An Empirical Analysis of Off-policy Learning in Discrete MDPs
Cosmin Păduraru, Doina Precup, Joelle Pineau, Gheorghe Comănici; JMLR W&CP 24:89-102, 2012.
Off-policy evaluation is the problem of evaluating a decision-making policy using data
collected under a different behaviour policy. While several methods are available for addressing off-policy evaluation, little work has been done on identifying the best methods.
In this paper, we conduct an in-depth comparative study of several off-policy evaluation
methods in non-bandit, finite-horizon MDPs, using randomly generated MDPs, as well as a
Mallard population dynamics model [Anderson, 1975] . We find that un-normalized importance sampling can exhibit prohibitively large variance in problems involving look-ahead
longer than a few time steps, and that dynamic programming methods perform better than
Monte-Carlo style methods.