Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling
Tobias Scheffer, Stefan Wrobel;
3(Dec):833-862, 2002.
Abstract
Many discovery problems,
e.
g. subgroup or association rule discovery, can
naturally be cast as
n-best hypotheses problems where the goal is to find
the
n hypotheses from a given hypothesis space that score best according to
a certain utility function. We present a sampling algorithm that solves this
problem by issuing a small number of database queries while guaranteeing
precise bounds on the confidence and quality of solutions. Known sampling
approaches have treated single hypothesis selection problems, assuming that
the utility is the average (over the examples) of some function --- which is
not the case for many frequently used utility functions. We show that our
algorithm works for all utilities that can be estimated with bounded error. We
provide these error bounds and resulting worst-case sample bounds for some of
the most frequently used utilities, and prove that there is no sampling
algorithm for a popular class of utility functions that cannot be estimated
with bounded error. The algorithm is sequential in the sense that it starts
to return (or discard) hypotheses that already seem to be particularly good
(or bad) after a few examples. Thus, the algorithm is almost always faster than
its worst-case bounds.
[abs]
[pdf]
[ps.gz]
[ps]