Next: Evaluation Criteria and Experimental
Up: Collaborative Filtering
Previous: Collaborative Filtering
Datasets
We evaluated dependency networks and Bayesian networks on three
datasets: (1) Nielsen, the dataset described in
Section 3.3, (2) MS.COM, which records whether or not
users of microsoft.com on one day in 1996 visited areas (``vroots'')
of the site (available on the Irvine Data Mining Repository), and (3)
MSNBC, which records whether or not visitors to MSNBC on one day
in 1998 read stories among the most popular 1001 stories on the site.
In each of these datasets, users correspond to cases and items
possibly viewed correspond to variables. The MSNBC dataset contains
20,000 users sampled at random from the approximate 600,000 users that
visited the site that day. In a separate analysis on this dataset, we
found that the inclusion of additional users did not produce a
substantial increase in accuracy. Table 2 provides additional
information about each dataset. All datasets were partitioned into
training and test sets at random. The train/test split for Nielsen
was the same as for the density-estimation experiment described in
Section 3.3. The learning algorithms for dependency networks
and Bayesian networks and their parameters described in
Section 3 were used here.
Table 2:
Number of users, items, and items per user for the datasets used
in evaluating the algorithms.
|
Dataset |
|
MS.COM |
Nielsen |
MSNBC |
Training cases |
32711 |
1637 |
10000 |
Test cases |
5000 |
1637 |
10000 |
Total items |
294 |
203 |
1001 |
Mean items per case |
3.02 |
8.64 |
2.67 |
in training set |
|
|
|
Next: Evaluation Criteria and Experimental
Up: Collaborative Filtering
Previous: Collaborative Filtering
Journal of Machine Learning Research,
2000-10-19