RCV1-v2/LYRL2004: The LYRL2004 Distribution of the RCV1-v2 Text Categorization Test Collection

(14-Oct-2005 Version)

David D. Lewis

David D. Lewis Consulting

Anti-spam image of my email address. The address is my first name, dave, followed by my last name, lewis, at daviddlewis dot com.

A. Introduction

This web page describes RCV1-v2/LYRL2004, a text categorization test collection which is distributed as a set of on-line appendices to a JMLR journal article.

A.1. How to (Not) Cite This Document 

In most cases, the following article should be cited:  

Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf

instead of the web page you are now reading.  I will refer to this article as LYRL2004.  LYRL2004 contains all the  information that the web page does, except formatting details and such.

If for some reason you need to cite this web page, you could cite it as:

Lewis, D. D.  RCV1-v2/LYRL2004: The LYRL2004 Distribution of the RCV1-v2 Text Categorization Test Collection (14-Oct-2005 Version). http://www.jmlr.org/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm

or as appropriate for your bibliographic format. If the web page is cited, we ask that LYRL2004 always be cited as well, since it is part of the archival literature.

A.2. Legal Issues 

The original agreement under which Reuters distributed the RCV1 CD-ROMs from Reuters, Ltd. between 2000 and 2004 stated:

    "Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published provided it is not possible to reconstruct the Data from the summary." 

Based on this clause, Reuters personnel stated that distributing term/document matrices is not a violation of the Agreement: 

    http://groups.yahoo.com/group/ReutersCorpora/message/70

    http://groups.yahoo.com/group/ReutersCorpora/message/106

To ensure that the original data cannot be reconstructed, the term/document matrices we distribute here remove words from a large stop list (including essentially all linguistic function words), replace the remaining words with stems, and scramble the order of the stems. 

While the term/document matrices and other information available in these Appendices are available without a license, I encourage those using the data to also obtain a licensed copy of the original RCV1 data, currently distributed by NIST (the National Institute of Standards and Technology, a US government agency).

B. On-Line Appendices to LYRL2004 

The RCV1-v2/LYRL2004 test collection is made up of a large number of files, which take the form of 18 On-Line Appendices to the LYRL2004 article.  We describe the files in the order of the corresponding On-Line Appendix numbers in the LYRL2004 paper. 

B.1. On-Line Appendix 1

On-Line Appendix 1 consists of the ASCII file rcv1.topics.txt, which is 488 bytes in size. It contains a list, one per line, of the names of the 103 RCV1 Topics categories that were available to Reuters indexers. See Section 3.2.1 of LYRL2004 for more information.

B.2. On-Line Appendix 2

On-Line Appendix 2 consists of the ASCII file rcv1.topics.hier.orig, which is 6,965 bytes in size. It contains a 104 node hierarchy (tree) of Reuters Topics categories. There is 1 root node, plus nodes for the 103 assignable categories (some of which are leaf nodes and some of which are internal nodes). Each node is represented by a line of the form:

    parent: <cat> child: <cat> child-description: <desc>

where <cat> is the name of a category, the string "Root" to indicate the root node, or the string "None" which is a placeholder that does not correspond to a node. There are 104 lines, one for each node, with the node specified in the child field, and the structure specified by giving the parent of each child.

See Section 3.2.1 of LYRL2004 for more information.

B.3. On-Line Appendix 3

On-Line Appendix 3 consists of the ASCII file rcv1.topics.hier.expanded, which is 7,810 bytes in size. It contains a 117 node hierarchy (tree) of Reuters Topics categories. There is 1 root node, 13 third-level nodes that do not correspond to assignable categories, and 103 nodes for the 103 assignable categories (some of which are leaf nodes and some of which are internal nodes). The format of the file is the same as for On-Line Appendix 2. See Section 3.2.1 of LYRL2004 for more information.

B.4. On-Line Appendix 4

On-Line Appendix 4 consists of the ASCII file rcv1.industries.txt, which is 2688 bytes in size. It contains a list, one per line, of the names of the 354 RCV1 Industry categories that were available to Reuters indexers. See Section 3.2.2 of LYRL2004 for more information.

B.5. On-Line Appendix 5

On-Line Appendix 5 consists of the ASCII file rcv1.industries.hier, which is 30,162 bytes in size. It contains a 365 node hierarchy (tree) of Reuters Industry categories. There is 1 root node, 10 second level nodes which do not correspond to assignable categories, plus nodes for the 354 assignable categories (some of which are leaf nodes and some of which are internal nodes). The format of the file is the same as for On-Line Appendix 2. See Section 3.2.2 of LYRL2004 for more information.

B.6. On-Line Appendix 6

On-Line Appendix 6 consists of the ASCII file rcv1.regions.txt, which is 2065 bytes in size. It contains a list, one per line, of the names of the 366 RCV1 Region categories that were available to Reuters indexers. See Section 3.2.3 of LYRL2004 for more information.

B.7. On-Line Appendix 7

On-Line Appendix 7 consists of the ASCII file rcv1v2-ids.dat.gz, which is 1,715,108 bytes in gzipped form, or 5,527,301 bytes when uncompressed. It contains a list, one per line, of the Reuters-assigned IDs of the 804,414 documents in RCV1-v2. Each line contains a single ID. See Section 4 of LYRL2004 for more information.

B.8. On-Line Appendix 8

On-Line Appendix 8 consists of the ASCII file rcv1-v2.topics.qrels.gz, which is 7,272,130 bytes in gzipped form, or 35,382,548 bytes when uncompressed. It specifies which Topic categories each RCV1-v2 document belongs to. The files have the format of TREC qrels files, as we now describe. Each category/document pair is specified by a separate one-line record. There are 2,606,875 lines, and each line has the format:

<category name> <did> 1 

where <category name> is the name of the category, <did> is a Reuters-assigned document ID, and the 1 is redundant but required for TREC format.

As an example, the transactions for the first two documents in rcv1-v2.topics.qrels look like this: 

E11 2286 1

ECAT 2286 1

M11 2286 1

M12 2286 1

MCAT 2286 1

C24 2287 1

CCAT 2287 1

These indicate that document 2286 belongs to Topic categories E11, ECAT, M11, M12, and MCAT. Document 2287 belongs to Topic categories C24 and CCAT.

See Section 4 of LYRL2004 for more information.

B.9. On-Line Appendix 9

On-Line Appendix 9 consists of the ASCII file rcv1-v2.industries.qrels.gz, which is 2,005,036 bytes in gzipped form, or 9,055,643 bytes when uncompressed. It has 560,922 lines and specifies which Industry categories each RCV1-v2 document belongs to. The file has the same format as On-Line Appendix 8. See Section 4 of LYRL2004 for more information.

B.10. On-Line Appendix 10

On-Line Appendix 10 consists of the ASCII file rcv1-v2.regions.qrels.gz, which is 3,214,040 bytes in gzipped form, or 14,348,799 bytes when uncompressed. It has 1,057,880 lines and specifies which Region categories each RCV1-v2 document belongs to. The file has the same format as On-Line Appendix 8. See Section 4 of LYRL2004 for more information.

B.11. On-Line Appendix 11

On-Line Appendix 11 consists of the ASCII file english.stop, which is 3,589 bytes in size. It contains a list of 571 stop words, one per line, that was developed by the SMART project. See Section 7 of LYRL2004 for more information.

B.12. On-Line Appendix 12

On-Line Appendix 12 consists of ten ASCII files containing tokenized documents. The files fall in two groups.

B.12.i. RCV1-v2 Token Files 

Five of the files contain the exact RCV1-v2 token files used to produce the vectors that were then used for training and testing supervised learners in LYRL2004. Four files contain test set tokenized documents, and the fifth contains the training set tokenized documents.

In gzipped form the file sizes in bytes are:

lyrl2004_tokens_test_pt0.dat.gz : 44734992

lyrl2004_tokens_test_pt1.dat.gz : 45595102

lyrl2004_tokens_test_pt2.dat.gz : 44507510 

lyrl2004_tokens_test_pt3.dat.gz : 42052117 

lyrl2004_tokens_train.dat.gz : 5108963

In uncompressed form the file sizes in bytes are: 

lyrl2004_tokens_test_pt0.dat : 153955383

lyrl2004_tokens_test_pt1.dat : 156091348

lyrl2004_tokens_test_pt2.dat : 153363982

lyrl2004_tokens_test_pt3.dat : 145174772

lyrl2004_tokens_train.dat : 17590105

The number of documents in each file is:

lyrl2004_tokens_test_pt0.dat : 199328 test documents

lyrl2004_tokens_test_pt1.dat : 199339 test documents

lyrl2004_tokens_test_pt2.dat : 199576 test documents 

lyrl2004_tokens_test_pt3.dat : 183022 test documents

lyrl2004_tokens_train.dat : 23149 training documents

There are 23,149 training documents and 781,265 test documents in these files, for a total of 804,414 documents, i.e. all the documents from RCV1-v2 as defined in LYRL2004. The documents have been tokenized, stopworded, and stemmed.  Most but not all punctuation was removed during stemming.  Note that while the LYRL2004 experiments used the particular training/test split reflected in the files, this split had no impact on how the tokenized documents were created.  Therefore, these files could be used in experiments with any other training/test split desired.

Each document in a file is represented in a format used by the SMART text retrieval system. A document has the format:

.I <did>

.W

<textline>+

<blankline>

where we have:

<did> : Reuters-assigned document id. 

<textline> : A line of white-space separated strings, one for each token produced by preprocessing for the specified document. These lines never begin with a period followed by an upper case alphabetic character.

<blankline> : A single end of line character. 

Each line that begins with ".I" indicates the start of a new document. 

Here's an example of the tokenized document file format: 

.I 1
.W
now is the time for all good documents
to come to the aid of the ir community

.I 2
.W
i am the best document since i have only one line

.I 3
.W
no i am the best document 

Actual tokenized documents are typically longer. 

See Section 7 of LYRL2004 for further details.

B.12.ii. Non-v2 RCV1 Token Files Produced as part of LYRL2004 Work

The remaining five files in On-Line Appendix 12 contain tokenized documents that correspond to original RCV1 documents that we did not include in the RCV1-v2 collection, and so are not included in the five files discussed above. They correspond to RCV1 documents that had demonstrably invalid category codes. In gzipped form the file sizes in bytes are:

lyrl2004-non-v2_tokens_test_pt0.dat.gz : 149887 

lyrl2004-non-v2_tokens_test_pt1.dat.gz : 171205

lyrl2004-non-v2_tokens_test_pt2.dat.gz : 102370

lyrl2004-non-v2_tokens_test_pt3.dat.gz : 132291

lyrl2004-non-v2_tokens_train.dat.gz : 46419

In uncompressed form the file sizes in bytes are:

lyrl2004-non-v2_tokens_test_pt0.dat : 567844

lyrl2004-non-v2_tokens_test_pt1.dat : 592220

lyrl2004-non-v2_tokens_test_pt2.dat : 357564

lyrl2004-non-v2_tokens_test_pt3.dat : 435917

lyrl2004-non-v2_tokens_train.dat : 161010

The number of documents in each file is:

lyrl2004-non-v2_tokens_test_pt0.dat : 671 documents

lyrl2004-non-v2_tokens_test_pt1.dat : 661 documents

lyrl2004-non-v2_tokens_test_pt2.dat : 424 documents

lyrl2004-non-v2_tokens_test_pt3.dat : 463 documents

lyrl2004-non-v2_tokens_train.dat : 158 documents 

These files contain tokenized versions of documents that are in RCV1, but not in RCV1-v2.  They were produced in exactly the same fashion as the RCV1-v2 tokenized document files, and have the same format. Of these files, only lyrl2004-non-v2_tokens_train.dat was used in the LYRL2004 experiments. It was used to produce lyrl2004-non-v2_vectors_train.dat which was in turn used (unintentionally) only for generating IDF weights. We include these files for completeness, but most researchers will not need to use them.

See Section 7 of LYRL2004 for further details. 

B.13. On-Line Appendix 13

On-Line Appendix 13 consists of ten ASCII files containing document vectors. They were produced using the On-Line Appendix 12 files as the starting point. As discussed in Section 7 of LYRL2004, for most kinds of research the On-Line Appendix 12 files will be more useful.

B.13.i. RCV1-v2 Vector Files 

Five of the files contain the exact RCV1-v2 vectors used for training and testing supervised learners in LYRL2004. Four files contain test set vectors, and the fifth contains the training set vectors. In gzipped form the file sizes in bytes are:

lyrl2004_vectors_test_pt0.dat.gz : 159879168

lyrl2004_vectors_test_pt1.dat.gz : 161878016

lyrl2004_vectors_test_pt2.dat.gz : 158580736

lyrl2004_vectors_test_pt3.dat.gz  : 149512192

lyrl2004_vectors_train.dat.gz : 18620416

In uncompressed form the file sizes in bytes are:

lyrl2004_vectors_test_pt0.dat : 367197611

lyrl2004_vectors_test_pt1.dat : 371378053

lyrl2004_vectors_test_pt2.dat : 364319208 

lyrl2004_vectors_test_pt3 .dat: 343575752 

lyrl2004_vectors_train.dat : 42955532

The number of vectors in each file is: 

lyrl2004_vectors_test_pt0.dat : 199328 test vectors

lyrl2004_vectors_test_pt1.dat : 199339 test vectors

lyrl2004_vectors_test_pt2.dat : 199576 test vectors 

lyrl2004_vectors_test_pt3.dat : 183022 test vectors

lyrl2004_vectors_train.dat : 23149 training vectors

There are 23,149 training vectors and 781,265 test vectors in this data set, for a total of 804,414 vectors, i.e. all vectors from RCV1-v2 as defined in LYRL2004. Vectors are cosine-normalized, log TF-IDF vectors.

IDF weights in the above vectors were computed from the union of the lyrl2004_vectors_train.dat and non-lyrl2004_vectors_train.dat (see below), i.e. a small number of RCV1-v1 vectors that are not in RCV1-v2 were used in computing IDF weights. No vectors from the non-lyrl2004 files were used in any supervised learning. Any term in a test document that did not occur in one or more documents from the union of lyrl2004_vectors_train.dat and non-lyrl2004_vectors_train.dat was discarded before cosine normalization. This discarding of terms, as well as the impact of the training/test split on IDF computation, means the Appendix 13 vectors should not be used in experiments with any training/test split besides the one used in LYRL2004!

The main reason to use the Appendix 13 files would be if a researcher wants to directly compare a supervised learning algorithm against those tested in the LYRL2004 paper, while keeping the training/test split and text representation exactly the same. For all other purposes, the token files (On-Line Appendix 12) are likely to be more useful. 

Each vector in a file of vectors is represented by a single line of the form:

<did> [<tid>:<weight>]+

where we have: 

<did> : Reuters-assigned document id. 

<tid> : A positive integer term id. Term ids are between 1 and 47,236. The corresponding type (string form) for each term id is found in On-Line Appendix 14. 

<weight> : The numeric feature value, i.e. within document weight, assigned to this term for this document, as described in LYRL2004.

Here's an example of the vector file format:

999995 1:0.03 3:0.047 8:0.38749738478937479 14:0.1 2748:0.03
999996 7:0.13 19:0.138 255:0.58588 314:0.28101 18800:0.005
999998 2:0.00001 3:0.108 184:0.228 488:0.0821 40917:0.111

Actual vectors are much longer, both due to more terms and more decimal places in weights.

See Section 7 of LYRL2004 for further details. 

B.13.ii. Non-v2 RCV1 Vectors Produced as part of LYRL2004 Work

The remaining 5 files in Appendix 13 contain vectors that correspond to original RCV1 documents that we did not include in the RCV1-v2 collection, and so are not included in the five files discussed above. They correspond to RCV1 documents that had demonstrably invalid category codes. In gzipped form the file sizes in bytes are: 

lyrl2004-non-v2_vectors_test_pt0.dat.gz : 532480

lyrl2004-non-v2_vectors_test_pt1.dat.gz : 524288

lyrl2004-non-v2_vectors_test_pt2.dat.gz : 339968

lyrl2004-non-v2_vectors_test_pt3.dat.gz : 413696

lyrl2004-non-v2_vectors_train.dat.gz : 172032

In uncompressed form the file sizes in bytes are: 

lyrl2004-non-v2_vectors_test_pt0.dat 1359872

lyrl2004-non-v2_vectors_test_pt1.dat 1294336

lyrl2004-non-v2_vectors_test_pt2.dat 839680

lyrl2004-non-v2_vectors_test_pt3.dat 974848

lyrl2004-non-v2_vectors_train.dat 389120

The number of vectors in each file is:

lyrl2004-non-v2_vectors_test_pt0.dat : 671 vectors

lyrl2004-non-v2_vectors_test_pt1.dat : 661 vectors

lyrl2004-non-v2_vectors_test_pt2.dat : 424 vectors

lyrl2004-non-v2_vectors_test_pt3.dat : 463 vectors

lyrl2004-non-v2_vectors_train.dat : 158 vectors

Of these files, only lyrl2004-non-v2_vectors_train.dat was used in the LYRL2004 experiments. It was used, along with lyrl2004_vectors_train.dat, only for generating IDF weights. These vectors were produced in exactly the same fashion as the RCV1-v2 vector files, and have the same format. We include them for completeness, but most researchers will not need to use them.

See Section 7 of LYRL2004 for further details. 

B.14. On-Line Appendix 14

On-Line Appendix 14 consists of the ASCII file stem.termid.idf.map.txt, which is 1,411,031 bytes in size.  It specifies the mapping between the numeric term IDs used in our vector files (On-Line Appendix 13) and the stemmed tokens used in our tokenized document files (On-Line Appendix 12). There are 47,236 lines in the file, corresponding to the 47,236 unique stemmed tokens present in the 23,307 pre-breakpoint RCV1-v1 documents. The lines have the form:

<stem> <termid> <idf>

where we have:

<stem> : stemmed term, as appears in On-Line Appendix 12 files

<termid> : integer term ID, as appears in On-Line Appendix 13 files

<idf> : inverse document frequency value used for the term in the LYRL2004 experiments

Not all the 47,236 terms represented in this file were actually used in our experiments. See Section 6.5 and Section 7 of LYRL2004 for details.

B.15. On-Line Appendix 15 

On-Line Appendix 15 consists of nineteen ASCII files containing the contingency tables used to generate all experiment results reported in LYRL2004. The filenames (not the files) have the format: 

<catset>.<alg>.<opt>.xml

 where we have:

<catset> : Set of categories

<alg> : Classifier algorithm 

<opt> : Whether the classifier was optimized for macroaveraging, microaveraging, or on a per-category basis. 

All Industries files are 35067 bytes in size, all Regions files 35523 bytes, and all Topics files are 9909 bytes in size. The files are available as a gzipped tar archive, a15-contingency-tables.tar.gz, which is 73547 bytes in size.

All files are in XML format and have the following structure (note our convention for specifying file format is slightly different here than in the rest of this web page): 

<allcats>

TABLELINE+

</allcats>

where "<allcats>" and "</allcats>" are actual XML tags in the file, each on its own line.  Each TABLELINE is a line with the following format:

<c> <n> NAME </n> <tp> A </tp> <fp> B </fp> <fn> C </fn> <tn> D </tn> </c>

where we have: 

NAME : name of a category

A : integer value for number of true positive (classifier assiged category to document, and category should be assigned to document) decisions the classifier made for this category on the RCV1-v2 test set documents.

B : integer value for number of false positive (classifier assiged category, category should not be assigned) decisions.

C : integer value for number of false negative (classifier did not assign category, category should be assigned) decisions.

D : integer value for number of true negative (classifier did not assign category, category should not be assigned) decisions.

Note A+B+C+D adds to 804,414 (the number of RCV1-v2 test documents) for all lines. 

See Section 5.3 of LYRL2004 for more on effectiveness measures. 

B.16. On-Line Appendix 16

On-Line Appendix 16 consists of the ASCII file topics.rbb, which is 6143 bytes in size. It is a Reuters documentation file on a set of categories closely related to the Topics categories. See Section 3.2.4 of LYRL2004 for more information.

B.17. On-Line Appendix 17

On-Line Appendix 17 consists of the ASCII file industries.rbb, which is 13222 bytes in size. It is a Reuters documentation file on a set of categories closely related to the Industries categories. See Section 3.2.4 of LYRL2004 for more information.

B.18. On-Line Appendix 18

On-Line Appendix 18 consists of the ASCII file regions.rbb, which is 6900 bytes in size. It is a Reuters documentation file on a set of categories closely related to the Regions categories. See Section 3.2.4 of LYRL2004 for more information.