Distributional Word Clusters vs. Words for Text Categorization
Ron Bekkerman, Ran El-Yaniv, Naftali Tishby, and Yoad Winter;
3(Mar):1183-1208, 2003.
Abstract
We study an approach to text categorization that combines
distributional clustering of words and a Support Vector Machine
(SVM) classifier. This word-cluster representation is computed
using the recently introduced
Information Bottleneck method,
which generates a compact and efficient representation of
documents. When combined with the classification power of the SVM,
this method yields high performance in text categorization. This
novel combination of SVM with word-cluster representation is
compared with SVM-based categorization using the simpler
bag-of-words (BOW) representation. The comparison is performed
over three known datasets. On one of these datasets (the 20
Newsgroups) the method based on word clusters significantly
outperforms the word-based representation in terms of
categorization accuracy or representation efficiency. On the two
other sets (Reuters-21578 and WebKB) the word-based representation
slightly outperforms the word-cluster representation. We
investigate the potential reasons for this behavior and relate it
to structural differences between the datasets.
[abs]
[pdf]
[ps.gz]
[ps]
[data]