Word-Sequence Kernels
Nicola Cancedda, Eric Gaussier, Cyril Goutte, Jean-Michel Renders;
3(Feb):1059-1082, 2003.
Abstract
We address the problem of categorising documents using
kernel-based methods such as Support Vector Machines. Since the
work of Joachims (1998), there is ample experimental evidence
that SVM using the standard word frequencies as features yield
state-of-the-art performance on a number of benchmark problems.
Recently, Lodhi et al. (2002) proposed the use of
string
kernels, a novel way of computing document similarity based of
matching non-consecutive subsequences of characters. In this
article, we propose the use of this technique with sequences of
words rather than characters. This approach has several
advantages, in particular it is more efficient computationally and
it ties in closely with standard linguistic pre-processing
techniques. We present some extensions to sequence kernels dealing
with symbol-dependent and match-dependent decay factors, and
present empirical evaluations of these extensions on the
Reuters-21578 datasets.
[abs]
[pdf]
[ps.gz]
[ps]