Text Classification using String Kernels
Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, Chris Watkins;
2(Feb):419-444, 2002.
Abstract
We propose a novel approach for categorizing text documents based on
the use of a special kernel. The kernel is
an inner product in the feature space generated by all subsequences of
length
k. A subsequence is any ordered sequence of
k characters
occurring in the text though not necessarily contiguously. The subsequences
are weighted by an exponentially decaying factor of their full length in the
text, hence emphasising those occurrences that are close to contiguous. A
direct computation of this feature vector would involve a prohibitive amount
of computation even for modest values of
k, since the dimension of the
feature space grows exponentially with
k. The paper describes how despite
this fact the inner product can be efficiently evaluated by a dynamic
programming technique.
Experimental comparisons of the
performance of the kernel compared with a standard word feature space
kernel (Joachims, 1998) show positive results on modestly sized datasets.
The case of contiguous subsequences is also considered for comparison
with the subsequences kernel with different decay factors.
For larger documents and datasets the paper introduces an approximation
technique that is shown to deliver good approximations efficiently for large
datasets.
[abs]
[pdf]
[ps.gz]
[ps]