New Techniques for Disambiguation in Natural Language and Their Application
to Biological Text
Filip Ginter, Jorma Boberg, Jouni Järvinen, Tapio Salakoski; 5(Jun):605--621, 2004.
Abstract
We study the problems of disambiguation in natural language, focusing on the
problem of gene vs. protein name disambiguation in biological text and also
considering the problem of context-sensitive spelling error correction. We
introduce a new family of classifiers based on ordering and weighting the
feature vectors obtained from word counts and word co-occurrence in the text, and
inspect several concrete classifiers from this family. We obtain the most
accurate prediction when weighting by positions of the words in the context. On
the gene/protein name disambiguation problem, this classifier outperforms both
the Naive Bayes and SNoW baseline classifiers. We also study the effect of the
smoothing techniques with the Naive Bayes classifier, the collocation features,
and the context length on the classification accuracy and show that correct
setting of the context length is important and also
problem-dependent.
[abs][pdf][ps.gz][ps]