A Neural Probabilistic Language Model
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin;
3(Feb):1137-1155, 2003.
Abstract
A goal of statistical language modeling is to learn the joint
probability function of sequences of words in a language. This is
intrinsically difficult because of the
curse of dimensionality:
a word sequence on which the model will be tested is likely to be
different from all the word sequences seen during training.
Traditional but very successful approaches based on n-grams obtain
generalization by concatenating very short overlapping
sequences seen in the training
set. We propose to fight the curse of dimensionality by
learning a distributed representation for words which allows each
training sentence to inform the model about an exponential number of
semantically neighboring sentences. The model learns simultaneously
(1) a distributed representation for each word along with (2) the
probability function for word sequences, expressed in terms of these
representations. Generalization is obtained because a sequence of
words that has never been seen before gets high probability if it is
made of words that are similar (in the sense of having a nearby
representation) to words forming an already seen sentence. Training
such large models (with millions of parameters) within a reasonable
time is itself a significant challenge. We report on experiments
using neural networks for the probability function, showing on two
text corpora that the proposed approach significantly improves on
state-of-the-art n-gram models, and that the proposed approach
allows to take advantage of longer contexts.
[abs]
[pdf]
[ps.gz]
[ps]