Shallow Parsing using Noisy and Non-Stationary Training Material

Miles Osborne; 2(Mar):695-719, 2002.


Shallow parsers are usually assumed to be trained on noise-free material, drawn from the same distribution as the testing material. However, when either the training set is noisy or else drawn from a different distributions, performance may be degraded. Using the parsed Wall Street Journal, we investigate the performance of four shallow parsers (maximum entropy, memory-based learning, N-grams and ensemble learning) trained using various types of artificially noisy material. Our first set of results show that shallow parsers are surprisingly robust to synthetic noise, with performance gradually decreasing as the rate of noise increases. Further results show that no single shallow parser performs best in all noise situations. Final results show that simple, parser-specific extensions can improve noise-tolerance. Our second set of results addresses the question of whether naturally occurring disfluencies undermines performance more than does a change in distribution. Results using the parsed Switchboard corpus suggest that, although naturally occurring disfluencies might harm performance, differences in distribution between the training set and the testing set are more significant.

[abs] [pdf] [ps.gz] [ps]