Representation of the Data

Next: Rule Generation and Evaluation Up: The ALLiS Algorithm Previous: The ALLiS Algorithm

Representation of the Data

In this paper, our experiments use the CoNLL shared task data [see][]TSKst00. It provides a corpus annotated with chunk information. Chunk information is extracted from a manually annotated corpus, the Wall Street Journal, part of the Upenn Treebank described in [Marcus et al.(1993)Marcus, Santorini, and Marcinkiewicz]. The corpus is tagged with the Brill tagger [see][]Brillcl, in order to make sure that the performance rates obtained with this data are realistic estimates. After this preprocessing, with each word of the data are associated its Part-of-Speech tag and its chunk category.

The formalism used by ALLiS in order to encode training data is XML. Even if other representation schemas can be equivalently used (such as database formalism), XML offers an adequate representation allowing structured (here hierarchical) data encoding. Data is then encoded in a tree structure. For each sentence a node is created and for each word of the sentence an empty node . With each node is associated its features (in the following examples, the word itself (feature W), its POS-tag (feature C) and its category (S)). This is a traditional attribute-value representation. More complex annotation can be used, for example, the tree of a given linguistic analysis can be kept. Table 1 shows data with a flat structure, and Table 2 with several linguistic hierarchical levels of information (chunk, clause). We use the first representation for the CoNLL 200 shared task, and the second one was used for the CoNLL2001 shared task [see][]Dejeanclausing.

**Table 1:** Example of training corpus for chunking (flat structure).
$\begin{table}\begin{verbatim}<S> <N W=During C=IN S=B-PP/> <N W=the C=DT S=B-N... ...drug C=NN S=I-NP/> <N W=traffickers C=NNS S=I-NP/> </S>\end{verbatim}\end{table}$

**Table 2:** Example of training data for clausing using a hierarchical structure.
$\begin{table}\begin{verbatim}<S NUM='4563'> <CL> <PHR S='NP' > <W BP='B' C='... ...PHR> <PHR S='O'> <W C='.' W='.'/> </PHR> </CL> </S>\end{verbatim}\end{table}$

We now explain how positive and negative examples are generated from this annotation. The division of the data into positive and negative examples is done thanks to the features. For instance, let the example shown Table 1 be the complete training data and let NP chunking be the task. The task is then to assign the correct value of the feature S to each node by using the features W and C. Each node (word) with the features S='I-NP' or S='B-NP'¹ is then a positive example, and each node without his feature is a negative example. For example, the node N W=dismayed C=VBN S=I-VP/ is a negative example, when the node N W=arrested C=VBN S=B-NP/ is a positive example, although both are tagged VBN. This example shows that the information attached to each node may be insufficient in order to correctly recognize the attribute S. Section 2.3 explains how the search space is extended.

Next: Rule Generation and Evaluation Up: The ALLiS Algorithm Previous: The ALLiS Algorithm

Hammerton J. 2002-03-13