There are many different CoNLL formats since CoNLL is a different shared task each year. The format for CoNLL 2009 is described here. Each line represents a single word with a series of tab-separated fields. _
s indicate empty values. Mate-Parser's manual says that it uses the first 12 columns of CoNLL 2009:
ID FORM LEMMA PLEMMA POS PPOS FEAT PFEAT HEAD PHEAD DEPREL PDEPREL
The definition of some of these columns come from earlier shared tasks (the CoNLL-X format used in 2006 and 2007):
ID
(index in sentence, starting at 1)
FORM
(word form itself)
LEMMA
(word's lemma or stem)
POS
(part of speech)
FEAT
(list of morphological features separated by |)
HEAD
(index of syntactic parent, 0 for ROOT
)
DEPREL
(syntactic relationship between HEAD
and this word)
There are variants of those columns (e.g., PPOS
but not POS
) that start with P
indicate that the value was automatically predicted rather a gold standard value.
Update: There is now a CoNLL-U data format as well which extends the CoNLL-X format.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…