Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
812 views
in Technique[技术] by (71.8m points)

nlp - What is CoNLL data format?

I am new to text mining. I am using a open source jar (Mate Parser) which gives me output in a CoNLL 2009 format after dependency parsing. I want to use the dependency parsing results for Information Extraction. But i am able to understand some of the output but not able to comprehend the CoNLL data format. Can any one help me in making me understand the CoNLL data format?? Any kind of pointers would be appreciated.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

There are many different CoNLL formats since CoNLL is a different shared task each year. The format for CoNLL 2009 is described here. Each line represents a single word with a series of tab-separated fields. _s indicate empty values. Mate-Parser's manual says that it uses the first 12 columns of CoNLL 2009:

ID FORM LEMMA PLEMMA POS PPOS FEAT PFEAT HEAD PHEAD DEPREL PDEPREL

The definition of some of these columns come from earlier shared tasks (the CoNLL-X format used in 2006 and 2007):

  • ID (index in sentence, starting at 1)
  • FORM (word form itself)
  • LEMMA (word's lemma or stem)
  • POS (part of speech)
  • FEAT (list of morphological features separated by |)
  • HEAD (index of syntactic parent, 0 for ROOT)
  • DEPREL (syntactic relationship between HEAD and this word)

There are variants of those columns (e.g., PPOS but not POS) that start with P indicate that the value was automatically predicted rather a gold standard value.

Update: There is now a CoNLL-U data format as well which extends the CoNLL-X format.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...