1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/...
-
Upload
shanon-young -
Category
Documents
-
view
220 -
download
0
Transcript of 1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva //gate.ac.uk/...
Using Corpora and Evaluation Tools
Diana Maynard
Kalina Bontcheva
http://gate.ac.uk/ http://nlp.shef.ac.uk/
March 2004
Corpus structure
• Located in gatecorpora in cvs• Each directory under gatecorpora has a corpus, e.g.,
gatecorpora/ace• Each corpus can have sub-parts, e.g. ace/bnews• Each (sub-)corpus has a clean and marked directory,
these are important• Clean holds the unannotated version, while marked holds
the human-marked ones• There may also be a processed subdirectory – this is a
datastore (unlike the other two)• Corresponding files in each subdirectory must have the
same name
Tools for corpus manipulation
• There are lots of tools available in gatecorpora/utilities and in subdirectories of each corpus
• Many of the corpora, e.g. MUC, ACE come in different formats (e.g. inline vs standoff markup) and have been converted to GATE-style annotations
• Also tools for e.g. counting things, changing annotation names etc (mostly JAPE grammars)
Corpora available
• MUC7 (newswires)• MUSE (news texts from the web)• ACE • ACE Chinese• ACE Arabic• Romanian (news texts; 1984)• CMU seminars• Jobs• CONLL’03 – part of Reuters with NEs• Bulgarian - news
MUC 7 corpus
• Newswires used in the official MUC 7 evaluation• Data available in MUC format and GATE format• Annotation types: Person, Location,
Organization, Money, Percent, Date, Time• Division into training and test sets
MUSE corpus
• News texts from various websites (BBC, Guardian, etc.)
• Annotation types: Person, Organisation, Location, Date, Time, Money, Percent, Address
• Slight differences in annotation guidelines from MUC, e.g. people’s titles are included in names
• Available from gatecorpora/news in various subdirectories
ACE corpus
• 3 types of text: newswire, broadcast news and newspaper
• Broadcast news and newspaper available as ground truth and original (degraded) texts
• Annotation types: Person, Organisation, Location, GPE, Facility
• Some annotations have roles to indicate metonymous usage
• Guidelines are different from MUC and MUSE• Available from gatecorpora/ace in various
subdirectories
Multilingual ACE
• As for ACE, but in Chinese and Arabic
• Texts are in UTF-8
• No degraded versions of these texts
• Available from gatecorpora/ace/ace03/Chinese/
and
gatecorpora/ace/ace03/Arabic/
CMU Seminars & Jobs
• Corpora frequently used to evaluate relation extraction and wrapper induction systems
• gatecorpora/jobs-corpus and gatecorpora/cmu-seminars
• Converted into gate xml, ready for use
CONLL’03 shared task
• Corpus used in the CONLL’03 shared task for evaluating NE recognition
• In English, part of the Reuters corpus
• Markup is e.g., <I-LOC>, not converted to Muse tags
• Use reuterstogate.jape to convert to Muse tags
• gatecorpora/ReutersWithNamedEntities
How it works
• Clean, marked, and processed• Corpus_tool.properties – must be in the directory
from where gate is executed• Specifies configuration information about
– What annotations types are to be evaluated– Threshold below which to print out debug info– Input set name and key set name
• Modes– Default – regression testing– Human marked against already stored, processed – Human marked against current processing results
Conclusion
This talk: http://gate.ac.uk/sale/talks/corpora-tutorial.ppt
More information: http://gate.ac.uk/