Named Entity Recognition for Daninsh in a CG framework Eckhard Bick Southern Denmark University...

Click here to load reader

download Named Entity Recognition for Daninsh in a CG framework Eckhard Bick Southern Denmark University lineb@hum.au.dk

of 36

  • date post

    01-Apr-2015
  • Category

    Documents

  • view

    213
  • download

    0

Embed Size (px)

Transcript of Named Entity Recognition for Daninsh in a CG framework Eckhard Bick Southern Denmark University...

  • Slide 1

Named Entity Recognition for Daninsh in a CG framework Eckhard Bick Southern Denmark University lineb@hum.au.dk Slide 2 Topics DanGram system overview Distributed NER techniques: - pattern matching - lexical - CG-rules Evaluation and outlook Slide 3 System Structure Slide 4 The DanGram system in current numbers Lexemes in morphological base lexicon: 146.342 (equals about 1.000.000 full forms), of these: proper names: 44839 (experimental) polylexicals: 460 (+ names and certain number expressions) Lexemes in the valency and semantic prototype lexicon: 95.308 Lexemes in the bilingual lexicon (Danish-Esperanto): 36.001 Danish CG-rules, in all: 7.400 morphological CG disambiguation rules: 2.259 + 137 = 2.396 syntactic mapping-rules: 2.250 syntactic CG disambiguation rules: 2.205 NER-module: 432 Attachment-CG: 117 (plus 429 bilingual rules in separate MT grammars) Danish PSG-rules: 605 (for generating syntactic tree structures) Performance: At full disambiguation (i.e., maximal precision), the system has an average correctness of 99% for word class (PoS), and about 95% for syntactic function tags (depending, on how fine grained an annotation scheme is used). From raw news text, 50%-70% of sentences produce well-formed syntactic trees. Speed: full CG-parse: ca. 400 words/sec for larger texts (start up time 3-6 sec) morphological analysis alone: ca. 1000 words/sec Slide 5 Running CG-annotation Da (When)[da] KS @SUB den (the)[den] ART UTR S DEF @>N gamle (old)[gammel] ADJ nG S DEF NOM @>N slger (salesman)[slger] N UTR S IDF NOM @SUBJ> krte (drove)[kre] V IMPF AKT @FS-ADVL> hjem (home)[hjem] N NEU P IDF NOM @ 6. NE chaining, a repair mechanism for faulty NE string recognition at levels (1) and (2) cleanmorf.dan Performs chunking choices too hard or too ambiguous to make before CG: Fuses Hans=Jensen og Otte=Nielsen, but keeps Hans Porsche and Otte PC'er using CG-recognition of Jensen PROP, Nielsen PROP, Porsche PROP and PC'er N Fuses PROP and certain semantic N-types, if upper case and so far unrecognized: PROP + N -> PROP : Betty=Nansen Broen PROP + N -> PROP : Betty=Nansen Foreningen PROP + N -> PROP : Betty=Nansen Prisen Repairs erroneous PROP splitting by the preprocessor, if later contextual typing asks for fusion: PROP + PROP : Dansk=Rde=Kors Afrika PROP + PROP : Danmarks Monetre Institut Slide 22 7. NE function classes, mapped and disambiguated by context based rules dancg.syn (ca. 4.400 rules) Handles, among other things, the syntactic function and attachment of names. The following are examples of functions relevant to the subsequent type mapper: (i) @N< (nominal dependents) prsident Bush, filmen "The Matrix" (ii) @APP (identifying appositions) Forldrebestyrelsens forman, Kurt Chistensen, anklager borgmester... (iii) @N 8.1 Cross-nominal prototype transfer Post-nominal attachment: i byen Rijnsburg MAP ( ) TARGET (PROP @NBARRIER @NON->N) (-1 @SUBJ>) 2." title="8.2 Coordination based type inference 1. Maps "close coordinators" (&KC-CLOSE): ADD (&KC-CLOSE) TARGET (KC) (*1 @SUBJ> BARRIER @NON->N) (-1 @SUBJ>) 2."> 8.2 Coordination based type inference 1. Maps "close coordinators" (&KC-CLOSE): ADD (&KC-CLOSE) TARGET (KC) (*1 @SUBJ> BARRIER @NON->N) (-1 @SUBJ>) 2. Then uses this tags in disambiguation rules: e.g. Arafat @SUBJ> og hans Palstinas=Selvstyre @SUBJ> REMOVE %non-h (0 %hum-all) (*-1 &KC-CLOSE BARRIER @NON->N LINK -1C %hum OR N-HUM LINK 0 NOM); SELECT ( ) (1 &KC-CLOSE) (*2C BARRIER @NON->N) ; 3. Danish has -only and pronouns: SELECT %hum (0 @SUBJ>) (1 KC) (2 ("han" GEN) OR ("hun" GEN)) (*3 @SUBJ> BARRIER @NON->N/KOMMA) ; # Hejberg og hans skole REMOVE %hum (0 @SUBJ>) (1 KC) (2 ("den" GEN) OR ("det" GEN)) (*3 @SUBJ> BARRIER @NON->N/KOMMA) ; # Anden Verdenskrig og dens mange slag Slide 26 8.3 PP-contexts Word-specific narrow context MAP ( ) TARGET (PROP) (-1 ("for" PRP)) (-2 ("syd") OR ("vest") OR ("nord") OR ("st")) ; Np-level vs. Clause level function ADD ( ) TARGET (PROP @P ) ; (safe, early rule) REMOVE ( ) (0 @P )); (heuristic, later rule) Pp-attachment inference, class based: godt 40 km fra Madras REMOVE %non-top (-1 ("fra" PRP) OR ("til" PRP)) (-2 N-DIST) (-3 NUM) ; Pp-attchment inference, word list based MAP (%org) TARGET (PROP NOM @P OR ) ; Slide 27 N) (*1 (N IDF) BARRIER @NON->N LINK 0 GEN-ORG/HUM) (NOT 0 OR OR OR OR OR ) ; # Microsofts/ Bill=Gates advokat/hjemmeside ("soft" GEN-ORG set) REMOVE %non-h (0 GEN LINK 0 %h) (*1 N BARRIER @NON- >N/KOMMA LINK 0 ( ) OR ( )) ; # owning thoughts and "thought products". %non-h respects also "humanoids",, etc."> 8.4 Genitive mapping MAP (%org) TARGET (GEN @>N) (*1 (N IDF) BARRIER @NON->N LINK 0 GEN-ORG) (NOT 0 OR OR OR OR ) ; # Microsofts generalforsamling/aktiekurs ("hard" GEN-ORG set) MAP (%org) TARGET (GEN @>N) (*1 (N IDF) BARRIER @NON->N LINK 0 GEN-ORG/HUM) (NOT 0 OR OR OR OR OR ) ; # Microsofts/ Bill=Gates advokat/hjemmeside ("soft" GEN-ORG set) REMOVE %non-h (0 GEN LINK 0 %h) (*1 N BARRIER @NON- >N/KOMMA LINK 0 ( ) OR ( )) ; # owning thoughts and "thought products". %non-h respects also "humanoids",, etc. Slide 28 N) ; # Den langlemmede Kanako=Yonekura ADD (%hum) TARGET ( PROP NOM) (-1 AD LINK 0 ADJ- HUM) (*-2 (ART S DEF) BARRIER @NON->N) ; # Den langlemmede Kanako=Yonekura"> 8.5 Prenominal context: Using adjective classes Uses semantic adjective classes, e.g. 1.Type based, more general, less safe: LIST ADJ-HUM = ; 2.Word based, more specific and safer: LIST ADJ-HUM& = "adfrdsvanskelig" "adspredt" "affektlabil" "afklaret" "afmgtig" "afslappet" "afstumpet" "afvisende" "agtbar" "agtpgivende" "agtsom" "alert" "alfaderlig" "alkrlig" "altopgivende" "altopofrende" "alvorsfuld".... MAP (%hum) TARGET ( PROP NOM) (-1 AD LINK 0 ADJ- HUM&) (*-2 (ART S DEF) BARRIER @NON->N) ; # Den langlemmede Kanako=Yonekura ADD (%hum) TARGET ( PROP NOM) (-1 AD LINK 0 ADJ- HUM) (*-2 (ART S DEF) BARRIER @NON->N) ; # Den langlemmede Kanako=Yonekura Slide 29 Evaluation RecallPrecisionF-score All word classes[1][1]98.698.798.65 All syntactic functions95.494.694.9 [1][1] Verbal subcategories (present PR, past IMPF, infinitive INF, present and past participle PCP1/2) and pronoun subcategories (inflecting DET, uninflecting INDP and personal PERS) were counted as different PoS. CG-annotation for Danish news text Slide 30 Performance statistics Korpus 90 Slide 31 Performance statistics Korpus 2000 Slide 32 Cross-class and class-internal name type errors Slide 33 Comparisons LTG (Mikheev et. al. 1998) achieved an overall F-measure of 93.39, using hybrid techniques involving both probabilistics/HMM, name/suffix lists and sgml-manipulating rules MENE (Borthwick et. al. 1998), maximum entropy training: in-domain/same-topic F-scores of up to 92.20 97.12% for a hybrid system integrating other MUC-7-systems cross-topic formal test, F-scores: 84.22 (pure MENE), 92 (hybrid MENE) possible weakness of trained systems: heavy training data bias? Korpus90/2000, which was used for the evaluation of the rule based system presented here, is a mixed-genre corpus, even its newstexts are highly cross-domain/cross-topic, since sentence order has been randomized for copyright reasons. Slide 34 What is it being used for? Enhance ordinary grammatical analysis - noun-disambiguation - semantic selection restriction fillers Corpus research on names Enhance IR-systems: e.g. Question-answering Slide 35 Outlook Future direct comparison might corroborate the intuition that a hand- crafted system is less likely to have a domain/topic bias than automated learning systems with limited training data. Balancing strengths and weaknesses, future work should also examine to which degree automated learning / probabilistic systems can interface with or supplement Constraint Grammar based NER systems For large chunks, Text/Discourse based memory should be used for name type disambiguation, so clear cases and a majority vote could determine the class of unknown names With a larger window of analysis, anaphora resolution across sentence boundaries might help NER ("human" pronouns, definite np's, ) Slide 36 Where to reach us: http://beta.visl.sdu.dk - http://corp.hum.sdu.dk