Citation Extractor
description
Transcript of Citation Extractor
Citation Extractor
Nguyen BachSue Ann HongBen Lambert
AuthorOf(Author, Paper)PublishedAt(Paper, Conference)IsPaper, IsAuthor, IsConference
Extraction Task
• “Citation” = <Paper, Authors, Conference>• “Pattern”
– regular expression
Method Outline
Query Search(WIT)
Extract Patternsusing known citations
Web pages(HTML, text)
Page-specificPatterns
Citation DB
Seed (e.g. 5 citations)
Extract Citations using new patternsCitations
Query: "multiple-goal recognition from low-level signals " " Xiaoyong Chai" " Qiang Yang" "AAAI 2005 "
Page: http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/y/Yang:Qiang.html
AUTHOR,AUTHOR: TITLE CONF.
4 Patterns:
AUTHOR, (A-Za-z): (A-Za-z). (A-Za-z)
(A-Za-z), AUTHOR : (A-Za-z). (A-Za-z)
(A-Za-z), (A-Za-z): TITLE. (A-Za-z)
(A-Za-z), (A-Za-z): (A-Za-z). CONF
AUTHOR, (A-Za-z): (A-Za-z). (A-Za-z)
(A-Za-z), AUTHOR : (A-Za-z). (A-Za-z)
(A-Za-z), (A-Za-z): TITLE. (A-Za-z)
(A-Za-z), (A-Za-z): (A-Za-z). CONF
Finding New Citations
AUTHOR,
AUTHOR:
TITLE CONF
.
AUTHOR,
AUTHOR,AUTHOR,
AUTHOR,
AUTHOR,
CONFCONF
AUTHOR,
AUTHOR:
AUTHOR:
System Spits Out…• 6 seeds 60 citations
• 36 of these (partial citations)– "Theory and Algorithms for Plan Merging " , " Ming Li"– "The Expected Value of Hierarchical Problem-Solving " , "
Fahiem Bacchus"– "Handling feature interactions in process-planning "
• 14 of these (partial strings)– "On D " – "On t " , " John Tromp", " Elizabeth Sweedyk", " Umest Vazirani"– "An L " , " Ronan Sleep"– "To D “
• No new conferences (end-token)
Bootstrapping, Short-Lived
• Highly restrictive regex’s– No recovery– More seeds and variety the better
• Stupid Little Things– Mis-capitalization – Variations in titles (‘-’ vs. ‘ ’)– Etc, etc, etc…
Extensions ~ Improvements• Less strict string matching
– Not case and punctuation sensitive• Better boundary detection
– Start/end tokens, HTML wrapper detection?• Better pattern construction
– e.g. n authors not 2• NER
– help find the right "window“– A source of ENTITY marker
• Use like ‘AUTHOR’, ‘TITLE’, ‘CONF’ but with probabilities/confidence values
• Evaluation with DBLP?
NER• Baseline model (News corpus)<ENAMEX_TYPE="PERSON"> M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K.
Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: </ENAMEX> Towards Spontaneous Speech Translation, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing.
<ENAMEX_TYPE="PERSON"> S. Awodey. </ENAMEX> Topological Representation of the Lambda Calculus. September <ENAMEX_TYPE="PERSON"> 1998. Math. Struct. </ENAMEX> in <ENAMEX_TYPE="LOCATION"> Comp. Sci. (2000), vol. 10, pp. 81--96. </ENAMEX>
• Adapted model (News + citation corpus)<ENAMEX_TYPE="PERSON"> M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K.
Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: </ENAMEX> Towards Spontaneous Speech Translation, Proceedings of the <ENAMEX_TYPE="ORGANIZATION"> International Conference on Acoustics, Speech, </ENAMEX> and Signal Processing.
<ENAMEX_TYPE="PERSON"> L. Birkedal. </ENAMEX> A General Notion of Realizability. December 1999. Proceedings of <ENAMEX_TYPE="ORGANIZATION"> LICS 2000 </ENAMEX>
Lessons LearnedAnother Boring Text Slide
• Semi-structured text is surprisingly difficult to read
• Off-line training for wrappers and/or NER may help
• Need very high-confidence rules to ensure precision
• A continuously-running system needs robustness (internet/Google-failure, unexpected errors, …)