Citation Extractor

10
Citation Extractor Nguyen Bach Sue Ann Hong Ben Lambert

description

Citation Extractor. Nguyen Bach Sue Ann Hong Ben Lambert. Extraction Task. AuthorOf(Author, Paper) PublishedAt(Paper, Conference) IsPaper, IsAuthor, IsConference. “Citation” = “Pattern” regular expression. Citation DB. Seed (e.g. 5 citations). - PowerPoint PPT Presentation

Transcript of Citation Extractor

Page 1: Citation Extractor

Citation Extractor

Nguyen BachSue Ann HongBen Lambert

Page 2: Citation Extractor

AuthorOf(Author, Paper)PublishedAt(Paper, Conference)IsPaper, IsAuthor, IsConference

Extraction Task

• “Citation” = <Paper, Authors, Conference>• “Pattern”

– regular expression

Page 3: Citation Extractor

Method Outline

Query Search(WIT)

Extract Patternsusing known citations

Web pages(HTML, text)

Page-specificPatterns

Citation DB

Seed (e.g. 5 citations)

Extract Citations using new patternsCitations

Page 4: Citation Extractor

Query: "multiple-goal recognition from low-level signals " " Xiaoyong Chai" " Qiang Yang" "AAAI 2005 "

Page: http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/y/Yang:Qiang.html

AUTHOR,AUTHOR: TITLE CONF.

4 Patterns:

AUTHOR, (A-Za-z): (A-Za-z). (A-Za-z)

(A-Za-z), AUTHOR : (A-Za-z). (A-Za-z)

(A-Za-z), (A-Za-z): TITLE. (A-Za-z)

(A-Za-z), (A-Za-z): (A-Za-z). CONF

Page 5: Citation Extractor

AUTHOR, (A-Za-z): (A-Za-z). (A-Za-z)

(A-Za-z), AUTHOR : (A-Za-z). (A-Za-z)

(A-Za-z), (A-Za-z): TITLE. (A-Za-z)

(A-Za-z), (A-Za-z): (A-Za-z). CONF

Finding New Citations

AUTHOR,

AUTHOR:

TITLE CONF

.

AUTHOR,

AUTHOR,AUTHOR,

AUTHOR,

AUTHOR,

CONFCONF

AUTHOR,

AUTHOR:

AUTHOR:

Page 6: Citation Extractor

System Spits Out…• 6 seeds 60 citations

• 36 of these (partial citations)– "Theory and Algorithms for Plan Merging " , " Ming Li"– "The Expected Value of Hierarchical Problem-Solving " , "

Fahiem Bacchus"– "Handling feature interactions in process-planning "

• 14 of these (partial strings)– "On D " – "On t " , " John Tromp", " Elizabeth Sweedyk", " Umest Vazirani"– "An L " , " Ronan Sleep"– "To D “

• No new conferences (end-token)

Page 7: Citation Extractor

Bootstrapping, Short-Lived

• Highly restrictive regex’s– No recovery– More seeds and variety the better

• Stupid Little Things– Mis-capitalization – Variations in titles (‘-’ vs. ‘ ’)– Etc, etc, etc…

Page 8: Citation Extractor

Extensions ~ Improvements• Less strict string matching

– Not case and punctuation sensitive• Better boundary detection

– Start/end tokens, HTML wrapper detection?• Better pattern construction

– e.g. n authors not 2• NER

– help find the right "window“– A source of ENTITY marker

• Use like ‘AUTHOR’, ‘TITLE’, ‘CONF’ but with probabilities/confidence values

• Evaluation with DBLP?

Page 9: Citation Extractor

NER• Baseline model (News corpus)<ENAMEX_TYPE="PERSON"> M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K.

Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: </ENAMEX> Towards Spontaneous Speech Translation, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing.

<ENAMEX_TYPE="PERSON"> S. Awodey. </ENAMEX> Topological Representation of the Lambda Calculus. September <ENAMEX_TYPE="PERSON"> 1998. Math. Struct. </ENAMEX> in <ENAMEX_TYPE="LOCATION"> Comp. Sci. (2000), vol. 10, pp. 81--96. </ENAMEX>

• Adapted model (News + citation corpus)<ENAMEX_TYPE="PERSON"> M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K.

Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, A. Waibel, 1994, JANUS 93: </ENAMEX> Towards Spontaneous Speech Translation, Proceedings of the <ENAMEX_TYPE="ORGANIZATION"> International Conference on Acoustics, Speech, </ENAMEX> and Signal Processing.

<ENAMEX_TYPE="PERSON"> L. Birkedal. </ENAMEX> A General Notion of Realizability. December 1999. Proceedings of <ENAMEX_TYPE="ORGANIZATION"> LICS 2000 </ENAMEX>

Page 10: Citation Extractor

Lessons LearnedAnother Boring Text Slide

• Semi-structured text is surprisingly difficult to read

• Off-line training for wrappers and/or NER may help

• Need very high-confidence rules to ensure precision

• A continuously-running system needs robustness (internet/Google-failure, unexpected errors, …)