Instructor: Alan Ritter CSE 5539: Web Information Extraction.
-
Upload
asher-leonard -
Category
Documents
-
view
232 -
download
3
Transcript of Instructor: Alan Ritter CSE 5539: Web Information Extraction.
![Page 1: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/1.jpg)
Instructor: Alan Ritter
CSE 5539: Web Information Extraction
![Page 2: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/2.jpg)
Motivation
• Data Analytics / Big Data– Companies have lots of data
lying around– Computing cycles are cheap– Using data to get insights:
• Business, Healthcare, Science, Government, Politics
• Challenge: Most of the world’s data is Unstructured– Text– Speech– Images
Structured Data
Bigger Unstructured Data
![Page 3: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/3.jpg)
Extracting Knowledge from Text
The Web News
Text Extractors
Structured Data
![Page 4: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/4.jpg)
![Page 5: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/5.jpg)
Example: Information Extraction from Twitter
“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America
march 27 for $250”
![Page 6: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/6.jpg)
Example: Information Extraction from Twitter
“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America
march 27 for $250”
![Page 7: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/7.jpg)
Example: Information Extraction from Twitter
“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America
march 27 for $250”
COMPANY PRODUCT DATE PRICE REGION
PRODUCT RELEASE
![Page 8: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/8.jpg)
Example: Information Extraction from Twitter
“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America
march 27 for $250”
COMPANY PRODUCT DATE PRICE REGION
Nintendo 3DS March 27 $250 North America
PRODUCT RELEASE
![Page 9: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/9.jpg)
Example: Information Extraction from Twitter
Samsung Galaxy S5 Coming to All Major U.S. Carriers Beginning April 11th
COMPANY PRODUCT DATE PRICE REGION
Samsung Galaxy S5 April 11 ? U.S.
Nintendo 3DS March 27 $250 North America
PRODUCT RELEASE
![Page 10: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/10.jpg)
Example: Information Extraction from Twitter
COMPANY PRODUCT DATE PRICE REGION
Samsung Galaxy S5 April 11 ? U.S.
Nintendo 3DS March 27 $250 North America
… … … … …
PRODUCT RELEASE
News
![Page 11: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/11.jpg)
Example Applications
• Question Answering / Structured Queries– Which companies are releasing new smartphones
new products in Europe this Spring?– Alert me anytime a new smartphone is announced
in the U.S.• Data Mining
– Analyze trends in product releases across different industries
– Is there a correlation between price and date of release?
![Page 12: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/12.jpg)
Knowledge GraphsThings not strings!
CSE 5539
Ohio State Univ.
Course offered at
Alan Ritter
Instructor
Columbus OH
Located In
![Page 13: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/13.jpg)
Data Sources
![Page 14: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/14.jpg)
Available Data Sources
All of these databases are sparsely populated
and out of date. We need to extract this type of knowledge from
text!!!!
![Page 15: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/15.jpg)
Available Data Sources
All of these databases are sparsely populated
and out of date. We need to extract this type of knowledge from
text!!!!
![Page 16: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/16.jpg)
Traditional information Extraction
![Page 17: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/17.jpg)
![Page 18: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/18.jpg)
![Page 19: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/19.jpg)
![Page 20: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/20.jpg)
![Page 21: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/21.jpg)
![Page 22: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/22.jpg)
Traditional information Extraction
![Page 23: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/23.jpg)
Example Text from MUC-4 (1992)[Cowie and Wilks]
![Page 24: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/24.jpg)
Example Output from MUC-4 (1992)
…
[Cowie and Wilks]
![Page 25: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/25.jpg)
Approaches• Initially: Rule Based
– Basically just write a bunch of regular expressions
![Page 26: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/26.jpg)
Approaches• Initially: Rule Based
– Basically just write a bunch of regular expressions
![Page 27: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/27.jpg)
Approaches• Initially: Rule Based
– Basically just write a bunch of regular expressions
![Page 28: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/28.jpg)
Approaches
• Initially: Rule Based– Basically just write a bunch of regular expressions
• Machine Learning (Fietag 1998) (Soderland 1999), (Mooney 1999)
– Annotate training / dev / test documents– Train machine learning models
![Page 29: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/29.jpg)
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
[Slide from William Cohen]
![Page 30: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/30.jpg)
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
[Slide from William Cohen]
![Page 31: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/31.jpg)
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
[Slide from William Cohen]
![Page 32: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/32.jpg)
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
[Slide from William Cohen]
![Page 33: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/33.jpg)
A “Naïve Bayes” Sliding Window Model[Freitag 1997]
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m
prefix contents suffix
If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
… …
Estimate Pr(LOCATION|window) using Bayes rule
Try all “reasonable” windows (vary length, position)
Assume independence for length, prefix words, suffix words, content words
Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)
[Slide from William Cohen]
![Page 34: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/34.jpg)
“Naïve Bayes” Sliding Window Results
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
Domain: CMU UseNet Seminar Announcements
Field F1 Person Name: 30%Location: 61%Start Time: 98%
[Slide from William Cohen]
![Page 35: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/35.jpg)
IE with Hidden Markov Models
Yesterday Pedro Domingos spoke this example sentence.
Yesterday Pedro Domingos spoke this example sentence.
Person name: Pedro Domingos
Given a sequence of observations:
and a trained HMM:
Find the most likely state sequence: (Viterbi)
Any words said to be generated by the designated “person name”state extract as a person name:
),(maxarg osPs
person name
location name
background
[Slide from William Cohen]
![Page 36: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/36.jpg)
Finite State Models
Naïve Bayes
Logistic Regression
Linear-chain CRFs
HMMsGenerative
directed models
General CRFs
Sequence
Sequence
Conditional Conditional Conditional
GeneralGraphs
GeneralGraphs
![Page 37: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/37.jpg)
Various Annotated Datasets for Event / Relation Extraction
• ACE– Automatic Content Extraction– Newswire– Successor to MUC
![Page 38: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/38.jpg)
Various Annotated Datasets for Event / Relation Extraction
• GENIA– Medline abstracts– Similar extraction task in the Biomedical domain
![Page 39: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/39.jpg)
Schemas -> Triples“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America
march 27 for $250”
COMPANY PRODUCT DATE PRICE REGION
Nintendo 3DS March 27 $250 North America
PRODUCT RELEASE
Manufacturer(3DS, Nintendo)ReleaseDate(3DS, March 27)Price(3DS, $250)…
RelationExtraction
![Page 40: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/40.jpg)
Open Information Extraction (Banko et. al. 2007)
![Page 41: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/41.jpg)
Demo (TextRunner)
• http://openie.allenai.org/
![Page 42: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/42.jpg)
42
Distant (weak) Supervision for Relation Extraction e.g. [Mintz et. al. 2009]
Person Birth Location
Barack Obama Honolulu
Mitt Romney Detroit
Albert Einstein Ulm
Nikola Tesla Smiljan
… …
“Barack Obama was born on August 4, 1961 at … in the city of Honolulu ...”
“Birth notices for Barack Obama were published in the Honolulu Advertiser…”
“Born in Honolulu, Barack Obama went on to become…”…
(Barack Obama, Honolulu)
(Mitt Romney, Detroit)
(Albert Einstein, Ulm)
![Page 43: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/43.jpg)
Demo (NELL)
• http://rtw.ml.cmu.edu/rtw/kbbrowser/
![Page 44: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/44.jpg)
Demo (Literome)
• http://literome.azurewebsites.net/
![Page 45: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/45.jpg)
Knowledge Base Population Subtasks
• Entity Recognition/Classification/Linking• Relation Extraction• Event Extraction• Knowledge Base Inference
![Page 46: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/46.jpg)
Applications
• Google knowledge graph• Facebook graph search• Biomedical knowledge bases• -> Your application domain here
– Geoscience knowledge graph?– Patent knowledge graph?– Cybersecurity knowledge graph?
![Page 47: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/47.jpg)
Research Groups at Other Places
![Page 48: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/48.jpg)
Why learn about this stuff?
![Page 49: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/49.jpg)
Paper Selection Form!(please fill out before next class)
https://goo.gl/AghZ1f
![Page 50: Instructor: Alan Ritter CSE 5539: Web Information Extraction.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e5c5503460f94b54d48/html5/thumbnails/50.jpg)
Administrative Details
• Course Webpage– http://aritter.github.io/courses/5539_fall15.html