Big Text: from Language to Knowledge Gerhard Weikum Max Planck Institute for Informatics & Saarland...
-
Upload
kale-breckenridge -
Category
Documents
-
view
229 -
download
1
Transcript of Big Text: from Language to Knowledge Gerhard Weikum Max Planck Institute for Informatics & Saarland...
Big Text:from Language to Knowledge
Gerhard Weikum Max Planck Institute for Informatics & Saarland University Saarbrücken, Germanyhttp://www.mpi-inf.mpg.de/~weikum/
From Natural-Language Text to Knowledge
WebContents
Knowledge
knowledgeacquisition
intelligentinterpretation
more knowledge, analytics, insight
Cyc
TextRunner/ReVerb WikiTaxonomy/
WikiNet
SUMO
ConceptNet 5
BabelNet
ReadTheWeb
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png
Web of Data & Knowledge (Linked Open Data)> 50 Bio. subject-predicate-object triples from > 1000 sources
• 10M entities in 350K classes• 120M facts for 100 relations• 100 languages• 95% accuracy
• 4M entities in 250 classes• 500M facts for 6000 properties• live updates
• 40M entities in 15000 topics• 1B facts for 4000 properties• core of Google Knowledge Graph
Web of Data & Knowledge
• 600M entities in 15000 topics• 20B facts
> 50 Bio. subject-predicate-object triples from > 1000 sources
Web of Data & Knowledge > 50 Bio. subject-predicate-object triples from > 1000 sources
Bob_Dylan type songwriter Bob_Dylan type civil_rights_activist songwriter subclassOf artist Bob_Dylan composed Hurricane Hurricane isAbout Rubin_Carter Steve_Jobs marriedTo Sara_Lownds validDuring [Sep-1965, June-1977] Bob_Dylan knownAs „voice of a generation“ Steve_Jobs „was big fan of“ Bob_Dylan Bob_Dylan „briefly dated“ Joan_Baez
taxonomic knowledge
factual knowledge
temporal knowledge
terminological knowledge
evidence & belief knowledge
Knowledge for Intelligent Applications
Enabling technology for:• disambiguation in written & spoken natural language• deep reasoning (e.g. QA to win quiz game)• machine reading (e.g. to summarize book or corpus)• semantic search in terms of entities&relations (not keywords&pages)• entity-level linkage for Big Data & Big Text analytics
Big Text Analytics: Who Covered Whom?1000‘s of Databases100 Mio‘s of Web Tables100 Bio‘s of Web & Social Media Pages
Hannes Wader Elvis Presley Wooden Heart Elvis Presley F. Silcher Muss i denn Tote Hosen Hannes Wader Heute hier morgen dort . . . . . . . . . . . . . . .
Musician Original Title
in different language, country, key, …with more sales, awards, media buzz, ….....
Big Text Analytics: Who Covered Whom?1000‘s of Databases100 Mio‘s of Web Tables100 Bio‘s of Web & Social Media Pages
in different language, country, key, …with more sales, awards, media buzz, ….....
Hannes Wader Wooden Heart Hannes Wader Heute Hier Tote Hosen Morgen Dort
Musician PerformedTitle
Elvis Wood Heart F. Silcher Muss i denn Hans E. Wader Heute Hier . . . . . . . . . .
Musician CreatedTitle
Name PlaceU2 Dublin Dagstuhl Wadern
Name Group Bono U2 Campino Tote Hosen
Wad
ern
Big Data & Big Text:challengeVariety & Veracity
Big Data & Big Text Analytics Who covered which other singer?Who influenced which other musicians?
Entertainment:
Drugs (combinations) and their side effectsHealth:
Politicians‘ positions on controversial topics and their involvement with industry
Politics:
Customer opinions on small-company products,gathered from social media
Business:
• Identify relevant contents sources• Identify entities of interest & their relationships• Position in time & space• Group and aggregate• Find insightful patterns & predict trends
General Design Pattern:
9
Trends in society, cultural factors, etc.Culturomics:
Named Entity Recognition & Disambiguation
Hurricane,about Carter,is on Bob‘sDesire.It is played inthe film withWashington.
(NERD)
contextual similarity:mention vs. entity(bag-of-words, language model)
prior popularityof name-entity pairs
Named Entity Recognition & Disambiguation
Hurricane,about Carter,is on Bob‘sDesire.It is played inthe film withWashington.
(NERD)Coherence of entity pairs:• semantic relationships• shared types (categories)• overlap of Wikipedia links
Named Entity Recognition & Disambiguation
Hurricane,about Carter,is on Bob‘sDesire.It is played inthe film withWashington.
racism protest songboxing championwrong conviction
Grammy Award winnerprotest song writerfilm music composercivil rights advocate
Academy Award winnerAfrican-American actorCry for Freedom filmHurricane film
racism victimmiddleweight boxingnickname Hurricanefalsely convicted
Coherence: (partial) overlap of (statistically weighted)entity-specific keyphrases
Named Entity Recognition & Disambiguation
Hurricane,about Carter,is on Bob‘sDesire.It is played inthe film withWashington.
NED algorithms compute mention-to-entity mappingover weighted graph of candidatesby popularity & similarity & coherence
KB provides building blocks:• name-entity dictionary,• relationships, types,• text descriptions, keyphrases,• statistics for weights
Joint Mapping
• Build mention-entity graph or joint-inference factor graph from knowledge and statistics in KB• Compute high-likelihood mapping (ML or MAP) or dense subgraph (with high total edge weight) such that: each m is connected to exactly one e (or at most one e)
90
30
5100
100
50
20 50
90
80 90
30
10 10
20
30
30
16
m1
m2
m3
m4
e1
e2
e3
e4
e5
e6
Coherence Graph Algorithm
90
30
5100
100
50 50
90
80 90
30
10 20
10
20
30
30140
180
50
470
145
230
D5 Overview May 14, 2013
17
• Compute dense subgraph to maximize min weighted degree among entity nodes such that: each m is connected to exactly one e (or at most one e)• Approx. algorithms (greedy, randomized, …), hash sketches, …• 82% precision on CoNLL‘03 benchmark• Open-source software & online service AIDA
http://www.mpi-inf.mpg.de/yago-naga/aida/
m1
m2
m3
m4
e1
e2
e3
e4
e5
e6
Entity Matching in Structured Data
entity linkage:• key to data integration• long-standing problem, very difficult, unsolved
H.L. Dunn: Record Linkage. American Journal of Public Health 36 (12), 1946H.B. Newcombe et al.: Automatic Linkage of Vital Records. Science 130 (3381), 1959
Hurricane 1975Forever Young 1972Like a Hurricane 1975……….
Hurricane DylanLike a Hurricane YoungHurricane Everette.
Dylan Bob 1941Thomas Dylan Swansea 1914Young Brigham 1801Young Neil Toronto 1945 Denny Sandy London 1947
Hurricane Katrina New Orleans 2005Hurricane Sandy New York 2012……….
Variety & Veracity !
?
Linking Big Data & Big Text
Musician Song Year Listeners Charts . . .
Bob Dylan Death is not … 1988 14 218 Bob Dylan Don‘t think twice 1962 319 588 Bob Dylan Make you feel … 1997 72 468 Nick Cave Death is not … 1996 85 333 Kronos Q. Don‘t think twice 2012 679 Adele Make you feel … 2008 559 715 H. Wader Heute hier ... 1972 2 630 Tote Hosen Heute hier … 2012 6 432 . . . . . . . . . .
Machine Reading of Scholarly Papershttps://gate.d5.mpi-inf.mpg.de/knowlife/
Machine Reading of Health Forumshttps://gate.d5.mpi-inf.mpg.de/knowlife/
Big Data & Text Analytics:Side Effects of Drug Combinations
http://dailymed.nlm.nih.gov http://www.patient.co.uk
Deeper insight from bothexpert data & social media:• actual side effects of drugs• … and drug combinations• risk factors and complications of (wide-spread) diseases• alternative therapies• aggregation & comparison by age, gender, life style, etc.
Structured Expert Data
SocialMedia
Machine Reading: from Names and Phrases to Entities, Classes, and Relations
Rome(Italy)
LazioRoma
ASRoma
MaestroCard
EnnioMorricone
MDMA
l‘Estasidell‘Oro
LeonardBernstein
JackMa
Yo-YoMa
cover of
story about
born in
plays for filmmusic
goal infootball
playssport
playsmusic
westernmovie
WesternDigital
Ma played his version of the Ecstasy.
The Maestro from Rome wrote scores for westerns.
wrote scoresr: composed
Disambiguation for Entities, Classes & Relations
scores for
westerns
from
Rome
Maestro
Combinatorial Optimization by ILP (with type constraints etc.)
e: Rome (Italy)
e: Lazio Roma
e: MaestroCard
e: Ennio Morriconec: conductor
c:soundtrackr: soundtrackFor
r: shootsGoalFor
r: bornIn
r: actedIn
c: western movie
e: Western Digital we
igh
ted
ed
ge
s (c
oh
ere
nc
e, s
imila
rity
, etc
.)
(M. Yahya et al.: EMNLP’12, CIKM‘13)
c: musician
r: giveExam
ILP optimizers like Gurobisolve this in seconds
search
Internet
publish & recommend
Entity Linking: Privacy at Stake
Levothroid shakingAddison’s disease
………Nive concert
Greenland singersSomalia elections
Steve Biko
searchengine
Zoe
female 29y Jamame
social network
Nive NielsenCryFreedom
discuss & seek help
online forum
female 25-30 Somalia
Synthroid tremble……….Addison disorder……….
social network
Synthroid tremble……….Addison disorder……….
search
Internet
Privacy Adversaries Linkability Threats: Weak cues: profiles, friends, etc.
Semantic cues: health, taste, queries
Statistical cues: correlations
discuss & seek help
publish & recommend
Levothroid shakingAddison’s disease
………Nive concert
Greenland singersSomalia elections
Steve BikoNive Nielsen
female 25-30 Somalia
female 29y Jamame
online forum
searchengine
CryFreedom
social network
search
social network
searchengine
online forum
Synthroid tremble……….Addison disorder……….
Internet
Levothroid shakingAddison’s disease
………Nive concert
Greenland singersSomalia elections
female 25-30 Somalia
Goal: Automated Privacy Advisor
discuss & seek help
publish & recommend
Nive NielsenCryFreedom
female 29y Jamame
PrivacyAdviser (PA):Software tool that analyses risk alerts user advises user
• explains consequences
• recommends policy changes
Your queries may lead to linking your identiesin Facebook and patient.co.uk !
………….Would you like to use an anonymization tool
for your search requests?………..
ERC Project imPACT(Backes/Druschel/Majumdar/Weikum)
Big Text & Big Data
Big Text & NERD: valuable content about entities
lifted towards knowledge & analytic insight
Machine Reading: discover and interpret names & phrases as entities, classes, relations, spatio-temporal modifiers, sentiments, beliefs, ….
Big Data: interlink natural-language text, social media,structured data & knowledge bases, images, videos
and help users coping with privacy risks