Scalable Information
Extraction and Integration
Eugene Agichtein
Microsoft Research �
Emory U
niversity
SunitaSarawagi
IIT Bombay
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
The Value of Text Data
�“U
nstructured”text data is the primary source of
human-generated inform
ation
�Citeseer, comparison shopping, PIM
systems, web search,
data warehousing
�Managing and utilizing text: inform
ation extraction
and integration
�Scalability: a bottleneck for deployment
�Relevance to data m
ining community
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Example: Answering Queries Over Text
For years, Microsoft
CorporationCEOBill
Gateswas against open
source. But today he
appears to have changed
his mind. "We can be
open source. We love the
concept of shared
source," said Bill Veghte,
a MicrosoftVP. "That's a
super-important shift for
us in terms of code
access.“
Richard Stallman,
founderof the Free
Software Foundation,
countered saying…
Name Title Organization
Bill Gates
CEO
Microsoft
Bill Veghte
VP
Microsoft
Richard Stallman
Founder
Free Soft..
PEOPLE
Select Name
From PEOPLE
Where Organization = ‘Microsoft’
Bill Gates
Bill Veghte
(from W
illiam Cohen’s IE tutorial, 2003)
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Managing Unstructured Text Data
�Information Extraction from text
�Represent information in text data in a structured form
�Identify instances of entities and relationships
�Main approaches and architectures
�Scaling up to large collections of documents (e.g., web)
�Information Integration
�Combine/resolve/clean information about entities
�Entity Resolution & Deduplication
�Scaling Up: Batch m
ode/algorithmic issues
�Connections between Inform
ation Extraction and Integration
�Coreference Resolution
�Deriving values from m
ultiple sources
�(W
eb) Question Answering
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Part I: Tutorial Outline
�Overview of Information Extraction
�Entity tagging
�Relation extraction
�Scaling up Inform
ation Extraction
�Focus on scaling up to large collections (where
data m
ining and M
L techniques shine)
�Other dim
ensions of scalability
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Information Extraction Components
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Information Extraction Tasks
�Extracting entities and relations: this tutorial
�Entities: named (e.g., Person) and generic (e.g., disease name)
�Relations: entities related in a predefined way (e.g., Location of a
Disease outbreak)
�Common extraction subtasks:
�Preprocessing: sentence chunking, syntactic parsing,
morphological analysis
�Creating rules or extraction patterns: manual, m
achine learning,
and hybrid
�Applying extraction patterns to extract new inform
ation
�Postprocessing and complex extraction: not covered
�Co-reference resolution
�Combining Relations into Events and Facts
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Related Tutorials
�Previous inform
ation extraction tutorials: consult for more
details
�R. Feldman, Inform
ation Extraction –
Theory and Practice,
ICML 2006
http://www.cs.biu.ac.il/~feldman/icml_tutorial.htm
l
�W. Cohen, A. McCallu
m, Inform
ation Extraction and
Integration: an O
verview, KDD 2003
http://www.cs.cmu.edu/~wcohen/ie-survey.ppt
�A. Doan, R. Ramakrishnan, S. Vaithyanathan, Managing
Inform
ation Extraction, SIG
MOD’06
�N. Koudas, D. Srivastava, S. Sarawagi, Record Linkage:
Sim
ilarity M
easures and Algorithms, SIG
MOD 2006
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Entity Tagging
�Identifying m
entions of entities (e.g., person names, locations,
companies) in text
�MUC (1997): Person, Location, Organization, Date/Tim
e/Currency
�ACE (2005): m
ore than 100 m
ore specific types
�Hand-coded vs. Machine Learning approaches
�Best approach depends on entity type and domain:
�Closed class (e.g., geographical locations, disease names, gene &
protein names): hand coded + dictionaries
�Syntactic (e.g., phone numbers, zipcodes): regexes
�Others (e.g., person and company names): m
ixture of context,
syntactic features, dictionaries, heuristics, etc.
�“Alm
ost solved”for common/typical entity types
�Non-syntactic entities computationally expensive
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Example: Extracting Entities from Text
�Useful for data warehousing, data cleaning, web
data integration
14089 Whispering Pines Nobel Drive San Diego CA 92122
House
number
Building
Road
City
Zip
State
Address
Citation
Year
2002
S4
Conference
Proc. of ACM SIG
MOD
S3
Title
Combining Fuzzy Inform
ation from M
ultiple Systems
S2
Author
Ronald Fagin
S1
Label(si)
Sequence
Segment(si)
Ronald Fag
in, Combining Fuzzy Inform
ation from M
ultiple
System
s, Proc. of ACM SIG
MOD, 2002
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Hand-Coded Methods
�Easy to construct in m
any cases
�e.g., to recognize prices, phone numbers, zip codes,
conference names, etc.
�Easier to debug & m
aintain
�Especially if written in a “high-level”language (as is
usually the case): e.g.,
�Easier to incorporate / reuse domain knowledge
�Can be quite labor intensive to write
ContactPattern RegularExpression(Email.body,”can be reached at”)
PersonPhone
Precedes(Person
Precedes(ContactPattern, Phone, D),
D)
[From Avatar]
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Example of Hand-Coded Entity Tagger
[Ramakrishnan. G
, 2005, Slides from Doan et al., SIGMOD 2006]
Rule 1
This rule will find person names with a salutation (e.g. Dr. Laura
Haas) and two capitalized words
<token> INITIAL</token>
<token>DOT </token>
<token>CAPSWORD</token>
<token>CAPSWORD</token>
Rule 2
This rule will find person names where two capitalized words
are present in a Person dictionary
<token>PERSONDICT, CAPSWORD </token>
<token>PERSONDICT,CAPSWORD</token>
CAPSWORD : W
ord starting with uppercase, second letter lowercase
E.g., DeWitt will satisfy it (D
EWITT will not)
\p{U
pper}\p{Lower}[\p{Alpha}]{1,25}
DOT
: The character ‘.’
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Hand Coded Rule Example: Conference Name
# These are subordinate patterns
$wordOrdinals="(?:first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|four
teenth|fifteenth)";
my $numberOrdinals="(?:\\d?(?:1st|2nd|3rd|1th|2th|3th|4th|5th|6th|7th|8th|9th|0th))";
my $ordinals="(?:$wordOrdinals|$numberOrdinals)";
my $confTypes="(?:Conference|Workshop|Symposium)";
my $words="(?:[A-Z]\\w+\\s*)"; # A word starting with a capital letter and ending with 0 or more spaces
my $confDescriptors="(?:international\\s+|[A-Z]+\\s+)"; # .e.g "International Conference ...' or the conference
name for workshops (e.g. "VLDB Workshop ...")
my $connectors="(?:on|of)";
my $abbreviations="(?:\\([A-Z]\\w\\w+[\\W\\s]*?(?:\\d\\d+)?\\))"; # Conference abbreviations like "(SIGMOD'06)"
# The actual pattern we search for. A typical conference name this pattern will find is
# "3rd International Conference on Blah BlahBlah(ICBBB-05)"
my
$fullNamePattern="((?:$ordinals\\s+$words*|$confDescriptors)?$confTypes(?:\\s+$connectors\\s+.*?|\\ s+)?$abb
reviations?)(?:\\n|\\r|\\.|<)";
############################## ################################
# Given a <dbworldMessage>, look for the conference pattern
##############################################################
lookForPattern($dbworldMessage, $fullNamePattern);
#########################################################
# In a given <file>, look for occurrences of <pattern>
# <pattern> is a regular expression
#########################################################
sub lookForPattern{
my ($file,$pattern) = @_;
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Gene & Protein Tagger: AliBaba
�Extract gene names from PubMedabstracts
�Use C
lassifier (Support Vector Machine -SVM)
Vector
Generator
SVMlight
Tagged
Text
Post
Processor
Tokenized
Training
Corpus
SVM Model
driven
Tagger
New
Text
Vector
Generator
�Corpus of 7500 sentences
�140.000 non-gene words
�60.000 gene names
�SVM
lighton different feature sets
�Dictionary compile
d from G
enbank, HUGO, MGD, YDB
�Post-processing for compound gene names
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Some Hand Coded Entity Taggers
�FRUMP [DeJong82]
�CIRCUS / AutoSlog[Riloff93]
�SRI FASTUS [Appelt, 1996]
�MITRE Alembic
(available for use)
�Alias-I LingPipe(available for use)
�OSMX [Embley, 2005]
�DBLife [Doan et al, 2006]
�Avatar [Jayram
et al, 2006]
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Machine Learning Methods
�Can work well when training data is easy to construct and is
plentiful
�Can capture complex patterns that are hard to encode w
ith
hand-crafted rules
�e.g., determ
ine w
hether a review is positive or negative
�extract long complex gene names
The human T cell leukemia lymphotropic virus type 1 Tax protein
represses M
yoD-dependent transcription by inhibiting MyoD-
binding to the KIX domain of p300.“
[From AliB
aba]
�Can be labor intensive to construct training data
�Question: how m
uch training data is sufficient?
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Popular Machine Learning Methods for IE
�Naive Bayes
�SRV [Freitag-98], Inductive Logic Programming
�Rapier [Califf& M
ooney-97]
�Hidden M
arkov M
odels [Leek, 1997]
�Maxim
um Entropy M
arkov M
odels [McCallu
m et al, 2000]
�Conditional Random Fields [Lafferty et al, 2000]
�Im
plementations available:
�Mallet (Andrew M
cCallum)
�crf.sourceforge.net (SunitaSarawagi)
�MinorThirdminorthird.sourceforge.net (W
illiam Cohen)
For details: [Feldman, 2006 and C
ohen, 2004]
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Example of State-based ML Method
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Extracted Entities: Resolving Duplicates
Document 1:The Justice Department has officially ended its inquiry into theassassinations
ofJohn F. Kennedyand Martin Luther King Jr., finding ``no persuasive evidence'' to
support conspiracy theories, according to department documents. The House Assassinations
Committee concluded in 1978 thatKennedywas ``probably'' assassinated as the result of a
conspiracy involving a second gunman, a finding that broke from the Warren Commission
's belief that Lee Harvey Oswald acted alone in Dallason Nov. 22, 1963.
Document 2: In 1953, MassachusettsSen. John F. Kennedymarried Jacqueline Lee
Bouvier in Newport, R.I. In 1960, Democratic presidential candidate John F. Kennedy
confronted the issue of his Roman Catholic faith by telling a Protestant group in Houston,
``I do not speak for my church on public matters, and the churchdoes not speak for me.'‘
Document 3:David Kennedywas born in Leicester, England in 1959.…Kennedyco-
edited The New Poetry (BloodaxeBooks 1993), and is the author of New Relations: The
Refashioning Of British Poetry 1980-1994 (Seren1996).
[From Li, M
orie, & Roth, AI Magazine, 2005]
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Important Problem, Addressed in Part II
�Appears in numerous real-world contexts
�Plagues m
any applications
�Citeseer, DBLife, AliB
aba, Rexa, etc.
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Outline
�Overview of Inform
ation Extraction
�Entity tagging
�Relation extraction
�Scaling up Inform
ation Extraction
�Focus on scaling up to large collections (where
data m
ining and M
L techniques shine)
�Other dim
ensions of scalability
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Relation Extraction: D
isease Outbreaks
�Extract structured relations from text
May 19 1995, Atlanta --The Centers for Disease Control
and Prevention, which is in the front line of the world's
response to the deadly Ebola epidemic in Zaire ,
is finding itself hard pressed to cope with the crisis…
Zaire
Ebola
May 1995
U.S.
Pneumonia
Feb. 1995
July 1995
Jan. 1995
Date
Location
Disease Name
U.K.
Mad Cow D
isease
Ethiopia
Malaria
Information
Extraction System
(e.g., NYU’s Proteus)
Disease O
utbreaks in The New York Tim
es
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Example: Protein Interactions
„Weshowthat CBF-A and CBF-C interact
witheachother to form a CBF-A-CBF-C complex
and that CBF-B doesnotinteractwith CBF-A or
CBF-C individuallybutthatitassociateswiththe
CBF-A-CBF-C complex.“
CBF-A
CBF-C
CBF-B
CBF-A-CBF-C complex
interact
complex
associates
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Relation Extraction
�Typically require Entity Tagging as preprocessing
�Knowledge Engineering
�Rules defined over lexical items
�“<company> located in <location>”
�Rules defined over parsed text
�“((O
bj<company>) (Verb located) (*) (Subj<location>))”
�Proteus, GATE, …
�Machine Learning-based
�Learn rules/patterns from examples
Dan Roth 2005, Cardie
2006, Mooney 2005, …
�Partially-supervised: bootstrap from “seed”examples
Agichtein & G
ravano2000, Etzioniet al., 2004, …
�Recently, hybrid m
odels [Feldman2004, 2006]
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Example Extraction Rule [NYU Proteus]
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Example Extraction Patterns:
Snowball [AG2000]
{<’s 0.7> <in 0.7>
<headquarters 0.7>}
ORGANIZATION
LOCATION
{<-0.75>
<based 0.75>}
ORGANIZATION
LOCATION
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Accuracy of Information Extraction
�Errors cascade (error in entity tag �
error in relation
extraction)
�This estimate is optimistic:
�Holds for well-established tasks
�Many specific/novel IE tasks exhibit lower accuracy
[Feldman, ICML 2006 tutorial]
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Outline
�Overview of Inform
ation Extraction
�Entity tagging
�Relation extraction
�Scaling up Information Extraction
�Focus on scaling up to large collections (where
data m
ining and M
L techniques shine)
�Other dim
ensions of scalability
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Dimensions of Scalability
�Efficiency/corpus size
�Years to process a large colle
ctions (centuries for Web)
�Heterogeneity/diversity of inform
ation sources
�Requires m
any rules (expensive to apply)
�Many sources/conventions (expensive to m
aintain rules)
�Accessing required documents
�Hidden W
eb databases are not crawlable
�Number of Extraction Tasks (not covered)
�Many patterns/rules to develop and m
aintain
�Open research area
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Scaling Up Information Extraction
�Scan-based extraction
�Classification/filtering to avoid processing documents
�Sharing common tags/annotations
�General (keyword) index-based techniques
�QXtract, KnowItAll
�Specialized indexes
�BE/KnowItNow, Linguist’s Search Engine
�Parallelization/Adaptive Processing
�IBM W
ebFountain, Google’s M
ap/Reduce
�Application: Question Answering
�AskMSR, Arranea, Mulder
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Scan
…
Output Tokens
Extraction
System
Text Database
3.Extract output
tokens
2.Process
documents
1.Retrieve docs
from database
��Scan
Scanretrieves and processes documents
sequentially (until reaching target recall)
Execution time = |Retrieved Docs| ·(R + P)
Tim
e for retrieving
a document
Tim
e for processing
a document
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Efficient Scanning for Information
Extraction
�80/20 rule: use few sim
ple rules to capture
majority of the cases [PRH2004]
�Train a classifier to discard irrelevant
documents without processing [GHY2002]
�Share base annotations (entity tags) across
multiple tasks
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Filtered Scan
…
Output Tokens
Extraction
System
Text Database
4.Extract output
tokens
3.Process
documents
1.Retrieve docs
from database
��Scan
Scanretrieves and processes all documents(until reaching target recall)
��Filtered Scan
Filtered Scanuses a classifier to identify and process only promising documents
(e.g., the Sports section of NYT is unlikely to describe diseaseoutbreaks)
Execution time = |Retrieved Docs| * ( R + F + P)
Tim
e for retrieving
a document
Tim
e for filtering
a document
Classifier
2.Filter documents
Tim
e for processing
a document
Classifier selectivity(σ≤1)
σ
filtered
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Exploiting Keyword and Phrase Indexes
�Generate queries to retrieve only relevant documents
�Data m
ining problem!
�Some m
ethods in literature:
�Traversing Q
uery G
raphs [AIG
2003]
�Iteratively refine queries [AG2003]
�Iteratively partition document space [Etzioniet al., WWW 2004]
�Case studies: QXtract, KnowItAll
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Simple Strategy: Iterative Set Expansion
…
Output Tokens
Extraction
System
Text Database
3.Extract tokens
from docs
2.Process retrieved
documents
1.Query
database with
seed tokens
Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q
Tim
e for retrieving
a document
Tim
e for answering
a query
Tim
e for processing
a document
Query
Generation
4.Augment seed
tokens with
new tokens
(e.g., [Ebola AND Zaire])
(e.g., <Malaria, Ethiopia>)
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Querying Graph
�The querying graph is a
bipartite graph, containing
tokens and documents
�Each token (transform
ed to a
keyword
query) retrieves
documents
�Documents contain
tokens
Tokens
Documents
t 1 t 2 t 3 t 4 t 5
d1
d2
d3
d4 d5
<SARS, China>
<Ebola, Zaire>
<Malaria, Ethiopia>
<Cholera, Sudan>
<H5N1, Vietnam>
[AIG
2003]
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Recall Limit: ReachabilityGraph
t 1retrievesdocument d1
that containst 2
t 1 t 2t 3 t 4
t 5
Tokens
Documents
t 1 t 2 t 3 t 4 t 5
d1
d2
d3
d4 d5
Upper recall limit: determ
ined by the size
of the biggest connected component
Reachability
Graph
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Reachability Graph for DiseaseOutbreaks
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Connected Components Visualization
DiseaseOutbreaks,
New York Times 1995
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Getting Around Reachability Limit
�KnowItAll:
�Add keywords to partition documents into
retrievable disjoint sets
�Submit queries with parts of extracted instances
�QXtract
�General queries with m
any m
atching documents
�Assumes m
any documents retrievable per query
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
QXtract [AG2003]
1.Get docu
men
t sample
with
“likely neg
ative”
and “likely
positive”
exam
ples.
2.Lab
el sam
ple docu
men
ts using
inform
ation extraction system
as “oracle.”
3.Train classifiers
to “reco
gnize”
useful docu
men
ts.
4.Gen
erate queriesfrom classifier
model/rules.
Query Generation
Inform
ation Extraction
Text Database
Search Engine
??
??
??
??
++
++
--
--
Seed Sampling
Classifier Training
Queries
tuple1
tuple2
tuple3
tuple4
tuple5
++
++
--
--
User-Provided Seed Tuples
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
KnowItAllArchitecture
�System W
ork Flow
Extractor
Search Engine Interface
Assessor
Database
Web Pages
Rule
Rule template
NP1 “such as”NPList2
& head(N
P1) = plural(name(C
lass1))
&properN
oun(head(each(N
PList2)))
=>
instanceOf(Class1,head(each(N
PList2)))
NP1 “such as”NPList2
& head(N
P1) = “countries”
&properN
oun(head(each(N
PList2)))
=>
instanceOf(Country,head(each(N
PList2)))
Keywords: “countries such as”
Slides: [ZhengShao, UIUC]
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
KnowItAllArchitecture (Cont.)
�System W
ork Flow
Extractor
Search Engine Interface
Assessor
Database
Web Pages
Rule
Extracted Inform
ation
Knowledge
the United Kingdom and Canada
India
North Korea, Iran, India and Pakistan
Japan
Iraq, Italy and Spain
…the United Kingdom
Canada
India
North Korea
Iran
…
Discriminator Phrase
Country AND X
“Countries such as X”
Country AND the United Kingdom
Countries such as the United Kingdom
Frequency
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Using Generic Indexes: Summary
�Order of magnitude scale-up in corpus size
�Indexes are approxim
ate (queries not precise)
�Require m
any documents to retrieve
�Can we do better?
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Index Structures for Information Extraction
�Bindings Engine [CE2005]
�Indexes of entities: [CGHX2006], [IBM Avatar]
�Other systems (not covered)
�Linguist’s search engine (P. Resniket al.): indexes
syntactic structures
�FREE: Indexing regular expressions: J. Cho et al.
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Bindings Engine (BE) [Slides: Cafarella2005]
�Bindings Engine (BE) is search engine where:
�No downloads during query processing
�Disk seeks constant in corpus size
�#queries = #phrases
�BE’sapproach:
�“Variabilized”search query language
�Pre-processes all documents before query-tim
e
�Integrates variable/type data with inverted index,
minim
izing query seeks
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
BE Query Support
cities such as <NounPhrase>
President Bush <Verb>
<NounPhrase>is the capital of <NounPhrase>
reach m
e at <phone-number>
�Any sequence of concrete term
s and typed
variables
�NEAR is insufficient
�Functions (e.g., “head(<NounPhrase>)”)
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
BE Operation
�Like a generic search engine, BE:
�Downloads a corpus of pages
�Creates an index
�Uses index to process queries efficiently
�BE further requires:
�Set of indexed types (e.g., “NounPhrase”), with a
“recognizer”for each
�String processing functions (e.g., “head()”)
�A BE system can only process types and
functions that its index supports
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Index design
�Search engines handle scale with inverted index
�Single disk seekper term
�Mainly sequential reads
�Disk analysis
�Seeksrequire ~5 m
s, so only 200/sec
�Sequential readstransfer 10-40 M
B/sec
�Inverted index m
inim
izes expensive seeks; BE
should do the same
�Parallel downloads are just parallel, distributed
seeks; still very costly
words
such
seattle
nickels
mayors
give
friendly
cities
billy
as
#docs
docid0
docid1
#docs
docid0
#docs
docid0
docid1
docid2
docid#docs-1
…#docs
docid0
docid1
docid2
#docs
docid0
docid1
#docs
docid0
docid1
docid2
docid#docs-1
…
#docs
docid0
#docs
docid0
docid1
docid2
docid3
#docs
docid0
docid1
docid2
docid#docs-1
…
#docs
docid0
docid1
docid2
docid#docs-1
…
words
such
seattle
nickels
mayors
give
friendly
cities
billy
as
#docs
docid0
docid1
docid2
docid#docs-1
…
104
21
150
322
2501
15
99
322
426
1309
1.Test for equality
2.Advance smaller pointer
3.Abort when a list is exhausted
Returned docs:
322
Query: such as
#docs
docid0
docid1
docid2
docid#docs-1
…
words
such
seattle
nickels
mayors
give
friendly
cities
billy
as
#docs
…pos 0
pos 1
docid#docs-1pos #
docs-1
#posns
pos 0
pos 1…
pos #
pos-1
docid0
docid1
#docs
…pos 0
pos 1
docid#docs-1pos #
docs-1
docid0
docid1
#posns
pos 0
pos 1…
pos #
pos-1
In phrase queries, match positions as well
#docs
docid0
docid1
docid2
docid#docs-1
…
#docs
docid0
docid1
docid2
docid#docs-1
…
“such as”
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Neighbor Index
�At each position in the index, store “neighbor text”
that might be useful
�Let’s index <NounPhrase> and <Adj-Term
>
“I love cities such as Philadelphia.”
Left
Right
AdjT: “love”
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Neighbor Index
�At each position in the index, store “neighbor text”
that might be useful
�Let’s index <NounPhrase> and <Adj-Term
>
“I love cities such as Philadelphia.”
Left
Right
AdjT: “cities”
NP: “cities”
AdjT: “I”
NP: “I”
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Neighbor Index
Left
Right
AdjT: “such”
Query:“cities such as <NounPhrase>”
AdjT: “Philadelphia”
NP: “Philadelphia”
“I love cities such as Philadelphia.”
neighbor 1
str 1
NPrightPhiladelphia
words
such
philadelphia
nickels
mayors
give
friendly
cities
billy
as
#docs
…pos 0
pos 1
docid#docs-1pos #
docs-1
#posns
pos 0
pos 1…
pos #
pos-1
docid0
docid1
“cities such as <NounPhrase>”
1.Find phrase query positions, as with phrase queries
2.If term is adjacent to variable, extract typed value
#posns
pos 0
neighbor 0
pos 1
neighbor 1…
pos #
pos-1…
#neighbors
blk_offset
3<offset>
19
12
In doc 19, starting at posn8:
“I love cities such as Philadelphia.”
neighbor 0
str 0
AdjTleft
such
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Asymptotic Efficiency Analysis
�kconcrete term
s in query
�Bbindings found for query
�Ndocuments in corpus
�Tindexed types in corpus
O(N)
O(k + B)
Std M
odel
O(N * T)
O(k)
BE
Index Space
Query Tim
e
(in seeks)
Band Nscale together; koften small; Toften exclusive
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Experiment 2: KnowItAllon BE
150k
50k
10k
Num
Extractions
89,641s
29,880s
5,976s
Std Imp/
BE
Speedup
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Experiment 2: KnowItAllon BE
150k
50k
10k
Num
Extractions
89,641s
29,880s
5,976s
Std Imp/
N/A
95s
95s
BE
N/A
314x
63x
Speedup
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
BE Summary
�Significant im
provement over generic indexes
�Index size grows linearly with number of types
�Some M
L-based patterns (e.g., HMMs, CRFs,
character models) not supported
�Can we use it for general QA, RE tasks?
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Similar Approach: [CGHX2006]
�Support “relationship”keyword queries over indexed entities
�Top-K support for early term
ination
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Indexing Thousands of Entity Types
[Slides from Chakrabartiet al., WWW 2006]
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Workload-Driven Indexing
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Selecting Types to Index
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Parallelization/Adaptive Processing
�Parallelize processing:
�IBM W
ebFountain
[GCG+2004]
�Google’s M
ap/Reduce
�Select most efficient access strategy
�Cost Estimation and O
ptimization [IAJG2006]
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Map/Reduce Framework
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Map/Reduce Framework
�General framework
�Scales to 1000s of machines
�Im
plemented in Nutch
�“M
aps”easily to inform
ation extraction
�Map phase:
�Parse individual documents
�Tag entities
�Propose candidate relation tuples
�Reduce phase
�Merge m
ultuple
mentionesof same relation tuple
�Resolve co-references, duplicates
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Cost Optimizer for Text-Centric Tasks
100
1,000
10,000
100,000 0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Recall
Execution Time (secs)
Scan
Filt. Scan
Iterative Set Exp
ansion
Automatic
Query G
en.
OPTIM
IZED
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Other Dimensions of Scalability:
Managing Complex Features [CNS2006]
R. Fagin and J. Helpern, Belief, awareness, reasoning, In AI 1998
Surface features
(cheap)
Authors
Database lookup features
(expensive!)
Inverted
index
Efficient search for
top-k most similar
entities
Many large
tables
1.Batch up to do better than
individual top-k?
2.Find top segmentation without
top-k matches for all segments?
J. Widom
E. F. Codd
R. K. Narayan
Nick Koudas
S. Chakrabarti
S. Sudarshan
Steve Cook
Ronald Fagin
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Other Dimensions of Scalability:
Extraction Pattern Discovery [K
onigand Brill, KDD 2006]
�Use suffix array to efficiently explore
candidate patterns
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Application: W
eb Question Answering
AskMSR: does not use patterns
�Sim
plicity �
scalability (cheap to compute n-grams)
�Challenge: do better than n-grams on web Q
A
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Summary
�Brief overview of inform
ation extraction from text
�Techniques to scale up inform
ation extraction
�Scan-based techniques (lim
ited impact)
�Exploiting general indexes (lim
ited accuracy)
�Build
ing specialized index structures (most promising)
�Scalability is a data m
ining problem
�Querying graphs �
link discovery
�Workload m
ining for index optimization
�Must be optimized for specific text mining application?
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Related Challenges
�Duplicate entities, relation tuplesextracted
�Missing values
�Extraction errors
�Inform
ation spans m
ultiple documents
�Combining relation tuplesinto complex
events
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Break
�Eugene Agichtein, Microsoft &
Emory University
�http://www.m
athcs.emory.edu/~eugene/
�eugene@
mathcs.emory.edu
�Next: Scalable Inform
ation Integration
�Core set of techniques to enable large-scale IE, text
mining
�SunitaSarawagi
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
References
�[AGI2005] E. Agichtein, Scaling Inform
ation Extraction to Large Document Colle
ctions,
IEEE Data Engineering Bulletin, 2005
�[AG2003] E. Agichtein and L. Gravano. Querying text databases for efficient
inform
ation extraction. ICDE 2003
�[AIG
2003] E. Agichtein, P. Ipeirotis, and L. Gravano, Modelin
g Q
uery-Based Access to
Text Databases, WebDB2003
�[CDS+2005] l J. Cafarella, D. Downey, S. Soderland, and O
ren Etzioni. KnowItNow: Fast,
scalable inform
ation extraction from the web. (H
LT/EMNLP), 2005.
�[CE2005] M
. J. Cafarella
and O
. Etzioni. A search engine for natural language
applications. (W
WW), 2005
�[CNS2006] A. Chandel, P.C. Nagesh, and S. Sarawagi. Efficient batch top-k search for
dictionary-based entity recognition. ICDE 2006
�[CRW2005] S. Chaudhuri, R. Ramakrishnan, and G
. Weikum. Integrating db and ir
technologies: What is the sound of one hand clapping?, CIDR 2005.
�[CGHX2006] K. Chakrabarti, V. Ganti, JiaweiHan, D. Xin, Ranking O
bjects Based on
Relationships, SIG
MOD 2006
�[CPD 2006] S. Chakrabarti, KritiPuniyaniand SujathaDas, Optimizing Scoring Functions
and Indexes for Proxim
ity Search in Type-annotated Corpora. WWW 2006
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
References II
�[DBB+2002] S. Dumais, M. Banko, E. Brill, J. Lin and A. Ng (2002). P. Bennett, S. Dumais
and E. Horvitz (2002). W
eb question answering: Is m
ore always better? SIG
IR 2002
�[G
HY2002] R. Grishman, S. Huttunen, and R. Yangarber. Inform
ation extraction for
enhanced access to disease outbreak reports. Journal of Biomedical Inform
atics, 2002.
�[G
CG+2004] D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J.
Zien. How to build a W
ebFountain: An architecture for very large-scale text analytics. IBM
Systems Journal, 2004.
�[IAJG2006] Ipeirotis, E. Agichtein, P. Jain, and L. Gravano, To Search or to Crawl:
Towards a Q
uery O
ptimizer for Text-Centric Tasks, SIG
MOD 2006
�[KRV+2004] R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, H. Zhu, Avatar: A
Database Approach to Semantic Search, SIG
MOD 2006
�[PRH2004] P. Pantel, D. Ravichandran, and E. Hovy. Towards terascale
knowledge
acquisition. In Conference on Computational Linguistics (COLING), 2004.
�[PE2005] P. Resnikand A. Elkiss. The linguist’s search engine: An overview
(demonstration). In ACL, 2005.
�[PDT2001] P.D. Turney. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In
European Conference on M
achine Learning (ECML), 2001.
�C. König
and E. Brill, Reducing the Human O
verhead in Text Categorization, KDD 2006
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
The data integration problem
De-duplication
Extracted entities
•People/organization names,
•Addresses,
•Citations,
•Disease outbreaks
Multiple related entities with m
ostly text attributes
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
The challenge of data integration
�Large lists with m
ultiple noisy m
entions of the
same entity
�No single key to order or cluster likely
duplicates while separating them from sim
ilar
but different entities.
�Need to depend on fuzzy and compute-
intensive string sim
ilarity functions
�Cannot afford to compare with m
ention with
every other.
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Integration: outline
�Duplicate Entity elim
ination
�Defining duplicates: string sim
ilarity functions
�Scalable algorithms for finding sim
ilar pairs
�Batch m
ode
�Join algorithms to find duplicate pairs
�Onlin
e m
ode
�Efficient index design (same as indices in
batch m
ode)
�Create groups from duplicate entity pairs
�Single entity: data partitions
�Multiple entities: Colle
ctive m
ulti-attribute deduplication
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
String similarity measures
�Token-based
�Examples
�Jaccard
�TF-IDF Cosine sim
ilarities
�Suitable for large documents
�Character-based
�Examples:
�Edit-distance and variants like Levenshtein, Jaro-W
inkler
�Soundex
�Suitable for short strings with spelling m
istakes
�Hybrids
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Token based
�Tokens/words
�‘AT&T Corporation’-> ‘AT&T’, ‘Corporation’
�Sim
ilarity: various m
easures of overlap of two sets S,T
�Jaccard(S,T) = |S∩T|/|S∪T|
�Example
�S = ‘AT&T C
orporation’-> ‘AT&T’, ‘Corporation’
�T = ‘AT&T C
orp’-> ‘AT&T’, ‘Corp.’
�Jaccard(S,T) = 1/3
�Variants: weights attached with each token
�Useful for large strings: example web documents
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Cosine similarity with TF/IDF weights
�Cosine sim
ilarity:
�Sets transform
ed to vectors with each term
as dim
ension
�Sim
ilarity: dot-product of two vectors each norm
alized to unit
length
��
cosine of angle between them
�Term
weight == TF/IDF
�log (tf+1) * log idfwhere
�tf: frequency of ‘term
’in a document d
�idf: number of documents / number of documents containing ‘term
’
�Intuitively: rare ‘term
s’are m
ore important
�Widely used in traditional IR
�Example:
�‘AT&T Corporation’, ‘AT&T Corp’or ‘AT&T Inc’
�Low weights for ‘Corporation’,’Corp’,’Inc’, Higher weight for ‘AT&T’
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Edit Distance [G98]
�Given two strings, S,T, edit(S,T):
�Minim
um cost sequence of operations to transform
S to T.
�Character Operations: I (insert), D (delete), R (Replace).
�Example: edit(Error,Eror) = 1, edit(great,grate) = 2
�Folklore dynamic programming algorithm to compute edit();
�O(m
2) versus O
(2m log m
) for token-based m
easures
�Several variants (gaps,weights) ---becomes NP-complete
easily.
�Varying costs of operations: can be learnt [RY97].
�Observations
�Suitable for common typing m
istakes on small strings
�Comprehensive vsComprenhensive
�Problematic for specific domains
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Edit Distance with affine gaps
�Differences between ‘duplicates’often due to abbreviations or
whole word insertions.
�IBM Corp. closer to ATT Corp. than IBM Corporation
�John SmithvsJohn Edward SmithvsJohn E. Smith
�Allo
w sequences of mis-m
atched characters (gaps) in the
alig
nment of two strings.
�Penalty: using the affine cost model
�Cost(g) = s+e ⋅l
�s: cost of opening a gap
�e: cost of extending the gap
�l: length of a gap
�Sim
ilar dynamic programming algorithm
�Parameters domain-dependent, learnable, e.g., [BM03,
MBP05]
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Hybrids [CRF03]
�Example: Edward, John Vs Jon Edwerd
�Let S = {a1,…
,aK}, T = {b1,…
bL} sets of term
s:
�Sim
(S,T) =
�Sim
’() some other sim
ilarity function
�C(t,S,T) = {w∈S s.t ∃
v ∈
T, sim
’(w,v) > t}
�D(w
,T) = m
axv∈Tsim
’(w,v), w ∈
C(t,S,T)
�sTFIDF=
)b,
a('max
1j
i
1
1sim
K
K i
L j∑ =
=
∑∈
),
,(
),
(*
),
(*
),
(T
St
Cw
Tw
DT
wW
Sw
W
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Integration: outline
�Duplicate Entity elim
ination
�Defining duplicates: string sim
ilarity functions
�Scalable algorithms for finding sim
ilar pairs
�Join algorithms to find duplicate pairs
�Efficient index design
�Create groups from duplicate entity pairs
�Single entity: data partitions
�Multiple entities: Collective m
ulti-attribute
deduplication
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Finding all duplicate pairs in large lists
�Input: a large list of records R with string attributes
�Output: all pairs (S,T) of records in R
which satisfy a Sim
ilarity
Criteria:
�Jaccard(S,T) > 0.7
�Overlapping tokens (S,T) > 5
�TF-IDF-C
osine(S,T) > 0.8
�Edit-distance(S,T) < k
�More complicated sim
ilarity functions use these as filters (high
recall, low precision)
�Naïve m
ethod: for each record pair, compute sim
ilarity score
�I/O and CPU intensive, not scalable to m
illions of records
�Goal: reduce O
(n2) cost to O
(n*w
), w
here w << n
�Reduce number of pairs on which sim
ilarity is computed
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
General template for similarity functions
�Sets: r,s
threshold
Common
tokens
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Approximating edit distance [GIJ+01]
�EditDistance(s,t) ≤d →
|q-grams(s)∩
q-
grams(t)| ≥
max(|s|,|t|) -(d-1)*q –
1
�Q-grams (sequence of q-characters in a field)
�‘AT&T Corporation’
�3-grams: {‘AT&’,’T&T’,’&T ‘, ‘T C’,’
Co’,’orp’,’rpo’,’por’,’ora’,’rat’,’ati’,’tio’,’ion’}
�Typically, q=3 �
Large q-gram sets
�Approxim
ate large q-gram sets to smalle
r sets
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Reducing size of large sets
�MinHashmethod
�Random projection m
ethod
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Minhashmethod
�MinHashsignature of sets
�Choose a random perm
utation P of all tokens
�MinHash(S
={s
1,s
2,..,sn}) = s
jif s
jis the first token in
perm
utation P
�Jaccard(s,t) = Probability(M
inhash(s) = M
inhash(t))
�Choose B such perm
utations to improve accuracy
�Extended to Jaccard
on weighted term
s(C
harikar STOC
2002)
�Highly effective in practice
�Mirror detection on webpages (Broder et al 1997)
�Cleaning database records (Chaudhuriet al 2004)
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Random Projection method
�Project each token to a random value in [-1 1]
�Random projection of set S={s
1,s
2,..,sn}
�1 if sum of projection of each token in S > 0
�0 otherw
ise
�Cosine(s,t) ∝Probability(projection of s and t
same)
�Repeat for B projections to improve accuracy
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Approximate join problem
�Find all pairs s,t where Token-overlap(s,t) > T
�Main idea: exploit threshold T to reduce work
�Two algorithms
�Pair-C
ount
�Indexed Count
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
�Step 1: Pass over data, for each token create list of sets
that contain it
�Step 2: generate pairs of sets, count and output those
with count > T
Pair-Count
Inverted index
t1 t2
2 2 1
The pair-counting table could be large �
too memory intensive
Not good when list lengths highly skewed
Self-join
lists
(Broderet al WWW 1997)
Data
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Step 2: Using each record,
probe
merge lists to find rids in T of them
Probe-Count
Step 1: Create inverted index
w1
w2
Data
Inverted index
Heap
w1, w4,wk
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Threshold sensitive list merge
Lists to be merged
Sort by increasing size
Except T-1 largest,
organize rest in heap
(T=3)
Heap
Pop from heap successively
Heap
Search in large lists in increasing order
Use lower bounds to terminate early
(SK04, CGK06)
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Summary of the pair creation step
�Can be extended to the weighted case fitting
the general framework.
�More complicated sim
ilarity functions use set
sim
ilarity functions as filters
�Set sizes can be reduced through techniques
like M
inHash (weighted versions also exist)
�Small sets (average set size < 20), m
ost database
entities with word tokens: use as-is
�Large sets: web documents, sets of q-grams
�Use M
inhash or random projection
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Integration: outline
�Duplicate Entity elim
ination
�Defining duplicates: string sim
ilarity functions
�Scalable algorithms for finding sim
ilar pairs
�Join algorithms to find duplicate pairs
�Efficient index design
�Create groups from duplicate entity pairs
�Single entity: data partitions
�Multiple entities: Collective m
ulti-attribute
deduplication
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Creating partitions
�Transitive closure
�Dangers: unrelated records collapsed into
a single cluster
2
1
10
7
93
5
8
6
4
2
1
10
7
93
5
8
6
4
3 disagreements
�Correlation clustering (Bansalet al 2002)
�Partition to m
inim
ize total disagreements
1.Edges across partitions
2.Missing edges within partition
�More appealing than clustering:
�No m
agic constants: number of clusters,
sim
ilarity thresholds, diameter, etc
�Extends to real-valued scores
�NP Hard: many approxim
ate algorithms
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Algorithms for correlation clustering
�Integer programming form
ulation (Charikar03)
�Xij= 1 if i and j in same partition, 0 otherw
ise
�Im
practical: O
(n3) constraints
�Practical substitutes (Heuristics, no guarantees)
�Agglomerative clustering: repeatedly m
erge closest clusters
�Efficient im
plementation possible via heaps (BG 2005)
�Definition of closeness subject to tuning
�Greatest reduction in error
�Average/M
ax/M
in sim
ilarity
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Empirical results on data partitioning
�Setup: Online comparison shopping,
�Fields: name, model, description, price
�Learner: O
nline perceptronlearner
�Complete-link clustering >> single-link clustering(transitive closure)
�An issue: when to stop m
erging clusters
Digital cameras
Camcoder
Luggage
(From: Bile
nkoet al,
2005)
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Other methods of partitioning
[Chaudhuriet al IC
DE 2005]
�Partitions are compact and relatively far from other points
�A Partition has to satisfy a number of criteria
�Points within partition closer than any points outside
�#points within p-neighborhood of each partition < c
�Either number of points in partition < K, or diameter < θ
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Algorithm
�Consider case where partitions required to be of
size < K �
if partition Pjof size m
in output then
�m-nearest neighbors of all r in Pi is Pi
�Neighborhood of each point is sparse
�For each record, do efficient index probes to get
�Get K nearest neighbors
�Count of number of points in p-neighborhood for each m
nearest neighbors
�Form
pairs and perform
grouping based on above
insight to find groups
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Summary: partitioning
�Transitive closure is a bad idea
�No verdict yet on best alternative
�Difficult to design an objective and algorithms
�Correlation clustering
�Reasonable objective with a skewed scoring function
�Poor algorithms
�Greedy agglomerative clustering algorithms ok
�Greatest minim
um sim
ilarity (complete-link), benefit
�Reasonable perform
ance with heap-based implementation
�Dense/Sparse partitioning
�Positives: Declarative objective, efficient algorithm
�Parameter retuning across domains
�Need comparison betw
een complete-link, Dense/Sparse, and
Correlation clustering.
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Collective de-duplication: m
ulti-attribute
Collectively de-duplicate entities and its many
attributes
Associate variables for predictions
for each attribute k
each record pair (i,j)
for each record pair
from Parag& Domingos2005
a1
a2
a3
Akij
Rij
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Dependency graph
A112
A134
A212
A234
A312= A
334
R12
R34
Scoring functions
�Independent scores
�sk(A
k,a
i,aj) Attribute-level
�Any classifier on various text
sim
ilarities of attribute pairs
�s(R
,bi,b
j) Record-level
�Any classifier on various
sim
ilarities of all k attribute
pairs
�Dependency scores
�dk(A
k, R): record pair, attribute
pair
10
10
71
24
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Joint de-duplication steps
�Jointly pick 0/1 labels for all record pairs R
ijand all K attribute
pairs A
kijto m
axim
ize
�When dependency scores associative
�dk(1,1) + d
k(0,0) >= d
k(1,0)+dk(0,1)
�Can find optimal scores through graph M
INCUT
[s(Rij
ij∑)+
s k(Aijk
k∑)+d k(Rij,Aijk)]
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Other issues and approaches
�Partitioning
�Transitive-closure as a post processing
�Results:
•Collective deduplication
•does not help whole citations,
•helps attributes
•Transitive closure can cause drop in accuracy
�Combined partitioning and linked dedup
�Dong, HaLevy, Madhavan(SIG
MOD 2005)
�Bhattacharya and G
etoor(2005)
CitationAuthorVenue
PT
PT
PT
Independent
87
85
798949
59
Collective
86
89
898986
82
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Collective linkage: set-oriented data
(Bhattacharya and G
etoor, 2005)
AnupGupta
P3
David W
hite
P4
Liu, Jane & J G
upta & W
hite, Don
P2
D W
hite, J Liu, A G
upta
P1
Scoring functions
�S(A
ij) Attribute-level
�Text sim
ilarity
�S(A
ij, N
ij) Dependency with
labels of co-author set
�Fraction of co-author set
assigned label 1.
�Score: αs(A
ij) + (1-α) s(A
ij, N
ij)
Algorithm
Greedy agglomerative clustering
�Merge author clusters with highest
score
�Redefine sim
ilarity between clusters
of authors instead of single authors
�Max of author-level sim
ilarity
D White
White, Don
J Liu
Liu Jane
A Gupta
AnupGupta
David White
D White
A Gupta
J Gupta
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
Summary
�Scalable algorithms for finding pairs of duplicates
solved for selective set sim
ilarity functions
�Open problem: colle
ction of weak sim
ilarity functions
�Data partitioning: agglomerative heap-based
clustering algorithms ok.
�Open problems:
�scalable algorithms for correlation clustering
�multi-attribute colle
ctive partitioning
�Ambiguity resolution in networked entities
�Other issues: combined extraction and integration
(Online integration)
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
1983
Year
10
Update Semantics
2
Canonical
Journal
Title
Id
ACM Trans.
Databases
16
17
AI
17
ACM TODS
10
Canonica
lName
Id
32
22
11
2
Autho
r
Article
4Jeffrey Ullm
an
4
3Ron Fagin
3
4J. Ullm
an
2
M Y Vardi
11
Canonical
Name
IdAuthors
Writes
Journals
Articles
Variant links
to canonical
entries
Database: norm
alized, stores noisy variants
3 Top-
level
entities
R. Faginand J. Helpern, Belief, awareness, reasoning. In AI1988[10] also see
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
1988
1983
Year
17
Belief, awareness,
reasoning
7
10
Update Semantics
2
Canonical
Journal
Title
Id
10
ACM Trans.
Databases
16
17
AI
17
ACM TODS
10
Canonica
l
Name
Id
87
32
22
97
11
2
Autho
r
Article
4Jeffrey Ullm
an
4
3R Fagin
8
8J Helpern
9
3Ron Fagin
3
4J. Ullm
an
2
M Y Vardi
11
Canonic
al
Name
IdAuthors
Writes
Journals
Articles
R. Faginand J. Helpern, Belief, awareness, reasoning. In AI1988[10] also see
Extraction
Integration
Author: R. Fagin A
Author: J. Helpern
Title Belief,..reasoning
Journal: AI
Year: 1998
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
References
�[ACG02] RohitAnanthakrishna, SurajitChaudhuri, VenkateshGanti: Elim
inating Fuzzy Duplicates in Data W
arehouses.
VLDB 2002: 586-597
�[BD83] Dina Bitton, David J. DeWitt: Duplicate Record Elim
ination in Large Data Files. ACM Trans. Database Syst. 8(2):
255-265 (1983)
�[BE77] Mike W
. Blasgen, KapaliP. Eswaran: Storage and Access in Relational Data Bases. IBM Systems Journal 16(4):
362-377 (1977)
�[BG04] IndrajitBhattacharya, LiseGetoor: Iterative record linkage for cleaning and integration. DMKD 2004: 11-18
�[C98] William W
. Cohen: Integration of Heterogeneous Databases W
ithout Common Domains Using Q
ueries Based on
Textual Sim
ilarity. SIG
MOD Conference 1998: 201-212
�[C00] William W
. Cohen: Data integration using sim
ilarity joins and a word-based inform
ation representation language. ACM
Trans. Inf. Syst. 18(3): 288-321 (2000)
�[CCZ02] Peter Christen, Tim
Churches, Xi Zhu: Probabilistic nameand address cleaning and standardization. Australasian
Data M
ining W
orkshop 2002.
�[CGGM04] SurajitChaudhuri, Kris G
anjam, VenkateshGanti, Rajeev M
otwani: Robust and Efficient Fuzzy M
atch for Online
Data Cleaning. SIG
MOD Conference 2003: 313-324
�[CGG+05] SurajitChaudhuri, Kris G
anjam, VenkateshGanti, RahulKapoor, VivekR. Narasayya, Theo Vassilakis: Data
cleaning in m
icrosoftSQL server 2005. SIG
MOD Conference 2005: 918-920
�[CGK06] SurajitChaudhuri, VenkateshGanti, RaghavKaushik: A primitive operator for sim
ilarity joins in data cleaning. ICDE
2006.
�[CGM05] SurajitChaudhuri, VenkateshGanti, Rajeev M
otwani: Robust Identification of Fuzzy Duplicates. ICDE 2005: 865-
876
�[CRF03] William W
. Cohen, PradeepRavikumar, Stephen E. Fienberg: A Comparison of String Distance M
etrics for Name-
Matching Tasks. IIWeb2003: 73-78
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
References
�[DJ03] TamraparniDasu, Theodore Johnson: Exploratory D
ata M
ining and D
ata Cleaning John W
iley
2003
�[DNS91] David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider: An Evaluation of Non-E
quijoin
Algorithms. VLDB 1991: 443-452
�[DWI02] Data W
arehousing Institute report 2002
�[E00] Larry English: Plain English on D
ata Q
uality: Inform
ation Q
uality M
anagement: The Next Frontier.
DM Review M
agazine: April 2000. http://w
ww.dmreview.com/article_sub.cfm
?articleId=2073
�[FL95] ChristosFaloutsos, King-IpLin: FastM
ap: A Fast Algorithm for Indexing, Data-M
ining and
Visualization of Traditional and M
ultim
edia Datasets. SIG
MOD Conference 1995: 163-174
�[FS69] I. Fellegi, A. Sunter: A theory of record linkage. Journal of the American Statistical Association,
Vol64. No 328, 1969
�[G
98] D. Gusfield: Algorithms on strings, trees and sequences. Cambridge university press 1998
�[G
FS+01] Helena G
alhardas, Daniela Florescu, Dennis Shasha, Eric Sim
on, Cristian-AugustinSaita:
Declarative Data Cleaning: Language, Model, and Algorithms. VLDB2001: 371-380
�[G
IJ+01] Luis G
ravano, PanagiotisG. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh
Srivastava: Approxim
ate String Joins in a Database (Alm
ost) for Free. VLDB 2001: 491-500
�[G
IKS03] Luis G
ravano, PanagiotisG. Ipeirotis, Nick Koudas, Divesh Srivastava: Text joins in an RDBMS
for web data integration. WWW 2003: 90-101
�[G
KMS04] S. Guha, N. Koudas, A. Marathe, D. Srivastava: Merging the results of approxim
ate m
atch
operations. VLDB 2004.
�[G
KR98] David G
ibson, Jon M
. Kleinberg, PrabhakarRaghavan: Clustering C
ategorical Data: An
Approach Based on D
ynamical Systems. VLDB 1998: 311-322
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
References
�[HS95] Mauricio A. Hernández, Salvatore J. Stolfo: The M
erge/Purge Problem for Large Databases.
SIG
MOD Conference 1995: 127-138
�[HS98] Gísli R. Hjaltason, HananSamet: Incremental Distance Join Algorithms for Spatial Databases.
SIG
MOD Conference 1998: 237-248
�[J89] M. A. Jaro: Advances in record linkage m
ethodology as applie
d to m
atching the 1985 census of
Tampa, Florida. Journal of the American Statistical Association 84: 414-420.
�[JLM03] LiangJin, Chen Li, SharadMehrotra: Efficient Record Linkage in Large Data Sets. DASFAA
2003
�[JU91] PetteriJokinen, EskoUkkonen: Two Algorithms for Approxim
ate String M
atching in Static Texts.
MFCS 1991: 240-248
�[KL51] S. Kullb
ack, R. Liebler: On inform
ation and sufficiency. The annals of mathematical statistics
22(1): 79-86. 1959.
�[KMC05] DmitriV. Kalashnikov, SharadMehrotra, ZhaoqiChen: Exploiting R
elationships for Domain-
Independent Data Cleaning. SDM 2005
�[KMS04] Nick Koudas, AmitMarathe, Divesh Srivastava: Flexible String M
atching Against Large
Databases in Practice. VLDB 2004: 1078-1086
�[KMS05] Nick Koudas, AmitMarathe, Divesh Srivastava: SPID
ER: flexible m
atching in databases.
SIG
MOD Conference 2005: 876-878
�[LLL00] Mong-Li Lee, TokWang Ling, WaiLupLow: IntelliClean: a knowledge-based intelligent data
cleaner. KDD 2000: 290-294
�[M
E96] Alvaro E. Monge, Charles Elkan: The Field M
atching Problem: Algorithms and Applications. KDD
1996: 267-270
August 2006
Agichteinand Sarawagi, KDD 2006: Scalable Information Extraction and Integration
References
�[M
E97] Alvaro E. Monge, Charles Elkan: An Efficient Domain-Independent Algorithm for Detecting Approxim
ately Duplicate
Database Records. DMKD 1997
�[RY97] E. Ristad, P. Yianilos: Learning string edit distance. IEEE Pattern analysis and m
achine intelligence 1998.
�[S83] Gerry Salton: Introduction to m
odern inform
ation retrieval. M
cGraw Hill 1987.
�[SK04] Sunita Sarawagi, AlokKirpal: Efficient set joins on sim
ilarity predicates. SIG
MOD Conference 2004: 743-754
�[TF95] Howard R. Turtle, James Flood: Query Evaluation: Strategies and O
ptimizations. Inf. Process. Manage. 31(6): 831-
850 (1995)
�[TKF01] S. Tejada, C. Knoblock, S. Minton : Learning object identification rules for inform
ation integration. Inform
ation
Systems, Vol26, No 8, 607-633, 2001.
�[W
94] William E. Winkler: Advanced m
ethods for record linkage. Proceedings of the section on survey research m
ethods,
American Statistical Association 1994: 467-472
�[W
99] William E. Winkler: The state of record linkage and current research problems. IRS publication R
99/04
(http://www.census.gov/srd/www/byname.htm
l)
�[Y02] William E. Yancey: BigMatch: A program for extracting probable m
atches from a large file for record linkage. RRC
2002-01. Statistical Research Division, U.S. Bureau of the Census.
Top Related