UMass and Learning for CALO Andrew McCallum Information Extraction & Synthesis Laboratory Department...

UMass andLearning for CALO

Andrew McCallum

Information Extraction & Synthesis Laboratory

Department of Computer Science

University of Massachusetts

Outline• CC-Prediction

– Learning in the wild from user email usage

• DEX– Learning in the wild from user correction...

as well as KB records filled by other CALO components

• Rexa– Learning in the wild from user corrections to

coreference... propagating constraints in a Markov-Logic-like system that scales to ~20 million objects

• Several new topic models– Discover interesting useful structure without the need for

supervision... learning from newly arrived data on the fly

CC Prediction Using Various Exponential Family

Factor Graphs

Learning to keep an org. connected & avoid stove-piping.

First steps toward ad-hoc team creation.

Learning in the wild from user’s CC behavior,and from other parts of the CALO ontology.

Graphical Models for Email

xb

y

Nb

xsNs

xrNr-1

Body Subject Other Words Words Recipients

Recipient of Email

Nr

• Compute P(y|x) for CC prediction

- function - random variable

- N replicationsN

• Local functions facilitate system engineering through modularity

Email Model: Nb words in the body, Ns words in the subject, Nr recipients

The graph describes the joint distribution of random variables in term of the product of local functions

Document Models

xb

y

Nb

xsNs

xrNa-1

Title Abstract Body Co-authors References

Author ofDocument

Na

• Models may relational attributes

xt xbNt Nr

• We can optimize P(y|x) for classification performance and P(x|y) for model interpretability and parameter transfer (to other models)

CC Prediction and Relational Attributes

xb

y

Nb

xsNs

xrNr-1

Thread Body Subject Other Relation Relation Words Words Recipients

Target Recipient

Nr

xr’xtr

Thread Relations – e.g. Was a given recipient ever included on this email thread?

Recipient Relationships – e.g. Does one of the other recipients report to the target recipient?

Ntr

CC-Prediction Learning in the Wild

• As documents are added to Rexa, models of expertise for authors grows

• As DEX obtains more contact information and keywords, organizational relations emerge

• Model parameters can be adapted on-line

• Priors on parameters can be used to transfer learned information between models

• New relations can be added on-line• Modular model construction and intelligent model

optimization enable these goals

CC Prediction Upcoming work on

Multi-Conditional Learning

A discriminatively-trained topic model,

discovering low-dimensional representations for

transfer learning and improved regularization & generalization.

Objective Functions for Parameter EstimationTraditional, joint training (e.g. naive Bayes, most topic models)

Traditional, conditional training (e.g. MaxEnt classifiers, CRFs)

Conditional mixtures (e.g. Jebara’s CEM, McCallum CRF string edit distance, ...)

Multi-conditional(mostly conditional, generative regularization)

Multi-conditional(for semi-sup)

Multi-conditional(for transfer learning, 2 tasks, shared hiddens)

Tra

dit

ion

alN

ew,

mu

lti-

con

dit

ion

al

Traditional mixture model (e.g. LDA)

“Multi-Conditional Learning” (Regularization)[McCallum, Pal, Wang, 2006]

Predictive Random Fieldsmixture of Gaussians on synthetic data

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Data, classify by color Generatively trained

Conditionally-trained [Jebara 1998]

Multi-Conditional

[McCallum, Wang, Pal, 2005]

Multi-Conditional Mixturesvs. Harmoniun

on document retrieval task

Harmonium, joint with words, no labels

Harmonium, joint,with class labels and words

Conditionally-trained,to predict class labels

Multi-Conditional,multi-way conditionally trained

[McCallum, Wang, Pal, 2005]

DEX

Beginning with a review of previous work,

then new work on record extraction,

with the ability to leverage new KBs in the wild, and for transfer

System Overview

ContactInfo andPerson Name

Extraction

Person Name

Extraction

NameCoreference

HomepageRetrieval

Social NetworkAnalysis

KeywordExtraction

CRFWWW

names

Email QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

An ExampleTo: “Andrew McCallum” [email protected]

Subject ...

First Name:

Andrew

Middle Name:

Kachites

Last Name:

McCallum

JobTitle: Associate Professor

Company: University of Massachusetts

Street Address:

140 Governor’s Dr.

City: Amherst

State: MA

Zip: 01003

Company Phone:

(413) 545-1323

Links: Fernando Pereira, Sam Roweis,…

Key Words:

Information extraction,

social network,…

Search for new people

Summary of Results

Token

Acc

Field

Prec

Field

Recall

Field

F1

CRF 94.50 85.73 76.33 80.76

Person Keywords

William Cohen Logic programming

Text categorization

Data integration

Rule learning

Daphne Koller Bayesian networks

Relational models

Probabilistic models

Hidden variables

Deborah McGuiness

Semantic web

Description logics

Knowledge representation

Ontologies

Tom Mitchell Machine learning

Cognitive states

Learning apprentice

Artificial intelligence

Contact info and name extraction performance (25 fields)

Example keywords extracted

1. Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!)

2. Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency.

QuickTime™ and aTIFF (LZW) decompressor


• Information about – people – contact information– email– affiliation– job title– expertise – ...

are key to answering many CALO questions...both directly, and as supporting inputs to higher-level questions.

Importance of accurate DEX fields in IRIS

Learning Field Compatibilities in DEX

Professor Jane Smith

University of California

209-555-5555

Professor Smith chairs the Computer Science Department. She hails from Boston, …her administrative assistant …

John Doe

Administrative Assistant


209-444-4444

Name: Jane Smith, John Doe

JobTitle: Professor, Administrative Assistant

Company: U of California

Department: Computer Science

Phone: 209-555-5555, 209-444-4444

City: Boston

Extracted Record

Jane Smith University of California

209-555-5555Computer Science

Boston

John Doe



209-444-4444

Professor-.5

-.4

-.6

.4

.8

.4

-.5

Compatibility Graph


Professor Jane Smith


209-555-5555

Professor Smith chairs the Computer Science Department. She hails from Boston, …her administrative assistant …

John Doe



209-444-4444

Name: Jane Smith, John Doe

JobTitle: Professor, Administrative Assistant

Company: U of California

Department: Computer Science

Phone: 209-555-5555, 209-444-4444

City: Boston

Extracted Record

Jane Smith University of California

209-555-5555 Computer Science

Boston

John Doe



209-444-4444

Professor

• ~35% error reduction over transitive closure

• Qualitatively better than heuristic approach • Mine Knowledge Bases from other parts of IRIS

for learning compatibility rules among fields– “Professor” job title co-occurs with “University” company– Area code / city compatibility– “Senator” job title co-occurs with “Washington, D.C” location

• In the wild– As the user adds new fields & make corrections, DEX learns from

this KB data

• Transfer learning – between departments/industries


Rexa A knowledge base of publications,

grants, people, their expertise, topics, and inter-connections

Learning for information extraction and coreference.

Incrementally leveraging multiple sources of information for improved coreference

Gathering information about people’s expertise and co-author, citation relations

First a tour of Rexa, then slides about learning

Previous Systems

QuickTime™ and aTIFF (LZW) decompressor


ResearchPaper

Cites

Previous Systems

ResearchPaper

Cites

Person

UniversityVenue

Grant

Groups

Expertise

More Entities and Relations

Learning in Rexa

Extraction, coreferenceIn the wild: Re-adjusting KB after corrections from a user

Also, learning research topics/expertise, and their interconnections

(Linear Chain) Conditional Random Fields

yt -1

yt

xt

yt+1

xt +1

xt -1

Finite state model Graphical model

Undirected graphical model, trained to maximize

conditional probability of output sequence given input sequence

. . .

FSM states

observations

yt+2

xt +2

yt+3

xt +3

said Jones a Microsoft VP …

where

OTHER PERSON OTHER ORG TITLE …

output seq

input seq

Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]

Wide-spread interest, positive experimental results in many applications.

Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…

[Lafferty, McCallum, Pereira 2001]

€

p(y | x) =1

Zx

Φ(y t ,y t−1,x, t)t

∏

€

Φ(y t ,y t−1,x, t) = exp λ k fk (y t ,y t−1,x, t)k

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

(500

cit

atio

ns)

IE from Research Papers[McCallum et al ‘99]

IE from Research Papers

Field-level F1

Hidden Markov Models (HMMs) 75.6[Seymore, McCallum, Rosenfeld, 1999]

Support Vector Machines (SVMs) 89.7[Han, Giles, et al, 2003]

Conditional Random Fields (CRFs) 93.9[Peng, McCallum, 2004]

error40%

(Word-level accuracy is >99%)

p

Databasefield values

c

Joint segmentation and co-reference

o

s

o

s

c

c

s

o

Citation attributes

y y

y

Segmentation

[Wellner, McCallum, Peng, Hay, UAI 2004]Inference:Variant of Iterated Conditional Modes

Co-reference decisions

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990.

Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.

[Besag, 1986]

World Knowledge

35% reduction in co-reference error by using segmentation uncertainty.

6-14% reduction in segmentation error by using co-reference.

Extraction from and matching of research paper citations.

see also [Marthi, Milch, Russell, 2003]

Rexa Learning in the Wildfrom User Feedback

• Coreference will never be perfect.• Rexa allows users to enter corrections to

coreference decisions• Rexa then uses this feedback to

– re-consider other inter-related parts of the KB– automatically make further error corrections

by propagating constraints

• (Our coreference system uses underlying ideas very much like Markov Logic, and scales to ~20 million mention objects.)

Finding Topics in 1 million CS papers

200 topics & keywords automatically discovered.

Topical Transfer

Citation counts from one topic to another. Map “producers and consumers”

Topical Diversity

Find the topics that are cited by many other topics---measuring diversity of impact.

Entropy of the topic distribution among papers that cite this paper (this topic).

LowDiversity

HighDiversity

Some New Work onTopic Models

Robustly capturing topic correlationsPachkinko Allocation Model

Capturing phrases in topic-specific waysTopical N-Gram Model

Pachinko Machine

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

Pachinko Allocation Model[Li, McCallum, 2005]

Model stru

cture

,

not the g

raphical m

odel

Distributions over words (like “LDA topics”)

Distributions over topics;mixtures, representing topic correlations

Distributions over distributions over topics...

Some interior nodes could contain one multinomial, used for all documents.(i.e. a very peaked Dirichlet)

22

31 33

41 42 43 44 45

32

word1 word2 word3 word4 word5 word6 word7 word8

21

11

Topic Coherence Comparison

LDA 100estimationlikelihoodmaximumnoisyestimatesmixturescenesurfacenormalizationgeneratedmeasurementssurfacesestimatingestimatediterativecombinedfiguredivisivesequenceideal

LDA 20models modelparametersdistributionbayesianprobabilityestimationdatagaussianmethodslikelihoodemmixtureshowapproachpaperdensityframeworkapproximationmarkov

Example super-topic33 input hidden units function number27 estimation bayesian parameters data methods24 distribution gaussian markov likelihood mixture11 exact kalman full conditional deterministic1 smoothing predictive regularizers intermediate slope

“models,estimation, stopwords”

“estimation,some junk”

PAM 100estimationbayesianparametersdatamethodsestimatemaximumprobabilisticdistributionsnoisevariablevariablesnoisyinferencevarianceentropymodelsframeworkstatisticalestimating

“estimation”

Topic Correlations in PAM

5000 research paper abstracts, from across all CS

Numbers on edges are supertopics’ Dirichlet parameters

Likelihood Comparison

Varying number of topics

Want to Model Trends over Time

• Is prevalence of topic growing or waning?

• Pattern appears only briefly– Capture its statistics in focused way– Don’t confuse it with patterns elsewhere in time

• How do roles, groups, influence shift over time?

Topics over Time (TOT)

w t

Nd

z

D

T

T

Betaover time

Multinomialover words

Dirichlet

multinomialover topics

topicindex

wordtime

stamp

Dirichletprior

Uniformprior

w

t

Nd

z

D

T

Multinomialover words

time stamp

multinomialover topics

topicindex

word

Dirichletprior

distributionon timestamps

T

Betaover time

Uniformprior

[Wang, McCallum 2006]

State of the Union Address

208 Addresses delivered between January 8, 1790 and January 29, 2002.

To increase the number of documents, we split the addresses into paragraphs and treated them as ‘documents’. One-line paragraphs were excluded. Stopping was applied.

•17156 ‘documents’

•21534 words

•669,425 tokens

Our scheme of taxation, by means of which this needless surplus is takenfrom the people and put into the public Treasury, consists of a tariff orduty levied upon importations from abroad and internal-revenue taxes leviedupon the consumption of tobacco and spirituous and malt liquors. It must beconceded that none of the things subjected to internal-revenue taxationare, strictly speaking, necessaries. There appears to be no just complaintof this taxation by the consumers of these articles, and there seems to benothing so well able to bear the burden without hardship to any portion ofthe people.

1910

Comparing

TOT

against

LDA

Topic Distributions Conditioned on Time

time

top

ic m

ass

(in

ver

tica

l h

eig

ht)

NIPSvol1-14

UMass and Learning for CALO Andrew McCallum Information Extraction & Synthesis Laboratory Department...

Documents

Transcript of UMass and Learning for CALO Andrew McCallum Information Extraction & Synthesis Laboratory Department...