Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb...
-
date post
22-Dec-2015 -
Category
Documents
-
view
219 -
download
3
Transcript of Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb...
Identity Resolution in Email CollectionsIdentity Resolution in Email Collections
Tamer Elsayed and Douglas W. Oard
CLIP Colloquium, UMD, Feb 2009CLIP Colloquium, UMD, Feb 2009
Department of Computer Science, UMIACS, and iSchool
2
Identity Resolution in Email Collections
Real Problem
National ArchivesNational Archives
Clinton Clinton White HouseWhite House Tobacco Tobacco
PolicyPolicy
search search requestrequest
hired 25 hired 25 personspersons
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~32 million
emails
200,000
80,000
for 6 months …
3
Identity Resolution in Email Collections
Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be rescheduled
Did Sheila want Scott to participate? Looks like the
call will be too late for him.
Sheila
Identity Resolution in Email
WHO?WHO?WHO?WHO?
4
Identity Resolution in Email Collections
Enron Collection
-----Original Message-----From: [email protected]@ENRONSent: Monday, July 30, 2001 2:24 PMTo: Sager, Elizabeth; Murphy, Harlan; [email protected]; [email protected]: [email protected]: Shhhh.... it's a SURPRISE !
Message-ID: <1494.1584620.JavaMail.evans@thyme>Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT)From: [email protected]: [email protected]: RE: Shhhh.... it's a SURPRISE !X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER>X-To: '[email protected]@ENRON'
Hope all is well.Count me in for the group present.See ya next week if not earlier
Please call me (713) 207-5233
Liza
Elizabeth Sager713-853-6349
Hi Shari
Thanks!
Shari
55 Sheila’s !!55 Sheila’s !!weisman
pardoglover
richjones
breedenhuckaby
tweedmcintyrechadwick
birminghamkahanekforakertasmanfisherpetitt
DomboRobbinschang
jarnotkirby
knudsenboehringer
lutzgloverwollamjortnerneylon
whangernagel
gravesmclaughlin
venvillerappazzo
millerswatekhollis
maynesnacey
ferrarinidey
macleodhowarddarlingwatsonperlickadvanihesterkennerlewis
waltonwhitmanberggrenosowski
kelly
Rank Rank CandidatesCandidates
Rank Rank CandidatesCandidates
5
Identity Resolution in Email Collections
Generative Model
1. Choose “personperson” c to mention
p(c)
2. Choose appropriate “contextcontext” X to mention c
p(X | c)
3. Choose a “mentionmention” l
p(l | X, c)““sheila”sheila”
GEGEconferenceconference
callcall
6
Identity Resolution in Email Collections
3-Step Solution(1) Identity(1) Identity ModelingModeling
Posterior DistributionPosterior Distribution
(3) Mention Resolution(3) Mention Resolution
(2) Context Reconstruction(2) Context Reconstruction
7
Identity Resolution in Email Collections
Outline
Introduction and Approach Overview Computational Model of Identity Context Reconstruction Mention Resolution Evaluation on Existing Collections Scalable MapReduce Implementation New Test Collection Conclusion and Future Work
8
Identity Resolution in Email Collections
“Easy References” of Identity
-----Original Message-----From: [email protected]@ENRONSent: Monday, July 30, 2001 2:24 PMTo: Sager, Elizabeth; Murphy, Harlan; [email protected]; [email protected]: [email protected]: Shhhh.... it's a SURPRISE !
Message-ID: <1494.1584620.JavaMail.evans@thyme>Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT)From: [email protected]: [email protected]: RE: Shhhh.... it's a SURPRISE !X-From: Sager, Elizabeth </O=ENRON/OU=NA/CN=RECIPIENTS/CN=ESAGER>X-To: '[email protected]@ENRON'
Hope all is well.Count me in for the group present.See ya next week if not earlier
Please call me (713) 207-5233
Liza
Elizabeth Sager713-853-6349
Hi Shari
Thanks!
Shari
Email Email StandardsStandards
Email-Client Email-Client BehaviorBehavior
User User RegularitiesRegularities
9
Identity Resolution in Email Collections
Representational Model of Identity
77,240 “non-trivial” models
14 (Quoted Headers)
sheila glover
932 (Main Headers)
sheila
19 (Salutation)
216 (Signature)
sg19 (Signature)
sheila glover
1170 (User Name)
Representational ModelRepresentational Model
10
Identity Resolution in Email Collections
Computational Model of Identity
c
m
t
identity
observed mention
name type
Tt
ctfreq
ctfreqctp
'
),'(
),()|(
)('
),,'(
),,(),|(
cassocl
ctmfreq
ctmfreqctmp
Tt
ctpctmpcmp )|(),|()|(
Cc
cassoc
cassoccp
'
)'(
)()(
11
Identity Resolution in Email Collections
Identity Models
Candidates
Candidates
Likelihood: p Likelihood: p ( “sheila” | ( “sheila” | cc))
12
Identity Resolution in Email Collections
Outline
Introduction and Approach Overview Computational Model of Identity Context ReconstructionContext Reconstruction Mention Resolution Evaluation on Existing Collections Scalable MapReduce Implementation New Test Collection Conclusion and Future Work
13
Identity Resolution in Email Collections
Contextual Space
LocalLocalContextContext
Conversational Conversational ContextContext
Topical ContextTopical Context
14
Identity Resolution in Email Collections
Topical Context
Date: Fri Dec 15 05:33:00 EST 2000From: [email protected]: vince j kaminski <[email protected]>Cc: sheila walton <[email protected]>Subject: Re: Grant Masson
Great news. Lets get this moving along. Sheila, can you work out GE letter?
Vince, I am in London Monday/Tuesday, back Weds late. I'll ask Sheila to fix this for you and if you need me call me on my cell phone.
Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be rescheduled
Did Sheila want Scott to participate? Looks like the call will be too late for
him.
Sheila call
Sheila
call
GE
GE
15
Identity Resolution in Email Collections
Contextual Space
Social ContextSocial Context
LocalLocalContextContext
Conversational Conversational ContextContext
Topical ContextTopical Context
16
Identity Resolution in Email Collections
Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be rescheduled
Did Sheila want Scott to participate? Looks like the call will be too late for
him.
Social Context
Date: Tue, 19 Dec 2000 07:07:00 -0800 (PST)From: [email protected]: [email protected] Subject: ESA Option Execution
KayCan you initial the ESA assignment and assumption agreement or should I ask Sheila Tweed to do it? I believe she is currently en route from Portland.Thanks, Rebecca
Sheila Tweed
17
Identity Resolution in Email Collections
Formally
A context of an email is a probability probability distributiondistribution over emails
Probability estimated based on type of context
Contextual Space is a linear combination of 4 contexts
))(|( ikj exep
kx
18
Identity Resolution in Email Collections
Context Expansion
topical
time
people
content
social
conversationallocal
Temporal similarity affects social and topical similarity
19
Identity Resolution in Email Collections
Temporal Similarity
Decay over time Gaussian and Linear functions Time difference / Rank
20
Identity Resolution in Email Collections
Social Similarity
Two sets of participants (email adresses) Binary, Overlap, Jacaard, Both
21
Identity Resolution in Email Collections
Temporal Effect
Temporal Sim Pure Social Sim
Social SimNormalize
Social Context
22
Identity Resolution in Email Collections
Topical Similarity
Standard IR Similarity: BM25
Email as a DOCUMENT? Subject Body (+Subject) Root of thread Concatenated path to root
Combined similarly with temporal similarity
reply / forward
23
Identity Resolution in Email Collections
Contextual Space (emails)
Social ContextSocial Context
LocalLocalContextContext
Conversational Conversational ContextContext
Topical ContextTopical Context
24
Identity Resolution in Email Collections
Contextual Space (mentions)
“Sheila”
social
conversational
social
topical
social
topical
topical
“Sheila Tweed”
“sheila”
“sg”
“Sheila Walton”
“Sheila”
25
Identity Resolution in Email Collections
Outline
Introduction and Approach Overview Computational Model of Identity Context Reconstruction Mention ResolutionMention Resolution Evaluation on Existing Collections Scalable MapReduce Implementation New Test Collection Conclusion and Future Work
26
Identity Resolution in Email Collections
Mention Resolution
Candidates
Likelihood: p Likelihood: p ( “sheila” | ( “sheila” | cc))
Goal: estimate p(c|m, X(m)) and rank accordingly
Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <[email protected]>To: Suzanne Adams <[email protected]>Subject: Re: GE Conference Call has be rescheduled
Did Sheila want Scott to participate? Looks like the call will be too late for
him.
Sheila
11
22 33
??
27
Identity Resolution in Email Collections
[1] Context-Free Resolution
“Sheila”
social
conversational
social
topical
social
topical
topical
“Sheila Tweed”
“sheila”
“sg”
“Sheila Walton”
“Sheila”
“Sheila”
X
Context-FreeResolution
28
Identity Resolution in Email Collections
[2] Contextual Resolution
“Sheila”
social
social social
topical
“Sheila Tweed”
“sheila”
“sg”
“Sheila Walton”
“Sheila”
Context-FreeResolution
29
Identity Resolution in Email Collections
Outline
Introduction and Approach Overview Computational Model of Identity Context Reconstruction Mention Resolution Evaluation on Existing CollectionsEvaluation on Existing Collections Scalable MapReduce Implementation New Test Collection Conclusion
30
Identity Resolution in Email Collections
Test Collections
Collection Emails Identities Mention Candidates
Queries Min. Avg. Max.
Sager 1,628 627 51 1 4 11
Shapiro 974 855 49 1 8 21
Enron-subset 54,018 27,340 78 1 152 489
Enron-all 248,451 123,783 78 3 518 1785
Sager
Shapiro
Enron-subsetEnron-all
31
Identity Resolution in Email Collections
Evaluation Measures
Commonly used in “known-item” retrieval
Success @1 (i.e., Precision @1) One-best
MRR (Mean Reciprocal Rank) Inverse of the harmonic mean of the ranks of true
answer ri
n
i irnMRR
1
11
32
Identity Resolution in Email Collections
Comparison w/Literature
MRRMRR Success @ 1Success @ 1
ContextContext Lit.Lit. ContextContext Lit.Lit.
CollectionCollection ExpansionExpansion BestBest ExpansionExpansion BestBest
Sager 0.911 0.889 0.863 0.804
Shapiro 0.913 0.879 0.878 0.779
Enron-subset 0.91 - 0.846 (0.82)
Enron-all 0.89 - 0.821 -
ContextContextExpansionExpansion
Lit.Lit.BestBest
ContextContextExpansionExpansion
Lit.Lit.BestBest
Earlier expansion approach, reported in ACL 2008
Improvedexpansion
0.870.92
33
Identity Resolution in Email Collections
Limitations
Resolving single mentions
All mention-queries are sampled from Enron to Enron emails
All mention-queries refer to Enron Employee Small for train/test split
Scalable Implementation for Resolving All Mentions
New Test Collection
34
Identity Resolution in Email Collections
Outline
Introduction and Approach Overview Computational Model of Identity Context Reconstruction Mention Resolution Evaluation on Existing Collections Scalable MapReduce ImplementationScalable MapReduce Implementation New Test Collection Conclusion
35
Identity Resolution in Email Collections
Scalable Implementation
Two Bottlenecks:1. Context expansion of ALL emails
For each email: ranked list of “Similar” emails
2. Resolution of ALL mentionsResolution of one mention depends on resolution of all other mentions in context
36
Identity Resolution in Email Collections
Context Expansion of ALL Emails
Goal: For each email: ranked list of “Similar” emails Need for BOTH social and topical contexts Efficient implementation
Abstract Problem:Computing Pairwise Similarity
37
Identity Resolution in Email Collections
Trivial Solution
load each vector o(N) times load each term o(dft2) times
scalable and efficient solutionfor large collections
Goal
38
Identity Resolution in Email Collections
Better Solution
Load weights for each term once Each term contributes o(dft2) partial scores
Each term contributes only if appears in
39
Identity Resolution in Email Collections
MapReduce Framework
mapmap
mapmap
mapmap
mapmap
reducereduce
reducereduce
reducereduce
input
input
input
input
output
output
output
ShufflingShuffling
group values group values by: [by: [keyskeys]]
(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce
handles low-level details transparentlytransparently
(k2, [v2])(k1, v1)
[(k3, v3)][k2, v2]
40
Identity Resolution in Email Collections
reducereduce
Decomposition
Load weights for each term once Each term contributes o(dft2) partial scores
Each term contributes only if appears in
mapmap
41
Identity Resolution in Email Collections
Expansion Using MapReduce
Using generic pairwise-similarity for both topical and social expansion
~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~--
~~~~~~~~~~~~
~~~~~~~~~~~~
doc rep.
time window
rankcut-off
df-cut
topical : body/root/pat
hsocial : participants
temporal sim model
contextsim model
contextgraph
42
Identity Resolution in Email Collections
Context Mention-Graph
“Sheila”
social
conversational
social
topical
social
topical
topical
“Sheila Tweed”
“sheila”
“sg”
“Sheila Walton”
“Sheila”
Context-FreeResolution
mapmap
mapmap
mapmap
mapmap
reducereduce
43
Identity Resolution in Email Collections
Resolution System Using MapReduce
EmailsThreads Identity Models
Social Expansion
Conv.Expansion
Topical Expansion
LocalExpansion
Local Graph Social GraphTopical GraphConv. Graph
Mention Recognition and
Prior Computation
Prior Resolution
Posterior Resolution
Social Resolution
Conv.Resolution
Topical Resolution
LocalResolution
Merging Context Resolutions
PriorPriorPrior
Dict.
Exp
ansi
on
Exp
ansi
on
Res
olu
tio
nR
eso
luti
on
Pac
kin
gP
acki
ng
PreprocessingPreprocessing
44
Identity Resolution in Email Collections
Outline
Introduction and Approach Overview Computational Model of Identity Context Reconstruction Mention Resolution Evaluation on Existing Collections Scalable MapReduce Implementation New Test CollectionNew Test Collection Conclusion
45
Identity Resolution in Email Collections
New Test Collection
Random Sample from CMU-Enron collection “Annotation + Search” interface available Total annotators: 3 Annotation time: ~50 hours. Not only resolutions
Time, difficulty, confidence, evidence, and comments
Total mention-queries : 584 80% resolvable, 82% of them to Enron domain Overall inter-annotator agreement: ~81 %
46
Identity Resolution in Email Collections
Mention-Query Selection
47
Identity Resolution in Email Collections
Distribution of Names Based on Resolution
Enron-Resolvable390 (66%)
Non-Enron-Resolvable
80 (14%)
Unresolvable114 (20%)
Probably- Enron 24 (4%)
Probably-Non-Enron
62 (11%)
Unknown28 (5%)
48
Identity Resolution in Email Collections
Distribution Based on Difficulty
843
39
108
33
239
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Enron-Resolvable Non-Enron-Resolvable
Hard Moderately Hard Easy
49
Identity Resolution in Email Collections
Distribution Based on Confidence
411 2
31
4315
79
33663
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Enron-Resolvable Non-Enron-Resolvable
Unresolvable
Not Confident Somewhat Confident Very Confident
50
Identity Resolution in Email Collections
Distribution of Time Spent
R2 = 0.98
R2 = 0.95
R2 = 0.88
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
22%
0 1 2 3 4 5 6 7 8 9 10 11 12
Time (minutes)
Enron-Resolvable
Non-Enron-Resolvable
Unresolvable
Enron
Non-Enron
Unresolvable
51
Identity Resolution in Email Collections
Evaluation again …
0.90
0.44
0.690.65
0.84
0.47
0.760.71
0.92
0.59
0.800.76
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Enron Non-Enron Enron All
MR
R
topical social combination
Old Collection New Collection
52
Identity Resolution in Email Collections
Pairwise Agreement
195
50 190
199
16/16 (100%)
2/4 (50%)
4/7 (57%)
50
27
23
12/16 (75%)
1/5 (20%)
2/2 (100%)
35/38 (92%)
2/2 (100%)
6/10 (60%)
24/27 (89%)
5/12 (42%)
11/11 (100%)
Enron-resolvable
Non-enron-resolvable
Unresolvable
a3
a2
a1
53
Identity Resolution in Email Collections
Individual Annotator Agreement
6/9
3/9
28/32
6/10
37/40 2/224/27
5/12
11/11
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Enron-Resolvable Non-enron-Resolvable
Unresolvable
Ag
reem
ent
a1 a2 a3
54
Identity Resolution in Email Collections
Overall Agreement
0.90
0.43
0.810.77
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Enron-Resolvable
Non-enron-Resolvable
ResolvableOverall
Unresolvable
Ag
ree
me
nt
55
Identity Resolution in Email Collections
Agreement Based on Difficulty
57/59
5/7
30/38
5/16
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Enron-Resolvable Non-Enron-Resolvable
Ag
reem
ent
Easy Non-Easy
56
Identity Resolution in Email Collections
Agreement Based on Confidence
74/82
8/18
17/22
13/15
2/5
6/8
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Enron-Resolvable Non-Enron-Resolvable
Unresolvable
Ag
reem
ent
Very-Confident Not-Very-Confident
57
Identity Resolution in Email Collections
Conclusion and Future Work
Identity Resolution by non-participants is feasible And automatic systems for that can be built ~90-75% accurate
Proposed generative probabilistic model Context Expansion using temporal similarity Scalable Implementation using “Pairwise Sim with MapReduce”
Developed largest test collection for the task 80% resolvable, 82% of them to Enron employees
Effectiveness scales well to large collections
Efficiency Results Evaluation using double-assessments Iterative approach for “joint resolution”
58
Identity Resolution in Email Collections
Thank You!
59
Identity Resolution in Email Collections
Related Work
Diehl et al. (SIAM, 2006) Developed Enron-subset collection Temporal traffic models Candidates must have communicated with sender
Minkov et al. (SIGIR, 2006) Developed Sager and Shapiro collections Graphical framework Large collections?