Understanding Email Traffic (talk @ E-Discovery NL Symposium)

25
Understanding email traffic David Graus, University of Amsterdam [email protected] @dvdgrs

description

 

Transcript of Understanding Email Traffic (talk @ E-Discovery NL Symposium)

Page 1: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

Understanding email trafficDavid Graus, University of Amsterdam [email protected] @dvdgrs

Page 2: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

2

Page 3: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

3

Recipient recommendation

Ò Given a sender, an email, all possible recipients (in an enterprise); Ò Predict which recipient(s) are most likely to

receive the email

Page 4: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

4

Why?

Ò Understanding communication in/structure of an enterprise

Ò Applications in: Ò enterprise search Ò expert finding Ò community detection Ò spam classification Ò anomaly detection

Page 5: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

5

How?

Ò Gmail Ò Who do you frequently “co-address” Ò egonetwork

Ò Related work Ò Social Network Analysis (SNA) Ò Email content

Ò Us Ò SNA + Email content

Page 6: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

6

Part 1: Social Network Analysis?

[email protected] [email protected]

[email protected]

Page 7: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

7

image by Calvinius - Creative Commons Attribution-Share Alike 3.0

Page 8: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

8

SNA for predicting recipients?

1. Importance of a node in the network More important people are more likely to be the recipient of an email

2. Strength of connection between two nodes Given sender of the email, the recipients who are frequently addressed are more likely to be the recipient

Page 9: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

9

SNA for predicting recipients?

1. Importance of a node in the network 1. Number of received emails 2. PageRank score of node

2. Strength of connection between two nodes 1. Number of emails sent between nodes 2. Number of times two nodes are adressed together

Page 10: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

10

Part 2: Email content

Ò Statistical Language Models (LMs) !

Ò Assign a probability to a sequence of words; Ò Compute models for different corpora; !

Ò Used in lots of places; Ò Information Retrieval Ò Machine Translation Ò Speech Recognition

Page 11: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

11

Language Models

Ò Language models as communication “profiles”

Page 12: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

12

Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user)

Page 13: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

13

Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people)

Page 14: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

14

Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1

talks with node2)

Page 15: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

15

Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1

talks with node2)

Page 16: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

16

Language Models

Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1

talks with node2) 4. Corpus LM (how everyone

talks)

Page 17: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

17

Why language models?

Ò Comparisons between communication profiles: Ò Find nodes with most similar communication

Page 18: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

18

SNA !

!

1. Importance of a node in the network !

3. Strength of connection between nodes

!

!

!

Email Content !

!

1. Incoming LM 2. Outgoing LM 3. Interpersonal LM 4. Corpus-based LM

Page 19: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

19

Approach: time-based

t=0 1 email, 2 addresses t=1 2 emails, 2 addresses t=2 3 emails, 4 addresses t=3 4 emails, 5 addresses !

etc… !

t=n 607.011 emails, 2.068 addresses

Page 20: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

20

At some time interval t

Ò Given the email, sender, and network Ò Remove recipients from email Ò Rank all nodes in the network

Ò By computing for each candidate (recipient) node:

1. Importance of candidate 2. Strength of connection between sender and

candidate 3. Similarity between sender and candidate LMs

Page 21: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

21

Page 22: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

22

Findings: what works for predicting recipients?

Ò Importance of node: Number of received emails of node !

Ò Strength of connection: Number of emails between nodes !

Ò LM Similarity: Interpersonal LM is most important

Page 23: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

23

Findings: SNA vs email content

Ò SNA: Ò SNA signals deteriorate over time Ò SNA signals are most informative on highly

active users !

Ò Email content: Ò LM signal improves over time Ò LM signal does worse with highly active users

Page 24: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

24

Finally

Ò Combining Social Network Analysis with Language Modeling is better than doing either.

Page 25: Understanding Email Traffic (talk @ E-Discovery NL Symposium)

25

Why for E-Discovery

Ò Anomaly detection Ò Given a working prediction model; identify

“unexpected” communication Ò Language models for communication

Ò For a node, find the most different interpersonal communication Ò Friends/family vs colleagues?

Ò Find communication that differs from the corpus-based communication