Enron Email data set mining

36
Author-Avik Das and Jagriti Das- University of Connecticut Dr. Fei Wang Email data mining

Transcript of Enron Email data set mining

Page 1: Enron Email data set mining

Author-Avik Das and Jagriti Das- University of ConnecticutDr. Fei Wang

Email data mining

Page 2: Enron Email data set mining

Enron email dataset --- SQL tables

Enron email dataset

Enron email dataset- SQL dump

Refined SQL dump eliminating the noise and refining it into multiple

views

Views that contain no of messages sent across year 200, 2001,2002

Views that contain no of messages sent across year

200, 2001,2002 to external entities

View containing the roles for

each employee

Views that contain no of messages sent across year

200, 2001,2002 to lawyers

Noise: • Employees having multiple

email ids• Presence of records of some

other persons other than the list of 151 employees in the fact sheet

Page 3: Enron Email data set mining

First Slide- Primary schema

Second slide- Role view based on Primary schema

Third slide- Sent Messages Views based on Primary schema

Fourth slide- Sent Messages Views based on Primary schema

Fifth Slide- Sent Messages Views based on Primary schema

Database scheme

Page 4: Enron Email data set mining

employeelist

Message

recepient info

Email_idPK

First Name

Second name

Eid

MidPK

Sender

date

message_id

subject

body

folder

ridPK

midFK

rvalue

rtype

date

The MID in message is present as a foreign key in recipient info

Page 5: Enron Email data set mining

employeelist

RoleEmail_idPK

First Name

Second name

Eid

The role view maps the first and last name for each emp id

EidPK

First NameFK

Second NameFK

Role

Fact Sheet

The role view maps the role from the fact sheet

Page 6: Enron Email data set mining

send2002

Message

send2001

send2000

rvaluePK

date

count

MidPK

Sender

date

message_id

subject

body

folder

Contains count of messages sent on the year 2002

rvaluePK

date

count

Contains count of messages sent on the year 2001

rvaluePK

date

count

Contains count of messages sent on the year 2000

Page 7: Enron Email data set mining

send_ext_2002

Message

send_ext_2001

send_ext_2000

rvaluePK

date

count

MidPK

Sender

date

message_id

subject

body

folder

Contains count of messages sent on the year 2002external to enron

rvaluePK

date

count

Contains count of messages sent on the year 2001external to enrron

rvaluePK

date

count

Contains count of messages sent on the year 2000 external to Enron

Page 8: Enron Email data set mining

-for more info…List location or contact for specification (or other related documents)

send_law_2002

Message

send_law_2001

send_law_2000

rvaluePK

date

count

MidPK

Sender

date

message_id

subject

body

folder

Contains count of messages sent on the year 2002to lawyers

rvaluePK

date

count

Contains count of messages sent on the year 2001external to lawyers

rvaluePK

date

count

Contains count of messages sent on the year 2000 external to lawyers

Page 10: Enron Email data set mining

Receive Matrix

Receive Matrix H[i,j]i- employee id of receiver

j-employee id of sender

Receiver ID Receiver Mail Send ID Send ID

6 [email protected] 4 [email protected]

6 [email protected] 4 [email protected]

SQL dump

Receiver

ID

Receiver Mail Send ID Send ID

2 [email protected] 2 [email protected]

2 [email protected] 2 [email protected]

SQL dump- Noise

Page 11: Enron Email data set mining

-for more info…List location or contact for specification (or other related documents)

Steps to find the CEO_ Step 1

From receive matrix for each row/receiver find the

sender/senders who have send the minimum mails

For employee 2/row 2 the

minimum no is zero and is found

at col :1,4,5

Receive matrix

New matrix- C

Replacing all the minimum values

with 999 and other values with 0 for a

row

Page 12: Enron Email data set mining

Steps to find the CEO_ Step 2

From the new matrix C, find for each employee how many times it

was voted as parent

Find the number of 999s present in a column, that would give one how may times the employee was

voted as parent

Find the maximum number of 999s for all the columns

Page 13: Enron Email data set mining

Steps to find the CEO_Step 3

Get the maximum number of times an employee could be voted as

parent

The maximum value comes around as 150

New send index matrix D- D[I,j]i- employee id

J- no of times it got voted as

parent

Page 14: Enron Email data set mining

Final Step for CEO

Several employees have the maximum

value voted as parent

NoiseSend index matrix

Fact sheet

Emp-ID : 129 Jeffrey skillingEmp-ID : 127 Kenneth Lay

Eliminating noise

Page 15: Enron Email data set mining

Proposed Hierarchy from CEO find algorithm

Jeffrey SkillingPresident &

CEO

UnknownUnknown

Page 16: Enron Email data set mining

Levels of Hierarchy First Children

For each employee get the maximum number of

messages sent

For emp-1, the maximum number of messages sent was 3 so the possible first

child would be 73

Send Matrix First Child

Page 17: Enron Email data set mining

Levels of Hierarchy Second Children

For each employee get the second maximum number of

messages sent

For emp-1, the maximum number of messages sent was 3 and the next

maximum value was 2 so the possible first children would be 17,53

Send Matrix Second Child

Page 18: Enron Email data set mining

Levels of Hierarchy Third Children

For each employee get the third maximum number of

messages sent

For emp-2, the third maximum number of messages sent was 44 so the possible first

children would be 19

Send Matrix

Third Child

Page 19: Enron Email data set mining

Levels of Hierarchy Fourth Children

For each employee get the fourth maximum number of

messages sent

For emp-2, the fourth maximum number of messages sent was 29 so the possible first

children would be 4

Send Matrix

Fourth Child

Page 20: Enron Email data set mining

Sample Level of Hierarchy

Jeffrey Skilling -129CEO

Kenneth Lay-127 CEO

Greg Walley-54President

John Arnold-44 Vice president

Jeffrey Shank man- 36President

Andy Zipper- 78 Vice president

John Lovarato-53

CEO

Louise Kitchen 107President

Barry Tycholitz-38

Vice president

Page 21: Enron Email data set mining

Network Graph _Between CEOs Jeffrey Skilling

John Lavratato

Kenneth Lay

David Delainey

The number of inter-communications between CEO is quite less.

The network traffic is quite less with respect to messages sent and received

Page 22: Enron Email data set mining

Network Graph _Between managers/vice-presidents/presidents

Sample data

The number of inter-communications between mid level employees increases as we go down the CEO level

Page 23: Enron Email data set mining

Network Graph _Between employees

Sample data

The number of inter-communications between employees is the highest amongst all the tiers.

Page 24: Enron Email data set mining

Network Graph _Intra communication_CEO-ManagersEmployee -Managers

CEO- Managers The number of intra-communications between CEO level and

mid level employees is quite high

Managers- Employees The number of intra-communications between Manager level

and lower level employees are highest

Page 25: Enron Email data set mining

Ratio of communication between different levels

CEO

Mnager

Employee

CEO

Manager

Employee

Manager->CEOCEO->Manager

Manager->Employee Employee->CEO

Page 26: Enron Email data set mining

Sub send Matrices

Compute ratio of total number of messages sent for the different sent matrices.

Ratio of communication between different levels

Send Matrix

FunctionCEO->manager

Manager->CEO

Manager -employee

Employee-> Manager

Page 27: Enron Email data set mining

Output

Employee-sent Manager-response/sent CEO-response

4 2

4 1

Page 28: Enron Email data set mining

Detection of Anomalous Behavior in Employees

DatabaseSend matrix 2000

Send matrix 2001 eid

Number of messages sent

Page 29: Enron Email data set mining

2000 2001

Emails sent to Emails sent to

lawyer+Trader lawyer + Trader

Percentage change above threshold=25%?

class learnt concept-clustering

Detection of Anomalous Behavior in Employees

Page 30: Enron Email data set mining

Results-sample clusters

Below Threshold samples

Above Threshold samples

CEO/President

Page 31: Enron Email data set mining

Email Stats over the year 2000 and 2001 for low/mid level employees

Page 32: Enron Email data set mining

-for more info…List location or contact for specification (or other related documents)

Higher level employees

Email Stats over the year 2000 and 2001 for high level employees

Page 33: Enron Email data set mining

Temporal Analysis of emails sent for some high level employees

Page 34: Enron Email data set mining

Temporal Analysis of emails sent for some high level employees

Page 35: Enron Email data set mining

Semantic analysis using the LIWC tool.

Probabilistic dependency .

Future work

Page 36: Enron Email data set mining

–Thank You

Questions???