Enron Email data set mining
Transcript of Enron Email data set mining
Author-Avik Das and Jagriti Das- University of ConnecticutDr. Fei Wang
Email data mining
Enron email dataset --- SQL tables
Enron email dataset
Enron email dataset- SQL dump
Refined SQL dump eliminating the noise and refining it into multiple
views
Views that contain no of messages sent across year 200, 2001,2002
Views that contain no of messages sent across year
200, 2001,2002 to external entities
View containing the roles for
each employee
Views that contain no of messages sent across year
200, 2001,2002 to lawyers
Noise: • Employees having multiple
email ids• Presence of records of some
other persons other than the list of 151 employees in the fact sheet
First Slide- Primary schema
Second slide- Role view based on Primary schema
Third slide- Sent Messages Views based on Primary schema
Fourth slide- Sent Messages Views based on Primary schema
Fifth Slide- Sent Messages Views based on Primary schema
Database scheme
employeelist
Message
recepient info
Email_idPK
First Name
Second name
Eid
MidPK
Sender
date
message_id
subject
body
folder
ridPK
midFK
rvalue
rtype
date
The MID in message is present as a foreign key in recipient info
employeelist
RoleEmail_idPK
First Name
Second name
Eid
The role view maps the first and last name for each emp id
EidPK
First NameFK
Second NameFK
Role
Fact Sheet
The role view maps the role from the fact sheet
send2002
Message
send2001
send2000
rvaluePK
date
count
MidPK
Sender
date
message_id
subject
body
folder
Contains count of messages sent on the year 2002
rvaluePK
date
count
Contains count of messages sent on the year 2001
rvaluePK
date
count
Contains count of messages sent on the year 2000
send_ext_2002
Message
send_ext_2001
send_ext_2000
rvaluePK
date
count
MidPK
Sender
date
message_id
subject
body
folder
Contains count of messages sent on the year 2002external to enron
rvaluePK
date
count
Contains count of messages sent on the year 2001external to enrron
rvaluePK
date
count
Contains count of messages sent on the year 2000 external to Enron
-for more info…List location or contact for specification (or other related documents)
send_law_2002
Message
send_law_2001
send_law_2000
rvaluePK
date
count
MidPK
Sender
date
message_id
subject
body
folder
Contains count of messages sent on the year 2002to lawyers
rvaluePK
date
count
Contains count of messages sent on the year 2001external to lawyers
rvaluePK
date
count
Contains count of messages sent on the year 2000 external to lawyers
Send Matrix Send Matrix A[i,j]
i- employee id of senderj-employee id of receiver
Sender
ID
Sender Mail Receiver
ID
Receiver ID
3 [email protected] 4 [email protected]
3 [email protected] 4 [email protected]
3 [email protected] 4 [email protected]
3 [email protected] 4 [email protected]
3 [email protected] 4 [email protected]
3 [email protected] 4 [email protected]
SQL dump
Sender
ID
Sender Mail Receiver
ID
Receiver ID
2 [email protected] 2 [email protected]
2 [email protected] 2 [email protected]
SQL dump- Noise
Receive Matrix
Receive Matrix H[i,j]i- employee id of receiver
j-employee id of sender
Receiver ID Receiver Mail Send ID Send ID
6 [email protected] 4 [email protected]
6 [email protected] 4 [email protected]
SQL dump
Receiver
ID
Receiver Mail Send ID Send ID
2 [email protected] 2 [email protected]
2 [email protected] 2 [email protected]
SQL dump- Noise
-for more info…List location or contact for specification (or other related documents)
Steps to find the CEO_ Step 1
From receive matrix for each row/receiver find the
sender/senders who have send the minimum mails
For employee 2/row 2 the
minimum no is zero and is found
at col :1,4,5
Receive matrix
New matrix- C
Replacing all the minimum values
with 999 and other values with 0 for a
row
Steps to find the CEO_ Step 2
From the new matrix C, find for each employee how many times it
was voted as parent
Find the number of 999s present in a column, that would give one how may times the employee was
voted as parent
Find the maximum number of 999s for all the columns
Steps to find the CEO_Step 3
Get the maximum number of times an employee could be voted as
parent
The maximum value comes around as 150
New send index matrix D- D[I,j]i- employee id
J- no of times it got voted as
parent
Final Step for CEO
Several employees have the maximum
value voted as parent
NoiseSend index matrix
Fact sheet
Emp-ID : 129 Jeffrey skillingEmp-ID : 127 Kenneth Lay
Eliminating noise
Proposed Hierarchy from CEO find algorithm
Jeffrey SkillingPresident &
CEO
UnknownUnknown
Levels of Hierarchy First Children
For each employee get the maximum number of
messages sent
For emp-1, the maximum number of messages sent was 3 so the possible first
child would be 73
Send Matrix First Child
Levels of Hierarchy Second Children
For each employee get the second maximum number of
messages sent
For emp-1, the maximum number of messages sent was 3 and the next
maximum value was 2 so the possible first children would be 17,53
Send Matrix Second Child
Levels of Hierarchy Third Children
For each employee get the third maximum number of
messages sent
For emp-2, the third maximum number of messages sent was 44 so the possible first
children would be 19
Send Matrix
Third Child
Levels of Hierarchy Fourth Children
For each employee get the fourth maximum number of
messages sent
For emp-2, the fourth maximum number of messages sent was 29 so the possible first
children would be 4
Send Matrix
Fourth Child
Sample Level of Hierarchy
Jeffrey Skilling -129CEO
Kenneth Lay-127 CEO
Greg Walley-54President
John Arnold-44 Vice president
Jeffrey Shank man- 36President
Andy Zipper- 78 Vice president
John Lovarato-53
CEO
Louise Kitchen 107President
Barry Tycholitz-38
Vice president
Network Graph _Between CEOs Jeffrey Skilling
John Lavratato
Kenneth Lay
David Delainey
The number of inter-communications between CEO is quite less.
The network traffic is quite less with respect to messages sent and received
Network Graph _Between managers/vice-presidents/presidents
Sample data
The number of inter-communications between mid level employees increases as we go down the CEO level
Network Graph _Between employees
Sample data
The number of inter-communications between employees is the highest amongst all the tiers.
Network Graph _Intra communication_CEO-ManagersEmployee -Managers
CEO- Managers The number of intra-communications between CEO level and
mid level employees is quite high
Managers- Employees The number of intra-communications between Manager level
and lower level employees are highest
Ratio of communication between different levels
CEO
Mnager
Employee
CEO
Manager
Employee
Manager->CEOCEO->Manager
Manager->Employee Employee->CEO
Sub send Matrices
Compute ratio of total number of messages sent for the different sent matrices.
Ratio of communication between different levels
Send Matrix
FunctionCEO->manager
Manager->CEO
Manager -employee
Employee-> Manager
Output
Employee-sent Manager-response/sent CEO-response
4 2
4 1
Detection of Anomalous Behavior in Employees
DatabaseSend matrix 2000
Send matrix 2001 eid
Number of messages sent
2000 2001
Emails sent to Emails sent to
lawyer+Trader lawyer + Trader
Percentage change above threshold=25%?
class learnt concept-clustering
Detection of Anomalous Behavior in Employees
Results-sample clusters
Below Threshold samples
Above Threshold samples
CEO/President
Email Stats over the year 2000 and 2001 for low/mid level employees
-for more info…List location or contact for specification (or other related documents)
Higher level employees
Email Stats over the year 2000 and 2001 for high level employees
Temporal Analysis of emails sent for some high level employees
Temporal Analysis of emails sent for some high level employees
Semantic analysis using the LIWC tool.
Probabilistic dependency .
Future work
–Thank You
Questions???