Analyzing Social and Stylometric Features to Identify Spear phishing Emails
-
Upload
cybersecurity-education-and-research-centre -
Category
Engineering
-
view
89 -
download
5
description
Transcript of Analyzing Social and Stylometric Features to Identify Spear phishing Emails
Unifying the Global Response to Cybercrime
Analyzing Social and Stylometric Features to Identify Spearphishing Emails
Prateek Dewan, Anand Kashyap, Ponnurangam Kumaraguru
Indraprastha Institute of Information Technology – Delhi (IIITD), India
Unifying the Global Response to Cybercrime
Overview
• What is spearphishing? • Spearphishing and Online Social Media
• Challenges and dataset
• Feature extraction
• Classification results
• Discussion
1
Unifying the Global Response to Cybercrime
What is spearphishing? • Targeted phishing attack
• Contains contextual content instead of random messages
• Harder to detect, since spearphishing emails look more genuine
• Victims are asked to • Download malicious attachments
• Reply with sensitive information
• Click on URLs • …
2
Unifying the Global Response to Cybercrime
Why study spearphishing? • Victims are 4.5 times more likely to fall for spear
phishing, than normal phishing [1].
• One of the main entry points for Advanced Persistent Threats.
• Causes losses worth millions.
[1] M. Jakobsson. Modeling and preventing phishing attacks. In Financial Cryptography, volume 5. Citeseer, 2005.
3
Unifying the Global Response to Cybercrime
Spearphishing and social media • Social media profiles can be a good source for
the “context” part of spear phishing emails
• FBI warning on July 04, 20131
• “…emails typically contain accurate information about victims obtained from data posted on social networking sites…”
1 http://www.computerweekly.com/news/2240187487/FBI-warns-of-increased-spear-phishing-attacks
4
Unifying the Global Response to Cybercrime
Data • Emails
• Spear phishing emails (Symantec)
• Spam / phishing emails (Symantec)
• Benign emails (Enron)
• LinkedIn profiles • Recipients of emails in the three datasets mentioned
above
• LinkedIn People Search API
5
Unifying the Global Response to Cybercrime
Challenges (social features) • Limited information about victim to identify her on
social media • Only first name, last name, organization available from
victim’s email ID
• Hard to find victim on Facebook, Twitter, Google+ • Too many profiles with same first name, last name
• Work field not searchable.
6
Unifying the Global Response to Cybercrime
Challenges (social features) contd. • LinkedIn – Only network which provides searching
using work field
• People search API access restricted. • We requested for access under their Vetted API access
scheme.
• Rate limited • Only 100 requests per day per app
7
Unifying the Global Response to Cybercrime
Dataset • Emails sent to employees of 14 international
organizations
• SPEAR (Targeted spear phishing emails from Symantec) • 4,742 emails à 2,434 victims / LinkedIn profiles
• SPAM (Spam / phishing emails from Symantec) • 9,353 emails à 5,912 victims / LinkedIn profiles
• BENIGN (Sample from Enron email corpus) • 6,601 emails à 1,240 victims / LinkedIn profiles
8
Unifying the Global Response to Cybercrime
Feature set creation
SPAM
SPEAR
BENIGN
Stylometric features from emails
http://api.linkedin.com/v1/people-search:
1. firstName 2. lastName 3. organization
LinkedIn Profile(s)
Social features from LinkedIn
Final feature vector Recipient
email address
9
Unifying the Global Response to Cybercrime
Stylometric Features • Subject based (7)
• Num. words, Num. characters, Richness
• Has words: “bank”, “verify”
• isReply, isForwarded
• Attachment based (2) • Length of attachment name
• Attachment size
• Body based (9) • Num. words, Num. characters, Num. unique words
• Has words: “attach”, “suspension”, “verify your account”
• Num. newlines, Richness, function words
10
Unifying the Global Response to Cybercrime
Social Features • Location
• Connections
• Summary based (5) • Num. words, Num. Characters, Num. unique words
• Length, Richness
• Profession based (2) • Job Level (0-7)
• Job Type (0-9)
11
Unifying the Global Response to Cybercrime
Results (SPEAR v/s SPAM) Feature Set (num. features)
Classifier Random Forest J48 Decision Tree
Naïve Bayes
Subject (7) Accuracy (%) 83.91 83.10 58.87
FP Rate 0.208 0.227 0.371
Attachment (2) Accuracy (%) 97.86 96.69 69.15
FP Rate 0.035 0.046 0.218
All email (9) Accuracy (%) 98.28 97.32 68.69
FP Rate 0.024 0.035 0.221
Social (9) Accuracy (%) 81.73 76.63 65.85
FP Rate 0.229 0.356 0.445
Email + Social (18) Accuracy (%) 96.47 95.90 69.35
FP Rate 0.052 0.054 0.232
12
Unifying the Global Response to Cybercrime
Results (SPEAR v/s SPAM) contd. • Most informative features
• Attachment size
• Length of attachment name
• Subject Richness
• No. of characters in subject
• Location (from LinkedIn profile)
• No. of words in subject
• LinkedIn connections
• …
13
Unifying the Global Response to Cybercrime
Results (SPEAR v/s SPAM) contd.
14
SPEAR v/s SPAM subjects
ß Spam / phishing
Spear phishing à
15
Unifying the Global Response to Cybercrime
Results (SPEAR v/s BENIGN) Feature Set (num. features)
Classifier Random Forest J48 Decision Tree
Naïve Bayes
Subject (7) Accuracy (%) 81.19 81.11 61.75
FP Rate 0.210 0.217 0.489
Body(9) Accuracy (%) 97.17 95.62 53.81
FP Rate 0.031 0.048 0.338
All email (16) Accuracy (%) 97.39 95.84 54.14
FP Rate 0.029 0.044 0.334
Social (9) Accuracy (%) 94.48 91.79 69.76
FP Rate 0.067 0.103 0.278
Email + Social (25) Accuracy (%) 97.04 95.28 57.27
FP Rate 0.032 0.052 0.316
16
Unifying the Global Response to Cybercrime
Results (SPEAR v/s BENIGN) contd. • Most informative features
• Body richness • No. of characters in body • No. of words in body • No. of unique words in body • Location (from LinkedIn) • No. of newlines in body • Subject richness
• …
17
Unifying the Global Response to Cybercrime
Results (SPEAR v/s SPAM + BENIGN)
Feature Set (num. features)
Classifier Random Forest J48 Decision Tree
Naïve Bayes
Subject (7) Accuracy (%) 86.48 86.35 77.99
FP Rate 0.333 0.352 0.681
Social (9) Accuracy (%) 88.04 84.69 74.46
FP Rate 0.241 0.371 0.454
Email + Social (16) Accuracy (%) 89.86 88.38 73.97
FP Rate 0.202 0.248 0.381
18
Unifying the Global Response to Cybercrime
Results (SPEAR v/s SPAM + BENIGN) contd.
• Most informative features • Subject richness
• No. of characters in subject
• Location (from LinkedIn)
• LinkedIn connections
• No. of words in subject
• Email forwarded? (True / false)
• Email is a reply? (True / false)
• …
19
Unifying the Global Response to Cybercrime
Discussion • Social features features (from LinkedIn) did not help in
distinguishing spear phishing emails from non spear phishing emails. • Stylometric features from emails suffice to do so.
• Real world scenarios may be much different • Attackers may use information from other sources / social
networks, viz. Facebook, Twitter, etc.
• Dataset limitation • It is possible that no spear phishing mails in our dataset were
crafted using LinkedIn features
• We cannot conclude that such behavior would not be found outside our dataset, or in future.
20
Unifying the Global Response to Cybercrime
Thanks!
Prateek Dewan E: [email protected]
W: http://precog.iiitd.edu.in/people/prateek
21
Unifying the Global Response to Cybercrime
Backup slides…
Unifying the Global Response to Cybercrime
Results (SPEAR v/s SPAM) contd.
Unifying the Global Response to Cybercrime
Attachment names
Results (SPEAR v/s BENIGN) contd.
ß Benign emails
Spear phishing à
Unifying the Global Response to Cybercrime
Attachment types
Unifying the Global Response to Cybercrime
Details of organizations