Learning to Detect Phishing Emails

download Learning to Detect Phishing Emails

of 25

  • date post

    08-Jan-2016
  • Category

    Documents

  • view

    26
  • download

    4

Embed Size (px)

description

Learning to Detect Phishing Emails. Report : 鄭志欣 Advisor: Hsing-Kuo Pao. I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In Proceedings of the International World Wide Web Conference (WWW), pages 649 – 656, 2007. Outline. Introduction Method Empirical evaluation - PowerPoint PPT Presentation

Transcript of Learning to Detect Phishing Emails

  • Report : Advisor: Hsing-Kuo Pao

    *Learning to Detect Phishing EmailsI. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In Proceedings of the International World Wide Web Conference (WWW), pages 649656, 2007.

  • Outline*Introduction MethodEmpirical evaluationConclusion

  • Introduction* Phishing (Spoofed websites)Stealing account informationLogon credentialsIdentity information

    Phishing Problem Hard

  • Method*PILFER A Machine Learning based approach to classification.phishing emails / ham (good) emailsFeature Set

    Features as used in email classification

  • Features as used in email classification* IP-based URLs:http://192.168.0.1/paypal.cgi?fix_account Phishing attacks are hosted off of compromised PCs. This feature is binary.

  • *Age of linked-to domain namesLegitimate-sounding domain namePalypal.com paypal-update.com These domains often have a limited life WHOIS query

    date is within 60 days of the date the email was sent fresh domain. This is a binary feature

  • *Nonmatching URLsThis is a case of a link that says paypal.com but actually links to badsite.com.

    Such a link looks like paypal.com.

    This is a binary feature.

  • *Here links to non-modal domainClick here to restore your account access

    Link with the text link, click, or here that links to a domain other than this modal domain

    This is a binary feature.

  • *HTML emailsEmails are sent as either plain text, HTML, or a combination of the two - multipart/alternative format.

    To launch an attack without using HTML is difficult.

    This is a binary feature.

  • *Number of linksThe number of links present in an email.

    in HTML tag

    This is a continuous feature.

  • *Number of domainsSimply take the domain names previously extracted from all of the links, and simply count the number of distinct domains.Look at the main part of a domain https://www.cs.university.edu/ http://www.company.co.jp/This is a continuous feature.

  • *Number of dotsSubdomains likehttp://www.my-bank.update.data.com.Redirection script, such ashttp://www.google.com/url?q=http://www.badsite.comThis feature is simply the maximum number of dots (`.') contained in any of the links present in the email, and is a continuous feature.

  • *Contains javascriptAttackers can use JavaScript to hide information from the user, and potentially launch sophisticated attacks. An email is flagged with the contains javascript feature if the string javascript appears in the email, regardless of whether it is actually in a ortag This is a binary feature.

  • *Spam-filter outputThis is a binary feature, using the trained version of SpamAssassin with the default rule weights and threshold. Ham or SpamThis is a Binary feature.

  • Empirical Evaluation* Machine-Learning Implementation Testing Spam Assassin Datasets Additional ChallengesFalse Positives vs. False Negatives

  • *Machine-Learning Implementation-PILFERFirst, run a set of scripts to extract all the features listed.Second , we train and test a classifier using 10-fold cross validation. Random Forest (classifier)Random forests create a number of decision trees and each decision tree is made by randomly choosing an attribute to split on at each level, and then pruning the tree.

  • * we use a random forest as a classifier.

  • *Testing SpamAssassinSpamAssassin is a widely-deployed freely-available spam filter that is highly accurate in classifying spam emails.We classify the exact same dataset using SpamAssassin version 3.1.0, using the default thresholds and rules.Using Untrain SpamAssassin Training on 10-fold

  • *DatasetsTwo publicly available datasets.

    ham corpora from the SpamAssassin project6950 non-phishing non-spam emails

    Phishingcorpusapproximately 860 email messages

  • *Additional ChallengesThe age of the dataset. Phishing websites are short-lived. Some of our features can therefore not be extracted from older emails, making our tests difficult. EX: Domain linked to

  • Result*

  • *

  • Conclusion*it is possible to detect phishing emails with high accuracy by using a specialized filter, using features that are more directly applicable to phishing emails than those employed by general purpose spam filters.

  • Reference*I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In Proceedings of the International World Wide Web Conference (WWW), pages 649656, 2007.

    www.ics.uci.edu/.../Learning%20to%20Detect%20Phishing%20Emails.pptx

    http://armorize-cht.blogspot.com/2010/01/phishing-mail.html

  • *

    *