Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

23
DPNM, POSTECH 1/23 NOMS 2010 Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2 {dejavu94, fates, yjwon, johns, jwkhong}@postech.ac.kr April 20, 2010 1 Dept. of Computer Science and Engineering, POSTECH, Korea 2 Division of IT Convergence Engineering, POSTECH, Korea An Effective Similarity Metric for Application Traffic Classification

description

An Effective Similarity Metric for Application Traffic Classification. Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2 {dejavu94, fates, yjwon, johns, jwkhong}@postech.ac.kr April 20, 2010 - PowerPoint PPT Presentation

Transcript of Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

Page 1: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 1/23NOMS 2010

Jae Yoon Chung1, Byungchul Park1, Young J. Won1

John Strassner2, and James W. Hong1, 2

{dejavu94, fates, yjwon, johns, jwkhong}@postech.ac.kr

April 20, 2010

1Dept. of Computer Science and Engineering, POSTECH, Korea2Division of IT Convergence Engineering, POSTECH, Korea

An Effective Similarity Metricfor Application Traffic Classification

Page 2: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 2/23NOMS 2010

ContentsIntroduction

Related Work

Research Goal

Proposed Methodology

Evaluation

Conclusion and Future Work

Page 3: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 3/23NOMS 2010

Introduction Traffic classification for network management

Network planningQoS managementSecurityEtc.

Diversity of today’s Internet trafficNew types of network applicationsIncrease of P2P trafficVarious techniques for avoiding detection

Document classification Traffic classificationDocument classification in natural language processingComparing packet payload vectors is analogous to document classi-fication

Page 4: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 4/23NOMS 2010

Related Work Well-known port-based classification

Low complexityLow accuracy (approximately 50~70%)

Signature-based classificationHigh reliabilityExhaustive tasks for searching signaturesE.g.) Snort, LASER

Behavior-based classificationFocusing on traffic patterns and connection behaviorsQuestionable accuracyE.g.) BLINC

Machine Learning-based classificationUtilize statistical informationA huge computing resource consumptionE.g.) SVM, Bayesian Network

Similarity-based classificationUtilize document classification approachQuestionable scalabilityE.g.) Flow similarity calculation [IPOM ‘09]

Page 5: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 5/23NOMS 2010

Summary of IPOM 2009 Proposed new traffic classification approach

Utilize document classification approach using Cosine similarity calculationPropose new packet representation using Vector Space ModelPropose flow similarity calculation methodology which is to com-pare packets in flow sequentially

Methodology validation using real-world traffic on our campus backbone network

Cannot classify flows in asymmetric routing environment

No comparison of Cosine similarity and other similarity metrics

Cosine similarity that is common similarity metric for human-docu-ment classificationHigh variation of similarity value according to term-frequency

Page 6: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 6/23NOMS 2010

Research Goals Propose new traffic classification algorithm

Automation of signature generation step• Generate application vector, which is an alternative signature, using simple

vector operation• Make groups according to traffic type and operation within single-applica-

tion trafficAccurate and feasible traffic classification algorithm

• Classify application traffic using similarity calculation• Solve asymmetric routing classification problem• Validation using real-world network traffic to compare similarity metrics• Complexity analysis

Compare three similarity metrics for traffic classificationJaccard similarity – counting fragment of signatureCosine similarity – high weighting scheme for signatureRBF similarity – Euclidean distance between packets

Page 7: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 7/23NOMS 2010

Proposed Methodology

Page 8: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 8/23NOMS 2010

Vector Space Modeling Vector Space Modeling

An algebraic model representing text documents as vectorsWidely used to document classification

• Categorize electronic document based on its content (e.g. E-mail spam filtering)

Document classification vs. Traffic classificationDocument classification

• Find documents from stored text documents which satisfy certain information queries

Traffic classification• Classify network traffic according to the type of application based on

traffic information

Page 9: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 9/23NOMS 2010

Payload Vector Conversion (1/2) Definition of word in payload

Payload data within an i-bytes sliding window |Word set| = 2(8*sliding window size)

Definition of payload vectorA term-frequency vector in NLP

Payload Vector = [w1 w2 … wn]T

Page 10: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 10/23NOMS 2010

Payload Vector Conversion (2/2)Word WordWord

• The word size is 2 and the word set size is 216

– The simplest case for representing the order of content in pay-loads

Page 11: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 11/23NOMS 2010

Similarity Metrics for Traffic Classification

Jaccard similarityThe size of the intersection of the sample sets X and Y divided by the size of the union of the sample sets X and Y

||||),(

YXYXYXJ

||||),(

YXYXYXC

)exp(),( 2YXYXRBF

Cosine similarityTwo vectors X and Y of n dimensions by fining the cosine angle between them

RBF similarityRadius based function of Euclidean distance between two vectors X and Y

Page 12: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 12/23NOMS 2010

Application Vector Heuristics Application vector

Represent typical packets that are generated by target applications as the center (basis) of each cluster

Application vector generator Read packets from the target application trace Divide the packets into several types of clusters without any pre-processing

Applica-tion vec-tor gen-erator

Applica-tion trace

Application vector 1

Application vector 2

Application vector 3

Traffic cluster 1

Traffic cluster 2

Page 13: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 13/23NOMS 2010

Application Vector Generation

Unsupervised grouping within single-application trafficProvide fine-grained classificationClassify single-application traffic according to traffic types

packet6packet5packet4packet3packet2packet1

Applica-tion vec-

tor 1

Applica-tion vec-

tor 2

Application Traffic Cluster 1

Cluster 2

Page 14: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 14/23NOMS 2010

Two-stage Traffic Classification Packet level clustering

Classify signal packets regardless of flow informationCompare payload vectors with application vectors by calculating similarity valueMark on each packet with its application and priorityAllow the permutation of packet sequence

Flow level classificationRearrange packets according to flow informationIgnore mis-clustered packets that are caused by protocol ambigui-ties

• HTTP for Web • HTTP for P2P

Page 15: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 15/23NOMS 2010

Two-stage Traffic Classification

Flow 2Flow 1

Cluster 3

Cluster 2

Cluster 1

F2 P2F2 P3

F2 P1

F2 P4

F1 P1F1 P2

F1 P4

F1 P3

F1 P2

F1 P4F1 P3

F1 P1

F2 P2

F2 P3

F2 P1

F2 P4

Applica-tion Vec-

tor 1

Applica-tion Vec-

tor 2

Applica-tion Vec-

tor 3

F1 P2

F1 P4F1 P3

F1 P1F2 P2F2 P1

F2 P4F2 P3

Stage 1 Stage 2BackboneTrafficBitTorrent TrafficFileGuri Traffic BitTor-

rent

FileGuri

Melon

BitTor-rent FileGuri

Mis-clus-tered

Page 16: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 16/23NOMS 2010

Evaluation

Page 17: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 17/23NOMS 2010

Classifying Real-world Traffic Fix-port Applications

Traffic trace on one of two Internet junctions at POSTECH using opti-cal tapGround-truth traffic

• Some active flows among application traffic distinguished by usage of active port number

Target Applications• FileGuri, ClubBox, Melon, BigFile

Untraceable-port ApplicationsTraffic Measurement Agent (TMA)

• Monitoring the network interface of the host

• Recording log data (five-flow tuples, process name, packet count, etc)

Target Applications• eMule, BitTorrent

Backbone TrafficTarget Application

Traffic

Ground-truth Traffic

Target Application Traffic

Ground-truth Traffic

Page 18: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 18/23NOMS 2010

Classification Accuracy Classification accuracy com-

parisonFixed-port application

• FileGuri, ClubBox, Melon, BigFile

Untraceable-port application• eMule, BitTorrent

Jaccard similarity• Reliable

– count common segment

Cosine similarity• Emphasize common segment

– cannot distinguish ambiguous packets

RBF similarity• Difficulty of setting parameter

– need guideline how to set parameter

BitTorrent traffic on Backbone network

Traffic over-classification by Co-sine similarity High false positive rate of Cosine similarity

Page 19: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 19/23NOMS 2010

Histogram of Similarity Values

Page 20: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 20/23NOMS 2010

CDF of Distance among Payload Vectors

Page 21: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 21/23NOMS 2010

Complexity Analysis

Page 22: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 22/23NOMS 2010

Conclusion and Future Work Develop new traffic classification research

Utilizing document classification approach to traffic classificationUnsupervised classification to make cluster within a single-application trafficTwo-stage classification algorithm to solve asymmetric routing classification problemLinear time complexity

Compare three similarity metricsProvide guideline for selecting similarity metrics for traffic classificationProvide soft-classification that represents similarity as a numerical value ranges from 0 to 1

Future WorkEnhance unsupervised classification methodology for automated signature generationExtract orthogonal application vectors to improve scalability

Page 23: Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

DPNM, POSTECH 23/23NOMS 2010