Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2
description
Transcript of Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2
DPNM, POSTECH 1/23NOMS 2010
Jae Yoon Chung1, Byungchul Park1, Young J. Won1
John Strassner2, and James W. Hong1, 2
{dejavu94, fates, yjwon, johns, jwkhong}@postech.ac.kr
April 20, 2010
1Dept. of Computer Science and Engineering, POSTECH, Korea2Division of IT Convergence Engineering, POSTECH, Korea
An Effective Similarity Metricfor Application Traffic Classification
DPNM, POSTECH 2/23NOMS 2010
ContentsIntroduction
Related Work
Research Goal
Proposed Methodology
Evaluation
Conclusion and Future Work
DPNM, POSTECH 3/23NOMS 2010
Introduction Traffic classification for network management
Network planningQoS managementSecurityEtc.
Diversity of today’s Internet trafficNew types of network applicationsIncrease of P2P trafficVarious techniques for avoiding detection
Document classification Traffic classificationDocument classification in natural language processingComparing packet payload vectors is analogous to document classi-fication
DPNM, POSTECH 4/23NOMS 2010
Related Work Well-known port-based classification
Low complexityLow accuracy (approximately 50~70%)
Signature-based classificationHigh reliabilityExhaustive tasks for searching signaturesE.g.) Snort, LASER
Behavior-based classificationFocusing on traffic patterns and connection behaviorsQuestionable accuracyE.g.) BLINC
Machine Learning-based classificationUtilize statistical informationA huge computing resource consumptionE.g.) SVM, Bayesian Network
Similarity-based classificationUtilize document classification approachQuestionable scalabilityE.g.) Flow similarity calculation [IPOM ‘09]
DPNM, POSTECH 5/23NOMS 2010
Summary of IPOM 2009 Proposed new traffic classification approach
Utilize document classification approach using Cosine similarity calculationPropose new packet representation using Vector Space ModelPropose flow similarity calculation methodology which is to com-pare packets in flow sequentially
Methodology validation using real-world traffic on our campus backbone network
Cannot classify flows in asymmetric routing environment
No comparison of Cosine similarity and other similarity metrics
Cosine similarity that is common similarity metric for human-docu-ment classificationHigh variation of similarity value according to term-frequency
DPNM, POSTECH 6/23NOMS 2010
Research Goals Propose new traffic classification algorithm
Automation of signature generation step• Generate application vector, which is an alternative signature, using simple
vector operation• Make groups according to traffic type and operation within single-applica-
tion trafficAccurate and feasible traffic classification algorithm
• Classify application traffic using similarity calculation• Solve asymmetric routing classification problem• Validation using real-world network traffic to compare similarity metrics• Complexity analysis
Compare three similarity metrics for traffic classificationJaccard similarity – counting fragment of signatureCosine similarity – high weighting scheme for signatureRBF similarity – Euclidean distance between packets
DPNM, POSTECH 7/23NOMS 2010
Proposed Methodology
DPNM, POSTECH 8/23NOMS 2010
Vector Space Modeling Vector Space Modeling
An algebraic model representing text documents as vectorsWidely used to document classification
• Categorize electronic document based on its content (e.g. E-mail spam filtering)
Document classification vs. Traffic classificationDocument classification
• Find documents from stored text documents which satisfy certain information queries
Traffic classification• Classify network traffic according to the type of application based on
traffic information
DPNM, POSTECH 9/23NOMS 2010
Payload Vector Conversion (1/2) Definition of word in payload
Payload data within an i-bytes sliding window |Word set| = 2(8*sliding window size)
Definition of payload vectorA term-frequency vector in NLP
Payload Vector = [w1 w2 … wn]T
DPNM, POSTECH 10/23NOMS 2010
Payload Vector Conversion (2/2)Word WordWord
• The word size is 2 and the word set size is 216
– The simplest case for representing the order of content in pay-loads
DPNM, POSTECH 11/23NOMS 2010
Similarity Metrics for Traffic Classification
Jaccard similarityThe size of the intersection of the sample sets X and Y divided by the size of the union of the sample sets X and Y
||||),(
YXYXYXJ
||||),(
YXYXYXC
)exp(),( 2YXYXRBF
Cosine similarityTwo vectors X and Y of n dimensions by fining the cosine angle between them
RBF similarityRadius based function of Euclidean distance between two vectors X and Y
DPNM, POSTECH 12/23NOMS 2010
Application Vector Heuristics Application vector
Represent typical packets that are generated by target applications as the center (basis) of each cluster
Application vector generator Read packets from the target application trace Divide the packets into several types of clusters without any pre-processing
Applica-tion vec-tor gen-erator
Applica-tion trace
Application vector 1
Application vector 2
Application vector 3
Traffic cluster 1
Traffic cluster 2
DPNM, POSTECH 13/23NOMS 2010
Application Vector Generation
Unsupervised grouping within single-application trafficProvide fine-grained classificationClassify single-application traffic according to traffic types
packet6packet5packet4packet3packet2packet1
Applica-tion vec-
tor 1
Applica-tion vec-
tor 2
Application Traffic Cluster 1
Cluster 2
DPNM, POSTECH 14/23NOMS 2010
Two-stage Traffic Classification Packet level clustering
Classify signal packets regardless of flow informationCompare payload vectors with application vectors by calculating similarity valueMark on each packet with its application and priorityAllow the permutation of packet sequence
Flow level classificationRearrange packets according to flow informationIgnore mis-clustered packets that are caused by protocol ambigui-ties
• HTTP for Web • HTTP for P2P
DPNM, POSTECH 15/23NOMS 2010
Two-stage Traffic Classification
Flow 2Flow 1
Cluster 3
Cluster 2
Cluster 1
F2 P2F2 P3
F2 P1
F2 P4
F1 P1F1 P2
F1 P4
F1 P3
F1 P2
F1 P4F1 P3
F1 P1
F2 P2
F2 P3
F2 P1
F2 P4
Applica-tion Vec-
tor 1
Applica-tion Vec-
tor 2
Applica-tion Vec-
tor 3
F1 P2
F1 P4F1 P3
F1 P1F2 P2F2 P1
F2 P4F2 P3
Stage 1 Stage 2BackboneTrafficBitTorrent TrafficFileGuri Traffic BitTor-
rent
FileGuri
Melon
BitTor-rent FileGuri
Mis-clus-tered
DPNM, POSTECH 16/23NOMS 2010
Evaluation
DPNM, POSTECH 17/23NOMS 2010
Classifying Real-world Traffic Fix-port Applications
Traffic trace on one of two Internet junctions at POSTECH using opti-cal tapGround-truth traffic
• Some active flows among application traffic distinguished by usage of active port number
Target Applications• FileGuri, ClubBox, Melon, BigFile
Untraceable-port ApplicationsTraffic Measurement Agent (TMA)
• Monitoring the network interface of the host
• Recording log data (five-flow tuples, process name, packet count, etc)
Target Applications• eMule, BitTorrent
Backbone TrafficTarget Application
Traffic
Ground-truth Traffic
Target Application Traffic
Ground-truth Traffic
DPNM, POSTECH 18/23NOMS 2010
Classification Accuracy Classification accuracy com-
parisonFixed-port application
• FileGuri, ClubBox, Melon, BigFile
Untraceable-port application• eMule, BitTorrent
Jaccard similarity• Reliable
– count common segment
Cosine similarity• Emphasize common segment
– cannot distinguish ambiguous packets
RBF similarity• Difficulty of setting parameter
– need guideline how to set parameter
BitTorrent traffic on Backbone network
Traffic over-classification by Co-sine similarity High false positive rate of Cosine similarity
DPNM, POSTECH 19/23NOMS 2010
Histogram of Similarity Values
DPNM, POSTECH 20/23NOMS 2010
CDF of Distance among Payload Vectors
DPNM, POSTECH 21/23NOMS 2010
Complexity Analysis
DPNM, POSTECH 22/23NOMS 2010
Conclusion and Future Work Develop new traffic classification research
Utilizing document classification approach to traffic classificationUnsupervised classification to make cluster within a single-application trafficTwo-stage classification algorithm to solve asymmetric routing classification problemLinear time complexity
Compare three similarity metricsProvide guideline for selecting similarity metrics for traffic classificationProvide soft-classification that represents similarity as a numerical value ranges from 0 to 1
Future WorkEnhance unsupervised classification methodology for automated signature generationExtract orthogonal application vectors to improve scalability
DPNM, POSTECH 23/23NOMS 2010