Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering
-
Upload
pra-group-university-of-cagliari -
Category
Education
-
view
143 -
download
3
Transcript of Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering
Pattern Recognition and Applications Lab
University
of Cagliari, Italy
Department of Electrical and Electronic
Engineering
Poisoning Complete-Linkage Hierarchical Clustering
Ba#sta Biggio1, Samuel Rota Bulò2, Ignazio Pillai1, Michele Mura1, Eyasu Zemene Mequanint3, Marcello Pelillo3, and Fabio Roli1
(1) Università di Cagliari (IT); (2) FBK-‐irst, Trento (IT); (3) Università Ca’ Foscari di Venezia (IT)
Joensuu, Finland, 20-‐22 August 2014 S+SSPR 2014
http://pralab.diee.unica.it
• Growing number of devices,
services and applications connected to the Internet
• Vulnerabilities and attacks through malicious software (malware)
– Examples: Android market, malware applications
• Identity theft • Stolen credentials / credit card numbers
Threats and Attacks in Computer Security
2
http://pralab.diee.unica.it
• Need for (automated) detection (and rule generation)
– machine learning-based defenses (data clustering)
Threats and Attacks in Computer Security
3
Evasion: malware families / variants +65% new malware variants from 2012 to 2013 Mobile Adware and Malw. Analysis, Symantec, 2014
Detection: antivirus systems Rule-based systems
http://pralab.diee.unica.it
Data Clustering for Computer Security
• Goal: clustering of malware families to identify common characteristics and design suitable countermeasures • e.g., antivirus rules / signatures
4
x x x x x x x
x x x
x x
x x x x x
x1 x2 ... xd
feature extraction (e.g., URL length,
num. of parameters, etc.)
data collection (honeypots)
clustering of malware families (e.g., similar HTTP
requests)
data analysis / countermeasure design (e.g., signature generation)
if … then … else …
e.g., suspicious HTTP request to a web server hVp://www.vulnerablehotel.com/components/ com_hbssearch/longDesc.php?h_id=1& id=-‐2%20union%20select%20concat%28username, 0x3a,password%29%20from%20jos_users-‐-‐
http://pralab.diee.unica.it
Is Data Clustering Secure?
• Attackers can poison input data to subvert malware clustering
5
x x x
x x x x
x x
x
x
x x
x x x x
x1 x2 ... xd
feature extraction (e.g., URL length,
num. of parameters, etc.)
data collection (honeypots)
clustering of malware families (e.g., similar HTTP
requests)
data analysis / countermeasure design (e.g., signature generation)
if … then … else …
Well-‐cra9ed HTTP requests to subvert clustering hVp://www.vulnerablehotel.com/… hVp://www.vulnerablehotel.com/… hVp://www.vulnerablehotel.com/… hVp://www.vulnerablehotel.com/…
… is significantly compromised
… becomes useless (too many false alarms, low detection rate)
(1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
http://pralab.diee.unica.it
Is Data Clustering Secure?
• Earlier work (1,2): qualitative definition of attacks
6
(1) D. B. Skillicorn. Adversarial knowledge discovery. IEEE Intelligent Systems, 24:54–61, 2009. (2) J. G. Dutrisac and D. Skillicorn. Hiding clusters in adversarial settings. In IEEE Int’l Conf.
Intelligence and Security Informatics, pp.185–187, 2008.
Samples can be added to merge (and/or split) existing clusters
Samples can be obfuscated and hidden within existing clusters (e.g., fringe clusters)
x x x
x x
x x
x
x
x x x x
x x x
x x
x x
x
x
x x x x
x x x
x x
x x
x
x
x x x x
x x x
x x
x
Clustering on untainted data
http://pralab.diee.unica.it
Is Data Clustering Secure?
• Our previous work (1): – Framework for security evaluation of clustering algorithms – Formalization of poisoning and obfuscation attacks (optimization) – Case study on single-linkage hierarchical clustering
• Despite hierarchical clustering is widely used for malware clustering (2,3), it is significantly vulnerable to well-crafted attacks!
• In this work we focus on
7
(1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
(2) R. Perdisci, D. Ariu, and G. Giacinto. Scalable fine-grained behavioral clustering of http-based malware. Computer Networks, 57(2):487-500, 2013
(3) K. Rieck, P. Trinius, C. Willems, and T. Holz. Automatic analysis of malware behavior using machine learning. J. Comput. Secur., 19(4):639-668, 2011.
Poisoning a+acks against complete-‐linkage hierarchical clustering
http://pralab.diee.unica.it
Complete-Linkage Hierarchical Clustering
• Bottom-up agglomerative clustering – each point is initially considered as a cluster – closest clusters are iteratively merged
• Linkage criterion to define distance between clusters – complete-linkage criterion
• Clustering output is a hierarchy of clusterings – Criterion needed to select a given clustering (e.g., number of clusters)
8
dist(Ci,Cj ) = maxa∈Ci , b∈Cj
d(a,b) x x x
x x x x
x
http://pralab.diee.unica.it
Poisoning Attacks
• Goal: to maximally compromise the clustering output on D • Capability: adding m attack samples • Knowledge: perfect / worst-case attack
• Attack strategy:
9
Y = f (D) !Y = fD (D∪A)
maxA
dc Y, !Y (A)( ), A= ai{ }i=1
m
Distance between the clustering in the absence of attack and that under attack
Attack samples A
x x x
x x
x x
x
x
x x x x
x x x
x x
x x
x
x
x x x x
x x
Clustering on untainted data D
http://pralab.diee.unica.it
Poisoning Attacks
10
The clustering algorithm chooses the number of clusters that minimizes the attacker’s objective!
dc Y, !Y( ) = YY T − !Y !Y T
F, Y =
1 0 00 0 10 0 11 0 00 1 0
#
$
%%%%%%
&
'
((((((
, YY T =
1 0 0 1 00 1 1 0 00 1 1 0 01 0 0 1 00 0 0 0 1
#
$
%%%%%%
&
'
((((((
For a given clustering: Sample 1
…
Sample 5
maxA
dc Y, !Y (A)( ), A= ai{ }i=1
m
How to choose a given clustering from the hierarchy?
This gives us a lower bound on the worst-case attack’s impact!
http://pralab.diee.unica.it
Poisoning Complete-Linkage Clustering
• Attack strategy: • Heuristic-based solutions
– Greedy approach: adding one attack sample at a time
11
maxA
dc Y, !Y (A)( ), A= ai{ }i=1
m
http://pralab.diee.unica.it
Poisoning Complete-Linkage Clustering
• Local maxima are found at the clusters’ boundaries (wide regions)
12
dc Y, !Y (a)( )
x1
x2
http://pralab.diee.unica.it
Poisoning Complete-Linkage Clustering
13
• Underlying idea: to increase intra-cluster distance (extend attack) • For each cluster, consider two candidate attack points
Candidate attack points
http://pralab.diee.unica.it
Poisoning Complete-Linkage Clustering
14
• Underlying idea: to increase intra-cluster distance (extend attack)
http://pralab.diee.unica.it
Poisoning Complete-Linkage Clustering
15
• Underlying idea: to increase intra-cluster distance (extend attack)
Candidate attack points
http://pralab.diee.unica.it
Poisoning Complete-Linkage Clustering
1. Extend (Best): evaluates Y’(a) for each candidate attack, retaining the best one
– Clustering is run for each candidate attack point, twice per cluster
2. Extend (Hard): estimates Y’(a) assuming that each candidate will split the corresponding cluster, potentially merging it with a fragment of the closest cluster
– It does not require running clustering to find the best attack point
3. Extend (Soft): estimates Y’(a) as Extend (Hard), but using a soft probabilistic estimate instead of 0/1 sample-to-cluster assignments
– It does not require running clustering to find the best attack point
16
http://pralab.diee.unica.it
Poisoning Complete-Linkage Clustering
• The attack compromises the initial clustering by forming heterogeneous clusters
17
Clustering on untainted data Clustering after adding 10 attack samples
http://pralab.diee.unica.it
Experimental Setup
• Banana: artificial data, 80 samples, 2 features, k=4 initial clusters
• Malware: real data (1), 1,000 samples, 6 features, k≈9 initial clusters (estimated from data minimizing the Davies-Bouldin Index)
– Features: 1. number of GET requests 2. number of POST requests 3. average URL length 4. average number of URL parameters 5. average amount of data sent by POST requests 6. average response length
• MNIST Handwritten Digits: real data, 330 samples per cluster, 28 x 28 = 784 features (pixels), k=3 initial clusters corresponding to
18 (1) R. Perdisci, D. Ariu, and G. Giacinto. Scalable fine-grained behavioral clustering of http-based malware. Computer Networks, 57(2):487-500, 2013.
http://pralab.diee.unica.it
Experimental Results
• Attack strategies: Extend (Best/Hard/Soft), Random, Random (Best) – Banana:
• Extend (Best) very close to Optimal (Grid Search) • Random (Best) competitive with Extend (Hard / Soft)
19
0%2%5%7%9% 12% 15% 18% 20%05
101520253035404550
Obj
ectiv
e Fu
nctio
n
Banana
Fraction of samples controlled by the attacker
Random Random (Best) Extend (Hard) Extend (Soft) Extend (Best) Optimal (Grid Search)
0% 11.1% (10 attack samples)
http://pralab.diee.unica.it
0.0% 0.2% 0.4% 0.6% 0.8% 1.0%0
50
100
150
200
250
Obj
ectiv
e Fu
nctio
n
Digits
0% 1% 2% 3% 4% 5%0
50
100
150
Obj
ectiv
e Fu
nctio
n
Malware
Experimental Results
• Attack strategies: Extend (Best/Hard/Soft), Random, Random (Best) – Malware:
• Extend attacks and Random (Best) perform rather well
– MNIST Handwritten Digits: • Random (Best) not effective
– high-dimensional feature space
• Extend (Soft) outperforms Extend (Best / Hard)
20 Fraction of samples controlled by the attacker
Random Random (Best) Extend (Hard) Extend (Soft) Extend (Best) Optimal (Grid Search)
http://pralab.diee.unica.it
Conclusions and Future Work
• Framework for security evaluation of clustering algorithms • Poisoning attack vs. complete-linkage hierarchical clustering
– Even random-based attacks can be effective!
• Future work – Extensions to other clustering algorithms, common attack strategy
• e.g., black-box optimization with suitable heuristics
– Attacks with limited knowledge of the input data
21
Secure clustering algorithms
Attacks against clustering
http://pralab.diee.unica.it
? 22
Any ques<ons Thanks for your aVenion!
http://pralab.diee.unica.it 23
Extra slides
http://pralab.diee.unica.it
Is Data Clustering Secure?
• Our previous work (1): – Framework for security evaluation of clustering algorithms
1. Formal definition of potential attacks 2. Empirical evaluation of their impact
• Adversary’s model – Goal (security violation) – Knowledge of the attacked system
– Capability of manipulating the input data – Attack strategy (optimization problem)
• Inspired from previous work on adversarial machine learning – Barreno et al., Can machine learning be secure?, ASIACCS 2006
– Huang et al., Adversarial machine learning, AISec 2011 – Biggio et al., Security evaluation of pattern classifiers under attack, IEEE Trans.
Knowledge and Data Eng., 2013
24 (1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial
settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
http://pralab.diee.unica.it
Adversary’s Goal
• Security violation – Integrity: hiding clusters / malicious
activities without compromising normal system operation
• e.g., creating fringe clusters à obfuscation attack
– Availability: compromising normal system operation by maximally altering the clustering output
• e.g., merging existing clusters à poisoning attack
– Privacy: gaining confidential information about system users by reverse-engineering the clustering process
25 (1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial
settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
Integrity
Availability Privacy
http://pralab.diee.unica.it
Adversary’s Knowledge
• Perfect knowledge – upper bound on the performance degradation under attack
26
INPUT DATA FEATURE
REPRESENTATION
CLUSTERING ALGORITHM
e.g., k-means
ALGORITHM PARAMETERS
e.g., initialization
(1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
x x x x x x x
x x x
x x
x x x x x
x1 x2 ... xd
http://pralab.diee.unica.it
Adversary’s Capability
• Attacker’s capability is bounded: – maximum number of samples that can be added to the input data
• e.g., the attacker may only control a small fraction of malware samples collected by a honeypot
– maximum amount of modifications (application-specific constraints in feature space)
• e.g., malware samples should preserve their malicious functionality (elements can not be removed à features can only be incremented)
27
x Feasible domain
x '
(1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
http://pralab.diee.unica.it
Formalizing the Optimal Attack Strategy
28
maxAEθ~µ g A;θ( )!" #$
s.t. A ∈Ω
Knowledge of the data, features, …
Capability of manipulating the input data
Attacker’s goal
Perfect knowledge: Eθ~µ g A;θ( )!" #$= g A;θ0( )