Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

28
Pattern Recognition and Applications Lab University of Cagliari, Italy Department of Electrical and Electronic Engineering Poisoning Complete-Linkage Hierarchical Clustering Ba#sta Biggio 1 , Samuel Rota Bulò 2 , Ignazio Pillai 1 , Michele Mura 1 , Eyasu Zemene Mequanint 3 , Marcello Pelillo 3 , and Fabio Roli 1 ( 1 ) Università di Cagliari (IT); ( 2 ) FBKirst, Trento (IT); ( 3 ) Università Ca’ Foscari di Venezia (IT) Joensuu, Finland, 2022 August 2014 S+SSPR 2014

Transcript of Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

Page 1: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

Pattern Recognition and Applications Lab

                               

 University

of Cagliari, Italy

 

Department of Electrical and Electronic

Engineering

Poisoning Complete-Linkage Hierarchical Clustering

Ba#sta  Biggio1,  Samuel  Rota  Bulò2,  Ignazio  Pillai1,  Michele  Mura1,  Eyasu  Zemene  Mequanint3,  Marcello  Pelillo3,  and  Fabio  Roli1  

 (1)  Università  di  Cagliari  (IT);  (2)  FBK-­‐irst,  Trento  (IT);  (3)  Università  Ca’  Foscari  di  Venezia  (IT)  

Joensuu,  Finland,  20-­‐22  August  2014  S+SSPR  2014  

Page 2: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

•  Growing number of devices,

services and applications connected to the Internet

•  Vulnerabilities and attacks through malicious software (malware)

–  Examples: Android market, malware applications

•  Identity theft •  Stolen credentials / credit card numbers

Threats and Attacks in Computer Security

2  

Page 3: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

•  Need for (automated) detection (and rule generation)

–  machine learning-based defenses (data clustering)

Threats and Attacks in Computer Security

3  

Evasion: malware families / variants +65% new malware variants from 2012 to 2013 Mobile Adware and Malw. Analysis, Symantec, 2014

Detection: antivirus systems Rule-based systems

Page 4: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Data Clustering for Computer Security

•  Goal: clustering of malware families to identify common characteristics and design suitable countermeasures •  e.g., antivirus rules / signatures

4  

x   x  x  x   x  x  x  

x  x   x  

x  x  

x   x  x  x  x  

x1 x2 ... xd

feature extraction (e.g., URL length,

num. of parameters, etc.)

data collection (honeypots)

clustering of malware families (e.g., similar HTTP

requests)

data analysis / countermeasure design (e.g., signature generation)

if  …        then  …  else  …  

e.g.,  suspicious  HTTP  request  to  a  web  server    hVp://www.vulnerablehotel.com/components/  com_hbssearch/longDesc.php?h_id=1&  id=-­‐2%20union%20select%20concat%28username,  0x3a,password%29%20from%20jos_users-­‐-­‐  

Page 5: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Is Data Clustering Secure?

•  Attackers can poison input data to subvert malware clustering

5  

x  x  x  

x   x  x  x  

x  x  

x  

x  

x  x  

x  x  x  x  

x1 x2 ... xd

feature extraction (e.g., URL length,

num. of parameters, etc.)

data collection (honeypots)

clustering of malware families (e.g., similar HTTP

requests)

data analysis / countermeasure design (e.g., signature generation)

if  …        then  …  else  …  

Well-­‐cra9ed  HTTP  requests  to  subvert  clustering    hVp://www.vulnerablehotel.com/…  hVp://www.vulnerablehotel.com/…  hVp://www.vulnerablehotel.com/…  hVp://www.vulnerablehotel.com/…  

… is significantly compromised

… becomes useless (too many false alarms, low detection rate)

(1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.

Page 6: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Is Data Clustering Secure?

•  Earlier work (1,2): qualitative definition of attacks

6  

(1) D. B. Skillicorn. Adversarial knowledge discovery. IEEE Intelligent Systems, 24:54–61, 2009. (2)  J. G. Dutrisac and D. Skillicorn. Hiding clusters in adversarial settings. In IEEE Int’l Conf.

Intelligence and Security Informatics, pp.185–187, 2008.

Samples can be added to merge (and/or split) existing clusters

Samples can be obfuscated and hidden within existing clusters (e.g., fringe clusters)

x   x  x  

x   x  

x  x  

x  

x  

x  x  x  x  

x   x  x  

x   x  

x  x  

x  

x  

x  x  x  x  

x   x  x  

x   x  

x  x  

x  

x  

x  x  x  x  

x  x   x  

x  x  

x  

Clustering on untainted data

Page 7: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Is Data Clustering Secure?

•  Our previous work (1): –  Framework for security evaluation of clustering algorithms –  Formalization of poisoning and obfuscation attacks (optimization) –  Case study on single-linkage hierarchical clustering

•  Despite hierarchical clustering is widely used for malware clustering (2,3), it is significantly vulnerable to well-crafted attacks!

•  In this work we focus on

7  

(1)  B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.

(2)  R. Perdisci, D. Ariu, and G. Giacinto. Scalable fine-grained behavioral clustering of http-based malware. Computer Networks, 57(2):487-500, 2013

(3)  K. Rieck, P. Trinius, C. Willems, and T. Holz. Automatic analysis of malware behavior using machine learning. J. Comput. Secur., 19(4):639-668, 2011.

Poisoning  a+acks  against  complete-­‐linkage  hierarchical  clustering  

Page 8: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Complete-Linkage Hierarchical Clustering

•  Bottom-up agglomerative clustering –  each point is initially considered as a cluster –  closest clusters are iteratively merged

•  Linkage criterion to define distance between clusters –  complete-linkage criterion

•  Clustering output is a hierarchy of clusterings –  Criterion needed to select a given clustering (e.g., number of clusters)

8  

dist(Ci,Cj ) = maxa∈Ci , b∈Cj

d(a,b) x  x  x  

x  x  x  x  

x  

Page 9: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Poisoning Attacks

•  Goal: to maximally compromise the clustering output on D •  Capability: adding m attack samples •  Knowledge: perfect / worst-case attack

•  Attack strategy:

9  

Y = f (D) !Y = fD (D∪A)

maxA

dc Y, !Y (A)( ), A= ai{ }i=1

m

Distance between the clustering in the absence of attack and that under attack

Attack samples A

x   x  x  

x   x  

x  x  

x  

x  

x  x  x  x  

x   x  x  

x   x  

x  x  

x  

x  

x  x  x  x  

x  x  

Clustering on untainted data D

Page 10: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Poisoning Attacks

10  

The clustering algorithm chooses the number of clusters that minimizes the attacker’s objective!

dc Y, !Y( ) = YY T − !Y !Y T

F, Y =

1 0 00 0 10 0 11 0 00 1 0

#

$

%%%%%%

&

'

((((((

, YY T =

1 0 0 1 00 1 1 0 00 1 1 0 01 0 0 1 00 0 0 0 1

#

$

%%%%%%

&

'

((((((

For a given clustering: Sample 1

Sample 5

maxA

dc Y, !Y (A)( ), A= ai{ }i=1

m

How to choose a given clustering from the hierarchy?

This gives us a lower bound on the worst-case attack’s impact!

Page 11: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Poisoning Complete-Linkage Clustering

•  Attack strategy: •  Heuristic-based solutions

–  Greedy approach: adding one attack sample at a time

11  

maxA

dc Y, !Y (A)( ), A= ai{ }i=1

m

Page 12: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Poisoning Complete-Linkage Clustering

•  Local maxima are found at the clusters’ boundaries (wide regions)

12  

dc Y, !Y (a)( )

x1

x2

Page 13: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Poisoning Complete-Linkage Clustering

13  

•  Underlying idea: to increase intra-cluster distance (extend attack) •  For each cluster, consider two candidate attack points

Candidate attack points

Page 14: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Poisoning Complete-Linkage Clustering

14  

•  Underlying idea: to increase intra-cluster distance (extend attack)

Page 15: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Poisoning Complete-Linkage Clustering

15  

•  Underlying idea: to increase intra-cluster distance (extend attack)

Candidate attack points

Page 16: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Poisoning Complete-Linkage Clustering

1.  Extend (Best): evaluates Y’(a) for each candidate attack, retaining the best one

–  Clustering is run for each candidate attack point, twice per cluster

2.  Extend (Hard): estimates Y’(a) assuming that each candidate will split the corresponding cluster, potentially merging it with a fragment of the closest cluster

–  It does not require running clustering to find the best attack point

3.  Extend (Soft): estimates Y’(a) as Extend (Hard), but using a soft probabilistic estimate instead of 0/1 sample-to-cluster assignments

–  It does not require running clustering to find the best attack point

16  

Page 17: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Poisoning Complete-Linkage Clustering

•  The attack compromises the initial clustering by forming heterogeneous clusters

17  

Clustering on untainted data Clustering after adding 10 attack samples

Page 18: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Experimental Setup

•  Banana: artificial data, 80 samples, 2 features, k=4 initial clusters

•  Malware: real data (1), 1,000 samples, 6 features, k≈9 initial clusters (estimated from data minimizing the Davies-Bouldin Index)

–  Features: 1.  number of GET requests 2.  number of POST requests 3.  average URL length 4.  average number of URL parameters 5.  average amount of data sent by POST requests 6.  average response length

•  MNIST Handwritten Digits: real data, 330 samples per cluster, 28 x 28 = 784 features (pixels), k=3 initial clusters corresponding to

18  (1) R. Perdisci, D. Ariu, and G. Giacinto. Scalable fine-grained behavioral clustering of http-based malware. Computer Networks, 57(2):487-500, 2013.

Page 19: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Experimental Results

•  Attack strategies: Extend (Best/Hard/Soft), Random, Random (Best) –  Banana:

•  Extend (Best) very close to Optimal (Grid Search) •  Random (Best) competitive with Extend (Hard / Soft)

19  

0%2%5%7%9% 12% 15% 18% 20%05

101520253035404550

Obj

ectiv

e Fu

nctio

n

Banana

Fraction of samples controlled by the attacker

Random Random (Best) Extend (Hard) Extend (Soft) Extend (Best) Optimal (Grid Search)

0% 11.1% (10 attack samples)

Page 20: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

0.0% 0.2% 0.4% 0.6% 0.8% 1.0%0

50

100

150

200

250

Obj

ectiv

e Fu

nctio

n

Digits

0% 1% 2% 3% 4% 5%0

50

100

150

Obj

ectiv

e Fu

nctio

n

Malware

Experimental Results

•  Attack strategies: Extend (Best/Hard/Soft), Random, Random (Best) –  Malware:

•  Extend attacks and Random (Best) perform rather well

–  MNIST Handwritten Digits: •  Random (Best) not effective

–  high-dimensional feature space

•  Extend (Soft) outperforms Extend (Best / Hard)

20  Fraction of samples controlled by the attacker

Random Random (Best) Extend (Hard) Extend (Soft) Extend (Best) Optimal (Grid Search)

Page 21: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Conclusions and Future Work

•  Framework for security evaluation of clustering algorithms •  Poisoning attack vs. complete-linkage hierarchical clustering

–  Even random-based attacks can be effective!

•  Future work –  Extensions to other clustering algorithms, common attack strategy

•  e.g., black-box optimization with suitable heuristics

–  Attacks with limited knowledge of the input data

21  

Secure clustering algorithms

Attacks against clustering

Page 22: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

?  22  

 Any  ques<ons  Thanks  for  your  aVenion!  

Page 23: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it 23  

Extra  slides  

Page 24: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Is Data Clustering Secure?

•  Our previous work (1): –  Framework for security evaluation of clustering algorithms

1.  Formal definition of potential attacks 2.  Empirical evaluation of their impact

•  Adversary’s model –  Goal (security violation) –  Knowledge of the attacked system

–  Capability of manipulating the input data –  Attack strategy (optimization problem)

•  Inspired from previous work on adversarial machine learning –  Barreno et al., Can machine learning be secure?, ASIACCS 2006

–  Huang et al., Adversarial machine learning, AISec 2011 –  Biggio et al., Security evaluation of pattern classifiers under attack, IEEE Trans.

Knowledge and Data Eng., 2013

24  (1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial

settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.

Page 25: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Adversary’s Goal

•  Security violation –  Integrity: hiding clusters / malicious

activities without compromising normal system operation

•  e.g., creating fringe clusters à obfuscation attack

–  Availability: compromising normal system operation by maximally altering the clustering output

•  e.g., merging existing clusters à poisoning attack

–  Privacy: gaining confidential information about system users by reverse-engineering the clustering process

25  (1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial

settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.

Integrity

Availability Privacy

Page 26: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Adversary’s Knowledge

•  Perfect knowledge –  upper bound on the performance degradation under attack

26  

INPUT DATA FEATURE

REPRESENTATION

CLUSTERING ALGORITHM

e.g., k-means

ALGORITHM PARAMETERS

e.g., initialization

(1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.

x   x  x  x   x  x  x  

x  x   x  

x  x  

x   x  x  x  x  

x1 x2 ... xd

Page 27: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Adversary’s Capability

•  Attacker’s capability is bounded: –  maximum number of samples that can be added to the input data

•  e.g., the attacker may only control a small fraction of malware samples collected by a honeypot

–  maximum amount of modifications (application-specific constraints in feature space)

•  e.g., malware samples should preserve their malicious functionality (elements can not be removed à features can only be incremented)

27  

x Feasible domain

x '

(1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.

Page 28: Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

 

http://pralab.diee.unica.it

Formalizing the Optimal Attack Strategy

28  

maxAEθ~µ g A;θ( )!" #$

s.t. A ∈Ω

Knowledge of the data, features, …

Capability of manipulating the input data

Attacker’s goal

Perfect knowledge: Eθ~µ g A;θ( )!" #$= g A;θ0( )