Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

Pattern Recognition and Applications Lab

University

of Cagliari, Italy

Department of Electrical and Electronic

Engineering

Poisoning Complete-Linkage Hierarchical Clustering

Ba#sta Biggio1, Samuel Rota Bulò2, Ignazio Pillai1, Michele Mura1, Eyasu Zemene Mequanint3, Marcello Pelillo3, and Fabio Roli1

(1) Università di Cagliari (IT); (2) FBK-‐irst, Trento (IT); (3) Università Ca’ Foscari di Venezia (IT)

Joensuu, Finland, 20-‐22 August 2014 S+SSPR 2014

http://pralab.diee.unica.it

•  Growing number of devices,

services and applications connected to the Internet

•  Vulnerabilities and attacks through malicious software (malware)

–  Examples: Android market, malware applications

•  Identity theft •  Stolen credentials / credit card numbers

Threats and Attacks in Computer Security

2


•  Need for (automated) detection (and rule generation)

–  machine learning-based defenses (data clustering)

Threats and Attacks in Computer Security

3

Evasion: malware families / variants +65% new malware variants from 2012 to 2013 Mobile Adware and Malw. Analysis, Symantec, 2014

Detection: antivirus systems Rule-based systems


Data Clustering for Computer Security

•  Goal: clustering of malware families to identify common characteristics and design suitable countermeasures •  e.g., antivirus rules / signatures

4

x x x x x x x

x x x

x x

x x x x x

x1 x2 ... xd

feature extraction (e.g., URL length,

num. of parameters, etc.)

data collection (honeypots)

clustering of malware families (e.g., similar HTTP

requests)

data analysis / countermeasure design (e.g., signature generation)

if … then … else …

e.g., suspicious HTTP request to a web server hVp://www.vulnerablehotel.com/components/ com_hbssearch/longDesc.php?h_id=1& id=-‐2%20union%20select%20concat%28username, 0x3a,password%29%20from%20jos_users-‐-‐


Is Data Clustering Secure?

•  Attackers can poison input data to subvert malware clustering

5

x x x

x x x x

x x

x

x

x x

x x x x

x1 x2 ... xd

feature extraction (e.g., URL length,

num. of parameters, etc.)

data collection (honeypots)

clustering of malware families (e.g., similar HTTP

requests)

data analysis / countermeasure design (e.g., signature generation)

if … then … else …

Well-‐cra9ed HTTP requests to subvert clustering hVp://www.vulnerablehotel.com/… hVp://www.vulnerablehotel.com/… hVp://www.vulnerablehotel.com/… hVp://www.vulnerablehotel.com/…

… is significantly compromised

… becomes useless (too many false alarms, low detection rate)

(1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.



•  Earlier work (1,2): qualitative definition of attacks

6

(1) D. B. Skillicorn. Adversarial knowledge discovery. IEEE Intelligent Systems, 24:54–61, 2009. (2)  J. G. Dutrisac and D. Skillicorn. Hiding clusters in adversarial settings. In IEEE Int’l Conf.

Intelligence and Security Informatics, pp.185–187, 2008.

Samples can be added to merge (and/or split) existing clusters

Samples can be obfuscated and hidden within existing clusters (e.g., fringe clusters)

x x x

x x

x x

x

x

x x x x

x x x

x x

x x

x

x

x x x x

x x x

x x

x x

x

x

x x x x

x x x

x x

x

Clustering on untainted data



•  Our previous work (1): –  Framework for security evaluation of clustering algorithms –  Formalization of poisoning and obfuscation attacks (optimization) –  Case study on single-linkage hierarchical clustering

•  Despite hierarchical clustering is widely used for malware clustering (2,3), it is significantly vulnerable to well-crafted attacks!

•  In this work we focus on

7

(1)  B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.

(2)  R. Perdisci, D. Ariu, and G. Giacinto. Scalable fine-grained behavioral clustering of http-based malware. Computer Networks, 57(2):487-500, 2013

(3)  K. Rieck, P. Trinius, C. Willems, and T. Holz. Automatic analysis of malware behavior using machine learning. J. Comput. Secur., 19(4):639-668, 2011.

Poisoning a+acks against complete-‐linkage hierarchical clustering


Complete-Linkage Hierarchical Clustering

•  Bottom-up agglomerative clustering –  each point is initially considered as a cluster –  closest clusters are iteratively merged

•  Linkage criterion to define distance between clusters –  complete-linkage criterion

•  Clustering output is a hierarchy of clusterings –  Criterion needed to select a given clustering (e.g., number of clusters)

8

dist(Ci,Cj ) = maxa∈Ci , b∈Cj

d(a,b) x x x

x x x x

x


Poisoning Attacks

•  Goal: to maximally compromise the clustering output on D •  Capability: adding m attack samples •  Knowledge: perfect / worst-case attack

•  Attack strategy:

9

Y = f (D) !Y = fD (D∪A)

maxA

dc Y, !Y (A)( ), A= ai{ }i=1

m

Distance between the clustering in the absence of attack and that under attack

Attack samples A

x x x

x x

x x

x

x

x x x x

x x x

x x

x x

x

x

x x x x

x x

Clustering on untainted data D


Poisoning Attacks

10

The clustering algorithm chooses the number of clusters that minimizes the attacker’s objective!

dc Y, !Y( ) = YY T − !Y !Y T

F, Y =

1 0 00 0 10 0 11 0 00 1 0

#

$

%%%%%%

&

'

((((((

, YY T =

1 0 0 1 00 1 1 0 00 1 1 0 01 0 0 1 00 0 0 0 1

#

$

%%%%%%

&

'

((((((

For a given clustering: Sample 1

…

Sample 5

maxA

dc Y, !Y (A)( ), A= ai{ }i=1

m

How to choose a given clustering from the hierarchy?

This gives us a lower bound on the worst-case attack’s impact!


Poisoning Complete-Linkage Clustering

•  Attack strategy: •  Heuristic-based solutions

–  Greedy approach: adding one attack sample at a time

11

maxA

dc Y, !Y (A)( ), A= ai{ }i=1

m



•  Local maxima are found at the clusters’ boundaries (wide regions)

12

dc Y, !Y (a)( )

x1

x2



13

•  Underlying idea: to increase intra-cluster distance (extend attack) •  For each cluster, consider two candidate attack points

Candidate attack points



14

•  Underlying idea: to increase intra-cluster distance (extend attack)



15

•  Underlying idea: to increase intra-cluster distance (extend attack)

Candidate attack points



1.  Extend (Best): evaluates Y’(a) for each candidate attack, retaining the best one

–  Clustering is run for each candidate attack point, twice per cluster

2.  Extend (Hard): estimates Y’(a) assuming that each candidate will split the corresponding cluster, potentially merging it with a fragment of the closest cluster

–  It does not require running clustering to find the best attack point

3.  Extend (Soft): estimates Y’(a) as Extend (Hard), but using a soft probabilistic estimate instead of 0/1 sample-to-cluster assignments

–  It does not require running clustering to find the best attack point

16



•  The attack compromises the initial clustering by forming heterogeneous clusters

17

Clustering on untainted data Clustering after adding 10 attack samples


Experimental Setup

•  Banana: artificial data, 80 samples, 2 features, k=4 initial clusters

•  Malware: real data (1), 1,000 samples, 6 features, k≈9 initial clusters (estimated from data minimizing the Davies-Bouldin Index)

–  Features: 1.  number of GET requests 2.  number of POST requests 3.  average URL length 4.  average number of URL parameters 5.  average amount of data sent by POST requests 6.  average response length

•  MNIST Handwritten Digits: real data, 330 samples per cluster, 28 x 28 = 784 features (pixels), k=3 initial clusters corresponding to

18 (1) R. Perdisci, D. Ariu, and G. Giacinto. Scalable fine-grained behavioral clustering of http-based malware. Computer Networks, 57(2):487-500, 2013.


Experimental Results

•  Attack strategies: Extend (Best/Hard/Soft), Random, Random (Best) –  Banana:

•  Extend (Best) very close to Optimal (Grid Search) •  Random (Best) competitive with Extend (Hard / Soft)

19

0%2%5%7%9% 12% 15% 18% 20%05

101520253035404550

Obj

ectiv

e Fu

nctio

n

Banana

Fraction of samples controlled by the attacker

Random Random (Best) Extend (Hard) Extend (Soft) Extend (Best) Optimal (Grid Search)

0% 11.1% (10 attack samples)


0.0% 0.2% 0.4% 0.6% 0.8% 1.0%0

50

100

150

200

250

Obj

ectiv

e Fu

nctio

n

Digits

0% 1% 2% 3% 4% 5%0

50

100

150

Obj

ectiv

e Fu

nctio

n

Malware

Experimental Results

•  Attack strategies: Extend (Best/Hard/Soft), Random, Random (Best) –  Malware:

•  Extend attacks and Random (Best) perform rather well

–  MNIST Handwritten Digits: •  Random (Best) not effective

–  high-dimensional feature space

•  Extend (Soft) outperforms Extend (Best / Hard)

20 Fraction of samples controlled by the attacker

Random Random (Best) Extend (Hard) Extend (Soft) Extend (Best) Optimal (Grid Search)


Conclusions and Future Work

•  Framework for security evaluation of clustering algorithms •  Poisoning attack vs. complete-linkage hierarchical clustering

–  Even random-based attacks can be effective!

•  Future work –  Extensions to other clustering algorithms, common attack strategy

•  e.g., black-box optimization with suitable heuristics

–  Attacks with limited knowledge of the input data

21

Secure clustering algorithms

Attacks against clustering


? 22

Any ques<ons Thanks for your aVenion!

http://pralab.diee.unica.it 23

Extra slides



•  Our previous work (1): –  Framework for security evaluation of clustering algorithms

1.  Formal definition of potential attacks 2.  Empirical evaluation of their impact

•  Adversary’s model –  Goal (security violation) –  Knowledge of the attacked system

–  Capability of manipulating the input data –  Attack strategy (optimization problem)

•  Inspired from previous work on adversarial machine learning –  Barreno et al., Can machine learning be secure?, ASIACCS 2006

–  Huang et al., Adversarial machine learning, AISec 2011 –  Biggio et al., Security evaluation of pattern classifiers under attack, IEEE Trans.

Knowledge and Data Eng., 2013

24 (1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial

settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.


Adversary’s Goal

•  Security violation –  Integrity: hiding clusters / malicious

activities without compromising normal system operation

•  e.g., creating fringe clusters à obfuscation attack

–  Availability: compromising normal system operation by maximally altering the clustering output

•  e.g., merging existing clusters à poisoning attack

–  Privacy: gaining confidential information about system users by reverse-engineering the clustering process

25 (1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial

settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.

Integrity

Availability Privacy


Adversary’s Knowledge

•  Perfect knowledge –  upper bound on the performance degradation under attack

26

INPUT DATA FEATURE

REPRESENTATION

CLUSTERING ALGORITHM

e.g., k-means

ALGORITHM PARAMETERS

e.g., initialization


x x x x x x x

x x x

x x

x x x x x

x1 x2 ... xd


Adversary’s Capability

•  Attacker’s capability is bounded: –  maximum number of samples that can be added to the input data

•  e.g., the attacker may only control a small fraction of malware samples collected by a honeypot

–  maximum amount of modifications (application-specific constraints in feature space)

•  e.g., malware samples should preserve their malicious functionality (elements can not be removed à features can only be incremented)

27

x Feasible domain

x '



Formalizing the Optimal Attack Strategy

28

maxAEθ~µ g A;θ( )!" #$

s.t. A ∈Ω

Knowledge of the data, features, …

Capability of manipulating the input data

Attacker’s goal

Perfect knowledge: Eθ~µ g A;θ( )!" #$= g A;θ0( )

Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

Education

Transcript of Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering