1
Hidden Markov Models for Software Piracy Detection
Shabana KaziMark Stamp
HMMs for Piracy Detection
2
Intro
Here, we apply metamorphic analysis to software piracy detection
Very similar to techniques used in malware detectiono But, problem is completely different o Has nothing to do with malware
We show that there are other applications of such techniques
HMMs for Piracy Detection
3
Software Piracy
Software piracy is major problemo By 2009 estimate, $3 to $4 lost to
piracy for every $1 in software sales Usually, piracy consists of taking
software without modification In some cases, software is modified
o Commercial theft of intellectual property
o Thief really doesn’t want to get caught… HMMs for Piracy Detection
4
Software Piracy
We assume software is stoleno And modified, making it hard to detecto If completely rewritten from scratch, we
won’t detect it by our approach Want to make life hard for bad guys
o Ideally, major modifications required How much modification is need
before we cannot reliably detect?
HMMs for Piracy Detection
5
Goals
Technique applicable to any software
No special effort by developero Nothing extra inserted into code
We only require access to exe file Not a watermarking scheme
o More like software “birthmark” analysis
Also not plagiarism detectiono Here, want a “deeper” analysis
HMMs for Piracy Detection
6
Use Case
You work for Alice’s Software Companyo And you develop fancy software for
ASC Trudy’s Software Company (TSC)
develops suspiciously similar product
You suspect TSC of stealing your codeo Not identical, but seems similar
What can you do?o We’ve got some ideas that might
help…
HMMs for Piracy Detection
7
Use Case
Using the technique discussed here Can easily measure code similarity Low similarity?
o Then no hope of proving code is stolen High similarity?
o Further (costly) analysis is warranted High similarity does not prove
stoleno But a good reason to take a closer
look HMMs for Piracy Detection
8
Background
Metamorphic softwareo Metamorphic techniques (dead code,
permutation, substitution) HMM
o Basic ideas and notationo The 3 problems and their solutions
(discussed at a high level) We’ve seen all of this before
HMMs for Piracy Detection
9
Overview Training and scoring Train HMM on slightly morphed
copies of given “base” softwareo Slight morphing to avoid overfitting
Score morphed copies and other fileso Here, morphing serves to simulate
modifications by attacker Want to know how much morphing
required before detection failsHMMs for Piracy Detection
10
Metamorphic Generator
Built our own metamorphic generator
Morph based on extracted opcodeso Morphing consists of dead code
insertiono Specify a dead code percentage and
number of blocks to insert Do not require morphed code works
o Makes detection more difficult, not easier
o A worst-case scenario, detection-wiseHMMs for Piracy Detection
11
Training
Given a base executable file… Extract its opcode sequence Generate 100 slightly morphed
copieso Each morphed 10%, using dead code
extracted from random “normal” file Train HMM on morphed copies
o Using 5-fold cross validationo Note: We train one model for each
“fold”HMMs for Piracy Detection
12
Training Illustration of training process
o Slightly morphed copies of base program
HMMs for Piracy Detection
13
Determine Threshold
For each of 5-foldso Train HMMo Score 20 morphed files (match set)
and 15 normal (nomatch set) Determine threshold based on
scoreso Threshold is highest score of normal
fileo Implies FPR = 0; equivalently, TNR =
1 (for the given “fold”)HMMs for Piracy Detection
14
Setting a Threshold Process used to set threshold
HMMs for Piracy Detection
15
Experiments
Want to determine robustness For each base file tested… Train to obtain HMM and threshold Morph base file at various
percentageso Using various morphing strategieso Refer to this morphing as tampering
Score each tampered copyo Classify, based on threshold
HMMs for Piracy Detection
16
Experiments Scoring tampered files
HMMs for Piracy Detection
17
Experiment Details For each
base fileo 6 modelso 10
tamper percent for each
o 100 files each
o So, 6000 scores!
HMMs for Piracy Detection
18
Experiment Details Tested 10 base files, each data
pointo So 60,000 scores computed…
HMMs for Piracy Detection
19
Experiment Details Repeated entire experiment 6
timeso Using different number of blocks in
training phaseo Training made little difference on
scoreso So, here we only give results where 1
block used in training phase In total 360,000 scores computed
o And 360 “models” generateo That is, 1800 HMMs (one per fold)
HMMs for Piracy Detection
20
Results: Bar Graph
HMMs for Piracy Detection
21
Results: 3-d Plot
HMMs for Piracy Detection
22
Conclusions
Results look very promisingo Robust high degree of morphing
required before base file undetectedo Practical only requires exe, no
special effort when developingo Applies to any exe, at any time
Overall, strong software “birthmark” strategy with practical implications
HMMs for Piracy Detection
23
Future Work
Statistical analysis somewhat weako Results may be stronger than it
appears Many other scores/combinations of
scores can be testedo Results can only get better
Consider other morphing techniqueso And other file types (e.g., bytecode)o And mitigations for 1-block morphing
…
HMMs for Piracy Detection
24
References
S. Kazi and M. Stamp, Hidden Markov models for software piracy detection, Information Security Journal: A Global Perspective, 22:140-149, 2013
HMMs for Piracy Detection
Top Related