BinaryPig - Scalable Malware Analytics in Hadoop

44
BinaryPig: Scalable Binary Data Extraction in Hadoop Created By: Jason Trost, Telvis Calhoun, Zach Hanif

Transcript of BinaryPig - Scalable Malware Analytics in Hadoop

Page 1: BinaryPig - Scalable Malware Analytics in Hadoop

BinaryPig: Scalable Binary Data Extraction in Hadoop

Created By:Jason Trost, Telvis Calhoun, Zach Hanif

Page 2: BinaryPig - Scalable Malware Analytics in Hadoop

Bringing  data  science  to  cyber  security,  allowing  you  to  sense,  analyze  and  act  in  

real  5me.    

Page 3: BinaryPig - Scalable Malware Analytics in Hadoop

Agenda  

•  • The Problem •  • BinaryPig Architecture •  • Code and Implementation Details •  • Analysis and Results •  • Demo •  • Wrap-Up •  A

Page 4: BinaryPig - Scalable Malware Analytics in Hadoop

Background  

2.5 years 20M samples

9.5TB of malware • 

Page 5: BinaryPig - Scalable Malware Analytics in Hadoop
Page 6: BinaryPig - Scalable Malware Analytics in Hadoop

Malware data mining is useful

•  • Threat intel feeds• Contextual enrichment on events• Machine learning models

Page 7: BinaryPig - Scalable Malware Analytics in Hadoop

Pre-­‐BinaryPig:  Architecture  

Page 8: BinaryPig - Scalable Malware Analytics in Hadoop

Pre-­‐BinaryPig:  Storage  Issues  

•  • We kept running out of disk! •  • We lost samples when NFS nodes failed.

Page 9: BinaryPig - Scalable Malware Analytics in Hadoop

Pre-­‐BinaryPig  -­‐  Processing  Issues  

•  • No Data Locality. •  • Node failures were catastrophic. •  • Hard to add new analysis scripts.

Page 10: BinaryPig - Scalable Malware Analytics in Hadoop

Pre-­‐Binary  Pig  -­‐  Data  Explora=on  Issues  

•  • How can I share my findings for greater fame and glory? •  • Create a table schema for every analysis script? •  • RDBMS failure is worse than zombie apocalypse.

Page 11: BinaryPig - Scalable Malware Analytics in Hadoop

We  needed  a  system  that...  

•  • Scales to our historical data •  • Recovers from failures •  • Grows through scripting •  • Supports dynamic schemas •  • Searchable via the web

Page 12: BinaryPig - Scalable Malware Analytics in Hadoop

BinaryPig FRAMEWORK FOR PROCESSING SMALL BINARY FILES USING APACHE HADOOP AND APACHE PIG Bi

Page 13: BinaryPig - Scalable Malware Analytics in Hadoop

BinaryPig  

•  • Simple DSL •  • Pluggable analytics •  • Plays nice with existing tools •  • Enables rapid iteration

Page 14: BinaryPig - Scalable Malware Analytics in Hadoop

BinaryPig    

Page 15: BinaryPig - Scalable Malware Analytics in Hadoop

BinaryPig  -­‐  Storage  

•  • HDFS, scalable, replicated •  • Aggregate malware samples into sequence files

Page 16: BinaryPig - Scalable Malware Analytics in Hadoop

BinaryPig  -­‐  Processing  

•  • Hadoop - robust, distributed, with data locality •  • Apache Pig - Extensible, simple

Page 17: BinaryPig - Scalable Malware Analytics in Hadoop

BinaryPig  -­‐  Results  Explora=on  

•  • UI - turns your grandma into a data scientist •  • Elasticsearch - schemaless, replicated, awesome

Page 18: BinaryPig - Scalable Malware Analytics in Hadoop

Yet  Another  Framework?  

•  Malware  tools  didn't  scale  •  Hadoop  does  not  play  well  with  small  binary  files  

•  Hadoop  did  not  integrate  exis5ng  malware  analysis  tools  

 

Page 19: BinaryPig - Scalable Malware Analytics in Hadoop

Code  and  Implementa5on  Details  

Page 20: BinaryPig - Scalable Malware Analytics in Hadoop

BinaryPig  is  easy  to  use!  

Page 21: BinaryPig - Scalable Malware Analytics in Hadoop

BinaryPig  Ingest  Tools  

•  Generate  sequence  file  from  directory  containing  malware  samples  

 ./bin/dir_to_sequencefile <malwareDir> <hdfsOutputFile>

•  Generate  sequence  file  from  archive    

./bin/archive_to_sequencefile <archive> <hdfsOutputFile>

Page 22: BinaryPig - Scalable Malware Analytics in Hadoop

BinaryPig  Loaders  

•  Converts  raw  data  to  a  tuple  •  Execu5ng  Loader  

o  Executes  a  specific  script/program  on  a  file  wriIen  to  a  logical  path  

o  Example:  Hashing  •  Daemon  Loader  

o  Writes  binaries  to  a  path,  and  provides  those  paths  to  an  already  running  analysis  process  

o  Example:  Clamd      

Page 23: BinaryPig - Scalable Malware Analytics in Hadoop

Op=miza=ons  in  BinaryPig  •  To  leverage  pre-­‐exis5ng  tools,  we  had  to  write  malware  binaries  to  the  local  filesystem  on  the  worker  nodes  o  Note:  local  copy,  not  network  copy  o  We  op5mized  this  to  use  /dev/shm/  instead  

•  Quick  scripts  are  great  for  rapid  itera5on,  but...  o  Interpreter  startup  5me  can  dominate  execu5on  5me  o  Crea5ng  small,  long  running,  analy5c  daemons  provides  a  huge  speedup  for  frequently  used  tasks  

o  i.e.  the  clamscand  model  of  execu5on    

Page 24: BinaryPig - Scalable Malware Analytics in Hadoop

BinaryPig:  Loader  Implementa=ons  

•  Generic Script Loader •  Generic Daemon Loader •  ClamAV Loader •  Yara Loader •  Hashing Loader

Page 25: BinaryPig - Scalable Malware Analytics in Hadoop

strings.sh:    #!/bin/bash

strings "$@"

BinaryPig:  Scrip=ng  

strings.pig:    define Loader com.endgame.binarypig.loaders.ExecutingTextLoader;

data = LOAD '$INPUT' USING Loader('strings.sh');

DUMP data;

Page 26: BinaryPig - Scalable Malware Analytics in Hadoop

BinaryPig  supports  non-­‐PE32  files  

•  Handles more than just malware... o  Image analysis o  PDF data extraction o  APK extraction o  Any small binary files

Page 27: BinaryPig - Scalable Malware Analytics in Hadoop

Web  Interface  

Page 28: BinaryPig - Scalable Malware Analytics in Hadoop

Analysis  and  Results  

Page 29: BinaryPig - Scalable Malware Analytics in Hadoop

Malware  Census  

•  •  20  Million  unique  binaries  o  •  ~94%  PE  format  o  •  ~6%  are  mostly  Android  APK's  

•  •  5  hours  to  run  historical  set  

Page 30: BinaryPig - Scalable Malware Analytics in Hadoop

General  Findings  

Page 31: BinaryPig - Scalable Malware Analytics in Hadoop

Feature  Extrac=on  •  Our  core  mo5va5on  was  to  dras5cally  improve  the  experience  of  valida5ng  research.  

•  Packer  iden5fica5on  o  Overall  and  Sec5onal  Entropy  o  Kolmogorov  Complexity  o  Sec5on  and  resource  names  o  Sec5on  flags  

•  Import  tables  •  Func5on  Calls  •  Resource  hashes  and  subfeatures  

Page 32: BinaryPig - Scalable Malware Analytics in Hadoop

Feature  Depth  •  PEHeaders  are  shallow  

o  Easy  to  manipulate  o  Less  resolu5on  than  reverse  engineering  features  o  File  metadata  is  also  low  resolu5on  

•  Headers  provide  excellent  fast  features  •  Headers  are  oben  ignored  •  Work  the  analysis  around  the  feature  resolu5on  

o  Ignore  5ght  clusters,  go  for  wide  ones  o  Triage,  not  true  classifica5on  

Page 33: BinaryPig - Scalable Malware Analytics in Hadoop

Clustering  Results  •  Triage  for  dynamic  analysis  winnowing  •  Largest  cluster:  377,882  samples  

o  Three  malware  families  contained  within    o  Second  largest:  124,894  samples  

•  Valida5on  is  tricky  o  Manual  valida5on  cannot  be  en5rely  avoided  o  Cluster  meanings  change  with  feature  sets  o  Cannot  just  go  off  of  AV  results  

Page 34: BinaryPig - Scalable Malware Analytics in Hadoop

ICO  Extrac=on  

Page 35: BinaryPig - Scalable Malware Analytics in Hadoop

Icon  Features  •  Pixel  based  features  

o  Brightness    o  Color  values  o  Pixel  density  

•  Cryptographic  and  fuzzy  hashing  o  Perceptual  hashes  

•  Edge  detec5on  

Page 36: BinaryPig - Scalable Malware Analytics in Hadoop

Icon  Results  •  Icon  clustering    

o  Groups  do  not  just  include  family  lines  o  Copycat  malware  is  shown  as  well  o  Clear  indica5ons  of  malicious  intent  

•  Method  of  infec5on  can  be  extrapolated  o  Phishing  o  Obfuscated  executables  o  Adware  (more  than  we  expected)  o  False  posi5ves  -­‐  popular  sobware  detec5on  

Page 37: BinaryPig - Scalable Malware Analytics in Hadoop

Lessons  Learned  •  Feature  Selec5on  

o  Over  500  features  in  PEheader  alone  o  Abundance  of  features  requires  pruning  

•  Interpreta5on  and  Valida5on  of  Results  o  Manual  valida5on  is  an  unfortunate  reality  o  Care  has  to  be  taken  to  ensure  that  unsupervised  learning  provides  meaningful  results  

Page 38: BinaryPig - Scalable Malware Analytics in Hadoop

DEMO  

Page 39: BinaryPig - Scalable Malware Analytics in Hadoop

WRAP  UP  

Page 40: BinaryPig - Scalable Malware Analytics in Hadoop

•  Rapid Iteration •  Feature extraction •  Clustering analysis for rapid malware triage •  Enables weekly AV scans with latest signatures over

previous malware. •  Created binary classifier to improve sample collection

and categorize new samples

So  What?  

Page 41: BinaryPig - Scalable Malware Analytics in Hadoop

Future  work  •  • Compatibility with Pig 0.10.* and 0.11.* •  • EC2 tutorial •  • More examples/starter scripts

o  • Inclusion of some of our Mahout tasks o  • Open source process for that is moving forward

•  • Better error logging and handling o  Messages should be stored in a separate DB

•  • Easier deployments o  • Analytic daemons o  • Dependency libs o  • Fabric/Salt/Puppet/Chef

Page 42: BinaryPig - Scalable Malware Analytics in Hadoop

BinaryPig  is  Open  Source!  

https://github.com/endgameinc/binarypigApache 2 License

Page 43: BinaryPig - Scalable Malware Analytics in Hadoop

We  are  hiring!  

http://endgame.com/careers

Page 44: BinaryPig - Scalable Malware Analytics in Hadoop

QUESTIONS