Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev**...

34
Khaled N. Khasawneh *, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low- level Hardware-supported Malware Detection * University of California, Riverside, ** Binghamton University, *** Intel Corp. RAID 2015 – Kyoto, Japan, November 2015

Transcript of Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev**...

Page 1: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**,Nael Abu-Ghazaleh*, and Dmitry Ponomarev**

Ensemble Learning for Low-levelHardware-supported Malware

Detection

* University of California, Riverside, ** Binghamton University, *** Intel Corp.

RAID 2015 – Kyoto, Japan, November 2015

Page 2: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Malware GrowthMcAfee Lab

Over 350M malware programs in their malware zoo387 new threat every minute

RAID 2015 – Kyoto, Japan, November 2015

Page 3: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Malware Detection Analysis

Static analysisSearch for signatures in the executableCan detect all known malware programs with no false alarmsCan't detect metamorphic malware, polymorphic malware, or targeted attacks

RAID 2015 – Kyoto, Japan, November 2015

Page 4: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Malware Detection Analysis

Static analysisSearch for signatures in the executableCan detect all known malware programs with no false alarmsCan't detect metamorphic malware, polymorphic malware, or targeted attacks

Dynamic analysisMonitors the behavior of the programCan detect metamorphic malware, polymorphic malware, and targeted attacksAdds substantial overhead to the system and have false positives

RAID 2015 – Kyoto, Japan, November 2015

Page 5: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

TWO-LEVEL MALWARE DETECTION FRAMEWORK

RAID 2015 – Kyoto, Japan, November 2015

Page 6: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Two-Level Malware Detection

MAP was introduced by Ozsoy el al. (HPCA 2015)Explored a number of sub-semantic features vectors Single hardware supported detectorDetect malware online (In real time)Two stage detection

RAID 2015 – Kyoto, Japan, November 2015

Page 7: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

RAID 2015 – Kyoto, Japan, November 2015

Contributions of this work

Better hardware malware detection using ensemble of detectors specialized for each type of malware

Metrics to measure resulting advantages of using two-level malware detection framework

Page 8: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

EVALUATION METHODOLOGY:

WORKLOADS, FEATURES, PERFORMANCE MEASURES

RAID 2015 – Kyoto, Japan, November 2015

Page 9: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Data Set & Data Collection

Total Training Testing Cross-Validati

on

Backdoor

815 489 163 163

Rogue 685 411 137 137

PWS 558 335 111 111

Trojan 1123 673 225 225

Worm 473 283 95 95

Regular 554 332 111 111

• Source of programs– Malware

• MalwareDB• 2011-2014• 3,690 total malware programs

– Regular• Windows system binaries• Other applications like Winrar,

Notepad++, Acrobat Reader

• Dynamic trace– Windows 7 virtual machine – Firewall and security services

were all disabled– Pin tool was used to collect

the features during execution

RAID 2015 – Kyoto, Japan, November 2015

Page 10: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Feature SpaceInstruction mix

INS1: frequency of instruction categories INS2: frequency of most variant opcodesINS3: presence of instruction categories INS4: presence of most variant opcodes

Memory reference patterns MEM1: histogram (count) of memory address distancesMEM2: binary (presence) of memory address distances

Architectural eventsARCH: Total number of memory reads, memory writes, unaligned memory access, immediate branches and taken branches

RAID 2015 – Kyoto, Japan, November 2015

Page 11: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

RAID 2015 – Kyoto, Japan, November 2015

Detection Performance Measures

Sensitivity:Percent of malware that was detected (True positive rate)

Specificity:Percent of correctly classified regular programs (True negative rate)

Receiver Operating Characteristic (ROC) CurveSummaries the prediction performance for range of detection thresholds

Area Under the Curve (AUC)Traditional performance metric for ROC curve

Page 12: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

SPECIALIZING THE DETECTORS FOR DIFFERENT MALWARE

TYPES

RAID 2015 – Kyoto, Japan, November 2015

Page 13: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Constructing Specialized Detectors

Specialized detectors for each malware type were trained only with the data of that typeSupervised learning with logistic regression was used

MEM1 Detectors

RAID 2015 – Kyoto, Japan, November 2015

Page 14: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

General vs. Specialized Detectors

Backdoor

PWS Rogue Trojan Worm

INS1 General 0.713 0.909 0.949 0.715 0.705

Specialized 0.715 0.892 0.962 0.727 0.819

INS2 General 0.905 0.946 0.993 0.768 0.810

Specialized 0.895 0.954 0.976 0.782 0.984

INS3 General 0.837 0.909 0.924 0.527 0.761

Specialized 0.840 0.888 0.991 0.808 0.852

INS4 General 0.866 0.868 0.914 0.788 0.830

Specialized 0.891 0.941 0.993 0.798 0.869

MEM1 General 0.729 0.893 0.424 0.650 0.868

Specialized 0.868 0.961 0.921 0.867 0.871

MEM2 General 0.833 0.947 0.761 0.866 0.903

Specialized 0.843 0.979 0.931 0.868 0.871

ARCH General 0.702 0.919 0.965 0.763 0.602

Specialized 0.686 0.942 0.970 0.795 0.560

RAID 2015 – Kyoto, Japan, November 2015

Page 15: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

General vs. Specialized Detectors

Backdoor

PWS Rogue Trojan Worm

INS1 General 0.713 0.909 0.949 0.715 0.705

Specialized 0.715 0.892 0.962 0.727 0.819

INS2 General 0.905 0.946 0.993 0.768 0.810

Specialized 0.895 0.954 0.976 0.782 0.984

INS3 General 0.837 0.909 0.924 0.527 0.761

Specialized 0.840 0.888 0.991 0.808 0.852

INS4 General 0.866 0.868 0.914 0.788 0.830

Specialized 0.891 0.941 0.993 0.798 0.869

MEM1 General 0.729 0.893 0.424 0.650 0.868

Specialized 0.868 0.961 0.921 0.867 0.871

MEM2 General 0.833 0.947 0.761 0.866 0.903

Specialized 0.843 0.979 0.931 0.868 0.871

ARCH General 0.702 0.919 0.965 0.763 0.602

Specialized 0.686 0.942 0.970 0.795 0.560

RAID 2015 – Kyoto, Japan, November 2015

Page 16: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

General vs. Specialized Detectors

Backdoor

PWS Rogue Trojan Worm

INS1 General 0.713 0.909 0.949 0.715 0.705

Specialized 0.715 0.892 0.962 0.727 0.819

INS2 General 0.905 0.946 0.993 0.768 0.810

Specialized 0.895 0.954 0.976 0.782 0.984

INS3 General 0.837 0.909 0.924 0.527 0.761

Specialized 0.840 0.888 0.991 0.808 0.852

INS4 General 0.866 0.868 0.914 0.788 0.830

Specialized 0.891 0.941 0.993 0.798 0.869

MEM1 General 0.729 0.893 0.424 0.650 0.868

Specialized 0.868 0.961 0.921 0.867 0.871

MEM2 General 0.833 0.947 0.761 0.866 0.903

Specialized 0.843 0.979 0.931 0.868 0.871

ARCH General 0.702 0.919 0.965 0.763 0.602

Specialized 0.686 0.942 0.970 0.795 0.560

RAID 2015 – Kyoto, Japan, November 2015

Page 17: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

General vs. Specialized Detectors

Backdoor

PWS Rogue Trojan Worm

INS1 General 0.713 0.909 0.949 0.715 0.705

Specialized 0.715 0.892 0.962 0.727 0.819

INS2 General 0.905 0.946 0.993 0.768 0.810

Specialized 0.895 0.954 0.976 0.782 0.984

INS3 General 0.837 0.909 0.924 0.527 0.761

Specialized 0.840 0.888 0.991 0.808 0.852

INS4 General 0.866 0.868 0.914 0.788 0.830

Specialized 0.891 0.941 0.993 0.798 0.869

MEM1 General 0.729 0.893 0.424 0.650 0.868

Specialized 0.868 0.961 0.921 0.867 0.871

MEM2 General 0.833 0.947 0.761 0.866 0.903

Specialized 0.843 0.979 0.931 0.868 0.871

ARCH General 0.702 0.919 0.965 0.763 0.602

Specialized 0.686 0.942 0.970 0.795 0.560

RAID 2015 – Kyoto, Japan, November 2015

Page 18: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Is There an Opportunity?General Specialized Difference

Backdoor 0.8662 0.8956 0.0294

PWS 0.8684 0.9795 0.1111

Rogue 0.9149 0.9937 0.0788

Trojan 0.7887 0.8676 0.0789

Worm 0.8305 0.9842 0.1537

Average 0.8537 0.9441 0.0904

Best General (INS4) Best Specialized per Type

RAID 2015 – Kyoto, Japan, November 2015

Page 19: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

ENSEMBLE DETECTORS

RAID 2015 – Kyoto, Japan, November 2015

Page 20: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Ensemble LearningMultiple diverse base detectors

Different learning algorithm Different data set

Combined to solve a problem

RAID 2015 – Kyoto, Japan, November 2015

Page 21: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Decision FunctionsOr’ing

High Confidence Or’ing

RAID 2015 – Kyoto, Japan, November 2015

Page 22: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Decision FunctionsMajority voting

Stacking

RAID 2015 – Kyoto, Japan, November 2015

Page 23: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Ensemble DetectorsGeneral Ensemble

Combines multiple general detectors Best of INS, MEM, ARCH

Specialized Ensemble Combines the best specialized detector for each malware type

Mixed EnsembleCombines the best general detector with the best specialized detectors from the same features vector

RAID 2015 – Kyoto, Japan, November 2015

Page 24: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Offline Detection Effectiveness

Decision Function Sensitivity

Specificity

Accuracy

Best General - 82.4% 89.3% 85.1%

General Ensemble

Or’ing 99.1% 13.3% 65.0%

High Confidence 80.7% 92.0% 85.1%

Majority Voting 83.3% 92.1% 86.7%

Stacking 80.7% 96.0% 86.8%

Specialized Ensemble

Or’ing 100% 5% 51.3%

High Confidence 94.4% 94.7% 94.5%

Stacking 95.8% 96.0% 95.9%

Mixed EnsembleOr’ing 84.2% 70.6% 78.8%

High Confidence 83.3% 81.3% 82.5%

Stacking 80.7% 96.0% 86.7%

RAID 2015 – Kyoto, Japan, November 2015

Page 25: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Offline Detection Effectiveness

Decision Function Sensitivity

Specificity

Accuracy

Best General - 82.4% 89.3% 85.1%

General Ensemble

Or’ing 99.1% 13.3% 65.0%

High Confidence 80.7% 92.0% 85.1%

Majority Voting 83.3% 92.1% 86.7%

Stacking 80.7% 96.0% 86.8%

Specialized Ensemble

Or’ing 100% 5% 51.3%

High Confidence 94.4% 94.7% 94.5%

Stacking 95.8% 96.0% 95.9%

Mixed EnsembleOr’ing 84.2% 70.6% 78.8%

High Confidence 83.3% 81.3% 82.5%

Stacking 80.7% 96.0% 86.7%

RAID 2015 – Kyoto, Japan, November 2015

Page 26: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Offline Detection Effectiveness

Decision Function Sensitivity

Specificity

Accuracy

Best General - 82.4% 89.3% 85.1%

General Ensemble

Or’ing 99.1% 13.3% 65.0%

High Confidence 80.7% 92.0% 85.1%

Majority Voting 83.3% 92.1% 86.7%

Stacking 80.7% 96.0% 86.8%

Specialized Ensemble

Or’ing 100% 5% 51.3%

High Confidence 94.4% 94.7% 94.5%

Stacking 95.8% 96.0% 95.9%

Mixed EnsembleOr’ing 84.2% 70.6% 78.8%

High Confidence 83.3% 81.3% 82.5%

Stacking 80.7% 96.0% 86.7%

RAID 2015 – Kyoto, Japan, November 2015

Page 27: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Online Detection Effectiveness

A decision is made after each 10,000 committed instructions Exponentially Weighted Moving Average (EWMA) to filter false alarms

Sensitivity

Specificity Accuracy

Best General 84.2% 86.6% 85.1%

General Ensemble (Stacking)

77.1% 94.6% 84.1%

Specialized Ensemble (Stacking)

92.9% 92.0% 92.3%

Mixed Ensemble (Stacking)

85.5% 90.1% 87.4%

RAID 2015 – Kyoto, Japan, November 2015

Page 28: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

METRICS TO ASSESS RELATIVE PERFORMANCE OF TWO-LEVEL DETECTION

FRAMEWORK

RAID 2015 – Kyoto, Japan, November 2015

Page 29: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Metrics

Work Advantage

Time Advantage

Detection Performance

RAID 2015 – Kyoto, Japan, November 2015

Page 30: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Online Detection Effectiveness

A decision is made after each 10,000 committed instructions Exponentially Weighted Moving Average (EWMA) to filter false alarms

Sensitivity

Specificity Accuracy

Best General 84.2% 86.6% 85.1%

General Ensemble (Stacking)

77.1% 94.6% 84.1%

Specialized Ensemble (Stacking)

92.9% 92.0% 92.3%

Mixed Ensemble (Stacking)

85.5% 90.1% 87.4%

RAID 2015 – Kyoto, Japan, November 2015

Page 31: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Time & Work Advantage Results

Time Advantage Work Advantage

RAID 2015 – Kyoto, Japan, November 2015

Page 32: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Hardware Implementation

Physical design overheadArea 2.8% (Ensemble), 0.3% (General)Power 1.5% (Ensemble), 0.1% (General)Cycle time 9.8% (Ensemble), 1.9% (General)

RAID 2015 – Kyoto, Japan, November 2015

Page 33: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

RAID 2015 – Kyoto, Japan, November 2015

Conclusions & Future WorkEnsemble learning with specialized detectors can significantly improve detection performance

Hardware complexity increases, but several optimizations still possible

Some features are complex to collect; simpler features may carry same information

Future work:Demonstrate a fully functional systemStudy how attackers could evolve and adversarial machine learning

Page 34: Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** Ensemble Learning for Low-level Hardware-supported.

Thank you!

RAID 2015 – Kyoto, Japan, November 2015

Questions?