Download - Zararlı Yazılım Analizinde Güncel Problemleraselcuk.etu.edu.tr/SiberGuvenlikGunu-2019.12.23/EnginKirda.pdf · Actors / APT Advanced Persistent Threats != malware! ... •Makine

Zararlı Yazılım Analizinde Güncel Problemler

Engin [email protected]

In this talk…

• I plan to talk about some of my obvervations in research as well as in industry right now with respect to malware analysis and detection– “AI” usage in threat detection products – there is a

lot of talk on this – a lot of it is buzz as you know

• I plan to talk a bit about what we are doing– How to use “AI” right to detect advanced attacks

TOBB-ETÜ Siber Güvenlik Günü

What is…


Malware ?

Conventional Definition

• Trojan• Backdoor• Ransomware• Worm• Virus• Exploit• …


2000s called ...Sorry, too boring!

Let’s Start With “Who” and “Why”

• Today, who develops malware… any why?


Actors / APT

Advanced Persistent Threats != malware!

• Criminal Businesses• Organized Crime• Nation States


Criminals

• Steal credit cards / bank accounts• Steal credentials / social accounts• Traffic / Pay-per-click • Pay-per-install • Spam / Malvertising• DDOS• Ransomware


Organized Crime

Commercial Espionage

• Taking competitive advantage• Stealing trade secrets• Disrupting competitors’ ability to work

Political targets

Learn more: https://apt.securelist.com/TOBB-ETÜ Siber Güvenlik Günü

https://apt.securelist.com/

Nation State

• Very targeted / low profile• Each campaign might cost millions of US

dollars• Planned and developed for years• Unique pieces of software used • Zero-day exploits


What is malware?


Malware is software and infrastructuredesigned to achieve the actor’s goal

Advanced Threats

• Sophisticated toolsets• Multi-vector attacks• Multi-step attacks

InitialAccess

Execution Persistence PrivilegeEscalation

DefenseEvasion

CredentialAccess

DiscoveryLateralMovement

CollectionExfiltrationCommand& Control

MITRE’s ATT&CK: Adversarial Tactics, Techniques,and Common Knowledge


Artificial Intelligence

Artificial IntelligenceMachine

LearningDeep

Learning

The promise of AI in Cybersecurity

AI can search large amounts of data to identify relevant patterns and signals

• recognize (previously unknown) threats• find (hidden) connections between related events• filter data and reduce amount of manual analysis

effort• enable automated response

Anomalies are not (necessarily) threatsTruth #1

• Unsupervised learning– identifies structure in unlabeled data and finds outliers

(anomalies)

• Supervised learning– classifies data (can distinguish between good and bad)

Anomalies are not (necessarily) threats

x

y

Unsupervised learning


x

y


Outlier (Anomaly)


x

yFalse positive

False negatives



x

y

x

y

Unsupervised learning Supervised learning


• Many network security solutions today– profile your network (using unsupervised learning)– then find outliers and label them as threats

• The result are false positives and missed threats

Ask how a solution recognizes actual threats(and how they differentiate good from bad outliers)

The world is too complex for linear classifiersTruth #2

• A linear classifier makes a classification decision based on the value of a linear combination of the (input) characteristics.

• The linear classifier splits a high-dimensional input space with a hyperplane: all points on one side of the hyperplane are classified as "yes", while the others are classified as "no".

[Wikipedia]

The world is too complex for linear classifiers

x

y


x

y

Let’s add some data points


x

y


• Many network security solutions today– use simple models of network properties that are based on

thresholds or idea of average & standard deviation

• The result are false positives and missed threats

Ask people for details about their AI models

Good training data can be hard to getTruth #3

• More training data is (almost) always better

Michele Banko and Eric BrillScaling to Very Very Large Corpora for Natural Language Disambiguation ACM ACL, 2001

Good training data can be hard to get

• Malware samples are relatively easy to get– as a result, we have seen good models for malicious files

• What about malicious network traffic– when you detonate malware, it open all kinds of network

connections (many benign)– malicious infrastructure is frequently down– how to observe human-driven lateral movement?

Good training data can be hard to get

• Many network security vendors today– have a difficult time to get large amounts of high quality,

labeled network data (specifically, malicious traffic)– “threat intelligence team” manually builds (small) datasets

• The result are weak AI models and missed threats

Ask a solution how they get enough labeled networkdata to train their models

We need signal in the data to train AITruth #4

Which of these two connections is maliciousTPC 128.111.41.134:55015 -> 192.168.0.1:80in_bytes: 85; out_bytes: 78,143; duration 2.3s

TPC 128.111.41.134:55016 -> 192.168.0.1:80in_bytes: 91; out_bytes: 9,763; duration 1.3s

We need signal in the data to train AI



GET /getimage.php?url=https://website.com/images/cat.jpg HTTP/1.1Host: example.com

GET /getimage.php?url=http://169.254.169.254/latest/meta-data/ HTTP/1.1Host: example.com




GET /getimage.php?url=https://website.com/images/cat.jpg HTTP/1.1Host: example.com

GET /getimage.php?url=http://169.254.169.254/latest/meta-data/ HTTP/1.1Host: example.com

AWS EC2 Instance Metadata(SSRF Attack)


• Many network security solutions today– rely (too much) on network metadata to detect malicious

traffic (especially in the NDR space)– certain attacks are simply not visible in metadata

• The result are missed threats

Ask a splution if they combine metadata with deep packet inspection (DPI)

AI can be attacked (and evaded)Truth #5

• Adversarial machine learning is a hot topic

Fooling face recognition Confusing self-driving cars

AI can be attacked (and evaded)

• ML has also been attacked in security context

AI can be attacked (and evaded)

• Many network security solutions today– present AI as the magic silver bullet– put too much confidence into AI as single detection layer

• The result are missed threats

Ask a solution if they combine AI with other detection capabilities, such as sandboxing and threat intelligence

Despite the boom and bust, AI provides value(Bonus) Truth #6

• AI works well for focused tasks, not wide ranging missions (we are far from artificial general intelligence)

• You need a lot of input data, good features, and the proper algorithms (and deep security expertise)

• AI solutions operate best in combination with complementary detection techniques and human experts

Extracting behaviors: Sandboxing

• Not all sandboxes are created equal• Generating logs is not enough: identifying high-level

behaviors is important• Resistance to evasion is key• Exposing the actual malicious code is very important• Exposing the network traffic is very important


FUSE – Deep Visibility Into Malware

Guest OS

Malware

Legacy Sandbox

Visibility limited to interactions with the OS

Virtualization artifacts in Guest OS allow malware to fingerprint and evade

detection

Visibility limited to user-space objects

Guest OS

Full System Emulation Sandbox

MalwareDeep Content Inspection supportsvisibility into malware internals Full-system Emulation providesvisibility into kernel-level objects

Guest OS is unmodified and resists fingerprinting and evasion

?

Misuse vs. Anomaly Detection: The 80’s Are Back!

Modeling Good Behavior

• Time consuming• Requires expert knowledge• Incomplete• Constantly outdated


Learning Good Behavior

• Automated• Does not require expert knowledge• Comprehensive• Continued


Learn What Your Network Does

• Input: Netflow data, DNS resolutions, HTTP requests, DHCP logs, Active Directory data

• Output: A network baseline model– Ports open– Recurrent name resolutions and repeated connections– HTTP request characteristics (and amount of errors)– Normal destinations of flows (flow fan-in/fan-out)– Normal amount of data sent/received– Time of activity, logins


Identify Anomalies: Examples

• Once the baseline model has been established, the system identifies outliers– A new service started on a host– An RDP connection has been established to a server that

was never contacted before– An unusual amount of data has been uploaded to a never-

seen-before host– An unusual amount of HTTP errors has been generated


Bad is anomalous Anomalous is bad

Pitfalls in Anomaly Detection

Pitfalls in Anomaly Detection

Unusual data transfer

Abnormal CPU activity

Unusual login time

Long database query

Failed DNS resolutions

Long TCP session


Falses


Makine Öğreniminin Doğru Kullanımı

• Tespit ettiğimiz saldırılardan modeller çıkararak,

bilinmeyen yeni saldırıları ve davranışlarını öğrenebiliriz

• Her ihtimalde, bir saldırı bulduğumuz zaman, benzer

saldırıları makine öğrenmesi ile otomatik olarak bulmaya

çalışabiliriz

Sonuçlar

• Makine öğrenimi ve anormallik tespiti siber saldırıları bulmak için önemli araçlar

• Maalesef, makine öğrenimi her şeyin çözümü değil, ve gördüğünüz gibi, bazı zorlukları aşmak zorunda

• Anormallik tespitinin iyi çalışması için makine öğrenimi ile birleştirmek mantıklı ve verimli

• Bu alanda önümüzdeki yıllarda yeni ilginç sorunlar ve çözümler göreceğimizi tahmin ediyorum