Zararlı Yazılım Analizinde Güncel Problemler
Engin [email protected]
In this talk…
• I plan to talk about some of my obvervations in research as well as in industry right now with respect to malware analysis and detection– “AI” usage in threat detection products – there is a
lot of talk on this – a lot of it is buzz as you know
• I plan to talk a bit about what we are doing– How to use “AI” right to detect advanced attacks
TOBB-ETÜ Siber Güvenlik Günü
What is…
TOBB-ETÜ Siber Güvenlik Günü
Malware ?
Conventional Definition
• Trojan• Backdoor• Ransomware• Worm• Virus• Exploit• …
TOBB-ETÜ Siber Güvenlik Günü
2000s called ...Sorry, too boring!
Let’s Start With “Who” and “Why”
• Today, who develops malware… any why?
TOBB-ETÜ Siber Güvenlik Günü
Actors / APT
Advanced Persistent Threats != malware!
• Criminal Businesses• Organized Crime• Nation States
TOBB-ETÜ Siber Güvenlik Günü
Criminals
• Steal credit cards / bank accounts• Steal credentials / social accounts• Traffic / Pay-per-click • Pay-per-install • Spam / Malvertising• DDOS• Ransomware
TOBB-ETÜ Siber Güvenlik Günü
Organized Crime
Commercial Espionage
• Taking competitive advantage• Stealing trade secrets• Disrupting competitors’ ability to work
Political targets
Learn more: https://apt.securelist.com/TOBB-ETÜ Siber Güvenlik Günü
Nation State
• Very targeted / low profile• Each campaign might cost millions of US
dollars• Planned and developed for years• Unique pieces of software used • Zero-day exploits
TOBB-ETÜ Siber Güvenlik Günü
What is malware?
TOBB-ETÜ Siber Güvenlik Günü
Malware is software and infrastructuredesigned to achieve the actor’s goal
Advanced Threats
• Sophisticated toolsets• Multi-vector attacks• Multi-step attacks
InitialAccess
Execution Persistence PrivilegeEscalation
DefenseEvasion
CredentialAccess
DiscoveryLateralMovement
CollectionExfiltrationCommand& Control
MITRE’s ATT&CK: Adversarial Tactics, Techniques,and Common Knowledge
TOBB-ETÜ Siber Güvenlik Günü
Artificial Intelligence
Artificial IntelligenceMachine
LearningDeep
Learning
The promise of AI in Cybersecurity
AI can search large amounts of data to identify relevant patterns and signals
• recognize (previously unknown) threats• find (hidden) connections between related events• filter data and reduce amount of manual analysis
effort• enable automated response
Anomalies are not (necessarily) threatsTruth #1
• Unsupervised learning– identifies structure in unlabeled data and finds outliers
(anomalies)
• Supervised learning– classifies data (can distinguish between good and bad)
Anomalies are not (necessarily) threats
x
y
Unsupervised learning
Anomalies are not (necessarily) threats
x
y
Unsupervised learning
Outlier (Anomaly)
Anomalies are not (necessarily) threats
x
yFalse positive
False negatives
Unsupervised learning
Anomalies are not (necessarily) threats
x
y
x
y
Unsupervised learning Supervised learning
Anomalies are not (necessarily) threats
x
y
x
y
Unsupervised learning Supervised learning
Anomalies are not (necessarily) threats
x
y
x
y
Unsupervised learning Supervised learning
Anomalies are not (necessarily) threats
• Many network security solutions today– profile your network (using unsupervised learning)– then find outliers and label them as threats
• The result are false positives and missed threats
Ask how a solution recognizes actual threats(and how they differentiate good from bad outliers)
The world is too complex for linear classifiersTruth #2
• A linear classifier makes a classification decision based on the value of a linear combination of the (input) characteristics.
• The linear classifier splits a high-dimensional input space with a hyperplane: all points on one side of the hyperplane are classified as "yes", while the others are classified as "no".
[Wikipedia]
The world is too complex for linear classifiers
x
y
The world is too complex for linear classifiers
x
y
Let’s add some data points
The world is too complex for linear classifiers
x
y
The world is too complex for linear classifiers
• Many network security solutions today– use simple models of network properties that are based on
thresholds or idea of average & standard deviation
• The result are false positives and missed threats
Ask people for details about their AI models
Good training data can be hard to getTruth #3
• More training data is (almost) always better
Michele Banko and Eric BrillScaling to Very Very Large Corpora for Natural Language Disambiguation ACM ACL, 2001
Good training data can be hard to get
• Malware samples are relatively easy to get– as a result, we have seen good models for malicious files
• What about malicious network traffic– when you detonate malware, it open all kinds of network
connections (many benign)– malicious infrastructure is frequently down– how to observe human-driven lateral movement?
Good training data can be hard to get
• Many network security vendors today– have a difficult time to get large amounts of high quality,
labeled network data (specifically, malicious traffic)– “threat intelligence team” manually builds (small) datasets
• The result are weak AI models and missed threats
Ask a solution how they get enough labeled networkdata to train their models
We need signal in the data to train AITruth #4
Which of these two connections is maliciousTPC 128.111.41.134:55015 -> 192.168.0.1:80in_bytes: 85; out_bytes: 78,143; duration 2.3s
TPC 128.111.41.134:55016 -> 192.168.0.1:80in_bytes: 91; out_bytes: 9,763; duration 1.3s
We need signal in the data to train AI
TPC 128.111.41.134:55015 -> 192.168.0.1:80in_bytes: 85; out_bytes: 78,143; duration 2.3s
TPC 128.111.41.134:55016 -> 192.168.0.1:80in_bytes: 91; out_bytes: 9,763; duration 1.3s
GET /getimage.php?url=https://website.com/images/cat.jpg HTTP/1.1Host: example.com
GET /getimage.php?url=http://169.254.169.254/latest/meta-data/ HTTP/1.1Host: example.com
We need signal in the data to train AI
TPC 128.111.41.134:55015 -> 192.168.0.1:80in_bytes: 85; out_bytes: 78,143; duration 2.3s
TPC 128.111.41.134:55016 -> 192.168.0.1:80in_bytes: 91; out_bytes: 9,763; duration 1.3s
GET /getimage.php?url=https://website.com/images/cat.jpg HTTP/1.1Host: example.com
GET /getimage.php?url=http://169.254.169.254/latest/meta-data/ HTTP/1.1Host: example.com
AWS EC2 Instance Metadata(SSRF Attack)
We need signal in the data to train AI
• Many network security solutions today– rely (too much) on network metadata to detect malicious
traffic (especially in the NDR space)– certain attacks are simply not visible in metadata
• The result are missed threats
Ask a splution if they combine metadata with deep packet inspection (DPI)
AI can be attacked (and evaded)Truth #5
• Adversarial machine learning is a hot topic
Fooling face recognition Confusing self-driving cars
AI can be attacked (and evaded)
• ML has also been attacked in security context
AI can be attacked (and evaded)
• Many network security solutions today– present AI as the magic silver bullet– put too much confidence into AI as single detection layer
• The result are missed threats
Ask a solution if they combine AI with other detection capabilities, such as sandboxing and threat intelligence
Despite the boom and bust, AI provides value(Bonus) Truth #6
• AI works well for focused tasks, not wide ranging missions (we are far from artificial general intelligence)
• You need a lot of input data, good features, and the proper algorithms (and deep security expertise)
• AI solutions operate best in combination with complementary detection techniques and human experts
Extracting behaviors: Sandboxing
• Not all sandboxes are created equal• Generating logs is not enough: identifying high-level
behaviors is important• Resistance to evasion is key• Exposing the actual malicious code is very important• Exposing the network traffic is very important
TOBB-ETÜ Siber Güvenlik Günü
FUSE – Deep Visibility Into Malware
Guest OS
Malware
Legacy Sandbox
Visibility limited to interactions with the OS
Virtualization artifacts in Guest OS allow malware to fingerprint and evade
detection
Visibility limited to user-space objects
Guest OS
Full System Emulation Sandbox
MalwareDeep Content Inspection supportsvisibility into malware internals Full-system Emulation providesvisibility into kernel-level objects
Guest OS is unmodified and resists fingerprinting and evasion
?
Misuse vs. Anomaly Detection: The 80’s Are Back!
Modeling Good Behavior
• Time consuming• Requires expert knowledge• Incomplete• Constantly outdated
TOBB-ETÜ Siber Güvenlik Günü
Learning Good Behavior
• Automated• Does not require expert knowledge• Comprehensive• Continued
TOBB-ETÜ Siber Güvenlik Günü
Learn What Your Network Does
• Input: Netflow data, DNS resolutions, HTTP requests, DHCP logs, Active Directory data
• Output: A network baseline model– Ports open– Recurrent name resolutions and repeated connections– HTTP request characteristics (and amount of errors)– Normal destinations of flows (flow fan-in/fan-out)– Normal amount of data sent/received– Time of activity, logins
TOBB-ETÜ Siber Güvenlik Günü
Identify Anomalies: Examples
• Once the baseline model has been established, the system identifies outliers– A new service started on a host– An RDP connection has been established to a server that
was never contacted before– An unusual amount of data has been uploaded to a never-
seen-before host– An unusual amount of HTTP errors has been generated
TOBB-ETÜ Siber Güvenlik Günü
Bad is anomalous Anomalous is bad
Pitfalls in Anomaly Detection
Pitfalls in Anomaly Detection
Unusual data transfer
Abnormal CPU activity
Unusual login time
Long database query
Failed DNS resolutions
Long TCP session
TOBB-ETÜ Siber Güvenlik Günü
Falses
TOBB-ETÜ Siber Güvenlik Günü
Makine Öğreniminin Doğru Kullanımı
• Tespit ettiğimiz saldırılardan modeller çıkararak,
bilinmeyen yeni saldırıları ve davranışlarını öğrenebiliriz
• Her ihtimalde, bir saldırı bulduğumuz zaman, benzer
saldırıları makine öğrenmesi ile otomatik olarak bulmaya
çalışabiliriz
Sonuçlar
• Makine öğrenimi ve anormallik tespiti siber saldırıları bulmak için önemli araçlar
• Maalesef, makine öğrenimi her şeyin çözümü değil, ve gördüğünüz gibi, bazı zorlukları aşmak zorunda
• Anormallik tespitinin iyi çalışması için makine öğrenimi ile birleştirmek mantıklı ve verimli
• Bu alanda önümüzdeki yıllarda yeni ilginç sorunlar ve çözümler göreceğimizi tahmin ediyorum
Top Related