Dmk audioviz

103
WEAPONS GRADE AUDIO VISUALIZATION (and other pattern stunts) Dan Kaminsky Director Of Penetration Testing IOActive Inc.

Transcript of Dmk audioviz

Page 1: Dmk audioviz

WEAPONS GRADE AUDIO VISUALIZATION

(and other pattern stunts)

Dan Kaminsky

Director Of Penetration Testing

IOActive Inc.

Page 2: Dmk audioviz

Introduction

Page 3: Dmk audioviz

Alas

• Very easy to do the pretty– Very hard to do the useful

• Two major categories of audio visualization– Direct rendering of intensity or spectrum– Pretty morphing shapes…that…uh…kinda

guide the imagery.• But not really.

• Can we do better?

Page 4: Dmk audioviz

Another Approach: Dotplots

Page 5: Dmk audioviz

Useful for various domains

• Started in genomics – genes are the worst protocol the world has ever seen

• I’ve been using it for analyzing all sorts of things– Code– Books– Law– Video

• Originally saw a paper applying to audio– Could I make a WinAMP plugin that does

this?

Page 6: Dmk audioviz

What Exactly Are We Doing

• Jonathan Helman’s“DotPlot Patterns: ALiteral Look at PatternLanguages” offers anintroduction

• Instead of “to, be, not” etc, we use chunks of data from arbitrary files– Instead of demanding perfect equality, we measure

how similar the chunks are– If most of the bytes are in most of the same places,

it’s pretty similar, if most are different, pretty dissimilar

Page 7: Dmk audioviz

Demo: LudiVu

Page 8: Dmk audioviz

Intro to LudiVu

• Realtime spectral analyzer – compares what you’re listening to now with what you’ve been listening to the last few seconds– Really simple similarity metric: Split available

spectrum into three bands• Bass: Red• Midrange: Green• Treble: Blue

– Take difference between source band and dest band. If very similar, add sim. Else sub.

Page 9: Dmk audioviz

Infrastructure

• Built on top of AVS– WinAMP Advanced Visualization Studio– OpenCV, hacker style – lots and lots of image

manipulation algorithms, stackable and hackable

– Very easy to alter the framebuffer after the fact

Page 10: Dmk audioviz

What are we getting?

• Two independent outputs– One: When the same signal repeats, we

can see it as a line• “Visual autocorrelation”

– Two: Even if signals do not repeat, the shapes they form create a sort of “visual hash”

• Might be possible to do larger scale mapping, because the viz is the same

– The thought: This roughly feels like seeing as we hear

• Temporal in Audio = Spatial in Visual

Page 11: Dmk audioviz

Two Primary Modes

• Visual Hashes vs. Similarity Sequences are overlaid– Vertical white lines on top of RGB layout– Adding motion blur highlights similarity (white

lines) at expense of visual hashes• Actual mechanism: Leave n% of the old image

around

Page 12: Dmk audioviz

Chemical Brothers

Page 13: Dmk audioviz

So Why “Weapons Grade?”

• This was just supposed to be a toy…

• Then my friend suggested “Why don’t you try running this on Audio CAPTCHAs?”…– Sequences of numbers spoken over noise– There are only ten numbers– Can I see the repeated numbers I hear?

• Yup

Page 14: Dmk audioviz

Repeated Digit

Page 15: Dmk audioviz

What about other domains?(Nine Inch Nails, “Closer”)

Page 16: Dmk audioviz

More Video Analysis:Cibo Matto / Michel Gondry’s Palindromatic “Sugar Water”

Page 17: Dmk audioviz

We’ve figured out what some of these patterns mean…but code…

Page 18: Dmk audioviz

But some code just comes out strange.

Page 19: Dmk audioviz

Dotplots for Security / Code Analysis?

• A) Format Identification– 1) Do different files appear different, and does the

appearance reflect the existence of internal structure?– 2) Do different instances of the same file format

appear similar?– 3) Does one format embedded in another make itself

apparent?

• B) Fuzzer Guidance– 1) Can we locate the actual byte offsets where one

section ends and another begins?– 2) Can we visualize and compare fuzzer operations

via Dotplots?

Page 20: Dmk audioviz

Format Identification

• 1) Do different files appear different, and does the appearance reflect the existence of internal structure?

• 2) Do different instances of the same file format appear similar?

• 3) Does one format embedded in another make itself apparent?

Page 21: Dmk audioviz

Java Class Files

Page 22: Dmk audioviz

.NET Assemblies

Page 23: Dmk audioviz

CNN’s Home Page

Page 24: Dmk audioviz

SMBTorture Traffic(Packets – Note, Stop/Start Is Visible)

Page 25: Dmk audioviz

Kernel32.dll

Page 26: Dmk audioviz

Chromosome 22(This is, after all, a genomics hack)

Page 27: Dmk audioviz

The Legend Of Zelda

Page 28: Dmk audioviz

Format Identification

• 1) Do different files appear different, and does the appearance reflect the existence of internal structure?– Answer: Yes. They do.

• 2) Do different instances of the same file format appear similar?

• 3) Does one format embedded in another make itself apparent?

Page 29: Dmk audioviz

Books from Project Gutenberg:Consistent

Despite English’s low information content, lack of even mildly related strings causes little self-similarity across symbol clusters

Page 30: Dmk audioviz

US Code:Moderately Consistent

Legalese is a massively structured dialect. Symbols appear in very distinct patterns that are more reminiscent of machine code than text.

Page 31: Dmk audioviz

HTML:Consistent

HTML repeats smaller symbols (tags) and larger symbol clusters (via template engines) regularly. This shows up visually as a tightly repeating pattern.

Page 32: Dmk audioviz

Java Class Files (Compared):Mildly Consistent

Binary code (be it bytecode or x86) tends to be very structured. Still, we are dependent on both the content and the compiler to generate distinct patterns.

Page 33: Dmk audioviz

x86:Consistent (In Sections)

x86 tends not to be handwritten; as such complex instructions are emitted in a highly structured form.

Page 34: Dmk audioviz

Exception?

• 64 kilobyte graphical demonstration

• Run through a packer

• Compression removes patterns

Page 35: Dmk audioviz

NES Games

6502 Assembly Tends To Show Consistent Patterns, But…

Page 36: Dmk audioviz

Mario Games Look Rather Different.

1) Output is highly dependent on the compiler

2) Output is highly dependent upon the actual content

File formats are merely shells for actual content. You are analyzing the content; the format is just syntactic sugar.

Page 37: Dmk audioviz

Format Identification

• 1) Do different files appear different, and does the appearance reflect the existence of internal structure?– Answer: Yes. They do.

• 2) Do different instances of the same file format appear similar?– Answer: Somewhat. Similar content looks

like itself, but you’re measuring the fundamental entropy of the underlying content, not the format of the content itself.

• 3) Does one format embedded in another make itself apparent?

Page 38: Dmk audioviz

File Formats Contain Multiple SubformatsAnother Look At Kernel32.DLL

These are all different parts of Kernel32.

Page 39: Dmk audioviz

Quickly Browsing Large Files:Tilt-Shift View

• Instead of measuring absolute Y against absolute X, make X relative– Advance through the

file going down, look back a number of bytes going right

Page 40: Dmk audioviz

Complain All You Want.Hex Still Sucks.

Page 41: Dmk audioviz

Format Identification

• 1) Do different files appear different, and does the appearance reflect the existence of internal structure?– Answer: Yes. They do.

• 2) Do different instances of the same file format appear similar?– Answer: Somewhat. Similar content looks like itself,

but you’re measuring the fundamental entropy of the underlying content, not the format of the content itself.

• 3) Does one format embedded in another make itself apparent?– Answer: Yes. Multiple, distinct sections

are clearly visible in a way that hex cannot show.

Page 42: Dmk audioviz

Fuzzer Guidance

• 1) Can we locate the actual byte offsets where one section ends and another begins?– Why would we want to?

• Fuzzers break parsers.• Many subformats to a format, many subparsers to a parser• To a rough level of approximation, fuzzing a single subformat

lets you stress a single subparser• So once we split a file up, we can selectively attack one

subparser at a time.

• 2) Can we visualize and compare fuzzer operations via Dotplots?

Page 43: Dmk audioviz

Simple Math

We select an interesting blob from kernel32.dll. The blob is at pixel offset 507x507, and is a square around 570 pixels wide.

Window size on viz was 32.

507*32 = The interesting section starts 16224 bytes into the file.

570*32 = The interesting section is 18240 bytes long.

Page 44: Dmk audioviz

Whats The Actual Data?dd if=kernel32.dll bs=1 skip=16100

| hexdump - | more

Page 45: Dmk audioviz

Using Hardcorr as a “first knife” to locate interesting-to-fuzz regions

Page 46: Dmk audioviz

Fuzzer Guidance

• 1) Can we locate the actual byte offsets where one section ends and another begins?– Answer: Yes. We can quickly route from the image

to the byte offset, through basic arithmetic.

• 2) Can we visualize and compare fuzzer operations via Dotplots?

Page 47: Dmk audioviz

Differentials

• Major use of dotplots in bioinformatics is to compare one genome against another– Autocorrelation: Compare A to A– Cross-Correlation: Compare A to B

• Most files are sufficiently dissimilar that not very interesting structure shows up– Notable exception: Different versions of

the same binary

Page 48: Dmk audioviz

Visual Bindiff!

Page 49: Dmk audioviz

MSVCR70.DLL v. MSVCR71.DLL

Page 50: Dmk audioviz

Fuzzers:Very Broken Patchers

Mangle.C – Single Bit Differences

CFG9000 – Large Scale Reordering

Page 51: Dmk audioviz

Fuzzer Guidance

• 1) Can we locate the actual byte offsets where one section ends and another begins?– Answer: Yes. We can quickly route from the image

to the byte offset, through basic arithmetic.

• 2) Can we visualize and compare fuzzer operations via Dotplots?

– Answer: Yes – visual diffing effectively shows differences between files, including differences introduced by various flavors of fuzzers.

Page 52: Dmk audioviz

Other Structural Analysis Mechanisms

• Many physicists would agree that, had it not been for congestion control, the evaluation of web browsers might never have occurred. In fact, few hackers worldwide would disagree with the essential unification of voice-over-IP and public private key pair. In order to solve this riddle, we confirm that SMPs can be made stochastic, cacheable, and interposable.– Rooter: A Methodology for the Typical Unification of

Access Points and Redundancy

Page 53: Dmk audioviz

That was BS.

• That also got accepted into a con.– Automatically generated from a context free grammar– I’ve been working too hard all these years – “Be quiet, or I will replace you with a very small shell

script”

• This talk is a bit of a remix– Patterns and symbols are interesting me as of late

• Automatic determination of both is difficult, interesting, and unsolved

– Integration into human symbolic systems promises particularly interesting results

– So we’re going to explore a bit.

Page 54: Dmk audioviz

Language Is Cool• Language: A protocol for the transmission of concepts and

intentions between humans– Documentation is not available– Documentation does not really work– Learned through exposure and use

• Significant amount of internal structure, redundancy, and consistency

• Who makes language?– Kids.

• Adults coin words here and there, but when they’re forced to invent a common language to get things done, it’s called a Pidgin, and it’s terrible

• The kids hear it, and invent a Creole – a merged language of significantly greater accuracy and depth

• Children make languages• Adults make “working” languages• Programmers make barely working languages

Page 55: Dmk audioviz

Programmers Talk Funny

• Fundamentally two languages that programmers must use– Code to Human: “User Interface Design”– Code to Code: “File and Network Protocol”

• UI is a protocol.– This is obvious in retrospect.

• There are two things this talk hopes to do– Correct some of the Code->Human protocols that are out

there– Use human strategies to analyze Code to Code

communications• Learning a protocol is learning a language.

Humans do not learn languages quickly, and thus we’re resource bound on fuzzer development

• It’s 2007 – most parsers remain unfuzzed (and thus just waiting to be exploited)

Page 56: Dmk audioviz

Weaponizing Noam?

• “An early inference procedure was described by Chomsky and Miller (1957a), as reported in Solomonoff (1959). Chomsky proposed a method for detecting loops in finite state languages. The approach requires a set of valid sentences, and an oracle that determines whether a sentence is in the language.

The algorithm proceeds by deleting part of a valid sentence and asking the oracle whether the sentence is still valid. If it is, the deleted part is reinserted into the sequence and repeated, so that it appears twice. If the sentence is still in the language, a cycle has been detected.”– Inferring Sequential Structure, Craig Neville Manning, 1996– This couldn’t POSSIBLY be useful for building a structure

for a dumb fuzzer to operate against.• Instead of seeing if the parser crashes, just see if it considers

the input valid

Page 57: Dmk audioviz

Topics Of Discussion

• Further Explorations in Cryptomnemonics– Using Names and Syllables for password

representation

• Sequitur-XML: Merging automated structure discovery with the standard architecture for structure representation– …which turned out to be quite nice for

controlled structure destruction • Exploring Dotplots

– Building a GUI– Exploring other domains

Page 58: Dmk audioviz

Intro To Symbol Sets• Machine Symbols

– Data (AA, BB, CC)– Code (a(), b(), c())– Formats (All, Bad, Code)

• Human Symbols– Letters (A, B, C)– Glyphs ()– Syllables (Ah, Bee, See)– Words (Amazing, Bear, Clear)– Native Names (Alice, Bob, Charlie)– Things (Axe, Bone, Chimpanzee)– Actions (Ask, Buy, Compute)– Colors (Aquamarine, Blue, Chartreuse)

• Machines can use formats, but their native format is raw bits• Humans have no concept of “raw bits” – everything must be

contextual– Long history in mnemonics of mapping arbitrary data to a

context

Page 59: Dmk audioviz

Different Domains Have Different Strengths – See Visual Processing

Page 60: Dmk audioviz

Cryptomnemonics

• Definition: The study of human memory, as it applies to cryptographic systems

• Developing in response to this:– $ ssh dan@blahThe authenticity of host 'blah (1.2.3.4)' can't be established.RSA key fingerprint is 09:a9:b1:99:84:17:7d:ba:c6:55:46:5a:17:f8:83:01.Are you sure you want to continue connecting (yes/no)?

• The machine is acting like its integrating with another machine. It’s not, and that matters.

• Humans can handle hexadecimal characters – but not that many.

Page 61: Dmk audioviz

Hex Confusion

• After somewhere between 2 and 5 characters, most of you will fail to see a difference– Positional Bias: Expect to see certain things

at the beginning or end– Value Confusion: Letter vs. Number is

remembered before the actual value of letter or number

• Glyph confusion

– “Despair” Effect• Nobody could possibly detect a change, so it’s not

rational to even try

Page 62: Dmk audioviz

Classes of Memory

• There are three classes of memory, at least to the degree as is useful in cryptography– Rejection: “I’ve never seen that before”– Recognition: “It’s that one, not that other one”– Recollection: “Let me describe it to you.”

• SSH just requires rejection– Hex is not rejectable– Can we try another domain?

Page 63: Dmk audioviz

Exploring The Nymic Domain• $ ssh dan@blah

Key Data: julio and epifania dezzutti luther and rolande doornbos manual and twyla imbesi dirk and cuc kolopajlo omar and jeana hymel

The authenticity of host 'blah (1.2.3.4)' can't be established.Are you sure you want to continue connecting (yes/no)? – Alternate mapping for

09:a9:b1:99:84:17:7d:ba:c6:55:46:5a:17:f8:83:01.– Proposed last year as a potential solution

• There is nothing more contextual than a story, and there is nothing more stable in a story than the names of its participants– Stories retold are stories remembered – we need to be exposed

to the above group time and time again to be able to reject any deviation from it

Page 64: Dmk audioviz

How To Derive Names?

• Original Model– Take US Census Data– Remove any names that may be easily confused with

one another:• Easy: Bob v. Bobby• Hard: Bob v. Robert

• Celebrity Naming– “Marge Godwin”

• Archaic Naming– Use constructs from various ancient languages

• Mechanistic Constructs– Bubble Babble: 64 bits = xegoz-tosys-vusik-masar– Koremutake: 64 bits = darujifahe stygrifrejy

Page 65: Dmk audioviz

How Many Names?

• Unclear what the crossover point is between hard from more names, and benefit from more entropy per name

– Present system is 512 male name, 512 female name, 1024 last names from US Census

– 256/256/256 would provide 24 bits per couple instead of 40, and the names would be more recognizable. Better? How much better?

• The more names, the more a problem position becomes

– We’re sensitive to names, but without a story context, there’s no roles locking people to being the first or the second or the third. So the more names, the more bits we lose to reader confusion.

• How many bits are necessary? Depends on what for.

Page 66: Dmk audioviz

Flipping The Bits

• SSH Key Representation is not the only thing we can do with this technique– In fact, it’s not even the most pressing problem

• Passwords are in crisis right now– PKI failed, deal with it

• There’s an entire alternate history where XSS enjoys the benefits of your legal credentials being available and shared

– People are being asked to generate, frequently, high entropy non repeated passwords

• They’re repeating them• They’ve exhausted personal entropy, and have moved

to geometric progressions to evade lameness checks– &(*uoiJKL798– Fixed prefix

Page 67: Dmk audioviz

A Fundamental Shift

• Generate passwords for your users.

– “But they’re hideous, nobody will remember what we automatically generate”

• You’re theoretically forcing them to generate those hideous passwords, off the top of their head

• Use alternate symbolic domains to coat the password entropy you require in a form users can accept

– Why yes, this is exactly like a tunnel. We’re tunneling entropy over a baby name book

Page 68: Dmk audioviz

Change Your Ways

• Modify your validation logic to accept long passwords without weird character sets– Punctuation and case sensitivity are

“weak symbols”• It is easier to chain together common

symbols in a common way, than it is to link together arbitrary bytes out of context– This is a fundamental difference between

human symbol manipulation and the operations of computers

Page 69: Dmk audioviz

How Many Bits Do We Really Need?

• Hash Validation: 80-100 bits– We don’t have a birthday paradox problem with

hashes, since one of them is fixed.– 2^80-2^100 work efforts are outside the range of

feasibility at this time

• Password Entry: 24 bits for low security, 36-48 bits for high security– Need enough to make brute force enumeration across

all users infeasible– For each username, try one possible password– 48 bit is what we’re at with

punctuation/case/number/8 character.

Page 70: Dmk audioviz

Limits to alternate symbol domains

• We lose the ability to measure “nextness”– 0x10 is one less than 0x11– Bob is…how much less than Charlie?

• Data may become variable length – Bob is three characters, Charlie is seven– Harder to see patterns

• Has trouble scaling to any large number of bits.– We can’t analyze even mildly large systems

using this translation layer

Page 71: Dmk audioviz

What We’ve Been Using(Warning: Sucks.)

Page 72: Dmk audioviz

N’est’ce pas Non Sequitur

• Sequitur: Linear Time Pattern Finder– Creates hierarchal Context Free Grammars from arbitrary input

• Compression Algorithm in which you can “look under the covers” to see what’s going on

• Created by Craig Neville-Manning as his PhD thesis a decade ago– He’s now Chief Research Scientist at Google

Page 73: Dmk audioviz

What’s New: Sequitur-XML

• echo ‘aabbabc’ | ./sequitur_simple.exe

• Why translate: Gives us much easier to manipulate output– C is very good for

generating the tree– Other languages are very

good for analyzing / modifying the tree

• XML is a (shockingly) good machine format for representing structure

Page 74: Dmk audioviz

Early Work: Syntax Highlighting Using Compression Depth

Page 75: Dmk audioviz

What’s Actually Going On?

• (0) -> … (73),b4,(73),ca,(73),e6,(73),02,(74),18,(74),2c,(74),4a,(74),5c,(74),6e,(74),80,(74),98,(74),b0,(74),c8,(74),e8,(74),fc,(74),10,(75),20,(75),30,(75),40,(75),50,(75),64,(75),82,(75),90,(75),9e,(75)…(84),d6,(84),ee,(84),0c,(85),28,(85),3c,(85),4e,(85),66,(85),7e,(85),8c,(85),9e,(85),ac,(85),be,(85),ca,(85),ea,(85),08,(86),26,(86),44,(86),56,(86),6a,(86),7c,(86),8a,(86),a6,(86),b6,(86),cc,(86),de,(86),02,(87)

• Repeated sequence, single byte literal. Repeated sequence, single byte literal. Rinse, lather, repeat.

Page 76: Dmk audioviz

Where Things Get Most Interesting…Live Symbol Browsing!

Page 77: Dmk audioviz

Browsing HOWTO

• For each entry in the root node,– If it’s a literal, color it white– If it’s part of a reference, color it red– If it’s clicked, color it and every other instance

of that reference blue• A little buggy• Present implementation DOES NOT SCALE• But effective!

Page 78: Dmk audioviz

Symbol Links: Where To Go From Here

• Turns code on left intosymbolic set on right;it’s easy then to linkthe symbols togetheras per the graph.

• This works for non-textual data• Sequitur imputes meaningful

symbols from arbitrary inputdata

Page 79: Dmk audioviz

Context Free Grammar Fuzzer:THE CFG9000

• Reduce input data to a stream of symbols• Fuzz data at the symbol level, rather than

at pure bytes– Shuffle– Drop– Repeat– Uniform Corrupt

• Consistently corrupt all instances of a given symbol

• <HEAD> -> <FOOBAR>

• Partially ported to the new XML framework

Page 80: Dmk audioviz

Sample CFG9000 Output

• calculate_rule_usage(p->rulep->rulep->rulep->rulep->rulep->rulep->rulep->rulep->rulep->rulep->rulep->rulep->rulep->rule() }

• calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(calculate_rule_usage(p->rule());

Page 81: Dmk audioviz

Slashdot Fuzzed

Page 82: Dmk audioviz

Slashdot Fuzzed (2)

Page 83: Dmk audioviz

Why We Moved To XML In The First Place

• XML is a (potentially) validating format– Has the concept of schemas– NOT THAT THEY’RE ALWAYS OR EVEN OFTEN

CHECKED• Schema validation is expensive

• We should be able to use XML Schemas to guide fuzzers– WS-Bang

• Excellent tool for bashing Web Services frameworks• Given a WSDL file (Web Services Description Language),

fuzz it

– Untidy: Mostly just attacks XML parsers, doesn’t hit the structure

Page 84: Dmk audioviz

Automatically Generating Schemas?

• We can autogenerate Schemas from XML (to some degree)– Relaxer– Trang– Tends to capture structure better than content

• Doesn’t appear to automatically determine what values are valid for each field

• Does provide framework for automatically extracting all instances of what can go where

Page 85: Dmk audioviz

Wireshark Demo:From…

• <field show="QUERY_FS_INFO Data" size="20" pos="126" value="ff002700ff000000080000004e00540046005300">– - <field show="FS Attributes: 0x002700ff" size="4" pos="126"

value="ff002700">–   <field name="smb.fs_attr.css"

showname=".... .... .... .... .... .... .... ...1 = Case Sensitive Search: This FS supports CASE SENSITIVE SEARCHes" size="4" pos="126" show="1" value="1" unmaskedvalue="ff002700" />

–   <field name="smb.fs_attr.cpn" showname=".... .... .... .... .... .... .... ..1. = Case Preserving: This FS supports CASE PRESERVED NAMES" size="4" pos="126" show="1" value="1" unmaskedvalue="ff002700" />

–   <field name="smb.fs_attr.uod" showname=".... .... .... .... .... .... .... .1.. = Unicode On Disk: This FS supports UNICODE NAMES" size="4" pos="126" show="1" value="1" unmaskedvalue="ff002700" />

Page 86: Dmk audioviz

Wireshark Demo:To:

• - <xsd:complexType name="field">– - <xsd:sequence>–   <xsd:element maxOccurs="unbounded" minOccurs="1"

name="field" type="field" /> –   </xsd:sequence>–   <xsd:attribute name="name" type="xsd:token" /> –   <xsd:attribute name="pos" type="xsd:int" /> –   <xsd:attribute name="show" type="xsd:normalizedString" /> –   <xsd:attribute name="showname"

type="xsd:normalizedString" /> –   <xsd:attribute name="size" type="xsd:int" /> –   <xsd:attribute name="value" type="xsd:token" /> –   <xsd:attribute name="hide" type="xsd:token" /> –   <xsd:attribute name="unmaskedvalue" type="xsd:token" /> –   </xsd:complexType>

Page 87: Dmk audioviz

Could we automatically extract structure from Sequitur-XML?

• “This sequence of bytes can be reconstructed with these other sequences of bytes”

– No tree relationship – anything can link in anything

– Need to have the content awareness Relaxer lacks to get anything useful

– Where might we get this content awareness?

Page 88: Dmk audioviz

What Might We Borrow From Linguistics?

• Can we use linguistic approaches?– Common Elements

• Humans: Subjects, Verbs, etc.• Machines: Delimiters, Length Fields, ASCII/Unicode, x86,

Padding to Four Byte Boundries– Symbol Interrelationships

• Humans: We take word boundries for granted– Until we’re listening to a foreign language, and wonder

why there aren’t spaces between words • Machines: File formats rarely make it easy to see where

one symbol starts and another begins• Does one symbol always appear before another? Does

one symbol always found itself surrounded by two others?

Page 89: Dmk audioviz

How To Think Of Sequitur

• Any time you’re manipulating data as bytes, think of manipulating it as symbols– N-gram histograms on bytes -> N-gram histograms on

symbols– Bayesian probabilities on characters -> Bayesian

probabilities on symbols• Sequitur is not necessarily the best way to

determine a grammar– Suffix Trees may be more accurate– Keiffer-Yang (redundant symbol extraction) a very

good post-processing step to add– Ray removes In-Memory Grammar Requirement– Not all other solutions are linear time, though

• Kind of cool to have a grammar that covers a 750GB hard drive undergoing forensics s

Page 90: Dmk audioviz

Fuzzy Wuzzy Wuz A Symbol

• Symbol analysis systems (language translators, etc) have issues w/ TMTOWTDI (There’s More Than One Way To Do It)– Very similar messages can be encapsulated in very

different ways– Very similar messages can be encapsulated in very

similar, but not identical ways

• Sequitur only handles exact matches – fuzzy grammar imputation doesn’t appear to exist yet– We must develop this fuzziness to create byte-

sourced XML schemas • It is a pretty wild concept, so

Page 91: Dmk audioviz

Conclusions…

• Lots of interesting work left to do– Unification of local presence of symbols, and global

view of file format• Possible to do dotplots themselves in the symbolic domain

– Use of dotplots to segment formats, which thus provides the tree we want for an XML schema

• <format>– <blob1 />

– <blob2 />

– <blob3 />

• </format>

– More colorful pretty pictures!

Page 92: Dmk audioviz

The Ancient Tongue:TCP/IP

• Can’t all be about pretty pictures • A new problem has popped up: Network

oligopolies are threatening to install firewalls that limit or eliminate bandwidth on a per-company basis– Their own media services might be fast,

others will be slow– Their own VPN services might be fast, others

will be slow

• Question: Is it possible to detect and locate devices violating network neutrality?

Page 93: Dmk audioviz

What’s The Closest Tool We Have?

• Firewalk– Mike Schiffman’s Firewall Analysis Tool– Packets elicit a ICMP Time Exceeded error if

they reach a router with TTL=0• TTL decremented by one for each hop, so you

start low, you can trace the route to a host

– A firewalled packet won’t live long enough to reach TTL=0

– So you can locate the firewall, and divine things about its ruleset, based on when your packets stop getting ICMP Time Exceeded

Page 94: Dmk audioviz

Limitations of Firewalking

• But Firewalk tells us what, not who is blocked…and it tells us nothing about who is allowed to go fast, and who is made to go slow– Suddenly, we devolve to a much older

question: Is it possible to find out that a target firewall is, or is not, blocking against or accepting traffic from an arbitrary IP address?

Page 95: Dmk audioviz

TCP Does Speed Measurement

• TCP speed analysis done blindly– Endpoints do not negotiate with one another– Everyone sends their packets, routers route

what they will. Endpoints need to adjust to what the routers are willing to pass.

• Routers communicate with endpoints by dropping their packets

• Can we combine this router backchannel w/ Firewalk?

Page 96: Dmk audioviz

In From The Side

• What causes packets to drop?– Too many packets

• What are we going to do?– Send too many packets

• Two channels are set up– A primary channel, which drops packets at some

known rate– A secondary channel, whose purpose it is to interfere

(or not) with the primary channel• When the secondary interferes with the primary,

we get feedback via the primary channel– The traffic composing the secondary channel can

come from anywhere, be composed of anything, and can be TTL’d just like in a normal firewalk.

Page 97: Dmk audioviz

The TTL Channel

• Normally, you don’t know which router along a path is dropping your packets

• If you are the source of the drop-inducing packets, you can control how far your noise goes out – thus, you can discover which router is hitting its limit / censoring your net connection

Page 98: Dmk audioviz

Scorchmarking

• Why Scorchmarking?– Routers are burning packets…those that get through

might have a scorch mark or two

• Basic Model– Client downloads a file from a site, at some given

speed negotiated via TCP.– At the same time, traffic is injected from different IP

addresses. This should cause drops.• If it doesn’t, the network is either penalizing the primary

channel (easy to drop against) or rewarding the secondary channel (resilient to drops)

Page 99: Dmk audioviz

Advanced Scorchmarking [0]

• Having to depend on a client is lame– Wouldn’t it be nice if we could scan the

Internet for these servers?

• What fundamental service is a receiving client providing?– It is acknowledging our traffic – letting us

know how much it received, and how many milliseconds it took to receive it

• Aren’t there other ways we could extract the same data from hosts?

Page 100: Dmk audioviz

Advanced Scorchmarking [1]

• What else will acknowledge receiving traffic from us?– TCP Servers

• Sting, from Stefan Savage, used this to great effect

– DNS Servers – Routers.

• Supposedly, routers won’t send more than a certain number of ICMP Time Exceeded packets per second

• In reality, they seem to ICMP Time Exceeded ACK however much you throw at them

• Even if they didn’t, you could use the difference in ICMP Time Exceeded rates between Primary and Secondary channel, to determine whether interference was showing up.

• Everyone’s got a NAT – so you can query everyone for whether certain sorts of traffic are being blocked to them

Page 101: Dmk audioviz

Advanced Scorchmarking [2]

• So, yes.– You can scan for violations of Network Neutrality– You can find networks that are blocking or passing

particular IP ranges

• It’s not exactly efficient though• Neutrality violations are easier to find than the

standard FW case– Firewalls are normally between the WAN and the LAN

(Slow Net -> FW -> Fast Net)– Neutrality violators are mid-WAN (Slow Net -> Fw ->

Slow Net -> Fast Net)– Easier to overload the slow net after the firewall

• Boxes with max TTL rates override this

Page 102: Dmk audioviz

Speed Limits

• Fundamental Problem: Have to max out bandwidth on the link to trigger the backchannel– No packets dropping, no data– Means you have to DoS a link – not scalable/legal

• Potential Solution: Find capped acknowledgers– The mythical ICMP Time Exceeded rate limit works

well• Primary and Secondary channel both eliciting ITE’s• When secondary channel gets a packet through, it takes up

a slot on the primary channel’s • ITE is perfect, since you can TTL limit any packet• Depends on the firewall passing the primary’s ITE’s• Maybe Linux / NATs actually implement rate limits?

– Another option: What if we have code on the client?

Page 103: Dmk audioviz

Windows Media Player:More Than Just DRM. Really!

• Bulk Transfer: RTP– Runs over Unicast UDP– Yes, the same Unicast UDP that penetrates NAT so

well!

• Flow Control / Quality Monitoring: RTCP• No technical reason RTCP needs to go back to

the same address that RTP stream is coming from– So: We pretend to provide media streams from all

sorts of sites, and use WMP to collect traffic stats for us

• It might work…