SlideShare a Scribd company logo
1 of 92
Weaponizing Noam
Chomsky:
Symbols and Grammars Are Fun
Dan Kaminsky
Director Of Penetration Testing
Introduction
• Many physicists would agree that, had it not
been for congestion control, the evaluation of
web browsers might never have occurred. In
fact, few hackers worldwide would disagree with
the essential unification of voice-over-IP and
public private key pair. In order to solve this
riddle, we confirm that SMPs can be made
stochastic, cacheable, and interposable.
– Rooter: A Methodology for the Typical Unification of
Access Points and Redundancy
That was BS.
• That also got accepted into a con.
– Automatically generated from a context free grammar
– I’ve been working too hard all these years 
– “Be quiet, or I will replace you with a very small shell
script”
• This talk is a bit of a remix
– Patterns and symbols are interesting me as of late
• Automatic determination of both is difficult, interesting, and
unsolved
– Integration into human symbolic systems promises
particularly interesting results
– So we’re going to explore a bit.
Language Is Cool
• Language: A protocol for the transmission of concepts and
intentions between humans
– Documentation is not available
– Documentation does not really work
– Learned through exposure and use
• Significant amount of internal structure, redundancy, and
consistency
• Who makes language?
– Kids.
• Adults coin words here and there, but when they’re forced
to invent a common language to get things done, it’s
called a Pidgin, and it’s terrible
• The kids hear it, and invent a Creole – a merged language
of significantly greater accuracy and depth
• Children make languages
• Adults make “working” languages
• Programmers make barely working languages
Programmers Talk Funny
• Fundamentally two languages that programmers must use
– Code to Human: “User Interface Design”
– Code to Code: “File and Network Protocol”
• UI is a protocol.
– This is obvious in retrospect.
• There are two things this talk hopes to do
– Correct some of the Code->Human protocols that are out
there
– Use human strategies to analyze Code to Code
communications
• Learning a protocol is learning a language.
Humans do not learn languages quickly, and thus
we’re resource bound on fuzzer development
• It’s 2007 – most parsers remain unfuzzed (and thus
just waiting to be exploited)
Weaponizing Noam?
• “An early inference procedure was described by
Chomsky and Miller (1957a), as reported in Solomonoff
(1959). Chomsky proposed a method for detecting loops
in finite state languages. The approach requires a set of
valid sentences, and an oracle that determines whether
a sentence is in the language.
The algorithm proceeds by deleting part of a valid
sentence and asking the oracle whether the sentence is
still valid. If it is, the deleted part is reinserted into the
sequence and repeated, so that it appears twice. If the
sentence is still in the language, a cycle has been
detected.”
– Inferring Sequential Structure, Craig Neville Manning, 1996
– This couldn’t POSSIBLY be useful for building a structure
for a dumb fuzzer to operate against.
• Instead of seeing if the parser crashes, just see if it considers
the input valid
Topics Of Discussion
• Further Explorations in Cryptomnemonics
– Using Names and Syllables for password
representation
• Sequitur-XML: Merging automated
structure discovery with the standard
architecture for structure representation
– …which turned out to be quite nice for
controlled structure destruction 
• Exploring Dotplots
– Building a GUI
– Exploring other domains
Intro To Symbol Sets
• Machine Symbols
– Data (AA, BB, CC)
– Code (a(), b(), c())
– Formats (All, Bad, Code) 
• Human Symbols
– Letters (A, B, C)
– Glyphs ()
– Syllables (Ah, Bee, See)
– Words (Amazing, Bear, Clear)
– Native Names (Alice, Bob, Charlie)
– Things (Axe, Bone, Chimpanzee)
– Actions (Ask, Buy, Compute)
– Colors (Aquamarine, Blue, Chartreuse)
• Machines can use formats, but their native format is raw bits
• Humans have no concept of “raw bits” – everything must be
contextual
– Long history in mnemonics of mapping arbitrary data to a
Different Domains Have Different
Strengths – See Visual Processing
Cryptomnemonics
• Definition: The study of human memory, as it
applies to cryptographic systems
• Developing in response to this:
– $ ssh dan@blah
The authenticity of host 'blah
(1.2.3.4)' can't be established.
RSA key fingerprint is
09:a9:b1:99:84:17:7d:ba:c6:55:46:5a:17:
f8:83:01.
Are you sure you want to continue
connecting (yes/no)?
• The machine is acting like its integrating with
another machine. It’s not, and that matters.
• Humans can handle hexadecimal characters –
but not that many.
Hex Confusion
• After somewhere between 2 and 5
characters, most of you will fail to see a
difference
– Positional Bias: Expect to see certain things
at the beginning or end
– Value Confusion: Letter vs. Number is
remembered before the actual value of letter
or number
• Glyph confusion
– “Despair” Effect
• Nobody could possibly detect a change, so it’s not
rational to even try
Classes of Memory
• There are three classes of memory, at least to
the degree as is useful in cryptography
– Rejection: “I’ve never seen that before”
– Recognition: “It’s that one, not that other one”
– Recollection: “Let me describe it to you.”
• SSH just requires rejection
– Hex is not rejectable
– Can we try another domain?
Exploring The Nymic Domain
• $ ssh dan@blah
Key Data:
julio and epifania dezzutti
luther and rolande doornbos
manual and twyla imbesi
dirk and cuc kolopajlo
omar and jeana hymel
The authenticity of host 'blah (1.2.3.4)' can't be
established.
Are you sure you want to continue connecting
(yes/no)?
– Alternate mapping for
09:a9:b1:99:84:17:7d:ba:c6:55:46:5a:17:f8:83:01.
– Proposed last year as a potential solution
• There is nothing more contextual than a story, and there is nothing
more stable in a story than the names of its participants
– Stories retold are stories remembered – we need to be exposed
to the above group time and time again to be able to reject any
deviation from it
How To Derive Names?
• Original Model
– Take US Census Data
– Remove any names that may be easily confused with
one another:
• Easy: Bob v. Bobby
• Hard: Bob v. Robert
• Celebrity Naming
– “Marge Godwin”
• Archaic Naming
– Use constructs from various ancient languages
• Mechanistic Constructs
– Bubble Babble: 64 bits = xegoz-tosys-vusik-masar
– Koremutake: 64 bits = darujifahe stygrifrejy
How Many Names?
• Unclear what the crossover point is between hard from more
names, and benefit from more entropy per name
– Present system is 512 male name, 512 female name, 1024
last names from US Census
– 256/256/256 would provide 24 bits per couple instead of
40, and the names would be more recognizable. Better?
How much better?
• The more names, the more a problem position
becomes
– We’re sensitive to names, but without a story
context, there’s no roles locking people to being the
first or the second or the third. So the more names,
the more bits we lose to reader confusion.
• How many bits are necessary? Depends on what for.
Flipping The Bits
• SSH Key Representation is not the only thing we can do with
this technique
– In fact, it’s not even the most pressing problem
• Passwords are in crisis right now
– PKI failed, deal with it
• There’s an entire alternate history where XSS enjoys
the benefits of your legal credentials being available
and shared
– People are being asked to generate, frequently, high
entropy non repeated passwords
• They’re repeating them
• They’ve exhausted personal entropy, and have moved
to geometric progressions to evade lameness checks
– &(*uoiJKL798
– Fixed prefix
A Fundamental Shift
• Generate passwords for your users.
– “But they’re hideous, nobody will remember
what we automatically generate”
• You’re theoretically forcing them to generate
those hideous passwords, off the top of their
head
• Use alternate symbolic domains to coat the
password entropy you require in a form users can
accept
– Why yes, this is exactly like a tunnel. We’re
tunneling entropy over a baby name book 
Change Your Ways
• Modify your validation logic to accept long
passwords without weird character sets
– Punctuation and case sensitivity are
“weak symbols”
• It is easier to chain together common
symbols in a common way, than it is to
link together arbitrary bytes out of
context
– This is a fundamental difference between
human symbol manipulation and the
operations of computers
How Many Bits Do We Really
Need?
• Hash Validation: 80-100 bits
– We don’t have a birthday paradox problem with
hashes, since one of them is fixed.
– 2^80-2^100 work efforts are outside the range of
feasibility at this time
• Password Entry: 24 bits for low security, 36-48
bits for high security
– Need enough to make brute force enumeration
across all users infeasible
– For each username, try one possible password
– 48 bit is what we’re at with
punctuation/case/number/8 character.
Limits to alternate symbol domains
• We lose the ability to measure “nextness”
– 0x10 is one less than 0x11
– Bob is…how much less than Charlie?
• Data may become variable length – Bob is
three characters, Charlie is seven
– Harder to see patterns
• Has trouble scaling to any large number of
bits.
– We can’t analyze even mildly large systems
using this translation layer
What We’ve Been Using
(Warning: Sucks.)
N’est’ce pas Non Sequitur
• Sequitur: Linear Time Pattern Finder
– Creates hierarchal Context Free Grammars from arbitrary input
• Compression Algorithm in which you can “look under the
covers” to see what’s going on
• Created by Craig Neville-Manning as his PhD thesis a
decade ago
– He’s now Chief Research Scientist at Google
What’s New: Sequitur-XML
• echo ‘aabbabc’ |
./sequitur_simple.exe
• Why translate: Gives us
much easier to
manipulate output
– C is very good for
generating the tree
– Other languages are very
good for analyzing /
modifying the tree
• XML is a (shockingly)
good machine format for
representing structure
Early Work: Syntax Highlighting
Using Compression Depth
What’s Actually Going On?
• (0) -> … (73),b4,(73),ca,(73),e6,(73),02,(74),18,
(74),2c,(74),4a,(74),5c,(74),6e,(74),80,(74),98,
(74),b0,(74),c8,(74),e8,(74),fc,(74),10,(75),20,
(75),30,(75),40,(75),50,(75),64,(75),82,(75),90,
(75),9e,(75)
…
(84),d6,(84),ee,(84),0c,(85),28,(85),3c,(85),4e,
(85),66,(85),7e,(85),8c,(85),9e,(85),ac,(85),be,
(85),ca,(85),ea,(85),08,(86),26,(86),44,(86),56,
(86),6a,(86),7c,(86),8a,(86),a6,(86),b6,(86),cc,
(86),de,(86),02,(87)
• Repeated sequence, single byte literal.
Repeated sequence, single byte literal. Rinse,
lather, repeat.
Where Things Get Most
Interesting…Live Symbol Browsing!
Browsing HOWTO
• For each entry in the root node,
– If it’s a literal, color it white
– If it’s part of a reference, color it red
– If it’s clicked, color it and every other instance
of that reference blue
• A little buggy
• Present implementation DOES NOT SCALE
• But effective!
Symbol Links: Where To Go From
Here
• Turns code on left into
symbolic set on right;
it’s easy then to link
the symbols together
as per the graph.
• This works for non-textual data
• Sequitur imputes meaningful
symbols from arbitrary input
data
Context Free Grammar Fuzzer:
THE CFG9000
• Reduce input data to a stream of symbols
• Fuzz data at the symbol level, rather than
at pure bytes
– Shuffle
– Drop
– Repeat
– Uniform Corrupt
• Consistently corrupt all instances of a given
symbol
• <HEAD> -> <FOOBAR>
• Partially ported to the new XML framework
Sample CFG9000 Output
• calculate_rule_usage(p->rulep->rulep->rulep-
>rulep->rulep->rulep->rulep->rulep->rulep-
>rulep->rulep->rulep->rulep->rule() }
• calculate_rule_usage(calculate_rule_usage(calc
ulate_rule_usage(calculate_rule_usage(calculat
e_rule_usage(calculate_rule_usage(calculate_ru
le_usage(calculate_rule_usage(calculate_rule_u
sage(calculate_rule_usage(calculate_rule_usag
e(calculate_rule_usage(calculate_rule_usage(ca
lculate_rule_usage(calculate_rule_usage(calcula
te_rule_usage(calculate_rule_usage(calculate_r
ule_usage(p->rule());
Slashdot Fuzzed
Slashdot Fuzzed (2)
Why We Moved To XML In The
First Place
• XML is a (potentially) validating format
– Has the concept of schemas
– NOT THAT THEY’RE ALWAYS OR EVEN OFTEN
CHECKED
• Schema validation is expensive
• We should be able to use XML Schemas to
guide fuzzers
– WS-Bang
• Excellent tool for bashing Web Services frameworks
• Given a WSDL file (Web Services Description Language),
fuzz it
– Untidy: Mostly just attacks XML parsers, doesn’t hit
the structure
Automatically Generating
Schemas?
• We can autogenerate Schemas from XML
(to some degree)
– Relaxer
– Trang
– Tends to capture structure better than content
• Doesn’t appear to automatically determine what
values are valid for each field
• Does provide framework for automatically
extracting all instances of what can go where
Wireshark Demo:
From…
• <field show="QUERY_FS_INFO Data" size="20"
pos="126"
value="ff002700ff000000080000004e00540046005300"
>
– - <field show="FS Attributes: 0x002700ff" size="4" pos="126"
value="ff002700">
– <field name="smb.fs_attr.css"
showname=".... .... .... .... .... .... .... ...1 = Case Sensitive
Search: This FS supports CASE SENSITIVE SEARCHes"
size="4" pos="126" show="1" value="1"
unmaskedvalue="ff002700" />
– <field name="smb.fs_attr.cpn" showname=".... .... .... .... .... ....
.... ..1. = Case Preserving: This FS supports CASE
PRESERVED NAMES" size="4" pos="126" show="1" value="1"
unmaskedvalue="ff002700" />
– <field name="smb.fs_attr.uod" showname=".... .... .... .... .... ....
.... .1.. = Unicode On Disk: This FS supports UNICODE
NAMES" size="4" pos="126" show="1" value="1"
unmaskedvalue="ff002700" />
Wireshark Demo:
To:
• - <xsd:complexType name="field">
– - <xsd:sequence>
– <xsd:element maxOccurs="unbounded" minOccurs="1"
name="field" type="field" />
– </xsd:sequence>
– <xsd:attribute name="name" type="xsd:token" />
– <xsd:attribute name="pos" type="xsd:int" />
– <xsd:attribute name="show" type="xsd:normalizedString" />
– <xsd:attribute name="showname"
type="xsd:normalizedString" />
– <xsd:attribute name="size" type="xsd:int" />
– <xsd:attribute name="value" type="xsd:token" />
– <xsd:attribute name="hide" type="xsd:token" />
– <xsd:attribute name="unmaskedvalue" type="xsd:token" />
– </xsd:complexType>
Could we automatically extract
structure from Sequitur-XML?
• “This sequence of bytes can be reconstructed
with these other sequences of bytes”
– No tree relationship – anything can link in
anything
– Need to have the content awareness
Relaxer lacks to get anything useful
– Where might we get this content
awareness?
What Might We Borrow From
Linguistics?
• Can we use linguistic approaches?
– Common Elements
• Humans: Subjects, Verbs, etc.
• Machines: Delimiters, Length Fields, ASCII/Unicode, x86,
Padding to Four Byte Boundries
– Symbol Interrelationships
• Humans: We take word boundries for granted
– Until we’re listening to a foreign language, and wonder
why there aren’t spaces between words 
• Machines: File formats rarely make it easy to see where
one symbol starts and another begins
• Does one symbol always appear before another? Does
one symbol always found itself surrounded by two others?
How To Think Of Sequitur
• Any time you’re manipulating data as bytes,
think of manipulating it as symbols
– N-gram histograms on bytes -> N-gram histograms
on symbols
– Bayesian probabilities on characters -> Bayesian
probabilities on symbols
• Sequitur is not necessarily the best way to
determine a grammar
– Suffix Trees may be more accurate
– Keiffer-Yang (redundant symbol extraction) a very
good post-processing step to add
– Ray removes In-Memory Grammar Requirement
– Not all other solutions are linear time, though
• Kind of cool to have a grammar that covers a 750GB hard
drive undergoing forensics s
Fuzzy Wuzzy Wuz A Symbol
• Symbol analysis systems (language translators,
etc) have issues w/ TMTOWTDI (There’s More
Than One Way To Do It)
– Very similar messages can be encapsulated in very
different ways
– Very similar messages can be encapsulated in very
similar, but not identical ways
• Sequitur only handles exact matches – fuzzy
grammar imputation doesn’t appear to exist yet
– We must develop this fuzziness to create byte-
sourced XML schemas 
• It is a pretty wild concept, so 
– Are there any systems for analyzing complex, inequal
but somewhat related sets of symbols?
Another Approach: Dotplots
What Exactly Are We Doing
• Jonathan Helman’s
“DotPlot Patterns: A
Literal Look at Pattern
Languages” offers an
introduction
• Instead of “to, be, not” etc, we use chunks of
data from arbitrary files
– Instead of demanding perfect equality, we measure
how similar the chunks are
– If most of the bytes are in most of the same places,
it’s pretty similar, if most are different, pretty dissimilar
New: Video Analysis!
(Nine Inch Nails, “Closer”)
More Video Analysis:
Cibo Matto / Michel Gondry’s Palindromatic “Sugar Water”
We’ve figured out what some of
these patterns mean…
But some code just comes out
strange.
So How Might This Be Useful?
• A) Format Identification
– 1) Do different files appear different, and does the
appearance reflect the existence of internal structure?
– 2) Do different instances of the same file format
appear similar?
– 3) Does one format embedded in another make itself
apparent?
• B) Fuzzer Guidance
– 1) Can we locate the actual byte offsets where one
section ends and another begins?
– 2) Can we visualize and compare fuzzer operations
via Dotplots?
Format Identification
• 1) Do different files appear different,
and does the appearance reflect the
existence of internal structure?
• 2) Do different instances of the same file format appear
similar?
• 3) Does one format embedded in another make itself
apparent?
Java Class Files
.NET Assemblies
CNN’s Home Page
SMBTorture Traffic
(Packets – Note, Stop/Start Is Visible)
Kernel32.dll
Chromosome 22
(This is, after all, a genomics hack)
The Legend Of Zelda
Format Identification
• 1) Do different files appear different, and
does the appearance reflect the existence
of internal structure?
– Answer: Yes. They do.
• 2) Do different instances of the same
file format appear similar?
• 3) Does one format embedded in another make itself
apparent?
Books from Project Gutenberg:
Consistent
Despite English’s low
information content,
lack of even mildly
related strings causes
little self-similarity
across symbol clusters
US Code:
Moderately Consistent
Legalese is a massively
structured dialect.
Symbols appear in very
distinct patterns that are
more reminiscent of
machine code than text.
HTML:
Consistent
HTML repeats smaller
symbols (tags) and larger
symbol clusters (via
template engines) regularly.
This shows up visually as a
tightly repeating pattern.
Java Class Files (Compared):
Mildly Consistent
Binary code (be it bytecode
or x86) tends to be very
structured. Still, we are
dependent on both the
content and the compiler
to generate distinct
patterns.
x86:
Consistent (In Sections)
x86 tends not to be
handwritten; as such
complex instructions are
emitted in a highly
structured form.
Exception?
• 64 kilobyte graphical
demonstration
• Run through a packer

• Compression
removes patterns
NES Games
6502 Assembly Tends
To Show Consistent
Patterns, But…
Mario Games Look Rather
Different.
1) Output is highly
dependent on the
compiler
2) Output is highly
dependent upon the
actual content
File formats are merely
shells for actual
content. You are
analyzing the content;
the format is just
syntactic sugar.
Format Identification
• 1) Do different files appear different, and does the
appearance reflect the existence of internal structure?
– Answer: Yes. They do.
• 2) Do different instances of the same
file format appear similar?
– Answer: Somewhat. Similar content looks
like itself, but you’re measuring the
fundamental entropy of the underlying
content, not the format of the content
itself.
• 3) Does one format embedded in another make
itself apparent?
File Formats Contain Multiple Subformats
Another Look At Kernel32.DLL
These are all different
parts of Kernel32.
Quickly Browsing Large Files:
Tilt-Shift View
• Instead of measuring
absolute Y against
absolute X, make X
relative
– Advance through the
file going down, look
back a number of
bytes going right
Complain All You Want.
Hex Still Sucks.
Format Identification
• 1) Do different files appear different, and does the
appearance reflect the existence of internal structure?
– Answer: Yes. They do.
• 2) Do different instances of the same file format appear
similar?
– Answer: Somewhat. Similar content looks like itself,
but you’re measuring the fundamental entropy of the
underlying content, not the format of the content itself.
• 3) Does one format embedded in another
make itself apparent?
– Answer: Yes. Multiple, distinct sections
are clearly visible in a way that hex cannot
show.
Fuzzer Guidance
• 1) Can we locate the actual byte offsets
where one section ends and another begins?
– Why would we want to?
• Fuzzers break parsers.
• Many subformats to a format, many subparsers to a parser
• To a rough level of approximation, fuzzing a single subformat
lets you stress a single subparser
• So once we split a file up, we can selectively attack one
subparser at a time.
• 2) Can we visualize and compare fuzzer operations via
Dotplots?
Simple Math
We select an interesting blob
from kernel32.dll. The blob is
at pixel offset 507x507, and
is a square around 570 pixels
wide.
Window size on viz was 32.
507*32 = The interesting
section starts 16224 bytes
into the file.
570*32 = The interesting
section is 18240 bytes long.
Whats The Actual Data?
dd if=kernel32.dll bs=1 skip=16100
| hexdump - | more
Using Hardcorr as a “first knife” to
locate interesting-to-fuzz regions
Fuzzer Guidance
• 1) Can we locate the actual byte offsets where
one section ends and another begins?
– Answer: Yes. We can quickly route from the image
to the byte offset, through basic arithmetic.
• 2) Can we visualize and compare
fuzzer operations via Dotplots?
Differentials
• Major use of dotplots in bioinformatics is to
compare one genome against another
– Autocorrelation: Compare A to A
– Cross-Correlation: Compare A to B
• Most files are sufficiently dissimilar that
not very interesting structure shows up
– Notable exception: Different versions of
the same binary
Visual Bindiff!
MSVCR70.DLL v. MSVCR71.DLL
Fuzzers:
Very Broken Patchers 
Mangle.C – Single Bit
Differences
CFG9000 – Large Scale
Reordering
Fuzzer Guidance
• 1) Can we locate the actual byte offsets where one
section ends and another begins?
– Answer: Yes. We can quickly route from the image
to the byte offset, through basic arithmetic.
• 2) Can we visualize and compare
fuzzer operations via Dotplots?
–Answer: Yes – visual diffing effectively
shows differences between files,
including differences introduced by
various flavors of fuzzers.
Conclusions…
• Lots of interesting work left to do
– Unification of local presence of symbols, and global
view of file format
• Possible to do dotplots themselves in the symbolic domain
– Use of dotplots to segment formats, which thus
provides the tree we want for an XML schema
• <format>
– <blob1 />
– <blob2 />
– <blob3 />
• </format>
– More colorful pretty pictures!
The Ancient Tongue:
TCP/IP
• Can’t all be about pretty pictures 
• A new problem has popped up: Network
oligopolies are threatening to install
firewalls that limit or eliminate bandwidth
on a per-company basis
– Their own media services might be fast,
others will be slow
– Their own VPN services might be fast, others
will be slow
• Question: Is it possible to detect and
locate devices violating network
What’s The Closest Tool We Have?
• Firewalk
– Mike Schiffman’s Firewall Analysis Tool
– Packets elicit a ICMP Time Exceeded error if
they reach a router with TTL=0
• TTL decremented by one for each hop, so you
start low, you can trace the route to a host
– A firewalled packet won’t live long enough to
reach TTL=0
– So you can locate the firewall, and divine
things about its ruleset, based on when your
packets stop getting ICMP Time Exceeded
Limitations of Firewalking
• But Firewalk tells us what, not who is
blocked…and it tells us nothing about who
is allowed to go fast, and who is made to
go slow
– Suddenly, we devolve to a much older
question: Is it possible to find out that a target
firewall is, or is not, blocking against or
accepting traffic from an arbitrary IP address?
TCP Does Speed Measurement
• TCP speed analysis done blindly
– Endpoints do not negotiate with one another
– Everyone sends their packets, routers route
what they will. Endpoints need to adjust to
what the routers are willing to pass.
• Routers communicate with endpoints by dropping
their packets
• Can we combine this router backchannel
w/ Firewalk?
In From The Side
• What causes packets to drop?
– Too many packets
• What are we going to do?
– Send too many packets
• Two channels are set up
– A primary channel, which drops packets at some
known rate
– A secondary channel, whose purpose it is to interfere
(or not) with the primary channel
• When the secondary interferes with the primary,
we get feedback via the primary channel
– The traffic composing the secondary channel can
come from anywhere, be composed of anything, and
can be TTL’d just like in a normal firewalk.
The TTL Channel
• Normally, you don’t know which router
along a path is dropping your packets
 
• If you are the source of the drop-inducing
packets, you can control how far your
noise goes out – thus, you can discover
which router is hitting its limit / censoring
your net connection
 
Scorchmarking
• Why Scorchmarking?
– Routers are burning packets…those that get through
might have a scorch mark or two 
• Basic Model
– Client downloads a file from a site, at some given
speed negotiated via TCP.
– At the same time, traffic is injected from different IP
addresses. This should cause drops.
• If it doesn’t, the network is either penalizing the primary
channel (easy to drop against) or rewarding the secondary
channel (resilient to drops)
Advanced Scorchmarking [0]
• Having to depend on a client is lame
– Wouldn’t it be nice if we could scan the
Internet for these servers?
• What fundamental service is a receiving
client providing?
– It is acknowledging our traffic – letting us
know how much it received, and how many
milliseconds it took to receive it
• Aren’t there other ways we could extract
the same data from hosts?
Advanced Scorchmarking [1]
• What else will acknowledge receiving traffic from
us?
– TCP Servers
• Sting, from Stefan Savage, used this to great effect
– DNS Servers 
– Routers.
• Supposedly, routers won’t send more than a certain number
of ICMP Time Exceeded packets per second
• In reality, they seem to ICMP Time Exceeded ACK however
much you throw at them
• Even if they didn’t, you could use the difference in ICMP
Time Exceeded rates between Primary and Secondary
channel, to determine whether interference was showing up.
• Everyone’s got a NAT – so you can query everyone for
whether certain sorts of traffic are being blocked to them
Advanced Scorchmarking [2]
• So, yes.
– You can scan for violations of Network Neutrality
– You can find networks that are blocking or passing
particular IP ranges
• It’s not exactly efficient though
• Neutrality violations are easier to find than the
standard FW case
– Firewalls are normally between the WAN and the LAN
(Slow Net -> FW -> Fast Net)
– Neutrality violators are mid-WAN (Slow Net -> Fw ->
Slow Net -> Fast Net)
– Easier to overload the slow net after the firewall
• Boxes with max TTL rates override this
Speed Limits
• Fundamental Problem: Have to max out
bandwidth on the link to trigger the backchannel
– No packets dropping, no data
– Means you have to DoS a link – not scalable/legal
• Potential Solution: Find capped acknowledgers
– The mythical ICMP Time Exceeded rate limit works
well
• Primary and Secondary channel both eliciting ITE’s
• When secondary channel gets a packet through, it takes up a
slot on the primary channel’s
• ITE is perfect, since you can TTL limit any packet
• Depends on the firewall passing the primary’s ITE’s
• Maybe Linux / NATs actually implement rate limits?
– Another option: What if we have code on the client?
Windows Media Player:
More Than Just DRM. Really!
• Bulk Transfer: RTP
– Runs over Unicast UDP
– Yes, the same Unicast UDP that penetrates NAT so
well!
• Flow Control / Quality Monitoring: RTCP
• No technical reason RTCP needs to go back to
the same address that RTP stream is coming
from
– So: We pretend to provide media streams from all
sorts of sites, and use WMP to collect traffic stats for
us 
• It might work…

More Related Content

What's hot

Design Reviewing The Web
Design Reviewing The WebDesign Reviewing The Web
Design Reviewing The Webamiable_indian
 
Domain Key Infrastructure (From Black Hat USA)
Domain Key Infrastructure (From Black Hat USA)Domain Key Infrastructure (From Black Hat USA)
Domain Key Infrastructure (From Black Hat USA)Dan Kaminsky
 
Bugs Aren't Random
Bugs Aren't RandomBugs Aren't Random
Bugs Aren't RandomDan Kaminsky
 
Wo defensive trickery_13mar2017
Wo defensive trickery_13mar2017Wo defensive trickery_13mar2017
Wo defensive trickery_13mar2017Dan Kaminsky
 
Dmk sb2010 web_defense
Dmk sb2010 web_defenseDmk sb2010 web_defense
Dmk sb2010 web_defenseDan Kaminsky
 
A Technical Dive into Defensive Trickery
A Technical Dive into Defensive TrickeryA Technical Dive into Defensive Trickery
A Technical Dive into Defensive TrickeryDan Kaminsky
 
I Want These * Bugs Off My * Internet
I Want These * Bugs Off My * InternetI Want These * Bugs Off My * Internet
I Want These * Bugs Off My * InternetDan Kaminsky
 
Move Fast and Fix Things
Move Fast and Fix ThingsMove Fast and Fix Things
Move Fast and Fix ThingsDan Kaminsky
 
Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingShowing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingDan Kaminsky
 
Dmk blackops2006 ccc
Dmk blackops2006 cccDmk blackops2006 ccc
Dmk blackops2006 cccDan Kaminsky
 
Bh us-02-kaminsky-blackops
Bh us-02-kaminsky-blackopsBh us-02-kaminsky-blackops
Bh us-02-kaminsky-blackopsDan Kaminsky
 
Why isn't infosec working? Did you turn it off and back on again?
Why isn't infosec working? Did you turn it off and back on again?Why isn't infosec working? Did you turn it off and back on again?
Why isn't infosec working? Did you turn it off and back on again?Rob Fuller
 
SSL: Past, Present and Future
SSL: Past, Present and FutureSSL: Past, Present and Future
SSL: Past, Present and FutureLuis Grangeia
 
NotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersNotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersRob Fuller
 
Man vs Internet - Current challenges and future tendencies of establishing tr...
Man vs Internet - Current challenges and future tendencies of establishing tr...Man vs Internet - Current challenges and future tendencies of establishing tr...
Man vs Internet - Current challenges and future tendencies of establishing tr...Luis Grangeia
 
Keynote - Closing the TLS Authentication Gap
Keynote - Closing the TLS Authentication GapKeynote - Closing the TLS Authentication Gap
Keynote - Closing the TLS Authentication GapSecurityTube.Net
 

What's hot (20)

Dmk neut toor
Dmk neut toorDmk neut toor
Dmk neut toor
 
Design Reviewing The Web
Design Reviewing The WebDesign Reviewing The Web
Design Reviewing The Web
 
Domain Key Infrastructure (From Black Hat USA)
Domain Key Infrastructure (From Black Hat USA)Domain Key Infrastructure (From Black Hat USA)
Domain Key Infrastructure (From Black Hat USA)
 
Bugs Aren't Random
Bugs Aren't RandomBugs Aren't Random
Bugs Aren't Random
 
Wo defensive trickery_13mar2017
Wo defensive trickery_13mar2017Wo defensive trickery_13mar2017
Wo defensive trickery_13mar2017
 
Interpolique
InterpoliqueInterpolique
Interpolique
 
Dmk sb2010 web_defense
Dmk sb2010 web_defenseDmk sb2010 web_defense
Dmk sb2010 web_defense
 
Bh eu 05-kaminsky
Bh eu 05-kaminskyBh eu 05-kaminsky
Bh eu 05-kaminsky
 
A Technical Dive into Defensive Trickery
A Technical Dive into Defensive TrickeryA Technical Dive into Defensive Trickery
A Technical Dive into Defensive Trickery
 
I Want These * Bugs Off My * Internet
I Want These * Bugs Off My * InternetI Want These * Bugs Off My * Internet
I Want These * Bugs Off My * Internet
 
Move Fast and Fix Things
Move Fast and Fix ThingsMove Fast and Fix Things
Move Fast and Fix Things
 
Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingShowing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
 
Dmk bo2 k8_ccc
Dmk bo2 k8_cccDmk bo2 k8_ccc
Dmk bo2 k8_ccc
 
Dmk blackops2006 ccc
Dmk blackops2006 cccDmk blackops2006 ccc
Dmk blackops2006 ccc
 
Bh us-02-kaminsky-blackops
Bh us-02-kaminsky-blackopsBh us-02-kaminsky-blackops
Bh us-02-kaminsky-blackops
 
Why isn't infosec working? Did you turn it off and back on again?
Why isn't infosec working? Did you turn it off and back on again?Why isn't infosec working? Did you turn it off and back on again?
Why isn't infosec working? Did you turn it off and back on again?
 
SSL: Past, Present and Future
SSL: Past, Present and FutureSSL: Past, Present and Future
SSL: Past, Present and Future
 
NotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersNotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for Pentesters
 
Man vs Internet - Current challenges and future tendencies of establishing tr...
Man vs Internet - Current challenges and future tendencies of establishing tr...Man vs Internet - Current challenges and future tendencies of establishing tr...
Man vs Internet - Current challenges and future tendencies of establishing tr...
 
Keynote - Closing the TLS Authentication Gap
Keynote - Closing the TLS Authentication GapKeynote - Closing the TLS Authentication Gap
Keynote - Closing the TLS Authentication Gap
 

Viewers also liked

Bh fed-03-kaminsky
Bh fed-03-kaminskyBh fed-03-kaminsky
Bh fed-03-kaminskyDan Kaminsky
 
232 md5-considered-harmful-slides
232 md5-considered-harmful-slides232 md5-considered-harmful-slides
232 md5-considered-harmful-slidesDan Kaminsky
 
Black ops of tcp2005 japan
Black ops of tcp2005 japanBlack ops of tcp2005 japan
Black ops of tcp2005 japanDan Kaminsky
 
Bh us-02-kaminsky-blackops
Bh us-02-kaminsky-blackopsBh us-02-kaminsky-blackops
Bh us-02-kaminsky-blackopsDan Kaminsky
 

Viewers also liked (6)

Bh fed-03-kaminsky
Bh fed-03-kaminskyBh fed-03-kaminsky
Bh fed-03-kaminsky
 
Dmk bo2 k7_web
Dmk bo2 k7_webDmk bo2 k7_web
Dmk bo2 k7_web
 
232 md5-considered-harmful-slides
232 md5-considered-harmful-slides232 md5-considered-harmful-slides
232 md5-considered-harmful-slides
 
Black ops of tcp2005 japan
Black ops of tcp2005 japanBlack ops of tcp2005 japan
Black ops of tcp2005 japan
 
Bh us-02-kaminsky-blackops
Bh us-02-kaminsky-blackopsBh us-02-kaminsky-blackops
Bh us-02-kaminsky-blackops
 
Advanced open ssh
Advanced open sshAdvanced open ssh
Advanced open ssh
 

Similar to Cryptomnemonics: Using Names and Stories to Represent Data/TITLE

Oral tradition in software engineering: Passing the craft across generations
Oral tradition in software engineering: Passing the craft across generationsOral tradition in software engineering: Passing the craft across generations
Oral tradition in software engineering: Passing the craft across generationsbcantrill
 
AI Lecture-01 (Introduction) NN and Fuzzy
AI Lecture-01 (Introduction) NN and FuzzyAI Lecture-01 (Introduction) NN and Fuzzy
AI Lecture-01 (Introduction) NN and FuzzySirRafiLectures
 
intro (1).ppt
intro (1).pptintro (1).ppt
intro (1).pptburakkrk6
 
Password Storage Sucks!
Password Storage Sucks!Password Storage Sucks!
Password Storage Sucks!nerdybeardo
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systemsCJ Jenkins
 
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...IT Arena
 
Clojure slides
Clojure slidesClojure slides
Clojure slidesmcohen01
 
Un actor (model) per amico - Alessandro Melchiori - Codemotion Milan 2016
Un actor (model) per amico - Alessandro Melchiori - Codemotion Milan 2016Un actor (model) per amico - Alessandro Melchiori - Codemotion Milan 2016
Un actor (model) per amico - Alessandro Melchiori - Codemotion Milan 2016Codemotion
 
HarambeeNet: Data by the people, for the people
HarambeeNet: Data by the people, for the peopleHarambeeNet: Data by the people, for the people
HarambeeNet: Data by the people, for the peopleMichael Bernstein
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structureselliando dias
 
Password Management
Password ManagementPassword Management
Password ManagementRick Chin
 
Artificial intelligence(introduction)
Artificial intelligence(introduction)Artificial intelligence(introduction)
Artificial intelligence(introduction)syed rafi
 
Introduction-To-AI(L-1).pdf
Introduction-To-AI(L-1).pdfIntroduction-To-AI(L-1).pdf
Introduction-To-AI(L-1).pdfssuser8324dd
 
BSidesLV 2013 - Using Machine Learning to Support Information Security
BSidesLV 2013 - Using Machine Learning to Support Information SecurityBSidesLV 2013 - Using Machine Learning to Support Information Security
BSidesLV 2013 - Using Machine Learning to Support Information SecurityAlex Pinto
 
Scylla Summit 2022: Predicting the Past
Scylla Summit 2022: Predicting the PastScylla Summit 2022: Predicting the Past
Scylla Summit 2022: Predicting the PastScyllaDB
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014fnothaft
 
Nicholas Dorans - The Evolution of Passwords
Nicholas Dorans - The Evolution of PasswordsNicholas Dorans - The Evolution of Passwords
Nicholas Dorans - The Evolution of PasswordsCSNP
 
Hybrid concurrency patterns
Hybrid concurrency patternsHybrid concurrency patterns
Hybrid concurrency patternsKyle Drake
 

Similar to Cryptomnemonics: Using Names and Stories to Represent Data/TITLE (20)

Dmk audioviz
Dmk audiovizDmk audioviz
Dmk audioviz
 
Oral tradition in software engineering: Passing the craft across generations
Oral tradition in software engineering: Passing the craft across generationsOral tradition in software engineering: Passing the craft across generations
Oral tradition in software engineering: Passing the craft across generations
 
AI Lecture-01 (Introduction) NN and Fuzzy
AI Lecture-01 (Introduction) NN and FuzzyAI Lecture-01 (Introduction) NN and Fuzzy
AI Lecture-01 (Introduction) NN and Fuzzy
 
intro (1).ppt
intro (1).pptintro (1).ppt
intro (1).ppt
 
Password Storage Sucks!
Password Storage Sucks!Password Storage Sucks!
Password Storage Sucks!
 
chatcptkk.ppt
chatcptkk.pptchatcptkk.ppt
chatcptkk.ppt
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systems
 
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
 
Clojure slides
Clojure slidesClojure slides
Clojure slides
 
Un actor (model) per amico - Alessandro Melchiori - Codemotion Milan 2016
Un actor (model) per amico - Alessandro Melchiori - Codemotion Milan 2016Un actor (model) per amico - Alessandro Melchiori - Codemotion Milan 2016
Un actor (model) per amico - Alessandro Melchiori - Codemotion Milan 2016
 
HarambeeNet: Data by the people, for the people
HarambeeNet: Data by the people, for the peopleHarambeeNet: Data by the people, for the people
HarambeeNet: Data by the people, for the people
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
Password Management
Password ManagementPassword Management
Password Management
 
Artificial intelligence(introduction)
Artificial intelligence(introduction)Artificial intelligence(introduction)
Artificial intelligence(introduction)
 
Introduction-To-AI(L-1).pdf
Introduction-To-AI(L-1).pdfIntroduction-To-AI(L-1).pdf
Introduction-To-AI(L-1).pdf
 
BSidesLV 2013 - Using Machine Learning to Support Information Security
BSidesLV 2013 - Using Machine Learning to Support Information SecurityBSidesLV 2013 - Using Machine Learning to Support Information Security
BSidesLV 2013 - Using Machine Learning to Support Information Security
 
Scylla Summit 2022: Predicting the Past
Scylla Summit 2022: Predicting the PastScylla Summit 2022: Predicting the Past
Scylla Summit 2022: Predicting the Past
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
Nicholas Dorans - The Evolution of Passwords
Nicholas Dorans - The Evolution of PasswordsNicholas Dorans - The Evolution of Passwords
Nicholas Dorans - The Evolution of Passwords
 
Hybrid concurrency patterns
Hybrid concurrency patternsHybrid concurrency patterns
Hybrid concurrency patterns
 

More from Dan Kaminsky

More from Dan Kaminsky (7)

Chicken
ChickenChicken
Chicken
 
Chicken Chicken Chicken Chicken
Chicken Chicken Chicken ChickenChicken Chicken Chicken Chicken
Chicken Chicken Chicken Chicken
 
Some Thoughts On Bitcoin
Some Thoughts On BitcoinSome Thoughts On Bitcoin
Some Thoughts On Bitcoin
 
Interpolique
InterpoliqueInterpolique
Interpolique
 
Bh eu 05-kaminsky
Bh eu 05-kaminskyBh eu 05-kaminsky
Bh eu 05-kaminsky
 
Bo2004
Bo2004Bo2004
Bo2004
 
Gwc3
Gwc3Gwc3
Gwc3
 

Cryptomnemonics: Using Names and Stories to Represent Data/TITLE

  • 1. Weaponizing Noam Chomsky: Symbols and Grammars Are Fun Dan Kaminsky Director Of Penetration Testing
  • 2. Introduction • Many physicists would agree that, had it not been for congestion control, the evaluation of web browsers might never have occurred. In fact, few hackers worldwide would disagree with the essential unification of voice-over-IP and public private key pair. In order to solve this riddle, we confirm that SMPs can be made stochastic, cacheable, and interposable. – Rooter: A Methodology for the Typical Unification of Access Points and Redundancy
  • 3. That was BS. • That also got accepted into a con. – Automatically generated from a context free grammar – I’ve been working too hard all these years  – “Be quiet, or I will replace you with a very small shell script” • This talk is a bit of a remix – Patterns and symbols are interesting me as of late • Automatic determination of both is difficult, interesting, and unsolved – Integration into human symbolic systems promises particularly interesting results – So we’re going to explore a bit.
  • 4. Language Is Cool • Language: A protocol for the transmission of concepts and intentions between humans – Documentation is not available – Documentation does not really work – Learned through exposure and use • Significant amount of internal structure, redundancy, and consistency • Who makes language? – Kids. • Adults coin words here and there, but when they’re forced to invent a common language to get things done, it’s called a Pidgin, and it’s terrible • The kids hear it, and invent a Creole – a merged language of significantly greater accuracy and depth • Children make languages • Adults make “working” languages • Programmers make barely working languages
  • 5. Programmers Talk Funny • Fundamentally two languages that programmers must use – Code to Human: “User Interface Design” – Code to Code: “File and Network Protocol” • UI is a protocol. – This is obvious in retrospect. • There are two things this talk hopes to do – Correct some of the Code->Human protocols that are out there – Use human strategies to analyze Code to Code communications • Learning a protocol is learning a language. Humans do not learn languages quickly, and thus we’re resource bound on fuzzer development • It’s 2007 – most parsers remain unfuzzed (and thus just waiting to be exploited)
  • 6. Weaponizing Noam? • “An early inference procedure was described by Chomsky and Miller (1957a), as reported in Solomonoff (1959). Chomsky proposed a method for detecting loops in finite state languages. The approach requires a set of valid sentences, and an oracle that determines whether a sentence is in the language. The algorithm proceeds by deleting part of a valid sentence and asking the oracle whether the sentence is still valid. If it is, the deleted part is reinserted into the sequence and repeated, so that it appears twice. If the sentence is still in the language, a cycle has been detected.” – Inferring Sequential Structure, Craig Neville Manning, 1996 – This couldn’t POSSIBLY be useful for building a structure for a dumb fuzzer to operate against. • Instead of seeing if the parser crashes, just see if it considers the input valid
  • 7. Topics Of Discussion • Further Explorations in Cryptomnemonics – Using Names and Syllables for password representation • Sequitur-XML: Merging automated structure discovery with the standard architecture for structure representation – …which turned out to be quite nice for controlled structure destruction  • Exploring Dotplots – Building a GUI – Exploring other domains
  • 8. Intro To Symbol Sets • Machine Symbols – Data (AA, BB, CC) – Code (a(), b(), c()) – Formats (All, Bad, Code)  • Human Symbols – Letters (A, B, C) – Glyphs () – Syllables (Ah, Bee, See) – Words (Amazing, Bear, Clear) – Native Names (Alice, Bob, Charlie) – Things (Axe, Bone, Chimpanzee) – Actions (Ask, Buy, Compute) – Colors (Aquamarine, Blue, Chartreuse) • Machines can use formats, but their native format is raw bits • Humans have no concept of “raw bits” – everything must be contextual – Long history in mnemonics of mapping arbitrary data to a
  • 9. Different Domains Have Different Strengths – See Visual Processing
  • 10. Cryptomnemonics • Definition: The study of human memory, as it applies to cryptographic systems • Developing in response to this: – $ ssh dan@blah The authenticity of host 'blah (1.2.3.4)' can't be established. RSA key fingerprint is 09:a9:b1:99:84:17:7d:ba:c6:55:46:5a:17: f8:83:01. Are you sure you want to continue connecting (yes/no)? • The machine is acting like its integrating with another machine. It’s not, and that matters. • Humans can handle hexadecimal characters – but not that many.
  • 11. Hex Confusion • After somewhere between 2 and 5 characters, most of you will fail to see a difference – Positional Bias: Expect to see certain things at the beginning or end – Value Confusion: Letter vs. Number is remembered before the actual value of letter or number • Glyph confusion – “Despair” Effect • Nobody could possibly detect a change, so it’s not rational to even try
  • 12. Classes of Memory • There are three classes of memory, at least to the degree as is useful in cryptography – Rejection: “I’ve never seen that before” – Recognition: “It’s that one, not that other one” – Recollection: “Let me describe it to you.” • SSH just requires rejection – Hex is not rejectable – Can we try another domain?
  • 13. Exploring The Nymic Domain • $ ssh dan@blah Key Data: julio and epifania dezzutti luther and rolande doornbos manual and twyla imbesi dirk and cuc kolopajlo omar and jeana hymel The authenticity of host 'blah (1.2.3.4)' can't be established. Are you sure you want to continue connecting (yes/no)? – Alternate mapping for 09:a9:b1:99:84:17:7d:ba:c6:55:46:5a:17:f8:83:01. – Proposed last year as a potential solution • There is nothing more contextual than a story, and there is nothing more stable in a story than the names of its participants – Stories retold are stories remembered – we need to be exposed to the above group time and time again to be able to reject any deviation from it
  • 14. How To Derive Names? • Original Model – Take US Census Data – Remove any names that may be easily confused with one another: • Easy: Bob v. Bobby • Hard: Bob v. Robert • Celebrity Naming – “Marge Godwin” • Archaic Naming – Use constructs from various ancient languages • Mechanistic Constructs – Bubble Babble: 64 bits = xegoz-tosys-vusik-masar – Koremutake: 64 bits = darujifahe stygrifrejy
  • 15. How Many Names? • Unclear what the crossover point is between hard from more names, and benefit from more entropy per name – Present system is 512 male name, 512 female name, 1024 last names from US Census – 256/256/256 would provide 24 bits per couple instead of 40, and the names would be more recognizable. Better? How much better? • The more names, the more a problem position becomes – We’re sensitive to names, but without a story context, there’s no roles locking people to being the first or the second or the third. So the more names, the more bits we lose to reader confusion. • How many bits are necessary? Depends on what for.
  • 16. Flipping The Bits • SSH Key Representation is not the only thing we can do with this technique – In fact, it’s not even the most pressing problem • Passwords are in crisis right now – PKI failed, deal with it • There’s an entire alternate history where XSS enjoys the benefits of your legal credentials being available and shared – People are being asked to generate, frequently, high entropy non repeated passwords • They’re repeating them • They’ve exhausted personal entropy, and have moved to geometric progressions to evade lameness checks – &(*uoiJKL798 – Fixed prefix
  • 17. A Fundamental Shift • Generate passwords for your users. – “But they’re hideous, nobody will remember what we automatically generate” • You’re theoretically forcing them to generate those hideous passwords, off the top of their head • Use alternate symbolic domains to coat the password entropy you require in a form users can accept – Why yes, this is exactly like a tunnel. We’re tunneling entropy over a baby name book 
  • 18. Change Your Ways • Modify your validation logic to accept long passwords without weird character sets – Punctuation and case sensitivity are “weak symbols” • It is easier to chain together common symbols in a common way, than it is to link together arbitrary bytes out of context – This is a fundamental difference between human symbol manipulation and the operations of computers
  • 19. How Many Bits Do We Really Need? • Hash Validation: 80-100 bits – We don’t have a birthday paradox problem with hashes, since one of them is fixed. – 2^80-2^100 work efforts are outside the range of feasibility at this time • Password Entry: 24 bits for low security, 36-48 bits for high security – Need enough to make brute force enumeration across all users infeasible – For each username, try one possible password – 48 bit is what we’re at with punctuation/case/number/8 character.
  • 20. Limits to alternate symbol domains • We lose the ability to measure “nextness” – 0x10 is one less than 0x11 – Bob is…how much less than Charlie? • Data may become variable length – Bob is three characters, Charlie is seven – Harder to see patterns • Has trouble scaling to any large number of bits. – We can’t analyze even mildly large systems using this translation layer
  • 21. What We’ve Been Using (Warning: Sucks.)
  • 22. N’est’ce pas Non Sequitur • Sequitur: Linear Time Pattern Finder – Creates hierarchal Context Free Grammars from arbitrary input • Compression Algorithm in which you can “look under the covers” to see what’s going on • Created by Craig Neville-Manning as his PhD thesis a decade ago – He’s now Chief Research Scientist at Google
  • 23. What’s New: Sequitur-XML • echo ‘aabbabc’ | ./sequitur_simple.exe • Why translate: Gives us much easier to manipulate output – C is very good for generating the tree – Other languages are very good for analyzing / modifying the tree • XML is a (shockingly) good machine format for representing structure
  • 24. Early Work: Syntax Highlighting Using Compression Depth
  • 25. What’s Actually Going On? • (0) -> … (73),b4,(73),ca,(73),e6,(73),02,(74),18, (74),2c,(74),4a,(74),5c,(74),6e,(74),80,(74),98, (74),b0,(74),c8,(74),e8,(74),fc,(74),10,(75),20, (75),30,(75),40,(75),50,(75),64,(75),82,(75),90, (75),9e,(75) … (84),d6,(84),ee,(84),0c,(85),28,(85),3c,(85),4e, (85),66,(85),7e,(85),8c,(85),9e,(85),ac,(85),be, (85),ca,(85),ea,(85),08,(86),26,(86),44,(86),56, (86),6a,(86),7c,(86),8a,(86),a6,(86),b6,(86),cc, (86),de,(86),02,(87) • Repeated sequence, single byte literal. Repeated sequence, single byte literal. Rinse, lather, repeat.
  • 26. Where Things Get Most Interesting…Live Symbol Browsing!
  • 27. Browsing HOWTO • For each entry in the root node, – If it’s a literal, color it white – If it’s part of a reference, color it red – If it’s clicked, color it and every other instance of that reference blue • A little buggy • Present implementation DOES NOT SCALE • But effective!
  • 28. Symbol Links: Where To Go From Here • Turns code on left into symbolic set on right; it’s easy then to link the symbols together as per the graph. • This works for non-textual data • Sequitur imputes meaningful symbols from arbitrary input data
  • 29. Context Free Grammar Fuzzer: THE CFG9000 • Reduce input data to a stream of symbols • Fuzz data at the symbol level, rather than at pure bytes – Shuffle – Drop – Repeat – Uniform Corrupt • Consistently corrupt all instances of a given symbol • <HEAD> -> <FOOBAR> • Partially ported to the new XML framework
  • 30. Sample CFG9000 Output • calculate_rule_usage(p->rulep->rulep->rulep- >rulep->rulep->rulep->rulep->rulep->rulep- >rulep->rulep->rulep->rulep->rule() } • calculate_rule_usage(calculate_rule_usage(calc ulate_rule_usage(calculate_rule_usage(calculat e_rule_usage(calculate_rule_usage(calculate_ru le_usage(calculate_rule_usage(calculate_rule_u sage(calculate_rule_usage(calculate_rule_usag e(calculate_rule_usage(calculate_rule_usage(ca lculate_rule_usage(calculate_rule_usage(calcula te_rule_usage(calculate_rule_usage(calculate_r ule_usage(p->rule());
  • 33. Why We Moved To XML In The First Place • XML is a (potentially) validating format – Has the concept of schemas – NOT THAT THEY’RE ALWAYS OR EVEN OFTEN CHECKED • Schema validation is expensive • We should be able to use XML Schemas to guide fuzzers – WS-Bang • Excellent tool for bashing Web Services frameworks • Given a WSDL file (Web Services Description Language), fuzz it – Untidy: Mostly just attacks XML parsers, doesn’t hit the structure
  • 34. Automatically Generating Schemas? • We can autogenerate Schemas from XML (to some degree) – Relaxer – Trang – Tends to capture structure better than content • Doesn’t appear to automatically determine what values are valid for each field • Does provide framework for automatically extracting all instances of what can go where
  • 35. Wireshark Demo: From… • <field show="QUERY_FS_INFO Data" size="20" pos="126" value="ff002700ff000000080000004e00540046005300" > – - <field show="FS Attributes: 0x002700ff" size="4" pos="126" value="ff002700"> – <field name="smb.fs_attr.css" showname=".... .... .... .... .... .... .... ...1 = Case Sensitive Search: This FS supports CASE SENSITIVE SEARCHes" size="4" pos="126" show="1" value="1" unmaskedvalue="ff002700" /> – <field name="smb.fs_attr.cpn" showname=".... .... .... .... .... .... .... ..1. = Case Preserving: This FS supports CASE PRESERVED NAMES" size="4" pos="126" show="1" value="1" unmaskedvalue="ff002700" /> – <field name="smb.fs_attr.uod" showname=".... .... .... .... .... .... .... .1.. = Unicode On Disk: This FS supports UNICODE NAMES" size="4" pos="126" show="1" value="1" unmaskedvalue="ff002700" />
  • 36. Wireshark Demo: To: • - <xsd:complexType name="field"> – - <xsd:sequence> – <xsd:element maxOccurs="unbounded" minOccurs="1" name="field" type="field" /> – </xsd:sequence> – <xsd:attribute name="name" type="xsd:token" /> – <xsd:attribute name="pos" type="xsd:int" /> – <xsd:attribute name="show" type="xsd:normalizedString" /> – <xsd:attribute name="showname" type="xsd:normalizedString" /> – <xsd:attribute name="size" type="xsd:int" /> – <xsd:attribute name="value" type="xsd:token" /> – <xsd:attribute name="hide" type="xsd:token" /> – <xsd:attribute name="unmaskedvalue" type="xsd:token" /> – </xsd:complexType>
  • 37. Could we automatically extract structure from Sequitur-XML? • “This sequence of bytes can be reconstructed with these other sequences of bytes” – No tree relationship – anything can link in anything – Need to have the content awareness Relaxer lacks to get anything useful – Where might we get this content awareness?
  • 38. What Might We Borrow From Linguistics? • Can we use linguistic approaches? – Common Elements • Humans: Subjects, Verbs, etc. • Machines: Delimiters, Length Fields, ASCII/Unicode, x86, Padding to Four Byte Boundries – Symbol Interrelationships • Humans: We take word boundries for granted – Until we’re listening to a foreign language, and wonder why there aren’t spaces between words  • Machines: File formats rarely make it easy to see where one symbol starts and another begins • Does one symbol always appear before another? Does one symbol always found itself surrounded by two others?
  • 39. How To Think Of Sequitur • Any time you’re manipulating data as bytes, think of manipulating it as symbols – N-gram histograms on bytes -> N-gram histograms on symbols – Bayesian probabilities on characters -> Bayesian probabilities on symbols • Sequitur is not necessarily the best way to determine a grammar – Suffix Trees may be more accurate – Keiffer-Yang (redundant symbol extraction) a very good post-processing step to add – Ray removes In-Memory Grammar Requirement – Not all other solutions are linear time, though • Kind of cool to have a grammar that covers a 750GB hard drive undergoing forensics s
  • 40. Fuzzy Wuzzy Wuz A Symbol • Symbol analysis systems (language translators, etc) have issues w/ TMTOWTDI (There’s More Than One Way To Do It) – Very similar messages can be encapsulated in very different ways – Very similar messages can be encapsulated in very similar, but not identical ways • Sequitur only handles exact matches – fuzzy grammar imputation doesn’t appear to exist yet – We must develop this fuzziness to create byte- sourced XML schemas  • It is a pretty wild concept, so  – Are there any systems for analyzing complex, inequal but somewhat related sets of symbols?
  • 42. What Exactly Are We Doing • Jonathan Helman’s “DotPlot Patterns: A Literal Look at Pattern Languages” offers an introduction • Instead of “to, be, not” etc, we use chunks of data from arbitrary files – Instead of demanding perfect equality, we measure how similar the chunks are – If most of the bytes are in most of the same places, it’s pretty similar, if most are different, pretty dissimilar
  • 43. New: Video Analysis! (Nine Inch Nails, “Closer”)
  • 44. More Video Analysis: Cibo Matto / Michel Gondry’s Palindromatic “Sugar Water”
  • 45. We’ve figured out what some of these patterns mean…
  • 46. But some code just comes out strange.
  • 47. So How Might This Be Useful? • A) Format Identification – 1) Do different files appear different, and does the appearance reflect the existence of internal structure? – 2) Do different instances of the same file format appear similar? – 3) Does one format embedded in another make itself apparent? • B) Fuzzer Guidance – 1) Can we locate the actual byte offsets where one section ends and another begins? – 2) Can we visualize and compare fuzzer operations via Dotplots?
  • 48. Format Identification • 1) Do different files appear different, and does the appearance reflect the existence of internal structure? • 2) Do different instances of the same file format appear similar? • 3) Does one format embedded in another make itself apparent?
  • 52. SMBTorture Traffic (Packets – Note, Stop/Start Is Visible)
  • 54. Chromosome 22 (This is, after all, a genomics hack)
  • 55. The Legend Of Zelda
  • 56. Format Identification • 1) Do different files appear different, and does the appearance reflect the existence of internal structure? – Answer: Yes. They do. • 2) Do different instances of the same file format appear similar? • 3) Does one format embedded in another make itself apparent?
  • 57. Books from Project Gutenberg: Consistent Despite English’s low information content, lack of even mildly related strings causes little self-similarity across symbol clusters
  • 58. US Code: Moderately Consistent Legalese is a massively structured dialect. Symbols appear in very distinct patterns that are more reminiscent of machine code than text.
  • 59. HTML: Consistent HTML repeats smaller symbols (tags) and larger symbol clusters (via template engines) regularly. This shows up visually as a tightly repeating pattern.
  • 60. Java Class Files (Compared): Mildly Consistent Binary code (be it bytecode or x86) tends to be very structured. Still, we are dependent on both the content and the compiler to generate distinct patterns.
  • 61. x86: Consistent (In Sections) x86 tends not to be handwritten; as such complex instructions are emitted in a highly structured form.
  • 62. Exception? • 64 kilobyte graphical demonstration • Run through a packer  • Compression removes patterns
  • 63. NES Games 6502 Assembly Tends To Show Consistent Patterns, But…
  • 64. Mario Games Look Rather Different. 1) Output is highly dependent on the compiler 2) Output is highly dependent upon the actual content File formats are merely shells for actual content. You are analyzing the content; the format is just syntactic sugar.
  • 65. Format Identification • 1) Do different files appear different, and does the appearance reflect the existence of internal structure? – Answer: Yes. They do. • 2) Do different instances of the same file format appear similar? – Answer: Somewhat. Similar content looks like itself, but you’re measuring the fundamental entropy of the underlying content, not the format of the content itself. • 3) Does one format embedded in another make itself apparent?
  • 66. File Formats Contain Multiple Subformats Another Look At Kernel32.DLL These are all different parts of Kernel32.
  • 67. Quickly Browsing Large Files: Tilt-Shift View • Instead of measuring absolute Y against absolute X, make X relative – Advance through the file going down, look back a number of bytes going right
  • 68. Complain All You Want. Hex Still Sucks.
  • 69. Format Identification • 1) Do different files appear different, and does the appearance reflect the existence of internal structure? – Answer: Yes. They do. • 2) Do different instances of the same file format appear similar? – Answer: Somewhat. Similar content looks like itself, but you’re measuring the fundamental entropy of the underlying content, not the format of the content itself. • 3) Does one format embedded in another make itself apparent? – Answer: Yes. Multiple, distinct sections are clearly visible in a way that hex cannot show.
  • 70. Fuzzer Guidance • 1) Can we locate the actual byte offsets where one section ends and another begins? – Why would we want to? • Fuzzers break parsers. • Many subformats to a format, many subparsers to a parser • To a rough level of approximation, fuzzing a single subformat lets you stress a single subparser • So once we split a file up, we can selectively attack one subparser at a time. • 2) Can we visualize and compare fuzzer operations via Dotplots?
  • 71. Simple Math We select an interesting blob from kernel32.dll. The blob is at pixel offset 507x507, and is a square around 570 pixels wide. Window size on viz was 32. 507*32 = The interesting section starts 16224 bytes into the file. 570*32 = The interesting section is 18240 bytes long.
  • 72. Whats The Actual Data? dd if=kernel32.dll bs=1 skip=16100 | hexdump - | more
  • 73. Using Hardcorr as a “first knife” to locate interesting-to-fuzz regions
  • 74. Fuzzer Guidance • 1) Can we locate the actual byte offsets where one section ends and another begins? – Answer: Yes. We can quickly route from the image to the byte offset, through basic arithmetic. • 2) Can we visualize and compare fuzzer operations via Dotplots?
  • 75. Differentials • Major use of dotplots in bioinformatics is to compare one genome against another – Autocorrelation: Compare A to A – Cross-Correlation: Compare A to B • Most files are sufficiently dissimilar that not very interesting structure shows up – Notable exception: Different versions of the same binary
  • 78. Fuzzers: Very Broken Patchers  Mangle.C – Single Bit Differences CFG9000 – Large Scale Reordering
  • 79. Fuzzer Guidance • 1) Can we locate the actual byte offsets where one section ends and another begins? – Answer: Yes. We can quickly route from the image to the byte offset, through basic arithmetic. • 2) Can we visualize and compare fuzzer operations via Dotplots? –Answer: Yes – visual diffing effectively shows differences between files, including differences introduced by various flavors of fuzzers.
  • 80. Conclusions… • Lots of interesting work left to do – Unification of local presence of symbols, and global view of file format • Possible to do dotplots themselves in the symbolic domain – Use of dotplots to segment formats, which thus provides the tree we want for an XML schema • <format> – <blob1 /> – <blob2 /> – <blob3 /> • </format> – More colorful pretty pictures!
  • 81. The Ancient Tongue: TCP/IP • Can’t all be about pretty pictures  • A new problem has popped up: Network oligopolies are threatening to install firewalls that limit or eliminate bandwidth on a per-company basis – Their own media services might be fast, others will be slow – Their own VPN services might be fast, others will be slow • Question: Is it possible to detect and locate devices violating network
  • 82. What’s The Closest Tool We Have? • Firewalk – Mike Schiffman’s Firewall Analysis Tool – Packets elicit a ICMP Time Exceeded error if they reach a router with TTL=0 • TTL decremented by one for each hop, so you start low, you can trace the route to a host – A firewalled packet won’t live long enough to reach TTL=0 – So you can locate the firewall, and divine things about its ruleset, based on when your packets stop getting ICMP Time Exceeded
  • 83. Limitations of Firewalking • But Firewalk tells us what, not who is blocked…and it tells us nothing about who is allowed to go fast, and who is made to go slow – Suddenly, we devolve to a much older question: Is it possible to find out that a target firewall is, or is not, blocking against or accepting traffic from an arbitrary IP address?
  • 84. TCP Does Speed Measurement • TCP speed analysis done blindly – Endpoints do not negotiate with one another – Everyone sends their packets, routers route what they will. Endpoints need to adjust to what the routers are willing to pass. • Routers communicate with endpoints by dropping their packets • Can we combine this router backchannel w/ Firewalk?
  • 85. In From The Side • What causes packets to drop? – Too many packets • What are we going to do? – Send too many packets • Two channels are set up – A primary channel, which drops packets at some known rate – A secondary channel, whose purpose it is to interfere (or not) with the primary channel • When the secondary interferes with the primary, we get feedback via the primary channel – The traffic composing the secondary channel can come from anywhere, be composed of anything, and can be TTL’d just like in a normal firewalk.
  • 86. The TTL Channel • Normally, you don’t know which router along a path is dropping your packets   • If you are the source of the drop-inducing packets, you can control how far your noise goes out – thus, you can discover which router is hitting its limit / censoring your net connection  
  • 87. Scorchmarking • Why Scorchmarking? – Routers are burning packets…those that get through might have a scorch mark or two  • Basic Model – Client downloads a file from a site, at some given speed negotiated via TCP. – At the same time, traffic is injected from different IP addresses. This should cause drops. • If it doesn’t, the network is either penalizing the primary channel (easy to drop against) or rewarding the secondary channel (resilient to drops)
  • 88. Advanced Scorchmarking [0] • Having to depend on a client is lame – Wouldn’t it be nice if we could scan the Internet for these servers? • What fundamental service is a receiving client providing? – It is acknowledging our traffic – letting us know how much it received, and how many milliseconds it took to receive it • Aren’t there other ways we could extract the same data from hosts?
  • 89. Advanced Scorchmarking [1] • What else will acknowledge receiving traffic from us? – TCP Servers • Sting, from Stefan Savage, used this to great effect – DNS Servers  – Routers. • Supposedly, routers won’t send more than a certain number of ICMP Time Exceeded packets per second • In reality, they seem to ICMP Time Exceeded ACK however much you throw at them • Even if they didn’t, you could use the difference in ICMP Time Exceeded rates between Primary and Secondary channel, to determine whether interference was showing up. • Everyone’s got a NAT – so you can query everyone for whether certain sorts of traffic are being blocked to them
  • 90. Advanced Scorchmarking [2] • So, yes. – You can scan for violations of Network Neutrality – You can find networks that are blocking or passing particular IP ranges • It’s not exactly efficient though • Neutrality violations are easier to find than the standard FW case – Firewalls are normally between the WAN and the LAN (Slow Net -> FW -> Fast Net) – Neutrality violators are mid-WAN (Slow Net -> Fw -> Slow Net -> Fast Net) – Easier to overload the slow net after the firewall • Boxes with max TTL rates override this
  • 91. Speed Limits • Fundamental Problem: Have to max out bandwidth on the link to trigger the backchannel – No packets dropping, no data – Means you have to DoS a link – not scalable/legal • Potential Solution: Find capped acknowledgers – The mythical ICMP Time Exceeded rate limit works well • Primary and Secondary channel both eliciting ITE’s • When secondary channel gets a packet through, it takes up a slot on the primary channel’s • ITE is perfect, since you can TTL limit any packet • Depends on the firewall passing the primary’s ITE’s • Maybe Linux / NATs actually implement rate limits? – Another option: What if we have code on the client?
  • 92. Windows Media Player: More Than Just DRM. Really! • Bulk Transfer: RTP – Runs over Unicast UDP – Yes, the same Unicast UDP that penetrates NAT so well! • Flow Control / Quality Monitoring: RTCP • No technical reason RTCP needs to go back to the same address that RTP stream is coming from – So: We pretend to provide media streams from all sorts of sites, and use WMP to collect traffic stats for us  • It might work…