1. A New Year in Data Science:
ML Unpaused
Data Day Texas
Austin, 2015-01-10
Paco Nathan, @pacoid
2. Observations about Machine Learning, Data Science,
Big Data, Open Source, Cluster Computing, Notebooks,
etc., over the past year … plus, a look ahead
4. Backstory: The Sun Also Rises
Some wake early in
the morning and go
build buildings
5. Backstory: The Sun Also Rises
Some wake early in
the morning and go
build buildings
6. Backstory: The Sun Also Rises
Some gaze into the
heavens, sit back,
and explain the
process…
7. Backstory: The Sun Also Rises
Some gaze into the
heavens, sit back,
and explain the
process…
Clearly, provably,
our Sun revolves
around the Earth
at an observable
rate
8. Backstory: The Sun Also Rises
Others create and
evaluate models to
predict the Earth’s
orbit of the Sun
9. Backstory: The Sun Also Rises
Sometimes, when
the sky gods become
angry and obscure
the Sun as our due
punishment…
We grow scared and
react: sacrifices must
be offered, our plans
must change, etc.
10. Backstory: The Sun Also Rises
Sometimes, when
the sky gods become
angry and obscure
the Sun
punishment…
We grow scared and
react: sacrifices must
be offered, our plans
must
These points are what
I’d like to discuss today
13. Feel free to disagree, but I find that definition
to be flawed…
Whither Data Science?
14. Feel free to disagree, but I find that definition
to be flawed…
1. That ignores DevOps (how’s that working out?)
and Visualization/Design (ditto)
Whither Data Science?
15. Feel free to disagree, but I find that definition
to be flawed…
1. That ignores DevOps (how’s that working out?)
and Visualization/Design (ditto)
2. When the CEO asks you to help explain why
revenue nose-dived over the past month…
neither field has a clue about how to model
business phenomena
Whither Data Science?
16. Software Engineering:
implement and test a model that somebody selected
…almost ignores the matter of modeling entirely,
at least not since old school types like Dijkstra
!
Statistics:
measure and justify a model that somebody selected
…was never particularly good at teaching how to
model problems – as two renowned statisticians,
William Cleveland and Leo Breiman, noted
Whither Data Science?
17. Software Engineering:
implement and test a model that somebody selected
…almost ignores the matter of modeling entirely,
at least not since old school types like
!
Statistics:
measure and justify a model that somebody selected
…was never particularly good at teaching how to
model problems – as two renowned statisticians,
William Cleveland
Whither Data Science?
Both fields are necessary,
but not sufficient
18. TheThorn in the Side of Big Data: too few artists
Christopher Ré, Stanford
safaribooksonline.com/library/view/strata-conference-santa/9781491900321/
part92.html
Whither Data Science?
19. TheThorn in the Side of Big Data: too few artists
Christopher Ré, Stanford
safaribooksonline.com/library/view/strata-conference-santa/9781491900321/
part92.html
Whither Data Science?
“You should think
about features and
not algorithms”
21. Floyd Marinescu observed about the aftermath
of EJBs in Brief History…
Intended for building framework components,
e.g., for IBM, Oracle, Sun, but not many others
Based on RMI, prior to notions
like RESTful web services
Enterprise Java Beans: Lessons from hate-watch reality television
22. Maybe a handful of people in the world would
ever actually need to use EJBs, but those few
people wanted a spec
Then, for tragic political reasons (MSFT envy),
Sun Microsystems made EJBs prominent in
their Java APIs
Enterprise Java Beans: Lessons from hate-watch reality television
23. Fortunately, we evolved: Spring, JBoss, etc.,
those came along as relatively more sane tech
Now we see the Docker thing soar, with notions
such as microservices displacing legacy cruft
(BTW, if you haven’t yet, check out Weave)
Enterprise Java Beans: Lessons from hate-watch reality television
24. I mention this because, to me, EJB represented
a convoluted form of template thinking:
Enterprise Java Beans: Lessons from hate-watch reality television
developing complex web apps
for the sake of
developing complex web apps
25. Enterprise Java Beans: Lessons from hate-watch reality television
IRL developers and template thinking don’t
determine public policy… right?
26. Enterprise Java Beans: Lessons from hate-watch reality television
To paraphrase Dean Wampler, consider WordCount
a simple apps written for MapReduce in Hadoop …
~50 lines of unapologetic Java that feels hella like
writing EJBs:
27. Enterprise Java Beans: Lessons from hate-watch reality television
Compare that with functional programming, where
the same WC app is three lines of easily-read Scala
when run in Apache Spark:
28. Enterprise Java Beans: Lessons from hate-watch reality television
Check out Dean’s talk at 11:00,
“Why Scala isTaking Over
the Big DataWorld”
Compare that with functional programming, where
the same WC app is three lines of easily-read Scala
when run in Apache Spark:
29. Enterprise Java Beans: Lessons from hate-watch reality television
Hadoop suffers because, IMHO, that convoluted
EJB style of developer-centric template thinking
staged a coup
Perhaps we could
“donate” some
OSS talent…
Send a pull
request…
Or something.
31. Probability got going, formally, in the 16th c. –
although interesting mathematical estimations
trace back to classical times
Arabs in the 9th c. used frequency analysis –
later rediscovered by Europeans during the
early Italian Renaissance
Statistics followed, originally more about what
we might call demographics – through 18th c.
Lies, Damn Lies, Statistics, Data Science
32. Laplace, Gauss, et al., bridged the fields in the
late 18th c. using distributions (what we studied
in Stats 101) to infer the probability of errors
in estimates
!
!
Much of the 19th/20th c. work was about using
goodness of fit tests, etc., justifying some distribution
• generally speaking, that require samples
• that, in turn, implies batch windows
Lies, Damn Lies, Statistics, Data Science
33. Lies, Damn Lies, Statistics, Data Science
That kind of template thinking in action
really lurvs it some batch windows
34. While 19th/20th c. stats work focused on
defensibility
21st c. work, w.r.t. Big Data apps, focuses more
on predictability – plus there’s a shift in how we
make estimates…
Lies, Damn Lies, Statistics, Data Science
BTW, doesn’t it seem weird to crunch through piles
of data in large batch jobs, at large expense, when
the results get used to approximate features
ultimately? Why not perform that in stream?
35. A fascinating, relatively new area pioneered by
relatively few people – e.g., Philippe Flajolet
Provides approximation with error bounds
using much less resources (RAM, CPU, etc.)
highlyscalable.wordpress.com/2012/05/01/
probabilistic-structures-web-analytics-
data-mining/
Lies, Damn Lies, Statistics, Data Science
36. algorithm use case example
Bloom Filter set membership code
MinHash
set similarity code
HyperLogLog set cardinality code
Count-Min Sketch frequency summaries code
DSQ streaming quantiles code
SkipList ordered sequence search code
Lies, Damn Lies, Statistics, Data Science
37. Lies, Damn Lies, Statistics, Data Science
E.g., ±4% could buy you two orders of magnitude
reduction in the required memory footprint for
an analytics app
!
OSS projects such as Algebird and BlinkDB
provide for this newer approach to the math of
approximations at scale
38. Lies, Damn Lies, Statistics, Data Science
E.g., ±4% could buy you two orders of magnitude
reduction in the required memory footprint for
an analytics app
!
OSS projects such as
provide for this newer approach to the math of
approximati
Oscar Boykin at 14:00,
“Aggregators: Modeling
Data Queries Functionally”
co-author of Algebird, Scalding
40. Data Science is inherently interdisciplinary
To paraphrase Chris Ré, emphasis on algorithms
is relatively minor in the grand scheme –
Especially when compared to needs for modeling
business problems effectively
To wit: beyond phenomenology, leading
into quantitative analysis and repeatable results
On the one hand, CS + Stats do not quite address
those needs…
The Interzone
41. On the other hand, Physics
does well to teach modeling –
I like to hire physicists to work
on Data teams…
The Interzone
They tend to get the interdisciplinary aspects:
got the math background, coding experience,
generally good at systems engineering, etc.
Not saying we should all rush out to get Physics
degrees; there’s something to be learned there,
vital for the work and priorities ahead
42. I mention this because we are at a crossroads,
which has more to do with the physical world –
some talks here at DDTx15 help illustrate that
Vast implications for Health Care, Transportation,
Agriculture, Energy, Gov, Manufacturing in general…
More about that
in a bit –
The Interzone
44. Most of the ML libraries that one encounters
today focus on two general kinds of solutions:
• convex optimization
• matrix factorization
The Libraries: Alexandria Redux
45. One might think of the convex optimization
in this case as a kind of curve fitting – generally
with some regularization term to avoid overfitting,
which is not good
Good Bad
The Libraries: Alexandria Redux
46. For supervised learning, used to create classifiers:
1. categorize the expected data into N classes
2. split a sample of the data into train/test sets
3. use learners to optimize classifiers based on
the training set, to label the data into N classes
4. evaluate the classifiers against the test set,
measuring error in predicted vs. expected labels
The Libraries: Alexandria Redux
47. Bokay, great for security problems with simply
two classes: good guys vs. bad guys
How do you decide what the classes are
for more complex problems in business?
That’s where the matrix factorization
parts come in handy…
The Libraries: Alexandria Redux
48. For unsupervised learning, which is often used
to reduce dimension:
1. create a covariance matrix of the data
2. solve for the eigenvectors and eigenvalues
of the matrix
3. select the top N eigenvectors, based on
diminishing returns for how they explain
variance in the data
4. those eigenvectors define your N classes
The Libraries: Alexandria Redux
49. An excellent overview of ML definitions
(up to this point) is given in:
The Libraries: Alexandria Redux
To wit:
Generalization = Representation + Optimization + Evaluation
A Few UsefulThings to Know about Machine Learning
Pedro Domingos
CACM 55:10 (Oct 2012)
http://dl.acm.org/citation.cfm?id=2347755
52. evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Algorithms and developer-centric template thinking
only go so far
Results are shown in
is highlighted in
The Libraries: Alexandria Redux
Matthew Kirk 12:00
“Lessons Learned: Machine Learning
andTechnical Debt”
Ted Dunning 13:00
“Computing with Chaos”
Julia Evans 15:00
“Data Pipelines.They're a lot of work!”
Christopher Johnson 16:00
“Scala Data Pipelines for Music
Recommendations”
53. Even so, business demands exceed far beyond
what classifiers and labels alone can give us…
Businesses lurv Optimization, gobs of it; in
that context ML libraries today merely scratch
the surface
Round hole, square peg
The Libraries: Alexandria Redux
54. Imagine that you compete with FedEx… how do
you optimize delivery routes for airplanes, trucks,
trains, nanodrones, hoverboards, etc.?
Which do you optimize: fuel cost,
delivery time, maintenance schedules,
minimizing lost packages?
Doesn’t sound much like online
advertising, social networks, or
any episode of Silicon Valley
The Libraries: Alexandria Redux
56. What were the origins of machine learning?
• Marvin Minsky @MIT, 1950s
• Support Vector Machines @Bell Labs, 1990s
• Google @Stanford, 1990s
• Ray Kurzweil, 2000s
Nope…
ML, Unpaused
57. ML has been an aspect of AI research for a
long while, through several different vectors
A good early history (up to 1980s) is given in:
ML, Unpaused
Machine Learning:A Historical and Methodological Analysis
Jaime Carbonell, Ryszard Michalski, Tom Mitchell
AI Magazine 4:3 (1983)
http://dx.doi.org/10.1609/aimag.v4i3.406
To wit:
task-oriented studies, knowledge acquisition, cognitive
simulation, theoretical exploration … overall, a much
broader class of optimization problems
58. An era of anticipation – AI was making inroads…
• emphasis on capturing/representing knowledge
and expertise – production use cases in medicine
• Fifth Generation Computing (parallel h/w)
in Japan MCC, etc.
However:
• few outside academia had enough cluster compute
power – aside from 3-letter agencies and AT&T
• meanwhile ML was not yet considered “academic”
enough within academia
Circa early 1980s:
60. Some fundamental tech platforms emerge…
• Hubble Space Telescope, Human Genome Project,
WWW, electric cars relaunched
And throughout that decade:
• Linux, Java @Sun, JavaScript @Netscape
• Firefly, an initial commercial ML app
on teh interwebs @MIT Media Lab
• Rise of e-commerce leveraging horizontal
scale-out with commodity hardware
Circa early 1990s:
62. GOOG AMZN EBAY YHOO LNKD NFLX FB TWTR
emerged out of the dust…
• web apps dominated for search, e-commerce,
social networks, etc.
• did we mention EJBs and template thinking?
• mobile picked up traction
• recommender systems went mainstream
• AI picked up with semantic web efforts…
Circa early 2000s:
64. Successful e-commerce firms have IPO’ed and are
now busy building skyscrapers in downtown SF…
Circa mid 2010s:
LinkedIn, 350 Bush
Transbay Transit
Salesforce, 415 Mission
65. An odd truism about the hubris of the uber-wealthy
and the timing of their skyscraper projects…
But…
Sears Tower, Chicago
Lehman Brothers, London
Fontainebleau, Las Vegas
66. An odd truism about the hubris of the uber-wealthy
and the timing of their skyscraper projects…
But…
67. Businesses lurv Optimization, lots of it…
• ML circa 1985 focused on those needs, but got
knocked back to something inevitably more
aristotelian and predictable
• Outside of SiliconValley, we’ve made big strides
• One danger: next downturn cycle,VCs might
reshape tech industry, reverting to “safe bets”
Circa mid 2010s: Back to the Future
However, a few extremely interesting
aspects have emerged…
68. evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
We have approximation, deep learning and
symbolic regression to assist on “Features”
evaluationoptimizationrepresentationcirca 2010
ETL into
cluster/cloud
data
data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore
train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable results
decisions, feedback
bar developers
foo algorithms
Or, maybe, cognitive computing will help on
several of the more difficult aspects of this…
Circa mid 2010s: Extremely Interesting Emerging Aspects
70. Circa mid 2010s: Extremely Interesting Emerging Aspects
Rhetorical postures: “Is AI a good idea,
or potentially harmful?”
– per Elon Musk, et al.
71. Circa mid 2010s: Extremely Interesting Emerging Aspects
Clearly: good idea
brewbot.io
Rhetorical postures: “Is AI a good idea,
or potentially harmful?”
– per Elon Musk, et al.
72. Circa mid 2010s: Extremely Interesting Emerging Aspects
Speaking of which, a highly recommended podcast
by actual data scientists drinking really good beers:
partiallyderivative.com
73. Circa mid 2010s: Extremely Interesting Emerging Aspects
2015: Notebooks in Containers in the Cloud
“Keep simple things simple
and complex things possible.”
databricks.com/product
PublishingWorkflows for Jupyter
Andrew Odewahn, Kyle Kelley, Rune Madsen
odewahn.github.io/publishing-workflows-for-jupyter
IPython Interactive Demo
Nature Magazine + Rackspace
nature.com/news/ipython-interactive-demo-7.21492
74. 2015: Notebooks in Containers in the Cloud
“Keep simple things simple
and complex things possible.”
databricks.com/product
PublishingWorkflows for Jupyter
Andrew Odewahn
odewahn.github.io/publishing-workflows-for-jupyter
IPython Interactive Demo
Nature Magazine + Rackspace
nature.com/news/ipython-interactive-demo-7.21492
Circa mid 2010s: Extremely Interesting Emerging Aspects
Makes me wonder about the “data engineer”
role … notebooks simplify ops needs, while
ultimately the domain experts wield the real
power with data
76. Frontstory: The Sun Also Rises
Some wake early in
the morning and go
build buildings
dev-centric templates
77. Some gaze into the
heavens, sit back,
and explain the
process…
20th c. stats
Frontstory: The Sun Also Rises
78. Sometimes, when
the sky gods become
angry and obscure
the Sun as our due
punishment… VCs during recessions
Frontstory: The Sun Also Rises
79. Others create and
evaluate models to
predict the Earth’s
orbit of the Sun
What’s needed most
Frontstory: The Sun Also Rises
80. Forward Motion:
SV trend: early data scientists displace old-school
product managers
Because there are hard
problems to be solved…
Because we need
new eyes on target…
Because use cases…
82. Because Use Cases: Health Care
“In fact, using ourTopological Data Analysis system, they were
able to discover multiple types of Type 2 diabetes … huge
impact on all the hundreds of millions of people” – Ayasdi
“Nobody knows what to do with those archives …They’re just
sitting there, costing money. This is just seen as a big opportunity.
It’s like,‘Oh, this is what we were saving this up for!’” – Enlitic
“Sloan-Kettering is also trainingWatson on 1,500 real-world lung
cancer cases, helping it to decipher physician notes and learn
from the hospital’s expertise in treating cancer.” – IBM Watson
Employing tech such as deep learning and
cognitive computing for vital use cases in
health care:
83. Because Use Cases: Transportation
http://automatic.com/
!
Detects events like hard braking, acceleration – uploaded in
real-time with geolocation to a Spark Streaming pipeline …
data trends indicate road hazards, blind intersections, bad
signal placement, and other input to improve traffic planning.
Also detects inefficient vehicle operation, under-inflated tires,
poor driving behaviors, aggressive acceleration, etc.
84. Because Use Cases: Education
https://databricks.com/blog/2014/12/08/
pearson…
!
Integrates Kafka + Spark Streaming + Cassandra +
Blur, running within aYARN cluster on AWS to provide
a scalable, reliable, cloud-based platform for services
that analyze student performance across product and
institution boundaries.
Delivers immersive learning experiences
designed for how students read, think,
and learn; as well as efficacy insights to
both learners and institutions which were
not possible before.
!
Reliability features handle Kafka node
failures, receiver failures, leader changes,
committed offset in ZK, plus adjustable
data-rate throughput.
85. Because Use Cases: Language, everywhere
http://idibon.com/
!
!
!
Our social fabric is encoded as text documents,
and similarly it get tested, deployed, maintained,
and monitored there – it’s the launch point for
cognitive computing.
http://digitalreasoning.com/
86. http://digitalreasoning.com/
Because Use Cases: Language, everywhere
http://idibon.com/
!
!
!
Our social fabric is encoded as text documents,
and similarly it get tested, deployed, maintained,
and monitored there – it’s the launch point for
cognitive computing.
Robert Munroe, 12:00 “Building Better
Experts: co-optimization of human and
machine intelligence at Idibon”
AndrewTrask, David Gilmore 11:00
“Deep Learning for Natural Language
Processing”
87. Because Use Cases: Geospatial
Advanced geo uses cases throughout all levels of gov
and industry for Big Data, machine learning, graph
algorithms, approximations, etc.
If you roll trucks you probably use licenses from ESRI.
Also consider the IoT sensor data, e.g., from National
Instruments' customers – where does it go, what do
organizations use to analyze it?
These are the large-scale optimization problems
you were looking for…
http://esri.github.io/gis-tools-for-hadoop/ (and Spark)
http://thunderheadxpler.blogspot.com/
http://geotrellis.io/
http://www.oculusinfo.com/tiles/
https://databricks.com/blog/2014/12/03/app...
88. Because Use Cases: Telecom,Travel, Banking, etc.
http://spark-summit.org/2014/talk/
stratio-streaming…
Stratio represents one of the most sophisticated
integrations for Spark Streaming – the union of
a real-time messaging bus with a complex event
processing engine: Kafka, Spark Streaming,
Cassandra, along with the Siddhi CEP engine
Telecom, in particular, is leveraging this new
streaming technology as a big win near-term
http://www.openstratio.org/
https://github.com/stratio
https://github.com/Stratio/streaming-
cep-engine
BTW if you’re in Madrid next fall
check out Big Data Hispano
89. Because Use Cases…
Common theme: many of those use cases are
powered by Apache Spark –
Especially notice Spark Streaming, which is a big
game-changer for analytics across industry
90. Because Use Cases…
Common theme: many of those use cases are
powered by
Especially notice
game-changer for analytics across industry
Taylor Goetz 11:00
“Beyond theTweetingToaster: IoT
Streaming AnalyticsWith Apache
Storm, Kafka, and Arduino”
Hari Shreedharan 12:00
“RealTime Data Processing Using
Spark Streaming”
91. Because Use Cases: Agriculture
Ag+Data Issues
http://radar.oreilly.com/2014/04/agdata.html
Data Guild whitepaper: Ag Systems + Data Outlook
http://goo.gl/OK8RFf
• livelihood for 40% of world population
• $15T/year annual GDP globally
• data-intensive issues, much legal impasse
Over a half billion small farms worldwide, and most
are family-run farms that rely on rain-fed agriculture
Nudge, and I just might propose DWave clusters
into cold craters on the Lunar South Pole with
routers @L5 and an LLO skyhook… to handle
the vector quantization demands. Or something.
airships
e.g., JP Aerospace, 40 km
atmostats
e.g.,Titan Aerospace, 20 km
microsats
e.g., Planet Labs, 400 km
robots
e.g., Blue River, 1 m
sensors
e.g., Hortau, -0.3 m
drones
e.g., HoneyComb, 120 m
Layered Sensing Networks
93. Apache Spark developer certificate program
• http://oreilly.com/go/sparkcert
• defined by Spark experts @Databricks
• assessed by O’Reilly Media
• establishes the bar for Spark expertise
certification:
94. MOOCs:
Anthony Joseph
UC Berkeley
begins 2015-02-23
edx.org/course/uc-berkeleyx/uc-
berkeleyx-cs100-1x-
introduction-big-6181
Ameet Talwalkar
UCLA
begins 2015-04-14
edx.org/course/uc-berkeleyx/
uc-berkeleyx-cs190-1x-
scalable-machine-6066
97. confs:
Strata CA
San Jose, Feb 18-20
strataconf.com/strata2015
Spark Summit East
NYC, Mar 18-19
spark-summit.org/east
Big Data Tech Con
Boston, Apr 26-28
bigdatatechcon.com
Strata EU
London, May 5-7
strataconf.com/big-data-conference-uk-2015
Spark Summit 2015
SF, Jun 15-17
spark-summit.org
98. books:
Fast Data Processing
with Spark
Holden Karau
Packt (2013)
shop.oreilly.com/product/
9781782167068.do
Spark in Action
Chris Fregly
Manning (2015*)
sparkinaction.com/
Learning Spark
Holden Karau,
Andy Konwinski,
Matei Zaharia
O’Reilly (2015*)
shop.oreilly.com/product/
0636920028512.do
99. presenter:
Just Enough Math
O’Reilly, 2014
justenoughmath.com
preview: youtu.be/TQ58cWgdCpA
monthly newsletter for updates,
events, conf summaries, etc.:
liber118.com/pxn/
Enterprise Data Workflows
with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do