With Anaconda (in particular Numba and Dask) you can scale up your NumPy and Pandas stack to many cpus and GPUs as well as scale-out to run on clusters of machines including Hadoop.
SpotFlow: Tracking Method Calls and States at Runtime
Scale Up and Scale Out with Anaconda Platform
1. Scale Up and Scale Out with the
Anaconda Platform
Travis Oliphant
CEO
2. Travis Oliphant, PhD — About me
• PhD 2001 from Mayo Clinic in Biomedical Engineering
• MS/BS degrees in Elec. Comp. Engineering from BYU
• Created SciPy (1999-2009)
• Professor at BYU (2001-2007)
• Author and Principal Dev of NumPy (2005-2012)
• Started Numba (2012)
• Founding Chair of NumFocus / PyData
• Former PSF Director (2015)
• Founder of Continuum Analytics in 2012.
2
SciPy
3. 3
Anaconda enables Scale Up and Scale Out
VerticalScaling
(BiggerNodes)
Horizontal Scaling
(More Nodes)
Big Memory
and ManyCore
/GPU Box
Many
commodity
nodes in a
cluster
Best of Both
(e.g. GPU
cluster)
4. 4
Anaconda enables Scale Up and Scale Out
VerticalScaling
(BiggerNodes)
Horizontal Scaling
(More Nodes)
Numba
DyND
Anaconda + MKL
Dask
Blaze
conda
Anaconda
Inside Hadoop
10. 10
• Package, dependency and environment manager
• Language agnostic (Python, R, Java, C, FORTRAN…)
• Cross-platform (Windows, OS X, Linux)
$ conda install python=2.7
$ conda install pandas
$ conda install -c r r
$ conda install mongodb
Conda
11. Where packages, notebooks, and environments are shared.
Powerful collaboration and package management for open source and private projects.
Public projects and notebooks are always free.
REGISTER TODAY!
ANACONDA.ORG
13. 13
Anaconda now with MKL as default
•Intel MKL (Math Kernel Libraries) provide enhanced
algorithms for basic math functions.
•Using MKL provides optimal performance for basic BLAS,
LAPACK, FFT, and math functions.
•Version 2.5 has MKL provided as the default in the free
download of Anaconda.
14. Space of Python Compilation
14
Ahead Of Time Just In Time
Relies on
CPython /
libpython
Cython
Shedskin
Nuitka (today)
Pythran
Numba
Numba
HOPE
Theano
Pyjion
Replaces
CPython /
libpython
Nuitka (someday) Pyston
PyPy
18. 18
@jit('void(f8[:,:],f8[:,:],f8[:,:])')
def filter(image, filt, output):
M, N = image.shape
m, n = filt.shape
for i in range(m//2, M-m//2):
for j in range(n//2, N-n//2):
result = 0.0
for k in range(m):
for l in range(n):
result += image[i+k-m//2,j+l-n//2]*filt[k, l]
output[i,j] = result
~1500x speed-up
20. Numba Features
20
• Numba supports:
Windows, OS X, and Linux
32 and 64-bit x86 CPUs and NVIDIA GPUs
Python 2 and 3
NumPy versions 1.6 through 1.9
• Does not require a C/C++ compiler on the user’s system.
• < 70 MB to install.
• Does not replace the standard Python interpreter
(all of your existing Python libraries are still available)
21. Numba Modes
21
• object mode: Compiled code operates on Python
objects. Only significant performance improvement is
compilation of loops that can be compiled in nopython
mode (see below).
• nopython mode: Compiled code operates on “machine
native” data. Usually within 25% of the performance of
equivalent C or FORTRAN.
22. How to Use Numba
22
1. Create a realistic benchmark test case.
(Do not use your unit tests as a benchmark!)
2. Run a profiler on your benchmark.
(cProfile is a good choice)
3. Identify hotspots that could potentially be compiled by Numba with a
little refactoring.
(see rest of this talk and online documentation)
4. Apply @numba.jit and @numba.vectorize as needed to critical
functions.
(Small rewrites may be needed to work around Numba limitations.)
5. Re-run benchmark to check if there was a performance improvement.
24. The Basics
24
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the array
2.7x speedup over NumPy!
Numba decorator
(nopython=True not required)
29. Case-study -- j0 from scipy.special
29
• scipy.special was one of the first libraries I wrote (in 1999)
• extended “umath” module by adding new “universal functions” to
compute many scientific functions by wrapping C and Fortran libs.
• Bessel functions are solutions to a differential equation:
x2 d2
y
dx2
+ x
dy
dx
+ (x2
↵2
)y = 0
y = J↵ (x)
Jn (x) =
1
⇡
Z ⇡
0
cos (n⌧ x sin (⌧)) d⌧
31. Result --- equivalent to compiled code
31
In [6]: %timeit vj0(x)
10000 loops, best of 3: 75 us per loop
In [7]: from scipy.special import j0
In [8]: %timeit j0(x)
10000 loops, best of 3: 75.3 us per loop
But! Now code is in Python and can be experimented with
more easily (and moved to the GPU / accelerator more easily)!
32. Numba is very popular!
32
A
numba
mailing
list
reports
experiments
of
a
SciPy
author
who
got
2x
speed-‐
up
by
removing
their
Cython
type
annotations
and
surrounding
function
with
numba.jit
(with
a
few
minor
changes
needed
to
the
code).
With
Numba’s
ahead-‐of-‐time
compilation
one
can
legitimately
use
Numba
to
create
a
library
that
you
ship
to
others
(who
then
don’t
need
to
have
Numba
installed
—
or
just
need
a
Numba
run-‐time
installed).
SciPy
(and
NumPy)
would
look
very
different
in
Numba
had
existed
16
years
ago
when
SciPy
was
getting
started….
—
and
you
would
all
be
happier.
35. CUDA Python (in open-source Numba!)
35
CUDA Development
using Python syntax for
optimal performance!
You have to understand
CUDA at least a little —
writing kernels that launch
in parallel on the GPU
38. 38
from numba import jit
@jit
def mandel(x, y, max_iters):
c = complex(x,y)
z = 0j
for i in range(max_iters):
z = z*z + c
if z.real * z.real + z.imag * z.imag >= 4:
return 255 * i // max_iters
return 255
Mandelbrot
40. Other interesting things
40
• CUDA Simulator to debug your code in Python interpreter
• Generalized ufuncs (@guvectorize) including GPU support and
multi-core (threaded) support
• Call ctypes and cffi functions directly and pass them as arguments
• Support for types that understand the buffer protocol
• Pickle Numba functions to run on remote execution engines
• “numba annotate” to dump HTML annotated version of compiled
code
• See: http://numba.pydata.org/numba-doc/0.23.0/
41. What Doesn’t Work?
41
(A non-comprehensive list)
• Sets, lists, dictionaries, user defined classes (tuples do work!)
• List, set and dictionary comprehensions
• Recursion
• Exceptions with non-constant parameters
• Most string operations (buffer support is very preliminary!)
• yield from
• closures inside a JIT function (compiling JIT functions inside a closure
works…)
• Modifying globals
• Debugging of compiled code (you have to debug in Python mode).
42. Recently Added Numba Features
42
• Support for named tuples in nopython mode
• Limited support for lists in nopython mode
• On-disk caching of compiled functions (opt-in)
• JIT classes (zero-cost abstraction)
• Support of np.dot (and ‘@‘ operator on Python 3.5)
• Support for some of np.linalg
• generated_jit (jit the functions that are the return values of the
decorated function)
• SmartArrays which can exist on host and GPU (transparent
data access).
44. Conclusion
44
• Lots of progress in the past year!
• Try out Numba on your numerical and NumPy-related
projects:
conda install numba
• Your feedback helps us make Numba better!
Tell us what you would like to see:
https://github.com/numba/numba
• Extension API coming soon and support for more data
structures
46. 46
• Infrastructure for meta-data, meta-compute, and expression graphs/dataflow
• Data glue for scale-up or scale-out
• Generic remote computation & query system
• (NumPy+Pandas+LINQ+OLAP+PADL).mashup()
Blaze is an extensible high-level interface for data
analytics. It feels like NumPy/Pandas. It drives other
data systems. Blaze expressions enable high-level
reasoning. An ecosystem of tools.
http://blaze.pydata.org
Blaze
49. APIs, syntax, language
49
Data Runtime
Expressions
metadata
storage/containers
compute
datashape
blaze
dask
odo
parallelize optimize, JIT
50. Blaze
50
Interface to query data on different storage systems http://blaze.pydata.org/en/latest/
from blaze import Data
iris = Data('iris.csv')
iris = Data('sqlite:///flowers.db::iris')
iris = Data('mongodb://localhost/mydb::iris')
iris = Data('iris.json')
CSV
SQL
MongoDB
JSON
iris = Data('s3://blaze-data/iris.csv')S3
…
Current focus is the “dark data” and pydata stack for run-time (dask, dynd,
numpy, pandas, x-ray, etc.) + customer needs (i.e. kdb, mongo).
52. 52
datashapeblaze
Blaze uses datashape as its type system (like DyND)
>>> iris = Data('iris.json')
>>> iris.dshape
dshape("""var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}""")
53. Datashape
53
A structured data description language
http://datashape.pydata.org/
dimension dtype
unit types
var
3 string
int32
4 float64
*
*
*
*
var * { x : int32, y : string, z : float64 }
datashape
tabular datashape
record
ordered struct dtype
{ x : int32, y : string, z : float64 }
collection of types keyed by labels
4
55. iriscsv:
source: iris.csv
irisdb:
source: sqlite:///flowers.db::iris
irisjson:
source: iris.json
dshape: "var * {name: string, amount: float64}"
irismongo:
source: mongodb://localhost/mydb::iris
Blaze Server — Lights up your Dark Data
55
Builds off of Blaze uniform interface
to host data remotely through a JSON
web API.
$ blaze-server server.yaml -e
localhost:6363/compute.json
server.yaml
57. Compute recipes work with existing libraries and have multiple
backends — write once and run anywhere.
• python list
• numpy arrays
• dynd
• pandas DataFrame
• Spark, Impala
• Mongo
• dask
57
58. • You can layer expressions over any data
• Write once, deploy anywhere
• Practically, expressions will work better on specific data
structures, formats, and engines
• Use odo to copy from one format and/or engine to another
58
59. 59
Dask: Distributed PyData
• A parallel computing framework
• That leverages the excellent Python ecosystem (NumPy and Pandas)
• Using blocked algorithms and task scheduling
• Written in pure Python
Core Ideas
• Dynamic task scheduling yields sane parallelism
• Simple library to enable parallelism
• Dask.array/dataframe to encapsulate the functionality
• Distributed scheduler
67. • simple: easy to use API
• flexible: perform a lots of action with a minimal amount of code
• fast: dispatching to run-time engines & cython
• database like: familiar ops
• interop: integration with the PyData Stack
67
(((A + 1) * 2) ** 3)
71. 71
from dask import dataframe as dd
columns = ["name", "amenity", "Longitude", "Latitude"]
data = dd.read_csv('POIWorld.csv', usecols=columns)
with_name = data[data.name.notnull()]
with_amenity = data[data.amenity.notnull()]
is_starbucks = with_name.name.str.contains('[Ss]tarbucks')
is_dunkin = with_name.name.str.contains('[Dd]unkin')
starbucks = with_name[is_starbucks]
dunkin = with_name[is_dunkin]
locs = dd.compute(starbucks.Longitude,
starbucks.Latitude,
dunkin.Longitude,
dunkin.Latitude)
# extract arrays of values fro the series:
lon_s, lat_s, lon_d, lat_d = [loc.values for loc in locs]
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
def draw_USA():
"""initialize a basemap centered on the continental USA"""
plt.figure(figsize=(14, 10))
return Basemap(projection='lcc', resolution='l',
llcrnrlon=-119, urcrnrlon=-64,
llcrnrlat=22, urcrnrlat=49,
lat_1=33, lat_2=45, lon_0=-95,
area_thresh=10000)
m = draw_USA()
# Draw map background
m.fillcontinents(color='white', lake_color='#eeeeee')
m.drawstates(color='lightgray')
m.drawcoastlines(color='lightgray')
m.drawcountries(color='lightgray')
m.drawmapboundary(fill_color='#eeeeee')
# Plot the values in Starbucks Green and Dunkin Donuts Orange
style = dict(s=5, marker='o', alpha=0.5, zorder=2)
m.scatter(lon_s, lat_s, latlon=True,
label="Starbucks", color='#00592D', **style)
m.scatter(lon_d, lat_d, latlon=True,
label="Dunkin' Donuts", color='#FC772A', **style)
plt.legend(loc='lower left', frameon=False);
72. dask distributed
72
Pythonic Multiple-machine Parallelism that understands Dask graphs
1) Defines Center (dcenter) and Worker (dworker)
2) Simplified setup with dcluster for example —
Center
dcluster 192.168.0.{1,2,3,4}
dcluster —hostfile hostfile.txt
or
3) Create Executor objects like
concurrent.futures (Python 3) or
futures (Python 2.7 back-port)
4) Data locality supported with ad-hoc task graphs
by returning futures wherever possible
73. 73
Anaconda (PyData) Inside Hadoop
conda
Dask
MPI
High Performance
All of Python/R
• Part of Dask Project
• native HDFS reader
• YARN/mesos integration
• parquet, avro, thrift readers
• Preview releases available now
Coming GA in Q2 of 2016.
• The native way to do Hadoop with PyData
stack!
• For Python users it’s better than Spark
(faster and integration with current code).
• Integrates easily with NumPy, Pandas, scikit-
learn, etc.
74. 74
PyData Inside Hadoop
conda
•knit (http://knit.readthedocs.org/en/latest/)
•enables python code to interact with Hadoop schedulers)
•hdfs3 (http://hdfs3.readthedocs.org/en/latest/)
•wrapper to Pivotal’s libhdfs3 which provides native reading and
writing of hdfs (without the JVM)
Two key libraries — enable the connection
https://github.com/dask/dec2 might also be useful for
you (to ease starting dask distributed clusters on EC2)
76. HDFS without Java
76
1. HDFS splits large files into many small blocks replicated on many
datanodes
2. For efficient computation we must use data directly on datanodes
3. distributed.hdfs queries the locations of the individual blocks
4. distributed executes functions directly on those blocks on the
datanodes
5. distributed+pandas enables distributed CSV processing on HDFS
in pure Python
6. Distributed, pythonic computation, directly on hdfs!
77. 77
$ hdfs dfs -cp yellow_tripdata_2014-01.csv /data/nyctaxi/
>>> from distributed import hdfs
>>> blocks = hdfs.get_locations('/data/nyctaxi/', '192.168.50.100', 9000)
>>> columns = ['vendor_id', 'pickup_datetime', 'dropoff_datetime',
'passenger_count', 'trip_distance', 'pickup_longitude', 'pickup_latitude',
'rate_code', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude',
'payment_type', 'fare_amount', 'surcharge', 'mta_tax', 'tip_amount',
'tolls_amount', 'total_amount']
>>> from distributed import Executor
>>> executor = Executor('192.168.1.100:8787')
>>> dfs = [executor.submit(pd.read_csv, block['path'], workers=block['hosts'],
... columns=columns, skiprows=1)
... for block in blocks]
These operations produce Future objects that point to remote results on the worker computers. This does not
pull results back to local memory. We can use these futures in later computations with the executor.
78. 78
def sum_series(seq):
result = seq[0]
for s in seq[1:]:
result = result.add(s, fill_value=0)
return result
>>> counts = executor.map(lambda df: df.passenger_count.value_counts(), dfs)
>>> total = executor.submit(sum_series, counts)
>>> total.result()
0 259
1 9727301
2 1891581
3 566248
4 267540
5 789070
6 540444
7 7
8 5
9 16
208 19