SlideShare ist ein Scribd-Unternehmen logo
1 von 73
Downloaden Sie, um offline zu lesen
On the code of data science
Varoquaux
Ga¨el
import data science
data science.discover()
On the code of data science
Varoquaux
Ga¨el
import data science
data science.discover()
Disclaimer:
A bit of a Python bias
Key ideas carry over
What is data science?
G Varoquaux 2
What is data science?
G Varoquaux 2
Data science = statistics + code
Larger datasets
Automating insights
Data science is disruptive when it enables
real-time or personalized decisions
G Varoquaux 3
1 Writing code for data science
2 Machine learning in Python
3 Outlook: infrastructure
Make you more productive as a data scientist
G Varoquaux 4
1 Writing code for data science
Productivity & Quality
G Varoquaux 5
1 A data-science workflow
Work based on intuition
and experimentation
Conjecture
Experiment
ñ Interactive & framework-less
Yet
needs consolidation
keeping flexibility
G Varoquaux 6
1 A design pattern in computational experiments
MVC pattern from Wikipedia:
Model
Manages the data
and rules of the
application
View
Output represen-
tation
Possibly several views
Controller
Accepts input
and converts it to
commands
for model and view
Photo-editing software
Filters Canvas Tool palettes
Typical web application
Database Web-pages URLs
G Varoquaux 7
1 A design pattern in computational experiments
MVC pattern from Wikipedia:
Model
Manages the data
and rules of the
application
View
Output represen-
tation
Possibly several views
Controller
Accepts input
and converts it to
commands
for model and view
For data science:
Numerical, data-
processing, & ex-
perimental logic
Results, as files.
Data & plots
Imperative API
Avoid input as files:
not expressive
Module
with functions
Post-processing script
CSV & data files
Script
ñ for loops
G Varoquaux 7
1 A design pattern in computational experiments
MVC pattern from Wikipedia:
Model
Manages the data
and rules of the
application
View
Output represen-
tation
Possibly several views
Controller
Accepts input
and converts it to
commands
for model and view
For data science:
Numerical, data-
processing, & ex-
perimental logic
Results, as files.
Data & plots
Imperative API
Avoid input as files:
not expressive
Module
with functions
Post-processing script
CSV & data files
Script
ñ for loops
A recipe
3 types of files:
• modules • command scripts • post-processing scripts
CSVs & intermediate data files
Separate computation from analysis / plotting
Code and text (and data) ñ version control
G Varoquaux 7
1 A design pattern in computational experiments
MVC pattern from Wikipedia:
Model
Manages the data
and rules of the
application
View
Output represen-
tation
Possibly several views
Controller
Accepts input
and converts it to
commands
for model and view
For data science:
Numerical, data-
processing, & ex-
perimental logic
Results, as files.
Data & plots
Imperative API
Avoid input as files:
not expressive
Module
with functions
Post-processing script
CSV & data files
Script
ñ for loops
A recipe
3 types of files:
• modules • command scripts • post-processing scripts
CSVs & intermediate data files
Separate computation from analysis / plotting
Code and text (and data) ñ version control
Decouple steps
Goals: Reuse code
Mitigate compute time
G Varoquaux 7
1 How I work progressive consolidation
Start with a script playing to understand the problem
G Varoquaux 8
1 How I work progressive consolidation
Start with a script playing to understand the problem
Identify blocks/operations ñ move to a function
Use functions
Obstacle: local scope
requires identifying input and output variables
That’s a good thing
Solution: Interactive debugging / understanding
inside a function: %debug in IPython
Functions are the basic reusable abstraction
G Varoquaux 8
1 How I work progressive consolidation
Start with a script playing to understand the problem
Identify blocks/operations ñ move to a function
As they stabilize, move to a module
Modules
enable sharing between experiments
ñ avoid 1000 lines scripts + commented code
enable testing
Fast experiments as tests
ñ gives confidence, hence refactorings
G Varoquaux 8
1 How I work progressive consolidation
Start with a script playing to understand the problem
Identify blocks/operations ñ move to a function
As they stabilize, move to a module
Clean: delete code & files you have version control
Attentional load makes it impossible
to find or understand things
Where’s Waldo?G Varoquaux 8
1 How I work progressive consolidation
Start with a script playing to understand the problem
Identify blocks/operations ñ move to a function
As they stabilize, move to a module
Clean: delete code & files you have version control
Why is it hard?
Long compute times
make us unadventurous
Know your tools
Refactoring editor
Version control
G Varoquaux 8
1 Caching to tame computation time
The memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from memory
For data science
a & b can be big
a & b arbitrary objects no change in workflow
Results stored on disk
G Varoquaux 9
1 Caching to tame computation time
The memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from memory
For data science
Fits in experimentation loop
Helps decrease re-run times œ
Black-boxy, persistence only implicit
G Varoquaux 9
Adopting software-engineering best practices
G Varoquaux 10
1 The ladder of code quality
Use linting/code analysis in your editor seriously
G Varoquaux 11
1 The ladder of code quality
Use linting/code analysis in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
G Varoquaux 11
1 The ladder of code qualityIncreasingcost
?İ
Use linting/code analysis in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
Avoid premature software engineering
G Varoquaux 11
1 The ladder of code qualityIncreasingcost
?İ
Use linting/code analysis in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
Avoid premature software engineering
Over versus under engineering
When the goal is generating insights
Experimentation to develop intuitions
ñ new ideas
As the path becomes clear: consolidation
Heavy engineering too early freezes bad ideas
G Varoquaux 11
1 LibrariesIncreasingcost
?İ
Use linting/code analysis in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
A library
G Varoquaux 12
2 Machine learning in Python
scikit-learn
G Varoquaux 13
2 Tradeoffs
Experimentation Production-
đ §
scikit
G Varoquaux 14
2 My stack for data science
Python, what else?
General-purpose language
Interactive
Easy to read / write
G Varoquaux 15
2 My stack for data science
The scientific Python stack
numpy arrays
Mostly a float**
No annotation / structure
Universal across applications
Easily shared across languages
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
57187745620
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
7187745620
G Varoquaux 15
2 My stack for data science
The scientific Python stack
numpy arrays
Connecting to
pandas
Columnar data
scikit-image
Images
scipy
Numerics, signal processing
...
G Varoquaux 15
2 Machine learning in a nutshell
Machine learning is about making predictions from data
e.g. learning to distinguish apples from oranges
G Varoquaux 16
2 Machine learning in a nutshell
Machine learning is about making predictions from data
e.g. learning to distinguish apples from oranges
Prediction is very difficult, especially about the future. Niels Bohr
Learn as much as possible from the data
but not too much
G Varoquaux 16
2 Machine learning in a nutshell
Machine learning is about making predictions from data
e.g. learning to distinguish apples from oranges
Prediction is very difficult, especially about the future. Niels Bohr
Learn as much as possible from the data
but not too much
x
y
x
y
Which model do you prefer?
G Varoquaux 16
2 Machine learning in a nutshell
Machine learning is about making predictions from data
e.g. learning to distinguish apples from oranges
Prediction is very difficult, especially about the future. Niels Bohr
Learn as much as possible from the data
but not too much
x
y
x
y
Minimizing train error ‰ generalization : overfit
G Varoquaux 16
2 Machine learning in a nutshell
Machine learning is about making predictions from data
e.g. learning to distinguish apples from oranges
Prediction is very difficult, especially about the future. Niels Bohr
Learn as much as possible from the data
but not too much
x
y
x
y
Adapting model complexity to data – regularization
G Varoquaux 16
2 Machine learning without learning the machinery
G Varoquaux 17
2 Machine learning without learning the machinery
A library, not a program
More expressive and flexible
Easy to include in an ecosystem
let’s disrupt something new
G Varoquaux 17
2 Machine learning without learning the machinery
A library, not a program
More expressive and flexible
Easy to include in an ecosystem
let’s disrupt something new
As easy as py
from s k l e a r n import svm
c l a s s i f i e r = svm.SVC()
c l a s s i f i e r . f i t ( X t r a i n , y t r a i n )
Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )
G Varoquaux 17
2 Show me your data: the samples ˆ features matrix
Data input: a 2D numerical array
Requires transforming your problem
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
samples
features
G Varoquaux 18
2 Show me your data: the samples ˆ features matrix
Data input: a 2D numerical array
Requires transforming your problem
With text documents:
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
documents
the
Python
performance
profiling
module
is
code
can
a
sklearn.feature extraction.text.TfIdfVectorizer
G Varoquaux 18
“Big” data
Engineering efficient processing pipelines
Many samples or
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
samples
features
Many features
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
samples
features 03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
See also: http://www.slideshare.net/GaelVaroquaux/processing-
biggish-data-on-commodity-hardware-simple-python-patterns
G Varoquaux 19
2 Many samples: on-line algorithms
e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
G Varoquaux 20
2 Many samples: on-line algorithms
e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )
Supervised models: predicting
sklearn.naive bayes...
sklearn.linear model.SGDRegressor
sklearn.linear model.SGDClassifier
Clustering: grouping samples
sklearn.cluster.MiniBatchKMeans
sklearn.cluster.Birch
Linear decompositions: finding new representations
sklearn.decompositions.IncrementalPCA
sklearn.decompositions.MiniBatchDictionaryLearning
sklearn.decompositions.LatentDirichletAllocation
G Varoquaux 20
2 Many features: on-the-fly data reduction
ñ Reduce the data as it is loaded
X s m a l l =
e s t i m a t o r . t r a n s f o r m ( X big , y)
G Varoquaux 21
2 Many features: on-the-fly data reduction
Random projections (will average features)
sklearn.random projection
random linear combinations of the features
Fast clustering of features
sklearn.cluster.FeatureAgglomeration
on images: super-pixel strategy
Hashing when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer
stateless: can be used in parallel
G Varoquaux 21
More gems in scikit-learn
SAG:
linear model.LogisticRegression(solver=’sag’)
Fast linear model on biggish data
G Varoquaux 22
More gems in scikit-learn
SAG:
linear model.LogisticRegression(solver=’sag’)
Fast linear model on biggish data
PCA == RandomizedPCA: (0.18)
Heuristic to switch PCA to random linear algebra
Fights global warming
Huge speed gains for biggish data
G Varoquaux 22
More gems in scikit-learn
New cross-validation objects (0.18)
from s k l e a r n . c r o s s v a l i d a t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)
for t r a i n , t e s t in cv :
X t r a i n = X[ t r a i n ]
y t r a i n = y[ t r a i n ]
Data-independent better nested-CV
G Varoquaux 22
More gems in scikit-learn
New cross-validation objects (0.18)
from s k l e a r n . m o d e l s e l e c t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d ( n f o l d s =2)
for t r a i n , t e s t in cv . s p l i t (X, y):
X t r a i n = X[ t r a i n ]
y t r a i n = y[ t r a i n ]
Data-independent ñ better nested-CV
G Varoquaux 22
More gems in scikit-learn
Outlier detection and isolation forests (0.18)
G Varoquaux 22
3 Outlook: infrastructure
No strings attached
G Varoquaux 23
3 Dataflow is key to scale
Array computing
CPU
03878794797927
01790752701578
03878794797927
01790752701578
Data parallel
03878794797927
03878794797927
Streaming
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
0
3
8
7
8
7
9
4
7
9
7
9
2
7
0
1
7
9
0
7
5
2
7
0
1
5
7
8
9
4
0
7
1
7
4
6
1
2
4
7
9
7
5
4
9
7
0
7
1
8
7
1
7
8
8
7
1
3
6
5
3
4
9
0
4
9
5
1
9
0
7
4
7
5
4
2
6
5
3
5
8
0
9
8
4
8
7
2
1
5
4
6
3
4
9
0
8
4
9
0
3
4
5
6
7
3
2
4
5
6
1
4
7
8
9
5
7
1
8
7
7
4
5
6
2
0
Parallel computing
Data + code transfer Out-of-memory persistence
These patterns can yield horrible code
G Varoquaux 24
3 Parallel-computing engine: joblib
sklearn.Estimator(n jobs=2)
G Varoquaux 25
3 Parallel-computing engine: joblib
sklearn.Estimator(n jobs=2)
Under the hood: joblib
ąąąąąąąąą from joblib import Parallel, delayed
ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Threading and multiprocessing mode
G Varoquaux 25
3 Parallel-computing engine: joblib
sklearn.Estimator(n jobs=2)
Under the hood: joblib
ąąąąąąąąą from joblib import Parallel, delayed
ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Threading and multiprocessing mode
G Varoquaux 25
3 Parallel-computing engine: joblib
sklearn.Estimator(n jobs=2)
Under the hood: joblib
ąąąąąąąąą from joblib import Parallel, delayed
ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Threading and multiprocessing mode
New: distributed computing backends:
Yarn, dask.distributed, IPython.parallel
import distributed.joblib
from joblib import Parallel, parallel backend
with parallel backend(’dask.distributed’,
scheduler host=’HOST:PORT’):
# normal Joblib code
G Varoquaux 25
3 Parallel-computing engine: joblib
sklearn.Estimator(n jobs=2)
Under the hood: joblib
ąąąąąąąąą from joblib import Parallel, delayed
ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Threading and multiprocessing mode
New: distributed computing backends:
Yarn, dask.distributed, IPython.parallel
import distributed.joblib
from joblib import Parallel, parallel backend
with parallel backend(’dask.distributed’,
scheduler host=’HOST:PORT’):
# normal Joblib code
Algorithmic plain-Python code
with optional bells and whistles
G Varoquaux 25
3 Persistence to disk
Persisting any Python object fast, with little overhead
Streamed compression
I/O from/in open file handles
ñ In S3, HDFS
G Varoquaux 26
3 joblib.Memory as a storage pool in dev
S3/HDFS/cloud backend:
joblib.Memory(’uri’, backend=’s3’)
https://github.com/joblib/joblib/pull/397
G Varoquaux 27
3 joblib.Memory as a storage pool in dev
S3/HDFS/cloud backend:
joblib.Memory(’uri’, backend=’s3’)
https://github.com/joblib/joblib/pull/397
Cache replacement:
mem = joblib.Memory(’uri’, bytes limit=’1G’)
mem.reduce size() # Remove oldest
G Varoquaux 27
3 joblib.Memory as a storage pool in dev
S3/HDFS/cloud backend:
joblib.Memory(’uri’, backend=’s3’)
https://github.com/joblib/joblib/pull/397
Cache replacement:
mem = joblib.Memory(’uri’, bytes limit=’1G’)
mem.reduce size() # Remove oldest
Out-of-memory computing
ąąąąąąąąą result = mem.cache(g).call and shelve(a)
ąąąąąąąąą result
MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”)
ąąąąąąąąą c = result.get()
G Varoquaux 27
3 joblib as bricks of a compute engine: vision
03878794797927
03878794797927
G Varoquaux 28
3 joblib as bricks of a compute engine: vision
03878794797927
03878794797927
Parallel on
a cloud/cluster
Memory + shelving
on distributed stores
as out-of-core
& common
computation
Simple algorithmic code
Using infrastructure without buying into it
G Varoquaux 28
Time to wrap up
Time to wrap up
Code, code, code
Scipy-lectures: learning numerical Python
Many problems are better solved by
documentation than new code
Scipy-lectures: learning numerical Python
Comprehensive document: numpy, scipy, ...
1. Getting started with Python for science
2. Advanced topics
3. Packages and applications
http://scipy-lectures.org
Scipy-lectures: learning numerical Python
Code examples
On the code of data science
@GaelVaroquaux
I believe in code
On the code of data science
@GaelVaroquaux
I believe in code
without compromises
Libraries
with side-effect free code
tight APIs
decoupling of analysis
and plotting
Software engineering
version control
On the code of data science
@GaelVaroquaux
I believe in code
without compromises
Code empowers
enables automation
opens new applications
Disrupt something new
We use scikit-learn for markers
of neuropsychiatric diseases
On the code of data science
@GaelVaroquaux
I believe in code
without compromises
Code empowers
Software purity holds back
Interactivity:
brittle and costly
but necessary for insights
Refusing syntax shortcuts
ñ verbosity
On the code of data science
@GaelVaroquaux
I believe in code
without compromises
Code empowers
Software purity holds back
The cost is worth the benefit in the long run

Weitere ähnliche Inhalte

Andere mochten auch

Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataGael Varoquaux
 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in PythonGael Varoquaux
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Gael Varoquaux
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsGael Varoquaux
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIGael Varoquaux
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
 
A hand-waving introduction to sparsity for compressed tomography reconstruction
A hand-waving introduction to sparsity for compressed tomography reconstructionA hand-waving introduction to sparsity for compressed tomography reconstruction
A hand-waving introduction to sparsity for compressed tomography reconstructionGael Varoquaux
 
Advanced network modelling 2: connectivity measures, goup analysis
Advanced network modelling 2: connectivity measures, goup analysisAdvanced network modelling 2: connectivity measures, goup analysis
Advanced network modelling 2: connectivity measures, goup analysisGael Varoquaux
 
Brain network modelling: connectivity metrics and group analysis
Brain network modelling: connectivity metrics and group analysisBrain network modelling: connectivity metrics and group analysis
Brain network modelling: connectivity metrics and group analysisGael Varoquaux
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsityGael Varoquaux
 
Connectomics: Parcellations and Network Analysis Methods
Connectomics: Parcellations and Network Analysis MethodsConnectomics: Parcellations and Network Analysis Methods
Connectomics: Parcellations and Network Analysis MethodsGael Varoquaux
 
Scikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonScikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonGael Varoquaux
 
Brain reading, compressive sensing, fMRI and statistical learning in Python
Brain reading, compressive sensing, fMRI and statistical learning in PythonBrain reading, compressive sensing, fMRI and statistical learning in Python
Brain reading, compressive sensing, fMRI and statistical learning in PythonGael Varoquaux
 
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...Gael Varoquaux
 

Andere mochten auch (15)

Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in Python
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizations
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
 
A hand-waving introduction to sparsity for compressed tomography reconstruction
A hand-waving introduction to sparsity for compressed tomography reconstructionA hand-waving introduction to sparsity for compressed tomography reconstruction
A hand-waving introduction to sparsity for compressed tomography reconstruction
 
Advanced network modelling 2: connectivity measures, goup analysis
Advanced network modelling 2: connectivity measures, goup analysisAdvanced network modelling 2: connectivity measures, goup analysis
Advanced network modelling 2: connectivity measures, goup analysis
 
Brain network modelling: connectivity metrics and group analysis
Brain network modelling: connectivity metrics and group analysisBrain network modelling: connectivity metrics and group analysis
Brain network modelling: connectivity metrics and group analysis
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsity
 
Connectomics: Parcellations and Network Analysis Methods
Connectomics: Parcellations and Network Analysis MethodsConnectomics: Parcellations and Network Analysis Methods
Connectomics: Parcellations and Network Analysis Methods
 
Scikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonScikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en Python
 
Brain reading, compressive sensing, fMRI and statistical learning in Python
Brain reading, compressive sensing, fMRI and statistical learning in PythonBrain reading, compressive sensing, fMRI and statistical learning in Python
Brain reading, compressive sensing, fMRI and statistical learning in Python
 
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
 

Ähnlich wie On the code of data science

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxPyData
 
20090918 Agile Computer Control of a Complex Experiment
20090918 Agile Computer Control of a Complex Experiment20090918 Agile Computer Control of a Complex Experiment
20090918 Agile Computer Control of a Complex ExperimentJonathan Blakes
 
Data science workflows: from notebooks to production
Data science workflows: from notebooks to productionData science workflows: from notebooks to production
Data science workflows: from notebooks to productionMarissa Saunders
 
Vision Algorithmics
Vision AlgorithmicsVision Algorithmics
Vision Algorithmicspotaters
 
Start with version control and experiments management in machine learning
Start with version control and experiments management in machine learningStart with version control and experiments management in machine learning
Start with version control and experiments management in machine learningMikhail Rozhkov
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
 
Bacalhau: Stable Diffusion on a GPU
Bacalhau: Stable Diffusion on a GPUBacalhau: Stable Diffusion on a GPU
Bacalhau: Stable Diffusion on a GPUAlly339821
 
What is Python? An overview of Python for science.
What is Python? An overview of Python for science.What is Python? An overview of Python for science.
What is Python? An overview of Python for science.Nicholas Pringle
 
Bug prediction + sdlc automation
Bug prediction + sdlc automationBug prediction + sdlc automation
Bug prediction + sdlc automationAlexey Tokar
 
Kubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesKubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesWeaveworks
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Gael Varoquaux
 
Data Science in Production: Technologies That Drive Adoption of Data Science ...
Data Science in Production: Technologies That Drive Adoption of Data Science ...Data Science in Production: Technologies That Drive Adoption of Data Science ...
Data Science in Production: Technologies That Drive Adoption of Data Science ...Nir Yungster
 
ODSC West 2022 – Kitbashing in ML
ODSC West 2022 – Kitbashing in MLODSC West 2022 – Kitbashing in ML
ODSC West 2022 – Kitbashing in MLBryan Bischof
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the tradeFangda Wang
 
An introduction to_programming_the_microchip_pic_in_ccs_c
An introduction to_programming_the_microchip_pic_in_ccs_cAn introduction to_programming_the_microchip_pic_in_ccs_c
An introduction to_programming_the_microchip_pic_in_ccs_cSuresh Murugesan
 
Industrializing Machine learning pipelines
Industrializing Machine learning pipelinesIndustrializing Machine learning pipelines
Industrializing Machine learning pipelinesGermain Tanguy
 
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
digitaldesign-s20-lecture3b-fpga-afterlecture.pdfdigitaldesign-s20-lecture3b-fpga-afterlecture.pdf
digitaldesign-s20-lecture3b-fpga-afterlecture.pdfDuy-Hieu Bui
 

Ähnlich wie On the code of data science (20)

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
 
20090918 Agile Computer Control of a Complex Experiment
20090918 Agile Computer Control of a Complex Experiment20090918 Agile Computer Control of a Complex Experiment
20090918 Agile Computer Control of a Complex Experiment
 
Data science workflows: from notebooks to production
Data science workflows: from notebooks to productionData science workflows: from notebooks to production
Data science workflows: from notebooks to production
 
Vision Algorithmics
Vision AlgorithmicsVision Algorithmics
Vision Algorithmics
 
Start with version control and experiments management in machine learning
Start with version control and experiments management in machine learningStart with version control and experiments management in machine learning
Start with version control and experiments management in machine learning
 
Fuzzing - Part 2
Fuzzing - Part 2Fuzzing - Part 2
Fuzzing - Part 2
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
Bacalhau: Stable Diffusion on a GPU
Bacalhau: Stable Diffusion on a GPUBacalhau: Stable Diffusion on a GPU
Bacalhau: Stable Diffusion on a GPU
 
What is Python? An overview of Python for science.
What is Python? An overview of Python for science.What is Python? An overview of Python for science.
What is Python? An overview of Python for science.
 
Bug prediction + sdlc automation
Bug prediction + sdlc automationBug prediction + sdlc automation
Bug prediction + sdlc automation
 
Kubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesKubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slides
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...
 
Data Science in Production: Technologies That Drive Adoption of Data Science ...
Data Science in Production: Technologies That Drive Adoption of Data Science ...Data Science in Production: Technologies That Drive Adoption of Data Science ...
Data Science in Production: Technologies That Drive Adoption of Data Science ...
 
ODSC West 2022 – Kitbashing in ML
ODSC West 2022 – Kitbashing in MLODSC West 2022 – Kitbashing in ML
ODSC West 2022 – Kitbashing in ML
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the trade
 
An introduction to_programming_the_microchip_pic_in_ccs_c
An introduction to_programming_the_microchip_pic_in_ccs_cAn introduction to_programming_the_microchip_pic_in_ccs_c
An introduction to_programming_the_microchip_pic_in_ccs_c
 
Industrializing Machine learning pipelines
Industrializing Machine learning pipelinesIndustrializing Machine learning pipelines
Industrializing Machine learning pipelines
 
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
digitaldesign-s20-lecture3b-fpga-afterlecture.pdfdigitaldesign-s20-lecture3b-fpga-afterlecture.pdf
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
 
Power Meter Presentation
Power Meter PresentationPower Meter Presentation
Power Meter Presentation
 
CG-Orientation ppt.pptx
CG-Orientation ppt.pptxCG-Orientation ppt.pptx
CG-Orientation ppt.pptx
 

Mehr von Gael Varoquaux

Evaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueEvaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueGael Varoquaux
 
Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingGael Varoquaux
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing valuesGael Varoquaux
 
Dirty data science machine learning on non-curated data
Dirty data science machine learning on non-curated dataDirty data science machine learning on non-curated data
Dirty data science machine learning on non-curated dataGael Varoquaux
 
Representation learning in limited-data settings
Representation learning in limited-data settingsRepresentation learning in limited-data settings
Representation learning in limited-data settingsGael Varoquaux
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Gael Varoquaux
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingGael Varoquaux
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomesGael Varoquaux
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingGael Varoquaux
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Gael Varoquaux
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingGael Varoquaux
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovationGael Varoquaux
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsGael Varoquaux
 

Mehr von Gael Varoquaux (14)

Evaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueEvaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic value
 
Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imaging
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing values
 
Dirty data science machine learning on non-curated data
Dirty data science machine learning on non-curated dataDirty data science machine learning on non-curated data
Dirty data science machine learning on non-curated data
 
Representation learning in limited-data settings
Representation learning in limited-data settingsRepresentation learning in limited-data settings
Representation learning in limited-data settings
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mapping
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovation
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
 

Kürzlich hochgeladen

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Kürzlich hochgeladen (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

On the code of data science

  • 1. On the code of data science Varoquaux Ga¨el import data science data science.discover()
  • 2. On the code of data science Varoquaux Ga¨el import data science data science.discover() Disclaimer: A bit of a Python bias Key ideas carry over
  • 3. What is data science? G Varoquaux 2
  • 4. What is data science? G Varoquaux 2
  • 5. Data science = statistics + code Larger datasets Automating insights Data science is disruptive when it enables real-time or personalized decisions G Varoquaux 3
  • 6. 1 Writing code for data science 2 Machine learning in Python 3 Outlook: infrastructure Make you more productive as a data scientist G Varoquaux 4
  • 7. 1 Writing code for data science Productivity & Quality G Varoquaux 5
  • 8. 1 A data-science workflow Work based on intuition and experimentation Conjecture Experiment ñ Interactive & framework-less Yet needs consolidation keeping flexibility G Varoquaux 6
  • 9. 1 A design pattern in computational experiments MVC pattern from Wikipedia: Model Manages the data and rules of the application View Output represen- tation Possibly several views Controller Accepts input and converts it to commands for model and view Photo-editing software Filters Canvas Tool palettes Typical web application Database Web-pages URLs G Varoquaux 7
  • 10. 1 A design pattern in computational experiments MVC pattern from Wikipedia: Model Manages the data and rules of the application View Output represen- tation Possibly several views Controller Accepts input and converts it to commands for model and view For data science: Numerical, data- processing, & ex- perimental logic Results, as files. Data & plots Imperative API Avoid input as files: not expressive Module with functions Post-processing script CSV & data files Script ñ for loops G Varoquaux 7
  • 11. 1 A design pattern in computational experiments MVC pattern from Wikipedia: Model Manages the data and rules of the application View Output represen- tation Possibly several views Controller Accepts input and converts it to commands for model and view For data science: Numerical, data- processing, & ex- perimental logic Results, as files. Data & plots Imperative API Avoid input as files: not expressive Module with functions Post-processing script CSV & data files Script ñ for loops A recipe 3 types of files: • modules • command scripts • post-processing scripts CSVs & intermediate data files Separate computation from analysis / plotting Code and text (and data) ñ version control G Varoquaux 7
  • 12. 1 A design pattern in computational experiments MVC pattern from Wikipedia: Model Manages the data and rules of the application View Output represen- tation Possibly several views Controller Accepts input and converts it to commands for model and view For data science: Numerical, data- processing, & ex- perimental logic Results, as files. Data & plots Imperative API Avoid input as files: not expressive Module with functions Post-processing script CSV & data files Script ñ for loops A recipe 3 types of files: • modules • command scripts • post-processing scripts CSVs & intermediate data files Separate computation from analysis / plotting Code and text (and data) ñ version control Decouple steps Goals: Reuse code Mitigate compute time G Varoquaux 7
  • 13. 1 How I work progressive consolidation Start with a script playing to understand the problem G Varoquaux 8
  • 14. 1 How I work progressive consolidation Start with a script playing to understand the problem Identify blocks/operations ñ move to a function Use functions Obstacle: local scope requires identifying input and output variables That’s a good thing Solution: Interactive debugging / understanding inside a function: %debug in IPython Functions are the basic reusable abstraction G Varoquaux 8
  • 15. 1 How I work progressive consolidation Start with a script playing to understand the problem Identify blocks/operations ñ move to a function As they stabilize, move to a module Modules enable sharing between experiments ñ avoid 1000 lines scripts + commented code enable testing Fast experiments as tests ñ gives confidence, hence refactorings G Varoquaux 8
  • 16. 1 How I work progressive consolidation Start with a script playing to understand the problem Identify blocks/operations ñ move to a function As they stabilize, move to a module Clean: delete code & files you have version control Attentional load makes it impossible to find or understand things Where’s Waldo?G Varoquaux 8
  • 17. 1 How I work progressive consolidation Start with a script playing to understand the problem Identify blocks/operations ñ move to a function As they stabilize, move to a module Clean: delete code & files you have version control Why is it hard? Long compute times make us unadventurous Know your tools Refactoring editor Version control G Varoquaux 8
  • 18. 1 Caching to tame computation time The memoize pattern mem = joblib.Memory(cachedir=’.’) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from memory For data science a & b can be big a & b arbitrary objects no change in workflow Results stored on disk G Varoquaux 9
  • 19. 1 Caching to tame computation time The memoize pattern mem = joblib.Memory(cachedir=’.’) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from memory For data science Fits in experimentation loop Helps decrease re-run times œ Black-boxy, persistence only implicit G Varoquaux 9
  • 20. Adopting software-engineering best practices G Varoquaux 10
  • 21. 1 The ladder of code quality Use linting/code analysis in your editor seriously G Varoquaux 11
  • 22. 1 The ladder of code quality Use linting/code analysis in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... G Varoquaux 11
  • 23. 1 The ladder of code qualityIncreasingcost ?İ Use linting/code analysis in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... Avoid premature software engineering G Varoquaux 11
  • 24. 1 The ladder of code qualityIncreasingcost ?İ Use linting/code analysis in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... Avoid premature software engineering Over versus under engineering When the goal is generating insights Experimentation to develop intuitions ñ new ideas As the path becomes clear: consolidation Heavy engineering too early freezes bad ideas G Varoquaux 11
  • 25. 1 LibrariesIncreasingcost ?İ Use linting/code analysis in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... A library G Varoquaux 12
  • 26. 2 Machine learning in Python scikit-learn G Varoquaux 13
  • 28. 2 My stack for data science Python, what else? General-purpose language Interactive Easy to read / write G Varoquaux 15
  • 29. 2 My stack for data science The scientific Python stack numpy arrays Mostly a float** No annotation / structure Universal across applications Easily shared across languages 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 57187745620 03878794797927 01790752701578 94071746124797 54970718717887 13653490495190 74754265358098 48721546349084 90345673245614 7187745620 G Varoquaux 15
  • 30. 2 My stack for data science The scientific Python stack numpy arrays Connecting to pandas Columnar data scikit-image Images scipy Numerics, signal processing ... G Varoquaux 15
  • 31. 2 Machine learning in a nutshell Machine learning is about making predictions from data e.g. learning to distinguish apples from oranges G Varoquaux 16
  • 32. 2 Machine learning in a nutshell Machine learning is about making predictions from data e.g. learning to distinguish apples from oranges Prediction is very difficult, especially about the future. Niels Bohr Learn as much as possible from the data but not too much G Varoquaux 16
  • 33. 2 Machine learning in a nutshell Machine learning is about making predictions from data e.g. learning to distinguish apples from oranges Prediction is very difficult, especially about the future. Niels Bohr Learn as much as possible from the data but not too much x y x y Which model do you prefer? G Varoquaux 16
  • 34. 2 Machine learning in a nutshell Machine learning is about making predictions from data e.g. learning to distinguish apples from oranges Prediction is very difficult, especially about the future. Niels Bohr Learn as much as possible from the data but not too much x y x y Minimizing train error ‰ generalization : overfit G Varoquaux 16
  • 35. 2 Machine learning in a nutshell Machine learning is about making predictions from data e.g. learning to distinguish apples from oranges Prediction is very difficult, especially about the future. Niels Bohr Learn as much as possible from the data but not too much x y x y Adapting model complexity to data – regularization G Varoquaux 16
  • 36. 2 Machine learning without learning the machinery G Varoquaux 17
  • 37. 2 Machine learning without learning the machinery A library, not a program More expressive and flexible Easy to include in an ecosystem let’s disrupt something new G Varoquaux 17
  • 38. 2 Machine learning without learning the machinery A library, not a program More expressive and flexible Easy to include in an ecosystem let’s disrupt something new As easy as py from s k l e a r n import svm c l a s s i f i e r = svm.SVC() c l a s s i f i e r . f i t ( X t r a i n , y t r a i n ) Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t ) G Varoquaux 17
  • 39. 2 Show me your data: the samples ˆ features matrix Data input: a 2D numerical array Requires transforming your problem 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 samples features G Varoquaux 18
  • 40. 2 Show me your data: the samples ˆ features matrix Data input: a 2D numerical array Requires transforming your problem With text documents: 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 documents the Python performance profiling module is code can a sklearn.feature extraction.text.TfIdfVectorizer G Varoquaux 18
  • 41. “Big” data Engineering efficient processing pipelines Many samples or 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 samples features Many features 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 samples features 03078090707907 00790752700578 94071006000797 00970008007000 10000400400090 00050205008000 See also: http://www.slideshare.net/GaelVaroquaux/processing- biggish-data-on-commodity-hardware-simple-python-patterns G Varoquaux 19
  • 42. 2 Many samples: on-line algorithms e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 G Varoquaux 20
  • 43. 2 Many samples: on-line algorithms e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n ) Supervised models: predicting sklearn.naive bayes... sklearn.linear model.SGDRegressor sklearn.linear model.SGDClassifier Clustering: grouping samples sklearn.cluster.MiniBatchKMeans sklearn.cluster.Birch Linear decompositions: finding new representations sklearn.decompositions.IncrementalPCA sklearn.decompositions.MiniBatchDictionaryLearning sklearn.decompositions.LatentDirichletAllocation G Varoquaux 20
  • 44. 2 Many features: on-the-fly data reduction ñ Reduce the data as it is loaded X s m a l l = e s t i m a t o r . t r a n s f o r m ( X big , y) G Varoquaux 21
  • 45. 2 Many features: on-the-fly data reduction Random projections (will average features) sklearn.random projection random linear combinations of the features Fast clustering of features sklearn.cluster.FeatureAgglomeration on images: super-pixel strategy Hashing when observations have varying size (e.g. words) sklearn.feature extraction.text. HashingVectorizer stateless: can be used in parallel G Varoquaux 21
  • 46. More gems in scikit-learn SAG: linear model.LogisticRegression(solver=’sag’) Fast linear model on biggish data G Varoquaux 22
  • 47. More gems in scikit-learn SAG: linear model.LogisticRegression(solver=’sag’) Fast linear model on biggish data PCA == RandomizedPCA: (0.18) Heuristic to switch PCA to random linear algebra Fights global warming Huge speed gains for biggish data G Varoquaux 22
  • 48. More gems in scikit-learn New cross-validation objects (0.18) from s k l e a r n . c r o s s v a l i d a t i o n import S t r a t i f i e d K F o l d cv = S t r a t i f i e d K F o l d (y , n f o l d s =2) for t r a i n , t e s t in cv : X t r a i n = X[ t r a i n ] y t r a i n = y[ t r a i n ] Data-independent better nested-CV G Varoquaux 22
  • 49. More gems in scikit-learn New cross-validation objects (0.18) from s k l e a r n . m o d e l s e l e c t i o n import S t r a t i f i e d K F o l d cv = S t r a t i f i e d K F o l d ( n f o l d s =2) for t r a i n , t e s t in cv . s p l i t (X, y): X t r a i n = X[ t r a i n ] y t r a i n = y[ t r a i n ] Data-independent ñ better nested-CV G Varoquaux 22
  • 50. More gems in scikit-learn Outlier detection and isolation forests (0.18) G Varoquaux 22
  • 51. 3 Outlook: infrastructure No strings attached G Varoquaux 23
  • 52. 3 Dataflow is key to scale Array computing CPU 03878794797927 01790752701578 03878794797927 01790752701578 Data parallel 03878794797927 03878794797927 Streaming 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 Parallel computing Data + code transfer Out-of-memory persistence These patterns can yield horrible code G Varoquaux 24
  • 53. 3 Parallel-computing engine: joblib sklearn.Estimator(n jobs=2) G Varoquaux 25
  • 54. 3 Parallel-computing engine: joblib sklearn.Estimator(n jobs=2) Under the hood: joblib ąąąąąąąąą from joblib import Parallel, delayed ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] Threading and multiprocessing mode G Varoquaux 25
  • 55. 3 Parallel-computing engine: joblib sklearn.Estimator(n jobs=2) Under the hood: joblib ąąąąąąąąą from joblib import Parallel, delayed ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] Threading and multiprocessing mode G Varoquaux 25
  • 56. 3 Parallel-computing engine: joblib sklearn.Estimator(n jobs=2) Under the hood: joblib ąąąąąąąąą from joblib import Parallel, delayed ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] Threading and multiprocessing mode New: distributed computing backends: Yarn, dask.distributed, IPython.parallel import distributed.joblib from joblib import Parallel, parallel backend with parallel backend(’dask.distributed’, scheduler host=’HOST:PORT’): # normal Joblib code G Varoquaux 25
  • 57. 3 Parallel-computing engine: joblib sklearn.Estimator(n jobs=2) Under the hood: joblib ąąąąąąąąą from joblib import Parallel, delayed ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] Threading and multiprocessing mode New: distributed computing backends: Yarn, dask.distributed, IPython.parallel import distributed.joblib from joblib import Parallel, parallel backend with parallel backend(’dask.distributed’, scheduler host=’HOST:PORT’): # normal Joblib code Algorithmic plain-Python code with optional bells and whistles G Varoquaux 25
  • 58. 3 Persistence to disk Persisting any Python object fast, with little overhead Streamed compression I/O from/in open file handles ñ In S3, HDFS G Varoquaux 26
  • 59. 3 joblib.Memory as a storage pool in dev S3/HDFS/cloud backend: joblib.Memory(’uri’, backend=’s3’) https://github.com/joblib/joblib/pull/397 G Varoquaux 27
  • 60. 3 joblib.Memory as a storage pool in dev S3/HDFS/cloud backend: joblib.Memory(’uri’, backend=’s3’) https://github.com/joblib/joblib/pull/397 Cache replacement: mem = joblib.Memory(’uri’, bytes limit=’1G’) mem.reduce size() # Remove oldest G Varoquaux 27
  • 61. 3 joblib.Memory as a storage pool in dev S3/HDFS/cloud backend: joblib.Memory(’uri’, backend=’s3’) https://github.com/joblib/joblib/pull/397 Cache replacement: mem = joblib.Memory(’uri’, bytes limit=’1G’) mem.reduce size() # Remove oldest Out-of-memory computing ąąąąąąąąą result = mem.cache(g).call and shelve(a) ąąąąąąąąą result MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”) ąąąąąąąąą c = result.get() G Varoquaux 27
  • 62. 3 joblib as bricks of a compute engine: vision 03878794797927 03878794797927 G Varoquaux 28
  • 63. 3 joblib as bricks of a compute engine: vision 03878794797927 03878794797927 Parallel on a cloud/cluster Memory + shelving on distributed stores as out-of-core & common computation Simple algorithmic code Using infrastructure without buying into it G Varoquaux 28
  • 65. Time to wrap up Code, code, code
  • 66. Scipy-lectures: learning numerical Python Many problems are better solved by documentation than new code
  • 67. Scipy-lectures: learning numerical Python Comprehensive document: numpy, scipy, ... 1. Getting started with Python for science 2. Advanced topics 3. Packages and applications http://scipy-lectures.org
  • 68. Scipy-lectures: learning numerical Python Code examples
  • 69. On the code of data science @GaelVaroquaux I believe in code
  • 70. On the code of data science @GaelVaroquaux I believe in code without compromises Libraries with side-effect free code tight APIs decoupling of analysis and plotting Software engineering version control
  • 71. On the code of data science @GaelVaroquaux I believe in code without compromises Code empowers enables automation opens new applications Disrupt something new We use scikit-learn for markers of neuropsychiatric diseases
  • 72. On the code of data science @GaelVaroquaux I believe in code without compromises Code empowers Software purity holds back Interactivity: brittle and costly but necessary for insights Refusing syntax shortcuts ñ verbosity
  • 73. On the code of data science @GaelVaroquaux I believe in code without compromises Code empowers Software purity holds back The cost is worth the benefit in the long run