This document discusses best practices for writing code for data science projects. It recommends starting with scripts to understand problems, then refactoring code into well-named functions and modules for reuse and testing. Caching and online algorithms can help handle large datasets. The document also advocates adopting software engineering practices like version control and testing as projects mature. Overall it provides guidance on structuring data science code for productivity, quality, and scalability.
5. Data science = statistics + code
Larger datasets
Automating insights
Data science is disruptive when it enables
real-time or personalized decisions
G Varoquaux 3
6. 1 Writing code for data science
2 Machine learning in Python
3 Outlook: infrastructure
Make you more productive as a data scientist
G Varoquaux 4
7. 1 Writing code for data science
Productivity & Quality
G Varoquaux 5
8. 1 A data-science workflow
Work based on intuition
and experimentation
Conjecture
Experiment
ñ Interactive & framework-less
Yet
needs consolidation
keeping flexibility
G Varoquaux 6
9. 1 A design pattern in computational experiments
MVC pattern from Wikipedia:
Model
Manages the data
and rules of the
application
View
Output represen-
tation
Possibly several views
Controller
Accepts input
and converts it to
commands
for model and view
Photo-editing software
Filters Canvas Tool palettes
Typical web application
Database Web-pages URLs
G Varoquaux 7
10. 1 A design pattern in computational experiments
MVC pattern from Wikipedia:
Model
Manages the data
and rules of the
application
View
Output represen-
tation
Possibly several views
Controller
Accepts input
and converts it to
commands
for model and view
For data science:
Numerical, data-
processing, & ex-
perimental logic
Results, as files.
Data & plots
Imperative API
Avoid input as files:
not expressive
Module
with functions
Post-processing script
CSV & data files
Script
ñ for loops
G Varoquaux 7
11. 1 A design pattern in computational experiments
MVC pattern from Wikipedia:
Model
Manages the data
and rules of the
application
View
Output represen-
tation
Possibly several views
Controller
Accepts input
and converts it to
commands
for model and view
For data science:
Numerical, data-
processing, & ex-
perimental logic
Results, as files.
Data & plots
Imperative API
Avoid input as files:
not expressive
Module
with functions
Post-processing script
CSV & data files
Script
ñ for loops
A recipe
3 types of files:
• modules • command scripts • post-processing scripts
CSVs & intermediate data files
Separate computation from analysis / plotting
Code and text (and data) ñ version control
G Varoquaux 7
12. 1 A design pattern in computational experiments
MVC pattern from Wikipedia:
Model
Manages the data
and rules of the
application
View
Output represen-
tation
Possibly several views
Controller
Accepts input
and converts it to
commands
for model and view
For data science:
Numerical, data-
processing, & ex-
perimental logic
Results, as files.
Data & plots
Imperative API
Avoid input as files:
not expressive
Module
with functions
Post-processing script
CSV & data files
Script
ñ for loops
A recipe
3 types of files:
• modules • command scripts • post-processing scripts
CSVs & intermediate data files
Separate computation from analysis / plotting
Code and text (and data) ñ version control
Decouple steps
Goals: Reuse code
Mitigate compute time
G Varoquaux 7
13. 1 How I work progressive consolidation
Start with a script playing to understand the problem
G Varoquaux 8
14. 1 How I work progressive consolidation
Start with a script playing to understand the problem
Identify blocks/operations ñ move to a function
Use functions
Obstacle: local scope
requires identifying input and output variables
That’s a good thing
Solution: Interactive debugging / understanding
inside a function: %debug in IPython
Functions are the basic reusable abstraction
G Varoquaux 8
15. 1 How I work progressive consolidation
Start with a script playing to understand the problem
Identify blocks/operations ñ move to a function
As they stabilize, move to a module
Modules
enable sharing between experiments
ñ avoid 1000 lines scripts + commented code
enable testing
Fast experiments as tests
ñ gives confidence, hence refactorings
G Varoquaux 8
16. 1 How I work progressive consolidation
Start with a script playing to understand the problem
Identify blocks/operations ñ move to a function
As they stabilize, move to a module
Clean: delete code & files you have version control
Attentional load makes it impossible
to find or understand things
Where’s Waldo?G Varoquaux 8
17. 1 How I work progressive consolidation
Start with a script playing to understand the problem
Identify blocks/operations ñ move to a function
As they stabilize, move to a module
Clean: delete code & files you have version control
Why is it hard?
Long compute times
make us unadventurous
Know your tools
Refactoring editor
Version control
G Varoquaux 8
18. 1 Caching to tame computation time
The memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from memory
For data science
a & b can be big
a & b arbitrary objects no change in workflow
Results stored on disk
G Varoquaux 9
19. 1 Caching to tame computation time
The memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from memory
For data science
Fits in experimentation loop
Helps decrease re-run times œ
Black-boxy, persistence only implicit
G Varoquaux 9
21. 1 The ladder of code quality
Use linting/code analysis in your editor seriously
G Varoquaux 11
22. 1 The ladder of code quality
Use linting/code analysis in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
G Varoquaux 11
23. 1 The ladder of code qualityIncreasingcost
?İ
Use linting/code analysis in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
Avoid premature software engineering
G Varoquaux 11
24. 1 The ladder of code qualityIncreasingcost
?İ
Use linting/code analysis in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
Avoid premature software engineering
Over versus under engineering
When the goal is generating insights
Experimentation to develop intuitions
ñ new ideas
As the path becomes clear: consolidation
Heavy engineering too early freezes bad ideas
G Varoquaux 11
25. 1 LibrariesIncreasingcost
?İ
Use linting/code analysis in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
A library
G Varoquaux 12
28. 2 My stack for data science
Python, what else?
General-purpose language
Interactive
Easy to read / write
G Varoquaux 15
29. 2 My stack for data science
The scientific Python stack
numpy arrays
Mostly a float**
No annotation / structure
Universal across applications
Easily shared across languages
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
57187745620
03878794797927
01790752701578
94071746124797
54970718717887
13653490495190
74754265358098
48721546349084
90345673245614
7187745620
G Varoquaux 15
30. 2 My stack for data science
The scientific Python stack
numpy arrays
Connecting to
pandas
Columnar data
scikit-image
Images
scipy
Numerics, signal processing
...
G Varoquaux 15
31. 2 Machine learning in a nutshell
Machine learning is about making predictions from data
e.g. learning to distinguish apples from oranges
G Varoquaux 16
32. 2 Machine learning in a nutshell
Machine learning is about making predictions from data
e.g. learning to distinguish apples from oranges
Prediction is very difficult, especially about the future. Niels Bohr
Learn as much as possible from the data
but not too much
G Varoquaux 16
33. 2 Machine learning in a nutshell
Machine learning is about making predictions from data
e.g. learning to distinguish apples from oranges
Prediction is very difficult, especially about the future. Niels Bohr
Learn as much as possible from the data
but not too much
x
y
x
y
Which model do you prefer?
G Varoquaux 16
34. 2 Machine learning in a nutshell
Machine learning is about making predictions from data
e.g. learning to distinguish apples from oranges
Prediction is very difficult, especially about the future. Niels Bohr
Learn as much as possible from the data
but not too much
x
y
x
y
Minimizing train error ‰ generalization : overfit
G Varoquaux 16
35. 2 Machine learning in a nutshell
Machine learning is about making predictions from data
e.g. learning to distinguish apples from oranges
Prediction is very difficult, especially about the future. Niels Bohr
Learn as much as possible from the data
but not too much
x
y
x
y
Adapting model complexity to data – regularization
G Varoquaux 16
37. 2 Machine learning without learning the machinery
A library, not a program
More expressive and flexible
Easy to include in an ecosystem
let’s disrupt something new
G Varoquaux 17
38. 2 Machine learning without learning the machinery
A library, not a program
More expressive and flexible
Easy to include in an ecosystem
let’s disrupt something new
As easy as py
from s k l e a r n import svm
c l a s s i f i e r = svm.SVC()
c l a s s i f i e r . f i t ( X t r a i n , y t r a i n )
Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )
G Varoquaux 17
39. 2 Show me your data: the samples ˆ features matrix
Data input: a 2D numerical array
Requires transforming your problem
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
samples
features
G Varoquaux 18
40. 2 Show me your data: the samples ˆ features matrix
Data input: a 2D numerical array
Requires transforming your problem
With text documents:
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
documents
the
Python
performance
profiling
module
is
code
can
a
sklearn.feature extraction.text.TfIdfVectorizer
G Varoquaux 18
41. “Big” data
Engineering efficient processing pipelines
Many samples or
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
samples
features
Many features
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
samples
features 03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
See also: http://www.slideshare.net/GaelVaroquaux/processing-
biggish-data-on-commodity-hardware-simple-python-patterns
G Varoquaux 19
43. 2 Many samples: on-line algorithms
e s t i m a t o r . p a r t i a l f i t ( X t r a i n , Y t r a i n )
Supervised models: predicting
sklearn.naive bayes...
sklearn.linear model.SGDRegressor
sklearn.linear model.SGDClassifier
Clustering: grouping samples
sklearn.cluster.MiniBatchKMeans
sklearn.cluster.Birch
Linear decompositions: finding new representations
sklearn.decompositions.IncrementalPCA
sklearn.decompositions.MiniBatchDictionaryLearning
sklearn.decompositions.LatentDirichletAllocation
G Varoquaux 20
44. 2 Many features: on-the-fly data reduction
ñ Reduce the data as it is loaded
X s m a l l =
e s t i m a t o r . t r a n s f o r m ( X big , y)
G Varoquaux 21
45. 2 Many features: on-the-fly data reduction
Random projections (will average features)
sklearn.random projection
random linear combinations of the features
Fast clustering of features
sklearn.cluster.FeatureAgglomeration
on images: super-pixel strategy
Hashing when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer
stateless: can be used in parallel
G Varoquaux 21
46. More gems in scikit-learn
SAG:
linear model.LogisticRegression(solver=’sag’)
Fast linear model on biggish data
G Varoquaux 22
47. More gems in scikit-learn
SAG:
linear model.LogisticRegression(solver=’sag’)
Fast linear model on biggish data
PCA == RandomizedPCA: (0.18)
Heuristic to switch PCA to random linear algebra
Fights global warming
Huge speed gains for biggish data
G Varoquaux 22
48. More gems in scikit-learn
New cross-validation objects (0.18)
from s k l e a r n . c r o s s v a l i d a t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d (y , n f o l d s =2)
for t r a i n , t e s t in cv :
X t r a i n = X[ t r a i n ]
y t r a i n = y[ t r a i n ]
Data-independent better nested-CV
G Varoquaux 22
49. More gems in scikit-learn
New cross-validation objects (0.18)
from s k l e a r n . m o d e l s e l e c t i o n
import S t r a t i f i e d K F o l d
cv = S t r a t i f i e d K F o l d ( n f o l d s =2)
for t r a i n , t e s t in cv . s p l i t (X, y):
X t r a i n = X[ t r a i n ]
y t r a i n = y[ t r a i n ]
Data-independent ñ better nested-CV
G Varoquaux 22
50. More gems in scikit-learn
Outlier detection and isolation forests (0.18)
G Varoquaux 22
54. 3 Parallel-computing engine: joblib
sklearn.Estimator(n jobs=2)
Under the hood: joblib
ąąąąąąąąą from joblib import Parallel, delayed
ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Threading and multiprocessing mode
G Varoquaux 25
55. 3 Parallel-computing engine: joblib
sklearn.Estimator(n jobs=2)
Under the hood: joblib
ąąąąąąąąą from joblib import Parallel, delayed
ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Threading and multiprocessing mode
G Varoquaux 25
56. 3 Parallel-computing engine: joblib
sklearn.Estimator(n jobs=2)
Under the hood: joblib
ąąąąąąąąą from joblib import Parallel, delayed
ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Threading and multiprocessing mode
New: distributed computing backends:
Yarn, dask.distributed, IPython.parallel
import distributed.joblib
from joblib import Parallel, parallel backend
with parallel backend(’dask.distributed’,
scheduler host=’HOST:PORT’):
# normal Joblib code
G Varoquaux 25
57. 3 Parallel-computing engine: joblib
sklearn.Estimator(n jobs=2)
Under the hood: joblib
ąąąąąąąąą from joblib import Parallel, delayed
ąąąąąąąąą Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Threading and multiprocessing mode
New: distributed computing backends:
Yarn, dask.distributed, IPython.parallel
import distributed.joblib
from joblib import Parallel, parallel backend
with parallel backend(’dask.distributed’,
scheduler host=’HOST:PORT’):
# normal Joblib code
Algorithmic plain-Python code
with optional bells and whistles
G Varoquaux 25
58. 3 Persistence to disk
Persisting any Python object fast, with little overhead
Streamed compression
I/O from/in open file handles
ñ In S3, HDFS
G Varoquaux 26
59. 3 joblib.Memory as a storage pool in dev
S3/HDFS/cloud backend:
joblib.Memory(’uri’, backend=’s3’)
https://github.com/joblib/joblib/pull/397
G Varoquaux 27
60. 3 joblib.Memory as a storage pool in dev
S3/HDFS/cloud backend:
joblib.Memory(’uri’, backend=’s3’)
https://github.com/joblib/joblib/pull/397
Cache replacement:
mem = joblib.Memory(’uri’, bytes limit=’1G’)
mem.reduce size() # Remove oldest
G Varoquaux 27
61. 3 joblib.Memory as a storage pool in dev
S3/HDFS/cloud backend:
joblib.Memory(’uri’, backend=’s3’)
https://github.com/joblib/joblib/pull/397
Cache replacement:
mem = joblib.Memory(’uri’, bytes limit=’1G’)
mem.reduce size() # Remove oldest
Out-of-memory computing
ąąąąąąąąą result = mem.cache(g).call and shelve(a)
ąąąąąąąąą result
MemorizedResult(cachedir=”...”, func=”g”, argument hash=”...”)
ąąąąąąąąą c = result.get()
G Varoquaux 27
62. 3 joblib as bricks of a compute engine: vision
03878794797927
03878794797927
G Varoquaux 28
63. 3 joblib as bricks of a compute engine: vision
03878794797927
03878794797927
Parallel on
a cloud/cluster
Memory + shelving
on distributed stores
as out-of-core
& common
computation
Simple algorithmic code
Using infrastructure without buying into it
G Varoquaux 28
69. On the code of data science
@GaelVaroquaux
I believe in code
70. On the code of data science
@GaelVaroquaux
I believe in code
without compromises
Libraries
with side-effect free code
tight APIs
decoupling of analysis
and plotting
Software engineering
version control
71. On the code of data science
@GaelVaroquaux
I believe in code
without compromises
Code empowers
enables automation
opens new applications
Disrupt something new
We use scikit-learn for markers
of neuropsychiatric diseases
72. On the code of data science
@GaelVaroquaux
I believe in code
without compromises
Code empowers
Software purity holds back
Interactivity:
brittle and costly
but necessary for insights
Refusing syntax shortcuts
ñ verbosity
73. On the code of data science
@GaelVaroquaux
I believe in code
without compromises
Code empowers
Software purity holds back
The cost is worth the benefit in the long run