Slides for my keynote at Scipy 2017
https://youtu.be/eVDDL6tgsv8
Computing has been driving forward a revolution in how science and technology can solve new problems. Python has grown to be a central player in this game, from computational physics to data science. I would like to explore some lessons learned doing science with Python as well as doing Python libraries for science. What are the ingredients that the scientists need? What technical and project-management choices drove the success of projects I've been involved with? How do these demands and offers shape our ecosystem?
In this talk, I'd like to share a few thoughts on how we code for science and innovation, with the modest goal of changing the world.
2. Science
The process of discovering
knowledge and mechanisms
Computing is a central part of how we do science
G Varoquaux 2
3. Science
The process of discovering
knowledge and mechanisms
Computing is a central part of how we do science
Science + Computers = Computational science
Nuclear physics Fluid dynamics Chemistry
G Varoquaux 2
4. Science
The process of discovering
knowledge and mechanisms
Computing is a central part of how we do science
Science + Computers = Computational science
Psychology
G Varoquaux 2
5. Science
The process of discovering
knowledge and mechanisms
Computing is a central part of how we do science
Science + Computers = Computational science
Psychology
Marketting
Data science: using data to acquire insights
G Varoquaux 2
6. Science
The process of discovering
knowledge and mechanisms
“Science is not a political construct or a belief sys-
tem. Scientific progress depends on openness, trans-
parency, and the free flow of ideas and people.”
— Dr. Rush Holt, CEO of AAAS,
testimony to the House Committee on Science, Space, and Tech-
nology, Feb 8, 2017
G Varoquaux 3
7. Science
The process of discovering
knowledge and mechanisms
Science helps shaping society
Growth in a time of debt [Reinhart & Rogoff 2010]:
Wrong conclusions due to flawed Excel processing
⇒ Public debt blamed for financial crisis (Osborne UK MP)
Autism and vaccines:
forged study: [Wakefield et al, Lancet 1998]
⇒ Drop in vaccination, measles outbreak
Loss of trust in science is very costly
G Varoquaux 3
9. Innovation
Putting the right technology to the right use
Light blub:
Invented ∼ 1835 by Lindsay
Extra progress: vaccum pumps (Swan ∼ 1880)
Economics: availability of electric power
⇒ Edison’s company
G Varoquaux 4
10. Innovation
Putting the right technology to the right use
Light blub:
Invented ∼ 1835 by Lindsay
Extra progress: vaccum pumps (Swan ∼ 1880)
Economics: availability of electric power
⇒ Edison’s company
Outbox: company digitizing physical mail
But citizens aren’t the USPS customers, junk mailers are
⇒ No cooperation from USPS, Outbox dies
Power balances drive innovation as much as technology
G Varoquaux 4
11. Coding for science and innovation:
Computing is the new electricity:
a driver for change
With new data sources,
it reaches beyond physics & engineering
G Varoquaux 5
12. Coding for science and innovation:
1 Coding as a scientist
2 Building software for science
3 An ecosystem
G Varoquaux 6
14. 1 Data in brain sciences
The mental world
cognition, emotions
autism, depression
Historically studied
via verbal interactions
Psychology
G Varoquaux 8
15. 1 Data in brain sciences
The mental world
cognition, emotions
autism, depression
Historically studied
via verbal interactions
The brain
an organ:
neurons, firing
Imaging brain activity
Quantitative data
G Varoquaux 8
16. 1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
Comparing the brain activity of many subjects
Supervised machine learning to discriminate Autism
G Varoquaux 9
17. 1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
1. Extract brain networks
Unsupervised feature learning
complex model fit to 1Tb data
G Varoquaux 9
18. 1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
1. Extract brain networks
2. Per-subject connections
Information geometry,
Lie algebra...
G Varoquaux 9
19. 1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
1. Extract brain networks
2. Per-subject connections
3. Supervised learning
Scikit-learn
G Varoquaux 9
20. 1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
1. Extract brain networks
2. Per-subject connections
3. Supervised learning
Scikit-learn
Limits to impact:
Cannot outperform clinicians that define Autism/Control
Psychiatrists unhappy with current blurry definition
But not ready to accept black-box algorithmic definition
G Varoquaux 9
21. 1 One example of our work: biomarkers of Autism
[Abraham...Varoquaux, 2017]
1. Extract brain networks
2. Per-subject connections
3. Supervised learning
Scikit-learn
Limits to impact:
Cannot outperform clinicians that define Autism/Control
Psychiatrists unhappy with current blurry definition
But not ready to accept black-box algorithmic definition
Lots of moving parts
Practitionners need to
make the tools theirs
G Varoquaux 9
22. 1 A quest for trust: reproducible research
“if it’s not open and verifiable by others, it’s not science,
or engineering, or whatever it is you call what we do“
— V. Stodden, The scientific method in practice
Computational reproducibility:
Automate everything
Control the environment
G Varoquaux 10
24. 1 Automate everything...
Some operations work better with a human in the loop
Scientific research is an iterative process
Tension between needs for interaction and replay
G Varoquaux 11
25. 1 Automate everything...
Some operations work better with a human in the loop
Scientific research is an iterative process
Tension between needs for interaction and replay
Mayavi
Reflexivity between dialogs and objects
Record mode
G Varoquaux 11
26. 1 Automate everything...
Some operations work better with a human in the loop
Scientific research is an iterative process
Tension between needs for interaction and replay
Jupyter, and its widgets:
Exploring the space between interaction and code
G Varoquaux 11
27. 1 Beyond computational reproducibility
Make every computational step reproducible,
and good science will emerge
G Varoquaux 12
28. 1 Beyond computational reproducibility
Make every computational step reproducible,
and good science will emerge
Estimating the reproducibility of psychological science
[Science 2015] 36% of effects replicate
Reasons:
Statistical challenges — analysis degrees of freedom
Weak insentives — winner’s curse in publication
Seldom computational reproducibility
G Varoquaux 12
29. 1 Beyond computational reproducibility
Make every computational step reproducible,
and good science will emerge
Estimating the reproducibility of psychological science
[Science 2015] 36% of effects replicate
Reasons:
Statistical challenges — analysis degrees of freedom
Weak insentives — winner’s curse in publication
Seldom computational reproducibility
I think that reproducibility is a misnomer.
What matters is that operations be
verifiable or reusable.
G Varoquaux 12
30. In practice, the best way to improve research
is to use the right (conceptual) tools.
G Varoquaux 13
31. 1 Managing complexity
In practice, the best way to improve research
is to use the right (conceptual) tools.
The everyday roadblock is cognitive load
Machine learning, brain anatomy, psychology
R, Python, shell scripts
Funding agencies, reviewer 3, courting VCs
G Varoquaux 14
32. Coding as a scientist
Final code should be auditable,
ideally reusable
Tension between interactive computing
& automating
Main enemy: cognitive overload
G Varoquaux 15
33. Coding as a scientist
Final code should be auditable,
ideally reusable
Tension between interactive computing
& automating
Main enemy: cognitive overload
In the industry
Reusable
Verifiable? Not for silicon valley,
but in insurance, healthcare, banking...
Moving data-scientist code
to production?
Software projects going over budget?
G Varoquaux 15
34. Code quality in exploratory work
Use pyflakes in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
G Varoquaux 16
35. Code quality in exploratory workIncreasingcost
?
Use pyflakes in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
Avoid premature software engineering
G Varoquaux 16
36. Code quality in exploratory workIncreasingcost
?
Use pyflakes in your editor seriously
Coding convention, good naming
Version control Use git + github
Code review
Unit testing
If it’s not tested, it’s broken or soon will be
Make a package
controlled dependencies and compilation
...
Avoid premature software engineering
Over versus under engineering
Goal is generating insights / moving in new spaces
Experimentation for intuitions and proofs of concepts
⇒ new ideas
As the path becomes clear: consolidation
is great for that
Heavy engineering too early freezes bad ideas
G Varoquaux 16
37. 2 Building software for science
The point of view of the developer
Libraries are what enables us to scale:
Abstractions reduce cognitive load
Code reuse gets us further
G Varoquaux 17
38. 2 Examples of such libraries
scikit-learn
Make research in machine-learning
models and algorithm useable to people
who do not understand them
ni
nilearn
Make it easy to answer neuroimaging
problems with them
G Varoquaux 18
39. 2 Examples of such libraries
scikit-learn
Make research in machine-learning
models and algorithm useable to people
who do not understand them
Challenges:
Variety of that space
Statistical concepts coding concepts
ni
nilearn
Make it easy to answer neuroimaging
problems with them
Challenges: Onboarding technology-adverse users
G Varoquaux 18
40. 2 Tools that reduce cognitive overload
It’s a design problem
G Varoquaux 19
41. 2 Tools that reduce cognitive overload
Jonathan Ive, an industrial designer, is #4 at Apple
Code different.
G Varoquaux 20
42. 2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 21
43. 2 Some API design principles for the scipy stack
Consistency, consistency, consistency
np.save(file, obj) pickle.dump(obj, file)
fmin(...maxiter=10) lsq linear(...max iter=10)
Creates cognitive overload
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 22
44. 2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
Objects have hidden states,
Objects have no universal interface, entry point, output
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 23
45. 2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
How much do usage patterns carry out across the library?
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 24
46. 2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Facilitates working with multiple libraries together
Easier to get up to speed with a given library
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 25
47. 2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Change of behavior depending on input type
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 26
48. 2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Interfaces define objects
Incompatible behaviors lead to bugs (eg np.matrix)
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 27
49. 2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Properties obfuscate the data model of the object
Properties can create hidden compute costs
Shallow is better than deep
Error messages matter
Be Pythonic
G Varoquaux 28
50. 2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Objects are understood by their surface
Composition creates cognitive overload
Error messages matter
Be Pythonic
G Varoquaux 29
51. 2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Explain the problem
Print the offending value
Be Pythonic
G Varoquaux 30
52. 2 Some API design principles for the scipy stack
Consistency, consistency, consistency
Functions are easier to understand than classes
A library should hinge on a small number of concepts
Common data containers make the ecosystem stronger
Each function should have one and only one purpose
Code for interfaces, but don’t overdo duck typing
Properties are for impedance matching
Shallow is better than deep
Error messages matter
Be Pythonic
Avoid syntax hacks
G Varoquaux 31
53. 2 Scikit-learn API
Scikit-learn cheat sheet
Scikit-learn
Fit and predict
>>> estimator = Estimator(param1=param1)
>>> estimator.fit(X train, y train)
>>> y test = estimator.predict(X test)
Transform data
>>> X red = estimator.transform(X test)
G Varoquaux 32
54. 2 Scikit-learn API
Scikit-learn cheat sheet
Scikit-learn
Fit and predict
>>> estimator = Estimator(param1=param1)
>>> estimator.fit(X train, y train)
>>> y test = estimator.predict(X test)
Transform data
>>> X red = estimator.transform(X test)
The estimator is a “contract”
(slightly more elaborate than above)
It has created an ecosystem of packages
Based on duck-typing, not inheritence
G Varoquaux 32
56. 2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
G Varoquaux 34
57. 2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
User flow on the scikit-learn website:
Examples
G Varoquaux 34
58. 2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
User flow on the nilearn website:
Examples
G Varoquaux 34
59. 2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
Sphinx-gallery: compiling scripts in an examples gallery
G Varoquaux 34
60. 2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
Sphinx-gallery: compiling scripts in an examples gallery
Restructured text
formatting
Capturing
outputs
Links to
function docs
+Creates Jupyter
notebooks
G Varoquaux 34
61. 2 Example-driven development
The 3-liner as the new cool
Teaching others
Teaching yourself
Write examples that solve end problems
Iterate on your API until these are simple
Mayavi scikit-learn nilearn
Sphinx-gallery: compiling scripts in an examples gallery
Insert links to examples
containing a function
G Varoquaux 34
62. 2 Building great documentation
Focus on explaining concepts (hint: write plain English)
Less is more: prioritize, avoid redundancy
Code examples must be short (link to full tutorial examples)
Links everywhere: users will land at the wrong place
Teach with the docs
Plan for maintenance of docs:
Continuous integration
Check links
Runs examples
Doctests
G Varoquaux 35
63. 2 Reusable science
scikit-learn is the new machine-learning textbook
nilearn is the new neuroimaging review article
Experiments reproduced
at each commit
eg: brain reading
nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html
G Varoquaux 36
64. 2 Reusable science
scikit-learn is the new machine-learning textbook
nilearn is the new neuroimaging review article
Experiments reproduced
at each commit
eg: brain reading
nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html
Resource intensive CI:
Data ⇒ Fight for good open data
Computation ⇒ Find good algorithms and tradeoffs
Forces us to distill the literature (as a review)
G Varoquaux 36
65. 2 Reusable science
scikit-learn is the new machine-learning textbook
nilearn is the new neuroimaging review article
Experiments reproduced
at each commit
eg: brain reading
nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html
Package development consolidates
science and moves it outside the lab
G Varoquaux 36
66. 3 An ecosystem
A bird’s eye view on scientific packages
G Varoquaux 37
67. 3 Packages of the Python ecosystem
1 10 100 1000 10000
Package rank
104
105
106
107
108
109
NumberofPyPIdownloads
A small number of packages
are used by many
1
f distribution, preferential attachment
G Varoquaux 38
68. 3 Packages of the Python ecosystem
1 10 100 1000 10000
Package rank
104
105
106
107
108
109
NumberofPyPIdownloads
numpy#49
scikit-learn #110
joblib #431
nilearn
#2877
simplejson #1
six #2setuptools#3
A small number of packages
are used by many
1
f distribution, preferential attachment
nilearn relies on scikit-learn & joblib that rely on numpy...
G Varoquaux 38
69. 3 Standing on the shoulders of maintainers
May 31th: pip broken
https://github.com/pypa/
setuptools/pull/1043
Left-pad:
How left-padding strings broke
the Internet
A Javascript package
for left padding strings
was removed from
node’s package manager,
breaking all the websites
that depended on it.
G Varoquaux 39
70. 3 Dependencies
Beyond installation, a challenge is to ensure package
versions play way together: correctness of the code
Breakage of backward compability
yields irreconcilable dependencies
G Varoquaux 40
71. 3 Dependencies and their upgrade
It’s a fact: users hate upgrading
If it ain’t broken, don’t fix it
even if it is, apparently
G Varoquaux 41
72. 3 Declaring undependence?
Monolythic packages with no dependencies...
But:
Scaling is hard
Complexity grows as square of codebase size
[Woodfield 1979]
User support grows with userbase size
G Varoquaux 42
73. 3 Core software is infrastructure
Everybody uses it everyday
In industry, education, & research
G Varoquaux 43
74. 3 Core software is infrastructure
Everybody uses it everyday
In industry, education, & research
It needs maintenance
Like roads (or openSSL, to prevent heartbleed)
Central infrastructure packages are “boring”
They are understaffed and underfunded
References: “Roads and Bridge” Ford foundation report
Excellent talk by Heather Miller
https://www.youtube.com/watch?v=17yy5BwIiTw
G Varoquaux 43
75. @GaelVaroquaux
Coding for science and innovation
New science
High value of bringing new methods to a field
⇒ Enable domain-specialists
Rapid interation, but with automation & consolidation
Software tools
Scientists are limited by cognitive load
⇒ Design of API and documentation in libraries
Libraries make science reproducible and reusable
An ecosystem
Central packages hold the ecosystem together
Thanks to: the scipy community