Dask Tutorial at PyConDE / PyData Karlsruhe 2018. These were the introductory slides that mainly contain the link to Matthew Rocklin's Dask workshop at PyData NYC 2018 whereon this workshop was based.
CI, CD -Tools to integrate without manual intervention
Scalable Scientific Computing with Dask
1. 1
PyCon.DE / PyData Karlsruhe 2018
Uwe L. Korn
Scalable Scientific Computing with
Dask
2. 2
• Senior Data Scientist at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Data Engineer and Architect with heavy
focus around Pandas
About me
xhochy
mail@uwekorn.com
3. 3
• Execution and definition of task graphs
• a parallel computing library that scales the existing Python ecosystem.
• scales down to your laptop laptop
• sclaes up to a cluster
What is Dask?
4. 4
• multi-core and distributed parallel execution
• low-level: task schedulers for computation graphs
• high-level: Array, Bag and DataFrame
More than a single CPU
5. 5
Dask is
• More light-weight
• In Python, operates well with C/C++/Fortran/LLVM or other natively
compiled code
• Part of the Python ecosystem
What about Spark?
6. 6
Spark is
• Written in Scala and works well within the JVM
• Python support is very limited
• Brings its own ecosystem
• Able to provide more higher level optimizations
What about Spark?