Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

•

1 gefällt mir•1,730 views

This document discusses how Apache Arrow enables sharing data between Python and Java without copying. It summarizes Arrow's capabilities for efficient in-memory columnar data and its ability to exchange data between different programming languages. The document then outlines how Arrow, through its Java and Python libraries, allows querying data in Java from Python without copying, by passing memory addresses between the two environments. This enables faster data science workflows that involve both Python and Java/Scala.

Daten & Analysen

1
Fulfilling Apache Arrow's Promises:
Pandas on JVM memory without a copy
PyCon.DE Karlsruhe 2018
Uwe L. Korn

2
• Senior Data Scientist at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Data Engineer and Architect with heavy
focus around Pandas
About me
xhochy
mail@uwekorn.com

3
What’s Apache Arrow?
• Published in February 2016
• Specification for in-memory columnar data layout
• No overhead for cross-system communication
• Designed for eﬃciency (exploit SIMD, cache locality, ..)
• Exchange data without conversion between Python, C++, C(glib), Ruby,
Lua, R, JavaScript, Go, Rust, Matlab and the JVM
• Brought Parquet to Pandas and made PySpark fast (@pandas_udf)

4
February 2016: Birth of Apache Arrow
Just a goal…

5
Data Science Workflow in 2018
Python
machine
learning
model
pre-processing
with pandas
probability density
function (PDF)
SQL
Engine

6
Looks simple?
• It isn’t.
• „Data“ is very heterogeneous landscape
• Most common setup:
• Java/Scala, i.e. JVM, for data processing
• Python for machine learning

7
Data Science Workflow in 2018
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver JayDeBeApi
P
Y
T
H
O
N
R
O
W
S
J
D
B
C
R
O
W
S

8
org.apache.arrow.adapter.jdbc
• Retrieve JDBC results as Arrow RecordBatch / VectorSchemaRoot
• Do conversion of rows to columns in the JVM
• Data is stored„oﬀ-heap“, i.e:
• not managed by the JVM
• native memorly layout, same as in pyarrow

9
Workflow in 2018 with Arrow
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver
org.apache.
arrow.adapter.
jdbc
A
R
R
O
W
J
D
B
C
R
O
W
S
?

10
So we’re done? No.
• We still only have Arrow data in the JVM
• Arrow and Pandas have a slightly diﬀerent memory layout
• We have this today in PySpark
• It’s fast
• Still involves a copy over the network
• Arrow → pandas conversion is tuned but still a copy

11
pyarrow.jvm
• Access Arrow data created in the JVM from Python
• Involves no copy of the data
• Translation of the helper objects
• Actually passes memory addresses around
No copy between the JVM and Python!

NumPy & the BlockManager
Photo by Susan Holt Simpson on Unsplash

13
Pandas Shortcomings
• Limited to NumPy data types, otherwise object
• Columns are not separate, grouped by type
• Nullability is not type-safe (yet)
—> Arrow memory does not match Pandas memory
—> Copy 😢

14
Pandas ExtensionArrays
• Introduced new interfaces in 0.23
• ExtensionDtype
• What type of scalars?
• ExtensionArray
• Implement basic array ops
• Pandas provides algorithms on top
• Still, experimental, wait for 0.24

16
fletcher
• https://github.com/xhochy/fletcher
• Implements Extension{Array,Dtype} with Apache Arrow as storage
• Uses Numba to implement the necessary analytic on top
• Needs {pandas, Arrow, …} master
No copy between Apache Arrow and pandas!

17
Workflow in 2018 with Arrow
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver
org.apache.
arrow.adapter.
jdbc
A
R
R
O
W
J
D
B
C
R
O
W
S
pyarrow.jvm 
/
fletcher

Make your
best decision
today.
blueyonder.ai/en/careers
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
21

Cross language DataFrame library
• Website: https://arrow.apache.org/
• ML: dev@arrow.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/ARROW
• Slack: https://
apachearrowslackin.herokuapp.com/
• Github mirror: https://github.com/apache/
arrow
Apache Arrow Apache Parquet
Famous columnar file format
• Website: https://parquet.apache.org/
• ML: dev@parquet.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/PARQUET
• Slack: https://parquet-slack-
invite.herokuapp.com/
• C++ Github mirror: https://github.com/
apache/parquet-cpp
22
Get Involved!

Empfohlen

pandas.(to/from)_sql is simple but not fastUwe Korn

Extending Pandas using Apache Arrow and NumbaUwe Korn

Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf

Enabling Python to be a Better Big Data CitizenWes McKinney

Future of pandasJeff Reback

Improving data interoperability in Python and RWes McKinney

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney

Empfohlen

pandas.(to/from)_sql is simple but not fastUwe Korn

Extending Pandas using Apache Arrow and NumbaUwe Korn

Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf

Enabling Python to be a Better Big Data CitizenWes McKinney

Future of pandasJeff Reback

Improving data interoperability in Python and RWes McKinney

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney

Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...MLconf

Pandas/Data Analysis at BaypiggiesAndy Hayden

DataFrames: The Extended CutWes McKinney

PrestoChen Chun

PyCon Singapore 2013 KeynoteWes McKinney

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine

Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney

Presto as a Service - Tips for operation and monitoringTaro L. Saito

Presto in my_use_case2wyukawa

Rust is for "Big Data"Andy Grove

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney

Fabian Hueske – Juggling with Bits and BytesFlink Forward

Presto Meetup 2016 Small StartHiroshi Toyama

Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks

Apache Spark & MLlibGrigory Sapunov

Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks

Strata2017 sgwyukawa

Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman

Cascalognathanmarz

Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

Weitere ähnliche Inhalte

Was ist angesagt?

Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...MLconf

Pandas/Data Analysis at BaypiggiesAndy Hayden

DataFrames: The Extended CutWes McKinney

PrestoChen Chun

PyCon Singapore 2013 KeynoteWes McKinney

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine

Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney

Presto as a Service - Tips for operation and monitoringTaro L. Saito

Presto in my_use_case2wyukawa

Rust is for "Big Data"Andy Grove

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney

Fabian Hueske – Juggling with Bits and BytesFlink Forward

Presto Meetup 2016 Small StartHiroshi Toyama

Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks

Apache Spark & MLlibGrigory Sapunov

Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks

Strata2017 sgwyukawa

Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman

Cascalognathanmarz

Was ist angesagt? (20)

Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...

Pandas/Data Analysis at Baypiggies

DataFrames: The Extended Cut

Presto

PyCon Singapore 2013 Keynote

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16

Apache Arrow -- Cross-language development platform for in-memory data

Presto as a Service - Tips for operation and monitoring

Presto in my_use_case2

Rust is for "Big Data"

Apache Arrow at DataEngConf Barcelona 2018

An Incomplete Data Tools Landscape for Hackers in 2015

Fabian Hueske – Juggling with Bits and Bytes

Presto Meetup 2016 Small Start

Resource-Efficient Deep Learning Model Selection on Apache Spark

Apache Spark & MLlib

Apache Spark MLlib 2.0 Preview: Data Science and Production

Strata2017 sg

Deep Learning on Apache® Spark™ : Workflows and Best Practices

Cascalog

Ähnlich wie Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn

Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney

Apache Spark in IndustryDorian Beganovic

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

Lightning Fast Dataframes with PolarsAlberto Danese

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...Uwe Korn

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsUwe Korn

Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks

Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman

Scalable Scientific Computing with DaskUwe Korn

Apache Arrow and Python: The latestWes McKinney

Hopsworks - Self-Service Spark/Flink/Kafka/HadoopJim Dowling

Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney

3 python packagesFEG

Koalas: Unifying Spark and pandas APIsXiao Li

Data Science meets Software DevelopmentAlexis Seigneurin

Apache spark-melbourne-april-2015-meetupNed Shawa

Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok

Ähnlich wie Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy (20)

Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...

Data Science at Scale: Using Apache Spark for Data Science at Bitly

How Apache Arrow and Parquet boost cross-language interoperability

Next-generation Python Big Data Tools, powered by Apache Arrow

Apache Spark in Industry

Apache Spark for Everyone - Women Who Code Workshop

Lightning Fast Dataframes with Polars

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems

Deep Learning on Apache® Spark™: Workflows and Best Practices

Scalable Scientific Computing with Dask

Apache Arrow and Python: The latest

Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop

Apache Arrow (Strata-Hadoop World San Jose 2016)

3 python packages

Koalas: Unifying Spark and pandas APIs

Data Science meets Software Development

Apache spark-melbourne-april-2015-meetup

Spark summit 2019 infrastructure for deep learning in apache spark 0425

Kürzlich hochgeladen

Machine learning classification ppt.pptamreenkhanum0307

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一fhwihughh

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

RadioAdProWritingCinderellabyButleri.pdfgstagge

Easter Eggs From Star Wars and in cars 1 and 217djon017

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

How we prevented account sharing with MFAAndrei Kaleshka

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

Learn How Data Science Changes Our WorldEduminds Learning

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Kürzlich hochgeladen (20)

Machine learning classification ppt.ppt

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

RadioAdProWritingCinderellabyButleri.pdf

Easter Eggs From Star Wars and in cars 1 and 2

Identifying Appropriate Test Statistics Involving Population Mean

Heart Disease Classification Report: A Data Analysis Project

Defining Constituents, Data Vizzes and Telling a Data Story

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

How we prevented account sharing with MFA

Data Factory in Microsoft Fabric (MsBIP #82)

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT

20240419 - Measurecamp Amsterdam - SAM.pdf

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

Learn How Data Science Changes Our World

GA4 Without Cookies [Measure Camp AMS]

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

1. 1 Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy PyCon.DE Karlsruhe 2018 Uwe L. Korn

2. 2 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Data Engineer and Architect with heavy focus around Pandas About me xhochy mail@uwekorn.com

3. 3 What’s Apache Arrow? • Published in February 2016 • Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for eﬃciency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript, Go, Rust, Matlab and the JVM • Brought Parquet to Pandas and made PySpark fast (@pandas_udf)

4. 4 February 2016: Birth of Apache Arrow Just a goal…

5. 5 Data Science Workflow in 2018 Python machine learning model pre-processing with pandas probability density function (PDF) SQL Engine

6. 6 Looks simple? • It isn’t. • „Data“ is very heterogeneous landscape • Most common setup: • Java/Scala, i.e. JVM, for data processing • Python for machine learning

7. 7 Data Science Workflow in 2018 Python machine learning model pre-processing with pandas SQL Engine JDBC Driver JayDeBeApi P Y T H O N R O W S J D B C R O W S

8. 8 org.apache.arrow.adapter.jdbc • Retrieve JDBC results as Arrow RecordBatch / VectorSchemaRoot • Do conversion of rows to columns in the JVM • Data is stored„oﬀ-heap“, i.e: • not managed by the JVM • native memorly layout, same as in pyarrow

9. 9 Workflow in 2018 with Arrow Python machine learning model pre-processing with pandas SQL Engine JDBC Driver org.apache. arrow.adapter. jdbc A R R O W J D B C R O W S ?

10. 10 So we’re done? No. • We still only have Arrow data in the JVM • Arrow and Pandas have a slightly diﬀerent memory layout • We have this today in PySpark • It’s fast • Still involves a copy over the network • Arrow → pandas conversion is tuned but still a copy

11. 11 pyarrow.jvm • Access Arrow data created in the JVM from Python • Involves no copy of the data • Translation of the helper objects • Actually passes memory addresses around No copy between the JVM and Python!

12. NumPy & the BlockManager Photo by Susan Holt Simpson on Unsplash

13. 13 Pandas Shortcomings • Limited to NumPy data types, otherwise object • Columns are not separate, grouped by type • Nullability is not type-safe (yet) —> Arrow memory does not match Pandas memory —> Copy 😢

14. 14 Pandas ExtensionArrays • Introduced new interfaces in 0.23 • ExtensionDtype • What type of scalars? • ExtensionArray • Implement basic array ops • Pandas provides algorithms on top • Still, experimental, wait for 0.24

15. 15 Photo by Niklas Tidbury on Unsplash

16. 16 fletcher • https://github.com/xhochy/fletcher • Implements Extension{Array,Dtype} with Apache Arrow as storage • Uses Numba to implement the necessary analytic on top • Needs {pandas, Arrow, …} master No copy between Apache Arrow and pandas!

17. 17 Workflow in 2018 with Arrow Python machine learning model pre-processing with pandas SQL Engine JDBC Driver org.apache. arrow.adapter. jdbc A R R O W J D B C R O W S pyarrow.jvm  / fletcher

18. 18 ??? Does it work?

19. 19 Does it work?

20. 20 Does it work?

21. Make your best decision today. blueyonder.ai/en/careers Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 21

22. Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: dev@arrow.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github mirror: https://github.com/apache/ arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: dev@parquet.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • C++ Github mirror: https://github.com/ apache/parquet-cpp 22 Get Involved!