SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Taming the language border in
data analytics and science with
Apache Arrow
Uwe Korn – QuantCo – 18th June 2019
About me
• Engineering at QuantCo

• Apache {Arrow, Parquet} PMC

• Focus on Python but interact with
R, Java, SAS, …
@xhochy
@xhochy
mail@uwekorn.com
https://uwekorn.com
Do we have a problem?
• Yes, there are different ecosystems!

• Berlin Buzzwords

• Java / Scala

• Flink / ElasticSearch / Kafka

• Scala-Spark / Kubernetes

• PyData

• Python / R

• Pandas / NumPy / PySpark/sparklyr / Docker

• SQL-based databases

• ODBC / JDBC

• Custom protocols (e.g. Postgres)
Why solve this?
• We build pipelines to move data

• Goal: end-to-end data products

Somewhere along the path we need to talk

• Avoid duplicate work / work on converters
Apache Arrow at its core
• Main idea: common columnar representation of data in memory
• Provide libraries to access the data structures

• Broad support for many languages

• Create building blocks to form an ecosystem around it

• Implement adaptors for existing structures
Columnar Data
Previous Work
• CSV works really everywhere 

• Slow, untyped and row-wise

• Parquet is gaining traction in all ecosystems

• one of the major features and interaction points of Arrow

• Still, this serializes data

• RAM-Copy: 10GB/s on a Laptop

• DataFrame implementations look similar but still are incompatible
Languages
• C++, C(glib), Python, Ruby, R, Matlab

• C#

• Go

• Java

• JavaScript

• Rust
There’s a social component
• It’s not only APIs you need to bring together

• Communities are also quite distinct

• Get them talking!
Shipped with batteries
• There is more than just data structures

• Batteries in Arrow

• Vectorized Parquet reader: C++, Rust, Java(WIP)

C++ also supports ORC

• Gandiva: LLVM-based expression kernels

• Plasma: Shared-memory object store

• DataFusion: Rust-based query engine

• Flight: RPC protocol built on top of gRPC with zero-copy optimizations
Ecosystem
• RAPIDS: Analytics on the GPU

• Dremio: Data platform

• Turbodbc: columnar ODBC access in C++/Python

• Spark: fast Python and R bridge

• fletcher (pandas): Use Arrow instead of NumPy as backing storage

• fletcher (FPGA): Use Arrow on FPGAs

• Many more … https://arrow.apache.org/powered_by/
Does it work?
Everything is amazing on slides …

… so does this Arrow actually work?

Let’s take a real example with:

• ERP System in Java with JDBC access (no non-Java client)

• ETL and Data Cleaning in Python

• Analysis in R
Does it work?
Does it work?
Does it work?
WIP
Up Next
• Build more adaptors, e.g. Postgres

• Building blocks for query engines on top of Arrow

• Datasets

• Analytical kernels

• DataFrame implementations directly on top of Arrow
Thanks
Slides at https://twitter.com/xhochy

Question here!

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop at ayasdi
Hadoop at ayasdiHadoop at ayasdi
Hadoop at ayasdiMohit Jaggi
 
Introduction to df
Introduction to dfIntroduction to df
Introduction to dfMohit Jaggi
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
Building DSLs with Scala
Building DSLs with ScalaBuilding DSLs with Scala
Building DSLs with ScalaMohit Jaggi
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
AgensGraph: a Multi-model Graph Database based on PostgreSql
AgensGraph: a Multi-model Graph Database based on PostgreSqlAgensGraph: a Multi-model Graph Database based on PostgreSql
AgensGraph: a Multi-model Graph Database based on PostgreSqlKisung Kim
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastpandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastUwe Korn
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 
DL4J at Workday Meetup
DL4J at Workday MeetupDL4J at Workday Meetup
DL4J at Workday MeetupDavid Kale
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talksyhadoop
 
Developing a Staff-Only Samvera Application
Developing a Staff-Only Samvera ApplicationDeveloping a Staff-Only Samvera Application
Developing a Staff-Only Samvera ApplicationJames Griffin
 
Deep Learning with GPUs in Production - AI By the Bay
Deep Learning with GPUs in Production - AI By the BayDeep Learning with GPUs in Production - AI By the Bay
Deep Learning with GPUs in Production - AI By the BayAdam Gibson
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall
 
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupApache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupfvanvollenhoven
 
A rubyist's naive comparison of some database systems and toolkits
A rubyist's naive comparison of some database systems and toolkitsA rubyist's naive comparison of some database systems and toolkits
A rubyist's naive comparison of some database systems and toolkitsBelighted
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 

Was ist angesagt? (20)

Hadoop at ayasdi
Hadoop at ayasdiHadoop at ayasdi
Hadoop at ayasdi
 
Introduction to df
Introduction to dfIntroduction to df
Introduction to df
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
Building DSLs with Scala
Building DSLs with ScalaBuilding DSLs with Scala
Building DSLs with Scala
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
AgensGraph: a Multi-model Graph Database based on PostgreSql
AgensGraph: a Multi-model Graph Database based on PostgreSqlAgensGraph: a Multi-model Graph Database based on PostgreSql
AgensGraph: a Multi-model Graph Database based on PostgreSql
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Rust meetup delhi nov 18
Rust meetup delhi nov 18Rust meetup delhi nov 18
Rust meetup delhi nov 18
 
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastpandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fast
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
DL4J at Workday Meetup
DL4J at Workday MeetupDL4J at Workday Meetup
DL4J at Workday Meetup
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Developing a Staff-Only Samvera Application
Developing a Staff-Only Samvera ApplicationDeveloping a Staff-Only Samvera Application
Developing a Staff-Only Samvera Application
 
Deep Learning with GPUs in Production - AI By the Bay
Deep Learning with GPUs in Production - AI By the BayDeep Learning with GPUs in Production - AI By the Bay
Deep Learning with GPUs in Production - AI By the Bay
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
 
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupApache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
 
A rubyist's naive comparison of some database systems and toolkits
A rubyist's naive comparison of some database systems and toolkitsA rubyist's naive comparison of some database systems and toolkits
A rubyist's naive comparison of some database systems and toolkits
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 

Ähnlich wie Berlin Buzzwords 2019 - Taming the language border in data analytics and science with Apache Arrow

Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato ReviewHang Li
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...ScyllaDB
 
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Thoughtworks
 
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Thoughtworks
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyFulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyUwe Korn
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
 
High Performance Systems in Go - GopherCon 2014
High Performance Systems in Go - GopherCon 2014High Performance Systems in Go - GopherCon 2014
High Performance Systems in Go - GopherCon 2014Derek Collison
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionSri Ambati
 
H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16Sri Ambati
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latestWes McKinney
 
H2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFH2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFSri Ambati
 

Ähnlich wie Berlin Buzzwords 2019 - Taming the language border in data analytics and science with Apache Arrow (20)

Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
 
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
 
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyFulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
High Performance Systems in Go - GopherCon 2014
High Performance Systems in Go - GopherCon 2014High Performance Systems in Go - GopherCon 2014
High Performance Systems in Go - GopherCon 2014
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
 
Be faster then rabbits
Be faster then rabbitsBe faster then rabbits
Be faster then rabbits
 
H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
H2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFH2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SF
 

Kürzlich hochgeladen

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 

Kürzlich hochgeladen (20)

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 

Berlin Buzzwords 2019 - Taming the language border in data analytics and science with Apache Arrow

  • 1. Taming the language border in data analytics and science with Apache Arrow Uwe Korn – QuantCo – 18th June 2019
  • 2. About me • Engineering at QuantCo • Apache {Arrow, Parquet} PMC • Focus on Python but interact with R, Java, SAS, … @xhochy @xhochy mail@uwekorn.com https://uwekorn.com
  • 3. Do we have a problem? • Yes, there are different ecosystems! • Berlin Buzzwords • Java / Scala • Flink / ElasticSearch / Kafka • Scala-Spark / Kubernetes • PyData • Python / R • Pandas / NumPy / PySpark/sparklyr / Docker • SQL-based databases • ODBC / JDBC • Custom protocols (e.g. Postgres)
  • 4. Why solve this? • We build pipelines to move data • Goal: end-to-end data products
 Somewhere along the path we need to talk • Avoid duplicate work / work on converters
  • 5.
  • 6. Apache Arrow at its core • Main idea: common columnar representation of data in memory • Provide libraries to access the data structures • Broad support for many languages • Create building blocks to form an ecosystem around it • Implement adaptors for existing structures
  • 8. Previous Work • CSV works really everywhere • Slow, untyped and row-wise • Parquet is gaining traction in all ecosystems • one of the major features and interaction points of Arrow • Still, this serializes data • RAM-Copy: 10GB/s on a Laptop • DataFrame implementations look similar but still are incompatible
  • 9. Languages • C++, C(glib), Python, Ruby, R, Matlab • C# • Go • Java • JavaScript • Rust
  • 10. There’s a social component • It’s not only APIs you need to bring together • Communities are also quite distinct • Get them talking!
  • 11. Shipped with batteries • There is more than just data structures • Batteries in Arrow • Vectorized Parquet reader: C++, Rust, Java(WIP)
 C++ also supports ORC • Gandiva: LLVM-based expression kernels • Plasma: Shared-memory object store • DataFusion: Rust-based query engine • Flight: RPC protocol built on top of gRPC with zero-copy optimizations
  • 12. Ecosystem • RAPIDS: Analytics on the GPU • Dremio: Data platform • Turbodbc: columnar ODBC access in C++/Python • Spark: fast Python and R bridge • fletcher (pandas): Use Arrow instead of NumPy as backing storage • fletcher (FPGA): Use Arrow on FPGAs • Many more … https://arrow.apache.org/powered_by/
  • 13. Does it work? Everything is amazing on slides … … so does this Arrow actually work? Let’s take a real example with: • ERP System in Java with JDBC access (no non-Java client) • ETL and Data Cleaning in Python • Analysis in R
  • 17. Up Next • Build more adaptors, e.g. Postgres • Building blocks for query engines on top of Arrow • Datasets • Analytical kernels • DataFrame implementations directly on top of Arrow