Efficient DataFrame storage with Apache Parquet

•

3 likes•1,366 views

Uwe Korn

Slides for my presentation at PyData London 2017: https://pydata.org/london2017/schedule/presentation/54/

Data & Analytics

1
Eﬃcient and portable DataFrame
storage with Apache Parquet
Uwe L. Korn, PyData London 2017

2
• Data Scientist at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Work in Python, Cython, C++11 and SQL
• Heavy Pandas User
About me
xhochy
uwe@apache.org

3
Agenda
• History of Apache Parquet
• The format in detail
• Use it in Python

4
About Parquet
1. Columnar on-disk storage format
2. Started in fall 2012 by Cloudera & Twitter
3. July 2013: 1.0 release
4. top-level Apache project
5. Fall 2016: Python & C++ support
6. State of the art format in the Hadoop ecosystem
• often used as the default I/O option

5
Why use Parquet?
1. Columnar format 
—> vectorized operations
2. Eﬃcient encodings and compressions 
—> small size without the need for a fat CPU
3. Query push-down 
—> bring computation to the I/O layer
4. Language independent format 
—> libs in Java / Scala / C++ / Python /…

6
Who uses Parquet?
• Query Engines
• Hive
• Impala
• Drill
• Presto
• …
• Frameworks
• Spark
• MapReduce
• …
• Pandas
• Dask

File Structure
File
RowGroup
Column Chunks
Page
Statistics

Encodings
• Know the data
• Exploit the knowledge
• Cheaper than universal compression
• Example dataset:
• NYC TLC Trip Record data for January 2016
• 1629 MiB as CSV
• columns: bool(1), datetime(2), float(12), int(4)
• Source: http://www.nyc.gov/html/tlc/html/about/
trip_record_data.shtml

Encodings — PLAIN
• Simply write the binary representation to disk
• Simple to read & write
• Performance limited by I/O throughput
• —> 1499 MiB

Encodings — RLE & Bit Packing
• bit-packing: only use the necessary bit
• RunLengthEncoding: 378 times „12“
• hybrid: dynamically choose the best
• Used for Definition & Repetition levels

Encodings — Dictionary
• PLAIN_DICTIONARY / RLE_DICTIONARY
• every value is assigned a code
• Dictionary: store a map of code —> value
• Data: store only codes, use RLE on that
• —> 329 MiB (22%)

Compression
1. Shrink data size independent of its content
2. More CPU intensive than encoding
3. encoding+compression performs better than
compression alone with less CPU cost
4. LZO, Snappy, GZIP, Brotli 
—> If in doubt: use Snappy
5. GZIP: 174 MiB (11%) 
Snappy: 216 MiB (14 %)

Query pushdown
1. Only load used data
1. skip columns that are not needed
2. skip (chunks of) rows that not relevant
2. saves I/O load as the data is not transferred
3. saves CPU as the data is not decoded

Read & Write Parquet
17
https://arrow.apache.org/docs/python/parquet.html
Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/

18
Apache Arrow?
• Specification for in-memory columnar data layout
• No overhead for cross-system communication
• Designed for eﬃciency (exploit SIMD, cache locality, ..)
• Exchange data without conversion between Python, C++, C(glib),
Ruby, Lua, R and the JVM
• This brought Parquet to Pandas without any Python code in
parquet-cpp
Just released 0.3

Cross language DataFrame library
• Website: https://arrow.apache.org/
• ML: dev@arrow.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/ARROW
• Slack: https://
apachearrowslackin.herokuapp.com/
• Github mirror: https://github.com/apache/
arrow
Apache Arrow Apache Parquet
Famous columnar file format
• Website: https://parquet.apache.org/
• ML: dev@parquet.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/PARQUET
• Slack: https://parquet-slack-
invite.herokuapp.com/
• C++ Github mirror: https://github.com/
apache/parquet-cpp
19
Get Involved!

Blue Yonder GmbH
Ohiostraße 8
76149 Karlsruhe
Germany
+49 721 383117 0
Blue Yonder Software Limited
19 Eastbourne Terrace
London, W2 6LG
United Kingdom
+44 20 3626 0360
Blue Yonder
Best decisions,
delivered daily
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
20

What's hot

ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney

Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney

An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney

Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney

Data Science Languages and Industry AnalyticsWes McKinney

Strata NY 2017 Parquet Arrow roadmapJulien Le Dem

Rust is for "Big Data"Andy Grove

Strata NY 2018: The deconstructed databaseJulien Le Dem

Mule soft mar 2017 Parquet ArrowJulien Le Dem

My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney

Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney

Ibis: Scaling the Python Data ExperienceWes McKinney

DataFrames: The Extended CutWes McKinney

If you have your own Columnar format, stop now and use Parquet 😛Julien Le Dem

From flat files to deconstructed databaseJulien Le Dem

Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem

Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney

What's hot (20)

ACM TechTalks : Apache Arrow and the Future of Data Frames

Strata London 2016: The future of column oriented data processing with Arrow ...

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"

An Incomplete Data Tools Landscape for Hackers in 2015

Apache Arrow at DataEngConf Barcelona 2018

Apache Arrow: Present and Future @ ScaledML 2020

Apache Arrow Workshop at VLDB 2019 / BOSS Session

Data Science Languages and Industry Analytics

Strata NY 2017 Parquet Arrow roadmap

Rust is for "Big Data"

Strata NY 2018: The deconstructed database

Mule soft mar 2017 Parquet Arrow

My Data Journey with Python (SciPy 2015 Keynote)

Apache Arrow Flight: A New Gold Standard for Data Transport

Ibis: Scaling the Python Data Experience

DataFrames: The Extended Cut

If you have your own Columnar format, stop now and use Parquet 😛

From flat files to deconstructed database

Data Eng Conf NY Nov 2016 Parquet Arrow

Python Data Ecosystem: Thoughts on Building for the Future

Similar to Efficient DataFrame storage with Apache Parquet

ApacheCon Europe Big Data 2016 – Parquet in practice & detailUwe Korn

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...Uwe Korn

What's new in Hadoop Common and HDFS DataWorks Summit/Hadoop Summit

Taming the resource tigerElizabeth Smith

Realtime traffic analyserAlex Moskvin

OpenPOWER Acceleration of HPCC SystemsHPCC Systems

Scaling systems for research computingThe BioTeam Inc.

Storage in hadoopPuneet Tripathi

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs

Hadoop ppt1chariorienit

ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataHitoshi Sato

MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB

Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez

Silicon Valley Code Camp 2014 - Advanced MongoDBDaniel Coupal

Spectrum Scale Unified File and Object with WAN CachingSandeep Patil

Software Defined Analytics with File and Object Access Plus Geographically Di...Trishali Nayar

Running MongoDB 3.0 on AWSMongoDB

The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi

From a student to an apache committer practice of apache io tdbjixuan1989

Similar to Efficient DataFrame storage with Apache Parquet (20)

ApacheCon Europe Big Data 2016 – Parquet in practice & detail

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...

What's new in Hadoop Common and HDFS

Taming the resource tiger

Realtime traffic analyser

OpenPOWER Acceleration of HPCC Systems

Scaling systems for research computing

Storage in hadoop

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Hadoop ppt1

ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data

MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)

Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...

Silicon Valley Code Camp 2014 - Advanced MongoDB

Spectrum Scale Unified File and Object with WAN Caching

Software Defined Analytics with File and Object Access Plus Geographically Di...

Running MongoDB 3.0 on AWS

The state of Hive and Spark in the Cloud (July 2017)

From a student to an apache committer practice of apache io tdb

Recently uploaded

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16

Learn How Data Science Changes Our WorldEduminds Learning

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

Easter Eggs From Star Wars and in cars 1 and 217djon017

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

Recently uploaded (20)

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh

Learn How Data Science Changes Our World

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理

Data Factory in Microsoft Fabric (MsBIP #82)

MK KOMUNIKASI DATA (TI)komdat komdat.docx

Student profile product demonstration on grades, ability, well-being and mind...

Generative AI for Social Good at Open Data Science East 2024

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

Easter Eggs From Star Wars and in cars 1 and 2

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...

Real-Time AI Streaming - AI Max Princeton

RABBIT: A CLI tool for identifying bots based on their GitHub events.

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

Defining Constituents, Data Vizzes and Telling a Data Story

Efficient DataFrame storage with Apache Parquet

1. 1 Eﬃcient and portable DataFrame storage with Apache Parquet Uwe L. Korn, PyData London 2017

2. 2 • Data Scientist at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Work in Python, Cython, C++11 and SQL • Heavy Pandas User About me xhochy uwe@apache.org

3. 3 Agenda • History of Apache Parquet • The format in detail • Use it in Python

4. 4 About Parquet 1. Columnar on-disk storage format 2. Started in fall 2012 by Cloudera & Twitter 3. July 2013: 1.0 release 4. top-level Apache project 5. Fall 2016: Python & C++ support 6. State of the art format in the Hadoop ecosystem • often used as the default I/O option

5. 5 Why use Parquet? 1. Columnar format  —> vectorized operations 2. Eﬃcient encodings and compressions  —> small size without the need for a fat CPU 3. Query push-down  —> bring computation to the I/O layer 4. Language independent format  —> libs in Java / Scala / C++ / Python /…

6. 6 Who uses Parquet? • Query Engines • Hive • Impala • Drill • Presto • … • Frameworks • Spark • MapReduce • … • Pandas • Dask

7. File Structure File RowGroup Column Chunks Page Statistics

8. Encodings • Know the data • Exploit the knowledge • Cheaper than universal compression • Example dataset: • NYC TLC Trip Record data for January 2016 • 1629 MiB as CSV • columns: bool(1), datetime(2), float(12), int(4) • Source: http://www.nyc.gov/html/tlc/html/about/ trip_record_data.shtml

9. Encodings — PLAIN • Simply write the binary representation to disk • Simple to read & write • Performance limited by I/O throughput • —> 1499 MiB

10. Encodings — RLE & Bit Packing • bit-packing: only use the necessary bit • RunLengthEncoding: 378 times „12“ • hybrid: dynamically choose the best • Used for Definition & Repetition levels

11. Encodings — Dictionary • PLAIN_DICTIONARY / RLE_DICTIONARY • every value is assigned a code • Dictionary: store a map of code —> value • Data: store only codes, use RLE on that • —> 329 MiB (22%)

12. Compression 1. Shrink data size independent of its content 2. More CPU intensive than encoding 3. encoding+compression performs better than compression alone with less CPU cost 4. LZO, Snappy, GZIP, Brotli  —> If in doubt: use Snappy 5. GZIP: 174 MiB (11%)  Snappy: 216 MiB (14 %)

13. Query pushdown 1. Only load used data 1. skip columns that are not needed 2. skip (chunks of) rows that not relevant 2. saves I/O load as the data is not transferred 3. saves CPU as the data is not decoded

14. Benchmarks (size)

15. Benchmarks (time)

16. Benchmarks (size vs time)

17. Read & Write Parquet 17 https://arrow.apache.org/docs/python/parquet.html Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/

18. 18 Apache Arrow? • Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for eﬃciency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R and the JVM • This brought Parquet to Pandas without any Python code in parquet-cpp Just released 0.3

19. Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: dev@arrow.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github mirror: https://github.com/apache/ arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: dev@parquet.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • C++ Github mirror: https://github.com/ apache/parquet-cpp 19 Get Involved!

20. Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe Germany +49 721 383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 20

Efficient DataFrame storage with Apache Parquet

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Efficient DataFrame storage with Apache Parquet

Similar to Efficient DataFrame storage with Apache Parquet (20)

More from Uwe Korn

More from Uwe Korn (6)

Recently uploaded

Recently uploaded (20)

Efficient DataFrame storage with Apache Parquet