SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
pd.{read/to}_sql is simple but
not fast
Uwe Korn – QuantCo – November 2020
About me
• Engineering at QuantCo

• Apache {Arrow, Parquet} PMC

• Turbodbc Maintainer

• Other OSS stuff
@xhochy
@xhochy
mail@uwekorn.com
https://uwekorn.com
Our setting
• We like tabular data

• Thus we use pandas

• We want large amounts of this data in pandas
• The traditional storage for it is SQL databases

• How do we get from one to another?
SQL
• Very very brief intro:

• „domain-specific language for accessing data held in a relational
database management system“

• The one language in data systems that precedes all the Python, R,
Julia, … we use as our „main“ language, also much wider user
base

• SELECT * FROM table

INSERT INTO table
• Two main arguments:

• sql: SQL query to be executed or a table name.

• con: SQLAlchemy connectable, str, or sqlite3 connection
• Two main arguments:

• name: Name of SQL table.

• con: SQLAlchemy connectable, str, or sqlite3 connection
• Let’s look at the other nice bits („additional arguments“)

• if_exists: „What should we do when the target already exists?“

• fail

• replace

• append
• index: „What should we with this one magical column?“ (bool)

• index_label

• chunksize: „Write less data at once“

• dtype: „What should we with this one magical column?“ (bool)

• method: „Supply some magic insertion hook“ (callable)
SQLAlchemy
• SQLAlchemy is a Python SQL toolkit and Object Relational Mapper
(ORM)

• We only use the toolkit part for:

• Metadata about schema and tables (incl. creation)

• Engine for connecting to various databases using a uniform
interface
Under the bonnet
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fast
How does it work (read_sql)?
• pandas.read_sql [1] calls SQLDatabase.read_query [2]

• This then does



• Depending on whether a chunksize was given, this fetches all or
parts of the result
[1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L509-L516
[2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1243
How does it work (read_sql)?
• Passes in the data into the from_records constructor


• Optionally parses dates and sets an index
How does it work (to_sql)?
• This is more tricky as we modify the database.

• to_sql [1] may need to create the target

• If not existing, it will call CREATE TABLE [2]

• Afterwards, we INSERT [3] into the (new) table

• The insertion step is where we convert from DataFrame back into
records [4]



[1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1320
[2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1383-L1393
[3] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1398
[4] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L734-L747
Why is it slow?
No benchmarks yet, theory first.



















Why is it slow?
Thanks
Slides will come after PyData Global

Follow me on Twitter: @xhochy
How to get fast?
ODBC
• Open Database Connectivity (ODBC) is a standard API for accessing
databases

• Most databases provide an ODBC interface, some of them are
efficient

• Two popular Python libraries for that:

• https://github.com/mkleehammer/pyodbc

• https://github.com/blue-yonder/turbodbc
ODBC
Turbodbc has support for Apache Arrow: https://arrow.apache.org/
blog/2017/06/16/turbodbc-arrow/
ODBC
• With turbodbc + Arrow we get the following performance
improvements:

• 3-4x for MS SQL, see https://youtu.be/B-uj8EDcjLY?t=1208

• 3-4x speedup for Exasol, see https://youtu.be/B-uj8EDcjLY?t=1390
Snowflake
• Turbodbc is a solution that retrofits performance

• Snowflake drivers already come with built-in speed

• Default response is JSON-based, BUT:

• The database server can answer directly with Arrow

• Client only needs the Arrow->pandas conversion (lightning fast⚡)

• Up to 10x faster, see https://www.snowflake.com/blog/fetching-
query-results-from-snowflake-just-got-a-lot-faster-with-apache-
arrow/
JDBC
• Blogged about this at: https://uwekorn.com/2019/11/17/fast-jdbc-
access-in-python-using-pyarrow-jvm.html

• Not yet so convenient and read-only

• First, you need all your Java dependencies incl arrow-jdbc in your
classpath

• Start JVM and load the driver, setup Arrow Java
JDBC
• Then:

• Fetch result using the Arrow Java JDBC adapter

• Use pyarrow.jvm to get a Python reference to the JVM memory

• Convert to pandas 136x speedup!
Postgres
Not yet opensourced but this is how it works:
How do we get this
into pandas.read_sql?
API troubles
• pandas’ simple API: 



• turbodbc

API troubles
• pandas’ simple API: 



• Snowflake

API troubles
• pandas’ simple API: 



• pyarrow.jvm + JDBC

Building a better API
• We want to use pandas’ simple API but with the nice performance
benefits

• One idea: Dispatching based on the connection class



• User doesn’t need to learn a new API

• Performance improvements come via optional packages

Building a better API
Alternative idea:
Building a better API
Discussion in https://github.com/pandas-dev/pandas/issues/36893
Thanks
Follow me on Twitter: @xhochy

Más contenido relacionado

Was ist angesagt?

Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringTaro L. Saito
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...MLconf
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
 
LuceneRDD for (Geospatial) Search and Entity Linkage
LuceneRDD for (Geospatial) Search and Entity LinkageLuceneRDD for (Geospatial) Search and Entity Linkage
LuceneRDD for (Geospatial) Search and Entity Linkagezouzias
 
Pandas/Data Analysis at Baypiggies
Pandas/Data Analysis at BaypiggiesPandas/Data Analysis at Baypiggies
Pandas/Data Analysis at BaypiggiesAndy Hayden
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...Lucidworks
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa
 
Rust & Apache Arrow @ RMS
Rust & Apache Arrow @ RMSRust & Apache Arrow @ RMS
Rust & Apache Arrow @ RMSAndy Grove
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...viirya
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 

Was ist angesagt? (20)

Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
LuceneRDD for (Geospatial) Search and Entity Linkage
LuceneRDD for (Geospatial) Search and Entity LinkageLuceneRDD for (Geospatial) Search and Entity Linkage
LuceneRDD for (Geospatial) Search and Entity Linkage
 
Pandas/Data Analysis at Baypiggies
Pandas/Data Analysis at BaypiggiesPandas/Data Analysis at Baypiggies
Pandas/Data Analysis at Baypiggies
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Rust & Apache Arrow @ RMS
Rust & Apache Arrow @ RMSRust & Apache Arrow @ RMS
Rust & Apache Arrow @ RMS
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 

Ähnlich wie pandas.(to/from)_sql is simple but not fast

COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesAlfredo Abate
 
PLSSUG - Troubleshoot SQL Server performance problems like a Microsoft Engineer
PLSSUG - Troubleshoot SQL Server performance problems like a Microsoft EngineerPLSSUG - Troubleshoot SQL Server performance problems like a Microsoft Engineer
PLSSUG - Troubleshoot SQL Server performance problems like a Microsoft EngineerMarek Maśko
 
Migration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQLMigration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQLPGConf APAC
 
TechEvent 2019: Oracle to PostgreSQL - a Travel Guide from Practice; Roland S...
TechEvent 2019: Oracle to PostgreSQL - a Travel Guide from Practice; Roland S...TechEvent 2019: Oracle to PostgreSQL - a Travel Guide from Practice; Roland S...
TechEvent 2019: Oracle to PostgreSQL - a Travel Guide from Practice; Roland S...Trivadis
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i  Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i Zend by Rogue Wave Software
 
Oracle to Postgres Migration - part 1
Oracle to Postgres Migration - part 1Oracle to Postgres Migration - part 1
Oracle to Postgres Migration - part 1PgTraining
 
Presto Meetup 2016 Small Start
Presto Meetup 2016 Small StartPresto Meetup 2016 Small Start
Presto Meetup 2016 Small StartHiroshi Toyama
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
Profiling and Tuning a Web Application - The Dirty Details
Profiling and Tuning a Web Application - The Dirty DetailsProfiling and Tuning a Web Application - The Dirty Details
Profiling and Tuning a Web Application - The Dirty DetailsAchievers Tech
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
 
Drupal meets PostgreSQL for DrupalCamp MSK 2014
Drupal meets PostgreSQL for DrupalCamp MSK 2014Drupal meets PostgreSQL for DrupalCamp MSK 2014
Drupal meets PostgreSQL for DrupalCamp MSK 2014Kate Marshalkina
 
SharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSPC Adriatics
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak
 
Best practices-wordpress-enterprise
Best practices-wordpress-enterpriseBest practices-wordpress-enterprise
Best practices-wordpress-enterpriseTaylor Lovett
 
Logs aggregation and analysis
Logs aggregation and analysisLogs aggregation and analysis
Logs aggregation and analysisDivante
 

Ähnlich wie pandas.(to/from)_sql is simple but not fast (20)

COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
PLSSUG - Troubleshoot SQL Server performance problems like a Microsoft Engineer
PLSSUG - Troubleshoot SQL Server performance problems like a Microsoft EngineerPLSSUG - Troubleshoot SQL Server performance problems like a Microsoft Engineer
PLSSUG - Troubleshoot SQL Server performance problems like a Microsoft Engineer
 
Migration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQLMigration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQL
 
TechEvent 2019: Oracle to PostgreSQL - a Travel Guide from Practice; Roland S...
TechEvent 2019: Oracle to PostgreSQL - a Travel Guide from Practice; Roland S...TechEvent 2019: Oracle to PostgreSQL - a Travel Guide from Practice; Roland S...
TechEvent 2019: Oracle to PostgreSQL - a Travel Guide from Practice; Roland S...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i  Fundamentals of performance tuning PHP on IBM i
Fundamentals of performance tuning PHP on IBM i
 
Oracle to Postgres Migration - part 1
Oracle to Postgres Migration - part 1Oracle to Postgres Migration - part 1
Oracle to Postgres Migration - part 1
 
Breaking data
Breaking dataBreaking data
Breaking data
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 
Presto Meetup 2016 Small Start
Presto Meetup 2016 Small StartPresto Meetup 2016 Small Start
Presto Meetup 2016 Small Start
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Profiling and Tuning a Web Application - The Dirty Details
Profiling and Tuning a Web Application - The Dirty DetailsProfiling and Tuning a Web Application - The Dirty Details
Profiling and Tuning a Web Application - The Dirty Details
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Drupal meets PostgreSQL for DrupalCamp MSK 2014
Drupal meets PostgreSQL for DrupalCamp MSK 2014Drupal meets PostgreSQL for DrupalCamp MSK 2014
Drupal meets PostgreSQL for DrupalCamp MSK 2014
 
SharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi Vončina
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Best practices-wordpress-enterprise
Best practices-wordpress-enterpriseBest practices-wordpress-enterprise
Best practices-wordpress-enterprise
 
Php reports sumit
Php reports sumitPhp reports sumit
Php reports sumit
 
Logs aggregation and analysis
Logs aggregation and analysisLogs aggregation and analysis
Logs aggregation and analysis
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 

Mehr von Uwe Korn

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsPyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsUwe Korn
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn
 
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...Uwe Korn
 
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
ApacheCon Europe Big Data 2016 – Parquet in practice & detailApacheCon Europe Big Data 2016 – Parquet in practice & detail
ApacheCon Europe Big Data 2016 – Parquet in practice & detailUwe Korn
 
Scalable Scientific Computing with Dask
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with DaskUwe Korn
 
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...Uwe Korn
 
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...Uwe Korn
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn
 

Mehr von Uwe Korn (8)

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsPyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
 
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
 
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
ApacheCon Europe Big Data 2016 – Parquet in practice & detailApacheCon Europe Big Data 2016 – Parquet in practice & detail
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
 
Scalable Scientific Computing with Dask
Scalable Scientific Computing with DaskScalable Scientific Computing with Dask
Scalable Scientific Computing with Dask
 
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...
 
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 

Último

Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...Neo4j
 
Paul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfPaul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfdcphostmaster
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptxFurkanTasci3
 
Stochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxStochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxjkmrshll88
 
PPT for Presiding Officer.pptxvvdffdfgggg
PPT for Presiding Officer.pptxvvdffdfggggPPT for Presiding Officer.pptxvvdffdfgggg
PPT for Presiding Officer.pptxvvdffdfggggbhadratanusenapati1
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-ProfitsTimothy Spann
 
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...ferisulianta.com
 
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j
 
Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1bengalurutug
 
Air Con Energy Rating Info411 Presentation.pdf
Air Con Energy Rating Info411 Presentation.pdfAir Con Energy Rating Info411 Presentation.pdf
Air Con Energy Rating Info411 Presentation.pdfJasonBoboKyaw
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j
 
Báo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân MarketingBáo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân MarketingMarketingTrips
 
Empowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded AnalyticsEmpowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded AnalyticsGain Insights
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTimothy Spann
 
Understanding the Impact of video length on student performance
Understanding the Impact of video length on student performanceUnderstanding the Impact of video length on student performance
Understanding the Impact of video length on student performancePrithaVashisht1
 
The market for cross-border mortgages in Europe
The market for cross-border mortgages in EuropeThe market for cross-border mortgages in Europe
The market for cross-border mortgages in Europe321k
 
Data Collection from Social Media Platforms
Data Collection from Social Media PlatformsData Collection from Social Media Platforms
Data Collection from Social Media PlatformsMahmoud Yasser
 
Brain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxBrain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxShammiRai3
 

Último (20)

Target_Company_Data_breach_2013_110million
Target_Company_Data_breach_2013_110millionTarget_Company_Data_breach_2013_110million
Target_Company_Data_breach_2013_110million
 
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
 
Paul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfPaul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdf
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
 
Stochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxStochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptx
 
PPT for Presiding Officer.pptxvvdffdfgggg
PPT for Presiding Officer.pptxvvdffdfggggPPT for Presiding Officer.pptxvvdffdfgggg
PPT for Presiding Officer.pptxvvdffdfgggg
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits
 
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
 
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
 
Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1
 
Air Con Energy Rating Info411 Presentation.pdf
Air Con Energy Rating Info411 Presentation.pdfAir Con Energy Rating Info411 Presentation.pdf
Air Con Energy Rating Info411 Presentation.pdf
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
 
Báo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân MarketingBáo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân Marketing
 
Empowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded AnalyticsEmpowering Decisions A Guide to Embedded Analytics
Empowering Decisions A Guide to Embedded Analytics
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI Pipelines
 
Understanding the Impact of video length on student performance
Understanding the Impact of video length on student performanceUnderstanding the Impact of video length on student performance
Understanding the Impact of video length on student performance
 
The market for cross-border mortgages in Europe
The market for cross-border mortgages in EuropeThe market for cross-border mortgages in Europe
The market for cross-border mortgages in Europe
 
Data Collection from Social Media Platforms
Data Collection from Social Media PlatformsData Collection from Social Media Platforms
Data Collection from Social Media Platforms
 
Brain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptxBrain Tumor Detection with Machine Learning.pptx
Brain Tumor Detection with Machine Learning.pptx
 

pandas.(to/from)_sql is simple but not fast

  • 1. pd.{read/to}_sql is simple but not fast Uwe Korn – QuantCo – November 2020
  • 2. About me • Engineering at QuantCo • Apache {Arrow, Parquet} PMC • Turbodbc Maintainer • Other OSS stuff @xhochy @xhochy mail@uwekorn.com https://uwekorn.com
  • 3. Our setting • We like tabular data • Thus we use pandas • We want large amounts of this data in pandas • The traditional storage for it is SQL databases • How do we get from one to another?
  • 4. SQL • Very very brief intro: • „domain-specific language for accessing data held in a relational database management system“ • The one language in data systems that precedes all the Python, R, Julia, … we use as our „main“ language, also much wider user base • SELECT * FROM table
 INSERT INTO table
  • 5. • Two main arguments: • sql: SQL query to be executed or a table name. • con: SQLAlchemy connectable, str, or sqlite3 connection
  • 6. • Two main arguments: • name: Name of SQL table. • con: SQLAlchemy connectable, str, or sqlite3 connection
  • 7. • Let’s look at the other nice bits („additional arguments“) • if_exists: „What should we do when the target already exists?“ • fail • replace • append
  • 8. • index: „What should we with this one magical column?“ (bool) • index_label • chunksize: „Write less data at once“ • dtype: „What should we with this one magical column?“ (bool) • method: „Supply some magic insertion hook“ (callable)
  • 9. SQLAlchemy • SQLAlchemy is a Python SQL toolkit and Object Relational Mapper (ORM) • We only use the toolkit part for: • Metadata about schema and tables (incl. creation) • Engine for connecting to various databases using a uniform interface
  • 15. How does it work (read_sql)? • pandas.read_sql [1] calls SQLDatabase.read_query [2] • This then does
 • Depending on whether a chunksize was given, this fetches all or parts of the result [1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L509-L516 [2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1243
  • 16. How does it work (read_sql)? • Passes in the data into the from_records constructor • Optionally parses dates and sets an index
  • 17. How does it work (to_sql)? • This is more tricky as we modify the database. • to_sql [1] may need to create the target • If not existing, it will call CREATE TABLE [2] • Afterwards, we INSERT [3] into the (new) table • The insertion step is where we convert from DataFrame back into records [4]
 
 [1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1320 [2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1383-L1393 [3] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1398 [4] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L734-L747
  • 18. Why is it slow? No benchmarks yet, theory first.
 
 
 
 
 
 
 
 
 

  • 19. Why is it slow?
  • 20. Thanks Slides will come after PyData Global Follow me on Twitter: @xhochy How to get fast?
  • 21. ODBC • Open Database Connectivity (ODBC) is a standard API for accessing databases • Most databases provide an ODBC interface, some of them are efficient • Two popular Python libraries for that: • https://github.com/mkleehammer/pyodbc • https://github.com/blue-yonder/turbodbc
  • 22. ODBC Turbodbc has support for Apache Arrow: https://arrow.apache.org/ blog/2017/06/16/turbodbc-arrow/
  • 23. ODBC • With turbodbc + Arrow we get the following performance improvements: • 3-4x for MS SQL, see https://youtu.be/B-uj8EDcjLY?t=1208 • 3-4x speedup for Exasol, see https://youtu.be/B-uj8EDcjLY?t=1390
  • 24. Snowflake • Turbodbc is a solution that retrofits performance • Snowflake drivers already come with built-in speed • Default response is JSON-based, BUT: • The database server can answer directly with Arrow • Client only needs the Arrow->pandas conversion (lightning fast⚡) • Up to 10x faster, see https://www.snowflake.com/blog/fetching- query-results-from-snowflake-just-got-a-lot-faster-with-apache- arrow/
  • 25. JDBC • Blogged about this at: https://uwekorn.com/2019/11/17/fast-jdbc- access-in-python-using-pyarrow-jvm.html • Not yet so convenient and read-only • First, you need all your Java dependencies incl arrow-jdbc in your classpath • Start JVM and load the driver, setup Arrow Java
  • 26. JDBC • Then: • Fetch result using the Arrow Java JDBC adapter • Use pyarrow.jvm to get a Python reference to the JVM memory • Convert to pandas 136x speedup!
  • 27. Postgres Not yet opensourced but this is how it works:
  • 28. How do we get this into pandas.read_sql?
  • 29. API troubles • pandas’ simple API: 
 • turbodbc

  • 30. API troubles • pandas’ simple API: 
 • Snowflake

  • 31. API troubles • pandas’ simple API: 
 • pyarrow.jvm + JDBC

  • 32. Building a better API • We want to use pandas’ simple API but with the nice performance benefits • One idea: Dispatching based on the connection class
 • User doesn’t need to learn a new API • Performance improvements come via optional packages

  • 33. Building a better API Alternative idea:
  • 34. Building a better API Discussion in https://github.com/pandas-dev/pandas/issues/36893
  • 35. Thanks Follow me on Twitter: @xhochy