SlideShare ist ein Scribd-Unternehmen logo
1 von 61
Downloaden Sie, um offline zu lesen
Sadayuki Furuhashi
Founder & Software Architect
Treasure Data, inc.
PrestoInteractive SQL Query Engine for Big Data
Hadoop Conference in Japan 2014
A little about me...
> Sadayuki Furuhashi
> github/twitter: @frsyuki
> Treasure Data, Inc.
> Founder & Software Architect
> Open-source hacker
> MessagePack - efficient object serializer
> Fluentd - data collection tool
> ServerEngine - Ruby framework to build multiprocess servers
> LS4 - distributed object storage system
> kumofs - distributed key-value data store
0. Background + Intro
What’s Presto?
A distributed SQL query engine
for interactive data analisys
against GBs to PBs of data.
Presto’s history
> 2012 Fall: Project started at Facebook
> Designed for interactive query
> with speed of commercial data warehouse
> and scalability to the size of Facebook
> 2013 Winter: Open sourced!
> 30+ contributes in 6 months
> including people from outside of Facebook
What’s the problems to solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more and less scalable by far
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
What’s the problems to solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more and less scalable by far
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
What’s the problems to solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more and less scalable by far
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
What’s the problems to solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more and less scalable by far
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
Commercial
BI Tools
Batch analysis platform Visualization platform
Dashboard
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
✓ Less scalable
✓ Extra cost
Commercial
BI Tools
Dashboard
✓ More work to manage
2 platforms
✓ Can’t query against
“live”data directly
Batch analysis platform Visualization platform
HDFS
Hive Dashboard
Presto
PostgreSQL, etc.
Daily/Hourly Batch
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Interactive query
Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra MySQL Commertial DBs
SQL on any data sets
Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra MySQL Commertial DBs
SQL on any data sets Commercial
BI Tools
✓ IBM Cognos
✓ Tableau
✓ ...
Data analysis platform
dashboard on chart.io: https://chartio.com/
What can Presto do?
> Query interactively (in milli-seconds to minues)
> MapReduce and Hive are still necessary for ETL
> Query using commercial BI tools or dashboards
> Reliable ODBC/JDBC connectivity
> Query across multiple data sources such as
Hive, HBase, Cassandra, or even commertial DBs
> Plugin mechanism
> Integrate batch analisys + visualization
into a single data analysis platform
Presto’s deployment
> Facebook
> Multiple geographical regions
> scaled to 1,000 nodes
> actively used by 1,000+ employees
> who run 30,000+ queries every day
> processing 1PB/day
> Netflix, Dropbox, Treasure Data, Airbnb, Qubole
> Presto as a Service
Today’s talk
1. Distributed architecture
2. Data visualization - Demo
3. Query Execution - Presto vs. MapReduce
4. Monitoring & Configuration
5. Roadmap - the future
1. Distributed architecture
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
1. find servers in a cluster
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
2. Client sends a query
using HTTP
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
3. Coordinator builds
a query plan
Connector plugin
provides metadata
(table schema, etc.)
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
4. Coordinator sends
tasks to workers
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
5. Workers read data
through connector plugin
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
6. Workers run tasks
in memory
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
7. Client gets the result
from a worker
Client
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
What’s Connectors?
> Connectors are plugins to Presto
> written in Java
> Access to storage and metadata
> provide table schema to coordinators
> provide table rows to workers
> Implementations:
> Hive connector
> Cassandra connector
> MySQL through JDBC connector (prerelease)
> Or your own connector
Client
Coordinator Hive
Connector
Worker
Worker
Worker
HDFS,
Hive Metastore
Discovery Service
find servers in a cluster
Hive connector
Client
Coordinator Cassandra
Connector
Worker
Worker
Worker
Cassandra
Discovery Service
find servers in a cluster
Cassandra connector
Client
Coordinator
other
connectors
...
Worker
Worker
Worker
Cassandra
Discovery Service
find servers in a cluster
Hive
Connector
HDFS / Metastore
Multiple connectors in a query
Cassandra
Connector
Other data sources...
1. Distributed architecture
> 3 type of servers:
> Coordinator, worker, discovery service
> Get data/metadata through connector plugins.
> Presto is NOT a database
> Presto provides SQL to existent data stores
> Client protocol is HTTP + JSON
> Language bindings:
Ruby, Python, PHP, Java (JDBC), R, Node.JS...
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
Coordinator
Coordinator HA
2. Data visualization
The problems to use BI tools
> BI tools need ODBC or JDBC connectivity
> Tableau, IBM Cognos, QlickView, Chart.IO, ...
> JasperSoft, Pentaho, MotionBoard, ...
> ODBC/JDBC is VERY COMPLICATED
> Matured implementation needs LONG time
A solution: PostgreSQL protocol
> Creating a PostgreSQL protocol gateway
> Using PostgreSQL’s stable ODBC / JDBC driver
https://github.com/treasure-data/prestogres
How Prestogres works?
2. select run_presto_as_temp_table(
‘presto_result’,‘SELECT COUNT(1) FROM tbl1’);
pgpool-II
+ patchclient
1. SELECT COUNT(1) FROM tbl1
4. SELECT * FROM presto_result;
PostgreSQL
3.“run_persto_as_temp_table”function
runs query on Presto
Coordinator
Demo
2. Data visualization with Presto
> Data visualization tools need ODBC/JDBC driver
> but implemetation takes LONG time
> A solution is to use PostgreSQL protocol
> and use PostgreSQL’s ODBC/JDBC driver
> Prestogres is already confirmed to work with
some commertial BI tools
3. Query Execution
Presto’s execution model
> Presto is NOT MapReduce
> Presto’s query plan is based on DAG
> more like Apache Tez or traditional MPP
databases
How query runs?
> Coordinator
> SQL Parser
> Query Planner
> Execution planner
> Workers
> Task execution scheduler
SQL
SQL Parser
AST
Logical
Planner
Distributed
Planner
Logical
Query Plan
Execution
Planner
Discovery Server
Connector
Distributed
Query Plan Execution Plan
Optimizer
NodeManager
✓ node list
✓ table schema
Metadata
SQL
SQL Parser
SQL
Distributed
Planner
Logical
Query Plan
Execution
Planner
Discovery Service
Connector
Query Plan Execution Plan
Optimizer
NodeManager
✓ node list
✓ table schema
Metadata
(today’s talk)
Query
Planner
Query Planner
SELECT
name,
count(*) AS c
FROM impressions
GROUP BY name
SQL
impressions (
name varchar
time bigint
)
Table schema
Table scan
(name:varchar)
GROUP BY
(name, count(*))
Output
(name, c)
+
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
Logical query plan
Distributed query plan
Query Planner - Stages
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
inter-worker
data transfer
pipelined
aggregation
inter-worker
data transfer
Stage-0
Stage-1
Stage-2
Sink
Partial aggregation
Table scan
Sink
Partial aggregation
Table scan
Execution Planner
+ Node list
✓ 2 workers
Sink
Final aggregation
Exchange
Output
Exchange
Sink
Final aggregation
Exchange
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
Worker 1 Worker 2
Execution Planner - Tasks
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Task
1 task / worker / stage
✓ All tasks in parallel
Output
Exchange
Worker 1 Worker 2
Execution Planner - Split
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
Split
many splits / task
= many threads / worker
(table scan)
1 split / task
= 1 thread / worker
Worker 1 Worker 2
1 split / worker
= 1 thread / worker
All stages are pipe-lined
✓ No wait time
✓ No fault-tolerance
MapReduce vs. Presto
MapReduce Presto
map map
reduce reduce
task task
task task
task
task
memory-to-memory
data transfer
✓ No disk IO
✓ Data chunk must
fit in memory
task
disk
map map
reduce reduce
disk
disk
Write data
to disk
Wait between
stages
3. Query Execution
> SQL is converted into stages, tasks and splits
> All tasks run in parallel
> No wait time between stages (pipelined)
> If one task fails, all tasks fail at once (query fails)
> Memory-to-memory data transfer
> No disk IO
> If aggregated data doesn’t fit in memory,
query fails
• Note: query dies but worker doesn’t die.
Memory consumption of all queries is fully managed
4. Monitoring & Configuration
Monitoring
> Web UI
> basic query status check
> JMX HTTP API
> GET /v1/jmx/mbean[/{objectName}]
• com.facebook.presto.execution:name=TaskManager
• com.facebook.presto.execution:name=QueryManager
• com.facebook.presto.execution:name=NodeScheduler
> Event notification (remote logging)
> POST http://remote.server/v2/event
• query start, query complete, split complete
Configuration
> Execution planning (for coordinator)
> query.initial-hash-partitions
• max number of hash buckets (=tasks) of a GROUP BY
(default: 8)
> node-scheduler.min-candidates
• max number of workers to run a stage in parallel
(default: 10)
> node-scheduler.include-coordinator
• whether run tasks only on workers or include coordinator
> query.schedule-split-batch-size
• number of splits of a stage to start at once
Configuration
> Task execution (for workers)
> task.cpu-timer-enabled
• enable detailed statistics (causes some overhead)
(default: true)
> task.max-memory
• memory limit of a task especially for hash tables used by
GROUP BY and JOIN operations (default: 256MB)
• enlarge if you get“Task exceeded max memory size”error
> task.shard.max-threads
• max number of threads of a worker to run active splits
(default: number of CPU cores * 4)
5. Roadmap
A report of Presto Meetup 2014
http://www.slideshare.net/dain1/presto-meetup-20140514-34731104
"Presto, Past, Present, and Future" by Dain Sundstrom at Facebook
Presto’s future
> Huge JOIN and GROUP BY
> Spill to disk
> Task recovery
> CREATE VIEW (※implemented)
> Native store (※implemented)
> Fast data store in Presto workers
> to cache hot data
> Authentication and permissions
Presto’s future
> DDL/DML statements
> CREATE TABLE with partitioning
> DELETE and INSERT
> Plugin repository
> CLI plugin manager
> JOIN and aggregation pushdown
> Custom optimizers
Links
> Web site & document
> http://prestodb.io
> Mailing list
> https://groups.google.com/group/presto-users
> Github
> https://github.com/facebook/presto
> Guidelines for contribution
> https://github.com/facebook/presto/blob/master/CONTRIBUTING.md
Check: www.treasuredata.com
Cloud service for the entire data pipeline,
including Presto. We’re hiring!

Weitere ähnliche Inhalte

Was ist angesagt?

Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouseAltinity Ltd
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOAltinity Ltd
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangDatabricks
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderDatabricks
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...Altinity Ltd
 

Was ist angesagt? (20)

Presto
PrestoPresto
Presto
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 

Ähnlich wie Presto - Hadoop Conference Japan 2014

SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014N Masahiro
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSSN Masahiro
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoopch adnan
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...viirya
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
 
Big data-at-detik
Big data-at-detikBig data-at-detik
Big data-at-detikk4ndar
 
Boston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the EnterpriseBoston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the EnterpriseMatt Fuller
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65N Masahiro
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesSadayuki Furuhashi
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 

Ähnlich wie Presto - Hadoop Conference Japan 2014 (20)

SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
Big data-at-detik
Big data-at-detikBig data-at-detik
Big data-at-detik
 
Boston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the EnterpriseBoston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the Enterprise
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 

Mehr von Sadayuki Furuhashi

Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Sadayuki Furuhashi
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Sadayuki Furuhashi
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupSadayuki Furuhashi
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?Sadayuki Furuhashi
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11Sadayuki Furuhashi
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkSadayuki Furuhashi
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダSadayuki Furuhashi
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsSadayuki Furuhashi
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderSadayuki Furuhashi
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreSadayuki Furuhashi
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualSadayuki Furuhashi
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataSadayuki Furuhashi
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into HadoopSadayuki Furuhashi
 

Mehr von Sadayuki Furuhashi (20)

Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
 
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダ
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure Data
 
Fluentd meetup at Slideshare
Fluentd meetup at SlideshareFluentd meetup at Slideshare
Fluentd meetup at Slideshare
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
 
Fluentd meetup
Fluentd meetupFluentd meetup
Fluentd meetup
 
upload test 1
upload test 1upload test 1
upload test 1
 

Kürzlich hochgeladen

Internship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SEInternship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SESaleh Ibne Omar
 
Early Modern Spain. All about this period
Early Modern Spain. All about this periodEarly Modern Spain. All about this period
Early Modern Spain. All about this periodSaraIsabelJimenez
 
CHROMATOGRAPHY and its types with procedure,diagrams,flow charts,advantages a...
CHROMATOGRAPHY and its types with procedure,diagrams,flow charts,advantages a...CHROMATOGRAPHY and its types with procedure,diagrams,flow charts,advantages a...
CHROMATOGRAPHY and its types with procedure,diagrams,flow charts,advantages a...university
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.KathleenAnnCordero2
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationNathan Young
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRRsarwankumar4524
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Escort Service
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEMCharmi13
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxaryanv1753
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...漢銘 謝
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxRoquia Salam
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptxogubuikealex
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRachelAnnTenibroAmaz
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comsaastr
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxAsifArshad8
 
proposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerproposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerkumenegertelayegrama
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸mathanramanathan2005
 
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...Henrik Hanke
 

Kürzlich hochgeladen (19)

Internship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SEInternship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SE
 
Early Modern Spain. All about this period
Early Modern Spain. All about this periodEarly Modern Spain. All about this period
Early Modern Spain. All about this period
 
CHROMATOGRAPHY and its types with procedure,diagrams,flow charts,advantages a...
CHROMATOGRAPHY and its types with procedure,diagrams,flow charts,advantages a...CHROMATOGRAPHY and its types with procedure,diagrams,flow charts,advantages a...
CHROMATOGRAPHY and its types with procedure,diagrams,flow charts,advantages a...
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism Presentation
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
 
Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEM
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptx
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptx
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptx
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
 
proposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerproposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeeger
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸
 
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
DGT @ CTAC 2024 Valencia: Most crucial invest to digitalisation_Sven Zoelle_v...
 

Presto - Hadoop Conference Japan 2014

  • 1. Sadayuki Furuhashi Founder & Software Architect Treasure Data, inc. PrestoInteractive SQL Query Engine for Big Data Hadoop Conference in Japan 2014
  • 2. A little about me... > Sadayuki Furuhashi > github/twitter: @frsyuki > Treasure Data, Inc. > Founder & Software Architect > Open-source hacker > MessagePack - efficient object serializer > Fluentd - data collection tool > ServerEngine - Ruby framework to build multiprocess servers > LS4 - distributed object storage system > kumofs - distributed key-value data store
  • 4. What’s Presto? A distributed SQL query engine for interactive data analisys against GBs to PBs of data.
  • 5. Presto’s history > 2012 Fall: Project started at Facebook > Designed for interactive query > with speed of commercial data warehouse > and scalability to the size of Facebook > 2013 Winter: Open sourced! > 30+ contributes in 6 months > including people from outside of Facebook
  • 6. What’s the problems to solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more and less scalable by far > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 7. What’s the problems to solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more and less scalable by far > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 8. What’s the problems to solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more and less scalable by far > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 9. What’s the problems to solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more and less scalable by far > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 10. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query Commercial BI Tools Batch analysis platform Visualization platform Dashboard
  • 11. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query ✓ Less scalable ✓ Extra cost Commercial BI Tools Dashboard ✓ More work to manage 2 platforms ✓ Can’t query against “live”data directly Batch analysis platform Visualization platform
  • 12. HDFS Hive Dashboard Presto PostgreSQL, etc. Daily/Hourly Batch HDFS Hive Dashboard Daily/Hourly Batch Interactive query Interactive query
  • 14. Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query Cassandra MySQL Commertial DBs SQL on any data sets Commercial BI Tools ✓ IBM Cognos ✓ Tableau ✓ ... Data analysis platform
  • 15. dashboard on chart.io: https://chartio.com/
  • 16. What can Presto do? > Query interactively (in milli-seconds to minues) > MapReduce and Hive are still necessary for ETL > Query using commercial BI tools or dashboards > Reliable ODBC/JDBC connectivity > Query across multiple data sources such as Hive, HBase, Cassandra, or even commertial DBs > Plugin mechanism > Integrate batch analisys + visualization into a single data analysis platform
  • 17. Presto’s deployment > Facebook > Multiple geographical regions > scaled to 1,000 nodes > actively used by 1,000+ employees > who run 30,000+ queries every day > processing 1PB/day > Netflix, Dropbox, Treasure Data, Airbnb, Qubole > Presto as a Service
  • 18. Today’s talk 1. Distributed architecture 2. Data visualization - Demo 3. Query Execution - Presto vs. MapReduce 4. Monitoring & Configuration 5. Roadmap - the future
  • 21. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 1. find servers in a cluster
  • 22. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 2. Client sends a query using HTTP
  • 23. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 3. Coordinator builds a query plan Connector plugin provides metadata (table schema, etc.)
  • 24. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 4. Coordinator sends tasks to workers
  • 25. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 5. Workers read data through connector plugin
  • 26. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 6. Workers run tasks in memory
  • 27. Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 7. Client gets the result from a worker Client
  • 29. What’s Connectors? > Connectors are plugins to Presto > written in Java > Access to storage and metadata > provide table schema to coordinators > provide table rows to workers > Implementations: > Hive connector > Cassandra connector > MySQL through JDBC connector (prerelease) > Or your own connector
  • 32. Client Coordinator other connectors ... Worker Worker Worker Cassandra Discovery Service find servers in a cluster Hive Connector HDFS / Metastore Multiple connectors in a query Cassandra Connector Other data sources...
  • 33. 1. Distributed architecture > 3 type of servers: > Coordinator, worker, discovery service > Get data/metadata through connector plugins. > Presto is NOT a database > Presto provides SQL to existent data stores > Client protocol is HTTP + JSON > Language bindings: Ruby, Python, PHP, Java (JDBC), R, Node.JS...
  • 34. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service Coordinator Coordinator HA
  • 36. The problems to use BI tools > BI tools need ODBC or JDBC connectivity > Tableau, IBM Cognos, QlickView, Chart.IO, ... > JasperSoft, Pentaho, MotionBoard, ... > ODBC/JDBC is VERY COMPLICATED > Matured implementation needs LONG time
  • 37. A solution: PostgreSQL protocol > Creating a PostgreSQL protocol gateway > Using PostgreSQL’s stable ODBC / JDBC driver https://github.com/treasure-data/prestogres
  • 38. How Prestogres works? 2. select run_presto_as_temp_table( ‘presto_result’,‘SELECT COUNT(1) FROM tbl1’); pgpool-II + patchclient 1. SELECT COUNT(1) FROM tbl1 4. SELECT * FROM presto_result; PostgreSQL 3.“run_persto_as_temp_table”function runs query on Presto Coordinator
  • 39. Demo
  • 40. 2. Data visualization with Presto > Data visualization tools need ODBC/JDBC driver > but implemetation takes LONG time > A solution is to use PostgreSQL protocol > and use PostgreSQL’s ODBC/JDBC driver > Prestogres is already confirmed to work with some commertial BI tools
  • 42. Presto’s execution model > Presto is NOT MapReduce > Presto’s query plan is based on DAG > more like Apache Tez or traditional MPP databases
  • 43. How query runs? > Coordinator > SQL Parser > Query Planner > Execution planner > Workers > Task execution scheduler
  • 44. SQL SQL Parser AST Logical Planner Distributed Planner Logical Query Plan Execution Planner Discovery Server Connector Distributed Query Plan Execution Plan Optimizer NodeManager ✓ node list ✓ table schema Metadata
  • 45. SQL SQL Parser SQL Distributed Planner Logical Query Plan Execution Planner Discovery Service Connector Query Plan Execution Plan Optimizer NodeManager ✓ node list ✓ table schema Metadata (today’s talk) Query Planner
  • 46. Query Planner SELECT name, count(*) AS c FROM impressions GROUP BY name SQL impressions ( name varchar time bigint ) Table schema Table scan (name:varchar) GROUP BY (name, count(*)) Output (name, c) + Sink Final aggregation Exchange Sink Partial aggregation Table scan Output Exchange Logical query plan Distributed query plan
  • 47. Query Planner - Stages Sink Final aggregation Exchange Sink Partial aggregation Table scan Output Exchange inter-worker data transfer pipelined aggregation inter-worker data transfer Stage-0 Stage-1 Stage-2
  • 48. Sink Partial aggregation Table scan Sink Partial aggregation Table scan Execution Planner + Node list ✓ 2 workers Sink Final aggregation Exchange Output Exchange Sink Final aggregation Exchange Sink Final aggregation Exchange Sink Partial aggregation Table scan Output Exchange Worker 1 Worker 2
  • 49. Execution Planner - Tasks Sink Final aggregation Exchange Sink Partial aggregation Table scan Sink Final aggregation Exchange Sink Partial aggregation Table scan Task 1 task / worker / stage ✓ All tasks in parallel Output Exchange Worker 1 Worker 2
  • 50. Execution Planner - Split Sink Final aggregation Exchange Sink Partial aggregation Table scan Sink Final aggregation Exchange Sink Partial aggregation Table scan Output Exchange Split many splits / task = many threads / worker (table scan) 1 split / task = 1 thread / worker Worker 1 Worker 2 1 split / worker = 1 thread / worker
  • 51. All stages are pipe-lined ✓ No wait time ✓ No fault-tolerance MapReduce vs. Presto MapReduce Presto map map reduce reduce task task task task task task memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory task disk map map reduce reduce disk disk Write data to disk Wait between stages
  • 52. 3. Query Execution > SQL is converted into stages, tasks and splits > All tasks run in parallel > No wait time between stages (pipelined) > If one task fails, all tasks fail at once (query fails) > Memory-to-memory data transfer > No disk IO > If aggregated data doesn’t fit in memory, query fails • Note: query dies but worker doesn’t die. Memory consumption of all queries is fully managed
  • 53. 4. Monitoring & Configuration
  • 54. Monitoring > Web UI > basic query status check > JMX HTTP API > GET /v1/jmx/mbean[/{objectName}] • com.facebook.presto.execution:name=TaskManager • com.facebook.presto.execution:name=QueryManager • com.facebook.presto.execution:name=NodeScheduler > Event notification (remote logging) > POST http://remote.server/v2/event • query start, query complete, split complete
  • 55. Configuration > Execution planning (for coordinator) > query.initial-hash-partitions • max number of hash buckets (=tasks) of a GROUP BY (default: 8) > node-scheduler.min-candidates • max number of workers to run a stage in parallel (default: 10) > node-scheduler.include-coordinator • whether run tasks only on workers or include coordinator > query.schedule-split-batch-size • number of splits of a stage to start at once
  • 56. Configuration > Task execution (for workers) > task.cpu-timer-enabled • enable detailed statistics (causes some overhead) (default: true) > task.max-memory • memory limit of a task especially for hash tables used by GROUP BY and JOIN operations (default: 256MB) • enlarge if you get“Task exceeded max memory size”error > task.shard.max-threads • max number of threads of a worker to run active splits (default: number of CPU cores * 4)
  • 57. 5. Roadmap A report of Presto Meetup 2014 http://www.slideshare.net/dain1/presto-meetup-20140514-34731104 "Presto, Past, Present, and Future" by Dain Sundstrom at Facebook
  • 58. Presto’s future > Huge JOIN and GROUP BY > Spill to disk > Task recovery > CREATE VIEW (※implemented) > Native store (※implemented) > Fast data store in Presto workers > to cache hot data > Authentication and permissions
  • 59. Presto’s future > DDL/DML statements > CREATE TABLE with partitioning > DELETE and INSERT > Plugin repository > CLI plugin manager > JOIN and aggregation pushdown > Custom optimizers
  • 60. Links > Web site & document > http://prestodb.io > Mailing list > https://groups.google.com/group/presto-users > Github > https://github.com/facebook/presto > Guidelines for contribution > https://github.com/facebook/presto/blob/master/CONTRIBUTING.md
  • 61. Check: www.treasuredata.com Cloud service for the entire data pipeline, including Presto. We’re hiring!