SlideShare ist ein Scribd-Unternehmen logo
1 von 24
© 2014 IBM Corporation
Challenges of Building a First
Class SQL-on-Hadoop Engine
Scott C. Gray sgray@us.ibm.com @ScottCGrayIBM
Adriana Zubiri zubiri@ca.ibm.com @adrianaZubiri
Agenda
► Why and what is Big SQL 3.0?
• Not a sales pitch, I promise!
► Overview of the challenges
► How we solved (some of) them
• Architecture and interaction with Hadoop
• Query rewrite
• Query optimization
► Future challenges
The Perfect Storm
► Increase business interest on SQL on Hadoop to
improve the pace and efficiency of adopting Hadoop
► SQL engines on Hadoop moving away from MR
towards MPP architectures
► SQL users expect same level of language expressiveness,
features and (somewhat) performance as RDMSs
► IBM has decades of experience and assets on building
SQL engines… Why not leverage it?
The Result? Big SQL 3.0
► MapReduce replaced with a modern
MPP shared-nothing architecture
► Architected from the ground up
for low latency and high throughput
► Same SQL expressiveness as relational
RDBMs, which allows application portability
► Rich enterprise capabilities…
Big SQL 3.0 At a Glance
Application Portability & Integration
Data shared with Hadoop ecosystem
Comprehensive file formats supported
Superior enablement of IBM Software
Performance
Powerful SQL query rewriter
Cost based optimizer
Optimized for concurrent user throughput
performance
Result sets not constrained by existing
memory
Federation
Distributed requests to multiple data
sources within a single SQL statement
Main data sources supported: DB2,
Teradata, Oracle, Netezza
Enterprise Capabilities
Advanced Security / Auditing
Resource and Workload Management
Self Tuning Memory Management
Comprehensive Monitoring
Rich SQL
Comprehensive SQL support
IBM’s SQL PL compatibility
How did we do it?
► Big SQL is derived from an existing IBM shared-nothing RDBMS
• A very mature MPP architecture
• Already understands distributed joins and optimization
► Behavior is sufficiently different that
it is considered a separate product
• Certain SQL constructs are disabled
• Traditional data warehouse partitioning
is unavailable
• New SQL constructs introduced
► On the surface, porting a shared
nothing RDBMS to a shared nothing
cluster (Hadoop) seems easy, but …
database
partition
database
partition
database
partition
database
partition
Traditional Distributed RBMS Architecture
Challenges for a traditional RDBMS on Hadoop
► Data placement
• Traditional databases expect to have full control over data placement
• Data placement plays an important role in performance (e.g. co-located
joins)
• Hadoop’s randomly scattered data plays against the grain of this
► Reading and writing Hadoop files
• Normally an RDBMS has its own storage format
• Format is highly optimized to minimize cost of moving data into memory
• Hadoop has a practically unbounded number of storage formats all with
different capabilities
Challenges for a traditional RDBMS on Hadoop
► Query optimization
• Statistics on Hadoop are a relatively new concept
• The are frequently not available
• The database optimizer can use statistics not traditionally available in Hive
• Hive-style partitioning (grouping data into different files/directories) is a new
concept
► Resource management
• A database server almost always runs in isolation
• In Hadoop the nodes must be shared with many other tasks
– Data nodes
– MR task tracker and tasks
– HBase region servers, etc.
• We needed to learn to play nice with others
Architecture Overview
Management Node
Big SQL
Master Node
Management Node
Big SQL
Scheduler
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
Database
Service
Hive
Metastore
Hive
Server
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
DDL
FMP
UDF
FMP *FMP = Fenced mode process
Big SQL Scheduler
► The Scheduler is the main RDBMS↔Hadoop service interface
• Interfaces with Hive metastore for table metadata
• Acts like the MapReduce job tracker for Big SQL
– Big SQL provides query predicates for
scheduler to perform partition elimination
– Determines splits for each “table” involved in the query
– Schedules splits on available Big SQL nodes
(favoring scheduling locally to the data)
– Serves work (splits) to I/O engines
– Coordinates “commits” after INSERTs
► Scheduler allows the database engine to
be largely unaware of the Hadoop world
Management Node
Big SQL
Master Node
Big SQL
Scheduler
DDL
FMP
UDF
FMP
Mgmt Node
Database
Service
Hive
Metastore
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
TrackerUDF
FMP
I/O Fence Mode Processes
► Native I/O FMP
• The high-speed interface for a limited number of common file formats
► Java I/O FMP
• Handles all other formats via standard Hadoop/Hive API’s
► Both perform multi-threaded direct I/O on local data
► The database engine had to be taught storage format capabilities
• Projection list is pushed into I/O format
• Predicates are pushed as close to the data as
possible (into storage format, if possible)
• Predicates that cannot be pushed down are
evaluated within the database engine
► The database engine is only aware of which nodes
need to read
• Scheduler directs the readers to their portion of work
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
Mgmt Node
Big SQL
Master Node
Big SQL
Scheduler
DDL
FMP
UDF
FMP
Query Compilation There is a lot involved in SQL compilation
► Parsing
• Catch syntax errors
• Generate internal representation of query
► Semantic checking
• Determine if query makes sense
• Incorporate view definitions
• Add logic for constraint checking
► Query optimization
• Modify query to improve performance (Query Rewrite)
• Choose the most efficient “access plan”
► Pushdown Analysis
• Federation “optimization”
► Threaded code generation
• Generate efficient “executable” code
Query Rewrite
► Why is query re-write important?
• There are many ways to express the same query
• Query generators often produce suboptimal queries and don’t permit “hand optimization”
• Complex queries often result in redundancy, especially with views
• For Large data volumes optimal access plans more crucial as penalty for poor planning is
greater
select sum(l_extendedprice) / 7.0
avg_yearly
from tpcd.lineitem, tpcd.part
where p_partkey = l_partkey
and p_brand = 'Brand#23'
and p_container = 'MED BOX'
and l_quantity < ( select 0.2 *
avg(l_quantity) from tpcd.lineitem
where l_partkey = p_partkey);
select sum(l_extendedprice) / 7.0 as avg_yearly
from temp (l_quantity, avgquantity, l_extendeprice)
as
(select l_quantity, avg(l_quantity) over
(partition by l_partkey)
as avgquantity, l_extenedprice
from tpcd.lineitem, tpcd.part
where p_partkey = l_partkey
and p_brand = 'BRAND#23'
and p_container = 'MED BOX')
where l_quantity < 0.2 * avgquantity
• Query correlation eliminated
• Line item table accessed only once
• Execution time reduced in half!
Query Rewrite
► Most existing query rewrite rules remain unchanged
• 140+ existing query re-writes are leveraged
• Almost none are impacted by “the Hadoop world”
► There were however a few modifications that were required…
Query Rewrite and Indexes
► Column nullability and indexes can help drive query optimization
• Can produce more efficiently decorrelated subqueries and joins
• Used to prove uniqueness of joined rows (“early-out” join)
► Very few Hadoop data sources support
the concept of an index
► In the Hive metastore all columns
are implicitly nullable
► Big SQL introduces advisory
constraints and nullability indicators
• User can specify whether or not
constraints can be “trusted” for
query rewrites
create hadoop table users
(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null,
salary timestamp(3) null,
constraint fk_ofc foreign key (office_id)
references office (office_id)
)
row format delimited
fields terminated by '|'
stored as textfile;
Nullability Indicators
Constraints
Query Pushdown
► Pushdown moves processing down as
close to the data as possible
• Projection pushdown – retrieve only
necessary columns
• Selection pushdown – push search criteria
► Big SQL understands the capabilities of
readers and storage formats involved
• As much as possible is pushed down
• Residual processing done in the server
• Optimizer costs queries based upon how
much can be pushed down
3) External Sarg Predicate,
Comparison Operator: Equal (=)
Subquery Input Required: No
Filter Factor: 0.04
Predicate Text:
--------------
(Q1.P_BRAND = 'Brand#23')
4) External Sarg Predicate,
Comparison Operator: Equal (=)
Subquery Input Required: No
Filter Factor: 0.025
Predicate Text:
--------------
(Q1.P_CONTAINER = 'MED BOX')
select sum(l_extendedprice) / 7.0 as avg_yearly
from temp (l_quantity, avgquantity, l_extendeprice) as
(select l_quantity, avg(l_quantity) over
(partition by l_partkey)
as avgquantity, l_extenedprice
from tpcd.lineitem, tpcd.part
where p_partkey = l_partkey
and p_brand = 'BRAND#23'
and p_container = 'MED BOX')
where l_quantity < 0.2 * avgquantity
Statistics
► Big SQL utilizes Hive statistics
collection with some extensions:
• Additional support for column groups,
histograms and frequent values
• Automatic determination of partitions that
require statistics collection vs. explicit
• Partitioned tables: added table-level
versions of NDV, Min, Max, Null count,
Average column length
• Hive catalogs as well as database engine
catalogs are also populated
• We are restructuring the relevant code for
submission back to Hive
► Capability for statistic fabrication
if no stats available at compile time
Table statistics
• Cardinality (count)
• Number of Files
• Total File Size
Column statistics
• Minimum value (all types)
• Maximum value (all types)
• Cardinality (non-nulls)
• Distribution (Number of Distinct Values NDV)
• Number of null values
• Average Length of the column value (all types)
• Histogram - Number of buckets configurable
• Frequent Values (MFV) – Number configurable
Column group statistics
Costing Model
► Few extensions required to the Cost Model
► TBSCAN operator cost model extended
to evaluate cost of reading from Hadoop
► New elements taken into account:
# of files, size of files, # of partitions, # of nodes
► Optimizer now knows in which subset of
nodes the data resides
→ Better costing!
|
2.66667e-08
HSJOIN
( 7)
1.1218e+06
8351
/--------+--------
5.30119e+08 3.75e+07
BTQ NLJOIN
( 8) ( 11)
948130 146345
7291 1060
| /----+----
5.76923e+08 1 3.75e+07
LTQ GRPBY FILTER
( 9) ( 12) ( 20)
855793 114241 126068
7291 1060 1060
| | |
5.76923e+08 13 7.5e+07
TBSCAN TBSCAN BTQ
( 10) ( 13) ( 21)
802209 114241 117135
7291 1060 1060
| | |
7.5e+09 13 5.76923e+06
TABLE: TPCH5TB_PARQ TEMP LTQ
ORDERS ( 14) ( 22)
Q1 114241 108879
1060 1060
| |
13 5.76923e+06
DTQ TBSCAN
( 15) ( 23)
114241 108325
1060 1060
| |
1 7.5e+08
GRPBY TABLE: TPCH5TB_PARQ
( 16) CUSTOMER
114241 Q5
1060
|
1
LTQ
( 17)
114241
1060
|
1
GRPBY
( 18)
114241
1060
|
5.24479e+06
TBSCAN
( 19)
113931
1060
|
7.5e+08
TABLE: TPCH5TB_PARQ
CUSTOMER
Q2
We can access a Hadoop table as:
► “Scattered” Partitioned:
• Only accesses local data to the node
► Replicated:
• Accesses local and remote data
– Optimizer could also use a broadcast table queue
– HDFS shared file system provides replication
New Access Plans
Data not hash partitioned on a particular columns
(aka “Scattered partitioned”)
New Parallel Join Strategy
introduced
Parallel Join Strategies
Replicated vs. Broadcast join
All tables are “scatter” partitioned
Join predicate:
STORE.STOREKEY = DAILY_SALES.STOREKEY
19
Replicate smaller table to partitions
of the larger table using:
• Broadcast table queue
• Replicated HDFS scan
Table Queue represents
communication between
nodes or subagents
JOIN
Store
Daily Sales
SCAN
SCAN
Broadcast
TQ SCAN
replicated SCAN
Parallel Join Strategies
Repartitioned join
All tables are “scatter” partitioned
Join predicate:
DAILY_FORECAST.STOREKEY = DAILY_SALES.STOREKEY
20
• Both tables large
• Too expensive to broadcast or
replicate either
• Repartition both tables on the
join columns
• Use directed table queue (DTQ)
JOIN
Daily
Forecast
Daily Sales
SCAN SCAN
Directed
TQ
Directed
TQ
Future Challenges
► The challenges never end!
• That’s what makes this job fun!
• The Hadoop ecosystem continues to expand
• New storage techniques, indexing techniques, etc.
► Here are a few areas we’re exploring….
Future Challenges
► Dynamic split allocation
• React to competing workloads
• If one node is slow, hand work you would have handed it to another node
► More pushdown!
• Currently we push projection/selection down
• Should we push more advanced operations? Aggregation? Joins?
► Join co-location
• Perform co-located joins when tables are partitioned on the same join key
► Explicit MapReduce style parallelism (“SQL MR”)
• Expand SQL to explicitly perform partitioned operations
Queries?
(Optimized, of course)
Try Big SQL 3.0 Beta on the cloud!
https://bigsql.imdemocloud.com/
Scott C. Gray sgray@us.ibm.com @ScottCGrayIBM
Adriana Zubiri zubiri@ca.ibm.com @adrianaZubiri

Weitere ähnliche Inhalte

Was ist angesagt?

IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.George Joseph
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Scott Leberknight
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...Cloudera, Inc.
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endthkoch
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARNWangda Tan
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designphanleson
 
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaDataWorks Summit
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseCloudera, Inc.
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGuang Xu
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.
 

Was ist angesagt? (19)

Cloudera impala
Cloudera impalaCloudera impala
Cloudera impala
 
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
Hadoop DB
Hadoop DBHadoop DB
Hadoop DB
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead end
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design
 
HBase internals
HBase internalsHBase internals
HBase internals
 
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in Alibaba
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBase
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
 

Andere mochten auch

Advanced Data Modeling and Bitmap Indexes
Advanced Data Modeling and Bitmap IndexesAdvanced Data Modeling and Bitmap Indexes
Advanced Data Modeling and Bitmap IndexesDataStax Academy
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single clusterSalil Navgire
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Michael Arnold
 
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, HadoopData-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, HadoopDATAVERSITY
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Monitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, NagiosMonitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, NagiosPradeep Kumar
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Hortonworks
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data WarehouseCaserta
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Andere mochten auch (18)

Advanced Data Modeling and Bitmap Indexes
Advanced Data Modeling and Bitmap IndexesAdvanced Data Modeling and Bitmap Indexes
Advanced Data Modeling and Bitmap Indexes
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single cluster
 
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)
 
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, HadoopData-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Monitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, NagiosMonitor PowerKVM using Ganglia, Nagios
Monitor PowerKVM using Ganglia, Nagios
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Ähnlich wie Challenges of Implementing an Advanced SQL Engine on Hadoop

SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 
Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Lucas Jellema
 
Webinar: Migrating from RDBMS to MongoDB
Webinar: Migrating from RDBMS to MongoDBWebinar: Migrating from RDBMS to MongoDB
Webinar: Migrating from RDBMS to MongoDBMongoDB
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDBMongoDB
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemSerendio Inc.
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabasesAdi Challa
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at ScaleMongoDB
 

Ähnlich wie Challenges of Implementing an Advanced SQL Engine on Hadoop (20)

SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)
 
Webinar: Migrating from RDBMS to MongoDB
Webinar: Migrating from RDBMS to MongoDBWebinar: Migrating from RDBMS to MongoDB
Webinar: Migrating from RDBMS to MongoDB
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
NoSQL
NoSQLNoSQL
NoSQL
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
 
CDC to the Max!
CDC to the Max!CDC to the Max!
CDC to the Max!
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Kürzlich hochgeladen (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Challenges of Implementing an Advanced SQL Engine on Hadoop

  • 1. © 2014 IBM Corporation Challenges of Building a First Class SQL-on-Hadoop Engine Scott C. Gray sgray@us.ibm.com @ScottCGrayIBM Adriana Zubiri zubiri@ca.ibm.com @adrianaZubiri
  • 2. Agenda ► Why and what is Big SQL 3.0? • Not a sales pitch, I promise! ► Overview of the challenges ► How we solved (some of) them • Architecture and interaction with Hadoop • Query rewrite • Query optimization ► Future challenges
  • 3. The Perfect Storm ► Increase business interest on SQL on Hadoop to improve the pace and efficiency of adopting Hadoop ► SQL engines on Hadoop moving away from MR towards MPP architectures ► SQL users expect same level of language expressiveness, features and (somewhat) performance as RDMSs ► IBM has decades of experience and assets on building SQL engines… Why not leverage it?
  • 4. The Result? Big SQL 3.0 ► MapReduce replaced with a modern MPP shared-nothing architecture ► Architected from the ground up for low latency and high throughput ► Same SQL expressiveness as relational RDBMs, which allows application portability ► Rich enterprise capabilities…
  • 5. Big SQL 3.0 At a Glance Application Portability & Integration Data shared with Hadoop ecosystem Comprehensive file formats supported Superior enablement of IBM Software Performance Powerful SQL query rewriter Cost based optimizer Optimized for concurrent user throughput performance Result sets not constrained by existing memory Federation Distributed requests to multiple data sources within a single SQL statement Main data sources supported: DB2, Teradata, Oracle, Netezza Enterprise Capabilities Advanced Security / Auditing Resource and Workload Management Self Tuning Memory Management Comprehensive Monitoring Rich SQL Comprehensive SQL support IBM’s SQL PL compatibility
  • 6. How did we do it? ► Big SQL is derived from an existing IBM shared-nothing RDBMS • A very mature MPP architecture • Already understands distributed joins and optimization ► Behavior is sufficiently different that it is considered a separate product • Certain SQL constructs are disabled • Traditional data warehouse partitioning is unavailable • New SQL constructs introduced ► On the surface, porting a shared nothing RDBMS to a shared nothing cluster (Hadoop) seems easy, but … database partition database partition database partition database partition Traditional Distributed RBMS Architecture
  • 7. Challenges for a traditional RDBMS on Hadoop ► Data placement • Traditional databases expect to have full control over data placement • Data placement plays an important role in performance (e.g. co-located joins) • Hadoop’s randomly scattered data plays against the grain of this ► Reading and writing Hadoop files • Normally an RDBMS has its own storage format • Format is highly optimized to minimize cost of moving data into memory • Hadoop has a practically unbounded number of storage formats all with different capabilities
  • 8. Challenges for a traditional RDBMS on Hadoop ► Query optimization • Statistics on Hadoop are a relatively new concept • The are frequently not available • The database optimizer can use statistics not traditionally available in Hive • Hive-style partitioning (grouping data into different files/directories) is a new concept ► Resource management • A database server almost always runs in isolation • In Hadoop the nodes must be shared with many other tasks – Data nodes – MR task tracker and tasks – HBase region servers, etc. • We needed to learn to play nice with others
  • 9. Architecture Overview Management Node Big SQL Master Node Management Node Big SQL Scheduler Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask Tracker Other ServiceHDFS Data HDFS Data HDFS Data Temp Data UDF FMP Compute Node Database Service Hive Metastore Hive Server Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask Tracker Other ServiceHDFS Data HDFS Data HDFS Data Temp Data UDF FMP Compute Node Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask Tracker Other ServiceHDFS Data HDFS Data HDFS Data Temp Data UDF FMP Compute Node DDL FMP UDF FMP *FMP = Fenced mode process
  • 10. Big SQL Scheduler ► The Scheduler is the main RDBMS↔Hadoop service interface • Interfaces with Hive metastore for table metadata • Acts like the MapReduce job tracker for Big SQL – Big SQL provides query predicates for scheduler to perform partition elimination – Determines splits for each “table” involved in the query – Schedules splits on available Big SQL nodes (favoring scheduling locally to the data) – Serves work (splits) to I/O engines – Coordinates “commits” after INSERTs ► Scheduler allows the database engine to be largely unaware of the Hadoop world Management Node Big SQL Master Node Big SQL Scheduler DDL FMP UDF FMP Mgmt Node Database Service Hive Metastore Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask TrackerUDF FMP
  • 11. I/O Fence Mode Processes ► Native I/O FMP • The high-speed interface for a limited number of common file formats ► Java I/O FMP • Handles all other formats via standard Hadoop/Hive API’s ► Both perform multi-threaded direct I/O on local data ► The database engine had to be taught storage format capabilities • Projection list is pushed into I/O format • Predicates are pushed as close to the data as possible (into storage format, if possible) • Predicates that cannot be pushed down are evaluated within the database engine ► The database engine is only aware of which nodes need to read • Scheduler directs the readers to their portion of work Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask Tracker Other ServiceHDFS Data HDFS Data HDFS Data Temp Data UDF FMP Compute Node
  • 12. Mgmt Node Big SQL Master Node Big SQL Scheduler DDL FMP UDF FMP Query Compilation There is a lot involved in SQL compilation ► Parsing • Catch syntax errors • Generate internal representation of query ► Semantic checking • Determine if query makes sense • Incorporate view definitions • Add logic for constraint checking ► Query optimization • Modify query to improve performance (Query Rewrite) • Choose the most efficient “access plan” ► Pushdown Analysis • Federation “optimization” ► Threaded code generation • Generate efficient “executable” code
  • 13. Query Rewrite ► Why is query re-write important? • There are many ways to express the same query • Query generators often produce suboptimal queries and don’t permit “hand optimization” • Complex queries often result in redundancy, especially with views • For Large data volumes optimal access plans more crucial as penalty for poor planning is greater select sum(l_extendedprice) / 7.0 avg_yearly from tpcd.lineitem, tpcd.part where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX' and l_quantity < ( select 0.2 * avg(l_quantity) from tpcd.lineitem where l_partkey = p_partkey); select sum(l_extendedprice) / 7.0 as avg_yearly from temp (l_quantity, avgquantity, l_extendeprice) as (select l_quantity, avg(l_quantity) over (partition by l_partkey) as avgquantity, l_extenedprice from tpcd.lineitem, tpcd.part where p_partkey = l_partkey and p_brand = 'BRAND#23' and p_container = 'MED BOX') where l_quantity < 0.2 * avgquantity • Query correlation eliminated • Line item table accessed only once • Execution time reduced in half!
  • 14. Query Rewrite ► Most existing query rewrite rules remain unchanged • 140+ existing query re-writes are leveraged • Almost none are impacted by “the Hadoop world” ► There were however a few modifications that were required…
  • 15. Query Rewrite and Indexes ► Column nullability and indexes can help drive query optimization • Can produce more efficiently decorrelated subqueries and joins • Used to prove uniqueness of joined rows (“early-out” join) ► Very few Hadoop data sources support the concept of an index ► In the Hive metastore all columns are implicitly nullable ► Big SQL introduces advisory constraints and nullability indicators • User can specify whether or not constraints can be “trusted” for query rewrites create hadoop table users ( id int not null primary key, office_id int null, fname varchar(30) not null, lname varchar(30) not null, salary timestamp(3) null, constraint fk_ofc foreign key (office_id) references office (office_id) ) row format delimited fields terminated by '|' stored as textfile; Nullability Indicators Constraints
  • 16. Query Pushdown ► Pushdown moves processing down as close to the data as possible • Projection pushdown – retrieve only necessary columns • Selection pushdown – push search criteria ► Big SQL understands the capabilities of readers and storage formats involved • As much as possible is pushed down • Residual processing done in the server • Optimizer costs queries based upon how much can be pushed down 3) External Sarg Predicate, Comparison Operator: Equal (=) Subquery Input Required: No Filter Factor: 0.04 Predicate Text: -------------- (Q1.P_BRAND = 'Brand#23') 4) External Sarg Predicate, Comparison Operator: Equal (=) Subquery Input Required: No Filter Factor: 0.025 Predicate Text: -------------- (Q1.P_CONTAINER = 'MED BOX') select sum(l_extendedprice) / 7.0 as avg_yearly from temp (l_quantity, avgquantity, l_extendeprice) as (select l_quantity, avg(l_quantity) over (partition by l_partkey) as avgquantity, l_extenedprice from tpcd.lineitem, tpcd.part where p_partkey = l_partkey and p_brand = 'BRAND#23' and p_container = 'MED BOX') where l_quantity < 0.2 * avgquantity
  • 17. Statistics ► Big SQL utilizes Hive statistics collection with some extensions: • Additional support for column groups, histograms and frequent values • Automatic determination of partitions that require statistics collection vs. explicit • Partitioned tables: added table-level versions of NDV, Min, Max, Null count, Average column length • Hive catalogs as well as database engine catalogs are also populated • We are restructuring the relevant code for submission back to Hive ► Capability for statistic fabrication if no stats available at compile time Table statistics • Cardinality (count) • Number of Files • Total File Size Column statistics • Minimum value (all types) • Maximum value (all types) • Cardinality (non-nulls) • Distribution (Number of Distinct Values NDV) • Number of null values • Average Length of the column value (all types) • Histogram - Number of buckets configurable • Frequent Values (MFV) – Number configurable Column group statistics
  • 18. Costing Model ► Few extensions required to the Cost Model ► TBSCAN operator cost model extended to evaluate cost of reading from Hadoop ► New elements taken into account: # of files, size of files, # of partitions, # of nodes ► Optimizer now knows in which subset of nodes the data resides → Better costing! | 2.66667e-08 HSJOIN ( 7) 1.1218e+06 8351 /--------+-------- 5.30119e+08 3.75e+07 BTQ NLJOIN ( 8) ( 11) 948130 146345 7291 1060 | /----+---- 5.76923e+08 1 3.75e+07 LTQ GRPBY FILTER ( 9) ( 12) ( 20) 855793 114241 126068 7291 1060 1060 | | | 5.76923e+08 13 7.5e+07 TBSCAN TBSCAN BTQ ( 10) ( 13) ( 21) 802209 114241 117135 7291 1060 1060 | | | 7.5e+09 13 5.76923e+06 TABLE: TPCH5TB_PARQ TEMP LTQ ORDERS ( 14) ( 22) Q1 114241 108879 1060 1060 | | 13 5.76923e+06 DTQ TBSCAN ( 15) ( 23) 114241 108325 1060 1060 | | 1 7.5e+08 GRPBY TABLE: TPCH5TB_PARQ ( 16) CUSTOMER 114241 Q5 1060 | 1 LTQ ( 17) 114241 1060 | 1 GRPBY ( 18) 114241 1060 | 5.24479e+06 TBSCAN ( 19) 113931 1060 | 7.5e+08 TABLE: TPCH5TB_PARQ CUSTOMER Q2
  • 19. We can access a Hadoop table as: ► “Scattered” Partitioned: • Only accesses local data to the node ► Replicated: • Accesses local and remote data – Optimizer could also use a broadcast table queue – HDFS shared file system provides replication New Access Plans Data not hash partitioned on a particular columns (aka “Scattered partitioned”) New Parallel Join Strategy introduced
  • 20. Parallel Join Strategies Replicated vs. Broadcast join All tables are “scatter” partitioned Join predicate: STORE.STOREKEY = DAILY_SALES.STOREKEY 19 Replicate smaller table to partitions of the larger table using: • Broadcast table queue • Replicated HDFS scan Table Queue represents communication between nodes or subagents JOIN Store Daily Sales SCAN SCAN Broadcast TQ SCAN replicated SCAN
  • 21. Parallel Join Strategies Repartitioned join All tables are “scatter” partitioned Join predicate: DAILY_FORECAST.STOREKEY = DAILY_SALES.STOREKEY 20 • Both tables large • Too expensive to broadcast or replicate either • Repartition both tables on the join columns • Use directed table queue (DTQ) JOIN Daily Forecast Daily Sales SCAN SCAN Directed TQ Directed TQ
  • 22. Future Challenges ► The challenges never end! • That’s what makes this job fun! • The Hadoop ecosystem continues to expand • New storage techniques, indexing techniques, etc. ► Here are a few areas we’re exploring….
  • 23. Future Challenges ► Dynamic split allocation • React to competing workloads • If one node is slow, hand work you would have handed it to another node ► More pushdown! • Currently we push projection/selection down • Should we push more advanced operations? Aggregation? Joins? ► Join co-location • Perform co-located joins when tables are partitioned on the same join key ► Explicit MapReduce style parallelism (“SQL MR”) • Expand SQL to explicitly perform partitioned operations
  • 24. Queries? (Optimized, of course) Try Big SQL 3.0 Beta on the cloud! https://bigsql.imdemocloud.com/ Scott C. Gray sgray@us.ibm.com @ScottCGrayIBM Adriana Zubiri zubiri@ca.ibm.com @adrianaZubiri

Hinweis der Redaktion

  1. Rewriting a given SQL query into a semantically equivalent form that may be processed more efficiently
  2. Mfv and histograms obtain better selectivity estimates for range predicates over data that is non-uniformly distributed. Stats stored in Hive metastore for currently hive supported stats and our internal catalog tables for all Min. max in hive only for a subset of types Avg length of the column values in hive only for strings Column and table stats done together Next: automatic stats collection