In this talk, we will discuss how we use Spark as part of a hybrid RDBMS architecture that includes Hadoop and HBase. The optimizer evaluates each query and sends OLTP traffic (including CRUD queries) to HBase and OLAP traffic to Spark. We will focus on the challenges of handling the tradeoffs inherent in an integrated architecture that simultaneously handles real-time and batch traffic. Lessons learned include: - Embedding Spark into a RDBMS - Running Spark on Yarn and isolating OLTP traffic from OLAP traffic - Accelerating the generation of Spark RDDs from HBase - Customizing the Spark UI The lessons learned can also be applied to other hybrid systems, such as Lambda architectures.
Bio:-
John Leach is the CTO and Co-Founder of Splice Machine. With over 15 years of software experience under his belt, John’s expertise in analytics and BI drives his role as Chief Technology Officer. Prior to Splice Machine, John founded Incite Retail in June 2008 and led the company’s strategy and development efforts. At Incite Retail, he built custom Big Data systems (leveraging HBase and Hadoop) for Fortune 500 companies. Prior to Incite Retail, he ran the business intelligence practice at Blue Martini Software and built strategic partnerships with integration partners. John was a key subject matter expert for Blue Martini Software in many strategic implementations across the world. His focus at Blue Martini was helping clients incorporate decision support knowledge into their current business processes utilizing advanced algorithms and machine learning. John received dual bachelor’s degrees in biomedical and mechanical engineering from Washington University in Saint Louis. Leach is the organizer emeritus for the Saint Louis Hadoop Users Group and is active in the Washington University Elliot Society.
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
1. Splice
Machine
Proprietary
and
Confidential
Powering
Real-‐Time
Applications
&
Analytics
Enabling
Decisions
in
the
Moment
John
Leach
CTO
&
Co-‐Founder
2. Splice
Machine
Proprietary
and
Confidential
Life
Sciences
Digital
Marketing Fraud
Detection
DECISIONS
IN
THE
MOMENT
Supply
Chain
Optimization
3. Splice
Machine
Proprietary
and
Confidential
Today’s
Reality:
Stale
Data,
Backward-‐Looking
Decisions
3
How
old
is
the
data
in
your
reports?
¨ 1
day
+
¨ 1
day
¨ 4
hours
+
¨ 1
hour
+
¨ Real-‐time
4. Splice
Machine
Proprietary
and
Confidential
Today’s
Reality:
Stale
Data,
Backward-‐Looking
Decisions
4
24%
50%
7%
9%
9%
* Source: Webinars on 11-3-15 and 12-10-15, 237 respondents
How
old
is
the
data
in
your
reports?
¨ 1
day
+
¨ 1
day
¨ 4
hours
+
¨ 1
hour
+
¨ Real-‐time
5. Splice
Machine
Proprietary
and
Confidential
Data
Gridlock:
Complex,
Outdated
ETL
Pipelines
Ad Hoc
Analytics
Executive
Business Reports
Operational
Reports
ERP
CRM
Supply
Chain
HR
…
Data
Warehouse
Datamart
Stream or
Batch Updates
Mixed
Workload Apps
ODS
ETL
OLTP
Systems
Extract
Transform
Load
OLAP
Systems§ Pain
§ Separate
OLTP
&
OLAP
systems
§ Messy
ETL
“glue”
§ Why?
§ Different
workloads
§ Different
data
structures
§ Hard
to
isolate
workloads
§ No
longer
adequate
§ Can’t
afford
to
wait
days
or
hours
to
analyze
data
Current
architectures
unable
to
keep
up
5
6. Splice
Machine
Proprietary
and
Confidential
Nirvana:
No
More
ETL
OLAP
Report
§ Benefits
§ Faster
since
Big
Data
does
not
need
to
be
moved
§ Eliminate
expensive
ETL
and
data
warehouse
systems
§ Act
on
real-‐time
data
instead
of
yesterday’s
§ Why
is
it
Possible
Now?
OLTP App
OLTP/OLAP
Simultaneous
OLTP
&
OLAP
workloads
6
7. Splice
Machine
Proprietary
and
Confidential
Disruptive
Technology
Enablers
Scale-Out
Technology
In-Memory
Technology
Scale
Up
(Increase
server
size)
Scale
Out
(More
small
servers)
vs.
$ $ $ $ $ $
7
8. Splice
Machine
Proprietary
and
Confidential
The
Splice
Machine
RDBMS:
Replace
Oracle
&
MySQL
The
First
RDBMS
Powered
by
Hadoop &
Spark
8
ANSI
SQL
No
retraining
or
rewrites
for
SQL-‐based
analysts,
reports,
and
applications
¼
the
Cost
Scales
out
on
commodity
hardware
SQL Scale
Out Speed
Transactions
Ensure
reliable
updates
across
multiple
rows
Mixed
Workloads
Simultaneously
support
OLTP
and
OLAP
workloads
Elastic
Increase
scale
in
just
a
few
minutes
10-‐20x
Faster
Leverages
Spark
in-‐memory
technology
9. Splice
Machine
Proprietary
and
Confidential
Omni-‐Channel
Marketing:
Harte-‐Hanks
9
Overview
Digital
marketing
services
provider
Unified
Customer
Profile
Real-‐time
campaign
management
Operational
application
with
BI
reports
Challenges
Oracle
RAC
too
expensive
to
scale
Queries
too
slow
– even
up
to
½
hour
Getting
worse
– expect
30-‐50%
data
growth
Looked
for
9
months
for
a
cost-‐effective
solution
Solution
Diagram Initial
Results
¼cost
with
commodity
scale
out
3-‐7x
faster
through
parallelized
queries
10-‐20x
price/perf
with
no
application,
BI
or
ETL
rewrites
Cross-Channel
Campaigns
Real-Time
Personalization
Real-Time Actions
10. Splice
Machine
Proprietary
and
Confidential
Simultaneous
OLTP
&
OLAP
Workloads
10
Very
few
applications
are
OLTP
only
Traditional RDBMSs Splice Machine
HBASE SPARK
BOTTLENECKS,
DELAYS
O
L
A
P
WORKLOAD
ISOLATION
O
L
T
P
K E Y
11. Splice
Machine
Proprietary
and
Confidential
Simultaneous
OLTP
&
OLAP
Workloads
11
Separate
OLTP
&
OLAP
processes
isolate
workloads
Traditional RDBMSs Splice Machine
As
OLAP
load
rises,
OLTP
response
times
increase
OLAP
LOAD
OLTP
RESPONSE
TIME
As
OLAP
load
rises,
OLTP
response
times
remain
flat
OLAP
LOAD
OLTP
RESPONSE
TIME
12. Splice
Machine
Proprietary
and
Confidential
Proven
Building
Blocks:
Spark,
Hadoop and
Derby
Apache
Derby
§ ANSI
SQL-‐99
RDBMS
§ Java-‐based
§ ODBC/JDBC
Compliant
Apache
HBase/Hadoop
§ Auto-‐sharding
§ High
availability
§ Scalability
to
100s
of
PBs
Apache
Spark
§ Analytical
engine
§ Fast,
in-‐memory
technology
§ Memory
resilient
to
node
failure
12
13. Splice
Machine
Proprietary
and
Confidential
HBase:
Proven
Scale-‐Out
§ Auto-‐sharding
§ Scales
with
commodity
hardware
§ Cost-‐effective
from
GBs
to
PBs
§ High
availability
thru
failover
and
replication
§ LSM-‐trees
13
14. Splice
Machine
Proprietary
and
Confidential
Apache
14
Unmatched
Performance
§ Fastest
sort
of
1PB
of
data
Advanced
In-‐Memory
Technology
§ Spill-‐to-‐disk
for
large
datasets
§ Resilient
against
node
failures
§ Pipelining
for
computation
parallelism
Most
Active
Apache
Community
§ Almost
500
committers
Extensive
Libraries
§ Over
140
and
growing
§ Libraries
for
machine
learning,
streaming
and
graph
processing
15. Splice
Machine
Proprietary
and
Confidential 15
Address
HBase Challenges
§ Compactions
§ Large
Data
Movements
RDBMS
Features
§ Index
Creation
§ Statistics
Collection
§ Import
§ Admin
UI
Analytic
Processing
§ Pipelining
for
computation
parallelism
§ Lineage
Machine
Learning
§ Incorporating
MLib into
the
RDBMS
How
is
Spark
aiding
Splice
Machine?
16. Splice
Machine
Proprietary
and
Confidential
Splice
Machine:
Advanced
Spark
Integration
16
Compaction:
LSM
Tree
(Deal
with
the
Devil)
17. Splice
Machine
Proprietary
and
Confidential
Splice
Machine:
Spark/HBase Integration:
Compaction
17
Minor
Compaction Major
Compaction
•••
18. Splice
Machine
Proprietary
and
Confidential
Splice
Machine:
Advanced
Spark
Integration
18
Innovative,
High-‐Performance
RDD
Creation
§ Fast
access
to
HFiles in
HDFS
§ Merged
with
deltas
from
Memstore
§ Avoids
slower
HBase API
Universal
Execution
Plan
and
Byte
Code
§ Optimizer,
plan
and
code
shared
across
Spark
or
HBase execution
•••
HBase Region
Server
HDFS
•••
Region
1
Memstore
Spark
Worker
•••RDD
1
HFile HFile•••
P H Y S I C A L
N O D E
RDD
N
HFile••• HFile•••
Region
N
Memstore
HBase Region
Server
HDFS
•••
Region
1
Memstore
Spark
Worker
•••RDD
1
HFile HFile•••
P H Y S I C A L
N O D E
RDD
N
HFile••• HFile•••
Region
N
Memstore
19. Splice
Machine
Proprietary
and
Confidential
Splice
Machine
Architecture
1. Standard
install
of
HBase
Cluster
(HBase,
HDFS,
ZooKeeper)
with
Spark
HBase
Co-‐Processor
L
E
G
E
N
D
2. Distribute
Splice
Machine
JAR
to
each
region
server
3. Automatically
invoke
co-‐
processors
on
each
region
19
Cach
e
•••
Tas
k
Executor
Tas
k
HBase Region
Server
•••
HDFS
SPLICE
PARSER
SPLICE
PLANNER
SPLICE
OPTIMIZER
SPLICE
EXECUTOR
• Snapshot
Isolation
• Indexes
Region Region
SPLICE
EXECUTOR
• Snapshot
Isolation
• Indexes
Spark
Worker RDD
Spark
Master
RDD
Cach
e
•••
Tas
k
Executor
Tas
k
•••
•••
•••
Cach
e
•••
Tas
k
Executor
Tas
k
HBase Region
Server
HDFS
SPLICE
PARSER
SPLICE
PLANNER
SPLICE
OPTIMIZER
SPLICE
EXECUTOR
• Snapshot
Isolation
• Indexes
Region Region
SPLICE
EXECUTOR
• Snapshot
Isolation
• Indexes
Spark
Worker RDDRDD
Cach
e
•••
Tas
k
Executor
Tas
k
•••
•••
•••
HMasterZookeeper
24. Splice
Machine
Proprietary
and
Confidential
Splice
Machine:
Query
Execution
24
OLTP Execution on HBase
4a. Execute OLTP query from
byte code
5a. Use block cache and bloom
filters to optimize data access
6a. Return results
3. Generate optimal byte code
1. Parse SQL
2. Optimize query plan
25. Splice
Machine
Proprietary
and
Confidential
Splice
Machine:
Query
Execution
25
OLAP Execution on Spark
4b. Generate Spark execution plan
OLTP Execution on HBase
4a. Execute OLTP query from
byte code
5a. Use block cache and bloom
filters to optimize data access
6a. Return results
3. Generate optimal byte code
1. Parse SQL
2. Optimize query plan
OLAP Execution on Spark
4b. Generate Spark execution plan
5b. Submit Spark plan with byte code
6b. Fair scheduling of distributed of tasks
7b. Generate RDD from HFiles and Memstore
8b. Execute query and return results
26. Splice
Machine
Proprietary
and
Confidential
Isolated
Resource
Management
26
Isolate
Spark
&
HBase resources
through
Linux
Cgroups
27. Splice
Machine
Proprietary
and
Confidential
Isolated
Resource
Management
27
Isolate
Spark
&
HBase resources
through
Linux
Cgroups
28. Splice
Machine
Proprietary
and
Confidential
Configurable
Spark
Resource
Management
28
Prioritize
Spark
resources
between
Query,
Admin
&
Import
jobs
Custom
resource
pools
through
XML
29. Splice
Machine
Proprietary
and
Confidential
Spark
Query
Management
29
Visualization
of
active
and
completed
queries
30. Splice
Machine
Proprietary
and
Confidential
Spark
Query
Management
(cont’d)
30
Visualization
of
stages
for
each
query,
plus
kill
function
31. Splice
Machine
Proprietary
and
Confidential
Spark
Query
Management
(cont’d)
31
Visualization
of
stages
for
query
plan,
plus
kill
function
32. Splice
Machine
Proprietary
and
Confidential
Spark
Query
Management
(cont’d)
32
Detailed
metrics
for
tasks
in
each
stage
34. Splice
Machine
Proprietary
and
Confidential
Federated
Query
Support
34
Virtual
Table
Interface
(VTI)
§ Execute
federated
queries
against
external
files,
libraries
or
databases
§ External
Databases
§ Use
JDBC
to
access
data
in
DBs
such
as
Oracle
and
DB2
§ External
Libraries
§ Access
over
140
Spark
libraries
for
machine
learning
and
streaming
§ External
Files
§ Pre-‐defined
or
dynamic
schema
§ Access
local
FS,
HDFS,
AWS
S3
§ Sample
query:
MapReduceI/O
Formats
§ Accept
federated
queries
from
MapReduce,
Pig,
and
Hive
§ Register
Splice
Machine
schema
in
HCATALOG
§ Merge
structured
(Splice)
and
unstructured
data
in
ad-‐hoc
query
§ Seamless
integration
to
Hadoop
ecosystem
40. Splice
Machine
Proprietary
and
Confidential
Advisory
Board
40
Advisory
Board
includes
luminaries
in
databases
and
technology
Roger
Bamford
Former
Principal
Architect
at
Oracle
Father
of
Oracle
RAC
Mike
Franklin
Computer
Science
Chair,
UC
Berkeley
Director,
UC
Berkeley
AMPLab
Founder
of
Apache
Spark
Marie-‐Anne
Neimat
Co-‐Founder,
Times-‐Ten
Database
Former
VP,
Database
Eng.
at
Oracle
Ken
Rudin
Head
of
Growth
and
Analysis
for
Google
Search
Head
of
Analytics
at
Facebook
Abhinav Gupta
Co-‐Founder,
VP
Engineering
at
Rocket
Fuel
Runs
15PB
HBase Cluster
41. Splice
Machine
Proprietary
and
Confidential
The
Splice
Machine
RDBMS:
Replace
Oracle
&
MySQL
The
First
RDBMS
Powered
by
Hadoop &
Spark
41
ANSI
SQL
No
retraining
or
rewrites
for
SQL-‐based
analysts,
reports,
and
applications
¼
the
Cost
Scales
out
on
commodity
hardware
SQL Scale
Out Speed
Transactions
Ensure
reliable
updates
across
multiple
rows
Mixed
Workloads
Simultaneously
support
OLTP
and
OLAP
workloads
Elastic
Increase
scale
in
just
a
few
minutes
10-‐20x
Faster
Leverages
Spark
in-‐memory
technology