Apache Cassandra is rock-solid and widely deployed for OLTP and real-time applications, but it is typically not thought of as an OLAP database for analytical queries. This talk will show architectures and techniques for combining Apache Cassandra and Spark to yield a 10-1000x improvement in OLAP analytical performance. We will then introduce a new open-source project that combines the above performance improvements with the ease of use of Apache Cassandra, and compare it to implementations based on Hadoop and Parquet.
First, the existing Cassandra Spark connector allows one to easily load data from Cassandra to Spark. We'll cover how to accelerate queries through different caching options in Spark, and the tradeoffs and limitations around performance, memory, and updating data in real time. We then dive into the use of columnar storage layout and efficient coding techniques that dramatically speed up I/O for OLAP use cases. Cassandra features like triggers and custom secondary indexes allow for easy data ingestion into columnar format. Next, we explore how to integrate this new storage with Spark SQL and its pluggable data storage API. Future developments will enable extreme analytical database performance, including smart caching of column projections, a columnar version of Spark's Catalyst execution planner, and how vectorization makes for fast cache- and GPU-friendly calculations - see Spark's Project Tungsten.
FiloDB is a new open-source database using the above techniques to combine very fast Spark SQL analytical queries with the ease of use of Cassandra. We will briefly cover interesting use cases, such as:
* Easy exactly-once ingestion from Kafka for streaming and IoT applications
* Incremental computed columns and geospatial annotations. We'll discuss how FiloDB improves aggregations needed for choropleth maps over standard PostGIS solutions.
2. Who am I?
Distinguished Engineer,
@evanfchan
User and contributor to Spark since 0.9, Cassandra since 0.6
Co-creator and maintainer of
Tuplejump
http://velvia.github.io
Spark Job Server
3. About Tuplejump
is a big data technology leader providing solutions for
rapid insights from data.
Tuplejump
- the first Spark-Cassandra integration
- an open source Lucene indexer for Cassandra
- open source HDFS for Cassandra
Calliope
Stargate
SnackFS
4. Didn't I attend the same talk last year?
Similar title, but mostly new material
Will reveal new open source projects! :)
5. Problem Space
Need analytical database / queries on structured big data
Something SQL-like, very flexible and fast
Pre-aggregation too limiting
Fast data / constant updates
Ideally, want my queries to run over fresh data too
6. Example: Video analytics
Typical collection and analysis of consumer events
3 billion new events every day
Video publishers want updated stats, the sooner the better
Pre-aggregation only enables simple dashboard UIs
What if one wants to offer more advanced analysis, or a
generic data query API?
Eg, top countries filtered by device type, OS, browser
7. Requirements
Scalable - rules out PostGreSQL, etc.
Easy to update and ingest new data
Not traditional OLAP cubes - that's not what I'm talking
about
Very fast for analytical queries - OLAP not OLTP
Extremely flexible queries
Preferably open source
8. Parquet
Widely used, lots of support (Spark, Impala, etc.)
Problem: Parquet is read-optimized, not easy to use for writes
Cannot support idempotent writes
Optimized for writing very large chunks, not small updates
Not suitable for time series, IoT, etc.
Often needs multiple passes of jobs for compaction of small
files, deduplication, etc.
People really want a database-like abstraction, not a file format!
9. Cassandra
Horizontally scalable
Very flexible data modelling (lists, sets, custom data types)
Easy to operate
Perfect for ingestion of real time / machine data
Best of breed storage technology, huge community
BUT: Simple queries only
OLTP-oriented
10. Apache Spark
Horizontally scalable, in-memory queries
Functional Scala transforms - map, filter, groupBy, sort
etc.
SQL, machine learning, streaming, graph, R, many more plugins
all on ONE platform - feed your SQL results to a logistic
regression, easy!
Huge number of connectors with every single storage
technology
11. Spark provides the missing fast, deep
analytics piece of Cassandra!
...tying together fast event ingestion and rich deep
analytics!
14. Not Very Fast, but Real-Time Updates
Spark does no caching by default - you will always be reading
from C*!
Pros:
No need to fit all data in memory
Always get the latest data
Cons:
Pretty slow for ad hoc analytical queries - using regular CQL
tables
15. How to go Faster?
Read less data
Do less I/O
Make your computations faster
19. Spark Cached Tables can be Really Fast
GDELT dataset, 4 million rows, 60 columns, localhost
Method secs
Uncached 317
Cached 0.38
Almost a 1000x speedup!
On an 8-node EC2 c3.XL cluster, 117 million rows, can run
common queries 1-2 seconds against cached dataset.
20. Problems with Cached Tables
Still have to read the data from Cassandra first, which is slow
Amount of RAM: your entire data + extra for conversion to
cached table
Cached tables only live in Spark executors - by default
tied to single context - not HA
once any executor dies, must re-read data from C*
Caching takes time: convert from RDD[Row] to compressed
columnar format
Cannot easily combine new RDD[Row] with cached tables
(and keep speed)
21. Problems with Cached Tables
If you don't have enough RAM, Spark can cache your tables
partly to disk. This is still way, way, faster than scanning an entire
C* table. However, cached tables are still tied to a single Spark
context/application.
Also: rdd.cache()is NOT the same as SQLContext's
cacheTable!
23. How Cassandra stores your CQL Tables
Suppose you had this CQL table:
CREATETABLE(
departmenttext,
empIdtext,
firsttext,
lasttext,
ageint,
PRIMARYKEY(department,empId)
);
24. How Cassandra stores your CQL Tables
PartitionKey 01:first 01:last 01:age 02:first 02:last 02:age
Sales Bob Jones 34 Susan O'Connor 40
Engineering Dilbert P ? Dogbert Dog 1
Each row is stored contiguously. All columns in row 2 come after
row 1.
To analyze only age, C* still has to read every field.
25. Cassandra is really a row-based, OLTP-oriented datastore.
Unless you know how to use it otherwise :)
28. Columnar Format solves I/O
How much data can I query interactively? More than you think!
29.
Columnar Storage Performance Study
http://github.com/velvia/cassandra-gdelt
Scenario Ingest Read all
columns
Read one
column
Narrow
table
1927
sec
505 sec 504 sec
Wide
table
3897
sec
365 sec 351 sec
Columnar 93 sec 8.6 sec 0.23 sec
On reads, using a columnar format is up to 2190x faster, while
ingestion is 20-40x faster.
30. Columnar Format solves Caching
Use the same format on disk, in cache, in memory scan
Caching works a lot better when the cached object is the
same!!
No data format dissonance means bringing in new bits of data
and combining with existing cached data is seamless
31. So, why isn't everybody doing this?
No columnar storage format designed to work with NoSQL
stores
Efficient conversion to/from columnar format a hard problem
Most infrastructure is still row oriented
Spark SQL/DataFrames based on RDD[Row]
Spark Catalyst is a row-oriented query parser
32. All hard work leads to profit, but mere talk leads
to poverty.
- Proverbs 14:23
37. Versioned
Incrementally add a column or a few rows as a new version. Easily
control what versions to query. Roll back changes inexpensively.
Stream out new versions as continuous queries :)
38. Columnar
Parquet-style storage layout
Retrieve select columns and minimize I/O for analytical
queries
Add a new column without having to copy the whole table
Vectorization and lazy/zero serialization for extreme
efficiency
39. What's in the name?
Rich sweet layers of distributed, versioned database goodness
40. 100% Reactive
Built completely on the Typesafe Platform:
Scala 2.10 and SBT
Spark (including custom data source)
Akka Actors for rational scale-out concurrency
Futures for I/O
Phantom Cassandra client for reactive, type-safe C* I/O
Typesafe Config
43. Analytical Query Performance
Up to 200x Faster Queries for Spark on Cassandra 2.x
Parquet Performance with Cassandra Flexibility
(Stick around for the demo)
44. Fast Event/Time-Series Ad-Hoc Analytics
New rows appended via Kafka
Writes are idempotent - no need to dedup!
Converted to columnar chunks on ingest and stored in C*
Only necessary columnar chunks are read into Spark for
minimal I/O
45. Fast Event/Time-Series Ad-Hoc Analytics
Entity Time1 Time2
US-0123 d1 d2
NZ-9495 d1 d2
Model your time series with FiloDB similarly to Cassandra:
Sort key: Timestamp, similar to clustering key
Partition Key: Event/machine entity
FiloDB keeps data sorted while stored in efficient columnar
storage.
48. Simplify your Lambda Architecture...
( )https://www.mapr.com/developercentral/lambda-architecture
49. With Spark, Cassandra, and FiloDB
Ma, where did all the components go?
You mean I don't have to deal with Hadoop?
Use Cassandra as a front end to store IoT data first
50. FiloDB vs Parquet
Comparable read performance - with lots of space to improve
Assuming co-located Spark and Cassandra
On localhost, both subsecond for simple queries (GDELT
1979-1984)
FiloDB has more room to grow - due to hot column caching
and much less deserialization overhead
Lower memory requirement due to much smaller block sizes
Much better fit for IoT / Machine / Time-series applications
Idempotent writes by PK with no deduplication
Limited support for types
array / set / map support not there, but will be added
53. Ingestion and Storage?
Current version:
Each dataset is stored using 2 regular Cassandra tables
Ingestion using Spark (Dataframes or SQL)
Future version?
Automatic ingestion of your existing C* data using custom
secondary index
55. The filo project
is a binary data vector library
designed for extreme read performance with minimal
deserialization costs.
http://github.com/velvia/filo
Designed for NoSQL, not a file format
random or linear access
on or off heap
missing value support
Scala only, but cross-platform support possible
56. What is the ceiling?
This Scala loop can read integers from a binary Filo blob at a rate
of 2 billion integers per second - single threaded:
defsumAllInts():Int={
vartotal=0
for{i<-0untilnumValuesoptimized}{
total+=sc(i)
}
total
}
57. Vectorization of Spark Queries
The project.Tungsten
Process many elements from the same column at once, keep data
in L1/L2 cache.
Coming in Spark 1.4 through 1.6
58. Hot Column Caching in Tachyon
Has a "table" feature, originally designed for Shark
Keep hot columnar chunks in shared off-heap memory for fast
access
59. FiloDB - Roadmap
Support for many more data types and sort and partition keys -
please give us your input!
Non-Spark ingestion API. Your input is again needed.
In-memory caching for significant query speedup
Projections. Often-repeated queries can be sped up
significantly with projections.
Use of GPU and SIMD instructions to speed up queries
60. You can help!
Send me your use cases for fast big data analysis on Spark and
Cassandra
Especially IoT, Event, Time-Series
What is your data model?
Email if you want to contribute
61. Thanks...
to the entire OSS community, but in particular:
Lee Mighdoll, Nest/Google
Rohit Rai and Satya B., Tuplejump
My colleagues at Socrata
If you want to go fast, go alone. If you want to go
far, go together.
-- African proverb
64. The scenarios
dataset
1979 to now
60 columns, 250 million+ rows, 250GB+
Let's compare Cassandra I/O only, no caching or Spark
Narrow table - CQL table with one row per partition key
Wide table - wide rows with 10,000 logical rows per partition
key
Columnar layout - 1000 rows per columnar chunk, wide rows,
with dictionary compression
Global Database of Events, Language, and Tone
First 4 million rows, localhost, SSD, C* 2.0.9, LZ4 compression.
Compaction performed before read benchmarks.
65. Disk space usage
Scenario Disk used
Narrow table 2.7 GB
Wide table 1.6 GB
Columnar 0.34 GB
The disk space usage helps explain some of the numbers.
66. Connecting Spark to Cassandra
Datastax's
Tuplejump
Spark Cassandra Connector
Calliope
Get started in one line with spark-shell!
bin/spark-shell
--packagescom.datastax.spark:spark-cassandra-connector_2.10:1.4.0-M3
--confspark.cassandra.connection.host=127.0.0.1
67. What about C* Secondary Indexing?
Spark-Cassandra Connector and Calliope can both reduce I/O by
using Cassandra secondary indices. Does this work with caching?
No, not really, because only the filtered rows would be cached.
Subsequent queries against this limited cached table would not
give you expected results.
68. Turns out this has been solved before!
Even .Facebook uses Vertica
69. MPP Databases
Easy writes plus fast queries, with constant transfers
Automatic query optimization by storing intermediate query
projections
Stonebraker, et. al. - paper (Brown Univ)CStore