SlideShare a Scribd company logo
1 of 70
Download to read offline
Breakthrough OLAP
Performance with
Cassandra and Spark
Evan Chan
September 2015
Who am I?
Distinguished Engineer,
@evanfchan
User and contributor to Spark since 0.9, Cassandra since 0.6
Co-creator and maintainer of
Tuplejump
http://velvia.github.io
Spark Job Server
About Tuplejump
is a big data technology leader providing solutions for
rapid insights from data.
Tuplejump
- the first Spark-Cassandra integration
- an open source Lucene indexer for Cassandra
- open source HDFS for Cassandra
Calliope
Stargate
SnackFS
Didn't I attend the same talk last year?
Similar title, but mostly new material
Will reveal new open source projects! :)
Problem Space
Need analytical database / queries on structured big data
Something SQL-like, very flexible and fast
Pre-aggregation too limiting
Fast data / constant updates
Ideally, want my queries to run over fresh data too
Example: Video analytics
Typical collection and analysis of consumer events
3 billion new events every day
Video publishers want updated stats, the sooner the better
Pre-aggregation only enables simple dashboard UIs
What if one wants to offer more advanced analysis, or a
generic data query API?
Eg, top countries filtered by device type, OS, browser
Requirements
Scalable - rules out PostGreSQL, etc.
Easy to update and ingest new data
Not traditional OLAP cubes - that's not what I'm talking
about
Very fast for analytical queries - OLAP not OLTP
Extremely flexible queries
Preferably open source
Parquet
Widely used, lots of support (Spark, Impala, etc.)
Problem: Parquet is read-optimized, not easy to use for writes
Cannot support idempotent writes
Optimized for writing very large chunks, not small updates
Not suitable for time series, IoT, etc.
Often needs multiple passes of jobs for compaction of small
files, deduplication, etc.
 
People really want a database-like abstraction, not a file format!
Cassandra
Horizontally scalable
Very flexible data modelling (lists, sets, custom data types)
Easy to operate
Perfect for ingestion of real time / machine data
Best of breed storage technology, huge community
BUT: Simple queries only
OLTP-oriented
Apache Spark
Horizontally scalable, in-memory queries
Functional Scala transforms - map, filter, groupBy, sort
etc.
SQL, machine learning, streaming, graph, R, many more plugins
all on ONE platform - feed your SQL results to a logistic
regression, easy!
Huge number of connectors with every single storage
technology
Spark provides the missing fast, deep
analytics piece of Cassandra!
...tying together fast event ingestion and rich deep
analytics!
How to make Spark and Cassandra
Go Fast
Spark on Cassandra: No Caching
Not Very Fast, but Real-Time Updates
Spark does no caching by default - you will always be reading
from C*!
Pros:
No need to fit all data in memory
Always get the latest data
Cons:
Pretty slow for ad hoc analytical queries - using regular CQL
tables
How to go Faster?
Read less data
Do less I/O
Make your computations faster
Spark as Cassandra's Cache
Caching a SQL Table from Cassandra
DataFrames support in Cassandra Connector 1.4.0 (and 1.3.0):
valsqlContext=neworg.apache.spark.sql.SQLContext(sc)
valdf=sqlContext.read
.format("org.apache.spark.sql.cassandra")
.option("table","gdelt")
.option("keyspace","test").load()
df.registerTempTable("gdelt")
sqlContext.cacheTable("gdelt")
sqlContext.sql("SELECTcount(monthyear)FROMgdelt").show()
 
How Spark SQL's Table Caching Works
Spark Cached Tables can be Really Fast
GDELT dataset, 4 million rows, 60 columns, localhost
Method secs
Uncached 317
Cached 0.38
 
Almost a 1000x speedup!
On an 8-node EC2 c3.XL cluster, 117 million rows, can run
common queries 1-2 seconds against cached dataset.
Problems with Cached Tables
Still have to read the data from Cassandra first, which is slow
Amount of RAM: your entire data + extra for conversion to
cached table
Cached tables only live in Spark executors - by default
tied to single context - not HA
once any executor dies, must re-read data from C*
Caching takes time: convert from RDD[Row] to compressed
columnar format
Cannot easily combine new RDD[Row] with cached tables
(and keep speed)
Problems with Cached Tables
If you don't have enough RAM, Spark can cache your tables
partly to disk. This is still way, way, faster than scanning an entire
C* table. However, cached tables are still tied to a single Spark
context/application.
Also: rdd.cache()is NOT the same as SQLContext's
cacheTable!
Faster Queries Through Columnar Storage
Wait, I thought Cassandra was columnar?
How Cassandra stores your CQL Tables
Suppose you had this CQL table:
CREATETABLE(
departmenttext,
empIdtext,
firsttext,
lasttext,
ageint,
PRIMARYKEY(department,empId)
);
How Cassandra stores your CQL Tables
PartitionKey 01:first 01:last 01:age 02:first 02:last 02:age
Sales Bob Jones 34 Susan O'Connor 40
Engineering Dilbert P ? Dogbert Dog 1
 
Each row is stored contiguously. All columns in row 2 come after
row 1.
To analyze only age, C* still has to read every field.
Cassandra is really a row-based, OLTP-oriented datastore.
Unless you know how to use it otherwise :)
The traditional row-based data storage
approach is dead
- Michael Stonebraker
Columnar Storage
Name column
0 1
0 1
 
Dictionary: {0: "Barak", 1: "Hillary"}
 
Age column
0 1
46 66
Columnar Format solves I/O
How much data can I query interactively? More than you think!
 
Columnar Storage Performance Study
http://github.com/velvia/cassandra-gdelt
Scenario Ingest Read all
columns
Read one
column
Narrow
table
1927
sec
505 sec 504 sec
Wide
table
3897
sec
365 sec 351 sec
Columnar 93 sec 8.6 sec 0.23 sec
 
On reads, using a columnar format is up to 2190x faster, while
ingestion is 20-40x faster.
Columnar Format solves Caching
Use the same format on disk, in cache, in memory scan
Caching works a lot better when the cached object is the
same!!
No data format dissonance means bringing in new bits of data
and combining with existing cached data is seamless
So, why isn't everybody doing this?
No columnar storage format designed to work with NoSQL
stores
Efficient conversion to/from columnar format a hard problem
Most infrastructure is still row oriented
Spark SQL/DataFrames based on RDD[Row]
Spark Catalyst is a row-oriented query parser
All hard work leads to profit, but mere talk leads
to poverty.
- Proverbs 14:23
Introducing FiloDB
Distributed. Versioned. Columnar. Built for Streaming.
 
github.com/tuplejump/FiloDB
FiloDB - What?
Distributed
Apache Cassandra. Scale out with no SPOF. Cross-datacenter
replication. Proven storage and database technology.
Versioned
Incrementally add a column or a few rows as a new version. Easily
control what versions to query. Roll back changes inexpensively.
Stream out new versions as continuous queries :)
Columnar
Parquet-style storage layout
Retrieve select columns and minimize I/O for analytical
queries
Add a new column without having to copy the whole table
Vectorization and lazy/zero serialization for extreme
efficiency
What's in the name?
Rich sweet layers of distributed, versioned database goodness
100% Reactive
Built completely on the Typesafe Platform:
Scala 2.10 and SBT
Spark (including custom data source)
Akka Actors for rational scale-out concurrency
Futures for I/O
Phantom Cassandra client for reactive, type-safe C* I/O
Typesafe Config
Spark SQL Queries!
CREATETEMPORARYTABLEgdelt
USINGfilodb.spark
OPTIONS(dataset"gdelt");
SELECTActor1Name,Actor2Name,AvgToneFROMgdeltORDERBYAvgToneDESCLIMIT15
Read to and write from Spark Dataframes
Append/merge to FiloDB table from Spark Streaming
Use Tableau or any other JDBC tool
FiloDB - Why?
Fast Streaming Data + Big Data, All in One!
Analytical Query Performance
Up to 200x Faster Queries for Spark on Cassandra 2.x
Parquet Performance with Cassandra Flexibility
(Stick around for the demo)
Fast Event/Time-Series Ad-Hoc Analytics
New rows appended via Kafka
Writes are idempotent - no need to dedup!
Converted to columnar chunks on ingest and stored in C*
Only necessary columnar chunks are read into Spark for
minimal I/O
Fast Event/Time-Series Ad-Hoc Analytics
Entity Time1 Time2
US-0123 d1 d2
NZ-9495 d1 d2
 
Model your time series with FiloDB similarly to Cassandra:
Sort key: Timestamp, similar to clustering key
Partition Key: Event/machine entity
FiloDB keeps data sorted while stored in efficient columnar
storage.
FiloDB = Streaming + Columnar
Extend your Cassandra (2.x) Investment
Make it work for batch and ad-hoc analytics!
Simplify your Lambda Architecture...
( )https://www.mapr.com/developercentral/lambda-architecture
With Spark, Cassandra, and FiloDB
Ma, where did all the components go?
You mean I don't have to deal with Hadoop?
Use Cassandra as a front end to store IoT data first
FiloDB vs Parquet
Comparable read performance - with lots of space to improve
Assuming co-located Spark and Cassandra
On localhost, both subsecond for simple queries (GDELT
1979-1984)
FiloDB has more room to grow - due to hot column caching
and much less deserialization overhead
Lower memory requirement due to much smaller block sizes
Much better fit for IoT / Machine / Time-series applications
Idempotent writes by PK with no deduplication
Limited support for types
array / set / map support not there, but will be added
FiloDB - How?
FiloDB Architecture
Ingestion and Storage?
Current version:
Each dataset is stored using 2 regular Cassandra tables
Ingestion using Spark (Dataframes or SQL)
Future version?
Automatic ingestion of your existing C* data using custom
secondary index
Towards Extreme Query Performance
The filo project
is a binary data vector library
designed for extreme read performance with minimal
deserialization costs.
http://github.com/velvia/filo
Designed for NoSQL, not a file format
random or linear access
on or off heap
missing value support
Scala only, but cross-platform support possible
What is the ceiling?
This Scala loop can read integers from a binary Filo blob at a rate
of 2 billion integers per second - single threaded:
defsumAllInts():Int={
vartotal=0
for{i<-0untilnumValuesoptimized}{
total+=sc(i)
}
total
}
Vectorization of Spark Queries
The project.Tungsten
Process many elements from the same column at once, keep data
in L1/L2 cache.
Coming in Spark 1.4 through 1.6
Hot Column Caching in Tachyon
Has a "table" feature, originally designed for Shark
Keep hot columnar chunks in shared off-heap memory for fast
access
FiloDB - Roadmap
Support for many more data types and sort and partition keys -
please give us your input!
Non-Spark ingestion API. Your input is again needed.
In-memory caching for significant query speedup
Projections. Often-repeated queries can be sped up
significantly with projections.
Use of GPU and SIMD instructions to speed up queries
You can help!
Send me your use cases for fast big data analysis on Spark and
Cassandra
Especially IoT, Event, Time-Series
What is your data model?
Email if you want to contribute
Thanks...
to the entire OSS community, but in particular:
Lee Mighdoll, Nest/Google
Rohit Rai and Satya B., Tuplejump
My colleagues at Socrata
 
If you want to go fast, go alone. If you want to go
far, go together.
-- African proverb
DEMO TIME
GDELT: Regular C* Tables vs FiloDB
Extra Slides
The scenarios
dataset
1979 to now
60 columns, 250 million+ rows, 250GB+
Let's compare Cassandra I/O only, no caching or Spark
Narrow table - CQL table with one row per partition key
Wide table - wide rows with 10,000 logical rows per partition
key
Columnar layout - 1000 rows per columnar chunk, wide rows,
with dictionary compression
Global Database of Events, Language, and Tone
First 4 million rows, localhost, SSD, C* 2.0.9, LZ4 compression.
Compaction performed before read benchmarks.
Disk space usage
Scenario Disk used
Narrow table 2.7 GB
Wide table 1.6 GB
Columnar 0.34 GB
The disk space usage helps explain some of the numbers.
Connecting Spark to Cassandra
Datastax's
Tuplejump
Spark Cassandra Connector
Calliope
 
Get started in one line with spark-shell!
bin/spark-shell
--packagescom.datastax.spark:spark-cassandra-connector_2.10:1.4.0-M3
--confspark.cassandra.connection.host=127.0.0.1
What about C* Secondary Indexing?
Spark-Cassandra Connector and Calliope can both reduce I/O by
using Cassandra secondary indices. Does this work with caching?
No, not really, because only the filtered rows would be cached.
Subsequent queries against this limited cached table would not
give you expected results.
Turns out this has been solved before!
Even .Facebook uses Vertica
MPP Databases
Easy writes plus fast queries, with constant transfers
Automatic query optimization by storing intermediate query
projections
Stonebraker, et. al. - paper (Brown Univ)CStore
When in doubt, use brute force
- Ken Thompson

More Related Content

What's hot

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsTimothy Spann
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and SparkEvan Chan
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Anya Bida
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Summit
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet odsc
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016 Hiromitsu Komatsu
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAmazon Web Services
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 

What's hot (20)

Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDsApache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 

Viewers also liked

Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsDave Gardner
 
World’s Best Data Modeling Tool
World’s Best Data Modeling ToolWorld’s Best Data Modeling Tool
World’s Best Data Modeling ToolArtem Chebotko
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon
 
Overiew of Cassandra and Doradus
Overiew of Cassandra and DoradusOveriew of Cassandra and Doradus
Overiew of Cassandra and Doradusrandyguck
 
Extending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance AnalyticsExtending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance Analyticsrandyguck
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...randyguck
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...DataStax Academy
 
DataStax: The Whys of NoSQL
DataStax: The Whys of NoSQLDataStax: The Whys of NoSQL
DataStax: The Whys of NoSQLDataStax Academy
 
Petabridge: The New .NET Enterprise Stack
Petabridge: The New .NET Enterprise StackPetabridge: The New .NET Enterprise Stack
Petabridge: The New .NET Enterprise StackDataStax Academy
 
DataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart AnalyticsDataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart AnalyticsDataStax Academy
 
DataStax: Steps to successfully implementing NoSQL in the enterprise
DataStax: Steps to successfully implementing NoSQL in the enterpriseDataStax: Steps to successfully implementing NoSQL in the enterprise
DataStax: Steps to successfully implementing NoSQL in the enterpriseDataStax Academy
 
DataStax: Setting Your Database Management on Autopilot with OpsCenter
DataStax: Setting Your Database Management on Autopilot with OpsCenterDataStax: Setting Your Database Management on Autopilot with OpsCenter
DataStax: Setting Your Database Management on Autopilot with OpsCenterDataStax Academy
 
DataStax: Ramping up Cassandra QA
DataStax: Ramping up Cassandra QADataStax: Ramping up Cassandra QA
DataStax: Ramping up Cassandra QADataStax Academy
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying SparkDatabricks
 
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra MigrationInfosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra MigrationDataStax Academy
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
 

Viewers also liked (20)

Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
World’s Best Data Modeling Tool
World’s Best Data Modeling ToolWorld’s Best Data Modeling Tool
World’s Best Data Modeling Tool
 
data-modeling-paper
data-modeling-paperdata-modeling-paper
data-modeling-paper
 
Chapter-13-solutions
Chapter-13-solutionsChapter-13-solutions
Chapter-13-solutions
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
 
Overiew of Cassandra and Doradus
Overiew of Cassandra and DoradusOveriew of Cassandra and Doradus
Overiew of Cassandra and Doradus
 
Extending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance AnalyticsExtending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance Analytics
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
 
DataStax: The Whys of NoSQL
DataStax: The Whys of NoSQLDataStax: The Whys of NoSQL
DataStax: The Whys of NoSQL
 
Petabridge: The New .NET Enterprise Stack
Petabridge: The New .NET Enterprise StackPetabridge: The New .NET Enterprise Stack
Petabridge: The New .NET Enterprise Stack
 
DataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart AnalyticsDataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart Analytics
 
DataStax: Steps to successfully implementing NoSQL in the enterprise
DataStax: Steps to successfully implementing NoSQL in the enterpriseDataStax: Steps to successfully implementing NoSQL in the enterprise
DataStax: Steps to successfully implementing NoSQL in the enterprise
 
DataStax: Setting Your Database Management on Autopilot with OpsCenter
DataStax: Setting Your Database Management on Autopilot with OpsCenterDataStax: Setting Your Database Management on Autopilot with OpsCenter
DataStax: Setting Your Database Management on Autopilot with OpsCenter
 
DataStax: Ramping up Cassandra QA
DataStax: Ramping up Cassandra QADataStax: Ramping up Cassandra QA
DataStax: Ramping up Cassandra QA
 
Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra MigrationInfosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
 

Similar to TupleJump: Breakthrough OLAP performance on Cassandra and Spark

700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web ServiceSpark Summit
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingDibyendu Bhattacharya
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiCodemotion Dubai
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergyniallmilton
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 

Similar to TupleJump: Breakthrough OLAP performance on Cassandra and Spark (20)

700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergy
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Recently uploaded

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

TupleJump: Breakthrough OLAP performance on Cassandra and Spark

  • 1. Breakthrough OLAP Performance with Cassandra and Spark Evan Chan September 2015
  • 2. Who am I? Distinguished Engineer, @evanfchan User and contributor to Spark since 0.9, Cassandra since 0.6 Co-creator and maintainer of Tuplejump http://velvia.github.io Spark Job Server
  • 3. About Tuplejump is a big data technology leader providing solutions for rapid insights from data. Tuplejump - the first Spark-Cassandra integration - an open source Lucene indexer for Cassandra - open source HDFS for Cassandra Calliope Stargate SnackFS
  • 4. Didn't I attend the same talk last year? Similar title, but mostly new material Will reveal new open source projects! :)
  • 5. Problem Space Need analytical database / queries on structured big data Something SQL-like, very flexible and fast Pre-aggregation too limiting Fast data / constant updates Ideally, want my queries to run over fresh data too
  • 6. Example: Video analytics Typical collection and analysis of consumer events 3 billion new events every day Video publishers want updated stats, the sooner the better Pre-aggregation only enables simple dashboard UIs What if one wants to offer more advanced analysis, or a generic data query API? Eg, top countries filtered by device type, OS, browser
  • 7. Requirements Scalable - rules out PostGreSQL, etc. Easy to update and ingest new data Not traditional OLAP cubes - that's not what I'm talking about Very fast for analytical queries - OLAP not OLTP Extremely flexible queries Preferably open source
  • 8. Parquet Widely used, lots of support (Spark, Impala, etc.) Problem: Parquet is read-optimized, not easy to use for writes Cannot support idempotent writes Optimized for writing very large chunks, not small updates Not suitable for time series, IoT, etc. Often needs multiple passes of jobs for compaction of small files, deduplication, etc.   People really want a database-like abstraction, not a file format!
  • 9. Cassandra Horizontally scalable Very flexible data modelling (lists, sets, custom data types) Easy to operate Perfect for ingestion of real time / machine data Best of breed storage technology, huge community BUT: Simple queries only OLTP-oriented
  • 10. Apache Spark Horizontally scalable, in-memory queries Functional Scala transforms - map, filter, groupBy, sort etc. SQL, machine learning, streaming, graph, R, many more plugins all on ONE platform - feed your SQL results to a logistic regression, easy! Huge number of connectors with every single storage technology
  • 11. Spark provides the missing fast, deep analytics piece of Cassandra! ...tying together fast event ingestion and rich deep analytics!
  • 12. How to make Spark and Cassandra Go Fast
  • 13. Spark on Cassandra: No Caching
  • 14. Not Very Fast, but Real-Time Updates Spark does no caching by default - you will always be reading from C*! Pros: No need to fit all data in memory Always get the latest data Cons: Pretty slow for ad hoc analytical queries - using regular CQL tables
  • 15. How to go Faster? Read less data Do less I/O Make your computations faster
  • 17. Caching a SQL Table from Cassandra DataFrames support in Cassandra Connector 1.4.0 (and 1.3.0): valsqlContext=neworg.apache.spark.sql.SQLContext(sc) valdf=sqlContext.read .format("org.apache.spark.sql.cassandra") .option("table","gdelt") .option("keyspace","test").load() df.registerTempTable("gdelt") sqlContext.cacheTable("gdelt") sqlContext.sql("SELECTcount(monthyear)FROMgdelt").show()  
  • 18. How Spark SQL's Table Caching Works
  • 19. Spark Cached Tables can be Really Fast GDELT dataset, 4 million rows, 60 columns, localhost Method secs Uncached 317 Cached 0.38   Almost a 1000x speedup! On an 8-node EC2 c3.XL cluster, 117 million rows, can run common queries 1-2 seconds against cached dataset.
  • 20. Problems with Cached Tables Still have to read the data from Cassandra first, which is slow Amount of RAM: your entire data + extra for conversion to cached table Cached tables only live in Spark executors - by default tied to single context - not HA once any executor dies, must re-read data from C* Caching takes time: convert from RDD[Row] to compressed columnar format Cannot easily combine new RDD[Row] with cached tables (and keep speed)
  • 21. Problems with Cached Tables If you don't have enough RAM, Spark can cache your tables partly to disk. This is still way, way, faster than scanning an entire C* table. However, cached tables are still tied to a single Spark context/application. Also: rdd.cache()is NOT the same as SQLContext's cacheTable!
  • 22. Faster Queries Through Columnar Storage Wait, I thought Cassandra was columnar?
  • 23. How Cassandra stores your CQL Tables Suppose you had this CQL table: CREATETABLE( departmenttext, empIdtext, firsttext, lasttext, ageint, PRIMARYKEY(department,empId) );
  • 24. How Cassandra stores your CQL Tables PartitionKey 01:first 01:last 01:age 02:first 02:last 02:age Sales Bob Jones 34 Susan O'Connor 40 Engineering Dilbert P ? Dogbert Dog 1   Each row is stored contiguously. All columns in row 2 come after row 1. To analyze only age, C* still has to read every field.
  • 25. Cassandra is really a row-based, OLTP-oriented datastore. Unless you know how to use it otherwise :)
  • 26. The traditional row-based data storage approach is dead - Michael Stonebraker
  • 27. Columnar Storage Name column 0 1 0 1   Dictionary: {0: "Barak", 1: "Hillary"}   Age column 0 1 46 66
  • 28. Columnar Format solves I/O How much data can I query interactively? More than you think!
  • 29.   Columnar Storage Performance Study http://github.com/velvia/cassandra-gdelt Scenario Ingest Read all columns Read one column Narrow table 1927 sec 505 sec 504 sec Wide table 3897 sec 365 sec 351 sec Columnar 93 sec 8.6 sec 0.23 sec   On reads, using a columnar format is up to 2190x faster, while ingestion is 20-40x faster.
  • 30. Columnar Format solves Caching Use the same format on disk, in cache, in memory scan Caching works a lot better when the cached object is the same!! No data format dissonance means bringing in new bits of data and combining with existing cached data is seamless
  • 31. So, why isn't everybody doing this? No columnar storage format designed to work with NoSQL stores Efficient conversion to/from columnar format a hard problem Most infrastructure is still row oriented Spark SQL/DataFrames based on RDD[Row] Spark Catalyst is a row-oriented query parser
  • 32. All hard work leads to profit, but mere talk leads to poverty. - Proverbs 14:23
  • 33.
  • 34. Introducing FiloDB Distributed. Versioned. Columnar. Built for Streaming.   github.com/tuplejump/FiloDB
  • 36. Distributed Apache Cassandra. Scale out with no SPOF. Cross-datacenter replication. Proven storage and database technology.
  • 37. Versioned Incrementally add a column or a few rows as a new version. Easily control what versions to query. Roll back changes inexpensively. Stream out new versions as continuous queries :)
  • 38. Columnar Parquet-style storage layout Retrieve select columns and minimize I/O for analytical queries Add a new column without having to copy the whole table Vectorization and lazy/zero serialization for extreme efficiency
  • 39. What's in the name? Rich sweet layers of distributed, versioned database goodness
  • 40. 100% Reactive Built completely on the Typesafe Platform: Scala 2.10 and SBT Spark (including custom data source) Akka Actors for rational scale-out concurrency Futures for I/O Phantom Cassandra client for reactive, type-safe C* I/O Typesafe Config
  • 41. Spark SQL Queries! CREATETEMPORARYTABLEgdelt USINGfilodb.spark OPTIONS(dataset"gdelt"); SELECTActor1Name,Actor2Name,AvgToneFROMgdeltORDERBYAvgToneDESCLIMIT15 Read to and write from Spark Dataframes Append/merge to FiloDB table from Spark Streaming Use Tableau or any other JDBC tool
  • 42. FiloDB - Why? Fast Streaming Data + Big Data, All in One!
  • 43. Analytical Query Performance Up to 200x Faster Queries for Spark on Cassandra 2.x Parquet Performance with Cassandra Flexibility (Stick around for the demo)
  • 44. Fast Event/Time-Series Ad-Hoc Analytics New rows appended via Kafka Writes are idempotent - no need to dedup! Converted to columnar chunks on ingest and stored in C* Only necessary columnar chunks are read into Spark for minimal I/O
  • 45. Fast Event/Time-Series Ad-Hoc Analytics Entity Time1 Time2 US-0123 d1 d2 NZ-9495 d1 d2   Model your time series with FiloDB similarly to Cassandra: Sort key: Timestamp, similar to clustering key Partition Key: Event/machine entity FiloDB keeps data sorted while stored in efficient columnar storage.
  • 46. FiloDB = Streaming + Columnar
  • 47. Extend your Cassandra (2.x) Investment Make it work for batch and ad-hoc analytics!
  • 48. Simplify your Lambda Architecture... ( )https://www.mapr.com/developercentral/lambda-architecture
  • 49. With Spark, Cassandra, and FiloDB Ma, where did all the components go? You mean I don't have to deal with Hadoop? Use Cassandra as a front end to store IoT data first
  • 50. FiloDB vs Parquet Comparable read performance - with lots of space to improve Assuming co-located Spark and Cassandra On localhost, both subsecond for simple queries (GDELT 1979-1984) FiloDB has more room to grow - due to hot column caching and much less deserialization overhead Lower memory requirement due to much smaller block sizes Much better fit for IoT / Machine / Time-series applications Idempotent writes by PK with no deduplication Limited support for types array / set / map support not there, but will be added
  • 53. Ingestion and Storage? Current version: Each dataset is stored using 2 regular Cassandra tables Ingestion using Spark (Dataframes or SQL) Future version? Automatic ingestion of your existing C* data using custom secondary index
  • 54. Towards Extreme Query Performance
  • 55. The filo project is a binary data vector library designed for extreme read performance with minimal deserialization costs. http://github.com/velvia/filo Designed for NoSQL, not a file format random or linear access on or off heap missing value support Scala only, but cross-platform support possible
  • 56. What is the ceiling? This Scala loop can read integers from a binary Filo blob at a rate of 2 billion integers per second - single threaded: defsumAllInts():Int={ vartotal=0 for{i<-0untilnumValuesoptimized}{ total+=sc(i) } total }
  • 57. Vectorization of Spark Queries The project.Tungsten Process many elements from the same column at once, keep data in L1/L2 cache. Coming in Spark 1.4 through 1.6
  • 58. Hot Column Caching in Tachyon Has a "table" feature, originally designed for Shark Keep hot columnar chunks in shared off-heap memory for fast access
  • 59. FiloDB - Roadmap Support for many more data types and sort and partition keys - please give us your input! Non-Spark ingestion API. Your input is again needed. In-memory caching for significant query speedup Projections. Often-repeated queries can be sped up significantly with projections. Use of GPU and SIMD instructions to speed up queries
  • 60. You can help! Send me your use cases for fast big data analysis on Spark and Cassandra Especially IoT, Event, Time-Series What is your data model? Email if you want to contribute
  • 61. Thanks... to the entire OSS community, but in particular: Lee Mighdoll, Nest/Google Rohit Rai and Satya B., Tuplejump My colleagues at Socrata   If you want to go fast, go alone. If you want to go far, go together. -- African proverb
  • 62. DEMO TIME GDELT: Regular C* Tables vs FiloDB
  • 64. The scenarios dataset 1979 to now 60 columns, 250 million+ rows, 250GB+ Let's compare Cassandra I/O only, no caching or Spark Narrow table - CQL table with one row per partition key Wide table - wide rows with 10,000 logical rows per partition key Columnar layout - 1000 rows per columnar chunk, wide rows, with dictionary compression Global Database of Events, Language, and Tone First 4 million rows, localhost, SSD, C* 2.0.9, LZ4 compression. Compaction performed before read benchmarks.
  • 65. Disk space usage Scenario Disk used Narrow table 2.7 GB Wide table 1.6 GB Columnar 0.34 GB The disk space usage helps explain some of the numbers.
  • 66. Connecting Spark to Cassandra Datastax's Tuplejump Spark Cassandra Connector Calliope   Get started in one line with spark-shell! bin/spark-shell --packagescom.datastax.spark:spark-cassandra-connector_2.10:1.4.0-M3 --confspark.cassandra.connection.host=127.0.0.1
  • 67. What about C* Secondary Indexing? Spark-Cassandra Connector and Calliope can both reduce I/O by using Cassandra secondary indices. Does this work with caching? No, not really, because only the filtered rows would be cached. Subsequent queries against this limited cached table would not give you expected results.
  • 68. Turns out this has been solved before! Even .Facebook uses Vertica
  • 69. MPP Databases Easy writes plus fast queries, with constant transfers Automatic query optimization by storing intermediate query projections Stonebraker, et. al. - paper (Brown Univ)CStore
  • 70. When in doubt, use brute force - Ken Thompson