Massively scalable, always on, and ridiculously fast. Apache Cassandra is the database chosen by Apple, Netflix, and 30 of the Fortune 100 to power their critical infrastructure. How do we analyze petabytes of data, whether it be massive batching or as it’s ingested via streaming with Apache Kafka? Enter Apache Spark. Challenging MapReduce head on, Apache Spark offers powerful constructs that make it possible to slice and dice your data, whether it be through machine learning, graph queries, as well as transformations familiar to people with functional programming backgrounds such as map, filter, and reduce. Step away ready to rock with the most powerful distributed database, scalable messaging, and analytics platform on the planet.
Watch the video here
https://www.youtube.com/watch?v=X-FKmKc9hkI
2. Small Data
• 100's of MB to low GB, single user
• sed, awk, grep are great
• Python & Ruby
• sqlite
• Iterate quickly
• Limitations:
• bad for multiple concurrent users (file sharing!)
3. Medium Data
• Fits on 1 machine
• Most current web apps
• Content-driven web apps (answerbag.com)
• RDBMS is fine
• postgres
• mysql
• Supports hundreds of concurrent
users
• ACID makes us feel good
• Scales vertically
5. Replication: ACID is a lie
Client
Master Slave
replication lag
Consistent results? Nope!
6. Third Normal Form Doesn't Scale
• Queries are unpredictable
• Users are impatient
• If data > memory, you = history
• Disk seeks are the worst
• Data must be denormalized
(SELECT
CONCAT(city_name,', ',region) value,
latitude,
longitude,
id,
population,
( 3959 * acos( cos( radians($latitude) ) *
cos( radians( latitude ) ) * cos( radians( longitude )
- radians($longitude) ) + sin( radians($latitude) ) *
sin( radians( latitude ) ) ) )
AS distance,
CASE region
WHEN '$region' THEN 1
ELSE 0
END AS region_match
FROM `cities`
$where and foo_count > 5
ORDER BY region_match desc, foo_count desc
limit 0, 11)
UNION
(SELECT
CONCAT(city_name,', ',region) value,
latitude,
longitude,
id,
population,
( 3959 * acos( cos( radians($latitude) ) *
cos( radians( latitude ) ) * cos( radians( longitude )
- radians($longitude) ) + sin( radians($latitude) ) *
sin( radians( latitude ) ) ) )
7. Sharding is a Nightmare
• Data is all over the place
• No more joins
• No more aggregations
• Denormalize all the things
• Querying secondary indexes
requires hitting every shard
• Adding shards requires manually
moving data
• Schema changes
8. High Availability.. not really
• Master failover… who's responsible?
• Another moving part…
• Bolted on hack
• Downtime is frequent
• Change database settings (innodb
buffer pool, etc)
• Drive, power supply failures
• OS updates
• Multi-DC? Yeah ok buddy…
9. Summary of Failure
• Scaling is a pain
• ACID is naive at best
• You aren't consistent
• Re-sharding is a manual process
• We're going to denormalize for
performance
• High availability is complicated,
requires additional operational
overhead
• Generally more servers is more work
10. Lessons Learned
• Consistency is not practical
• So we give it up
• Manual sharding & rebalancing is hard
• So let's build in
• Adding servers should be EASY
• Every moving part makes systems more complex
• So let's simplify our architecture - no more master / slave
• Scaling up is expensive
• We want commodity hardware
• Scatter / gather no good
• We denormalize for real time query performance
• Goal is to always hit 1 server
11. What is Apache Cassandra?
• Fast Distributed Database
• High Availability
• Linear Scalability
• Predictable Performance
• No SPOF
• Multi-DC
• Commodity Hardware
• Easy to manage operationally
• Not a drop in replacement for
RDBMS
12. Masterless, Peer to Peer
• Data distributed automatically
• Data replicated automatically
(replication factor)
• Rack / AZ aware
• Queries routed internally to correct
server
• Tune consistency as needed
• ONE, QUORUM, ALL
13. Data Structures
• Like an RDBMS, Cassandra uses a Table to
store data
• But there’s where the similarities end
• Partitions within tables
• Rows within partitions (or a single row)
• CQL to create tables & query data
• Partition keys determine where a partition
is found
• Clustering keys determine ordering of rows
within a partition
Table
Partition
Row
Keyspace
14. Example: Single Row Partition
• Simple User system
• Identified by name (pk)
• 1 Row per partition
• This is familiar territory
name age job
jon 33 evangelist
luke 33 evangelist
old pete 108 retired
s. seagal 62 actor
JCVD 53 actor
cqlsh:demo> select * from user WHERE name = 'JCVD'
cqlsh:demo> create table user
(name text primary key,
age int,
job text);
15. Example: Multiple Rows
• Comments on photos
• Comments are always selected by
the photo_id
• There are only 4 rows in 2 partitions
• In the real world, use UUIDs instead
of int for PK
photo_id comment_id user comment
5 1 jon hi
5 2 luke oh hey
5 3 JCVD AHHHHH!!!
6 4 jon great pic
select * from comment where photo_id=5
create table comment
( photo_id int,
comment_id int,
user text,
comment text,
primary key (photo_id, comment_id));
16. Embrace Denormalization
• Fastest way to query data is to ask
for a list of things, stored
contiguously
group_name user_id name age
ninjas 1 Jon 33
ninjas 2 Luke 33
ninjas 3 Pete 103
ninjas 4 Sarah 22
partition key (grouping)
clustering key (sorting)
17. Model Tables to Answer Queries
• This is not 3NF!!
• We always query by partition key
• Create many tables aka
materialized views
• Manage in your app code
• Denormalize!!
user age
jon 33
luke 33
JCVD 53
age user user
33 jon luke
53 JCVD
CREATE TABLE age_to_user (
age int,
user text,
primary key (age, user)
);
18. Limitations
• No aggregations yet
• (coming in 3.0)
• No joins
• Select rows by partition key
• Manage your own secondary
indexes
• No arbitrary, full cluster queries
20. Apache Spark
• Batch processing
• Functional constructs
• map / reduce / filter
• Fully distributed SQL
• RDD is a collection of data
• Scala or Python
• Streaming
• Machine learning
• Graph analytics (GraphX)
21. Spark on Cassandra
• Use the DataStax Spark connector (OSS)
• Data locality - run spark locally on each
node
• Dedicated DC - different workload
• Jobs write results back to Cassandra
• Lambda Architecture
24. Data Migrations
1 import org.apache.spark.{SparkContext,SparkConf}
2 import com.datastax.spark.connector._
3
4 object DataMigration {
5 def main(args: Array[String]): Unit = {
6
7 val conf = new SparkConf(true)
8 .set("spark.cassandra.connection.host", "127.0.0.1")
9
10 val sc = new SparkContext("local", "test", conf)
11
12 case class FoodToUserIndex(food: String, user: String)
13
14 val user_table = sc.cassandraTable("tutorial", "user")
15
16 val food_index = user_table.map(r =>
17 new FoodToUserIndex(r.getString("favorite_food"),
18 r.getString("name")))
19
20 food_index.saveToCassandra("tutorial", "food_to_user_index")
21
22 }
23 }
25. SparkSQL
• Register an RDD as a table
• Query as if it was RDBMS
• Joins, aggregations, etc
• Join across data sources (Postgres to Cassandra)
• Supports JDBC & ODBC
1 case class Person(name: String, age: Int)
2 sc.cassandraTable[Person]("test", "persons").registerAsTable("persons")
3 val adults = sql("SELECT * FROM persons WHERE age > 17")
4
27. Overview
• Read data from a streaming source
• ZeroMQ, Kafka, Raw Socket
• Data is read in batches
• Streaming is at best an approximation
• val ssc = new StreamingContext(sc, Seconds(1))
Time 1.1 1.5 1.7 2.1 2.4 2.8 3.4
Data (1,2) (4,2) (6,2) (9,1) (3,5) (7,1) (3,10)
28. What is Apache Kafka?
• Distributed, partitioned pub/sub
• Messages are sent to “topics” (sort of
like a queue)
• Multiple subscribers can read from
same partition
• Each subscriber maintains it's own
• position
• Scales massively
• We can use this to message both
ways
29. Streaming Snippet
1
2 val rawEvents: ReceiverInputDStream[(String, String)] =
3 KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, Map(topic -> 1),
4 StorageLevel.MEMORY_ONLY)
5
6 val parsed: DStream[PageView] = rawEvents.map{ case (_,v) =>
7 parse(v).extract[PageView]
8 } // return parsed json as RDD
9
10 case class PageViewsPerSite(site_id:String, ts:UUID, pageviews:Int)
11
12 val pairs = parsed.map(event => (event.site_id, 1))
13
14 val hits_per_site: DStream[PageViewsPerSite] = pairs.reduceByKey(_+ _).map(
15 x => {
16 val (site_id, hits) = x
17 PageViewsPerSite(site_id, UUIDs.timeBased(), hits)
18
19 }
20 )
21
22 hits_per_site.saveToCassandra("killranalytics", "real_time_data")
Full source: https://github.com/rustyrazorblade/killranalytics/blob/master/spark/src/main/scala/RawEventProcessing.scala
30. Things to keep in mind….
• Streaming aggregations are an
approximation
• Best practice is to come up w/ a
rough number in streaming
• You’ll need to reaggregate original
data if you want precision
• If it's time series, use
DateTieredCompaction with
consistent TTLs
32. Clustering
• Unsupervised learning
• Batch or streaming
• reevaluate clusters as new data arrives
• K-means
• Puts points into predefined # of clusters
• Power iteration clustering
• clustering vertices of a graph
34. Collaborative Filtering
• Recommendation engine
• Algo: Alternating least squares
• Movies, music, etc
• Perfect match for Cassandra
• Source of truth
• Hot, live data
• Spark generates recommendations
(store in cassandra)
• Feedback loops generates better
recommendations over time
35. When is Spark + Cassandra right?
• Many storage alternatives
• HDFS / Hbase
• RDBMS
• Spark talks to everyone
• CSV on S3
• Lots of live data
• Real time requirements - SLA
• Multi-DC
• Not for data at rest, not ETL
• Cassandra is not a data warehouse
36. Open Source
• Latest, bleeding edge features
• File JIRAs
• Support via mailing list & IRC
• Fix bugs
• cassandra.apache.org
• Perfect for hacking
37. DataStax Enterprise
• Integrated Multi-DC Solr
• Integrated Spark
• Free Startup Program
• <3MM rev & <$30M funding
• Extended support
• Additional QA
• Focused on stable releases for enterprise