Spark/Mesos Seattle Meetup group shares the latest presentation from their recent meetup event on showcasing real world implementations of working with Spark within the context of your Big Data Infrastructure.
Session are demo heavy and slide light focusing on getting your development environments up and running including getting up and running, configuration issues, SparkSQL vs. Hive, etc.
To learn more about the Seattle meetup: http://www.meetup.com/Seattle-Spark-Meetup/members/21698691/
2. 2
Next Sessions
• Currently planning
– New sessions
• Joint Seattle Spark Meetup and Seattle Mesos User Group meeting
• Joint Seattle Spark Meetup and Cassandra meeting
• Mobile marketing scenario with Tune
– Repeats requested for:
• Deep Dive into Spark, Tachyon, and Mesos Internals (Eastside)
• Spark at eBay: Troubleshooting the everyday issues (Seattle)
• This session (Eastside)
3. 3
Unlocking your Hadoop data with Apache
Spark and CDH5
Denny Lee, Steven Hastings
Data Sciences Engineering
4. 4
Purpose
• Showcasing Apache Spark scenarios within your Hadoop cluster
• Utilizing some Concur Data Sciences Engineering Scenarios
• Special Guests:
– Tableau Software: Showcasing Tableau to Apache Spark
– Cloudera: How to configure and deploy Spark on YARN via Cloudera Manager
5. 5
Agenda
• Configuring and Deploying Spark on YARN with Cloudera Manager
• Quick primer on our expense receipt scenario
• Connecting to Spark on your CDH5.1 cluster
• Quick demos
– Pig vs. Hive
– SparkSQL
• Tableau connecting to SparkSQL
• Deep Dive demo
– MLLib: SVD
6. 6
Take Picture of Receipt
601 108th Ave. NE Uel lcvue, WA 98004
Chantanee Thai Restaurant & Bar’
www.chantanee.com
TABLE: B 6 - 2 Guests
Your Server was Jerry
4/14/2014 1 ;14:02 PM Sequence #0000052
ID #0281727 Subtotal $40-00
Tota] Taxes $380 Grand Tota1 $43-80
Credit Purchase Name
BC Type : Amex
00 Num : xxxx xxxx xxxx 2000
Approval : 544882
Server : Jerry
Ticket Name : B 6
15% $6.00
I agree to pay the amount shown above.
Visit us i
Payment Amount:
2.5% .15 1 O .00
n BotheH-Pen Thai
8. 8
Gateway Node
Hadoop
Gateway Node
- Can connect to HDFS
- Execute Hadoop on cluster
- Can execute Spark on cluster
- OR on local for adhoc
- Can setup multiple VMs
9. 9
Connecting…
spark-shell --master spark://$masternode:7077 --executor-memory
1G --total-executor-cores 16
--master
specify the master node OR if using a gateway node
can just run locally to test
--executor-memory
limit amount of memory you use otherwise you’ll
use up as much as you can (should set defaults)
--total-executor-cores
limit amount of cores you use otherwise you’ll use up
as much as you can (should set defaults)
11. 11
RowCount: Pig
A = LOAD
'/user/hive/warehouse/dennyl.db/sample_ocr/000000_0' USING
TextLoader as (line:chararray);
B = group A all;
C = foreach B generate COUNT(A);
dump C;
12. 12
RowCount: Spark
val ocrtxt =
sc.textFile("/user/hive/warehouse/dennyl.db/sample_ocr/0000
00_0")
ocrtxt.count
13. 13
RowCount: Pig vs. Spark
Row Count against 1+ million categorized receipts
Query Pig Spark
1 0:00:41 0:00:02
2 0:00:42 0:00:00.5
15. 15
WordCount: Pig
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as
word;
C = group B by word;
D = foreach C generate COUNT(B), group;
E = group D all;
F = foreach E generate COUNT(D);
dump F
16. 16
WordCount: Spark
var wc =
ocrtxt.flatMap(
line => line.split(" ")
).map(
word => (word, 1)
).reduceByKey(_ + _)
wc.count
17. 17
WordCount: Pig vs. Spark
Word Count against 1+ million categorized receipts
Query Pig Spark
1 0:02:09 0:00:38
2 0:02:07 0:00:02
18. 18
SparkSQL: Querying
-- Utilize SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
-- Configure class
case class ocrdata(company_code: String, user_id: Long, date: Long,
category_desc: String, category_key: String, legacy_key: String,
legacy_desc: String, vendor: String, amount: Double)
19. 19
SparkSQL: Querying (2)
-- Extract Data
val ocr = sc.textFile("/$HDFS_Location"
).map(_.split("t")).map(
m => ocrdata(m(0), m(1).toLong, m(2).toLong, m(3), m(4), m(5), m(6),
m(7), m(8).toDouble)
)
-- For Spark 1.0.2
ocr.registerAsTable("ocr")
-- For Spark 1.1.0+
ocr.registerTempTable("ocr")
20. 20
SparkSQL: Querying (3)
-- Write a SQL statement
val blah = sqlContext.sql(
"SELECT company_code, user_id FROM ocr”
)
-- Show the first 10 rows
blah.map(
a => a(0) + ", " + a(1)
).collect().take(10).foreach(println)
21. 21
Oops!
14/11/15 09:55:35 ERROR scheduler.TaskSetManager: Task 16.0:0 failed 4 times; aborting job
14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Cancelling stage 16
14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Stage 16 was cancelled
14/11/15 09:55:35 INFO scheduler.DAGScheduler: Failed to run collect at <console>:22
14/11/15 09:55:35 WARN scheduler.TaskSetManager: Task 136 was killed.
14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 16.0, whose tasks have all
completed, from pool
org.apache.spark.SparkException: Job aborted due to stage failure: Task 16.0:0 failed 4 times, most
recent failure: Exception failure in TID 135 on host $server$: java.lang.NumberFormatException: For
input string: "N"
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
java.lang.Double.parseDouble(Double.java:540)
scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
22. 22
Let’s find the error
-- incorrect configuration, let me try to find "N"
val errors = ocrtxt.filter(line => line.contains("N"))
-- error count (71053 lines)
errors.count
-- look at some of the data
errors.take(10).foreach(println)
23. 23
Solution
-- Issue
The [amount] field contains N which is NULL value generated by Hive
-- Configure class (Original)
case class ocrdata(company_code: String, user_id: Long, ... amount:
Double)
-- Configure class (Original)
case class ocrdata(company_code: String, user_id: Long, ... amount:
String)
24. 24
Re-running the Query
14/11/16 18:43:32 INFO scheduler.DAGScheduler: Stage 10 (collect at
<console>:22) finished in 7.249 s
14/11/16 18:43:32 INFO spark.SparkContext: Job finished: collect at
<console>:22, took 7.268298566 s
-1978639384, 20156192
-1978639384, 20164613
542292324, 20131109
-598558319, 20128132
1369654093, 20130970
-1351048937, 20130846
25. 25
SparkSQL: By Category (Count)
-- Query
val blah = sqlContext.sql("SELECT category_desc, COUNT(1) FROM ocr GROUP BY
category_desc")
blah.map(a => a(0) + ", " + a(1)).collect().take(10).foreach(println)
-- Results
14/11/16 18:46:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose
tasks have all completed, from pool
14/11/16 18:46:12 INFO spark.SparkContext: Job finished: collect at
<console>:22, took 4.275620339 s
Category 1, 25
Category 2, 97
Category 3, 37
26. 26
SparkSQL: via Sum(Amount)
-- Query
val blah = sqlContext.sql("SELECT category_desc, sum(amount) FROM ocr GROUP BY
category_desc")
blah.map(a => a(0) + ", " + a(1)).collect().take(10).foreach(println)
-- Results
14/11/16 18:46:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose
tasks have all completed, from pool
14/11/16 18:46:12 INFO spark.SparkContext: Job finished: collect at
<console>:22, took 4.275620339 s
Category 1, 2000
Category 2, 10
Category 3, 1800
29. 29
Overview
• Re-Intro to Expense Receipt Prediction
• SVD
– What is it?
– Why do I care?
• Demo
– Basics (get the data)
– Data wrangling
– Compute SVD
30. 30
Expense Receipts (Problems)
• Want to guess Expense type based on word on receipt.
• Receipt X Word matrix is sparse
• Some words are likely to be found together
• Some words are actually the same word
31. 31
SVD (Singular Value Decomposition)
• Been around a while
– But still popular and useful
• Matrix Factorization
– Intuition: rotate your view of the data
• Data can be well approximated by fewer features
– And you can get an idea of how good of an approximation
32. 32
Demo: Overview
Raw Data
Tokenized
Words
Grouped
Words,
Records
Matrix
SVD
33. 33
Basics
val rawlines = sc.textFile(“/user/stevenh/subocr/subset.dat”)
val ocrRecords = rawlines map { rawline =>
rawline.split(“t”)
} filter { line =>
line.length == 10 && line(8) != “”
} zipWithIndex() map { case (lineItems, lineIdx) =>
OcrRecord(lineIdx, lineItems(0), lineItems(1).toLong, lineItems(4),
lineItems(8).toDouble, lineItems(9))
}
zipWithIndex() lets you give each record in your RDD an incrementing integer index
34. 34
Tokenize Records
val splitRegex = new scala.util.matching.Regex("rn")
val wordRegex = new scala.util.matching.Regex("[a-zA-Z0-9_]+")
val recordWords = ocrRecords flatMap { rec =>
val s1 = splitRegex.replaceAllIn(rec.ocrText, "")
val s2 = wordRegex.findAllIn(s1)
for { S <- s2 } yield (rec.recIdx, S.toLowerCase)
}
Keep track of which record this came from
35. 35
Demo: Overview
Raw Data
Tokenized
Words
Grouped
Words,
Records
Matrix
SVD
36. 36
Group data by Record and Word
val wordCounts = recordWords groupBy { T => T }
val wordsByRecord = wordCounts map { gr =>
(gr._1._2, (gr._1._1, gr._1._2, gr._2.size))
}
val uniqueWords = wordsByRecord groupBy { T =>
T._2._2
} zipWithIndex() map { gr =>
(gr._1._1, gr._2)
}
37. 37
Join Record, Word Data
val preJoined = wordsByRecord join uniqueWords
val joined = preJoined map { pj =>
RecordWord(pj._2._1._1, pj._2._2, pj._2._1._2, pj._2._1._3.toDouble)
}
Join 2-tuple RDDs on first value of tuple
Now we have data for each non-zero word/record combo
38. 38
Demo: Overview
Raw Data
Tokenized
Words
Grouped
Words,
Records
Matrix
SVD
39. 39
Generate Word x Record Matrix
val ncols = ocrRecords.count().toInt
val nrows = uniqueWords.count().toLong
val vectors: RDD[Vector] = joined groupBy { T =>
T.wordIdx
} map { gr =>
This is a Spark Vector, not a scala Vector
val indices = for { x <- gr._2 } yield x.recIdx.toInt
val data = for { x <- gr._2 } yield x.n
new SparseVector(ncols, indices.toArray, data.toArray)
}
val rowMatrix = new RowMatrix(vectors, nrows, ncols)
40. 40
Demo: Overview
Raw Data
Tokenized
Words
Grouped
Words,
Records
Matrix
SVD
41. 41
Compute SVD
val svd = rowMatrix.computeSVD(5, computeU = true)
• Ironically, in Spark v1.0 computeSVD is limited by an operation
which must complete on a single node…
42. 42
Spark / SVD References
• Distributing the Singular Value Decomposition with Spark
– Spark-SVD gist
– Twitter / Databricks blog post
• Spotting Topics with the Singular Value Decomposition
44. 44
Upcoming Conferences
• Strata + Hadoop World
– http://strataconf.com/big-data-conference-ca-2015/public/content/home
– San Jose, Feb 17-20, 2015
• Spark Summit East
– http://spark-summit.org/east
– New York, March 18-19, 2015
• Ask for a copy of “Learning Spark”
– http://www.oreilly.com/pub/get/spark