SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
1 
Seattle Spark Meetup
2 
Next Sessions 
• Currently planning 
– New sessions 
• Joint Seattle Spark Meetup and Seattle Mesos User Group meeting 
• Joint Seattle Spark Meetup and Cassandra meeting 
• Mobile marketing scenario with Tune 
– Repeats requested for: 
• Deep Dive into Spark, Tachyon, and Mesos Internals (Eastside) 
• Spark at eBay: Troubleshooting the everyday issues (Seattle) 
• This session (Eastside)
3 
Unlocking your Hadoop data with Apache 
Spark and CDH5 
Denny Lee, Steven Hastings 
Data Sciences Engineering
4 
Purpose 
• Showcasing Apache Spark scenarios within your Hadoop cluster 
• Utilizing some Concur Data Sciences Engineering Scenarios 
• Special Guests: 
– Tableau Software: Showcasing Tableau to Apache Spark 
– Cloudera: How to configure and deploy Spark on YARN via Cloudera Manager
5 
Agenda 
• Configuring and Deploying Spark on YARN with Cloudera Manager 
• Quick primer on our expense receipt scenario 
• Connecting to Spark on your CDH5.1 cluster 
• Quick demos 
– Pig vs. Hive 
– SparkSQL 
• Tableau connecting to SparkSQL 
• Deep Dive demo 
– MLLib: SVD
6 
Take Picture of Receipt 
601 108th Ave. NE Uel lcvue, WA 98004 
Chantanee Thai Restaurant & Bar’ 
www.chantanee.com 
TABLE: B 6 - 2 Guests 
Your Server was Jerry 
4/14/2014 1 ;14:02 PM Sequence #0000052 
ID #0281727 Subtotal $40-00 
Tota] Taxes $380 Grand Tota1 $43-80 
Credit Purchase Name 
BC Type : Amex 
00 Num : xxxx xxxx xxxx 2000 
Approval : 544882 
Server : Jerry 
Ticket Name : B 6 
15% $6.00 
I agree to pay the amount shown above. 
Visit us i 
Payment Amount: 
2.5% .15 1 O .00 
n BotheH-Pen Thai
7 
Help Choose Expense Type
8 
Gateway Node 
Hadoop 
Gateway Node 
- Can connect to HDFS 
- Execute Hadoop on cluster 
- Can execute Spark on cluster 
- OR on local for adhoc 
- Can setup multiple VMs
9 
Connecting… 
spark-shell --master spark://$masternode:7077 --executor-memory 
1G --total-executor-cores 16 
--master 
specify the master node OR if using a gateway node 
can just run locally to test 
--executor-memory 
limit amount of memory you use otherwise you’ll 
use up as much as you can (should set defaults) 
--total-executor-cores 
limit amount of cores you use otherwise you’ll use up 
as much as you can (should set defaults)
10 
Connecting… and resources
11 
RowCount: Pig 
A = LOAD 
'/user/hive/warehouse/dennyl.db/sample_ocr/000000_0' USING 
TextLoader as (line:chararray); 
B = group A all; 
C = foreach B generate COUNT(A); 
dump C;
12 
RowCount: Spark 
val ocrtxt = 
sc.textFile("/user/hive/warehouse/dennyl.db/sample_ocr/0000 
00_0") 
ocrtxt.count
13 
RowCount: Pig vs. Spark 
Row Count against 1+ million categorized receipts 
Query Pig Spark 
1 0:00:41 0:00:02 
2 0:00:42 0:00:00.5
14 
RowCount: Spark Stages
15 
WordCount: Pig 
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as 
word; 
C = group B by word; 
D = foreach C generate COUNT(B), group; 
E = group D all; 
F = foreach E generate COUNT(D); 
dump F
16 
WordCount: Spark 
var wc = 
ocrtxt.flatMap( 
line => line.split(" ") 
).map( 
word => (word, 1) 
).reduceByKey(_ + _) 
wc.count
17 
WordCount: Pig vs. Spark 
Word Count against 1+ million categorized receipts 
Query Pig Spark 
1 0:02:09 0:00:38 
2 0:02:07 0:00:02
18 
SparkSQL: Querying 
-- Utilize SQLContext 
val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
import sqlContext.createSchemaRDD 
-- Configure class 
case class ocrdata(company_code: String, user_id: Long, date: Long, 
category_desc: String, category_key: String, legacy_key: String, 
legacy_desc: String, vendor: String, amount: Double)
19 
SparkSQL: Querying (2) 
-- Extract Data 
val ocr = sc.textFile("/$HDFS_Location" 
).map(_.split("t")).map( 
m => ocrdata(m(0), m(1).toLong, m(2).toLong, m(3), m(4), m(5), m(6), 
m(7), m(8).toDouble) 
) 
-- For Spark 1.0.2 
ocr.registerAsTable("ocr") 
-- For Spark 1.1.0+ 
ocr.registerTempTable("ocr")
20 
SparkSQL: Querying (3) 
-- Write a SQL statement 
val blah = sqlContext.sql( 
"SELECT company_code, user_id FROM ocr” 
) 
-- Show the first 10 rows 
blah.map( 
a => a(0) + ", " + a(1) 
).collect().take(10).foreach(println)
21 
Oops! 
14/11/15 09:55:35 ERROR scheduler.TaskSetManager: Task 16.0:0 failed 4 times; aborting job 
14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Cancelling stage 16 
14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Stage 16 was cancelled 
14/11/15 09:55:35 INFO scheduler.DAGScheduler: Failed to run collect at <console>:22 
14/11/15 09:55:35 WARN scheduler.TaskSetManager: Task 136 was killed. 
14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 16.0, whose tasks have all 
completed, from pool 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 16.0:0 failed 4 times, most 
recent failure: Exception failure in TID 135 on host $server$: java.lang.NumberFormatException: For 
input string: "N" 
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241) 
java.lang.Double.parseDouble(Double.java:540) 
scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
22 
Let’s find the error 
-- incorrect configuration, let me try to find "N" 
val errors = ocrtxt.filter(line => line.contains("N")) 
-- error count (71053 lines) 
errors.count 
-- look at some of the data 
errors.take(10).foreach(println)
23 
Solution 
-- Issue 
The [amount] field contains N which is NULL value generated by Hive 
-- Configure class (Original) 
case class ocrdata(company_code: String, user_id: Long, ... amount: 
Double) 
-- Configure class (Original) 
case class ocrdata(company_code: String, user_id: Long, ... amount: 
String)
24 
Re-running the Query 
14/11/16 18:43:32 INFO scheduler.DAGScheduler: Stage 10 (collect at 
<console>:22) finished in 7.249 s 
14/11/16 18:43:32 INFO spark.SparkContext: Job finished: collect at 
<console>:22, took 7.268298566 s 
-1978639384, 20156192 
-1978639384, 20164613 
542292324, 20131109 
-598558319, 20128132 
1369654093, 20130970 
-1351048937, 20130846
25 
SparkSQL: By Category (Count) 
-- Query 
val blah = sqlContext.sql("SELECT category_desc, COUNT(1) FROM ocr GROUP BY 
category_desc") 
blah.map(a => a(0) + ", " + a(1)).collect().take(10).foreach(println) 
-- Results 
14/11/16 18:46:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose 
tasks have all completed, from pool 
14/11/16 18:46:12 INFO spark.SparkContext: Job finished: collect at 
<console>:22, took 4.275620339 s 
Category 1, 25 
Category 2, 97 
Category 3, 37
26 
SparkSQL: via Sum(Amount) 
-- Query 
val blah = sqlContext.sql("SELECT category_desc, sum(amount) FROM ocr GROUP BY 
category_desc") 
blah.map(a => a(0) + ", " + a(1)).collect().take(10).foreach(println) 
-- Results 
14/11/16 18:46:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose 
tasks have all completed, from pool 
14/11/16 18:46:12 INFO spark.SparkContext: Job finished: collect at 
<console>:22, took 4.275620339 s 
Category 1, 2000 
Category 2, 10 
Category 3, 1800
27 
Connecting Tableau to SparkSQL
28 
Diving into MLLib / SVD for Expense 
Receipt Scenario
29 
Overview 
• Re-Intro to Expense Receipt Prediction 
• SVD 
– What is it? 
– Why do I care? 
• Demo 
– Basics (get the data) 
– Data wrangling 
– Compute SVD
30 
Expense Receipts (Problems) 
• Want to guess Expense type based on word on receipt. 
• Receipt X Word matrix is sparse 
• Some words are likely to be found together 
• Some words are actually the same word
31 
SVD (Singular Value Decomposition) 
• Been around a while 
– But still popular and useful 
• Matrix Factorization 
– Intuition: rotate your view of the data 
• Data can be well approximated by fewer features 
– And you can get an idea of how good of an approximation
32 
Demo: Overview 
Raw Data 
Tokenized 
Words 
Grouped 
Words, 
Records 
Matrix 
SVD
33 
Basics 
val rawlines = sc.textFile(“/user/stevenh/subocr/subset.dat”) 
val ocrRecords = rawlines map { rawline => 
rawline.split(“t”) 
} filter { line => 
line.length == 10 && line(8) != “” 
} zipWithIndex() map { case (lineItems, lineIdx) => 
OcrRecord(lineIdx, lineItems(0), lineItems(1).toLong, lineItems(4), 
lineItems(8).toDouble, lineItems(9)) 
} 
zipWithIndex() lets you give each record in your RDD an incrementing integer index
34 
Tokenize Records 
val splitRegex = new scala.util.matching.Regex("rn") 
val wordRegex = new scala.util.matching.Regex("[a-zA-Z0-9_]+") 
val recordWords = ocrRecords flatMap { rec => 
val s1 = splitRegex.replaceAllIn(rec.ocrText, "") 
val s2 = wordRegex.findAllIn(s1) 
for { S <- s2 } yield (rec.recIdx, S.toLowerCase) 
} 
Keep track of which record this came from
35 
Demo: Overview 
Raw Data 
Tokenized 
Words 
Grouped 
Words, 
Records 
Matrix 
SVD
36 
Group data by Record and Word 
val wordCounts = recordWords groupBy { T => T } 
val wordsByRecord = wordCounts map { gr => 
(gr._1._2, (gr._1._1, gr._1._2, gr._2.size)) 
} 
val uniqueWords = wordsByRecord groupBy { T => 
T._2._2 
} zipWithIndex() map { gr => 
(gr._1._1, gr._2) 
}
37 
Join Record, Word Data 
val preJoined = wordsByRecord join uniqueWords 
val joined = preJoined map { pj => 
RecordWord(pj._2._1._1, pj._2._2, pj._2._1._2, pj._2._1._3.toDouble) 
} 
Join 2-tuple RDDs on first value of tuple 
Now we have data for each non-zero word/record combo
38 
Demo: Overview 
Raw Data 
Tokenized 
Words 
Grouped 
Words, 
Records 
Matrix 
SVD
39 
Generate Word x Record Matrix 
val ncols = ocrRecords.count().toInt 
val nrows = uniqueWords.count().toLong 
val vectors: RDD[Vector] = joined groupBy { T => 
T.wordIdx 
} map { gr => 
This is a Spark Vector, not a scala Vector 
val indices = for { x <- gr._2 } yield x.recIdx.toInt 
val data = for { x <- gr._2 } yield x.n 
new SparseVector(ncols, indices.toArray, data.toArray) 
} 
val rowMatrix = new RowMatrix(vectors, nrows, ncols)
40 
Demo: Overview 
Raw Data 
Tokenized 
Words 
Grouped 
Words, 
Records 
Matrix 
SVD
41 
Compute SVD 
val svd = rowMatrix.computeSVD(5, computeU = true) 
• Ironically, in Spark v1.0 computeSVD is limited by an operation 
which must complete on a single node…
42 
Spark / SVD References 
• Distributing the Singular Value Decomposition with Spark 
– Spark-SVD gist 
– Twitter / Databricks blog post 
• Spotting Topics with the Singular Value Decomposition
43 
Now do something with the data!
44 
Upcoming Conferences 
• Strata + Hadoop World 
– http://strataconf.com/big-data-conference-ca-2015/public/content/home 
– San Jose, Feb 17-20, 2015 
• Spark Summit East 
– http://spark-summit.org/east 
– New York, March 18-19, 2015 
• Ask for a copy of “Learning Spark” 
– http://www.oreilly.com/pub/get/spark
Unlocking Your Hadoop Data with Apache Spark and CDH5

Weitere ähnliche Inhalte

Was ist angesagt?

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...Databricks
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...Spark Summit
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsDatabricks
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoPostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoUri Savelchev
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkDatabricks
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Databricks
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015Yousun Jeong
 

Was ist angesagt? (20)

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
 
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark Workloads
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoPostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 

Andere mochten auch

15 Reasons Why Concur Is Easy To Use
15 Reasons Why Concur Is Easy To Use15 Reasons Why Concur Is Easy To Use
15 Reasons Why Concur Is Easy To UseSAP Concur
 
Play Concur's Game, Beat the Sheet
Play Concur's Game, Beat the Sheet Play Concur's Game, Beat the Sheet
Play Concur's Game, Beat the Sheet SAP Concur
 
Concur UX: how do we learn from you?
Concur UX: how do we learn from you?Concur UX: how do we learn from you?
Concur UX: how do we learn from you?SAP Concur
 
Where To EAT In London
Where To EAT In LondonWhere To EAT In London
Where To EAT In LondonSAP Concur
 
Where To Eat in Paris
Where To Eat in ParisWhere To Eat in Paris
Where To Eat in ParisSAP Concur
 
The true cost of an invoice solution
The true cost of an invoice solutionThe true cost of an invoice solution
The true cost of an invoice solutionSAP Concur
 
Taking Advantage of Digital Disruption in Travel and Expense
Taking Advantage of Digital Disruption in Travel and ExpenseTaking Advantage of Digital Disruption in Travel and Expense
Taking Advantage of Digital Disruption in Travel and ExpenseSAP Concur
 
The Future of Wearables for Business Travel - Concur
The Future of Wearables for Business Travel - ConcurThe Future of Wearables for Business Travel - Concur
The Future of Wearables for Business Travel - ConcurSAP Concur
 
Concur UX team: how do we design effectively
Concur UX team: how do we design effectivelyConcur UX team: how do we design effectively
Concur UX team: how do we design effectivelySAP Concur
 
Expense Policy Workbook: Top Questions to Answer When Creating Your T&E Policy
Expense Policy Workbook: Top Questions to Answer When Creating Your T&E PolicyExpense Policy Workbook: Top Questions to Answer When Creating Your T&E Policy
Expense Policy Workbook: Top Questions to Answer When Creating Your T&E PolicySAP Concur
 
GBTA Passport Game From Concur
GBTA Passport Game From ConcurGBTA Passport Game From Concur
GBTA Passport Game From ConcurSAP Concur
 

Andere mochten auch (11)

15 Reasons Why Concur Is Easy To Use
15 Reasons Why Concur Is Easy To Use15 Reasons Why Concur Is Easy To Use
15 Reasons Why Concur Is Easy To Use
 
Play Concur's Game, Beat the Sheet
Play Concur's Game, Beat the Sheet Play Concur's Game, Beat the Sheet
Play Concur's Game, Beat the Sheet
 
Concur UX: how do we learn from you?
Concur UX: how do we learn from you?Concur UX: how do we learn from you?
Concur UX: how do we learn from you?
 
Where To EAT In London
Where To EAT In LondonWhere To EAT In London
Where To EAT In London
 
Where To Eat in Paris
Where To Eat in ParisWhere To Eat in Paris
Where To Eat in Paris
 
The true cost of an invoice solution
The true cost of an invoice solutionThe true cost of an invoice solution
The true cost of an invoice solution
 
Taking Advantage of Digital Disruption in Travel and Expense
Taking Advantage of Digital Disruption in Travel and ExpenseTaking Advantage of Digital Disruption in Travel and Expense
Taking Advantage of Digital Disruption in Travel and Expense
 
The Future of Wearables for Business Travel - Concur
The Future of Wearables for Business Travel - ConcurThe Future of Wearables for Business Travel - Concur
The Future of Wearables for Business Travel - Concur
 
Concur UX team: how do we design effectively
Concur UX team: how do we design effectivelyConcur UX team: how do we design effectively
Concur UX team: how do we design effectively
 
Expense Policy Workbook: Top Questions to Answer When Creating Your T&E Policy
Expense Policy Workbook: Top Questions to Answer When Creating Your T&E PolicyExpense Policy Workbook: Top Questions to Answer When Creating Your T&E Policy
Expense Policy Workbook: Top Questions to Answer When Creating Your T&E Policy
 
GBTA Passport Game From Concur
GBTA Passport Game From ConcurGBTA Passport Game From Concur
GBTA Passport Game From Concur
 

Ähnlich wie Unlocking Your Hadoop Data with Apache Spark and CDH5

Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?SegFaultConf
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developersChristopher Batey
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkIvan Morozov
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache SparkMarcoYuriFujiiMelo
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 

Ähnlich wie Unlocking Your Hadoop Data with Apache Spark and CDH5 (20)

Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 

Mehr von SAP Concur

Acquis Consulting - Harnessing The Power of Travel & Entertainment Data
Acquis Consulting - Harnessing The Power of Travel & Entertainment DataAcquis Consulting - Harnessing The Power of Travel & Entertainment Data
Acquis Consulting - Harnessing The Power of Travel & Entertainment DataSAP Concur
 
Connect with Concur at GBTA
Connect with Concur at GBTAConnect with Concur at GBTA
Connect with Concur at GBTASAP Concur
 
Where to Eat: San Francisco
Where to Eat: San FranciscoWhere to Eat: San Francisco
Where to Eat: San FranciscoSAP Concur
 
The Busiest Expense Day of 2014
The Busiest Expense Day of 2014The Busiest Expense Day of 2014
The Busiest Expense Day of 2014SAP Concur
 
Travel Turbulence? Apps to the Rescue!
Travel Turbulence? Apps to the Rescue!Travel Turbulence? Apps to the Rescue!
Travel Turbulence? Apps to the Rescue!SAP Concur
 
Interns at Concur - Summer 2014
Interns at Concur - Summer 2014Interns at Concur - Summer 2014
Interns at Concur - Summer 2014SAP Concur
 
Gamifying Your Business
Gamifying Your BusinessGamifying Your Business
Gamifying Your BusinessSAP Concur
 
Gamification in the Workplace
Gamification in the WorkplaceGamification in the Workplace
Gamification in the WorkplaceSAP Concur
 
Aberdeen T&E Trends
Aberdeen T&E TrendsAberdeen T&E Trends
Aberdeen T&E TrendsSAP Concur
 
Forrester Report: The Power Of Real-Time Insight
Forrester Report: The Power Of Real-Time InsightForrester Report: The Power Of Real-Time Insight
Forrester Report: The Power Of Real-Time InsightSAP Concur
 
The Future of Travel Management Solutions
The Future of Travel Management SolutionsThe Future of Travel Management Solutions
The Future of Travel Management SolutionsSAP Concur
 
Tap into Business Travel Apps
Tap into Business Travel AppsTap into Business Travel Apps
Tap into Business Travel AppsSAP Concur
 
Open Booking: A Winning Itinerary for Business Travel 2.0
Open Booking: A Winning Itinerary for Business Travel 2.0Open Booking: A Winning Itinerary for Business Travel 2.0
Open Booking: A Winning Itinerary for Business Travel 2.0SAP Concur
 
Global Business Takes International Savvy
Global Business Takes International SavvyGlobal Business Takes International Savvy
Global Business Takes International SavvySAP Concur
 
12 Innovation Quotes
12 Innovation Quotes12 Innovation Quotes
12 Innovation QuotesSAP Concur
 
Payroll Compliance for Business Travelers: What You Need To Know
Payroll Compliance for Business Travelers: What You Need To KnowPayroll Compliance for Business Travelers: What You Need To Know
Payroll Compliance for Business Travelers: What You Need To KnowSAP Concur
 
Venn Diagram #97 #DontWait
Venn Diagram #97 #DontWaitVenn Diagram #97 #DontWait
Venn Diagram #97 #DontWaitSAP Concur
 
Venn Diagram #256 #DontWait
Venn Diagram #256 #DontWaitVenn Diagram #256 #DontWait
Venn Diagram #256 #DontWaitSAP Concur
 
Venn Diagram #183 #DontWait
Venn Diagram #183 #DontWaitVenn Diagram #183 #DontWait
Venn Diagram #183 #DontWaitSAP Concur
 

Mehr von SAP Concur (19)

Acquis Consulting - Harnessing The Power of Travel & Entertainment Data
Acquis Consulting - Harnessing The Power of Travel & Entertainment DataAcquis Consulting - Harnessing The Power of Travel & Entertainment Data
Acquis Consulting - Harnessing The Power of Travel & Entertainment Data
 
Connect with Concur at GBTA
Connect with Concur at GBTAConnect with Concur at GBTA
Connect with Concur at GBTA
 
Where to Eat: San Francisco
Where to Eat: San FranciscoWhere to Eat: San Francisco
Where to Eat: San Francisco
 
The Busiest Expense Day of 2014
The Busiest Expense Day of 2014The Busiest Expense Day of 2014
The Busiest Expense Day of 2014
 
Travel Turbulence? Apps to the Rescue!
Travel Turbulence? Apps to the Rescue!Travel Turbulence? Apps to the Rescue!
Travel Turbulence? Apps to the Rescue!
 
Interns at Concur - Summer 2014
Interns at Concur - Summer 2014Interns at Concur - Summer 2014
Interns at Concur - Summer 2014
 
Gamifying Your Business
Gamifying Your BusinessGamifying Your Business
Gamifying Your Business
 
Gamification in the Workplace
Gamification in the WorkplaceGamification in the Workplace
Gamification in the Workplace
 
Aberdeen T&E Trends
Aberdeen T&E TrendsAberdeen T&E Trends
Aberdeen T&E Trends
 
Forrester Report: The Power Of Real-Time Insight
Forrester Report: The Power Of Real-Time InsightForrester Report: The Power Of Real-Time Insight
Forrester Report: The Power Of Real-Time Insight
 
The Future of Travel Management Solutions
The Future of Travel Management SolutionsThe Future of Travel Management Solutions
The Future of Travel Management Solutions
 
Tap into Business Travel Apps
Tap into Business Travel AppsTap into Business Travel Apps
Tap into Business Travel Apps
 
Open Booking: A Winning Itinerary for Business Travel 2.0
Open Booking: A Winning Itinerary for Business Travel 2.0Open Booking: A Winning Itinerary for Business Travel 2.0
Open Booking: A Winning Itinerary for Business Travel 2.0
 
Global Business Takes International Savvy
Global Business Takes International SavvyGlobal Business Takes International Savvy
Global Business Takes International Savvy
 
12 Innovation Quotes
12 Innovation Quotes12 Innovation Quotes
12 Innovation Quotes
 
Payroll Compliance for Business Travelers: What You Need To Know
Payroll Compliance for Business Travelers: What You Need To KnowPayroll Compliance for Business Travelers: What You Need To Know
Payroll Compliance for Business Travelers: What You Need To Know
 
Venn Diagram #97 #DontWait
Venn Diagram #97 #DontWaitVenn Diagram #97 #DontWait
Venn Diagram #97 #DontWait
 
Venn Diagram #256 #DontWait
Venn Diagram #256 #DontWaitVenn Diagram #256 #DontWait
Venn Diagram #256 #DontWait
 
Venn Diagram #183 #DontWait
Venn Diagram #183 #DontWaitVenn Diagram #183 #DontWait
Venn Diagram #183 #DontWait
 

Kürzlich hochgeladen

Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 

Kürzlich hochgeladen (20)

Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 

Unlocking Your Hadoop Data with Apache Spark and CDH5

  • 2. 2 Next Sessions • Currently planning – New sessions • Joint Seattle Spark Meetup and Seattle Mesos User Group meeting • Joint Seattle Spark Meetup and Cassandra meeting • Mobile marketing scenario with Tune – Repeats requested for: • Deep Dive into Spark, Tachyon, and Mesos Internals (Eastside) • Spark at eBay: Troubleshooting the everyday issues (Seattle) • This session (Eastside)
  • 3. 3 Unlocking your Hadoop data with Apache Spark and CDH5 Denny Lee, Steven Hastings Data Sciences Engineering
  • 4. 4 Purpose • Showcasing Apache Spark scenarios within your Hadoop cluster • Utilizing some Concur Data Sciences Engineering Scenarios • Special Guests: – Tableau Software: Showcasing Tableau to Apache Spark – Cloudera: How to configure and deploy Spark on YARN via Cloudera Manager
  • 5. 5 Agenda • Configuring and Deploying Spark on YARN with Cloudera Manager • Quick primer on our expense receipt scenario • Connecting to Spark on your CDH5.1 cluster • Quick demos – Pig vs. Hive – SparkSQL • Tableau connecting to SparkSQL • Deep Dive demo – MLLib: SVD
  • 6. 6 Take Picture of Receipt 601 108th Ave. NE Uel lcvue, WA 98004 Chantanee Thai Restaurant & Bar’ www.chantanee.com TABLE: B 6 - 2 Guests Your Server was Jerry 4/14/2014 1 ;14:02 PM Sequence #0000052 ID #0281727 Subtotal $40-00 Tota] Taxes $380 Grand Tota1 $43-80 Credit Purchase Name BC Type : Amex 00 Num : xxxx xxxx xxxx 2000 Approval : 544882 Server : Jerry Ticket Name : B 6 15% $6.00 I agree to pay the amount shown above. Visit us i Payment Amount: 2.5% .15 1 O .00 n BotheH-Pen Thai
  • 7. 7 Help Choose Expense Type
  • 8. 8 Gateway Node Hadoop Gateway Node - Can connect to HDFS - Execute Hadoop on cluster - Can execute Spark on cluster - OR on local for adhoc - Can setup multiple VMs
  • 9. 9 Connecting… spark-shell --master spark://$masternode:7077 --executor-memory 1G --total-executor-cores 16 --master specify the master node OR if using a gateway node can just run locally to test --executor-memory limit amount of memory you use otherwise you’ll use up as much as you can (should set defaults) --total-executor-cores limit amount of cores you use otherwise you’ll use up as much as you can (should set defaults)
  • 10. 10 Connecting… and resources
  • 11. 11 RowCount: Pig A = LOAD '/user/hive/warehouse/dennyl.db/sample_ocr/000000_0' USING TextLoader as (line:chararray); B = group A all; C = foreach B generate COUNT(A); dump C;
  • 12. 12 RowCount: Spark val ocrtxt = sc.textFile("/user/hive/warehouse/dennyl.db/sample_ocr/0000 00_0") ocrtxt.count
  • 13. 13 RowCount: Pig vs. Spark Row Count against 1+ million categorized receipts Query Pig Spark 1 0:00:41 0:00:02 2 0:00:42 0:00:00.5
  • 15. 15 WordCount: Pig B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; E = group D all; F = foreach E generate COUNT(D); dump F
  • 16. 16 WordCount: Spark var wc = ocrtxt.flatMap( line => line.split(" ") ).map( word => (word, 1) ).reduceByKey(_ + _) wc.count
  • 17. 17 WordCount: Pig vs. Spark Word Count against 1+ million categorized receipts Query Pig Spark 1 0:02:09 0:00:38 2 0:02:07 0:00:02
  • 18. 18 SparkSQL: Querying -- Utilize SQLContext val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD -- Configure class case class ocrdata(company_code: String, user_id: Long, date: Long, category_desc: String, category_key: String, legacy_key: String, legacy_desc: String, vendor: String, amount: Double)
  • 19. 19 SparkSQL: Querying (2) -- Extract Data val ocr = sc.textFile("/$HDFS_Location" ).map(_.split("t")).map( m => ocrdata(m(0), m(1).toLong, m(2).toLong, m(3), m(4), m(5), m(6), m(7), m(8).toDouble) ) -- For Spark 1.0.2 ocr.registerAsTable("ocr") -- For Spark 1.1.0+ ocr.registerTempTable("ocr")
  • 20. 20 SparkSQL: Querying (3) -- Write a SQL statement val blah = sqlContext.sql( "SELECT company_code, user_id FROM ocr” ) -- Show the first 10 rows blah.map( a => a(0) + ", " + a(1) ).collect().take(10).foreach(println)
  • 21. 21 Oops! 14/11/15 09:55:35 ERROR scheduler.TaskSetManager: Task 16.0:0 failed 4 times; aborting job 14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Cancelling stage 16 14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Stage 16 was cancelled 14/11/15 09:55:35 INFO scheduler.DAGScheduler: Failed to run collect at <console>:22 14/11/15 09:55:35 WARN scheduler.TaskSetManager: Task 136 was killed. 14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 16.0, whose tasks have all completed, from pool org.apache.spark.SparkException: Job aborted due to stage failure: Task 16.0:0 failed 4 times, most recent failure: Exception failure in TID 135 on host $server$: java.lang.NumberFormatException: For input string: "N" sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241) java.lang.Double.parseDouble(Double.java:540) scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
  • 22. 22 Let’s find the error -- incorrect configuration, let me try to find "N" val errors = ocrtxt.filter(line => line.contains("N")) -- error count (71053 lines) errors.count -- look at some of the data errors.take(10).foreach(println)
  • 23. 23 Solution -- Issue The [amount] field contains N which is NULL value generated by Hive -- Configure class (Original) case class ocrdata(company_code: String, user_id: Long, ... amount: Double) -- Configure class (Original) case class ocrdata(company_code: String, user_id: Long, ... amount: String)
  • 24. 24 Re-running the Query 14/11/16 18:43:32 INFO scheduler.DAGScheduler: Stage 10 (collect at <console>:22) finished in 7.249 s 14/11/16 18:43:32 INFO spark.SparkContext: Job finished: collect at <console>:22, took 7.268298566 s -1978639384, 20156192 -1978639384, 20164613 542292324, 20131109 -598558319, 20128132 1369654093, 20130970 -1351048937, 20130846
  • 25. 25 SparkSQL: By Category (Count) -- Query val blah = sqlContext.sql("SELECT category_desc, COUNT(1) FROM ocr GROUP BY category_desc") blah.map(a => a(0) + ", " + a(1)).collect().take(10).foreach(println) -- Results 14/11/16 18:46:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks have all completed, from pool 14/11/16 18:46:12 INFO spark.SparkContext: Job finished: collect at <console>:22, took 4.275620339 s Category 1, 25 Category 2, 97 Category 3, 37
  • 26. 26 SparkSQL: via Sum(Amount) -- Query val blah = sqlContext.sql("SELECT category_desc, sum(amount) FROM ocr GROUP BY category_desc") blah.map(a => a(0) + ", " + a(1)).collect().take(10).foreach(println) -- Results 14/11/16 18:46:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks have all completed, from pool 14/11/16 18:46:12 INFO spark.SparkContext: Job finished: collect at <console>:22, took 4.275620339 s Category 1, 2000 Category 2, 10 Category 3, 1800
  • 27. 27 Connecting Tableau to SparkSQL
  • 28. 28 Diving into MLLib / SVD for Expense Receipt Scenario
  • 29. 29 Overview • Re-Intro to Expense Receipt Prediction • SVD – What is it? – Why do I care? • Demo – Basics (get the data) – Data wrangling – Compute SVD
  • 30. 30 Expense Receipts (Problems) • Want to guess Expense type based on word on receipt. • Receipt X Word matrix is sparse • Some words are likely to be found together • Some words are actually the same word
  • 31. 31 SVD (Singular Value Decomposition) • Been around a while – But still popular and useful • Matrix Factorization – Intuition: rotate your view of the data • Data can be well approximated by fewer features – And you can get an idea of how good of an approximation
  • 32. 32 Demo: Overview Raw Data Tokenized Words Grouped Words, Records Matrix SVD
  • 33. 33 Basics val rawlines = sc.textFile(“/user/stevenh/subocr/subset.dat”) val ocrRecords = rawlines map { rawline => rawline.split(“t”) } filter { line => line.length == 10 && line(8) != “” } zipWithIndex() map { case (lineItems, lineIdx) => OcrRecord(lineIdx, lineItems(0), lineItems(1).toLong, lineItems(4), lineItems(8).toDouble, lineItems(9)) } zipWithIndex() lets you give each record in your RDD an incrementing integer index
  • 34. 34 Tokenize Records val splitRegex = new scala.util.matching.Regex("rn") val wordRegex = new scala.util.matching.Regex("[a-zA-Z0-9_]+") val recordWords = ocrRecords flatMap { rec => val s1 = splitRegex.replaceAllIn(rec.ocrText, "") val s2 = wordRegex.findAllIn(s1) for { S <- s2 } yield (rec.recIdx, S.toLowerCase) } Keep track of which record this came from
  • 35. 35 Demo: Overview Raw Data Tokenized Words Grouped Words, Records Matrix SVD
  • 36. 36 Group data by Record and Word val wordCounts = recordWords groupBy { T => T } val wordsByRecord = wordCounts map { gr => (gr._1._2, (gr._1._1, gr._1._2, gr._2.size)) } val uniqueWords = wordsByRecord groupBy { T => T._2._2 } zipWithIndex() map { gr => (gr._1._1, gr._2) }
  • 37. 37 Join Record, Word Data val preJoined = wordsByRecord join uniqueWords val joined = preJoined map { pj => RecordWord(pj._2._1._1, pj._2._2, pj._2._1._2, pj._2._1._3.toDouble) } Join 2-tuple RDDs on first value of tuple Now we have data for each non-zero word/record combo
  • 38. 38 Demo: Overview Raw Data Tokenized Words Grouped Words, Records Matrix SVD
  • 39. 39 Generate Word x Record Matrix val ncols = ocrRecords.count().toInt val nrows = uniqueWords.count().toLong val vectors: RDD[Vector] = joined groupBy { T => T.wordIdx } map { gr => This is a Spark Vector, not a scala Vector val indices = for { x <- gr._2 } yield x.recIdx.toInt val data = for { x <- gr._2 } yield x.n new SparseVector(ncols, indices.toArray, data.toArray) } val rowMatrix = new RowMatrix(vectors, nrows, ncols)
  • 40. 40 Demo: Overview Raw Data Tokenized Words Grouped Words, Records Matrix SVD
  • 41. 41 Compute SVD val svd = rowMatrix.computeSVD(5, computeU = true) • Ironically, in Spark v1.0 computeSVD is limited by an operation which must complete on a single node…
  • 42. 42 Spark / SVD References • Distributing the Singular Value Decomposition with Spark – Spark-SVD gist – Twitter / Databricks blog post • Spotting Topics with the Singular Value Decomposition
  • 43. 43 Now do something with the data!
  • 44. 44 Upcoming Conferences • Strata + Hadoop World – http://strataconf.com/big-data-conference-ca-2015/public/content/home – San Jose, Feb 17-20, 2015 • Spark Summit East – http://spark-summit.org/east – New York, March 18-19, 2015 • Ask for a copy of “Learning Spark” – http://www.oreilly.com/pub/get/spark