This document provides an overview of Apache Spark, a unified analytics engine for large-scale data processing. It discusses what Spark is, its key features like speed, integration, and simplicity. It also covers Spark's language support in Scala, Java, and Python. The document then discusses Resilient Distributed Datasets (RDDs), which are Spark's fundamental data structure, as well as transformations and actions. It provides examples of RDD operations. The document also covers Spark projects like Spark SQL, Spark Streaming, MLlib and GraphX and provides brief descriptions of their functionality.
39. Spark SQL
case class Person(name: String, age: Int) //Class
// Map RDD
val people = sc.textFile("...")
.map(_.split(","))
.map(p => Person(p(0), p(1).trim.toInt))
//Register as table
people.registerTempTable("people")
// Query
val teenagers = sqlContext.sql("SELECT name FROM people WHERE
age >= 13 AND age <= 19")
40. Spark SQL - Hive
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
// Create table and load data
sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value
STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH '...' INTO TABLE src")
// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").foreach(println)
43. Spark Streaming
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
ssc.start()
ssc.awaitTermination()
48. MLlib - Algorithms
● Linear Algebra
● Basic Statistics
● Classification and Regression
○ Linear model (SVM, Logistic regression,
Linear Regression)
○ Decision trees
○ Naive Bayes
51. MLlib - ALS Recommendation
val data = sc.textFile("...")
val ratings = data.map(_.split(',') match { case Array(user, item,
rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
// Build the recommendation model using ALS rank = 10, iters = 20
val model = ALS.train(ratings, rank, numIterations, 0.01)
val recommendations = model.recommendProducts(userId, 10)
52. MLlib - Naive Bayes
val data = sc.textFile("data/mllib/sample_naive_bayes_data.txt")
val training = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ')
.map(_.toDouble)))
}
val model = NaiveBayes.train(training, lambda = 1.0)
53. GraphX is Apache Spark's API
for graphs and graph-parallel
computation.
59. // Run PageRank
val ranks = graph.pageRank(0.0001).vertices
// Join the ranks with the usernames
val users = sc.textFile("...").map { ... => (id, username)) }
val ranksByUsername = users.join(ranks).map {
case (id, (username, rank)) => (username, rank)
}
// Print the result
println(ranksByUsername.collect().mkString("n"))
GraphX - PageRank