The document is a presentation on Apache Spark and Scala. It introduces key concepts like Big Data, Spark, and Spark's features. It describes Spark's ecosystem including Spark Streaming, MLlib, and GraphX. It also provides an overview of Scala, why it is used with Spark, and demonstrates a Spark Streaming program to count lines with the word "FATAL". The presentation aims to explain Spark and Scala concepts and their use in Big Data applications.
2. Slide 2Slide 2 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
What is Big Data?
What is Spark?
Why Spark?
Spark Ecosystem
Spark Features
Scala overview
Spark Streaming Demo
For Queries during the session and class recording:
Post on Twitter @edurekaIN: #askEdureka
Post on Facebook /edurekaIN
Objectives of this Session
3. Slide 3Slide 3 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
What is Big Data
4. Slide 4Slide 4 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
What is Big Data
5. Slide 5Slide 5 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Big Data
Lots of Data (Terabytes or Petabytes)
Big data is the term for a collection of data sets
so large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications
The challenges include capture, curation,
storage, search, sharing, transfer, analysis, and
visualization
cloud
tools
statistics
No SQL
compression
storage
support
database
analyze
information
terabytes
processing
mobile
Big Data
6. Slide 6Slide 6 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
What is Big Data
7. Slide 7Slide 7 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
What is Big Data
8. Slide 8Slide 8 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
What is Spark?
Apache Spark is a general-purpose cluster in-memory computing system
Provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs
Provides various high level tools like Spark SQL for structured data processing, Mlib for Machine Learning and more..
High Level
APIs
High Level
Tools
More…
9. Slide 9Slide 9 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Why Spark?
Cluster Manager
Deployment
via YARN
The Spark framework can be deployed through
Apache Mesos, Apache Hadoop via Yarn, or
Spark’s own cluster manager.
10. Slide 10Slide 10 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Why Spark?
Polyglot Scala
Spark framework is polyglot – Can be programmed
in several programming languages (Currently
Scala, Java and Python supported).
11. Slide 11Slide 11 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Why Spark?
Provides powerful caching and disk persistence capabilities
Interactive Data Analysis
Faster Batch
Iterative Algorithms
Real-Time Stream Processing
Faster Decision-Making
12. Slide 12Slide 12 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Spark Community is Super Active!
14. Slide 14Slide 14 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Spark Ecosystem (Contd.)
Used for structured
data. Can run
unmodified hive
queries on existing
Hadoop
deployment.
Spark Core Engine
Aplha/Pre-alpha
Shark
(SQL)
Spark
Streaming
(Streaming)
MLLib
(Machine
learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
BlindDB
(Approximate
SQL)
Enables analytical
and interactive
apps for live
streaming data.
An approximate
query engine. To
run over Core
Spark Engine.
Graph Computation
engine.
(Similar to Giraph)
Package for R language
to enable R-users to
leverage Spark power
from R shell.
Machine learning library being built on top of Spark. Provision for support to many
machine learning algorithms with speeds upto 100 times faster than Map-Reduce.
15. Slide 15Slide 15 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Spark Ecosystem (Contd.)
16. Slide 16Slide 16 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Spark Functional Features
17. Slide 17Slide 17 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Spark Non-functional Features
18. Slide 18Slide 18 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
19. Slide 19Slide 19 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Introduction to Scala
20. Slide 20Slide 20 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Introduction to Scala
21. Slide 21Slide 21 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Scala frameworks
22. Slide 22Slide 22 www.edureka.co/apache-spark-scala-trainingTwitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Why Scala?
23. Slide 23 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions www.edureka.in
Demo : Spark Streaming
Write a Spark streaming program, which counts the number of lines
containing the word “FATAL” and keeps reporting it on console.
24. Slide 24
Questions?
Buy Spark Course at : www.edureka.co
Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions