SlideShare a Scribd company logo
1 of 44
An Overview of Apache Spark
Ilya Gulman
CTO, Twingo
June, 2014
About Twingo
• A Big Data Company
• Established in 2006 by Golan Nahum
• 25 Employees
• Reseller and expert integrator in HP VERTICA
• Reseller and integrator in MAPR
• Reseller and expert integrator in MICROSTRATEGY
• Deep knowedge in Phyton and Linux
• We did more than 20 Big Data successful Projects
• Expertise in SAAS /OEM BIG DATA solutions
Agenda
• What is Spark?
• The Difference with Spark
• SQL on Spark
• Combining the power
• Real-World Use Cases
• Resources
What is Spark?
Fast and general MapReduce-like engine
for large-scale data processing
• Fast
In memory data storage for very fast interactive queries
Up to 100 times faster then Hadoop
• General - Unified platform that can combine:
SQL, Machine Learning , Streaming , Graph & Complex
analytics
• Ease of use
Can be developed in Java, Scala or Python
• Integrated with Hadoop
Can read from HDFS, HBase, Cassandra, and any Hadoop
data source.
What is Spark ?
Spark is the Most Active Open
Source Project in Big Data
Projectcontributorsinpastyear
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
The Spark Community
Unified Platform
Shark
(SQL)
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
Continued innovation bringing new functionality, e.g.:
• Java 8 (Closures, Lamba Expressions)
• Spark SQL (SQL on Spark, not just Hive)
• BlinkDB (Approximate Queries)
• SparkR (R wrapper for Spark)
Supported Languages
• Java
• Scala
• Python
• SQL
Data Sources
• Local Files
– file:///opt/httpd/logs/access_log
• S3
• Hadoop Distributed Filesystem
– Regular files, sequence files, any other Hadoop
InputFormat
• Hbase
• Can also read from any other Hadoop data
source.
The Difference with Spark
Easy and Fast Big Data
• Easy to Develop
– Rich APIs in Java, Scala,
Python
– Interactive shell
• Fast to Run
– General execution graphs
– In-memory storage
2-5× less code Up to 10× faster on disk,
100× in memory
Resilient Distributed Datasets (RDD)
• Spark revolves around RDDs
• Fault-tolerant collection of elements that can be
operated on in parallel
– Parallelized Collection: Scala collection which is run
in parallel
– Hadoop Dataset: records of files supported by
Hadoop
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
RDD Operations
• Transformations
– Creation of a new dataset from an existing
• map, filter, distinct, union, sample, groupByKey, join, etc…
• Actions
– Return a value after running a computation
• collect, count, first, takeSample, foreach, etc…
Check the documentation for a complete list
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
RDD Code Example
file = spark.textFile("hdfs://...")
errors = file.filter(lambda line: "ERROR" in line)
# Count all the errors
errors.count()
# Count errors mentioning MySQL
errors.filter(lambda line: "MySQL" in line).count()
# Fetch the MySQL errors as an array of strings
errors.filter(lambda line: "MySQL" in line).collect()
RDD Persistence / Caching
• Variety of storage levels
– memory_only (default), memory_and_disk, etc…
• API Calls
– persist(StorageLevel)
– cache() – shorthand for
persist(StorageLevel.MEMORY_ONLY)
• Considerations
– Read from disk vs. recompute (memory_and_disk)
– Total memory storage size (memory_only_ser)
– Replicate to second node for faster fault recovery
(memory_only_2)
• Think about this option if supporting a web application
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence
Cache Scaling Matters
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully
cached
Executiontime(s)
% of working set in cache
RDD Fault Recovery
RDDs track lineage information that can be used to
efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
Interactive Shell
• Iterative Development
– Cache those RDDs
– Open the shell and ask questions
• We have all wished we could do this with MapReduce
– Compile / save your code for scheduled jobs later
• Scala – spark-shell
• Python – pyspark
SQL on Spark
Before Spark - Hive
• Puts structure/schema onto HDFS data
• Compiles HiveQL queries into MapReduce
jobs
• Very popular: 90+% of Facebook Hadoop
jobs generated by Hive
• Initially developed by Facebook
But.. Hive is slow
• Takes 20+ seconds even for simple
queries
• "A good day is when I can run 6 Hive
queries” – @mtraverso
SQL over Spark
Shark – SQL over Spark
• Hive-compatible (HiveQL, UDFs, metadata)
– Works in existing Hive warehouses without changing
queries or data!
• Augments Hive
– In-memory tables and columnar memory store
• Fast execution engine
– Uses Spark as the underlying execution engine
– Low-latency, interactive queries
– Scale-out and tolerates worker failures
Shark – SQL over Spark
• Hive-compatible (HiveQL, UDFs, metadata)
– Works in existing Hive warehouses without changing
queries or data!
• Augments Hive
– In-memory tables and columnar memory store
• Fast execution engine
– Uses Spark as the underlying execution engine
– Low-latency, interactive queries
– Scale-out and tolerates worker failures
Machine Learning
Machine Learning - MLlib
• K-Means
• L1 and L2-regularized Linear Regression
• L1 and L2-regularized Logistic Regression
• Alternating Least Squares
• Naive Bayes
• Stochastic Gradient Descent
* As of May 14, 2014
** Don’t be surprised if you see the Mahout library converting to Spark soon
Streaming
Comparison to Storm
• Higher throughput than Storm
– Spark Streaming: 670k records/sec/node
– Storm: 115k records/sec/node
– Commercial systems: 100-500k records/sec/node
0
20
40
100 1000
Throughputper
node(MB/s)
Record Size (bytes)
WordCount
Spark
Storm
0
20
40
60
100 1000
Throughputper
node(MB/s)
Record Size (bytes)
Grep
Spark
Storm
Combining the power
Combining the power
• Use Machine Learningg result as table.
GENERATE KMeans(tweet_locations)
SAVE AS TABLE tweet_clusters;
• Combine SQL, ML, and streaming (Scala)
val points = sc.runSql[Double, Double](
“select latitude, longitude from historic_tweets”)
val model = KMeans.train(points, 10)
sc.twitterStream(...)
.map(t => (model.closestCenter(t.location), 1))
.reduceByWindow(“5s”, _ + _)
Real-World Use Cases
Spark at Yahoo!
• Fast Machine Learning
Personalized news pages
Spark at Yahoo!
• Hive on Spark (Shark)
Using existing BI tools to view and query advertising
analytic data collected in Hadoop.
Any tool that plugs into Hive, like Tableau, automatically
works with Shark.
Spark at
• One of the largest streaming video companies
on the Internet
• 4+ billion video feeds per month
(second only to YouTube)
• CONVIVA uses Spark Streaming to learn
network conditions in real time
Resources
Remember
• If you want to use a new technology you must learn that
new technology
• For those who have been using Hadoop for a while, at
one time you had to learn all about MapReduce and how
to manage and tune it
• To get the most out of a new technology you need to
learn that technology, this includes tuning
– There are switches you can use to optimize your work
Configuration
http://spark.apache.org/docs/latest/
Most Important
• Application Configuration
http://spark.apache.org/docs/latest/configuration.html
• Standalone Cluster Configuration
http://spark.apache.org/docs/latest/spark-standalone.html
• Tuning Guide
http://spark.apache.org/docs/latest/tuning.html
Resources
• Pig on Spark
– http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html
– https://github.com/aniket486/pig
– https://github.com/twitter/pig/tree/spork
– http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1
– https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix
• Latest on Spark
– http://databricks.com/categories/spark/
– http://www.spark-stack.org/
Thank You
www.twingo.co.il
Optional - More Examples
SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]);
JavaRDD<String> file = sc.textFile("hdfs://...");
JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split("
")))
.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.reduceByKey((x, y) -> x + y);
counts.saveAsTextFile("hdfs://...");
val sc = new SparkContext(master, appName, [sparkHome], [jars])
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Word Count
• Java MapReduce (~15 lines of code)
• Java Spark (~ 7 lines of code)
• Scala and Python (4 lines of code)
– interactive shell: skip line 1 and replace the last line with counts.collect()
• Java8 (4 lines of code)
Network Word Count – Streaming
// Create the context with a 1 second batch size
val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1),
System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass))
// Create a NetworkInputDStream on target host:port and count the
// words in input stream of n delimited text (eg. generated by 'nc')
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
Deploying Spark – Cluster
Manager Types
• Standalone mode
– Comes bundled (EC2 capable)
• YARN
• Mesos

More Related Content

What's hot

Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With SparkEdureka!
 
Spark Streaming
Spark StreamingSpark Streaming
Spark StreamingEdureka!
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!Edureka!
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkBurak Yavuz
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedinYukti Kaura
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache SparkDona Mary Philip
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupPhaneendra Chiruvella
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark OverviewairisData
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaSpark Summit
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
 
Sydney Apache Spark Meetup - Spark Natural Language Processing
Sydney Apache Spark Meetup - Spark Natural Language ProcessingSydney Apache Spark Meetup - Spark Natural Language Processing
Sydney Apache Spark Meetup - Spark Natural Language ProcessingAndy Huang
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingBuilding Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingDatabricks
 
Sydney Spark Meetup - September 2015
Sydney Spark Meetup - September 2015Sydney Spark Meetup - September 2015
Sydney Spark Meetup - September 2015Andy Huang
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 

What's hot (20)

Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
 
Spark Streaming
Spark StreamingSpark Streaming
Spark Streaming
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan Kessler
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal Malohlava
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Sydney Apache Spark Meetup - Spark Natural Language Processing
Sydney Apache Spark Meetup - Spark Natural Language ProcessingSydney Apache Spark Meetup - Spark Natural Language Processing
Sydney Apache Spark Meetup - Spark Natural Language Processing
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark StreamingBuilding Robust, Adaptive Streaming Apps with Spark Streaming
Building Robust, Adaptive Streaming Apps with Spark Streaming
 
Sydney Spark Meetup - September 2015
Sydney Spark Meetup - September 2015Sydney Spark Meetup - September 2015
Sydney Spark Meetup - September 2015
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 

Viewers also liked

HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapRlohitvijayarenu
 
Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012MapR Technologies
 
Apache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL ReferencesApache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL ReferencesMapR Technologies
 
Spark & Hadoop at Production at Scale
Spark & Hadoop at Production at ScaleSpark & Hadoop at Production at Scale
Spark & Hadoop at Production at ScaleMapR Technologies
 
⭐⭐⭐⭐⭐ Examen Sistemas Digitales SD+MSA (2do Parcial)
⭐⭐⭐⭐⭐ Examen Sistemas Digitales SD+MSA (2do Parcial)⭐⭐⭐⭐⭐ Examen Sistemas Digitales SD+MSA (2do Parcial)
⭐⭐⭐⭐⭐ Examen Sistemas Digitales SD+MSA (2do Parcial)Victor Asanza
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation WorkshopMapR Technologies
 
Oracle rac cachefusion - High Availability Day 2015
Oracle rac cachefusion - High Availability Day 2015Oracle rac cachefusion - High Availability Day 2015
Oracle rac cachefusion - High Availability Day 2015aioughydchapter
 
Cloud conference - mongodb
Cloud conference - mongodbCloud conference - mongodb
Cloud conference - mongodbMitch Pirtle
 
Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015aioughydchapter
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASMRACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASMMaaz Anjum
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeMapR Technologies
 
Spark Application for Time Series Analysis
Spark Application for Time Series AnalysisSpark Application for Time Series Analysis
Spark Application for Time Series AnalysisMapR Technologies
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRDouglas Bernardini
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Flex Your Database on 12c's Flex ASM and Flex Cluster
Flex Your Database on 12c's Flex ASM and Flex ClusterFlex Your Database on 12c's Flex ASM and Flex Cluster
Flex Your Database on 12c's Flex ASM and Flex ClusterMaaz Anjum
 

Viewers also liked (20)

Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapR
 
Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012Machine Learning with Hadoop Boston hug 2012
Machine Learning with Hadoop Boston hug 2012
 
Apache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL ReferencesApache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL References
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Spark & Hadoop at Production at Scale
Spark & Hadoop at Production at ScaleSpark & Hadoop at Production at Scale
Spark & Hadoop at Production at Scale
 
⭐⭐⭐⭐⭐ Examen Sistemas Digitales SD+MSA (2do Parcial)
⭐⭐⭐⭐⭐ Examen Sistemas Digitales SD+MSA (2do Parcial)⭐⭐⭐⭐⭐ Examen Sistemas Digitales SD+MSA (2do Parcial)
⭐⭐⭐⭐⭐ Examen Sistemas Digitales SD+MSA (2do Parcial)
 
Practical Machine Learning: Innovations in Recommendation Workshop
Practical Machine Learning:  Innovations in Recommendation WorkshopPractical Machine Learning:  Innovations in Recommendation Workshop
Practical Machine Learning: Innovations in Recommendation Workshop
 
Elastic map reduce
Elastic map reduceElastic map reduce
Elastic map reduce
 
MongoDB
MongoDBMongoDB
MongoDB
 
Oracle rac cachefusion - High Availability Day 2015
Oracle rac cachefusion - High Availability Day 2015Oracle rac cachefusion - High Availability Day 2015
Oracle rac cachefusion - High Availability Day 2015
 
Cloud conference - mongodb
Cloud conference - mongodbCloud conference - mongodb
Cloud conference - mongodb
 
Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015Aman sharma hyd_12crac High Availability Day 2015
Aman sharma hyd_12crac High Availability Day 2015
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASMRACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 
Spark Application for Time Series Analysis
Spark Application for Time Series AnalysisSpark Application for Time Series Analysis
Spark Application for Time Series Analysis
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Flex Your Database on 12c's Flex ASM and Flex Cluster
Flex Your Database on 12c's Flex ASM and Flex ClusterFlex Your Database on 12c's Flex ASM and Flex Cluster
Flex Your Database on 12c's Flex ASM and Flex Cluster
 

Similar to Intro to Apache Spark by CTO of Twingo

Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentationRamesh Mudunuri
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkMuktadiur Rahman
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkJUGBD
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 

Similar to Intro to Apache Spark by CTO of Twingo (20)

Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Spark 101
Spark 101Spark 101
Spark 101
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 

More from MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Intro to Apache Spark by CTO of Twingo

  • 1. An Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014
  • 2. About Twingo • A Big Data Company • Established in 2006 by Golan Nahum • 25 Employees • Reseller and expert integrator in HP VERTICA • Reseller and integrator in MAPR • Reseller and expert integrator in MICROSTRATEGY • Deep knowedge in Phyton and Linux • We did more than 20 Big Data successful Projects • Expertise in SAAS /OEM BIG DATA solutions
  • 3. Agenda • What is Spark? • The Difference with Spark • SQL on Spark • Combining the power • Real-World Use Cases • Resources
  • 5. Fast and general MapReduce-like engine for large-scale data processing • Fast In memory data storage for very fast interactive queries Up to 100 times faster then Hadoop • General - Unified platform that can combine: SQL, Machine Learning , Streaming , Graph & Complex analytics • Ease of use Can be developed in Java, Scala or Python • Integrated with Hadoop Can read from HDFS, HBase, Cassandra, and any Hadoop data source. What is Spark ?
  • 6. Spark is the Most Active Open Source Project in Big Data Projectcontributorsinpastyear Giraph Storm Tez 0 20 40 60 80 100 120 140
  • 8. Unified Platform Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.: • Java 8 (Closures, Lamba Expressions) • Spark SQL (SQL on Spark, not just Hive) • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark)
  • 9. Supported Languages • Java • Scala • Python • SQL
  • 10. Data Sources • Local Files – file:///opt/httpd/logs/access_log • S3 • Hadoop Distributed Filesystem – Regular files, sequence files, any other Hadoop InputFormat • Hbase • Can also read from any other Hadoop data source.
  • 12. Easy and Fast Big Data • Easy to Develop – Rich APIs in Java, Scala, Python – Interactive shell • Fast to Run – General execution graphs – In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  • 13. Resilient Distributed Datasets (RDD) • Spark revolves around RDDs • Fault-tolerant collection of elements that can be operated on in parallel – Parallelized Collection: Scala collection which is run in parallel – Hadoop Dataset: records of files supported by Hadoop http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 14. RDD Operations • Transformations – Creation of a new dataset from an existing • map, filter, distinct, union, sample, groupByKey, join, etc… • Actions – Return a value after running a computation • collect, count, first, takeSample, foreach, etc… Check the documentation for a complete list http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
  • 15. RDD Code Example file = spark.textFile("hdfs://...") errors = file.filter(lambda line: "ERROR" in line) # Count all the errors errors.count() # Count errors mentioning MySQL errors.filter(lambda line: "MySQL" in line).count() # Fetch the MySQL errors as an array of strings errors.filter(lambda line: "MySQL" in line).collect()
  • 16. RDD Persistence / Caching • Variety of storage levels – memory_only (default), memory_and_disk, etc… • API Calls – persist(StorageLevel) – cache() – shorthand for persist(StorageLevel.MEMORY_ONLY) • Considerations – Read from disk vs. recompute (memory_and_disk) – Total memory storage size (memory_only_ser) – Replicate to second node for faster fault recovery (memory_only_2) • Think about this option if supporting a web application http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence
  • 17. Cache Scaling Matters 69 58 41 30 12 0 20 40 60 80 100 Cache disabled 25% 50% 75% Fully cached Executiontime(s) % of working set in cache
  • 18. RDD Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 19. Interactive Shell • Iterative Development – Cache those RDDs – Open the shell and ask questions • We have all wished we could do this with MapReduce – Compile / save your code for scheduled jobs later • Scala – spark-shell • Python – pyspark
  • 21. Before Spark - Hive • Puts structure/schema onto HDFS data • Compiles HiveQL queries into MapReduce jobs • Very popular: 90+% of Facebook Hadoop jobs generated by Hive • Initially developed by Facebook
  • 22. But.. Hive is slow • Takes 20+ seconds even for simple queries • "A good day is when I can run 6 Hive queries” – @mtraverso
  • 24. Shark – SQL over Spark • Hive-compatible (HiveQL, UDFs, metadata) – Works in existing Hive warehouses without changing queries or data! • Augments Hive – In-memory tables and columnar memory store • Fast execution engine – Uses Spark as the underlying execution engine – Low-latency, interactive queries – Scale-out and tolerates worker failures
  • 25. Shark – SQL over Spark • Hive-compatible (HiveQL, UDFs, metadata) – Works in existing Hive warehouses without changing queries or data! • Augments Hive – In-memory tables and columnar memory store • Fast execution engine – Uses Spark as the underlying execution engine – Low-latency, interactive queries – Scale-out and tolerates worker failures
  • 27. Machine Learning - MLlib • K-Means • L1 and L2-regularized Linear Regression • L1 and L2-regularized Logistic Regression • Alternating Least Squares • Naive Bayes • Stochastic Gradient Descent * As of May 14, 2014 ** Don’t be surprised if you see the Mahout library converting to Spark soon
  • 29. Comparison to Storm • Higher throughput than Storm – Spark Streaming: 670k records/sec/node – Storm: 115k records/sec/node – Commercial systems: 100-500k records/sec/node 0 20 40 100 1000 Throughputper node(MB/s) Record Size (bytes) WordCount Spark Storm 0 20 40 60 100 1000 Throughputper node(MB/s) Record Size (bytes) Grep Spark Storm
  • 31. Combining the power • Use Machine Learningg result as table. GENERATE KMeans(tweet_locations) SAVE AS TABLE tweet_clusters; • Combine SQL, ML, and streaming (Scala) val points = sc.runSql[Double, Double]( “select latitude, longitude from historic_tweets”) val model = KMeans.train(points, 10) sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)
  • 33. Spark at Yahoo! • Fast Machine Learning Personalized news pages
  • 34. Spark at Yahoo! • Hive on Spark (Shark) Using existing BI tools to view and query advertising analytic data collected in Hadoop. Any tool that plugs into Hive, like Tableau, automatically works with Shark.
  • 35. Spark at • One of the largest streaming video companies on the Internet • 4+ billion video feeds per month (second only to YouTube) • CONVIVA uses Spark Streaming to learn network conditions in real time
  • 37. Remember • If you want to use a new technology you must learn that new technology • For those who have been using Hadoop for a while, at one time you had to learn all about MapReduce and how to manage and tune it • To get the most out of a new technology you need to learn that technology, this includes tuning – There are switches you can use to optimize your work
  • 38. Configuration http://spark.apache.org/docs/latest/ Most Important • Application Configuration http://spark.apache.org/docs/latest/configuration.html • Standalone Cluster Configuration http://spark.apache.org/docs/latest/spark-standalone.html • Tuning Guide http://spark.apache.org/docs/latest/tuning.html
  • 39. Resources • Pig on Spark – http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html – https://github.com/aniket486/pig – https://github.com/twitter/pig/tree/spork – http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 – https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix • Latest on Spark – http://databricks.com/categories/spark/ – http://www.spark-stack.org/
  • 41. Optional - More Examples
  • 42. SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]); JavaRDD<String> file = sc.textFile("hdfs://..."); JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split(" "))) .mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y); counts.saveAsTextFile("hdfs://..."); val sc = new SparkContext(master, appName, [sparkHome], [jars]) val file = sc.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Word Count • Java MapReduce (~15 lines of code) • Java Spark (~ 7 lines of code) • Scala and Python (4 lines of code) – interactive shell: skip line 1 and replace the last line with counts.collect() • Java8 (4 lines of code)
  • 43. Network Word Count – Streaming // Create the context with a 1 second batch size val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1), System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass)) // Create a NetworkInputDStream on target host:port and count the // words in input stream of n delimited text (eg. generated by 'nc') val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start()
  • 44. Deploying Spark – Cluster Manager Types • Standalone mode – Comes bundled (EC2 capable) • YARN • Mesos

Editor's Notes

  1. Spark is really cool…
  2. השקענו בפרויקט מעל 12 שנות אדם , פרויקט FIX , בשיתוף עם מטריקס ביי
  3. Yahoo and Adobe are in production with Spark.
  4. This sounds a lot like the reason to consider Pig vs. Java MapReduce
  5. Gracefully
  6. You can import the MLlib to use here in the shell!
  7. Don’t forget to share your experiences. This is really what the community is about. Don’t have time to contribute to open source, use it and share your experiences!
  8. This isn’t all proven out yet, but some of it should just work already.
  9. Best use case? Standalone followed by Mesos… My personal opinion is that Mesos is where the future will take us.