MongoDB and Spark

MongoDB and Spark
Getting MongoDB and Spark to play
nice together

Agenda
What we will unravel today!

4
Agenda
What Is Spark
Overview
Spark Stack
Spark + MongoDB
How to set up
MongoDB and Spark?
Integration
Use Cases / Demo
Datascience, Analytics
Others

5
Howdy!
Who's this guy?
Norberto Leite
Lead Engineer
@nleite
MongoDB

https://university.mongodb.com

Interactive Shell
Easy[ier] API
Caching

9
Delivering User Relevancy
• Integrate data from many sources
• Fast-cycle analytics
• Real-time
• Reliable

10
Wearable Devices
Embedded Systems
Internet of Things
Embedded Medical Devices

11
Access complete patient history
Avoid of conflicting prescriptions
Clinical trials

wget https://www-eu.apache.org/dist/spark/spark-2.4.0
/spark-2.4.0-bin-hadoop2.7.tgz
tar -xzvf spark-2.4.0-bin-hadoop2.7.tgz
spark-2.4.0-bin-hadoop2.7/bin/pyspark
Python 2.7.10 (default, Aug 17 2018, 17:41:52)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)] on darwin
...
Using Python version 2.7.10 (default, Aug 17 2018 17:41:52)
SparkSession available as 'spark'.
>>>
Getting started with Spark - pyspark

spark-2.4.0-bin-hadoop2.7/bin/spark-shell
...
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.4.0
/_/
...
scala>
Getting started with Spark - spark-shell (scala)

Spark Stack
Spark SQL
Spark
Streaming
MLIB GraphX
Apache Spark
Seamless integration
with SQL using
DataFrame API. Also
supports HIVE SQL
Fast Feed data processing API.
Designed for Fault Tolerance and
bridges streaming with batch
processing
MLib is Spark machine
learning algorithms trick bag.
Spark graph library

Spark Stack
Spark
SQL
Spark
Streaming
MLIB GraphX
Apache Spark

Spark
Stand
Alone
YAR
N
Mesos
HDF
S
Distributed Resources

YAR
N
Spark
Mesos
HDF
S
Spark
Stand
Alone
Hadoop
Distributed Processing

YAR
N
Spark
Mesos
Hiv
e
Pig
Spar
k
SQL
Spark Shell
Spark
Streaming
Spark
Stand
Alone
Hadoop
Domain Specific
Languages
HDF
S

How can we use
Spark with MongoDB?

Parallelism
Machine Learning
Stream Processing
Aggregation
Native Processing
Horizontal Scalling

https://github.com/mongodb/mongo-spark

29
MongoDB
Spark
Connector
https://docs.mongodb.com/spark-connector

31
Apache
Spark
Spark SQL
Spark
Streaming
Apache Spark

32
MongoDB
Spark
Connector
Spark SQL
Spark
Streaming
Apache Spark
MongoDB Spark
Connector

33
Input Cluster
Spark SQL
Spark
Streaming
Apache Spark
MongoDB Spark
Connector

34
Output
Cluster
Spark SQL
Spark
Streaming
Apache Spark
MongoDB Spark
Connector

35
Which can be
the same!
Spark SQL
Spark
Streaming
Apache Spark
MongoDB Spark
Connector

YAR
N
Spark
Mesos
Hiv
e
Pig
Spar
k
SQL
Spark Shell
Spark
Streaming
Spark
Stand
Alone
Hadoop
MongoDB
Spark-Connector

What we can do with
Spark and MongoDB

39
Spark SQL
DataFrames
https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
https://spark.apache.org/docs/2.4.0/sql-programming-guide.html

40
MongoDB
Spark-Connector
https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/

Fraud Detection
I'm so in love!

Fraud Detection
I'm so in love!
Me, too<3
Now send me your CC
number
?
Ok, XXXX-123-zzz
$$$

Workloads
Chat App
Login
User
Profile
Contacts
Messages
…
Spark
Fraud Detection
Segmentation
Recommendations
HDFS HDFS HDFS Archiving
Data Crunching

Workloads
Chat App
Spark
Real-time data
processing
HDFS HDFS HDFS

47
Spark Streaming
Spark
Twitter
Feed

48
Spark Streaming
Twitter
Feed
{
"statuses": [
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Mon Sep 24 03:35:21 +0000
2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "freebandnames",
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}

49
Spark Streaming
Spark
{
"statuses": [
{
"favorited": false,
"truncated": false,
2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"time": "Mon Sep 24 03:35",
"freebandnames": 1
}
{
"statuses": [
{
"favorited": false,
"truncated": false,
2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"statuses": [
{
"favorited": false,
"truncated": false,
2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"statuses": [
{
"favorited": false,
"truncated": false,
2012",
"id_str": "250075927172759552",
"entities": {
"urls": [
],
"hashtags": [
{
"indices": [
20,
34
]
}
],
"user_mentions": []
}
}
}
{
"time": "Mon Sep 24 03:35",
"freebandnames": 4
}

54
Steps Describing Demo
● SRT text messages in the network
● Spark collects those messages
○ Defines a processing Window
○ Performs word count
● Store DataFrame into MongoDB
https://github.com/nleite/mdb.local

56
● An extremely powerful combination
● Many possible use cases
● Evolving all the time
Spark and
MongoDB

Norberto Leite
Lead Engineer
norberto@mongodb.com
@nleite

MongoDB and Spark

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie MongoDB and Spark

Ähnlich wie MongoDB and Spark (20)

Mehr von Norberto Leite

Mehr von Norberto Leite (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MongoDB and Spark