The MongoDB Spark Connector integrates MongoDB and Apache Spark, providing users with the ability to process data in MongoDB with the massive parallelism of Spark. The connector gives users access to Spark's streaming capabilities, machine learning libraries, and interactive processing through the Spark shell, Dataframes and Datasets. We'll take a tour of the connector with a focus on practical use of the connector, and run a demo using both Spark and MongoDB for data processing.
4. 4
Agenda
What Is Spark
Overview
Spark Stack
Spark + MongoDB
How to set up
MongoDB and Spark?
Integration
Use Cases / Demo
Datascience, Analytics
Others
12. wget https://www-eu.apache.org/dist/spark/spark-2.4.0
/spark-2.4.0-bin-hadoop2.7.tgz
tar -xzvf spark-2.4.0-bin-hadoop2.7.tgz
spark-2.4.0-bin-hadoop2.7/bin/pyspark
Python 2.7.10 (default, Aug 17 2018, 17:41:52)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.0.42)] on darwin
...
Using Python version 2.7.10 (default, Aug 17 2018 17:41:52)
SparkSession available as 'spark'.
>>>
Getting started with Spark - pyspark
16. Spark Stack
Spark SQL
Spark
Streaming
MLIB GraphX
Apache Spark
Seamless integration
with SQL using
DataFrame API. Also
supports HIVE SQL
Fast Feed data processing API.
Designed for Fault Tolerance and
bridges streaming with batch
processing
MLib is Spark machine
learning algorithms trick bag.
Spark graph library
54. 54
Steps Describing Demo
● SRT text messages in the network
● Spark collects those messages
○ Defines a processing Window
○ Performs word count
● Store DataFrame into MongoDB
https://github.com/nleite/mdb.local