SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
Nested JSON
Data processing using
Apache Spark
Instructions for use
Let us read a public JSON dataset available on the internet. Extract required
fields from nested data, and analyze the dataset to get some insights. Here
I’m using the Baby names public data set available on the internet for this
demo.
2
What are we performing in this demo?
╺ Read data from the URL using scala API
╺ Convert the read data into a dataframe
╺ Extract the required fields from the nested JSON dataset
╺ Analyze the data by writing queries
╺ Visualize the processed data
3
Let us read a public JSON dataset available on the internet. Extract required fields from
nested data, and analyze the dataset to get some insights. Here I’m using the Baby names
public data set available on the internet for this demo.
After this, we use the jsonString Val created above and create a dataframe using Spark API.
We need to import spark.implicits to convert Sequence of Strings to a Dataset, and then
we create a dataframe out of it.
Now let us see the schema of the JSON using printSchema method:
5
Now let us see the schema of the JSON using printSchema method:
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true))
Also, it contains metadata about the data, let’s not worry about it, for now. But you can
have a look at it when you run this in your machine. Mainly it contains columns field
information in metadata, which I have extracted for you to have a better understanding of
the data we will work on.
We have below fields within an Array of data that we are going to analyze.
6
╺ meta
╺ Year
╺ first_name
╺ County
╺ Sex
╺ Count
╺ Sid
╺ Id
╺ Position
╺ created_at
╺ created_meta
╺ updated_at
╺ updated_meta
7
But how we can extract these data fields from JSON? Now let’s select data from the jsonDF
dataframe we created. It looks something like this
8
Now we have to extract the fields within this data. To do this, let us first create a temporary
view of this dataframe and use explode function to extract Year, Name, County, and gender
fields.
To use explode method, we should first import spark sql functions.
9
Now let us see the schema of the insightData.
10
Let me show you the contents of insightData datafrmae using the display method
available in Databricks.
11
Now let us write a query to see what is the most popular first letter baby names to start
within each year.
insightData.select("year","name").createOrReplaceTempView("yearname")
val dis=spark.sql("select year,firstLetter,count,ranks from (select year,firstLetter,count
,rank() over (partition by year order by count desc) as ranks from (select year, left(name,1) as
firstLetter, count(1) as count from yearname group by year ,firstLetter order by year
desc,count desc)Y )Z where ranks=1 order by year desc")
12
Now let’s visualize this data using the graphs available in Databricks.
13
Apache Spark Integration
Services
With 15+ years in data analytics technology services,
Aegis Softwares Canada expert offers a wide range of
apache spark implementation, integration, and
development solutions also 24/7 support.
14
AEGIS SOFTWARE
CANADA (Branch Office)
2 Robert Speck Parkway,
Suite 750, Mississauga,
ON Ontario-L4Z1H8,
Canada.
OFSHORE SOFTWARE DEVELOPMENT COMPANY
INDIA (Head Office)
319, 3rd Floor, Golden Plaza,
Tagore Road,
Rajkot – 360001
Gujarat, India
info@aegissoftwares.com www.aegissoftwares.com
16
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineSrivatsan Srinivasan
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Willy Lulciuc
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 

Was ist angesagt? (20)

Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning Pipeline
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Spark
SparkSpark
Spark
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 

Ähnlich wie Nested JSON data processing with Apache Spark

Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...Kavika Roy
 
Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkROlgun Aydın
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Using pandas library for data analysis in python
Using pandas library for data analysis in pythonUsing pandas library for data analysis in python
Using pandas library for data analysis in pythonBruce Jenks
 
Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2Sara Hooker
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...Amazon Web Services
 
Working with solr.pptx
Working with solr.pptxWorking with solr.pptx
Working with solr.pptxalignminds
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Oracle to vb 6.0 connectivity
Oracle to vb 6.0 connectivityOracle to vb 6.0 connectivity
Oracle to vb 6.0 connectivityrohit vishwakarma
 
Unit-2 JSON.pdf
Unit-2 JSON.pdfUnit-2 JSON.pdf
Unit-2 JSON.pdfhskznx
 
Important SAS Tips and Tricks for A Grade
Important SAS Tips and Tricks for A GradeImportant SAS Tips and Tricks for A Grade
Important SAS Tips and Tricks for A GradeLesa Cote
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r publishedDipendra Kusi
 
School of Data - mapping company networks
School of Data - mapping company networksSchool of Data - mapping company networks
School of Data - mapping company networksTony Hirst
 
Database adapter
Database adapterDatabase adapter
Database adapterxavier john
 

Ähnlich wie Nested JSON data processing with Apache Spark (20)

Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
 
Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkR
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Automotive industry ppt
Automotive industry pptAutomotive industry ppt
Automotive industry ppt
 
Using pandas library for data analysis in python
Using pandas library for data analysis in pythonUsing pandas library for data analysis in python
Using pandas library for data analysis in python
 
Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2
 
Linq
LinqLinq
Linq
 
Java programming-examples
Java programming-examplesJava programming-examples
Java programming-examples
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
 
Working with solr.pptx
Working with solr.pptxWorking with solr.pptx
Working with solr.pptx
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Spring database - part2
Spring database -  part2Spring database -  part2
Spring database - part2
 
Spark1
Spark1Spark1
Spark1
 
Oracle to vb 6.0 connectivity
Oracle to vb 6.0 connectivityOracle to vb 6.0 connectivity
Oracle to vb 6.0 connectivity
 
Json tutorial, a beguiner guide
Json tutorial, a beguiner guideJson tutorial, a beguiner guide
Json tutorial, a beguiner guide
 
Unit-2 JSON.pdf
Unit-2 JSON.pdfUnit-2 JSON.pdf
Unit-2 JSON.pdf
 
Important SAS Tips and Tricks for A Grade
Important SAS Tips and Tricks for A GradeImportant SAS Tips and Tricks for A Grade
Important SAS Tips and Tricks for A Grade
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
School of Data - mapping company networks
School of Data - mapping company networksSchool of Data - mapping company networks
School of Data - mapping company networks
 
Database adapter
Database adapterDatabase adapter
Database adapter
 

Mehr von Aegis Software Canada

Top apache-spark concepts and services in India
Top apache-spark concepts and services in IndiaTop apache-spark concepts and services in India
Top apache-spark concepts and services in IndiaAegis Software Canada
 
Grow your business using the Dynamics 365 Solutions
Grow your business using the Dynamics 365 SolutionsGrow your business using the Dynamics 365 Solutions
Grow your business using the Dynamics 365 SolutionsAegis Software Canada
 
Detailed guide to the Apache Spark Framework
Detailed guide to the Apache Spark FrameworkDetailed guide to the Apache Spark Framework
Detailed guide to the Apache Spark FrameworkAegis Software Canada
 
Microsoft Dynamics 365: Tutorial of Content and Modules
Microsoft Dynamics 365: Tutorial of Content and ModulesMicrosoft Dynamics 365: Tutorial of Content and Modules
Microsoft Dynamics 365: Tutorial of Content and ModulesAegis Software Canada
 
2018 What's New in Visual Studio Code 1.25?
2018 What's New in Visual Studio Code 1.25?2018 What's New in Visual Studio Code 1.25?
2018 What's New in Visual Studio Code 1.25?Aegis Software Canada
 
Fantastic four machine_learning_java_libraries
Fantastic four machine_learning_java_librariesFantastic four machine_learning_java_libraries
Fantastic four machine_learning_java_librariesAegis Software Canada
 
Liferay plugin customization to change the behavior in portal
Liferay plugin customization to change the behavior in portalLiferay plugin customization to change the behavior in portal
Liferay plugin customization to change the behavior in portalAegis Software Canada
 

Mehr von Aegis Software Canada (8)

Top apache-spark concepts and services in India
Top apache-spark concepts and services in IndiaTop apache-spark concepts and services in India
Top apache-spark concepts and services in India
 
Grow your business using the Dynamics 365 Solutions
Grow your business using the Dynamics 365 SolutionsGrow your business using the Dynamics 365 Solutions
Grow your business using the Dynamics 365 Solutions
 
Detailed guide to the Apache Spark Framework
Detailed guide to the Apache Spark FrameworkDetailed guide to the Apache Spark Framework
Detailed guide to the Apache Spark Framework
 
Microsoft Dynamics 365: Tutorial of Content and Modules
Microsoft Dynamics 365: Tutorial of Content and ModulesMicrosoft Dynamics 365: Tutorial of Content and Modules
Microsoft Dynamics 365: Tutorial of Content and Modules
 
Why Choose Dynamics 365 CRM?
Why Choose Dynamics 365 CRM?Why Choose Dynamics 365 CRM?
Why Choose Dynamics 365 CRM?
 
2018 What's New in Visual Studio Code 1.25?
2018 What's New in Visual Studio Code 1.25?2018 What's New in Visual Studio Code 1.25?
2018 What's New in Visual Studio Code 1.25?
 
Fantastic four machine_learning_java_libraries
Fantastic four machine_learning_java_librariesFantastic four machine_learning_java_libraries
Fantastic four machine_learning_java_libraries
 
Liferay plugin customization to change the behavior in portal
Liferay plugin customization to change the behavior in portalLiferay plugin customization to change the behavior in portal
Liferay plugin customization to change the behavior in portal
 

Kürzlich hochgeladen

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 

Kürzlich hochgeladen (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Nested JSON data processing with Apache Spark

  • 1. Nested JSON Data processing using Apache Spark
  • 2. Instructions for use Let us read a public JSON dataset available on the internet. Extract required fields from nested data, and analyze the dataset to get some insights. Here I’m using the Baby names public data set available on the internet for this demo. 2 What are we performing in this demo? ╺ Read data from the URL using scala API ╺ Convert the read data into a dataframe ╺ Extract the required fields from the nested JSON dataset ╺ Analyze the data by writing queries ╺ Visualize the processed data
  • 3. 3 Let us read a public JSON dataset available on the internet. Extract required fields from nested data, and analyze the dataset to get some insights. Here I’m using the Baby names public data set available on the internet for this demo. After this, we use the jsonString Val created above and create a dataframe using Spark API. We need to import spark.implicits to convert Sequence of Strings to a Dataset, and then we create a dataframe out of it.
  • 4. Now let us see the schema of the JSON using printSchema method:
  • 5. 5 Now let us see the schema of the JSON using printSchema method: |-- data: array (nullable = true) | |-- element: array (containsNull = true) | | |-- element: string (containsNull = true)) Also, it contains metadata about the data, let’s not worry about it, for now. But you can have a look at it when you run this in your machine. Mainly it contains columns field information in metadata, which I have extracted for you to have a better understanding of the data we will work on.
  • 6. We have below fields within an Array of data that we are going to analyze. 6 ╺ meta ╺ Year ╺ first_name ╺ County ╺ Sex ╺ Count ╺ Sid ╺ Id ╺ Position ╺ created_at ╺ created_meta ╺ updated_at ╺ updated_meta
  • 7. 7 But how we can extract these data fields from JSON? Now let’s select data from the jsonDF dataframe we created. It looks something like this
  • 8. 8 Now we have to extract the fields within this data. To do this, let us first create a temporary view of this dataframe and use explode function to extract Year, Name, County, and gender fields. To use explode method, we should first import spark sql functions.
  • 9. 9 Now let us see the schema of the insightData.
  • 10. 10 Let me show you the contents of insightData datafrmae using the display method available in Databricks.
  • 11. 11 Now let us write a query to see what is the most popular first letter baby names to start within each year. insightData.select("year","name").createOrReplaceTempView("yearname") val dis=spark.sql("select year,firstLetter,count,ranks from (select year,firstLetter,count ,rank() over (partition by year order by count desc) as ranks from (select year, left(name,1) as firstLetter, count(1) as count from yearname group by year ,firstLetter order by year desc,count desc)Y )Z where ranks=1 order by year desc")
  • 12. 12 Now let’s visualize this data using the graphs available in Databricks.
  • 13. 13
  • 14. Apache Spark Integration Services With 15+ years in data analytics technology services, Aegis Softwares Canada expert offers a wide range of apache spark implementation, integration, and development solutions also 24/7 support. 14
  • 15. AEGIS SOFTWARE CANADA (Branch Office) 2 Robert Speck Parkway, Suite 750, Mississauga, ON Ontario-L4Z1H8, Canada. OFSHORE SOFTWARE DEVELOPMENT COMPANY INDIA (Head Office) 319, 3rd Floor, Golden Plaza, Tagore Road, Rajkot – 360001 Gujarat, India info@aegissoftwares.com www.aegissoftwares.com