Nested JSON data processing with Apache Spark

•

1 gefällt mir•347 views

Here we have explained through this article some of the best ways to do nested JSON data processing with Apache Spark and have given some screenshots according to the article.

Technologie

Nested JSON
Data processing using
Apache Spark

Instructions for use
Let us read a public JSON dataset available on the internet. Extract required
fields from nested data, and analyze the dataset to get some insights. Here
I’m using the Baby names public data set available on the internet for this
demo.
2
What are we performing in this demo?
╺ Read data from the URL using scala API
╺ Convert the read data into a dataframe
╺ Extract the required fields from the nested JSON dataset
╺ Analyze the data by writing queries
╺ Visualize the processed data

3
Let us read a public JSON dataset available on the internet. Extract required fields from
nested data, and analyze the dataset to get some insights. Here I’m using the Baby names
public data set available on the internet for this demo.
After this, we use the jsonString Val created above and create a dataframe using Spark API.
We need to import spark.implicits to convert Sequence of Strings to a Dataset, and then
we create a dataframe out of it.

Now let us see the schema of the JSON using printSchema method:

5
Now let us see the schema of the JSON using printSchema method:
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true))
Also, it contains metadata about the data, let’s not worry about it, for now. But you can
have a look at it when you run this in your machine. Mainly it contains columns field
information in metadata, which I have extracted for you to have a better understanding of
the data we will work on.

We have below fields within an Array of data that we are going to analyze.
6
╺ meta
╺ Year
╺ first_name
╺ County
╺ Sex
╺ Count
╺ Sid
╺ Id
╺ Position
╺ created_at
╺ created_meta
╺ updated_at
╺ updated_meta

7
But how we can extract these data fields from JSON? Now let’s select data from the jsonDF
dataframe we created. It looks something like this

8
Now we have to extract the fields within this data. To do this, let us first create a temporary
view of this dataframe and use explode function to extract Year, Name, County, and gender
fields.
To use explode method, we should first import spark sql functions.

9
Now let us see the schema of the insightData.

10
Let me show you the contents of insightData datafrmae using the display method
available in Databricks.

11
Now let us write a query to see what is the most popular first letter baby names to start
within each year.
insightData.select("year","name").createOrReplaceTempView("yearname")
val dis=spark.sql("select year,firstLetter,count,ranks from (select year,firstLetter,count
,rank() over (partition by year order by count desc) as ranks from (select year, left(name,1) as
firstLetter, count(1) as count from yearname group by year ,firstLetter order by year
desc,count desc)Y )Z where ranks=1 order by year desc")

12
Now let’s visualize this data using the graphs available in Databricks.

Apache Spark Integration
Services
With 15+ years in data analytics technology services,
Aegis Softwares Canada expert offers a wide range of
apache spark implementation, integration, and
development solutions also 24/7 support.
14

AEGIS SOFTWARE
CANADA (Branch Office)
2 Robert Speck Parkway,
Suite 750, Mississauga,
ON Ontario-L4Z1H8,
Canada.
OFSHORE SOFTWARE DEVELOPMENT COMPANY
INDIA (Head Office)
319, 3rd Floor, Golden Plaza,
Tagore Road,
Rajkot – 360001
Gujarat, India
info@aegissoftwares.com www.aegissoftwares.com

Empfohlen

An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks

Machine Learning and the Elastic StackYann Cluchey

Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks

Apache Spark sqlaftab alam

Deep Dive into the New Features of Apache Spark 3.0Databricks

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

Understanding Query Plans and Spark UIsDatabricks

Empfohlen

An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks

Machine Learning and the Elastic StackYann Cluchey

Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks

Apache Spark sqlaftab alam

Deep Dive into the New Features of Apache Spark 3.0Databricks

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

Understanding Query Plans and Spark UIsDatabricks

Introduction to PySparkRussell Jurney

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Physical Plans in Spark SQLDatabricks

How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit

Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks

Simplifying Big Data Analytics with Apache SparkDatabricks

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks

PySpark in practice slidesDat Tran

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit

Real World End to End machine Learning PipelineSrivatsan Srinivasan

Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks

Deep Dive: Memory Management in Apache SparkDatabricks

SparkKoushik Mondal

Data Lineage with Apache Airflow using Marquez Willy Lulciuc

Row/Column- Level Security in SQL for Apache SparkDataWorks Summit/Hadoop Summit

Understanding and Improving Code GenerationDatabricks

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...Kavika Roy

Machine Learning with SparkROlgun Aydın

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to PySparkRussell Jurney

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Physical Plans in Spark SQLDatabricks

How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit

Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks

Simplifying Big Data Analytics with Apache SparkDatabricks

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks

PySpark in practice slidesDat Tran

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit

Real World End to End machine Learning PipelineSrivatsan Srinivasan

Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks

Deep Dive: Memory Management in Apache SparkDatabricks

SparkKoushik Mondal

Data Lineage with Apache Airflow using Marquez Willy Lulciuc

Row/Column- Level Security in SQL for Apache SparkDataWorks Summit/Hadoop Summit

Understanding and Improving Code GenerationDatabricks

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

Was ist angesagt? (20)

Introduction to PySpark

Fine Tuning and Enhancing Performance of Apache Spark Jobs

Physical Plans in Spark SQL

How to understand and analyze Apache Hive query execution plan for performanc...

Introducing DataFrames in Spark for Large Scale Data Science

Simplifying Big Data Analytics with Apache Spark

Apache Spark Core—Deep Dive—Proper Optimization

Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle

PySpark in practice slides

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...

Real World End to End machine Learning Pipeline

Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...

Deep Dive: Memory Management in Apache Spark

Spark

Data Lineage with Apache Airflow using Marquez

Row/Column- Level Security in SQL for Apache Spark

Understanding and Improving Code Generation

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

A Deep Dive into Query Execution Engine of Spark SQL

Ähnlich wie Nested JSON data processing with Apache Spark

Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...Kavika Roy

Machine Learning with SparkROlgun Aydın

Data Source API in SparkDatabricks

Automotive industry pptsudharsanpremkumar1

Using pandas library for data analysis in pythonBruce Jenks

Module 9: Natural Language Processing Part 2Sara Hooker

LinqFoyzul Karim

Java programming-examplesMumbai Academisc

BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...Amazon Web Services

Working with solr.pptxalignminds

Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit

Spring database - part2Santosh Kumar Kar

Spark1Dr. G. Bharadwaja Kumar

Oracle to vb 6.0 connectivityrohit vishwakarma

Json tutorial, a beguiner guideRafael Montesinos Muñoz

Unit-2 JSON.pdfhskznx

Important SAS Tips and Tricks for A GradeLesa Cote

Big data analysis using spark r publishedDipendra Kusi

School of Data - mapping company networksTony Hirst

Database adapterxavier john

Ähnlich wie Nested JSON data processing with Apache Spark (20)

Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...

Machine Learning with SparkR

Data Source API in Spark

Automotive industry ppt

Using pandas library for data analysis in python

Module 9: Natural Language Processing Part 2

Linq

Java programming-examples

BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...

Working with solr.pptx

Data infrastructure architecture for medium size organization: tips for colle...

Spring database - part2

Spark1

Oracle to vb 6.0 connectivity

Json tutorial, a beguiner guide

Unit-2 JSON.pdf

Important SAS Tips and Tricks for A Grade

Big data analysis using spark r published

School of Data - mapping company networks

Database adapter

Mehr von Aegis Software Canada

Top apache-spark concepts and services in IndiaAegis Software Canada

Grow your business using the Dynamics 365 SolutionsAegis Software Canada

Detailed guide to the Apache Spark FrameworkAegis Software Canada

Microsoft Dynamics 365: Tutorial of Content and ModulesAegis Software Canada

Why Choose Dynamics 365 CRM?Aegis Software Canada

2018 What's New in Visual Studio Code 1.25?Aegis Software Canada

Fantastic four machine_learning_java_librariesAegis Software Canada

Liferay plugin customization to change the behavior in portalAegis Software Canada

Mehr von Aegis Software Canada (8)

Top apache-spark concepts and services in India

Grow your business using the Dynamics 365 Solutions

Detailed guide to the Apache Spark Framework

Microsoft Dynamics 365: Tutorial of Content and Modules

Why Choose Dynamics 365 CRM?

2018 What's New in Visual Studio Code 1.25?

Fantastic four machine_learning_java_libraries

Liferay plugin customization to change the behavior in portal

Kürzlich hochgeladen

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

GenAI Risks & Security Meetup 01052024.pdflior mazor

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Kürzlich hochgeladen (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Strategies for Landing an Oracle DBA Job as a Fresher

What Are The Drone Anti-jamming Systems Technology?

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

AWS Community Day CPH - Three problems of Terraform

How to Troubleshoot Apps for the Modern Connected Worker

HTML Injection Attacks: Impact and Mitigation Strategies

GenCyber Cyber Security Day Presentation

Boost Fertility New Invention Ups Success Rates.pdf

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Scaling API-first – The story of a global engineering organization

Driving Behavioral Change for Information Management through Data-Driven Gree...

Finology Group – Insurtech Innovation Award 2024

🐬 The future of MySQL is Postgres 🐘

Exploring the Future Potential of AI-Enabled Smartphone Processors

GenAI Risks & Security Meetup 01052024.pdf

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Nested JSON data processing with Apache Spark

1. Nested JSON Data processing using Apache Spark

2. Instructions for use Let us read a public JSON dataset available on the internet. Extract required fields from nested data, and analyze the dataset to get some insights. Here I’m using the Baby names public data set available on the internet for this demo. 2 What are we performing in this demo? ╺ Read data from the URL using scala API ╺ Convert the read data into a dataframe ╺ Extract the required fields from the nested JSON dataset ╺ Analyze the data by writing queries ╺ Visualize the processed data

3. 3 Let us read a public JSON dataset available on the internet. Extract required fields from nested data, and analyze the dataset to get some insights. Here I’m using the Baby names public data set available on the internet for this demo. After this, we use the jsonString Val created above and create a dataframe using Spark API. We need to import spark.implicits to convert Sequence of Strings to a Dataset, and then we create a dataframe out of it.

4. Now let us see the schema of the JSON using printSchema method:

5. 5 Now let us see the schema of the JSON using printSchema method: |-- data: array (nullable = true) | |-- element: array (containsNull = true) | | |-- element: string (containsNull = true)) Also, it contains metadata about the data, let’s not worry about it, for now. But you can have a look at it when you run this in your machine. Mainly it contains columns field information in metadata, which I have extracted for you to have a better understanding of the data we will work on.

6. We have below fields within an Array of data that we are going to analyze. 6 ╺ meta ╺ Year ╺ first_name ╺ County ╺ Sex ╺ Count ╺ Sid ╺ Id ╺ Position ╺ created_at ╺ created_meta ╺ updated_at ╺ updated_meta

7. 7 But how we can extract these data fields from JSON? Now let’s select data from the jsonDF dataframe we created. It looks something like this

8. 8 Now we have to extract the fields within this data. To do this, let us first create a temporary view of this dataframe and use explode function to extract Year, Name, County, and gender fields. To use explode method, we should first import spark sql functions.

9. 9 Now let us see the schema of the insightData.

10. 10 Let me show you the contents of insightData datafrmae using the display method available in Databricks.

11. 11 Now let us write a query to see what is the most popular first letter baby names to start within each year. insightData.select("year","name").createOrReplaceTempView("yearname") val dis=spark.sql("select year,firstLetter,count,ranks from (select year,firstLetter,count ,rank() over (partition by year order by count desc) as ranks from (select year, left(name,1) as firstLetter, count(1) as count from yearname group by year ,firstLetter order by year desc,count desc)Y )Z where ranks=1 order by year desc")

12. 12 Now let’s visualize this data using the graphs available in Databricks.

13. 13

14. Apache Spark Integration Services With 15+ years in data analytics technology services, Aegis Softwares Canada expert offers a wide range of apache spark implementation, integration, and development solutions also 24/7 support. 14

15. AEGIS SOFTWARE CANADA (Branch Office) 2 Robert Speck Parkway, Suite 750, Mississauga, ON Ontario-L4Z1H8, Canada. OFSHORE SOFTWARE DEVELOPMENT COMPANY INDIA (Head Office) 319, 3rd Floor, Golden Plaza, Tagore Road, Rajkot – 360001 Gujarat, India info@aegissoftwares.com www.aegissoftwares.com

16. 16 Thank you