SlideShare ist ein Scribd-Unternehmen logo
1 von 9
Downloaden Sie, um offline zu lesen
```python
!pip install pyspark
```
Collecting pyspark
Downloading pyspark-2.2.0.post0.tar.gz (188.3MB)
Collecting py4j==0.10.4 (from pyspark)
Downloading py4j-0.10.4-py2.py3-none-any.whl (186kB)
Building wheels for collected packages: pyspark
Running setup.py bdist_wheel for pyspark: started
Running setup.py bdist_wheel for pyspark: finished with status
'done'
Stored in directory:
C:UsersDellAppDataLocalpipCachewheels5f0bb35cb16b15d28dcc32f8e
7ec91a044829642874bb7586f6e6cbe
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.4 pyspark-2.2.0
```python
from pyspark import SparkContext,SparkConf
sc=SparkContext()
```
```python
import os
```
```python
os.getcwd()
```
'C:UsersDell'
```python
os.chdir('C:UsersDellDesktop')
```
```python
os.listdir()
```
['desktop.ini',
'dump 2582017',
'Fusion Church.html',
'Fusion Church_files',
'iris.csv',
'KOG',
'NF22997109906610.ETicket.pdf',
'R Packages',
'Telegram.lnk',
'twitter_share.jpg',
'winutils.exe',
'~$avel Reimbursements.docx',
'~$thonajay.docx']
```python
#load data
data=sc.textFile('C:UsersDellDesktopiris.csv')
```
```python
type(data)
```
pyspark.rdd.RDD
```python
data.top(1)
```
['7.9,3.8,6.4,2,"virginica"']
```python
data.first()
```
'"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"'
```python
from pyspark.sql import SparkSession
```
```python
spark= SparkSession.builder 
.master("local") 
.appName("Data Exploration") 
.getOrCreate()
```
```python
#load data as Spark DataFrame
data2=spark.read.format("csv") 
.option("header","true") 
.option("mode","DROPMALFORMED") 
.load('C:UsersDellDesktopiris.csv')
```
```python
type(data2)
```
pyspark.sql.dataframe.DataFrame
```python
data2.printSchema()
```
root
|-- Sepal.Length: string (nullable = true)
|-- Sepal.Width: string (nullable = true)
|-- Petal.Length: string (nullable = true)
|-- Petal.Width: string (nullable = true)
|-- Species: string (nullable = true)
```python
data2.columns
```
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width',
'Species']
```python
data2.schema.names
```
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width',
'Species']
```python
newColumns=['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width',
'Species']
```
```python
from functools import reduce
```
```python
data2 = reduce(lambda data2, idx:
data2.withColumnRenamed(oldColumns[idx], newColumns[idx]),
range(len(oldColumns)), data2)
data2.printSchema()
data2.show()
```
root
|-- Sepal_Length: string (nullable = true)
|-- Sepal_Width: string (nullable = true)
|-- Petal_Length: string (nullable = true)
|-- Petal_Width: string (nullable = true)
|-- Species: string (nullable = true)
+------------+-----------+------------+-----------+-------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|Species|
+------------+-----------+------------+-----------+-------+
| 5.1| 3.5| 1.4| 0.2| setosa|
| 4.9| 3| 1.4| 0.2| setosa|
| 4.7| 3.2| 1.3| 0.2| setosa|
| 4.6| 3.1| 1.5| 0.2| setosa|
| 5| 3.6| 1.4| 0.2| setosa|
| 5.4| 3.9| 1.7| 0.4| setosa|
| 4.6| 3.4| 1.4| 0.3| setosa|
| 5| 3.4| 1.5| 0.2| setosa|
| 4.4| 2.9| 1.4| 0.2| setosa|
| 4.9| 3.1| 1.5| 0.1| setosa|
| 5.4| 3.7| 1.5| 0.2| setosa|
| 4.8| 3.4| 1.6| 0.2| setosa|
| 4.8| 3| 1.4| 0.1| setosa|
| 4.3| 3| 1.1| 0.1| setosa|
| 5.8| 4| 1.2| 0.2| setosa|
| 5.7| 4.4| 1.5| 0.4| setosa|
| 5.4| 3.9| 1.3| 0.4| setosa|
| 5.1| 3.5| 1.4| 0.3| setosa|
| 5.7| 3.8| 1.7| 0.3| setosa|
| 5.1| 3.8| 1.5| 0.3| setosa|
+------------+-----------+------------+-----------+-------+
only showing top 20 rows
```python
data2.dtypes
```
[('Sepal_Length', 'string'),
('Sepal_Width', 'string'),
('Petal_Length', 'string'),
('Petal_Width', 'string'),
('Species', 'string')]
```python
data3 = data2.select('Sepal_Length', 'Sepal_Width', 'Species')
data3.cache()
data3.count()
```
150
```python
data3.show()
```
+------------+-----------+-------+
|Sepal_Length|Sepal_Width|Species|
+------------+-----------+-------+
| 5.1| 3.5| setosa|
| 4.9| 3| setosa|
| 4.7| 3.2| setosa|
| 4.6| 3.1| setosa|
| 5| 3.6| setosa|
| 5.4| 3.9| setosa|
| 4.6| 3.4| setosa|
| 5| 3.4| setosa|
| 4.4| 2.9| setosa|
| 4.9| 3.1| setosa|
| 5.4| 3.7| setosa|
| 4.8| 3.4| setosa|
| 4.8| 3| setosa|
| 4.3| 3| setosa|
| 5.8| 4| setosa|
| 5.7| 4.4| setosa|
| 5.4| 3.9| setosa|
| 5.1| 3.5| setosa|
| 5.7| 3.8| setosa|
| 5.1| 3.8| setosa|
+------------+-----------+-------+
only showing top 20 rows
```python
data3.limit(5)
```
DataFrame[Sepal_Length: string, Sepal_Width: string, Species: string]
```python
data3.limit(5).show()
```
+------------+-----------+-------+
|Sepal_Length|Sepal_Width|Species|
+------------+-----------+-------+
| 5.1| 3.5| setosa|
| 4.9| 3| setosa|
| 4.7| 3.2| setosa|
| 4.6| 3.1| setosa|
| 5| 3.6| setosa|
+------------+-----------+-------+
```python
data3.limit(5).limit(2).show()
```
+------------+-----------+-------+
|Sepal_Length|Sepal_Width|Species|
+------------+-----------+-------+
| 5.1| 3.5| setosa|
| 4.9| 3| setosa|
+------------+-----------+-------+
```python
data4=data2.selectExpr('CAST(Sepal_Length AS INT) AS Sepal_Length')
```
```python
data4
```
DataFrame[Sepal_Length: int]
```python
from pyspark.sql.functions import *
```
```python
data4.select('Sepal_Length').agg(mean('Sepal_Length')).show()
```
+-----------------+
|avg(Sepal_Length)|
+-----------------+
|5.386666666666667|
+-----------------+
```python
data5=data2.selectExpr('CAST(Sepal_Length AS INT) AS
Sepal_Length','CAST(Petal_Width AS INT) AS Petal_Width','CAST(Sepal_Width
AS INT) AS Sepal_Width','CAST(Petal_Length AS INT) AS
Petal_Length','Species')
```
```python
data5
```
DataFrame[Sepal_Length: int, Petal_Width: int, Sepal_Width: int,
Petal_Length: int, Species: string]
```python
data5.columns
```
['Sepal_Length', 'Petal_Width', 'Sepal_Width', 'Petal_Length',
'Species']
```python
data5.select('Sepal_Length','Species').groupBy('Species').agg(mean("Sepal
_Length")).show()
```
+----------+-----------------+
| Species|avg(Sepal_Length)|
+----------+-----------------+
| virginica| 6.08|
|versicolor| 5.48|
| setosa| 4.6|
+----------+-----------------+
```python
#df =
data3.select(col('Sepal_Length'),dat.Sepal_Length.cast('float').alias('pr
ice'))
```

Weitere ähnliche Inhalte

Was ist angesagt?

KCDC - .NET memory management
KCDC - .NET memory managementKCDC - .NET memory management
KCDC - .NET memory managementbenemmett
 
Drizzle to MySQL, Stress Free Migration
Drizzle to MySQL, Stress Free MigrationDrizzle to MySQL, Stress Free Migration
Drizzle to MySQL, Stress Free MigrationAndrew Hutchings
 
Python And GIS - Beyond Modelbuilder And Pythonwin
Python And GIS - Beyond Modelbuilder And PythonwinPython And GIS - Beyond Modelbuilder And Pythonwin
Python And GIS - Beyond Modelbuilder And PythonwinChad Cooper
 
PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.Andrii Soldatenko
 
The Ring programming language version 1.10 book - Part 10 of 212
The Ring programming language version 1.10 book - Part 10 of 212The Ring programming language version 1.10 book - Part 10 of 212
The Ring programming language version 1.10 book - Part 10 of 212Mahmoud Samir Fayed
 
Triangle OpenStack meetup 09 2013
Triangle OpenStack meetup 09 2013Triangle OpenStack meetup 09 2013
Triangle OpenStack meetup 09 2013Dan Radez
 
Parallel Computing in R
Parallel Computing in RParallel Computing in R
Parallel Computing in Rmickey24
 
Openstack installation using rdo multi node
Openstack installation using rdo multi nodeOpenstack installation using rdo multi node
Openstack installation using rdo multi nodeNarasimha sreeram
 
Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput Grant McAlister
 
Basicsof c make and git for a hello qt application
Basicsof c make and git for a hello qt applicationBasicsof c make and git for a hello qt application
Basicsof c make and git for a hello qt applicationDinesh Manajipet
 
Maximal slice problem
Maximal slice problemMaximal slice problem
Maximal slice problemmininerej
 
Use of django at jolt online v3
Use of django at jolt online v3Use of django at jolt online v3
Use of django at jolt online v3Jaime Buelta
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?Andrii Soldatenko
 
The Ring programming language version 1.6 book - Part 42 of 189
The Ring programming language version 1.6 book - Part 42 of 189The Ring programming language version 1.6 book - Part 42 of 189
The Ring programming language version 1.6 book - Part 42 of 189Mahmoud Samir Fayed
 
Fun with processes - lightning talk
Fun with processes - lightning talkFun with processes - lightning talk
Fun with processes - lightning talkPaweł Dawczak
 

Was ist angesagt? (20)

KCDC - .NET memory management
KCDC - .NET memory managementKCDC - .NET memory management
KCDC - .NET memory management
 
Drizzle to MySQL, Stress Free Migration
Drizzle to MySQL, Stress Free MigrationDrizzle to MySQL, Stress Free Migration
Drizzle to MySQL, Stress Free Migration
 
Project 1
Project 1Project 1
Project 1
 
Python And GIS - Beyond Modelbuilder And Pythonwin
Python And GIS - Beyond Modelbuilder And PythonwinPython And GIS - Beyond Modelbuilder And Pythonwin
Python And GIS - Beyond Modelbuilder And Pythonwin
 
PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.
 
The Ring programming language version 1.10 book - Part 10 of 212
The Ring programming language version 1.10 book - Part 10 of 212The Ring programming language version 1.10 book - Part 10 of 212
The Ring programming language version 1.10 book - Part 10 of 212
 
Triangle OpenStack meetup 09 2013
Triangle OpenStack meetup 09 2013Triangle OpenStack meetup 09 2013
Triangle OpenStack meetup 09 2013
 
R sharing 101
R sharing 101R sharing 101
R sharing 101
 
Parallel Computing in R
Parallel Computing in RParallel Computing in R
Parallel Computing in R
 
Openstack installation using rdo multi node
Openstack installation using rdo multi nodeOpenstack installation using rdo multi node
Openstack installation using rdo multi node
 
tp smarts_onboarding
 tp smarts_onboarding tp smarts_onboarding
tp smarts_onboarding
 
Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput
 
Basicsof c make and git for a hello qt application
Basicsof c make and git for a hello qt applicationBasicsof c make and git for a hello qt application
Basicsof c make and git for a hello qt application
 
Maximal slice problem
Maximal slice problemMaximal slice problem
Maximal slice problem
 
Use of django at jolt online v3
Use of django at jolt online v3Use of django at jolt online v3
Use of django at jolt online v3
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?
 
Spark_Documentation_Template1
Spark_Documentation_Template1Spark_Documentation_Template1
Spark_Documentation_Template1
 
The Ring programming language version 1.6 book - Part 42 of 189
The Ring programming language version 1.6 book - Part 42 of 189The Ring programming language version 1.6 book - Part 42 of 189
The Ring programming language version 1.6 book - Part 42 of 189
 
Fun with processes - lightning talk
Fun with processes - lightning talkFun with processes - lightning talk
Fun with processes - lightning talk
 
Assignment6
Assignment6Assignment6
Assignment6
 

Ähnlich wie Pyspark

Entity System Architecture with Unity - Unity User Group Berlin
Entity System Architecture with Unity - Unity User Group BerlinEntity System Architecture with Unity - Unity User Group Berlin
Entity System Architecture with Unity - Unity User Group BerlinSimon Schmid
 
Entitas System Architecture with Unity - Maxim Zaks and Simon Schmid
Entitas System Architecture with Unity - Maxim Zaks and Simon Schmid Entitas System Architecture with Unity - Maxim Zaks and Simon Schmid
Entitas System Architecture with Unity - Maxim Zaks and Simon Schmid Wooga
 
Spraykatz installation & basic usage
Spraykatz installation & basic usageSpraykatz installation & basic usage
Spraykatz installation & basic usageSylvain Cortes
 
Christian Strappazzon - Presentazione Python Milano - Codemotion Milano 2017
Christian Strappazzon - Presentazione Python Milano - Codemotion Milano 2017Christian Strappazzon - Presentazione Python Milano - Codemotion Milano 2017
Christian Strappazzon - Presentazione Python Milano - Codemotion Milano 2017Codemotion
 
Czym jest webpack i dlaczego chcesz go używać?
Czym jest webpack i dlaczego chcesz go używać?Czym jest webpack i dlaczego chcesz go używać?
Czym jest webpack i dlaczego chcesz go używać?Marcin Gajda
 
Open stack pike-devstack-tutorial
Open stack pike-devstack-tutorialOpen stack pike-devstack-tutorial
Open stack pike-devstack-tutorialEueung Mulyana
 
OpenStack API's and WSGI
OpenStack API's and WSGIOpenStack API's and WSGI
OpenStack API's and WSGIMike Pittaro
 
How to Install Configure and Use sysstat utils on RHEL 7
How to Install Configure and Use sysstat utils on RHEL 7How to Install Configure and Use sysstat utils on RHEL 7
How to Install Configure and Use sysstat utils on RHEL 7VCP Muthukrishna
 
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법Open Source Consulting
 
Arbeiten mit distribute, pip und virtualenv
Arbeiten mit distribute, pip und virtualenvArbeiten mit distribute, pip und virtualenv
Arbeiten mit distribute, pip und virtualenvMarkus Zapke-Gründemann
 
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...Ontico
 
AtlasCamp 2015 Docker continuous integration training
AtlasCamp 2015 Docker continuous integration trainingAtlasCamp 2015 Docker continuous integration training
AtlasCamp 2015 Docker continuous integration trainingSteve Smith
 
Using Nix and Docker as automated deployment solutions
Using Nix and Docker as automated deployment solutionsUsing Nix and Docker as automated deployment solutions
Using Nix and Docker as automated deployment solutionsSander van der Burg
 
How to deliver a Python project
How to deliver a Python projectHow to deliver a Python project
How to deliver a Python projectmattjdavidson
 
Undelete (and more) rows from the binary log
Undelete (and more) rows from the binary logUndelete (and more) rows from the binary log
Undelete (and more) rows from the binary logFrederic Descamps
 

Ähnlich wie Pyspark (20)

Entity System Architecture with Unity - Unity User Group Berlin
Entity System Architecture with Unity - Unity User Group BerlinEntity System Architecture with Unity - Unity User Group Berlin
Entity System Architecture with Unity - Unity User Group Berlin
 
Entitas System Architecture with Unity - Maxim Zaks and Simon Schmid
Entitas System Architecture with Unity - Maxim Zaks and Simon Schmid Entitas System Architecture with Unity - Maxim Zaks and Simon Schmid
Entitas System Architecture with Unity - Maxim Zaks and Simon Schmid
 
Spraykatz installation & basic usage
Spraykatz installation & basic usageSpraykatz installation & basic usage
Spraykatz installation & basic usage
 
Christian Strappazzon - Presentazione Python Milano - Codemotion Milano 2017
Christian Strappazzon - Presentazione Python Milano - Codemotion Milano 2017Christian Strappazzon - Presentazione Python Milano - Codemotion Milano 2017
Christian Strappazzon - Presentazione Python Milano - Codemotion Milano 2017
 
Ac cuda c_4
Ac cuda c_4Ac cuda c_4
Ac cuda c_4
 
GoLang & GoatCore
GoLang & GoatCore GoLang & GoatCore
GoLang & GoatCore
 
Czym jest webpack i dlaczego chcesz go używać?
Czym jest webpack i dlaczego chcesz go używać?Czym jest webpack i dlaczego chcesz go używać?
Czym jest webpack i dlaczego chcesz go używać?
 
C&C Botnet Factory
C&C Botnet FactoryC&C Botnet Factory
C&C Botnet Factory
 
Open stack pike-devstack-tutorial
Open stack pike-devstack-tutorialOpen stack pike-devstack-tutorial
Open stack pike-devstack-tutorial
 
OpenStack API's and WSGI
OpenStack API's and WSGIOpenStack API's and WSGI
OpenStack API's and WSGI
 
How to Install Configure and Use sysstat utils on RHEL 7
How to Install Configure and Use sysstat utils on RHEL 7How to Install Configure and Use sysstat utils on RHEL 7
How to Install Configure and Use sysstat utils on RHEL 7
 
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교  및 구축 방법
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
 
Arbeiten mit distribute, pip und virtualenv
Arbeiten mit distribute, pip und virtualenvArbeiten mit distribute, pip und virtualenv
Arbeiten mit distribute, pip und virtualenv
 
Basic Linux kernel
Basic Linux kernelBasic Linux kernel
Basic Linux kernel
 
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
 
AtlasCamp 2015 Docker continuous integration training
AtlasCamp 2015 Docker continuous integration trainingAtlasCamp 2015 Docker continuous integration training
AtlasCamp 2015 Docker continuous integration training
 
Using Nix and Docker as automated deployment solutions
Using Nix and Docker as automated deployment solutionsUsing Nix and Docker as automated deployment solutions
Using Nix and Docker as automated deployment solutions
 
Mojolicious lite
Mojolicious liteMojolicious lite
Mojolicious lite
 
How to deliver a Python project
How to deliver a Python projectHow to deliver a Python project
How to deliver a Python project
 
Undelete (and more) rows from the binary log
Undelete (and more) rows from the binary logUndelete (and more) rows from the binary log
Undelete (and more) rows from the binary log
 

Mehr von Ajay Ohri

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay OhriAjay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RAjay Ohri
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionAjay Ohri
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for freeAjay Ohri
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10Ajay Ohri
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri ResumeAjay Ohri
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...Ajay Ohri
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessAjay Ohri
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data ScientistsAjay Ohri
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in PythonAjay Ohri
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen OomsAjay Ohri
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsAjay Ohri
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha Ajay Ohri
 
Analyze this
Analyze thisAnalyze this
Analyze thisAjay Ohri
 
Summer school python in spanish
Summer school python in spanishSummer school python in spanish
Summer school python in spanishAjay Ohri
 

Mehr von Ajay Ohri (20)

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 Election
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for free
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri Resume
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Tradecraft
Tradecraft   Tradecraft
Tradecraft
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data Scientists
 
Craps
CrapsCraps
Craps
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in Python
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen Ooms
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha
 
Analyze this
Analyze thisAnalyze this
Analyze this
 
Summer school python in spanish
Summer school python in spanishSummer school python in spanish
Summer school python in spanish
 

Kürzlich hochgeladen

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 

Kürzlich hochgeladen (20)

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 

Pyspark

  • 1. ```python !pip install pyspark ``` Collecting pyspark Downloading pyspark-2.2.0.post0.tar.gz (188.3MB) Collecting py4j==0.10.4 (from pyspark) Downloading py4j-0.10.4-py2.py3-none-any.whl (186kB) Building wheels for collected packages: pyspark Running setup.py bdist_wheel for pyspark: started Running setup.py bdist_wheel for pyspark: finished with status 'done' Stored in directory: C:UsersDellAppDataLocalpipCachewheels5f0bb35cb16b15d28dcc32f8e 7ec91a044829642874bb7586f6e6cbe Successfully built pyspark Installing collected packages: py4j, pyspark Successfully installed py4j-0.10.4 pyspark-2.2.0 ```python from pyspark import SparkContext,SparkConf sc=SparkContext() ``` ```python import os ``` ```python os.getcwd() ``` 'C:UsersDell' ```python os.chdir('C:UsersDellDesktop') ``` ```python os.listdir() ```
  • 2. ['desktop.ini', 'dump 2582017', 'Fusion Church.html', 'Fusion Church_files', 'iris.csv', 'KOG', 'NF22997109906610.ETicket.pdf', 'R Packages', 'Telegram.lnk', 'twitter_share.jpg', 'winutils.exe', '~$avel Reimbursements.docx', '~$thonajay.docx'] ```python #load data data=sc.textFile('C:UsersDellDesktopiris.csv') ``` ```python type(data) ``` pyspark.rdd.RDD ```python data.top(1) ``` ['7.9,3.8,6.4,2,"virginica"'] ```python data.first() ```
  • 3. '"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"' ```python from pyspark.sql import SparkSession ``` ```python spark= SparkSession.builder .master("local") .appName("Data Exploration") .getOrCreate() ``` ```python #load data as Spark DataFrame data2=spark.read.format("csv") .option("header","true") .option("mode","DROPMALFORMED") .load('C:UsersDellDesktopiris.csv') ``` ```python type(data2) ``` pyspark.sql.dataframe.DataFrame ```python data2.printSchema() ``` root |-- Sepal.Length: string (nullable = true) |-- Sepal.Width: string (nullable = true) |-- Petal.Length: string (nullable = true) |-- Petal.Width: string (nullable = true) |-- Species: string (nullable = true)
  • 4. ```python data2.columns ``` ['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'] ```python data2.schema.names ``` ['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'] ```python newColumns=['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Species'] ``` ```python from functools import reduce ``` ```python data2 = reduce(lambda data2, idx: data2.withColumnRenamed(oldColumns[idx], newColumns[idx]), range(len(oldColumns)), data2) data2.printSchema() data2.show() ``` root |-- Sepal_Length: string (nullable = true) |-- Sepal_Width: string (nullable = true) |-- Petal_Length: string (nullable = true)
  • 5. |-- Petal_Width: string (nullable = true) |-- Species: string (nullable = true) +------------+-----------+------------+-----------+-------+ |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|Species| +------------+-----------+------------+-----------+-------+ | 5.1| 3.5| 1.4| 0.2| setosa| | 4.9| 3| 1.4| 0.2| setosa| | 4.7| 3.2| 1.3| 0.2| setosa| | 4.6| 3.1| 1.5| 0.2| setosa| | 5| 3.6| 1.4| 0.2| setosa| | 5.4| 3.9| 1.7| 0.4| setosa| | 4.6| 3.4| 1.4| 0.3| setosa| | 5| 3.4| 1.5| 0.2| setosa| | 4.4| 2.9| 1.4| 0.2| setosa| | 4.9| 3.1| 1.5| 0.1| setosa| | 5.4| 3.7| 1.5| 0.2| setosa| | 4.8| 3.4| 1.6| 0.2| setosa| | 4.8| 3| 1.4| 0.1| setosa| | 4.3| 3| 1.1| 0.1| setosa| | 5.8| 4| 1.2| 0.2| setosa| | 5.7| 4.4| 1.5| 0.4| setosa| | 5.4| 3.9| 1.3| 0.4| setosa| | 5.1| 3.5| 1.4| 0.3| setosa| | 5.7| 3.8| 1.7| 0.3| setosa| | 5.1| 3.8| 1.5| 0.3| setosa| +------------+-----------+------------+-----------+-------+ only showing top 20 rows ```python data2.dtypes ``` [('Sepal_Length', 'string'), ('Sepal_Width', 'string'), ('Petal_Length', 'string'), ('Petal_Width', 'string'), ('Species', 'string')] ```python data3 = data2.select('Sepal_Length', 'Sepal_Width', 'Species') data3.cache() data3.count() ```
  • 6. 150 ```python data3.show() ``` +------------+-----------+-------+ |Sepal_Length|Sepal_Width|Species| +------------+-----------+-------+ | 5.1| 3.5| setosa| | 4.9| 3| setosa| | 4.7| 3.2| setosa| | 4.6| 3.1| setosa| | 5| 3.6| setosa| | 5.4| 3.9| setosa| | 4.6| 3.4| setosa| | 5| 3.4| setosa| | 4.4| 2.9| setosa| | 4.9| 3.1| setosa| | 5.4| 3.7| setosa| | 4.8| 3.4| setosa| | 4.8| 3| setosa| | 4.3| 3| setosa| | 5.8| 4| setosa| | 5.7| 4.4| setosa| | 5.4| 3.9| setosa| | 5.1| 3.5| setosa| | 5.7| 3.8| setosa| | 5.1| 3.8| setosa| +------------+-----------+-------+ only showing top 20 rows ```python data3.limit(5) ``` DataFrame[Sepal_Length: string, Sepal_Width: string, Species: string] ```python
  • 7. data3.limit(5).show() ``` +------------+-----------+-------+ |Sepal_Length|Sepal_Width|Species| +------------+-----------+-------+ | 5.1| 3.5| setosa| | 4.9| 3| setosa| | 4.7| 3.2| setosa| | 4.6| 3.1| setosa| | 5| 3.6| setosa| +------------+-----------+-------+ ```python data3.limit(5).limit(2).show() ``` +------------+-----------+-------+ |Sepal_Length|Sepal_Width|Species| +------------+-----------+-------+ | 5.1| 3.5| setosa| | 4.9| 3| setosa| +------------+-----------+-------+ ```python data4=data2.selectExpr('CAST(Sepal_Length AS INT) AS Sepal_Length') ``` ```python data4 ``` DataFrame[Sepal_Length: int] ```python from pyspark.sql.functions import * ``` ```python data4.select('Sepal_Length').agg(mean('Sepal_Length')).show()
  • 8. ``` +-----------------+ |avg(Sepal_Length)| +-----------------+ |5.386666666666667| +-----------------+ ```python data5=data2.selectExpr('CAST(Sepal_Length AS INT) AS Sepal_Length','CAST(Petal_Width AS INT) AS Petal_Width','CAST(Sepal_Width AS INT) AS Sepal_Width','CAST(Petal_Length AS INT) AS Petal_Length','Species') ``` ```python data5 ``` DataFrame[Sepal_Length: int, Petal_Width: int, Sepal_Width: int, Petal_Length: int, Species: string] ```python data5.columns ``` ['Sepal_Length', 'Petal_Width', 'Sepal_Width', 'Petal_Length', 'Species'] ```python data5.select('Sepal_Length','Species').groupBy('Species').agg(mean("Sepal _Length")).show() ``` +----------+-----------------+ | Species|avg(Sepal_Length)| +----------+-----------------+ | virginica| 6.08|
  • 9. |versicolor| 5.48| | setosa| 4.6| +----------+-----------------+ ```python #df = data3.select(col('Sepal_Length'),dat.Sepal_Length.cast('float').alias('pr ice')) ```