SlideShare ist ein Scribd-Unternehmen logo
1 von 10
Introduction to dplyr and base R functions for data manipulation
Kamal Gupta Roy
Last Edited on 3rd Nov 2021
Instructions/Agenda and Learnings
1. Use of functions like ls(), getwd(), setwd(), rm()
2. Install packages (dslabs, dplyr)
3. Load packages(dslabs, dplyr) – library
4. Read murder dataset
5. functions: nrow,ncol,head,tail,summary,class[try for dataframe and variable],str,names,levels,nlevels
6. Position of a dataframe
7. Reading a vector from data frame and doing basic arithmetic functions
8. Order/Arrange - Sorting the data
9. Selecting a column
10. Filtering rows
11. Creating a new variable
12. Summrizing data
13. Summarizing while grouping
14. Chaining Method
15. Exercise
dplyr functionality
• Five basic verbs: filter, select, arrange, mutate, summarise (plus group_by)
Basic Codes
Directory Details
#### workspace
ls()
1
## character(0)
#To know what is the default working directory
getwd()
## [1] "C:/Users/Debzitt/Dropbox (Erasmus Universiteit Rotterdam)/Kamal Gupta/AMSOM-Teaching/a. TOD531 -
# Setting a Working Directory using setwd()
#setwd(C:/Users/Admin/)
getwd()
## [1] "C:/Users/Debzitt/Dropbox (Erasmus Universiteit Rotterdam)/Kamal Gupta/AMSOM-Teaching/a. TOD531 -
Install packages
install.packages("dslabs")
install.packages("dplyr")
Load packages
library(dslabs)
library(dplyr)
##
## Attaching package: ’dplyr’
## The following objects are masked from ’package:stats’:
##
## filter, lag
## The following objects are masked from ’package:base’:
##
## intersect, setdiff, setequal, union
Read dataframe
murder <- data.frame(murders)
Basic check on data
nrow(murder)
## [1] 51
2
ncol(murder)
## [1] 5
head(murder)
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
murder[1,1]
## [1] "Alabama"
tail(murder)
## state abb region population total
## 46 Vermont VT Northeast 625741 2
## 47 Virginia VA South 8001024 250
## 48 Washington WA West 6724540 93
## 49 West Virginia WV South 1852994 27
## 50 Wisconsin WI North Central 5686986 97
## 51 Wyoming WY West 563626 5
summary(murder)
## state abb region population
## Length:51 Length:51 Northeast : 9 Min. : 563626
## Class :character Class :character South :17 1st Qu.: 1696962
## Mode :character Mode :character North Central:12 Median : 4339367
## West :13 Mean : 6075769
## 3rd Qu.: 6636084
## Max. :37253956
## total
## Min. : 2.0
## 1st Qu.: 24.5
## Median : 97.0
## Mean : 184.4
## 3rd Qu.: 268.0
## Max. :1257.0
class(murder)
## [1] "data.frame"
3
class(murder$state)
## [1] "character"
str(murder)
## ’data.frame’: 51 obs. of 5 variables:
## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ abb : chr "AL" "AK" "AZ" "AR" ...
## $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
## $ population: num 4779736 710231 6392017 2915918 37253956 ...
## $ total : num 135 19 232 93 1257 ...
names(murder)
## [1] "state" "abb" "region" "population" "total"
levels(murder$region)
## [1] "Northeast" "South" "North Central" "West"
nlevels(murder$region)
## [1] 4
Read a vector from data frame
mdr <- murder$total
sum(mdr)
## [1] 9403
mean(mdr)
## [1] 184.3725
max(mdr)
## [1] 1257
min(mdr)
## [1] 2
4
dplyr functions
Sorting data
Simple R
rway <- murder[order(murder$total),]
head(rway)
## state abb region population total
## 46 Vermont VT Northeast 625741 2
## 35 North Dakota ND North Central 672591 4
## 30 New Hampshire NH Northeast 1316470 5
## 51 Wyoming WY West 563626 5
## 12 Hawaii HI West 1360301 7
## 42 South Dakota SD North Central 814180 8
dplyr
dpway <- arrange(murder, total)
head(dpway)
## state abb region population total
## 1 Vermont VT Northeast 625741 2
## 2 North Dakota ND North Central 672591 4
## 3 New Hampshire NH Northeast 1316470 5
## 4 Wyoming WY West 563626 5
## 5 Hawaii HI West 1360301 7
## 6 South Dakota SD North Central 814180 8
Selecting a column
Simple R
rway <- murder[,"state"]
head(rway)
## [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
## [6] "Colorado"
class(rway)
## [1] "character"
5
rway <- murder[,c("state","total")]
head(rway)
## state total
## 1 Alabama 135
## 2 Alaska 19
## 3 Arizona 232
## 4 Arkansas 93
## 5 California 1257
## 6 Colorado 65
dplyr
dpway <- select(murder,state)
head(dpway)
## state
## 1 Alabama
## 2 Alaska
## 3 Arizona
## 4 Arkansas
## 5 California
## 6 Colorado
class(dpway)
## [1] "data.frame"
dpway <- select(murder,state,total)
head(dpway)
## state total
## 1 Alabama 135
## 2 Alaska 19
## 3 Arizona 232
## 4 Arkansas 93
## 5 California 1257
## 6 Colorado 65
Filtering rows
Simple R
rway <- murder[murder$state=='California',]
head(rway)
## state abb region population total
## 5 California CA West 37253956 1257
6
dplyr
dpway <- filter(murder,state=='California')
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
dpway <- filter(murder,state=='California' & abb=='CA')
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
dpway <- filter(murder,state=='California', abb=='CA')
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
dpway <- filter(murder,state=='California' | abb=='WI')
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
## 2 Wisconsin WI North Central 5686986 97
dpway <- filter(murder,abb %in% c('CA','WI','NY'))
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
## 2 New York NY Northeast 19378102 517
## 3 Wisconsin WI North Central 5686986 97
Creating a new variable
Simple R
murder$newpop <- murder$population / 1000
head(murder)
## state abb region population total newpop
## 1 Alabama AL South 4779736 135 4779.736
## 2 Alaska AK West 710231 19 710.231
## 3 Arizona AZ West 6392017 232 6392.017
## 4 Arkansas AR South 2915918 93 2915.918
## 5 California CA West 37253956 1257 37253.956
## 6 Colorado CO West 5029196 65 5029.196
7
dplyr
dpway <- mutate(murder,newpop=population/1000)
head(dpway)
## state abb region population total newpop
## 1 Alabama AL South 4779736 135 4779.736
## 2 Alaska AK West 710231 19 710.231
## 3 Arizona AZ West 6392017 232 6392.017
## 4 Arkansas AR South 2915918 93 2915.918
## 5 California CA West 37253956 1257 37253.956
## 6 Colorado CO West 5029196 65 5029.196
summarise: Reduce variables to values
• Primarily useful with data that has been grouped by one or more variables
• group_by creates the groups that will be operated on
• summarise uses the provided aggregation function to summarise each group
dplyr way - summarize
summarise(murder,summurder=sum(total,na.rm=TRUE))
## summurder
## 1 9403
summarise(murder,avgmurder=mean(total,na.rm=TRUE))
## avgmurder
## 1 184.3725
summarise(murder,countrows=n())
## countrows
## 1 51
summarise(murder,summurder=sum(total,na.rm=TRUE),
avgmurder=mean(total,na.rm=TRUE),countrows=n())
## summurder avgmurder countrows
## 1 9403 184.3725 51
dplyr way - group by
8
m1 <- group_by(murder,region)
ab <- summarise(m1,md=sum(total, na.rm=TRUE),
pop = mean(population, na.rm=TRUE),
cn = n())
ab <- data.frame(ab)
ab
## region md pop cn
## 1 Northeast 1469 6146360 9
## 2 South 4195 6804378 17
## 3 North Central 1828 5577250 12
## 4 West 1911 5534273 13
Chaining Method
ab <- murder %>%
group_by(region) %>%
summarise(md = sum(total, na.rm=TRUE),
pop = sum(population, na.rm=TRUE),
cn=n())
ab <- data.frame(ab)
ab
## region md pop cn
## 1 Northeast 1469 55317240 9
## 2 South 4195 115674434 17
## 3 North Central 1828 66927001 12
## 4 West 1911 71945553 13
Exercises
Exercise 1
Do the following for Murder dataset
i. Get the murder dataset (as was done in the class)
ii. do basic exploration of the data (like number of rows, number of columns, structure and names of the
data)
iii. which three states have highest population?
iv. How many states have more than average population?
v. what is the total population of US (actual number and in millions)
vi. what is the total number of murders across US?
vii. what is the average number of murders
viii. what is the total murders in the South region
9
ix. How many states are there in each region
x. what is the murder rate across each region?
xi. Which is the most dangerous state?
Exercise 2
Do the following for mtcars dataset
i. Get the mtcars dataset
ii. do basic exploration of the data (like number of rows, number of columns, structure and names of the
data)
iii. How many different types of gears are there?
iv. which type of transmission is more? automatic or manual
v. what is the average hp by number of cylinders
vi. what is the avg hp by gears
vii. does mpg depend on number of gears?
viii. Does weight of car depends on number of cylinders?
10

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Garment costing
Garment costingGarment costing
Garment costing
 
Bowing in Geometrical prints--By Sukhvir Sabharwal
Bowing in Geometrical prints--By Sukhvir SabharwalBowing in Geometrical prints--By Sukhvir Sabharwal
Bowing in Geometrical prints--By Sukhvir Sabharwal
 
Garment Costing
Garment CostingGarment Costing
Garment Costing
 
Assignment problem
Assignment problemAssignment problem
Assignment problem
 
Data Mining using Weka
Data Mining using WekaData Mining using Weka
Data Mining using Weka
 
Denim Jeans Factory Set Up Planning
Denim Jeans Factory Set Up PlanningDenim Jeans Factory Set Up Planning
Denim Jeans Factory Set Up Planning
 
Levi's co. manendra
Levi's co. manendraLevi's co. manendra
Levi's co. manendra
 
Levis final
Levis finalLevis final
Levis final
 
Sourcing in textile industry
Sourcing in textile industrySourcing in textile industry
Sourcing in textile industry
 
Factors.pptx
Factors.pptxFactors.pptx
Factors.pptx
 
Six months Hypothetical buying plan of H &M.
Six months Hypothetical buying plan of  H &M.Six months Hypothetical buying plan of  H &M.
Six months Hypothetical buying plan of H &M.
 
Sampling .presentration
Sampling .presentrationSampling .presentration
Sampling .presentration
 
Apparel Internship Report Silver Spark Apparel Ltd Unit-2
Apparel Internship Report Silver Spark Apparel Ltd Unit-2Apparel Internship Report Silver Spark Apparel Ltd Unit-2
Apparel Internship Report Silver Spark Apparel Ltd Unit-2
 
Dept._of_Merchanidising
Dept._of_MerchanidisingDept._of_Merchanidising
Dept._of_Merchanidising
 
Apparel CAD and Grading Learning Diary
Apparel CAD and Grading Learning DiaryApparel CAD and Grading Learning Diary
Apparel CAD and Grading Learning Diary
 
Vector in R
Vector in RVector in R
Vector in R
 
Apparel sourcing operations
Apparel sourcing operationsApparel sourcing operations
Apparel sourcing operations
 
Consumption & costing of Apparels
Consumption & costing of ApparelsConsumption & costing of Apparels
Consumption & costing of Apparels
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Modular garments production system
Modular garments production systemModular garments production system
Modular garments production system
 

Kürzlich hochgeladen

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 

Kürzlich hochgeladen (20)

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 

Introduction to data manipulation in R

  • 1. Introduction to dplyr and base R functions for data manipulation Kamal Gupta Roy Last Edited on 3rd Nov 2021 Instructions/Agenda and Learnings 1. Use of functions like ls(), getwd(), setwd(), rm() 2. Install packages (dslabs, dplyr) 3. Load packages(dslabs, dplyr) – library 4. Read murder dataset 5. functions: nrow,ncol,head,tail,summary,class[try for dataframe and variable],str,names,levels,nlevels 6. Position of a dataframe 7. Reading a vector from data frame and doing basic arithmetic functions 8. Order/Arrange - Sorting the data 9. Selecting a column 10. Filtering rows 11. Creating a new variable 12. Summrizing data 13. Summarizing while grouping 14. Chaining Method 15. Exercise dplyr functionality • Five basic verbs: filter, select, arrange, mutate, summarise (plus group_by) Basic Codes Directory Details #### workspace ls() 1
  • 2. ## character(0) #To know what is the default working directory getwd() ## [1] "C:/Users/Debzitt/Dropbox (Erasmus Universiteit Rotterdam)/Kamal Gupta/AMSOM-Teaching/a. TOD531 - # Setting a Working Directory using setwd() #setwd(C:/Users/Admin/) getwd() ## [1] "C:/Users/Debzitt/Dropbox (Erasmus Universiteit Rotterdam)/Kamal Gupta/AMSOM-Teaching/a. TOD531 - Install packages install.packages("dslabs") install.packages("dplyr") Load packages library(dslabs) library(dplyr) ## ## Attaching package: ’dplyr’ ## The following objects are masked from ’package:stats’: ## ## filter, lag ## The following objects are masked from ’package:base’: ## ## intersect, setdiff, setequal, union Read dataframe murder <- data.frame(murders) Basic check on data nrow(murder) ## [1] 51 2
  • 3. ncol(murder) ## [1] 5 head(murder) ## state abb region population total ## 1 Alabama AL South 4779736 135 ## 2 Alaska AK West 710231 19 ## 3 Arizona AZ West 6392017 232 ## 4 Arkansas AR South 2915918 93 ## 5 California CA West 37253956 1257 ## 6 Colorado CO West 5029196 65 murder[1,1] ## [1] "Alabama" tail(murder) ## state abb region population total ## 46 Vermont VT Northeast 625741 2 ## 47 Virginia VA South 8001024 250 ## 48 Washington WA West 6724540 93 ## 49 West Virginia WV South 1852994 27 ## 50 Wisconsin WI North Central 5686986 97 ## 51 Wyoming WY West 563626 5 summary(murder) ## state abb region population ## Length:51 Length:51 Northeast : 9 Min. : 563626 ## Class :character Class :character South :17 1st Qu.: 1696962 ## Mode :character Mode :character North Central:12 Median : 4339367 ## West :13 Mean : 6075769 ## 3rd Qu.: 6636084 ## Max. :37253956 ## total ## Min. : 2.0 ## 1st Qu.: 24.5 ## Median : 97.0 ## Mean : 184.4 ## 3rd Qu.: 268.0 ## Max. :1257.0 class(murder) ## [1] "data.frame" 3
  • 4. class(murder$state) ## [1] "character" str(murder) ## ’data.frame’: 51 obs. of 5 variables: ## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ... ## $ abb : chr "AL" "AK" "AZ" "AR" ... ## $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ... ## $ population: num 4779736 710231 6392017 2915918 37253956 ... ## $ total : num 135 19 232 93 1257 ... names(murder) ## [1] "state" "abb" "region" "population" "total" levels(murder$region) ## [1] "Northeast" "South" "North Central" "West" nlevels(murder$region) ## [1] 4 Read a vector from data frame mdr <- murder$total sum(mdr) ## [1] 9403 mean(mdr) ## [1] 184.3725 max(mdr) ## [1] 1257 min(mdr) ## [1] 2 4
  • 5. dplyr functions Sorting data Simple R rway <- murder[order(murder$total),] head(rway) ## state abb region population total ## 46 Vermont VT Northeast 625741 2 ## 35 North Dakota ND North Central 672591 4 ## 30 New Hampshire NH Northeast 1316470 5 ## 51 Wyoming WY West 563626 5 ## 12 Hawaii HI West 1360301 7 ## 42 South Dakota SD North Central 814180 8 dplyr dpway <- arrange(murder, total) head(dpway) ## state abb region population total ## 1 Vermont VT Northeast 625741 2 ## 2 North Dakota ND North Central 672591 4 ## 3 New Hampshire NH Northeast 1316470 5 ## 4 Wyoming WY West 563626 5 ## 5 Hawaii HI West 1360301 7 ## 6 South Dakota SD North Central 814180 8 Selecting a column Simple R rway <- murder[,"state"] head(rway) ## [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California" ## [6] "Colorado" class(rway) ## [1] "character" 5
  • 6. rway <- murder[,c("state","total")] head(rway) ## state total ## 1 Alabama 135 ## 2 Alaska 19 ## 3 Arizona 232 ## 4 Arkansas 93 ## 5 California 1257 ## 6 Colorado 65 dplyr dpway <- select(murder,state) head(dpway) ## state ## 1 Alabama ## 2 Alaska ## 3 Arizona ## 4 Arkansas ## 5 California ## 6 Colorado class(dpway) ## [1] "data.frame" dpway <- select(murder,state,total) head(dpway) ## state total ## 1 Alabama 135 ## 2 Alaska 19 ## 3 Arizona 232 ## 4 Arkansas 93 ## 5 California 1257 ## 6 Colorado 65 Filtering rows Simple R rway <- murder[murder$state=='California',] head(rway) ## state abb region population total ## 5 California CA West 37253956 1257 6
  • 7. dplyr dpway <- filter(murder,state=='California') head(dpway) ## state abb region population total ## 1 California CA West 37253956 1257 dpway <- filter(murder,state=='California' & abb=='CA') head(dpway) ## state abb region population total ## 1 California CA West 37253956 1257 dpway <- filter(murder,state=='California', abb=='CA') head(dpway) ## state abb region population total ## 1 California CA West 37253956 1257 dpway <- filter(murder,state=='California' | abb=='WI') head(dpway) ## state abb region population total ## 1 California CA West 37253956 1257 ## 2 Wisconsin WI North Central 5686986 97 dpway <- filter(murder,abb %in% c('CA','WI','NY')) head(dpway) ## state abb region population total ## 1 California CA West 37253956 1257 ## 2 New York NY Northeast 19378102 517 ## 3 Wisconsin WI North Central 5686986 97 Creating a new variable Simple R murder$newpop <- murder$population / 1000 head(murder) ## state abb region population total newpop ## 1 Alabama AL South 4779736 135 4779.736 ## 2 Alaska AK West 710231 19 710.231 ## 3 Arizona AZ West 6392017 232 6392.017 ## 4 Arkansas AR South 2915918 93 2915.918 ## 5 California CA West 37253956 1257 37253.956 ## 6 Colorado CO West 5029196 65 5029.196 7
  • 8. dplyr dpway <- mutate(murder,newpop=population/1000) head(dpway) ## state abb region population total newpop ## 1 Alabama AL South 4779736 135 4779.736 ## 2 Alaska AK West 710231 19 710.231 ## 3 Arizona AZ West 6392017 232 6392.017 ## 4 Arkansas AR South 2915918 93 2915.918 ## 5 California CA West 37253956 1257 37253.956 ## 6 Colorado CO West 5029196 65 5029.196 summarise: Reduce variables to values • Primarily useful with data that has been grouped by one or more variables • group_by creates the groups that will be operated on • summarise uses the provided aggregation function to summarise each group dplyr way - summarize summarise(murder,summurder=sum(total,na.rm=TRUE)) ## summurder ## 1 9403 summarise(murder,avgmurder=mean(total,na.rm=TRUE)) ## avgmurder ## 1 184.3725 summarise(murder,countrows=n()) ## countrows ## 1 51 summarise(murder,summurder=sum(total,na.rm=TRUE), avgmurder=mean(total,na.rm=TRUE),countrows=n()) ## summurder avgmurder countrows ## 1 9403 184.3725 51 dplyr way - group by 8
  • 9. m1 <- group_by(murder,region) ab <- summarise(m1,md=sum(total, na.rm=TRUE), pop = mean(population, na.rm=TRUE), cn = n()) ab <- data.frame(ab) ab ## region md pop cn ## 1 Northeast 1469 6146360 9 ## 2 South 4195 6804378 17 ## 3 North Central 1828 5577250 12 ## 4 West 1911 5534273 13 Chaining Method ab <- murder %>% group_by(region) %>% summarise(md = sum(total, na.rm=TRUE), pop = sum(population, na.rm=TRUE), cn=n()) ab <- data.frame(ab) ab ## region md pop cn ## 1 Northeast 1469 55317240 9 ## 2 South 4195 115674434 17 ## 3 North Central 1828 66927001 12 ## 4 West 1911 71945553 13 Exercises Exercise 1 Do the following for Murder dataset i. Get the murder dataset (as was done in the class) ii. do basic exploration of the data (like number of rows, number of columns, structure and names of the data) iii. which three states have highest population? iv. How many states have more than average population? v. what is the total population of US (actual number and in millions) vi. what is the total number of murders across US? vii. what is the average number of murders viii. what is the total murders in the South region 9
  • 10. ix. How many states are there in each region x. what is the murder rate across each region? xi. Which is the most dangerous state? Exercise 2 Do the following for mtcars dataset i. Get the mtcars dataset ii. do basic exploration of the data (like number of rows, number of columns, structure and names of the data) iii. How many different types of gears are there? iv. which type of transmission is more? automatic or manual v. what is the average hp by number of cylinders vi. what is the avg hp by gears vii. does mpg depend on number of gears? viii. Does weight of car depends on number of cylinders? 10