This document provides an introduction to using base R and dplyr functions for data manipulation. It discusses loading and exploring data, sorting and selecting data, filtering rows, creating new variables, and summarizing data with and without grouping. Functions from both base R and dplyr are demonstrated for common data manipulation tasks like arranging, filtering, selecting, mutating, and summarizing. The document concludes with exercises asking to use these functions on murder rate and mtcars datasets.
1. Introduction to dplyr and base R functions for data manipulation
Kamal Gupta Roy
Last Edited on 3rd Nov 2021
Instructions/Agenda and Learnings
1. Use of functions like ls(), getwd(), setwd(), rm()
2. Install packages (dslabs, dplyr)
3. Load packages(dslabs, dplyr) – library
4. Read murder dataset
5. functions: nrow,ncol,head,tail,summary,class[try for dataframe and variable],str,names,levels,nlevels
6. Position of a dataframe
7. Reading a vector from data frame and doing basic arithmetic functions
8. Order/Arrange - Sorting the data
9. Selecting a column
10. Filtering rows
11. Creating a new variable
12. Summrizing data
13. Summarizing while grouping
14. Chaining Method
15. Exercise
dplyr functionality
• Five basic verbs: filter, select, arrange, mutate, summarise (plus group_by)
Basic Codes
Directory Details
#### workspace
ls()
1
2. ## character(0)
#To know what is the default working directory
getwd()
## [1] "C:/Users/Debzitt/Dropbox (Erasmus Universiteit Rotterdam)/Kamal Gupta/AMSOM-Teaching/a. TOD531 -
# Setting a Working Directory using setwd()
#setwd(C:/Users/Admin/)
getwd()
## [1] "C:/Users/Debzitt/Dropbox (Erasmus Universiteit Rotterdam)/Kamal Gupta/AMSOM-Teaching/a. TOD531 -
Install packages
install.packages("dslabs")
install.packages("dplyr")
Load packages
library(dslabs)
library(dplyr)
##
## Attaching package: ’dplyr’
## The following objects are masked from ’package:stats’:
##
## filter, lag
## The following objects are masked from ’package:base’:
##
## intersect, setdiff, setequal, union
Read dataframe
murder <- data.frame(murders)
Basic check on data
nrow(murder)
## [1] 51
2
3. ncol(murder)
## [1] 5
head(murder)
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
murder[1,1]
## [1] "Alabama"
tail(murder)
## state abb region population total
## 46 Vermont VT Northeast 625741 2
## 47 Virginia VA South 8001024 250
## 48 Washington WA West 6724540 93
## 49 West Virginia WV South 1852994 27
## 50 Wisconsin WI North Central 5686986 97
## 51 Wyoming WY West 563626 5
summary(murder)
## state abb region population
## Length:51 Length:51 Northeast : 9 Min. : 563626
## Class :character Class :character South :17 1st Qu.: 1696962
## Mode :character Mode :character North Central:12 Median : 4339367
## West :13 Mean : 6075769
## 3rd Qu.: 6636084
## Max. :37253956
## total
## Min. : 2.0
## 1st Qu.: 24.5
## Median : 97.0
## Mean : 184.4
## 3rd Qu.: 268.0
## Max. :1257.0
class(murder)
## [1] "data.frame"
3
5. dplyr functions
Sorting data
Simple R
rway <- murder[order(murder$total),]
head(rway)
## state abb region population total
## 46 Vermont VT Northeast 625741 2
## 35 North Dakota ND North Central 672591 4
## 30 New Hampshire NH Northeast 1316470 5
## 51 Wyoming WY West 563626 5
## 12 Hawaii HI West 1360301 7
## 42 South Dakota SD North Central 814180 8
dplyr
dpway <- arrange(murder, total)
head(dpway)
## state abb region population total
## 1 Vermont VT Northeast 625741 2
## 2 North Dakota ND North Central 672591 4
## 3 New Hampshire NH Northeast 1316470 5
## 4 Wyoming WY West 563626 5
## 5 Hawaii HI West 1360301 7
## 6 South Dakota SD North Central 814180 8
Selecting a column
Simple R
rway <- murder[,"state"]
head(rway)
## [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
## [6] "Colorado"
class(rway)
## [1] "character"
5
6. rway <- murder[,c("state","total")]
head(rway)
## state total
## 1 Alabama 135
## 2 Alaska 19
## 3 Arizona 232
## 4 Arkansas 93
## 5 California 1257
## 6 Colorado 65
dplyr
dpway <- select(murder,state)
head(dpway)
## state
## 1 Alabama
## 2 Alaska
## 3 Arizona
## 4 Arkansas
## 5 California
## 6 Colorado
class(dpway)
## [1] "data.frame"
dpway <- select(murder,state,total)
head(dpway)
## state total
## 1 Alabama 135
## 2 Alaska 19
## 3 Arizona 232
## 4 Arkansas 93
## 5 California 1257
## 6 Colorado 65
Filtering rows
Simple R
rway <- murder[murder$state=='California',]
head(rway)
## state abb region population total
## 5 California CA West 37253956 1257
6
7. dplyr
dpway <- filter(murder,state=='California')
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
dpway <- filter(murder,state=='California' & abb=='CA')
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
dpway <- filter(murder,state=='California', abb=='CA')
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
dpway <- filter(murder,state=='California' | abb=='WI')
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
## 2 Wisconsin WI North Central 5686986 97
dpway <- filter(murder,abb %in% c('CA','WI','NY'))
head(dpway)
## state abb region population total
## 1 California CA West 37253956 1257
## 2 New York NY Northeast 19378102 517
## 3 Wisconsin WI North Central 5686986 97
Creating a new variable
Simple R
murder$newpop <- murder$population / 1000
head(murder)
## state abb region population total newpop
## 1 Alabama AL South 4779736 135 4779.736
## 2 Alaska AK West 710231 19 710.231
## 3 Arizona AZ West 6392017 232 6392.017
## 4 Arkansas AR South 2915918 93 2915.918
## 5 California CA West 37253956 1257 37253.956
## 6 Colorado CO West 5029196 65 5029.196
7
8. dplyr
dpway <- mutate(murder,newpop=population/1000)
head(dpway)
## state abb region population total newpop
## 1 Alabama AL South 4779736 135 4779.736
## 2 Alaska AK West 710231 19 710.231
## 3 Arizona AZ West 6392017 232 6392.017
## 4 Arkansas AR South 2915918 93 2915.918
## 5 California CA West 37253956 1257 37253.956
## 6 Colorado CO West 5029196 65 5029.196
summarise: Reduce variables to values
• Primarily useful with data that has been grouped by one or more variables
• group_by creates the groups that will be operated on
• summarise uses the provided aggregation function to summarise each group
dplyr way - summarize
summarise(murder,summurder=sum(total,na.rm=TRUE))
## summurder
## 1 9403
summarise(murder,avgmurder=mean(total,na.rm=TRUE))
## avgmurder
## 1 184.3725
summarise(murder,countrows=n())
## countrows
## 1 51
summarise(murder,summurder=sum(total,na.rm=TRUE),
avgmurder=mean(total,na.rm=TRUE),countrows=n())
## summurder avgmurder countrows
## 1 9403 184.3725 51
dplyr way - group by
8
9. m1 <- group_by(murder,region)
ab <- summarise(m1,md=sum(total, na.rm=TRUE),
pop = mean(population, na.rm=TRUE),
cn = n())
ab <- data.frame(ab)
ab
## region md pop cn
## 1 Northeast 1469 6146360 9
## 2 South 4195 6804378 17
## 3 North Central 1828 5577250 12
## 4 West 1911 5534273 13
Chaining Method
ab <- murder %>%
group_by(region) %>%
summarise(md = sum(total, na.rm=TRUE),
pop = sum(population, na.rm=TRUE),
cn=n())
ab <- data.frame(ab)
ab
## region md pop cn
## 1 Northeast 1469 55317240 9
## 2 South 4195 115674434 17
## 3 North Central 1828 66927001 12
## 4 West 1911 71945553 13
Exercises
Exercise 1
Do the following for Murder dataset
i. Get the murder dataset (as was done in the class)
ii. do basic exploration of the data (like number of rows, number of columns, structure and names of the
data)
iii. which three states have highest population?
iv. How many states have more than average population?
v. what is the total population of US (actual number and in millions)
vi. what is the total number of murders across US?
vii. what is the average number of murders
viii. what is the total murders in the South region
9
10. ix. How many states are there in each region
x. what is the murder rate across each region?
xi. Which is the most dangerous state?
Exercise 2
Do the following for mtcars dataset
i. Get the mtcars dataset
ii. do basic exploration of the data (like number of rows, number of columns, structure and names of the
data)
iii. How many different types of gears are there?
iv. which type of transmission is more? automatic or manual
v. what is the average hp by number of cylinders
vi. what is the avg hp by gears
vii. does mpg depend on number of gears?
viii. Does weight of car depends on number of cylinders?
10