12. 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
1
2
3
4
5
6
7
scatter plot
Sepal Length
Petal
Length
pch Values
• pch = 0,square
• pch = 1,circle
• pch = 2,triangle point up
• pch = 3,plus
• pch = 4,cross
• pch = 5,diamond
• pch = 6,triangle point down
• pch = 7,square cross
• pch = 8,star
• pch = 9,diamond plus
• pch = 10,circle plus
• pch = 11,triangles up and down
• pch = 12,square plus
• pch = 13,circle cross
• pch = 14,square and triangle down
• pch = 15, filled square
• pch = 16, filled circle
• pch = 17, filled triangle point-up
• pch = 18, filled diamond
• pch = 19, solid circle
• pch = 20,bullet (smaller circle)
• pch = 21, filled circle blue
• pch = 22, filled square blue
12
13. • pch = 23, filled diamond blue
• pch = 24, filled triangle point-up blue
• pch = 25, filled triangle point down blue
Lines on the scatter Plot
plot(iris$Sepal.Length,iris$Petal.Length,xlab="Sepal Length",ylab="Petal Length",main="scatter plot", co
abline(v=6, col="purple") # verical line
abline(h=6, col="red") # Horizontal line
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
1
2
3
4
5
6
7
scatter plot
Sepal Length
Petal
Length
Mean Lines on Scatter Plot
plot(iris$Sepal.Length,iris$Petal.Length,xlab="Sepal Length",ylab="Petal Length",main="scatter plot", co
abline(v=mean(iris$Sepal.Length),col="blue") # line with mean sepal length
abline(h=mean(iris$Petal.Length),col="pink") # line with mean Petal length
fit <- lm(Petal.Length~Sepal.Length, data=iris) # fitting regression
abline(fit, col="yellow") #linear line
13
14. 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
1
2
3
4
5
6
7
scatter plot
Sepal Length
Petal
Length
R-Square between two variables
summary(lm(Petal.Length~Sepal.Length, data=iris))$r.squared
## [1] 0.7599546
R-square between Petal Length and Sepal Length is 76 %
Cross Tab of all numeric attributes
pairs(iris[,1:4])
14
32. 0
3
6
9
10 20 30
mpg
count
All about ggplot2
Libraries for ggplot2 and Manipulation
library(dplyr)
##
## Attaching package: ’dplyr’
## The following objects are masked from ’package:stats’:
##
## filter, lag
## The following objects are masked from ’package:base’:
##
## intersect, setdiff, setequal, union
library(ggplot2)
house <- read.csv("C:UserskamalDropbox (Erasmus Universiteit Rotterdam)Kamal GuptaAMSOM-Teachi
32
33. head(house)
## X price lot_size waterfront age land_value construction air_cond fuel
## 1 1 132500 0.09 No 42 50000 No No Electric
## 2 2 181115 0.92 No 0 22300 No No Gas
## 3 3 109000 0.19 No 133 7300 No No Gas
## 4 4 155000 0.41 No 13 18700 No No Gas
## 5 5 86060 0.11 No 0 15000 Yes Yes Gas
## 6 6 120000 0.68 No 31 14000 No No Gas
## heat sewer living_area fireplaces bathrooms rooms
## 1 Electric Private 906 1 1.0 5
## 2 Hot Water Private 1953 0 2.5 6
## 3 Hot Water Public 1944 1 1.0 8
## 4 Hot Air Private 1944 1 1.5 5
## 5 Hot Air Public 840 0 1.0 3
## 6 Hot Air Private 1152 1 1.0 8
str(house)
## ’data.frame’: 1728 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ price : int 132500 181115 109000 155000 86060 120000 153000 170000 90000 122900 ...
## $ lot_size : num 0.09 0.92 0.19 0.41 0.11 0.68 0.4 1.21 0.83 1.94 ...
## $ waterfront : chr "No" "No" "No" "No" ...
## $ age : int 42 0 133 13 0 31 33 23 36 4 ...
## $ land_value : int 50000 22300 7300 18700 15000 14000 23300 14600 22200 21200 ...
## $ construction: chr "No" "No" "No" "No" ...
## $ air_cond : chr "No" "No" "No" "No" ...
## $ fuel : chr "Electric" "Gas" "Gas" "Gas" ...
## $ heat : chr "Electric" "Hot Water" "Hot Water" "Hot Air" ...
## $ sewer : chr "Private" "Private" "Public" "Private" ...
## $ living_area : int 906 1953 1944 1944 840 1152 2752 1662 1632 1416 ...
## $ fireplaces : int 1 0 1 1 0 1 1 1 0 0 ...
## $ bathrooms : num 1 2.5 1 1.5 1 1 1.5 1.5 1.5 1.5 ...
## $ rooms : int 5 6 8 5 3 8 8 9 8 6 ...
summary(house)
## X price lot_size waterfront
## Min. : 1.0 Min. : 5000 Min. : 0.0000 Length:1728
## 1st Qu.: 432.8 1st Qu.:145000 1st Qu.: 0.1700 Class :character
## Median : 864.5 Median :189900 Median : 0.3700 Mode :character
## Mean : 864.5 Mean :211967 Mean : 0.5002
## 3rd Qu.:1296.2 3rd Qu.:259000 3rd Qu.: 0.5400
## Max. :1728.0 Max. :775000 Max. :12.2000
## age land_value construction air_cond
## Min. : 0.00 Min. : 200 Length:1728 Length:1728
## 1st Qu.: 13.00 1st Qu.: 15100 Class :character Class :character
## Median : 19.00 Median : 25000 Mode :character Mode :character
## Mean : 27.92 Mean : 34557
## 3rd Qu.: 34.00 3rd Qu.: 40200
## Max. :225.00 Max. :412600
33
34. ## fuel heat sewer living_area
## Length:1728 Length:1728 Length:1728 Min. : 616
## Class :character Class :character Class :character 1st Qu.:1300
## Mode :character Mode :character Mode :character Median :1634
## Mean :1755
## 3rd Qu.:2138
## Max. :5228
## fireplaces bathrooms rooms
## Min. :0.0000 Min. :0.0 Min. : 2.000
## 1st Qu.:0.0000 1st Qu.:1.5 1st Qu.: 5.000
## Median :1.0000 Median :2.0 Median : 7.000
## Mean :0.6019 Mean :1.9 Mean : 7.042
## 3rd Qu.:1.0000 3rd Qu.:2.5 3rd Qu.: 8.250
## Max. :4.0000 Max. :4.5 Max. :12.000
Histogram
ggplot(data=house, aes(x=price/100000)) + geom_histogram()
## ‘stat_bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.
0
100
200
0 2 4 6 8
price/1e+05
count
34
58. 1e+05
2e+05
3e+05
4e+05
5e+05
6e+05
1000 2000 3000 4000 5000
living_area
price
air_cond
No
Yes
ggplot(data=house, aes(y=price, x=living_area, col=heat)) + geom_smooth(se=F)
## ‘geom_smooth()‘ using method = ’gam’ and formula ’y ~ s(x, bs = "cs")’
58
59. 1e+05
2e+05
3e+05
4e+05
5e+05
6e+05
1000 2000 3000 4000 5000
living_area
price
heat
Electric
Hot Air
Hot Water
Scatter Plot with smooth Lines
ggplot(data=house, aes(y=price, x=living_area)) + geom_point() + geom_smooth(method=lm)
## ‘geom_smooth()‘ using formula ’y ~ x’
59