SlideShare a Scribd company logo
1 of 35
Download to read offline
Text Mining with R { an Analysis of Twitter Data 
Yanchang Zhao 
http://www.RDataMining.com 
30 September 2014 
1 / 35
Outline 
Introduction 
Extracting Tweets 
Text Cleaning 
Frequent Words and Associations 
Word Cloud 
Clustering 
Topic Modelling 
Online Resources 
2 / 35
Text Mining 
I unstructured text data 
I text categorization 
I text clustering 
I entity extraction 
I sentiment analysis 
I document summarization 
I . . . 
3 / 35
Text mining of Twitter data with R 1 
1. extract data from Twitter 
2. clean extracted data and build a document-term matrix 
3.
nd frequent words and associations 
4. create a word cloud to visualize important words 
5. text clustering 
6. topic modelling 
1Chapter 10: Text Mining, R and Data Mining: Examples and Case Studies. 
http://www.rdatamining.com/docs/RDataMining.pdf 
4 / 35
Outline 
Introduction 
Extracting Tweets 
Text Cleaning 
Frequent Words and Associations 
Word Cloud 
Clustering 
Topic Modelling 
Online Resources 
5 / 35
Retrieve Tweets 
Retrieve recent tweets by @RDataMining 
## Option 1: retrieve tweets from Twitter 
library(twitteR) 
tweets <- userTimeline("RDataMining", n = 3200) 
## Option 2: download @RDataMining tweets from RDataMining.com 
url <- "http://www.rdatamining.com/data/rdmTweets.RData" 
download.file(url, destfile = "./data/rdmTweets.RData") 
## load tweets into R 
load(file = "./data/rdmTweets-201306.RData") 
6 / 35
(n.tweet <- length(tweets)) 
## [1] 320 
tweets[1:5] 
## [[1]] 
## [1] "RDataMining: Examples on calling Java code from R nht... 
## 
## [[2]] 
## [1] "RDataMining: Simulating Map-Reduce in R for Big Data A... 
## 
## [[3]] 
## [1] "RDataMining: Job opportunity: Senior Analyst - Big Dat... 
## 
## [[4]] 
## [1] "RDataMining: CLAVIN: an open source software package f... 
## 
## [[5]] 
## [1] "RDataMining: An online book on Natural Language Proces... 
7 / 35
Outline 
Introduction 
Extracting Tweets 
Text Cleaning 
Frequent Words and Associations 
Word Cloud 
Clustering 
Topic Modelling 
Online Resources 
8 / 35
# convert tweets to a data frame 
# tweets.df <- do.call("rbind", lapply(tweets, as.data.frame)) 
tweets.df <- twListToDF(tweets) 
dim(tweets.df) 
## [1] 320 14 
library(tm) 
# build a corpus, and specify the source to be character vectors 
myCorpus <- Corpus(VectorSource(tweets.df$text)) 
# convert to lower case 
myCorpus <- tm_map(myCorpus, tolower) 
Package tm v0.5-10 was used in this example. With tm v0.6, 
content transformer" needs to be used to wrap around normal 
functions. 
# tm v0.6 
myCorpus <- tm_map(myCorpus, content_transformer(tolower)) 
9 / 35
# remove punctuation 
myCorpus <- tm_map(myCorpus, removePunctuation) 
# remove numbers 
myCorpus <- tm_map(myCorpus, removeNumbers) 
# remove URLs 
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) 
myCorpus <- tm_map(myCorpus, removeURL) 
# add two extra stop words: 
available
 and 
via
 
myStopwords <- c(stopwords("english"), "available", "via") 
# remove 
r
 and 
big
 from stopwords 
myStopwords <- setdiff(myStopwords, c("r", "big")) 
# remove stopwords from corpus 
myCorpus <- tm_map(myCorpus, removeWords, myStopwords) 
10 / 35
# keep a copy of corpus to use later as a dictionary for stem completion 
myCorpusCopy <- myCorpus 
# stem words 
myCorpus <- tm_map(myCorpus, stemDocument) 
# inspect the first 5 documents (tweets) inspect(myCorpus[1:5]) 
# The code below is used for to make text fit for paper width 
for (i in 1:5) f 
cat(paste("[[", i, "]] ", sep = "")) 
writeLines(myCorpus[[i]]) 
g 
## [[1]] exampl call java code r 
## 
## [[2]] simul mapreduc r big data analysi use flight data ... 
## [[3]] job opportun senior analyst big data wesfarm indust... 
## [[4]] clavin open sourc softwar packag document geotag g... 
## [[5]] onlin book natur languag process python 
11 / 35
# stem completion 
myCorpus <- tm_map(myCorpus, stemCompletion, 
dictionary = myCorpusCopy) 
## [[1]] examples call java code r 
## [[2]] simulating mapreduce r big data analysis used flights... 
## [[3]] job opportunity senior analyst big data wesfarmers in... 
## [[4]] clavin open source software package document geotaggi... 
## [[5]] online book natural language processing python 
12 / 35
# count frequency of "mining" 
miningCases <- tm_map(myCorpusCopy, grep, pattern = "nn<mining") 
sum(unlist(miningCases)) 
## [1] 82 
# count frequency of "miners" 
minerCases <- tm_map(myCorpusCopy, grep, pattern = "nn<miners") 
sum(unlist(minerCases)) 
## [1] 4 
# replace "miners" with "mining" 
myCorpus <- tm_map(myCorpus, gsub, pattern = "miners", 
replacement = "mining") 
13 / 35
tdm <- TermDocumentMatrix(myCorpus, 
control = list(wordLengths = c(1, Inf))) 
tdm 
## A term-document matrix (790 terms, 320 documents) 
## 
## Non-/sparse entries: 2449/250351 
## Sparsity : 99% 
## Maximal term length: 27 
## Weighting : term frequency (tf) 
14 / 35
Outline 
Introduction 
Extracting Tweets 
Text Cleaning 
Frequent Words and Associations 
Word Cloud 
Clustering 
Topic Modelling 
Online Resources 
15 / 35
idx <- which(dimnames(tdm)$Terms == "r") 
inspect(tdm[idx + (0:5), 101:110]) 
## A term-document matrix (6 terms, 10 documents) 
## 
## Non-/sparse entries: 4/56 
## Sparsity : 93% 
## Maximal term length: 12 
## Weighting : term frequency (tf) 
## 
## Docs 
## Terms 101 102 103 104 105 106 107 108 109 110 
## r 0 1 1 0 0 0 0 0 1 1 
## ramachandran 0 0 0 0 0 0 0 0 0 0 
## random 0 0 0 0 0 0 0 0 0 0 
## ranked 0 0 0 0 0 0 0 0 0 0 
## rann 0 0 0 0 0 0 0 0 0 0 
## rapidminer 0 0 0 0 0 0 0 0 0 0 
16 / 35
# inspect frequent words 
(freq.terms <- findFreqTerms(tdm, lowfreq = 15)) 
## [1] "analysis" "applications" "big" "book" 
## [5] "code" "computing" "data" "examples" 
## [9] "group" "introduction" "mining" "network" 
## [13] "package" "position" "postdoctoral" "r" 
## [17] "research" "see" "slides" "social" 
## [21] "tutorial" "university" "used" 
term.freq <- rowSums(as.matrix(tdm)) 
term.freq <- subset(term.freq, term.freq >= 15) 
df <- data.frame(term = names(term.freq), freq = term.freq) 
17 / 35
library(ggplot2) 
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity") + 
xlab("Terms") + ylab("Count") + coord_flip() 
used 
university 
tutorial 
social 
slides 
see 
research 
r 
postdoctoral 
position 
package 
network 
mining 
introduction 
group 
examples 
data 
computing 
code 
book 
big 
applications 
analysis 
0 50 100 150 
Count 
Terms 
18 / 35
# which words are associated with 
r
? 
findAssocs(tdm, "r", 0.2) 
## r 
## examples 0.32 
## code 0.29 
## package 0.20 
# which words are associated with 
mining
? 
findAssocs(tdm, "mining", 0.25) 
## mining 
## data 0.47 
## mahout 0.30 
## recommendation 0.30 
## sets 0.30 
## supports 0.30 
## frequent 0.26 
## itemset 0.26 
19 / 35
library(graph) 
library(Rgraphviz) 
plot(tdm, term = freq.terms, corThreshold = 0.12, weighting = T) 
analysis 
research r 
applications big 
book code computing 
data 
examples 
group 
introduction 
mining 
network postdoctoral 
package 
position 
see 
slides 
tutorial social 
university used 
20 / 35
Outline 
Introduction 
Extracting Tweets 
Text Cleaning 
Frequent Words and Associations 
Word Cloud 
Clustering 
Topic Modelling 
Online Resources 
21 / 35
library(wordcloud) 
m <- as.matrix(tdm) 
# calculate the frequency of words and sort it by frequency 
word.freq <- sort(rowSums(m), decreasing = T) 
wordcloud(words = names(word.freq), freq = word.freq, min.freq = 3, 
random.order = F) 
language 
postdoctoral 
university 
social 
book 
data 
r 
project 
forecasting 
frequent 
applications 
big 
code 
tutorial 
job 
tried 
edited 
provided 
information 
research examples 
package 
mining 
analysis 
used 
cfp 
network 
positionslides 
computing 
see 
lecture 
introduction 
group 
analytics 
australia 
modelling 
scientist 
time 
association 
free 
online 
parallel 
text 
clustering 
course 
learn 
pdf 
rules 
series 
talk 
document 
now 
programming 
fellow 
statistics 
ausdm 
detection 
outlier 
google 
large 
open 
rdatamining 
techniques 
thanks 
tools 
users 
vacancy 
amp 
chapter 
due 
graphics 
processing 
notes 
map 
presentation 
science 
software 
starting 
studies 
visualizing 
business 
call 
can 
case center 
database 
details 
follower 
functions 
join 
kdnuggets 
page 
poll 
top 
submission spatial 
technology 
twitter 
analyst 
california 
card 
fast 
classification 
find 
get 
graph 
handling 
interactive 
linkedin 
list 
machine 
melbourne 
recent 
reference 
using 
videos 
web 
elsevier 
workshop 
wwwrdataminingcom 
access added 
canada 
canberra 
china 
cloud 
conference 
datasets 
distributed 
dmapps 
events draft 
experience 
high 
ibm 
industrial 
knowledge 
management 
nd 
performance 
predictive 
published 
search 
sentiment 
short 
snowfall 
southern 
sydney 
tenuretrack 
topic 
week 
youtube 
advanced 
also 
answers 
april 
area 
august 
building 
charts 
check 
comments 
containing 
cran 
create 
csiro 
dec 
dynamic 
easier 
engineer 
format 
guide 
hadoop 
igraph 
incl 
itemset 
jan 
labs 
may 
media 
microsoft 
mid 
new 
opportunity 
plot 
postdocresearch 
quick 
realworld 
san 
senior 
simple 
singapore 
state 
titled 
tj 
track 
tweets 
usamaf 
views 
visits 
watson 
website 22 / 35
Outline 
Introduction 
Extracting Tweets 
Text Cleaning 
Frequent Words and Associations 
Word Cloud 
Clustering 
Topic Modelling 
Online Resources 
23 / 35
# remove sparse terms 
tdm2 <- removeSparseTerms(tdm, sparse = 0.95) 
m2 <- as.matrix(tdm2) 
# cluster terms 
distMatrix <- dist(scale(m2)) 
fit <- hclust(distMatrix, method = "ward") 
24 / 35
plot(fit) 
rect.hclust(fit, k = 6) # cut tree into 6 clusters 
10 20 30 40 50 60 
research 
position 
postdoctoral 
university 
analysis 
network 
social 
examples 
package 
slides 
used 
computing 
tutorial 
big 
applications 
book 
r 
data 
mining 
Cluster Dendrogram 
Height 
25 / 35
m3 <- t(m2) # transpose the matrix to cluster documents (tweets) 
set.seed(122) # set a fixed random seed 
k <- 6 # number of clusters 
kmeansResult <- kmeans(m3, k) 
round(kmeansResult$centers, digits = 3) # cluster centers 
## analysis applications big book computing data examples 
## 1 0.147 0.088 0.147 0.015 0.059 1.015 0.088 
## 2 0.028 0.167 0.167 0.250 0.028 1.556 0.194 
## 3 0.810 0.000 0.000 0.000 0.000 0.048 0.095 
## 4 0.080 0.036 0.007 0.058 0.087 0.000 0.181 
## 5 0.000 0.000 0.000 0.067 0.067 0.333 0.067 
## 6 0.119 0.048 0.071 0.000 0.048 0.357 0.000 
## mining network package position postdoctoral r research 
## 1 0.338 0.015 0.015 0.059 0.074 0.235 0.074 
## 2 1.056 0.000 0.222 0.000 0.000 1.000 0.028 
## 3 0.048 1.000 0.095 0.143 0.095 0.286 0.048 
## 4 0.065 0.022 0.174 0.000 0.007 0.703 0.000 
## 5 1.200 0.000 0.000 0.000 0.067 0.067 0.000 
## 6 0.119 0.000 0.024 0.643 0.310 0.000 0.714 
## slides social tutorial university used 
## 1 0.074 0.000 0.015 0.015 0.029 
## 2 0.056 0.000 0.000 0.000 0.250 
## 3 0.095 0.762 0.190 0.000 0.095 
26 / 35
for (i in 1:k) f 
cat(paste("cluster ", i, ": ", sep = "")) 
s <- sort(kmeansResult$centers[i, ], decreasing = T) 
cat(names(s)[1:5], "nn") 
# print the tweets of every cluster 
# print(tweets[which(kmeansResult$cluster==i)]) 
g 
## cluster 1: data mining r analysis big 
## cluster 2: data mining r book used 
## cluster 3: network analysis social r tutorial 
## cluster 4: r examples package slides used 
## cluster 5: mining tutorial slides data book 
## cluster 6: research position university data postdoctoral 
27 / 35
library(fpc) 
# partitioning around medoids with estimation of number of clusters 
pamResult <- pamk(m3, metric="manhattan") 
k <- pamResult$nc # number of clusters identified 
pamResult <- pamResult$pamobject 
# print cluster medoids 
for (i in 1:k) f 
cat("cluster", i, ": ", 
colnames(pamResult$medoids)[which(pamResult$medoids[i,]==1)], "nn") 
g 
## cluster 1 : examples r 
## cluster 2 : analysis data r 
## cluster 3 : data 
## cluster 4 : 
## cluster 5 : r 
## cluster 6 : data mining r 
## cluster 7 : data mining 
## cluster 8 : analysis network social 
## cluster 9 : data position research 
## cluster 10 : position postdoctoral university 
28 / 35
# plot clustering result 
layout(matrix(c(1, 2), 1, 2)) # set to two graphs per page 
plot(pamResult, col.p = pamResult$clustering) 
−4 −2 0 2 4 6 
−4 −2 0 2 4 
clusplot(pam(x = sdata, k = k, diss = diss, metric = "manhattan")) 
Component 1 
Component 2 
These two components explain 25.23 % of the point variability. 
Silhouette plot of pam(x = sdata, k = k, diss = diss, metric = "manhattan") 
n = 320 10  clusters  Cj 
−0.2 0.0 0.2 0.4 0.6 0.8 1.0 
Silhouette width si 
Average silhouette width : 0.27 
j : nj | aveiÎCj  si 
1 : 28 | 0.13 
2 : 12 | 0.25 
3 : 31 | 0.23 
4 : 61 | 0.54 
5 : 58 | 0.29 
6 : 28 | 0.21 
7 : 43 | 0.15 
8 : 19 | 0.29 
9 : 21 | 0.15 
10 : 19 | 0.14 
layout(matrix(1)) # change back to one graph per page 
29 / 35
Outline 
Introduction 
Extracting Tweets 
Text Cleaning 
Frequent Words and Associations 
Word Cloud 
Clustering 
Topic Modelling 
Online Resources 
30 / 35
Topic Modelling 
dtm <- as.DocumentTermMatrix(tdm) 
library(topicmodels) 
lda <- LDA(dtm, k = 8) # find 8 topics 
term <- terms(lda, 4) # first 4 terms of every topic 
term 
## Topic 1 Topic 2 Topic 3 Topic 4 
## [1,] "data" "data" "mining" "r" 
## [2,] "mining" "r" "r" "lecture" 
## [3,] "scientist" "mining" "tutorial" "time" 
## [4,] "research" "book" "introduction" "series" 
## Topic 5 Topic 6 Topic 7 Topic 8 
## [1,] "position" "package" "r" "analysis" 
## [2,] "research" "r" "data" "network" 
## [3,] "postdoctoral" "clustering" "computing" "r" 
## [4,] "university" "detection" "slides" "code" 
term <- apply(term, MARGIN = 2, paste, collapse = ", ") 
31 / 35
Topic Modelling 
# first topic identified for every document (tweet) 
topic <- topics(lda, 1) 
topics <- data.frame(date=as.IDate(tweets.df$created), topic) 
qplot(date, ..count.., data=topics, geom="density", 
fill=term[topic], position="stack") 
0.4 
0.2 
0.0 
2011−07 2012−01 2012−07 2013−01 2013−07 
date 
count 
term[topic] 
analysis, network, r, code 
data, mining, scientist, research 
data, r, mining, book 
mining, r, tutorial, introduction 
package, r, clustering, detection 
position, research, postdoctoral, university 
r, data, computing, slides 
r, lecture, time, series 
32 / 35
Outline 
Introduction 
Extracting Tweets 
Text Cleaning 
Frequent Words and Associations 
Word Cloud 
Clustering 
Topic Modelling 
Online Resources 
33 / 35
Online Resources 
I Chapter 10: Text Mining, in book 
R and Data Mining: Examples and Case Studies 
http://www.rdatamining.com/docs/RDataMining.pdf 
I R Reference Card for Data Mining 
http://www.rdatamining.com/docs/R-refcard-data-mining.pdf 
I Free online courses and documents 
http://www.rdatamining.com/resources/ 
I RDataMining Group on LinkedIn (7,000+ members) 
http://group.rdatamining.com 
I RDataMining on Twitter (1,700+ followers) 
@RDataMining 
34 / 35

More Related Content

What's hot

Request to Fulfill Presentation (IT4IT)
Request to Fulfill Presentation (IT4IT)Request to Fulfill Presentation (IT4IT)
Request to Fulfill Presentation (IT4IT)Rob Akershoek
 
Huawei Digital Transformation
Huawei Digital TransformationHuawei Digital Transformation
Huawei Digital TransformationMyles Freedman
 
Database/ Bases de données
Database/ Bases de donnéesDatabase/ Bases de données
Database/ Bases de donnéeszied kallel
 
ITIL Service Strategy
ITIL Service StrategyITIL Service Strategy
ITIL Service StrategyMarvin Sirait
 
Business Value Measurements and the Solution Design Framework
Business Value Measurements and the Solution Design FrameworkBusiness Value Measurements and the Solution Design Framework
Business Value Measurements and the Solution Design FrameworkLeo Barella
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
5.1 K plus proches voisins
5.1 K plus proches voisins5.1 K plus proches voisins
5.1 K plus proches voisinsBoris Guarisma
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Reinventing Enterprise Operations
Reinventing Enterprise OperationsReinventing Enterprise Operations
Reinventing Enterprise Operationsaccenture
 
ITIL v3 vs v4
ITIL v3 vs v4ITIL v3 vs v4
ITIL v3 vs v4BITIL.COM
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop AdministrationEdureka!
 
Boost your ITSM maturity with a service catalog
Boost your ITSM maturity with a service catalogBoost your ITSM maturity with a service catalog
Boost your ITSM maturity with a service catalogAxios Systems
 
end-to-end service management with ServiceNow (English)
end-to-end service management with ServiceNow (English)end-to-end service management with ServiceNow (English)
end-to-end service management with ServiceNow (English)Orange Business Services
 
Being digital: Embracing the future of work
Being digital: Embracing the future of workBeing digital: Embracing the future of work
Being digital: Embracing the future of workaccenture
 
La gestion de la connaissance et le catalogue de service au service de l’opti...
La gestion de la connaissance et le catalogue de service au service de l’opti...La gestion de la connaissance et le catalogue de service au service de l’opti...
La gestion de la connaissance et le catalogue de service au service de l’opti...itSMF France
 
Why New-age IT Operating Models are Necessary for Enhanced Operational Agility
Why New-age IT Operating Models are Necessary for Enhanced Operational AgilityWhy New-age IT Operating Models are Necessary for Enhanced Operational Agility
Why New-age IT Operating Models are Necessary for Enhanced Operational AgilityCognizant
 

What's hot (20)

Request to Fulfill Presentation (IT4IT)
Request to Fulfill Presentation (IT4IT)Request to Fulfill Presentation (IT4IT)
Request to Fulfill Presentation (IT4IT)
 
ITIL v4 Foundation course
 ITIL v4 Foundation course  ITIL v4 Foundation course
ITIL v4 Foundation course
 
Huawei Digital Transformation
Huawei Digital TransformationHuawei Digital Transformation
Huawei Digital Transformation
 
Database/ Bases de données
Database/ Bases de donnéesDatabase/ Bases de données
Database/ Bases de données
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
ITIL Service Strategy
ITIL Service StrategyITIL Service Strategy
ITIL Service Strategy
 
Business Value Measurements and the Solution Design Framework
Business Value Measurements and the Solution Design FrameworkBusiness Value Measurements and the Solution Design Framework
Business Value Measurements and the Solution Design Framework
 
Introduction à ITIL
Introduction à ITILIntroduction à ITIL
Introduction à ITIL
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
5.1 K plus proches voisins
5.1 K plus proches voisins5.1 K plus proches voisins
5.1 K plus proches voisins
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Reinventing Enterprise Operations
Reinventing Enterprise OperationsReinventing Enterprise Operations
Reinventing Enterprise Operations
 
ITIL v3 vs v4
ITIL v3 vs v4ITIL v3 vs v4
ITIL v3 vs v4
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
 
Boost your ITSM maturity with a service catalog
Boost your ITSM maturity with a service catalogBoost your ITSM maturity with a service catalog
Boost your ITSM maturity with a service catalog
 
end-to-end service management with ServiceNow (English)
end-to-end service management with ServiceNow (English)end-to-end service management with ServiceNow (English)
end-to-end service management with ServiceNow (English)
 
Being digital: Embracing the future of work
Being digital: Embracing the future of workBeing digital: Embracing the future of work
Being digital: Embracing the future of work
 
La gestion de la connaissance et le catalogue de service au service de l’opti...
La gestion de la connaissance et le catalogue de service au service de l’opti...La gestion de la connaissance et le catalogue de service au service de l’opti...
La gestion de la connaissance et le catalogue de service au service de l’opti...
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Why New-age IT Operating Models are Necessary for Enhanced Operational Agility
Why New-age IT Operating Models are Necessary for Enhanced Operational AgilityWhy New-age IT Operating Models are Necessary for Enhanced Operational Agility
Why New-age IT Operating Models are Necessary for Enhanced Operational Agility
 

Similar to Text Mining with R -- an Analysis of Twitter Data

Text Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningText Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningMeghaj Mallick
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Johan Blomme
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rYanchang Zhao
 
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Goran S. Milovanovic
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In RRsquared Academy
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkZalando Technology
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft AzureDmitry Petukhov
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Dynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web siteDynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web siteSriram Natarajan
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints
 
Reproducibility with R
Reproducibility with RReproducibility with R
Reproducibility with RMartin Jung
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
Learning deep structured semantic models for web search
Learning deep structured semantic models for web searchLearning deep structured semantic models for web search
Learning deep structured semantic models for web searchhyunsung lee
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0 Databricks
 
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentationCary Millsap
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa
 
200612_BioPackathon_ss
200612_BioPackathon_ss200612_BioPackathon_ss
200612_BioPackathon_ssSatoshi Kume
 

Similar to Text Mining with R -- an Analysis of Twitter Data (20)

Text Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningText Mining of Twitter in Data Mining
Text Mining of Twitter in Data Mining
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft Azure
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Dynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web siteDynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web site
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
Reproducibility with R
Reproducibility with RReproducibility with R
Reproducibility with R
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Learning deep structured semantic models for web search
Learning deep structured semantic models for web searchLearning deep structured semantic models for web search
Learning deep structured semantic models for web search
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0
 
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
200612_BioPackathon_ss
200612_BioPackathon_ss200612_BioPackathon_ss
200612_BioPackathon_ss
 

More from Yanchang Zhao

RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisYanchang Zhao
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classificationYanchang Zhao
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programmingYanchang Zhao
 
RDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rRDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rYanchang Zhao
 
RDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisationRDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisationYanchang Zhao
 
RDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rRDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rYanchang Zhao
 
RDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rRDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rYanchang Zhao
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-cardYanchang Zhao
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with RYanchang Zhao
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with RYanchang Zhao
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with RYanchang Zhao
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with RYanchang Zhao
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with RYanchang Zhao
 
Time series-mining-slides
Time series-mining-slidesTime series-mining-slides
Time series-mining-slidesYanchang Zhao
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data MiningYanchang Zhao
 

More from Yanchang Zhao (17)

RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysis
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
 
RDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rRDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-r
 
RDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisationRDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisation
 
RDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rRDataMining slides-clustering-with-r
RDataMining slides-clustering-with-r
 
RDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rRDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-r
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-card
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with R
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with R
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
Time series-mining-slides
Time series-mining-slidesTime series-mining-slides
Time series-mining-slides
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data Mining
 

Recently uploaded

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Text Mining with R -- an Analysis of Twitter Data

  • 1. Text Mining with R { an Analysis of Twitter Data Yanchang Zhao http://www.RDataMining.com 30 September 2014 1 / 35
  • 2. Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 2 / 35
  • 3. Text Mining I unstructured text data I text categorization I text clustering I entity extraction I sentiment analysis I document summarization I . . . 3 / 35
  • 4. Text mining of Twitter data with R 1 1. extract data from Twitter 2. clean extracted data and build a document-term matrix 3.
  • 5. nd frequent words and associations 4. create a word cloud to visualize important words 5. text clustering 6. topic modelling 1Chapter 10: Text Mining, R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/RDataMining.pdf 4 / 35
  • 6. Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 5 / 35
  • 7. Retrieve Tweets Retrieve recent tweets by @RDataMining ## Option 1: retrieve tweets from Twitter library(twitteR) tweets <- userTimeline("RDataMining", n = 3200) ## Option 2: download @RDataMining tweets from RDataMining.com url <- "http://www.rdatamining.com/data/rdmTweets.RData" download.file(url, destfile = "./data/rdmTweets.RData") ## load tweets into R load(file = "./data/rdmTweets-201306.RData") 6 / 35
  • 8. (n.tweet <- length(tweets)) ## [1] 320 tweets[1:5] ## [[1]] ## [1] "RDataMining: Examples on calling Java code from R nht... ## ## [[2]] ## [1] "RDataMining: Simulating Map-Reduce in R for Big Data A... ## ## [[3]] ## [1] "RDataMining: Job opportunity: Senior Analyst - Big Dat... ## ## [[4]] ## [1] "RDataMining: CLAVIN: an open source software package f... ## ## [[5]] ## [1] "RDataMining: An online book on Natural Language Proces... 7 / 35
  • 9. Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 8 / 35
  • 10. # convert tweets to a data frame # tweets.df <- do.call("rbind", lapply(tweets, as.data.frame)) tweets.df <- twListToDF(tweets) dim(tweets.df) ## [1] 320 14 library(tm) # build a corpus, and specify the source to be character vectors myCorpus <- Corpus(VectorSource(tweets.df$text)) # convert to lower case myCorpus <- tm_map(myCorpus, tolower) Package tm v0.5-10 was used in this example. With tm v0.6, content transformer" needs to be used to wrap around normal functions. # tm v0.6 myCorpus <- tm_map(myCorpus, content_transformer(tolower)) 9 / 35
  • 11. # remove punctuation myCorpus <- tm_map(myCorpus, removePunctuation) # remove numbers myCorpus <- tm_map(myCorpus, removeNumbers) # remove URLs removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) myCorpus <- tm_map(myCorpus, removeURL) # add two extra stop words: available and via myStopwords <- c(stopwords("english"), "available", "via") # remove r and big from stopwords myStopwords <- setdiff(myStopwords, c("r", "big")) # remove stopwords from corpus myCorpus <- tm_map(myCorpus, removeWords, myStopwords) 10 / 35
  • 12. # keep a copy of corpus to use later as a dictionary for stem completion myCorpusCopy <- myCorpus # stem words myCorpus <- tm_map(myCorpus, stemDocument) # inspect the first 5 documents (tweets) inspect(myCorpus[1:5]) # The code below is used for to make text fit for paper width for (i in 1:5) f cat(paste("[[", i, "]] ", sep = "")) writeLines(myCorpus[[i]]) g ## [[1]] exampl call java code r ## ## [[2]] simul mapreduc r big data analysi use flight data ... ## [[3]] job opportun senior analyst big data wesfarm indust... ## [[4]] clavin open sourc softwar packag document geotag g... ## [[5]] onlin book natur languag process python 11 / 35
  • 13. # stem completion myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy) ## [[1]] examples call java code r ## [[2]] simulating mapreduce r big data analysis used flights... ## [[3]] job opportunity senior analyst big data wesfarmers in... ## [[4]] clavin open source software package document geotaggi... ## [[5]] online book natural language processing python 12 / 35
  • 14. # count frequency of "mining" miningCases <- tm_map(myCorpusCopy, grep, pattern = "nn<mining") sum(unlist(miningCases)) ## [1] 82 # count frequency of "miners" minerCases <- tm_map(myCorpusCopy, grep, pattern = "nn<miners") sum(unlist(minerCases)) ## [1] 4 # replace "miners" with "mining" myCorpus <- tm_map(myCorpus, gsub, pattern = "miners", replacement = "mining") 13 / 35
  • 15. tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(1, Inf))) tdm ## A term-document matrix (790 terms, 320 documents) ## ## Non-/sparse entries: 2449/250351 ## Sparsity : 99% ## Maximal term length: 27 ## Weighting : term frequency (tf) 14 / 35
  • 16. Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 15 / 35
  • 17. idx <- which(dimnames(tdm)$Terms == "r") inspect(tdm[idx + (0:5), 101:110]) ## A term-document matrix (6 terms, 10 documents) ## ## Non-/sparse entries: 4/56 ## Sparsity : 93% ## Maximal term length: 12 ## Weighting : term frequency (tf) ## ## Docs ## Terms 101 102 103 104 105 106 107 108 109 110 ## r 0 1 1 0 0 0 0 0 1 1 ## ramachandran 0 0 0 0 0 0 0 0 0 0 ## random 0 0 0 0 0 0 0 0 0 0 ## ranked 0 0 0 0 0 0 0 0 0 0 ## rann 0 0 0 0 0 0 0 0 0 0 ## rapidminer 0 0 0 0 0 0 0 0 0 0 16 / 35
  • 18. # inspect frequent words (freq.terms <- findFreqTerms(tdm, lowfreq = 15)) ## [1] "analysis" "applications" "big" "book" ## [5] "code" "computing" "data" "examples" ## [9] "group" "introduction" "mining" "network" ## [13] "package" "position" "postdoctoral" "r" ## [17] "research" "see" "slides" "social" ## [21] "tutorial" "university" "used" term.freq <- rowSums(as.matrix(tdm)) term.freq <- subset(term.freq, term.freq >= 15) df <- data.frame(term = names(term.freq), freq = term.freq) 17 / 35
  • 19. library(ggplot2) ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity") + xlab("Terms") + ylab("Count") + coord_flip() used university tutorial social slides see research r postdoctoral position package network mining introduction group examples data computing code book big applications analysis 0 50 100 150 Count Terms 18 / 35
  • 20. # which words are associated with r ? findAssocs(tdm, "r", 0.2) ## r ## examples 0.32 ## code 0.29 ## package 0.20 # which words are associated with mining ? findAssocs(tdm, "mining", 0.25) ## mining ## data 0.47 ## mahout 0.30 ## recommendation 0.30 ## sets 0.30 ## supports 0.30 ## frequent 0.26 ## itemset 0.26 19 / 35
  • 21. library(graph) library(Rgraphviz) plot(tdm, term = freq.terms, corThreshold = 0.12, weighting = T) analysis research r applications big book code computing data examples group introduction mining network postdoctoral package position see slides tutorial social university used 20 / 35
  • 22. Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 21 / 35
  • 23. library(wordcloud) m <- as.matrix(tdm) # calculate the frequency of words and sort it by frequency word.freq <- sort(rowSums(m), decreasing = T) wordcloud(words = names(word.freq), freq = word.freq, min.freq = 3, random.order = F) language postdoctoral university social book data r project forecasting frequent applications big code tutorial job tried edited provided information research examples package mining analysis used cfp network positionslides computing see lecture introduction group analytics australia modelling scientist time association free online parallel text clustering course learn pdf rules series talk document now programming fellow statistics ausdm detection outlier google large open rdatamining techniques thanks tools users vacancy amp chapter due graphics processing notes map presentation science software starting studies visualizing business call can case center database details follower functions join kdnuggets page poll top submission spatial technology twitter analyst california card fast classification find get graph handling interactive linkedin list machine melbourne recent reference using videos web elsevier workshop wwwrdataminingcom access added canada canberra china cloud conference datasets distributed dmapps events draft experience high ibm industrial knowledge management nd performance predictive published search sentiment short snowfall southern sydney tenuretrack topic week youtube advanced also answers april area august building charts check comments containing cran create csiro dec dynamic easier engineer format guide hadoop igraph incl itemset jan labs may media microsoft mid new opportunity plot postdocresearch quick realworld san senior simple singapore state titled tj track tweets usamaf views visits watson website 22 / 35
  • 24. Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 23 / 35
  • 25. # remove sparse terms tdm2 <- removeSparseTerms(tdm, sparse = 0.95) m2 <- as.matrix(tdm2) # cluster terms distMatrix <- dist(scale(m2)) fit <- hclust(distMatrix, method = "ward") 24 / 35
  • 26. plot(fit) rect.hclust(fit, k = 6) # cut tree into 6 clusters 10 20 30 40 50 60 research position postdoctoral university analysis network social examples package slides used computing tutorial big applications book r data mining Cluster Dendrogram Height 25 / 35
  • 27. m3 <- t(m2) # transpose the matrix to cluster documents (tweets) set.seed(122) # set a fixed random seed k <- 6 # number of clusters kmeansResult <- kmeans(m3, k) round(kmeansResult$centers, digits = 3) # cluster centers ## analysis applications big book computing data examples ## 1 0.147 0.088 0.147 0.015 0.059 1.015 0.088 ## 2 0.028 0.167 0.167 0.250 0.028 1.556 0.194 ## 3 0.810 0.000 0.000 0.000 0.000 0.048 0.095 ## 4 0.080 0.036 0.007 0.058 0.087 0.000 0.181 ## 5 0.000 0.000 0.000 0.067 0.067 0.333 0.067 ## 6 0.119 0.048 0.071 0.000 0.048 0.357 0.000 ## mining network package position postdoctoral r research ## 1 0.338 0.015 0.015 0.059 0.074 0.235 0.074 ## 2 1.056 0.000 0.222 0.000 0.000 1.000 0.028 ## 3 0.048 1.000 0.095 0.143 0.095 0.286 0.048 ## 4 0.065 0.022 0.174 0.000 0.007 0.703 0.000 ## 5 1.200 0.000 0.000 0.000 0.067 0.067 0.000 ## 6 0.119 0.000 0.024 0.643 0.310 0.000 0.714 ## slides social tutorial university used ## 1 0.074 0.000 0.015 0.015 0.029 ## 2 0.056 0.000 0.000 0.000 0.250 ## 3 0.095 0.762 0.190 0.000 0.095 26 / 35
  • 28. for (i in 1:k) f cat(paste("cluster ", i, ": ", sep = "")) s <- sort(kmeansResult$centers[i, ], decreasing = T) cat(names(s)[1:5], "nn") # print the tweets of every cluster # print(tweets[which(kmeansResult$cluster==i)]) g ## cluster 1: data mining r analysis big ## cluster 2: data mining r book used ## cluster 3: network analysis social r tutorial ## cluster 4: r examples package slides used ## cluster 5: mining tutorial slides data book ## cluster 6: research position university data postdoctoral 27 / 35
  • 29. library(fpc) # partitioning around medoids with estimation of number of clusters pamResult <- pamk(m3, metric="manhattan") k <- pamResult$nc # number of clusters identified pamResult <- pamResult$pamobject # print cluster medoids for (i in 1:k) f cat("cluster", i, ": ", colnames(pamResult$medoids)[which(pamResult$medoids[i,]==1)], "nn") g ## cluster 1 : examples r ## cluster 2 : analysis data r ## cluster 3 : data ## cluster 4 : ## cluster 5 : r ## cluster 6 : data mining r ## cluster 7 : data mining ## cluster 8 : analysis network social ## cluster 9 : data position research ## cluster 10 : position postdoctoral university 28 / 35
  • 30. # plot clustering result layout(matrix(c(1, 2), 1, 2)) # set to two graphs per page plot(pamResult, col.p = pamResult$clustering) −4 −2 0 2 4 6 −4 −2 0 2 4 clusplot(pam(x = sdata, k = k, diss = diss, metric = "manhattan")) Component 1 Component 2 These two components explain 25.23 % of the point variability. Silhouette plot of pam(x = sdata, k = k, diss = diss, metric = "manhattan") n = 320 10 clusters Cj −0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width si Average silhouette width : 0.27 j : nj | aveiÎCj si 1 : 28 | 0.13 2 : 12 | 0.25 3 : 31 | 0.23 4 : 61 | 0.54 5 : 58 | 0.29 6 : 28 | 0.21 7 : 43 | 0.15 8 : 19 | 0.29 9 : 21 | 0.15 10 : 19 | 0.14 layout(matrix(1)) # change back to one graph per page 29 / 35
  • 31. Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 30 / 35
  • 32. Topic Modelling dtm <- as.DocumentTermMatrix(tdm) library(topicmodels) lda <- LDA(dtm, k = 8) # find 8 topics term <- terms(lda, 4) # first 4 terms of every topic term ## Topic 1 Topic 2 Topic 3 Topic 4 ## [1,] "data" "data" "mining" "r" ## [2,] "mining" "r" "r" "lecture" ## [3,] "scientist" "mining" "tutorial" "time" ## [4,] "research" "book" "introduction" "series" ## Topic 5 Topic 6 Topic 7 Topic 8 ## [1,] "position" "package" "r" "analysis" ## [2,] "research" "r" "data" "network" ## [3,] "postdoctoral" "clustering" "computing" "r" ## [4,] "university" "detection" "slides" "code" term <- apply(term, MARGIN = 2, paste, collapse = ", ") 31 / 35
  • 33. Topic Modelling # first topic identified for every document (tweet) topic <- topics(lda, 1) topics <- data.frame(date=as.IDate(tweets.df$created), topic) qplot(date, ..count.., data=topics, geom="density", fill=term[topic], position="stack") 0.4 0.2 0.0 2011−07 2012−01 2012−07 2013−01 2013−07 date count term[topic] analysis, network, r, code data, mining, scientist, research data, r, mining, book mining, r, tutorial, introduction package, r, clustering, detection position, research, postdoctoral, university r, data, computing, slides r, lecture, time, series 32 / 35
  • 34. Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 33 / 35
  • 35. Online Resources I Chapter 10: Text Mining, in book R and Data Mining: Examples and Case Studies http://www.rdatamining.com/docs/RDataMining.pdf I R Reference Card for Data Mining http://www.rdatamining.com/docs/R-refcard-data-mining.pdf I Free online courses and documents http://www.rdatamining.com/resources/ I RDataMining Group on LinkedIn (7,000+ members) http://group.rdatamining.com I RDataMining on Twitter (1,700+ followers) @RDataMining 34 / 35
  • 36. The End Thanks! Email: yanchang(at)rdatamining.com 35 / 35