SlideShare a Scribd company logo
1 of 64
Natural Language Processing
in R (rNLP)
Fridolin Wild, The Open University, UK
Tutorial to the Doctoral School
at the Institute of Business Informatics
of the Goethe University Frankfurt
Structure of this tutorial
• An introduction to R and cRunch
• Language basics in R
• Basic I/O in R
• Social Network Analysis
• Latent Semantic Analysis
• Twitter
• Sentiment
• (Advanced I/O in R: MySQL, SparQL)
Introduction
cRunch
• is an infrastructure
• for computationally-intense learning
analytics
• supporting researchers
• in investigating big data
• generated in the co-construction of
knowledge
… and beyond
…
Architecture
(Thiele & Lehner, 2011)
Architecture
(Thiele & Lehner, 2011)
Living Reports
data shop
cron jobs
R webservices
Reports
Living reports
• reports with embedded
scripts and data
• knitr and Sweave
• render to html, PDF, …
• visualisations:
– ggplot2, trellis, graphix
– jpg, png, eps, pdf
png(file=”n.png”, plot(network(m)))
• Fill-in-the-blanks:
Drop out quote went down to
<<echo=FALSE>>=
doquote[“OU”,”2011”]
@
documentclass[a4paper]{article}
title{Sweave Example 1}
author{Friedrich Leisch}
begin{document}
maketitle
In this example we embed parts of the examples from the
texttt{kruskal.test} help page into a LaTeX{} document:
<<>>=
data(airquality)
library(ctest)
kruskal.test(Ozone ~ Month, data = airquality)
@
which shows that the location parameter of the Ozone
distribution varies significantly from month to month. Finally we
include a boxplot of the data:
begin{center}
<<fig=TRUE,echo=FALSE>>=
boxplot(Ozone ~ Month, data = airquality)
@
end{center}
end{document}
Example PDF report
Example html5 report
Example Report
=============
This is an example of embedded scripts and
data.
```{r}
a = "hello world”
print(a)
```
And here is an example of how to embed a chart.
```{r fig.width=7, fig.height=6}
plot( 5:20 )
```
Shiny Widgets (1)
• Widgets: use-case
sized encapsulations
of mini apps
• HTML5
• Two files:
ui.R, server.R
• Still missing:
manifest files
(info.plist, config.xml)
Shiny Widgets (2)
From http://www.rstudio.com/shiny/
Web Services
harmonization &
data warehousing
Example R web service
print “hello world”
More complex R web service
setContentType("image/png")
a = c(1,3,5,12,13,15)
image_file = tempfile()
png(file=image_file)
plot(a,
main = "The magic image",
ylab = "", xlab = "",
col = c("darkred", "darkblue", "darkgreen")
)
dev.off()
sendBin(readBin(image_file,'raw',n=file.info(image_file)$size))
unlink(image_file)
R web services
• Uses the apache
mod_R.so
• See http://Rapache.net
• Common server functions:
– GET and POST variables
– setContentType
– sendBin
– …
A word on memory mgmt.
• Advanced memory management
(see p.70 of Dietl diploma thesis):
– Use package big memory
(for shared memory across
threads)
– Use package Rserve (for shared
read-only access across threads)
– Swap out memory objects with
save() and load()
– The latter is typically sufficient
(hard disks are fast!)
• data management abstraction
layer for mod_R.so:
configure handler in http.conf:
specify directory match and load specific
data management routines at start up:
REvalOnStartup
"source(‟/dbal.R');"
Harvesting
data acquisition
Job scheduling
• crontab entries for R webservices
• e.g. harvest feeds
• e.g. store in local DB
data shop
sharing
Data shop and the community
• You have a „public/‟ folder :)
– „public/data‟: save() any .rda file and
it will be indexed within the hour
– „public/services‟: use this to execute
your scripts; indexed within the hour
– „public/gallery‟: use this to store
your public visualisations
– code sharing: Any .R script in your
„public/‟ folder is source readable by
the web
Not covered
The useful pointer
More NLP packages
install.packages("Natural
LanguageProcessing”)
library("Natural
LanguageProcessing")
studio
exploratory
programming
studio
Social Network Analysis
Fridolin Wild, The Open University, UK
The Idea
The basic concept
• Precursors date back to 1920s, math to
Euler‟s „Seven Bridges of Koenigsberg‟
The basic concept
• Precursors date back to 1920s, math to
Euler‟s „Seven Bridges of Koenigsberg‟
The basic concept
• Precursors date back to 1920s, math to
Euler‟s „Seven Bridges of Koenigsberg‟
• Social Networks are:
• Actors (people, groups, media, tags, …)
• Ties (interactions, relationships, …)
• Actors and ties form graph
• Graph has measurable structural
properties
• Betweenness,
• Degree of Centrality,
• Density,
• Cohesion
• Structural Patterns
Forum Messages
message_id forum_id parent_id author
130 2853483 2853445 N 2043
131 1440740 785876 N 1669
132 2515257 2515256 N 5814
133 4704949 4699874 N 5810
134 2597170 2558273 N 2054
135 2316951 2230821 N 5095
136 3407573 3407568 N 36
137 2277393 2277387 N 359
138 3394136 3382201 N 1050
139 4603931 4167338 N 453
140 6234819 6189254 6231352 5400
141 806699 785877 804668 2177
142 4430290 3371246 3380313 48
143 3395686 3391024 3391129 35
144 6270213 6024351 6265378 5780
145 2496015 2491522 2491536 2774
146 4707562 4699873 4707502 5810
147 2574199 2440094 2443801 5801
148 4501993 4424215 4491650 5232
message_id forum_id parent_id author
60 734569 31117 N 2491
221 762702 31117 1
317 762717 31117 762702 1927
1528 819660 31117 793408 1197
1950 840406 31117 839998 1348
1047 841810 31117 767386 1879
2239 862709 31117 N 1982
2420 869839 31117 862709 2038
2694 884824 31117 N 5439
2503 896399 31117 862709 1982
2846 901691 31117 895022 992
3321 951376 31117 N 5174
3384 952895 31117 951376 1597
1186 955595 31117 767386 5724
3604 958065 31117 N 716
2551 960734 31117 862709 1939
4072 975816 31117 N 584
2574 986038 31117 862709 2043
2590 987842 31117 862709 1982
Incidence Matrix
• msg_id = incident, authors appear in incidents
Derive Adjacency Matrix
= t(im) %*% im
Visualization: Sociogramme
Degree
Betweenness
Network Density
• Total edges = 29
• Possible edges =
18 * (18-1)/2 = 153
• Density = 0.19
kmeans Cluster (k=3)
Analysis
• Mix
• Match
• Optimise
Tutorials
• Starter: sna-simple.Rmd
• Real: sna-blog.Rmd
• Advanced: sna-forum.Rmd
Latent Semantic Analysis
Fridolin Wild, The Open University, UK
Latent Semantic Analysis
• “Humans learn word meanings and how to combine
them into passage meaning through experience
with ~paragraph unitized verbal environments.”
• “They don‟t remember all the separate words of a
passage; they remember its overall gist or
meaning.”
• “LSA learns by „reading‟ ~paragraph unitized
texts that represent the environment.”
• “It doesn‟t remember all the separate words of a
text it; it remembers its overall gist or meaning.”
(Landauer, 2007)
Word choice is over-rated
• Educated adult understands ~100,000 word forms
• An average sentence contains 20 tokens.
• Thus 100,00020 possible combinations of words in a
sentence
• maximum of log2 100,00020
= 332 bits in word choice alone.
• 20! = 2.4 x 1018 possible orders of 20 words
= maximum of 61 bits from order of the words.
• 332/(61+ 332) = 84% word choice
(Landauer, 2007)
LSA (2)
• Assumption: texts have a semantic structure
• However, this structure is obscured by word
usage (noise, synonymy, polysemy, …)
• Proposed LSA Solution:
– map doc-term matrix
– using conceptual indices
– derived statistically (truncated SVD)
– and make similarity comparisons using
angles
Input (e.g., documents)
{ M } =
Deerwester, Dumais, Furnas, Landauer, and Harshman (1990):
Indexing by Latent Semantic Analysis, In: Journal of the American
Society for Information Science, 41(6):391-407
Only the red terms appear in more
than one document, so strip the rest.
term = feature
vocabulary = ordered set of features
TEXTMATRIX
Singular Value Decomposition
=
Truncated SVD
latent-semantic space
Reconstructed, Reduced Matrix
m4: Graph minors: A survey
Similarity in a Latent-Semantic Space
Query
Target 1
Target 2Angle 2
Angle 1
Ydimension
X dimension
doc2doc - similarities
Unreduced = pure vector
space model
- Based on M = TSD’
- Pearson Correlation
over document vectors
reduced
- based on M2 = TS2D’
- Pearson Correlation
over document vectors
Ex Post Updating: Folding-In
• SVD factor stability
– SVD calculates factors over a given text base
– Different texts – different factors
– Challenge: avoid unwanted factor changes
(e.g., bad essays)
– Solution: folding-in of essays instead of recalculating
• SVD is computationally expensive
Folding-In in Detail
1
kk
T
i STvd
1
T
ikki dSTm
2
vT
Tk Sk Dk
Mk
(Berry et al., 1995)
(1) convert
Original
Vector to
„Dk“-format
(2) convert
„Dk“-format
vector to
„Mk“-format
LSA Process & Driving Parameters
4 x 12 x 7 x 2 x 3
= 2016 Combinations
Pre-Processing
• Stemming
– Porter Stemmer (snowball.tartarus.org)
– ‚move„, ‚moving„, ‚moves„ => ‚move„
– in German even more important (more flections)
• Stop Word Elimination
– 373 Stop Words in German
• Stemming plus Stop Word Elimination
• Unprocessed („raw‟) Terms
Term Weighting Schemes
• Global Weights (GW)
– None (‚raw‘ tf)
– Normalisation
– Inverse Document
Frequency (IDF)
– 1 + Entropy
.
1
2
1
j
ij
i
tf
norm
1
)(
log2
idocfreq
numdocs
idfi
1
log
log
1
j
ijij
i
numdocs
pp
entplusone 1
j
ij
ij
ij
tf
tf
p, where
weightij = lw(tfij) ∙ gw(tfij)
 Local Weights (LW)
 None (‘raw’ tf)
 Binary Term Frequency
 Logarithmized Term Frequency
(log)
SVD-Dimensionality
• Many different proposals (see package)
• 80% variance is a good estimator
Proximity Measures
• Pearson Correlation
• Cosine Correlation
• Spearman„s Rho
pics: http://davidmlane.com/hyperstat/A62891.html
Pair-wise dis/similarity
Convergence expected: ‘eu’, ‘österreich’ Divergence expected: ‘jahr’, ‘wien’
The Package
• Available via CRAN, e.g.:
http://cran.r-project.org/web/packages/lsa/index.html
• Higher-level Abstraction to Ease Use
– Core methods:
textmatrix() / query()
lsa()
fold_in()
as.textmatrix()
– Support methods for term weighting, dimensionality
calculation, correlation measurement, …
Core Workflow
• tm = textmatrix(„dir/„)
• tm = lw_logtf(tm) *
gw_idf(tm)
• space = lsa(tm,
dims=dimcalc_share())
• tm3 = fold_in(tm, space)
• as.textmatrix(tm)
Pre-Processing Chain
Tutorials
• Starter: lsa-indexing.Rmd
• Real: lsa-essayscoring.Rmd
• Advanced: lsa-sparse.Rmd
Additional tutorials
Fridolin Wild, The Open University, UK
Tutorials
• Advanced I/O: twitter.Rmd
• Advanced I/O: sparql.Rmd
• Advanced NLP: twitter-sentiment.Rmd
• Evaluation: interrater-agreement.Rmd

More Related Content

What's hot

Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
Text Mining Using R
Text Mining Using RText Mining Using R
Text Mining Using RKnoldus Inc.
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rVivian S. Zhang
 
R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesJeffrey Breen
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classificationshakimov
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupDan Sullivan, Ph.D.
 
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...Shuyo Nakatani
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Sean Golliher
 
A Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentA Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentKemal Can Kara
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationPierre de Lacaze
 
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...shakimov
 
Slides
SlidesSlides
Slidesbutest
 
Navigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept AnalysisNavigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept AnalysisMehwish Alam
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureDr. Christian Betz
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionDeeksha thakur
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesBryan Gummibearehausen
 
Programming with Millions of Examples (HRL)
Programming with Millions of Examples (HRL)Programming with Millions of Examples (HRL)
Programming with Millions of Examples (HRL)Eran Yahav
 

What's hot (20)

Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Text Mining Using R
Text Mining Using RText Mining Using R
Text Mining Using R
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlines
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classification
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniques
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
 
A Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentA Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitment
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...
 
Slides
SlidesSlides
Slides
 
Navigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept AnalysisNavigating and Exploring RDF Data using Formal Concept Analysis
Navigating and Exploring RDF Data using Formal Concept Analysis
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
Programming with Millions of Examples (HRL)
Programming with Millions of Examples (HRL)Programming with Millions of Examples (HRL)
Programming with Millions of Examples (HRL)
 

Viewers also liked

Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisAravind Babu
 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API Mohd Shadab Alam
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify RaisAjay Ohri
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in TwitterAyushi Dalmia
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Marina Santini
 
NLP若手の会シンポジウム行ってきた & Chainer使ってみた
NLP若手の会シンポジウム行ってきた & Chainer使ってみたNLP若手の会シンポジウム行ってきた & Chainer使ってみた
NLP若手の会シンポジウム行ってきた & Chainer使ってみたYoshiyuki Kakihara
 
NLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaNLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaSpark Summit
 
Processamento Automático da Língua Portuguesa - Campus Party Br 6
Processamento Automático da Língua Portuguesa - Campus Party Br 6Processamento Automático da Língua Portuguesa - Campus Party Br 6
Processamento Automático da Língua Portuguesa - Campus Party Br 6William Colen
 
Startupfest 2015: HARPER REED (Modest, Inc.) - Lightning Keynote
Startupfest 2015: HARPER REED (Modest, Inc.) - Lightning KeynoteStartupfest 2015: HARPER REED (Modest, Inc.) - Lightning Keynote
Startupfest 2015: HARPER REED (Modest, Inc.) - Lightning KeynoteStartupfest
 
Count-Min Tree Sketch : Approximate counting for NLP tasks
Count-Min Tree Sketch : Approximate counting for NLP tasksCount-Min Tree Sketch : Approximate counting for NLP tasks
Count-Min Tree Sketch : Approximate counting for NLP tasksGuillaume Pitel
 
Natural language procesing in R
Natural language procesing in RNatural language procesing in R
Natural language procesing in ROlabanji Shonibare
 
Practical Predictive Analytics Models and Methods
Practical Predictive Analytics Models and MethodsPractical Predictive Analytics Models and Methods
Practical Predictive Analytics Models and MethodsZhipeng Liang
 
Webinar: Maximize Keyword Profits & Conversions with Data Science
Webinar: Maximize Keyword Profits & Conversions with Data ScienceWebinar: Maximize Keyword Profits & Conversions with Data Science
Webinar: Maximize Keyword Profits & Conversions with Data ScienceQuanticMind
 
An ad words ad performance analysis by r
An ad words ad performance analysis by rAn ad words ad performance analysis by r
An ad words ad performance analysis by rSimonChen888
 
Building Emoji Autocomplete
Building Emoji AutocompleteBuilding Emoji Autocomplete
Building Emoji AutocompleteDasmer Singh
 

Viewers also liked (20)

TextMining with R
TextMining with RTextMining with R
TextMining with R
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
NLTK in 20 minutes
NLTK in 20 minutesNLTK in 20 minutes
NLTK in 20 minutes
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify Rais
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?
 
Tulane March 2017 Talk
Tulane March 2017 TalkTulane March 2017 Talk
Tulane March 2017 Talk
 
NLP若手の会シンポジウム行ってきた & Chainer使ってみた
NLP若手の会シンポジウム行ってきた & Chainer使ってみたNLP若手の会シンポジウム行ってきた & Chainer使ってみた
NLP若手の会シンポジウム行ってきた & Chainer使ってみた
 
NLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaNLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey Stella
 
Processamento Automático da Língua Portuguesa - Campus Party Br 6
Processamento Automático da Língua Portuguesa - Campus Party Br 6Processamento Automático da Língua Portuguesa - Campus Party Br 6
Processamento Automático da Língua Portuguesa - Campus Party Br 6
 
Startupfest 2015: HARPER REED (Modest, Inc.) - Lightning Keynote
Startupfest 2015: HARPER REED (Modest, Inc.) - Lightning KeynoteStartupfest 2015: HARPER REED (Modest, Inc.) - Lightning Keynote
Startupfest 2015: HARPER REED (Modest, Inc.) - Lightning Keynote
 
Count-Min Tree Sketch : Approximate counting for NLP tasks
Count-Min Tree Sketch : Approximate counting for NLP tasksCount-Min Tree Sketch : Approximate counting for NLP tasks
Count-Min Tree Sketch : Approximate counting for NLP tasks
 
Natural language procesing in R
Natural language procesing in RNatural language procesing in R
Natural language procesing in R
 
Practical Predictive Analytics Models and Methods
Practical Predictive Analytics Models and MethodsPractical Predictive Analytics Models and Methods
Practical Predictive Analytics Models and Methods
 
Webinar: Maximize Keyword Profits & Conversions with Data Science
Webinar: Maximize Keyword Profits & Conversions with Data ScienceWebinar: Maximize Keyword Profits & Conversions with Data Science
Webinar: Maximize Keyword Profits & Conversions with Data Science
 
NLP from scratch
NLP from scratch NLP from scratch
NLP from scratch
 
An ad words ad performance analysis by r
An ad words ad performance analysis by rAn ad words ad performance analysis by r
An ad words ad performance analysis by r
 
Building Emoji Autocomplete
Building Emoji AutocompleteBuilding Emoji Autocomplete
Building Emoji Autocomplete
 

Similar to Natural Language Processing in R (rNLP)

Framester and WFD
Framester and WFD Framester and WFD
Framester and WFD Aldo Gangemi
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingNesreen K. Ahmed
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL ServerStéphane Fréchette
 
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesAyudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesBig Data Colombia
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learningtelss09
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Recordspbajcsy
 
Seminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent ProgrammingSeminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent ProgrammingStefan Marr
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsJiaheng Lu
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Bradley Allen
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners Jen Stirrup
 
KDD17Tutorial_final (1).pdf
KDD17Tutorial_final (1).pdfKDD17Tutorial_final (1).pdf
KDD17Tutorial_final (1).pdfssuserf2f0fe
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2yannabraham
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Alexandros Karatzoglou
 

Similar to Natural Language Processing in R (rNLP) (20)

Framester and WFD
Framester and WFD Framester and WFD
Framester and WFD
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
User biglm
User biglmUser biglm
User biglm
 
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesAyudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Records
 
Seminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent ProgrammingSeminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent Programming
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing Paradigms
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners
 
R tutorial
R tutorialR tutorial
R tutorial
 
KDD17Tutorial_final (1).pdf
KDD17Tutorial_final (1).pdfKDD17Tutorial_final (1).pdf
KDD17Tutorial_final (1).pdf
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2
 
What's in a textbook
What's in a textbookWhat's in a textbook
What's in a textbook
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
 
Db1 04
Db1 04Db1 04
Db1 04
 

More from fridolin.wild

Performance Augmentation (Keynote, SIG LT, XR4ALL)
Performance Augmentation (Keynote, SIG LT, XR4ALL)Performance Augmentation (Keynote, SIG LT, XR4ALL)
Performance Augmentation (Keynote, SIG LT, XR4ALL)fridolin.wild
 
Professional TEL 4.0: Performance Augmentation for Industry 4.0
Professional TEL 4.0: Performance Augmentation for Industry 4.0 Professional TEL 4.0: Performance Augmentation for Industry 4.0
Professional TEL 4.0: Performance Augmentation for Industry 4.0 fridolin.wild
 
Performance Augmentation
Performance AugmentationPerformance Augmentation
Performance Augmentationfridolin.wild
 
Reality As A Knowledge Medium
Reality As A Knowledge MediumReality As A Knowledge Medium
Reality As A Knowledge Mediumfridolin.wild
 
ARLEM draft spec - overview
ARLEM draft spec - overviewARLEM draft spec - overview
ARLEM draft spec - overviewfridolin.wild
 
AR community meeting, Seoul, Korea, October 6, 2015
AR community meeting, Seoul, Korea, October 6, 2015AR community meeting, Seoul, Korea, October 6, 2015
AR community meeting, Seoul, Korea, October 6, 2015fridolin.wild
 
IEEE p1589 'ARLEM' virtual meeting, September 9, 2015
IEEE p1589 'ARLEM' virtual meeting, September 9, 2015IEEE p1589 'ARLEM' virtual meeting, September 9, 2015
IEEE p1589 'ARLEM' virtual meeting, September 9, 2015fridolin.wild
 
IEEE p1589 'ARLEM' virtual meeting, July 8, 2015
IEEE p1589 'ARLEM' virtual meeting, July 8, 2015IEEE p1589 'ARLEM' virtual meeting, July 8, 2015
IEEE p1589 'ARLEM' virtual meeting, July 8, 2015fridolin.wild
 
Special Interest Group on Wearables-Enhanced Leanring (SIG WELL)
Special Interest Group on Wearables-Enhanced Leanring (SIG WELL)Special Interest Group on Wearables-Enhanced Leanring (SIG WELL)
Special Interest Group on Wearables-Enhanced Leanring (SIG WELL)fridolin.wild
 
IEEE augmented reality learning experience model (ARLEM)
IEEE augmented reality learning experience model (ARLEM)IEEE augmented reality learning experience model (ARLEM)
IEEE augmented reality learning experience model (ARLEM)fridolin.wild
 
Learning from meaningful, purposive interaction
Learning from meaningful, purposive interactionLearning from meaningful, purposive interaction
Learning from meaningful, purposive interactionfridolin.wild
 
IEEE augmented reality learning experience model (ARLEM)
IEEE augmented reality learning experience model (ARLEM)IEEE augmented reality learning experience model (ARLEM)
IEEE augmented reality learning experience model (ARLEM)fridolin.wild
 
ARgh! kinesthetic learning
ARgh! kinesthetic learningARgh! kinesthetic learning
ARgh! kinesthetic learningfridolin.wild
 
Lab rats-and-the-moral-maze-v2
Lab rats-and-the-moral-maze-v2Lab rats-and-the-moral-maze-v2
Lab rats-and-the-moral-maze-v2fridolin.wild
 
Quantifying reflection
Quantifying reflectionQuantifying reflection
Quantifying reflectionfridolin.wild
 
What if...? Technology and Knowledge in the University of the Future
What if...? Technology and Knowledge in the University of the FutureWhat if...? Technology and Knowledge in the University of the Future
What if...? Technology and Knowledge in the University of the Futurefridolin.wild
 
The Grand Research Challenges for TEL. A shortlist.
The Grand Research Challenges for TEL. A shortlist.The Grand Research Challenges for TEL. A shortlist.
The Grand Research Challenges for TEL. A shortlist.fridolin.wild
 

More from fridolin.wild (20)

Performance Augmentation (Keynote, SIG LT, XR4ALL)
Performance Augmentation (Keynote, SIG LT, XR4ALL)Performance Augmentation (Keynote, SIG LT, XR4ALL)
Performance Augmentation (Keynote, SIG LT, XR4ALL)
 
Professional TEL 4.0: Performance Augmentation for Industry 4.0
Professional TEL 4.0: Performance Augmentation for Industry 4.0 Professional TEL 4.0: Performance Augmentation for Industry 4.0
Professional TEL 4.0: Performance Augmentation for Industry 4.0
 
Performance Augmentation
Performance AugmentationPerformance Augmentation
Performance Augmentation
 
Reality As A Knowledge Medium
Reality As A Knowledge MediumReality As A Knowledge Medium
Reality As A Knowledge Medium
 
ARLEM draft spec - overview
ARLEM draft spec - overviewARLEM draft spec - overview
ARLEM draft spec - overview
 
AR community meeting, Seoul, Korea, October 6, 2015
AR community meeting, Seoul, Korea, October 6, 2015AR community meeting, Seoul, Korea, October 6, 2015
AR community meeting, Seoul, Korea, October 6, 2015
 
IEEE p1589 'ARLEM' virtual meeting, September 9, 2015
IEEE p1589 'ARLEM' virtual meeting, September 9, 2015IEEE p1589 'ARLEM' virtual meeting, September 9, 2015
IEEE p1589 'ARLEM' virtual meeting, September 9, 2015
 
IEEE p1589 'ARLEM' virtual meeting, July 8, 2015
IEEE p1589 'ARLEM' virtual meeting, July 8, 2015IEEE p1589 'ARLEM' virtual meeting, July 8, 2015
IEEE p1589 'ARLEM' virtual meeting, July 8, 2015
 
Special Interest Group on Wearables-Enhanced Leanring (SIG WELL)
Special Interest Group on Wearables-Enhanced Leanring (SIG WELL)Special Interest Group on Wearables-Enhanced Leanring (SIG WELL)
Special Interest Group on Wearables-Enhanced Leanring (SIG WELL)
 
IEEE augmented reality learning experience model (ARLEM)
IEEE augmented reality learning experience model (ARLEM)IEEE augmented reality learning experience model (ARLEM)
IEEE augmented reality learning experience model (ARLEM)
 
Learning from meaningful, purposive interaction
Learning from meaningful, purposive interactionLearning from meaningful, purposive interaction
Learning from meaningful, purposive interaction
 
Reality as a Medium
Reality as a MediumReality as a Medium
Reality as a Medium
 
IEEE augmented reality learning experience model (ARLEM)
IEEE augmented reality learning experience model (ARLEM)IEEE augmented reality learning experience model (ARLEM)
IEEE augmented reality learning experience model (ARLEM)
 
ARgh! kinesthetic learning
ARgh! kinesthetic learningARgh! kinesthetic learning
ARgh! kinesthetic learning
 
learning by doing.
learning by doing.learning by doing.
learning by doing.
 
Lab rats-and-the-moral-maze-v2
Lab rats-and-the-moral-maze-v2Lab rats-and-the-moral-maze-v2
Lab rats-and-the-moral-maze-v2
 
Quantifying reflection
Quantifying reflectionQuantifying reflection
Quantifying reflection
 
What if...? Technology and Knowledge in the University of the Future
What if...? Technology and Knowledge in the University of the FutureWhat if...? Technology and Knowledge in the University of the Future
What if...? Technology and Knowledge in the University of the Future
 
Widget- based PLEs
Widget-based PLEsWidget-based PLEs
Widget- based PLEs
 
The Grand Research Challenges for TEL. A shortlist.
The Grand Research Challenges for TEL. A shortlist.The Grand Research Challenges for TEL. A shortlist.
The Grand Research Challenges for TEL. A shortlist.
 

Recently uploaded

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Natural Language Processing in R (rNLP)

  • 1. Natural Language Processing in R (rNLP) Fridolin Wild, The Open University, UK Tutorial to the Doctoral School at the Institute of Business Informatics of the Goethe University Frankfurt
  • 2. Structure of this tutorial • An introduction to R and cRunch • Language basics in R • Basic I/O in R • Social Network Analysis • Latent Semantic Analysis • Twitter • Sentiment • (Advanced I/O in R: MySQL, SparQL)
  • 4. cRunch • is an infrastructure • for computationally-intense learning analytics • supporting researchers • in investigating big data • generated in the co-construction of knowledge … and beyond …
  • 6. Architecture (Thiele & Lehner, 2011) Living Reports data shop cron jobs R webservices
  • 8. Living reports • reports with embedded scripts and data • knitr and Sweave • render to html, PDF, … • visualisations: – ggplot2, trellis, graphix – jpg, png, eps, pdf png(file=”n.png”, plot(network(m))) • Fill-in-the-blanks: Drop out quote went down to <<echo=FALSE>>= doquote[“OU”,”2011”] @ documentclass[a4paper]{article} title{Sweave Example 1} author{Friedrich Leisch} begin{document} maketitle In this example we embed parts of the examples from the texttt{kruskal.test} help page into a LaTeX{} document: <<>>= data(airquality) library(ctest) kruskal.test(Ozone ~ Month, data = airquality) @ which shows that the location parameter of the Ozone distribution varies significantly from month to month. Finally we include a boxplot of the data: begin{center} <<fig=TRUE,echo=FALSE>>= boxplot(Ozone ~ Month, data = airquality) @ end{center} end{document}
  • 10. Example html5 report Example Report ============= This is an example of embedded scripts and data. ```{r} a = "hello world” print(a) ``` And here is an example of how to embed a chart. ```{r fig.width=7, fig.height=6} plot( 5:20 ) ```
  • 11. Shiny Widgets (1) • Widgets: use-case sized encapsulations of mini apps • HTML5 • Two files: ui.R, server.R • Still missing: manifest files (info.plist, config.xml)
  • 12. Shiny Widgets (2) From http://www.rstudio.com/shiny/
  • 14. Example R web service print “hello world”
  • 15. More complex R web service setContentType("image/png") a = c(1,3,5,12,13,15) image_file = tempfile() png(file=image_file) plot(a, main = "The magic image", ylab = "", xlab = "", col = c("darkred", "darkblue", "darkgreen") ) dev.off() sendBin(readBin(image_file,'raw',n=file.info(image_file)$size)) unlink(image_file)
  • 16. R web services • Uses the apache mod_R.so • See http://Rapache.net • Common server functions: – GET and POST variables – setContentType – sendBin – …
  • 17. A word on memory mgmt. • Advanced memory management (see p.70 of Dietl diploma thesis): – Use package big memory (for shared memory across threads) – Use package Rserve (for shared read-only access across threads) – Swap out memory objects with save() and load() – The latter is typically sufficient (hard disks are fast!) • data management abstraction layer for mod_R.so: configure handler in http.conf: specify directory match and load specific data management routines at start up: REvalOnStartup "source(‟/dbal.R');"
  • 19. Job scheduling • crontab entries for R webservices • e.g. harvest feeds • e.g. store in local DB
  • 21. Data shop and the community • You have a „public/‟ folder :) – „public/data‟: save() any .rda file and it will be indexed within the hour – „public/services‟: use this to execute your scripts; indexed within the hour – „public/gallery‟: use this to store your public visualisations – code sharing: Any .R script in your „public/‟ folder is source readable by the web
  • 26. Social Network Analysis Fridolin Wild, The Open University, UK
  • 28. The basic concept • Precursors date back to 1920s, math to Euler‟s „Seven Bridges of Koenigsberg‟
  • 29. The basic concept • Precursors date back to 1920s, math to Euler‟s „Seven Bridges of Koenigsberg‟
  • 30. The basic concept • Precursors date back to 1920s, math to Euler‟s „Seven Bridges of Koenigsberg‟ • Social Networks are: • Actors (people, groups, media, tags, …) • Ties (interactions, relationships, …) • Actors and ties form graph • Graph has measurable structural properties • Betweenness, • Degree of Centrality, • Density, • Cohesion • Structural Patterns
  • 31. Forum Messages message_id forum_id parent_id author 130 2853483 2853445 N 2043 131 1440740 785876 N 1669 132 2515257 2515256 N 5814 133 4704949 4699874 N 5810 134 2597170 2558273 N 2054 135 2316951 2230821 N 5095 136 3407573 3407568 N 36 137 2277393 2277387 N 359 138 3394136 3382201 N 1050 139 4603931 4167338 N 453 140 6234819 6189254 6231352 5400 141 806699 785877 804668 2177 142 4430290 3371246 3380313 48 143 3395686 3391024 3391129 35 144 6270213 6024351 6265378 5780 145 2496015 2491522 2491536 2774 146 4707562 4699873 4707502 5810 147 2574199 2440094 2443801 5801 148 4501993 4424215 4491650 5232 message_id forum_id parent_id author 60 734569 31117 N 2491 221 762702 31117 1 317 762717 31117 762702 1927 1528 819660 31117 793408 1197 1950 840406 31117 839998 1348 1047 841810 31117 767386 1879 2239 862709 31117 N 1982 2420 869839 31117 862709 2038 2694 884824 31117 N 5439 2503 896399 31117 862709 1982 2846 901691 31117 895022 992 3321 951376 31117 N 5174 3384 952895 31117 951376 1597 1186 955595 31117 767386 5724 3604 958065 31117 N 716 2551 960734 31117 862709 1939 4072 975816 31117 N 584 2574 986038 31117 862709 2043 2590 987842 31117 862709 1982
  • 32. Incidence Matrix • msg_id = incident, authors appear in incidents
  • 37. Network Density • Total edges = 29 • Possible edges = 18 * (18-1)/2 = 153 • Density = 0.19
  • 40. Tutorials • Starter: sna-simple.Rmd • Real: sna-blog.Rmd • Advanced: sna-forum.Rmd
  • 41. Latent Semantic Analysis Fridolin Wild, The Open University, UK
  • 42. Latent Semantic Analysis • “Humans learn word meanings and how to combine them into passage meaning through experience with ~paragraph unitized verbal environments.” • “They don‟t remember all the separate words of a passage; they remember its overall gist or meaning.” • “LSA learns by „reading‟ ~paragraph unitized texts that represent the environment.” • “It doesn‟t remember all the separate words of a text it; it remembers its overall gist or meaning.” (Landauer, 2007)
  • 43. Word choice is over-rated • Educated adult understands ~100,000 word forms • An average sentence contains 20 tokens. • Thus 100,00020 possible combinations of words in a sentence • maximum of log2 100,00020 = 332 bits in word choice alone. • 20! = 2.4 x 1018 possible orders of 20 words = maximum of 61 bits from order of the words. • 332/(61+ 332) = 84% word choice (Landauer, 2007)
  • 44. LSA (2) • Assumption: texts have a semantic structure • However, this structure is obscured by word usage (noise, synonymy, polysemy, …) • Proposed LSA Solution: – map doc-term matrix – using conceptual indices – derived statistically (truncated SVD) – and make similarity comparisons using angles
  • 45. Input (e.g., documents) { M } = Deerwester, Dumais, Furnas, Landauer, and Harshman (1990): Indexing by Latent Semantic Analysis, In: Journal of the American Society for Information Science, 41(6):391-407 Only the red terms appear in more than one document, so strip the rest. term = feature vocabulary = ordered set of features TEXTMATRIX
  • 48. Reconstructed, Reduced Matrix m4: Graph minors: A survey
  • 49. Similarity in a Latent-Semantic Space Query Target 1 Target 2Angle 2 Angle 1 Ydimension X dimension
  • 50. doc2doc - similarities Unreduced = pure vector space model - Based on M = TSD’ - Pearson Correlation over document vectors reduced - based on M2 = TS2D’ - Pearson Correlation over document vectors
  • 51. Ex Post Updating: Folding-In • SVD factor stability – SVD calculates factors over a given text base – Different texts – different factors – Challenge: avoid unwanted factor changes (e.g., bad essays) – Solution: folding-in of essays instead of recalculating • SVD is computationally expensive
  • 52. Folding-In in Detail 1 kk T i STvd 1 T ikki dSTm 2 vT Tk Sk Dk Mk (Berry et al., 1995) (1) convert Original Vector to „Dk“-format (2) convert „Dk“-format vector to „Mk“-format
  • 53. LSA Process & Driving Parameters 4 x 12 x 7 x 2 x 3 = 2016 Combinations
  • 54. Pre-Processing • Stemming – Porter Stemmer (snowball.tartarus.org) – ‚move„, ‚moving„, ‚moves„ => ‚move„ – in German even more important (more flections) • Stop Word Elimination – 373 Stop Words in German • Stemming plus Stop Word Elimination • Unprocessed („raw‟) Terms
  • 55. Term Weighting Schemes • Global Weights (GW) – None (‚raw‘ tf) – Normalisation – Inverse Document Frequency (IDF) – 1 + Entropy . 1 2 1 j ij i tf norm 1 )( log2 idocfreq numdocs idfi 1 log log 1 j ijij i numdocs pp entplusone 1 j ij ij ij tf tf p, where weightij = lw(tfij) ∙ gw(tfij)  Local Weights (LW)  None (‘raw’ tf)  Binary Term Frequency  Logarithmized Term Frequency (log)
  • 56. SVD-Dimensionality • Many different proposals (see package) • 80% variance is a good estimator
  • 57. Proximity Measures • Pearson Correlation • Cosine Correlation • Spearman„s Rho pics: http://davidmlane.com/hyperstat/A62891.html
  • 58. Pair-wise dis/similarity Convergence expected: ‘eu’, ‘österreich’ Divergence expected: ‘jahr’, ‘wien’
  • 59. The Package • Available via CRAN, e.g.: http://cran.r-project.org/web/packages/lsa/index.html • Higher-level Abstraction to Ease Use – Core methods: textmatrix() / query() lsa() fold_in() as.textmatrix() – Support methods for term weighting, dimensionality calculation, correlation measurement, …
  • 60. Core Workflow • tm = textmatrix(„dir/„) • tm = lw_logtf(tm) * gw_idf(tm) • space = lsa(tm, dims=dimcalc_share()) • tm3 = fold_in(tm, space) • as.textmatrix(tm)
  • 62. Tutorials • Starter: lsa-indexing.Rmd • Real: lsa-essayscoring.Rmd • Advanced: lsa-sparse.Rmd
  • 63. Additional tutorials Fridolin Wild, The Open University, UK
  • 64. Tutorials • Advanced I/O: twitter.Rmd • Advanced I/O: sparql.Rmd • Advanced NLP: twitter-sentiment.Rmd • Evaluation: interrater-agreement.Rmd