SlideShare ist ein Scribd-Unternehmen logo
1 von 19
HANDS ON:
TEXT MINING WITH R
Jahnab Kumar Deka
Introduction
• To learn from collections of text documents like books,
newspapers, emails, etc.
Important Terms:
• Tokenization
• Tagging (Noun/Verb/…)
• Chunking(Noun Phase)
• Stemming(-ing/-s/-ed)
Important packages in R
• library(tm) # Framework for text mining.
• library(SnowballC) # Provides wordStem() for stemming.
• library(qdap) # Quantitative discourse analysis of
transcripts.
• library(qdapDictionaries)
• library(dplyr) # Data preparation and pipes %>%.
• library(RColorBrewer) # Generate palette of colours for
plots.
• library(ggplot2) # Plot word frequencies.
• library(scales) # Include commas in numbers.
• library(Rgraphviz) # Correlation plots.
Corpus
• Collection of text
• Each corpus will have separate articles, stories, volumes,
each treated as a separate entity or record.
• Any file format can be converted to text file for corpus
Eg:
• PDF to Text File
• system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done")
• Word Document to Text File
• system("for f in *.doc; do antiword $f; done")
Corpus
• Consider folder corpus/txt
• List some of file names
Loading Corpus
• Loading Corpus
** Using DirSource() the source object is passed on to Corpus() which loads the documents.
• In case of PDF Documents
• docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF))
** xpdf application needs to be installed for readPDF()
• In case of Word Documents
• docs <- Corpus(DirSource(cname), readerControl=list(reader=readDOC("-r -s")))
** -r requests that removed text be included in the output
** -s requests that text hidden by Word be included
Exploration of Corpus
• inspect()
• Preparing the corpus
• Transformation type
• tm map() is used to apply one of this transformation
• Other transformations can be implemented using R functions and wrapped
within content_transformer()
Transformation Example
• replace “/”, “@” and “|” with a space
• Alternate method
• Conversion to toLower Case
• Remove Numbers
• Remove Punctuation
Contd...
• Remove English Stop Words
• Remove Own Stop Words
• Strip Whitespace
• Specific Transformations
Contd...
• Stemming
• Creating a Document Term Matrix
A matrix with documents as the rows
terms as the columns
count of the frequency of words as the cells of the matrix.
• Term frequency
Contd...
• Frequency order of item
• ord <- order(freq)
• Least Frequent item
• freq[head(ord)]
• Most frequent item
• freq[tail(ord)]
• Document Term matrix to CSV
• dtm <- DocumentTermMatrix(docs)
• m <- as.matrix(dtm)
• write.csv(m, file="dtm.csv")
Contd...
• Removing Sparse Terms
• dtms <- removeSparseTerms(dtm, 0.1) //Sparse factor
• the resulting matrix contains only terms with a sparse factor of less than sparse.
• Frequent items and association
** lowfreq = terms that occur at least 1000 times
• Association with word with correlation limit
• // association of “data” with other word
• // two words always appear together => correlation would be 1.0
Correlation
• 50 of the more frequent words
• With minimum correlation of 0.5
• Word occurrences 100
• By default
• 20 random terms
• With minimum correlation of 0.7
Plotting word frequencies
• freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
• wf <- data.frame(word=names(freq), freq=freq)
• //words that occurs at least 500 times in the corpus
Word cloud
Size of Word & Frequency
• For word limitation
• wordcloud(names(freq), freq, max.words=100)
• For term frequency limitation
• wordcloud(names(freq), freq, min.freq=100)
• Adding Color
• wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))
Quantitative Analysis of Text (qdap)
• Extracting the column names (the terms) and retain those shorter
than 20 characters
• To generate frequencies and percentage
Contd...
• Word Length Counts
** vertical line = Mean length of words
Letter and Position Heatmap

Weitere ähnliche Inhalte

Was ist angesagt?

Data Analysis Software
Data Analysis SoftwareData Analysis Software
Data Analysis SoftwareResearchLeap
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesankit_ppt
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R StudioRupak Roy
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 
Knowledge representation In Artificial Intelligence
Knowledge representation In Artificial IntelligenceKnowledge representation In Artificial Intelligence
Knowledge representation In Artificial IntelligenceRamla Sheikh
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysisM. Atif Qureshi
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Parth Khare
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question AnsweringMarina Santini
 
Workshop presentation hands on r programming
Workshop presentation hands on r programmingWorkshop presentation hands on r programming
Workshop presentation hands on r programmingNimrita Koul
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data AnalysisUmair Shafique
 

Was ist angesagt? (20)

Class ppt intro to r
Class ppt intro to rClass ppt intro to r
Class ppt intro to r
 
Data Analysis Software
Data Analysis SoftwareData Analysis Software
Data Analysis Software
 
Language models
Language modelsLanguage models
Language models
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Text mining
Text miningText mining
Text mining
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R Studio
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Knowledge representation In Artificial Intelligence
Knowledge representation In Artificial IntelligenceKnowledge representation In Artificial Intelligence
Knowledge representation In Artificial Intelligence
 
Association rules apriori algorithm
Association rules   apriori algorithmAssociation rules   apriori algorithm
Association rules apriori algorithm
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysis
 
KNN
KNNKNN
KNN
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
Wordnet
WordnetWordnet
Wordnet
 
Workshop presentation hands on r programming
Workshop presentation hands on r programmingWorkshop presentation hands on r programming
Workshop presentation hands on r programming
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 

Andere mochten auch

Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)fridolin.wild
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API Mohd Shadab Alam
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify RaisAjay Ohri
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweetsVasu Jain
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in TwitterAyushi Dalmia
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlBen Healey
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with RYanchang Zhao
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with RYanchang Zhao
 
Interactive Text Mining Suite: Data Visualization for Literary Studies
Interactive Text Mining Suite: Data Visualization for Literary Studies Interactive Text Mining Suite: Data Visualization for Literary Studies
Interactive Text Mining Suite: Data Visualization for Literary Studies Olga Scrivner
 
Rugby World Cup 2011 twitter analysis
Rugby World Cup 2011 twitter analysisRugby World Cup 2011 twitter analysis
Rugby World Cup 2011 twitter analysisiGo2 Pty Ltd
 
Der Nobelpreis geht an: Vitamin C
Der Nobelpreis geht an: Vitamin CDer Nobelpreis geht an: Vitamin C
Der Nobelpreis geht an: Vitamin CDr Rath
 
Text Mining for Second Screen
Text Mining for Second ScreenText Mining for Second Screen
Text Mining for Second ScreenIvan Demin
 
Count-Min Tree Sketch : Approximate counting for NLP tasks
Count-Min Tree Sketch : Approximate counting for NLP tasksCount-Min Tree Sketch : Approximate counting for NLP tasks
Count-Min Tree Sketch : Approximate counting for NLP tasksGuillaume Pitel
 

Andere mochten auch (20)

Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify Rais
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweets
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco Control
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with R
 
Interactive Text Mining Suite: Data Visualization for Literary Studies
Interactive Text Mining Suite: Data Visualization for Literary Studies Interactive Text Mining Suite: Data Visualization for Literary Studies
Interactive Text Mining Suite: Data Visualization for Literary Studies
 
R Datatypes
R DatatypesR Datatypes
R Datatypes
 
Reading Data into R
Reading Data into RReading Data into R
Reading Data into R
 
Text MIning
Text MIningText MIning
Text MIning
 
Rugby World Cup 2011 twitter analysis
Rugby World Cup 2011 twitter analysisRugby World Cup 2011 twitter analysis
Rugby World Cup 2011 twitter analysis
 
Der Nobelpreis geht an: Vitamin C
Der Nobelpreis geht an: Vitamin CDer Nobelpreis geht an: Vitamin C
Der Nobelpreis geht an: Vitamin C
 
Text Mining for Second Screen
Text Mining for Second ScreenText Mining for Second Screen
Text Mining for Second Screen
 
Count-Min Tree Sketch : Approximate counting for NLP tasks
Count-Min Tree Sketch : Approximate counting for NLP tasksCount-Min Tree Sketch : Approximate counting for NLP tasks
Count-Min Tree Sketch : Approximate counting for NLP tasks
 

Ähnlich wie hands on: Text Mining With R

Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersVitomir Kovanovic
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XMLAbhra Basak
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingFlorian Leitner
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachReza Rahimi
 
Alexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupAlexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupOleksii Holub
 
Text and Numbers (Data Types)in PHP
Text and Numbers (Data Types)in PHPText and Numbers (Data Types)in PHP
Text and Numbers (Data Types)in PHPKamal Acharya
 
Set Similarity Search using a Distributed Prefix Tree Index
Set Similarity Search using a Distributed Prefix Tree IndexSet Similarity Search using a Distributed Prefix Tree Index
Set Similarity Search using a Distributed Prefix Tree IndexHPCC Systems
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examplesYoshitomo Matsubara
 
Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Ahmed El-Arabawy
 

Ähnlich wie hands on: Text Mining With R (20)

Web search engines
Web search enginesWeb search engines
Web search engines
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
 
04 standard class library c#
04 standard class library c#04 standard class library c#
04 standard class library c#
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String Processing
 
search engine
search enginesearch engine
search engine
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning Approach
 
Alexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupAlexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape Meetup
 
Text and Numbers (Data Types)in PHP
Text and Numbers (Data Types)in PHPText and Numbers (Data Types)in PHP
Text and Numbers (Data Types)in PHP
 
MIPS Architecture
MIPS ArchitectureMIPS Architecture
MIPS Architecture
 
Set Similarity Search using a Distributed Prefix Tree Index
Set Similarity Search using a Distributed Prefix Tree IndexSet Similarity Search using a Distributed Prefix Tree Index
Set Similarity Search using a Distributed Prefix Tree Index
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examples
 
Basics R.ppt
Basics R.pptBasics R.ppt
Basics R.ppt
 
Lecture_4.pdf
Lecture_4.pdfLecture_4.pdf
Lecture_4.pdf
 
Text features
Text featuresText features
Text features
 
Basics.ppt
Basics.pptBasics.ppt
Basics.ppt
 
Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions
 
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي   R program د.هديل القفيديمحاضرة برنامج التحليل الكمي   R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
 

Kürzlich hochgeladen

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 

Kürzlich hochgeladen (20)

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 

hands on: Text Mining With R

  • 1. HANDS ON: TEXT MINING WITH R Jahnab Kumar Deka
  • 2. Introduction • To learn from collections of text documents like books, newspapers, emails, etc. Important Terms: • Tokenization • Tagging (Noun/Verb/…) • Chunking(Noun Phase) • Stemming(-ing/-s/-ed)
  • 3. Important packages in R • library(tm) # Framework for text mining. • library(SnowballC) # Provides wordStem() for stemming. • library(qdap) # Quantitative discourse analysis of transcripts. • library(qdapDictionaries) • library(dplyr) # Data preparation and pipes %>%. • library(RColorBrewer) # Generate palette of colours for plots. • library(ggplot2) # Plot word frequencies. • library(scales) # Include commas in numbers. • library(Rgraphviz) # Correlation plots.
  • 4. Corpus • Collection of text • Each corpus will have separate articles, stories, volumes, each treated as a separate entity or record. • Any file format can be converted to text file for corpus Eg: • PDF to Text File • system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done") • Word Document to Text File • system("for f in *.doc; do antiword $f; done")
  • 5. Corpus • Consider folder corpus/txt • List some of file names
  • 6. Loading Corpus • Loading Corpus ** Using DirSource() the source object is passed on to Corpus() which loads the documents. • In case of PDF Documents • docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF)) ** xpdf application needs to be installed for readPDF() • In case of Word Documents • docs <- Corpus(DirSource(cname), readerControl=list(reader=readDOC("-r -s"))) ** -r requests that removed text be included in the output ** -s requests that text hidden by Word be included
  • 7. Exploration of Corpus • inspect() • Preparing the corpus • Transformation type • tm map() is used to apply one of this transformation • Other transformations can be implemented using R functions and wrapped within content_transformer()
  • 8. Transformation Example • replace “/”, “@” and “|” with a space • Alternate method • Conversion to toLower Case • Remove Numbers • Remove Punctuation
  • 9. Contd... • Remove English Stop Words • Remove Own Stop Words • Strip Whitespace • Specific Transformations
  • 10. Contd... • Stemming • Creating a Document Term Matrix A matrix with documents as the rows terms as the columns count of the frequency of words as the cells of the matrix. • Term frequency
  • 11. Contd... • Frequency order of item • ord <- order(freq) • Least Frequent item • freq[head(ord)] • Most frequent item • freq[tail(ord)] • Document Term matrix to CSV • dtm <- DocumentTermMatrix(docs) • m <- as.matrix(dtm) • write.csv(m, file="dtm.csv")
  • 12. Contd... • Removing Sparse Terms • dtms <- removeSparseTerms(dtm, 0.1) //Sparse factor • the resulting matrix contains only terms with a sparse factor of less than sparse. • Frequent items and association ** lowfreq = terms that occur at least 1000 times • Association with word with correlation limit • // association of “data” with other word • // two words always appear together => correlation would be 1.0
  • 13. Correlation • 50 of the more frequent words • With minimum correlation of 0.5 • Word occurrences 100 • By default • 20 random terms • With minimum correlation of 0.7
  • 14. Plotting word frequencies • freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE) • wf <- data.frame(word=names(freq), freq=freq) • //words that occurs at least 500 times in the corpus
  • 16. Size of Word & Frequency • For word limitation • wordcloud(names(freq), freq, max.words=100) • For term frequency limitation • wordcloud(names(freq), freq, min.freq=100) • Adding Color • wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))
  • 17. Quantitative Analysis of Text (qdap) • Extracting the column names (the terms) and retain those shorter than 20 characters • To generate frequencies and percentage
  • 18. Contd... • Word Length Counts ** vertical line = Mean length of words