From idea to production in a day – Leveraging Azure ML and Streamlit to build...
Pydata Taipei 2020
1. Is a Twitter user’s location
correlated with their opinion on
#COVID-19?
Rong-Ching Chang
Chun-Ming Lai, Assist Prof
Information Security Lab X Social Computing And Information Security Lab (SCIS)
Tunghai University
Twitter@AnnCC12
2. Agenda
• General introduction of How you get data from Twitter
• Data, Data Pre-processing
• Mythology
• What top 5 countries of English users tweets about during Feb to May
• How is it different from getting data from Facebook?
3. How do you get data from Twitter
Twitter API Hydration Web Scrapping Tools
5. Open data & Covid & Hydration
Citation Dataset from: Umair Qazi, Muhammad
Imran, Ferda Ofli. GeoCoV19: A Dataset of
Hundreds of Millions of Multilingual COVID-19
Tweets with Location Information. ACM
SIGSPATIAL Special, May 2020. doi:
https://doi.org/10.1145/3404820.3404823
6. Basic Data Pre-processing
A -> a
@ mail
# hashtags
https://
@mention
Dooing -> doing
Doing - > do
:$%#@
n, nan
stopwords
Filter lang
Filter geography
Remove top 20,
tail 20 words
Tokenization
Word frequency
WorldCloud, Trigram
8. Word Cloud
Word Cloud
Word
Frequency
Categorization Mixed
A word cloud is a kind
of weighted list to
visualize language or
text data
the size of font
indicates
the number of
subcategories of a
collection.
the size of font
represents the
number of keywords
that appears in the
collection.
Cite:Yuping Jin, “Development of Word Cloud Generator Software Based on Python”, GCMM 2016
14. Top 5 Countries (ISO 3166-1 alpha-2)
• 'us’ United States of America
• ‘in’ India
• ‘gb’ United Kingdom
• ‘cn’ China
• 'au’ Australia
• Date
• 2/1
• 3/16
• 3/25
• 4/25
31. Thank you
Rong-Ching Chang
Chun-Ming Lai, Assist Prof
Information Security Lab X Social Computing And Information Security Lab
(SCIS)
Tunghai University
Twitter@AnnCC12
Hinweis der Redaktion
解釋為什麼選四個點
把confirm cases
Mis spelling
Lemmatization
Optional common word removal, you can also add stopwords or additional words
In the categorization type, the size of font indicates the number of subcategories of a collection.
For the capture of stylometric features, they based their approach on trigrams, arguing that trigrams capture stylometric features well and are more extensible to unknown text when using a small training set, comparing to a bag of words approach.
Daniel Ricardo Jaimes Moreno. Et al. “Prediction of Personality Traits in Twitter Users with Latent Features “, IEEE, 2019
For the capture of stylometric features, they based their approach on trigrams, arguing that trigrams capture stylometric features well and are more extensible to unknown text when using a small training set, comparing to a bag of words approach.
Daniel Ricardo Jaimes Moreno. Et al. “Prediction of Personality Traits in Twitter Users with Latent Features “, IEEE, 2019
The polarity score is a float within the range [-1.0, 1.0]
The sentiment polarity can be determined as positive, negative and neutral.
Subjective 主觀
Objective 客觀
The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
'us','in','gb','cn','au’
解釋為什麼選四個點
把confirm cases
'us','in','gb','cn','au'
In the end of March, you can clearly see the positive sentiment
In the end of April 25, covid is clearly being tweeted with some political messages
放number of tweets