A method to evaluate the reliability of social media data for social network analysis

ASONAM 2020 OFFICIAL
Derek Weber1,2, Mehwish Nasim3-6, Lewis Mitchell5,6 and Lucia Falzon2,7
Contact: derek.weber@adelaide.edu.au
1 School of Computer Science, The University of Adelaide, Australia.
2 Defence Science and Technology Group, Department of Defence, Australia.
3 Data61, CSIRO, Adelaide, Australia.
4 Cyber Security Cooperative Research Centre, Adelaide, Australia.
5 ARC Centre for Excellence for Mathematical and Statistical Frontiers (ACEMS), Adelaide, Australia.
6 School of Mathematical Sciences, The University of Adelaide, Australia.
7 School of Psychological Sciences, The University of Melbourne, Australia.
AMETHOD TO EVALUATE
THE RELIABILITY OF SOCIAL MEDIADATA
FOR SOCIAL NETWORKANALYSIS

Moreno’s sociogram of a year 2 class
Building Social Networks
• To build a social network…
• What’s the research question?
• What’s a node?
• What’s an edge?
• Where’s the boundary?
• How will the data be collected?
• Traditional example
• Interviewing kids in a school regarding their friends
• Social media?
http://www.martingrandjean.ch/social-network-analysis-visualization-morenos-sociograms-revisited/ 2

Using Social Media Data
What’s a node? An account
What’s an edge? Friends & followers?
• Cheap, often stale, platform-specific meaning,
dense networks, boundary questions, bot followers
Open research questions
• What makes a relationship?
• Does SNA make sense?
• What to use as evidence? Where’s the boundary?
Our approach: interaction networks
• Active connections
• Degree of activity
• Direction of flow
3
Follower network in Gephi, coloured by modularity
https://www.flickr.com/photos/psychemedia/5821218782/in/photostream/

Social Media Interactions
Social Media
Behaviour
Potential Social
Networks
Mention
Retweet
Reply
4

Network Varieties
Mentions
Cf. Instagram tags
E.g. @username
Replies
Cf. Facebook or
Reddit comments
Retweets
Cf. Facebook
shares, Tumblr
reposts
Five minutes of #qanda (ABC’s Q&A) Twitter in 2018
5Diagrams were produced in Visone, available at https://visone.info

Collection Methods
Input OutputMethod
Keywords
Usernames
Timeframes
Filter live stream
Retrieval via Search
Snowball strategy
Post corpus
Egonet
6

Conducting a Collection
Boundary
• Filter terms:
• Hashtags
• Search terms
• Usernames
• Timing:
• Start time / date
• End time / date
• Geo:
• Bounding box
Time
7

Potential Pitfalls
• What have we got? What are we missing?
• Repeatability?
• Null hypothesis
• Plan:
• Conduct simultaneous collections…
• with different tools…
• but same boundary criteria.
• Compare the results.
8
Collecting social media data from the same platform, at the
same time, using the same boundary criteria, regardless of the
tool used, should result in identical datasets

Analysis Process
9
Posts Dataset
statistics
Network
statistics
Indexing Grouping

Experiment Collection
Datasets
• ABC’s Q&A
• 2018 episode
• Terms: “qanda”, …*
• AFL
• March & April 2019
• Term: “afl”
• Ethics
• University of Adelaide HREC
Tools
• RAPID
• University of Melbourne / DST
• Collection & data analytics
• Twitter & Reddit
• Topic tracking
• Twarc
• Thin open source Python wrapper
over Twitter API
RAPID: K. Lim, S. Jayasekara, S. Karunasekera, A. Harwood, L. Falzon, J. Dunn, & G. Burgess, RAPID: Real-time Analytics
Platform for Interactive Data Mining, in: ECML/PKDD (3), volume 11053 of LNCS, Springer, 2018, pp. 649–653.
Twarc: https://github.com/DocNow/twarc
* Two terms hidden to adhere to ethics protocols. 10
www.abc.net.au
Wikipedia.org

Dataset Statistics
Q&A Part 1:
Q&A Part 2:
AFL1:
AFL2:
RAPID
RAPID
RAPID
RAPID1
RAPID2
Twarc
Twarc
Twarc
Tweets
RAPID
RAPID
RAPID
RAPID1
RAPID2
Twarc
Twarc
Twarc
Accounts
11
4 hrs
15 hrs
72 hrs
144 hrs

Q&A Networks
Part 1
MENTIONS
Part 1
REPLIES
12
Twarc TwarcRAPID RAPID
+35%
+49% +26%
+32%
+72%
+42%

Ranking Centralities
n1: 0.6
n2: 0.55
n3: 0.25
n4: 0.15
n5: 0.1
n5: 0.35
n1: 0.32
n4: 0.37
n2: 0.31
n3: 0.33
Nodes G1 Nodes G2
Similarity
Measures
Kendall τ
Spearman’s ρ
13
n3
n5
n1
n2
n4
G1 G2
n5
n2
n1
n3
n4
nx
ny

Centralities
14
Moderate similarity: 0.4 – 0.6 High similarity: 0.9+

Grouping
15
Adjusted Rand index scores

A Weekend of AFL
Collection Tool Duration When Posts Users
AFL1 RAPID 72 hrs March
2019
21,799 11,573
Twarc 44,470 16,821
16
22962
20431 en only

Is AFL big in Japan?
17

What’s happening?
• Twarc
• Thin layer – matches anything with “afl”
• RAPID
• Grab any match, then…
• Check text properties have the desired string
• E.g., ‘text’, ‘user.screen_name’, ‘user.description’
18

Lessons
• Null hypothesis rejected
• Collection variations affect analyses
• Know what your tool does
• Be aware of noise
• Be aware of language clashes
• Choose filter terms carefully
• Have a guiding research question
WARNING: Your tools may affect results & business decisions
19

QUESTIONS?
20
Derek Weber1,2, Mehwish Nasim3-6, Lewis Mitchell5,6 and Lucia Falzon2,7
1 School of Computer Science, The University of Adelaide, Australia.
2 Defence Science and Technology Group, Department of Defence, Australia.
3 Data61, CSIRO, Adelaide, Australia.
4 Cyber Security Cooperative Research Centre, Adelaide, Australia.
5 ARC Centre for Excellence for Mathematical and Statistical Frontiers (ACEMS), Adelaide, Australia.
6 School of Mathematical Sciences, The University of Adelaide, Australia.
7 School of Psychological Sciences, The University of Melbourne, Australia.

Q&A Content
Noisy for SNA
Part 1: RAPIDPart 1: Twarc
21
Hashtag co-mention networks
OK for trend analysis
* Some hashtags hidden to adhere to ethics protocols.

AFL1 Content
22
Noisy for SNA
Part 1: RAPIDPart 1: Twarc
OK for trend analysis
Hashtag co-mention networks
* Some hashtags hidden to adhere to ethics protocols.

Dataset Statistics
Collection Tool Duration (hrs) When Posts Reposts Users
Q&A Part 1 RAPID 4 2018 15,930 8,744 4,970
Twarc 27,389 14,191 7,057
Q&A Part 2 RAPID 15 2018 11,719 8,051 4,708
Twarc 15,490 10,988 5,799
AFL1 RAPID 72 March 2019 21,799 7,047 11,573
Twarc 44,470 11,482 16,821
AFL1-en
(en & und)
RAPID 72 March 2019 21,235 6,849 11,238
Twarc 25,231 8,531 12,399
AFL2 RAPID 144 April 2019 30,103 9,215 14,231
RAPID 30,115 9,215 14,232
23

Q&A Datasets
24

Clusters
25

A method to evaluate the reliability of social media data for social network analysis

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie A method to evaluate the reliability of social media data for social network analysis

Ähnlich wie A method to evaluate the reliability of social media data for social network analysis (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

A method to evaluate the reliability of social media data for social network analysis

Hinweis der Redaktion