SlideShare ist ein Scribd-Unternehmen logo
1 von 25
ASONAM 2020 OFFICIAL
Derek Weber1,2, Mehwish Nasim3-6, Lewis Mitchell5,6 and Lucia Falzon2,7
Contact: derek.weber@adelaide.edu.au
1 School of Computer Science, The University of Adelaide, Australia.
2 Defence Science and Technology Group, Department of Defence, Australia.
3 Data61, CSIRO, Adelaide, Australia.
4 Cyber Security Cooperative Research Centre, Adelaide, Australia.
5 ARC Centre for Excellence for Mathematical and Statistical Frontiers (ACEMS), Adelaide, Australia.
6 School of Mathematical Sciences, The University of Adelaide, Australia.
7 School of Psychological Sciences, The University of Melbourne, Australia.
AMETHOD TO EVALUATE
THE RELIABILITY OF SOCIAL MEDIADATA
FOR SOCIAL NETWORKANALYSIS
ASONAM 2020 OFFICIAL
Moreno’s sociogram of a year 2 class
Building Social Networks
• To build a social network…
• What’s the research question?
• What’s a node?
• What’s an edge?
• Where’s the boundary?
• How will the data be collected?
• Traditional example
• Interviewing kids in a school regarding their friends
• Social media?
http://www.martingrandjean.ch/social-network-analysis-visualization-morenos-sociograms-revisited/ 2
ASONAM 2020 OFFICIAL
Using Social Media Data
What’s a node? An account
What’s an edge? Friends & followers?
• Cheap, often stale, platform-specific meaning,
dense networks, boundary questions, bot followers
Open research questions
• What makes a relationship?
• Does SNA make sense?
• What to use as evidence? Where’s the boundary?
Our approach: interaction networks
• Active connections
• Degree of activity
• Direction of flow
3
Follower network in Gephi, coloured by modularity
https://www.flickr.com/photos/psychemedia/5821218782/in/photostream/
ASONAM 2020 OFFICIAL
Social Media Interactions
Social Media
Behaviour
Potential Social
Networks
Mention
Retweet
Reply
4
ASONAM 2020 OFFICIAL
Network Varieties
Mentions
Cf. Instagram tags
E.g. @username
Replies
Cf. Facebook or
Reddit comments
Retweets
Cf. Facebook
shares, Tumblr
reposts
Five minutes of #qanda (ABC’s Q&A) Twitter in 2018
5Diagrams were produced in Visone, available at https://visone.info
ASONAM 2020 OFFICIAL
Collection Methods
Input OutputMethod
Keywords
Usernames
Timeframes
Filter live stream
Retrieval via Search
Snowball strategy
Post corpus
Egonet
6
ASONAM 2020 OFFICIAL
Conducting a Collection
Boundary
• Filter terms:
• Hashtags
• Search terms
• Usernames
• Timing:
• Start time / date
• End time / date
• Geo:
• Bounding box
Time
7
ASONAM 2020 OFFICIAL
Potential Pitfalls
• What have we got? What are we missing?
• Repeatability?
• Null hypothesis
• Plan:
• Conduct simultaneous collections…
• with different tools…
• but same boundary criteria.
• Compare the results.
8
Collecting social media data from the same platform, at the
same time, using the same boundary criteria, regardless of the
tool used, should result in identical datasets
ASONAM 2020 OFFICIAL
Analysis Process
9
Posts Dataset
statistics
Network
statistics
Indexing Grouping
ASONAM 2020 OFFICIAL
Experiment Collection
Datasets
• ABC’s Q&A
• 2018 episode
• Terms: “qanda”, …*
• AFL
• March & April 2019
• Term: “afl”
• Ethics
• University of Adelaide HREC
Tools
• RAPID
• University of Melbourne / DST
• Collection & data analytics
• Twitter & Reddit
• Topic tracking
• Twarc
• Thin open source Python wrapper
over Twitter API
RAPID: K. Lim, S. Jayasekara, S. Karunasekera, A. Harwood, L. Falzon, J. Dunn, & G. Burgess, RAPID: Real-time Analytics
Platform for Interactive Data Mining, in: ECML/PKDD (3), volume 11053 of LNCS, Springer, 2018, pp. 649–653.
Twarc: https://github.com/DocNow/twarc
* Two terms hidden to adhere to ethics protocols. 10
www.abc.net.au
Wikipedia.org
ASONAM 2020 OFFICIAL
Dataset Statistics
Q&A Part 1:
Q&A Part 2:
AFL1:
AFL2:
RAPID
RAPID
RAPID
RAPID1
RAPID2
Twarc
Twarc
Twarc
Tweets
RAPID
RAPID
RAPID
RAPID1
RAPID2
Twarc
Twarc
Twarc
Accounts
11
4 hrs
15 hrs
72 hrs
144 hrs
ASONAM 2020 OFFICIAL
Q&A Networks
Part 1
MENTIONS
Part 1
REPLIES
12
Twarc TwarcRAPID RAPID
+35%
+49% +26%
+32%
+72%
+42%
ASONAM 2020 OFFICIAL
Ranking Centralities
n1: 0.6
n2: 0.55
n3: 0.25
n4: 0.15
n5: 0.1
n5: 0.35
n1: 0.32
n4: 0.37
n2: 0.31
n3: 0.33
Nodes G1 Nodes G2
Similarity
Measures
Kendall τ
Spearman’s ρ
13
n3
n5
n1
n2
n4
G1 G2
n5
n2
n1
n3
n4
nx
ny
ASONAM 2020 OFFICIAL
Centralities
14
Moderate similarity: 0.4 – 0.6 High similarity: 0.9+
ASONAM 2020 OFFICIAL
Grouping
15
Adjusted Rand index scores
ASONAM 2020 OFFICIAL
A Weekend of AFL
Collection Tool Duration When Posts Users
AFL1 RAPID 72 hrs March
2019
21,799 11,573
Twarc 44,470 16,821
16
22962
20431 en only
ASONAM 2020 OFFICIAL
Is AFL big in Japan?
17
ASONAM 2020 OFFICIAL
What’s happening?
• Twarc
• Thin layer – matches anything with “afl”
• RAPID
• Grab any match, then…
• Check text properties have the desired string
• E.g., ‘text’, ‘user.screen_name’, ‘user.description’
18
ASONAM 2020 OFFICIAL
Lessons
• Null hypothesis rejected
• Collection variations affect analyses
• Know what your tool does
• Be aware of noise
• Be aware of language clashes
• Choose filter terms carefully
• Have a guiding research question
WARNING: Your tools may affect results & business decisions
19
ASONAM 2020 OFFICIAL
QUESTIONS?
20
Derek Weber1,2, Mehwish Nasim3-6, Lewis Mitchell5,6 and Lucia Falzon2,7
1 School of Computer Science, The University of Adelaide, Australia.
2 Defence Science and Technology Group, Department of Defence, Australia.
3 Data61, CSIRO, Adelaide, Australia.
4 Cyber Security Cooperative Research Centre, Adelaide, Australia.
5 ARC Centre for Excellence for Mathematical and Statistical Frontiers (ACEMS), Adelaide, Australia.
6 School of Mathematical Sciences, The University of Adelaide, Australia.
7 School of Psychological Sciences, The University of Melbourne, Australia.
ASONAM 2020 OFFICIAL
Q&A Content
Noisy for SNA
Part 1: RAPIDPart 1: Twarc
21
Hashtag co-mention networks
OK for trend analysis
* Some hashtags hidden to adhere to ethics protocols.
ASONAM 2020 OFFICIAL
AFL1 Content
22
Noisy for SNA
Part 1: RAPIDPart 1: Twarc
OK for trend analysis
Hashtag co-mention networks
* Some hashtags hidden to adhere to ethics protocols.
ASONAM 2020 OFFICIAL
Dataset Statistics
Collection Tool Duration (hrs) When Posts Reposts Users
Q&A Part 1 RAPID 4 2018 15,930 8,744 4,970
Twarc 27,389 14,191 7,057
Q&A Part 2 RAPID 15 2018 11,719 8,051 4,708
Twarc 15,490 10,988 5,799
AFL1 RAPID 72 March 2019 21,799 7,047 11,573
Twarc 44,470 11,482 16,821
AFL1-en
(en & und)
RAPID 72 March 2019 21,235 6,849 11,238
Twarc 25,231 8,531 12,399
AFL2 RAPID 144 April 2019 30,103 9,215 14,231
RAPID 30,115 9,215 14,232
23
ASONAM 2020 OFFICIAL
Q&A Datasets
24
ASONAM 2020 OFFICIAL
Clusters
25

Weitere ähnliche Inhalte

Was ist angesagt?

You're Hired: Examining Acceptance of Social Media Screening of Job Applicants
You're Hired: Examining Acceptance of Social Media Screening of Job ApplicantsYou're Hired: Examining Acceptance of Social Media Screening of Job Applicants
You're Hired: Examining Acceptance of Social Media Screening of Job Applicants
Toronto Metropolitan University
 

Was ist angesagt? (20)

Understanding the world with NLP: interactions between society, behaviour and...
Understanding the world with NLP: interactions between society, behaviour and...Understanding the world with NLP: interactions between society, behaviour and...
Understanding the world with NLP: interactions between society, behaviour and...
 
Data augmented ethnography: 
using big data and ethnography to explore candi...
Data augmented ethnography: 
using big data and ethnography  to explore candi...Data augmented ethnography: 
using big data and ethnography  to explore candi...
Data augmented ethnography: 
using big data and ethnography to explore candi...
 
The data we want
The data we wantThe data we want
The data we want
 
Social Media in Australia: A ‘Big Data’ Perspective on Twitter
Social Media in Australia: A ‘Big Data’ Perspective on TwitterSocial Media in Australia: A ‘Big Data’ Perspective on Twitter
Social Media in Australia: A ‘Big Data’ Perspective on Twitter
 
You're Hired: Examining Acceptance of Social Media Screening of Job Applicants
You're Hired: Examining Acceptance of Social Media Screening of Job ApplicantsYou're Hired: Examining Acceptance of Social Media Screening of Job Applicants
You're Hired: Examining Acceptance of Social Media Screening of Job Applicants
 
Doing Social and Political Research in a Digital Age: An Introduction to Digi...
Doing Social and Political Research in a Digital Age: An Introduction to Digi...Doing Social and Political Research in a Digital Age: An Introduction to Digi...
Doing Social and Political Research in a Digital Age: An Introduction to Digi...
 
Matching Mobile Applications for Cross Promotion
Matching Mobile Applications for Cross PromotionMatching Mobile Applications for Cross Promotion
Matching Mobile Applications for Cross Promotion
 
GitHub as Transparency Device in Data Journalism, Open Data and Data Activism
GitHub as Transparency Device in  Data Journalism, Open Data and Data ActivismGitHub as Transparency Device in  Data Journalism, Open Data and Data Activism
GitHub as Transparency Device in Data Journalism, Open Data and Data Activism
 
Mapping Issues with the Web: An Introduction to Digital Methods
Mapping Issues with the Web: An Introduction to Digital MethodsMapping Issues with the Web: An Introduction to Digital Methods
Mapping Issues with the Web: An Introduction to Digital Methods
 
Doing Digital Methods: Some Recent Highlights from Winter and Summer Schools
Doing Digital Methods: Some Recent Highlights from Winter and Summer SchoolsDoing Digital Methods: Some Recent Highlights from Winter and Summer Schools
Doing Digital Methods: Some Recent Highlights from Winter and Summer Schools
 
Predicting Elections with Twitter
Predicting Elections with TwitterPredicting Elections with Twitter
Predicting Elections with Twitter
 
Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...
Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...
Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...
 
Redistributing journalism: Journalism as a data public and the politics of qu...
Redistributing journalism: Journalism as a data public and the politics of qu...Redistributing journalism: Journalism as a data public and the politics of qu...
Redistributing journalism: Journalism as a data public and the politics of qu...
 
Dynamics of a Scandal: The Centrelink Robodebt Affair on Twitter
Dynamics of a Scandal: The Centrelink Robodebt Affair on TwitterDynamics of a Scandal: The Centrelink Robodebt Affair on Twitter
Dynamics of a Scandal: The Centrelink Robodebt Affair on Twitter
 
Computational methods for intelligent matchmaking for knowledge work
Computational methods for intelligent matchmaking for knowledge workComputational methods for intelligent matchmaking for knowledge work
Computational methods for intelligent matchmaking for knowledge work
 
Collaboration between Software Developers and the Impact of Proximity
Collaboration between Software Developers  and the Impact of ProximityCollaboration between Software Developers  and the Impact of Proximity
Collaboration between Software Developers and the Impact of Proximity
 
Chung-Jui LAI - Polarization of Political Opinion by News Media
Chung-Jui LAI - Polarization of Political Opinion by News MediaChung-Jui LAI - Polarization of Political Opinion by News Media
Chung-Jui LAI - Polarization of Political Opinion by News Media
 
Filth and lies: analysing social media
Filth and lies: analysing social mediaFilth and lies: analysing social media
Filth and lies: analysing social media
 
The language of social media
The language of social mediaThe language of social media
The language of social media
 
How to get started with Data Journalism
How to get started with Data JournalismHow to get started with Data Journalism
How to get started with Data Journalism
 

Ähnlich wie A method to evaluate the reliability of social media data for social network analysis

When is a digital link a network edge? Exploring ways to construct social net...
When is a digital link a network edge? Exploring ways to construct social net...When is a digital link a network edge? Exploring ways to construct social net...
When is a digital link a network edge? Exploring ways to construct social net...
Derek Weber
 
SGCI - Science Gateways - Technology-Enhanced Research Under Consideration of...
SGCI - Science Gateways - Technology-Enhanced Research Under Consideration of...SGCI - Science Gateways - Technology-Enhanced Research Under Consideration of...
SGCI - Science Gateways - Technology-Enhanced Research Under Consideration of...
Sandra Gesing
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in Education
Philip Piety
 

Ähnlich wie A method to evaluate the reliability of social media data for social network analysis (20)

When is a digital link a network edge? Exploring ways to construct social net...
When is a digital link a network edge? Exploring ways to construct social net...When is a digital link a network edge? Exploring ways to construct social net...
When is a digital link a network edge? Exploring ways to construct social net...
 
Conventicle 2013 Digital Technologies Australia, England & the Literature
Conventicle 2013 Digital Technologies Australia, England & the LiteratureConventicle 2013 Digital Technologies Australia, England & the Literature
Conventicle 2013 Digital Technologies Australia, England & the Literature
 
Data Science for Every Student at RPI
Data Science for Every Student at RPIData Science for Every Student at RPI
Data Science for Every Student at RPI
 
Data Science and What It Means to Library and Information Science
Data Science and What It Means to Library and Information ScienceData Science and What It Means to Library and Information Science
Data Science and What It Means to Library and Information Science
 
2011.10.10 Multi-Disciplinary Research Themes and Training
2011.10.10 Multi-Disciplinary Research Themes and Training2011.10.10 Multi-Disciplinary Research Themes and Training
2011.10.10 Multi-Disciplinary Research Themes and Training
 
Lecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptxLecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptx
 
L2 DS Tools and Application.pptx
L2 DS Tools and Application.pptxL2 DS Tools and Application.pptx
L2 DS Tools and Application.pptx
 
Sgci nasa-esds-10-29-18
Sgci nasa-esds-10-29-18Sgci nasa-esds-10-29-18
Sgci nasa-esds-10-29-18
 
[DSC Europe 22] Machine learning algorithms as tools for student success pred...
[DSC Europe 22] Machine learning algorithms as tools for student success pred...[DSC Europe 22] Machine learning algorithms as tools for student success pred...
[DSC Europe 22] Machine learning algorithms as tools for student success pred...
 
SGCI - Science Gateways - Technology-Enhanced Research Under Consideration of...
SGCI - Science Gateways - Technology-Enhanced Research Under Consideration of...SGCI - Science Gateways - Technology-Enhanced Research Under Consideration of...
SGCI - Science Gateways - Technology-Enhanced Research Under Consideration of...
 
RDM skills
RDM skillsRDM skills
RDM skills
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in Education
 
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrustLec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
 
Lowenberg Making Data Count
Lowenberg Making Data CountLowenberg Making Data Count
Lowenberg Making Data Count
 
Pace "How the Community Wants to Serve Its Constituents"
Pace "How the Community Wants to Serve Its Constituents"Pace "How the Community Wants to Serve Its Constituents"
Pace "How the Community Wants to Serve Its Constituents"
 
SAS Programming for High School - Giving Students the Power to Know
SAS Programming for High School - Giving Students the Power to KnowSAS Programming for High School - Giving Students the Power to Know
SAS Programming for High School - Giving Students the Power to Know
 
Claudia Bauzer Medeiros - Open Science meets Data Science: Some challenges to...
Claudia Bauzer Medeiros - Open Science meets Data Science: Some challenges to...Claudia Bauzer Medeiros - Open Science meets Data Science: Some challenges to...
Claudia Bauzer Medeiros - Open Science meets Data Science: Some challenges to...
 
Teaching Data Science to Undergraduate Students
Teaching Data Science to Undergraduate StudentsTeaching Data Science to Undergraduate Students
Teaching Data Science to Undergraduate Students
 
NFDI Physical Sciences Colloquium - FAIR
NFDI Physical Sciences Colloquium - FAIRNFDI Physical Sciences Colloquium - FAIR
NFDI Physical Sciences Colloquium - FAIR
 
Ucla july 2018 natasha simons
Ucla july 2018 natasha simonsUcla july 2018 natasha simons
Ucla july 2018 natasha simons
 

Kürzlich hochgeladen

Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 

Kürzlich hochgeladen (20)

SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 

A method to evaluate the reliability of social media data for social network analysis

  • 1. ASONAM 2020 OFFICIAL Derek Weber1,2, Mehwish Nasim3-6, Lewis Mitchell5,6 and Lucia Falzon2,7 Contact: derek.weber@adelaide.edu.au 1 School of Computer Science, The University of Adelaide, Australia. 2 Defence Science and Technology Group, Department of Defence, Australia. 3 Data61, CSIRO, Adelaide, Australia. 4 Cyber Security Cooperative Research Centre, Adelaide, Australia. 5 ARC Centre for Excellence for Mathematical and Statistical Frontiers (ACEMS), Adelaide, Australia. 6 School of Mathematical Sciences, The University of Adelaide, Australia. 7 School of Psychological Sciences, The University of Melbourne, Australia. AMETHOD TO EVALUATE THE RELIABILITY OF SOCIAL MEDIADATA FOR SOCIAL NETWORKANALYSIS
  • 2. ASONAM 2020 OFFICIAL Moreno’s sociogram of a year 2 class Building Social Networks • To build a social network… • What’s the research question? • What’s a node? • What’s an edge? • Where’s the boundary? • How will the data be collected? • Traditional example • Interviewing kids in a school regarding their friends • Social media? http://www.martingrandjean.ch/social-network-analysis-visualization-morenos-sociograms-revisited/ 2
  • 3. ASONAM 2020 OFFICIAL Using Social Media Data What’s a node? An account What’s an edge? Friends & followers? • Cheap, often stale, platform-specific meaning, dense networks, boundary questions, bot followers Open research questions • What makes a relationship? • Does SNA make sense? • What to use as evidence? Where’s the boundary? Our approach: interaction networks • Active connections • Degree of activity • Direction of flow 3 Follower network in Gephi, coloured by modularity https://www.flickr.com/photos/psychemedia/5821218782/in/photostream/
  • 4. ASONAM 2020 OFFICIAL Social Media Interactions Social Media Behaviour Potential Social Networks Mention Retweet Reply 4
  • 5. ASONAM 2020 OFFICIAL Network Varieties Mentions Cf. Instagram tags E.g. @username Replies Cf. Facebook or Reddit comments Retweets Cf. Facebook shares, Tumblr reposts Five minutes of #qanda (ABC’s Q&A) Twitter in 2018 5Diagrams were produced in Visone, available at https://visone.info
  • 6. ASONAM 2020 OFFICIAL Collection Methods Input OutputMethod Keywords Usernames Timeframes Filter live stream Retrieval via Search Snowball strategy Post corpus Egonet 6
  • 7. ASONAM 2020 OFFICIAL Conducting a Collection Boundary • Filter terms: • Hashtags • Search terms • Usernames • Timing: • Start time / date • End time / date • Geo: • Bounding box Time 7
  • 8. ASONAM 2020 OFFICIAL Potential Pitfalls • What have we got? What are we missing? • Repeatability? • Null hypothesis • Plan: • Conduct simultaneous collections… • with different tools… • but same boundary criteria. • Compare the results. 8 Collecting social media data from the same platform, at the same time, using the same boundary criteria, regardless of the tool used, should result in identical datasets
  • 9. ASONAM 2020 OFFICIAL Analysis Process 9 Posts Dataset statistics Network statistics Indexing Grouping
  • 10. ASONAM 2020 OFFICIAL Experiment Collection Datasets • ABC’s Q&A • 2018 episode • Terms: “qanda”, …* • AFL • March & April 2019 • Term: “afl” • Ethics • University of Adelaide HREC Tools • RAPID • University of Melbourne / DST • Collection & data analytics • Twitter & Reddit • Topic tracking • Twarc • Thin open source Python wrapper over Twitter API RAPID: K. Lim, S. Jayasekara, S. Karunasekera, A. Harwood, L. Falzon, J. Dunn, & G. Burgess, RAPID: Real-time Analytics Platform for Interactive Data Mining, in: ECML/PKDD (3), volume 11053 of LNCS, Springer, 2018, pp. 649–653. Twarc: https://github.com/DocNow/twarc * Two terms hidden to adhere to ethics protocols. 10 www.abc.net.au Wikipedia.org
  • 11. ASONAM 2020 OFFICIAL Dataset Statistics Q&A Part 1: Q&A Part 2: AFL1: AFL2: RAPID RAPID RAPID RAPID1 RAPID2 Twarc Twarc Twarc Tweets RAPID RAPID RAPID RAPID1 RAPID2 Twarc Twarc Twarc Accounts 11 4 hrs 15 hrs 72 hrs 144 hrs
  • 12. ASONAM 2020 OFFICIAL Q&A Networks Part 1 MENTIONS Part 1 REPLIES 12 Twarc TwarcRAPID RAPID +35% +49% +26% +32% +72% +42%
  • 13. ASONAM 2020 OFFICIAL Ranking Centralities n1: 0.6 n2: 0.55 n3: 0.25 n4: 0.15 n5: 0.1 n5: 0.35 n1: 0.32 n4: 0.37 n2: 0.31 n3: 0.33 Nodes G1 Nodes G2 Similarity Measures Kendall τ Spearman’s ρ 13 n3 n5 n1 n2 n4 G1 G2 n5 n2 n1 n3 n4 nx ny
  • 14. ASONAM 2020 OFFICIAL Centralities 14 Moderate similarity: 0.4 – 0.6 High similarity: 0.9+
  • 16. ASONAM 2020 OFFICIAL A Weekend of AFL Collection Tool Duration When Posts Users AFL1 RAPID 72 hrs March 2019 21,799 11,573 Twarc 44,470 16,821 16 22962 20431 en only
  • 17. ASONAM 2020 OFFICIAL Is AFL big in Japan? 17
  • 18. ASONAM 2020 OFFICIAL What’s happening? • Twarc • Thin layer – matches anything with “afl” • RAPID • Grab any match, then… • Check text properties have the desired string • E.g., ‘text’, ‘user.screen_name’, ‘user.description’ 18
  • 19. ASONAM 2020 OFFICIAL Lessons • Null hypothesis rejected • Collection variations affect analyses • Know what your tool does • Be aware of noise • Be aware of language clashes • Choose filter terms carefully • Have a guiding research question WARNING: Your tools may affect results & business decisions 19
  • 20. ASONAM 2020 OFFICIAL QUESTIONS? 20 Derek Weber1,2, Mehwish Nasim3-6, Lewis Mitchell5,6 and Lucia Falzon2,7 1 School of Computer Science, The University of Adelaide, Australia. 2 Defence Science and Technology Group, Department of Defence, Australia. 3 Data61, CSIRO, Adelaide, Australia. 4 Cyber Security Cooperative Research Centre, Adelaide, Australia. 5 ARC Centre for Excellence for Mathematical and Statistical Frontiers (ACEMS), Adelaide, Australia. 6 School of Mathematical Sciences, The University of Adelaide, Australia. 7 School of Psychological Sciences, The University of Melbourne, Australia.
  • 21. ASONAM 2020 OFFICIAL Q&A Content Noisy for SNA Part 1: RAPIDPart 1: Twarc 21 Hashtag co-mention networks OK for trend analysis * Some hashtags hidden to adhere to ethics protocols.
  • 22. ASONAM 2020 OFFICIAL AFL1 Content 22 Noisy for SNA Part 1: RAPIDPart 1: Twarc OK for trend analysis Hashtag co-mention networks * Some hashtags hidden to adhere to ethics protocols.
  • 23. ASONAM 2020 OFFICIAL Dataset Statistics Collection Tool Duration (hrs) When Posts Reposts Users Q&A Part 1 RAPID 4 2018 15,930 8,744 4,970 Twarc 27,389 14,191 7,057 Q&A Part 2 RAPID 15 2018 11,719 8,051 4,708 Twarc 15,490 10,988 5,799 AFL1 RAPID 72 March 2019 21,799 7,047 11,573 Twarc 44,470 11,482 16,821 AFL1-en (en & und) RAPID 72 March 2019 21,235 6,849 11,238 Twarc 25,231 8,531 12,399 AFL2 RAPID 144 April 2019 30,103 9,215 14,231 RAPID 30,115 9,215 14,232 23

Hinweis der Redaktion

  1. Good morning everyone, The promise of social media is easy access to plentiful social data, but there are questions to consider, particularly concerning how to define network boundaries while working with the vagaries of the platforms themselves. (Then there are sampling issues, and the dynamic nature of the data.) Today I will present work that my colleagues and I have done considering the challenges of using social media data as a source for social network analysis.
  2. To conduct SNA, we first need to construct a social network. How do we do this? Of course, in the general case, “It depends”. [CLICK] What’s the research question we’re trying to answer? [CLICK] In turn, that will help answer the follow up questions: What will a node represent? How will nodes be linked? How far do we want the network to extend? This could cover a family or a community, or be based on something categorical about the actors and their attributes, such as their gender or income bracket. These questions will help answer what data needs to be collected, and then we can address the challenges of how to collect the data. (The information is hard to come by, or may come with a degree of uncertainty or timeliness.) [CLICK] A traditional example is Moreno’s work on student social networks and how they evolve as the students progress. (The data was collected through interviews and the boundary surrounded the class.) [CLICK] But what can we do with social media data? Diagram source: https://en.wikipedia.org/wiki/Sociometry which got it from http://www.martingrandjean.ch/social-network-analysis-visualization-morenos-sociograms-revisited/ Description: Size = Indegree, Colour = dark blue if indegree is 0, white if indegree >= 3 http://www.martingrandjean.ch/social-network-analysis-visualization-morenos-sociograms-revisited/
  3. Well, it’s clear that an obvious candidate for a node is an account, but what constitutes an edge? [CLICK] A social network edge is typically a long-lasting connection, like parenthood or employment, but the obvious counterpart in social media, friend and follower relations, aren’t quite the same: They even differ between platforms. A Twitter follow is cheaper to create and easier to ignore than a friendship in Facebook; Because they are so cheap to create and everything is effectively permanent, many such connections may, in fact, be stale or, at best, still valid but inactive, or may be otherwise questionable, such as if they’re links to bots. [CLICK] In any case, they can form very dense networks, such as these. [CLICK] So the questions remain: What constitutes a relationship in social media? Is it meaningful to use it for SNA (social network analysis)? What evidence can be gathered, i.e., what data is available, to support these relationships? [CLICK] Our approach is to focus on interaction networks, because by studying actual posting behaviour, we can see not only how accounts are connecting, but also how often, which provides a richer sense of the underlying relationships than follower networks. Furthermore, because the interactions are often directed, it gives us a chance to consider the flow of information and influence. https://www.flickr.com/photos/psychemedia/5821218782/in/photostream/
  4. To build social networks from interactions, we consider ones that are common across platforms. We focus on [CLICK] accounts and the posts they make, whether they’re Facebook or Instagram posts or tweets on Twitter. [CLICK] On many platforms, a post may mention another account; [CLICK] A post may be duplicated by sharing or retweeting it; OR [CLICK] A post may be a reply to or a comment on an existing post. These are just examples, of course, as there are a myriad of other ways to connect accounts, especially when you consider their content (like their hashtags, links to web pages, or other media, or even the terms they use).
  5. From a five minute window looking at the Twitter discussion surrounding an episode of ABC’s Q&A, [CLICK] we created these three different networks. As you can see, the retweet network in green appears to be a subset of the mention network in red - [CLICK] here are notable matches. This is because retweets necessarily include a mention of the account they are retweeting. Depending on your research question, this may be important to keep or confounding.
  6. Now that we know how to build social networks, how can we get the data? Social media platforms offer data via their APIs, but the data models will vary, as will the amount of information that is accessible, either through rate limit or simply through not making it available; paying will usually get you more data. [CLICK] You can tap into live feeds of posts, search for current or historical data, or start with seed accounts and work your way out using a snowball strategy. [CLICK] As input, we can use search terms, usernames and timestamps, though many platforms offer sophisticated query languages. (Geographic bounding boxes can often also be used.) [CLICK] The output we get may be a corpus of posts or an egonet.
  7. The network boundary we want may not be the one we get, depending on what the API will let us do. Typically, we want to find the community of accounts discussing a particular topic. [CLICK] Say these are posts we want to collect for our network [CLICK] Constrained by the API, our filter can collect only these, so we are bound to miss some posts we wanted [CLICK] Also, we collect posts that match our criteria, but we don’t want. These may be irrelevant spam and advertising or others who’ve used our filter terms, or else the terms may be meaningful in other languages.
  8. So what can we do with all this? [CLICK] Importantly, what are we missing and how does it affect our analyses? [CLICK] Could someone else do the same study? Bearing in mind, of course, that they’d need to do it simultaneously? [CLICK] Our null hypothesis was therefore that collecting social media data from a particular platform, at a particular time, using particular criteria, should produce the same dataset, give or take minor timing issues. [CLICK] Our plan was to test this out. We would conduct simultaneous Twitter collection activities with different tools during the same time periods, using the same boundary criteria, and then analyse the datasets we collected to see if they differed. (, and then see how any variations affected social network analyses based on them.)
  9. We established this analysis process to compare corresponding datasets. [CLICK] First we consider the dataset statistics, (such as number of posts, number of accounts that appeared, number of retweets, hashtags, etc.) [CLICK] Then we construct different networks from the datasets and compare network statistics (such as number of nodes, edges, network diameter, largest component, etc.) [CLICK] To look more closely, we then examine the nodes themselves, considering a variety of standard centrality measures. [CLICK] Finally, we compare the groupings in the networks based on Louvain clustering. In this way, we consider not just variations in the data collected, but also the effect of those variations on analyses.
  10. [CLICK] The first tool we used was RAPID, a collection and data mining application created by the University of Melbourne with DST Group. It lets you set up filters of Twitter and Reddit. Notably, RAPID has the ability to dynamically modify the filter criteria autonomously, periodically adding terms it finds are popular and dropping ones that haven’t been seen. This allows it to track the topic based on how it’s actually being discussed in real time. [CLICK] As a contrast, we used Twarc, which just provides a very thin layer around the Twitter API with no extra smarts. [CLICK] We then used these to create two collections, each with two simultaenous phases, generating 4 datasets for each collection. The first collection was over the episode of ABC’s Q&A mentioned earlier, including 4 hours during and shortly after the episode, and then 15 hours the following day. The second was conducted over two separate weekends in 2019 aiming to follow discussions about Australian Rules Football. [CLICK] All collection, storage and analysis of the data was conducted under two ethics protocols approved by Adelaide Uni. (the Adelaide Uni HREC.) #170316 and H-2018-045 Q&A image: https://www.abc.net.au/cm/rimage/10760138-1x1-large.jpg?v=2 AFL image: https://en.wikipedia.org/wiki/Australian_Football_League#/media/File:Australian_Football_League.svg
  11. These tables show just a raw count of the number of posts collected and the number of accounts that appeared in them. Our hypothesis said the collections should be nearly identical, which is clearly not the case. (Remembering that our null hypothesis was that collecting at the same time, with the same criteria against the same social media platform should result in similar if not identical datasets, we can see that our collections were unexpectedly different.)
  12. We can see that these variations do cause differences in the networks we build from them. [CLICK] 70% more tweets and 40% more accounts in the Twarc dataset result in [CLICK] 25-50% more nodes and edges in the corresponding mention [CLICK] and reply networks. (These diagrams are of just the largest components from each of those networks,) (but they correspond to 95 and 70% respectively of the mention and reply network nodes.) Twarc mentions: 5819 of 6119 RAPID mentions: 4326 of 4535 Twarc replies: 1081 of 1490 RAPID replies: 829 of 1184
  13. To examine the centrality values of nodes in corresponding networks, we can’t just compare the values directly. Instead, given two networks G1 and G2, we first of all only consider the common nodes [CLICK], and then rank them by their centrality values [CLICK] and see how closely the rankings match. To compare the rankings, [CLICK] we used Kendall’s tau and Spearman’s rho (rank correlation coefficients). Ideally, our corresponding networks should have mostly the same nodes, and they should appear in similar rankings, according to our null hypothesis.
  14. We compared the common nodes from the top 1000 of each corresponding network, hoping for tau and rho values around 0.9. Moderate similarity values would be 0.4 to 0.6 [CLICK] but as you can see we didn’t often even get that. Reply networks were more similar than mention networks, but that could be due to them being much smaller. Part 1 mention 521 – 585 reply 988 – 993 Part 2 mention 893 – 906 reply 998 (all)
  15. By this point, it’s no longer surprising that there are differences in the clusters we discovered using the Louvain method. Here, we considered just the largest 20 clusters in each pair of corresponding networks for reply, mention and retweet networks. Although they follow similar trends of size, the differences are clear. We spent some time examining the membership of corresponding clusters [CLICK], but came to no firm conclusions other than that the reply networks were the most similar.
  16. Like the Q&A datasets, the first AFL datasets were dramatically different in size, though [CLICK] it appeared as though there was significant overlap. A manual scan of the tweets showed many non-English tweets in the Twarc dataset, [CLICK] so we looked at the language distributions and found a significant amount of Japanese content. Looking at just the English tweets [CLICK], we find the datasets are much more similar in size. So what’s are those Japanese tweets about? Is AFL really big in Japan? English only: Twarc: 22,962 RAPID: 20,431 English & und: Twarc: 25,236 RAPID: 21,235
  17. Picking a random Japanese tweet, it looks like it’s promoting some kind of boxing match. Others referred to a Japanese marketplace, like eBay or Amazon. [CLICK] Looking at the raw JSON (don’t read it), the string “afl” appears in this block here [CLICK], and if we look closely [CLICK], it’s part of a URL, and has nothing to do with Aussie Rules.
  18. So what’s going on? [CLICK] Twarc is a very thin layer over Twitter’s APIs, so it should be doing the least above and beyond what Twitter is offering. If the string “afl” appears in a tweet, it will return it, pending rate limits. [CLICK] We spoke to the RAPID team at Melbourne Uni and found that it does some useful post-collection processing. It retrieves the same tweets as Twarc, but then checks that the filter terms appear in the text-based fields in the tweet, such as in the text of the tweet or the account’s screen name or description, both of which are included in tweets’ metadata. So RAPID was matching these Japanese tweets but then dumping them. [CLICK] That’s important to know.
  19. So what have we learned? [CLICK] We’ve found that attempting to conduct the same collection simultaneously using different tools can produce very different datasets. Furthermore, we’ve shown that analyses of social networks based on those datasets produce very different results [CLICK] We’ve learned tools can value-add, but we should know what they do. We’ve learned that noise can also affect results, and that it may necessary to trawl through content manually to see what types of noise is present. Some may be incidental or from spammers or advertising, but some may be from language clashes. Be wary of very short filter terms. This activity focussed on the collection of data and building of networks, but in the absence of [CLICK] a clear guiding research question, which really help with deciding on network construction parameters and subsequent data requirements. We chose not to constrain ourselves to particular research questions, so we possibly encountered more pitfalls than we might have done otherwise. We invite you to learn from the consequences of our decisions. The broader implication for those who analyse social media data, is that they should be aware that that their tools may affect their analyses and the decisions they base on their results. [CLICK] More focused research is required.
  20. Thank you for your time and attention. I’m happy to answer any questions now.
  21. If we consider the content of the datasets and map out hashtags that are mentioned by the same accounts, we can see a lot more alignment between corresponding datasets, not only in the most mentioned hashtags but also in their linkages. From this we learn two things: From a high level, this kind of content is subjectively similar, despite the differences in the datasets; and We can start to identify noise in the datasets. We wanted to see discussion about Q&A but there’s content here that clearly doesn’t match that. In fact, there appear to be clusters of Spanish content that was picked up because it included the term ‘qanda’. [CLICK] In short, these variations might be fine for trend analysis, say if you’re an advertiser wanting to track an ad campaign, but it raises questions about using SNA techniques. Frasedeldia – phrase of the day (Catalan) Felizjueves – happy Thursday (Spanish) Arabic – “five”, “spraying”, 8kasimdunyadelilergunu – 8th November world mad man day (Turkish) – “world madness day” – celebrated by Turks by sharing cartoons with each other on social media