SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Downloaden Sie, um offline zu lesen
Security Challenges in
Online Social Media
Chun-Ming Tim Lai
賴俊鳴 助理教授
Personal information
2007 – 2011: B.S. NTU CSIE
• Prof. Juin-Ming Chen (Math): Lattice Reduction
• Prof. Der-Tsai. Lee (CSIE), Chen-Mou Cheng (EE):
Secure Index
2011 – 2012: Military Police second lieutenant
2013 – 2019: Ph.D. study, UC Davis
• Prof. S. Felix Wu
• Dissertation: Attackers’ Intention and Influence
Analysis in Social Media
2019 ~ Cloud Innovation School @ 東海大學
10/18/2021 2
Social Media
Exerting significant impact on mass communication
10/18/2021 3
Traditional Media Social Media
Datasize Less More
User Type Reader Editor/Reporter
Time-based Delayed Real time
10/18/2021 4
Reaction
 回覆的即時性
 是否切中要點,立案追蹤
 文章的生命週期
 平均1.5小時,影響人的生活
10/18/2021 5
Security Threat
Severe Threat
• Phishing
• Malware, drive-by-download
Medium to light Threat
• Advertisement
• Spamming (Fund-raising, porn, canned messages, etc.)
New type Threat
• Rumors, Media manipulation, sign up, vote stuffing, etc.
• Fake News
• Crowdturfing = CrowdSourcing + Astroturfing
10/18/2021 6
10/18/2021 7
10/18/2021 8
Outline
 Suitable Target, Lifecycle Analysis
 Multiple Accounts Detection
 Geolocation Identification
 Personal words
10/18/2021 9
10/18/2021 10
Facebook.com/63811549237/posts/10153038271604238
2014, 12-19, 03:06 am GMT
Social Media— Climate Change
10/18/2021 11
GMT+0
10/18/2021 12
Total: 609 comments
Suitable Targets Problem
Any post thread p in social media
platform, predict whether p
contains at least one malicious
comment via a classifier – c
{target,nontarget}
10/18/2021 13
Key idea: Life Cycle of Posts
10/18/2021 14
10 hrs
Definition
 Time Series (TS)
• TScreated(post): the time an original article is posted
• TSj: a time period j following the time of the original
• TSfinal: the end of our observation
 Accumulated Number of participants (AccNcomment)
• The number of post comments between TSi and TS(i-1)
 Discussion Atmosphere Vector (DAV)
10/18/2021 15
Example
TScreated(Climate) = 2014-12-19 03:06:42
Suppose j = 5, final = 120
DAV(Climate) = [# of comments 03:06:42 ~ 03:11:42 1st
# of comments 03:11:42 ~ 03:16:42 2nd
…
# of comments 05:01:42 ~ 05:06:42] 24th
10/18/2021 16
Dataset
2011~2014 Ten Main Media pages on
Facebook
Totally 42,703,463
10/18/2021 17
Feature Engineering
 # of comments, # of likes, # of shares
 Spanning time (Last comment time – first comment time)
 Temporal Feature with Delta Time window, with a final
observation time
 Context-free, don’t need to address Natural Language
Processing
10/18/2021
18
Time Elapsed
1st
Comments 1st Likes 1st Shares
Results
10/18/2021 19
Near Real Time
Discussion: Do you understand Facebook enough?
10/18/2021 20
• Attackers’ preference
• Selected by Facebook
• Audience reaction
• Bandwagon Effect
• Rich get Richer
• Human loves biased and
debating ones
Life Cycle and Influence Ratio
10/18/2021 21
CNN 2012 all post threads
>70%
mURL
DAV Predict IR (1/2)
10/18/2021 22
DAV Predict IR (2/2)
10/18/2021 23
Accounts Activity within a week around election date
10/18/2021 24
Active = Count(Activities) within 1 week >= threshold
10/18/2021 25
Clinton
1st week
Clinton
2nd
week
10/18/2021 26
Trump
2nd
week
Trump
1st
week
All accounts:
Periodic
Attacker accounts:
Random
Conclusion
Predict Suitable Targets successfully with temporal
features
• Attackers: Follow or not?
• Defenders: Deploy resource
Temporal Analysis with different variables
• Influence Ratio, increase or decrease for next time
window?
• 24 hours pattern, link online and offline behavior
10/18/2021 27
Outline
 Suitable Target, Lifecycle Analysis
 Multiple Accounts Detection
 Geolocation Identification
 Personal words
10/18/2021 28
Semi-Supervised Learning on Graphs
Motivation of detecting multiple accounts on FB
Crawler
1
Crawler
2
Crawler
3
FaceBook
API
When Call FaceBook
API:
API will give each
crawler a different
scope ID. Thus it leads
to same user with
different scope ID in
the dataset.
100003468896671 高婷婷
https://www.facebook.com/mayuko.sakamoto.503
100004123536871 賴婷婷
https://www.facebook.com/profile.php?id=100004123536871
100003251795795 陳婷婷 https://www.facebook.com/rika.etoh
100000681128139 高婷婷 https://www.facebook.com/vincenzo.muscari.5
100002630019886 陳婷婷 https://www.facebook.com/sven.erkens.98
813243492 高婷婷 https://www.facebook.com/profile.php?id=813243492
Ting-Ting’s Family
Facebook 允許朋友數
100003468896671 高婷婷 45xx
100004123536871 賴婷婷 45xx
100003251795795 陳婷婷 4xxx
5000
Multiple Accounts Detection using
Semi-Supervised Learning on Graphs
When crawling data from FB using multiple crawlers, it will give you a scope ID instead of
giving you primary ID for each crawler.
For example, a user’s primary ID is mohamed.aimane.98. He has multiple scope ID,
they are 1815396745342476, 1815402648675219 , 1815411572007660,
1815468805335270 , 1815515615330589 ,1815482155333935 , 1815488781999939 ,
1816157185266432. It implies mohamed’s data is crawled by 8 different crawlers.
As the result, in our dataset we know their users names are all mohamed aimane, but
there are a lot of ID with the same user name.
Problem : Given 2 scope ID with the same user name. Are they the same user(same
primary ID) or not?
Motivation of detecting multiple accounts on FB
Graph Construction
U: {Users}, V:{Pages}, edge:{u,v} : u had an activity on page v
Activities
Main Algorithms
Unsupervised learning using Katz Similarity
Pxy(i) = (x,x1,x2,….y), length I
u1, u2 are similar if their activity paths are similar
Katz similarity can be computed by:
Where M is the adjacency matrix of graph G. 𝛽 is a scalar smaller than 1/ 𝑀 2
to
ensure convergence, and I is the identity matrix.
Main Algorithms
Unsupervised learning using Katz Similarity
Katz matrix is
1 0.9
0.9 1
0.2 0.3
0.5 0.5
0.2 0.6
0.3 0.5
1 0.8
0.8 1
The threshold we use is 0.8
Then the 1st node and the 2nd node are belong to the same user, and the 3rd and 4th
node are belongs to the same user, others are not.
Example of Algorithm 1
Main Algorithms
Semi-Supervised Method using Graph Embedding
Classical ML Tasks in Networks
• Node Classification
• Predict type of a node
• Link Prediction
• Predict friends
• Community Detection
• Network Similarity
• Similar with two networks
Node2vec(1/4)
Many Possible ways:
• PageRank score, Degree, centrality, # of edges…etc.
Features
Node2vec(2/4)
Mixture of BFS and DFS
BFS --- LocalView (u and S1)
DFS --- GlobalView (u and S6)
Node2vec(3/4)
• Two Parameters:
• Return parameter p:
• Return back to the previous node
• In-out parameter q:
• Moving outwards (DFS) vs. inwards (BFS)
• The ratio of BFS vs.DFS
• Biased 2nd-order random walks explore network neighborhoods.
Parameters
Node2vec(4/4)
• Simulate r random walks of length l starting from each node u
• Optimize the node2vec objective using Stochastic Gradient Descent
Embedding for node 1 : (0.1, 0.3, 0.2, 0.4), Embedding for node 2 : (0.2, 0.3, 0.2, 0.4)
We sample some ground truth that : node 1 and node 2 are belongs to the same node,
ect.
L looks like :((1,2), 1) ((1,3), 0 ), ((2,3), 0) ((2,4), - 1) ((3,4), -1) …..
X is from embedding : for example, ((1,2), (0.1, 0, 0, 0 )) ….
Then feed X and L into label spreading model, we will get, the 1st node and the 2nd node
are belong to the same user, and the 3rd and 4th node are belongs to the same user,
others are not.
Example of Algorithm 3
Main Algorithms
Different measurement of Embedding Vectors
Experiments and Evaluation
Comparison among the Three Methods
Two simple datasets : dataset 1: 188 nodes and 262 activities (links);
dataset 2: 4188 accounts and 6715 activities(links).
Outline
 Suitable Target, Lifecycle Analysis
 Multiple Account Detection
 Geolocation Identification
 Personal words
10/18/2021 47
Page Information and Page-like Graph
10/18/2021
Sport Illustrated
Golden State
Warriors
Oakland Museum
Giving Tuesday
like
like
like
Field Example
Page ID 47657117525
Name Golden State Warriors
Category Sports Team
Country United States
Fan Count 11,019,236
Description The Official Facebook page
of
the Golden State Warriors
48
10/18/2021 49
10/18/2021
• Facebook public
pages are public
profiles used by
local businesses,
companies,
organizations or
public figures
Likes
Promoting other pages to
community participants
50
Data Collection
Facebook Graph API version 2.8 used to collect our
data [1]
• 38,831,367 pages (for this work)
• 2,430,873 US
• 12,685,090 other countries
• 23,715,404 unknown
 [1] https://developers.facebook.com/docs/graph-api/reference/page
10/18/2021 51
Majority Vote Algorithm
10/18/2021
• location designated as state
information in this scenario
• The location labeling is determined by
the most votes
• Overall accuracy is only 59.4%
• This algorithm works well in page nationality
prediction task, with 90.25% accuracy
52
Baseline Algorithm
Utilizes locality of states to find pages
belonging to their corresponding states
• Pick out anchored pages, with local property, as
multiple seeds to start BFS from
Target classifier: 51 classes
• 50 classes of US states and a class of ”others (OT)”
State Distance Vector (SDV)
10/18/2021 53
Alabama Arkansas Arizona Wyoming
……
P IHOP(P, S_Arizona) == 4
OHOP(P, S_Arizona) == 3
31M+ nodes, 600M+ edges
10/18/2021
Alaska
54
Anchor Page Selection (1/2)
10/18/2021
Effectiveness of BFS-based algorithms
• It depends on anchored page selection
Anchored pages have to be local such that SDV can provide authentic
tendency of a page’s locality
Suitable examples (focusing on local communities)
• state universities, government, park or police organizations
Ill-suited examples (popular and thus having global impact)
• NBA, MLB, or NFL sports teams
55
• We adopt all subsidiary
pages
of ”OnlyInYourState.com” as
a set of anchored pages
• It has a distinct page for each state
• Each subsidiary page mostly
connects local communities
Anchor Page Selection (2/2)
Page Name Page ID
Only In Alabama 783744898386760
Only In Alaska 686107314826906
Only In Southern California
184034905285700
6
Only In Northern California 856450181102963
Idaho Only 435099846671531
Only In New York 386608421546055
Only In Virginia
156051573754049
2
Only In West Virginia
150970950928653
2
Only In Wisconsin
139029706462742
0
Only In Wyoming
172417436447638
1
10/18/2021 56
51 Anchors
Arizona
Northern
California
10/18/2021 57
Advanced Algorithm
Baseline algorithm’s drawback
• A local page can have a few connections with those pages far beyond
• This kind of connection noise would highly reduce prediction accuracy
State Neighborhood Probability (SNP)
Both SDV and SNP are taken as feature vectors for ML models
• Utilize locality and neighborhood context for better identification
10/18/2021 58
Dataset
California accounts for 20% of all US pages, and half of all
pages (49.49%) are located in top 5 states
• California, New York, Florida, Illinois, and Texas
10/18/2021 59
Accuracy Summary
Classifier Precision Recall F1 score
Naive Bayes (Baseline BFS) 0.44 0.27 0.26
Adaboost (Baseline BFS) 0.46 0.40 0.37
Random Forest (Baseline BFS) 0.69 0.69 0.68
Random Forest (Advanced BFS) 0.89 0.88 0.88
10/18/2021 60
Outline
 Suitable Target, Lifecycle Analysis
 Multiple Account Detection
 Geolocation Identification
 Personal words
10/18/2021 61
Future Trends -- IT
10/18/2021 62
Thank you!
Q & A
10/18/2021 63
Thank you!
Q & A

Weitere ähnliche Inhalte

Ähnlich wie Yuntech present

CML's Presentation at FengChia University
CML's Presentation at FengChia UniversityCML's Presentation at FengChia University
CML's Presentation at FengChia UniversityTunghai University
 
srd117.final.512Spring2016
srd117.final.512Spring2016srd117.final.512Spring2016
srd117.final.512Spring2016Saurabh Deochake
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)Kunwoo Park
 
Inferring social media user attributes using language and network information
Inferring social media user attributes using language and network informationInferring social media user attributes using language and network information
Inferring social media user attributes using language and network informationNikolaos Aletras
 
User Behaviour Pattern Recognition On Twitter Social Network
User Behaviour Pattern Recognition On Twitter Social NetworkUser Behaviour Pattern Recognition On Twitter Social Network
User Behaviour Pattern Recognition On Twitter Social NetworkGeorge Konstantakopoulos
 
SRS Of Social Networking
SRS Of Social NetworkingSRS Of Social Networking
SRS Of Social Networkingmaaano786
 
User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks Shah Alam Sabuj
 
Profiling User Interests on the Social Semantic Web
Profiling User Interests on the Social Semantic WebProfiling User Interests on the Social Semantic Web
Profiling User Interests on the Social Semantic WebFabrizio Orlandi
 
An Empirical Analysis on Social Capital and Enterprise 2.0 Participation in a...
An Empirical Analysis on Social Capital and Enterprise 2.0 Participation in a...An Empirical Analysis on Social Capital and Enterprise 2.0 Participation in a...
An Empirical Analysis on Social Capital and Enterprise 2.0 Participation in a...Paolo Massa
 
Social Web 2014: Final Presentations (Part II)
Social Web 2014: Final Presentations (Part II)Social Web 2014: Final Presentations (Part II)
Social Web 2014: Final Presentations (Part II)Lora Aroyo
 
Seams2016 presentation calikli_et_al
Seams2016 presentation calikli_et_alSeams2016 presentation calikli_et_al
Seams2016 presentation calikli_et_alGul Calikli
 
Big social data analytics - social network analysis
Big social data analytics - social network analysis Big social data analytics - social network analysis
Big social data analytics - social network analysis Jari Jussila
 
Temporal Exploration in 2D Visualization of Emotions on Twitter Stream
Temporal Exploration in 2D Visualization of Emotions on Twitter StreamTemporal Exploration in 2D Visualization of Emotions on Twitter Stream
Temporal Exploration in 2D Visualization of Emotions on Twitter StreamTELKOMNIKA JOURNAL
 
Studying user footprints in different online social networks
Studying user footprints in different online social networksStudying user footprints in different online social networks
Studying user footprints in different online social networksIIIT Hyderabad
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningNeo4j
 
Linked Data Overview - structured data on the web for US EPA 20140203
Linked Data Overview - structured data on the web for US EPA 20140203Linked Data Overview - structured data on the web for US EPA 20140203
Linked Data Overview - structured data on the web for US EPA 201402033 Round Stones
 
Workshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceWorkshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceNeo4j
 

Ähnlich wie Yuntech present (20)

NDU Present
NDU PresentNDU Present
NDU Present
 
CML's Presentation at FengChia University
CML's Presentation at FengChia UniversityCML's Presentation at FengChia University
CML's Presentation at FengChia University
 
srd117.final.512Spring2016
srd117.final.512Spring2016srd117.final.512Spring2016
srd117.final.512Spring2016
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
 
Inferring social media user attributes using language and network information
Inferring social media user attributes using language and network informationInferring social media user attributes using language and network information
Inferring social media user attributes using language and network information
 
User Behaviour Pattern Recognition On Twitter Social Network
User Behaviour Pattern Recognition On Twitter Social NetworkUser Behaviour Pattern Recognition On Twitter Social Network
User Behaviour Pattern Recognition On Twitter Social Network
 
SRS Of Social Networking
SRS Of Social NetworkingSRS Of Social Networking
SRS Of Social Networking
 
User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks
 
Sp150502ss
Sp150502ssSp150502ss
Sp150502ss
 
Profiling User Interests on the Social Semantic Web
Profiling User Interests on the Social Semantic WebProfiling User Interests on the Social Semantic Web
Profiling User Interests on the Social Semantic Web
 
An Empirical Analysis on Social Capital and Enterprise 2.0 Participation in a...
An Empirical Analysis on Social Capital and Enterprise 2.0 Participation in a...An Empirical Analysis on Social Capital and Enterprise 2.0 Participation in a...
An Empirical Analysis on Social Capital and Enterprise 2.0 Participation in a...
 
Social Web 2014: Final Presentations (Part II)
Social Web 2014: Final Presentations (Part II)Social Web 2014: Final Presentations (Part II)
Social Web 2014: Final Presentations (Part II)
 
Seams2016 presentation calikli_et_al
Seams2016 presentation calikli_et_alSeams2016 presentation calikli_et_al
Seams2016 presentation calikli_et_al
 
Big social data analytics - social network analysis
Big social data analytics - social network analysis Big social data analytics - social network analysis
Big social data analytics - social network analysis
 
Analyzing Emoji in Text
Analyzing Emoji in TextAnalyzing Emoji in Text
Analyzing Emoji in Text
 
Temporal Exploration in 2D Visualization of Emotions on Twitter Stream
Temporal Exploration in 2D Visualization of Emotions on Twitter StreamTemporal Exploration in 2D Visualization of Emotions on Twitter Stream
Temporal Exploration in 2D Visualization of Emotions on Twitter Stream
 
Studying user footprints in different online social networks
Studying user footprints in different online social networksStudying user footprints in different online social networks
Studying user footprints in different online social networks
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine Learning
 
Linked Data Overview - structured data on the web for US EPA 20140203
Linked Data Overview - structured data on the web for US EPA 20140203Linked Data Overview - structured data on the web for US EPA 20140203
Linked Data Overview - structured data on the web for US EPA 20140203
 
Workshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceWorkshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data Science
 

Mehr von Tunghai University

Mehr von Tunghai University (6)

CCU Talk at 2022/01/03
CCU Talk at 2022/01/03CCU Talk at 2022/01/03
CCU Talk at 2022/01/03
 
When Online Computational Data Meets Offline Real World Events
When Online Computational Data Meets Offline Real World EventsWhen Online Computational Data Meets Offline Real World Events
When Online Computational Data Meets Offline Real World Events
 
Mimic iv
Mimic iv Mimic iv
Mimic iv
 
CML's presentation with IORG
CML's presentation with IORGCML's presentation with IORG
CML's presentation with IORG
 
Pydata Taipei 2020
Pydata Taipei 2020Pydata Taipei 2020
Pydata Taipei 2020
 
Big data johnson_public
Big data johnson_publicBig data johnson_public
Big data johnson_public
 

Yuntech present

  • 1. Security Challenges in Online Social Media Chun-Ming Tim Lai 賴俊鳴 助理教授
  • 2. Personal information 2007 – 2011: B.S. NTU CSIE • Prof. Juin-Ming Chen (Math): Lattice Reduction • Prof. Der-Tsai. Lee (CSIE), Chen-Mou Cheng (EE): Secure Index 2011 – 2012: Military Police second lieutenant 2013 – 2019: Ph.D. study, UC Davis • Prof. S. Felix Wu • Dissertation: Attackers’ Intention and Influence Analysis in Social Media 2019 ~ Cloud Innovation School @ 東海大學 10/18/2021 2
  • 3. Social Media Exerting significant impact on mass communication 10/18/2021 3 Traditional Media Social Media Datasize Less More User Type Reader Editor/Reporter Time-based Delayed Real time
  • 5. Reaction  回覆的即時性  是否切中要點,立案追蹤  文章的生命週期  平均1.5小時,影響人的生活 10/18/2021 5
  • 6. Security Threat Severe Threat • Phishing • Malware, drive-by-download Medium to light Threat • Advertisement • Spamming (Fund-raising, porn, canned messages, etc.) New type Threat • Rumors, Media manipulation, sign up, vote stuffing, etc. • Fake News • Crowdturfing = CrowdSourcing + Astroturfing 10/18/2021 6
  • 9. Outline  Suitable Target, Lifecycle Analysis  Multiple Accounts Detection  Geolocation Identification  Personal words 10/18/2021 9
  • 13. Suitable Targets Problem Any post thread p in social media platform, predict whether p contains at least one malicious comment via a classifier – c {target,nontarget} 10/18/2021 13
  • 14. Key idea: Life Cycle of Posts 10/18/2021 14 10 hrs
  • 15. Definition  Time Series (TS) • TScreated(post): the time an original article is posted • TSj: a time period j following the time of the original • TSfinal: the end of our observation  Accumulated Number of participants (AccNcomment) • The number of post comments between TSi and TS(i-1)  Discussion Atmosphere Vector (DAV) 10/18/2021 15
  • 16. Example TScreated(Climate) = 2014-12-19 03:06:42 Suppose j = 5, final = 120 DAV(Climate) = [# of comments 03:06:42 ~ 03:11:42 1st # of comments 03:11:42 ~ 03:16:42 2nd … # of comments 05:01:42 ~ 05:06:42] 24th 10/18/2021 16
  • 17. Dataset 2011~2014 Ten Main Media pages on Facebook Totally 42,703,463 10/18/2021 17
  • 18. Feature Engineering  # of comments, # of likes, # of shares  Spanning time (Last comment time – first comment time)  Temporal Feature with Delta Time window, with a final observation time  Context-free, don’t need to address Natural Language Processing 10/18/2021 18 Time Elapsed 1st Comments 1st Likes 1st Shares
  • 20. Discussion: Do you understand Facebook enough? 10/18/2021 20 • Attackers’ preference • Selected by Facebook • Audience reaction • Bandwagon Effect • Rich get Richer • Human loves biased and debating ones
  • 21. Life Cycle and Influence Ratio 10/18/2021 21 CNN 2012 all post threads >70% mURL
  • 22. DAV Predict IR (1/2) 10/18/2021 22
  • 23. DAV Predict IR (2/2) 10/18/2021 23
  • 24. Accounts Activity within a week around election date 10/18/2021 24 Active = Count(Activities) within 1 week >= threshold
  • 27. Conclusion Predict Suitable Targets successfully with temporal features • Attackers: Follow or not? • Defenders: Deploy resource Temporal Analysis with different variables • Influence Ratio, increase or decrease for next time window? • 24 hours pattern, link online and offline behavior 10/18/2021 27
  • 28. Outline  Suitable Target, Lifecycle Analysis  Multiple Accounts Detection  Geolocation Identification  Personal words 10/18/2021 28
  • 29. Semi-Supervised Learning on Graphs Motivation of detecting multiple accounts on FB Crawler 1 Crawler 2 Crawler 3 FaceBook API When Call FaceBook API: API will give each crawler a different scope ID. Thus it leads to same user with different scope ID in the dataset.
  • 30.
  • 31. 100003468896671 高婷婷 https://www.facebook.com/mayuko.sakamoto.503 100004123536871 賴婷婷 https://www.facebook.com/profile.php?id=100004123536871 100003251795795 陳婷婷 https://www.facebook.com/rika.etoh 100000681128139 高婷婷 https://www.facebook.com/vincenzo.muscari.5 100002630019886 陳婷婷 https://www.facebook.com/sven.erkens.98 813243492 高婷婷 https://www.facebook.com/profile.php?id=813243492 Ting-Ting’s Family
  • 32. Facebook 允許朋友數 100003468896671 高婷婷 45xx 100004123536871 賴婷婷 45xx 100003251795795 陳婷婷 4xxx 5000
  • 33. Multiple Accounts Detection using Semi-Supervised Learning on Graphs When crawling data from FB using multiple crawlers, it will give you a scope ID instead of giving you primary ID for each crawler. For example, a user’s primary ID is mohamed.aimane.98. He has multiple scope ID, they are 1815396745342476, 1815402648675219 , 1815411572007660, 1815468805335270 , 1815515615330589 ,1815482155333935 , 1815488781999939 , 1816157185266432. It implies mohamed’s data is crawled by 8 different crawlers. As the result, in our dataset we know their users names are all mohamed aimane, but there are a lot of ID with the same user name. Problem : Given 2 scope ID with the same user name. Are they the same user(same primary ID) or not? Motivation of detecting multiple accounts on FB
  • 34. Graph Construction U: {Users}, V:{Pages}, edge:{u,v} : u had an activity on page v Activities
  • 35. Main Algorithms Unsupervised learning using Katz Similarity Pxy(i) = (x,x1,x2,….y), length I u1, u2 are similar if their activity paths are similar Katz similarity can be computed by: Where M is the adjacency matrix of graph G. 𝛽 is a scalar smaller than 1/ 𝑀 2 to ensure convergence, and I is the identity matrix.
  • 36. Main Algorithms Unsupervised learning using Katz Similarity
  • 37. Katz matrix is 1 0.9 0.9 1 0.2 0.3 0.5 0.5 0.2 0.6 0.3 0.5 1 0.8 0.8 1 The threshold we use is 0.8 Then the 1st node and the 2nd node are belong to the same user, and the 3rd and 4th node are belongs to the same user, others are not. Example of Algorithm 1
  • 38. Main Algorithms Semi-Supervised Method using Graph Embedding
  • 39. Classical ML Tasks in Networks • Node Classification • Predict type of a node • Link Prediction • Predict friends • Community Detection • Network Similarity • Similar with two networks
  • 40. Node2vec(1/4) Many Possible ways: • PageRank score, Degree, centrality, # of edges…etc. Features
  • 41. Node2vec(2/4) Mixture of BFS and DFS BFS --- LocalView (u and S1) DFS --- GlobalView (u and S6)
  • 42. Node2vec(3/4) • Two Parameters: • Return parameter p: • Return back to the previous node • In-out parameter q: • Moving outwards (DFS) vs. inwards (BFS) • The ratio of BFS vs.DFS • Biased 2nd-order random walks explore network neighborhoods. Parameters
  • 43. Node2vec(4/4) • Simulate r random walks of length l starting from each node u • Optimize the node2vec objective using Stochastic Gradient Descent
  • 44. Embedding for node 1 : (0.1, 0.3, 0.2, 0.4), Embedding for node 2 : (0.2, 0.3, 0.2, 0.4) We sample some ground truth that : node 1 and node 2 are belongs to the same node, ect. L looks like :((1,2), 1) ((1,3), 0 ), ((2,3), 0) ((2,4), - 1) ((3,4), -1) ….. X is from embedding : for example, ((1,2), (0.1, 0, 0, 0 )) …. Then feed X and L into label spreading model, we will get, the 1st node and the 2nd node are belong to the same user, and the 3rd and 4th node are belongs to the same user, others are not. Example of Algorithm 3
  • 46. Experiments and Evaluation Comparison among the Three Methods Two simple datasets : dataset 1: 188 nodes and 262 activities (links); dataset 2: 4188 accounts and 6715 activities(links).
  • 47. Outline  Suitable Target, Lifecycle Analysis  Multiple Account Detection  Geolocation Identification  Personal words 10/18/2021 47
  • 48. Page Information and Page-like Graph 10/18/2021 Sport Illustrated Golden State Warriors Oakland Museum Giving Tuesday like like like Field Example Page ID 47657117525 Name Golden State Warriors Category Sports Team Country United States Fan Count 11,019,236 Description The Official Facebook page of the Golden State Warriors 48
  • 50. 10/18/2021 • Facebook public pages are public profiles used by local businesses, companies, organizations or public figures Likes Promoting other pages to community participants 50
  • 51. Data Collection Facebook Graph API version 2.8 used to collect our data [1] • 38,831,367 pages (for this work) • 2,430,873 US • 12,685,090 other countries • 23,715,404 unknown  [1] https://developers.facebook.com/docs/graph-api/reference/page 10/18/2021 51
  • 52. Majority Vote Algorithm 10/18/2021 • location designated as state information in this scenario • The location labeling is determined by the most votes • Overall accuracy is only 59.4% • This algorithm works well in page nationality prediction task, with 90.25% accuracy 52
  • 53. Baseline Algorithm Utilizes locality of states to find pages belonging to their corresponding states • Pick out anchored pages, with local property, as multiple seeds to start BFS from Target classifier: 51 classes • 50 classes of US states and a class of ”others (OT)” State Distance Vector (SDV) 10/18/2021 53
  • 54. Alabama Arkansas Arizona Wyoming …… P IHOP(P, S_Arizona) == 4 OHOP(P, S_Arizona) == 3 31M+ nodes, 600M+ edges 10/18/2021 Alaska 54
  • 55. Anchor Page Selection (1/2) 10/18/2021 Effectiveness of BFS-based algorithms • It depends on anchored page selection Anchored pages have to be local such that SDV can provide authentic tendency of a page’s locality Suitable examples (focusing on local communities) • state universities, government, park or police organizations Ill-suited examples (popular and thus having global impact) • NBA, MLB, or NFL sports teams 55
  • 56. • We adopt all subsidiary pages of ”OnlyInYourState.com” as a set of anchored pages • It has a distinct page for each state • Each subsidiary page mostly connects local communities Anchor Page Selection (2/2) Page Name Page ID Only In Alabama 783744898386760 Only In Alaska 686107314826906 Only In Southern California 184034905285700 6 Only In Northern California 856450181102963 Idaho Only 435099846671531 Only In New York 386608421546055 Only In Virginia 156051573754049 2 Only In West Virginia 150970950928653 2 Only In Wisconsin 139029706462742 0 Only In Wyoming 172417436447638 1 10/18/2021 56
  • 58. Advanced Algorithm Baseline algorithm’s drawback • A local page can have a few connections with those pages far beyond • This kind of connection noise would highly reduce prediction accuracy State Neighborhood Probability (SNP) Both SDV and SNP are taken as feature vectors for ML models • Utilize locality and neighborhood context for better identification 10/18/2021 58
  • 59. Dataset California accounts for 20% of all US pages, and half of all pages (49.49%) are located in top 5 states • California, New York, Florida, Illinois, and Texas 10/18/2021 59
  • 60. Accuracy Summary Classifier Precision Recall F1 score Naive Bayes (Baseline BFS) 0.44 0.27 0.26 Adaboost (Baseline BFS) 0.46 0.40 0.37 Random Forest (Baseline BFS) 0.69 0.69 0.68 Random Forest (Advanced BFS) 0.89 0.88 0.88 10/18/2021 60
  • 61. Outline  Suitable Target, Lifecycle Analysis  Multiple Account Detection  Geolocation Identification  Personal words 10/18/2021 61
  • 62. Future Trends -- IT 10/18/2021 62
  • 63. Thank you! Q & A 10/18/2021 63 Thank you! Q & A

Hinweis der Redaktion

  1. Top-down, Authoritative, vs. distributed, skim SFW – “Editor/Reporter” and “reader”
  2. Sometimes it’s hard to evaluate “spamming” New SFW – Likefarm? Is that ContentFarm?
  3. Every principle has its mind, reason, everything has its causality
  4. SFW – we need to have a better organized presentation for problems. SFW – the defenders concern might be different – we need to consider the risk factor
  5. Shelf Life, skim messages, can “catch” ones eyes only , enlarge the influence https://www.facebook.com/barackobama/posts/10151673679836749 https://www.facebook.com/cnn/posts/313652498762911 SFW – ask the audience “which post has higher prob to be attacked”?
  6. SFW – watch out for the transition into this slide. SFW – do you want to provide one example for all or most of the slides? SFW – I feel that you should give an example to explain. SFW – Definition**s**
  7. SFW – how to interpret 10 minutes? (what is the total time and attack time)? Naïve Bayne: DAV not independent with each other Adaboost: Not good for outlier, number of estimators = 50 and learning rate = 1. Decision Tree: Good for social networks data we set minimum samples split = 2 and minimum samples leaf = 1, as with depth, nodes are expanded until all leaves are pure.
  8. 1. IR is learnable? 2. No difference between Light and Critical malicious URLs since their performance are quite similar 3. Increase recall result is high
  9. SFW – explain “Exact time after last attack”
  10. Why do you choose similarity Fast
  11. Read the silde
  12. Our first thought is majority vote algorithm
  13. where IHOP(Page,Si) denotes hop distance between page and seed Si, using inward edges as connection for BFS; OHOP(Page, Si) denotes hop distance between page and seed Si, using outward edges as connection for BFS.
  14. In particular, since California is much larger than other states in perspectives of population and economy, “OnlyInYourState.com” splits California into Northern and Southern regions, as shown in Table. Therefore, both ”Only In Northern California” and ”Only In Southern California” are used as anchored pages to calculate IHOP (P age, Si) and OHOP (P age, Si), in addition to the other forty nine an- chored pages. Hence Nanchored pages is set as 51. Furthermore, since ”Only In Idaho” had been registered, OnlyInYourState.com named its Idaho counterpart as ”Idaho Only” instead. In general, more anchored pages involved would enlarge the BFS coverage of pages.
  15. This probability is not high; however, the baseline BFS-based ML algorithm only cares about the hop distances to the anchored pages. where INP(Page,Ri) denotes inward neighborhood location probability between this page and the adjacent pages belonging to the region Ri; where IE(Page,Ri) is the number of inward edges between this page and the adjacent pages belonging to the region Ri;
  16. We took the pages with declared location information of country and city as ground truth data. Few pages are excluded because their city names exist in multiple states, which can result in ambiguous city-to-state mapping. There are 29,849 cities in total in the US.
  17. The training set utilized 80% of data while test set employed the rest. Since number of classes is rather large, Random Forest classifier is preferably adopted, instead of Gradient Boosting classifier [23]. The default parameter sets were applied when using the implementations available in the scikit-learn package [54]. As shown in Table 4.2, the precision, recall, f1 score of the Random Forest classifier are at least 20% better than the counterparts of the Naive Bayes classifier and the Adaboost classifier. Thus in the following, we only present results done with the Random Forest classifier. baseline BFS-based ML algorithm with the Random Forest classifier achieved 69% accuracy, which is 10% better than accuracy of the majority vote algorithm. With addition of SNP, advanced BFS-based ML algorithm accomplished 89% prediction accuracy, which is a 20% improvement over baseline.
  18. SFW – what have been done? Whether you can justify some of your work is fundamental and not just incremental and applied? SFW – balance between contributions to CS versus Social Science