Online Social Networks (OSNs) are popular platforms for online users. Users typically register and maintain their accounts (user identities) across different OSNs to share a variety of content and remain connected with their friends. Consequently, linking user identities across OSN platforms, referred to as user identity linkage (UIL) becomes a critical problem. Solving this problem enables us to build a more comprehensive view of user’s activities across OSNs, which is highly beneficial for targeted advertisements, recommendations, and many more applications. In the thesis, we propose approaches for analyzing data collection methods, investigating biases in identity linkage datasets, linkage of user identities across social networks, control-ability of user identity linkage, and application of user identity linkage solutions to solve related problems.
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and Application
1. User Identity Linkage: Data Collection, Dataset
Biases, Method, Control and Application
Rishabh Kaushal
PhD15008
Committee Members:
Prof. Sanjay Jha
Dr. Alessandra Sala
Prof. Anwitaman Datta
Prof. Ponnurangam Kumaraguru (PK), Advisor
PhD Defense Presentation
2. Who Am I ?
Sponsored PhD Student, Precog Research Group, IIIT, Delhi.
Serving as Assistant Professor, IT Dept, IGDTUW.
MS by Research from IIIT, Hyderabad.
Research Interest: Social Computing.
2
4. Identity in Physical World
4
Identity Physical World
Student
Teacher
Software Engineer
Father
5. Identity in Online World
Identity has three dimensions - profile, content, and network
User joins multiple social networks
5
World of Social
Networks
Professional
Personal
News
6. Problem: User Identity Linkage (UIL)
UIL refers to the problem of determining whether two input user
identities, taken from two different social networks A and B, belong to
the same person or not.
(Ia
, Ib
): Linked User Identity Pair
6
9. Thesis Statement
“Computational approaches can be proposed for the analysis of data
collection methods, investigation of biases in identity linkage datasets,
linkage of user identities across social networks, control-ability of user
identity linkage, and application of user identity linkage solution to
solve extraneous problems.”
9
10. Outline of Talk
10
Accepted at 12th IEEE International Conference on Social Computing (SocialCom 2019). Xiamen, China.
12. Social Aggregation (SA)
We refer to such sites as social aggregation platforms on which users
create an account and provide details of their multiple social network
accounts.
Perito et al. → Google profiles, Liu et al. → About.me profiles
12
13. Cross Platform Sharing
Cross platform sharing refers to a user behavior in which user posts
the same content across multiple social network (Correa et al.)
13
14. Self Disclosure
14
On user profile page, user himself/herself discloses their identity on
other social network platform (Chen et al.)
18. Data Collection - Conclusion
Computational approaches to collect linked user identity pairs can be
implemented.
Each data collection method depends upon a particular user behavior
which is leverage to collect linked identities of that user.
18
19. Outline of Talk
19
Accepted at 35th ACM/SIGAPP Symposium on Applied Computing (SAC 2020). Brno, Czech Republic.
20. Why study dataset biases ?
20
Every data collection approach depend on the typical behaviors of
users who maintain identities across multiple social networks
As a consequence, these behavioral biases exhibited by users get
manifested in these user identity linkage datasets.
21. Scope of our work
We focus on two identity linkage datasets (SD and CPS) derived by
leveraging two user behaviors namely, self-disclosure and cross
platform posting, respectively on Twitter and Instagram.
(1) Detection & Impact: Does dataset bias exist? What is the impact
of dataset biases on ML models?
(2) Quantification: How to measure the amount of dataset biases?
21
22. UIL as Supervised Learning Problem
22
Negative Class Generation: To create unlinked user identity pairs i.e. user identities that do not
belong to the same person, done in two ways - random pairing and similar pairing.
1. Jaccard Similarity on ‘username’ of
user identity pair.
2. Edit Distance on ‘display name’ of
user identity pair.
+ve Pairs: (rishabhk_, rk.iiit)
-ve Pair: (rishab, rk.iiit)
(rahul, rk.iiit)
24. User Behavioral Features
Jaccard Similarity (JS) on usernames
24
50% of user identity pairs from SD
have JS value as 0.9 as opposed to
only 23% from CPS
Proportionofusers
25. User Behavioral Features
25
Edit Distance (ED) on display names
Proportionofusers
58% display names of user identity
pairs obtained through SD have 0.0
ED as compared to 35% from CPS
26. Impact of biases on model
26
Across all learning algorithms adopted, precision of models trained and tested on same datasets
are better than the models trained & tested on different datasets.
Experiments in two ways. (1) Same dataset for train-test (2) Different dataset for
train-test
27. Quantification of Bias
We have detected behavioral biases in user identities, characterized
them and measured their impact on identity linkage models.
We propose a design that quantifies biases by leveraging from a
well-established discrimination measurement approach namely
‘situational testing’.
27
29. Applying ST to quantify biases
Data Record:
Person → User Identity Pair
Protected Attribute:
Gender (male or female) → Data Collection Method (SD or CPS)
Class Label:
(Selected / Not-Selected) → (Linked / Not-Linked)
29
30. Results
RQ: Are both decision classes (linked and unlinked) equally affected by biases?
30
t-value=0, means no bias.
But, it is evident that probability
distributions of t−values are spread
on both positive (t>0) and negative
(t<0) sides which indicates that
behavioral biases affect many data
records.
31. Dataset Biases - Conclusion
Behavioral biases exist in identity linkage datasets. They can be
detected and quantified.
We recommend to collect linked user identities using more than one
data collection method.
Mitigation of biases in identity dataset - open problem.
31
32. Outline of Talk
32
Accepted at International School & Conference on Network Science (NetSciX, 2020), Tokyo, Japan.
33. Propose: NeXLink Framework
Can we obtain effective node representations such that node embeddings of users
belonging to Cross-Network Linkages (CNLs) are closer in embedding space than
other nodes?
33
Input
Output
34. More formally
The goal of embedding function is to transform each user identity ui
X
and uj
Y
into
low dimensional vectors zi
X
and zj
Y
of size d such that if ui
X
and uj
Y
belong to the
same person, then their embedding vectors zi
X
and zj
Y
are closer in embedding
space else far apart.
34
35. NeXLink Framework
35
Structural similarities of node
within their respective
networks are preserved
Similarities of nodes across the
two networks are preserved based
on common friendship relation
36. Local Node Embeddings*
The joint probability of ui
X
and uk
X
represented by their embedding vectors zi
X
and
zj
X
can be expressed as below
The empirical probability between ui
X
and uk
X
within same network is defined by
their normalized weights as below
Optimization: Minimize the KL-divergence between these distributions
36
* LINE algorithm: Tang et al.
37. Global Node Embeddings
To construct global node embeddings, we construct a
global graph (G) as follows.
G(V) = VX
+ VY
G(E) = CNL + NCNL
Positive Edge Generation (CNL): Linked identity pairs
belonging to same person across social networks.
37
Negative Edge Generation (NCNL): For every node pair (ui
X
,uj
Y
) we perform a random
walk of t length starting at node ui
X
and add (ui
X
,uk
Y
) to NCNL (Non Cross Network Links)
if uk
Y
appears in the random walk.
38. Global Node Embeddings
To learn node embeddings, we perform biased walks (node2vec*) guided by
common friends (CF) metric such that transition probability is
38
* node2vec algorithm: Grover et al.
39. Datasets
We evaluated NeXLink framework on two datasets.
Augmented Dataset: Sampled two sub-graphs from a large Facebook friendship
network data comprising of 63,713 nodes and 817,090 edges. (Man et al.)
Real-world Dataset: Twitter (5,120 users and 130,575 edges) and Instagram (5,313
users and 54,233 edges) with 1,288 common users. (Kong et al.)
39
40. Evaluation Metric
For a given node ui
X
, our goal is find node uj
Y
which belong to the same person.
Therefore, we count a hit if zj
Y
is present in top-k node embeddings, ordered based
on cosine similarity.
40
41. Evaluation - Comparison with others
We evaluate our proposed NeXLink
(LINE-node2vec) framework with two
other approaches.
IONE: Input-Output Network
Embedding (IONE) for the task of
network alignment
REGAL: Representation Learning
based Graph Alignment
41
42. NeXLink Framework - Conclusion
Node representation learning based approach can be proposed to
effectively learn embedding vectors for extracting linked user identities .
42
43. Outline of Talk
43
Accepted at 9th International Conference on Social Informatics (SocInfo, 2017), University of Oxford, London.
44. Linkability Nudge
Can we help users control linkability of their identities across social
networks ?
We design and implement a linkability nudge, gentle interventions to
help users towards making an informed decision.
User decides a range of linkability threshold (score) for each identity
pair. (dynamic web portal)
Whenever user behavior goes beyond the pre-configured range, the
user is nudged. (web browser extension)
44
49. Nudge Evaluation
Controlled lab experiment, control vs treatment period.
Participants were recruited and told to perform tasks related to
making a post and changing their profile attribute.
We observed the impact of linkability nudge on participants.
49
54. Contributions Summary
Performed comparative analysis of data collection methods.
Investigated biases in identity linkage datasets.
Proposed node embedding framework for user identity linkage.
Helped users control linkability of their identities across OSNs.
Applied UIL solution to detect clones and flag their behaviors.
54
55. Limitations & Future Directions
Data collection is a challenge. Need to explore other social media platforms
goodreads, strava, etc.
We employed situational testing in detection of dataset biases. Other methods
from fairness algorithm studies need to be explored.
Our NeXLink node embedding framework takes only network information.
Leveraging content and profile features can be helpful.
We performed controlled lab study. Deploying linkability nudge for field trials.
55
56. Acknowledgements
PhD Advisor: Prof PK
Monitoring Committee: Prof Arun Balaji Buduru, Prof Rajiv Ratn Shah
Co-authors and Peers
Members of Precog
My family
56