Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Social network analysis in R (using Twitter data) Slide 1 Social network analysis in R (using Twitter data) Slide 2 Social network analysis in R (using Twitter data) Slide 3 Social network analysis in R (using Twitter data) Slide 4 Social network analysis in R (using Twitter data) Slide 5 Social network analysis in R (using Twitter data) Slide 6 Social network analysis in R (using Twitter data) Slide 7 Social network analysis in R (using Twitter data) Slide 8 Social network analysis in R (using Twitter data) Slide 9 Social network analysis in R (using Twitter data) Slide 10 Social network analysis in R (using Twitter data) Slide 11 Social network analysis in R (using Twitter data) Slide 12 Social network analysis in R (using Twitter data) Slide 13 Social network analysis in R (using Twitter data) Slide 14 Social network analysis in R (using Twitter data) Slide 15 Social network analysis in R (using Twitter data) Slide 16 Social network analysis in R (using Twitter data) Slide 17 Social network analysis in R (using Twitter data) Slide 18 Social network analysis in R (using Twitter data) Slide 19 Social network analysis in R (using Twitter data) Slide 20 Social network analysis in R (using Twitter data) Slide 21
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Social network analysis in R (using Twitter data)

Download to read offline

In the context of social network analysis, we have chosen to analyze social media (specifically twitter) data as the main theme of our project. In fact, twitter users tweet, like, follow, and retweet creating complex network structures. We will be analyzing these network structures and visualize the relationships between these individual people as a retweet network for an interesting topic.

  • Be the first to like this

Social network analysis in R (using Twitter data)

  1. 1. 1 Social Network Analysis in R: (Using twitter data) Supervised by: Pr. Tarik AGOUTI Done by: Ayoub NAINIA, Walid DARHOURI, Ayoub OUMALEK Table of content 1. INTRODUCTION: CASE STUDY ......................................................................................... 4 2. SOCIAL NETWORK ANALYSIS VOCABULARY..................................................................... 4 2.1 NETWORK VS NETWORK ANALYSIS................................................................................................4 2.2 COMPONENTS OF A NETWORK......................................................................................................4 2.3 DIRECTED VS UNDIRECTED NETWORK ............................................................................................5 3. APPLICATION IN SOCIAL MEDIA...................................................................................... 5 3.1 RETWEET NETWORK ...................................................................................................................5 3.2 CREATE A RETWEET NETWORK OF USERS.........................................................................................6 3.3 CREATE THE TWEET DATA FRAME ..................................................................................................6 3.4 CREATE DATA FRAME FOR THE NETWORK........................................................................................6 3.5 INCLUDE ONLY RETWEETS IN THE DATA FRAME.................................................................................7 3.6 CONVERT DATA FRAME TO A MATRIX .............................................................................................7 3.7 CREATE THE RETWEET NETWORK...................................................................................................7 3.8 VIEW THE RETWEET NETWORK......................................................................................................8 4. NETWORK CENTRALITY MEASURES................................................................................. 8 4.2 NETWORK CENTRALITY MEASURES.................................................................................................9 4.2.1 Degree centrality..........................................................................................................9 4.2.1.1 Degree centrality of a user.................................................................................10 4.2.1.2 Users who retweeted most................................................................................11 4.2.1.3 Users whose posts were retweeted most..........................................................11 4.2.2 Betweenness...............................................................................................................12 4.2.2.1 Identifying users with high betweenness ..........................................................12 5. VISUALIZING TWITTER RETWEET NETWORK...................................................................13 5.1 CREATE THE BASE NETWORK PLOT ...............................................................................................13 5.2 FORMAT THE PLOT ...................................................................................................................14 5.2.1 Set vertex size based on the out-degree ....................................................................15 5.2.2 Follower count of network users ................................................................................17
  2. 2. 2 5.2.2.1 Assign network attributes ..................................................................................18 5.2.2.2 View vertex attributes........................................................................................18 5.2.2.3 Changing vertex colors.......................................................................................19 5.2.2.4 View plot formatted with vertex attributes.......................................................19 6. CONCLUSION.................................................................................................................20 7. BIBLIOGRAPHY ..............................................................................................................21
  3. 3. 3 Table of figures Figure 1: Components of a network.................................................................................................4 Figure 2: Undirected network ..........................................................................................................5 Figure 3: Retweet network. ..............................................................................................................5 Figure 4: The out-degree of a vertex (node). ...................................................................................9 Figure 5: The in-degree of a vertex................................................................................................10 Figure 6: Illustrating betweenness..................................................................................................12 Figure 7: Plotted retweets network.................................................................................................14 Figure 8: Formatted plot of retweets network................................................................................15 Figure 9: Proportionating vertex size to the out-degree.................................................................17 Figure 10: Formatted plot displaying retweets network (followers, degree with size and color)..20
  4. 4. 4 1. Introduction: Case study Social Network Analysis (SNA) is the process of investigating social structures through the use of networks and graph theory. It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) or ties, edges, or links (relationships or interactions) that connect them. Examples of social structures commonly visualized through social network analysis include social media networks, information circulation, friendship and acquaintance networks, business networks, knowledge network, and much more. These networks are often visualized through graphs in which nodes are represented as points and ties are represented as lines. These visualizations provide a means of qualitatively assessing networks by varying the visual representation of their nodes and edges to reflect attributes of interest. In the context of social network analysis, we have chosen to analyze social media (specifically twitter) data as the main theme of our project. In fact, twitter users tweet, like, follow, and retweet creating complex network structures. We will be analyzing these network structures and visualize the relationships between these individual people as a retweet network for an interesting topic. 2. Social Network Analysis vocabulary 2.1Network vs Network Analysis A network is a system of interconnected objects. Two classic examples of networks are a Local Area Network of computers and social media. On the other side, network analysis can be regarded as a set of techniques with a shared methodological perspective, which allow researchers to depict relations among actors and to analyze the social structures that emerge from the recurrence of these relations. (Network Analysis Antonio M. Chiesi, in International Encyclopedia of the Social & Behavioral Sciences (Second Edition), 2015) 2.2Components of a network The two main components of a network are the nodes and edges. The objects that are interconnected in a network are called nodes (or vertices). Every user in a twitter network is called a node or vertex. The connection between these objects are called edges. Figure 1: Components of a network.
  5. 5. 5 2.3Directed vs Undirected Network There are two broad types of networks based on the information flow: directed and undirected networks. When the edges of a network point towards one direction, the network is called a directed network. In the figure above (figure:1), the relationship between nodes only works in one directions. When the edges of a network have no direction, the network is an undirected network. The relationship between nodes works in both directions in such networks, Figure 2: Undirected network 3. Application in social media Twitter users tweet, like, follow, and retweet creating complex network structures. Analyzing the structure and size of networks facilitates identifying key players and influencers who are pivotal to transmitting information to a wide audience. 3.1Retweet network A retweet network is a network of twitter users who retweet tweets posted by other users. It is a directed network where the source node (or vertex) is the user who retweets and the target vertex is the user who posted the original tweet. Figure 3: Retweet network. Understanding the position of potential customers on a retweet network allows a brand to identify key players who are likely to retweet posts to spread brand messaging
  6. 6. 6 3.2Create a retweet network of users We will create a retweet network of users who retweet on hashtag OOTD which mean “Outfit Of The Day”. This hashtag is popular amongst young users for flaunting their outfits and can be used by fashion brands to grab the attention of potential customers. 3.3Create the tweet data frame First, we extract 18000 tweets on hashtag OOTD using search_tweets() and include retweets. # Create tweet data frame for tweets on #OOTD twts_OOTD <- search_tweets(“#OOTD”, n = 18000, include_rts = TRUE) 3.4Create data frame for the network Next, we create a subset data frame of screen_name and retweet_screen_name from the extracted tweets. For the retweet network, the source node is the screen_name and the target node is the retweet_screen_name. # Create data frame for the network rt_df <- twts_OOTD[, c(“screen_name”, “retweet_screen_name”)] As some rows have NA (missing) values under retweet_screen_name, we will exclude these rows before proceeding further. # View the dat frame head(rt_df, 10) # A tibble: 10 x 2 screen_name retweet_screen_name <chr> <chr> 1 spookybins fyjypnation 2 muunniiee fyjypnation 3 muunniiee fyjypnation 4 kiranalix NA 5 kiranalix NA 6 kiranalix NA 7 kiranalix NA 8 KingHwang00 fyjypnation 9 KingHwang00 fyjypnation 10 skzinstaupdate NA
  7. 7. 7 3.5Include only retweets in the data frame To remove rows with NA values, we use the complete.cases() function. This function takes the data frame as input and retains only rows without NA values. # Remove rows with missing values rt_df_new <- rt_df[complete.cases(rt_df), ] # View the dat frame head(rt_df_new, 10) # A tibble: 10 x 2 screen_name retweet_screen_name <chr> <chr> 1 spookybins fyjypnation 2 muunniiee fyjypnation 3 muunniiee fyjypnation 4 KingHwang00 fyjypnation 5 KingHwang00 fyjypnation 6 nicolasvillar12 InfurMagazine 7 somewaychild snapu_fukuoka 8 somewaychild snapu_fukuoka 9 somewaychild snapu_fukuoka 10 somewaychild snapu_fukuoka 3.6Convert data frame to a matrix To create a network, we need the contents saved as a matrix. The as.matrix() function converts the data frame to a matrix. # Convert to matrix matrix <- as.matrix(rt_df_new) 3.7Create the retweet network We are now ready to create the retweet network using graph_from_edgelist() from the igraph library. This function takes two arguments: the edge list, el, set to the matrix and directed set to TRUE for the directed network.
  8. 8. 8 # Create the retweet network library(igraph) nw_rtweet <- graph_from_edgelist(el = matrix, directed = TRUE) 3.8View the retweet network We use the print.igraph() function to view the retweet network. DN indicates that it is a directed network. The number of edges and vertices are 4100 and 4616 respectively. The source and target vertex (node) names can be seen separated by arrows. For example, "spookybins" is the source vertex and "fyjypnation" is the target vertex in the first row. We have now successfully created a retweet network. We will now identify key players from this network using network centrality measures. # View the retweet network print.igraph(nw_rtweet) IGRAPH bb7fb9a DN-- 5464 6485 -- + attr: name (v/c) + edges from bb7fb9a (vertex names): [1] spookybins ->fyjypnation [2] muunniiee ->fyjypnation [3] muunniiee ->fyjypnation [4] KingHwang00 ->fyjypnation [5] KingHwang00 ->fyjypnation [6] nicolasvillar12->InfurMagazine [7] somewaychild ->snapu_fukuoka [8] somewaychild ->snapu_fukuoka + ... omitted several edges 4. Network centrality measures Network centrality measures are critical to identifying key players and influencers from a network. In this case study, we will understand the concepts of network centrality measures and the two key centrality measures, degree centrality and betweenness. We will calculate these measures for our retweet network to identify the key players in the network and the role they can play in a promotional campaign.
  9. 9. 9 4.2Network centrality measures The influence of a node (vertex) is determined by its number of edges and position within the network. Network Centrality is a key property of complex networks that influences the behavior of dynamical processes, like synchronization and epidemic spreading, and can bring important information about the organization of complex systems, like our brain and society. (Rodrigues F.A. (2019) Network Centrality: An Introduction. In: Macau E. (eds) A Mathematical Modeling Approach from Nonlinear Dynamics to Complex Systems. Nonlinear Systems and Complexity, vol 22. Springer, Cham. https://doi.org/10.1007/978-3-319-78512-7_10) In other words, network centrality is the measure of importance of a vertex in a network. Network centrality measures assign a numerical value to each vertex according to its influence on other vertices (nodes). We will be focusing on the measures degree centrality and betweenness centrality in this case study. 4.2.1 Degree centrality Degree centrality is the simplest centrality measure to compute. Recall that a node's degree is simply a count of how many social connections (i.e., edges) it has. The degree centrality for a node is simply its degree. A node with 10 social connections would have a degree centrality of 10. A node with 1 edge would have a degree centrality of 1. (Jennifer Golbeck, in Introduction to Social Media Investigation https://www.sciencedirect.com/book/9780128016565/introduction-to-social- media-investigation , 2015) In a directed network, vertices will have out-degree and in-degree scores: Figure 4: The out-degree of a vertex (node). The out-degree represents how many outgoing edges a vertex has. In a retweet network, the out- degree of a vertex indicates the number of times a user retweets posts.
  10. 10. 10 Figure 5: The in-degree of a vertex. The in-degree represents how many incoming edges each vertex has. In a retweet network, in- degree indicates the number of times the user's posts are retweeted. 4.2.1.1 Degree centrality of a user The out-degree and in-degree centrality of a user in a retweet network can be calculated using the function degree() from the igraph package. This function takes the following arguments:  The retweet network.  The user, and mode set to “out” to extract the out-degree. library(igraph) # Calculate out-degree out_deg <- degree(nw_rtweet, "OutfitAww", mode = c("out")) out_deg OutfitAww 20 The value of 20 for out-degree indicates that the user has retweeted 20 times on the topic. To calculate the in-degree, we use the same function and arguments but set the mode to “in”. The value of 23 for in-degree indicates that this user’s posts have been retweeted 23 times. library(igraph) # Calculate out-degree in_deg <- degree(nw_rtweet, "OutfitAww", mode = c("in")) in_deg
  11. 11. 11 OutfitAww 23 4.2.1.2 Users who retweeted most Let us identify users who retweet the most by calculating out-degree for the retweet network. # Calculate the out-degree scores out_degree <- degree(nw_rtweet, mode = c("out")) To find the top 3 users who retweet the most, we sort the array in descending order of the out- degree using the sort() function. # Sort users in descending order of out-degree scores out_degree_sort <- sort(out_degree, decreasing = TRUE) The top 3 users and their out-degree values are displayed here. These users are key players who can be used as a medium to retweet promotional posts of a fashion brand. # View the top 3 users out_degree_sort[1:3] _sundaysunsets_ myfoodfantasy69 RoarLoudTravel 111 93 88 4.2.1.3 Users whose posts were retweeted most We will now calculate in-degree scores for the network to identify users whose posts were retweeted the most. # Calculate the in-degree scores in_degree <- degree(nw_rtweet, mode = c("in")) The degree() and sort() functions are used again to calculate the in-degree values and sort the users based on the in-degree values. # Sort the users in descending order of in-degree scores in_degree_sort <- sort(in_degree, decreasing = TRUE)
  12. 12. 12 Here, the users with the top 3 in-degrees are influential as their tweets are retweeted many times. They can be used to initiate branding messages of a fashion brand. # View the top 3 users in_degree_sort[1:3] EnjoyNature always5star Havenlust 1390 244 230 4.2.2 Betweenness Betweenness centrality is a widely used measure that captures a person's role in allowing information to pass from one part of the network to the other. A node with higher betweenness centrality has more control over the network because more information will pass through that node. Figure 6: Illustrating betweenness. 4.2.2.1 Identifying users with high betweenness We can identify the top users based on betweenness scores using the betweenness() function. The function takes two arguments: the retweet network and the value TRUE for directed. # Calculate the betweenness scores of the network betwn_nw <- betweenness(nw_rtweet, directed = TRUE) We sort the array in descending order of betweenness scores and view the top 3 users. # Sort the users in descending order of betweenness scores
  13. 13. 13 betwn_nw_sort <- betwn_nw %>% sort(decreasing = TRUE) %>% round() These users are key bridges between people who retweet frequently and users whose tweets are retweeted frequently. They are important to the flow of information through the retweet network. # View the top 3 users betwn_nw_sort[1:3] always5star RoarLoudTravel TravelBugsWorld 5603 1852 933 5. Visualizing twitter retweet network Visualization of twitter networks helps understand complex networks in an easier and appealing way. In this section, we will plot and visualize a network with default parameters. Next, we will apply formatting attributes to the plot to improve readability. Finally, we will use centrality measures and network attributes to enhance the plot. 5.1Create the base network plot We can create a base network plot by using the plot.igraph() function from the igraph library. This function takes the retweet network as input. The set.seed() function fixes the randomness to reproduce the same plot every time. # Create the base network plot set.seed(1234) plot.igraph(nw_rtweet) The plot is created with vertices shown as orange circles and the edges indicated by grey lines. Let's now format the plot with attributes for better readability.
  14. 14. 14 Figure 7: Plotted retweets network. 5.2Format the plot We add the following arguments to the plot() function. The aspect ratio is set to 9/16 to have a rectangular plot and the attributes vertex size and color, edge size and color, text label size and color are included. # Format the network plot with attributes set.seed(1234) plot(nw_rtweet, asp = 9/12, vertex.size = 10, vertex.color = "lightblue", edge.arrow.size = 0.5, edge.color = "black", vertex.label.cex = 0.9, vertex.labe.color = "black") In the plot, the number of arrows going out of a vertex is a measure of the number of times the user retweets. It will be more meaningful if the vertex size is proportional to the number of times the user retweets, or the out-degree.
  15. 15. 15 Figure 8: Formatted plot of retweets network. 5.2.1 Set vertex size based on the out-degree We will add attributes such that the vertex size is indicative of the number of times the user retweets. Let's calculate the out-degree and assign it to a variable, deg_out. # Create a variable for out-degree deg_out <- degree(nw_rtweet, mode = c("out")) deg_out ndyPennefather bestforluxury CenturyCruises thepassportsaga heathrowshuttle 4 0 0 0 0 TheBayAreaLife _rjardon BeautyfromItaly wildlife__pics fhdphotography 3 5 0 3 0 remymichaels buffett_v dungnhi01911301 AccTravel TravelBabbo 0 4 0 4 0 Lyora Mel365dotCom IBBtravel mOQIl GeographyLion 0 0 0 4 0 RoadtripC _chokey ishan_Wickrama Jennasm66663888 Home_and_Loving 0 0 0 3 0 SurfnSunshine AskChefDennis edjlazar Enjoy_Nature_ stays_unique 0 0 3 0 0
  16. 16. 16 SMaureneLoft ppeters21 4 0 To avoid assigning zero values for the vertex size, we amplify this variable by multiplying deg_out by a random number 2 and adding 10 so the minimum vertex size is 10. vert_size <- (deg_out * 3) + 10 We assign vert_size to the vertex size attribute and retain the other arguments in the plot. # Assign vert_size to vertex size attribute and plot network set.seed(1234) plot(nw_rtweet, asp = 9/16, vertex.size = vert_size, vertex.color = "lightblue", edge.arrow.size = 0.5, edge.color = "black", vertex.label.cex = 0.9, vertex.labe.color = "black") The vertex size is now proportionate to the out-degree. Vertices with bigger circles are the users who retweet more.
  17. 17. 17 Figure 9: Proportionating vertex size to the out-degree. The users who retweet most will add more value if they have a high follower count as their retweets will reach a wider audience. We will modify the network plot to show users who retweet more and also have a high number of followers. To do this, we will add the follower count as a network attribute from an external data frame. 5.2.2 Follower count of network users In a network plot, the combination of vertex size indicating the number of retweets by a user and vertex color indicating a high follower count provides clear insights on the most influential users who can promote a brand. The data frame containing follower counts for the screen names in the network is imported using the readRDS() function. # Import the followers count data frame followers <- readRDS("follower_count.rds")
  18. 18. 18 The followers’ data frame has 2 columns: screen_name and followers_count. In the followers data frame, we use the ifelse() function to create a new column which takes the value 1 when the follower count is greater than 500, else 0. # Categorize high and low follower count followers$follow <- ifelse(followers$followers_count > 500, "1", "0") vert_names followers_count follow 1 _chokey 1398 1 2 _rjardon 9357 1 3 AccTravel 1267 1 4 AndyPennefather 34763 1 5 AskChefDennis 0 0 6 BeautyfromItaly 0 0 We now have a new column follow with values 1 or 0. # View the data frame with the new column head(followers) 5.2.2.1 Assign network attributes We assign the new column follow as an attribute called followers to the vertices of the network using the V() function. This function takes the retweet network as input. # Assign external network attributes to retweet network V(nw_rtweet)$followers <- followers$follow 5.2.2.2 View vertex attributes We can view the network attributes with the vertex_attr() function. The network vertices have 2 attributes: name which is the screen name and followers with values 0 or 1. # View the vertex attributes vertex_attr(nw_rtweet) [1] "1" "1" "1" "1" "0" "0" "1" "1" "1" "0" "1" "0" "0" "0" "1" "0" "0" "0" "1" [20] "0" "1" "1" "0" "1" "1" "1" "0" "1" "1" "0" "0" "1"
  19. 19. 19 5.2.2.3 Changing vertex colors The vertex color can now be set based on the followers’ attribute. First, create a vector sub_color with values "lightgreen" and "tomato". In the plot attributes, input this sub_color for vertex.color. Here, map the vertex attribute followers as a factor to sub_color. # Set the vertex colors for the plot sub_color <- c("lightgreen", "tomato") set.seed(1234) plot(nw_rtweet, asp = 9/16, vertex.size = vert_size, edge.arrow.size = 0.5, vertex.label.cex = 1.3, vertex.color = sub_color[as.factor(vertex_attr(nw_rtweet, "followers"))], vertex.label.color = "black", vertex.frame.color = "grey") 5.2.2.4 View plot formatted with vertex attributes The vertices with followers equal to 1, that is with follower count above 500 are displayed in light green and the rest are displayed in the color tomato. The larger vertices colored light green are our most important users since they retweet the most and also have a high number of followers.
  20. 20. 20 Figure 10: Formatted plot displaying retweets network (followers, degree with size and color). 6. Conclusion We have identified the most influential users in the network who can promote a brand. The vertices colored light green are these users as they retweet the most and also have a high number of followers.
  21. 21. 21 7. Bibliography - Wikipedia - Social Network Analysis: https://en.wikipedia.org/wiki/Social_network_analysis - Datacamp – Analyzing social media data in R: https://learn.datacamp.com/ - Jennifer Golbeck, in Introduction to Social Media Investigation https://www.sciencedirect.com/book/9780128016565/introduction-to-social-media- investigation , 2015 - Rodrigues F.A. (2019) Network Centrality: An Introduction. In: Macau E. (eds) A Mathematical Modeling Approach from Nonlinear Dynamics to Complex Systems. Nonlinear Systems and Complexity, vol 22. Springer, Cham. https://doi.org/10.1007/978- 3-319-78512-7_10 - Network Analysis Antonio M. Chiesi, in International Encyclopedia of the Social & Behavioral Sciences (Second Edition), https://www.sciencedirect.com/science/article/pii/B9780080970868730558 2015

In the context of social network analysis, we have chosen to analyze social media (specifically twitter) data as the main theme of our project. In fact, twitter users tweet, like, follow, and retweet creating complex network structures. We will be analyzing these network structures and visualize the relationships between these individual people as a retweet network for an interesting topic.

Views

Total views

123

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

0

Shares

0

Comments

0

Likes

0

×