COMMunicationRESearch.NET

Site Tools

sna_eg_stanford:lab05

Lab 05

################################################
# LAB 5: Two-Mode Networks and Mobility Models #
################################################

# NOTE: if you have trouble because some packages are not installed,
# see lab 1 for instructions on how to install all necessary packages.

##################################################################
#
# Lab 5
#
# This lab covers affiliation data. It shows the user how to plot
# affiliation data using the igraph package, how to transform it
# into one mode data and generate centrality measures.

# This lab also introduces additional material on plotting network
# data, with special attention to two-mode networks. It also
# covers using opacity/transparency in network plots, hand-placing
# nodes in a network layout (in case labels overlap), and various
# other considerations that may be of interest when creating
# network visualizations in your publications.
#
# It then covers network mobility. Specifically, it covers how to
# compute transition probability matrices and then instructs the
# user to create visualizations similar to those generated for the
# affiliation data.
#
# We'll work with affiliation data collected by Dan McFarland
# on student extracurricular affiliations (CITE). It's a
# longitudinal data set, with 3 waves - 1996, 1997, 1998.  It
# consists of students (anonymized) and the student organizations
# in which they are members (e.g. National Honor Society,
##################################################################

# To-do
# (A) need to add 3 affiliations:
#	1.	"entry" affiliations for students: for any kid younger than 8th grade (for Magnet) or 9th grade (for Rural) in any particular year, this would be their affiliation that year.
#	2.	"leave" affiliations for students: for any kid older than 12th grade in any particular year, this would be their affiliation.
#	3.	"non-participant": for any student who is not in a leave or entry position and who has a row sum of 0.
# (B) Using my table of affiliation characteristics, we can omit certain fluff clubs like pep club, NHS, foreign language clubs, etc from the graph as they drive a lot of non-sensical patterns.

library(igraph)

# (1) Read in the data files, NA data objects coded as "na"
magact96 = read.delim("http://sna.stanford.edu/sna_R_labs/data/mag_act96.txt", na.strings = "na", check.names = FALSE)
magact97 = read.delim("http://sna.stanford.edu/sna_R_labs/data/mag_act97.txt", na.strings = "na", check.names = FALSE)
magact98 = read.delim("http://sna.stanford.edu/sna_R_labs/data/mag_act98.txt", na.strings = "na", check.names = FALSE)

# Missing data is coded as "na" in this data, which is why we gave
# R the command na.strings = "na". We need to preserve the
# original column names for labeling our visualizations so we use
# the "check.names = FALSE" argument as well. If we needed to
# access our data using the "magact96$variable" syntax, we would # NOT want to read in our data like this. # These files consist of four columns of individual-level # attributes (ID, gender, grade, race), then a bunch of group # membership dummy variables (coded "1" for membership, "0" for no # membership). We need to set aside the first four columns (which # do not change from year to year). # take a look at the data as follows (you can edit or "fix" the data # using this perspective if necessary): fix(magact96) ################################################################ # (2) Create the attribute data. Attributes appear the same # from 1996-1998 magattrib = magact96[,1:4] ################################################################ # (3) Drop columns so we have a square incidence matrix for each # year g96 = as.matrix(magact96[,-(1:4)]); row.names(g96) = magact96[,1] g97 = as.matrix(magact97[,-(1:4)]); row.names(g97) = magact97[,1] g98 = as.matrix(magact98[,-(1:4)]); row.names(g98) = magact98[,1] # For future reference, you can access these via # data(studentnets.magact96.97.98, package = "NetData") # Now load in these two-mode graphs into igraph. i96 <- graph.incidence(g96, mode=c("all") ) i97 <- graph.incidence(g97, mode=c("all") ) i98 <- graph.incidence(g98, mode=c("all") ) ################################################################ # (3a) Run some plots: # # Now, let's plot these graphs. The igraph package has excellent # plotting functionality that allows you to assign visual # attributes to igraph objects before you plot. The alternative is # to pass 20 or so arguments to the plot.igraph() function, which # gets really messy. # In past labs, you may have accessed vertex (node attributes via # this function: get.edge.attribute(krack_full, 'reports_to_tie') # A "shorthand" to access edge or vertex (node) attributes works # similarly to R's handling of lists and dataframes (e.g., # 'dataframe$variablename' or'list$sublist$object').

# Each node (or "vertex") object is accessible by calling V(g),
# and you can call (or create) a node attribute by using the $# operator so that you call V(g)$attribute.

# Let's now use this notation to set the vertex color, with
# special attention to making graph objects slightly transparent.
# We'll use the rgb function in R to do this. We specify the
# levels of Red, Green, and Blue and "alpha-channel" (a.k.a.
# opacity) using this syntax: rgb(red, green, blue, alpha). To
# return solid red, one would use this call: rgb(red = 1,green =
# 0,blue = 0, alpha = 1). We will make nodes, edges, and labels
# slightly transparent so that when things overlap it is still

# You can read up on the RGB color model at
# http://en.wikipedia.org/wiki/RGB_color_model.

# Here's how to set the color attribute for a set of nodes in a
# graph object:

V(i96)$color[1:1295] = rgb(red = 1, green = 0, blue = 0, alpha = .5) V(i96)$color[1296:1386] = rgb(red = 0, green = 1, blue = 0, alpha = .5)

# Notice that we index the V(g)$color object by a seemingly # arbitrary value, 1295. This marks the end of the student nodes, # and 1296 is the first group node. You can view which nodes are # which by typing V(i96). R prints out a list of all the nodes in # the graph, and those with an id number can be distinguished from # group names. # From here on out, we do not specify "red = ", "green = ", "blue # = ", and "alpha = ". These are the default arguments (R knows # the first number corresponds to red, the second to blue, and so # on). # Now we'll set some other graph attributes: V(i96)$label = V(i96)$name V(i96)$label.color = rgb(0,0,.2,.5)
V(i96)$label.cex = .4 V(i96)$size = 6
V(i96)$frame.color = V(i96)$color

# You can also set edge attributes. Here we'll make the edges
# nearly transparent and slightly yellow because there will be so
# many edges in this graph:

E(i96)$color = rgb(.5,.5,0,.2) # Now, we'll open a pdf "device" on which to plot. This is just a # connection to a pdf file. Note that the code below will take a # minute or two to execute (or longer if you have a pre- Intel # dual-core processor), because the graph is so large. pdf("5.1_magact_stdnt_actvts_1996.pdf") plot(i96, layout=layout.fruchterman.reingold) dev.off() # Note that we've used the Fruchterman-Reingold force-directed # layout algorithm here. Generally speaking, when you have a # ton of edges, the Kamada-Kawai layout algorithm works well but, # it can get really slow for networks with a lot of nodes. Also, # for larger networks, layout.fruchterman.reingold.grid is faster, # but can fail to produce a plot with any meaningful pattern if you # have too many isolates, as is the case here. Experiment for # yourself. # Now, if you open the pdf output, you'll notice that you can zoom # in on any part of the graph ad infinitum without losing any # resolution. How is that possible in such a small file? It's # possible because the pdf device output consists of data based on # vectors: lines, polygons, circles, ellipses, etc., each specified # by a mathematical formula that your pdf program renders when you # view it. Regular bitmap or jpeg picture output, on the other # hand, consists of a pixel-coordinate mapping of the image in # question, which is why you lose resolution when you zoom in on a # digital photograph or a plot produced with most other programs. # This plot is oddly reminiscent of a crescent and star, but # impossible to read. Part of the problem is the way in which # layout algorithms deal with isolates. For example, # layout.fruchterman.reingold will squish all of the connected # nodes into the center creating a useless "hairball"-like # visualization. The same applies to # layout.fruchterman.reingold.grid (it's even worse). And # layout.kamada.kawai takes exponentially longer to converge (it # may not ever converge in igraph's implementation of the # algorithm with isolates). # Let's remove all of the isolates (the crescent), change a few # aesthetic features, and replot. First, we'll remove isolates, by # deleting all nodes with a degree of 0, meaning that they have # zero edges. Then, we'll suppress labels for students and make # their nodes smaller and more transparent. Then we'll make the # edges more narrow more transparent. Then, we'll replot using # various layout algorithms: i96 = delete.vertices(i96, V(i96)[ degree(i96)==0 ]) V(i96)$label[1:857] = NA
V(i96)$color[1:857] = rgb(1,0,0,.1) V(i96)$size[1:857] = 2

E(i96)$width = .3 E(i96)$color = rgb(.5,.5,0,.1)

dev.off()

pdf("5.3_magact_stdnt_actvts_1996_layout.fruchterman.reingold.grid.pdf")
plot(i96, layout=layout.fruchterman.reingold.grid)
dev.off()

pdf("5.4_magact_stdnt_actvts_1996_layout.fruchterman.reingold.pdf")
plot(i96, layout=layout.fruchterman.reingold)
dev.off()

# The nice thing about the Fruchterman-Reingold layout in this
# case is that it really emphasizes centrality -- the nodes that
# are most central are nearly always placed in the middle of the
# plot.

# Now repeat for other years (we do not delete the vertices in
# these years.

# 1997
i97 = delete.vertices(i97, V(i97)[ degree(i97)==0 ])
V(i97)$color = rgb(0, 1, 0, .5) V(i97)$color[1:752] =  rgb(1,0,0,.1)
V(i97)$frame.color = V(i97)$color

V(i97)$size = 6 V(i97)$size[1:752] = 2

V(i97)$label = V(i97)$name
V(i97)$label.color = rgb(0,0,.2,.5) V(i97)$label.cex = .4
V(i97)$label[1:752] = NA E(i97)$width = .3
E(i97)$color = rgb(.5,.5,0,.2) pdf("5.5_magact_stdnt_actvts_1997.pdf") plot(i97, layout=layout.fruchterman.reingold) dev.off() # 1998 i98 = delete.vertices(i98, V(i98)[ degree(i98)==0 ]) V(i98)$color = rgb(0, 1, 0, .5)
V(i98)$color[1:724] = rgb(1,0,0,.1) V(i98)$frame.color = V(i98)$color V(i98)$size = 6
V(i98)$size[1:724] = 2 V(i98)$label = V(i98)$name V(i98)$label.color = rgb(0,0,.2,.5)
V(i98)$label.cex = .4 V(i98)$label[1:724] = NA

E(i98)$width = .3 E(i98)$color = rgb(.5,.5,0,.2)

pdf("5.6_magact_stdnt_actvts_1998.pdf")
plot(i98, layout=layout.fruchterman.reingold)
dev.off()

################################################################
# (4) Now produce single mode co-event matrices for each year. This
# is done using R's matrix algebra commands. You first need to get
# your data in matrix format. We already have a matrix representation
# of our data, but if you did not, a network object can be coerced via
# as.matrix(your-network) if you are using the network or sna
# packages; with the igraph package you would use get.adjacency(your
# network).

# To get the one-mode representation of ties between rows (people in
# our example), multiply the matrix by its transpose. To get the
# one-mode representation of ties between columns (clubs in our
# example), multiply the transpose of the matrix by the matrix.
# Note that you must use the matrix-multiplication operator %*%
# rather than a simple asterisk. The R code is for our data follows:

g96e = t(g96) %*% g96
g97e = t(g97) %*% g97
g98e = t(g98) %*% g98

i96e = graph.adjacency(g96e, mode = "undirected")

# Now we need to transform the graph so that multiple edges become an
# attribute of each unique edge, accessible via E(g)$weight: E(i96e)$weight <- count.multiple(i96e)
i96e <- simplify(i96e)

# Now plot the first single mode co-event matrices. Set vertex
# attributes, making sure to make them slightly transparent by
# altering the gamma via the rgb(r,g,b,gamma) function.

# Set vertex attributes
V(i96e)$label = V(i96e)$name
V(i96e)$label.color = rgb(0,0,.2,.8) V(i96e)$label.cex = .6
V(i96e)$size = 6 V(i96e)$color = rgb(0,0,1,.5)
V(i96e)$frame.color = V(i96e)$color

# We set edge opacity/transparency as a function of how many students
# each group has in common (the weight of the edge that connects the
# two groups).

# In order to do so, we need to transform the edge weights so that
# they are between about .05 and 1, otherwise they will not show up on
# the plot.

# We use the log function + .3 to make sure all transparencies are
# on relatively the same scale, then divide by the maximum edge weight
# to get them on a scale from about .2 and 1.

egalpha = (log(E(i96e)$weight)+.3)/max(log(E(i96e)$weight)+.3)
E(i96e)$color = rgb(.5,.5,0,egalpha) # For illustrative purposes, let's compare how the Kamada-Kawai and # Fruchterman-Reingold algorithms render this graph: pdf("5.7_magact_stdnt_actvts_1996_clubs.pdf") plot(i96e, main = "layout.kamada.kawai", layout=layout.kamada.kawai) plot(i96e, main = "layout.fruchterman.reingold", layout=layout.fruchterman.reingold) dev.off() # Be sure to go tot the second page in the pdf to see the FR layout. # You might like the Kamada-Kawai layout for this graph, because the # center of the graph is very busy if you use the Fruchterman-Reingold # layout. ################################################################ # (5) and (6) Group Overlap Networks and Plots # We are also interested in the percent overlap between groups. # Note that this will be a directed graph, because the percent overlap # will not be symmetric across groups--for example, it may be that 3/4 # of Spanish NHS members are in NHS, but only 1/8 of NHS members are # in the Spanish NHS. We'll create this graph for all years in our # data (though we could do it for one year only). # First we'll create a percent overlap graph. We start by dividing # each row by the diagonal (this is really easy in R): ol96 = g96e/diag(g96e) ol97 = g97e/diag(g97e) ol98 = g98e/diag(g98e) ################################################################ # (7) Next, we'll sum the matrices and set any NA cells (caused by # dividing by zero in the step above) to zero: magall = ol96 + ol97 + ol98 magall[is.na(magall)] = 0 # Note that magall now consists of a percent overlap matrix, but # because we've summed over 3 years, the maximun is now 3 instead of # 1. ################################################################ # 7 (a) compute average club size, by taking the mean across each # value in each diagonal magdiag = apply(cbind(diag(g96e), diag(g97e), diag(g98e)), 1, mean ) ################################################################ # 7 (a) i. Generate centrality for magall # First we'll want to make an igraph object: magallg = graph.adjacency(magall, weighted=T) # We need to set weighted=T because otherwise igraph dichotomizes # edges at 1. Note also that this graph is directed, so we no longer # tell igraph that the graph is "undirected." The graph is directed # because the percent overlap will not be symmetric across groups - # for example, it may be that 3/4 of Spanish NHS members are in NHS, # but only 1/8 of NHS members are in the Spanish NHS. # Now we'll calculate some node-level centrality measures: # Degree V(magallg)$degree = degree(magallg)

# Betweenness centrality
V(magallg)$btwcnt = betweenness(magallg) ################################################################ # 7 (b) plot magall, for relationships > 1 # Before we plot this network, we should probably filter some of the # edges, otherwise our graph will probably be too busy to make sense # of visually. Take a look at the distribution of connection strength # by plotting the density of the magall matrix: pdf("5.8_magact_stdnt_actvts_club_overlap_density.pdf") plot(density(magall)) dev.off() # Nearly all of the edge weights are below 1 -- or in other words, the # percent overlap for most clubs is less than 1/3. Let's filter at 1, # so that an edge will consists of group overlap of more than 1/3 of # the group's members in question. magallgt1 = magall magallgt1[magallgt1<1] = 0 magallggt1 = graph.adjacency(magallgt1, weighted=T) # Removes loops: magallggt1 <- simplify(magallggt1, remove.multiple=FALSE, remove.loops=TRUE) # Degree V(magallggt1)$degree = degree(magallggt1)

# Betweenness centrality
V(magallggt1)$btwcnt = betweenness(magallggt1) # Before we do anything else, we'll create a custom layout based on # Fruchterman-Ringold wherein we adjust the coordates by hand using # the tkplot gui tool to make sure all of the labels are visible. This # is very useful if you want to create a really sharp-looking network # visualization for publication. magallggt1$layout = layout.fruchterman.reingold(magallggt1)
V(magallggt1)$label = V(magallggt1)$name
tkplot(magallggt1)

# Let the plot load, then maximize the window, and select to View ->
# Fit to Screen so that you get maximum resolution for this large
# graph. Now hand-place the nodes, making sure no labels overlap.

# Pay special attention to whether the labels overlap (or might
# overlap if the font was bigger) along the vertical. Save the layout
# coordinates to the graph object:
magallggt1$layout = tkplot.getcoords(1) # We use "1" here only if this was the first tkplot object you # called. If you called tkplot a few times, use the last plot object. # You can tell which object is visible because at the top of the # tkplot interface, you'll see something like "Graph plot 1". # Set vertex attributes V(magallggt1)$label = V(magallggt1)$name V(magallggt1)$label.color = rgb(0,0,.2,.6)
V(magallggt1)$size = 6 V(magallggt1)$color = rgb(0,0,1,.5)
V(magallggt1)$frame.color = V(magallggt1)$color

# Set edge attributes
E(magallggt1)$arrow.size = .3 # Set edge alpha according to edge weight egalpha = (E(magallggt1)$weight+.1)/max(E(magallggt1)$weight+.1) E(magallggt1)$color = rgb(.5,.5,0,egalpha)

# One thing that we can do with this graph is to set label size as a
# function of degree, which adds a "tag-cloud"-like element to the
# visualization:
V(magallggt1)$label.cex = V(magallggt1)$degree/(max(V(magallggt1)$degree)/2)+ .3 # Note, as presently coded, you must play with the formula above to # get the ratio of big text to small text just right for other graphs. # You may want to hand place the nodes as we did earlier using tkplot. #Now plot it: pdf("5.9_magact_stdnt_actvts_club_overlap_handplaced_layout.pdf") plot(magallggt1) dev.off() # This plots our network with our custom layout. (Because we specified # our layout so that it was part of our igraph object as # "magallggt1$layout," igraph used our layout by default).

################################################################
# 7 (c) Play around with the visualizations above, save the best

# Variations might include scaling vertex size based on degree,
# scaling edge width or edge transperancy (gamma) based on
# edge weight.

################################################################
# (8) Play around with cohesion and centrality measurements.
# What can we say about the general comemberships at Magnet
# High over 3 years?

# useful code:

plot(density(degree(i96e)))
plot(density(betweenness(i96e)))

degree.distribution(i96e)

############################################################
###########################################################
# Part II: Mobility and Careers:
###########################################################

# (1) create a new matrix that multiplies 1996 magnet with 1997
# magnet so you see the number of students moving from 1996
# membership to 1997 memberships.

# Before we actually do this, we need to make sure that the
# rows and columns for g96 and g97 are the same. We'll
# use the match() function for this:

# First, let's get an idea of how many column-names (activities) and row
# names (student ids) are in common between the two years:

(cnames = intersect( colnames(g96), colnames(g97) ) )
(rnames = intersect( row.names(g96), row.names(g97) ) )

# Great, there are a lot of names in common. Now we
# need to make sure we are only using the rows
# and columns of each matrix that contain entries used in
# both years. We also need to make sure that the columns and
# rows are in the same order.

# In order to accomplish this we are going to exploit R's
# indexing capabilities. We are going to have R "rebuild"
# each matrix according to the order of rnames and cnames.
# We'll use the match() function to accomplish this.
g96matched = g96[ match(rnames, row.names(g96)), match(cnames, colnames(g96)) ]
g97matched = g97[ match(rnames, row.names(g97)), match(cnames, colnames(g97)) ]

# We need to do the same thing for the diagonal of the matrix g96e, which is
# our co-membership/affiliation matrix computed above:
mag96diagmatched = diag( g96e[ match(cnames, colnames(g96e)), match(cnames, colnames(g96e)) ] )

# Now let's check to make sure things worked correctly:
which(row.names(g96matched) != row.names(g97matched))
which(colnames(g96matched) != colnames(g97matched))

# NOW we can multiply to get the transition probability matrix:
mag96_97 = t(g96matched) %*% g97matched

# Now we need to do the same for 97 and 98:
cnames = intersect( colnames(g97), colnames(g98) )
rnames = intersect( row.names(g97), row.names(g98) )
g97matched = g97[ match(rnames, row.names(g97)), match(cnames, colnames(g97)) ]
g98matched = g98[ match(rnames, row.names(g98)), match(cnames, colnames(g98)) ]

mag97_98 = t(g97matched) %*% g98matched
mag97diagmatched = diag( g97e[ match(cnames, colnames(g97e)), match(cnames, colnames(g97e)) ] )

############################################################
## (2) Create mag96-diag and mag97-diag
## what we'll need to do is to run the same proceedure as above

#mag96diag = diag(g96matched)
#mag97diag = diag(g97matched)

###########################################################
# 2 (a) Create magmob96_97 by dividing by mag96diagmatched in
#       order to get a transition probability matrix:
magmob96_97 = mag96_97/mag96diagmatched

###########################################################
# 2 a (i) Repeat for mag97 98 using mag97-diag values.
#           Save as magmob97 98.

magmob97_98 = mag97_98/mag97diagmatched

###########################################################
# (3) aggregate the transition probability matrix across years

# Now use the match() function to make it so that the
# transition probability matrices have the same row and
# column entries, using the example above as a guide.

# One you have that, you can just add the matrices and
# (scalar) divide by 2.

###########################################################
# (4) Now plot as you did with the event-overlap graphs

############################################################
#
# Questions:
#   (1) How does the story differ for the mobility vs the
#       cross-sectional overlaps?
#   (2) What network properties (cohesion or centrality) afford
#       some insight into your claims?
#   (3) From these results, what substantive story can you make
#       about the extracurriculum and it's career structure?
#
############################################################