sna_eg_stanford:lab07
Lab 07
############################################## # Lab 7: PEER INFLUENCE & QAP REGRESSION LAB # ############################################## # NOTE: if you have trouble because some packages are not installed, # see lab 1 for instructions on how to install all necessary packages. ########################################################################### # # Lab 7 # # Part I - Develop peer influence models that meet the high standards of # publication in todays top journals (i.e., addressing autocorrelation # and issues of causality). # # Part II - Use QAP regression to predict increases in social interaction # and task interaction for two classrooms. Basically run dyadic level # regressions suitable for continuous variables and count data (what # ERGM- Exponential Random Graph's cannot do). # ########################################################################### ########################################################################## # PART I -- PEER INFLUENCE ########################################################################### # # This lab examines peer effects in classroom data. It first introduces # the reader to the necessary data processing and manipulation, then basic # visualization related to peer-effects. It then introduces the Linear Network # Autocorrelation Model (lnam), which can be used to better model the # dependencies in network data (well, better than an OLS model anyway). # # The next section introduces the reader to the concept of matching--treating # the explanatory variable of interest as an experimental "treatment" # and then matching on other covariates. Matching allows the application # of straightforward analytical techniques used to analyze experiments, # namely comparison of means. Code for bootstrapping is introduced to allow for # non-parametric estimation of the standard errors for the mean. Code is also provided # to produce a dotplot showing the means and a (bootstrapped) 95% CI. # # The data was collected by Dan McFarland (McFarland 2001). The key measures of interest # for this example are the time 1 and 2 measures of how much each student liked # the subject, and the friend network at time 1. # # The codebook is available here: http://stanford.edu/~messing/codebook.v12.txt ############################################################################### #Clear Variables rm(list=ls(all=TRUE)) gc() # Install and load necessary packages: install.packages("reshape", dependencies = TRUE) library(reshape) library(igraph) data(studentnets.peerinfl, package="NetData") ###################################################################################### # Now we need to reshape our attitudes data according to semester. The data is # currently in "long" format, so that student ids and measures are repeated on # multiple lines and "semester" is a single variable. We need # to transform the data so that measures are repeated on a single # line, or "wide" format. We'll use the excellent "reshape" package # to accomplish this. Take a look at the documentation using ?reshape ?reshape attitudesw = reshape( attitudes, idvar="std_id", timevar="sem_id", direction="wide" ) # Notice that all semester 1 variables now feature a ".1" postscript, while all # semester 2 variables feature a ".2" postscript # We'll first visualize the data. We want to know whether the change in # a student's appreciation of a subject (t_2 - t_1) changes in the direction of his # friends' appreciation at t_1. # So, we'll first take the difference of a few dependent variables # for which we might expect peer influence (e.g. "sub" (liking of subject), # "cmt" (liking of classmates), and "tch" (liking of teacher)), # and then take the mean value for an individual's friends' values at t_1. # First create the delta variables: attitudesw$deltasub = attitudesw$sub.2 - attitudesw$sub.1 attitudesw$deltatch = attitudesw$tch.2 - attitudesw$tch.1 attitudesw$deltacmt = attitudesw$cmt.2 - attitudesw$cmt.1 # Next we'll create the mean of friends' sub.1. # We'll use the adjacency matrix, multiply that by the attitudesw$sub.1 # (element-wise, not matrix multiplication), and then take the mean of each row. sem1graph = graph.data.frame(d = sem1[1:2], directed=TRUE) #sem1graph = network(x = sem1[1:2], directed=TRUE) #sem2graph = graph.data.frame(d = sem2, directed=TRUE) # But first, we'll need to clean the data to make sure the rows line up! # let's check to see whether the edge data has the same cases as the attitudes data which(!(V(sem1graph)$name %in% as.character(attitudesw$std_id))) which(!(as.character(attitudesw$std_id) %in% V(sem1graph)$name )) # They are not the same... we'll need to address this below. # Now let's get the matrix representation of the network: sem1matrix = get.adjacency(sem1graph, binary=T) # This is often referred to as "W" in the literature # When you have a large square matrix like this, you can get a good idea of # it's density/sparsity using the image function. image(sem1matrix) # It's also generally good to know the degree distribution of any network you # are handling: plot(density(degree(sem1graph))) # Looks like degree might be distributed exponentially. We can check as follows: simexpdist100 = rexp(n=length(V(sem1graph)), rate = .100) simexpdist125 = rexp(n=length(V(sem1graph)), rate = .125) simexpdist150 = rexp(n=length(V(sem1graph)), rate = .150) lines(density(simexpdist100), col="red") lines(density(simexpdist125), col="blue") lines(density(simexpdist150), col="green") # It's not a precise fit, but it might be a good approximation. # Let's reorder the attitudes data to make sure it's in the same order as # sem1matrix attitudesw = attitudesw[match(row.names(sem1matrix), attitudesw$std_id),] # Make sure it worked (this should return 0 if so): which(row.names(sem1matrix)!=as.character(attitudesw$std_id)) which(colnames(sem1matrix)!=as.character(attitudesw$std_id)) # Let's also make sure that igraph read the graph in correctly: # (note that igraph can behave strangely if you pass it a # parameter like "directed = FALSE" for a directed network like this one). sem1$alter_id[sem1$std_id == 149824] V(sem1graph)[0] V(sem1graph)[unique(neighbors(sem1graph, v=0, mode = 1))] # looks good. # Now we can compute the mean of sub.1 for a student's friends, # by element-wise multiplying sub.1 by the matrix and then taking the # mean of each non-zero cell in each row: # Let's first test this out with a simple matrix to make sure we know what # we are doing: (M = matrix(c(1,0,1,0,1,0,1,0,1), nrow=3, ncol=3)) (R = c(1,2,3)) R * M # This is multiplying the first row by 1, the second row by 2, and so on. # Instead we want: t(R * t(M)) # This is multiplying the first column by 1, the second column by 2, and so on. # Recall that we will be analyzing this data so that rows are cases, so # this is what we want if we are going to calculate sub.1 for each # student's friends' sub.1. (The first option would return the # student's sub.1 for each non-zero row, not his/her friends' sub.1). sem1Wxsub1 = t(attitudesw$sub.1 * t(sem1matrix)) sem1Wxtch1 = t(attitudesw$tch.1 * t(sem1matrix)) sem1Wxcmt1 = t(attitudesw$cmt.1 * t(sem1matrix)) # Now we'll take the mean of cells not equal to zero: attitudesw$frsub1mean = numeric(length(attitudesw$std_id)) attitudesw$frtch1mean = numeric(length(attitudesw$std_id)) attitudesw$frcmt1mean = numeric(length(attitudesw$std_id)) for(i in 1:length(attitudesw$std_id)){ attitudesw$frsub1mean[i] = mean(sem1Wxsub1[i,sem1Wxsub1[i,]!=0 ], na.rm=T) attitudesw$frtch1mean[i] = mean(sem1Wxtch1[i,sem1Wxtch1[i,]!=0], na.rm=T) attitudesw$frcmt1mean[i] = mean(sem1Wxcmt1[i,sem1Wxcmt1[i,]!=0], na.rm=T) } # Now let's plot the data and look at the relationship: plot(attitudesw$frsub1mean, jitter(attitudesw$deltasub)) plot(attitudesw$frtch1mean, jitter(attitudesw$deltatch)) plot(attitudesw$frcmt1mean, jitter(attitudesw$deltacmt)) # Looks noisy, but there might be a relationship for "sub". ############################################################################ # An alternative way to compute this is iteratively. The advantage # to the following approach is that if you are working with a large # data set, you do not need to load an adjacency matrix into # memory, you can keep things in edgelist format, which is MUCH # more efficient. attitudesw$mfrsub.1 = numeric(length(attitudesw$sub.1)) attitudesw$mfrtch.1 = numeric(length(attitudesw$sub.1)) attitudesw$mfrcmt.1 = numeric(length(attitudesw$sub.1)) # If you thought the number of alters who like the class mattered # more than the mean, you might uncomment the code for the nfrsubgt3 # variable and incorporate it into your analysis (two occurrences): #attitudesw$nfrsubgt3 = numeric(length(attitudesw$sub.1)) for(i in 1:length(attitudesw$std_id)){ # first get alters of student i: altrs = sem1$alter_id[ sem1$std_id == attitudesw$std_id[i] ] # then get alters' attitudes altatts = attitudesw$sub.1[ attitudesw$std_id %in% altrs ] # now count how many friends like the class more than "3" attitudesw$nfrsubgt3[i] = length(which(altatts > 3)) # then take the mean, ignoring NAs: attitudesw$mfrsub.1[i] = mean(altatts, na.rm = TRUE) # Note that this can all be done in one line of code: # attitudesw$mfrsub.1[i] = mean(attitudesw$sub.1[ attitudesw$std_id %in% sem1$alter_id[sem1$std_id %in% attitudesw$std_id[i]]], na.rm=TRUE ) attitudesw$mfrtch.1[i] = mean(attitudesw$tch.1[ attitudesw$std_id %in% sem1$alter_id[sem1$std_id %in% attitudesw$std_id[i]]], na.rm=TRUE ) attitudesw$mfrcmt.1[i] = mean(attitudesw$cmt.1[ attitudesw$std_id %in% sem1$alter_id[sem1$std_id %in% attitudesw$std_id[i]]], na.rm=TRUE ) } # if you are going to run this with a lot of data, you may wish to include # the following lines inside your for-loop: # gc() # print(i) # this will run the garbage collector, gc(), which helps R manage memory # and print(i) will let you know your progress as R chugs away. # The plot is exactly the same: plot(attitudesw$mfrsub.1, jitter(attitudesw$deltasub)) # And the correlation should be 1 cor(attitudesw$mfrsub.1,attitudesw$frsub1mean, use= "complete") # The plots suggest that there might be a relationship for "sub", or # how much each student likes the subject. # Let's cheat a little bit and run linear models predicting # change in each of our variables based on mean friend's values at t=1. # The data violates many of the assumptions of OLS regression so the # estimates are not particularly meaningful, other than to let us know # that we may be on the right path. summary(lm(deltasub~mfrsub.1, data=attitudesw)) summary(lm(deltatch~mfrtch.1, data=attitudesw)) summary(lm(deltacmt~mfrcmt.1, data=attitudesw)) # Significance is always encouraging, even though we are only predicting a tiny # fraction of the variance (R^2 = .026) # We'll cheat a little more and remove everyone whose attitudes did not change: summary(lm(deltasub~mfrsub.1, data=attitudesw, subset = deltasub!=0 )) summary(lm(deltatch~mfrtch.1, data=attitudesw, subset = deltasub!=0 )) summary(lm(deltacmt~mfrcmt.1, data=attitudesw, subset = deltasub!=0 )) # Very nice, for "sub" the effect grows in strength, signficance, # and R^2 goes up to .067. # Let's also take a look at the plot: pdf("7.1_Peer_effect_on_opinion_of_subject.pdf") plot( x = attitudesw$mfrsub.1, y = jitter(attitudesw$deltasub), main = "Peer effects on opinion of subject", xlab = "mean friends' opinion", ylab = expression(Delta~"opinion from "~t[1]*~to~t[2]) ) # We can draw a regression line on the plot based on the regression # we estimated above: abline(lm(deltasub~mfrsub.1, data=attitudesw )) dev.off() # We've got some evidence of a peer effect on how much a student # likes the subject. Our evidence is based on temporal precendence: # your friends opinion of the subject predicts how much # your opinion the subject changes from t_1 to t_2. ############################################################################ # Now let's run some "real" models. # We'll first estimate the model Y_2 = Y_1 + W*Y_alter, using a network # autocorrelation model. Read up on it here: # Roger Th. A. J. Leenders, Modeling social influence through network # autocorrelation: constructing the weight matrix, Social Networks, Volume 24, Issue 1, January 2002, Pages 21-47, ISSN 0378-8733, DOI: 10.1016/S0378-8733(01)00049-1. # (http://www.sciencedirect.com/science/article/B6VD1-44SKCC2-1/2/cae2d6b4cf4c1c21f4f870fd2d58b5cc) # This model arguably takes into accounts the correlation between friends in # the disturbances (error term). Yet there are problems with this kind of model. # See: # Eric J. Neuman, Mark S. Mizruchi, Structure and bias in the network autocorrelation # model, Social Networks, In Press, Corrected Proof, Available online 31 May 2010, ISSN 0378-8733, DOI: 10.1016/j.socnet.2010.04.003. # (http://www.sciencedirect.com/science/article/B6VD1-506H0SN-1/2/73a16c627271d29d9283b6d69b873a07) # With that in mind, let's proceed: library(sna) library(numDeriv) # We'll use the mfrsub.1 variable to approximate the W*Y_alter term. (If # we actually use the matrix, we cannot estimate the model--there # would be the same number of variables as cases). # Let's remove the NA values from the attitudes measures # (otherwise lnam will crash). We'll remove rows with NAs for # sub.1, sub.2, or mfrsub.1. Note that we could remove any row with an # NA value with the syntax: na.omit(attitudesw), but if there are # rows that contain our variables of interest but are NA for some other # variable, na.omit would also drop these rows and we would lose valuable # data. atts = attitudesw[!is.na(attitudesw$sub.2),] atts = atts[!is.na(atts$sub.1),] atts = atts[!is.na(atts$mfrsub.1),] W = sem1matrix # Now we'll make sure the rows and columns in W are in the same # order as atts: W = W[match(atts$std_id,row.names(W)), match(atts$std_id,colnames(W))] # Let's make sure (this will return 0 if we did it right): which(rownames(W) != atts$std_id) # Now we'll estimate the peer influence model using lnam, which stands for # Linear Network Autocorrelation Model. pim1<-lnam(atts$sub.2,cbind(atts$sub.1, atts$mfrsub.1), W2 = W) summary(pim1) # This shows a pretty strong peer effect on subjects at time2. Notice that # the adjusted R^2 is .41. This is not good news. Even after controlling # for t1 values of attitude for the subject, we are predicting less than # half of the variance in t_2 values. That means that there is a good # chance of omitted variable bias. # Below is an attempt to run the model with the actual WYalt as the X matrix. # For the reasons stated above, it doesn't work. It is included in case the # reader wants to uncomment this code and see for him/herself. #WYalt = sem1Wxsub1 #WYalt = WYalt[match(atts$std_id,row.names(WYalt)), match(atts$std_id,colnames(WYalt))] #image(WYalt) # ##pim2<-lnam(y = atts$sub.2, x = WYalt, W2 = W) #pim2<-lnam(y = atts$sub.2, x = WYalt) #summary(pim2) ############################################################################ # Matching # # One method that is less vulnerable to some problems related to regression is # matching. In matching, one divides the key explanatory variable into discrete # "treatments" and then matches participants so that they are as similar on the # remaining set of covariates as possible. A simple difference in means/medians/ # quantiles can be used. From there we can use a simple t-test to estimate the # "treatment" effect without having to worry about multicolinearity. # If the data violates the normality assumption (as will be the # case here), non-parametric methods such as a permutation test or bootstrapped # standard errors can be used. This approach also also violates less statistical # assumptions and therefore should be expected to be less biased than running a # simple OLS model. # For the original article on matching, see: # Donald B. Rubin, Matching to Remove Bias in Observational Studies # Biometrics, Vol. 29, No. 1 (Mar., 1973), pp. 159-183 # Stable URL: http://www.jstor.org/stable/2529684 # Also worthy of note is the Rubin Causal Model, which uses matching. # Read up on it here: http://en.wikipedia.org/wiki/Rubin_Causal_Model # However, matching is not a cure-all and is obviously sensitive to the variables # on which one matches. For an excellent article on this, see: # Jeffrey A. Smith, Petra E. Todd, Does matching overcome LaLonde's critique of # nonexperimental estimators?, Journal of Econometrics, Volume 125, Issues 1-2, # Experimental and non-experimental evaluation of economic policy and models, # March-April 2005, Pages 305-353. # (http://www.sciencedirect.com/science/article/B6VC0-4CXTY59-1/2/06728bd79899fd9a5b814dfea9fd1560) # A good introduction to matching using the MatchIt package can be found courtesy # of Gary King here: http://gking.harvard.edu/matchit/docs/matchit.pdf library(MatchIt) # First, we dichotomize the independent variable of interest, mean friend opinion regarding # subject: mfrsub.1 atts$mftreat = ifelse(atts$mfrsub.1 > 3, 1,0) atts = atts[c("mftreat", "deltasub", "mfrsub.1","egrd.1", "sub.1", "tot.1", "frn.1", "cmt.1", "prev.1")] # Before we do any matching, let's take a look at our new treatment variable. # We'll make this into a factor, which will help interpret our results. A factor # is a special type of object in R that is used for qualitative (nominal or ordinal) # data/variables. A factor is also generally more efficient with respect to memory, # because it stores an integer value corresponding to a label (e.g. 1 for "low" and 2 # for "high") rather than storing the entire string. # Factors have an order, which you can set by ordering the labels in ascending order # when you create the factor, or by using the reorder() function. We'll talk about this # more in the section on dotplots below. atts$mftreatf = factor(atts$mftreat, labels = c("low", "high")) # Let's compute the means: tapply(X = atts$deltasub, INDEX = list(atts$mftreatf), FUN = mean) # Note, the tapply function is useful if you have more than one factor for which # you want to calcuate some function over. For example, if there were another # factor, say exptreat, we could get the mean for each cell using this syntax: # tapply(X = atts$deltasub, INDEX = list(atts$mftreat, atts$exptreat), FUN = mean) # A t-test is not totally kosher here because the DV is ordinal, and so it violates # assumptions behind the t-test. # But let's take a look anyway: t.test(deltasub~mftreatf, data=atts) # A better alternative is the Wilcoxon test (a.k.a. Mann-Whitney test), # suitable for ordinal variables. wilcox.test(x=atts$deltasub[atts$mftreatf=="low"], y=atts$deltasub[atts$mftreatf=="high"], conf.int = TRUE) # Now let's match and redo the analyses. # We normally would assign each subject/case to treatment or control groups based on # whatever corresponds more closely to a treatment or a control. In this case, it # is debatable whether having friends who like the class is more of a "treatment" # than having friends who do not like the class. However, there are 246 students # whose friends do not like the class versus 99 whose friends do like the class, # so we can say that having friends who like the class is certainly more unusual # than the opposite. Accordingly, we'll conceptualize having friends who like the # class as the treatment. # In the actual matching proceedure, what we'll do is try to find individuals in # the "control" condition (having friends who do not like the subject) who look # most similar to each treated individual based on a variety of variables. Ideally, # we will achieve "balance" on the covariates between those in the treatment condition, # and those we select from the control condition, so that the individuals are as similar # as possible. # Here are the variables we'll use (all are at time 1): # sub.1 - how much the student likes the subject # egrd.1 - expected grade # tot.1 - how much the student likes the way the course is taught # frn.1 - how interesting the subject finds his/her friends # cmt.1 - how interesting the student finds his/her classmates # prev.1 - has the student had the teacher previously) # We'll use "nearest neighbor" matching, which uses a distance measure to # select the individual in the "control" group that is closest to each # individual in the "treatment" group. The default distance measure is the # logistic regression propensity score. # The codebook for this data is available here: http://stanford.edu/~messing/codebook.v12.txt m.out = matchit(mftreat ~ egrd.1 + sub.1 + tot.1 + frn.1 + cmt.1 + prev.1, data = atts, method = "nearest") summary(m.out) plot(m.out) # Now let's assess the extent to which our matching algorithm effectively matched each # treatment subject to a control subject, e.g., the extent to which we achieved balance. # The plot function displays Q-Q plots of the control (X-axis) versus treatment groups # (Y-axis) for the original sample and the matched sample. The Q-Q plots indicate perfect # balance on the covariate in question if all dots are perfectly aligned on the # 45 degree line. The further the deviation from this line, the worse the samples are # balanced on this variable. # Based on the plots, matching appears to have improved balance on our covariates a # little bit, but not perfectly. # Ideally we want to experiment with other covariates and try to find the combination # that best and most parsimoniously captures the key phenomena we want to control for. matchatts = match.data(m.out) # We'll make our treatment variable into a factor: matchatts$mftreatf = factor(matchatts$mftreat, labels = c("low", "high")) # Let's compute the means: tapply(X = matchatts$deltasub, INDEX = list(matchatts$mftreatf), FUN = mean) t.test(deltasub~mftreatf, data=matchatts) wilcox.test(x=matchatts$deltasub[matchatts$mftreatf=="low"], y=matchatts$deltasub[matchatts$mftreatf=="high"], conf.int = TRUE) # We can also perform a full permuation test. A permutation test is an exact test of # whether it is possible to reject the null hypothesis that two distributions are # the same. It works by first calulating the mean difference, which fuctions as the # test statistic. The difference in sample means is calculated and recorded for every # possible way of dividing these pooled values into two groups of size n_A and n_B # for every permutation of the group labels A and B. The set of these calculated # differences is the exact distribution of possible differences under the null # hypothesis that group label does not matter. This test fuctions like a t.test but # does not rely on any distributional assumptions about the data. # For additional information, see: # http://en.wikipedia.org/wiki/Resampling_%28statistics%29#Permutation_tests library(coin) independence_test(deltasub ~ mftreatf, data = matchatts, distribution = exact()) # Another alternative is to look at bootstrapped estimates of the mean and standard # error. This is often an attractive way to estimate statistical significance # because our data violates the distributional assumptions involved in estimating # standard errors in this way (namely, that our dependent variable resembles anything # like a normal or t distribution--it cannot because it is ordinal with only # 7 levels). Because boostrapping relies on resampling instead of distributional # assumptions to estimate variance, it is more robust. # There are packages available to generate bootstrapped estimates, but seeing the code # is a valuable way to get a sense of exactly how bootstrapping works. Here is the code # for a bootstrapping function designed to estimate the mean and standard error: b.mean <- function(data, nsim) { # Create a list object to store the sets of resampled data (in this case nsim = 1000). resamples = list() # Create a vector of the same length to store the resampled means r.mean = numeric(nsim) # Generate a sample with replacement for(i in 1:nsim){ # generates a random sample of our data with replacement resamples[[i]] = sample(x=data, size=length(data), replace=T) # Now calcuate the mean for this iteration r.mean[i] = mean(resamples[[i]], na.rm=T) } # Calculate the mean of the mean of the simulated estimates above: mean.adj = mean(r.mean, na.rm=T) # Calculate how this differs from the arithmatic mean estimate of the original # sample: bias = mean(data, na.rm=T) - mean.adj # Generate the standard error of the estimate std.err <- sqrt(var(r.mean)) # Return results return( data.frame(mean = mean(data, na.rm=T), #the mean estimate mean.adj = mean.adj, # the adjusted estimate based on simulations bias = bias, # the mean minus the adjusted estimate std.err=std.err # the standard error of the estimate (based on simulations) ) ) } # Before we use it on our data, let's make sure it works. We will simulate some # data for which we know the parameters and then estimate the parameters using # the bootstrap: simdata = rnorm(n = 1000, mean = 5, sd = 1) b.mean(simdata, nsim=1000) # Calculate the theoretical mean and standard error mean(simdata) (se = sd(simdata)/sqrt(length(simdata)-1)) # here we use the formula SE = SD/sqrt(N - 1) # We have demonstrated that our bootstraping function is a good estimator # for "perfect" data that meets the assumptions behind traditional # statistical tests. Now let's move on to our data. # For those with friends with on average postive attitudes: b.mean(data=matchatts$deltasub[matchatts$mftreatf=="high"], nsim=1000) # For those with friends with on average negative attitudes: b.mean(data=matchatts$deltasub[matchatts$mftreatf=="low"], nsim=1000) # Here's how to do it using the boot package, which is faster and more # extensible than the R-code above: library(boot) # You have to write a special function for the boot package, which # can be confusing at first, but is useful for more complicated # types of bootstrapping. The boot package is also orders of magnitude # faster than doing things in R, which is useful if you start to do # more complicated bootstrapping or use a higher number of simulations. # First, write a basic function that will return the estimate in question # for x data using d cases: samplemean <- function(x, d) { return(mean(x[d])) } (bootlow = boot(data = matchatts$deltasub[matchatts$mftreatf=="low"], statistic = samplemean, R = 1000)) (boothigh =boot(data = matchatts$deltasub[matchatts$mftreatf=="high"], statistic = samplemean, R = 1000)) # Note that estimates of the standard errors will be slightly different # each time due to randomness involved in estimation. As you might expect, # as the number of simulations increases, variance will decrease. # Based on our bootstrapped estimates and standard error calculations, # it's clear that the two means are quite different. So, based on our # matched samples, it seems there is a significant peer effect. ########################################################################## # Let's try full matching so that we are not throwing away any of our data: m.out = matchit(mftreat ~ egrd.1 + sub.1 + tot.1 + frn.1 + cmt.1 + prev.1, data = atts, method = "full") matchatts = match.data(m.out) # We'll make this into a factor, which will help inturpret our results: matchatts$mftreatf = factor(matchatts$mftreat, labels = c("low", "high")) # Note that we'll now have to use the weights that the output provided because # we told it to use the full sample. # Let's compute the means: tapply(X = matchatts$deltasub, INDEX = list(matchatts$mftreatf), FUN = weighted.mean, weights = matchatts$weights ) # There is no straightforward way to compute a t-test in R with weighted data. # We can however compute weighted standard errors and a corresponding # 95% CI: # Recall that a 95% CI of an estimate is calculated by taking the estimate # +/- the standard error * 1.96. Actually, +/- 1.96 is just an estimate of the # the 0.05 and 0.95 quantiles of a statistical distribution. We'll use # qt(quantile, N-1) in this case. library(Hmisc) meanlow = wtd.mean(matchatts$deltasub[matchatts$mftreatf=="low"], weights = matchatts$weights[matchatts$mftreatf=="low"]) varlow = wtd.var(matchatts$deltasub[matchatts$mftreatf=="low"], weights = matchatts$weights[matchatts$mftreatf=="low"]) selow = sqrt(varlow)/sqrt(sum(matchatts$weights[matchatts$mftreatf=="low"])) meanlow + selow * qt(.025, ( sum(matchatts$weights[matchatts$mftreatf=="low"]) - 1)) meanlow + selow * qt(.975, ( sum(matchatts$weights[matchatts$mftreatf=="low"]) - 1)) meanhigh = wtd.mean(matchatts$deltasub[matchatts$mftreatf=="high"], weights = matchatts$weights[matchatts$mftreatf=="high"]) varhigh = wtd.var(matchatts$deltasub[matchatts$mftreatf=="high"], weights = matchatts$weights[matchatts$mftreatf=="high"]) sehigh = sqrt(varhigh)/sqrt(sum(matchatts$weights[matchatts$mftreatf=="high"])) meanhigh + sehigh * qt(.025, ( sum(matchatts$weights[matchatts$mftreatf=="high"]) - 1)) meanhigh + sehigh * qt(.975, ( sum(matchatts$weights[matchatts$mftreatf=="high"]) - 1)) # Here's how to do it with the boot package: sample.wtd.mean <- function(x, d, wts) { return(weighted.mean(x[d], w = wts)) } (bootlow = boot(data = matchatts$deltasub[matchatts$mftreatf=="low"], wts = matchatts$weights[matchatts$mftreatf=="low"], statistic = sample.wtd.mean, R = 1000)) (boothigh =boot(data = matchatts$deltasub[matchatts$mftreatf=="high"], wts = matchatts$weights[matchatts$mftreatf=="high"], statistic = sample.wtd.mean, R = 1000)) # We can make a nice plot of the results with the lattice package: library(lattice) # Dot plots: visual = list() visual$var = c("Friends do not like subj.", "Friends like subj.") visual$M = c(bootlow$t0, boothigh$t0) visual$se = c(sd(bootlow$t), sd(boothigh$t)) visual$N = c(length(matchatts$weights[matchatts$mftreatf=="high"]), length(matchatts$weights[matchatts$mftreatf=="low"])) visual$lo = visual$M - visual$se visual$up = visual$M + visual$se visual = as.data.frame(visual) #print it out: pdf(file="7.2_dotplot_matchoutput_bootstrapped_SE.pdf", width = 6, height = 3) dotplot( reorder(var,M) ~ M, data = visual, main="Effect of friends' attitude on opinion change,\nmatched samples with bootstrapped SE", panel = function (x, y, subscripts, groups, ...) { panel.dotplot(x, y) panel.segments(x0=visual$lo[subscripts],y0=as.numeric(y), x1=visual$up[subscripts],y1=as.numeric(y), lty = 1 ) }, xlab=expression(Delta~"opinion from "~t[1]*~to~t[2]), xlim= c(-.5, .2) ) dev.off() # Very nice! ############################################ # QUESTION #1 - What do these results show? # What assumptions are being made? # What might lead you to question the results? What would improve them? ############################################ ######################################################################### # Extra-credit: # # Use the McFarland dataset to acquire other individual level variables # and develop better matching. Then follow this lab and explore the # variety of outcomes suitable for illustrating peer effects in # classrooms (e.g., PSAT scores, engagement, conflict, etc). Results # are likely suitable for an A-journal publication. # # Again, the codebook is available here: http://stanford.edu/~messing/codebook.v12.txt # ######################################################################### ######################################################################### # PART II -- QAP REGRESSION ######################################################################### # # Here we want to compare different classrooms and discern what leads # participants to become more or less sociable and academically engaged # with one another. # # Class m173 is an accelerated math class of all 10th graders taught # by a rigid teacher who used teacher-centered instructional formats. # Class m 182 is a regular math class of 10th and 11th grade students # taught by a casual teacher who used student-centered instructional formats. # ######################################################################### # igraph and sna don't mix well. Before we load sna, we will detach igraph. detach(package:igraph) # Load the "sna" library library(sna) #Clear Variables rm(list=ls(all=TRUE)) gc() #(2) QAP Regressions for m l73 # Note that each matrix must be the same size (n x n). You'll want to make # sure all input and output matrices are the same size. Predictor matrices # thus are not individual level variables; they are differences, matches, # etc. You'd use a matrix of differences in of quantitative (node level) # variables, and matches in the case of categorical/dichotomous (node # level) variables. There's a really easy way to do it in R # (say your node-level variables are x and y) # x <- seq(1,4) # y <- seq(2,5) # outer(x,y,FUN="-") # find the difference between x and y # outer(x,y,FUN="==") # produce 0/1 similairty matrix # For this lab we will use matrices that were previously generated in UCINet # Loading predictor matrices. Each matrix is a n x n matrix # data is saved in a CSV format. You can open the files in Excel # to see the structure. data(studentnets.mrqap173, package="NetData") # Look at what we loaded via ls() # We need the data in matrix format # predictor matrices m173_sem1_SSL <- as.matrix(m173_sem1_SSL) m173_sem1_TSL <- as.matrix(m173_sem1_TSL) m173_sem1_FRN <- as.matrix(m173_sem1_FRN) m173_sem1_SEAT <- as.matrix(m173_sem1_SEAT) m173_sem1_RCE <- as.matrix(m173_sem1_RCE) m173_sem1_GND <- as.matrix(m173_sem1_GND) # Load response matrices m173_sem2_SSL <- as.matrix(m173_sem2_SSL) m173_sem2_TSL <- as.matrix(m173_sem2_TSL) # In order to run the QAP regression we must create an array of matrices # containing all the predcitor matrices. We are, in effect, creating a # 3-d matrix (predcitor x n x n). # Important: All matrices must be the same size! response_matrices <- array(NA, c(6, length(m173_sem1_SSL[1,]),length(m173_sem1_SSL[1,]))) response_matrices[1,,] <- m173_sem1_SSL response_matrices[2,,] <- m173_sem1_TSL response_matrices[3,,] <- m173_sem1_FRN response_matrices[4,,] <- m173_sem1_SEAT response_matrices[5,,] <- m173_sem1_RCE response_matrices[6,,] <- m173_sem1_GND ############################## #(2a) SSL2 <- SSL1 + TSL1 + FRN1 + SEAT1 + RCE + GND ############################## # Fit a netlm model by using netlm, the response matrix and the array of predictor matrices # This may take a LONG time. nl<-netlm(m173_sem2_SSL,response_matrices) # Make the model easier to read by adding lables for each predictor matrix. nlLabeled <- list() nlLabeled <- summary(nl) # Labels are provided in the same order as they were assigned in the response_matrices array nlLabeled$names <- c("Intercept", "Social normalized and labeled (SSL1)", "Task normalized and labeled (TSL1)", "Friends 1=friend, 2=best friend(FRN1)", "Seat in first semester (Seat1)","Race (RCE)","Gender (GND)") # Round the ocefficients to two decimals nlLabeled$coefficients = round(nlLabeled$coefficients, 2) nlLabeled ############################## #(2b) TSL2 <- TSL1 + SSL1 + FRN1 + SEAT1 + RCE + GND ############################## # Fit a netlm model by using netlm, the response matrix and the array of predictor matrices nl<-netlm(m173_sem2_TSL,response_matrices) #make the model easier to read nlLabeled <- list() nlLabeled <- summary(nl) nlLabeled$names <- c("Intercept", "Social normalized and labeled (SSL1)", "Task normalized and labeled (TSL1)", "Friends 1=friend, 2=best friend(FRN1)", "Seat in first semester (Seat1)","Race (RCE)","Gender (GND)") nlLabeled$coefficients = round(nlLabeled$coefficients, 2) nlLabeled ############################## #(3) QAP Regressions for m 182 ############################## # Repeat for class m 182 # Clear Variables rm(list=ls(all=TRUE)) gc() data(studentnets.mrqap182, package = "NetData") # Look at what we loaded via ls() # Again, we need the data in matrix format # predictor matrices m182_sem1_SSL <- as.matrix(m182_sem1_SSL) m182_sem1_TSL <- as.matrix(m182_sem1_TSL) m182_sem1_FRN <- as.matrix(m182_sem1_FRN) m182_sem1_SEAT <- as.matrix(m182_sem1_SEAT) m182_sem1_RCE <- as.matrix(m182_sem1_RCE) m182_sem1_GND <- as.matrix(m182_sem1_GND) #response matrices m182_sem2_SSL <- as.matrix(m182_sem2_SSL) m182_sem2_TSL <- as.matrix(m182_sem2_TSL) #This class will require you to make multiple extraction operations. #A student exits the class at the end of first semester and another #enters at the start of second semester. These students must be removed #before you can conduct QAP regression (you need the same row-column orderings). #Extract actor 15 for TSL and SSL of second semester (# 15 <- new student). response_matrices <- array(NA, c(6, length(m182_sem1_SSL[1,]),length(m182_sem1_SSL[1,]))) response_matrices[1,,] <- m182_sem1_SSL response_matrices[2,,] <- m182_sem1_TSL response_matrices[3,,] <- m182_sem1_FRN response_matrices[4,,] <- m182_sem1_SEAT response_matrices[5,,] <- m182_sem1_RCE response_matrices[6,,] <- m182_sem1_GND ############################## #(3a) SSL2 <- SSL1 + TSL1 + FRN1 + SEAT1 + RCE + GND ############################## #Fit a netlm model nl<-netlm(m182_sem2_SSL,response_matrices) #make the model easier to read nlLabeled <- list() nlLabeled <- summary(nl) nlLabeled$names <- c("Intercept", "Social normalized and labeled (SSL1)", "Task normalized and labeled (TSL1)", "Friends 1=friend, 2=best friend(FRN1)", "Seat in first semester (Seat1)","Race (RCE)","Gender (GND)") nlLabeled$coefficients = round(nlLabeled$coefficients, 2) nlLabeled ############################## #(3b) TSL2 <- SSL1 + TSL1 + FRN1 + SEAT1 + RCE + GND ############################## #Fit a netlm model nl<-netlm(m182_sem2_TSL,response_matrices) #make the model easier to read nlLabeled <- list() nlLabeled <- summary(nl) nlLabeled$names <- c("Intercept", "Social normalized and labeled (SSL1)", "Task normalized and labeled (TSL1)", "Friends 1=friend, 2=best friend(FRN1)", "Seat in first semester (Seat1)","Race (RCE)","Gender (GND)") nlLabeled$coefficients = round(nlLabeled$coefficients, 2) nlLabeled #Report your results. Describe what they mean? Repeat for ETSL of second semester. ####################################################################### # QUESTION #2 - Compare your results for m 173 and m 182. # Do the classes have different results? # If not, why not? # If so, why so? # What sort of story can you derive about the change in # task and social interactions within these classrooms? # # Remember you are predicting changes in relations of sociable or task # interaction using other types of relations (i.e., friendship, # proximate seating, same race, etc). ######################################################################### # # Extra-credit - run the QAP regression part of the lab on the # entire McFarland dataset of classrooms, and perform meta-analyses # on the results using multi-level modeling - then presto - # you will have results on academic and social relations in classrooms # which is suitable for publication in an A-journal. # #########################################################################
sna_eg_stanford/lab07.txt · Last modified: 2019/11/29 09:15 by hkimscil