sna_eg_stanford:lab07
Lab 07
##############################################
# Lab 7: PEER INFLUENCE & QAP REGRESSION LAB #
##############################################
# NOTE: if you have trouble because some packages are not installed,
# see lab 1 for instructions on how to install all necessary packages.
###########################################################################
#
# Lab 7
#
# Part I - Develop peer influence models that meet the high standards of
# publication in todays top journals (i.e., addressing autocorrelation
# and issues of causality).
#
# Part II - Use QAP regression to predict increases in social interaction
# and task interaction for two classrooms. Basically run dyadic level
# regressions suitable for continuous variables and count data (what
# ERGM- Exponential Random Graph's cannot do).
#
###########################################################################
##########################################################################
# PART I -- PEER INFLUENCE
###########################################################################
#
# This lab examines peer effects in classroom data. It first introduces
# the reader to the necessary data processing and manipulation, then basic
# visualization related to peer-effects. It then introduces the Linear Network
# Autocorrelation Model (lnam), which can be used to better model the
# dependencies in network data (well, better than an OLS model anyway).
#
# The next section introduces the reader to the concept of matching--treating
# the explanatory variable of interest as an experimental "treatment"
# and then matching on other covariates. Matching allows the application
# of straightforward analytical techniques used to analyze experiments,
# namely comparison of means. Code for bootstrapping is introduced to allow for
# non-parametric estimation of the standard errors for the mean. Code is also provided
# to produce a dotplot showing the means and a (bootstrapped) 95% CI.
#
# The data was collected by Dan McFarland (McFarland 2001). The key measures of interest
# for this example are the time 1 and 2 measures of how much each student liked
# the subject, and the friend network at time 1.
#
# The codebook is available here: http://stanford.edu/~messing/codebook.v12.txt
###############################################################################
#Clear Variables
rm(list=ls(all=TRUE))
gc()
# Install and load necessary packages:
install.packages("reshape", dependencies = TRUE)
library(reshape)
library(igraph)
data(studentnets.peerinfl, package="NetData")
######################################################################################
# Now we need to reshape our attitudes data according to semester. The data is
# currently in "long" format, so that student ids and measures are repeated on
# multiple lines and "semester" is a single variable. We need
# to transform the data so that measures are repeated on a single
# line, or "wide" format. We'll use the excellent "reshape" package
# to accomplish this. Take a look at the documentation using ?reshape
?reshape
attitudesw = reshape( attitudes, idvar="std_id",
timevar="sem_id", direction="wide" )
# Notice that all semester 1 variables now feature a ".1" postscript, while all
# semester 2 variables feature a ".2" postscript
# We'll first visualize the data. We want to know whether the change in
# a student's appreciation of a subject (t_2 - t_1) changes in the direction of his
# friends' appreciation at t_1.
# So, we'll first take the difference of a few dependent variables
# for which we might expect peer influence (e.g. "sub" (liking of subject),
# "cmt" (liking of classmates), and "tch" (liking of teacher)),
# and then take the mean value for an individual's friends' values at t_1.
# First create the delta variables:
attitudesw$deltasub = attitudesw$sub.2 - attitudesw$sub.1
attitudesw$deltatch = attitudesw$tch.2 - attitudesw$tch.1
attitudesw$deltacmt = attitudesw$cmt.2 - attitudesw$cmt.1
# Next we'll create the mean of friends' sub.1.
# We'll use the adjacency matrix, multiply that by the attitudesw$sub.1
# (element-wise, not matrix multiplication), and then take the mean of each row.
sem1graph = graph.data.frame(d = sem1[1:2], directed=TRUE)
#sem1graph = network(x = sem1[1:2], directed=TRUE)
#sem2graph = graph.data.frame(d = sem2, directed=TRUE)
# But first, we'll need to clean the data to make sure the rows line up!
# let's check to see whether the edge data has the same cases as the attitudes data
which(!(V(sem1graph)$name %in% as.character(attitudesw$std_id)))
which(!(as.character(attitudesw$std_id) %in% V(sem1graph)$name ))
# They are not the same... we'll need to address this below.
# Now let's get the matrix representation of the network:
sem1matrix = get.adjacency(sem1graph, binary=T)
# This is often referred to as "W" in the literature
# When you have a large square matrix like this, you can get a good idea of
# it's density/sparsity using the image function.
image(sem1matrix)
# It's also generally good to know the degree distribution of any network you
# are handling:
plot(density(degree(sem1graph)))
# Looks like degree might be distributed exponentially. We can check as follows:
simexpdist100 = rexp(n=length(V(sem1graph)), rate = .100)
simexpdist125 = rexp(n=length(V(sem1graph)), rate = .125)
simexpdist150 = rexp(n=length(V(sem1graph)), rate = .150)
lines(density(simexpdist100), col="red")
lines(density(simexpdist125), col="blue")
lines(density(simexpdist150), col="green")
# It's not a precise fit, but it might be a good approximation.
# Let's reorder the attitudes data to make sure it's in the same order as
# sem1matrix
attitudesw = attitudesw[match(row.names(sem1matrix), attitudesw$std_id),]
# Make sure it worked (this should return 0 if so):
which(row.names(sem1matrix)!=as.character(attitudesw$std_id))
which(colnames(sem1matrix)!=as.character(attitudesw$std_id))
# Let's also make sure that igraph read the graph in correctly:
# (note that igraph can behave strangely if you pass it a
# parameter like "directed = FALSE" for a directed network like this one).
sem1$alter_id[sem1$std_id == 149824]
V(sem1graph)[0]
V(sem1graph)[unique(neighbors(sem1graph, v=0, mode = 1))]
# looks good.
# Now we can compute the mean of sub.1 for a student's friends,
# by element-wise multiplying sub.1 by the matrix and then taking the
# mean of each non-zero cell in each row:
# Let's first test this out with a simple matrix to make sure we know what
# we are doing:
(M = matrix(c(1,0,1,0,1,0,1,0,1), nrow=3, ncol=3))
(R = c(1,2,3))
R * M
# This is multiplying the first row by 1, the second row by 2, and so on.
# Instead we want:
t(R * t(M))
# This is multiplying the first column by 1, the second column by 2, and so on.
# Recall that we will be analyzing this data so that rows are cases, so
# this is what we want if we are going to calculate sub.1 for each
# student's friends' sub.1. (The first option would return the
# student's sub.1 for each non-zero row, not his/her friends' sub.1).
sem1Wxsub1 = t(attitudesw$sub.1 * t(sem1matrix))
sem1Wxtch1 = t(attitudesw$tch.1 * t(sem1matrix))
sem1Wxcmt1 = t(attitudesw$cmt.1 * t(sem1matrix))
# Now we'll take the mean of cells not equal to zero:
attitudesw$frsub1mean = numeric(length(attitudesw$std_id))
attitudesw$frtch1mean = numeric(length(attitudesw$std_id))
attitudesw$frcmt1mean = numeric(length(attitudesw$std_id))
for(i in 1:length(attitudesw$std_id)){
attitudesw$frsub1mean[i] = mean(sem1Wxsub1[i,sem1Wxsub1[i,]!=0 ], na.rm=T)
attitudesw$frtch1mean[i] = mean(sem1Wxtch1[i,sem1Wxtch1[i,]!=0], na.rm=T)
attitudesw$frcmt1mean[i] = mean(sem1Wxcmt1[i,sem1Wxcmt1[i,]!=0], na.rm=T)
}
# Now let's plot the data and look at the relationship:
plot(attitudesw$frsub1mean, jitter(attitudesw$deltasub))
plot(attitudesw$frtch1mean, jitter(attitudesw$deltatch))
plot(attitudesw$frcmt1mean, jitter(attitudesw$deltacmt))
# Looks noisy, but there might be a relationship for "sub".
############################################################################
# An alternative way to compute this is iteratively. The advantage
# to the following approach is that if you are working with a large
# data set, you do not need to load an adjacency matrix into
# memory, you can keep things in edgelist format, which is MUCH
# more efficient.
attitudesw$mfrsub.1 = numeric(length(attitudesw$sub.1))
attitudesw$mfrtch.1 = numeric(length(attitudesw$sub.1))
attitudesw$mfrcmt.1 = numeric(length(attitudesw$sub.1))
# If you thought the number of alters who like the class mattered
# more than the mean, you might uncomment the code for the nfrsubgt3
# variable and incorporate it into your analysis (two occurrences):
#attitudesw$nfrsubgt3 = numeric(length(attitudesw$sub.1))
for(i in 1:length(attitudesw$std_id)){
# first get alters of student i:
altrs = sem1$alter_id[ sem1$std_id == attitudesw$std_id[i] ]
# then get alters' attitudes
altatts = attitudesw$sub.1[ attitudesw$std_id %in% altrs ]
# now count how many friends like the class more than "3"
attitudesw$nfrsubgt3[i] = length(which(altatts > 3))
# then take the mean, ignoring NAs:
attitudesw$mfrsub.1[i] = mean(altatts, na.rm = TRUE)
# Note that this can all be done in one line of code:
# attitudesw$mfrsub.1[i] = mean(attitudesw$sub.1[ attitudesw$std_id %in% sem1$alter_id[sem1$std_id %in% attitudesw$std_id[i]]], na.rm=TRUE )
attitudesw$mfrtch.1[i] = mean(attitudesw$tch.1[ attitudesw$std_id %in% sem1$alter_id[sem1$std_id %in% attitudesw$std_id[i]]], na.rm=TRUE )
attitudesw$mfrcmt.1[i] = mean(attitudesw$cmt.1[ attitudesw$std_id %in% sem1$alter_id[sem1$std_id %in% attitudesw$std_id[i]]], na.rm=TRUE )
}
# if you are going to run this with a lot of data, you may wish to include
# the following lines inside your for-loop:
# gc()
# print(i)
# this will run the garbage collector, gc(), which helps R manage memory
# and print(i) will let you know your progress as R chugs away.
# The plot is exactly the same:
plot(attitudesw$mfrsub.1, jitter(attitudesw$deltasub))
# And the correlation should be 1
cor(attitudesw$mfrsub.1,attitudesw$frsub1mean, use= "complete")
# The plots suggest that there might be a relationship for "sub", or
# how much each student likes the subject.
# Let's cheat a little bit and run linear models predicting
# change in each of our variables based on mean friend's values at t=1.
# The data violates many of the assumptions of OLS regression so the
# estimates are not particularly meaningful, other than to let us know
# that we may be on the right path.
summary(lm(deltasub~mfrsub.1, data=attitudesw))
summary(lm(deltatch~mfrtch.1, data=attitudesw))
summary(lm(deltacmt~mfrcmt.1, data=attitudesw))
# Significance is always encouraging, even though we are only predicting a tiny
# fraction of the variance (R^2 = .026)
# We'll cheat a little more and remove everyone whose attitudes did not change:
summary(lm(deltasub~mfrsub.1, data=attitudesw, subset = deltasub!=0 ))
summary(lm(deltatch~mfrtch.1, data=attitudesw, subset = deltasub!=0 ))
summary(lm(deltacmt~mfrcmt.1, data=attitudesw, subset = deltasub!=0 ))
# Very nice, for "sub" the effect grows in strength, signficance,
# and R^2 goes up to .067.
# Let's also take a look at the plot:
pdf("7.1_Peer_effect_on_opinion_of_subject.pdf")
plot( x = attitudesw$mfrsub.1,
y = jitter(attitudesw$deltasub),
main = "Peer effects on opinion of subject",
xlab = "mean friends' opinion",
ylab = expression(Delta~"opinion from "~t[1]*~to~t[2])
)
# We can draw a regression line on the plot based on the regression
# we estimated above:
abline(lm(deltasub~mfrsub.1, data=attitudesw ))
dev.off()
# We've got some evidence of a peer effect on how much a student
# likes the subject. Our evidence is based on temporal precendence:
# your friends opinion of the subject predicts how much
# your opinion the subject changes from t_1 to t_2.
############################################################################
# Now let's run some "real" models.
# We'll first estimate the model Y_2 = Y_1 + W*Y_alter, using a network
# autocorrelation model. Read up on it here:
# Roger Th. A. J. Leenders, Modeling social influence through network
# autocorrelation: constructing the weight matrix, Social Networks, Volume 24, Issue 1, January 2002, Pages 21-47, ISSN 0378-8733, DOI: 10.1016/S0378-8733(01)00049-1.
# (http://www.sciencedirect.com/science/article/B6VD1-44SKCC2-1/2/cae2d6b4cf4c1c21f4f870fd2d58b5cc)
# This model arguably takes into accounts the correlation between friends in
# the disturbances (error term). Yet there are problems with this kind of model.
# See:
# Eric J. Neuman, Mark S. Mizruchi, Structure and bias in the network autocorrelation
# model, Social Networks, In Press, Corrected Proof, Available online 31 May 2010, ISSN 0378-8733, DOI: 10.1016/j.socnet.2010.04.003.
# (http://www.sciencedirect.com/science/article/B6VD1-506H0SN-1/2/73a16c627271d29d9283b6d69b873a07)
# With that in mind, let's proceed:
library(sna)
library(numDeriv)
# We'll use the mfrsub.1 variable to approximate the W*Y_alter term. (If
# we actually use the matrix, we cannot estimate the model--there
# would be the same number of variables as cases).
# Let's remove the NA values from the attitudes measures
# (otherwise lnam will crash). We'll remove rows with NAs for
# sub.1, sub.2, or mfrsub.1. Note that we could remove any row with an
# NA value with the syntax: na.omit(attitudesw), but if there are
# rows that contain our variables of interest but are NA for some other
# variable, na.omit would also drop these rows and we would lose valuable
# data.
atts = attitudesw[!is.na(attitudesw$sub.2),]
atts = atts[!is.na(atts$sub.1),]
atts = atts[!is.na(atts$mfrsub.1),]
W = sem1matrix
# Now we'll make sure the rows and columns in W are in the same
# order as atts:
W = W[match(atts$std_id,row.names(W)), match(atts$std_id,colnames(W))]
# Let's make sure (this will return 0 if we did it right):
which(rownames(W) != atts$std_id)
# Now we'll estimate the peer influence model using lnam, which stands for
# Linear Network Autocorrelation Model.
pim1<-lnam(atts$sub.2,cbind(atts$sub.1, atts$mfrsub.1), W2 = W)
summary(pim1)
# This shows a pretty strong peer effect on subjects at time2. Notice that
# the adjusted R^2 is .41. This is not good news. Even after controlling
# for t1 values of attitude for the subject, we are predicting less than
# half of the variance in t_2 values. That means that there is a good
# chance of omitted variable bias.
# Below is an attempt to run the model with the actual WYalt as the X matrix.
# For the reasons stated above, it doesn't work. It is included in case the
# reader wants to uncomment this code and see for him/herself.
#WYalt = sem1Wxsub1
#WYalt = WYalt[match(atts$std_id,row.names(WYalt)), match(atts$std_id,colnames(WYalt))]
#image(WYalt)
#
##pim2<-lnam(y = atts$sub.2, x = WYalt, W2 = W)
#pim2<-lnam(y = atts$sub.2, x = WYalt)
#summary(pim2)
############################################################################
# Matching
#
# One method that is less vulnerable to some problems related to regression is
# matching. In matching, one divides the key explanatory variable into discrete
# "treatments" and then matches participants so that they are as similar on the
# remaining set of covariates as possible. A simple difference in means/medians/
# quantiles can be used. From there we can use a simple t-test to estimate the
# "treatment" effect without having to worry about multicolinearity.
# If the data violates the normality assumption (as will be the
# case here), non-parametric methods such as a permutation test or bootstrapped
# standard errors can be used. This approach also also violates less statistical
# assumptions and therefore should be expected to be less biased than running a
# simple OLS model.
# For the original article on matching, see:
# Donald B. Rubin, Matching to Remove Bias in Observational Studies
# Biometrics, Vol. 29, No. 1 (Mar., 1973), pp. 159-183
# Stable URL: http://www.jstor.org/stable/2529684
# Also worthy of note is the Rubin Causal Model, which uses matching.
# Read up on it here: http://en.wikipedia.org/wiki/Rubin_Causal_Model
# However, matching is not a cure-all and is obviously sensitive to the variables
# on which one matches. For an excellent article on this, see:
# Jeffrey A. Smith, Petra E. Todd, Does matching overcome LaLonde's critique of
# nonexperimental estimators?, Journal of Econometrics, Volume 125, Issues 1-2,
# Experimental and non-experimental evaluation of economic policy and models,
# March-April 2005, Pages 305-353.
# (http://www.sciencedirect.com/science/article/B6VC0-4CXTY59-1/2/06728bd79899fd9a5b814dfea9fd1560)
# A good introduction to matching using the MatchIt package can be found courtesy
# of Gary King here: http://gking.harvard.edu/matchit/docs/matchit.pdf
library(MatchIt)
# First, we dichotomize the independent variable of interest, mean friend opinion regarding
# subject: mfrsub.1
atts$mftreat = ifelse(atts$mfrsub.1 > 3, 1,0)
atts = atts[c("mftreat", "deltasub", "mfrsub.1","egrd.1", "sub.1", "tot.1", "frn.1", "cmt.1", "prev.1")]
# Before we do any matching, let's take a look at our new treatment variable.
# We'll make this into a factor, which will help interpret our results. A factor
# is a special type of object in R that is used for qualitative (nominal or ordinal)
# data/variables. A factor is also generally more efficient with respect to memory,
# because it stores an integer value corresponding to a label (e.g. 1 for "low" and 2
# for "high") rather than storing the entire string.
# Factors have an order, which you can set by ordering the labels in ascending order
# when you create the factor, or by using the reorder() function. We'll talk about this
# more in the section on dotplots below.
atts$mftreatf = factor(atts$mftreat, labels = c("low", "high"))
# Let's compute the means:
tapply(X = atts$deltasub, INDEX = list(atts$mftreatf), FUN = mean)
# Note, the tapply function is useful if you have more than one factor for which
# you want to calcuate some function over. For example, if there were another
# factor, say exptreat, we could get the mean for each cell using this syntax:
# tapply(X = atts$deltasub, INDEX = list(atts$mftreat, atts$exptreat), FUN = mean)
# A t-test is not totally kosher here because the DV is ordinal, and so it violates
# assumptions behind the t-test.
# But let's take a look anyway:
t.test(deltasub~mftreatf, data=atts)
# A better alternative is the Wilcoxon test (a.k.a. Mann-Whitney test),
# suitable for ordinal variables.
wilcox.test(x=atts$deltasub[atts$mftreatf=="low"],
y=atts$deltasub[atts$mftreatf=="high"],
conf.int = TRUE)
# Now let's match and redo the analyses.
# We normally would assign each subject/case to treatment or control groups based on
# whatever corresponds more closely to a treatment or a control. In this case, it
# is debatable whether having friends who like the class is more of a "treatment"
# than having friends who do not like the class. However, there are 246 students
# whose friends do not like the class versus 99 whose friends do like the class,
# so we can say that having friends who like the class is certainly more unusual
# than the opposite. Accordingly, we'll conceptualize having friends who like the
# class as the treatment.
# In the actual matching proceedure, what we'll do is try to find individuals in
# the "control" condition (having friends who do not like the subject) who look
# most similar to each treated individual based on a variety of variables. Ideally,
# we will achieve "balance" on the covariates between those in the treatment condition,
# and those we select from the control condition, so that the individuals are as similar
# as possible.
# Here are the variables we'll use (all are at time 1):
# sub.1 - how much the student likes the subject
# egrd.1 - expected grade
# tot.1 - how much the student likes the way the course is taught
# frn.1 - how interesting the subject finds his/her friends
# cmt.1 - how interesting the student finds his/her classmates
# prev.1 - has the student had the teacher previously)
# We'll use "nearest neighbor" matching, which uses a distance measure to
# select the individual in the "control" group that is closest to each
# individual in the "treatment" group. The default distance measure is the
# logistic regression propensity score.
# The codebook for this data is available here: http://stanford.edu/~messing/codebook.v12.txt
m.out = matchit(mftreat ~ egrd.1 + sub.1 + tot.1 + frn.1 + cmt.1 + prev.1,
data = atts, method = "nearest")
summary(m.out)
plot(m.out)
# Now let's assess the extent to which our matching algorithm effectively matched each
# treatment subject to a control subject, e.g., the extent to which we achieved balance.
# The plot function displays Q-Q plots of the control (X-axis) versus treatment groups
# (Y-axis) for the original sample and the matched sample. The Q-Q plots indicate perfect
# balance on the covariate in question if all dots are perfectly aligned on the
# 45 degree line. The further the deviation from this line, the worse the samples are
# balanced on this variable.
# Based on the plots, matching appears to have improved balance on our covariates a
# little bit, but not perfectly.
# Ideally we want to experiment with other covariates and try to find the combination
# that best and most parsimoniously captures the key phenomena we want to control for.
matchatts = match.data(m.out)
# We'll make our treatment variable into a factor:
matchatts$mftreatf = factor(matchatts$mftreat, labels = c("low", "high"))
# Let's compute the means:
tapply(X = matchatts$deltasub, INDEX = list(matchatts$mftreatf), FUN = mean)
t.test(deltasub~mftreatf, data=matchatts)
wilcox.test(x=matchatts$deltasub[matchatts$mftreatf=="low"],
y=matchatts$deltasub[matchatts$mftreatf=="high"],
conf.int = TRUE)
# We can also perform a full permuation test. A permutation test is an exact test of
# whether it is possible to reject the null hypothesis that two distributions are
# the same. It works by first calulating the mean difference, which fuctions as the
# test statistic. The difference in sample means is calculated and recorded for every
# possible way of dividing these pooled values into two groups of size n_A and n_B
# for every permutation of the group labels A and B. The set of these calculated
# differences is the exact distribution of possible differences under the null
# hypothesis that group label does not matter. This test fuctions like a t.test but
# does not rely on any distributional assumptions about the data.
# For additional information, see:
# http://en.wikipedia.org/wiki/Resampling_%28statistics%29#Permutation_tests
library(coin)
independence_test(deltasub ~ mftreatf, data = matchatts,
distribution = exact())
# Another alternative is to look at bootstrapped estimates of the mean and standard
# error. This is often an attractive way to estimate statistical significance
# because our data violates the distributional assumptions involved in estimating
# standard errors in this way (namely, that our dependent variable resembles anything
# like a normal or t distribution--it cannot because it is ordinal with only
# 7 levels). Because boostrapping relies on resampling instead of distributional
# assumptions to estimate variance, it is more robust.
# There are packages available to generate bootstrapped estimates, but seeing the code
# is a valuable way to get a sense of exactly how bootstrapping works. Here is the code
# for a bootstrapping function designed to estimate the mean and standard error:
b.mean <- function(data, nsim) {
# Create a list object to store the sets of resampled data (in this case nsim = 1000).
resamples = list()
# Create a vector of the same length to store the resampled means
r.mean = numeric(nsim)
# Generate a sample with replacement
for(i in 1:nsim){
# generates a random sample of our data with replacement
resamples[[i]] = sample(x=data, size=length(data), replace=T)
# Now calcuate the mean for this iteration
r.mean[i] = mean(resamples[[i]], na.rm=T)
}
# Calculate the mean of the mean of the simulated estimates above:
mean.adj = mean(r.mean, na.rm=T)
# Calculate how this differs from the arithmatic mean estimate of the original
# sample:
bias = mean(data, na.rm=T) - mean.adj
# Generate the standard error of the estimate
std.err <- sqrt(var(r.mean))
# Return results
return( data.frame(mean = mean(data, na.rm=T), #the mean estimate
mean.adj = mean.adj, # the adjusted estimate based on simulations
bias = bias, # the mean minus the adjusted estimate
std.err=std.err # the standard error of the estimate (based on simulations)
)
)
}
# Before we use it on our data, let's make sure it works. We will simulate some
# data for which we know the parameters and then estimate the parameters using
# the bootstrap:
simdata = rnorm(n = 1000, mean = 5, sd = 1)
b.mean(simdata, nsim=1000)
# Calculate the theoretical mean and standard error
mean(simdata)
(se = sd(simdata)/sqrt(length(simdata)-1))
# here we use the formula SE = SD/sqrt(N - 1)
# We have demonstrated that our bootstraping function is a good estimator
# for "perfect" data that meets the assumptions behind traditional
# statistical tests. Now let's move on to our data.
# For those with friends with on average postive attitudes:
b.mean(data=matchatts$deltasub[matchatts$mftreatf=="high"],
nsim=1000)
# For those with friends with on average negative attitudes:
b.mean(data=matchatts$deltasub[matchatts$mftreatf=="low"],
nsim=1000)
# Here's how to do it using the boot package, which is faster and more
# extensible than the R-code above:
library(boot)
# You have to write a special function for the boot package, which
# can be confusing at first, but is useful for more complicated
# types of bootstrapping. The boot package is also orders of magnitude
# faster than doing things in R, which is useful if you start to do
# more complicated bootstrapping or use a higher number of simulations.
# First, write a basic function that will return the estimate in question
# for x data using d cases:
samplemean <- function(x, d) {
return(mean(x[d]))
}
(bootlow = boot(data = matchatts$deltasub[matchatts$mftreatf=="low"],
statistic = samplemean,
R = 1000))
(boothigh =boot(data = matchatts$deltasub[matchatts$mftreatf=="high"],
statistic = samplemean,
R = 1000))
# Note that estimates of the standard errors will be slightly different
# each time due to randomness involved in estimation. As you might expect,
# as the number of simulations increases, variance will decrease.
# Based on our bootstrapped estimates and standard error calculations,
# it's clear that the two means are quite different. So, based on our
# matched samples, it seems there is a significant peer effect.
##########################################################################
# Let's try full matching so that we are not throwing away any of our data:
m.out = matchit(mftreat ~ egrd.1 + sub.1 + tot.1 + frn.1 + cmt.1 + prev.1,
data = atts, method = "full")
matchatts = match.data(m.out)
# We'll make this into a factor, which will help inturpret our results:
matchatts$mftreatf = factor(matchatts$mftreat, labels = c("low", "high"))
# Note that we'll now have to use the weights that the output provided because
# we told it to use the full sample.
# Let's compute the means:
tapply(X = matchatts$deltasub,
INDEX = list(matchatts$mftreatf),
FUN = weighted.mean, weights = matchatts$weights )
# There is no straightforward way to compute a t-test in R with weighted data.
# We can however compute weighted standard errors and a corresponding
# 95% CI:
# Recall that a 95% CI of an estimate is calculated by taking the estimate
# +/- the standard error * 1.96. Actually, +/- 1.96 is just an estimate of the
# the 0.05 and 0.95 quantiles of a statistical distribution. We'll use
# qt(quantile, N-1) in this case.
library(Hmisc)
meanlow = wtd.mean(matchatts$deltasub[matchatts$mftreatf=="low"],
weights = matchatts$weights[matchatts$mftreatf=="low"])
varlow = wtd.var(matchatts$deltasub[matchatts$mftreatf=="low"],
weights = matchatts$weights[matchatts$mftreatf=="low"])
selow = sqrt(varlow)/sqrt(sum(matchatts$weights[matchatts$mftreatf=="low"]))
meanlow + selow * qt(.025, ( sum(matchatts$weights[matchatts$mftreatf=="low"]) - 1))
meanlow + selow * qt(.975, ( sum(matchatts$weights[matchatts$mftreatf=="low"]) - 1))
meanhigh = wtd.mean(matchatts$deltasub[matchatts$mftreatf=="high"],
weights = matchatts$weights[matchatts$mftreatf=="high"])
varhigh = wtd.var(matchatts$deltasub[matchatts$mftreatf=="high"],
weights = matchatts$weights[matchatts$mftreatf=="high"])
sehigh = sqrt(varhigh)/sqrt(sum(matchatts$weights[matchatts$mftreatf=="high"]))
meanhigh + sehigh * qt(.025, ( sum(matchatts$weights[matchatts$mftreatf=="high"]) - 1))
meanhigh + sehigh * qt(.975, ( sum(matchatts$weights[matchatts$mftreatf=="high"]) - 1))
# Here's how to do it with the boot package:
sample.wtd.mean <- function(x, d, wts) {
return(weighted.mean(x[d], w = wts))
}
(bootlow = boot(data = matchatts$deltasub[matchatts$mftreatf=="low"],
wts = matchatts$weights[matchatts$mftreatf=="low"],
statistic = sample.wtd.mean,
R = 1000))
(boothigh =boot(data = matchatts$deltasub[matchatts$mftreatf=="high"],
wts = matchatts$weights[matchatts$mftreatf=="high"],
statistic = sample.wtd.mean,
R = 1000))
# We can make a nice plot of the results with the lattice package:
library(lattice)
# Dot plots:
visual = list()
visual$var = c("Friends do not like subj.", "Friends like subj.")
visual$M = c(bootlow$t0, boothigh$t0)
visual$se = c(sd(bootlow$t), sd(boothigh$t))
visual$N = c(length(matchatts$weights[matchatts$mftreatf=="high"]),
length(matchatts$weights[matchatts$mftreatf=="low"]))
visual$lo = visual$M - visual$se
visual$up = visual$M + visual$se
visual = as.data.frame(visual)
#print it out:
pdf(file="7.2_dotplot_matchoutput_bootstrapped_SE.pdf", width = 6, height = 3)
dotplot( reorder(var,M) ~ M,
data = visual,
main="Effect of friends' attitude on opinion change,\nmatched samples with bootstrapped SE",
panel = function (x, y, subscripts, groups, ...) {
panel.dotplot(x, y)
panel.segments(x0=visual$lo[subscripts],y0=as.numeric(y),
x1=visual$up[subscripts],y1=as.numeric(y),
lty = 1 )
},
xlab=expression(Delta~"opinion from "~t[1]*~to~t[2]),
xlim= c(-.5, .2)
)
dev.off()
# Very nice!
############################################
# QUESTION #1 - What do these results show?
# What assumptions are being made?
# What might lead you to question the results? What would improve them?
############################################
#########################################################################
# Extra-credit:
#
# Use the McFarland dataset to acquire other individual level variables
# and develop better matching. Then follow this lab and explore the
# variety of outcomes suitable for illustrating peer effects in
# classrooms (e.g., PSAT scores, engagement, conflict, etc). Results
# are likely suitable for an A-journal publication.
#
# Again, the codebook is available here: http://stanford.edu/~messing/codebook.v12.txt
#
#########################################################################
#########################################################################
# PART II -- QAP REGRESSION
#########################################################################
#
# Here we want to compare different classrooms and discern what leads
# participants to become more or less sociable and academically engaged
# with one another.
#
# Class m173 is an accelerated math class of all 10th graders taught
# by a rigid teacher who used teacher-centered instructional formats.
# Class m 182 is a regular math class of 10th and 11th grade students
# taught by a casual teacher who used student-centered instructional formats.
#
#########################################################################
# igraph and sna don't mix well. Before we load sna, we will detach igraph.
detach(package:igraph)
# Load the "sna" library
library(sna)
#Clear Variables
rm(list=ls(all=TRUE))
gc()
#(2) QAP Regressions for m l73
# Note that each matrix must be the same size (n x n). You'll want to make
# sure all input and output matrices are the same size. Predictor matrices
# thus are not individual level variables; they are differences, matches,
# etc. You'd use a matrix of differences in of quantitative (node level)
# variables, and matches in the case of categorical/dichotomous (node
# level) variables. There's a really easy way to do it in R
# (say your node-level variables are x and y)
# x <- seq(1,4)
# y <- seq(2,5)
# outer(x,y,FUN="-") # find the difference between x and y
# outer(x,y,FUN="==") # produce 0/1 similairty matrix
# For this lab we will use matrices that were previously generated in UCINet
# Loading predictor matrices. Each matrix is a n x n matrix
# data is saved in a CSV format. You can open the files in Excel
# to see the structure.
data(studentnets.mrqap173, package="NetData")
# Look at what we loaded via
ls()
# We need the data in matrix format
# predictor matrices
m173_sem1_SSL <- as.matrix(m173_sem1_SSL)
m173_sem1_TSL <- as.matrix(m173_sem1_TSL)
m173_sem1_FRN <- as.matrix(m173_sem1_FRN)
m173_sem1_SEAT <- as.matrix(m173_sem1_SEAT)
m173_sem1_RCE <- as.matrix(m173_sem1_RCE)
m173_sem1_GND <- as.matrix(m173_sem1_GND)
# Load response matrices
m173_sem2_SSL <- as.matrix(m173_sem2_SSL)
m173_sem2_TSL <- as.matrix(m173_sem2_TSL)
# In order to run the QAP regression we must create an array of matrices
# containing all the predcitor matrices. We are, in effect, creating a
# 3-d matrix (predcitor x n x n).
# Important: All matrices must be the same size!
response_matrices <- array(NA, c(6, length(m173_sem1_SSL[1,]),length(m173_sem1_SSL[1,])))
response_matrices[1,,] <- m173_sem1_SSL
response_matrices[2,,] <- m173_sem1_TSL
response_matrices[3,,] <- m173_sem1_FRN
response_matrices[4,,] <- m173_sem1_SEAT
response_matrices[5,,] <- m173_sem1_RCE
response_matrices[6,,] <- m173_sem1_GND
##############################
#(2a) SSL2 <- SSL1 + TSL1 + FRN1 + SEAT1 + RCE + GND
##############################
# Fit a netlm model by using netlm, the response matrix and the array of predictor matrices
# This may take a LONG time.
nl<-netlm(m173_sem2_SSL,response_matrices)
# Make the model easier to read by adding lables for each predictor matrix.
nlLabeled <- list()
nlLabeled <- summary(nl)
# Labels are provided in the same order as they were assigned in the response_matrices array
nlLabeled$names <- c("Intercept", "Social normalized and labeled (SSL1)", "Task normalized and labeled (TSL1)", "Friends 1=friend, 2=best friend(FRN1)", "Seat in first semester (Seat1)","Race (RCE)","Gender (GND)")
# Round the ocefficients to two decimals
nlLabeled$coefficients = round(nlLabeled$coefficients, 2)
nlLabeled
##############################
#(2b) TSL2 <- TSL1 + SSL1 + FRN1 + SEAT1 + RCE + GND
##############################
# Fit a netlm model by using netlm, the response matrix and the array of predictor matrices
nl<-netlm(m173_sem2_TSL,response_matrices)
#make the model easier to read
nlLabeled <- list()
nlLabeled <- summary(nl)
nlLabeled$names <- c("Intercept", "Social normalized and labeled (SSL1)", "Task normalized and labeled (TSL1)", "Friends 1=friend, 2=best friend(FRN1)", "Seat in first semester (Seat1)","Race (RCE)","Gender (GND)")
nlLabeled$coefficients = round(nlLabeled$coefficients, 2)
nlLabeled
##############################
#(3) QAP Regressions for m 182
##############################
# Repeat for class m 182
# Clear Variables
rm(list=ls(all=TRUE))
gc()
data(studentnets.mrqap182, package = "NetData")
# Look at what we loaded via
ls()
# Again, we need the data in matrix format
# predictor matrices
m182_sem1_SSL <- as.matrix(m182_sem1_SSL)
m182_sem1_TSL <- as.matrix(m182_sem1_TSL)
m182_sem1_FRN <- as.matrix(m182_sem1_FRN)
m182_sem1_SEAT <- as.matrix(m182_sem1_SEAT)
m182_sem1_RCE <- as.matrix(m182_sem1_RCE)
m182_sem1_GND <- as.matrix(m182_sem1_GND)
#response matrices
m182_sem2_SSL <- as.matrix(m182_sem2_SSL)
m182_sem2_TSL <- as.matrix(m182_sem2_TSL)
#This class will require you to make multiple extraction operations.
#A student exits the class at the end of first semester and another
#enters at the start of second semester. These students must be removed
#before you can conduct QAP regression (you need the same row-column orderings).
#Extract actor 15 for TSL and SSL of second semester (# 15 <- new student).
response_matrices <- array(NA, c(6, length(m182_sem1_SSL[1,]),length(m182_sem1_SSL[1,])))
response_matrices[1,,] <- m182_sem1_SSL
response_matrices[2,,] <- m182_sem1_TSL
response_matrices[3,,] <- m182_sem1_FRN
response_matrices[4,,] <- m182_sem1_SEAT
response_matrices[5,,] <- m182_sem1_RCE
response_matrices[6,,] <- m182_sem1_GND
##############################
#(3a) SSL2 <- SSL1 + TSL1 + FRN1 + SEAT1 + RCE + GND
##############################
#Fit a netlm model
nl<-netlm(m182_sem2_SSL,response_matrices)
#make the model easier to read
nlLabeled <- list()
nlLabeled <- summary(nl)
nlLabeled$names <- c("Intercept", "Social normalized and labeled (SSL1)", "Task normalized and labeled (TSL1)", "Friends 1=friend, 2=best friend(FRN1)", "Seat in first semester (Seat1)","Race (RCE)","Gender (GND)")
nlLabeled$coefficients = round(nlLabeled$coefficients, 2)
nlLabeled
##############################
#(3b) TSL2 <- SSL1 + TSL1 + FRN1 + SEAT1 + RCE + GND
##############################
#Fit a netlm model
nl<-netlm(m182_sem2_TSL,response_matrices)
#make the model easier to read
nlLabeled <- list()
nlLabeled <- summary(nl)
nlLabeled$names <- c("Intercept", "Social normalized and labeled (SSL1)", "Task normalized and labeled (TSL1)", "Friends 1=friend, 2=best friend(FRN1)", "Seat in first semester (Seat1)","Race (RCE)","Gender (GND)")
nlLabeled$coefficients = round(nlLabeled$coefficients, 2)
nlLabeled
#Report your results. Describe what they mean? Repeat for ETSL of second semester.
#######################################################################
# QUESTION #2 - Compare your results for m 173 and m 182.
# Do the classes have different results?
# If not, why not?
# If so, why so?
# What sort of story can you derive about the change in
# task and social interactions within these classrooms?
#
# Remember you are predicting changes in relations of sociable or task
# interaction using other types of relations (i.e., friendship,
# proximate seating, same race, etc).
#########################################################################
#
# Extra-credit - run the QAP regression part of the lab on the
# entire McFarland dataset of classrooms, and perform meta-analyses
# on the results using multi-level modeling - then presto -
# you will have results on academic and social relations in classrooms
# which is suitable for publication in an A-journal.
#
#########################################################################
sna_eg_stanford/lab07.txt · Last modified: by hkimscil
