ANCOVA (Analysis of Covariance)

Multiple Regression에 대한 이해가 필요
그 중에서

Partial correlation (hence partial R square value),
semi-partial correlation (or Part correlation),
zero-order correlation 를 설명한 부분과
이를 직접 설명한 페이지 Partial and semipartial correlation 참조

Y = a + b1X1 + b2X2 의 예에서

모든 X 변인 들의 설명력에 대한 테스트는 ANOVA test를 하게 된다.
위에서 구하는 R^2은 모든 변인들의 설명력을 합친 것을 말한다.
각 coefficient에 대한 기여도에 대한 test는 t-test를 한다.
이 때 각 변인의 기여도는 중복되는 부분을 제외한 고유의 부분에 대한 테스트이다 (partial correlation R^2에 대한 테스트, semi-partial correlation R^2 이 아님). 이 설명이 partial and semipartial correlation에 자세히 나온다.
X1 고유의 설명력을 알아보기 위해서 우리가 한 것은
Y = a + b1X1 의 regression을 해본 후
그 residual을 따로 구해서 보면 이 부분은 X1과 무관한 부분이 된다 (X1의 설명력을 제외한 Y변량의 나머지) = res.y
X2 고유의 기여도 혹은 설명력을 (X1과 중복되는 부분을 제외한 부분을) 보기 위해서
X2 = a + b1X1의 regression을 한 후에
X2의 residual을 따로 모아본다. 이것 또한 X1과 무관한 부분이 된다. = res.x2
이제 res.y = a + b1 * res.x2 의 regression을 하면 이것이
x2의 partial correlation test가 된다 (혹은 X2 고유의 설명력에 대한 regression test가 된다).

위의 recap에 더해서

만약에 X1 = 숫자로 측정된 변인이고
X2는 종류로 측정된 변인이라고 하자
이 때 X1의 설명력을 제외한 X2만의 설명력을 본다는 것은
X1를 제어하고 X2의 효과를 (설명력을 혹은 영향력을) 본다는 것이다.
이것이 ANCOVA이다.

e.g. 통계를 공부하는 방법이 세가지가 있다.

B = traditional 하게 책으로만 공부하는 방법
T = 동영상을 위주로 공부하는 방법
R = Researcher가 고안해낸 새로운 방법
이 세가지 방법의 효과에 차이가 있는지 알고 싶어서
세 그룹의 학생들이 (각 30명) 세 가지 방법으로 통계를 배운 후. (x2)
통계시험을 치룬다 (y)
그런데 연구자는 각 학생들의 지능지수를 제어하여 살펴보고 싶다 (x1)

e.g.
anorexia.csv

anorexia <- read.csv("http://commres.net/wiki/_media/anorexia.csv")
anorexia
anorexia$Diff = anorexia$Postwt-anorexia$Prewt
anorexia

$ Diff = Treat + Error $

$ Diff = Prewt + Treat + Error $

$ Postwt = Prewt + Treat + Error $

levels(anorexia$Treat)
[1] NULL

anorexia$Treat <- factor(anorexia$Treat, levels = c("Cont", "CBT", "FT"))

levels(anorexia$Treat)
[1] "Cont" "CBT"  "FT"  

model <- lm (Postwt ~ Prewt + Treat, data = anorexia)

Anova(model)
Anova Table (Type II tests)

Response: Postwt
          Sum Sq Df F value    Pr(>F)    
Prewt      353.8  1  7.2655 0.0088500 ** 
Treat      766.3  2  7.8681 0.0008438 ***
Residuals 3311.3 68                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

summary(model)
Call:
lm(formula = Postwt ~ Prewt + Treat, data = anorexia)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.1083  -4.2773  -0.5484   5.4838  15.2922 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  45.6740    13.2167   3.456 0.000950 ***
Prewt         0.4345     0.1612   2.695 0.008850 ** 
TreatCBT      4.0971     1.8935   2.164 0.033999 *  
TreatFT       8.6601     2.1931   3.949 0.000189 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.978 on 68 degrees of freedom
Multiple R-squared:  0.2777,	Adjusted R-squared:  0.2458 
F-statistic: 8.713 on 3 and 68 DF,  p-value: 5.719e-05

library(multcomp)
dunnett = glht(model, linfct=mcp(Treat="Dunnett"))

summary(dunnett)

	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Dunnett Contrasts


Fit: lm(formula = Postwt ~ Prewt + Treat, data = anorexia)

Linear Hypotheses:
                Estimate Std. Error t value Pr(>|t|)    
CBT - Cont == 0    4.097      1.893   2.164 0.062939 .  
FT - Cont == 0     8.660      2.193   3.949 0.000373 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)

ANCOVA (Analysis of Covariance) 2

Data

### --------------------------------------------------------------
### Analysis of covariance, cricket example
### pp. 228–229
### --------------------------------------------------------------

Input = (

"Species  Temp   Pulse
 ex       20.8   67.9 
 ex       20.8   65.1 
 ex       24     77.3 
 ex       24     78.7 
 ex       24     79.4 
 ex       24     80.4 
 ex       26.2   85.8 
 ex       26.2   86.6 
 ex       26.2   87.5 
 ex       26.2   89.1 
 ex       28.4   98.6 
 ex       29    100.8 
 ex       30.4   99.3 
 ex       30.4  101.7 
 niv      17.2   44.3 
 niv      18.3   47.2 
 niv      18.3   47.6 
 niv      18.3   49.6 
 niv      18.9   50.3 
 niv      18.9   51.8 
 niv      20.4   60 
 niv      21     58.5 
 niv      21     58.9 
 niv      22.1   60.7 
 niv      23.5   69.8 
 niv      24.2   70.9 
 niv      25.9   76.2 
 niv      26.5   76.1 
 niv      26.5   77 
 niv      26.5   77.7 
 niv      28.6   84.7
")

Data = read.table(textConnection(Input),header=TRUE)

Simple Plot

plot(x   = Data$Temp, 
     y   = Data$Pulse, 
     col = Data$Species, 
     pch = 16,
     xlab = "Temperature",
     ylab = "Pulse")

legend('bottomright', 
       legend = levels(Data$Species), 
       col = 1:2, 
       cex = 1,    
       pch = 16)

Analysis of covariance

options(contrasts = c("contr.treatment", "contr.poly"))
   
   ### These are the default contrasts in R


model.1 = lm (Pulse ~ Temp + Species + Temp:Species,
              data = Data)

library(car)
Anova(model.1, type="II")
 
Anova Table (Type II tests)
 
             Sum Sq Df  F value    Pr(>F)   
Temp         4376.1  1 1388.839 < 2.2e-16 ***
Species       598.0  1  189.789 9.907e-14 ***
Temp:Species    4.3  1    1.357    0.2542    
 
### Interaction is not significant, so the slope across groups
### is not different.

Re-analysis

model.2 = lm (Pulse ~ Temp + Species,
              data = Data)

library(car)
Anova(model.2, type="II")
 
Anova Table (Type II tests)
 
          Sum Sq Df F value    Pr(>F)   
Temp      4376.1  1  1371.4 < 2.2e-16 ***
Species    598.0  1   187.4 6.272e-14 ***
 
### The category variable (Species) is significant,
### so the intercepts among groups are different
 
 
summary(model.2)
 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -7.21091    2.55094  -2.827  0.00858 **
Temp          3.60275    0.09729  37.032  < 2e-16 ***
Speciesniv  -10.06529    0.73526 -13.689 6.27e-14 ***
 
### Note that these estimates are different than in the Handbook,
###   but the calculated results will be identical.
### The slope estimate is the same.
### The intercept for species 1 (ex) is (intercept).
### The intercept for species 2 (niv) is (intercept) + Speciesniv.
### This is determined from the contrast coding of the Species
### variable shown below, and the fact that Speciesniv is shown in
### coefficient table above.
 
 
contrasts(Data$Species)
 
    niv
ex    0
niv   1

Simple plot with fitted lines

I.nought = -7.21091
I1 = I.nought + 0
I2 = I.nought + -10.06529
B  = 3.60275 

plot(x   = Data$Temp, 
     y   = Data$Pulse, 
     col = Data$Species, 
     pch = 16,
     xlab = "Temperature",
     ylab = "Pulse")

legend('bottomright', 
       legend = levels(Data$Species), 
       col = 1:2, 
       cex = 1,    
       pch = 16)

abline(I1, B,
       lty=1, lwd=2, col = 1)

abline(I2, B,
       lty=1, lwd=2, col = 2)

p-value and R-squared of combined model

summary(model.2)

Checking assumptions of the model

hist(residuals(model.2), 
     col="darkgray")

plot(fitted(model.2), 
     residuals(model.2)
     )

Eg.

tri.csv
trt: treatment
pat: patient id
hgba1c: Hemoglobin A1c
trichg: Triglyceride changes

$ trichg = hgba1c + trt + e $

Table of Contents

ANCOVA (Analysis of Covariance)

ANCOVA (Analysis of Covariance) 2

Eg.