User Tools

Site Tools


partial_and_semipartial_correlation

Partial and semi-partial correlation

references
The Elements of Statistical Learning or local copy

Simple explanation of the below procedures is like this:

  • Separately regress Y and X1 against X2, that is,
    • regress Y against X2 AND
    • regression X1 against X2.
  • Regress the Y residuals against the X1 residuals.

In the below example,

  • regress gpa against sat
  • regress clep against sat
  • regress the gpa residuals against clep residuals.

Take a close look at the graphs, especially, the grey areas.

For more, see https://stats.stackexchange.com/questions/28474/how-can-adding-a-2nd-iv-make-the-1st-iv-significant

  1. sat, clep이 각각 (gpa에 대한) regression에 사용되었을 때에 이 둘의 영향력이 나타난다. 그러나, 이 둘을 동시에 같이 사용했을 때에는 sat의 영향력이 사라지게 된다. 따라서 clep에 대해 gpa를 regression 했을 때를 제어하는 것을 보여주기로 한다.
  2. lm.gpa.clep을 1) 얻고 res.lm.gpa.clep 2)을 구하면 clep의 영향력 부분을 제외한 나머지가 된다.
  3. lm.gpa.clepsat 를 구해서 sat의 영향력이 사라지는 것을 본다.
  4. 그리고, res.lm.gpa.clep을 종속변인으로 하고 sat를 독립변인으로 한 영향력을 보면 clep을 제어한 후 sat만의 영향력을 보는 것이 된다 (pcor.test(gpa,sat,clep, data=tests)와 동일해야)
  5. sat에는 clep과 관련이 있는 (상관관계) 부분이 포함되기에 바로 위의 것은 이루어질 수 없다.
  6. 만약에 독립변인인 sat와 clep이 orthogonal하다면 (즉, 상관관계가 0이라면), 스트라이크 아웃된 부분이 가능하겠지만 그렇지 않기에 sat에서 clep의 부분을 제거한 부분을 구해서 즉, res.lm.sat.clep을 구해서 res.lm.gpa.clep와의 상관관계를 본다.

Partial cor

regression gpa against clep

# import test score data "tests_cor.csv"
tests <- read.csv("http://commres.net/wiki/_media/r/tests_cor.csv")
colnames(tests) <- c("ser", "sat", "clep", "gpa")
tests <- subset(tests, select=c("sat", "clep", "gpa"))
attach(tests)
lm.gpa.clep <- lm(gpa ~ clep, data = tests)
summary(lm.gpa.clep)
Call:
lm(formula = gpa ~ clep, data = tests)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.190496 -0.141167 -0.002376  0.110847  0.225207 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.17438    0.38946   3.015 0.016676 *  
clep         0.06054    0.01177   5.144 0.000881 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1637 on 8 degrees of freedom
Multiple R-squared:  0.7679,	Adjusted R-squared:  0.7388 
F-statistic: 26.46 on 1 and 8 DF,  p-value: 0.0008808
# get residuals  
res.lm.gpa.clep <- lm.gpa.clep$residuals

# get cor between gpa sat pred resid from. lm.gpa.clep
cor.gpa.clep <- as.data.frame(cbind(gpa, clep, lm.gpa.clep$fitted.values, lm.gpa.clep$residuals))
colnames(cor.gpa.clep) <- c("gpa", "clep", "pred", "resid")
cor(cor.gpa.clep)
         gpa   clep   pred  resid
gpa   1.0000 0.8763 0.8763 0.4818
clep  0.8763 1.0000 1.0000 0.0000
pred  0.8763 1.0000 1.0000 0.0000
resid 0.4818 0.0000 0.0000 1.0000
> 

regression gpa against sat

# may not be necessary.
# just to check that sat alone is also an influencer to gpa. 
lm.gpa.sat <-  lm(gpa ~ sat, data=tests)
summary(lm.gpa.sat)
# get residuals  
res.lm.gpa.sat <- lm.gpa.sat$residuals

# get cor between gpa sat pred resid from. lm.gpa.clep
cor.gpa.sat <- as.data.frame(cbind(gpa, sat, lm.gpa.sat$fitted.values, lm.gpa.sat$residuals))
colnames(cor.gpa.sat) <- c("gpa", "sat", "pred", "resid")
cor(cor.gpa.sat)

regression gpa against both celp and sat

lm.gpa.clepsat <- lm(gpa ~ clep + sat, data = tests) 
summary(lm.gpa.clepsat)
Call:
lm(formula = gpa ~ clep + sat, data = tests)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.197888 -0.128974 -0.000528  0.131170  0.226404 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  1.1607560  0.4081117   2.844   0.0249 *
clep         0.0729294  0.0253799   2.874   0.0239 *
sat         -0.0007015  0.0012564  -0.558   0.5940  
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1713 on 7 degrees of freedom
Multiple R-squared:  0.7778,	Adjusted R-squared:  0.7143 
F-statistic: 12.25 on 2 and 7 DF,  p-value: 0.005175

> 

checking partial cor 1

# get res.lm.clep.sat 
lm.sat.clep <- lm(sat ~ clep, data = tests)
summary(lm.sat.clep)
Call:
lm(formula = sat ~ clep, data = tests)

Residuals:
     Min       1Q   Median       3Q      Max 
-101.860  -19.292    1.136   28.306   54.132 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -19.421    114.638  -0.169  0.86967    
clep          17.665      3.464   5.100  0.00093 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 48.2 on 8 degrees of freedom
Multiple R-squared:  0.7648,	Adjusted R-squared:  0.7353 
F-statistic: 26.01 on 1 and 8 DF,  p-value: 0.0009303

> 

res.lm.sat.clep <- lm.sat.clep$residuals
install.packages("ppcor")
library(ppcor)
pcor.gpa.sat.clep <- lm(res.lm.gpa.clep ~ res.lm.sat.clep)
summary(pcor.gpa.sat.clep)
Call:
lm(formula = res.lm.gpa.clep ~ res.lm.sat.clep)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.197888 -0.128974 -0.000528  0.131170  0.226404 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)      1.755e-17  5.067e-02   0.000    1.000
res.lm.sat.clep -7.015e-04  1.175e-03  -0.597    0.567

Residual standard error: 0.1602 on 8 degrees of freedom
Multiple R-squared:  0.04264,	Adjusted R-squared:  -0.07703 
F-statistic: 0.3563 on 1 and 8 DF,  p-value: 0.5671

 
> pcor.gpa.sat.clep <- pcor.test(gpa,sat,clep)
> pcor.gpa.sat.clep
    estimate   p.value statistic  n gp  Method
1 -0.2064849 0.5940128  -0.55834 10  1 pearson
> pcor.gpa.sat.clep$estimate^2
[1] 0.04263601
> 
d <- data.frame(sat=sat, clep=clep, gpa=gpa, res.lm.gpa.clep=res.lm.gpa.clep)
plot(d)

Note that the relationship between res.lm.gpa.clep and sat look like negative, which can be confirmed in the lm.gpa.satclep

summary(lm.gpa.satclep)

.

checking partial cor 2

This method will not work because of the correlation between the two IVs.
Using the whole sat is not the same as res.lm.sat.clep, which eliminates the shared part with clep.

# theoretically if we regress res.gpa.clep against sat
pcor.res.sat <- lm(res.lm.gpa.clep ~ tests$sat)
summary(pcor.res.sat)
Call:
lm(formula = res.lm.gpa.clep ~ tests$sat)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.200398 -0.142818  0.000099  0.112699  0.223556 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.0924155  0.3286705   0.281    0.786
tests$sat   -0.0001650  0.0005797  -0.285    0.783

Residual standard error: 0.1629 on 8 degrees of freedom
Multiple R-squared:  0.01003,	Adjusted R-squared:  -0.1137 
F-statistic: 0.08105 on 1 and 8 DF,  p-value: 0.7831

Semipartial cor

> spcor.gpa.sat.clep <- spcor.test(gpa,sat,clep)
> spcor.gpa.sat.clep
     estimate   p.value  statistic  n gp  Method
1 -0.09948786 0.7989893 -0.2645326 10  1 pearson
> spcor.gpa.sat.clep$estimate^2
[1] 0.009897835
> 

e.g.,

In this example, the two IVs are orthogonal to each other (not correlated with each other). Hence, regress res.y.x2 against x1 would not result in any problem.

n <- 32
set.seed(182)
u <-matrix(rnorm(2*n), ncol=2)
u0 <- cbind(u[,1] - mean(u[,1]), u[,2] - mean(u[,2]))
x <- svd(u0)$u
eps <- rnorm(n)
y <-  x %*% c(0.05, 1) + eps * 0.01
x1 <- x[,1]
x2 <- x[,2]
dset <- data.frame(y,x1,x2)
head(dset)
        y       x1      x2
1  0.2311 -0.42320  0.2536
2 -0.1708 -0.13428 -0.1573
3  0.1617  0.12404  0.1580
4  0.1111  0.10377  0.1214
5  0.2176  0.08796  0.1962
6  0.2054  0.02187  0.1950
>
round(cor(dset), 3)
       y    x1    x2
y  1.000 0.068 0.996
x1 0.068 1.000 0.000
x2 0.996 0.000 1.000
> 
lm.y.x1 <- lm(y ~ x1)
summary(lm.y.x1)
Call:
lm(formula = y ~ x1)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3750 -0.1013 -0.0229  0.1402  0.2985 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.00258    0.03242   -0.08     0.94
x1           0.06895    0.18341    0.38     0.71

Residual standard error: 0.183 on 30 degrees of freedom
Multiple R-squared:  0.00469,	Adjusted R-squared:  -0.0285 
F-statistic: 0.141 on 1 and 30 DF,  p-value: 0.71
lm.y.x1x2 <- lm(y ~ x1 + x2)
summary(lm.y.x1x2)
Call:
lm(formula = y ~ x1 + x2)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.026697 -0.004072  0.000732  0.006664  0.017220 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.00258    0.00168   -1.54     0.14    
x1           0.06895    0.00949    7.27  5.3e-08 ***
x2           1.00328    0.00949  105.72  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.00949 on 29 degrees of freedom
Multiple R-squared:  0.997,	Adjusted R-squared:  0.997 
F-statistic: 5.61e+03 on 2 and 29 DF,  p-value: <2e-16
> 
lm.y.x2 <- lm(y ~ x2)
summary(lm.y.x2)
Call:
lm(formula = y ~ x2)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.027366 -0.010654  0.002941  0.009922  0.027470 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.002576   0.002770   -0.93     0.36    
x2           1.003276   0.015669   64.03   <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01567 on 30 degrees of freedom
Multiple R-squared:  0.9927,	Adjusted R-squared:  0.9925 
F-statistic:  4100 on 1 and 30 DF,  p-value: < 2.2e-16
res.lm.y.x2 <- lm.y.x2$resdiduals
d <- data.frame(X1=x1, X2=x2, Y=y, res.lm.y.x2=res.lm.y.x2)
plot(d)

> 1-0.9927
[1] 0.0073

X2이 설명하는 Y분산의 나머지를 (1-R2 = 0.0073) 종속변인으로 하고 x1을 독립변인으로 하여 regression을 하면 figure의 RY축에 해당하는 관계가 나타난다. 특히 RY와 X1과의 관계가 선형적으로 바뀐것은 X1 자체로는 아무런 역할을 하지 못하는 것으로 나타나다가, X2가 개입되고 X2의 영향력으로 설명된 Y부분을 제외한 (제어한, controlling) 나머지에 대한 X1의 설명력이 significant하게 바뀐 결과이다.

> lm.resyx2.x1 <- lm(lm.y.x2$residuals ~ x1)
> summary(lm.resyx2.x1)

Call:
lm(formula = lm.y.x2$residuals ~ x1)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0266967 -0.0040718  0.0007323  0.0066643  0.0172201 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.220e-18  1.649e-03    0.00        1    
x1          6.895e-02  9.331e-03    7.39 3.11e-08 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.009331 on 30 degrees of freedom
Multiple R-squared:  0.6454,	Adjusted R-squared:  0.6336 
F-statistic: 54.61 on 1 and 30 DF,  p-value: 3.115e-08

> 

Actual correlation would look like the below.

x1의 영향력은 y 총분산에 비해 크지 않다 (b / a + b = .469%)


x2의 영향력을 control한 후에 x1영향력을 보면 64.54%에 달하게 된다.

e.g.

eg.b.csv
y	x1	x2
1.644540	1.063845	.351188
1.785204	1.203146	.200000
-1.36357	-.466514	-.961069
.314549	1.175054	.800000
.317955	.100612	.858597
.970097	2.438904	1.000000
.664388	1.204048	.292670
-.870252	-.993857	-1.89018
1.962192	.587540	-.275352
1.036381	-.110834	-.246448
.007415	-.069234	1.447422
1.634353	.965370	.467095
.219813	.553268	.348095
-.285774	.358621	.166708
1.498758	-2.87971	-1.13757
1.671538	-.310708	.396034
1.462036	.057677	1.401522
-.563266	.904716	-.744522
.297874	.561898	-.929709
-1.54898	-.898084	-.838295
eg.c.csv
y	x1	x2
1.644540	1.063845	.351188
1.785204	-1.20315	.200000
-1.36357	-.466514	-.961069
.314549	1.175054	.800000
.317955	-.100612	.858597
.970097	1.438904	1.000000
.664388	1.204048	.292670
-.870252	-.993857	-1.89018
1.962192	-.587540	-.275352
1.036381	-.110834	-.246448
.007415	-.069234	1.447422
1.634353	.965370	.467095
.219813	.553268	.348095
-.285774	.358621	.166708
1.498758	-2.87971	-1.13757
1.671538	-.810708	.396034
1.462036	-.057677	1.401522
-.563266	.904716	-.744522
.297874	.561898	-.929709
-1.54898	-1.26108	-.838295

e.g. 3,

set.seed(888)  # for reproducibility
S = rnorm(60, mean=0, sd=1.0)  # the Suppressor is normally distributed
U = 1.1*S + rnorm(60, mean=0, sd=0.1)  # U (unrelated) is Suppressor plus error
R <- rnorm(60, mean=0, sd=1.0)  # related part; normally distributed
OV = U + R  # the Other Variable is U plus R
Y  = R + rnorm(60, mean=0, sd=2)  # Y is R plus error
1)
분석결과 변인의 이름은 “분석방법.종속변인.독립변인”으로 한다
2)
r 에서 다음과 같이 얻는다.
res.lm.gpa.clep  <- residuals(lm.gpa.clep)
혹은
res.lm.gpa.clep <-  lm.gpa.clep$residuals
partial_and_semipartial_correlation.txt · Last modified: 2019/02/28 20:16 by hkimscil