User Tools

Site Tools


partial_and_semipartial_correlation

Partial and semi-partial correlation

please refer to the page: http://faculty.cas.usf.edu/mbrannick/regression/Partial.html

tests_cor.csv

# import test score data "tests_cor.csv"
tests <- read.csv("http://commres.net/wiki/_media/r/tests_cor.csv")
tests 
   X.Person SAT.Q CLEP Math.GPA
1         1   500   30      2.8
2         2   550   32      3.0
3         3   450   28      2.9
4         4   400   25      2.8
5         5   600   32      3.3
6         6   650   38      3.3
7         7   700   39      3.5
8         8   550   38      3.7
9         9   650   35      3.4
10       10   550   31      2.9

Then change the column names into something easy to handle. And get rid of the “ser” column.

colnames(tests) <- c("ser", "sat", "clep", "gpa")
tests <- subset(tests, select=c("sat", "clep", "gpa"))

This data represents that GPA might be caused by (influenced by) two variables: that is, clep test score and sat (quantitative) score.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
    usr <- par("usr")
    on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- cor(x, y)        
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    text(0.5, 0.5, txt)
    }
attach(tests)
pairs(cbind(sat, clep, gpa), lower.panel = panel.cor, pch = 18)

Note that correlation between sat and gpa and that of clep and gpa is somewhat high in the graphic and at the same time, correlation between the ivs – sat and gpa is also high.

In order to understand how the ivs influence each other. We first regress gpa on sat, first, that is, a simple regression.

Regression GPA on SAT

tests.lm.gpa.sat <- lm(gpa ~ sat, data = tests)
summary(tests.lm.gpa.sat)

Call:
lm(formula = gpa ~ sat, data = tests)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.23544 -0.12184  0.00316  0.02943  0.56456 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 1.7848101  0.4771715   3.740   0.0057 **
sat         0.0024557  0.0008416   2.918   0.0193 * 
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2365 on 8 degrees of freedom
Multiple R-squared:  0.5156,	Adjusted R-squared:  0.455 
F-statistic: 8.515 on 1 and 8 DF,  p-value: 0.01935

Also, correlation between sat and gpa is 0.7180** (see the below output)

> options(digits=4)
> cor(tests)
        sat   clep    gpa
sat  1.0000 0.8745 0.7180**
clep 0.8745 1.0000 0.8763
gpa  0.7180 0.8763 1.0000
> 
 

Now we collect predicted values (y hat) and residual values (residual error values even with the iv variable considered).

tmp1 <- data.frame(cbind(tests.lm.gpa.sat$fitted.values, tests.lm.gpa.sat$residuals))
tmp.gpa.sat <- as.data.frame(cbind(sat, gpa, tmp1$X1, tmp1$X2))
colnames(tmp.gpa.sat) <- c("sat", "gpa", "pred", "resid")
tmp.gpa.sat
   sat gpa     pred        resid
1  500 2.8 3.012658 -0.212658228
2  550 3.0 3.135443 -0.135443038
3  450 2.9 2.889873  0.010126582
4  400 2.8 2.767089  0.032911392
5  600 3.3 3.258228  0.041772152
6  650 3.3 3.381013 -0.081012658
7  700 3.5 3.503797 -0.003797468
8  550 3.7 3.135443  0.564556962
9  650 3.4 3.381013  0.018987342
10 550 2.9 3.135443 -0.235443038

Note that,

  • as mentioned, correlation between sat and gpa = 0.7180,
  • correlation between gpa and pred = the same as the above, and
    • it should be that way since predicted values are regression part of y sum of square.
    • in other words, correlation between the dv and the predicted values are the same as that between the dv and the iv (sat).
  • correlation between pred and resid = 0.
    • it is so because residuals are rest of predicted ones. As shown in the below figure, SSres does NOT share any with SSreg.
  • Also, correlation between gpa (the dv) and the resdiuals (SSres) is square root value of (1-0.71802)
  • Why? see the below r code:
> sqrt(1-(0.718^2))
[1] 0.696

round(cor(tmp.gpa.sat), digits = 3)
        sat   gpa  pred resid
sat   1.000 0.718 1.000 0.000
gpa   0.718 1.000 0.718 0.696
pred  1.000 0.718 1.000 0.000
resid 0.000 0.696 0.000 1.000

regression (treatment) part = .718^2 = 0.515524
residual (error) part = .696^2 = 0.484416

Regression GPA against CLEP

We do the same thing as the above.

tests.lm.gpa.clep <- lm(gpa ~ clep, data = tests)
summary(tests.lm.gpa.clep)

Call:
lm(formula = gpa ~ clep, data = tests)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.190496 -0.141167 -0.002376  0.110847  0.225207 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.17438    0.38946   3.015 0.016676 *  
clep         0.06054    0.01177   5.144 0.000881 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1637 on 8 degrees of freedom
Multiple R-squared:  0.7679,	Adjusted R-squared:  0.7388 
F-statistic: 26.46 on 1 and 8 DF,  p-value: 0.0008808

tmp.gpa.clep <- as.data.frame(cbind(tests$clep, tests$gpa, tests.lm.gpa.clep$fitted.values, tests.lm.gpa.clep$residuals))
colnames(tmp.gpa.clep) <- c("clep", "gpa", "pred", "resid")
tmp.gpa.clep
   clep gpa     pred       resid
1    30 2.8 2.990496 -0.19049587
2    32 3.0 3.111570 -0.11157025
3    28 2.9 2.869421  0.03057851
4    25 2.8 2.687810  0.11219008
5    32 3.3 3.111570  0.18842975
6    38 3.3 3.474793 -0.17479339
7    39 3.5 3.535331 -0.03533058
8    38 3.7 3.474793  0.22520661
9    35 3.4 3.293182  0.10681818
10   31 2.9 3.051033 -0.15103306
round(cor(tmp.gpa.clep), digits = 4)
        clep    gpa   pred  resid
clep  1.0000 0.8763 1.0000 0.0000
gpa   0.8763 1.0000 0.8763 0.4818
pred  1.0000 0.8763 1.0000 0.0000
resid 0.0000 0.4818 0.0000 1.0000
> 

Note that rclep x gpa = .876 = .88

Suppose that you don't have the correlation output as shown the above. Identify the correlation between gpa (dv) and residuals.

Regression GPA against SAT and CLEP

tests.lm.gpa.satclep <- lm(gpa ~ sat + clep, data = tests)
summary(tests.lm.gpa.satclep)

Call:
lm(formula = gpa ~ sat + clep, data = tests)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.197888 -0.128974 -0.000528  0.131170  0.226404 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  1.1607560  0.4081117   2.844   0.0249 *
sat         -0.0007015  0.0012564  -0.558   0.5940  
clep         0.0729294  0.0253799   2.874   0.0239 *
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1713 on 7 degrees of freedom
Multiple R-squared:  0.7778,	Adjusted R-squared:  0.7143 
F-statistic: 12.25 on 2 and 7 DF,  p-value: 0.005175

Note that SAT is not influential anymore.

  • Intercept = 1.16, t=2.844, p < .05
  • CLEP = 0.07, t=2.874, p < .05
  • SAT = -.0007, t=-0.558, n.s.

We may conclude that the significant unique predictor is CLEP. Although
Now we might come up with an idea that SAT may influence both CLEP and GPA. Therefore, we want to figure out the relationship between CLEP and SAT.

Regression CLEP against SAT


sat not only influence gpa but also influence clep test. Hence we investigate the relationship between sat and clep.

tests.lm.clep.sat <- lm(clep~sat, data=tests) 
summary(tests.lm.clep.sat)

Call:
lm(formula = clep ~ sat, data = tests)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5316 -1.2437 -0.2848  0.0949  5.6329 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.556962   4.813367   1.778  0.11334    
sat         0.043291   0.008489   5.100  0.00093 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.386 on 8 degrees of freedom
Multiple R-squared:  0.7648,	Adjusted R-squared:  0.7353 
F-statistic: 26.01 on 1 and 8 DF,  p-value: 0.0009303
tmp.clep.sat <- as.data.frame(cbind(sat, clep, tests.lm.clep.sat$fitted.values, tests.lm.clep.sat$residuals))
colnames(tmp.clep.sat) <- c("sat", "clep", "pred", "resid")
tmp.clep.sat
   sat clep     pred       resid
1  500   30 30.20253 -0.20253165
2  550   32 32.36709 -0.36708861
3  450   28 28.03797 -0.03797468
4  400   25 25.87342 -0.87341772
5  600   32 34.53165 -2.53164557
6  650   38 36.69620  1.30379747
7  700   39 38.86076  0.13924051
8  550   38 32.36709  5.63291139
9  650   35 36.69620 -1.69620253
10 550   31 32.36709 -1.36708861
round(cor(tmp.clep.sat), digits = 3)
        sat  clep  pred resid
sat   1.000 0.875 1.000 0.000
clep  0.875 1.000 0.875 0.485
pred  1.000 0.875 1.000 0.000
resid 0.000 0.485 0.000 1.000

Note that correlation between SAT (iv) and resid = 0

Now we have two sets of residuals from SAT, one for GPA and one for CLEP. GPA and CLEP are our two achievement measures. According to our theory, they should not be correlated except for the common influence of SAT. The residuals are what is left when we remove SAT from each variable. Therefore, our theory says that our two residuals should not be correlated.

If we compute the correlation between these two sets of residual, we find that:

tmp.resid.clepgpa <- round(cbind(tests.lm.clep.sat$residuals, tests.lm.gpa.sat$residuals), digits = 5)
colnames(tmp.resid.clepgpa) <- c("clep_resid", "gpa_resid")
tmp.resid.clepgpa
   clep_resid gpa_resid
1     -0.2025   -0.2127
2     -0.3671   -0.1354
3     -0.0380    0.0101
4     -0.8734    0.0329
5     -2.5316    0.0418
6      1.3038   -0.0810
7      0.1392   -0.0038
8      5.6329    0.5646
9     -1.6962    0.0190
10    -1.3671   -0.2354
cor(tmp.resid.clepgpa)
           clep_resid gpa_resid
clep_resid  1.0000000 0.7356583
gpa_resid   0.7356583 1.0000000

rclep_resid x gpa_resid = 0.7356583 = .73

The correlation between the two sets is .73, which is significantly different from zero at p < .05.

The correlation between the two sets of residuals is called a partial correlation. In our case, it was the correlation between GPA and CLEP while holding SAT constant.

The partial correlation is what we get when we hold constant some third variable from two other variables. We know the correlation between CLEP and GPA is .88. But SAT “accounts for” (or could account for) part of that.

**** PLEASE NOTE THAT the numbers (percentages) of SS do not add up to 100% because we are talking about partial correlation. Why? please see 무엇부터 라는_문제 in multiple regression page.

The other one

> tests.lm.sat.clep <- lm(sat~clep, data=tests) 
> summary(tests.lm.sat.clep)

Call:
lm(formula = sat ~ clep, data = tests)

Residuals:
    Min      1Q  Median      3Q     Max 
-101.86  -19.29    1.14   28.31   54.13 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -19.42     114.64   -0.17  0.86967    
clep           17.67       3.46    5.10  0.00093 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 48.2 on 8 degrees of freedom
Multiple R-squared:  0.765,	Adjusted R-squared:  0.735 
F-statistic:   26 on 1 and 8 DF,  p-value: 0.00093

> tmp.sat.clep <- as.data.frame(cbind(clep, sat, tests.lm.sat.clep$fitted.values, tests.lm.sat.clep$residuals))
> colnames(tmp.sat.clep) <- c("clep", "sat", "pred", "resid")
> round(cor(tmp.sat.clep), digits=4)
        clep    sat   pred resid
clep  1.0000 0.8745 1.0000 0.000
sat   0.8745 1.0000 0.8745 0.485
pred  1.0000 0.8745 1.0000 0.000
resid 0.0000 0.4850 0.0000 1.000
> .8745^2
[1] 0.7648
> .4850^2
[1] 0.2352
> tmp.resid.satgpa <- round(cbind(tests.lm.sat.clep$residuals, tests.lm.gpa.clep$residuals), digits=4)
> colnames(tmp.resid.satgpa) <- c("satresid", "gparesid")
> tmp.resid.satgpa
   satresid gparesid
1   -10.537  -0.1905
2     4.132  -0.1116
3   -25.207   0.0306
4   -22.211   0.1122
5    54.132   0.1884
6    -1.859  -0.1748
7    30.475  -0.0353
8  -101.859   0.2252
9    51.136   0.1068
10   21.797  -0.1510
> cor(tmp.resid.satgpa)
         satresid gparesid
satresid   1.0000  -0.2065
gparesid  -0.2065   1.0000
> 0.2065^2
[1] 0.04264
> 
partial_and_semipartial_correlation.txt · Last modified: 2018/12/14 08:04 by hkimscil