Differences

This shows you the differences between two versions of the page.

--- partial_and_semipartial_correlation [2023/05/24 00:09] – [e.g.,] hkimscil
+++ partial_and_semipartial_correlation [2023/05/31 08:56] – [Why overall model is significant while IVs are not?] hkimscil
@@ Line 175: / Line 175: @@
 </code>
-우리는 이것을 partial correlation이라고 부른다는 것을 알고 있다. 이를 ppcor 패키지를 이용해서 테스트해보면
+우리는 이것을 [[:multiple_regression#determining_ivs_role|partial correlation이라고 부른다는 것을 알고 있다]]. 이를 ppcor 패키지를 이용해서 테스트해보면
 <code>
@@ Line 242: / Line 242: @@
 또한 위의 설명은 [[:multiple_regression#in_r|다른 곳에서 언급했던]] Multiple regression에서의 summary(lm())과 anova(lm())이 차이를 보이는 이유를 설명하기도 한다 (여기서는 summary(mod)와 anova(mod)). anova는 변인을 순서대로 받고 다른 IV들에 대한 제어를 하지 않으므로 IV 순서에 따라서 그 분석 결과가 달라지기도 한다.
+아래의 결과를 살펴보면 anova() 결과 독립변인들의 p value 들과 summary() 에서의 독립변인들의 p value가 다른 이유가 다르다.
+<code>
+# anova()에서의 결과
+acs_k3      1  110211  110211   32.059 2.985e-08 ***
+# summary(lm())에서의 결과
+acs_k3        3.3884     2.3333   1.452    0.147
+</code>
+아래는 [[:multiple_regression#in_r|Multiple Regression 설명에서 가져옴]]
+<code>
+dvar <- read.csv("http://commres.net/wiki/_media/elemapi2_.csv", fileEncoding="UTF-8-BOM")
+mod <- lm(api00 ~ ell + acs_k3 + avg_ed + meals, data=dvar)
+summary(mod)
+anova(mod)
+</code>
+<code>
+> dvar <- read.csv("http://commres.net/wiki/_media/elemapi2_.csv", fileEncoding="UTF-8-BOM")
+> mod <- lm(api00 ~ ell + acs_k3 + avg_ed + meals, data=dvar)
+> summary(mod)
+Call:
+lm(formula = api00 ~ ell + acs_k3 + avg_ed + meals, data = dvar)
+Residuals:
+     Min       1Q   Median       3Q      Max
+-187.020  -40.358   -0.313   36.155  173.697
+Coefficients:
+            Estimate Std. Error t value Pr(>|t|)
+(Intercept) 709.6388    56.2401  12.618  < 2e-16 ***
+ell          -0.8434     0.1958  -4.307 2.12e-05 ***
+acs_k3        3.3884     2.3333   1.452    0.147
+avg_ed       29.0724     6.9243   4.199 3.36e-05 ***
+meals        -2.9374     0.1948 -15.081  < 2e-16 ***
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+Residual standard error: 58.63 on 374 degrees of freedom
+  (21 observations deleted due to missingness)
+Multiple R-squared:  0.8326,	Adjusted R-squared:  0.8308
+F-statistic:   465 on 4 and 374 DF,  p-value: < 2.2e-16
+> anova(mod)
+Analysis of Variance Table
+Response: api00
+           Df  Sum Sq Mean Sq  F value    Pr(>F)
+ell         1 4502711 4502711 1309.762 < 2.2e-16 ***
+acs_k3      1  110211  110211   32.059 2.985e-08 ***
+avg_ed      1  998892  998892  290.561 < 2.2e-16 ***
+meals       1  781905  781905  227.443 < 2.2e-16 ***
+Residuals 374 1285740    3438
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+>
+</code>
 아래에서 mod2를 보면
@@ Line 247: / Line 306: @@
   * ell, avg_ed, meals + acs_k3 가 독립변인인데,
   * 그 순서가 이전 문서의
-  * ell, acs_k3, avg_ed, meals에서 바뀐것을 알 수 있다.
+  * ell, acs_k3, avg_ed, meals에서 바뀐것을 알 수 있다 (acs.k3가 맨 뒤로 감).
+  * 즉,
+  * lm(api00 ~ ell + acs_k3 + avg_ed + meals)
+  * lm(api00 ~ ell + avg_ed + meals + acs_k3)
 anova는 독립변인에 대한 영향력을 다른 IV들을 고려하지 않고, 그냥 입력 순서대로 처리하므로, acs_k3를 마지막으로 보냄으로써, 다른 IV들이 DV에 대한 설명력을 모두 차지하고 그 나머지를 보여주게 된다. 이것이 t-test가 significant하지 않은 이유이다.
@@ Line 290: / Line 352: @@
 </code>
-이는 다른 독립변인들의 순서를 바꾸어도 마찬가지이다. mod3은 mod2에서 meals변인을 맨 앞으로 옮긴 예이다.
+이는 다른 독립변인들의 순서를 바꾸어도 마찬가지이다. mod3은 mod2에서 meals변인을 맨 앞으로 옮긴 예이다. 즉
+  * mod  <- lm(api00 ~ ell + acs_k3 + avg_ed + meals)
+  * mod2 <- lm(api00 ~ ell + avg_ed + meals + acs_k3)
+  * mod3 <- lm(api00 ~ meals + ell + avg_ed + acs_k3)
 summary(mod), summary(mod2), summary(mod3)의 결과는 서로 다르지 않지만, anova의 결과는 어떤 독립변인이 앞으로 오는가에 따라서 그 f값과  p-value가 달라진다. 물론, 만약에 독립변인들 간의 상관관계가 0이라면 순서가 영향을 주지는 않겠다.
 <code>
@@ Line 576: / Line 643: @@
 {{pcor.y.x1.x2.v2.png?400}}
 x2의 영향력을 control한 후에 x1영향력을 보면 64.54%에 달하게 된다.
+====== IV 각각의 설명력의 크기와 합쳐서 regression했을 때의 크기? ======
+see https://www.researchgate.net/post/Why_is_the_Multiple_regression_model_not_significant_while_simple_regression_for_the_same_variables_is_significant
+<code>
+RSS = 3:10 #Right shoe size
+LSS = rnorm(RSS, RSS, 0.1) #Left shoe size - similar to RSS
+cor(LSS, RSS) #correlation ~ 0.99
+weights = 120 + rnorm(RSS, 10*RSS, 10)
+##Fit a joint model
+m = lm(weights ~ LSS + RSS)
+##F-value is very small, but neither LSS or RSS are significant
+summary(m)
+</code>
+<code>> RSS = 3:10 #Right shoe size
+> LSS = rnorm(RSS, RSS, 0.1) #Left shoe size - similar to RSS
+> cor(LSS, RSS) #correlation ~ 0.99
+[1] 0.9994836
+>
+> weights = 120 + rnorm(RSS, 10*RSS, 10)
+>
+> ##Fit a joint model
+> m = lm(weights ~ LSS + RSS)
+>
+> ##F-value is very small, but neither LSS or RSS are significant
+> summary(m)
+Call:
+lm(formula = weights ~ LSS + RSS)
+Residuals:
+       2       3       4       5       6       7       8
+.8544  4.5254 -3.6333 -7.6402 -0.2467 -3.1997 -5.2665 10.6066
+Coefficients:
+            Estimate Std. Error t value Pr(>|t|)
+(Intercept)  104.842      8.169  12.834 5.11e-05 ***
+LSS          -14.162     35.447  -0.400    0.706
+RSS           26.305     35.034   0.751    0.487
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+Residual standard error: 7.296 on 5 degrees of freedom
+Multiple R-squared:  0.9599,	Adjusted R-squared:  0.9439
+F-statistic: 59.92 on 2 and 5 DF,  p-value: 0.000321
+>
+> ##Fitting RSS or LSS separately gives a significant result.
+> summary(lm(weights ~ LSS))
+Call:
+lm(formula = weights ~ LSS)
+Residuals:
+   Min     1Q Median     3Q    Max
+-6.055 -4.930 -2.925  4.886 11.854
+Coefficients:
+            Estimate Std. Error t value Pr(>|t|)
+(Intercept)  103.099      7.543   13.67 9.53e-06 ***
+LSS           12.440      1.097   11.34 2.81e-05 ***
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+Residual standard error: 7.026 on 6 degrees of freedom
+Multiple R-squared:  0.9554,	Adjusted R-squared:  0.948
+F-statistic: 128.6 on 1 and 6 DF,  p-value: 2.814e-05
+>
+</code>