Differences

This shows you the differences between two versions of the page.

--- adjusted_r_squared [2016/05/11 07:17] – created hkimscil
+++ adjusted_r_squared [2016/05/11 07:48] (current) – hkimscil
@@ Line 13: / Line 13: @@
 |  __Model Summary(b)__   |||||
-| Model  | R              | R Square      | Adjusted R Square   | Std. Error of the Estimate   |
+| Model  | R              | R \\ Square      | Adjusted \\ R Square   | Std. Error of \\ the Estimate   |
 | 1      | 0.903696114    | 0.816666667   | 0.755555556         | 0.605530071   |
 <WRAP clear />
-  * r-square:
+**__r-square:__**
-   * $r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = \frac{\text{Explained sample variability}}{\text{Total sample variability}}$
+  * $\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = \frac{\text{Explained sample variability}}{\text{Total sample variability}}$
-   * $r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = 1-\frac{SS_{res}}{SS_{total}} = 0.816666667 = R^2 $
+  * $\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = 1-\frac{SS_{res}}{SS_{total}} = 0.816666667 = R^2 $
-   * Usually interpret with % ( by multiplying 100 to $r^2$ )
+  * Usually interpret with % ( by multiplying 100 to $r^2$ )
-  * Adjusted r-square:
+**__Adjusted r-square:__ **
-   * $r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = 1 - \frac{SS_{res}}{SS_{total}} $ ,
+  * $\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = 1 - \frac{SS_{res}}{SS_{total}} $ ,
-   * This is equivalent to: $ 1 - \frac {Var_{rei}}{Var_{total}} $
+  * This is equivalent to: $ \displaystyle 1 - \frac {Var_{rei}}{Var_{total}} $
-   * Var = MS = s<sup>2</sup> = SS / n
+  * $\text{Var} = \text{MS} = s^{2} = \displaystyle \frac {SS}{n} $
-   * Here, we replace the value n
+  * 여기서 n 대신에 각각 아래의 값을 사용한다면 (n = 샘플 숫자, p = 변인 숫자),
-    * for Var<sub>res</sub> = SS<sub>res</sub> / n - p -1
+    * $\displaystyle Var_{res} = \frac {SS_{res}}{n-p-1}$
-    * for Var<sub>total</sub> = SS<sub>total</sub> / n - 1
+    * $\displaystyle Var_{total} = \frac {SS_{total}}{n-1}$
-   * This is the same logic as we used n-1 instead of n in order to get estimation of population standard deviation with a sample statistics.
+  * 따라서,
+    * $\displaystyle \text{Adjusted } R^{2} = 1 - \displaystyle \frac {\displaystyle \frac {SS_{res}}{n-p-1}}{\displaystyle  \frac {SS_{total}}{n-1}} $
+   * This is **the same logic** as we used n-1 instead of n in order to get estimation of population standard deviation with a sample statistics.
    * Therefore, the Adjusted r<sup>2</sup> = 0.755555556
+**__왜 Adjusted R squared 값을 사용하는가?__ **
+  * p가 커지면, 즉 . . . .
+  * Adjusted R squared 값이 작아지는 경향이 생긴다.
+  * 그런데, p가 커진다는 것은 독립변인을 자꾸 추가한다는 것인데, 독립변인 모든 X들이 사실은 Y를 설명하는 것이 아니라고 해도, (즉, X와 Y가 이론적인 원인과 결과의 관계를 갖지 않더라도) 자연적으로 R<sup>2</sup>값은 커지게 된다. 이런 경우를 over-fit 되었다고 한다 (R square 값에 대한 통계적인 테스트(F-test)를 goodness of fit test라고 부르는 것에 상응하여). 그러나, Adjusted R squared 값은 p값이 계산에 작용되기에 (X변인이 추가되고 있는) 어느시점에서 작아지게 된다. 이 작아지는 시점이 over-fit을 피하는 순간이라고 판단하게 된다.
+  * <imgcaption image1|>{{:bestsubsetsex.gif|}}</imgcaption>
+  * 가령 위의 경우, 연구자는 독립변인으로 처음 세가지만 사용할 것을 결정할 수 있는데 이는 Adjusted R 제곱값이 4번째 변인 투입부터 줄기때문이다. 반면에 R 제곱값은 계속 커진다.
-If we take a look at the ANOVA result:
-^  __ANOVA__   ^^^^^^^
-| Model   |     |  Sum of Squares   |  df   |  Mean Square   |  F   |  Sig.   |
-| 1   |  Regression   |  4.9   |  1   |  4.9   |  13.36363636   |  0.035352847   |
-|    |  Residual   |  @yellow:1.1   |  3   |  0.366666667   |     |     |
-|    |  Total   |  6   |  4   |     |     |     |
-| a Predictors: (Constant),  x   |||||||
-| b Dependent Variable: y    |||||||
-<WRAP clear />
-  * ANOVA, F-test, $F=\frac{MS_{between}}{MS_{within}}$
-  * MS_between?
-  * MS_within?
-  * MS for residual
-   * $s = \sqrt{s^2} = \sqrt{\frac{SS_{res}}{n-2}} $
-   * random difference (MS<sub>within</sub> ): $s^2 = \frac{SS_{res}}{n-2} $
-  * MS for regression . . . Obtained difference
-   * do the same procedure at the above in MS for residual.
-   * but, this time degress of freedom is k-1 (number of variables -1 ), 1.
-  * Then what does F value mean?
-Then, we take another look at coefficients result:
-^  __example__   ^^^^^^^^^
-|  Model   ||  Unstandardized Coefficients   |     |  Standardized Coefficients   |  t   |  Sig.   |  95% Confidence Interval for B   ||
-|  B   |  Std. Error   |  Beta   |     |     |  Lower Bound   |  Upper Bound   |
-|  1   |  (Constant)   |  -0.1   |  0.635085296   |     |  -0.157459164   |  0.88488398   |  -2.121124854   |  1.921124854   |
-|     |  x   |  0.7   | @yellow:0.191485422   |  0.903696114   | @yellow:3.655630775   |  0.035352847   |  0.090607928   |  1.309392072   |
-| a  Dependent Variable: y   |||||||
-<WRAP clear />
-  * Why do we do t-test for the slope of X variable? The below is a mathematical explanation for this.
-  * Sampling distribution of Beta (혹은 b):
-   * $\sigma_{\beta_{1}} = \frac{\sigma}{\sqrt{SS_{xx}}}$
-   * estimation of $\sigma_{\beta_{1}}$ : substitute sigma with s
-  * t-test
-   * $t=\frac{\beta_{1} - \text{Hypothesized value of }\beta_{1}}{s_{\beta_{1}}}$
-   * Hypothesized value of beta 값은 대개 0. 따라서 t 값은
-   * $t=\frac{\beta_{1}}{s_{\beta_{1}}}$
-   * $s_{\beta} = \frac {MS_{E}}{SS_{X}} = \display\frac{\sqrt{\frac{SSE}{n-2}}}{\sqrt{SS_{X}}} = \display\frac{\sqrt{\frac{\Sigma{(Y-\hat{Y})^2}}{n-2}}}{\sqrt{\Sigma{(X_{i}-\bar{X})^2}}} $
-^ X  ^ Y  ^ $X-\bar{X}$  ^ ssx  ^ sp  ^ y<sub>predicted</sub>  ^ error  ^ error<sup>2</sup>  ^
-| 1  | 1  | -2  | 4  | 2  | 0.6  | -0.4  | 0.16  |
-| 2  | 1  | -1  | 1  | 1  | 1.3  | 0.3  | 0.09  |
-| 3  | 2  | 0  | 0  | 0  | 2  | 0  | 0  |
-| 4  | 2  | 1  | 1  | 0  | 2.7  | 0.7  | 0.49  |
-| 5  | 4  | 2  | 4  | 4  | 3.4  | -0.6  | 0.36  |
-| $\bar{X}$ = 3  | 2  |  | SS<sub>X</sub> = 10  | $\Sigma$ = 7  |   |   | SSE = 1.1  |
-Regression formula: y<sub>predicted</sub> = -0.1 + 0.7 X
-SSE = Sum of Square Error
-기울기 beta(b)에 대한 표준오차값은 아래와 같이 구한다.
-$$se_{\beta} = \frac {\sqrt{SSE/n-2}}{\sqrt{SSX}} \\
- & = &  \frac {\sqrt{1.1/3}}{\sqrt{10}} = 0.191485 $$
-그리고 b = 0.7
-따라서 t = b / se = 3.655631