Differences

This shows you the differences between two versions of the page.

--- regression [2022/10/23 12:23] – [E.g., 3. Simple regression: Adjusted R squared & Slope test] hkimscil
+++ regression [2023/05/24 08:53] (current) – [Slope test] hkimscil
@@ Line 283: / Line 283: @@
 ^  ANOVA(b)  ^^^^^^^
 |  Model   |      |  Sum of Squares   |  df   |  Mean Square   |  F   |  Sig.  |
-|  1.000    |  Regression   | @white: 18.934    | @blue: 1.000    |  18.934    |  13.687    |  0.006   |
+|  1.000    |  Regression   | @white: 18.934    | @lightblue: 1.000    |  18.934    |  13.687    |  0.006   |
-|     |  Residual   | @orange: 11.066    | @green: 8.000    |  1.383*    |     |    |
+|     |  Residual   | @orange: 11.066    | @lightgreen: 8.000    |  1.383*    |     |    |
 |     |  Total   | @yellow: 30.000    | @#eee: 9.000    |     |     |    |
 | a Predictors: (Constant), bankIncome  income \\ b Dependent Variable: bankbook  number of bank  |||||||
-* 1.383 = SS<sub>res</sub> / n-2 = standard error 표준오차
+  * 1.383 = SS<sub>res</sub> / n-2 = MS residual = MS error (error due to random chance)
-  * standard error = 표준오차는 [[:t-test]]를 배울 때의 t = 차이/se 에서와 같은 의미
+  * 이 MS error 는 [[:t-test]]를 배울 때의 t = 차이/se 에서 se와 같은 의미
-  * 따라서 MS<sub>regression</sub>인 18.934 를 표준오차로 나눈 값을 F 값이라고 부른다.
+  * 그리고 MS<sub>regression</sub>인 18.934 를 MS error 혹은 MS residual 로 나눈 값을 F 값이라고 부르고
+  * 이에 대한 통계학 검증을 한다 (df regression = 2 - 1; df residual = n - 2; df total = n - 1)
 <WRAP clear />
@@ Line 301: / Line 302: @@
 | for SS   | for degrees of freedom   |
-| @white: white \\ = explained error (E) \\ = $SS{reg}$  | @blue: for regression \\ (number of variable -1) \\ = 1 (blue) |
+| @white: white \\ = explained error (E) \\ = $SS{reg}$  | @lightblue: for regression \\ (number of variable -1) \\ = 1 (light blue) |
-| @orange: orange \\ = unexplained error (U) \\ = $SS{res}$  | @green: for residual \\ (number of case - number of variable) \\ = 8 (green) |
+| @orange: orange \\ = unexplained error (U) \\ = $SS{res}$  | @lightgreen: for residual \\ (number of case - number of variable) \\ = 8 (green) |
 | @yellow: yellow \\ = total error $SS_{total}$ \\ = E + U \\ = $SS_{reg} + SS_{res}$ | @#eee: grey \\ = total df \\ = total sample # -1  |
@@ Line 662: / Line 663: @@
-**__r-square:__**
+===== r-square =====
   * $\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = \frac{\text{Explained sample variability}}{\text{Total sample variability}}$
@@ Line 671: / Line 672: @@
-**__Adjusted r-square:__**
+===== Adjusted r-square =====
   * $\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = 1 - \frac{SS_{res}}{SS_{total}} $ ,
@@ Line 692: / Line 693: @@
       * R2 value goes down -- which means
       * more (many) IVs is not always good
-  * Therefore, the Adjusted r<sup>2</sup> = .367 / 1.5 = 0.756 (green color cell)
+  * Therefore, the Adjusted r<sup>2</sup> = 1- (.367 / 1.5) = 0.756 (green color cell)
-**__Slope test__**
+===== Slope test =====
 If we take a look at the ANOVA result:
@@ Line 705: / Line 706: @@
 | b Dependent Variable: y    |||||||
 <WRAP clear />
+F test recap.
   * ANOVA, F-test, $F=\frac{MS_{between}}{MS_{within}}$
-  * MS_between?
+    * MS_between?
-  * MS_within?
+    * MS_within?
-  * MS for residual
+  * regression에서 within 에 해당하는 것 == residual
    * $s = \sqrt{s^2} = \sqrt{\frac{SS_{res}}{n-2}} $
-   * random difference (MS<sub>within</sub> ): $s^2 = \frac{SS_{res}}{n-2} $
+   * 왜냐하면 이 ss residual이 random difference 를 말하는 것이므로 (MS<sub>within</sub> ): $s^2 = \frac{SS_{res}}{n-2} $
   * MS for regression . . . Obtained difference
    * do the same procedure at the above in MS for <del>residual</del> regression.
@@ Line 728: / Line 729: @@
   * Why do we do t-test for the slope of X variable? The below is a mathematical explanation for this.
-  * Sampling distribution of b:
+  * Sampling distribution of error around the slope line b:
    * $\displaystyle \sigma_{b_{1}} = \frac{\sigma}{\sqrt{SS_{x}}}$
+     * We remember that $\displaystyle \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$ ?
    * estimation of $\sigma_{b_{1}}$ : substitute sigma with s
+만약에 error들이 (residual들) slope b를 중심으로 포진해 있고, 이것을 따로 떼어내서 distribution curve를 그려보면 평균이 0이고 standard deviation이 위의 standard error값을 갖는 normal distribution을 이루게 될 것이다.
   * t-test
    * $\displaystyle t=\frac{b_{1} - \text{Hypothesized value of }\beta_{1}}{s_{b_{1}}}$
+   * Hypothesized value of b 값은 (혹은 beta) 0. 따라서 t 값은
+   * $\displaystyle t=\frac{b_{1}}{s_{b_{1}}}$
+   * 기울기에 대한 표준오차는 (se) 아래와 같이 구한다
-   * Hypothesized value of beta 값은 대개 0. 따라서 t 값은
+\begin{eqnarray*}
+\displaystyle s_{b_{1}} & = & \sqrt {\frac {MSE}{SS_{X}}} \\
-   * $\displaystyle t=\frac{b_{1}}{s_{b_{1}}}$
+ & = & \displaystyle \sqrt { \frac{1}{n-2} * \frac{SSE}{SS_{X}}} \\
+ & = & \displaystyle \sqrt { \frac{1}{n-2} * \frac{ \Sigma{(Y-\hat{Y})^2} }{ \Sigma{ (X_{i} - \bar{X})^2 } } } \\
+\end{eqnarray*}
-   * $\displaystyle s_{b_{1}} = \frac {MSE}{SS_{X}} = \frac{\sqrt{\frac{SSE}{n-2}}}{\sqrt{SS_{X}}} = \displaystyle \frac{\sqrt{\frac{\Sigma{(Y-\hat{Y})^2}}{n-2}}}{\sqrt{\Sigma{(X_{i}-\bar{X})^2}}} $
 ^ X  ^ Y  ^ $X-\bar{X}$  ^ ssx  ^ sp  ^ y<sub>predicted</sub>  ^ error  ^ error<sup>2</sup>  ^
@@ Line 753: / Line 757: @@
 Regression formula: y<sub>predicted</sub> = -0.1 + 0.7 X
-SSE = Sum of Square Error
+SSE = Sum of Square Error = SS_residual
 기울기 beta(b)에 대한 표준오차값은 아래와 같이 구한다.
 \begin{eqnarray*}
 se_{\beta} & = & \frac {\sqrt{SSE/n-2}}{\sqrt{SSX}} \\
@@ Line 763: / Line 768: @@
 따라서 t = b / se = 3.655631
-<code>x <- c(1, 2, 3, 4, 5)
-y <- c(1, 1, 2, 2, 4)
-mody <- lm(y ~ x)
-</code>
-<code>
-> x <- c(1, 2, 3, 4, 5)
-> y <- c(1, 1, 2, 2, 4)
-> mody <- lm(y ~ x)
-> summary(mody)
-Call:
-lm(formula = y ~ x)
-Residuals:
-          2          3          4          5
-.000e-01 -3.000e-01 -3.886e-16 -7.000e-01  6.000e-01
-Coefficients:
-            Estimate Std. Error t value Pr(>|t|)
-(Intercept)  -0.1000     0.6351  -0.157   0.8849
-x             0.7000     0.1915   3.656   0.0354 *
----
-Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-Residual standard error: 0.6055 on 3 degrees of freedom
-Multiple R-squared:  0.8167,	Adjusted R-squared:  0.7556
-F-statistic: 13.36 on 1 and 3 DF,  p-value: 0.03535
->
-</code>
 ====== E.g., 4. Simple regression ======
 Another example of simple regression: from {{:elemapi.sav}} \\