Differences

This shows you the differences between two versions of the page.

--- regression [2018/11/09 07:54] – [E.g., 1. Simple regression & F-test for goodness of fit] hkimscil
+++ regression [2023/05/24 08:53] (current) – [Slope test] hkimscil
@@ Line 1: / Line 1: @@
-====== Regression ======
+====== Regression 회귀분석 ======
-SA [[Multiple Regression]]
+See also [[Multiple Regression]] 다변량회귀분석
 두 변인 간의 상관관계가 완전하다면 (r=1.0 혹은 r=-1.0) 변인 간의 상관관계에 의한 그래프는 아래와 같을 것이다.
@@ Line 33: / Line 33: @@
 간혹, 다른 교재를 보면,
- $b = r * \displaystyle \frac{s_{Y}}{s_{X}}$
+$$b = r * \displaystyle \frac{s_{Y}}{s_{X}} $$
 와 같이 나타나는데 이 둘은 같은 의미를 갖는다.
@@ Line 39: / Line 39: @@
 __동일한 공식 설명__:
- $ r = \displaystyle \frac{SP}{\sqrt{SS_X SS_Y}}\text{, threfore } $
+\begin{eqnarray*}
+r & = & \displaystyle \frac{SP}{\sqrt{SS_X SS_Y}}\text{,}\quad\text{therefore} \\
- $ SP = r * \displaystyle \sqrt{SS_X SS_Y} $ 그리고
+SP & = & r * \displaystyle \sqrt{SS_X SS_Y} \quad \text{and} \\
+\\
- $ b = \displaystyle \frac{SP}{SS_X} $ 따라서
+b & = & \displaystyle \frac{SP}{SS_X} \\
+& = & \displaystyle \frac{r * \sqrt{SS_X SS_Y}}{SS_X}  \\
- $ = \displaystyle \frac{r * \sqrt{SS_X SS_Y}}{SS_X} = r * \frac{\sqrt{SS_Y}}{\sqrt{SS_X}} $ 따라서
+& = & r * \frac{\sqrt{SS_Y}}{\sqrt{SS_X}} \\
+& = & r * \displaystyle \frac{s_Y}{s_X}
- $ = r * \displaystyle \frac{s_Y}{s_X} $
+\end{eqnarray*}
 </WRAP>
@@ Line 108: / Line 108: @@
 ========== 표준오차 잔여변량 (standard error residual) ==========
-<imgcaption regression_line_01|평균값만으로 Y값을 예측하는 경우>{{ :r.predicted.unpredicted.err.yaxis.png?250|}}</imgcaption>
+[{{ :r.predicted.unpredicted.err.yaxis.png?250|regression_line_01. 평균값만으로 Y값을 예측하는 경우}}]  regression_line_01 은 변인 X 와 Y 간의 관계를 (association) 나타내주는 그래프이다. 그리고, 이 그래프에서 $\overline{Y} = 30$ 이다. 이 데이터 중에서 X에 대한 정보가 없다고 가정하고, Y 관측치를 예측하려면 어떻게 해야 할까? 당연히 연구자는 자신이 가지고 있는 Y 변인 데이터의 중앙값인 평균 ( $\overline{Y}$ ) 을 사용하려고 할 것이다. 이 평균값으로 각 개인의 값(Y)을 예측한 한 후, 이 오차를 제곱하여 모두 더한 것이 바로 Sum of Square 값인 $SS$ 이다.
-<imgref regression_line_01>은 변인 X 와 Y 간의 관계를 (association) 나타내주는 그래프이다. 그리고, 이 그래프에서 $\overline{Y} = 30$ 이다. 이 데이터 중에서 X에 대한 정보가 없다고 가정하고, Y 관측치를 예측하려면 어떻게 해야 할까? 당연히 연구자는 자신이 가지고 있는 Y 변인 데이터의 중앙값인 평균 ( $\overline{Y}$ ) 을 사용하려고 할 것이다. 이 평균값으로 각 개인의 값(Y)을 예측한 한 후, 이 오차를 제곱하여 모두 더한 것이 바로 Sum of Square 값인 $SS$ 이다.
 연구자가 각 케이스에 해당하는 Y 값에 대응하는 X 값을 알고 있고, 이를 함께 고려하면 (Covariance) Y값 예측에 도움을 줄 수 있다는 것을 알았다. 즉, Y 변인의 평균값을 Y의 대표값이라고 하기에는 개인의 실제값 (혹은 관측치)과 평균값 간의 오차가 너무 큰데, 이 오차를 줄이기 위해서 만들어진 것이 회귀선이다 (regression line, 오렌지 라인). 따라서 회귀선은 평균값만을 사용할 때 나타나는 오차를 줄여주는 역할을 한다.
@@ Line 116: / Line 115: @@
 <WRAP clear />
-<imgcaption regression_line_02|X값을 함께 고려하여 (즉, Regression Line을 그려) Y값을 예측하는 경우>{{ :r.Predicted.Unpredicted.err.png?400}}</imgcaption>
+[{{ :r.Predicted.Unpredicted.err.png?400|regression_line_02. X값을 함께 고려하여 (즉, Regression Line을 그려) Y값을 예측하는 경우}}] 연구자는 데이터를 이용하여 회귀식의 b값과 a값을 구할 수 있다. 그리고 이를 사용하면, 평균값 $\overline{Y}$ 이 주는 오차에 비해서 상대적으로 작은 (녹색선만큼을 뺀 분량의) 오차를 갖도록 할 수 있다. 즉, 회귀식이 보다 정확한 예측을 가능하도록 하여 주는 것이다. 이렇게 회귀식을 사용하여 (즉, b라는 기울기를 사용하여) 관측치를 예측함으로써, 평균값을 사용했을 때보다 줄어드는 오차 부분을 설명된 오차라고 (explained error: **녹색 분의 (제곱의) 합**) 한다. 그러나, 회귀선을 사용하더라도 연구자는 검은색 만큼의 오차는 피할 수 없다. 이를 __설명되지 않은 오차__라고 (unexplained error: **검은색 분의 (제곱의) 합**) 한다. 그리고 이 각각을 regression error 와 residual error라고 부른다.
-연구자는 데이터를 이용하여 회귀식의 b값과 a값을 구할 수 있다. 그리고 이를 사용하면, 평균값 $\overline{Y}$ 이 주는 오차에 비해서 상대적으로 작은 (녹색선만큼을 뺀 분량의) 오차를 갖도록 할 수 있다. 즉, 회귀식이 보다 정확한 예측을 가능하도록 하여 주는 것이다. 이렇게 회귀식을 사용하여 (즉, b라는 기울기를 사용하여) 관측치를 예측함으로써, 평균값을 사용했을 때보다 줄어드는 오차 부분을 설명된 오차라고 (explained error: **녹색 분의 (제곱의) 합**) 한다. 그러나, 회귀선을 사용하더라도 연구자는 검은색 만큼의 오차는 피할 수 없다. 이를 __설명되지 않은 오차__라고 (unexplained error: **검은색 분의 (제곱의) 합**) 한다. 그리고 이 각각을 regression error 와 residual error라고 부른다.
@@ Line 187: / Line 185: @@
 ^  __ prediction for y values with__ $\overline{Y}$  ^^^
-| bankaccount   | error   | error<sup>2</sup>  |
+| bankaccount   | prediction  | error   | error<sup>2</sup>  |
-| 6   | -2   | 4  |
+| 6   | 8  | -2   | 4  |
-| 5   | -3   | 9  |
+| 5   | 8  | -3   | 9  |
-| 7   | -1   | 1  |
+| 7   | 8  | -1   | 1  |
-| 7   | -1   | 1  |
+| 7   | 8  | -1   | 1  |
-| 8   | 0   | 0  |
+| 8   | 8  | 0   | 0  |
-| 10   | 2   | 4  |
+| 10   | 8  | 2   | 4  |
-| 8   | 0   | 0  |
+| 8   | 8  | 0   | 0  |
-| 11   | 3   | 9  |
+| 11   | 8  | 3   | 9  |
-| 9   | 1   | 1  |
+| 9   | 8  | 1   | 1  |
-| 9   | 1   | 1  |
+| 9   | 8  | 1   | 1  |
-|  $\overline{Y}=8$   |    |  $SS_{total} = 30$   |
+|  $\overline{Y}=8$   |   |   |  $SS_{total} = 30$   |
 <WRAP clear />
 위에서 제곱한 값의 합은? 30이다. 이는 사실, SS (Sum of Square)값이 30이라는 이야기이다. 그리고, 위에서 설명한 것처럼, 이 값은 $ SS_{total} $ 이라고 할 수 있으며 __전체에러__ 변량이라고 할 수 있겠다.
@@ Line 204: / Line 202: @@
 __SS<sub>res</sub> , Residual error__
 <code>
+> head(datavar)
+. . . .
+> mod <- lm(bankaccount ~ income, data = datavar)
+> summary(mod)
 Residuals:
     Min      1Q  Median      3Q     Max
@@ Line 280: / Line 283: @@
 ^  ANOVA(b)  ^^^^^^^
 |  Model   |      |  Sum of Squares   |  df   |  Mean Square   |  F   |  Sig.  |
-|  1.000    |  Regression   | @white: 18.934    | @grey: 1.000    |  18.934    |  13.687    |  0.006   |
+|  1.000    |  Regression   | @white: 18.934    | @lightblue: 1.000    |  18.934    |  13.687    |  0.006   |
-|     |  Residual   | @orange: 11.066    | @green: 8.000    |  1.383    |     |    |
+|     |  Residual   | @orange: 11.066    | @lightgreen: 8.000    |  1.383*    |     |    |
-|     |  Total   | @yellow: 30.000    |  9.000    |     |     |    |
+|     |  Total   | @yellow: 30.000    | @#eee: 9.000    |     |     |    |
 | a Predictors: (Constant), bankIncome  income \\ b Dependent Variable: bankbook  number of bank  |||||||
+  * 1.383 = SS<sub>res</sub> / n-2 = MS residual = MS error (error due to random chance)
+  * 이 MS error 는 [[:t-test]]를 배울 때의 t = 차이/se 에서 se와 같은 의미
+  * 그리고 MS<sub>regression</sub>인 18.934 를 MS error 혹은 MS residual 로 나눈 값을 F 값이라고 부르고
+  * 이에 대한 통계학 검증을 한다 (df regression = 2 - 1; df residual = n - 2; df total = n - 1)
 <WRAP clear />
 __ SS<sub>total</sub> SS<sub>reg</sub> SS<sub>res</sub> 를 이용한 F-test__
@@ Line 290: / Line 299: @@
 회귀공식 (regression equation) 혹은 모델 (regression model)이 만들어지게 되면 이 모델에 대한 판단이 필요하다. 첫 째, regression equation에서 도출되는 잔여오차 (SS<sub>res</sub> )가 충분히 작은지 (혹은 거꾸로, regression equation에 의해서 설명된 오차 (SS<sub>reg</sub> )가 전체오차를 (SS<sub>total</sub> ) 줄이는데 충분했는지이다. 이는 r<sup>2</sup> 에 대한 판단을 통해서 하게 되는데, 이 때 사용되는 것이 F-test이다. 또한, 변인이 갖는 계수 (coefficient) 값 ( $\hat{Y} = a + bX $ 에서 b 값)이 r<sup>2</sup> 값에 얼마나 기여했는가를 판단하는 것이 있다. 이 경우에는 b값에 대한 (즉, 기울기에 대한) t-test를 이용하게 된다. 단순회귀분석(simaple regression analysis)에서 F-test와 t-test는 중복된 테스트라고 할 수 있다. X의 설명력에 책임을 지는 것이 오직 b 뿐이기 때문이다. 그러나 $\hat{Y} = a + b_{1} X_{1} + b_{2} X_{2}$와 같은 독립변인이 여러개인 경우에는 각각의 독립변인의 기여도에 대한 테스트가 필요한데, 이를 t-test를 통해서 수행하게 된다. 즉, R<sup>2</sup>에 대한 F-test는 전체 독립변인들의 기여도에 대한 종합적 테스트라고 할 수 있고, 각 독립변인에 대한 기여도는 그 기울기에 대한 테스트로 알아본다.
-위 섹션에서,
+위의 표에서 (Anova table),
-white = explained error (E) = $SS{reg}$ \\
-orange = unexplained error (U) = $SS{res}$ \\
-yellow = total error $SS_{total}$ = E + U = $SS_{reg} + SS_{res}$ \\
-degrees of freedom  \\
+| for SS   | for degrees of freedom   |
-for residual (number of variable -1) = 1 (blue) \\
+| @white: white \\ = explained error (E) \\ = $SS{reg}$  | @lightblue: for regression \\ (number of variable -1) \\ = 1 (light blue) |
-for regression (number of case - number of variable) = 8 (green) \\
+| @orange: orange \\ = unexplained error (U) \\ = $SS{res}$  | @lightgreen: for residual \\ (number of case - number of variable) \\ = 8 (green) |
+| @yellow: yellow \\ = total error $SS_{total}$ \\ = E + U \\ = $SS_{reg} + SS_{res}$ | @#eee: grey \\ = total df \\ = total sample # -1  |
 Then, \\
@@ Line 327: / Line 332: @@
  -> r<sup>2</sup> 만큼 했다고 말하는 것이 상식적이다. 왜냐하면, r<sup>2</sup> 에 기여한 변인으로 오직 하나 있는 것이 X변인이기 때문이다.((위에서 언급한 걱처럼 이는 simaple regression analysis 상황이기 때문이다))
-이 때 b가 기여했다는 판단은 t-test를 이용해서 하게된다. 이에 대한 설명은 아래 [[:regression#eg_3_simple_regression]]의 마지막 부분에 기록해 두었다.
+이 때 b가 기여했다는 판단은 t-test를 이용해서 하게된다. 이에 대한 설명은 아래 [[:regression#eg_3_simple_regressionadjusted_r_squared_slope_test]]의 마지막 부분에 기록해 두었다.
 ^  __Coefficients(a)__   ^^^^^^^
@@ Line 341: / Line 346: @@
-====== E.g., 2. Simple regression ======
+====== E.g., Simple regression ======
 data:
 {{:acidity.sav}} \\
@@ Line 578: / Line 583: @@
  SS<sub>total</sub> = 87.733
  r<sup>2</sup> = SS<sub>reg</sub> / SS<sub>total</sub> = 42.462 / 87.733 = .484.
+====== e.g. Simple Regression ======
+{{:AllenMursau.data.csv}}
+<code>datavar <- read.csv("http://commres.net/wiki/_media/allenmursau.data.csv")
+</code>
+<code>> mod <- lm(Y ~ X, data=datavar)
+> summary(mod)
+Call:
+lm(formula = Y ~ X, data = datavar)
+Residuals:
+    Min      1Q  Median      3Q     Max
+-250.22 -132.28   33.09  165.53  187.78
+Coefficients:
+            Estimate Std. Error t value Pr(>|t|)
+(Intercept)  300.976    229.754   1.310    0.219
+X             10.312      3.124   3.301    0.008 **
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+Residual standard error: 170.5 on 10 degrees of freedom
+Multiple R-squared:  0.5214,	Adjusted R-squared:  0.4736
+F-statistic:  10.9 on 1 and 10 DF,  p-value: 0.008002
+</code>
+<code>> anova(mod)
+Analysis of Variance Table
+Response: Y
+          Df Sum Sq Mean Sq F value   Pr(>F)
+X          1 316874  316874  10.896 0.008002 **
+Residuals 10 290824   29082
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+> </code>
+<code>
+> ss_total <- var(datavar$Y)*11
+> round(ss_total)
+[1] 607698
+> 316874 + 290824  # 위의 아웃풋에서 Sum Sq for X와 Residuals를 더한 값
+[1] 607698
+</code>
+<WRAP box help>위의 anova 아웃풋 박스에서 R square value를 구할 수 있는가?
+</WRAP>
 ====== E.g., 3. Simple regression: Adjusted R squared & Slope test ======
@@ Line 606: / Line 663: @@
-**__r-square:__**
+===== r-square =====
   * $\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = \frac{\text{Explained sample variability}}{\text{Total sample variability}}$
@@ Line 615: / Line 672: @@
-**__Adjusted r-square:__**
+===== Adjusted r-square =====
   * $\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = 1 - \frac{SS_{res}}{SS_{total}} $ ,
@@ Line 636: / Line 693: @@
       * R2 value goes down -- which means
       * more (many) IVs is not always good
-  * Therefore, the Adjusted r<sup>2</sup> = .367 / 1.5 = 0.756 (green color cell)
+  * Therefore, the Adjusted r<sup>2</sup> = 1- (.367 / 1.5) = 0.756 (green color cell)
-**__Slope test__**
+===== Slope test =====
 If we take a look at the ANOVA result:
@@ Line 649: / Line 706: @@
 | b Dependent Variable: y    |||||||
 <WRAP clear />
+F test recap.
   * ANOVA, F-test, $F=\frac{MS_{between}}{MS_{within}}$
-  * MS_between?
+    * MS_between?
-  * MS_within?
+    * MS_within?
-  * MS for residual
+  * regression에서 within 에 해당하는 것 == residual
    * $s = \sqrt{s^2} = \sqrt{\frac{SS_{res}}{n-2}} $
-   * random difference (MS<sub>within</sub> ): $s^2 = \frac{SS_{res}}{n-2} $
+   * 왜냐하면 이 ss residual이 random difference 를 말하는 것이므로 (MS<sub>within</sub> ): $s^2 = \frac{SS_{res}}{n-2} $
   * MS for regression . . . Obtained difference
-   * do the same procedure at the above in MS for residual.
+   * do the same procedure at the above in MS for <del>residual</del> regression.
    * but, this time degress of freedom is k-1 (number of variables -1 ), 1.
   * Then what does F value mean?
@@ Line 672: / Line 729: @@
   * Why do we do t-test for the slope of X variable? The below is a mathematical explanation for this.
-  * Sampling distribution of b:
+  * Sampling distribution of error around the slope line b:
    * $\displaystyle \sigma_{b_{1}} = \frac{\sigma}{\sqrt{SS_{x}}}$
+     * We remember that $\displaystyle \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$ ?
    * estimation of $\sigma_{b_{1}}$ : substitute sigma with s
+만약에 error들이 (residual들) slope b를 중심으로 포진해 있고, 이것을 따로 떼어내서 distribution curve를 그려보면 평균이 0이고 standard deviation이 위의 standard error값을 갖는 normal distribution을 이루게 될 것이다.
   * t-test
    * $\displaystyle t=\frac{b_{1} - \text{Hypothesized value of }\beta_{1}}{s_{b_{1}}}$
+   * Hypothesized value of b 값은 (혹은 beta) 0. 따라서 t 값은
+   * $\displaystyle t=\frac{b_{1}}{s_{b_{1}}}$
+   * 기울기에 대한 표준오차는 (se) 아래와 같이 구한다
-   * Hypothesized value of beta 값은 대개 0. 따라서 t 값은
+\begin{eqnarray*}
+\displaystyle s_{b_{1}} & = & \sqrt {\frac {MSE}{SS_{X}}} \\
+ & = & \displaystyle \sqrt { \frac{1}{n-2} * \frac{SSE}{SS_{X}}} \\
+ & = & \displaystyle \sqrt { \frac{1}{n-2} * \frac{ \Sigma{(Y-\hat{Y})^2} }{ \Sigma{ (X_{i} - \bar{X})^2 } } } \\
+\end{eqnarray*}
-   * $\displaystyle t=\frac{b_{1}}{s_{b_{1}}}$
-   * $\displaystyle s_{b_{1}} = \frac {MSE}{SS_{X}} = \frac{\sqrt{\frac{SSE}{n-2}}}{\sqrt{SS_{X}}} = \display\frac{\sqrt{\frac{\Sigma{(Y-\hat{Y})^2}}{n-2}}}{\sqrt{\Sigma{(X_{i}-\bar{X})^2}}} $
 ^ X  ^ Y  ^ $X-\bar{X}$  ^ ssx  ^ sp  ^ y<sub>predicted</sub>  ^ error  ^ error<sup>2</sup>  ^
@@ Line 697: / Line 757: @@
 Regression formula: y<sub>predicted</sub> = -0.1 + 0.7 X
-SSE = Sum of Square Error
+SSE = Sum of Square Error = SS_residual
 기울기 beta(b)에 대한 표준오차값은 아래와 같이 구한다.
-$$se_{\beta} = \frac {\sqrt{SSE/n-2}}{\sqrt{SSX}} \\
- & = &  \frac {\sqrt{1.1/3}}{\sqrt{10}} = 0.191485 $$
+\begin{eqnarray*}
+se_{\beta} & = & \frac {\sqrt{SSE/n-2}}{\sqrt{SSX}} \\
+& = & \frac {\sqrt{1.1/3}}{\sqrt{10}}  \\
+& = & 0.191485
+\end{eqnarray*}
 그리고 b = 0.7
 따라서 t = b / se = 3.655631
-<code>x <- c(1, 2, 3, 4, 5)
-y <- c(1, 1, 2, 2, 4)
-mody <- lm(y ~ x)
-</code>
 ====== E.g., 4. Simple regression ======
@@ Line 906: / Line 966: @@
 <WRAP clear />
-{{tag> statistics "research methods" regression "multiple regression"}}
+{{tag> statistics "research methods" regression "multiple regression" 회귀분석 상관관계 조사방법론}}