regression
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
regression [2022/10/23 10:58] – [E.g., 1. Simple regression & F-test for goodness of fit] hkimscil | regression [2023/05/24 08:53] (current) – [Slope test] hkimscil | ||
---|---|---|---|
Line 283: | Line 283: | ||
^ ANOVA(b) | ^ ANOVA(b) | ||
| Model | | Model | ||
- | | 1.000 | Regression | + | | 1.000 | Regression |
- | | | + | | |
- | | | + | | |
| a Predictors: (Constant), bankIncome | | a Predictors: (Constant), bankIncome | ||
- | * 1.383 = SS< | + | |
- | * standard | + | * 이 MS error 는 [[: |
- | * 따라서 | + | * 그리고 |
+ | * 이에 대한 통계학 검증을 한다 (df regression = 2 - 1; df residual = n - 2; df total = n - 1) | ||
<WRAP clear /> | <WRAP clear /> | ||
Line 298: | Line 299: | ||
회귀공식 (regression equation) 혹은 모델 (regression model)이 만들어지게 되면 이 모델에 대한 판단이 필요하다. 첫 째, regression equation에서 도출되는 잔여오차 (SS< | 회귀공식 (regression equation) 혹은 모델 (regression model)이 만들어지게 되면 이 모델에 대한 판단이 필요하다. 첫 째, regression equation에서 도출되는 잔여오차 (SS< | ||
- | 위 섹션에서, | + | 위의 표에서 |
- | white = explained error (E) = $SS{reg}$ \\ | + | | for SS | for degrees of freedom |
- | orange = unexplained error (U) = $SS{res}$ \\ | + | | @white: white \\ = explained error (E) \\ = $SS{reg}$ |
- | + | | @orange: orange \\ = unexplained error (U) \\ = $SS{res}$ | |
- | yellow = total error $SS_{total}$ = E + U = $SS_{reg} + SS_{res}$ \\ | + | | @yellow: yellow \\ = total error $SS_{total}$ |
- | + | ||
- | degrees of freedom | + | |
- | for regression (number of variable -1) = 1 (blue) | + | |
- | for residual (number of case - number of variable) = 8 (green) \\ | + | |
Then, \\ | Then, \\ | ||
Line 335: | Line 332: | ||
-> r< | -> r< | ||
- | 이 때 b가 기여했다는 판단은 t-test를 이용해서 하게된다. 이에 대한 설명은 아래 [[: | + | 이 때 b가 기여했다는 판단은 t-test를 이용해서 하게된다. 이에 대한 설명은 아래 [[: |
^ __Coefficients(a)__ | ^ __Coefficients(a)__ | ||
Line 666: | Line 663: | ||
- | **__r-square:__** | + | ===== r-square |
* $\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = \frac{\text{Explained sample variability}}{\text{Total sample variability}}$ | * $\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = \frac{\text{Explained sample variability}}{\text{Total sample variability}}$ | ||
Line 675: | Line 672: | ||
- | **__Adjusted | + | ===== Adjusted |
* $\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = 1 - \frac{SS_{res}}{SS_{total}} $ , | * $\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = 1 - \frac{SS_{res}}{SS_{total}} $ , | ||
Line 696: | Line 693: | ||
* R2 value goes down -- which means | * R2 value goes down -- which means | ||
* more (many) IVs is not always good | * more (many) IVs is not always good | ||
- | * Therefore, the Adjusted r< | + | * Therefore, the Adjusted r< |
- | **__Slope test__** | + | ===== Slope test ===== |
If we take a look at the ANOVA result: | If we take a look at the ANOVA result: | ||
Line 709: | Line 706: | ||
| b Dependent Variable: y ||||||| | | b Dependent Variable: y ||||||| | ||
<WRAP clear /> | <WRAP clear /> | ||
+ | F test recap. | ||
* ANOVA, F-test, $F=\frac{MS_{between}}{MS_{within}}$ | * ANOVA, F-test, $F=\frac{MS_{between}}{MS_{within}}$ | ||
- | | + | |
- | * MS_within? | + | * MS_within? |
- | * MS for residual | + | * regression에서 within 에 해당하는 것 == residual |
* $s = \sqrt{s^2} = \sqrt{\frac{SS_{res}}{n-2}} $ | * $s = \sqrt{s^2} = \sqrt{\frac{SS_{res}}{n-2}} $ | ||
- | * random difference (MS< | + | |
* MS for regression . . . Obtained difference | * MS for regression . . . Obtained difference | ||
- | * do the same procedure at the above in MS for residual. | + | * do the same procedure at the above in MS for <del>residual</ |
* but, this time degress of freedom is k-1 (number of variables -1 ), 1. | * but, this time degress of freedom is k-1 (number of variables -1 ), 1. | ||
* Then what does F value mean? | * Then what does F value mean? | ||
Line 732: | Line 729: | ||
* Why do we do t-test for the slope of X variable? The below is a mathematical explanation for this. | * Why do we do t-test for the slope of X variable? The below is a mathematical explanation for this. | ||
- | * Sampling distribution of b: | + | * Sampling distribution of error around the slope line b: |
* $\displaystyle \sigma_{b_{1}} = \frac{\sigma}{\sqrt{SS_{x}}}$ | * $\displaystyle \sigma_{b_{1}} = \frac{\sigma}{\sqrt{SS_{x}}}$ | ||
+ | * We remember that $\displaystyle \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$ ? | ||
* estimation of $\sigma_{b_{1}}$ : substitute sigma with s | * estimation of $\sigma_{b_{1}}$ : substitute sigma with s | ||
+ | 만약에 error들이 (residual들) slope b를 중심으로 포진해 있고, 이것을 따로 떼어내서 distribution curve를 그려보면 평균이 0이고 standard deviation이 위의 standard error값을 갖는 normal distribution을 이루게 될 것이다. | ||
* t-test | * t-test | ||
- | |||
* $\displaystyle t=\frac{b_{1} - \text{Hypothesized value of }\beta_{1}}{s_{b_{1}}}$ | * $\displaystyle t=\frac{b_{1} - \text{Hypothesized value of }\beta_{1}}{s_{b_{1}}}$ | ||
+ | * Hypothesized value of b 값은 (혹은 beta) 0. 따라서 t 값은 | ||
+ | * $\displaystyle t=\frac{b_{1}}{s_{b_{1}}}$ | ||
+ | * 기울기에 대한 표준오차는 (se) 아래와 같이 구한다 | ||
- | * Hypothesized value of beta 값은 대개 0. 따라서 t 값은 | + | \begin{eqnarray*} |
- | + | \displaystyle | |
- | * $\displaystyle | + | & = & \displaystyle \sqrt { \frac{1}{n-2} * \frac{SSE}{SS_{X}}} \\ |
+ | & = & \displaystyle \sqrt { \frac{1}{n-2} * \frac{ \Sigma{(Y-\hat{Y})^2} }{ \Sigma{ (X_{i} - \bar{X})^2 } } } \\ | ||
+ | \end{eqnarray*} | ||
- | * $\displaystyle s_{b_{1}} = \frac {MSE}{SS_{X}} = \frac{\sqrt{\frac{SSE}{n-2}}}{\sqrt{SS_{X}}} = \displaystyle \frac{\sqrt{\frac{\Sigma{(Y-\hat{Y})^2}}{n-2}}}{\sqrt{\Sigma{(X_{i}-\bar{X})^2}}} $ | ||
^ X ^ Y ^ $X-\bar{X}$ | ^ X ^ Y ^ $X-\bar{X}$ | ||
Line 757: | Line 757: | ||
Regression formula: y< | Regression formula: y< | ||
- | SSE = Sum of Square Error | + | SSE = Sum of Square Error = SS_residual |
기울기 beta(b)에 대한 표준오차값은 아래와 같이 구한다. | 기울기 beta(b)에 대한 표준오차값은 아래와 같이 구한다. | ||
+ | |||
\begin{eqnarray*} | \begin{eqnarray*} | ||
se_{\beta} & = & \frac {\sqrt{SSE/ | se_{\beta} & = & \frac {\sqrt{SSE/ | ||
Line 767: | Line 768: | ||
따라서 t = b / se = 3.655631 | 따라서 t = b / se = 3.655631 | ||
- | < | ||
- | y <- c(1, 1, 2, 2, 4) | ||
- | mody <- lm(y ~ x) | ||
- | </ | ||
====== E.g., 4. Simple regression ====== | ====== E.g., 4. Simple regression ====== |
regression.1666490289.txt.gz · Last modified: 2022/10/23 10:58 by hkimscil