This is an old revision of the document!

Adjusted R Squared

Adjusted R² vs. R²

아래는 Regression의 E.g. 3 Simple Regression 예이다.

	DATA
x	y
1	1
2	1
3	2
4	2
5	4

Model Summary(b)
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	0.903696114	0.816666667	0.755555556	0.605530071

r-square:

$\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = \frac{\text{Explained sample variability}}{\text{Total sample variability}}$
$\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = 1-\frac{SS_{res}}{SS_{total}} = 0.816666667 = R^2 $
Usually interpret with % ( by multiplying 100 to $r^2$ )

Adjusted r-square:

$\displaystyle r^2=\frac{SS_{total}-SS_{res}}{SS_{total}} = 1 - \frac{SS_{res}}{SS_{total}} $ ,
This is equivalent to: $ \displaystyle 1 - \frac {Var_{rei}}{Var_{total}} $
$\text{Var} = \text{MS} = s^{2} = \displaystyle \frac {SS}{n} $
여기서 n 대신에 각각 아래의 값을 사용한다면 (n = 샘플 숫자, p = 변인 숫자),
- $\displaystyle Var_{res} = \frac {SS_{res}}{n-p-1}$
- $\displaystyle Var_{total} = \frac {SS_{total}}{n-1}$
따라서,
- $\displaystyle \text{Adjusted } R^{2} = 1 - \displaystyle \frac {\displaystyle \frac {SS_{res}}{n-p-1}}{\displaystyle \frac {SS_{total}}{n-1}} $
This is the same logic as we used n-1 instead of n in order to get estimation of population standard deviation with a sample statistics.
Therefore, the Adjusted r² = 0.755555556

왜 Adjusted R squared 값을 사용하는가?

If we take a look at the ANOVA result:

ANOVA
Model		Sum of Squares	df	Mean Square	F	Sig.
1	Regression	4.9	1	4.9	13.36363636	0.035352847
	Residual	1.1	3	0.366666667
	Total	6	4
a Predictors: (Constant), x
b Dependent Variable: y

ANOVA, F-test, $F=\frac{MS_{between}}{MS_{within}}$
MS_between?
MS_within?
MS for residual
$s = \sqrt{s^2} = \sqrt{\frac{SS_{res}}{n-2}} $
random difference (MS_within ): $s^2 = \frac{SS_{res}}{n-2} $
MS for regression . . . Obtained difference
do the same procedure at the above in MS for residual.
but, this time degress of freedom is k-1 (number of variables -1 ), 1.
Then what does F value mean?

Then, we take another look at coefficients result:

example
Model		Unstandardized Coefficients		Standardized Coefficients	t	Sig.	95% Confidence Interval for B
B	Std. Error	Beta			Lower Bound	Upper Bound
1	(Constant)	-0.1	0.635085296		-0.157459164	0.88488398	-2.121124854	1.921124854
	x	0.7	0.191485422	0.903696114	3.655630775	0.035352847	0.090607928	1.309392072
a Dependent Variable: y

Why do we do t-test for the slope of X variable? The below is a mathematical explanation for this.
Sampling distribution of Beta (혹은 b):
$\sigma_{\beta_{1}} = \frac{\sigma}{\sqrt{SS_{xx}}}$
estimation of $\sigma_{\beta_{1}}$ : substitute sigma with s
t-test
$t=\frac{\beta_{1} - \text{Hypothesized value of }\beta_{1}}{s_{\beta_{1}}}$
Hypothesized value of beta 값은 대개 0. 따라서 t 값은
$t=\frac{\beta_{1}}{s_{\beta_{1}}}$
$s_{\beta} = \frac {MS_{E}}{SS_{X}} = \display\frac{\sqrt{\frac{SSE}{n-2}}}{\sqrt{SS_{X}}} = \display\frac{\sqrt{\frac{\Sigma{(Y-\hat{Y})^2}}{n-2}}}{\sqrt{\Sigma{(X_{i}-\bar{X})^2}}} $

X	Y	$X-\bar{X}$	ssx	sp	y_predicted	error	error²
1	1	-2	4	2	0.6	-0.4	0.16
2	1	-1	1	1	1.3	0.3	0.09
3	2	0	0	0	2	0	0
4	2	1	1	0	2.7	0.7	0.49
5	4	2	4	4	3.4	-0.6	0.36
$\bar{X}$ = 3	2		SS_X = 10	$\Sigma$ = 7			SSE = 1.1

Regression formula: y_predicted = -0.1 + 0.7 X
SSE = Sum of Square Error
기울기 beta(b)에 대한 표준오차값은 아래와 같이 구한다.
$$se_{\beta} = \frac {\sqrt{SSE/n-2}}{\sqrt{SSX}} \\ & = & \frac {\sqrt{1.1/3}}{\sqrt{10}} = 0.191485 $$
그리고 b = 0.7
따라서 t = b / se = 3.655631