====== Multiple Regression e.gs. ======
====== E.g. 1 ======
{{:d.yyk.csv}}
d.yyk <- read.csv("http://commres.net/wiki/_media/d.yyk.csv")
d.yyk
d.yyk <- subset(d.yyk, select = -c(1))
d.yyk
> d.yyk <- subset(d.yyk, select = -c(1))
> d.yyk
bmi stress happiness
1 15.1 2 4
2 15.3 2 4
3 16.4 1 5
4 16.3 2 4
5 17.5 2 3
6 18.8 2 4
7 19.2 2 3
8 20.3 1 4
9 21.3 1 4
10 21.3 2 4
11 22.4 2 5
12 23.5 2 5
13 23.7 2 4
14 24.2 3 3
15 24.3 3 3
16 25.6 2 3
17 26.4 3 3
18 26.4 3 2
19 26.4 3 2
20 27.5 3 3
21 28.6 3 2
22 28.2 4 2
23 31.3 3 2
24 32.1 4 1
25 33.1 4 1
26 33.2 5 1
27 34.4 5 1
28 35.8 5 1
29 36.1 5 1
30 38.1 5 1
우선 여기에서 종속변인인 (dv) happiness에 bmi와 stress를 리그레션 해본다.
attach(d.yyk)
lm.happiness.bmistress <- lm(happiness ~ bmi + stress, data=d.yyk)
summary(lm.happiness.bmistress)
anova(lm.happiness.bmistress)
> attach(d.yyk)
> lm.happiness.bmistress <- lm(happiness ~ bmi + stress, data=d.yyk)
> summary(lm.happiness.bmistress)
Call:
lm(formula = happiness ~ bmi + stress, data = d.yyk)
Residuals:
Min 1Q Median 3Q Max
-0.89293 -0.40909 0.08816 0.29844 1.46429
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.29098 0.50779 12.389 1.19e-12 ***
bmi -0.05954 0.03626 -1.642 0.11222
stress -0.67809 0.19273 -3.518 0.00156 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5869 on 27 degrees of freedom
Multiple R-squared: 0.8217, Adjusted R-squared: 0.8085
F-statistic: 62.22 on 2 and 27 DF, p-value: 7.76e-11
>
>
> anova(lm.happiness.bmistress)
Analysis of Variance Table
Response: happiness
Df Sum Sq Mean Sq F value Pr(>F)
bmi 1 38.603 38.603 112.070 4.124e-11 ***
stress 1 4.264 4.264 12.378 0.001558 **
Residuals 27 9.300 0.344
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
위의 분석을 보면
R2 = 0.8217, F (2, 27) = 62.77, p = 7.76e-11
그러나, coefficient 값을 보면 bmi는 significant 하지 않고, stress는 significant하다. 이는 R제곱에 영향을 주는 것으로 stress가 주이고 bmi의 영향력은 미미하다고 하다는 결론을 내리도록 해준다. 그러나, [[:multiple regression]]에서 언급한 것처럼 독립변인이 두 개 이상일 때에는 무엇이 얼마나 종속변인에 영향을 주는지 그림을 그릴 수 있어야 하므로, 아래와 같이 bmi만을 가지고 regression을 다시 해본다.
lm.happiness.bmi <- lm(happiness ~ bmi, data=d.yyk)
summary(lm.happiness.bmi)
anova(lm.happiness.bmi)
> lm.happiness.bmi <- lm(happiness ~ bmi, data=d.yyk)
>
> summary(lm.happiness.bmi)
Call:
lm(formula = happiness ~ bmi, data = d.yyk)
Residuals:
Min 1Q Median 3Q Max
-1.20754 -0.49871 -0.03181 0.35669 1.83265
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.24143 0.50990 14.202 2.54e-14 ***
bmi -0.17337 0.01942 -8.927 1.11e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.696 on 28 degrees of freedom
Multiple R-squared: 0.74, Adjusted R-squared: 0.7307
F-statistic: 79.69 on 1 and 28 DF, p-value: 1.109e-09
> anova(lm.happiness.bmi)
Analysis of Variance Table
Response: happiness
Df Sum Sq Mean Sq F value Pr(>F)
bmi 1 38.603 38.603 79.687 1.109e-09 ***
Residuals 28 13.564 0.484
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
놀랍게도 bmi 하나만을 가지고 regression을 했더니, R제곱 값이 .74이었다 (F(1,28) = 79.69, p = 1.109e-09). 이런 결과가 나올 수 있는 이유는 독립변인인 bmi와 stress 간의 상관관계가 높아서 처음 분석에서 그 영향력을 (설명력, R제곱에 기여하는 부분을) 하나의 독립변인이 모두 가졌갔기 때문이라고 생각할 수 있다. 이 경우에는 그 독립변인이 stress이다.
happiness에 stress 만을 regression 해본 결과는 아래와 같다.
lm.happiness.stress <- lm(happiness ~ stress, data = d.yyk)
summary(lm.happiness.stress)
anova(lm.happiness.stress)
> lm.happiness.stress <- lm(happiness ~ stress, data = d.yyk)
> summary(lm.happiness.stress)
Call:
lm(formula = happiness ~ stress, data = d.yyk)
Residuals:
Min 1Q Median 3Q Max
-0.7449 -0.6657 0.2155 0.3343 1.3343
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.58651 0.27965 19.98 < 2e-16 ***
stress -0.96041 0.08964 -10.71 2.05e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6044 on 28 degrees of freedom
Multiple R-squared: 0.8039, Adjusted R-squared: 0.7969
F-statistic: 114.8 on 1 and 28 DF, p-value: 2.053e-11
> anova(lm.happiness.stress)
Analysis of Variance Table
Response: happiness
Df Sum Sq Mean Sq F value Pr(>F)
stress 1 41.938 41.938 114.8 2.053e-11 ***
Residuals 28 10.229 0.365
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
>
R제곱값은 0.80로 (F(1, 28) = 114.8, 2.053e-11) 스트레스만으로도 significant한 결과를 갖는다.
그렇다면 stress 와 bmi가 공통으로 기여하는 부분을 뺀 순수 기여분은 어떻게 될까? 즉, 위의 .80 부분 중 bmi와 공통으로 기여하는 부분을 제외한 나머지는 얼마일까? 보통 이와 같은 작업을 bmi의 (다른 독립변인의) 영향력을 제어하고 (control) 순수기여분만을 살펴본다고 이야기 한다.
===== 방법 1 =====
이를 위해서 아래를 계획, 수행해본다.
- 각각의 독립변인이 고유하게 미치는 영향력은 (설명력은) 무엇인지를 본다.
- 공통설명력은 얼마나 되는지 본다.
- 1을 위해서는 각 독립변인과 종속변인인 happiness의 semi-partial correlation값을 구해서 제곱해보면 되겠다.
- 2를 위해서는 두 독립변인을 써서 구했던 r 제곱값에서 위의 1에서 구한 제곱값들을 제외한 나머지를 보면 된겠다.
* 결론을 내기 위한 계획을 세우고 실행한다.
* 이는 아래와 같이 정리할 수 있다
{{:pasted:20201201-170048.png}}
===== 각각의 독립변인이 고유하게 미치는 영향력은 (설명력은) 무엇인지를 본다 =====
> spcor(d.yyk)
$estimate
bmi stress happiness
bmi 1.0000000 0.2730799 -0.1360657
stress 0.2371411 1.0000000 -0.2532032
happiness -0.1334127 -0.2858909 1.0000000
$p.value
bmi stress happiness
bmi 0.0000000 0.1517715 0.4815643
stress 0.2154821 0.0000000 0.1850784
happiness 0.4902316 0.1327284 0.0000000
$statistic
bmi stress happiness
bmi 0.0000000 1.475028 -0.7136552
stress 1.2684024 0.000000 -1.3600004
happiness -0.6994855 -1.550236 0.0000000
$n
[1] 30
$gp
[1] 1
$method
[1] "pearson"
>
>
happiness에 영향을 주는 변인을 보는 것이므로
bmi stress
happiness -0.1334127 -0.2858909
를 본다. 그리고 이 값의 제곱값이 각 독립변인의 고유 설명력이다.
> (-0.1334127)^2
[1] 0.01779895
> (-0.2858909)^2
[1] 0.08173361
>
즉, '' stress: 8.1% '' 와 '' bmi: 1.78% '' 만이 독립변인의 고유영향력이고 이를 제외한 '' 82.17 - (9.88) = 72.29 '' 가 공통영향력이라고 하겠다.
이를 파티션을 하면서 직접 살펴보려면
* 우선 $\frac{b}{a+b+c+d}$ 를 보려고 한다.
* 그림에서 m.bmi <- lm((a+b+c+d)~(b+e)) 와 같이 한후에 r제곱값을 보고, sqrt 하면 r값을 알 수 있다.
* b+e를 구하려면 lm(bmi~stress)를 한후, 그 residual을 보면 된다.
* a+b+c+d 는 happiness 그 자체이다.
m.bmi <- lm(bmi ~ stress)
mod <- lm(happiness ~ resid(m.bmi))
summary(mod)
> m.bmi <- lm(bmi ~ stress)
> mod <- lm(happiness ~ resid(m.bmi))
> summary(mod)
Call:
lm(formula = happiness ~ resid(m.bmi))
Residuals:
Min 1Q Median 3Q Max
-1.97283 -0.94440 0.05897 0.97961 2.29664
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.83333 0.24698 11.472 4.27e-12 ***
resid(m.bmi) -0.05954 0.08358 -0.712 0.482
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.353 on 28 degrees of freedom
Multiple R-squared: 0.0178, Adjusted R-squared: -0.01728
F-statistic: 0.5074 on 1 and 28 DF, p-value: 0.4822
위의 분석에서 R-square 값인 0.0178 이 bmi의 고유의 설명력이다. r값은 sqrt(0.0178)이다. 그리고, 위의 모델은 significant하지 않음을 주목한다.
다음으로 $\frac {d}{a+b+c+d}$을 구해서 stress 고유설명력을 본다. 이제는
m.stress <- lm(stress ~ bmi)
mod2 <- lm(happiness ~ resid(m.stress))
sumary(mod2)
> m.stress <- lm(stress ~ bmi)
> mod2 <- lm(happiness ~ resid(m.stress))
> summary(mod2)
Call:
lm(formula = happiness ~ resid(m.stress))
Residuals:
Min 1Q Median 3Q Max
-1.9383 -1.2297 0.2170 0.9804 1.9284
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.8333 0.2388 11.865 1.95e-12 ***
resid(m.stress) -0.6781 0.4295 -1.579 0.126
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.308 on 28 degrees of freedom
Multiple R-squared: 0.08173, Adjusted R-squared: 0.04894
F-statistic: 2.492 on 1 and 28 DF, p-value: 0.1256
>
Multiple R-squared 인 0.08173 이 고유 설명력이고, 이 또한 significant 하지 않다.
0.08173 값과 0.0178을 더한 값을 제외한 lm(happiness~bmi+stress) 에서의 R-squared 값이 공통설명력이 된다. 아래의 분석 결과에서 Multiple R-squared: 0.8217 이 두 변인을 모두 합한 설명력이다.
m.both <- lm(happiness~bmi+stress)
summary(m.both)
> m.both <- lm(happiness~bmi+stress)
> summary(m.both)
Call:
lm(formula = happiness ~ bmi + stress)
Residuals:
Min 1Q Median 3Q Max
-0.89293 -0.40909 0.08816 0.29844 1.46429
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.29098 0.50779 12.389 1.19e-12 ***
bmi -0.05954 0.03626 -1.642 0.11222
stress -0.67809 0.19273 -3.518 0.00156 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5869 on 27 degrees of freedom
Multiple R-squared: 0.8217, Adjusted R-squared: 0.8085
F-statistic: 62.22 on 2 and 27 DF, p-value: 7.76e-11
이 값은 0.72217 이다.
> 0.8217- (0.08173 + 0.0178)
[1] 0.72217
>
bmi나 stress 중 하나를 IV로 취하는 것이 좋다는 결론을 내린다.
====== with Two Predictor Variables ======
data file: {{:mlt06.sav}} from http://www.psychstat.missouristate.edu/multibook/mlt06.htm \\
* Y1 - A measure of success in graduate school.
* X1 - A measure of intellectual ability.
* X2 - A measure of "work ethic."
* X3 - A second measure of intellectual ability.
* X4 - A measure of spatial ability.
* Y2 - Score on a major review paper.
Analyze
Descriptive statistics
Descriptives
모든 변수를 Variable(s)로 이동
OK 누르기
Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
sucess in graduate school 20 125 230 169.45 24.517
score on a major review paper 20 105 135 120.50 8.918
intellectual ability I 20 11 64 37.05 14.891
work ethic 20 11 56 29.10 14.056
intellectual ability II 20 17 79 49.35 18.622
spatial ability 20 11 56 32.50 13.004
Valid N (listwise) 20
===== Correlation matrix 검사 =====
Analyze
Correlate
Bivariate
모든 변수를 Variable(s)로 이동
Correlations
y1 y2 x1 x2 x3 x4
y1 Pearson Correlation 1 .310 .764** .769** .687** .736**
Sig. (2-tailed) .184 .000 .000 .001 .000
N 20 20 20 20 20 20
y2 Pearson Correlation .310 1 .251 .334 .168 .018
Sig. (2-tailed) .184 .286 .150 .479 .939
N 20 20 20 20 20 20
x1 Pearson Correlation .764** .251 1 .255 .940** .904**
Sig. (2-tailed) .000 .286 .278 .000 .000
N 20 20 20 20 20 20
x2 Pearson Correlation .769** .334 .255 1 .243 .266
Sig. (2-tailed) .000 .150 .278 .302 .257
N 20 20 20 20 20 20
x3 Pearson Correlation .687** .168 .940** .243 1 .847**
Sig. (2-tailed) .001 .479 .000 .302 .000
N 20 20 20 20 20 20
x4 Pearson Correlation .736** .018 .904** .266 .847** 1
Sig. (2-tailed) .000 .939 .000 .257 .000
N 20 20 20 20 20 20
** Correlation is significant at the 0.01 level (2-tailed).
{{:mlt06_bivariate.png|}}
===== Regression =====
Analyze
Regression
Linear
Dependent: y1 (success in graduate school)
Independent: x1 (intell. ability)
x2 (work ethic)
Click Statistics (오른쪽 상단 버튼)
Choose
Estimates (under Regression Coefficients)
Model fit
R squared change
Descriptives
Part and partial correlations
Collinearity diagnostics
Model Summaryb
Change Statistics
Model R R Adjusted Std. Error R F df1 df2 Sig. F
Square R Square of the Square Change Change
Estimate Change
1 .968a .936 .929 6.541 .936 124.979 2 17 .000
a Predictors: (Constant), work ethic, intellectual ability I
b Dependent Variable: success in graduate school
ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 10693.657 2 5346.828 124.979 .000b
Residual 727.293 17 42.782
Total 11420.950 19
a Dependent Variable: sucess in graduate school
b Predictors: (Constant), work ethic, intellectual ability I
Coefficientsa
Model Unstandardized Standardized Correlations Collinearity
Coefficients Coefficients Statistics
------------- ----- ------------ -------------------- --------- ----
B Std. Beta t Sig. Zero Partial Part Tolerance VIF
Error -order
1 (Constant) 101.222 4.587 22.065 .000
x1 1.000 .104 .608 9.600 .000 .764 .919 .588 .935 1.070
x3 1.071 .110 .614 9.699 .000 .769 .920 .594 .935 1.070
a Dependent Variable: success in graduate school
x1 intellectual ability I
x3 work ethic
from the above output:
| x | zero-order cor | part cor | squared zero-order cor | squared part cor | shared square cor |
| x1 | .764 | .588 | 0.583696 | 0.345744 | 0.237952 |
| x2 | .769 | .594 | 0.591361 | 0.352836 | 0.238525 |
note that the values of two raws at the last column are similar. The portion is the shared effects from both x1 and x2.
$$ \hat{Y_{i}} = 101.222 + 1.000X1_{i} + 1.071X2_{i} $$
$$ \hat{Y_{i}} = 101.222 + 1.000 \ \text{intell. ability}_{i} + 1.071 \ \text{work ethic}_{i} $$
{{:mlt06_residual_stats.png|}}
====== E.g. 2 ======
?state.x77
US State Facts and Figures
Description
Data sets related to the 50 states of the United States of America.
Usage
state.abb
state.area
state.center
state.division
state.name
state.region
state.x77
Details
R currently contains the following “state” data sets. Note that all data are arranged according to alphabetical order of the state names.
state.abb:
character vector of 2-letter abbreviations for the state names.
state.area:
numeric vector of state areas (in square miles).
state.center:
list with components named x and y giving the approximate geographic center of each state in negative longitude and latitude. Alaska and Hawaii are placed just off the West Coast.
state.division:
factor giving state divisions (New England, Middle Atlantic, South Atlantic, East South Central, West South Central, East North Central, West North Central, Mountain, and Pacific).
state.name:
character vector giving the full state names.
state.region:
factor giving the region (Northeast, South, North Central, West) that each state belongs to.
state.x77:
matrix with 50 rows and 8 columns giving the following statistics in the respective columns.
Population:
population estimate as of July 1, 1975
Income:
per capita income (1974)
Illiteracy:
illiteracy (1970, percent of population)
Life Exp:
life expectancy in years (1969–71)
Murder:
murder and non-negligent manslaughter rate per 100,000 population (1976)
HS Grad:
percent high-school graduates (1970)
Frost:
mean number of days with minimum temperature below freezing (1931–1960) in capital or large city
Area:
land area in square miles
Source
U.S. Department of Commerce, Bureau of the Census (1977) Statistical Abstract of the United States.
U.S. Department of Commerce, Bureau of the Census (1977) County and City Data Book.
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
> state.x77 # output not shown
> str(state.x77) # clearly not a data frame! (it's a matrix)
> st = as.data.frame(state.x77) # so we'll make it one
> str(st)
'data.frame': 50 obs. of 8 variables:
$ Population: num 3615 365 2212 2110 21198 ...
$ Income : num 3624 6315 4530 3378 5114 ...
$ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
$ Life Exp : num 69.0 69.3 70.5 70.7 71.7 ...
$ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
$ HS Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
$ Frost : num 20 152 15 65 20 166 139 103 11 60 ...
$ Area : num 50708 566432 113417 51945 156361 ...
> colnames(st)[4] = "Life.Exp" # no spaces in variable names, please
> colnames(st)[6] = "HS.Grad" # ditto
> st$Density = st$Population * 1000 / st$Area # create a new column in st
> summary(st)
Population Income Illiteracy Life.Exp Murder
Min. : 365 Min. :3098 Min. :0.500 Min. :67.96 Min. : 1.400
1st Qu.: 1080 1st Qu.:3993 1st Qu.:0.625 1st Qu.:70.12 1st Qu.: 4.350
Median : 2838 Median :4519 Median :0.950 Median :70.67 Median : 6.850
Mean : 4246 Mean :4436 Mean :1.170 Mean :70.88 Mean : 7.378
3rd Qu.: 4968 3rd Qu.:4814 3rd Qu.:1.575 3rd Qu.:71.89 3rd Qu.:10.675
Max. :21198 Max. :6315 Max. :2.800 Max. :73.60 Max. :15.100
HS.Grad Frost Area Density
Min. :37.80 Min. : 0.00 Min. : 1049 Min. : 0.6444
1st Qu.:48.05 1st Qu.: 66.25 1st Qu.: 36985 1st Qu.: 25.3352
Median :53.25 Median :114.50 Median : 54277 Median : 73.0154
Mean :53.11 Mean :104.46 Mean : 70736 Mean :149.2245
3rd Qu.:59.15 3rd Qu.:139.75 3rd Qu.: 81162 3rd Qu.:144.2828
Max. :67.30 Max. :188.00 Max. :566432 Max. :975.0033
> cor(st) # correlation matrix (not shown, yet)
> round(cor(st), 3) # rounding makes it easier to look at
Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area Density
Population 1.000 0.208 0.108 -0.068 0.344 -0.098 -0.332 0.023 0.246
Income 0.208 1.000 -0.437 0.340 -0.230 0.620 0.226 0.363 0.330
Illiteracy 0.108 -0.437 1.000 -0.588 0.703 -0.657 -0.672 0.077 0.009
Life.Exp -0.068 0.340 -0.588 1.000 -0.781 0.582 0.262 -0.107 0.091
Murder 0.344 -0.230 0.703 -0.781 1.000 -0.488 -0.539 0.228 -0.185
HS.Grad -0.098 0.620 -0.657 0.582 -0.488 1.000 0.367 0.334 -0.088
Frost -0.332 0.226 -0.672 0.262 -0.539 0.367 1.000 0.059 0.002
Area 0.023 0.363 0.077 -0.107 0.228 0.334 0.059 1.000 -0.341
Density 0.246 0.330 0.009 0.091 -0.185 -0.088 0.002 -0.341 1.000
>
> pairs(st) # scatterplot matrix; plot(st) does the same thing
{{state.x77.cor.png}}
> ### model1 = lm(Life.Exp ~ Population + Income + Illiteracy + Murder +
+ ### HS.Grad + Frost + Area + Density, data=st)
> ### This is what we're going to accomplish, but more economically, by
> ### simply placing a dot after the tilde, which means "everything else."
> model1 = lm(Life.Exp ~ ., data=st)
> summary(model1)
Call:
lm(formula = Life.Exp ~ ., data = st)
Residuals:
Min 1Q Median 3Q Max
-1.47514 -0.45887 -0.06352 0.59362 1.21823
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.995e+01 1.843e+00 37.956 < 2e-16 ***
Population 6.480e-05 3.001e-05 2.159 0.0367 *
Income 2.701e-04 3.087e-04 0.875 0.3867
Illiteracy 3.029e-01 4.024e-01 0.753 0.4559
Murder -3.286e-01 4.941e-02 -6.652 5.12e-08 ***
HS.Grad 4.291e-02 2.332e-02 1.840 0.0730 .
Frost -4.580e-03 3.189e-03 -1.436 0.1585
Area -1.558e-06 1.914e-06 -0.814 0.4205
Density -1.105e-03 7.312e-04 -1.511 0.1385
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7337 on 41 degrees of freedom
Multiple R-squared: 0.7501, Adjusted R-squared: 0.7013
F-statistic: 15.38 on 8 and 41 DF, p-value: 3.787e-10
> summary.aov(model1)
Df Sum Sq Mean Sq F value Pr(>F)
Population 1 0.409 0.409 0.760 0.38849
Income 1 11.595 11.595 21.541 3.53e-05 ***
Illiteracy 1 19.421 19.421 36.081 4.23e-07 ***
Murder 1 27.429 27.429 50.959 1.05e-08 ***
HS.Grad 1 4.099 4.099 7.615 0.00861 **
Frost 1 2.049 2.049 3.806 0.05792 .
Area 1 0.001 0.001 0.002 0.96438
Density 1 1.229 1.229 2.283 0.13847
Residuals 41 22.068 0.538
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
===== Model fitting =====
> model2 = update(model1, .~. -Area)
> summary(model2)
Call:
lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder +
HS.Grad + Frost + Density, data = st)
Residuals:
Min 1Q Median 3Q Max
-1.50252 -0.40471 -0.06079 0.58682 1.43862
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.094e+01 1.378e+00 51.488 < 2e-16 ***
Population 6.249e-05 2.976e-05 2.100 0.0418 *
Income 1.485e-04 2.690e-04 0.552 0.5840
Illiteracy 1.452e-01 3.512e-01 0.413 0.6814
Murder -3.319e-01 4.904e-02 -6.768 3.12e-08 ***
HS.Grad 3.746e-02 2.225e-02 1.684 0.0996 .
Frost -5.533e-03 2.955e-03 -1.873 0.0681 .
Density -7.995e-04 6.251e-04 -1.279 0.2079
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7307 on 42 degrees of freedom
Multiple R-squared: 0.746, Adjusted R-squared: 0.7037
F-statistic: 17.63 on 7 and 42 DF, p-value: 1.173e-10
> anova(model2, model1)
Analysis of Variance Table
Model 1: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad +
Frost + Density
Model 2: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad +
Frost + Area + Density
Res.Df RSS Df Sum of Sq F Pr(>F)
1 42 22.425
2 41 22.068 1 0.35639 0.6621 0.4205
> model3 = update(model2, .~. -Illiteracy)
> summary(model3)
Call:
lm(formula = Life.Exp ~ Population + Income + Murder + HS.Grad +
Frost + Density, data = st)
Residuals:
Min 1Q Median 3Q Max
-1.49555 -0.41246 -0.05336 0.58399 1.50535
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.131e+01 1.042e+00 68.420 < 2e-16 ***
Population 5.811e-05 2.753e-05 2.110 0.0407 *
Income 1.324e-04 2.636e-04 0.502 0.6181
Murder -3.208e-01 4.054e-02 -7.912 6.32e-10 ***
HS.Grad 3.499e-02 2.122e-02 1.649 0.1065
Frost -6.191e-03 2.465e-03 -2.512 0.0158 *
Density -7.324e-04 5.978e-04 -1.225 0.2272
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7236 on 43 degrees of freedom
Multiple R-squared: 0.745, Adjusted R-squared: 0.7094
F-statistic: 20.94 on 6 and 43 DF, p-value: 2.632e-11
>
> model4 = update(model3, .~. -Income)
> summary(model4)
Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost +
Density, data = st)
Residuals:
Min 1Q Median 3Q Max
-1.56877 -0.40951 -0.04554 0.57362 1.54752
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.142e+01 1.011e+00 70.665 < 2e-16 ***
Population 6.083e-05 2.676e-05 2.273 0.02796 *
Murder -3.160e-01 3.910e-02 -8.083 3.07e-10 ***
HS.Grad 4.233e-02 1.525e-02 2.776 0.00805 **
Frost -5.999e-03 2.414e-03 -2.485 0.01682 *
Density -5.864e-04 5.178e-04 -1.132 0.26360
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7174 on 44 degrees of freedom
Multiple R-squared: 0.7435, Adjusted R-squared: 0.7144
F-statistic: 25.51 on 5 and 44 DF, p-value: 5.524e-12
> model5 = update(model4, .~. -Density)
> summary(model5)
Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost,
data = st)
Residuals:
Min 1Q Median 3Q Max
-1.47095 -0.53464 -0.03701 0.57621 1.50683
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.103e+01 9.529e-01 74.542 < 2e-16 ***
Population 5.014e-05 2.512e-05 1.996 0.05201 .
Murder -3.001e-01 3.661e-02 -8.199 1.77e-10 ***
HS.Grad 4.658e-02 1.483e-02 3.142 0.00297 **
Frost -5.943e-03 2.421e-03 -2.455 0.01802 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7197 on 45 degrees of freedom
Multiple R-squared: 0.736, Adjusted R-squared: 0.7126
F-statistic: 31.37 on 4 and 45 DF, p-value: 1.696e-12
>
> anova(model5, model1)
Analysis of Variance Table
Model 1: Life.Exp ~ Population + Murder + HS.Grad + Frost
Model 2: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad +
Frost + Area + Density
Res.Df RSS Df Sum of Sq F Pr(>F)
1 45 23.308
2 41 22.068 4 1.2397 0.5758 0.6818
===== Stepwise =====
> step(model1, direction="backward")
Start: AIC=-22.89
Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad +
Frost + Area + Density
Df Sum of Sq RSS AIC
- Illiteracy 1 0.3050 22.373 -24.208
- Area 1 0.3564 22.425 -24.093
- Income 1 0.4120 22.480 -23.969
22.068 -22.894
- Frost 1 1.1102 23.178 -22.440
- Density 1 1.2288 23.297 -22.185
- HS.Grad 1 1.8225 23.891 -20.926
- Population 1 2.5095 24.578 -19.509
- Murder 1 23.8173 45.886 11.707
Step: AIC=-24.21
Life.Exp ~ Population + Income + Murder + HS.Grad + Frost + Area +
Density
Df Sum of Sq RSS AIC
- Area 1 0.1427 22.516 -25.890
- Income 1 0.2316 22.605 -25.693
22.373 -24.208
- Density 1 0.9286 23.302 -24.174
- HS.Grad 1 1.5218 23.895 -22.918
- Population 1 2.2047 24.578 -21.509
- Frost 1 3.1324 25.506 -19.656
- Murder 1 26.7071 49.080 13.072
Step: AIC=-25.89
Life.Exp ~ Population + Income + Murder + HS.Grad + Frost + Density
Df Sum of Sq RSS AIC
- Income 1 0.132 22.648 -27.598
- Density 1 0.786 23.302 -26.174
22.516 -25.890
- HS.Grad 1 1.424 23.940 -24.824
- Population 1 2.332 24.848 -22.962
- Frost 1 3.304 25.820 -21.043
- Murder 1 32.779 55.295 17.033
Step: AIC=-27.6
Life.Exp ~ Population + Murder + HS.Grad + Frost + Density
Df Sum of Sq RSS AIC
- Density 1 0.660 23.308 -28.161
22.648 -27.598
- Population 1 2.659 25.307 -24.046
- Frost 1 3.179 25.827 -23.030
- HS.Grad 1 3.966 26.614 -21.529
- Murder 1 33.626 56.274 15.910
Step: AIC=-28.16
Life.Exp ~ Population + Murder + HS.Grad + Frost
Df Sum of Sq RSS AIC
23.308 -28.161
- Population 1 2.064 25.372 -25.920
- Frost 1 3.122 26.430 -23.877
- HS.Grad 1 5.112 28.420 -20.246
- Murder 1 34.816 58.124 15.528
Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost,
data = st)
Coefficients:
(Intercept) Population Murder HS.Grad Frost
7.103e+01 5.014e-05 -3.001e-01 4.658e-02 -5.943e-03
>
===== Prediction =====
> predict(model5, list(Population=2000, Murder=10.5, HS.Grad=48, Frost=100))
1
69.61746
===== Regression Diagnostics =====
> par(mfrow=c(2,2)) # visualize four graphs at once
> plot(model5)
> par(mfrow=c(1,1)) # reset the graphics defaults
{{:state.x77.reg.diagnostic.720.png}}
===== Model objects =====
> names(model5)
[1] "coefficients" "residuals" "effects" "rank" "fitted.values"
[6] "assign" "qr" "df.residual" "xlevels" "call"
[11] "terms" "model"
> coef(model5) # an extractor function
(Intercept) Population Murder HS.Grad Frost
7.102713e+01 5.013998e-05 -3.001488e-01 4.658225e-02 -5.943290e-03
> model5$coefficients # list indexing
(Intercept) Population Murder HS.Grad Frost
7.102713e+01 5.013998e-05 -3.001488e-01 4.658225e-02 -5.943290e-03
> model5[[1]] # recall by position in the list (double brackets for lists)
(Intercept) Population Murder HS.Grad Frost
7.102713e+01 5.013998e-05 -3.001488e-01 4.658225e-02 -5.943290e-03
> model5$resid
Alabama Alaska Arizona Arkansas California
0.56888134 -0.54740399 -0.86415671 1.08626119 -0.08564599
Colorado Connecticut Delaware Florida Georgia
0.95645816 0.44541028 -1.06646884 0.04460505 -0.09694227
Hawaii Idaho Illinois Indiana Iowa
1.50683146 0.37010714 -0.05244160 -0.02158526 0.16347124
Kansas Kentucky Louisiana Maine Maryland
0.67648037 0.85582067 -0.39044846 -1.47095411 -0.29851996
Massachusetts Michigan Minnesota Mississippi Missouri
-0.61105391 0.76106640 0.69440380 -0.91535384 0.58389969
Montana Nebraska Nevada New Hampshire New Jersey
-0.84024805 0.42967691 -0.49482393 -0.49635615 -0.66612086
New Mexico New York North Carolina North Dakota Ohio
0.28880945 -0.07937149 -0.07624179 0.90350550 -0.26548767
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
0.26139958 -0.28445333 -0.95045527 0.13992982 -1.10109172
South Dakota Tennessee Texas Utah Vermont
0.06839119 0.64416651 0.92114057 0.84246817 0.57865019
Virginia Washington West Virginia Wisconsin Wyoming
-0.06691392 -0.96272426 -0.96982588 0.47004324 -0.58678863
> sort(model5$resid) # extract residuals and sort them
Maine South Carolina Delaware West Virginia Washington
-1.47095411 -1.10109172 -1.06646884 -0.96982588 -0.96272426
Pennsylvania Mississippi Arizona Montana New Jersey
-0.95045527 -0.91535384 -0.86415671 -0.84024805 -0.66612086
Massachusetts Wyoming Alaska New Hampshire Nevada
-0.61105391 -0.58678863 -0.54740399 -0.49635615 -0.49482393
Louisiana Maryland Oregon Ohio Georgia
-0.39044846 -0.29851996 -0.28445333 -0.26548767 -0.09694227
California New York North Carolina Virginia Illinois
-0.08564599 -0.07937149 -0.07624179 -0.06691392 -0.05244160
Indiana Florida South Dakota Rhode Island Iowa
-0.02158526 0.04460505 0.06839119 0.13992982 0.16347124
Oklahoma New Mexico Idaho Nebraska Connecticut
0.26139958 0.28880945 0.37010714 0.42967691 0.44541028
Wisconsin Alabama Vermont Missouri Tennessee
0.47004324 0.56888134 0.57865019 0.58389969 0.64416651
Kansas Minnesota Michigan Utah Kentucky
0.67648037 0.69440380 0.76106640 0.84246817 0.85582067
North Dakota Texas Colorado Arkansas Hawaii
0.90350550 0.92114057 0.95645816 1.08626119 1.50683146
====== e.g., ======
library(ISLR)
head(Carseats)
str(Carseats)
lm.full <- lm(Sales ~ . , data = Carseats)
lm.null <- lm(Sales ~ 1 , data = Carseats)
> stepAIC(lm.full)
Start: AIC=26.82
Sales ~ CompPrice + Income + Advertising + Population + Price +
ShelveLoc + Age + Education + Urban + US
Df Sum of Sq RSS AIC
- Population 1 0.33 403.16 25.15
- Education 1 1.19 404.02 26.00
- Urban 1 1.23 404.06 26.04
- US 1 1.57 404.40 26.38
402.83 26.82
- Income 1 76.16 478.99 94.09
- Advertising 1 127.14 529.97 134.54
- Age 1 217.44 620.27 197.48
- CompPrice 1 519.91 922.74 356.35
- ShelveLoc 2 1053.20 1456.03 536.80
- Price 1 1323.23 1726.06 606.85
Step: AIC=25.15
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age + Education + Urban + US
Df Sum of Sq RSS AIC
- Urban 1 1.15 404.31 24.29
- Education 1 1.36 404.52 24.49
- US 1 1.89 405.05 25.02
403.16 25.15
- Income 1 75.94 479.10 92.18
- Advertising 1 145.38 548.54 146.32
- Age 1 218.52 621.68 196.38
- CompPrice 1 521.69 924.85 355.27
- ShelveLoc 2 1053.18 1456.34 534.89
- Price 1 1323.51 1726.67 605.00
Step: AIC=24.29
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age + Education + US
Df Sum of Sq RSS AIC
- Education 1 1.44 405.76 23.72
- US 1 1.85 406.16 24.12
404.31 24.29
- Income 1 76.64 480.96 91.73
- Advertising 1 146.03 550.34 145.63
- Age 1 217.59 621.91 194.53
- CompPrice 1 526.17 930.48 355.69
- ShelveLoc 2 1053.93 1458.25 533.41
- Price 1 1322.80 1727.11 603.10
Step: AIC=23.72
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age + US
Df Sum of Sq RSS AIC
- US 1 1.63 407.39 23.32
405.76 23.72
- Income 1 77.87 483.62 91.94
- Advertising 1 145.30 551.06 144.15
- Age 1 217.97 623.73 193.70
- CompPrice 1 525.25 931.00 353.92
- ShelveLoc 2 1056.88 1462.64 532.61
- Price 1 1322.83 1728.58 601.44
Step: AIC=23.32
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age
Df Sum of Sq RSS AIC
407.39 23.32
- Income 1 76.68 484.07 90.30
- Age 1 219.12 626.51 193.48
- Advertising 1 234.03 641.42 202.89
- CompPrice 1 523.83 931.22 352.01
- ShelveLoc 2 1055.51 1462.90 530.68
- Price 1 1324.42 1731.81 600.18
Call:
lm(formula = Sales ~ CompPrice + Income + Advertising + Price +
ShelveLoc + Age, data = Carseats)
Coefficients:
(Intercept) CompPrice Income Advertising
5.47523 0.09257 0.01578 0.11590
Price ShelveLocGood ShelveLocMedium Age
-0.09532 4.83567 1.95199 -0.04613
>
> step(lm.full, direction="both")
Start: AIC=26.82
Sales ~ CompPrice + Income + Advertising + Population + Price +
ShelveLoc + Age + Education + Urban + US
Df Sum of Sq RSS AIC
- Population 1 0.33 403.16 25.15
- Education 1 1.19 404.02 26.00
- Urban 1 1.23 404.06 26.04
- US 1 1.57 404.40 26.38
402.83 26.82
- Income 1 76.16 478.99 94.09
- Advertising 1 127.14 529.97 134.54
- Age 1 217.44 620.27 197.48
- CompPrice 1 519.91 922.74 356.35
- ShelveLoc 2 1053.20 1456.03 536.80
- Price 1 1323.23 1726.06 606.85
Step: AIC=25.15
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age + Education + Urban + US
Df Sum of Sq RSS AIC
- Urban 1 1.15 404.31 24.29
- Education 1 1.36 404.52 24.49
- US 1 1.89 405.05 25.02
403.16 25.15
+ Population 1 0.33 402.83 26.82
- Income 1 75.94 479.10 92.18
- Advertising 1 145.38 548.54 146.32
- Age 1 218.52 621.68 196.38
- CompPrice 1 521.69 924.85 355.27
- ShelveLoc 2 1053.18 1456.34 534.89
- Price 1 1323.51 1726.67 605.00
Step: AIC=24.29
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age + Education + US
Df Sum of Sq RSS AIC
- Education 1 1.44 405.76 23.72
- US 1 1.85 406.16 24.12
404.31 24.29
+ Urban 1 1.15 403.16 25.15
+ Population 1 0.25 404.06 26.04
- Income 1 76.64 480.96 91.73
- Advertising 1 146.03 550.34 145.63
- Age 1 217.59 621.91 194.53
- CompPrice 1 526.17 930.48 355.69
- ShelveLoc 2 1053.93 1458.25 533.41
- Price 1 1322.80 1727.11 603.10
Step: AIC=23.72
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age + US
Df Sum of Sq RSS AIC
- US 1 1.63 407.39 23.32
405.76 23.72
+ Education 1 1.44 404.31 24.29
+ Urban 1 1.24 404.52 24.49
+ Population 1 0.41 405.35 25.32
- Income 1 77.87 483.62 91.94
- Advertising 1 145.30 551.06 144.15
- Age 1 217.97 623.73 193.70
- CompPrice 1 525.25 931.00 353.92
- ShelveLoc 2 1056.88 1462.64 532.61
- Price 1 1322.83 1728.58 601.44
Step: AIC=23.32
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age
Df Sum of Sq RSS AIC
407.39 23.32
+ US 1 1.63 405.76 23.72
+ Education 1 1.22 406.16 24.12
+ Urban 1 1.19 406.20 24.15
+ Population 1 0.72 406.67 24.62
- Income 1 76.68 484.07 90.30
- Age 1 219.12 626.51 193.48
- Advertising 1 234.03 641.42 202.89
- CompPrice 1 523.83 931.22 352.01
- ShelveLoc 2 1055.51 1462.90 530.68
- Price 1 1324.42 1731.81 600.18
Call:
lm(formula = Sales ~ CompPrice + Income + Advertising + Price +
ShelveLoc + Age, data = Carseats)
Coefficients:
(Intercept) CompPrice Income Advertising
5.47523 0.09257 0.01578 0.11590
Price ShelveLocGood ShelveLocMedium Age
-0.09532 4.83567 1.95199 -0.04613
>
> stepAIC(lm.full, direction="backward")
Start: AIC=26.82
Sales ~ CompPrice + Income + Advertising + Population + Price +
ShelveLoc + Age + Education + Urban + US
Df Sum of Sq RSS AIC
- Population 1 0.33 403.16 25.15
- Education 1 1.19 404.02 26.00
- Urban 1 1.23 404.06 26.04
- US 1 1.57 404.40 26.38
402.83 26.82
- Income 1 76.16 478.99 94.09
- Advertising 1 127.14 529.97 134.54
- Age 1 217.44 620.27 197.48
- CompPrice 1 519.91 922.74 356.35
- ShelveLoc 2 1053.20 1456.03 536.80
- Price 1 1323.23 1726.06 606.85
Step: AIC=25.15
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age + Education + Urban + US
Df Sum of Sq RSS AIC
- Urban 1 1.15 404.31 24.29
- Education 1 1.36 404.52 24.49
- US 1 1.89 405.05 25.02
403.16 25.15
- Income 1 75.94 479.10 92.18
- Advertising 1 145.38 548.54 146.32
- Age 1 218.52 621.68 196.38
- CompPrice 1 521.69 924.85 355.27
- ShelveLoc 2 1053.18 1456.34 534.89
- Price 1 1323.51 1726.67 605.00
Step: AIC=24.29
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age + Education + US
Df Sum of Sq RSS AIC
- Education 1 1.44 405.76 23.72
- US 1 1.85 406.16 24.12
404.31 24.29
- Income 1 76.64 480.96 91.73
- Advertising 1 146.03 550.34 145.63
- Age 1 217.59 621.91 194.53
- CompPrice 1 526.17 930.48 355.69
- ShelveLoc 2 1053.93 1458.25 533.41
- Price 1 1322.80 1727.11 603.10
Step: AIC=23.72
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age + US
Df Sum of Sq RSS AIC
- US 1 1.63 407.39 23.32
405.76 23.72
- Income 1 77.87 483.62 91.94
- Advertising 1 145.30 551.06 144.15
- Age 1 217.97 623.73 193.70
- CompPrice 1 525.25 931.00 353.92
- ShelveLoc 2 1056.88 1462.64 532.61
- Price 1 1322.83 1728.58 601.44
Step: AIC=23.32
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age
Df Sum of Sq RSS AIC
407.39 23.32
- Income 1 76.68 484.07 90.30
- Age 1 219.12 626.51 193.48
- Advertising 1 234.03 641.42 202.89
- CompPrice 1 523.83 931.22 352.01
- ShelveLoc 2 1055.51 1462.90 530.68
- Price 1 1324.42 1731.81 600.18
Call:
lm(formula = Sales ~ CompPrice + Income + Advertising + Price +
ShelveLoc + Age, data = Carseats)
Coefficients:
(Intercept) CompPrice Income Advertising Price
5.47523 0.09257 0.01578 0.11590 -0.09532
ShelveLocGood ShelveLocMedium Age
4.83567 1.95199 -0.04613
>
> step(lm.full, direction="backward")
Start: AIC=26.82
Sales ~ CompPrice + Income + Advertising + Population + Price +
ShelveLoc + Age + Education + Urban + US
Df Sum of Sq RSS AIC
- Population 1 0.33 403.16 25.15
- Education 1 1.19 404.02 26.00
- Urban 1 1.23 404.06 26.04
- US 1 1.57 404.40 26.38
402.83 26.82
- Income 1 76.16 478.99 94.09
- Advertising 1 127.14 529.97 134.54
- Age 1 217.44 620.27 197.48
- CompPrice 1 519.91 922.74 356.35
- ShelveLoc 2 1053.20 1456.03 536.80
- Price 1 1323.23 1726.06 606.85
Step: AIC=25.15
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age + Education + Urban + US
Df Sum of Sq RSS AIC
- Urban 1 1.15 404.31 24.29
- Education 1 1.36 404.52 24.49
- US 1 1.89 405.05 25.02
403.16 25.15
- Income 1 75.94 479.10 92.18
- Advertising 1 145.38 548.54 146.32
- Age 1 218.52 621.68 196.38
- CompPrice 1 521.69 924.85 355.27
- ShelveLoc 2 1053.18 1456.34 534.89
- Price 1 1323.51 1726.67 605.00
Step: AIC=24.29
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age + Education + US
Df Sum of Sq RSS AIC
- Education 1 1.44 405.76 23.72
- US 1 1.85 406.16 24.12
404.31 24.29
- Income 1 76.64 480.96 91.73
- Advertising 1 146.03 550.34 145.63
- Age 1 217.59 621.91 194.53
- CompPrice 1 526.17 930.48 355.69
- ShelveLoc 2 1053.93 1458.25 533.41
- Price 1 1322.80 1727.11 603.10
Step: AIC=23.72
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age + US
Df Sum of Sq RSS AIC
- US 1 1.63 407.39 23.32
405.76 23.72
- Income 1 77.87 483.62 91.94
- Advertising 1 145.30 551.06 144.15
- Age 1 217.97 623.73 193.70
- CompPrice 1 525.25 931.00 353.92
- ShelveLoc 2 1056.88 1462.64 532.61
- Price 1 1322.83 1728.58 601.44
Step: AIC=23.32
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age
Df Sum of Sq RSS AIC
407.39 23.32
- Income 1 76.68 484.07 90.30
- Age 1 219.12 626.51 193.48
- Advertising 1 234.03 641.42 202.89
- CompPrice 1 523.83 931.22 352.01
- ShelveLoc 2 1055.51 1462.90 530.68
- Price 1 1324.42 1731.81 600.18
Call:
lm(formula = Sales ~ CompPrice + Income + Advertising + Price +
ShelveLoc + Age, data = Carseats)
Coefficients:
(Intercept) CompPrice Income Advertising Price
5.47523 0.09257 0.01578 0.11590 -0.09532
ShelveLocGood ShelveLocMedium Age
4.83567 1.95199 -0.04613
===== Pick the model from stepAIC =====
lm.fit.01 <- lm(formula = Sales ~
CompPrice + Income +
Advertising + Price +
ShelveLoc + Age,
data = Carseats)
summary(lm.fit.01)
> lm.fit.01 <- lm(formula = Sales ~ CompPrice +
Income + Advertising + Price +
ShelveLoc + Age, data = Carseats)
> summary(lm.fit.01)
Call:
lm(formula = Sales ~ CompPrice + Income + Advertising + Price +
ShelveLoc + Age, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-2.7728 -0.6954 0.0282 0.6732 3.3292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.475226 0.505005 10.84 <2e-16 ***
CompPrice 0.092571 0.004123 22.45 <2e-16 ***
Income 0.015785 0.001838 8.59 <2e-16 ***
Advertising 0.115903 0.007724 15.01 <2e-16 ***
Price -0.095319 0.002670 -35.70 <2e-16 ***
ShelveLocGood 4.835675 0.152499 31.71 <2e-16 ***
ShelveLocMedium 1.951993 0.125375 15.57 <2e-16 ***
Age -0.046128 0.003177 -14.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.019 on 392 degrees of freedom
Multiple R-squared: 0.872, Adjusted R-squared: 0.8697
F-statistic: 381.4 on 7 and 392 DF, p-value: < 2.2e-16
Compare the fitted model to full model
anova(lm.full, lm.fit.01)
> anova(lm.full, lm.fit.01)
Analysis of Variance Table
Model 1: Sales ~ CompPrice + Income + Advertising + Population + Price +
ShelveLoc + Age + Education + Urban + US
Model 2: Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age
Res.Df RSS Df Sum of Sq F Pr(>F)
1 388 402.83
2 392 407.39 -4 -4.5533 1.0964 0.358
>
===== Backward elimination =====
drop1(lm.full, test = "F")
> drop1(lm.full, test = "F")
Single term deletions
Model:
Sales ~ CompPrice + Income + Advertising + Population + Price +
ShelveLoc + Age + Education + Urban + US
Df Sum of Sq RSS AIC F value Pr(>F)
402.83 26.82
CompPrice 1 519.91 922.74 356.35 500.7659 < 2.2e-16 ***
Income 1 76.16 478.99 94.09 73.3537 2.58e-16 ***
Advertising 1 127.14 529.97 134.54 122.4571 < 2.2e-16 ***
Population 1 0.33 403.16 25.15 0.3149 0.5750
Price 1 1323.23 1726.06 606.85 1274.5022 < 2.2e-16 ***
ShelveLoc 2 1053.20 1456.03 536.80 507.2079 < 2.2e-16 ***
Age 1 217.44 620.27 197.48 209.4333 < 2.2e-16 ***
Education 1 1.19 404.02 26.00 1.1450 0.2853
Urban 1 1.23 404.06 26.04 1.1831 0.2774
US 1 1.57 404.40 26.38 1.5094 0.2200
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
>
drop1(update(lm.full, ~ . -Population), test = "F")
> drop1(update(lm.full, ~ . -Population), test = "F")
Single term deletions
Model:
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age + Education + Urban + US
Df Sum of Sq RSS AIC F value Pr(>F)
403.16 25.15
CompPrice 1 521.69 924.85 355.27 503.3686 < 2.2e-16 ***
Income 1 75.94 479.10 92.18 73.2717 2.652e-16 ***
Advertising 1 145.38 548.54 146.32 140.2694 < 2.2e-16 ***
Price 1 1323.51 1726.67 605.00 1277.0276 < 2.2e-16 ***
ShelveLoc 2 1053.18 1456.34 534.89 508.0927 < 2.2e-16 ***
Age 1 218.52 621.68 196.38 210.8411 < 2.2e-16 ***
Education 1 1.36 404.52 24.49 1.3122 0.2527
Urban 1 1.15 404.31 24.29 1.1132 0.2920
US 1 1.89 405.05 25.02 1.8262 0.1774
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
>
drop1(update(lm.full, ~. - Population -Urban), test="F")
> drop1(update(lm.full, ~. - Population -Urban), test="F")
Single term deletions
Model:
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age + Education + US
Df Sum of Sq RSS AIC F value Pr(>F)
404.31 24.29
CompPrice 1 526.17 930.48 355.69 507.5374 <2e-16 ***
Income 1 76.64 480.96 91.73 73.9307 <2e-16 ***
Advertising 1 146.03 550.34 145.63 140.8593 <2e-16 ***
Price 1 1322.80 1727.11 603.10 1275.9661 <2e-16 ***
ShelveLoc 2 1053.93 1458.25 533.41 508.3098 <2e-16 ***
Age 1 217.59 621.91 194.53 209.8897 <2e-16 ***
Education 1 1.44 405.76 23.72 1.3930 0.2386
US 1 1.85 406.16 24.12 1.7848 0.1823
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
>
drop1(update(lm.full, ~. - Population -Urban -Education), test="F")
> drop1(update(lm.full, ~. - Population -Urban -Education), test="F")
Single term deletions
Model:
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age + US
Df Sum of Sq RSS AIC F value Pr(>F)
405.76 23.72
CompPrice 1 525.25 931.00 353.92 506.1421 <2e-16 ***
Income 1 77.87 483.62 91.94 75.0336 <2e-16 ***
Advertising 1 145.30 551.06 144.15 140.0181 <2e-16 ***
Price 1 1322.83 1728.58 601.44 1274.7123 <2e-16 ***
ShelveLoc 2 1056.88 1462.64 532.61 509.2202 <2e-16 ***
Age 1 217.97 623.73 193.70 210.0409 <2e-16 ***
US 1 1.63 407.39 23.32 1.5693 0.2111
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
drop1(update(lm.full, ~. - Population -Urban -Education -US), test="F")
> drop1(update(lm.full, ~. - Population -Urban -Education -US), test="F")
Single term deletions
Model:
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc +
Age
Df Sum of Sq RSS AIC F value Pr(>F)
407.39 23.32
CompPrice 1 523.83 931.22 352.01 504.047 < 2.2e-16 ***
Income 1 76.68 484.07 90.30 73.784 < 2.2e-16 ***
Advertising 1 234.03 641.42 202.89 225.192 < 2.2e-16 ***
Price 1 1324.42 1731.81 600.18 1274.400 < 2.2e-16 ***
ShelveLoc 2 1055.51 1462.90 530.68 507.822 < 2.2e-16 ***
Age 1 219.12 626.51 193.48 210.848 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
lm.fit.be <- lm(Sales ~
CompPrice + Income +
Advertising + Price +
ShelveLoc + Age, data = Carseats)
summary(lm.fit.be)
> lm.fit.be <- lm(Sales ~ CompPrice +
Income + Advertising +
Price + ShelveLoc +
Age, data = Carseats)
> summary(lm.fit.be)
Call:
lm(formula = Sales ~ CompPrice + Income + Advertising + Price +
ShelveLoc + Age, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-2.7728 -0.6954 0.0282 0.6732 3.3292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.475226 0.505005 10.84 <2e-16 ***
CompPrice 0.092571 0.004123 22.45 <2e-16 ***
Income 0.015785 0.001838 8.59 <2e-16 ***
Advertising 0.115903 0.007724 15.01 <2e-16 ***
Price -0.095319 0.002670 -35.70 <2e-16 ***
ShelveLocGood 4.835675 0.152499 31.71 <2e-16 ***
ShelveLocMedium 1.951993 0.125375 15.57 <2e-16 ***
Age -0.046128 0.003177 -14.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.019 on 392 degrees of freedom
Multiple R-squared: 0.872, Adjusted R-squared: 0.8697
F-statistic: 381.4 on 7 and 392 DF, p-value: < 2.2e-16
>
lm.fit.01 <- lm(formula = Sales ~
CompPrice + Income +
Advertising + Price +
ShelveLoc + Age,
data = Carseats)
summary(lm.fit.01)
> lm.fit.01 <- lm(formula = Sales ~ CompPrice +
Income + Advertising + Price +
ShelveLoc + Age, data = Carseats)
> summary(lm.fit.01)
Call:
lm(formula = Sales ~ CompPrice + Income + Advertising + Price +
ShelveLoc + Age, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-2.7728 -0.6954 0.0282 0.6732 3.3292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.475226 0.505005 10.84 <2e-16 ***
CompPrice 0.092571 0.004123 22.45 <2e-16 ***
Income 0.015785 0.001838 8.59 <2e-16 ***
Advertising 0.115903 0.007724 15.01 <2e-16 ***
Price -0.095319 0.002670 -35.70 <2e-16 ***
ShelveLocGood 4.835675 0.152499 31.71 <2e-16 ***
ShelveLocMedium 1.951993 0.125375 15.57 <2e-16 ***
Age -0.046128 0.003177 -14.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.019 on 392 degrees of freedom
Multiple R-squared: 0.872, Adjusted R-squared: 0.8697
F-statistic: 381.4 on 7 and 392 DF, p-value: < 2.2e-16