Multiple Regression e.gs.
E.g. 1
- 방법 1
- 각각의 독립변인이 고유하게 미치는 영향력은 (설명력은) 무엇인지를 본다
with Two Predictor Variables
- Correlation matrix 검사
- Regression
E.g. 2
e.g.,
- Pick the model from stepAIC
- Backward elimination

Multiple Regression e.gs.

E.g. 1

d.yyk.csv

d.yyk <- read.csv("http://commres.net/wiki/_media/d.yyk.csv")
d.yyk
d.yyk <- subset(d.yyk, select = -c(1))
d.yyk

> d.yyk <- subset(d.yyk, select = -c(1))
> d.yyk
    bmi stress happiness
1  15.1      2         4
2  15.3      2         4
3  16.4      1         5
4  16.3      2         4
5  17.5      2         3
6  18.8      2         4
7  19.2      2         3
8  20.3      1         4
9  21.3      1         4
10 21.3      2         4
11 22.4      2         5
12 23.5      2         5
13 23.7      2         4
14 24.2      3         3
15 24.3      3         3
16 25.6      2         3
17 26.4      3         3
18 26.4      3         2
19 26.4      3         2
20 27.5      3         3
21 28.6      3         2
22 28.2      4         2
23 31.3      3         2
24 32.1      4         1
25 33.1      4         1
26 33.2      5         1
27 34.4      5         1
28 35.8      5         1
29 36.1      5         1
30 38.1      5         1

우선 여기에서 종속변인인 (dv) happiness에 bmi와 stress를 리그레션 해본다.

attach(d.yyk)
lm.happiness.bmistress <- lm(happiness ~ bmi + stress, data=d.yyk)
summary(lm.happiness.bmistress)
anova(lm.happiness.bmistress)

> attach(d.yyk)
> lm.happiness.bmistress <- lm(happiness ~ bmi + stress, data=d.yyk)
> summary(lm.happiness.bmistress)

Call:
lm(formula = happiness ~ bmi + stress, data = d.yyk)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.89293 -0.40909  0.08816  0.29844  1.46429 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.29098    0.50779  12.389 1.19e-12 ***
bmi         -0.05954    0.03626  -1.642  0.11222    
stress      -0.67809    0.19273  -3.518  0.00156 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5869 on 27 degrees of freedom
Multiple R-squared:  0.8217,	Adjusted R-squared:  0.8085 
F-statistic: 62.22 on 2 and 27 DF,  p-value: 7.76e-11
>
> 
> anova(lm.happiness.bmistress)
Analysis of Variance Table

Response: happiness
          Df Sum Sq Mean Sq F value    Pr(>F)    
bmi        1 38.603  38.603 112.070 4.124e-11 ***
stress     1  4.264   4.264  12.378  0.001558 ** 
Residuals 27  9.300   0.344                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

위의 분석을 보면
R² = 0.8217, F (2, 27) = 62.77, p = 7.76e-11

그러나, coefficient 값을 보면 bmi는 significant 하지 않고, stress는 significant하다. 이는 R제곱에 영향을 주는 것으로 stress가 주이고 bmi의 영향력은 미미하다고 하다는 결론을 내리도록 해준다. 그러나, multiple regression에서 언급한 것처럼 독립변인이 두 개 이상일 때에는 무엇이 얼마나 종속변인에 영향을 주는지 그림을 그릴 수 있어야 하므로, 아래와 같이 bmi만을 가지고 regression을 다시 해본다.

lm.happiness.bmi <- lm(happiness ~ bmi, data=d.yyk)
summary(lm.happiness.bmi)
anova(lm.happiness.bmi)

> lm.happiness.bmi <- lm(happiness ~ bmi, data=d.yyk)
> 
> summary(lm.happiness.bmi)

Call:
lm(formula = happiness ~ bmi, data = d.yyk)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.20754 -0.49871 -0.03181  0.35669  1.83265 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.24143    0.50990  14.202 2.54e-14 ***
bmi         -0.17337    0.01942  -8.927 1.11e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.696 on 28 degrees of freedom
Multiple R-squared:   0.74,	Adjusted R-squared:  0.7307 
F-statistic: 79.69 on 1 and 28 DF,  p-value: 1.109e-09

> anova(lm.happiness.bmi)
Analysis of Variance Table

Response: happiness
          Df Sum Sq Mean Sq F value    Pr(>F)    
bmi        1 38.603  38.603  79.687 1.109e-09 ***
Residuals 28 13.564   0.484                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

놀랍게도 bmi 하나만을 가지고 regression을 했더니, R제곱 값이 .74이었다 (F(1,28) = 79.69, p = 1.109e-09). 이런 결과가 나올 수 있는 이유는 독립변인인 bmi와 stress 간의 상관관계가 높아서 처음 분석에서 그 영향력을 (설명력, R제곱에 기여하는 부분을) 하나의 독립변인이 모두 가졌갔기 때문이라고 생각할 수 있다. 이 경우에는 그 독립변인이 stress이다.

happiness에 stress 만을 regression 해본 결과는 아래와 같다.

lm.happiness.stress <- lm(happiness ~ stress, data = d.yyk)
summary(lm.happiness.stress)
anova(lm.happiness.stress)

> lm.happiness.stress <- lm(happiness ~ stress, data = d.yyk)
> summary(lm.happiness.stress)

Call:
lm(formula = happiness ~ stress, data = d.yyk)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7449 -0.6657  0.2155  0.3343  1.3343 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.58651    0.27965   19.98  < 2e-16 ***
stress      -0.96041    0.08964  -10.71 2.05e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6044 on 28 degrees of freedom
Multiple R-squared:  0.8039,	Adjusted R-squared:  0.7969 
F-statistic: 114.8 on 1 and 28 DF,  p-value: 2.053e-11

> anova(lm.happiness.stress)
Analysis of Variance Table

Response: happiness
          Df Sum Sq Mean Sq F value    Pr(>F)    
stress     1 41.938  41.938   114.8 2.053e-11 ***
Residuals 28 10.229   0.365                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
>

R제곱값은 0.80로 (F(1, 28) = 114.8, 2.053e-11) 스트레스만으로도 significant한 결과를 갖는다.

그렇다면 stress 와 bmi가 공통으로 기여하는 부분을 뺀 순수 기여분은 어떻게 될까? 즉, 위의 .80 부분 중 bmi와 공통으로 기여하는 부분을 제외한 나머지는 얼마일까? 보통 이와 같은 작업을 bmi의 (다른 독립변인의) 영향력을 제어하고 (control) 순수기여분만을 살펴본다고 이야기 한다.

방법 1

이를 위해서 아래를 계획, 수행해본다.

각각의 독립변인이 고유하게 미치는 영향력은 (설명력은) 무엇인지를 본다.
공통설명력은 얼마나 되는지 본다.

1을 위해서는 각 독립변인과 종속변인인 happiness의 semi-partial correlation값을 구해서 제곱해보면 되겠다.
2를 위해서는 두 독립변인을 써서 구했던 r 제곱값에서 위의 1에서 구한 제곱값들을 제외한 나머지를 보면 된겠다.

결론을 내기 위한 계획을 세우고 실행한다.
이는 아래와 같이 정리할 수 있다

각각의 독립변인이 고유하게 미치는 영향력은 (설명력은) 무엇인지를 본다

> spcor(d.yyk)
$estimate
                 bmi     stress  happiness
bmi        1.0000000  0.2730799 -0.1360657
stress     0.2371411  1.0000000 -0.2532032
happiness -0.1334127 -0.2858909  1.0000000

$p.value
                bmi    stress happiness
bmi       0.0000000 0.1517715 0.4815643
stress    0.2154821 0.0000000 0.1850784
happiness 0.4902316 0.1327284 0.0000000

$statistic
                 bmi    stress  happiness
bmi        0.0000000  1.475028 -0.7136552
stress     1.2684024  0.000000 -1.3600004
happiness -0.6994855 -1.550236  0.0000000

$n
[1] 30

$gp
[1] 1

$method
[1] "pearson"

> 
>

happiness에 영향을 주는 변인을 보는 것이므로

                 bmi    stress
happiness -0.1334127 -0.2858909

를 본다. 그리고 이 값의 제곱값이 각 독립변인의 고유 설명력이다.

> (-0.1334127)^2 
[1] 0.01779895
> (-0.2858909)^2
[1] 0.08173361
>

즉, stress: 8.1% 와 bmi: 1.78% 만이 독립변인의 고유영향력이고 이를 제외한 82.17 - (9.88) = 72.29 가 공통영향력이라고 하겠다.

이를 파티션을 하면서 직접 살펴보려면

우선 $\frac{b}{a+b+c+d}$ 를 보려고 한다.
그림에서 m.bmi ← lm¹⁾ 와 같이 한후에 r제곱값을 보고, sqrt 하면 r값을 알 수 있다.
b+e를 구하려면 lm(bmi~stress)를 한후, 그 residual을 보면 된다.
a+b+c+d 는 happiness 그 자체이다.

m.bmi <- lm(bmi ~ stress)
mod <- lm(happiness ~ resid(m.bmi))
summary(mod)

> m.bmi <- lm(bmi ~ stress)
> mod <- lm(happiness ~ resid(m.bmi))
> summary(mod)

Call:
lm(formula = happiness ~ resid(m.bmi))

Residuals:
     Min       1Q   Median       3Q      Max 
-1.97283 -0.94440  0.05897  0.97961  2.29664 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.83333    0.24698  11.472 4.27e-12 ***
resid(m.bmi) -0.05954    0.08358  -0.712    0.482    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.353 on 28 degrees of freedom
Multiple R-squared:  0.0178,	Adjusted R-squared:  -0.01728 
F-statistic: 0.5074 on 1 and 28 DF,  p-value: 0.4822

위의 분석에서 R-square 값인 0.0178 이 bmi의 고유의 설명력이다. r값은 sqrt(0.0178)이다. 그리고, 위의 모델은 significant하지 않음을 주목한다.

다음으로 $\frac {d}{a+b+c+d}$을 구해서 stress 고유설명력을 본다. 이제는

m.stress <- lm(stress ~ bmi)
mod2 <- lm(happiness ~ resid(m.stress))
sumary(mod2)

> m.stress <- lm(stress ~ bmi)
> mod2 <- lm(happiness ~ resid(m.stress))
> summary(mod2)

Call:
lm(formula = happiness ~ resid(m.stress))

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9383 -1.2297  0.2170  0.9804  1.9284 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       2.8333     0.2388  11.865 1.95e-12 ***
resid(m.stress)  -0.6781     0.4295  -1.579    0.126    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.308 on 28 degrees of freedom
Multiple R-squared:  0.08173,	Adjusted R-squared:  0.04894 
F-statistic: 2.492 on 1 and 28 DF,  p-value: 0.1256

>

Multiple R-squared 인 0.08173 이 고유 설명력이고, 이 또한 significant 하지 않다.
0.08173 값과 0.0178을 더한 값을 제외한 lm(happiness~bmi+stress) 에서의 R-squared 값이 공통설명력이 된다. 아래의 분석 결과에서 Multiple R-squared: 0.8217 이 두 변인을 모두 합한 설명력이다.

m.both <- lm(happiness~bmi+stress)
summary(m.both)

> m.both <- lm(happiness~bmi+stress)
> summary(m.both)

Call:
lm(formula = happiness ~ bmi + stress)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.89293 -0.40909  0.08816  0.29844  1.46429 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.29098    0.50779  12.389 1.19e-12 ***
bmi         -0.05954    0.03626  -1.642  0.11222    
stress      -0.67809    0.19273  -3.518  0.00156 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5869 on 27 degrees of freedom
Multiple R-squared:  0.8217,	Adjusted R-squared:  0.8085 
F-statistic: 62.22 on 2 and 27 DF,  p-value: 7.76e-11

이 값은 0.72217 이다.

> 0.8217- (0.08173 + 0.0178)
[1] 0.72217
>

bmi나 stress 중 하나를 IV로 취하는 것이 좋다는 결론을 내린다.

with Two Predictor Variables

data file: mlt06.sav from http://www.psychstat.missouristate.edu/multibook/mlt06.htm

Y1 - A measure of success in graduate school.
X1 - A measure of intellectual ability.
X2 - A measure of “work ethic.”
X3 - A second measure of intellectual ability.
X4 - A measure of spatial ability.
Y2 - Score on a major review paper.

Analyze
 Descriptive statistics
  Descriptives
   모든 변수를 Variable(s)로 이동 
    OK 누르기

Descriptive Statistics					
				N	Minimum	Maximum	Mean	Std. Deviation
sucess in graduate school	20	125	230	169.45	24.517
score on a major review paper	20	105	135	120.50	8.918
intellectual ability I		20	11	64	37.05	14.891
work ethic			20	11	56	29.10	14.056
intellectual ability II		20	17	79	49.35	18.622
spatial ability			20	11	56	32.50	13.004
Valid N (listwise)		20

Correlation matrix 검사

Analyze
 Correlate
  Bivariate
   모든 변수를 Variable(s)로 이동

Correlations
				y1	y2	x1	x2	x3	x4
y1	Pearson Correlation	1	.310	.764**	.769**	.687**	.736**
	Sig. (2-tailed)		.184	.000	.000	.001	.000
	N			20	20	20	20	20	20
y2	Pearson Correlation	.310	1	.251	.334	.168	.018
	Sig. (2-tailed)		.184		.286	.150	.479	.939
	N			20	20	20	20	20	20
x1	Pearson Correlation	.764**	.251	1	.255	.940**	.904**
	Sig. (2-tailed)		.000	.286		.278	.000	.000
	N			20	20	20	20	20	20
x2	Pearson Correlation	.769**	.334	.255	1	.243	.266
	Sig. (2-tailed)		.000	.150	.278		.302	.257
	N			20	20	20	20	20	20
x3	Pearson Correlation	.687**	.168	.940**	.243	1	.847**
	Sig. (2-tailed)		.001	.479	.000	.302		.000
	N			20	20	20	20	20	20
x4	Pearson Correlation	.736**	.018	.904**	.266	.847**	1
	Sig. (2-tailed)		.000	.939	.000	.257	.000	
	N			20	20	20	20	20	20
** Correlation is significant at the 0.01 level (2-tailed).

Regression

Analyze
 Regression
  Linear
   Dependent: 	y1 (success in graduate school)
   Independent:	x1 (intell. ability) 
		x2 (work ethic)
   Click Statistics (오른쪽 상단 버튼)
    Choose
     Estimates (under Regression Coefficients)
     Model fit
     R squared change
     Descriptives
     Part and partial correlations
     Collinearity diagnostics

Model Summaryb
							Change Statistics	
Model	R	R	Adjusted	Std. Error	R  	F 	df1	df2	Sig. F 
		Square	R Square	of the 		Square	Change			Change
					Estimate	Change
1	.968a	.936	.929		6.541		.936	124.979	2	17	.000
a Predictors: (Constant), work ethic, intellectual ability I
b Dependent Variable: success in graduate school

ANOVAa						
Model			Sum of Squares	df	Mean Square	F		Sig.
1	Regression	10693.657	2	5346.828	124.979		.000b
	Residual	727.293		17	42.782		
	Total		11420.950	19			
a Dependent Variable: sucess in graduate school						
b Predictors: (Constant), work ethic, intellectual ability I

Coefficientsa
Model		Unstandardized 	Standardized 		Correlations		Collinearity 
		Coefficients	Coefficients					Statistics
		-------------	-----	------------	--------------------	---------	----
		B	Std. 	Beta	t	Sig.	Zero	Partial	Part	Tolerance	VIF
			Error					-order	
1 (Constant)	101.222	4.587		22.065	.000					
  x1		1.000	.104	.608	9.600	.000	.764	.919	.588	.935		1.070
  x3		1.071	.110	.614	9.699	.000	.769	.920	.594	.935		1.070
a Dependent Variable: success in graduate school
x1  intellectual ability I
x3  work ethic

from the above output:

x	zero-order cor	part cor	squared zero-order cor	squared part cor	shared square cor
x1	.764	.588	0.583696	0.345744	0.237952
x2	.769	.594	0.591361	0.352836	0.238525

note that the values of two raws at the last column are similar. The portion is the shared effects from both x1 and x2.

$$ \hat{Y_{i}} = 101.222 + 1.000X1_{i} + 1.071X2_{i} $$
$$ \hat{Y_{i}} = 101.222 + 1.000 \ \text{intell. ability}_{i} + 1.071 \ \text{work ethic}_{i} $$

E.g. 2

?state.x77

US State Facts and Figures

Description

Data sets related to the 50 states of the United States of America.

Usage

state.abb
state.area
state.center
state.division
state.name
state.region
state.x77
Details

R currently contains the following “state” data sets. Note that all data are arranged according to alphabetical order of the state names.

state.abb:
character vector of 2-letter abbreviations for the state names.

state.area:
numeric vector of state areas (in square miles).

state.center:
list with components named x and y giving the approximate geographic center of each state in negative longitude and latitude. Alaska and Hawaii are placed just off the West Coast.

state.division:
factor giving state divisions (New England, Middle Atlantic, South Atlantic, East South Central, West South Central, East North Central, West North Central, Mountain, and Pacific).

state.name:
character vector giving the full state names.

state.region:
factor giving the region (Northeast, South, North Central, West) that each state belongs to.

state.x77:
matrix with 50 rows and 8 columns giving the following statistics in the respective columns.

Population:
population estimate as of July 1, 1975

Income:
per capita income (1974)

Illiteracy:
illiteracy (1970, percent of population)

Life Exp:
life expectancy in years (1969–71)

Murder:
murder and non-negligent manslaughter rate per 100,000 population (1976)

HS Grad:
percent high-school graduates (1970)

Frost:
mean number of days with minimum temperature below freezing (1931–1960) in capital or large city

Area:
land area in square miles

Source

U.S. Department of Commerce, Bureau of the Census (1977) Statistical Abstract of the United States.

U.S. Department of Commerce, Bureau of the Census (1977) County and City Data Book.

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

> state.x77                            # output not shown
> str(state.x77)                       # clearly not a data frame! (it's a matrix)
> st = as.data.frame(state.x77)        # so we'll make it one
> str(st)
'data.frame':   50 obs. of  8 variables:
 $ Population: num   3615   365  2212  2110 21198 ...
 $ Income    : num  3624 6315 4530 3378 5114 ...
 $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
 $ Life Exp  : num  69.0 69.3 70.5 70.7 71.7 ...
 $ Murder    : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
 $ HS Grad   : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
 $ Frost     : num  20 152 15 65 20 166 139 103 11 60 ...
 $ Area      : num   50708 566432 113417  51945 156361 ...

> colnames(st)[4] = "Life.Exp"                   # no spaces in variable names, please
> colnames(st)[6] = "HS.Grad"                    # ditto
> st$Density = st$Population * 1000 / st$Area    # create a new column in st

> summary(st)
   Population        Income       Illiteracy       Life.Exp         Murder      
 Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96   Min.   : 1.400  
 1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12   1st Qu.: 4.350  
 Median : 2838   Median :4519   Median :0.950   Median :70.67   Median : 6.850  
 Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88   Mean   : 7.378  
 3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89   3rd Qu.:10.675  
 Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60   Max.   :15.100  
    HS.Grad          Frost             Area           Density        
 Min.   :37.80   Min.   :  0.00   Min.   :  1049   Min.   :  0.6444  
 1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985   1st Qu.: 25.3352  
 Median :53.25   Median :114.50   Median : 54277   Median : 73.0154  
 Mean   :53.11   Mean   :104.46   Mean   : 70736   Mean   :149.2245  
 3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81162   3rd Qu.:144.2828  
 Max.   :67.30   Max.   :188.00   Max.   :566432   Max.   :975.0033

> cor(st)                              # correlation matrix (not shown, yet)
> round(cor(st), 3)                    # rounding makes it easier to look at
           Population Income Illiteracy Life.Exp Murder HS.Grad  Frost   Area Density
Population      1.000  0.208      0.108   -0.068  0.344  -0.098 -0.332  0.023   0.246
Income          0.208  1.000     -0.437    0.340 -0.230   0.620  0.226  0.363   0.330
Illiteracy      0.108 -0.437      1.000   -0.588  0.703  -0.657 -0.672  0.077   0.009
Life.Exp       -0.068  0.340     -0.588    1.000 -0.781   0.582  0.262 -0.107   0.091
Murder          0.344 -0.230      0.703   -0.781  1.000  -0.488 -0.539  0.228  -0.185
HS.Grad        -0.098  0.620     -0.657    0.582 -0.488   1.000  0.367  0.334  -0.088
Frost          -0.332  0.226     -0.672    0.262 -0.539   0.367  1.000  0.059   0.002
Area            0.023  0.363      0.077   -0.107  0.228   0.334  0.059  1.000  -0.341
Density         0.246  0.330      0.009    0.091 -0.185  -0.088  0.002 -0.341   1.000
>

> pairs(st)                            # scatterplot matrix; plot(st) does the same thing

> ### model1 = lm(Life.Exp ~ Population + Income + Illiteracy + Murder +
+ ###                        HS.Grad + Frost + Area + Density, data=st)
> ### This is what we're going to accomplish, but more economically, by
> ### simply placing a dot after the tilde, which means "everything else."
> model1 = lm(Life.Exp ~ ., data=st)
> summary(model1)

Call:
lm(formula = Life.Exp ~ ., data = st)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.47514 -0.45887 -0.06352  0.59362  1.21823 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.995e+01  1.843e+00  37.956  < 2e-16 ***
Population   6.480e-05  3.001e-05   2.159   0.0367 *  
Income       2.701e-04  3.087e-04   0.875   0.3867    
Illiteracy   3.029e-01  4.024e-01   0.753   0.4559    
Murder      -3.286e-01  4.941e-02  -6.652 5.12e-08 ***
HS.Grad      4.291e-02  2.332e-02   1.840   0.0730 .  
Frost       -4.580e-03  3.189e-03  -1.436   0.1585    
Area        -1.558e-06  1.914e-06  -0.814   0.4205    
Density     -1.105e-03  7.312e-04  -1.511   0.1385    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7337 on 41 degrees of freedom
Multiple R-squared:  0.7501,	Adjusted R-squared:  0.7013 
F-statistic: 15.38 on 8 and 41 DF,  p-value: 3.787e-10

> summary.aov(model1)
            Df Sum Sq Mean Sq F value   Pr(>F)    
Population   1  0.409   0.409   0.760  0.38849    
Income       1 11.595  11.595  21.541 3.53e-05 ***
Illiteracy   1 19.421  19.421  36.081 4.23e-07 ***
Murder       1 27.429  27.429  50.959 1.05e-08 ***
HS.Grad      1  4.099   4.099   7.615  0.00861 ** 
Frost        1  2.049   2.049   3.806  0.05792 .  
Area         1  0.001   0.001   0.002  0.96438    
Density      1  1.229   1.229   2.283  0.13847    
Residuals   41 22.068   0.538                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

Model fitting

> model2 = update(model1, .~. -Area)
> summary(model2)

Call:
lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder + 
    HS.Grad + Frost + Density, data = st)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.50252 -0.40471 -0.06079  0.58682  1.43862 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.094e+01  1.378e+00  51.488  < 2e-16 ***
Population   6.249e-05  2.976e-05   2.100   0.0418 *  
Income       1.485e-04  2.690e-04   0.552   0.5840    
Illiteracy   1.452e-01  3.512e-01   0.413   0.6814    
Murder      -3.319e-01  4.904e-02  -6.768 3.12e-08 ***
HS.Grad      3.746e-02  2.225e-02   1.684   0.0996 .  
Frost       -5.533e-03  2.955e-03  -1.873   0.0681 .  
Density     -7.995e-04  6.251e-04  -1.279   0.2079    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7307 on 42 degrees of freedom
Multiple R-squared:  0.746,	Adjusted R-squared:  0.7037 
F-statistic: 17.63 on 7 and 42 DF,  p-value: 1.173e-10

> anova(model2, model1)    
Analysis of Variance Table

Model 1: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + 
    Frost + Density
Model 2: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + 
    Frost + Area + Density
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     42 22.425                           
2     41 22.068  1   0.35639 0.6621 0.4205

> model3 = update(model2, .~. -Illiteracy)
> summary(model3)


Call:
lm(formula = Life.Exp ~ Population + Income + Murder + HS.Grad + 
    Frost + Density, data = st)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.49555 -0.41246 -0.05336  0.58399  1.50535 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.131e+01  1.042e+00  68.420  < 2e-16 ***
Population   5.811e-05  2.753e-05   2.110   0.0407 *  
Income       1.324e-04  2.636e-04   0.502   0.6181    
Murder      -3.208e-01  4.054e-02  -7.912 6.32e-10 ***
HS.Grad      3.499e-02  2.122e-02   1.649   0.1065    
Frost       -6.191e-03  2.465e-03  -2.512   0.0158 *  
Density     -7.324e-04  5.978e-04  -1.225   0.2272    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7236 on 43 degrees of freedom
Multiple R-squared:  0.745,	Adjusted R-squared:  0.7094 
F-statistic: 20.94 on 6 and 43 DF,  p-value: 2.632e-11

>

> model4 = update(model3, .~. -Income)
> summary(model4)

Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost + 
    Density, data = st)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.56877 -0.40951 -0.04554  0.57362  1.54752 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.142e+01  1.011e+00  70.665  < 2e-16 ***
Population   6.083e-05  2.676e-05   2.273  0.02796 *  
Murder      -3.160e-01  3.910e-02  -8.083 3.07e-10 ***
HS.Grad      4.233e-02  1.525e-02   2.776  0.00805 ** 
Frost       -5.999e-03  2.414e-03  -2.485  0.01682 *  
Density     -5.864e-04  5.178e-04  -1.132  0.26360    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7174 on 44 degrees of freedom
Multiple R-squared:  0.7435,	Adjusted R-squared:  0.7144 
F-statistic: 25.51 on 5 and 44 DF,  p-value: 5.524e-12

> model5 = update(model4, .~. -Density)
> summary(model5)


Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost, 
    data = st)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.47095 -0.53464 -0.03701  0.57621  1.50683 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.103e+01  9.529e-01  74.542  < 2e-16 ***
Population   5.014e-05  2.512e-05   1.996  0.05201 .  
Murder      -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
HS.Grad      4.658e-02  1.483e-02   3.142  0.00297 ** 
Frost       -5.943e-03  2.421e-03  -2.455  0.01802 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7197 on 45 degrees of freedom
Multiple R-squared:  0.736,	Adjusted R-squared:  0.7126 
F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12

>

> anova(model5, model1)
Analysis of Variance Table

Model 1: Life.Exp ~ Population + Murder + HS.Grad + Frost
Model 2: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + 
    Frost + Area + Density
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     45 23.308                           
2     41 22.068  4    1.2397 0.5758 0.6818

Stepwise

> step(model1, direction="backward")
Start:  AIC=-22.89
Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + 
    Frost + Area + Density

             Df Sum of Sq    RSS     AIC
- Illiteracy  1    0.3050 22.373 -24.208
- Area        1    0.3564 22.425 -24.093
- Income      1    0.4120 22.480 -23.969
<none>                    22.068 -22.894
- Frost       1    1.1102 23.178 -22.440
- Density     1    1.2288 23.297 -22.185
- HS.Grad     1    1.8225 23.891 -20.926
- Population  1    2.5095 24.578 -19.509
- Murder      1   23.8173 45.886  11.707

Step:  AIC=-24.21
Life.Exp ~ Population + Income + Murder + HS.Grad + Frost + Area + 
    Density

             Df Sum of Sq    RSS     AIC
- Area        1    0.1427 22.516 -25.890
- Income      1    0.2316 22.605 -25.693
<none>                    22.373 -24.208
- Density     1    0.9286 23.302 -24.174
- HS.Grad     1    1.5218 23.895 -22.918
- Population  1    2.2047 24.578 -21.509
- Frost       1    3.1324 25.506 -19.656
- Murder      1   26.7071 49.080  13.072

Step:  AIC=-25.89
Life.Exp ~ Population + Income + Murder + HS.Grad + Frost + Density

             Df Sum of Sq    RSS     AIC
- Income      1     0.132 22.648 -27.598
- Density     1     0.786 23.302 -26.174
<none>                    22.516 -25.890
- HS.Grad     1     1.424 23.940 -24.824
- Population  1     2.332 24.848 -22.962
- Frost       1     3.304 25.820 -21.043
- Murder      1    32.779 55.295  17.033

Step:  AIC=-27.6
Life.Exp ~ Population + Murder + HS.Grad + Frost + Density

             Df Sum of Sq    RSS     AIC
- Density     1     0.660 23.308 -28.161
<none>                    22.648 -27.598
- Population  1     2.659 25.307 -24.046
- Frost       1     3.179 25.827 -23.030
- HS.Grad     1     3.966 26.614 -21.529
- Murder      1    33.626 56.274  15.910

Step:  AIC=-28.16
Life.Exp ~ Population + Murder + HS.Grad + Frost

             Df Sum of Sq    RSS     AIC
<none>                    23.308 -28.161
- Population  1     2.064 25.372 -25.920
- Frost       1     3.122 26.430 -23.877
- HS.Grad     1     5.112 28.420 -20.246
- Murder      1    34.816 58.124  15.528

Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost, 
    data = st)

Coefficients:
(Intercept)   Population       Murder      HS.Grad        Frost  
  7.103e+01    5.014e-05   -3.001e-01    4.658e-02   -5.943e-03  

>

Prediction

> predict(model5, list(Population=2000, Murder=10.5, HS.Grad=48, Frost=100))
       1 
69.61746

Regression Diagnostics

> par(mfrow=c(2,2))                    # visualize four graphs at once
> plot(model5)
> par(mfrow=c(1,1))                    # reset the graphics defaults

Model objects

> names(model5)
 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values"
 [6] "assign"        "qr"            "df.residual"   "xlevels"       "call"         
[11] "terms"         "model"

> coef(model5)                         # an extractor function
  (Intercept)    Population        Murder       HS.Grad         Frost 
 7.102713e+01  5.013998e-05 -3.001488e-01  4.658225e-02 -5.943290e-03 
> model5$coefficients                  # list indexing
  (Intercept)    Population        Murder       HS.Grad         Frost 
 7.102713e+01  5.013998e-05 -3.001488e-01  4.658225e-02 -5.943290e-03
> model5[[1]]                          # recall by position in the list (double brackets for lists)
  (Intercept)    Population        Murder       HS.Grad         Frost 
 7.102713e+01  5.013998e-05 -3.001488e-01  4.658225e-02 -5.943290e-03

> model5$resid
       Alabama         Alaska        Arizona       Arkansas     California 
    0.56888134    -0.54740399    -0.86415671     1.08626119    -0.08564599 
      Colorado    Connecticut       Delaware        Florida        Georgia 
    0.95645816     0.44541028    -1.06646884     0.04460505    -0.09694227 
        Hawaii          Idaho       Illinois        Indiana           Iowa 
    1.50683146     0.37010714    -0.05244160    -0.02158526     0.16347124 
        Kansas       Kentucky      Louisiana          Maine       Maryland 
    0.67648037     0.85582067    -0.39044846    -1.47095411    -0.29851996 
 Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
   -0.61105391     0.76106640     0.69440380    -0.91535384     0.58389969 
       Montana       Nebraska         Nevada  New Hampshire     New Jersey 
   -0.84024805     0.42967691    -0.49482393    -0.49635615    -0.66612086 
    New Mexico       New York North Carolina   North Dakota           Ohio 
    0.28880945    -0.07937149    -0.07624179     0.90350550    -0.26548767 
      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
    0.26139958    -0.28445333    -0.95045527     0.13992982    -1.10109172 
  South Dakota      Tennessee          Texas           Utah        Vermont 
    0.06839119     0.64416651     0.92114057     0.84246817     0.57865019 
      Virginia     Washington  West Virginia      Wisconsin        Wyoming 
   -0.06691392    -0.96272426    -0.96982588     0.47004324    -0.58678863

> sort(model5$resid)                   # extract residuals and sort them
         Maine South Carolina       Delaware  West Virginia     Washington 
   -1.47095411    -1.10109172    -1.06646884    -0.96982588    -0.96272426 
  Pennsylvania    Mississippi        Arizona        Montana     New Jersey 
   -0.95045527    -0.91535384    -0.86415671    -0.84024805    -0.66612086 
 Massachusetts        Wyoming         Alaska  New Hampshire         Nevada 
   -0.61105391    -0.58678863    -0.54740399    -0.49635615    -0.49482393 
     Louisiana       Maryland         Oregon           Ohio        Georgia 
   -0.39044846    -0.29851996    -0.28445333    -0.26548767    -0.09694227 
    California       New York North Carolina       Virginia       Illinois 
   -0.08564599    -0.07937149    -0.07624179    -0.06691392    -0.05244160 
       Indiana        Florida   South Dakota   Rhode Island           Iowa 
   -0.02158526     0.04460505     0.06839119     0.13992982     0.16347124 
      Oklahoma     New Mexico          Idaho       Nebraska    Connecticut 
    0.26139958     0.28880945     0.37010714     0.42967691     0.44541028 
     Wisconsin        Alabama        Vermont       Missouri      Tennessee 
    0.47004324     0.56888134     0.57865019     0.58389969     0.64416651 
        Kansas      Minnesota       Michigan           Utah       Kentucky 
    0.67648037     0.69440380     0.76106640     0.84246817     0.85582067 
  North Dakota          Texas       Colorado       Arkansas         Hawaii 
    0.90350550     0.92114057     0.95645816     1.08626119     1.50683146

e.g.,

library(ISLR)
head(Carseats)
str(Carseats)

lm.full <- lm(Sales ~ . , data = Carseats)
lm.null <- lm(Sales ~ 1 , data = Carseats)

> stepAIC(lm.full)
Start:  AIC=26.82
Sales ~ CompPrice + Income + Advertising + Population + Price + 
    ShelveLoc + Age + Education + Urban + US

              Df Sum of Sq     RSS    AIC
- Population   1      0.33  403.16  25.15
- Education    1      1.19  404.02  26.00
- Urban        1      1.23  404.06  26.04
- US           1      1.57  404.40  26.38
<none>                      402.83  26.82
- Income       1     76.16  478.99  94.09
- Advertising  1    127.14  529.97 134.54
- Age          1    217.44  620.27 197.48
- CompPrice    1    519.91  922.74 356.35
- ShelveLoc    2   1053.20 1456.03 536.80
- Price        1   1323.23 1726.06 606.85

Step:  AIC=25.15
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age + Education + Urban + US

              Df Sum of Sq     RSS    AIC
- Urban        1      1.15  404.31  24.29
- Education    1      1.36  404.52  24.49
- US           1      1.89  405.05  25.02
<none>                      403.16  25.15
- Income       1     75.94  479.10  92.18
- Advertising  1    145.38  548.54 146.32
- Age          1    218.52  621.68 196.38
- CompPrice    1    521.69  924.85 355.27
- ShelveLoc    2   1053.18 1456.34 534.89
- Price        1   1323.51 1726.67 605.00

Step:  AIC=24.29
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age + Education + US

              Df Sum of Sq     RSS    AIC
- Education    1      1.44  405.76  23.72
- US           1      1.85  406.16  24.12
<none>                      404.31  24.29
- Income       1     76.64  480.96  91.73
- Advertising  1    146.03  550.34 145.63
- Age          1    217.59  621.91 194.53
- CompPrice    1    526.17  930.48 355.69
- ShelveLoc    2   1053.93 1458.25 533.41
- Price        1   1322.80 1727.11 603.10

Step:  AIC=23.72
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age + US

              Df Sum of Sq     RSS    AIC
- US           1      1.63  407.39  23.32
<none>                      405.76  23.72
- Income       1     77.87  483.62  91.94
- Advertising  1    145.30  551.06 144.15
- Age          1    217.97  623.73 193.70
- CompPrice    1    525.25  931.00 353.92
- ShelveLoc    2   1056.88 1462.64 532.61
- Price        1   1322.83 1728.58 601.44

Step:  AIC=23.32
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age

              Df Sum of Sq     RSS    AIC
<none>                      407.39  23.32
- Income       1     76.68  484.07  90.30
- Age          1    219.12  626.51 193.48
- Advertising  1    234.03  641.42 202.89
- CompPrice    1    523.83  931.22 352.01
- ShelveLoc    2   1055.51 1462.90 530.68
- Price        1   1324.42 1731.81 600.18

Call:
lm(formula = Sales ~ CompPrice + Income + Advertising + Price + 
    ShelveLoc + Age, data = Carseats)

Coefficients:
    (Intercept)        CompPrice           Income      Advertising  
        5.47523          0.09257          0.01578          0.11590  
          Price    ShelveLocGood  ShelveLocMedium              Age  
       -0.09532          4.83567          1.95199         -0.04613  

>

> step(lm.full, direction="both")
Start:  AIC=26.82
Sales ~ CompPrice + Income + Advertising + Population + Price + 
    ShelveLoc + Age + Education + Urban + US

              Df Sum of Sq     RSS    AIC
- Population   1      0.33  403.16  25.15
- Education    1      1.19  404.02  26.00
- Urban        1      1.23  404.06  26.04
- US           1      1.57  404.40  26.38
<none>                      402.83  26.82
- Income       1     76.16  478.99  94.09
- Advertising  1    127.14  529.97 134.54
- Age          1    217.44  620.27 197.48
- CompPrice    1    519.91  922.74 356.35
- ShelveLoc    2   1053.20 1456.03 536.80
- Price        1   1323.23 1726.06 606.85

Step:  AIC=25.15
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age + Education + Urban + US

              Df Sum of Sq     RSS    AIC
- Urban        1      1.15  404.31  24.29
- Education    1      1.36  404.52  24.49
- US           1      1.89  405.05  25.02
<none>                      403.16  25.15
+ Population   1      0.33  402.83  26.82
- Income       1     75.94  479.10  92.18
- Advertising  1    145.38  548.54 146.32
- Age          1    218.52  621.68 196.38
- CompPrice    1    521.69  924.85 355.27
- ShelveLoc    2   1053.18 1456.34 534.89
- Price        1   1323.51 1726.67 605.00

Step:  AIC=24.29
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age + Education + US

              Df Sum of Sq     RSS    AIC
- Education    1      1.44  405.76  23.72
- US           1      1.85  406.16  24.12
<none>                      404.31  24.29
+ Urban        1      1.15  403.16  25.15
+ Population   1      0.25  404.06  26.04
- Income       1     76.64  480.96  91.73
- Advertising  1    146.03  550.34 145.63
- Age          1    217.59  621.91 194.53
- CompPrice    1    526.17  930.48 355.69
- ShelveLoc    2   1053.93 1458.25 533.41
- Price        1   1322.80 1727.11 603.10

Step:  AIC=23.72
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age + US

              Df Sum of Sq     RSS    AIC
- US           1      1.63  407.39  23.32
<none>                      405.76  23.72
+ Education    1      1.44  404.31  24.29
+ Urban        1      1.24  404.52  24.49
+ Population   1      0.41  405.35  25.32
- Income       1     77.87  483.62  91.94
- Advertising  1    145.30  551.06 144.15
- Age          1    217.97  623.73 193.70
- CompPrice    1    525.25  931.00 353.92
- ShelveLoc    2   1056.88 1462.64 532.61
- Price        1   1322.83 1728.58 601.44

Step:  AIC=23.32
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age

              Df Sum of Sq     RSS    AIC
<none>                      407.39  23.32
+ US           1      1.63  405.76  23.72
+ Education    1      1.22  406.16  24.12
+ Urban        1      1.19  406.20  24.15
+ Population   1      0.72  406.67  24.62
- Income       1     76.68  484.07  90.30
- Age          1    219.12  626.51 193.48
- Advertising  1    234.03  641.42 202.89
- CompPrice    1    523.83  931.22 352.01
- ShelveLoc    2   1055.51 1462.90 530.68
- Price        1   1324.42 1731.81 600.18

Call:
lm(formula = Sales ~ CompPrice + Income + Advertising + Price + 
    ShelveLoc + Age, data = Carseats)

Coefficients:
    (Intercept)        CompPrice           Income      Advertising  
        5.47523          0.09257          0.01578          0.11590  
          Price    ShelveLocGood  ShelveLocMedium              Age  
       -0.09532          4.83567          1.95199         -0.04613  

>

> stepAIC(lm.full, direction="backward")
Start:  AIC=26.82
Sales ~ CompPrice + Income + Advertising + Population + Price + 
    ShelveLoc + Age + Education + Urban + US

              Df Sum of Sq     RSS    AIC
- Population   1      0.33  403.16  25.15
- Education    1      1.19  404.02  26.00
- Urban        1      1.23  404.06  26.04
- US           1      1.57  404.40  26.38
<none>                      402.83  26.82
- Income       1     76.16  478.99  94.09
- Advertising  1    127.14  529.97 134.54
- Age          1    217.44  620.27 197.48
- CompPrice    1    519.91  922.74 356.35
- ShelveLoc    2   1053.20 1456.03 536.80
- Price        1   1323.23 1726.06 606.85

Step:  AIC=25.15
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age + Education + Urban + US

              Df Sum of Sq     RSS    AIC
- Urban        1      1.15  404.31  24.29
- Education    1      1.36  404.52  24.49
- US           1      1.89  405.05  25.02
<none>                      403.16  25.15
- Income       1     75.94  479.10  92.18
- Advertising  1    145.38  548.54 146.32
- Age          1    218.52  621.68 196.38
- CompPrice    1    521.69  924.85 355.27
- ShelveLoc    2   1053.18 1456.34 534.89
- Price        1   1323.51 1726.67 605.00

Step:  AIC=24.29
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age + Education + US

              Df Sum of Sq     RSS    AIC
- Education    1      1.44  405.76  23.72
- US           1      1.85  406.16  24.12
<none>                      404.31  24.29
- Income       1     76.64  480.96  91.73
- Advertising  1    146.03  550.34 145.63
- Age          1    217.59  621.91 194.53
- CompPrice    1    526.17  930.48 355.69
- ShelveLoc    2   1053.93 1458.25 533.41
- Price        1   1322.80 1727.11 603.10

Step:  AIC=23.72
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age + US

              Df Sum of Sq     RSS    AIC
- US           1      1.63  407.39  23.32
<none>                      405.76  23.72
- Income       1     77.87  483.62  91.94
- Advertising  1    145.30  551.06 144.15
- Age          1    217.97  623.73 193.70
- CompPrice    1    525.25  931.00 353.92
- ShelveLoc    2   1056.88 1462.64 532.61
- Price        1   1322.83 1728.58 601.44

Step:  AIC=23.32
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age

              Df Sum of Sq     RSS    AIC
<none>                      407.39  23.32
- Income       1     76.68  484.07  90.30
- Age          1    219.12  626.51 193.48
- Advertising  1    234.03  641.42 202.89
- CompPrice    1    523.83  931.22 352.01
- ShelveLoc    2   1055.51 1462.90 530.68
- Price        1   1324.42 1731.81 600.18

Call:
lm(formula = Sales ~ CompPrice + Income + Advertising + Price + 
    ShelveLoc + Age, data = Carseats)

Coefficients:
    (Intercept)        CompPrice           Income      Advertising            Price  
        5.47523          0.09257          0.01578          0.11590         -0.09532  
  ShelveLocGood  ShelveLocMedium              Age  
        4.83567          1.95199         -0.04613  

>

> step(lm.full, direction="backward")
Start:  AIC=26.82
Sales ~ CompPrice + Income + Advertising + Population + Price + 
    ShelveLoc + Age + Education + Urban + US

              Df Sum of Sq     RSS    AIC
- Population   1      0.33  403.16  25.15
- Education    1      1.19  404.02  26.00
- Urban        1      1.23  404.06  26.04
- US           1      1.57  404.40  26.38
<none>                      402.83  26.82
- Income       1     76.16  478.99  94.09
- Advertising  1    127.14  529.97 134.54
- Age          1    217.44  620.27 197.48
- CompPrice    1    519.91  922.74 356.35
- ShelveLoc    2   1053.20 1456.03 536.80
- Price        1   1323.23 1726.06 606.85

Step:  AIC=25.15
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age + Education + Urban + US

              Df Sum of Sq     RSS    AIC
- Urban        1      1.15  404.31  24.29
- Education    1      1.36  404.52  24.49
- US           1      1.89  405.05  25.02
<none>                      403.16  25.15
- Income       1     75.94  479.10  92.18
- Advertising  1    145.38  548.54 146.32
- Age          1    218.52  621.68 196.38
- CompPrice    1    521.69  924.85 355.27
- ShelveLoc    2   1053.18 1456.34 534.89
- Price        1   1323.51 1726.67 605.00

Step:  AIC=24.29
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age + Education + US

              Df Sum of Sq     RSS    AIC
- Education    1      1.44  405.76  23.72
- US           1      1.85  406.16  24.12
<none>                      404.31  24.29
- Income       1     76.64  480.96  91.73
- Advertising  1    146.03  550.34 145.63
- Age          1    217.59  621.91 194.53
- CompPrice    1    526.17  930.48 355.69
- ShelveLoc    2   1053.93 1458.25 533.41
- Price        1   1322.80 1727.11 603.10

Step:  AIC=23.72
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age + US

              Df Sum of Sq     RSS    AIC
- US           1      1.63  407.39  23.32
<none>                      405.76  23.72
- Income       1     77.87  483.62  91.94
- Advertising  1    145.30  551.06 144.15
- Age          1    217.97  623.73 193.70
- CompPrice    1    525.25  931.00 353.92
- ShelveLoc    2   1056.88 1462.64 532.61
- Price        1   1322.83 1728.58 601.44

Step:  AIC=23.32
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age

              Df Sum of Sq     RSS    AIC
<none>                      407.39  23.32
- Income       1     76.68  484.07  90.30
- Age          1    219.12  626.51 193.48
- Advertising  1    234.03  641.42 202.89
- CompPrice    1    523.83  931.22 352.01
- ShelveLoc    2   1055.51 1462.90 530.68
- Price        1   1324.42 1731.81 600.18

Call:
lm(formula = Sales ~ CompPrice + Income + Advertising + Price + 
    ShelveLoc + Age, data = Carseats)

Coefficients:
    (Intercept)        CompPrice           Income      Advertising            Price  
        5.47523          0.09257          0.01578          0.11590         -0.09532  
  ShelveLocGood  ShelveLocMedium              Age  
        4.83567          1.95199         -0.04613

Pick the model from stepAIC

lm.fit.01 <- lm(formula = Sales ~ 
                 CompPrice + Income + 
                 Advertising + Price + 
                 ShelveLoc + Age, 
                 data = Carseats)
summary(lm.fit.01)

> lm.fit.01 <- lm(formula = Sales ~ CompPrice + 
         Income + Advertising + Price + 
         ShelveLoc + Age, data = Carseats)
> summary(lm.fit.01)

Call:
lm(formula = Sales ~ CompPrice + Income + Advertising + Price + 
    ShelveLoc + Age, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7728 -0.6954  0.0282  0.6732  3.3292 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      5.475226   0.505005   10.84   <2e-16 ***
CompPrice        0.092571   0.004123   22.45   <2e-16 ***
Income           0.015785   0.001838    8.59   <2e-16 ***
Advertising      0.115903   0.007724   15.01   <2e-16 ***
Price           -0.095319   0.002670  -35.70   <2e-16 ***
ShelveLocGood    4.835675   0.152499   31.71   <2e-16 ***
ShelveLocMedium  1.951993   0.125375   15.57   <2e-16 ***
Age             -0.046128   0.003177  -14.52   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.019 on 392 degrees of freedom
Multiple R-squared:  0.872,	Adjusted R-squared:  0.8697 
F-statistic: 381.4 on 7 and 392 DF,  p-value: < 2.2e-16

Compare the fitted model to full model

anova(lm.full, lm.fit.01)

> anova(lm.full, lm.fit.01)
Analysis of Variance Table

Model 1: Sales ~ CompPrice + Income + Advertising + Population + Price + 
    ShelveLoc + Age + Education + Urban + US
Model 2: Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    388 402.83                           
2    392 407.39 -4   -4.5533 1.0964  0.358
>

Backward elimination

drop1(lm.full, test = "F")

> drop1(lm.full, test = "F")
Single term deletions

Model:
Sales ~ CompPrice + Income + Advertising + Population + Price + 
    ShelveLoc + Age + Education + Urban + US
            Df Sum of Sq     RSS    AIC   F value    Pr(>F)    
<none>                    402.83  26.82                        
CompPrice    1    519.91  922.74 356.35  500.7659 < 2.2e-16 ***
Income       1     76.16  478.99  94.09   73.3537  2.58e-16 ***
Advertising  1    127.14  529.97 134.54  122.4571 < 2.2e-16 ***
Population   1      0.33  403.16  25.15    0.3149    0.5750    
Price        1   1323.23 1726.06 606.85 1274.5022 < 2.2e-16 ***
ShelveLoc    2   1053.20 1456.03 536.80  507.2079 < 2.2e-16 ***
Age          1    217.44  620.27 197.48  209.4333 < 2.2e-16 ***
Education    1      1.19  404.02  26.00    1.1450    0.2853    
Urban        1      1.23  404.06  26.04    1.1831    0.2774    
US           1      1.57  404.40  26.38    1.5094    0.2200    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
>

drop1(update(lm.full, ~ . -Population), test = "F")

> drop1(update(lm.full, ~ . -Population), test = "F")
Single term deletions

Model:
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age + Education + Urban + US
            Df Sum of Sq     RSS    AIC   F value    Pr(>F)    
<none>                    403.16  25.15                        
CompPrice    1    521.69  924.85 355.27  503.3686 < 2.2e-16 ***
Income       1     75.94  479.10  92.18   73.2717 2.652e-16 ***
Advertising  1    145.38  548.54 146.32  140.2694 < 2.2e-16 ***
Price        1   1323.51 1726.67 605.00 1277.0276 < 2.2e-16 ***
ShelveLoc    2   1053.18 1456.34 534.89  508.0927 < 2.2e-16 ***
Age          1    218.52  621.68 196.38  210.8411 < 2.2e-16 ***
Education    1      1.36  404.52  24.49    1.3122    0.2527    
Urban        1      1.15  404.31  24.29    1.1132    0.2920    
US           1      1.89  405.05  25.02    1.8262    0.1774    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
>

drop1(update(lm.full, ~. - Population -Urban), test="F")

> drop1(update(lm.full, ~. - Population -Urban), test="F")
Single term deletions

Model:
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age + Education + US
            Df Sum of Sq     RSS    AIC   F value Pr(>F)    
<none>                    404.31  24.29                     
CompPrice    1    526.17  930.48 355.69  507.5374 <2e-16 ***
Income       1     76.64  480.96  91.73   73.9307 <2e-16 ***
Advertising  1    146.03  550.34 145.63  140.8593 <2e-16 ***
Price        1   1322.80 1727.11 603.10 1275.9661 <2e-16 ***
ShelveLoc    2   1053.93 1458.25 533.41  508.3098 <2e-16 ***
Age          1    217.59  621.91 194.53  209.8897 <2e-16 ***
Education    1      1.44  405.76  23.72    1.3930 0.2386    
US           1      1.85  406.16  24.12    1.7848 0.1823    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
>

drop1(update(lm.full, ~. - Population -Urban -Education), test="F")

> drop1(update(lm.full, ~. - Population -Urban -Education), test="F")
Single term deletions

Model:
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age + US
            Df Sum of Sq     RSS    AIC   F value Pr(>F)    
<none>                    405.76  23.72                     
CompPrice    1    525.25  931.00 353.92  506.1421 <2e-16 ***
Income       1     77.87  483.62  91.94   75.0336 <2e-16 ***
Advertising  1    145.30  551.06 144.15  140.0181 <2e-16 ***
Price        1   1322.83 1728.58 601.44 1274.7123 <2e-16 ***
ShelveLoc    2   1056.88 1462.64 532.61  509.2202 <2e-16 ***
Age          1    217.97  623.73 193.70  210.0409 <2e-16 ***
US           1      1.63  407.39  23.32    1.5693 0.2111    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

drop1(update(lm.full, ~. - Population -Urban -Education -US), test="F")

> drop1(update(lm.full, ~. - Population -Urban -Education -US), test="F")
Single term deletions

Model:
Sales ~ CompPrice + Income + Advertising + Price + ShelveLoc + 
    Age
            Df Sum of Sq     RSS    AIC  F value    Pr(>F)    
<none>                    407.39  23.32                       
CompPrice    1    523.83  931.22 352.01  504.047 < 2.2e-16 ***
Income       1     76.68  484.07  90.30   73.784 < 2.2e-16 ***
Advertising  1    234.03  641.42 202.89  225.192 < 2.2e-16 ***
Price        1   1324.42 1731.81 600.18 1274.400 < 2.2e-16 ***
ShelveLoc    2   1055.51 1462.90 530.68  507.822 < 2.2e-16 ***
Age          1    219.12  626.51 193.48  210.848 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

lm.fit.be <- lm(Sales ~ 
            CompPrice + Income + 
            Advertising + Price + 
            ShelveLoc + Age, data = Carseats)
            
summary(lm.fit.be)

> lm.fit.be <- lm(Sales ~ CompPrice + 
        Income + Advertising + 
        Price + ShelveLoc + 
        Age, data = Carseats)
> summary(lm.fit.be)

Call:
lm(formula = Sales ~ CompPrice + Income + Advertising + Price + 
    ShelveLoc + Age, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7728 -0.6954  0.0282  0.6732  3.3292 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      5.475226   0.505005   10.84   <2e-16 ***
CompPrice        0.092571   0.004123   22.45   <2e-16 ***
Income           0.015785   0.001838    8.59   <2e-16 ***
Advertising      0.115903   0.007724   15.01   <2e-16 ***
Price           -0.095319   0.002670  -35.70   <2e-16 ***
ShelveLocGood    4.835675   0.152499   31.71   <2e-16 ***
ShelveLocMedium  1.951993   0.125375   15.57   <2e-16 ***
Age             -0.046128   0.003177  -14.52   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.019 on 392 degrees of freedom
Multiple R-squared:  0.872,	Adjusted R-squared:  0.8697 
F-statistic: 381.4 on 7 and 392 DF,  p-value: < 2.2e-16

>

lm.fit.01 <- lm(formula = Sales ~ 
                 CompPrice + Income + 
                 Advertising + Price + 
                 ShelveLoc + Age, 
                 data = Carseats)
summary(lm.fit.01)

> lm.fit.01 <- lm(formula = Sales ~ CompPrice + 
         Income + Advertising + Price + 
         ShelveLoc + Age, data = Carseats)
> summary(lm.fit.01)

Call:
lm(formula = Sales ~ CompPrice + Income + Advertising + Price + 
    ShelveLoc + Age, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7728 -0.6954  0.0282  0.6732  3.3292 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      5.475226   0.505005   10.84   <2e-16 ***
CompPrice        0.092571   0.004123   22.45   <2e-16 ***
Income           0.015785   0.001838    8.59   <2e-16 ***
Advertising      0.115903   0.007724   15.01   <2e-16 ***
Price           -0.095319   0.002670  -35.70   <2e-16 ***
ShelveLocGood    4.835675   0.152499   31.71   <2e-16 ***
ShelveLocMedium  1.951993   0.125375   15.57   <2e-16 ***
Age             -0.046128   0.003177  -14.52   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.019 on 392 degrees of freedom
Multiple R-squared:  0.872,	Adjusted R-squared:  0.8697 
F-statistic: 381.4 on 7 and 392 DF,  p-value: < 2.2e-16

¹⁾

a+b+c+d)~(b+e

Table of Contents