User Tools

Site Tools


r:linear_regression

Simple Linear Regression

yi = β0 + β1xi + εi
where β0 and β1 are the regression coefficients and the εi are the error terms.

library(MASS)
attach(Cars93)
plot(MPG.city~EngineSize)
mod <- lm(MPG.city ~ EngineSize) 
mod
Call:
lm(formula = MPG.city ~ EngineSize)

Coefficients:
(Intercept)   EngineSize  
     32.627       -3.846  
      
summary(mod)

Call:
lm(formula = MPG.city ~ EngineSize)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.6264  -2.7032  -0.5491   2.1428  17.2197 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  32.6267     1.1439  28.523  < 2e-16 ***
EngineSize   -3.8464     0.3999  -9.618 1.61e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.979 on 91 degrees of freedom
Multiple R-squared:  0.5041,	Adjusted R-squared:  0.4987 
F-statistic: 92.51 on 1 and 91 DF,  p-value: 1.606e-15

Is the model statistically significant?

  • Check the F statistic at the bottom of the summary.
  • The F statistic tells you whether the model is significant or insignificant.

Are the coefficients significant?

  • Check the coefficient’s t statistics and p-values in the summary.

Is the model useful?

  • Check the R2 near the bottom of the summary.
  • R2 is a measure of the model’s quality. Bigger is better. Mathematically, it is the fraction of the variance of y that is explained by the regression model.

Does the model fit the data well?

Does the data satisfy the assumptions behind linear regression?

  • Check whether the diagnostics confirm that a linear model is reasonable for your data.

E.g. 1

dataset_simpleregression.csv

dataset_simpleRegression.csv
   YEAR  ROLL UNEM HGRAD  INC
1     1  5501  8.1  9552 1923
2     2  5945  7.0  9680 1961
3     3  6629  7.3  9731 1979
4     4  7556  7.5 11666 2030
5     5  8716  7.0 14675 2112
6     6  9369  6.4 15265 2192
7     7  9920  6.5 15484 2235
8     8 10167  6.4 15723 2351
9     9 11084  6.3 16501 2411
10   10 12504  7.7 16890 2475
11   11 13746  8.2 17203 2524
12   12 13656  7.5 17707 2674
13   13 13850  7.4 18108 2833
14   14 14145  8.2 18266 2863
15   15 14888 10.1 19308 2839
16   16 14991  9.2 18224 2898
17   17 14836  7.7 18997 3123
18   18 14478  5.7 19505 3195
19   19 14539  6.5 19800 3239
20   20 14395  7.5 19546 3129
21   21 14599  7.3 19117 3100
22   22 14969  9.2 18774 3008
23   23 15107 10.1 17813 2983
24   24 14831  7.5 17304 3069
25   25 15081  8.8 16756 3151
26   26 15127  9.1 16749 3127
27   27 15856  8.8 16925 3179
28   28 15938  7.8 17231 3207
29   29 16081  7.0 16816 3345
datavar <- read.csv("http://commres.net/wiki/_media/r/dataset_simpleregression.csv")

# ROLL = students' enrollment
# HGRAD = number of high school graduates
# UNEM = unemployment
# INC = per capita income
linearModelVar <- lm(ROLL ~ UNEM, datavar)

linearModelVar
Call:
lm(formula = ROLL ~ UNEM, data = d)

Coefficients:
(Intercept)         UNEM  
       3957         1134  

from the above, Enrollment = 3957 + 1134 * Unemployment Rate
만약에 unemployment가 9%라고 한다면, Enrollment 예측은

3957 + 1134 * 9
[1] 14163

14163 일것으로 예측할 수 있다.

Summarizing the model

summary(linearModelVar)
Call:
lm(formula = ROLL ~ UNEM, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-7640.0 -1046.5   602.8  1934.3  4187.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   3957.0     4000.1   0.989   0.3313  
UNEM          1133.8      513.1   2.210   0.0358 *
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3049 on 27 degrees of freedom
Multiple R-squared:  0.1531,	Adjusted R-squared:  0.1218 
F-statistic: 4.883 on 1 and 27 DF,  p-value: 0.03579

You should be able to understand the below terms (what they mean):
Multiple R-squared(R2): 0.1531
F-statistic: 4.883 (1, 27), p-value: 0.03579
t-statistic for UNEM: 2.210, p-value: 0.0358 *

What about beta coefficient?

In R we demonstrate the use of the lm.beta() function in the QuantPsyc package (due to Thomas D. Fletcher of State Farm). The function is short and sweet, and takes a linear model object as argument:

> lm.beta(mod)
EngineSize 
-0.7100032 
> cor(MPG.city,EngineSize)
[1] -0.7100032
> 

Multiple Regression

regression output table

anova(m) ANOVA table
coefficients(m) = coef(m) Model coefficients
confint(m) Confidence intervals for the regression coefficients
deviance(m) Residual sum of squares
effects(m) Vector of orthogonal effects
fitted(m) Vector of fitted y values
residuals(m) = resid(m) Model residuals
This reports the standard error of the residuals (σ) - that is, the sample standard deviation of ε.
summary(m) Key statistics, such as R2, the F statistic, and the residual standard error (σ)
vcov(m) Variance?covariance matrix of the main parameters
lm.model <- lm(Cars93$MPG.city ~ Cars93$EngineSize + Cars93$Price)
summary(lm.model)
Call:
lm(formula = Cars93$MPG.city ~ Cars93$EngineSize + Cars93$Price)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.4511 -2.0637 -0.5156  1.6239 16.9372 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       33.34647    1.12251  29.707  < 2e-16 ***
Cars93$EngineSize -2.98885    0.47808  -6.252 1.33e-08 ***
Cars93$Price      -0.15415    0.05134  -3.002  0.00347 ** 
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.815 on 90 degrees of freedom
Multiple R-squared:  0.5492,	Adjusted R-squared:  0.5392 
F-statistic: 54.83 on 2 and 90 DF,  p-value: 2.674e-16

Questions:

  • What is R2?
  • How many cars are involved in this test? (cf. df = 90)
    • df + # of variables involved (3) = 93
    • check 'str(Cars93)'
  • If I eliminate the R2 from the above output, can you still identify what it is?

The last question:

  • If I eliminate the R2 from the above output, can you still identify what it is?
  • to answer the question, use the regression output table:

R2 = SSreg/SStotal = 1 - SSres/SStotal
=

> anova(lm.model)
Analysis of Variance Table

Response: Cars93$MPG.city
                  Df Sum Sq Mean Sq F value  Pr(>F)    
Cars93$EngineSize  1   1465    1465  100.65 2.4e-16 ***
Cars93$Price       1    131     131    9.01  0.0035 ** 
Residuals         90   1310      15                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> sstotal = 1465+131+1310
> ssreg <- 1465+131
> ssreg/sstotal
[1] 0.54921
> 
> # or
> 1-(deviance(lm.model)/sstotal)
[1] 0.54932

Regression formula:

  • $\hat{Y} = -2.99 * EngineSize + -0.15 * Price + 33.35$
  • $\hat{Y} = \widehat{\text{MPG.city}}$

in the meantime,

> lm.beta(lm.model)
Cars93$EngineSize      Cars93$Price 
       -0.5517121        -0.2649553 
> cor(MPG.city,EngineSize)
[1] -0.7100032
> cor(EngineSize,Price)
[1] 0.5974254
> cor(MPG.city,Price)
[1] -0.5945622
> 

Or . . . .

> temp <- subset(Cars93, select=c(MPG.city,EngineSize,Price))
> temp
. . . . 

> cor(temp)
             MPG.city EngineSize      Price
MPG.city    1.0000000 -0.7100032 -0.5945622
EngineSize -0.7100032  1.0000000  0.5974254
Price      -0.5945622  0.5974254  1.0000000
> 

Beta coefficients are not equal to correlations among variables.

plot(lm.model$residuals)
abline(h=0, col="red")
anova(lm.model)
Analysis of Variance Table

Response: Cars93$MPG.city
                  Df  Sum Sq Mean Sq F value    Pr(>F)    
Cars93$EngineSize  1 1464.71 1464.71 100.653 2.444e-16 ***
Cars93$Price       1  131.17  131.17   9.014  0.003468 ** 
Residuals         90 1309.69   14.55                      
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Why use anova with lm output (lm.model in this case)?

coef(lm.model)
      (Intercept) Cars93$EngineSize      Cars93$Price 
       33.3464734        -2.9888458        -0.1541498 
residuals(lm.model)
          1           2           3           4           5 
-0.51556927 -0.55648889 -0.49194621 -0.16625801  3.73898067 
          6           7           8           9          10 
-2.35086090  0.21745634  3.34329779  1.06528019  2.64786884 
         11          12          13          14          15 
 2.58362397  0.29459458 -0.01370501 -1.85673580 -3.32003095 
         16          17          18          19          20 
-1.47621772 -2.93554987  1.49577174  6.54763980 -0.64692607 
         21          22          23          24          25 
 1.05563073  1.06413662  1.55497338 -2.02911999 -1.82416666 
         26          27          28          29          30 
-4.45108994 -2.46962214 -2.40287136  2.01742275  0.08957791 
         31          32          33          34          35 
 2.67973459 -3.40963806 -2.73023540 -2.02114636 -1.21068471 
         36          37          38          39          40 
-6.31235513 -0.26611019  1.62394797 16.93723064  3.36255227 
         41          42          43          44          45 
 0.58003782 15.00200777 -0.07339128  1.36999363 -4.42505304 
         46          47          48          49          50 
-1.32170679 -5.22609969  4.48710776 -2.06374182 -0.95386332 
         51          52          53          54          55 
 0.29847852  3.96702480  1.71512315  1.82158662  2.66911267 
         56          57          58          59          60 
-3.43567496 -7.45110565 -1.55474970  4.75970527 -3.39080806 
         61          62          63          64          65 
-0.69202743  1.72453815 -2.35662642  2.25464742  0.24690826 
         66          67          68          69          70 
-4.43567496 -0.06571546 -0.39110586 -1.25837103 -0.98293839 
         71          72          73          74          75 
 0.20204136 -2.74679396  3.82302800 -2.65771911 -1.45594634 
         76          77          78          79          80 
-1.33262651  0.77239559 -2.64579820  2.04339631  4.53499980 
         81          82          83          84          85 
-1.28631823 -0.76509170 10.86471434  4.64746325  1.06534353 
         86          87          88          89          90 
-1.96548643 -4.67404320 -1.56378785 -5.83760799 -3.28578597 
         91          92          93 
-3.38601500 -1.97292778 -2.05744404 
deviance(lm.model)
[1] 1309.686

Ex 1

위의 Cars93 데이터에서 city mileage은 Enginesize Horsepower 와 Turn.circle 의 영향을 얼마나 받을까?

  • Turn.circle은 회전반경을 의미하고
  • Enginsize는 Engine의 크기
  • Horsepower는 마력수를 의미한다

Ex 2

dataset_multipleregression.csv

data <- read.csv("http://commres.net/wiki/_media/r/dataset_multipleregression.csv")
head(data)
  YEAR ROLL UNEM HGRAD  INC
1    1 5501  8.1  9552 1923
2    2 5945  7.0  9680 1961
3    3 6629  7.3  9731 1979
4    4 7556  7.5 11666 2030
5    5 8716  7.0 14675 2112
6    6 9369  6.4 15265 2192

ROLL = students' enrollment
HGRAD = number of high school graduates
UNEM = unemployment
INC = per capita income
onePredictorModel <- lm(ROLL ~ UNEM, datavar)
onePredictorModel

Call:
lm(formula = ROLL ~ UNEM, data = datavar)

Coefficients:
(Intercept)         UNEM  
       3957         1134  

summary(onePredictorModel)

Call:
lm(formula = ROLL ~ UNEM, data = datavar)

Residuals:
    Min      1Q  Median      3Q     Max 
-7640.0 -1046.5   602.8  1934.3  4187.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   3957.0     4000.1   0.989   0.3313  
UNEM          1133.8      513.1   2.210   0.0358 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3049 on 27 degrees of freedom
Multiple R-squared:  0.1531,	Adjusted R-squared:  0.1218 
F-statistic: 4.883 on 1 and 27 DF,  p-value: 0.03579

predict the fall enrollment (ROLL): using

  • the unemployment rate (UNEM) and
  • number of spring high school graduates (HGRAD)
twoPredictorModel <- lm(ROLL ~ UNEM + HGRAD, data)
twoPredictorModel

Call:
lm(formula = ROLL ~ UNEM + HGRAD, data = data)

Coefficients:
(Intercept)         UNEM        HGRAD  
 -8255.7511     698.2681       0.9423  

summary(twoPredictorModel)

Call:
lm(formula = ROLL ~ UNEM + HGRAD, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2102.2  -861.6  -349.4   374.5  3603.5 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.256e+03  2.052e+03  -4.023  0.00044 ***
UNEM         6.983e+02  2.244e+02   3.111  0.00449 ** 
HGRAD        9.423e-01  8.613e-02  10.941 3.16e-11 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1313 on 26 degrees of freedom
Multiple R-squared:  0.8489,	Adjusted R-squared:  0.8373 
F-statistic: 73.03 on 2 and 26 DF,  p-value: 2.144e-11
y =  -8255.7511  +  698.2681*UNEM  +  0.9423*HGRAD  

Q: what is the expected fall enrollment (ROLL) given this year's unemployment rate (UNEM) of 9% and spring high school graduating class (HGRAD) of 100,000?

UNEM <- 9
HGRAD <- 100000
-8255.7511  +  698.2681*UNEM  +  0.9423*HGRAD
[1] 92258.66

A: the predicted fall enrollment, given a 9% unemployment rate and 100,000 student spring high school graduating class, is
92258 students.

Enrollment 와 Unemployment, Highschool student grateduates, Income 간의 관계. 즉,

  • dv = ROLL (Enrollment)
  • iv = UNEM, HGRAD, INC
threePredictorModel <- lm(ROLL ~ UNEM + HGRAD + INC, data)
threePredictorModel

Call:
lm(formula = ROLL ~ UNEM + HGRAD + INC, data = data)

Coefficients:
(Intercept)         UNEM        HGRAD          INC  
 -9153.2545     450.1245       0.4065       4.2749  
summary(threePredictorModel)

Call:
lm(formula = ROLL ~ UNEM + HGRAD + INC, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1148.84  -489.71    -1.88   387.40  1425.75 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -9.153e+03  1.053e+03  -8.691 5.02e-09 ***
UNEM         4.501e+02  1.182e+02   3.809 0.000807 ***
HGRAD        4.065e-01  7.602e-02   5.347 1.52e-05 ***
INC          4.275e+00  4.947e-01   8.642 5.59e-09 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 670.4 on 25 degrees of freedom
Multiple R-squared:  0.9621,	Adjusted R-squared:  0.9576 
F-statistic: 211.5 on 3 and 25 DF,  p-value: < 2.2e-16

How to get beta coefficients1) for predictor variables?

  • Use scale function to standardize the x values then do regression
  • Use lm.beta function in QuantPsyc
  • In this case
install.packages("QuantPsyc")
library(QuantPsyc)
lm.beta(threePredictorModel)
     UNEM     HGRAD       INC 
0.1553619 0.3656177 0.6061762 

How to compare each model (with incremental IVs)

anova(onePredictorModel, twoPredictorModel, threePredictorModel)

Analysis of Variance Table

Model 1: ROLL ~ UNEM
Model 2: ROLL ~ UNEM + HGRAD
Model 3: ROLL ~ UNEM + HGRAD + INC
  Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
1     27 251084710                                  
2     26  44805568  1 206279143 458.92 < 2.2e-16 ***
3     25  11237313  1  33568255  74.68 5.594e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

e.g. 3

hierarchicalregressiondata.csv

# Import data (simulated data for this example)
myData <- read.csv('http://static.lib.virginia.edu/statlab/materials/data/hierarchicalRegressionData.csv')
# or 
# myData <- read.csv('http://commres.net/wiki/_media/r/hierarchicalregressiondata.csv')

# Build models to compare the adding variables are worthwhile. 
m0 <- lm(happiness ~ 1, data=myData)  # to obtain Total SS
m1 <- lm(happiness ~ age + gender, data=myData)  # Model 1
m2 <- lm(happiness ~ age + gender + friends, data=myData)  # Model 2
m3 <- lm(happiness ~ age + gender + friends + pets, data=myData)  # Model 3
anova(m0)  # to obtain Total SS
Analysis of Variance Table

Response: happiness
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 99 240.84  2.4327  
summary(m1)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.66778    2.01364   3.808 0.000246 ***
age         -0.13039    0.07936  -1.643 0.103611    
genderMale   0.16430    0.31938   0.514 0.608106    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.553 on 97 degrees of freedom
Multiple R-squared:  0.02855,	Adjusted R-squared:  0.008515 
F-statistic: 1.425 on 2 and 97 DF,  p-value: 0.2455
summary(m2)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  6.21730    1.96220   3.169  0.00206 **
age         -0.12479    0.07546  -1.654  0.10146   
genderMale   0.14931    0.30365   0.492  0.62405   
friends      0.18985    0.05640   3.366  0.00110 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.476 on 96 degrees of freedom
Multiple R-squared:  0.1311,	Adjusted R-squared:  0.1039 
F-statistic: 4.828 on 3 and 96 DF,  p-value: 0.003573
summary(m3)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  5.78540    1.90266   3.041  0.00305 **
age         -0.11146    0.07309  -1.525  0.13057   
genderMale  -0.14267    0.31157  -0.458  0.64806   
friends      0.17134    0.05491   3.120  0.00239 **
pets         0.36391    0.13044   2.790  0.00637 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.427 on 95 degrees of freedom
Multiple R-squared:  0.1969,	Adjusted R-squared:  0.1631 
F-statistic: 5.822 on 4 and 95 DF,  p-value: 0.0003105
> lm.beta(m3)
        age  genderMale     friends        pets 
-0.14098154 -0.04484095  0.28909280  0.27446786 
anova(m0,m1,m2,m3)
Analysis of Variance Table

Model 1: happiness ~ 1
Model 2: happiness ~ age + gender
Model 3: happiness ~ age + gender + friends
Model 4: happiness ~ age + gender + friends + pets
  Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
1     99 240.84                                   
2     97 233.97  2    6.8748  1.6883 0.1903349    
3     96 209.27  1   24.6957 12.1293 0.0007521 ***
4     95 193.42  1   15.8461  7.7828 0.0063739 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  • Model 0: SSTotal= 240.84 (no predictors)
  • Model 1: SSResidual= 233.97 (after adding age and gender)
  • Model 2: SSResidual= 209.27,
    • SSDifference= 233.97 - 209.27 = 24.696,
    • F(1,96) = 12.1293, p value = 0.0007521 (after adding friends)
  • Model 3: SSResidual= 193.42,
    • SSDifference= 209.27 - 193.42 = 15.846,
    • F(1,95) = 7.7828, p value = 0.0063739 (after adding pets)


Source: Park, N., Kee, K. F., & Valenzuela, S. (2009). Being immersed in social networking environment: Facebook groups, uses and gratifications, and social outcomes. CyberPsychology & Behavior, 12, 729-733.


Source: Lankau, M. J., & Scandura, T. A. (2002). An investigation of personal learning in mentoring relationships: Content, antecedents, and consequences. Academy of Management Journal, 45, 779-790.

see also http://rtutorialseries.blogspot.kr/2010/01/r-tutorial-series-basic-hierarchical.html

e.g. 4

crime_simple_1_.txt

# data import 
crime <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/crime_simple.txt", 
         sep = "\t", header = TRUE)
str(crime)        
  • R: Crime rate: # of offenses reported to police per million population
  • Age: The number of males of age 14-24 per 1000 population
  • S: Indicator variable for Southern states (0 = No, 1 = Yes)
  • Ed: Mean # of years of schooling x 10 for persons of age 25 or older
  • Ex0: 1960 per capita expenditure on police by state and local government
  • Ex1: 1959 per capita expenditure on police by state and local government
  • LF: Labor force participation rate per 1000 civilian urban males age 14-24
  • M: The number of males per 1000 females
  • N: State population size in hundred thousands
  • NW: The number of non-whites per 1000 population
  • U1: Unemployment rate of urban males per 1000 of age 14-24
  • U2: Unemployment rate of urban males per 1000 of age 35-39
  • W: Median value of transferable goods and assets or family income in tens of dollar
  • X: The number of families per 1000 earning below 1/2 the median income
# Assign more meaningful variable names
colnames(crime) <- c("crime.per.million", "young.males", "is.south", "average.ed",
                     "exp.per.cap.1960", "exp.per.cap.1959", "labour.part",
                     "male.per.fem", "population", "nonwhite",
                     "unemp.youth", "unemp.adult", "median.assets", "num.low.salary")
# Convert is.south to a factor
# Divide average.ed by 10 so that the variable is actually average education
# Convert median assets to 1000's of dollars instead of 10's
crime <- transform(crime, is.south = as.factor(is.south),
                          average.ed = average.ed / 10,
                          median.assets = median.assets / 100)
# Fit model
# enter all IVs into crime.per.million (a DV)
crime.lm <- lm(crime.per.million ~ ., data = crime)
plot(crime)

check the linear relationships (a strong corr) between Remove 1959 expenditure and 1960. Also check those between unemp.youth and unemp.adult. Hence, we delete the two (as a redundant).

# Remove 1959 expenditure and youth unemployment
crime.lm2 <- update(crime.lm, . ~ . - exp.per.cap.1959 - unemp.youth)
summary(crime.lm2)

Call:
lm(formula = crime.per.million ~ young.males + is.south + average.ed + 
    exp.per.cap.1960 + labour.part + male.per.fem + population + 
    nonwhite + unemp.adult + median.assets + num.low.salary, 
    data = crime)

Residuals:
   Min     1Q Median     3Q    Max 
-35.82 -11.57  -1.51  10.63  55.02 

Coefficients:
                    Estimate  Std. Error t value  Pr(>|t|)    
(Intercept)      -633.438828  145.470340  -4.354  0.000111 ***
young.males         1.126883    0.418791   2.691  0.010853 *  
is.south1          -0.556600   13.883248  -0.040  0.968248    
average.ed         15.328028    6.202516   2.471  0.018476 *  
exp.per.cap.1960    1.138299    0.226977   5.015 0.0000153 ***
labour.part         0.068716    0.133540   0.515  0.610087    
male.per.fem        0.003021    0.173041   0.017  0.986172    
population         -0.064477    0.128278  -0.503  0.618367    
nonwhite           -0.013794    0.061901  -0.223  0.824960    
unemp.adult         0.931498    0.541803   1.719  0.094402 .  
median.assets      15.158975   10.524458   1.440  0.158653    
num.low.salary      0.825936    0.234189   3.527  0.001197 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.98 on 35 degrees of freedom
Multiple R-squared:  0.7543,	Adjusted R-squared:  0.6771 
F-statistic: 9.769 on 11 and 35 DF,  p-value: 0.00000009378




summary(crime.lm)

Call:
lm(formula = crime.per.million ~ ., data = crime)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.884 -11.923  -1.135  13.495  50.560 

Coefficients:
                    Estimate  Std. Error t value  Pr(>|t|)    
(Intercept)      -691.837588  155.887918  -4.438 0.0000956 ***
young.males         1.039810    0.422708   2.460   0.01931 *  
is.south1          -8.308313   14.911588  -0.557   0.58117    
average.ed         18.016011    6.496504   2.773   0.00906 ** 
exp.per.cap.1960    1.607818    1.058667   1.519   0.13836    
exp.per.cap.1959   -0.667258    1.148773  -0.581   0.56529    
labour.part        -0.041031    0.153477  -0.267   0.79087    
male.per.fem        0.164795    0.209932   0.785   0.43806    
population         -0.041277    0.129516  -0.319   0.75196    
nonwhite            0.007175    0.063867   0.112   0.91124    
unemp.youth        -0.601675    0.437154  -1.376   0.17798    
unemp.adult         1.792263    0.856111   2.093   0.04407 *  
median.assets      13.735847   10.583028   1.298   0.20332    
num.low.salary      0.792933    0.235085   3.373   0.00191 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.94 on 33 degrees of freedom
Multiple R-squared:  0.7692,	Adjusted R-squared:  0.6783 
F-statistic: 8.462 on 13 and 33 DF,  p-value: 0.0000003686

See the below stepwise regression explanation.

reduced.model <- step(crime.lm2, irection="backward")
Start:  AIC=300.61
crime.per.million ~ young.males + is.south + average.ed + exp.per.cap.1960 + 
    labour.part + male.per.fem + population + nonwhite + unemp.adult + 
    median.assets + num.low.salary

                   Df Sum of Sq   RSS    AIC
- male.per.fem      1       0.1 16906 298.61
- is.south          1       0.8 16907 298.61
- nonwhite          1      24.0 16930 298.67
- population        1     122.0 17028 298.95
- labour.part       1     127.9 17034 298.96
<none                         16906 300.61
- median.assets     1    1002.1 17908 301.31
- unemp.adult       1    1427.8 18334 302.42
- average.ed        1    2949.9 19856 306.17
- young.males       1    3497.3 20403 307.45
- num.low.salary    1    6008.0 22914 312.90
- exp.per.cap.1960  1   12148.5 29054 324.06

Step:  AIC=298.61
crime.per.million ~ young.males + is.south + average.ed + exp.per.cap.1960 + 
    labour.part + population + nonwhite + unemp.adult + median.assets + 
    num.low.salary

                   Df Sum of Sq   RSS    AIC
- is.south          1       0.8 16907 296.61
- nonwhite          1      25.7 16932 296.68
- labour.part       1     155.9 17062 297.04
- population        1     161.7 17068 297.06
<none                         16906 298.61
- median.assets     1    1002.9 17909 299.32
- unemp.adult       1    1804.4 18711 301.37
- average.ed        1    3242.3 20148 304.85
- young.males       1    3807.8 20714 306.16
- num.low.salary    1    6267.5 23174 311.43
- exp.per.cap.1960  1   12586.1 29492 322.76

Step:  AIC=296.61
crime.per.million ~ young.males + average.ed + exp.per.cap.1960 + 
    labour.part + population + nonwhite + unemp.adult + median.assets + 
    num.low.salary

                   Df Sum of Sq   RSS    AIC
- nonwhite          1      38.4 16945 294.72
- population        1     162.2 17069 295.06
- labour.part       1     210.0 17117 295.19
<none                         16907 296.61
- median.assets     1    1047.1 17954 297.43
- unemp.adult       1    1866.1 18773 299.53
- average.ed        1    3243.1 20150 302.86
- young.males       1    3820.9 20728 304.19
- num.low.salary    1    7560.2 24467 311.98
- exp.per.cap.1960  1   12624.3 29531 320.82

Step:  AIC=294.72
crime.per.million ~ young.males + average.ed + exp.per.cap.1960 + 
    labour.part + population + unemp.adult + median.assets + 
    num.low.salary

                   Df Sum of Sq   RSS    AIC
- population        1     175.1 17120 293.20
- labour.part       1     216.2 17162 293.31
<none                         16945 294.72
- median.assets     1    1146.8 18092 295.79
- unemp.adult       1    1855.8 18801 297.60
- average.ed        1    3515.0 20460 301.58
- young.males       1    4032.3 20978 302.75
- num.low.salary    1    7700.6 24646 310.32
- exp.per.cap.1960  1   14283.5 31229 321.45

Step:  AIC=293.2
crime.per.million ~ young.males + average.ed + exp.per.cap.1960 + 
    labour.part + unemp.adult + median.assets + num.low.salary

                   Df Sum of Sq   RSS    AIC
- labour.part       1     230.7 17351 291.83
<none                         17120 293.20
- median.assets     1    1064.3 18185 294.03
- unemp.adult       1    1858.2 18979 296.04
- average.ed        1    3981.1 21101 301.03
- young.males       1    4509.0 21629 302.19
- num.low.salary    1    7540.2 24661 308.35
- exp.per.cap.1960  1   15814.0 32934 321.95

Step:  AIC=291.83
crime.per.million ~ young.males + average.ed + exp.per.cap.1960 + 
    unemp.adult + median.assets + num.low.salary

                   Df Sum of Sq   RSS    AIC
<none                         17351 291.83
- median.assets     1    1252.6 18604 293.11
- unemp.adult       1    1628.7 18980 294.05
- young.males       1    4461.0 21812 300.58
- average.ed        1    6214.7 23566 304.22
- num.low.salary    1    8932.3 26283 309.35
- exp.per.cap.1960  1   15596.5 32948 319.97
summary(reduced.model)

Performing Linear Regression Without an Intercept

lm(y ~ x + 0)

Performing Linear Regression with Interaction Terms

This is a very tricky part. See Interaction effects in regression analysis
see also https://cran.r-project.org/web/packages/jtools/vignettes/interactions.html

lm(y ~ u*v)
yi = β0 + β1ui + β2vi + β3uivi + εi
β3uivi = interaction term
y ~ u*v*w
yi = β0 + β1ui + β2vi + β3wi + β4uivi + β5uiwi + β6viwi + β7uiviwi + εi
first order interaction: β4uivi, β5uiwi, β6viwi
second order interaction: β7uiviwi
y ~ u + v + w + u:v:w
yi = β0 + β1ui + β2vi + β3wi + β4uiviwi + εi

Selecting the Best Regression Variables = Stepwise regression

http://www.utstat.toronto.edu/~brunner/oldclass/appliedf11/handouts/2101f11StepwiseLogisticR.pdf

full.model <- lm(y ~ x1 + x2 + x3 + x4)
reduced.model <- step(full.model, direction="backward")

dataset_multipleregression.csv

data <- read.csv("http://commres.net/wiki/_media/r/dataset_multipleregression.csv")
head(data)
  YEAR ROLL UNEM HGRAD  INC
1    1 5501  8.1  9552 1923
2    2 5945  7.0  9680 1961
3    3 6629  7.3  9731 1979
4    4 7556  7.5 11666 2030
5    5 8716  7.0 14675 2112
6    6 9369  6.4 15265 2192

ROLL = students' enrollment
HGRAD = number of high school graduates
UNEM = unemployment
INC = per capita income
r.model <- step(threePredictorModel, direction="backward")
Start:  AIC=381.16
ROLL ~ UNEM + HGRAD + INC

        Df Sum of Sq      RSS    AIC
<none              11237313 381.16
- UNEM   1   6522098 17759411 392.43
- HGRAD  1  12852039 24089352 401.27
- INC    1  33568255 44805568 419.27
></code

Note: nothing has been discarded. 

eg. from textbook without data

<code>full.model <- lm(y ~ x1 + x2 + x3 + x4)
summary(full.model)

Call:
lm(formula = y ~ x1 + x2 + x3 + x4)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.405  -5.153   2.025   6.525  19.186 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   10.274      4.573   2.247   0.0337 *
x1          -102.298     47.558  -2.151   0.0413 *
x2             4.362     40.237   0.108   0.9145  
x3            75.115     34.236   2.194   0.0377 *
x4            26.286     42.239   0.622   0.5394  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 12.19 on 25 degrees of freedom
Multiple R-squared: 0.8848,     Adjusted R-squared: 0.8664 
F-statistic:    48 on 4 and 25 DF,  p-value: 2.235e-11

We want to eliminate the insignificant variables, so we use step to incrementally eliminate the underperformers. The result is called the reduced model:

reduced.model <- step(full.model, direction="backward")
Start:  AIC=154.58
y ~ x1 + x2 + x3 + x4

       Df Sum of Sq    RSS    AIC
- x2    1      1.75 3718.5 152.60
- x4    1     57.58 3774.3 153.04
<none             3716.7 154.58
- x1    1    687.89 4404.6 157.68
- x3    1    715.67 4432.4 157.87

Step:  AIC=152.6
y ~ x1 + x3 + x4

       Df Sum of Sq    RSS    AIC
- x4    1     77.82 3796.3 151.22
<none             3718.5 152.60
- x3    1    737.39 4455.9 156.02
- x1    1    787.96 4506.4 156.36

Step:  AIC=151.22
y ~ x1 + x3

       Df Sum of Sq    RSS    AIC
<none             3796.3 151.22
- x1    1    884.44 4680.7 155.50
- x3    1    964.24 4760.5 156.01

The output from step shows the sequence of models that it explored. In this case, step removed x2 and x4 and left only x1 and x3 in the final (reduced) model. The summary of the reduced model shows that it contains only significant predictors:

summary(reduced.model)

Call:
lm(formula = y ~ x1 + x3)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.192  -2.677   2.049   7.453  19.704 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   10.267      4.437   2.314   0.0285 *
x1           -79.265     31.604  -2.508   0.0185 *
x3            82.726     31.590   2.619   0.0143 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 11.86 on 27 degrees of freedom
Multiple R-squared: 0.8823,     Adjusted R-squared: 0.8736 
F-statistic: 101.2 on 2 and 27 DF,  p-value: 2.842e-13

Or start with nothing

min.model <- lm(y ~ 1)
fwd.model <- step(min.model,
+                   direction="forward",
+                   scope=( ~ x1 + x2 + x3 + x4),
+                   trace=0 )
min.model <- lm(ROLL ~ 1, data)
fwd.model <- step(min.model, direction="forward", scope=(~ UNEM + HGRAD + INC),trace=0)
summary(fwd.model)

Call:
lm(formula = ROLL ~ INC + HGRAD + UNEM, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1148.84  -489.71    -1.88   387.40  1425.75 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -9.153e+03  1.053e+03  -8.691 5.02e-09 ***
INC          4.275e+00  4.947e-01   8.642 5.59e-09 ***
HGRAD        4.065e-01  7.602e-02   5.347 1.52e-05 ***
UNEM         4.501e+02  1.182e+02   3.809 0.000807 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 670.4 on 25 degrees of freedom
Multiple R-squared:  0.9621,	Adjusted R-squared:  0.9576 
F-statistic: 211.5 on 3 and 25 DF,  p-value: < 2.2e-16

>

Diagnosing a Linear Regression

m <- lm(y ~ x)
plot(m)
library(cars)
m <- lm(cars$dist ~ cars$speed)
plot(m)
  • The points in the Residuals vs Fitted plot are randomly scattered with no particular pattern.
  • The points in the Normal Q?Q plot are more-or-less on the line, indicating that the residuals follow a normal distribution.
  • In both the Scale?Location plot and the Residuals vs Leverage plots, the points are in a group with none too far from the center.
influence.measures(m)
Influence measures of
	 lm(formula = cars$dist ~ cars$speed) :

     dfb.1_ dfb.crs.    dffit cov.r   cook.d    hat inf
1   0.09440 -0.08625  0.09490 1.175 4.59e-03 0.1149   *
2   0.29242 -0.26716  0.29398 1.146 4.35e-02 0.1149   *
3  -0.10750  0.09369 -0.11040 1.116 6.20e-03 0.0715    
4   0.21898 -0.19085  0.22488 1.093 2.55e-02 0.0715    
5   0.03408 -0.02901  0.03554 1.109 6.45e-04 0.0600    
6  -0.11101  0.09174 -0.11852 1.085 7.13e-03 0.0499    
7  -0.04618  0.03669 -0.05110 1.085 1.33e-03 0.0413    
8   0.05248 -0.04170  0.05807 1.084 1.72e-03 0.0413    
9   0.15209 -0.12083  0.16827 1.058 1.43e-02 0.0413    
10 -0.09168  0.06895 -0.10716 1.065 5.82e-03 0.0341    
11  0.02446 -0.01840  0.02859 1.079 4.17e-04 0.0341    
12 -0.13849  0.09602 -0.17628 1.027 1.55e-02 0.0284    
13 -0.08467  0.05870 -0.10777 1.056 5.88e-03 0.0284    
14 -0.04929  0.03417 -0.06274 1.067 2.00e-03 0.0284    
15 -0.01413  0.00979 -0.01798 1.073 1.65e-04 0.0284    
16 -0.05330  0.03233 -0.07757 1.058 3.06e-03 0.0242    
17  0.00323 -0.00196  0.00470 1.069 1.13e-05 0.0242    
18  0.00323 -0.00196  0.00470 1.069 1.13e-05 0.0242    
19  0.08843 -0.05364  0.12870 1.039 8.34e-03 0.0242    
20 -0.06172  0.02871 -0.11111 1.041 6.23e-03 0.0214    
21 -0.00789  0.00367 -0.01419 1.065 1.03e-04 0.0214    
22  0.12329 -0.05734  0.22194 0.971 2.40e-02 0.0214    
23  0.24851 -0.11558  0.44734 0.747 8.56e-02 0.0214   *
24 -0.08002  0.01551 -0.20360 0.979 2.03e-02 0.0201    
25 -0.05700  0.01105 -0.14504 1.019 1.05e-02 0.0201    
26  0.04643 -0.00900  0.11812 1.034 7.02e-03 0.0201    
27 -0.02664 -0.01432 -0.12571 1.031 7.94e-03 0.0203    
28 -0.01059 -0.00569 -0.04998 1.059 1.27e-03 0.0203    
29 -0.00528 -0.04978 -0.17031 1.010 1.44e-02 0.0219    
30 -0.00281 -0.02647 -0.09054 1.050 4.15e-03 0.0219    
31  0.00022  0.00207  0.00708 1.066 2.56e-05 0.0219    
32  0.01561 -0.05223 -0.11741 1.046 6.96e-03 0.0249    
33 -0.00387  0.01296  0.02914 1.068 4.33e-04 0.0249    
34 -0.03235  0.10823  0.24330 0.972 2.88e-02 0.0249    
35 -0.04462  0.14928  0.33557 0.894 5.26e-02 0.0249    
36  0.06663 -0.13914 -0.24553 0.989 2.95e-02 0.0295    
37  0.03458 -0.07221 -0.12744 1.051 8.20e-03 0.0295    
38 -0.03372  0.07042  0.12428 1.052 7.80e-03 0.0295    
39  0.14564 -0.25086 -0.38002 0.921 6.81e-02 0.0354    
40  0.06340 -0.10920 -0.16542 1.048 1.38e-02 0.0354    
41  0.04382 -0.07547 -0.11432 1.065 6.62e-03 0.0354    
42  0.02443 -0.04207 -0.06373 1.076 2.07e-03 0.0354    
43 -0.01411  0.02431  0.03682 1.080 6.92e-04 0.0354    
44  0.02456 -0.03551 -0.04533 1.098 1.05e-03 0.0518    
45  0.19602 -0.27032 -0.32823 1.039 5.32e-02 0.0622    
46  0.08260 -0.11000 -0.12877 1.116 8.43e-03 0.0740    
47 -0.18634  0.24815  0.29050 1.077 4.21e-02 0.0740    
48 -0.19890  0.26488  0.31008 1.071 4.79e-02 0.0740    
49 -0.57747  0.76902  0.90027 0.762 3.40e-01 0.0740   *
50 -0.06025  0.07812  0.08898 1.139 4.04e-03 0.0873   *

When a statistician says that an observation is influential, it means that removing the observation would significantly change the fitted regression model. We want to identify those observations because they might be outliers that distort our model; we owe it to ourselves to investigate them.

Testing Residuals for Autocorrelation (Durbin?Watson Test)

dwtest(m)

	Durbin-Watson test

data:  m
DW = 1.6762, p-value = 0.09522
alternative hypothesis: true autocorrelation is greater than 0

Autocorrelation in the residuals is a scourge because it distorts the regression statistics, such as the F statistic and the t statistics for the regression coefficients. The presence of autocorrelation suggests that your model is missing a useful predictor variable or that it should include a time series component, such as a trend or a seasonal indicator.

1)
beta weights, beta values
r/linear_regression.txt · Last modified: 2019/06/13 10:15 by hkimscil

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki