====== Simple Linear Regression ======
yi = β0 + β1xi + εi
where β0 and β1 are the regression coefficients and the εi are the error terms.
library(MASS)
attach(Cars93)
plot(MPG.city~EngineSize)
mod <- lm(MPG.city ~ EngineSize)
mod
Call:
lm(formula = MPG.city ~ EngineSize)
Coefficients:
(Intercept) EngineSize
32.627 -3.846
summary(mod)
Call:
lm(formula = MPG.city ~ EngineSize)
Residuals:
Min 1Q Median 3Q Max
-10.6264 -2.7032 -0.5491 2.1428 17.2197
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.6267 1.1439 28.523 < 2e-16 ***
EngineSize -3.8464 0.3999 -9.618 1.61e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.979 on 91 degrees of freedom
Multiple R-squared: 0.5041, Adjusted R-squared: 0.4987
F-statistic: 92.51 on 1 and 91 DF, p-value: 1.606e-15
Is the model statistically significant?
* Check the **F statistic** at the bottom of the summary.
* The F statistic tells you whether the model is significant or insignificant.
Are the coefficients significant?
* Check the coefficient’s **t statistics and p-values** in the summary.
Is the model useful?
* Check the **R2** near the bottom of the summary.
* R2 is a measure of the model’s quality. Bigger is better. Mathematically, it is the fraction of the variance of y that is explained by the regression model.
Does the model fit the data well?
* **Plot the residuals** and check the regression diagnostics.
* see [[https://drsimonj.svbtle.com/visualising-residuals|visualizing residuals]]
Does the data satisfy the assumptions behind linear regression?
* Check whether the diagnostics confirm that a linear model is reasonable for your data.
====== E.g. 1 ======
{{:r:dataset_simpleRegression.csv}}
YEAR ROLL UNEM HGRAD INC
1 1 5501 8.1 9552 1923
2 2 5945 7.0 9680 1961
3 3 6629 7.3 9731 1979
4 4 7556 7.5 11666 2030
5 5 8716 7.0 14675 2112
6 6 9369 6.4 15265 2192
7 7 9920 6.5 15484 2235
8 8 10167 6.4 15723 2351
9 9 11084 6.3 16501 2411
10 10 12504 7.7 16890 2475
11 11 13746 8.2 17203 2524
12 12 13656 7.5 17707 2674
13 13 13850 7.4 18108 2833
14 14 14145 8.2 18266 2863
15 15 14888 10.1 19308 2839
16 16 14991 9.2 18224 2898
17 17 14836 7.7 18997 3123
18 18 14478 5.7 19505 3195
19 19 14539 6.5 19800 3239
20 20 14395 7.5 19546 3129
21 21 14599 7.3 19117 3100
22 22 14969 9.2 18774 3008
23 23 15107 10.1 17813 2983
24 24 14831 7.5 17304 3069
25 25 15081 8.8 16756 3151
26 26 15127 9.1 16749 3127
27 27 15856 8.8 16925 3179
28 28 15938 7.8 17231 3207
29 29 16081 7.0 16816 3345
datavar <- read.csv("http://commres.net/wiki/_media/r/dataset_simpleregression.csv")
# ROLL = students' enrollment
# HGRAD = number of high school graduates
# UNEM = unemployment
# INC = per capita income
linearModelVar <- lm(ROLL ~ UNEM, datavar)
linearModelVar
Call:
lm(formula = ROLL ~ UNEM, data = d)
Coefficients:
(Intercept) UNEM
3957 1134
from the above, //Enrollment// = 3957 + 1134 * //Unemployment Rate//
만약에 unemployment가 9%라고 한다면, Enrollment 예측은
3957 + 1134 * 9
[1] 14163
14163 일것으로 예측할 수 있다.
Summarizing the model
summary(linearModelVar)
Call:
lm(formula = ROLL ~ UNEM, data = d)
Residuals:
Min 1Q Median 3Q Max
-7640.0 -1046.5 602.8 1934.3 4187.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3957.0 4000.1 0.989 0.3313
UNEM 1133.8 513.1 2.210 0.0358 *
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3049 on 27 degrees of freedom
Multiple R-squared: 0.1531, Adjusted R-squared: 0.1218
F-statistic: 4.883 on 1 and 27 DF, p-value: 0.03579
You should be able to understand the below terms (what they mean):
Multiple R-squared(R2): 0.1531
F-statistic: 4.883 (1, 27), p-value: 0.03579
t-statistic for UNEM: 2.210, p-value: 0.0358 *
What about beta coefficient?
In R we demonstrate the use of the lm.beta() function in the QuantPsyc package (due to Thomas D. Fletcher of State Farm). The function is short and sweet, and takes a linear model object as argument:
> lm.beta(mod)
EngineSize
-0.7100032
> cor(MPG.city,EngineSize)
[1] -0.7100032
>
====== Multiple Regression ======
regression output table
| anova(m) | ANOVA table |
| coefficients(m) = coef(m) | Model coefficients |
| confint(m) | Confidence intervals for the regression coefficients |
| deviance(m) | Residual sum of squares |
| effects(m) | Vector of orthogonal effects |
| fitted(m) | Vector of fitted y values |
| residuals(m) = resid(m) | Model residuals \\ This reports the standard error of the residuals (σ) - that is, the sample standard deviation of ε. |
| summary(m) | Key statistics, such as R2, the F statistic, and the residual standard error (σ) |
| vcov(m) | Variance?covariance matrix of the main parameters |
lm.model <- lm(Cars93$MPG.city ~ Cars93$EngineSize + Cars93$Price)
summary(lm.model)
Call:
lm(formula = Cars93$MPG.city ~ Cars93$EngineSize + Cars93$Price)
Residuals:
Min 1Q Median 3Q Max
-7.4511 -2.0637 -0.5156 1.6239 16.9372
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.34647 1.12251 29.707 < 2e-16 ***
Cars93$EngineSize -2.98885 0.47808 -6.252 1.33e-08 ***
Cars93$Price -0.15415 0.05134 -3.002 0.00347 **
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.815 on 90 degrees of freedom
Multiple R-squared: 0.5492, Adjusted R-squared: 0.5392
F-statistic: 54.83 on 2 and 90 DF, p-value: 2.674e-16
Questions:
* What is R2?
* How many cars are involved in this test? (cf. df = 90)
* df + # of variables involved (3) = 93
* check 'str(Cars93)'
* If I eliminate the R2 from the above output, can you still identify what it is?
The last question:
* If I eliminate the R2 from the above output, can you still identify what it is?
* to answer the question, use the regression output table:
R2 = SSreg/SStotal = 1 - SSres/SStotal
=
> anova(lm.model)
Analysis of Variance Table
Response: Cars93$MPG.city
Df Sum Sq Mean Sq F value Pr(>F)
Cars93$EngineSize 1 1465 1465 100.65 2.4e-16 ***
Cars93$Price 1 131 131 9.01 0.0035 **
Residuals 90 1310 15
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> sstotal = 1465+131+1310
> ssreg <- 1465+131
> ssreg/sstotal
[1] 0.54921
>
> # or
> 1-(deviance(lm.model)/sstotal)
[1] 0.54932
Regression formula:
* $\hat{Y} = -2.99 * EngineSize + -0.15 * Price + 33.35$
* $\hat{Y} = \widehat{\text{MPG.city}}$
in the meantime,
> lm.beta(lm.model)
Cars93$EngineSize Cars93$Price
-0.5517121 -0.2649553
> cor(MPG.city,EngineSize)
[1] -0.7100032
> cor(EngineSize,Price)
[1] 0.5974254
> cor(MPG.city,Price)
[1] -0.5945622
>
Or . . . .
> temp <- subset(Cars93, select=c(MPG.city,EngineSize,Price))
> temp
. . . .
> cor(temp)
MPG.city EngineSize Price
MPG.city 1.0000000 -0.7100032 -0.5945622
EngineSize -0.7100032 1.0000000 0.5974254
Price -0.5945622 0.5974254 1.0000000
>
Beta coefficients are not equal to correlations among variables.
plot(lm.model$residuals)
abline(h=0, col="red")
anova(lm.model)
Analysis of Variance Table
Response: Cars93$MPG.city
Df Sum Sq Mean Sq F value Pr(>F)
Cars93$EngineSize 1 1464.71 1464.71 100.653 2.444e-16 ***
Cars93$Price 1 131.17 131.17 9.014 0.003468 **
Residuals 90 1309.69 14.55
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Why use anova with lm output (lm.model in this case)?
coef(lm.model)
(Intercept) Cars93$EngineSize Cars93$Price
33.3464734 -2.9888458 -0.1541498
residuals(lm.model)
1 2 3 4 5
-0.51556927 -0.55648889 -0.49194621 -0.16625801 3.73898067
6 7 8 9 10
-2.35086090 0.21745634 3.34329779 1.06528019 2.64786884
11 12 13 14 15
2.58362397 0.29459458 -0.01370501 -1.85673580 -3.32003095
16 17 18 19 20
-1.47621772 -2.93554987 1.49577174 6.54763980 -0.64692607
21 22 23 24 25
1.05563073 1.06413662 1.55497338 -2.02911999 -1.82416666
26 27 28 29 30
-4.45108994 -2.46962214 -2.40287136 2.01742275 0.08957791
31 32 33 34 35
2.67973459 -3.40963806 -2.73023540 -2.02114636 -1.21068471
36 37 38 39 40
-6.31235513 -0.26611019 1.62394797 16.93723064 3.36255227
41 42 43 44 45
0.58003782 15.00200777 -0.07339128 1.36999363 -4.42505304
46 47 48 49 50
-1.32170679 -5.22609969 4.48710776 -2.06374182 -0.95386332
51 52 53 54 55
0.29847852 3.96702480 1.71512315 1.82158662 2.66911267
56 57 58 59 60
-3.43567496 -7.45110565 -1.55474970 4.75970527 -3.39080806
61 62 63 64 65
-0.69202743 1.72453815 -2.35662642 2.25464742 0.24690826
66 67 68 69 70
-4.43567496 -0.06571546 -0.39110586 -1.25837103 -0.98293839
71 72 73 74 75
0.20204136 -2.74679396 3.82302800 -2.65771911 -1.45594634
76 77 78 79 80
-1.33262651 0.77239559 -2.64579820 2.04339631 4.53499980
81 82 83 84 85
-1.28631823 -0.76509170 10.86471434 4.64746325 1.06534353
86 87 88 89 90
-1.96548643 -4.67404320 -1.56378785 -5.83760799 -3.28578597
91 92 93
-3.38601500 -1.97292778 -2.05744404
deviance(lm.model)
[1] 1309.686
===== Ex 1 =====
위의 Cars93 데이터에서 city mileage은 Enginesize Horsepower 와 Turn.circle 의 영향을 얼마나 받을까?
* Turn.circle은 회전반경을 의미하고
* Enginsize는 Engine의 크기
* Horsepower는 마력수를 의미한다
===== Ex 2 =====
{{dataset_multipleRegression.csv}}
data <- read.csv("http://commres.net/wiki/_media/r/dataset_multipleregression.csv")
head(data)
YEAR ROLL UNEM HGRAD INC
1 1 5501 8.1 9552 1923
2 2 5945 7.0 9680 1961
3 3 6629 7.3 9731 1979
4 4 7556 7.5 11666 2030
5 5 8716 7.0 14675 2112
6 6 9369 6.4 15265 2192
ROLL = students' enrollment
HGRAD = number of high school graduates
UNEM = unemployment
INC = per capita income
onePredictorModel <- lm(ROLL ~ UNEM, datavar)
onePredictorModel
Call:
lm(formula = ROLL ~ UNEM, data = datavar)
Coefficients:
(Intercept) UNEM
3957 1134
summary(onePredictorModel)
Call:
lm(formula = ROLL ~ UNEM, data = datavar)
Residuals:
Min 1Q Median 3Q Max
-7640.0 -1046.5 602.8 1934.3 4187.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3957.0 4000.1 0.989 0.3313
UNEM 1133.8 513.1 2.210 0.0358 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3049 on 27 degrees of freedom
Multiple R-squared: 0.1531, Adjusted R-squared: 0.1218
F-statistic: 4.883 on 1 and 27 DF, p-value: 0.03579
predict the fall enrollment (ROLL): using
* the unemployment rate (UNEM) and
* number of spring high school graduates (HGRAD)
twoPredictorModel <- lm(ROLL ~ UNEM + HGRAD, data)
twoPredictorModel
Call:
lm(formula = ROLL ~ UNEM + HGRAD, data = data)
Coefficients:
(Intercept) UNEM HGRAD
-8255.7511 698.2681 0.9423
summary(twoPredictorModel)
Call:
lm(formula = ROLL ~ UNEM + HGRAD, data = data)
Residuals:
Min 1Q Median 3Q Max
-2102.2 -861.6 -349.4 374.5 3603.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.256e+03 2.052e+03 -4.023 0.00044 ***
UNEM 6.983e+02 2.244e+02 3.111 0.00449 **
HGRAD 9.423e-01 8.613e-02 10.941 3.16e-11 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1313 on 26 degrees of freedom
Multiple R-squared: 0.8489, Adjusted R-squared: 0.8373
F-statistic: 73.03 on 2 and 26 DF, p-value: 2.144e-11
y = -8255.7511 + 698.2681*UNEM + 0.9423*HGRAD
Q: what is the expected fall enrollment (ROLL) given this year's unemployment rate (UNEM) of 9% and spring high school graduating class (HGRAD) of 100,000?
UNEM <- 9
HGRAD <- 100000
-8255.7511 + 698.2681*UNEM + 0.9423*HGRAD
[1] 92258.66
A: the predicted fall enrollment, given a 9% unemployment rate and 100,000 student spring high school graduating class, is
**92258** students.
Enrollment 와 Unemployment, Highschool student grateduates, Income 간의 관계. 즉,
* dv = ROLL (Enrollment)
* iv = UNEM, HGRAD, INC
threePredictorModel <- lm(ROLL ~ UNEM + HGRAD + INC, data)
threePredictorModel
Call:
lm(formula = ROLL ~ UNEM + HGRAD + INC, data = data)
Coefficients:
(Intercept) UNEM HGRAD INC
-9153.2545 450.1245 0.4065 4.2749
summary(threePredictorModel)
Call:
lm(formula = ROLL ~ UNEM + HGRAD + INC, data = data)
Residuals:
Min 1Q Median 3Q Max
-1148.84 -489.71 -1.88 387.40 1425.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.153e+03 1.053e+03 -8.691 5.02e-09 ***
UNEM 4.501e+02 1.182e+02 3.809 0.000807 ***
HGRAD 4.065e-01 7.602e-02 5.347 1.52e-05 ***
INC 4.275e+00 4.947e-01 8.642 5.59e-09 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 670.4 on 25 degrees of freedom
Multiple R-squared: 0.9621, Adjusted R-squared: 0.9576
F-statistic: 211.5 on 3 and 25 DF, p-value: < 2.2e-16
How to get **beta coefficients**((beta weights, beta values)) for predictor variables?
* Use scale function to standardize the x values then do regression
* Use lm.beta function in QuantPsyc
* In this case
install.packages("QuantPsyc")
library(QuantPsyc)
lm.beta(threePredictorModel)
UNEM HGRAD INC
0.1553619 0.3656177 0.6061762
How to compare each model (with incremental IVs)
anova(onePredictorModel, twoPredictorModel, threePredictorModel)
Analysis of Variance Table
Model 1: ROLL ~ UNEM
Model 2: ROLL ~ UNEM + HGRAD
Model 3: ROLL ~ UNEM + HGRAD + INC
Res.Df RSS Df Sum of Sq F Pr(>F)
1 27 251084710
2 26 44805568 1 206279143 458.92 < 2.2e-16 ***
3 25 11237313 1 33568255 74.68 5.594e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
===== e.g. 3 =====
{{hierarchicalregressiondata.csv}}
# Import data (simulated data for this example)
myData <- read.csv('http://static.lib.virginia.edu/statlab/materials/data/hierarchicalRegressionData.csv')
# or
# myData <- read.csv('http://commres.net/wiki/_media/r/hierarchicalregressiondata.csv')
# Build models to compare the adding variables are worthwhile.
m0 <- lm(happiness ~ 1, data=myData) # to obtain Total SS
m1 <- lm(happiness ~ age + gender, data=myData) # Model 1
m2 <- lm(happiness ~ age + gender + friends, data=myData) # Model 2
m3 <- lm(happiness ~ age + gender + friends + pets, data=myData) # Model 3
anova(m0) # to obtain Total SS
Analysis of Variance Table
Response: happiness
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 99 240.84 2.4327
summary(m1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.66778 2.01364 3.808 0.000246 ***
age -0.13039 0.07936 -1.643 0.103611
genderMale 0.16430 0.31938 0.514 0.608106
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.553 on 97 degrees of freedom
Multiple R-squared: 0.02855, Adjusted R-squared: 0.008515
F-statistic: 1.425 on 2 and 97 DF, p-value: 0.2455
summary(m2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.21730 1.96220 3.169 0.00206 **
age -0.12479 0.07546 -1.654 0.10146
genderMale 0.14931 0.30365 0.492 0.62405
friends 0.18985 0.05640 3.366 0.00110 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.476 on 96 degrees of freedom
Multiple R-squared: 0.1311, Adjusted R-squared: 0.1039
F-statistic: 4.828 on 3 and 96 DF, p-value: 0.003573
summary(m3)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.78540 1.90266 3.041 0.00305 **
age -0.11146 0.07309 -1.525 0.13057
genderMale -0.14267 0.31157 -0.458 0.64806
friends 0.17134 0.05491 3.120 0.00239 **
pets 0.36391 0.13044 2.790 0.00637 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.427 on 95 degrees of freedom
Multiple R-squared: 0.1969, Adjusted R-squared: 0.1631
F-statistic: 5.822 on 4 and 95 DF, p-value: 0.0003105
> lm.beta(m3)
age genderMale friends pets
-0.14098154 -0.04484095 0.28909280 0.27446786
anova(m0,m1,m2,m3)
Analysis of Variance Table
Model 1: happiness ~ 1
Model 2: happiness ~ age + gender
Model 3: happiness ~ age + gender + friends
Model 4: happiness ~ age + gender + friends + pets
Res.Df RSS Df Sum of Sq F Pr(>F)
1 99 240.84
2 97 233.97 2 6.8748 1.6883 0.1903349
3 96 209.27 1 24.6957 12.1293 0.0007521 ***
4 95 193.42 1 15.8461 7.7828 0.0063739 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
* Model 0: SSTotal= 240.84 (no predictors)
* Model 1: SSResidual= 233.97 (after adding age and gender)
* Model 2: SSResidual= 209.27,
* SSDifference= 233.97 - 209.27 = 24.696,
* F(1,96) = 12.1293, p value = 0.0007521 (after adding friends)
* Model 3: SSResidual= 193.42,
* SSDifference= 209.27 - 193.42 = 15.846,
* F(1,95) = 7.7828, p value = 0.0063739 (after adding pets)
{{https://data.library.virginia.edu/files/Park.png}}
Source: Park, N., Kee, K. F., & Valenzuela, S. (2009). Being immersed in social networking environment: Facebook groups, uses and gratifications, and social outcomes. CyberPsychology & Behavior, 12, 729-733.
{{https://data.library.virginia.edu/files/Lankau-950x1024.png}}
Source: Lankau, M. J., & Scandura, T. A. (2002). An investigation of personal learning in mentoring relationships: Content, antecedents, and consequences. Academy of Management Journal, 45, 779-790.
see also http://rtutorialseries.blogspot.kr/2010/01/r-tutorial-series-basic-hierarchical.html
===== e.g. 4 =====
{{:r:crime_simple_1_.txt}}
# data import
crime <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/crime_simple.txt",
sep = "\t", header = TRUE)
str(crime)
* R: Crime rate: # of offenses reported to police per million population
* Age: The number of males of age 14-24 per 1000 population
* S: Indicator variable for Southern states (0 = No, 1 = Yes)
* Ed: Mean # of years of schooling x 10 for persons of age 25 or older
* Ex0: 1960 per capita expenditure on police by state and local government
* Ex1: 1959 per capita expenditure on police by state and local government
* LF: Labor force participation rate per 1000 civilian urban males age 14-24
* M: The number of males per 1000 females
* N: State population size in hundred thousands
* NW: The number of non-whites per 1000 population
* U1: Unemployment rate of urban males per 1000 of age 14-24
* U2: Unemployment rate of urban males per 1000 of age 35-39
* W: Median value of transferable goods and assets or family income in tens of dollar
* X: The number of families per 1000 earning below 1/2 the median income
# Assign more meaningful variable names
colnames(crime) <- c("crime.per.million", "young.males", "is.south", "average.ed",
"exp.per.cap.1960", "exp.per.cap.1959", "labour.part",
"male.per.fem", "population", "nonwhite",
"unemp.youth", "unemp.adult", "median.assets", "num.low.salary")
# Convert is.south to a factor
# Divide average.ed by 10 so that the variable is actually average education
# Convert median assets to 1000's of dollars instead of 10's
crime <- transform(crime, is.south = as.factor(is.south),
average.ed = average.ed / 10,
median.assets = median.assets / 100)
# Fit model
# enter all IVs into crime.per.million (a DV)
crime.lm <- lm(crime.per.million ~ ., data = crime)
plot(crime)
check the linear relationships (a strong corr) between Remove 1959 expenditure and 1960. Also check those between unemp.youth and unemp.adult. Hence, we delete the two (as a redundant).
# Remove 1959 expenditure and youth unemployment
crime.lm2 <- update(crime.lm, . ~ . - exp.per.cap.1959 - unemp.youth)
summary(crime.lm2)
Call:
lm(formula = crime.per.million ~ young.males + is.south + average.ed +
exp.per.cap.1960 + labour.part + male.per.fem + population +
nonwhite + unemp.adult + median.assets + num.low.salary,
data = crime)
Residuals:
Min 1Q Median 3Q Max
-35.82 -11.57 -1.51 10.63 55.02
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -633.438828 145.470340 -4.354 0.000111 ***
young.males 1.126883 0.418791 2.691 0.010853 *
is.south1 -0.556600 13.883248 -0.040 0.968248
average.ed 15.328028 6.202516 2.471 0.018476 *
exp.per.cap.1960 1.138299 0.226977 5.015 0.0000153 ***
labour.part 0.068716 0.133540 0.515 0.610087
male.per.fem 0.003021 0.173041 0.017 0.986172
population -0.064477 0.128278 -0.503 0.618367
nonwhite -0.013794 0.061901 -0.223 0.824960
unemp.adult 0.931498 0.541803 1.719 0.094402 .
median.assets 15.158975 10.524458 1.440 0.158653
num.low.salary 0.825936 0.234189 3.527 0.001197 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.98 on 35 degrees of freedom
Multiple R-squared: 0.7543, Adjusted R-squared: 0.6771
F-statistic: 9.769 on 11 and 35 DF, p-value: 0.00000009378
summary(crime.lm)
Call:
lm(formula = crime.per.million ~ ., data = crime)
Residuals:
Min 1Q Median 3Q Max
-34.884 -11.923 -1.135 13.495 50.560
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -691.837588 155.887918 -4.438 0.0000956 ***
young.males 1.039810 0.422708 2.460 0.01931 *
is.south1 -8.308313 14.911588 -0.557 0.58117
average.ed 18.016011 6.496504 2.773 0.00906 **
exp.per.cap.1960 1.607818 1.058667 1.519 0.13836
exp.per.cap.1959 -0.667258 1.148773 -0.581 0.56529
labour.part -0.041031 0.153477 -0.267 0.79087
male.per.fem 0.164795 0.209932 0.785 0.43806
population -0.041277 0.129516 -0.319 0.75196
nonwhite 0.007175 0.063867 0.112 0.91124
unemp.youth -0.601675 0.437154 -1.376 0.17798
unemp.adult 1.792263 0.856111 2.093 0.04407 *
median.assets 13.735847 10.583028 1.298 0.20332
num.low.salary 0.792933 0.235085 3.373 0.00191 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.94 on 33 degrees of freedom
Multiple R-squared: 0.7692, Adjusted R-squared: 0.6783
F-statistic: 8.462 on 13 and 33 DF, p-value: 0.0000003686
See the below stepwise regression explanation.
reduced.model <- step(crime.lm2, irection="backward")
Start: AIC=300.61
crime.per.million ~ young.males + is.south + average.ed + exp.per.cap.1960 +
labour.part + male.per.fem + population + nonwhite + unemp.adult +
median.assets + num.low.salary
Df Sum of Sq RSS AIC
- male.per.fem 1 0.1 16906 298.61
- is.south 1 0.8 16907 298.61
- nonwhite 1 24.0 16930 298.67
- population 1 122.0 17028 298.95
- labour.part 1 127.9 17034 298.96
summary(reduced.model)
====== Performing Linear Regression Without an Intercept ======
lm(y ~ x + 0)
====== Performing Linear Regression with Interaction Terms ======
This is a very tricky part. See [[:Interaction effects in regression analysis]]
see also https://cran.r-project.org/web/packages/jtools/vignettes/interactions.html
lm(y ~ u*v)
yi = β0 + β1ui + β2vi + β3uivi + εi
β3uivi = interaction term
y ~ u*v*w
yi = β0 + β1ui + β2vi + β3wi + β4uivi + β5uiwi + β6viwi + β7uiviwi + εi
first order interaction: β4uivi, β5uiwi, β6viwi
second order interaction: β7uiviwi
y ~ u + v + w + u:v:w
yi = β0 + β1ui + β2vi + β3wi + β4uiviwi + εi
====== Selecting the Best Regression Variables = Stepwise regression ======
http://www.utstat.toronto.edu/~brunner/oldclass/appliedf11/handouts/2101f11StepwiseLogisticR.pdf
full.model <- lm(y ~ x1 + x2 + x3 + x4)
reduced.model <- step(full.model, direction="backward")
{{dataset_multipleRegression.csv}}
data <- read.csv("http://commres.net/wiki/_media/r/dataset_multipleregression.csv")
head(data)
YEAR ROLL UNEM HGRAD INC
1 1 5501 8.1 9552 1923
2 2 5945 7.0 9680 1961
3 3 6629 7.3 9731 1979
4 4 7556 7.5 11666 2030
5 5 8716 7.0 14675 2112
6 6 9369 6.4 15265 2192
ROLL = students' enrollment
HGRAD = number of high school graduates
UNEM = unemployment
INC = per capita income
r.model <- step(threePredictorModel, direction="backward")
Start: AIC=381.16
ROLL ~ UNEM + HGRAD + INC
Df Sum of Sq RSS AIC
full.model <- lm(y ~ x1 + x2 + x3 + x4)
summary(full.model)
Call:
lm(formula = y ~ x1 + x2 + x3 + x4)
Residuals:
Min 1Q Median 3Q Max
-34.405 -5.153 2.025 6.525 19.186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.274 4.573 2.247 0.0337 *
x1 -102.298 47.558 -2.151 0.0413 *
x2 4.362 40.237 0.108 0.9145
x3 75.115 34.236 2.194 0.0377 *
x4 26.286 42.239 0.622 0.5394
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.19 on 25 degrees of freedom
Multiple R-squared: 0.8848, Adjusted R-squared: 0.8664
F-statistic: 48 on 4 and 25 DF, p-value: 2.235e-11
We want to eliminate the insignificant variables, so we use step to incrementally eliminate the underperformers. The result is called the reduced model:
reduced.model <- step(full.model, direction="backward")
Start: AIC=154.58
y ~ x1 + x2 + x3 + x4
Df Sum of Sq RSS AIC
- x2 1 1.75 3718.5 152.60
- x4 1 57.58 3774.3 153.04
The output from step shows the sequence of models that it explored. In this case, step removed x2 and x4 and left only x1 and x3 in the final (reduced) model. The summary of the reduced model shows that it contains only significant predictors:
summary(reduced.model)
Call:
lm(formula = y ~ x1 + x3)
Residuals:
Min 1Q Median 3Q Max
-36.192 -2.677 2.049 7.453 19.704
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.267 4.437 2.314 0.0285 *
x1 -79.265 31.604 -2.508 0.0185 *
x3 82.726 31.590 2.619 0.0143 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.86 on 27 degrees of freedom
Multiple R-squared: 0.8823, Adjusted R-squared: 0.8736
F-statistic: 101.2 on 2 and 27 DF, p-value: 2.842e-13
Or start with nothing
min.model <- lm(y ~ 1)
fwd.model <- step(min.model,
+ direction="forward",
+ scope=( ~ x1 + x2 + x3 + x4),
+ trace=0 )
min.model <- lm(ROLL ~ 1, data)
fwd.model <- step(min.model, direction="forward", scope=(~ UNEM + HGRAD + INC),trace=0)
summary(fwd.model)
Call:
lm(formula = ROLL ~ INC + HGRAD + UNEM, data = data)
Residuals:
Min 1Q Median 3Q Max
-1148.84 -489.71 -1.88 387.40 1425.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.153e+03 1.053e+03 -8.691 5.02e-09 ***
INC 4.275e+00 4.947e-01 8.642 5.59e-09 ***
HGRAD 4.065e-01 7.602e-02 5.347 1.52e-05 ***
UNEM 4.501e+02 1.182e+02 3.809 0.000807 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 670.4 on 25 degrees of freedom
Multiple R-squared: 0.9621, Adjusted R-squared: 0.9576
F-statistic: 211.5 on 3 and 25 DF, p-value: < 2.2e-16
>
====== Diagnosing a Linear Regression ======
m <- lm(y ~ x)
plot(m)
library(cars)
m <- lm(cars$dist ~ cars$speed)
plot(m)
* The points in the Residuals vs Fitted plot are randomly scattered with no particular pattern.
* The points in the Normal Q?Q plot are more-or-less on the line, indicating that the residuals follow a normal distribution.
* In both the Scale?Location plot and the Residuals vs Leverage plots, the points are in a group with none too far from the center.
influence.measures(m)
Influence measures of
lm(formula = cars$dist ~ cars$speed) :
dfb.1_ dfb.crs. dffit cov.r cook.d hat inf
1 0.09440 -0.08625 0.09490 1.175 4.59e-03 0.1149 *
2 0.29242 -0.26716 0.29398 1.146 4.35e-02 0.1149 *
3 -0.10750 0.09369 -0.11040 1.116 6.20e-03 0.0715
4 0.21898 -0.19085 0.22488 1.093 2.55e-02 0.0715
5 0.03408 -0.02901 0.03554 1.109 6.45e-04 0.0600
6 -0.11101 0.09174 -0.11852 1.085 7.13e-03 0.0499
7 -0.04618 0.03669 -0.05110 1.085 1.33e-03 0.0413
8 0.05248 -0.04170 0.05807 1.084 1.72e-03 0.0413
9 0.15209 -0.12083 0.16827 1.058 1.43e-02 0.0413
10 -0.09168 0.06895 -0.10716 1.065 5.82e-03 0.0341
11 0.02446 -0.01840 0.02859 1.079 4.17e-04 0.0341
12 -0.13849 0.09602 -0.17628 1.027 1.55e-02 0.0284
13 -0.08467 0.05870 -0.10777 1.056 5.88e-03 0.0284
14 -0.04929 0.03417 -0.06274 1.067 2.00e-03 0.0284
15 -0.01413 0.00979 -0.01798 1.073 1.65e-04 0.0284
16 -0.05330 0.03233 -0.07757 1.058 3.06e-03 0.0242
17 0.00323 -0.00196 0.00470 1.069 1.13e-05 0.0242
18 0.00323 -0.00196 0.00470 1.069 1.13e-05 0.0242
19 0.08843 -0.05364 0.12870 1.039 8.34e-03 0.0242
20 -0.06172 0.02871 -0.11111 1.041 6.23e-03 0.0214
21 -0.00789 0.00367 -0.01419 1.065 1.03e-04 0.0214
22 0.12329 -0.05734 0.22194 0.971 2.40e-02 0.0214
23 0.24851 -0.11558 0.44734 0.747 8.56e-02 0.0214 *
24 -0.08002 0.01551 -0.20360 0.979 2.03e-02 0.0201
25 -0.05700 0.01105 -0.14504 1.019 1.05e-02 0.0201
26 0.04643 -0.00900 0.11812 1.034 7.02e-03 0.0201
27 -0.02664 -0.01432 -0.12571 1.031 7.94e-03 0.0203
28 -0.01059 -0.00569 -0.04998 1.059 1.27e-03 0.0203
29 -0.00528 -0.04978 -0.17031 1.010 1.44e-02 0.0219
30 -0.00281 -0.02647 -0.09054 1.050 4.15e-03 0.0219
31 0.00022 0.00207 0.00708 1.066 2.56e-05 0.0219
32 0.01561 -0.05223 -0.11741 1.046 6.96e-03 0.0249
33 -0.00387 0.01296 0.02914 1.068 4.33e-04 0.0249
34 -0.03235 0.10823 0.24330 0.972 2.88e-02 0.0249
35 -0.04462 0.14928 0.33557 0.894 5.26e-02 0.0249
36 0.06663 -0.13914 -0.24553 0.989 2.95e-02 0.0295
37 0.03458 -0.07221 -0.12744 1.051 8.20e-03 0.0295
38 -0.03372 0.07042 0.12428 1.052 7.80e-03 0.0295
39 0.14564 -0.25086 -0.38002 0.921 6.81e-02 0.0354
40 0.06340 -0.10920 -0.16542 1.048 1.38e-02 0.0354
41 0.04382 -0.07547 -0.11432 1.065 6.62e-03 0.0354
42 0.02443 -0.04207 -0.06373 1.076 2.07e-03 0.0354
43 -0.01411 0.02431 0.03682 1.080 6.92e-04 0.0354
44 0.02456 -0.03551 -0.04533 1.098 1.05e-03 0.0518
45 0.19602 -0.27032 -0.32823 1.039 5.32e-02 0.0622
46 0.08260 -0.11000 -0.12877 1.116 8.43e-03 0.0740
47 -0.18634 0.24815 0.29050 1.077 4.21e-02 0.0740
48 -0.19890 0.26488 0.31008 1.071 4.79e-02 0.0740
49 -0.57747 0.76902 0.90027 0.762 3.40e-01 0.0740 *
50 -0.06025 0.07812 0.08898 1.139 4.04e-03 0.0873 *
When a statistician says that an observation is influential, it means that removing the observation would significantly change the fitted regression model. We want to identify those observations because they might be outliers that distort our model; we owe it to ourselves to investigate them.
====== Testing Residuals for Autocorrelation (Durbin?Watson Test) ======
dwtest(m)
Durbin-Watson test
data: m
DW = 1.6762, p-value = 0.09522
alternative hypothesis: true autocorrelation is greater than 0
Autocorrelation in the residuals is a scourge because it distorts the regression statistics, such as the F statistic and the t statistics for the regression coefficients. The presence of autocorrelation suggests that your model is missing a useful predictor variable or that it should include a time series component, such as a trend or a seasonal indicator.