Table of Contents
Simple Linear Regression
yi = β0 + β1xi + εi
where β0 and β1 are the regression coefficients and the εi are the error terms.
library(MASS) attach(Cars93) plot(MPG.city~EngineSize)
mod <- lm(MPG.city ~ EngineSize) mod
Call: lm(formula = MPG.city ~ EngineSize) Coefficients: (Intercept) EngineSize 32.627 -3.846 summary(mod) Call: lm(formula = MPG.city ~ EngineSize) Residuals: Min 1Q Median 3Q Max -10.6264 -2.7032 -0.5491 2.1428 17.2197 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 32.6267 1.1439 28.523 < 2e-16 *** EngineSize -3.8464 0.3999 -9.618 1.61e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.979 on 91 degrees of freedom Multiple R-squared: 0.5041, Adjusted R-squared: 0.4987 F-statistic: 92.51 on 1 and 91 DF, p-value: 1.606e-15
Is the model statistically significant?
- Check the F statistic at the bottom of the summary.
- The F statistic tells you whether the model is significant or insignificant.
Are the coefficients significant?
- Check the coefficient’s t statistics and p-values in the summary.
Is the model useful?
- Check the R2 near the bottom of the summary.
- R2 is a measure of the model’s quality. Bigger is better. Mathematically, it is the fraction of the variance of y that is explained by the regression model.
Does the model fit the data well?
- Plot the residuals and check the regression diagnostics.
Does the data satisfy the assumptions behind linear regression?
- Check whether the diagnostics confirm that a linear model is reasonable for your data.
E.g. 1
- dataset_simpleRegression.csv
YEAR ROLL UNEM HGRAD INC 1 1 5501 8.1 9552 1923 2 2 5945 7.0 9680 1961 3 3 6629 7.3 9731 1979 4 4 7556 7.5 11666 2030 5 5 8716 7.0 14675 2112 6 6 9369 6.4 15265 2192 7 7 9920 6.5 15484 2235 8 8 10167 6.4 15723 2351 9 9 11084 6.3 16501 2411 10 10 12504 7.7 16890 2475 11 11 13746 8.2 17203 2524 12 12 13656 7.5 17707 2674 13 13 13850 7.4 18108 2833 14 14 14145 8.2 18266 2863 15 15 14888 10.1 19308 2839 16 16 14991 9.2 18224 2898 17 17 14836 7.7 18997 3123 18 18 14478 5.7 19505 3195 19 19 14539 6.5 19800 3239 20 20 14395 7.5 19546 3129 21 21 14599 7.3 19117 3100 22 22 14969 9.2 18774 3008 23 23 15107 10.1 17813 2983 24 24 14831 7.5 17304 3069 25 25 15081 8.8 16756 3151 26 26 15127 9.1 16749 3127 27 27 15856 8.8 16925 3179 28 28 15938 7.8 17231 3207 29 29 16081 7.0 16816 3345
datavar <- read.csv("http://commres.net/wiki/_media/r/dataset_simpleregression.csv") # ROLL = students' enrollment # HGRAD = number of high school graduates # UNEM = unemployment # INC = per capita income
linearModelVar <- lm(ROLL ~ UNEM, datavar) linearModelVar
Call: lm(formula = ROLL ~ UNEM, data = d) Coefficients: (Intercept) UNEM 3957 1134
from the above, Enrollment = 3957 + 1134 * Unemployment Rate
만약에 unemployment가 9%라고 한다면, Enrollment 예측은
3957 + 1134 * 9 [1] 14163
14163 일것으로 예측할 수 있다.
Summarizing the model
summary(linearModelVar)
Call: lm(formula = ROLL ~ UNEM, data = d) Residuals: Min 1Q Median 3Q Max -7640.0 -1046.5 602.8 1934.3 4187.2 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3957.0 4000.1 0.989 0.3313 UNEM 1133.8 513.1 2.210 0.0358 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3049 on 27 degrees of freedom Multiple R-squared: 0.1531, Adjusted R-squared: 0.1218 F-statistic: 4.883 on 1 and 27 DF, p-value: 0.03579
You should be able to understand the below terms (what they mean):
Multiple R-squared(R2): 0.1531
F-statistic: 4.883 (1, 27), p-value: 0.03579
t-statistic for UNEM: 2.210, p-value: 0.0358 *
What about beta coefficient?
In R we demonstrate the use of the lm.beta() function in the QuantPsyc package (due to Thomas D. Fletcher of State Farm). The function is short and sweet, and takes a linear model object as argument:
> lm.beta(mod) EngineSize -0.7100032 > cor(MPG.city,EngineSize) [1] -0.7100032 >
Multiple Regression
regression output table
anova(m) | ANOVA table |
coefficients(m) = coef(m) | Model coefficients |
confint(m) | Confidence intervals for the regression coefficients |
deviance(m) | Residual sum of squares |
effects(m) | Vector of orthogonal effects |
fitted(m) | Vector of fitted y values |
residuals(m) = resid(m) | Model residuals This reports the standard error of the residuals (σ) - that is, the sample standard deviation of ε. |
summary(m) | Key statistics, such as R2, the F statistic, and the residual standard error (σ) |
vcov(m) | Variance?covariance matrix of the main parameters |
lm.model <- lm(Cars93$MPG.city ~ Cars93$EngineSize + Cars93$Price) summary(lm.model)
Call: lm(formula = Cars93$MPG.city ~ Cars93$EngineSize + Cars93$Price) Residuals: Min 1Q Median 3Q Max -7.4511 -2.0637 -0.5156 1.6239 16.9372 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 33.34647 1.12251 29.707 < 2e-16 *** Cars93$EngineSize -2.98885 0.47808 -6.252 1.33e-08 *** Cars93$Price -0.15415 0.05134 -3.002 0.00347 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.815 on 90 degrees of freedom Multiple R-squared: 0.5492, Adjusted R-squared: 0.5392 F-statistic: 54.83 on 2 and 90 DF, p-value: 2.674e-16
Questions:
- What is R2?
- How many cars are involved in this test? (cf. df = 90)
- df + # of variables involved (3) = 93
- check 'str(Cars93)'
- If I eliminate the R2 from the above output, can you still identify what it is?
The last question:
- If I eliminate the R2 from the above output, can you still identify what it is?
- to answer the question, use the regression output table:
R2 = SSreg/SStotal = 1 - SSres/SStotal
=
> anova(lm.model) Analysis of Variance Table Response: Cars93$MPG.city Df Sum Sq Mean Sq F value Pr(>F) Cars93$EngineSize 1 1465 1465 100.65 2.4e-16 *** Cars93$Price 1 131 131 9.01 0.0035 ** Residuals 90 1310 15 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > sstotal = 1465+131+1310 > ssreg <- 1465+131 > ssreg/sstotal [1] 0.54921 > > # or > 1-(deviance(lm.model)/sstotal) [1] 0.54932
Regression formula:
- $\hat{Y} = -2.99 * EngineSize + -0.15 * Price + 33.35$
- $\hat{Y} = \widehat{\text{MPG.city}}$
in the meantime,
> lm.beta(lm.model) Cars93$EngineSize Cars93$Price -0.5517121 -0.2649553 > cor(MPG.city,EngineSize) [1] -0.7100032 > cor(EngineSize,Price) [1] 0.5974254 > cor(MPG.city,Price) [1] -0.5945622 >
Or . . . .
> temp <- subset(Cars93, select=c(MPG.city,EngineSize,Price)) > temp . . . . > cor(temp) MPG.city EngineSize Price MPG.city 1.0000000 -0.7100032 -0.5945622 EngineSize -0.7100032 1.0000000 0.5974254 Price -0.5945622 0.5974254 1.0000000 >
Beta coefficients are not equal to correlations among variables.
plot(lm.model$residuals) abline(h=0, col="red")
anova(lm.model) Analysis of Variance Table Response: Cars93$MPG.city Df Sum Sq Mean Sq F value Pr(>F) Cars93$EngineSize 1 1464.71 1464.71 100.653 2.444e-16 *** Cars93$Price 1 131.17 131.17 9.014 0.003468 ** Residuals 90 1309.69 14.55 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Why use anova with lm output (lm.model in this case)?
coef(lm.model) (Intercept) Cars93$EngineSize Cars93$Price 33.3464734 -2.9888458 -0.1541498
residuals(lm.model) 1 2 3 4 5 -0.51556927 -0.55648889 -0.49194621 -0.16625801 3.73898067 6 7 8 9 10 -2.35086090 0.21745634 3.34329779 1.06528019 2.64786884 11 12 13 14 15 2.58362397 0.29459458 -0.01370501 -1.85673580 -3.32003095 16 17 18 19 20 -1.47621772 -2.93554987 1.49577174 6.54763980 -0.64692607 21 22 23 24 25 1.05563073 1.06413662 1.55497338 -2.02911999 -1.82416666 26 27 28 29 30 -4.45108994 -2.46962214 -2.40287136 2.01742275 0.08957791 31 32 33 34 35 2.67973459 -3.40963806 -2.73023540 -2.02114636 -1.21068471 36 37 38 39 40 -6.31235513 -0.26611019 1.62394797 16.93723064 3.36255227 41 42 43 44 45 0.58003782 15.00200777 -0.07339128 1.36999363 -4.42505304 46 47 48 49 50 -1.32170679 -5.22609969 4.48710776 -2.06374182 -0.95386332 51 52 53 54 55 0.29847852 3.96702480 1.71512315 1.82158662 2.66911267 56 57 58 59 60 -3.43567496 -7.45110565 -1.55474970 4.75970527 -3.39080806 61 62 63 64 65 -0.69202743 1.72453815 -2.35662642 2.25464742 0.24690826 66 67 68 69 70 -4.43567496 -0.06571546 -0.39110586 -1.25837103 -0.98293839 71 72 73 74 75 0.20204136 -2.74679396 3.82302800 -2.65771911 -1.45594634 76 77 78 79 80 -1.33262651 0.77239559 -2.64579820 2.04339631 4.53499980 81 82 83 84 85 -1.28631823 -0.76509170 10.86471434 4.64746325 1.06534353 86 87 88 89 90 -1.96548643 -4.67404320 -1.56378785 -5.83760799 -3.28578597 91 92 93 -3.38601500 -1.97292778 -2.05744404
deviance(lm.model) [1] 1309.686
Ex 1
위의 Cars93 데이터에서 city mileage은 Enginesize Horsepower 와 Turn.circle 의 영향을 얼마나 받을까?
- Turn.circle은 회전반경을 의미하고
- Enginsize는 Engine의 크기
- Horsepower는 마력수를 의미한다
Ex 2
dataset_multipleregression.csv
data <- read.csv("http://commres.net/wiki/_media/r/dataset_multipleregression.csv") head(data) YEAR ROLL UNEM HGRAD INC 1 1 5501 8.1 9552 1923 2 2 5945 7.0 9680 1961 3 3 6629 7.3 9731 1979 4 4 7556 7.5 11666 2030 5 5 8716 7.0 14675 2112 6 6 9369 6.4 15265 2192 ROLL = students' enrollment HGRAD = number of high school graduates UNEM = unemployment INC = per capita income
onePredictorModel <- lm(ROLL ~ UNEM, datavar) onePredictorModel Call: lm(formula = ROLL ~ UNEM, data = datavar) Coefficients: (Intercept) UNEM 3957 1134 summary(onePredictorModel) Call: lm(formula = ROLL ~ UNEM, data = datavar) Residuals: Min 1Q Median 3Q Max -7640.0 -1046.5 602.8 1934.3 4187.2 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3957.0 4000.1 0.989 0.3313 UNEM 1133.8 513.1 2.210 0.0358 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3049 on 27 degrees of freedom Multiple R-squared: 0.1531, Adjusted R-squared: 0.1218 F-statistic: 4.883 on 1 and 27 DF, p-value: 0.03579
predict the fall enrollment (ROLL): using
- the unemployment rate (UNEM) and
- number of spring high school graduates (HGRAD)
twoPredictorModel <- lm(ROLL ~ UNEM + HGRAD, data) twoPredictorModel Call: lm(formula = ROLL ~ UNEM + HGRAD, data = data) Coefficients: (Intercept) UNEM HGRAD -8255.7511 698.2681 0.9423
summary(twoPredictorModel) Call: lm(formula = ROLL ~ UNEM + HGRAD, data = data) Residuals: Min 1Q Median 3Q Max -2102.2 -861.6 -349.4 374.5 3603.5 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -8.256e+03 2.052e+03 -4.023 0.00044 *** UNEM 6.983e+02 2.244e+02 3.111 0.00449 ** HGRAD 9.423e-01 8.613e-02 10.941 3.16e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1313 on 26 degrees of freedom Multiple R-squared: 0.8489, Adjusted R-squared: 0.8373 F-statistic: 73.03 on 2 and 26 DF, p-value: 2.144e-11
y = -8255.7511 + 698.2681*UNEM + 0.9423*HGRAD
Q: what is the expected fall enrollment (ROLL) given this year's unemployment rate (UNEM) of 9% and spring high school graduating class (HGRAD) of 100,000?
UNEM <- 9 HGRAD <- 100000 -8255.7511 + 698.2681*UNEM + 0.9423*HGRAD [1] 92258.66
A: the predicted fall enrollment, given a 9% unemployment rate and 100,000 student spring high school graduating class, is
92258 students.
Enrollment 와 Unemployment, Highschool student grateduates, Income 간의 관계. 즉,
- dv = ROLL (Enrollment)
- iv = UNEM, HGRAD, INC
threePredictorModel <- lm(ROLL ~ UNEM + HGRAD + INC, data)
threePredictorModel Call: lm(formula = ROLL ~ UNEM + HGRAD + INC, data = data) Coefficients: (Intercept) UNEM HGRAD INC -9153.2545 450.1245 0.4065 4.2749
summary(threePredictorModel) Call: lm(formula = ROLL ~ UNEM + HGRAD + INC, data = data) Residuals: Min 1Q Median 3Q Max -1148.84 -489.71 -1.88 387.40 1425.75 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.153e+03 1.053e+03 -8.691 5.02e-09 *** UNEM 4.501e+02 1.182e+02 3.809 0.000807 *** HGRAD 4.065e-01 7.602e-02 5.347 1.52e-05 *** INC 4.275e+00 4.947e-01 8.642 5.59e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 670.4 on 25 degrees of freedom Multiple R-squared: 0.9621, Adjusted R-squared: 0.9576 F-statistic: 211.5 on 3 and 25 DF, p-value: < 2.2e-16
How to get beta coefficients1) for predictor variables?
- Use scale function to standardize the x values then do regression
- Use lm.beta function in QuantPsyc
- In this case
install.packages("QuantPsyc") library(QuantPsyc) lm.beta(threePredictorModel) UNEM HGRAD INC 0.1553619 0.3656177 0.6061762
How to compare each model (with incremental IVs)
anova(onePredictorModel, twoPredictorModel, threePredictorModel) Analysis of Variance Table Model 1: ROLL ~ UNEM Model 2: ROLL ~ UNEM + HGRAD Model 3: ROLL ~ UNEM + HGRAD + INC Res.Df RSS Df Sum of Sq F Pr(>F) 1 27 251084710 2 26 44805568 1 206279143 458.92 < 2.2e-16 *** 3 25 11237313 1 33568255 74.68 5.594e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
e.g. 3
hierarchicalregressiondata.csv
# Import data (simulated data for this example) myData <- read.csv('http://static.lib.virginia.edu/statlab/materials/data/hierarchicalRegressionData.csv') # or # myData <- read.csv('http://commres.net/wiki/_media/r/hierarchicalregressiondata.csv') # Build models to compare the adding variables are worthwhile. m0 <- lm(happiness ~ 1, data=myData) # to obtain Total SS m1 <- lm(happiness ~ age + gender, data=myData) # Model 1 m2 <- lm(happiness ~ age + gender + friends, data=myData) # Model 2 m3 <- lm(happiness ~ age + gender + friends + pets, data=myData) # Model 3
anova(m0) # to obtain Total SS Analysis of Variance Table Response: happiness Df Sum Sq Mean Sq F value Pr(>F) Residuals 99 240.84 2.4327
summary(m1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.66778 2.01364 3.808 0.000246 *** age -0.13039 0.07936 -1.643 0.103611 genderMale 0.16430 0.31938 0.514 0.608106 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.553 on 97 degrees of freedom Multiple R-squared: 0.02855, Adjusted R-squared: 0.008515 F-statistic: 1.425 on 2 and 97 DF, p-value: 0.2455
summary(m2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.21730 1.96220 3.169 0.00206 ** age -0.12479 0.07546 -1.654 0.10146 genderMale 0.14931 0.30365 0.492 0.62405 friends 0.18985 0.05640 3.366 0.00110 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.476 on 96 degrees of freedom Multiple R-squared: 0.1311, Adjusted R-squared: 0.1039 F-statistic: 4.828 on 3 and 96 DF, p-value: 0.003573
summary(m3) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.78540 1.90266 3.041 0.00305 ** age -0.11146 0.07309 -1.525 0.13057 genderMale -0.14267 0.31157 -0.458 0.64806 friends 0.17134 0.05491 3.120 0.00239 ** pets 0.36391 0.13044 2.790 0.00637 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.427 on 95 degrees of freedom Multiple R-squared: 0.1969, Adjusted R-squared: 0.1631 F-statistic: 5.822 on 4 and 95 DF, p-value: 0.0003105
> lm.beta(m3) age genderMale friends pets -0.14098154 -0.04484095 0.28909280 0.27446786
anova(m0,m1,m2,m3) Analysis of Variance Table Model 1: happiness ~ 1 Model 2: happiness ~ age + gender Model 3: happiness ~ age + gender + friends Model 4: happiness ~ age + gender + friends + pets Res.Df RSS Df Sum of Sq F Pr(>F) 1 99 240.84 2 97 233.97 2 6.8748 1.6883 0.1903349 3 96 209.27 1 24.6957 12.1293 0.0007521 *** 4 95 193.42 1 15.8461 7.7828 0.0063739 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- Model 0: SSTotal= 240.84 (no predictors)
- Model 1: SSResidual= 233.97 (after adding age and gender)
- Model 2: SSResidual= 209.27,
- SSDifference= 233.97 - 209.27 = 24.696,
- F(1,96) = 12.1293, p value = 0.0007521 (after adding friends)
- Model 3: SSResidual= 193.42,
- SSDifference= 209.27 - 193.42 = 15.846,
- F(1,95) = 7.7828, p value = 0.0063739 (after adding pets)
Source: Park, N., Kee, K. F., & Valenzuela, S. (2009). Being immersed in social networking environment: Facebook groups, uses and gratifications, and social outcomes. CyberPsychology & Behavior, 12, 729-733.
Source: Lankau, M. J., & Scandura, T. A. (2002). An investigation of personal learning in mentoring relationships: Content, antecedents, and consequences. Academy of Management Journal, 45, 779-790.
see also http://rtutorialseries.blogspot.kr/2010/01/r-tutorial-series-basic-hierarchical.html
e.g. 4
# data import crime <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/crime_simple.txt", sep = "\t", header = TRUE) str(crime)
- R: Crime rate: # of offenses reported to police per million population
- Age: The number of males of age 14-24 per 1000 population
- S: Indicator variable for Southern states (0 = No, 1 = Yes)
- Ed: Mean # of years of schooling x 10 for persons of age 25 or older
- Ex0: 1960 per capita expenditure on police by state and local government
- Ex1: 1959 per capita expenditure on police by state and local government
- LF: Labor force participation rate per 1000 civilian urban males age 14-24
- M: The number of males per 1000 females
- N: State population size in hundred thousands
- NW: The number of non-whites per 1000 population
- U1: Unemployment rate of urban males per 1000 of age 14-24
- U2: Unemployment rate of urban males per 1000 of age 35-39
- W: Median value of transferable goods and assets or family income in tens of dollar
- X: The number of families per 1000 earning below 1/2 the median income
# Assign more meaningful variable names colnames(crime) <- c("crime.per.million", "young.males", "is.south", "average.ed", "exp.per.cap.1960", "exp.per.cap.1959", "labour.part", "male.per.fem", "population", "nonwhite", "unemp.youth", "unemp.adult", "median.assets", "num.low.salary") # Convert is.south to a factor # Divide average.ed by 10 so that the variable is actually average education # Convert median assets to 1000's of dollars instead of 10's crime <- transform(crime, is.south = as.factor(is.south), average.ed = average.ed / 10, median.assets = median.assets / 100)
# Fit model # enter all IVs into crime.per.million (a DV) crime.lm <- lm(crime.per.million ~ ., data = crime)
plot(crime)
check the linear relationships (a strong corr) between Remove 1959 expenditure and 1960. Also check those between unemp.youth and unemp.adult. Hence, we delete the two (as a redundant).
# Remove 1959 expenditure and youth unemployment crime.lm2 <- update(crime.lm, . ~ . - exp.per.cap.1959 - unemp.youth)
summary(crime.lm2) Call: lm(formula = crime.per.million ~ young.males + is.south + average.ed + exp.per.cap.1960 + labour.part + male.per.fem + population + nonwhite + unemp.adult + median.assets + num.low.salary, data = crime) Residuals: Min 1Q Median 3Q Max -35.82 -11.57 -1.51 10.63 55.02 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -633.438828 145.470340 -4.354 0.000111 *** young.males 1.126883 0.418791 2.691 0.010853 * is.south1 -0.556600 13.883248 -0.040 0.968248 average.ed 15.328028 6.202516 2.471 0.018476 * exp.per.cap.1960 1.138299 0.226977 5.015 0.0000153 *** labour.part 0.068716 0.133540 0.515 0.610087 male.per.fem 0.003021 0.173041 0.017 0.986172 population -0.064477 0.128278 -0.503 0.618367 nonwhite -0.013794 0.061901 -0.223 0.824960 unemp.adult 0.931498 0.541803 1.719 0.094402 . median.assets 15.158975 10.524458 1.440 0.158653 num.low.salary 0.825936 0.234189 3.527 0.001197 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 21.98 on 35 degrees of freedom Multiple R-squared: 0.7543, Adjusted R-squared: 0.6771 F-statistic: 9.769 on 11 and 35 DF, p-value: 0.00000009378 summary(crime.lm) Call: lm(formula = crime.per.million ~ ., data = crime) Residuals: Min 1Q Median 3Q Max -34.884 -11.923 -1.135 13.495 50.560 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -691.837588 155.887918 -4.438 0.0000956 *** young.males 1.039810 0.422708 2.460 0.01931 * is.south1 -8.308313 14.911588 -0.557 0.58117 average.ed 18.016011 6.496504 2.773 0.00906 ** exp.per.cap.1960 1.607818 1.058667 1.519 0.13836 exp.per.cap.1959 -0.667258 1.148773 -0.581 0.56529 labour.part -0.041031 0.153477 -0.267 0.79087 male.per.fem 0.164795 0.209932 0.785 0.43806 population -0.041277 0.129516 -0.319 0.75196 nonwhite 0.007175 0.063867 0.112 0.91124 unemp.youth -0.601675 0.437154 -1.376 0.17798 unemp.adult 1.792263 0.856111 2.093 0.04407 * median.assets 13.735847 10.583028 1.298 0.20332 num.low.salary 0.792933 0.235085 3.373 0.00191 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 21.94 on 33 degrees of freedom Multiple R-squared: 0.7692, Adjusted R-squared: 0.6783 F-statistic: 8.462 on 13 and 33 DF, p-value: 0.0000003686
See the below stepwise regression explanation.
reduced.model <- step(crime.lm2, irection="backward") Start: AIC=300.61 crime.per.million ~ young.males + is.south + average.ed + exp.per.cap.1960 + labour.part + male.per.fem + population + nonwhite + unemp.adult + median.assets + num.low.salary Df Sum of Sq RSS AIC - male.per.fem 1 0.1 16906 298.61 - is.south 1 0.8 16907 298.61 - nonwhite 1 24.0 16930 298.67 - population 1 122.0 17028 298.95 - labour.part 1 127.9 17034 298.96 <none 16906 300.61 - median.assets 1 1002.1 17908 301.31 - unemp.adult 1 1427.8 18334 302.42 - average.ed 1 2949.9 19856 306.17 - young.males 1 3497.3 20403 307.45 - num.low.salary 1 6008.0 22914 312.90 - exp.per.cap.1960 1 12148.5 29054 324.06 Step: AIC=298.61 crime.per.million ~ young.males + is.south + average.ed + exp.per.cap.1960 + labour.part + population + nonwhite + unemp.adult + median.assets + num.low.salary Df Sum of Sq RSS AIC - is.south 1 0.8 16907 296.61 - nonwhite 1 25.7 16932 296.68 - labour.part 1 155.9 17062 297.04 - population 1 161.7 17068 297.06 <none 16906 298.61 - median.assets 1 1002.9 17909 299.32 - unemp.adult 1 1804.4 18711 301.37 - average.ed 1 3242.3 20148 304.85 - young.males 1 3807.8 20714 306.16 - num.low.salary 1 6267.5 23174 311.43 - exp.per.cap.1960 1 12586.1 29492 322.76 Step: AIC=296.61 crime.per.million ~ young.males + average.ed + exp.per.cap.1960 + labour.part + population + nonwhite + unemp.adult + median.assets + num.low.salary Df Sum of Sq RSS AIC - nonwhite 1 38.4 16945 294.72 - population 1 162.2 17069 295.06 - labour.part 1 210.0 17117 295.19 <none 16907 296.61 - median.assets 1 1047.1 17954 297.43 - unemp.adult 1 1866.1 18773 299.53 - average.ed 1 3243.1 20150 302.86 - young.males 1 3820.9 20728 304.19 - num.low.salary 1 7560.2 24467 311.98 - exp.per.cap.1960 1 12624.3 29531 320.82 Step: AIC=294.72 crime.per.million ~ young.males + average.ed + exp.per.cap.1960 + labour.part + population + unemp.adult + median.assets + num.low.salary Df Sum of Sq RSS AIC - population 1 175.1 17120 293.20 - labour.part 1 216.2 17162 293.31 <none 16945 294.72 - median.assets 1 1146.8 18092 295.79 - unemp.adult 1 1855.8 18801 297.60 - average.ed 1 3515.0 20460 301.58 - young.males 1 4032.3 20978 302.75 - num.low.salary 1 7700.6 24646 310.32 - exp.per.cap.1960 1 14283.5 31229 321.45 Step: AIC=293.2 crime.per.million ~ young.males + average.ed + exp.per.cap.1960 + labour.part + unemp.adult + median.assets + num.low.salary Df Sum of Sq RSS AIC - labour.part 1 230.7 17351 291.83 <none 17120 293.20 - median.assets 1 1064.3 18185 294.03 - unemp.adult 1 1858.2 18979 296.04 - average.ed 1 3981.1 21101 301.03 - young.males 1 4509.0 21629 302.19 - num.low.salary 1 7540.2 24661 308.35 - exp.per.cap.1960 1 15814.0 32934 321.95 Step: AIC=291.83 crime.per.million ~ young.males + average.ed + exp.per.cap.1960 + unemp.adult + median.assets + num.low.salary Df Sum of Sq RSS AIC <none 17351 291.83 - median.assets 1 1252.6 18604 293.11 - unemp.adult 1 1628.7 18980 294.05 - young.males 1 4461.0 21812 300.58 - average.ed 1 6214.7 23566 304.22 - num.low.salary 1 8932.3 26283 309.35 - exp.per.cap.1960 1 15596.5 32948 319.97
summary(reduced.model)
Performing Linear Regression Without an Intercept
lm(y ~ x + 0)
Performing Linear Regression with Interaction Terms
This is a very tricky part. See Interaction effects in regression analysis
see also https://cran.r-project.org/web/packages/jtools/vignettes/interactions.html
lm(y ~ u*v)
yi = β0 + β1ui + β2vi + β3uivi + εi β3uivi = interaction term
y ~ u*v*w yi = β0 + β1ui + β2vi + β3wi + β4uivi + β5uiwi + β6viwi + β7uiviwi + εi first order interaction: β4uivi, β5uiwi, β6viwi second order interaction: β7uiviwi
y ~ u + v + w + u:v:w yi = β0 + β1ui + β2vi + β3wi + β4uiviwi + εi
Selecting the Best Regression Variables = Stepwise regression
http://www.utstat.toronto.edu/~brunner/oldclass/appliedf11/handouts/2101f11StepwiseLogisticR.pdf
full.model <- lm(y ~ x1 + x2 + x3 + x4) reduced.model <- step(full.model, direction="backward")
dataset_multipleregression.csv
data <- read.csv("http://commres.net/wiki/_media/r/dataset_multipleregression.csv") head(data) YEAR ROLL UNEM HGRAD INC 1 1 5501 8.1 9552 1923 2 2 5945 7.0 9680 1961 3 3 6629 7.3 9731 1979 4 4 7556 7.5 11666 2030 5 5 8716 7.0 14675 2112 6 6 9369 6.4 15265 2192 ROLL = students' enrollment HGRAD = number of high school graduates UNEM = unemployment INC = per capita income
r.model <- step(threePredictorModel, direction="backward") Start: AIC=381.16 ROLL ~ UNEM + HGRAD + INC Df Sum of Sq RSS AIC <none 11237313 381.16 - UNEM 1 6522098 17759411 392.43 - HGRAD 1 12852039 24089352 401.27 - INC 1 33568255 44805568 419.27 ></code Note: nothing has been discarded. eg. from textbook without data <code>full.model <- lm(y ~ x1 + x2 + x3 + x4) summary(full.model) Call: lm(formula = y ~ x1 + x2 + x3 + x4) Residuals: Min 1Q Median 3Q Max -34.405 -5.153 2.025 6.525 19.186 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.274 4.573 2.247 0.0337 * x1 -102.298 47.558 -2.151 0.0413 * x2 4.362 40.237 0.108 0.9145 x3 75.115 34.236 2.194 0.0377 * x4 26.286 42.239 0.622 0.5394 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12.19 on 25 degrees of freedom Multiple R-squared: 0.8848, Adjusted R-squared: 0.8664 F-statistic: 48 on 4 and 25 DF, p-value: 2.235e-11
We want to eliminate the insignificant variables, so we use step to incrementally eliminate the underperformers. The result is called the reduced model:
reduced.model <- step(full.model, direction="backward") Start: AIC=154.58 y ~ x1 + x2 + x3 + x4 Df Sum of Sq RSS AIC - x2 1 1.75 3718.5 152.60 - x4 1 57.58 3774.3 153.04 <none 3716.7 154.58 - x1 1 687.89 4404.6 157.68 - x3 1 715.67 4432.4 157.87 Step: AIC=152.6 y ~ x1 + x3 + x4 Df Sum of Sq RSS AIC - x4 1 77.82 3796.3 151.22 <none 3718.5 152.60 - x3 1 737.39 4455.9 156.02 - x1 1 787.96 4506.4 156.36 Step: AIC=151.22 y ~ x1 + x3 Df Sum of Sq RSS AIC <none 3796.3 151.22 - x1 1 884.44 4680.7 155.50 - x3 1 964.24 4760.5 156.01
The output from step shows the sequence of models that it explored. In this case, step removed x2 and x4 and left only x1 and x3 in the final (reduced) model. The summary of the reduced model shows that it contains only significant predictors:
summary(reduced.model) Call: lm(formula = y ~ x1 + x3) Residuals: Min 1Q Median 3Q Max -36.192 -2.677 2.049 7.453 19.704 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.267 4.437 2.314 0.0285 * x1 -79.265 31.604 -2.508 0.0185 * x3 82.726 31.590 2.619 0.0143 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 11.86 on 27 degrees of freedom Multiple R-squared: 0.8823, Adjusted R-squared: 0.8736 F-statistic: 101.2 on 2 and 27 DF, p-value: 2.842e-13
Or start with nothing
min.model <- lm(y ~ 1)
fwd.model <- step(min.model, + direction="forward", + scope=( ~ x1 + x2 + x3 + x4), + trace=0 )
min.model <- lm(ROLL ~ 1, data)
fwd.model <- step(min.model, direction="forward", scope=(~ UNEM + HGRAD + INC),trace=0) summary(fwd.model) Call: lm(formula = ROLL ~ INC + HGRAD + UNEM, data = data) Residuals: Min 1Q Median 3Q Max -1148.84 -489.71 -1.88 387.40 1425.75 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.153e+03 1.053e+03 -8.691 5.02e-09 *** INC 4.275e+00 4.947e-01 8.642 5.59e-09 *** HGRAD 4.065e-01 7.602e-02 5.347 1.52e-05 *** UNEM 4.501e+02 1.182e+02 3.809 0.000807 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 670.4 on 25 degrees of freedom Multiple R-squared: 0.9621, Adjusted R-squared: 0.9576 F-statistic: 211.5 on 3 and 25 DF, p-value: < 2.2e-16 >
Diagnosing a Linear Regression
m <- lm(y ~ x) plot(m)
library(cars) m <- lm(cars$dist ~ cars$speed) plot(m)
- The points in the Residuals vs Fitted plot are randomly scattered with no particular pattern.
- The points in the Normal Q?Q plot are more-or-less on the line, indicating that the residuals follow a normal distribution.
- In both the Scale?Location plot and the Residuals vs Leverage plots, the points are in a group with none too far from the center.
influence.measures(m) Influence measures of lm(formula = cars$dist ~ cars$speed) : dfb.1_ dfb.crs. dffit cov.r cook.d hat inf 1 0.09440 -0.08625 0.09490 1.175 4.59e-03 0.1149 * 2 0.29242 -0.26716 0.29398 1.146 4.35e-02 0.1149 * 3 -0.10750 0.09369 -0.11040 1.116 6.20e-03 0.0715 4 0.21898 -0.19085 0.22488 1.093 2.55e-02 0.0715 5 0.03408 -0.02901 0.03554 1.109 6.45e-04 0.0600 6 -0.11101 0.09174 -0.11852 1.085 7.13e-03 0.0499 7 -0.04618 0.03669 -0.05110 1.085 1.33e-03 0.0413 8 0.05248 -0.04170 0.05807 1.084 1.72e-03 0.0413 9 0.15209 -0.12083 0.16827 1.058 1.43e-02 0.0413 10 -0.09168 0.06895 -0.10716 1.065 5.82e-03 0.0341 11 0.02446 -0.01840 0.02859 1.079 4.17e-04 0.0341 12 -0.13849 0.09602 -0.17628 1.027 1.55e-02 0.0284 13 -0.08467 0.05870 -0.10777 1.056 5.88e-03 0.0284 14 -0.04929 0.03417 -0.06274 1.067 2.00e-03 0.0284 15 -0.01413 0.00979 -0.01798 1.073 1.65e-04 0.0284 16 -0.05330 0.03233 -0.07757 1.058 3.06e-03 0.0242 17 0.00323 -0.00196 0.00470 1.069 1.13e-05 0.0242 18 0.00323 -0.00196 0.00470 1.069 1.13e-05 0.0242 19 0.08843 -0.05364 0.12870 1.039 8.34e-03 0.0242 20 -0.06172 0.02871 -0.11111 1.041 6.23e-03 0.0214 21 -0.00789 0.00367 -0.01419 1.065 1.03e-04 0.0214 22 0.12329 -0.05734 0.22194 0.971 2.40e-02 0.0214 23 0.24851 -0.11558 0.44734 0.747 8.56e-02 0.0214 * 24 -0.08002 0.01551 -0.20360 0.979 2.03e-02 0.0201 25 -0.05700 0.01105 -0.14504 1.019 1.05e-02 0.0201 26 0.04643 -0.00900 0.11812 1.034 7.02e-03 0.0201 27 -0.02664 -0.01432 -0.12571 1.031 7.94e-03 0.0203 28 -0.01059 -0.00569 -0.04998 1.059 1.27e-03 0.0203 29 -0.00528 -0.04978 -0.17031 1.010 1.44e-02 0.0219 30 -0.00281 -0.02647 -0.09054 1.050 4.15e-03 0.0219 31 0.00022 0.00207 0.00708 1.066 2.56e-05 0.0219 32 0.01561 -0.05223 -0.11741 1.046 6.96e-03 0.0249 33 -0.00387 0.01296 0.02914 1.068 4.33e-04 0.0249 34 -0.03235 0.10823 0.24330 0.972 2.88e-02 0.0249 35 -0.04462 0.14928 0.33557 0.894 5.26e-02 0.0249 36 0.06663 -0.13914 -0.24553 0.989 2.95e-02 0.0295 37 0.03458 -0.07221 -0.12744 1.051 8.20e-03 0.0295 38 -0.03372 0.07042 0.12428 1.052 7.80e-03 0.0295 39 0.14564 -0.25086 -0.38002 0.921 6.81e-02 0.0354 40 0.06340 -0.10920 -0.16542 1.048 1.38e-02 0.0354 41 0.04382 -0.07547 -0.11432 1.065 6.62e-03 0.0354 42 0.02443 -0.04207 -0.06373 1.076 2.07e-03 0.0354 43 -0.01411 0.02431 0.03682 1.080 6.92e-04 0.0354 44 0.02456 -0.03551 -0.04533 1.098 1.05e-03 0.0518 45 0.19602 -0.27032 -0.32823 1.039 5.32e-02 0.0622 46 0.08260 -0.11000 -0.12877 1.116 8.43e-03 0.0740 47 -0.18634 0.24815 0.29050 1.077 4.21e-02 0.0740 48 -0.19890 0.26488 0.31008 1.071 4.79e-02 0.0740 49 -0.57747 0.76902 0.90027 0.762 3.40e-01 0.0740 * 50 -0.06025 0.07812 0.08898 1.139 4.04e-03 0.0873 *
When a statistician says that an observation is influential, it means that removing the observation would significantly change the fitted regression model. We want to identify those observations because they might be outliers that distort our model; we owe it to ourselves to investigate them.
Testing Residuals for Autocorrelation (Durbin?Watson Test)
dwtest(m) Durbin-Watson test data: m DW = 1.6762, p-value = 0.09522 alternative hypothesis: true autocorrelation is greater than 0
Autocorrelation in the residuals is a scourge because it distorts the regression statistics, such as the F statistic and the t statistics for the regression coefficients. The presence of autocorrelation suggests that your model is missing a useful predictor variable or that it should include a time series component, such as a trend or a seasonal indicator.