====== Simple Linear Regression ====== yi = β0 + β1xi + εi where β0 and β1 are the regression coefficients and the εi are the error terms. library(MASS) attach(Cars93) plot(MPG.city~EngineSize) mod <- lm(MPG.city ~ EngineSize) mod Call: lm(formula = MPG.city ~ EngineSize) Coefficients: (Intercept) EngineSize 32.627 -3.846 summary(mod) Call: lm(formula = MPG.city ~ EngineSize) Residuals: Min 1Q Median 3Q Max -10.6264 -2.7032 -0.5491 2.1428 17.2197 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 32.6267 1.1439 28.523 < 2e-16 *** EngineSize -3.8464 0.3999 -9.618 1.61e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.979 on 91 degrees of freedom Multiple R-squared: 0.5041, Adjusted R-squared: 0.4987 F-statistic: 92.51 on 1 and 91 DF, p-value: 1.606e-15 Is the model statistically significant? * Check the **F statistic** at the bottom of the summary. * The F statistic tells you whether the model is significant or insignificant. Are the coefficients significant? * Check the coefficient’s **t statistics and p-values** in the summary. Is the model useful? * Check the **R2** near the bottom of the summary. * R2 is a measure of the model’s quality. Bigger is better. Mathematically, it is the fraction of the variance of y that is explained by the regression model. Does the model fit the data well? * **Plot the residuals** and check the regression diagnostics. * see [[https://drsimonj.svbtle.com/visualising-residuals|visualizing residuals]] Does the data satisfy the assumptions behind linear regression? * Check whether the diagnostics confirm that a linear model is reasonable for your data. ====== E.g. 1 ====== {{:r:dataset_simpleRegression.csv}} YEAR ROLL UNEM HGRAD INC 1 1 5501 8.1 9552 1923 2 2 5945 7.0 9680 1961 3 3 6629 7.3 9731 1979 4 4 7556 7.5 11666 2030 5 5 8716 7.0 14675 2112 6 6 9369 6.4 15265 2192 7 7 9920 6.5 15484 2235 8 8 10167 6.4 15723 2351 9 9 11084 6.3 16501 2411 10 10 12504 7.7 16890 2475 11 11 13746 8.2 17203 2524 12 12 13656 7.5 17707 2674 13 13 13850 7.4 18108 2833 14 14 14145 8.2 18266 2863 15 15 14888 10.1 19308 2839 16 16 14991 9.2 18224 2898 17 17 14836 7.7 18997 3123 18 18 14478 5.7 19505 3195 19 19 14539 6.5 19800 3239 20 20 14395 7.5 19546 3129 21 21 14599 7.3 19117 3100 22 22 14969 9.2 18774 3008 23 23 15107 10.1 17813 2983 24 24 14831 7.5 17304 3069 25 25 15081 8.8 16756 3151 26 26 15127 9.1 16749 3127 27 27 15856 8.8 16925 3179 28 28 15938 7.8 17231 3207 29 29 16081 7.0 16816 3345 datavar <- read.csv("http://commres.net/wiki/_media/r/dataset_simpleregression.csv") # ROLL = students' enrollment # HGRAD = number of high school graduates # UNEM = unemployment # INC = per capita income linearModelVar <- lm(ROLL ~ UNEM, datavar) linearModelVar Call: lm(formula = ROLL ~ UNEM, data = d) Coefficients: (Intercept) UNEM 3957 1134 from the above, //Enrollment// = 3957 + 1134 * //Unemployment Rate// 만약에 unemployment가 9%라고 한다면, Enrollment 예측은 3957 + 1134 * 9 [1] 14163 14163 일것으로 예측할 수 있다. Summarizing the model summary(linearModelVar) Call: lm(formula = ROLL ~ UNEM, data = d) Residuals: Min 1Q Median 3Q Max -7640.0 -1046.5 602.8 1934.3 4187.2 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3957.0 4000.1 0.989 0.3313 UNEM 1133.8 513.1 2.210 0.0358 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3049 on 27 degrees of freedom Multiple R-squared: 0.1531, Adjusted R-squared: 0.1218 F-statistic: 4.883 on 1 and 27 DF, p-value: 0.03579 You should be able to understand the below terms (what they mean): Multiple R-squared(R2): 0.1531 F-statistic: 4.883 (1, 27), p-value: 0.03579 t-statistic for UNEM: 2.210, p-value: 0.0358 * What about beta coefficient?
In R we demonstrate the use of the lm.beta() function in the QuantPsyc package (due to Thomas D. Fletcher of State Farm). The function is short and sweet, and takes a linear model object as argument:
> lm.beta(mod) EngineSize -0.7100032 > cor(MPG.city,EngineSize) [1] -0.7100032 >
====== Multiple Regression ====== regression output table | anova(m) | ANOVA table | | coefficients(m) = coef(m) | Model coefficients | | confint(m) | Confidence intervals for the regression coefficients | | deviance(m) | Residual sum of squares | | effects(m) | Vector of orthogonal effects | | fitted(m) | Vector of fitted y values | | residuals(m) = resid(m) | Model residuals \\ This reports the standard error of the residuals (σ) - that is, the sample standard deviation of ε. | | summary(m) | Key statistics, such as R2, the F statistic, and the residual standard error (σ) | | vcov(m) | Variance?covariance matrix of the main parameters | lm.model <- lm(Cars93$MPG.city ~ Cars93$EngineSize + Cars93$Price) summary(lm.model) Call: lm(formula = Cars93$MPG.city ~ Cars93$EngineSize + Cars93$Price) Residuals: Min 1Q Median 3Q Max -7.4511 -2.0637 -0.5156 1.6239 16.9372 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 33.34647 1.12251 29.707 < 2e-16 *** Cars93$EngineSize -2.98885 0.47808 -6.252 1.33e-08 *** Cars93$Price -0.15415 0.05134 -3.002 0.00347 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.815 on 90 degrees of freedom Multiple R-squared: 0.5492, Adjusted R-squared: 0.5392 F-statistic: 54.83 on 2 and 90 DF, p-value: 2.674e-16 Questions: * What is R2? * How many cars are involved in this test? (cf. df = 90) * df + # of variables involved (3) = 93 * check 'str(Cars93)' * If I eliminate the R2 from the above output, can you still identify what it is? The last question: * If I eliminate the R2 from the above output, can you still identify what it is? * to answer the question, use the regression output table: R2 = SSreg/SStotal = 1 - SSres/SStotal = > anova(lm.model) Analysis of Variance Table Response: Cars93$MPG.city Df Sum Sq Mean Sq F value Pr(>F) Cars93$EngineSize 1 1465 1465 100.65 2.4e-16 *** Cars93$Price 1 131 131 9.01 0.0035 ** Residuals 90 1310 15 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > sstotal = 1465+131+1310 > ssreg <- 1465+131 > ssreg/sstotal [1] 0.54921 > > # or > 1-(deviance(lm.model)/sstotal) [1] 0.54932 Regression formula: * $\hat{Y} = -2.99 * EngineSize + -0.15 * Price + 33.35$ * $\hat{Y} = \widehat{\text{MPG.city}}$ in the meantime, > lm.beta(lm.model) Cars93$EngineSize Cars93$Price -0.5517121 -0.2649553 > cor(MPG.city,EngineSize) [1] -0.7100032 > cor(EngineSize,Price) [1] 0.5974254 > cor(MPG.city,Price) [1] -0.5945622 > Or . . . . > temp <- subset(Cars93, select=c(MPG.city,EngineSize,Price)) > temp . . . . > cor(temp) MPG.city EngineSize Price MPG.city 1.0000000 -0.7100032 -0.5945622 EngineSize -0.7100032 1.0000000 0.5974254 Price -0.5945622 0.5974254 1.0000000 > Beta coefficients are not equal to correlations among variables. plot(lm.model$residuals) abline(h=0, col="red") anova(lm.model) Analysis of Variance Table Response: Cars93$MPG.city Df Sum Sq Mean Sq F value Pr(>F) Cars93$EngineSize 1 1464.71 1464.71 100.653 2.444e-16 *** Cars93$Price 1 131.17 131.17 9.014 0.003468 ** Residuals 90 1309.69 14.55 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Why use anova with lm output (lm.model in this case)? coef(lm.model) (Intercept) Cars93$EngineSize Cars93$Price 33.3464734 -2.9888458 -0.1541498 residuals(lm.model) 1 2 3 4 5 -0.51556927 -0.55648889 -0.49194621 -0.16625801 3.73898067 6 7 8 9 10 -2.35086090 0.21745634 3.34329779 1.06528019 2.64786884 11 12 13 14 15 2.58362397 0.29459458 -0.01370501 -1.85673580 -3.32003095 16 17 18 19 20 -1.47621772 -2.93554987 1.49577174 6.54763980 -0.64692607 21 22 23 24 25 1.05563073 1.06413662 1.55497338 -2.02911999 -1.82416666 26 27 28 29 30 -4.45108994 -2.46962214 -2.40287136 2.01742275 0.08957791 31 32 33 34 35 2.67973459 -3.40963806 -2.73023540 -2.02114636 -1.21068471 36 37 38 39 40 -6.31235513 -0.26611019 1.62394797 16.93723064 3.36255227 41 42 43 44 45 0.58003782 15.00200777 -0.07339128 1.36999363 -4.42505304 46 47 48 49 50 -1.32170679 -5.22609969 4.48710776 -2.06374182 -0.95386332 51 52 53 54 55 0.29847852 3.96702480 1.71512315 1.82158662 2.66911267 56 57 58 59 60 -3.43567496 -7.45110565 -1.55474970 4.75970527 -3.39080806 61 62 63 64 65 -0.69202743 1.72453815 -2.35662642 2.25464742 0.24690826 66 67 68 69 70 -4.43567496 -0.06571546 -0.39110586 -1.25837103 -0.98293839 71 72 73 74 75 0.20204136 -2.74679396 3.82302800 -2.65771911 -1.45594634 76 77 78 79 80 -1.33262651 0.77239559 -2.64579820 2.04339631 4.53499980 81 82 83 84 85 -1.28631823 -0.76509170 10.86471434 4.64746325 1.06534353 86 87 88 89 90 -1.96548643 -4.67404320 -1.56378785 -5.83760799 -3.28578597 91 92 93 -3.38601500 -1.97292778 -2.05744404 deviance(lm.model) [1] 1309.686 ===== Ex 1 ===== 위의 Cars93 데이터에서 city mileage은 Enginesize Horsepower 와 Turn.circle 의 영향을 얼마나 받을까? * Turn.circle은 회전반경을 의미하고 * Enginsize는 Engine의 크기 * Horsepower는 마력수를 의미한다 ===== Ex 2 ===== {{dataset_multipleRegression.csv}} data <- read.csv("http://commres.net/wiki/_media/r/dataset_multipleregression.csv") head(data) YEAR ROLL UNEM HGRAD INC 1 1 5501 8.1 9552 1923 2 2 5945 7.0 9680 1961 3 3 6629 7.3 9731 1979 4 4 7556 7.5 11666 2030 5 5 8716 7.0 14675 2112 6 6 9369 6.4 15265 2192 ROLL = students' enrollment HGRAD = number of high school graduates UNEM = unemployment INC = per capita income onePredictorModel <- lm(ROLL ~ UNEM, datavar) onePredictorModel Call: lm(formula = ROLL ~ UNEM, data = datavar) Coefficients: (Intercept) UNEM 3957 1134 summary(onePredictorModel) Call: lm(formula = ROLL ~ UNEM, data = datavar) Residuals: Min 1Q Median 3Q Max -7640.0 -1046.5 602.8 1934.3 4187.2 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3957.0 4000.1 0.989 0.3313 UNEM 1133.8 513.1 2.210 0.0358 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3049 on 27 degrees of freedom Multiple R-squared: 0.1531, Adjusted R-squared: 0.1218 F-statistic: 4.883 on 1 and 27 DF, p-value: 0.03579 predict the fall enrollment (ROLL): using * the unemployment rate (UNEM) and * number of spring high school graduates (HGRAD) twoPredictorModel <- lm(ROLL ~ UNEM + HGRAD, data) twoPredictorModel Call: lm(formula = ROLL ~ UNEM + HGRAD, data = data) Coefficients: (Intercept) UNEM HGRAD -8255.7511 698.2681 0.9423 summary(twoPredictorModel) Call: lm(formula = ROLL ~ UNEM + HGRAD, data = data) Residuals: Min 1Q Median 3Q Max -2102.2 -861.6 -349.4 374.5 3603.5 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -8.256e+03 2.052e+03 -4.023 0.00044 *** UNEM 6.983e+02 2.244e+02 3.111 0.00449 ** HGRAD 9.423e-01 8.613e-02 10.941 3.16e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1313 on 26 degrees of freedom Multiple R-squared: 0.8489, Adjusted R-squared: 0.8373 F-statistic: 73.03 on 2 and 26 DF, p-value: 2.144e-11 y = -8255.7511 + 698.2681*UNEM + 0.9423*HGRAD Q: what is the expected fall enrollment (ROLL) given this year's unemployment rate (UNEM) of 9% and spring high school graduating class (HGRAD) of 100,000? UNEM <- 9 HGRAD <- 100000 -8255.7511 + 698.2681*UNEM + 0.9423*HGRAD [1] 92258.66 A: the predicted fall enrollment, given a 9% unemployment rate and 100,000 student spring high school graduating class, is **92258** students. Enrollment 와 Unemployment, Highschool student grateduates, Income 간의 관계. 즉, * dv = ROLL (Enrollment) * iv = UNEM, HGRAD, INC threePredictorModel <- lm(ROLL ~ UNEM + HGRAD + INC, data) threePredictorModel Call: lm(formula = ROLL ~ UNEM + HGRAD + INC, data = data) Coefficients: (Intercept) UNEM HGRAD INC -9153.2545 450.1245 0.4065 4.2749 summary(threePredictorModel) Call: lm(formula = ROLL ~ UNEM + HGRAD + INC, data = data) Residuals: Min 1Q Median 3Q Max -1148.84 -489.71 -1.88 387.40 1425.75 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.153e+03 1.053e+03 -8.691 5.02e-09 *** UNEM 4.501e+02 1.182e+02 3.809 0.000807 *** HGRAD 4.065e-01 7.602e-02 5.347 1.52e-05 *** INC 4.275e+00 4.947e-01 8.642 5.59e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 670.4 on 25 degrees of freedom Multiple R-squared: 0.9621, Adjusted R-squared: 0.9576 F-statistic: 211.5 on 3 and 25 DF, p-value: < 2.2e-16 How to get **beta coefficients**((beta weights, beta values)) for predictor variables? * Use scale function to standardize the x values then do regression * Use lm.beta function in QuantPsyc * In this case install.packages("QuantPsyc") library(QuantPsyc) lm.beta(threePredictorModel) UNEM HGRAD INC 0.1553619 0.3656177 0.6061762 How to compare each model (with incremental IVs) anova(onePredictorModel, twoPredictorModel, threePredictorModel) Analysis of Variance Table Model 1: ROLL ~ UNEM Model 2: ROLL ~ UNEM + HGRAD Model 3: ROLL ~ UNEM + HGRAD + INC Res.Df RSS Df Sum of Sq F Pr(>F) 1 27 251084710 2 26 44805568 1 206279143 458.92 < 2.2e-16 *** 3 25 11237313 1 33568255 74.68 5.594e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ===== e.g. 3 ===== {{hierarchicalregressiondata.csv}} # Import data (simulated data for this example) myData <- read.csv('http://static.lib.virginia.edu/statlab/materials/data/hierarchicalRegressionData.csv') # or # myData <- read.csv('http://commres.net/wiki/_media/r/hierarchicalregressiondata.csv') # Build models to compare the adding variables are worthwhile. m0 <- lm(happiness ~ 1, data=myData) # to obtain Total SS m1 <- lm(happiness ~ age + gender, data=myData) # Model 1 m2 <- lm(happiness ~ age + gender + friends, data=myData) # Model 2 m3 <- lm(happiness ~ age + gender + friends + pets, data=myData) # Model 3 anova(m0) # to obtain Total SS Analysis of Variance Table Response: happiness Df Sum Sq Mean Sq F value Pr(>F) Residuals 99 240.84 2.4327 summary(m1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.66778 2.01364 3.808 0.000246 *** age -0.13039 0.07936 -1.643 0.103611 genderMale 0.16430 0.31938 0.514 0.608106 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.553 on 97 degrees of freedom Multiple R-squared: 0.02855, Adjusted R-squared: 0.008515 F-statistic: 1.425 on 2 and 97 DF, p-value: 0.2455 summary(m2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.21730 1.96220 3.169 0.00206 ** age -0.12479 0.07546 -1.654 0.10146 genderMale 0.14931 0.30365 0.492 0.62405 friends 0.18985 0.05640 3.366 0.00110 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.476 on 96 degrees of freedom Multiple R-squared: 0.1311, Adjusted R-squared: 0.1039 F-statistic: 4.828 on 3 and 96 DF, p-value: 0.003573 summary(m3) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.78540 1.90266 3.041 0.00305 ** age -0.11146 0.07309 -1.525 0.13057 genderMale -0.14267 0.31157 -0.458 0.64806 friends 0.17134 0.05491 3.120 0.00239 ** pets 0.36391 0.13044 2.790 0.00637 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.427 on 95 degrees of freedom Multiple R-squared: 0.1969, Adjusted R-squared: 0.1631 F-statistic: 5.822 on 4 and 95 DF, p-value: 0.0003105 > lm.beta(m3) age genderMale friends pets -0.14098154 -0.04484095 0.28909280 0.27446786 anova(m0,m1,m2,m3) Analysis of Variance Table Model 1: happiness ~ 1 Model 2: happiness ~ age + gender Model 3: happiness ~ age + gender + friends Model 4: happiness ~ age + gender + friends + pets Res.Df RSS Df Sum of Sq F Pr(>F) 1 99 240.84 2 97 233.97 2 6.8748 1.6883 0.1903349 3 96 209.27 1 24.6957 12.1293 0.0007521 *** 4 95 193.42 1 15.8461 7.7828 0.0063739 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 * Model 0: SSTotal= 240.84 (no predictors) * Model 1: SSResidual= 233.97 (after adding age and gender) * Model 2: SSResidual= 209.27, * SSDifference= 233.97 - 209.27 = 24.696, * F(1,96) = 12.1293, p value = 0.0007521 (after adding friends) * Model 3: SSResidual= 193.42, * SSDifference= 209.27 - 193.42 = 15.846, * F(1,95) = 7.7828, p value = 0.0063739 (after adding pets) {{https://data.library.virginia.edu/files/Park.png}} Source: Park, N., Kee, K. F., & Valenzuela, S. (2009). Being immersed in social networking environment: Facebook groups, uses and gratifications, and social outcomes. CyberPsychology & Behavior, 12, 729-733. {{https://data.library.virginia.edu/files/Lankau-950x1024.png}} Source: Lankau, M. J., & Scandura, T. A. (2002). An investigation of personal learning in mentoring relationships: Content, antecedents, and consequences. Academy of Management Journal, 45, 779-790. see also http://rtutorialseries.blogspot.kr/2010/01/r-tutorial-series-basic-hierarchical.html ===== e.g. 4 ===== {{:r:crime_simple_1_.txt}} # data import crime <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/crime_simple.txt", sep = "\t", header = TRUE) str(crime) * R: Crime rate: # of offenses reported to police per million population * Age: The number of males of age 14-24 per 1000 population * S: Indicator variable for Southern states (0 = No, 1 = Yes) * Ed: Mean # of years of schooling x 10 for persons of age 25 or older * Ex0: 1960 per capita expenditure on police by state and local government * Ex1: 1959 per capita expenditure on police by state and local government * LF: Labor force participation rate per 1000 civilian urban males age 14-24 * M: The number of males per 1000 females * N: State population size in hundred thousands * NW: The number of non-whites per 1000 population * U1: Unemployment rate of urban males per 1000 of age 14-24 * U2: Unemployment rate of urban males per 1000 of age 35-39 * W: Median value of transferable goods and assets or family income in tens of dollar * X: The number of families per 1000 earning below 1/2 the median income # Assign more meaningful variable names colnames(crime) <- c("crime.per.million", "young.males", "is.south", "average.ed", "exp.per.cap.1960", "exp.per.cap.1959", "labour.part", "male.per.fem", "population", "nonwhite", "unemp.youth", "unemp.adult", "median.assets", "num.low.salary") # Convert is.south to a factor # Divide average.ed by 10 so that the variable is actually average education # Convert median assets to 1000's of dollars instead of 10's crime <- transform(crime, is.south = as.factor(is.south), average.ed = average.ed / 10, median.assets = median.assets / 100) # Fit model # enter all IVs into crime.per.million (a DV) crime.lm <- lm(crime.per.million ~ ., data = crime) plot(crime) check the linear relationships (a strong corr) between Remove 1959 expenditure and 1960. Also check those between unemp.youth and unemp.adult. Hence, we delete the two (as a redundant). # Remove 1959 expenditure and youth unemployment crime.lm2 <- update(crime.lm, . ~ . - exp.per.cap.1959 - unemp.youth) summary(crime.lm2) Call: lm(formula = crime.per.million ~ young.males + is.south + average.ed + exp.per.cap.1960 + labour.part + male.per.fem + population + nonwhite + unemp.adult + median.assets + num.low.salary, data = crime) Residuals: Min 1Q Median 3Q Max -35.82 -11.57 -1.51 10.63 55.02 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -633.438828 145.470340 -4.354 0.000111 *** young.males 1.126883 0.418791 2.691 0.010853 * is.south1 -0.556600 13.883248 -0.040 0.968248 average.ed 15.328028 6.202516 2.471 0.018476 * exp.per.cap.1960 1.138299 0.226977 5.015 0.0000153 *** labour.part 0.068716 0.133540 0.515 0.610087 male.per.fem 0.003021 0.173041 0.017 0.986172 population -0.064477 0.128278 -0.503 0.618367 nonwhite -0.013794 0.061901 -0.223 0.824960 unemp.adult 0.931498 0.541803 1.719 0.094402 . median.assets 15.158975 10.524458 1.440 0.158653 num.low.salary 0.825936 0.234189 3.527 0.001197 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 21.98 on 35 degrees of freedom Multiple R-squared: 0.7543, Adjusted R-squared: 0.6771 F-statistic: 9.769 on 11 and 35 DF, p-value: 0.00000009378 summary(crime.lm) Call: lm(formula = crime.per.million ~ ., data = crime) Residuals: Min 1Q Median 3Q Max -34.884 -11.923 -1.135 13.495 50.560 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -691.837588 155.887918 -4.438 0.0000956 *** young.males 1.039810 0.422708 2.460 0.01931 * is.south1 -8.308313 14.911588 -0.557 0.58117 average.ed 18.016011 6.496504 2.773 0.00906 ** exp.per.cap.1960 1.607818 1.058667 1.519 0.13836 exp.per.cap.1959 -0.667258 1.148773 -0.581 0.56529 labour.part -0.041031 0.153477 -0.267 0.79087 male.per.fem 0.164795 0.209932 0.785 0.43806 population -0.041277 0.129516 -0.319 0.75196 nonwhite 0.007175 0.063867 0.112 0.91124 unemp.youth -0.601675 0.437154 -1.376 0.17798 unemp.adult 1.792263 0.856111 2.093 0.04407 * median.assets 13.735847 10.583028 1.298 0.20332 num.low.salary 0.792933 0.235085 3.373 0.00191 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 21.94 on 33 degrees of freedom Multiple R-squared: 0.7692, Adjusted R-squared: 0.6783 F-statistic: 8.462 on 13 and 33 DF, p-value: 0.0000003686 See the below stepwise regression explanation. reduced.model <- step(crime.lm2, irection="backward") Start: AIC=300.61 crime.per.million ~ young.males + is.south + average.ed + exp.per.cap.1960 + labour.part + male.per.fem + population + nonwhite + unemp.adult + median.assets + num.low.salary Df Sum of Sq RSS AIC - male.per.fem 1 0.1 16906 298.61 - is.south 1 0.8 16907 298.61 - nonwhite 1 24.0 16930 298.67 - population 1 122.0 17028 298.95 - labour.part 1 127.9 17034 298.96 summary(reduced.model) ====== Performing Linear Regression Without an Intercept ====== lm(y ~ x + 0) ====== Performing Linear Regression with Interaction Terms ====== This is a very tricky part. See [[:Interaction effects in regression analysis]] see also https://cran.r-project.org/web/packages/jtools/vignettes/interactions.html lm(y ~ u*v) yi = β0 + β1ui + β2vi + β3uivi + εi β3uivi = interaction term y ~ u*v*w yi = β0 + β1ui + β2vi + β3wi + β4uivi + β5uiwi + β6viwi + β7uiviwi + εi first order interaction: β4uivi, β5uiwi, β6viwi second order interaction: β7uiviwi y ~ u + v + w + u:v:w yi = β0 + β1ui + β2vi + β3wi + β4uiviwi + εi ====== Selecting the Best Regression Variables = Stepwise regression ====== http://www.utstat.toronto.edu/~brunner/oldclass/appliedf11/handouts/2101f11StepwiseLogisticR.pdf full.model <- lm(y ~ x1 + x2 + x3 + x4) reduced.model <- step(full.model, direction="backward") {{dataset_multipleRegression.csv}} data <- read.csv("http://commres.net/wiki/_media/r/dataset_multipleregression.csv") head(data) YEAR ROLL UNEM HGRAD INC 1 1 5501 8.1 9552 1923 2 2 5945 7.0 9680 1961 3 3 6629 7.3 9731 1979 4 4 7556 7.5 11666 2030 5 5 8716 7.0 14675 2112 6 6 9369 6.4 15265 2192 ROLL = students' enrollment HGRAD = number of high school graduates UNEM = unemployment INC = per capita income r.model <- step(threePredictorModel, direction="backward") Start: AIC=381.16 ROLL ~ UNEM + HGRAD + INC Df Sum of Sq RSS AIC full.model <- lm(y ~ x1 + x2 + x3 + x4) summary(full.model) Call: lm(formula = y ~ x1 + x2 + x3 + x4) Residuals: Min 1Q Median 3Q Max -34.405 -5.153 2.025 6.525 19.186 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.274 4.573 2.247 0.0337 * x1 -102.298 47.558 -2.151 0.0413 * x2 4.362 40.237 0.108 0.9145 x3 75.115 34.236 2.194 0.0377 * x4 26.286 42.239 0.622 0.5394 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12.19 on 25 degrees of freedom Multiple R-squared: 0.8848, Adjusted R-squared: 0.8664 F-statistic: 48 on 4 and 25 DF, p-value: 2.235e-11 We want to eliminate the insignificant variables, so we use step to incrementally eliminate the underperformers. The result is called the reduced model: reduced.model <- step(full.model, direction="backward") Start: AIC=154.58 y ~ x1 + x2 + x3 + x4 Df Sum of Sq RSS AIC - x2 1 1.75 3718.5 152.60 - x4 1 57.58 3774.3 153.04 The output from step shows the sequence of models that it explored. In this case, step removed x2 and x4 and left only x1 and x3 in the final (reduced) model. The summary of the reduced model shows that it contains only significant predictors: summary(reduced.model) Call: lm(formula = y ~ x1 + x3) Residuals: Min 1Q Median 3Q Max -36.192 -2.677 2.049 7.453 19.704 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.267 4.437 2.314 0.0285 * x1 -79.265 31.604 -2.508 0.0185 * x3 82.726 31.590 2.619 0.0143 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 11.86 on 27 degrees of freedom Multiple R-squared: 0.8823, Adjusted R-squared: 0.8736 F-statistic: 101.2 on 2 and 27 DF, p-value: 2.842e-13 Or start with nothing min.model <- lm(y ~ 1) fwd.model <- step(min.model, + direction="forward", + scope=( ~ x1 + x2 + x3 + x4), + trace=0 ) min.model <- lm(ROLL ~ 1, data) fwd.model <- step(min.model, direction="forward", scope=(~ UNEM + HGRAD + INC),trace=0) summary(fwd.model) Call: lm(formula = ROLL ~ INC + HGRAD + UNEM, data = data) Residuals: Min 1Q Median 3Q Max -1148.84 -489.71 -1.88 387.40 1425.75 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.153e+03 1.053e+03 -8.691 5.02e-09 *** INC 4.275e+00 4.947e-01 8.642 5.59e-09 *** HGRAD 4.065e-01 7.602e-02 5.347 1.52e-05 *** UNEM 4.501e+02 1.182e+02 3.809 0.000807 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 670.4 on 25 degrees of freedom Multiple R-squared: 0.9621, Adjusted R-squared: 0.9576 F-statistic: 211.5 on 3 and 25 DF, p-value: < 2.2e-16 > ====== Diagnosing a Linear Regression ====== m <- lm(y ~ x) plot(m) library(cars) m <- lm(cars$dist ~ cars$speed) plot(m) * The points in the Residuals vs Fitted plot are randomly scattered with no particular pattern. * The points in the Normal Q?Q plot are more-or-less on the line, indicating that the residuals follow a normal distribution. * In both the Scale?Location plot and the Residuals vs Leverage plots, the points are in a group with none too far from the center. influence.measures(m) Influence measures of lm(formula = cars$dist ~ cars$speed) : dfb.1_ dfb.crs. dffit cov.r cook.d hat inf 1 0.09440 -0.08625 0.09490 1.175 4.59e-03 0.1149 * 2 0.29242 -0.26716 0.29398 1.146 4.35e-02 0.1149 * 3 -0.10750 0.09369 -0.11040 1.116 6.20e-03 0.0715 4 0.21898 -0.19085 0.22488 1.093 2.55e-02 0.0715 5 0.03408 -0.02901 0.03554 1.109 6.45e-04 0.0600 6 -0.11101 0.09174 -0.11852 1.085 7.13e-03 0.0499 7 -0.04618 0.03669 -0.05110 1.085 1.33e-03 0.0413 8 0.05248 -0.04170 0.05807 1.084 1.72e-03 0.0413 9 0.15209 -0.12083 0.16827 1.058 1.43e-02 0.0413 10 -0.09168 0.06895 -0.10716 1.065 5.82e-03 0.0341 11 0.02446 -0.01840 0.02859 1.079 4.17e-04 0.0341 12 -0.13849 0.09602 -0.17628 1.027 1.55e-02 0.0284 13 -0.08467 0.05870 -0.10777 1.056 5.88e-03 0.0284 14 -0.04929 0.03417 -0.06274 1.067 2.00e-03 0.0284 15 -0.01413 0.00979 -0.01798 1.073 1.65e-04 0.0284 16 -0.05330 0.03233 -0.07757 1.058 3.06e-03 0.0242 17 0.00323 -0.00196 0.00470 1.069 1.13e-05 0.0242 18 0.00323 -0.00196 0.00470 1.069 1.13e-05 0.0242 19 0.08843 -0.05364 0.12870 1.039 8.34e-03 0.0242 20 -0.06172 0.02871 -0.11111 1.041 6.23e-03 0.0214 21 -0.00789 0.00367 -0.01419 1.065 1.03e-04 0.0214 22 0.12329 -0.05734 0.22194 0.971 2.40e-02 0.0214 23 0.24851 -0.11558 0.44734 0.747 8.56e-02 0.0214 * 24 -0.08002 0.01551 -0.20360 0.979 2.03e-02 0.0201 25 -0.05700 0.01105 -0.14504 1.019 1.05e-02 0.0201 26 0.04643 -0.00900 0.11812 1.034 7.02e-03 0.0201 27 -0.02664 -0.01432 -0.12571 1.031 7.94e-03 0.0203 28 -0.01059 -0.00569 -0.04998 1.059 1.27e-03 0.0203 29 -0.00528 -0.04978 -0.17031 1.010 1.44e-02 0.0219 30 -0.00281 -0.02647 -0.09054 1.050 4.15e-03 0.0219 31 0.00022 0.00207 0.00708 1.066 2.56e-05 0.0219 32 0.01561 -0.05223 -0.11741 1.046 6.96e-03 0.0249 33 -0.00387 0.01296 0.02914 1.068 4.33e-04 0.0249 34 -0.03235 0.10823 0.24330 0.972 2.88e-02 0.0249 35 -0.04462 0.14928 0.33557 0.894 5.26e-02 0.0249 36 0.06663 -0.13914 -0.24553 0.989 2.95e-02 0.0295 37 0.03458 -0.07221 -0.12744 1.051 8.20e-03 0.0295 38 -0.03372 0.07042 0.12428 1.052 7.80e-03 0.0295 39 0.14564 -0.25086 -0.38002 0.921 6.81e-02 0.0354 40 0.06340 -0.10920 -0.16542 1.048 1.38e-02 0.0354 41 0.04382 -0.07547 -0.11432 1.065 6.62e-03 0.0354 42 0.02443 -0.04207 -0.06373 1.076 2.07e-03 0.0354 43 -0.01411 0.02431 0.03682 1.080 6.92e-04 0.0354 44 0.02456 -0.03551 -0.04533 1.098 1.05e-03 0.0518 45 0.19602 -0.27032 -0.32823 1.039 5.32e-02 0.0622 46 0.08260 -0.11000 -0.12877 1.116 8.43e-03 0.0740 47 -0.18634 0.24815 0.29050 1.077 4.21e-02 0.0740 48 -0.19890 0.26488 0.31008 1.071 4.79e-02 0.0740 49 -0.57747 0.76902 0.90027 0.762 3.40e-01 0.0740 * 50 -0.06025 0.07812 0.08898 1.139 4.04e-03 0.0873 * When a statistician says that an observation is influential, it means that removing the observation would significantly change the fitted regression model. We want to identify those observations because they might be outliers that distort our model; we owe it to ourselves to investigate them. ====== Testing Residuals for Autocorrelation (Durbin?Watson Test) ====== dwtest(m) Durbin-Watson test data: m DW = 1.6762, p-value = 0.09522 alternative hypothesis: true autocorrelation is greater than 0 Autocorrelation in the residuals is a scourge because it distorts the regression statistics, such as the F statistic and the t statistics for the regression coefficients. The presence of autocorrelation suggests that your model is missing a useful predictor variable or that it should include a time series component, such as a trend or a seasonal indicator.