Differences

This shows you the differences between two versions of the page.

--- r:linear_regression [2018/06/15 08:02] – [Multiple Regression] hkimscil
+++ r:linear_regression [2019/06/13 10:15] (current) – hkimscil
@@ Line 51: / Line 51: @@
 Does the model fit the data well?
   * **Plot the residuals** and check the regression diagnostics.
+    * see [[https://drsimonj.svbtle.com/visualising-residuals|visualizing residuals]]
 Does the data satisfy the assumptions behind linear regression?
   * Check whether the diagnostics confirm that a linear model is reasonable for your data.
@@ Line 146: / Line 147: @@
 </WRAP>
+<WRAP info>
+What about beta coefficient?
+<blockquote>In R we demonstrate the use of the lm.beta() function in the QuantPsyc package (due to Thomas D. Fletcher of State Farm). The function is short and sweet, and takes a linear model object as argument:</blockquote>
+<code>> lm.beta(mod)
+EngineSize
+-0.7100032
+> cor(MPG.city,EngineSize)
+[1] -0.7100032
+>
+</code></WRAP>
 ====== Multiple Regression ======
+regression output table
 | anova(m)  | ANOVA table  |
 | coefficients(m) = coef(m)  | Model coefficients  |
@@ Line 184: / Line 198: @@
 F-statistic: 54.83 on 2 and 90 DF,  p-value: 2.674e-16
 </code>
-Questions:
+<WRAP box help>Questions:
   * What is R<sup>2</sup>?
   * How many cars are involved in this test? (cf. df = 90)
+    * df + # of variables involved (3) = 93
+    * check 'str(Cars93)'
   * If I eliminate the R<sup>2</sup> from the above output, can you still identify what it is?
+</WRAP>
+<WRAP box info>The last question:
+  * If I eliminate the R<sup>2</sup> from the above output, can you still identify what it is?
+  * to answer the question, use the regression output table:
+R<sup>2</sup> = SS<sub>reg</sub>/SS<sub>total</sub> = 1 - SS<sub>res</sub>/SS<sub>total</sub>
+=
+<code>> anova(lm.model)
+Analysis of Variance Table
+Response: Cars93$MPG.city
+                  Df Sum Sq Mean Sq F value  Pr(>F)
+Cars93$EngineSize  1   1465    1465  100.65 2.4e-16 ***
+Cars93$Price       1    131     131    9.01  0.0035 **
+Residuals         90   1310      15
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+> sstotal = 1465+131+1310
+> ssreg <- 1465+131
+> ssreg/sstotal
+[1] 0.54921
+>
+> # or
+> 1-(deviance(lm.model)/sstotal)
+[1] 0.54932
+</code>
+</WRAP>
 Regression formula:
@@ Line 193: / Line 240: @@
   * $\hat{Y} = \widehat{\text{MPG.city}}$
-<code>plot(lm.model$residuals)</code>
+<WRAP box info>in the meantime,
+<code>> lm.beta(lm.model)
+Cars93$EngineSize      Cars93$Price
+       -0.5517121        -0.2649553
+> cor(MPG.city,EngineSize)
+[1] -0.7100032
+> cor(EngineSize,Price)
+[1] 0.5974254
+> cor(MPG.city,Price)
+[1] -0.5945622
+>
+</code>
+Or . . . .
+<code>> temp <- subset(Cars93, select=c(MPG.city,EngineSize,Price))
+> temp
+. . . .
+> cor(temp)
+             MPG.city EngineSize      Price
+MPG.city    1.0000000 -0.7100032 -0.5945622
+EngineSize -0.7100032  1.0000000  0.5974254
+Price      -0.5945622  0.5974254  1.0000000
+>
+</code>
+Beta coefficients are not equal to correlations among variables.
+</WRAP>
+<code>plot(lm.model$residuals)
+abline(h=0, col="red")
+</code>
 <code>anova(lm.model)
@@ Line 205: / Line 281: @@
 ---
 Signif. codes:
-‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</code>
+‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+</code>
+Why use anova with lm output (lm.model in this case)?
 <code>coef(lm.model)
@@ Line 315: / Line 394: @@
-#predict the fall enrollment (ROLL)
+predict the fall enrollment (ROLL): using
-using the unemployment rate (UNEM) and
+  * the unemployment rate (UNEM) and
-number of spring high school graduates (HGRAD)
+  * number of spring high school graduates (HGRAD)
 <code>
@@ Line 357: / Line 436: @@
 </code>
-<code>y =  -8255.7511  +  698.2681*UNEM  +  0.9423*HGRAD  </code>
+<code>
+y =  -8255.7511  +  698.2681*UNEM  +  0.9423*HGRAD
+</code>
 Q: what is the expected fall enrollment (ROLL) given this year's unemployment rate (UNEM) of 9% and spring high school graduating class (HGRAD) of 100,000?
@@ Line 369: / Line 450: @@
 **92258** students.
-Enrollment 와 Unemployment, Highschool student grateduates, Income 간의 관계
+Enrollment 와 Unemployment, Highschool student grateduates, Income 간의 관계. 즉,
-<code>threePredictorModel <- lm(ROLL ~ UNEM + HGRAD + INC, data)
+  * dv = ROLL (Enrollment)
-></code
+  * iv = UNEM, HGRAD, INC
+<code>threePredictorModel <- lm(ROLL ~ UNEM + HGRAD + INC, data)</code>
 <code>threePredictorModel
@@ Line 406: / Line 489: @@
 F-statistic: 211.5 on 3 and 25 DF,  p-value: < 2.2e-16
-></code>
+</code>
 How to get **beta coefficients**((beta weights, beta values)) for predictor variables?
@@ Line 420: / Line 503: @@
 How to compare each model (with incremental IVs)
-<codeanova(onePredictorModel, twoPredictorModel, threePredictorModel)
+<code>
+anova(onePredictorModel, twoPredictorModel, threePredictorModel)
 Analysis of Variance Table
@@ Line 442: / Line 526: @@
 # Import data (simulated data for this example)
 myData <- read.csv('http://static.lib.virginia.edu/statlab/materials/data/hierarchicalRegressionData.csv')
+# or
+# myData <- read.csv('http://commres.net/wiki/_media/r/hierarchicalregressiondata.csv')
-# Build models
+# Build models to compare the adding variables are worthwhile.
 m0 <- lm(happiness ~ 1, data=myData)  # to obtain Total SS
 m1 <- lm(happiness ~ age + gender, data=myData)  # Model 1
@@ Line 457: / Line 543: @@
 Residuals 99 240.84  2.4327
 </code>
-<code>anova(m1, m2, m3)  # model comparison
-Analysis of Variance Table
-Model 1: happiness ~ age + gender
-Model 2: happiness ~ age + gender + friends
-Model 3: happiness ~ age + gender + friends + pets
-  Res.Df    RSS Df Sum of Sq       F    Pr(>F)
-     97 233.97
-     96 209.27  1    24.696 12.1293 0.0007521 ***
-     95 193.42  1    15.846  7.7828 0.0063739 **
----
-Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-</code>
-  * Model 0: SS<sub>Total</sub= 240.84 (no predictors)
-  * Model 1: SS<sub>Residual</sub= 233.97 (after adding age and gender)
-  * Model 2: SS<sub>Residual</sub= 209.27,
-    * SS<sub>Difference</sub= 233.97 - 209.27 = 24.696,
-    * FF(1,96) = 12.1293, pp = 0.0007521 (after adding friends)
-  * Model 3: SS<sub>Residual</sub= 193.42,
-    * SS<sub>Difference</sub= 209.27 - 193.42 = 15.846,
-    * FF(1,95) = 7.7828, pp = 0.0063739 (after adding pets)
 <code>summary(m1)
@@ Line 494: / Line 557: @@
 Multiple R-squared:  0.02855,	Adjusted R-squared:  0.008515
 F-statistic: 1.425 on 2 and 97 DF,  p-value: 0.2455
+</code>
+<code>
 summary(m2)
@@ Line 509: / Line 573: @@
 Multiple R-squared:  0.1311,	Adjusted R-squared:  0.1039
 F-statistic: 4.828 on 3 and 96 DF,  p-value: 0.003573
+</code>
+<code>
 summary(m3)
@@ Line 527: / Line 593: @@
 </code>
+<code>
+> lm.beta(m3)
+        age  genderMale     friends        pets
+-0.14098154 -0.04484095  0.28909280  0.27446786
+</code>
+<code>
+anova(m0,m1,m2,m3)
+Analysis of Variance Table
+Model 1: happiness ~ 1
+Model 2: happiness ~ age + gender
+Model 3: happiness ~ age + gender + friends
+Model 4: happiness ~ age + gender + friends + pets
+  Res.Df    RSS Df Sum of Sq       F    Pr(>F)
+     99 240.84
+     97 233.97  2    6.8748  1.6883 0.1903349
+     96 209.27  1   24.6957 12.1293 0.0007521 ***
+     95 193.42  1   15.8461  7.7828 0.0063739 **
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+</code>
+  * Model 0: SS<sub>Total</sub>= 240.84 (no predictors)
+  * Model 1: SS<sub>Residual</sub>= 233.97 (after adding age and gender)
+  * Model 2: SS<sub>Residual</sub>= 209.27,
+    * SS<sub>Difference</sub>= 233.97 - 209.27 = 24.696,
+    * F(1,96) = 12.1293, p value = 0.0007521 (after adding friends)
+  * Model 3: SS<sub>Residual</sub>= 193.42,
+    * SS<sub>Difference</sub>= 209.27 - 193.42 = 15.846,
+    * F(1,95) = 7.7828, p value = 0.0063739 (after adding pets)
 {{https://data.library.virginia.edu/files/Park.png}}