Table of Contents
Sequential or Hierarchical regression
연구자가 판단하여 독립변인들 중 필요한 것들을 묶어서 스테이지 별로 (단계 별) 넣고 분석하는 것을 말한다. Stepwise regression은 이를 컴퓨터나 계산방법을 통하여 수행하게 된다.
데이터
DATA for regression analysis | ||
---|---|---|
bankaccount | income | famnum |
6 | 220 | 5 |
5 | 190 | 6 |
7 | 260 | 3 |
7 | 200 | 4 |
8 | 330 | 2 |
10 | 490 | 4 |
8 | 210 | 3 |
11 | 380 | 2 |
9 | 320 | 1 |
9 | 270 | 3 |
datavar <- read.csv("http://commres.net/wiki/_media/regression01-bankaccount.csv")
Enter
Model Summaryb | |||||||||
Model | R | R Square | Adjusted R Square | Std. Error of the Estimate | Change Statistics | ||||
R Square Change | F Change | df1 | df2 | Sig. F Change | |||||
1 | .893a | .798 | .740 | .930 | .798 | 13.838 | 2 | 7 | .004 |
a. Predictors: (Constant), 가족숫자, 수입 | |||||||||
b. Dependent Variable: 통장갯수 |
ANOVAa | ||||||
Model | Sum of Squares | df | Mean Square | F | Sig. | |
1 | Regression | 23.944 | 2 | 11.972 | 13.838 | .004b |
Residual | 6.056 | 7 | .865 | |||
Total | 30.000 | 9 | ||||
a. Dependent Variable: 통장갯수 | ||||||
b. Predictors: (Constant), 가족숫자, 수입 |
Coefficientsa | |||||||||
Model | Unstandardized Coefficients | Standardized Coefficients | t | Sig. | Correlations | ||||
B | Std. Error | Beta | Zero-order | Partial | Part | ||||
1 | (Constant) | 6.399 | 1.517 | 4.220 | .004 | ||||
수입 | .012 | .004 | .616 | 3.325 | .013 | .794 | .783 | .565 | |
가족숫자 | -.545 | .226 | -.446 | -2.406 | .047 | -.692 | -.673 | -.409 | |
a. Dependent Variable: 통장갯수 |
$\hat{Y} = 6.399 + .012 X_{1} + -.545 X_{2} $
The below is just an exercise for figuring out the unique part of r2 value for x1 and x2 (수입, 가족수). For more information see part and zero-order relationship: see determining_ivs_role in multiple regression
zero-order | part = semi-partial | ||
x1 | x2 | x1p | x2p |
.794 | -.692 | .565 | -.409 |
zero-order square | part (in spss) = semipartial (in general) | ||
x1 zsq (x1zsq) | x2 zsq (x1zsq) | x1 semi-partial (or part) sq (x1spsq) | x2 part sq (x1spsq) |
.630436 | .478864 | .319225 | .167281 |
a+b / a+b+c+d | b+c / a+b+c+d | a / a+b+c+d | c / a+b+c+d |
x1zsq - x1spsq ~= x2zsq - x2spsq
0.311211 ~= 0.311583
아래는 r 에서 계산한 것
> .794^2 - .565^2 [1] 0.3112 > .692^2 - .409^2 [1] 0.3116
R에서 보는 예는 아래를 참조
Seq.
Model Summaryc | |||||||||
Model | R | R Square | Adjusted R Square | Std. Error of the Estimate | Change Statistics | ||||
R Square Change | F Change | df1 | df2 | Sig. F Change | |||||
1 | .794a | .631 | .585 | 1.176 | .631 | 13.687 | 1 | 8 | .006 |
2 | .893b | .798 | .740 | .930 | .167 | 5.791 | 1 | 7 | .047 |
a. Predictors: (Constant), 수입 | |||||||||
b. Predictors: (Constant), 수입, 가족숫자 | |||||||||
c. Dependent Variable: 통장갯수 |
증가한 r2값에 대한 F-test 결과는 Fdiff=5.791, p = .047 (less than .05)
ANOVAa | ||||||
Model | Sum of Squares | df | Mean Square | F | Sig. | |
1 | Regression | 18.934 | 1 | 18.934 | 13.687 | .006b |
Residual | 11.066 | 8 | 1.383 | |||
Total | 30.000 | 9 | ||||
2 | Regression | 23.944 | 2 | 11.972 | 13.838 | .004c |
Residual | 6.056 | 7 | .865 | |||
Total | 30.000 | 9 | ||||
a. Dependent Variable: 통장갯수 | ||||||
b. Predictors: (Constant), 수입 | ||||||
c. Predictors: (Constant), 수입, 가족숫자 |
Coefficientsa | |||||||||
Model | Unstandardized Coefficients | Standardized Coefficients | t | Sig. | Correlations | ||||
B | Std. Error | Beta | Zero-order | Partial | Part | ||||
1 | (Constant) | 3.618 | 1.242 | 2.914 | .019 | ||||
수입 | .015 | .004 | .794 | 3.700 | .006 | .794 | .794 | .794 | |
2 | (Constant) | 6.399 | 1.517 | 4.220 | .004 | ||||
수입 | .012 | .004 | .616 | 3.325 | .013 | .794 | .783 | .565 | |
가족숫자 | -.545 | .226 | -.446 | -2.406 | .047 | -.692 | -.673 | -.409 | |
a. Dependent Variable: 통장갯수 |
http://imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/hier
https://ww2.coastal.edu/kingw/statistics/R-tutorials/multregr.html
r
datavar <- read.csv("http://commres.net/wiki/_media/regression01-bankaccount.csv") datavar m1 <- lm(bankaccount~income+famnum, data=datavar) summary(m1) library(ppcor) spcor(datavar) pcor(datavar)
> datavar <- read.csv("http://commres.net/wiki/_media/regression01-bankaccount.csv") > datavar bankaccount income famnum 1 6 220 5 2 5 190 6 3 7 260 3 4 7 200 4 5 8 330 2 6 10 490 4 7 8 210 3 8 11 380 2 9 9 320 1 10 9 270 3 > m1 <- lm(bankaccount~income+famnum, data=datavar) > summary(m1) Call: lm(formula = bankaccount ~ income + famnum, data = datavar) Residuals: Min 1Q Median 3Q Max -1.2173 -0.5779 -0.1515 0.6642 1.1906 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.399103 1.516539 4.220 0.00394 ** income 0.011841 0.003561 3.325 0.01268 * famnum -0.544727 0.226364 -2.406 0.04702 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9301 on 7 degrees of freedom Multiple R-squared: 0.7981, Adjusted R-squared: 0.7404 F-statistic: 13.84 on 2 and 7 DF, p-value: 0.003696 > library(ppcor) > spcor(datavar) $estimate bankaccount income famnum bankaccount 1.0000000 0.5646726 -0.4086619 income 0.7171965 1.0000000 0.2078919 famnum -0.6166940 0.2470028 1.0000000 $p.value bankaccount income famnum bankaccount 0.00000000 0.113182 0.2748117 income 0.02964029 0.000000 0.5914441 famnum 0.07691195 0.521696 0.0000000 $statistic bankaccount income famnum bankaccount 0.000000 1.8101977 -1.1846548 income 2.722920 0.0000000 0.5623159 famnum -2.072679 0.6744045 0.0000000 $n [1] 10 $gp [1] 1 $method [1] "pearson" > pcor(datavar) $estimate bankaccount income famnum bankaccount 1.0000000 0.7825112 -0.6728560 income 0.7825112 1.0000000 0.3422911 famnum -0.6728560 0.3422911 1.0000000 $p.value bankaccount income famnum bankaccount 0.00000000 0.01267595 0.04702022 income 0.01267595 0.00000000 0.36723388 famnum 0.04702022 0.36723388 0.00000000 $statistic bankaccount income famnum bankaccount 0.000000 3.3251023 -2.4064253 income 3.325102 0.0000000 0.9638389 famnum -2.406425 0.9638389 0.0000000 $n [1] 10 $gp [1] 1 $method [1] "pearson" ## zero-order correlation > cor(datavar) bankaccount income famnum bankaccount 1.0000000 0.7944312 -0.6922935 income 0.7944312 1.0000000 -0.3999614 famnum -0.6922935 -0.3999614 1.0000000 >
semipartial (part): spcor()
bankaccount | income | famnum | |
bankaccount | 1.0000000 | 0.5646726$(1)$ | -0.4086619$(2)$ |
income | 0.7171965 | 1.0000000 | 0.2078919 |
famnum | -0.6166940 | 0.2470028 | 1.0000000 |
bankaccount | income | famnum | |
bankaccount | 1.0000000 | 0.7825112 | -0.6728560 |
income | 0.7825112 | 1.0000000 | 0.3422911 |
famnum | -0.6728560 | 0.3422911 | 1.0000000 |
bankaccount | income | famnum | |
bankaccount | 1.0000000 | 0.7944312$(3)$ | -0.6922935$(4)$ |
income | 0.7944312 | 1.0000000 | -0.3999614 |
famnum | -0.6922935 | -0.3999614 | 1.0000000 |
sp.b.i <- 0.5646726 ## (1) c.b.i <- 0.7944312 ## (3) sp.b.f <- -0.4086619 ## (2) c.b.f <- -0.6922935 ## (4) c.b.i.sq <- c.b.i^2 ## (3)^2 sp.b.i.sq <- sp.b.i^2 ## (1)^2 c.b.i.sq - sp.b.i.sq c.b.f.sq <- c.b.f^2 ## (4)^2 sp.b.f.sq <- sp.b.f^2 ## (1)^2 c.b.f.sq - sp.b.f.sq
> sp.b.i <- 0.5646726 > c.b.i <- 0.7944312 > > sp.b.f <- -0.4086619 > c.b.f <- -0.6922935 > > c.b.i.sq <- c.b.i^2 ## (3)^2 > sp.b.i.sq <- sp.b.i^2 > > c.b.i.sq - sp.b.i.sq [1] 0.3123 > > c.b.f.sq <- c.b.f^2 ## (4)^2 > sp.b.f.sq <- sp.b.f^2 > > c.b.f.sq - sp.b.f.sq [1] 0.3123
0.3123 가 두 독립변인이 DV에 같이 (공히) 미치는 영향력 분량이다.
pp.b.i <- pcor.test(datavar$bankaccount, datavar$income, datavar$famnum) p.b.i p.b.i$estimate p.b.f <- pcor.test(datavar$bankaccount, datavar$famnum, datavar$income) p.b.f p.b.f$estimate sp.b.i <- spcor.test(datavar$bankaccount, datavar$income, datavar$famnum) sp.b.i sp.b.i$estimate sp.b.f <- spcor.test(datavar$bankaccount, datavar$famnum, datavar$income) sp.b.f sp.b.f$estimate zc.b.i <- cor(datavar$bankaccount, datavar$income) zc.b.i zc.b.f <- cor(datavar$bankaccount, datavar$famnum) zc.b.f zc.b.i^2 - (sp.b.i$estimate)^2 zc.b.f^2 - (sp.b.f$estimate)^2
. . .
> pp.b.i <- pcor.test(datavar$bankaccount, datavar$income, datavar$famnum) > p.b.i estimate p.value statistic n gp Method 1 0.7825 0.01268 3.325 10 1 pearson > p.b.i$estimate [1] 0.7825 > > p.b.f <- pcor.test(datavar$bankaccount, datavar$famnum, datavar$income) > p.b.f estimate p.value statistic n gp Method 1 -0.6729 0.04702 -2.406 10 1 pearson > p.b.f$estimate [1] -0.6729 > > sp.b.i <- spcor.test(datavar$bankaccount, datavar$income, datavar$famnum) > sp.b.i estimate p.value statistic n gp Method 1 0.5647 0.1132 1.81 10 1 pearson > sp.b.i$estimate [1] 0.5647 > sp.b.f <- spcor.test(datavar$bankaccount, datavar$famnum, datavar$income) > sp.b.f estimate p.value statistic n gp Method 1 -0.4087 0.2748 -1.185 10 1 pearson > sp.b.f$estimate [1] -0.4087 > > > zc.b.i <- cor(datavar$bankaccount, datavar$income) > zc.b.i [1] 0.7944 > zc.b.f <- cor(datavar$bankaccount, datavar$famnum) > zc.b.f [1] -0.6923 > > zc.b.i^2 - (sp.b.i$estimate)^2 [1] 0.3123 > zc.b.f^2 - (sp.b.f$estimate)^2 [1] 0.3123 > > >
e.g. 3. College enrollment in New Mexico University
> datavar <- read.csv("http://commres.net/wiki/_media/r/dataset_hlr.csv") > str(datavar) 'data.frame': 29 obs. of 5 variables: $ YEAR : int 1 2 3 4 5 6 7 8 9 10 ... $ ROLL : int 5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ... $ UNEM : num 8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ... $ HGRAD: int 9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ... $ INC : int 1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ... >
onePredictorModel <- lm(ROLL ~ UNEM, data = datavar) twoPredictorModel <- lm(ROLL ~ UNEM + HGRAD, data = datavar) threePredictorModel <- lm(ROLL ~ UNEM + HGRAD + INC, data = datavar)
summary(onePredictorModel) summary(twoPredictorModel) summary(threePredictorModel)
> summary(onePredictorModel) Call: lm(formula = ROLL ~ UNEM, data = datavar) Residuals: Min 1Q Median 3Q Max -7640.0 -1046.5 602.8 1934.3 4187.2 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3957.0 4000.1 0.989 0.3313 UNEM 1133.8 513.1 2.210 0.0358 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3049 on 27 degrees of freedom Multiple R-squared: 0.1531, Adjusted R-squared: 0.1218 F-statistic: 4.883 on 1 and 27 DF, p-value: 0.03579
> summary(twoPredictorModel) Call: lm(formula = ROLL ~ UNEM + HGRAD, data = datavar) Residuals: Min 1Q Median 3Q Max -2102.2 -861.6 -349.4 374.5 3603.5 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -8.256e+03 2.052e+03 -4.023 0.00044 *** UNEM 6.983e+02 2.244e+02 3.111 0.00449 ** HGRAD 9.423e-01 8.613e-02 10.941 3.16e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1313 on 26 degrees of freedom Multiple R-squared: 0.8489, Adjusted R-squared: 0.8373 F-statistic: 73.03 on 2 and 26 DF, p-value: 2.144e-11 >
> summary(threePredictorModel) Call: lm(formula = ROLL ~ UNEM + HGRAD + INC, data = datavar) Residuals: Min 1Q Median 3Q Max -1148.84 -489.71 -1.88 387.40 1425.75 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.153e+03 1.053e+03 -8.691 5.02e-09 *** UNEM 4.501e+02 1.182e+02 3.809 0.000807 *** HGRAD 4.065e-01 7.602e-02 5.347 1.52e-05 *** INC 4.275e+00 4.947e-01 8.642 5.59e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 670.4 on 25 degrees of freedom Multiple R-squared: 0.9621, Adjusted R-squared: 0.9576 F-statistic: 211.5 on 3 and 25 DF, p-value: < 2.2e-16
anova(onePredictorModel, twoPredictorModel, threePredictorModel) Analysis of Variance Table Model 1: ROLL ~ UNEM Model 2: ROLL ~ UNEM + HGRAD Model 3: ROLL ~ UNEM + HGRAD + INC Res.Df RSS Df Sum of Sq F Pr(>F) 1 27 251084710 2 26 44805568 1 206279143 458.92 < 2.2e-16 *** 3 25 11237313 1 33568255 74.68 5.594e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 >
e.g. 4. Happiness
hierarchical.regression.data.csv
# Import data (simulated data for this example) # myData <- read.csv('http://static.lib.virginia.edu/statlab/materials/data/hierarchicalRegressionData.csv') myData <- read.csv("http://commres.net/wiki/_media/hierarchical.regression.data.csv")
> str(myData) 'data.frame': 100 obs. of 5 variables: $ happiness: int 5 5 6 4 3 5 5 5 4 4 ... $ age : int 24 28 25 26 20 25 24 24 26 26 ... $ gender : chr "Male" "Male" "Female" "Male" ... $ friends : int 12 8 6 4 8 9 5 6 8 6 ... $ pets : int 3 1 0 2 0 0 5 2 1 4 ... > myData$gender <- factor(myData$gender) > str(myData) 'data.frame': 100 obs. of 5 variables: $ happiness: int 5 5 6 4 3 5 5 5 4 4 ... $ age : int 24 28 25 26 20 25 24 24 26 26 ... $ gender : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 2 2 2 2 ... $ friends : int 12 8 6 4 8 9 5 6 8 6 ... $ pets : int 3 1 0 2 0 0 5 2 1 4 ... >
> m0 <- lm(happiness ~ 1, data = myData) > anova(m0) Analysis of Variance Table Response: happiness Df Sum Sq Mean Sq F value Pr(>F) Residuals 99 240.84 2.4327 >
# 불필요하지만 위의 분석이 variance와 # 같은 것이라는 것을 아래처럼 확인한다. > attach(myData) The following objects are masked from myData (pos = 3): age, friends, gender, happiness, pets > var(happiness) [1] 2.432727 > length(happiness) [1] 100 > df.happiness <- length(happiness) - 1 > df.happiness # degrees of freedom [1] 99 > ss.happiness <- var(happiness)* df.happiness # sum of square (ss) value for happiness variable > ss.happiness [1] 240.84 >
> m1 <- lm(happiness ~ age + gender, data=myData) # Model 1 > summary(m1) Call: lm(formula = happiness ~ age + gender, data = myData) Residuals: Min 1Q Median 3Q Max -3.6688 -1.0094 -0.1472 0.8273 4.2973 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.66778 2.01364 3.808 0.000246 *** age -0.13039 0.07936 -1.643 0.103611 genderMale 0.16430 0.31938 0.514 0.608106 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.553 on 97 degrees of freedom Multiple R-squared: 0.02855, Adjusted R-squared: 0.008515 F-statistic: 1.425 on 2 and 97 DF, p-value: 0.2455 >
# m1은 이미 위에서 실행 > m2 <- lm(happiness ~ age + gender + friends, data=myData) # Model 2: Adding friends variable > m3 <- lm(happiness ~ age + gender + friends + pets, data = myData) # Model 3: Adding pets variable
> anova(m1, m2, m3) Analysis of Variance Table Model 1: happiness ~ age + gender Model 2: happiness ~ age + gender + friends Model 3: happiness ~ age + gender + friends + pets Res.Df RSS Df Sum of Sq F Pr(>F) 1 97 233.97 2 96 209.27 1 24.696 12.1293 0.0007521 *** 3 95 193.42 1 15.846 7.7828 0.0063739 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 >
> summary(m1) Call: lm(formula = happiness ~ age + gender, data = myData) Residuals: Min 1Q Median 3Q Max -3.6688 -1.0094 -0.1472 0.8273 4.2973 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.66778 2.01364 3.808 0.000246 *** age -0.13039 0.07936 -1.643 0.103611 genderMale 0.16430 0.31938 0.514 0.608106 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.553 on 97 degrees of freedom Multiple R-squared: 0.02855, Adjusted R-squared: 0.008515 F-statistic: 1.425 on 2 and 97 DF, p-value: 0.2455 > summary(m2) Call: lm(formula = happiness ~ age + gender + friends, data = myData) Residuals: Min 1Q Median 3Q Max -3.5758 -1.0204 0.0156 0.8087 3.7299 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.21730 1.96220 3.169 0.00206 ** age -0.12479 0.07546 -1.654 0.10146 genderMale 0.14931 0.30365 0.492 0.62405 friends 0.18985 0.05640 3.366 0.00110 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.476 on 96 degrees of freedom Multiple R-squared: 0.1311, Adjusted R-squared: 0.1039 F-statistic: 4.828 on 3 and 96 DF, p-value: 0.003573 > summary(m3) Call: lm(formula = happiness ~ age + gender + friends + pets, data = myData) Residuals: Min 1Q Median 3Q Max -3.0556 -1.0183 -0.1109 0.8832 3.5911 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.78540 1.90266 3.041 0.00305 ** age -0.11146 0.07309 -1.525 0.13057 genderMale -0.14267 0.31157 -0.458 0.64806 friends 0.17134 0.05491 3.120 0.00239 ** pets 0.36391 0.13044 2.790 0.00637 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.427 on 95 degrees of freedom Multiple R-squared: 0.1969, Adjusted R-squared: 0.1631 F-statistic: 5.822 on 4 and 95 DF, p-value: 0.0003105 >