{{keywords> multiple regression, statistics, 다중회귀분석, 통계, 조사방법론}}
====== Multiple Regression ======
See also [[:Regression]] 혹은 [[:Regression|단순회귀]] \\ 
Simple regression과 ([[regression|단순회귀]]) mutiple regression (다중회귀) 분석은 하나의 종속변인과 다른 독립변인들(복수에 주의) 간의 관계에 대해서 살펴볼 때 사용되는 보편적인 분석방법 중의 하나이다. correlation이나 regression 이라는 용어는 보통 뚜렷한 의미차이를 두지 않고 혼용되는 경향이 많은데, 궂이 가리자면, regression은 예측 (prediction)을 하는데 많이 쓰이고, correlation은 변인간의 관계를 알아보는데 더 많이 쓰인다. 

단순한 [[:Regression|linear regression]]의 경우, r<sup>2</sup> 의 값에 대한 평가는 비교적 간단하다. 왜냐하면, r<sup>2</sup> 에 책임을 지는 b 값은 오직 하나이기 때문이다. Regression에서 구한 coefficient 값이 a (상수)와 b (coefficient for X)이:

$$Y = a + bX$$

라고 할때, 이 regression 공식에 대한 F 값이 통계적으로 유의미하다면, 이 값에 공헌하는 오직 하나의 변인인 X의 b값이 이에 대한 모든 것이기 때문이다. 그러나, 만약에 두 개 이상의 [[:types_of_variables#independent|독립변인]]이 regression에 사용된다면 이야기가 달라진다. 
 
Multiple regression은 여러 연구학제에서 다양하게 사용된다, 예를 들면 Baldry ((Baldry, A. C. (2003). Bullying in schools and exposure to domestic violence. Child Abuse & Neglect, 27(7), 713-732. doi: doi: DOI: 10.1016/S0145-2134(03)00114-5. {{Bullying in schools and exposure to domestic violence.pdf}} ))는 Multiple regression방법을 이용하여 어린이들의 폭력적인 성향 (bully behavior)에 영향을 주는 요소(변인)들을 살펴보았다. Baldry는 [[:hierarchical regression]] 혹은 [[:sequential regression]]방법을 사용하여, 어린이의 폭력적인 행동을 설명하는 변인으로 나이와 성별(남자, 여자)를 우선 선택하였고, 두 번째 절차로 아버지의 어머니에 대한 언어적, 신체적 폭력성 (abuse, 어머니의 폭력성은 배제되었음)을 선택하였으며, 마지막으로 어머니의 아버지에 대한 언어적, 신체적 폭력성을 선택하여 단계적인 regression을 하였다. 그의 연구결과를 보면, 아버지의 폭력적인 성향은 아이의 폭력적 행위와 연관이 없었으며, 성별, 나이와 함께, 어머니의 폭력성이 더 아이들의 폭력적인 행동과 연관이 있었다. 위의 4가지 변인이 설명한 아이의 폭력적 행위에 대한 설명력은 14%에 그쳤다 (위의 문헌 참고). 

Yang과 그의 동료들은 ((Yang, H.-E., Wu, C.-C., & Wang, K.-C. (2009). An empirical analysis of online game service satisfaction and loyalty. Expert Systems with Applications, 36(2, Part 1), 1816-1825. {{An empirical analysis of online game service satisfaction and loyalty.pdf}} )) 온라인게임 서비스에 대한 만족도(satisfaction)와 로열티에 영향을 주는 변인으로 게임서비스의 질(quality)과 이용료 (transaction cost), 그리고 경험에 대한 평가(가치, experiential value)을 들고 이들 간의 관계를 연구하였다. 연구 결과에 따르면, 세 변인 모두 게임 서비스에 대한 만족도에 영향을 주었으며, 만족도는 다시 로열티에 영향을 주는 것으로 밝혀졌다 (mediating effect). 이들은 IV 간의 관계도 측정할 수 있는 path analysis 방법을 분석도구로 사용하였는데, 이도 regression 방법의 한 종류이다.

Rice와 Katz는((Rice, R. E., & Katz, J. E. (2008). Assessing new cell phone text and video services. Telecommunications Policy, 32(7), 455-467. {{Assessing new cell phone text and video services.pdf}} )) 다양해지는 휴대전화기의 서비스들에 대한 관심사에 영향을 주는 변인을 고찰하고, 그 변인들로, 인구통계학적 특성, 사회적인 요인, 그리고 이전의 유사한 테크놀로지의 사용 (여기서는 the Internet과 cell phone)을 들었다. 각각의 변인들로 제시된 요인들은 다시,

__IVs__ 
  * 인구통계학적 특징: 
      * Education  
      * Martial Status 
      * Age 
      * Gender 
      * Race/Ethnicity 
      * Income 
  * Social Factors 
      * 가족, 친구들과의 물리적인 거리 
      * 사회적인 도움 (support) 
      * Privacy에 대한 권리 
      * Privacy에 대한 위협 
  * 이전 테크놀로지에 대한 경험 
      * Internet adoption/usage 
      * Cell phone adoption/usage 

__DV__  
  * Cell phone의 비디오, 텍스트 서비스들에 대한 평가 
      * Surveillance 형 서비스 
      * Entertainment 형 서비스 
      * Intrumental 서비스 

로 나누어졌다. 분석 결과, 이들은 gender를 제외한 인구통계학적인 요인이 세 가지 종류의 서비스에 대한 평가에 긍정적인 영향을 직, 간접적으로 미치는 것으로 파악이 되었으며, 간접적인 영향력은 사회적인 지원/도움을 매개로 하여 나타났다. 반면에 사회적인 요인의 경우에는 가족/친구와 물리적으로 가까우면서 친하게 지낼 수록 기능적인 (instrumental) 서비스 (위치추적 같은)에 호감을 보이는 것으로 나타났으며, privacy에 대해서 중요하게 생각할 수록 emergency 서비스를 긍정적으로 평가하는 것으로 나타났다. 하지만 전체적인 관점에서 보면, privacy와 관련된 요인은 Cell phone 사용에 대해 미미한 효과만을 보이는 것으로 나타났다. 마지막으로 인터넷과 cell phone 사용을 하지 않을 수록 기능적인 서비스 (위치추적과 같은 서비스)에 호감을 보이는 것으로 나타났다.(( 관련논문: {{:public views of mobile medical devices and services.pdf}} )) (( Additional reading: Finocchiaro Castro, M. (2008). Where are you from? Cultural differences in public good experiments. Journal of Socio-Economics, 37(6), 2319-2329. {{:cultural differences in public good experiments.pdf}} )). 
<WRAP clear />

====== e.g.======
Data set again. 
<code>
datavar <- read.csv("http://commres.net/wiki/_media/regression01-bankaccount.csv") </code>

^  DATA for regression analysis   ^^^
| bankaccount   | income   | famnum  | 
| 6   | 220   | 5  | 
| 5   | 190   | 6  | 
| 7   | 260   | 3  | 
| 7   | 200   | 4  | 
| 8   | 330   | 2  | 
| 10   | 490   | 4  | 
| 8   | 210   | 3  | 
| 11   | 380   | 2  | 
| 9   | 320   | 1  | 
| 9   | 270   | 3  | 

<code>    account       income        fammember   
 Min.   : 5   Min.   :190.0   Min.   :1.00  
 1st Qu.: 7   1st Qu.:212.5   1st Qu.:2.25  
 Median : 8   Median :265.0   Median :3.00  
 Mean   : 8   Mean   :287.0   Mean   :3.30  
 3rd Qu.: 9   3rd Qu.:327.5   3rd Qu.:4.00  
 Max.   :11   Max.   :490.0   Max.   :6.00  
</code>

아래는 분산을 (variance 혹은 MS) 구하는 과정이다. 표에서 error 컬럼은 개인점수를 평균으로 ($\overline{Y}=8$) 예측했을 때의 오차를 (error) 말한다. 그리고 이를 제곱하여 (error<sup>2</sup>) 모두 더한다 ($SS_{total} = 30$). 
^  prediction for y values with $\overline{Y}$  ^^^
| bankaccount   | error   | error<sup>2</sup>  | 
| 6   | -2   | 4  | 
| 5   | -3   | 9  | 
| 7   | -1   | 1  | 
| 7   | -1   | 1  | 
| 8   | 0   | 0  | 
| 10   | 2   | 4  | 
| 8   | 0   | 0  | 
| 11   | 3   | 9  | 
| 9   | 1   | 1  | 
| 9   | 1   | 1  | 
|  $\overline{Y}=8$   |    |  $SS_{total} = 30$   | 

Regression output (using R)

<code>Residuals:
    Min      1Q  Median      3Q     Max 
-1.2173 -0.5779 -0.1515  0.6642  1.1906 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  6.399103   1.516539   4.220  0.00394 * *
fammember   -0.544727   0.226364  -2.406  0.04702 * 
income       0.011841   0.003561   3.325  0.01268 * 
---
Signif. codes:  0 '* * *' 0.001 '* *' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.9301 on 7 degrees of freedom
Multiple R-squared: 0.7981,	Adjusted R-squared: 0.7404 
F-statistic: 13.84 on 2 and 7 DF,  p-value: 0.003696 
</code>

$$\hat{Y} = 6.399103 + (-0.544727) \text{fammember} + (0.011841) \text{income} $$

위의 regression 식에 해당되는 추정치 ( $\hat{Y}$ )를 구해보면

^  Another X, X<sub>2</sub>  ^^^^
|  bankaccount   |  pred2   |  error2   |  error<sup>2</sup>   | 
|  $Y_{i}$   |  $\hat{Y}$   |  $\hat{Y}-Y_{i}$ = error   |  error<sup>2</sup>   | 
|  6.000    |  6.281    |  0.281    |  0.079   | 
|  5.000    |  5.381    |  0.381    |  0.145   | 
|  7.000    |  7.844    |  0.844    |  0.712   | 
|  7.000    |  6.588    |  -0.412    |  0.169   | 
|  8.000    |  9.217    |  1.217    |  1.482   | 
|  10.000    |  10.023    |  0.023    |  0.001   | 
|  8.000    |  7.252    |  -0.748    |  0.560   | 
|  11.000    |  9.809    |  -1.191    |  1.418   | 
|  9.000    |  9.644    |  0.644    |  0.414   | 
|  9.000    |  7.962    |  -1.038    |  1.077   | 
|     |     |  SS<sub>res</sub>   |  6.056   | 
|     |     |  SS<sub>reg</sub>   |  23.944   | 
Still, 

$$SS_{total} = 30$$

Now, by entering another variable X<sub>2</sub> = number of family, we get:

$$SS_{unexplained} = 6.056$$
$$SS_{explained} = 23.944$$

Then, R<sup>2</sup>, F, b values are: ?

^  Model Summary(b)   ^^^^^ 
|  Model   |  R   |  R Square   |  Adjusted R Square   |  Std. Error of the Estimate   | 
|  1.000    |  0.893    |@orange:  0.798    |  0.740    |  0.930   | 
| a  redictors: (Constant), bankfam, bankIncome  income \\ b Dependent Variable: bankbook  number of bank   ||||| 

$$ R^2 = \frac{SS_{reg}}{SS_{tot}} = 0.798$$


^  ANOVA(b)   ^^^^^^^
|  Model   |      |  Sum of Squares   |  df   |  Mean Square   |  F   |  Sig.  | 
|  1.000    |  Regression   |  23.944    |  2.000    |  11.972    |  13.838    |  0.004   | 
|     |  Residual   |  6.056    |  7.000    |  0.865    |     |    | 
|     |  Total   |  30.000    |  9.000    |     |     |    | 
| a Predictors: (Constant), bankfam, bankIncome income \\ b Dependent Variable: bankbook  number of bank of bank   ||||||| 


^  Coefficients(a)  ^^^^^^^
|  Model   |      |  Unstandardized \\ Coefficients   |     |  Standardized \\ Coefficients   |  t   |  Sig.  | 
|     |     |  B   |  Std. Error   |  Beta   |     |    | 
|  1.000    |  (Constant)   |  6.399    |  1.517    |     |  4.220    |  0.004   | 
|     |  income   |  0.012    |  0.004    |  0.616    |  3.325    |  0.013   | 
|     |  bankfam   |  -0.545    |  0.226    |  -0.446    |  -2.406    |  0.047   | 
| a Dependent Variable: bankbook  number of bank   |||||||


====== Slope test ======

b에 대한 (coefficients) 유의도 테스트는 t-test를 이용하여 한다. t-test는 기본적으로 트리트먼트효과 (독립변인효과 혹은 차이)를 랜덤에러인 standard error로 나누어서 구하므로, 위의 표에서 income에 대한 t value는 0.012/0.004; bankfam의 경우는 -0.545 / 0.226로 구할 수 있다. 

독립변인이 하나일 경우에 구한 t 값은 해당 리그레션 모델의 F test값의 제곱근을 씌운 값이 된다. 독립변인이 둘 이상인 경우에는 독립변인 간의 상관관계가 존재하는 경우가 대다수이므로 t 값의 제곱이 꼭 F 값이 되지는 않는다.

====== Beta coefficients ======
[[:beta coefficients]] 혹은 Standardized coefficients 참조 

====== e.g., ======
DATA: \\ 
<wrap indent>{{:elemapi2.sav}}
{{:elemapi2.csv}}
</wrap>

The Academic Performance Index (**API**) is a measurement of //academic performance and progress of individual schools in California, United States//. It is one of the main components of the Public Schools Accountability Act passed by the California legislature in 1999. API scores ranges from a low of 200 to a high of 1000. [[https://www.google.co.kr/search?q=what+is+high+school+api+|Google search]]

<code>	Variable Labels
Variable	Position	Label
snum	1	school number
dnum	2	district number
api00	3	api 2000
api99	4	api 1999
growth	5	growth 1999 to 2000
meals	6	pct free meals;  the percentage of students receiving free meals
ell	7	english language learners
yr_rnd	8	year round school
mobility	9	pct 1st year in school
acs_k3	10	avg class size k-3; the average class size in kindergarten through 3rd grade
acs_46	11	avg class size 4-6; the average class size in 4th through 6th grade
not_hsg	12	parent not hsg
hsg	13	parent hsg
some_col	14	parent some college
col_grad	15	parent college grad
grad_sch	16	parent grad school
avg_ed	17	avg parent ed
full	18	pct full credential; the percentage of teachers who have full teaching credentials (전임교원 %)
emer	19	pct emer credential; 임시교원 %
enroll	20	number of students
mealcat	21	Percentage free meals in 3 categories
collcat	22	<none>
Variables in the working file
</code>

<code>regression
  /dependent api00
  /method=enter ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll .
</code>

<code>	Variables Entered/Removed
Model	Variables Entered	Variables Removed	Method
1      number of students,                               Enter
       avg class size 4-6, 
       pct 1st year in school, 
       avg class size k-3, 
       pct emer credential, 
       english language learners, 
       year round school, 
       pct free meals, 
       pct full credentiala	.	
a. All requested variables entered.
</code>

^  Model Summary  ^^^^^
| Model   | R   | R Square   | Adjusted \\ R Square   | Std. Error of \\ the Estimate   | 
| 1   | .919a   | .845   | .841   | 56.768   | 
| a. Predictors: (Constant), number of students, avg class size 4-6, pct 1st year in school, avg class size k-3, pct emer credential, english language learners, year round school, pct free meals, pct full credential   ||||| 

^  ANOVA<sup>b</sup>  ^^^^^^^
| Model   |    | Sum of Squares   | df   | Mean Square   | F   | Sig.   | 
| 1   | Regression   | 6740702.006   | 9   | 748966.890   | 232.409   | .000a   | 
|    | Residual   | 1240707.781   | 385   | 3222.618   |    |    | 
|    | Total   | 7981409.787   | 394   |    |    |    | 
| b. Dependent Variable: api 2000   \\ a. Predictors: (Constant), number of students, avg class size 4-6, pct 1st year in school, avg class size k-3, pct emer credential, english language learners, year round school, pct free meals, pct full credential  |||||||

^  Coefficients<sub>a</sub>  ^^^^^^^ 
| Model   |    | Unstandardized \\ Coefficients   |    | Standardized \\ Coefficients   | t   | Sig.   | 
|    |    | B   | Std. Error   | Beta   |    |    | 
| 1   | (Constant)   | 758.942   | 62.286   |    | 12.185   | .000   | 
|    | english language learners   | -.860   | .211   | -.150   | -4.083   | .000   | 
|    | pct free meals   | -2.948   | .170   | -.661   | -17.307   | .000   | 
|    | year round school   | -19.889   | 9.258   | -.059   | -2.148   | .032   | 
|    | pct 1st year in school   | -1.301   | .436   | -.069   | -2.983   | .003   | 
|    | avg class size k-3   | 1.319   | 2.253   | .013   | .585   | .559   | 
|    | avg class size 4-6   | 2.032   | .798   | .055   | 2.546   | .011   | 
|    | pct full credential   | .610   | .476   | .064   | 1.281   | .201   | 
|    | pct emer credential   | -.707   | .605   | -.058   | -1.167   | .244   | 
|    | number of students   | -.012   | .017   | -.019   | -.724   | .469   | 
| a. Dependent Variable: api 2000  ||||||| 


====== e.g., ======
Another one from the same data. 
<code>REGRESSION
  /DEPENDENT api00
  /METHOD=ENTER ell acs_k3 avg_ed meals
</code>

^  Variable Labels   ^^^
| Variable   | Position   | Label   | 
| snum   | 1   | school number   | 
| dnum   | 2   | district number   | 
| @yellow:api00   | 3   | api 2000   | 
| api99   | 4   | api 1999   | 
| growth   | 5   | growth 1999 to 2000   | 
| @white:meals   | 6   | pct free meals   | 
| @white:ell   | 7   | english language learners   | 
| yr_rnd   | 8   | year round school   | 
| mobility   | 9   | pct 1st year in school   | 
| @white:acs_k3   | 10   | avg class size k-3   | 
| acs_46   | 11   | avg class size 4-6   | 
| not_hsg   | 12   | parent not hsg   | 
| hsg   | 13   | parent hsg   | 
| some_col   | 14   | parent some college   | 
| col_grad   | 15   | parent college grad   | 
| grad_sch   | 16   | parent grad school   | 
| @white:avg_ed   | 17   | avg parent ed   | 
| full   | 18   | pct full credential   | 
| emer   | 19   | pct emer credential   | 
| enroll   | 20   | number of students   | 
| mealcat   | 21   | Percentage free meals in 3 categories   | 
| collcat   | 22   | <none>   | 
|   |   | Variables in the working file   | 
<WRAP clear />

^  Model Summary   ^^^^^
| Model   | R   | R Square   | Adjusted R Square   | Std. Error of the Estimate   | 
| 1   | .912a   | .833   | .831   | 58.633   | 
| a. Predictors: (Constant), pct free meals, avg class size k-3, english language learners, avg parent ed   |||||
<WRAP clear />

Build a hypothesis: 
<wrap indent>What is the DV?</wrap> \\
<wrap indent>What are the IVs?</wrap> \\

^  ANOVA(b)   ^^^^^^^
| Model   |    | Sum of Squares   | df   | Mean Square   | F   | Sig.   | 
| 1   | Regression   | 6393719.254   | 4   | 1598429.813   | 464.956   | .000a   | 
|    | Residual   | 1285740.498   | 374   | 3437.809   |   |   | 
|    | Total   | 7679459.752   | 378   |   |   |   | 
| a. Predictors: (Constant), pct free meals, avg class size k-3, english language learners, avg parent ed  \\ b. Dependent Variable: api 2000   ||||||| 
<WRAP clear />

What does the R<sup>2</sup> mean?
How would you make your decision on fitting the model? 

^  Coefficients(a)   ^^^^^^^ 
|    |    | Unstandardized \\ Coefficients   |    | Standardized \\ Coefficients   |   |   | 
| Model   |    | B   | Std. Error   | Beta   | t   | Sig.   | 
| 1   | (Constant)   | 709.639   | 56.240   |    | 12.618   | .000   | 
|    | english language learners   | -.843   | .196   | -.147   | -4.307   | .000   | 
|    | avg class size k-3   | 3.388   | 2.333   | .032   | 1.452   | .147   | 
|    | avg parent ed   | 29.072   | 6.924   | .156   | 4.199   | .000   | 
|    | pct free meals   | -2.937   | .195   | -.655   | -15.081   | .000   | 
| a. Dependent Variable: api 2000   ||||||| 
<WRAP clear />

What is the contributions of each IV?
How would you compare to each other?

-> From here go to the data examination section. We will get back here soon. [[Outliers]].

<code>DATASET ACTIVATE DataSet3.
REGRESSION
  /DESCRIPTIVES MEAN STDDEV CORR SIG N
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA COLLIN TOL CHANGE ZPP
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN 
  /DEPENDENT api00
  /METHOD=ENTER ell acs_k3 avg_ed meals
  /SCATTERPLOT=(*ZRESID ,*ZPRED)
  /RESIDUALS HISTOGRAM(ZRESID) NORMPROB(ZRESID).

</code>

===== in R =====
<code>dvar <- read.csv("http://commres.net/wiki/_media/elemapi2_.csv", sep = "\t", fileEncoding="UTF-8-BOM")
mod <- lm(api00 ~ ell + acs_k3 + avg_ed + meals, data=dvar)
summary(mod)
anova(mod)

</code>
<code>
dvar <- read.csv("http://commres.net/wiki/_media/elemapi2_.csv", fileEncoding="UTF-8-BOM")
> mod <- lm(api00 ~ ell + acs_k3 + avg_ed + meals, data=dvar)
> summary(mod)

Call:
lm(formula = api00 ~ ell + acs_k3 + avg_ed + meals, data = dvar)

Residuals:
     Min       1Q   Median       3Q      Max 
-187.020  -40.358   -0.313   36.155  173.697 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 709.6388    56.2401  12.618  < 2e-16 ***
ell          -0.8434     0.1958  -4.307 2.12e-05 ***
acs_k3        3.3884     2.3333   1.452    0.147    
avg_ed       29.0724     6.9243   4.199 3.36e-05 ***
meals        -2.9374     0.1948 -15.081  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 58.63 on 374 degrees of freedom
  (21 observations deleted due to missingness)
Multiple R-squared:  0.8326,	Adjusted R-squared:  0.8308 
F-statistic:   465 on 4 and 374 DF,  p-value: < 2.2e-16

> anova(mod)
Analysis of Variance Table

Response: api00
           Df  Sum Sq Mean Sq  F value    Pr(>F)    
ell         1 4502711 4502711 1309.762 < 2.2e-16 ***
acs_k3      1  110211  110211   32.059 2.985e-08 ***
avg_ed      1  998892  998892  290.561 < 2.2e-16 ***
meals       1  781905  781905  227.443 < 2.2e-16 ***
Residuals 374 1285740    3438                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
</code>

<code>> mod

Call:
lm(formula = api00 ~ ell + acs_k3 + avg_ed + meals, data = dvar)

Coefficients:
(Intercept)          ell       acs_k3       avg_ed        meals  
   709.6388      -0.8434       3.3884      29.0724      -2.9374  

></code>
$$ \hat{Y} =  709.6388 + -0.8434 \text{ell} + 3.3884 \text{acs_k3} + 29.0724 \text{avg_ed} + -2.9374 \text{meals} \\$$ 

그렇다면 각각의 독립변인 고유의 설명력은 얼마인가? --> see [[:partial and semipartial correlation]]


====== The problem of "which one is entered first?" ======

__그림 여기쯤 수록__

Y 변량과 (전체변량) 세개의 독립변인의 설명변량 (X<sub>1</sub> X<sub>2</sub> X<sub>3</sub> ) 간의 관계에 대한 설명 \\
따라서 어떤 변수를 어떻게 넣는가의 문제가 중요하게 됨.\\

  * Enter method (all at once as if they are not related)
  * Selection methods
    * [[:Statistical regression methods]]
      * Forward selection: X변인들 (predictors) 중 종속변인인 Y와 상관관계가 가장 높은 변인부터 먼저 투입되어 회귀계산이 수행된다. 먼저 투입된 변인은 (상관관계가 높으므로) 이론적으로 종속변인을 설명하는 중요한 요소로 여겨지게 된다. 또한 다음 변인은 우선 투입된 변인을 고려한 상태로 투입된다. 
      * Backward elimination: 모든 독립변인들이 한꺼번에 투입되어 회귀계산이 시작된다. 이어서 회귀식에 통계학적으로 기여하지 못한다고 판단되는 X변인이 하나씩 제거되면서 회귀계산을 반복적으로 한다. 
      * Step-wise selection: Forward와 같은 방식으로 회귀계산을 하되, 투입된 변인의 설명력을 계산하여 버릴 것인지 취할 것인지를 결정한다. 
    * [[:Sequential regression]] ([[:hierarchical regression]] or block-wise) method: 이론적 혹은 연구자의 판단에 따라서 독립변인들을 그룹지어 (블럭화하여) 투입하는 것을 말한다. 각 블럭은 회귀계산에 투입되고, 설명력이 충분치 않을 경우 제거된 후 다음 블럭이 더해질 수 있다. 순서가 먼저인 블럭(변인들)이 설명력을 온전히 갖게 되는 경향이 있으므로 앞의 블럭은 그 효과를(설명력을) 제어(콘트롤)하는 것이라고 할 수 있다. 더하여 설명에 기여하지 못하는 변인들을 계산과 해석에서 제거함으로써 독립 변인들의 설명력을 높이는 효과를 결과한다.

  * [[http://imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/hier|What is the difference between hierarchical and stepwise regressions?]]
    * . . . the stepwise procedure defines an a posteriori order based solely on a statistical consideration (the statistical significance of semi-partial correlations) . . . .
====== Determining IVs' role ======
For a complete explanation and examples, read [[:partial  and semipartial correlation]]
https://www.youtube.com/watch?v=-QsMvrQDxyU
[{{ :partial.correlations.jpg?300 |r-squared semi-partial partial correlations }}]

|  | Standard Multiple   | Sequential   |  comments   | 
| r<sub>i</sub><sup>2</sup>  \\ squared correlation \\ squared **zero-order** \\ correlation in spss  | IV<sub>1</sub> : (a+b) / (a+b+c+d)   | IV<sub>1</sub> : (a+b) / (a+b+c+d)   | overlapped effects   | 
| ::: | IV<sub>2</sub> : (c+b) / (a+b+c+d)   | IV<sub>2</sub>: (c+b) / (a+b+c+d)   | ::: | 
| sr<sub>i</sub><sup>2</sup>  \\ squared \\ **semipartial** correlation \\ **part in spss**   | IV<sub>1</sub> : (a) / (a+b+c+d)   | IV<sub>1</sub> : (a+b) / (a+b+c+d)   | Usual setting \\ Unique contribution to Y   | 
| ::: | IV<sub>2</sub> : %%(c%%) / (a+b+c+d)   | IV<sub>2</sub> : %%(c%%) / (a+b+c+d)   | ::: | 
| pr<sub>i</sub><sup>2</sup>  \\ squared \\ **partial** correlation \\ **partial in spss**   | IV<sub>1</sub> : (a) / (a+d)   | IV<sub>1</sub> : (a+b) / (a+b+d)   | Like adjusted r<sup>2</sup>  \\ Unique contribution to Y   | 
| ::: | IV<sub>2</sub> : %%(c%%) / (c+d)   | IV<sub>2</sub> : %%(c%%) / (c+d)   | ::: | 
| IV<sub>1</sub> 이 IV<sub>2</sub> 보다 먼저 투입되었을 때를 가정   |||| 
<WRAP clear />
Semipartial = part \\
partial = partial \\
위 섹션의 설명에서  \\ 
Stnadard Multiple Regression 방식은 = ENTER 방식을 의미 \\
Sequential = Forward selection, Backward elimination, Stat selection, 등등을 의미 \\

__주의__
  * a+b+c+d -> 전체 Y
  * b -> 애매한 부분, Y에 대한 설명력의 원인으로 X<sub>1</sub> 이 될수도 X,,2,, 가 될 수도 있다.
  * 분모부분의 차이에서 -> semipartial 과 partial 의 차이가 나타난다.
  * partial의 경우 -> 다른 IV의 역할이 분모, 분자에서 모두 빠져나간다. 
  * semi-partial의 경우 --> 다른 IV의 역할이 분자에서만 빠져 나간다. 따라서 독립변인의 고유한 영향력과 종속변인의 (DV) 전체분산량 간의 비율이라고 할 수 있다. SPSS에서는 __part__라고 불린다.

<code>  /STATISTICS COEFF OUTS R ANOVA CHANGE ZPP
위에서 ZPP
</code>

Multicolliearity problem = when torelance < .01 or when VIF > 10 

====== elem e.g. again ======
<code>
dvar <- read.csv("http://commres.net/wiki/_media/elemapi2_.csv", fileEncoding="UTF-8-BOM")
mod <- lm(api00 ~ ell + acs_k3 + avg_ed + meals, data=dvar)
summary(mod)
anova(mod)
</code>
<code>
dvar <- read.csv("http://commres.net/wiki/_media/elemapi2_.csv", fileEncoding="UTF-8-BOM")
> mod <- lm(api00 ~ ell + acs_k3 + avg_ed + meals, data=dvar)
> summary(mod)

Call:
lm(formula = api00 ~ ell + acs_k3 + avg_ed + meals, data = dvar)

Residuals:
     Min       1Q   Median       3Q      Max 
-187.020  -40.358   -0.313   36.155  173.697 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 709.6388    56.2401  12.618  < 2e-16 ***
ell          -0.8434     0.1958  -4.307 2.12e-05 ***
acs_k3        3.3884     2.3333   1.452    0.147    
avg_ed       29.0724     6.9243   4.199 3.36e-05 ***
meals        -2.9374     0.1948 -15.081  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 58.63 on 374 degrees of freedom
  (21 observations deleted due to missingness)
Multiple R-squared:  0.8326,	Adjusted R-squared:  0.8308 
F-statistic:   465 on 4 and 374 DF,  p-value: < 2.2e-16

> anova(mod)
Analysis of Variance Table

Response: api00
           Df  Sum Sq Mean Sq  F value    Pr(>F)    
ell         1 4502711 4502711 1309.762 < 2.2e-16 ***
acs_k3      1  110211  110211   32.059 2.985e-08 ***
avg_ed      1  998892  998892  290.561 < 2.2e-16 ***
meals       1  781905  781905  227.443 < 2.2e-16 ***
Residuals 374 1285740    3438                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
</code>
<code>
# install.packages("ppcor")
library(ppcor)
myvar <- data.frame(api00, ell, acs_k3, avg_ed, meals)
myvar <- na.omit(myvar)
spcor(myvar)
</code>

<code>
> library(ppcor)
> myvar <- data.frame(api00, ell, acs_k3, avg_ed, meals)
> myvar <- na.omit(myvar)
> spcor(myvar)
$estimate
             api00         ell      acs_k3      avg_ed      meals
api00   1.00000000 -0.09112026  0.03072660  0.08883450 -0.3190889
ell    -0.13469956  1.00000000  0.06086724 -0.06173591  0.1626061
acs_k3  0.07245527  0.09709299  1.00000000 -0.13288465 -0.1367842
avg_ed  0.12079565 -0.05678795 -0.07662825  1.00000000 -0.2028836
meals  -0.29972194  0.10332189 -0.05448629 -0.14014709  1.0000000

$p.value
              api00        ell    acs_k3      avg_ed        meals
api00  0.000000e+00 0.07761805 0.5525340 0.085390280 2.403284e-10
ell    8.918743e-03 0.00000000 0.2390272 0.232377348 1.558141e-03
acs_k3 1.608778e-01 0.05998819 0.0000000 0.009891503 7.907183e-03
avg_ed 1.912418e-02 0.27203887 0.1380449 0.000000000 7.424903e-05
meals  3.041658e-09 0.04526574 0.2919775 0.006489783 0.000000e+00

$statistic
           api00       ell     acs_k3    avg_ed     meals
api00   0.000000 -1.769543  0.5945048  1.724797 -6.511264
ell    -2.628924  0.000000  1.1793030 -1.196197  3.187069
acs_k3  1.404911  1.886603  0.0000000 -2.592862 -2.670380
avg_ed  2.353309 -1.100002 -1.4862899  0.000000 -4.006914
meals  -6.075665  2.008902 -1.0552823 -2.737331  0.000000

$n
[1] 379

$gp
[1] 3

$method
[1] "pearson"
> 
> 
</code>

<code>
> spcor.test(myvar$api00, myvar$meals, myvar[,c(2,3,4)])
    estimate      p.value statistic   n gp  Method
1 -0.3190889 2.403284e-10 -6.511264 379  3 pearson
> 
</code>
====== e.g., ======
[[:multiple regression examples]]
{{:h1.sav|An example data file 1}}
  * Y1 - A measure of success in graduate school.
  * X1 - A measure of intellectual ability.
  * X2 - A measure of "work ethic."
  * X3 - A second measure of intellectual ability.
  * X4 - A measure of spatial ability.
  * Y2 - Score on a major review paper.
{{:h2.sav|An example data file 2}}
  * Age
  * Gender (0=Male, 1=Female)
  * Married (0=No, 1=Yes)
  * IncomeC Income in College (in thousands)
  * HealthC Score on Health Inventory in College
  * ChildC Number of Children while in College
  * LifeSatC Score on Life Satisfaction Inventory in College
  * SES Socio Economic Status of Parents
  * LifeSatC Score on Life Satisfaction Inventory in College
  * Smoker (0=No, 1=Yes)
  * SpiritC Score on Spritiuality Inventory in College
  * Finish Finish the program in college (0=No, 1=Yes)
  * LifeSat Score on Life Satisfaction Inventory seven years after College
  * Income Income seven years after College (in thousands)

====== exercise ======
{{:insurance.csv}}
<code>
dvar <- read.csv("http://commres.net/wiki/_media/insurance.csv")
</code>

[[:Multiple Regression Exercise]]

====== Resources ======
  * [[http://www.theanalysisfactor.com/resources/by-topic/linear-regression/|Linear Regression Resources]]
  * [[https://www.youtube.com/watch?v=YcNGKam_wwQ|Linear Regression 1]]
    * [[https://www.youtube.com/watch?v=U4QL8QCbil8|2]]
    * [[https://www.youtube.com/watch?v=YdCLztrI73s|3]]
    * [[https://www.youtube.com/watch?v=ynijS2McieQ|4]]
    * [[https://www.youtube.com/watch?v=Q6CkRWXvCZw|5]]
  * [[https://www.youtube.com/watch?v=IWYENu0kCYE|Multiple regression 1]]
  * http://www.ats.ucla.edu/stat/spss/webbooks/reg/chapter1/spssreg1.htm

  * [[https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-statistics.php|Linear Regression Analysis using SPSS Statistics]]
  * [[https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-statistics.php|Multiple Regression Analysis using SPSS Statistics]]


  * https://ww2.coastal.edu/kingw/statistics/R-tutorials/multregr.html
    * state.x77 data in r
  * http://www.statmethods.net/stats/regression.html
  * http://rtutorialseries.blogspot.kr/2009/12/r-tutorial-series-multiple-linear.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+RTutorialSeries+(R+Tutorial+Series)
  * https://www.r-bloggers.com/analysis-of-covariance-%E2%80%93-extending-simple-linear-regression/
  * http://www.wekaleamstudios.co.uk/posts/analysis-of-covariance-extending-simple-linear-regression/

https://www.youtube.com/user/marinstatlectures/search?query=Multiple+Linear+Regression+


{{tag> "research methods" "statistics" "regression" "multiple regression"}}