Multiple Regression

See also Regression 혹은 단순회귀

Simple regression과 (단순회귀) mutiple regression (다중회귀) 분석은 하나의 종속변인과 다른 독립변인들(복수에 주의) 간의 관계에 대해서 살펴볼 때 사용되는 보편적인 분석방법 중의 하나이다. correlation이나 regression 이라는 용어는 보통 뚜렷한 의미차이를 두지 않고 혼용되는 경향이 많은데, 궂이 가리자면, regression은 예측 (prediction)을 하는데 많이 쓰이고, correlation은 변인간의 관계를 알아보는데 더 많이 쓰인다.

단순한 linear regression의 경우, r² 의 값에 대한 평가는 비교적 간단하다. 왜냐하면, r² 에 책임을 지는 b 값은 오직 하나이기 때문이다. Regression에서 구한 coefficient 값이 a (상수)와 b (coefficient for X)이:

$$Y = a + bX$$

라고 할때, 이 regression 공식에 대한 F 값이 통계적으로 유의미하다면, 이 값에 공헌하는 오직 하나의 변인인 X의 b값이 이에 대한 모든 것이기 때문이다. 그러나, 만약에 두 개 이상의 독립변인이 regression에 사용된다면 이야기가 달라진다.

Multiple regression은 여러 연구학제에서 다양하게 사용된다, 예를 들면 Baldry ¹⁾는 Multiple regression방법을 이용하여 어린이들의 폭력적인 성향 (bully behavior)에 영향을 주는 요소(변인)들을 살펴보았다. Baldry는 hierarchical regression 혹은 sequential regression방법을 사용하여, 어린이의 폭력적인 행동을 설명하는 변인으로 나이와 성별(남자, 여자)를 우선 선택하였고, 두 번째 절차로 아버지의 어머니에 대한 언어적, 신체적 폭력성 (abuse, 어머니의 폭력성은 배제되었음)을 선택하였으며, 마지막으로 어머니의 아버지에 대한 언어적, 신체적 폭력성을 선택하여 단계적인 regression을 하였다. 그의 연구결과를 보면, 아버지의 폭력적인 성향은 아이의 폭력적 행위와 연관이 없었으며, 성별, 나이와 함께, 어머니의 폭력성이 더 아이들의 폭력적인 행동과 연관이 있었다. 위의 4가지 변인이 설명한 아이의 폭력적 행위에 대한 설명력은 14%에 그쳤다 (위의 문헌 참고).

Yang과 그의 동료들은 ²⁾ 온라인게임 서비스에 대한 만족도(satisfaction)와 로열티에 영향을 주는 변인으로 게임서비스의 질(quality)과 이용료 (transaction cost), 그리고 경험에 대한 평가(가치, experiential value)을 들고 이들 간의 관계를 연구하였다. 연구 결과에 따르면, 세 변인 모두 게임 서비스에 대한 만족도에 영향을 주었으며, 만족도는 다시 로열티에 영향을 주는 것으로 밝혀졌다 (mediating effect). 이들은 IV 간의 관계도 측정할 수 있는 path analysis 방법을 분석도구로 사용하였는데, 이도 regression 방법의 한 종류이다.

Rice와 Katz는³⁾ 다양해지는 휴대전화기의 서비스들에 대한 관심사에 영향을 주는 변인을 고찰하고, 그 변인들로, 인구통계학적 특성, 사회적인 요인, 그리고 이전의 유사한 테크놀로지의 사용 (여기서는 the Internet과 cell phone)을 들었다. 각각의 변인들로 제시된 요인들은 다시,

IVs

인구통계학적 특징:
- Education
- Martial Status
- Age
- Gender
- Race/Ethnicity
- Income
Social Factors
- 가족, 친구들과의 물리적인 거리
- 사회적인 도움 (support)
- Privacy에 대한 권리
- Privacy에 대한 위협
이전 테크놀로지에 대한 경험
- Internet adoption/usage
- Cell phone adoption/usage

DV

Cell phone의 비디오, 텍스트 서비스들에 대한 평가
- Surveillance 형 서비스
- Entertainment 형 서비스
- Intrumental 서비스

로 나누어졌다. 분석 결과, 이들은 gender를 제외한 인구통계학적인 요인이 세 가지 종류의 서비스에 대한 평가에 긍정적인 영향을 직, 간접적으로 미치는 것으로 파악이 되었으며, 간접적인 영향력은 사회적인 지원/도움을 매개로 하여 나타났다. 반면에 사회적인 요인의 경우에는 가족/친구와 물리적으로 가까우면서 친하게 지낼 수록 기능적인 (instrumental) 서비스 (위치추적 같은)에 호감을 보이는 것으로 나타났으며, privacy에 대해서 중요하게 생각할 수록 emergency 서비스를 긍정적으로 평가하는 것으로 나타났다. 하지만 전체적인 관점에서 보면, privacy와 관련된 요인은 Cell phone 사용에 대해 미미한 효과만을 보이는 것으로 나타났다. 마지막으로 인터넷과 cell phone 사용을 하지 않을 수록 기능적인 서비스 (위치추적과 같은 서비스)에 호감을 보이는 것으로 나타났다.⁴⁾ ⁵⁾.

e.g.

Data set again.

datavar <- read.csv("http://commres.net/wiki/_media/regression01-bankaccount.csv")

DATA for regression analysis
bankaccount	income	famnum
6	220	5
5	190	6
7	260	3
7	200	4
8	330	2
10	490	4
8	210	3
11	380	2
9	320	1
9	270	3

    account       income        fammember   
 Min.   : 5   Min.   :190.0   Min.   :1.00  
 1st Qu.: 7   1st Qu.:212.5   1st Qu.:2.25  
 Median : 8   Median :265.0   Median :3.00  
 Mean   : 8   Mean   :287.0   Mean   :3.30  
 3rd Qu.: 9   3rd Qu.:327.5   3rd Qu.:4.00  
 Max.   :11   Max.   :490.0   Max.   :6.00

아래는 분산을 (variance 혹은 MS) 구하는 과정이다. 표에서 error 컬럼은 개인점수를 평균으로 ($\overline{Y}=8$) 예측했을 때의 오차를 (error) 말한다. 그리고 이를 제곱하여 (error²) 모두 더한다 ($SS_{total} = 30$).

prediction for y values with $\overline{Y}$
bankaccount	error	error²
6	-2	4
5	-3	9
7	-1	1
7	-1	1
8	0	0
10	2	4
8	0	0
11	3	9
9	1	1
9	1	1
$\overline{Y}=8$		$SS_{total} = 30$

Regression output (using R)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2173 -0.5779 -0.1515  0.6642  1.1906 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  6.399103   1.516539   4.220  0.00394 * *
fammember   -0.544727   0.226364  -2.406  0.04702 * 
income       0.011841   0.003561   3.325  0.01268 * 
---
Signif. codes:  0 '* * *' 0.001 '* *' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.9301 on 7 degrees of freedom
Multiple R-squared: 0.7981,	Adjusted R-squared: 0.7404 
F-statistic: 13.84 on 2 and 7 DF,  p-value: 0.003696

$$\hat{Y} = 6.399103 + (-0.544727) \text{fammember} + (0.011841) \text{income} $$

위의 regression 식에 해당되는 추정치 ( $\hat{Y}$ )를 구해보면

Another X, X₂
bankaccount	pred2	error2	error²
$Y_{i}$	$\hat{Y}$	$\hat{Y}-Y_{i}$ = error	error²
6.000	6.281	0.281	0.079
5.000	5.381	0.381	0.145
7.000	7.844	0.844	0.712
7.000	6.588	-0.412	0.169
8.000	9.217	1.217	1.482
10.000	10.023	0.023	0.001
8.000	7.252	-0.748	0.560
11.000	9.809	-1.191	1.418
9.000	9.644	0.644	0.414
9.000	7.962	-1.038	1.077
		SS_res	6.056
		SS_reg	23.944

Still,

$$SS_{total} = 30$$

Now, by entering another variable X₂ = number of family, we get:

$$SS_{unexplained} = 6.056$$
$$SS_{explained} = 23.944$$

Then, R², F, b values are: ?

Model Summary(b)
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1.000	0.893	0.798	0.740	0.930
a redictors: (Constant), bankfam, bankIncome income b Dependent Variable: bankbook number of bank

$$ R^2 = \frac{SS_{reg}}{SS_{tot}} = 0.798$$

ANOVA(b)
Model		Sum of Squares	df	Mean Square	F	Sig.
1.000	Regression	23.944	2.000	11.972	13.838	0.004
	Residual	6.056	7.000	0.865
	Total	30.000	9.000
a Predictors: (Constant), bankfam, bankIncome income b Dependent Variable: bankbook number of bank of bank

Coefficients(a)
Model		Unstandardized Coefficients		Standardized Coefficients	t	Sig.
		B	Std. Error	Beta
1.000	(Constant)	6.399	1.517		4.220	0.004
	income	0.012	0.004	0.616	3.325	0.013
	bankfam	-0.545	0.226	-0.446	-2.406	0.047
a Dependent Variable: bankbook number of bank

Slope test

b에 대한 (coefficients) 유의도 테스트는 t-test를 이용하여 한다. t-test는 기본적으로 트리트먼트효과 (독립변인효과 혹은 차이)를 랜덤에러인 standard error로 나누어서 구하므로, 위의 표에서 income에 대한 t value는 0.012/0.004; bankfam의 경우는 -0.545 / 0.226로 구할 수 있다.

독립변인이 하나일 경우에 구한 t 값은 해당 리그레션 모델의 F test값의 제곱근을 씌운 값이 된다. 독립변인이 둘 이상인 경우에는 독립변인 간의 상관관계가 존재하는 경우가 대다수이므로 t 값의 제곱이 꼭 F 값이 되지는 않는다.

Beta coefficients

beta coefficients 혹은 Standardized coefficients 참조

e.g.,

DATA:

elemapi2.sav
elemapi2.csv

The Academic Performance Index (API) is a measurement of academic performance and progress of individual schools in California, United States. It is one of the main components of the Public Schools Accountability Act passed by the California legislature in 1999. API scores ranges from a low of 200 to a high of 1000. Google search

	Variable Labels
Variable	Position	Label
snum	1	school number
dnum	2	district number
api00	3	api 2000
api99	4	api 1999
growth	5	growth 1999 to 2000
meals	6	pct free meals;  the percentage of students receiving free meals
ell	7	english language learners
yr_rnd	8	year round school
mobility	9	pct 1st year in school
acs_k3	10	avg class size k-3; the average class size in kindergarten through 3rd grade
acs_46	11	avg class size 4-6; the average class size in 4th through 6th grade
not_hsg	12	parent not hsg
hsg	13	parent hsg
some_col	14	parent some college
col_grad	15	parent college grad
grad_sch	16	parent grad school
avg_ed	17	avg parent ed
full	18	pct full credential; the percentage of teachers who have full teaching credentials (전임교원 %)
emer	19	pct emer credential; 임시교원 %
enroll	20	number of students
mealcat	21	Percentage free meals in 3 categories
collcat	22	<none>
Variables in the working file

regression
  /dependent api00
  /method=enter ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll .

	Variables Entered/Removed
Model	Variables Entered	Variables Removed	Method
1      number of students,                               Enter
       avg class size 4-6, 
       pct 1st year in school, 
       avg class size k-3, 
       pct emer credential, 
       english language learners, 
       year round school, 
       pct free meals, 
       pct full credentiala	.	
a. All requested variables entered.

Model Summary
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.919a	.845	.841	56.768
a. Predictors: (Constant), number of students, avg class size 4-6, pct 1st year in school, avg class size k-3, pct emer credential, english language learners, year round school, pct free meals, pct full credential

ANOVA^b
Model		Sum of Squares	df	Mean Square	F	Sig.
1	Regression	6740702.006	9	748966.890	232.409	.000a
	Residual	1240707.781	385	3222.618
	Total	7981409.787	394
b. Dependent Variable: api 2000 a. Predictors: (Constant), number of students, avg class size 4-6, pct 1st year in school, avg class size k-3, pct emer credential, english language learners, year round school, pct free meals, pct full credential

Coefficients_a
Model		Unstandardized Coefficients		Standardized Coefficients	t	Sig.
		B	Std. Error	Beta
1	(Constant)	758.942	62.286		12.185	.000
	english language learners	-.860	.211	-.150	-4.083	.000
	pct free meals	-2.948	.170	-.661	-17.307	.000
	year round school	-19.889	9.258	-.059	-2.148	.032
	pct 1st year in school	-1.301	.436	-.069	-2.983	.003
	avg class size k-3	1.319	2.253	.013	.585	.559
	avg class size 4-6	2.032	.798	.055	2.546	.011
	pct full credential	.610	.476	.064	1.281	.201
	pct emer credential	-.707	.605	-.058	-1.167	.244
	number of students	-.012	.017	-.019	-.724	.469
a. Dependent Variable: api 2000

e.g.,

Another one from the same data.

REGRESSION
  /DEPENDENT api00
  /METHOD=ENTER ell acs_k3 avg_ed meals

Variable Labels
Variable	Position	Label
snum	1	school number
dnum	2	district number
api00	3	api 2000
api99	4	api 1999
growth	5	growth 1999 to 2000
meals	6	pct free meals
ell	7	english language learners
yr_rnd	8	year round school
mobility	9	pct 1st year in school
acs_k3	10	avg class size k-3
acs_46	11	avg class size 4-6
not_hsg	12	parent not hsg
hsg	13	parent hsg
some_col	14	parent some college
col_grad	15	parent college grad
grad_sch	16	parent grad school
avg_ed	17	avg parent ed
full	18	pct full credential
emer	19	pct emer credential
enroll	20	number of students
mealcat	21	Percentage free meals in 3 categories
collcat	22	<none>
		Variables in the working file

Model Summary
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.912a	.833	.831	58.633
a. Predictors: (Constant), pct free meals, avg class size k-3, english language learners, avg parent ed

Build a hypothesis:
What is the DV?

What are the IVs?

ANOVA(b)
Model		Sum of Squares	df	Mean Square	F	Sig.
1	Regression	6393719.254	4	1598429.813	464.956	.000a
	Residual	1285740.498	374	3437.809
	Total	7679459.752	378
a. Predictors: (Constant), pct free meals, avg class size k-3, english language learners, avg parent ed b. Dependent Variable: api 2000

What does the R² mean?
How would you make your decision on fitting the model?

Coefficients(a)
		Unstandardized Coefficients		Standardized Coefficients
Model		B	Std. Error	Beta	t	Sig.
1	(Constant)	709.639	56.240		12.618	.000
	english language learners	-.843	.196	-.147	-4.307	.000
	avg class size k-3	3.388	2.333	.032	1.452	.147
	avg parent ed	29.072	6.924	.156	4.199	.000
	pct free meals	-2.937	.195	-.655	-15.081	.000
a. Dependent Variable: api 2000

What is the contributions of each IV?
How would you compare to each other?

→ From here go to the data examination section. We will get back here soon. Outliers.

DATASET ACTIVATE DataSet3.
REGRESSION
  /DESCRIPTIVES MEAN STDDEV CORR SIG N
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA COLLIN TOL CHANGE ZPP
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN 
  /DEPENDENT api00
  /METHOD=ENTER ell acs_k3 avg_ed meals
  /SCATTERPLOT=(*ZRESID ,*ZPRED)
  /RESIDUALS HISTOGRAM(ZRESID) NORMPROB(ZRESID).

in R

dvar <- read.csv("http://commres.net/wiki/_media/elemapi2_.csv", sep = "\t", fileEncoding="UTF-8-BOM")
mod <- lm(api00 ~ ell + acs_k3 + avg_ed + meals, data=dvar)
summary(mod)
anova(mod)

dvar <- read.csv("http://commres.net/wiki/_media/elemapi2_.csv", fileEncoding="UTF-8-BOM")
> mod <- lm(api00 ~ ell + acs_k3 + avg_ed + meals, data=dvar)
> summary(mod)

Call:
lm(formula = api00 ~ ell + acs_k3 + avg_ed + meals, data = dvar)

Residuals:
     Min       1Q   Median       3Q      Max 
-187.020  -40.358   -0.313   36.155  173.697 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 709.6388    56.2401  12.618  < 2e-16 ***
ell          -0.8434     0.1958  -4.307 2.12e-05 ***
acs_k3        3.3884     2.3333   1.452    0.147    
avg_ed       29.0724     6.9243   4.199 3.36e-05 ***
meals        -2.9374     0.1948 -15.081  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 58.63 on 374 degrees of freedom
  (21 observations deleted due to missingness)
Multiple R-squared:  0.8326,	Adjusted R-squared:  0.8308 
F-statistic:   465 on 4 and 374 DF,  p-value: < 2.2e-16

> anova(mod)
Analysis of Variance Table

Response: api00
           Df  Sum Sq Mean Sq  F value    Pr(>F)    
ell         1 4502711 4502711 1309.762 < 2.2e-16 ***
acs_k3      1  110211  110211   32.059 2.985e-08 ***
avg_ed      1  998892  998892  290.561 < 2.2e-16 ***
meals       1  781905  781905  227.443 < 2.2e-16 ***
Residuals 374 1285740    3438                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

> mod

Call:
lm(formula = api00 ~ ell + acs_k3 + avg_ed + meals, data = dvar)

Coefficients:
(Intercept)          ell       acs_k3       avg_ed        meals  
   709.6388      -0.8434       3.3884      29.0724      -2.9374  

>

$$ \hat{Y} = 709.6388 + -0.8434 \text{ell} + 3.3884 \text{acs_k3} + 29.0724 \text{avg_ed} + -2.9374 \text{meals} \\$$

그렇다면 각각의 독립변인 고유의 설명력은 얼마인가? –> see partial and semipartial correlation

The problem of "which one is entered first?"

그림 여기쯤 수록

Y 변량과 (전체변량) 세개의 독립변인의 설명변량 (X₁ X₂ X₃ ) 간의 관계에 대한 설명

따라서 어떤 변수를 어떻게 넣는가의 문제가 중요하게 됨.

Enter method (all at once as if they are not related)
Selection methods
- Statistical regression methods
  - Forward selection: X변인들 (predictors) 중 종속변인인 Y와 상관관계가 가장 높은 변인부터 먼저 투입되어 회귀계산이 수행된다. 먼저 투입된 변인은 (상관관계가 높으므로) 이론적으로 종속변인을 설명하는 중요한 요소로 여겨지게 된다. 또한 다음 변인은 우선 투입된 변인을 고려한 상태로 투입된다.
  - Backward elimination: 모든 독립변인들이 한꺼번에 투입되어 회귀계산이 시작된다. 이어서 회귀식에 통계학적으로 기여하지 못한다고 판단되는 X변인이 하나씩 제거되면서 회귀계산을 반복적으로 한다.
  - Step-wise selection: Forward와 같은 방식으로 회귀계산을 하되, 투입된 변인의 설명력을 계산하여 버릴 것인지 취할 것인지를 결정한다.
- Sequential regression (hierarchical regression or block-wise) method: 이론적 혹은 연구자의 판단에 따라서 독립변인들을 그룹지어 (블럭화하여) 투입하는 것을 말한다. 각 블럭은 회귀계산에 투입되고, 설명력이 충분치 않을 경우 제거된 후 다음 블럭이 더해질 수 있다. 순서가 먼저인 블럭(변인들)이 설명력을 온전히 갖게 되는 경향이 있으므로 앞의 블럭은 그 효과를(설명력을) 제어(콘트롤)하는 것이라고 할 수 있다. 더하여 설명에 기여하지 못하는 변인들을 계산과 해석에서 제거함으로써 독립 변인들의 설명력을 높이는 효과를 결과한다.

What is the difference between hierarchical and stepwise regressions?
- . . . the stepwise procedure defines an a posteriori order based solely on a statistical consideration (the statistical significance of semi-partial correlations) . . . .

Determining IVs' role

For a complete explanation and examples, read partial and semipartial correlation
https://www.youtube.com/watch?v=-QsMvrQDxyU

r-squared semi-partial partial correlations

	Standard Multiple	Sequential	comments
r_i² squared correlation squared zero-order correlation in spss	IV₁ : (a+b) / (a+b+c+d)	IV₁ : (a+b) / (a+b+c+d)	overlapped effects
	IV₂ : (c+b) / (a+b+c+d)	IV₂: (c+b) / (a+b+c+d)
sr_i² squared semipartial correlation part in spss	IV₁ : (a) / (a+b+c+d)	IV₁ : (a+b) / (a+b+c+d)	Usual setting Unique contribution to Y
	IV₂ : (c) / (a+b+c+d)	IV₂ : (c) / (a+b+c+d)
pr_i² squared partial correlation partial in spss	IV₁ : (a) / (a+d)	IV₁ : (a+b) / (a+b+d)	Like adjusted r² Unique contribution to Y
	IV₂ : (c) / (c+d)	IV₂ : (c) / (c+d)
IV₁ 이 IV₂ 보다 먼저 투입되었을 때를 가정

Semipartial = part

partial = partial

위 섹션의 설명에서

Stnadard Multiple Regression 방식은 = ENTER 방식을 의미

Sequential = Forward selection, Backward elimination, Stat selection, 등등을 의미

주의

a+b+c+d → 전체 Y
b → 애매한 부분, Y에 대한 설명력의 원인으로 X₁ 이 될수도 X,,2,, 가 될 수도 있다.
분모부분의 차이에서 → semipartial 과 partial 의 차이가 나타난다.
partial의 경우 → 다른 IV의 역할이 분모, 분자에서 모두 빠져나간다.
semi-partial의 경우 –> 다른 IV의 역할이 분자에서만 빠져 나간다. 따라서 독립변인의 고유한 영향력과 종속변인의 (DV) 전체분산량 간의 비율이라고 할 수 있다. SPSS에서는 part라고 불린다.

  /STATISTICS COEFF OUTS R ANOVA CHANGE ZPP
위에서 ZPP

Multicolliearity problem = when torelance < .01 or when VIF > 10

elem e.g. again

dvar <- read.csv("http://commres.net/wiki/_media/elemapi2_.csv", fileEncoding="UTF-8-BOM")
mod <- lm(api00 ~ ell + acs_k3 + avg_ed + meals, data=dvar)
summary(mod)
anova(mod)

dvar <- read.csv("http://commres.net/wiki/_media/elemapi2_.csv", fileEncoding="UTF-8-BOM")
> mod <- lm(api00 ~ ell + acs_k3 + avg_ed + meals, data=dvar)
> summary(mod)

Call:
lm(formula = api00 ~ ell + acs_k3 + avg_ed + meals, data = dvar)

Residuals:
     Min       1Q   Median       3Q      Max 
-187.020  -40.358   -0.313   36.155  173.697 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 709.6388    56.2401  12.618  < 2e-16 ***
ell          -0.8434     0.1958  -4.307 2.12e-05 ***
acs_k3        3.3884     2.3333   1.452    0.147    
avg_ed       29.0724     6.9243   4.199 3.36e-05 ***
meals        -2.9374     0.1948 -15.081  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 58.63 on 374 degrees of freedom
  (21 observations deleted due to missingness)
Multiple R-squared:  0.8326,	Adjusted R-squared:  0.8308 
F-statistic:   465 on 4 and 374 DF,  p-value: < 2.2e-16

> anova(mod)
Analysis of Variance Table

Response: api00
           Df  Sum Sq Mean Sq  F value    Pr(>F)    
ell         1 4502711 4502711 1309.762 < 2.2e-16 ***
acs_k3      1  110211  110211   32.059 2.985e-08 ***
avg_ed      1  998892  998892  290.561 < 2.2e-16 ***
meals       1  781905  781905  227.443 < 2.2e-16 ***
Residuals 374 1285740    3438                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

# install.packages("ppcor")
library(ppcor)
myvar <- data.frame(api00, ell, acs_k3, avg_ed, meals)
myvar <- na.omit(myvar)
spcor(myvar)

> library(ppcor)
> myvar <- data.frame(api00, ell, acs_k3, avg_ed, meals)
> myvar <- na.omit(myvar)
> spcor(myvar)
$estimate
             api00         ell      acs_k3      avg_ed      meals
api00   1.00000000 -0.09112026  0.03072660  0.08883450 -0.3190889
ell    -0.13469956  1.00000000  0.06086724 -0.06173591  0.1626061
acs_k3  0.07245527  0.09709299  1.00000000 -0.13288465 -0.1367842
avg_ed  0.12079565 -0.05678795 -0.07662825  1.00000000 -0.2028836
meals  -0.29972194  0.10332189 -0.05448629 -0.14014709  1.0000000

$p.value
              api00        ell    acs_k3      avg_ed        meals
api00  0.000000e+00 0.07761805 0.5525340 0.085390280 2.403284e-10
ell    8.918743e-03 0.00000000 0.2390272 0.232377348 1.558141e-03
acs_k3 1.608778e-01 0.05998819 0.0000000 0.009891503 7.907183e-03
avg_ed 1.912418e-02 0.27203887 0.1380449 0.000000000 7.424903e-05
meals  3.041658e-09 0.04526574 0.2919775 0.006489783 0.000000e+00

$statistic
           api00       ell     acs_k3    avg_ed     meals
api00   0.000000 -1.769543  0.5945048  1.724797 -6.511264
ell    -2.628924  0.000000  1.1793030 -1.196197  3.187069
acs_k3  1.404911  1.886603  0.0000000 -2.592862 -2.670380
avg_ed  2.353309 -1.100002 -1.4862899  0.000000 -4.006914
meals  -6.075665  2.008902 -1.0552823 -2.737331  0.000000

$n
[1] 379

$gp
[1] 3

$method
[1] "pearson"
> 
>

> spcor.test(myvar$api00, myvar$meals, myvar[,c(2,3,4)])
    estimate      p.value statistic   n gp  Method
1 -0.3190889 2.403284e-10 -6.511264 379  3 pearson
>

e.g.,

multiple regression examples
An example data file 1

Y1 - A measure of success in graduate school.
X1 - A measure of intellectual ability.
X2 - A measure of “work ethic.”
X3 - A second measure of intellectual ability.
X4 - A measure of spatial ability.
Y2 - Score on a major review paper.

An example data file 2

Age
Gender (0=Male, 1=Female)
Married (0=No, 1=Yes)
IncomeC Income in College (in thousands)
HealthC Score on Health Inventory in College
ChildC Number of Children while in College
LifeSatC Score on Life Satisfaction Inventory in College
SES Socio Economic Status of Parents
LifeSatC Score on Life Satisfaction Inventory in College
Smoker (0=No, 1=Yes)
SpiritC Score on Spritiuality Inventory in College
Finish Finish the program in college (0=No, 1=Yes)
LifeSat Score on Life Satisfaction Inventory seven years after College
Income Income seven years after College (in thousands)

exercise

insurance.csv

dvar <- read.csv("http://commres.net/wiki/_media/insurance.csv")

Multiple Regression Exercise

Resources

https://www.youtube.com/user/marinstatlectures/search?query=Multiple+Linear+Regression+

research methods, statistics, regression, multiple regression

¹⁾

Baldry, A. C. (2003). Bullying in schools and exposure to domestic violence. Child Abuse & Neglect, 27(7), 713-732. doi: doi: DOI: 10.1016/S0145-2134(03)00114-5. bullying_in_schools_and_exposure_to_domestic_violence.pdf

²⁾

Yang, H.-E., Wu, C.-C., & Wang, K.-C. (2009). An empirical analysis of online game service satisfaction and loyalty. Expert Systems with Applications, 36(2, Part 1), 1816-1825. an_empirical_analysis_of_online_game_service_satisfaction_and_loyalty.pdf

³⁾

Rice, R. E., & Katz, J. E. (2008). Assessing new cell phone text and video services. Telecommunications Policy, 32(7), 455-467. assessing_new_cell_phone_text_and_video_services.pdf

⁴⁾

⁵⁾

Additional reading: Finocchiaro Castro, M. (2008). Where are you from? Cultural differences in public good experiments. Journal of Socio-Economics, 37(6), 2319-2329. cultural_differences_in_public_good_experiments.pdf

COMMunication
RESearch.NET

Table of Contents

Multiple Regression

e.g.

Slope test

Beta coefficients

e.g.,

e.g.,

in R

The problem of "which one is entered first?"

Determining IVs' role

elem e.g. again

e.g.,

exercise

Resources