Differences

This shows you the differences between two versions of the page.

--- using_dummy_variables [2026/06/09 22:45] – hkimscil
+++ using_dummy_variables [2026/06/14 23:15] (current) – [Regression with two Catogorical IVs] hkimscil
@@ Line 549: / Line 549: @@
 = 81-100
 </code>
+  * pct (%) of emer (emergency) credentials: Some states will grant an emergency certificate or permit at the request of a school district that has posted and failed to find a qualified candidate for a teacher vacancy. It typically allows the candidate to serve in a temporary capacity for the duration of a school year.
+  * pct (%) of full credentials: To be fully certified, it generally means that a teacher must have graduated from an accredited college, completed an approved teacher credential program and passed a test of their academic skills.
 위의 변인들 중에서 "무방학학교"가 성적에 어떤 영향을 미칠 것인가를 알아 보기 위해서 regression 테스트를 시행하였다. 아래는 그 결과이다.
@@ Line 871: / Line 874: @@
 {{pasted:20260609-224414.png?600}}
 {{pasted:20260609-224430.png?600}}
+위의 아웃풋을 살펴 보면, 학생들의 성적이 가지는 총 변량의 (Sum of Square Total) 약 22.6% 를 방학이 있고 없고로 구분되는 yr_rnd 변인이 설명을 하고 있으며, 이는 통계적으로 유의미한 것이다 (F(1, 398) = 116.241, p < .001). 위의 regression output은 yr_rnd 변인 중에서 방학이 있는 특성이 baseline이 되어 있으며, 이를 염두에 두고 regression식을 적어 보면 아래와 같다.
-|  **Model Summary**  |||||
+''y hat = 684.54 - 160.51 * (yr_rndno_break) ''
-| Model   | R   | R Square   | Adjusted R Square   | Std. Error of the Estimate   |
+  * yr_rndno_break: yr_rndno_break = 1
-| 1   | .475a   | 0.226   | 0.224   | 125.3   |
+    * y hat = 684.54 - 160.51 * (1)
-| a. Predictors: (Constant), year round school   |||||
+    * y hat = 524.03
+  * yr_rndbreak: yr_rndnobreak = 0
+    * y hat = 684.54 - 160.51 * (0)
+    * y hat = 684.54
-|  **ANOVA(b)**  |||||||
+위 회귀식에서 r은
-| Model   |    | Sum of Squares   | df   | Mean Square   | F   | Sig.   |
+''y hat = a + bX'' 의 형식에서 ''X'' 로 ''no_break''를 썼음을 (yr_rndno_break) 알 수 있다. 이 경우에는 yr_rndno_break를 1로 넣어서 해석을 한다. 즉, no break일 경우에는  ''y hat = 684.54 - 160.51(1) '' 이라는 것이다. 반대로 break일 경우에는 뒤쪽 부분이 해당이 안되므로 ''0''으로 대체한다. 따라서 이 경우의 회귀식은 ''y hat = 684.54''이다. 이 둘을 비교해보면 no break일 경우에는
-| 1   | Regression   | 1825000.563   | 1   | 1825000.563   | 116.241   | .000a   |
+''y hat = 684.54 - 160.51(1) = 524.03'' 이고
-|    | Residual   | 6248671.435   | 398   | 15700.179   |    |    |
+break일 경우에는
-|    | Total   | 8073671.997   | 399   |    |    |    |
+''y hat = 684.54 - 160.51(0) = 684.54'' 라는 것이다. 다시 이야기하면 break가 없는 학교의 평균 api점수는 524.03점인 반면에 break가 있는 학교의 평균은 684.54 이다. 이 점수의 차이는 두 집단의 평균을 비교하는 것과 같은 형태를 (형식) 갖는다. 즉, t-test를 하는 것과 마찬가지이다. 위에서 얻은
-| a. Predictors: (Constant), year round school  \\ b. Dependent Variable: api 2000  |||||||
+  * t.value
+  * F.value
+  * t.value.lm
-|  **Coefficients(a)**  |||||||
+  * F.value.lm 은 모두 같은 원리로 두 그룹을 (집단의 api 평균을) 테스트 한 것이다.
-|    |    | Unstandardized Coefficients   |    | Standardized Coefficients   |    |    |
-| Model   |    | B   | Std. Error   | Beta   | t   | Sig.   |
-| 1   | (Constant)   | 684.539   | 7.14   |    | 95.878   | 0   |
-|    | year round school   | -160.506   | 14.887   | -0.475   | -10.782   | 0   |
-| a. Dependent Variable: api 2000   |||||||
-위의 아웃풋을 살펴 보면,
-학생들의 성적이 가지는 변량의 약 23%를 방학이없는학교가 갖으며, 이는 통계적으로 유의미한 것이다 (F(1, 398) = 116.241, p < .001). regression을 이용해서 엊어진 계수(coefficients)를 살펴 보면
-$\hat{Y} = 684.539 - 160.506 X$
-이 때,
-  * X: 0 = No
-  * X: 1 = Yes
-이므로 x=0 일때를 대입해 보면, 즉, 방학이있는학교의 경우는 684.539의 추정치를 엊을 수 있으며, x=1일때를 대입해 보면 즉, 방학이없는학교의 경우에는 524.033의 추정치를 엊을 수 있다. b coefficient가 이 역할 (차이를 나타내는 역할을 하는데)에 대한 유의성에 대한 판단은 t-test로 하는데, 이 t값은 -10.782며 이는 F값인 116.241의 제곱근이다 (즉, $t^2 = F$ ). 사실, 이 상황은 정확히 t-test를 해야할 상황이므로 (두 그룹에 대한 성적평균의 차이), t-test를 해야 하지만 이와 같이 regression을 하여도 동일한 결과를 보게된다 (같은 의미에서 F-test를 했어도 마찬가지).
-또한 위에서 이야기한 추정치는 X변인의 특성인 무방학학교과 일반학교의 평균과 같으며, X변인의 coefficient였던 -160.506은 바로, 이 두 평균 값의 차이를 없애 주는 역할을 한다.
-<code>IGRAPH
- /X1 = VAR(yr_rnd) TYPE = scale
- /Y = VAR (api00) TYPE = SCALE
- /FITLINE METHOD = REGRESSION  LINEAR LINE = TOTAL MEFFECT
- /CATORDER VAR(yr_rnd) (ASCENDING VALUES  OMITEMPTY)
- /SCATTER COINCIDENT = NONE.
-</code>
 {{regressionCategory.jpg}}
-위의 그래프에서 직선은 $\hat{Y} = 684.539 - 160.506 X$ 이다.
+위의 그래프는 두 그룹의 (break, no break 학교그룹) api 점수 분포를 나타내 주는 것이다. 그리고 직선은 두 그룹 평균을 연결한 선으로 $\hat{Y} = 684.539 - 160.506 X$ 으로 표현된다.
-<code>MEANS
+이와 같이 종류변인(category, nominal)을 가지고서도 regression 테스트를 할 수 있으며, 사실 이는 t-test나 F-test와 다르지 않다. 위에서 주의해야 할 점은 두 변인의 종류를 손으로 coding할 때, 1과 2가 아닌, 0과 1로 하였다는 점이다. 이렇게 하는 이유는 해석하기에 편하기 때문이며, 이것이 보통의 방법이다. 그러나, 1과 2로 coding 데이터를 이용해도 크게 다른지 않은 결과를 구하게 된다. 다른 점이라면, 절편에 해당되는 상수값이 다르게 되며, coefficient값은 위의 분석과 동일한 값을 갖게 된다. 그리고 일반적으로 r에서는 dummy variables을 얻기 위한 coding을 하지 않아도 r이 스스로 구하여 계산을 한다. 즉, dummy variable 을 구하지 않은 채로  ''lm(api00 ~ yr_rnd, data=df)'' 만으로도 계산이 된다.
-  TABLES=api00 BY yr_rnd.
-</code>
-|  **Report**  ||||
-| api 2000 ||||
-| year round school  | Mean  | N  | Std. Deviation  |
-| No     | 684.54  | 308  | 132.113  |
-| Yes    | 524.03  | 92   | 98.916   |
-| Total  | 647.62  | 400  | 142.249  |
-이와 같이 종류변인(category, nominal)을 가지고서도 regression 테스트를 할 수 있으며, 사실 이는 t-test나 F-test와 다르지 않다. 위에서 주의해야 할 점은 두 변인의 종류를 coding할 때, 1과 2가 아닌, 0과 1로 하였다는 점이다. 이렇게 하는 이유는 해석하기에 편하기 때문이며, 이것이 보통의 방법이다. 그러나, 1과 2로 coding 데이터를 이용해도 크게 다른지 않은 결과를 구하게 된다. 다른 점이라면, 절편에 해당되는 상수값이 다르게 되며, coefficient값은 위의 분석과 동일한 값을 갖게 된다.
-===== 3 or more groups =====
-만약에 ANOVA 테스트에서와 같이 종류가 3개 이상인 변인은 어떻게 처리해야 할까? 아래는 이를 regression으로 테스트 한 결과이다.
+===== Regression with a categorical IV with 3 attributes =====
+<tabbox rs.3att>
 <code>
-> mod2 <- lm(api00 ~ factor(mealcat), data=datavar)
+m.mealcat <- lm(api00 ~ mealcat, data=df)
-> mod2
+summary(m.mealcat)
+</code>
+<tabbox ro.3att>
+<code>
+> #######################################
+> # categorical IV with 3 or more attributes
+> #######################################
+> m.mealcat <- lm(api00 ~ mealcat, data=df)
+> summary(m.mealcat)
 Call:
-lm(formula = api00 ~ factor(mealcat), data = datavar)
+lm(formula = api00 ~ mealcat, data = df)
-Coefficients:
-     (Intercept)  factor(mealcat)2  factor(mealcat)3
-.7            -166.3            -301.3
-> summary(mod2)
-Call:
-lm(formula = api00 ~ factor(mealcat), data = datavar)
 Residuals:
@@ Line 955: / Line 922: @@
 Coefficients:
-                 Estimate Std. Error t value Pr(>|t|)
+             Estimate Std. Error t value Pr(>|t|)
-(Intercept)       805.718      6.169  130.60   <2e-16 ***
+(Intercept)   805.718      6.169  130.60   <2e-16 ***
-factor(mealcat)2 -166.324      8.708  -19.10   <2e-16 ***
+mealcatto80  -166.324      8.708  -19.10   <2e-16 ***
-factor(mealcat)3 -301.338      8.629  -34.92   <2e-16 ***
+mealcatto100 -301.338      8.629  -34.92   <2e-16 ***
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
@@ Line 968: / Line 935: @@
 >
 </code>
+</tabbox>
-아까와 같은 두 집단 간의 비교는 가능하였지만, 세 집단인 이 경우에 regression 테스트는 x를 연속변인으로 하는 선형관계를 구하게 된다. 즉, 두집단으로 이루어진 변인의 경우, 한 집단을 0으로 보았을 때 다른 집단은 자동으로 1이 되고, 이 둘을 비교하는 것이었는데, 3집단인 경우, 어느 한 집단을 0으로 놓고 본다고 비교할 집단이 두개나 되므로 비교가 어려워진다. 이는 현실에 맞지 않으므로 대개의 경우에는 집단 수에 해당하는 변인을 가외로 (가변인 혹은 dummy variable) 만든 후, 이를 가지고 regression을 하게 된다.
+<code>
+y hat = 805.718 - 166.324 * to80 - 301.338 * to100
+mealcat0-46 (to46 으로 대체)
+mealcat47-80 (to80 으로 대체)
+maelcat81-100 (to100 으로 대체)
+</code>
-SPSS의 경우에는 아래와 같이 recode작업을 할 수 있다.
+이에 대한 해석도 앞에서의 것과 마찬가지이다.
+  * y hat = 805.718 - 166.324*mg2 - 301.338*mg3
+  * mg1 = 1, mg2 = 0, mg3 = 0 일 경우
+    * y hat = 805.718 - 166.324*(0) - 301.338*(0)
+    * y hat = 805.718
+  * mg1 = 0, mg2 = 1, mg3 = 0 일 경우
+    * y hat = 805.718 - 166.324*(1) - 301.338*(0)
+    * y hat = 805.718 - 166.324
+    * y hat = 639.394
+  * mg1 = 0, mg2 = 0, mg3 = 1 일 경우
+    * y hat = 805.718 - 166.324*(0) - 301.338*(1)
+    * y hat = 805.718 - 301.338
+    * y hat = 504.38
-<code>compute mealcat1 = 0.
+  * 즉, 무료급식의 퍼센티지가 높을 수록 api점수가 낮음을 알 수 있다. 이렇게 무료급식 퍼센티지를 독립변인으로 종속변인인 api00점수를 (학력점수) 봤을 때, 그 설명력이 통계학적으로 유효한가는 regression output에서 (summary(mod2))
-if mealcat = 1 mealcat1 = 1.
+    * F-value 와 p-value를 가지고 판단한다.
-compute mealcat2 = 0.
+      * (F (2, 397) = 611.1; p-value < 2.2e-16)
-if mealcat = 2 mealcat2 = 1.
+      * 위에서 2, 397 은 각각 between degrees of freedom 과 within degrees of freedom 이다. 이를 보고도 우리는
-compute mealcat3 = 0.
+      * 총 400개의 학교가 데이터에 참여했음을 알 수 있고 (2 + 397 에 1을 더한 값),
-if mealcat = 3 mealcat3 = 1.
+      * 독립변인의 종류가 3가지 (df = 2 이므로) 임을 알 수 있다.
-execute.
+    * R square value 는 설명력의 크기를 알려준다.
-</code>
+      * 0.7548 즉, 75.48% 를 독립변인이 종속변인을 설명한다 (상당한 크기임을 알 수 있다).
-위는 해당 카데고리를 1로 만들고, 나머지를 0으로 만들어서 2분화 하는 작업이다. 이렇게 하면 3개의 새로운 변인이 만들어지게 되는데 (mealcat1, mealcat2, mealcat3), 이 세개의 변인 중에서 2개만을 취해서 regression 테스트를 한다. SPSS의 경우에는
-<code>regression
+===== Regression with a continuous and a categorical IV =====
- /dependent api00
+<tabbox rs.04>
- /method = enter mealcat2 mealcat3.
+<code>
-주의. 세개의 변인을 모두 넣지 않는다.
+#######################################
-</code>
+#######################################
+# with a continuous and a category IV
+#######################################
+#######################################
+m.mealsyr_rnd <- lm(api00~meals+yr_rnd, data=df)
+summary(m.mealsyr_rnd)
+m.mealsxyr_rnd <- lm(api00~meals*yr_rnd, data=df)
+summary(m.mealsxyr_rnd)
-<code>		Model Summary
+# 중요: 위에서 yr_rnd의 설명력이 t test를 보면
-Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
+# 사라짐. 왜?
-	.869a	.755	.754	70.612
+# interaction을 추가하면서 yr_rnd의 설명력은
-a. Predictors: (Constant), mealcat3, mealcat2
+# (방학이 있고 없음의 설명력은) meals의 percentage가
+# 0 이 될때의 것으로 설정된다. 0이 되는 데이터가
+# 극단치이고 이 경우에 nobr인 학교도 없으므로 계산된
+# se값이 극대화되어 t값이 작아지게 된다. 이를 바로
+# 잡으려면 meals의 0이 중간에 가도록 한 후에
+# regression을 하면 meals가 0 일때가 극단치가 되지
+# 않으므로 결과가 옳게 나온다
-			ANOVA(b)
+df %>%
-Model		Sum of Squares	df	Mean Square	F	Sig.
+  ggplot(aes(x = meals, y = api00, color = factor(yr_rnd))) +
-	Regression	6094197.670	2	3047098.835	611.121	.000a
+  geom_point(size = 3) +
-	Residual	1979474.328	397	4986.081
+  geom_smooth(method = "lm", se = FALSE) +
-	Total	8073671.997	399
+  scale_color_brewer(palette = "Set1", name = "yr_rnd") +
-a. Predictors: (Constant), mealcat3, mealcat2
+  labs(title = "Multiple Regression: Interaction by yr_rnd break",
-b. Dependent Variable: api 2000
+       x = " (free meals percentage)",
+       y = "api00")
-			Coefficients(a)
-		Unstandardized Coefficients		Standardized Coefficients
-Model		B	Std. Error	Beta	t	Sig.
-	(Constant)	805.718	6.169		130.599	.000
-	mealcat2	-166.324	8.708	-.550	-19.099	.000
-	mealcat3	-301.338	8.629	-1.007	-34.922	.000
-a. Dependent Variable: api 2000
-</code>
+# 1. 평균을 빼준 값을 새로운 변인으로 저장
+df$meals_centered <-
+  df$meals - mean(df$meals)
-|  **Coefficients(a)**  |||||||
+# 2. Run the model again using the centered variable
-|  |  | Unstandardized Coefficients   |    | Standardized Coefficients   |   |   |
+m_centered <- lm(api00 ~ meals_centered * yr_rnd,
-| Model   |    | B   | Std. Error   | Beta   | t   | Sig.   |
+                     data = df)
-| 1   | (Constant)   | 805.718   | 6.169   |    | 130.599   | .000   |
-|    | mealcat2   | -166.324   | 8.708   | -.550   | -19.099   | .000   |
-|    | mealcat3   | -301.338   | 8.629   | -1.007   | -34.922   | .000   |
-| a. Dependent Variable: api 2000   |||||||
-위에서
+# 3. Check the new summary
-$\hat{Y} = 805.718 - 166.324 \; \text{mealcat2} - 301.338 \; \text{mealcat3} $
+summary(m_centered)
-이에 대한 해석은
+df %>%
-  * mealcat2와 mealcat3이 0일 때 (즉 mealcat1 변인의 상태일 때), 805.718
+  ggplot(aes(x = meals_centered, y = api00, color = factor(yr_rnd))) +
-  * mealcat3이 0일 때, 805.718-166.324 = 639.39 의 상황
+  geom_point(size = 3) +
-  * mealcat2가 0일 때, 805.718-301.338 = 504.38 의 상황이다.
+  geom_smooth(method = "lm", se = FALSE) +
-  * 그리고, 이 값들은 바로 각 그룹의 평균값이 된다.
+  scale_color_brewer(palette = "Set1", name = "yr_rnd") +
+  labs(title = "Multiple Regression: Interaction by yr_rnd break",
+       x = " (free meals percentage CENTERED)",
+       y = "api00")
-<code>	Report
-api 2000
-Percentage free meals in 3 categories	Mean	N	Std. Deviation
--46% free meals		805.72	131	65.669
--80% free meals		639.39	132	82.135
--100% free meals		504.38	137	62.727
-	Total				647.62	400	142.249
-</code>
-위에서, mealcat1대신에 mealcat3 그룹을 빼고 사용했어도, 결과를 해석하는데는 지장이 없다.
+# Install the package if you do not have it
+# install.packages("interactions")
-마지막으로, 위의 테스트는 이전에 언급되었던 FactorialAnova와 동일한 것이다.
+# Plot the interaction
+library(interactions)
+interact_plot(m.mealsxyr_rnd, pred = meals, modx = yr_rnd)
-<code>glm
- api00 by mealcat
- /print=parameter.
+m.ellyr_rnd <- lm(api00~ell+yr_rnd, data=df)
+summary(m.ellyr_rnd)
+m.ellxyr_rnd <- lm(api00~ell*yr_rnd, data=df)
+summary(m.ellxyr_rnd)
+df %>%
+  ggplot(aes(x = ell, y = api00, color = factor(yr_rnd))) +
+  geom_point(size = 3) +
+  geom_smooth(method = "lm", se = FALSE) +
+  scale_color_brewer(palette = "Set1", name = "Cylinders") +
+  labs(title = "Multiple Regression: Interaction by yr_rnd break",
+       x = " (ell, english language learner)",
+       y = "api00")
+# ell mean centred 이번 경우는 큰 차이를 보이지 않는다
+df$ell_centered <- df$ell - mean(df$ell)
+m.ell.centered <- lm(api00 ~ ell_centered * yr_rnd, data = df)
+summary(m.ell.centered)
+coefs <- summary(m.ellxyr_rnd)$coefficients
+coefs
+api.br <- coefs[1]
+api.nobr <- coefs[1]+coefs[3]
+slope.red <- coefs[2]
+slope.blue <- coefs[2]+coefs[4]
+data.frame(api.br, api.nobr, slope.red, slope.blue)
+# br과 nobr에 따라서 ell의 slope가 달라지는것이
+# interaction effects
 </code>
+<tabbox ro.04>
+<code>
+> #######################################
+> #######################################
+> # with a continuous and a category IV
+> #######################################
+> #######################################
+> m.mealsyr_rnd <- lm(api00~meals+yr_rnd, data=df)
+> summary(m.mealsyr_rnd)
-|  **Between-Subjects Factors**  ||||
+Call:
-|    |    | Value Label   | N   |
+lm(formula = api00 ~ meals + yr_rnd, data = df)
-| Percentage free meals in 3 categories   | 1   | 0-46% free meals   | 131   |
-|    | 2   | 47-80% free meals   | 132   |
-|    | 3   | 81-100% free meals   | 137   |
-|  **Tests of Between-Subjects Effects**  ||||||
+Residuals:
-| Dependent Variable:api 2000   |  |  |  |  |  |
+     Min       1Q   Median       3Q      Max
-| Source   | Type III Sum of Squares   | df   | Mean Square   | F   | Sig.   |
+-174.882  -41.542   -1.044   39.454  176.007
-| Corrected Model   | 6.094E6   | 2   | 3047098.835   | 611.121   | .000   |
-| Intercept   | 1.688E8   | 1   | 1.688E8   | 33863.695   | .000   |
-| mealcat   | 6094197.670   | 2   | 3047098.835   | 611.121   | .000   |
-| Error   | 1979474.328   | 397   | 4986.081   |    |    |
-| Total   | 1.758E8   | 400   |    |    |    |
-| Corrected Total   | 8073671.997   | 399   |    |    |    |
-|a. R Squared = .755 (Adjusted R Squared = .754)   ||||||
-|  **Parameter Estimates**   |||||||
+Coefficients:
-| Dependent Variable:api 2000   |||||||
+            Estimate Std. Error t value Pr(>|t|)
-|    |    |    |    |    | 95% Confidence Interval   |  |
+(Intercept) 885.6192     6.4712 136.855  < 2e-16 ***
-| Parameter   | B   | Std. Error   | t   | Sig.   | Lower Bound   | Upper Bound   |
+meals        -3.7921     0.1036 -36.595  < 2e-16 ***
-| Intercept   | 504.380   | 6.033   | 83.606   | .000   | 492.519   | 516.240   |
+yr_rndnobr  -40.3291     7.8479  -5.139 4.35e-07 ***
-| [mealcat=1]   | 301.338   | 8.629   | 34.922   | .000   | 284.374   | 318.302   |
+---
-| [mealcat=2]   | 135.014   | 8.612   | 15.677   | .000   | 118.083   | 151.945   |
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-| [mealcat=3]   | 0a   | .   | .   | .   | .   | .   |
-| a. This parameter is set to zero because it is redundant.   ||||||||
-혹은 Oneway ANOVA
+Residual standard error: 59.99 on 397 degrees of freedom
+Multiple R-squared:  0.823,	Adjusted R-squared:  0.8221
+F-statistic: 923.2 on 2 and 397 DF,  p-value: < 2.2e-16
-<code>ONEWAY api00 BY mealcat
+>
-  /STATISTICS DESCRIPTIVES EFFECTS HOMOGENEITY
+> m.mealsxyr_rnd <- lm(api00~meals*yr_rnd, data=df)
-  /PLOT MEANS
+> summary(m.mealsxyr_rnd)
-  /MISSING ANALYSIS
-  /POSTHOC=TUKEY SCHEFFE ALPHA(0.05).
-</code>
-|	|  Sum of Squares  |  df  |  Mean Square  |  F  |  Sig.  |
+Call:
-|  Between Groups  |  6094197.67  |  2  |  3047098.835  |  611.120953  |  .000  |
+lm(formula = api00 ~ meals * yr_rnd, data = df)
-|  Within Groups  |  1979474.328  |  397  |  4986.08143  |    |    |
-|  Total  |  8073671.998  |  399  |    |    |    |
+Residuals:
+    Min      1Q  Median      3Q     Max
+-175.78  -41.96   -0.76   39.06  175.49
-===== 2 variables, categorical =====
+Coefficients:
-위에서 사용된 __2 개의 독립변인을 모두__ 넣어서 regression을 할 수도 있다. 위에서 언급한 경로를 따른다면, 이는 FactorialAnova의 한 종류일 것이다.
+                 Estimate Std. Error t value Pr(>|t|)
+(Intercept)      885.0046     6.7647 130.827   <2e-16 ***
+meals             -3.7805     0.1100 -34.354   <2e-16 ***
+yr_rndnobr       -31.8737    27.9104  -1.142    0.254
+meals:yr_rndnobr  -0.1041     0.3299  -0.316    0.752
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-<code>regression
+Residual standard error: 60.06 on 396 degrees of freedom
- /dep api00
+Multiple R-squared:  0.8231,	Adjusted R-squared:  0.8217
- /method =  enter yr_rnd mealcat1 mealcat2.
+F-statistic: 614.1 on 3 and 396 DF,  p-value: < 2.2e-16
+>
+> # 중요: 위에서 yr_rnd의 설명력이 t test를 보면
+> # 사라짐. 왜?
+> # interaction을 추가하면서 yr_rnd의 설명력은
+> # (방학이 있고 없음의 설명력은) meals의 percentage가
+> # 0 이 될때의 것으로 설정된다. 0이 되는 데이터가
+> # 극단치이고 이 경우에 nobr인 학교도 없으므로 계산된
+> # se값이 극대화되어 t값이 작아지게 된다. 이를 바로
+> # 잡으려면 meals의 0이 중간에 가도록 한 후에
+> # regression을 하면 meals가 0 일때가 극단치가 되지
+> # 않으므로 결과가 옳게 나온다
+>
+> df %>%
++   ggplot(aes(x = meals, y = api00, color = factor(yr_rnd))) +
++   geom_point(size = 3) +
++   geom_smooth(method = "lm", se = FALSE) +
++   scale_color_brewer(palette = "Set1", name = "yr_rnd") +
++   labs(title = "Multiple Regression: Interaction by yr_rnd break",
++        x = " (free meals percentage)",
++        y = "api00")
+`geom_smooth()` using formula = 'y ~ x'
+>
+>
 </code>
+{{pasted:20260609-230452.png?600}}
-|  **Model Summary**  |||||
+<code>
-| Model   | R   | R Square   | Adjusted R Square   | Std. Error of the Estimate   |
+> # 1. 평균을 빼준 값을 새로운 변인으로 저장
-| 1   | .876a   | .767   | .765   | 68.893   |
+> df$meals_centered <-
-| a. Predictors: (Constant), mealcat2, year round school, mealcat1   |||||
++   df$meals - mean(df$meals)
+>
+> # 2. Run the model again using the centered variable
+> m_centered <- lm(api00 ~ meals_centered * yr_rnd,
++                      data = df)
+>
+> # 3. Check the new summary
+> summary(m_centered)
-|  **ANOVA(b)**  |||||||
+Call:
-| Model   |              | Sum of Squares   | df    | Mean Square   | F   | Sig.   |
+lm(formula = api00 ~ meals_centered * yr_rnd, data = df)
-| 1       | Regression   | 6194144.303      | 3     | 2064714.768   | 435.017  | .000a   |
-|         | Residual     | 1879527.694      | 396   | 4746.282   |    |    |
-|         | Total        | 8073671.997      | 399   |    |    |    |
-| a. Predictors: (Constant), mealcat2, year round school, mealcat1   ||||||
-| b. Dependent Variable: api 2000   |||||||
-^  **Coefficients(a)**   ^^^^^^^
+Residuals:
-|    |    | Unstandardized \\ Coefficients   |    | Standardized \\ Coefficients   |   |   |
+    Min      1Q  Median      3Q     Max
-| Model   |    | B   | Std. Error   | Beta   | t   | Sig.   |
+-175.78  -41.96   -0.76   39.06  175.49
-| 1   | (Constant)   | 526.330   | 7.585   |    | 69.395   | .000   |
-|    | year round school   | -42.960   | 9.362   | -.127   | -4.589   | .000   |
-|    | mealcat1   | 281.683   | 9.446   | .930   | 29.821   | .000   |
-|    | mealcat2   | 117.946   | 9.189   | .390   | 12.836   | .000   |
-| a. Dependent Variable: api 2000   |||||||
+Coefficients:
+                          Estimate Std. Error t value Pr(>|t|)
+(Intercept)               656.9827     3.5150 186.910  < 2e-16 ***
+meals_centered             -3.7805     0.1100 -34.354  < 2e-16 ***
+yr_rndnobr                -38.1550    10.4473  -3.652 0.000295 ***
+meals_centered:yr_rndnobr  -0.1041     0.3299  -0.316 0.752386
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-똑같은 분석이지만 뒤의 두 변인의 효과를 따로 보기 위해서 뽑은 결과이다.
+Residual standard error: 60.06 on 396 degrees of freedom
+Multiple R-squared:  0.8231,	Adjusted R-squared:  0.8217
+F-statistic: 614.1 on 3 and 396 DF,  p-value: < 2.2e-16
-<code>regression
+>
- /dep api00
+> df %>%
- /method =  enter yr_rnd
++   ggplot(aes(x = meals_centered, y = api00, color = factor(yr_rnd))) +
- /method = test(mealcat1 mealcat2).
++   geom_point(size = 3) +
++   geom_smooth(method = "lm", se = FALSE) +
++   scale_color_brewer(palette = "Set1", name = "yr_rnd") +
++   labs(title = "Multiple Regression: Interaction by yr_rnd break",
++        x = " (free meals percentage CENTERED)",
++        y = "api00")
+`geom_smooth()` using formula = 'y ~ x'
+>
+>
 </code>
+{{pasted:20260609-230505.png?600}}
-^ Model Summary   ^^^^^
+<code>
-| Model   | R   | R Square   | Adjusted R Square   | Std. Error of the Estimate   |
+> # Install the package if you do not have it
-| 1   | .475a   | .226   | .224   | 125.300   |
+> # install.packages("interactions")
-| 2   | .876b   | .767   | .765   | 68.893   |
+>
-| a. Predictors: (Constant), year round school   \\ b. Predictors: (Constant), year round school, mealcat2, mealcat1   |||||
+> # Plot the interaction
+> library(interactions)
+> interact_plot(m.mealsxyr_rnd, pred = meals, modx = yr_rnd)
+>
+>
+>
+</code>
+{{pasted:20260609-230516.png?600}}
+<code>
+> m.ellyr_rnd <- lm(api00~ell+yr_rnd, data=df)
+> summary(m.ellyr_rnd)
-| ANOVA(d) |  |  |  |  |  |  |  |  |
+Call:
-| Model   |    |    | Sum of Squares   | df   | Mean Square   | F   | Sig.   | R Square Change   |
+lm(formula = api00 ~ ell + yr_rnd, data = df)
-| 1   | Regression   |    | 1825000.563   | 1   | 1825000.563   | 116.241   | .000a   |    |
-|    | Residual   |    | 6248671.435   | 398   | 15700.179   |    |    |    |
-|    | Total   |    | 8073671.997   | 399   |    |    |    |    |
-| 2   | Subset Tests   | mealcat1, mealcat2   | 4369143.740   | 2   | 2184571.870   | 460.270   | .000b   | .541   |
-|    | Regression   |    | 6194144.303   | 3   | 2064714.768   | 435.017   | .000c   |    |
-|    | Residual   |    | 1879527.694   | 396   | 4746.282   |    |    |    |
-|    | Total   |    | 8073671.997   | 399   |    |    |    |    |
-| a. Predictors: (Constant), year round school   \\ b. Tested against the full model.   \\ c. Predictors in the Full Model: (Constant), year round school, mealcat2, mealcat1.  \\ d. Dependent Variable: api 2000  |||||||||
+Residuals:
+     Min       1Q   Median       3Q      Max
+-265.843  -55.349    0.334   70.673  195.778
-^ Coefficients(a)    ^^^^^^^
+Coefficients:
-|    |    | Unstandardized Coefficients   |    | Standardized Coefficients   |   |   |
+            Estimate Std. Error t value Pr(>|t|)
-| Model   |    | B   | Std. Error   | Beta   | t   | Sig.   |
+(Intercept) 784.3978     7.2878  107.63  < 2e-16 ***
-| 1   | (Constant)   | 684.539   | 7.140   |    | 95.878   | .000   |
+ell          -4.0427     0.2094  -19.31  < 2e-16 ***
-|    | year round school   | -160.506   | 14.887   | -.475   | -10.782   | .000   |
+yr_rndnobr  -41.8420    12.3441   -3.39  0.00077 ***
-| 2   | (Constant)   | 526.330   | 7.585   |    | 69.395   | .000   |
+---
-|    | year round school   | -42.960   | 9.362   | -.127   | -4.589   | .000   |
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-|    | mealcat1   | 281.683   | 9.446   | .930   | 29.821   | .000   |
-|    | mealcat2   | 117.946   | 9.189   | .390   | 12.836   | .000   |
-| a. Dependent Variable: api 2000   |||||||
+Residual standard error: 90.1 on 397 degrees of freedom
+Multiple R-squared:  0.6008,	Adjusted R-squared:  0.5988
+F-statistic: 298.8 on 2 and 397 DF,  p-value: < 2.2e-16
-^  **Excluded Variables(b)**   ^^^^^^^
+> m.ellxyr_rnd <- lm(api00~ell*yr_rnd, data=df)
-|    |    |    |    |    |    | Collinearity Statistics   |
+> summary(m.ellxyr_rnd)
-| Model   |    | Beta In   | t   | Sig.   | Partial Correlation   | Tolerance   |
-| 1   | mealcat1   | .697a   | 23.132   | .000   | .758   | .914   |
-|    | mealcat2   | -.138a   | -3.106   | .002   | -.154   | .962   |
-| a. Predictors in the Model: (Constant), year round school   \\ b. Dependent Variable: api 2000   |||||||
+Call:
+lm(formula = api00 ~ ell * yr_rnd, data = df)
-해석에 대해서 . . . .
+Residuals:
-^  **interpretation**  ^^^^
+     Min       1Q   Median       3Q      Max
-|   | mealcat=1   | mealcat=2   | mealcat=0   |
+-270.514  -57.800    5.994   65.845  206.275
-|yr_rnd=0   | cell1   | cell2   | cell3   |
-|yr_rnd=1   | cell4   | cell5   | cell6   |
+Coefficients:
+                Estimate Std. Error t value Pr(>|t|)
+(Intercept)     794.2576     7.8422 101.280  < 2e-16 ***
+ell              -4.4418     0.2420 -18.354  < 2e-16 ***
+yr_rndnobr     -110.5779    24.7935  -4.460 1.07e-05 ***
+ell:yr_rndnobr    1.4884     0.4673   3.185  0.00156 **
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-^  **interpretation**  ^^^^
+Residual standard error: 89.08 on 396 degrees of freedom
-|   | mealcat=1   | mealcat=2   | mealcat=0   |
+Multiple R-squared:  0.6108,	Adjusted R-squared:  0.6078
-|   | mealcat=1->1   | mealcat=2->1    | mealcat=3->mealcat1,2=0   |
+F-statistic: 207.1 on 3 and 396 DF,  p-value: < 2.2e-16
-| yr_rnd=0   | cell1   | cell2   | cell3   |
-| :::  | intercept + \\ BMealCat1   | intercept + \\ BMealCat2   | intercept   |
-| yr_rnd=1   | cell4   | cell5   | cell6   |
-| :::  | intercept + \\ BMealCat1 + \\ Byr_rnd   | intercept + \\ BMealCat2 + \\ Byr_rnd   | intercept + \\ Byr_rnd   |
-<code>glm
+> df %>%
-  api00 BY yr_rnd mealcat
++   ggplot(aes(x = ell, y = api00, color = factor(yr_rnd))) +
-  /DESIGN = yr_rnd mealcat
++   geom_point(size = 3) +
-  /print=parameter TEST(LMATRIX).
++   geom_smooth(method = "lm", se = FALSE) +
-</code>
++   scale_color_brewer(palette = "Set1", name = "Cylinders") +
++   labs(title = "Multiple Regression: Interaction by yr_rnd break",
++        x = " (ell, english language learner)",
++        y = "api00")
+`geom_smooth()` using formula = 'y ~ x'
+>
+>
+> # ell mean centred 이번 경우는 큰 차이를 보이지 않는다
+> df$ell_centered <- df$ell - mean(df$ell)
+> m.ell.centered <- lm(api00 ~ ell_centered * yr_rnd, data = df)
+> summary(m.ell.centered)
-===== continuous + categorical variables =====
+Call:
-<code>regress
+lm(formula = api00 ~ ell_centered * yr_rnd, data = df)
- /dep = api00
- /method = enter yr_rnd some_col
- /save pre.
-* pre = predicted value (y hat).
-output:
+Residuals:
+     Min       1Q   Median       3Q      Max
+-270.514  -57.800    5.994   65.845  206.275
-		Model Summary(b)
+Coefficients:
-Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
+                        Estimate Std. Error t value Pr(>|t|)
-	.507a	.257	.253	122.951
+(Intercept)             654.5514     5.3323 122.752  < 2e-16 ***
-a. Predictors: (Constant), parent some college, year round school
+ell_centered             -4.4418     0.2420 -18.354  < 2e-16 ***
-b. Dependent Variable: api 2000
+yr_rndnobr              -63.7652    14.0117  -4.551 7.12e-06 ***
+ell_centered:yr_rndnobr   1.4884     0.4673   3.185  0.00156 **
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-			ANOVA(b)
+Residual standard error: 89.08 on 396 degrees of freedom
-Model		Sum of Squares	df	Mean Square	F	Sig.
+Multiple R-squared:  0.6108,	Adjusted R-squared:  0.6078
-	Regression	2072201.839	2	1036100.919	68.539	.000a
+F-statistic: 207.1 on 3 and 396 DF,  p-value: < 2.2e-16
-	Residual	6001470.159	397	15117.053
+>
-	Total		8073671.997	399
+>
-a. Predictors: (Constant), parent some college, year round school
+> coefs <- summary(m.ellxyr_rnd)$coefficients
-b. Dependent Variable: api 2000
+> coefs
+                  Estimate Std. Error    t value      Pr(>|t|)
+(Intercept)     794.257647  7.8421894 101.280090 3.235236e-285
+ell              -4.441819  0.2420092 -18.353922  6.903303e-55
+yr_rndnobr     -110.577884 24.7935271  -4.459950  1.069371e-05
+ell:yr_rndnobr    1.488362  0.4673174   3.184906  1.562618e-03
+> api.br <- coefs[1]
+> api.nobr <- coefs[1]+coefs[3]
+> slope.red <- coefs[2]
+> slope.blue <- coefs[2]+coefs[4]
+> data.frame(api.br, api.nobr, slope.red, slope.blue)
+    api.br api.nobr slope.red slope.blue
+794.2576 683.6798 -4.441819  -2.953456
+> # br과 nobr에 따라서 ell의 slope가 달라지는것이
+> # interaction effects
+>
+</code>
+{{pasted:20260609-230601.png?600}}
-			Coefficients(a)
+</tabbox>
-		Unstandardized Coefficients		Standardized Coefficients
-Model				B		Std. Error	Beta	t	Sig.
-	(Constant)		637.858		13.503			47.237	.000
-	year round school	-149.159	14.875		-.442	-10.027	.000
-	parent some college	2.236		553		.178	4.044	.000
-a. Dependent Variable: api 2000
-</code>
+===== Regression with a categorical and a continuous IV: e.g. 2 =====
+<tabbox rs.06>
+<code>
+###############################
+# 다른 예
+###############################
+m.ellmealcat <- lm(api00~ell+mealcat, data=df)
+summary(m.ellmealcat)
+# install.packages("emmeans")
+# library(emmeans)
+# Calculate estimated marginal means for each group
+gmeans <- emmeans(m.ellmealcat, specs = ~ mealcat)
+# Perform pairwise comparisons using Tukey's adjustment (Default)
+pairs(gmeans, adjust = "tukey")
+m.ellxmealcat <- lm(api00~ell*mealcat, data=df)
+summary(m.ellxmealcat)
+df %>%
+  ggplot(aes(x = ell, y = api00, color = factor(mealcat))) +
+  geom_point(size = 3) +
+  geom_smooth(method = "lm", se = FALSE) +
+  scale_color_brewer(palette = "Set1", name = "mealcat") +
+  labs(title = "Multiple Regression: Interaction by meal cat",
+       x = " (ell, english language learner)",
+       y = "api00")
+mealcat1 <- summary(m.ellxmealcat)$coefficients[[1]]
+mealcat2 <- summary(m.ellxmealcat)$coefficients[[1]] +
+  summary(m.ellxmealcat)$coefficients[[3]]
+mealcat3 <- summary(m.ellxmealcat)$coefficients[[1]] +
+  summary(m.ellxmealcat)$coefficients[[4]]
+data.frame(mealcat1, mealcat2, mealcat3)
+ell.slope.mealcat1 <- summary(m.ellxmealcat)$coefficients[[2]]
+ell.slope.mealcat2 <- summary(m.ellxmealcat)$coefficients[[2]] +
+  summary(m.ellxmealcat)$coefficients[[5]]
+ell.slope.mealcat3 <- summary(m.ellxmealcat)$coefficients[[2]] +
+  summary(m.ellxmealcat)$coefficients[[6]]
+data.frame(ell.slope.mealcat1,
+           ell.slope.mealcat2,
+           ell.slope.mealcat3)
-<code>COMPUTE filt=(yr_rnd=0).
-FILTER BY filt.
-regress
- /dep = api00
- /method = enter some_col.
 </code>
+<tabbox ro.06>
+<code>
+> ###############################
+> # 다른 예
+> ###############################
+> m.ellmealcat <- lm(api00~ell+mealcat, data=df)
+> summary(m.ellmealcat)
-위의 명령어는 (spss) yr_rnd value -> 0 인것을 선택하여, 이를 필터링하면 (고르면) -> 1 이 되고
+Call:
-필터링되지 않은 케이스들은 버려지게 되어 필터링이 된 케이스들만 선택이 되어 분석에 사용됨을 뜻 한다. 즉, 위는 rn_rnd값이 0 인 케이스에 대해서만 simple regression을 하라는 것이다.
+lm(formula = api00 ~ ell + mealcat, data = df)
-<code>		Model Summary
+Residuals:
-Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
+    Min      1Q  Median      3Q     Max
-	.126a	.016		.013			131.278
+-224.96  -42.70   -0.32   48.93  179.92
-a. Predictors: (Constant), parent some college
-			ANOVA(b)
+Coefficients:
-Model		Sum of Squares	df	Mean Square	F	Sig.
+             Estimate Std. Error t value Pr(>|t|)
-	Regression	84700.858	1	84700.858	4.915	.027a
+(Intercept)   819.654      6.052 135.426  < 2e-16 ***
-	Residual	5273591.675	306	17233.960
+ell            -1.546      0.203  -7.616 1.94e-13 ***
-	Total		5358292.532	307
+mealcatto80  -134.493      9.153 -14.693  < 2e-16 ***
-a. Predictors: (Constant), parent some college
+mealcatto100 -230.737     12.290 -18.774  < 2e-16 ***
-b. Dependent Variable: api 2000
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-			Coefficients(a)
+Residual standard error: 66.03 on 396 degrees of freedom
-		Unstandardized Coefficients		Standardized Coefficients
+Multiple R-squared:  0.7861,	Adjusted R-squared:  0.7845
-Model				B	Std. Error	Beta	t	Sig.
+F-statistic: 485.2 on 3 and 396 DF,  p-value: < 2.2e-16
-	(Constant)		655.110	15.237			42.995	.000
-	parent some college	1.409	.636		.126	2.217	.027
-a. Dependent Variable: api 2000
-</code>
-{{regression-cat0.jpg}}
+> # install.packages("emmeans")
+> # library(emmeans)
+> # Calculate estimated marginal means for each group
+> gmeans <- emmeans(m.ellmealcat, specs = ~ mealcat)
+> # Perform pairwise comparisons using Tukey's adjustment (Default)
+> pairs(gmeans, adjust = "tukey")
+ contrast     estimate    SE  df t.ratio p.value
+ to46 - to80     134.5  9.15 396  14.693  <.0001
+ to46 - to100    230.7 12.30 396  18.774  <.0001
+ to80 - to100     96.2  9.53 396  10.102  <.0001
-<code>COMPUTE filt=(yr_rnd=1).
+P value adjustment: tukey method for comparing a family of 3 estimates
-FILTER BY filt.
+>
-regress
+> m.ellxmealcat <- lm(api00~ell*mealcat, data=df)
- /dep = api00
+> summary(m.ellxmealcat)
- /method = enter some_col.
+Call:
+lm(formula = api00 ~ ell * mealcat, data = df)
-		Model Summary
+Residuals:
-Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
+     Min       1Q   Median       3Q      Max
-	.648a	.420		.413			75.773
+-221.854  -43.244    1.437   47.443  181.471
-a. Predictors: (Constant), parent some college
-			ANOVA(b)
+Coefficients:
-Model		Sum of Squares	df	Mean Square	F	Sig.
+                  Estimate Std. Error t value Pr(>|t|)
-	Regression	373644.064	1	373644.064	65.078	.000a
+(Intercept)       836.7724     8.8620  94.422  < 2e-16 ***
-	Residual	516734.838	90	5741.498
+ell                -3.4447     0.7508  -4.588 6.03e-06 ***
-	Total		890378.902	91
+mealcatto80      -146.6127    13.9085 -10.541  < 2e-16 ***
-a. Predictors: (Constant), parent some college
+mealcatto100     -270.8356    18.7941 -14.411  < 2e-16 ***
-b. Dependent Variable: api 2000
+ell:mealcatto80     1.7300     0.8111   2.133   0.0335 *
+ell:mealcatto100    2.3190     0.8032   2.887   0.0041 **
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-			Coefficients(a)
+Residual standard error: 65.47 on 394 degrees of freedom
-		Unstandardized Coefficients		Standardized Coefficients
+Multiple R-squared:  0.7909,	Adjusted R-squared:  0.7882
-Model				B	Std. Error	Beta	t	Sig.
+F-statistic:   298 on 5 and 394 DF,  p-value: < 2.2e-16
-	(Constant)		407.039	16.515			24.647	.000
-	parent some college	7.403	.918		.648	8.067	.000
+> df %>%
-a. Dependent Variable: api 2000
++   ggplot(aes(x = ell, y = api00, color = factor(mealcat))) +
++   geom_point(size = 3) +
++   geom_smooth(method = "lm", se = FALSE) +
++   scale_color_brewer(palette = "Set1", name = "mealcat") +
++   labs(title = "Multiple Regression: Interaction by meal cat",
++        x = " (ell, english language learner)",
++        y = "api00")
+`geom_smooth()` using formula = 'y ~ x'
+>
+> mealcat1 <- summary(m.ellxmealcat)$coefficients[[1]]
+> mealcat2 <- summary(m.ellxmealcat)$coefficients[[1]] +
++   summary(m.ellxmealcat)$coefficients[[3]]
+> mealcat3 <- summary(m.ellxmealcat)$coefficients[[1]] +
++   summary(m.ellxmealcat)$coefficients[[4]]
+> data.frame(mealcat1, mealcat2, mealcat3)
+  mealcat1 mealcat2 mealcat3
+836.7724 690.1597 565.9368
+>
+> ell.slope.mealcat1 <- summary(m.ellxmealcat)$coefficients[[2]]
+> ell.slope.mealcat2 <- summary(m.ellxmealcat)$coefficients[[2]] +
++   summary(m.ellxmealcat)$coefficients[[5]]
+> ell.slope.mealcat3 <- summary(m.ellxmealcat)$coefficients[[2]] +
++   summary(m.ellxmealcat)$coefficients[[6]]
+> data.frame(ell.slope.mealcat1,
++            ell.slope.mealcat2,
++            ell.slope.mealcat3)
+  ell.slope.mealcat1 ell.slope.mealcat2 ell.slope.mealcat3
+          -3.444691          -1.714707          -1.125645
+>
 </code>
+</tabbox>
+{{pasted:20260609-231439.png?600}}
-{{regression-cat1.jpg}}
+===== Regression with two Catogorical IVs =====
+<tabbox rs.07>
+<code>
+##########################################
+# 카테고리 iv 2개
+##########################################
+m.yrrndmealcat <- lm(api00~yr_rnd+mealcat, data=df)
+summary(m.yrrndmealcat)
-===== interaction effect =====
+# 해석.
+coefs <- summary(m.yrrndmealcat)$coefficients
+coefs
+br.to46 <- coefs[1]
+br.to80 <- coefs[1]+coefs[3]
+br.to100 <- coefs[1]+coefs[4]
+nobr.to46 <- coefs[1]+coefs[2]
+nobr.to80 <- coefs[1]+coefs[2]+coefs[3]
+nobr.to100 <- coefs[1]+coefs[2]+coefs[4]
+cat(br.to46, br.to80, br.to100)
+cat(nobr.to46, nobr.to80, nobr.to100)
-위의 두 regression은 yr_rnd 변인이 갖는 두 가지 특성에 대해서 따로 regression (api_00 <- some_col) 을 한 것이다. 이 결과, 두 집단의 regression 기울기 (coefficient)가 다르다는 것을 알았다. 즉, some_col의 api_00에 대한 영향력이 다르다는 것이다. 이는 각각의 상황(변인의 특성)에 따라서 동일한 독립변인이 역할을 달리하는 것으로 상호효과(interaction effect)라고 할 수 있다. 따라서, 두 기울기가 혹은 계수(coefficients)가 서로 다르다는 것을 검증한다면, 상호효과를 알아볼 수 있다.
+# 해석. interaction
+m.yrrndxmealcat <- lm(api00~yr_rnd*mealcat, data=df)
+summary(m.yrrndxmealcat)
-아래는 새로운 변인을 만들어서 변인의 값으로 yr_rnd와 some_col값을 곱한 값을 대체한 것이다. 즉,
+coefs <- summary(m.yrrndxmealcat)$coefficients
+coefs
+br.to46 <- coefs[1]
+br.to80 <- coefs[1]+coefs[3]
+br.to100 <- coefs[1]+coefs[4]
+nobr.to46 <- coefs[1]+coefs[2]
+nobr.to80 <- coefs[1]+coefs[2]+coefs[3]+coefs[5]
+nobr.to100 <- coefs[1]+coefs[2]+coefs[4]+coefs[6]
+cat(br.to46, br.to80, br.to100)
+cat(nobr.to46, nobr.to80, nobr.to100)
+</code>
+<tabbox ro.07>
+<code>
+> ##########################################
+> # 카테고리 iv 2개
+> ##########################################
+> m.yrrndmealcat <- lm(api00~yr_rnd+mealcat, data=df)
+> summary(m.yrrndmealcat)
-DV: api00
+Call:
+lm(formula = api00 ~ yr_rnd + mealcat, data = df)
-IV1: some_col
+Residuals:
-IV2: yr_rnd
+    Min      1Q  Median      3Q     Max
-IV3: yr_rnd * some_col = interaction effects
+-215.32  -49.50    1.65   49.17  183.63
-<code>compute yrXsome = yr_rnd*some_col.
+Coefficients:
-execute.
+             Estimate Std. Error t value Pr(>|t|)
+(Intercept)   808.013      6.040 133.777  < 2e-16 ***
+yr_rndnobr    -42.960      9.362  -4.589 5.99e-06 ***
+mealcatto80  -163.737      8.515 -19.229  < 2e-16 ***
+mealcatto100 -281.683      9.446 -29.821  < 2e-16 ***
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+Residual standard error: 68.89 on 396 degrees of freedom
+Multiple R-squared:  0.7672,	Adjusted R-squared:  0.7654
+F-statistic:   435 on 3 and 396 DF,  p-value: < 2.2e-16
+>
+> # 해석.
+> coefs <- summary(m.yrrndmealcat)$coefficients
+> coefs
+               Estimate Std. Error    t value      Pr(>|t|)
+(Intercept)   808.01313   6.039984 133.777362  0.000000e+00
+yr_rndnobr    -42.96006   9.361761  -4.588886  5.990293e-06
+mealcatto80  -163.73737   8.515015 -19.229252  1.128686e-58
+mealcatto100 -281.68318   9.445676 -29.821388 2.767252e-103
+> br.to46 <- coefs[1]
+> br.to80 <- coefs[1]+coefs[3]
+> br.to100 <- coefs[1]+coefs[4]
+> nobr.to46 <- coefs[1]+coefs[2]
+> nobr.to80 <- coefs[1]+coefs[2]+coefs[3]
+> nobr.to100 <- coefs[1]+coefs[2]+coefs[4]
+> cat(br.to46, br.to80, br.to100)
+.0131 644.2758 526.33
+> cat(nobr.to46, nobr.to80, nobr.to100)
+.0531 601.3157 483.3699>
 </code>
+예측식은 아래와 같다.
+<code>
+y hat = 808.013 + -42.960*(yr_rndno_break) + -163.737(mealcat47-80) + -281.683(mealcat81-100)
-그리고, 이 변인을 regression 공식에 이용한다.
+yr_rnd:
+break = 방학있음
+no_break = 방학없음
-<code>regress
+mealcat:
- /dep = api00
+-46% free meals
- /method = enter some_col yr_rnd yrXsome
+-80%
- /save pre.
+-100%
 </code>
+이에 대한 해석은 각각의 독립변인의 종류 수인 2개와 3개를 곱한 6개의 경우로 나누어서 생각할 수 있다. 즉,
+''y hat = 808.013 + -42.960*(yr_rndno_break) + -163.737(mealcat47-80) + -281.683(mealcat81-100)''
+을 바탕으로 각각의 조건을 고려하여 y hat를 계산하면 아래와 같다.
-<code>output:
+<wrap #two_dummy_table />
-		Model Summary(b)
+TABLE. Two dummy variables
-Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
+^                 ^ mealcat0-46  ^ mealcat47-80   ^ mealcat81-100    ^
-	.532a	.283		.277			120.922
+| yr_rndbreak     | <wrap> yr_rndbreak = 1
-a. Predictors: (Constant), yrXsome, parent some college, year round school
+yr_rndno_break = 0
-b. Dependent Variable: api 2000
+mealcat0-46 = 1
+mealcat47-80 = 0
+mealcat81-100 = 0 경우
+''y hat = 808.013''
+</wrap>  | <wrap> yr_rndbreak = 1
+yr_rndno_break = 0
+mealcat0-46 = 0
+mealcat47-80 = 1
+mealcat81-100 = 0 경우
+''**y hat = 808.013 - 163.737 = 644.276**''
+</wrap>  | <wrap> yr_rndbreak = 1
+yr_rndno_break = 0
+mealcat0-46 = 0
+mealcat47-80 = 0
+mealcat81-100 = 1 경우
+''**y hat = 808.013 - 281.683 = 526.33**''
+</wrap>  |
+| yr_rndno_break  | <wrap> yr_rndbreak = 0
+yr_rndno_break = 1
+mealcat0-46 = 1
+mealcat47-80 = 0
+mealcat81-100 = 0 경우
+''**y hat = 808.013 - 42.960 = 765.053**''
+</wrap>  | <wrap> yr_rndbreak = 0
+yr_rndno_break = 1
+mealcat0-46 = 0
+mealcat47-80 = 1
+mealcat81-100 = 0 경우
+''**y hat = 808.013 - 42.960 - 163.737 = 601.316**''
+</wrap>  | <wrap> yr_rndbreak = 0
+yr_rndno_break = 1
+mealcat0-46 = 0
+mealcat47-80 = 0
+mealcat81-100 = 1 경우
+''**y hat = 808.013 - 42.960 - 281.683 = 483.37**''
+</wrap>  |
-			ANOVA(b)
+<code>
-Model		Sum of Squares	df	Mean Square	F	Sig.
+> # 해석. interaction
-	Regression	2283345.485	3	761115.162	52.053	.000a
+> m.yrrndxmealcat <- lm(api00~yr_rnd*mealcat, data=df)
-	Residual	5790326.513	396	14622.037
+> summary(m.yrrndxmealcat)
-	Total		8073671.997	399
-a. Predictors: (Constant), yrXsome, parent some college, year round school
-b. Dependent Variable: api 2000
-			Coefficients(a)
+Call:
-		Unstandardized Coefficients		Standardized Coefficients
+lm(formula = api00 ~ yr_rnd * mealcat, data = df)
-Model				B		Std. Error	Beta	t	Sig.
-	(Constant)		655.110		14.035		46.677	.000
-	parent some college	1.409		.586		.112	2.407	.017
-	year round school	-248.071	29.859		-.735	-8.308	.000
-	__yrXsome__		5.993		1.577		.330	3.800	.000
-a. Dependent Variable: api 2000
-		Residuals Statistics(a)
+Residuals:
-			Minimum		Maximum		Mean	Std. Deviation	N
+     Min       1Q   Median       3Q      Max
-Predicted Value	407.04		749.54		647.62	75.648		400
+-207.533  -50.764   -1.843   48.874  179.000
-Residual		-275.118	279.252		.000	120.466		400
-Std. Predicted Value	-3.180		1.347		.000	1.000		400
+Coefficients:
-Std. Residual		-2.275		2.309		.000	.996		400
+                        Estimate Std. Error t value Pr(>|t|)
-a. Dependent Variable: api 2000
+(Intercept)              809.685      6.185 130.911  < 2e-16 ***
+yr_rndnobr               -74.257     26.756  -2.775  0.00578 **
+mealcatto80             -164.412      8.877 -18.522  < 2e-16 ***
+mealcatto100            -288.193     10.443 -27.597  < 2e-16 ***
+yr_rndnobr:mealcatto80    22.517     32.752   0.687  0.49217
+yr_rndnobr:mealcatto100   40.764     29.231   1.395  0.16394
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
+Residual standard error: 68.87 on 394 degrees of freedom
+Multiple R-squared:  0.7685,	Adjusted R-squared:  0.7656
+F-statistic: 261.6 on 5 and 394 DF,  p-value: < 2.2e-16
+>
+> coefs <- summary(m.yrrndxmealcat)$coefficients
+> coefs
+                          Estimate Std. Error     t value     Pr(>|t|)
+(Intercept)              809.68548   6.184993 130.9113019 0.000000e+00
+yr_rndnobr               -74.25691  26.756287  -2.7753071 5.777728e-03
+mealcatto80             -164.41198   8.876767 -18.5216067 1.529685e-55
+mealcatto100            -288.19295  10.442837 -27.5971894 4.297367e-94
+yr_rndnobr:mealcatto80    22.51674  32.751732   0.6874977 4.921736e-01
+yr_rndnobr:mealcatto100   40.76438  29.231183   1.3945510 1.639371e-01
+> br.to46 <- coefs[1]
+> br.to80 <- coefs[1]+coefs[3]
+> br.to100 <- coefs[1]+coefs[4]
+> nobr.to46 <- coefs[1]+coefs[2]
+> nobr.to80 <- coefs[1]+coefs[2]+coefs[3]+coefs[5]
+> nobr.to100 <- coefs[1]+coefs[2]+coefs[4]+coefs[6]
+> cat(br.to46, br.to80, br.to100)
+.6855 645.2735 521.4925
+> cat(nobr.to46, nobr.to80, nobr.to100)
+.4286 593.5333 488
+>
 </code>
+위의 테스트는 두 개의 독립변인이 모두 종류이고 종속변인이 숫자일 때의 조건을 만족하니 factorial anova를 해도 된다. 아래는 그 결과이다.
+<code>
+> mod4 <- lm(api00 ~ yr_rnd + mealcat + yr_rnd:mealcat, data=datavar)
+> summary(mod4)
-위에서 __yr_rnd__의 b 계수 값이 5.993으로 유의미하다고 판단된다 (t = 3.800, p < .001). 따라서 두 변인 간의 상호효과가 존재한다고 할 수 있다. 이를 다시 도표화해서 보면, 두 집단의 기울기가 서로 다르다는 것을 알 수 있다.
+Call:
+lm(formula = api00 ~ yr_rnd + mealcat + yr_rnd:mealcat, data = datavar)
-{{regression-cat2-interctionx.jpg}}
+Residuals:
-{{regression-cat2-interaction-each.jpg}}
+     Min       1Q   Median       3Q      Max
+-207.533  -50.764   -1.843   48.874  179.000
-위의 테스트를 살펴보면, 두 개의 독립변인 중 하나는 종류변인이고 다른 하나는 숫자변인이다. 각 변인의 영향력에 대해서 regression을 통해서 알아보면서 두 변인의 상호작용까지 알아본 것이 된다. 이와 같은 절차는 FactorialAnova 에서 살펴본 것과 같다. 사실, 위의 연구문제(가설)를 ANOVA를 이용해서도 할 수 있다.
+Coefficients:
+                             Estimate Std. Error t value Pr(>|t|)
+(Intercept)                   809.685      6.185 130.911  < 2e-16 ***
+yr_rndno_break                -74.257     26.756  -2.775  0.00578 **
+mealcat47-80                 -164.412      8.877 -18.522  < 2e-16 ***
+mealcat81-100                -288.193     10.443 -27.597  < 2e-16 ***
+yr_rndno_break:mealcat47-80    22.517     32.752   0.687  0.49217
+yr_rndno_break:mealcat81-100   40.764     29.231   1.395  0.16394
+---
+Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-<code>glm
+Residual standard error: 68.87 on 394 degrees of freedom
-  api00 BY yr_rnd WITH some_col
+Multiple R-squared:  0.7685,	Adjusted R-squared:  0.7656
-  /DESIGN = some_col yr_rnd yr_rnd*some_col.
+F-statistic: 261.6 on 5 and 394 DF,  p-value: < 2.2e-16
+</code>
-or
+<code>
-UNIANOVA api00 BY yr_rnd WITH some_col
+Coefficients:
-  /DESIGN=yr_rnd some_col some_col*yr_rnd
+                             Estimate Std. Error t value Pr(>|t|)
-  /print=parameter .
+(Intercept)                   809.685      6.185 130.911  < 2e-16 ***
+yr_rndno_break                -74.257     26.756  -2.775  0.00578 **
+mealcat47-80                 -164.412      8.877 -18.522  < 2e-16 ***
+mealcat81-100                -288.193     10.443 -27.597  < 2e-16 ***
+yr_rndno_break:mealcat47-80    22.517     32.752   0.687  0.49217
+yr_rndno_break:mealcat81-100   40.764     29.231   1.395  0.16394
+---
+이전 식
+y hat = 808.013 + -42.960 * (nobr) + -163.737 * (to80) + -281.683 * (to100)
+위의 식
+y hat = 809.685 +  -74.257*(nobr) +
+                  -164.412*(to80) +
+                  -288.193*(to100) +
+.517*(nobr:to80) +   --> aaaaa case
+.764*(nobr:to100)    --> bbbbb case
+yr_rnd:
+break = 방학있음
+no_break = 방학없음
+mealcat:
+-46% free meals
+-80%
+-100%
 </code>
+^                 ^ mealcat0-46  ^ mealcat47-80   ^ mealcat81-100    ^
+| yr_rndbreak     | <wrap> 베이스라인
+yr_rndno_break = 0
+mealcat47-80 = 0
+mealcat81-100 = 0 경우
+''y hat = 809.685''
+</wrap>  | <wrap> yr_rndno_break = 0
+mealcat0-46 = 0
+mealcat81-100 = 0 경우
+''y hat = 809.685 -
+.737
+        = 645.9''
+</wrap>  | <wrap>
+yr_rndno_break = 0
+mealcat0-46 = 0
+mealcat47-80 = 0 경우
+''y hat = 809.685 -
+.683
+        = 528''
+</wrap>  |
+| yr_rndno_break  | <wrap>
+yr_rndbreak = 0
+mealcat47-80 = 0
+mealcat81-100 = 0 경우
+''y hat = 809.685 -
+.257
+        = 735.4''
+</wrap>  | <wrap> aaaaa
+yr_rndbreak = 0
+mealcat0-46 = 0
+mealcat81-100 = 0 경우
+''y hat = 809.685 -
+.257 -
+.412 +
+          <fc #ff0000>22.517</fc>
+        = 593.5''
+</wrap>  | <wrap> bbbbb
+yr_rndbreak = 0
+mealcat0-46 = 0
+mealcat47-80 = 0 경우
+''y hat = 809.685 -
+.257 -
+.193 +
+          <fc #ff0000>40.764</fc>
+        = 488''
+</wrap>  |
+마지막 두 케이스를 보면 no_break학교 중에서 밀카테고리 2와 3에서 떨어지는 정도가 어느 정도 완화되는 경향을 보이지만 통계학적으로 significant하지는 않다.
+</tabbox>