Differences

This shows you the differences between two versions of the page.

--- types_of_error [2026/04/20 04:30] – [ro.type.ii.error] hkimscil
+++ types_of_error [2026/04/21 22:13] (current) – hkimscil
@@ Line 30: / Line 30: @@
 {{tabembedded>:types_of_error:code01|R script, types_of_error:output01|R output}}
+아래는 고등학교2년생들의 방학중 하루에 게임에 쓰는 시간을 분으로 가정한 것이다. 모집단의 (population) 평균은 (m.p) 140 분이고, 표준편차값은 20이다 (sigma). 그리고 미디어교육을 받은 모집단을 가정하여 (현실에서 이런 일은 거의 이러나지 않는다) 이 모집단의 평균이 140에서 10분을 뺀 130분이라고 가정한다 (m.pe). 이제 n = 40 인 샘플을 취하여 그 샘플에게 미디어교육을 시킨 후, 효과를 검증하려고 한다.
 <tabbox rs.type.ii.error>
@@ Line 140: / Line 142: @@
 >
 </code>
 </tabbox>
+{{pasted:20260420-042828.png}}
+n=40 샘플의 점수가 붉은 색 점선 사이에서 나오게 되면 가설검증에 실패하게 된다. 이 때 범할 수 있는 오류는 type ii error이다. 이 때 범할 수 있는 에러를 파란 색의 효과가 있는 모집단에 비교해서 생각하면 그 probability는
+  * ''pnorm(검증에실패하는점수, 130, se, lower.tail=F) = 0.123''
+가 된다. 즉, 내가 얻은 점수가 p2 집단에서 나왔을 확률이 12.3%나 된다는 뜻이다. 이것은 내 판단이 (가설검증에 실패한다는 판단이) 잘못일 확률이 12.3%라는 뜻이다.
+{{pasted:20260420-042902.png}}
+위의 그래프는 n = 40에서 n = 400으로 늘린 것이다. 이 때 가설 검증에 실패하는 구간을 정해놓고 내가 구한 샘플의 평균이 그 구간에 속하더라도, 이 점수가 p2에서 나왔을 확률이 0이 된다. 즉, 내가 내린 "가설검증 실패"라는 판단이 잘 못일 확률이 0이라는 뜻이다.
+{{pasted:20260420-043927.png}}
+극단적으로 샘플의 크기가 10이라면 이 때의 se값이 상대적으로 크므로 (20/sqrt(10)) 두 그래프는 더욱 많이 겹치게 된다. 이럴 경우, 가설 검증에서 실패하여도, 그 점수가 p2에서 나왔을 확률은 33.7% 나 된다.
 위의 설명은 아래 표와 같이 정리할 수 있다.
@@ Line 156: / Line 168: @@
-====== E.G. ======
-이는 아래를 보면 더 확연해진다.
-<code>
-rnorm2 <- function(n,mean,sd) { mean+sd*scale(rnorm(n)) }
-potato_sample <- rnorm2(25, 194,20)
-mean(potato_sample)
-sqrt(var(potato_sample))
-t.test(potato_sample, mu=200)
-	One Sample t-test
-data:  potato_sample
-t = -1.5, df = 24, p-value = 0.1467
-alternative hypothesis: true mean is not equal to 200
-percent confidence interval:
-.7444 202.2556
-sample estimates:
-mean of x
-
->
-</code>
-아래의 qt 펑션 이해를 위해서는 [[https://www.quora.com/In-R-what-is-the-difference-between-dt-pt-and-qt-in-reference-to-the-student-t-distribution|t-distribution function 문서]] 참조
-<code>
-> abs(qt(0.05/2, 24))
-[1] 2.063899
-</code>
-즉, +-2.063899 이상이 되어야지 영가설을 부정할 수 있는데, 현재의 t-score는 -1.5이므로 영가설을 부정할 수 없는 상태이다.
-se 값을 구하는 공식으로 sqrt(25)=5 이니 se = 20/5 = 4 이다. 만약에 n값이 (샘플사이즈) 2500 이라면 se값은 0.4일 것이다 (아래 참조)
-<code>
-> 20/sqrt(length(potato_sample))
-[1] 4
-</code>
-<code>
-potato_sample_large <- rnorm2(2500, 194,20)
-mean(potato_sample_large)
-[1] 194
-> sqrt(var(potato_sample_large))
-[1]   20
-t.test(potato_sample_large, mu=200)
-	One Sample t-test
-data:  potato_sample_large
-t = -15, df = 2499, p-value < 2.2e-16
-alternative hypothesis: true mean is not equal to 200
-percent confidence interval:
-.2156 194.7844
-sample estimates:
-mean of x
-
-</code>
-<code>
-> abs(qt(0.05/2, 2499))
-[1] 1.960914
-</code>
-위의 경우 critical t value는 +-1.960914 (approx. 2)면 영가설을 부정할 수 있는데, calculated t value는 -15이므로 부정할 수 있다.
-<code>
-> # standard error value
-> 20/sqrt(length(potato_sample_large))
-[1] 0.4
->
-</code>
-<WRAP help> 위 둘의 se를 비교해 보라. 그리고, 이를 type I and type II error와 관련지어 설명하라 </WRAP>
-mu = 200, sigma = 20 인 상황에서 a___ 뒤의 숫자가 샘플의 크기라고 하면,
-<code>> a25 <- rnorm(50000, 200, 4) # 4 = 20/sqrt(25) = std error값
-> a100 <- rnorm(50000, 200, 2)
-> a400 <- rnorm(50000, 200, 1)
-> a900 <- rnorm(50000, 200, .667)
-> a1600 <- rnorm(50000, 200, .5)
-> a2500 <- rnorm(50000, 200, .4)
-> a3600 <- rnorm(50000, 200, .333)
-> a4900 <- rnrom(50000, 200, .286)
-> a6400 <- rnorm(50000, 200, .25)
-> a8100 <- rnorm(50000, 200, .222)
-> pa25 <- hist(a25)
-> pa100 <- hist(a100)
-> pa400 <- hist(a400)
-> pa900 <- hist(a900)
-> pa1600 <- hist(a1600)
-> pa2500 <- hist(a2500)
-> pa3600 <- hist(a3600)
-> pa4900 <- hist(a4900)
-> pa6400 <- hist(a6400)
-> pa8100 <- hist(a8100)
-> plot(pa25, col=rgb(.1,.1,.1,.1), xlim=c(185,215), ylim=c(0,15000))
-> plot(pa100, col=rgb(.2,.2,.2,.2), xlim=c(185,215), ylim=c(0,15000), add=T)
-> plot(pa400, col=rgb(.3,.3,.3,.3), xlim=c(185,215), ylim=c(0,15000), add=T)
-> plot(pa900, col=rgb(.4,.4,.4,.4), xlim=c(185,215), ylim=c(0,15000), add=T)
-> plot(pa1600, col=rgb(.5,.5,.5,.5), xlim=c(185,215), ylim=c(0,15000), add=T)
-> plot(pa2500, col=rgb(.6,.6,.6,.6), xlim=c(185,215), ylim=c(0,15000), add=T)
-> plot(pa3600, col=rgb(.7,.7,.7,.7), xlim=c(185,215), ylim=c(0,15000), add=T)
-> plot(pa4900, col=rgb(.8,.8,.8,.8), xlim=c(185,215), ylim=c(0,15000), add=T)
-> plot(pa6400, col=rgb(.9,.9,.9,.9), xlim=c(185,215), ylim=c(0,15000), add=T)
-> plot(pa8100, col=rgb(1,1,1,1), xlim=c(185,215), ylim=c(0,15000), add=T)
-</code>
-{{:sampling_distribution_25_to_8100.png}}
-{{:sampling_distribution_25_to_8100_big.png}}
-{{:sampling_distribution_25_to_8100.pdf}}
-{{:sampling_distribution_25_2500.png}}
-Where is my 194 (sample's mean)?
 {{tag>"1종오류" "2종오류" "오류의 종류" "types of error" "type 1 error" "type 2 error"}}