과제제출은 가설만들기 등의 제시에 대한 답과 r의 명령어와 아웃풋, 그리고 이에 대한 해석을 포함해야 합니다. ====== E.g. 1====== MASS data의 Cars93 data에서 Origin에 따른 city Mileage와 highway Mileage, Engine size를 비교하라. - 가설 만들기: * $\text{MPG.city: } \bar{X}_{\text{USA}} \ne \bar{X}_{\text{nonUSA}}$ * $\text{MPG.highway: } \bar{X}_{\text{USA}} \ne \bar{X}_{\text{nonUSA}}$ * $\text{EnginSize: } \bar{X}_{\text{USA}} \ne \bar{X}_{\text{nonUSA}}$ - 영가설 만들기 * $\text{MPG.city: } \bar{X}_{\text{USA}} = \bar{X}_{\text{nonUSA}}$ * $\text{MPG.highway: } \bar{X}_{\text{USA}} = \bar{X}_{\text{nonUSA}}$ * $\text{EnginSize: } \bar{X}_{\text{USA}} = \bar{X}_{\text{nonUSA}}$ - 각 그룹의 평균과 표준편차 - 가설 테스트 - 테스트 결과 > CarData <- subset(Cars93, select = c(Origin, MPG.city, MPG.highway, EngineSize)) > CarData Origin MPG.city MPG.highway EngineSize 1 non-USA 25 31 1.8 2 non-USA 18 25 3.2 3 non-USA 20 26 2.8 4 non-USA 19 26 2.8 5 non-USA 22 30 3.5 6 USA 22 31 2.2 7 USA 19 28 3.8 8 USA 16 25 5.7 9 USA 19 27 3.8 10 USA 16 25 4.9 11 USA 16 25 4.6 12 USA 25 36 2.2 13 USA 25 34 2.2 14 USA 19 28 3.4 15 USA 21 29 2.2 16 USA 18 23 3.8 17 USA 15 20 4.3 18 USA 17 26 5.0 19 USA 17 25 5.7 20 USA 20 28 3.3 21 USA 23 28 3.0 22 USA 20 26 3.3 23 USA 29 33 1.5 24 USA 23 29 2.2 25 USA 22 27 2.5 26 USA 17 21 3.0 27 USA 21 27 2.5 28 USA 18 24 3.0 29 USA 29 33 1.5 30 USA 20 28 3.5 31 USA 31 33 1.3 32 USA 23 30 1.8 33 USA 22 27 2.3 34 USA 22 29 2.3 35 USA 24 30 2.0 36 USA 15 20 3.0 37 USA 21 30 3.0 38 USA 18 26 4.6 39 non-USA 46 50 1.0 40 non-USA 30 36 1.6 41 non-USA 24 31 2.3 42 non-USA 42 46 1.5 43 non-USA 24 31 2.2 44 non-USA 29 33 1.5 45 non-USA 22 29 1.8 46 non-USA 26 34 1.5 47 non-USA 20 27 2.0 48 non-USA 17 22 4.5 49 non-USA 18 24 3.0 50 non-USA 18 23 3.0 51 USA 17 26 3.8 52 USA 18 26 4.6 53 non-USA 29 37 1.6 54 non-USA 28 36 1.8 55 non-USA 26 34 2.5 56 non-USA 18 24 3.0 57 non-USA 17 25 1.3 58 non-USA 20 29 2.3 59 non-USA 19 25 3.2 60 USA 23 26 1.6 61 USA 19 26 3.8 62 non-USA 29 33 1.5 63 non-USA 18 24 3.0 64 non-USA 29 33 1.6 65 non-USA 24 30 2.4 66 non-USA 17 23 3.0 67 non-USA 21 26 3.0 68 USA 24 31 2.3 69 USA 23 31 2.2 70 USA 18 23 3.8 71 USA 19 28 3.8 72 USA 23 30 1.8 73 USA 31 41 1.6 74 USA 23 31 2.0 75 USA 19 28 3.4 76 USA 19 27 3.4 77 USA 19 28 3.8 78 non-USA 20 26 2.1 79 USA 28 38 1.9 80 non-USA 33 37 1.2 81 non-USA 25 30 1.8 82 non-USA 23 30 2.2 83 non-USA 39 43 1.3 84 non-USA 32 37 1.5 85 non-USA 25 32 2.2 86 non-USA 22 29 2.2 87 non-USA 18 22 2.4 88 non-USA 25 33 1.8 89 non-USA 17 21 2.5 90 non-USA 21 30 2.0 91 non-USA 18 25 2.8 92 non-USA 21 28 2.3 93 non-USA 20 28 2.4 > > sapply(CarData, summary, na.rm=) $Origin USA non-USA 48 45 $MPG.city Min. 1st Qu. Median Mean 3rd Qu. Max. 15.00 18.00 21.00 22.37 25.00 46.00 $MPG.highway Min. 1st Qu. Median Mean 3rd Qu. Max. 20.00 26.00 28.00 29.09 31.00 50.00 $EngineSize Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.800 2.400 2.668 3.300 5.700 > > attach(CarData) > tapply(CarData$MPG.city, CarData$Origin, summary) $USA Min. 1st Qu. Median Mean 3rd Qu. Max. 15.00 18.00 20.00 20.96 23.00 31.00 $`non-USA` Min. 1st Qu. Median Mean 3rd Qu. Max. 17.00 19.00 22.00 23.87 26.00 46.00 > tapply(MPG.city, Origin, sd) USA non-USA 3.994455 6.672876 > plot(MPG.city~Origin) {{t-test_mpg.city.png}} > t.test(MPG.city~Origin) Welch Two Sample t-test data: MPG.city by Origin t = -2.5296, df = 71.024, p-value = 0.01364 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -5.2008385 -0.6158282 sample estimates: mean in group USA mean in group non-USA 20.95833 23.86667 > > t.test(MPG.city~Origin, var.equal=TRUE) Two Sample t-test data: MPG.city by Origin t = -2.5688, df = 91, p-value = 0.01183 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -5.1572298 -0.6594368 sample estimates: mean in group USA mean in group non-USA 20.95833 23.86667 > > tapply(MPG.highway, Origin, summary) $USA Min. 1st Qu. Median Mean 3rd Qu. Max. 20.00 26.00 28.00 28.15 30.00 41.00 $`non-USA` Min. 1st Qu. Median Mean 3rd Qu. Max. 21.00 25.00 30.00 30.09 33.00 50.00 > > tapply(MPG.highway, Origin, sd) USA non-USA 4.151337 6.247990 > plot(MPG.highway~Origin) {{t-test_mpghighway.png}} > t.test(MPG.highway~Origin) Welch Two Sample t-test data: MPG.highway by Origin t = -1.7545, df = 75.802, p-value = 0.08339 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4.1489029 0.2627918 sample estimates: mean in group USA mean in group non-USA 28.14583 30.08889 > tapply(EngineSize, Origin, summary) $USA Min. 1st Qu. Median Mean 3rd Qu. Max. 1.300 2.200 3.000 3.067 3.800 5.700 $`non-USA` Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.600 2.200 2.242 2.800 4.500 > tapply(EngineSize, Origin, sd) USA non-USA 1.1353757 0.7171563 > plot(EngineSize~Origin) > {{t-test_enginesize.png}} > t.test(EngineSize~Origin) Welch Two Sample t-test data: EngineSize by Origin t = 4.2135, df = 80.033, p-value = 6.55e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.4350602 1.2138287 sample estimates: mean in group USA mean in group non-USA 3.066667 2.242222 > ====== E.g. 2 ====== - Seatbelts 데이터를 불러온 후 - seatbelt 법령이 지정되기 전과 후의 드라이버 사망률을 비교하시오. - hypothesis - null hypothesis - test result > sb <- as.data.frame(Seatbelts) > attach(sb) The following objects are masked from sb (pos = 3): drivers, DriversKilled, front, kms, law, PetrolPrice, rear, VanKilled The following object is masked from package:MASS: drivers > > tapply(DriversKilled,law,summary) $`0` Min. 1st Qu. Median Mean 3rd Qu. Max. 79.0 108.0 121.0 125.9 140.0 198.0 $`1` Min. 1st Qu. Median Mean 3rd Qu. Max. 60.0 85.0 92.0 100.3 119.0 154.0 > > tapply(DriversKilled,law,sd) 0 1 24.26088 22.22860 > t.test(DriversKilled~law) Welch Two Sample t-test data: DriversKilled by law t = 5.1253, df = 29.609, p-value = 1.693e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 15.39892 35.81899 sample estimates: mean in group 0 mean in group 1 125.8698 100.2609 ====== E.g. 3 ====== - anorexia 데이터를 설명하시오. - FT (family Treatment)만을 추출하여 (subset function 이용) PreWT와 PostWT를 비교하시오. - 가설을 만들고 - 테스트를 한 후 - 결과를 보고하시오. >anorexia . . . . > md = subset(anorexia, Treat=="FT") > md Treat Prewt Postwt 56 FT 83.8 95.2 57 FT 83.3 94.3 58 FT 86.0 91.5 59 FT 82.5 91.9 60 FT 86.7 100.3 61 FT 79.6 76.7 62 FT 76.9 76.8 63 FT 94.2 101.6 64 FT 73.4 94.9 65 FT 80.5 75.2 66 FT 81.6 77.8 67 FT 82.1 95.5 68 FT 77.6 90.7 69 FT 83.5 92.5 70 FT 89.9 93.8 71 FT 86.0 91.7 72 FT 87.3 98.0 > t.test(md$Prewt, md$Postwt, data=md, paired=TRUE) Paired t-test data: md$Prewt and md$Postwt t = -4.1849, df = 16, p-value = 0.0007003 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -10.94471 -3.58470 sample estimates: mean of the differences -7.264706 ====== E.g. 4 ====== A: 175, 168, 168, 190, 156, 181, 182, 175, 174, 179 B: 185, 169, 173, 173, 188, 186, 175, 174, 179, 180 두 그룹의 평균의 차이를 비교하시오. > a [1] 175 168 168 190 156 181 182 175 174 179 > b [1] 185 169 173 173 188 186 175 174 179 180 > ab <- data.frame(a,b) > ab a b 1 175 185 2 168 169 3 168 173 4 190 173 5 156 188 6 181 186 7 182 175 8 175 174 9 174 179 10 179 180 > > summary(ab) a b Min. :156.0 Min. :169.0 1st Qu.:169.5 1st Qu.:173.2 Median :175.0 Median :177.0 Mean :174.8 Mean :178.2 3rd Qu.:180.5 3rd Qu.:183.8 Max. :190.0 Max. :188.0 > abs <- stack(ab) > tapply(abs$values, abs$ind, summary) $a Min. 1st Qu. Median Mean 3rd Qu. Max. 156.0 169.5 175.0 174.8 180.5 190.0 $b Min. 1st Qu. Median Mean 3rd Qu. Max. 169.0 173.2 177.0 178.2 183.8 188.0 > tapply(abs$values, abs$ind, sd) a b 9.342852 6.442912 > > t.test(ab$a,ab$b) Welch Two Sample t-test data: ab$a and ab$b t = -0.94737, df = 15.981, p-value = 0.3576 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -11.008795 4.208795 sample estimates: mean of x mean of y 174.8 178.2 ====== E.g. 5 ====== 아래는 9개의 특정 공장에서 추출한 아이스크림에서 발견된 박테리아 숫자이다(MPN/g): 0.593 0.142 0.329 0.691 0.231 0.793 0.519 0.392 0.418 아이스크림의 박테리아가 0.3 MPN/g 보다 커서 유통되기에 위험하다고 할 수 있을까? > ir <- c(0.593, 0.142, 0.329, 0.691, 0.231, 0.793, 0.519, 0.392, 0.418) > ir [1] 0.593 0.142 0.329 0.691 0.231 0.793 0.519 0.392 0.418 > t.test(ir, mu=.3) One Sample t-test data: ir t = 2.2051, df = 8, p-value = 0.05853 alternative hypothesis: true mean is not equal to 0.3 95 percent confidence interval: 0.2928381 0.6200508 sample estimates: mean of x 0.4564444 > > t.test(ir, alternative="greater", mu=.3) One Sample t-test data: ir t = 2.2051, df = 8, p-value = 0.02927 alternative hypothesis: true mean is greater than 0.3 95 percent confidence interval: 0.3245133 Inf sample estimates: mean of x 0.4564444 > ====== E.g. 6 ====== 아래는 흡연/비흡연자 그룹의 기억력 테스트의 결과이다. 비흡연자 = 18,22,21,17,20,17,23,20,22,21 흡연자 = 16,20,14,21,20,18,13,15,17,21 흡연이 기억에 영향을 준다고 할 수 있을까? > smoke <- c(18,22,21,17,20,17,23,20,22,21) > nosmoke <- c(16,20,14,21,20,18,13,15,17,21) > sn <- data.frame(smoke, nosmoke) > ss <- stack(sn) > plot(ss$values~ss$ind) > t.test(values$ss~ind$ss) Welch Two Sample t-test data: ss$values by ss$ind t = -2.2573, df = 16.376, p-value = 0.03798 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -5.0371795 -0.1628205 sample estimates: mean in group nosmoke mean in group smoke 17.5 20.1 > > ====== E.g. 7 ====== - MASS package를 불러온 후, survey 데이터를 활용하여 담배와 운동량 간의 관계에 대한 가설테스트를 하시오. - 운동량의 데이터를 자주하는 그룹과 (freq) 가끔에서 전혀하지 않는 그룹(none to some)의 두 그룹으로 재 조정하여 가설 테스트를 하면 어떻게 되는가? ====== E.g. 8 ====== - 위의 데이터에서 성별 간에 흡연정도에 차이가 있을까? - 흡연 데이터를 흡연자/비흡연자로 나누어서 보면 성별 간에 흡연의 차이가 있을까?