우선 type I and type II error 다시 확인 [[:types of error]] [[:z-test]] [[:t-test]] ''Q.'' Alcohol이 임산부에게 미치는 영향 : Alcohol이 임산부에게 미치는 영향에 대해서 조사를 하는 연구자가, 임신 중의 alcohol 섭취가 태아의 몸무게에 미치는 영향에 대해서 관심을 가졌다. 이에 따라서 n = 16 의 랜덤 샘플 쥐가 구해졌다. 어미 쥐는 매일 일정량의 alcohol을 섭취하였다. 연구자는 이 쥐들의 새끼 중 하나씩을 선택해서 n = 16의 샘플을 취한 후 평균을 내 보았더니, $\overline{X}$ = 15 grams 이었다. 보통 쥐의 경우 평균 몸무게는 $\mu = 18$ 그램이고 $\sigma = 4$ 라는 것을 연구자는 알고 있다. 연구자는 alcohol의 영향력을 어떻게 테스트해야 할까? [[https://www.easycalculation.com/statistics/t-distribution-critical-value-table.php|T dist. table]]

> rnorm2 <- function(n,mean,sd) { mean+sd*scale(rnorm(n)) }
> potato_sample <- rnorm2(25, 191,20)
> rnorm2 <- function(n,mean,sd) { mean+sd*scale(rnorm(n)) }
> rat <- rnorm2(16, 15, 4)
> t.test(rat, mu=18, sd=4)

	One Sample t-test

data:  rat
t = -3, df = 15, p-value = 0.008973
alternative hypothesis: true mean is not equal to 18
95 percent confidence interval:
 12.86855 17.13145
sample estimates:
mean of x 
       15 

>

28명의 SAT score. reasonable guess의 효과 각 문항은 다섯개의 선택지가 존재한다고 할 때 학생들이 reasonable guess를 이용하여 답을 풀었을 때 과연 효과가 있다고 할 수 있을까? 58, 48, 48, 41, 34, 43, 38, 53, 41, 60, 55, 44, 43, 49, 47, 33, 47, 40, 46, 53, 40, 45, 39, 47, 50, 53, 46, 53 . . .

> sec12.9 <- c(58, 48, 48, 41, 34, 
43, 38, 53, 41, 60, 55, 44, 43, 49, 47, 
33, 47, 40, 46, 53, 40, 45, 39, 47, 50, 
53, 46, 53)

> mean(sec12.9)
[1] 46.21429

> sqrt(var(sec12.9))
[1] 6.729466

> length(sec12.9)
[1] 28

> t.test(sec12.9, mu=20)

	One Sample t-test

data:  sec12.9
t = 20.6128, df = 27, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 20
95 percent confidence interval:
 43.60487 48.82370
sample estimates:
mean of x 
 46.21429 


> num <- mean(sec12.9)-20
> # num = difference
> denum <- sqrt(var(sec12.9))/sqrt(length(sec12.9))
> # denum <- std error 
> tvalue <- num/denum
> tvalue
[1] 20.61277

t test summary * 차이(difference)와 연관(association)의 가설 중 차이의 가설에서 * 독립변인(independent variable)의 attributes가 2개의 종류일 때 t-test를 한다. * remind: see [[:hypothesis]], [[:types of variable]], [[:level of measurement]] * 차이를 알아보는 상황을 정리해 보면 (두 개의 그룹 간) see [[:t-test]] * Population vs. sample의 차이 * population with known $\mu$ and $\sigma$ * population with known $\mu$, but unknown $\sigma$ * two samples 간의 차이 * 두 그룹 간의 비교 * 남/녀 간의 게임적응 능력 차이 * one sample 의 시간을 둔 차이 * 약을 먹고 나타나는 효과 Chapter 1. should be familiarized. Chapter 2. ?trees will explain what the data set is.

Description

This data set provides measurements of the girth, 
height and volume of timber in 31 felled black 
cherry trees. Note that girth is the diameter of 
the tree (in inches) measured at 4 ft 6 in above 
the ground.

Usage

trees
Format

A data frame with 31 observations on 3 variables.

[,1]	Girth	 numeric	 Tree diameter in inches
[,2]	Height	 numeric	 Height in ft
[,3]	Volume	 numeric	 Volume of timber in cubic ft

평균 mean(trees$Volume) 분산 var(trees$Volume) 분산 s²은 자료의 제곱합을 n이 아닌 n-1로 나누어 구하는데, 그 이유는 수학적으로 n-1로 나눈 s²의 기대값이 모분산 $ \sigma^{2} $ 과 일치하기 때문이다(([[:why n-1]] 참조)) 따라서 조사대상이 모집단일 경우 모분산을 구하려면 분산값에 (n-1)/n을 곱해준다.

attach(trees)
n <- length(Volume)
var_as_population <- var(Volume) * (n-1) / n 
var_as_population

Standard Deviation sd(Volume) or sqrt(var(Volume)) Standard Error . . . . 수학적으로 표준편차는 표준오차보다 $ \sqrt{n} $ 배만큼 크다.


attach(trees)
n <- length(Volume)
se_value <- sd(Volume)/sqrt(n)
se_value

중위수, 사분위수, boxplot fivenum(Volume)

quantile(Volume)
  0%  25%  50%  75% 100% 
10.2 19.4 24.2 37.3 77.0

IQR = 75% value - 15% value IQR(Volume) Boxplot boxplot(Volume, col="red") colors() histogram


hist(Volume, probability=T)  # histogram
lines(density(Volume), col="blue") # distribution curve

stem(Volume)

  The decimal point is 1 digit(s) to the right of the |

  1 | 00066899
  2 | 00111234567
  3 | 24568
  4 | 3
  5 | 12568
  6 | 
  7 | 7

qqnorm(Volume)
qqline(Volume, col="red")

QQplot에서 직선은 정확한 정규분포 수식에서 나오는 값인데, 관찰값인 점들이 이 직선에서 크게 벗어나지 않으면 Volume이 정규분포를 따른다고 할 수 있다. 이를 대강 살펴보는 것이 qqnorm 펑션의 역할이다.

> x <- rnorm(n=31, 0, 1)
> qqnorm(x)
> qqline(x)

함수만들기

> se = function(x) sd(x)/sqrt(length(x))
> se(Volume)