====== Estimating Populations and Samples: Making Predictions ====== {{tablelayout?colwidth="350px"&rowsHeaderSource=1&rowsVisible=2&float=right}} |{{:b:head_first_statistics:pasted:20191125-101010.png}}| So how can we use the results of the sample taste test to tell us the mean amount of time gumball flavor lasts for in the general gumball population? The answer is actually pretty intuitive. We assume that the mean flavor duration of the gumballs in the sample matches that of the population. In other words, we find the mean of the sample and use it as the mean for the population too. Here's a sketch showing the distribution of the sample, and what you’d expect the distribution of the population to look like based on the sample. You’d expect the distribution of the population to be a similar shape to that of the sample, so you can assume that the mean of the sample and population have about the same value. $$\mu \quad \quad \hat\mu$$ $\hat\mu$ : See this hat I’m wearing? It means I’m a point estimator. If you don’t have the exact value of the mean, then I'm the next best thing. $\hat{Y}$, $\hat{\mu}$ 등의 사인은 대개 estimates를 의미한다 (예측치 혹은 추정치). \begin{align*} \overline{X} & = \frac {\sum{X}}{n} \\ & = \frac{ \sum_{i=1}^{n} X_{i} } {n} \\ & = \hat{\mu} \end{align*} ====== Estimating population variance ====== |{{:b:head_first_statistics:pasted:20191125-102110.png?400}} | There are fewer values in \\ the sample, so there’s a good chance \\ that more extreme values will be \\ excluded.| | {{:b:head_first_statistics:pasted:20191125-102318.png?600}} || \begin{eqnarray*} \sigma^{2} & = & \frac{\sum(X-\mu)^2}{n} \end{eqnarray*} \begin{eqnarray*} \hat{\sigma^{2}} & = & \frac{\sum(X-\overline{X})^2}{n-1} \end{eqnarray*} \begin{eqnarray*} \hat{\sigma^{2}} & = & s^2 \\ s^2 & = & \frac {\sum(X-\overline{X})^2}{n-1} \end{eqnarray*} {{:b:head_first_statistics:pasted:20191125-103417.png}} {{:b:head_first_statistics:pasted:20191125-103603.png}} {{:b:head_first_statistics:pasted:20191125-103510.png}} [[:Why N-1]] x <- c(61.9, 62.6, 63.3, 64.8, 65.1, 66.4, 67.1, 67.2, 68.7, 69.9) mean(x) var(x) > x <- c(61.9, 62.6, 63.3, 64.8, 65.1, 66.4, 67.1, 67.2, 68.7, 69.9) > mean(x) [1] 65.7 > var(x) [1] 6.924444 > | x | mean | x - mean | ds%%^%%2 | | 61.9 | 65.7 | -3.8 | 14.44 | | 62.6 | 65.7 | -3.1 | 9.61 | | 63.3 | 65.7 | -2.4 | 5.76 | | 64.8 | 65.7 | -0.9 | 0.81 | | 65.1 | 65.7 | -0.6 | 0.36 | | 66.4 | 65.7 | 0.7 | 0.49 | | 67.1 | 65.7 | 1.4 | 1.96 | | 67.2 | 65.7 | 1.5 | 2.25 | | 68.7 | 65.7 | 3 | 9 | | 69.9 | 65.7 | 4.2 | 17.64 | | $\sum{ds^2}$ | | | 62.32 | | $n-1$ | | | 9 | | $Var(x)$ | | | 6.924444 | ====== Estimating proportion ====== {{:b:head_first_statistics:pasted:20191125-105807.png}} {{:b:head_first_statistics:pasted:20191125-105856.png}} p = 32/40 = 0.8 Mighty Gumball takes another sample of their super-long-lasting gumballs, and finds that in the sample, 10 out of 40 people prefer the pink gumballs to all other colors. What proportion of people prefer pink gumballs in the population? What’s the probability of choosing someone from the population who doesn’t prefer pink gumballs? \begin{eqnarray*} \hat{P} = P_{s} & = & \frac {10}{40} \\ & = & 0.25 \end{eqnarray*} \begin{eqnarray*} \hat{P} = P_{s'} & = & 1 - \hat{P} \\ & = & 1 - 0.25 \\ & = & 0.75 \end{eqnarray*} ====== Sampling distribution of proportions ====== ===== Expectation of samples proportions (Ps) ===== red gumballs in the population = 0.25 A jumbo box of gumballs containing 100 gumballs (n = 100). What is the probability to get 40 red balls out of a box? X는 $X \sim B(100, 1/4)$ 의 분포를 따를 때, P(X=40)? > dbinom(40, 100, 1/4) [1] 0.0003626268 population: gumball의 25%가 red라고 할 때, 하나의 샘플을 뽑는다고 가정할 때의 기대값과 분산값은 무엇인가? Bernoulli distribution에 따르면, 하나의 검볼을 뽑을 때, 이것이 red인지 아닌지에 대한 기대값과 분산값은 output = 1(red), 0(not-red) \begin{eqnarray*} E(Y) & = & p = 1/4 \\ Var(Y) & = & p * q = 3/16 \end{eqnarray*} 위의 상황에서 100번 independent trial을 통해서 구한 평균과 분산값은: $X \sim B(100, 1/4)$의 분포를 따른다고 할 때, \begin{eqnarray*} E(X) & = & n * p = 100 * 1/4 = 25 \\ Var(X) & = & n * p * q = 100 * 1/4 * 3/4 = 18.75 \end{eqnarray*} 이 때 $n = 100$일때 각각의 시도에서의 (trial) proportion 기대값은 ($\hat{P}$): \begin{align*} n = 100, \\ \hat{P_{1}} & = \frac{X_{1}}{n} = 0.34, (X_{1} = 34) \\ \hat{P_{2}} & = \frac{X_{2}}{n} = 0.43, (X_{2} = 43) \\ \hat{P_{3}} & = \frac{X_{3}}{n} = 0.32, (X_{3} = 32) \\ \hat{P_{4}} & = \frac{X_{4}}{n} = 0.42, (X_{4} = 42) \\ \cdots \cdots \cdots \\ \hat{P_{k}} & = \frac{X_{k}}{n} = 0.24, (X_{1} = 24) \\ \end{align*} 즉, $X \sim B(n, p)$ 일 때, sample의 확률 $P_{s} = \dfrac{X}{n}$를 따른다 ($X$ = red gumball이 나온 갯수, $n$ = sample 크기). {{:b:head_first_statistics:pasted:20191126-073028.png}} 위의 sampling을 계속한다면 (1)~(6)과 같은 결과를 의미한다 (아래 그림 참조). {{:b:head_first_statistics:pasted:20191126-073652.png}} 이렇게 계속 샘플링을 하여 그 확률(probability)를 구한다고 하면: \begin{eqnarray*} E(\text{probability of samples}) & = & E(P_{s}) \\ & = & E \left(\frac{X}{n} \right) \\ & = & \frac{E(X)}{n} \\ & = & \frac{np}{n} \\ & = & p \end{eqnarray*} ^ references ^ | {{youtube>Br067hrasc8}} Sampling distribution of sample proportion part 1, \\ AP Statistics, Khan Academy | | {{youtube>fuGwbG9_W1c}} The Sampling Distribution of the Sample Proportion | ===== What about variance ===== \begin{eqnarray*} Var(\text{probability of sample proportions}) & = & Var(P_{s}) \\ & = & Var\left(\frac{X}{n}\right) \\ & = & \frac {Var(X)}{n^{2}} \\ & = & \frac {npq}{n^{2}} \\ & = & \frac {pq}{n} \end{eqnarray*} \begin{eqnarray*} \text{Standard deviation of sample proportions} & = & \sqrt{\frac{pq}{n}} \\ & = & \text{Standard error of sample proportions} \end{eqnarray*} 이를 종합하면, Sample proportions 들에 대한 기대값과 분산은 각각 아래와 같다 (그림 참조). $$E(P_{s}) = p \qquad\qquad\qquad Var(P_{s}) = \displaystyle \frac{pq}{n}$$ {{:b:head_first_statistics:pasted:20191126-075541.png}} $$P_{s} \sim N\left(p,\; \frac{pq}{n}\right) $$ continuity correction: $$\pm \frac{1}{2n}$$ ===== Exercise ===== 25% of the gumball population are red. What’s the probability that in a box of 100 gumballs, at least 40% will be red? We’ll guide you through the steps. 1. If Ps is the proportion of red gumballs in the box, how is Ps distributed? 샘플 비율에 (proportion) 대한 질문이므로, 우리는 샘플 proportion은 아래와 같은 성격을 갖는 것을 안다. \begin{eqnarray*} P_{s} & \sim & N\left(p,\; \frac{pq}{n}\right) \\ & \sim & N\left(0.25,\; \frac{0.25*0.75}{100}\right) \\ \end{eqnarray*} 2. What’s the value of P(Ps ≥ 0.4)? Hint: Remember that you need to apply a continuity correction. \begin{eqnarray*} P(P_{s} \ge 0.4) & = & P(P_{s} > 0.4 - (1/(2*100)) ) \\ & = & P(P_{s} > 0.395) \end{eqnarray*} 0.395일 때의 표준점수를 구한 후 오른 쪽 부분의 면적을 구한다. \begin{eqnarray*} z & = & \frac {X - \mu}{s} \\ & = & \frac {0.395 - 0.25}{\sqrt{(\frac{0.25*0.75}{100})}} \\ & = & \frac {0.145}{\sqrt{0.001875}} \\ & = & \frac {0.145}{0.04330127} \\ & = & 3.348632 \\ & \approx & 3.35 \end{eqnarray*} 위의 계산을 토대로 구해야 할 값은 \begin{eqnarray*} P(Z = z) & = & 1 - P (Z < 3.35) \\ & = & 1 - 0.9996 \\ & = & 0.0004 \end{eqnarray*} p <- 0.25 q <- 1-p n <- 100 var <- (p*q)/(n) se <- sqrt((p*q)/(n)) o <- .4 o.c <- .4 - (1/(2*n)) o.c pnorm(o.c, p, se, lower.tail = F) > > p <- 0.25 > q <- 1-p > n <- 100 > var <- (p*q)/(n) > se <- sqrt((p*q)/(n)) > o <- .4 > o.c <- .4 - (1/(2*n)) > o.c [1] 0.395 > pnorm(o.c, p, se, lower.tail = F) [1] 0.0004060586 ====== Sampling distribution of sample mean ====== According to Mighty Gumball’s statistics for the population, the mean number of gumballs in each packet is 10, and the variance is 1. The trouble is they’ve had a complaint. One of their most faithful customers bought 30 packets of gumballs, and he found that the average number of gumballs per packet in his sample is only 8.5. {{:b:head_first_statistics:pasted:20191126-083127.png}} {{:b:head_first_statistics:pasted:20191126-083201.png}} {{:b:head_first_statistics:pasted:20191126-083349.png}} {{:b:head_first_statistics:pasted:20191126-083432.png}} {{:b:head_first_statistics:pasted:20191126-083532.png}} \begin{eqnarray*} \overline{X} = \frac{X_{1} + X_{2} + . . . + X_{n}}{n} \end{eqnarray*} 위는 풍선검 봉지 30개로 이루어진 샘플의 평균을 이야기하고 아래는 이 평균을 계속 모았을 때의 평균을 이야기한다. \begin{eqnarray*} E(\overline{X}) & = & E\left(\frac{X_{1} + X_{2} + . . . + X_{n}}{n}\right) \\ & = & \frac{1}{n}\: E \left(X_{1} + X_{2} + . . . + X_{n}\right) \\ & = & \frac{1}{n}\: \left(E(X_{1}) + E(X_{2}) + . . . + E(X_{n})\right) \\ & = & \frac{1}{n}\: \left(\mu + \mu + . . . + \mu \right)\\ & = & \frac{1}{n}\: n * \mu \\ & = & \mu \end{eqnarray*} ===== Variance of sample means ===== ==== Statistics Magnets ==== \begin{eqnarray*} \overline{X} = \frac{X_{1} + X_{2} + . . . + X_{n}}{n} \end{eqnarray*} \begin{align*} Var(\overline{X}) & = Var \left(\frac{X_{1} + X_{2} + . . . + X_{n}}{n}\right) \\ & = \frac {1}{n^2} Var \left(X_{1} + X_{2} + . . . + X_{n} \right) \\ & = \frac{1}{n^2} (\sigma^2 + \sigma^2 + . . . + \sigma^2) \\ & = \frac{1}{n^2} n * (\sigma^2) \\ & = \frac{\sigma^2}{n} \end{align*} \begin{eqnarray} E(\overline{X}) & = & \mu_{\overline{X}} \; = \; \mu \\ Var(\overline{X}) & = & \sigma^{2}_{\overline{X}} \; = \; \frac{\sigma^2}{n} \\ SD(\overline{X}) & = & \sigma_{\overline{X}} \; = \; \frac{\sigma}{\sqrt{n}} \end{eqnarray} \begin{eqnarray*} \text{standard error} & = & \text{standard deviation of sample means} \\ & = & \frac{\sigma}{\sqrt{n}} \\ & = & \sqrt{\frac{\sigma^{2}}{n}} \end{eqnarray*} {{:b:head_first_statistics:pasted:20191126-093924.png}} \begin{eqnarray*} \text{If} \; X \sim N(\mu, \ \sigma^2), \;\; \text{then} \; \overline{X} \sim N(\mu, \ \frac{\sigma^2}{n}) \end{eqnarray*} ^reference ^ |{{youtube>q50GpTdFYyI}} sampling distribution of x bar (sample mean) \\ we will see this video | |{{youtube>JLmD0sJId1M}} deriving mean and variance of (sampling distribution of) sample mean | ===== Distribution of sample means: CLT ===== Though X may NOT be normally distributed, $\overline{X}$ is, if n is large enough. CLT, see [[:Central Limit Theorem]] ===== Using CLT for the binomial distribution ===== $X \sim B(n, p)$ 에서 $\mu = np$, $\sigma^2 = npq$ 이고, n이 30이 넘는 조건에서 이항분포가 정상분포를 이룬다고 하므로 $\overline{X} \sim N(\mu, \frac{\sigma^2}{n})$에 대입해 보면: $$\overline{X} \sim N(np, \; pq) $$ {{:b:head_first_statistics:pasted:20191126-095122.png}} ===== for the Poisson distribution ===== $X \sim Po(\lambda)$, n이 30이 넘는 조건에서, $\mu = \sigma^2 = \lambda$ 이다. 이를 $\overline{X} \sim N(\mu, \frac{\sigma^2}{n})$에 대입해 보면: $$\overline{X} \sim N(\lambda, \; \frac{\lambda}{n}) $$ ===== Exercise ===== Let’s apply this to Mighty Gumball’s problem. * The mean number of gumballs per packet is 10, and the variance is 1. * If you take a sample of 30 packets, * what’s the probability that the sample mean is 8.5 gumballs per packet or fewer? We’ll guide you through the steps. $\overline{X}$? $\overline{X} \sim N(\mu, \frac{\sigma^2}{n})$ 이고, n = 30 일 경우는 $$\overline{X} \sim N(10, \frac{1}{30})$$ $P (\overline{X} < 8.5)$ 을 묻는 문제이므로 \begin{eqnarray*} z & = & \frac{8.5-10}{\sqrt{\frac{1}{30}}} \\ & = & -8.22 \end{eqnarray*} 따라서, 위의 문제는 $P(Z < z) = P(Z < -8.22)$를 묻는 문제 > pnorm(-8.22) [1] 1.017516e-16 > pnorm(8.5, 10, sqrt(1/30)) # or . . . . [1] 1.053435e-16 > discrepancy? > a <- sqrt(1/30) > b <- 8.5-10 > b/a [1] -8.215838 > pnorm(b/a) [1] 1.053435e-16