### Site Tools

 b:head_first_statistics:using_the_normal_distribution [2019/11/06 08:47]hkimscil [Headline] b:head_first_statistics:using_the_normal_distribution [2019/11/20 23:31] (current)hkimscil [All aboard the Love Train] Both sides previous revision Previous revision 2019/11/17 23:45 hkimscil [Swivel chair again] 2019/11/14 10:50 hkimscil [Headline] 2019/11/14 08:55 hkimscil [Normal distribution to the rescue] 2019/11/14 08:54 hkimscil [Swivel chair again] 2019/11/14 08:53 hkimscil [Swivel chair again] 2019/11/14 08:52 hkimscil [Swivel chair again] 2019/11/14 08:51 hkimscil [Swivel chair again] 2019/11/14 08:48 hkimscil [Swivel chair again] 2019/11/14 08:45 hkimscil [Independent Observation] 2019/11/14 08:44 hkimscil [X - Y] 2019/11/14 08:44 hkimscil [e.x.] 2019/11/14 08:39 hkimscil [Independent Observation] 2019/11/14 08:33 hkimscil [Independent Observation] 2019/11/14 08:33 hkimscil [Linear Transform] 2019/11/14 08:32 hkimscil [Independent Observation] 2019/11/14 08:28 hkimscil [Linear Transform] 2019/11/14 08:24 hkimscil [Independent Observation] 2019/11/14 08:16 hkimscil [Linear Transform and Independent Observation] 2019/11/14 08:05 hkimscil [e.x.] 2019/11/14 07:43 hkimscil [X - Y] 2019/11/13 15:16 hkimscil [Using the normal distribution II] 2019/11/13 14:38 hkimscil [Using the normal distribution ii] 2019/11/13 14:35 hkimscil [Headline] 2019/11/13 14:34 hkimscil [Headline] 2019/11/13 11:57 hkimscil [Using the normal distribution] 2019/11/06 08:47 hkimscil [Headline] 2019/11/06 08:39 hkimscil [Exercise] 2019/11/06 08:36 hkimscil [Exercise] 2019/11/06 08:35 hkimscil [So how do we find normal probabilities?] 2019/11/06 08:31 hkimscil [So how do we find normal probabilities?] 2019/11/06 08:30 hkimscil [So how do we find normal probabilities?] 2019/11/06 08:05 hkimscil [So how do we find normal probabilities?] 2019/11/06 07:59 hkimscil [Step 3: Look up the probability in your handy table] 2019/11/06 07:59 hkimscil [So how do we find normal probabilities?] 2019/11/06 07:57 hkimscil [exercise] 2019/11/06 07:56 hkimscil [So how do we find normal probabilities?] 2019/11/06 07:50 hkimscil [So how do we find normal probabilities?] 2019/11/06 07:30 hkimscil [exercise] 2019/11/06 07:30 hkimscil [exercise] 2019/11/06 07:29 hkimscil [exercise] 2019/11/06 07:19 hkimscil 2019/11/06 07:10 hkimscil 2019/11/06 06:57 hkimscil created Next revision Previous revision 2019/11/20 23:31 hkimscil [All aboard the Love Train] 2019/11/19 11:49 hkimscil [exercise] 2019/11/19 11:48 hkimscil [exercise] 2019/11/19 10:30 hkimscil [All aboard the Love Train] 2019/11/19 10:28 hkimscil [All aboard the Love Train] 2019/11/19 10:26 hkimscil [All aboard the Love Train] 2019/11/19 10:20 hkimscil [All aboard the Love Train] 2019/11/19 10:19 hkimscil [All aboard the Love Train] 2019/11/19 10:12 hkimscil [All aboard the Love Train] 2019/11/19 00:17 hkimscil [All aboard the Love Train] 2019/11/19 00:15 hkimscil [All aboard the Love Train] 2019/11/18 23:58 hkimscil [Normal distribution to the rescue] 2019/11/18 23:55 hkimscil [Normal distribution to the rescue] 2019/11/18 23:45 hkimscil [Normal distribution to the rescue] 2019/11/18 17:00 hkimscil [When to approximate the binomial distribution with the normal] 2019/11/18 16:53 hkimscil [When to approximate the binomial distribution with the normal] 2019/11/18 16:51 hkimscil [e.x.] 2019/11/18 16:50 hkimscil [Headline] 2019/11/18 13:38 hkimscil [All aboard the Love Train] 2019/11/18 13:36 hkimscil [Normal distribution to the rescue] 2019/11/18 13:28 hkimscil [exercise] 2019/11/18 13:16 hkimscil [When to approximate the binomial distribution with the normal] 2019/11/18 12:07 hkimscil [All aboard the Love Train] 2019/11/18 12:00 hkimscil [All aboard the Love Train] 2019/11/18 11:59 hkimscil [exercise] 2019/11/18 11:55 hkimscil [exercise] 2019/11/18 11:53 hkimscil [exercise] 2019/11/18 11:53 hkimscil [exercise] 2019/11/18 11:49 hkimscil [When to approximate the binomial distribution with the normal] 2019/11/18 11:45 hkimscil [exercise] 2019/11/18 11:31 hkimscil [Apply a continuity correction] 2019/11/18 11:27 hkimscil [Apply a continuity correction] 2019/11/18 11:21 hkimscil [Revisiting Normal Approximation] 2019/11/18 11:20 hkimscil [Revisiting Normal Approximation] 2019/11/18 11:18 hkimscil [Normal distribution to the rescue] 2019/11/18 11:08 hkimscil [Normal distribution to the rescue] 2019/11/18 11:03 hkimscil [Normal distribution to the rescue] 2019/11/18 10:50 hkimscil [Normal distribution to the rescue] 2019/11/18 10:26 hkimscil [Normal distribution to the rescue] 2019/11/18 08:46 hkimscil [Normal distribution to the rescue] 2019/11/18 08:46 hkimscil [Normal distribution to the rescue] 2019/11/18 08:42 hkimscil [Normal distribution to the rescue] 2019/11/18 08:37 hkimscil [Normal distribution to the rescue] 2019/11/18 08:29 hkimscil [Normal distribution to the rescue] 2019/11/18 08:29 hkimscil [Normal distribution to the rescue] 2019/11/18 08:27 hkimscil [Normal distribution to the rescue] 2019/11/18 08:23 hkimscil [Swivel chair again] 2019/11/18 08:17 hkimscil [Independent Observation] 2019/11/18 08:16 hkimscil [Linear Transform] 2019/11/18 08:14 hkimscil [e.x.] 2019/11/18 08:09 hkimscil [X - Y] Line 35: Line 35: \end{eqnarray*} \end{eqnarray*} - 우리가 면적을 이용하는 이유는 x축의 모든 경우를 ​ discrete하게 (이산적으로) 나타낼 수 없기 때문이다. + **우리가 면적을 이용하는 이유는 ​x축의 모든 경우를 ​ discrete하게 (이산적으로) 나타낼 수 없기 때문이다.** ===== exercise ===== ===== exercise ===== Line 301: Line 301: ​ - ====== Headline ​====== + ===== Headline ===== __The Case of the Missing Parameters__ __The Case of the Missing Parameters__ Line 330: Line 330: \end{eqnarray*} \end{eqnarray*} - 이를 풀어보면 + \begin{eqnarray} + -2.61 \sigma & = & 5-\mu \\ + 1.8 \sigma & = & 15-\mu + \end{eqnarray} + + 위에서 (1) - (2)를 하면 \begin{eqnarray*} \begin{eqnarray*} - \mu & = & 10.914 \\ + (-2.61 - 1.8) \sigma & = & 5-\mu - 15 + \mu \\ + -4.41 \sigma & = & -10 \\ + \sigma & = & \frac {-10}{-4.41} \\ + & = & 2.267574 + \end{eqnarray*} + + 이를 (1)에 대입하면 + \begin{eqnarray*} + 1.8 * 2.267574 & = & 15-\mu ​ \\ + \mu & = & 15 - (1.8 * 2.267574) \\ + & = & 10.91837 + \end{eqnarray*} + + 따라서 + \begin{eqnarray*} + \mu & = & 10.9184 \\ \sigma & = & 2.27 \sigma & = & 2.27 \end{eqnarray*} \end{eqnarray*} + + ======= Using the normal distribution II ======= + + \begin{eqnarray*} + \text{bride} \sim N(150, 400) \\ + \text{groom} \sim N(190, 500) + \end{eqnarray*} + + For a roller coaster ride: should be under 380 lbs combined bride and groom. ​ + + + {{:​b:​head_first_statistics:​pasted:​20191113-142020.png}} + + {{:​b:​head_first_statistics:​pasted:​20191113-142210.png}} + + {{:​b:​head_first_statistics:​pasted:​20191113-142042.png}} + + 이전 기대치 계산에서 ​ + \begin{eqnarray*} + E(X + Y) & = & E(X) + E(Y) \\ + E(X - Y) & = & E(X) - E(Y) \\ + Var(X + Y) & = & Var(X) + Var(Y) \\ + Var(X - Y) & = & Var(X) + Var(Y) + \end{eqnarray*} + + {{:​b:​head_first_statistics:​pasted:​20191113-141902.png}} + + ===== X + Y ===== + \begin{eqnarray*} + X \sim N(\mu_{X}, \sigma^{2}_{X}) ​ \\ + Y \sim N(\mu_{Y}, \sigma^{2}_{Y}) ​ + \end{eqnarray*} + + \begin{eqnarray*} + X + Y \sim N(\mu, \sigma^{2})  ​ + \end{eqnarray*} + + \begin{eqnarray*} + \mu & = & \mu_{X} + \mu_{Y} \\ + \sigma^{2} & = & \sigma^{2}_{X} + \sigma^{2}_{Y} + \end{eqnarray*} + + 즉, 전체 평균은 남자평균과 여자평균을 합한 것과 같고, 전체 분산값은 X 분산값과 Y 분산값을 더한 것이다. ​ + + {{:​b:​head_first_statistics:​pasted:​20191113-143012.png}} + 즉, + {{:​b:​head_first_statistics:​pasted:​20191113-142020.png}} + + ===== X - Y ===== + \begin{eqnarray*} + X \sim N(\mu_{X}, \sigma^{2}_{X}) ​ \\ + Y \sim N(\mu_{Y}, \sigma^{2}_{Y}) ​ + \end{eqnarray*} + + \begin{eqnarray*} + X - Y \sim N(\mu, \sigma^{2})  ​ + \end{eqnarray*} + + \begin{eqnarray*} + \mu & = & \mu_{X} - \mu_{Y} \\ + \sigma^{2} & = & \sigma^{2}_{X} + \sigma^{2}_{Y} + \end{eqnarray*} + + 즉, X - Y 분포의 평균은 남자평균과 여자평균의 차이와 같고, 전체 분산값은 X 분산값과 Y 분산값을 더한 것과 같다. ​ + + {{:​b:​head_first_statistics:​pasted:​20191113-143619.png}} + + + Find the probability that the combined weight of the bride and groom is less than 380 pounds using the following three steps. ​ + + X ~ N(150, 400), Y ~ N(190, 500)의 조건에서 ​ + $$X + Y \sim N(340, 900)$$ + + 둘의 체중을 합한 값은 380 이므로, ​ + + \begin{eqnarray*} + z & = & \frac {(X + Y) - \mu}{ \sigma } \\ + & = & \frac {380 - 340}{ 30 } \\ + & = & \frac {40}{30} \\ + & = & 1.333 + \end{eqnarray*} + ​ + + P(X + Y < 380) 을 알아내기 위해서 [[:​z-table]]을 참조하거나 R을 이용한다. + + <​code>>​ pnorm(1.333) + [1] 0.9082409 + ​ + + + pnorm in r: 표준점수에 해당하는 누적 퍼센티지 (​**P**​ercentage) + <​code>​ + > pnorm(1.333) ​ + [1] 0.9082409 + ​ + + 참고로 R의 경우, 표준점수화는 r이 해 주므로, 아래와 같이 그 값을 구해도 된다. + + <​code>>​ pnorm(380, 340, sqrt(900)) ​ + # 900은 variance이므로 sqrt값을 대입 + # pnorm의 옵션은 (q, mean  = 0, stdev = 1)이다. ​ + # 즉, mean, 0과 stdev, 1값에서 (표준점수에서) ​ + # q값에 해당하는 왼쪽부분의 percentile을 구하라 + [1] 0.9087888 + ​ + + <​code>​ + dnorm(x, mean = 0, sd = 1, log = FALSE) + pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) + qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) + rnorm(n, mean = 0, sd = 1) + ​ + + ​ + + 따라서 ​ + $$P(X + Y < 380) = 0.9082409$$ + + ===== exercise ===== + + Julie’s matchmaker is at it again. What's the **probability that a man will be at least 5 inches taller than a woman**? In Statsville, the height of men in inches is distributed as N(71, 20.25), and the height of women in inches is distributed as N(64, 16). + ​ + + \begin{eqnarray*} + M & \sim & N(71, 20.25) \\ + F & \sim & N(64, 16) + \end{eqnarray*} ​ + + **probability that a man will be at least 5 inches taller than a woman**? = "​probability that a man will be at least 5 inches taller than (an average) woman" 이므로 $P(X > F + 5)$ 을 구하라는 문제. ​ + \begin{eqnarray*} + P(X > F + 5) = P(X - F > 5) + \end{eqnarray*} ​ + + 그런데, ​ + \begin{eqnarray*} + X - Y & \sim & N(\mu_{m} - \mu_{f}, \sigma^{2}_{m} + \sigma^{2}_{f}) \\ + & \sim & N(7, 36.25) + \end{eqnarray*} ​ + + 즉, 남성과 여성 평균의 차이로 이루어진 분포는 $X \sim N(7, 36.25)$ 를 따른다. + + 위에서 $\sigma$ 값은 $\sqrt(\sigma^2)$ 이므로 $6.02$ + + X - Y = 5점일 때의 표준점수는 ​ + \begin{eqnarray*} + z & = & \frac {(X-Y)-\mu}{\sigma} \\ + & = & \frac {5-7}{6.02} \\ + & = & -0.3322259 + \end{eqnarray*} + + 따라서 정답은 ''​%%1 - pnorm(-0.33)%%''​ 인 + + <​code>​ + > 1 - pnorm(-0.3322259,​ 0, 1) + [1] 0.6301407 + # 혹은 아래와 같이 직접 구한다. + > 1 - pnorm(5, 7, sqrt(36.25)) + [1] 0.6301241 + ​ + + ===== Linear Transform ===== + + + 4인용 roller coaster의 지지하중 무게는 800 LBs 라고 한다. 그리고 Statsville 사람들의 평균 몸무게는 180, 분산은 625라고 할 수 있다. 네명을 합한 무게가 800 LBs 보다 작을 확률은 얼마나 될까? + ​ + + **The distribution of 4X** is actually **a linear transform of X**. It’s a transformation of X in the form aX + b, where a is equal to 4, and b is equal to 0. This is exactly the same sort of transform as we encountered earlier with discrete probability distributions. Linear transforms describe **​underlying changes to the size of the values in the probability distribution​**. This means that 4X actually describes the weight of an individual adult whose weight has been multiplied by 4. + + {{:​b:​head_first_statistics:​pasted:​20191114-075704.png}} + + 즉, transforamtion의 경우는 ​한 개인의 몸무게가 4배가 될 때를 의미​하지 단순히 4명을 합하는 것을 의미하지는 않음. 전자의 경우에 distribution은 아래를 따른다. ​ + + {{:​b:​head_first_statistics:​pasted:​20191114-072427.png}} + + ===== Independent Observation ​ ===== + Rather than transforming the weight of each adult, what we really need to figure out is ​the probability distribution for the combined weight of four separate adults​. In other words, we need to work out ​the probability distribution of four independent observations of X​. + + {{:​b:​head_first_statistics:​pasted:​20191114-075914.png}} + + 즉, 개인이 하나씩 늘어 4명이 된다. 개인이 4X가 (X=몸무게) 되는 것이 아님. ​ + + {{:​b:​head_first_statistics:​pasted:​20191114-080104.png}} + + 각 개인의 몸무게는 독립적인 observation이다. 이 경우의 분포는 아래르 따른다. ​ + + {{:​b:​head_first_statistics:​pasted:​20191114-080220.png}} + + + Q: So what’s the difference between linear transforms and independent observations?​ + A: Linear transforms affect the underlying values in your probability distribution. As an example, if you have a length of rope of a particular length, then applying a linear transform affects the length of the rope. Independent observations have to do with the quantity of things you’re dealing with. As an example, if you have n independent observations of a piece of rope, then you’re talking about n pieces of rope. In general, __if the quantity changes__, you’re dealing with **independent observations**. __If the underlying values change__, then you’re dealing with a **transform**. + + Q: Do I really have to know which is which? What difference does it make? + A: You have to know which is which because it make a difference in your probability calculations. You calculate the expectation for linear transforms and independent observations in the same way, but there’s a big difference in the way the variance is calculated. If you have n independent observations then the variance is n times the original. If you transform your probability distribution as aX + b, then your variance becomes a2 times the original. + + Q: Can I have both independent observations and linear transforms in the same probability distribution?​ + A: Yes you can. To work out the probability distribution,​ just follow the basic rules for calculating expectation and variance. You use the same rules for both discrete and continuous probability distributions. ​ + ​ + + 그러므로 앞의 문제에 대한 해답은: + $X_{1} + X_{2} + X_{3} + X_{4} \sim N(720, 2500)$ 로 표현할 수 있다. ​ + + $P(X_{1} + X_{2} + X_{3} + X_{4} < 800)$ 을 구하기 위해서는 800에 해당하는 표준점수를 구한 후, 누적 퍼센티지를 알아내면 된다. ​ + + \begin{eqnarray*} + z & = & \frac {x-\mu}{\sigma} \\ + & = & \frac {800-720}{50} \\ + & = & \frac {80}{50} \\ + & = & 1.6 + \end{eqnarray*} + + 따라서, pnorm(1.6)의 점수인 0.9452007이 답. + <​code>>​ pnorm(1.6) + [1] 0.9452007​ + + ===== Swivel chair again ===== + + {{:​b:​head_first_statistics:​pasted:​20191114-081620.png}} + + Before going further: ​ + + + So what’s the probability of getting 30 or more questions right out of 40? That will help us determine whether to keep playing, or walk away. + ​ + + + + There are 40 questions, which means there are 40 trials. ​ + + The outcome of each trial can be **a success or failure**, and we want to find the probability of getting a certain number of successes. ​ + + In order to do this, we need to use **the binomial distribution**. We use **n = 40**, and as each question has four possible answers, **p is 1/4** or 0.25.. ​ + + If X is the number of questions we get right, then we **want to find P(X > 30)**. ​ + + This means we have to calculate and add together **the probabilities for P(X = 30) up to P(X = 40)**. ​ + + We can find the mean and variance using n, p and q, where q = 1 - p. The mean is equal to np, and the variance is equal to npq. This gives us a **mean of 40 x 0.25 = 10**, and a variance of **40 x 0.25 x 0.75 = 7.5**. ​ + + ​ + + <​code>​ + > pbinom(29,​40,​ 1/4, lower.tail = F) + [1] 4.630881e-11 + ​ + + r에서 다음의 함수가 이항분포의 결과값을 구하는데 사용된다:​ ''​%%dbinom,​ pbinom, qbinom, rbinom%%''​ + + <​code>​ + # X ~ B(10, 0.2)의 분포를 따를 때, + # X는 2일 때의 확률은? 즉, P(X=2)? + dbinom(2, 10, 0.2) + + k <- c(0:50) # or + k <- seq(0, 50, 1) + + b <- dbinom(k, 50, 0.2) + plot(k,b, type = "​l"​) + + b <- dbinom(k, 50, 0.6) + plot(k,b, type = "​l"​) + + k <- seq(0, 100, 1) + b <- dbinom(k, 100, 0.6) + plot(k, b, type = "​l"​) + ​ + + <​code>​ + # X ~ B(10, 0.2)의 분포를 따를 때, + # X는 2일 때의 확률은? 즉, P(X=2)? + dbinom(2, 10, 0.2) + + k <- c(0:50) # or + k <- seq(0, 50, 1) + + b <- dbinom(k, 50, 0.2) + plot(k, b, type = "​p"​) + + b <- dbinom(k, 50, 0.6) + plot(k, b, type = "​p"​) + + k <- seq(0, 100, 1) + b <- dbinom(k, 100, 0.6) + plot(k, b, type = "​p"​) + ​ + + 위와 같은 식으로 문제의 해를 구한다고 하면 + <​code>​ + k <- seq(30, 40, 1) + k + b <- dbinom(k, 40, 0.25) + b + sum(b) + ​ + + <​code>​ + > k <- seq(30, 40, 1) + > k + [1] 30 31 32 33 34 35 36 37 38 39 40 + > b <- dbinom(k, 40, 0.25) + > b + [1] 4.140329e-11 4.451967e-12 4.173719e-13 3.372702e-14 2.314599e-15 1.322628e-16 6.123279e-18 2.206587e-19 + [9] 5.806808e-21 9.926167e-23 8.271806e-25 + > sum(b) + [1] 4.630881e-11 + > + ​ + + ​ + + + + {{:​b:​head_first_statistics:​pasted:​20191114-082244.png}} + + ===== Normal distribution to the rescue ===== + {{:​b:​head_first_statistics:​pasted:​20191114-082458.png}} + + <​code>​ + n20 <- 20 + n5 <- 5 + p1 <- .1 + p5 <- .5 + x <- c(0:30) + + a <- dbinom(x, n5, p1) + c <- dbinom(x, n20, p1) + b <- dbinom(x, n5, p5) + d <- dbinom(x, n20, p5) + + par(mfcol=c(2,​2)) + + barplot(a, names.arg=x,​ main="​n=5,​ p=0.1"​) + barplot(c, names.arg=x,​ main="​n=20,​ p=0.1"​) + barplot(b, names.arg=x,​ main="​n=5,​ p=0.5"​) + barplot(d, names.arg=x,​ main="​n=20,​ p=0.5"​) + par(mfcol=c(1,​1)) + ​ + + <​code>​ + + ​ + {{:​b:​head_first_statistics:​pasted:​20191118-232847.png}} + {{:​b:​head_first_statistics:​pasted:​20191118-231159.png}} + {{:​b:​head_first_statistics:​pasted:​20191118-231510.png}} + ===== When to approximate the binomial distribution with the normal ===== + We saw in the last exercise that the binomial distribution looks very similar to the normal distribution where p is **around 0.5**, and n is **around 20**. As a general rule, **you can use the normal distribution to approximate the binomial when np and nq are both greater than 5**. + + ​**np 와 nq 모두가 5가 넘을 때** ​ + + + 이 때 Normal distribution의 특징은 N(np, npq) + {{:​b:​head_first_statistics:​pasted:​20191118-111847.png}} + {{:​b:​head_first_statistics:​pasted:​20191118-111749.png}} + + + + Before we use the normal distribution for the full 40 questions for Who Wants To Win A Swivel Chair, let’s tackle a simpler problem to make sure it works. Let’s try finding the probability that we get 5 or fewer questions correct out of 12, where there are only two possible choices for each question. + + Let’s start off by working this out using the binomial distribution. Use the binomial distribution to find P(X < 6) where X ~ B(12, 0.5). + ​ + + P(X < 6) 이므로, P(X=1), P(X=2), P(X=3), P(X=4), P(X=5) 까지 구하여 이를 더한다. ​ + {{:​b:​head_first_statistics:​pasted:​20191118-095610.png}} + {{:​b:​head_first_statistics:​pasted:​20191118-095652.png}} + + + 이를 R을 이용하여 구하면, ​ + <​code>​ + pbinom(5, 12, 1/2) + ​ + + <​code>​ + > pbinom(5, 12, 1/2) + [1] 0.387207 + ​ + + ​ + + 교재의 조언에 따라서 Normal distribution으로 구하려면: ​ + + \begin{eqnarray*} + X & \sim & B(12, 1/2) \\ + n & = & 12 \\ + p & = & 1/2 \\ + q & = & 1/2 + \end{eqnarray*} + + 이고, $np$와 $nq$ 모두가 5보다 크므로 Normal distribution에 대입하여 사용할 수 있다. 따라서 X 분포는 $X \sim N(np, nqp)$ + 를 따라야 한다. 즉, $X \sim N(6, 3)$일 때, P(X < 6)을 구해야 한다. ​ + + \begin{eqnarray*} + z & = & \frac {(6 - 6)}{\sqrt(3)} \\ + & = & 0 + \end{eqnarray*} + 이에 대한 probability는 $P(X < 6) = 0.5$ 이다. 이 값은 앞에서 구한 binomial distribution 계산과 다른 값을 갖는다. 즉, $0.387 \ne 0.5$ 이다. 그 이유는 . . . . + + + ===== Revisiting Normal Approximation ===== + {{:​b:​head_first_statistics:​pasted:​20191118-104800.png}} ​ + + + {{:​b:​head_first_statistics:​pasted:​20191118-102001.png}} + {{:​b:​head_first_statistics:​pasted:​20191118-102042.png}} + + ===== Apply a continuity correction ===== + + 그렇다면,​ $X \sim N(6, 3)$의 조건에서 $P(X < 5.5) = ?$ + \begin{eqnarray*} + z & = & \frac {(5.5-6)}{\sqrt(3)} \\ + & = & - 0.29 + \end{eqnarray*} + <​code>​ + > pnorm(-0.29) + [1] 0.3859081 + ​ + + 이 값은 위의 0.387에 근사하다. ​ + + + * In particular circumstances you can **use the normal distribution to approximate the binomial**. If X ~ B(n, p) and np > 5 and nq > 5 then you can approximate X using X ~ N(np, npq) + * If you’re approximating the binomial distribution with the normal distribution,​ then you need to **​apply a continuity correction​** to make sure your results are accurate. + ​ + + {{:​b:​head_first_statistics:​pasted:​20191118-103328.png}} + + + Q:Does it really save time to approximate the binomial distribution with the normal? + + A: It can save a lot of time. Calculating binomial probabilities can be time-consuming because you generally have to work out the probability of lots of different values. You have no way of simply calculating binomial probabilities over a range of values. If you approximate the binomial distribution with the normal distribution,​ then it’s a lot quicker. You can look probabilities up in standard tables and also deal with whole ranges at once. + + Q:So is it really accurate? + + A:Yes, It’s accurate enough for most purposes. The key thing to remember is that you need to apply a continuity correction. If you don’t then your results will be less accurate. + + Q:What about continuity corrections for < and >? Do I treat those the same way as the ones for ≤ and ≥? + + A: There’s a difference, and it all comes down to which values you want to include a nd exclude. When you’re working out probabilities using ≤ and ≥, you need to make sure that you include the value in the inequality in your probability range. So if, say, you need to work out P(X ≤ 10), you need to make sure your probability includes the value 10. This m eans you need to consider P(X < 10.5). When you’re working out probabilities using < or >, you need to make sure that you exclude the value in the inequality from your probability range. This means that if you need to work out P(X < 10), you need to make sure that your probability excludes 10. You need to consider P(X < 9.5). + + Q:You can approximate the binomial distribution with both the normal and Poisson distributions. Which should I use? + + A: It all depends on your circumstances. If X ~ B(n, p), then you can use the normal distribution to approximate the binomial d istribution if np > 5 and nq > 5. You can use the Poisson distribution to approximate the binomial distribution if n > 50 and p < 0.1 + + + Remember, you need to apply a ​**continuity correction**​ when you approximate the binomial distribution with the normal distribution. + ​ + + ===== Pool Puzzle ===== + + X < 3  ----  X < 2.5 ​ + X > 3  ----  X > 3.5 ​ + X <_ 3  ----  X < 3.5 ​ + X >_ 3  ----  X >_ 2.5 ​ + 3 <_ X < 10   ​---- ​ 2.5 < X < 9.5 ​ + X = 0  ----  -0.5 < X < 0.5 ​ + 3 <_ X <_ 10  ----  2.5 < X < 10.5 ​ + 3 < X <_ 10  ----  2.5 < X < 10.5 ​ + X > 0  ----  0.5 < X ​ + 3 < X < 10  ----  3.5 < X < 9.5 ​ + ​ + + + ===== exercise ===== + + What’s the probability of you winning the jackpot on today’s edition of Who Wants to Win a Swivel Chair? See if you can find the probability of getting at least 30 questions correct out of 40, where each question has a choice of 4 possible answers. ​ + ​ + + $X \sim B(40, 1/4)$ 분포에서 $P(X \ge 30)$를 구하는 문제. + $np = 40 * 1/4 = 10$ + $npq = 40 * 1/4 * 3/4 = 10 * 3/4 = 7.5$ + 따라서 X ~ N(np, npq)라고 할 때, N(10, 7.5) . . . . + 교재에는 왜 X ~ N(10, 30)이라고 하는지 모르겠음. . . . + + + \begin{eqnarray*} + z & = & \frac {X - \mu}{\sigma} \\ + & = & \frac {29.5 - 10}{\sqrt{7.5}} \\ + & = & 7.120393 ​ + \end{eqnarray*} + + z score = 7.120393 을 [[:​z-table]]에서 참조할 수는 없다. . . . + + <​code>​ + > pnorm(29.5, 10, sqrt(7.5)) + [1] 1 + > 1 - pnorm(29.5, 10, sqrt(7.5)) + [1] 5.381251e-13 + > + ​ + 즉, 확율은 0에 가까움. + + + ===== All aboard the Love Train ===== + {{:​b:​head_first_statistics:​pasted:​20191118-113020.png}} + + $\lambda > 15$ 일 때, Poisson distribution,​ $X \sim Po(\lambda)$는 ​ $X \sim N(\lambda, \lamba)$의 성격을 취한다. + + 예) + <​code>​ + par(mfcol=c(2,​2)) + barplot(dpois(c(1:​10),​.25),​ main = "​lambda \n =0.25"​) + barplot(dpois(c(1:​20),​5),​ main = "​lambda \n = 5") + barplot(dpois(c(1:​10),​1),​ main = "​lambda \n = 1") + barplot(dpois(c(1:​40),​20),​ main = "​lambda \n = 20") + par(mfcol=c(1,​1)) + ​ + + {{:​b:​head_first_statistics:​pasted:​20191120-230151.png}} + + + Dexter’s found some statistics on the Internet about the model of roller coaster he’s been trying out, and according to one site, you can expect the ride to break down 40 times a year. + + Given the huge profit the Love Train is bound to make, Dexter thinks that it’s still worth going ahead with the ride if there’s a high probability of it __breaking down__ **less than 52 times a year**. So how do we work out that probability?​ + + What sort of probability distribution does this follow? How would you work out the probability of the ride breaking down less than 52 times in a year? + ​ + + $X \sim Po (\lambda = 40)$ 일 때 $P (X < 52)$ 를 구하는 문제 + + 이 상황과 $N(\mu, \sigma^2)$ 상황을 같이 사용하는 것은 $X \sim N(\lambda, \lambda)$일 경우 + + $X \sim Po(40)$ 이므로 ​ + $X \sim N(40, 40)$ 일 때, $P(X < 52)$ 경우를 구하는 것 + + X < 52 일 때는 X < 51.5를 사용하고,​ $\mu = 40$, $\sigma = \sqrt{\left 40}$을 사용한다. + \begin{eqnarray*} + z & = & \frac {51.5 - 40}{\sqrt{40}} \\ + & = & 1.82 + \end{eqnarray*} + + <​code>​ + > 11.5/​sqrt(40) + [1] 1.81831 + > pnorm(1.82) + [1] 0.9656205 + # Or + > pnorm(1.81831) + [1] 0.9654916 + # Or + > pnorm(51.5, 40, sqrt(40)) + [1] 0.9654916 + ​ + + $0.9654916 \sim 0.9656205$ + + ===== Check up ===== + {{:​b:​head_first_statistics:​pasted:​20191119-100005.png}} + {{:​b:​head_first_statistics:​pasted:​20191119-100038.png}}