Differences

This shows you the differences between two versions of the page.

--- b:head_first_statistics:using_the_normal_distribution [2019/11/06 08:47] – [Headline] hkimscil
+++ b:head_first_statistics:using_the_normal_distribution [2024/10/28 08:09] (current) – […then squash the width] hkimscil
@@ Line 35: / Line 35: @@
 \end{eqnarray*}
-우리가 면적을 이용하는 이유는 x축의 모든 경우를  discrete하게 (이산적으로) 나타낼 수 없기 때문이다.
+**우리가 면적을 이용하는 이유는 <fc #ff0000>x축의 모든 경우를  discrete하게 (이산적으로) 나타낼 수 없기 때문</fc>이다.**
 ===== exercise =====
@@ Line 76: / Line 76: @@
 주의: $X \sim N(\mu, \sigma^{2})$ 처럼 표현.
-\begin{eqnarray*}
+\begin{align*}
-\mu$ & = & \text{mean}  \\
+\mu & = \text{mean}  \\
-\sigma^{2} & = & \text{variance} \\
+\sigma^{2} & = \text{variance} \\
-\sqrt{\sigma^{2}} & = & \sigma \\
+\sqrt{\sigma^{2}} & = \sigma \\
-& = & \text{standard deviation}
+& = \text{standard deviation}
-\end{eqnarray*}
+\end{align*}
 {{:b:head_first_statistics:pasted:20191106-065833.png}}
@@ Line 88: / Line 88: @@
 ===== So how do we find normal probabilities? =====
+평균이 0 이고 표준편차가 1일 Normal distribution 에서의 probabilities는 아래의 PDF 파일과 같이 구해 놓은 값이 있다
+(R을 이용하지 않는다면). [[https://ux1.eiu.edu/~aalvarado2/z_table.pdf|z table]] 링크
+평균과 표준편차 값이 0, 1이 아닌 다른 값을 같는 분포는 0, 1 이 되도록 변환한 후에 probability를 구한다 (표준점수화).
 {{:b:head_first_statistics:pasted:20191106-070200.png}}
@@ Line 118: / Line 123: @@
 \begin{eqnarray*}
-\displaystyle\frac {X - 71} {\sqrt(20.25)} & \sim & N(0, 1) \\
+\displaystyle\frac {X - 71} {\sqrt{20.25}} & \sim & N(0, 1) \\
 \displaystyle\frac {X - 71} {4.5} & \sim & N(0, 1)
 \end{eqnarray*}
@@ Line 132: / Line 137: @@
 z & = & \displaystyle \frac {x - \mu}{\sigma} \\
 & = & \frac {64-71} {4.5} \\
-& = & 1.56
+& = & - 1.56
 \end{eqnarray*}
-따라서, 표준점수를 1.56을 가지고 표준점수 테이블에서 1.56보다 큰 부분의 면적을 구한것을 참조하면 된다.
+따라서, 표준점수를 -1.56을 가지고 표준점수 테이블에서 -1.56보다 큰 부분의 면적을 구한것을 참조하면 된다.
-<code>> a <- c(1:100)
+<code>
-> scale(a)
+> 1 - pnorm(-1.56)
-              [,1]
+[1] 0.9406201
-  [1,] -1.70622042
+> pnorm(-1.56, lower.tail = FALSE)
-  [2,] -1.67175132
+[1] 0.9406201
-  [3,] -1.63728222
+> pnorm(64, 71, sqrt(20.25), lower.tail = FALSE)
-  [4,] -1.60281312
+[1] 0.9400931
-  [5,] -1.56834402
+>
-  [6,] -1.53387492
+</code>
-  [7,] -1.49940582
-  [8,] -1.46493672
-  [9,] -1.43046762
- [10,] -1.39599852
- [11,] -1.36152943
- [12,] -1.32706033
- [13,] -1.29259123
- [14,] -1.25812213
- [15,] -1.22365303
- [16,] -1.18918393
- [17,] -1.15471483
- [18,] -1.12024573
- [19,] -1.08577663
- [20,] -1.05130753
- [21,] -1.01683843
- [22,] -0.98236933
- [23,] -0.94790023
- [24,] -0.91343113
- [25,] -0.87896203
- [26,] -0.84449293
- [27,] -0.81002384
- [28,] -0.77555474
- [29,] -0.74108564
- [30,] -0.70661654
- [31,] -0.67214744
- [32,] -0.63767834
- [33,] -0.60320924
- [34,] -0.56874014
- [35,] -0.53427104
- [36,] -0.49980194
- [37,] -0.46533284
- [38,] -0.43086374
- [39,] -0.39639464
- [40,] -0.36192554
- [41,] -0.32745644
- [42,] -0.29298734
- [43,] -0.25851825
- [44,] -0.22404915
- [45,] -0.18958005
- [46,] -0.15511095
- [47,] -0.12064185
- [48,] -0.08617275
- [49,] -0.05170365
- [50,] -0.01723455
- [51,]  0.01723455
- [52,]  0.05170365
- [53,]  0.08617275
- [54,]  0.12064185
- [55,]  0.15511095
- [56,]  0.18958005
- [57,]  0.22404915
- [58,]  0.25851825
- [59,]  0.29298734
- [60,]  0.32745644
- [61,]  0.36192554
- [62,]  0.39639464
- [63,]  0.43086374
- [64,]  0.46533284
- [65,]  0.49980194
- [66,]  0.53427104
- [67,]  0.56874014
- [68,]  0.60320924
- [69,]  0.63767834
- [70,]  0.67214744
- [71,]  0.70661654
- [72,]  0.74108564
- [73,]  0.77555474
- [74,]  0.81002384
- [75,]  0.84449293
- [76,]  0.87896203
- [77,]  0.91343113
- [78,]  0.94790023
- [79,]  0.98236933
- [80,]  1.01683843
- [81,]  1.05130753
- [82,]  1.08577663
- [83,]  1.12024573
- [84,]  1.15471483
- [85,]  1.18918393
- [86,]  1.22365303
- [87,]  1.25812213
- [88,]  1.29259123
- [89,]  1.32706033
- [90,]  1.36152943
- [91,]  1.39599852
- [92,]  1.43046762
- [93,]  1.46493672
- [94,]  1.49940582
- [95,]  1.53387492
- [96,]  1.56834402
- [97,]  1.60281312
- [98,]  1.63728222
- [99,]  1.67175132
-[100,]  1.70622042
-attr(,"scaled:center")
-[1] 50.5
-attr(,"scaled:scale")
-[1] 29.01149
-> aa <- scale(a)
-> mean(aa)
-[1] 0
-> sd(aa)
-[1] 1
-> </code>
 ==== exercise ====
@@ Line 288: / Line 189: @@
 ===== Exercise =====
 Julie with 5" heels = 64 + 5 = 69
+Remember X ~ N(71, 20.25)
+mean = 71
+variance = 20.25
+sd = 4.5
+z = (71-69)/4.5
 z score = -0.44
@@ Line 301: / Line 207: @@
 </code>
-====== Headline ======
+<WRAP box>
+rnorm
+<code>
+> set.seed(101)
+> rnorm(5)
+[1] -0.3260365  0.5524619 -0.6749438  0.2143595  0.3107692
+> rnorm(5, mean=0, sd=1)
+[1]  1.1739663  0.6187899 -0.1127343  0.9170283 -0.2232594
+>
+</code>
+<code>
+> set.seed(101)
+> rnorm(5, mean=100, sd=10)
+[1]  96.73964 105.52462  93.25056 102.14359 103.10769
+>
+</code>
+<code>
+> set.seed(101)
+> s1 <- rnorm(100, mean=100, sd=10)
+> s1
+  [1]  96.73964 105.52462  93.25056 102.14359 103.10769 111.73966 106.18790
+  [8]  98.87266 109.17028  97.76741 105.26448  92.05156 114.27756  85.33180
+ [15]  97.63317  98.06662  91.50245 100.58465  91.82330  79.49692  98.36244
+ [22] 107.08522  97.32019  85.36078 107.44436  85.89610 104.67068  98.80680
+ [29] 104.67239 104.98136 108.94937 102.79152 110.07866  79.26894 111.89853
+ [36]  92.75626 101.67984 109.20335  83.28395 104.48469 104.82459 107.58214
+ [43]  76.80673  95.40495  88.94616 104.02928 105.68935  92.93917  97.09909
+ [50]  85.16122  88.49745  97.25529 105.77901  86.03097 107.49058  89.48813
+ [57] 101.65381 111.29809 111.73722  95.72137  97.40198  85.88827  93.58642
+ [64] 101.12458 104.22604 103.86835  93.12202 101.48902  99.42350  99.25177
+ [71] 115.09897 116.19937 111.53158  99.22396  81.81065  89.62555 103.02492
+ [78]  87.22054 101.38339  99.49016 118.52148 111.11675  94.88625  94.56119
+ [85]  82.71073 104.70750 100.05387 113.48046 107.24097 115.52549 113.25470
+ [92]  99.65735  96.38987  92.79835 102.82015  92.09474  95.55095 113.64993
+ [99] 104.97454  91.85604
+>
+> mean(s1)
+[1] 99.62809
+> sd(s1)
+[1] 9.34071
+>
+</code>
+pnorm
+qnorm
+dnorm
+<code>
+> set.seed(101)
+> dnorm(0)
+[1] 0.3989423
+> dnorm(0, mean=0, sd=1)
+[1] 0.3989423
+> dnorm(0, mean=0, sd=5)
+[1] 0.07978846
+>
+</code>
+pnorm
+<code>
+Mean <- 100
+Sd <- 10
+# X grid for non-standard normal distribution
+x <- seq(-4, 4, length = 100) * Sd + Mean
+# Density function
+f <- dnorm(x, Mean, Sd)
+plot(x, f, type = "l", lwd = 2, col = "blue", ylab = "", xlab = "Weight")
+abline(v = Mean) # Vertical line on the mean
+</code>
+{{:b:head_first_statistics:pasted:20221027-222851.png?400}}
+<code>
+# mean: mean of the Normal variable
+# sd: standard deviation of the Normal variable
+# lb: lower bound of the area
+# ub: upper bound of the area
+# acolor: color of the area
+# ...: additional arguments to be passed to lines function
+normal_area <- function(mean = 0, sd = 1, lb, ub, acolor = "lightgray", ...) {
+    x <- seq(mean - 3 * sd, mean + 3 * sd, length = 100)
+    if (missing(lb)) {
+       lb <- min(x)
+    }
+    if (missing(ub)) {
+        ub <- max(x)
+    }
+    x2 <- seq(lb, ub, length = 100)
+    plot(x, dnorm(x, mean, sd), type = "n", ylab = "")
+    y <- dnorm(x2, mean, sd)
+    polygon(c(lb, x2, ub), c(0, y, 0), col = acolor)
+    lines(x, dnorm(x, mean, sd), type = "l", ...)
+}
+</code>
+<code>
+normal_area(mean = 0, sd = 1, lb = -1, ub = 2, lwd = 2)
+</code>
+{{:b:head_first_statistics:pasted:20221027-224243.png?500}}
+<code>
+pnorm(2)
+pnorm(-1)
+pnorm(2)-pnorm(-1)
+ar <- round(pnorm(2)-pnorm(-1),3)
+</code>
+<code>
+> pnorm(2)
+[1] 0.9772499
+> pnorm(-1)
+[1] 0.1586553
+> pnorm(2)-pnorm(-1)
+[1] 0.8185946
+> ar <- round(pnorm(2)-pnorm(-1),3)
+>
+</code>
+<code>
+m.s <- 100
+sd.s <- 15
+lb <- 80
+ub <- 110
+normal_area(mean = m.s, sd = sd.s, lb = lb, ub = ub, lwd = 2)
+ar <- round(pnorm(ub, m.s, sd.s)-pnorm(lb, m.s, sd.s),3)
+text(m.s, .01, ar)
+</code>
+{{:b:head_first_statistics:pasted:20221027-225952.png?500}}
+<code>
+m.s <- 100
+sd.s <- 15
+lb <- m.s - sd.s
+ub <- m.s + sd.s
+normal_area(mean = m.s, sd = sd.s, lb = lb, ub = ub, lwd = 2)
+ar <- round(pnorm(ub, m.s, sd.s)-pnorm(lb, m.s, sd.s),3)
+text(m.s, .01, ar)
+</code>
+</WRAP>
+===== Headline =====
 <WRAP box>
 __The Case of the Missing Parameters__
@@ Line 330: / Line 376: @@
 \end{eqnarray*}
-이를 풀어보면
+\begin{eqnarray}
+-2.61 \sigma & = & 5-\mu \\
+.8 \sigma & = & 15-\mu
+\end{eqnarray}
+위에서 (1) - (2)를 하면
 \begin{eqnarray*}
-\mu & = & 10.914 \\
+(-2.61 - 1.8) \sigma & = & 5-\mu - 15 + \mu \\
+-4.41 \sigma & = & -10 \\
+\sigma & = & \frac {-10}{-4.41} \\
+& = & 2.267574
+\end{eqnarray*}
+이를 (1)에 대입하면
+\begin{eqnarray*}
+.8 * 2.267574 & = & 15-\mu  \\
+\mu & = & 15 - (1.8 * 2.267574) \\
+& = & 10.91837
+\end{eqnarray*}
+따라서
+\begin{eqnarray*}
+\mu & = & 10.9184 \\
 \sigma & = & 2.27
 \end{eqnarray*}
+======= Using the normal distribution II =======
+\begin{eqnarray*}
+\text{bride} \sim N(150, 400) \\
+\text{groom} \sim N(190, 500)
+\end{eqnarray*}
+For a roller coaster ride: should be under 380 lbs combined bride and groom.
+{{:b:head_first_statistics:pasted:20191113-142020.png}}
+{{:b:head_first_statistics:pasted:20191113-142210.png}}
+{{:b:head_first_statistics:pasted:20191113-142042.png}}
+이전 기대치 계산에서
+\begin{eqnarray*}
+E(X + Y) & = & E(X) + E(Y) \\
+E(X - Y) & = & E(X) - E(Y) \\
+Var(X + Y) & = & Var(X) + Var(Y) \\
+Var(X - Y) & = & Var(X) + Var(Y)
+\end{eqnarray*}
+{{:b:head_first_statistics:pasted:20191113-141902.png}}
+===== X + Y =====
+\begin{eqnarray*}
+X \sim N(\mu_{X}, \sigma^{2}_{X})  \\
+Y \sim N(\mu_{Y}, \sigma^{2}_{Y})
+\end{eqnarray*}
+\begin{eqnarray*}
+X + Y \sim N(\mu, \sigma^{2})
+\end{eqnarray*}
+\begin{eqnarray*}
+\mu & = & \mu_{X} + \mu_{Y} \\
+\sigma^{2} & = & \sigma^{2}_{X} + \sigma^{2}_{Y}
+\end{eqnarray*}
+즉, 전체 평균은 남자평균과 여자평균을 합한 것과 같고, 전체 분산값은 X 분산값과 Y 분산값을 더한 것이다.
+{{:b:head_first_statistics:pasted:20191113-143012.png}}
+즉,
+{{:b:head_first_statistics:pasted:20191113-142020.png}}
+===== X - Y =====
+\begin{eqnarray*}
+X \sim N(\mu_{X}, \sigma^{2}_{X})  \\
+Y \sim N(\mu_{Y}, \sigma^{2}_{Y})
+\end{eqnarray*}
+\begin{eqnarray*}
+X - Y \sim N(\mu, \sigma^{2})
+\end{eqnarray*}
+\begin{eqnarray*}
+\mu & = & \mu_{X} - \mu_{Y} \\
+\sigma^{2} & = & \sigma^{2}_{X} + \sigma^{2}_{Y}
+\end{eqnarray*}
+즉, X - Y 분포의 평균은 남자평균과 여자평균의 차이와 같고, 전체 분산값은 X 분산값과 Y 분산값을 더한 것과 같다.
+{{:b:head_first_statistics:pasted:20191113-143619.png}}
+<WRAP box>
+Find the probability that the combined weight of the bride and groom is less than 380 pounds using the following three steps.
+X ~ N(150, 400), Y ~ N(190, 500)의 조건에서
+$$X + Y \sim N(340, 900)$$
+둘의 체중을 합한 값은 380 이므로,
+\begin{eqnarray*}
+z & = & \frac {(X + Y) - \mu}{ \sigma } \\
+& = & \frac {380 - 340}{ 30 } \\
+& = & \frac {40}{30} \\
+& = & 1.333
+\end{eqnarray*}
+</WRAP>
+P(X + Y < 380) 을 알아내기 위해서 [[:z-table]]을 참조하거나 R을 이용한다.
+<code>> pnorm(1.333)
+[1] 0.9082409
+</code>
+<WRAP info 70%>
+pnorm in r: 표준점수에 해당하는 누적 퍼센티지 (<fc #ff0000>**P**</fc>ercentage)
+<code>
+> pnorm(1.333)
+[1] 0.9082409
+</code>
+참고로 R의 경우, 표준점수화는 r이 해 주므로, 아래와 같이 그 값을 구해도 된다.
+<code>> pnorm(380, 340, sqrt(900))
+# 900은 variance이므로 sqrt값을 대입
+# pnorm의 옵션은 (q, mean  = 0, stdev = 1)이다.
+# 즉, mean, 0과 stdev, 1값에서 (표준점수에서)
+# q값에 해당하는 왼쪽부분의 percentile을 구하라
+[1] 0.9087888
+</code>
+<code>
+dnorm(x, mean = 0, sd = 1, log = FALSE)
+pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
+qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
+rnorm(n, mean = 0, sd = 1)
+</code>
+[{{  :b:head_first_statistics:pasted:20201204-175705.png  }}]
+</WRAP>
+따라서
+$$P(X + Y < 380) = 0.9082409 $$
+===== exercise =====
+<WRAP info 60%>
+Julie’s matchmaker is at it again. What's the **probability that a man will be at least 5 inches taller than a woman**? In Statsville, the height of men in inches is distributed as N(71, 20.25), and the height of women in inches is distributed as N(64, 16).
+</WRAP>
+\begin{eqnarray*}
+M & \sim & N(71, 20.25) \\
+F & \sim & N(64, 16)
+\end{eqnarray*}
+**probability that a man will be at least 5 inches taller than a woman**? = "probability that a man will be at least 5 inches taller than (an average) woman" 이므로 $P(X > F + 5)$ 을 구하라는 문제.
+\begin{align*}
+P(X > F + 5) & = P(X - F > 5)
+\end{align*}
+그런데,
+\begin{eqnarray*}
+X - Y & \sim & N(\mu_{m} - \mu_{f}, \sigma^{2}_{m} + \sigma^{2}_{f}) \\
+      & \sim & N(7, 36.25)
+\end{eqnarray*}
+즉, 남성과 여성 평균의 차이로 이루어진 분포는 $X \sim N(7, 36.25)$ 를 따른다.
+위에서 $\sigma$ 값은 $\sqrt{\sigma^2}$ 이므로 $6.02$
+X - Y = 5점일 때의 표준점수는
+\begin{eqnarray*}
+z & = & \frac {(X-Y)-\mu}{\sigma} \\
+& = & \frac {5-7}{6.02} \\
+& = & -0.3322259
+\end{eqnarray*}
+따라서 정답은 ''%%1 - pnorm(-0.33)%%'' 인
+<code>
+> 1 - pnorm(-0.3322259, 0, 1)
+[1] 0.6301407
+# 혹은 아래와 같이 직접 구한다.
+> 1 - pnorm(5, 7, sqrt(36.25))
+[1] 0.6301241
+# "1 -"를 사용하는 대신 lower.tail = FALSE 를 사용할 수도 있다.
+> pnorm(5, 7, sqrt(36.25), lower.tail = FALSE)
+[1] 0.6301241
+</code>
+===== Linear Transform =====
+<WRAP alert 60%>
+인용 roller coaster의 지지하중 무게는 800 LBs 라고 한다. 그리고 Statsville 사람들의 평균 몸무게는 180, 분산은 625라고 할 수 있다. 네명을 합한 무게가 800 LBs 보다 작을 확률은 얼마나 될까?
+</WRAP>
+**The distribution of 4X** is actually **a linear transform of X**. It’s a transformation of X in the form aX + b, where a is equal to 4, and b is equal to 0. This is exactly the same sort of transform as we encountered earlier with discrete probability distributions. Linear transforms describe **<fc #ff0000>underlying changes to the size of the values in the probability distribution</fc>**. This means that 4X actually describes the weight of an individual adult whose weight has been multiplied by 4.
+{{:b:head_first_statistics:pasted:20191114-075704.png}}
+즉, transforamtion의 경우는 <fc #ff0000>한 개인의 몸무게가 4배가 될 때를 의미</fc>하지 단순히 4명을 합하는 것을 의미하지는 않음. 전자의 경우에 distribution은 아래를 따른다.
+{{:b:head_first_statistics:pasted:20191114-072427.png}}
+===== Independent Observation  =====
+Rather than transforming the weight of each adult, what we really need to figure out is <fc #ff0000>the probability distribution for the combined weight of four separate adults</fc>. In other words, we need to work out <fc #ff0000>the probability distribution of four independent observations of X</fc>.
+{{:b:head_first_statistics:pasted:20191114-075914.png}}
+즉, 개인이 하나씩 늘어 4명이 된다. 개인이 4X가 (X=몸무게) 되는 것이 아님.
+{{:b:head_first_statistics:pasted:20191114-080104.png}}
+각 개인의 몸무게는 독립적인 observation이다. 이 경우의 분포는 아래르 따른다.
+{{:b:head_first_statistics:pasted:20191114-080220.png}}
+<WRAP info 60%>
+Q: So what’s the difference between linear transforms and independent observations?
+A: Linear transforms affect the underlying values in your probability distribution. As an example, if you have a length of rope of a particular length, then applying a linear transform affects the length of the rope. Independent observations have to do with the quantity of things you’re dealing with. As an example, if you have n independent observations of a piece of rope, then you’re talking about n pieces of rope. In general, __if the quantity changes__, you’re dealing with **independent observations**. __If the underlying values change__, then you’re dealing with a **transform**.
+Q: Do I really have to know which is which? What difference does it make?
+A: You have to know which is which because it make a difference in your probability calculations. You calculate the expectation for linear transforms and independent observations in the same way, but there’s a big difference in the way the variance is calculated. If you have n independent observations then the variance is n times the original. If you transform your probability distribution as aX + b, then your variance becomes a2 times the original.
+Q: Can I have both independent observations and linear transforms in the same probability distribution?
+A: Yes you can. To work out the probability distribution, just follow the basic rules for calculating expectation and variance. You use the same rules for both discrete and continuous probability distributions.
+</WRAP>
+그러므로 앞의 문제에 대한 해답은:
+$X_{1} + X_{2} + X_{3} + X_{4} \sim N(720, 2500)$ 로 표현할 수 있다.
+$P(X_{1} + X_{2} + X_{3} + X_{4} < 800)$ 을 구하기 위해서는 800에 해당하는 표준점수를 구한 후, 누적 퍼센티지를 알아내면 된다.
+\begin{eqnarray*}
+z & = & \frac {x-\mu}{\sigma} \\
+& = & \frac {800-720}{50} \\
+& = & \frac {80}{50} \\
+& = & 1.6
+\end{eqnarray*}
+따라서, pnorm(1.6)의 점수인 0.9452007이 답.
+<code>
+> pnorm(1.6)
+[1] 0.9452007
+# 혹은
+> pnorm(800, 720, sqrt(2500), lower.tail = TRUE)
+[1] 0.9452007
+</code>
+===== Swivel chair again =====
+{{:b:head_first_statistics:pasted:20191114-081620.png}}
+Before going further:
+<WRAP alert 60%>
+So what’s the probability of getting 30 or more questions right out of 40? That will help us determine whether to keep playing, or walk away.
+</WRAP>
+<WRAP info 60%>
+There are 40 questions, which means there are 40 trials.
+The outcome of each trial can be **a success or failure**, and we want to find the probability of getting a certain number of successes.
+In order to do this, we need to use **the binomial distribution**. We use **n = 40**, and as each question has four possible answers, **p is 1/4** or 0.25..
+If X is the number of questions we get right, then we **want to find P(X > 30)**.
+This means we have to calculate and add together **the probabilities for P(X = 30) up to P(X = 40)**.
+We can find the mean and variance using n, p and q, where q = 1 - p. The mean is equal to np, and the variance is equal to npq. This gives us a **mean of 40 x 0.25 = 10**, and a variance of **40 x 0.25 x 0.75 = 7.5**.
+</WRAP>
+<WRAP center info>
+<code>
+> pbinom(29,40, 1/4, lower.tail = F)
+[1] 4.630881e-11
+</code>
+r에서 다음의 함수가 이항분포의 결과값을 구하는데 사용된다: ''%%dbinom, pbinom, qbinom, rbinom%%''
+<code>
+# X ~ B(10, 0.2)의 분포를 따를 때,
+# X는 2일 때의 확률은? 즉, P(X=2)?
+dbinom(2, 10, 0.2)
+k <- c(0:50) # or
+k <- seq(0, 50, 1)
+b <- dbinom(k, 50, 0.2)
+plot(k,b, type = "l")
+b <- dbinom(k, 50, 0.6)
+plot(k,b, type = "l")
+k <- seq(0, 100, 1)
+b <- dbinom(k, 100, 0.6)
+plot(k, b, type = "l")
+</code>
+<code>
+# X ~ B(10, 0.2)의 분포를 따를 때,
+# X는 2일 때의 확률은? 즉, P(X=2)?
+dbinom(2, 10, 0.2)
+k <- c(0:50) # or
+k <- seq(0, 50, 1)
+b <- dbinom(k, 50, 0.2)
+plot(k, b, type = "p")
+b <- dbinom(k, 50, 0.6)
+plot(k, b, type = "p")
+k <- seq(0, 100, 1)
+b <- dbinom(k, 100, 0.6)
+plot(k, b, type = "p")
+</code>
+위와 같은 식으로 문제의 해를 구한다고 하면
+<code>
+k <- seq(30, 40, 1)
+k
+b <- dbinom(k, 40, 0.25)
+b
+sum(b)
+</code>
+<code>
+> k <- seq(30, 40, 1)
+> k
+ [1] 30 31 32 33 34 35 36 37 38 39 40
+> b <- dbinom(k, 40, 0.25)
+> b
+ [1] 4.140329e-11 4.451967e-12 4.173719e-13 3.372702e-14 2.314599e-15 1.322628e-16 6.123279e-18 2.206587e-19
+ [9] 5.806808e-21 9.926167e-23 8.271806e-25
+> sum(b)
+[1] 4.630881e-11
+>
+</code>
+</WRAP>
+{{:b:head_first_statistics:pasted:20191114-082244.png}}
+===== Normal distribution to the rescue =====
+{{:b:head_first_statistics:pasted:20191114-082458.png}}
+<code>
+n20 <- 20
+n5 <- 5
+p1 <- .1
+p5 <- .5
+x <- c(0:30)
+a <- dbinom(x, n5, p1)
+c <- dbinom(x, n20, p1)
+b <- dbinom(x, n5, p5)
+d <- dbinom(x, n20, p5)
+par(mfcol=c(2,2))
+barplot(a, names.arg=x, main="n=5, p=0.1")
+barplot(c, names.arg=x, main="n=20, p=0.1")
+barplot(b, names.arg=x, main="n=5, p=0.5")
+barplot(d, names.arg=x, main="n=20, p=0.5")
+par(mfcol=c(1,1))
+</code>
+<code>
+</code>
+{{:b:head_first_statistics:pasted:20191118-232847.png}}
+{{:b:head_first_statistics:pasted:20191118-231159.png}}
+{{:b:head_first_statistics:pasted:20191118-231510.png}}
+===== When to approximate the binomial distribution with the normal =====
+We saw in the last exercise that the binomial distribution looks very similar to the normal distribution where p is **around 0.5**, and n is **around 20**. As a general rule, **you can use the normal distribution to approximate the binomial when np and nq are both greater than 5**.
+<fc #ff0000>**np 와 nq 모두가 5가 넘을 때**</fc>
+이 때 Normal distribution의 특징은 N(np, npq)
+{{:b:head_first_statistics:pasted:20191118-111847.png}}
+{{:b:head_first_statistics:pasted:20191118-111749.png}}
+<WRAP help 60%>
+Before we use the normal distribution for the full 40 questions for Who Wants To Win A Swivel Chair, let’s tackle a simpler problem to make sure it works. Let’s try finding the probability that we get 5 or fewer questions correct out of 12, where there are only two possible choices for each question.
+Let’s start off by working this out using the binomial distribution. Use the binomial distribution to find P(X < 6) where X ~ B(12, 0.5).
+</WRAP>
+P(X < 6) 이므로, P(X=1), P(X=2), P(X=3), P(X=4), P(X=5) 까지 구하여 이를 더한다.
+{{:b:head_first_statistics:pasted:20191118-095610.png}}
+{{:b:head_first_statistics:pasted:20191118-095652.png}}
+<WRAP info 60%>
+이를 R을 이용하여 구하면,
+<code>
+pbinom(5, 12, 1/2)
+</code>
+<code>
+> pbinom(5, 12, 1/2)
+[1] 0.387207
+</code>
+위는 아래와 같음을 이해해야 한다
+<code>
+> sum(dbinom(c(0:5),12,1/2))
+[1] 0.387207
+>
+</code>
+</WRAP>
+교재의 조언에 따라서 Normal distribution으로 구하려면:
+\begin{eqnarray*}
+X & \sim & B(12, 1/2) \\
+n & = & 12 \\
+p & = & 1/2 \\
+q & = & 1/2
+\end{eqnarray*}
+이고, $np$와 $nq$ 모두가 5보다 크므로 Normal distribution에 대입하여 사용할 수 있다. 따라서 X 분포는 $X \sim N(np, nqp)$
+를 따라야 한다. 즉, $X \sim N(6, 3)$일 때, P(X < 6)을 구해야 한다.
+\begin{eqnarray*}
+z & = & \frac {(6 - 6)}{\sqrt{3}} \\
+& = & 0
+\end{eqnarray*}
+이에 대한 probability는 $P(X < 6) = 0.5$ 이다. 이 값은 앞에서 구한 binomial distribution 계산과 다른 값을 갖는다. 즉, $0.387 \ne 0.5$ 이다. 그 이유는 . . . .
+===== Revisiting Normal Approximation =====
+{{:b:head_first_statistics:pasted:20191118-104800.png}}
+{{:b:head_first_statistics:pasted:20191118-102001.png}}
+{{:b:head_first_statistics:pasted:20191118-102042.png}}
+===== Apply a continuity correction =====
+그렇다면, $X \sim N(6, 3)$의 조건에서 $P(X < 5.5) = ?$
+\begin{eqnarray*}
+z & = & \frac {(5.5-6)}{\sqrt{3}} \\
+& = & - 0.29
+\end{eqnarray*}
+<code>
+> pnorm(-0.29)
+[1] 0.3859081
+# the below is the same as the above
+> n <- 12
+> p <- 1/2
+> q <- 1-p
+> pnorm(5.5, n*p, sqrt(n*p*q))
+[1] 0.386415
+>
+</code>
+이 값은 위의 0.387에 근사하다.
+<WRAP info 60%>
+  * In particular circumstances you can **use the normal distribution to approximate the binomial**. If X ~ B(n, p) and np > 5 and nq > 5 then you can approximate X using X ~ N(np, npq)
+  * If you’re approximating the binomial distribution with the normal distribution, then you need to **<fc #ff0000>apply a continuity correction</fc>** to make sure your results are accurate.
+</WRAP>
+{{:b:head_first_statistics:pasted:20191118-103328.png}}
+<WRAP info 70%>
+Q:Does it really save time to approximate the binomial distribution with the normal?
+A: It can save a lot of time. Calculating binomial probabilities can be time-consuming because you generally have to work out the probability of lots of different values. You have no way of simply calculating binomial probabilities over a range of values. If you approximate the binomial distribution with the normal distribution, then it’s a lot quicker. You can look probabilities up in standard tables and also deal with whole ranges at once.
+Q:So is it really accurate?
+A:Yes, It’s accurate enough for most purposes. The key thing to remember is that you need to apply a continuity correction. If you don’t then your results will be less accurate.
+Q:What about continuity corrections for < and >? Do I treat those the same way as the ones for ≤ and ≥?
+A: There’s a difference, and it all comes down to which values you want to include a nd exclude. When you’re working out probabilities using ≤ and ≥, you need to make sure that you include the value in the inequality in your probability range. So if, say, you need to work out P(X ≤ 10), you need to make sure your probability includes the value 10. This m eans you need to consider P(X < 10.5). When you’re working out probabilities using < or >, you need to make sure that you exclude the value in the inequality from your probability range. This means that if you need to work out P(X < 10), you need to make sure that your probability excludes 10. You need to consider P(X < 9.5).
+Q:You can approximate the binomial distribution with both the normal and Poisson distributions. Which should I use?
+A: It all depends on your circumstances. If X ~ B(n, p), then you can use the normal distribution to approximate the binomial d istribution if np > 5 and nq > 5. You can use the Poisson distribution to approximate the binomial distribution if n > 50 and p < 0.1
+Remember, you need to apply a <fc #ff0000>**continuity correction**</fc> when you approximate the binomial distribution with the normal distribution.
+</WRAP>
+===== Pool Puzzle =====
+<wrap #continuity_correction_egs />
+<WRAP help 60%>
+X < 3  ----  <wrap spoiler> X < 2.5 </wrap>
+X > 3  ----  <wrap spoiler> X > 3.5 </wrap>
+X <_ 3  ----  <wrap spoiler> X < 3.5 </wrap>
+X >_ 3  ----  <wrap spoiler> X > 2.5 </wrap>
+<_ X < 10   ----  <wrap spoiler> 2.5 < X < 9.5 </wrap>
+X = 0  ----  <wrap spoiler> -0.5 < X < 0.5 </wrap>
+<_ X <_ 10  ----  <wrap spoiler> 2.5 < X < 10.5 </wrap>
+< X <_ 10  ----  <wrap spoiler> 3.5 < X < 10.5 </wrap>
+X > 0  ----  <wrap spoiler> X > 0.5 </wrap>
+< X < 10  ----  <wrap spoiler> 3.5 < X < 9.5 </wrap>
+</WRAP>
+===== exercise =====
+<WRAP help 60%>
+What’s the probability of you winning the jackpot on today’s edition of Who Wants to Win a Swivel Chair? See if you can find the probability of getting at least 30 questions correct out of 40, where each question has a choice of 4 possible answers.
+</WRAP>
+$X \sim B(40, 1/4)$ 분포에서 $P(X \ge 30)$를 구하는 문제.
+$np = 40 * 1/4 = 10$
+$npq = 40 * 1/4 * 3/4 = 10 * 3/4 = 7.5$
+따라서 X ~ N(np, npq)라고 할 때, N(10, 7.5) . . . .
+교재에는 왜 X ~ N(10, 30)이라고 하는지 모르겠음. . . .
+\begin{eqnarray*}
+z & = & \frac {X - \mu}{\sigma} \\
+& = & \frac {29.5 - 10}{\sqrt{7.5}} \\
+& = & 7.120393
+\end{eqnarray*}
+z score = 7.120393 을 [[:z-table]]에서 참조할 수는 없다. . . .
+<code>
+> pnorm(29.5, 10, sqrt(7.5))
+[1] 1
+> 1 - pnorm(29.5, 10, sqrt(7.5))
+[1] 5.381251e-13
+>
+</code>
+즉, 확율은 0에 가까움.
+===== All aboard the Love Train =====
+{{:b:head_first_statistics:pasted:20191118-113020.png}}
+$\lambda > 15$ 일 때, Poisson distribution, $X \sim Po(\lambda)$는 $X \sim N(\lambda, \lambda)$ 의 성격을 취한다.
+예)
+<code>
+par(mfcol=c(2,2))
+barplot(dpois(c(1:10),.25), main = "lambda \n =0.25")
+barplot(dpois(c(1:20),5), main = "lambda \n = 5")
+barplot(dpois(c(1:10),1), main = "lambda \n = 1")
+barplot(dpois(c(1:40),20), main = "lambda \n = 20")
+par(mfcol=c(1,1))
+</code>
+{{:b:head_first_statistics:pasted:20191120-230151.png}}
+<WRAP help 60%>
+Dexter’s found some statistics on the Internet about the model of roller coaster he’s been trying out, and according to one site, you can expect the ride to break down 40 times a year.
+Given the huge profit the Love Train is bound to make, Dexter thinks that it’s still worth going ahead with the ride if there’s a high probability of it __breaking down__ **less than 52 times a year**. So how do we work out that probability?
+What sort of probability distribution does this follow? How would you work out the probability of the ride breaking down less than 52 times in a year?
+</WRAP>
+$X \sim Po (\lambda = 40)$ 일 때 $P (X < 52)$ 를 구하는 문제
+이 상황과 $N(\mu, \sigma^2)$ 상황을 같이 사용하는 것은 $X \sim N(\lambda, \lambda) $일 경우
+$X \sim Po(40)$ 이므로
+$X \sim N(40, 40)$ 일 때, $P(X < 52)$ 경우를 구하는 것
+X < 52 일 때는 X < 51.5를 사용하고, $\mu = 40$, $\sigma = \sqrt{40}$을 사용한다.
+\begin{eqnarray*}
+z & = & \frac {51.5 - 40}{\sqrt{40}} \\
+& = & 1.82
+\end{eqnarray*}
+<code>
+> 11.5/sqrt(40)
+[1] 1.81831
+> pnorm(1.82)
+[1] 0.9656205
+# Or
+> pnorm(1.81831)
+[1] 0.9654916
+# Or
+> pnorm(51.5, 40, sqrt(40))
+[1] 0.9654916
+</code>
+$0.9654916 \sim 0.9656205$
+===== Check up =====
+^ Situation  ^ Distribution  ^ Condition  ^
+| $X + Y$ \\  $\text{when}$ \\ $X \sim N(\mu^{\;}_{X}, \sigma^{2}_{X})$ \\ $Y\sim N(\mu^{  }_{Y}, \sigma^{2}_{Y}) \qquad\qquad $  |    |    |
+| $X - Y$ \\  $\text{when}$ \\ $X \sim N(\mu^{\;}_{X}, \sigma^{2}_{X})$ \\ $Y\sim N(\mu^{  }_{Y}, \sigma^{2}_{Y})$  |   |   |
+| $aX + b$ \\  $\text{when}$ \\ $X \sim N(\mu_{X}, \sigma^{2}_{X})$  | $\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad$   |$\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad$   |
+| $X_{1} + X_{2} + . . . X_{n}$ \\ \\ $X \sim N(\mu, \sigma^{2})$    | $\qquad\qquad\qquad\qquad$   |$\qquad\qquad\qquad\qquad$   |
+| $\text{Normal approximation of X} $ \\ \\ $X \sim B(n, p)$         | $\qquad\qquad\qquad\qquad$   |$\qquad\qquad\qquad\qquad$   |
+| $\text{Normal approximation of X} $ \\ \\ $X \sim Po{\lambda}$     | $\qquad\qquad\qquad\qquad$   |$\qquad\qquad\qquad\qquad\;\;$   |
+^ Situation  ^ Distribution  ^ Condition  ^
+| $X + Y$ \\  $\text{when}$ \\ $X \sim N(\mu^{\;}_{X}, \sigma^{2}_{X})$ \\ $Y\sim N(\mu^{  }_{Y}, \sigma^{2}_{Y}) \qquad\qquad $  | $X + Y \sim N(\mu_{X} + \mu_{Y}, \sigma^{2}_{X} + \sigma^{2}_{Y}) \qquad\;\; $ | X, Y events are independent $\qquad$ |
+| $X - Y$ \\  $\text{when}$ \\ $X \sim N(\mu^{\;}_{X}, \sigma^{2}_{X})$ \\ $Y\sim N(\mu^{  }_{Y}, \sigma^{2}_{Y})$  | $X - Y \sim N(\mu_{X} - \mu_{Y}, \sigma^{2}_{X} + \sigma^{2}_{Y}) $  | X, Y events are independent  |
+| $aX + b$ \\  $\text{when}$ \\ $X \sim N(\mu_{X}, \sigma^{2}_{X})$  | $aX + b \sim N \left(a\mu_{X} + b, a^{2}\sigma^{2}_{X}\right)$  | a, b are constant.  |
+| $X_{1} + X_{2} + . . . X_{n}$ \\ \\ $X \sim N(\mu, \sigma^{2})$  | $X_{1} + X_{2} + . . . X_{n} \sim N(n\mu, n\sigma^2)$   | $X_{1} + X_{2} + . . . X_{n}$ are independent \\ observation of X   |
+| $\text{Normal approximation of X} $ \\ \\ $X \sim B \left(n, p \right)$  | $X \sim N \left(np, npq\right)$  | $\text{when }\quad np > 5, \; nq > 5$ \\ continuity correction required  |
+| $\text{Normal approximation of X} $ \\ \\ $X \sim Po \left(\lambda\right)$  | $X \sim N (\lambda, \lambda)$   | $\text{when }\quad \lambda > 15$  \\ continuity correction required $\qquad\qquad\;\;\;$   |