Differences

This shows you the differences between two versions of the page.

--- standard_error [2016/10/17 09:23] – hkimscil
+++ standard_error [2020/05/17 13:44] (current) – hkimscil
@@ Line 1: / Line 1: @@
-표준오차 혹은 Standard Error 는 Standard Deviation of Sample Means 혹은 Standard Deviation of Sample Means Distribution 를 줄여서 부르는 단어이다. 따라서 이 용어는 Standard deviation 와 Sample Means Distribution 의 성격을 파악하고 있으면 이해하기 쉽다.
-특정 숫자, ''n'' 인 샘플들의 평균들을 모아서 그 분포를 보는 것을 Sample Means Distribution 혹은 Sampling Distribution (샘플평균들의 분포) 라고 하여 이에 대해서는 [[Sampling Distribution]]에서 자세히 소개하였다.
-<WRAP box>
-반 총장은 영남권과 호남권, 40대와 50대, 무당층, 중도층에서는 상승한 반면, 수도권 특히 서울 지역에서 큰 폭으로 하락했고, 30대와 60대 이상, 새누리당 지지층, 보수층과 진보층에서도 내린 것으로 집계됐다.
-반 총장은 일간으로 10일(월)에는 지난주 주간집계 대비 1.5%p 하락한 22.0%를 기록했고, 11일(화)에는 24.8%로 반등했으나, 12일(수)에는 23.8%로 다시 내림세를 보였다.
-. . . .
-문 전 대표는 일간으로 주말에 있었던‘사드 배치 절차 잠정 중단’발언이 있었던 10일(월)에는 지난주 주간집계 대비 3.5%p 상승한 21.4%로 출발했으나, 11일(화)에는 18.9%로 하락했다가, 12일(수)에는 20.5%로 다시 상승세를 보이고 있는 것으로 조사됐다.
-. . . .
-이번 주중집계는 2016년 10월 10일과 12일 3일간 전국 19세 이상 유권자 **1,509명**을 대상으로 무선 전화면접(17%), 스마트폰앱(38%), 무선(25%)·유선(20%) 자동응답 혼용 방식으로 무선전화(80%)와 유선전화(20%) 병행 임의걸기(RDD, random digit dialing) 및 임의스마트폰알림(RDSP, random digit smartphone-pushing) 방법으로 조사했고, 응답률은 전화면접 17.2%, 스마트폰앱 42.6%, 자동응답 5.7%로, 전체 10.3%(총 통화시도 14,650명 중 1,509명 응답 완료)를 기록했다. 통계보정은 2016년 6월말 행정자치부 주민등록 인구통계 기준 성, 연령, 권역별 가중치 부여 방식으로 이루어졌고, **표본오차는 95% 신뢰수준에서 ±2.5%p**이다. 일간집계는 2일 이동 시계열(two-day rolling time-series) 방식으로 10일 1,013명, 11일 1,005명, 12일 1,006명을 대상으로 했고, 응답률은 10일 10.4%, 11일 10.2%, 12일 10.3%, 표본오차는 3일간 모두 95% 신뢰수준에서 ±3.1%p이다. 일간집계의 통계보정 방식은 주중집계와 동일하다.
-</WRAP>
-====== Standard deviation again ======
-Before talking about standard error, let's take a brief look at the concept of normal curve and standard deviation.
-Normal Curve and Standard Deviation,
-\begin{equation}
-\displaystyle P(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} * e^{\frac{-(x - \mu)^2}{2 \sigma^2}}
-\end{equation}
-see [[:normal distribution]]
-<code>Please ignore the next two lines
-P(x) = 1 / sqrt(2 pi sigma^2)*e^(-(x-mu)^2/2 sigma^2)
-P(x) = 1 / sqrt(2 * pi * (3^2))  *  (2.817 ^ (-((x-68)^2) / (2*(3^2))))
-. . . . dplot formula for distrib. curve with mu=68, sd=3
-</code>
-This is the mathematical formula for a normal curve. It isn't quiet easy to understand, though, is it? But, I want to briefly go over what the signs and symbols mean in this formula.  First, some symbols such as $\pi$ ; and $e$ are numbers. Probably you know what $\pi$ means (it is about 3.1416). And $e$ is about 2.817. Even though they look difficult to read, they are just numbers like "2" found in the same formula. Another kind is $ \mu $ and $ \sigma $ . They represent mean and standard deviation. So, if we assume a normally distributed sample whose mean ( $ \mu $ ) is 0 and standard deviation( $\sigma$ ) is 1, and draw a graph; it is going to look like the one in the below graph (the highest one).  Now remember that we discussed the standard deviation? The formula is (this time, I will show the formula for the variance ( $\s^2$ ):
-$ \displaystyle s^2 = \frac {\displaystyle  \sum_{i=1}^n (X_i-\overline{X})^2}{N-1} $
-The point was that if the upper part of the formula (the sigma thing) gets big, the stdev -- hence, variance -- will be big. In order for the upper part to be big, the individual sample units (Xi) should be vary a lot from the sample mean (x bar). If the sample units are varying a lot from the mean, the distribution graph should be widely spread.
-Now, going back to the normal curve formula, what if standard deviation is 2 and mean is 0? The corresponding graph is the second highest one in the below graphs. I put other graphs whose stdev values are 3 and 4 -- the mean values are 0 for all cases.  As you see, the larger is the stdev value, the more spread the corresponding graph is.
-{{:std-err.jpg)}}
-So, if we draw a normal distribution curve of which mean is 56.88 and stdev is 15.67, the graph should look like the below.
-{{quiz-distrib.jpg}}
 ====== Standard Error ======
+변인이 수치로 측정된 경우의 표준오차란 (standard error) 모집단에서 샘플을 취했을 때 그 샘플의 평균이 모집단의 평균에서 얼마나 떨어져서 나타날까를 (평균에서부터의 랜덤 에러) 나타내주는 지표로 샘플평균집합의 표준편차를 (standard deviation of sample means) 말한다 ([[:Sampling distribution]]).
-Now let's talk about standard error. Before taking a look at the formula, let's clarify the term standard error. Your immediate response to this term may be error of what?  Error for the population parameter -- You know the difference between parameter and statistics, right? The term is slightly out of focus, though it represents what it does very well. The formal name for the standard error should be the standard deviation of sampling distribution (Here, we are back to the concept of standard deviation, again).
+종류로 측정된 변인의 경우에는 아래와 같은 방법을 사용한다.
+[[:c/mrm/standard_error#퍼센티지에서의_표준오차]]
-There are two kinds of standard errors you need to know.
-  * The first one is the standard deviation of sampling distribution of $ \hat{p}$ or "p cap" -- probability.
-  * The second one is the standard deviation of sampling distribution of mean. The underlying concepts employed here are same. I will focus on the former one, first -- standard deviation of sampling distribution of probability.
-Suppose you are an employee in the Gallup. Public opinion polls are conducted to estimate the fraction of the US citizens who trust the president. You are assigned to this job and responsible for the report. What should you do in order to satisfy the firm? Yes, the general idea is: (1) you take a sample; (2) examine how many of them trust the president; and (3) report it. Generally, it seems ok. But, when your superior asks if the result is representing the whole population, what are you going to say? Hmm, it appears to be more job. So, you modify the general idea and refine the steps to do: (1) you take a sample which is representative to the U.S. population; (2) examine how many of them are trusting the president; (3) guess if the particular sample show a certain result, what is going to be the whole population (US citizen); (4) report your guess to your superior.  Basically, one thing has been added to your plan -- making a connection between your sample to the population.
-people were randomly chosen and 637 answered that they trust the president. So far, the (1) and (2) in the "to-do-list" has been taken care of.
-So, the proportion of people who trust the president in your particular sample is: $\hat{p} = \frac{637}{1000} = 0.637$ .
-This is what normal people would report when they do survey -- showing percentage. But, it is not enough for you who studied research method. That is, you need do more than this.
-That is, what you really want to know is not the $\hat{p}$ . Rather, you want to know the proportion of the total US citizens, which can be called " $p$ ."
-Let's stop here and talk about standard error. But the above will be used again.
-The standard deviation of the sampling distribution of $\hat{p}$ <wrap id #standard_error_nominal>is as follows:</wrap>
-$\displaystyle \sigma_{\hat{p}} = \sqrt{\frac{p*q}{n}} , \;\;\; q = (1 - p)$ and " $\hat{p}$ " represents sampling statistics (probability), and " $p$ " represents population parameter (probability). In English, it says that . . .
-  * standard deviation of sampling probability ( $\sigma_{\hat{p}}$ )
-  * is equal to
-  * to $\sqrt{\frac{p * q}{n}}$ , $n$ is the sample size
-What it indicates is that if you know your population probability, you can calculate the standard error of a sample probability. In relation to the above example, it says
-  * if you know the proportion (probability) of US citizens' president support,
-  * you can calculate the standard deviation of your sample's proportion (probability).
-Wait . . . . ?!?! You might ask: "we wanted to know the population probability in the first place. That is why we took a sample (N=1000). But, now Hyo is saying that in order to know the standard deviation of sample probability, we need to know population probability? It is non-sense! If we know population probability in the first place, we would do sampling!!!"
-Yes!! That is absolutely true. In other words, in the above formula, we DO NOT know the value of $p$ ." But, here is the magic -- it would not be a big problem if we replace $p$ with $\hat{p}$ obtained from the sample $(n = 1000)$ , if we are sure that the sample really represents the population, the actual (unknown) $p$ is about the same as $\hat{p}$ .
-Besides, the value of " $\sqrt{p * (1-p)}$ " is relatively insensitive to the changes of $p$ ." If you look at the below table;
-| p    | q    | pq    | square \\ root (pq)  |                                                                    |
-| 0.1  | 0.9  | 0.09  | 0.30     | note that the values here are \\ not much different \\ as found in p and q           |
-| 0.2  | 0.8  | 0.16  | 0.40              |:::                                                                    |
-| 0.3  | 0.7  | 0.21  | 0.46              |:::                                                                    |
-| 0.4  | 0.6  | 0.24  | 0.49              |:::                                                                    |
-| 0.5  | 0.5  | 0.25  | 0.50              |:::                                                                    |
-| 0.6  | 0.4  | 0.24  | 0.49              |:::                                                                    |
-| 0.7  | 0.3  | 0.21  | 0.46              |:::                                                                    |
-| 0.8  | 0.2  | 0.16  | 0.40              |:::                                                                    |
-| 0.9  | 0.1  | 0.09  | 0.30              |:::                                                                    |
-  * Note that values of $q$ (column, q) is determined by $p$ . Suppose that we got 0.6 from sample probability ( $\hat{p}$ ) and use it as population probability ( $p$ ). Further, suppose that the real value of population probability is 0.5. Then, what is the difference between the calculation -- 0.01. (0.6 * 0.4) vs. (0.5 * 0.5).
-So, we replace the value $p$ with your $\hat{p}$ and assume that the result is not going to be much different. Hence the standard deviation of the sampling distribution is:
-$\displaystyle \sqrt{\frac{(p * q)}{n}} = \sqrt{\frac{0.637*0.363}{1000}} = 0.03$ , where $n$ is sample size.
-The left part is called what? standard deviation of the sampling distribution.
-  * We know what standard deviation means.
-  * Sampling distribution? not clear . . . (I will be writing about this soon). But, briefly
-It means that
-  - If we take a sample (size, n = 1000); measure the probability of the sample and record it; and
-  - repeat the above step again and again for many many times,
-  - we get a kind of distribution curve, of which shape is normal.
-  - If we take a look at this distribution, we will find
-  - the mean of it (the most frequently appearing probabilities) will be the same as the population probability; and
-  - the standard deviation of it should be 0.03 (from the above calculation).
-This particular value, standard deviation, has a special meaning [This is important]
-  - Keeping it in mind that the distribution curve is obtained from the above step 1 and 2 -- sample probabilities,
-  - we realize that the chance of getting a particular sample probability will follow this distribution.
-  - That is, if we take one sample, [we are coming back to standard deviation rule here],
-  - 68 out of 100 cases, the sample probability should be found in between the probability of the above distribution PLUS/MINUS ONE standard deviation.
-  - 95 out of 100 cases, the sample probability should be found in between the probability of the above distribution PLUS/MINUS TWO standard deviation.
-  - 99 out of 100 cases, the sample probability should be found in between the probability of the above distribution PLUS/MINUS THREE standard deviation.
-AND, this PARTICULAR standard deviation is called STANDARD ERROR.
-The below is a graphical illustration of it. I am repeating it, here. The distribution curve is obtained from many many sample probabilities. If we take one sample, the chance of getting the sample probability depends on this below distribution curve.
-{{trust-stdev-00.jpg}}
-That is,
-<WRAP code>mean (+-) 1stdev : 0.607 to 0.667   ---   red line = 68%
-mean (+-) 2stdev : 0.577 to 0.697   ---  blue line = 95%
-mean (+-) 3stdev : 0.547  to 0.727  ---  yellow = 99%
-</WRAP>
-Now carefully think about what you have gotten from your sample statistics. You took a sample whose size is 1000. The purpose of this was not seeing the sample statistics, but, to estimate the population parameters. That's why you carefully (though, the procedures were omitted in this discussion) did random sampling -- to make your random sample representative to the population. Then, you calculated standard deviation of sampling distribution (standard error). This very thing (standard error) is supposed to show how the sample means are distributed if you keep taking samples and getting the means (proportions, in this case).  In other words, (this distinction is important!), the standard error is not about your sample itself. It is something about possible means from "many samples (you did not take many samples though; you took only one)." To put this differently, you took a random sample and believed that the sample represents the population. From the statistics you got from the sample, you calculated a number, called standard error.
-And, since the standard error is really the standard deviation of sampling distributions, you can employ the idea of guessing game.
-That is,
-  * your sample probability (acquired from the sample you took first place) +- one standard error has about 68% of chance of having the true population probability.
-  * The sample probability +- two standard deviation has about 95% chance of having the true population probability.
-  * The sample probability +- three standard deviation has about 99% chance of having the true population probability.
-|   |   |   |
-| mean (+-) 1sd:  | 0.607 to 0.667  | red line = 68%  |
-| mean (+-) 2sd:  | 0.577 to 0.697  | blue line = 95%  |
-| mean (+-) 3sd:  | 0.547  to 0.727  | yellow = 99%  |
-Now, you think that the first option 0.607-0.667 cannot be chosen, because the certainty about the true mean of the population is merely 68%. But, it may look cool to say that the true mean resides between 0.607 and 0.667 --   because this is the narrowest range of your guess.  But, if you want to do this, you also need to address that you have (only) 68% certainty on your claim. This will not give you a credibility. You don't want to be blamed that you spent the research money for the about half-true (68%) guess.  How about choosing the last option (0.547 to 0.727). It has reversed dilemma, now! You may look cool saying that you have 99% of certainty of your claim. But the claim itself has the widest range of possible true mean point (0.547 to 0.727). So, you decide to choose the middle option. I have 95% about the claim that the true mean is in between 0.577-0.697. It seems to be compromising the first and the third option.
-On tomorrow's  newspaper, there will be a story about your study. The report will say that your firm's research revealed that 63.7% of 1,000 randomly selected people trust the president. From this research, the report will also say, we can use the statistics (63.7%) as the true proportion of the US citizens with plus-minus 6% of error margin.