standard_error
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
standard_error [2018/03/19 08:50] – hkimscil | standard_error [2020/05/17 13:44] (current) – hkimscil | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | 표준오차 혹은 Standard Error 는 Standard Deviation of Sample Means 혹은 Standard Deviation of Sample Means Distribution 를 줄여서 부르는 단어이다. 따라서 이 용어는 Standard deviation 와 Sample Means Distribution 의 성격을 파악하고 있으면 이해하기 쉽다. | ||
- | |||
- | 특정 숫자, '' | ||
- | |||
- | |||
- | ====== Standard deviation again ====== | ||
- | Before talking about standard error, let's take a brief look at the concept of normal curve and standard deviation. | ||
- | |||
- | Normal Curve and Standard Deviation, | ||
- | |||
- | \begin{equation} | ||
- | \displaystyle P(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} * e^{\frac{-(x - \mu)^2}{2 \sigma^2}} | ||
- | \end{equation} | ||
- | |||
- | see [[:normal distribution]] | ||
- | |||
- | < | ||
- | P(x) = 1 / sqrt(2 pi sigma^2)*e^(-(x-mu)^2/ | ||
- | P(x) = 1 / sqrt(2 * pi * (3^2)) | ||
- | . . . . dplot formula for distrib. curve with mu=68, sd=3 | ||
- | </ | ||
- | |||
- | This is the mathematical formula for a normal curve. It isn't quiet easy to understand, though, is it? But, I want to briefly go over what the signs and symbols mean in this formula. | ||
- | |||
- | $ \displaystyle s^2 = \frac {\displaystyle | ||
- | |||
- | The point was that if the upper part of the formula (the sigma thing) gets big, the stdev -- hence, variance -- will be big. In order for the upper part to be big, the individual sample units (Xi) should be vary a lot from the sample mean (x bar). If the sample units are varying a lot from the mean, the distribution graph should be widely spread. | ||
- | |||
- | Now, going back to the normal curve formula, what if standard deviation is 2 and mean is 0? The corresponding graph is the second highest one in the below graphs. I put other graphs whose stdev values are 3 and 4 -- the mean values are 0 for all cases. | ||
- | |||
- | {{: | ||
- | |||
- | So, if we draw a normal distribution curve of which mean is 56.88 and stdev is 15.67, the graph should look like the below. | ||
- | |||
- | {{quiz-distrib.jpg}} | ||
- | |||
====== Standard Error ====== | ====== Standard Error ====== | ||
+ | 변인이 수치로 측정된 경우의 표준오차란 (standard error) 모집단에서 샘플을 취했을 때 그 샘플의 평균이 모집단의 평균에서 얼마나 떨어져서 나타날까를 (평균에서부터의 랜덤 에러) 나타내주는 지표로 샘플평균집합의 표준편차를 (standard deviation of sample means) 말한다 ([[: | ||
- | Now let's talk about standard error. Before taking a look at the formula, let's clarify the term standard error. Your immediate response to this term may be error of what? Error for the population parameter -- You know the difference between parameter and statistics, right? The term is slightly out of focus, though it represents what it does very well. The formal name for the standard error should be the standard deviation of sampling distribution (Here, we are back to the concept of standard deviation, again). | + | 종류로 측정된 변인의 경우에는 아래와 같은 방법을 사용한다. |
- | + | [[:c/ | |
- | There are two kinds of standard errors you need to know. | + | |
- | + | ||
- | * The first one is the standard deviation of sampling distribution of $ \hat{p} $ or "p cap" -- probability. | + | |
- | * The second one is the standard deviation of sampling distribution of mean. The underlying concepts employed here are same. I will focus on the former one, first -- standard deviation of sampling distribution of probability. | + | |
- | + | ||
- | Suppose you are an employee in the Gallup. Public opinion polls are conducted to estimate the fraction of the US citizens who trust the president. You are assigned to this job and responsible for the report. What should you do in order to satisfy the firm? Yes, the general idea is: (1) you take a sample; (2) examine how many of them trust the president; and (3) report it. Generally, it seems ok. But, when your superior asks if the result is representing the whole population, what are you going to say? Hmm, it appears to be more job. So, you modify the general idea and refine the steps to do: (1) you take a sample which is representative to the U.S. population; (2) examine how many of them are trusting the president; (3) guess if the particular sample show a certain result, what is going to be the whole population (US citizen); (4) report your guess to your superior. | + | |
- | + | ||
- | 1000 people were randomly chosen and 637 answered that they trust the president. So far, the (1) and (2) in the " | + | |
- | + | ||
- | So, the proportion of people who trust the president in your particular sample is: $ \hat{p} = \frac{637}{1000} = 0.637 $ . | + | |
- | + | ||
- | This is what normal people would report when they do survey -- showing percentage. But, it is not enough for you who studied research method. That is, you need do more than this. | + | |
- | + | ||
- | That is, what you really want to know is not the $\hat{p}$ . Rather, you want to know the proportion of the total US citizens, which can be called " $ p $ ." | + | |
- | + | ||
- | Let's stop here and talk about standard error. But the above will be used again. | + | |
- | + | ||
- | The standard deviation of the sampling distribution of $ \hat{p} $ <wrap id #standard_error_nominal> | + | |
- | + | ||
- | $ \displaystyle \sigma_{\hat{p}} = \sqrt{\frac{p*q}{n}} , \;\;\; q = (1 - p) $ and " $ \hat{p} $ " represents sampling statistics (probability), | + | |
- | + | ||
- | * standard deviation of sampling probability ( $ \sigma_{\hat{p}} $ ) | + | |
- | * is equal to | + | |
- | * to $ \sqrt{\frac{p * q}{n}} $ , $ n $ is the sample size | + | |
- | + | ||
- | What it indicates is that if you know your population probability, | + | |
- | + | ||
- | * if you know the proportion (probability) of US citizens' | + | |
- | * you can calculate the standard deviation of your sample' | + | |
- | + | ||
- | Wait . . . . ?!?! You might ask: "we wanted to know the population probability in the first place. That is why we took a sample (N=1000). But, now Hyo is saying that in order to know the standard deviation of sample probability, | + | |
- | + | ||
- | Yes!! That is absolutely true. In other words, in the above formula, we DO NOT know the value of $p$ ." But, here is the magic -- it would not be a big problem if we replace $p$ with $\hat{p}$ obtained from the sample $ (n = 1000) $ , if we are sure that the sample really represents the population, the actual (unknown) $p$ is about the same as $ \hat{p} $ . | + | |
- | + | ||
- | Besides, the value of " $ \sqrt{p * (1-p)} $ " is relatively insensitive to the changes of $ p $ ." If you look at the below table; | + | |
- | + | ||
- | | p | q | pq | square \\ root (pq) | | | + | |
- | | 0.1 | 0.9 | 0.09 | 0.30 | note that the values here are \\ not much different \\ as found in p and q | | + | |
- | | 0.2 | 0.8 | 0.16 | 0.40 |::: | | + | |
- | | 0.3 | 0.7 | 0.21 | 0.46 |::: | | + | |
- | | 0.4 | 0.6 | 0.24 | 0.49 |::: | | + | |
- | | 0.5 | 0.5 | 0.25 | 0.50 |::: | | + | |
- | | 0.6 | 0.4 | 0.24 | 0.49 |::: | | + | |
- | | 0.7 | 0.3 | 0.21 | 0.46 |::: | | + | |
- | | 0.8 | 0.2 | 0.16 | 0.40 |::: | | + | |
- | | 0.9 | 0.1 | 0.09 | 0.30 |::: | | + | |
- | + | ||
- | * Note that values of $ q $ (column, q) is determined by $ p $ . Suppose that we got 0.6 from sample probability ( $ \hat{p} $ ) and use it as population probability ( $ p $ ). Further, suppose that the real value of population probability is 0.5. Then, what is the difference between the calculation -- 0.01. (0.6 * 0.4) vs. (0.5 * 0.5). | + | |
- | + | ||
- | So, we replace the value $p$ with your $\hat{p}$ and assume that the result is not going to be much different. Hence the standard deviation of the sampling distribution is: | + | |
- | + | ||
- | $ \displaystyle \sqrt{\frac{(p * q)}{n}} = \sqrt{\frac{0.637*0.363}{1000}} = 0.03 $ , where $ n $ is sample size. | + | |
- | + | ||
- | The left part is called what? standard deviation of the sampling distribution. | + | |
- | + | ||
- | * We know what standard deviation means. | + | |
- | * Sampling distribution? | + | |
- | + | ||
- | It means that | + | |
- | + | ||
- | - If we take a sample (size, n = 1000); measure the probability of the sample and record it; and | + | |
- | - repeat the above step again and again for many many times, | + | |
- | - we get a kind of distribution curve, of which shape is normal. | + | |
- | - If we take a look at this distribution, | + | |
- | - the mean of it (the most frequently appearing probabilities) will be the same as the population probability; | + | |
- | - the standard deviation of it should be 0.03 (from the above calculation). | + | |
- | + | ||
- | This particular value, standard deviation, has a special meaning [This is important] | + | |
- | + | ||
- | - Keeping it in mind that the distribution curve is obtained from the above step 1 and 2 -- sample probabilities, | + | |
- | - we realize that the chance of getting a particular sample probability will follow this distribution. | + | |
- | - That is, if we take one sample, [we are coming back to standard deviation rule here], | + | |
- | - 68 out of 100 cases, the sample probability should be found in between the probability of the above distribution PLUS/MINUS ONE standard deviation. | + | |
- | - 95 out of 100 cases, the sample probability should be found in between the probability of the above distribution PLUS/MINUS TWO standard deviation. | + | |
- | - 99 out of 100 cases, the sample probability should be found in between the probability of the above distribution PLUS/MINUS THREE standard deviation. | + | |
- | + | ||
- | AND, this PARTICULAR standard deviation is called STANDARD ERROR. | + | |
- | + | ||
- | The below is a graphical illustration of it. I am repeating it, here. The distribution curve is obtained from many many sample probabilities. If we take one sample, the chance of getting the sample probability depends on this below distribution curve. | + | |
- | + | ||
- | {{trust-stdev-00.jpg}} | + | |
- | + | ||
- | That is, | + | |
- | <WRAP code> | + | |
- | mean (+-) 2stdev : 0.577 to 0.697 | + | |
- | mean (+-) 3stdev : 0.547 to 0.727 --- yellow = 99% | + | |
- | </ | + | |
- | + | ||
- | Now carefully think about what you have gotten from your sample statistics. You took a sample whose size is 1000. The purpose of this was not seeing the sample statistics, but, to estimate the population parameters. That's why you carefully (though, the procedures were omitted in this discussion) did random sampling -- to make your random sample representative to the population. Then, you calculated standard deviation of sampling distribution (standard error). This very thing (standard error) is supposed to show how the sample means are distributed if you keep taking samples and getting the means (proportions, | + | |
- | + | ||
- | And, since the standard error is really the standard deviation of sampling distributions, | + | |
- | + | ||
- | That is, | + | |
- | + | ||
- | * your sample probability (acquired from the sample you took first place) +- one standard error has about 68% of chance of having the true population probability. | + | |
- | * The sample probability +- two standard deviation has about 95% chance of having the true population probability. | + | |
- | * The sample probability +- three standard deviation has about 99% chance of having the true population probability. | + | |
- | + | ||
- | | | + | |
- | | mean (+-) 1sd: | 0.607 to 0.667 | red line = 68% | | + | |
- | | mean (+-) 2sd: | 0.577 to 0.697 | blue line = 95% | | + | |
- | | mean (+-) 3sd: | 0.547 to 0.727 | yellow = 99% | | + | |
- | + | ||
- | Now, you think that the first option 0.607-0.667 cannot be chosen, because the certainty about the true mean of the population is merely 68%. But, it may look cool to say that the true mean resides between 0.607 and 0.667 -- | + | |
- | + | ||
- | On tomorrow' | + | |
standard_error.1521418819.txt.gz · Last modified: 2018/03/19 08:50 by hkimscil