Differences

This shows you the differences between two versions of the page.

--- sampling [2018/03/13 16:48] – [Sample statistics] hkimscil
+++ sampling [2019/09/13 11:08] – hkimscil
@@ Line 18: / Line 18: @@
 That same year, George Gallup, an advertising executive who had begun a scientific poll, predicted that Roosevelt would win the election, based on a **quota sample** of 50,000 people. He also predicted that the //Literary Digest// would mis-predict the results. His correct predictions made public opinion polling a critical element of elections for journalists and indeed for politicians. The Gallup Poll would become a staple of future presidential elections, and remains one of the most prominent election polling organizations.
 -- http://en.wikipedia.org/wiki/United_States_presidential_election,_1936
+  * 1916년 Literary Digest지는 Woodrow Wilson 과 Charles Hughes 대통령 후보자에 관한 여론조사를 통해 당선자를 예측하고 정확히 맞힘
+  * 그 이후 4년마다 여론조사를 실시하였고 예측이 잘 맞아 왔음
+  * 1936년 Landon vs. Roosevelt 예측에서 크게 실패함
+    * 천만명 (10 million)에게 우편서베이를 실시하여 2백3십만 (2.3 million)이 응답을 하였고
+    * Landon 57% vs Roosebelt 43% 로 예측함
+    * 그러나 Roosebelt 61% 로 승리
+  * 틀리게 된 이유로는 Literary Digest지가 설문을 요청한 샘플을 구한 방법이
+    * 전화번호부 + 자동차소유주 명부 → 1천만명 중 2백만명 응답 (22% 회수율 혹은 응답률)
+    * 그러나, 당시(1936년)에 전화와 자동차를 소유한 사람들은 대부분 중상류층으로
+    * 공화당지지자가 대다수
+    * 따라서 무작위 샘플이 아닌, biased sample인 결과
+  * 같은 해에 신생의 작은 여론조사회사는 Roosevelt 가 승리할 것으로 맞힘
+  * 이 회사가 George Gallup이라는 청년이 세운 Gallup 사
+  * Gallup사는 Quota sampling 방법을 통해서 인구구성비Sampling ======
+===== 용어들 =====
+연구자는 자신의 연구 문제와 관련된 조사대상의 집단을 규정한다. 이 때 규정되는 집단을 모집단 혹은 [[Population|population]]이라고 한다. 연구자가 청소년의 mp3 음악 사용에 관한 개념에 관해서 관심을 가지고 이에 따른 연구문제를 제시했다면, 잠정적으로 이 연구에서 규정하는 모집단은 청소년이라고 하겠다. [[Conceptualization]]과 [[Operationalization]]과 마찬가지로, 모집단에 대한 규정을 명확히 해 놓을 필요가 있다. 즉, 위의 예에서 청소년이라 하면, 구체적으로 어떤 대상인지를 밝혀야 한다.
+연구자는 모집단에 대한 명확한 규정을 하였어도, [[:Population|population]]이 작은 경우나 혹은 census와 같은 경우를 제외하고는, 그 집단 전체를 조사할 수 없는 경우가 많다. 집단 전체를 모두 조사하는 것을 enumeration이라고 하는데, 이런 경우는 높은 비용이 소요되기 때문이다.
+따라서, 연구자는 population에서 일정 대상을 선출하여, 이에 대한 조사를 바탕으로 population의 성격을 추론해 내게 된다. 이렇게 선정되는 population의 일부를 [[:Sample|sample]]이라고 한다.
+<WRAP box>
+통계학적인 용어로 수학적으로 정리된 Population의 성격은 parameter라고 하고, Sample 의 성격을 말할 때는 statistics라고 한다.
+</WRAP>
+<WRAP box>
+  * Alf Landon 대 Franklin Roosevelt
+  * Literary Digest vs. George Gallup
+This election is notable for the Literary Digest poll, which was based on **10 million** questionnaires mailed to readers and potential readers; over two million were returned.
+That same year, George Gallup, an advertising executive who had begun a scientific poll, predicted that Roosevelt would win the election, based on a **quota sample** of 50,000 people. He also predicted that the //Literary Digest// would mis-predict the results. His correct predictions made public opinion polling a critical element of elections for journalists and indeed for politicians. The Gallup Poll would become a staple of future presidential elections, and remains one of the most prominent election polling organizations.
+-- http://en.wikipedia.org/wiki/United_States_presidential_election,_1936
+  * 1916년 Literary Digest지는 Woodrow Wilson 과 Charles Hughes 대통령 후보자에 관한 여론조사를 통해 당선자를 예측하고 정확히 맞힘
+  * 그 이후 4년마다 여론조사를 실시하였고 예측이 잘 맞아 왔음
+  * 1936년 Landon vs. Roosevelt 예측에서 크게 실패함
+    * 천만명 (10 million)에게 우편서베이를 실시하여 2백3십만 (2.3 million)이 응답을 하였고
+    * Landon 57% vs Roosebelt 43% 로 예측함
+    * 그러나 Roosebelt 61% 로 승리
+  * 틀리게 된 이유로는 Literary Digest지가 설문을 요청한 샘플을 구한 방법이
+    * 전화번호부 + 자동차소유주 명부 → 1천만명 중 2백만명 응답 (22% 회수율 혹은 응답률)
+    * 그러나, 당시(1936년)에 전화와 자동차를 소유한 사람들은 대부분 중상류층으로
+    * 공화당지지자가 대다수
+    * 따라서 무작위 샘플이 아닌, biased sample인 결과
+  * 같은 해에 신생의 작은 여론조사회사는 Roosevelt 가 승리할 것으로 맞힘
+  * 이 회사가 George Gallup이라는 청년이 세운 Gallup 사
+  * Gallup사는 Quota sampling 방법을 통해서 인구구성비율에 맞는 샘플을 채취함.
+  * 이를 통해 인구에 회자되면서 현재의 갤럽사로 성장함
 </WRAP>
@@ Line 36: / Line 87: @@
 ==== 원리  ====
   * Representativeness (대표성)
-    * ECBS (Equal Chance of Being Selected)
+    * ECoBS (Equal Chance of Being Selected)
   * Sampling bias (샘플링 바이어스)
     * concscious
@@ Line 66: / Line 117: @@
 var_ <- new.env()
-n<-20            ## Sample n individuals at a time
+n<-20        ## Sample n individuals at a time
-p_mean<-0        ## Population mean
+p_mean<-0    ## Population mean
-p_sd<-1            ## Population standard deviation
+p_sd<-1      ## Population standard deviation
-N<-500            ## Number of times the experiment (sampling) is replicated
+N<-500       ## Number of times the experiment (sampling) is replicated
 pdf('SE.pdf')
-for(i in 1:N)                                ## do the experiment N times
+for(i in 1:N)     ## do the experiment N times
 {
-smp<-rnorm(n,p_mean,p_sd)                 ## sample n data points from the population
+smp<-rnorm(n,p_mean,p_sd)    ## sample n data points from the population
+var_$x_bar<-c(var_$x_bar,mean(smp))     ## keep track of the mean (x_bar) from each sample
-var_$x_bar<-c(var_$x_bar,mean(smp))         ## keep track of the mean (x_bar) from each sample
+hist(var_$x_bar,probability=TRUE,col="red",xlim=c(-4,4),xlab="x / x_bar",main="",ylim=c(0,2.2))
+# Plot a histogram of x_bar values
-hist(var_$x_bar,probability=TRUE,col="red",xlim=c(-4,4),xlab="x / x_bar",main="",ylim=c(0,2.2))  # Plot a histogram of x_bar values
 points(mean(smp),0,pch=19,cex=1.5,col='black')
 curve(dnorm(x,p_mean,p_sd/sqrt(n)),lwd=3,add=TRUE)
@@ Line 86: / Line 139: @@
 text(2.5,1.5,labels=paste('standard deviation of\nsample means = ',round(sd(var_$x_bar),2),sep='') )
-curve(dnorm(x,p_mean,p_sd),main="",ylab="",xlim=c(-4,4),xlab="X",col="blue",lwd=3,add=TRUE) ## Plot the sample
+curve(dnorm(x,p_mean,p_sd),main="",ylab="",xlim=c(-4,4),xlab="X",col="blue",lwd=3,add=TRUE)
+## Plot the sample
 text(2.5,0.5,labels=paste('# of means drawn = ',i,sep=''))
@@ Line 100: / Line 154: @@
 dev.off()
 </code>
+{{SE.pdf}}
   * Variation See, [[:Variance]]: 225.0584138 (15^2)