Differences

This shows you the differences between two versions of the page.

--- r:general_statistics [2016/11/02 01:31] – [e.g.,] hkimscil
+++ r:general_statistics [2019/10/10 22:56] (current) – [Forming a Confidence Interval for a Mean] hkimscil
@@ Line 53: / Line 53: @@
 <code>suburbs <- read.csv("suburbs.csv", head=T, sep="	")
+suburbs <- read.csv("http://commres.net/wiki/_export/code/r/general_statistics?codeblock=1", head=T, sep="\t")
 </code>
@@ Line 76: / Line 77: @@
 ====== Calculating Relative Frequencies ======
 <code>
-> mean(Cars93$MPG.city > 14)  # see the summary(Cars93$MPG.city) the above
+> mean(Cars93$MPG.city > 14)  # see the summary(Cars93$MPG.city) the above = 100%, min = 15이므로
 [1] 1
@@ Line 180: / Line 181: @@
 </code>
+<code>> library(MASS)
+> cardata <- data.frame(Cars93$Origin, Cars93$Type)
+> cardata
+   Cars93.Origin Cars93.Type
+        non-USA       Small
+        non-USA     Midsize
+        non-USA     Compact
+        non-USA     Midsize
+        non-USA     Midsize
+            USA     Midsize
+            USA       Large
+            USA       Large
+            USA     Midsize
+           USA       Large
+           USA     Midsize
+           USA     Compact
+           USA     Compact
+           USA      Sporty
+           USA     Midsize
+           USA         Van
+           USA         Van
+           USA       Large
+           USA      Sporty
+           USA       Large
+           USA     Compact
+           USA       Large
+           USA       Small
+           USA       Small
+           USA     Compact
+           USA         Van
+           USA     Midsize
+           USA      Sporty
+           USA       Small
+           USA       Large
+           USA       Small
+           USA       Small
+           USA     Compact
+           USA      Sporty
+           USA      Sporty
+           USA         Van
+           USA     Midsize
+           USA       Large
+       non-USA       Small
+       non-USA      Sporty
+       non-USA      Sporty
+       non-USA       Small
+       non-USA     Compact
+       non-USA       Small
+       non-USA       Small
+       non-USA      Sporty
+       non-USA     Midsize
+       non-USA     Midsize
+       non-USA     Midsize
+       non-USA     Midsize
+           USA     Midsize
+           USA       Large
+       non-USA       Small
+       non-USA       Small
+       non-USA     Compact
+       non-USA         Van
+       non-USA      Sporty
+       non-USA     Compact
+       non-USA     Midsize
+           USA      Sporty
+           USA     Midsize
+       non-USA       Small
+       non-USA     Midsize
+       non-USA       Small
+       non-USA     Compact
+       non-USA         Van
+       non-USA     Midsize
+           USA     Compact
+           USA     Midsize
+           USA         Van
+           USA       Large
+           USA      Sporty
+           USA       Small
+           USA     Compact
+           USA      Sporty
+           USA     Midsize
+           USA       Large
+       non-USA     Compact
+           USA       Small
+       non-USA       Small
+       non-USA       Small
+       non-USA     Compact
+       non-USA       Small
+       non-USA       Small
+       non-USA      Sporty
+       non-USA     Midsize
+       non-USA         Van
+       non-USA       Small
+       non-USA         Van
+       non-USA     Compact
+       non-USA      Sporty
+       non-USA     Compact
+       non-USA     Midsize
+> cartbl <- table(cardata)
+> cartbl
+             Cars93.Type
+Cars93.Origin Compact Large Midsize Small Sporty Van
+      USA           7    11      10     7      8   5
+      non-USA       9     0      12    14      6   4
+> summary(cartbl)
+Number of cases in table: 93
+Number of factors: 2
+Test for independence of all factors:
+	Chisq = 14.08, df = 5, p-value = 0.01511
+	Chi-squared approximation may be incorrect
+> chisq.test(cartbl)
+	Pearson's Chi-squared test
+data:  cartbl
+X-squared = 14.08, df = 5, p-value = 0.01511
+Warning message:
+In chisq.test(cartbl) : 카이제곱 approximation은 정확하지 않을수도 있습니다
+>
+</code>
 ====== Calculating Quantiles (and Quartiles) of a Dataset ======
@@ Line 199: / Line 320: @@
 <code>> dur <- faithful$eruptions
+> dur > mean(dur)
+  [1]  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE
+ [15]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE
+ [29]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
+ [43]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
+ [57]  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
+ [71]  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
+ [85]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
+ [99] FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
+[113]  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
+[127] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
+[141]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
+[155]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
+[169] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
+[183]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
+[197]  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
+[211] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
+[225]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE
+[239]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
+[253]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE
+[267]  TRUE  TRUE FALSE  TRUE FALSE  TRUE
 > mean(dur > mean(dur))
 [1] 0.6176471
@@ Line 283: / Line 425: @@
 > round(mean(zdur))
 [1] 0
+> round(sd(zdur))
+[1] 1
 </code>
@@ Line 352: / Line 497: @@
 ====== Forming a Confidence Interval for a Mean ======
-<code>> s <- sd(x)
+<code>
+> set.seed(1024)
+> x <- rnorm(50, mean=100, sd=15)
+> s <- sd(x)
 > m <- mean(x)
 > n <- length(x)
@@ Line 364: / Line 512: @@
 > SE
 [1] 2.458358
-> E <- qt(.975, df=n-1)*SE
+## qt fun: qt(prob, df) zscore 2점에 해당하는 점수는?
+> qtv <- qt(.975, df=n-1)
+> qtv
+[1]
+## qtv는 2에 해당하는 95퍼센트 CL
+## 이 때의 CI는
+> E <- qtv*SE
 > E
 [1] 4.940254
@@ Line 372: / Line 526: @@
 </code>
-<code>> t.test(x)
+<code>
+> t.test(x, mu=98)
 	One Sample t-test
 data:  x
-t = 39.052, df = 49, p-value < 2.2e-16
+t = 0.37089, df = 49, p-value = 0.7123
-alternative hypothesis: true mean is not equal to 0
+alternative hypothesis: true mean is not equal to 98
 percent confidence interval:
-.0636 100.9441
+.32303 103.34143
 sample estimates:
 mean of x
-.00386
+.83223
+> t.test(x, mu=100)
+	One Sample t-test
+data:  x
+t = -0.52043, df = 49, p-value = 0.6051
+alternative hypothesis: true mean is not equal to 100
+percent confidence interval:
+.32303 103.34143
+sample estimates:
+mean of x
+.83223
+> t.test(x, mu=95)
+	One Sample t-test
+data:  x
+t = 1.7079, df = 49, p-value = 0.09399
+alternative hypothesis: true mean is not equal to 95
+percent confidence interval:
+.32303 103.34143
+sample estimates:
+mean of x
+.83223
+>
 </code>
@@ Line 396: / Line 579: @@
 W = 0.97415, p-value = 0.3386
 </code>
+The large p-value suggests the underlying population could be normally distributed. The next example reports a small p-value for y, so it is unlikely that this sample came from a normal population:
+normal distribution assumed -> var.equal=T
+normal distribution not assumed -> var.equal=F
 ====== Comparing the Means of Two Samples ======
@@ Line 448: / Line 636: @@
 mpg.auto = mtcars[L,]$mpg
 mpg.auto                    # automatic transmission mileage
- [1] 21.4 18.7 18.1 14.3 24.4 ...
+ [1] 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 21.5 15.5 15.2
+[18] 13.3 19.2
 mpg.manual = mtcars[!L,]$mpg
 mpg.manual                  # manual transmission mileage
- [1] 21.0 21.0 22.8 32.4 30.4 ...
+ [1] 21.0 21.0 22.8 32.4 30.4 33.9 27.3 26.0 30.4 15.8 19.7 15.0 21.4
 t.test(mpg.auto, mpg.manual)
@@ Line 466: / Line 655: @@
 mean of x mean of y
 .14737  24.39231
+</code>
+OR
+<code>> t.test(mtcars$mpg~mtcars$am)
+	Welch Two Sample t-test
+data:  mtcars$mpg by mtcars$am
+t = -3.7671, df = 18.332, p-value = 0.001374
+alternative hypothesis: true difference in means is not equal to 0
+percent confidence interval:
+ -11.280194  -3.209684
+sample estimates:
+mean in group 0 mean in group 1
+.14737        24.39231
 </code>
 Another eg.
-<code>a = c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179)
+<code>> a = c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179)
-b = c(185, 169, 173, 173, 188, 186, 175, 174, 179, 180)
+> b = c(185, 169, 173, 173, 188, 186, 175, 174, 179, 180)
 </code>
-<code>t.test(a,b, var.equal=TRUE, paired=FALSE)
+<code>> t.test(a,b, var.equal=TRUE, paired=FALSE)
 Two Sample t-test
@@ Line 489: / Line 694: @@
-qt(0.975, 18)
+> qt(0.975, 18)
 [1] 2.100922
 </code>
-<code>var.test(a,b)
+<code>> var.test(a,b)
      F test to compare two variances
@@ Line 507: / Line 712: @@
-qf(0.95, 9, 9)
+> qf(0.95, 9, 9)
 [1] 3.178893
@@ Line 518: / Line 723: @@
 ===== e.g., =====
 <code>> sleep
-#>    extra group ID
+>    extra group ID
-#> 1    0.7     1  1
+> 1    0.7     1  1
-#> 2   -1.6     1  2
+> 2   -1.6     1  2
-#> 3   -0.2     1  3
+> 3   -0.2     1  3
-#> 4   -1.2     1  4
+> 4   -1.2     1  4
-#> 5   -0.1     1  5
+> 5   -0.1     1  5
-#> 6    3.4     1  6
+> 6    3.4     1  6
-#> 7    3.7     1  7
+> 7    3.7     1  7
-#> 8    0.8     1  8
+> 8    0.8     1  8
-#> 9    0.0     1  9
+> 9    0.0     1  9
-#> 10   2.0     1 10
+> 10   2.0     1 10
-#> 11   1.9     2  1
+> 11   1.9     2  1
-#> 12   0.8     2  2
+> 12   0.8     2  2
-#> 13   1.1     2  3
+> 13   1.1     2  3
-#> 14   0.1     2  4
+> 14   0.1     2  4
-#> 15  -0.1     2  5
+> 15  -0.1     2  5
-#> 16   4.4     2  6
+> 16   4.4     2  6
-#> 17   5.5     2  7
+> 17   5.5     2  7
-#> 18   1.6     2  8
+> 18   1.6     2  8
-#> 19   4.6     2  9
+> 19   4.6     2  9
-#> 20   3.4     2 10
+> 20   3.4     2 10
 </code>
-<code>sleep_wide <- data.frame(
+<code>> sleep_wide <- data.frame(
     ID=1:10,
     group1=sleep$extra[1:10],
@@ Line 546: / Line 751: @@
 )
 sleep_wide
-#>    ID group1 group2
+>    ID group1 group2
-#> 1   1    0.7    1.9
+> 1   1    0.7    1.9
-#> 2   2   -1.6    0.8
+> 2   2   -1.6    0.8
-#> 3   3   -0.2    1.1
+> 3   3   -0.2    1.1
-#> 4   4   -1.2    0.1
+> 4   4   -1.2    0.1
-#> 5   5   -0.1   -0.1
+> 5   5   -0.1   -0.1
-#> 6   6    3.4    4.4
+> 6   6    3.4    4.4
-#> 7   7    3.7    5.5
+> 7   7    3.7    5.5
-#> 8   8    0.8    1.6
+> 8   8    0.8    1.6
-#> 9   9    0.0    4.6
+> 9   9    0.0    4.6
-#> 10 10    2.0    3.4
+> 10 10    2.0    3.4
 </code>
 Ignore the ID variable for a convenience.
@@ Line 563: / Line 768: @@
 # Welch t-test
 t.test(extra ~ group, sleep)
+>
-#>
+> 	Welch Two Sample t-test
-#> 	Welch Two Sample t-test
+>
-#>
+> data:  extra by group
-#> data:  extra by group
+> t = -1.8608, df = 17.776, p-value = 0.07939
-#> t = -1.8608, df = 17.776, p-value = 0.07939
+> alternative hypothesis: true difference in means is not equal to 0
-#> alternative hypothesis: true difference in means is not equal to 0
+> 95 percent confidence interval:
-#> 95 percent confidence interval:
+>  -3.3654832  0.2054832
-#>  -3.3654832  0.2054832
+> sample estimates:
-#> sample estimates:
+> mean in group 1 mean in group 2
-#> mean in group 1 mean in group 2
+>            0.75            2.33
-#>            0.75            2.33
 # Same for wide data (two separate vectors)
-# t.test(sleep_wide$group1, sleep_wide$group2)
+> t.test(sleep_wide$group1, sleep_wide$group2)
 </code>
@@ Line 584: / Line 788: @@
 <code>
 # Student t-test
-t.test(extra ~ group, sleep, var.equal=TRUE)
+> t.test(extra ~ group, sleep, var.equal=TRUE)
-#>
+>
-#> 	Two Sample t-test
+> 	Two Sample t-test
-#>
+>
-#> data:  extra by group
+> data:  extra by group
-#> t = -1.8608, df = 18, p-value = 0.07919
+> t = -1.8608, df = 18, p-value = 0.07919
-#> alternative hypothesis: true difference in means is not equal to 0
+> alternative hypothesis: true difference in means is not equal to 0
-#> 95 percent confidence interval:
+> 95 percent confidence interval:
-#>  -3.363874  0.203874
+>  -3.363874  0.203874
-#> sample estimates:
+> sample estimates:
-#> mean in group 1 mean in group 2
+> mean in group 1 mean in group 2
-#>            0.75            2.33
+>             0.75            2.33
 </code>
-<code># Same for wide data (two separate vectors)
+<code>#  Same for wide data (two separate vectors)
-# t.test(sleep_wide$group1, sleep_wide$group2, var.equal=TRUE)
+> t.test(sleep_wide$group1, sleep_wide$group2, var.equal=TRUE)
 </code>
@@ Line 608: / Line 812: @@
 <code>
 # Sort by group then ID
-sleep <- sleep[order(sleep$group, sleep$ID), ]
+> sleep <- sleep[order(sleep$group, sleep$ID), ]
 # Paired t-test
-t.test(extra ~ group, sleep, paired=TRUE)
+> t.test(extra ~ group, sleep, paired=TRUE)
-#>
-#> 	Paired t-test
+ 	Paired t-test
-#>
-#> data:  extra by group
+ data:  extra by group
-#> t = -4.0621, df = 9, p-value = 0.002833
+ t = -4.0621, df = 9, p-value = 0.002833
-#> alternative hypothesis: true difference in means is not equal to 0
+ alternative hypothesis: true difference in means is not equal to 0
-#> 95 percent confidence interval:
+percent confidence interval:
-#>  -2.4598858 -0.7001142
+  -2.4598858 -0.7001142
-#> sample estimates:
+ sample estimates:
-#> mean of the differences
+ mean of the differences
-#>                   -1.58
+                   -1.58
 </code>
 <code># Same for wide data (two separate vectors)
-# t.test(sleep.wide$group1, sleep.wide$group2, paired=TRUE)
+> t.test(sleep.wide$group1, sleep.wide$group2, paired=TRUE)
+	Paired t-test
+data:  sleep_wide$group1 and sleep_wide$group2
+t = -4.0621, df = 9, p-value = 0.002833
+alternative hypothesis: true difference in means is not equal to 0
+percent confidence interval:
+ -2.4598858 -0.7001142
+sample estimates:
+mean of the differences
+                  -1.58
 </code>
 The paired t-test is equivalent to testing whether difference between each pair of observations has a population mean of 0. (See below for comparing a single group to a population mean.)
-<code>t.test(sleep.wide$group1 - sleep.wide$group2, mu=0, var.equal=TRUE)
+<code>> t.test(sleep_wide$group1 - sleep_wide$group2, mu=0, var.equal=TRUE)
-#> Error in t.test(sleep.wide$group1 - sleep.wide$group2, mu = 0, var.equal = TRUE): object 'sleep.wide' not found
+	One Sample t-test
+data:  sleep_wide$group1 - sleep_wide$group2
+t = -4.0621, df = 9, p-value = 0.002833
+alternative hypothesis: true mean is not equal to 0
+percent confidence interval:
+ -2.4598858 -0.7001142
+sample estimates:
+mean of x
+    -1.58
 </code>
@@ Line 641: / Line 868: @@
 <code>
 t.test(sleep$extra, mu=0)
-#>
+>
-#> 	One Sample t-test
+> 	One Sample t-test
-#>
+>
-#> data:  sleep$extra
+> data:  sleep$extra
-#> t = 3.413, df = 19, p-value = 0.002918
+> t = 3.413, df = 19, p-value = 0.002918
-#> alternative hypothesis: true mean is not equal to 0
+> alternative hypothesis: true mean is not equal to 0
-#> 95 percent confidence interval:
+> 95 percent confidence interval:
-#>  0.5955845 2.4844155
+>  0.5955845 2.4844155
-#> sample estimates:
+> sample estimates:
-#> mean of x
+> mean of x
-#>      1.54
+>      1.54
 </code>