Variability and Spread

Who are you going to use for the upcoming game (basketball)?

A
7	8	9	10	11	12	13
1	1	2	2	2	1	1

B
7	9	10	11	13
1	2	4	2	1

C
3	6	7	10	11	13	30
2	1	2	3	1	1	1

a <- c(7,8,9,9,10,10,11,11,12,13)
b <- c(7,9,9,10,10,10,10,11,11,13)
c <- c(3,3,6,7,7,10,10,10,11,13,30)
## c <- c(3,3,6,7,7,10,11,13,15,20,30)

data <- list(a,b,c)
data
sapply(data,mean)
sapply(data,median)
sapply(data,range)

sapply(data,sd)
sapply(data,var)

> a <- c(7,8,9,9,10,10,11,11,12,13)
> b <- c(7,9,9,10,10,10,10,11,11,13)
> c <- c(3,3,6,7,7,10,10,10,11,13,30)
> 
> data <- list(a,b,c)
> data
[[1]]
 [1]  7  8  9  9 10 10 11 11 12 13

[[2]]
 [1]  7  9  9 10 10 10 10 11 11 13

[[3]]
 [1]  3  3  6  7  7 10 10 10 11 13 30

> sapply(data,mean)
[1] 10 10 10
> sapply(data,median)
[1] 10 10 10
> sapply(data,range)
     [,1] [,2] [,3]
[1,]    7    7    3
[2,]   13   13   30
>
>

> sapply(data,sd)
[1] 1.825742 1.563472 7.362065
> sapply(data,var)
[1]  3.333333  2.444444 54.200000
>

Range

range
교재에서는 upper bound와 lower bound의 차이값을 range라고 설명하지만, R에서는 lower와 upper bound값을 제시한 것이 range값이 된다. 즉,

> sapply(data,range)
     [,1] [,2] [,3]
[1,]    7    7    3
[2,]   13   13   30

13 - 7 = 6
13 - 7 = 6
30 - 3 = 27

그러나 range도 데이터의 분포를 정확하게 그려주지는 않는다. 아래의 첫번째, 두번째 데이터의 range는 모두 4 (8-12). 그러나, 개인 점수들의 분포는 다른 양상을 보인다.

즉,

아웃라이어의 (극단치의) 문제

a <- c(1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,5,5,5}
b <- c(1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,5,5,5, 10}

range(a) vs. range(b)

이런 두 그룹간의 range 차이는 outlier에 기인한다.

Quartile

quartile

> basket <- c(3,3,6,7,7,10,10,10,11,13,30)
> basket <- sort(basket)
> basket
 [1]  3  3  6  7  7 10 10 10 11 13 30
>

> quantile(basket)
  0%  25%  50%  75% 100% 
 3.0  6.5 10.0 10.5 30.0

Percentile

How to find percentile

First of all, line all your values up in ascending order.
To find the position of the kth percentile out of n numbers, start off by calculating .$ k(\frac{n}{100})$
If this gives you an integer, then your percentile is halfway between the value at position $ k(\frac{n}{100})$ and the next number along. Take the average of the numbers at these two positions to give you your percentile.
If $ k(\frac{n}{100})$ is not an integer, then round it up. This then gives you the position of the percentile.

> k <- c(1:125)
> length(k)
[1] 125
> k
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
 [21]  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40
 [41]  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60
 [61]  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80
 [81]  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
[101] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
[121] 121 122 123 124 125
>

10th percentile 을 구하려면
10 * ( 125 / 100) = 12.5
이 숫자를 반올림하면 13이므로 13번째 숫자가 10번째 페센타일이 된다 (13).

> k <- c(1:10)
> length(k)
[1] 10
> k
 [1]  1  2  3  4  5  6  7  8  9 10

20th percentile을 구하려면
$ 20 * (10 /100) = 2 $ 이므로
2번째와 3번째 사이의 점수의 평균이므로, 2.5이다.

Boxplot

# j <- c(6,7,7,8,9,10,10,11,11,13)
j <- c(7,9,9,10,10,10,10,11,11,13)
# m <- c(3,3,6,7,7,10,10,10,11,13,30)
m <- c(3,3,6,7,8,9,9,10,11,13,30)

median(j)
median(m)

boxplot(j)
boxplot(m)

boxplot(j, m)
boxplot(j, m, horizontal = T)

Variance

variance

$ \sum \text{deviation score}^2 = \sum \text{ds}^2 $

$ \sum \text{error}^2 $
- error = 평균값으로 개인값을 추측했을 때 발생하는 오차
- (평균으로 추측했을 때 생기는) 오차의 제곱의 합
- (오차의) 제곱의 합
- 제곱의 합
- Sum of Square (SS)

$ \sum \text{ds}^2 = \text{SS} = \text{Sum of Square} $ ¹⁾

$$ \text{variance} = \frac {SS}{n-1} = \frac {SS}{df}$$
calculation of variance (an easy way) see variance calculation
- $ \displaystyle \frac{\sum(X_{i})}{N} - \mu^2$