Differences

This shows you the differences between two versions of the page.

--- r:data_transformations [2016/10/11 23:03] – created hkimscil
+++ r:data_transformations [2019/09/19 09:23] (current) – [Splitting a Vector into Groups] hkimscil
@@ Line 6: / Line 6: @@
 Warning message:
 패키지 ‘MASS’는 R 버전 3.2.5에서 작성되었습니다
-> split(Cars93$MPG.city, Cars93$Origin)
+> split(Cars93$MPG.city, Cars93$Origin) # Origin별로 MPG.city를 나눠라
 $USA
  [1] 22 19 16 19 16 16 25 25 19 21 18 15
@@ Line 39: / Line 39: @@
 [1] 23.86667
 >
+# or
+> sapply(g, mean)
+     USA  non-USA
+.95833 23.86667
+# or retain list format
+> lapply(g, mean)
+$USA
+[1] 20.95833
+$`non-USA`
+[1] 23.86667
 </code>
+====== Applying a Function to Each List Element ======
+<code>S1 <- c(89, 85, 85, 86, 88, 89, 86, 82, 96, 85, 93, 91,
+, 87, 94, 77, 87, 98, 85, 89, 95, 85, 93, 93,
+, 71, 97, 93, 75, 68, 98, 95, 79, 94, 98, 95)
+S2 <- c(60, 98, 94, 95, 99, 97, 100, 73, 93, 91, 98,
+, 66, 83, 77, 97, 91, 93, 71, 91, 95, 100,
+, 96, 91, 76, 100, 97, 99, 95, 97, 77, 94,
+, 88, 100, 94, 93, 86)
+S3 <- c(95, 86, 90, 90, 75, 83, 96, 85, 83, 84, 81, 98,
+, 94, 84, 89, 93, 99, 91, 77, 95, 90, 91, 87,
+, 76, 99, 99, 97, 97, 97, 77, 93, 96, 90, 87,
+, 88)
+S4 <- c(67, 93, 63, 83, 87, 97, 96, 92, 93, 96, 87, 90,
+, 90, 82, 91, 85, 93, 83, 90, 87, 99, 94, 88,
+, 72, 81, 93, 93, 94, 97, 89, 96, 95, 82, 97)
+scores <- list(S1=S1,S2=S2,S3=S3,S4=S4)
+</code>
+<code>scores
+$S1
+ [1] 89 85 85 86 88 89 86 82 96 85 93 91 98 87 94 77 87 98 85 89
+[21] 95 85 93 93 97 71 97 93 75 68 98 95 79 94 98 95
+$S2
+ [1]  60  98  94  95  99  97 100  73  93  91  98  86  66  83  77
+[16]  97  91  93  71  91  95 100  72  96  91  76 100  97  99  95
+[31]  97  77  94  99  88 100  94  93  86
+$S3
+ [1] 95 86 90 90 75 83 96 85 83 84 81 98 77 94 84 89 93 99 91 77
+[21] 95 90 91 87 85 76 99 99 97 97 97 77 93 96 90 87 97 88
+$S4
+ [1] 67 93 63 83 87 97 96 92 93 96 87 90 94 90 82 91 85 93 83 90
+[21] 87 99 94 88 90 72 81 93 93 94 97 89 96 95 82 97
+</code>
+**lapply(list_name, function)**
+<code>lapply(scores, length)
+$S1
+[1] 36
+$S2
+[1] 39
+$S3
+[1] 38
+$S4
+[1] 36
+</code>
+**sapply(list_name, function)**
+<code>> sapply(scores, length)
+S1 S2 S3 S4
+39 38 36
+</code>
+<code>> sapply(scores, mean)
+      S1       S2       S3       S4
+.77778 89.79487 89.23684 88.86111
+> sapply(scores, sd)
+       S1        S2        S3        S4
+.720515 10.543592  7.178926  8.208542
+</code>
+If the called function returns a vector, sapply will form the results into **a matrix**. The range function, for example, returns a two-element vector:
+<code>> sapply(scores, range)
+     S1  S2 S3 S4
+[1,] 68  60 75 63
+[2,] 98 100 99 99
+</code>
+If the called function returns a structured object, such as a list, then you will need to use lapply rather than sapply. Structured objects cannot be put into a vector. Suppose we want to perform a t test on every semester. The t.test function returns a list, so we must use lapply:
+<code>> tests <- lapply(scores, t.test)
+</code>
+====== Applying a Function to Every Row ======
+<code>> longdata<- c(-1.850152, -1.406571, -1.0104817, -3.7170704,
+           -0.2804896, 0.9496313, 1.346517, -0.1580926, 1.6272786,
+           -2.4483321, -0.5407272, -1.708678, -0.3480616, -0.2757667,
+           -1.2177024)
+> long <- matrix(longdata, 3,5)
+> colnames(long) <- c("trial1","trial2","trial3","trial4","trial5")
+> rownames(long) <- c("Moe", "Larry", "Curly")
+> long
+         trial1     trial2     trial3     trial4     trial5
+Moe   -1.850152 -3.7170704  1.3465170  2.4483321 -0.3480616
+Larry -1.406571 -0.2804896 -0.1580926 -0.5407272 -0.2757667
+Curly -1.010482  0.9496313  1.6272786 -1.7086780 -1.2177024
+</code>
+<code>apply(long, 1, mean)
+       Moe      Larry      Curly
+-1.6529530  1.2427334 -0.8181872
+</code>
+<code>apply(long, 1, range)
+            Moe      Larry      Curly
+[1,] -3.7170704 -0.1580926 -1.7086779
+[2,] -0.2804896  2.4483321 -0.2757667
+</code>
+====== Applying a Function to Every Column ======
+<code>apply(matrix, 2, function)
+</code>
+-> row by row
+-> column by column
+====== Applying a Function to Groups of Data ======
+<code>> tapply(vector, factor, function)
+</code>
+<code csv suburbs.csv>
+city	county	state	pop
+Chicago	Cook	IL	2853114
+Kenosha	Kenosha	WI	90352
+Aurora	Kane	IL	171782
+Elgin	Kane	IL	94487
+Gary	Lake(IN)	IN	102746
+Joliet	Kendall	IL	106221
+Naperville	DuPage	IL	147779
+Arlington Heights	Cook	IL	76031
+Bolingbrook	Will	IL	70834
+Cicero	Cook	IL	72616
+Evanston	Cook	IL	74239
+Hammond	Lake(IN)	IN	83048
+Palatine	Cook	IL	67232
+Schaumburg	Cook	IL	75386
+Skokie	Cook	IL	63348
+Waukegan	Lake(IL)	IL	91452
+</code>
+<code>suburbs <- read.csv("suburbs.csv", head=T, sep="	")
+suburbs <- read.csv("http://commres.net/wiki/_export/code/r/data_transformations?codeblock=15", head=T, sep="	")
+</code>
+<code>
+> attach(suburbs)
+> pop
+ [1] 2853114   90352  171782   94487  102746  106221  147779   76031   70834
+[10]   72616   74239   83048   67232   75386   63348   91452
+We can easily compute sums and averages for all the cities:
+> sum(pop)
+[1] 4240667
+> mean(pop)
+[1] 265041.7
+</code>
+factors by county = 8
+<code>> county
+ [1] Cook     Kenosha  Kane     Kane     Lake(IN) Kendall
+ [7] DuPage   Cook     Will     Cook     Cook     Lake(IN)
+[13] Cook     Cook     Cook     Lake(IL)
+Levels: Cook DuPage Kane Kendall Kenosha Lake(IL) ... Will
+</code>
+<code>> tapply(pop, county, sum)
+    Cook   DuPage     Kane  Kendall  Kenosha Lake(IL) Lake(IN)     Will
+ 3281966   147779   266269   106221    90352    91452   185794    70834
+> tapply(pop,county,mean)
+    Cook   DuPage     Kane  Kendall  Kenosha Lake(IL) Lake(IN)     Will
+.3 147779.0 133134.5 106221.0  90352.0  91452.0  92897.0  70834.0
+</code>
+The function given to tapply should expect a single argument: a vector containing all the members of one group. A good example is the length function, which takes a vector parameter and returns the vector’s length. Use it to count the number of data in each group; in this case, the number of cities in each county:
+<code>> tapply(pop,county,length)
+    Cook   DuPage     Kane  Kendall  Kenosha Lake(IL) Lake(IN)     Will
+        1        2        1        1        1        2        1
+</code>
+====== Applying a Function to Groups of Rows ======
+<code>> by(dfrm, fact, fun)</code>
+dfrm = the data frame,
+fact = grouping factor,
+fun = function. The function should expect one argument, a data frame.
+<code>library("MASS")
+sel <- Cars93[c("Origin", "Manufacturer", "MPG.city", "MPG.highway", "EngineSize")]
+> by(sel, sel$Orig, summary)
+sel$Orig: USA
+     Origin       Manufacturer    MPG.city      MPG.highway      EngineSize
+ USA    :48   Chevrolet : 8    Min.   :15.00   Min.   :20.00   Min.   :1.300
+ non-USA: 0   Ford      : 8    1st Qu.:18.00   1st Qu.:26.00   1st Qu.:2.200
+              Dodge     : 6    Median :20.00   Median :28.00   Median :3.000
+              Pontiac   : 5    Mean   :20.96   Mean   :28.15   Mean   :3.067
+              Buick     : 4    3rd Qu.:23.00   3rd Qu.:30.00   3rd Qu.:3.800
+              Oldsmobile: 4    Max.   :31.00   Max.   :41.00   Max.   :5.700
+              (Other)   :13
+------------------------------------------------------------------
+sel$Orig: non-USA
+     Origin       Manufacturer    MPG.city      MPG.highway      EngineSize
+ USA    : 0   Mazda     : 5    Min.   :17.00   Min.   :21.00   Min.   :1.000
+ non-USA:45   Hyundai   : 4    1st Qu.:19.00   1st Qu.:25.00   1st Qu.:1.600
+              Nissan    : 4    Median :22.00   Median :30.00   Median :2.200
+              Toyota    : 4    Mean   :23.87   Mean   :30.09   Mean   :2.242
+              Volkswagen: 4    3rd Qu.:26.00   3rd Qu.:33.00   3rd Qu.:2.800
+              Honda     : 3    Max.   :46.00   Max.   :50.00   Max.   :4.500
+              (Other)   :21
+</code>
+<code>tapply(suburbs$pop, suburbs$state, summary)
+</code>
+<code>by(suburbs$pop, suburbs$state, summary)
+</code>