# COMMunicationRESearch.NET

### Site Tools

b:r_cookbook:data_structures

# DATA

## Vectors

1. 벡터
2. Vectors are homogeneous: All elements of a vector must have the same type.
3. Vectors can be indexed by position: v[2] refers to the second element of v.
4. Vectors can be indexed by multiple positions, returning a subvector: v[c(2,3)] is a subvector of v that consists of the second and third elements.
5. Vector elements can have names: Vectors have a names property, the same length as the vector itself, that gives names to the elements:
> v <- c(10, 20, 30)
> names(v) <- c("Moe", "Larry", "Curly")
> print(v)
Moe Larry Curly
10    20    30
> v["Larry"]
Larry
20

## Lists

1. Lists are heterogeneous: Lists can contain elements of different types; in R terminology, list elements may have different modes. Lists can even contain other structured objects, such as lists and data frames; this allows you to create recursive data structures.
2. Lists can be indexed by position: lst[[2]] refers to the second element of lst. Note the double square brackets.
3. Lists let you extract sublists: lst[c(2,3)] is a sublist of lst that consists of the second and third elements. Note the single square brackets.
4. List elements can have names: Both lst[[“Moe”]] and lst$Moe refer to the element named “Moe”. n = c(2, 3, 5) s = c("aa", "bb", "cc", "dd", "ee") b = c(TRUE, FALSE, TRUE, FALSE, FALSE) x = list(n, s, b, 3) # x contains copies of n, s, b > x[2] [[1]] [1] "aa" "bb" "cc" "dd" "ee" > x[c(2, 4)] [[1]] [1] "aa" "bb" "cc" "dd" "ee" [[2]] [1] 3 > x[[2]] [1] "aa" "bb" "cc" "dd" "ee" > x[[2]][2] [1] "bb" x[[2]][1] <- "xx" # instead of "aa" x[[2]]  ## Mode: Physical Type Object Example Mode Number 3.1415 numeric Vector of numbers c(2, 7.182, 3.1415) numeric Character string "Moe" character Vector of character strings c("Moe", "Larry", "Curly") character Factor factor(c("NY", "CA", "IL")) numeric List list("Moe", "Larry", "Curly") list Data frame data.frame(x=1:3, y=c("NY", "CA", "IL")) list Function print function ## Class In R, every object also has a class, which defines its abstract type. The terminology is borrowed from object-oriented programming. A single number could represent many different things: a distance, a point in time, a weight. All those objects have a mode of “numeric” because they are stored as a number; but they could have different classes to indicate their interpretation. For example, a Date object consists of a single number: > d <- as.Date("2010-03-15") > mode(d) [1] "numeric" > length(d) [1] 1 But it has a class of Date, telling us how to interpret that number; namely, as the number of days since January 1, 1970: > class(d) [1] "Date" ## Scalars ## Matrices > A <- 1:6 > dim(A) NULL > print(A) [1] 1 2 3 4 5 6 We give dimensions to the vector when we set its dim attribute. Watch what happens when we set our vector dimensions to 2 × 3 and print it: > dim(A) <- c(2,3) > print(A) [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 A matrix can be created from a list, too. Like a vector, a list has a dim attribute, which is initially NULL: > B <- list(1,2,3,4,5,6) > dim(B) NULL If we set the dim attribute, it gives the list a shape: > dim(B) <- c(2,3) > print(B) [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6  ## Arrays The discussion of matrices can be generalized to 3-dimensional or even n-dimensional structures: just assign more dimensions to the underlying vector (or list). The following example creates a 3-dimensional array with dimensions 2 × 3 × 2: > D <- 1:12 > dim(D) <- c(2,3,2) > print(D) , , 1 [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 , , 2 [,1] [,2] [,3] [1,] 7 9 11 [2,] 8 10 12 ## Factors A factor looks like a vector, but it has special properties. R keeps track of the unique values in a vector, and each unique value is called a level of the associated factor. R uses a compact representation for factors, which makes them efficient for storage in data frames. In other programming languages, a factor would be represented by a vector of enumerated values. There are two key uses for factors: Categorical variables: A factor can represent a categorical variable. Categorical variables are used in contingency tables, linear regression, analysis of variance (ANOVA), logistic regression, and many other areas. Grouping: This is a technique for labeling or tagging your data items according to their group. See the Introduction to Chapter 6. > A <- c(1,2,2,3,3,4,4,4,4,2,1,2,3,3) > A [1] 1 2 2 3 3 4 4 4 4 2 1 2 3 3 > str(A) num [1:14] 1 2 2 3 3 4 4 4 4 2 ... > fA <- factor(A) > fA [1] 1 2 2 3 3 4 4 4 4 2 1 2 3 3 Levels: 1 2 3 4 > str(fA) Factor w/ 4 levels "1","2","3","4": 1 2 2 3 3 4 4 4 4 2 ... >  ## Data Frames A data frame is powerful and flexible structure. Most serious R applications involve data frames. A data frame is intended to mimic a dataset, such as one you might encounter in SAS or SPSS. A data frame is a tabular (rectangular) data structure, which means that it has rows and columns. It is not implemented by a matrix, however. Rather, a data frame is a list: • The elements of the list are vectors and/or factors.1) • Those vectors and factors are the columns of the data frame. • The vectors and factors must all have the same length; in other words, all columns must have the same height. • The equal-height columns give a rectangular shape to the data frame. • The columns must have names. Because a data frame is both a list and a rectangular structure, R provides two different paradigms for accessing its contents: • You can use list operators to extract columns from a data frame, such as dfrm[i], dfrm[[i]], or dfrm$name.
• You can use matrix-like notation, such as dfrm[i,j], dfrm[i,], or dfrm[,j].

# Appending Data to a Vector

> v <- c(1,2,3)
> v <- c(v,4)           # Append a single value to v
> v
[1] 1 2 3 4
> w <- c(5,6,7,8)
> v <- c(v,w)           # Append an entire vector to v
> v
[1] 1 2 3 4 5 6 7 8
> v <- c(1,2,3)         # Create a vector of three elements
> v[10] <- 10           # Assign to the 10th element
> v                     # R extends the vector automatically
[1]  1  2  3 NA NA NA NA NA NA 10

# Inserting Data into a Vector

> append(1:10, 99)
[1]  1  2  3  4  5 6  7  8  9 10 99
> append(1:10, 99, after=5)
[1]  1  2  3  4  5 99  6  7  8  9 10
> append(1:10, 99, after=0)
[1] 99  1  2  3  4  5  6  7  8  9 10

# Understanding the Recycling Rule

> (1:6) + (1:3)
[1] 2 4 6 5 7 9
>
1 2 3 4 5 6
1 2 3 1 2 3
2 4 6 5 7 9
> cbind(1:6)
[,1]
[1,]    1
[2,]    2
[3,]    3
[4,]    4
[5,]    5
[6,]    6
> cbind(1:3)
[,1]
[1,]    1
[2,]    2
[3,]    3
> cbind(1:6, 1:3)
[,1] [,2]
[1,]    1    1
[2,]    2    2
[3,]    3    3
[4,]    4    1
[5,]    5    2
[6,]    6    3
> (1:6) + (1:5)          # Oops! 1:5 is one element too short
[1]  2  4  6  8 10  7
Warning message:
In (1:6) + (1:5) :
longer object length is not a multiple of shorter object length
> (1:6) + 10
[1] 11 12 13 14 15 16

# Creating a Factor (Categorical Variable)

> f <- factor(v)          # v is a vector of strings or integers
> f <- factor(v, levels)
> f <- factor(c("Win","Win","Lose","Tie","Win","Lose"))
> f
[1] Win  Win  Lose Tie  Win  Lose
Levels: Lose Tie Win

Add the below line before entering the textbook code.

> wday <- c("Wed", "Thu", "Mon", "Wed", "Thu", "Thu", "Thu", "Tue", "Thu", "Tue")
> f <- factor(wday)
> f
[1] Wed Thu Mon Wed Thu Thu Thu Tue Thu Tue
Levels: Mon Thu Tue Wed
> f <- factor(wday, c("Mon","Tue","Wed","Thu","Fri")) # c(...) part means "levels" not data
> f  # note that there is no Fri in the below output.
[1] Wed Thu Mon Wed Thu Thu Thu Tue Thu Tue
Levels: Mon Tue Wed Thu Fri

# Combining Multiple Vectors into One Vector and a Factor

> comb <- stack(list(v1=v1, v2=v2, v3=v3))     # Combine 3 vectors

Why in the world would you want to mash all your data into one big vector and a parallel factor? The reason is that many important statistical functions require the data in that format. Suppose you survey freshmen, sophomores, and juniors regarding their confidence level (“What percentage of the time do you feel confident in school?”). Now you have three vectors, called freshmen, sophomores, and juniors. You want to perform an ANOVA analysis of the differences between the groups. The ANOVA function, aov, requires one vector with the survey results as well as a parallel factor that identifies the group. You can combine the groups using the stack function:

freshmen sophomores juniors
1 .60 .70 .76
2 .35 .61 .72
3 .44 .63 .92
4 .62 .87 .87
5 .60 .85
6 .70
7 .64
freshmen <- c(0.6, 0.35, 0.44, 0.62, 0.6)
sophomores <- c(0.7, 0.61, 0.63, 0.87, 0.85, 0.7, 0.64)
juniors <- c(.76, .72, .92, .87)
> comb <- stack(list(fresh=freshmen, soph=sophomores, jrs=juniors))
> print(comb)
values   ind
1    0.60 fresh
2    0.35 fresh
3    0.44 fresh
4    0.62 fresh
5    0.60 fresh
6    0.70  soph
7    0.61  soph
8    0.63  soph
9    0.87  soph
10   0.85  soph
11   0.70  soph
12   0.64  soph
13   0.76   jrs
14   0.71   jrs
15   0.92   jrs
16   0.87   jrs

Now you can perform the ANOVA analysis on the two columns:

> aov(values ~ ind, data=comb)

When building the list we must provide tags for the list elements (the tags are fresh, soph, and jrs in this example). Those tags are required because stack uses them as the levels of the parallel factor.

Annoyed by the funky variable names (column names)?

colnames(comb) <- c("score", "year")
aov(score ~ year, data=comb)

# Creating a List

> lst <- list(0.5, 0.841, 0.977)
> lst
[[1]]
[1] 0.5

[[2]]
[1] 0.841

[[3]]
[1] 0.977

When R prints the list, it identifies each list element by its position (1, 2, 3) and prints the element’s value (e.g., [1] 0.5) under its position. More usefully, lists can — unlike vectors — contain elements of different modes (types). Here is an extreme example of a mongrel created from a scalar, a character string, a vector, and a function:

> lst <- list(3.14, "Moe", c(1,1,2,3), mean)
> lst
[[1]]
[1] 3.14

[[2]]
[1] "Moe"

[[3]]
[1] 1 1 2 3

[[4]]
function (x, ...)
UseMethod("mean")
<environment: namespace:base>

You can also build a list by creating an empty list and populating it. Here is our “mongrel” example built in that way:

> lst <- list()
> lst[[1]] <- 3.14
> lst[[2]] <- "Moe"
> lst[[3]] <- c(1,1,2,3)
> lst[[4]] <- mean
> lst <- list(mid=0.5, right=0.841, far.right=0.977)
> lst
$mid [1] 0.5$right
[1] 0.841

$far.right [1] 0.977 # Selecting List Elements by Position > years <- list(1960, 1964, 1976, 1994) > years [[1]] [1] 1960 [[2]] [1] 1964 [[3]] [1] 1976 [[4]] [1] 1994 > years[[1]] [1] 1960 lst[[n]] This is an element, not a list. It is the nth element of lst. lst[n] This is a list, not an element. The list contains one element, taken from the nth element of lst. This is a special case of lst[c(n1, n2, …, nk)] in which we eliminated the c(…) construct because there is only one n. > class(years[[1]]) [1] "numeric" > class(years[1]) [1] "list" # Selecting List Elements by Name Use one of these forms. Here, lst is a list variable: lst[["name"]] Selects the element called name. Returns NULL if no element has that name. lst$name

Same as previous, just different syntax.

lst[c(name1, name2, ..., namek)]

Returns a list built from the indicated elements of lst.
Note that the first two forms return an element whereas the third form returns a list.

> years <- list(Kennedy=1960, Johnson=1964, Carter=1976, Clinton=1994)

The below has the same effects as the above.

years <- list(1960, 1964, 1976, 1994)
names(years) <- c("Kennedy", "Johnson", "Carter", "Clinton")

These next two expressions return the same value—namely, the element that is named “Kennedy”:

> years[["Kennedy"]]
[1] 1960
> years$Kennedy [1] 1960 The following two expressions return sublists extracted from years: > years[c("Kennedy","Johnson")]$Kennedy
[1] 1960

$Johnson [1] 1964 > years["Carter"]$Carter
[1] 1976

# Removing an Element from a List

> years <- list(Kennedy=1960, Johnson=1964, Carter=1976, Clinton=1994)
> years
$Kennedy [1] 1960$Johnson
[1] 1964

$Carter [1] 1976$Clinton
[1] 1994

> years[["Johnson"]] <- NULL              # Remove the element labeled "Johnson"
> years
$Kennedy [1] 1960$Carter
[1] 1976

$Clinton [1] 1994 You can remove multiple elements this way, too: > years[c("Carter","Clinton")] <- NULL # Remove two elements > years$Kennedy
[1] 1960

# Removing NULL Elements from a List

> lst[sapply(lst, is.null)] <- NULL
1. R calls sapply to apply the is.null function to every element of the list.
2. sapply returns a vector of logical values that are TRUE wherever the corresponding list element is NULL.
3. R selects values from the list according to that vector.
4. R assigns NULL to the selected items, removing them from the list.
> lst <- list("Moe", NULL, "Curly")          # Create list with NULL element
> lst
[[1]]
[1] "Moe"

[[2]]
NULL

[[3]]
[1] "Curly"

> lst[sapply(lst, is.null)] <- NULL          # Remove NULL element from list
> lst
[[1]]
[1] "Moe"

[[2]]
[1] "Curly"
> lst[lst < 0] <- NULL
> lst[lst == 0] <- NULL
> lst[is.na(lst)] <- NULL

# Initializing a Matrix

> theData <- c(1.1, 1.2, 2.1, 2.2, 3.1, 3.2)
> mat <- matrix(theData, 2, 3)
> mat
[,1] [,2] [,3]
[1,]  1.1  2.1  3.1
[2,]  1.2  2.2  3.2
matrix(data, row, col)

If data is a single value, recycling rule is applied.

> matrix(0, 2, 3)          # Create an all-zeros matrix
[,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0
> matrix(NA, 2, 3)         # Create a matrix populated with NA
[,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA

Same thing.

> mat <- matrix(c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3), 2, 3)

> theData <- c(1.1, 1.2, 1.3,
+              2.1, 2.2, 2.3)
> mat <- matrix(theData, 2, 3, byrow=TRUE)

Condense version

> mat <- matrix(c(1.1, 1.2, 1.3,
+                 2.1, 2.2, 2.3),
+               2, 3, byrow=TRUE)

Same

> v <- c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3)
> dim(v) <- c(2,3)
> v
[,1] [,2] [,3]
[1,]  1.1  1.3  2.2
[2,]  1.2  2.1  2.3

# Performing Matrix Operations

t(A)
Matrix transposition of A
solve(A)
Matrix inverse of A
A %*% B
Matrix multiplication of A and B
diag(n)
An n-by-n diagonal (identity) matrix
> mat <- matrix(c(1.1, 1.2, 1.3,
2.1, 2.2, 2.3),
2, 3, byrow=TRUE)
> mat
[,1] [,2] [,3]
[1,]  1.1  1.2  1.3
[2,]  2.1  2.2  2.3
> mat%*%t(mat)
[,1]  [,2]
[1,] 4.34  7.94
[2,] 7.94 14.54
> t(mat)%*%mat
[,1] [,2] [,3]
[1,] 5.62 5.94 6.26
[2,] 5.94 6.28 6.62
[3,] 6.26 6.62 6.98


# Naming to the Rows and Columns of a Matrix

> rownames(mat) <- c("rowname1", "rowname2", ..., "rownamem")
> colnames(mat) <- c("colname1", "colname2", ..., "colnamen")

# Selecting One Row or Column from a Matrix

> vec <- mat[1,]          # First row
> vec <- mat[,3]          # Third column

Normally, when you select one row or column from a matrix, R strips off the dimensions. The result is a dimensionless vector:

> mat[1,]
[1]  1  4  7 10
> mat[,3]
[1] 7 8 9

When you include the drop=FALSE argument, however, R retains the dimensions. In that case, selecting a row returns a row vector (a 1 × n matrix):

> mat[1,,drop=FALSE]
[,1] [,2] [,3] [,4]
[1,]    1    4    7   10

Likewise, selecting a column with drop=FALSE returns a column vector (an n × 1 matrix):

> mat[,3,drop=FALSE]
[,1]
[1,]    7
[2,]    8
[3,]    9

# Initializing a Data Frame from Column Data

Combining vectors

> dfrm <- data.frame(v1, v2, v3, f1, f2)

Combining lists

> dfrm <- as.data.frame(list.of.vectors)
pred1 <- c(-2.7528917, -0.3626909, -1.0416039, 1.266682, 0.7806372, -1.0832624, -2.0883305, -0.7063653, -0.8394022, -0.4966884)
pred2 <- c(-1.4078413, 0.31286963, -0.69685664, -1.27511434, -0.27292745, 0.73383339, 0.96816822, -0.84476203, 0.31530793, -0.08030948)
pred3 <- c("AM", "AM", "PM", "PM", "AM", "AM", "PM", "PM", "PM", "AM")
resp <- c(12.57715, 21.02418, 18.94694, 18.98153, 19.59455, 20.71605, 22.70062, 18.40691, 21.0093, 19.31253)
> dfrm <- data.frame(pred1, pred2, pred3, resp)
> dfrm
pred1       pred2 pred3     resp
1  -2.7528917 -1.40784130    AM 12.57715
2  -0.3626909  0.31286963    AM 21.02418
3  -1.0416039 -0.69685664    PM 18.94694
4   1.2666820 -1.27511434    PM 18.98153
5   0.7806372 -0.27292745    AM 19.59455
6  -1.0832624  0.73383339    AM 20.71605
7  -2.0883305  0.96816822    PM 22.70062
8  -0.7063653 -0.84476203    PM 18.40691
9  -0.8394022  0.31530793    PM 21.00930
10 -0.4966884 -0.08030948    AM 19.31253
> dfrm <- data.frame(p1=pred1, p2=pred2, p3=pred3, r=resp)
> dfrm
p1          p2 p3        r
1  -2.7528917 -1.40784130 AM 12.57715
2  -0.3626909  0.31286963 AM 21.02418
3  -1.0416039 -0.69685664 PM 18.94694
.
. (etc.)
.

suppose that there are other data in the resp and pred3 like the below:

pred3 <- c(pred3, "PM")
resp <- c(resp, 20,30,40)

Now you are trying to combine these vectors into a dataframe (dfrm2); but, failing:

dfrm <- data.frame(pred1, pred2, pred3, resp)

To fix this, you want to remove the additional data from the vectors so that each vector has 10 data element. How would you do that?

pred3 ← pred3[-c(11)]
resp ← resp[-c(11:13)]
dfrm ← data.frame(pred1, pred2, pred3, resp)

Or, add NAs in the short columns. How would I do that?

> lst <- list(p1=pred1, p2=pred2, p3=pred3, r=resp)

Alternatively, list → as.data.frame

> lst <- list(p1=pred1, p2=pred2, p3=pred3, r=resp)

> as.data.frame(lst)
p1          p2 p3        r
1  -2.7528917 -1.40784130 AM 12.57715
2  -0.3626909  0.31286963 AM 21.02418
3  -1.0416039 -0.69685664 PM 18.94694
.
. (etc.)
.

# Initializing a Data Frame from Row Data

rbind() function combines vector, matrix or data frame by rows.

data1.csv
Subtype	Gender	Expression
A	m	-0.54
A	f	-0.8
B	f	-1.03
C	m	-0.41

data2.csv
Subtype	Gender	Expression
D	m	3.22
D	f	1.02
D	f	0.21
D	m	-0.04
D	m	2.11
B	m	-1.21
A	f	-0.2

> x1 <- read.csv("http://commres.net/wiki/_export/code/r/data_structures?codeblock=83", head=T, sep="	")
> x2 <- read.csv("http://commres.net/wiki/_export/code/r/data_structures?codeblock=84", head=T, sep="	")

> x <- rbind(x1,x2)
> x

Subtype Gender Expression
1        A      m      -0.54
2        A      f      -0.80
3        B      f      -1.03
4        C      m      -0.41
5        D      m       3.22
6        D      f       1.02
7        D      f       0.21
8        D      m      -0.04
9        D      m       2.11
10       B      m      -1.21
11       A      f      -0.20

# Appending Rows to a Data Frame

suburbs.csv
	city	county	state	pop
1	Chicago	Cook	IL	2853114
2	Kenosha	Kenosha	WI	90352
3	Aurora	Kane	IL	171782
4	Elgin	Kane	IL	94487
5	Gary	Lake(IN)	IN	102746
6	Joliet	Kendall	IL	106221
7	Naperville	DuPage	IL	147779
8	Arlington	Heights Cook	IL	76031
9	Bolingbrook	Will	IL	70834
10	Cicero	Cook	IL	72616
11	Evanston	Cook	IL	74239
12	Hammond	Lake(IN)	IN	83048
13	Palatine	Cook	IL	67232
14	Schaumburg	Cook	IL	75386
15	Skokie	Cook	IL	63348
16	Waukegan	Lake(IL)	IL	91452
suburbs <- read.csv("http://commres.net/wiki/_export/code/r/data_structures?codeblock=89", head=T, sep="	")
suburbs
. . . . .
suburbs$X <- NULL # x column should be deleted. newRow <- data.frame(city="West Dundee", county="Kane", state="IL", pop=5428) suburbs <- rbind(suburbs, newRow) suburbs city county state pop 1 Chicago Cook IL 2853114 2 Kenosha Kenosha WI 90352 3 Aurora Kane IL 171782 4 Elgin Kane IL 94487 5 Gary Lake(IN) IN 102746 6 Joliet Kendall IL 106221 7 Naperville DuPage IL 147779 8 Arlington Heights Cook IL 76031 9 Bolingbrook Will IL 70834 10 Cicero Cook IL 72616 11 Evanston Cook IL 74239 12 Hammond Lake(IN) IN 83048 13 Palatine Cook IL 67232 14 Schaumburg Cook IL 75386 15 Skokie Cook IL 63348 16 Waukegan Lake(IL) IL 91452 17 West Dundee Kane IL 5428  # Preallocating a Data Frame suburbs.csv  city county state pop 1 Chicago Cook IL 2853114 2 Kenosha Kenosha WI 90352 3 Aurora Kane IL 171782 4 Elgin Kane IL 94487 5 Gary Lake(IN) IN 102746 6 Joliet Kendall IL 106221 7 Naperville DuPage IL 147779 8 Arlington Heights Cook IL 76031 9 Bolingbrook Will IL 70834 10 Cicero Cook IL 72616 11 Evanston Cook IL 74239 12 Hammond Lake(IN) IN 83048 13 Palatine Cook IL 67232 14 Schaumburg Cook IL 75386 15 Skokie Cook IL 63348 16 Waukegan Lake(IL) IL 91452 suburbs <- read.csv("http://commres.net/wiki/_export/code/r/data_structures?codeblock=92", head=T, sep=" ") > suburbs[[1]] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 > suburbs[[3]] [1] Cook Kenosha Kane Kane Lake(IN) Kendall DuPage Cook Will Cook Cook Lake(IN) Cook Cook Cook [16] Lake(IL) Levels: Cook DuPage Kane Kendall Kenosha Lake(IL) Lake(IN) Will > suburbs[[4]] [1] IL WI IL IL IN IL IL IL IL IL IL IN IL IL IL IL Levels: IL IN WI suburbs[[1]] This returns one column. suburbs[1] This returns a data frame, and the data frame contains exactly one column. This is a special case of dfrm[c(n1,n2, …, nk)]. We don’t need the c(…) construct because there is only one n. # Selecting data frame columns by position suburbs.csv city county state pop Chicago Cook IL 2853114 Kenosha Kenosha WI 90352 Aurora Kane IL 171782 Elgin Kane IL 94487 Gary Lake(IN) IN 102746 Joliet Kendall IL 106221 Naperville DuPage IL 147779 Arlington Heights Cook IL 76031 Bolingbrook Will IL 70834 Cicero Cook IL 72616 Evanston Cook IL 74239 Hammond Lake(IN) IN 83048 Palatine Cook IL 67232 Schaumburg Cook IL 75386 Skokie Cook IL 63348 Waukegan Lake(IL) IL 91452 suburbs <- read.csv("http://commres.net/wiki/_export/code/r/data_structures?codeblock=97", head=T, sep=" ") > suburbs[[1]] [1] "Chicago" "Kenosha" "Aurora" "Elgin" [5] "Gary" "Joliet" "Naperville" "Arlington Heights" [9] "Bolingbrook" "Cicero" "Evanston" "Hammond" [13] "Palatine" "Schaumburg" "Skokie" "Waukegan" > suburbs[1] city 1 Chicago 2 Kenosha 3 Aurora 4 Elgin 5 Gary 6 Joliet 7 Naperville 8 Arlington Heights 9 Bolingbrook 10 Cicero 11 Evanston 12 Hammond 13 Palatine 14 Schaumburg 15 Skokie 16 Waukegan > suburbs[c(1,4)] city pop 1 Chicago 2853114 2 Kenosha 90352 3 Aurora 171782 4 Elgin 94487 5 Gary 102746 6 Joliet 106221 7 Naperville 147779 8 Arlington Heights 76031 9 Bolingbrook 70834 10 Cicero 72616 11 Evanston 74239 12 Hammond 83048 13 Palatine 67232 14 Schaumburg 75386 15 Skokie 63348 16 Waukegan 91452 # Selecting Data Frame Columns by Name dfrm[["name"]] Returns one column, the column called name. dfrm$name
Same as previous, just different syntax.
To select one or more columns and package them in a data frame, use these list expressions:
dfrm["name"]
Selects one column and packages it inside a data frame object.
dfrm[c("name1", "name2", ..., "namek")]
Selects several columns and packages them in a data frame.
You can use matrix-style subscripting to select one or more columns:
dfrm[, "name"]
Returns the named column.
dfrm[, c("name1", "name2", ..., "namek")]
Selects several columns and packages in a data frame.

# Selecting Rows and Columns More Easily

Data set used in the section: Cars93 in MASS packages

install.packages("MASS")
library(MASS)
Cars93
Manufacturer          Model    Type Min.Price Price Max.Price MPG.city MPG.highway            AirBags DriveTrain Cylinders EngineSize
1          Acura        Integra   Small      12.9  15.9      18.8       25          31               None      Front         4        1.8
2          Acura         Legend Midsize      29.2  33.9      38.7       18          25 Driver & Passenger      Front         6        3.2
3           Audi             90 Compact      25.9  29.1      32.3       20          26        Driver only      Front         6        2.8
4           Audi            100 Midsize      30.8  37.7      44.6       19          26 Driver & Passenger      Front         6        2.8
. . . . .
subset(Cars93, select=Model, subset=(MPG.city > 30))
Model
31 Festiva
39   Metro
42   Civic
.
. (etc.)
.
subset(Cars93, select=c(Model,Min.Price,Max.Price),
+        subset=(Cylinders == 4 & Origin == "USA"))
Model Min.Price Max.Price
6        Century      14.2      17.3
12      Cavalier       8.5      18.3
13       Corsica      11.4      11.4
.
. (etc.)
.
subset(Cars93, select=c(Manufacturer,Model),
+        subset=c(MPG.highway > median(MPG.highway)))
Manufacturer         Model
1          Acura       Integra
5            BMW          535i
6          Buick       Century
.
. (etc.)
.

# Changing the Names of Data Frame Columns

mat <- c(-0.818, -0.667, -0.494, -0.819, -0.946, -0.205, 0.385, 1.531, -0.611, -2.155, -0.535, -0.316)
dim(mat) <- c(4,3)
mat
[,1]   [,2]   [,3]
[1,] -0.818 -0.667 -0.494
[2,] -0.819 -0.946 -0.205
[3,]  0.385  1.531 -0.611
[4,] -2.155 -0.535 -0.316

Vanilla variable name!

as.data.frame(mat)
V1     V2     V3
1 -0.818 -0.667 -0.494
2 -0.819 -0.946 -0.205
3  0.385  1.531 -0.611
4 -2.155 -0.535 -0.316
colnames(mat) <- c("before","treatment","after")
> mat
before treatment  after
[1,] -0.818    -0.946 -0.611
[2,] -0.667    -0.205 -2.155
[3,] -0.494     0.385 -0.535
[4,] -0.819     1.531 -0.316

> as.data.frame(mat)
before treatment  after
1 -0.818    -0.946 -0.611
2 -0.667    -0.205 -2.155
3 -0.494     0.385 -0.535
4 -0.819     1.531 -0.316

# Editing a Data Frame

> temp <- edit(mat)
mat <- temp      # Overwrite only if you're happy with the changes!
mat2 <- temp     # or....
# then, close the edit window

Can you save it as “mat.csv.” Then, retrieve it again into r space?

When you read back the csv file? How would you avoid like the below output? I mean aovid X column?

  X before treatment  after
1 1 -0.818    -0.946 -0.611
2 2 -0.667    -0.205 -2.155
3 3 -0.494     0.385 -0.535
4 4 -0.819     1.531 -0.316

Or even, how would I save the csv file, without the X column?

# Removing NAs from a Data Frame

Use na.omit to remove rows that contain any NA values.

> clean <- na.omit(dfrm)

# Excluding Columns by Name

> subset(dfrm, select = -badboy)          # All columns except badboy
> cor(patient.data)
patient.id         pre     dosage        post
patient.id  1.00000000  0.02286906  0.3643084 -0.13798149
pre         0.02286906  1.00000000  0.2270821 -0.03269263
dosage      0.36430837  0.22708208  1.0000000 -0.42006280
post       -0.13798149 -0.03269263 -0.4200628  1.00000000

This correlation matrix includes the meaningless “correlation” between patient ID and other variables, which is annoying. We can exclude the patient ID column to clean up the output:

> cor(subset(patient.data, select = -patient.id))
pre     dosage        post
pre     1.00000000  0.2270821 -0.03269264
dosage  0.22708207  1.0000000 -0.42006280
post   -0.03269264 -0.4200628  1.00000000

We can exclude multiple columns by giving a vector of negated names:

> cor(subset(patient.data, select = c(-patient.id,-dosage)))
pre        post
pre   1.00000000 -0.03269264
post -0.03269264  1.00000000

# Combining Two Data Frames

> stooges
name n.marry n.child
1   Moe       1       2
2 Larry       1       2
3 Curly       4       2
> birth
birth.year  birth.place
1       1887  Bensonhurst
3       1903     Brooklyn
> cbind(stooges,birth)
name n.marry n.child birth.year  birth.place
1   Moe       1       2       1887  Bensonhurst
2 Larry       1       2       1902 Philadelphia
3 Curly       4       2       1903     Brooklyn

rbind

> stooges
name n.marry n.child
1   Moe       1       2
2 Larry       1       2
3 Curly       4       2
> guys
name n.marry n.child
1   Tom       4       2
2  Dick       1       4
3 Harry       1       1
> rbind(stooges,guys)
name n.marry n.child
1   Moe       1       2
2 Larry       1       2
3 Curly       4       2
4   Tom       4       2
5  Dick       1       4
6 Harry       1       1
1)
A data frame can be built from a mixture of vectors, factors, and matrices. The columns of the matrices become columns in the data frame. The number of rows in each matrix must match the length of the vectors and factors. In other words, all elements of a data frame must have the same height.
b/r_cookbook/data_structures.txt · Last modified: 2020/05/26 17:40 by hkimscil