User Tools

Site Tools


b:r_cookbook:data_structures

DATA

Vectors

  1. 벡터
  2. Vectors are homogeneous: All elements of a vector must have the same type.
  3. Vectors can be indexed by position: v[2] refers to the second element of v.
  4. Vectors can be indexed by multiple positions, returning a subvector: v[c(2,3)] is a subvector of v that consists of the second and third elements.
  5. Vector elements can have names: Vectors have a names property, the same length as the vector itself, that gives names to the elements:
> v <- c(10, 20, 30)
> names(v) <- c("Moe", "Larry", "Curly")
> print(v)
  Moe Larry Curly 
   10    20    30
> v["Larry"]
Larry 
   20

Lists

  1. Lists are heterogeneous: Lists can contain elements of different types; in R terminology, list elements may have different modes. Lists can even contain other structured objects, such as lists and data frames; this allows you to create recursive data structures.
  2. Lists can be indexed by position: lst[[2]] refers to the second element of lst. Note the double square brackets.
  3. Lists let you extract sublists: lst[c(2,3)] is a sublist of lst that consists of the second and third elements. Note the single square brackets.
  4. List elements can have names: Both lst[[“Moe”]] and lst$Moe refer to the element named “Moe”.
n = c(2, 3, 5) 
s = c("aa", "bb", "cc", "dd", "ee") 
b = c(TRUE, FALSE, TRUE, FALSE, FALSE) 
x = list(n, s, b, 3)   # x contains copies of n, s, b
> x[2] 
[[1]] 
[1] "aa" "bb" "cc" "dd" "ee"
> x[c(2, 4)] 
[[1]] 
[1] "aa" "bb" "cc" "dd" "ee" 
 
[[2]] 
[1] 3
> x[[2]] 
[1] "aa" "bb" "cc" "dd" "ee"

> x[[2]][2]
[1] "bb"
x[[2]][1] <- "xx" # instead of "aa" 
x[[2]]

Mode: Physical Type

Object Example Mode
Number 3.1415 numeric
Vector of numbers c(2, 7.182, 3.1415) numeric
Character string "Moe" character
Vector of character strings c("Moe", "Larry", "Curly") character
Factor factor(c("NY", "CA", "IL")) numeric
List list("Moe", "Larry", "Curly") list
Data frame data.frame(x=1:3, y=c("NY", "CA", "IL")) list
Function print function

Class

In R, every object also has a class, which defines its abstract type. The terminology is borrowed from object-oriented programming. A single number could represent many different things: a distance, a point in time, a weight. All those objects have a mode of “numeric” because they are stored as a number; but they could have different classes to indicate their interpretation.
For example, a Date object consists of a single number:

> d <- as.Date("2010-03-15")
> mode(d)
[1] "numeric"
> length(d)
[1] 1

But it has a class of Date, telling us how to interpret that number; namely, as the number of days since January 1, 1970:

> class(d)
[1] "Date"

Scalars

Matrices

> A <- 1:6
> dim(A)
NULL
> print(A)
[1] 1 2 3 4 5 6

We give dimensions to the vector when we set its dim attribute. Watch what happens when we set our vector dimensions to 2 × 3 and print it:

> dim(A) <- c(2,3)
> print(A)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

A matrix can be created from a list, too. Like a vector, a list has a dim attribute, which is initially NULL:

> B <- list(1,2,3,4,5,6)
> dim(B)
NULL

If we set the dim attribute, it gives the list a shape:

> dim(B) <- c(2,3)
> print(B)
     [,1] [,2] [,3]
[1,] 1    3    5   
[2,] 2    4    6 

Arrays

The discussion of matrices can be generalized to 3-dimensional or even n-dimensional structures: just assign more dimensions to the underlying vector (or list). The following example creates a 3-dimensional array with dimensions 2 × 3 × 2:

> D <- 1:12
> dim(D) <- c(2,3,2)
> print(D)
, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

Factors

A factor looks like a vector, but it has special properties. R keeps track of the unique values in a vector, and each unique value is called a level of the associated factor. R uses a compact representation for factors, which makes them efficient for storage in data frames. In other programming languages, a factor would be represented by a vector of enumerated values.

There are two key uses for factors:

Categorical variables: A factor can represent a categorical variable. Categorical variables are used in contingency tables, linear regression, analysis of variance (ANOVA), logistic regression, and many other areas.

Grouping: This is a technique for labeling or tagging your data items according to their group. See the Introduction to Chapter 6.

> A <- c(1,2,2,3,3,4,4,4,4,2,1,2,3,3)
> A
 [1] 1 2 2 3 3 4 4 4 4 2 1 2 3 3
> str(A)
 num [1:14] 1 2 2 3 3 4 4 4 4 2 ...
> fA <- factor(A)
> fA
 [1] 1 2 2 3 3 4 4 4 4 2 1 2 3 3
Levels: 1 2 3 4
> str(fA)
 Factor w/ 4 levels "1","2","3","4": 1 2 2 3 3 4 4 4 4 2 ...
> 

Data Frames

A data frame is powerful and flexible structure. Most serious R applications involve data frames. A data frame is intended to mimic a dataset, such as one you might encounter in SAS or SPSS.

A data frame is a tabular (rectangular) data structure, which means that it has rows and columns. It is not implemented by a matrix, however. Rather, a data frame is a list:

  • The elements of the list are vectors and/or factors.1)
  • Those vectors and factors are the columns of the data frame.
  • The vectors and factors must all have the same length; in other words, all columns must have the same height.
  • The equal-height columns give a rectangular shape to the data frame.
  • The columns must have names.

Because a data frame is both a list and a rectangular structure, R provides two different paradigms for accessing its contents:

  • You can use list operators to extract columns from a data frame, such as dfrm[i], dfrm[[i]], or dfrm$name.
  • You can use matrix-like notation, such as dfrm[i,j], dfrm[i,], or dfrm[,j].

Appending Data to a Vector

> v <- c(1,2,3)
> v <- c(v,4)           # Append a single value to v
> v
[1] 1 2 3 4
> w <- c(5,6,7,8)
> v <- c(v,w)           # Append an entire vector to v
> v
[1] 1 2 3 4 5 6 7 8
> v <- c(1,2,3)         # Create a vector of three elements
> v[10] <- 10           # Assign to the 10th element
> v                     # R extends the vector automatically
 [1]  1  2  3 NA NA NA NA NA NA 10

Inserting Data into a Vector

> append(1:10, 99)
 [1]  1  2  3  4  5 6  7  8  9 10 99
> append(1:10, 99, after=5)
 [1]  1  2  3  4  5 99  6  7  8  9 10
> append(1:10, 99, after=0)
 [1] 99  1  2  3  4  5  6  7  8  9 10

Understanding the Recycling Rule

> (1:6) + (1:3)
[1] 2 4 6 5 7 9
> 
1 2 3 4 5 6
1 2 3 1 2 3
2 4 6 5 7 9
> cbind(1:6)
     [,1]
[1,]    1
[2,]    2
[3,]    3
[4,]    4
[5,]    5
[6,]    6
> cbind(1:3)
     [,1]
[1,]    1
[2,]    2
[3,]    3
> cbind(1:6, 1:3)
     [,1] [,2]
[1,]    1    1
[2,]    2    2
[3,]    3    3
[4,]    4    1
[5,]    5    2
[6,]    6    3
> (1:6) + (1:5)          # Oops! 1:5 is one element too short
[1]  2  4  6  8 10  7
Warning message:
In (1:6) + (1:5) :
  longer object length is not a multiple of shorter object length
> (1:6) + 10
[1] 11 12 13 14 15 16

Creating a Factor (Categorical Variable)

> f <- factor(v)          # v is a vector of strings or integers
> f <- factor(v, levels)
> f <- factor(c("Win","Win","Lose","Tie","Win","Lose"))
> f
[1] Win  Win  Lose Tie  Win  Lose
Levels: Lose Tie Win

Add the below line before entering the textbook code.

> wday <- c("Wed", "Thu", "Mon", "Wed", "Thu", "Thu", "Thu", "Tue", "Thu", "Tue")
> f <- factor(wday)
> f
 [1] Wed Thu Mon Wed Thu Thu Thu Tue Thu Tue
Levels: Mon Thu Tue Wed
> f <- factor(wday, c("Mon","Tue","Wed","Thu","Fri")) # c(...) part means "levels" not data 
> f  # note that there is no Fri in the below output.
 [1] Wed Thu Mon Wed Thu Thu Thu Tue Thu Tue
Levels: Mon Tue Wed Thu Fri

Combining Multiple Vectors into One Vector and a Factor

> comb <- stack(list(v1=v1, v2=v2, v3=v3))     # Combine 3 vectors

Why in the world would you want to mash all your data into one big vector and a parallel factor? The reason is that many important statistical functions require the data in that format. Suppose you survey freshmen, sophomores, and juniors regarding their confidence level (“What percentage of the time do you feel confident in school?”). Now you have three vectors, called freshmen, sophomores, and juniors. You want to perform an ANOVA analysis of the differences between the groups. The ANOVA function, aov, requires one vector with the survey results as well as a parallel factor that identifies the group. You can combine the groups using the stack function:

freshmen sophomores juniors
1 .60 .70 .76
2 .35 .61 .72
3 .44 .63 .92
4 .62 .87 .87
5 .60 .85
6 .70
7 .64
freshmen <- c(0.6, 0.35, 0.44, 0.62, 0.6)
sophomores <- c(0.7, 0.61, 0.63, 0.87, 0.85, 0.7, 0.64)
juniors <- c(.76, .72, .92, .87)
> comb <- stack(list(fresh=freshmen, soph=sophomores, jrs=juniors))
> print(comb)
   values   ind
1    0.60 fresh
2    0.35 fresh
3    0.44 fresh
4    0.62 fresh
5    0.60 fresh
6    0.70  soph
7    0.61  soph
8    0.63  soph
9    0.87  soph
10   0.85  soph
11   0.70  soph
12   0.64  soph
13   0.76   jrs
14   0.71   jrs
15   0.92   jrs
16   0.87   jrs

Now you can perform the ANOVA analysis on the two columns:

> aov(values ~ ind, data=comb)

When building the list we must provide tags for the list elements (the tags are fresh, soph, and jrs in this example). Those tags are required because stack uses them as the levels of the parallel factor.

Annoyed by the funky variable names (column names)?

colnames(comb) <- c("score", "year")
aov(score ~ year, data=comb)

Creating a List

> lst <- list(0.5, 0.841, 0.977)
> lst
[[1]]
[1] 0.5

[[2]]
[1] 0.841

[[3]]
[1] 0.977

When R prints the list, it identifies each list element by its position (1, 2, 3) and prints the element’s value (e.g., [1] 0.5) under its position. More usefully, lists can — unlike vectors — contain elements of different modes (types). Here is an extreme example of a mongrel created from a scalar, a character string, a vector, and a function:

> lst <- list(3.14, "Moe", c(1,1,2,3), mean)
> lst
[[1]]
[1] 3.14

[[2]]
[1] "Moe"

[[3]]
[1] 1 1 2 3

[[4]]
function (x, ...) 
UseMethod("mean")
<environment: namespace:base>

You can also build a list by creating an empty list and populating it. Here is our “mongrel” example built in that way:

> lst <- list()
> lst[[1]] <- 3.14
> lst[[2]] <- "Moe"
> lst[[3]] <- c(1,1,2,3)
> lst[[4]] <- mean
> lst <- list(mid=0.5, right=0.841, far.right=0.977)
> lst
$mid
[1] 0.5

$right
[1] 0.841

$far.right
[1] 0.977

Selecting List Elements by Position

> years <- list(1960, 1964, 1976, 1994)
> years
[[1]]
[1] 1960

[[2]]
[1] 1964

[[3]]
[1] 1976

[[4]]
[1] 1994
> years[[1]]
[1] 1960
lst[[n]]

This is an element, not a list. It is the nth element of lst.

lst[n]

This is a list, not an element. The list contains one element, taken from the nth element of lst. This is a special case of lst[c(n1, n2, …, nk)] in which we eliminated the c(…) construct because there is only one n.

> class(years[[1]])
[1] "numeric"
> class(years[1])
[1] "list"

Selecting List Elements by Name

Use one of these forms. Here, lst is a list variable:

lst[["name"]]

Selects the element called name. Returns NULL if no element has that name.

lst$name

Same as previous, just different syntax.

lst[c(name1, name2, ..., namek)]

Returns a list built from the indicated elements of lst.
Note that the first two forms return an element whereas the third form returns a list.

> years <- list(Kennedy=1960, Johnson=1964, Carter=1976, Clinton=1994)

The below has the same effects as the above.

years <- list(1960, 1964, 1976, 1994)
names(years) <- c("Kennedy", "Johnson", "Carter", "Clinton")

These next two expressions return the same value—namely, the element that is named “Kennedy”:

> years[["Kennedy"]]
[1] 1960
> years$Kennedy
[1] 1960

The following two expressions return sublists extracted from years:

> years[c("Kennedy","Johnson")]
$Kennedy
[1] 1960

$Johnson
[1] 1964
> years["Carter"]
$Carter
[1] 1976

Removing an Element from a List

> years <- list(Kennedy=1960, Johnson=1964, Carter=1976, Clinton=1994)
> years
$Kennedy
[1] 1960

$Johnson
[1] 1964

$Carter
[1] 1976

$Clinton
[1] 1994

> years[["Johnson"]] <- NULL              # Remove the element labeled "Johnson"
> years
$Kennedy
[1] 1960

$Carter
[1] 1976

$Clinton
[1] 1994

You can remove multiple elements this way, too:

> years[c("Carter","Clinton")] <- NULL     # Remove two elements
> years
$Kennedy
[1] 1960

Removing NULL Elements from a List

> lst[sapply(lst, is.null)] <- NULL 
  1. R calls sapply to apply the is.null function to every element of the list.
  2. sapply returns a vector of logical values that are TRUE wherever the corresponding list element is NULL.
  3. R selects values from the list according to that vector.
  4. R assigns NULL to the selected items, removing them from the list.
> lst <- list("Moe", NULL, "Curly")          # Create list with NULL element
> lst
[[1]]
[1] "Moe"

[[2]]
NULL

[[3]]
[1] "Curly"

> lst[sapply(lst, is.null)] <- NULL          # Remove NULL element from list
> lst
[[1]]
[1] "Moe"

[[2]]
[1] "Curly"
> lst[lst < 0] <- NULL
> lst[lst == 0] <- NULL
> lst[is.na(lst)] <- NULL

Initializing a Matrix

> theData <- c(1.1, 1.2, 2.1, 2.2, 3.1, 3.2)
> mat <- matrix(theData, 2, 3)
> mat
     [,1] [,2] [,3]
[1,]  1.1  2.1  3.1
[2,]  1.2  2.2  3.2
matrix(data, row, col)

If data is a single value, recycling rule is applied.

> matrix(0, 2, 3)          # Create an all-zeros matrix
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0
> matrix(NA, 2, 3)         # Create a matrix populated with NA
     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA

Same thing.

> mat <- matrix(c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3), 2, 3)

Easy to read.

> theData <- c(1.1, 1.2, 1.3,
+              2.1, 2.2, 2.3)
> mat <- matrix(theData, 2, 3, byrow=TRUE)

Condense version

> mat <- matrix(c(1.1, 1.2, 1.3,
+                 2.1, 2.2, 2.3),
+               2, 3, byrow=TRUE)

Same

> v <- c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3)
> dim(v) <- c(2,3)
> v
     [,1] [,2] [,3]
[1,]  1.1  1.3  2.2
[2,]  1.2  2.1  2.3

Performing Matrix Operations

t(A)
    Matrix transposition of A
solve(A)
    Matrix inverse of A
A %*% B
    Matrix multiplication of A and B
diag(n)
    An n-by-n diagonal (identity) matrix
> mat <- matrix(c(1.1, 1.2, 1.3,
                  2.1, 2.2, 2.3), 
                  2, 3, byrow=TRUE)
> mat
     [,1] [,2] [,3]
[1,]  1.1  1.2  1.3
[2,]  2.1  2.2  2.3
> mat%*%t(mat)
     [,1]  [,2]
[1,] 4.34  7.94
[2,] 7.94 14.54
> t(mat)%*%mat
     [,1] [,2] [,3]
[1,] 5.62 5.94 6.26
[2,] 5.94 6.28 6.62
[3,] 6.26 6.62 6.98

Naming to the Rows and Columns of a Matrix

> rownames(mat) <- c("rowname1", "rowname2", ..., "rownamem")
> colnames(mat) <- c("colname1", "colname2", ..., "colnamen")

Selecting One Row or Column from a Matrix

> vec <- mat[1,]          # First row
> vec <- mat[,3]          # Third column

Normally, when you select one row or column from a matrix, R strips off the dimensions. The result is a dimensionless vector:

> mat[1,]
[1]  1  4  7 10
> mat[,3]
[1] 7 8 9

When you include the drop=FALSE argument, however, R retains the dimensions. In that case, selecting a row returns a row vector (a 1 × n matrix):

> mat[1,,drop=FALSE]
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10

Likewise, selecting a column with drop=FALSE returns a column vector (an n × 1 matrix):

> mat[,3,drop=FALSE]
     [,1]
[1,]    7
[2,]    8
[3,]    9

Initializing a Data Frame from Column Data

Combining vectors

> dfrm <- data.frame(v1, v2, v3, f1, f2)

Combining lists

> dfrm <- as.data.frame(list.of.vectors)
pred1 <- c(-2.7528917, -0.3626909, -1.0416039, 1.266682, 0.7806372, -1.0832624, -2.0883305, -0.7063653, -0.8394022, -0.4966884)
pred2 <- c(-1.4078413, 0.31286963, -0.69685664, -1.27511434, -0.27292745, 0.73383339, 0.96816822, -0.84476203, 0.31530793, -0.08030948)
pred3 <- c("AM", "AM", "PM", "PM", "AM", "AM", "PM", "PM", "PM", "AM")
resp <- c(12.57715, 21.02418, 18.94694, 18.98153, 19.59455, 20.71605, 22.70062, 18.40691, 21.0093, 19.31253)
> dfrm <- data.frame(pred1, pred2, pred3, resp)
> dfrm
        pred1       pred2 pred3     resp
1  -2.7528917 -1.40784130    AM 12.57715
2  -0.3626909  0.31286963    AM 21.02418
3  -1.0416039 -0.69685664    PM 18.94694
4   1.2666820 -1.27511434    PM 18.98153
5   0.7806372 -0.27292745    AM 19.59455
6  -1.0832624  0.73383339    AM 20.71605
7  -2.0883305  0.96816822    PM 22.70062
8  -0.7063653 -0.84476203    PM 18.40691
9  -0.8394022  0.31530793    PM 21.00930
10 -0.4966884 -0.08030948    AM 19.31253
> dfrm <- data.frame(p1=pred1, p2=pred2, p3=pred3, r=resp)
> dfrm
           p1          p2 p3        r
1  -2.7528917 -1.40784130 AM 12.57715
2  -0.3626909  0.31286963 AM 21.02418
3  -1.0416039 -0.69685664 PM 18.94694
.
. (etc.)
.

suppose that there are other data in the resp and pred3 like the below:

pred3 <- c(pred3, "PM")
resp <- c(resp, 20,30,40)

Now you are trying to combine these vectors into a dataframe (dfrm2); but, failing:

dfrm <- data.frame(pred1, pred2, pred3, resp)

To fix this, you want to remove the additional data from the vectors so that each vector has 10 data element. How would you do that?

pred3 ← pred3[-c(11)]
resp ← resp[-c(11:13)]
dfrm ← data.frame(pred1, pred2, pred3, resp)

Or, add NAs in the short columns. How would I do that?

> lst <- list(p1=pred1, p2=pred2, p3=pred3, r=resp)

Alternatively, list → as.data.frame

> lst <- list(p1=pred1, p2=pred2, p3=pred3, r=resp)

> as.data.frame(lst)
           p1          p2 p3        r
1  -2.7528917 -1.40784130 AM 12.57715
2  -0.3626909  0.31286963 AM 21.02418
3  -1.0416039 -0.69685664 PM 18.94694
.
. (etc.)
.

Initializing a Data Frame from Row Data

rbind() function combines vector, matrix or data frame by rows.

data1.csv
Subtype	Gender	Expression
A	m	-0.54
A	f	-0.8
B	f	-1.03
C	m	-0.41
 
data2.csv
Subtype	Gender	Expression
D	m	3.22
D	f	1.02
D	f	0.21
D	m	-0.04
D	m	2.11
B	m	-1.21
A	f	-0.2
 
> x1 <- read.csv("http://commres.net/wiki/_export/code/r/data_structures?codeblock=83", head=T, sep="	")
> x2 <- read.csv("http://commres.net/wiki/_export/code/r/data_structures?codeblock=84", head=T, sep="	")

> x <- rbind(x1,x2)
> x

   Subtype Gender Expression
1        A      m      -0.54
2        A      f      -0.80
3        B      f      -1.03
4        C      m      -0.41
5        D      m       3.22
6        D      f       1.02
7        D      f       0.21
8        D      m      -0.04
9        D      m       2.11
10       B      m      -1.21
11       A      f      -0.20

Appending Rows to a Data Frame

suburbs.csv
	city	county	state	pop
1	Chicago	Cook	IL	2853114
2	Kenosha	Kenosha	WI	90352
3	Aurora	Kane	IL	171782
4	Elgin	Kane	IL	94487
5	Gary	Lake(IN)	IN	102746
6	Joliet	Kendall	IL	106221
7	Naperville	DuPage	IL	147779
8	Arlington	Heights Cook	IL	76031
9	Bolingbrook	Will	IL	70834
10	Cicero	Cook	IL	72616
11	Evanston	Cook	IL	74239
12	Hammond	Lake(IN)	IN	83048
13	Palatine	Cook	IL	67232
14	Schaumburg	Cook	IL	75386
15	Skokie	Cook	IL	63348
16	Waukegan	Lake(IL)	IL	91452
suburbs <- read.csv("http://commres.net/wiki/_export/code/r/data_structures?codeblock=89", head=T, sep="	")
suburbs
. . . . .
suburbs$X <- NULL    # x column should be deleted.
newRow <- data.frame(city="West Dundee", county="Kane", state="IL", pop=5428)
suburbs <- rbind(suburbs, newRow)
suburbs
                city   county state     pop
1            Chicago     Cook    IL 2853114
2            Kenosha  Kenosha    WI   90352
3             Aurora     Kane    IL  171782
4              Elgin     Kane    IL   94487
5               Gary Lake(IN)    IN  102746
6             Joliet  Kendall    IL  106221
7         Naperville   DuPage    IL  147779
8  Arlington Heights     Cook    IL   76031
9        Bolingbrook     Will    IL   70834
10            Cicero     Cook    IL   72616
11          Evanston     Cook    IL   74239
12           Hammond Lake(IN)    IN   83048
13          Palatine     Cook    IL   67232
14        Schaumburg     Cook    IL   75386
15            Skokie     Cook    IL   63348
16          Waukegan Lake(IL)    IL   91452
17       West Dundee     Kane    IL    5428

Preallocating a Data Frame

suburbs.csv
	city	county	state	pop
1	Chicago	Cook	IL	2853114
2	Kenosha	Kenosha	WI	90352
3	Aurora	Kane	IL	171782
4	Elgin	Kane	IL	94487
5	Gary	Lake(IN)	IN	102746
6	Joliet	Kendall	IL	106221
7	Naperville	DuPage	IL	147779
8	Arlington Heights	Cook	IL	76031
9	Bolingbrook	Will	IL	70834
10	Cicero	Cook	IL	72616
11	Evanston	Cook	IL	74239
12	Hammond	Lake(IN)	IN	83048
13	Palatine	Cook	IL	67232
14	Schaumburg	Cook	IL	75386
15	Skokie	Cook	IL	63348
16	Waukegan	Lake(IL)	IL	91452
suburbs <- read.csv("http://commres.net/wiki/_export/code/r/data_structures?codeblock=92", head=T, sep="	")
> suburbs[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16
> suburbs[[3]]
 [1] Cook     Kenosha  Kane     Kane     Lake(IN) Kendall  DuPage   Cook     Will     Cook     Cook     Lake(IN) Cook     Cook     Cook    
[16] Lake(IL)
Levels: Cook DuPage Kane Kendall Kenosha Lake(IL) Lake(IN) Will
> suburbs[[4]]
 [1] IL WI IL IL IN IL IL IL IL IL IL IN IL IL IL IL
Levels: IL IN WI
suburbs[[1]]

This returns one column.

suburbs[1]

This returns a data frame, and the data frame contains exactly one column. This is a special case of dfrm[c(n1,n2, …, nk)]. We don’t need the c(…) construct because there is only one n.

Selecting data frame columns by position

suburbs.csv
city	county	state	pop
Chicago	Cook	IL	2853114
Kenosha	Kenosha	WI	90352
Aurora	Kane	IL	171782
Elgin	Kane	IL	94487
Gary	Lake(IN)	IN	102746
Joliet	Kendall	IL	106221
Naperville	DuPage	IL	147779
Arlington Heights	Cook	IL	76031
Bolingbrook	Will	IL	70834
Cicero	Cook	IL	72616
Evanston	Cook	IL	74239
Hammond	Lake(IN)	IN	83048
Palatine	Cook	IL	67232
Schaumburg	Cook	IL	75386
Skokie	Cook	IL	63348
Waukegan	Lake(IL)	IL	91452
suburbs <- read.csv("http://commres.net/wiki/_export/code/r/data_structures?codeblock=97", head=T, sep="	")
> suburbs[[1]]
[1] "Chicago"           "Kenosha"           "Aurora"            "Elgin"            
 [5] "Gary"              "Joliet"            "Naperville"        "Arlington Heights"
 [9] "Bolingbrook"       "Cicero"            "Evanston"          "Hammond"          
[13] "Palatine"          "Schaumburg"        "Skokie"            "Waukegan"
> suburbs[1]
                city
1            Chicago
2            Kenosha
3             Aurora
4              Elgin
5               Gary
6             Joliet
7         Naperville
8  Arlington Heights
9        Bolingbrook
10            Cicero
11          Evanston
12           Hammond
13          Palatine
14        Schaumburg
15            Skokie
16          Waukegan
> suburbs[c(1,4)]
                city     pop
1            Chicago 2853114
2            Kenosha   90352
3             Aurora  171782
4              Elgin   94487
5               Gary  102746
6             Joliet  106221
7         Naperville  147779
8  Arlington Heights   76031
9        Bolingbrook   70834
10            Cicero   72616
11          Evanston   74239
12           Hammond   83048
13          Palatine   67232
14        Schaumburg   75386
15            Skokie   63348
16          Waukegan   91452

Selecting Data Frame Columns by Name

dfrm[["name"]]
    Returns one column, the column called name.
dfrm$name
    Same as previous, just different syntax.
    To select one or more columns and package them in a data frame, use these list expressions:
dfrm["name"]
    Selects one column and packages it inside a data frame object.
dfrm[c("name1", "name2", ..., "namek")]
    Selects several columns and packages them in a data frame.
    You can use matrix-style subscripting to select one or more columns:
dfrm[, "name"]
    Returns the named column.
dfrm[, c("name1", "name2", ..., "namek")]
    Selects several columns and packages in a data frame.

Selecting Rows and Columns More Easily

Data set used in the section: Cars93 in MASS packages

install.packages("MASS")
library(MASS)
Cars93
    Manufacturer          Model    Type Min.Price Price Max.Price MPG.city MPG.highway            AirBags DriveTrain Cylinders EngineSize
1          Acura        Integra   Small      12.9  15.9      18.8       25          31               None      Front         4        1.8
2          Acura         Legend Midsize      29.2  33.9      38.7       18          25 Driver & Passenger      Front         6        3.2
3           Audi             90 Compact      25.9  29.1      32.3       20          26        Driver only      Front         6        2.8
4           Audi            100 Midsize      30.8  37.7      44.6       19          26 Driver & Passenger      Front         6        2.8
. . . . .
subset(Cars93, select=Model, subset=(MPG.city > 30))
     Model
31 Festiva
39   Metro
42   Civic
.
. (etc.)
.
subset(Cars93, select=c(Model,Min.Price,Max.Price),
+        subset=(Cylinders == 4 & Origin == "USA"))
           Model Min.Price Max.Price
6        Century      14.2      17.3
12      Cavalier       8.5      18.3
13       Corsica      11.4      11.4
.
. (etc.)
.
subset(Cars93, select=c(Manufacturer,Model),
+        subset=c(MPG.highway > median(MPG.highway)))
    Manufacturer         Model
1          Acura       Integra
5            BMW          535i
6          Buick       Century
.
. (etc.)
.

Changing the Names of Data Frame Columns

mat <- c(-0.818, -0.667, -0.494, -0.819, -0.946, -0.205, 0.385, 1.531, -0.611, -2.155, -0.535, -0.316)
dim(mat) <- c(4,3)
mat 
       [,1]   [,2]   [,3]
[1,] -0.818 -0.667 -0.494
[2,] -0.819 -0.946 -0.205
[3,]  0.385  1.531 -0.611
[4,] -2.155 -0.535 -0.316

Vanilla variable name!

as.data.frame(mat)
      V1     V2     V3
1 -0.818 -0.667 -0.494
2 -0.819 -0.946 -0.205
3  0.385  1.531 -0.611
4 -2.155 -0.535 -0.316
colnames(mat) <- c("before","treatment","after")
> mat
     before treatment  after
[1,] -0.818    -0.946 -0.611
[2,] -0.667    -0.205 -2.155
[3,] -0.494     0.385 -0.535
[4,] -0.819     1.531 -0.316

> as.data.frame(mat)
  before treatment  after
1 -0.818    -0.946 -0.611
2 -0.667    -0.205 -2.155
3 -0.494     0.385 -0.535
4 -0.819     1.531 -0.316

Editing a Data Frame

> temp <- edit(mat)
mat <- temp      # Overwrite only if you're happy with the changes!
mat2 <- temp     # or....
# then, close the edit window

Can you save it as “mat.csv.” Then, retrieve it again into r space?

When you read back the csv file? How would you avoid like the below output? I mean aovid X column?

  X before treatment  after
1 1 -0.818    -0.946 -0.611
2 2 -0.667    -0.205 -2.155
3 3 -0.494     0.385 -0.535
4 4 -0.819     1.531 -0.316

Or even, how would I save the csv file, without the X column?

Removing NAs from a Data Frame

Use na.omit to remove rows that contain any NA values.

> clean <- na.omit(dfrm)

Excluding Columns by Name

> subset(dfrm, select = -badboy)          # All columns except badboy
> cor(patient.data)
            patient.id         pre     dosage        post
patient.id  1.00000000  0.02286906  0.3643084 -0.13798149
pre         0.02286906  1.00000000  0.2270821 -0.03269263
dosage      0.36430837  0.22708208  1.0000000 -0.42006280
post       -0.13798149 -0.03269263 -0.4200628  1.00000000

This correlation matrix includes the meaningless “correlation” between patient ID and other variables, which is annoying. We can exclude the patient ID column to clean up the output:

> cor(subset(patient.data, select = -patient.id))
               pre     dosage        post
pre     1.00000000  0.2270821 -0.03269264
dosage  0.22708207  1.0000000 -0.42006280
post   -0.03269264 -0.4200628  1.00000000

We can exclude multiple columns by giving a vector of negated names:

> cor(subset(patient.data, select = c(-patient.id,-dosage)))
             pre        post
pre   1.00000000 -0.03269264
post -0.03269264  1.00000000

Combining Two Data Frames

> stooges
   name n.marry n.child
1   Moe       1       2
2 Larry       1       2
3 Curly       4       2
> birth
  birth.year  birth.place
1       1887  Bensonhurst
2       1902 Philadelphia
3       1903     Brooklyn
> cbind(stooges,birth)
   name n.marry n.child birth.year  birth.place
1   Moe       1       2       1887  Bensonhurst
2 Larry       1       2       1902 Philadelphia
3 Curly       4       2       1903     Brooklyn

rbind

> stooges
   name n.marry n.child
1   Moe       1       2
2 Larry       1       2
3 Curly       4       2
> guys
   name n.marry n.child
1   Tom       4       2
2  Dick       1       4
3 Harry       1       1
> rbind(stooges,guys)
   name n.marry n.child
1   Moe       1       2
2 Larry       1       2
3 Curly       4       2
4   Tom       4       2
5  Dick       1       4
6 Harry       1       1
1)
A data frame can be built from a mixture of vectors, factors, and matrices. The columns of the matrices become columns in the data frame. The number of rows in each matrix must match the length of the vectors and factors. In other words, all elements of a data frame must have the same height.
b/r_cookbook/data_structures.txt · Last modified: 2020/05/26 17:40 by hkimscil