Table of Contents
DATA
Vectors
- 벡터
- Vectors are homogeneous: All elements of a vector must have the same type.
- Vectors can be indexed by position: v[2] refers to the second element of v.
- Vectors can be indexed by multiple positions, returning a subvector: v[c(2,3)] is a subvector of v that consists of the second and third elements.
- Vector elements can have names: Vectors have a names property, the same length as the vector itself, that gives names to the elements:
> v <- c(10, 20, 30) > names(v) <- c("Moe", "Larry", "Curly") > print(v) Moe Larry Curly 10 20 30
> v["Larry"] Larry 20
Lists
- Lists are heterogeneous: Lists can contain elements of different types; in R terminology, list elements may have different modes. Lists can even contain other structured objects, such as lists and data frames; this allows you to create recursive data structures.
- Lists can be indexed by position: lst[[2]] refers to the second element of lst. Note the double square brackets.
- Lists let you extract sublists: lst[c(2,3)] is a sublist of lst that consists of the second and third elements. Note the single square brackets.
- List elements can have names: Both lst[[“Moe”]] and lst$Moe refer to the element named “Moe”.
n = c(2, 3, 5) s = c("aa", "bb", "cc", "dd", "ee") b = c(TRUE, FALSE, TRUE, FALSE, FALSE) x = list(n, s, b, 3) # x contains copies of n, s, b
> x[2] [[1]] [1] "aa" "bb" "cc" "dd" "ee"
> x[c(2, 4)] [[1]] [1] "aa" "bb" "cc" "dd" "ee" [[2]] [1] 3
> x[[2]] [1] "aa" "bb" "cc" "dd" "ee" > x[[2]][2] [1] "bb"
x[[2]][1] <- "xx" # instead of "aa" xx[[2]]
Mode: Physical Type
Object | Example | Mode |
---|---|---|
Number | 3.1415 | numeric |
Vector of numbers | c(2.7.182, 3.1415) | numeric |
Character string | "Moe" | character |
Vector of character strings | c("Moe", "Larry", "Curly") | character |
Factor | factor(c("NY", "CA", "IL")) | numeric |
List | list("Moe", "Larry", "Curly") | list |
Data frame | data.frame(x=1:3, y=c("NY", "CA", "IL")) | list |
Function | print | function |
Class
In R, every object also has a class, which defines its abstract type. The terminology is borrowed from object-oriented programming. A single number could represent many different things: a distance, a point in time, a weight. All those objects have a mode of “numeric” because they are stored as a number; but they could have different classes to indicate their interpretation.
For example, a Date object consists of a single number:
> d <- as.Date("2010-03-15") > mode(d) [1] "numeric" > length(d) [1] 1
But it has a class of Date, telling us how to interpret that number; namely, as the number of days since January 1, 1970:
> class(d) [1] "Date"
Scalars
Matrices
> A <- 1:6 > dim(A) NULL > print(A) [1] 1 2 3 4 5 6
We give dimensions to the vector when we set its dim attribute. Watch what happens when we set our vector dimensions to 2 × 3 and print it:
> dim(A) <- c(2,3) > print(A) [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6
A matrix can be created from a list, too. Like a vector, a list has a dim attribute, which is initially NULL:
> B <- list(1,2,3,4,5,6) > dim(B) NULL
If we set the dim attribute, it gives the list a shape:
> dim(B) <- c(2,3) > print(B) [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6
Arrays
The discussion of matrices can be generalized to 3-dimensional or even n-dimensional structures: just assign more dimensions to the underlying vector (or list). The following example creates a 3-dimensional array with dimensions 2 × 3 × 2:
> D <- 1:12 > dim(D) <- c(2,3,2) > print(D) , , 1 [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 , , 2 [,1] [,2] [,3] [1,] 7 9 11 [2,] 8 10 12
Factors
A factor looks like a vector, but it has special properties. R keeps track of the unique values in a vector, and each unique value is called a level of the associated factor. R uses a compact representation for factors, which makes them efficient for storage in data frames. In other programming languages, a factor would be represented by a vector of enumerated values.
There are two key uses for factors:
Categorical variables: A factor can represent a categorical variable. Categorical variables are used in contingency tables, linear regression, analysis of variance (ANOVA), logistic regression, and many other areas.
Grouping: This is a technique for labeling or tagging your data items according to their group. See the Introduction to Chapter 6.
> A <- c(1,2,2,3,3,4,4,4,4,2,1,2,3,3) > A [1] 1 2 2 3 3 4 4 4 4 2 1 2 3 3 > str(A) num [1:14] 1 2 2 3 3 4 4 4 4 2 ... > fA <- factor(A) > fA [1] 1 2 2 3 3 4 4 4 4 2 1 2 3 3 Levels: 1 2 3 4 > str(fA) Factor w/ 4 levels "1","2","3","4": 1 2 2 3 3 4 4 4 4 2 ... >
Data Frames
A data frame is powerful and flexible structure. Most serious R applications involve data frames. A data frame is intended to mimic a dataset, such as one you might encounter in SAS or SPSS.
A data frame is a tabular (rectangular) data structure, which means that it has rows and columns. It is not implemented by a matrix, however. Rather, a data frame is a list:
- The elements of the list are vectors and/or factors.1)
- Those vectors and factors are the columns of the data frame.
- The vectors and factors must all have the same length; in other words, all columns must have the same height.
- The equal-height columns give a rectangular shape to the data frame.
- The columns must have names.
Because a data frame is both a list and a rectangular structure, R provides two different paradigms for accessing its contents:
- You can use list operators to extract columns from a data frame, such as dfrm[i], dfrm[[i]], or dfrm$name.
- You can use matrix-like notation, such as dfrm[i,j], dfrm[i,], or dfrm[,j].
Appending Data to a Vector
> v <- c(1,2,3) > v <- c(v,4) # Append a single value to v > v [1] 1 2 3 4 > w <- c(5,6,7,8) > v <- c(v,w) # Append an entire vector to v > v [1] 1 2 3 4 5 6 7 8
> v <- c(1,2,3) # Create a vector of three elements > v[10] <- 10 # Assign to the 10th element > v # R extends the vector automatically [1] 1 2 3 NA NA NA NA NA NA 10
Inserting Data into a Vector
> append(1:10, 99) [1] 1 2 3 4 5 6 7 8 9 10 99
> append(1:10, 99, after=5) [1] 1 2 3 4 5 99 6 7 8 9 10
> append(1:10, 99, after=0) [1] 99 1 2 3 4 5 6 7 8 9 10
Understanding the Recycling Rule
> (1:6) + (1:3) [1] 2 4 6 5 7 9
> 1 2 3 4 5 6 1 2 3 1 2 3 2 4 6 5 7 9
> cbind(1:6) [,1] [1,] 1 [2,] 2 [3,] 3 [4,] 4 [5,] 5 [6,] 6
> cbind(1:3) [,1] [1,] 1 [2,] 2 [3,] 3
> cbind(1:6, 1:3) [,1] [,2] [1,] 1 1 [2,] 2 2 [3,] 3 3 [4,] 4 1 [5,] 5 2 [6,] 6 3
> (1:6) + (1:5) # Oops! 1:5 is one element too short [1] 2 4 6 8 10 7 Warning message: In (1:6) + (1:5) : longer object length is not a multiple of shorter object length
> (1:6) + 10 [1] 11 12 13 14 15 16
Creating a Factor (Categorical Variable)
> f <- factor(v) # v is a vector of strings or integers
> f <- factor(v, levels)
> f <- factor(c("Win","Win","Lose","Tie","Win","Lose")) > f [1] Win Win Lose Tie Win Lose Levels: Lose Tie Win
Add the below line before entering the textbook code.
> wday <- c("Wed", "Thu", "Mon", "Wed", "Thu", "Thu", "Thu", "Tue", "Thu", "Tue")
> f <- factor(wday) > f [1] Wed Thu Mon Wed Thu Thu Thu Tue Thu Tue Levels: Mon Thu Tue Wed
> f <- factor(wday, c("Mon","Tue","Wed","Thu","Fri")) # c(...) part means "levels" not data > f # note that there is no Fri in the below output. [1] Wed Thu Mon Wed Thu Thu Thu Tue Thu Tue Levels: Mon Tue Wed Thu Fri
Combining Multiple Vectors into One Vector and a Factor
> comb <- stack(list(v1=v1, v2=v2, v3=v3)) # Combine 3 vectors
Why in the world would you want to mash all your data into one big vector and a parallel factor? The reason is that many important statistical functions require the data in that format. Suppose you survey freshmen, sophomores, and juniors regarding their confidence level (“What percentage of the time do you feel confident in school?”). Now you have three vectors, called freshmen, sophomores, and juniors. You want to perform an ANOVA analysis of the differences between the groups. The ANOVA function, aov, requires one vector with the survey results as well as a parallel factor that identifies the group. You can combine the groups using the stack function:
freshmen | sophomores | juniors | |
---|---|---|---|
1 | .60 | .70 | .76 |
2 | .35 | .61 | .72 |
3 | .44 | .63 | .92 |
4 | .62 | .87 | .87 |
5 | .60 | .85 | |
6 | .70 | ||
7 | .64 |
freshmen <- c(0.6, 0.35, 0.44, 0.62, 0.6) sophomores <- c(0.7, 0.61, 0.63, 0.87, 0.85, 0.7, 0.64) juniors <- c(.76, .72, .92, .87)
> comb <- stack(list(fresh=freshmen, soph=sophomores, jrs=juniors)) > print(comb) values ind 1 0.60 fresh 2 0.35 fresh 3 0.44 fresh 4 0.62 fresh 5 0.60 fresh 6 0.70 soph 7 0.61 soph 8 0.63 soph 9 0.87 soph 10 0.85 soph 11 0.70 soph 12 0.64 soph 13 0.76 jrs 14 0.71 jrs 15 0.92 jrs 16 0.87 jrs
Now you can perform the ANOVA analysis on the two columns:
> aov(values ~ ind, data=comb)
When building the list we must provide tags for the list elements (the tags are fresh, soph, and jrs in this example). Those tags are required because stack uses them as the levels of the parallel factor.
Annoyed by the funky variable names (column names)?
colnames(comb) <- c("score", "year") aov(score ~ year, data=comb)
Creating a List
> lst <- list(0.5, 0.841, 0.977) > lst [[1]] [1] 0.5 [[2]] [1] 0.841 [[3]] [1] 0.977
When R prints the list, it identifies each list element by its position (1, 2, 3) and prints the element’s value (e.g., [1] 0.5) under its position. More usefully, lists can — unlike vectors — contain elements of different modes (types). Here is an extreme example of a mongrel created from a scalar, a character string, a vector, and a function:
> lst <- list(3.14, "Moe", c(1,1,2,3), mean) > lst [[1]] [1] 3.14 [[2]] [1] "Moe" [[3]] [1] 1 1 2 3 [[4]] function (x, ...) UseMethod("mean") <environment: namespace:base>
You can also build a list by creating an empty list and populating it. Here is our “mongrel” example built in that way:
> lst <- list() > lst[[1]] <- 3.14 > lst[[2]] <- "Moe" > lst[[3]] <- c(1,1,2,3) > lst[[4]] <- mean
> lst <- list(mid=0.5, right=0.841, far.right=0.977) > lst $mid [1] 0.5 $right [1] 0.841 $far.right [1] 0.977
Selecting List Elements by Position
> years <- list(1960, 1964, 1976, 1994) > years [[1]] [1] 1960 [[2]] [1] 1964 [[3]] [1] 1976 [[4]] [1] 1994
> years[[1]] [1] 1960
lst[[n]]
This is an element, not a list. It is the nth element of lst.
lst[n]
This is a list, not an element. The list contains one element, taken from the nth element of lst. This is a special case of lst[c(n1, n2, …, nk)] in which we eliminated the c(…) construct because there is only one n.
> class(years[[1]]) [1] "numeric"
> class(years[1]) [1] "list"
Selecting List Elements by Name
Use one of these forms. Here, lst is a list variable:
lst[["name"]]
Selects the element called name. Returns NULL if no element has that name.
lst$name
Same as previous, just different syntax.
lst[c(name1, name2, ..., namek)]
Returns a list built from the indicated elements of lst.
Note that the first two forms return an element whereas the third form returns a list.
> years <- list(Kennedy=1960, Johnson=1964, Carter=1976, Clinton=1994)
The below has the same effects as the above.
years <- list(1960, 1964, 1976, 1994) names(years) <- c("Kennedy", "Johnson", "Carter", "Clinton")
These next two expressions return the same value—namely, the element that is named “Kennedy”:
> years[["Kennedy"]] [1] 1960 > years$Kennedy [1] 1960
The following two expressions return sublists extracted from years:
> years[c("Kennedy","Johnson")] $Kennedy [1] 1960 $Johnson [1] 1964
> years["Carter"] $Carter [1] 1976
Removing an Element from a List
> years <- list(Kennedy=1960, Johnson=1964, Carter=1976, Clinton=1994)
> years $Kennedy [1] 1960 $Johnson [1] 1964 $Carter [1] 1976 $Clinton [1] 1994 > years[["Johnson"]] <- NULL # Remove the element labeled "Johnson" > years $Kennedy [1] 1960 $Carter [1] 1976 $Clinton [1] 1994
You can remove multiple elements this way, too:
> years[c("Carter","Clinton")] <- NULL # Remove two elements > years $Kennedy [1] 1960
Removing NULL Elements from a List
> lst[sapply(lst, is.null)] <- NULL
- R calls sapply to apply the is.null function to every element of the list.
- sapply returns a vector of logical values that are TRUE wherever the corresponding list element is NULL.
- R selects values from the list according to that vector.
- R assigns NULL to the selected items, removing them from the list.
> lst <- list("Moe", NULL, "Curly") # Create list with NULL element > lst [[1]] [1] "Moe" [[2]] NULL [[3]] [1] "Curly" > lst[sapply(lst, is.null)] <- NULL # Remove NULL element from list > lst [[1]] [1] "Moe" [[2]] [1] "Curly"
> lst[lst < 0] <- NULL > lst[lst == 0] <- NULL > lst[is.na(lst)] <- NULL
Initializing a Matrix
> theData <- c(1.1, 1.2, 2.1, 2.2, 3.1, 3.2) > mat <- matrix(theData, 2, 3) > mat [,1] [,2] [,3] [1,] 1.1 2.1 3.1 [2,] 1.2 2.2 3.2
matrix(data, row, col)
If data is a single value, recycling rule is applied.
> matrix(0, 2, 3) # Create an all-zeros matrix [,1] [,2] [,3] [1,] 0 0 0 [2,] 0 0 0 > matrix(NA, 2, 3) # Create a matrix populated with NA [,1] [,2] [,3] [1,] NA NA NA [2,] NA NA NA
Same thing.
> mat <- matrix(c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3), 2, 3)
Easy to read.
> theData <- c(1.1, 1.2, 1.3, + 2.1, 2.2, 2.3) > mat <- matrix(theData, 2, 3, byrow=TRUE)
Condense version
> mat <- matrix(c(1.1, 1.2, 1.3, + 2.1, 2.2, 2.3), + 2, 3, byrow=TRUE)
Same
> v <- c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3) > dim(v) <- c(2,3) > v [,1] [,2] [,3] [1,] 1.1 1.3 2.2 [2,] 1.2 2.1 2.3
Performing Matrix Operations
t(A) Matrix transposition of A solve(A) Matrix inverse of A A %*% B Matrix multiplication of A and B diag(n) An n-by-n diagonal (identity) matrix
> mat <- matrix(c(1.1, 1.2, 1.3, 2.1, 2.2, 2.3), 2, 3, byrow=TRUE) > mat [,1] [,2] [,3] [1,] 1.1 1.2 1.3 [2,] 2.1 2.2 2.3 > mat%*%t(mat) [,1] [,2] [1,] 4.34 7.94 [2,] 7.94 14.54 > t(mat)%*%mat [,1] [,2] [,3] [1,] 5.62 5.94 6.26 [2,] 5.94 6.28 6.62 [3,] 6.26 6.62 6.98
Naming to the Rows and Columns of a Matrix
> rownames(mat) <- c("rowname1", "rowname2", ..., "rownamem") > colnames(mat) <- c("colname1", "colname2", ..., "colnamen")
Selecting One Row or Column from a Matrix
> vec <- mat[1,] # First row > vec <- mat[,3] # Third column
Normally, when you select one row or column from a matrix, R strips off the dimensions. The result is a dimensionless vector:
> mat[1,] [1] 1 4 7 10 > mat[,3] [1] 7 8 9
When you include the drop=FALSE argument, however, R retains the dimensions. In that case, selecting a row returns a row vector (a 1 × n matrix):
> mat[1,,drop=FALSE] [,1] [,2] [,3] [,4] [1,] 1 4 7 10
Likewise, selecting a column with drop=FALSE returns a column vector (an n × 1 matrix):
> mat[,3,drop=FALSE] [,1] [1,] 7 [2,] 8 [3,] 9
Initializing a Data Frame from Column Data
Combining vectors
> dfrm <- data.frame(v1, v2, v3, f1, f2)
Combining lists
> dfrm <- as.data.frame(list.of.vectors)
pred1 <- c(-2.7528917, -0.3626909, -1.0416039, 1.266682, 0.7806372, -1.0832624, -2.0883305, -0.7063653, -0.8394022, -0.4966884) pred2 <- c(-1.4078413, 0.31286963, -0.69685664, -1.27511434, -0.27292745, 0.73383339, 0.96816822, -0.84476203, 0.31530793, -0.08030948) pred3 <- c("AM", "AM", "PM", "PM", "AM", "AM", "PM", "PM", "PM", "AM") resp <- c(12.57715, 21.02418, 18.94694, 18.98153, 19.59455, 20.71605, 22.70062, 18.40691, 21.0093, 19.31253)
> dfrm <- data.frame(pred1, pred2, pred3, resp) > dfrm pred1 pred2 pred3 resp 1 -2.7528917 -1.40784130 AM 12.57715 2 -0.3626909 0.31286963 AM 21.02418 3 -1.0416039 -0.69685664 PM 18.94694 4 1.2666820 -1.27511434 PM 18.98153 5 0.7806372 -0.27292745 AM 19.59455 6 -1.0832624 0.73383339 AM 20.71605 7 -2.0883305 0.96816822 PM 22.70062 8 -0.7063653 -0.84476203 PM 18.40691 9 -0.8394022 0.31530793 PM 21.00930 10 -0.4966884 -0.08030948 AM 19.31253
> dfrm <- data.frame(p1=pred1, p2=pred2, p3=pred3, r=resp) > dfrm p1 p2 p3 r 1 -2.7528917 -1.40784130 AM 12.57715 2 -0.3626909 0.31286963 AM 21.02418 3 -1.0416039 -0.69685664 PM 18.94694 . . (etc.) .
suppose that there are other data in the resp and pred3 like the below:
pred3 <- c(pred3, "PM")
resp <- c(resp, 20,30,40)
Now you are trying to combine these vectors into a dataframe (dfrm2); but, failing:
dfrm <- data.frame(pred1, pred2, pred3, resp)
To fix this, you want to remove the additional data from the vectors so that each vector has 10 data element. How would you do that?
pred3 ← pred3[-c(11)]
resp ← resp[-c(11:13)]
dfrm ← data.frame(pred1, pred2, pred3, resp)
Or, add NAs in the short columns. How would I do that?
> lst <- list(p1=pred1, p2=pred2, p3=pred3, r=resp)
Alternatively, list → as.data.frame
> lst <- list(p1=pred1, p2=pred2, p3=pred3, r=resp) > as.data.frame(lst) p1 p2 p3 r 1 -2.7528917 -1.40784130 AM 12.57715 2 -0.3626909 0.31286963 AM 21.02418 3 -1.0416039 -0.69685664 PM 18.94694 . . (etc.) .
Initializing a Data Frame from Row Data
rbind() function combines vector, matrix or data frame by rows.
- data1.csv
Subtype Gender Expression A m -0.54 A f -0.8 B f -1.03 C m -0.41
- data2.csv
Subtype Gender Expression D m 3.22 D f 1.02 D f 0.21 D m -0.04 D m 2.11 B m -1.21 A f -0.2
> x1 <- read.csv("http://commres.net/wiki/_export/code/r/data_structures?codeblock=83", head=T, sep=" ") > x2 <- read.csv("http://commres.net/wiki/_export/code/r/data_structures?codeblock=84", head=T, sep=" ") > x <- rbind(x1,x2) > x Subtype Gender Expression 1 A m -0.54 2 A f -0.80 3 B f -1.03 4 C m -0.41 5 D m 3.22 6 D f 1.02 7 D f 0.21 8 D m -0.04 9 D m 2.11 10 B m -1.21 11 A f -0.20
Appending Rows to a Data Frame
- suburbs.csv
city county state pop 1 Chicago Cook IL 2853114 2 Kenosha Kenosha WI 90352 3 Aurora Kane IL 171782 4 Elgin Kane IL 94487 5 Gary Lake(IN) IN 102746 6 Joliet Kendall IL 106221 7 Naperville DuPage IL 147779 8 Arlington Heights Cook IL 76031 9 Bolingbrook Will IL 70834 10 Cicero Cook IL 72616 11 Evanston Cook IL 74239 12 Hammond Lake(IN) IN 83048 13 Palatine Cook IL 67232 14 Schaumburg Cook IL 75386 15 Skokie Cook IL 63348 16 Waukegan Lake(IL) IL 91452
suburbs <- read.csv("http://commres.net/wiki/_export/code/r/data_structures?codeblock=88", head=T, sep=" ") suburbs . . . . . suburbs$X <- NULL # x column should be deleted.
newRow <- data.frame(city="West Dundee", county="Kane", state="IL", pop=5428) suburbs <- rbind(suburbs, newRow) suburbs city county state pop 1 Chicago Cook IL 2853114 2 Kenosha Kenosha WI 90352 3 Aurora Kane IL 171782 4 Elgin Kane IL 94487 5 Gary Lake(IN) IN 102746 6 Joliet Kendall IL 106221 7 Naperville DuPage IL 147779 8 Arlington Heights Cook IL 76031 9 Bolingbrook Will IL 70834 10 Cicero Cook IL 72616 11 Evanston Cook IL 74239 12 Hammond Lake(IN) IN 83048 13 Palatine Cook IL 67232 14 Schaumburg Cook IL 75386 15 Skokie Cook IL 63348 16 Waukegan Lake(IL) IL 91452 17 West Dundee Kane IL 5428
Preallocating a Data Frame
- suburbs.csv
city county state pop 1 Chicago Cook IL 2853114 2 Kenosha Kenosha WI 90352 3 Aurora Kane IL 171782 4 Elgin Kane IL 94487 5 Gary Lake(IN) IN 102746 6 Joliet Kendall IL 106221 7 Naperville DuPage IL 147779 8 Arlington Heights Cook IL 76031 9 Bolingbrook Will IL 70834 10 Cicero Cook IL 72616 11 Evanston Cook IL 74239 12 Hammond Lake(IN) IN 83048 13 Palatine Cook IL 67232 14 Schaumburg Cook IL 75386 15 Skokie Cook IL 63348 16 Waukegan Lake(IL) IL 91452
suburbs <- read.csv("http://commres.net/wiki/_export/code/r/data_structures?codeblock=91", head=T, sep=" ")
> suburbs[[1]] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 > suburbs[[3]] [1] Cook Kenosha Kane Kane Lake(IN) Kendall DuPage Cook Will Cook Cook Lake(IN) Cook Cook Cook [16] Lake(IL) Levels: Cook DuPage Kane Kendall Kenosha Lake(IL) Lake(IN) Will > suburbs[[4]] [1] IL WI IL IL IN IL IL IL IL IL IL IN IL IL IL IL Levels: IL IN WI
suburbs[[1]]
This returns one column.
suburbs[1]
This returns a data frame, and the data frame contains exactly one column. This is a special case of dfrm[c(n1,n2, …, nk)]. We don’t need the c(…) construct because there is only one n.
Selecting data frame columns by position
- suburbs.csv
city county state pop Chicago Cook IL 2853114 Kenosha Kenosha WI 90352 Aurora Kane IL 171782 Elgin Kane IL 94487 Gary Lake(IN) IN 102746 Joliet Kendall IL 106221 Naperville DuPage IL 147779 Arlington Heights Cook IL 76031 Bolingbrook Will IL 70834 Cicero Cook IL 72616 Evanston Cook IL 74239 Hammond Lake(IN) IN 83048 Palatine Cook IL 67232 Schaumburg Cook IL 75386 Skokie Cook IL 63348 Waukegan Lake(IL) IL 91452
suburbs <- read.csv("http://commres.net/wiki/_export/code/r/data_structures?codeblock=96", head=T, sep=" ")
> suburbs[[1]] [1] "Chicago" "Kenosha" "Aurora" "Elgin" [5] "Gary" "Joliet" "Naperville" "Arlington Heights" [9] "Bolingbrook" "Cicero" "Evanston" "Hammond" [13] "Palatine" "Schaumburg" "Skokie" "Waukegan"
> suburbs[1] city 1 Chicago 2 Kenosha 3 Aurora 4 Elgin 5 Gary 6 Joliet 7 Naperville 8 Arlington Heights 9 Bolingbrook 10 Cicero 11 Evanston 12 Hammond 13 Palatine 14 Schaumburg 15 Skokie 16 Waukegan
> suburbs[c(1,4)] city pop 1 Chicago 2853114 2 Kenosha 90352 3 Aurora 171782 4 Elgin 94487 5 Gary 102746 6 Joliet 106221 7 Naperville 147779 8 Arlington Heights 76031 9 Bolingbrook 70834 10 Cicero 72616 11 Evanston 74239 12 Hammond 83048 13 Palatine 67232 14 Schaumburg 75386 15 Skokie 63348 16 Waukegan 91452
Selecting Data Frame Columns by Name
dfrm[["name"]] Returns one column, the column called name. dfrm$name Same as previous, just different syntax. To select one or more columns and package them in a data frame, use these list expressions: dfrm["name"] Selects one column and packages it inside a data frame object. dfrm[c("name1", "name2", ..., "namek")] Selects several columns and packages them in a data frame. You can use matrix-style subscripting to select one or more columns: dfrm[, "name"] Returns the named column. dfrm[, c("name1", "name2", ..., "namek")] Selects several columns and packages in a data frame.
Selecting Rows and Columns More Easily
Data set used in the section: Cars93 in MASS packages
install.packages("MASS") library(MASS) Cars93 Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway AirBags DriveTrain Cylinders EngineSize 1 Acura Integra Small 12.9 15.9 18.8 25 31 None Front 4 1.8 2 Acura Legend Midsize 29.2 33.9 38.7 18 25 Driver & Passenger Front 6 3.2 3 Audi 90 Compact 25.9 29.1 32.3 20 26 Driver only Front 6 2.8 4 Audi 100 Midsize 30.8 37.7 44.6 19 26 Driver & Passenger Front 6 2.8 . . . . .
subset(Cars93, select=Model, subset=(MPG.city > 30)) Model 31 Festiva 39 Metro 42 Civic . . (etc.) .
subset(Cars93, select=c(Model,Min.Price,Max.Price), + subset=(Cylinders == 4 & Origin == "USA")) Model Min.Price Max.Price 6 Century 14.2 17.3 12 Cavalier 8.5 18.3 13 Corsica 11.4 11.4 . . (etc.) .
subset(Cars93, select=c(Manufacturer,Model), + subset=c(MPG.highway > median(MPG.highway))) Manufacturer Model 1 Acura Integra 5 BMW 535i 6 Buick Century . . (etc.) .
Changing the Names of Data Frame Columns
mat <- c(-0.818, -0.667, -0.494, -0.819, -0.946, -0.205, 0.385, 1.531, -0.611, -2.155, -0.535, -0.316) dim(mat) <- c(4,3) mat [,1] [,2] [,3] [1,] -0.818 -0.667 -0.494 [2,] -0.819 -0.946 -0.205 [3,] 0.385 1.531 -0.611 [4,] -2.155 -0.535 -0.316
Vanilla variable name!
as.data.frame(mat) V1 V2 V3 1 -0.818 -0.667 -0.494 2 -0.819 -0.946 -0.205 3 0.385 1.531 -0.611 4 -2.155 -0.535 -0.316
colnames(mat) <- c("before","treatment","after") > mat before treatment after [1,] -0.818 -0.946 -0.611 [2,] -0.667 -0.205 -2.155 [3,] -0.494 0.385 -0.535 [4,] -0.819 1.531 -0.316 > as.data.frame(mat) before treatment after 1 -0.818 -0.946 -0.611 2 -0.667 -0.205 -2.155 3 -0.494 0.385 -0.535 4 -0.819 1.531 -0.316
Editing a Data Frame
> temp <- edit(mat) mat <- temp # Overwrite only if you're happy with the changes! mat2 <- temp # or.... # then, close the edit window
Can you save it as “mat.csv.” Then, retrieve it again into r space?
When you read back the csv file? How would you avoid like the below output? I mean aovid X column?
X before treatment after 1 1 -0.818 -0.946 -0.611 2 2 -0.667 -0.205 -2.155 3 3 -0.494 0.385 -0.535 4 4 -0.819 1.531 -0.316
Or even, how would I save the csv file, without the X column?
Removing NAs from a Data Frame
Use na.omit to remove rows that contain any NA values.
> clean <- na.omit(dfrm)
Excluding Columns by Name
> subset(dfrm, select = -badboy) # All columns except badboy
> cor(patient.data) patient.id pre dosage post patient.id 1.00000000 0.02286906 0.3643084 -0.13798149 pre 0.02286906 1.00000000 0.2270821 -0.03269263 dosage 0.36430837 0.22708208 1.0000000 -0.42006280 post -0.13798149 -0.03269263 -0.4200628 1.00000000
This correlation matrix includes the meaningless “correlation” between patient ID and other variables, which is annoying. We can exclude the patient ID column to clean up the output:
> cor(subset(patient.data, select = -patient.id)) pre dosage post pre 1.00000000 0.2270821 -0.03269264 dosage 0.22708207 1.0000000 -0.42006280 post -0.03269264 -0.4200628 1.00000000
We can exclude multiple columns by giving a vector of negated names:
> cor(subset(patient.data, select = c(-patient.id,-dosage))) pre post pre 1.00000000 -0.03269264 post -0.03269264 1.00000000
Combining Two Data Frames
> stooges name n.marry n.child 1 Moe 1 2 2 Larry 1 2 3 Curly 4 2 > birth birth.year birth.place 1 1887 Bensonhurst 2 1902 Philadelphia 3 1903 Brooklyn > cbind(stooges,birth) name n.marry n.child birth.year birth.place 1 Moe 1 2 1887 Bensonhurst 2 Larry 1 2 1902 Philadelphia 3 Curly 4 2 1903 Brooklyn
rbind
> stooges name n.marry n.child 1 Moe 1 2 2 Larry 1 2 3 Curly 4 2 > guys name n.marry n.child 1 Tom 4 2 2 Dick 1 4 3 Harry 1 1 > rbind(stooges,guys) name n.marry n.child 1 Moe 1 2 2 Larry 1 2 3 Curly 4 2 4 Tom 4 2 5 Dick 1 4 6 Harry 1 1