A Philosophical Note on R input system Several of my Statistical Analysis System (SAS) friends are disappointed with the input facilities of R. They point out that SAS has an elaborate set of commands for reading and parsing input files in many formats. R does not, and this leads them to conclude that R is not ready for real work. After all, if it can’t read your data, what good is it? I think they do not understand the design philosophy behind R, which is based on a statistical package called S. The authors of S worked at Bell Labs and were steeped in the Unix design philosophy. A keystone of that philosophy is the idea of modular tools. Programs in Unix are not large, monolithic programs that try to do everything. Instead, they are smaller, specialized tools that each do one thing well. The Unix user joins the programs together like building blocks, creating systems from the components. R does statistics and graphics well. Very well, in fact. It is superior in that way to many commercial packages. R is not a great tool for preprocessing data files, however. The authors of S assumed you would perform that munging with some other tool: perl, awk, sed, cut, paste, whatever floats your boat. Why should they duplicate that capability? If your data is difficult to access or difficult to parse, consider using an outboard tool to preprocess the data before loading it into R. Let R do what R does best. ====== Entering Data from the Keyboard ====== > scores <- c(61, 66, 90, 88, 100) ====== Digit control ====== > pi [1] 3.141593 > 100*pi [1] 314.1593 > print(pi, digits=4) [1] 3.142 > print(100*pi, digits=4) [1] 314.2 ====== Redirecting Output to a File ====== > cat("The answer is", answer, "\n", file="filename") Use the sink function to redirect all output from both print and cat. Call sink with a filename argument to begin redirecting console output to that file. When you are done, use sink with no argument to close the file and resume output to the console: > sink("filename") # Begin writing output to file . . . other session work . . . > sink() # Resume writing output to console > sink("script_output.txt") # Redirect output to file > source("script.R") # Run the script, capturing its output > sink() # Resume writing output to console cat(data, file="analysisReport.out") cat(results, file="analysisRepart.out", append=TRUE) cat(conclusion, file="analysisReport.out", append=TRUE) Without "append=TRUE" option, the content will be **overwritten** with the succeeding content. The above method is tedious and error prone. The second line has a different filename. The below method (making connection to file with file function) is better. con <- file("analysisReport.out", "w") cat(data, file=con) cat(results, file=con) cat(conclusion, file=con) close(con) "file("filename.txt", "w")" opens "filename.txt and remains open until "close(opened_variable)" function. In this case, append=TRUE option is not necessary. ====== Listing Files ====== list.files() > samp <- read.csv("C:\Data\sample-data.csv") # wrong > samp <- read.csv("C:/Data/sample-data.csv") # good > samp <- read.csv("C:\\Data\\sample-data.csv") # oaky, escape with \ ====== Fixed-Width Records ====== > records <- read.fwf("filename", widths=c(w1, w2, ..., wn)) Fisher R.A. 1890 1962 Pearson Karl 1857 1936 Cox Gertrude 1900 1978 Yates Frank 1902 1994 Smith Kirstine 1878 1939 > records <- read.fwf("http://commres.net/wiki/_export/code/r/input_output?codeblock=11", widths=c(10,10,4,-1,4)) Note: -1 is for ignoring a character. > records V1 V2 V3 V4 1 Fisher R.A. 1890 1962 2 Pearson Karl 1857 1936 3 Cox Gertrude 1900 1978 4 Yates Frank 1902 1994 5 Smith Kirstine 1878 1939 Funky variable names (V1, V2, . . .). Use the below method with "col.names=c()" option. > records <- read.fwf("http://commres.net/wiki/_export/code/r/input_output?codeblock=11", widths=c(10,10,4,-1,4), + col.names=c("Last","First","Born","Died")) > records Last First Born Died 1 Fisher R.A. 1890 1962 2 Pearson Karl 1857 1936 3 Cox Gertrude 1900 1978 4 Yates Frank 1902 1994 5 Smith Kirstine 1878 1939 ====== Reading Tabular Data Files ====== > dfrm <- read.table("filename") Fisher R.A. 1890 1962 Pearson Karl 1857 1936 Cox Gertrude 1900 1978 Yates Frank 1902 1994 Smith Kirstine 1878 1939 NOTE: final data line should have "enter" key. > dfrm <- read.table("statisticians.txt") > print(dfrm) V1 V2 V3 V4 1 Fisher R.A. 1890 1962 2 Pearson Karl 1857 1936 3 Cox Gertrude 1900 1978 4 Yates Frank 1902 1994 5 Smith Kirstine 1878 1939 > dfrm <- read.table("statisticians.txt", sep=":") > class(dfrm$V1) [1] "factor" > dfrm <- read.table("statisticians.txt", stringsAsFactor=FALSE) > class(dfrm$V1) [1] "character" NOTE: factor? = attributes of a nominal variable lastname firstname born died Fisher R.A. 1890 1962 Pearson Karl 1857 1936 Cox Gertrude 1900 1978 Yates Frank 1902 1994 Smith Kirstine 1878 1939 > dfrm <- read.table("statisticians.txt", header=TRUE, stringsAsFactor=FALSE) > print(dfrm) comment line starting with # line will be ignored. # This is a data file of famous statisticians. # Last edited on 1994-06-18 lastname firstname born died Fisher R.A. 1890 1962 Pearson Karl 1857 1936 Cox Gertrude 1900 1978 Yates Frank 1902 1994 Smith Kirstine 1878 1939 ====== Reading from CSV Files ====== label lbound ubound low 0.000 0.674 mid 0.674 1.640 high 1.640 2.330 > tbl <- read.csv("filename.csv") > tbl > tbl <- read.csv("filename.csv", header=FALSE) > tbl > tbl <- read.csv("filename.csv", header=FALSE, sep="\t") > tbl > str(tbl) 'data.frame': 3 obs. of 3 variables: $ label : Factor w/ 3 levels "high","low","mid": 2 3 1 $ lbound: num 0 0.674 1.64 $ ubound: num 0.674 1.64 2.33 as.is=TRUE -> factor to strings > tbl <- read.csv("table-data.csv", as.is=TRUE) > str(tbl) 'data.frame': 3 obs. of 3 variables: $ label : chr "low" "mid" "high" $ lbound: num 0 0.674 1.64 $ ubound: num 0.674 1.64 2.33 ====== Writing to CSV Files ====== > write.csv(x, file="filename", row.names=FALSE) > print(tbl) label lbound ubound 1 low 0.000 0.674 2 mid 0.674 1.640 3 high 1.640 2.330 > write.csv(tbl, file="table-data.csv", row.names=F) "label","lbound","ubound" "low",0,0.674 "mid",0.674,1.64 "high",1.64,2.33 If we do not specify row.names=FALSE, the function prepends each row with a label taken from the row.names attribute of your data. If your data doesn’t have row names then the function just uses the row numbers, which creates a CSV file like this: "","label","lbound","ubound" "1","low",0,0.674 "2","mid",0.674,1.64 "3","high",1.64,2.33 ====== Reading Tabular or CSV Data from the Web ====== > tbl <- read.csv("http://www.example.com/download/data.csv") > tbl <- read.table("ftp://ftp.example.com/download/data.txt") ====== Reading Data from HTML Tables ====== > library(XML) > url <- 'http://www.example.com/data/table.html' > tbls <- readHTMLTable(url) > tbl <- readHTMLTable(url, which=3) > library(XML) > url <- 'http://en.wikipedia.org/wiki/World_population' > tbls <- readHTMLTable(url) The above not working for me. :( library(httr) url <- 'http://en.wikipedia.org/wiki/World_population' tables <- GET(url) tables <- readHTMLTable(rawToChar(tables$content))