Table of Contents

A Philosophical Note on R input system

Several of my Statistical Analysis System (SAS) friends are disappointed with the input facilities of R. They point out that SAS has an elaborate set of commands for reading and parsing input files in many formats. R does not, and this leads them to conclude that R is not ready for real work. After all, if it can’t read your data, what good is it?

I think they do not understand the design philosophy behind R, which is based on a statistical package called S. The authors of S worked at Bell Labs and were steeped in the Unix design philosophy. A keystone of that philosophy is the idea of modular tools. Programs in Unix are not large, monolithic programs that try to do everything. Instead, they are smaller, specialized tools that each do one thing well. The Unix user joins the programs together like building blocks, creating systems from the components.

R does statistics and graphics well. Very well, in fact. It is superior in that way to many commercial packages.
R is not a great tool for preprocessing data files, however. The authors of S assumed you would perform that munging with some other tool: perl, awk, sed, cut, paste, whatever floats your boat. Why should they duplicate that capability?

If your data is difficult to access or difficult to parse, consider using an outboard tool to preprocess the data before loading it into R. Let R do what R does best.

Entering Data from the Keyboard

> scores <- c(61, 66, 90, 88, 100)

Digit control

> pi
[1] 3.141593
> 100*pi
[1] 314.1593
> print(pi, digits=4)
[1] 3.142
> print(100*pi, digits=4)
[1] 314.2

Redirecting Output to a File

> cat("The answer is", answer, "\n", file="filename")

Use the sink function to redirect all output from both print and cat. Call sink with a filename argument to begin redirecting console output to that file. When you are done, use sink with no argument to close the file and resume output to the console:

> sink("filename")                     # Begin writing output to file

. . . other session work . . .

> sink()                               # Resume writing output to console
> sink("script_output.txt")            # Redirect output to file
> source("script.R")                   # Run the script, capturing its output
> sink()                               # Resume writing output to console
cat(data, file="analysisReport.out")
cat(results, file="analysisRepart.out", append=TRUE)
cat(conclusion, file="analysisReport.out", append=TRUE)

Without “append=TRUE” option, the content will be overwritten with the succeeding content. The above method is tedious and error prone. The second line has a different filename. The below method (making connection to file with file function) is better.

con <- file("analysisReport.out", "w")
cat(data, file=con)
cat(results, file=con)
cat(conclusion, file=con)
close(con)

“file(“filename.txt”, “w”)” opens “filename.txt and remains open until “close(opened_variable)” function. In this case, append=TRUE option is not necessary.

Listing Files

list.files()
> samp <- read.csv("C:\Data\sample-data.csv")     # wrong
> samp <- read.csv("C:/Data/sample-data.csv")     # good
> samp <- read.csv("C:\\Data\\sample-data.csv")   # oaky, escape with \

Fixed-Width Records

> records <- read.fwf("filename", widths=c(w1, w2, ..., wn))
fixed-width.txt
Fisher    R.A.      1890 1962
Pearson   Karl      1857 1936
Cox       Gertrude  1900 1978
Yates     Frank     1902 1994
Smith     Kirstine  1878 1939
> records <- read.fwf("http://commres.net/wiki/_export/code/r/input_output?codeblock=11", widths=c(10,10,4,-1,4))

Note: -1 is for ignoring a character.

> records
          V1         V2   V3   V4
1 Fisher     R.A.       1890 1962
2 Pearson    Karl       1857 1936
3 Cox        Gertrude   1900 1978
4 Yates      Frank      1902 1994
5 Smith      Kirstine   1878 1939

Funky variable names (V1, V2, . . .). Use the below method with “col.names=c()” option.

> records <- read.fwf("http://commres.net/wiki/_export/code/r/input_output?codeblock=11", widths=c(10,10,4,-1,4),
+                     col.names=c("Last","First","Born","Died"))
> records
        Last      First Born Died
1 Fisher     R.A.       1890 1962
2 Pearson    Karl       1857 1936
3 Cox        Gertrude   1900 1978
4 Yates      Frank      1902 1994
5 Smith      Kirstine   1878 1939

Reading Tabular Data Files

> dfrm <- read.table("filename")
statisticians.txt
Fisher R.A. 1890 1962
Pearson Karl 1857 1936
Cox Gertrude 1900 1978
Yates Frank 1902 1994
Smith Kirstine 1878 1939

NOTE: final data line should have “enter” key.

> dfrm <- read.table("statisticians.txt")
> print(dfrm)
       V1       V2   V3   V4
1  Fisher     R.A. 1890 1962
2 Pearson     Karl 1857 1936
3     Cox Gertrude 1900 1978
4   Yates    Frank 1902 1994
5   Smith Kirstine 1878 1939
> dfrm <- read.table("statisticians.txt", sep=":")
> class(dfrm$V1)
[1] "factor"
> dfrm <- read.table("statisticians.txt", stringsAsFactor=FALSE)
> class(dfrm$V1)
[1] "character"

NOTE: factor? = attributes of a nominal variable

statisticians2.txt
lastname firstname born died
Fisher R.A. 1890 1962
Pearson Karl 1857 1936
Cox Gertrude 1900 1978
Yates Frank 1902 1994
Smith Kirstine 1878 1939
> dfrm <- read.table("statisticians.txt", header=TRUE, stringsAsFactor=FALSE)
> print(dfrm)

comment line starting with # line will be ignored.

statisticians3.txt
# This is a data file of famous statisticians.
# Last edited on 1994-06-18
lastname firstname born died
Fisher R.A. 1890 1962
Pearson Karl 1857 1936
Cox Gertrude 1900 1978
Yates Frank 1902 1994
Smith Kirstine 1878 1939

Reading from CSV Files

filename.csv
label	lbound	ubound
low	0.000	0.674
mid	0.674	1.640
high	1.640	2.330
> tbl <- read.csv("filename.csv")
> tbl
> tbl <- read.csv("filename.csv", header=FALSE)
> tbl
> tbl <- read.csv("filename.csv", header=FALSE, sep="\t")
> tbl
> str(tbl)
'data.frame':   3 obs. of  3 variables:
 $ label : Factor w/ 3 levels "high","low","mid": 2 3 1
 $ lbound: num  0 0.674 1.64
 $ ubound: num  0.674 1.64 2.33

as.is=TRUE → factor to strings

> tbl <- read.csv("table-data.csv", as.is=TRUE)
> str(tbl)
'data.frame':   3 obs. of  3 variables:
 $ label : chr  "low" "mid" "high"
 $ lbound: num  0 0.674 1.64
 $ ubound: num  0.674 1.64 2.33

Writing to CSV Files

> write.csv(x, file="filename", row.names=FALSE)
> print(tbl)
  label lbound ubound
1   low  0.000  0.674
2   mid  0.674  1.640
3  high  1.640  2.330
> write.csv(tbl, file="table-data.csv", row.names=F)
"label","lbound","ubound"
"low",0,0.674
"mid",0.674,1.64
"high",1.64,2.33

If we do not specify row.names=FALSE, the function prepends each row with a label taken from the row.names attribute of your data. If your data doesn’t have row names then the function just uses the row numbers, which creates a CSV file like this:

"","label","lbound","ubound"
"1","low",0,0.674
"2","mid",0.674,1.64
"3","high",1.64,2.33

Reading Tabular or CSV Data from the Web

> tbl <- read.csv("http://www.example.com/download/data.csv")
> tbl <- read.table("ftp://ftp.example.com/download/data.txt")

Reading Data from HTML Tables

> library(XML)
> url <- 'http://www.example.com/data/table.html'
> tbls <- readHTMLTable(url)
> tbl <- readHTMLTable(url, which=3)
> library(XML)
> url <- 'http://en.wikipedia.org/wiki/World_population'
> tbls <- readHTMLTable(url)

The above not working for me. :(

library(httr)
url <- 'http://en.wikipedia.org/wiki/World_population'
tables <- GET(url)
tables <- readHTMLTable(rawToChar(tables$content))