Saturday, November 20, 2010

R function for reading big tables

HugeFileLoader = function(path, sep = "\t", skip = 0, header = T, nrows = 10){

### counts the number of lines using shell wc command, and converts the output to numeric
line.count = paste("wc -l ", path, sep = "")
row.count = as.numeric(strsplit(system(line.count, intern = T), split=" ")[[1]][1]) - skip

### reads in first five lines of the file and determines the type of each column
first5rows = read.table(path, header = TRUE, nrows = nrows, skip = skip, sep = sep)
tab.classes = sapply(first5rows, class)

### reads in the data
tab = read.table(path, header=header, colClasses=tab.classes, comment.char="#", nrows=row.count, skip=skip, sep=sep)
return(tab)
}

If you are using R on a Mac, you have to change the index when parsing wc -l output ([[1]][1]), because it returns a space as the first character, while on a linux machine it returns the number of lines.

No comments:

Post a Comment