Tag Archives: big data

Reading large files in R

Reading large data files into R using  read.table can be painfully slow due to the large number of checks it performs on the data (inferring data types, checking number of columns, etc.). Below are a number of tips to improve its speed:

  1. Set  nrows the number of records in your dataset. An easier way to obtain this is to use wc -l  in the terminal (only works on Unix systems). A handy function to do this from R:
  2. Ensure  comment.char="" to turn off the interpretation of comments.
  3. Supply the column types in colClasses  e.g.  colClasses = c("numeric", "character", "logical")
  4. Setting multi.line=FALSE  can help the performance of the scan.
  5. In general, it is worth using  as.is=TRUE to prevent strings being turned into factors automatically.  If you provide  colClasses  this is redundant (as the column types will be determined by colClasses ).

If your data is very large (5GB+), you should look at using the data.table package. In particular  fread can be used to read very large files to data tables.