Efficiently Filtering While Reading Data Into R (With Python?!)

Posted on Wed 17 July 2019 in how-to

Working with large amounts of tabular data is a daily occurance for both bioinformaticians and data scientists. There's a lot the two groups can learn from each other (great future post material). However, I recently ran into a situation that I was sure had to be relatively common. Apparently it wasn't and I had very little luck checking for a solution in my usual genomics/bioinformatics cirlces, as well as the data science material I had on-hand.

The Problem

I recently received output from a large BLAST search. Something on the order of 200,000 queries. Some of those queries had many thousand hits. This is because the search was completed with minimal filtering. The idea being, it's easy to post-filter it, but you can't get the hits back that are thrown away. Fair enough. It was also split between 100+ files. The files were also output in BLAST's "format 7" (run with --outfmt 7). This means it's tabular (.tsv) with comment lines throughout. This collection of files was actually too big to load them into R (where I do most exploratory data analysis) and then filter them. So, I figured there had to be a way to combine loading and filtering in a satisfactory way. Of course, you could also pre-filter it with awk or Python line-by-line and write it out to the hard drive, but this problem interested me.

TLDR

If you have to load AND filter you should use the lesser-known readr function read_delim_chunked() (and its derivatives, read_{tsv|csv|table}_chunked()) or write a parser in Python and translate the resulting object (list of lists, dictionary of lists, or Pandas dataframe) to R with reticulate. The reason behind this is that iterating through a file and filtering line-by-line, while a seemingly common thing to do, is horrifically slow in R as far as I can tell. I'm happy to eat my words if other useRs can prove I'm wrong.

Attempt #1: Writing a Line-by-line Parser in R

I read all the warnings. They say that R is slow. But is this really true? I frequently read pretty large files into R with readr or data.table, and they're wicked fast. What I failed to immediately realize is that these packages are fast because they're written in C/C++ and are effectively compiled programs that interact with R through its API.

I decided I would test this on both a subset of the data, one of the smaller files (~2.6 GB), and the first 10,000 lines. The tsv files have comment lines denoted with '#' and 12 columns which vary between character, float, and integers:

# Example data
input_file <- 'test_blast_fmt7.out'
system('head -n 10000 test_blast_fmt7.out > test_blast_fmt7.head10000.out')
input_file_head <- 'test_blast_fmt7.head10000.out'
col_names <- c('query', 'subject', 'identity', 'align_length',
               'mismatches', 'gap_opens', 'q_start', 'q_end',
               's_start', 's_end', 'evalue', 'bit_score')

I first attempted to solve this problem by writing the following function (don't do this).

read_filter_blast7_lbl_base <- function (input_file, header, min_perc_id, min_al_len, max_evalue) {
  # Initialize a line counter
  i = 1

  # Open a file connection, yield one line at a time
  out_list <- list()
  conn <- file(input_file, open = 'r')
  while (length(n_line <- readLines(conn, n = 1, warn = FALSE)) > 0) {

    # Split the line at the separator to yield a list, turn into a vector
    line <- unlist(strsplit(n_line, '\t'))

    # Skip comment lines
    if (!startsWith(line[1], '#')) {
      # Include filtering conditions here in this if statement
      if (line[3] >= min_perc_id & line[4] >= min_al_len & line[11] <= max_evalue) {
        out_list[[i]] <- line
      }
    }
    # Count lines
    i = i + 1
  }
  close(conn)

  # Bind the lines as a data frame, don't convert strings to factors
  out_df <- as.data.frame(do.call(rbind, out_list), stringsAsFactors=FALSE)
  colnames(out_df) <- header

  # Set the column classes
  for (i in header[3:12]) {
    out_df[, i] <- as.numeric(out_df[, i])
  }

  # The out_df object will include filtered data, all columns as character vectors
  return(out_df)
}

Wow! What an abomination. I'm not sure if this says more about R's unsuitability to this kind of task or my obvious "Python-think" that's seeping in here. It's a mess. And it's slow. I tried testing on the first 10,000 lines of one of the files:

library(microbenchmark)

# Benchmark 100 iterations (default) over the first 10000 lines
> microbenchmark(read_filter_blast7_lbl_base(input_file_head, 80, 432, 1e-50), times = 100)
Unit: milliseconds
                                                                         expr
 read_filter_blast7_lbl_base(input_file_head, col_names, 80, 432,      1e-50)
      min       lq     mean   median       uq    max neval
 338.4403 346.2422 351.0633 349.2319 352.5961 404.62   100

It works, but this doesn't scale up to a full file (it sat for ages until I killed it), and it suffers from R's problems with growing lists in a loop leading to copying rather than appending. There are clearly other issues too because my attempts to pre-allocate a list or data frame of the correct size did not speed it up. This means that I might be doing something wrong. Regardless, this is too much work to do something so simple. I welcome others to find a pure base R implementation that's better. It seems like there should be a way to do it.

However, there are better options.

Attempt #2: Using sqldf to Filter a Temporary sqlite Database

The internet led me to believe that this isn't really something people do in R. And if you can't load it all into memory then you should use a database and query that. It seems excessive, but the sqldf R package can do this. It even includes a function to create the DB on the fly while reading (read.csv.sql()) I totally understand using SQL or similar to query a DB when you have a reason to query it often and it's stored as an SQL DB. I question the wisdom of this suggestion though for this purpose. I was able to use it as follows:

read_filter_blast7_sqldf <- function(input_file, header, min_perc_id, min_al_len, max_evalue) {
  temp_df <- read.csv.sql(
    file = input_file,
    filter = "sed -e '/^#/d'",
    sql = paste0('SELECT * FROM file WHERE V3 >= ', min_perc_id,
                 ' AND V4 >= ', min_al_len, ' AND V11 <= ', max_evalue),
    header = FALSE,
    sep = '\t',
    colClasses = c(rep('character', 2), 'numeric', rep('integer', 7), rep('numeric', 2)),
  )
  colnames(temp_df) <- header
  return(temp_df)
}

One of the key limitations here is that you need to pipe through a shell command (sed) to remove comment lines. Not the biggest deal, but having to write a sed command does take you out of your flow in an R or jupyter notebook. Let's see how that performs:

# Benchmark 100 iterations (default) over the first 10000 lines
> microbenchmark(read_filter_blast7_sqldf(input_file_head, col_names, 80, 432, 1e-50), times = 100)
Unit: milliseconds
                                                                      expr
 read_filter_blast7_sqldf(input_file_head, col_names, 80, 432,      1e-50)
      min       lq     mean   median       uq      max neval
 72.27077 73.30782 93.27967 79.38143 103.4807 199.8507   100

What's going on here? The max is much slower than the min. This is because the first iteration reads this into a temporary file! This will take even more time for a larger file, and that temporary database will be the size of the full, unfiltered file. Usually you load each file once, and each file needs its own temp sqlite DB. So, the max time is actually the only timing that matters! Plus, it has the problem of filling up your /tmp directory. What happens when I try to load the smallest whole file (2.6 GB)?

# Benchmark the whole file once
> microbenchmark(read_filter_blast7_sqldf(input_file, col_names, 80, 432, 1e-50), times = 1)
Unit: seconds
                                                            expr      min
 read_filter_blast7_sqldf(input_file, col_names, 80, 432, 1e-50) 165.4224
       lq     mean   median       uq      max neval
 165.4224 165.4224 165.4224 165.4224 165.4224     1

That's decent performance, but because it makes temporary files it will fill up your /tmp directory:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
dev             7.8G     0  7.8G   0% /dev
run             7.8G  1.4M  7.8G   1% /run
/dev/nvme0n1p2  423G   34G  368G   9% /
tmpfs           7.8G  170M  7.6G   3% /dev/shm
tmpfs           7.8G     0  7.8G   0% /sys/fs/cgroup
tmpfs           7.8G  6.9G  913M  89% /tmp
/dev/nvme0n1p1  300M  348K  300M   1% /boot/efi
tmpfs           1.6G   32K  1.6G   1% /run/user/1000

And running it on a second file fails:

# Benchmark the whole file once
> microbenchmark(read_filter_blast7_sqldf(input_file, col_names, 80, 432, 1e-50), times = 1)
Error in connection_import_file(conn@ptr, name, value, sep, eol, skip) : 
  RS_sqlite_import: database or disk is full
In addition: Warning message:
In .Internal(gc(verbose, reset, full)) :
  closing unused connection 4 (test_blast_fmt7.out)
Error in result_create(conn@ptr, statement) : 
  cannot rollback - no transaction is active

I then had to sudo rm -r /tmp/Rtmp* because my ssd was full.

On a system with tons of space it could be fine. I'm running this on my laptop during a flight so that doesn't help. You could also specify where those databases are made. However, the largest file I needed to work with is 30 GB and there were several. And this exact problem happened with that on our lab's server. (Note to self: Ask my PI to upgrade the root drive).

Still not a great solution.

Attempt #3: readr `read_delim_chunked()`

This function isn't well-documented, but is the fastest option I found. It's not quite the line-by-line implementation I thought up, but it's similar. Basically, it will use readr and a function (user-definable) to bind together dataframes which are read in chunks. Getting the best performance would require optimizing the chunk size to the largest you can reasonably handle in memory. I stuck with 10,000 because I was comparing to other options.

library(readr)

read_filter_blast7_readr_chunked <- function (input_file, header, min_perc_id, min_al_len, max_evalue) {

  f <- function(x, pos) subset(x, identity >= min_perc_id & align_length >= min_al_len & evalue <= max_evalue)
  out_df <- read_tsv_chunked(input_file, chunk_size = 10000, col_names = header, comment = '#', callback = DataFrameCallback$new(f))
}

Benchmarking results in some very very solid performance:

# Benchmark 100 iterations (default) over the first 10000 lines
> microbenchmark(read_filter_blast7_readr_chunked(input_file_head, col_names, 80, 432, 1e-50), times = 100)
Unit: milliseconds
                                                                              expr
 read_filter_blast7_readr_chunked(input_file_head, col_names,      80, 432, 1e-50)
      min       lq     mean   median       uq      max neval
 28.90464 29.10356 30.87358 29.92421 31.28273 92.10364   100

That's not too surprising, since it's basically just reading in the whole file at once and readr is fast. So, how's it work on the full file?

# Benchmark the whole file once
> microbenchmark(read_filter_blast7_sqldf(input_file, col_names, 80, 432, 1e-50), times = 1)
Unit: seconds
                                                                         expr
 read_filter_blast7_readr_chunked(input_file, col_names, 80, 432,      1e-50)
      min       lq     mean   median       uq      max neval
 76.59867 76.59867 76.59867 76.59867 76.59867 76.59867     1

Really good performance. You can tune it better as well, the. This is probably your best bet without getting too weird. But let's get weird ;)

Attempt #4: Parse With Python Translate to R With reticulate

Let's do this in Python! Sort of...

Writing a function to read and filter something in a for loop is a common thing for me in python. I usually don't use Python for exploratory analysis though and am less familiar with Pandas et al. than I am with R's ecosystem. However, I was able to figure out that it will automatically turn a list of lists into a dataframe, which is pretty neat. A solid win over R's functionality:

import pandas as pd

def filter_blast7(input_blast_results, header, min_perc_id, min_al_len,
                  max_evalue):
    df_list = []
    with open(input_blast_results, 'r') as input_handle:
        for line in input_handle:
            if not line.startswith('#'):
                entry = line.strip().split()
                perc_id = float(entry[2])
                al_len = int(entry[3])
                evalue = float(entry[10])
                if (perc_id >= min_perc_id and al_len >= min_al_len
                        and evalue <= max_evalue):
                    df_list.append(entry)
    return pd.DataFrame(df_list, columns=header)

This function, when tested in python returns a data frame as expected:

>>> filter_blast7(input_file_head, col_names, 80, 432, 1e-50).head()
   query     subject    identity align_length mismatches  ... q_end s_start s_end  evalue bit_score
0  query_1  scaffold88  100.000          869          0  ...   869  733052  733920    0.0      1605
1  query_1  scaffold88   95.732          867         34  ...   869  734435  735298    0.0      1393
2  query_1   scaffold1  100.000          869          0  ...   869    4053    4921    0.0      1605
3  query_1   scaffold1   95.732          867         34  ...   869    5436    6299    0.0      1393
4  query_3  scaffold88  100.000          786          0  ...   786  735004  735789    0.0      1452

Nice, but I thought we were using R? Well, we can actually use the R package reticulate to run this python code and translate its output to an R data frame. Pretty neat! You can get it from CRAN with install.packages('reticulate'). With it installed you'll want to make sure it can find your python installation:

> library(reticulate)
> py_discover_config()
python:         /home/groverj3/.pyenv/shims/python
libpython:      /home/groverj3/.pyenv/versions/3.7.2/lib/libpython3.7m.so
pythonhome:     /home/groverj3/.pyenv/versions/3.7.2:/home/groverj3/.pyenv/versions/3.7.2
version:        3.7.2 (default, Mar 17 2019, 02:15:50)  [GCC 8.2.1 20181127]
numpy:          /home/groverj3/.local/lib/python3.7/site-packages/numpy
numpy_version:  1.16.4

python versions found: 
 /home/groverj3/.pyenv/shims/python
 /usr/bin/python
 /usr/bin/python3
> py_available()
[1] FALSE
> py_available(initialize = TRUE)
[1] TRUE

Even though it found my python installation py_available() returns false? I actually forgot to initialize it with use_python() but you can either do that or use py_available(initialize = TRUE). You also must have the python shared library installed installed (which it is not when using pyenv). Now, you can either source the python parsing function from a saved .py script, or run it inline as follows, and coerce it to an R function:

read_filter_blast7_lbl_py <- py_run_string(
"import pandas as pd

def filter_blast7(input_blast_results, header, min_perc_id, min_al_len,
                   max_evalue):
    df_list = []
    with open(input_blast_results, 'r') as input_handle:
        for line in input_handle:
            if not line.startswith('#'):
                entry = line.strip().split()
                perc_id = float(entry[2])
                al_len = int(entry[3])
                evalue = float(entry[10])
                if (
                    perc_id >= min_perc_id
                    and al_len >= min_al_len
                    and evalue <= max_evalue
                ):
                    df_list.append(entry)
    return pd.DataFrame(df_list, columns=header)"
)$filter_blast7

Which shows up as a "python.builtin.function."

class(read_filter_blast7_lbl_py)
[1] "python.builtin.function" "python.builtin.object"

Now, let's benchmark that:

> # Benchmark 100 iterations over the first 10000 lines
> microbenchmark(read_filter_blast7_lbl_py(input_file_head, col_names, 80, 432, 1e-50), times = 100)
Unit: milliseconds
                                                                       expr
 read_filter_blast7_lbl_py(input_file_head, col_names, 80, 432,      1e-50)
      min       lq     mean   median       uq      max neval
 60.73332 62.57396 67.58808 63.75887 69.98544 105.8659   100
> 
> # Benchmark the whole file once
> microbenchmark(read_filter_blast7_lbl_py(input_file, col_names, 80, 432, 1e-50), times = 1)
Unit: seconds
                                                             expr      min
 read_filter_blast7_lbl_py(input_file, col_names, 80, 432, 1e-50) 139.1212
       lq     mean   median       uq      max neval
 139.1212 139.1212 139.1212 139.1212 139.1212     1
>

It also returns a data frame identical to the python one:

head(read_filter_blast7_lbl_py(input_file_head, col_names, 80, 432, 1e-50))
  query    subject    identity align_length mismatches
1 query_1 scaffold88  100.000          869          0
2 query_1 scaffold88   95.732          867         34
3 query_1  scaffold1  100.000          869          0
4 query_1  scaffold1   95.732          867         34
5 query_3 scaffold88  100.000          786          0
6 query_3  scaffold1  100.000          786          0
  gap_opens q_start q_end s_start  s_end evalue bit_score
1         0       1   869  733052 733920    0.0      1605
2         1       3   869  734435 735298    0.0      1393
3         0       1   869    4053   4921    0.0      1605
4         1       3   869    5436   6299    0.0      1393
5         0       1   786  735004 735789    0.0      1452
6         0       1   786    6005   6790    0.0      1452

It's definitely not as fast as reading it chunked with readr. However, that Python function was easy to write and performed far better than the base R way to read and filter line-by-line. In the future, if I have something that I know how to do in Python I may not try to translate it to R. Just run the python code with reticulate!

Wrapping Things Up

You would think that is a commonly done thing, to filter huge datasets while reading. Apparently, it's not common enough in R for good documentation to exist. In the end readr wins again with read_delim_chunked(). This does require some tuning to get the best performance, but if you pick some sane chunk_size then further opitmization is probably unnecessary and it will work fine. However, the revelation that communicating between python and R works so well opens up a lot of future possibilities! Some things are just better suited to one language or another. While Python has pandas dataframes and a large ecosystem around them none of it is as intuitive as the tidyverse (to me). Something like the two winners here are ideal for situations where you have a huge file to load, but you know that most of the file will not meet your filtering criteria.

Using Python to do general purpose programming and communicating those results to R for statistical testing and visualization is definitely something to consider.

Addendum: What's the Overhead of Filtering While reading?

Let's check it out, first the readr method:

> microbenchmark(read_tsv(input_file, col_names=col_names, comment='#'), times = 1)
Parsed with column specification:
cols(
  query = col_character(),
  subject = col_character(),
  identity = col_double(),
  align_length = col_double(),
  mismatches = col_double(),
  gap_opens = col_double(),
  q_start = col_double(),
  q_end = col_double(),
  s_start = col_double(),
  s_end = col_double(),
  evalue = col_double(),
  bit_score = col_double()
)
|=================================================================| 100% 2685 MB
Unit: seconds
                                                       expr      min       lq
 read_tsv(input_file, col_names = col_names, comment = "#") 64.35369 64.35369
     mean   median       uq      max neval
 64.35369 64.35369 64.35369 64.35369     1

This is compared with 76.59867 ms for filtering. That's basically no difference. How about the reading it in with Python using Pandas read_csv:

> read_csv_py <- py_run_string(
"def read_blast7(input_blast_results, header):
    return pd.read_csv(input_blast_results, sep='\t', comment='#')"
)$read_blast7

> # Benchmark the whole file once
> microbenchmark(read_csv_py(input_file, col_names), times = 1)
Unit: seconds
                               expr      min       lq     mean   median
 read_csv_py(input_file, col_names) 97.45004 97.45004 97.45004 97.45004
       uq      max neval
 97.45004 97.45004     1

There's clearly a bit of overhead with both of these methods, but it's pretty minor, on the order of 10-30 ms for a 2.6 gb file.