Just Write Your Own Python Parsers for .fastq Files
Posted on Thu 22 August 2019 in commentary
In contrast to the zen of python there are actually many ways to handle sequence data in Python. There are several packages on PyPI that provide parsers for sequence formats like .fastq and .fasta. I've never bothered with these, including the oft-used Biopython. I vaguely remembered Biopython being slower than any parser I'd written myself early-on in learning bioinformatics, and it not actually being simpler to implement. However, I'd never looked at this in detail. Additionally, I'd recently run across a few posts on biostars where users were deriding people for asking "What is the most efficient way to parse a huge .fastq file" for something similar.
First of all, don't discourage people who are trying to learn. Secondly, this is a good question! As scientists, we should know that just because data exists doesn't meant it's good. Likewise, just because software exists doesn't mean it's the best tool for any given job. Plus, writing simple parsers for common formats is a good way to practice file processing for when you eventually need to do something hard and no ready-made parser exists in a package.
Rather than vaguely saying "X package is slow, do this instead" I thought it'd be best to actually benchmark some different .fastq parser options.
The Contenders
There are several packages that include parsers for biological sequence data. These include:
I'm familiar with Biopython from the recommendations that abound in the community
for exactly this task, and HTSeq mostly for HTSeq-count
. Scikit-bio seems to be
newer and under current development, so results from testing that are subject to
change. Just in case someone looks at this yers after it's written and wonders
why I got the performance that I did.
When it comes to dealing with .fastq files I checked through my library of Python scripts and came across two patterns that I'll also test compared to these packages:
- Reading line-by-line, using a counter to yield records
- Reading line-by-line, using
zip_longest()
fromitertools
to yield records
Setting up the Test
I did this in a jupyter notebook, since that's what I use on a day-to-day basis. Most of my interactive "data science" work is done in R, which is mostly a consequence of at one point needing to use some R packages that have no Python equivalents, and just rolling with that. So, actually using Python in jupyter is a bit of a departure from the norm for me.
First, you need the necessary packages. I just use pip with pyenv:
pip install biopython HTSeq scikit-bio
Then, started a new jupyter notebook with jupyterlab (a sweet new UI for jupyter that you should use!). Your first step is always to do your imports.
from Bio import SeqIO
from HTSeq import FastqReader
from itertools import zip_longest
import skbio
I'm only using one function from skbio, but it's just called read()
which is
too generic a name to just import that single function without causing all sorts
of annoyances and gnashing of teeth.
Also, it's important with any parsing problem to understand the file format. The .fastq format is ubiquitous in bioinformatics and looks like this:
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
You can understand it as a repeated series of four lines:
- Sequence ID, starting with "@"
- Sequence (ATCG)
- Separator (+)
- Quality score for each base call (same length as sequence)
The catch here is that you can't use @ as a record separator. It's a valid character in the score line, too. So, you really do need to group the lines in batches of four, as it's possible @ will exist in position 1 of the score line.
Define Some Functions to Test
In order to make the benchmarking easier to follow, I figured I'd define the functions I want to bechmark in a consistent way:
# Using Biopython
def parse_biopython(input_fastq):
for record in SeqIO.parse(input_fastq, 'fastq'):
yield record
# Using HTSeq
def parse_htseq(input_fastq):
for record in FastqReader(input_fastq):
yield record
# HTSeq raw
def parse_htseq_raw(input_fastq):
for record in FastqReader(input_fastq, raw_iterator=True):
yield record
# Skbio
def parse_skbio(input_fastq):
for record in skbio.io.read(input_fastq, format='fastq'):
yield record
# Line by line with counter
def parse_lbl_counter(input_fastq):
with open(input_fastq, 'r') as input_handle:
record = []
n = 0
for line in input_handle:
n += 1
record.append(line.rstrip())
if n == 4:
yield record
n = 0
fq_record = []
# Line by line with zip_longest
def parse_zip_longest(input_fastq):
with open(input_fastq, 'r') as input_handle:
fastq_iterator = (l.rstrip() for l in input_handle)
for record in zip_longest(*[fastq_iterator] * 4):
yield record
Here I intended to use two different methods from HTSeq, one which just returns
bare tuples rather than objects with other kinds of validation based on the
definition of the format. However, neither HTSeq method worked. Instead, giving a
StopIteration
error when it reached the end of a file. Trying to catch that
with a try:
except:
block didn't seem to work? It did parse until it reached
the end of a file though. I think this is a bug, and I may raise it with the
HTSeq people. So it is, regrettably, not included in my benchmarking results.
Also, in both custom parsers, str.rstrip()
was marginally faster than
str.strip()
so I went with that instead.
Run Some Benchmarks
I decided I would try each of these with 1 million lines from a whole-genome bisulfite experiment. These are the R1 mates from 75bp paired end reads:
%timeit -n 10 -r 10 [record for record in parse_biopython('JWG3_2_2_R1.head.fastq')]
2.86 s ± 56.7 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
%timeit -n 10 -r 10 [record for record in parse_skbio('JWG3_2_2_R1.head.fastq')]
1min 33s ± 13.7 s per loop (mean ± std. dev. of 10 runs, 10 loops each)
%timeit -n 10 -r 10 [record for record in parse_lbl_counter('JWG3_2_2_R1.head.fastq')]
295 ms ± 14.7 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
%timeit -n 10 -r 10 [record for record in parse_zip_longest('JWG3_2_2_R1.head.fastq')]
249 ms ± 2.57 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
The %timeit
function there is some ipython "line magic." It simplifies timing
a single line of code. The %%timeit
is the "cell magic" version.
It seems that skbio isn't ready for primetime just yet. The real question then
is, would biopython suffice for day-to-day work? Perhaps yes, ~1M lines in < 3s
(349650.35 lines per second) is a timescale that people might be willing to work
with. Keep in mind this is on my personal laptop, so it's hardly a compute
cluster. In contrast, the very simple line counter-based parser that I wrote as a
master's student back in 2013 as a python-learning exercise is nearly 10x faster!
There is also an improvement in speed for using zip_longest()
from itertools
(a trick I'm pretty sure I saw in a post from Brent Pedersen on stackoverflow).
Visualize
I'm usually a ggplot2 useR for visualizations, but I'm already in python here so let's use this as an excuse to try the great python plotting library altair. It's declarative, like ggplot2, and you build your plot by "mapping" your "variables" (columns) to "encodings" (analogous to "aesthetics" in ggplot2). I ran several other benchmarks and turned them into a pandas data frame. First you'll need to do some imports:
import pandas as pd
import numpy as np
import altair as alt
Then make the data frame
# Create a dataframe Pandas style
timing_data = pd.DataFrame({'Method': np.repeat(['biopython', 'skbio', 'lbl', 'zip'], 5),
'Reads': (np.tile([100, 1000, 10000, 100000, 1000000], 4) / 4),
'Time (s)': [(670 / 1e6), (4.4 / 1000), (40.49 / 1000), (418 / 1000), 2.86,
(14.2 /1000), (132 / 1000), 1.32, 13.9, (60 + 33),
(181 / 1e6), (442 / 1e6), (3.92 / 1000), (40.5 / 1000), (295 / 1000),
(70.2 / 1e6), (352 / 1e6), (3.19 / 1000), (32.5 / 1000), (249 / 1000)]})
Since each record is 4 lines, converting lines to # of reads requires dividing by four. Likewise, the benchmarking results are in various time units, so I've converted all of them to seconds. Not particular efficiently, but for this simple example it's fine.
Now we can visualize with Altair. It has a very nice syntax inspired by ggplot2's "grammar of graphics." It's based on vega-lite under the hood and allows you to easily save your plot from jupyterlab. Here's the code:
# Plot as a scatterplot
alt.Chart(timing_data).mark_point().encode(
x='Reads',
y='Time (s)',
color='Method'
)
# Plot on log scale
alt.Chart(timing_data).mark_point().encode(
alt.X('Reads', scale=alt.Scale(type='log', base=10)),
alt.Y('Time (s)', scale=alt.Scale(type='log', base=10)),
color='Method'
)
Everything scales linearly, but at massively different rates. Sci-kit bio is in another universe in terms of time, such that you can't even visualize it with the others in a meaningful way until you log scale everything. By the log scale, you can essentially see that biopython is an order of magnitude faster than skbio, and either simple parser are an order of magnitude faster again. The difference between the two simple parsers is pretty insignificant.
Note: Altair is great! Not quite as full-featured as ggplot2 in R, but it's definitely promising and something to watch for in the future. They definitely should make it work with jupyterlab's dark theme though. Due to the transparent plot backgrounds it requres a light theme.
To Wrap Things Up
I'm not saying you should never use biopython, I suspect its parser does some
extra validation that my simple parsers don't. It also returns objects with some
possibly useful methods. However, if you just want to read files quckly then the
simple line-by-line parsers aren't actually very complicated to write. Plus, you
don't even need to import anything unless you want a minor speed boost from
itertools
. Additionally, if you didn't need to strip newlines you'd get a boost
from not calling an str.strip()
method on each line.
If you're ok with living dangerously, and are sure your files are formatted correctly you can easily write something that will outperform standard implementations with little effort when it comes to .fastq parsing.