[Biopython-dev] Parsing fastq files with SeqIO.parser(handle)

Fri Apr 19 13:24:39 UTC 2013

Hi Alex,

Are you talking about FASTA or FASTQ or both?

On Fri, Apr 19, 2013 at 12:29 PM, Alex Leach <albl500 at york.ac.uk> wrote:
> Dear BioPython Devs,
>
> Probably a strange request, but I was wondering if it might be a good idea
> to make the fasta parser raise an error when it is asked to parse
> incorrectly formatted files.

The FASTA parser (based on the older versions in Biopython) tolerates
random stuff before the first record. This has some practical uses like
the fact it can be used to parse the FASTA block at the end of a GFF
file. Changing that might cause trouble...

> I ask, because a while ago, I made a simple command line utility to convert
> sequence files to/from various formats, using SeqIO.parser. It's attached if
> anyone's interested.

For simple jobs, Bio.SeqIO.convert is nice and often faster too.

I see you're making heavy use of the SeqRecord's format method.
That is not a good idea for speed (as the docstring and documentation
tries to explain), as it makes a StringIO handle and calls SeqIO.write
internally - just call SeqIO.write directly with your file handle.

> My supervisor's now using it to filter fastq formatted sequences by length,
> but keeps forgetting to add a '-format fastq' option. The script by default
> assumes fasta formatted sequences, which, like SeqIO.parser is by design,
> but the problem is that the parser doesn't mind at all when a fastq file
> doesn't contain a single ">" character.
>
> Are there any interfaces to make the fasta parser stricter? This error is
> completely silent until picked up by external programs; hmmer, in this
> instance. Ideally, an error would be raised much earlier in the process,
> especially as the department's NFS servers take ages to retrieve and convert
> an IonTorrent dataset. (I've got him using /var/tmp for the converted files,
> but he keeps the original fastq's in an NFS home folder, which is
> sloooooow).

In your shoes I'd write some sanity testing into the script, for
example check the filesize for hints of truncation or being empty.
Also I'd try parsing it just enough to check the first few reads are
sane.

> The department's using BioPython 1.57 btw.
>

That's quite elderly now, released two years ago.

> Thanks for your time.
> Kind regards,
> Alex
>
> p.s. Don't suppose there's any plans to implement any parsers as
> C-extensions?

I don't have any such plans. It would be possible and likely faster
for the special case where you can push the file handle stuff all
to the C level, but then it won't cope with general Python handles
(e.g. StringIO, network handles, decompressed files, etc). And
it won't work under Jython, and becomes more complex for PyPy.

Also from the benchmarking I've done, even with FASTA and
FASTQ one of the major time sinks is building the Python objects.
If you stick with strings using for example then even parsing in
pure Python is much quicker. See:

http://news.open-bio.org/news/2009/09/biopython-fast-fastq/
from Bio.SeqIO.QualityIO import FastqGeneralIterator
from Bio.SeqIO.FastaIO import SimpleFastaParser

There are other ideas about, e.g. lazy loading which I suggested
as a possible GSoC project this year:
http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html