[Biopython-dev] Parsing fastq files with SeqIO.parser(handle)

Fri Apr 19 16:04:28 UTC 2013

Hi Peter,

On Fri, 19 Apr 2013 14:24:39 +0100, Peter Cock <p.j.a.cock at googlemail.com>
wrote:

>>
>> p.s. Don't suppose there's any plans to implement any parsers as
>> C-extensions?
>
> I don't have any such plans. It would be possible and likely faster
> for the special case where you can push the file handle stuff all
> to the C level, but then it won't cope with general Python handles
> (e.g. StringIO, network handles, decompressed files, etc). And
> it won't work under Jython, and becomes more complex for PyPy.

I haven't done any in-depth reading into Jython or PyPy, but don't all
file-like objects aim to provide the same API? At least on the Python
side, the only difference should be whether a file-like object is
seek-able or not, I would have thought...

Bearing this in mind, I would have thought that even when using odd
handles, like StringIO objects, the C-Python Object Protocol functions[1]
could be used to interface with them.

[1] - http://docs.python.org/2/c-api/object.html

e.g.

void write_to_obj_handle(PyObject * obj_handle, PyObject * sequence_string)
{
      // Get the "write" method on "obj_handle"
      PyObject * handle_writer = PyObject_GetAttr(obj_handle, "write");

      // Call "obj_handle.write(sequence_string)"
      PyObject_CallFunctionObjArgs(handle_writer, sequence_string, NULL);
...

Still, it would be a horrendous amount of work to get anything nearly as
flexible as the current parsers, which as you say, might not get much
faster anyway.

>
> Also from the benchmarking I've done, even with FASTA and
> FASTQ one of the major time sinks is building the Python objects.
> If you stick with strings using for example then even parsing in
> pure Python is much quicker. See:

Yes, creating millions of PyObject's does add a lot of overhead... I've
been investigating various extension module techniques and it seems the
"best" ones, like e.g. numpy arrays, use a dedicated container object
(MultipleSeqAlignment?) which holds C-type records. Only when an
individual item or slice is requested by Python is it converted into a
PyObject (SeqRecord?), but functions written in C, Fortran and C++ can
still use the raw C-Level API. For heavy computations on big matrices,
this essentially removes all Python-related overhead in PyObject's
excessive usage of the heap.

>
> http://news.open-bio.org/news/2009/09/biopython-fast-fastq/
> from Bio.SeqIO.QualityIO import FastqGeneralIterator
> from Bio.SeqIO.FastaIO import SimpleFastaParser

Thanks, I've now read the article. Some good info in there! I haven't
really done too much with different fastq formats, but that day will no
doubt soon come...

>
> There are other ideas about, e.g. lazy loading which I suggested
> as a possible GSoC project this year:
> http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html

Just re-read your email and looked back through my old ArbIO.py module[2].
Now I know why I didn't say anything! Looks like I wrote that a looong
time ago, and the code is a bit of a mess. It was designed to work only
with Arb Silva's almost unique way of formatting fasta files. It worked
very well for that use case. Your ideas regarding BAM, tabix and BioSQL
sound a much better idea than using pickle dumps to save indexes, though.
Shame the GSOC proposals got bounced...

Kind regards,
Alex

[2] -
http://code.google.com/p/ssummo/source/browse/trunk/ssummo/lib/ArbIO.py?r=5

-- 
---
Alex Leach. BSc, MRes
Chong & Redeker Labs
Department of Biology
University of York
YO10 5DD
Tel: 07940 480 771

EMAIL DISCLAIMER: http://www.york.ac.uk/docs/disclaimer/email.htm