[Biopython] Shuffler for wrapped fastas

Sat Sep 19 11:17:50 UTC 2009

Hi Tanya,

Sorry for the slight delay - your email didn't appear in my inbox
for a couple of days. Odd.

On Wed, Sep 16, 2009 at 1:01 PM, Tanya Golubchik wrote:
>
> A single write is definitely better, though it's still so much slower than
> plain text shuffling that it's not ideal for millions of short reads unless
> we want to do something useful like convert the scores to Phred in the
> process. In that case we'd be using 'format' anyway, I assume, unless there
> is a neat trick to reformat a whole lot of reads at once.

If guess you mean the SeqRecord's format method? It isn't intended
for output of multiple records to a file, but rather is a convenience
method for a single record. The (slow) approach using a loop and
many calls to SeqRecord.format(...) is also less general than using
a single call to Bio.SeqIO.write(...). Consider interleaved file formats
or those with a header - the for loop won't work here.

Using Bio.SeqIO and combining the parse and write functions already
allows simple conversion of a range of sequence file formats, including
the three FASTQ variants. This is covered in the tutorial and the wiki,
http://biopython.org/wiki/SeqIO

The soon to be released Biopython 1.52 will make this even easier
(and in some cases like FASTQ conversion also faster) with the
addition of a Bio.SeqIO.convert(...) function.

> In general I find myself using Biopython for longer sequences (fasta or
> fastq), because of the neatness and flexibility of Biopython, but sticking
> to plain text for short reads because of the overheads.

In some cases that is the best thing to do. If you haven't already
done so, have a look at the FastqGeneralIterator function in
Bio.SeqIO.QualityIO which returns a tuple of three strings (so
no overhead from Seq and SeqRecord objects).

> BTW, itertools.izip does exactly what your interleave method
> does, so I'm not sure there's any need to rewrite it.

No it doesn't. The builtin function zip, and itertools.izip both
return tuples (pairs of entries). Consider:

>>> a = range(0,10,2)
>>> b = range(1, 10, 2)
>>> zip(a,b)
[(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)]

Or, using itertools,

>>> import itertools
>>> list(itertools.izip(a,b))
[(0, 1), (2, 3), (4, 5), (6, 7), (8, 9)]

Using the (not very general) interlace function I wrote earlier:
http://lists.open-bio.org/pipermail/biopython/2009-September/005583.html

>>> def interlace(iter1, iter2) :
...     while True :
...         yield iter1.next()
...         yield iter2.next()
...
>>> list(interlace(iter(a),iter(b)))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

I hope that illustrates the difference. Here you get back ten items,
but with zip or izip you get five pairs of iterms. Via Google you
can easily find much more general interlace functions in Python.

Peter