[Biopython] [Velvet-users] Shuffler for wrapped fastas

Wed Sep 16 11:08:42 UTC 2009

Hi Velvet users,

I've also CC'd the biopython mailing list (and added links to the velvet
mailing list archive posts), as this conversation might be better off
continued there.

Peter wrote:
http://listserver.ebi.ac.uk/pipermail/velvet-users/2009-September/000567.html
>> Nice to see people using Biopython - although that doesn't look like the
>> most efficient way to write the code (I say without timing it!). See also:
>> http://lists.open-bio.org/pipermail/biopython/2009-September/005579.html
>>
>> And of course these Bio.SeqIO scripts would also work for FASTQ files
>> if you just switch the format name. If you wanted to do more error checking,
>> you could compare the record IDs (to check they are paired reads), and
>> you should make sure both files have the same number of records.

Tanya wrote:
http://listserver.ebi.ac.uk/pipermail/velvet-users/2009-September/000568.html
> Absolutely, that's why I said it was slow -- but easy and efficient
> enough for moderate-size fasta files. If you look at the attached
> script, I wrote the function shuffle_reads in a way that lets you import
> it and change the format type (or you could change the command line args
> to pass it in, of course, but I couldn't be bothered). Also easy enough
> to check the headers match.  However, fastq does require the latest
> Biopython release, and not everyone has it.
>
> Personally I don't use Biopython for shuffling fastq files because with
> millions of short reads, it's hugely inefficient (tenfold runtime
> difference) to go through creating a Biopython object for each read.
> So for short reads, I'd go with straightforward shuffling, plain text.

You are right that if performance really is a bottleneck, the Biopython
SeqIO system can be a limit due to the creation of SeqRecord objects.

However, it isn't quite as bad as you think - your code is making it worse
by using the format method on each record (which internally calls SeqIO),
instead of a single call to Bio.SeqIO.write() as recommended in our
documentation.

This version is about twice as fast as your original:

import sys
from Bio import SeqIO
def interleave(iter1, iter2) :
    while True :
        yield iter1.next()
        yield iter2.next()
f1, f2 = open(sys.argv[1]), open(sys.argv[2])
outfile = open(sys.argv[3], 'w')
format = 'fasta' #or "fastq" or ...
records = interleave(SeqIO.parse(f1, format), SeqIO.parse(f2, format))
count = SeqIO.write(records, outfile, format)
outfile.close()

(It still assumes the input files are properly paired)

I agree that a lower level FASTA parser using python strings could
be another five times faster - but you are aware, this gives up the
generality of the SeqIO system whereby this could be used on any
supported file format.

Peter