[Biopython] [Velvet-users] Shuffler for wrapped fastas
peter at maubp.freeserve.co.uk
Wed Sep 16 07:08:42 EDT 2009
Hi Velvet users,
I've also CC'd the biopython mailing list (and added links to the velvet
mailing list archive posts), as this conversation might be better off
>> Nice to see people using Biopython - although that doesn't look like the
>> most efficient way to write the code (I say without timing it!). See also:
>> And of course these Bio.SeqIO scripts would also work for FASTQ files
>> if you just switch the format name. If you wanted to do more error checking,
>> you could compare the record IDs (to check they are paired reads), and
>> you should make sure both files have the same number of records.
> Absolutely, that's why I said it was slow -- but easy and efficient
> enough for moderate-size fasta files. If you look at the attached
> script, I wrote the function shuffle_reads in a way that lets you import
> it and change the format type (or you could change the command line args
> to pass it in, of course, but I couldn't be bothered). Also easy enough
> to check the headers match. However, fastq does require the latest
> Biopython release, and not everyone has it.
> Personally I don't use Biopython for shuffling fastq files because with
> millions of short reads, it's hugely inefficient (tenfold runtime
> difference) to go through creating a Biopython object for each read.
> So for short reads, I'd go with straightforward shuffling, plain text.
You are right that if performance really is a bottleneck, the Biopython
SeqIO system can be a limit due to the creation of SeqRecord objects.
However, it isn't quite as bad as you think - your code is making it worse
by using the format method on each record (which internally calls SeqIO),
instead of a single call to Bio.SeqIO.write() as recommended in our
This version is about twice as fast as your original:
from Bio import SeqIO
def interleave(iter1, iter2) :
while True :
f1, f2 = open(sys.argv), open(sys.argv)
outfile = open(sys.argv, 'w')
format = 'fasta' #or "fastq" or ...
records = interleave(SeqIO.parse(f1, format), SeqIO.parse(f2, format))
count = SeqIO.write(records, outfile, format)
(It still assumes the input files are properly paired)
I agree that a lower level FASTA parser using python strings could
be another five times faster - but you are aware, this gives up the
generality of the SeqIO system whereby this could be used on any
supported file format.
More information about the Biopython