[Biopython] sort fasta file

Peter biopython at maubp.freeserve.co.uk
Thu Mar 18 10:44:09 UTC 2010


On Wed, Mar 17, 2010 at 12:01 PM, xyz <mitlox at op.pl> wrote:
> On Wed, 17 Mar 2010 10:22:48 +0000
> Peter <biopython at maubp.freeserve.co.uk> wrote:
>> For example,
>>
>> handle = open("example.fasta", "rU")
>> records = list(SeqIO.parse(handle, "fasta"))
>> handle.close()
>> records.sort(cmp=lambda x,y: cmp(len(x), len(y)))
>> #records.sort(cmp=reverse=True)
>> out_handle = open("sorted.fasta", "w")
>> SeqIO.write(records, out_handle, "fasta")
>> out_handle.close()
>>
>> Peter
>
> Thank you for the code. I only changed this and it works.
>
> records.sort(cmp=lambda x,y: len(y.seq) - len(x.seq))
>
> If I could not hold all the records in memory at once what could I do?

I would use Bio.SeqIO.index() to give random access to the
records. You would also need to load and sort the record
identifiers and the lengths. Something like this:

from Bio import SeqIO
#Get the lengths and ids, and sort on length
len_and_ids = sorted((len(rec), rec.id) for rec in \
            SeqIO.parse(open("ls_orchid.fasta"),"fasta"))
#Once sorted only need the ids, so can free some memory
ids = [id for (length, id) in len_and_ids]
del len_and_ids
#Now prepare the index
record_index = SeqIO.index("ls_orchid.fasta", "fasta")
#Now prepare a generator expression to give the
#records one-by-one for output
records = (record_index[id] for id in ids)
#Finally write these to a file
handle = open("sorted.fasta", "w")
count = SeqIO.write(records, handle, "fasta")
handle.close()
print "Sorted %i records" % count

That code should work for any file format support by
the Bio.SeqIO parse, index and write functions (e.g.
GenBank files, FASTQ, etc).

Notice that it actually reads though the input file twice,
once to get the ids and lengths, and once to build the
index (getting the ids and file offsets). If you wanted to
get a bit more low level you could do this in a single
pass - but it would be more effort than using the SeqIO
functions.

I wonder if this example is useful enough to go in the
tutorial? What do you think?

Peter



More information about the Biopython mailing list