[Biopython] sort fasta file

Peter biopython at maubp.freeserve.co.uk
Wed Mar 17 10:22:48 UTC 2010


On Wed, Mar 17, 2010 at 10:08 AM, xyz <mitlox at op.pl> wrote:
> Hello,
> I would like sort multiple fasta file depends on the sequence length,
>  ie. from the read with longest sequence to the read with the shortest
> sequence.
>
> I have tried to do it but I do not how to sort the records depends on
> the sequence length.
>
> from Bio import SeqIO
>
> handle = open("example.fasta", "rU")
> records = list(SeqIO.parse(handle, "fasta"))
> records.sort(reverse=True)
>
> Thank you in advance.
>
> Best regards,

If you can hold all the records in memory at once (which it looks
like you can) then this is pretty easy. You need to do a custom
search - the built in list help is a bit terse:

>>> help([].sort)
Help on built-in function sort:

sort(...)
    L.sort(cmp=None, key=None, reverse=False) -- stable sort *IN PLACE*;
    cmp(x, y) -> -1, 0, 1

You need to pass in a function as the cmp argument, which will
take two objects (here SeqRecords) and return -1, 0 or 1. The
concise way to do this is with a lambda, and reuse the built-in
function cmp but acting on the length of the records.

For example,

handle = open("example.fasta", "rU")
records = list(SeqIO.parse(handle, "fasta"))
handle.close()
records.sort(cmp=lambda x,y: cmp(len(x), len(y)))
#records.sort(cmp=reverse=True)
out_handle = open("sorted.fasta", "w")
SeqIO.write(records, out_handle, "fasta")
out_handle.close()

Peter




More information about the Biopython mailing list