[Biopython] Cookbook suggestion

Sat Apr 13 20:27:24 UTC 2013

Hi Justin,

On Sat, Apr 13, 2013 at 9:13 PM, Justin Gibbons <jgibbons1 at mail.usf.edu> wrote:
> I want to add the following to the cookbook but I am unable to create an
> account.

Hmm - we should fix that. Is there a specific error message
from the wiki?

> #using SeqIO.write() without holding records in memory.
>
> from Bio import SeqIO
>
>
> seq_ids=set() #create an empty set to hold the sequence IDs.
> indexed_fasta=SeqIO.index(file_path, 'fasta') #Can be searched by sequence
> ID but is not held in memory
>
> for seq_record in SeqIO.parse(file_path, 'fasta'):
>     #Filter according to some critria:
>         seq_ids.add(seq_record.id)

Why do call SeqIO.index, but not use it and instead get
the ID list by doing a full parse of the file? Note that calling
SeqIO.index is likely faster than SeqIO.parse because the
index code doesn't actually load the sequence information
etc - just the record identifier. This speed difference is more
obvious on heavier file formats like GenBank. e.g. These
single lines both get all the identifiers as a list:

seq_ids = SeqIO.parse(file_path, 'fasta').keys()

vs:

seq_ids = [rec.id for rec in SeqIO.parse(file_path, 'fasta')]

Also note that using a set rather than a list for the ids
means the order is lost - which may be important.

> #write the fasta records to a new file using SeqIO.write()
>
> SeqIO.write([indexed_fasta[seq_id] for seq_id in seq_ids], new_file_path,
> 'fasta')
>

That last line uses a list comprehension,
[indexed_fasta[seq_id] for seq_id in seq_ids]

That will therefore load all the records into memory as a list of
SeqRecord objects, which can be avoided with a list comprehension:

(indexed_fasta[seq_id] for seq_id in seq_ids)

i.e. round brackets not square.

> So if someone who can edit the cookbook wants to add it feel free to.
>
> Justin Gibbons

Feedback on the documentation and efforts to improve it
are always welcome. However, I'm not sure what your example
is trying to do yet - it seems to rewrite a FASTA file with the
records in a new order (with the order given by however
Python sorts the set of IDs).

Thanks,

Peter