[Biopython] SeqIO.index for csfasta files

Tue Jan 19 09:32:45 UTC 2010

On Tue, Jan 19, 2010 at 5:50 AM, Kevin Lam <aboulia at gmail.com> wrote:
> Hi all
> I know csfasta isn't listed in the SeqIO page but can I use index on it as
> well to retrieve subset of reads from csfasta ? (qual files are ok )
> http://news.open-bio.org/news/2009/09/biopython-seqio-index/
>
> Cheers
> Kevin

We don't explicitly support color space FASTA, but it should work.
By that I mean the parser will just give you the sequences as is
(e.g. A1231232) with a default generic alphabet object.

Depending on the number of reads, and the size of the subset,
you may find using Bio.SeqIO.parse and write together works
better (lower memory requirements). I would suggest building
a python set of the desired IDs, then using something like this:

#Using set to test membership (hash based, faster than a list)
wanted_ids = set(...)
#This is a memory efficient generator expression:
wanted = (rec for rec in SeqIO.parse(...) if rec.id in wanted_ids)
handle = open(..., "w")
count = SeqIO.write(wanted, handle, "fasta")
handle.close()

Peter