[Biopython] Retrieving fasta seqs

Tue Feb 2 14:13:44 UTC 2010

My version uses set to store the Ids. It fails with too many records  
( 60 million ) on 31 gb ram 64 bit centos python 2.4  can't figure  
why. But works well with 1 million ids.

Can I propose this be part of the tutorial? It seems quite a popular  
request.  I was going to post on my blog but think more people will  
benefit if it's on the wiki
I don't mind contributing the code and lessons

Kevin

Sent from my iPod

On 02-Feb-2010, at 9:49 PM, Peter <biopython at maubp.freeserve.co.uk>  
wrote:

> On Tue, Feb 2, 2010 at 1:09 PM, Brad Chapman <chapmanb at 50mail.com>  
> wrote:
>>
>> Finally, iterate through the large FASTA file, and write records of
>> interest:
>>
>> sec = open(sys.argv[1], 'r')
>> for rec in SeqIO.parse(sec, "fasta"):
>>    if rec.id in listita:
>>        SeqIO.write([rec], out_handle, "fasta")
>>
>
> Or, once you have read about generator expressions,
> this version might seem nicer - but perhaps a bit too
> complicated for a beginner:
>
> records = SeqIO.parse(open(sys.argv[1], 'r'), "fasta")
> wanted = (rec for rec in records if rec.id in listita)
> SeqIO.write(wanted, out_handle, "fasta")
>
> Another alternative, which could be quicker to run
> depending on the size of the files and the relative
> number of records wanted would be to use the
> Bio.SeqIO.index() function to pull out the desired
> records from the FASTA input file.
>
> Peter
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython