[Bioperl-l] what's the optimal way to search a fasta file formatching ID's?

Cook, Malcolm MEC at stowers-institute.org
Thu Oct 25 22:17:04 UTC 2007


If you have the fasta database already indexed for blast searching, then
you should use fastacmd, which comes with the blast package, for
extracting (sub)sequences based on ID (and indices).

Malcolm Cook
Database Applications Manager - Bioinformatics
Stowers Institute for Medical Research - Kansas City, Missouri
  

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org 
> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Joseph Fass
> Sent: Thursday, October 25, 2007 4:50 PM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] what's the optimal way to search a fasta 
> file formatching ID's?
> 
> I would appreciate any advice, big or small, on this ...
> 
> I've got a decent-sized database ... 90,000 sequences or so 
> in a single fasta-format file.  Then, I've got sequence ID's 
> from that database that show up in blast reports.  I want to 
> collect those ID's and their sequences (for the purposes of 
> exploring possible contigs).  Since the blast report only 
> includes sub-sequences (from alignments) of my sequences, I 
> want to parse the report, then match each hit ID against an 
> ID in the database, so I can pull out its full sequence.  Is 
> there a faster way to do this than opening the database file 
> each time I have a new hit ID, so I can search it from 
> beginning to end?  If I push each sequence onto a list or 
> hash, it's liable to chew up a lot of RAM, I'm guessing.  Any 
> suggestions?
> 
> Thanks in advance,
> ~joe
> 
> --
> Joseph Fass
> joseph.fass at gmail.com  ||  joefass at hotmail.com
> 970.227.5928 (c)  ||  530.754.7978 (w)
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 




More information about the Bioperl-l mailing list