[Bioperl-l] Getting sequences by ID

Wed Apr 5 22:14:01 UTC 2006

On Wed, 2006-04-05 at 18:03 +0100, Yuval Itan wrote:
> I would be grateful for an advice from you regarding Bioperl, after I was 
> fiddling around trying to write the Perl script for that from scratch.
> I have a large fasta file of about 20,000 genes, and another file which is a 
> list of about 2,000 gene IDs (no sequences), all included in the large file. 
> I need to create a fasta file which will include only the genes with these 
> specific 200 IDs. I was wondering if there is a method in Bioperl that will 
> allow me to do the following pseudocode:
> 
> For each $ID from 200_IDs_set_file
> {
> 	$my_seq = get_sequence_by_ID(from large_fasta_file, $ID)
> 	write $my_seq into file
> }

There are many possibilities involving combinations of pure Perl and
BioPerl modules, and some even involving no Perl, but rather using
commands like 'formatdb' and 'fastacmd -s'. There are probably EMBOSS
solutions too.

Using your pseudo code, you could use Bio::Index::Fasta to index your
20,000 genes. Then loop over each ID, and retrieve the Seq via the
index, and write it out using Bio::SeqIO.

Perhaps look at it from another perspective:

# put all the IDs we want into a hash (read from file)
my %want_id = .... ; 

foreach $seq (use Seq::IO to read large_fasta_file) {
  if $want_id{$seq->id} then
    use Seq::IO to write this $seq out
  end
}

-- 
Torsten Seemann <torsten.seemann at infotech.monash.edu.au>
Victorian Bioinformatics Consortium