[Bioperl-l] Getting sequences by ID

Ryan Golhar golharam at umdnj.edu
Thu Apr 6 15:34:57 UTC 2006


Here's how I'm doing it with bioperl, but with large genbank files (such
as chromosomes) it take a while:

my $inseq = Bio::SeqIO->new(...);
while (my $seqobj = $inseq->next_seq) {
	next if ($seqobj->accession ne $id);

	# process the sequence here
}


-----Original Message-----
From: bioperl-l-bounces at lists.open-bio.org
[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Torsten
Seemann
Sent: Wednesday, April 05, 2006 6:14 PM
To: Yuval Itan
Cc: bioperl-l at bioperl.org
Subject: Re: [Bioperl-l] Getting sequences by ID


On Wed, 2006-04-05 at 18:03 +0100, Yuval Itan wrote:
> I would be grateful for an advice from you regarding Bioperl, after I 
> was
> fiddling around trying to write the Perl script for that from scratch.
> I have a large fasta file of about 20,000 genes, and another file
which is a 
> list of about 2,000 gene IDs (no sequences), all included in the large
file. 
> I need to create a fasta file which will include only the genes with
these 
> specific 200 IDs. I was wondering if there is a method in Bioperl that
will 
> allow me to do the following pseudocode:
> 
> For each $ID from 200_IDs_set_file
> {
> 	$my_seq = get_sequence_by_ID(from large_fasta_file, $ID)
> 	write $my_seq into file
> }

There are many possibilities involving combinations of pure Perl and
BioPerl modules, and some even involving no Perl, but rather using
commands like 'formatdb' and 'fastacmd -s'. There are probably EMBOSS
solutions too.

Using your pseudo code, you could use Bio::Index::Fasta to index your
20,000 genes. Then loop over each ID, and retrieve the Seq via the
index, and write it out using Bio::SeqIO.

Perhaps look at it from another perspective:

# put all the IDs we want into a hash (read from file)
my %want_id = .... ; 

foreach $seq (use Seq::IO to read large_fasta_file) {
  if $want_id{$seq->id} then
    use Seq::IO to write this $seq out
  end
}



-- 
Torsten Seemann <torsten.seemann at infotech.monash.edu.au>
Victorian Bioinformatics Consortium

_______________________________________________
Bioperl-l mailing list
Bioperl-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list