Yuval, if this is all you're doing, it may not be worth using Bioperl. Here's a Perl one-liner that does it: cut and paste he whole thing onto the command line and replace the filenames (id_list, input.fasta, and found.fasta at the end) with your own. perl -e '($id,$fasta)=@ARGV; open(ID,$id); while () {s/\r?\n//; /^>?(\S+)/; $ids{$1}++;} $num_ids = keys %ids; open(F, $fasta); $s_read = $s_wrote = $print_it = 0; while () { if (/^>(\S+)/) {$s_read++; if ($ids{$1}) {$s_wrote++; $print_it = 1; delete $ids{$1}} else {$print_it = 0}}; if ($print_it) {print $_}}; END {warn "Searched $s_read FASTA records.\nFound $s_wrote IDs out of $num_ids in the ID list.\n"}' id_list input.fasta > found.fasta http://www.cgr.harvard.edu/cbg/scriptome/UNIX/Tools/Choose.html#choose_a_set _of_fasta_sequences_from_a_file__choose_fastas_from_list_ (There's a teensy-bit different version for Windows Perl too.) You can email me off-list for more info. - Amir Karger Computational Biology Group Bauer Center for Genomics Research Harvard University 617-496-0626 > -----Original Message----- > From: Torsten Seemann [mailto:torsten.seemann@infotech.monash.edu.au] > Sent: Wednesday, April 05, 2006 6:14 PM > To: Yuval Itan > Cc: bioperl-l@bioperl.org > Subject: Re: [Bioperl-l] Getting sequences by ID > > On Wed, 2006-04-05 at 18:03 +0100, Yuval Itan wrote: > > I would be grateful for an advice from you regarding > Bioperl, after I was > > fiddling around trying to write the Perl script for that > from scratch. > > I have a large fasta file of about 20,000 genes, and > another file which is a > > list of about 2,000 gene IDs (no sequences), all included > in the large file. > > I need to create a fasta file which will include only the > genes with these > > specific 200 IDs. I was wondering if there is a method in > Bioperl that will > > allow me to do the following pseudocode: > > > > For each $ID from 200_IDs_set_file > > { > > $my_seq = get_sequence_by_ID(from large_fasta_file, $ID) > > write $my_seq into file > > } > > There are many possibilities involving combinations of pure Perl and > BioPerl modules, and some even involving no Perl, but rather using > commands like 'formatdb' and 'fastacmd -s'. There are probably EMBOSS > solutions too. > > Using your pseudo code, you could use Bio::Index::Fasta to index your > 20,000 genes. Then loop over each ID, and retrieve the Seq via the > index, and write it out using Bio::SeqIO. > > Perhaps look at it from another perspective: > > # put all the IDs we want into a hash (read from file) > my %want_id = .... ; > > foreach $seq (use Seq::IO to read large_fasta_file) { > if $want_id{$seq->id} then > use Seq::IO to write this $seq out > end > } > > > > -- > Torsten Seemann > Victorian Bioinformatics Consortium > > >