[Bioperl-l] BIO::DB::FASTA ID

Staffa, Nick (NIH/NIEHS) staffa at niehs.nih.gov
Thu Jun 21 18:36:12 UTC 2007


This program below returns only  1527 IDs from a fasta file that I have
constructed, which has
mildred> grep -c "^>Dpse" T_orthologs_Dpse_genes.fa
1820
.
It actually does not return the first 3 ids,
nor the 5th, nor 7..36, 38,39,41..44......
The header lines are of variable length and the sequence lines are 80
characters except at the ends when they might be shorter.
Is there some caveat that I am ignoring in my format that breaks
bio::db::fasta?


#!/usr/bin/perl
#
#
#
use strict;
use Bio::DB::Fasta;
use Bio::Tools::SeqWords;
use Bio::Seq;
use Bio::SeqIO;
$|=1;
#
#
my $Dpse_UTR_file_for_T_orthologs =
"/home/staffa/clients/Kari/D_pse_genome/testit/T_orthologs_Dpse_genes.fa";
my $db = Bio::DB::Fasta->new
('/home/staffa/clients/Kari/D_pse_genome/testit/T_orthologs_Dpse_genes.fa',
  -reindex,  -makeid => \&make_my_id);
my @ids = $db->ids;
my $number_in = @ids;
print "number of Dpse IDs = $number_in\n";
foreach my $id (@ids){
print "$id\n";
}
sub make_my_id {
#       parse header line:
#       >Dpse_GA13134 CG14636 NO UTR has 2 TATTTAT 117 145, 0 TTATTTATT
    my $line = shift;
#    print "line = $line\n";
    $line =~ />(\w+) /;
    my $ID = $1;
#    print "ID = $ID\n";
    return $ID;
      }

-------------- next part --------------
A non-text attachment was scrubbed...
Name: T_orthologs_Dpse_genes.fa
Type: application/octet-stream
Size: 5033676 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20070621/07c354d0/attachment-0004.obj>


More information about the Bioperl-l mailing list