[Bioperl-l] BIO::DB::FASTA ID

Michael Kiwala mkiwala at watson.wustl.edu
Thu Jun 21 21:23:46 UTC 2007


You only have 1527 unique id's in the file.

~$ grep '^>' Desktop/T_orthologs_Dpse_genes.fa|cut -d\  -f1|sort -u|wc -l
1527


Change your make_id function to make sure the id's are unique.



Staffa, Nick (NIH/NIEHS) wrote:
> This program below returns only  1527 IDs from a fasta file that I have
> constructed, which has
> mildred> grep -c "^>Dpse" T_orthologs_Dpse_genes.fa
> 1820
> .
> It actually does not return the first 3 ids,
> nor the 5th, nor 7..36, 38,39,41..44......
> The header lines are of variable length and the sequence lines are 80
> characters except at the ends when they might be shorter.
> Is there some caveat that I am ignoring in my format that breaks
> bio::db::fasta?
>
>
> #!/usr/bin/perl
> #
> #
> #
> use strict;
> use Bio::DB::Fasta;
> use Bio::Tools::SeqWords;
> use Bio::Seq;
> use Bio::SeqIO;
> $|=1;
> #
> #
> my $Dpse_UTR_file_for_T_orthologs =
> "/home/staffa/clients/Kari/D_pse_genome/testit/T_orthologs_Dpse_genes.fa";
> my $db = Bio::DB::Fasta->new
> ('/home/staffa/clients/Kari/D_pse_genome/testit/T_orthologs_Dpse_genes.fa',
>   -reindex,  -makeid => \&make_my_id);
> my @ids = $db->ids;
> my $number_in = @ids;
> print "number of Dpse IDs = $number_in\n";
> foreach my $id (@ids){
> print "$id\n";
> }
> sub make_my_id {
> #       parse header line:
> #       >Dpse_GA13134 CG14636 NO UTR has 2 TATTTAT 117 145, 0 TTATTTATT
>     my $line = shift;
> #    print "line = $line\n";
>     $line =~ />(\w+) /;
>     my $ID = $1;
> #    print "ID = $ID\n";
>     return $ID;
>       }
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l



More information about the Bioperl-l mailing list