[Bioperl-l] QUestions about Bio::Index and Bio::DB

Tue, 2 Apr 2002 20:26:41 -0500 (EST)

On Tue, 2 Apr 2002 morgarws@mh.us.sbphrd.com wrote:

> I have a couple of questions about these two methods for random access into
> FASTA files:
>
> 1. Since a GB FASTA file normally has gi and gb identifiers concatenated
> together it appears the Bio::Index then can only access a sequence by the
> concatenated ID. Is this the expected correct behavior?
>
You can override the id parsing with your own function - and it can
specify any number of IDs to be indexed by.
(from my tutorial problem 6 with a couple of added lines:
http://users.bioperl.org/~jason/niehs_tutorial/problems/problem6.html
)

use Bio::Index::Fasta;
use Bio::SeqIO;

# the first item on the cmd line is the index filename to create
my $Index_File_Name = 'myindex.idx';
my $inx = Bio::Index::Fasta->new('-filename' => $Index_File_Name,
                                 '-write_flag' => 1);
$inx->id_parser(\&parse_ncbi_id);
# everything else on the cmdline
$inx->make_index('/home/data/libs/ecoli.aa');
my $gi = '1786188';
my $acc = 'AAC73118';

print "got a seq for $acc\n" if( my $seq = $inx->fetch($acc));
print "got a seq for $gi\n" if( my $seq = $inx->fetch($gi));

sub parse_ncbi_id {
    my @retvals;
# you can implement this to of course return as many ids as you want -
# there are some limits to index key length (Lincoln knows) in the
# Berkeley 1.x and 2.x implementations so I'm not sure how many you can
# squeeze in, gi and accession usually get me by just fine.

    if( $_[0] =~ /^>(\S+)/ ) {
        my $val = $1;
        push @retvals, $val;
        my (@elements) = split(/\|/, $val);
        while( @elements ) {
            my $id = shift @elements;
            if( $id eq 'gb' || $id eq 'gi' || $id eq 'ref' ) {
                next;
            }
            push @retvals, $id;
            if( $id =~ /(\S+)\./) {
                push @retvals, $1;
            }
        }
    }
    return @retvals;
}

> 2. Bio::DB::Fasta seems to be able to generate an index (via the -makeid
> option) for one of the IDs but not for all that appear, ie the routine
> specified with the makeid option is supposed to return a scalar. Is it planned
> or in the works to allow the makeid routine to return a list of IDs that the
> sequence is indexed by?
>
Not sure - this is Lincoln's implementation and I think we wanted to try
and merge the DB::Fasta with Index::Fasta as they essentially provide the
same thing, there was some hangup which we discussed that I've since
forgotten.

> 3. Both the latest version of WashU BLAST and (I believe though haven't
> checked for myself) NCBI BLAST have the ability to generate an index file of
> the FASTA file which can be used to randomly access the sequences in that file
> (WASHU provides this with its xdformat and xdget commands). Is anybody working
> on adding support to either Bio::Index or Bio::DB for these indexes? And if
> someone wanted to do it for which module would it make the most sense to add
> it to?

We don't support the NCBI format yet - the index file format changes
between versions which is annoying.  We support a homegrown berkeleydb
format and now the open-bio projects have settled on a standard which is
part of the OBDA spec (http://obda.open-bio.org - see mailing list
open-bio-l at http://open-bio.org/mailman/listinfo/open-bio-l for more
info) which supports 2 different flat file indexing formats - these are
provided in the Bio::DB::Flat interface.

Some people have volunteered to look into the NCBI and WU-BLAST index
formats and we would welcome a pure-Perl implementation for reading these
index files.  Not sure how far these have advanced...

>
> Thanks in Advance,
>
> Bill Morgart
> morgarws@molbio.sbphrd.com
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

-- 
Jason Stajich
Duke University
jason@cgt.mc.duke.edu