Bioperl: Accessing sequences via Bio::DB::SeqI

Thu, 27 Apr 2000 17:17:58 -0700

Hi,

I'm looking at implementing a class to access a GCG SeqStore database (Oracle
backend) using BioPerl.  I'm trying to integrate this as cleanly as possible,
perhaps more for my practice and a sense of elegance than for anything else.  In
otherwords, I think I could hack this easily, but my questions go more towards:
"What were the original implementors trying to achieve?".

I'm looking at the abstract classes: Bio::DB::SeqI.pm and
Bio::DB::RandomAccessI.pm and the non-abstract (real?) class
Bio::DB::GenBank.pm.

It looks like RandomAccess.pm is merely a subset of SeqI.pm, with the "iterate
through the database" stripped out.  Is this for some reason other than "it got
left in the developement tree when we switched its name from RandomAccessI to
SeqI" or is it because it does represent a different abstract class with less
functionality?  If so, why would this be done? Isn't that the purpose of the
subroutine stubs in the abstract class to begin with?

Looking at GenBank.pm, it comments that its class is SeqI.pm and then inherits
RandomAccessI.pm! I assume it was in the process of being switched from one to
the other; I can't tell which way it was moving, however.  Anyone know?

Looking at SeqI.pm, which I am taking as the "master" abstract class, I wonder
about the iterator function:
    @ids = $seqdb->get_all_ids();
    $stream = $seqdb->get_PrimarySeq_stream();
    while(my $seq= $stream->next_seq()) {
        # $seq is a PrimarySeqI compliant object
    }

Given the increasing sizes of the databases (I know the one I am working with is
huge!), I wonder if this iterator should permit some kind of a selection
function. That is, for example for SeqStore, where the sequences are stored in
an Oracle DB, why not include a set of criteria, or even a SELECT statement?

Then I could say:
    @ids = $seqdb->get_ids( { length => '> 1500',
                              type  => 'cDNA',
                              species => 'Homo sapiens'});
where the parameters were defined by the specific instance (correct word?) of
the class derived from the abstract class. A derived class which accessed an SQL
database might permit direct SQL queries to return the subset of IDs.

Also, should such an interator return a full Bio::Seq object rather than a
Bio::PrimarySeq object (or should it be selectable?); I certainly hope that our
database will contain a great deal of annotation in addition to merely sequence.

Comments?

Mark

--
Mark Dalphin                          email: mdalphin@amgen.com
Mail Stop: 29-2-A                     phone: +1-805-447-4951 (work)
One Amgen Center Drive                       +1-805-375-0680 (home)
Thousand Oaks, CA 91320                 fax: +1-805-499-9955 (work)

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================