Bioperl: Accessing sequences via Bio::DB::SeqI

Ewan Birney birney@ebi.ac.uk
Fri, 28 Apr 2000 10:14:24 +0100 (BST)


On Thu, 27 Apr 2000, Mark Dalphin wrote:

> Hi,
> 
> I'm looking at implementing a class to access a GCG SeqStore database (Oracle
> backend) using BioPerl.  I'm trying to integrate this as cleanly as possible,
> perhaps more for my practice and a sense of elegance than for anything else.  In
> otherwords, I think I could hack this easily, but my questions go more towards:
> "What were the original implementors trying to achieve?".

Great stuff Mark. I am responsible (as always) for the semantic mire down
here...

> 
> I'm looking at the abstract classes: Bio::DB::SeqI.pm and
> Bio::DB::RandomAccessI.pm and the non-abstract (real?) class
> Bio::DB::GenBank.pm.

Yup. These are the right classes to look at.


> 
> It looks like RandomAccess.pm is merely a subset of SeqI.pm, with the "iterate
> through the database" stripped out.  Is this for some reason other than "it got
> left in the developement tree when we switched its name from RandomAccessI to
> SeqI" or is it because it does represent a different abstract class with less
> functionality?  If so, why would this be done? Isn't that the purpose of the
> subroutine stubs in the abstract class to begin with?

If you look at interfacing to a number of databases across the web, it is 
impossible and inadvisable to implement a database iterator style
functionality. Hence RandomAccessI and SeqI <- SeqI is what you would want
to implement, but RandomAccessI is a subset for these web databases.

> 
> Looking at GenBank.pm, it comments that its class is SeqI.pm and then inherits
> RandomAccessI.pm! I assume it was in the process of being switched from one to
> the other; I can't tell which way it was moving, however.  Anyone know?
> 

The documentation is wrong. It should say in the docs that it inheriets
from SeqI.pm. When we made the /Index classes have iterators I needed to
split out SeqI from RandomAccessI. 

Documentation is at fault, and the @ISA is correct.


> Looking at SeqI.pm, which I am taking as the "master" abstract class, I wonder
> about the iterator function:
>     @ids = $seqdb->get_all_ids();
>     $stream = $seqdb->get_PrimarySeq_stream();
>     while(my $seq= $stream->next_seq()) {
>         # $seq is a PrimarySeqI compliant object
>     }
> 
> Given the increasing sizes of the databases (I know the one I am working with is
> huge!), I wonder if this iterator should permit some kind of a selection
> function. That is, for example for SeqStore, where the sequences are stored in
> an Oracle DB, why not include a set of criteria, or even a SELECT statement?
> 

Grrrr. Then we get into Object Query Language and alot of mess. I would
prefer a system where the "selection" criteria is part of the concrete
database object, and hence can be specialised for individual databases and
not part of the interface. ie...


  $db = Bio::DB::SeqStore->new ( 'locator' => $oracle_locator_handle,
				 'subset'  => { length => 1500,
                                                type   => 'cdna' });

  @ids = $seqdb->get_ids();


In my view this is much nicer and means you can just think about SeqStore 
sort of queries constraints rather than use tackling the completely
complex "I want a way of query efficiently on objects through any
implementation of the database"


Does this make sense?


> Then I could say:
>     @ids = $seqdb->get_ids( { length => '> 1500',
>                               type  => 'cDNA',
>                               species => 'Homo sapiens'});
> where the parameters were defined by the specific instance (correct word?) of
> the class derived from the abstract class. A derived class which accessed an SQL
> database might permit direct SQL queries to return the subset of IDs.
> 
> Also, should such an interator return a full Bio::Seq object rather than a
> Bio::PrimarySeq object (or should it be selectable?); I certainly hope that our
> database will contain a great deal of annotation in addition to merely sequence.
> 

My 'view' on this is that for the fully annotated objects you get a list
of ids and then use the RandomAccess stuff to get the annotated objects.

However, I have a feeling that I am going to bullied into suggesting that
we make a dual iterator. If the iterators behaved in the SeqIO fashion of
having

    $iterator->next_seq();
    $iterator->next_primary_seq();

would this be a good idea.


> Comments?
> 


Great for you to take the time over this. Feel free to suggest other
improvements...


> Mark
> 
> --
> Mark Dalphin                          email: mdalphin@amgen.com
> Mail Stop: 29-2-A                     phone: +1-805-447-4951 (work)
> One Amgen Center Drive                       +1-805-375-0680 (home)
> Thousand Oaks, CA 91320                 fax: +1-805-499-9955 (work)
> 
> 
> 
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL: http://bio.perl.org/
> For info about how to (un)subscribe, where messages are archived, etc:
> http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
> ====================================================================
> 

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================