[Bioperl-l] Walking multiple bioentries using bioperl-db

Wed Jul 19 11:31:50 EDT 2006

On Jul 19, 2006, at 9:43 AM, Jay Hannah wrote:

> Howdy --
>
> I'm using bioperl-db + biosql-schema + mySQL.
>
> I can now successfully build a biosql-schema instance in mySQL, load
> taxonomy, then using bioperl-db load a GenBank file from disk,  
> commiting
> the sequences I want. For a given accession number + version +  
> namespace,
> I can tell bioperl-db to delete that from mySQL and it does. Yay!!  
> I'll be
> throwing a "Using bioperl-db" document onto the wiki over the next  
> week.

Excellent!

>
> What I am current baffled by:
>
> How do I ask bioperl-db to walk over multiple bioentries in my  
> database so
> I can do things with them? The simplest possible example: print a  
> list of
> all bioentries in my database.
>
> It is trivially easy to just query mySQL directly, but if I'm  
> reading /
> understanding the documentation correctly bioperl-db intends to be
> database schema and RDBMS agnostic. In that case, I should use  
> bioperl-db
> to walk my records. So, how do I do that?

Bioperl-db indeed intends to be schema(-variant) and RDBMS agnostic,  
but that doesn't mean that you have to be as well. If you find it  
trivially easy to query your database using SQL and DBI and you don't  
care about being RDBMS or schema-variant agnostic, then by all means  
don't feel obligated to go through the bioperl-db API for querying.

Note you can obtain the DBI database handle being used by a  
persistence adaptor by calling dbh():

	my $dbh = $adaptor->dbh();

(The advantage of this is that you use the same connection, and  
therefore the same machinery for obtaining connection parameters and  
building the DSN that the rest of bioperl-db uses. Also, you have the  
ability to see transactions in progress that have not been committed  
yet by the adaptor.)

What you should not do through SQL directly is modifying (UPDATE &  
DELETE) entities which bioperl-db also holds in a cache (by default  
terms, dbxrefs), unless you also take care to clear the cache of the  
respective adaptor.

>
> Is Bio::DB::Query::BioQuery the way to do this? The only way?

Well, yes, unless you want to use SQL directly (which is not 0a  
despised option, see above).

>
> If so then can someone help me understand the datacollections() and
> where() methods?

datacollections() in essence corresponds to the FROM clause in a SQL  
statement, including JOIN statements. '=>' joins two entities in 1:n  
relationship, '<=>' joins two entities in n:n relationship. Instead  
of the table(s) you give the (Bioperl) objects that are to be joined,  
and bioperl-db will translate the objects to database entities, i.e.,  
tables. Each object may be followed by an alias. The alias makes it  
easier to refer to the object (entity) in the query constraint part  
(where()). A single alias following a join expression will always  
apply to the master object (table).

>
> perldoc Bio::DB::Query::BioQuery
>
>           # all mouse sequences loaded under namespace ensembl that
>           # have receptor in their description
>           $query->datacollections(["Bio::PrimarySeqI e",
>                                  "Bio::Species=>Bio::PrimarySeqI sp",
>                                  "BioNamespace=>Bio::PrimarySeqI  
> db"]);

This is short for

           $query->datacollections([ # enumare the objects we need:
                                  "Bio::PrimarySeqI e",
                                  "Bio::Species sp",
                                  "BioNamespace db",
                                  # specify master-detail relationships
                                  "Bio::Species=>Bio::PrimarySeqI",
                                  "BioNamespace=>Bio::PrimarySeqI"]);

because the alias following the join statement applies to the master  
entity.

>           $query->where(["sp.binomial like 'Mus *'",
>                          "e.desc like '*receptor*'",
>                          "db.namespace = 'ensembl'"]);

The where() method corresponds to the WHERE clause in SQL. The  
default logical operator between constraints is AND. There is more  
documentation in on the syntax of expressing constraints in  
Bio::DB::Query::QueryConstraint.

The column for which to constrain the value is given as the attribute  
(method) of the (bioperl) object. If there are multiple objects in  
the 'datacollections' then you need to qualify each attribute by  
prefixing it with the object, or the alias assigned in datacollections 
(), followed by a dot; corresponding to typical OO syntax.

>
>           # all mouse sequences loaded under namespace ensembl that
>           # have receptor in their description, and that also have a
>           # cross-reference with SWISS as the database
>           $query->datacollections(["Bio::PrimarySeqI e",
>                                  "Bio::Species=>Bio::PrimarySeqI sp",
>                                  "BioNamespace=>Bio::PrimarySeqI db",
>                                  "Bio::Annotation::DBLink xref",
>
> I'm bewildered by this API. Please forgive my ignorance.

I understand. This part of the API is by far the one with the  
skimpiest documentation.

There are a considerable number of tests in t/query.t which may serve  
as examples. They also are known to work if their tests don't fail.  
The tests don't actually execute any query, instead some internal  
guts are used to test the translation to SQL, so if you know SQL you  
may be able to understand better what's going on by seeing the object- 
level query and the SQL-level query side-by-side.

>
> 1) How do I get *all* bioentries out of my database?

Your datacollections would consist of the single object Bio::SeqI (or  
Bio::PrimarySeqI if you didn't want any annotation), and there would  
be no query constraint:

	my $query = Bio::DB::Query::BioQuery->new(-datacollections=> 
["Bio::SeqI"]);

>
> 2) Say I did want just the "namespace" 'Pico' (one of my
> biodatabase.name's). Where did
>
>     "BioNamespace=>Bio::PrimarySeqI db"]);
>
> come from? How was I supposed to figure out the left hand side of that
> mapping? The right hand side? If that line wasn't sitting in that  
> document
> was there a way for me to figure it out as a *user* of bioperl-db?

You would not know from Bioperl itself. The right hand side is a  
Bioperl class. The left hand side is a kludge because Bioperl does  
not have a namespace class, instead objects that have a namespace  
implement the Bio::IdentifiableI interface directly. This kind of one  
class mapping to two database entities (biodatabase is a table  
separate from, in fact a master for, bioentry) is extremely  
cumbersome to express in a generic way, so I chose to create a  
Bio::DB::Persistent::BioNamespace class to represent that for the  
purpose of queries.

> Or would I need to be a *programmer* of bioperl-db reading source  
> to figure
> this out? Where did
>
>     "db.namespace = 'ensembl'"]);
>
> come from? Again, do I have to read source code to know how to invoke
> that magic?

Well, I'm not sure even reading the source code clears it all up ;)  
As I said before, the part before the dot is the alias or object, the  
part after is the attribute (or method) to be constrained.

>
> Sorry if I sound like a jerk. That is not my intention. Hopefully I  
> can
> document the answers for future bioperl-db'ers.

No problem, that's fine - and whatever you would be willing to  
contribute to documentation would be highly appreciated.

	-hilmar

>
> Thanks in advance,
>
> j
> my current plaything: http://openlab.jays.net
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================