[Bioperl-l] bioperl-db: Updating existing Databases with load_seqdatabase.pl and load_ontology

Hilmar Lapp hlapp at gnf.org
Tue Oct 7 14:55:49 EDT 2003


I don't know what you mean by 'everything'. Update is a known and known to
be completely non-trivial problem.

The load_XXX scripts themselves just issue an update to the persistent
objects (if they have been found by prior lookup). The update call as
implemented in the bioperl-db adaptors translate to SQL UPDATE statements
*only* (where undefined attributes are prevented from overriding database
column values). By 'only' I mean that there won't be any magically performed
DELETEs on entity associations as far as the adaptors are concerned.

'Updating' the associations is the main issue that makes a database update
non-trivial, because there is no SQL UPDATE equivalent: the association is
either present or absent. If this is confusing, take sequences and their
associated seqfeatures as a very simple example. If the number of features
changes for a sequence, you can't reflect that by issuing a SQL UPDATE
against some table, because seqfeature and bioentry are two tables connected
by a foreign key. You either a) figure out which features are modified,
added, and deleted, and correspondingly update, insert, and delete those, or
b) you delete all existing features and insert all from the updated sequence
entry from scratch.

There are arguments for and against both a) and b). The approach of the
load_XXX scripts is to leave this decision to the user. It is the script
supplied via --mergeobjs that does the magic if any. A couple of (working)
example scripts for this purpose come with the package in the scripts/biosql
directory, and if you check them out you'll see exactly what's going on (as
there is no magic anywhere else other than cascading plain UPDATEs):

    freshen-annot.pl    # replace all old annotations with the new
annotations regardless
    merge-unique-ann.pl # leave old annotation untouched, add on new that
are different from any of the old annotations
    update-on-new-version.pl # like freshen, but only for bioentries that
changed version
    update-on-new-date.pl    # like freshen, but only for bioentries that
have a more recent date

Note that databanks not only update and add entries, they also obsolete
entries. This is not at all handled by load_seqdatabase.pl.

So, returning to your question, it depends on your situation. If you are
never going to add your own annotations, purging and re-populating is
guaranteed to leave you in a state consistent with the datasource, but is
likely to be slower than a smart update (smart meaning update only if the
entry indeed changed). If you add your own annotations, purging the content
is not an option, and you need to decide how to merge an existing entry with
its updated counterpart using either one of the aforementioned scriptlets,
or one you write yourself. Similarly for ontology update: if you want to
update ontologies w/o updating everything else too, purging is not an
option, and you'll need to decide how you want to merge.

Merging entries currently leaves you with possibly obsoleted bioentries
hanging around in your biosql instance. I have this very problem myself and
can't present a good solution yet as I'll just start to address it within
the next few weeks.

Hth.

    -hilmar

On 10/7/03 9:24 AM, "Raphael A. Bauer"
<raphael.bauer at informatik.hu-berlin.de> wrote:

> Hello bioperl-l members,
> 
> i wonder how to perform a "safe" (incremental) update on a
> working biosql schema.
> 
> My setup is the following: BioSQL schema on Postgres equipped with
> Swissprot (via load_seqdatabase.pl and GeneOntology (via load_ontology.pl).
> 
> Now i simply want to feed the existing BioSQL schema with
> new versions of both databases.
> The question ist how to do it in a safe way - especially so that no
> links in tables are dead and all information in the relevant tables and
> dbxrefs is upadated too.
> 
> There is of course the --lookup flag for both scripts (and the
> update-on-new-date.pl for SProt) - but does it take care of everything?
> Or would it be better to build up the database from scratch in order
> to remain in a consistent state...
> 
> Many thanks in advance,
> 
> Raphael
> 
> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> 

-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------





More information about the Bioperl-l mailing list