[BioSQL-l] How to get a Seq object from Bio::DB::Persistent::Seq

Hilmar Lapp hlapp at gnf.org
Mon Jun 7 19:52:26 EDT 2004


On Jun 3, 2004, at 1:49 AM, jochen wrote:

> Hi,
>
> I have a similar problem, namely I want to modify some sequences and
> store them back in the database, without overwriting any of the 
> original
> sequences, basically this:
>
> # retrieve an existing sequence
> my $seq = Bio::Seq::RichSeq->new( -display_id => 'something' );

Note that display_id (bioentry.name) is not constrained by a unique 
index and therefore you may easily get duplicate records (which will 
cause an exception if searching by unique key).

> $seq = $seqadaptor->find_by_unique_key($seq);
>
> # make sure, $seq isn't persistant anymore
> my $buffer = new IO::String;
> my $out = new Bio::SeqIO(-fh => $buffer, -format => 'embl');
> $out->write_seq($seq);
> $buffer->setpos(0);
> my $in = new Bio::SeqIO(-fh => $buffer, -format => 'embl');
> $seq = $in->next_seq;
>
> # modify it a little
> $seq->primary_id('NEW001');
>
> # create a new copy (fails, just overwrites the old one)
> $seq->create()

With the above code this line needs to throw a perl error for calling a 
non-existent function on an object. A sequence stream will never give 
you a persistent object.

Should I assume that between the lines you created a persistent object 
from the object that the SeqIO stream returned to you?


> A little debugging revealed that there are several unique constraints 
> on the bioentry (using postgresql here), which prevent me from 
> creating two objects, if they have
>
> o the same primary_id and/or
> o the same (accession_number,version,namespace)
>
> Isn't this an unneccsary restriction? especially, why is primary_id an
> unique constraint, and not (primary_id,namespace)?
>

This was suggested before, and in fact you can change that constraint 
to include the identifier. I thought it's in the schema as a commented 
out option, but apparently it is not (yet).

Bioperl-db will use, but not mandate, the namespace as additional 
constraint when doing a lookup by primary_id.

(accession_number,version,namespace) is a well-established uniqueness 
constraint on sequences in order to guarantee a minimal amount of 
sanity.

> Even worse, $seq->create in most cases doesn't give an error if there 
> is already a similar sequence, but just writes over the existing 
> sequence:

It doesn't write over an existing sequence. It will update the 
attributes of the object you wanted to create to match those of the 
existing object in the database, unless you pass in an object factory 
(-obj_factory => $myseqfactory).

>
> In Bio/DB/BioSQL/BasePersistenceAdaptor.pm, line 196-213, you try to
> insert an the new object. If this fails, you conclude this object 
> already exists and retrieve it from the DB. Now this behaviour is ok 
> for creating the eventually missing foreign key objects. However, if I 
> invoke create() on an sequence object, I'd expect this object to be 
> newly created or to receive an error.
>

If that's what you expect then run a find_by_unique_key() first to make 
sure it's not present already. (Note that this is still no guarantee 
because between the time you get the negative result and the time you 
commit the create() transaction somebody else may have inserted the 
same sequence.)

Note that the method is named create(), not insert_or_fail(). The 
purpose is that after the call returns successfully the object on which 
you invoked create() has an equivalent entry in the database. It is not 
an error if the respective row that you wanted to be present in the 
database is already there.

If it were, you'd mandate the user to run in almost all cases the logic 
you found at this place if an exception occurs. I.e., you'd require the 
user to worry about a lot of absence/presence/concurrency/transactional 
possibilities when all that he/she wanted was to make sure the sequence 
(as identified by its unique key) is in the database.

Bioperl-db is not a SQL interface. It's an OR mapper. You use it if you 
want to live and navigate in object land, not when you want to be close 
to the RDBMS vibe. At least that's the goal ...


> What do you think about this? Did I miss something there?
>
> I'd suggest fixing that by introducing two different create functions
> (or a parameter) that controls whether it's ok to retrieve an 
> eventually existing object (i.e. when creating the foreign key 
> objects) or whether the whole method should fail if there is an 
> already existing object.

It's easily achievable on the client end by running the 
find_by_unqiue_key() first.

>
>> ...
>> # trigger insert by making the object forget
>> # its primary key
>> $pseq->primary_key(undef);
>> # we need to duplicate dependent objects
>> # (children) too, like features
>> foreach my $pfea ($pseq->get_SeqFeatures) {
>> 	$pfea->primary_key(undef)
>> 		if $pfea->isa("Bio::DB::PersistentObjectI");
>> 	# features have locations
>> 	$pfea->location->primary_key(undef)
>> 		if $pfea->location->isa("Bio::DB::PersistentObjectI");
>> }
>> # do the insert
>> $pseq->create();
>
> assuming you just changed the namespace, this code example won't work,
> because you didn't change the primary_id, thus violating the unique
> constraint

Right. It wasn't meant as bullet-proof code. (Note that primary_id is 
optional.)

I'm inclined to make the tuple of (identifier,namespace) the default 
for the future; there seem to be too many subtle issues otherwise if 
you're unsuspecting.

	-hilmar

>
> kind regards
> -- jochen
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------



More information about the BioSQL-l mailing list