From daniel.lang at biologie.uni-freiburg.de  Wed Jun  2 04:44:51 2004
From: daniel.lang at biologie.uni-freiburg.de (Daniel Lang)
Date: Wed Jun  2 04:48:24 2004
Subject: [BioSQL-l] How to get a Seq object from Bio::DB::Persistent::Seq
Message-ID: <40BD9383.3090603@biologie.uni-freiburg.de>

Hi,
I'm retrieving sequences out of a biosql db using Bio::DB::Query's...
When calling the next_object function on the QueryResult, you get the
persistence object for seq.
I want to copy the object into a fresh seq object, to add new data and
store it afterwarts as a new entry with a different namespace.
The solution I?m using now is quite awkward...I copy it using SeqIO:(
Is there a method to retrieve seq objects directly?

Additionally, I'm quite confused by the mapping of bioperl objects to
biosql tables(e.g. for generating a Bio:Query with datacollections):
connections like bioenty<->seqI are obvious, but the rest?
Is there a something like a overview list of the object mapping?

A example script for Query and Constraints would be great:)

Thanks in advance

Daniel


-- 

Daniel Lang
University of Freiburg, Plant Biotechnology
Sonnenstr. 5, D-79104 Freiburg
phone: +49 761 203 6988
homepage:  http://www.plant-biotech.net/
e-mail: daniel.lang@biologie.uni-freiburg.de

#################################################
 >REALITY.SYS corrupted: Reboot universe? (Y/N/A)
#################################################

Join MOSS 2004 in Freiburg, Germany from September 12th - 15th:
registration and information @ http://www.plant-biotech.net/moss2004


From Marc.Logghe at devgen.com  Wed Jun  2 05:31:23 2004
From: Marc.Logghe at devgen.com (Marc Logghe)
Date: Wed Jun  2 05:35:12 2004
Subject: [BioSQL-l] How to get a Seq object from Bio::DB::Persistent::Seq
Message-ID: <BEE28BF86078B6429D6C780635718E21904D01@morelia.be.devgen.com>

Hi Daniel,

> Hi,
> I'm retrieving sequences out of a biosql db using Bio::DB::Query's...
> When calling the next_object function on the QueryResult, you get the
> persistence object for seq.
> I want to copy the object into a fresh seq object, to add new data and
> store it afterwarts as a new entry with a different namespace.
> The solution I?m using now is quite awkward...I copy it using SeqIO:(
> Is there a method to retrieve seq objects directly?

I don't know what will happen if you change the namespace of the persistent object and store it. Probably a lot of constraints ;-) (Not tested though !)
A route you could follow is to 
1. fetch the plain seq object
2. change the namespace and add some features
3. make it persistent and 
4. store it.

suppose you have your persistent seq in $pseq;
my $seq = $pseq->obj;
$seq->namespace('my_new_namespace')

# do some other stuff

my $new_pseq = $db->create_persistent($seq);
$new_pseq->create;

> 
> Additionally, I'm quite confused by the mapping of bioperl objects to
> biosql tables(e.g. for generating a Bio:Query with datacollections):
> connections like bioenty<->seqI are obvious, but the rest?
> Is there a something like a overview list of the object mapping?
Have a look at  perldoc -m Bio::DB::BioSQL::BaseDriver, more precisely at the %object_entity_map variable.
Examples of queries you might find in http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-db/t/query.t?rev=1.9&cvsroot=bioperl&content-type=text/vnd.viewcvs-markup
and also the presentation given at BOSC2003:
http://open-bio.org/bosc2003/slides/Persistent_Bioperl_BOSC03.pdf

HTH,
Marc

 
From hlapp at gnf.org  Wed Jun  2 14:02:09 2004
From: hlapp at gnf.org (Hilmar Lapp)
Date: Wed Jun  2 14:05:55 2004
Subject: [BioSQL-l] How to get a Seq object from Bio::DB::Persistent::Seq
In-Reply-To: <BEE28BF86078B6429D6C780635718E21904D01@morelia.be.devgen.com>
References: <BEE28BF86078B6429D6C780635718E21904D01@morelia.be.devgen.com>
Message-ID: <F7AA6DD6-B4BE-11D8-9796-000A95AE92B0@gnf.org>


On Jun 2, 2004, at 2:31 AM, Marc Logghe wrote:

> Hi Daniel,
>
>> I want to copy the object into a fresh seq object, to add new data and
>> store it afterwarts as a new entry with a different namespace.
>
> I don't know what will happen if you change the namespace of the  
> persistent object and store it. Probably a lot of constraints ;-) (Not  
> tested though !)
> A route you could follow is to
> 1. fetch the plain seq object

You don't really need to do this even.

> 2. change the namespace and add some features
> 3. make it persistent and
> 4. store it.
>

Right, that would be the way.

> suppose you have your persistent seq in $pseq;
> my $seq = $pseq->obj;
> $seq->namespace('my_new_namespace')
>

Again, no real reason to get the wrapped object unless you explicitly  
need a non-persistent object.

Persistent objects in bioperl-db speak are not tightly coupled to the  
database; in fact you might say they are uncoupled. What I mean is that  
you may change any attribute or property of the persistent object  
without having any effect on what is stored in the database. Only once  
you ask the object to store itself will it sync the changes to the  
database.

So, you may simply do the following:

	while (my $pseq = $query->next_object) {
		# e.g. change namespace
		$pseq->namespace("my namespace");
		# change other things, e.g., tack on another feature
		# (which may or may not be a persistent object)
		$pseq->add_SeqFeature($myfeature);
		# ...
		# when done making changes, sync to database
		$pseq->store();
	}

Note that this will update bioentries to change their namespace, not  
duplicate them in another namespace. If you wanted to duplicate a  
sequence in another namespace, possibly with some changes on the  
annotation, replace $pseq->store() with the following:

		...
		# trigger insert by making the object forget
		# its primary key
		$pseq->primary_key(undef);
		# we need to duplicate dependent objects
		# (children) too, like features
		foreach my $pfea ($pseq->get_SeqFeatures) {
			$pfea->primary_key(undef)
				if $pfea->isa("Bio::DB::PersistentObjectI");
			# features have locations
			$pfea->location->primary_key(undef)
				if $pfea->location->isa("Bio::DB::PersistentObjectI");
		}
		# do the insert
		$pseq->create();

You will note that this sample code actually does not cover all  
possible cases; e.g., if there are sub-features, or split locations.  
But you get the idea. Nevertheless, there is indeed a case for having a  
convenience method for de-persisting objects to better support those  
who want to duplicate them.

> # do some other stuff
>
> my $new_pseq = $db->create_persistent($seq);
> $new_pseq->create;
>

Note that this has problems associated as outlined above:

	- if you wanted to update the sequence, this would not do that
	- you will update the features though so that the original sequence  
won't have any features anymore (a feature has a foreign key to exactly  
one bioentry)

>>
>> Additionally, I'm quite confused by the mapping of bioperl objects to
>> biosql tables(e.g. for generating a Bio:Query with datacollections):
>> connections like bioenty<->seqI are obvious, but the rest?
>> Is there a something like a overview list of the object mapping?
> Have a look at  perldoc -m Bio::DB::BioSQL::BaseDriver, more precisely  
> at the %object_entity_map variable.
> Examples of queries you might find in  
> http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-db/t/ 
> query.t?rev=1.9&cvsroot=bioperl&content-type=text/vnd.viewcvs-markup
> and also the presentation given at BOSC2003:
> http://open-bio.org/bosc2003/slides/Persistent_Bioperl_BOSC03.pdf
>

Right. Thanks for helping Marc.

	-hilmar

-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------

From daniel.lang at biologie.uni-freiburg.de  Thu Jun  3 04:01:00 2004
From: daniel.lang at biologie.uni-freiburg.de (Daniel Lang)
Date: Thu Jun  3 04:04:48 2004
Subject: [BioSQL-l] How to get a Seq object from Bio::DB::Persistent::Seq
In-Reply-To: <F7AA6DD6-B4BE-11D8-9796-000A95AE92B0@gnf.org>
References: <BEE28BF86078B6429D6C780635718E21904D01@morelia.be.devgen.com>
	<F7AA6DD6-B4BE-11D8-9796-000A95AE92B0@gnf.org>
Message-ID: <40BEDABC.90304@biologie.uni-freiburg.de>

Thank you both for your extensive and quick answers answers...

Hilmar Lapp wrote:
 >
 > On Jun 2, 2004, at 2:31 AM, Marc Logghe wrote:
 >
 >> Hi Daniel,
 >>
 >>> I want to copy the object into a fresh seq object, to add new data and
 >>> store it afterwarts as a new entry with a different namespace.
 >>
 >>
 >> I don't know what will happen if you change the namespace of the
 >> persistent object and store it. Probably a lot of constraints ;-)
 >> (Not  tested though !)
 >> A route you could follow is to
 >> 1. fetch the plain seq object
 >
 >
 > You don't really need to do this even.
 >
 >> 2. change the namespace and add some features
 >> 3. make it persistent and
 >> 4. store it.
 >>
 >
 > Right, that would be the way.
 >
 >> suppose you have your persistent seq in $pseq;
 >> my $seq = $pseq->obj;
 >> $seq->namespace('my_new_namespace')
 >>
 >
 > Again, no real reason to get the wrapped object unless you explicitly
 > need a non-persistent object.
 >
 > Persistent objects in bioperl-db speak are not tightly coupled to the
 > database; in fact you might say they are uncoupled. What I mean is that
 > you may change any attribute or property of the persistent object
 > without having any effect on what is stored in the database. Only once
 > you ask the object to store itself will it sync the changes to the
 > database.
 >
 > So, you may simply do the following:
 >
 >     while (my $pseq = $query->next_object) {
 >         # e.g. change namespace
 >         $pseq->namespace("my namespace");
 >         # change other things, e.g., tack on another feature
 >         # (which may or may not be a persistent object)
 >         $pseq->add_SeqFeature($myfeature);
 >         # ...
 >         # when done making changes, sync to database
 >         $pseq->store();
 >     }
 >
 > Note that this will update bioentries to change their namespace, not
 > duplicate them in another namespace. If you wanted to duplicate a
 > sequence in another namespace, possibly with some changes on the
 > annotation, replace $pseq->store() with the following:
 >
 >         ...
 >         # trigger insert by making the object forget
 >         # its primary key
 >         $pseq->primary_key(undef);
 >         # we need to duplicate dependent objects
 >         # (children) too, like features
 >         foreach my $pfea ($pseq->get_SeqFeatures) {
 >             $pfea->primary_key(undef)
 >                 if $pfea->isa("Bio::DB::PersistentObjectI");
 >             # features have locations
 >             $pfea->location->primary_key(undef)
 >                 if $pfea->location->isa("Bio::DB::PersistentObjectI");
 >         }
 >         # do the insert
 >         $pseq->create();
 >
 > You will note that this sample code actually does not cover all
 > possible cases; e.g., if there are sub-features, or split locations.
 > But you get the idea. Nevertheless, there is indeed a case for having a
 > convenience method for de-persisting objects to better support those
 > who want to duplicate them.
 >
 >> # do some other stuff
 >>
 >> my $new_pseq = $db->create_persistent($seq);
 >> $new_pseq->create;
 >>
 >
 > Note that this has problems associated as outlined above:
 >
 >     - if you wanted to update the sequence, this would not do that
 >     - you will update the features though so that the original sequence
 > won't have any features anymore (a feature has a foreign key to exactly
 > one bioentry)
 >
 >>>
 >>> Additionally, I'm quite confused by the mapping of bioperl objects to
 >>> biosql tables(e.g. for generating a Bio:Query with datacollections):
 >>> connections like bioenty<->seqI are obvious, but the rest?
 >>> Is there a something like a overview list of the object mapping?
 >>
 >> Have a look at  perldoc -m Bio::DB::BioSQL::BaseDriver, more
 >> precisely  at the %object_entity_map variable.
 >> Examples of queries you might find in
 >> http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-db/t/
 >> query.t?rev=1.9&cvsroot=bioperl&content-type=text/vnd.viewcvs-markup
 >> and also the presentation given at BOSC2003:
 >> http://open-bio.org/bosc2003/slides/Persistent_Bioperl_BOSC03.pdf
 >>
 >
 > Right. Thanks for helping Marc.
 >
 >     -hilmar
 >

-- 

Daniel Lang
University of Freiburg, Plant Biotechnology
Sonnenstr. 5, D-79104 Freiburg
phone: +49 761 203 6988
homepage:  http://www.plant-biotech.net/
e-mail: daniel.lang@biologie.uni-freiburg.de

#################################################
 >REALITY.SYS corrupted: Reboot universe? (Y/N/A)
#################################################

Join MOSS 2004 in Freiburg, Germany from September 12th - 15th:
registration and information @ http://www.plant-biotech.net/moss2004


From jochen at penguin-breeder.org  Thu Jun  3 04:49:06 2004
From: jochen at penguin-breeder.org (jochen)
Date: Thu Jun  3 04:52:15 2004
Subject: [BioSQL-l] How to get a Seq object from Bio::DB::Persistent::Seq
Message-ID: <20040603084906.GA27454@coffee.homeunix.org>

Hi,

I have a similar problem, namely I want to modify some sequences and
store them back in the database, without overwriting any of the original
sequences, basically this:

# retrieve an existing sequence
my $seq = Bio::Seq::RichSeq->new( -display_id => 'something' );
$seq = $seqadaptor->find_by_unique_key($seq);

# make sure, $seq isn't persistant anymore
my $buffer = new IO::String;
my $out = new Bio::SeqIO(-fh => $buffer, -format => 'embl');
$out->write_seq($seq);
$buffer->setpos(0);
my $in = new Bio::SeqIO(-fh => $buffer, -format => 'embl');
$seq = $in->next_seq;

# modify it a little
$seq->primary_id('NEW001');

# create a new copy (fails, just overwrites the old one)
$seq->create()

A little debugging revealed that there are several unique constraints on
the bioentry (using postgresql here), which prevent me from creating two
objects, if they have

o the same primary_id and/or
o the same (accession_number,version,namespace)

Isn't this an unneccsary restriction? especially, why is primary_id an
unique constraint, and not (primary_id,namespace)?

Even worse, $seq->create in most cases doesn't give an error if there is
already a similar sequence, but just writes over the existing sequence:

In Bio/DB/BioSQL/BasePersistenceAdaptor.pm, line 196-213, you try to 
insert an the new object. If this fails, you conclude this object already 
exists and retrieve it from the DB. Now this behaviour is ok for creating 
the eventually missing foreign key objects. However, if I invoke create() 
on an sequence object, I'd expect this object to be newly created or to 
receive an error.

What do you think about this? Did I miss something there?

I'd suggest fixing that by introducing two different create functions
(or a parameter) that controls whether it's ok to retrieve an eventually
existing object (i.e. when creating the foreign key objects) or whether 
the whole method should fail if there is an already existing object.

> ...
> # trigger insert by making the object forget
> # its primary key
> $pseq->primary_key(undef);
> # we need to duplicate dependent objects
> # (children) too, like features
> foreach my $pfea ($pseq->get_SeqFeatures) {
> 	$pfea->primary_key(undef)
> 		if $pfea->isa("Bio::DB::PersistentObjectI");
> 	# features have locations
> 	$pfea->location->primary_key(undef)
> 		if $pfea->location->isa("Bio::DB::PersistentObjectI");
> }
> # do the insert
> $pseq->create();

assuming you just changed the namespace, this code example won't work, 
because you didn't change the primary_id, thus violating the unique
constraint

kind regards
-- jochen
From hlapp at gnf.org  Fri Jun  4 14:16:06 2004
From: hlapp at gnf.org (Hilmar Lapp)
Date: Fri Jun  4 14:19:22 2004
Subject: [BioSQL-l] Re: [Bioperl-l] Biosql documentation request
In-Reply-To: <GAEDKMGOKFBLJPKCLKCCOEMIDMAA.brian_osborne@cognia.com>
References: <GAEDKMGOKFBLJPKCLKCCOEMIDMAA.brian_osborne@cognia.com>
Message-ID: <3EE5990A-B653-11D8-AB9B-000A95AE92B0@gnf.org>

It pretty much is the latest. Amazing, isn't it? The schema is *very* 
stable.

There are a few additions in the Oracle version which aren't really 
officially blessed yet (meaning, they're not in the MySQL/Pg versions 
but will be soon), and none of which breaks backwards compatibility.

I'll try and see whether I can get postgres_autodoc installed over the 
weekend.

Or maybe somebody on the biosql list has this setup already?

	-hilmar

On Jun 4, 2004, at 4:38 AM, Brian Osborne wrote:

> Hilmar,
>
> Neither does the ERD show nullability. The ERD is good but some useful
> information is missing, yes.
>
> The ERD is dated 6/4/2003, is this the latest version? Pardon my 
> ignorance.
>
> Brian O.
>
> -----Original Message-----
> From: bioperl-l-bounces@portal.open-bio.org
> [mailto:bioperl-l-bounces@portal.open-bio.org]On Behalf Of Hilmar Lapp
> Sent: Friday, June 04, 2004 2:20 AM
> To: Brian Osborne
> Cc: bioperl-l@bioperl.org
> Subject: Re: [Bioperl-l] Biosql documentation request
>
> Brian, if I understand the output correctly it only documents the
> schema elements. Do you feel that the ERD (doc/biosql-ERD.pdf) does not
> fulfill this purpose well enough?
>
> The ERD diagram actually doesn't show the unique key constraints, so
> that would be a difference indeed.
>
>         -hilmar
>
> On Thursday, June 3, 2004, at 05:56  AM, Brian Osborne wrote:
>
>> Bioperl-l,
>>
>> Dave Howorth has provided a detailed critique of the bioperl-db/biosql
>> documentation which I'm working through. One thing that he noticed was
>> that
>> the Biosql file doc/biosql.html was out-of-date. This file was created
>> by
>> running a script called postgres_autodoc.pl on a Postgres instance of
>> the
>> biosql schema. Can anyone provide me with a current version of this
>> file? I
>> run biosql on Mysql myself and I haven't found a script or utility
>> equivalent to postgres_autodoc.pl. postgres_autodoc.pl is available at
>> http://www.rbt.ca/autodoc/.
>>
>> Brian O.
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
> --
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------

From hlapp at gnf.org  Fri Jun  4 23:41:30 2004
From: hlapp at gnf.org (Hilmar Lapp)
Date: Fri Jun  4 23:44:43 2004
Subject: [BioSQL-l] Re: [Bioperl-l] Biosql documentation request
In-Reply-To: <GAEDKMGOKFBLJPKCLKCCGENCDMAA.brian_osborne@cognia.com>
References: <GAEDKMGOKFBLJPKCLKCCGENCDMAA.brian_osborne@cognia.com>
Message-ID: <3B611DD0-B6A2-11D8-AB9B-000A95AE92B0@gnf.org>

The kudos w.r.t. to the INSTALL document should go to Ewan who I 
believe wrote (and tested) that heroically during the Singapore 
hackathon.

Great though that it proved useful.

	-hilmar

On Jun 4, 2004, at 6:15 PM, Brian Osborne wrote:

> Hilmar,
>
> I went ahead and installed postgres as well as the biosql schema when I
> found out that Cygwin would install postgres. Kudos to you: the 
> installation
> of postgres, postgres initialization, and biosql database creation took
> about 10 minutes and I've never used postgres before. I was basically
> following the INSTALL instructions. The new biosql.html has been 
> commited.
>
> Brian O.
>
> -----Original Message-----
> From: bioperl-l-bounces@portal.open-bio.org
> [mailto:bioperl-l-bounces@portal.open-bio.org]On Behalf Of Hilmar Lapp
> Sent: Friday, June 04, 2004 2:16 PM
> To: Brian Osborne
> Cc: Hilmar Lapp; BioPerl; Biosql
> Subject: Re: [Bioperl-l] Biosql documentation request
>
> It pretty much is the latest. Amazing, isn't it? The schema is *very*
> stable.
>
> There are a few additions in the Oracle version which aren't really
> officially blessed yet (meaning, they're not in the MySQL/Pg versions
> but will be soon), and none of which breaks backwards compatibility.
>
> I'll try and see whether I can get postgres_autodoc installed over the
> weekend.
>
> Or maybe somebody on the biosql list has this setup already?
>
>         -hilmar
>
> On Jun 4, 2004, at 4:38 AM, Brian Osborne wrote:
>
>> Hilmar,
>>
>> Neither does the ERD show nullability. The ERD is good but some useful
>> information is missing, yes.
>>
>> The ERD is dated 6/4/2003, is this the latest version? Pardon my
>> ignorance.
>>
>> Brian O.
>>
>> -----Original Message-----
>> From: bioperl-l-bounces@portal.open-bio.org
>> [mailto:bioperl-l-bounces@portal.open-bio.org]On Behalf Of Hilmar Lapp
>> Sent: Friday, June 04, 2004 2:20 AM
>> To: Brian Osborne
>> Cc: bioperl-l@bioperl.org
>> Subject: Re: [Bioperl-l] Biosql documentation request
>>
>> Brian, if I understand the output correctly it only documents the
>> schema elements. Do you feel that the ERD (doc/biosql-ERD.pdf) does 
>> not
>> fulfill this purpose well enough?
>>
>> The ERD diagram actually doesn't show the unique key constraints, so
>> that would be a difference indeed.
>>
>>         -hilmar
>>
>> On Thursday, June 3, 2004, at 05:56  AM, Brian Osborne wrote:
>>
>>> Bioperl-l,
>>>
>>> Dave Howorth has provided a detailed critique of the 
>>> bioperl-db/biosql
>>> documentation which I'm working through. One thing that he noticed 
>>> was
>>> that
>>> the Biosql file doc/biosql.html was out-of-date. This file was 
>>> created
>>> by
>>> running a script called postgres_autodoc.pl on a Postgres instance of
>>> the
>>> biosql schema. Can anyone provide me with a current version of this
>>> file? I
>>> run biosql on Mysql myself and I haven't found a script or utility
>>> equivalent to postgres_autodoc.pl. postgres_autodoc.pl is available 
>>> at
>>> http://www.rbt.ca/autodoc/.
>>>
>>> Brian O.
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l@portal.open-bio.org
>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>> --
>> -------------------------------------------------------------
>> Hilmar Lapp                            email: lapp at gnf.org
>> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
>> -------------------------------------------------------------
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
> --
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------

From brian_osborne at cognia.com  Fri Jun  4 21:15:56 2004
From: brian_osborne at cognia.com (Brian Osborne)
Date: Sat Jun  5 20:57:07 2004
Subject: [BioSQL-l] RE: [Bioperl-l] Biosql documentation request
In-Reply-To: <3EE5990A-B653-11D8-AB9B-000A95AE92B0@gnf.org>
Message-ID: <GAEDKMGOKFBLJPKCLKCCGENCDMAA.brian_osborne@cognia.com>

Hilmar,

I went ahead and installed postgres as well as the biosql schema when I
found out that Cygwin would install postgres. Kudos to you: the installation
of postgres, postgres initialization, and biosql database creation took
about 10 minutes and I've never used postgres before. I was basically
following the INSTALL instructions. The new biosql.html has been commited.

Brian O.

-----Original Message-----
From: bioperl-l-bounces@portal.open-bio.org
[mailto:bioperl-l-bounces@portal.open-bio.org]On Behalf Of Hilmar Lapp
Sent: Friday, June 04, 2004 2:16 PM
To: Brian Osborne
Cc: Hilmar Lapp; BioPerl; Biosql
Subject: Re: [Bioperl-l] Biosql documentation request

It pretty much is the latest. Amazing, isn't it? The schema is *very*
stable.

There are a few additions in the Oracle version which aren't really
officially blessed yet (meaning, they're not in the MySQL/Pg versions
but will be soon), and none of which breaks backwards compatibility.

I'll try and see whether I can get postgres_autodoc installed over the
weekend.

Or maybe somebody on the biosql list has this setup already?

        -hilmar

On Jun 4, 2004, at 4:38 AM, Brian Osborne wrote:

> Hilmar,
>
> Neither does the ERD show nullability. The ERD is good but some useful
> information is missing, yes.
>
> The ERD is dated 6/4/2003, is this the latest version? Pardon my
> ignorance.
>
> Brian O.
>
> -----Original Message-----
> From: bioperl-l-bounces@portal.open-bio.org
> [mailto:bioperl-l-bounces@portal.open-bio.org]On Behalf Of Hilmar Lapp
> Sent: Friday, June 04, 2004 2:20 AM
> To: Brian Osborne
> Cc: bioperl-l@bioperl.org
> Subject: Re: [Bioperl-l] Biosql documentation request
>
> Brian, if I understand the output correctly it only documents the
> schema elements. Do you feel that the ERD (doc/biosql-ERD.pdf) does not
> fulfill this purpose well enough?
>
> The ERD diagram actually doesn't show the unique key constraints, so
> that would be a difference indeed.
>
>         -hilmar
>
> On Thursday, June 3, 2004, at 05:56  AM, Brian Osborne wrote:
>
>> Bioperl-l,
>>
>> Dave Howorth has provided a detailed critique of the bioperl-db/biosql
>> documentation which I'm working through. One thing that he noticed was
>> that
>> the Biosql file doc/biosql.html was out-of-date. This file was created
>> by
>> running a script called postgres_autodoc.pl on a Postgres instance of
>> the
>> biosql schema. Can anyone provide me with a current version of this
>> file? I
>> run biosql on Mysql myself and I haven't found a script or utility
>> equivalent to postgres_autodoc.pl. postgres_autodoc.pl is available at
>> http://www.rbt.ca/autodoc/.
>>
>> Brian O.
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
> --
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------

_______________________________________________
Bioperl-l mailing list
Bioperl-l@portal.open-bio.org
http://portal.open-bio.org/mailman/listinfo/bioperl-l


From hlapp at gnf.org  Mon Jun  7 19:52:26 2004
From: hlapp at gnf.org (Hilmar Lapp)
Date: Mon Jun  7 19:55:32 2004
Subject: [BioSQL-l] How to get a Seq object from Bio::DB::Persistent::Seq
In-Reply-To: <20040603084906.GA27454@coffee.homeunix.org>
References: <20040603084906.GA27454@coffee.homeunix.org>
Message-ID: <BA627563-B8DD-11D8-A130-000A95AE92B0@gnf.org>


On Jun 3, 2004, at 1:49 AM, jochen wrote:

> Hi,
>
> I have a similar problem, namely I want to modify some sequences and
> store them back in the database, without overwriting any of the 
> original
> sequences, basically this:
>
> # retrieve an existing sequence
> my $seq = Bio::Seq::RichSeq->new( -display_id => 'something' );

Note that display_id (bioentry.name) is not constrained by a unique 
index and therefore you may easily get duplicate records (which will 
cause an exception if searching by unique key).

> $seq = $seqadaptor->find_by_unique_key($seq);
>
> # make sure, $seq isn't persistant anymore
> my $buffer = new IO::String;
> my $out = new Bio::SeqIO(-fh => $buffer, -format => 'embl');
> $out->write_seq($seq);
> $buffer->setpos(0);
> my $in = new Bio::SeqIO(-fh => $buffer, -format => 'embl');
> $seq = $in->next_seq;
>
> # modify it a little
> $seq->primary_id('NEW001');
>
> # create a new copy (fails, just overwrites the old one)
> $seq->create()

With the above code this line needs to throw a perl error for calling a 
non-existent function on an object. A sequence stream will never give 
you a persistent object.

Should I assume that between the lines you created a persistent object 
from the object that the SeqIO stream returned to you?


> A little debugging revealed that there are several unique constraints 
> on the bioentry (using postgresql here), which prevent me from 
> creating two objects, if they have
>
> o the same primary_id and/or
> o the same (accession_number,version,namespace)
>
> Isn't this an unneccsary restriction? especially, why is primary_id an
> unique constraint, and not (primary_id,namespace)?
>

This was suggested before, and in fact you can change that constraint 
to include the identifier. I thought it's in the schema as a commented 
out option, but apparently it is not (yet).

Bioperl-db will use, but not mandate, the namespace as additional 
constraint when doing a lookup by primary_id.

(accession_number,version,namespace) is a well-established uniqueness 
constraint on sequences in order to guarantee a minimal amount of 
sanity.

> Even worse, $seq->create in most cases doesn't give an error if there 
> is already a similar sequence, but just writes over the existing 
> sequence:

It doesn't write over an existing sequence. It will update the 
attributes of the object you wanted to create to match those of the 
existing object in the database, unless you pass in an object factory 
(-obj_factory => $myseqfactory).

>
> In Bio/DB/BioSQL/BasePersistenceAdaptor.pm, line 196-213, you try to
> insert an the new object. If this fails, you conclude this object 
> already exists and retrieve it from the DB. Now this behaviour is ok 
> for creating the eventually missing foreign key objects. However, if I 
> invoke create() on an sequence object, I'd expect this object to be 
> newly created or to receive an error.
>

If that's what you expect then run a find_by_unique_key() first to make 
sure it's not present already. (Note that this is still no guarantee 
because between the time you get the negative result and the time you 
commit the create() transaction somebody else may have inserted the 
same sequence.)

Note that the method is named create(), not insert_or_fail(). The 
purpose is that after the call returns successfully the object on which 
you invoked create() has an equivalent entry in the database. It is not 
an error if the respective row that you wanted to be present in the 
database is already there.

If it were, you'd mandate the user to run in almost all cases the logic 
you found at this place if an exception occurs. I.e., you'd require the 
user to worry about a lot of absence/presence/concurrency/transactional 
possibilities when all that he/she wanted was to make sure the sequence 
(as identified by its unique key) is in the database.

Bioperl-db is not a SQL interface. It's an OR mapper. You use it if you 
want to live and navigate in object land, not when you want to be close 
to the RDBMS vibe. At least that's the goal ...


> What do you think about this? Did I miss something there?
>
> I'd suggest fixing that by introducing two different create functions
> (or a parameter) that controls whether it's ok to retrieve an 
> eventually existing object (i.e. when creating the foreign key 
> objects) or whether the whole method should fail if there is an 
> already existing object.

It's easily achievable on the client end by running the 
find_by_unqiue_key() first.

>
>> ...
>> # trigger insert by making the object forget
>> # its primary key
>> $pseq->primary_key(undef);
>> # we need to duplicate dependent objects
>> # (children) too, like features
>> foreach my $pfea ($pseq->get_SeqFeatures) {
>> 	$pfea->primary_key(undef)
>> 		if $pfea->isa("Bio::DB::PersistentObjectI");
>> 	# features have locations
>> 	$pfea->location->primary_key(undef)
>> 		if $pfea->location->isa("Bio::DB::PersistentObjectI");
>> }
>> # do the insert
>> $pseq->create();
>
> assuming you just changed the namespace, this code example won't work,
> because you didn't change the primary_id, thus violating the unique
> constraint

Right. It wasn't meant as bullet-proof code. (Note that primary_id is 
optional.)

I'm inclined to make the tuple of (identifier,namespace) the default 
for the future; there seem to be too many subtle issues otherwise if 
you're unsuspecting.

	-hilmar

>
> kind regards
> -- jochen
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l@open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------

From jochen at penguin-breeder.org  Tue Jun  8 04:42:58 2004
From: jochen at penguin-breeder.org (Jochen Eisinger)
Date: Tue Jun  8 04:45:55 2004
Subject: [BioSQL-l] How to get a Seq object from Bio::DB::Persistent::Seq
In-Reply-To: <BA627563-B8DD-11D8-A130-000A95AE92B0@gnf.org>
References: <20040603084906.GA27454@coffee.homeunix.org>
	<BA627563-B8DD-11D8-A130-000A95AE92B0@gnf.org>
Message-ID: <20040608084258.GA10233@coffee.homeunix.org>

Hi,

thanks for your clarifying answer!

On Mon, Jun 07, 2004 at 04:52:26PM -0700, Hilmar Lapp wrote:
> >$seq = $seqadaptor->find_by_unique_key($seq);
> >
> ># make sure, $seq isn't persistant anymore
> >my $buffer = new IO::String;
> >my $out = new Bio::SeqIO(-fh => $buffer, -format => 'embl');
> >$out->write_seq($seq);
> >$buffer->setpos(0);
> >my $in = new Bio::SeqIO(-fh => $buffer, -format => 'embl');
> >$seq = $in->next_seq;
> >
> ># modify it a little
> >$seq->primary_id('NEW001');
> >
> ># create a new copy (fails, just overwrites the old one)
> >$seq->create()
> 
> With the above code this line needs to throw a perl error for calling a 
> non-existent function on an object. A sequence stream will never give 
> you a persistent object.

Ah, yes, I forgot

  $seq = $db->create_persistent($seq) 

before the create() in the above example.

> (accession_number,version,namespace) is a well-established uniqueness 
> constraint on sequences in order to guarantee a minimal amount of 
> sanity.

Why isn't this the primary key btw? I'm quite new to biosql and may
still be missing some points... I'm rather surprised you're using
artificial columns as primary keys and add unique constraints to the
table, instead of using them as primary keys and dropping this integer
valued id columns.

> 
> >Even worse, $seq->create in most cases doesn't give an error if there 
> >is already a similar sequence, but just writes over the existing 
> >sequence:
> 
> It doesn't write over an existing sequence. It will update the 
> attributes of the object you wanted to create to match those of the 
> existing object in the database, unless you pass in an object factory 
> (-obj_factory => $myseqfactory).

It won't update the record in any case. If you change the length of the
sequence for example, you will get an error "tried to lie about sequence
length"

> >In Bio/DB/BioSQL/BasePersistenceAdaptor.pm, line 196-213, you try to
> >insert an the new object. If this fails, you conclude this object 
> >already exists and retrieve it from the DB. Now this behaviour is ok 
> >for creating the eventually missing foreign key objects. However, if I 
> >invoke create() on an sequence object, I'd expect this object to be 
> >newly created or to receive an error.
> >
> 
> If that's what you expect then run a find_by_unique_key() first to make 
> sure it's not present already. (Note that this is still no guarantee 
> because between the time you get the negative result and the time you 
> commit the create() transaction somebody else may have inserted the 
> same sequence.)
That should not be possible, the DBs transaction system should take care
of this.

> Note that the method is named create(), not insert_or_fail(). The 
> purpose is that after the call returns successfully the object on which 
> you invoked create() has an equivalent entry in the database. It is not 
> an error if the respective row that you wanted to be present in the 
> database is already there.

I expected store() to do this, and create to be insert_or_fail-like 

> Bioperl-db is not a SQL interface. It's an OR mapper. You use it if you 
> want to live and navigate in object land, not when you want to be close 
> to the RDBMS vibe. At least that's the goal ...

Ok

> I'm inclined to make the tuple of (identifier,namespace) the default 
> for the future; there seem to be too many subtle issues otherwise if 
> you're unsuspecting.

I guess that would be a good thing to do. Otherwise it's quite
impossible to have the same sequence in multiple versions in a single
database. 

In my case, I need to have sequences with several different annotations
stored in one db. changing the primary id of the sequences is not an
option here.

kind regards
-- jochen
From jochen at penguin-breeder.org  Tue Jun  8 10:25:35 2004
From: jochen at penguin-breeder.org (Jochen Eisinger)
Date: Tue Jun  8 10:28:30 2004
Subject: [BioSQL-l] SimpleValueAdaptor does not accept values of 0
Message-ID: <20040608142535.GA23458@coffee.homeunix.org>

Hi,

I ran into the problem that 0 values won't be retrieved from the
database. I found the same bug reported in the bugzilla db:

http://bugzilla.bioperl.org/show_bug.cgi?id=1586

the solution suggested there works for me.

kind regards
-- jochen
From hlapp at gnf.org  Tue Jun  8 13:08:49 2004
From: hlapp at gnf.org (Hilmar Lapp)
Date: Tue Jun  8 13:11:57 2004
Subject: [BioSQL-l] Bioperl-db: Added -flat_only option to find_by_query()
Message-ID: <826BD7E0-B96E-11D8-9CC5-000A95AE92B0@gnf.org>

Disregard if you aren't using bioperl-db.

This option was previously only available with find_by_unique_key(). 
You can now pass it to find_by_query() as well. -flat_only means 
retrieved objects will not get their children retrieved and attached.

E.g., when retrieving a Bio::SeqI object, there won't be features nor 
annotation with this flag set to true when you get the found object(s) 
returned. This is useful to save time if you aren't going to query 
those attributes anyway in your script.

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------

From hlapp at gnf.org  Tue Jun  8 13:09:07 2004
From: hlapp at gnf.org (Hilmar Lapp)
Date: Tue Jun  8 13:12:11 2004
Subject: [BioSQL-l] SimpleValueAdaptor does not accept values of 0
In-Reply-To: <20040608142535.GA23458@coffee.homeunix.org>
References: <20040608142535.GA23458@coffee.homeunix.org>
Message-ID: <8D0BA10C-B96E-11D8-9CC5-000A95AE92B0@gnf.org>

Fixed in the repository. -hilmar

On Jun 8, 2004, at 7:25 AM, Jochen Eisinger wrote:

> Hi,
>
> I ran into the problem that 0 values won't be retrieved from the
> database. I found the same bug reported in the bugzilla db:
>
> http://bugzilla.bioperl.org/show_bug.cgi?id=1586
>
> the solution suggested there works for me.
>
> kind regards
> -- jochen
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l@open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------

From hlapp at gmx.net  Sat Jun 12 20:11:11 2004
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat Jun 12 20:25:43 2004
Subject: [BioSQL-l] How to get a Seq object from Bio::DB::Persistent::Seq
In-Reply-To: <20040608084258.GA10233@coffee.homeunix.org>
Message-ID: <2D5D0E01-BCCE-11D8-8A0A-000A959EB4C4@gmx.net>


On Tuesday, June 8, 2004, at 01:42  AM, Jochen Eisinger wrote:

>
>> (accession_number,version,namespace) is a well-established uniqueness
>> constraint on sequences in order to guarantee a minimal amount of
>> sanity.
>
> Why isn't this the primary key btw? I'm quite new to biosql and may
> still be missing some points... I'm rather surprised you're using
> artificial columns as primary keys and add unique constraints to the
> table, instead of using them as primary keys and dropping this integer
> valued id columns.

Who uses the natural primary key as the physical primary key? It's 
common and best practice not to do so, because 1) a natural primary key 
will change if you change the attribute(s), which means you'll have to 
change the foreign keys referencing it too, and 2) especially 
multi-column keys are slow to join (but even a single-column character 
column is slower). There's plenty of relational database design and 
theory textbooks out there that explain this a lot better and in depth.

>>
>> It doesn't write over an existing sequence. It will update the
>> attributes of the object you wanted to create to match those of the
>> existing object in the database, unless you pass in an object factory
>> (-obj_factory => $myseqfactory).
>
> It won't update the record in any case. If you change the length of the
> sequence for example, you will get an error "tried to lie about 
> sequence
> length"

It will update the object I said, not the record in the database. You 
cannot set $seq->length to a value other than the actual length of the 
sequence if there is one.

>>
>> If that's what you expect then run a find_by_unique_key() first to 
>> make
>> sure it's not present already. (Note that this is still no guarantee
>> because between the time you get the negative result and the time you
>> commit the create() transaction somebody else may have inserted the
>> same sequence.)
> That should not be possible, the DBs transaction system should take 
> care
> of this.

Tell me how it should be able to accomplish this. Transactions don't 
cure wrong assumptions, they just isolate concurrent access.

Lets assume you have a record to be inserted with unique key 'foo'. At 
the time you make a lookup on that key somebody else inserted a record 
with the same key but hasn't committed the transaction yet. Your lookup 
will return no record. Now you go ahead and insert the record. If the 
other user's transaction isn't rolled back, your insert will either 
fail immediately if he committed meanwhile, or it will block and fail 
once he commits.

>
> In my case, I need to have sequences with several different annotations
> stored in one db. changing the primary id of the sequences is not an
> option here.
>

If you change the UK constraint to include the namespace you should be 
fine.

	-hilmar

-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------


From gwu at molbio.mgh.harvard.edu  Tue Jun 15 12:38:08 2004
From: gwu at molbio.mgh.harvard.edu (Gang Wu)
Date: Tue Jun 15 12:41:13 2004
Subject: [BioSQL-l] how to quickly retrieve feature sequences
Message-ID: <AAEKLJJMIGMKFDEIPJFEGEFFCLAA.gwu@molbio.mgh.harvard.edu>

Hi,

I just loaded the 5 Arabidopsis thalian Genbank genome files into my
sequence database(BioSQL 1.38). My question is: How can I efficiently
retrieve all gene sequences from the database? I tried to do that by joining
seqfeature, seqfeature_qualifier_value, location, term and biosequence
tables, but it turned out to be extremely slow(See the attached SQL, 2
records take about 20 seconds on my Dell PowerEdge 2650 with dual 2.6G
Xeons). Does anyone have a better way to do it?

All I can imagine to do this faster is(by Java or other languages): Pull all
gene location info; Pull erlated sequence from biosequence table; rotate
through the gene location list and retrieve the substring of the sequence.
But this does not seem attractive for me since for different applications, I
have to write code to pull the sequences by myself. Is it possible to
extend/modify the BioSQL schema to serve this purpose better?

My understanding is that a lot subsequent applications would be only
interested in certain pieces of the whole genome sequences and there must be
an efficient way to do that. If everyone has to invent their method, the
BioSQL might be a little bit too limited. Any idea on this?

Gang

From gwu at molbio.mgh.harvard.edu  Tue Jun 15 13:12:36 2004
From: gwu at molbio.mgh.harvard.edu (Gang Wu)
Date: Tue Jun 15 13:15:23 2004
Subject: [BioSQL-l] how to quickly retrieve feature sequences
In-Reply-To: <AAEKLJJMIGMKFDEIPJFEGEFFCLAA.gwu@molbio.mgh.harvard.edu>
Message-ID: <AAEKLJJMIGMKFDEIPJFEOEFFCLAA.gwu@molbio.mgh.harvard.edu>

Just forgot to attach the SQL.

=========================================
ATTACHMENT 1
=========================================
CREATE TABLE `term_relationship_term` (
  `term_relationship_id` int(11) NOT NULL default '0',
  `term_id` int(11) NOT NULL default '0',
  PRIMARY KEY  (`term_relationship_id`,`term_id`),
  UNIQUE KEY `term_relationship_id` (`term_relationship_id`),
  UNIQUE KEY `term_id` (`term_id`)
) TYPE=InnoDB;
========================================

Gang


-----Original Message-----
From: biosql-l-bounces@portal.open-bio.org
[mailto:biosql-l-bounces@portal.open-bio.org]On Behalf Of Gang Wu
Sent: Tuesday, June 15, 2004 12:38 PM
To: biosql-l@open-bio.org
Subject: [BioSQL-l] how to quickly retrieve feature sequences


Hi,

I just loaded the 5 Arabidopsis thalian Genbank genome files into my
sequence database(BioSQL 1.38). My question is: How can I efficiently
retrieve all gene sequences from the database? I tried to do that by joining
seqfeature, seqfeature_qualifier_value, location, term and biosequence
tables, but it turned out to be extremely slow(See the attached SQL, 2
records take about 20 seconds on my Dell PowerEdge 2650 with dual 2.6G
Xeons). Does anyone have a better way to do it?

All I can imagine to do this faster is(by Java or other languages): Pull all
gene location info; Pull erlated sequence from biosequence table; rotate
through the gene location list and retrieve the substring of the sequence.
But this does not seem attractive for me since for different applications, I
have to write code to pull the sequences by myself. Is it possible to
extend/modify the BioSQL schema to serve this purpose better?

My understanding is that a lot subsequent applications would be only
interested in certain pieces of the whole genome sequences and there must be
an efficient way to do that. If everyone has to invent their method, the
BioSQL might be a little bit too limited. Any idea on this?

Gang

_______________________________________________
BioSQL-l mailing list
BioSQL-l@open-bio.org
http://open-bio.org/mailman/listinfo/biosql-l

From gwu at molbio.mgh.harvard.edu  Tue Jun 15 13:31:28 2004
From: gwu at molbio.mgh.harvard.edu (Gang Wu)
Date: Tue Jun 15 13:34:11 2004
Subject: [BioSQL-l] how to quickly retrieve feature sequences
In-Reply-To: <AAEKLJJMIGMKFDEIPJFEGEFFCLAA.gwu@molbio.mgh.harvard.edu>
Message-ID: <AAEKLJJMIGMKFDEIPJFEEEFGCLAA.gwu@molbio.mgh.harvard.edu>

SQL again:

SELECT t1.seqfeature_id,t1.bioentry_id,t2.start_pos, t2.end_pos, t2.strand,
t4.value locus_tag,
substring(t6.seq, t2.start_pos,t2.end_pos) seq
FROM `seqfeature` t1 inner join location t2 on
t1.seqfeature_id=t2.seqfeature_id
inner join term t3 on t1.type_term_id=t3.term_id
inner join seqfeature_qualifier_value t4 on
t1.seqfeature_id=t4.seqfeature_id
inner join term t5 on t4.term_id=t5.term_id
inner join biosequence t6 on t1.bioentry_id=t6.bioentry_id
where t3.name='gene' and t5.name='locus_tag'
limit 2

Gang

-----Original Message-----
From: biosql-l-bounces@portal.open-bio.org
[mailto:biosql-l-bounces@portal.open-bio.org]On Behalf Of Gang Wu
Sent: Tuesday, June 15, 2004 12:38 PM
To: biosql-l@open-bio.org
Subject: [BioSQL-l] how to quickly retrieve feature sequences


Hi,

I just loaded the 5 Arabidopsis thalian Genbank genome files into my
sequence database(BioSQL 1.38). My question is: How can I efficiently
retrieve all gene sequences from the database? I tried to do that by joining
seqfeature, seqfeature_qualifier_value, location, term and biosequence
tables, but it turned out to be extremely slow(See the attached SQL, 2
records take about 20 seconds on my Dell PowerEdge 2650 with dual 2.6G
Xeons). Does anyone have a better way to do it?

All I can imagine to do this faster is(by Java or other languages): Pull all
gene location info; Pull erlated sequence from biosequence table; rotate
through the gene location list and retrieve the substring of the sequence.
But this does not seem attractive for me since for different applications, I
have to write code to pull the sequences by myself. Is it possible to
extend/modify the BioSQL schema to serve this purpose better?

My understanding is that a lot subsequent applications would be only
interested in certain pieces of the whole genome sequences and there must be
an efficient way to do that. If everyone has to invent their method, the
BioSQL might be a little bit too limited. Any idea on this?

Gang

_______________________________________________
BioSQL-l mailing list
BioSQL-l@open-bio.org
http://open-bio.org/mailman/listinfo/biosql-l

From hlapp at gnf.org  Sun Jun 20 09:21:28 2004
From: hlapp at gnf.org (Hilmar Lapp)
Date: Sun Jun 20 09:24:14 2004
Subject: [BioSQL-l] how to quickly retrieve feature sequences
In-Reply-To: <AAEKLJJMIGMKFDEIPJFEGEFFCLAA.gwu@molbio.mgh.harvard.edu>
Message-ID: <BCC114E6-C2BC-11D8-B4DB-000A959EB4C4@gnf.org>

Gang,

do you want to do this in high-throughput? Otherwise you could use 
bioperl and bioperl-db as the language-binding and then use the bioperl 
object model to retrieve the information.

I'm away from my desk for a week, so I won't be able to elaborate 
further before the week after next week.

	-hilmar

On Tuesday, June 15, 2004, at 09:38  AM, Gang Wu wrote:

> Hi,
>
> I just loaded the 5 Arabidopsis thalian Genbank genome files into my
> sequence database(BioSQL 1.38). My question is: How can I efficiently
> retrieve all gene sequences from the database? I tried to do that by 
> joining
> seqfeature, seqfeature_qualifier_value, location, term and biosequence
> tables, but it turned out to be extremely slow(See the attached SQL, 2
> records take about 20 seconds on my Dell PowerEdge 2650 with dual 2.6G
> Xeons). Does anyone have a better way to do it?
>
> All I can imagine to do this faster is(by Java or other languages): 
> Pull all
> gene location info; Pull erlated sequence from biosequence table; 
> rotate
> through the gene location list and retrieve the substring of the 
> sequence.
> But this does not seem attractive for me since for different 
> applications, I
> have to write code to pull the sequences by myself. Is it possible to
> extend/modify the BioSQL schema to serve this purpose better?
>
> My understanding is that a lot subsequent applications would be only
> interested in certain pieces of the whole genome sequences and there 
> must be
> an efficient way to do that. If everyone has to invent their method, 
> the
> BioSQL might be a little bit too limited. Any idea on this?
>
> Gang
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l@open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------


From gwu at molbio.mgh.harvard.edu  Mon Jun 21 09:42:52 2004
From: gwu at molbio.mgh.harvard.edu (Gang Wu)
Date: Mon Jun 21 09:47:31 2004
Subject: [BioSQL-l] how to quickly retrieve feature sequences
In-Reply-To: <AAEKLJJMIGMKFDEIPJFEEEFGCLAA.gwu@molbio.mgh.harvard.edu>
Message-ID: <AAEKLJJMIGMKFDEIPJFEKEGCCLAA.gwu@molbio.mgh.harvard.edu>

It turned out it's quick enough to retrieve sequences such as gene, promter
etc. The SQL I provided in last message had a 'bug' on "substring(t6.seq,
t2.start_pos,t2.end_pos) seq" line, which will retrieve the subsequence
starting at "t2.start_pos" with length of "t2.end_pos". But what I needed is
the gene sequences, which should be "substring(t6.seq,
t2.start_pos,t2.end_pos-t2.start_pos+1) seq".

If the average length of gene sequences is 1-1.5k, retriving every 1000 gene
sequences needs about 2-4 seconds on our server(Dell PowerEdge 2650 with
dual Xeon 2.6G, 512K). Is this fast enough for you guys?

Gang


-----Original Message-----
From: biosql-l-bounces@portal.open-bio.org
[mailto:biosql-l-bounces@portal.open-bio.org]On Behalf Of Gang Wu
Sent: Tuesday, June 15, 2004 1:31 PM
To: biosql-l@open-bio.org
Subject: RE: [BioSQL-l] how to quickly retrieve feature sequences


SQL again:

SELECT t1.seqfeature_id,t1.bioentry_id,t2.start_pos, t2.end_pos, t2.strand,
t4.value locus_tag,
substring(t6.seq, t2.start_pos,t2.end_pos) seq
FROM `seqfeature` t1 inner join location t2 on
t1.seqfeature_id=t2.seqfeature_id
inner join term t3 on t1.type_term_id=t3.term_id
inner join seqfeature_qualifier_value t4 on
t1.seqfeature_id=t4.seqfeature_id
inner join term t5 on t4.term_id=t5.term_id
inner join biosequence t6 on t1.bioentry_id=t6.bioentry_id
where t3.name='gene' and t5.name='locus_tag'
limit 2

Gang

-----Original Message-----
From: biosql-l-bounces@portal.open-bio.org
[mailto:biosql-l-bounces@portal.open-bio.org]On Behalf Of Gang Wu
Sent: Tuesday, June 15, 2004 12:38 PM
To: biosql-l@open-bio.org
Subject: [BioSQL-l] how to quickly retrieve feature sequences


Hi,

I just loaded the 5 Arabidopsis thalian Genbank genome files into my
sequence database(BioSQL 1.38). My question is: How can I efficiently
retrieve all gene sequences from the database? I tried to do that by joining
seqfeature, seqfeature_qualifier_value, location, term and biosequence
tables, but it turned out to be extremely slow(See the attached SQL, 2
records take about 20 seconds on my Dell PowerEdge 2650 with dual 2.6G
Xeons). Does anyone have a better way to do it?

All I can imagine to do this faster is(by Java or other languages): Pull all
gene location info; Pull erlated sequence from biosequence table; rotate
through the gene location list and retrieve the substring of the sequence.
But this does not seem attractive for me since for different applications, I
have to write code to pull the sequences by myself. Is it possible to
extend/modify the BioSQL schema to serve this purpose better?

My understanding is that a lot subsequent applications would be only
interested in certain pieces of the whole genome sequences and there must be
an efficient way to do that. If everyone has to invent their method, the
BioSQL might be a little bit too limited. Any idea on this?

Gang

_______________________________________________
BioSQL-l mailing list
BioSQL-l@open-bio.org
http://open-bio.org/mailman/listinfo/biosql-l

_______________________________________________
BioSQL-l mailing list
BioSQL-l@open-bio.org
http://open-bio.org/mailman/listinfo/biosql-l