From Marc.Logghe at devgen.com  Mon Jul  4 05:56:37 2005
From: Marc.Logghe at devgen.com (Marc Logghe)
Date: Mon Jul  4 05:47:37 2005
Subject: [BioSQL-l] FW: SeqWithQuality and biosql
Message-ID: <0C528E3670D8CE4B8E013F6749231AA62F53C8@ANTARESIA.be.devgen.com>

Apologies for cross posting, I had picked the wrong mail adress :-(

-----Original Message-----
From: Marc Logghe 
Sent: Monday, July 04, 2005 11:43 AM
To: bioperl-l@portal.open-bio.org
Subject: SeqWithQuality and biosql

Hi all,
I am currently exploring the possibility to store a
Bio::Seq::SeqWithQuality object in biosql.
Has anyone ever tried this ?
One possibility would be to
1) split up the Bio::Seq::SeqWithQuality object into a plain
Bio::Seq::RichSeq and a Bio::Seq::PrimaryQual
2) store them separately in biosql; different namespaces
3) link them with a relation term.
4) make a custom adaptor to fetch the persistent objects from biosql and
reconstruct the Bio::Seq::SeqWithQuality

Does that make sense ? Any other suggestions/possibilities ?
As a test I tried to load a Bio::Seq::PrimaryQual in biosql using the
load_seqdatabase.pl but it fails because Bio::Seq::PrimaryQual does not
have a namespace method.
I hope I'm wrong but I have the impression there is a long way to go ;-)

Marc


From mark.schreiber at novartis.com  Tue Jul  5 01:44:10 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Tue Jul  5 01:35:16 2005
Subject: [BioSQL-l] FW: SeqWithQuality and biosql
Message-ID: <OFC101E129.DB1FE8C3-ON48257035.001E4939-48257035.001F8333@EU.novartis.net>

Hello -

I was wondering about similar issues with biojava. As you may (or may not) 
know biojava can make sequences from symbols in any alphabet, two examples 
are DNA and the integer alphabet (a collection of Symbols that are 
integers). Biojava can also make compound alphabets, one such example is 
the Phred alphabet which is the multiplication of DNA x Integer 
(technically a subset of Integer from 0 to 99).

Because sequence in BioSQL is stored in a CLOB if you can encode your 
SeqWithQuality as a String of characters you can store it. With the case 
above (which is probably similar to yours) you would need 400 characters 
to store it which is too large for ASCI but could be done in Unicode. The 
downside is your persitance layer needs to know how to encode and decode 
your SeqWithQuality. I'm not familiar how BioPerl would do this. BioJava 
would need to Implement a SymbolTokenizer for the alphabet and then 
persistance would happen automatically (assuming your DB is OK with 
Unicode). An alternative would be to make a tokenizer that uses more than 
single character tokens for encoding (eg A23 G40 T34 C22 etc).

The alternative you suggest of storing two sequences with a relationship 
is also nice (because you can retreive each part seperately) but also 
requires your persitance layer to know about it. However, it has big 
disadvantages because they are not strongly tied to each other. If you 
manipulate one you might invalidate the other. Also if you delete one the 
other will probably not be deleted in a cascade.

Not sure if any of this helps but a consensus on how to store this kind of 
information would be good so the bio* projects do it the same way. 
Consensus in this case will probably mean whatever the first 
implementation is.

- Mark


"Marc Logghe" <Marc.Logghe@devgen.com>
Sent by: biosql-l-bounces@portal.open-bio.org
07/04/2005 05:56 PM

 
        To:     <biosql-l@open-bio.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [BioSQL-l] FW: SeqWithQuality and biosql


Apologies for cross posting, I had picked the wrong mail adress :-(

-----Original Message-----
From: Marc Logghe 
Sent: Monday, July 04, 2005 11:43 AM
To: bioperl-l@portal.open-bio.org
Subject: SeqWithQuality and biosql

Hi all,
I am currently exploring the possibility to store a
Bio::Seq::SeqWithQuality object in biosql.
Has anyone ever tried this ?
One possibility would be to
1) split up the Bio::Seq::SeqWithQuality object into a plain
Bio::Seq::RichSeq and a Bio::Seq::PrimaryQual
2) store them separately in biosql; different namespaces
3) link them with a relation term.
4) make a custom adaptor to fetch the persistent objects from biosql and
reconstruct the Bio::Seq::SeqWithQuality

Does that make sense ? Any other suggestions/possibilities ?
As a test I tried to load a Bio::Seq::PrimaryQual in biosql using the
load_seqdatabase.pl but it fails because Bio::Seq::PrimaryQual does not
have a namespace method.
I hope I'm wrong but I have the impression there is a long way to go ;-)

Marc


_______________________________________________
BioSQL-l mailing list
BioSQL-l@open-bio.org
http://open-bio.org/mailman/listinfo/biosql-l


From hollandr at gis.a-star.edu.sg  Tue Jul  5 02:33:07 2005
From: hollandr at gis.a-star.edu.sg (Richard HOLLAND)
Date: Tue Jul  5 02:26:53 2005
Subject: [BioSQL-l] FW: SeqWithQuality and biosql
Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D5601DCB1B5@BIONIC.biopolis.one-north.com>

I'd think storing it in BioSQL as 2-byte pairs would be good. First byte
is the base (an ASCII character), second byte is the quality (an 8-bit
integer). Sure it wastes a few bits but so does normal DNA...


Richard Holland
Bioinformatics Specialist
GIS extension 8199
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: biosql-l-bounces@portal.open-bio.org 
> [mailto:biosql-l-bounces@portal.open-bio.org] On Behalf Of 
> mark.schreiber@novartis.com
> Sent: Tuesday, July 05, 2005 1:44 PM
> To: Marc Logghe
> Cc: biosql-l-bounces@portal.open-bio.org; biosql-l@open-bio.org
> Subject: Re: [BioSQL-l] FW: SeqWithQuality and biosql
> 
> 
> Hello -
> 
> I was wondering about similar issues with biojava. As you may 
> (or may not) 
> know biojava can make sequences from symbols in any alphabet, 
> two examples 
> are DNA and the integer alphabet (a collection of Symbols that are 
> integers). Biojava can also make compound alphabets, one such 
> example is 
> the Phred alphabet which is the multiplication of DNA x Integer 
> (technically a subset of Integer from 0 to 99).
> 
> Because sequence in BioSQL is stored in a CLOB if you can encode your 
> SeqWithQuality as a String of characters you can store it. 
> With the case 
> above (which is probably similar to yours) you would need 400 
> characters 
> to store it which is too large for ASCI but could be done in 
> Unicode. The 
> downside is your persitance layer needs to know how to encode 
> and decode 
> your SeqWithQuality. I'm not familiar how BioPerl would do 
> this. BioJava 
> would need to Implement a SymbolTokenizer for the alphabet and then 
> persistance would happen automatically (assuming your DB is OK with 
> Unicode). An alternative would be to make a tokenizer that 
> uses more than 
> single character tokens for encoding (eg A23 G40 T34 C22 etc).
> 
> The alternative you suggest of storing two sequences with a 
> relationship 
> is also nice (because you can retreive each part seperately) but also 
> requires your persitance layer to know about it. However, it has big 
> disadvantages because they are not strongly tied to each 
> other. If you 
> manipulate one you might invalidate the other. Also if you 
> delete one the 
> other will probably not be deleted in a cascade.
> 
> Not sure if any of this helps but a consensus on how to store 
> this kind of 
> information would be good so the bio* projects do it the same way. 
> Consensus in this case will probably mean whatever the first 
> implementation is.
> 
> - Mark
> 
> 
> 
> 
> 
> "Marc Logghe" <Marc.Logghe@devgen.com>
> Sent by: biosql-l-bounces@portal.open-bio.org
> 07/04/2005 05:56 PM
> 
>  
>         To:     <biosql-l@open-bio.org>
>         cc:     (bcc: Mark Schreiber/GP/Novartis)
>         Subject:        [BioSQL-l] FW: SeqWithQuality and biosql
> 
> 
> Apologies for cross posting, I had picked the wrong mail adress :-(
> 
> -----Original Message-----
> From: Marc Logghe 
> Sent: Monday, July 04, 2005 11:43 AM
> To: bioperl-l@portal.open-bio.org
> Subject: SeqWithQuality and biosql
> 
> Hi all,
> I am currently exploring the possibility to store a
> Bio::Seq::SeqWithQuality object in biosql.
> Has anyone ever tried this ?
> One possibility would be to
> 1) split up the Bio::Seq::SeqWithQuality object into a plain
> Bio::Seq::RichSeq and a Bio::Seq::PrimaryQual
> 2) store them separately in biosql; different namespaces
> 3) link them with a relation term.
> 4) make a custom adaptor to fetch the persistent objects from 
> biosql and
> reconstruct the Bio::Seq::SeqWithQuality
> 
> Does that make sense ? Any other suggestions/possibilities ?
> As a test I tried to load a Bio::Seq::PrimaryQual in biosql using the
> load_seqdatabase.pl but it fails because 
> Bio::Seq::PrimaryQual does not
> have a namespace method.
> I hope I'm wrong but I have the impression there is a long 
> way to go ;-)
> 
> Marc
> 
> 
> 
> 
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l@open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
> 
> 
> 
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l@open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
> 

From Marc.Logghe at devgen.com  Tue Jul  5 03:39:28 2005
From: Marc.Logghe at devgen.com (Marc Logghe)
Date: Tue Jul  5 03:32:24 2005
Subject: [BioSQL-l] FW: SeqWithQuality and biosql
Message-ID: <0C528E3670D8CE4B8E013F6749231AA62F53D1@ANTARESIA.be.devgen.com>

Thanks for the feedback.
Good to know I am not alone in this ;-)
I totally agree with Mark that there should be a kind of consensus on
how to store this in Bio*.
Yesterday I mistakenly posted my original mail to the bioperl list.
Heikki responded to that; it might be a good starting point but I am not
familiar with it:
http://portal.open-bio.org/pipermail/bioperl-l/2005-July/019271.html
So far the long term solustion.
In short term, to have at least something that works, I'll experiment a
little with storing separate objects. I remember one of the
presentations of Hilmar, where he gave the example of making an adaptor
and storing 2 sequence objects that interacted with each other as a
result of a Two Hybrid experiment in yeast.
Cheers,
Marc


> 
> I'd think storing it in BioSQL as 2-byte pairs would be good. 
> First byte is the base (an ASCII character), second byte is 
> the quality (an 8-bit integer). Sure it wastes a few bits but 
> so does normal DNA...
> 
> 
> Richard Holland
> Bioinformatics Specialist
> GIS extension 8199
> ---------------------------------------------
> This email is confidential and may be privileged. If you are 
> not the intended recipient, please delete it and notify us 
> immediately. Please do not copy or use it for any purpose, or 
> disclose its content to any other person. Thank you.
> ---------------------------------------------
> 
> 
> > -----Original Message-----
> > From: biosql-l-bounces@portal.open-bio.org
> > [mailto:biosql-l-bounces@portal.open-bio.org] On Behalf Of 
> > mark.schreiber@novartis.com
> > Sent: Tuesday, July 05, 2005 1:44 PM
> > To: Marc Logghe
> > Cc: biosql-l-bounces@portal.open-bio.org; biosql-l@open-bio.org
> > Subject: Re: [BioSQL-l] FW: SeqWithQuality and biosql
> > 
> > 
> > Hello -
> > 
> > I was wondering about similar issues with biojava. As you 
> may (or may 
> > not) know biojava can make sequences from symbols in any 
> alphabet, two 
> > examples are DNA and the integer alphabet (a collection of Symbols 
> > that are integers). Biojava can also make compound 
> alphabets, one such 
> > example is the Phred alphabet which is the multiplication of DNA x 
> > Integer (technically a subset of Integer from 0 to 99).
> > 
> > Because sequence in BioSQL is stored in a CLOB if you can 
> encode your 
> > SeqWithQuality as a String of characters you can store it.
> > With the case
> > above (which is probably similar to yours) you would need 400 
> > characters to store it which is too large for ASCI but 
> could be done 
> > in Unicode. The downside is your persitance layer needs to 
> know how to 
> > encode and decode your SeqWithQuality. I'm not familiar how BioPerl 
> > would do this. BioJava would need to Implement a 
> SymbolTokenizer for 
> > the alphabet and then persistance would happen 
> automatically (assuming 
> > your DB is OK with Unicode). An alternative would be to make a 
> > tokenizer that uses more than single character tokens for 
> encoding (eg 
> > A23 G40 T34 C22 etc).
> > 
> > The alternative you suggest of storing two sequences with a 
> > relationship is also nice (because you can retreive each part 
> > seperately) but also requires your persitance layer to know 
> about it. 
> > However, it has big disadvantages because they are not 
> strongly tied 
> > to each other. If you manipulate one you might invalidate 
> the other. 
> > Also if you delete one the other will probably not be deleted in a 
> > cascade.
> > 
> > Not sure if any of this helps but a consensus on how to store this 
> > kind of information would be good so the bio* projects do 
> it the same 
> > way.
> > Consensus in this case will probably mean whatever the first 
> > implementation is.
> > 
> > - Mark
> > 
> > 
> > 
> > 
> > 
> > "Marc Logghe" <Marc.Logghe@devgen.com> Sent by: 
> > biosql-l-bounces@portal.open-bio.org
> > 07/04/2005 05:56 PM
> > 
> >  
> >         To:     <biosql-l@open-bio.org>
> >         cc:     (bcc: Mark Schreiber/GP/Novartis)
> >         Subject:        [BioSQL-l] FW: SeqWithQuality and biosql
> > 
> > 
> > Apologies for cross posting, I had picked the wrong mail adress :-(
> > 
> > -----Original Message-----
> > From: Marc Logghe
> > Sent: Monday, July 04, 2005 11:43 AM
> > To: bioperl-l@portal.open-bio.org
> > Subject: SeqWithQuality and biosql
> > 
> > Hi all,
> > I am currently exploring the possibility to store a 
> > Bio::Seq::SeqWithQuality object in biosql.
> > Has anyone ever tried this ?
> > One possibility would be to
> > 1) split up the Bio::Seq::SeqWithQuality object into a plain 
> > Bio::Seq::RichSeq and a Bio::Seq::PrimaryQual
> > 2) store them separately in biosql; different namespaces
> > 3) link them with a relation term.
> > 4) make a custom adaptor to fetch the persistent objects 
> from biosql 
> > and reconstruct the Bio::Seq::SeqWithQuality
> > 
> > Does that make sense ? Any other suggestions/possibilities ?
> > As a test I tried to load a Bio::Seq::PrimaryQual in biosql 
> using the 
> > load_seqdatabase.pl but it fails because Bio::Seq::PrimaryQual does 
> > not have a namespace method.
> > I hope I'm wrong but I have the impression there is a long 
> way to go 
> > ;-)
> > 
> > Marc
> > 
> > 
> > 
> > 
> > _______________________________________________
> > BioSQL-l mailing list
> > BioSQL-l@open-bio.org
> > http://open-bio.org/mailman/listinfo/biosql-l
> > 
> > 
> > 
> > _______________________________________________
> > BioSQL-l mailing list
> > BioSQL-l@open-bio.org
> > http://open-bio.org/mailman/listinfo/biosql-l
> > 
> 

From hlapp at gnf.org  Tue Jul  5 14:55:10 2005
From: hlapp at gnf.org (Hilmar Lapp)
Date: Tue Jul  5 14:43:37 2005
Subject: [BioSQL-l] Re: SeqWithQuality and biosql
In-Reply-To: <0C528E3670D8CE4B8E013F6749231AA62F53D1@ANTARESIA.be.devgen.com>
References: <0C528E3670D8CE4B8E013F6749231AA62F53D1@ANTARESIA.be.devgen.com>
Message-ID: <4672e7ad470df9973b998dd1188db923@gnf.org>

(I don't think posting to bioperl was a mistake, so I'm including it 
here again)

I think I like Mark's proposal best, i.e., the fundamental model of at 
most one sequence for each bioentry (e.g., Bio::SeqI object) is left 
intact, and the problem is reformulated as how to encode/decode 
sequences from alphabet cross-products as strings.

Encoding/decoding wouldn't be difficult to implement, even such that 
the encoded string is still humanly readable. Biojava has a natural 
provision for doing this (SymbolTokenizer?), but Bioperl does not, 
i.e., in Bioperl the object model assumes that the sequence is a flat 
string, and the alphabet is also a flat string; there is no object you 
could ask to provide you with an encoder/decoder appropriate for either 
the alphabet or the type of sequence object.

I'd like to hear some feedback from the Bioperl folks as to whether 
you'd consider this capability a generally useful addition to Bioperl. 
(It could be designed in a number of ways ranging from more intrusive 
to completely neutral - e.g., adding this as a method to SeqI [like 
$seq->seq_encoder()], or making $seq->alphabet() return an object with 
this and other capabilities, or creating a separate factory class that 
would return the appropriate encoder known to [or registered with] it 
based on a given alphabet and type of sequence object.)

As for Bio::Seq::MetaI, this could certainly be the interface for 
SeqWithQuality, but wouldn't solve the de/serialization problem. Also, 
at least conceptually MetaI-derived classes could represent 
multi-dimensional meta-information, right? That is, the problem of how 
to encode/decode the meta-information isn't trivial or restricted to 
two dimensions here either.

As for creating a specialized adaptor in Bioperl-db, that would 
certainly work too and would most likely be the fastest way to get 
something that works. However, long-term it would solve the problem 
only for SeqWithQuality and not for the more general problem of how to 
store sequences that are based on cross-product alphabets. BTW if you 
do implement a specialized adaptor, then instead of storing two 
bioentries and connecting them you might as well implement the sequence 
encoding/decoding for this particular object in the adaptor - you'd 
gain speed because instead of increasing the number of database 
operations you'd spend a couple more CPU cycles in Perl code, and you 
wouldn't be burdened with two bioentries that aren't coupled by foreign 
key constraint.

As for consensus for how to encode sequence with quality values, I'd 
include a delimiter between the alphabet operands in the cross-product. 
I.e., using e.g. slash as the delimiter: 'A/22 T/30 A/32 G/35 C/35'. 
This can be easily extended to multi-dimensional cross-products so long 
as the delimiter between them isn't a symbol in any of the alphabets.

	-hilmar


On Jul 5, 2005, at 12:39 AM, Marc Logghe wrote:

> Thanks for the feedback.
> Good to know I am not alone in this ;-)
> I totally agree with Mark that there should be a kind of consensus on
> how to store this in Bio*.
> Yesterday I mistakenly posted my original mail to the bioperl list.
> Heikki responded to that; it might be a good starting point but I am 
> not
> familiar with it:
> http://portal.open-bio.org/pipermail/bioperl-l/2005-July/019271.html
> So far the long term solustion.
> In short term, to have at least something that works, I'll experiment a
> little with storing separate objects. I remember one of the
> presentations of Hilmar, where he gave the example of making an adaptor
> and storing 2 sequence objects that interacted with each other as a
> result of a Two Hybrid experiment in yeast.
> Cheers,
> Marc
>
>
>>
>> I'd think storing it in BioSQL as 2-byte pairs would be good.
>> First byte is the base (an ASCII character), second byte is
>> the quality (an 8-bit integer). Sure it wastes a few bits but
>> so does normal DNA...
>>
>>
>> Richard Holland
>> Bioinformatics Specialist
>> GIS extension 8199
>> ---------------------------------------------
>> This email is confidential and may be privileged. If you are
>> not the intended recipient, please delete it and notify us
>> immediately. Please do not copy or use it for any purpose, or
>> disclose its content to any other person. Thank you.
>> ---------------------------------------------
>>
>>
>>> -----Original Message-----
>>> From: biosql-l-bounces@portal.open-bio.org
>>> [mailto:biosql-l-bounces@portal.open-bio.org] On Behalf Of
>>> mark.schreiber@novartis.com
>>> Sent: Tuesday, July 05, 2005 1:44 PM
>>> To: Marc Logghe
>>> Cc: biosql-l-bounces@portal.open-bio.org; biosql-l@open-bio.org
>>> Subject: Re: [BioSQL-l] FW: SeqWithQuality and biosql
>>>
>>>
>>> Hello -
>>>
>>> I was wondering about similar issues with biojava. As you
>> may (or may
>>> not) know biojava can make sequences from symbols in any
>> alphabet, two
>>> examples are DNA and the integer alphabet (a collection of Symbols
>>> that are integers). Biojava can also make compound
>> alphabets, one such
>>> example is the Phred alphabet which is the multiplication of DNA x
>>> Integer (technically a subset of Integer from 0 to 99).
>>>
>>> Because sequence in BioSQL is stored in a CLOB if you can
>> encode your
>>> SeqWithQuality as a String of characters you can store it.
>>> With the case
>>> above (which is probably similar to yours) you would need 400
>>> characters to store it which is too large for ASCI but
>> could be done
>>> in Unicode. The downside is your persitance layer needs to
>> know how to
>>> encode and decode your SeqWithQuality. I'm not familiar how BioPerl
>>> would do this. BioJava would need to Implement a
>> SymbolTokenizer for
>>> the alphabet and then persistance would happen
>> automatically (assuming
>>> your DB is OK with Unicode). An alternative would be to make a
>>> tokenizer that uses more than single character tokens for
>> encoding (eg
>>> A23 G40 T34 C22 etc).
>>>
>>> The alternative you suggest of storing two sequences with a
>>> relationship is also nice (because you can retreive each part
>>> seperately) but also requires your persitance layer to know
>> about it.
>>> However, it has big disadvantages because they are not
>> strongly tied
>>> to each other. If you manipulate one you might invalidate
>> the other.
>>> Also if you delete one the other will probably not be deleted in a
>>> cascade.
>>>
>>> Not sure if any of this helps but a consensus on how to store this
>>> kind of information would be good so the bio* projects do
>> it the same
>>> way.
>>> Consensus in this case will probably mean whatever the first
>>> implementation is.
>>>
>>> - Mark
>>>
>>>
>>>
>>>
>>>
>>> "Marc Logghe" <Marc.Logghe@devgen.com> Sent by:
>>> biosql-l-bounces@portal.open-bio.org
>>> 07/04/2005 05:56 PM
>>>
>>>
>>>         To:     <biosql-l@open-bio.org>
>>>         cc:     (bcc: Mark Schreiber/GP/Novartis)
>>>         Subject:        [BioSQL-l] FW: SeqWithQuality and biosql
>>>
>>>
>>> Apologies for cross posting, I had picked the wrong mail adress :-(
>>>
>>> -----Original Message-----
>>> From: Marc Logghe
>>> Sent: Monday, July 04, 2005 11:43 AM
>>> To: bioperl-l@portal.open-bio.org
>>> Subject: SeqWithQuality and biosql
>>>
>>> Hi all,
>>> I am currently exploring the possibility to store a
>>> Bio::Seq::SeqWithQuality object in biosql.
>>> Has anyone ever tried this ?
>>> One possibility would be to
>>> 1) split up the Bio::Seq::SeqWithQuality object into a plain
>>> Bio::Seq::RichSeq and a Bio::Seq::PrimaryQual
>>> 2) store them separately in biosql; different namespaces
>>> 3) link them with a relation term.
>>> 4) make a custom adaptor to fetch the persistent objects
>> from biosql
>>> and reconstruct the Bio::Seq::SeqWithQuality
>>>
>>> Does that make sense ? Any other suggestions/possibilities ?
>>> As a test I tried to load a Bio::Seq::PrimaryQual in biosql
>> using the
>>> load_seqdatabase.pl but it fails because Bio::Seq::PrimaryQual does
>>> not have a namespace method.
>>> I hope I'm wrong but I have the impression there is a long
>> way to go
>>> ;-)
>>>
>>> Marc
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l@open-bio.org
>>> http://open-bio.org/mailman/listinfo/biosql-l
>>>
>>>
>>>
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l@open-bio.org
>>> http://open-bio.org/mailman/listinfo/biosql-l
>>>
>>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l@open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------

From mark.schreiber at novartis.com  Tue Jul  5 22:55:40 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Tue Jul  5 22:46:54 2005
Subject: [BioSQL-l] Re: SeqWithQuality and biosql
Message-ID: <OFAA2A729F.FA88750C-ON48257036.000C8955-48257036.00101614@EU.novartis.net>

The BioJava SymbolTokenizer can either tokenize to characters or Strings. 
Obviously not all alphabets can sensibly tokenize to characters (eg large 
compound alphabets). Currently by default it would tokenize a compound 
symbol to its compound names. For example the codon ACA would be 

(adenosine cytosine adenosine)

This is obviously not ideal for a database and it can easily be changed in 
biojava without breaking things (to be honest, tokenization of compound 
alphas in biojava is not a common task at all). I would propose the 
following for compound alphabets...

(aca)(gtc) for codon alphabets.
(g17)(t40) for quality type alphabets.

I like the use of brakets because it is possible in biojava to do 
something like this ((DNAxDNAxDNA)xPROTEIN) which would represent an 
alignement of codons with their amino acids or even 
((DNAxDNA)x(DNAxDNAxDNA)), which I'm not sure you would ever use but their 
might be a good reason for it. The brackets help to disambiguate better 
than spaces would. For example

((ctc)S) for the first example or,
((atg)(gc)) for the second example.

To make this work there also needs to be a uniform way to store the 
alphabet name in the sequence table. The above examples show how biojava 
constructs alphabet names but there maybe (probably are) better ways.

For quality information you could use (DNAxINTEGER), techincally the 
biojava name would be (DNAxSubIntegerAlphabet[0..99]). Of course you don't 
have to use this convention and aliasing would be nice (eg the 'official' 
name for INTEGER in BioJava would be 'Alphabet of all integers' which is a 
bit long winded!)

- Mark


Hilmar Lapp <hlapp@gnf.org>
07/06/2005 02:55 AM

 
        To:     "Marc Logghe" <Marc.Logghe@devgen.com>
        cc:     Mark Schreiber/GP/Novartis@PH, Bioperl <bioperl-l@bioperl.org>, OBDA 
BioSQL <biosql-l@open-bio.org>, Richard HOLLAND 
<hollandr@gis.a-star.edu.sg>
        Subject:        Re: SeqWithQuality and biosql


(I don't think posting to bioperl was a mistake, so I'm including it 
here again)

I think I like Mark's proposal best, i.e., the fundamental model of at 
most one sequence for each bioentry (e.g., Bio::SeqI object) is left 
intact, and the problem is reformulated as how to encode/decode 
sequences from alphabet cross-products as strings.

Encoding/decoding wouldn't be difficult to implement, even such that 
the encoded string is still humanly readable. Biojava has a natural 
provision for doing this (SymbolTokenizer?), but Bioperl does not, 
i.e., in Bioperl the object model assumes that the sequence is a flat 
string, and the alphabet is also a flat string; there is no object you 
could ask to provide you with an encoder/decoder appropriate for either 
the alphabet or the type of sequence object.

I'd like to hear some feedback from the Bioperl folks as to whether 
you'd consider this capability a generally useful addition to Bioperl. 
(It could be designed in a number of ways ranging from more intrusive 
to completely neutral - e.g., adding this as a method to SeqI [like 
$seq->seq_encoder()], or making $seq->alphabet() return an object with 
this and other capabilities, or creating a separate factory class that 
would return the appropriate encoder known to [or registered with] it 
based on a given alphabet and type of sequence object.)

As for Bio::Seq::MetaI, this could certainly be the interface for 
SeqWithQuality, but wouldn't solve the de/serialization problem. Also, 
at least conceptually MetaI-derived classes could represent 
multi-dimensional meta-information, right? That is, the problem of how 
to encode/decode the meta-information isn't trivial or restricted to 
two dimensions here either.

As for creating a specialized adaptor in Bioperl-db, that would 
certainly work too and would most likely be the fastest way to get 
something that works. However, long-term it would solve the problem 
only for SeqWithQuality and not for the more general problem of how to 
store sequences that are based on cross-product alphabets. BTW if you 
do implement a specialized adaptor, then instead of storing two 
bioentries and connecting them you might as well implement the sequence 
encoding/decoding for this particular object in the adaptor - you'd 
gain speed because instead of increasing the number of database 
operations you'd spend a couple more CPU cycles in Perl code, and you 
wouldn't be burdened with two bioentries that aren't coupled by foreign 
key constraint.

As for consensus for how to encode sequence with quality values, I'd 
include a delimiter between the alphabet operands in the cross-product. 
I.e., using e.g. slash as the delimiter: 'A/22 T/30 A/32 G/35 C/35'. 
This can be easily extended to multi-dimensional cross-products so long 
as the delimiter between them isn't a symbol in any of the alphabets.

                 -hilmar


On Jul 5, 2005, at 12:39 AM, Marc Logghe wrote:

> Thanks for the feedback.
> Good to know I am not alone in this ;-)
> I totally agree with Mark that there should be a kind of consensus on
> how to store this in Bio*.
> Yesterday I mistakenly posted my original mail to the bioperl list.
> Heikki responded to that; it might be a good starting point but I am 
> not
> familiar with it:
> http://portal.open-bio.org/pipermail/bioperl-l/2005-July/019271.html
> So far the long term solustion.
> In short term, to have at least something that works, I'll experiment a
> little with storing separate objects. I remember one of the
> presentations of Hilmar, where he gave the example of making an adaptor
> and storing 2 sequence objects that interacted with each other as a
> result of a Two Hybrid experiment in yeast.
> Cheers,
> Marc
>
>
>>
>> I'd think storing it in BioSQL as 2-byte pairs would be good.
>> First byte is the base (an ASCII character), second byte is
>> the quality (an 8-bit integer). Sure it wastes a few bits but
>> so does normal DNA...
>>
>>
>> Richard Holland
>> Bioinformatics Specialist
>> GIS extension 8199
>> ---------------------------------------------
>> This email is confidential and may be privileged. If you are
>> not the intended recipient, please delete it and notify us
>> immediately. Please do not copy or use it for any purpose, or
>> disclose its content to any other person. Thank you.
>> ---------------------------------------------
>>
>>
>>> -----Original Message-----
>>> From: biosql-l-bounces@portal.open-bio.org
>>> [mailto:biosql-l-bounces@portal.open-bio.org] On Behalf Of
>>> mark.schreiber@novartis.com
>>> Sent: Tuesday, July 05, 2005 1:44 PM
>>> To: Marc Logghe
>>> Cc: biosql-l-bounces@portal.open-bio.org; biosql-l@open-bio.org
>>> Subject: Re: [BioSQL-l] FW: SeqWithQuality and biosql
>>>
>>>
>>> Hello -
>>>
>>> I was wondering about similar issues with biojava. As you
>> may (or may
>>> not) know biojava can make sequences from symbols in any
>> alphabet, two
>>> examples are DNA and the integer alphabet (a collection of Symbols
>>> that are integers). Biojava can also make compound
>> alphabets, one such
>>> example is the Phred alphabet which is the multiplication of DNA x
>>> Integer (technically a subset of Integer from 0 to 99).
>>>
>>> Because sequence in BioSQL is stored in a CLOB if you can
>> encode your
>>> SeqWithQuality as a String of characters you can store it.
>>> With the case
>>> above (which is probably similar to yours) you would need 400
>>> characters to store it which is too large for ASCI but
>> could be done
>>> in Unicode. The downside is your persitance layer needs to
>> know how to
>>> encode and decode your SeqWithQuality. I'm not familiar how BioPerl
>>> would do this. BioJava would need to Implement a
>> SymbolTokenizer for
>>> the alphabet and then persistance would happen
>> automatically (assuming
>>> your DB is OK with Unicode). An alternative would be to make a
>>> tokenizer that uses more than single character tokens for
>> encoding (eg
>>> A23 G40 T34 C22 etc).
>>>
>>> The alternative you suggest of storing two sequences with a
>>> relationship is also nice (because you can retreive each part
>>> seperately) but also requires your persitance layer to know
>> about it.
>>> However, it has big disadvantages because they are not
>> strongly tied
>>> to each other. If you manipulate one you might invalidate
>> the other.
>>> Also if you delete one the other will probably not be deleted in a
>>> cascade.
>>>
>>> Not sure if any of this helps but a consensus on how to store this
>>> kind of information would be good so the bio* projects do
>> it the same
>>> way.
>>> Consensus in this case will probably mean whatever the first
>>> implementation is.
>>>
>>> - Mark
>>>
>>>
>>>
>>>
>>>
>>> "Marc Logghe" <Marc.Logghe@devgen.com> Sent by:
>>> biosql-l-bounces@portal.open-bio.org
>>> 07/04/2005 05:56 PM
>>>
>>>
>>>         To:     <biosql-l@open-bio.org>
>>>         cc:     (bcc: Mark Schreiber/GP/Novartis)
>>>         Subject:        [BioSQL-l] FW: SeqWithQuality and biosql
>>>
>>>
>>> Apologies for cross posting, I had picked the wrong mail adress :-(
>>>
>>> -----Original Message-----
>>> From: Marc Logghe
>>> Sent: Monday, July 04, 2005 11:43 AM
>>> To: bioperl-l@portal.open-bio.org
>>> Subject: SeqWithQuality and biosql
>>>
>>> Hi all,
>>> I am currently exploring the possibility to store a
>>> Bio::Seq::SeqWithQuality object in biosql.
>>> Has anyone ever tried this ?
>>> One possibility would be to
>>> 1) split up the Bio::Seq::SeqWithQuality object into a plain
>>> Bio::Seq::RichSeq and a Bio::Seq::PrimaryQual
>>> 2) store them separately in biosql; different namespaces
>>> 3) link them with a relation term.
>>> 4) make a custom adaptor to fetch the persistent objects
>> from biosql
>>> and reconstruct the Bio::Seq::SeqWithQuality
>>>
>>> Does that make sense ? Any other suggestions/possibilities ?
>>> As a test I tried to load a Bio::Seq::PrimaryQual in biosql
>> using the
>>> load_seqdatabase.pl but it fails because Bio::Seq::PrimaryQual does
>>> not have a namespace method.
>>> I hope I'm wrong but I have the impression there is a long
>> way to go
>>> ;-)
>>>
>>> Marc
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l@open-bio.org
>>> http://open-bio.org/mailman/listinfo/biosql-l
>>>
>>>
>>>
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l@open-bio.org
>>> http://open-bio.org/mailman/listinfo/biosql-l
>>>
>>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l@open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------


From hollandr at gis.a-star.edu.sg  Tue Jul  5 23:38:51 2005
From: hollandr at gis.a-star.edu.sg (Richard HOLLAND)
Date: Tue Jul  5 23:30:57 2005
Subject: [BioSQL-l] RE: SeqWithQuality and biosql
Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D5601DCB226@BIONIC.biopolis.one-north.com>

Good point.

To correctly represent compound alphabets in a consistent manner would
require extra tables in BioSQL (version 1.1?). Some kind of alphabet
table with a name and a related table with alphabet ids and ranks to
construct cross products etc.

Why not store the delimiter as an attribute of the alphabet in this
table. That way we can use whatever delimiters we like. I don't think
grouping is necessary - after all we know from the alphabet definition
that there are a fixed number of tokens per symbol and what order they
come in, so we just read the first three tokens to build the first
symbol, and so on.


Richard Holland
Bioinformatics Specialist
GIS extension 8199
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: Hilmar Lapp [mailto:hlapp@gnf.org] 
> Sent: Wednesday, July 06, 2005 11:30 AM
> To: mark.schreiber@novartis.com
> Cc: Bioperl; Richard HOLLAND
> Subject: Re: SeqWithQuality and biosql
> 
> 
> 
> On Jul 5, 2005, at 7:55 PM, mark.schreiber@novartis.com wrote:
> 
> > I would propose the
> > following for compound alphabets...
> >
> > (aca)(gtc) for codon alphabets.
> > (g17)(t40) for quality type alphabets.
> 
> In your convention wouldn't this need to be
> (g(17))(t(40))
> 
> Otherwise you'd have trouble representing higher-dimensional 
> cross-products unless you alternate chars and digits which would be a 
> useless restriction.
> 
> 	-hilmar
> -- 
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
> 
> 

From mark.schreiber at novartis.com  Wed Jul  6 01:37:21 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Wed Jul  6 01:28:31 2005
Subject: [BioSQL-l] RE: SeqWithQuality and biosql
Message-ID: <OF4F0555F1.941C3730-ON48257036.001EA669-48257036.001EE388@EU.novartis.net>

Actually under my proposal

(a(17)) would imply (DNAx(SubInteger[0..9]xSubInteger[0..9]))


"Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
07/06/2005 11:38 AM

 
        To:     "Hilmar Lapp" <hlapp@gnf.org>, Mark Schreiber/GP/Novartis@PH
        cc:     "Bioperl" <bioperl-l@bioperl.org>, <biosql-l@open-bio.org>
        Subject:        RE: SeqWithQuality and biosql


Good point.

To correctly represent compound alphabets in a consistent manner would
require extra tables in BioSQL (version 1.1?). Some kind of alphabet
table with a name and a related table with alphabet ids and ranks to
construct cross products etc.

Why not store the delimiter as an attribute of the alphabet in this
table. That way we can use whatever delimiters we like. I don't think
grouping is necessary - after all we know from the alphabet definition
that there are a fixed number of tokens per symbol and what order they
come in, so we just read the first three tokens to build the first
symbol, and so on.


Richard Holland
Bioinformatics Specialist
GIS extension 8199
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: Hilmar Lapp [mailto:hlapp@gnf.org] 
> Sent: Wednesday, July 06, 2005 11:30 AM
> To: mark.schreiber@novartis.com
> Cc: Bioperl; Richard HOLLAND
> Subject: Re: SeqWithQuality and biosql
> 
> 
> 
> On Jul 5, 2005, at 7:55 PM, mark.schreiber@novartis.com wrote:
> 
> > I would propose the
> > following for compound alphabets...
> >
> > (aca)(gtc) for codon alphabets.
> > (g17)(t40) for quality type alphabets.
> 
> In your convention wouldn't this need to be
> (g(17))(t(40))
> 
> Otherwise you'd have trouble representing higher-dimensional 
> cross-products unless you alternate chars and digits which would be a 
> useless restriction.
> 
>                -hilmar
> -- 
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
> 
> 


From hlapp at gnf.org  Wed Jul  6 12:30:06 2005
From: hlapp at gnf.org (Hilmar Lapp)
Date: Wed Jul  6 12:21:16 2005
Subject: [BioSQL-l] Re: [Bioperl-l] RE: SeqWithQuality and biosql
In-Reply-To: <OF4F0555F1.941C3730-ON48257036.001EA669-48257036.001EE388@EU.novartis.net>
References: <OF4F0555F1.941C3730-ON48257036.001EA669-48257036.001EE388@EU.novartis.net>
Message-ID: <aa7f992af71e51f4ecce18469d13370c@gnf.org>


On Jul 5, 2005, at 10:37 PM, mark.schreiber@novartis.com wrote:

> Actually under my proposal
>
> (a(17)) would imply (DNAx(SubInteger[0..9]xSubInteger[0..9]))
>

That's why I didn't like it - how would you encode 
(DNAx(SubInteger[0..99]xSubInteger[0..99]) in this proposal? Require 
each component to be two-digit? There ought to be delimiters between 
the operands, no?

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------


From mark.schreiber at novartis.com  Wed Jul  6 20:59:13 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Wed Jul  6 20:50:16 2005
Subject: [BioSQL-l] Re: [Bioperl-l] RE: SeqWithQuality and biosql
Message-ID: <OF819DE46D.80C07DD4-ON48257037.00052036-48257037.00056CAD@EU.novartis.net>

Good point. I would prefer a system that only uses delimiters for 
ambiguous cases like the one you show but I guess thats pretty complex so 
maybe delimiters for every sub-alphabet.

- Mark


Hilmar Lapp <hlapp@gnf.org>
07/07/2005 12:30 AM

 
        To:     Mark Schreiber/GP/Novartis@PH
        cc:     "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>, Bioperl 
<bioperl-l@bioperl.org>, biosql-l@open-bio.org
        Subject:        Re: [Bioperl-l] RE: SeqWithQuality and biosql


On Jul 5, 2005, at 10:37 PM, mark.schreiber@novartis.com wrote:

> Actually under my proposal
>
> (a(17)) would imply (DNAx(SubInteger[0..9]xSubInteger[0..9]))
>

That's why I didn't like it - how would you encode 
(DNAx(SubInteger[0..99]xSubInteger[0..99]) in this proposal? Require 
each component to be two-digit? There ought to be delimiters between 
the operands, no?

                 -hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------