[Bioperl-l] Re: SeqWithQuality and biosql

Tue Jul 5 22:55:40 EDT 2005

The BioJava SymbolTokenizer can either tokenize to characters or Strings. 
Obviously not all alphabets can sensibly tokenize to characters (eg large 
compound alphabets). Currently by default it would tokenize a compound 
symbol to its compound names. For example the codon ACA would be 

(adenosine cytosine adenosine)

This is obviously not ideal for a database and it can easily be changed in 
biojava without breaking things (to be honest, tokenization of compound 
alphas in biojava is not a common task at all). I would propose the 
following for compound alphabets...

(aca)(gtc) for codon alphabets.
(g17)(t40) for quality type alphabets.

I like the use of brakets because it is possible in biojava to do 
something like this ((DNAxDNAxDNA)xPROTEIN) which would represent an 
alignement of codons with their amino acids or even 
((DNAxDNA)x(DNAxDNAxDNA)), which I'm not sure you would ever use but their 
might be a good reason for it. The brackets help to disambiguate better 
than spaces would. For example

((ctc)S) for the first example or,
((atg)(gc)) for the second example.

To make this work there also needs to be a uniform way to store the 
alphabet name in the sequence table. The above examples show how biojava 
constructs alphabet names but there maybe (probably are) better ways.

For quality information you could use (DNAxINTEGER), techincally the 
biojava name would be (DNAxSubIntegerAlphabet[0..99]). Of course you don't 
have to use this convention and aliasing would be nice (eg the 'official' 
name for INTEGER in BioJava would be 'Alphabet of all integers' which is a 
bit long winded!)

- Mark

Hilmar Lapp <hlapp at gnf.org>
07/06/2005 02:55 AM

        To:     "Marc Logghe" <Marc.Logghe at devgen.com>
        cc:     Mark Schreiber/GP/Novartis at PH, Bioperl <bioperl-l at bioperl.org>, OBDA 
BioSQL <biosql-l at open-bio.org>, Richard HOLLAND 
<hollandr at gis.a-star.edu.sg>
        Subject:        Re: SeqWithQuality and biosql

(I don't think posting to bioperl was a mistake, so I'm including it 
here again)

I think I like Mark's proposal best, i.e., the fundamental model of at 
most one sequence for each bioentry (e.g., Bio::SeqI object) is left 
intact, and the problem is reformulated as how to encode/decode 
sequences from alphabet cross-products as strings.

Encoding/decoding wouldn't be difficult to implement, even such that 
the encoded string is still humanly readable. Biojava has a natural 
provision for doing this (SymbolTokenizer?), but Bioperl does not, 
i.e., in Bioperl the object model assumes that the sequence is a flat 
string, and the alphabet is also a flat string; there is no object you 
could ask to provide you with an encoder/decoder appropriate for either 
the alphabet or the type of sequence object.

I'd like to hear some feedback from the Bioperl folks as to whether 
you'd consider this capability a generally useful addition to Bioperl. 
(It could be designed in a number of ways ranging from more intrusive 
to completely neutral - e.g., adding this as a method to SeqI [like 
$seq->seq_encoder()], or making $seq->alphabet() return an object with 
this and other capabilities, or creating a separate factory class that 
would return the appropriate encoder known to [or registered with] it 
based on a given alphabet and type of sequence object.)

As for Bio::Seq::MetaI, this could certainly be the interface for 
SeqWithQuality, but wouldn't solve the de/serialization problem. Also, 
at least conceptually MetaI-derived classes could represent 
multi-dimensional meta-information, right? That is, the problem of how 
to encode/decode the meta-information isn't trivial or restricted to 
two dimensions here either.

As for creating a specialized adaptor in Bioperl-db, that would 
certainly work too and would most likely be the fastest way to get 
something that works. However, long-term it would solve the problem 
only for SeqWithQuality and not for the more general problem of how to 
store sequences that are based on cross-product alphabets. BTW if you 
do implement a specialized adaptor, then instead of storing two 
bioentries and connecting them you might as well implement the sequence 
encoding/decoding for this particular object in the adaptor - you'd 
gain speed because instead of increasing the number of database 
operations you'd spend a couple more CPU cycles in Perl code, and you 
wouldn't be burdened with two bioentries that aren't coupled by foreign 
key constraint.

As for consensus for how to encode sequence with quality values, I'd 
include a delimiter between the alphabet operands in the cross-product. 
I.e., using e.g. slash as the delimiter: 'A/22 T/30 A/32 G/35 C/35'. 
This can be easily extended to multi-dimensional cross-products so long 
as the delimiter between them isn't a symbol in any of the alphabets.

                 -hilmar

On Jul 5, 2005, at 12:39 AM, Marc Logghe wrote:

> Thanks for the feedback.
> Good to know I am not alone in this ;-)
> I totally agree with Mark that there should be a kind of consensus on
> how to store this in Bio*.
> Yesterday I mistakenly posted my original mail to the bioperl list.
> Heikki responded to that; it might be a good starting point but I am 
> not
> familiar with it:
> http://portal.open-bio.org/pipermail/bioperl-l/2005-July/019271.html
> So far the long term solustion.
> In short term, to have at least something that works, I'll experiment a
> little with storing separate objects. I remember one of the
> presentations of Hilmar, where he gave the example of making an adaptor
> and storing 2 sequence objects that interacted with each other as a
> result of a Two Hybrid experiment in yeast.
> Cheers,
> Marc
>
>
>>
>> I'd think storing it in BioSQL as 2-byte pairs would be good.
>> First byte is the base (an ASCII character), second byte is
>> the quality (an 8-bit integer). Sure it wastes a few bits but
>> so does normal DNA...
>>
>>
>> Richard Holland
>> Bioinformatics Specialist
>> GIS extension 8199
>> ---------------------------------------------
>> This email is confidential and may be privileged. If you are
>> not the intended recipient, please delete it and notify us
>> immediately. Please do not copy or use it for any purpose, or
>> disclose its content to any other person. Thank you.
>> ---------------------------------------------
>>
>>
>>> -----Original Message-----
>>> From: biosql-l-bounces at portal.open-bio.org
>>> [mailto:biosql-l-bounces at portal.open-bio.org] On Behalf Of
>>> mark.schreiber at novartis.com
>>> Sent: Tuesday, July 05, 2005 1:44 PM
>>> To: Marc Logghe
>>> Cc: biosql-l-bounces at portal.open-bio.org; biosql-l at open-bio.org
>>> Subject: Re: [BioSQL-l] FW: SeqWithQuality and biosql
>>>
>>>
>>> Hello -
>>>
>>> I was wondering about similar issues with biojava. As you
>> may (or may
>>> not) know biojava can make sequences from symbols in any
>> alphabet, two
>>> examples are DNA and the integer alphabet (a collection of Symbols
>>> that are integers). Biojava can also make compound
>> alphabets, one such
>>> example is the Phred alphabet which is the multiplication of DNA x
>>> Integer (technically a subset of Integer from 0 to 99).
>>>
>>> Because sequence in BioSQL is stored in a CLOB if you can
>> encode your
>>> SeqWithQuality as a String of characters you can store it.
>>> With the case
>>> above (which is probably similar to yours) you would need 400
>>> characters to store it which is too large for ASCI but
>> could be done
>>> in Unicode. The downside is your persitance layer needs to
>> know how to
>>> encode and decode your SeqWithQuality. I'm not familiar how BioPerl
>>> would do this. BioJava would need to Implement a
>> SymbolTokenizer for
>>> the alphabet and then persistance would happen
>> automatically (assuming
>>> your DB is OK with Unicode). An alternative would be to make a
>>> tokenizer that uses more than single character tokens for
>> encoding (eg
>>> A23 G40 T34 C22 etc).
>>>
>>> The alternative you suggest of storing two sequences with a
>>> relationship is also nice (because you can retreive each part
>>> seperately) but also requires your persitance layer to know
>> about it.
>>> However, it has big disadvantages because they are not
>> strongly tied
>>> to each other. If you manipulate one you might invalidate
>> the other.
>>> Also if you delete one the other will probably not be deleted in a
>>> cascade.
>>>
>>> Not sure if any of this helps but a consensus on how to store this
>>> kind of information would be good so the bio* projects do
>> it the same
>>> way.
>>> Consensus in this case will probably mean whatever the first
>>> implementation is.
>>>
>>> - Mark
>>>
>>>
>>>
>>>
>>>
>>> "Marc Logghe" <Marc.Logghe at devgen.com> Sent by:
>>> biosql-l-bounces at portal.open-bio.org
>>> 07/04/2005 05:56 PM
>>>
>>>
>>>         To:     <biosql-l at open-bio.org>
>>>         cc:     (bcc: Mark Schreiber/GP/Novartis)
>>>         Subject:        [BioSQL-l] FW: SeqWithQuality and biosql
>>>
>>>
>>> Apologies for cross posting, I had picked the wrong mail adress :-(
>>>
>>> -----Original Message-----
>>> From: Marc Logghe
>>> Sent: Monday, July 04, 2005 11:43 AM
>>> To: bioperl-l at portal.open-bio.org
>>> Subject: SeqWithQuality and biosql
>>>
>>> Hi all,
>>> I am currently exploring the possibility to store a
>>> Bio::Seq::SeqWithQuality object in biosql.
>>> Has anyone ever tried this ?
>>> One possibility would be to
>>> 1) split up the Bio::Seq::SeqWithQuality object into a plain
>>> Bio::Seq::RichSeq and a Bio::Seq::PrimaryQual
>>> 2) store them separately in biosql; different namespaces
>>> 3) link them with a relation term.
>>> 4) make a custom adaptor to fetch the persistent objects
>> from biosql
>>> and reconstruct the Bio::Seq::SeqWithQuality
>>>
>>> Does that make sense ? Any other suggestions/possibilities ?
>>> As a test I tried to load a Bio::Seq::PrimaryQual in biosql
>> using the
>>> load_seqdatabase.pl but it fails because Bio::Seq::PrimaryQual does
>>> not have a namespace method.
>>> I hope I'm wrong but I have the impression there is a long
>> way to go
>>> ;-)
>>>
>>> Marc
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l at open-bio.org
>>> http://open-bio.org/mailman/listinfo/biosql-l
>>>
>>>
>>>
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l at open-bio.org
>>> http://open-bio.org/mailman/listinfo/biosql-l
>>>
>>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------