From Marc.Logghe at devgen.com Mon Jul 4 05:56:37 2005 From: Marc.Logghe at devgen.com (Marc Logghe) Date: Mon Jul 4 05:47:37 2005 Subject: [BioSQL-l] FW: SeqWithQuality and biosql Message-ID: <0C528E3670D8CE4B8E013F6749231AA62F53C8@ANTARESIA.be.devgen.com> Apologies for cross posting, I had picked the wrong mail adress :-( -----Original Message----- From: Marc Logghe Sent: Monday, July 04, 2005 11:43 AM To: bioperl-l@portal.open-bio.org Subject: SeqWithQuality and biosql Hi all, I am currently exploring the possibility to store a Bio::Seq::SeqWithQuality object in biosql. Has anyone ever tried this ? One possibility would be to 1) split up the Bio::Seq::SeqWithQuality object into a plain Bio::Seq::RichSeq and a Bio::Seq::PrimaryQual 2) store them separately in biosql; different namespaces 3) link them with a relation term. 4) make a custom adaptor to fetch the persistent objects from biosql and reconstruct the Bio::Seq::SeqWithQuality Does that make sense ? Any other suggestions/possibilities ? As a test I tried to load a Bio::Seq::PrimaryQual in biosql using the load_seqdatabase.pl but it fails because Bio::Seq::PrimaryQual does not have a namespace method. I hope I'm wrong but I have the impression there is a long way to go ;-) Marc From mark.schreiber at novartis.com Tue Jul 5 01:44:10 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Tue Jul 5 01:35:16 2005 Subject: [BioSQL-l] FW: SeqWithQuality and biosql Message-ID: Hello - I was wondering about similar issues with biojava. As you may (or may not) know biojava can make sequences from symbols in any alphabet, two examples are DNA and the integer alphabet (a collection of Symbols that are integers). Biojava can also make compound alphabets, one such example is the Phred alphabet which is the multiplication of DNA x Integer (technically a subset of Integer from 0 to 99). Because sequence in BioSQL is stored in a CLOB if you can encode your SeqWithQuality as a String of characters you can store it. With the case above (which is probably similar to yours) you would need 400 characters to store it which is too large for ASCI but could be done in Unicode. The downside is your persitance layer needs to know how to encode and decode your SeqWithQuality. I'm not familiar how BioPerl would do this. BioJava would need to Implement a SymbolTokenizer for the alphabet and then persistance would happen automatically (assuming your DB is OK with Unicode). An alternative would be to make a tokenizer that uses more than single character tokens for encoding (eg A23 G40 T34 C22 etc). The alternative you suggest of storing two sequences with a relationship is also nice (because you can retreive each part seperately) but also requires your persitance layer to know about it. However, it has big disadvantages because they are not strongly tied to each other. If you manipulate one you might invalidate the other. Also if you delete one the other will probably not be deleted in a cascade. Not sure if any of this helps but a consensus on how to store this kind of information would be good so the bio* projects do it the same way. Consensus in this case will probably mean whatever the first implementation is. - Mark "Marc Logghe" Sent by: biosql-l-bounces@portal.open-bio.org 07/04/2005 05:56 PM To: cc: (bcc: Mark Schreiber/GP/Novartis) Subject: [BioSQL-l] FW: SeqWithQuality and biosql Apologies for cross posting, I had picked the wrong mail adress :-( -----Original Message----- From: Marc Logghe Sent: Monday, July 04, 2005 11:43 AM To: bioperl-l@portal.open-bio.org Subject: SeqWithQuality and biosql Hi all, I am currently exploring the possibility to store a Bio::Seq::SeqWithQuality object in biosql. Has anyone ever tried this ? One possibility would be to 1) split up the Bio::Seq::SeqWithQuality object into a plain Bio::Seq::RichSeq and a Bio::Seq::PrimaryQual 2) store them separately in biosql; different namespaces 3) link them with a relation term. 4) make a custom adaptor to fetch the persistent objects from biosql and reconstruct the Bio::Seq::SeqWithQuality Does that make sense ? Any other suggestions/possibilities ? As a test I tried to load a Bio::Seq::PrimaryQual in biosql using the load_seqdatabase.pl but it fails because Bio::Seq::PrimaryQual does not have a namespace method. I hope I'm wrong but I have the impression there is a long way to go ;-) Marc _______________________________________________ BioSQL-l mailing list BioSQL-l@open-bio.org http://open-bio.org/mailman/listinfo/biosql-l From hollandr at gis.a-star.edu.sg Tue Jul 5 02:33:07 2005 From: hollandr at gis.a-star.edu.sg (Richard HOLLAND) Date: Tue Jul 5 02:26:53 2005 Subject: [BioSQL-l] FW: SeqWithQuality and biosql Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D5601DCB1B5@BIONIC.biopolis.one-north.com> I'd think storing it in BioSQL as 2-byte pairs would be good. First byte is the base (an ASCII character), second byte is the quality (an 8-bit integer). Sure it wastes a few bits but so does normal DNA... Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: biosql-l-bounces@portal.open-bio.org > [mailto:biosql-l-bounces@portal.open-bio.org] On Behalf Of > mark.schreiber@novartis.com > Sent: Tuesday, July 05, 2005 1:44 PM > To: Marc Logghe > Cc: biosql-l-bounces@portal.open-bio.org; biosql-l@open-bio.org > Subject: Re: [BioSQL-l] FW: SeqWithQuality and biosql > > > Hello - > > I was wondering about similar issues with biojava. As you may > (or may not) > know biojava can make sequences from symbols in any alphabet, > two examples > are DNA and the integer alphabet (a collection of Symbols that are > integers). Biojava can also make compound alphabets, one such > example is > the Phred alphabet which is the multiplication of DNA x Integer > (technically a subset of Integer from 0 to 99). > > Because sequence in BioSQL is stored in a CLOB if you can encode your > SeqWithQuality as a String of characters you can store it. > With the case > above (which is probably similar to yours) you would need 400 > characters > to store it which is too large for ASCI but could be done in > Unicode. The > downside is your persitance layer needs to know how to encode > and decode > your SeqWithQuality. I'm not familiar how BioPerl would do > this. BioJava > would need to Implement a SymbolTokenizer for the alphabet and then > persistance would happen automatically (assuming your DB is OK with > Unicode). An alternative would be to make a tokenizer that > uses more than > single character tokens for encoding (eg A23 G40 T34 C22 etc). > > The alternative you suggest of storing two sequences with a > relationship > is also nice (because you can retreive each part seperately) but also > requires your persitance layer to know about it. However, it has big > disadvantages because they are not strongly tied to each > other. If you > manipulate one you might invalidate the other. Also if you > delete one the > other will probably not be deleted in a cascade. > > Not sure if any of this helps but a consensus on how to store > this kind of > information would be good so the bio* projects do it the same way. > Consensus in this case will probably mean whatever the first > implementation is. > > - Mark > > > > > > "Marc Logghe" > Sent by: biosql-l-bounces@portal.open-bio.org > 07/04/2005 05:56 PM > > > To: > cc: (bcc: Mark Schreiber/GP/Novartis) > Subject: [BioSQL-l] FW: SeqWithQuality and biosql > > > Apologies for cross posting, I had picked the wrong mail adress :-( > > -----Original Message----- > From: Marc Logghe > Sent: Monday, July 04, 2005 11:43 AM > To: bioperl-l@portal.open-bio.org > Subject: SeqWithQuality and biosql > > Hi all, > I am currently exploring the possibility to store a > Bio::Seq::SeqWithQuality object in biosql. > Has anyone ever tried this ? > One possibility would be to > 1) split up the Bio::Seq::SeqWithQuality object into a plain > Bio::Seq::RichSeq and a Bio::Seq::PrimaryQual > 2) store them separately in biosql; different namespaces > 3) link them with a relation term. > 4) make a custom adaptor to fetch the persistent objects from > biosql and > reconstruct the Bio::Seq::SeqWithQuality > > Does that make sense ? Any other suggestions/possibilities ? > As a test I tried to load a Bio::Seq::PrimaryQual in biosql using the > load_seqdatabase.pl but it fails because > Bio::Seq::PrimaryQual does not > have a namespace method. > I hope I'm wrong but I have the impression there is a long > way to go ;-) > > Marc > > > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l@open-bio.org > http://open-bio.org/mailman/listinfo/biosql-l > > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l@open-bio.org > http://open-bio.org/mailman/listinfo/biosql-l > From Marc.Logghe at devgen.com Tue Jul 5 03:39:28 2005 From: Marc.Logghe at devgen.com (Marc Logghe) Date: Tue Jul 5 03:32:24 2005 Subject: [BioSQL-l] FW: SeqWithQuality and biosql Message-ID: <0C528E3670D8CE4B8E013F6749231AA62F53D1@ANTARESIA.be.devgen.com> Thanks for the feedback. Good to know I am not alone in this ;-) I totally agree with Mark that there should be a kind of consensus on how to store this in Bio*. Yesterday I mistakenly posted my original mail to the bioperl list. Heikki responded to that; it might be a good starting point but I am not familiar with it: http://portal.open-bio.org/pipermail/bioperl-l/2005-July/019271.html So far the long term solustion. In short term, to have at least something that works, I'll experiment a little with storing separate objects. I remember one of the presentations of Hilmar, where he gave the example of making an adaptor and storing 2 sequence objects that interacted with each other as a result of a Two Hybrid experiment in yeast. Cheers, Marc > > I'd think storing it in BioSQL as 2-byte pairs would be good. > First byte is the base (an ASCII character), second byte is > the quality (an 8-bit integer). Sure it wastes a few bits but > so does normal DNA... > > > Richard Holland > Bioinformatics Specialist > GIS extension 8199 > --------------------------------------------- > This email is confidential and may be privileged. If you are > not the intended recipient, please delete it and notify us > immediately. Please do not copy or use it for any purpose, or > disclose its content to any other person. Thank you. > --------------------------------------------- > > > > -----Original Message----- > > From: biosql-l-bounces@portal.open-bio.org > > [mailto:biosql-l-bounces@portal.open-bio.org] On Behalf Of > > mark.schreiber@novartis.com > > Sent: Tuesday, July 05, 2005 1:44 PM > > To: Marc Logghe > > Cc: biosql-l-bounces@portal.open-bio.org; biosql-l@open-bio.org > > Subject: Re: [BioSQL-l] FW: SeqWithQuality and biosql > > > > > > Hello - > > > > I was wondering about similar issues with biojava. As you > may (or may > > not) know biojava can make sequences from symbols in any > alphabet, two > > examples are DNA and the integer alphabet (a collection of Symbols > > that are integers). Biojava can also make compound > alphabets, one such > > example is the Phred alphabet which is the multiplication of DNA x > > Integer (technically a subset of Integer from 0 to 99). > > > > Because sequence in BioSQL is stored in a CLOB if you can > encode your > > SeqWithQuality as a String of characters you can store it. > > With the case > > above (which is probably similar to yours) you would need 400 > > characters to store it which is too large for ASCI but > could be done > > in Unicode. The downside is your persitance layer needs to > know how to > > encode and decode your SeqWithQuality. I'm not familiar how BioPerl > > would do this. BioJava would need to Implement a > SymbolTokenizer for > > the alphabet and then persistance would happen > automatically (assuming > > your DB is OK with Unicode). An alternative would be to make a > > tokenizer that uses more than single character tokens for > encoding (eg > > A23 G40 T34 C22 etc). > > > > The alternative you suggest of storing two sequences with a > > relationship is also nice (because you can retreive each part > > seperately) but also requires your persitance layer to know > about it. > > However, it has big disadvantages because they are not > strongly tied > > to each other. If you manipulate one you might invalidate > the other. > > Also if you delete one the other will probably not be deleted in a > > cascade. > > > > Not sure if any of this helps but a consensus on how to store this > > kind of information would be good so the bio* projects do > it the same > > way. > > Consensus in this case will probably mean whatever the first > > implementation is. > > > > - Mark > > > > > > > > > > > > "Marc Logghe" Sent by: > > biosql-l-bounces@portal.open-bio.org > > 07/04/2005 05:56 PM > > > > > > To: > > cc: (bcc: Mark Schreiber/GP/Novartis) > > Subject: [BioSQL-l] FW: SeqWithQuality and biosql > > > > > > Apologies for cross posting, I had picked the wrong mail adress :-( > > > > -----Original Message----- > > From: Marc Logghe > > Sent: Monday, July 04, 2005 11:43 AM > > To: bioperl-l@portal.open-bio.org > > Subject: SeqWithQuality and biosql > > > > Hi all, > > I am currently exploring the possibility to store a > > Bio::Seq::SeqWithQuality object in biosql. > > Has anyone ever tried this ? > > One possibility would be to > > 1) split up the Bio::Seq::SeqWithQuality object into a plain > > Bio::Seq::RichSeq and a Bio::Seq::PrimaryQual > > 2) store them separately in biosql; different namespaces > > 3) link them with a relation term. > > 4) make a custom adaptor to fetch the persistent objects > from biosql > > and reconstruct the Bio::Seq::SeqWithQuality > > > > Does that make sense ? Any other suggestions/possibilities ? > > As a test I tried to load a Bio::Seq::PrimaryQual in biosql > using the > > load_seqdatabase.pl but it fails because Bio::Seq::PrimaryQual does > > not have a namespace method. > > I hope I'm wrong but I have the impression there is a long > way to go > > ;-) > > > > Marc > > > > > > > > > > _______________________________________________ > > BioSQL-l mailing list > > BioSQL-l@open-bio.org > > http://open-bio.org/mailman/listinfo/biosql-l > > > > > > > > _______________________________________________ > > BioSQL-l mailing list > > BioSQL-l@open-bio.org > > http://open-bio.org/mailman/listinfo/biosql-l > > > From hlapp at gnf.org Tue Jul 5 14:55:10 2005 From: hlapp at gnf.org (Hilmar Lapp) Date: Tue Jul 5 14:43:37 2005 Subject: [BioSQL-l] Re: SeqWithQuality and biosql In-Reply-To: <0C528E3670D8CE4B8E013F6749231AA62F53D1@ANTARESIA.be.devgen.com> References: <0C528E3670D8CE4B8E013F6749231AA62F53D1@ANTARESIA.be.devgen.com> Message-ID: <4672e7ad470df9973b998dd1188db923@gnf.org> (I don't think posting to bioperl was a mistake, so I'm including it here again) I think I like Mark's proposal best, i.e., the fundamental model of at most one sequence for each bioentry (e.g., Bio::SeqI object) is left intact, and the problem is reformulated as how to encode/decode sequences from alphabet cross-products as strings. Encoding/decoding wouldn't be difficult to implement, even such that the encoded string is still humanly readable. Biojava has a natural provision for doing this (SymbolTokenizer?), but Bioperl does not, i.e., in Bioperl the object model assumes that the sequence is a flat string, and the alphabet is also a flat string; there is no object you could ask to provide you with an encoder/decoder appropriate for either the alphabet or the type of sequence object. I'd like to hear some feedback from the Bioperl folks as to whether you'd consider this capability a generally useful addition to Bioperl. (It could be designed in a number of ways ranging from more intrusive to completely neutral - e.g., adding this as a method to SeqI [like $seq->seq_encoder()], or making $seq->alphabet() return an object with this and other capabilities, or creating a separate factory class that would return the appropriate encoder known to [or registered with] it based on a given alphabet and type of sequence object.) As for Bio::Seq::MetaI, this could certainly be the interface for SeqWithQuality, but wouldn't solve the de/serialization problem. Also, at least conceptually MetaI-derived classes could represent multi-dimensional meta-information, right? That is, the problem of how to encode/decode the meta-information isn't trivial or restricted to two dimensions here either. As for creating a specialized adaptor in Bioperl-db, that would certainly work too and would most likely be the fastest way to get something that works. However, long-term it would solve the problem only for SeqWithQuality and not for the more general problem of how to store sequences that are based on cross-product alphabets. BTW if you do implement a specialized adaptor, then instead of storing two bioentries and connecting them you might as well implement the sequence encoding/decoding for this particular object in the adaptor - you'd gain speed because instead of increasing the number of database operations you'd spend a couple more CPU cycles in Perl code, and you wouldn't be burdened with two bioentries that aren't coupled by foreign key constraint. As for consensus for how to encode sequence with quality values, I'd include a delimiter between the alphabet operands in the cross-product. I.e., using e.g. slash as the delimiter: 'A/22 T/30 A/32 G/35 C/35'. This can be easily extended to multi-dimensional cross-products so long as the delimiter between them isn't a symbol in any of the alphabets. -hilmar On Jul 5, 2005, at 12:39 AM, Marc Logghe wrote: > Thanks for the feedback. > Good to know I am not alone in this ;-) > I totally agree with Mark that there should be a kind of consensus on > how to store this in Bio*. > Yesterday I mistakenly posted my original mail to the bioperl list. > Heikki responded to that; it might be a good starting point but I am > not > familiar with it: > http://portal.open-bio.org/pipermail/bioperl-l/2005-July/019271.html > So far the long term solustion. > In short term, to have at least something that works, I'll experiment a > little with storing separate objects. I remember one of the > presentations of Hilmar, where he gave the example of making an adaptor > and storing 2 sequence objects that interacted with each other as a > result of a Two Hybrid experiment in yeast. > Cheers, > Marc > > >> >> I'd think storing it in BioSQL as 2-byte pairs would be good. >> First byte is the base (an ASCII character), second byte is >> the quality (an 8-bit integer). Sure it wastes a few bits but >> so does normal DNA... >> >> >> Richard Holland >> Bioinformatics Specialist >> GIS extension 8199 >> --------------------------------------------- >> This email is confidential and may be privileged. If you are >> not the intended recipient, please delete it and notify us >> immediately. Please do not copy or use it for any purpose, or >> disclose its content to any other person. Thank you. >> --------------------------------------------- >> >> >>> -----Original Message----- >>> From: biosql-l-bounces@portal.open-bio.org >>> [mailto:biosql-l-bounces@portal.open-bio.org] On Behalf Of >>> mark.schreiber@novartis.com >>> Sent: Tuesday, July 05, 2005 1:44 PM >>> To: Marc Logghe >>> Cc: biosql-l-bounces@portal.open-bio.org; biosql-l@open-bio.org >>> Subject: Re: [BioSQL-l] FW: SeqWithQuality and biosql >>> >>> >>> Hello - >>> >>> I was wondering about similar issues with biojava. As you >> may (or may >>> not) know biojava can make sequences from symbols in any >> alphabet, two >>> examples are DNA and the integer alphabet (a collection of Symbols >>> that are integers). Biojava can also make compound >> alphabets, one such >>> example is the Phred alphabet which is the multiplication of DNA x >>> Integer (technically a subset of Integer from 0 to 99). >>> >>> Because sequence in BioSQL is stored in a CLOB if you can >> encode your >>> SeqWithQuality as a String of characters you can store it. >>> With the case >>> above (which is probably similar to yours) you would need 400 >>> characters to store it which is too large for ASCI but >> could be done >>> in Unicode. The downside is your persitance layer needs to >> know how to >>> encode and decode your SeqWithQuality. I'm not familiar how BioPerl >>> would do this. BioJava would need to Implement a >> SymbolTokenizer for >>> the alphabet and then persistance would happen >> automatically (assuming >>> your DB is OK with Unicode). An alternative would be to make a >>> tokenizer that uses more than single character tokens for >> encoding (eg >>> A23 G40 T34 C22 etc). >>> >>> The alternative you suggest of storing two sequences with a >>> relationship is also nice (because you can retreive each part >>> seperately) but also requires your persitance layer to know >> about it. >>> However, it has big disadvantages because they are not >> strongly tied >>> to each other. If you manipulate one you might invalidate >> the other. >>> Also if you delete one the other will probably not be deleted in a >>> cascade. >>> >>> Not sure if any of this helps but a consensus on how to store this >>> kind of information would be good so the bio* projects do >> it the same >>> way. >>> Consensus in this case will probably mean whatever the first >>> implementation is. >>> >>> - Mark >>> >>> >>> >>> >>> >>> "Marc Logghe" Sent by: >>> biosql-l-bounces@portal.open-bio.org >>> 07/04/2005 05:56 PM >>> >>> >>> To: >>> cc: (bcc: Mark Schreiber/GP/Novartis) >>> Subject: [BioSQL-l] FW: SeqWithQuality and biosql >>> >>> >>> Apologies for cross posting, I had picked the wrong mail adress :-( >>> >>> -----Original Message----- >>> From: Marc Logghe >>> Sent: Monday, July 04, 2005 11:43 AM >>> To: bioperl-l@portal.open-bio.org >>> Subject: SeqWithQuality and biosql >>> >>> Hi all, >>> I am currently exploring the possibility to store a >>> Bio::Seq::SeqWithQuality object in biosql. >>> Has anyone ever tried this ? >>> One possibility would be to >>> 1) split up the Bio::Seq::SeqWithQuality object into a plain >>> Bio::Seq::RichSeq and a Bio::Seq::PrimaryQual >>> 2) store them separately in biosql; different namespaces >>> 3) link them with a relation term. >>> 4) make a custom adaptor to fetch the persistent objects >> from biosql >>> and reconstruct the Bio::Seq::SeqWithQuality >>> >>> Does that make sense ? Any other suggestions/possibilities ? >>> As a test I tried to load a Bio::Seq::PrimaryQual in biosql >> using the >>> load_seqdatabase.pl but it fails because Bio::Seq::PrimaryQual does >>> not have a namespace method. >>> I hope I'm wrong but I have the impression there is a long >> way to go >>> ;-) >>> >>> Marc >>> >>> >>> >>> >>> _______________________________________________ >>> BioSQL-l mailing list >>> BioSQL-l@open-bio.org >>> http://open-bio.org/mailman/listinfo/biosql-l >>> >>> >>> >>> _______________________________________________ >>> BioSQL-l mailing list >>> BioSQL-l@open-bio.org >>> http://open-bio.org/mailman/listinfo/biosql-l >>> >> > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l@open-bio.org > http://open-bio.org/mailman/listinfo/biosql-l > -- ------------------------------------------------------------- Hilmar Lapp email: lapp at gnf.org GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 ------------------------------------------------------------- From mark.schreiber at novartis.com Tue Jul 5 22:55:40 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Tue Jul 5 22:46:54 2005 Subject: [BioSQL-l] Re: SeqWithQuality and biosql Message-ID: The BioJava SymbolTokenizer can either tokenize to characters or Strings. Obviously not all alphabets can sensibly tokenize to characters (eg large compound alphabets). Currently by default it would tokenize a compound symbol to its compound names. For example the codon ACA would be (adenosine cytosine adenosine) This is obviously not ideal for a database and it can easily be changed in biojava without breaking things (to be honest, tokenization of compound alphas in biojava is not a common task at all). I would propose the following for compound alphabets... (aca)(gtc) for codon alphabets. (g17)(t40) for quality type alphabets. I like the use of brakets because it is possible in biojava to do something like this ((DNAxDNAxDNA)xPROTEIN) which would represent an alignement of codons with their amino acids or even ((DNAxDNA)x(DNAxDNAxDNA)), which I'm not sure you would ever use but their might be a good reason for it. The brackets help to disambiguate better than spaces would. For example ((ctc)S) for the first example or, ((atg)(gc)) for the second example. To make this work there also needs to be a uniform way to store the alphabet name in the sequence table. The above examples show how biojava constructs alphabet names but there maybe (probably are) better ways. For quality information you could use (DNAxINTEGER), techincally the biojava name would be (DNAxSubIntegerAlphabet[0..99]). Of course you don't have to use this convention and aliasing would be nice (eg the 'official' name for INTEGER in BioJava would be 'Alphabet of all integers' which is a bit long winded!) - Mark Hilmar Lapp 07/06/2005 02:55 AM To: "Marc Logghe" cc: Mark Schreiber/GP/Novartis@PH, Bioperl , OBDA BioSQL , Richard HOLLAND Subject: Re: SeqWithQuality and biosql (I don't think posting to bioperl was a mistake, so I'm including it here again) I think I like Mark's proposal best, i.e., the fundamental model of at most one sequence for each bioentry (e.g., Bio::SeqI object) is left intact, and the problem is reformulated as how to encode/decode sequences from alphabet cross-products as strings. Encoding/decoding wouldn't be difficult to implement, even such that the encoded string is still humanly readable. Biojava has a natural provision for doing this (SymbolTokenizer?), but Bioperl does not, i.e., in Bioperl the object model assumes that the sequence is a flat string, and the alphabet is also a flat string; there is no object you could ask to provide you with an encoder/decoder appropriate for either the alphabet or the type of sequence object. I'd like to hear some feedback from the Bioperl folks as to whether you'd consider this capability a generally useful addition to Bioperl. (It could be designed in a number of ways ranging from more intrusive to completely neutral - e.g., adding this as a method to SeqI [like $seq->seq_encoder()], or making $seq->alphabet() return an object with this and other capabilities, or creating a separate factory class that would return the appropriate encoder known to [or registered with] it based on a given alphabet and type of sequence object.) As for Bio::Seq::MetaI, this could certainly be the interface for SeqWithQuality, but wouldn't solve the de/serialization problem. Also, at least conceptually MetaI-derived classes could represent multi-dimensional meta-information, right? That is, the problem of how to encode/decode the meta-information isn't trivial or restricted to two dimensions here either. As for creating a specialized adaptor in Bioperl-db, that would certainly work too and would most likely be the fastest way to get something that works. However, long-term it would solve the problem only for SeqWithQuality and not for the more general problem of how to store sequences that are based on cross-product alphabets. BTW if you do implement a specialized adaptor, then instead of storing two bioentries and connecting them you might as well implement the sequence encoding/decoding for this particular object in the adaptor - you'd gain speed because instead of increasing the number of database operations you'd spend a couple more CPU cycles in Perl code, and you wouldn't be burdened with two bioentries that aren't coupled by foreign key constraint. As for consensus for how to encode sequence with quality values, I'd include a delimiter between the alphabet operands in the cross-product. I.e., using e.g. slash as the delimiter: 'A/22 T/30 A/32 G/35 C/35'. This can be easily extended to multi-dimensional cross-products so long as the delimiter between them isn't a symbol in any of the alphabets. -hilmar On Jul 5, 2005, at 12:39 AM, Marc Logghe wrote: > Thanks for the feedback. > Good to know I am not alone in this ;-) > I totally agree with Mark that there should be a kind of consensus on > how to store this in Bio*. > Yesterday I mistakenly posted my original mail to the bioperl list. > Heikki responded to that; it might be a good starting point but I am > not > familiar with it: > http://portal.open-bio.org/pipermail/bioperl-l/2005-July/019271.html > So far the long term solustion. > In short term, to have at least something that works, I'll experiment a > little with storing separate objects. I remember one of the > presentations of Hilmar, where he gave the example of making an adaptor > and storing 2 sequence objects that interacted with each other as a > result of a Two Hybrid experiment in yeast. > Cheers, > Marc > > >> >> I'd think storing it in BioSQL as 2-byte pairs would be good. >> First byte is the base (an ASCII character), second byte is >> the quality (an 8-bit integer). Sure it wastes a few bits but >> so does normal DNA... >> >> >> Richard Holland >> Bioinformatics Specialist >> GIS extension 8199 >> --------------------------------------------- >> This email is confidential and may be privileged. If you are >> not the intended recipient, please delete it and notify us >> immediately. Please do not copy or use it for any purpose, or >> disclose its content to any other person. Thank you. >> --------------------------------------------- >> >> >>> -----Original Message----- >>> From: biosql-l-bounces@portal.open-bio.org >>> [mailto:biosql-l-bounces@portal.open-bio.org] On Behalf Of >>> mark.schreiber@novartis.com >>> Sent: Tuesday, July 05, 2005 1:44 PM >>> To: Marc Logghe >>> Cc: biosql-l-bounces@portal.open-bio.org; biosql-l@open-bio.org >>> Subject: Re: [BioSQL-l] FW: SeqWithQuality and biosql >>> >>> >>> Hello - >>> >>> I was wondering about similar issues with biojava. As you >> may (or may >>> not) know biojava can make sequences from symbols in any >> alphabet, two >>> examples are DNA and the integer alphabet (a collection of Symbols >>> that are integers). Biojava can also make compound >> alphabets, one such >>> example is the Phred alphabet which is the multiplication of DNA x >>> Integer (technically a subset of Integer from 0 to 99). >>> >>> Because sequence in BioSQL is stored in a CLOB if you can >> encode your >>> SeqWithQuality as a String of characters you can store it. >>> With the case >>> above (which is probably similar to yours) you would need 400 >>> characters to store it which is too large for ASCI but >> could be done >>> in Unicode. The downside is your persitance layer needs to >> know how to >>> encode and decode your SeqWithQuality. I'm not familiar how BioPerl >>> would do this. BioJava would need to Implement a >> SymbolTokenizer for >>> the alphabet and then persistance would happen >> automatically (assuming >>> your DB is OK with Unicode). An alternative would be to make a >>> tokenizer that uses more than single character tokens for >> encoding (eg >>> A23 G40 T34 C22 etc). >>> >>> The alternative you suggest of storing two sequences with a >>> relationship is also nice (because you can retreive each part >>> seperately) but also requires your persitance layer to know >> about it. >>> However, it has big disadvantages because they are not >> strongly tied >>> to each other. If you manipulate one you might invalidate >> the other. >>> Also if you delete one the other will probably not be deleted in a >>> cascade. >>> >>> Not sure if any of this helps but a consensus on how to store this >>> kind of information would be good so the bio* projects do >> it the same >>> way. >>> Consensus in this case will probably mean whatever the first >>> implementation is. >>> >>> - Mark >>> >>> >>> >>> >>> >>> "Marc Logghe" Sent by: >>> biosql-l-bounces@portal.open-bio.org >>> 07/04/2005 05:56 PM >>> >>> >>> To: >>> cc: (bcc: Mark Schreiber/GP/Novartis) >>> Subject: [BioSQL-l] FW: SeqWithQuality and biosql >>> >>> >>> Apologies for cross posting, I had picked the wrong mail adress :-( >>> >>> -----Original Message----- >>> From: Marc Logghe >>> Sent: Monday, July 04, 2005 11:43 AM >>> To: bioperl-l@portal.open-bio.org >>> Subject: SeqWithQuality and biosql >>> >>> Hi all, >>> I am currently exploring the possibility to store a >>> Bio::Seq::SeqWithQuality object in biosql. >>> Has anyone ever tried this ? >>> One possibility would be to >>> 1) split up the Bio::Seq::SeqWithQuality object into a plain >>> Bio::Seq::RichSeq and a Bio::Seq::PrimaryQual >>> 2) store them separately in biosql; different namespaces >>> 3) link them with a relation term. >>> 4) make a custom adaptor to fetch the persistent objects >> from biosql >>> and reconstruct the Bio::Seq::SeqWithQuality >>> >>> Does that make sense ? Any other suggestions/possibilities ? >>> As a test I tried to load a Bio::Seq::PrimaryQual in biosql >> using the >>> load_seqdatabase.pl but it fails because Bio::Seq::PrimaryQual does >>> not have a namespace method. >>> I hope I'm wrong but I have the impression there is a long >> way to go >>> ;-) >>> >>> Marc >>> >>> >>> >>> >>> _______________________________________________ >>> BioSQL-l mailing list >>> BioSQL-l@open-bio.org >>> http://open-bio.org/mailman/listinfo/biosql-l >>> >>> >>> >>> _______________________________________________ >>> BioSQL-l mailing list >>> BioSQL-l@open-bio.org >>> http://open-bio.org/mailman/listinfo/biosql-l >>> >> > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l@open-bio.org > http://open-bio.org/mailman/listinfo/biosql-l > -- ------------------------------------------------------------- Hilmar Lapp email: lapp at gnf.org GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 ------------------------------------------------------------- From hollandr at gis.a-star.edu.sg Tue Jul 5 23:38:51 2005 From: hollandr at gis.a-star.edu.sg (Richard HOLLAND) Date: Tue Jul 5 23:30:57 2005 Subject: [BioSQL-l] RE: SeqWithQuality and biosql Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D5601DCB226@BIONIC.biopolis.one-north.com> Good point. To correctly represent compound alphabets in a consistent manner would require extra tables in BioSQL (version 1.1?). Some kind of alphabet table with a name and a related table with alphabet ids and ranks to construct cross products etc. Why not store the delimiter as an attribute of the alphabet in this table. That way we can use whatever delimiters we like. I don't think grouping is necessary - after all we know from the alphabet definition that there are a fixed number of tokens per symbol and what order they come in, so we just read the first three tokens to build the first symbol, and so on. Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp@gnf.org] > Sent: Wednesday, July 06, 2005 11:30 AM > To: mark.schreiber@novartis.com > Cc: Bioperl; Richard HOLLAND > Subject: Re: SeqWithQuality and biosql > > > > On Jul 5, 2005, at 7:55 PM, mark.schreiber@novartis.com wrote: > > > I would propose the > > following for compound alphabets... > > > > (aca)(gtc) for codon alphabets. > > (g17)(t40) for quality type alphabets. > > In your convention wouldn't this need to be > (g(17))(t(40)) > > Otherwise you'd have trouble representing higher-dimensional > cross-products unless you alternate chars and digits which would be a > useless restriction. > > -hilmar > -- > ------------------------------------------------------------- > Hilmar Lapp email: lapp at gnf.org > GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 > ------------------------------------------------------------- > > From mark.schreiber at novartis.com Wed Jul 6 01:37:21 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Wed Jul 6 01:28:31 2005 Subject: [BioSQL-l] RE: SeqWithQuality and biosql Message-ID: Actually under my proposal (a(17)) would imply (DNAx(SubInteger[0..9]xSubInteger[0..9])) "Richard HOLLAND" 07/06/2005 11:38 AM To: "Hilmar Lapp" , Mark Schreiber/GP/Novartis@PH cc: "Bioperl" , Subject: RE: SeqWithQuality and biosql Good point. To correctly represent compound alphabets in a consistent manner would require extra tables in BioSQL (version 1.1?). Some kind of alphabet table with a name and a related table with alphabet ids and ranks to construct cross products etc. Why not store the delimiter as an attribute of the alphabet in this table. That way we can use whatever delimiters we like. I don't think grouping is necessary - after all we know from the alphabet definition that there are a fixed number of tokens per symbol and what order they come in, so we just read the first three tokens to build the first symbol, and so on. Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp@gnf.org] > Sent: Wednesday, July 06, 2005 11:30 AM > To: mark.schreiber@novartis.com > Cc: Bioperl; Richard HOLLAND > Subject: Re: SeqWithQuality and biosql > > > > On Jul 5, 2005, at 7:55 PM, mark.schreiber@novartis.com wrote: > > > I would propose the > > following for compound alphabets... > > > > (aca)(gtc) for codon alphabets. > > (g17)(t40) for quality type alphabets. > > In your convention wouldn't this need to be > (g(17))(t(40)) > > Otherwise you'd have trouble representing higher-dimensional > cross-products unless you alternate chars and digits which would be a > useless restriction. > > -hilmar > -- > ------------------------------------------------------------- > Hilmar Lapp email: lapp at gnf.org > GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 > ------------------------------------------------------------- > > From hlapp at gnf.org Wed Jul 6 12:30:06 2005 From: hlapp at gnf.org (Hilmar Lapp) Date: Wed Jul 6 12:21:16 2005 Subject: [BioSQL-l] Re: [Bioperl-l] RE: SeqWithQuality and biosql In-Reply-To: References: Message-ID: On Jul 5, 2005, at 10:37 PM, mark.schreiber@novartis.com wrote: > Actually under my proposal > > (a(17)) would imply (DNAx(SubInteger[0..9]xSubInteger[0..9])) > That's why I didn't like it - how would you encode (DNAx(SubInteger[0..99]xSubInteger[0..99]) in this proposal? Require each component to be two-digit? There ought to be delimiters between the operands, no? -hilmar -- ------------------------------------------------------------- Hilmar Lapp email: lapp at gnf.org GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 ------------------------------------------------------------- From mark.schreiber at novartis.com Wed Jul 6 20:59:13 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Wed Jul 6 20:50:16 2005 Subject: [BioSQL-l] Re: [Bioperl-l] RE: SeqWithQuality and biosql Message-ID: Good point. I would prefer a system that only uses delimiters for ambiguous cases like the one you show but I guess thats pretty complex so maybe delimiters for every sub-alphabet. - Mark Hilmar Lapp 07/07/2005 12:30 AM To: Mark Schreiber/GP/Novartis@PH cc: "Richard HOLLAND" , Bioperl , biosql-l@open-bio.org Subject: Re: [Bioperl-l] RE: SeqWithQuality and biosql On Jul 5, 2005, at 10:37 PM, mark.schreiber@novartis.com wrote: > Actually under my proposal > > (a(17)) would imply (DNAx(SubInteger[0..9]xSubInteger[0..9])) > That's why I didn't like it - how would you encode (DNAx(SubInteger[0..99]xSubInteger[0..99]) in this proposal? Require each component to be two-digit? There ought to be delimiters between the operands, no? -hilmar -- ------------------------------------------------------------- Hilmar Lapp email: lapp at gnf.org GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 -------------------------------------------------------------