From jimp at compbio.dundee.ac.uk Wed Dec 3 09:56:16 2008 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Wed, 03 Dec 2008 14:56:16 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate Message-ID: <49369E10.6000502@compbio.dundee.ac.uk> Hi. I was wondering - are bioentry accessions intended to be case insensitive under the archetypal BioSQL schema ? If so - are the various Bio* language bindings are supposed to honour the case insensitivity ? The reason that I am asking is that I have just encountered this issue in relation to biosql backed sequence database queries made via biojava-x's hibernate bindings with a biosql 1.01 install on postgress. If Bioentry accessions are, in fact, case insensitive, then I can petition for an update to the biojava-x bindings to honour this. If not, then I'll just continue with my own non-standard hack ;) thanks in advance. Jim ps. on a side issue - have the various Bio* language bindings actually been specified formally ? If so - where might I find them ? -- ------------------------------------------------------------------- J. B. Procter (ENFIN/VAMSAS) Barton Bioinformatics Research Group Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk The University of Dundee is a Scottish Registered Charity, No. SC015096. From markjschreiber at gmail.com Thu Dec 4 01:00:30 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 4 Dec 2008 14:00:30 +0800 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <49369E10.6000502@compbio.dundee.ac.uk> References: <49369E10.6000502@compbio.dundee.ac.uk> Message-ID: <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> Good question... Are there any situations where Accessions would not be case insensitive? I feel they should be case-insensitive but partly this is going to depend on Hibernates default behaivor and the default of the underlying DB. Some DB's are case insensitive by default if this is the case then hibernates behaivour will probably not impact this. - Mark On Wed, Dec 3, 2008 at 10:56 PM, James Procter wrote: > > Hi. > > I was wondering - are bioentry accessions intended to be case > insensitive under the archetypal BioSQL schema ? > > If so - are the various Bio* language bindings are supposed to honour > the case insensitivity ? > > The reason that I am asking is that I have just encountered this issue > in relation to biosql backed sequence database queries made via > biojava-x's hibernate bindings with a biosql 1.01 install on postgress. > > If Bioentry accessions are, in fact, case insensitive, then I can > petition for an update to the biojava-x bindings to honour this. If not, > then I'll just continue with my own non-standard hack ;) > > thanks in advance. > Jim > > ps. on a side issue - have the various Bio* language bindings actually > been specified formally ? If so - where might I find them ? > > -- > ------------------------------------------------------------------- > J. B. Procter (ENFIN/VAMSAS) Barton Bioinformatics Research Group > Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk > The University of Dundee is a Scottish Registered Charity, No. SC015096. > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l From biopython at maubp.freeserve.co.uk Thu Dec 4 05:35:55 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 10:35:55 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <320fb6e00812040230n3d78061ay96d538ea75d3673e@mail.gmail.com> References: <49369E10.6000502@compbio.dundee.ac.uk> <320fb6e00812040230n3d78061ay96d538ea75d3673e@mail.gmail.com> Message-ID: <320fb6e00812040235s6e9d70b6y70b9b8443c7e2312@mail.gmail.com> Sending again - last time I didn't use my mailing list email address - sorry James, you'll get this twice. James Procter wrote: > > ps. on a side issue - have the various Bio* language bindings actually > been specified formally ? If so - where might I find them ? > I think the answer to that is sadly a no. For Biopython work, I have been treating BioPerl as the reference implementation BioSQL, and have tried to get some details clarified here on this list, e.g. regarding ontologies: http://lists.open-bio.org/pipermail/biosql-l/2008-November/001412.html http://lists.open-bio.org/pipermail/biosql-l/2008-November/001414.html http://lists.open-bio.org/pipermail/biosql-l/2008-November/001413.html Peter From jimp at compbio.dundee.ac.uk Thu Dec 4 05:56:48 2008 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Thu, 04 Dec 2008 10:56:48 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> References: <49369E10.6000502@compbio.dundee.ac.uk> <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> Message-ID: <4937B770.7050000@compbio.dundee.ac.uk> Thanks for the reply - Mark. Mark Schreiber wrote: > Are there any situations where Accessions would not be case > insensitive? AFAIK the public sequence, structure and gene id databases all have case insensitive Accessions, so I'd reckon 'No' being the answer there. However, its easy to imagine that some legacy in-house databases relying on case sensitivity in some way. I feel they should be case-insensitive but partly this is > going to depend on Hibernates default behaivor and the default of the > underlying DB. Some DB's are case insensitive by default if this is > the case then hibernates behaivour will probably not impact this. Again, agreed. Assuming for the moment that accessions are case insensitive in BioSQL, then case sensitivity should also be built into the BioSQL schema implementations for each DB, and a case insensitive column should be used if the DB supports it. However, regardless of whether the DB supports that kind of attribution, the language bindings should also have the case-insensitivity built in. In the case of Hibernate, the bindings have to be matched to the underlying database anyway, so I guess the problem I encountered is really due to a bug in the biojavax schema. of course - if case insensitivity isn't in the BioSQL spec - then its all immaterial, so the question still remains. Can anyone enlighten us further ? Jim ps. I've started another thread for the reply regarding biosql/Bio* object mappings. From jimp at compbio.dundee.ac.uk Thu Dec 4 06:10:31 2008 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Thu, 04 Dec 2008 11:10:31 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <320fb6e00812040235s6e9d70b6y70b9b8443c7e2312@mail.gmail.com> References: <49369E10.6000502@compbio.dundee.ac.uk> <320fb6e00812040230n3d78061ay96d538ea75d3673e@mail.gmail.com> <320fb6e00812040235s6e9d70b6y70b9b8443c7e2312@mail.gmail.com> Message-ID: <4937BAA7.9030003@compbio.dundee.ac.uk> Peter wrote: > Sending again - last time I didn't use my mailing list email address - > sorry James, you'll get this twice. no worries. > > James Procter wrote: >> ps. on a side issue - have the various Bio* language bindings actually >> been specified formally ? If so - where might I find them ? >> > > I think the answer to that is sadly a no. For Biopython work, I have > been treating BioPerl as the reference implementation BioSQL, and have > tried to get some details clarified here on this list, e.g. regarding > ontologies: > > http://lists.open-bio.org/pipermail/biosql-l/2008-November/001412.html > http://lists.open-bio.org/pipermail/biosql-l/2008-November/001414.html > http://lists.open-bio.org/pipermail/biosql-l/2008-November/001413.html Ah - yes - I wasn't quite up to speed on that thread. I think its probably a better place to continue this discussion.... Jim. From holland at eaglegenomics.com Thu Dec 4 06:43:10 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 04 Dec 2008 11:43:10 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <4937B770.7050000@compbio.dundee.ac.uk> References: <49369E10.6000502@compbio.dundee.ac.uk> <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> <4937B770.7050000@compbio.dundee.ac.uk> Message-ID: <4937C24E.7070402@eaglegenomics.com> Hibernate reflects the case-sensitivity of the database it is connected to. There's no options in it for changing that, and so you have to use it the way the database underneath expects you to. However, you can modify your queries so that you convert your query terms to a specific case in advance of the search using the toUpper() or toLower() functions of the String class, then when performing the search, use HQL's lower() or upper() functions inside the query HQL to convert values to the same case when making the comparisons. Someone will need to search through the BJX code to find the spots where explicit queries are made against accessions or any other case-insensitive data, then modify it to use the above technique. This would possibly involve introducing HQL queries where currently only direct Hibernate object references are being made. cheers, Richard James Procter wrote: > Thanks for the reply - Mark. > > Mark Schreiber wrote: >> Are there any situations where Accessions would not be case >> insensitive? > AFAIK the public sequence, structure and gene id databases all have case > insensitive Accessions, so I'd reckon 'No' being the answer there. > However, its easy to imagine that some legacy in-house databases relying > on case sensitivity in some way. > I feel they should be case-insensitive but partly this is >> going to depend on Hibernates default behaivor and the default of the >> underlying DB. Some DB's are case insensitive by default if this is >> the case then hibernates behaivour will probably not impact this. > Again, agreed. > > Assuming for the moment that accessions are case insensitive in BioSQL, > then case sensitivity should also be built into the BioSQL schema > implementations for each DB, and a case insensitive column should be > used if the DB supports it. However, regardless of whether the DB > supports that kind of attribution, the language bindings should also > have the case-insensitivity built in. In the case of Hibernate, the > bindings have to be matched to the underlying database anyway, so I > guess the problem I encountered is really due to a bug in the biojavax > schema. > > of course - if case insensitivity isn't in the BioSQL spec - then its > all immaterial, so the question still remains. Can anyone enlighten us > further ? > > Jim > > ps. I've started another thread for the reply regarding biosql/Bio* > object mappings. > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From jimp at compbio.dundee.ac.uk Thu Dec 4 06:45:24 2008 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Thu, 04 Dec 2008 11:45:24 +0000 Subject: [BioSQL-l] BioSQL and ontology "standards". In-Reply-To: <320fb6e00811281204i3bae31e4kc18f70121244b4d1@mail.gmail.com> References: <320fb6e00811281057r2d3a1145j3072b6a537112e12@mail.gmail.com> <49304392.4080908@eaglegenomics.com> <320fb6e00811281204i3bae31e4kc18f70121244b4d1@mail.gmail.com> Message-ID: <4937C2D4.1020701@compbio.dundee.ac.uk> Hi - I'm very sorry to break the thread a little - particularly with the deep discussion that's going on. Peter drew my attention to the thread in his reply to my ps. on another thread: Peter's reply to my original PS: >> ps. on a side issue - have the various Bio* language bindings actually >> been specified formally ? If so - where might I find them ? >> > > I think the answer to that is sadly a no. For Biopython work, I have > been treating BioPerl as the reference implementation BioSQL. Peter wrote: > On Fri, Nov 28, 2008 at 7:16 PM, Richard Holland wrote: >> BioJava does what BioPerl does and pretty much makes it up as it goes >> along, using whatever the input files tell it. > > OK, good. As a brutal summary, leaving all Peter's questions unanswered, that statement suggests a consensus - BioPerl is the 'reference' mapping. However, I personally do not yet know enough about each Bio* sequence feature structure to verify that this is the case. >> I think the best approach is to always to use what the file says, and >> trust that it's accurate. What needs to be agreed between projects is >> any additional annotations that get introduced outside the context of >> file parsing, and the names of the ontologies used for the file >> annotations so that all projects use the same ontologies and don't >> replicate them inside the BioSQL database. It would be nice to >> standardise these names and the additional custom terms across the >> projects, in much the same way as people tried already to standardise >> the way general objects get mapped to BioSQL. > > This is what I am trying to get at here - documenting the existing "ad > hoc" ontology usage. My impression is that it has not been > documented, and that the BioPerl behaviour is the defacto BioSQL > standard. > > I'd like to pin down this standard, and extend it for situations like > the location_qualifier_value.term_id and perhaps location.term_id > where BioPerl seems to ignore the ontology issue. I'm adding my support for documentation here. However, to put into perspective why this verification is necessary, I should explain my problem: I've been evaluating the use of BioSQL as a back end database for DAS source deployment. We are using both BioPerl and BioJava to interact with the BioSQL database, but ultimately aim to serve bioentry annotation as DAS features. This means that there needs to be a clear between a BioSQL bioentry's annotation and the attributes of one or more DAS features, and that mapping needs to be honoured by all the Bio* object bindings utilised by the various programs interacting with the BioSQL database. DAS features are actually pretty simple. To begin with, I'm only interested in unambiguously mapping the core DAS/1 feature attributes: - location (start,end and strand) - type (which may additionally have a sequence annotation ontology term) - label (free text relating to the type term) - feature score (again associated with the type) - URLs (often added as href properties) - Method (free text but often has associated evidence code) - notes (free text which may include additional ontological terms) I'm building on the mapping started by Benjamin Schuster Bockler and implemented in Dazzle. However, I've already run into some mismatches and I now need to clarify whether we are misusing the BioPerl sequence feature binding, or whether the Biojava->DAS part of the mapping is broken. A formal specification, or at the very least a mapping diagram, is therefore pretty much essential. It will also enable better 'out of the box' support for access to BioSQL datasources in other applications. Jim. From jimp at compbio.dundee.ac.uk Thu Dec 4 07:19:15 2008 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Thu, 04 Dec 2008 12:19:15 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <4937C24E.7070402@eaglegenomics.com> References: <49369E10.6000502@compbio.dundee.ac.uk> <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> <4937B770.7050000@compbio.dundee.ac.uk> <4937C24E.7070402@eaglegenomics.com> Message-ID: <4937CAC3.4070906@compbio.dundee.ac.uk> Thanks for replying... Richard Holland wrote: > Hibernate reflects the case-sensitivity of the database it is connected > to. There's no options in it for changing that, and so you have to use > it the way the database underneath expects you to. agreed. > However, you can modify your queries so that you convert your query > terms to a specific case in advance of the search using the toUpper() or > toLower() functions of the String class, then when performing the > search, use HQL's lower() or upper() functions inside the query HQL to > convert values to the same case when making the comparisons. agreed (again). > > Someone will need to search through the BJX code to find the spots where > explicit queries are made against accessions or any other > case-insensitive data, then modify it to use the above technique. This > would possibly involve introducing HQL queries where currently only > direct Hibernate object references are being made. ok - this isn't too hard to do. Do you know who currently maintains the BiojavaX bindings to BJX ? I can send them patches - or I guess submit them directly... but.. it wont fix the issue. In a situation where the backend DB is case sensitive, new accessions inserted into the DB must be forced to be the same case in the same way as the accession query string will be forced to be. This modification is again straightforward, but the case change would also have to be propagated to existing entries in the BioSQL. Furthermore, modifying the Biojava-x bindings would only ensure case-insensitivity for Biojava queries, the same modification would also be necessary for the other bio-* bindings. This isn't actually a burning issue for me at the moment even though it might sound like it from the way I'm posting about it - I had already hacked biojava-x to accomodate the case for my specific BioSQL database. However, it is important to clarify this situation in order to ensure interoperability between biosql databases and off the shelf middleware. all the best! Jim From holland at eaglegenomics.com Thu Dec 4 07:27:03 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 04 Dec 2008 12:27:03 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <4937CAC3.4070906@compbio.dundee.ac.uk> References: <49369E10.6000502@compbio.dundee.ac.uk> <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> <4937B770.7050000@compbio.dundee.ac.uk> <4937C24E.7070402@eaglegenomics.com> <4937CAC3.4070906@compbio.dundee.ac.uk> Message-ID: <4937CC97.3060602@eaglegenomics.com> I'm the maintainer for BJX so you can post patches to me. However, I agree that a consensus needs to be reached on how to store this data amongst the various projects. There's no point one fixing it one way and another fixing it the other. This is, from my point of view, the main problem with having something like BioSQL defined as a schema rather than as an API that can include defined behaviour (business logic) as well as database structure (the Oracle version for instance would expose a public set of PL/SQL stored procedures and projects would then interact solely via those). Also, BioJava itself may not be case-insensitive (can't remember how I coded it now...) so would need changes throughout to things like equals() methods as well (this has implications going back into the original code as well as the BJX extensions which build on that). cheers, Richard James Procter wrote: > Thanks for replying... > > Richard Holland wrote: >> Hibernate reflects the case-sensitivity of the database it is connected >> to. There's no options in it for changing that, and so you have to use >> it the way the database underneath expects you to. > agreed. > >> However, you can modify your queries so that you convert your query >> terms to a specific case in advance of the search using the toUpper() or >> toLower() functions of the String class, then when performing the >> search, use HQL's lower() or upper() functions inside the query HQL to >> convert values to the same case when making the comparisons. > agreed (again). >> Someone will need to search through the BJX code to find the spots where >> explicit queries are made against accessions or any other >> case-insensitive data, then modify it to use the above technique. This >> would possibly involve introducing HQL queries where currently only >> direct Hibernate object references are being made. > ok - this isn't too hard to do. Do you know who currently maintains the > BiojavaX bindings to BJX ? I can send them patches - or I guess submit > them directly... but.. it wont fix the issue. > > In a situation where the backend DB is case sensitive, new accessions > inserted into the DB must be forced to be the same case in the same way > as the accession query string will be forced to be. This modification is > again straightforward, but the case change would also have to be > propagated to existing entries in the BioSQL. Furthermore, modifying the > Biojava-x bindings would only ensure case-insensitivity for Biojava > queries, the same modification would also be necessary for the other > bio-* bindings. > > This isn't actually a burning issue for me at the moment even though it > might sound like it from the way I'm posting about it - I had already > hacked biojava-x to accomodate the case for my specific BioSQL database. > However, it is important to clarify this situation in order to ensure > interoperability between biosql databases and off the shelf middleware. > > all the best! > Jim > > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From jimp at compbio.dundee.ac.uk Thu Dec 4 07:41:27 2008 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Thu, 04 Dec 2008 12:41:27 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <4937CC97.3060602@eaglegenomics.com> References: <49369E10.6000502@compbio.dundee.ac.uk> <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> <4937B770.7050000@compbio.dundee.ac.uk> <4937C24E.7070402@eaglegenomics.com> <4937CAC3.4070906@compbio.dundee.ac.uk> <4937CC97.3060602@eaglegenomics.com> Message-ID: <4937CFF7.7040200@compbio.dundee.ac.uk> Richard Holland wrote: > I'm the maintainer for BJX so you can post patches to me. ah - {bows} - sorry - should've realised that. > > However, I agree that a consensus needs to be reached on how to store > this data amongst the various projects. There's no point one fixing it > one way and another fixing it the other. This is, from my point of view, > the main problem with having something like BioSQL defined as a schema > rather than as an API that can include defined behaviour (business > logic) as well as database structure (the Oracle version for instance > would expose a public set of PL/SQL stored procedures and projects would > then interact solely via those). yes. I was hoping there were at least some additional notes on how the entities were further restricted which might inform us about this. Perhaps that might be something to be considered for the next BioSQL release. > Also, BioJava itself may not be case-insensitive (can't remember how I > coded it now...) so would need changes throughout to things like > equals() methods as well (this has implications going back into the > original code as well as the BJX extensions which build on that). eeks - that could definitely get hairy and almost certainly have some additional side effects. But, I think that it might be possible to introduce the change cleanly so whenever a biosql backed object was being worked with, case sensitivity would apply. Is case sensitivity covered in the biojavax unit tests ? thanks for all your input on this! Jim From holland at eaglegenomics.com Thu Dec 4 07:55:48 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 04 Dec 2008 12:55:48 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <4937CFF7.7040200@compbio.dundee.ac.uk> References: <49369E10.6000502@compbio.dundee.ac.uk> <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> <4937B770.7050000@compbio.dundee.ac.uk> <4937C24E.7070402@eaglegenomics.com> <4937CAC3.4070906@compbio.dundee.ac.uk> <4937CC97.3060602@eaglegenomics.com> <4937CFF7.7040200@compbio.dundee.ac.uk> Message-ID: <4937D354.5040706@eaglegenomics.com> > eeks - that could definitely get hairy and almost certainly have some > additional side effects. But, I think that it might be possible to > introduce the change cleanly so whenever a biosql backed object was > being worked with, case sensitivity would apply. Is case sensitivity > covered in the biojavax unit tests ? No. Sounds like an area of need! > thanks for all your input on this! > Jim > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From lpritc at scri.ac.uk Thu Dec 4 08:25:50 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 04 Dec 2008 13:25:50 +0000 Subject: [BioSQL-l] BioSQL and ontology "standards". In-Reply-To: <4937C2D4.1020701@compbio.dundee.ac.uk> Message-ID: On 04/12/2008 11:45, "James Procter" wrote: > Peter wrote: >> On Fri, Nov 28, 2008 at 7:16 PM, Richard Holland wrote: >>> I think the best approach is to always to use what the file says, and >>> trust that it's accurate. What needs to be agreed between projects is >>> any additional annotations that get introduced outside the context of >>> file parsing, and the names of the ontologies used for the file >>> annotations so that all projects use the same ontologies and don't >>> replicate them inside the BioSQL database. It would be nice to >>> standardise these names and the additional custom terms across the >>> projects, in much the same way as people tried already to standardise >>> the way general objects get mapped to BioSQL. >> >> This is what I am trying to get at here - documenting the existing "ad >> hoc" ontology usage. My impression is that it has not been >> documented, and that the BioPerl behaviour is the defacto BioSQL >> standard. >> >> I'd like to pin down this standard, and extend it for situations like >> the location_qualifier_value.term_id and perhaps location.term_id >> where BioPerl seems to ignore the ontology issue. Hi, Just to add some of my experience with BioSQL and Biopython to the discussion... When I began to look at this issue a couple of years ago, it was clear that the Biopython loader (and, to the best of my knowledge, Bioperl does this, too) for GenBank files and BioSQL put pretty much everything under an 'ontology' called 'Annotation Tags', with no definitions and only rudimentary error-checking. Now, BioSQL seems to have taken great care to ensure that, whatever one's choice of ontology, it can be accommodated in the database schema. There is, as far as I can tell, no reason to favour one ontology over another on the grounds of BioSQL compatibility and, if anything, the BioSQL schema pretty much forced me to start considering ontologies in a serious manner. My understanding is that BioSQL is ontology-neutral, and that the appropriate choice of ontology is dependent on the data with which you want to populate your database. This suggests to me that the Bio* loaders are the things that need to be dynamically ontology-aware, first to check if the appropriate ontology (as selected by the user) for the data is present in the database, and then to populate the database using those ontology terms, calling errors as appropriate (e.g. for extraneous terms, mis-spellings, inappropriate data types, etc.). If your reason, like mine, for using an ontology is either to ensure that annotation terms have well-defined (or at least defined) meanings, and perhaps incidentally to carry out a check on the validity of a particular annotation file within the domain of that ontology, then that can readily be done in BioSQL. I have managed this with both the Gene Ontology and Sequence Ontology ontologies, and locally-defined ontologies. BioSQL copes with these very nicely, as does a modified Biopython Loader.py. However, the current Biopython (and AFIAA Bio*) behaviour with 'Annotation Tags' doesn't correspond well to the above. I think that this is a bad thing in general, and that there is room for improvement, if we want it. With apologies if I'm misinterpreting the tide of discussion, but I would be disappointed to see a default behaviour of "bung everything under 'Annotation Tags', typos and all" become a 'standard' of any sort, rather than a placeholder for future development of ontology-aware Bio* code that queries and populates BioSQL appropriately. I see the situation as pretty much analogous to the effective requirement for NCBI taxon data in BioSQL, when using Biopython: you need to load in the NCBI taxon data before your own data can be imported in a taxon-aware manner. I would prefer to see a similar, but perhaps even more draconian imposition of requiring an appropriate ontology (or ontologies) to be present in the database before importing data, and a specification of which ontology/ies is/are to be used when loading the data. Then, where a term is not yet known to an ontology in BioSQL, this might be an error in the source data, or an oversight of the ontology. Correcting either of these improves the quality of the data and/or its description. The catch-all 'Annotation Tag' 'ontology' seems to silently record a new term with a different ID, permits no error correction and, for my own part, I would rather this behaviour went away, eventually. Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From biopython at maubp.freeserve.co.uk Thu Dec 4 10:04:44 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 15:04:44 +0000 Subject: [BioSQL-l] BioSQL and ontology "standards". In-Reply-To: References: <4937C2D4.1020701@compbio.dundee.ac.uk> Message-ID: <320fb6e00812040704i71c23144m35de4de6df882902@mail.gmail.com> On Thu, Dec 4, 2008 at 1:25 PM, Leighton Pritchard wrote: > With apologies if I'm misinterpreting the tide of discussion, but I would be > disappointed to see a default behaviour of "bung everything under > 'Annotation Tags', typos and all" become a 'standard' of any sort, rather > than a placeholder for future development of ontology-aware Bio* code that > queries and populates BioSQL appropriately. Overall, I agree. It isn't ideal, but the current ad-hoc "ontology" is useful in that its looseness allows any parsable GenBank file to be imported into the database. Pinning down the current behaviour as a "standard" for better intercompatibility between the Bio* projects is a good thing, even if this only a short term goal. In the long term, yes, maybe all Bio* projects should be able to cope with any (optional) strict ontology instead. > I see the situation as pretty much analogous to the effective requirement > for NCBI taxon data in BioSQL, when using Biopython: you need to load in the > NCBI taxon data before your own data can be imported in a taxon-aware > manner. This is going off topic, but that's not really true any more. It used to be the case that if you wanted to record the NCBI taxonomy when loading GenBank files into BioSQL with Biopython that you would ideally first prepopulate the taxonomy tables with the BioSQL load_ncbi_taxonomy.pl script. I should go and update http://www.biopython.org/wiki/BioSQL now that Biopython 1.49 is out, as it can optionally fill in the lineage on demand by querying NCBI Entrez. Either way, it does "play nice" with running load_ncbi_taxonomy.pl before or after loading records with Biopython. Peter From lpritc at scri.ac.uk Thu Dec 4 10:14:37 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 04 Dec 2008 15:14:37 +0000 Subject: [BioSQL-l] BioSQL and ontology "standards". In-Reply-To: <320fb6e00812040704i71c23144m35de4de6df882902@mail.gmail.com> Message-ID: On 04/12/2008 15:04, "Peter" wrote: > On Thu, Dec 4, 2008 at 1:25 PM, Leighton Pritchard wrote: >> With apologies if I'm misinterpreting the tide of discussion, but I would be >> disappointed to see a default behaviour of "bung everything under >> 'Annotation Tags', typos and all" become a 'standard' of any sort, rather >> than a placeholder for future development of ontology-aware Bio* code that >> queries and populates BioSQL appropriately. > > Overall, I agree. It isn't ideal, but the current ad-hoc "ontology" > is useful in that its looseness allows any parsable GenBank file to be > imported into the database. I think that this may be a matter of perspective: you see an advantage, I see an accident waiting to happen ;) >> I see the situation as pretty much analogous to the effective requirement >> for NCBI taxon data in BioSQL, when using Biopython: you need to load in the >> NCBI taxon data before your own data can be imported in a taxon-aware >> manner. > > This is going off topic, but that's not really true any more. Ah... Good point - my bad. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From hlapp at gmx.net Thu Dec 4 10:57:14 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 4 Dec 2008 10:57:14 -0500 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <49369E10.6000502@compbio.dundee.ac.uk> References: <49369E10.6000502@compbio.dundee.ac.uk> Message-ID: On Dec 3, 2008, at 9:56 AM, James Procter wrote: > I was wondering - are bioentry accessions intended to be case > insensitive under the archetypal BioSQL schema ? Sorry for chiming in a bit late here. Accessions are case sensitive in BioSQL at the level of the relational model. In fact, this is enforced for MySQL (which unilaterally chose to treat the SQL VARCHAR datatype as case-insensitive) by making the type VARCHAR BINARY. I'm rather disinclined to change that, I have to say. I realize that many (all?) of the databases we typically use treat accessions as case- insensitive. But I doubt that that's part of the specs in each case, and there is no standard that would oblige future databases to do the same. Rather, I think it's application (or data source) level semantics, and should hence be implemented at that level if desired. In full-featured RDBMSs that's actually not very difficult. For example, you can build a function index on UPPER(accession), which gives you indexed access to case-insensitive accessions w/o changing the model itself. As Mark mentioned, Hibernate can be taught to use functions like these too. Does that make sense? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Thu Dec 4 10:59:59 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 4 Dec 2008 10:59:59 -0500 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <320fb6e00812040235s6e9d70b6y70b9b8443c7e2312@mail.gmail.com> References: <49369E10.6000502@compbio.dundee.ac.uk> <320fb6e00812040230n3d78061ay96d538ea75d3673e@mail.gmail.com> <320fb6e00812040235s6e9d70b6y70b9b8443c7e2312@mail.gmail.com> Message-ID: On Dec 4, 2008, at 5:35 AM, Peter wrote: > James Procter wrote: >> >> ps. on a side issue - have the various Bio* language bindings >> actually >> been specified formally ? If so - where might I find them ? >> > > I think the answer to that is sadly a no. I agree (with both the sadly and the no). Maybe I have New Years resolution coming at me here ... Indeed though, this needs addressing. I think I was about to start something on the wiki and then got sucked elsewhere. If anyone has energy to start this, please don't wait - wiki allows account creation by anyone. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jimp at compbio.dundee.ac.uk Thu Dec 4 11:17:05 2008 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Thu, 04 Dec 2008 16:17:05 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: References: <49369E10.6000502@compbio.dundee.ac.uk> Message-ID: <49380281.6070405@compbio.dundee.ac.uk> Hi Hilmar Hilmar Lapp wrote: > Sorry for chiming in a bit late here. Accessions are case sensitive in > BioSQL at the level of the relational model. Ah. OK. This is a definitive answer. > In fact, this is enforced for MySQL (which unilaterally chose to treat > the SQL VARCHAR datatype as case-insensitive) by making the type VARCHAR > BINARY. I did wonder about that. This makes sense. > I'm rather disinclined to change that, I have to say. I realize that > many (all?) of the databases we typically use treat accessions as > case-insensitive. But I doubt that that's part of the specs in each > case, and there is no standard that would oblige future databases to do > the same. > > Rather, I think it's application (or data source) level semantics, and > should hence be implemented at that level if desired. In full-featured > RDBMSs that's actually not very difficult. For example, you can build a > function index on UPPER(accession), which gives you indexed access to > case-insensitive accessions w/o changing the model itself. As Mark > mentioned, Hibernate can be taught to use functions like these too. Fair enough. This squarely places the onus on the application/datasource to be careful about case. It also corroborates the transparent behaviour exhibited by Biojava and Bioperl. The worrying aspect is that as far as I can tell, protocols like DAS do not play nicely with this... case is usually ignored for ID lookup. I guess this means that the middleware that does the transformation is always going to have to (optionally) use specific case-insensitive BioSQL language bindings, and the datasource deployer will simply have to configure it accordingly for their BioSQL database. cheers! Jim From hlapp at gmx.net Thu Dec 4 11:19:54 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 4 Dec 2008 11:19:54 -0500 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <4937CAC3.4070906@compbio.dundee.ac.uk> References: <49369E10.6000502@compbio.dundee.ac.uk> <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> <4937B770.7050000@compbio.dundee.ac.uk> <4937C24E.7070402@eaglegenomics.com> <4937CAC3.4070906@compbio.dundee.ac.uk> Message-ID: <49ECA0AB-1ADF-4A83-8645-ABA258CE3F51@gmx.net> On Dec 4, 2008, at 7:19 AM, James Procter wrote: > In a situation where the backend DB is case sensitive, new accessions > inserted into the DB must be forced to be the same case in the same > way > as the accession query string will be forced to be. I would advise against this if you are interested in cross-Bio* interoperability at all. As I pointed out previously, such transformations need not be permanent but can be done on-the-fly. > This modification is again straightforward, but the case change > would also have to be propagated to existing entries in the BioSQL. > Furthermore, modifying the > Biojava-x bindings would only ensure case-insensitivity for Biojava > queries, the same modification would also be necessary for the other > bio-* bindings. Right, and I'm not sure that's a good rule to impose on all implementations. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Thu Dec 4 11:22:53 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 4 Dec 2008 11:22:53 -0500 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <49380281.6070405@compbio.dundee.ac.uk> References: <49369E10.6000502@compbio.dundee.ac.uk> <49380281.6070405@compbio.dundee.ac.uk> Message-ID: <7E1F7CF2-1B67-440D-A92B-29659C03096A@gmx.net> On Dec 4, 2008, at 11:17 AM, James Procter wrote: > The worrying aspect is that as far as I can tell, protocols like DAS > do not play nicely with this... case is usually ignored for ID lookup. Right. Data exchange protocols and query APIs are a different story from the data model of the databases that underpin them. > I guess this means that the middleware that does the transformation > is always going to have to (optionally) use specific case- > insensitive BioSQL language bindings Right. I'm not sure though that this is a big onus. Is it? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jimp at compbio.dundee.ac.uk Thu Dec 4 11:38:05 2008 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Thu, 04 Dec 2008 16:38:05 +0000 Subject: [BioSQL-l] BioSQL and ontology "standards". In-Reply-To: References: Message-ID: <4938076D.7010307@compbio.dundee.ac.uk> Can I just add comment to refocus discussion here: In part its probably my fault, but I think we're starting to mix two (as I see it) distinct aspects of ontologies within the BioSQL schema: Leighton Pritchard wrote: > On 04/12/2008 15:04, "Peter" wrote: >>> With apologies if I'm misinterpreting the tide of discussion, but I would be >>> disappointed to see a default behaviour of "bung everything under >>> 'Annotation Tags', typos and all" become a 'standard' of any sort, rather >>> than a placeholder for future development of ontology-aware Bio* code that >>> queries and populates BioSQL appropriately. >> Overall, I agree. It isn't ideal, but the current ad-hoc "ontology" >> is useful in that its looseness allows any parsable GenBank file to be >> imported into the database. > > I think that this may be a matter of perspective: you see an advantage, I > see an accident waiting to happen ;) I think this aspect of BioSQL ontology is strictly semantic, and can probably only be handled when interpreting the data retrieved from a BioSQL database in specific context. Within a particular type of datamodel, inconsistencies arise in the free-text derived from flat-file data records. These inconsistencies (such as typos, synonyms, case variation, etc) are really aspects to be addressed by data cleaning prior to insertion into the database. If you don't care about putting dirty data into a bioSQL database, then you should still be able to do it, but you should then not expect someone else to connect to the database and make perfect sense of the data. In particular, don't expect a program to magically interpret your mis-annotated 'eXons' as coding regions, for instance, or your disulphide bonds as disulfide (or vice versa). The ramification of this is that in practice, clients that ultimately consume and interpret any kind of BioSQL datastore have to have some form of robustness built in. This would be in the same way that file parsers have to cope with the known variations of freetext feature tags in genbank records. In this situation one assumes that the client at least understands that particular terms in the biosql annotation really are freetext feature tags, and that brings us to the other aspect, which in comparison is much more prosaic. The aspect that I was talking about is 'structural' consistency rather than semantic consistency (for want of a better word). For instance, a bioperl Generic Feature has a 'score' attribute - this should map to the 'score' attribute on a BioPython generic feature, and also to a biojava score attribute. As far as I understand it (and I may be wrong here), hierarchical relationships are faithfully preserved when a bio* feature is persisted in BioSQL - and I'd hope all the attributes were too. This kind of thing is definitely worth writing down, and even making test cases against! Just to duplicate what Hilmar wrote in response to the Bio* binding comment here: >>>> ps. on a side issue - have the various Bio* language bindings actually >>> been specified formally ? If so - where might I find them ? >>> >> >> I think the answer to that is sadly a no. > > > I agree (with both the sadly and the no). Maybe I have New Years > resolution coming at me here ... > > Indeed though, this needs addressing. I think I was about to start > something on the wiki and then got sucked elsewhere. If anyone has > energy to start this, please don't wait - wiki allows account creation > by anyone. I guess we (at least Peter, Myself, and anyone else) should get to it then. At least for the structural mapping betwen biosql terms and feature structure! as for the semantic ontology aspect - is there a way that one might tag a biosql database as using external ontologies ? Jim From Bank.Beszteri at awi.de Thu Dec 4 11:28:13 2008 From: Bank.Beszteri at awi.de (=?ISO-8859-15?Q?B=E1nk_Beszteri?=) Date: Thu, 04 Dec 2008 17:28:13 +0100 Subject: [BioSQL-l] alternative taxonomic hierarchies in BioSQL? Message-ID: <4938051D.4030101@awi.de> Dear BioSQLers, do I understand right that the current BioSQL schema allows for a single taxonomy per database only? When looking into the tables taxon and taxon_name, it looks like neither taxa nor their neighborhood relationships can belong to different taxonomies. Is this correct, or am I missing something? If this is so: are there any plans to add such a feature in the future? I think (besides that I could use it) it could probably be useful for others as well (to have the possibility to e.g. have an ITIS taxonomy or just a user?s own private taxonomy parallel to NCBI taxonomy in a single BioSQL DB). I didn?t find anything about this on the BioSQL pages, please direct me to the right place if I missed it! Bank -- Dr. Bank Beszteri Bioinformatics Working Group Alfred Wegener Institute for Polar and Marine Research Am Handelshafen 12 Bremerhaven Germany From biopython at maubp.freeserve.co.uk Thu Dec 4 12:06:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 17:06:59 +0000 Subject: [BioSQL-l] alternative taxonomic hierarchies in BioSQL? In-Reply-To: <4938051D.4030101@awi.de> References: <4938051D.4030101@awi.de> Message-ID: <320fb6e00812040906w12533956g859e4ec0415665e3@mail.gmail.com> On Thu, Dec 4, 2008 at 4:28 PM, B?nk Beszteri wrote: > > Dear BioSQLers, > > do I understand right that the current BioSQL schema allows for a single > taxonomy per database only? Not quite. If you ignore that fact that the taxon table's external taxonomy ID is explicitly labelled as the ncbi_taxon_id, you could store any taxonomy in the taxon and taxon_name tables. You could even have multiple independent taxonomies in these tables. However, each bioentry can only point to one taxon entry (and thus belongs to only one taxonomy), which is a big limitation. It would be useful to have a bioentry point to multiple taxon entries (and thus multiple taxonomies, e.g. NCBI and ITIS), which might require some sort of link table between the taxon and bioentry tables. This would also solve the issue of how to support chimeric sequences which has been noted on http://www.biosql.org/wiki/Enhancement_Requests > When looking into the tables taxon and taxon_name, it looks like neither > taxa nor their neighborhood relationships can belong to different taxonomies. > Is this correct, or am I missing something? True - but why would you want to interlink taxon entries like that? > If this is so: are there any plans to add such a feature in the future? I > think (besides that I could use it) it could probably be useful for others > as well (to have the possibility to e.g. have an ITIS taxonomy or just a > user?s own private taxonomy parallel to NCBI taxonomy in a single BioSQL > DB). I didn?t find anything about this on the BioSQL pages, please direct me > to the right place if I missed it! I think the issue has been raised before on the mailing list, and IIRC it was agreed that there was room for improvement. Maybe this is something for BioSQL v1.1.x? Peter From biopython at maubp.freeserve.co.uk Thu Dec 4 12:09:38 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 17:09:38 +0000 Subject: [BioSQL-l] BioSQL and ontology "standards". In-Reply-To: <4938076D.7010307@compbio.dundee.ac.uk> References: <4938076D.7010307@compbio.dundee.ac.uk> Message-ID: <320fb6e00812040909o386f2ba6h8d044aba31799dae@mail.gmail.com> On Thu, Dec 4, 2008 at 4:38 PM, James Procter wrote: > > as for the semantic ontology aspect - is there a way that one might tag > a biosql database as using external ontologies ? > I'm not sure what you mean here - you could look at the ontology and term tables. Peter From biopython at maubp.freeserve.co.uk Thu Dec 4 12:16:58 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 17:16:58 +0000 Subject: [BioSQL-l] Passwords on biosql databases In-Reply-To: <492BDCF2.6F09.00E0.0@dundee.ac.uk> References: <492BDCF2.6F09.00E0.0@dundee.ac.uk> Message-ID: <320fb6e00812040916k7c12f6c8qb81ec124053c428b@mail.gmail.com> On Tue, Nov 25, 2008 at 11:09 AM, David Martin wrote: > > I have set up a biosql database on Postgres. The Bio::DB::BioDB module > croaks complaining that it needs the password. I have tried the obvious > things (-password -passwd and reading what docs I could find) but to no avail. > > Any clues? Your email only just reached me - maybe there was a delay somewhere. However, in case you are still stuck, and assuming no one else has answered in the mean time I'll try and help. As an aside, I would say this is really a question for the BioPerl mailing list, as it is about the BioPerl bindings to BioSQL. >From a quick look at the BioPerl code for load_seqdatabase.pl, my guess is you need to use "pass" as the argument name, e.g. my $db = Bio::DB::BioDB->new(-database => "biosql", -printerror => $printerror, -host => $host, -port => $port, -dbname => $dbname, -driver => $driver, -user => $dbuser, -pass => $dbpass, -dsn => $dsn, -schema => $schema, -initrc => $initrc, ); Peter From hlapp at gmx.net Thu Dec 4 13:43:17 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 4 Dec 2008 13:43:17 -0500 Subject: [BioSQL-l] Passwords on biosql databases In-Reply-To: <320fb6e00812040916k7c12f6c8qb81ec124053c428b@mail.gmail.com> References: <492BDCF2.6F09.00E0.0@dundee.ac.uk> <320fb6e00812040916k7c12f6c8qb81ec124053c428b@mail.gmail.com> Message-ID: On Dec 4, 2008, at 12:16 PM, Peter wrote: > From a quick look at the BioPerl code for load_seqdatabase.pl, my > guess is you need to use "pass" as the argument name, e.g. > > my $db = Bio::DB::BioDB->new(-database => "biosql", > -printerror => $printerror, > -host => $host, > -port => $port, > -dbname => $dbname, > -driver => $driver, > -user => $dbuser, > -pass => $dbpass, > -dsn => $dsn, > -schema => $schema, > -initrc => $initrc, > ); Yes, and the same is true if you are using Bio::DB::BioDB directly. B::D::BioDB::new() accepts the options documented there, plus (as noted there, actually), all options accepted by Bio::DB::SimpleDBContext. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Thu Dec 4 17:07:39 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 4 Dec 2008 17:07:39 -0500 Subject: [BioSQL-l] alternative taxonomic hierarchies in BioSQL? In-Reply-To: <320fb6e00812040906w12533956g859e4ec0415665e3@mail.gmail.com> References: <4938051D.4030101@awi.de> <320fb6e00812040906w12533956g859e4ec0415665e3@mail.gmail.com> Message-ID: On Dec 4, 2008, at 12:06 PM, Peter wrote: > On Thu, Dec 4, 2008 at 4:28 PM, B?nk Beszteri > wrote: >> >> Dear BioSQLers, >> >> do I understand right that the current BioSQL schema allows for a >> single >> taxonomy per database only? > > Not quite. If you ignore that fact that the taxon table's external > taxonomy ID is explicitly labelled as the ncbi_taxon_id, you could > store any taxonomy in the taxon and taxon_name tables. You could even > have multiple independent taxonomies in these tables. Right. Though it's certainly ugly to call something a ncbi_taxon_id when really it is a ITIS ID, for example. Aside from that, the load_ncbi_taxonomy.pl script that comes with BioSQL can't really deal with other taxonomies being stored in the taxon tables, too. First, it will consider all nodes that it can't find in NCBI (by ID) as having been obsoleted and will delete them, and even if it somehow failed to do that, it would fail to compute the nested set enumeration for all other taxonomies. Changing that would basically require namespacing taxon nodes. Though it's an option, it has increasingly struck me as a duplication of what the PhyloDB module provides already (see other comments below), so I am actually less and less in favor of it. I think the appropriate way to look at the taxon tables is as the reference taxonomy for bioentries (and so calling the identifier ncbi_taxon_id is still bad as it prescribes the NCBI taxonomy as the reference). In this context: > However, each bioentry can only point to one taxon entry (and thus > belongs to only one taxonomy), which is a big limitation. This is well motivated in biological applications and current object models. I'm not sure about the other Bio* toolkits, but BioPerl for example doesn't support multiple species objects for a sequence. > It would be useful to have a bioentry point to multiple taxon entries > (and thus multiple taxonomies, e.g. NCBI and ITIS), which might > require some sort of link table between the taxon and bioentry tables. Note that the PhyloDB module supports this. Nodes in a tree (or taxonomy) can be associated with one or more bioentries (and, in fact, reference taxon nodes). > [...] >> When looking into the tables taxon and taxon_name, it looks like >> neither >> taxa nor their neighborhood relationships can belong to different >> taxonomies. >> Is this correct, or am I missing something? > > True - but why would you want to interlink taxon entries like that? There may be use-cases for this. For example, to relate taxa named differently between two taxonomies but that really are synonymous. Or one taxonomy containing a synonym that the other doesn't. Not your molecular sequence database/analysis type of thing, sure. But still legitimate. > > >> If this is so: are there any plans to add such a feature in the >> future? I >> think (besides that I could use it) it could probably be useful for >> others >> as well (to have the possibility to e.g. have an ITIS taxonomy Note that the svn / main trunk version of BioSQL has a script load_itis_taxonomy.pl. It loads it into the PhyloDB module, though. ITIS isn't a single tree but actually 5; there is no common root. So it ends up as 5 trees in the PhyloDB tables. >> or just a user?s own private taxonomy parallel to NCBI taxonomy in >> a single BioSQL >> DB). Yeah; I've been wanting to write a general taxonomy loader, or more precisely a loader that utilizes Bio::TreeIO for reading. Just haven't had time around to do that. (Need another hackathon :-) > [...] I think the issue has been raised before on the mailing list, > and IIRC > it was agreed that there was room for improvement. Maybe this is > something for BioSQL v1.1.x? Fixing the ncbi_taxon_id column name definitely. As for letting the taxon tables duplicate the same capabilities as the PhyloDB tables, I'm not sure that that's the best route to go. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From chapmanb at 50mail.com Sun Dec 14 20:17:50 2008 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 14 Dec 2008 20:17:50 -0500 Subject: [BioSQL-l] BioSQL and ontology "standards". In-Reply-To: <320fb6e00812040909o386f2ba6h8d044aba31799dae@mail.gmail.com> References: <4938076D.7010307@compbio.dundee.ac.uk> <320fb6e00812040909o386f2ba6h8d044aba31799dae@mail.gmail.com> Message-ID: <20081215011749.GA59046@kunkel> Hi all; I wanted to reply to the BioSQL ontology discussion Peter started up last month. He summarized how naming is currently done in the ontology and term tables and some of the potential downsides to that: > Currently BioPerl and Biopython (and I assume the other projects but > haven't checked) use a couple of ad-hoc ontology names for storing > annotation. In particular, if there is no predefined entry for a > novel ontology term, it gets added on the fly. This is very > convenient as it means a BioSQL database can be used without first > importing a predefined ontology. However there are downsides, for > example spelling errors in the keys of a GenBank file get treated as > a ontology entries. There was some general consensus that a more formalized, or at least documented, naming scheme would be good, provided there is some leniency for adding terms if they don't fall into the scheme. I agree, and think this suggestion by Peter is good: > On a related point, it might make more sense to use a predefined > ontology, like SOFA or SO from http://www.sequenceontology.org/ Towards a start for this, I put together a mapping of GenBank header, feature and qualifier keys to the SO ontology (and also standard ontologies like Dublin Core). If this is a direction we'd like to go, this would provide the high level documentation for reference implementations. It is currently about 3/4 finished but should give a good notion; I'd need some help from someone more familiar with SO for some of the missing terms. It got a little out of control for a mailing list post, so I wrote up the motivation and details here: http://bcbio.wordpress.com/2008/12/14/standard-ontologies-in-biosql/ The tab delimited mapping file with GenBank terms to ontology terms is there as a starting place. Brad From vanaquisl at gmail.com Mon Dec 15 05:24:06 2008 From: vanaquisl at gmail.com (vanaquisl vanaquisl) Date: Mon, 15 Dec 2008 11:24:06 +0100 Subject: [BioSQL-l] SQLite support Message-ID: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com> Hi folks, I am a brand new BioSQL user and I'd like to know if there will be a SQLite support in the near future? I am about to use SQLite for a standalone lightweight Biopython application and it would be very helpful if I could use BioSQL directly. Thanks for any advice. V. From biopython at maubp.freeserve.co.uk Mon Dec 15 05:43:33 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 15 Dec 2008 10:43:33 +0000 Subject: [BioSQL-l] SQLite support In-Reply-To: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com> References: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com> Message-ID: <320fb6e00812150243w4b0dc223g40abcf684af1ccf5@mail.gmail.com> On Mon, Dec 15, 2008, vanaquisl vanaquisl wrote: > > Hi folks, > > I am a brand new BioSQL user and I'd like to know if there will be a SQLite > support in the near future? As far as I know, BioSQL does not currently include a schema for SQLite. > I am about to use SQLite for a standalone lightweight Biopython application > and it would be very helpful if I could use BioSQL directly. Biopython can talk to a BioSQL database run on MySQL (using MySQLdb), PostgreSQL (using either Psycopg or Psycopg2). In theory it could be extended to use any supported BioSQL database (provided there are python bindings for the database software, in the case of SQLite using pysqlite would probably fine). Peter From Bank.Beszteri at awi.de Fri Dec 5 05:23:20 2008 From: Bank.Beszteri at awi.de (=?ISO-8859-1?Q?B=E1nk_Beszteri?=) Date: Fri, 05 Dec 2008 11:23:20 +0100 Subject: [BioSQL-l] alternative taxonomic hierarchies in BioSQL? In-Reply-To: References: <4938051D.4030101@awi.de> <320fb6e00812040906w12533956g859e4ec0415665e3@mail.gmail.com> Message-ID: <49390118.9030104@awi.de> Hi Hilmar & Peter, so it looks like using PhyloDB is probably the way to explore further for this, I somehow missed that point (I limited myself to think within the biosql core tables). Thanks for this direction and all the ideas / insights! Bank Hilmar Lapp schrieb: > > On Dec 4, 2008, at 12:06 PM, Peter wrote: > >> On Thu, Dec 4, 2008 at 4:28 PM, B?nk Beszteri >> wrote: >>> >>> Dear BioSQLers, >>> >>> do I understand right that the current BioSQL schema allows for a >>> single >>> taxonomy per database only? >> >> Not quite. If you ignore that fact that the taxon table's external >> taxonomy ID is explicitly labelled as the ncbi_taxon_id, you could >> store any taxonomy in the taxon and taxon_name tables. You could even >> have multiple independent taxonomies in these tables. > > Right. Though it's certainly ugly to call something a ncbi_taxon_id > when really it is a ITIS ID, for example. > > Aside from that, the load_ncbi_taxonomy.pl script that comes with > BioSQL can't really deal with other taxonomies being stored in the > taxon tables, too. First, it will consider all nodes that it can't > find in NCBI (by ID) as having been obsoleted and will delete them, > and even if it somehow failed to do that, it would fail to compute the > nested set enumeration for all other taxonomies. > > Changing that would basically require namespacing taxon nodes. Though > it's an option, it has increasingly struck me as a duplication of what > the PhyloDB module provides already (see other comments below), so I > am actually less and less in favor of it. > > I think the appropriate way to look at the taxon tables is as the > reference taxonomy for bioentries (and so calling the identifier > ncbi_taxon_id is still bad as it prescribes the NCBI taxonomy as the > reference). In this context: > >> However, each bioentry can only point to one taxon entry (and thus >> belongs to only one taxonomy), which is a big limitation. > > This is well motivated in biological applications and current object > models. I'm not sure about the other Bio* toolkits, but BioPerl for > example doesn't support multiple species objects for a sequence. > >> It would be useful to have a bioentry point to multiple taxon entries >> (and thus multiple taxonomies, e.g. NCBI and ITIS), which might >> require some sort of link table between the taxon and bioentry tables. > > Note that the PhyloDB module supports this. Nodes in a tree (or > taxonomy) can be associated with one or more bioentries (and, in fact, > reference taxon nodes). > >> [...] >>> When looking into the tables taxon and taxon_name, it looks like >>> neither >>> taxa nor their neighborhood relationships can belong to different >>> taxonomies. >>> Is this correct, or am I missing something? >> >> True - but why would you want to interlink taxon entries like that? > > There may be use-cases for this. For example, to relate taxa named > differently between two taxonomies but that really are synonymous. Or > one taxonomy containing a synonym that the other doesn't. > > Not your molecular sequence database/analysis type of thing, sure. But > still legitimate. > >> >> >>> If this is so: are there any plans to add such a feature in the >>> future? I >>> think (besides that I could use it) it could probably be useful for >>> others >>> as well (to have the possibility to e.g. have an ITIS taxonomy > > Note that the svn / main trunk version of BioSQL has a script > load_itis_taxonomy.pl. It loads it into the PhyloDB module, though. > ITIS isn't a single tree but actually 5; there is no common root. So > it ends up as 5 trees in the PhyloDB tables. > >>> or just a user?s own private taxonomy parallel to NCBI taxonomy in a >>> single BioSQL >>> DB). > > Yeah; I've been wanting to write a general taxonomy loader, or more > precisely a loader that utilizes Bio::TreeIO for reading. Just haven't > had time around to do that. (Need another hackathon :-) > >> [...] I think the issue has been raised before on the mailing list, >> and IIRC >> it was agreed that there was room for improvement. Maybe this is >> something for BioSQL v1.1.x? > > Fixing the ncbi_taxon_id column name definitely. As for letting the > taxon tables duplicate the same capabilities as the PhyloDB tables, > I'm not sure that that's the best route to go. > > -hilmar > From biopython at maubp.freeserve.co.uk Wed Dec 17 09:32:36 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 17 Dec 2008 14:32:36 +0000 Subject: [BioSQL-l] The bioentry and biosequence tables In-Reply-To: <320fb6e00812170627ma0d87fdi4af3da0fc65e7bfa@mail.gmail.com> References: <320fb6e00812170627ma0d87fdi4af3da0fc65e7bfa@mail.gmail.com> Message-ID: <320fb6e00812170632k1f7b993bq843fc36c38588673@mail.gmail.com> Hello all, Looking at the BioSQL schema, it initially appears that each bioentry can have multiple biosequence entries (each with their own alphabet, version and length). On reading http://biosql.org/wiki/Schema_Overview#BIOSEQUENCE I see that in fact the schema should make sure each bioentry can have at most one biosequence - which is good news. If this was not the case, then the location of any seqfeature locations would be ambiguous. Was there a reason for not just putting optional sequence, length and alphabet fields into the bioentry table directly (instead of having a separate biosequence table)? Does doing it as a separate table speed up accessing the core (non-sequence) bioentry information? Thanks, Peter From biopython at maubp.freeserve.co.uk Wed Dec 17 09:55:02 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 17 Dec 2008 14:55:02 +0000 Subject: [BioSQL-l] The bioentry and biosequence tables In-Reply-To: References: <320fb6e00812170627ma0d87fdi4af3da0fc65e7bfa@mail.gmail.com> <320fb6e00812170632k1f7b993bq843fc36c38588673@mail.gmail.com> Message-ID: <320fb6e00812170655j6f0811ber817cc69b1dcb159e@mail.gmail.com> On Wed, Dec 17, 2008 at 2:49 PM, Hilmar Lapp wrote: > > The bioentry table is for any biological database entry with a stable and > unique identifier. > > In practical terms most of these will be sequence database entries, but they > don't have to be. Among the not so far fetched examples are gene > records/models (such as from LocusLink or Entrez Gene) and (e.g. EST or > protein) sequence clusters. Yes - as another example I was thinking you could import an NCBI Protein Tables (PTT) table like this, you'd have lots of seqfeature entries but no actual sequence. > A (at present) more exotic example would be museum specimen records. > >> Does doing it as a separate table speed up accessing the core >> (non-sequence) bioentry information? > > Possibly. To what extent will depend on the RDBMS, obviously. But at least > several years ago, when BioSQL was first designed, some RDBMSs would indeed > be faster in full table scans if the table didn't contain an LOB. OK - I thought that might have been the case. > The way to look at it conceptually for us has been similar to object > orientation. The bioentry table is the base table for all biodatabase > entries, and the biosequence table is in joined for derived objects that > also have a sequence. > > Does that make sense? Yes it does. Thanks. Peter From hlapp at gmx.net Wed Dec 17 09:49:28 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 17 Dec 2008 09:49:28 -0500 Subject: [BioSQL-l] The bioentry and biosequence tables In-Reply-To: <320fb6e00812170632k1f7b993bq843fc36c38588673@mail.gmail.com> References: <320fb6e00812170627ma0d87fdi4af3da0fc65e7bfa@mail.gmail.com> <320fb6e00812170632k1f7b993bq843fc36c38588673@mail.gmail.com> Message-ID: On Dec 17, 2008, at 9:32 AM, Peter wrote: > Was there a reason for not just putting optional sequence, length and > alphabet fields into the bioentry table directly (instead of having a > separate biosequence table)? The bioentry table is for any biological database entry with a stable and unique identifier. In practical terms most of these will be sequence database entries, but they don't have to be. Among the not so far fetched examples are gene records/models (such as from LocusLink or Entrez Gene) and (e.g. EST or protein) sequence clusters. A (at present) more exotic example would be museum specimen records. > Does doing it as a separate table speed up accessing the core (non- > sequence) bioentry information? Possibly. To what extent will depend on the RDBMS, obviously. But at least several years ago, when BioSQL was first designed, some RDBMSs would indeed be faster in full table scans if the table didn't contain an LOB. The way to look at it conceptually for us has been similar to object orientation. The bioentry table is the base table for all biodatabase entries, and the biosequence table is in joined for derived objects that also have a sequence. Does that make sense? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From dmitry.turin at belarusbank.minsk.by Mon Dec 29 12:41:21 2008 From: dmitry.turin at belarusbank.minsk.by (Dmitry Turin) Date: Mon, 29 Dec 2008 19:41:21 +0200 Subject: [BioSQL-l] PostGraph II Message-ID: <666803384.20081229194121@belarusbank.minsk.by> Hi, Biosql-l. I have read attentively presentation "BioPostgres_Overview_BOSC_2006.ppt", and paper on "phenomics.cs.ucla.edu/PostGraph/", to which presentation refers. I take courage to seggest to implement statemests [1] SELECT * FROM table1.table2.table3; UPDATE table1.table2.table3 SET ... ; DELETE * FROM table1.table2.table3; instead of or in addition to PostGraph, because these statements are more powerfull and convenient for work with graphs. [1] blogs.ingres.com/technology/2008/07/31/bringing-dbms-in-line-with-modern-communication-requirements-sql2009/ sql50.euro.ru/sql5.16.4.pdf Dmitry (SQL50, HTML60) From jimp at compbio.dundee.ac.uk Wed Dec 3 14:56:16 2008 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Wed, 03 Dec 2008 14:56:16 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate Message-ID: <49369E10.6000502@compbio.dundee.ac.uk> Hi. I was wondering - are bioentry accessions intended to be case insensitive under the archetypal BioSQL schema ? If so - are the various Bio* language bindings are supposed to honour the case insensitivity ? The reason that I am asking is that I have just encountered this issue in relation to biosql backed sequence database queries made via biojava-x's hibernate bindings with a biosql 1.01 install on postgress. If Bioentry accessions are, in fact, case insensitive, then I can petition for an update to the biojava-x bindings to honour this. If not, then I'll just continue with my own non-standard hack ;) thanks in advance. Jim ps. on a side issue - have the various Bio* language bindings actually been specified formally ? If so - where might I find them ? -- ------------------------------------------------------------------- J. B. Procter (ENFIN/VAMSAS) Barton Bioinformatics Research Group Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk The University of Dundee is a Scottish Registered Charity, No. SC015096. From markjschreiber at gmail.com Thu Dec 4 06:00:30 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Thu, 4 Dec 2008 14:00:30 +0800 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <49369E10.6000502@compbio.dundee.ac.uk> References: <49369E10.6000502@compbio.dundee.ac.uk> Message-ID: <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> Good question... Are there any situations where Accessions would not be case insensitive? I feel they should be case-insensitive but partly this is going to depend on Hibernates default behaivor and the default of the underlying DB. Some DB's are case insensitive by default if this is the case then hibernates behaivour will probably not impact this. - Mark On Wed, Dec 3, 2008 at 10:56 PM, James Procter wrote: > > Hi. > > I was wondering - are bioentry accessions intended to be case > insensitive under the archetypal BioSQL schema ? > > If so - are the various Bio* language bindings are supposed to honour > the case insensitivity ? > > The reason that I am asking is that I have just encountered this issue > in relation to biosql backed sequence database queries made via > biojava-x's hibernate bindings with a biosql 1.01 install on postgress. > > If Bioentry accessions are, in fact, case insensitive, then I can > petition for an update to the biojava-x bindings to honour this. If not, > then I'll just continue with my own non-standard hack ;) > > thanks in advance. > Jim > > ps. on a side issue - have the various Bio* language bindings actually > been specified formally ? If so - where might I find them ? > > -- > ------------------------------------------------------------------- > J. B. Procter (ENFIN/VAMSAS) Barton Bioinformatics Research Group > Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk > The University of Dundee is a Scottish Registered Charity, No. SC015096. > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l From biopython at maubp.freeserve.co.uk Thu Dec 4 10:35:55 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 10:35:55 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <320fb6e00812040230n3d78061ay96d538ea75d3673e@mail.gmail.com> References: <49369E10.6000502@compbio.dundee.ac.uk> <320fb6e00812040230n3d78061ay96d538ea75d3673e@mail.gmail.com> Message-ID: <320fb6e00812040235s6e9d70b6y70b9b8443c7e2312@mail.gmail.com> Sending again - last time I didn't use my mailing list email address - sorry James, you'll get this twice. James Procter wrote: > > ps. on a side issue - have the various Bio* language bindings actually > been specified formally ? If so - where might I find them ? > I think the answer to that is sadly a no. For Biopython work, I have been treating BioPerl as the reference implementation BioSQL, and have tried to get some details clarified here on this list, e.g. regarding ontologies: http://lists.open-bio.org/pipermail/biosql-l/2008-November/001412.html http://lists.open-bio.org/pipermail/biosql-l/2008-November/001414.html http://lists.open-bio.org/pipermail/biosql-l/2008-November/001413.html Peter From jimp at compbio.dundee.ac.uk Thu Dec 4 10:56:48 2008 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Thu, 04 Dec 2008 10:56:48 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> References: <49369E10.6000502@compbio.dundee.ac.uk> <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> Message-ID: <4937B770.7050000@compbio.dundee.ac.uk> Thanks for the reply - Mark. Mark Schreiber wrote: > Are there any situations where Accessions would not be case > insensitive? AFAIK the public sequence, structure and gene id databases all have case insensitive Accessions, so I'd reckon 'No' being the answer there. However, its easy to imagine that some legacy in-house databases relying on case sensitivity in some way. I feel they should be case-insensitive but partly this is > going to depend on Hibernates default behaivor and the default of the > underlying DB. Some DB's are case insensitive by default if this is > the case then hibernates behaivour will probably not impact this. Again, agreed. Assuming for the moment that accessions are case insensitive in BioSQL, then case sensitivity should also be built into the BioSQL schema implementations for each DB, and a case insensitive column should be used if the DB supports it. However, regardless of whether the DB supports that kind of attribution, the language bindings should also have the case-insensitivity built in. In the case of Hibernate, the bindings have to be matched to the underlying database anyway, so I guess the problem I encountered is really due to a bug in the biojavax schema. of course - if case insensitivity isn't in the BioSQL spec - then its all immaterial, so the question still remains. Can anyone enlighten us further ? Jim ps. I've started another thread for the reply regarding biosql/Bio* object mappings. From jimp at compbio.dundee.ac.uk Thu Dec 4 11:10:31 2008 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Thu, 04 Dec 2008 11:10:31 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <320fb6e00812040235s6e9d70b6y70b9b8443c7e2312@mail.gmail.com> References: <49369E10.6000502@compbio.dundee.ac.uk> <320fb6e00812040230n3d78061ay96d538ea75d3673e@mail.gmail.com> <320fb6e00812040235s6e9d70b6y70b9b8443c7e2312@mail.gmail.com> Message-ID: <4937BAA7.9030003@compbio.dundee.ac.uk> Peter wrote: > Sending again - last time I didn't use my mailing list email address - > sorry James, you'll get this twice. no worries. > > James Procter wrote: >> ps. on a side issue - have the various Bio* language bindings actually >> been specified formally ? If so - where might I find them ? >> > > I think the answer to that is sadly a no. For Biopython work, I have > been treating BioPerl as the reference implementation BioSQL, and have > tried to get some details clarified here on this list, e.g. regarding > ontologies: > > http://lists.open-bio.org/pipermail/biosql-l/2008-November/001412.html > http://lists.open-bio.org/pipermail/biosql-l/2008-November/001414.html > http://lists.open-bio.org/pipermail/biosql-l/2008-November/001413.html Ah - yes - I wasn't quite up to speed on that thread. I think its probably a better place to continue this discussion.... Jim. From holland at eaglegenomics.com Thu Dec 4 11:43:10 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 04 Dec 2008 11:43:10 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <4937B770.7050000@compbio.dundee.ac.uk> References: <49369E10.6000502@compbio.dundee.ac.uk> <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> <4937B770.7050000@compbio.dundee.ac.uk> Message-ID: <4937C24E.7070402@eaglegenomics.com> Hibernate reflects the case-sensitivity of the database it is connected to. There's no options in it for changing that, and so you have to use it the way the database underneath expects you to. However, you can modify your queries so that you convert your query terms to a specific case in advance of the search using the toUpper() or toLower() functions of the String class, then when performing the search, use HQL's lower() or upper() functions inside the query HQL to convert values to the same case when making the comparisons. Someone will need to search through the BJX code to find the spots where explicit queries are made against accessions or any other case-insensitive data, then modify it to use the above technique. This would possibly involve introducing HQL queries where currently only direct Hibernate object references are being made. cheers, Richard James Procter wrote: > Thanks for the reply - Mark. > > Mark Schreiber wrote: >> Are there any situations where Accessions would not be case >> insensitive? > AFAIK the public sequence, structure and gene id databases all have case > insensitive Accessions, so I'd reckon 'No' being the answer there. > However, its easy to imagine that some legacy in-house databases relying > on case sensitivity in some way. > I feel they should be case-insensitive but partly this is >> going to depend on Hibernates default behaivor and the default of the >> underlying DB. Some DB's are case insensitive by default if this is >> the case then hibernates behaivour will probably not impact this. > Again, agreed. > > Assuming for the moment that accessions are case insensitive in BioSQL, > then case sensitivity should also be built into the BioSQL schema > implementations for each DB, and a case insensitive column should be > used if the DB supports it. However, regardless of whether the DB > supports that kind of attribution, the language bindings should also > have the case-insensitivity built in. In the case of Hibernate, the > bindings have to be matched to the underlying database anyway, so I > guess the problem I encountered is really due to a bug in the biojavax > schema. > > of course - if case insensitivity isn't in the BioSQL spec - then its > all immaterial, so the question still remains. Can anyone enlighten us > further ? > > Jim > > ps. I've started another thread for the reply regarding biosql/Bio* > object mappings. > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From jimp at compbio.dundee.ac.uk Thu Dec 4 11:45:24 2008 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Thu, 04 Dec 2008 11:45:24 +0000 Subject: [BioSQL-l] BioSQL and ontology "standards". In-Reply-To: <320fb6e00811281204i3bae31e4kc18f70121244b4d1@mail.gmail.com> References: <320fb6e00811281057r2d3a1145j3072b6a537112e12@mail.gmail.com> <49304392.4080908@eaglegenomics.com> <320fb6e00811281204i3bae31e4kc18f70121244b4d1@mail.gmail.com> Message-ID: <4937C2D4.1020701@compbio.dundee.ac.uk> Hi - I'm very sorry to break the thread a little - particularly with the deep discussion that's going on. Peter drew my attention to the thread in his reply to my ps. on another thread: Peter's reply to my original PS: >> ps. on a side issue - have the various Bio* language bindings actually >> been specified formally ? If so - where might I find them ? >> > > I think the answer to that is sadly a no. For Biopython work, I have > been treating BioPerl as the reference implementation BioSQL. Peter wrote: > On Fri, Nov 28, 2008 at 7:16 PM, Richard Holland wrote: >> BioJava does what BioPerl does and pretty much makes it up as it goes >> along, using whatever the input files tell it. > > OK, good. As a brutal summary, leaving all Peter's questions unanswered, that statement suggests a consensus - BioPerl is the 'reference' mapping. However, I personally do not yet know enough about each Bio* sequence feature structure to verify that this is the case. >> I think the best approach is to always to use what the file says, and >> trust that it's accurate. What needs to be agreed between projects is >> any additional annotations that get introduced outside the context of >> file parsing, and the names of the ontologies used for the file >> annotations so that all projects use the same ontologies and don't >> replicate them inside the BioSQL database. It would be nice to >> standardise these names and the additional custom terms across the >> projects, in much the same way as people tried already to standardise >> the way general objects get mapped to BioSQL. > > This is what I am trying to get at here - documenting the existing "ad > hoc" ontology usage. My impression is that it has not been > documented, and that the BioPerl behaviour is the defacto BioSQL > standard. > > I'd like to pin down this standard, and extend it for situations like > the location_qualifier_value.term_id and perhaps location.term_id > where BioPerl seems to ignore the ontology issue. I'm adding my support for documentation here. However, to put into perspective why this verification is necessary, I should explain my problem: I've been evaluating the use of BioSQL as a back end database for DAS source deployment. We are using both BioPerl and BioJava to interact with the BioSQL database, but ultimately aim to serve bioentry annotation as DAS features. This means that there needs to be a clear between a BioSQL bioentry's annotation and the attributes of one or more DAS features, and that mapping needs to be honoured by all the Bio* object bindings utilised by the various programs interacting with the BioSQL database. DAS features are actually pretty simple. To begin with, I'm only interested in unambiguously mapping the core DAS/1 feature attributes: - location (start,end and strand) - type (which may additionally have a sequence annotation ontology term) - label (free text relating to the type term) - feature score (again associated with the type) - URLs (often added as href properties) - Method (free text but often has associated evidence code) - notes (free text which may include additional ontological terms) I'm building on the mapping started by Benjamin Schuster Bockler and implemented in Dazzle. However, I've already run into some mismatches and I now need to clarify whether we are misusing the BioPerl sequence feature binding, or whether the Biojava->DAS part of the mapping is broken. A formal specification, or at the very least a mapping diagram, is therefore pretty much essential. It will also enable better 'out of the box' support for access to BioSQL datasources in other applications. Jim. From jimp at compbio.dundee.ac.uk Thu Dec 4 12:19:15 2008 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Thu, 04 Dec 2008 12:19:15 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <4937C24E.7070402@eaglegenomics.com> References: <49369E10.6000502@compbio.dundee.ac.uk> <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> <4937B770.7050000@compbio.dundee.ac.uk> <4937C24E.7070402@eaglegenomics.com> Message-ID: <4937CAC3.4070906@compbio.dundee.ac.uk> Thanks for replying... Richard Holland wrote: > Hibernate reflects the case-sensitivity of the database it is connected > to. There's no options in it for changing that, and so you have to use > it the way the database underneath expects you to. agreed. > However, you can modify your queries so that you convert your query > terms to a specific case in advance of the search using the toUpper() or > toLower() functions of the String class, then when performing the > search, use HQL's lower() or upper() functions inside the query HQL to > convert values to the same case when making the comparisons. agreed (again). > > Someone will need to search through the BJX code to find the spots where > explicit queries are made against accessions or any other > case-insensitive data, then modify it to use the above technique. This > would possibly involve introducing HQL queries where currently only > direct Hibernate object references are being made. ok - this isn't too hard to do. Do you know who currently maintains the BiojavaX bindings to BJX ? I can send them patches - or I guess submit them directly... but.. it wont fix the issue. In a situation where the backend DB is case sensitive, new accessions inserted into the DB must be forced to be the same case in the same way as the accession query string will be forced to be. This modification is again straightforward, but the case change would also have to be propagated to existing entries in the BioSQL. Furthermore, modifying the Biojava-x bindings would only ensure case-insensitivity for Biojava queries, the same modification would also be necessary for the other bio-* bindings. This isn't actually a burning issue for me at the moment even though it might sound like it from the way I'm posting about it - I had already hacked biojava-x to accomodate the case for my specific BioSQL database. However, it is important to clarify this situation in order to ensure interoperability between biosql databases and off the shelf middleware. all the best! Jim From holland at eaglegenomics.com Thu Dec 4 12:27:03 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 04 Dec 2008 12:27:03 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <4937CAC3.4070906@compbio.dundee.ac.uk> References: <49369E10.6000502@compbio.dundee.ac.uk> <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> <4937B770.7050000@compbio.dundee.ac.uk> <4937C24E.7070402@eaglegenomics.com> <4937CAC3.4070906@compbio.dundee.ac.uk> Message-ID: <4937CC97.3060602@eaglegenomics.com> I'm the maintainer for BJX so you can post patches to me. However, I agree that a consensus needs to be reached on how to store this data amongst the various projects. There's no point one fixing it one way and another fixing it the other. This is, from my point of view, the main problem with having something like BioSQL defined as a schema rather than as an API that can include defined behaviour (business logic) as well as database structure (the Oracle version for instance would expose a public set of PL/SQL stored procedures and projects would then interact solely via those). Also, BioJava itself may not be case-insensitive (can't remember how I coded it now...) so would need changes throughout to things like equals() methods as well (this has implications going back into the original code as well as the BJX extensions which build on that). cheers, Richard James Procter wrote: > Thanks for replying... > > Richard Holland wrote: >> Hibernate reflects the case-sensitivity of the database it is connected >> to. There's no options in it for changing that, and so you have to use >> it the way the database underneath expects you to. > agreed. > >> However, you can modify your queries so that you convert your query >> terms to a specific case in advance of the search using the toUpper() or >> toLower() functions of the String class, then when performing the >> search, use HQL's lower() or upper() functions inside the query HQL to >> convert values to the same case when making the comparisons. > agreed (again). >> Someone will need to search through the BJX code to find the spots where >> explicit queries are made against accessions or any other >> case-insensitive data, then modify it to use the above technique. This >> would possibly involve introducing HQL queries where currently only >> direct Hibernate object references are being made. > ok - this isn't too hard to do. Do you know who currently maintains the > BiojavaX bindings to BJX ? I can send them patches - or I guess submit > them directly... but.. it wont fix the issue. > > In a situation where the backend DB is case sensitive, new accessions > inserted into the DB must be forced to be the same case in the same way > as the accession query string will be forced to be. This modification is > again straightforward, but the case change would also have to be > propagated to existing entries in the BioSQL. Furthermore, modifying the > Biojava-x bindings would only ensure case-insensitivity for Biojava > queries, the same modification would also be necessary for the other > bio-* bindings. > > This isn't actually a burning issue for me at the moment even though it > might sound like it from the way I'm posting about it - I had already > hacked biojava-x to accomodate the case for my specific BioSQL database. > However, it is important to clarify this situation in order to ensure > interoperability between biosql databases and off the shelf middleware. > > all the best! > Jim > > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From jimp at compbio.dundee.ac.uk Thu Dec 4 12:41:27 2008 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Thu, 04 Dec 2008 12:41:27 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <4937CC97.3060602@eaglegenomics.com> References: <49369E10.6000502@compbio.dundee.ac.uk> <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> <4937B770.7050000@compbio.dundee.ac.uk> <4937C24E.7070402@eaglegenomics.com> <4937CAC3.4070906@compbio.dundee.ac.uk> <4937CC97.3060602@eaglegenomics.com> Message-ID: <4937CFF7.7040200@compbio.dundee.ac.uk> Richard Holland wrote: > I'm the maintainer for BJX so you can post patches to me. ah - {bows} - sorry - should've realised that. > > However, I agree that a consensus needs to be reached on how to store > this data amongst the various projects. There's no point one fixing it > one way and another fixing it the other. This is, from my point of view, > the main problem with having something like BioSQL defined as a schema > rather than as an API that can include defined behaviour (business > logic) as well as database structure (the Oracle version for instance > would expose a public set of PL/SQL stored procedures and projects would > then interact solely via those). yes. I was hoping there were at least some additional notes on how the entities were further restricted which might inform us about this. Perhaps that might be something to be considered for the next BioSQL release. > Also, BioJava itself may not be case-insensitive (can't remember how I > coded it now...) so would need changes throughout to things like > equals() methods as well (this has implications going back into the > original code as well as the BJX extensions which build on that). eeks - that could definitely get hairy and almost certainly have some additional side effects. But, I think that it might be possible to introduce the change cleanly so whenever a biosql backed object was being worked with, case sensitivity would apply. Is case sensitivity covered in the biojavax unit tests ? thanks for all your input on this! Jim From holland at eaglegenomics.com Thu Dec 4 12:55:48 2008 From: holland at eaglegenomics.com (Richard Holland) Date: Thu, 04 Dec 2008 12:55:48 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <4937CFF7.7040200@compbio.dundee.ac.uk> References: <49369E10.6000502@compbio.dundee.ac.uk> <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> <4937B770.7050000@compbio.dundee.ac.uk> <4937C24E.7070402@eaglegenomics.com> <4937CAC3.4070906@compbio.dundee.ac.uk> <4937CC97.3060602@eaglegenomics.com> <4937CFF7.7040200@compbio.dundee.ac.uk> Message-ID: <4937D354.5040706@eaglegenomics.com> > eeks - that could definitely get hairy and almost certainly have some > additional side effects. But, I think that it might be possible to > introduce the change cleanly so whenever a biosql backed object was > being worked with, case sensitivity would apply. Is case sensitivity > covered in the biojavax unit tests ? No. Sounds like an area of need! > thanks for all your input on this! > Jim > -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd M: +44 7500 438846 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From lpritc at scri.ac.uk Thu Dec 4 13:25:50 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 04 Dec 2008 13:25:50 +0000 Subject: [BioSQL-l] BioSQL and ontology "standards". In-Reply-To: <4937C2D4.1020701@compbio.dundee.ac.uk> Message-ID: On 04/12/2008 11:45, "James Procter" wrote: > Peter wrote: >> On Fri, Nov 28, 2008 at 7:16 PM, Richard Holland wrote: >>> I think the best approach is to always to use what the file says, and >>> trust that it's accurate. What needs to be agreed between projects is >>> any additional annotations that get introduced outside the context of >>> file parsing, and the names of the ontologies used for the file >>> annotations so that all projects use the same ontologies and don't >>> replicate them inside the BioSQL database. It would be nice to >>> standardise these names and the additional custom terms across the >>> projects, in much the same way as people tried already to standardise >>> the way general objects get mapped to BioSQL. >> >> This is what I am trying to get at here - documenting the existing "ad >> hoc" ontology usage. My impression is that it has not been >> documented, and that the BioPerl behaviour is the defacto BioSQL >> standard. >> >> I'd like to pin down this standard, and extend it for situations like >> the location_qualifier_value.term_id and perhaps location.term_id >> where BioPerl seems to ignore the ontology issue. Hi, Just to add some of my experience with BioSQL and Biopython to the discussion... When I began to look at this issue a couple of years ago, it was clear that the Biopython loader (and, to the best of my knowledge, Bioperl does this, too) for GenBank files and BioSQL put pretty much everything under an 'ontology' called 'Annotation Tags', with no definitions and only rudimentary error-checking. Now, BioSQL seems to have taken great care to ensure that, whatever one's choice of ontology, it can be accommodated in the database schema. There is, as far as I can tell, no reason to favour one ontology over another on the grounds of BioSQL compatibility and, if anything, the BioSQL schema pretty much forced me to start considering ontologies in a serious manner. My understanding is that BioSQL is ontology-neutral, and that the appropriate choice of ontology is dependent on the data with which you want to populate your database. This suggests to me that the Bio* loaders are the things that need to be dynamically ontology-aware, first to check if the appropriate ontology (as selected by the user) for the data is present in the database, and then to populate the database using those ontology terms, calling errors as appropriate (e.g. for extraneous terms, mis-spellings, inappropriate data types, etc.). If your reason, like mine, for using an ontology is either to ensure that annotation terms have well-defined (or at least defined) meanings, and perhaps incidentally to carry out a check on the validity of a particular annotation file within the domain of that ontology, then that can readily be done in BioSQL. I have managed this with both the Gene Ontology and Sequence Ontology ontologies, and locally-defined ontologies. BioSQL copes with these very nicely, as does a modified Biopython Loader.py. However, the current Biopython (and AFIAA Bio*) behaviour with 'Annotation Tags' doesn't correspond well to the above. I think that this is a bad thing in general, and that there is room for improvement, if we want it. With apologies if I'm misinterpreting the tide of discussion, but I would be disappointed to see a default behaviour of "bung everything under 'Annotation Tags', typos and all" become a 'standard' of any sort, rather than a placeholder for future development of ontology-aware Bio* code that queries and populates BioSQL appropriately. I see the situation as pretty much analogous to the effective requirement for NCBI taxon data in BioSQL, when using Biopython: you need to load in the NCBI taxon data before your own data can be imported in a taxon-aware manner. I would prefer to see a similar, but perhaps even more draconian imposition of requiring an appropriate ontology (or ontologies) to be present in the database before importing data, and a specification of which ontology/ies is/are to be used when loading the data. Then, where a term is not yet known to an ontology in BioSQL, this might be an error in the source data, or an oversight of the ontology. Correcting either of these improves the quality of the data and/or its description. The catch-all 'Annotation Tag' 'ontology' seems to silently record a new term with a different ID, permits no error correction and, for my own part, I would rather this behaviour went away, eventually. Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From biopython at maubp.freeserve.co.uk Thu Dec 4 15:04:44 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 15:04:44 +0000 Subject: [BioSQL-l] BioSQL and ontology "standards". In-Reply-To: References: <4937C2D4.1020701@compbio.dundee.ac.uk> Message-ID: <320fb6e00812040704i71c23144m35de4de6df882902@mail.gmail.com> On Thu, Dec 4, 2008 at 1:25 PM, Leighton Pritchard wrote: > With apologies if I'm misinterpreting the tide of discussion, but I would be > disappointed to see a default behaviour of "bung everything under > 'Annotation Tags', typos and all" become a 'standard' of any sort, rather > than a placeholder for future development of ontology-aware Bio* code that > queries and populates BioSQL appropriately. Overall, I agree. It isn't ideal, but the current ad-hoc "ontology" is useful in that its looseness allows any parsable GenBank file to be imported into the database. Pinning down the current behaviour as a "standard" for better intercompatibility between the Bio* projects is a good thing, even if this only a short term goal. In the long term, yes, maybe all Bio* projects should be able to cope with any (optional) strict ontology instead. > I see the situation as pretty much analogous to the effective requirement > for NCBI taxon data in BioSQL, when using Biopython: you need to load in the > NCBI taxon data before your own data can be imported in a taxon-aware > manner. This is going off topic, but that's not really true any more. It used to be the case that if you wanted to record the NCBI taxonomy when loading GenBank files into BioSQL with Biopython that you would ideally first prepopulate the taxonomy tables with the BioSQL load_ncbi_taxonomy.pl script. I should go and update http://www.biopython.org/wiki/BioSQL now that Biopython 1.49 is out, as it can optionally fill in the lineage on demand by querying NCBI Entrez. Either way, it does "play nice" with running load_ncbi_taxonomy.pl before or after loading records with Biopython. Peter From lpritc at scri.ac.uk Thu Dec 4 15:14:37 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 04 Dec 2008 15:14:37 +0000 Subject: [BioSQL-l] BioSQL and ontology "standards". In-Reply-To: <320fb6e00812040704i71c23144m35de4de6df882902@mail.gmail.com> Message-ID: On 04/12/2008 15:04, "Peter" wrote: > On Thu, Dec 4, 2008 at 1:25 PM, Leighton Pritchard wrote: >> With apologies if I'm misinterpreting the tide of discussion, but I would be >> disappointed to see a default behaviour of "bung everything under >> 'Annotation Tags', typos and all" become a 'standard' of any sort, rather >> than a placeholder for future development of ontology-aware Bio* code that >> queries and populates BioSQL appropriately. > > Overall, I agree. It isn't ideal, but the current ad-hoc "ontology" > is useful in that its looseness allows any parsable GenBank file to be > imported into the database. I think that this may be a matter of perspective: you see an advantage, I see an accident waiting to happen ;) >> I see the situation as pretty much analogous to the effective requirement >> for NCBI taxon data in BioSQL, when using Biopython: you need to load in the >> NCBI taxon data before your own data can be imported in a taxon-aware >> manner. > > This is going off topic, but that's not really true any more. Ah... Good point - my bad. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From hlapp at gmx.net Thu Dec 4 15:57:14 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 4 Dec 2008 10:57:14 -0500 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <49369E10.6000502@compbio.dundee.ac.uk> References: <49369E10.6000502@compbio.dundee.ac.uk> Message-ID: On Dec 3, 2008, at 9:56 AM, James Procter wrote: > I was wondering - are bioentry accessions intended to be case > insensitive under the archetypal BioSQL schema ? Sorry for chiming in a bit late here. Accessions are case sensitive in BioSQL at the level of the relational model. In fact, this is enforced for MySQL (which unilaterally chose to treat the SQL VARCHAR datatype as case-insensitive) by making the type VARCHAR BINARY. I'm rather disinclined to change that, I have to say. I realize that many (all?) of the databases we typically use treat accessions as case- insensitive. But I doubt that that's part of the specs in each case, and there is no standard that would oblige future databases to do the same. Rather, I think it's application (or data source) level semantics, and should hence be implemented at that level if desired. In full-featured RDBMSs that's actually not very difficult. For example, you can build a function index on UPPER(accession), which gives you indexed access to case-insensitive accessions w/o changing the model itself. As Mark mentioned, Hibernate can be taught to use functions like these too. Does that make sense? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Thu Dec 4 15:59:59 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 4 Dec 2008 10:59:59 -0500 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <320fb6e00812040235s6e9d70b6y70b9b8443c7e2312@mail.gmail.com> References: <49369E10.6000502@compbio.dundee.ac.uk> <320fb6e00812040230n3d78061ay96d538ea75d3673e@mail.gmail.com> <320fb6e00812040235s6e9d70b6y70b9b8443c7e2312@mail.gmail.com> Message-ID: On Dec 4, 2008, at 5:35 AM, Peter wrote: > James Procter wrote: >> >> ps. on a side issue - have the various Bio* language bindings >> actually >> been specified formally ? If so - where might I find them ? >> > > I think the answer to that is sadly a no. I agree (with both the sadly and the no). Maybe I have New Years resolution coming at me here ... Indeed though, this needs addressing. I think I was about to start something on the wiki and then got sucked elsewhere. If anyone has energy to start this, please don't wait - wiki allows account creation by anyone. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jimp at compbio.dundee.ac.uk Thu Dec 4 16:17:05 2008 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Thu, 04 Dec 2008 16:17:05 +0000 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: References: <49369E10.6000502@compbio.dundee.ac.uk> Message-ID: <49380281.6070405@compbio.dundee.ac.uk> Hi Hilmar Hilmar Lapp wrote: > Sorry for chiming in a bit late here. Accessions are case sensitive in > BioSQL at the level of the relational model. Ah. OK. This is a definitive answer. > In fact, this is enforced for MySQL (which unilaterally chose to treat > the SQL VARCHAR datatype as case-insensitive) by making the type VARCHAR > BINARY. I did wonder about that. This makes sense. > I'm rather disinclined to change that, I have to say. I realize that > many (all?) of the databases we typically use treat accessions as > case-insensitive. But I doubt that that's part of the specs in each > case, and there is no standard that would oblige future databases to do > the same. > > Rather, I think it's application (or data source) level semantics, and > should hence be implemented at that level if desired. In full-featured > RDBMSs that's actually not very difficult. For example, you can build a > function index on UPPER(accession), which gives you indexed access to > case-insensitive accessions w/o changing the model itself. As Mark > mentioned, Hibernate can be taught to use functions like these too. Fair enough. This squarely places the onus on the application/datasource to be careful about case. It also corroborates the transparent behaviour exhibited by Biojava and Bioperl. The worrying aspect is that as far as I can tell, protocols like DAS do not play nicely with this... case is usually ignored for ID lookup. I guess this means that the middleware that does the transformation is always going to have to (optionally) use specific case-insensitive BioSQL language bindings, and the datasource deployer will simply have to configure it accordingly for their BioSQL database. cheers! Jim From hlapp at gmx.net Thu Dec 4 16:19:54 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 4 Dec 2008 11:19:54 -0500 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <4937CAC3.4070906@compbio.dundee.ac.uk> References: <49369E10.6000502@compbio.dundee.ac.uk> <93b45ca50812032200h241c7e36u8fce4facfb86676e@mail.gmail.com> <4937B770.7050000@compbio.dundee.ac.uk> <4937C24E.7070402@eaglegenomics.com> <4937CAC3.4070906@compbio.dundee.ac.uk> Message-ID: <49ECA0AB-1ADF-4A83-8645-ABA258CE3F51@gmx.net> On Dec 4, 2008, at 7:19 AM, James Procter wrote: > In a situation where the backend DB is case sensitive, new accessions > inserted into the DB must be forced to be the same case in the same > way > as the accession query string will be forced to be. I would advise against this if you are interested in cross-Bio* interoperability at all. As I pointed out previously, such transformations need not be permanent but can be done on-the-fly. > This modification is again straightforward, but the case change > would also have to be propagated to existing entries in the BioSQL. > Furthermore, modifying the > Biojava-x bindings would only ensure case-insensitivity for Biojava > queries, the same modification would also be necessary for the other > bio-* bindings. Right, and I'm not sure that's a good rule to impose on all implementations. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Thu Dec 4 16:22:53 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 4 Dec 2008 11:22:53 -0500 Subject: [BioSQL-l] case sensitivity in biosql accessions under BioJavax/Hibernate In-Reply-To: <49380281.6070405@compbio.dundee.ac.uk> References: <49369E10.6000502@compbio.dundee.ac.uk> <49380281.6070405@compbio.dundee.ac.uk> Message-ID: <7E1F7CF2-1B67-440D-A92B-29659C03096A@gmx.net> On Dec 4, 2008, at 11:17 AM, James Procter wrote: > The worrying aspect is that as far as I can tell, protocols like DAS > do not play nicely with this... case is usually ignored for ID lookup. Right. Data exchange protocols and query APIs are a different story from the data model of the databases that underpin them. > I guess this means that the middleware that does the transformation > is always going to have to (optionally) use specific case- > insensitive BioSQL language bindings Right. I'm not sure though that this is a big onus. Is it? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jimp at compbio.dundee.ac.uk Thu Dec 4 16:38:05 2008 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Thu, 04 Dec 2008 16:38:05 +0000 Subject: [BioSQL-l] BioSQL and ontology "standards". In-Reply-To: References: Message-ID: <4938076D.7010307@compbio.dundee.ac.uk> Can I just add comment to refocus discussion here: In part its probably my fault, but I think we're starting to mix two (as I see it) distinct aspects of ontologies within the BioSQL schema: Leighton Pritchard wrote: > On 04/12/2008 15:04, "Peter" wrote: >>> With apologies if I'm misinterpreting the tide of discussion, but I would be >>> disappointed to see a default behaviour of "bung everything under >>> 'Annotation Tags', typos and all" become a 'standard' of any sort, rather >>> than a placeholder for future development of ontology-aware Bio* code that >>> queries and populates BioSQL appropriately. >> Overall, I agree. It isn't ideal, but the current ad-hoc "ontology" >> is useful in that its looseness allows any parsable GenBank file to be >> imported into the database. > > I think that this may be a matter of perspective: you see an advantage, I > see an accident waiting to happen ;) I think this aspect of BioSQL ontology is strictly semantic, and can probably only be handled when interpreting the data retrieved from a BioSQL database in specific context. Within a particular type of datamodel, inconsistencies arise in the free-text derived from flat-file data records. These inconsistencies (such as typos, synonyms, case variation, etc) are really aspects to be addressed by data cleaning prior to insertion into the database. If you don't care about putting dirty data into a bioSQL database, then you should still be able to do it, but you should then not expect someone else to connect to the database and make perfect sense of the data. In particular, don't expect a program to magically interpret your mis-annotated 'eXons' as coding regions, for instance, or your disulphide bonds as disulfide (or vice versa). The ramification of this is that in practice, clients that ultimately consume and interpret any kind of BioSQL datastore have to have some form of robustness built in. This would be in the same way that file parsers have to cope with the known variations of freetext feature tags in genbank records. In this situation one assumes that the client at least understands that particular terms in the biosql annotation really are freetext feature tags, and that brings us to the other aspect, which in comparison is much more prosaic. The aspect that I was talking about is 'structural' consistency rather than semantic consistency (for want of a better word). For instance, a bioperl Generic Feature has a 'score' attribute - this should map to the 'score' attribute on a BioPython generic feature, and also to a biojava score attribute. As far as I understand it (and I may be wrong here), hierarchical relationships are faithfully preserved when a bio* feature is persisted in BioSQL - and I'd hope all the attributes were too. This kind of thing is definitely worth writing down, and even making test cases against! Just to duplicate what Hilmar wrote in response to the Bio* binding comment here: >>>> ps. on a side issue - have the various Bio* language bindings actually >>> been specified formally ? If so - where might I find them ? >>> >> >> I think the answer to that is sadly a no. > > > I agree (with both the sadly and the no). Maybe I have New Years > resolution coming at me here ... > > Indeed though, this needs addressing. I think I was about to start > something on the wiki and then got sucked elsewhere. If anyone has > energy to start this, please don't wait - wiki allows account creation > by anyone. I guess we (at least Peter, Myself, and anyone else) should get to it then. At least for the structural mapping betwen biosql terms and feature structure! as for the semantic ontology aspect - is there a way that one might tag a biosql database as using external ontologies ? Jim From Bank.Beszteri at awi.de Thu Dec 4 16:28:13 2008 From: Bank.Beszteri at awi.de (=?ISO-8859-15?Q?B=E1nk_Beszteri?=) Date: Thu, 04 Dec 2008 17:28:13 +0100 Subject: [BioSQL-l] alternative taxonomic hierarchies in BioSQL? Message-ID: <4938051D.4030101@awi.de> Dear BioSQLers, do I understand right that the current BioSQL schema allows for a single taxonomy per database only? When looking into the tables taxon and taxon_name, it looks like neither taxa nor their neighborhood relationships can belong to different taxonomies. Is this correct, or am I missing something? If this is so: are there any plans to add such a feature in the future? I think (besides that I could use it) it could probably be useful for others as well (to have the possibility to e.g. have an ITIS taxonomy or just a user?s own private taxonomy parallel to NCBI taxonomy in a single BioSQL DB). I didn?t find anything about this on the BioSQL pages, please direct me to the right place if I missed it! Bank -- Dr. Bank Beszteri Bioinformatics Working Group Alfred Wegener Institute for Polar and Marine Research Am Handelshafen 12 Bremerhaven Germany From biopython at maubp.freeserve.co.uk Thu Dec 4 17:06:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 17:06:59 +0000 Subject: [BioSQL-l] alternative taxonomic hierarchies in BioSQL? In-Reply-To: <4938051D.4030101@awi.de> References: <4938051D.4030101@awi.de> Message-ID: <320fb6e00812040906w12533956g859e4ec0415665e3@mail.gmail.com> On Thu, Dec 4, 2008 at 4:28 PM, B?nk Beszteri wrote: > > Dear BioSQLers, > > do I understand right that the current BioSQL schema allows for a single > taxonomy per database only? Not quite. If you ignore that fact that the taxon table's external taxonomy ID is explicitly labelled as the ncbi_taxon_id, you could store any taxonomy in the taxon and taxon_name tables. You could even have multiple independent taxonomies in these tables. However, each bioentry can only point to one taxon entry (and thus belongs to only one taxonomy), which is a big limitation. It would be useful to have a bioentry point to multiple taxon entries (and thus multiple taxonomies, e.g. NCBI and ITIS), which might require some sort of link table between the taxon and bioentry tables. This would also solve the issue of how to support chimeric sequences which has been noted on http://www.biosql.org/wiki/Enhancement_Requests > When looking into the tables taxon and taxon_name, it looks like neither > taxa nor their neighborhood relationships can belong to different taxonomies. > Is this correct, or am I missing something? True - but why would you want to interlink taxon entries like that? > If this is so: are there any plans to add such a feature in the future? I > think (besides that I could use it) it could probably be useful for others > as well (to have the possibility to e.g. have an ITIS taxonomy or just a > user?s own private taxonomy parallel to NCBI taxonomy in a single BioSQL > DB). I didn?t find anything about this on the BioSQL pages, please direct me > to the right place if I missed it! I think the issue has been raised before on the mailing list, and IIRC it was agreed that there was room for improvement. Maybe this is something for BioSQL v1.1.x? Peter From biopython at maubp.freeserve.co.uk Thu Dec 4 17:09:38 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 17:09:38 +0000 Subject: [BioSQL-l] BioSQL and ontology "standards". In-Reply-To: <4938076D.7010307@compbio.dundee.ac.uk> References: <4938076D.7010307@compbio.dundee.ac.uk> Message-ID: <320fb6e00812040909o386f2ba6h8d044aba31799dae@mail.gmail.com> On Thu, Dec 4, 2008 at 4:38 PM, James Procter wrote: > > as for the semantic ontology aspect - is there a way that one might tag > a biosql database as using external ontologies ? > I'm not sure what you mean here - you could look at the ontology and term tables. Peter From biopython at maubp.freeserve.co.uk Thu Dec 4 17:16:58 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Dec 2008 17:16:58 +0000 Subject: [BioSQL-l] Passwords on biosql databases In-Reply-To: <492BDCF2.6F09.00E0.0@dundee.ac.uk> References: <492BDCF2.6F09.00E0.0@dundee.ac.uk> Message-ID: <320fb6e00812040916k7c12f6c8qb81ec124053c428b@mail.gmail.com> On Tue, Nov 25, 2008 at 11:09 AM, David Martin wrote: > > I have set up a biosql database on Postgres. The Bio::DB::BioDB module > croaks complaining that it needs the password. I have tried the obvious > things (-password -passwd and reading what docs I could find) but to no avail. > > Any clues? Your email only just reached me - maybe there was a delay somewhere. However, in case you are still stuck, and assuming no one else has answered in the mean time I'll try and help. As an aside, I would say this is really a question for the BioPerl mailing list, as it is about the BioPerl bindings to BioSQL. >From a quick look at the BioPerl code for load_seqdatabase.pl, my guess is you need to use "pass" as the argument name, e.g. my $db = Bio::DB::BioDB->new(-database => "biosql", -printerror => $printerror, -host => $host, -port => $port, -dbname => $dbname, -driver => $driver, -user => $dbuser, -pass => $dbpass, -dsn => $dsn, -schema => $schema, -initrc => $initrc, ); Peter From hlapp at gmx.net Thu Dec 4 18:43:17 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 4 Dec 2008 13:43:17 -0500 Subject: [BioSQL-l] Passwords on biosql databases In-Reply-To: <320fb6e00812040916k7c12f6c8qb81ec124053c428b@mail.gmail.com> References: <492BDCF2.6F09.00E0.0@dundee.ac.uk> <320fb6e00812040916k7c12f6c8qb81ec124053c428b@mail.gmail.com> Message-ID: On Dec 4, 2008, at 12:16 PM, Peter wrote: > From a quick look at the BioPerl code for load_seqdatabase.pl, my > guess is you need to use "pass" as the argument name, e.g. > > my $db = Bio::DB::BioDB->new(-database => "biosql", > -printerror => $printerror, > -host => $host, > -port => $port, > -dbname => $dbname, > -driver => $driver, > -user => $dbuser, > -pass => $dbpass, > -dsn => $dsn, > -schema => $schema, > -initrc => $initrc, > ); Yes, and the same is true if you are using Bio::DB::BioDB directly. B::D::BioDB::new() accepts the options documented there, plus (as noted there, actually), all options accepted by Bio::DB::SimpleDBContext. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Thu Dec 4 22:07:39 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 4 Dec 2008 17:07:39 -0500 Subject: [BioSQL-l] alternative taxonomic hierarchies in BioSQL? In-Reply-To: <320fb6e00812040906w12533956g859e4ec0415665e3@mail.gmail.com> References: <4938051D.4030101@awi.de> <320fb6e00812040906w12533956g859e4ec0415665e3@mail.gmail.com> Message-ID: On Dec 4, 2008, at 12:06 PM, Peter wrote: > On Thu, Dec 4, 2008 at 4:28 PM, B?nk Beszteri > wrote: >> >> Dear BioSQLers, >> >> do I understand right that the current BioSQL schema allows for a >> single >> taxonomy per database only? > > Not quite. If you ignore that fact that the taxon table's external > taxonomy ID is explicitly labelled as the ncbi_taxon_id, you could > store any taxonomy in the taxon and taxon_name tables. You could even > have multiple independent taxonomies in these tables. Right. Though it's certainly ugly to call something a ncbi_taxon_id when really it is a ITIS ID, for example. Aside from that, the load_ncbi_taxonomy.pl script that comes with BioSQL can't really deal with other taxonomies being stored in the taxon tables, too. First, it will consider all nodes that it can't find in NCBI (by ID) as having been obsoleted and will delete them, and even if it somehow failed to do that, it would fail to compute the nested set enumeration for all other taxonomies. Changing that would basically require namespacing taxon nodes. Though it's an option, it has increasingly struck me as a duplication of what the PhyloDB module provides already (see other comments below), so I am actually less and less in favor of it. I think the appropriate way to look at the taxon tables is as the reference taxonomy for bioentries (and so calling the identifier ncbi_taxon_id is still bad as it prescribes the NCBI taxonomy as the reference). In this context: > However, each bioentry can only point to one taxon entry (and thus > belongs to only one taxonomy), which is a big limitation. This is well motivated in biological applications and current object models. I'm not sure about the other Bio* toolkits, but BioPerl for example doesn't support multiple species objects for a sequence. > It would be useful to have a bioentry point to multiple taxon entries > (and thus multiple taxonomies, e.g. NCBI and ITIS), which might > require some sort of link table between the taxon and bioentry tables. Note that the PhyloDB module supports this. Nodes in a tree (or taxonomy) can be associated with one or more bioentries (and, in fact, reference taxon nodes). > [...] >> When looking into the tables taxon and taxon_name, it looks like >> neither >> taxa nor their neighborhood relationships can belong to different >> taxonomies. >> Is this correct, or am I missing something? > > True - but why would you want to interlink taxon entries like that? There may be use-cases for this. For example, to relate taxa named differently between two taxonomies but that really are synonymous. Or one taxonomy containing a synonym that the other doesn't. Not your molecular sequence database/analysis type of thing, sure. But still legitimate. > > >> If this is so: are there any plans to add such a feature in the >> future? I >> think (besides that I could use it) it could probably be useful for >> others >> as well (to have the possibility to e.g. have an ITIS taxonomy Note that the svn / main trunk version of BioSQL has a script load_itis_taxonomy.pl. It loads it into the PhyloDB module, though. ITIS isn't a single tree but actually 5; there is no common root. So it ends up as 5 trees in the PhyloDB tables. >> or just a user?s own private taxonomy parallel to NCBI taxonomy in >> a single BioSQL >> DB). Yeah; I've been wanting to write a general taxonomy loader, or more precisely a loader that utilizes Bio::TreeIO for reading. Just haven't had time around to do that. (Need another hackathon :-) > [...] I think the issue has been raised before on the mailing list, > and IIRC > it was agreed that there was room for improvement. Maybe this is > something for BioSQL v1.1.x? Fixing the ncbi_taxon_id column name definitely. As for letting the taxon tables duplicate the same capabilities as the PhyloDB tables, I'm not sure that that's the best route to go. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From chapmanb at 50mail.com Mon Dec 15 01:17:50 2008 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 14 Dec 2008 20:17:50 -0500 Subject: [BioSQL-l] BioSQL and ontology "standards". In-Reply-To: <320fb6e00812040909o386f2ba6h8d044aba31799dae@mail.gmail.com> References: <4938076D.7010307@compbio.dundee.ac.uk> <320fb6e00812040909o386f2ba6h8d044aba31799dae@mail.gmail.com> Message-ID: <20081215011749.GA59046@kunkel> Hi all; I wanted to reply to the BioSQL ontology discussion Peter started up last month. He summarized how naming is currently done in the ontology and term tables and some of the potential downsides to that: > Currently BioPerl and Biopython (and I assume the other projects but > haven't checked) use a couple of ad-hoc ontology names for storing > annotation. In particular, if there is no predefined entry for a > novel ontology term, it gets added on the fly. This is very > convenient as it means a BioSQL database can be used without first > importing a predefined ontology. However there are downsides, for > example spelling errors in the keys of a GenBank file get treated as > a ontology entries. There was some general consensus that a more formalized, or at least documented, naming scheme would be good, provided there is some leniency for adding terms if they don't fall into the scheme. I agree, and think this suggestion by Peter is good: > On a related point, it might make more sense to use a predefined > ontology, like SOFA or SO from http://www.sequenceontology.org/ Towards a start for this, I put together a mapping of GenBank header, feature and qualifier keys to the SO ontology (and also standard ontologies like Dublin Core). If this is a direction we'd like to go, this would provide the high level documentation for reference implementations. It is currently about 3/4 finished but should give a good notion; I'd need some help from someone more familiar with SO for some of the missing terms. It got a little out of control for a mailing list post, so I wrote up the motivation and details here: http://bcbio.wordpress.com/2008/12/14/standard-ontologies-in-biosql/ The tab delimited mapping file with GenBank terms to ontology terms is there as a starting place. Brad From vanaquisl at gmail.com Mon Dec 15 10:24:06 2008 From: vanaquisl at gmail.com (vanaquisl vanaquisl) Date: Mon, 15 Dec 2008 11:24:06 +0100 Subject: [BioSQL-l] SQLite support Message-ID: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com> Hi folks, I am a brand new BioSQL user and I'd like to know if there will be a SQLite support in the near future? I am about to use SQLite for a standalone lightweight Biopython application and it would be very helpful if I could use BioSQL directly. Thanks for any advice. V. From biopython at maubp.freeserve.co.uk Mon Dec 15 10:43:33 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 15 Dec 2008 10:43:33 +0000 Subject: [BioSQL-l] SQLite support In-Reply-To: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com> References: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com> Message-ID: <320fb6e00812150243w4b0dc223g40abcf684af1ccf5@mail.gmail.com> On Mon, Dec 15, 2008, vanaquisl vanaquisl wrote: > > Hi folks, > > I am a brand new BioSQL user and I'd like to know if there will be a SQLite > support in the near future? As far as I know, BioSQL does not currently include a schema for SQLite. > I am about to use SQLite for a standalone lightweight Biopython application > and it would be very helpful if I could use BioSQL directly. Biopython can talk to a BioSQL database run on MySQL (using MySQLdb), PostgreSQL (using either Psycopg or Psycopg2). In theory it could be extended to use any supported BioSQL database (provided there are python bindings for the database software, in the case of SQLite using pysqlite would probably fine). Peter From Bank.Beszteri at awi.de Fri Dec 5 10:23:20 2008 From: Bank.Beszteri at awi.de (=?ISO-8859-1?Q?B=E1nk_Beszteri?=) Date: Fri, 05 Dec 2008 11:23:20 +0100 Subject: [BioSQL-l] alternative taxonomic hierarchies in BioSQL? In-Reply-To: References: <4938051D.4030101@awi.de> <320fb6e00812040906w12533956g859e4ec0415665e3@mail.gmail.com> Message-ID: <49390118.9030104@awi.de> Hi Hilmar & Peter, so it looks like using PhyloDB is probably the way to explore further for this, I somehow missed that point (I limited myself to think within the biosql core tables). Thanks for this direction and all the ideas / insights! Bank Hilmar Lapp schrieb: > > On Dec 4, 2008, at 12:06 PM, Peter wrote: > >> On Thu, Dec 4, 2008 at 4:28 PM, B?nk Beszteri >> wrote: >>> >>> Dear BioSQLers, >>> >>> do I understand right that the current BioSQL schema allows for a >>> single >>> taxonomy per database only? >> >> Not quite. If you ignore that fact that the taxon table's external >> taxonomy ID is explicitly labelled as the ncbi_taxon_id, you could >> store any taxonomy in the taxon and taxon_name tables. You could even >> have multiple independent taxonomies in these tables. > > Right. Though it's certainly ugly to call something a ncbi_taxon_id > when really it is a ITIS ID, for example. > > Aside from that, the load_ncbi_taxonomy.pl script that comes with > BioSQL can't really deal with other taxonomies being stored in the > taxon tables, too. First, it will consider all nodes that it can't > find in NCBI (by ID) as having been obsoleted and will delete them, > and even if it somehow failed to do that, it would fail to compute the > nested set enumeration for all other taxonomies. > > Changing that would basically require namespacing taxon nodes. Though > it's an option, it has increasingly struck me as a duplication of what > the PhyloDB module provides already (see other comments below), so I > am actually less and less in favor of it. > > I think the appropriate way to look at the taxon tables is as the > reference taxonomy for bioentries (and so calling the identifier > ncbi_taxon_id is still bad as it prescribes the NCBI taxonomy as the > reference). In this context: > >> However, each bioentry can only point to one taxon entry (and thus >> belongs to only one taxonomy), which is a big limitation. > > This is well motivated in biological applications and current object > models. I'm not sure about the other Bio* toolkits, but BioPerl for > example doesn't support multiple species objects for a sequence. > >> It would be useful to have a bioentry point to multiple taxon entries >> (and thus multiple taxonomies, e.g. NCBI and ITIS), which might >> require some sort of link table between the taxon and bioentry tables. > > Note that the PhyloDB module supports this. Nodes in a tree (or > taxonomy) can be associated with one or more bioentries (and, in fact, > reference taxon nodes). > >> [...] >>> When looking into the tables taxon and taxon_name, it looks like >>> neither >>> taxa nor their neighborhood relationships can belong to different >>> taxonomies. >>> Is this correct, or am I missing something? >> >> True - but why would you want to interlink taxon entries like that? > > There may be use-cases for this. For example, to relate taxa named > differently between two taxonomies but that really are synonymous. Or > one taxonomy containing a synonym that the other doesn't. > > Not your molecular sequence database/analysis type of thing, sure. But > still legitimate. > >> >> >>> If this is so: are there any plans to add such a feature in the >>> future? I >>> think (besides that I could use it) it could probably be useful for >>> others >>> as well (to have the possibility to e.g. have an ITIS taxonomy > > Note that the svn / main trunk version of BioSQL has a script > load_itis_taxonomy.pl. It loads it into the PhyloDB module, though. > ITIS isn't a single tree but actually 5; there is no common root. So > it ends up as 5 trees in the PhyloDB tables. > >>> or just a user?s own private taxonomy parallel to NCBI taxonomy in a >>> single BioSQL >>> DB). > > Yeah; I've been wanting to write a general taxonomy loader, or more > precisely a loader that utilizes Bio::TreeIO for reading. Just haven't > had time around to do that. (Need another hackathon :-) > >> [...] I think the issue has been raised before on the mailing list, >> and IIRC >> it was agreed that there was room for improvement. Maybe this is >> something for BioSQL v1.1.x? > > Fixing the ncbi_taxon_id column name definitely. As for letting the > taxon tables duplicate the same capabilities as the PhyloDB tables, > I'm not sure that that's the best route to go. > > -hilmar > From biopython at maubp.freeserve.co.uk Wed Dec 17 14:32:36 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 17 Dec 2008 14:32:36 +0000 Subject: [BioSQL-l] The bioentry and biosequence tables In-Reply-To: <320fb6e00812170627ma0d87fdi4af3da0fc65e7bfa@mail.gmail.com> References: <320fb6e00812170627ma0d87fdi4af3da0fc65e7bfa@mail.gmail.com> Message-ID: <320fb6e00812170632k1f7b993bq843fc36c38588673@mail.gmail.com> Hello all, Looking at the BioSQL schema, it initially appears that each bioentry can have multiple biosequence entries (each with their own alphabet, version and length). On reading http://biosql.org/wiki/Schema_Overview#BIOSEQUENCE I see that in fact the schema should make sure each bioentry can have at most one biosequence - which is good news. If this was not the case, then the location of any seqfeature locations would be ambiguous. Was there a reason for not just putting optional sequence, length and alphabet fields into the bioentry table directly (instead of having a separate biosequence table)? Does doing it as a separate table speed up accessing the core (non-sequence) bioentry information? Thanks, Peter From biopython at maubp.freeserve.co.uk Wed Dec 17 14:55:02 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 17 Dec 2008 14:55:02 +0000 Subject: [BioSQL-l] The bioentry and biosequence tables In-Reply-To: References: <320fb6e00812170627ma0d87fdi4af3da0fc65e7bfa@mail.gmail.com> <320fb6e00812170632k1f7b993bq843fc36c38588673@mail.gmail.com> Message-ID: <320fb6e00812170655j6f0811ber817cc69b1dcb159e@mail.gmail.com> On Wed, Dec 17, 2008 at 2:49 PM, Hilmar Lapp wrote: > > The bioentry table is for any biological database entry with a stable and > unique identifier. > > In practical terms most of these will be sequence database entries, but they > don't have to be. Among the not so far fetched examples are gene > records/models (such as from LocusLink or Entrez Gene) and (e.g. EST or > protein) sequence clusters. Yes - as another example I was thinking you could import an NCBI Protein Tables (PTT) table like this, you'd have lots of seqfeature entries but no actual sequence. > A (at present) more exotic example would be museum specimen records. > >> Does doing it as a separate table speed up accessing the core >> (non-sequence) bioentry information? > > Possibly. To what extent will depend on the RDBMS, obviously. But at least > several years ago, when BioSQL was first designed, some RDBMSs would indeed > be faster in full table scans if the table didn't contain an LOB. OK - I thought that might have been the case. > The way to look at it conceptually for us has been similar to object > orientation. The bioentry table is the base table for all biodatabase > entries, and the biosequence table is in joined for derived objects that > also have a sequence. > > Does that make sense? Yes it does. Thanks. Peter From hlapp at gmx.net Wed Dec 17 14:49:28 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 17 Dec 2008 09:49:28 -0500 Subject: [BioSQL-l] The bioentry and biosequence tables In-Reply-To: <320fb6e00812170632k1f7b993bq843fc36c38588673@mail.gmail.com> References: <320fb6e00812170627ma0d87fdi4af3da0fc65e7bfa@mail.gmail.com> <320fb6e00812170632k1f7b993bq843fc36c38588673@mail.gmail.com> Message-ID: On Dec 17, 2008, at 9:32 AM, Peter wrote: > Was there a reason for not just putting optional sequence, length and > alphabet fields into the bioentry table directly (instead of having a > separate biosequence table)? The bioentry table is for any biological database entry with a stable and unique identifier. In practical terms most of these will be sequence database entries, but they don't have to be. Among the not so far fetched examples are gene records/models (such as from LocusLink or Entrez Gene) and (e.g. EST or protein) sequence clusters. A (at present) more exotic example would be museum specimen records. > Does doing it as a separate table speed up accessing the core (non- > sequence) bioentry information? Possibly. To what extent will depend on the RDBMS, obviously. But at least several years ago, when BioSQL was first designed, some RDBMSs would indeed be faster in full table scans if the table didn't contain an LOB. The way to look at it conceptually for us has been similar to object orientation. The bioentry table is the base table for all biodatabase entries, and the biosequence table is in joined for derived objects that also have a sequence. Does that make sense? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From dmitry.turin at belarusbank.minsk.by Mon Dec 29 17:41:21 2008 From: dmitry.turin at belarusbank.minsk.by (Dmitry Turin) Date: Mon, 29 Dec 2008 19:41:21 +0200 Subject: [BioSQL-l] PostGraph II Message-ID: <666803384.20081229194121@belarusbank.minsk.by> Hi, Biosql-l. I have read attentively presentation "BioPostgres_Overview_BOSC_2006.ppt", and paper on "phenomics.cs.ucla.edu/PostGraph/", to which presentation refers. I take courage to seggest to implement statemests [1] SELECT * FROM table1.table2.table3; UPDATE table1.table2.table3 SET ... ; DELETE * FROM table1.table2.table3; instead of or in addition to PostGraph, because these statements are more powerfull and convenient for work with graphs. [1] blogs.ingres.com/technology/2008/07/31/bringing-dbms-in-line-with-modern-communication-requirements-sql2009/ sql50.euro.ru/sql5.16.4.pdf Dmitry (SQL50, HTML60)