From angel at mail.med.upenn.edu Thu Aug 3 13:28:18 2006 From: angel at mail.med.upenn.edu (Angel Pizarro) Date: Thu, 03 Aug 2006 13:28:18 -0400 Subject: [BioSQL-l] database extensions Message-ID: <44D23232.3020209@mail.med.upenn.edu> Hello, Relatively new to biosql, but I was wondering about a few aspects of the schema/project. First, about the ontology tables, what is the preferred way to map ontology annotations to bioentries? via a seqfeature? Currently I just added a new table to map GO associations with the evidence code from GOA. Not optimal as there may be multiple lines of evidence for an association, as in the godatabase schema. Second, are primary keys up for discussion any time soon? I realize that a lot of external projects rely on this schema, so it has to remain stable, but the inconsistent use of UID, compound keys or even lack of a key really put a hindrance on the use of off-the-shelf ORMs. Third, how does one go about submitting proposals for schema extensions? I am wanting to extend the schema with a few modules, mainly ripped out of either GUS and/or chado, as well as adding a module for proteomics data. Fourth, is the current practice for representation of biological pathways and interactions to use the bioentryrelationship table? Many thanks. -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 E: angel at mail.med.upenn.edu From angel at mail.med.upenn.edu Thu Aug 3 13:12:24 2006 From: angel at mail.med.upenn.edu (Angel Pizarro) Date: Thu, 03 Aug 2006 13:12:24 -0400 Subject: [BioSQL-l] null title and CRC Message-ID: <44D22E78.6050505@mail.med.upenn.edu> From hilmar: > The CRC for references uses the authors, title, and location > attributes in Bioperl-db, and empty (or null) strings default to the > string "". > > If title is empty and authors and location do not distinguish two > references, then why do you want to have two rows for those > references? Basically, there are identical for all intents and > purposes, or are they not? > > -hilmar Sorry for not replying to the original thread, but I just joined this list. This was an issue for me with bioperl loading as well, since I was using the same biosql instance to load two different biodatabases with the same entry. Specifically, I loaded IPI, which has no feature table in the entries, and the genbank equivalents to get the feature tables. Namely the constraint caused an error when the the genbank record was loaded. I think that this is primarily an issue with bioperl, but I raise it here to make the java folks aware of the potential pitfall and maybe ask if whether the CRC should be calculated with the biodatabase in mind? Probably not, since as hilmar states, it's still the same reference. BTW - I solved the issue by dropping the constraint, since I really don't care about references. Not optimal, but certainly easiest thing to do ;) -angel From hlapp at gmx.net Sun Aug 6 14:37:03 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 6 Aug 2006 15:37:03 -0300 Subject: [BioSQL-l] null title and CRC In-Reply-To: <44D22E78.6050505@mail.med.upenn.edu> References: <44D22E78.6050505@mail.med.upenn.edu> Message-ID: <30B51FAA-6372-4559-AE50-D535AABF8AA1@gmx.net> I think I need to debug this. If bioperl-db stumbles over this, then it sounds like that's what needs to be fixed. Can you or somebody else provide with two sample records that exemplify (i.e., replicate) the problem and which I can turn into a test case? -hilmar On Aug 3, 2006, at 2:12 PM, Angel Pizarro wrote: > From hilmar: >> The CRC for references uses the authors, title, and location >> attributes in Bioperl-db, and empty (or null) strings default to the >> string "". >> >> If title is empty and authors and location do not distinguish two >> references, then why do you want to have two rows for those >> references? Basically, there are identical for all intents and >> purposes, or are they not? >> >> -hilmar > > Sorry for not replying to the original thread, but I just joined > this list. > This was an issue for me with bioperl loading as well, since I was > using > the same biosql instance to load two different biodatabases with the > same entry. Specifically, I loaded IPI, which has no feature table in > the entries, and the genbank equivalents to get the feature tables. > Namely the constraint caused an error when the the genbank record was > loaded. > > I think that this is primarily an issue with bioperl, but I raise it > here to make the java folks aware of the potential pitfall and > maybe ask > if whether the CRC should be calculated with the biodatabase in mind? > Probably not, since as hilmar states, it's still the same reference. > > BTW - I solved the issue by dropping the constraint, since I really > don't care about references. Not optimal, but certainly easiest > thing to > do ;) > > -angel > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun Aug 6 14:32:43 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 6 Aug 2006 15:32:43 -0300 Subject: [BioSQL-l] database extensions In-Reply-To: <44D23232.3020209@mail.med.upenn.edu> References: <44D23232.3020209@mail.med.upenn.edu> Message-ID: <56F43732-8EB4-4276-8971-0986B30EEAD0@gmx.net> Hi Angel, sorry for the belated response, I was at BOSC. See my comments below. On Aug 3, 2006, at 2:28 PM, Angel Pizarro wrote: > Hello, > > Relatively new to biosql, but I was wondering about a few aspects > of the > schema/project. > > First, about the ontology tables, what is the preferred way to map > ontology annotations to bioentries? via a seqfeature? Currently I just > added a new table to map GO associations with the evidence code from > GOA. Not optimal as there may be multiple lines of evidence for an > association, as in the godatabase schema. You link ontology terms to bioentries through the bioentry_qualifier_value table, i.e., as a value-less term association. If you want to capture the evidence code for GO then associations then you can use the value field in bioentry_qualifier_value to hold the code. This indeed won't very well if there are multiple evidence codes. You could collapse them into one delimited string but that will impair your ability to constrain searches by evidence code. However, a LIKE constraint instead of string equality may not make a big difference since typically the value column isn't indexed anyway since you may have big values there. At any rate, if you do have multiple evidence codes and you do want to constrain searches by evidence code then there needs to be a better solution. > > Second, are primary keys up for discussion any time soon? I realize > that > a lot of external projects rely on this schema, so it has to remain > stable, but the inconsistent use of UID, compound keys or even lack > of a > key really put a hindrance on the use of off-the-shelf ORMs. Can you elaborate? Meanwhile most tables do have a surrogate key. Only those that serve as association tables and aren't referenced themselves (and only very few association tables are referenced by foreign key) do not (they still have a unique key constraint though). Just to make sure - you're looking at the CVS check-out version, not at 0.1 or something? > > Third, how does one go about submitting proposals for schema > extensions? > I am wanting to extend the schema with a few modules, mainly ripped > out > of either GUS and/or chado, as well as adding a module for > proteomics data. You would send those to the list, ideally accompanied with some comments on motivation and why the existing tables can't deal with the data the new entities are supposed to capture. That would give people a chance to comment. I enthusiastically welcome proposals for additions especially if those help to promote the utility of BioSQL. > > Fourth, is the current practice for representation of biological > pathways and interactions to use the bioentryrelationship table? Yes, that was my plan when I worked on the Symgene project. I didn't get to ever implement that though so don't know how well it would really work. I did implement bioentry graphs with the bioentry_relationship table, and I had to add an evidence table to accomplish my goals. With that it worked very well though. This is the evidence table, I'll add it in the 1.1 version. CREATE TABLE Evidence ( Evidence_Id INTEGER NOT NULL, Score NUMBER NULL, Last_Modified DATE DEFAULT SYSDATE NOT NULL, Bioentry_Relationship_Id INTEGER NOT NULL, Term_Id INTEGER NOT NULL, DBXref_Id INTEGER NULL, PRIMARY KEY (Evidence_Id) UNIQUE (Bioentry_Relationship_Id, Term_Id, DBXref_Id) ); > > Many thanks. You're most welcome. -hilmar > > -- > Angel Pizarro > Director, Bioinformatics Facility > Institute for Translational Medicine and Therapeutics > University of Pennsylvania > 806 BRB II/III > 421 Curie Blvd. > Philadelphia, PA 19104-6160 > > P: 215-573-3736 > F: 215-573-9004 > E: angel at mail.med.upenn.edu > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From angel at mail.med.upenn.edu Mon Aug 7 09:10:55 2006 From: angel at mail.med.upenn.edu (Angel Pizarro) Date: Mon, 07 Aug 2006 09:10:55 -0400 Subject: [BioSQL-l] null title and CRC In-Reply-To: <30B51FAA-6372-4559-AE50-D535AABF8AA1@gmx.net> References: <44D22E78.6050505@mail.med.upenn.edu> <30B51FAA-6372-4559-AE50-D535AABF8AA1@gmx.net> Message-ID: <44D73BDF.10806@mail.med.upenn.edu> Hilmar Lapp wrote: > I think I need to debug this. If bioperl-db stumbles over this, then > it sounds like that's what needs to be fixed. > > Can you or somebody else provide with two sample records that > exemplify (i.e., replicate) the problem and which I can turn into a > test case? > Since these where bulk loads, I am not sure which records conflicted, but I'll have a poke around and see if I can grab a test set for you. -angel > -hilmar > > On Aug 3, 2006, at 2:12 PM, Angel Pizarro wrote: > >> From hilmar: >>> The CRC for references uses the authors, title, and location >>> attributes in Bioperl-db, and empty (or null) strings default to the >>> string "". >>> >>> If title is empty and authors and location do not distinguish two >>> references, then why do you want to have two rows for those >>> references? Basically, there are identical for all intents and >>> purposes, or are they not? >>> >>> -hilmar >> >> Sorry for not replying to the original thread, but I just joined this >> list. >> This was an issue for me with bioperl loading as well, since I was using >> the same biosql instance to load two different biodatabases with the >> same entry. Specifically, I loaded IPI, which has no feature table in >> the entries, and the genbank equivalents to get the feature tables. >> Namely the constraint caused an error when the the genbank record was >> loaded. >> >> I think that this is primarily an issue with bioperl, but I raise it >> here to make the java folks aware of the potential pitfall and maybe ask >> if whether the CRC should be calculated with the biodatabase in mind? >> Probably not, since as hilmar states, it's still the same reference. >> >> BTW - I solved the issue by dropping the constraint, since I really >> don't care about references. Not optimal, but certainly easiest thing to >> do ;) >> >> -angel >> >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l >> > > --=========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 E: angel at mail.med.upenn.edu From angel at mail.med.upenn.edu Mon Aug 7 09:45:16 2006 From: angel at mail.med.upenn.edu (Angel Pizarro) Date: Mon, 07 Aug 2006 09:45:16 -0400 Subject: [BioSQL-l] database extensions In-Reply-To: <56F43732-8EB4-4276-8971-0986B30EEAD0@gmx.net> References: <44D23232.3020209@mail.med.upenn.edu> <56F43732-8EB4-4276-8971-0986B30EEAD0@gmx.net> Message-ID: <44D743EC.5030601@mail.med.upenn.edu> Hilmar Lapp wrote: > Hi Angel, sorry for the belated response, I was at BOSC. See my > comments below. > Yes, I missed my chance to go this year. Maybe next! > On Aug 3, 2006, at 2:28 PM, Angel Pizarro wrote: >> >> Second, are primary keys up for discussion any time soon? I realize that >> a lot of external projects rely on this schema, so it has to remain >> stable, but the inconsistent use of UID, compound keys or even lack of a >> key really put a hindrance on the use of off-the-shelf ORMs. > > Can you elaborate? Meanwhile most tables do have a surrogate key. Only > those that serve as association tables and aren't referenced > themselves (and only very few association tables are referenced by > foreign key) do not (they still have a unique key constraint though). > > Just to make sure - you're looking at the CVS check-out version, not > at 0.1 or something? I am looking at the CVS 1.0 schema. By "inconsistent" I mean that certain tables have a single PK, others have multiple and yet others have none. Alternate keys are not the issue here. Many of the simple off-the-shelf object relational mapping APIs, particularly those tied to the web app suites, assume a single primary key and that all persistent object have one. Personally as a database guy I really don't see a problem with the data model, but it is making my life a little more difficult than it needs to be in the app and language binding space, particularly python. Lastly, since I do want to make some schema proposals, guidelines on how to encode the proposed data models would be nice, and make less work for the reviewers. My extra needs are: Experimental results: There is no schema component for storing exp results of high-throughput data like microarray and proteomics. Experimental context: You can't divorce the experimental context from the results of microarray, proteomics and other high-throughput experimental technologies. Pathway and networks: Hilmar has provided a start in the previous reply, but may need extension to kinetic information. I probably won't get to this, but I do notice that it is missing. -angel >> >> Third, how does one go about submitting proposals for schema extensions? >> I am wanting to extend the schema with a few modules, mainly ripped out >> of either GUS and/or chado, as well as adding a module for >> proteomics data. > > You would send those to the list, ideally accompanied with some > comments on motivation and why the existing tables can't deal with the > data the new entities are supposed to capture. That would give people > a chance to comment. > > I enthusiastically welcome proposals for additions especially if those > help to promote the utility of BioSQL. From roy at colibase.bham.ac.uk Mon Aug 7 14:32:28 2006 From: roy at colibase.bham.ac.uk (Roy Chaudhuri) Date: Mon, 07 Aug 2006 19:32:28 +0100 Subject: [BioSQL-l] PostgreSQL Error: column "oid" does not exist Message-ID: <44D7873C.9010200@colibase.bham.ac.uk> Hi all. I've just been playing around with a new install of PostgreSQL v8.14. I was getting lots of error messages saying 'column "oid" does not exist' when I loaded the BioSQL schema (biosqldb-pg.sql). After a bit of research I found out that the default value for default_with_oids has changed from true to false for PostgreSQL 8.1 onwards: http://www.postgresql.org/docs/8.1/static/runtime-config-compatible.html#GUC-DEFAULT-WITH-OIDS Altering the BioSQL table definitions to explicitly state 'WITH OIDS' before the semicolon works as a fix. Perhaps this could be changed in the CVS? Sorry if this seems obvious to Pg experts but it would have saved a novice like me a bit of time had I found mention of it in the mailing list archives. Cheers, Roy. -- Dr. Roy Chaudhuri Bioinformatics Research Fellow Division of Immunity and Infection University of Birmingham, U.K. http://xbase.bham.ac.uk From hlapp at gmx.net Mon Aug 7 16:11:19 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 7 Aug 2006 16:11:19 -0400 Subject: [BioSQL-l] PostgreSQL Error: column "oid" does not exist In-Reply-To: <44D7873C.9010200@colibase.bham.ac.uk> References: <44D7873C.9010200@colibase.bham.ac.uk> Message-ID: <4A69DDF2-904B-4B48-A868-BA253875C0F6@gmx.net> I think this can be considered a bug in the Pg version of BioSQL and needs to be fixed. It is also a nice example for how things come back to you and bite. At the time I wrote the RULEs (the OID doesn't appear anywhere else does it?) PostgreSQL didn't have nested transactions and so the RULE statements really were a kludge to the absence of nested transaction, which is what I would have really wanted. I think it is time now to get rid of that stuff by implementing the respective parts in bioperl-db using nested transactions, thereby obviating the need for RULE statements, thereby obviating the need for putting 'WITH OID' anywhere. -hilmar On Aug 7, 2006, at 2:32 PM, Roy Chaudhuri wrote: > Hi all. > > I've just been playing around with a new install of PostgreSQL > v8.14. I > was getting lots of error messages saying 'column "oid" does not > exist' > when I loaded the BioSQL schema (biosqldb-pg.sql). After a bit of > research I found out that the default value for default_with_oids has > changed from true to false for PostgreSQL 8.1 onwards: > http://www.postgresql.org/docs/8.1/static/runtime-config- > compatible.html#GUC-DEFAULT-WITH-OIDS > > Altering the BioSQL table definitions to explicitly state 'WITH OIDS' > before the semicolon works as a fix. Perhaps this could be changed in > the CVS? > > Sorry if this seems obvious to Pg experts but it would have saved a > novice like me a bit of time had I found mention of it in the mailing > list archives. > > Cheers, > Roy. > > -- > Dr. Roy Chaudhuri > Bioinformatics Research Fellow > Division of Immunity and Infection > University of Birmingham, U.K. > > http://xbase.bham.ac.uk > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From angel at mail.med.upenn.edu Wed Aug 9 09:47:41 2006 From: angel at mail.med.upenn.edu (Angel Pizarro) Date: Wed, 09 Aug 2006 09:47:41 -0400 Subject: [BioSQL-l] Experimental context and result modules Message-ID: <44D9E77D.7060003@mail.med.upenn.edu> File attached. Hopefully the included comments are enough... If not I can whip up an ERD and some more doc -angel -------------- next part -------------- A non-text attachment was scrubbed... Name: biosql_ext.sql Type: text/x-sql Size: 9281 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/biosql-l/attachments/20060809/544cdb61/attachment-0001.bin From angel at mail.med.upenn.edu Thu Aug 10 16:05:08 2006 From: angel at mail.med.upenn.edu (Angel Pizarro) Date: Thu, 10 Aug 2006 16:05:08 -0400 Subject: [BioSQL-l] null title and CRC In-Reply-To: <44D73BDF.10806@mail.med.upenn.edu> References: <44D22E78.6050505@mail.med.upenn.edu> <30B51FAA-6372-4559-AE50-D535AABF8AA1@gmx.net> <44D73BDF.10806@mail.med.upenn.edu> Message-ID: <44DB9174.7080203@mail.med.upenn.edu> Here are a set of records that make a new install of biosql fail b/c of the CRC constraint using the script : bioperl-db/scripts/biosql/load_seqdatabase.pl My test setup was latest CVS tarball (as of last week ;) ) of bioperl, mysql 5.0. Also recreated the error on a fresh postgres 7.4.8 (and 8.1) install. I ran the script like so: perl ~/bin/load_seqdatabase.pl --dsn "dbi:mysql:bstest" --format genbank --dbuser xxxx --dbpass xxxx --namespace gb --lookup test_load_seqdatabase_crc.gbff Here is the debug error message from one of the runs I did: > --------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, > values were ("","Danio rerio small nuclear ribonucleoprotein > polypeptide C, mRNA (cDNA clone MGC:109792 > IMAGE:7292940)","Unpublished > (2005)","CRC-0E44E80E2C988097","1","159","") FKs () > ERROR: duplicate key violates unique constraint "reference_crc_key" Cheers, -angel Angel Pizarro wrote: > Hilmar Lapp wrote: > >> I think I need to debug this. If bioperl-db stumbles over this, then >> it sounds like that's what needs to be fixed. >> >> Can you or somebody else provide with two sample records that >> exemplify (i.e., replicate) the problem and which I can turn into a >> test case? >> >> > Since these where bulk loads, I am not sure which records conflicted, > but I'll have a poke around and see if I can grab a test set for you. > -angel > > >> -hilmar >> >> On Aug 3, 2006, at 2:12 PM, Angel Pizarro wrote: >> >> >>> From hilmar: >>> >>>> The CRC for references uses the authors, title, and location >>>> attributes in Bioperl-db, and empty (or null) strings default to the >>>> string "". >>>> >>>> If title is empty and authors and location do not distinguish two >>>> references, then why do you want to have two rows for those >>>> references? Basically, there are identical for all intents and >>>> purposes, or are they not? >>>> >>>> -hilmar >>>> >>> Sorry for not replying to the original thread, but I just joined this >>> list. >>> This was an issue for me with bioperl loading as well, since I was using >>> the same biosql instance to load two different biodatabases with the >>> same entry. Specifically, I loaded IPI, which has no feature table in >>> the entries, and the genbank equivalents to get the feature tables. >>> Namely the constraint caused an error when the the genbank record was >>> loaded. >>> >>> I think that this is primarily an issue with bioperl, but I raise it >>> here to make the java folks aware of the potential pitfall and maybe ask >>> if whether the CRC should be calculated with the biodatabase in mind? >>> Probably not, since as hilmar states, it's still the same reference. >>> >>> BTW - I solved the issue by dropping the constraint, since I really >>> don't care about references. Not optimal, but certainly easiest thing to >>> do ;) >>> >>> -angel >>> >>> _______________________________________________ >>> BioSQL-l mailing list >>> BioSQL-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biosql-l >>> >>> >> --=========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> >> >> > > -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 E: angel at mail.med.upenn.edu -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: test_load_seqdatabase_crc.gbff Url: http://lists.open-bio.org/pipermail/biosql-l/attachments/20060810/1e7d3285/attachment.pl From muratem at eng.uah.edu Fri Aug 11 12:10:30 2006 From: muratem at eng.uah.edu (Mike Muratet) Date: Fri, 11 Aug 2006 11:10:30 -0500 (CDT) Subject: [BioSQL-l] load_seqdatabase fails when loading refseq plant files Message-ID: Hello all I am using biosql-schema/bioperl-db to load Refseq entries into a biosql database. I don't see any version info in the files, but I downloaded everything in the last month or so and everything passed all the tests when installed. I am using perl 5.8.5, mysql 5.0.22, DBI-1.5.1, DBD-mysql-3.006. I was loading plant file from Refseq rel 18: load_seqdatabase.pl --dbname biosql --lookup --u --namespace plant --format genbank --safe plant*.rna.gbff.gz and it crashed after about 30K of 60K records: at /usr/lib/perl5/site_perl/5.8.5/Bio/biosql-schema/sql/bioperl-db/scripts/biosql/load_seqdatabase.pl line 633 -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values were ("","Direct Submission","Submitted (01-JUL-2004) National Center for Biotechnology Information, National Institutes of Health, Bethesda 20894, United States of America","CRC-6F1453182E2BAC3F","1","786","") FKs () Duplicate entry 'CRC-6F1453182E2BAC3F' for key 3 --------------------------------------------------- Could not store XM_472403: ------------- EXCEPTION ------------- MSG: create: object (Bio::Annotation::Reference) failed to insert or to be found by unique key STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:208 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254 STACK Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:272 STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:219 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:216 t I traced the error back through the source and database and found that XM_472403 has the same CRC value as XM_473880. I actually got many errors of this type, but only the last one crashed the script (in spite of --safe). Should there be more info included in the CRC field? I am weak when it comes to RDBMs, but looking at the schema, I would guess that the CRC field was added to make an otherwise degenerate key unique. Would it help to add more fields to the CRC, or another key? The former might be done without have to change a lot of code. Thanks Mike From angel at mail.med.upenn.edu Fri Aug 11 14:57:35 2006 From: angel at mail.med.upenn.edu (Angel Pizarro) Date: Fri, 11 Aug 2006 14:57:35 -0400 Subject: [BioSQL-l] load_seqdatabase fails when loading refseq plant files In-Reply-To: References: Message-ID: <1155322655.4837.25.camel@gort.gcrc.upenn.edu> Glad I am not the only one that ran into this problem! Mike, I had reported this issue a few emails back and have provided the list with an example file for testing, so it should be resolved soon. FYI, you are correct that CRC is computed on load to determine if two pub references are in fact the same. This is a feature to save database space. The expected behaviour would be for the subsequent entries with the same CRC reference should have an FK to the originating reference entry, and not insert a duplicate row into the reference table. FYI #2, the --safe option explicitly states that it will continue to process records after errors BUT do a roll-back at the end of the run. This is to gather all of your errors in one shot, as opposed to fixing a record, starting, error, fix, etc ,. If you are impatient and do not care about references, you have three choices. 1) drop the unique constraint on reference.crc (this will cause dups in reference and you can not go back to a unique CRC without some major SQL data migration routine to fix FK's and delete the dups. 2) filter your records to not contain reference information 3) alter load_seqdatabase to not enter reference information. This would be in the Bio::AnnotationCollection object: $seq->annotation()->remove_Annotations('reference'); The above command inserted someplace in the script line ~575 should do the trick. Obviously this means that all reference information is not loaded into the DB at all. -angel On Fri, 2006-08-11 at 11:10 -0500, Mike Muratet wrote: > Hello all > > I am using biosql-schema/bioperl-db to load Refseq entries into a biosql > database. I don't see any version info in the files, but I downloaded > everything in the last month or so and everything passed all the tests > when installed. I am using perl 5.8.5, mysql 5.0.22, DBI-1.5.1, > DBD-mysql-3.006. I was loading plant file from Refseq rel 18: > > load_seqdatabase.pl --dbname biosql > --lookup --u --namespace plant --format genbank --safe plant*.rna.gbff.gz > > and it crashed after about 30K of 60K records: > > at /usr/lib/perl5/site_perl/5.8.5/Bio/biosql-schema/sql/bioperl-db/scripts/biosql/load_seqdatabase.pl > line 633 > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values > were ("","Direct Submission","Submitted (01-JUL-2004) National Center for > Biotechnology Information, National Institutes of Health, Bethesda 20894, > United States of America","CRC-6F1453182E2BAC3F","1","786","") FKs > () > Duplicate entry 'CRC-6F1453182E2BAC3F' for key 3 > --------------------------------------------------- > Could not store XM_472403: > ------------- EXCEPTION ------------- > MSG: create: object (Bio::Annotation::Reference) failed to insert or to be > found by unique key > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:208 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254 > STACK Bio::DB::Persistent::PersistentObject::store > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:272 > STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:219 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:216 > t > > I traced the error back through the source and database and found that > XM_472403 has the same CRC value as XM_473880. I actually got many errors of this type, > but only the last one crashed the script (in spite of --safe). > > Should there be more info included in the CRC field? I am weak when > it comes to RDBMs, but looking at the schema, I would guess that the CRC field > was added to make an otherwise degenerate key unique. Would it help to add > more fields to the CRC, or another key? The former might be done without > have to change a lot of code. > > Thanks > > Mike > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l From muratem at eng.uah.edu Mon Aug 14 12:55:45 2006 From: muratem at eng.uah.edu (Mike Muratet) Date: Mon, 14 Aug 2006 11:55:45 -0500 (CDT) Subject: [BioSQL-l] load_seqdatabase fails when loading refseq plant files In-Reply-To: <1155322655.4837.25.camel@gort.gcrc.upenn.edu> References: <1155322655.4837.25.camel@gort.gcrc.upenn.edu> Message-ID: On Fri, 11 Aug 2006, Angel Pizarro wrote: > Date: Fri, 11 Aug 2006 14:57:35 -0400 > From: Angel Pizarro > To: BioSQL , Bioperl > Subject: Re: [BioSQL-l] load_seqdatabase fails when loading refseq plant files > > Glad I am not the only one that ran into this problem! Mike, I had > reported this issue a few emails back and have provided the list with an > example file for testing, so it should be resolved soon. > I must have missed it. Sorry. > FYI, you are correct that CRC is computed on load to determine if two > pub references are in fact the same. This is a feature to save database > space. The expected behaviour would be for the subsequent entries with > the same CRC reference should have an FK to the originating reference > entry, and not insert a duplicate row into the reference table. > > FYI #2, the --safe option explicitly states that it will continue to > process records after errors BUT do a roll-back at the end of the run. > This is to gather all of your errors in one shot, as opposed to fixing a > record, starting, error, fix, etc ,. > > If you are impatient and do not care about references, you have three > choices. > 1) drop the unique constraint on reference.crc (this will cause dups in > reference and you can not go back to a unique CRC without some major SQL > data migration routine to fix FK's and delete the dups. > > 2) filter your records to not contain reference information > > 3) alter load_seqdatabase to not enter reference information. This would > be in the Bio::AnnotationCollection object: > > $seq->annotation()->remove_Annotations('reference'); > > The above command inserted someplace in the script line ~575 should do > the trick. Obviously this means that all reference information is not > loaded into the DB at all. > I do need to get something working, and the references are not critical to the application, so I will probably alter load_seqdatabase. Thanks for the help! Cheers Mike > -angel > > On Fri, 2006-08-11 at 11:10 -0500, Mike Muratet wrote: >> Hello all >> >> I am using biosql-schema/bioperl-db to load Refseq entries into a biosql >> database. I don't see any version info in the files, but I downloaded >> everything in the last month or so and everything passed all the tests >> when installed. I am using perl 5.8.5, mysql 5.0.22, DBI-1.5.1, >> DBD-mysql-3.006. I was loading plant file from Refseq rel 18: >> >> load_seqdatabase.pl --dbname biosql >> --lookup --u --namespace plant --format genbank --safe plant*.rna.gbff.gz >> >> and it crashed after about 30K of 60K records: >> >> at /usr/lib/perl5/site_perl/5.8.5/Bio/biosql-schema/sql/bioperl-db/scripts/biosql/load_seqdatabase.pl >> line 633 >> >> -------------------- WARNING --------------------- >> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values >> were ("","Direct Submission","Submitted (01-JUL-2004) National Center for >> Biotechnology Information, National Institutes of Health, Bethesda 20894, >> United States of America","CRC-6F1453182E2BAC3F","1","786","") FKs >> () >> Duplicate entry 'CRC-6F1453182E2BAC3F' for key 3 >> --------------------------------------------------- >> Could not store XM_472403: >> ------------- EXCEPTION ------------- >> MSG: create: object (Bio::Annotation::Reference) failed to insert or to be >> found by unique key >> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create >> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:208 >> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store >> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254 >> STACK Bio::DB::Persistent::PersistentObject::store >> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:272 >> STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children >> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:219 >> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create >> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:216 >> t >> >> I traced the error back through the source and database and found that >> XM_472403 has the same CRC value as XM_473880. I actually got many errors of this type, >> but only the last one crashed the script (in spite of --safe). >> >> Should there be more info included in the CRC field? I am weak when >> it comes to RDBMs, but looking at the schema, I would guess that the CRC field >> was added to make an otherwise degenerate key unique. Would it help to add >> more fields to the CRC, or another key? The former might be done without >> have to change a lot of code. >> >> Thanks >> >> Mike >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > From angel at mail.med.upenn.edu Thu Aug 3 17:28:18 2006 From: angel at mail.med.upenn.edu (Angel Pizarro) Date: Thu, 03 Aug 2006 13:28:18 -0400 Subject: [BioSQL-l] database extensions Message-ID: <44D23232.3020209@mail.med.upenn.edu> Hello, Relatively new to biosql, but I was wondering about a few aspects of the schema/project. First, about the ontology tables, what is the preferred way to map ontology annotations to bioentries? via a seqfeature? Currently I just added a new table to map GO associations with the evidence code from GOA. Not optimal as there may be multiple lines of evidence for an association, as in the godatabase schema. Second, are primary keys up for discussion any time soon? I realize that a lot of external projects rely on this schema, so it has to remain stable, but the inconsistent use of UID, compound keys or even lack of a key really put a hindrance on the use of off-the-shelf ORMs. Third, how does one go about submitting proposals for schema extensions? I am wanting to extend the schema with a few modules, mainly ripped out of either GUS and/or chado, as well as adding a module for proteomics data. Fourth, is the current practice for representation of biological pathways and interactions to use the bioentryrelationship table? Many thanks. -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 E: angel at mail.med.upenn.edu From angel at mail.med.upenn.edu Thu Aug 3 17:12:24 2006 From: angel at mail.med.upenn.edu (Angel Pizarro) Date: Thu, 03 Aug 2006 13:12:24 -0400 Subject: [BioSQL-l] null title and CRC Message-ID: <44D22E78.6050505@mail.med.upenn.edu> From hilmar: > The CRC for references uses the authors, title, and location > attributes in Bioperl-db, and empty (or null) strings default to the > string "". > > If title is empty and authors and location do not distinguish two > references, then why do you want to have two rows for those > references? Basically, there are identical for all intents and > purposes, or are they not? > > -hilmar Sorry for not replying to the original thread, but I just joined this list. This was an issue for me with bioperl loading as well, since I was using the same biosql instance to load two different biodatabases with the same entry. Specifically, I loaded IPI, which has no feature table in the entries, and the genbank equivalents to get the feature tables. Namely the constraint caused an error when the the genbank record was loaded. I think that this is primarily an issue with bioperl, but I raise it here to make the java folks aware of the potential pitfall and maybe ask if whether the CRC should be calculated with the biodatabase in mind? Probably not, since as hilmar states, it's still the same reference. BTW - I solved the issue by dropping the constraint, since I really don't care about references. Not optimal, but certainly easiest thing to do ;) -angel From hlapp at gmx.net Sun Aug 6 18:37:03 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 6 Aug 2006 15:37:03 -0300 Subject: [BioSQL-l] null title and CRC In-Reply-To: <44D22E78.6050505@mail.med.upenn.edu> References: <44D22E78.6050505@mail.med.upenn.edu> Message-ID: <30B51FAA-6372-4559-AE50-D535AABF8AA1@gmx.net> I think I need to debug this. If bioperl-db stumbles over this, then it sounds like that's what needs to be fixed. Can you or somebody else provide with two sample records that exemplify (i.e., replicate) the problem and which I can turn into a test case? -hilmar On Aug 3, 2006, at 2:12 PM, Angel Pizarro wrote: > From hilmar: >> The CRC for references uses the authors, title, and location >> attributes in Bioperl-db, and empty (or null) strings default to the >> string "". >> >> If title is empty and authors and location do not distinguish two >> references, then why do you want to have two rows for those >> references? Basically, there are identical for all intents and >> purposes, or are they not? >> >> -hilmar > > Sorry for not replying to the original thread, but I just joined > this list. > This was an issue for me with bioperl loading as well, since I was > using > the same biosql instance to load two different biodatabases with the > same entry. Specifically, I loaded IPI, which has no feature table in > the entries, and the genbank equivalents to get the feature tables. > Namely the constraint caused an error when the the genbank record was > loaded. > > I think that this is primarily an issue with bioperl, but I raise it > here to make the java folks aware of the potential pitfall and > maybe ask > if whether the CRC should be calculated with the biodatabase in mind? > Probably not, since as hilmar states, it's still the same reference. > > BTW - I solved the issue by dropping the constraint, since I really > don't care about references. Not optimal, but certainly easiest > thing to > do ;) > > -angel > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun Aug 6 18:32:43 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 6 Aug 2006 15:32:43 -0300 Subject: [BioSQL-l] database extensions In-Reply-To: <44D23232.3020209@mail.med.upenn.edu> References: <44D23232.3020209@mail.med.upenn.edu> Message-ID: <56F43732-8EB4-4276-8971-0986B30EEAD0@gmx.net> Hi Angel, sorry for the belated response, I was at BOSC. See my comments below. On Aug 3, 2006, at 2:28 PM, Angel Pizarro wrote: > Hello, > > Relatively new to biosql, but I was wondering about a few aspects > of the > schema/project. > > First, about the ontology tables, what is the preferred way to map > ontology annotations to bioentries? via a seqfeature? Currently I just > added a new table to map GO associations with the evidence code from > GOA. Not optimal as there may be multiple lines of evidence for an > association, as in the godatabase schema. You link ontology terms to bioentries through the bioentry_qualifier_value table, i.e., as a value-less term association. If you want to capture the evidence code for GO then associations then you can use the value field in bioentry_qualifier_value to hold the code. This indeed won't very well if there are multiple evidence codes. You could collapse them into one delimited string but that will impair your ability to constrain searches by evidence code. However, a LIKE constraint instead of string equality may not make a big difference since typically the value column isn't indexed anyway since you may have big values there. At any rate, if you do have multiple evidence codes and you do want to constrain searches by evidence code then there needs to be a better solution. > > Second, are primary keys up for discussion any time soon? I realize > that > a lot of external projects rely on this schema, so it has to remain > stable, but the inconsistent use of UID, compound keys or even lack > of a > key really put a hindrance on the use of off-the-shelf ORMs. Can you elaborate? Meanwhile most tables do have a surrogate key. Only those that serve as association tables and aren't referenced themselves (and only very few association tables are referenced by foreign key) do not (they still have a unique key constraint though). Just to make sure - you're looking at the CVS check-out version, not at 0.1 or something? > > Third, how does one go about submitting proposals for schema > extensions? > I am wanting to extend the schema with a few modules, mainly ripped > out > of either GUS and/or chado, as well as adding a module for > proteomics data. You would send those to the list, ideally accompanied with some comments on motivation and why the existing tables can't deal with the data the new entities are supposed to capture. That would give people a chance to comment. I enthusiastically welcome proposals for additions especially if those help to promote the utility of BioSQL. > > Fourth, is the current practice for representation of biological > pathways and interactions to use the bioentryrelationship table? Yes, that was my plan when I worked on the Symgene project. I didn't get to ever implement that though so don't know how well it would really work. I did implement bioentry graphs with the bioentry_relationship table, and I had to add an evidence table to accomplish my goals. With that it worked very well though. This is the evidence table, I'll add it in the 1.1 version. CREATE TABLE Evidence ( Evidence_Id INTEGER NOT NULL, Score NUMBER NULL, Last_Modified DATE DEFAULT SYSDATE NOT NULL, Bioentry_Relationship_Id INTEGER NOT NULL, Term_Id INTEGER NOT NULL, DBXref_Id INTEGER NULL, PRIMARY KEY (Evidence_Id) UNIQUE (Bioentry_Relationship_Id, Term_Id, DBXref_Id) ); > > Many thanks. You're most welcome. -hilmar > > -- > Angel Pizarro > Director, Bioinformatics Facility > Institute for Translational Medicine and Therapeutics > University of Pennsylvania > 806 BRB II/III > 421 Curie Blvd. > Philadelphia, PA 19104-6160 > > P: 215-573-3736 > F: 215-573-9004 > E: angel at mail.med.upenn.edu > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From angel at mail.med.upenn.edu Mon Aug 7 13:10:55 2006 From: angel at mail.med.upenn.edu (Angel Pizarro) Date: Mon, 07 Aug 2006 09:10:55 -0400 Subject: [BioSQL-l] null title and CRC In-Reply-To: <30B51FAA-6372-4559-AE50-D535AABF8AA1@gmx.net> References: <44D22E78.6050505@mail.med.upenn.edu> <30B51FAA-6372-4559-AE50-D535AABF8AA1@gmx.net> Message-ID: <44D73BDF.10806@mail.med.upenn.edu> Hilmar Lapp wrote: > I think I need to debug this. If bioperl-db stumbles over this, then > it sounds like that's what needs to be fixed. > > Can you or somebody else provide with two sample records that > exemplify (i.e., replicate) the problem and which I can turn into a > test case? > Since these where bulk loads, I am not sure which records conflicted, but I'll have a poke around and see if I can grab a test set for you. -angel > -hilmar > > On Aug 3, 2006, at 2:12 PM, Angel Pizarro wrote: > >> From hilmar: >>> The CRC for references uses the authors, title, and location >>> attributes in Bioperl-db, and empty (or null) strings default to the >>> string "". >>> >>> If title is empty and authors and location do not distinguish two >>> references, then why do you want to have two rows for those >>> references? Basically, there are identical for all intents and >>> purposes, or are they not? >>> >>> -hilmar >> >> Sorry for not replying to the original thread, but I just joined this >> list. >> This was an issue for me with bioperl loading as well, since I was using >> the same biosql instance to load two different biodatabases with the >> same entry. Specifically, I loaded IPI, which has no feature table in >> the entries, and the genbank equivalents to get the feature tables. >> Namely the constraint caused an error when the the genbank record was >> loaded. >> >> I think that this is primarily an issue with bioperl, but I raise it >> here to make the java folks aware of the potential pitfall and maybe ask >> if whether the CRC should be calculated with the biodatabase in mind? >> Probably not, since as hilmar states, it's still the same reference. >> >> BTW - I solved the issue by dropping the constraint, since I really >> don't care about references. Not optimal, but certainly easiest thing to >> do ;) >> >> -angel >> >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l >> > > --=========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 E: angel at mail.med.upenn.edu From angel at mail.med.upenn.edu Mon Aug 7 13:45:16 2006 From: angel at mail.med.upenn.edu (Angel Pizarro) Date: Mon, 07 Aug 2006 09:45:16 -0400 Subject: [BioSQL-l] database extensions In-Reply-To: <56F43732-8EB4-4276-8971-0986B30EEAD0@gmx.net> References: <44D23232.3020209@mail.med.upenn.edu> <56F43732-8EB4-4276-8971-0986B30EEAD0@gmx.net> Message-ID: <44D743EC.5030601@mail.med.upenn.edu> Hilmar Lapp wrote: > Hi Angel, sorry for the belated response, I was at BOSC. See my > comments below. > Yes, I missed my chance to go this year. Maybe next! > On Aug 3, 2006, at 2:28 PM, Angel Pizarro wrote: >> >> Second, are primary keys up for discussion any time soon? I realize that >> a lot of external projects rely on this schema, so it has to remain >> stable, but the inconsistent use of UID, compound keys or even lack of a >> key really put a hindrance on the use of off-the-shelf ORMs. > > Can you elaborate? Meanwhile most tables do have a surrogate key. Only > those that serve as association tables and aren't referenced > themselves (and only very few association tables are referenced by > foreign key) do not (they still have a unique key constraint though). > > Just to make sure - you're looking at the CVS check-out version, not > at 0.1 or something? I am looking at the CVS 1.0 schema. By "inconsistent" I mean that certain tables have a single PK, others have multiple and yet others have none. Alternate keys are not the issue here. Many of the simple off-the-shelf object relational mapping APIs, particularly those tied to the web app suites, assume a single primary key and that all persistent object have one. Personally as a database guy I really don't see a problem with the data model, but it is making my life a little more difficult than it needs to be in the app and language binding space, particularly python. Lastly, since I do want to make some schema proposals, guidelines on how to encode the proposed data models would be nice, and make less work for the reviewers. My extra needs are: Experimental results: There is no schema component for storing exp results of high-throughput data like microarray and proteomics. Experimental context: You can't divorce the experimental context from the results of microarray, proteomics and other high-throughput experimental technologies. Pathway and networks: Hilmar has provided a start in the previous reply, but may need extension to kinetic information. I probably won't get to this, but I do notice that it is missing. -angel >> >> Third, how does one go about submitting proposals for schema extensions? >> I am wanting to extend the schema with a few modules, mainly ripped out >> of either GUS and/or chado, as well as adding a module for >> proteomics data. > > You would send those to the list, ideally accompanied with some > comments on motivation and why the existing tables can't deal with the > data the new entities are supposed to capture. That would give people > a chance to comment. > > I enthusiastically welcome proposals for additions especially if those > help to promote the utility of BioSQL. From roy at colibase.bham.ac.uk Mon Aug 7 18:32:28 2006 From: roy at colibase.bham.ac.uk (Roy Chaudhuri) Date: Mon, 07 Aug 2006 19:32:28 +0100 Subject: [BioSQL-l] PostgreSQL Error: column "oid" does not exist Message-ID: <44D7873C.9010200@colibase.bham.ac.uk> Hi all. I've just been playing around with a new install of PostgreSQL v8.14. I was getting lots of error messages saying 'column "oid" does not exist' when I loaded the BioSQL schema (biosqldb-pg.sql). After a bit of research I found out that the default value for default_with_oids has changed from true to false for PostgreSQL 8.1 onwards: http://www.postgresql.org/docs/8.1/static/runtime-config-compatible.html#GUC-DEFAULT-WITH-OIDS Altering the BioSQL table definitions to explicitly state 'WITH OIDS' before the semicolon works as a fix. Perhaps this could be changed in the CVS? Sorry if this seems obvious to Pg experts but it would have saved a novice like me a bit of time had I found mention of it in the mailing list archives. Cheers, Roy. -- Dr. Roy Chaudhuri Bioinformatics Research Fellow Division of Immunity and Infection University of Birmingham, U.K. http://xbase.bham.ac.uk From hlapp at gmx.net Mon Aug 7 20:11:19 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 7 Aug 2006 16:11:19 -0400 Subject: [BioSQL-l] PostgreSQL Error: column "oid" does not exist In-Reply-To: <44D7873C.9010200@colibase.bham.ac.uk> References: <44D7873C.9010200@colibase.bham.ac.uk> Message-ID: <4A69DDF2-904B-4B48-A868-BA253875C0F6@gmx.net> I think this can be considered a bug in the Pg version of BioSQL and needs to be fixed. It is also a nice example for how things come back to you and bite. At the time I wrote the RULEs (the OID doesn't appear anywhere else does it?) PostgreSQL didn't have nested transactions and so the RULE statements really were a kludge to the absence of nested transaction, which is what I would have really wanted. I think it is time now to get rid of that stuff by implementing the respective parts in bioperl-db using nested transactions, thereby obviating the need for RULE statements, thereby obviating the need for putting 'WITH OID' anywhere. -hilmar On Aug 7, 2006, at 2:32 PM, Roy Chaudhuri wrote: > Hi all. > > I've just been playing around with a new install of PostgreSQL > v8.14. I > was getting lots of error messages saying 'column "oid" does not > exist' > when I loaded the BioSQL schema (biosqldb-pg.sql). After a bit of > research I found out that the default value for default_with_oids has > changed from true to false for PostgreSQL 8.1 onwards: > http://www.postgresql.org/docs/8.1/static/runtime-config- > compatible.html#GUC-DEFAULT-WITH-OIDS > > Altering the BioSQL table definitions to explicitly state 'WITH OIDS' > before the semicolon works as a fix. Perhaps this could be changed in > the CVS? > > Sorry if this seems obvious to Pg experts but it would have saved a > novice like me a bit of time had I found mention of it in the mailing > list archives. > > Cheers, > Roy. > > -- > Dr. Roy Chaudhuri > Bioinformatics Research Fellow > Division of Immunity and Infection > University of Birmingham, U.K. > > http://xbase.bham.ac.uk > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From angel at mail.med.upenn.edu Wed Aug 9 13:47:41 2006 From: angel at mail.med.upenn.edu (Angel Pizarro) Date: Wed, 09 Aug 2006 09:47:41 -0400 Subject: [BioSQL-l] Experimental context and result modules Message-ID: <44D9E77D.7060003@mail.med.upenn.edu> File attached. Hopefully the included comments are enough... If not I can whip up an ERD and some more doc -angel -------------- next part -------------- A non-text attachment was scrubbed... Name: biosql_ext.sql Type: text/x-sql Size: 9281 bytes Desc: not available URL: From angel at mail.med.upenn.edu Thu Aug 10 20:05:08 2006 From: angel at mail.med.upenn.edu (Angel Pizarro) Date: Thu, 10 Aug 2006 16:05:08 -0400 Subject: [BioSQL-l] null title and CRC In-Reply-To: <44D73BDF.10806@mail.med.upenn.edu> References: <44D22E78.6050505@mail.med.upenn.edu> <30B51FAA-6372-4559-AE50-D535AABF8AA1@gmx.net> <44D73BDF.10806@mail.med.upenn.edu> Message-ID: <44DB9174.7080203@mail.med.upenn.edu> Here are a set of records that make a new install of biosql fail b/c of the CRC constraint using the script : bioperl-db/scripts/biosql/load_seqdatabase.pl My test setup was latest CVS tarball (as of last week ;) ) of bioperl, mysql 5.0. Also recreated the error on a fresh postgres 7.4.8 (and 8.1) install. I ran the script like so: perl ~/bin/load_seqdatabase.pl --dsn "dbi:mysql:bstest" --format genbank --dbuser xxxx --dbpass xxxx --namespace gb --lookup test_load_seqdatabase_crc.gbff Here is the debug error message from one of the runs I did: > --------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, > values were ("","Danio rerio small nuclear ribonucleoprotein > polypeptide C, mRNA (cDNA clone MGC:109792 > IMAGE:7292940)","Unpublished > (2005)","CRC-0E44E80E2C988097","1","159","") FKs () > ERROR: duplicate key violates unique constraint "reference_crc_key" Cheers, -angel Angel Pizarro wrote: > Hilmar Lapp wrote: > >> I think I need to debug this. If bioperl-db stumbles over this, then >> it sounds like that's what needs to be fixed. >> >> Can you or somebody else provide with two sample records that >> exemplify (i.e., replicate) the problem and which I can turn into a >> test case? >> >> > Since these where bulk loads, I am not sure which records conflicted, > but I'll have a poke around and see if I can grab a test set for you. > -angel > > >> -hilmar >> >> On Aug 3, 2006, at 2:12 PM, Angel Pizarro wrote: >> >> >>> From hilmar: >>> >>>> The CRC for references uses the authors, title, and location >>>> attributes in Bioperl-db, and empty (or null) strings default to the >>>> string "". >>>> >>>> If title is empty and authors and location do not distinguish two >>>> references, then why do you want to have two rows for those >>>> references? Basically, there are identical for all intents and >>>> purposes, or are they not? >>>> >>>> -hilmar >>>> >>> Sorry for not replying to the original thread, but I just joined this >>> list. >>> This was an issue for me with bioperl loading as well, since I was using >>> the same biosql instance to load two different biodatabases with the >>> same entry. Specifically, I loaded IPI, which has no feature table in >>> the entries, and the genbank equivalents to get the feature tables. >>> Namely the constraint caused an error when the the genbank record was >>> loaded. >>> >>> I think that this is primarily an issue with bioperl, but I raise it >>> here to make the java folks aware of the potential pitfall and maybe ask >>> if whether the CRC should be calculated with the biodatabase in mind? >>> Probably not, since as hilmar states, it's still the same reference. >>> >>> BTW - I solved the issue by dropping the constraint, since I really >>> don't care about references. Not optimal, but certainly easiest thing to >>> do ;) >>> >>> -angel >>> >>> _______________________________________________ >>> BioSQL-l mailing list >>> BioSQL-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biosql-l >>> >>> >> --=========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> >> >> > > -- Angel Pizarro Director, Bioinformatics Facility Institute for Translational Medicine and Therapeutics University of Pennsylvania 806 BRB II/III 421 Curie Blvd. Philadelphia, PA 19104-6160 P: 215-573-3736 F: 215-573-9004 E: angel at mail.med.upenn.edu -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: test_load_seqdatabase_crc.gbff URL: From muratem at eng.uah.edu Fri Aug 11 16:10:30 2006 From: muratem at eng.uah.edu (Mike Muratet) Date: Fri, 11 Aug 2006 11:10:30 -0500 (CDT) Subject: [BioSQL-l] load_seqdatabase fails when loading refseq plant files Message-ID: Hello all I am using biosql-schema/bioperl-db to load Refseq entries into a biosql database. I don't see any version info in the files, but I downloaded everything in the last month or so and everything passed all the tests when installed. I am using perl 5.8.5, mysql 5.0.22, DBI-1.5.1, DBD-mysql-3.006. I was loading plant file from Refseq rel 18: load_seqdatabase.pl --dbname biosql --lookup --u --namespace plant --format genbank --safe plant*.rna.gbff.gz and it crashed after about 30K of 60K records: at /usr/lib/perl5/site_perl/5.8.5/Bio/biosql-schema/sql/bioperl-db/scripts/biosql/load_seqdatabase.pl line 633 -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values were ("","Direct Submission","Submitted (01-JUL-2004) National Center for Biotechnology Information, National Institutes of Health, Bethesda 20894, United States of America","CRC-6F1453182E2BAC3F","1","786","") FKs () Duplicate entry 'CRC-6F1453182E2BAC3F' for key 3 --------------------------------------------------- Could not store XM_472403: ------------- EXCEPTION ------------- MSG: create: object (Bio::Annotation::Reference) failed to insert or to be found by unique key STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:208 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254 STACK Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:272 STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:219 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:216 t I traced the error back through the source and database and found that XM_472403 has the same CRC value as XM_473880. I actually got many errors of this type, but only the last one crashed the script (in spite of --safe). Should there be more info included in the CRC field? I am weak when it comes to RDBMs, but looking at the schema, I would guess that the CRC field was added to make an otherwise degenerate key unique. Would it help to add more fields to the CRC, or another key? The former might be done without have to change a lot of code. Thanks Mike From angel at mail.med.upenn.edu Fri Aug 11 18:57:35 2006 From: angel at mail.med.upenn.edu (Angel Pizarro) Date: Fri, 11 Aug 2006 14:57:35 -0400 Subject: [BioSQL-l] load_seqdatabase fails when loading refseq plant files In-Reply-To: References: Message-ID: <1155322655.4837.25.camel@gort.gcrc.upenn.edu> Glad I am not the only one that ran into this problem! Mike, I had reported this issue a few emails back and have provided the list with an example file for testing, so it should be resolved soon. FYI, you are correct that CRC is computed on load to determine if two pub references are in fact the same. This is a feature to save database space. The expected behaviour would be for the subsequent entries with the same CRC reference should have an FK to the originating reference entry, and not insert a duplicate row into the reference table. FYI #2, the --safe option explicitly states that it will continue to process records after errors BUT do a roll-back at the end of the run. This is to gather all of your errors in one shot, as opposed to fixing a record, starting, error, fix, etc ,. If you are impatient and do not care about references, you have three choices. 1) drop the unique constraint on reference.crc (this will cause dups in reference and you can not go back to a unique CRC without some major SQL data migration routine to fix FK's and delete the dups. 2) filter your records to not contain reference information 3) alter load_seqdatabase to not enter reference information. This would be in the Bio::AnnotationCollection object: $seq->annotation()->remove_Annotations('reference'); The above command inserted someplace in the script line ~575 should do the trick. Obviously this means that all reference information is not loaded into the DB at all. -angel On Fri, 2006-08-11 at 11:10 -0500, Mike Muratet wrote: > Hello all > > I am using biosql-schema/bioperl-db to load Refseq entries into a biosql > database. I don't see any version info in the files, but I downloaded > everything in the last month or so and everything passed all the tests > when installed. I am using perl 5.8.5, mysql 5.0.22, DBI-1.5.1, > DBD-mysql-3.006. I was loading plant file from Refseq rel 18: > > load_seqdatabase.pl --dbname biosql > --lookup --u --namespace plant --format genbank --safe plant*.rna.gbff.gz > > and it crashed after about 30K of 60K records: > > at /usr/lib/perl5/site_perl/5.8.5/Bio/biosql-schema/sql/bioperl-db/scripts/biosql/load_seqdatabase.pl > line 633 > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values > were ("","Direct Submission","Submitted (01-JUL-2004) National Center for > Biotechnology Information, National Institutes of Health, Bethesda 20894, > United States of America","CRC-6F1453182E2BAC3F","1","786","") FKs > () > Duplicate entry 'CRC-6F1453182E2BAC3F' for key 3 > --------------------------------------------------- > Could not store XM_472403: > ------------- EXCEPTION ------------- > MSG: create: object (Bio::Annotation::Reference) failed to insert or to be > found by unique key > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:208 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254 > STACK Bio::DB::Persistent::PersistentObject::store > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:272 > STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:219 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:216 > t > > I traced the error back through the source and database and found that > XM_472403 has the same CRC value as XM_473880. I actually got many errors of this type, > but only the last one crashed the script (in spite of --safe). > > Should there be more info included in the CRC field? I am weak when > it comes to RDBMs, but looking at the schema, I would guess that the CRC field > was added to make an otherwise degenerate key unique. Would it help to add > more fields to the CRC, or another key? The former might be done without > have to change a lot of code. > > Thanks > > Mike > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l From muratem at eng.uah.edu Mon Aug 14 16:55:45 2006 From: muratem at eng.uah.edu (Mike Muratet) Date: Mon, 14 Aug 2006 11:55:45 -0500 (CDT) Subject: [BioSQL-l] load_seqdatabase fails when loading refseq plant files In-Reply-To: <1155322655.4837.25.camel@gort.gcrc.upenn.edu> References: <1155322655.4837.25.camel@gort.gcrc.upenn.edu> Message-ID: On Fri, 11 Aug 2006, Angel Pizarro wrote: > Date: Fri, 11 Aug 2006 14:57:35 -0400 > From: Angel Pizarro > To: BioSQL , Bioperl > Subject: Re: [BioSQL-l] load_seqdatabase fails when loading refseq plant files > > Glad I am not the only one that ran into this problem! Mike, I had > reported this issue a few emails back and have provided the list with an > example file for testing, so it should be resolved soon. > I must have missed it. Sorry. > FYI, you are correct that CRC is computed on load to determine if two > pub references are in fact the same. This is a feature to save database > space. The expected behaviour would be for the subsequent entries with > the same CRC reference should have an FK to the originating reference > entry, and not insert a duplicate row into the reference table. > > FYI #2, the --safe option explicitly states that it will continue to > process records after errors BUT do a roll-back at the end of the run. > This is to gather all of your errors in one shot, as opposed to fixing a > record, starting, error, fix, etc ,. > > If you are impatient and do not care about references, you have three > choices. > 1) drop the unique constraint on reference.crc (this will cause dups in > reference and you can not go back to a unique CRC without some major SQL > data migration routine to fix FK's and delete the dups. > > 2) filter your records to not contain reference information > > 3) alter load_seqdatabase to not enter reference information. This would > be in the Bio::AnnotationCollection object: > > $seq->annotation()->remove_Annotations('reference'); > > The above command inserted someplace in the script line ~575 should do > the trick. Obviously this means that all reference information is not > loaded into the DB at all. > I do need to get something working, and the references are not critical to the application, so I will probably alter load_seqdatabase. Thanks for the help! Cheers Mike > -angel > > On Fri, 2006-08-11 at 11:10 -0500, Mike Muratet wrote: >> Hello all >> >> I am using biosql-schema/bioperl-db to load Refseq entries into a biosql >> database. I don't see any version info in the files, but I downloaded >> everything in the last month or so and everything passed all the tests >> when installed. I am using perl 5.8.5, mysql 5.0.22, DBI-1.5.1, >> DBD-mysql-3.006. I was loading plant file from Refseq rel 18: >> >> load_seqdatabase.pl --dbname biosql >> --lookup --u --namespace plant --format genbank --safe plant*.rna.gbff.gz >> >> and it crashed after about 30K of 60K records: >> >> at /usr/lib/perl5/site_perl/5.8.5/Bio/biosql-schema/sql/bioperl-db/scripts/biosql/load_seqdatabase.pl >> line 633 >> >> -------------------- WARNING --------------------- >> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values >> were ("","Direct Submission","Submitted (01-JUL-2004) National Center for >> Biotechnology Information, National Institutes of Health, Bethesda 20894, >> United States of America","CRC-6F1453182E2BAC3F","1","786","") FKs >> () >> Duplicate entry 'CRC-6F1453182E2BAC3F' for key 3 >> --------------------------------------------------- >> Could not store XM_472403: >> ------------- EXCEPTION ------------- >> MSG: create: object (Bio::Annotation::Reference) failed to insert or to be >> found by unique key >> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create >> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:208 >> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store >> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254 >> STACK Bio::DB::Persistent::PersistentObject::store >> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:272 >> STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children >> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:219 >> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create >> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:216 >> t >> >> I traced the error back through the source and database and found that >> XM_472403 has the same CRC value as XM_473880. I actually got many errors of this type, >> but only the last one crashed the script (in spite of --safe). >> >> Should there be more info included in the CRC field? I am weak when >> it comes to RDBMs, but looking at the schema, I would guess that the CRC field >> was added to make an otherwise degenerate key unique. Would it help to add >> more fields to the CRC, or another key? The former might be done without >> have to change a lot of code. >> >> Thanks >> >> Mike >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l >