From dillo at pcbi.upenn.edu Mon Apr 9 12:05:03 2007 From: dillo at pcbi.upenn.edu (Bryan Cardillo) Date: Mon, 9 Apr 2007 12:05:03 -0400 Subject: [BioSQL-l] genbank, references, and crc's Message-ID: <20070409160502.GD5285@rover.pcbi.upenn.edu> This is probably more of a bioperl issue, but since it was previously discussed here, this is where I'll continue the discussion. I've just run into the same issues mentioned in these threads while loading some refseq sequences. http://lists.open-bio.org/pipermail/biosql-l/2006-July/001024.html http://lists.open-bio.org/pipermail/biosql-l/2006-August/001048.html I believe the bioperl-db patch below solves these issues. The crux of the problem is that the _crc64 code uses the authors, title, and location to determine a unique key. However the get_unique_key_query method only checks authors before deferring to a crc lookup. The fix causes the crc key to be used if any of authors, title, or location is specified. Cheers, Bryan Cardillo Penn Bioinformatics Core University of Pennsylvania ReferenceAdaptor.pm | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) Index: ./Bio/DB/BioSQL/ReferenceAdaptor.pm =================================================================== RCS file: /home/repository/bioperl/bioperl-db/Bio/DB/BioSQL/ReferenceAdaptor.pm,v retrieving revision 1.24 diff -u -r1.24 ReferenceAdaptor.pm --- ./Bio/DB/BioSQL/ReferenceAdaptor.pm 4 Jul 2006 22:23:12 -0000 1.24 +++ ./Bio/DB/BioSQL/ReferenceAdaptor.pm 9 Apr 2007 15:38:35 -0000 @@ -426,7 +426,7 @@ }); } } - if($obj->authors()) { + if($obj->authors() || $obj->title() || $obj->location()) { push(@ukqueries, { 'doc_id' => $self->_crc64($obj), }); From hlapp at gmx.net Tue Apr 10 12:09:43 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 10 Apr 2007 12:09:43 -0400 Subject: [BioSQL-l] genbank, references, and crc's In-Reply-To: <20070409160502.GD5285@rover.pcbi.upenn.edu> References: <20070409160502.GD5285@rover.pcbi.upenn.edu> Message-ID: <3A873665-CA1D-489B-A6A8-6EDBB54C7858@gmx.net> Hi Bryan, thanks for tracking this down - great, I've committed it. The 'correct' condition, as defined by the schema, would actually be test for author or title being specified, because location must be non-empty, according to the schema. I.e., at least theoretically, the condition will now always be true, unless you removed the NOT NULL constraint locally on reference.location. Would you mind testing whether removing the location() part from the if clause will still solve the issue? -hilmar On Apr 9, 2007, at 12:05 PM, Bryan Cardillo wrote: > This is probably more of a bioperl issue, but since it was > previously discussed here, this is where I'll continue the > discussion. I've just run into the same issues mentioned in > these threads while loading some refseq sequences. > > http://lists.open-bio.org/pipermail/biosql-l/2006-July/ > 001024.html > http://lists.open-bio.org/pipermail/biosql-l/2006-August/ > 001048.html > > > I believe the bioperl-db patch below solves these issues. > The crux of the problem is that the _crc64 code uses the > authors, title, and location to determine a unique key. > However the get_unique_key_query method only checks authors > before deferring to a crc lookup. The fix causes the crc key > to be used if any of authors, title, or location is > specified. > > Cheers, > Bryan Cardillo > Penn Bioinformatics Core > University of Pennsylvania > > ReferenceAdaptor.pm | 2 +- > 1 files changed, 1 insertion(+), 1 deletion(-) > > Index: ./Bio/DB/BioSQL/ReferenceAdaptor.pm > =================================================================== > RCS file: /home/repository/bioperl/bioperl-db/Bio/DB/BioSQL/ > ReferenceAdaptor.pm,v > retrieving revision 1.24 > diff -u -r1.24 ReferenceAdaptor.pm > --- ./Bio/DB/BioSQL/ReferenceAdaptor.pm 4 Jul 2006 22:23:12 -0000 1.24 > +++ ./Bio/DB/BioSQL/ReferenceAdaptor.pm 9 Apr 2007 15:38:35 -0000 > @@ -426,7 +426,7 @@ > }); > } > } > - if($obj->authors()) { > + if($obj->authors() || $obj->title() || $obj->location()) { > push(@ukqueries, { > 'doc_id' => $self->_crc64($obj), > }); > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From dillo at pcbi.upenn.edu Wed Apr 11 11:33:39 2007 From: dillo at pcbi.upenn.edu (Bryan Cardillo) Date: Wed, 11 Apr 2007 11:33:39 -0400 Subject: [BioSQL-l] genbank, references, and crc's In-Reply-To: <3A873665-CA1D-489B-A6A8-6EDBB54C7858@gmx.net> References: <20070409160502.GD5285@rover.pcbi.upenn.edu> <3A873665-CA1D-489B-A6A8-6EDBB54C7858@gmx.net> Message-ID: <20070411153337.GA5275@rover.pcbi.upenn.edu> On Tue, Apr 10, 2007 at 12:09:43PM -0400, Hilmar Lapp wrote: > thanks for tracking this down - great, I've committed it. > > The 'correct' condition, as defined by the schema, would actually be > test for author or title being specified, because location must be > non-empty, according to the schema. > > I.e., at least theoretically, the condition will now always be true, > unless you removed the NOT NULL constraint locally on > reference.location. > > Would you mind testing whether removing the location() part from the > if clause will still solve the issue? you are correct, the test for location doesn't seem to be necessary. from a theoretically point of view, I'm not sure I agree with removing the location test though. it seems to me that if you have a field (ie, location) which is used in generating a unique identifier (crc64), then you should consult that field when determining what the unique identifier is for a particular object. to put it another way, a reference instance with no authors, no title, and a location can have a valid crc. so why should the adaptor ignore this case? all that being said, my understanding of how all this goes together is still pretty shallow, so I'll defer to you as to which solution is best ;) --Bryan From hlapp at gmx.net Sun Apr 15 22:54:19 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 15 Apr 2007 22:54:19 -0400 Subject: [BioSQL-l] genbank, references, and crc's In-Reply-To: <20070411153337.GA5275@rover.pcbi.upenn.edu> References: <20070409160502.GD5285@rover.pcbi.upenn.edu> <3A873665-CA1D-489B-A6A8-6EDBB54C7858@gmx.net> <20070411153337.GA5275@rover.pcbi.upenn.edu> Message-ID: <52E4803B-7141-4D40-B46D-369626D15968@gmx.net> On Apr 11, 2007, at 11:33 AM, Bryan Cardillo wrote: > to put it another way, a reference instance with no authors, > no title, and a location can have a valid crc. so why should > the adaptor ignore this case? You're right - I can't just remove the location() test. Instead, I should be able to remove the bracketing if clause altogether. I.e., in light of the schema, the construct if($obj->authors() || $obj->title() || $obj->location()) { push(@ukqueries, { 'doc_id' => $self->_crc64($obj), }); } ought to be equivalent to push(@ukqueries, { 'doc_id' => $self->_crc64($obj), }); The thing is that the BioPerl object model doesn't complain if you leave all three of authors, title, and location empty, no matter how non-sensical that is (it so happens that annotation parsed out from a legitimate sequence file in e.g. genbank format will always have location filled in). I think I'll leave the if clause in and document that in reality for all legitimate annotation sources the clause should always evaluate to true. Thanks for your observations - very sharp. I hope you'll stick around with the code for a while, it can certainly benefit from another pair of sharp eyes. Don't hesitate to let me know if you need any help. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From lpritc at scri.ac.uk Mon Apr 16 11:55:22 2007 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Mon, 16 Apr 2007 16:55:22 +0100 Subject: [BioSQL-l] Problem loading GO. Message-ID: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> Hi, I've been trying to upload the GO into a clean BioSQL (MySQL, 1.4.1) schema using the BioPerl bp_load_ontology.pl script, with the OBOv1.0, OBOv1.2, and the most recent flatfiles from http://www.geneontology.org/GO.downloads.ontology.shtml - none of my attempts have been successful. The errors below are from a Linux installation, but the same errors are thrown on OS X, too. I am using the most recent versions of BioPerl and bioperl-db, installed via CPAN: [lpritc at lplinuxdev sequence_data]$ perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION,"\n"' 1.005002102 and bioperl-db 1.5.2. I have attached the traceback below (running with --safe throws a number of equivalent errors), and I would be grateful for any help you might be able to offer with setting me on track to fixing this. SOFA is loaded without issues, you might be pleased to hear ;) Thanks in advance, L. ######## [lpritc at lplinuxdev sequence_data]$ bp_load_ontology.pl --host localhost --dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass ******** --format obo ~/Downloads/gene_ontology_edit.obo Loading ontology gene_ontology: ... terms ... relationships Done with gene_ontology. Loading ontology biological_process: ... terms -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values were ("","","0","") FKs () Column 'dbname' cannot be null --------------------------------------------------- Could not store term GO:0018901, name '2,4-dichlorophenoxyacetic acid metabolic process': ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: create: object (Bio::Annotation::DBLink) failed to insert or to be found by unique key STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206 STACK: Bio::DB::BioSQL::TermAdaptor::store_children /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/TermAdaptor.pm:293 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK: Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271 STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805 STACK: /usr/bin/bp_load_ontology.pl:610 ----------------------------------------------------------- [lpritc at lplinuxdev sequence_data]$ bp_load_ontology.pl --host localhost --dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass ******** --format goflat --fmtargs ~/Downloads/GO.defs ~/Downloads/function.ontology Loading ontology Gene Ontology: ... terms -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values were ("MetaCyc","2\,3-DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RXN","0","") FKs () Duplicate entry '2\,3-DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RX-MetaCyc-0' for key 2 --------------------------------------------------- Could not store term GO:0047528, name '2\,3-dihydroxyindole 2 \,3-dioxygenase activity': ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: create: object (Bio::Annotation::DBLink) failed to insert or to be found by unique key STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206 STACK: Bio::DB::BioSQL::TermAdaptor::store_children /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/TermAdaptor.pm:293 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK: Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271 STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805 STACK: /usr/bin/bp_load_ontology.pl:610 ----------------------------------------------------------- at /usr/bin/bp_load_ontology.pl line 817 main::persist_term('-term', 'Bio::Ontology::GOterm=HASH(0xa6afb9c)', '-db', 'Bio::DB::BioSQL::DBAdaptor=HASH(0x9b5afd0)', '-termfactory', 'Bio::Ontology::TermFactory=HASH(0x9f40d10)', '-throw', 'CODE(0x96f6b68)', '-mergeobs', ...) called at /usr/bin/bp_load_ontology.pl line 610 -- Dr Leighton Pritchard B.Sc.(Hons) MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA e:lpritc at scri.ac.uk w:http://bioinf.scri.ac.uk/lp gpg/pgp: 0xFEFC205C _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From hlapp at gmx.net Tue Apr 17 00:00:55 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 17 Apr 2007 00:00:55 -0400 Subject: [BioSQL-l] Problem loading GO. In-Reply-To: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> Message-ID: Hi Leighton, please see below: On Apr 16, 2007, at 11:55 AM, Leighton Pritchard wrote: > Hi, > > I've been trying to upload the GO into a clean BioSQL (MySQL, 1.4.1) > schema using the BioPerl bp_load_ontology.pl script, with the OBOv1.0, > OBOv1.2, and the most recent flatfiles from > http://www.geneontology.org/GO.downloads.ontology.shtml - none of my > attempts have been successful. The errors below are from a Linux > installation, but the same errors are thrown on OS X, too. I am using > the most recent versions of BioPerl and bioperl-db, installed via > CPAN: > > [lpritc at lplinuxdev sequence_data]$ perl -MBio::Root::Version -e 'print > $Bio::Root::Version::VERSION,"\n"' > 1.005002102 > > and bioperl-db 1.5.2. > > I have attached the traceback below (running with --safe throws a > number > of equivalent errors), Using --safe will throw the same errors, but will continue loading. I.e., you'd lose the one term, but keep everything else. I do realize that especially for a graph losing an internal node can be quite detrimental. > [...] > ######## > > [lpritc at lplinuxdev sequence_data]$ bp_load_ontology.pl --host > localhost > --dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass > ******** --format obo ~/Downloads/gene_ontology_edit.obo > Loading ontology gene_ontology: > ... terms > ... relationships > Done with gene_ontology. > Loading ontology biological_process: > ... terms > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values > were ("","","0","") FKs () > Column 'dbname' cannot be null > --------------------------------------------------- This would point to a problem of the BioPerl obo parser. According to the message, both the database name and the accession of the db_xref for the term are - surely erroneously - empty. Apparently the parser fails to parse out database and accession for this db_xref of term GO: 0018901. If you can edit the obo file, you can try deleting the db_xref(s) for that term that look odd (or delete all if you don't need them). I'd have to debug the obo parser to see exactly where it's going wrong in parsing. > Could not store term GO:0018901, name '2,4-dichlorophenoxyacetic acid > metabolic process': > > ------------- EXCEPTION: Bio::Root::Exception ------------- > [...] > [lpritc at lplinuxdev sequence_data]$ bp_load_ontology.pl --host > localhost > --dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass > ******** --format goflat --fmtargs ~/Downloads/GO.defs Note that the argument for --fmtargs here should read "-defs_file,/path/to/Downloads/GO.defs". (Note that within the quotes there is no tilde expansion.) > ~/Downloads/function.ontology > Loading ontology Gene Ontology: > ... terms > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values > were ("MetaCyc","2\,3-DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RXN","0","") > FKs > () > Duplicate entry '2\,3-DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RX- > MetaCyc-0' for > key 2 > --------------------------------------------------- This is one the things why you've got to love MySQL (and I am correct in inferring that you're using MySQL?). The width of the dbxref.accession column (for which the second value in parentheses is) is 40 chars. The apparently pre-existing value ("2\,3- DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RX-MetaCyc-0") is 50 chars, which when loaded should have resulted in an exception. Instead, MySQL just simply and silently truncates it to 40 chars, which makes it identical to the first 40 chars of "2\,3-DIHYDROXYINDOLE-2\,3- DIOXYGENASE-RXN" (which is 41 chars in length). It may be necessary to widen the length of dbname.accession here, for example to 80 chars? Let me know if you need help with the DDL command to do this. Let me know how far this gets you. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From lpritc at scri.ac.uk Tue Apr 17 09:35:44 2007 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 17 Apr 2007 14:35:44 +0100 Subject: [BioSQL-l] Problem loading GO. In-Reply-To: References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> Message-ID: <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> Hi Hilmar, Thanks for the very quick response. Apologies for the long reply, but I thought it might be useful if anyone else happens across the same problems that I did. On Tue, 2007-04-17 at 00:00 -0400, Hilmar Lapp wrote: > Apparently the parser > fails to parse out database and accession for this db_xref of term GO: > 0018901. > > If you can edit the obo file, you can try deleting the db_xref(s) for > that term that look odd (or delete all if you don't need them). You're spot on - see further down for details... > Note that the argument for --fmtargs here should read > "-defs_file,/path/to/Downloads/GO.defs". (Note that within the quotes > there is no tilde expansion.) D'oh! Thanks for the note - my bad, there. > This is one the things why you've got to love MySQL (and I am correct > in inferring that you're using MySQL?). The 'choice' was forced upon me ;) > It may be necessary to widen the length of dbname.accession here, for > example to 80 chars? Let me know if you need help with the DDL > command to do this. I've fixed that now (and added it to my local biosqldb-mysql.sql schema), but with a clean BioSQL schema and using: [lpritc at lplinuxdev sql]$ bp_load_ontology.pl --host localhost --dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass ******** --format goflat --fmtargs "-defs_file,/home/lpritc/Downloads/GO.defs" /home/lpritc/Downloads/function.ontology I was still getting errors with the GO flatfile: Loading ontology Gene Ontology: ... terms -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values were ("","","0","") FKs () Column 'dbname' cannot be null --------------------------------------------------- Could not store term GO:0047554, name '2-pyrone-4,6-dicarboxylate lactonase activity': ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: create: object (Bio::Annotation::DBLink) failed to insert or to be found by unique key STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206 STACK: Bio::DB::BioSQL::TermAdaptor::store_children /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/TermAdaptor.pm:293 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK: Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271 STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805 STACK: /usr/bin/bp_load_ontology.pl:610 ----------------------------------------------------------- at /usr/bin/bp_load_ontology.pl line 817 main::persist_term('-term', 'Bio::Ontology::GOterm=HASH(0x88497a4)', '-db', 'Bio::DB::BioSQL::DBAdaptor=HASH(0x897f074)', '-termfactory', 'Bio::Ontology::TermFactory=HASH(0x8d64ad8)', '-throw', 'CODE(0x851abc8)', '-mergeobs', ...) called at /usr/bin/bp_load_ontology.pl line 610 I tracked this down to an apparently poor formatting of the GO.defs file (note that the first and third definition_lines appear to be two halves of the same entry): term: 2-pyrone-4,6-dicarboxylate lactonase activity goid: GO:0047554 definition: Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O = 4-carboxy-2-hydroxyhexa-2,4-dienedioate. definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN definition_reference: EC:3.1.1.57 definition_reference: MetaCyc:2-PYRONE-4 I found 43 similar errors for other GOIDs, and it appears to result from the occurrence of the string "\," in a dbxref - mostly MetaCyc entries, but also some UM-BBD_pathwayID entries. These errors appear to have followed through into the generation of the OBO format files in each case, e.g.: def: "Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O = 4-carboxy-2-hydroxyhexa-2,4-dienedioate." [:6-DICARBOXYLATE-LACTONASE-RXN, EC:3.1.1.57, MetaCyc:2-PYRONE-4] and so is something for the GO guys to fix, I guess. Another error is thrown after fixing the above, though (with the same command as before): Loading ontology Gene Ontology: ... terms -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values were ("GO:0006905","vesicle transport","OBSOLETE (was not defined before being made obsolete).","X","") FKs (1) Duplicate entry 'vesicle transport-1-X' for key 3 --------------------------------------------------- Could not store term GO:0006905, name 'vesicle transport': ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: create: object (Bio::Ontology::GOterm) failed to insert or to be found by unique key STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK: Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271 STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805 STACK: /usr/bin/bp_load_ontology.pl:610 ----------------------------------------------------------- at /usr/bin/bp_load_ontology.pl line 817 main::persist_term('-term', 'Bio::Ontology::GOterm=HASH(0xbcac418)', '-db', 'Bio::DB::BioSQL::DBAdaptor=HASH(0x957805c)', '-termfactory', 'Bio::Ontology::TermFactory=HASH(0x995db20)', '-throw', 'CODE(0x9113bd0)', '-mergeobs', ...) called at /usr/bin/bp_load_ontology.pl line 610 There are duplicate terms, identical in the term table except for GOID: GO:0006905 and GO:0005480. They are both "vesicle transport", and obsoleted: term: vesicle transport goid: GO:0005480 definition: OBSOLETE (was not defined before being made obsolete). definition_reference: GOC:go_curators comment: This term was made obsolete because it represents a biological process and not a molecular function. To update annotations, use the biological process term 'vesicle-mediated transport ; GO:0016192'. term: vesicle transport goid: GO:0006905 definition: OBSOLETE (was not defined before being made obsolete). definition_reference: GOC:go_curators comment: This term was made obsolete because the meaning of the term is ambiguous. To update annotations, consider the biological process term 'vesicle-mediated transport ; GO:0016192'. I used the --noobsolete flag to avoid this error - reasoning that since I'm populating the database for the first time, ignoring the obsolete terms won't hurt - but finally this error was thrown: Loading ontology Gene Ontology: ... terms -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values were ("PMID","","0","") FKs () Column 'accession' cannot be null --------------------------------------------------- Could not store term GO:0032933, name 'SREBP-mediated signaling pathway': ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: create: object (Bio::Annotation::DBLink) failed to insert or to be found by unique key STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206 STACK: Bio::DB::BioSQL::TermAdaptor::store_children /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/TermAdaptor.pm:293 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK: Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271 STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805 STACK: /usr/bin/bp_load_ontology.pl:610 ----------------------------------------------------------- at /usr/bin/bp_load_ontology.pl line 817 main::persist_term('-term', 'Bio::Ontology::GOterm=HASH(0xbe18f14)', '-db', 'Bio::DB::BioSQL::DBAdaptor=HASH(0x99bbf2c)', '-termfactory', 'Bio::Ontology::TermFactory=HASH(0x9da0ad8)', '-throw', 'CODE(0x9556bb4)', '-mergeobs', ...) called at /usr/bin/bp_load_ontology.pl line 610 with the offending entry being term: SREBP-mediated signaling pathway goid: GO:0032933 definition: A series of molecular signals from the endoplasmic reticulum to the nucleus generated as a consequence of altered levels of one or more lipids, and resulting in the activation of transcription by SREBP. definition_reference: GOC:mah definition_reference: PMID:0 I commented out the definition_reference for PMID:0, which seemed to fix matters. The process.ontology and component.ontology files then went into the database without a hitch. Thanks again for your help, L. -- Dr Leighton Pritchard B.Sc.(Hons) MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA e:lpritc at scri.ac.uk w:http://bioinf.scri.ac.uk/lp gpg/pgp: 0xFEFC205C _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From hlapp at gmx.net Tue Apr 17 11:09:45 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 17 Apr 2007 11:09:45 -0400 Subject: [BioSQL-l] Problem loading GO. In-Reply-To: <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> Message-ID: <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> On Apr 17, 2007, at 9:35 AM, Leighton Pritchard wrote: > Hi Hilmar, > > Thanks for the very quick response. Apologies for the long reply, > but I > thought it might be useful if anyone else happens across the same > problems that I did. Thanks for reporting all these. > [...] > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values > were ("","","0","") FKs () > Column 'dbname' cannot be null > --------------------------------------------------- > Could not store term GO:0047554, name '2-pyrone-4,6-dicarboxylate > lactonase activity': > [...] > I tracked this down to an apparently poor formatting of the GO.defs > file > (note that the first and third definition_lines appear to be two > halves > of the same entry): > > term: 2-pyrone-4,6-dicarboxylate lactonase activity > goid: GO:0047554 > definition: Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + > H2O > = 4-carboxy-2-hydroxyhexa-2,4-dienedioate. > definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN I wonder whether this is the line that throws the parser off. It looks like the database part of the reference is missing - bad. > definition_reference: EC:3.1.1.57 > definition_reference: MetaCyc:2-PYRONE-4 > > I found 43 similar errors for other GOIDs, and it appears to result > from > the occurrence of the string "\," in a dbxref - mostly MetaCyc > entries, > but also some UM-BBD_pathwayID entries. I'm not sure - although the string "\," might indeed trip up the parser, would have to investigate to confirm. Could it be a coincidence with definition_references that lack the database part before the colon? > > These errors appear to have followed through into the generation of > the > OBO format files in each case, e.g.: > > def: "Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O = > 4-carboxy-2-hydroxyhexa-2,4-dienedioate." [:6-DICARBOXYLATE- > LACTONASE-RXN, EC:3.1.1.57, MetaCyc:2-PYRONE-4] Again, the first db_xref lacks the database in front of the colon. I can also see why "\," will trip up the parser in this format. > > and so is something for the GO guys to fix, I guess. The lack of a database for certain xrefs surely is. If the escaped comma does throw off the BioPerl parser then that part is for BioPerl to fix. It does seem to extract the parts correctly, if the error message is any indication, though you may argue that it should remove the escaping backslashes (and I'd certainly agree with that). > > > Another error is thrown after fixing the above, though (with the same > command as before): > > Loading ontology Gene Ontology: > ... terms > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values > were > ("GO:0006905","vesicle transport","OBSOLETE (was not defined before > being made obsolete).","X","") FKs (1) > Duplicate entry 'vesicle transport-1-X' for key 3 > --------------------------------------------------- > Could not store term GO:0006905, name 'vesicle transport': > [...] > There are duplicate terms, identical in the term table except for > GOID: > GO:0006905 and GO:0005480. They are both "vesicle transport", and > obsoleted: That violates the uniqueness constraint, and this sounds more like a bug in the GO file. I'm also not sure what motivated them to create the same term multiple times only to obsolete it immediately. > [...] > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values > were ("PMID","","0","") FKs () > Column 'accession' cannot be null > --------------------------------------------------- > Could not store term GO:0032933, name 'SREBP-mediated signaling > pathway': > [...] > with the offending entry being > > term: SREBP-mediated signaling pathway > goid: GO:0032933 > definition: A series of molecular signals from the endoplasmic > reticulum > to the nucleus generated as a consequence of altered levels of one or > more lipids, and resulting in the activation of transcription by > SREBP. > definition_reference: GOC:mah > definition_reference: PMID:0 > > I commented out the definition_reference for PMID:0, which seemed > to fix > matters. Right, it seems to be a bogus reference. > > The process.ontology and component.ontology files then went into the > database without a hitch. Thanks again for your help, Fantastic you got it all loaded! Note that you also have the --computetc switch which will compute the transitive closure for you automatically. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From lpritc at scri.ac.uk Tue Apr 17 12:05:16 2007 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 17 Apr 2007 17:05:16 +0100 Subject: [BioSQL-l] Problem loading GO. In-Reply-To: <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> Message-ID: <1176825916.988.121.camel@lplinuxdev.scri.sari.ac.uk> Hello again, On Tue, 2007-04-17 at 11:09 -0400, Hilmar Lapp wrote: > Thanks for reporting all these. No problem at all. > On Apr 17, 2007, at 9:35 AM, Leighton Pritchard wrote: > > term: 2-pyrone-4,6-dicarboxylate lactonase activity [...] > > definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN > > I wonder whether this is the line that throws the parser off. It > looks like the database part of the reference is missing - bad. > > definition_reference: MetaCyc:2-PYRONE-4 I don't think the parser is to blame, here. Note that if you join the definition_reference strings from the GO.defs file, you get: MetaCyc:2-PYRONE-4:6-DICARBOXYLATE-LACTONASE-RXN Then if you replace the colon by "\," you get what should (I think) actually be the MetaCyc entry: MetaCyc:2-PYRONE-4\,6-DICARBOXYLATE-LACTONASE-RXN > > I found 43 similar errors for other GOIDs, and it appears to result > > from > > the occurrence of the string "\," in a dbxref - mostly MetaCyc > > entries, > > but also some UM-BBD_pathwayID entries. > > I'm not sure - although the string "\," might indeed trip up the > parser, would have to investigate to confirm. Could it be a > coincidence with definition_references that lack the database part > before the colon? Inspecting the troublesome entries by eye seems to turn up the same problem as above consistently: a GO term in the GO.defs file is malformed. The term should have a definition_reference field describing a MetaCyc entry that matches the term field. In the term string, there would be an escaped comma, but the string ends where we expect this. The string that would follow the escaped comma is present as the first definition_reference. This observation also extends to cases where there should be two occurrences of "\," in the MetaCyc field, e.g.: term: 2,3-dihydroxyindole 2,3-dioxygenase activity goid: GO:0047528 definition: Catalysis of the reaction: 2,3-dihydroxyindole + O2 = anthranilate + CO2. definition_reference: :3-DIHYDROXYINDOLE-2 definition_reference: :3-DIOXYGENASE-RXN definition_reference: EC:1.13.11.2 definition_reference: MetaCyc:2 It then appears as though the GO flatfiles were used automatically to generate the OBO format files, and propagated the same error into the square brackets in each case. > > and so is something for the GO guys to fix, I guess. > > The lack of a database for certain xrefs surely is. If the escaped > comma does throw off the BioPerl parser then that part is for BioPerl > to fix. I thinkk the problems are now all in the data I downloaded from http://www.geneontology.org/GO.downloads.shtml - I believe the BioPerl parser to be innocent of these charges ;) I've submitted the issue at the GO site, and with any luck they'll handle it quite soon (if it is in fact their problem). > Note that you also have the --computetc switch which will compute the > transitive closure for you automatically. :D Excellent! Thanks for the pointer, and again for your efforts, L. -- Dr Leighton Pritchard B.Sc.(Hons) MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA e:lpritc at scri.ac.uk w:http://bioinf.scri.ac.uk/lp gpg/pgp: 0xFEFC205C _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From cjfields at uiuc.edu Tue Apr 17 12:18:19 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 17 Apr 2007 11:18:19 -0500 Subject: [BioSQL-l] Problem loading GO. In-Reply-To: <1176825916.988.121.camel@lplinuxdev.scri.sari.ac.uk> References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> <1176825916.988.121.camel@lplinuxdev.scri.sari.ac.uk> Message-ID: <146086E2-330B-4460-90AC-2632E82ED145@uiuc.edu> On Apr 17, 2007, at 11:05 AM, Leighton Pritchard wrote: ... > >>> and so is something for the GO guys to fix, I guess. >> >> The lack of a database for certain xrefs surely is. If the escaped >> comma does throw off the BioPerl parser then that part is for BioPerl >> to fix. > > I thinkk the problems are now all in the data I downloaded from > http://www.geneontology.org/GO.downloads.shtml - I believe the BioPerl > parser to be innocent of these charges ;) I've submitted the issue at > the GO site, and with any luck they'll handle it quite soon (if it > is in > fact their problem). > >> Note that you also have the --computetc switch which will compute the >> transitive closure for you automatically. > > :D Excellent! Thanks for the pointer, and again for your efforts, > > L. ... If you do find anything that is BioSQL- or Bioperl-related then file a bug report so we can track it. I agree with Hilmar that it's likely the parser is partly to blame. http://bugzilla.open-bio.org/ We really appreciate the work you're putting into this! chris From lpritc at scri.ac.uk Tue Apr 17 12:55:38 2007 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 17 Apr 2007 17:55:38 +0100 Subject: [BioSQL-l] Problem loading GO. In-Reply-To: <146086E2-330B-4460-90AC-2632E82ED145@uiuc.edu> References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> <1176825916.988.121.camel@lplinuxdev.scri.sari.ac.uk> <146086E2-330B-4460-90AC-2632E82ED145@uiuc.edu> Message-ID: <1176828938.988.133.camel@lplinuxdev.scri.sari.ac.uk> Hi Chris, On Tue, 2007-04-17 at 11:18 -0500, Chris Fields wrote: > If you do find anything that is BioSQL- or Bioperl-related then file > a bug report so we can track it. I agree with Hilmar that it's > likely the parser is partly to blame. > > http://bugzilla.open-bio.org/ I've submitted a bug report, mostly replicating my first post in this thread. I added links to the appropriate point in the list archives so that the rest of the discussion can be considered, too. > We really appreciate the work you're putting into this! Thanks - I'm just grateful that the Bio* repertoire is there at all so that my problems are relatively minor (as opposed to attempting to replicate the functionality independently). L. -- Dr Leighton Pritchard B.Sc.(Hons) MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA e:lpritc at scri.ac.uk w:http://bioinf.scri.ac.uk/lp gpg/pgp: 0xFEFC205C _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From lpritc at scri.ac.uk Tue Apr 17 13:03:53 2007 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 17 Apr 2007 18:03:53 +0100 Subject: [BioSQL-l] [Bioperl-l] Problem loading GO. In-Reply-To: References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> Message-ID: <1176829433.988.143.camel@lplinuxdev.scri.sari.ac.uk> On Tue, 2007-04-17 at 09:54 -0700, Chris Mungall wrote: > Is there any reason you're loading GO.defs? This is a legacy format > all the information is subsumed in the obo file. My only reason was that the parser originally failed to load the OBO format data - probably for the same reason that the flatfile failed - and I tried the flatfile to check if there were parser issues with the format. I just carried on with the flatfile after that because the terms with formatting errors were (subjectively, for me) easier to spot and fix by hand. I'm happy to use a fixed OBO file. > I didn't see your message to the GO folks re formatting errors - who > did you send it to & what was the subject? I'll see it gets seen to. I submitted it via the website interface - I'm afraid I have no idea where it would have gone after that. -- Dr Leighton Pritchard B.Sc.(Hons) MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA e:lpritc at scri.ac.uk w:http://bioinf.scri.ac.uk/lp gpg/pgp: 0xFEFC205C _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From cjm at fruitfly.org Tue Apr 17 12:54:51 2007 From: cjm at fruitfly.org (Chris Mungall) Date: Tue, 17 Apr 2007 09:54:51 -0700 Subject: [BioSQL-l] [Bioperl-l] Problem loading GO. In-Reply-To: <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> Message-ID: Is there any reason you're loading GO.defs? This is a legacy format all the information is subsumed in the obo file. I didn't see your message to the GO folks re formatting errors - who did you send it to & what was the subject? I'll see it gets seen to. >> -------------------- WARNING --------------------- >> MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values >> were >> ("GO:0006905","vesicle transport","OBSOLETE (was not defined before >> being made obsolete).","X","") FKs (1) >> Duplicate entry 'vesicle transport-1-X' for key 3 >> --------------------------------------------------- >> Could not store term GO:0006905, name 'vesicle transport': >> [...] >> There are duplicate terms, identical in the term table except for >> GOID: >> GO:0006905 and GO:0005480. They are both "vesicle transport", and >> obsoleted: > > That violates the uniqueness constraint, and this sounds more like a > bug in the GO file. I'm also not sure what motivated them to create > the same term multiple times only to obsolete it immediately. these things happen - the schema should be able to deal with it. it's a pain I know. In Chado we have some hacky solution for this (I believe it is concatenating the ID onto the name of obsolete terms). I think that its actually wrong to include obsoletes and actual terms in the same table - however, it's obviously astoundingly useful to be able to do this, but it requires the hack to get ou of the uniqueness violation. The EBI loads all of OBO into BioSQL regularly - I wonder how they handle this? On Apr 17, 2007, at 8:09 AM, Hilmar Lapp wrote: > > On Apr 17, 2007, at 9:35 AM, Leighton Pritchard wrote: > >> Hi Hilmar, >> >> Thanks for the very quick response. Apologies for the long reply, >> but I >> thought it might be useful if anyone else happens across the same >> problems that I did. > > Thanks for reporting all these. > >> [...] >> -------------------- WARNING --------------------- >> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values >> were ("","","0","") FKs () >> Column 'dbname' cannot be null >> --------------------------------------------------- >> Could not store term GO:0047554, name '2-pyrone-4,6-dicarboxylate >> lactonase activity': >> [...] >> I tracked this down to an apparently poor formatting of the GO.defs >> file >> (note that the first and third definition_lines appear to be two >> halves >> of the same entry): >> >> term: 2-pyrone-4,6-dicarboxylate lactonase activity >> goid: GO:0047554 >> definition: Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + >> H2O >> = 4-carboxy-2-hydroxyhexa-2,4-dienedioate. >> definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN > > I wonder whether this is the line that throws the parser off. It > looks like the database part of the reference is missing - bad. > >> definition_reference: EC:3.1.1.57 >> definition_reference: MetaCyc:2-PYRONE-4 >> >> I found 43 similar errors for other GOIDs, and it appears to result >> from >> the occurrence of the string "\," in a dbxref - mostly MetaCyc >> entries, >> but also some UM-BBD_pathwayID entries. > > I'm not sure - although the string "\," might indeed trip up the > parser, would have to investigate to confirm. Could it be a > coincidence with definition_references that lack the database part > before the colon? > >> >> These errors appear to have followed through into the generation of >> the >> OBO format files in each case, e.g.: >> >> def: "Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O = >> 4-carboxy-2-hydroxyhexa-2,4-dienedioate." [:6-DICARBOXYLATE- >> LACTONASE-RXN, EC:3.1.1.57, MetaCyc:2-PYRONE-4] > > Again, the first db_xref lacks the database in front of the colon. I > can also see why "\," will trip up the parser in this format. > >> >> and so is something for the GO guys to fix, I guess. > > The lack of a database for certain xrefs surely is. If the escaped > comma does throw off the BioPerl parser then that part is for BioPerl > to fix. It does seem to extract the parts correctly, if the error > message is any indication, though you may argue that it should remove > the escaping backslashes (and I'd certainly agree with that). > >> >> >> Another error is thrown after fixing the above, though (with the same >> command as before): >> >> Loading ontology Gene Ontology: >> ... terms >> >> -------------------- WARNING --------------------- >> MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values >> were >> ("GO:0006905","vesicle transport","OBSOLETE (was not defined before >> being made obsolete).","X","") FKs (1) >> Duplicate entry 'vesicle transport-1-X' for key 3 >> --------------------------------------------------- >> Could not store term GO:0006905, name 'vesicle transport': >> [...] >> There are duplicate terms, identical in the term table except for >> GOID: >> GO:0006905 and GO:0005480. They are both "vesicle transport", and >> obsoleted: > > That violates the uniqueness constraint, and this sounds more like a > bug in the GO file. I'm also not sure what motivated them to create > the same term multiple times only to obsolete it immediately. > >> [...] >> -------------------- WARNING --------------------- >> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values >> were ("PMID","","0","") FKs () >> Column 'accession' cannot be null >> --------------------------------------------------- >> Could not store term GO:0032933, name 'SREBP-mediated signaling >> pathway': >> [...] >> with the offending entry being >> >> term: SREBP-mediated signaling pathway >> goid: GO:0032933 >> definition: A series of molecular signals from the endoplasmic >> reticulum >> to the nucleus generated as a consequence of altered levels of one or >> more lipids, and resulting in the activation of transcription by >> SREBP. >> definition_reference: GOC:mah >> definition_reference: PMID:0 >> >> I commented out the definition_reference for PMID:0, which seemed >> to fix >> matters. > > Right, it seems to be a bogus reference. > >> >> The process.ontology and component.ontology files then went into the >> database without a hitch. Thanks again for your help, > > Fantastic you got it all loaded! > > Note that you also have the --computetc switch which will compute the > transitive closure for you automatically. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From rcote at ebi.ac.uk Wed Apr 18 03:08:50 2007 From: rcote at ebi.ac.uk (Richard Cote) Date: Wed, 18 Apr 2007 08:08:50 +0100 Subject: [BioSQL-l] [Bioperl-l] Problem loading GO. In-Reply-To: References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> Message-ID: <4625C402.5040809@ebi.ac.uk> Chris Mungall wrote: >>> Could not store term GO:0006905, name 'vesicle transport': >>> [...] >>> There are duplicate terms, identical in the term table except for >>> GOID: >>> GO:0006905 and GO:0005480. They are both "vesicle transport", and >>> obsoleted: >> > I think that its actually wrong to include obsoletes and actual terms in > the same table - however, it's obviously astoundingly useful to be able > to do this, but it requires the hack to get ou of the uniqueness violation. > > The EBI loads all of OBO into BioSQL regularly - I wonder how they > handle this? I simply avoid the issue. There's no uniqueness constraint in term name. The only constraint is term ID, and even that is only unique in the context of an ontology namespace (i.e. it would be perfectly allowable to have FOO:1234 and BAR:1234). The only unique (and primary) key is generated by the ORM layer so I don't even have to deal with that. We also have all the terms, obsoleted or not, in the same table because people are always querying on stuff that's been made obsolete but is still annotated with the old IDs. Cheers, Rc -- Richard Cote Software Engineer - PRIDE Project Team (Sequence Database Group) European Bioinformatics Institute Wellcome Trust Genome Campus rcote at ebi.ac.uk Hinxton, Cambridge CB10 1SD Phone: (+44) 1223 492610 United Kingdom Fax : (+44) 1223 494468 From dillo at pcbi.upenn.edu Mon Apr 9 16:05:03 2007 From: dillo at pcbi.upenn.edu (Bryan Cardillo) Date: Mon, 9 Apr 2007 12:05:03 -0400 Subject: [BioSQL-l] genbank, references, and crc's Message-ID: <20070409160502.GD5285@rover.pcbi.upenn.edu> This is probably more of a bioperl issue, but since it was previously discussed here, this is where I'll continue the discussion. I've just run into the same issues mentioned in these threads while loading some refseq sequences. http://lists.open-bio.org/pipermail/biosql-l/2006-July/001024.html http://lists.open-bio.org/pipermail/biosql-l/2006-August/001048.html I believe the bioperl-db patch below solves these issues. The crux of the problem is that the _crc64 code uses the authors, title, and location to determine a unique key. However the get_unique_key_query method only checks authors before deferring to a crc lookup. The fix causes the crc key to be used if any of authors, title, or location is specified. Cheers, Bryan Cardillo Penn Bioinformatics Core University of Pennsylvania ReferenceAdaptor.pm | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) Index: ./Bio/DB/BioSQL/ReferenceAdaptor.pm =================================================================== RCS file: /home/repository/bioperl/bioperl-db/Bio/DB/BioSQL/ReferenceAdaptor.pm,v retrieving revision 1.24 diff -u -r1.24 ReferenceAdaptor.pm --- ./Bio/DB/BioSQL/ReferenceAdaptor.pm 4 Jul 2006 22:23:12 -0000 1.24 +++ ./Bio/DB/BioSQL/ReferenceAdaptor.pm 9 Apr 2007 15:38:35 -0000 @@ -426,7 +426,7 @@ }); } } - if($obj->authors()) { + if($obj->authors() || $obj->title() || $obj->location()) { push(@ukqueries, { 'doc_id' => $self->_crc64($obj), }); From hlapp at gmx.net Tue Apr 10 16:09:43 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 10 Apr 2007 12:09:43 -0400 Subject: [BioSQL-l] genbank, references, and crc's In-Reply-To: <20070409160502.GD5285@rover.pcbi.upenn.edu> References: <20070409160502.GD5285@rover.pcbi.upenn.edu> Message-ID: <3A873665-CA1D-489B-A6A8-6EDBB54C7858@gmx.net> Hi Bryan, thanks for tracking this down - great, I've committed it. The 'correct' condition, as defined by the schema, would actually be test for author or title being specified, because location must be non-empty, according to the schema. I.e., at least theoretically, the condition will now always be true, unless you removed the NOT NULL constraint locally on reference.location. Would you mind testing whether removing the location() part from the if clause will still solve the issue? -hilmar On Apr 9, 2007, at 12:05 PM, Bryan Cardillo wrote: > This is probably more of a bioperl issue, but since it was > previously discussed here, this is where I'll continue the > discussion. I've just run into the same issues mentioned in > these threads while loading some refseq sequences. > > http://lists.open-bio.org/pipermail/biosql-l/2006-July/ > 001024.html > http://lists.open-bio.org/pipermail/biosql-l/2006-August/ > 001048.html > > > I believe the bioperl-db patch below solves these issues. > The crux of the problem is that the _crc64 code uses the > authors, title, and location to determine a unique key. > However the get_unique_key_query method only checks authors > before deferring to a crc lookup. The fix causes the crc key > to be used if any of authors, title, or location is > specified. > > Cheers, > Bryan Cardillo > Penn Bioinformatics Core > University of Pennsylvania > > ReferenceAdaptor.pm | 2 +- > 1 files changed, 1 insertion(+), 1 deletion(-) > > Index: ./Bio/DB/BioSQL/ReferenceAdaptor.pm > =================================================================== > RCS file: /home/repository/bioperl/bioperl-db/Bio/DB/BioSQL/ > ReferenceAdaptor.pm,v > retrieving revision 1.24 > diff -u -r1.24 ReferenceAdaptor.pm > --- ./Bio/DB/BioSQL/ReferenceAdaptor.pm 4 Jul 2006 22:23:12 -0000 1.24 > +++ ./Bio/DB/BioSQL/ReferenceAdaptor.pm 9 Apr 2007 15:38:35 -0000 > @@ -426,7 +426,7 @@ > }); > } > } > - if($obj->authors()) { > + if($obj->authors() || $obj->title() || $obj->location()) { > push(@ukqueries, { > 'doc_id' => $self->_crc64($obj), > }); > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From dillo at pcbi.upenn.edu Wed Apr 11 15:33:39 2007 From: dillo at pcbi.upenn.edu (Bryan Cardillo) Date: Wed, 11 Apr 2007 11:33:39 -0400 Subject: [BioSQL-l] genbank, references, and crc's In-Reply-To: <3A873665-CA1D-489B-A6A8-6EDBB54C7858@gmx.net> References: <20070409160502.GD5285@rover.pcbi.upenn.edu> <3A873665-CA1D-489B-A6A8-6EDBB54C7858@gmx.net> Message-ID: <20070411153337.GA5275@rover.pcbi.upenn.edu> On Tue, Apr 10, 2007 at 12:09:43PM -0400, Hilmar Lapp wrote: > thanks for tracking this down - great, I've committed it. > > The 'correct' condition, as defined by the schema, would actually be > test for author or title being specified, because location must be > non-empty, according to the schema. > > I.e., at least theoretically, the condition will now always be true, > unless you removed the NOT NULL constraint locally on > reference.location. > > Would you mind testing whether removing the location() part from the > if clause will still solve the issue? you are correct, the test for location doesn't seem to be necessary. from a theoretically point of view, I'm not sure I agree with removing the location test though. it seems to me that if you have a field (ie, location) which is used in generating a unique identifier (crc64), then you should consult that field when determining what the unique identifier is for a particular object. to put it another way, a reference instance with no authors, no title, and a location can have a valid crc. so why should the adaptor ignore this case? all that being said, my understanding of how all this goes together is still pretty shallow, so I'll defer to you as to which solution is best ;) --Bryan From hlapp at gmx.net Mon Apr 16 02:54:19 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 15 Apr 2007 22:54:19 -0400 Subject: [BioSQL-l] genbank, references, and crc's In-Reply-To: <20070411153337.GA5275@rover.pcbi.upenn.edu> References: <20070409160502.GD5285@rover.pcbi.upenn.edu> <3A873665-CA1D-489B-A6A8-6EDBB54C7858@gmx.net> <20070411153337.GA5275@rover.pcbi.upenn.edu> Message-ID: <52E4803B-7141-4D40-B46D-369626D15968@gmx.net> On Apr 11, 2007, at 11:33 AM, Bryan Cardillo wrote: > to put it another way, a reference instance with no authors, > no title, and a location can have a valid crc. so why should > the adaptor ignore this case? You're right - I can't just remove the location() test. Instead, I should be able to remove the bracketing if clause altogether. I.e., in light of the schema, the construct if($obj->authors() || $obj->title() || $obj->location()) { push(@ukqueries, { 'doc_id' => $self->_crc64($obj), }); } ought to be equivalent to push(@ukqueries, { 'doc_id' => $self->_crc64($obj), }); The thing is that the BioPerl object model doesn't complain if you leave all three of authors, title, and location empty, no matter how non-sensical that is (it so happens that annotation parsed out from a legitimate sequence file in e.g. genbank format will always have location filled in). I think I'll leave the if clause in and document that in reality for all legitimate annotation sources the clause should always evaluate to true. Thanks for your observations - very sharp. I hope you'll stick around with the code for a while, it can certainly benefit from another pair of sharp eyes. Don't hesitate to let me know if you need any help. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From lpritc at scri.ac.uk Mon Apr 16 15:55:22 2007 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Mon, 16 Apr 2007 16:55:22 +0100 Subject: [BioSQL-l] Problem loading GO. Message-ID: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> Hi, I've been trying to upload the GO into a clean BioSQL (MySQL, 1.4.1) schema using the BioPerl bp_load_ontology.pl script, with the OBOv1.0, OBOv1.2, and the most recent flatfiles from http://www.geneontology.org/GO.downloads.ontology.shtml - none of my attempts have been successful. The errors below are from a Linux installation, but the same errors are thrown on OS X, too. I am using the most recent versions of BioPerl and bioperl-db, installed via CPAN: [lpritc at lplinuxdev sequence_data]$ perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION,"\n"' 1.005002102 and bioperl-db 1.5.2. I have attached the traceback below (running with --safe throws a number of equivalent errors), and I would be grateful for any help you might be able to offer with setting me on track to fixing this. SOFA is loaded without issues, you might be pleased to hear ;) Thanks in advance, L. ######## [lpritc at lplinuxdev sequence_data]$ bp_load_ontology.pl --host localhost --dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass ******** --format obo ~/Downloads/gene_ontology_edit.obo Loading ontology gene_ontology: ... terms ... relationships Done with gene_ontology. Loading ontology biological_process: ... terms -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values were ("","","0","") FKs () Column 'dbname' cannot be null --------------------------------------------------- Could not store term GO:0018901, name '2,4-dichlorophenoxyacetic acid metabolic process': ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: create: object (Bio::Annotation::DBLink) failed to insert or to be found by unique key STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206 STACK: Bio::DB::BioSQL::TermAdaptor::store_children /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/TermAdaptor.pm:293 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK: Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271 STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805 STACK: /usr/bin/bp_load_ontology.pl:610 ----------------------------------------------------------- [lpritc at lplinuxdev sequence_data]$ bp_load_ontology.pl --host localhost --dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass ******** --format goflat --fmtargs ~/Downloads/GO.defs ~/Downloads/function.ontology Loading ontology Gene Ontology: ... terms -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values were ("MetaCyc","2\,3-DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RXN","0","") FKs () Duplicate entry '2\,3-DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RX-MetaCyc-0' for key 2 --------------------------------------------------- Could not store term GO:0047528, name '2\,3-dihydroxyindole 2 \,3-dioxygenase activity': ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: create: object (Bio::Annotation::DBLink) failed to insert or to be found by unique key STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206 STACK: Bio::DB::BioSQL::TermAdaptor::store_children /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/TermAdaptor.pm:293 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK: Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271 STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805 STACK: /usr/bin/bp_load_ontology.pl:610 ----------------------------------------------------------- at /usr/bin/bp_load_ontology.pl line 817 main::persist_term('-term', 'Bio::Ontology::GOterm=HASH(0xa6afb9c)', '-db', 'Bio::DB::BioSQL::DBAdaptor=HASH(0x9b5afd0)', '-termfactory', 'Bio::Ontology::TermFactory=HASH(0x9f40d10)', '-throw', 'CODE(0x96f6b68)', '-mergeobs', ...) called at /usr/bin/bp_load_ontology.pl line 610 -- Dr Leighton Pritchard B.Sc.(Hons) MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA e:lpritc at scri.ac.uk w:http://bioinf.scri.ac.uk/lp gpg/pgp: 0xFEFC205C _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From hlapp at gmx.net Tue Apr 17 04:00:55 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 17 Apr 2007 00:00:55 -0400 Subject: [BioSQL-l] Problem loading GO. In-Reply-To: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> Message-ID: Hi Leighton, please see below: On Apr 16, 2007, at 11:55 AM, Leighton Pritchard wrote: > Hi, > > I've been trying to upload the GO into a clean BioSQL (MySQL, 1.4.1) > schema using the BioPerl bp_load_ontology.pl script, with the OBOv1.0, > OBOv1.2, and the most recent flatfiles from > http://www.geneontology.org/GO.downloads.ontology.shtml - none of my > attempts have been successful. The errors below are from a Linux > installation, but the same errors are thrown on OS X, too. I am using > the most recent versions of BioPerl and bioperl-db, installed via > CPAN: > > [lpritc at lplinuxdev sequence_data]$ perl -MBio::Root::Version -e 'print > $Bio::Root::Version::VERSION,"\n"' > 1.005002102 > > and bioperl-db 1.5.2. > > I have attached the traceback below (running with --safe throws a > number > of equivalent errors), Using --safe will throw the same errors, but will continue loading. I.e., you'd lose the one term, but keep everything else. I do realize that especially for a graph losing an internal node can be quite detrimental. > [...] > ######## > > [lpritc at lplinuxdev sequence_data]$ bp_load_ontology.pl --host > localhost > --dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass > ******** --format obo ~/Downloads/gene_ontology_edit.obo > Loading ontology gene_ontology: > ... terms > ... relationships > Done with gene_ontology. > Loading ontology biological_process: > ... terms > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values > were ("","","0","") FKs () > Column 'dbname' cannot be null > --------------------------------------------------- This would point to a problem of the BioPerl obo parser. According to the message, both the database name and the accession of the db_xref for the term are - surely erroneously - empty. Apparently the parser fails to parse out database and accession for this db_xref of term GO: 0018901. If you can edit the obo file, you can try deleting the db_xref(s) for that term that look odd (or delete all if you don't need them). I'd have to debug the obo parser to see exactly where it's going wrong in parsing. > Could not store term GO:0018901, name '2,4-dichlorophenoxyacetic acid > metabolic process': > > ------------- EXCEPTION: Bio::Root::Exception ------------- > [...] > [lpritc at lplinuxdev sequence_data]$ bp_load_ontology.pl --host > localhost > --dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass > ******** --format goflat --fmtargs ~/Downloads/GO.defs Note that the argument for --fmtargs here should read "-defs_file,/path/to/Downloads/GO.defs". (Note that within the quotes there is no tilde expansion.) > ~/Downloads/function.ontology > Loading ontology Gene Ontology: > ... terms > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values > were ("MetaCyc","2\,3-DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RXN","0","") > FKs > () > Duplicate entry '2\,3-DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RX- > MetaCyc-0' for > key 2 > --------------------------------------------------- This is one the things why you've got to love MySQL (and I am correct in inferring that you're using MySQL?). The width of the dbxref.accession column (for which the second value in parentheses is) is 40 chars. The apparently pre-existing value ("2\,3- DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RX-MetaCyc-0") is 50 chars, which when loaded should have resulted in an exception. Instead, MySQL just simply and silently truncates it to 40 chars, which makes it identical to the first 40 chars of "2\,3-DIHYDROXYINDOLE-2\,3- DIOXYGENASE-RXN" (which is 41 chars in length). It may be necessary to widen the length of dbname.accession here, for example to 80 chars? Let me know if you need help with the DDL command to do this. Let me know how far this gets you. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From lpritc at scri.ac.uk Tue Apr 17 13:35:44 2007 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 17 Apr 2007 14:35:44 +0100 Subject: [BioSQL-l] Problem loading GO. In-Reply-To: References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> Message-ID: <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> Hi Hilmar, Thanks for the very quick response. Apologies for the long reply, but I thought it might be useful if anyone else happens across the same problems that I did. On Tue, 2007-04-17 at 00:00 -0400, Hilmar Lapp wrote: > Apparently the parser > fails to parse out database and accession for this db_xref of term GO: > 0018901. > > If you can edit the obo file, you can try deleting the db_xref(s) for > that term that look odd (or delete all if you don't need them). You're spot on - see further down for details... > Note that the argument for --fmtargs here should read > "-defs_file,/path/to/Downloads/GO.defs". (Note that within the quotes > there is no tilde expansion.) D'oh! Thanks for the note - my bad, there. > This is one the things why you've got to love MySQL (and I am correct > in inferring that you're using MySQL?). The 'choice' was forced upon me ;) > It may be necessary to widen the length of dbname.accession here, for > example to 80 chars? Let me know if you need help with the DDL > command to do this. I've fixed that now (and added it to my local biosqldb-mysql.sql schema), but with a clean BioSQL schema and using: [lpritc at lplinuxdev sql]$ bp_load_ontology.pl --host localhost --dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass ******** --format goflat --fmtargs "-defs_file,/home/lpritc/Downloads/GO.defs" /home/lpritc/Downloads/function.ontology I was still getting errors with the GO flatfile: Loading ontology Gene Ontology: ... terms -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values were ("","","0","") FKs () Column 'dbname' cannot be null --------------------------------------------------- Could not store term GO:0047554, name '2-pyrone-4,6-dicarboxylate lactonase activity': ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: create: object (Bio::Annotation::DBLink) failed to insert or to be found by unique key STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206 STACK: Bio::DB::BioSQL::TermAdaptor::store_children /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/TermAdaptor.pm:293 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK: Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271 STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805 STACK: /usr/bin/bp_load_ontology.pl:610 ----------------------------------------------------------- at /usr/bin/bp_load_ontology.pl line 817 main::persist_term('-term', 'Bio::Ontology::GOterm=HASH(0x88497a4)', '-db', 'Bio::DB::BioSQL::DBAdaptor=HASH(0x897f074)', '-termfactory', 'Bio::Ontology::TermFactory=HASH(0x8d64ad8)', '-throw', 'CODE(0x851abc8)', '-mergeobs', ...) called at /usr/bin/bp_load_ontology.pl line 610 I tracked this down to an apparently poor formatting of the GO.defs file (note that the first and third definition_lines appear to be two halves of the same entry): term: 2-pyrone-4,6-dicarboxylate lactonase activity goid: GO:0047554 definition: Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O = 4-carboxy-2-hydroxyhexa-2,4-dienedioate. definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN definition_reference: EC:3.1.1.57 definition_reference: MetaCyc:2-PYRONE-4 I found 43 similar errors for other GOIDs, and it appears to result from the occurrence of the string "\," in a dbxref - mostly MetaCyc entries, but also some UM-BBD_pathwayID entries. These errors appear to have followed through into the generation of the OBO format files in each case, e.g.: def: "Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O = 4-carboxy-2-hydroxyhexa-2,4-dienedioate." [:6-DICARBOXYLATE-LACTONASE-RXN, EC:3.1.1.57, MetaCyc:2-PYRONE-4] and so is something for the GO guys to fix, I guess. Another error is thrown after fixing the above, though (with the same command as before): Loading ontology Gene Ontology: ... terms -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values were ("GO:0006905","vesicle transport","OBSOLETE (was not defined before being made obsolete).","X","") FKs (1) Duplicate entry 'vesicle transport-1-X' for key 3 --------------------------------------------------- Could not store term GO:0006905, name 'vesicle transport': ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: create: object (Bio::Ontology::GOterm) failed to insert or to be found by unique key STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK: Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271 STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805 STACK: /usr/bin/bp_load_ontology.pl:610 ----------------------------------------------------------- at /usr/bin/bp_load_ontology.pl line 817 main::persist_term('-term', 'Bio::Ontology::GOterm=HASH(0xbcac418)', '-db', 'Bio::DB::BioSQL::DBAdaptor=HASH(0x957805c)', '-termfactory', 'Bio::Ontology::TermFactory=HASH(0x995db20)', '-throw', 'CODE(0x9113bd0)', '-mergeobs', ...) called at /usr/bin/bp_load_ontology.pl line 610 There are duplicate terms, identical in the term table except for GOID: GO:0006905 and GO:0005480. They are both "vesicle transport", and obsoleted: term: vesicle transport goid: GO:0005480 definition: OBSOLETE (was not defined before being made obsolete). definition_reference: GOC:go_curators comment: This term was made obsolete because it represents a biological process and not a molecular function. To update annotations, use the biological process term 'vesicle-mediated transport ; GO:0016192'. term: vesicle transport goid: GO:0006905 definition: OBSOLETE (was not defined before being made obsolete). definition_reference: GOC:go_curators comment: This term was made obsolete because the meaning of the term is ambiguous. To update annotations, consider the biological process term 'vesicle-mediated transport ; GO:0016192'. I used the --noobsolete flag to avoid this error - reasoning that since I'm populating the database for the first time, ignoring the obsolete terms won't hurt - but finally this error was thrown: Loading ontology Gene Ontology: ... terms -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values were ("PMID","","0","") FKs () Column 'accession' cannot be null --------------------------------------------------- Could not store term GO:0032933, name 'SREBP-mediated signaling pathway': ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: create: object (Bio::Annotation::DBLink) failed to insert or to be found by unique key STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206 STACK: Bio::DB::BioSQL::TermAdaptor::store_children /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/TermAdaptor.pm:293 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK: Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271 STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805 STACK: /usr/bin/bp_load_ontology.pl:610 ----------------------------------------------------------- at /usr/bin/bp_load_ontology.pl line 817 main::persist_term('-term', 'Bio::Ontology::GOterm=HASH(0xbe18f14)', '-db', 'Bio::DB::BioSQL::DBAdaptor=HASH(0x99bbf2c)', '-termfactory', 'Bio::Ontology::TermFactory=HASH(0x9da0ad8)', '-throw', 'CODE(0x9556bb4)', '-mergeobs', ...) called at /usr/bin/bp_load_ontology.pl line 610 with the offending entry being term: SREBP-mediated signaling pathway goid: GO:0032933 definition: A series of molecular signals from the endoplasmic reticulum to the nucleus generated as a consequence of altered levels of one or more lipids, and resulting in the activation of transcription by SREBP. definition_reference: GOC:mah definition_reference: PMID:0 I commented out the definition_reference for PMID:0, which seemed to fix matters. The process.ontology and component.ontology files then went into the database without a hitch. Thanks again for your help, L. -- Dr Leighton Pritchard B.Sc.(Hons) MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA e:lpritc at scri.ac.uk w:http://bioinf.scri.ac.uk/lp gpg/pgp: 0xFEFC205C _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From hlapp at gmx.net Tue Apr 17 15:09:45 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 17 Apr 2007 11:09:45 -0400 Subject: [BioSQL-l] Problem loading GO. In-Reply-To: <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> Message-ID: <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> On Apr 17, 2007, at 9:35 AM, Leighton Pritchard wrote: > Hi Hilmar, > > Thanks for the very quick response. Apologies for the long reply, > but I > thought it might be useful if anyone else happens across the same > problems that I did. Thanks for reporting all these. > [...] > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values > were ("","","0","") FKs () > Column 'dbname' cannot be null > --------------------------------------------------- > Could not store term GO:0047554, name '2-pyrone-4,6-dicarboxylate > lactonase activity': > [...] > I tracked this down to an apparently poor formatting of the GO.defs > file > (note that the first and third definition_lines appear to be two > halves > of the same entry): > > term: 2-pyrone-4,6-dicarboxylate lactonase activity > goid: GO:0047554 > definition: Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + > H2O > = 4-carboxy-2-hydroxyhexa-2,4-dienedioate. > definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN I wonder whether this is the line that throws the parser off. It looks like the database part of the reference is missing - bad. > definition_reference: EC:3.1.1.57 > definition_reference: MetaCyc:2-PYRONE-4 > > I found 43 similar errors for other GOIDs, and it appears to result > from > the occurrence of the string "\," in a dbxref - mostly MetaCyc > entries, > but also some UM-BBD_pathwayID entries. I'm not sure - although the string "\," might indeed trip up the parser, would have to investigate to confirm. Could it be a coincidence with definition_references that lack the database part before the colon? > > These errors appear to have followed through into the generation of > the > OBO format files in each case, e.g.: > > def: "Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O = > 4-carboxy-2-hydroxyhexa-2,4-dienedioate." [:6-DICARBOXYLATE- > LACTONASE-RXN, EC:3.1.1.57, MetaCyc:2-PYRONE-4] Again, the first db_xref lacks the database in front of the colon. I can also see why "\," will trip up the parser in this format. > > and so is something for the GO guys to fix, I guess. The lack of a database for certain xrefs surely is. If the escaped comma does throw off the BioPerl parser then that part is for BioPerl to fix. It does seem to extract the parts correctly, if the error message is any indication, though you may argue that it should remove the escaping backslashes (and I'd certainly agree with that). > > > Another error is thrown after fixing the above, though (with the same > command as before): > > Loading ontology Gene Ontology: > ... terms > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values > were > ("GO:0006905","vesicle transport","OBSOLETE (was not defined before > being made obsolete).","X","") FKs (1) > Duplicate entry 'vesicle transport-1-X' for key 3 > --------------------------------------------------- > Could not store term GO:0006905, name 'vesicle transport': > [...] > There are duplicate terms, identical in the term table except for > GOID: > GO:0006905 and GO:0005480. They are both "vesicle transport", and > obsoleted: That violates the uniqueness constraint, and this sounds more like a bug in the GO file. I'm also not sure what motivated them to create the same term multiple times only to obsolete it immediately. > [...] > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values > were ("PMID","","0","") FKs () > Column 'accession' cannot be null > --------------------------------------------------- > Could not store term GO:0032933, name 'SREBP-mediated signaling > pathway': > [...] > with the offending entry being > > term: SREBP-mediated signaling pathway > goid: GO:0032933 > definition: A series of molecular signals from the endoplasmic > reticulum > to the nucleus generated as a consequence of altered levels of one or > more lipids, and resulting in the activation of transcription by > SREBP. > definition_reference: GOC:mah > definition_reference: PMID:0 > > I commented out the definition_reference for PMID:0, which seemed > to fix > matters. Right, it seems to be a bogus reference. > > The process.ontology and component.ontology files then went into the > database without a hitch. Thanks again for your help, Fantastic you got it all loaded! Note that you also have the --computetc switch which will compute the transitive closure for you automatically. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From lpritc at scri.ac.uk Tue Apr 17 16:05:16 2007 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 17 Apr 2007 17:05:16 +0100 Subject: [BioSQL-l] Problem loading GO. In-Reply-To: <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> Message-ID: <1176825916.988.121.camel@lplinuxdev.scri.sari.ac.uk> Hello again, On Tue, 2007-04-17 at 11:09 -0400, Hilmar Lapp wrote: > Thanks for reporting all these. No problem at all. > On Apr 17, 2007, at 9:35 AM, Leighton Pritchard wrote: > > term: 2-pyrone-4,6-dicarboxylate lactonase activity [...] > > definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN > > I wonder whether this is the line that throws the parser off. It > looks like the database part of the reference is missing - bad. > > definition_reference: MetaCyc:2-PYRONE-4 I don't think the parser is to blame, here. Note that if you join the definition_reference strings from the GO.defs file, you get: MetaCyc:2-PYRONE-4:6-DICARBOXYLATE-LACTONASE-RXN Then if you replace the colon by "\," you get what should (I think) actually be the MetaCyc entry: MetaCyc:2-PYRONE-4\,6-DICARBOXYLATE-LACTONASE-RXN > > I found 43 similar errors for other GOIDs, and it appears to result > > from > > the occurrence of the string "\," in a dbxref - mostly MetaCyc > > entries, > > but also some UM-BBD_pathwayID entries. > > I'm not sure - although the string "\," might indeed trip up the > parser, would have to investigate to confirm. Could it be a > coincidence with definition_references that lack the database part > before the colon? Inspecting the troublesome entries by eye seems to turn up the same problem as above consistently: a GO term in the GO.defs file is malformed. The term should have a definition_reference field describing a MetaCyc entry that matches the term field. In the term string, there would be an escaped comma, but the string ends where we expect this. The string that would follow the escaped comma is present as the first definition_reference. This observation also extends to cases where there should be two occurrences of "\," in the MetaCyc field, e.g.: term: 2,3-dihydroxyindole 2,3-dioxygenase activity goid: GO:0047528 definition: Catalysis of the reaction: 2,3-dihydroxyindole + O2 = anthranilate + CO2. definition_reference: :3-DIHYDROXYINDOLE-2 definition_reference: :3-DIOXYGENASE-RXN definition_reference: EC:1.13.11.2 definition_reference: MetaCyc:2 It then appears as though the GO flatfiles were used automatically to generate the OBO format files, and propagated the same error into the square brackets in each case. > > and so is something for the GO guys to fix, I guess. > > The lack of a database for certain xrefs surely is. If the escaped > comma does throw off the BioPerl parser then that part is for BioPerl > to fix. I thinkk the problems are now all in the data I downloaded from http://www.geneontology.org/GO.downloads.shtml - I believe the BioPerl parser to be innocent of these charges ;) I've submitted the issue at the GO site, and with any luck they'll handle it quite soon (if it is in fact their problem). > Note that you also have the --computetc switch which will compute the > transitive closure for you automatically. :D Excellent! Thanks for the pointer, and again for your efforts, L. -- Dr Leighton Pritchard B.Sc.(Hons) MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA e:lpritc at scri.ac.uk w:http://bioinf.scri.ac.uk/lp gpg/pgp: 0xFEFC205C _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From cjfields at uiuc.edu Tue Apr 17 16:18:19 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 17 Apr 2007 11:18:19 -0500 Subject: [BioSQL-l] Problem loading GO. In-Reply-To: <1176825916.988.121.camel@lplinuxdev.scri.sari.ac.uk> References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> <1176825916.988.121.camel@lplinuxdev.scri.sari.ac.uk> Message-ID: <146086E2-330B-4460-90AC-2632E82ED145@uiuc.edu> On Apr 17, 2007, at 11:05 AM, Leighton Pritchard wrote: ... > >>> and so is something for the GO guys to fix, I guess. >> >> The lack of a database for certain xrefs surely is. If the escaped >> comma does throw off the BioPerl parser then that part is for BioPerl >> to fix. > > I thinkk the problems are now all in the data I downloaded from > http://www.geneontology.org/GO.downloads.shtml - I believe the BioPerl > parser to be innocent of these charges ;) I've submitted the issue at > the GO site, and with any luck they'll handle it quite soon (if it > is in > fact their problem). > >> Note that you also have the --computetc switch which will compute the >> transitive closure for you automatically. > > :D Excellent! Thanks for the pointer, and again for your efforts, > > L. ... If you do find anything that is BioSQL- or Bioperl-related then file a bug report so we can track it. I agree with Hilmar that it's likely the parser is partly to blame. http://bugzilla.open-bio.org/ We really appreciate the work you're putting into this! chris From lpritc at scri.ac.uk Tue Apr 17 16:55:38 2007 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 17 Apr 2007 17:55:38 +0100 Subject: [BioSQL-l] Problem loading GO. In-Reply-To: <146086E2-330B-4460-90AC-2632E82ED145@uiuc.edu> References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> <1176825916.988.121.camel@lplinuxdev.scri.sari.ac.uk> <146086E2-330B-4460-90AC-2632E82ED145@uiuc.edu> Message-ID: <1176828938.988.133.camel@lplinuxdev.scri.sari.ac.uk> Hi Chris, On Tue, 2007-04-17 at 11:18 -0500, Chris Fields wrote: > If you do find anything that is BioSQL- or Bioperl-related then file > a bug report so we can track it. I agree with Hilmar that it's > likely the parser is partly to blame. > > http://bugzilla.open-bio.org/ I've submitted a bug report, mostly replicating my first post in this thread. I added links to the appropriate point in the list archives so that the rest of the discussion can be considered, too. > We really appreciate the work you're putting into this! Thanks - I'm just grateful that the Bio* repertoire is there at all so that my problems are relatively minor (as opposed to attempting to replicate the functionality independently). L. -- Dr Leighton Pritchard B.Sc.(Hons) MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA e:lpritc at scri.ac.uk w:http://bioinf.scri.ac.uk/lp gpg/pgp: 0xFEFC205C _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From lpritc at scri.ac.uk Tue Apr 17 17:03:53 2007 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 17 Apr 2007 18:03:53 +0100 Subject: [BioSQL-l] [Bioperl-l] Problem loading GO. In-Reply-To: References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> Message-ID: <1176829433.988.143.camel@lplinuxdev.scri.sari.ac.uk> On Tue, 2007-04-17 at 09:54 -0700, Chris Mungall wrote: > Is there any reason you're loading GO.defs? This is a legacy format > all the information is subsumed in the obo file. My only reason was that the parser originally failed to load the OBO format data - probably for the same reason that the flatfile failed - and I tried the flatfile to check if there were parser issues with the format. I just carried on with the flatfile after that because the terms with formatting errors were (subjectively, for me) easier to spot and fix by hand. I'm happy to use a fixed OBO file. > I didn't see your message to the GO folks re formatting errors - who > did you send it to & what was the subject? I'll see it gets seen to. I submitted it via the website interface - I'm afraid I have no idea where it would have gone after that. -- Dr Leighton Pritchard B.Sc.(Hons) MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA e:lpritc at scri.ac.uk w:http://bioinf.scri.ac.uk/lp gpg/pgp: 0xFEFC205C _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From cjm at fruitfly.org Tue Apr 17 16:54:51 2007 From: cjm at fruitfly.org (Chris Mungall) Date: Tue, 17 Apr 2007 09:54:51 -0700 Subject: [BioSQL-l] [Bioperl-l] Problem loading GO. In-Reply-To: <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> Message-ID: Is there any reason you're loading GO.defs? This is a legacy format all the information is subsumed in the obo file. I didn't see your message to the GO folks re formatting errors - who did you send it to & what was the subject? I'll see it gets seen to. >> -------------------- WARNING --------------------- >> MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values >> were >> ("GO:0006905","vesicle transport","OBSOLETE (was not defined before >> being made obsolete).","X","") FKs (1) >> Duplicate entry 'vesicle transport-1-X' for key 3 >> --------------------------------------------------- >> Could not store term GO:0006905, name 'vesicle transport': >> [...] >> There are duplicate terms, identical in the term table except for >> GOID: >> GO:0006905 and GO:0005480. They are both "vesicle transport", and >> obsoleted: > > That violates the uniqueness constraint, and this sounds more like a > bug in the GO file. I'm also not sure what motivated them to create > the same term multiple times only to obsolete it immediately. these things happen - the schema should be able to deal with it. it's a pain I know. In Chado we have some hacky solution for this (I believe it is concatenating the ID onto the name of obsolete terms). I think that its actually wrong to include obsoletes and actual terms in the same table - however, it's obviously astoundingly useful to be able to do this, but it requires the hack to get ou of the uniqueness violation. The EBI loads all of OBO into BioSQL regularly - I wonder how they handle this? On Apr 17, 2007, at 8:09 AM, Hilmar Lapp wrote: > > On Apr 17, 2007, at 9:35 AM, Leighton Pritchard wrote: > >> Hi Hilmar, >> >> Thanks for the very quick response. Apologies for the long reply, >> but I >> thought it might be useful if anyone else happens across the same >> problems that I did. > > Thanks for reporting all these. > >> [...] >> -------------------- WARNING --------------------- >> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values >> were ("","","0","") FKs () >> Column 'dbname' cannot be null >> --------------------------------------------------- >> Could not store term GO:0047554, name '2-pyrone-4,6-dicarboxylate >> lactonase activity': >> [...] >> I tracked this down to an apparently poor formatting of the GO.defs >> file >> (note that the first and third definition_lines appear to be two >> halves >> of the same entry): >> >> term: 2-pyrone-4,6-dicarboxylate lactonase activity >> goid: GO:0047554 >> definition: Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + >> H2O >> = 4-carboxy-2-hydroxyhexa-2,4-dienedioate. >> definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN > > I wonder whether this is the line that throws the parser off. It > looks like the database part of the reference is missing - bad. > >> definition_reference: EC:3.1.1.57 >> definition_reference: MetaCyc:2-PYRONE-4 >> >> I found 43 similar errors for other GOIDs, and it appears to result >> from >> the occurrence of the string "\," in a dbxref - mostly MetaCyc >> entries, >> but also some UM-BBD_pathwayID entries. > > I'm not sure - although the string "\," might indeed trip up the > parser, would have to investigate to confirm. Could it be a > coincidence with definition_references that lack the database part > before the colon? > >> >> These errors appear to have followed through into the generation of >> the >> OBO format files in each case, e.g.: >> >> def: "Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O = >> 4-carboxy-2-hydroxyhexa-2,4-dienedioate." [:6-DICARBOXYLATE- >> LACTONASE-RXN, EC:3.1.1.57, MetaCyc:2-PYRONE-4] > > Again, the first db_xref lacks the database in front of the colon. I > can also see why "\," will trip up the parser in this format. > >> >> and so is something for the GO guys to fix, I guess. > > The lack of a database for certain xrefs surely is. If the escaped > comma does throw off the BioPerl parser then that part is for BioPerl > to fix. It does seem to extract the parts correctly, if the error > message is any indication, though you may argue that it should remove > the escaping backslashes (and I'd certainly agree with that). > >> >> >> Another error is thrown after fixing the above, though (with the same >> command as before): >> >> Loading ontology Gene Ontology: >> ... terms >> >> -------------------- WARNING --------------------- >> MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values >> were >> ("GO:0006905","vesicle transport","OBSOLETE (was not defined before >> being made obsolete).","X","") FKs (1) >> Duplicate entry 'vesicle transport-1-X' for key 3 >> --------------------------------------------------- >> Could not store term GO:0006905, name 'vesicle transport': >> [...] >> There are duplicate terms, identical in the term table except for >> GOID: >> GO:0006905 and GO:0005480. They are both "vesicle transport", and >> obsoleted: > > That violates the uniqueness constraint, and this sounds more like a > bug in the GO file. I'm also not sure what motivated them to create > the same term multiple times only to obsolete it immediately. > >> [...] >> -------------------- WARNING --------------------- >> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values >> were ("PMID","","0","") FKs () >> Column 'accession' cannot be null >> --------------------------------------------------- >> Could not store term GO:0032933, name 'SREBP-mediated signaling >> pathway': >> [...] >> with the offending entry being >> >> term: SREBP-mediated signaling pathway >> goid: GO:0032933 >> definition: A series of molecular signals from the endoplasmic >> reticulum >> to the nucleus generated as a consequence of altered levels of one or >> more lipids, and resulting in the activation of transcription by >> SREBP. >> definition_reference: GOC:mah >> definition_reference: PMID:0 >> >> I commented out the definition_reference for PMID:0, which seemed >> to fix >> matters. > > Right, it seems to be a bogus reference. > >> >> The process.ontology and component.ontology files then went into the >> database without a hitch. Thanks again for your help, > > Fantastic you got it all loaded! > > Note that you also have the --computetc switch which will compute the > transitive closure for you automatically. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From rcote at ebi.ac.uk Wed Apr 18 07:08:50 2007 From: rcote at ebi.ac.uk (Richard Cote) Date: Wed, 18 Apr 2007 08:08:50 +0100 Subject: [BioSQL-l] [Bioperl-l] Problem loading GO. In-Reply-To: References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk> <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk> <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net> Message-ID: <4625C402.5040809@ebi.ac.uk> Chris Mungall wrote: >>> Could not store term GO:0006905, name 'vesicle transport': >>> [...] >>> There are duplicate terms, identical in the term table except for >>> GOID: >>> GO:0006905 and GO:0005480. They are both "vesicle transport", and >>> obsoleted: >> > I think that its actually wrong to include obsoletes and actual terms in > the same table - however, it's obviously astoundingly useful to be able > to do this, but it requires the hack to get ou of the uniqueness violation. > > The EBI loads all of OBO into BioSQL regularly - I wonder how they > handle this? I simply avoid the issue. There's no uniqueness constraint in term name. The only constraint is term ID, and even that is only unique in the context of an ontology namespace (i.e. it would be perfectly allowable to have FOO:1234 and BAR:1234). The only unique (and primary) key is generated by the ORM layer so I don't even have to deal with that. We also have all the terms, obsoleted or not, in the same table because people are always querying on stuff that's been made obsolete but is still annotated with the old IDs. Cheers, Rc -- Richard Cote Software Engineer - PRIDE Project Team (Sequence Database Group) European Bioinformatics Institute Wellcome Trust Genome Campus rcote at ebi.ac.uk Hinxton, Cambridge CB10 1SD Phone: (+44) 1223 492610 United Kingdom Fax : (+44) 1223 494468