From dillo at pcbi.upenn.edu  Mon Apr  9 12:05:03 2007
From: dillo at pcbi.upenn.edu (Bryan Cardillo)
Date: Mon, 9 Apr 2007 12:05:03 -0400
Subject: [BioSQL-l] genbank, references, and crc's
Message-ID: <20070409160502.GD5285@rover.pcbi.upenn.edu>

        This is probably more of a bioperl issue, but since it was
        previously discussed here, this is where I'll continue the
        discussion.  I've just run into the same issues mentioned in
        these threads while loading some refseq sequences.

        http://lists.open-bio.org/pipermail/biosql-l/2006-July/001024.html
        http://lists.open-bio.org/pipermail/biosql-l/2006-August/001048.html


        I believe the bioperl-db patch below solves these issues.
        The crux of the problem is that the _crc64 code uses the
        authors, title, and location to determine a unique key.
        However the get_unique_key_query method only checks authors
        before deferring to a crc lookup.  The fix causes the crc key
        to be used if any of authors, title, or location is
        specified.

        Cheers,
        Bryan Cardillo
        Penn Bioinformatics Core
        University of Pennsylvania

 ReferenceAdaptor.pm |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

Index: ./Bio/DB/BioSQL/ReferenceAdaptor.pm
===================================================================
RCS file: /home/repository/bioperl/bioperl-db/Bio/DB/BioSQL/ReferenceAdaptor.pm,v
retrieving revision 1.24
diff -u -r1.24 ReferenceAdaptor.pm
--- ./Bio/DB/BioSQL/ReferenceAdaptor.pm	4 Jul 2006 22:23:12 -0000	1.24
+++ ./Bio/DB/BioSQL/ReferenceAdaptor.pm	9 Apr 2007 15:38:35 -0000
@@ -426,7 +426,7 @@
 	    });
 	}
     }
-    if($obj->authors()) {
+    if($obj->authors() || $obj->title() || $obj->location()) {
 	push(@ukqueries, {
 	    'doc_id' => $self->_crc64($obj),
 	});

From hlapp at gmx.net  Tue Apr 10 12:09:43 2007
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 10 Apr 2007 12:09:43 -0400
Subject: [BioSQL-l] genbank, references, and crc's
In-Reply-To: <20070409160502.GD5285@rover.pcbi.upenn.edu>
References: <20070409160502.GD5285@rover.pcbi.upenn.edu>
Message-ID: <3A873665-CA1D-489B-A6A8-6EDBB54C7858@gmx.net>

Hi Bryan,

thanks for tracking this down - great, I've committed it.

The 'correct' condition, as defined by the schema, would actually be  
test for author or title being specified, because location must be  
non-empty, according to the schema.

I.e., at least theoretically, the condition will now always be true,  
unless you removed the NOT NULL constraint locally on  
reference.location.

Would you mind testing whether removing the location() part from the  
if clause will still solve the issue?

	-hilmar

On Apr 9, 2007, at 12:05 PM, Bryan Cardillo wrote:

>         This is probably more of a bioperl issue, but since it was
>         previously discussed here, this is where I'll continue the
>         discussion.  I've just run into the same issues mentioned in
>         these threads while loading some refseq sequences.
>
>         http://lists.open-bio.org/pipermail/biosql-l/2006-July/ 
> 001024.html
>         http://lists.open-bio.org/pipermail/biosql-l/2006-August/ 
> 001048.html
>
>
>         I believe the bioperl-db patch below solves these issues.
>         The crux of the problem is that the _crc64 code uses the
>         authors, title, and location to determine a unique key.
>         However the get_unique_key_query method only checks authors
>         before deferring to a crc lookup.  The fix causes the crc key
>         to be used if any of authors, title, or location is
>         specified.
>
>         Cheers,
>         Bryan Cardillo
>         Penn Bioinformatics Core
>         University of Pennsylvania
>
>  ReferenceAdaptor.pm |    2 +-
>  1 files changed, 1 insertion(+), 1 deletion(-)
>
> Index: ./Bio/DB/BioSQL/ReferenceAdaptor.pm
> ===================================================================
> RCS file: /home/repository/bioperl/bioperl-db/Bio/DB/BioSQL/ 
> ReferenceAdaptor.pm,v
> retrieving revision 1.24
> diff -u -r1.24 ReferenceAdaptor.pm
> --- ./Bio/DB/BioSQL/ReferenceAdaptor.pm	4 Jul 2006 22:23:12 -0000	1.24
> +++ ./Bio/DB/BioSQL/ReferenceAdaptor.pm	9 Apr 2007 15:38:35 -0000
> @@ -426,7 +426,7 @@
>  	    });
>  	}
>      }
> -    if($obj->authors()) {
> +    if($obj->authors() || $obj->title() || $obj->location()) {
>  	push(@ukqueries, {
>  	    'doc_id' => $self->_crc64($obj),
>  	});
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From dillo at pcbi.upenn.edu  Wed Apr 11 11:33:39 2007
From: dillo at pcbi.upenn.edu (Bryan Cardillo)
Date: Wed, 11 Apr 2007 11:33:39 -0400
Subject: [BioSQL-l] genbank, references, and crc's
In-Reply-To: <3A873665-CA1D-489B-A6A8-6EDBB54C7858@gmx.net>
References: <20070409160502.GD5285@rover.pcbi.upenn.edu>
	<3A873665-CA1D-489B-A6A8-6EDBB54C7858@gmx.net>
Message-ID: <20070411153337.GA5275@rover.pcbi.upenn.edu>

On Tue, Apr 10, 2007 at 12:09:43PM -0400, Hilmar Lapp wrote:
> thanks for tracking this down - great, I've committed it.
> 
> The 'correct' condition, as defined by the schema, would actually be  
> test for author or title being specified, because location must be  
> non-empty, according to the schema.
> 
> I.e., at least theoretically, the condition will now always be true,  
> unless you removed the NOT NULL constraint locally on  
> reference.location.
> 
> Would you mind testing whether removing the location() part from the  
> if clause will still solve the issue?

        you are correct, the test for location doesn't seem to be
        necessary.

        from a theoretically point of view, I'm not sure I agree
        with removing the location test though.  it seems to me that
        if you have a field (ie, location) which is used in
        generating a unique identifier (crc64), then you should
        consult that field when determining what the unique
        identifier is for a particular object.

        to put it another way, a reference instance with no authors,
        no title, and a location can have a valid crc.  so why should
        the adaptor ignore this case?

        all that being said, my understanding of how all this goes
        together is still pretty shallow, so I'll defer to you as to
        which solution is best ;)

        --Bryan

From hlapp at gmx.net  Sun Apr 15 22:54:19 2007
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 15 Apr 2007 22:54:19 -0400
Subject: [BioSQL-l] genbank, references, and crc's
In-Reply-To: <20070411153337.GA5275@rover.pcbi.upenn.edu>
References: <20070409160502.GD5285@rover.pcbi.upenn.edu>
	<3A873665-CA1D-489B-A6A8-6EDBB54C7858@gmx.net>
	<20070411153337.GA5275@rover.pcbi.upenn.edu>
Message-ID: <52E4803B-7141-4D40-B46D-369626D15968@gmx.net>


On Apr 11, 2007, at 11:33 AM, Bryan Cardillo wrote:

> to put it another way, a reference instance with no authors,
>         no title, and a location can have a valid crc.  so why should
>         the adaptor ignore this case?

You're right - I can't just remove the location() test. Instead, I  
should be able to remove the bracketing if clause altogether. I.e.,  
in light of the schema, the construct

     if($obj->authors() || $obj->title() || $obj->location()) {
         push(@ukqueries, {
             'doc_id' => $self->_crc64($obj),
         });
     }

ought to be equivalent to

     push(@ukqueries, {
          'doc_id' => $self->_crc64($obj),
     });

The thing is that the BioPerl object model doesn't complain if you  
leave all three of authors, title, and location empty, no matter how  
non-sensical that is (it so happens that annotation parsed out from a  
legitimate sequence file in e.g. genbank format will always have  
location filled in).

I think I'll leave the if clause in and document that in reality for  
all legitimate annotation sources the clause should always evaluate  
to true.

Thanks for your observations - very sharp. I hope you'll stick around  
with the code for a while, it can certainly benefit from another pair  
of sharp eyes. Don't hesitate to let me know if you need any help.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From lpritc at scri.ac.uk  Mon Apr 16 11:55:22 2007
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Mon, 16 Apr 2007 16:55:22 +0100
Subject: [BioSQL-l] Problem loading GO.
Message-ID: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>

Hi,

I've been trying to upload the GO into a clean BioSQL (MySQL, 1.4.1)
schema using the BioPerl bp_load_ontology.pl script, with the OBOv1.0,
OBOv1.2, and the most recent flatfiles from
http://www.geneontology.org/GO.downloads.ontology.shtml - none of my
attempts have been successful.  The errors below are from a Linux
installation, but the same errors are thrown on OS X, too.  I am using
the most recent versions of BioPerl and bioperl-db, installed via CPAN:

[lpritc at lplinuxdev sequence_data]$ perl -MBio::Root::Version -e 'print
$Bio::Root::Version::VERSION,"\n"'
1.005002102

and bioperl-db 1.5.2.

I have attached the traceback below (running with --safe throws a number
of equivalent errors), and I would be grateful for any help you might be
able to offer with setting me on track to fixing this.

SOFA is loaded without issues, you might be pleased to hear ;)

Thanks in advance,

L.


########

[lpritc at lplinuxdev sequence_data]$ bp_load_ontology.pl --host localhost
--dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass
******** --format obo ~/Downloads/gene_ontology_edit.obo
Loading ontology gene_ontology:
        ... terms
        ... relationships
        Done with gene_ontology.
Loading ontology biological_process:
        ... terms

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
were ("","","0","") FKs ()
Column 'dbname' cannot be null
---------------------------------------------------
Could not store term GO:0018901, name '2,4-dichlorophenoxyacetic acid
metabolic process':

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: create: object (Bio::Annotation::DBLink) failed to insert or to be
found by unique key
STACK: Error::throw
STACK:
Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206
STACK:
Bio::DB::BioSQL::TermAdaptor::store_children /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/TermAdaptor.pm:293
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK:
Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271
STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805
STACK: /usr/bin/bp_load_ontology.pl:610
-----------------------------------------------------------


[lpritc at lplinuxdev sequence_data]$ bp_load_ontology.pl --host localhost
--dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass
******** --format goflat --fmtargs ~/Downloads/GO.defs
~/Downloads/function.ontology   
Loading ontology Gene Ontology:
        ... terms

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
were ("MetaCyc","2\,3-DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RXN","0","") FKs
()
Duplicate entry '2\,3-DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RX-MetaCyc-0' for
key 2
---------------------------------------------------
Could not store term GO:0047528, name '2\,3-dihydroxyindole 2
\,3-dioxygenase activity':

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: create: object (Bio::Annotation::DBLink) failed to insert or to be
found by unique key
STACK: Error::throw
STACK:
Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206
STACK:
Bio::DB::BioSQL::TermAdaptor::store_children /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/TermAdaptor.pm:293
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK:
Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271
STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805
STACK: /usr/bin/bp_load_ontology.pl:610
-----------------------------------------------------------

 at /usr/bin/bp_load_ontology.pl line 817
        main::persist_term('-term',
'Bio::Ontology::GOterm=HASH(0xa6afb9c)', '-db',
'Bio::DB::BioSQL::DBAdaptor=HASH(0x9b5afd0)', '-termfactory',
'Bio::Ontology::TermFactory=HASH(0x9f40d10)', '-throw',
'CODE(0x96f6b68)', '-mergeobs', ...) called
at /usr/bin/bp_load_ontology.pl line 610


-- 
Dr Leighton Pritchard B.Sc.(Hons) MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA
e:lpritc at scri.ac.uk            w:http://bioinf.scri.ac.uk/lp
gpg/pgp: 0xFEFC205C
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From hlapp at gmx.net  Tue Apr 17 00:00:55 2007
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 17 Apr 2007 00:00:55 -0400
Subject: [BioSQL-l] Problem loading GO.
In-Reply-To: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
Message-ID: <B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>

Hi Leighton, please see below:

On Apr 16, 2007, at 11:55 AM, Leighton Pritchard wrote:

> Hi,
>
> I've been trying to upload the GO into a clean BioSQL (MySQL, 1.4.1)
> schema using the BioPerl bp_load_ontology.pl script, with the OBOv1.0,
> OBOv1.2, and the most recent flatfiles from
> http://www.geneontology.org/GO.downloads.ontology.shtml - none of my
> attempts have been successful.  The errors below are from a Linux
> installation, but the same errors are thrown on OS X, too.  I am using
> the most recent versions of BioPerl and bioperl-db, installed via  
> CPAN:
>
> [lpritc at lplinuxdev sequence_data]$ perl -MBio::Root::Version -e 'print
> $Bio::Root::Version::VERSION,"\n"'
> 1.005002102
>
> and bioperl-db 1.5.2.
>
> I have attached the traceback below (running with --safe throws a  
> number
> of equivalent errors),

Using --safe will throw the same errors, but will continue loading.  
I.e., you'd lose the one term, but keep everything else.

I do realize that especially for a graph losing an internal node can  
be quite detrimental.

> [...]
> ########
>
> [lpritc at lplinuxdev sequence_data]$ bp_load_ontology.pl --host  
> localhost
> --dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass
> ******** --format obo ~/Downloads/gene_ontology_edit.obo
> Loading ontology gene_ontology:
>         ... terms
>         ... relationships
>         Done with gene_ontology.
> Loading ontology biological_process:
>         ... terms
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
> were ("","","0","") FKs ()
> Column 'dbname' cannot be null
> ---------------------------------------------------

This would point to a problem of the BioPerl obo parser. According to  
the message, both the database name and the accession of the db_xref  
for the term are - surely erroneously - empty. Apparently the parser  
fails to parse out database and accession for this db_xref of term GO: 
0018901.

If you can edit the obo file, you can try deleting the db_xref(s) for  
that term that look odd (or delete all if you don't need them).

I'd have to debug the obo parser to see exactly where it's going  
wrong in parsing.

> Could not store term GO:0018901, name '2,4-dichlorophenoxyacetic acid
> metabolic process':
>
> ------------- EXCEPTION: Bio::Root::Exception -------------
> [...]
> [lpritc at lplinuxdev sequence_data]$ bp_load_ontology.pl --host  
> localhost
> --dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass
> ******** --format goflat --fmtargs ~/Downloads/GO.defs

Note that the argument for --fmtargs here should read
"-defs_file,/path/to/Downloads/GO.defs". (Note that within the quotes  
there is no tilde expansion.)

> ~/Downloads/function.ontology
> Loading ontology Gene Ontology:
>         ... terms
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
> were ("MetaCyc","2\,3-DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RXN","0","")  
> FKs
> ()
> Duplicate entry '2\,3-DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RX- 
> MetaCyc-0' for
> key 2
> ---------------------------------------------------

This is one the things why you've got to love MySQL (and I am correct  
in inferring that you're using MySQL?). The width of the  
dbxref.accession column (for which the second value in parentheses  
is) is 40 chars. The apparently pre-existing value ("2\,3- 
DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RX-MetaCyc-0") is 50 chars, which  
when loaded should have resulted in an exception. Instead, MySQL just  
simply and silently truncates it to 40 chars, which makes it  
identical to the first 40 chars of "2\,3-DIHYDROXYINDOLE-2\,3- 
DIOXYGENASE-RXN" (which is 41 chars in length).

It may be necessary to widen the length of dbname.accession here, for  
example to 80 chars? Let me know if you need help with the DDL  
command to do this.

Let me know how far this gets you.

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From lpritc at scri.ac.uk  Tue Apr 17 09:35:44 2007
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Tue, 17 Apr 2007 14:35:44 +0100
Subject: [BioSQL-l] Problem loading GO.
In-Reply-To: <B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
	<B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
Message-ID: <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>

Hi Hilmar, 

Thanks for the very quick response.  Apologies for the long reply, but I
thought it might be useful if anyone else happens across the same
problems that I did.

On Tue, 2007-04-17 at 00:00 -0400, Hilmar Lapp wrote:
> Apparently the parser  
> fails to parse out database and accession for this db_xref of term GO: 
> 0018901.
> 
> If you can edit the obo file, you can try deleting the db_xref(s) for  
> that term that look odd (or delete all if you don't need them).

You're spot on - see further down for details...

> Note that the argument for --fmtargs here should read
> "-defs_file,/path/to/Downloads/GO.defs". (Note that within the quotes  
> there is no tilde expansion.)

D'oh!  Thanks for the note - my bad, there.

> This is one the things why you've got to love MySQL (and I am correct  
> in inferring that you're using MySQL?). 

The 'choice' was forced upon me ;)

> It may be necessary to widen the length of dbname.accession here, for  
> example to 80 chars? Let me know if you need help with the DDL  
> command to do this.

I've fixed that now (and added it to my local biosqldb-mysql.sql
schema), but with a clean BioSQL schema and using:

[lpritc at lplinuxdev sql]$ bp_load_ontology.pl --host localhost --dbname
biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass ********
--format goflat --fmtargs
"-defs_file,/home/lpritc/Downloads/GO.defs" /home/lpritc/Downloads/function.ontology 

I was still getting errors with the GO flatfile:

Loading ontology Gene Ontology:
        ... terms

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
were ("","","0","") FKs ()
Column 'dbname' cannot be null
---------------------------------------------------
Could not store term GO:0047554, name '2-pyrone-4,6-dicarboxylate
lactonase activity':

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: create: object (Bio::Annotation::DBLink) failed to insert or to be
found by unique key
STACK: Error::throw
STACK:
Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206
STACK:
Bio::DB::BioSQL::TermAdaptor::store_children /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/TermAdaptor.pm:293
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK:
Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271
STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805
STACK: /usr/bin/bp_load_ontology.pl:610
-----------------------------------------------------------

 at /usr/bin/bp_load_ontology.pl line 817
        main::persist_term('-term',
'Bio::Ontology::GOterm=HASH(0x88497a4)', '-db',
'Bio::DB::BioSQL::DBAdaptor=HASH(0x897f074)', '-termfactory',
'Bio::Ontology::TermFactory=HASH(0x8d64ad8)', '-throw',
'CODE(0x851abc8)', '-mergeobs', ...) called
at /usr/bin/bp_load_ontology.pl line 610

I tracked this down to an apparently poor formatting of the GO.defs file
(note that the first and third definition_lines appear to be two halves
of the same entry):

term: 2-pyrone-4,6-dicarboxylate lactonase activity
goid: GO:0047554
definition: Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O
= 4-carboxy-2-hydroxyhexa-2,4-dienedioate.
definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN
definition_reference: EC:3.1.1.57
definition_reference: MetaCyc:2-PYRONE-4

I found 43 similar errors for other GOIDs, and it appears to result from
the occurrence of the string "\," in a dbxref - mostly MetaCyc entries,
but also some UM-BBD_pathwayID entries.

These errors appear to have followed through into the generation of the
OBO format files in each case, e.g.:

def: "Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O =
4-carboxy-2-hydroxyhexa-2,4-dienedioate." [:6-DICARBOXYLATE-LACTONASE-RXN, EC:3.1.1.57, MetaCyc:2-PYRONE-4]

and so is something for the GO guys to fix, I guess.


Another error is thrown after fixing the above, though (with the same
command as before):

Loading ontology Gene Ontology:
        ... terms

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values were
("GO:0006905","vesicle transport","OBSOLETE (was not defined before
being made obsolete).","X","") FKs (1)
Duplicate entry 'vesicle transport-1-X' for key 3
---------------------------------------------------
Could not store term GO:0006905, name 'vesicle transport':

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: create: object (Bio::Ontology::GOterm) failed to insert or to be
found by unique key
STACK: Error::throw
STACK:
Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK:
Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271
STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805
STACK: /usr/bin/bp_load_ontology.pl:610
-----------------------------------------------------------

 at /usr/bin/bp_load_ontology.pl line 817
        main::persist_term('-term',
'Bio::Ontology::GOterm=HASH(0xbcac418)', '-db',
'Bio::DB::BioSQL::DBAdaptor=HASH(0x957805c)', '-termfactory',
'Bio::Ontology::TermFactory=HASH(0x995db20)', '-throw',
'CODE(0x9113bd0)', '-mergeobs', ...) called
at /usr/bin/bp_load_ontology.pl line 610

There are duplicate terms, identical in the term table except for GOID:
GO:0006905 and GO:0005480.  They are both "vesicle transport", and
obsoleted:

term: vesicle transport
goid: GO:0005480
definition: OBSOLETE (was not defined before being made obsolete).
definition_reference: GOC:go_curators
comment: This term was made obsolete because it represents a biological
process and not a molecular function. To update annotations, use the
biological process term 'vesicle-mediated transport ; GO:0016192'.

term: vesicle transport
goid: GO:0006905
definition: OBSOLETE (was not defined before being made obsolete).
definition_reference: GOC:go_curators
comment: This term was made obsolete because the meaning of the term is
ambiguous. To update annotations, consider the biological process term
'vesicle-mediated transport ; GO:0016192'.

I used the --noobsolete flag to avoid this error - reasoning that since
I'm populating the database for the first time, ignoring the obsolete
terms won't hurt - but finally this error was thrown:

Loading ontology Gene Ontology:
        ... terms

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
were ("PMID","","0","") FKs ()
Column 'accession' cannot be null
---------------------------------------------------
Could not store term GO:0032933, name 'SREBP-mediated signaling
pathway':

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: create: object (Bio::Annotation::DBLink) failed to insert or to be
found by unique key
STACK: Error::throw
STACK:
Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206
STACK:
Bio::DB::BioSQL::TermAdaptor::store_children /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/TermAdaptor.pm:293
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK:
Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271
STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805
STACK: /usr/bin/bp_load_ontology.pl:610
-----------------------------------------------------------

 at /usr/bin/bp_load_ontology.pl line 817
        main::persist_term('-term',
'Bio::Ontology::GOterm=HASH(0xbe18f14)', '-db',
'Bio::DB::BioSQL::DBAdaptor=HASH(0x99bbf2c)', '-termfactory',
'Bio::Ontology::TermFactory=HASH(0x9da0ad8)', '-throw',
'CODE(0x9556bb4)', '-mergeobs', ...) called
at /usr/bin/bp_load_ontology.pl line 610

with the offending entry being 

term: SREBP-mediated signaling pathway
goid: GO:0032933
definition: A series of molecular signals from the endoplasmic reticulum
to the nucleus generated as a consequence of altered levels of one or
more lipids, and resulting in the activation of transcription by SREBP.
definition_reference: GOC:mah
definition_reference: PMID:0

I commented out the definition_reference for PMID:0, which seemed to fix
matters.

The process.ontology and component.ontology files then went into the
database without a hitch.  Thanks again for your help,

L.

-- 
Dr Leighton Pritchard B.Sc.(Hons) MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA
e:lpritc at scri.ac.uk            w:http://bioinf.scri.ac.uk/lp
gpg/pgp: 0xFEFC205C
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From hlapp at gmx.net  Tue Apr 17 11:09:45 2007
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 17 Apr 2007 11:09:45 -0400
Subject: [BioSQL-l] Problem loading GO.
In-Reply-To: <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
	<B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
	<1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>
Message-ID: <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>


On Apr 17, 2007, at 9:35 AM, Leighton Pritchard wrote:

> Hi Hilmar,
>
> Thanks for the very quick response.  Apologies for the long reply,  
> but I
> thought it might be useful if anyone else happens across the same
> problems that I did.

Thanks for reporting all these.

> [...]
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
> were ("","","0","") FKs ()
> Column 'dbname' cannot be null
> ---------------------------------------------------
> Could not store term GO:0047554, name '2-pyrone-4,6-dicarboxylate
> lactonase activity':
> [...]
> I tracked this down to an apparently poor formatting of the GO.defs  
> file
> (note that the first and third definition_lines appear to be two  
> halves
> of the same entry):
>
> term: 2-pyrone-4,6-dicarboxylate lactonase activity
> goid: GO:0047554
> definition: Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate +  
> H2O
> = 4-carboxy-2-hydroxyhexa-2,4-dienedioate.
> definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN

I wonder whether this is the line that throws the parser off. It  
looks like the database part of the reference is missing - bad.

> definition_reference: EC:3.1.1.57
> definition_reference: MetaCyc:2-PYRONE-4
>
> I found 43 similar errors for other GOIDs, and it appears to result  
> from
> the occurrence of the string "\," in a dbxref - mostly MetaCyc  
> entries,
> but also some UM-BBD_pathwayID entries.

I'm not sure - although the string "\," might indeed trip up the  
parser, would have to investigate to confirm. Could it be a  
coincidence with definition_references that lack the database part  
before the colon?

>
> These errors appear to have followed through into the generation of  
> the
> OBO format files in each case, e.g.:
>
> def: "Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O =
> 4-carboxy-2-hydroxyhexa-2,4-dienedioate." [:6-DICARBOXYLATE- 
> LACTONASE-RXN, EC:3.1.1.57, MetaCyc:2-PYRONE-4]

Again, the first db_xref lacks the database in front of the colon. I  
can also see why "\," will trip up the parser in this format.

>
> and so is something for the GO guys to fix, I guess.

The lack of a database for certain xrefs surely is. If the escaped  
comma does throw off the BioPerl parser then that part is for BioPerl  
to fix. It does seem to extract the parts correctly, if the error  
message is any indication, though you may argue that it should remove  
the escaping backslashes (and I'd certainly agree with that).

>
>
> Another error is thrown after fixing the above, though (with the same
> command as before):
>
> Loading ontology Gene Ontology:
>         ... terms
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values  
> were
> ("GO:0006905","vesicle transport","OBSOLETE (was not defined before
> being made obsolete).","X","") FKs (1)
> Duplicate entry 'vesicle transport-1-X' for key 3
> ---------------------------------------------------
> Could not store term GO:0006905, name 'vesicle transport':
> [...]
> There are duplicate terms, identical in the term table except for  
> GOID:
> GO:0006905 and GO:0005480.  They are both "vesicle transport", and
> obsoleted:

That violates the uniqueness constraint, and this sounds more like a  
bug in the GO file. I'm also not sure what motivated them to create  
the same term multiple times only to obsolete it immediately.

> [...]
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
> were ("PMID","","0","") FKs ()
> Column 'accession' cannot be null
> ---------------------------------------------------
> Could not store term GO:0032933, name 'SREBP-mediated signaling
> pathway':
> [...]
> with the offending entry being
>
> term: SREBP-mediated signaling pathway
> goid: GO:0032933
> definition: A series of molecular signals from the endoplasmic  
> reticulum
> to the nucleus generated as a consequence of altered levels of one or
> more lipids, and resulting in the activation of transcription by  
> SREBP.
> definition_reference: GOC:mah
> definition_reference: PMID:0
>
> I commented out the definition_reference for PMID:0, which seemed  
> to fix
> matters.

Right, it seems to be a bogus reference.

>
> The process.ontology and component.ontology files then went into the
> database without a hitch.  Thanks again for your help,

Fantastic you got it all loaded!

Note that you also have the --computetc switch which will compute the  
transitive closure for you automatically.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From lpritc at scri.ac.uk  Tue Apr 17 12:05:16 2007
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Tue, 17 Apr 2007 17:05:16 +0100
Subject: [BioSQL-l] Problem loading GO.
In-Reply-To: <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
	<B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
	<1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>
	<5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>
Message-ID: <1176825916.988.121.camel@lplinuxdev.scri.sari.ac.uk>

Hello again,

On Tue, 2007-04-17 at 11:09 -0400, Hilmar Lapp wrote:
> Thanks for reporting all these.

No problem at all.

> On Apr 17, 2007, at 9:35 AM, Leighton Pritchard wrote:
> > term: 2-pyrone-4,6-dicarboxylate lactonase activity
[...]
> > definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN
> 
> I wonder whether this is the line that throws the parser off. It  
> looks like the database part of the reference is missing - bad.

> > definition_reference: MetaCyc:2-PYRONE-4

I don't think the parser is to blame, here.  Note that if you join the
definition_reference strings from the GO.defs file, you get:

MetaCyc:2-PYRONE-4:6-DICARBOXYLATE-LACTONASE-RXN

Then if you replace the colon by "\," you get what should (I think)
actually be the MetaCyc entry:

MetaCyc:2-PYRONE-4\,6-DICARBOXYLATE-LACTONASE-RXN

> > I found 43 similar errors for other GOIDs, and it appears to result  
> > from
> > the occurrence of the string "\," in a dbxref - mostly MetaCyc  
> > entries,
> > but also some UM-BBD_pathwayID entries.
> 
> I'm not sure - although the string "\," might indeed trip up the  
> parser, would have to investigate to confirm. Could it be a  
> coincidence with definition_references that lack the database part  
> before the colon?

Inspecting the troublesome entries by eye seems to turn up the same
problem as above consistently: a GO term in the GO.defs file is
malformed.  The term should have a definition_reference field describing
a MetaCyc entry that matches the term field.  In the term string, there
would be an escaped comma, but the string ends where we expect this.
The string that would follow the escaped comma is present as the first
definition_reference.

This observation also extends to cases where there should be two
occurrences of "\," in the MetaCyc field, e.g.:

term: 2,3-dihydroxyindole 2,3-dioxygenase activity
goid: GO:0047528
definition: Catalysis of the reaction: 2,3-dihydroxyindole + O2 =
anthranilate + CO2.
definition_reference: :3-DIHYDROXYINDOLE-2
definition_reference: :3-DIOXYGENASE-RXN
definition_reference: EC:1.13.11.2
definition_reference: MetaCyc:2

It then appears as though the GO flatfiles were used automatically to
generate the OBO format files, and propagated the same error into the
square brackets in each case.

> > and so is something for the GO guys to fix, I guess.
> 
> The lack of a database for certain xrefs surely is. If the escaped  
> comma does throw off the BioPerl parser then that part is for BioPerl  
> to fix. 

I thinkk the problems are now all in the data I downloaded from
http://www.geneontology.org/GO.downloads.shtml - I believe the BioPerl
parser to be innocent of these charges ;)  I've submitted the issue at
the GO site, and with any luck they'll handle it quite soon (if it is in
fact their problem).

> Note that you also have the --computetc switch which will compute the  
> transitive closure for you automatically.

:D Excellent!  Thanks for the pointer, and again for your efforts,

L.

-- 
Dr Leighton Pritchard B.Sc.(Hons) MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA
e:lpritc at scri.ac.uk            w:http://bioinf.scri.ac.uk/lp
gpg/pgp: 0xFEFC205C
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From cjfields at uiuc.edu  Tue Apr 17 12:18:19 2007
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 17 Apr 2007 11:18:19 -0500
Subject: [BioSQL-l] Problem loading GO.
In-Reply-To: <1176825916.988.121.camel@lplinuxdev.scri.sari.ac.uk>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
	<B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
	<1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>
	<5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>
	<1176825916.988.121.camel@lplinuxdev.scri.sari.ac.uk>
Message-ID: <146086E2-330B-4460-90AC-2632E82ED145@uiuc.edu>

On Apr 17, 2007, at 11:05 AM, Leighton Pritchard wrote:
...
>
>>> and so is something for the GO guys to fix, I guess.
>>
>> The lack of a database for certain xrefs surely is. If the escaped
>> comma does throw off the BioPerl parser then that part is for BioPerl
>> to fix.
>
> I thinkk the problems are now all in the data I downloaded from
> http://www.geneontology.org/GO.downloads.shtml - I believe the BioPerl
> parser to be innocent of these charges ;)  I've submitted the issue at
> the GO site, and with any luck they'll handle it quite soon (if it  
> is in
> fact their problem).
>
>> Note that you also have the --computetc switch which will compute the
>> transitive closure for you automatically.
>
> :D Excellent!  Thanks for the pointer, and again for your efforts,
>
> L.
...

If you do find anything that is BioSQL- or Bioperl-related then file  
a bug report so we can track it.  I agree with Hilmar that it's  
likely the parser is partly to blame.

http://bugzilla.open-bio.org/

We really appreciate the work you're putting into this!

chris

From lpritc at scri.ac.uk  Tue Apr 17 12:55:38 2007
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Tue, 17 Apr 2007 17:55:38 +0100
Subject: [BioSQL-l] Problem loading GO.
In-Reply-To: <146086E2-330B-4460-90AC-2632E82ED145@uiuc.edu>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
	<B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
	<1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>
	<5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>
	<1176825916.988.121.camel@lplinuxdev.scri.sari.ac.uk>
	<146086E2-330B-4460-90AC-2632E82ED145@uiuc.edu>
Message-ID: <1176828938.988.133.camel@lplinuxdev.scri.sari.ac.uk>

Hi Chris,

On Tue, 2007-04-17 at 11:18 -0500, Chris Fields wrote:
> If you do find anything that is BioSQL- or Bioperl-related then file  
> a bug report so we can track it.  I agree with Hilmar that it's  
> likely the parser is partly to blame.
> 
> http://bugzilla.open-bio.org/

I've submitted a bug report, mostly replicating my first post in this
thread.  I added links to the appropriate point in the list archives so
that the rest of the discussion can be considered, too.

> We really appreciate the work you're putting into this!

Thanks - I'm just grateful that the Bio* repertoire is there at all so
that my problems are relatively minor (as opposed to attempting to
replicate the functionality independently).

L.

-- 
Dr Leighton Pritchard B.Sc.(Hons) MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA
e:lpritc at scri.ac.uk            w:http://bioinf.scri.ac.uk/lp
gpg/pgp: 0xFEFC205C
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From lpritc at scri.ac.uk  Tue Apr 17 13:03:53 2007
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Tue, 17 Apr 2007 18:03:53 +0100
Subject: [BioSQL-l] [Bioperl-l]  Problem loading GO.
In-Reply-To: <E61FBBF0-0E65-43A4-BBB8-CD145447A042@fruitfly.org>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
	<B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
	<1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>
	<5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>
	<E61FBBF0-0E65-43A4-BBB8-CD145447A042@fruitfly.org>
Message-ID: <1176829433.988.143.camel@lplinuxdev.scri.sari.ac.uk>

On Tue, 2007-04-17 at 09:54 -0700, Chris Mungall wrote:
> Is there any reason you're loading GO.defs? This is a legacy format  
> all the information is subsumed in the obo file.

My only reason was that the parser originally failed to load the OBO 
format data - probably for the same reason that the flatfile failed - 
and I tried the flatfile to check if there were parser issues with the 
format.  I just carried on with the flatfile after that because the 
terms with formatting errors were (subjectively, for me) easier to 
spot and fix by hand.  I'm happy to use a fixed OBO file.

> I didn't see your message to the GO folks re formatting errors - who  
> did you send it to & what was the subject? I'll see it gets seen to.

I submitted it via the website interface - I'm afraid I have no idea
where it would have gone after that.

-- 
Dr Leighton Pritchard B.Sc.(Hons) MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA
e:lpritc at scri.ac.uk            w:http://bioinf.scri.ac.uk/lp
gpg/pgp: 0xFEFC205C
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From cjm at fruitfly.org  Tue Apr 17 12:54:51 2007
From: cjm at fruitfly.org (Chris Mungall)
Date: Tue, 17 Apr 2007 09:54:51 -0700
Subject: [BioSQL-l] [Bioperl-l]  Problem loading GO.
In-Reply-To: <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
	<B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
	<1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>
	<5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>
Message-ID: <E61FBBF0-0E65-43A4-BBB8-CD145447A042@fruitfly.org>


Is there any reason you're loading GO.defs? This is a legacy format  
all the information is subsumed in the obo file.

I didn't see your message to the GO folks re formatting errors - who  
did you send it to & what was the subject? I'll see it gets seen to.

>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values
>> were
>> ("GO:0006905","vesicle transport","OBSOLETE (was not defined before
>> being made obsolete).","X","") FKs (1)
>> Duplicate entry 'vesicle transport-1-X' for key 3
>> ---------------------------------------------------
>> Could not store term GO:0006905, name 'vesicle transport':
>> [...]
>> There are duplicate terms, identical in the term table except for
>> GOID:
>> GO:0006905 and GO:0005480.  They are both "vesicle transport", and
>> obsoleted:
>
> That violates the uniqueness constraint, and this sounds more like a
> bug in the GO file. I'm also not sure what motivated them to create
> the same term multiple times only to obsolete it immediately.

these things happen - the schema should be able to deal with it. it's  
a pain I know. In Chado we have some hacky solution for this (I  
believe it is concatenating the ID onto the name of obsolete terms).

I think that its actually wrong to include obsoletes and actual terms  
in the same table - however, it's obviously astoundingly useful to be  
able to do this, but it requires the hack to get ou of the uniqueness  
violation.

The EBI loads all of OBO into BioSQL regularly - I wonder how they  
handle this?

On Apr 17, 2007, at 8:09 AM, Hilmar Lapp wrote:

>
> On Apr 17, 2007, at 9:35 AM, Leighton Pritchard wrote:
>
>> Hi Hilmar,
>>
>> Thanks for the very quick response.  Apologies for the long reply,
>> but I
>> thought it might be useful if anyone else happens across the same
>> problems that I did.
>
> Thanks for reporting all these.
>
>> [...]
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
>> were ("","","0","") FKs ()
>> Column 'dbname' cannot be null
>> ---------------------------------------------------
>> Could not store term GO:0047554, name '2-pyrone-4,6-dicarboxylate
>> lactonase activity':
>> [...]
>> I tracked this down to an apparently poor formatting of the GO.defs
>> file
>> (note that the first and third definition_lines appear to be two
>> halves
>> of the same entry):
>>
>> term: 2-pyrone-4,6-dicarboxylate lactonase activity
>> goid: GO:0047554
>> definition: Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate +
>> H2O
>> = 4-carboxy-2-hydroxyhexa-2,4-dienedioate.
>> definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN
>
> I wonder whether this is the line that throws the parser off. It
> looks like the database part of the reference is missing - bad.
>
>> definition_reference: EC:3.1.1.57
>> definition_reference: MetaCyc:2-PYRONE-4
>>
>> I found 43 similar errors for other GOIDs, and it appears to result
>> from
>> the occurrence of the string "\," in a dbxref - mostly MetaCyc
>> entries,
>> but also some UM-BBD_pathwayID entries.
>
> I'm not sure - although the string "\," might indeed trip up the
> parser, would have to investigate to confirm. Could it be a
> coincidence with definition_references that lack the database part
> before the colon?
>
>>
>> These errors appear to have followed through into the generation of
>> the
>> OBO format files in each case, e.g.:
>>
>> def: "Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O =
>> 4-carboxy-2-hydroxyhexa-2,4-dienedioate." [:6-DICARBOXYLATE-
>> LACTONASE-RXN, EC:3.1.1.57, MetaCyc:2-PYRONE-4]
>
> Again, the first db_xref lacks the database in front of the colon. I
> can also see why "\," will trip up the parser in this format.
>
>>
>> and so is something for the GO guys to fix, I guess.
>
> The lack of a database for certain xrefs surely is. If the escaped
> comma does throw off the BioPerl parser then that part is for BioPerl
> to fix. It does seem to extract the parts correctly, if the error
> message is any indication, though you may argue that it should remove
> the escaping backslashes (and I'd certainly agree with that).
>
>>
>>
>> Another error is thrown after fixing the above, though (with the same
>> command as before):
>>
>> Loading ontology Gene Ontology:
>>         ... terms
>>
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values
>> were
>> ("GO:0006905","vesicle transport","OBSOLETE (was not defined before
>> being made obsolete).","X","") FKs (1)
>> Duplicate entry 'vesicle transport-1-X' for key 3
>> ---------------------------------------------------
>> Could not store term GO:0006905, name 'vesicle transport':
>> [...]
>> There are duplicate terms, identical in the term table except for
>> GOID:
>> GO:0006905 and GO:0005480.  They are both "vesicle transport", and
>> obsoleted:
>
> That violates the uniqueness constraint, and this sounds more like a
> bug in the GO file. I'm also not sure what motivated them to create
> the same term multiple times only to obsolete it immediately.
>
>> [...]
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
>> were ("PMID","","0","") FKs ()
>> Column 'accession' cannot be null
>> ---------------------------------------------------
>> Could not store term GO:0032933, name 'SREBP-mediated signaling
>> pathway':
>> [...]
>> with the offending entry being
>>
>> term: SREBP-mediated signaling pathway
>> goid: GO:0032933
>> definition: A series of molecular signals from the endoplasmic
>> reticulum
>> to the nucleus generated as a consequence of altered levels of one or
>> more lipids, and resulting in the activation of transcription by
>> SREBP.
>> definition_reference: GOC:mah
>> definition_reference: PMID:0
>>
>> I commented out the definition_reference for PMID:0, which seemed
>> to fix
>> matters.
>
> Right, it seems to be a bogus reference.
>
>>
>> The process.ontology and component.ontology files then went into the
>> database without a hitch.  Thanks again for your help,
>
> Fantastic you got it all loaded!
>
> Note that you also have the --computetc switch which will compute the
> transitive closure for you automatically.
>
> 	-hilmar
> -- 
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From rcote at ebi.ac.uk  Wed Apr 18 03:08:50 2007
From: rcote at ebi.ac.uk (Richard Cote)
Date: Wed, 18 Apr 2007 08:08:50 +0100
Subject: [BioSQL-l] [Bioperl-l]  Problem loading GO.
In-Reply-To: <E61FBBF0-0E65-43A4-BBB8-CD145447A042@fruitfly.org>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
	<B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
	<1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>
	<5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>
	<E61FBBF0-0E65-43A4-BBB8-CD145447A042@fruitfly.org>
Message-ID: <4625C402.5040809@ebi.ac.uk>

Chris Mungall wrote:
>>> Could not store term GO:0006905, name 'vesicle transport':
>>> [...]
>>> There are duplicate terms, identical in the term table except for
>>> GOID:
>>> GO:0006905 and GO:0005480.  They are both "vesicle transport", and
>>> obsoleted:
>>
> I think that its actually wrong to include obsoletes and actual terms in 
> the same table - however, it's obviously astoundingly useful to be able 
> to do this, but it requires the hack to get ou of the uniqueness violation.
> 
> The EBI loads all of OBO into BioSQL regularly - I wonder how they 
> handle this?

I simply avoid the issue. There's no uniqueness constraint in term name. 
The only constraint is term ID, and even that is only unique in the 
context of an ontology namespace (i.e. it would be perfectly allowable 
to have FOO:1234 and BAR:1234). The only unique (and primary) key is 
generated by the ORM layer so I don't even have to deal with that.

We also have all the terms, obsoleted or not, in the same table because 
people are always querying on stuff that's been made obsolete but is 
still annotated with the old IDs.

Cheers,
Rc

-- 
Richard Cote
Software Engineer - PRIDE Project Team (Sequence Database Group)
European Bioinformatics Institute
Wellcome Trust Genome Campus                 rcote at ebi.ac.uk
Hinxton, Cambridge CB10 1SD                  Phone: (+44) 1223 492610
United Kingdom                               Fax  : (+44) 1223 494468

From dillo at pcbi.upenn.edu  Mon Apr  9 16:05:03 2007
From: dillo at pcbi.upenn.edu (Bryan Cardillo)
Date: Mon, 9 Apr 2007 12:05:03 -0400
Subject: [BioSQL-l] genbank, references, and crc's
Message-ID: <20070409160502.GD5285@rover.pcbi.upenn.edu>

        This is probably more of a bioperl issue, but since it was
        previously discussed here, this is where I'll continue the
        discussion.  I've just run into the same issues mentioned in
        these threads while loading some refseq sequences.

        http://lists.open-bio.org/pipermail/biosql-l/2006-July/001024.html
        http://lists.open-bio.org/pipermail/biosql-l/2006-August/001048.html


        I believe the bioperl-db patch below solves these issues.
        The crux of the problem is that the _crc64 code uses the
        authors, title, and location to determine a unique key.
        However the get_unique_key_query method only checks authors
        before deferring to a crc lookup.  The fix causes the crc key
        to be used if any of authors, title, or location is
        specified.

        Cheers,
        Bryan Cardillo
        Penn Bioinformatics Core
        University of Pennsylvania

 ReferenceAdaptor.pm |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

Index: ./Bio/DB/BioSQL/ReferenceAdaptor.pm
===================================================================
RCS file: /home/repository/bioperl/bioperl-db/Bio/DB/BioSQL/ReferenceAdaptor.pm,v
retrieving revision 1.24
diff -u -r1.24 ReferenceAdaptor.pm
--- ./Bio/DB/BioSQL/ReferenceAdaptor.pm	4 Jul 2006 22:23:12 -0000	1.24
+++ ./Bio/DB/BioSQL/ReferenceAdaptor.pm	9 Apr 2007 15:38:35 -0000
@@ -426,7 +426,7 @@
 	    });
 	}
     }
-    if($obj->authors()) {
+    if($obj->authors() || $obj->title() || $obj->location()) {
 	push(@ukqueries, {
 	    'doc_id' => $self->_crc64($obj),
 	});


From hlapp at gmx.net  Tue Apr 10 16:09:43 2007
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 10 Apr 2007 12:09:43 -0400
Subject: [BioSQL-l] genbank, references, and crc's
In-Reply-To: <20070409160502.GD5285@rover.pcbi.upenn.edu>
References: <20070409160502.GD5285@rover.pcbi.upenn.edu>
Message-ID: <3A873665-CA1D-489B-A6A8-6EDBB54C7858@gmx.net>

Hi Bryan,

thanks for tracking this down - great, I've committed it.

The 'correct' condition, as defined by the schema, would actually be  
test for author or title being specified, because location must be  
non-empty, according to the schema.

I.e., at least theoretically, the condition will now always be true,  
unless you removed the NOT NULL constraint locally on  
reference.location.

Would you mind testing whether removing the location() part from the  
if clause will still solve the issue?

	-hilmar

On Apr 9, 2007, at 12:05 PM, Bryan Cardillo wrote:

>         This is probably more of a bioperl issue, but since it was
>         previously discussed here, this is where I'll continue the
>         discussion.  I've just run into the same issues mentioned in
>         these threads while loading some refseq sequences.
>
>         http://lists.open-bio.org/pipermail/biosql-l/2006-July/ 
> 001024.html
>         http://lists.open-bio.org/pipermail/biosql-l/2006-August/ 
> 001048.html
>
>
>         I believe the bioperl-db patch below solves these issues.
>         The crux of the problem is that the _crc64 code uses the
>         authors, title, and location to determine a unique key.
>         However the get_unique_key_query method only checks authors
>         before deferring to a crc lookup.  The fix causes the crc key
>         to be used if any of authors, title, or location is
>         specified.
>
>         Cheers,
>         Bryan Cardillo
>         Penn Bioinformatics Core
>         University of Pennsylvania
>
>  ReferenceAdaptor.pm |    2 +-
>  1 files changed, 1 insertion(+), 1 deletion(-)
>
> Index: ./Bio/DB/BioSQL/ReferenceAdaptor.pm
> ===================================================================
> RCS file: /home/repository/bioperl/bioperl-db/Bio/DB/BioSQL/ 
> ReferenceAdaptor.pm,v
> retrieving revision 1.24
> diff -u -r1.24 ReferenceAdaptor.pm
> --- ./Bio/DB/BioSQL/ReferenceAdaptor.pm	4 Jul 2006 22:23:12 -0000	1.24
> +++ ./Bio/DB/BioSQL/ReferenceAdaptor.pm	9 Apr 2007 15:38:35 -0000
> @@ -426,7 +426,7 @@
>  	    });
>  	}
>      }
> -    if($obj->authors()) {
> +    if($obj->authors() || $obj->title() || $obj->location()) {
>  	push(@ukqueries, {
>  	    'doc_id' => $self->_crc64($obj),
>  	});
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From dillo at pcbi.upenn.edu  Wed Apr 11 15:33:39 2007
From: dillo at pcbi.upenn.edu (Bryan Cardillo)
Date: Wed, 11 Apr 2007 11:33:39 -0400
Subject: [BioSQL-l] genbank, references, and crc's
In-Reply-To: <3A873665-CA1D-489B-A6A8-6EDBB54C7858@gmx.net>
References: <20070409160502.GD5285@rover.pcbi.upenn.edu>
	<3A873665-CA1D-489B-A6A8-6EDBB54C7858@gmx.net>
Message-ID: <20070411153337.GA5275@rover.pcbi.upenn.edu>

On Tue, Apr 10, 2007 at 12:09:43PM -0400, Hilmar Lapp wrote:
> thanks for tracking this down - great, I've committed it.
> 
> The 'correct' condition, as defined by the schema, would actually be  
> test for author or title being specified, because location must be  
> non-empty, according to the schema.
> 
> I.e., at least theoretically, the condition will now always be true,  
> unless you removed the NOT NULL constraint locally on  
> reference.location.
> 
> Would you mind testing whether removing the location() part from the  
> if clause will still solve the issue?

        you are correct, the test for location doesn't seem to be
        necessary.

        from a theoretically point of view, I'm not sure I agree
        with removing the location test though.  it seems to me that
        if you have a field (ie, location) which is used in
        generating a unique identifier (crc64), then you should
        consult that field when determining what the unique
        identifier is for a particular object.

        to put it another way, a reference instance with no authors,
        no title, and a location can have a valid crc.  so why should
        the adaptor ignore this case?

        all that being said, my understanding of how all this goes
        together is still pretty shallow, so I'll defer to you as to
        which solution is best ;)

        --Bryan


From hlapp at gmx.net  Mon Apr 16 02:54:19 2007
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 15 Apr 2007 22:54:19 -0400
Subject: [BioSQL-l] genbank, references, and crc's
In-Reply-To: <20070411153337.GA5275@rover.pcbi.upenn.edu>
References: <20070409160502.GD5285@rover.pcbi.upenn.edu>
	<3A873665-CA1D-489B-A6A8-6EDBB54C7858@gmx.net>
	<20070411153337.GA5275@rover.pcbi.upenn.edu>
Message-ID: <52E4803B-7141-4D40-B46D-369626D15968@gmx.net>


On Apr 11, 2007, at 11:33 AM, Bryan Cardillo wrote:

> to put it another way, a reference instance with no authors,
>         no title, and a location can have a valid crc.  so why should
>         the adaptor ignore this case?

You're right - I can't just remove the location() test. Instead, I  
should be able to remove the bracketing if clause altogether. I.e.,  
in light of the schema, the construct

     if($obj->authors() || $obj->title() || $obj->location()) {
         push(@ukqueries, {
             'doc_id' => $self->_crc64($obj),
         });
     }

ought to be equivalent to

     push(@ukqueries, {
          'doc_id' => $self->_crc64($obj),
     });

The thing is that the BioPerl object model doesn't complain if you  
leave all three of authors, title, and location empty, no matter how  
non-sensical that is (it so happens that annotation parsed out from a  
legitimate sequence file in e.g. genbank format will always have  
location filled in).

I think I'll leave the if clause in and document that in reality for  
all legitimate annotation sources the clause should always evaluate  
to true.

Thanks for your observations - very sharp. I hope you'll stick around  
with the code for a while, it can certainly benefit from another pair  
of sharp eyes. Don't hesitate to let me know if you need any help.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From lpritc at scri.ac.uk  Mon Apr 16 15:55:22 2007
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Mon, 16 Apr 2007 16:55:22 +0100
Subject: [BioSQL-l] Problem loading GO.
Message-ID: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>

Hi,

I've been trying to upload the GO into a clean BioSQL (MySQL, 1.4.1)
schema using the BioPerl bp_load_ontology.pl script, with the OBOv1.0,
OBOv1.2, and the most recent flatfiles from
http://www.geneontology.org/GO.downloads.ontology.shtml - none of my
attempts have been successful.  The errors below are from a Linux
installation, but the same errors are thrown on OS X, too.  I am using
the most recent versions of BioPerl and bioperl-db, installed via CPAN:

[lpritc at lplinuxdev sequence_data]$ perl -MBio::Root::Version -e 'print
$Bio::Root::Version::VERSION,"\n"'
1.005002102

and bioperl-db 1.5.2.

I have attached the traceback below (running with --safe throws a number
of equivalent errors), and I would be grateful for any help you might be
able to offer with setting me on track to fixing this.

SOFA is loaded without issues, you might be pleased to hear ;)

Thanks in advance,

L.


########

[lpritc at lplinuxdev sequence_data]$ bp_load_ontology.pl --host localhost
--dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass
******** --format obo ~/Downloads/gene_ontology_edit.obo
Loading ontology gene_ontology:
        ... terms
        ... relationships
        Done with gene_ontology.
Loading ontology biological_process:
        ... terms

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
were ("","","0","") FKs ()
Column 'dbname' cannot be null
---------------------------------------------------
Could not store term GO:0018901, name '2,4-dichlorophenoxyacetic acid
metabolic process':

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: create: object (Bio::Annotation::DBLink) failed to insert or to be
found by unique key
STACK: Error::throw
STACK:
Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206
STACK:
Bio::DB::BioSQL::TermAdaptor::store_children /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/TermAdaptor.pm:293
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK:
Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271
STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805
STACK: /usr/bin/bp_load_ontology.pl:610
-----------------------------------------------------------


[lpritc at lplinuxdev sequence_data]$ bp_load_ontology.pl --host localhost
--dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass
******** --format goflat --fmtargs ~/Downloads/GO.defs
~/Downloads/function.ontology   
Loading ontology Gene Ontology:
        ... terms

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
were ("MetaCyc","2\,3-DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RXN","0","") FKs
()
Duplicate entry '2\,3-DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RX-MetaCyc-0' for
key 2
---------------------------------------------------
Could not store term GO:0047528, name '2\,3-dihydroxyindole 2
\,3-dioxygenase activity':

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: create: object (Bio::Annotation::DBLink) failed to insert or to be
found by unique key
STACK: Error::throw
STACK:
Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206
STACK:
Bio::DB::BioSQL::TermAdaptor::store_children /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/TermAdaptor.pm:293
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK:
Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271
STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805
STACK: /usr/bin/bp_load_ontology.pl:610
-----------------------------------------------------------

 at /usr/bin/bp_load_ontology.pl line 817
        main::persist_term('-term',
'Bio::Ontology::GOterm=HASH(0xa6afb9c)', '-db',
'Bio::DB::BioSQL::DBAdaptor=HASH(0x9b5afd0)', '-termfactory',
'Bio::Ontology::TermFactory=HASH(0x9f40d10)', '-throw',
'CODE(0x96f6b68)', '-mergeobs', ...) called
at /usr/bin/bp_load_ontology.pl line 610


-- 
Dr Leighton Pritchard B.Sc.(Hons) MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA
e:lpritc at scri.ac.uk            w:http://bioinf.scri.ac.uk/lp
gpg/pgp: 0xFEFC205C
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From hlapp at gmx.net  Tue Apr 17 04:00:55 2007
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 17 Apr 2007 00:00:55 -0400
Subject: [BioSQL-l] Problem loading GO.
In-Reply-To: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
Message-ID: <B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>

Hi Leighton, please see below:

On Apr 16, 2007, at 11:55 AM, Leighton Pritchard wrote:

> Hi,
>
> I've been trying to upload the GO into a clean BioSQL (MySQL, 1.4.1)
> schema using the BioPerl bp_load_ontology.pl script, with the OBOv1.0,
> OBOv1.2, and the most recent flatfiles from
> http://www.geneontology.org/GO.downloads.ontology.shtml - none of my
> attempts have been successful.  The errors below are from a Linux
> installation, but the same errors are thrown on OS X, too.  I am using
> the most recent versions of BioPerl and bioperl-db, installed via  
> CPAN:
>
> [lpritc at lplinuxdev sequence_data]$ perl -MBio::Root::Version -e 'print
> $Bio::Root::Version::VERSION,"\n"'
> 1.005002102
>
> and bioperl-db 1.5.2.
>
> I have attached the traceback below (running with --safe throws a  
> number
> of equivalent errors),

Using --safe will throw the same errors, but will continue loading.  
I.e., you'd lose the one term, but keep everything else.

I do realize that especially for a graph losing an internal node can  
be quite detrimental.

> [...]
> ########
>
> [lpritc at lplinuxdev sequence_data]$ bp_load_ontology.pl --host  
> localhost
> --dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass
> ******** --format obo ~/Downloads/gene_ontology_edit.obo
> Loading ontology gene_ontology:
>         ... terms
>         ... relationships
>         Done with gene_ontology.
> Loading ontology biological_process:
>         ... terms
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
> were ("","","0","") FKs ()
> Column 'dbname' cannot be null
> ---------------------------------------------------

This would point to a problem of the BioPerl obo parser. According to  
the message, both the database name and the accession of the db_xref  
for the term are - surely erroneously - empty. Apparently the parser  
fails to parse out database and accession for this db_xref of term GO: 
0018901.

If you can edit the obo file, you can try deleting the db_xref(s) for  
that term that look odd (or delete all if you don't need them).

I'd have to debug the obo parser to see exactly where it's going  
wrong in parsing.

> Could not store term GO:0018901, name '2,4-dichlorophenoxyacetic acid
> metabolic process':
>
> ------------- EXCEPTION: Bio::Root::Exception -------------
> [...]
> [lpritc at lplinuxdev sequence_data]$ bp_load_ontology.pl --host  
> localhost
> --dbname biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass
> ******** --format goflat --fmtargs ~/Downloads/GO.defs

Note that the argument for --fmtargs here should read
"-defs_file,/path/to/Downloads/GO.defs". (Note that within the quotes  
there is no tilde expansion.)

> ~/Downloads/function.ontology
> Loading ontology Gene Ontology:
>         ... terms
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
> were ("MetaCyc","2\,3-DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RXN","0","")  
> FKs
> ()
> Duplicate entry '2\,3-DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RX- 
> MetaCyc-0' for
> key 2
> ---------------------------------------------------

This is one the things why you've got to love MySQL (and I am correct  
in inferring that you're using MySQL?). The width of the  
dbxref.accession column (for which the second value in parentheses  
is) is 40 chars. The apparently pre-existing value ("2\,3- 
DIHYDROXYINDOLE-2\,3-DIOXYGENASE-RX-MetaCyc-0") is 50 chars, which  
when loaded should have resulted in an exception. Instead, MySQL just  
simply and silently truncates it to 40 chars, which makes it  
identical to the first 40 chars of "2\,3-DIHYDROXYINDOLE-2\,3- 
DIOXYGENASE-RXN" (which is 41 chars in length).

It may be necessary to widen the length of dbname.accession here, for  
example to 80 chars? Let me know if you need help with the DDL  
command to do this.

Let me know how far this gets you.

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From lpritc at scri.ac.uk  Tue Apr 17 13:35:44 2007
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Tue, 17 Apr 2007 14:35:44 +0100
Subject: [BioSQL-l] Problem loading GO.
In-Reply-To: <B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
	<B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
Message-ID: <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>

Hi Hilmar, 

Thanks for the very quick response.  Apologies for the long reply, but I
thought it might be useful if anyone else happens across the same
problems that I did.

On Tue, 2007-04-17 at 00:00 -0400, Hilmar Lapp wrote:
> Apparently the parser  
> fails to parse out database and accession for this db_xref of term GO: 
> 0018901.
> 
> If you can edit the obo file, you can try deleting the db_xref(s) for  
> that term that look odd (or delete all if you don't need them).

You're spot on - see further down for details...

> Note that the argument for --fmtargs here should read
> "-defs_file,/path/to/Downloads/GO.defs". (Note that within the quotes  
> there is no tilde expansion.)

D'oh!  Thanks for the note - my bad, there.

> This is one the things why you've got to love MySQL (and I am correct  
> in inferring that you're using MySQL?). 

The 'choice' was forced upon me ;)

> It may be necessary to widen the length of dbname.accession here, for  
> example to 80 chars? Let me know if you need help with the DDL  
> command to do this.

I've fixed that now (and added it to my local biosqldb-mysql.sql
schema), but with a clean BioSQL schema and using:

[lpritc at lplinuxdev sql]$ bp_load_ontology.pl --host localhost --dbname
biosql --namespace "Gene Ontology" --dbuser lpritc --dbpass ********
--format goflat --fmtargs
"-defs_file,/home/lpritc/Downloads/GO.defs" /home/lpritc/Downloads/function.ontology 

I was still getting errors with the GO flatfile:

Loading ontology Gene Ontology:
        ... terms

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
were ("","","0","") FKs ()
Column 'dbname' cannot be null
---------------------------------------------------
Could not store term GO:0047554, name '2-pyrone-4,6-dicarboxylate
lactonase activity':

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: create: object (Bio::Annotation::DBLink) failed to insert or to be
found by unique key
STACK: Error::throw
STACK:
Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206
STACK:
Bio::DB::BioSQL::TermAdaptor::store_children /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/TermAdaptor.pm:293
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK:
Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271
STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805
STACK: /usr/bin/bp_load_ontology.pl:610
-----------------------------------------------------------

 at /usr/bin/bp_load_ontology.pl line 817
        main::persist_term('-term',
'Bio::Ontology::GOterm=HASH(0x88497a4)', '-db',
'Bio::DB::BioSQL::DBAdaptor=HASH(0x897f074)', '-termfactory',
'Bio::Ontology::TermFactory=HASH(0x8d64ad8)', '-throw',
'CODE(0x851abc8)', '-mergeobs', ...) called
at /usr/bin/bp_load_ontology.pl line 610

I tracked this down to an apparently poor formatting of the GO.defs file
(note that the first and third definition_lines appear to be two halves
of the same entry):

term: 2-pyrone-4,6-dicarboxylate lactonase activity
goid: GO:0047554
definition: Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O
= 4-carboxy-2-hydroxyhexa-2,4-dienedioate.
definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN
definition_reference: EC:3.1.1.57
definition_reference: MetaCyc:2-PYRONE-4

I found 43 similar errors for other GOIDs, and it appears to result from
the occurrence of the string "\," in a dbxref - mostly MetaCyc entries,
but also some UM-BBD_pathwayID entries.

These errors appear to have followed through into the generation of the
OBO format files in each case, e.g.:

def: "Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O =
4-carboxy-2-hydroxyhexa-2,4-dienedioate." [:6-DICARBOXYLATE-LACTONASE-RXN, EC:3.1.1.57, MetaCyc:2-PYRONE-4]

and so is something for the GO guys to fix, I guess.


Another error is thrown after fixing the above, though (with the same
command as before):

Loading ontology Gene Ontology:
        ... terms

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values were
("GO:0006905","vesicle transport","OBSOLETE (was not defined before
being made obsolete).","X","") FKs (1)
Duplicate entry 'vesicle transport-1-X' for key 3
---------------------------------------------------
Could not store term GO:0006905, name 'vesicle transport':

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: create: object (Bio::Ontology::GOterm) failed to insert or to be
found by unique key
STACK: Error::throw
STACK:
Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK:
Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271
STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805
STACK: /usr/bin/bp_load_ontology.pl:610
-----------------------------------------------------------

 at /usr/bin/bp_load_ontology.pl line 817
        main::persist_term('-term',
'Bio::Ontology::GOterm=HASH(0xbcac418)', '-db',
'Bio::DB::BioSQL::DBAdaptor=HASH(0x957805c)', '-termfactory',
'Bio::Ontology::TermFactory=HASH(0x995db20)', '-throw',
'CODE(0x9113bd0)', '-mergeobs', ...) called
at /usr/bin/bp_load_ontology.pl line 610

There are duplicate terms, identical in the term table except for GOID:
GO:0006905 and GO:0005480.  They are both "vesicle transport", and
obsoleted:

term: vesicle transport
goid: GO:0005480
definition: OBSOLETE (was not defined before being made obsolete).
definition_reference: GOC:go_curators
comment: This term was made obsolete because it represents a biological
process and not a molecular function. To update annotations, use the
biological process term 'vesicle-mediated transport ; GO:0016192'.

term: vesicle transport
goid: GO:0006905
definition: OBSOLETE (was not defined before being made obsolete).
definition_reference: GOC:go_curators
comment: This term was made obsolete because the meaning of the term is
ambiguous. To update annotations, consider the biological process term
'vesicle-mediated transport ; GO:0016192'.

I used the --noobsolete flag to avoid this error - reasoning that since
I'm populating the database for the first time, ignoring the obsolete
terms won't hurt - but finally this error was thrown:

Loading ontology Gene Ontology:
        ... terms

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
were ("PMID","","0","") FKs ()
Column 'accession' cannot be null
---------------------------------------------------
Could not store term GO:0032933, name 'SREBP-mediated signaling
pathway':

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: create: object (Bio::Annotation::DBLink) failed to insert or to be
found by unique key
STACK: Error::throw
STACK:
Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Root.pm:359
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206
STACK:
Bio::DB::BioSQL::TermAdaptor::store_children /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/TermAdaptor.pm:293
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214
STACK:
Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK:
Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271
STACK: main::persist_term /usr/bin/bp_load_ontology.pl:805
STACK: /usr/bin/bp_load_ontology.pl:610
-----------------------------------------------------------

 at /usr/bin/bp_load_ontology.pl line 817
        main::persist_term('-term',
'Bio::Ontology::GOterm=HASH(0xbe18f14)', '-db',
'Bio::DB::BioSQL::DBAdaptor=HASH(0x99bbf2c)', '-termfactory',
'Bio::Ontology::TermFactory=HASH(0x9da0ad8)', '-throw',
'CODE(0x9556bb4)', '-mergeobs', ...) called
at /usr/bin/bp_load_ontology.pl line 610

with the offending entry being 

term: SREBP-mediated signaling pathway
goid: GO:0032933
definition: A series of molecular signals from the endoplasmic reticulum
to the nucleus generated as a consequence of altered levels of one or
more lipids, and resulting in the activation of transcription by SREBP.
definition_reference: GOC:mah
definition_reference: PMID:0

I commented out the definition_reference for PMID:0, which seemed to fix
matters.

The process.ontology and component.ontology files then went into the
database without a hitch.  Thanks again for your help,

L.

-- 
Dr Leighton Pritchard B.Sc.(Hons) MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA
e:lpritc at scri.ac.uk            w:http://bioinf.scri.ac.uk/lp
gpg/pgp: 0xFEFC205C
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From hlapp at gmx.net  Tue Apr 17 15:09:45 2007
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 17 Apr 2007 11:09:45 -0400
Subject: [BioSQL-l] Problem loading GO.
In-Reply-To: <1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
	<B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
	<1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>
Message-ID: <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>


On Apr 17, 2007, at 9:35 AM, Leighton Pritchard wrote:

> Hi Hilmar,
>
> Thanks for the very quick response.  Apologies for the long reply,  
> but I
> thought it might be useful if anyone else happens across the same
> problems that I did.

Thanks for reporting all these.

> [...]
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
> were ("","","0","") FKs ()
> Column 'dbname' cannot be null
> ---------------------------------------------------
> Could not store term GO:0047554, name '2-pyrone-4,6-dicarboxylate
> lactonase activity':
> [...]
> I tracked this down to an apparently poor formatting of the GO.defs  
> file
> (note that the first and third definition_lines appear to be two  
> halves
> of the same entry):
>
> term: 2-pyrone-4,6-dicarboxylate lactonase activity
> goid: GO:0047554
> definition: Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate +  
> H2O
> = 4-carboxy-2-hydroxyhexa-2,4-dienedioate.
> definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN

I wonder whether this is the line that throws the parser off. It  
looks like the database part of the reference is missing - bad.

> definition_reference: EC:3.1.1.57
> definition_reference: MetaCyc:2-PYRONE-4
>
> I found 43 similar errors for other GOIDs, and it appears to result  
> from
> the occurrence of the string "\," in a dbxref - mostly MetaCyc  
> entries,
> but also some UM-BBD_pathwayID entries.

I'm not sure - although the string "\," might indeed trip up the  
parser, would have to investigate to confirm. Could it be a  
coincidence with definition_references that lack the database part  
before the colon?

>
> These errors appear to have followed through into the generation of  
> the
> OBO format files in each case, e.g.:
>
> def: "Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O =
> 4-carboxy-2-hydroxyhexa-2,4-dienedioate." [:6-DICARBOXYLATE- 
> LACTONASE-RXN, EC:3.1.1.57, MetaCyc:2-PYRONE-4]

Again, the first db_xref lacks the database in front of the colon. I  
can also see why "\," will trip up the parser in this format.

>
> and so is something for the GO guys to fix, I guess.

The lack of a database for certain xrefs surely is. If the escaped  
comma does throw off the BioPerl parser then that part is for BioPerl  
to fix. It does seem to extract the parts correctly, if the error  
message is any indication, though you may argue that it should remove  
the escaping backslashes (and I'd certainly agree with that).

>
>
> Another error is thrown after fixing the above, though (with the same
> command as before):
>
> Loading ontology Gene Ontology:
>         ... terms
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values  
> were
> ("GO:0006905","vesicle transport","OBSOLETE (was not defined before
> being made obsolete).","X","") FKs (1)
> Duplicate entry 'vesicle transport-1-X' for key 3
> ---------------------------------------------------
> Could not store term GO:0006905, name 'vesicle transport':
> [...]
> There are duplicate terms, identical in the term table except for  
> GOID:
> GO:0006905 and GO:0005480.  They are both "vesicle transport", and
> obsoleted:

That violates the uniqueness constraint, and this sounds more like a  
bug in the GO file. I'm also not sure what motivated them to create  
the same term multiple times only to obsolete it immediately.

> [...]
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
> were ("PMID","","0","") FKs ()
> Column 'accession' cannot be null
> ---------------------------------------------------
> Could not store term GO:0032933, name 'SREBP-mediated signaling
> pathway':
> [...]
> with the offending entry being
>
> term: SREBP-mediated signaling pathway
> goid: GO:0032933
> definition: A series of molecular signals from the endoplasmic  
> reticulum
> to the nucleus generated as a consequence of altered levels of one or
> more lipids, and resulting in the activation of transcription by  
> SREBP.
> definition_reference: GOC:mah
> definition_reference: PMID:0
>
> I commented out the definition_reference for PMID:0, which seemed  
> to fix
> matters.

Right, it seems to be a bogus reference.

>
> The process.ontology and component.ontology files then went into the
> database without a hitch.  Thanks again for your help,

Fantastic you got it all loaded!

Note that you also have the --computetc switch which will compute the  
transitive closure for you automatically.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From lpritc at scri.ac.uk  Tue Apr 17 16:05:16 2007
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Tue, 17 Apr 2007 17:05:16 +0100
Subject: [BioSQL-l] Problem loading GO.
In-Reply-To: <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
	<B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
	<1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>
	<5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>
Message-ID: <1176825916.988.121.camel@lplinuxdev.scri.sari.ac.uk>

Hello again,

On Tue, 2007-04-17 at 11:09 -0400, Hilmar Lapp wrote:
> Thanks for reporting all these.

No problem at all.

> On Apr 17, 2007, at 9:35 AM, Leighton Pritchard wrote:
> > term: 2-pyrone-4,6-dicarboxylate lactonase activity
[...]
> > definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN
> 
> I wonder whether this is the line that throws the parser off. It  
> looks like the database part of the reference is missing - bad.

> > definition_reference: MetaCyc:2-PYRONE-4

I don't think the parser is to blame, here.  Note that if you join the
definition_reference strings from the GO.defs file, you get:

MetaCyc:2-PYRONE-4:6-DICARBOXYLATE-LACTONASE-RXN

Then if you replace the colon by "\," you get what should (I think)
actually be the MetaCyc entry:

MetaCyc:2-PYRONE-4\,6-DICARBOXYLATE-LACTONASE-RXN

> > I found 43 similar errors for other GOIDs, and it appears to result  
> > from
> > the occurrence of the string "\," in a dbxref - mostly MetaCyc  
> > entries,
> > but also some UM-BBD_pathwayID entries.
> 
> I'm not sure - although the string "\," might indeed trip up the  
> parser, would have to investigate to confirm. Could it be a  
> coincidence with definition_references that lack the database part  
> before the colon?

Inspecting the troublesome entries by eye seems to turn up the same
problem as above consistently: a GO term in the GO.defs file is
malformed.  The term should have a definition_reference field describing
a MetaCyc entry that matches the term field.  In the term string, there
would be an escaped comma, but the string ends where we expect this.
The string that would follow the escaped comma is present as the first
definition_reference.

This observation also extends to cases where there should be two
occurrences of "\," in the MetaCyc field, e.g.:

term: 2,3-dihydroxyindole 2,3-dioxygenase activity
goid: GO:0047528
definition: Catalysis of the reaction: 2,3-dihydroxyindole + O2 =
anthranilate + CO2.
definition_reference: :3-DIHYDROXYINDOLE-2
definition_reference: :3-DIOXYGENASE-RXN
definition_reference: EC:1.13.11.2
definition_reference: MetaCyc:2

It then appears as though the GO flatfiles were used automatically to
generate the OBO format files, and propagated the same error into the
square brackets in each case.

> > and so is something for the GO guys to fix, I guess.
> 
> The lack of a database for certain xrefs surely is. If the escaped  
> comma does throw off the BioPerl parser then that part is for BioPerl  
> to fix. 

I thinkk the problems are now all in the data I downloaded from
http://www.geneontology.org/GO.downloads.shtml - I believe the BioPerl
parser to be innocent of these charges ;)  I've submitted the issue at
the GO site, and with any luck they'll handle it quite soon (if it is in
fact their problem).

> Note that you also have the --computetc switch which will compute the  
> transitive closure for you automatically.

:D Excellent!  Thanks for the pointer, and again for your efforts,

L.

-- 
Dr Leighton Pritchard B.Sc.(Hons) MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA
e:lpritc at scri.ac.uk            w:http://bioinf.scri.ac.uk/lp
gpg/pgp: 0xFEFC205C
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From cjfields at uiuc.edu  Tue Apr 17 16:18:19 2007
From: cjfields at uiuc.edu (Chris Fields)
Date: Tue, 17 Apr 2007 11:18:19 -0500
Subject: [BioSQL-l] Problem loading GO.
In-Reply-To: <1176825916.988.121.camel@lplinuxdev.scri.sari.ac.uk>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
	<B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
	<1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>
	<5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>
	<1176825916.988.121.camel@lplinuxdev.scri.sari.ac.uk>
Message-ID: <146086E2-330B-4460-90AC-2632E82ED145@uiuc.edu>

On Apr 17, 2007, at 11:05 AM, Leighton Pritchard wrote:
...
>
>>> and so is something for the GO guys to fix, I guess.
>>
>> The lack of a database for certain xrefs surely is. If the escaped
>> comma does throw off the BioPerl parser then that part is for BioPerl
>> to fix.
>
> I thinkk the problems are now all in the data I downloaded from
> http://www.geneontology.org/GO.downloads.shtml - I believe the BioPerl
> parser to be innocent of these charges ;)  I've submitted the issue at
> the GO site, and with any luck they'll handle it quite soon (if it  
> is in
> fact their problem).
>
>> Note that you also have the --computetc switch which will compute the
>> transitive closure for you automatically.
>
> :D Excellent!  Thanks for the pointer, and again for your efforts,
>
> L.
...

If you do find anything that is BioSQL- or Bioperl-related then file  
a bug report so we can track it.  I agree with Hilmar that it's  
likely the parser is partly to blame.

http://bugzilla.open-bio.org/

We really appreciate the work you're putting into this!

chris


From lpritc at scri.ac.uk  Tue Apr 17 16:55:38 2007
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Tue, 17 Apr 2007 17:55:38 +0100
Subject: [BioSQL-l] Problem loading GO.
In-Reply-To: <146086E2-330B-4460-90AC-2632E82ED145@uiuc.edu>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
	<B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
	<1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>
	<5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>
	<1176825916.988.121.camel@lplinuxdev.scri.sari.ac.uk>
	<146086E2-330B-4460-90AC-2632E82ED145@uiuc.edu>
Message-ID: <1176828938.988.133.camel@lplinuxdev.scri.sari.ac.uk>

Hi Chris,

On Tue, 2007-04-17 at 11:18 -0500, Chris Fields wrote:
> If you do find anything that is BioSQL- or Bioperl-related then file  
> a bug report so we can track it.  I agree with Hilmar that it's  
> likely the parser is partly to blame.
> 
> http://bugzilla.open-bio.org/

I've submitted a bug report, mostly replicating my first post in this
thread.  I added links to the appropriate point in the list archives so
that the rest of the discussion can be considered, too.

> We really appreciate the work you're putting into this!

Thanks - I'm just grateful that the Bio* repertoire is there at all so
that my problems are relatively minor (as opposed to attempting to
replicate the functionality independently).

L.

-- 
Dr Leighton Pritchard B.Sc.(Hons) MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA
e:lpritc at scri.ac.uk            w:http://bioinf.scri.ac.uk/lp
gpg/pgp: 0xFEFC205C
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From lpritc at scri.ac.uk  Tue Apr 17 17:03:53 2007
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Tue, 17 Apr 2007 18:03:53 +0100
Subject: [BioSQL-l] [Bioperl-l]  Problem loading GO.
In-Reply-To: <E61FBBF0-0E65-43A4-BBB8-CD145447A042@fruitfly.org>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
	<B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
	<1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>
	<5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>
	<E61FBBF0-0E65-43A4-BBB8-CD145447A042@fruitfly.org>
Message-ID: <1176829433.988.143.camel@lplinuxdev.scri.sari.ac.uk>

On Tue, 2007-04-17 at 09:54 -0700, Chris Mungall wrote:
> Is there any reason you're loading GO.defs? This is a legacy format  
> all the information is subsumed in the obo file.

My only reason was that the parser originally failed to load the OBO 
format data - probably for the same reason that the flatfile failed - 
and I tried the flatfile to check if there were parser issues with the 
format.  I just carried on with the flatfile after that because the 
terms with formatting errors were (subjectively, for me) easier to 
spot and fix by hand.  I'm happy to use a fixed OBO file.

> I didn't see your message to the GO folks re formatting errors - who  
> did you send it to & what was the subject? I'll see it gets seen to.

I submitted it via the website interface - I'm afraid I have no idea
where it would have gone after that.

-- 
Dr Leighton Pritchard B.Sc.(Hons) MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA
e:lpritc at scri.ac.uk            w:http://bioinf.scri.ac.uk/lp
gpg/pgp: 0xFEFC205C
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From cjm at fruitfly.org  Tue Apr 17 16:54:51 2007
From: cjm at fruitfly.org (Chris Mungall)
Date: Tue, 17 Apr 2007 09:54:51 -0700
Subject: [BioSQL-l] [Bioperl-l]  Problem loading GO.
In-Reply-To: <5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
	<B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
	<1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>
	<5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>
Message-ID: <E61FBBF0-0E65-43A4-BBB8-CD145447A042@fruitfly.org>


Is there any reason you're loading GO.defs? This is a legacy format  
all the information is subsumed in the obo file.

I didn't see your message to the GO folks re formatting errors - who  
did you send it to & what was the subject? I'll see it gets seen to.

>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values
>> were
>> ("GO:0006905","vesicle transport","OBSOLETE (was not defined before
>> being made obsolete).","X","") FKs (1)
>> Duplicate entry 'vesicle transport-1-X' for key 3
>> ---------------------------------------------------
>> Could not store term GO:0006905, name 'vesicle transport':
>> [...]
>> There are duplicate terms, identical in the term table except for
>> GOID:
>> GO:0006905 and GO:0005480.  They are both "vesicle transport", and
>> obsoleted:
>
> That violates the uniqueness constraint, and this sounds more like a
> bug in the GO file. I'm also not sure what motivated them to create
> the same term multiple times only to obsolete it immediately.

these things happen - the schema should be able to deal with it. it's  
a pain I know. In Chado we have some hacky solution for this (I  
believe it is concatenating the ID onto the name of obsolete terms).

I think that its actually wrong to include obsoletes and actual terms  
in the same table - however, it's obviously astoundingly useful to be  
able to do this, but it requires the hack to get ou of the uniqueness  
violation.

The EBI loads all of OBO into BioSQL regularly - I wonder how they  
handle this?

On Apr 17, 2007, at 8:09 AM, Hilmar Lapp wrote:

>
> On Apr 17, 2007, at 9:35 AM, Leighton Pritchard wrote:
>
>> Hi Hilmar,
>>
>> Thanks for the very quick response.  Apologies for the long reply,
>> but I
>> thought it might be useful if anyone else happens across the same
>> problems that I did.
>
> Thanks for reporting all these.
>
>> [...]
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
>> were ("","","0","") FKs ()
>> Column 'dbname' cannot be null
>> ---------------------------------------------------
>> Could not store term GO:0047554, name '2-pyrone-4,6-dicarboxylate
>> lactonase activity':
>> [...]
>> I tracked this down to an apparently poor formatting of the GO.defs
>> file
>> (note that the first and third definition_lines appear to be two
>> halves
>> of the same entry):
>>
>> term: 2-pyrone-4,6-dicarboxylate lactonase activity
>> goid: GO:0047554
>> definition: Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate +
>> H2O
>> = 4-carboxy-2-hydroxyhexa-2,4-dienedioate.
>> definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN
>
> I wonder whether this is the line that throws the parser off. It
> looks like the database part of the reference is missing - bad.
>
>> definition_reference: EC:3.1.1.57
>> definition_reference: MetaCyc:2-PYRONE-4
>>
>> I found 43 similar errors for other GOIDs, and it appears to result
>> from
>> the occurrence of the string "\," in a dbxref - mostly MetaCyc
>> entries,
>> but also some UM-BBD_pathwayID entries.
>
> I'm not sure - although the string "\," might indeed trip up the
> parser, would have to investigate to confirm. Could it be a
> coincidence with definition_references that lack the database part
> before the colon?
>
>>
>> These errors appear to have followed through into the generation of
>> the
>> OBO format files in each case, e.g.:
>>
>> def: "Catalysis of the reaction: 2-pyrone-4,6-dicarboxylate + H2O =
>> 4-carboxy-2-hydroxyhexa-2,4-dienedioate." [:6-DICARBOXYLATE-
>> LACTONASE-RXN, EC:3.1.1.57, MetaCyc:2-PYRONE-4]
>
> Again, the first db_xref lacks the database in front of the colon. I
> can also see why "\," will trip up the parser in this format.
>
>>
>> and so is something for the GO guys to fix, I guess.
>
> The lack of a database for certain xrefs surely is. If the escaped
> comma does throw off the BioPerl parser then that part is for BioPerl
> to fix. It does seem to extract the parts correctly, if the error
> message is any indication, though you may argue that it should remove
> the escaping backslashes (and I'd certainly agree with that).
>
>>
>>
>> Another error is thrown after fixing the above, though (with the same
>> command as before):
>>
>> Loading ontology Gene Ontology:
>>         ... terms
>>
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values
>> were
>> ("GO:0006905","vesicle transport","OBSOLETE (was not defined before
>> being made obsolete).","X","") FKs (1)
>> Duplicate entry 'vesicle transport-1-X' for key 3
>> ---------------------------------------------------
>> Could not store term GO:0006905, name 'vesicle transport':
>> [...]
>> There are duplicate terms, identical in the term table except for
>> GOID:
>> GO:0006905 and GO:0005480.  They are both "vesicle transport", and
>> obsoleted:
>
> That violates the uniqueness constraint, and this sounds more like a
> bug in the GO file. I'm also not sure what motivated them to create
> the same term multiple times only to obsolete it immediately.
>
>> [...]
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::DBLinkAdaptor (driver) failed, values
>> were ("PMID","","0","") FKs ()
>> Column 'accession' cannot be null
>> ---------------------------------------------------
>> Could not store term GO:0032933, name 'SREBP-mediated signaling
>> pathway':
>> [...]
>> with the offending entry being
>>
>> term: SREBP-mediated signaling pathway
>> goid: GO:0032933
>> definition: A series of molecular signals from the endoplasmic
>> reticulum
>> to the nucleus generated as a consequence of altered levels of one or
>> more lipids, and resulting in the activation of transcription by
>> SREBP.
>> definition_reference: GOC:mah
>> definition_reference: PMID:0
>>
>> I commented out the definition_reference for PMID:0, which seemed
>> to fix
>> matters.
>
> Right, it seems to be a bogus reference.
>
>>
>> The process.ontology and component.ontology files then went into the
>> database without a hitch.  Thanks again for your help,
>
> Fantastic you got it all loaded!
>
> Note that you also have the --computetc switch which will compute the
> transitive closure for you automatically.
>
> 	-hilmar
> -- 
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


From rcote at ebi.ac.uk  Wed Apr 18 07:08:50 2007
From: rcote at ebi.ac.uk (Richard Cote)
Date: Wed, 18 Apr 2007 08:08:50 +0100
Subject: [BioSQL-l] [Bioperl-l]  Problem loading GO.
In-Reply-To: <E61FBBF0-0E65-43A4-BBB8-CD145447A042@fruitfly.org>
References: <1176738922.988.26.camel@lplinuxdev.scri.sari.ac.uk>
	<B8DA7982-89F5-4D46-8736-A1D25EA7B504@gmx.net>
	<1176816944.988.83.camel@lplinuxdev.scri.sari.ac.uk>
	<5D5DDFF3-1C01-4D3D-80F8-CD777DEA38D5@gmx.net>
	<E61FBBF0-0E65-43A4-BBB8-CD145447A042@fruitfly.org>
Message-ID: <4625C402.5040809@ebi.ac.uk>

Chris Mungall wrote:
>>> Could not store term GO:0006905, name 'vesicle transport':
>>> [...]
>>> There are duplicate terms, identical in the term table except for
>>> GOID:
>>> GO:0006905 and GO:0005480.  They are both "vesicle transport", and
>>> obsoleted:
>>
> I think that its actually wrong to include obsoletes and actual terms in 
> the same table - however, it's obviously astoundingly useful to be able 
> to do this, but it requires the hack to get ou of the uniqueness violation.
> 
> The EBI loads all of OBO into BioSQL regularly - I wonder how they 
> handle this?

I simply avoid the issue. There's no uniqueness constraint in term name. 
The only constraint is term ID, and even that is only unique in the 
context of an ontology namespace (i.e. it would be perfectly allowable 
to have FOO:1234 and BAR:1234). The only unique (and primary) key is 
generated by the ORM layer so I don't even have to deal with that.

We also have all the terms, obsoleted or not, in the same table because 
people are always querying on stuff that's been made obsolete but is 
still annotated with the old IDs.

Cheers,
Rc

-- 
Richard Cote
Software Engineer - PRIDE Project Team (Sequence Database Group)
European Bioinformatics Institute
Wellcome Trust Genome Campus                 rcote at ebi.ac.uk
Hinxton, Cambridge CB10 1SD                  Phone: (+44) 1223 492610
United Kingdom                               Fax  : (+44) 1223 494468