[BioSQL-l] Problem loading GO.

Leighton Pritchard lpritc at scri.ac.uk
Tue Apr 17 16:05:16 UTC 2007


Hello again,

On Tue, 2007-04-17 at 11:09 -0400, Hilmar Lapp wrote:
> Thanks for reporting all these.

No problem at all.

> On Apr 17, 2007, at 9:35 AM, Leighton Pritchard wrote:
> > term: 2-pyrone-4,6-dicarboxylate lactonase activity
[...]
> > definition_reference: :6-DICARBOXYLATE-LACTONASE-RXN
> 
> I wonder whether this is the line that throws the parser off. It  
> looks like the database part of the reference is missing - bad.

> > definition_reference: MetaCyc:2-PYRONE-4

I don't think the parser is to blame, here.  Note that if you join the
definition_reference strings from the GO.defs file, you get:

MetaCyc:2-PYRONE-4:6-DICARBOXYLATE-LACTONASE-RXN

Then if you replace the colon by "\," you get what should (I think)
actually be the MetaCyc entry:

MetaCyc:2-PYRONE-4\,6-DICARBOXYLATE-LACTONASE-RXN

> > I found 43 similar errors for other GOIDs, and it appears to result  
> > from
> > the occurrence of the string "\," in a dbxref - mostly MetaCyc  
> > entries,
> > but also some UM-BBD_pathwayID entries.
> 
> I'm not sure - although the string "\," might indeed trip up the  
> parser, would have to investigate to confirm. Could it be a  
> coincidence with definition_references that lack the database part  
> before the colon?

Inspecting the troublesome entries by eye seems to turn up the same
problem as above consistently: a GO term in the GO.defs file is
malformed.  The term should have a definition_reference field describing
a MetaCyc entry that matches the term field.  In the term string, there
would be an escaped comma, but the string ends where we expect this.
The string that would follow the escaped comma is present as the first
definition_reference.

This observation also extends to cases where there should be two
occurrences of "\," in the MetaCyc field, e.g.:

term: 2,3-dihydroxyindole 2,3-dioxygenase activity
goid: GO:0047528
definition: Catalysis of the reaction: 2,3-dihydroxyindole + O2 =
anthranilate + CO2.
definition_reference: :3-DIHYDROXYINDOLE-2
definition_reference: :3-DIOXYGENASE-RXN
definition_reference: EC:1.13.11.2
definition_reference: MetaCyc:2

It then appears as though the GO flatfiles were used automatically to
generate the OBO format files, and propagated the same error into the
square brackets in each case.

> > and so is something for the GO guys to fix, I guess.
> 
> The lack of a database for certain xrefs surely is. If the escaped  
> comma does throw off the BioPerl parser then that part is for BioPerl  
> to fix. 

I thinkk the problems are now all in the data I downloaded from
http://www.geneontology.org/GO.downloads.shtml - I believe the BioPerl
parser to be innocent of these charges ;)  I've submitted the issue at
the GO site, and with any luck they'll handle it quite soon (if it is in
fact their problem).

> Note that you also have the --computetc switch which will compute the  
> transitive closure for you automatically.

:D Excellent!  Thanks for the pointer, and again for your efforts,

L.

-- 
Dr Leighton Pritchard B.Sc.(Hons) MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA
e:lpritc at scri.ac.uk            w:http://bioinf.scri.ac.uk/lp
gpg/pgp: 0xFEFC205C
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).




More information about the BioSQL-l mailing list