[Bioperl-l] GO categories and load_ontology.pl

Law, Annie Annie.Law at nrc-cnrc.gc.ca
Tue Apr 27 10:41:53 EDT 2004


Hi Hilmar and other bioperl enthusiasts,

Thanks for your response.  I now have a better understanding of the reason
for the 200 entries
Which are not in GO.  My focus now are the items that are in GO and are not
obsolete.

My main concern now is given a GO id how do I find out what GO category it
belongs to.
I want to know if it belongs to Molecular function, biological process, or
cellular component by the simplest method
Available with bioperl.
I don't see any obvious way of doing this since in the table called term all
of the entries in the term table are of 
Ontology Id = 1 (Gene ontology).

It was suggested to me by another helpful bioperler that I could somehow
look to see if my term is a child of the
GO id for molecular function, biological process, or cellular component.
This could be done by looking at the path
Table if transitive closure was used.  I am not too sure about how to go
about doing this and would be open
To other solutions.

The other alternative I can think of is to load a text file from GO ontology
in my database which is the 
GO.terms_and_ids file which is provided by Gene Ontology.  This is basically
a tab delimited text file which
includes the GO id and a letter such F, P, or C next to the ID to indicate
F = molecular function, P = biological process, C = cellular component.
This is not really my first choice as a solution if I can access this
information and it is somehow made available by the load_ontology.pl script
which I have already used.

Thanks very much,
Annie.




-----------------------
To: Law, Annie
Cc: 'bioperl-l at bioperl.org'
Subject: Re: [Bioperl-l] GO categories and load_ontology.pl


Annie, I still owe you an answer for your earlier email. I haven't 
managed to get to that yet. See below for my response to this one.

On Wednesday, March 17, 2004, at 08:50  AM, Law, Annie wrote:

> It seems that most of the Entries in the term table are of Ontoloy Id
> = 1
> (Gene ontology) and only around 200 entries molecular function, 
> biological
> process, and cellular component put together when there are about 16000
> entries in the term table.
> This is only true if I load locuslink into the database.

This is because LocusLink lags behind the latest version of GO in terms 
of the release that they use for annotating sequences. I.e., LocusLink 
uses some terms which have meanwhile been retired or obsoleted from GO. 
Depending on whether they are still in GO's .defs file, they won't be 
in your database if you chose to ignore obsoleted entries (which is not 
a bad choice at all per se), or they aren't part of GO anymore at all.

LocusLink doesn't give the ontology of GO terms (which would be 'Gene 
Ontology'); rather it gives the category. Because a term must have an 
ontology associated, the SeqIO LL parser interprets as the ontology 
what NCBI really meant to be the category.

You'd have the following choices to proceed.

	- Ignore the 200 entries which aren't in Gene Ontology. You're not 
going to miss a significant amount of your annotation, and it's 
annotation with obsoleted terms anyway.

	- Load GO including obsoleted terms, and see with how many non-Gene 
Ontology terms that would leave you. If it's a lot less than 200, you 
may just want to ignore the rest.

	- Build a SeqProcessor module (see Bio::Factory::SeqProcessorI and 
Bio::Seq::BaseSeqProcessor) which takes the seq objects as the LL 
parser returns them, goes in and retrieves all GO term annotations, and 
replaces the ontology for those with 'Gene Ontology.' Then you pass 
your SeqProcessor to load_seqdatabase.pl using the --pipeline 
command-line option (see the script's POD).

The last option may sound like but is really not a lot of work if you 
can program perl. Note, however, that then you still wouldn't have any 
relationships for those terms - they simply have been retired.

Depending on what your project is, just ignoring those 200 may be the 
most reasonable way to go.

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------



More information about the Bioperl-l mailing list