[Bioperl-l] Interpro parsing problems - and solutions

Holland, Richard Richard.Holland at agresearch.co.nz
Mon Jan 5 18:58:20 EST 2004


Hi all. A long one this but I hope it's worth the read.

I found a possible bug in Bio/OntologyIO/InterProParser.pm. The default
behaviour for OntologyEngineI implementors is supposed to be a 'simple'
OntologyEngine. However, the code in InterProParser does not default to
anything if no engine is specified - it tries to set up an 'undef'
engine, which fails with a not-yet-implemented message. The other
parsers based on dagflat.pm do exhibit the default behaviour, so to make
InterProParser.pm do the same I added the following line at line 130
(just before "if(lc($eng_type) eq 'simple')"):

	$eng_type = 'simple' unless $eng_type;
	
Another possible bug in the same module - InterProParser.pm assumes that
it has been passed a filename not a filehandle. As OntologyIO allows for
both cases, I have adapted InterProParser.pm to suit:

	sub parse{
	   my $self = shift;

	   my $ret;
	   if ($self->file()) {
	        $ret = $self->{_parser}->parse( Source => {
	                SystemId => $self->file() } );
	   } elsif ($self->_fh()) {
	        $ret = $self->{_parser}->parse( Source => {
	                ByteStream => $self->_fh() } );
	   } else {
	        $ret = undef;
	        $self->throw("You have not provided any InterPro XML to
parse.\n");
	   }
	   $self->_is_parsed(1);
	   return $ret;
	}

Also, in Bio/OntologyIO/Handlers/InterProHandler.pm (note different
module!), a definite bug this time, it does not know about two of the
interpro types - Active_site and Binding_site. Adding lines to
start_element as follows seems to work:

	sub start_element {
	  my ($self, $element) = @_;
	  my $ont = $self->ontology();
	  my $fact = $self->term_factory();

	  if ($element->{Name} eq 'interprodb') {
	    # TWO NEW LINES FOLLOW
	    $ont->add_term($fact->create_object(-identifier =>
"Active_site",
	                                        -name => "Active Site")
);
	    $ont->add_term($fact->create_object(-identifier =>
"Binding_site",
	                                        -name => "Binding Site")
);
	    # CONTINUE WITH ORIGINAL CODE
	    $ont->add_term($fact->create_object(-identifier => "Family",
	                                        -name => "Family") );
	    $ont->add_term($fact->create_object(-identifier => "Domain",
	                                        -name => "Domain") );
	    $ont->add_term($fact->create_object(-identifier => "Repeat",
	                                        -name => "Repeat") );
	    $ont->add_term($fact->create_object(-identifier => "PTM",
	                                 -name => "post-translational
modification"));

There is a problem with the relation loader in
Bio/Ontology/SimpleOntologyEngine.pm as well - get_term_by_identifier
returns an array of @terms, but the get_relationships subroutine calls
it twice and both times expects it to only return a single term. Other
calls to get_term_by_identifier in this module are correct. I have
adapted the two incorrect references in get_relationships to loop
through all terms returned by get_term_by_identifier instead, as
follows:

        # if a term is supplied, add a relationship for the parent to
the term
        # except if the parent is the term itself (we added that one
before)
        if($term && ($parent_id ne $term->identifier())) {
            my @parent_terms =
$self->get_term_by_identifier($parent_id);
            foreach my $parent_term (@parent_terms) {
               push(@rels,
                    $relfact->create_object(-object_term    =>
$parent_term,
                                            -subject_term   => $term,
                                            -predicate_term =>
 
$parent_entry->{$term->identifier},
                                            -ontology       =>
$term->ontology()
                                            )
                    );
            }

        } else {
            # otherwise, i.e., no term supplied, or the parent equals
the
            # supplied term
            my @parent_terms = $term ?
                ($term) : $self->get_term_by_identifier($parent_id);
            foreach my $child_id (keys %$parent_entry) {
                my $rel_info = $parent_entry->{$child_id};

                foreach my $parent_term (@parent_terms) {
                   push(@rels,
                        $relfact->create_object(-object_term    =>
$parent_term,
                                                -subject_term   =>
 
$self->get_term_by_identifier(
 
$child_id),
                                                -predicate_term =>
$rel_info,
                                                -ontology
=>$parent_term->ontology
                                                )
                        );
                }
            }
        }

There is also a bug in Bio/OntologyIO/Handlers/InterProHandler.pm which
does not set the current ontology of new temporary terms when used in
relationships. This causes "cannot call method 'name' on an undefined
value" errors later on when get_relationships tries to find out the name
of the ontology of each relationship. It also does not set the 'name'
values for these terms, so that you get 'cannot insert null into
SG_TERM.NAME' errors when the terms are later committed to the database,
because the name column of sg_term has a 'not null' constraint, at least
in the Oracle implementation of BioSQL. These problems can both be fixed
by altering the call to the create_object function, in sub
_create_relationship, to include the -ontology and -name parameters:

    $term_temp = $ont->engine->add_term( $fact->create_object(
-InterPro_id => $ref_id,
							          -name
=> $ref_id,
	
-ontology => $ont ) );

And in sub start_element in the same module the same fix needs applying
(no need for -ontology here as the existing code following this
statement does this for us explicitly using a term->ontology() call):

    $self->_term(
                 (!defined $term_temp)
                 ? $ont->add_term( $fact->create_object(-InterPro_id =>
$id,
 
-name => $id ) )
                 : $term_temp
                );

Note that this call defaults the name to be the same as the InterPro ID.
Most temporary terms will be overridden with permanent data later in the
process, but for those that slip through without permanent definitions
(possibly a bug in the InterPro XML datafile?) the above change is
necessary to allow them to be inserted into the database and their
relations to other terms kept valid.

Lastly, the definition field of the term table is too short to take, for
example, IPR001967 - I have extended it to the Oracle varchar2 maximum
of 4000 chars (up from 2000) in my BioSQL instance, but found it still
to be too short, so I added a quick fix to Bio/Ontology/Term.pm to
truncate definitions to be no more than this length, but would like to
know if this is a more general problem:

	sub definition {
	    my $self = shift;

	    return $self->{'definition'} = substr(shift,0,4000) if @_;
	    return $self->{'definition'};
	} # definition

Finally, Bio/OntologyIO/Handlers/InterProHandler.pm fails to set the
ontology for the base IS_A, CONTAINS and FOUND_IN relationship types.
This causes a null foreign key error on SG_TERM.ONT_OID when it tries to
insert them into the database. After checking
Bio/Ontology/SimpleGOEngine.pm and finding that it does set the ontology
names for these relationships, I modified InterProHandler to do the same
by adding the following three lines in the constructor (new) method
directly after the instantiation of the three relationship types:

  $is_a_rel->ontology($self->ontology());
  $contains_rel->ontology($self->ontology());
  $found_in_rel->ontology($self->ontology());

With all these changes in place,
"bioperl-db/scripts/biosql/load_ontology.pl --format interpro
interpro.xml" works correctly for the current release of interpro.xml.
It _does_ give many "unique constraint violated" errors on
TERM_RELATIONSHIP while inserting the relationships into BioSQL, but I
suspect this is because the get_relationships function in
InterProHandler.pm does not check to see if a relation has already been
pushed before pushing it onto the array of relations it returns. To fix
this would slow down the code considerably so it's probably best to just
live with the messages.

Are these acceptable modifications or is there a better way to fix these
problems? I admit my Perl may not be the best but hopefully you get the
idea of how my fixes work and may be able to code them in a better way.

If any of these changes are acceptable could someone implement them for
me? I do not have access to CVS.

Finally, if you think that I have been spending too much time trying to
fix these bugs when in fact there is an alternative and fully functional
way to load InterPro XML data into BioPerl/BioSQL, please feel free to
slap me around a bit with a wet haddock. This is my first ever
contribution to an open source project so I could do with some advice on
how to do things better next time.

cheers,
Richard

PS. Totally unrelated but also a database field length problem - the
description field in bioentry is too short to hold SwissProt P05067 - in
my case I have lengthened the field to 1024 chars which does the trick.

PPS. Certain annotations in RefSeqs cause a similar overflow in
seqfeature_qualifier_value.value. I'm guessing that without using CLOBs
for all cases where free text is allowed, we'll never see the end of
these kinds of problems in the Oracle version of BioSQL?

---
Richard Holland
Bioinformatics Database Developer
ITS, Agresearch Invermay x3279


=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================



More information about the Bioperl-l mailing list