[Bioperl-l] bad entries in interpro

Fri Nov 26 21:58:35 EST 2004

the problem is that iprscan (the program that produces the interpro xml 
files) does not properly xml-escape some characters.  there is an if-block 
in the module that tries to catch things like quotes and ampersands, but 
it's by no means exhaustive.

the preferred solutions to this, in order of descending difficulty for you
are to:

A) complain to the interpro authors/maintainers and get them to make valid
xml.

B) write an if-block that will exhaustively escape characters that should 
be escaped.

C) hack the current if-block to support your special character.

I'll be happy to merge in a patch for you for case B or C.  Go ahead and 
modify the module, and run:

% diff -Bbup interpro.pm interpro.pm.new > interpro.patch

and post the file to the list.

-allen

On Tue, 23 Nov 2004, Robson Francisco de Souza {S} wrote:

> Hi everyone,
> 
> A few days ago, Mikko Arvas sent an e-mail to this list asking how to
> ignore bad entries in the matches.xml file from the InterPro database.
> Hilmar Lapp answered asking him to locate the position in the file that
> raises the error message 
> 
> >> not well-formed (invalid token) at line 2, column 53, byte 131 at 
> >> /usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm 
> >> line 187
> 
> Well, I saw no answers on the list, therefore I'm sending the problemtic
> entry below:
> 
> <protein id="O00408" name="CN2A_HUMAN" length="941" 
>  crc64="9797609B487FD64E">
>     <interpro id="IPR002073" name="3&apos;5&apos;-cyclic nucleotide
>     phosphodiesterase" type="Domain" parent_id="IPR003607">
> 
> The problem seems to be the "&apos;" annotation at the second line.
> 
> I also tested if an eval clause could be used to bypass such entries
> without crashing a script. The example script below worked fine and
> reported a problem with the entry above without crashing.
> 
> Would it be too dificult to make interpro.pm able to parse names like
> the one above?
> 
> Robson
> 
> ##################################################
> #!/usr/bin/perl -w
> 
> use strict;
> use Bio::SeqIO;
> 
> my $in = Bio::SeqIO->new(-file=>$ARGV[0],
>      -format=>"interpro");
> 
> my $i=1;
> while (1) {
>    my $seq;
>    eval {
>      $seq = $in->next_seq;
>    };
>    last if (!defined $seq);
>    if ($@) { print STDERR "Problem parsing sequence $i..."; next };
>      print STDERR $seq->id,"\n";
>      print "<=== ",$seq->id,"===>\n";
>     foreach my $f ($seq->get_all_SeqFeatures) {
>       print $f->gff_string,"\n";
>       foreach my $key ($f->annotation->get_all_annotation_keys) {
>         foreach my $value ($f->annotation->get_Annotations($key)) {
>           print $key,":",$value->as_text,"\n";
>         }
>       }
>     }
>     $i++;
> }
> 
> exit 0;
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>