[Bioperl-l] bad entries in interpro

Hilmar Lapp hlapp at gmx.net
Sat Nov 27 01:06:57 EST 2004


On Tuesday, November 23, 2004, at 04:30  PM, Robson Francisco de Souza 
{S} wrote:

>
>>> not well-formed (invalid token) at line 2, column 53, byte 131 at
>>> /usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm
>>> line 187
>
> Well, I saw no answers on the list, therefore I'm sending the 
> problemtic
> entry below:
>
> <protein id="O00408" name="CN2A_HUMAN" length="941"
>  crc64="9797609B487FD64E">
>     <interpro id="IPR002073" name="3&apos;5&apos;-cyclic nucleotide
>     phosphodiesterase" type="Domain" parent_id="IPR003607">
>
> The problem seems to be the "&apos;" annotation at the second line.

Did you try and delete the two &apos; from the entry and then it passed 
fine? Otherwise, the &apos; is not the problem.

>
> I also tested if an eval clause could be used to bypass such entries
> without crashing a script. The example script below worked fine and
> reported a problem with the entry above without crashing.

This will work as long as you don't need to resume parsing of the block 
of text that raised the exception, and if the file pointer is properly 
advanced. The way SeqIO::interpro.pm works neither seems to be a 
problem.

>
> Would it be too dificult to make interpro.pm able to parse names like
> the one above?

What throws up is the XML parser (expat). There's nothing interpro.pm 
can do about this to mitigate it, once it happened. The only course of 
help is to prepare the text block to be parsed such that it won't raise 
exceptions.

	-hilmar

>
> Robson
>
> ##################################################
> #!/usr/bin/perl -w
>
> use strict;
> use Bio::SeqIO;
>
> my $in = Bio::SeqIO->new(-file=>$ARGV[0],
>      -format=>"interpro");
>
> my $i=1;
> while (1) {
>    my $seq;
>    eval {
>      $seq = $in->next_seq;
>    };
>    last if (!defined $seq);
>    if ($@) { print STDERR "Problem parsing sequence $i..."; next };
>      print STDERR $seq->id,"\n";
>      print "<=== ",$seq->id,"===>\n";
>     foreach my $f ($seq->get_all_SeqFeatures) {
>       print $f->gff_string,"\n";
>       foreach my $key ($f->annotation->get_all_annotation_keys) {
>         foreach my $value ($f->annotation->get_Annotations($key)) {
>           print $key,":",$value->as_text,"\n";
>         }
>       }
>     }
>     $i++;
> }
>
> exit 0;
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------




More information about the Bioperl-l mailing list