[Bioperl-l] Re: bad interpro entries

Dave Howorth dhoworth at mrc-lmb.cam.ac.uk
Wed Dec 8 07:32:17 EST 2004


Mikko Arvas wrote:
> thank you so much for everybody for your help! But still no progress.
> I have Suse8.1, bioperl 1.4., XML::Parser.pm is 2.34 and latest 
> match.xml  from:
> ftp://ftp.ebi.ac.uk/pub/databases/interpro
> match.xml.gz 2004-11-29
> 
> Like Dave suggested just parsing with XML::Parser works fine with:

> But if do this:
> my $infeat = Bio::SeqIO->new('-file'   => "<$opt_i",
>                 '-format' => 'interpro' );
> while (my $feat = $infeat->next_seq) {print 
> $feat->accession_number()."\n";}
> 
> I still get:
> not well-formed (invalid token) at line 2, column 53, byte 131 at 
> /usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm 
> line 187
> 
> from protein id o00408.

 >PS. Here is still the whole entry just in case:

Well, I tested this entry with the validator and with that little test 
script and it appears to be good data. How did you obtain it? Was it as 
Hilmar suggested?:

 > There is no other editing of the chunks going on though except for a
 > haphazard substitution of certain double-quotes. In order to see the
 > chunk before it gets sent to the parser instance edit
 > Bio/SeqIO/interpro.pm and before the line
 >
 >       $self->parse_xml($xml_fragment);
 >
 > put a print statement that prints out the content of $xml_fragment.
 > That should also give the exact source XML that trips up the parser.

If you printed it another way, I'd suggest trying what Hilmar suggested 
next. If you did print it that way, call in the wizards!

Cheers, Dave
-- 
Dave Howorth
MRC Centre for Protein Engineering
Hills Road, Cambridge, CB2 2QH
01223 252960



More information about the Bioperl-l mailing list