[Bioperl-l] Re: bad entries in interpro again

Dave Howorth dhoworth at mrc-lmb.cam.ac.uk
Thu Dec 2 09:04:46 EST 2004


Mikko Arvas wrote:
> Sorry about that I should have tested it before mailing. The problem is 
> not non-ascii characters it seems to be specifically the combination of 
> two & inside individual <>. I tried various combinations and other 
> non-ascii characters (even in abundance) don't break it and a single & 
> does neither.
> 
> Here is again the problematic line:
> <interpro id="IPR002073" name="3&apos;5&apos;-cyclic nucleotide 
> phosphodiesterase" type="Domain" parent_id="IPR003607">
> 
> And its error:
> not well-formed (invalid token) at line 2, column 54, byte 132 at  
> /usr/lib/perl5/vendor_perl/5.8.3/i386-linux-thread-multi/XML/Parser.pm
> line 187
> 
> So which way to proceed?

I think some extra details might make it easier to see what is going on.

Which file are you scanning? Since your original post a new version of 
Interpro has been released so I suggest giving a URL on the Interpro FTP 
site so everybody can be sure of looking at the same file. I have just 
run the Sun XML validator on 
ftp://ftp.ebi.ac.uk/pub/databases/interpro/match.xml.gz (after unpacking 
it) and it validates as correct XML.

What version of XML::Parser are you using? I have just parsed that file 
with no errors using XML::Parser V2.34 on Suse 9.1 and this test script:

#!/usr/bin/perl
use strict;
use warnings;

use XML::Parser;

my $pl = new XML::Parser();
$pl->parsefile('match.xml');

So on the surface, the problem doesn't seem to be with either the 
Interpro data or the XML parser.

The file contains many lines identical to the one cited, which are all 
valid XML in accordance with the Interpro DTD, but none are line 2! So 
it looks like different data has been passed to XML::Parser.

Cheers, Dave
-- 
Dave Howorth
MRC Centre for Protein Engineering
Hills Road, Cambridge, CB2 2QH
01223 252960



More information about the Bioperl-l mailing list