[Bioperl-l] (no subject)

Mikko Arvas Mikko.Arvas at vtt.fi
Wed Dec 8 07:01:40 EST 2004


Hi,

thank you so much for everybody for your help! But still no progress.
I have Suse8.1, bioperl 1.4., XML::Parser.pm is 2.34 and latest 
match.xml  from:
ftp://ftp.ebi.ac.uk/pub/databases/interpro
match.xml.gz 2004-11-29

Like Dave suggested just parsing with XML::Parser works fine with:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Parser;
my $pl = new XML::Parser;
$pl->parsefile('match.xml');

But if do this:
my $infeat = Bio::SeqIO->new('-file'   => "<$opt_i",
			    '-format' => 'interpro' );
while (my $feat = $infeat->next_seq) {print $feat->accession_number()."\n";}

I still get:
not well-formed (invalid token) at line 2, column 53, byte 131 at 
/usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/Parser.pm line 187

from protein id o00408.

And I can still remove this problem by taking the 2nd & out from line
<interpro id="IPR002073" name="3&apos;5&apos;-cyclic nucleotide 
phosphodiesterase" type="Domain" parent_id="IPR003607">

I can see no difference in the quoting of this entry and the new and old 
version of match.xml.

There are about 2286 lines in match.xml with a two & and if I simply:
tr "&" "_" <match.xml>match_user_friendly.xml

I can parse match_user_friendly.xml untill the script above happily fills 
all the available memory and crashes (but that is an other story then).

So is this my system only or does somebody else have the same problem too? 
If it is I'll just be lazy and use tr, enough time spent already.

Cheers,
mikko

PS. Here is still the whole entry just in case:

<protein id="O00408" name="CN2A_HUMAN" length="941" crc64="9797609B487FD64E">
<interpro id="IPR002073" name="3&apos;5&apos;-cyclic nucleotide 
phosphodiesterase" type="Domain" parent_id="IPR003607">
<match id="PF00233" name="PDEase_I" dbname="PFAM">
<location start="655" end="892" status="T" evidence="HMMPfam" score="0.0" />
</match>
<match id="PR00387" name="PDIESTERASE1" dbname="PRINTS">
<location start="651" end="664" status="T" evidence="FPrintScan" 
score="7.399999999999999E-30" />
<location start="682" end="695" status="T" evidence="FPrintScan" 
score="7.399999999999999E-30" />
<location start="696" end="711" status="T" evidence="FPrintScan" 
score="7.399999999999999E-30" />
<location start="724" end="740" status="T" evidence="FPrintScan" 
score="7.399999999999999E-30" />
<location start="804" end="817" status="T" evidence="FPrintScan" 
score="7.399999999999999E-30" />
<location start="821" end="837" status="T" evidence="FPrintScan" 
score="7.399999999999999E-30" />
</match>
<match id="PS00126" name="PDEASE_I" dbname="PROSITE">
<location start="696" end="707" status="T" evidence="AddProsite" 
score="8.0E-5" />
</match>
<match id="SSF48547" name="PDEase" dbname="SSF">
<location start="573" end="898" status="T" evidence="HMMPfam" 
score="4.38E-43" />
</match>
</interpro>
<interpro id="IPR003018" name="GAF" type="Domain">
<match id="PF01590" name="GAF" dbname="PFAM">
<location start="241" end="377" status="T" evidence="HMMPfam" 
score="5.7E-10" />
<location start="409" end="548" status="T" evidence="HMMPfam" 
score="1.3E-25" />
</match>
<match id="PS50813" name="GAF" dbname="PREFILE">
<location start="396" end="550" status="T" evidence="PrfScan" 
score="11.073" />
</match>
<match id="SM00065" name="GAF" dbname="SMART">
<location start="241" end="387" status="T" evidence="Smart" score="7.3E-18" />
<location start="409" end="558" status="T" evidence="Smart" score="6.1E-38" />
</match>
</interpro>
<interpro id="IPR003607" name="Metal-dependent phosphohydrolase, HD region" 
type="Domain">
<match id="SM00471" name="HDc" dbname="SMART">
<location start="653" end="822" status="T" evidence="Smart" score="1.0E-6" />
</match>
</interpro>
</protein>
Mikko Arvas
VTT Biotechnology

e-mail:            mikko.arvas at vtt.fi
tel:                 +358-(0)9-456 5827
mobile:           +358-(0)44-381 0502
fax:                +358-(0)9-455 2103
mail:               Tietotie 2, Espoo
                       P.O. Box 1500
                       FIN-02044 VTT, Finland




More information about the Bioperl-l mailing list