[Bioperl-l] parse multi xml

Jordi Durban jordi.durban at gmail.com
Mon Nov 22 18:34:03 EST 2010


Thanks Peter.
That's exactly what I was looking for but so far I've not been able to do
that properly.
Any ideas??


2010/11/22 Peter <biopython at maubp.freeserve.co.uk>

> On Mon, Nov 22, 2010 at 8:01 PM, Jordi Durban <jordi.durban at gmail.com>
> wrote:
> > Hi all,
> > I'm a newbie in the list although I've been using bioperl for 2 years.
> > Now I have a problem with a XML file and I don't know how to parse it.
> > That file has 795 xml top tags (thta's is <?xml version="1.0">) because
> they
> > resulted from  Blast2go software the usage and I suppose the file is the
> > outcome of multiple blast results concatenation.
>
> Such a file is NOT a valid XML file (but see below), you can't just
> concatenate XML files. I'm pretty sure people have posted scripts
> to fix such files on the blast2go mailing list.
>
> > Well, I would like to split all 795 different xml chunks in 795 different
> > files in order to parse them looking for the best hit.
> > The problem appears using the blastxml parse
> > (*Bio::SearchIO::blastxml) *because
> > (and that's a personal opinion) there's another top tag not expected
> > and I get a error message once the first blast result was parsed.
> > How can I do that split function?
> > I hope I was clear
> > Thanks
>
> Historically the NCBI standalone BLAST used to create these
> concatenated XML files when used on multiple queries. It has
> since been fixed, but perhaps BioPerl has code still in it to
> handle these legacy invalid XML files?
>
> My suggestion (until a BioPerl guru speaks up) would be to
> split the file into chunks (in memory) by looking for the string
> <?xml version="1.0">, and parsing each chunk individually.
> Each chunk should be a valid XML file on its own.
>
> Peter
>



-- 
Jordi



More information about the Bioperl-l mailing list