[Biopython] About BLAST parser

Thu Oct 22 10:19:02 UTC 2009

On Thu, Oct 22, 2009 at 11:06 AM, Manu Tamminen <mavata at gmail.com> wrote:
>
> Hi Peter! Thanks for your prompt reply! I've run the BLAST analysis on a
> supercomputer cluster, saved the results into a XML file and then
> transferred the output file to my computer. I then run the script on my
> computer to parse the results into a tab separated file. With the current
> dataset I have 1115 sequences of around 500 bp each.
> Manu

Based on the Biopython error message, I suspect your XML file is
broken. How big is the XML file (MB). There are online tools for this,
but uploading a large file is out of the question. You could also open
the file in a suitable editor, go to the line number given in the Biopython
error message, and look at the file by eye to see if there is anything
obvious.

It is possible that the XML file was corrupted when you copied it to
your local machine (e.g. a network error). You could try zipping it
up, and then copying it again. It is also possible that the XML file
was corrupted on the disk on the cluster (rare, but this can happen).
In this case you might be able to fix the XML by hand, or re-run it.

Alternatively, it is possible that the file is valid, and the Biopython parser
(or the Python library we use internally) has a bug. As long as the
XML file isn't too big (say 10MB), you could email it to me personally
(NOT the mailing list) and I can try and have a look at it.

Personally, I would break up the task into jobs (maybe six jobs of
up to 200 sequences each - or even one sequence per job). On
most clusters this is a good idea anyway, as they can then be
handled by different cluster nodes. For the analysis, you just have
to parse the separate XML files. Any corrupted XML file will then
only affect a few sequences, and checking it or re-running it is
going to be much quicker and easier.

Peter