[Biopython-dev] Blast records

Peter biopython at maubp.freeserve.co.uk
Wed Sep 23 10:34:42 EDT 2009


On Wed, Sep 23, 2009 at 2:51 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> --- On Tue, 9/22/09, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> As I recall (backed up by what I wrote in the tutorial),
>> when I last checked, the plain text PSI-BLAST output
>> (i.e. from the command line tool blastpgp) included a
>> lot of information missing in the XML output. Perhaps
>> this has improved? If it hasn't, I am inclined to leave
>> things as they are. If the current PSI-BLAST outputs
>> more details in the XML we may be able to do a better job.
>
> As far as I can tell, the XML contains the same information
> as the plain-text psiblast output, but the XML parser doesn't
> parse it correctly, since it assumes it is dealing with regular
> blast rather than psi-blast.

It sounds like the NCBI have changed the PSI BLAST XML
output then.

>> The next bit is my recollection of some of the background
>> to this:
>> Classic BLAST (and also RPS-BLAST) allow multiple queries
>> and use the "iterator" block in the XML file for each query.
>> This was an odd choice of naming, but I think the XML tag was
>> originally only intended for the PSI-BLAST outout where each
>> "iteration" block in the XML corresponds to each step of the
>> algorithm. You may recall early versions of BLAST would output
>> "concatenated" XML files for multiple queries - which were not
>> true XML files.
>
> That is correct. To make things more complex, if you run
> psi-blast with multiple queries you get concatenated XML
> files again, with the iteration blocks corresponding to the
> psi-blast iterations for each query.

Odd - and arguably a bug, since it isn't valid XML.

>> I guess they fixed this by reusing the existing "iteration"
>> structure for multiple queries (rather than adding new XML
>> tags). With this in mind the current parsing of the XML from
>> PSI-BLAST makes sense.
>
> I don't know if it really makes sense. For a single psi-blast
> query, we're getting multiple Blast records. For multiple
> psi-blast queries, we're iterating over the iteration blocks
> while ignoring the fact that they can come from different
> queries.

Is a single Blast record object for each PSI-BLAST iteration
such a bad thing?

> Ideally, we should be able to see from the XML whether
> it was regular blast with multiple queries, or psi-blast with
> a single query. Right now that is possible by looking at
> the query-def lines, but I wonder if NCBI is considering
> a better solution for this. I'll write an email to them to find out.

Certainly clarification from the NCBI sounds useful.

Peter


More information about the Biopython-dev mailing list