[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?

Tue Sep 25 11:03:15 UTC 2012

On Tue, Sep 25, 2012 at 11:34 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>> The same goes with the attribute names. I suppose I
>> could add one table in the draft tutorial to list the
>> new attribute names, but I prefer not to have any
>> Bio.Blast-compatible names in the code.
>
>>> I would have to lookup how the record attributes
>>> changed(=renamed) from those specific for blast to
>>> those generalized and used(=promoted)
>>> by SearchIO.
>
>>> I see, hsp.sbjct_start is renamed to hsp.hit_start ...
>
> I would suggest to use the same names as in the XML
> source file. Then we are consistent with NCBI, we don'
>t have to come up with our own names, and we won't
> have to provide a list of biopython-defined record
> attributes. Dropping the "Hsp" in <Hsp_hit-from>, that
> would be "hit-from".

We can't be fully consistent with the NCBI since they
have more than one naming convention ;)

Personally I find the NCBI's human readable column
names used in the tabular output far nicer than the
verbose terms in the XML which is not really human
readable, e.g.

   	      slen means Subject sequence length
   	    qstart means Start of alignment in query
   	      qend means End of alignment in query
   	    sstart means Start of alignment in subject
   	      send means End of alignment in subject
   	      qseq means Aligned part of query sequence
   	      sseq means Aligned part of subject sequence

The term 'subject' for the hit sequence is quite BLAST
specific, but otherwise these terms are reasonably broad
and could make sense in SearchIO beyond BLAST
(assuming you don't find shortening the subject/query
prefix to a single letter confusing).

Currently the HSP object in SearchIO uses hit_start,
hit_end, query_start and query_end - but also note
that we're using Python counting.

Peter