[Biopython-dev] [Biopython (old issues only) - Bug #2176] XML Blast parser: miscellaneous bug fixes and cleanup

redmine at redmine.open-bio.org redmine at redmine.open-bio.org
Fri Nov 11 14:13:16 UTC 2016

Issue #2176 has been updated by Vincent Davis.

Description updated
Assignee changed from Biopython Dev Mailing List to Vincent Davis

Looks like the final commets where never done. That is remover "We could perhaps deprecate record.database_letters immediately, and at a later point, record.query_letters"
I see a TODO in the code to remove database_letters.

I'll create an issue on github.

Bug #2176: XML Blast parser: miscellaneous bug fixes and cleanup

* Author: Jacob Joseph
* Status: New
* Priority: Normal
* Assignee: Vincent Davis
* Category: Main Distribution
* Target version: Not Applicable
* URL: 
This follows the discussion started in bug 2051.  The blast XML parser does now work (Thanks!), but could still use a little work.  Here's a list of the issues I can see now.  I'll follow with patches to correct a few.

In Record.py, HSP.identities, HSP.gaps, and HSP.positives are still
defined as (None,None) tuples.  However, in NCBIXML.py, these
variables are set as integers.  I don't see a point of a tuple at all,
at least for NCBIXML.  (I realize it is used in NCBIStandalone.py).
Most importantly, the inconsistency makes it difficult to handle cases
when the parameter is not set.  It seems easiest, though, to just
retain the tuple format.

In the past, I worried that the order of tuple building for
self._blast.gap_penalties or ka_params could cause the tuple to have
an incorrect ordering.  I seem to remember hitting an issue where the
tuple was built with the wrong length, but I can't be specific.  In
general, it remains odd to me to not just use a list and set each
element respectively.  If necessary, one could convert to a tuple when
finished or use some other approach that does not rely upon order.

Why not use query_len, as defined in the XML file, or query_length
instead of query_letters as a variable name?  In
BlastParser._end_Iteration, self._blast.query_letters is set.  This is
not defined/documented in the Parameters class in Record.py.  Rather,
query_length is defined there.  In the Header class, though, the name
query_letters is used.  There also seems to be some confusion between
num_letters_in_database, num_sequences_in_database, database_letters,
and database_sequences.  Note that even if this naming is not
corrected, NCBIXML.py:186 is wrong with "self._blast_query_letters"
rather than "self._blast.query_letters".

Similarly, why store the bit score and E-value as 'bits' and
'_hsp.expect'/'descr.e' rather than just using bit_score and
evalue, as in the blast XML ouput?

I make use of <Hsp_align-len> in 2.2.13.  This value missing

The parsing of <Hit_id> and <Hit_def> is confusing.  For example,
results in _hit.title set to "gnl|BL_ORD_ID|0 3377250".  I would
rather they remain separate (or both methods be used).  

This is certainly not an exhaustive list.  I'm happy to provide
another patch correcting many of these inconsistencies.  At the
very least, the variable names defined in Record.py should be
used in NCBIXML.py.  May I modify at least the above names to
correspond more closely to the names used in the XML?  I know
I've found this particularly confusing.


NCBIXML.patch (2.6 KB)
Record.patch (3.37 KB)
no_blast_tuples.patch (1.42 KB)

You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20161111/4c60f3c4/attachment.html>

More information about the Biopython-dev mailing list