[Biopython-dev] [Bug 2176] New: XML Blast parser: miscellaneous bug fixes and cleanup

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Wed Jan 3 21:48:45 UTC 2007


http://bugzilla.open-bio.org/show_bug.cgi?id=2176

           Summary: XML Blast parser: miscellaneous bug fixes and cleanup
           Product: Biopython
           Version: Not Applicable
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: jmjoseph at andrew.cmu.edu


This follows the discussion started in bug 2051.  The blast XML parser does now
work (Thanks!), but could still use a little work.  Here's a list of the issues
I can see now.  I'll follow with patches to correct a few.

In Record.py, HSP.identities, HSP.gaps, and HSP.positives are still
defined as (None,None) tuples.  However, in NCBIXML.py, these
variables are set as integers.  I don't see a point of a tuple at all,
at least for NCBIXML.  (I realize it is used in NCBIStandalone.py).
Most importantly, the inconsistency makes it difficult to handle cases
when the parameter is not set.  It seems easiest, though, to just
retain the tuple format.

In the past, I worried that the order of tuple building for
self._blast.gap_penalties or ka_params could cause the tuple to have
an incorrect ordering.  I seem to remember hitting an issue where the
tuple was built with the wrong length, but I can't be specific.  In
general, it remains odd to me to not just use a list and set each
element respectively.  If necessary, one could convert to a tuple when
finished or use some other approach that does not rely upon order.

Why not use query_len, as defined in the XML file, or query_length
instead of query_letters as a variable name?  In
BlastParser._end_Iteration, self._blast.query_letters is set.  This is
not defined/documented in the Parameters class in Record.py.  Rather,
query_length is defined there.  In the Header class, though, the name
query_letters is used.  There also seems to be some confusion between
num_letters_in_database, num_sequences_in_database, database_letters,
and database_sequences.  Note that even if this naming is not
corrected, NCBIXML.py:186 is wrong with "self._blast_query_letters"
rather than "self._blast.query_letters".

Similarly, why store the bit score and E-value as 'bits' and
'_hsp.expect'/'descr.e' rather than just using bit_score and
evalue, as in the blast XML ouput?

I make use of <Hsp_align-len> in 2.2.13.  This value missing
entirely.

The parsing of <Hit_id> and <Hit_def> is confusing.  For example,
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gnl|BL_ORD_ID|0</Hit_id>
  <Hit_def>3377250</Hit_def>
  ...
results in _hit.title set to "gnl|BL_ORD_ID|0 3377250".  I would
rather they remain separate (or both methods be used).  

This is certainly not an exhaustive list.  I'm happy to provide
another patch correcting many of these inconsistencies.  At the
very least, the variable names defined in Record.py should be
used in NCBIXML.py.  May I modify at least the above names to
correspond more closely to the names used in the XML?  I know
I've found this particularly confusing.

-Jacob


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list