[Biopython] Converting from NCBIXML to SearchIO

Fri Feb 14 22:57:25 UTC 2014

Hi Bow,
   regarding the missing .gap_num attributes and likewise other ... I believe it is reasonable
for BLAST XML output to omit them to save some space if there are just no gaps in the alignment
or identity is 100%, etc. However, objects instantiated while parsing should have them.
I don;t like having some instances of same object having more attributes while some
having less. I don't mind having a global hook in SearchIO forcing this strict mode and
affecting default parameters inherited from blast-result related classes while parsing XML.

   Another issue I see now that I used to poke over two iterators in a while loop. I was checking
that each of the iterators returned a result object (evaluating as True). The reason
for this ugly-ness was/is two-fold:

   1. "for blah in zip(iter1, iter2):" would only poke over the same length of items
but I wanted to make sure iter1 and iter2 did NOT have, accidentally, different lengths.
One of the iterators was the from the XML output stream and expensive to calculate number
of entries in an extra sweep. The iter2 could be counted for a number of its items
rather cheaply. However, outside outside biopython I could grep through the XML stream.

   2. Second reason for the ugly checks for _record evaluating as True was because
blastall interleaves the XML stream with dummy entries (which evaluate as False object
from NCBIXML.parse()) and also, time to time, blastall places into the stream the very
first result. So, I used to check that _record.id is not same as the _record.id I got
when I just started parsing the XML stream (I cache the very first result id, how ugly,
right?). Both issues I already mentioned in biopython's bugzilla and this email list and
notably, notified NCBI. Unfortunately, they answered they won't fix any of these
(look into archives of this biopython list about a year ago or so?).

   Back to NCBIXML.parse() to SearchIO.parse() transition. Seemed I could have replaced

if _record:
     ...

whith

if _record.id:
    ....

but that is unnecessarily expensive because python must get much deeper into the object.

Unfortunately, this won't help me to deal with "empty" objects created by SearchIO when no match
was found. I am talking about this XML section resulting in object evaluating as False but
_record.id gives 'FL40XAE01A1L3P':

     <Iteration>
       <Iteration_iter-num>2</Iteration_iter-num>
       <Iteration_query-ID>lcl|2_0</Iteration_query-ID>
       <Iteration_query-def>FL40XAE01A1L3P length=127 xy=0311_1171 region=1 run=R_2008_12_17_15_21_00_</Iteration_query-def>
       <Iteration_query-len>374</Iteration_query-len>
       <Iteration_stat>
         <Statistics>
           <Statistics_db-num>99</Statistics_db-num>
           <Statistics_db-len>47536</Statistics_db-len>
           <Statistics_hsp-len>0</Statistics_hsp-len>
           <Statistics_eff-space>0</Statistics_eff-space>
           <Statistics_kappa>0.41</Statistics_kappa>
           <Statistics_lambda>0.625</Statistics_lambda>
           <Statistics_entropy>0.78</Statistics_entropy>
         </Statistics>
       </Iteration_stat>
       <Iteration_message>No hits found</Iteration_message>
     </Iteration>

Here is the same through SearchIO:

>>> _record =_blastn_iterator.next()
>>> print _record
Program: blastn (2.2.26)
   Query: FL40XAE01A1L3P (374)
          length=127 xy=0311_1171 region=1 run=R_2008_12_17_15_21_00_
  Target: queries.fasta queries2.fasta
    Hits: 0
>>>
>>> if _record:
...     print "true"
... else:
...     print "false"
...
false
>>>

I understand that the object evaluates as False because it has no sequence and therefore
appears to be "empty", but it is real result. I understand you want to follow some universal
logic of biopython about empty/non-empty objects but I don't think in this case it is a good idea.
Or do you want me to check for _record.hits evaluating as True?

In my original pseudocode I had

if _record:
     # either a match was found
     # or no match was found but the object is valid and evaluates as True
else:
     # reached EOF
     # or
     # reached broken XML item interleaved in the stream (just ignore the crap)

would read now:

if _record.id:
     if _record.hits:
         # a match was found
     else:
         # no match was found
else:
     # reached EOF
     # reached broken XML item interleaved in the stream (just ignore the crap)

Looks I can accomplish what I used to have but I would like to know your opinion and
a coding style advice before I get on my way. ;-)

Thank you,
Martin