[Biopython] Converting from NCBIXML to SearchIO

Thu Mar 6 19:34:48 UTC 2014

Hi list mates,
  I am rather happy with SearchIO, I haven't found more issues while converting my code. Let's see what devs do with proposed sanitization of objects lacking certain attributes. More impressions at the very end of the email.

Martin Mokrejs wrote:
> Martin Mokrejs wrote:
>> Hi,
>>    I am in the process of conversion to the new XML parsing code written by Bow.
>> So far, I have deciphered the following replacement strings (somewhat written in sed(1) format):
>>
>>
>> /hsp.identities/hsp.ident_num/
>> /hsp.score/hsp.bitscore/
>> /hsp.expect/hsp.evalue/
>> /hsp.bits/hsp.bitscore/
>> /hsp.gaps/hsp.gap_num/
>> /hsp.bits/hsp.bitscore_raw/
> 
> Aside from the fact I pasted twice the _hsp.bits line, my guess was wrong. The code works now but needed the following changes from NCBIXML to SearchIO names:
> 
> /_hsp.score/_hsp.bitscore_raw/
> /_hsp.bits/_hsp.bitscore/
> 
> 
>> /hsp.positives/hsp.pos_num/
>> /hsp.sbjct_start/hsp.hit_start/
>> /hsp.sbjct_end/hsp.hit_end/
>> # hsp.query_start # no change from NCBIXML
>> # hsp.query_end # no change from NCBIXML
>> /record.query.split()[0]/record.id/
>> /alignment.hit_def.split(' ')[0]/alignment.hit_id/
>> /record.alignments/record.hits/
>>
>> /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML (don't remember whether the counts include minus signs of the alignment or not)
>>
>>
>>
>>
>> Now I am uncertain. There used to be hsp.sbjct_length and alignment.length. I think the former length was including the minus sign for gaps while the latter is just the real length of the query sequence.
>>
>> Nevertheless, what did alignment.length transform into? Into len(hsp.query_all)? I don't think hsp.query_span but who knows. ;)
> 
> 
> Answering myself:
> 
> /alignment.hit_id/alignment.id/
> /alignment.length/_record.hits[0].seq_len/
> 
> 
> Other changes:
> 
> _hsp.sbjct/_hsp.hit.seq.tostring() # aligned sequence including dashes [ATGCNatgcn-]
> _hsp.query/_hsp.query.seq.tostring() # aligned sequence including dashes [ATGCNatgcn-]
> _hsp.match/_hsp.aln_annotation['homology']/ # e.g. '||||||||||||||||||||||||||||||||||| |||||||||| |   ||| ||   ||||||| |||||'
> 
> I think the dictionary key should have been better named "similarity".
> 
> 
> 
> The strand does not translate simply to SearchIO, one needs to do:
> /_hsp.strand/(_hsp.query_strand, _hsp.hit_strand)/ # the tuple will be e.g. (1, 1) while I think it used to be under NCBIXML as either ('Plus', 'Plus'), ('Plus, 'Minus'), (None, None), etc.
> 
> 
> 
>>
>>
>>
>> Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that has been added to SearchIO in 1.63. so, that's all from me now until I upgrade. ;)
> 
> I got around with try/except although it is more expensive than previously sufficient if/else tests:
> 
>         # undo the off-by-one change in SearchIO and transform back to real-life numbers
>         _hit_start = _hsp.hit_start + 1
>         _query_start = _hsp.query_start + 1
> 
>         try:
>             _ident_num = _hsp.ident_num
>         except:
>             _ident_num = 0
> 
>         try:
>             _pos_num = _hsp.pos_num
>         except:
>             _pos_num = 0
> 
>         try:
>             _gap_num = _hsp.gap_num
>         except:
>             # calculate gaps count missing sometimes in legacy blast XML output
>             # see also https://redmine.open-bio.org/issues/3363 saying that also _multimer_hsp_identities and _multimer_hsp_positives are affected
>             _gap_num = _hsp.aln_span - _ident_num
> 
> 
> 
> 
> 
> 
> So far I can conclude, that by transition from NCBIXML to SearchIO I got 30% wallclock speedup, but the most important will be for me whether it will save me memory used for parsing of huge XML files (>100GB uncompressed) . That I don't know yet, am still testing.

After a while of using I could say that now with SearchIO I get at least 2x, mostly 4x faster XML parsing speed (wallclock) and notably, 256GB large XML files from blastn take now only 200-300MB of RAM (unlike 25GB of RAM using NCBIXML before). Congratulations, Bow, seemed silly I couldn't use my laptop to parse such huge files which are generate in a few hours but parsing takes days!

  [ Why I need old blastn and XML is out of question here. ;) 
      -- I just need them because blastn+ with tabular plaintext output does not give me required data. ]

Martin