[Biopython] Converting from NCBIXML to SearchIO

Sat Feb 15 16:28:18 UTC 2014

Martin Mokrejs wrote:
> Hi,
>    I am in the process of conversion to the new XML parsing code written by Bow.
> So far, I have deciphered the following replacement strings (somewhat written in sed(1) format):
>
>
> /hsp.identities/hsp.ident_num/
> /hsp.score/hsp.bitscore/
> /hsp.expect/hsp.evalue/
> /hsp.bits/hsp.bitscore/
> /hsp.gaps/hsp.gap_num/
> /hsp.bits/hsp.bitscore_raw/

Aside from the fact I pasted twice the _hsp.bits line, my guess was wrong. The code works now but needed the following changes from NCBIXML to SearchIO names:

/_hsp.score/_hsp.bitscore_raw/
/_hsp.bits/_hsp.bitscore/

> /hsp.positives/hsp.pos_num/
> /hsp.sbjct_start/hsp.hit_start/
> /hsp.sbjct_end/hsp.hit_end/
> # hsp.query_start # no change from NCBIXML
> # hsp.query_end # no change from NCBIXML
> /record.query.split()[0]/record.id/
> /alignment.hit_def.split(' ')[0]/alignment.hit_id/
> /record.alignments/record.hits/
>
> /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML (don't remember whether the counts include minus signs of the alignment or not)
>
>
>
>
> Now I am uncertain. There used to be hsp.sbjct_length and alignment.length. I think the former length was including the minus sign for gaps while the latter is just the real length of the query sequence.
>
> Nevertheless, what did alignment.length transform into? Into len(hsp.query_all)? I don't think hsp.query_span but who knows. ;)

Answering myself:

/alignment.hit_id/alignment.id/
/alignment.length/_record.hits[0].seq_len/

Other changes:

_hsp.sbjct/_hsp.hit.seq.tostring() # aligned sequence including dashes [ATGCNatgcn-]
_hsp.query/_hsp.query.seq.tostring() # aligned sequence including dashes [ATGCNatgcn-]
_hsp.match/_hsp.aln_annotation['homology']/ # e.g. '||||||||||||||||||||||||||||||||||| |||||||||| |   ||| ||   ||||||| |||||'

I think the dictionary key should have been better named "similarity".

The strand does not translate simply to SearchIO, one needs to do:
/_hsp.strand/(_hsp.query_strand, _hsp.hit_strand)/ # the tuple will be e.g. (1, 1) while I think it used to be under NCBIXML as either ('Plus', 'Plus'), ('Plus, 'Minus'), (None, None), etc.

>
>
>
> Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that has been added to SearchIO in 1.63. so, that's all from me now until I upgrade. ;)

I got around with try/except although it is more expensive than previously sufficient if/else tests:

         # undo the off-by-one change in SearchIO and transform back to real-life numbers
         _hit_start = _hsp.hit_start + 1
         _query_start = _hsp.query_start + 1

         try:
             _ident_num = _hsp.ident_num
         except:
             _ident_num = 0

         try:
             _pos_num = _hsp.pos_num
         except:
             _pos_num = 0

         try:
             _gap_num = _hsp.gap_num
         except:
             # calculate gaps count missing sometimes in legacy blast XML output
             # see also https://redmine.open-bio.org/issues/3363 saying that also _multimer_hsp_identities and _multimer_hsp_positives are affected
             _gap_num = _hsp.aln_span - _ident_num

So far I can conclude, that by transition from NCBIXML to SearchIO I got 30% wallclock speedup, but the most important will be for me whether it will save me memory used for parsing of huge XML files (>100GB uncompressed) . That I don't know yet, am still testing.

Martin