[Biopython] Converting from NCBIXML to SearchIO

Martin Mokrejs mmokrejs at fold.natur.cuni.cz
Thu Feb 13 22:06:44 UTC 2014


Martin Mokrejs wrote:
> Hi Bow,
>    thank you for thorough guidance. Comments interleaved.
>
> Wibowo Arindrarto wrote:
>> Hi Martin,
>>
>> Here's the 'convention' I use on the length-related attributes in
>> SearchIO's blast parsers:
>>
>> * 'aln_span' attribute denote the length of the alignment itself,
>> which means this includes the gaps sign ('-'). In Blast, this is
>> always parsed from the file. You're right that this used to be
>> hsp.align_length.
>>
>> * 'seq_len' attributes denote the length of either the query (in
>> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the
>> gaps. These are parsed from the BLAST XML file itself. One of these,
>> hit.seq_len, is the one that used to be alignment.length.
>
> How about record.seq_len in SearchIO, isn't that same as well? At least
> I am hoping that the length (163 below) of the original query sequence, stored in
>
>   <BlastOutput_query-len>163</BlastOutput_query-len>
>
> of the XML input file. Having access to its value from under hsp object would be the best for me.
>
>
>> * 'query_span' and 'hit_span' are always computed by SearchIO (always
>> end coordinate - start coordinate of the query / hit match of the HSP,
>> so they do not count the gap characters). They may or may not be equal
>> to their seq_len counterparts, depending on how much the HSP covers
>> the query / hit sequences.
>
> I hope you wanted to say "end - start + 1" ;-)
>
>>
>> (I couldn't find any reference to sbjct_length in the current
>> codebase, perhaps it was removed some time ago?)
>
> I have the feelings that either blast or biopython used subjct_* with the 'u' in the name.
>
>
>> Since this is SearchIO, it also applies to other formats as well (e.g.
>> aln_span always counts the gap character).
>
> Fine with me, I need both values describing length region covered in the HSP, with and without the minus signs.
>
>
>> The 'gap_num' error sounds a bit weird, though. If I recall correctly,
>> it should work in 1.62 (it was added very early in the beginning).
>> What problems are you having?
>
>
> if str(_hsp.gap_num) == '(None, None)':
>      ....
> AttributeError: 'HSP' object has no attribute 'gap_num'

Yeah, I know why. You told me once ( https://github.com/biopython/biopython/issues/222 ) that it is optional. Indeed, the XML file lacks in this case the <Hsp_gaps> section. Actually, this old silly test for (None, None) is in my code just because of that bug. I would prefer if SearchIO provided

hsp.gap_num == None

and likewise for the other, optional attributes to sanitize the blast XML output with some default values. I use None for such cases so that if an integer is later expected python chokes on the None value, which is good. Mostly I only check is the variable returns true or false so the None default is ok for me.

alternatively, I have to check the dictionary of hsp whether it contains gap_num, which is inconvenient.

Martin



More information about the Biopython mailing list