[Biopython] Converting from NCBIXML to SearchIO

Martin Mokrejs mmokrejs at fold.natur.cuni.cz
Thu Feb 13 22:37:38 UTC 2014


Hi Bow,

Wibowo Arindrarto wrote:
> Hi Martin,
>
>>> Here's the 'convention' I use on the length-related attributes in
>>> SearchIO's blast parsers:
>>>
>>> * 'aln_span' attribute denote the length of the alignment itself,
>>> which means this includes the gaps sign ('-'). In Blast, this is
>>> always parsed from the file. You're right that this used to be
>>> hsp.align_length.
>>>
>>> * 'seq_len' attributes denote the length of either the query (in
>>> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the
>>> gaps. These are parsed from the BLAST XML file itself. One of these,
>>> hit.seq_len, is the one that used to be alignment.length.
>>
>>
>> How about record.seq_len in SearchIO, isn't that same as well? At least
>> I am hoping that the length (163 below) of the original query sequence,
>> stored in
>>
>>   <BlastOutput_query-len>163</BlastOutput_query-len>
>>
>> of the XML input file. Having access to its value from under hsp object
>> would be the best for me.
>
> if by 'record' you're referring to the top-most container (the
> QueryResult), then record.seq_len denotes the length of the full query
> sequence. This may or may not be the same as hit.seq_len.
>
> I did not choose to store it under the HSP object, for the following
> reasons because the HSP object is never meant to be used alone, always
> with Hit and QueryResult. So whenever one has access to an HSP, he/she
> must also have access to the containing Hit and QueryResult. Since the
> seq_len are attributes common to all HSPs (originating from the
> hit/query sequences), storing them in Hit and QueryResult objects
> seems most appropriate.

So far I had in one of my functions only hsp object and from it I accessed hsp.align_length. Due to the transition to SearchIO I have to modify the function so that it has access to record.seq_len (or QueryResult as you say). yes, I did it now but please consider some functionality is missing. I don't mind my own API change but others might be concerned. I believe I want record.seq_len and not pray on hit.seq_len. I am not sure if we are talking about the same but my testsuite will complain once the code compiles.


>
>>> * 'query_span' and 'hit_span' are always computed by SearchIO (always
>>> end coordinate - start coordinate of the query / hit match of the HSP,
>>> so they do not count the gap characters). They may or may not be equal
>>> to their seq_len counterparts, depending on how much the HSP covers
>>> the query / hit sequences.
>>
>>
>> I hope you wanted to say "end - start + 1" ;-)
>
> This is related to your comment below, I think. For better or worse,

Damn, right, in this case 4-1+1 = 4-0 ;)


> we needed to adhere to one consistent indexing and numbering system.
> Python's system was chosen based on the fact that anyone using
> Biopython should be (or is already) familiar with them and that
> SearchIO aims to unify all the different coordinate system that
> different programs use. Of course you'll notice that the consequence
> of this system is that one can calculate the length (or span, really)
> of the hit / query sequences by computing `end -start` instead of `end
> - start + 1` :).

Well, took me a while. ;)

>
>>> (I couldn't find any reference to sbjct_length in the current
>>> codebase, perhaps it was removed some time ago?)
>>
>>
>> I have the feelings that either blast or biopython used subjct_* with the
>> 'u' in the name.
>
> Couldn't find that either :/..
>
>>> The 'gap_num' error sounds a bit weird, though. If I recall correctly,
>>> it should work in 1.62 (it was added very early in the beginning).
>>> What problems are you having?
>>
>>
>
> (pasting the comment from your other email)
>
>>> if str(_hsp.gap_num) == '(None, None)':
>>>       ....
>>> AttributeError: 'HSP' object has no attribute 'gap_num'
>>
>>
>> Yeah, I know why. You told me once (
>> https://github.com/biopython/biopython/issues/222 ) that it is optional.
>> Indeed, the XML file lacks in this case the <Hsp_gaps> section. Actually,
>> this old silly test for (None, None) is in my code just because of that bug.
>> I would prefer if SearchIO provided
>>
>> hsp.gap_num == None
>>
>> and likewise for the other, optional attributes to sanitize the blast XML
>> output with some default values. I use None for such cases so that if an
>> integer is later expected python chokes on the None value, which is good.
>> Mostly I only check is the variable returns true or false so the None
>> default is ok for me.
>>
>> alternatively, I have to check the dictionary of hsp whether it contains
>> gap_num, which is inconvenient.
>
> Guess you solved it. But yeah, I was a bit ambivalent on the issue on
> whether to note missing attributes as None or simply nothing (as in,
> not having the attribute at all). To me (others, feel free to weigh in
> here), having it store nothing at all seems more preferred. If the
> former is chosen, the only way to be consistent is to store all other
> attributes from other search programs (e.g. HMMER's parameter in a
> BLAST HSP) as None (otherwise we use None for one missing attribute
> and not for the other?). This seems a bit cumbersome, so I chose to
> store nothing at all.

I will see in how many places I have to wrap access to any of these three (or maybe more) optional values and wrap them by an extra if conditional. I think I will just carelessly force my own defaults, that will keep the code shorter and easier to read. I understand your concern about defining defaults for all possible values but I have opposite opinions. Let's see what other say.

The "good" thing is that now hsp.gap_num does not exist while before hsp.gaps was (None, None) hence the tests for True succeeded. Now the code breaks, cool. :))

Martin



More information about the Biopython mailing list