[Biopython] Converting from NCBIXML to SearchIO

Wibowo Arindrarto w.arindrarto at gmail.com
Thu Feb 13 22:13:36 UTC 2014


Hi Martin,

>> Here's the 'convention' I use on the length-related attributes in
>> SearchIO's blast parsers:
>>
>> * 'aln_span' attribute denote the length of the alignment itself,
>> which means this includes the gaps sign ('-'). In Blast, this is
>> always parsed from the file. You're right that this used to be
>> hsp.align_length.
>>
>> * 'seq_len' attributes denote the length of either the query (in
>> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the
>> gaps. These are parsed from the BLAST XML file itself. One of these,
>> hit.seq_len, is the one that used to be alignment.length.
>
>
> How about record.seq_len in SearchIO, isn't that same as well? At least
> I am hoping that the length (163 below) of the original query sequence,
> stored in
>
>  <BlastOutput_query-len>163</BlastOutput_query-len>
>
> of the XML input file. Having access to its value from under hsp object
> would be the best for me.

if by 'record' you're referring to the top-most container (the
QueryResult), then record.seq_len denotes the length of the full query
sequence. This may or may not be the same as hit.seq_len.

I did not choose to store it under the HSP object, for the following
reasons because the HSP object is never meant to be used alone, always
with Hit and QueryResult. So whenever one has access to an HSP, he/she
must also have access to the containing Hit and QueryResult. Since the
seq_len are attributes common to all HSPs (originating from the
hit/query sequences), storing them in Hit and QueryResult objects
seems most appropriate.

>> * 'query_span' and 'hit_span' are always computed by SearchIO (always
>> end coordinate - start coordinate of the query / hit match of the HSP,
>> so they do not count the gap characters). They may or may not be equal
>> to their seq_len counterparts, depending on how much the HSP covers
>> the query / hit sequences.
>
>
> I hope you wanted to say "end - start + 1" ;-)

This is related to your comment below, I think. For better or worse,
we needed to adhere to one consistent indexing and numbering system.
Python's system was chosen based on the fact that anyone using
Biopython should be (or is already) familiar with them and that
SearchIO aims to unify all the different coordinate system that
different programs use. Of course you'll notice that the consequence
of this system is that one can calculate the length (or span, really)
of the hit / query sequences by computing `end -start` instead of `end
- start + 1` :).

>> (I couldn't find any reference to sbjct_length in the current
>> codebase, perhaps it was removed some time ago?)
>
>
> I have the feelings that either blast or biopython used subjct_* with the
> 'u' in the name.

Couldn't find that either :/..

>> The 'gap_num' error sounds a bit weird, though. If I recall correctly,
>> it should work in 1.62 (it was added very early in the beginning).
>> What problems are you having?
>
>

(pasting the comment from your other email)

>> if str(_hsp.gap_num) == '(None, None)':
>>      ....
>> AttributeError: 'HSP' object has no attribute 'gap_num'
>
>
> Yeah, I know why. You told me once (
> https://github.com/biopython/biopython/issues/222 ) that it is optional.
> Indeed, the XML file lacks in this case the <Hsp_gaps> section. Actually,
> this old silly test for (None, None) is in my code just because of that bug.
> I would prefer if SearchIO provided
>
> hsp.gap_num == None
>
> and likewise for the other, optional attributes to sanitize the blast XML
> output with some default values. I use None for such cases so that if an
> integer is later expected python chokes on the None value, which is good.
> Mostly I only check is the variable returns true or false so the None
> default is ok for me.
>
> alternatively, I have to check the dictionary of hsp whether it contains
> gap_num, which is inconvenient.

Guess you solved it. But yeah, I was a bit ambivalent on the issue on
whether to note missing attributes as None or simply nothing (as in,
not having the attribute at all). To me (others, feel free to weigh in
here), having it store nothing at all seems more preferred. If the
former is chosen, the only way to be consistent is to store all other
attributes from other search programs (e.g. HMMER's parameter in a
BLAST HSP) as None (otherwise we use None for one missing attribute
and not for the other?). This seems a bit cumbersome, so I chose to
store nothing at all.

> A new comment:
>
> The off-by-one change in SearchIO only complicates matters for me, so I
> immediately fix it to natural numbering, via:
>
> _query_start = hsp.query_start + 1
> _hit_start = hsp.hit_start + 1
>
> I know we talked about this in the past and this is just to say that I did
> not change my mind here. ;) Same with SffIO although there are two reason
> for off-by-one numberings, one due to the SFF specs but the other is
> likewise, to keep in sync with pythonic numbering. These always caused more
> troubles to me than anything good. Any values I have in variables are
> 1-based and in the few cases I need to do python slicing, I adjust
> appropriately, but in remaining cases I am always printing or storing the
> 1-based values. So, this concept (
> http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec114 ) is only for
> the sake of being pythonic, but bad for users.

This was addressed above :).


Cheers,
Bow



More information about the Biopython mailing list