[Biopython] Converting from NCBIXML to SearchIO

Martin Mokrejs mmokrejs at fold.natur.cuni.cz
Thu Feb 13 21:46:51 UTC 2014


Hi Bow,
   thank you for thorough guidance. Comments interleaved.

Wibowo Arindrarto wrote:
> Hi Martin,
>
> Here's the 'convention' I use on the length-related attributes in
> SearchIO's blast parsers:
>
> * 'aln_span' attribute denote the length of the alignment itself,
> which means this includes the gaps sign ('-'). In Blast, this is
> always parsed from the file. You're right that this used to be
> hsp.align_length.
>
> * 'seq_len' attributes denote the length of either the query (in
> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the
> gaps. These are parsed from the BLAST XML file itself. One of these,
> hit.seq_len, is the one that used to be alignment.length.

How about record.seq_len in SearchIO, isn't that same as well? At least
I am hoping that the length (163 below) of the original query sequence, stored in

  <BlastOutput_query-len>163</BlastOutput_query-len>

of the XML input file. Having access to its value from under hsp object would be the best for me.


> * 'query_span' and 'hit_span' are always computed by SearchIO (always
> end coordinate - start coordinate of the query / hit match of the HSP,
> so they do not count the gap characters). They may or may not be equal
> to their seq_len counterparts, depending on how much the HSP covers
> the query / hit sequences.

I hope you wanted to say "end - start + 1" ;-)

>
> (I couldn't find any reference to sbjct_length in the current
> codebase, perhaps it was removed some time ago?)

I have the feelings that either blast or biopython used subjct_* with the 'u' in the name.


> Since this is SearchIO, it also applies to other formats as well (e.g.
> aln_span always counts the gap character).

Fine with me, I need both values describing length region covered in the HSP, with and without the minus signs.


> The 'gap_num' error sounds a bit weird, though. If I recall correctly,
> it should work in 1.62 (it was added very early in the beginning).
> What problems are you having?


if str(_hsp.gap_num) == '(None, None)':
     ....
AttributeError: 'HSP' object has no attribute 'gap_num'


Here is the hsp object structure:

_hsp=['_NON_STICKY_ATTRS', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_aln_span_get', '_get_coords', '_hit_end_get', '_hit_inter_ranges_get', '_hit_inter_spans_get', '_hit_range_get', '_hit_span_get', '_hit_start_get', '_inter_ranges_get', '_inter_spans_get', '_items', '_query_end_get', '_query_inter_ranges_get', '_query_inter_spans_get', '_query_range_get', '_query_span_get', '_query_start_get', '_str_hsp_header', '_transfer_attrs', '_validate_fragment', 'aln', 'aln_all', 'aln_annotation', 'aln_annotation_all', 'aln_span', 'alphabet', 'bitscore', 'bitscore_raw', 'evalue', 'fragment', 'fragments', 'hit', 'hit_all', 'hit_description', 'hit_end', 'hit_end_all', 'hit_features', 'hit_
 f
eatures_all', 'hit_frame', 'hit_frame_all', 'hit_id', 'hit_inter_ranges', 'hit_inter_spans', 'hit_range', 'hit_range_all', 'hit_span', 'hit_span_all', 'hit_start', 'hit_start_all', 'hit_strand', 'hit_strand_all', 'ident_num', 'is_fragmented', 'pos_num', 'query', 'query_all', 'query_description', 'query_end', 'query_end_all', 'query_features', 'query_features_all', 'query_frame', 'query_frame_all', 'query_id', 'query_inter_ranges', 'query_inter_spans', 'query_range', 'query_range_all', 'query_span', 'query_span_all', 'query_start', 'query_start_all', 'query_strand', 'query_strand_all']



And eventually if that matters, the super-parent/blast record object:

['_NON_STICKY_ATTRS', '_QueryResult__marker', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_blast_id', '_description', '_hit_key_function', '_id', '_items', '_transfer_attrs', 'absorb', 'append', 'description', 'fragments', 'hit_filter', 'hit_keys', 'hit_map', 'hits', 'hsp_filter', 'hsp_map', 'hsps', 'id', 'index', 'items', 'iterhit_keys', 'iterhits', 'iteritems', 'param_evalue_threshold', 'param_filter', 'param_gap_extend', 'param_gap_open', 'param_score_match', 'param_score_mismatch', 'pop', 'program', 'reference', 'seq_len', 'sort', 'stat_db_len', 'stat_db_num', 'stat_eff_space', 'stat_entropy', 'stat_hsp_len', 'stat_kappa', 'stat_lambda', 'target', 'version']



A new comment:

The off-by-one change in SearchIO only complicates matters for me, so I immediately fix it to natural numbering, via:

_query_start = hsp.query_start + 1
_hit_start = hsp.hit_start + 1

I know we talked about this in the past and this is just to say that I did not change my mind here. ;) Same with SffIO although there are two reason for off-by-one numberings, one due to the SFF specs but the other is likewise, to keep in sync with pythonic numbering. These always caused more troubles to me than anything good. Any values I have in variables are 1-based and in the few cases I need to do python slicing, I adjust appropriately, but in remaining cases I am always printing or storing the 1-based values. So, this concept ( http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec114 ) is only for the sake of being pythonic, but bad for users.


Thanks,
Martin

>
> Cheers,
> Bow
>
> On Thu, Feb 13, 2014 at 9:38 PM, Martin Mokrejs
> <mmokrejs at fold.natur.cuni.cz> wrote:
>> Hi,
>>    I am in the process of conversion to the new XML parsing code written by
>> Bow.
>> So far, I have deciphered the following replacement strings (somewhat
>> written in sed(1) format):
>>
>>
>> /hsp.identities/hsp.ident_num/
>> /hsp.score/hsp.bitscore/
>> /hsp.expect/hsp.evalue/
>> /hsp.bits/hsp.bitscore/
>> /hsp.gaps/hsp.gap_num/
>> /hsp.bits/hsp.bitscore_raw/
>> /hsp.positives/hsp.pos_num/
>> /hsp.sbjct_start/hsp.hit_start/
>> /hsp.sbjct_end/hsp.hit_end/
>> # hsp.query_start # no change from NCBIXML
>> # hsp.query_end # no change from NCBIXML
>> /record.query.split()[0]/record.id/
>> /alignment.hit_def.split(' ')[0]/alignment.hit_id/
>> /record.alignments/record.hits/
>>
>> /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML
>> (don't remember whether the counts include minus signs of the alignment or
>> not)
>>
>>
>>
>>
>> Now I am uncertain. There used to be hsp.sbjct_length and alignment.length.
>> I think the former length was including the minus sign for gaps while the
>> latter is just the real length of the query sequence.
>>
>> Nevertheless, what did alignment.length transform into? Into
>> len(hsp.query_all)? I don't think hsp.query_span but who knows. ;)
>>
>>
>>
>> Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that
>> has been added to SearchIO in 1.63. so, that's all from me now until I
>> upgrade. ;)
>>
>>
>> Thank you,
>> Martin
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
>



More information about the Biopython mailing list