[Biopython] Converting from NCBIXML to SearchIO

Wibowo Arindrarto w.arindrarto at gmail.com
Thu Feb 13 21:22:13 UTC 2014


Hi Martin,

Here's the 'convention' I use on the length-related attributes in
SearchIO's blast parsers:

* 'aln_span' attribute denote the length of the alignment itself,
which means this includes the gaps sign ('-'). In Blast, this is
always parsed from the file. You're right that this used to be
hsp.align_length.

* 'seq_len' attributes denote the length of either the query (in
qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the
gaps. These are parsed from the BLAST XML file itself. One of these,
hit.seq_len, is the one that used to be alignment.length.

* 'query_span' and 'hit_span' are always computed by SearchIO (always
end coordinate - start coordinate of the query / hit match of the HSP,
so they do not count the gap characters). They may or may not be equal
to their seq_len counterparts, depending on how much the HSP covers
the query / hit sequences.

(I couldn't find any reference to sbjct_length in the current
codebase, perhaps it was removed some time ago?)

Since this is SearchIO, it also applies to other formats as well (e.g.
aln_span always counts the gap character).

The 'gap_num' error sounds a bit weird, though. If I recall correctly,
it should work in 1.62 (it was added very early in the beginning).
What problems are you having?

Cheers,
Bow

On Thu, Feb 13, 2014 at 9:38 PM, Martin Mokrejs
<mmokrejs at fold.natur.cuni.cz> wrote:
> Hi,
>   I am in the process of conversion to the new XML parsing code written by
> Bow.
> So far, I have deciphered the following replacement strings (somewhat
> written in sed(1) format):
>
>
> /hsp.identities/hsp.ident_num/
> /hsp.score/hsp.bitscore/
> /hsp.expect/hsp.evalue/
> /hsp.bits/hsp.bitscore/
> /hsp.gaps/hsp.gap_num/
> /hsp.bits/hsp.bitscore_raw/
> /hsp.positives/hsp.pos_num/
> /hsp.sbjct_start/hsp.hit_start/
> /hsp.sbjct_end/hsp.hit_end/
> # hsp.query_start # no change from NCBIXML
> # hsp.query_end # no change from NCBIXML
> /record.query.split()[0]/record.id/
> /alignment.hit_def.split(' ')[0]/alignment.hit_id/
> /record.alignments/record.hits/
>
> /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML
> (don't remember whether the counts include minus signs of the alignment or
> not)
>
>
>
>
> Now I am uncertain. There used to be hsp.sbjct_length and alignment.length.
> I think the former length was including the minus sign for gaps while the
> latter is just the real length of the query sequence.
>
> Nevertheless, what did alignment.length transform into? Into
> len(hsp.query_all)? I don't think hsp.query_span but who knows. ;)
>
>
>
> Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that
> has been added to SearchIO in 1.63. so, that's all from me now until I
> upgrade. ;)
>
>
> Thank you,
> Martin
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython



More information about the Biopython mailing list