[Biopython-dev] Start/end co-ordinates in SeqRecord objects?

Tue May 22 05:44:27 EDT 2012

On Tue, May 22, 2012 at 2:25 AM, Brad Chapman <chapmanb at 50mail.com> wrote:
>> Thoughts or feedback please? Would a worked example
>> help with my explanation?
>
> A worked example might help: not totally sure I grasp all the
> subtleties,
> Brad

OK. This will work best in a mono-spaced font. This was
picked out of one of our unit tests, bt009.txt - I just looked
for a BLAST pairwise alignment with some gaps:

 Score = 151 bits (378), Expect = 9e-37
 Identities = 88/201 (43%), Positives = 128/201 (62%), Gaps = 9/201 (4%)

Query: 1 MTRISHITRNTKETQIELSINLDGTGQADISTGIGFLDHML-TLLTFHSDFDLKIIGHGD 59
           M+R ++ITR TKET+IE+ +++D G+ +ST I F +HML TLLT+ + I+ D
Sbjct: 1 MSRSANITRETKETKIEVLLDIDRKGEVKVSTPIPFFNHMLITLLTYMNS--TAIVSATD 58

Query: 60 HETVGMDPHHLIEDVAIALGKCISEDLGNKLGIRRYGSFTIPMDEALVTCDLDISGRPYL 119
              + D HH++EDVAI LG I LG+K GI+R+ IPMD+ALV LDIS R
Sbjct: 59 K--LPYDDHHIVEDVAITLGLAIKTALGDKRGIKRFSHQIIPMDDALVLVSLDISNRGMA 116

Query: 120 VFHADLSGNQKLGGYDTEMTEEFFRALAFNAGITLHLNEHYGQNTHHIIEGMFKSTARAL 179
             + +L ++ +GG TE FF++ A+N+GITLH+++ G NTHHIIE FK+ AL
Sbjct: 117 FVNLNLKRSE-IGGLATENVPHFFQSFAYNSGITLHISQLSGYNTHHIIEASFKALGLAL 175

Query: 180 KQAVSIDESKVGEIPSSKGVL 200
            +A I ++ EI S+KG++
Sbjct: 176 YEATRIVDN---EIRSTKGII 193

When looking at this as a pairwise alignment, for the query SeqRecord
the sequence would be MTRISHITRNT...KGVL (with gaps), running
from residue 1 to 200 inclusive (one based counting, or 0 to 200 in
Python). There are 200 letters plus one gap, meaning the gapped
sequence is 201 letters long.

Similarly the matched sequence SeqRecord (or subject in BLAST's
terminology) is also 201 letters, this time 193 amino acids (residue
1 to 193 inclusive, one based counting) plus 8 gaps.

To turn this into code, something like this:

>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> query = SeqRecord(id="query", seq=Seq("MTRISHITRNTKETQIELSINLDGTGQADISTGIGFLDHML-TLLTFHSDFDLKIIGHGDHETVGMDPHHLIEDVAIALGKCISEDLGNKLGIRRYGSFTIPMDEALVTCDLDISGRPYLVFHADLSGNQKLGGYDTEMTEEFFRALAFNAGITLHLNEHYGQNTHHIIEGMFKSTARALKQAVSIDESKVGEIPSSKGVL"))
>>> match = SeqRecord(id="match", seq=Seq("MSRSANITRETKETKIEVLLDIDRKGEVKVSTPIPFFNHMLITLLTYMNS--TAIVSATDK--LPYDDHHIVEDVAITLGLAIKTALGDKRGIKRFSHQIIPMDDALVLVSLDISNRGMAFVNLNLKRSE-IGGLATENVPHFFQSFAYNSGITLHISQLSGYNTHHIIEASFKALGLALYEATRIVDN---EIRSTKGII"))

Turn it into an alignment,

>>> from Bio.Align import MultipleSeqAlignment
>>> align = MultipleSeqAlignment([query, match])
>>> print align
Alphabet() alignment with 2 rows and 201 columns
MTRISHITRNTKETQIELSINLDGTGQADISTGIGFLDHML-TL...GVL query
MSRSANITRETKETKIEVLLDIDRKGEVKVSTPIPFFNHMLITL...GII match

Assume the start/end co-ordinates are also stored somewhere
(and for nucleotide sequences, the strand too).

[As an aside, a pairwise multiple sequence alignment subclass or
similar for the SearchIO project could have a nicer pretty print
__str__ method showing where the two sequences agree - as
done in the BLAST text output etc.]

Now, suppose we slice this - for simplicity let's take the third
chunk as shown in the original text BLAST output above,
i.e. columns 121 to 180 inclusive (one based counting):

>>> print align[:,120:180]
Alphabet() alignment with 2 rows and 60 columns
VFHADLSGNQKLGGYDTEMTEEFFRALAFNAGITLHLNEHYGQN...RAL query
FVNLNLKRSE-IGGLATENVPHFFQSFAYNSGITLHISQLSGYN...LAL match

We know that for this sub-alignment (by looking at the BLAST
text output) that the query fragment is base 120 to 179 inclusive
(one based counting) and the match fragment is base 117 to 175.
I would like the SeqRecord slicing (done by the alignment object)
to be able to deduce these new start/end co-ordinates from the
original co-ordinates.

This means when we do match[120:180] and query[120:180], we
need to look at the position of the gaps, and thus convert from the
ungapped coordinates (used for the start and end values) into the
gapped coordinates (used for the alignment columns).

Essentially here the new start is the old start plus the number of
non-gap letters being removed from the start (before the slice
point). The new end can then be calculated by adding the number
of non-gap letters in the selected sequence, or from the old end
value reducing it by the number of non-gap letters removed from
the end.

In fact there is no need to store the end coordinate in memory -
it can be found on the fly from the start, sequence, and for
nucleotides, strand - which is all you get in MAF format. Doing
this would avoid inconsistent sets of values, but imposes a
number of complications on the object representation.

This is doable - but a sensible question is how common a
use case is it to slice alignments (or SeqRecord objects) and
care about their co-ordinates? This may actually be more
important for classical multiple sequence alignments like
Stockholm and MAF than for SearchIO.

Peter