[BioPython] From alignment column pos to seq nucleotide pos

Mon, 8 Oct 2001 19:05:12 -0400

Anders:
> : Given the column position in an alignment object,
> : how do I easiest get the corresponding nucleotide
> : position in one of the sequence members?

Iddo:
> Quick & dirty:
> 
> --acat---acca-aaatgcgt      # record 0
> a-ccattag-ccataa--gcgt      # record 1
> agc--ttag-ccataa--gcgt      # record 2
>        ^
> 	   column  7
> 
> Now, you want to find the sequence number if the "a" in record 1, on column 7.
[the corrected one:] 
> s = -1
> for i in c_align._records[1].seq[:8]:
>     if i <> "-":   # is this is a bad way for gapchecking??
> 	    s += 1

I think it would be nice to have something like this in the
Alignment class, since this is a general question people should be
able to ask (in my opinion). Based on Iddo's idea, I coded up a
function to do this. I changed the algorithm a little bit: instead
of using the assumption that the original sequence has no gap
characters, I make the assumption that the user of the function is
passing in stuff that "makes sense" (ie. the sequence you want to
find the position in actually matches the aligned sequence you say
it matches).

The code is pasted in below. What do people think about this? Would
it be a good thing to include in the Alignment class (or in
AlignInfo.SummaryInfo)? Any other ideas/suggestions/comments?

    def original_sequence_pos(self, col_pos, original_seq, seq_number):
        """Given an alignment position, find the position in a sequence.

        When given col_pos, the number of a column in this alignment, this
        function finds the corresponding position is the original (unaligned)
        sequence.

        original_seq is the sequence you want to find the position in
        and seq_number is the numerical position of the corresponding 
        aligned sequence in this alignment.

        This works by moving along the aligned sequence and correspondingly
        stepping along the original sequence until we come to col_pos. 
        The position we are at in the sequence is then returned.
        """
        aligned_seq = self._records[seq_number].seq
        seq_pos = 0
        for index in range(col_pos):
            # if we hit a residue that was in the original sequence
            # (not a character from the alignment), then we increment
            # the position we are at in the original sequence
            if aligned_seq[index].upper() == original_seq[seq_pos].upper():
                seq_pos += 1
            # otherwise we should have hit something in the alignment,
            # which should be a gap character
            else:
                assert aligned_seq[index] == aligned_seq.alphabet.gap_char, \
                  "Got an unexpected non-aligned character: %s" % \
                  (aligned_seq[index])

        # an last minute sanity check to be sure that we were asked
        # to find a position that was actually in the original sequence
        assert aligned_seq[col_pos] == original_seq[seq_pos], \
          "Residue in original sequence does not match aligned sequence."

        return seq_pos

Brad
-- 
PGP public key available from http://pgp.mit.edu/