[BioPython] Iterating through an alignment to calculate the number of gaps and their lengths

Peter biopython at maubp.freeserve.co.uk
Wed Feb 6 21:57:48 UTC 2008


On Feb 6, 2008 9:21 PM, Matthew Abravanel <vmatthewa at gmail.com> wrote:
> Hi Everyone,
>
> I was wondering if anyone could help, I am trying to write a little python
> script to iterate through an alignment and determine the number of gaps the
> alignment has and their lengths and output that information as a list.
>  Such as this made up alignemt:
>
>  Seq1 ATT-AGC-C
>  Seq2 AT--AGCTC
>
> and your program runs and outputs like  2 gaps of length 1 outputted as a
> list like this [1,1] or something like that. I am still learning about
> python strings and iterators and am not sure how you would approach this?
> Appreciate any help you could give. Thanks.

I would start with using Bio.SeqIO to read in the sequences as
SeqRecord objects - I'm assuming you have them in a file (e.g. fasta
format, or maybe clustal?).  See the tutorial or
http://biopython.org/wiki/SeqIO

e.g.

from Bio import SeqIO
handle = open("example.fasta")
for rec in SeqIO.parse(handle, "fasta") :
    print rec.id, len(rec.seq), rec.seq.count("-")

The above code will simple count the number of gap characters.  I
think you wanted to look at the sequence strings and how long each
stretch of gap characters is? Rather than counting the number of gap
characters?  Well that is a little more complicated... perhaps
something like this:

from Bio import SeqIO
handle = open("example.fasta")
gap = "-"
for rec in SeqIO.parse(handle, "fasta") :
    print rec.id, rec.seq
    #TODO - Handle leading or trailing gaps
    in_gap = False
    gap_len = 0
    for letter in rec.seq :
        if letter == gap and not in_gap :
            #Start of a gap
            in_gap = True
            assert gap_len == 0, "Logic error?"
            gap_len = 1
        elif in_gap and letter == gap :
            #Continuation of a gap
            gap_len += 1
        elif in_gap and letter <> gap :
            #End of the gap...
            print " - Found a gap of length %i" % gap_len
            #Reset
            in_gap = False
            gap_len = 0



Note that this doesn't record a running tally of the gap lengths
found, for which a python dictionary might be sensible.

Peter



More information about the Biopython mailing list