[BioPython] Iterating through an alignment to calculate the	number of gaps and their lengths
    Peter 
    biopython at maubp.freeserve.co.uk
       
    Wed Feb  6 21:57:48 UTC 2008
    
    
  
On Feb 6, 2008 9:21 PM, Matthew Abravanel <vmatthewa at gmail.com> wrote:
> Hi Everyone,
>
> I was wondering if anyone could help, I am trying to write a little python
> script to iterate through an alignment and determine the number of gaps the
> alignment has and their lengths and output that information as a list.
>  Such as this made up alignemt:
>
>  Seq1 ATT-AGC-C
>  Seq2 AT--AGCTC
>
> and your program runs and outputs like  2 gaps of length 1 outputted as a
> list like this [1,1] or something like that. I am still learning about
> python strings and iterators and am not sure how you would approach this?
> Appreciate any help you could give. Thanks.
I would start with using Bio.SeqIO to read in the sequences as
SeqRecord objects - I'm assuming you have them in a file (e.g. fasta
format, or maybe clustal?).  See the tutorial or
http://biopython.org/wiki/SeqIO
e.g.
from Bio import SeqIO
handle = open("example.fasta")
for rec in SeqIO.parse(handle, "fasta") :
    print rec.id, len(rec.seq), rec.seq.count("-")
The above code will simple count the number of gap characters.  I
think you wanted to look at the sequence strings and how long each
stretch of gap characters is? Rather than counting the number of gap
characters?  Well that is a little more complicated... perhaps
something like this:
from Bio import SeqIO
handle = open("example.fasta")
gap = "-"
for rec in SeqIO.parse(handle, "fasta") :
    print rec.id, rec.seq
    #TODO - Handle leading or trailing gaps
    in_gap = False
    gap_len = 0
    for letter in rec.seq :
        if letter == gap and not in_gap :
            #Start of a gap
            in_gap = True
            assert gap_len == 0, "Logic error?"
            gap_len = 1
        elif in_gap and letter == gap :
            #Continuation of a gap
            gap_len += 1
        elif in_gap and letter <> gap :
            #End of the gap...
            print " - Found a gap of length %i" % gap_len
            #Reset
            in_gap = False
            gap_len = 0
Note that this doesn't record a running tally of the gap lengths
found, for which a python dictionary might be sensible.
Peter
    
    
More information about the Biopython
mailing list