[Biopython-dev] Sequences and simple plots

Peter biopython at maubp.freeserve.co.uk
Wed Oct 15 07:41:25 EDT 2008


On Fri, Sep 26 Peter wrote:
> On Fri, Sep 26 Jared wrote:
>> On Sep 26 Peter wrote:
>>
>>> Did you try the dot-plot example?
>>
>> I didn't, but it looked good.
>
> Hopefully I've pitched it right - I've tried to make it as simple as
> possible, but the nested list comprehension is perhaps non-obvious.

Old output:
http://biopython.org/DIST/docs/tutorial/images/dot_plot.png

I recently wanted to draw a dot plot for a larger pair of sequences,
and found that the example code didn't scale well.  There were two
issues, the naive calculation and the fact that pylab.imshow has an
upper limit for the size of matrix (due to memory).

I've added a second more complicated version to the Tutorial in CVS
using pylab.scatter for the plotting:
http://biopython.org/DIST/docs/tutorial/images/dot_plot_scatter.png

#Load two SeqRecord objects
from Bio import SeqIO
handle = open("ls_orchid.fasta")
record_iterator = SeqIO.parse(handle, "fasta")
rec_one = record_iterator.next()
rec_two = record_iterator.next()
handle.close()

window = 7
step = 1
#Map every window sized sub-sequence's location in a dict
dict_one = {}
dict_two = {}
for (seq, section_dict) in [(rec_one.seq.tostring().upper(), dict_one),
                            (rec_two.seq.tostring().upper(), dict_two)] :
    for i in range(0, len(seq)-window, step) :
        section = seq[i:i+window]
        try :
            section_dict[section].append(i)
        except KeyError :
            section_dict[section] = [i]
#Now find any sub-sequences found in both sequences
matches = set(dict_one).intersection(dict_two)
print "%i unique matches" % len(matches)

#Create lists of x and y co-ordinates for scatter plot
x = []
y = []
for section in matches :
    for i in dict_one[section] :
        for j in dict_two[section] :
            x.append(i)
            y.append(j)

#Now draw it
import pylab
pylab.gray()
pylab.scatter(x,y)
pylab.xlim(0, len(seq_one)-window)
pylab.ylim(0, len(seq_two)-window)
pylab.xlabel("%s (length %i bp)" % (rec_one.id, len(rec_one)))
pylab.ylabel("%s (length %i bp)" % (rec_two.id, len(rec_two)))
pylab.title("Dot plot using window size %i\n(allowing no mis-matches)" % window)
pylab.show()

Using pylab.scatter is still a bit slow, but it does actually work.  I
was wondering if this dot-plot code were to use reportlab instead,
would it make a sensible addition to the Bio.Graphics module?

Peter


More information about the Biopython-dev mailing list