[Biopython-dev] Sequences and simple plots

Jared Flatow jflatow at northwestern.edu
Fri Sep 26 09:40:46 EDT 2008


On Sep 26, 2008, at 5:15 AM, Peter wrote:

> Cut and paste for people to comment on directly,

Ok, cool.

> The first shows a histogram of sequence lengths in a FASTA file (based
> having recently done this for some real assembly data).  Sample  
> output:
> http://biopython.org/DIST/docs/tutorial/images/hist_plot.png
>
> from Bio import SeqIO
> handle = open("ls_orchid.fasta")
> sizes = [len(seq_record) for seq_record in SeqIO.parse(handle,  
> "fasta")]
> handle.close()
>
> import pylab
> pylab.hist(sizes, bins=20)
> pylab.title("%i orchid sequences\nLengths %i to %i" \
>        % (len(sizes),min(sizes),max(sizes)))
> pylab.xlabel("Sequence length (bp)")
> pylab.ylabel("Count")
> pylab.show()

Its a perfectly fine example, my only comment would be to do something  
like this:

seqs = list(SeqIO.parse(handle, 'fasta'))
hist([len(seq) for seq in seqs], bins=20)

I like to keep the whole sequences in memory, especially if I am just  
digging around the data. Also I use the alpha parameter a lot for  
histograms, especially when doing overlapping ones. So then you can  
also do something like this:

hist([len(seq) for seq in seqs if GC(seq.seq) < .5], bins=20, alpha=. 
5, fc='r')
hist([len(seq) for seq in seqs if GC(seq.seq) >= .5], bins=20, alpha=. 
5, fc='b')

> The second is based on the GC% example we used for the BOSC 2008
> presentation: http://biopython.org/DIST/docs/tutorial/images/gc_plot.png
>
> from Bio import SeqIO
> from Bio.SeqUtils import GC
> handle = open("ls_orchid.fasta")
> gc_values = [GC(seq_record.seq) for seq_record in  
> SeqIO.parse(handle, "fasta")]
> gc_values.sort()
> handle.close()
>
> import pylab
> pylab.plot(gc_values) pylab.title("%i orchid sequences\nGC%% %0.1f  
> to %0.1f" \
>        % (len(gc_values),min(gc_values),max(gc_values)))
> pylab.xlabel("Genes")
> pylab.ylabel("GC%")
> pylab.show()

Again, if you had all the sequences in a list:

plot(sorted(GC(seq.seq) for seq in seqs))

jared


More information about the Biopython-dev mailing list