[Biopython-dev] [Bug 2705] New: Nicer GC and AT content and skew functions

Tue Dec 9 16:59:21 UTC 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2705

           Summary: Nicer GC and AT content and skew functions
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk

This bug started out as a discussion on Bug 2671, based on some nucleotide
scoring functions in GenomeDiagram which were used for plotting sequence
properties along a sequence using a sliding window.  The basic underlying
functions could make a nice addition under Bio.SeqUtils (rather than hiding
them under Bio.Graphics.GenomeDiagram).

In particular, GenomeDiagram's Utilities.py included the following
(non-windowed) nucleotide composition functions:

calc_gc_content - returns a float in the range 0 to 1.
calc_at_content - returns a float in the range 0 to 1.
calc_gc_skew - returns a float [*]
calc_at_skew - returns a float [*]

[*] As discussed on Bug 2671, these currently give zero if there is no AT
content, which was a reasonable shortcut given these functions were originally
used for plotting only.  They should instead raise an exception or return None
or NaN instead.

Also, as implemented in GenomeDiagram, these functions do not cope with mixed
case sequences (easily rectified).  Also, for GC and AT content these do not
deal with ambiguous nucleotides (where we could follow the existing
Bio.SeqUtils convention).

Bio.SeqUtils already has several related functions including:

GC - returns a float (a percentage in the range 0 to 100)
GC123 - returns a tuple of four floats (percentages between 0 and 100)

GC_skew - returns a list of floats using a default window size of 100bp.  Gives
a floating point exception if there is no GC content in any window.

Personally I don't like the fact that the existing GC function returns a number
between 0 and 100 (rather than 0 and 1).  Leighton agreed.

I don't think the current GC_skew function is intuitive and doesn't cover the
non-windowed use-case where you want the GC_skew of the whole sequence passed
in.  This is important if you want to do your own windowing (e.g. comparing GC
skew of individual genes to the whole genome).

Because they differ from the existing Bio.SeqUtils code, I think there is a
case for adding the four non-windowed functions from GenomeDiagram's
Utilities.py under Bio.SeqUtils.  Each would take a single argument, a sequence
(coping with a string, Seq object or MutableSeq object).  I have no
particularly strong views on the naming of these functions.  Perhaps they could
be located under a sub module like Bio.SeqUtils.Nucleotides or
Bio.SeqUtils.NucUtils?  The existing GC functions in Bio.SeqUtils could be
deprecated or at least declared obsolete.

This would also be a good opportunity to explicitly specify what we expect to
get back for the GC content when there are ambiguous nucleotides.

e.g. Following Bio.SeqUtils.GC, only count C, G and S (which means C or G) (in
either case) and divide by the length giving a lower bound.  Here GC("ACGTN")
is 40%.  An alternative approach might be to treat an N as 50% GC, and H (which
is A, C or T) as 66.6% GC etc, meaning GC("ACGTN") gives 50%.

The same approach should be used for the AT percentage, for example the current
lower bound approach would count only A, T and W characters (in either case).

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.