[Biopython-dev] Statistics code

Ralph Haygood rhaygood at duke.edu
Tue Oct 2 23:59:43 UTC 2007


Tiago,

Sorry to be so long replying---I've been almost drowning in work.

Use anything you find useful in my code.  If you do write an article
about it, I'd be glad to be a coauthor, not just in name but actually
to help with writing the discussion of sequence statistics.

There *is* a lot of stuff in my code, not all of it generally
important.  For example, few people will care about indel statistics,
beyond counting them and maybe getting the frequency distribution of
their lengths.  The things most people will care about are K (the
number of polymorphic sites), Watterson's theta, pi, Tajima's D, Fu
and Li's D, Fay and Wu's H, F_ST, and McDonald--Kreitman testing.

As for ambiguous nucleotides, my code handles them in one of two ways,
at the programmer's option.  By default, a site at which any sequence
in the alignment contains an ambiguous nucleotide is ignored; for
example,

 	ACRGTY
 	ACAGTC

is effectively equivalent to

 	ACGT
 	ACGT .

However, if the 'expand_diplotypes' option is specified when the
Sample object is constructed, each sequence in the alignment is
interpreted as a diplotype and converted into a pair of pseudo-
haplotypes, two-fold ambiguous nucleotides (R, Y, W, S, M, and K)
being interpreted as heterozygous; for example,

 	ACRGTY
 	ACAGTC

is effectively equivalent to

 	ACAGTC
 	ACGGTT
 	ACAGTC
 	ACAGTC .

In expand_diplotypes mode, sites containing three- or four-fold
ambiguous nucleotides are still ignored.  Also, you'll get a warning
if you request a statistic that depends on correct SNP phasing, which
most statistics don't.  So far, I've found these two operating modes
sufficient for my needs.

I think your plan sounds very reasonable, just adding sequence
statistics at a pace that's comfortable for you.  Any time you have
questions, feel free to ask me, and I'll give you whatever benefit
there is in my opinion and experience.

I'm happy for all this to happen on biopython-dev, so that other
people (e.g., Alex Lancaster) can add to it.  I'll leave it to the
core developers to tell us if we're too noisy.  (I'd recommend still
sending messages to me with copies to biopython-dev, however, so that
I don't accidentally miss them on biopython-dev, which I don't always
read carefully.)

Ralph

On Sat, 29 Sep 2007, Tiago Antão wrote:

> Hi Ralph,
>
> Hope all is good with you. I am now finally starting to commit
> statistics code to Biopython. But before I go ahead I would like to
> ask some advice to you (plus some extra comments):
>
> About code merging and authorship:
>
> I am finally looking to your code. There is really lots of stuff
> there! Would it be OK with you if I merged your code with mine into
> Bio.PopGen.Stats? Obviously the copyright/authorship for the module
> would be co-shared as would any authorship of any article deriving
> from it...
>
> About a strategy to advance:
>
> 1. I personally don't have any experience, really, with working with
> sequence data (My background are SNPs, microsatellites/STRs, AFLPs and
> that sort of stuff)
> 2. Starting on Monday I am beginning a PhD which will require, part
> time, sequence analysis
> 3. What I mean from 1 and 2 is that I currently don't have maturity to
> architect and design a good framework for sequence analysis but I will
> gain it with time.
> My plan is then to defer all sequence code until I fell I know what I
> am doing (although I was still thinking in providing something like
> BioPerl's facility of extracting all SNPs from sequences)
> If this is OK with you I plan to start committing code the week
> starting on this Monday,
>
> About request for insight:
>
> If you have any comments to offer on issues regarding representing
> indels and ambiguous data (ie ambiguous nucleotides) they might be
> useful, as I suppose that is the biggest issue that makes me afraid of
> sequence code.
>
>
> Finally: I would summarize our discussion here on biopython-dev (I am
> not taking it there directly just because you might not want your code
> on Biopython or might want it in other terms).
>
> Thanks,
> Tiago
>


More information about the Biopython-dev mailing list