[Biopython] Statistical similarity in microarray data
Sean Davis
sdavis2 at mail.nih.gov
Tue Feb 16 23:10:11 UTC 2010
On Tue, Feb 16, 2010 at 8:37 AM, Peter Saffrey <pzs at dcs.gla.ac.uk> wrote:
> This isn't strictly a biopython question, but I hoped I might find some
> expertise here.
>
> I need to compare two microarrays for similarity. Each file is a set of
> spots and their corresponding values. By ordering the values by the spot
> id and discarding points that are missing from either set, I can compare
> the two experiments. We are trying to show that samples using a new
> method correlate with the old method.
Any correlation method will likely do.
> Up until recently, we were using a Pearson correlation (from
> scipy.stats) but this assumes the data is normally distributed, which is
> probably isn't. The correlations were a little unreliable.
You'll need to look at the data to decide. If you have log ratios for
the arrays or you take the log of single-channel intensities, then I
think you will find that the data are often close enough to use
pearson correlation. However, as I mentioned above, any standard
correlation measure such as Pearson or Spearman will likely do just
fine.
> After a bit of digging, I tried using a Wilcoxon (also from
> scipy.stats), but this seems to give high correlations for things it
> shouldn't, like files that are different samples. It also seems to lack
> precision. I get p-values of 0 quite a lot; even 1e-80 would reassure me
> that something is really happening underneath.
What you are likely doing is testing whether the correlation between
the two assays differs from zero. Since the correlation values
between array platforms tends to be fairly good (well different from
zero), it is not at all unusual to have a p-value that is practically
zero (so it isn't very important to report the p-value).
> Does anybody have any experience with this type of statistical work?
Between platform comparisons are notoriously difficult to do well, but
having a correlation measure is usually enough to get started. Also,
a scatter plot of one array versus the other is a useful visualization
tool. If you want to look at a more formal approach, look at the MAQC
papers in Pubmed.
All these comments are very general. You'll probably want to be a bit
more specific about your experimental design and your goals.
Finally, while biopython provides an excellent set of tools for many
biological problems, you might take a look at the Bioconductor project
if you are looking to get into microarrays in any depth.
Sean
More information about the Biopython
mailing list