[Biopython] quantile normalization method

Sat Mar 20 18:26:27 UTC 2010

>
> @Laurent Gautier

The algorithm is fairly straightforward, as you noted it, but beware of
> details such missing values, ability to normalize against a target
> distribution, or ties when ranking (although I'd have to check if those
> receive a special treatment).The quantile normalization code in the R
> package "preprocessCore" is in C and might outperform a pure Python
> implementation.

Not sure about speed. I have 84 microarrays samples with ~190,000 probes and
it normalizes in 7 sec. I have no idea how fast R is or how many arrays are
common to normalize.

There is a variety of normalization methods in bioconductor, and it might
> make sense to embrace it as a dependency (rather than reimplement it). I
> have bindings for Bioconductor up my sleeve about to be distributed to few
> people for testing. The public release might be around ISMB, BOSC time.

I considered this and in the long run you might be right. But I don't know R
and I placed more value on understanding the normalization than learning R.
This is in part because there is little advantage in using R in the next
steps of my analysis. Bindings seem like a good idea but they would be a
black box to me. I guess for me since most of this is new the value of
implementing my own normalization in both learning more about python and
understanding the normalization out ways the benefits of implementing it in
R.

As a side question, why use biopython, are there ways in which it is better
than R ?
For me it is purely that I know python (a little) and can nothing about R.
Sure If I am just doing through step by step instruction from
a bioconductor use manual I am fine but once I what to do something new am
am lost. Not that I can't learn I am just prioritizing my learning.

And thanks for this

> norm_a = numpy.array(normq(m))
>
> can be replaced by
>
> norm_a = numpy.as_array(normq(m))
>
> to improve performances whenever m is of substantial size (as no copy is
> made - see
> http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy
> )
>

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>

On Sat, Mar 20, 2010 at 12:05 PM, Laurent Gautier <lgautier at gmail.com>wrote:

> Hi Bartek and Vincent,
>
> Few comments:
>
> A/
>
> The algorithm is fairly straightforward, as you noted it, but beware of
> details such missing values, ability to normalize against a target
> distribution, or ties when ranking (although I'd have to check if those
> receive a special treatment).
> The quantile normalization code in the R package "preprocessCore" is in C
> and might outperform a pure Python implementation.
>
> B/
>
> There is a variety of normalization methods in bioconductor, and it might
> make sense to embrace it as a dependency (rather than reimplement it). I
> have bindings for Bioconductor up my sleeve about to be distributed to few
> people for testing. The public release might be around ISMB, BOSC time.
>
> C/
>
>
> norm_a = numpy.array(normq(m))
>
> can be replaced by
>
> norm_a = numpy.as_array(normq(m))
>
> to improve performances whenever m is of substantial size (as no copy is
> made - see
> http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy)
>
>
>
> Best,
>
>
> Laurent
>
>
>
>
> On 3/20/10 5:00 PM, biopython-request at lists.open-bio.org wrote:
>
>> >  Is there a quantile normalization method in biopython, I search but did
>>> not
>>> >  find. If not it looks straight forward would it be of any interest to
>>> the
>>> >  community for me to contribute a method
>>> >
>>> >  1. given n arrays of length p, form X of dimension
>>> >  p ? n where each array is a column;
>>> >  2. sort each column of X to give X sort ;
>>> >  3. take the means across rows of X sort and assign this
>>> >  mean to each element in the row to get X sort ;
>>> >  4. get X normalized by rearranging each column of
>>> >  X sort to have the same ordering as original X
>>> >
>>> >  From
>>> >  A comparison of normalization methods for high
>>> >  density oligonucleotide array data based on
>>> >  variance and bias
>>> >  B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. P. Speed 4, 5
>>> >  ?
>>> >
>>>
>>  Hi,
>>
>> I don't think there is such a method available.
>>
>> I'm myself using the original R implementation by Bolstad et al. It
>> requires
>> rPy and R installed. It can be achieved in a few lines of code:
>>
>> <pre>
>> import rpy2.robjects as robjects
>> #ll = list of concatenated values to normalize
>> v = robjects.FloatVector(ll)
>> #numrows=number of vectors that made up ll
>> m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
>> robjects.r('require("preprocessCore")')
>> normq=robjects.r('normalize.quantiles')
>> norm_a=numpy.array(normq(m))
>> #norm_a=normalized array
>>  </pre>
>>
>> If your method is a pure python implementation which is comparably fast I
>> think it would be worth to have it in Biopython since the method is (in my
>> opinion) quite useful and it would remove the dependency on R from some of
>> my scripts.
>>
>> cheers
>>  Bartek
>>
>>
>