[Biopython] quantile normalization method

Mon Mar 22 04:02:20 UTC 2010

I found a mistake, the np.zeros_like(A) array need to be set as a float64,
otherwise it was assumed int. So the final results would have been rounded
to int.

def quantile_normalization(anarray):
        """
        anarray with samples in the columns and probes across the rows
        import numpy as np
        """
        anarray.dtype = np.float64
        A=anarray
        AA = np.float64(np.zeros_like(A))
        I = np.argsort(A,axis=0)
        AA[I,np.arange(A.shape[1])] =
np.float64(np.mean(A[I,np.arange(A.shape[1])],axis=1)[:,np.newaxis])
        return AA

*Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>

On Sat, Mar 20, 2010 at 1:35 PM, Vincent Davis <vincent at vincentdavis.net>wrote:

> @Laurent Gautier,  I agree with everything you said :)
>
> What I could really use is some to test the python code against R
> Just to help very if that the results are not completely wrong.
>
>   *Vincent Davis
> 720-301-3003 *
> vincent at vincentdavis.net
>  my blog <http://vincentdavis.net> | LinkedIn<http://www.linkedin.com/in/vincentdavis>
>
>
> On Sat, Mar 20, 2010 at 1:30 PM, Laurent Gautier <lgautier at gmail.com>wrote:
>
>> On 3/20/10 7:26 PM, Vincent Davis wrote:
>>
>>>    @Laurent Gautier
>>>
>>>    The algorithm is fairly straightforward, as you noted it, but beware
>>>    of details such missing values, ability to normalize against a
>>>    target distribution, or ties when ranking (although I'd have to
>>>    check if those receive a special treatment).The quantile
>>>    normalization code in the R package "preprocessCore" is in C and
>>>    might outperform a pure Python implementation.
>>>
>>>
>>> Not sure about speed. I have 84 microarrays samples with ~190,000 probes
>>> and it normalizes in 7 sec. I have no idea how fast R is or how many
>>> arrays are common to normalize.
>>>
>>
>> So speed is not an issue for your use-case; even a 10x speedup might not
>> justify the effort required to move to C, as this operation is performed
>> once in a while (once per dataset mostly).
>>
>> I am not sure there is a "common" number. When still working with arrays,
>> I can find myself with several hundred arrays with ~2 million probes each.
>>
>>
>>     There is a variety of normalization methods in bioconductor, and it
>>>    might make sense to embrace it as a dependency (rather than
>>>    reimplement it). I have bindings for Bioconductor up my sleeve about
>>>    to be distributed to few people for testing. The public release
>>>    might be around ISMB, BOSC time.
>>>
>>>
>>> I considered this and in the long run you might be right. But I don't
>>> know R and I placed more value on understanding the normalization than
>>> learning R. This is in part because there is little advantage in using R
>>> in the next steps of my analysis.
>>>
>>
>> Surprising, but you'll know best.
>>
>>
>>  Bindings seem like a good idea but
>>> they would be a black box to me. I guess for me since most of this is
>>> new the value of implementing my own normalization in both learning more
>>> about python and understanding the normalization out ways the benefits
>>> of implementing it in R.
>>>
>>
>> Everyone's mileage will vary. I often like building on existing libraries
>> (although I frequently read how methods work): this makes my palette of
>> tools richer than if I had to reimplement everything, and gives me time to
>> create my own.
>> Having this said, learning a language by implementing is a great way to
>> go.
>>
>>
>>  As a side question, why use biopython, are there ways in which it is
>>> better than R ?
>>>
>>
>> In short (and therefore with some imprecision and/or distortion),
>> Biopython is a "Python package" (i.e., collection of modules) for
>> bioinformatics, with a forte in handling a number of bioinformatics file
>> formats. R is a language for statistics, data analysis and graphics.
>>
>>
>>  For me it is purely that I know python (a little) and can nothing about
>>> R. Sure If I am just doing through step by step instruction from
>>> a bioconductor use manual I am fine but once I what to do something new
>>> am am lost. Not that I can't learn I am just prioritizing my learning.
>>>
>>
>> Then the idea is that you consider R/bioconductor as a Python library.
>> Should you want something new, you can then implement it in Python.
>>
>>
>>
>> Laurent
>>
>>
>>> And thanks for this
>>>
>>>    norm_a = numpy.array(normq(m))
>>>
>>>    can be replaced by
>>>
>>>    norm_a = numpy.as_array(normq(m))
>>>
>>>    to improve performances whenever m is of substantial size (as no
>>>    copy is made - see
>>>
>>> http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy
>>> )
>>>
>>>
>>>
>>>
>>>
>>> *Vincent Davis
>>> 720-301-3003 *
>>> vincent at vincentdavis.net <mailto:vincent at vincentdavis.net>
>>>
>>> my blog <http://vincentdavis.net> | LinkedIn
>>> <http://www.linkedin.com/in/vincentdavis>
>>>
>>>
>>>
>>>
>>> On Sat, Mar 20, 2010 at 12:05 PM, Laurent Gautier <lgautier at gmail.com
>>> <mailto:lgautier at gmail.com>> wrote:
>>>
>>>    Hi Bartek and Vincent,
>>>
>>>    Few comments:
>>>
>>>    A/
>>>
>>>    The algorithm is fairly straightforward, as you noted it, but beware
>>>    of details such missing values, ability to normalize against a
>>>    target distribution, or ties when ranking (although I'd have to
>>>    check if those receive a special treatment).
>>>    The quantile normalization code in the R package "preprocessCore" is
>>>    in C and might outperform a pure Python implementation.
>>>
>>>    B/
>>>
>>>    There is a variety of normalization methods in bioconductor, and it
>>>    might make sense to embrace it as a dependency (rather than
>>>    reimplement it). I have bindings for Bioconductor up my sleeve about
>>>    to be distributed to few people for testing. The public release
>>>    might be around ISMB, BOSC time.
>>>
>>>    C/
>>>
>>>
>>>    norm_a = numpy.array(normq(m))
>>>
>>>    can be replaced by
>>>
>>>    norm_a = numpy.as_array(normq(m))
>>>
>>>    to improve performances whenever m is of substantial size (as no
>>>    copy is made - see
>>>
>>> http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy
>>>    )
>>>
>>>
>>>
>>>    Best,
>>>
>>>
>>>    Laurent
>>>
>>>
>>>
>>>
>>>    On 3/20/10 5:00 PM, biopython-request at lists.open-bio.org
>>>    <mailto:biopython-request at lists.open-bio.org> wrote:
>>>
>>>             >  Is there a quantile normalization method in biopython, I
>>>            search but did not
>>>             >  find. If not it looks straight forward would it be of
>>>            any interest to the
>>>             >  community for me to contribute a method
>>>             >
>>>             >  1. given n arrays of length p, form X of dimension
>>>             >  p ? n where each array is a column;
>>>             >  2. sort each column of X to give X sort ;
>>>             >  3. take the means across rows of X sort and assign this
>>>             >  mean to each element in the row to get X sort ;
>>>             >  4. get X normalized by rearranging each column of
>>>             >  X sort to have the same ordering as original X
>>>             >
>>>             >  From
>>>             >  A comparison of normalization methods for high
>>>             >  density oligonucleotide array data based on
>>>             >  variance and bias
>>>             >  B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T.
>>>            P. Speed 4, 5
>>>             >  ?
>>>             >
>>>
>>>          Hi,
>>>
>>>        I don't think there is such a method available.
>>>
>>>        I'm myself using the original R implementation by Bolstad et al.
>>>        It requires
>>>        rPy and R installed. It can be achieved in a few lines of code:
>>>
>>>        <pre>
>>>        import rpy2.robjects as robjects
>>>        #ll = list of concatenated values to normalize
>>>        v = robjects.FloatVector(ll)
>>>        #numrows=number of vectors that made up ll
>>>        m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
>>>        robjects.r('require("preprocessCore")')
>>>        normq=robjects.r('normalize.quantiles')
>>>        norm_a=numpy.array(normq(m))
>>>        #norm_a=normalized array
>>>        </pre>
>>>
>>>        If your method is a pure python implementation which is
>>>        comparably fast I
>>>        think it would be worth to have it in Biopython since the method
>>>        is (in my
>>>        opinion) quite useful and it would remove the dependency on R
>>>        from some of
>>>        my scripts.
>>>
>>>        cheers
>>>          Bartek
>>>
>>>
>>>
>>>
>>
>