[Biopython] quantile normalization method

Sat Mar 20 19:35:33 UTC 2010

@Laurent Gautier,  I agree with everything you said :)

What I could really use is some to test the python code against R
Just to help very if that the results are not completely wrong.

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>

On Sat, Mar 20, 2010 at 1:30 PM, Laurent Gautier <lgautier at gmail.com> wrote:

> On 3/20/10 7:26 PM, Vincent Davis wrote:
>
>>    @Laurent Gautier
>>
>>    The algorithm is fairly straightforward, as you noted it, but beware
>>    of details such missing values, ability to normalize against a
>>    target distribution, or ties when ranking (although I'd have to
>>    check if those receive a special treatment).The quantile
>>    normalization code in the R package "preprocessCore" is in C and
>>    might outperform a pure Python implementation.
>>
>>
>> Not sure about speed. I have 84 microarrays samples with ~190,000 probes
>> and it normalizes in 7 sec. I have no idea how fast R is or how many
>> arrays are common to normalize.
>>
>
> So speed is not an issue for your use-case; even a 10x speedup might not
> justify the effort required to move to C, as this operation is performed
> once in a while (once per dataset mostly).
>
> I am not sure there is a "common" number. When still working with arrays, I
> can find myself with several hundred arrays with ~2 million probes each.
>
>
>     There is a variety of normalization methods in bioconductor, and it
>>    might make sense to embrace it as a dependency (rather than
>>    reimplement it). I have bindings for Bioconductor up my sleeve about
>>    to be distributed to few people for testing. The public release
>>    might be around ISMB, BOSC time.
>>
>>
>> I considered this and in the long run you might be right. But I don't
>> know R and I placed more value on understanding the normalization than
>> learning R. This is in part because there is little advantage in using R
>> in the next steps of my analysis.
>>
>
> Surprising, but you'll know best.
>
>
>  Bindings seem like a good idea but
>> they would be a black box to me. I guess for me since most of this is
>> new the value of implementing my own normalization in both learning more
>> about python and understanding the normalization out ways the benefits
>> of implementing it in R.
>>
>
> Everyone's mileage will vary. I often like building on existing libraries
> (although I frequently read how methods work): this makes my palette of
> tools richer than if I had to reimplement everything, and gives me time to
> create my own.
> Having this said, learning a language by implementing is a great way to go.
>
>
>  As a side question, why use biopython, are there ways in which it is
>> better than R ?
>>
>
> In short (and therefore with some imprecision and/or distortion), Biopython
> is a "Python package" (i.e., collection of modules) for bioinformatics, with
> a forte in handling a number of bioinformatics file formats. R is a language
> for statistics, data analysis and graphics.
>
>
>  For me it is purely that I know python (a little) and can nothing about
>> R. Sure If I am just doing through step by step instruction from
>> a bioconductor use manual I am fine but once I what to do something new
>> am am lost. Not that I can't learn I am just prioritizing my learning.
>>
>
> Then the idea is that you consider R/bioconductor as a Python library.
> Should you want something new, you can then implement it in Python.
>
>
>
> Laurent
>
>
>> And thanks for this
>>
>>    norm_a = numpy.array(normq(m))
>>
>>    can be replaced by
>>
>>    norm_a = numpy.as_array(normq(m))
>>
>>    to improve performances whenever m is of substantial size (as no
>>    copy is made - see
>>
>> http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy
>> )
>>
>>
>>
>>
>>
>> *Vincent Davis
>> 720-301-3003 *
>> vincent at vincentdavis.net <mailto:vincent at vincentdavis.net>
>>
>> my blog <http://vincentdavis.net> | LinkedIn
>> <http://www.linkedin.com/in/vincentdavis>
>>
>>
>>
>>
>> On Sat, Mar 20, 2010 at 12:05 PM, Laurent Gautier <lgautier at gmail.com
>> <mailto:lgautier at gmail.com>> wrote:
>>
>>    Hi Bartek and Vincent,
>>
>>    Few comments:
>>
>>    A/
>>
>>    The algorithm is fairly straightforward, as you noted it, but beware
>>    of details such missing values, ability to normalize against a
>>    target distribution, or ties when ranking (although I'd have to
>>    check if those receive a special treatment).
>>    The quantile normalization code in the R package "preprocessCore" is
>>    in C and might outperform a pure Python implementation.
>>
>>    B/
>>
>>    There is a variety of normalization methods in bioconductor, and it
>>    might make sense to embrace it as a dependency (rather than
>>    reimplement it). I have bindings for Bioconductor up my sleeve about
>>    to be distributed to few people for testing. The public release
>>    might be around ISMB, BOSC time.
>>
>>    C/
>>
>>
>>    norm_a = numpy.array(normq(m))
>>
>>    can be replaced by
>>
>>    norm_a = numpy.as_array(normq(m))
>>
>>    to improve performances whenever m is of substantial size (as no
>>    copy is made - see
>>
>> http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy
>>    )
>>
>>
>>
>>    Best,
>>
>>
>>    Laurent
>>
>>
>>
>>
>>    On 3/20/10 5:00 PM, biopython-request at lists.open-bio.org
>>    <mailto:biopython-request at lists.open-bio.org> wrote:
>>
>>             >  Is there a quantile normalization method in biopython, I
>>            search but did not
>>             >  find. If not it looks straight forward would it be of
>>            any interest to the
>>             >  community for me to contribute a method
>>             >
>>             >  1. given n arrays of length p, form X of dimension
>>             >  p ? n where each array is a column;
>>             >  2. sort each column of X to give X sort ;
>>             >  3. take the means across rows of X sort and assign this
>>             >  mean to each element in the row to get X sort ;
>>             >  4. get X normalized by rearranging each column of
>>             >  X sort to have the same ordering as original X
>>             >
>>             >  From
>>             >  A comparison of normalization methods for high
>>             >  density oligonucleotide array data based on
>>             >  variance and bias
>>             >  B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T.
>>            P. Speed 4, 5
>>             >  ?
>>             >
>>
>>          Hi,
>>
>>        I don't think there is such a method available.
>>
>>        I'm myself using the original R implementation by Bolstad et al.
>>        It requires
>>        rPy and R installed. It can be achieved in a few lines of code:
>>
>>        <pre>
>>        import rpy2.robjects as robjects
>>        #ll = list of concatenated values to normalize
>>        v = robjects.FloatVector(ll)
>>        #numrows=number of vectors that made up ll
>>        m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
>>        robjects.r('require("preprocessCore")')
>>        normq=robjects.r('normalize.quantiles')
>>        norm_a=numpy.array(normq(m))
>>        #norm_a=normalized array
>>        </pre>
>>
>>        If your method is a pure python implementation which is
>>        comparably fast I
>>        think it would be worth to have it in Biopython since the method
>>        is (in my
>>        opinion) quite useful and it would remove the dependency on R
>>        from some of
>>        my scripts.
>>
>>        cheers
>>          Bartek
>>
>>
>>
>>
>