[Biopython-dev] [Biopython] SciPy paper: documenting statistical data structure design issues

Tue May 25 01:53:14 EDT 2010

Hi:

On Tuesday 25 May 2010 07:03:19 Vincent Davis wrote:

> Discussions on the pystatsmodels mailing list are I am sure relevant but it
> might be more beneficial to discuss  first on the biopython list as
> sometimes to discussions get long and tend to be about economic type data.
> The google group/mailing list is
> http://groups.google.ca/group/pystatsmodels

> I think a few good examples of a "typical" biopy data set and or some of
> the typical difficulties would be good to have on the wiki. This might help
> start collaboration between statsmodels and biopython on this subject. I
> think there are few people that cross over between economics and
> bioinformatics.

My main concern with the current tools is the memory issue. For instance when 
I try to create a distribution of sequence lengths or qualities using NGS 
data I end up with millions of numbers. That is too much for any reasonable 
computer. I've solved the problem by using disk caches that work as 
iterators. I'm sure that this is not the most performant solucion. It's just 
a hack and I would like to use better tools for sure.
If you want to take a look at my current solution go to:

http://github.com/JoseBlanca/franklin/blob/master/franklin/utils/itertools_.py
http://github.com/JoseBlanca/franklin/blob/master/franklin/statistics.py

Best regards,

Jose Blanca

> Also If you know of other groups that would be interested please share this
> link/information.
>
> > Thanks,
> > --Michiel.
> >
> > --- On Mon, 5/24/10, Vincent Davis <vincent at vincentdavis.net> wrote:
> > > From: Vincent Davis <vincent at vincentdavis.net>
> > > Subject: [Biopython] SciPy paper: documenting statistical data
> > > structure
> >
> > design issues
> >
> > > To: "biopython" <biopython at lists.open-bio.org>
> > > Date: Monday, May 24, 2010, 3:45 PM
> > > "see the message below, cross posted
> > > from pystatsmodels"
> > >
> > > We have ben having some discussion on the pystatsmodels
> > > maling list about
> > > data objects, numpy arrays... I think it would be valuable
> > > for some
> > > biopython users to contribute some comments, examples or
> > > ideas to the scipy
> > > wiki that has been setup for this. I think at the heart of
> > > this is that
> > > although almost anything can be done with a numpy array we
> > > run into many
> > > problems that are difficult to solve with the current tools
> > > for numpy
> > > arrays. Because of this I think some nice examples of the
> > > data design
> > > problems that you have faced in the biopython and how they
> > > have been solved
> > > would be valuable.
> > >
> > > Thanks
> > > Vincent
> > >
> > > On Sat, May 22, 2010 at 7:22 PM, Wes McKinney <wesmckinn at gmail.com>
> > >
> > > wrote:
> > > > For my SciPy talk and paper in a little over a month,
> > >
> > > I was hoping to
> > >
> > > > render a somewhat coherent discussion of the design
> > >
> > > needs of
> > >
> > > > statistical data structures, based on my experience
> > >
> > > developing pandas
> > >
> > > > for quant finance research. I think these broadly fall
> > >
> > > into a few
> > >
> > > > categories: implementation ease, usability (for the
> > >
> > > non-developer
> > >
> > > > IPython-based console user), performance, and
> > >
> > > flexibility. Hopefully
> > >
> > > > this will be useful information that will help guide
> > >
> > > future
> > >
> > > > development efforts. What do you folks think?
> > > >
> > > > As part of this, I was thinking maybe we should start
> > >
> > > a wiki page (or
> > >
> > > > pages) somewhere to start listing out the various
> > >
> > > design issues (big
> > >
> > > > and small) where people can write their opinions and
> > >
> > > we can have a
> > >
> > > > structured discussion (e-mail is a bit hard for this
> > >
> > > sort of thing).
> > >
> > > > I'd also like to spend some time reading through other
> > >
> > > people's code
> > >
> > > > (e.g. all of the larry code) and writing down what I
> > >
> > > think about their
> > >
> > > > design choices in a constructive way.
> > > >
> > > > Part of what prompted my idea for a wiki was reading
> > >
> > > some of the larry
> > >
> > > > code and wanting to share my thoughts on various parts
> > >
> > > of it. Of
> > >
> > > > course I'm also prepared for other people to attack
> > >
> > > (and for me to
> > >
> > > > have to defend) my own code. For most of these things
> > >
> > > there isn't a
> > >
> > > > "right" and "wrong" and I am only interested in having
> > >
> > > constructive
> > >
> > > > discussions and hearing people's perspectives. Here's
> > >
> > > an example: in
> > >
> > > > pandas when adding two different-labeled 2d arrays,
> > >
> > > the result has the
> > >
> > > > *union* of all the labels. In la you get the
> > >
> > > intersection. Certainly
> > >
> > > > are pros and cons for either approach (in my case I
> > >
> > > don't want to lose
> > >
> > > > information, even if it's nulled out).
> > > >
> > > > We should also have a place where we document
> > >
> > > differences in
> > >
> > > > performance for various operations. I spent a lot of
> > >
> > > time even before
> > >
> > > > pandas was open-source obsessing over speed-- I'd like
> > >
> > > to think I
> > >
> > > > learned a few things but I was operating in a bubble
> > >
> > > so I might have
> > >
> > > > missed really obvious speedups. I also learned lots of
> > >
> > > odd things
> > >
> > > > about NumPy (did you know fancy indexing is a LOT
> > >
> > > slower than
> > >
> > > > ndarray.take?). We should probably establish some
> > >
> > > apples-to-apples
> > >
> > > > performance benchmarks to help people decide what to
> > >
> > > use for their
> > >
> > > > applications if speed matters.
> > > >
> > > > Best,
> > > > Wes
> > >
> > >    *Vincent Davis
> > > 720-301-3003 *
> > > vincent at vincentdavis.net
> > >  my blog <http://vincentdavis.net> |
> > > LinkedIn<http://www.linkedin.com/in/vincentdavis>
> > > _______________________________________________
> > > Biopython mailing list  -  Biopython at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biopython
>
>   *Vincent Davis
> 720-301-3003 *
> vincent at vincentdavis.net
>  my blog <http://vincentdavis.net> |
> LinkedIn<http://www.linkedin.com/in/vincentdavis>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)