From mjldehoon at yahoo.com  Wed Oct  1 08:18:24 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Wed, 1 Oct 2008 05:18:24 -0700 (PDT)
Subject: [BioPython] Bio.distance
Message-ID: <924102.72843.qm@web62403.mail.re1.yahoo.com>

Hi everybody,

Since the 1.48 release, Biopython has been making good progress in the migration from Numerical Python to NumPy. As part of this process, we are now reviewing and consolidating the code in Biopython that makes use of Numerical Python / NumPy. Specifically, we are thinking to merge the code in Bio.distance into Bio.kNN, and to deprecate Bio.distance and Bio.cdistance. Since Bio.kNN is the only Biopython module in Biopython that makes use of Bio.distance, we think that this won't affect anybody. However, if you are using Bio.distance outside of Bio.kNN, please let us know so we can find an alternative solution.

--Michiel.


From bsouthey at gmail.com  Wed Oct  1 11:49:53 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Wed, 01 Oct 2008 10:49:53 -0500
Subject: [BioPython] Bio.distance
In-Reply-To: <924102.72843.qm@web62403.mail.re1.yahoo.com>
References: <924102.72843.qm@web62403.mail.re1.yahoo.com>
Message-ID: <48E39C21.8010603@gmail.com>

Michiel de Hoon wrote:
> Hi everybody,
>
> Since the 1.48 release, Biopython has been making good progress in the migration from Numerical Python to NumPy. As part of this process, we are now reviewing and consolidating the code in Biopython that makes use of Numerical Python / NumPy. Specifically, we are thinking to merge the code in Bio.distance into Bio.kNN, and to deprecate Bio.distance and Bio.cdistance. Since Bio.kNN is the only Biopython module in Biopython that makes use of Bio.distance, we think that this won't affect anybody. However, if you are using Bio.distance outside of Bio.kNN, please let us know so we can find an alternative solution.
>
> --Michiel.
>
>
>       
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>   
Hi,
Under the 'standard' install I do not think that there is any advantage 
of using Bio.cdistance within Bio.kNN. I tested this on a bioinformatics 
data set with almost 1500 data points, 8 explanatory variables and k=9. 
I only got a one second difference between using Bio.cdistance or 
commenting it out on my system (after removing the build directory and 
reinstalling everything). Actual maximum times across three runs were 
under 16.6 seconds with it and under 17.4 seconds without it.

My system runs linux x86_64 (fedora 10) but it is not a 'clean' system 
due to other cpu intensive processes running. I used Python 2.5 and 
Numeric 2.4 as I forgot the order of imports. In my version the default 
distance without Bio.cdistance uses the Numeric dot (I did not try the 
python version) so I would expect this to be noticeably faster if lapack 
or atlas are installed than if these are not present.  (I used Fedora 
supplied Numeric so while I think this timing is without lapack and 
atlas I am not completely sure of that.)

I did not see an examples for k-nearest neighbor so below is (very bad) 
code using the logistic regression example 
(http://biopython.org/DIST/docs/cookbook/LogisticRegression.html).

Regards
Bruce


from Bio import kNN
xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11, 
-220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], 
[58, -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, 
-312.99], [154, -213.83], [147, -380.85], [93, -291.13]]
ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
model = kNN.train(xs, ys, 3)
ccr=0
tobs=0
for px, py in zip(xs, ys):
        cp=kNN.classify(model, px)
        tobs +=1
        if cp==py:
                ccr +=1
print tobs, ccr


From biopython at maubp.freeserve.co.uk  Wed Oct  1 11:52:05 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 1 Oct 2008 16:52:05 +0100
Subject: [BioPython] More string methods for the Seq object
In-Reply-To: <320fb6e00809290506p8aa2b51p4901b693ebb268bf@mail.gmail.com>
References: <320fb6e00809260859r23c7915buc114c5c0b71e195@mail.gmail.com>
	<48DD2DE6.10908@gmail.com>
	<320fb6e00809261422n6e4c4889p734508613898cc3f@mail.gmail.com>
	<48DD59DF.1000504@gmail.com>
	<320fb6e00809261457j65dc0876hd59d17aee01bc983@mail.gmail.com>
	<bbcd77d00809261855oecdcfc4w7b762b45a0aa7dc4@mail.gmail.com>
	<320fb6e00809270557n73b81b5ayb93fe85f0f466626@mail.gmail.com>
	<bbcd77d00809271806x16457c7dyc4812dbd09d5f8a9@mail.gmail.com>
	<320fb6e00809290450m6fedbaacu15a75107e5c39658@mail.gmail.com>
	<320fb6e00809290506p8aa2b51p4901b693ebb268bf@mail.gmail.com>
Message-ID: <320fb6e00810010852j5cf8e3ak7dc788372568251f@mail.gmail.com>

On Mon, Sep 29, 2008 at 1:06 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> I assume you [Bruce] are agreeing with ... follow[ing] the
>> string defaults of white space for stipping or splitting (for
>> consistency, even though this won't typically be useful for
>> sequences).  On balance this would probably be best from
>> a principle of consistency and least surprise for the user -
>> I'll update the patches.
>
> New patch for Seq object split, strip, lstrip and rstrip methods on
> Bug 2596 which follows the python string defaults (splitting on or
> stripping of white space).
> http://bugzilla.open-bio.org/show_bug.cgi?id=2596

There is now a second version of this patch on that bug, which will
also accept Seq objects as arguments to the split, strip, lstrip and
rstrip methods, plus has the start of some tests too.

We (Peter, Martin, Bruce and Leighton) seem to have reached an
agreement about adding split, strip, lstrip and rstrip methods to the
Seq object with the behaviour (arguments and defaults) to follow those
of the python string as closely as possible.

I'd like to encourage others lurking on the list to comment too, but
unless anyone objects, I intend to add these methods in CVS this week,
together with an updated unit test and updates to the tutorial.

Peter

From biopython at maubp.freeserve.co.uk  Wed Oct  1 12:03:22 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 1 Oct 2008 17:03:22 +0100
Subject: [BioPython] Bio.distance
In-Reply-To: <48E39C21.8010603@gmail.com>
References: <924102.72843.qm@web62403.mail.re1.yahoo.com>
	<48E39C21.8010603@gmail.com>
Message-ID: <320fb6e00810010903u253c6384ld401e1a771ee141e@mail.gmail.com>

On Wed, Oct 1, 2008 at 4:49 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>
> Hi,
> Under the 'standard' install I do not think that there is any advantage of
> using Bio.cdistance within Bio.kNN. I tested this on a bioinformatics data
> set with almost 1500 data points, 8 explanatory variables and k=9. ...
> Actual maximum times across three runs were under 16.6 seconds with
> it [Bio.cdistance] and under 17.4 seconds without it [Bio.distance using
> Numeric]

Its interesting that the C version is only slightly faster than
Numeric - of course as you point out there are lots of possible
complications here like lapack and atlas (plus compiler options and
CPU features).

I think your numbers are good support for Michiel's proposition that
we should deprecate Bio.cdistance and Bio.distance and just use numpy
in Bio.kNN - this will simplify our code base and make very little
difference to the speed.

Peter

From biopython at maubp.freeserve.co.uk  Wed Oct  1 12:17:10 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 1 Oct 2008 17:17:10 +0100
Subject: [BioPython]  Bio.kNN documentation
Message-ID: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com>

Bruce wrote:
> I tested this [Bio.kNN] on a bioinformatics data set with almost 1500
> data points, 8 explanatory variables and k=9. ...

Do you think this larger example could be adapted into something for
the Biopython documentation?  Otherwise the next bit of code looks
interesting.

> I did not see an examples for k-nearest neighbor so below is (very bad)
> code using the logistic regression example
> (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html).

This is a set of Bacillus subtilis gene pairs for which the operon
structure is known, with the intergene distance and gene expression
score as explanatory variables, with the class being same operon or
different operons.

> from Bio import kNN
> xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11,
> -220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], [58,
> -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, -312.99],
> [154, -213.83], [147, -380.85], [93, -291.13]]
> ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
> model = kNN.train(xs, ys, 3)
> ccr=0
> tobs=0
> for px, py in zip(xs, ys):
>       cp=kNN.classify(model, px)
>       tobs +=1
>       if cp==py:
>               ccr +=1
> print tobs, ccr

Could you expand on the cryptic variable names?  ccr = correct call
rate? tobs = total observations?

Coupled with a scatter plot (say with pylab, showing the two classes
in different colours), this could be turned into a nice little example
for the cookbook section of the tutorial.  Notice that later on in the
logistic regression example there is a second table of "test data"
which could be used to make de novo predictions.

Thanks,

Peter

From bsouthey at gmail.com  Wed Oct  1 14:40:41 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Wed, 01 Oct 2008 13:40:41 -0500
Subject: [BioPython] Bio.kNN documentation
In-Reply-To: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com>
References: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com>
Message-ID: <48E3C429.1020004@gmail.com>

Peter wrote:
> Bruce wrote:
>   
>> I tested this [Bio.kNN] on a bioinformatics data set with almost 1500
>> data points, 8 explanatory variables and k=9. ...
>>     
>
> Do you think this larger example could be adapted into something for
> the Biopython documentation?  Otherwise the next bit of code looks
> interesting.
>
>   
>> I did not see an examples for k-nearest neighbor so below is (very bad)
>> code using the logistic regression example
>> (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html).
>>     
>
> This is a set of Bacillus subtilis gene pairs for which the operon
> structure is known, with the intergene distance and gene expression
> score as explanatory variables, with the class being same operon or
> different operons.
>
>   
>> from Bio import kNN
>> xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11,
>> -220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], [58,
>> -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, -312.99],
>> [154, -213.83], [147, -380.85], [93, -291.13]]
>> ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
>> model = kNN.train(xs, ys, 3)
>> ccr=0
>> tobs=0
>> for px, py in zip(xs, ys):
>>       cp=kNN.classify(model, px)
>>       tobs +=1
>>       if cp==py:
>>               ccr +=1
>> print tobs, ccr
>>     
>
> Could you expand on the cryptic variable names?  ccr = correct call
> rate? tobs = total observations?
>
> Coupled with a scatter plot (say with pylab, showing the two classes
> in different colours), this could be turned into a nice little example
> for the cookbook section of the tutorial.  Notice that later on in the
> logistic regression example there is a second table of "test data"
> which could be used to make de novo predictions.
>
> Thanks,
>
> Peter
>
>   
I did realize that this was coming... :-)
(I guess I am volunteering myself to provide some material on machine 
learning with BioPython. So this is a start.)

I wanted something quick and dirty to output for testing, so tobs is the 
total number of observations and ccr is number of correctly classified 
points - I was to lazy to divide it by tobs to get the correct 
classification rate.

Here is an more extended sample code that also uses logistic regression. 
(Python is so great to with here!) I don't have plotting packages 
installed but someone could add the plots.

Regards
Bruce


-------------- next part --------------
A non-text attachment was scrubbed...
Name: knn_lr_example.py
Type: text/x-python
Size: 3257 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20081001/40109831/attachment.py>

From biopython at maubp.freeserve.co.uk  Wed Oct  1 17:40:55 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 1 Oct 2008 22:40:55 +0100
Subject: [BioPython] problem installing mxTextTools
In-Reply-To: <48896815.10104@berkeley.edu>
References: <4889645B.9080400@berkeley.edu> <48896815.10104@berkeley.edu>
Message-ID: <320fb6e00810011440v4bd80263hf3830d8d9f548e63@mail.gmail.com>

On Fri, Jul 25, 2008 at 6:43 AM, Nick Matzke <matzke at berkeley.edu> wrote:
> Hi all,
>
> An update -- I found a solution by copying the .pck file the download
> actually gave me to the filename that the install was apparently looking
> for.  This was not exactly obvious (!!!!) but apparently it worked:
> ...
> >>> print now()
> 2008-07-24 22:39:17.66
>

Was this an old email you accidently forwarded to the list?

For the next release of Biopython the only bits of code still using
mxTextTools have been deprecated, so the Biopython setup won't even
look for mxTextTools at all.  Right now with Biopython 1.48 you can
just install without mxTextTools (as the setup.py prompt should make
clear).

Peter

From biopython at maubp.freeserve.co.uk  Wed Oct  1 17:44:34 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 1 Oct 2008 22:44:34 +0100
Subject: [BioPython] problem installing mxTextTools
In-Reply-To: <320fb6e00810011440v4bd80263hf3830d8d9f548e63@mail.gmail.com>
References: <4889645B.9080400@berkeley.edu> <48896815.10104@berkeley.edu>
	<320fb6e00810011440v4bd80263hf3830d8d9f548e63@mail.gmail.com>
Message-ID: <320fb6e00810011444u7e5bf37fh2801c1980bd38a2a@mail.gmail.com>

On Wed, Oct 1, 2008 at 10:40 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Fri, Jul 25, 2008 at 6:43 AM, Nick Matzke <matzke at berkeley.edu> wrote:
>> Hi all,
>>
>> An update -- I found a solution by copying the .pck file the download
>> actually gave me to the filename that the install was apparently looking
>> for.  This was not exactly obvious (!!!!) but apparently it worked:
>> ...
>> >>> print now()
>> 2008-07-24 22:39:17.66
>>
>
> Was this an old email you accidently forwarded to the list?

Sorry about this Nick & everyone else - it was a mistake at my end.
It looks like a glitch (perhaps in GoogleMail itself?) marked this old
thread as unread and bumped it to the top of my to read list.  Odd,
but I didn't notice until after sending my confused reply.

Peter

From kteague at bcgsc.ca  Wed Oct  1 17:53:44 2008
From: kteague at bcgsc.ca (Kevin Teague)
Date: Wed, 1 Oct 2008 14:53:44 -0700
Subject: [BioPython] development question
References: <48B5BD98.8050101@heckler-koch.cz><48B65C9B.4000407@heckler-koch.cz>
	<20080828090431.GD5801@inb.uni-luebeck.de>
Message-ID: <36BEEFA2DF192944BF71E072F7A5F4656043D6@xchange1.phage.bcgsc.ca>


On Thu, Aug 28, 2008 at 10:06:51AM +0200, Pavel SRB wrote:
> so now to biopython. On my system i have biopython from debian repository 
> via apt-get. But i would like to have second version of biopython in system 
> just to check, log and change the code to learn more. This can be done with 
> removing sys.path.remove("/var/lib/python-support/python2.5")
> and importing Bio from some other development directory. But this way i 
> loose all modules in direcotory mentioned above and i believe it can be 
> done more clearly

You might want to check out VirtualEnv:

http://pypi.python.org/pypi/virtualenv

This tool will let you "clone" your system Python, so that you have your own isolated [virtualpythonname]/bin and [virtualpythonname/lib/python/site-packages/ directories. If you create a virtualenv with the --no-site-packages, then the /var/lib/python-support/python2.5/ location will be not be in the created virtual python's sys.path. Otherwise by default this location will be included, but your own isolated [virtualpythonname/lib/python/site-packages/ location will have precendence on sys.path, so if you install a newer BioPython into there it will get imported instead of the system one.

You can of course do all of this by manually fiddling with sys.path, but VirtualEnv just wraps up a few of these common practices into one handy tool - great for experimentation or trying out different packages.


From lunt at ctbp.ucsd.edu  Sat Oct  4 17:50:33 2008
From: lunt at ctbp.ucsd.edu (Bryan Lunt)
Date: Sat, 4 Oct 2008 14:50:33 -0700
Subject: [BioPython] Copy Constructors for Bio.Seq.Seq?
In-Reply-To: <b34be8bd0810041445k382c3253ybbfec492717a85a8@mail.gmail.com>
References: <b34be8bd0810041445k382c3253ybbfec492717a85a8@mail.gmail.com>
Message-ID: <b34be8bd0810041450n56cfb161p7c281f947340a7d@mail.gmail.com>

Greetings All!
I would like to make the following humble suggestion:
A copy-constructor for Bio.Seq.Seq would be helpful, currently it
seems that calling Bio.Align.Generic.Alignment.add_sequence on a Seq
object breaks because it tries to initialize a new Seq object on
whatever data you provided, and there is no copy-constructor, nor does
Bio.Align.Generic.Alignment.add_sequence handled just adding a Seq
object directly.

Thanks for considering this, I think this addition will help make
client-code cleaner.

-Bryan Lunt

From biopython at maubp.freeserve.co.uk  Sun Oct  5 07:06:57 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 5 Oct 2008 12:06:57 +0100
Subject: [BioPython] Copy Constructors for Bio.Seq.Seq?
In-Reply-To: <b34be8bd0810041450n56cfb161p7c281f947340a7d@mail.gmail.com>
References: <b34be8bd0810041445k382c3253ybbfec492717a85a8@mail.gmail.com>
	<b34be8bd0810041450n56cfb161p7c281f947340a7d@mail.gmail.com>
Message-ID: <320fb6e00810050406t41d25043oe7011745055a1fc7@mail.gmail.com>

On Sat, Oct 4, 2008 at 10:50 PM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
> Greetings All!
> I would like to make the following humble suggestion:
> A copy-constructor for Bio.Seq.Seq would be helpful, ...

You can use the string idiom of my_seq[:] to make a copy of a Seq object.

> currently it
> seems that calling Bio.Align.Generic.Alignment.add_sequence on a
> Seq object breaks because it tries to initialize a new Seq object on
> whatever data you provided, and there is no copy-constructor, nor does
> Bio.Align.Generic.Alignment.add_sequence handled just adding a Seq
> object directly.

Yes, the Bio.Align.Generic.Alignment.add_sequence() method currently
expects a string (which its docstring is fairly clear about), and
giving it a Seq does fail.  I suppose allowing it to take a Seq object
would be sensible (with a check on the alphabet being compatible with
that declared for the alignment).

We have been debating making the generic Alignment a little more list
like, by allowing .append() or .extend() for use with SeqRecord
objects (Bug 2553).
http://bugzilla.open-bio.org/show_bug.cgi?id=2553

> Thanks for considering this, I think this addition will help make
> client-code cleaner.

Would the SeqRecord append/extend idea suit you just as well?

Peter

From biopython at maubp.freeserve.co.uk  Sun Oct  5 08:16:28 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 5 Oct 2008 13:16:28 +0100
Subject: [BioPython] Migrating from Numerical Python to numpy
In-Reply-To: <623262.17729.qm@web62407.mail.re1.yahoo.com>
References: <623262.17729.qm@web62407.mail.re1.yahoo.com>
Message-ID: <320fb6e00810050516i20822ebcwf15cd058af0c9759@mail.gmail.com>

On Sat, Sep 20, 2008 at 4:02 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Dear all,
>
> As you probably are well aware, Biopython releases to date have used
> the now obsolete Numeric python library.  This is no longer being
> maintained and has been superseded by the numpy library.  See
> http://www.scipy.org/History_of_SciPy for more about details on the
> history of numerical python.  Biopython 1.48 should be the last
> Numeric only release of Biopython - we have already started moving to
> numpy in CVS.
>
> Supporting both Numeric and numpy ought to be fairly straightforward
> for the pure python modules in Biopython. However, we also have C code
> which must interact with Numeric/numpy, and trying to support both
> would be harder.
>
> Would anyone be inconvenienced if the next release of Biopython
> supported numpy ONLY (dropping support for Numeric)?  If so please
> speak up now - either here or on the development mailing list.
> Otherwise, a simple switch from Numeric to numpy will probably be the
> most straightforward migration plan.

No one has objected, and a simple switch from Numeric to numpy is
underway in CVS.  The next release of Biopython will suport numpy only
(dropping support for Numeric).

As an aside, from my own testing Biopython CVS looks happy with numpy
1.0, 1.1 and the just released 1.2 (although if we have missed any
deprecation warnings please let us know).

For preparing Windows installers for Biopython, it might be helpful to
know what version of numpy most Windows users (will) have installed
(this is important due to numpy C API changes between versions).

Thanks,

Peter

From biopython at maubp.freeserve.co.uk  Mon Oct  6 06:39:15 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 6 Oct 2008 11:39:15 +0100
Subject: [BioPython] Bio.kNN documentation
In-Reply-To: <48E3C429.1020004@gmail.com>
References: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com>
	<48E3C429.1020004@gmail.com>
Message-ID: <320fb6e00810060339t427d4c5dme2690fbc36b30c81@mail.gmail.com>

Bruce wrote:
>>> I did not see an examples for k-nearest neighbor so below is
>>> (very bad) code using the logistic regression example
>>> (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html).

Peter wrote:
>> This is a set of Bacillus subtilis gene pairs for which the operon
>> structure is known, with the intergene distance and gene expression
>> score as explanatory variables, with the class being same operon or
>> different operons.
>> ...
>> Coupled with a scatter plot (say with pylab, showing the two classes
>> in different colours), this could be turned into a nice little example
>> for the cookbook section of the tutorial.  Notice that later on in the
>> logistic regression example there is a second table of "test data"
>> which could be used to make de novo predictions.

Bruce wrote:
> I did realize that this was coming... :-)
> (I guess I am volunteering myself to provide some material on
> machine learning with BioPython. So this is a start.)

Michiel has suggested adding a whole chapter to the tutorial about
supervised learning, presumably incorporating his logistic regression
example as part of this.  Have a look at thread "Bio.MarkovModel;
Bio.Popgen, Bio.PDB documentation" on the dev mailing list.  I'm sure
you can contribute (even if just by proof reading).

Peter

From fkauff at biologie.uni-kl.de  Tue Oct  7 04:02:12 2008
From: fkauff at biologie.uni-kl.de (Frank Kauff)
Date: Tue, 07 Oct 2008 10:02:12 +0200
Subject: [BioPython] Creating and traversing an ultrametric tree
In-Reply-To: <320fb6e00809241326i16a337das844f4ac74766b459@mail.gmail.com>
References: <73045cca0809231713v219c3ec3tfc24461c7af6b453@mail.gmail.com>	<320fb6e00809240200y144500cbl86f9023cb868da89@mail.gmail.com>	<73045cca0809241132x30bc4d63t7ac0b9967a20e76c@mail.gmail.com>
	<320fb6e00809241326i16a337das844f4ac74766b459@mail.gmail.com>
Message-ID: <48EB1784.50803@biologie.uni-kl.de>


Peter wrote:
> On Wed, Sep 24, 2008 at 7:32 PM, aditya shukla
> <adityashukla1983 at gmail.com> wrote:
>   
>> Hello Peter ,
>>
>> Thanks for the reply ,
>> I have attached a file with  of the kind of data that i wanna parse.
>> I tried using Thomas Mailund's Newick tree parser but this dosen't
>> seem to work , so is there any other module that can help?
>>     
>
> Your file looks like this (in case anyone on the mailing list recognises it),
>
> /T_0_size=105((-bin-ulockmgr_server:0.99[&&NHX:C=0.195.0],
> (((-bin-hostname:0.00[&&NHX:C=200.0.0],
> (-bin-dnsdomainname:0.00[&&NHX:C=200.0.0],
> ...):0.99):0.99):0.99):0.99);
>
> [with a large chunk removed, and new lines inserted]
>
> I'm guessing this is some kind of computer system profile - nothing to
> do with bioinformatics.
>
> I'm not 100% sure this is Newick format - it might be worth trying to
> parse everything after the "/T_0_size=105" text which looks out of
> place to me.
>
> If it is a valid Newick format tree file, then it is using named
> internal nodes which is something Biopython can't currently parse (see
> Bug 2543, http://bugzilla.open-bio.org/show_bug.cgi?id=2543 ).  So I
> don't think you can use the Bio.Nexus module in Biopython to read this
> tree.
>
>   
Nexus.Trees has been extended to deal with internal node names, or 
"special comments" in the format [& blablalba]. Such comments comments 
can appear directly after the taxon label, after the closing 
parentheses, or between branchlength / support values attached to a node 
or a taxon labels, such as

(a,(b,(c,d)[&hi there]))
(a,(b[&hi there],c))
(a,(b:0.123[&hi there],c[&heyho]:0.3))
(a,(b,c)0.4[&comment]:0.95)

The comments are stored without change in the corresponding node object 
and can be accessed like

 >>> t=Trees.Tree('(a,(b:0.123[&hi there],c[&heyho]:0.3))')
 >>> print t.node(3).data.comment
[&hi there]
 >>> print t.node(4).data.comment
[&heyho]
 >>>

The comments are not parsed in any way - internal labels vary greatly in 
syntax, and are used to store all kinds of information. But at least 
they are now parsed and stored, and users can deal with them in any way 
they like.

Frank

> The only other python package I can suggest you try is NetworkX,
> https://networkx.lanl.gov/wiki
>
> Good luck,
>
> Peter
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>   


From mjldehoon at yahoo.com  Tue Oct  7 19:10:12 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 7 Oct 2008 16:10:12 -0700 (PDT)
Subject: [BioPython] Bio.kNN documentation
In-Reply-To: <320fb6e00810060339t427d4c5dme2690fbc36b30c81@mail.gmail.com>
Message-ID: <381879.37032.qm@web62403.mail.re1.yahoo.com>

> Bruce wrote:
> > (I guess I am volunteering myself to provide some
> material on
> > machine learning with BioPython. So this is a start.)
> 
> Michiel has suggested adding a whole chapter to the
> tutorial about
> supervised learning, presumably incorporating his logistic
> regression
> example as part of this.  Have a look at thread
> "Bio.MarkovModel;
> Bio.Popgen, Bio.PDB documentation" on the dev mailing
> list.  I'm sure
> you can contribute (even if just by proof reading).

Some more documentation on machine learning would definitely be useful. Recently I started a chapter on supervised learning methods in the tutorial. Right now it only covers logistic regression, but it should also include Bio.MarkovModel, Bio.MaxEntropy, Bio.NaiveBayes, and Bio.kNN. If you are planning to write some documentation on any of these, please let us know so we can avoid duplicated efforts. The new tutorial is in CVS; I put a copy of the HTML output of the latest version at
http://biopython.org/DIST/docs/tutorial/Tutorial.new.html.

Thanks!

--Michiel


From bsouthey at gmail.com  Tue Oct  7 21:35:51 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Tue, 7 Oct 2008 20:35:51 -0500
Subject: [BioPython] Bio.kNN documentation
In-Reply-To: <381879.37032.qm@web62403.mail.re1.yahoo.com>
References: <320fb6e00810060339t427d4c5dme2690fbc36b30c81@mail.gmail.com>
	<381879.37032.qm@web62403.mail.re1.yahoo.com>
Message-ID: <bbcd77d00810071835r44a1cf92pba05eb436c7e1734@mail.gmail.com>

On Tue, Oct 7, 2008 at 6:10 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>> Bruce wrote:
>> > (I guess I am volunteering myself to provide some
>> material on
>> > machine learning with BioPython. So this is a start.)
>>
>> Michiel has suggested adding a whole chapter to the
>> tutorial about
>> supervised learning, presumably incorporating his logistic
>> regression
>> example as part of this.  Have a look at thread
>> "Bio.MarkovModel;
>> Bio.Popgen, Bio.PDB documentation" on the dev mailing
>> list.  I'm sure
>> you can contribute (even if just by proof reading).
>
> Some more documentation on machine learning would definitely be useful. Recently I started a chapter on supervised learning methods in the tutorial. Right now it only covers logistic regression, but it should also include Bio.MarkovModel, Bio.MaxEntropy, Bio.NaiveBayes, and Bio.kNN. If you are planning to write some documentation on any of these, please let us know so we can avoid duplicated efforts. The new tutorial is in CVS; I put a copy of the HTML output of the latest version at
> http://biopython.org/DIST/docs/tutorial/Tutorial.new.html.
>
> Thanks!
>
> --Michiel
>

Hi,
I have not given it too much thought at present but this reflects some
of the work I have been doing or involved with. I do not know enough
about Bio.MarkovModel, Bio.MaxEntropy and Bio.NaiveBayes to really
help. But I did think to start with trying to extend the supervised
learning material to be more general.  One aspect is to get provide
working code using different methodologies for different examples.

Regards
Bruce

From stephan80 at mac.com  Wed Oct  8 07:33:51 2008
From: stephan80 at mac.com (Stephan)
Date: Wed, 08 Oct 2008 13:33:51 +0200
Subject: [BioPython] Entrez.efetch
Message-ID: <75573950382669954948356356615157751492-Webmail2@me.com>

Hi,

I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now.
Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this.

Anyway, here is a weird little problem with the Bio.Entrez.efetch tool:
(I use python 2.5 and the latest Biopython 1.48)
I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file:

---------------------------CODE------------------------------------
from Bio import Entrez, SeqIO

print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"]
handle = Entrez.efetch(db="genome", id="56", rettype="genbank")
print "downloading to SeqRecord..."
record = SeqIO.read(handle, "genbank")
print "...done"

handle = Entrez.efetch(db="genome", id="56", rettype="genbank")
filehandle = open("NCBI_DroMel", "w")
print "downloading to file..."
filehandle.write(handle.read())
print "...done"

handle = open("NCBI_DroMel")
print "reading from file..."
record = SeqIO.read(handle, "genbank")
---------------------------END-CODE------------------------------------

In the last line we have a crash, see the output of the code:

---------------------------OUTPUT------------------------------------
Drosophila melanogaster chromosome 4, complete sequence
downloading to SeqRecord...
...done
downloading to file...
...done
reading chr2L from file...
Traceback (most recent call last):
  File "efetch-test.py", line 17, in <module>
    record = SeqIO.read(handle, "genbank")
  File "HOME/lib/python/Bio/SeqIO/__init__.py", line 366, in read
    first = iterator.next()
  File "HOME/lib/python/Bio/GenBank/Scanner.py", line 410, in parse_records
    record = self.parse(handle)
  File "HOME/lib/python/Bio/GenBank/Scanner.py", line 393, in parse
    if self.feed(handle, consumer) :
  File "HOME/lib/python/Bio/GenBank/Scanner.py", line 370, in feed
    misc_lines, sequence_string = self.parse_footer()
  File "HOME/lib/python/Bio/GenBank/Scanner.py", line 723, in parse_footer
    raise ValueError("Premature end of file in sequence data")
ValueError: Premature end of file in sequence data
---------------------------END-OUTPUT------------------------------------

It seems that downloading the file to disk will corrupt the genbank file, while downloading directly into biopythons SeqIO.read() function works properly. I dont get it!
When I download this chromosome manually from the NCBI-website, I indeed find a difference in one line, namely in line 3 of the genbank file. In the manually downloaded file line 3 reads: "ACCESSION   NC_004353 REGION: 1..1351857", while in the file produced from my code I have only: "ACCESSION   NC_004353". So without that region-information, the biopython parser of course runs to a premature end.

I rather use the cPickle-module now to save the whole SeqRecord-instance. Thats works fine, so I dont need an immediate solution for the above posted problem, but I thought it might be interesting maybe...

Any hints?

Regards, Stephan

From chapmanb at 50mail.com  Wed Oct  8 08:35:33 2008
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 8 Oct 2008 08:35:33 -0400
Subject: [BioPython] Entrez.efetch
In-Reply-To: <75573950382669954948356356615157751492-Webmail2@me.com>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
Message-ID: <20081008123533.GE57379@sobchak.mgh.harvard.edu>

Hi Stephan;

> It seems that downloading the file to disk will corrupt the genbank
> file, while downloading directly into biopythons SeqIO.read() function
> works properly. I dont get it! 
>
> When I download this chromosome manually from the NCBI-website,
> I indeed find a difference in one line, namely in line 3 of the
> genbank file. In the manually downloaded file line 3 reads:
> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced
> from my code I have only: "ACCESSION NC_004353". So without that
> region-information, the biopython parser of course runs to a premature
> end.

This is a tricky problem that I ran into as well and is fixed in the
latest CVS version. The issue is that the Biopython reader is using an
UndoHandle instead of a standard python handle. By default some of these
operations appear to be assuming an iterator, but UndoHandle did not
provide this.

As a result, you can lose the first couple of lines which are
previously examined to determine the filetype. The fix is to make
this a proper iterator. You can either check out current CVS, or
make the addition manually to Bio/File.py in your current version:

http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython

Hope this helps,
Brad

From biopython at maubp.freeserve.co.uk  Wed Oct  8 09:37:24 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 8 Oct 2008 14:37:24 +0100
Subject: [BioPython] Entrez.efetch
In-Reply-To: <75573950382669954948356356615157751492-Webmail2@me.com>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
Message-ID: <320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com>

On Wed, Oct 8, 2008 at 12:33 PM, Stephan <stephan80 at mac.com> wrote:
> Hi,
>
> I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now.
> Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this.
>
> Anyway, here is a weird little problem with the Bio.Entrez.efetch tool:
> (I use python 2.5 and the latest Biopython 1.48)
> I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file:
>
> ---------------------------CODE------------------------------------
> from Bio import Entrez, SeqIO
>
> print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"]
> handle = Entrez.efetch(db="genome", id="56", rettype="genbank")
> print "downloading to SeqRecord..."
> record = SeqIO.read(handle, "genbank")
> print "...done"

I assume this is just test code - as it would be silly to download the
GenBank file twice in a real script.

> handle = Entrez.efetch(db="genome", id="56", rettype="genbank")
> filehandle = open("NCBI_DroMel", "w")
> print "downloading to file..."
> filehandle.write(handle.read())

You should now close the file, which should ensure it is fully written to disk:
filehandle.close()

> print "...done"
>
> handle = open("NCBI_DroMel")
> print "reading from file..."
> record = SeqIO.read(handle, "genbank")
> ---------------------------END-CODE------------------------------------
>
> In the last line we have a crash,
>  ...
> ValueError: Premature end of file in sequence data

This is because you started reading in the file without finishing
writing to it - the parser could only read in part of the data, and is
complaining about it ending prematurely.

Peter

From p.j.a.cock at googlemail.com  Wed Oct  8 09:46:25 2008
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 8 Oct 2008 14:46:25 +0100
Subject: [BioPython] Entrez.efetch
In-Reply-To: <20081008123533.GE57379@sobchak.mgh.harvard.edu>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
	<20081008123533.GE57379@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com>

Stephan wrote:
>> When I download this chromosome manually from the NCBI-website,
>> I indeed find a difference in one line, namely in line 3 of the
>> genbank file. In the manually downloaded file line 3 reads:
>> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced
>> from my code I have only: "ACCESSION NC_004353". So without that
>> region-information, the biopython parser of course runs to a premature
>> end.

Stephan - when you say manually, do you mean via a web browser?  If so
it is likely to be using a subtly different URL, which might explain
the NCBI generating slightly different data on the fly.  Either way,
this ACCESSION line difference shouldn't trigger the "Premature end of
file in sequence data" error in the GenBank parser.

On Wed, Oct 8, 2008 at 1:35 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> This is a tricky problem that I ran into as well and is fixed in the
> latest CVS version. The issue is that the Biopython reader is using an
> UndoHandle instead of a standard python handle. By default some of these
> operations appear to be assuming an iterator, but UndoHandle did not
> provide this.

Brad, I'm pretty sure the GenBank parser is NOT using the UndoHandle.
Just adding the close made Stephan's example work for me.  What
exactly was the problem you ran into (one of the other parsers
perhaps?).

> As a result, you can lose the first couple of lines which are
> previously examined to determine the filetype. The fix is to make
> this a proper iterator. You can either check out current CVS, or
> make the addition manually to Bio/File.py in your current version:
>
> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython

Adding this to the UndoHandle seems a sensible improvement - but I
don't see how it can affect Stephan's script.

Peter

From p.j.a.cock at googlemail.com  Wed Oct  8 09:46:25 2008
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 8 Oct 2008 14:46:25 +0100
Subject: [BioPython] Entrez.efetch
In-Reply-To: <20081008123533.GE57379@sobchak.mgh.harvard.edu>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
	<20081008123533.GE57379@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com>

Stephan wrote:
>> When I download this chromosome manually from the NCBI-website,
>> I indeed find a difference in one line, namely in line 3 of the
>> genbank file. In the manually downloaded file line 3 reads:
>> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced
>> from my code I have only: "ACCESSION NC_004353". So without that
>> region-information, the biopython parser of course runs to a premature
>> end.

Stephan - when you say manually, do you mean via a web browser?  If so
it is likely to be using a subtly different URL, which might explain
the NCBI generating slightly different data on the fly.  Either way,
this ACCESSION line difference shouldn't trigger the "Premature end of
file in sequence data" error in the GenBank parser.

On Wed, Oct 8, 2008 at 1:35 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> This is a tricky problem that I ran into as well and is fixed in the
> latest CVS version. The issue is that the Biopython reader is using an
> UndoHandle instead of a standard python handle. By default some of these
> operations appear to be assuming an iterator, but UndoHandle did not
> provide this.

Brad, I'm pretty sure the GenBank parser is NOT using the UndoHandle.
Just adding the close made Stephan's example work for me.  What
exactly was the problem you ran into (one of the other parsers
perhaps?).

> As a result, you can lose the first couple of lines which are
> previously examined to determine the filetype. The fix is to make
> this a proper iterator. You can either check out current CVS, or
> make the addition manually to Bio/File.py in your current version:
>
> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython

Adding this to the UndoHandle seems a sensible improvement - but I
don't see how it can affect Stephan's script.

Peter

From stephan80 at mac.com  Wed Oct  8 09:48:25 2008
From: stephan80 at mac.com (Stephan)
Date: Wed, 08 Oct 2008 15:48:25 +0200
Subject: [BioPython] Entrez.efetch
In-Reply-To: <320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
	<320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com>
Message-ID: <128043477953580677661042463273686413408-Webmail2@me.com>

Hi guys,

OK, there is two different problems here that Brad and Peter independently pointed out to me. Peter, you are right that not closing the file actually caused the error. Your hint fixes that, thanks.
But that doesnt fix that there is a part of line 3 missing over the download, and although I actually updated to the newest cvs-version of biopython as Brad suggested (sorry for accidently putting my answer not on the mailing-list) that does not fix that line...

Best,
Stephan

 
Am Mittwoch 08 Oktober 2008 um 03:37PM schrieb "Peter" <biopython at maubp.freeserve.co.uk>:
>On Wed, Oct 8, 2008 at 12:33 PM, Stephan <stephan80 at mac.com> wrote:
>> Hi,
>>
>> I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now.
>> Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this.
>>
>> Anyway, here is a weird little problem with the Bio.Entrez.efetch tool:
>> (I use python 2.5 and the latest Biopython 1.48)
>> I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file:
>>
>> ---------------------------CODE------------------------------------
>> from Bio import Entrez, SeqIO
>>
>> print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"]
>> handle = Entrez.efetch(db="genome", id="56", rettype="genbank")
>> print "downloading to SeqRecord..."
>> record = SeqIO.read(handle, "genbank")
>> print "...done"
>
>I assume this is just test code - as it would be silly to download the
>GenBank file twice in a real script.
>
>> handle = Entrez.efetch(db="genome", id="56", rettype="genbank")
>> filehandle = open("NCBI_DroMel", "w")
>> print "downloading to file..."
>> filehandle.write(handle.read())
>
>You should now close the file, which should ensure it is fully written to disk:
>filehandle.close()
>
>> print "...done"
>>
>> handle = open("NCBI_DroMel")
>> print "reading from file..."
>> record = SeqIO.read(handle, "genbank")
>> ---------------------------END-CODE------------------------------------
>>
>> In the last line we have a crash,
>>  ...
>> ValueError: Premature end of file in sequence data
>
>This is because you started reading in the file without finishing
>writing to it - the parser could only read in part of the data, and is
>complaining about it ending prematurely.
>
>Peter
>
>

From stephan80 at mac.com  Wed Oct  8 10:00:31 2008
From: stephan80 at mac.com (Stephan)
Date: Wed, 08 Oct 2008 16:00:31 +0200
Subject: [BioPython] Entrez.efetch
In-Reply-To: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
	<20081008123533.GE57379@sobchak.mgh.harvard.edu>
	<320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com>
Message-ID: <72537648433629820630731006204512761040-Webmail2@me.com>

>Stephan - when you say manually, do you mean via a web browser?  If so 
>it is likely to be using a subtly different URL, which might explain 
>the NCBI generating slightly different data on the fly.  Either way, 
>this ACCESSION line difference shouldn't trigger the "Premature end of 
>file in sequence data" error in the GenBank parser. 

Thanks, that must be it! 
Now I guess everything is solved, closing the handle makes my code run properly and the download-from-NCBI-webpage-issue explains the difference in line 3. 

>Adding this to the UndoHandle seems a sensible improvement - but I 
>don't see how it can affect Stephan's script. 

There I agree, thanks anyway, Brad. 

Regards, 
Stephan

From biopython at maubp.freeserve.co.uk  Wed Oct  8 10:02:54 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 8 Oct 2008 15:02:54 +0100
Subject: [BioPython] Entrez.efetch
In-Reply-To: <128043477953580677661042463273686413408-Webmail2@me.com>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
	<320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com>
	<128043477953580677661042463273686413408-Webmail2@me.com>
Message-ID: <320fb6e00810080702q6774f58ap52a02073d62cb75a@mail.gmail.com>

On Wed, Oct 8, 2008 at 2:48 PM, Stephan <stephan80 at mac.com> wrote:
>
> Hi guys,
>
> OK, there is two different problems here that Brad and Peter independently
> pointed out to me. Peter, you are right that not closing the file actually
> caused the error. Your hint fixes that, thanks.

Great.

> But that doesnt fix that there is a part of line 3 missing over the download,
> and although I actually updated to the newest cvs-version of biopython as
> Brad suggested (sorry for accidently putting my answer not on the mailing-list)
> that does not fix that line...

This is the issue where you get different GenBank files using
Bio.Entrez.efetch and a "manual download"?  First of all what did you
mean by "manual download" - for example FTP (what URL), or from a
browser?  Secondly, does this difference to the ACCESSION line (line
3) actually have any ill effects?

To be clear using Bio.Entrez.efetch as in your script, I get this:

LOCUS       NC_004353            1351857 bp    DNA     linear   INV 14-MAY-2008
DEFINITION  Drosophila melanogaster chromosome 4, complete sequence.
ACCESSION   NC_004353
VERSION     NC_004353.3  GI:116010290
PROJECT     GenomeProject:164
KEYWORDS    .
SOURCE      Drosophila melanogaster (fruit fly)
  ORGANISM  Drosophila melanogaster
...

Using FTP from ftp://ftp.ncbi.nih.gov/genomes/Drosophila_melanogaster/CHR_4/NC_004353.gbk
I get something similar but different:

LOCUS       NC_004353            1351857 bp    DNA     linear   INV 14-MAY-2008
DEFINITION  Drosophila melanogaster chromosome 4, complete sequence.
ACCESSION   NC_004353
VERSION     NC_004353.3  GI:116010290
KEYWORDS    .
SOURCE      Drosophila melanogaster (fruit fly)
  ORGANISM  Drosophila melanogaster
...

Notice the FTP file lacks the PROJECT line, and also differs slightly
in its feature table.

Using the NCBI website I suspect you can get other slight variations
(like the different ACCESSION line you reported).

Peter

From stephan80 at mac.com  Wed Oct  8 09:52:07 2008
From: stephan80 at mac.com (Stephan)
Date: Wed, 08 Oct 2008 15:52:07 +0200
Subject: [BioPython] Entrez.efetch
In-Reply-To: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
	<20081008123533.GE57379@sobchak.mgh.harvard.edu>
	<320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com>
Message-ID: <56009583349175862359179071289436480391-Webmail2@me.com>


>Stephan - when you say manually, do you mean via a web browser?  If so
>it is likely to be using a subtly different URL, which might explain
>the NCBI generating slightly different data on the fly.  Either way,
>this ACCESSION line difference shouldn't trigger the "Premature end of
>file in sequence data" error in the GenBank parser.

Thanks, that must be it!
Now I guess everything is solved, closing the handle makes my code run properly and the download-from-NCBI-webpage-issue explains the difference in line 3.

>Adding this to the UndoHandle seems a sensible improvement - but I
>don't see how it can affect Stephan's script.

There I agree, thanks anyway, Brad.

Regards,
Stephan

From biopythonlist at gmail.com  Wed Oct  8 12:23:32 2008
From: biopythonlist at gmail.com (dr goettel)
Date: Wed, 8 Oct 2008 18:23:32 +0200
Subject: [BioPython] taxonomic tree
Message-ID: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com>

Hello, I'm new in this list and in BioPython.
I would like to create a NCBI-like taxonomic tree and then fill it with the
organisms that I have in a file. Is there an easy way to do this? I started
using biopython's function at 7.11.4 (finding the lineage of an organism) in
the tutorial, but I need to do this tens of thousands times so it spends too
much time querying NCBI database. Therefore I built a taxonomic database
locally and implemented something similar to 7.11.4 tutorial's function so I
get, for every sequence, the lineage in the same way:

'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina;
 Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta;
 Liliopsida; Asparagales; Orchidaceae'


Now I need to create a tree, or fill an already created one. And then search
it by some criteria.

Please could anybody help me with this? Any idea?
Thankyou very much

From biopython at maubp.freeserve.co.uk  Wed Oct  8 12:38:31 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 8 Oct 2008 17:38:31 +0100
Subject: [BioPython] taxonomic tree
In-Reply-To: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com>
References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com>
Message-ID: <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com>

On Wed, Oct 8, 2008 at 5:23 PM, dr goettel <biopythonlist at gmail.com> wrote:
> Hello, I'm new in this list and in BioPython.

Hello :)

> I would like to create a NCBI-like taxonomic tree and then fill it with the
> organisms that I have in a file. Is there an easy way to do this? I started
> using biopython's function at 7.11.4 (finding the lineage of an organism) in
> the tutorial, ...

For anyone reading this later on, note that the tutorial section
numbers tend to change with each release of Biopython.  This section
just uses Bio.Entrez to fetch taxonomy information for a particular
NCBI taxon id.

> but I need to do this tens of thousands times so it spends too
> much time querying NCBI database.

Also calling Bio.Entrez 10000 times might annoy the NCBI ;)

> Therefore I built a taxonomic database
> locally and implemented something similar to 7.11.4 tutorial's function so I
> get, for every sequence, the lineage in the same way:
>
> 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina;
>  Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta;
>  Liliopsida; Asparagales; Orchidaceae'

I assume you used the NCBI provided taxdump files to populate the
database?  See ftp://ftp.ncbi.nih.gov/pub/taxonomy/

Personally rather than designing my own database just for this (and
writing a parser for the taxonomy files), I would have suggested
installing BioSQL, and using the BioSQL script load_ncbi_taxonomy.pl
to download and import the data for you.  This is a simple perl script
- you don't need BioPerl.  See http://www.biopython.org/wiki/BioSQL
for details.

> Now I need to create a tree, or fill an already created one. And then search
> it by some criteria.

What kind of tree do you mean?  Are you talking about creating a
Newick tree, or an in memory structure?  Perhaps the Bio.Nexus
module's tree functionality would help.

If you are interested, the BioSQL tables record the taxonomy tree
using two methods, each node has a parent node allowing you to walk up
the lineage.  There are also left/right values allowing selection of
all child nodes efficiently via an SQL select statement.

Peter

From biopython at maubp.freeserve.co.uk  Wed Oct  8 12:57:37 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 8 Oct 2008 17:57:37 +0100
Subject: [BioPython] Current tutorial in CVS
Message-ID: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com>

Michiel wrote:
> ... The new tutorial is in CVS; I put a copy of the HTML output
> of the latest version at
> http://biopython.org/DIST/docs/tutorial/Tutorial.new.html.

This also gives people a chance to look at the three plotting examples
I added to the "Cookbook" section a couple of weeks back,

http://www.biopython.org/DIST/docs/tutorial/Tutorial.new.html#chapter:cookbook

Suggestions for any additional biologically motivated simple plots
would be nice - especially for different plot types.  A scatter plot
could be added, are there any suggestions for this other than melting
temperature versus length or GC%?  See also this thread on the
dev-mailing list:
http://www.biopython.org/pipermail/biopython-dev/2008-September/004277.html

Note that the file at this URL is only temporary, and will probably be
removed before the next release.  The current tutorial is at:

http://www.biopython.org/DIST/docs/tutorial/Tutorial.html
http://www.biopython.org/DIST/docs/tutorial/Tutorial.pdf

Peter

From stephan80 at mac.com  Wed Oct  8 13:11:25 2008
From: stephan80 at mac.com (Stephan)
Date: Wed, 08 Oct 2008 19:11:25 +0200
Subject: [BioPython] Entrez.efetch large files
Message-ID: <133483072970409871957631124263040035200-Webmail2@me.com>

Sorry to have an Entrez.efetch-issue again, but somehow there seems to be a problem with very large files.

So when I run the following code using the newest cvs-version of biopython:

------------------------------------CODE-----------------------------------
from Bio import Entrez, SeqIO

id = "57"
print Entrez.read(Entrez.esummary(db="genome", id=id))[0]["Title"]
handle = Entrez.efetch(db="genome", id=id, rettype="genbank")
print "downloading to SeqRecord..."
record = SeqIO.read(handle, "genbank")
print "...done"
------------------------------------END-CODE-----------------------------

it fails with the output:

------------------------------------OUTPUT-----------------------------
Drosophila melanogaster chromosome X, complete sequence
downloading to SeqRecord...
Traceback (most recent call last):
  File "efetch-test.py", line 7, in <module>
    record = SeqIO.read(handle, "genbank")
  File "/NetUsers/stschiff/lib/python/Bio/SeqIO/__init__.py", line 366, in read
    first = iterator.next()
  File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 410, in parse_records
    record = self.parse(handle)
  File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 393, in parse
    if self.feed(handle, consumer) :
  File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 370, in feed
    misc_lines, sequence_string = self.parse_footer()
  File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 723, in parse_footer
    raise ValueError("Premature end of file in sequence data")
ValueError: Premature end of file in sequence data
------------------------------------END-OUTPUT-----------------------------


If I change the id to "56" (chromosome 4, which is shorter) it works. But for all the other chromosomes (ids: 57 - 61) it fails.
If I download the genbank files manually from the ftp-server and then use SeqIO.read() it works, so the download-process corrupts the genbank files if they are very large (about 35 MB) I guess...

Any hints?

Best,
Stephan

From biopython at maubp.freeserve.co.uk  Wed Oct  8 14:57:08 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 8 Oct 2008 19:57:08 +0100
Subject: [BioPython] Entrez.efetch large files
In-Reply-To: <133483072970409871957631124263040035200-Webmail2@me.com>
References: <133483072970409871957631124263040035200-Webmail2@me.com>
Message-ID: <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com>

On Wed, Oct 8, 2008 at 6:11 PM, Stephan <stephan80 at mac.com> wrote:
> Sorry to have an Entrez.efetch-issue again, but somehow there
> seems to be a problem with very large files.
> ...
> If I change the id to "56" (chromosome 4, which is shorter) it works.
> But for all the other chromosomes (ids: 57 - 61) it fails.
> If I download the genbank files manually from the ftp-server and
> then use SeqIO.read() it works, so the download-process corrupts
> the genbank files if they are very large (about 35 MB) I guess...
>
> Any hints?

Yes - one big hint: DON'T try and parse these large files directly
from the internet.  Use efetch to download the file and save it to
disk.  Then open this local file for parsing.

There are several good reasons for this:

(1) Rerunning the script (e.g. during development) needn't re-download
the file, which wastes time and money (yours and more importantly the
NCBI's).  You may be fine, but the NCBI can and do ban people's IP
addresses if they breach the guidelines.

(2) If the parsing fails, there is something to debug easily (the
local file).  You can open the file in a text editor to check it etc.

That being said, downloading and parsing in one go should work - I
would expect an IO error if the network timed out, rather than what
appears to be the data ending prematurely.  However, I don't expect
this to be easy to resolve - quite possibly this is a network time out
somewhere, maybe at your end, maybe on one of the ISP connections in
between.

On the bright side, at least the parser isn't silently ignoring the
end of the file, which would leave you with a truncated sequence
without any warnings :)

Do you think the Biopython tutorial should be more explicit about this
topic?  e.g. In chapter 4 (on Bio.SeqIO) I wrote:

>> Note that just because you can download sequence data and
>> parse it into a SeqRecord object in one go doesn't mean this
>> is always a good idea. In general, you should probably download
>> sequences once and save them to a file for reuse.

Maybe I should have said "... doesn't mean this is a good idea..." instead?

Peter

From biopython at maubp.freeserve.co.uk  Wed Oct  8 15:32:59 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 8 Oct 2008 20:32:59 +0100
Subject: [BioPython] Entrez.efetch large files
In-Reply-To: <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com>
References: <133483072970409871957631124263040035200-Webmail2@me.com>
	<320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com>
Message-ID: <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com>

> Yes - one big hint: DON'T try and parse these large files directly
> from the internet.  Use efetch to download the file and save it to
> disk.  Then open this local file for parsing.
> ...
> Do you think the Biopython tutorial should be more explicit about this
> topic?

I've changed the tutorial (the SeqIO and Entrez chapters) in CVS to
make this advice more explicit, and included an example of doing this
too.

import os
from Bio import SeqIO
from Bio import Entrez
Entrez.email = "A.N.Other at example.com"     # Always tell NCBI who you are
filename = "gi_186972394.gbk"
if not os.path.isfile(filename) :
    print "Downloading..."
    net_handle = Entrez.efetch(db="nucleotide",id="186972394",rettype="genbank")
    out_handle = open(filename, "w")
    out_handle.write(net_handle.read())
    out_handle.close()
    net_handle.close()
    print "Saved"

print "Parsing..."
record = SeqIO.read(open(filename), "genbank")
print record


Peter

From biopython at maubp.freeserve.co.uk  Wed Oct  8 16:57:03 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 8 Oct 2008 21:57:03 +0100
Subject: [BioPython] Entrez.efetch large files
In-Reply-To: <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com>
References: <133483072970409871957631124263040035200-Webmail2@me.com>
	<320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com>
	<320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com>
	<2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com>
Message-ID: <320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com>

On Wed, Oct 8, 2008 at 9:37 PM, Stephan Schiffels <stephan80 at mac.com> wrote:
>
>  Hi Peter,
>
> OK, first of all... you were right of course, with
> out_handle.write(net_handle.read()) the download works properly and reading
> the file from disk also works.The tutorial is very clear on that point, I
> agree.

OK - hopefully I've just made it clearer still ;)

> To illustrate why I made the mistake even though I read the tutorial:
> I made some code like:
>
> try:
>        unpickling a file as SeqRecord...
> except IOError:
>        download file into SeqRecord AND pickle afterwards to disk
>
> So, as you can see, I already tried to make the download only once!

I see - interesting.

> The disk-saving step, I realized, was smarter to do via cPickle since then
> reading from it also goes faster than parsing the genbank file each time. So
> my goal was to either load a pickled SeqRecord, or download into SeqRecord
> and then pickle to disk. I hope you agree that concerning resources from
> NCBI this way is (at least in principle) already quite optimal.

You approach is clever, and I agree, it shouldn't make any difference
to the number of downloads from the NCBI (once you have the script
debugged and working).

I'm curious - do you have any numbers for the relative times to load a
SeqRecord from a pickle, or re-parse it from the GenBank file?  I'm
aware of some "hot spots" in the GenBank parser which take more time
than they really need to (feature location parsing in particular).

However, even if using pickles is much faster, I would personally
still rather use this approach:

if file not present:
   download from NCBI and save it
parse file

I think it is safer to keep the original data in the NCBI provided
format, rather than as a python pickle.  Some of my reasons include:

* you might want to parse the files with a different tool one day
(e.g. grep, or maybe BioPerl, or EMBOSS)
* different versions of Biopython will parse the file slightly
differently (e.g. once Bugs 2225 and 2578 are fixed the SeqRecord
should include slightly more information from a GenBank file) while
your pickle will be static
* if the SeqRecord or Seq objects themselves change slightly between
versions of Biopython, the pickle may not work
* more generally, is it safe to transfer the pickly files between
different computers (e.g. different versions of python or Biopython,
different OS, different line endings)?

These issues may not be a problem in your setting.

More generally, you could consider using BioSQL, but this may be
overkill for your needs.

> However, as you pointed out, parsing from the internet makes problems.

If you do work out exactly what is going wrong, I would be interested
to hear about it.

> I think the advantages of not having to download each time were clear to me
> from the tutorial. Just that downloading AND parsing at the same time makes
> problems didnt appear to me. The addings to the tutorial seem to give some
> idea.

Your approach all makes sense. Thanks for explaining your thoughts.  I
don't think I'd ever tried efetch on such a large GenBank file in the
first place - for genomes I have usually used FTP instead.

Peter

From chapmanb at 50mail.com  Wed Oct  8 17:11:25 2008
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 8 Oct 2008 17:11:25 -0400
Subject: [BioPython] Entrez.efetch
In-Reply-To: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
	<20081008123533.GE57379@sobchak.mgh.harvard.edu>
	<320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com>
Message-ID: <20081008211125.GB17555@sobchak.mgh.harvard.edu>

Peter and Stephan;
My fault -- sorry about the red herring on this one. I shouldn't
have tried to answer this e-mail in 5 minutes before work this
morning. Sounds like y'all have it resolved with the missing close
so I will keep my mouth shut.

Peter, I don't remember my exact problem as it was in some
throw-away script and the fix seemed non-problematic. I was thrown
off by the "line 3" information Stephan mentioned because my issue
was with the first couple of lines missing when iterating with an
UndoHandle. No matter.

Thanks for coming up with the right fix!
Brad

> Stephan wrote:
> >> When I download this chromosome manually from the NCBI-website,
> >> I indeed find a difference in one line, namely in line 3 of the
> >> genbank file. In the manually downloaded file line 3 reads:
> >> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced
> >> from my code I have only: "ACCESSION NC_004353". So without that
> >> region-information, the biopython parser of course runs to a premature
> >> end.
> 
> Stephan - when you say manually, do you mean via a web browser?  If so
> it is likely to be using a subtly different URL, which might explain
> the NCBI generating slightly different data on the fly.  Either way,
> this ACCESSION line difference shouldn't trigger the "Premature end of
> file in sequence data" error in the GenBank parser.
> 
> On Wed, Oct 8, 2008 at 1:35 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> > This is a tricky problem that I ran into as well and is fixed in the
> > latest CVS version. The issue is that the Biopython reader is using an
> > UndoHandle instead of a standard python handle. By default some of these
> > operations appear to be assuming an iterator, but UndoHandle did not
> > provide this.
> 
> Brad, I'm pretty sure the GenBank parser is NOT using the UndoHandle.
> Just adding the close made Stephan's example work for me.  What
> exactly was the problem you ran into (one of the other parsers
> perhaps?).
> 
> > As a result, you can lose the first couple of lines which are
> > previously examined to determine the filetype. The fix is to make
> > this a proper iterator. You can either check out current CVS, or
> > make the addition manually to Bio/File.py in your current version:
> >
> > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython
> 
> Adding this to the UndoHandle seems a sensible improvement - but I
> don't see how it can affect Stephan's script.
> 
> Peter

From stephan80 at mac.com  Wed Oct  8 16:37:17 2008
From: stephan80 at mac.com (Stephan Schiffels)
Date: Wed, 08 Oct 2008 22:37:17 +0200
Subject: [BioPython] Entrez.efetch large files
In-Reply-To: <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com>
References: <133483072970409871957631124263040035200-Webmail2@me.com>
	<320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com>
	<320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com>
Message-ID: <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com>

  Hi Peter,

OK, first of all... you were right of course, with out_handle.write 
(net_handle.read()) the download works properly and reading the file  
from disk also works.The tutorial is very clear on that point, I agree.

To illustrate why I made the mistake even though I read the tutorial:
I made some code like:

try:
	unpickling a file as SeqRecord...
except IOError:
	download file into SeqRecord AND pickle afterwards to disk

So, as you can see, I already tried to make the download only once!  
The disk-saving step, I realized, was smarter to do via cPickle since  
then reading from it also goes faster than parsing the genbank file  
each time. So my goal was to either load a pickled SeqRecord, or  
download into SeqRecord and then pickle to disk. I hope you agree  
that concerning resources from NCBI this way is (at least in  
principle) already quite optimal. However, as you pointed out,  
parsing from the internet makes problems.

I think the advantages of not having to download each time were clear  
to me from the tutorial. Just that downloading AND parsing at the  
same time makes problems didnt appear to me. The addings to the  
tutorial seem to give some idea.

Thanks and Regards,
Stephan

Am 08.10.2008 um 21:32 schrieb Peter:

>> Yes - one big hint: DON'T try and parse these large files directly
>> from the internet.  Use efetch to download the file and save it to
>> disk.  Then open this local file for parsing.
>> ...
>> Do you think the Biopython tutorial should be more explicit about  
>> this
>> topic?
>
> I've changed the tutorial (the SeqIO and Entrez chapters) in CVS to
> make this advice more explicit, and included an example of doing this
> too.
>
> import os
> from Bio import SeqIO
> from Bio import Entrez
> Entrez.email = "A.N.Other at example.com"     # Always tell NCBI who  
> you are
> filename = "gi_186972394.gbk"
> if not os.path.isfile(filename) :
>     print "Downloading..."
>     net_handle = Entrez.efetch 
> (db="nucleotide",id="186972394",rettype="genbank")
>     out_handle = open(filename, "w")
>     out_handle.write(net_handle.read())
>     out_handle.close()
>     net_handle.close()
>     print "Saved"
>
> print "Parsing..."
> record = SeqIO.read(open(filename), "genbank")
> print record
>
>
> Peter


From biopythonlist at gmail.com  Thu Oct  9 04:52:42 2008
From: biopythonlist at gmail.com (dr goettel)
Date: Thu, 9 Oct 2008 10:52:42 +0200
Subject: [BioPython] taxonomic tree
In-Reply-To: <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com>
References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com>
	<320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com>
Message-ID: <9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com>

On Wed, Oct 8, 2008 at 6:38 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Wed, Oct 8, 2008 at 5:23 PM, dr goettel <biopythonlist at gmail.com>
> wrote:
> > Hello, I'm new in this list and in BioPython.
>
> Hello :)
>
> > I would like to create a NCBI-like taxonomic tree and then fill it with
> the
> > organisms that I have in a file. Is there an easy way to do this? I
> started
> > using biopython's function at 7.11.4 (finding the lineage of an organism)
> in
> > the tutorial, ...
>
> For anyone reading this later on, note that the tutorial section
> numbers tend to change with each release of Biopython.  This section
> just uses Bio.Entrez to fetch taxonomy information for a particular
> NCBI taxon id.
>
> > but I need to do this tens of thousands times so it spends too
> > much time querying NCBI database.
>
> Also calling Bio.Entrez 10000 times might annoy the NCBI ;)
>
> > Therefore I built a taxonomic database
> > locally and implemented something similar to 7.11.4 tutorial's function
> so I
> > get, for every sequence, the lineage in the same way:
> >
> > 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta;
> Streptophytina;
> >  Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta;
> >  Liliopsida; Asparagales; Orchidaceae'
>
> I assume you used the NCBI provided taxdump files to populate the
> database?  See ftp://ftp.ncbi.nih.gov/pub/taxonomy/
>

Yes I did.


>
> Personally rather than designing my own database just for this (and
> writing a parser for the taxonomy files), I would have suggested
> installing BioSQL, and using the BioSQL script load_ncbi_taxonomy.pl
> to download and import the data for you.  This is a simple perl script
> - you don't need BioPerl.  See http://www.biopython.org/wiki/BioSQL
> for details.
>

I also used the load_ncbi_taxonomy.pl script. It worked great!


>
> > Now I need to create a tree, or fill an already created one. And then
> search
> > it by some criteria.
>
> What kind of tree do you mean?  Are you talking about creating a
> Newick tree, or an in memory structure?  Perhaps the Bio.Nexus
> module's tree functionality would help.
>

Thankyou very much. I still don't know if I want Newick tree or the other
one. I'll take a look on Bio.Nexus module


>
> If you are interested, the BioSQL tables record the taxonomy tree
> using two methods, each node has a parent node allowing you to walk up
> the lineage.  There are also left/right values allowing selection of
> all child nodes efficiently via an SQL select statement.
>
> Peter
>

This is what I was trying to do, from the name of the organism (the leaf of
the tree) and getting every node using the parent_node field of the taxon
table, until reaching the root node. Once I have all the steps to the root
node then I have to create/filling the tree with my data in order to
examinate the number of organisms integrating certain
class/order/family/genus... etc
Any ideas will be very apreciated.

Thankyou very much for your answer and I'll take a look on Bio.Nexus module.

drG

From biopython at maubp.freeserve.co.uk  Thu Oct  9 05:31:16 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 9 Oct 2008 10:31:16 +0100
Subject: [BioPython] taxonomic tree
In-Reply-To: <9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com>
References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com>
	<320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com>
	<9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com>
Message-ID: <320fb6e00810090231w723e3b29m5e070c55166d3bfc@mail.gmail.com>

>> Personally rather than designing my own database just for this (and
>> writing a parser for the taxonomy files), I would have suggested
>> installing BioSQL, and using the BioSQL script load_ncbi_taxonomy.pl
>> to download and import the data for you.  This is a simple perl script
>> - you don't need BioPerl.  See http://www.biopython.org/wiki/BioSQL
>> for details.
>
> I also used the load_ncbi_taxonomy.pl script. It worked great!

Good.  I would encourage you to use the version from BioSQL v1.0.1 if
you are not already, as the version with BioSQL v1.0.0 makes an
additional unnecessary assumption about the database keys matching the
NCBI taxon ID.

>> If you are interested, the BioSQL tables record the taxonomy tree
>> using two methods, each node has a parent node allowing you to walk up
>> the lineage.  There are also left/right values allowing selection of
>> all child nodes efficiently via an SQL select statement.
>
> This is what I was trying to do, from the name of the organism (the leaf of
> the tree) and getting every node using the parent_node field of the taxon
> table, until reaching the root node. Once I have all the steps to the root
> node then I have to create/filling the tree with my data in order to
> examinate the number of organisms integrating certain
> class/order/family/genus... etc
> Any ideas will be very apreciated.

To do this in Biopython you'll have to write some SQL commands - but
first you need to understand how the left/right values work if you
want to take advantage of them.  I refer you to this thread on the
BioSQL mailing list earlier in the year:
http://lists.open-bio.org/pipermail/biosql-l/2008-April/001234.html

In particular, Hilmar referred to Joe Celko's SQL for Smarties books,
and the introduction to this nested-set representation given here:
http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html

Alternatively, if you wanted to avoid the left/right values, you could
use recursion or loops on the parent ID links to build up the tree.
For a single lineage this is fine - but for a full try I would expect
the left/right values to be faster.

Note that Biopython (in CVS now) ignores the left/right values.  This
is for two reasons - for pulling out a single lineage, Eric found this
was faster.  Also, when adding new entries to the database
re-calculating the left/right values is too slow, so we leave them as
NULL (and let the user (re)run load_ncbi_taxonomy.pl later if they
care).  This means we don't want to depend on the left/right values
being present.

Peter

From stephan.schiffels at uni-koeln.de  Thu Oct  9 09:01:11 2008
From: stephan.schiffels at uni-koeln.de (Stephan Schiffels)
Date: Thu, 9 Oct 2008 15:01:11 +0200
Subject: [BioPython] Entrez.efetch large files
In-Reply-To: <320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com>
References: <133483072970409871957631124263040035200-Webmail2@me.com>
	<320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com>
	<320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com>
	<2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com>
	<320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com>
Message-ID: <171A75DA-EE34-44AB-8E16-DEC626F7164C@uni-koeln.de>

Hi Peter,

Am 08.10.2008 um 22:57 schrieb Peter:

> I'm curious - do you have any numbers for the relative times to load a
> SeqRecord from a pickle, or re-parse it from the GenBank file?  I'm
> aware of some "hot spots" in the GenBank parser which take more time
> than they really need to (feature location parsing in particular).

So, here is a little profiling of reading a large chromosome both as  
genbank and from a pickled SeqRecord (both from disk of course):
 >>> t = Timer("a = cPickle.load(open('DroMel_chr2L.pickle'))",  
"import cPickle")
 >>> t.timeit(number=1)
5.2086620330810547
 >>> t = Timer("a = SeqIO.read(open('DroMel_chr2L.gbk'), 'genbank')",  
"from Bio import SeqIO")
 >>> t.timeit(number=1)
53.902437925338745
 >>>

As you see there is an amazing 10fold speed-gain using cPickle in  
comparison to SeqIO.read() ... not bad! The pickled file is a bit  
larger than the genbank file, but not much.

> However, even if using pickles is much faster, I would personally
> still rather use this approach:
>
> if file not present:
>    download from NCBI and save it
> parse file
>
Thats precisely how I do it now. Works cool!


> I think it is safer to keep the original data in the NCBI provided
> format, rather than as a python pickle.  Some of my reasons include:
>
> * you might want to parse the files with a different tool one day
> (e.g. grep, or maybe BioPerl, or EMBOSS)
> * different versions of Biopython will parse the file slightly
> differently (e.g. once Bugs 2225 and 2578 are fixed the SeqRecord
> should include slightly more information from a GenBank file) while
> your pickle will be static
> * if the SeqRecord or Seq objects themselves change slightly between
> versions of Biopython, the pickle may not work
> * more generally, is it safe to transfer the pickly files between
> different computers (e.g. different versions of python or Biopython,
> different OS, different line endings)?
>
> These issues may not be a problem in your setting.

You are right and in fact I now safe both the genbank file and the  
pickled file to disk, so I have all the backup.

>
> More generally, you could consider using BioSQL, but this may be
> overkill for your needs.
>
BioSQL is something that I like a lot. I have not yet digged my way  
through it but hopefully there will be options for me from that side  
as well.

>> However, as you pointed out, parsing from the internet makes  
>> problems.
>
> If you do work out exactly what is going wrong, I would be interested
> to hear about it.
>
Hmm, probably I wont find it out. Parsing from the internet works for  
small files, it must be some network-issue, dont know. Since I am in  
the university-web I doubt that the error starts at my side, maybe  
NCBI clears the connection if the other side is too slow, which is  
the case for the parsing process... But I understand too little about  
networking.

>> I think the advantages of not having to download each time were  
>> clear to me
>> from the tutorial. Just that downloading AND parsing at the same  
>> time makes
>> problems didnt appear to me. The addings to the tutorial seem to  
>> give some
>> idea.
>
> Your approach all makes sense. Thanks for explaining your thoughts.  I
> don't think I'd ever tried efetch on such a large GenBank file in the
> first place - for genomes I have usually used FTP instead.
>
> Peter

Regards,
Stephan


From biopython at maubp.freeserve.co.uk  Thu Oct  9 10:18:52 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 9 Oct 2008 15:18:52 +0100
Subject: [BioPython] Entrez.efetch large files
In-Reply-To: <171A75DA-EE34-44AB-8E16-DEC626F7164C@uni-koeln.de>
References: <133483072970409871957631124263040035200-Webmail2@me.com>
	<320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com>
	<320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com>
	<2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com>
	<320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com>
	<171A75DA-EE34-44AB-8E16-DEC626F7164C@uni-koeln.de>
Message-ID: <320fb6e00810090718g3420729fh50520a4760c5d27@mail.gmail.com>

Peter wrote:
>> I'm curious - do you have any numbers for the relative times to load a
>> SeqRecord from a pickle, or re-parse it from the GenBank file?  I'm
>> aware of some "hot spots" in the GenBank parser which take more time
>> than they really need to (feature location parsing in particular).

Stephan wrote:
> So, here is a little profiling of reading a large chromosome both as genbank
> and from a pickled SeqRecord (both from disk of course):
>>>> t = Timer("a = cPickle.load(open('DroMel_chr2L.pickle'))", "import
>>>> cPickle")
>>>> t.timeit(number=1)
> 5.2086620330810547
>>>> t = Timer("a = SeqIO.read(open('DroMel_chr2L.gbk'), 'genbank')", "from
>>>> Bio import SeqIO")
>>>> t.timeit(number=1)
> 53.902437925338745
>>>>
>
> As you see there is an amazing 10fold speed-gain using cPickle in comparison
> to SeqIO.read() ... not bad! The pickled file is a bit larger than the
> genbank file, but not much.

I'm seeing more like a three fold speed-gain (using cPickle protocol
0, with Python 2.5.2 on a Mac), which is less impressive.  For a 10
fold speed up I can see why the complexity overhead of using pickle
could be worthwhile.

cPickle.load() took 8.5s
cPickle.load() took 10.0s
cPickle.load() took 9.9s
SeqIO.read() took 29.9s
SeqIO.read() took 29.8s
SeqIO.read() took 29.8s

(Script below)

I'm not very impressed with the 30 seconds needed to parse a 30MB
file.  There is certainly scope for speeding up the GenBank parsing
here.

Peter

---------------

My timing script:

import os
import cPickle
import time
from Bio import Entrez, SeqIO
#Entrez.email = "..."

id="57"
genbank_filename = "NC_004354.gbk"
pickle_filename = "NC_004354.pickle"

if not os.path.isfile(genbank_filename) :
    print "Downloading..."
    net_handle = Entrez.efetch(db="genome", id=id, rettype="genbank")
    out_handle = open(genbank_filename, "w")
    out_handle.write(net_handle.read())
    out_handle.close()
    print "Saved"

if not os.path.isfile(pickle_filename) :
    print "Parsing..."
    record = SeqIO.read(open(genbank_filename), 'genbank')
    print "Pickling..."
    out_handle = open(pickle_filename ,"w")
    cPickle.dump(record, out_handle)
    out_handle.close()
    print "Saved"

print "Profiling..."
for i in range(3) :
    start = time.time()
    record = cPickle.load(open(pickle_filename))
    print "cPickle.load() took %0.1fs" % (time.time() - start)
for i in range(3) :
    start = time.time()
    record = SeqIO.read(open(genbank_filename), 'genbank')
    print "SeqIO.read() took %0.1fs" % (time.time() - start)
print "Done"

From biopython at maubp.freeserve.co.uk  Thu Oct  9 11:48:26 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 9 Oct 2008 16:48:26 +0100
Subject: [BioPython] Deprecating Bio.PubMed and some of Bio.GenBank
Message-ID: <320fb6e00810090848g2a516877i5950f515e748b9d0@mail.gmail.com>

Dear Biopythoneers,

Those of you who looked at the release notes for Biopython 1.48 might
have read this bit:

>> Bio.PubMed and the online code in Bio.GenBank are now considered
>> obsolete, and we intend to deprecate them after the next release.
>> For accessing PubMed and GenBank, please use Bio.Entrez instead.

These bits of code are effectively simple wrappers for Bio.Entrez.
While they may be simple to use, they cannot take advantage of the
NCBI's Entrez utils history functionality.  This means they discourage
users from following the NCBI's preferred usage patterns.

We're already trying to encouraging the use of Bio.Entrez by
documenting it prominently in the tutorial (which seems to be working
given the recent questions on the mailing list), but for Biopython
1.49 I'm suggesting we go further and deprecate Bio.PubMed and the
online code in Bio.GenBank.  This would mean a warning message would
appear when this code is used, and (barring feedback) after a couple
of releases this code would be removed completely.

Any comments or objections?  In particular, is anyone using this
"obsolete" functionality now?

Peter

From biopythonlist at gmail.com  Thu Oct  9 12:32:11 2008
From: biopythonlist at gmail.com (dr goettel)
Date: Thu, 9 Oct 2008 18:32:11 +0200
Subject: [BioPython] taxonomic tree
In-Reply-To: <320fb6e00810090231w723e3b29m5e070c55166d3bfc@mail.gmail.com>
References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com>
	<320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com>
	<9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com>
	<320fb6e00810090231w723e3b29m5e070c55166d3bfc@mail.gmail.com>
Message-ID: <9b15d9f30810090932qb22ca8boc6edc871bf285154@mail.gmail.com>

> To do this in Biopython you'll have to write some SQL commands - but
> first you need to understand how the left/right values work if you
> want to take advantage of them.  I refer you to this thread on the
> BioSQL mailing list earlier in the year:
> http://lists.open-bio.org/pipermail/biosql-l/2008-April/001234.html
>
> In particular, Hilmar referred to Joe Celko's SQL for Smarties books,
> and the introduction to this nested-set representation given here:
> http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html
>

That's great!!
Taking advantage of the left/right values will help me!! They 're  great.
I started writing a lot of code to do something that in fact can be done
with some sql statements.
In fact the sql statements are quite difficult for me so I have to deep
inside "inner joins".

Thankyou very much


drG

From biopython at maubp.freeserve.co.uk  Mon Oct 13 08:38:56 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Oct 2008 13:38:56 +0100
Subject: [BioPython] Translation method for Seq object
Message-ID: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>

Dear Biopythoneers,

This is a request for feedback about proposed additions to the Seq
object for the next release of Biopython.  I'd like people to pick (a)
to (e) in the list below (with additional comments or counter
suggestions welcome).

Enhancement bug 2381 is about adding transcription and translation
methods to the Seq object, allowing an object orientated style of
programming.

e.g. Current functional programming style:

>>> from Bio.Seq import Seq, transcribe
>>> from Bio.Alphabet import generic_dna
>>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna)
>>> my_seq
Seq('CAGTGACGTTAGTCCG', DNAAlphabet())
>>> transcribe(my_seq)
Seq('CAGUGACGUUAGUCCG', RNAAlphabet())

With the latest Biopython in CVS, you can now invoke a Seq object
method instead for transcription (or back transcription):

>>> my_seq.transcribe()
Seq('CAGUGACGUUAGUCCG', RNAAlphabet())

For a comparison, compare the shift from python string functions to
string methods.  This also makes the functionality more discoverable
via dir(my_seq).

Adding Seq object methods "transcribe" and "back_transcribe" doesn't
cause any confusion with the python string methods.  However, for
translation, the python string has an existing "translate" method:

> S.translate(table [,deletechars]) -> string
>
> Return a copy of the string S, where all characters occurring
> in the optional argument deletechars are removed, and the
> remaining characters have been mapped through the given
> translation table, which must be a string of length 256.

I don't think this functionality is really of direct use for sequences, and
having a Seq object "translate" method do a biological translation into
a protein sequence is much more intuitive. However, this could cause
confusion if the Seq object is passed to non-Biopython code which
expects a string like translate method.

To avoid this naming clash, a different method name would needed.

This is where some user feedback would be very welcome - I think
the following cover all the alternatives of what to call a biological
translation function (nucleotide to protein):

(a) Just use translate (ignore the existing string method)
(b) Use translate_ (trailing underscore, see PEP8)
(c) Use translation (a noun rather than verb; different style).
(d) Use something else (e.g. bio_translate or ...)
(e) Don't add a biological translation method at all because ...

Thanks,

Peter

See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381

From ericgibert at yahoo.fr  Mon Oct 13 10:38:02 2008
From: ericgibert at yahoo.fr (Eric Gibert)
Date: Mon, 13 Oct 2008 22:38:02 +0800
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
Message-ID: <BFAD764F5E77416099835E06AA4016EC@Gecko>


(a) Seq is an object, string is another object... each of them have various
methods and coincidently two of them have the same name...

Eric


-----Original Message-----
From: biopython-bounces at lists.open-bio.org
[mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter
Sent: Monday, October 13, 2008 8:39 PM
To: BioPython Mailing List
Subject: [BioPython] Translation method for Seq object

Dear Biopythoneers,

This is a request for feedback about proposed additions to the Seq
object for the next release of Biopython.  I'd like people to pick (a)
to (e) in the list below (with additional comments or counter
suggestions welcome).

Enhancement bug 2381 is about adding transcription and translation
methods to the Seq object, allowing an object orientated style of
programming.

e.g. Current functional programming style:

>>> from Bio.Seq import Seq, transcribe
>>> from Bio.Alphabet import generic_dna
>>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna)
>>> my_seq
Seq('CAGTGACGTTAGTCCG', DNAAlphabet())
>>> transcribe(my_seq)
Seq('CAGUGACGUUAGUCCG', RNAAlphabet())

With the latest Biopython in CVS, you can now invoke a Seq object
method instead for transcription (or back transcription):

>>> my_seq.transcribe()
Seq('CAGUGACGUUAGUCCG', RNAAlphabet())

For a comparison, compare the shift from python string functions to
string methods.  This also makes the functionality more discoverable
via dir(my_seq).

Adding Seq object methods "transcribe" and "back_transcribe" doesn't
cause any confusion with the python string methods.  However, for
translation, the python string has an existing "translate" method:

> S.translate(table [,deletechars]) -> string
>
> Return a copy of the string S, where all characters occurring
> in the optional argument deletechars are removed, and the
> remaining characters have been mapped through the given
> translation table, which must be a string of length 256.

I don't think this functionality is really of direct use for sequences, and
having a Seq object "translate" method do a biological translation into
a protein sequence is much more intuitive. However, this could cause
confusion if the Seq object is passed to non-Biopython code which
expects a string like translate method.

To avoid this naming clash, a different method name would needed.

This is where some user feedback would be very welcome - I think
the following cover all the alternatives of what to call a biological
translation function (nucleotide to protein):

(a) Just use translate (ignore the existing string method)
(b) Use translate_ (trailing underscore, see PEP8)
(c) Use translation (a noun rather than verb; different style).
(d) Use something else (e.g. bio_translate or ...)
(e) Don't add a biological translation method at all because ...

Thanks,

Peter

See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381
_______________________________________________
BioPython mailing list  -  BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


From bsouthey at gmail.com  Mon Oct 13 10:58:07 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Mon, 13 Oct 2008 09:58:07 -0500
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
Message-ID: <48F361FF.103@gmail.com>

Peter wrote:
> Dear Biopythoneers,
>
> This is a request for feedback about proposed additions to the Seq
> object for the next release of Biopython.  I'd like people to pick (a)
> to (e) in the list below (with additional comments or counter
> suggestions welcome).
>
> Enhancement bug 2381 is about adding transcription and translation
> methods to the Seq object, allowing an object orientated style of
> programming.
>
> e.g. Current functional programming style:
>
>   
>>>> from Bio.Seq import Seq, transcribe
>>>> from Bio.Alphabet import generic_dna
>>>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna)
>>>> my_seq
>>>>         
> Seq('CAGTGACGTTAGTCCG', DNAAlphabet())
>   
>>>> transcribe(my_seq)
>>>>         
> Seq('CAGUGACGUUAGUCCG', RNAAlphabet())
>
> With the latest Biopython in CVS, you can now invoke a Seq object
> method instead for transcription (or back transcription):
>
>   
>>>> my_seq.transcribe()
>>>>         
> Seq('CAGUGACGUUAGUCCG', RNAAlphabet())
>
> For a comparison, compare the shift from python string functions to
> string methods.  This also makes the functionality more discoverable
> via dir(my_seq).
>
> Adding Seq object methods "transcribe" and "back_transcribe" doesn't
> cause any confusion with the python string methods.  However, for
> translation, the python string has an existing "translate" method:
>
>   
>> S.translate(table [,deletechars]) -> string
>>
>> Return a copy of the string S, where all characters occurring
>> in the optional argument deletechars are removed, and the
>> remaining characters have been mapped through the given
>> translation table, which must be a string of length 256.
>>     
>
> I don't think this functionality is really of direct use for sequences, and
> having a Seq object "translate" method do a biological translation into
> a protein sequence is much more intuitive. However, this could cause
> confusion if the Seq object is passed to non-Biopython code which
> expects a string like translate method.
>
> To avoid this naming clash, a different method name would needed.
>
> This is where some user feedback would be very welcome - I think
> the following cover all the alternatives of what to call a biological
> translation function (nucleotide to protein):
>
> (a) Just use translate (ignore the existing string method)
> (b) Use translate_ (trailing underscore, see PEP8)
> (c) Use translation (a noun rather than verb; different style).
> (d) Use something else (e.g. bio_translate or ...)
> (e) Don't add a biological translation method at all because ...
>
> Thanks,
>
> Peter
>
> See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>   
Hi,
My thoughts on this is that it is generally best to avoid any confusion 
when possible. But 'translate' is not a reserved word and the Python 
documentation notes that the unicode version lacks the optional 
deletechars argument (so there is precedent for using the same word). 
Also it involves the methods versus functions argument but many of the 
string functions have been depreciated and will get removed in Python 
3.0 (so in Python 3.0 I think it will be hard to get a name clash 
without some strange inheritance going on).

Therefore, provided 'translate' is a method of Seq then I do not see any 
strong reason to avoid it except that it is long (but shorter than 
translation) :-)

Would be too cryptic to have dna(), rna() and protein() methods that 
provide the appropriate conversion based on the Seq type?
Obviously reverse translation of a protein sequence to a DNA sequence is 
complex if there are many solutions.

Regards
Bruce


From mjldehoon at yahoo.com  Mon Oct 13 10:57:28 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Mon, 13 Oct 2008 07:57:28 -0700 (PDT)
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
Message-ID: <421846.1946.qm@web62403.mail.re1.yahoo.com>

(f) Use .translate both for the Python .translate and for the Biopython .translate.

S.translate() ===> Biopython .translate

S.translate(table [,deletechars]) ===> Python .translate

We can tell from the presence or absence of arguments whether the user intends Python's translate or Biopython's translate.

--Michiel.


--- On Mon, 10/13/08, Peter <biopython at maubp.freeserve.co.uk> wrote:

> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: [BioPython] Translation method for Seq object
> To: "BioPython Mailing List" <biopython at lists.open-bio.org>
> Date: Monday, October 13, 2008, 8:38 AM
> Dear Biopythoneers,
> 
> This is a request for feedback about proposed additions to
> the Seq
> object for the next release of Biopython.  I'd like
> people to pick (a)
> to (e) in the list below (with additional comments or
> counter
> suggestions welcome).
> 
> Enhancement bug 2381 is about adding transcription and
> translation
> methods to the Seq object, allowing an object orientated
> style of
> programming.
> 
> e.g. Current functional programming style:
> 
> >>> from Bio.Seq import Seq, transcribe
> >>> from Bio.Alphabet import generic_dna
> >>> my_seq = Seq("CAGTGACGTTAGTCCG",
> generic_dna)
> >>> my_seq
> Seq('CAGTGACGTTAGTCCG', DNAAlphabet())
> >>> transcribe(my_seq)
> Seq('CAGUGACGUUAGUCCG', RNAAlphabet())
> 
> With the latest Biopython in CVS, you can now invoke a Seq
> object
> method instead for transcription (or back transcription):
> 
> >>> my_seq.transcribe()
> Seq('CAGUGACGUUAGUCCG', RNAAlphabet())
> 
> For a comparison, compare the shift from python string
> functions to
> string methods.  This also makes the functionality more
> discoverable
> via dir(my_seq).
> 
> Adding Seq object methods "transcribe" and
> "back_transcribe" doesn't
> cause any confusion with the python string methods. 
> However, for
> translation, the python string has an existing
> "translate" method:
> 
> > S.translate(table [,deletechars]) -> string
> >
> > Return a copy of the string S, where all characters
> occurring
> > in the optional argument deletechars are removed, and
> the
> > remaining characters have been mapped through the
> given
> > translation table, which must be a string of length
> 256.
> 
> I don't think this functionality is really of direct
> use for sequences, and
> having a Seq object "translate" method do a
> biological translation into
> a protein sequence is much more intuitive. However, this
> could cause
> confusion if the Seq object is passed to non-Biopython code
> which
> expects a string like translate method.
> 
> To avoid this naming clash, a different method name would
> needed.
> 
> This is where some user feedback would be very welcome - I
> think
> the following cover all the alternatives of what to call a
> biological
> translation function (nucleotide to protein):
> 
> (a) Just use translate (ignore the existing string method)
> (b) Use translate_ (trailing underscore, see PEP8)
> (c) Use translation (a noun rather than verb; different
> style).
> (d) Use something else (e.g. bio_translate or ...)
> (e) Don't add a biological translation method at all
> because ...
> 
> Thanks,
> 
> Peter
> 
> See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From biopython at maubp.freeserve.co.uk  Mon Oct 13 11:27:37 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Oct 2008 16:27:37 +0100
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <421846.1946.qm@web62403.mail.re1.yahoo.com>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
	<421846.1946.qm@web62403.mail.re1.yahoo.com>
Message-ID: <320fb6e00810130827j3ec07434s2f58e370743f9537@mail.gmail.com>

So I did manage to leave off at least one other option from my short list :)

Michiel de Hoon wrote:
>
> (f) Use .translate both for the Python .translate and for the Biopython .translate.
>
> S.translate() ===> Biopython .translate
>
> S.translate(table [,deletechars]) ===> Python .translate
>
> We can tell from the presence or absence of arguments whether the user intends Python's translate or Biopython's translate.

Sadly its not quite that simple.

For a biological translation we'd probably want to offer optional
arguments for at least the codon table and stop symbol (like the
current Bio.Seq.translate() function), with other further arguments
possible (e.g. to treat the sequence as a complete CDS where the start
codon should be validated and taken as M).

It would still be possible to automatically detect which translation
was required, but it wouldn't be very nice.  So overall I'm not keen
on this approach.

Peter

From biopython at maubp.freeserve.co.uk  Mon Oct 13 11:54:32 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Oct 2008 16:54:32 +0100
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <48F361FF.103@gmail.com>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
	<48F361FF.103@gmail.com>
Message-ID: <320fb6e00810130854m38f37075gf85b798cb4a98e21@mail.gmail.com>

Bruce wrote:
> ...
> Therefore, provided 'translate' is a method of Seq then I do not see any
> strong reason to avoid it except that it is long (but shorter than
> translation) :-)

Good - that sounds like another vote for option (a) in my original list.

> Would be too cryptic to have dna(), rna() and protein() methods that provide
> the appropriate conversion based on the Seq type?

Or in a similar vein, to_dna, to_rna, and to_protein? Or toDNA, toRNA,
toProtein?  I'd have to go and consult the current python style guide
for what is the current best practice. Something like that does sounds
reasonable (and they are short), but historically all related
Biopython functions have used the terms (back) transcription and
(back) translation so I would prefer to stick with those.

> Obviously reverse translation of a protein sequence to a DNA sequence is
> complex if there are many solutions.

Yes, back-translation is tricky because there is generally more than
one codon for any amino acid.  Ambiguous nucleotides can be used to
describe several possible codons giving that amino acid, but in
general it is not possible to do this and describe all the possible
codons which could have been used.  This topic is worth of an entire
thread... for the record, I would envisage a back_translate method for
the Seq object (assuming we settle on translate as the name for the
forward translation from nucleotide to protein).

Peter

From mjldehoon at yahoo.com  Mon Oct 13 20:50:14 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Mon, 13 Oct 2008 17:50:14 -0700 (PDT)
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
Message-ID: <900752.12970.qm@web62408.mail.re1.yahoo.com>

> (a) Just use translate (ignore the existing string method)
> (b) Use translate_ (trailing underscore, see PEP8)
> (c) Use translation (a noun rather than verb; different
> style).
> (d) Use something else (e.g. bio_translate or ...)
> (e) Don't add a biological translation method at all
> because ...


(a).
Note also that once Seq objects inherit from string, the Python .translate method is still accessible as str.translate(seq).

--Michiel.


From biopython at maubp.freeserve.co.uk  Tue Oct 14 06:18:13 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 14 Oct 2008 11:18:13 +0100
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <900752.12970.qm@web62408.mail.re1.yahoo.com>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
	<900752.12970.qm@web62408.mail.re1.yahoo.com>
Message-ID: <320fb6e00810140318i14c6362eq8a51030b1da660ae@mail.gmail.com>

OK, we seem to have a consensus :)

In Biopython's CVS, the Seq object now has a translate method which
does a biological translation.  If anyone comes up with a better
proposal before the next release, we can still rename this.  Otherwise
I will update the Tutorial in CVS shortly...

Note that for now, I have followed the existing Bio.Seq.translate(...)
function and the new Seq object translate(...) method takes only two
optional parameters - the codon table and the stop symbol.  I have
noted some suggestions for possible additional arguments on Bug 2381.

The adventurous among you may want to use CVS to update your Biopython
installations to try this out.  Please note that you will now need
numpy instead of Numeric (there is nothing to stop you having both
numpy and Numeric installed at the same time).  If you do try out the
CVS code, please run the unit tests and report any issues.

Thanks,

Peter

From biopython at maubp.freeserve.co.uk  Tue Oct 14 07:11:20 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 14 Oct 2008 12:11:20 +0100
Subject: [BioPython] Deprecating Bio.PubMed and some of Bio.GenBank
In-Reply-To: <320fb6e00810090848g2a516877i5950f515e748b9d0@mail.gmail.com>
References: <320fb6e00810090848g2a516877i5950f515e748b9d0@mail.gmail.com>
Message-ID: <320fb6e00810140411o341df854x49ef3e61421193b8@mail.gmail.com>

On Thu, Oct 9, 2008 at 4:48 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Dear Biopythoneers,
>
> Those of you who looked at the release notes for Biopython 1.48 might
> have read this bit:
>
>>> Bio.PubMed and the online code in Bio.GenBank are now considered
>>> obsolete, and we intend to deprecate them after the next release.
>>> For accessing PubMed and GenBank, please use Bio.Entrez instead.
>
> These bits of code are effectively simple wrappers for Bio.Entrez.
> While they may be simple to use, they cannot take advantage of the
> NCBI's Entrez utils history functionality.  This means they discourage
> users from following the NCBI's preferred usage patterns.
>
> We're already trying to encouraging the use of Bio.Entrez by
> documenting it prominently in the tutorial (which seems to be working
> given the recent questions on the mailing list), but for Biopython
> 1.49 I'm suggesting we go further and deprecate Bio.PubMed and the
> online code in Bio.GenBank.  This would mean a warning message would
> appear when this code is used, and (barring feedback) after a couple
> of releases this code would be removed completely.
>
> Any comments or objections?  In particular, is anyone using this
> "obsolete" functionality now?

I've just deprecated Bio.PubMed in CVS - meaning for the next release
of Biopython you'll see a warning message when you import the PubMed
module.  If you are using this module please say something sooner rather
than later.  This can still be undone.

Thanks,

Peter

From dalloliogm at gmail.com  Thu Oct 16 06:02:46 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 16 Oct 2008 12:02:46 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
Message-ID: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>

Hi,
I was going to write a python program to calculate Fst statistics from a
sample of SNP data.
Is there any module already available to do that in biopython, that I am
missing?
I saw there is a 'PopGen' module, but the Cookbook says it doesn't support
sequence data.
Is someone actually writing any module in python to calculate such
statistics?

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it

From biopython at maubp.freeserve.co.uk  Thu Oct 16 06:23:12 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 16 Oct 2008 11:23:12 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
Message-ID: <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>

On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Hi,
> I was going to write a python program to calculate Fst statistics from a
> sample of SNP data.  Is there any module already available to do that
> in biopython, that I am missing?  I saw there is a 'PopGen' module, but
> the Cookbook says it doesn't support sequence data.
> Is someone actually writing any module in python to calculate such
> statistics?

I think this will be a question for Tiago (the Bio.PopGen author),
although others on the list may have also tackled similar questions.

In terms of reading in the SNP data, what file format will you be
loading?  Does Bio.SeqIO currently suffice?

Have you looked into what (if any) additional python libraries you
would need?  For any Biopython addition, a dependency on just numpy
that would be preferable, but Tiago has previously suggested an
optional dependency on scipy for additional statistics needed in
population genetics.

Peter

From tiagoantao at gmail.com  Thu Oct 16 10:10:47 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 16 Oct 2008 15:10:47 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
Message-ID: <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com>

Hi,

On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Hi,
> I was going to write a python program to calculate Fst statistics from a
> sample of SNP data.
> Is there any module already available to do that in biopython, that I am
> missing?
> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support
> sequence data.
> Is someone actually writing any module in python to calculate such
> statistics?

The answer to this has to be done in parts, because it is actually a
bunch of related (but different) issues


On the data
1. Sequence support. Bio.PopGen doesn't support statistics for
sequences (like Tajima D and the like), BUT that is not relevant if
you want to do frequency based statistics (like good old Fst), you
just have to count frequencies and put into a "frequency format"
2. SNPs is actually not a sequence, but a single element, so it
becomes easier. What you need at the end is something like this:
For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0
For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0
For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10
And so on... You have to end up with frequency counts per population
So, as long as you convert data (sequence, SNP, microsatellite) to
frequency counts per population, there are no issues with the type of
data.

On calculating the statistics (Fst)
1. I am fully aware that core statistics like Fst (I work with Fst a
lot myself) are fundamental in a population genetics module, but I
sincerely don't know how to proceed because a long term solution
requires generic statistical support (e.g., chi-square tests
Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy
(and I will not maintain generic statistics code myself). I know that
Bio.PopGen is of little use without support for standard statistics.
2. A workaround (for which I have code written - but not commited to
the repository - I can give it to you) is to invoke GenePop and get
the Fst estimation. This requires the data to be in GenePop format
(again you can convert SNPs and even sequences to frequency based
format)
3. That being said, I have code to estimate Fst (Cockerham and Wier
theta and a variation from Mark Beaumont) in Python. I can give it to
you (but is not much tested).


On sequence data formats:
1. Note that sequence data files (that I know off) have no provision
for population structure (you cannot say, in a standard way, sequence
X belongs to population Y). You have to do it in adhoc way. That means
you have to invent your own convention for your private use.
2. Anyway, in your case I suppose you still have to extract the SNPs
from the sequence.
3. If you want do frequency based analysis on your SNPs, I suggest you
do a conversion to GenePop anyway (therefore you can import your data
in most population structure software as GenePop format is the defacto
standard)...
4. Because of the above there is actually no good solution for
automated conversion from sequence information to frequency based one
(in biopython or in any platform whatsoever)
I can give more suggestions if you give more details or have more
specific questions.

From tiagoantao at gmail.com  Thu Oct 16 10:14:28 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 16 Oct 2008 15:14:28 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com>
Message-ID: <6d941f120810160714s61eb6d1cx87d1943c4068d491@mail.gmail.com>

Just a minor point: I am so used to work in Fst that I mentally
converted your "F-statistics" to Fst. Most of my mail still stands.
The only point that changes a bit is that I only have code for Fst, so
I cannot help you with any other.

On Thu, Oct 16, 2008 at 3:10 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
> Hi,
>
> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
>> Hi,
>> I was going to write a python program to calculate Fst statistics from a
>> sample of SNP data.
>> Is there any module already available to do that in biopython, that I am
>> missing?
>> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support
>> sequence data.
>> Is someone actually writing any module in python to calculate such
>> statistics?
>
> The answer to this has to be done in parts, because it is actually a
> bunch of related (but different) issues
>
>
> On the data
> 1. Sequence support. Bio.PopGen doesn't support statistics for
> sequences (like Tajima D and the like), BUT that is not relevant if
> you want to do frequency based statistics (like good old Fst), you
> just have to count frequencies and put into a "frequency format"
> 2. SNPs is actually not a sequence, but a single element, so it
> becomes easier. What you need at the end is something like this:
> For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0
> For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0
> For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10
> And so on... You have to end up with frequency counts per population
> So, as long as you convert data (sequence, SNP, microsatellite) to
> frequency counts per population, there are no issues with the type of
> data.
>
> On calculating the statistics (Fst)
> 1. I am fully aware that core statistics like Fst (I work with Fst a
> lot myself) are fundamental in a population genetics module, but I
> sincerely don't know how to proceed because a long term solution
> requires generic statistical support (e.g., chi-square tests
> Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy
> (and I will not maintain generic statistics code myself). I know that
> Bio.PopGen is of little use without support for standard statistics.
> 2. A workaround (for which I have code written - but not commited to
> the repository - I can give it to you) is to invoke GenePop and get
> the Fst estimation. This requires the data to be in GenePop format
> (again you can convert SNPs and even sequences to frequency based
> format)
> 3. That being said, I have code to estimate Fst (Cockerham and Wier
> theta and a variation from Mark Beaumont) in Python. I can give it to
> you (but is not much tested).
>
>
> On sequence data formats:
> 1. Note that sequence data files (that I know off) have no provision
> for population structure (you cannot say, in a standard way, sequence
> X belongs to population Y). You have to do it in adhoc way. That means
> you have to invent your own convention for your private use.
> 2. Anyway, in your case I suppose you still have to extract the SNPs
> from the sequence.
> 3. If you want do frequency based analysis on your SNPs, I suggest you
> do a conversion to GenePop anyway (therefore you can import your data
> in most population structure software as GenePop format is the defacto
> standard)...
> 4. Because of the above there is actually no good solution for
> automated conversion from sequence information to frequency based one
> (in biopython or in any platform whatsoever)
> I can give more suggestions if you give more details or have more
> specific questions.
>


-- 
"Data always beats theories. 'Look at data three times and then come
to a conclusion,' versus 'coming to a conclusion and searching for
some data.' The former will win every time."
?Matthew Simmons,
http://www.tiago.org


From biopython at maubp.freeserve.co.uk  Thu Oct 16 11:11:27 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 16 Oct 2008 16:11:27 +0100
Subject: [BioPython]  back-translation method for Seq object?
Message-ID: <320fb6e00810160811s19e580b2ped86c43b32c401bb@mail.gmail.com>

Quoting from the recent thread about adding a translation method to
the Seq object, Bruce brought up back-translation:

Peter wrote:
> Bruce wrote:
>> Obviously reverse translation of a protein sequence to a DNA sequence is
>> complex if there are many solutions.
>
> Yes, back-translation is tricky because there is generally more than
> one codon for any amino acid.  Ambiguous nucleotides can be used to
> describe several possible codons giving that amino acid, but in
> general it is not possible to do this and describe all the possible
> codons which could have been used.  This topic is worth of an entire
> thread... for the record, I would envisage a back_translate method for
> the Seq object (assuming we settle on translate as the name for the
> forward translation from nucleotide to protein).

Do we actually need a back_translate method?  Can anyone suggest an
actual use-case for this?  It seems difficult to imagine that any
simple version would please everyone.

Bio.Translate (a semi-obsolete module whose deprecation has been
suggested) provides a back_translate method which picks an essentially
arbitrary but unambiguous codon for each amino acid.  Crude but
simple.  A more meaningful choice would require suppling codon
frequencies for the organism under consideration.

Other possibilities include using ambiguous nucleotides to try and
cover all the possibilities (e.g. "L" -> "CTN"), but even here in some
cases this is arbritary.  e.g. The standard three stop codons ['TAA',
'TAG', 'TGA'] could be represented as ['TAR', 'TGA'] or ['TRA', 'TAG']
but not by a single ambiguous codon ('TRR' also covers 'TGG' which
codes for 'W').

Potentially of use would be a generator function which returned all
possible back translations - but this would be complex and typically
overkill.

As a final point, a Seq object back-translation method could give RNA
or DNA.  From a biological point of view giving DNA by default would
make sense.  This choice is handled in Bio.Translate when creating the
translator object (part of what makes Bio.Translate relatively complex
to use).

Peter

From sdavis2 at mail.nih.gov  Thu Oct 16 11:16:51 2008
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 16 Oct 2008 11:16:51 -0400
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com>
Message-ID: <264855a00810160816g6ea765dcl65fd7e5aa38ef20c@mail.gmail.com>

On Thu, Oct 16, 2008 at 10:10 AM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
> Hi,
>
> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
>> Hi,
>> I was going to write a python program to calculate Fst statistics from a
>> sample of SNP data.
>> Is there any module already available to do that in biopython, that I am
>> missing?
>> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support
>> sequence data.
>> Is someone actually writing any module in python to calculate such
>> statistics?
>
> The answer to this has to be done in parts, because it is actually a
> bunch of related (but different) issues
>
>
> On the data
> 1. Sequence support. Bio.PopGen doesn't support statistics for
> sequences (like Tajima D and the like), BUT that is not relevant if
> you want to do frequency based statistics (like good old Fst), you
> just have to count frequencies and put into a "frequency format"
> 2. SNPs is actually not a sequence, but a single element, so it
> becomes easier. What you need at the end is something like this:
> For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0
> For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0
> For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10
> And so on... You have to end up with frequency counts per population
> So, as long as you convert data (sequence, SNP, microsatellite) to
> frequency counts per population, there are no issues with the type of
> data.
>
> On calculating the statistics (Fst)
> 1. I am fully aware that core statistics like Fst (I work with Fst a
> lot myself) are fundamental in a population genetics module, but I
> sincerely don't know how to proceed because a long term solution
> requires generic statistical support (e.g., chi-square tests
> Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy
> (and I will not maintain generic statistics code myself). I know that
> Bio.PopGen is of little use without support for standard statistics.
> 2. A workaround (for which I have code written - but not commited to
> the repository - I can give it to you) is to invoke GenePop and get
> the Fst estimation. This requires the data to be in GenePop format
> (again you can convert SNPs and even sequences to frequency based
> format)
> 3. That being said, I have code to estimate Fst (Cockerham and Wier
> theta and a variation from Mark Beaumont) in Python. I can give it to
> you (but is not much tested).
>
>
> On sequence data formats:
> 1. Note that sequence data files (that I know off) have no provision
> for population structure (you cannot say, in a standard way, sequence
> X belongs to population Y). You have to do it in adhoc way. That means
> you have to invent your own convention for your private use.
> 2. Anyway, in your case I suppose you still have to extract the SNPs
> from the sequence.
> 3. If you want do frequency based analysis on your SNPs, I suggest you
> do a conversion to GenePop anyway (therefore you can import your data
> in most population structure software as GenePop format is the defacto
> standard)...
> 4. Because of the above there is actually no good solution for
> automated conversion from sequence information to frequency based one
> (in biopython or in any platform whatsoever)
> I can give more suggestions if you give more details or have more
> specific questions.

Just a little note that the R programming language has some packages
for population genetics and, of course, has excellent statistical
tools.  One can interface with it via rpy.  I'm not advocating going
this route, but just wanted to let people know about another option.

Sean


From tiagoantao at gmail.com  Thu Oct 16 11:26:52 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 16 Oct 2008 16:26:52 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <264855a00810160816g6ea765dcl65fd7e5aa38ef20c@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com>
	<264855a00810160816g6ea765dcl65fd7e5aa38ef20c@mail.gmail.com>
Message-ID: <6d941f120810160826q2bf25382m41890fb39a4226a0@mail.gmail.com>

The task view on Genetics for R provides a good starting point to find
R packages related to the field:

http://www.freestatistics.org/cran/web/views/Genetics.html


On Thu, Oct 16, 2008 at 4:16 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
> On Thu, Oct 16, 2008 at 10:10 AM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
>> Hi,
>>
>> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
>> <dalloliogm at gmail.com> wrote:
>>> Hi,
>>> I was going to write a python program to calculate Fst statistics from a
>>> sample of SNP data.
>>> Is there any module already available to do that in biopython, that I am
>>> missing?
>>> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support
>>> sequence data.
>>> Is someone actually writing any module in python to calculate such
>>> statistics?
>>
>> The answer to this has to be done in parts, because it is actually a
>> bunch of related (but different) issues
>>
>>
>> On the data
>> 1. Sequence support. Bio.PopGen doesn't support statistics for
>> sequences (like Tajima D and the like), BUT that is not relevant if
>> you want to do frequency based statistics (like good old Fst), you
>> just have to count frequencies and put into a "frequency format"
>> 2. SNPs is actually not a sequence, but a single element, so it
>> becomes easier. What you need at the end is something like this:
>> For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0
>> For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0
>> For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10
>> And so on... You have to end up with frequency counts per population
>> So, as long as you convert data (sequence, SNP, microsatellite) to
>> frequency counts per population, there are no issues with the type of
>> data.
>>
>> On calculating the statistics (Fst)
>> 1. I am fully aware that core statistics like Fst (I work with Fst a
>> lot myself) are fundamental in a population genetics module, but I
>> sincerely don't know how to proceed because a long term solution
>> requires generic statistical support (e.g., chi-square tests
>> Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy
>> (and I will not maintain generic statistics code myself). I know that
>> Bio.PopGen is of little use without support for standard statistics.
>> 2. A workaround (for which I have code written - but not commited to
>> the repository - I can give it to you) is to invoke GenePop and get
>> the Fst estimation. This requires the data to be in GenePop format
>> (again you can convert SNPs and even sequences to frequency based
>> format)
>> 3. That being said, I have code to estimate Fst (Cockerham and Wier
>> theta and a variation from Mark Beaumont) in Python. I can give it to
>> you (but is not much tested).
>>
>>
>> On sequence data formats:
>> 1. Note that sequence data files (that I know off) have no provision
>> for population structure (you cannot say, in a standard way, sequence
>> X belongs to population Y). You have to do it in adhoc way. That means
>> you have to invent your own convention for your private use.
>> 2. Anyway, in your case I suppose you still have to extract the SNPs
>> from the sequence.
>> 3. If you want do frequency based analysis on your SNPs, I suggest you
>> do a conversion to GenePop anyway (therefore you can import your data
>> in most population structure software as GenePop format is the defacto
>> standard)...
>> 4. Because of the above there is actually no good solution for
>> automated conversion from sequence information to frequency based one
>> (in biopython or in any platform whatsoever)
>> I can give more suggestions if you give more details or have more
>> specific questions.
>
> Just a little note that the R programming language has some packages
> for population genetics and, of course, has excellent statistical
> tools.  One can interface with it via rpy.  I'm not advocating going
> this route, but just wanted to let people know about another option.
>
> Sean
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
"Data always beats theories. 'Look at data three times and then come
to a conclusion,' versus 'coming to a conclusion and searching for
some data.' The former will win every time."
?Matthew Simmons,
http://www.tiago.org


From lpritc at scri.ac.uk  Fri Oct 17 04:24:43 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Fri, 17 Oct 2008 09:24:43 +0100
Subject: [BioPython] back-translation method for Seq object?
In-Reply-To: <320fb6e00810160811s19e580b2ped86c43b32c401bb@mail.gmail.com>
Message-ID: <C51E0A5B.17DCE%lpritc@scri.ac.uk>


On 16/10/2008 16:11, "Peter" <biopython at maubp.freeserve.co.uk> wrote:

> Quoting from the recent thread about adding a translation method to
> the Seq object, Bruce brought up back-translation:
> 
> Peter wrote:
>> Bruce wrote:
>>> Obviously reverse translation of a protein sequence to a DNA sequence is
>>> complex if there are many solutions.

This is the key problem.  Forward translation is - for a given codon table -
a one-one mapping.  Reverse translation is (for many amino acids) one-many.
If the goal is to produce the coding sequence that actually encoded a
particular protein sequence, the problem is combinatorial and rapidly
becomes messy with increasing sequence length.  And that's not considering
the problem of splice variants/intron-exon boundaries if attempting to
relate the sequence back to some genome or genome fragment - more a problem
in eukaryotes.

>> Yes, back-translation is tricky because there is generally more than
>> one codon for any amino acid.  Ambiguous nucleotides can be used to
>> describe several possible codons giving that amino acid, but in
>> general it is not possible to do this and describe all the possible
>> codons which could have been used.  This topic is worth of an entire
>> thread... for the record, I would envisage a back_translate method for
>> the Seq object (assuming we settle on translate as the name for the
>> forward translation from nucleotide to protein).
> 
> Do we actually need a back_translate method?  Can anyone suggest an
> actual use-case for this?  It seems difficult to imagine that any
> simple version would please everyone.

I agree - I can't think of an occasion where I might want to back-translate
a protein in this way that wouldn't better be handled by other means.  Not
that I'm the fount of all use-cases but, given the number of ways in which
one *could* back-translate, perhaps it would be better not to pick/guess at
any single one.

Some choices to be made in deciding how to back-translate are (and I'm sure
you've already thought of them, but they're worth writing down):

I) Protein to unambiguous RNA:
  a) Codon table: arbitrary; organism-specific; user-defined?
  b) Codon choice: arbitrary and random; arbitrary and consistent; complete
set of possibilities; most common codon (if information available); other
favoured codon (if specified)?
II) Protein to ambiguous RNA:
  a) Return a Seq, string or some other representation of ambiguity?
  b) IUPAC ambiguity symbols; choice of codons; alternative representation
of ambiguity?

The most common back-translation I do is taking aligned protein sequences
back to their known coding sequences, and this is really a case of mapping
known codons onto predefined positions, rather than the interpolation of
unknown codons that is required for back-translation as implied above.
T-coffee handles this pretty well, IIRC.

To find coding sequences for a particular protein in the originating
sequence (if known), I use BLAST.  I guess there might be value in having
the ability to identify regions of the coding sequence that are least likely
to be variable (by generating them combinatorially) so that probes might be
designed if the coding sequence is not known.  But that doesn't appear to be
the way that most sequences are obtained these days: much cheaper to bung
RNA through 454 or Solexa and work through the output than to put someone on
the task of making an array of probes to find a sequence that may or may not
encode your sequenced protein...

> Bio.Translate (a semi-obsolete module whose deprecation has been
> suggested) provides a back_translate method which picks an essentially
> arbitrary but unambiguous codon for each amino acid.  Crude but
> simple.  A more meaningful choice would require suppling codon
> frequencies for the organism under consideration.

These can be found - for many organisms - in Emboss codon usage table (.cut)
files, if you have Emboss locally.  However, is requiring Emboss as a
dependency the cleanest or wisest solution for Biopython?  This approach
solves only one problem:  given a particular codon usage table, what is the
most likely sequence that would have produced this protein.  That's not a
problem I've ever come across in anger, but given a table of 'most efficient
codons' for some biological expression system, I can see this potentially
having some use.  However, given that many microbiologists can already tell
you the preferred codons for K12 without pausing for breath, I'm not sure
there's a problem looking for this solution.
 
> Other possibilities include using ambiguous nucleotides to try and
> cover all the possibilities (e.g. "L" -> "CTN"), but even here in some
> cases this is arbritary.  e.g. The standard three stop codons ['TAA',
> 'TAG', 'TGA'] could be represented as ['TAR', 'TGA'] or ['TRA', 'TAG']
> but not by a single ambiguous codon ('TRR' also covers 'TGG' which
> codes for 'W').

If Seq had an ambiguity-aware sequence representation, this could be
handled.  For example, a regular expression-based sequence representation
(which could lie alongside Seq.data, perhaps as Seq.regex) could represent
these variants as (TAA|TAG|TGA), and alternatively the usual ambiguity codes
could also be handled in a similar way (e.g. R as [AG]).  This would be of
some limited use, but would permit sequence searching within Biopython, at
least.

> Potentially of use would be a generator function which returned all
> possible back translations - but this would be complex and typically
> overkill.

I think that, for large sequences, this could quickly swamp the user.  What
do you see as the use of this output?

> As a final point, a Seq object back-translation method could give RNA
> or DNA.  From a biological point of view giving DNA by default would
> make sense.  This choice is handled in Bio.Translate when creating the
> translator object (part of what makes Bio.Translate relatively complex
> to use).

Since there is a one-one map of RNA to DNA, I'm easy about either choice on
a computational level.  Biologically-speaking, DNA -> RNA is transcription,
and RNA -> protein is translation, so I'd expect back-translation to convert
protein -> RNA, and back-transcription to convert RNA -> DNA.

Cheers,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________

From dalloliogm at gmail.com  Fri Oct 17 05:39:41 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Fri, 17 Oct 2008 11:39:41 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
Message-ID: <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>

On Thu, Oct 16, 2008 at 12:23 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
> > Hi,
> > I was going to write a python program to calculate Fst statistics from a
> > sample of SNP data.  Is there any module already available to do that
> > in biopython, that I am missing?  I saw there is a 'PopGen' module, but
> > the Cookbook says it doesn't support sequence data.
> > Is someone actually writing any module in python to calculate such
> > statistics?
>
> I think this will be a question for Tiago (the Bio.PopGen author),
> although others on the list may have also tackled similar questions.
>
> In terms of reading in the SNP data, what file format will you be
> loading?  Does Bio.SeqIO currently suffice?
>

Hi,
thank you very much all of you for the replies.
Actually I am going to use tped[1] and tfam[1] files as input, formatted
with the plink program[2].
Bio.SeqIO doesn't support these format, but this is right because they don't
cointain only sequences but rather elements like Tiago was saying.

Let's say I try to write a parser for these two file formats. In which
biopython object should I save them? Is there any kind of 'Individual' or
'Population' object in biopython?
I see from the cookbook that Bio.GenPop.Record is representanting
populations and individual as list[3], and that there is not a 'Population'
or 'Individual' object.
I think that it is a good approach, because these kind of files tend to be
very big and instantiating an Individual object instead of a tuple for every
line of the file would be take much memory.
But are you going to implement some kind of 'Individual' or 'Population'
object?
Moreover, python 2.6 will implement a new kind of data object, called 'named
tuple' [4], to implement these kind of records. It could be a good
compromise (maybe I'll better start a new thread about this and explain
better).


[1] tped, tfam: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#tr
[2] plink: http://pngu.mgh.harvard.edu/~purcell/plink/index.shtml
[3] biopython cookbook, popgen:
http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc112
[4] named tuples in python 2.6: http://code.activestate.com/recipes/500261/


>
> Have you looked into what (if any) additional python libraries you
> would need?  For any Biopython addition, a dependency on just numpy
> that would be preferable, but Tiago has previously suggested an
> optional dependency on scipy for additional statistics needed in
> population genetics.
>
> Peter
>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it

From dalloliogm at gmail.com  Fri Oct 17 06:03:32 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Fri, 17 Oct 2008 12:03:32 +0200
Subject: [BioPython] named tuples for biopython?
Message-ID: <5aa3b3570810170303y5660ab63n592dc99a76011ad2@mail.gmail.com>

Hi,
python 2.6 is going to implement a new kind of data (like lists, strings,
etc..) called 'named_tuple'.
It is intended to be a better data format to be used when parsing record
files and databases.

You can download the recipe from here (it should be included experimentally
in python 2.6):
- http://code.activestate.com/recipes/500261/

Basically, you instantiate a named_tuple object with this syntax:

>> Person = NamedTuple("Person name surname")

"Person" is a label for the named_tuple; the following fields, 'name' and
'surname'
Then you will have named_tuple object which is basically a mix between a
dictionary, a custom class and a tuple:

>> Person = NamedTuple("Person name surname")
>> Einstein = Person('Albert', 'Einstein')
>> Einstein.name
'Albert'

>> Einstein.surname
'Einstein'

>> people = []
>> for line in f.readlines():
>>     people.append(Person(line.split())
>>
>> for person in people:
>>     print person.name, person.surname

named_tuples are also read-only object, so they should be less
memory-expensive
It is like tuples against lists, but more customizable.

I am really not good ad explaining, and I can't find a good tutorial that
illustrate this.
I read a good article about named_tuples, but it is in italian language (
http://stacktrace.it/2008/05/gestione-dei-record-python-1/). Maybe you can
understand the code examples.

Has any of you heard about this new data type? Do you think it could be
useful for biopython? There is a lot of file parsing / database interfacing
in bioinformatics :)


p.s. since I didn't trust HTML-based mails to keep code formatting, I also
posted this same message on nodalpoint:
http://www.nodalpoint.org/2008/10/17/python_2_6_will_implement_a_new_data_format_named_tuple_can_it_be_of_use_for_biopython


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it

From dalloliogm at gmail.com  Fri Oct 17 06:11:23 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Fri, 17 Oct 2008 12:11:23 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810160714s61eb6d1cx87d1943c4068d491@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com>
	<6d941f120810160714s61eb6d1cx87d1943c4068d491@mail.gmail.com>
Message-ID: <5aa3b3570810170311g1d92dc52q41616cd6dc58fb03@mail.gmail.com>

On Thu, Oct 16, 2008 at 4:14 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:

> Just a minor point: I am so used to work in Fst that I mentally
> converted your "F-statistics" to Fst. Most of my mail still stands.
> The only point that changes a bit is that I only have code for Fst, so
> I cannot help you with any other.
>
> On Thu, Oct 16, 2008 at 3:10 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
>
> > 3. That being said, I have code to estimate Fst (Cockerham and Wier
> > theta and a variation from Mark Beaumont) in Python. I can give it to
> > you (but is not much tested).
> >
>


Thank you.. Can you please send me this code that you are using to calculate
Fst statistics with python?
I can't guarantee I will use it (most of the people here use perl and
bioperl, but I would prefer python), but maybe I can help you testing it.


>
>
>
>
>
> --
> "Data always beats theories. 'Look at data three times and then come
> to a conclusion,' versus 'coming to a conclusion and searching for
> some data.' The former will win every time."
> ?Matthew Simmons,
> http://www.tiago.org
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Fri Oct 17 06:17:51 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 17 Oct 2008 11:17:51 +0100
Subject: [BioPython] named tuples for biopython?
In-Reply-To: <5aa3b3570810170303y5660ab63n592dc99a76011ad2@mail.gmail.com>
References: <5aa3b3570810170303y5660ab63n592dc99a76011ad2@mail.gmail.com>
Message-ID: <320fb6e00810170317w24fe34a4p1884c4264f3e7363@mail.gmail.com>

On Fri, Oct 17, 2008 at 11:03 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Hi,
> python 2.6 is going to implement a new kind of data (like lists, strings,
> etc..) called 'named_tuple'.  It is intended to be a better data format to
> be used when parsing record files and databases.

I'd just seen this today actually via another mailing list.  Here is a
short example which actually works on python 2.6 (the details have
changed slightly from your quote),

>>> from collections import namedtuple
>>> Person = namedtuple("Person", "name surname")
>>> x = Person("Albert", "Einstein")
>>> x
Person(name='Albert', surname='Einstein')
>>> x.name
'Albert'
>>> x.surname
'Einstein'
>>> x.keys()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Person' object has no attribute 'keys'
>>> x["name"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: tuple indices must be integers, not str
>>> x[0]
'Albert'
>>> x[1]
'Einstein'

So this doesn't act much like a dictionary (in terms of the x[...]
usage), so we can't use it as a drop in enhancement for existing
dictionaries in Biopython.  I expect there are some places where a
namedtuple would make sense (although using it might break backwards
compatibility).

Also, if we did want to use NamedTuple in Biopython we'd have to
include a copy for use on older versions of python.  This is probably
possible under the python license... but would require an
implementation that still worked on pre 2.6.

Peter

From lpritc at scri.ac.uk  Fri Oct 17 06:52:33 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Fri, 17 Oct 2008 11:52:33 +0100
Subject: [BioPython] named tuples for biopython?
In-Reply-To: <5aa3b3570810170303y5660ab63n592dc99a76011ad2@mail.gmail.com>
Message-ID: <C51E2D01.17DF2%lpritc@scri.ac.uk>


On 17/10/2008 11:03, "Giovanni Marco Dall'Olio" <dalloliogm at gmail.com>
wrote:

> Hi,
> python 2.6 is going to implement a new kind of data (like lists, strings,
> etc..) called 'named_tuple'.
> It is intended to be a better data format to be used when parsing record
> files and databases.
> 
> You can download the recipe from here (it should be included experimentally
> in python 2.6):
> - http://code.activestate.com/recipes/500261/

The explanation here was pretty clear, to me:

http://docs.python.org/dev/library/collections.html#collections.namedtuple

> Has any of you heard about this new data type?

Not until you mentioned it - thanks for the heads-up.

> Do you think it could be
> useful for biopython? There is a lot of file parsing / database interfacing
> in bioinformatics :)

I can see it being a useful collection type.  It reminds me of C structs,
and looks like a near-perfect fit to many db table entries, and to
csv/ATF-format files for which the column headers can be used to define
attributes.  

I guess that one disadvantage of namedtuples, compared to, e.g. a dictionary
in which each value is itself a dictionary of attributes (with attribute
names for keys), is that there's a restricted character/word set available
for attribute names in the namedtuple, but this is not important for
dictionary keys, so some additional tally of header to attribute name may be
necessary.  This has a real use-case in, say, parsing ATF format files...

http://www.moleculardevices.com/pages/software/gn_genepix_file_formats.html

... where on-the-fly creation of attributes with the same name as in the
parsed file or table row may not be possible with a namedtuple.  If you know
of the column/field names in advance though, it shouldn't be an issue.

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________

From tiagoantao at gmail.com  Fri Oct 17 14:07:18 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Fri, 17 Oct 2008 19:07:18 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
Message-ID: <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>

Hi,

On Fri, Oct 17, 2008 at 10:39 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Let's say I try to write a parser for these two file formats. In which
> biopython object should I save them? Is there any kind of 'Individual' or
> 'Population' object in biopython?
> I see from the cookbook that Bio.GenPop.Record is representanting
> populations and individual as list[3], and that there is not a 'Population'
> or 'Individual' object.

No, there are no concepts of individuals or populations for now.
Bio.PopGen.GenePop is just a representation of a GenePop file (which
is a de facto standard in frequency based population genetics).
Currently Bio.PopGen philosophy is more of a wrapper for existing
software (e.g., I don't implement a coalescent simulator, like in
BioPerl, I wrap Simcoal2). The disadvantage is that it is not "Pure
Python" and is dependent on external applications. The advantage is
that, if the external application is good, than good functionality
becomes available inside Biopython. For example, coalescent simulation
in BioPerl is (at least last time I've checked it) orders of magnitude
less flexible than BioPython's (based on SimCoal2).
In this philosophy, I now have a (partial) wrapper for the GenePop
application to calculate statistics (voila, Fst).
That doesn't mean that core statistics functionality should not be
available in Bio.PopGen. I think it should be (that is why I have
quite done work on that - implementing from scratch Fst, allelic
richness, expected heterosigosity, ...). The same goes to the concept
of Population and Individual.
For a number of cumulative reasons, the work on that front is stalled.
But, if there is some interest, I would more than welcome reopening
that front...


> Moreover, python 2.6 will implement a new kind of data object, called 'named
> tuple' [4], to implement these kind of records. It could be a good
> compromise (maybe I'll better start a new thread about this and explain
> better).

I think the ad-hoc policy in Biopython is to support previous versions
of Python, so I don't think it will be easy to do things in a 2.6 only
way (although, for NEW functionality, from my part, I don't see a
problem with it).

Tiago

From bsouthey at gmail.com  Fri Oct 17 14:46:19 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Fri, 17 Oct 2008 13:46:19 -0500
Subject: [BioPython] back-translation method for Seq object?
In-Reply-To: <C51E0A5B.17DCE%lpritc@scri.ac.uk>
References: <C51E0A5B.17DCE%lpritc@scri.ac.uk>
Message-ID: <48F8DD7B.7010909@gmail.com>

Leighton Pritchard wrote:
> On 16/10/2008 16:11, "Peter" <biopython at maubp.freeserve.co.uk> wrote:
>
>   
>> Quoting from the recent thread about adding a translation method to
>> the Seq object, Bruce brought up back-translation:
>>
>> Peter wrote:
>>     
>>> Bruce wrote:
>>>       
>>>> Obviously reverse translation of a protein sequence to a DNA sequence is
>>>> complex if there are many solutions.
>>>>         
>
> This is the key problem.  Forward translation is - for a given codon table -
> a one-one mapping.  Reverse translation is (for many amino acids) one-many.
> If the goal is to produce the coding sequence that actually encoded a
> particular protein sequence, the problem is combinatorial and rapidly
> becomes messy with increasing sequence length.  And that's not considering
> the problem of splice variants/intron-exon boundaries if attempting to
> relate the sequence back to some genome or genome fragment - more a problem
> in eukaryotes.
>   
If you use a regular expression or a tree structure then there is a 
one-one mapping but then that would probably best as a subclass of Seq. 
Note you still would need a method to transverse it if you wanted to get 
a sequence from it as well as an reverse complement. It is fairly 
trivial to get a regular expression for it for the standard genetic code 
but I did not get my reverse complement to work satisfactory nor did I 
try to get DNA sequence from the regular expression.

I would suggest tools like Wise2 and exonerate 
(http://www.ebi.ac.uk/~guy/exonerate/) are the solution to solving gene 
structure problems than using a Seq object.

Obviously if you start with a DNA sequence, then you could create object 
that has a DNA/RNA Seq object and a protein Seq object(s) that contain 
the translation(s) like in Genbank DNA records that contain the 
translation. But that really avoids the issue here.

>>> Yes, back-translation is tricky because there is generally more than
>>> one codon for any amino acid.  Ambiguous nucleotides can be used to
>>> describe several possible codons giving that amino acid, but in
>>> general it is not possible to do this and describe all the possible
>>> codons which could have been used.  This topic is worth of an entire
>>> thread... for the record, I would envisage a back_translate method for
>>> the Seq object (assuming we settle on translate as the name for the
>>> forward translation from nucleotide to protein).
>>>       
>> Do we actually need a back_translate method?  Can anyone suggest an
>> actual use-case for this?  It seems difficult to imagine that any
>> simple version would please everyone.
>>     
>
> I agree - I can't think of an occasion where I might want to back-translate
> a protein in this way that wouldn't better be handled by other means.  Not
> that I'm the fount of all use-cases but, given the number of ways in which
> one *could* back-translate, perhaps it would be better not to pick/guess at
> any single one.
>   
Apart from the academic aspect, my main use is searching for protein 
motifs/domains, enzyme cleavage sites, finding very short combinations 
of amino acids and binding sites (I do not do this but it is the same) 
in DNA sequences especially genomic sequence. These are usually very 
small and, thus, unsuitable for most tools. One of my uses is with 
peptide identification and de novo sequencing using mass spectrometry 
when you don't know the actual protein or gene sequence. It also has the 
problem that certain amino acids have very similar mass so you would 
need to  Regardless of whether you use a regular expression query or not 
you still need a back translation of the protein query and probably the 
reverse complement.

Another case where it would be useful is that tools like TBLASTN gives 
protein alignments so you must open the DNA sequence and find the DNA 
region based on the protein alignment.

Bruce


From dalloliogm at gmail.com  Sun Oct 19 10:50:54 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Sun, 19 Oct 2008 16:50:54 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
Message-ID: <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>

On Sat, Oct 18, 2008 at 6:50 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:

>  > here have used bioperl Bio::PopGen::PopStat, but we saw that using that
> > module as it is now in bioperl is too much computationally-expensive for
> our
> > resources.
> > So, we are going to either refactor the bioperl function, or to write
> custom
> > scripts in python to calculate Fst.
> > I can program perl, but I would prefer to use python in use, since I like
> > object oriented programming.
>
> You can find my (completely unofficial, completely untested) PopGen module
> here:
> http://popgen.eu/PopGen.tar.gz
> You should take a biopython distro and replace the PopGen directory
> with the contents of this one.
>


ok, thank you very much!!
I would like to use git to keep track of the changes I will make to the
code.
What do you think if I'll upload it to http://github.com and then upload it
back on biopython when it is finished?
I am not sure, but I think it would be possible to convert the logs back to
cvs to reintegrate the changes in biopython.


> There are 2 ways to calculate Fst:
> Doing something this:
> from Bio.PopGen.Stats.Structural import Fst
>
> fst = Fst()


> fst.add_pop('Pop 1', [('a', 'a'), ('a', 'c'), ('a','c')])
> fst.add_pop('Pop 2', [('a', 'c'), ('a', 'c'), ('a','c')])
>

One of the problems we are having here, is that it takes too much RAM memory
to store all the information about characters for every population.
I was going to write a Population object, in which I'll store only the total
count of heterozygotes, individuals, and what is needed, instead of the
information about characters (('a', 'a'), ('a', 'c'), ...)

It is something like this:
class Population:
  markers = []

class Marker:
  total_heterozygotes_count = 0
  total_population_count = 0
  total_Purines_count = 0 # this could be renamed, of course
  total_Pyrimidines_count = 0


>
> Or using the new GenePop code (see GenePop/Controller.py), by using
> genepop to calculate Fsts.
>
> A few comments:
> 1. I don't trust my own Fst code (not tested at all, I am actually
> using GenePop as above). You can find it on PopGen.Stats.Structural
> (Fst, and also FstBeaumont). There is code there for Fst, Fis and Fit.
> Also Fk (I trust the Fk code, but its the only one)


I will ask my group leader to help me in writing down some good test data.
I'll let you know when I will speak with him.


>
> 2. If your problem is performance, I think you have to go to a faster
> language. Scripting languages strongly underperfom on the speed issue.
> I find this problem lots of times. C, C++ and Java (yes, java for
> performance) is what I use. Perl, Python and other scripting languages
> are quite bad performance-wise.


I know.. but I think this time, the problem is in memory usage.


> 3. You can find a Fst implementation in C++ on simuPop (see file
> stator.cpp). GenePop code must also have Fst implemented.

4. I have a Fst based application using Biopython PopGen with Fst (but
> for another application) - Fdist, you can find it at:
> http://www.biomedcentral.com/1471-2105/9/323 . Module Bio.PopGen.FDist
> (incidentally, you can also use this to calculate Fst ;) ).
> 5. My code on Bio.PopGen.Stats is surely not in its final form. I have
> a plan to change it massively. If you are interested in participating
> in the discussion, you are welcome.
>
> > This is to say that if you want, we can work on the same code, and
> > contribute it to biopython.
>
> This would be most welcome. I have almost no sense of propriety for
> the code that is on Bio.PopGen. So, if you work on this, go ahead!
>
>
> > I am writing a ped file parser (everybody here is used to this format,
> and I
> > don't know GenePop :( ), and a simple script that calculates Fst with the
> > most basic formula.
> > I am also trying to design some good tests, and I am using subversion as
> as
> > source control system.
> > Maybe I can also send this to you, so you can have a look (but it is
> still
> > very basic, I started yesterday).
>
> Again, any contribution would be most welcome. Regarding parsers I
> would suggest you to have a look at how parsers are done in Biopython.
> I am following the "standard". You can find an example on
> Bio.PopGen.GenePop.__init__.py. From my point of view I have nothing
> against a "non standard" parser as long as it is documented and
> commented.


Thank you very much.. I know more or less how parsers are written in
biopython, but I have never written one myself.


>
>
> Again, feel free to take this discussion to biopython-dev, especially
> if you are willing to contribute.
>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From mmokrejs at ribosome.natur.cuni.cz  Sun Oct 19 11:52:29 2008
From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=)
Date: Sun, 19 Oct 2008 17:52:29 +0200
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
Message-ID: <48FB57BD.7070705@ribosome.natur.cuni.cz>

Hi,
  I have been away for 2 weeks but although late, let
me oppose that string.translate() is of use. Here is my
current code:

    # make sure no unallowed chars are present in the sequence
    if type == "DNA":
        if not _sequence.translate(string.maketrans('', ''),'GgAaTtCc'):
            if not _sequence.translate(string.maketrans('', ''),'GgAaTtCcBbDdSsWw'):
                if not _sequence.translate(string.maketrans('', ''),'GgAaTtCcRrYyWwSsMmKkHhBbVvDdNn'):
                    raise ValueError, "DNA sequence contains unallowed characters: " + str(_sequence.translate(string.maketrans('', ''),'GgAaTtCcRrYyWwSsMmKkHhBbVvDdNn'))
                else:
                    _warning = "DNA sequence contains IUPACAmbiguousDNA characters, which cannot be interpreted uniquely. Please try to find sequence of higher quality."
            else:
                _warning = "DNA sequence contains ExtendedIUPACDNA characters. " + str(_sequence.translate(string.maketrans('', ''),'GATC')) + " Please try to find sequence of higher quality."
    elif type == "RNA":
        if not _sequence.translate(string.maketrans('', ''),'GgAaUuCc'):
            if not _sequence.translate(string.maketrans('', ''),'GgAaUuCcRrYyWwSsMmKkHhBbVvDdNn'):
                raise ValueError, "RNA sequence contains unallowed characters: " + str(_sequence.translate(string.maketrans('', ''),'GgAaTtCcRrYyWwSsMmKkHhBbVvDdNn'))
            else:
                _warning = "RNA sequence contains ExtendedIUPACDNA characters. " + str(_sequence.translate(string.maketrans('', ''),'GgAaUuCc')) + " Please try to find sequence of higher quality."
        _sequence = _sequence.translate(string.maketrans('Uu', 'Tt'))
    return (_warning, _type, _description, _sequence)


I would have voted for b) or c).
Martin


Peter wrote:
> Dear Biopythoneers,
> 
> This is a request for feedback about proposed additions to the Seq
> object for the next release of Biopython.  I'd like people to pick (a)
> to (e) in the list below (with additional comments or counter
> suggestions welcome).
> 
> Enhancement bug 2381 is about adding transcription and translation
> methods to the Seq object, allowing an object orientated style of
> programming.
> 
> e.g. Current functional programming style:
> 
>>>> from Bio.Seq import Seq, transcribe
>>>> from Bio.Alphabet import generic_dna
>>>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna)
>>>> my_seq
> Seq('CAGTGACGTTAGTCCG', DNAAlphabet())
>>>> transcribe(my_seq)
> Seq('CAGUGACGUUAGUCCG', RNAAlphabet())
> 
> With the latest Biopython in CVS, you can now invoke a Seq object
> method instead for transcription (or back transcription):
> 
>>>> my_seq.transcribe()
> Seq('CAGUGACGUUAGUCCG', RNAAlphabet())
> 
> For a comparison, compare the shift from python string functions to
> string methods.  This also makes the functionality more discoverable
> via dir(my_seq).
> 
> Adding Seq object methods "transcribe" and "back_transcribe" doesn't
> cause any confusion with the python string methods.  However, for
> translation, the python string has an existing "translate" method:
> 
>> S.translate(table [,deletechars]) -> string
>>
>> Return a copy of the string S, where all characters occurring
>> in the optional argument deletechars are removed, and the
>> remaining characters have been mapped through the given
>> translation table, which must be a string of length 256.
> 
> I don't think this functionality is really of direct use for sequences, and
> having a Seq object "translate" method do a biological translation into
> a protein sequence is much more intuitive. However, this could cause
> confusion if the Seq object is passed to non-Biopython code which
> expects a string like translate method.
> 
> To avoid this naming clash, a different method name would needed.
> 
> This is where some user feedback would be very welcome - I think
> the following cover all the alternatives of what to call a biological
> translation function (nucleotide to protein):
> 
> (a) Just use translate (ignore the existing string method)
> (b) Use translate_ (trailing underscore, see PEP8)
> (c) Use translation (a noun rather than verb; different style).
> (d) Use something else (e.g. bio_translate or ...)
> (e) Don't add a biological translation method at all because ...
> 
> Thanks,
> 
> Peter
> 
> See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381

From mmokrejs at ribosome.natur.cuni.cz  Sun Oct 19 12:17:50 2008
From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=)
Date: Sun, 19 Oct 2008 18:17:50 +0200
Subject: [BioPython] Current tutorial in CVS
In-Reply-To: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com>
References: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com>
Message-ID: <48FB5DAE.1050600@ribosome.natur.cuni.cz>

Hi Peter,

Peter wrote:
> Michiel wrote:
>> ... The new tutorial is in CVS; I put a copy of the HTML output
>> of the latest version at
>> http://biopython.org/DIST/docs/tutorial/Tutorial.new.html.
> 
> This also gives people a chance to look at the three plotting examples
> I added to the "Cookbook" section a couple of weeks back,
> 
> http://www.biopython.org/DIST/docs/tutorial/Tutorial.new.html#chapter:cookbook

  for those lazy would you please show how to save the generated plots into
e.g. jpg or .svg file?

Thanks, ;-)
Martin

From biopython at maubp.freeserve.co.uk  Sun Oct 19 12:34:46 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 19 Oct 2008 17:34:46 +0100
Subject: [BioPython] Current tutorial in CVS
In-Reply-To: <48FB5DAE.1050600@ribosome.natur.cuni.cz>
References: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com>
	<48FB5DAE.1050600@ribosome.natur.cuni.cz>
Message-ID: <320fb6e00810190934l432dd320ue8ff3a40fa497530@mail.gmail.com>

>
>  for those lazy would you please show how to save the generated plots into
> e.g. jpg or .svg file?

Instead or as well as pylab.show(), use pylab.savefig(...), for example:

pylab.savefig("dot_plot.png", dpi=75)
pylab.savefig("dot_plot.pdf")

On a related note - it looks like the pylab tutorial as moved, I'm
getting a 404 error on http://matplotlib.sourceforge.net/tutorial.html
now :(

It looks like http://matplotlib.sourceforge.net/api/pyplot_api.html is
the replacement.

Peter

From biopython at maubp.freeserve.co.uk  Sun Oct 19 14:15:59 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 19 Oct 2008 19:15:59 +0100
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <48FB57BD.7070705@ribosome.natur.cuni.cz>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
	<48FB57BD.7070705@ribosome.natur.cuni.cz>
Message-ID: <320fb6e00810191115k120b64c3m237d0929d33b13fb@mail.gmail.com>

On Sun, Oct 19, 2008 at 4:52 PM, Martin MOKREJ?
<mmokrejs at ribosome.natur.cuni.cz> wrote:
> Hi,
>  I have been away for 2 weeks but although late,

Untill we release a new biopython, its not too late to change the Seq
object's new methods.

> let me oppose that string.translate() is of use.
> Here is my current code:
> ...

Your code seems to be doing two things with the python string
translate() method:

(1) Using the deletechars argument (with an empty mapping) to look for
unexpected letters.  It took me a while to work out what your code was
doing - personally I would have used a python set for this, rather
than the string translate method.  Note also unicode strings don't
support the deletechars argument, and that python 3.0 removes the
deletechars argument from the string style objects.

(2) Using the translate mapping to switch "U" and "u" into "T" and "t"
to back transcribe RNA into DNA.  For this, Biopython already has a
Bio.Seq.back_transcribe function (which does work on strings), and in
CVS the Seq object gets a back_transcribe method too.  These do both
use the string translate method internally.

Neither of these operations convice me that the Seq object should
support the python string translate method.

Note that if you still need to use the python string translate method,
it is accessable by first turning the Seq object into a string (e.g.
str(my_seq).translate(mapping, delete_chars)), or as Michiel suggested
earlier, you could use the string module translate function on the Seq
object.

Also note that (as in your example using the string translate to do
back transcription) the translate method by its nature makes it
impossible to know if the original Seq object alphabet still applies
to the result.

Peter


From mmokrejs at ribosome.natur.cuni.cz  Sun Oct 19 14:28:38 2008
From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=)
Date: Sun, 19 Oct 2008 20:28:38 +0200
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <320fb6e00810191115k120b64c3m237d0929d33b13fb@mail.gmail.com>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>	
	<48FB57BD.7070705@ribosome.natur.cuni.cz>
	<320fb6e00810191115k120b64c3m237d0929d33b13fb@mail.gmail.com>
Message-ID: <48FB7C56.6010408@ribosome.natur.cuni.cz>

Peter,
  you are right in your points. I think the translate() trick
had some speed advantages over other approaches to zap unwanted
characters - I don't remember but if it is gonna break in future
python releases I will have to rewrite this anyway.
I just wanted to say I really do use the string translate
function and that it has use in bioinformatics as well. ;-)
  Still, I think the name clash is asking for disaster, but
overloading is a feature of python so it might be expected.
Do whatever you want. ;)
Cheers,
M.

Peter wrote:
> On Sun, Oct 19, 2008 at 4:52 PM, Martin MOKREJ?
> <mmokrejs at ribosome.natur.cuni.cz> wrote:
>> Hi,
>>  I have been away for 2 weeks but although late,
> 
> Untill we release a new biopython, its not too late to change the Seq
> object's new methods.
> 
>> let me oppose that string.translate() is of use.
>> Here is my current code:
>> ...
> 
> Your code seems to be doing two things with the python string
> translate() method:
> 
> (1) Using the deletechars argument (with an empty mapping) to look for
> unexpected letters.  It took me a while to work out what your code was
> doing - personally I would have used a python set for this, rather
> than the string translate method.  Note also unicode strings don't
> support the deletechars argument, and that python 3.0 removes the
> deletechars argument from the string style objects.
> 
> (2) Using the translate mapping to switch "U" and "u" into "T" and "t"
> to back transcribe RNA into DNA.  For this, Biopython already has a
> Bio.Seq.back_transcribe function (which does work on strings), and in
> CVS the Seq object gets a back_transcribe method too.  These do both
> use the string translate method internally.
> 
> Neither of these operations convice me that the Seq object should
> support the python string translate method.
> 
> Note that if you still need to use the python string translate method,
> it is accessable by first turning the Seq object into a string (e.g.
> str(my_seq).translate(mapping, delete_chars)), or as Michiel suggested
> earlier, you could use the string module translate function on the Seq
> object.
> 
> Also note that (as in your example using the string translate to do
> back transcription) the translate method by its nature makes it
> impossible to know if the original Seq object alphabet still applies
> to the result.


From biopython at maubp.freeserve.co.uk  Sun Oct 19 14:52:06 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 19 Oct 2008 19:52:06 +0100
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <48FB7C56.6010408@ribosome.natur.cuni.cz>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
	<48FB57BD.7070705@ribosome.natur.cuni.cz>
	<320fb6e00810191115k120b64c3m237d0929d33b13fb@mail.gmail.com>
	<48FB7C56.6010408@ribosome.natur.cuni.cz>
Message-ID: <320fb6e00810191152u13a2ee80pe21fe950dc3d046a@mail.gmail.com>

On Sun, Oct 19, 2008 at 7:28 PM, Martin MOKREJ?
<mmokrejs at ribosome.natur.cuni.cz> wrote:
> Peter,
>  you are right in your points. I think the translate() trick
> had some speed advantages over other approaches to
> zap unwanted characters ...

I haven't profiled this - you may be right.  On the other hand, using
the translate method in this way doesn't make the purpose of the code
obvious.

>- I don't remember but if it is gonna break in future
> python releases I will have to rewrite this anyway.

Certainly the deletechars argument seems to be gone in Python 3.0, but
you may not need to worry about that for a while.

> I just wanted to say I really do use the string translate
> function and that it has use in bioinformatics as well. ;-)

Using the string translate for (back)transcription is an obvious
example, but this is a special case that is already handled within
Biopython.

Does anyone have a non-transcription sequence example where the
mapping part of the translate method is actually used?

Using the string translate method just to remove characters is an
interesting one.  How common is this in typical python code?  I've
always used the string replace method (but usually I only want to
remove one character).  Maybe we should have a remove characters
method for the Seq object?  Here at least dealing with the alphabet is
fairly simple.  On another thread I'd suggested a "remove gaps" method
as a special case of this.

> Still, I think the name clash is asking for disaster, but
> overloading is a feature of python so it might be expected.
> Do whatever you want. ;)
> Cheers,
> M.

I'm still a tiny bit uneasy about the name clash myself... anyone else
what to join in the debate?

Peter


From biopython at maubp.freeserve.co.uk  Sun Oct 19 14:59:23 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 19 Oct 2008 19:59:23 +0100
Subject: [BioPython] Current tutorial in CVS
In-Reply-To: <320fb6e00810190934l432dd320ue8ff3a40fa497530@mail.gmail.com>
References: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com>
	<48FB5DAE.1050600@ribosome.natur.cuni.cz>
	<320fb6e00810190934l432dd320ue8ff3a40fa497530@mail.gmail.com>
Message-ID: <320fb6e00810191159j52bb78al4c38b1f7804c268f@mail.gmail.com>

Peter wrote:
> Marting wrote:
>> for those lazy would you please show how to save the generated
>> plots into e.g. jpg or .svg file?
>
> Instead or as well as pylab.show(), use pylab.savefig(...), for example:
>
> pylab.savefig("dot_plot.png", dpi=75)
> pylab.savefig("dot_plot.pdf")

I've added a note about this in the example in the CVS version of the Tutorial.

> On a related note - it looks like the pylab tutorial as moved, I'm
> getting a 404 error on http://matplotlib.sourceforge.net/tutorial.html
> now :(

I've updated this link to point at http://matplotlib.sourceforge.net/ instead
(which at the time of writing includes a quick summary of the pylab functions).

Peter

From tiagoantao at gmail.com  Mon Oct 20 01:41:56 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 20 Oct 2008 06:41:56 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
Message-ID: <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>

Hi,

On Sun, Oct 19, 2008 at 3:50 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> ok, thank you very much!!
> I would like to use git to keep track of the changes I will make to the
> code.
> What do you think if I'll upload it to http://github.com and then upload it
> back on biopython when it is finished?
> I am not sure, but I think it would be possible to convert the logs back to
> cvs to reintegrate the changes in biopython.

I think it is a good idea. When we reintegrate back I think there will
be no need to backport the commit logs anyway.


> One of the problems we are having here, is that it takes too much RAM memory
> to store all the information about characters for every population.
> I was going to write a Population object, in which I'll store only the total
> count of heterozygotes, individuals, and what is needed, instead of the
> information about characters (('a', 'a'), ('a', 'c'), ...)

I am afraid that this is not enough. Even for Fst. I suppose you are
acquainted with a formula with just heterozigosities. That is more of
just a textbook formula only.  The Fst standard estimator is really
Cockerham and Wier Theta estimator (1984 paper), and I think it needs
individual information (or at the very least allele counts). Check my
implementation of Fst, which should be it (less the bugs that are in).
Maybe my implementation of theta is wrong, which is a possiblity. But
theta is the standard.

May I do a suggestion for your problem? Split in SNP groups (like 100
at a time) and calculate 100 Fsts at time. Store the calculated Fsts
to disk and then join them at the end.

As a general rule, whatever goes into biopython has to be general
enough to accomodate all standard statistics (not just Fs). One cannot
make a solution that is taliored to solve just our personal research
issues.

I am currently traveling (which seems to be my constant state). When I
arrive back at office, on Wednsday, I will make a few suggestions on
how we can structure things. I have a few ideas that I would like to
share and discuss.

> class Marker:
>   total_heterozygotes_count = 0
>   total_population_count = 0
>   total_Purines_count = 0 # this could be renamed, of course
>   total_Pyrimidines_count = 0


Also, your representation seems to be targetted toward SNPs, people
use lots of other things (microsatellites are still used a lot). We
have to think about something that is useful to the general public.
Let me get back to you on Wednesday we ideas. If you are interested we
can work together to make a nice population genetics module that can
be used in a wide range of situations.

From lpritc at scri.ac.uk  Mon Oct 20 05:09:51 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Mon, 20 Oct 2008 10:09:51 +0100
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <320fb6e00810191152u13a2ee80pe21fe950dc3d046a@mail.gmail.com>
Message-ID: <C522096F.17F00%lpritc@scri.ac.uk>

On 19/10/2008 19:52, "Peter" <biopython at maubp.freeserve.co.uk> wrote:

> I'm still a tiny bit uneasy about the name clash myself... anyone else
> what to join in the debate?

The problem domain for biological sequences implies a natural definition for
the application of 'translate' to a DNA/RNA sequence that is the translation
into protein sequence.  The string.translate() method is not consistent with
this natural use of the language of the problem domain.

I take Martin's point that there are valid uses for the string.translate()
method in bioinformatics and elsewhere, but I think that overloading
translate() is as valid here as overloading __mul__ would be for an
implementation of matrix algebra, or complex numbers.  For biological
sequences as much as for number types, I think the problem domain and
expected behaviour of the object being represented in code should take
precedence over emulation of an object type that was never intended to
provide the functionality required for a biological sequence.

I think also that if the string.translate() method is required, an explicit
call to string.translate() implies: "translate this biological sequence as
if it were a string, and not a biological sequence".  The converse
application of a Bio.translate() method would to me imply "translate this
biological sequence as if it were a biological sequence, and not a string";
which seems to me to defeat part of the purpose of representing the
biological sequence with its own object.

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________

From biopython at maubp.freeserve.co.uk  Mon Oct 20 05:22:39 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Oct 2008 10:22:39 +0100
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <C522096F.17F00%lpritc@scri.ac.uk>
References: <320fb6e00810191152u13a2ee80pe21fe950dc3d046a@mail.gmail.com>
	<C522096F.17F00%lpritc@scri.ac.uk>
Message-ID: <320fb6e00810200222s641e165eqef3b209893a8d976@mail.gmail.com>

Leighton wrote:
> Peter wrote:
>> I'm still a tiny bit uneasy about the name clash myself... anyone else
>> what to join in the debate?
>
> The problem domain for biological sequences implies a natural definition for
> the application of 'translate' to a DNA/RNA sequence that is the translation
> into protein sequence.  The string.translate() method is not consistent with
> this natural use of the language of the problem domain.
> ...

I thought that was well argued and nicely put.  Of course, someone is
still bound to try calling the translate method with a string mapping.
 Maybe we should add a bit of defensive code to check the table
argument, and print a helpful error message when this happens?  We
currently only expect the codon table argument to be an NCBI genetic
code table name or ID (string or integer).

Earlier I wrote:
>> In Biopython's CVS, the Seq object now has a translate method
>> which does a biological translation.  If anyone comes up with a
>> better proposal before the next release, we can still rename this.
>> Otherwise I will update the Tutorial in CVS shortly...

I have since updated the Tutorial in CVS to use the new transcribe,
back_transcribe and translate methods.  Maybe we should put an updated
"preview" online for comment?

Peter

From lpritc at scri.ac.uk  Mon Oct 20 05:38:10 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Mon, 20 Oct 2008 10:38:10 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <48F8DD7B.7010909@gmail.com>
Message-ID: <C5221012.17F0D%lpritc@scri.ac.uk>


On 17/10/2008 19:46, "Bruce Southey" <bsouthey at gmail.com> wrote:

> Leighton Pritchard wrote:
>> This is the key problem.  Forward translation is - for a given codon table -
>> a one-one mapping.  Reverse translation is (for many amino acids) one-many.
>> If the goal is to produce the coding sequence that actually encoded a
>> particular protein sequence, the problem is combinatorial and rapidly
>> becomes messy with increasing sequence length.
>>   
> If you use a regular expression or a tree structure then there is a
> one-one mapping but then that would probably best as a subclass of Seq.

I don't see this, I'm afraid.

Each codon -> one amino acid : one-one mapping
Arg -> set of 6 possible codons : one-many mapping

It doesn't matter how it's represented in code, the problem of a one-many
mapping still exists for amino acid -> codon translation in most cases.

The combinatorial nature of the overall problem can be illustrated by
considering the unlikely case of a protein that comprises 100 arginines.
The number of potential coding sequences is 6**100 = 6.5e77.  That you *can*
choose any one of these to be your potential coding sequence doesn't negate
the fact that there are still (6.5e77)-1 other possibilities... It doesn't
get much better if you use the the average number of codons per amino acid:
61/20 ~= 3.  A 100aa protein would typically have 3**100 ~= 5e47 potential
coding sequences.  I wouldn't want to guess which one was correct, and I
can't see a back_translate method in this instance doing more than producing
a nucleotide sequence that is potentially capable of producing the passed
protein sequence, but for which no claims can be made about biological
plausibility.

Now, a back_translate() that takes a protein sequence alignment and, when
passed the coding sequences for each component sequence, returns the
corresponding alignment of the nucleotide sequences, makes sense to me.  But
that's a discussion for Bio.Alignment objects...

> I would suggest tools like Wise2 and exonerate
> (http://www.ebi.ac.uk/~guy/exonerate/) are the solution to solving gene
> structure problems than using a Seq object.

I wouldn't suggest using a Seq object for this purpose, either... ;)

>> I agree - I can't think of an occasion where I might want to back-translate
>> a protein in this way that wouldn't better be handled by other means.  Not
>> that I'm the fount of all use-cases but, given the number of ways in which
>> one *could* back-translate, perhaps it would be better not to pick/guess at
>> any single one.
>>   
> Apart from the academic aspect, my main use is searching for protein
> motifs/domains, enzyme cleavage sites, finding very short combinations
> of amino acids and binding sites (I do not do this but it is the same)
> in DNA sequences especially genomic sequence. These are usually very
> small and, thus, unsuitable for most tools.

I do much the same, and haven't found a pressing use for back-translation,
yet - YMMV.

> One of my uses is with
> peptide identification and de novo sequencing using mass spectrometry
> when you don't know the actual protein or gene sequence. It also has the
> problem that certain amino acids have very similar mass so you would
> need to  Regardless of whether you use a regular expression query or not
> you still need a back translation of the protein query and probably the
> reverse complement.

Perhaps I'm being dense, but I don't see why that is.  Can you give an
example?

> Another case where it would be useful is that tools like TBLASTN gives
> protein alignments so you must open the DNA sequence and find the DNA
> region based on the protein alignment.

You could use TBLASTN output - which provides start and stop coordinates for
the match on the subject sequence - to extract this directly, without the
need for backtranslation.  Example output where subject coordinates give the
match location below:

>ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete
genome
          Length = 5064019

 Score =  731 bits (1887), Expect = 0.0
 Identities = 363/376 (96%), Positives = 363/376 (96%)
 Frame = +3

Query: 1      MFHXXXXXXXXXXXXXTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
60
              MFH             TISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
Sbjct: 477432 MFHLPKLKQKPLALLLTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
477611

[...]

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________

From dalloliogm at gmail.com  Mon Oct 20 09:57:27 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Mon, 20 Oct 2008 15:57:27 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
Message-ID: <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>

On Mon, Oct 20, 2008 at 7:41 AM, Tiago Ant?o <tiagoantao at gmail.com> wrote:

> Hi,
>
> On Sun, Oct 19, 2008 at 3:50 PM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
> > ok, thank you very much!!
> > I would like to use git to keep track of the changes I will make to the
> > code.
> > What do you think if I'll upload it to http://github.com and then upload
> it
> > back on biopython when it is finished?
> > I am not sure, but I think it would be possible to convert the logs back
> to
> > cvs to reintegrate the changes in biopython.
>
> I think it is a good idea. When we reintegrate back I think there will
> be no need to backport the commit logs anyway.


Ok, I have uploaded the code to:
- http://github.com/dalloliogm/biopython---popgen

I put the code I wrote before writing in this mailing list in the folder
PopGen/Gio
-
http://github.com/dalloliogm/biopython---popgen/tree/6f6fa66cda1908dc8334ab6e9e69b7c85290a8be/src/PopGen/Gio
However, I plan to integrate these scripts with your code or re-write the
completely (well, your code is a lot better than mine :) ).

Just a curiosity: why do you use the '<>' operator instead of '!='?
Is it better supported in python 3.0?


> > One of the problems we are having here, is that it takes too much RAM
> memory
> > to store all the information about characters for every population.
> > I was going to write a Population object, in which I'll store only the
> total
> > count of heterozygotes, individuals, and what is needed, instead of the
> > information about characters (('a', 'a'), ('a', 'c'), ...)
>
> I am afraid that this is not enough. Even for Fst. I suppose you are
> acquainted with a formula with just heterozigosities.


Yes, I was trying to implement a very basic formula at first.


> That is more of
> just a textbook formula only.  The Fst standard estimator is really
> Cockerham and Wier Theta estimator (1984 paper)


Bioperl's Bio::PopGen::PopStats uses the same formula:
-
http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/PopGen/PopStats.html#POD3

"""
Bioperl's Bio:Based on diploid method in Weir BS, Genetics Data
Analysis II, 1996
           page 178.
"""

, and I think it needs
> individual information (or at the very least allele counts). Check my
> implementation of Fst, which should be it (less the bugs that are in).
> Maybe my implementation of theta is wrong, which is a possiblity. But
> theta is the standard.
>
> May I do a suggestion for your problem? Split in SNP groups (like 100
> at a time) and calculate 100 Fsts at time. Store the calculated Fsts
> to disk and then join them at the end.
>

Thanks - that's a good suggestion


>
>
> I am currently traveling (which seems to be my constant state). When I
> arrive back at office, on Wednsday, I will make a few suggestions on
> how we can structure things. I have a few ideas that I would like to
> share and discuss.
>

Have a nice trip!


>
> > class Marker:
> >   total_heterozygotes_count = 0
> >   total_population_count = 0
> >   total_Purines_count = 0 # this could be renamed, of course
> >   total_Pyrimidines_count = 0
>
>
> Also, your representation seems to be targetted toward SNPs, people
> use lots of other things (microsatellites are still used a lot). We
> have to think about something that is useful to the general public.
> Let me get back to you on Wednesday we ideas. If you are interested we
> can work together to make a nice population genetics module that can
> be used in a wide range of situations.
>

Yes, I agree. It was just a first try. We should collect some good
use-cases.


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Mon Oct 20 10:04:02 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Oct 2008 15:04:02 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
Message-ID: <320fb6e00810200704g578ec3aak6b9df1a5a90a2fc7@mail.gmail.com>

On Mon, Oct 20, 2008 at 2:57 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Just a curiosity: why do you use the '<>' operator instead of '!='?
> Is it better supported in python 3.0?

Python 2.x supports both <> and != for not equal, and people use both
depending on their personal preference (or exposure to other
languages).  Most Biopython code used to use <> which I personally do
by habit.

Python 3.x supports only != so I have recently gone through Biopython
in CVS switching all the <> to != instead.

I would recommend you use != in all new python code.

Peter

From biopython at maubp.freeserve.co.uk  Mon Oct 20 10:23:23 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Oct 2008 15:23:23 +0100
Subject: [BioPython] Bio.AlignIO feedback - seq_count
Message-ID: <320fb6e00810200723p2fcbe12ey125dd1fd67d195a7@mail.gmail.com>

Dear Biopythoneers,

I'm hoping some of you on the mailing list have actually used
Bio.AlignIO, and I'd like to ask for some feedback.

In particular, when loading in sequence files, did you ever use the
optional seq_count argument to declare how many sequences you expected
in each alignment?  The rational of this optional argument is
discussed in the Tutorial,
http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:AlignIO-count-argument

I'm curious if anyone actually found this useful in real life.

Thanks

Peter

From bsouthey at gmail.com  Tue Oct 21 10:13:15 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Tue, 21 Oct 2008 09:13:15 -0500
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <C5221012.17F0D%lpritc@scri.ac.uk>
References: <C5221012.17F0D%lpritc@scri.ac.uk>
Message-ID: <48FDE37B.5040301@gmail.com>

Leighton Pritchard wrote:
> On 17/10/2008 19:46, "Bruce Southey" <bsouthey at gmail.com> wrote:
>
>   
>> Leighton Pritchard wrote:
>>     
>>> This is the key problem.  Forward translation is - for a given codon table -
>>> a one-one mapping.  Reverse translation is (for many amino acids) one-many.
>>> If the goal is to produce the coding sequence that actually encoded a
>>> particular protein sequence, the problem is combinatorial and rapidly
>>> becomes messy with increasing sequence length.
>>>   
>>>       
>> If you use a regular expression or a tree structure then there is a
>> one-one mapping but then that would probably best as a subclass of Seq.
>>     
>
> I don't see this, I'm afraid.
>
> Each codon -> one amino acid : one-one mapping
> Arg -> set of 6 possible codons : one-many mapping
>   
If you believed this then your answer below is incorrect. The genetic 
code allow for 1 amino acid to map to a three nucleotides but not any 
three nor any more or any less than three. So to be clear there is a one 
to one mapping between a codon and amino acid as well amino acid and a 
codon. Therefore it is impossible for Arg to map to six possible codons 
as only one is correct. Under the standard genetic code, each amino acid 
can be represented in an regular expression either as the bases or 
ambiguous nucleotide codes:
Ala/A =(GCT|GCC|GCA|GCG) = GCN
Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR)
Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR)
Lys/K =(AAA|AAG) = AAR
Asn/N =(AAT|AAC) =AAY
Met/M =ATG =ATG
Asp/D =(GAT|GAC) =GAY
Phe/F =(TTT|TTC) =TTY
Cys/C =(TGT|TGC) =TGY
Pro/P =(CCT|CCC|CCA|CCG) =CCN
Gln/Q =(CAA|CAG) =CAR
Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY)
Glu/E =(GAA|GAG) = GAR
Thr/T =(ACT|ACC|ACA|ACG)  =ACN
Gly/G =(GGT|GGC|GGA|GGG) =GGN
Trp/W =TGG  =TGG
His/H =(CAT|CAC)  = CAY
Tyr/Y =(TAT|TAC) = TAY
Ile/I =(ATT|ATC|ATA) =ATH
Val/V =(GTT|GTC|GTA|GTG) =GTN

This is still a one to one mapping between an amino acid and regular 
expression relationship of the triplet that encodes it. Unfortunately 
the ambiguous nucleotide codes can not be used directly in a regular 
expression search.


> It doesn't matter how it's represented in code, the problem of a one-many
> mapping still exists for amino acid -> codon translation in most cases.
>
> The combinatorial nature of the overall problem can be illustrated by
> considering the unlikely case of a protein that comprises 100 arginines.
> The number of potential coding sequences is 6**100 = 6.5e77.  That you *can*
> choose any one of these to be your potential coding sequence doesn't negate
> the fact that there are still (6.5e77)-1 other possibilities... It doesn't
> get much better if you use the the average number of codons per amino acid:
> 61/20 ~= 3.  A 100aa protein would typically have 3**100 ~= 5e47 potential
> coding sequences.  I wouldn't want to guess which one was correct, and I
> can't see a back_translate method in this instance doing more than producing
> a nucleotide sequence that is potentially capable of producing the passed
> protein sequence, but for which no claims can be made about biological
> plausibility.
>   
You are not representing the one to six mapping you indicated above as 
sequence is composed of 300 nucleotides not 1800 as must occur with a 
one to 6 codon mapping. Rather you have provided the number of 
combinations of the six codons that can give you 100 Args based on a one 
to one mapping of one codon to one Arg.  If you use ambiguous nucleotide 
codes, you can reduce it down to 1.267651e+30 potential coding sequences 
for 100 amino acids as a worst case scenario.

It is not my position to argue what a user wants or how stupid I think 
that the request is. The user would quickly learn.
> Now, a back_translate() that takes a protein sequence alignment and, when
> passed the coding sequences for each component sequence, returns the
> corresponding alignment of the nucleotide sequences, makes sense to me.  But
> that's a discussion for Bio.Alignment objects...
>
>   
>> I would suggest tools like Wise2 and exonerate
>> (http://www.ebi.ac.uk/~guy/exonerate/) are the solution to solving gene
>> structure problems than using a Seq object.
>>     
>
> I wouldn't suggest using a Seq object for this purpose, either... ;)
>
>   
>>> I agree - I can't think of an occasion where I might want to back-translate
>>> a protein in this way that wouldn't better be handled by other means.  Not
>>> that I'm the fount of all use-cases but, given the number of ways in which
>>> one *could* back-translate, perhaps it would be better not to pick/guess at
>>> any single one.
>>>   
>>>       
>> Apart from the academic aspect, my main use is searching for protein
>> motifs/domains, enzyme cleavage sites, finding very short combinations
>> of amino acids and binding sites (I do not do this but it is the same)
>> in DNA sequences especially genomic sequence. These are usually very
>> small and, thus, unsuitable for most tools.
>>     
>
> I do much the same, and haven't found a pressing use for back-translation,
> yet - YMMV.
>
>   
>> One of my uses is with
>> peptide identification and de novo sequencing using mass spectrometry
>> when you don't know the actual protein or gene sequence. It also has the
>> problem that certain amino acids have very similar mass so you would
>> need to  Regardless of whether you use a regular expression query or not
>> you still need a back translation of the protein query and probably the
>> reverse complement.
>>     
>
> Perhaps I'm being dense, but I don't see why that is.  Can you give an
> example?
>   
Isoleucine and Leucine are the worst case (there are a couple of others 
that are close) because these have the same mass so you have to search for:
(TTA|TTG|CTT|CTC|CTA|CTG|ATT|ATC|ATA)

If you are searching say for an RFamide, you know that you need at least 
RFG, which means you need to do a query using regular expression on the 
plus strand using:
(CGT|CGC|CGA|CGG|AGA|AGG)(TTT|TTC)(GGT|GGC|GGA|GGG)

You then try to extend the match to more amino acids until you reach the 
desired mass (hopefully avoiding any introns) or sufficiently that you 
can use some other tool to help.
>   
>> Another case where it would be useful is that tools like TBLASTN gives
>> protein alignments so you must open the DNA sequence and find the DNA
>> region based on the protein alignment.
>>     
>
> You could use TBLASTN output - which provides start and stop coordinates for
> the match on the subject sequence - to extract this directly, without the
> need for backtranslation.  Example output where subject coordinates give the
> match location below:
>
>   
>> ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete
>>     
> genome
>           Length = 5064019
>
>  Score =  731 bits (1887), Expect = 0.0
>  Identities = 363/376 (96%), Positives = 363/376 (96%)
>  Frame = +3
>
> Query: 1      MFHXXXXXXXXXXXXXTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
> 60
>               MFH             TISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
> Sbjct: 477432 MFHLPKLKQKPLALLLTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
> 477611
>
> [...]
>
> L.
>
>   
Exactly my point, where is the DNA sequence? Only if you have direct 
access to the DNA sequence can you get it. Furthermore, the DNA sequence 
must be exactly the same because any change in the coordinates screws it 
up.

Bruce


From biopython at maubp.freeserve.co.uk  Tue Oct 21 10:26:49 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Oct 2008 15:26:49 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
	Seq object?
In-Reply-To: <48FDE37B.5040301@gmail.com>
References: <C5221012.17F0D%lpritc@scri.ac.uk> <48FDE37B.5040301@gmail.com>
Message-ID: <320fb6e00810210726g466292e4h3e8fe053d9107f48@mail.gmail.com>

Bruce wrote:
> Leighton wrote:
>> Each codon -> one amino acid : one-one mapping
>> Arg -> set of 6 possible codons : one-many mapping

I agree with Leighton.

> If you believed this then your answer below is incorrect.

No, I think you are just not using the terms one-to-one and
one-to-many as a mathematician would.

> The genetic code
> allow for 1 amino acid to map to a three nucleotides but not any three nor
> any more or any less than three. So to be clear there is a one to one
> mapping between a codon and amino acid as well amino acid and a codon.
> Therefore it is impossible for Arg to map to six possible codons as only one
> is correct. Under the standard genetic code, each amino acid can be
> represented in an regular expression either as the bases or ambiguous
> nucleotide codes:
> Ala/A =(GCT|GCC|GCA|GCG) = GCN

That is a one to four mapping using unambiguous nucleotides, or a one
to one mapping using ambiguous nucleotides.  This is a nice case.

> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR)

That is a one to six mapping using unambiguous nucleotides, or a one
to two mapping using ambiguous nucleotides.  This is a problem case.

> This is still a one to one mapping between an amino acid and regular
> expression relationship of the triplet that encodes it. Unfortunately the
> ambiguous nucleotide codes can not be used directly in a regular expression
> search.

The problem is that (TTN|CTR) or similar don't work in Seq objects -
would need a more advanced representation (perhaps based on regular
expressions).

Peter

From biopython at maubp.freeserve.co.uk  Tue Oct 21 10:45:57 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Oct 2008 15:45:57 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
	Seq object?
In-Reply-To: <48FDE37B.5040301@gmail.com>
References: <C5221012.17F0D%lpritc@scri.ac.uk> <48FDE37B.5040301@gmail.com>
Message-ID: <320fb6e00810210745w32b37edjeec1607a3711f6ea@mail.gmail.com>

Bruce wrote:
>>> Another case where it would be useful is that tools like TBLASTN gives
>>> protein alignments so you must open the DNA sequence and find the DNA
>>> region based on the protein alignment.

Leighton:
>> You could use TBLASTN output - which provides start and stop coordinates
>> for the match on the subject sequence - to extract this directly, without the
>> need for backtranslation.  Example output where subject coordinates give
>> the match location below:
>>
>>>
>>> ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete
>>>
>>
>> genome
>>          Length = 5064019
>>
>>  Score =  731 bits (1887), Expect = 0.0
>>  Identities = 363/376 (96%), Positives = 363/376 (96%)
>>  Frame = +3
>>
>> Query: 1      MFHXXXXXXXXXXXXXTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
>> 60
>>              MFH             TISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
>> Sbjct: 477432 MFHLPKLKQKPLALLLTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
>> 477611
>>
>> [...]

Bruce's reply:
> Exactly my point, where is the DNA sequence? Only if you have direct access
> to the DNA sequence can you get it. Furthermore, the DNA sequence must be
> exactly the same because any change in the coordinates screws it up.

You should have the original query from when you ran the BLAST
search, so using the co-ordinates given in the BLAST hit you can
recover the original nucleotide query which gives this match.

There is no reason to do a back-translation to try and find the original
query, which would be especially difficult in this example due to the
XXXXXX region (representing a region of low complexity which was
ignored by BLAST).  Even if you tried you could find more than one
match and without checking the the coordinates BLAST gives it would
not be clear which gave this BLAST match.

Peter

From lpritc at scri.ac.uk  Tue Oct 21 11:29:35 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Tue, 21 Oct 2008 16:29:35 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <48FDE37B.5040301@gmail.com>
Message-ID: <C523B3EF.18072%lpritc@scri.ac.uk>

Hi Bruce,

On 21/10/2008 15:13, "Bruce Southey" <bsouthey at gmail.com> wrote:
> Leighton Pritchard wrote:
>> I don't see this, I'm afraid.
>> 
>> Each codon -> one amino acid : one-one mapping
>> Arg -> set of 6 possible codons : one-many mapping
>>   
> If you believed this then your answer below is incorrect. The genetic
> code allow for 1 amino acid to map to a three nucleotides but not any
> three nor any more or any less than three.

I'm fine with this bit.  Each such set of three nucleotides is called a
'codon'.  Six such codons are able to code for an arginine, as you note:

> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG)

This is a one -> six mapping.  That is, one input (arginine), is capable of
being back-translated into any of six possible outputs (CGT, CGC, CGA, CGG,
AGA, or AGG).  

but you contradict this with the comment:

> So to be clear there is a one
> to one mapping between a codon and amino acid as well amino acid and a
> codon. Therefore it is impossible for Arg to map to six possible codons

I think that you're confusing the biological fact (only one codon actually
encoded this amino acid) with the back-translation problem (in the absence
of any other information, any one of six codons is equally likely to have
encoded this amino acid).

---
 
> This is still a one to one mapping between an amino acid and regular
> expression relationship of the triplet that encodes it.

Which is not the claim that I was making.  There are any number of ways of
forcing a one-one mapping of this sort.  You could arguably represent it as
a one-to-one mapping of 'arginine -> "the backtranslation of arginine"', but
that would not be informative in reconstructing the actual coding sequence
(if that was what you wanted - which is the point of the discussion: what is
the point of a back_translate() method?).  The regular expression mapping is
not useful for this, either.

> You are not representing the one to six mapping you indicated above as
> sequence is composed of 300 nucleotides not 1800 as must occur with a
> one to 6 codon mapping [...]

I think you've misunderstood what's going on here.

Imagine a reduced system, where there is only one amino acid - let's call it
A - and there are two possible codons that can produce this amino acid - XXX
and YYY (thanks, Coldplay).

Now, if we have a 'sequence' of only one amino acid: 'A', that might have
been encoded by the sequence 'XXX', or the sequence 'YYY'.  The sequence
that coded for 'A' is one of 'XXX' or 'YYY', and we don't know which; there
are two possibilities, therefore this is a 1->2 mapping.  2=2**1.  Note that
the nucleotide sequence is 3*1=3 long.

But if our sequence has two amino acids: 'AA', this could have been the
result of 'XXXXXX', 'XXXYYY', 'YYYXXX', or 'YYYYYY'.  The coding sequence is
one of four equally likely possibilities, and this is a 1->4 mapping (one
sequence, four possible outcomes).  4=2**2, and the nucleotide sequence is
3*2 long.

If we build longer sequences, we find that the number of potential outcomes
is 2**n, where n is the number of 'A's in the input sequence, and the
mapping is 1->2**n.  The nucleotide sequence is 3*n long.

If we make this more general, where there are m codons for this amino acid,
the number of potential outcomes is m**n, and the mapping is 1->m**n.  The
nucleotide sequence is, again, 3*n long.

In my previous example for arginine, m=6, n=100, the mapping is 1->6, and
the sequence is 300nt long, *not* 1800 nt long.  There are still 6e77 ways
of encoding a sequence of 100 arginines.  A back_translate() method that
pretends to find the 'correct' coding sequence in the absence of other
information, rather than 'a' coding sequence, is not making a plausible
claim.

> It is not my position to argue what a user wants or how stupid I think
> that the request is. The user would quickly learn.

While it is entirely possible to implement a function called
back_translate() that does something a user doesn't want or need, I'm not
sure that it's the approach that should be taken, here.

It is your position to argue what you want or need out of a back_translate()
method, and why, so that other people can see your point of view, and maybe
be swayed by it.  I don't see a use for such a method, even to produce a
regular expression for searching nucleotide sequences, because TBLASTN is so
much more efficient.

> Isoleucine and Leucine are the worst case (there are a couple of others
> that are close) because these have the same mass so you have to search for:
> (TTA|TTG|CTT|CTC|CTA|CTG|ATT|ATC|ATA)
> 
> If you are searching say for an RFamide, you know that you need at least
> RFG, which means you need to do a query using regular expression on the
> plus strand using:
> (CGT|CGC|CGA|CGG|AGA|AGG)(TTT|TTC)(GGT|GGC|GGA|GGG)
> 
> You then try to extend the match to more amino acids until you reach the
> desired mass (hopefully avoiding any introns) or sufficiently that you
> can use some other tool to help.

I think that, in your position, I'd compare timings with a six-frame,
three-frame or forward translation of (depending on the nature of the
nucleotide sequence) the nucleotide sequence you're searching against, and
then use a regular expression or string search with the protein sequence as
the query.  That's likely to be significantly faster than a regex search
with that many groups, with the effects more noticeable at larger query
sequence lengths; particularly so if you cache or save the translated
sequences for future searches.
   
>>> Another case where it would be useful is that tools like TBLASTN gives
>>> protein alignments so you must open the DNA sequence and find the DNA
>>> region based on the protein alignment.
>> You could use TBLASTN output - which provides start and stop coordinates for
>> the match on the subject sequence - to extract this directly, without the
>> need for backtranslation.

> Exactly my point, where is the DNA sequence?

It's in the database against which you queried; TBLASTN queries against
nucleotide databases.  Wait, that's not quite right - TBLASTN translates
nucleotide databases into protein databases and queries against them with
the protein sequence, partly because of the one-many mapping of
back-translation.

If the database is local, you can use fastacmd (part of BLAST) to dump the
entire database, to retrieve the single matching sequence from the database,
or even to extract only the region of the sequence that is the match.  Try
fastacmd --help at the command-line.

If your database is not local, you can (probably) obtain the sequence by
querying GenBank with the accession number.  If you can't do that, or ask
the people who compiled the database you're querying against, or if they
won't let you have the sequence, then you're stuck with guessing the coding
sequence.

> Only if you have direct access to the DNA sequence can you get it.

That's not true; fastacmd can extract FASTA-formatted sequences from any
(version number compatibilities notwithstanding) correctly-formatted BLAST
database.

> Furthermore, the DNA sequence
> must be exactly the same because any change in the coordinates screws it
> up.

I don't see how that is a great concern.  The coordinates of the match would
come from the same database you were searching, so should match.  If your
database is up-to-date, and you have to go to GenBank, then you should have
the most recent revision of the sequence in there, anyway.

Even if both of the above options fail, and you can acquire the new sequence
by some accession identifier, you can build a new local database from that
sequence alone, and find where the match is.  Or translate and search
directly in Python.

If you truly have no access to the DNA sequence (e.g. if it's proprietary
information, you can't access the BLAST database, and no-one will send you
the sequence) then, and only then, are you stuck with guessing the coding
sequence in *very* large parameter space.

Best,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________

From biopython at maubp.freeserve.co.uk  Tue Oct 21 11:59:00 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Oct 2008 16:59:00 +0100
Subject: [BioPython] back-translation method for Seq object?
In-Reply-To: <320fb6e00810210745w32b37edjeec1607a3711f6ea@mail.gmail.com>
References: <C5221012.17F0D%lpritc@scri.ac.uk> <48FDE37B.5040301@gmail.com>
	<320fb6e00810210745w32b37edjeec1607a3711f6ea@mail.gmail.com>
Message-ID: <320fb6e00810210859n1b922e7emd6a7456abd79cdc7@mail.gmail.com>

On Tue, Oct 21, 2008 at 3:45 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Bruce wrote:
>>>> Another case where it would be useful is that tools like TBLASTN gives
>>>> protein alignments so you must open the DNA sequence and find the DNA
>>>> region based on the protein alignment.
>
> Leighton:
>>> You could use TBLASTN output - which provides start and stop coordinates
>>> for the match on the subject sequence - to extract this directly, without the
>>> need for backtranslation.  Example output where subject coordinates give
>>> the match location below:
>>> ...
>
> Bruce's reply:
>> Exactly my point, where is the DNA sequence? Only if you have direct access
>> to the DNA sequence can you get it. Furthermore, the DNA sequence must be
>> exactly the same because any change in the coordinates screws it up.
>
> You should have the original query from when you ran the BLAST
> search, so using the co-ordinates given in the BLAST hit you can
> recover the original nucleotide query which gives this match.

Sorry - I was thinking of the wrong variant of BLAST.  As Leighton
pointed out, you would have to use fastacmd to extract the nucleotide
sequence of the match from the blast database (assuming you were
running stand alone blastall) or fetch it via its accession (if you
were running BLAST via the NCBI).

Peter

From biopython at maubp.freeserve.co.uk  Tue Oct 21 12:07:46 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Oct 2008 17:07:46 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
	Seq object?
In-Reply-To: <C523B3EF.18072%lpritc@scri.ac.uk>
References: <48FDE37B.5040301@gmail.com> <C523B3EF.18072%lpritc@scri.ac.uk>
Message-ID: <320fb6e00810210907s328c8f23ra007e14f1f32f5c9@mail.gmail.com>

Hi everyone,

I think we all agree that if we want a back-translation
method/function to return a simple string or Seq object (given no
additional information about the codon use), this cannot fully capture
all the possible codons.

If we want to provide a simple string or Seq object, we can either
pick an arbitrary codon in each case (as in the first attachment on
Bug 2618), or perhaps represent some of the possible codons using
ambiguous nucleotides.

e.g.
back_translate("MR") = "ATGCGT" #arbitrary codon for R unambiguous nucleotides

or,
back_translate("MR") = "ATGCGN" #arbitrary codon for R using ambiguous
nucleotides

Note in either example, the following nice property holds:
translate(back_translate("MR")) == "MR"

Even if improved by typical codon usage figures to give a more
biologically likely answer, neither of these simple approaches covers
the full set of six possible codons for Arg in the standard codon
table.

It was something like this that I envisioned as a candidate for a Seq
method (based on the behaviour of the existing Bio.Translate
functionality), but only if such a simple back_translate
method/function had any real uses.  And thus far, I haven't seen any.

A back translation method/function which dealt with all the possible
codon choices would have to use a more advanced representation
(possibly as Bruce suggested using regular expressions or some sort of
tree structure - ideally as a sub-class of the Seq object).  There is
also the option of returning multiple simple strings or Seq objects
(either as a list or preferable a generator) giving all possible back
translations, but I don't think this would be useful, except perhaps
on small examples, due to the potentially vast number of return
values.

Peter

From bsouthey at gmail.com  Tue Oct 21 15:46:58 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Tue, 21 Oct 2008 14:46:58 -0500
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <C523B3EF.18072%lpritc@scri.ac.uk>
References: <C523B3EF.18072%lpritc@scri.ac.uk>
Message-ID: <48FE31B2.8030509@gmail.com>

Leighton Pritchard wrote:
> Hi Bruce,
>
> On 21/10/2008 15:13, "Bruce Southey" <bsouthey at gmail.com> wrote:
>   
>> Leighton Pritchard wrote:
>>     
>>> I don't see this, I'm afraid.
>>>
>>> Each codon -> one amino acid : one-one mapping
>>> Arg -> set of 6 possible codons : one-many mapping
>>>   
>>>       
>> If you believed this then your answer below is incorrect. The genetic
>> code allow for 1 amino acid to map to a three nucleotides but not any
>> three nor any more or any less than three.
>>     
>
> I'm fine with this bit.  Each such set of three nucleotides is called a
> 'codon'.  Six such codons are able to code for an arginine, as you note:
>
>   
>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG)
>>     
>
> This is a one -> six mapping.  That is, one input (arginine), is capable of
> being back-translated into any of six possible outputs (CGT, CGC, CGA, CGG,
> AGA, or AGG).  
>
> but you contradict this with the comment:
>
>   
>> So to be clear there is a one
>> to one mapping between a codon and amino acid as well amino acid and a
>> codon. Therefore it is impossible for Arg to map to six possible codons
>>     
>
> I think that you're confusing the biological fact (only one codon actually
> encoded this amino acid) with the back-translation problem (in the absence
> of any other information, any one of six codons is equally likely to have
> encoded this amino acid).
>
> ---
>  
>   
>> This is still a one to one mapping between an amino acid and regular
>> expression relationship of the triplet that encodes it.
>>     
>
> Which is not the claim that I was making.  There are any number of ways of
> forcing a one-one mapping of this sort.  You could arguably represent it as
> a one-to-one mapping of 'arginine -> "the backtranslation of arginine"', but
> that would not be informative in reconstructing the actual coding sequence
> (if that was what you wanted - which is the point of the discussion: what is
> the point of a back_translate() method?).  The regular expression mapping is
> not useful for this, either.
>
>   
>> You are not representing the one to six mapping you indicated above as
>> sequence is composed of 300 nucleotides not 1800 as must occur with a
>> one to 6 codon mapping [...]
>>     
>
> I think you've misunderstood what's going on here.
>
> Imagine a reduced system, where there is only one amino acid - let's call it
> A - and there are two possible codons that can produce this amino acid - XXX
> and YYY (thanks, Coldplay).
>
> Now, if we have a 'sequence' of only one amino acid: 'A', that might have
> been encoded by the sequence 'XXX', or the sequence 'YYY'.  The sequence
> that coded for 'A' is one of 'XXX' or 'YYY', and we don't know which; there
> are two possibilities, therefore this is a 1->2 mapping.  2=2**1.  Note that
> the nucleotide sequence is 3*1=3 long.
>
> But if our sequence has two amino acids: 'AA', this could have been the
> result of 'XXXXXX', 'XXXYYY', 'YYYXXX', or 'YYYYYY'.  The coding sequence is
> one of four equally likely possibilities, and this is a 1->4 mapping (one
> sequence, four possible outcomes).  4=2**2, and the nucleotide sequence is
> 3*2 long.
>
> If we build longer sequences, we find that the number of potential outcomes
> is 2**n, where n is the number of 'A's in the input sequence, and the
> mapping is 1->2**n.  The nucleotide sequence is 3*n long.
>
> If we make this more general, where there are m codons for this amino acid,
> the number of potential outcomes is m**n, and the mapping is 1->m**n.  The
> nucleotide sequence is, again, 3*n long.
>
> In my previous example for arginine, m=6, n=100, the mapping is 1->6, and
> the sequence is 300nt long, *not* 1800 nt long.  There are still 6e77 ways
> of encoding a sequence of 100 arginines.  A back_translate() method that
> pretends to find the 'correct' coding sequence in the absence of other
> information, rather than 'a' coding sequence, is not making a plausible
> claim.
>
>   
Thank you for agreeing with me! I am glad that you realized that the 
genetic code prevents a true one to many relationship. In say relational 
databases where you can have one table for the journal issue and one 
table for the papers in it, you can get multiple papers in a single 
issue. Likewise, if we ignore the genetic code, there is one amino acid 
and one or more codons. However, the genetic code means that you only 
can select one of all the codons possible resulting in multiple 
combinations of one to one relationships.
>> It is not my position to argue what a user wants or how stupid I think
>> that the request is. The user would quickly learn.
>>     
>
> While it is entirely possible to implement a function called
> back_translate() that does something a user doesn't want or need, I'm not
> sure that it's the approach that should be taken, here.
>
> It is your position to argue what you want or need out of a back_translate()
> method, and why, so that other people can see your point of view, and maybe
> be swayed by it.  I don't see a use for such a method, even to produce a
> regular expression for searching nucleotide sequences, because TBLASTN is so
> much more efficient.
>   
This very much depends on how you want to use it. TBLASTN is not very 
good for very short sequences and can not handle protein domains/motifs 
such as those in Prosite.

>   
>> Isoleucine and Leucine are the worst case (there are a couple of others
>> that are close) because these have the same mass so you have to search for:
>> (TTA|TTG|CTT|CTC|CTA|CTG|ATT|ATC|ATA)
>>
>> If you are searching say for an RFamide, you know that you need at least
>> RFG, which means you need to do a query using regular expression on the
>> plus strand using:
>> (CGT|CGC|CGA|CGG|AGA|AGG)(TTT|TTC)(GGT|GGC|GGA|GGG)
>>
>> You then try to extend the match to more amino acids until you reach the
>> desired mass (hopefully avoiding any introns) or sufficiently that you
>> can use some other tool to help.
>>     
>
> I think that, in your position, I'd compare timings with a six-frame,
> three-frame or forward translation of (depending on the nature of the
> nucleotide sequence) the nucleotide sequence you're searching against, and
> then use a regular expression or string search with the protein sequence as
> the query.  That's likely to be significantly faster than a regex search
> with that many groups, with the effects more noticeable at larger query
> sequence lengths; particularly so if you cache or save the translated
> sequences for future searches.
>   
Thanks for the comments as I did not think about reusing the translation.
>    
>   
>>>> Another case where it would be useful is that tools like TBLASTN gives
>>>> protein alignments so you must open the DNA sequence and find the DNA
>>>> region based on the protein alignment.
>>>>         
>>> You could use TBLASTN output - which provides start and stop coordinates for
>>> the match on the subject sequence - to extract this directly, without the
>>> need for backtranslation.
>>>       
>
>   
>> Exactly my point, where is the DNA sequence?
>>     
>
> It's in the database against which you queried; TBLASTN queries against
> nucleotide databases.  Wait, that's not quite right - 
No, it is not even correct! :-)

> TBLASTN translates
> nucleotide databases into protein databases and queries against them with
> the protein sequence, partly because of the one-many mapping of
> back-translation.
>   
Not exactly as stop codons are not in protein databases except where 
they code for an amino acid.

> If the database is local, you can use fastacmd (part of BLAST) to dump the
> entire database, to retrieve the single matching sequence from the database,
> or even to extract only the region of the sequence that is the match.  Try
> fastacmd --help at the command-line.
>
> If your database is not local, you can (probably) obtain the sequence by
> querying GenBank with the accession number.  If you can't do that, or ask
> the people who compiled the database you're querying against, or if they
> won't let you have the sequence, then you're stuck with guessing the coding
> sequence.
>
>   
>> Only if you have direct access to the DNA sequence can you get it.
>>     
>
> That's not true; fastacmd can extract FASTA-formatted sequences from any
> (version number compatibilities notwithstanding) correctly-formatted BLAST
> database.
>
>   
Obviously because you still have direct access to the DNA sequence.
>> Furthermore, the DNA sequence
>> must be exactly the same because any change in the coordinates screws it
>> up.
>>     
>
> I don't see how that is a great concern.  The coordinates of the match would
> come from the same database you were searching, so should match.  If your
> database is up-to-date, and you have to go to GenBank, then you should have
> the most recent revision of the sequence in there, anyway.
>
> Even if both of the above options fail, and you can acquire the new sequence
> by some accession identifier, you can build a new local database from that
> sequence alone, and find where the match is.  Or translate and search
> directly in Python.
>   

These were some of the things that one was trying to avoid, especially 
repeating it all over again and hoping like crazy that it is still 
present. (Genome assemblies are not very forgiving.)
> If you truly have no access to the DNA sequence (e.g. if it's proprietary
> information, you can't access the BLAST database, and no-one will send you
> the sequence) then, and only then, are you stuck with guessing the coding
> sequence in *very* large parameter space.
>
> Best,
>
> L.
>
>   
Bruce


From bsouthey at gmail.com  Tue Oct 21 16:36:31 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Tue, 21 Oct 2008 15:36:31 -0500
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <320fb6e00810210907s328c8f23ra007e14f1f32f5c9@mail.gmail.com>
References: <48FDE37B.5040301@gmail.com> <C523B3EF.18072%lpritc@scri.ac.uk>
	<320fb6e00810210907s328c8f23ra007e14f1f32f5c9@mail.gmail.com>
Message-ID: <48FE3D4F.6060005@gmail.com>

Peter wrote:
> Hi everyone,
>
> I think we all agree that if we want a back-translation
> method/function to return a simple string or Seq object (given no
> additional information about the codon use), this cannot fully capture
> all the possible codons.
>   


For completeness as these are not 100% correct,
Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN

Ser is really so bad that one would suggest providing a strong warning 
and just use NTN, NGN, and NNN for Leu, Arg and Ser, respectively.


> If we want to provide a simple string or Seq object, we can either
> pick an arbitrary codon in each case (as in the first attachment on
> Bug 2618), or perhaps represent some of the possible codons using
> ambiguous nucleotides.
>
> e.g.
> back_translate("MR") = "ATGCGT" #arbitrary codon for R unambiguous nucleotides
>
> or,
> back_translate("MR") = "ATGCGN" #arbitrary codon for R using ambiguous
> nucleotides
>
> Note in either example, the following nice property holds:
> translate(back_translate("MR")) == "MR"
>
> Even if improved by typical codon usage figures to give a more
> biologically likely answer, neither of these simple approaches covers
> the full set of six possible codons for Arg in the standard codon
> table.
>
> It was something like this that I envisioned as a candidate for a Seq
> method (based on the behaviour of the existing Bio.Translate
> functionality), but only if such a simple back_translate
> method/function had any real uses.  And thus far, I haven't seen any.
>   
For you perhaps but my reasons are very real to me!
> A back translation method/function which dealt with all the possible
> codon choices would have to use a more advanced representation
> (possibly as Bruce suggested using regular expressions or some sort of
> tree structure - ideally as a sub-class of the Seq object).  There is
> also the option of returning multiple simple strings or Seq objects
> (either as a list or preferable a generator) giving all possible back
> translations, but I don't think this would be useful, except perhaps
> on small examples, due to the potentially vast number of return
> values.
>
> Peter
>
>   
In any situation, we are left with a ambiguous codons, a regular 
expression or some combination of sequence type (e.g., strings or Seq 
objects). None of these options are fully compatible with the Seq 
object.  So I do agree that back-translation can not be part of the Seq 
object. Also I agree that while first two could be return types for a 
Seq object method, the usage is probably too infrequent and too 
specialized for inclusion especially to handle codon usage frequencies.

Bruce


From lpritc at scri.ac.uk  Wed Oct 22 04:31:12 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Wed, 22 Oct 2008 09:31:12 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <48FE3D4F.6060005@gmail.com>
Message-ID: <C524A360.18103%lpritc@scri.ac.uk>

On 21/10/2008 21:36, "Bruce Southey" <bsouthey at gmail.com> wrote:

> For completeness as these are not 100% correct,
> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN

There are some difficulties with this encoding (IUPAC codes are at
http://www.chick.manchester.ac.uk/SiteSeer/IUPAC_codes.html)

YTN -> [CT]T[ACGT] -> {CTA, CTC, CTG, CTT, TTA, TTC, TTG, TTT}, two of which
do not encode leucine.

MGV -> [AC]G[ACG] -> {AGA, AGC, AGG, CGA, CGC, CGG}, of which AGC does not
encode arginine, and the resulting set does not include CGT, which does
encode arginine

WSN -> [AT][CG][ACGT] -> {ACA, ACC, ACG, ACT, AGA, AGC, AGG, AGT, TCA, TCC,
TCG, TCT, TGA, TGC, TGG, TGT}, of which 10 codons do not encode serine.

This would cause problems if we wanted to translate our back-translation
back to the original protein sequence (however we might want to do this).

> Ser is really so bad that one would suggest providing a strong warning
> and just use NTN, NGN, and NNN for Leu, Arg and Ser, respectively.

We could just backtranslate all amino acids to NNN and avoid the problem
entirely ;)

>> If we want to provide a simple string or Seq object, we can either
>> pick an arbitrary codon in each case (as in the first attachment on
>> Bug 2618), or perhaps represent some of the possible codons using
>> ambiguous nucleotides.
>> 
>> e.g.
>> back_translate("MR") = "ATGCGT" #arbitrary codon for R unambiguous
>> nucleotides
>> 
>> or,
>> back_translate("MR") = "ATGCGN" #arbitrary codon for R using ambiguous
>> nucleotides
>> 
>> Note in either example, the following nice property holds:
>> translate(back_translate("MR")) == "MR"

This would be an important consideration for a back_translate() method:
should translate() and back_translate() be inverse functions of each other?

I would say that this is a desirable property, or else a nested
translate(back_translate(translate(...(seq)...))) is likely to end up as a
string or sequence of ambiguity codons, which is not very useful.  If that
can't be done, then the opportunity to do so is probably best avoided...

To ensure that translate() and back_translate() are inverse functions, the
backtranslation of a particular amino acid should either return a single
unambiguous codon, or an ambiguous codon that cannot be translated to an
alternative amino acid (assuming a consistent codon table throughout).  If
we were not to choose arbitrarily an unambiguous codon, or subset of all
possible codons, then a representation of the ambiguity is required that is
not present in the Seq object, yet (e.g. For Ser, Leu or Arg as described
above).  A modification of translate() to spot, and accept such ambiguity
would be necessary.  This looks like harder work than it's worth.

>> It was something like this that I envisioned as a candidate for a Seq
>> method (based on the behaviour of the existing Bio.Translate
>> functionality), but only if such a simple back_translate
>> method/function had any real uses.  And thus far, I haven't seen any.
>>   
> For you perhaps but my reasons are very real to me!

I agree with Peter on this.  I don't see a single compelling use case for
back_translate() in a Seq object.

I can sort of see a potential use where, if you have a protein and want to
design a primer to the coding sequence (which is not known - otherwise there
are better ways to do this), then you might want to generate a sequence of
IUPAC ambiguity codes to guide primer design.  This might involve obtaining
a sequence only of the *certain* bases, e.g. Phe -> TTN; Ser -> NNN; Gly ->
GGN; Asp -> GAN, so that FGD -> TTNNNNGGN, and there are four of nine bases
around which primers might be designed.  However, I'm *really* stretching to
come up with this example.

I've outlined my views on some of the possible ways back_translate() might
work below:


Translate protein to its original coding sequence:
===================================================
Problem: this may be just guesswork in (very) large sequence space
Potential solution: guesswork may be guided by codon usage tables or user
preference for codons, but the biological utility/significance of the
result, which is still guessed at, is highly questionable.
Alternatives: If the originating organism's sequence is known, then TBLASTN
is fast, works well, and avoids the problem.  Alternatively, forward
translation followed by a search for the protein sequence is quicker and
less messy.


Translate protein to a single possible coding sequence (not necessarily
original):
============================================================================
Problem: Same one each time, or choose randomly? What is the point, anyway?
See above for solutions/alternatives


Translate protein to ambiguous representation (inverse translate and/or
return Seq):
============================================================================
Problem: changes required to the way sequences are represented in Seq
objects; this is a significant change at the heart of Biopython with many
inevitable side-effects.  Not clear how this would work, yet.
Potential solution: major coding upheaval and rewriting of Biopython
Alternatives: ignore the requirement that backtranslation is the inverse of
translation; do not return a Seq object, but instead store the
backtranslation as an attribute, or just return a string for the user to do
what they want with


Translate protein to ambiguous representation (not inverse of translate, do
not return Seq):
============================================================================
Problem: what's the point?  agreeing which ambiguous representation to use:
regex, IUPAC, something else; IUPAC ambiguities aren't a convenient
representation for Ser, Leu, Arg;
Potential solution: just use a regex; allow a choice; make an executive
decision; ignore it and hope it goes away


I think that the last behaviour here is the only one that is feasible, but I
still don't see much point in implementing it.  At least turning a protein
sequence into a regex of possible codons would be quick to code...

>> There is
>> also the option of returning multiple simple strings or Seq objects
>> (either as a list or preferable a generator) giving all possible back
>> translations, 

Eek! (for the reasons you mention)

L.


-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________

From biopython at maubp.freeserve.co.uk  Wed Oct 22 05:17:23 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Oct 2008 10:17:23 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
	Seq object?
In-Reply-To: <C524A360.18103%lpritc@scri.ac.uk>
References: <48FE3D4F.6060005@gmail.com> <C524A360.18103%lpritc@scri.ac.uk>
Message-ID: <320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com>

On Wed, Oct 22, 2008 at 9:31 AM, Leighton Pritchard <lpritc at scri.ac.uk> wrote:
> On 21/10/2008 21:36, "Bruce Southey" <bsouthey at gmail.com> wrote:
>
>> For completeness as these are not 100% correct,
>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN

I was going to jump up and down and disagree with you here Bruce, but
Leighton has already made the same point, (CGV | AGR) != MGV etc.

It is true that the ambiguous codon MGV would cover all the possible
Arg codons, but it includes more than that.  While this could be a
useful thing for certain back-translation reasons, it does break the
expectation that translate(back_translate(sequence)) == sequence
[currently the behaviour available in Bio.Translate].

>>> If we want to provide a simple string or Seq object, we can either
>>> pick an arbitrary codon in each case (as in the first attachment on
>>> Bug 2618), or perhaps represent some of the possible codons using
>>> ambiguous nucleotides.
>>> ...
>>> It was something like this that I envisioned as a candidate for a Seq
>>> method (based on the behaviour of the existing Bio.Translate
>>> functionality), but only if such a simple back_translate
>>> method/function had any real uses.  And thus far, I haven't seen any.
>>>
>> For you perhaps but my reasons are very real to me!

I was saying I don't see the need for a *simple* back_translate
function (giving a Seq object or a string), and that such a simple
function didn't seem to help with your examples.

I'm not denying that a complex back translation operation has real
utility (although I suspect there are multiple different solutions
which won't suit every problem - and makes justifying adding this to
the core Seq object hard to justify).  Perhaps a function in
Bio.SeqUtils to create a nucleotide regex describing possible back
translations from a protein sequence would suffice?

If one of your real-world examples can be solved with a back_translate
which returns a simple string or Seq object, could you clarify this.

Peter

From lpritc at scri.ac.uk  Wed Oct 22 06:03:32 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Wed, 22 Oct 2008 11:03:32 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <48FE31B2.8030509@gmail.com>
Message-ID: <C524B904.18127%lpritc@scri.ac.uk>

On 21/10/2008 20:46, "Bruce Southey" <bsouthey at gmail.com> wrote:
> Thank you for agreeing with me! I am glad that you realized that the
> genetic code prevents a true one to many relationship.

Bruce, I am not agreeing with you.  I'll try to clarify it another way:

More than one codon can encode the amino acid arginine (this is a many-one
relationship). The amino acid arginine can be 'decoded' to more than one
codon (this is a one-many relationship).

Imagine a function that accepts an amino acid as input and returns a valid
codon that could encode for the input amino acid.  This is 'decoding' as
described above, and is the process of back-translation for a single amino
acid.

For a single (i.e. 'one') amino acid, arginine, as input, the function might
correctly provide up to six (i.e. 'many') different valid answers.  This
makes it a one-many problem.  Further external constraints (e.g. Codon
tables) may be applied to restrict the number or likelihood of each codon
being correct in specific cases, but the fundamental problem is one-many.

Providing arginine as input to a particular coded version of this function
might in all cases only return a single codon as output (one-one), but the
problem itself is still one-many.

Furthermore, even though only one codon was responsible -
biologically-speaking - for encoding the arginine you're submitting to the
function (one-one), your question is the inverse: effectively 'what codon
encoded this arginine?'.  But (and it's a big but), if you don't know
beforehand what that codon is (and why else would you bother using the
function?), the problem is one-many, as any of the six solutions might be
correct.

Analogously, there are two possible values for the square root of a positive
real number, such as 4.  It is inherently a one-many problem.  For 4, the
return value could, correctly, be +2 or -2.  Now, the math.sqrt() function
in Python follows mathematical convention for the radical, and only returns
the positive value, but that does not make the relationship between the
value and its square root one-one, it only makes that implementation of the
function one-one, even though the answer could be, correctly, either
positive or negative.

Now, if your problem is: what is the length of side of a farmer's square
field with area four square miles (big field!), only one of these answers
makes sense (one-one), as the field is constrained by our reality and cannot
have negative length (this is effectively equivalent to saying that the
organism doesn't use five of the six possible codons for arginine, so only
one answer is possible).  However, the general problem of finding a square
root is still one-many, as you can see if you rephrase the problem as 'the
vector (a 0) has length 4; what is the value of a?'.  This is directly
analogous to the problem 'the amino acid arginine was encoded by a codon;
what codon was it?'.

> This very much depends on how you want to use it. TBLASTN is not very
> good for very short sequences and can not handle protein domains/motifs
> such as those in Prosite.

That's a fair point, and I wouldn't (and didn't ;) ) recommend TBLASTN as a
solution to all such problems.  I get acceptable results for exact matches
down to about 7aa on default settings, though.  Short query sequences can be
a problem whatever method you use, though.

>> TBLASTN queries against
>> nucleotide databases.  Wait, that's not quite right -
> No, it is not even correct! :-)

Yes, it is correct.  From:
http://www.ncbi.nlm.nih.gov/blast/blast_program.shtml (and other
references...)

"""
tblastn
compares a protein query sequence against a nucleotide sequence database
dynamically translated in all reading frames
""" 

They wrote it, so they should know.  Not that I've checked the code ;)

>> TBLASTN translates
>> nucleotide databases into protein databases and queries against them with
>> the protein sequence, partly because of the one-many mapping of
>> back-translation.
> Not exactly as stop codons are not in protein databases except where
> they code for an amino acid.

Stop codons are not (usually) in protein databases, that's true.  But they
*are* in nucleotide databases, which is what TBLASTN queries.  For example,
these are TBLASTN search results, in opposite directions on the same
nucleotide sequence, that span stop codons in the subject sequence,
indicated by '*' in the BLAST output (even though there are different stop
codons; Artemis handles this more elegantly):

>ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete
genome
          Length = 5064019

 Score = 79.0 bits (193), Expect = 8e-17
 Identities = 38/40 (95%), Positives = 38/40 (95%), Gaps = 2/40 (5%)
 Frame = +2

Query: 1  YPHSTAEYLILFE-INPRS-PFFCWIFWNLMLRDVDLENF 38
          YPHSTAEYLILFE INPRS PFFCWIFWNLMLRDVDLENF
Sbjct: 2  YPHSTAEYLILFE*INPRS*PFFCWIFWNLMLRDVDLENF 121

>ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete
genome
          Length = 5064019

 Score = 56.6 bits (135), Expect = 4e-10
 Identities = 29/32 (90%), Positives = 29/32 (90%), Gaps = 3/32 (9%)
 Frame = -3

Query: 1       CNGRWRC-SPL-CYISPRISCRSW-LKPSAIV 29
               CNGRWRC SPL CYISPRISCRSW LKPSAIV
Sbjct: 2851610 CNGRWRC*SPL*CYISPRISCRSW*LKPSAIV 2851515

>> That's not true; fastacmd can extract FASTA-formatted sequences from any
>> (version number compatibilities notwithstanding) correctly-formatted BLAST
>> database.
>>   
> Obviously because you still have direct access to the DNA sequence.

I'd call it indirect access if you've, say, downloaded a precompiled nt
database from NCBI and then have to extract the FASTA sequence from that
compiled database.  Either way, if you're querying a nucleotide database,
you've got to have a representation of the nucleotide sequence *somewhere*.

>> Even if both of the above options fail, and you can acquire the new sequence
>> by some accession identifier, you can build a new local database from that
>> sequence alone, and find where the match is.  Or translate and search
>> directly in Python.
>>   
> These were some of the things that one was trying to avoid, especially
> repeating it all over again and hoping like crazy that it is still
> present. 

Some things are just harder work than others ;)

> (Genome assemblies are not very forgiving.)

The genomes I've worked on have had stable sequences at revision points for
both assembly and annotation (though the old revision points have not been
kept publicly in all cases, which can be awkward).  All should, IMO.  But
that's a different thread on a different mailing list...

Best,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________

From dalloliogm at gmail.com  Wed Oct 22 06:25:57 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 22 Oct 2008 12:25:57 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
Message-ID: <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>

On Mon, Oct 20, 2008 at 3:57 PM, Giovanni Marco Dall'Olio <
dalloliogm at gmail.com> wrote:

>
>
> On Mon, Oct 20, 2008 at 7:41 AM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
>
>> Hi,
>>
>> On Sun, Oct 19, 2008 at 3:50 PM, Giovanni Marco Dall'Olio
>> <dalloliogm at gmail.com> wrote:
>> > ok, thank you very much!!
>> > I would like to use git to keep track of the changes I will make to the
>> > code.
>> > What do you think if I'll upload it to http://github.com and then
>> upload it
>> > back on biopython when it is finished?
>> > I am not sure, but I think it would be possible to convert the logs back
>> to
>> > cvs to reintegrate the changes in biopython.
>>
>> I think it is a good idea. When we reintegrate back I think there will
>> be no need to backport the commit logs anyway.
>
>
> Ok, I have uploaded the code to:
> - http://github.com/dalloliogm/biopython---popgen
>

I wrote a prototype for a PED file parser which uses your PopGen.Record
object to store data.
It's available on github: I have still to finish the consumer object and to
test it, but I think I will be able to finish it for today.

I left you a few comments on the github wiki:
- http://github.com/dalloliogm/biopython---popgen/wikis/home

Maybe the biggest issue is that I will have to use this library to parse
very big files, so there are a few things we could change in the
implementation of the parser.
Is there any way in python to force the interpreter to store variables in
temporary files instead of RAM memory?
I was thinking about modules like shelve, cPickle, but I am not sure they
work in this way.
We could also modify the parser in a way that it can accept a list of
populations as argument, and create a populations list with only those
populations from the file.

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Wed Oct 22 06:34:21 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Oct 2008 11:34:21 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
Message-ID: <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>

On Wed, Oct 22, 2008 at 11:25 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Maybe the biggest issue is that I will have to use this library to parse
> very big files, so there are a few things we could change in the
> implementation of the parser.
> Is there any way in python to force the interpreter to store variables in
> temporary files instead of RAM memory?
> I was thinking about modules like shelve, cPickle, but I am not sure they
> work in this way.

I have not looked at the specifics here, but adopting an iterator
approach might make sense - returning the entries one by one as parsed
from the file.  This is the idea for the Bio.SeqIO and Bio.AlignIO
parsers.  The user can then turn the entries into a list (if they have
enough memory), filter them as the arrive, etc.  For example, you
could compile a list of only those desired population entries,
discarding the others on the fly.

Peter

From bsouthey at gmail.com  Wed Oct 22 11:04:29 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Wed, 22 Oct 2008 10:04:29 -0500
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com>
References: <48FE3D4F.6060005@gmail.com> <C524A360.18103%lpritc@scri.ac.uk>
	<320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com>
Message-ID: <48FF40FD.5020604@gmail.com>

Peter wrote:
> On Wed, Oct 22, 2008 at 9:31 AM, Leighton Pritchard <lpritc at scri.ac.uk> wrote:
>   
>> On 21/10/2008 21:36, "Bruce Southey" <bsouthey at gmail.com> wrote:
>>
>>     
>>> For completeness as these are not 100% correct,
>>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
>>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
>>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN
>>>       
>
> I was going to jump up and down and disagree with you here Bruce, but
> Leighton has already made the same point, (CGV | AGR) != MGV etc.
> It is true that the ambiguous codon MGV would cover all the possible
> Arg codons, but it includes more than that.  While this could be a
> useful thing for certain back-translation reasons, it does break the
> expectation that translate(back_translate(sequence)) == sequence
> [currently the behaviour available in Bio.Translate].
>   
Leighton does show these are correct:
(CGV | AGR) == MGV
and MGV ==(CGV | AGR)

BUT I fully agree that MGV does stand for other other codons that are do 
not translate for Arg as Leighton pointed out. This was why I prefixed 
this by stating "these are not 100% correct" so I am sorry that I was 
not clear enough.  Yes, I am also very aware that this creates a problem 
for doing a translate(back_translate(sequence)) without using a special 
translation table (yet another reason for not including it in Seq object 
or just return an exception).

As I pointed in your other thread that I do not believe that a 
back-translation should be part of the Seq object. If for no other 
reason than back-translation just creates too many ambiguous nucleotides 
in one DNA sequence. This will cause some of the algorithms to determine 
protein or DNA sequences to fail (back_translate('AFLFQPQRFGR') gives 
'GCNTTYYTNTTYCARCCNCARMGVTTYGGNMGV', which causes NCBI's online BLASTN 
to say it is protein). In anycase, BLAST and such are not very good at 
handling multiple ambiguous nucleotides in a sequence when probably 
one-third to one-half of the sequence would be ambiguous nucleotides.

Bruce

From biopython at maubp.freeserve.co.uk  Wed Oct 22 11:33:00 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Oct 2008 16:33:00 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
	Seq object?
In-Reply-To: <48FF40FD.5020604@gmail.com>
References: <48FE3D4F.6060005@gmail.com> <C524A360.18103%lpritc@scri.ac.uk>
	<320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com>
	<48FF40FD.5020604@gmail.com>
Message-ID: <320fb6e00810220833j4089cec1i6fb9cee563b562d3@mail.gmail.com>

Bruce wrote:
>>>> For completeness as these are not 100% correct,
>>>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
>>>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
>>>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN

Just for the record, in addition to the debate about the final equal
signs above, there is at least one error in the above - for the
leucine codons, (TTN|CTR) should read (TTR|CTN), but this doesn't
matter for the discussion in hand.

Bruce wrote:
> Leighton does show these are correct:
> (CGV | AGR) == MGV
> and MGV ==(CGV | AGR)

I don't think Leighton did mean to say that. A set of 6 codons is NOT
equal to a set of 8 codons.  However, if we say "sub set" or "super
set" here things are probably fine (I haven't double checked the
correct ambiguity codes are used here).

Similarly, Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTR|CTN) covers 6
unambiguous codons.
This is a subset of YTN = (TTC|TTA|TTG|TTT|CTC|CTA|CTG|CTT) which
covers 8 unambiguous codons.

Having back_translate("L") == "YTN" means
translate(back_translate("L")) == "X", which would surprise many.
Using "YTN" covers all the codons plus some extra ones.  This might be
useful for searching purposes, but otherwise its very misleading.

Having back_translate("L") == "CTN" means
translate(back_translate("L")) == "L", but doesn't cover the two
codons TTR (i.e. TTA or TTG).  At least this is better than
back_translate("L") == "TTR" which still has
translate(back_translate("L")) == "L", but doesn't cover the four
codons CTN.  Picking any one of the six codons also ensures
translate(back_translate("L")) == "L" but of course doesn't cover the
other five codons.  In all three cases, the utility of the back
translation is limited.

> Yes, I am also very aware that this creates a problem for doing a
> translate(back_translate(sequence)) without using a special translation
> table (yet another reason for not including it in Seq object or just return
> an exception).

Yes.

> As I pointed in your other thread that I do not believe that a
> back-translation should be part of the Seq object.

In the absence of a compelling use case, I agree.

> If for no other reason
> than back-translation just creates too many ambiguous nucleotides in one DNA
> sequence. This will cause some of the algorithms to determine protein or DNA
> sequences to fail (back_translate('AFLFQPQRFGR') gives
> 'GCNTTYYTNTTYCARCCNCARMGVTTYGGNMGV', which causes
> NCBI's online BLASTN to say it is protein).

In such cases, you can explicitly tell BLAST (or other tools) if they
are using nucleotides or proteins.  However this is a valid concern
for working with ambiguous nucleotides.

As an aside, zen of python "In the face of ambiguity, refuse the
temptation to guess." (here nucleotide versus protein)

> In anycase, BLAST and such are not very good at handling
> multiple ambiguous nucleotides in a sequence when probably
> one-third to one-half of the sequence would be ambiguous
> nucleotides.

Ambiguous searches are bound to be tricky.

Peter

From lpritc at scri.ac.uk  Wed Oct 22 11:34:47 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Wed, 22 Oct 2008 16:34:47 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <48FF40FD.5020604@gmail.com>
Message-ID: <C52506A7.18168%lpritc@scri.ac.uk>


On 22/10/2008 16:04, "Bruce Southey" <bsouthey at gmail.com> wrote:

> Peter wrote:
>> On Wed, Oct 22, 2008 at 9:31 AM, Leighton Pritchard <lpritc at scri.ac.uk>
>> wrote:
>>   
>>> On 21/10/2008 21:36, "Bruce Southey" <bsouthey at gmail.com> wrote:
   
>>>> For completeness as these are not 100% correct,
>>>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
>>>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
>>>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN
>>>>       
>> 
>> I was going to jump up and down and disagree with you here Bruce, but
>> Leighton has already made the same point, (CGV | AGR) != MGV etc.
>> It is true that the ambiguous codon MGV would cover all the possible
>> Arg codons, but it includes more than that.
>>   
> Leighton does show these are correct:
> (CGV | AGR) == MGV
> and MGV ==(CGV | AGR)

I showed (and Peter also points out) that (TTN|CTR) is a subset of YTN, and
that (TCN|AGY) is a subset of WSN, and not that they are equivalent, which
is what you have written above.  For that equivalence, we would also require
that MGV is a subset of (CGV|AGR), which is not true.

Likewise I also showed that, although (CGV|AGR) is a subset of MGV, neither
CGV nor MGV include CGT, which is a valid codon for arginine.  Whether or
not this error is corrected to CGN/MGN, the regular expression is still only
a subset of those codons implied by the IUPAC ambiguity symbols.

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________

From bsouthey at gmail.com  Wed Oct 22 11:50:19 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Wed, 22 Oct 2008 10:50:19 -0500
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <320fb6e00810220833j4089cec1i6fb9cee563b562d3@mail.gmail.com>
References: <48FE3D4F.6060005@gmail.com> <C524A360.18103%lpritc@scri.ac.uk>	
	<320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com>	
	<48FF40FD.5020604@gmail.com>
	<320fb6e00810220833j4089cec1i6fb9cee563b562d3@mail.gmail.com>
Message-ID: <48FF4BBB.8020007@gmail.com>

Peter wrote:
> Bruce wrote:
>   
>>>>> For completeness as these are not 100% correct,
>>>>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
>>>>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
>>>>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN
>>>>>           
>
> Just for the record, in addition to the debate about the final equal
> signs above, there is at least one error in the above - for the
> leucine codons, (TTN|CTR) should read (TTR|CTN), but this doesn't
> matter for the discussion in hand.
>
>   
Thanks for correctly that one.

Bruce

From tiagoantao at gmail.com  Wed Oct 22 11:52:19 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 22 Oct 2008 16:52:19 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
Message-ID: <6d941f120810220852m606b8825tafaaf871f22df48e@mail.gmail.com>

Hi,

[Back in office now]

> Ok, I have uploaded the code to:
> - http://github.com/dalloliogm/biopython---popgen
>
> I put the code I wrote before writing in this mailing list in the folder
> PopGen/Gio

Thanks I will have a look and get acquainted with GIT.


>> I am afraid that this is not enough. Even for Fst. I suppose you are
>> acquainted with a formula with just heterozigosities.
>
> Yes, I was trying to implement a very basic formula at first.

For publication and data analysis the standard is Cockerham and Wier's
theta. The Standard Ht/(Hs-Ht) (or a variation of this) might be
misleading in regards to the amount of information that is needed.


> Yes, I agree. It was just a first try. We should collect some good
> use-cases.


In my head I divide statistics in the following dimensions:
1. genetic versus genomic (e.g. Fst is single locus, LD can be seen as
requiring more than 1 locus, therefore is "genomic")
2. frequency based versus marker based (some statistics require
frequencies only - ie, you can calculate them irrespective of the type
of marker - This is the case of Fst. Others are marker dependent, say
Tajima D requires sequences and can only be used with sequences)
3. population structure versus no pop structure. Some stats require
population structure (again, Fst), others don't (e.g., allelic
richness)

>From my point of view, a long-term solution needs to take into account
these dimensions (and others that I might be forgetting).

One can think in a solution based on Populations and Individuals as
fundamental objects (as opposed to statistics), but, from my
experience it is very difficult to define what is an "individual"
(i.e., what kind of information you need to store - I can expand on
this). It is easier to think in terms of statistics.

One fundamental point is that we don't have many opportunities to make
it right: if we define an architecture which proves in the future to
be not sufficient, then  we will have to both maintain the old legacy
(because there will be users around whose code cannot be constantly
broken when a new version is made available) while hack the new
features in.

From tiagoantao at gmail.com  Wed Oct 22 12:00:39 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 22 Oct 2008 17:00:39 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
Message-ID: <6d941f120810220900o620aa7x171fcd181f5248c@mail.gmail.com>

Hi,

On Wed, Oct 22, 2008 at 11:25 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> I wrote a prototype for a PED file parser which uses your PopGen.Record
> object to store data.

Don't feel obliged to use GenePop.Record. You can (maybe you should)
use one that is better for your PED record. The point is: your PED
files might have extra (or less) information than genepop files. For
instance, they might have population names. They might store the SNP
(A, C, T, G). With genepop you would have to convert (and thus loose)
the extra info.

> Maybe the biggest issue is that I will have to use this library to parse
> very big files, so there are a few things we could change in the
> implementation of the parser.

Yet another reason to develop your own record. I would not mind
helping you with that.

> We could also modify the parser in a way that it can accept a list of
> populations as argument, and create a populations list with only those
> populations from the file.

We have to be careful in modifying existing code. We can add new
functionality, add new interfaces. But changing existing interfaces or
removing them has to be dealt with exceptional care, because that will
break (existing) code done by users.

Tiago

From tiagoantao at gmail.com  Wed Oct 22 12:03:59 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 22 Oct 2008 17:03:59 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
Message-ID: <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>

On Wed, Oct 22, 2008 at 11:34 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> I have not looked at the specifics here, but adopting an iterator
> approach might make sense - returning the entries one by one as parsed
> from the file.  This is the idea for the Bio.SeqIO and Bio.AlignIO
> parsers.  The user can then turn the entries into a list (if they have
> enough memory), filter them as the arrive, etc.  For example, you
> could compile a list of only those desired population entries,
> discarding the others on the fly.

I will have look at iterators in Python. This idea from Giovannni is
actually floating around with current users for GenePop data which
have exactly the same problem (loooong records).

From dalloliogm at gmail.com  Wed Oct 22 13:10:45 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 22 Oct 2008 19:10:45 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
Message-ID: <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>

On Wed, Oct 22, 2008 at 6:03 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:

> On Wed, Oct 22, 2008 at 11:34 AM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
> > I have not looked at the specifics here, but adopting an iterator
> > approach might make sense - returning the entries one by one as parsed
> > from the file.  This is the idea for the Bio.SeqIO and Bio.AlignIO
> > parsers.  The user can then turn the entries into a list (if they have
> > enough memory), filter them as the arrive, etc.  For example, you
> > could compile a list of only those desired population entries,
> > discarding the others on the fly.
>
> I will have look at iterators in Python. This idea from Giovannni is
> actually floating around with current users for GenePop data which
> have exactly the same problem (loooong records).
>


Iterators are more difficult to implement in Ped files, because in this
format every line of the file is an individual, so to write an iterator
which iterates by population we will need to read at list the first row of
every line of all the file.
I was also thinking of starting using a database to store data, instead of
files. This would probably solve the problem of out of memory when parsing
those long files.
I would probably use sqlalchemy to interface with this database: this is why
I would like to implement a Population and Individual objects, it will fit
better with relational mapping.

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From dalloliogm at gmail.com  Wed Oct 22 13:12:24 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 22 Oct 2008 19:12:24 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810220852m606b8825tafaaf871f22df48e@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<6d941f120810220852m606b8825tafaaf871f22df48e@mail.gmail.com>
Message-ID: <5aa3b3570810221012v3543a533u15f81196752cd52@mail.gmail.com>

On Wed, Oct 22, 2008 at 5:52 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:

> Hi,
>
> [Back in office now]
>
> > Ok, I have uploaded the code to:
> > - http://github.com/dalloliogm/biopython---popgen
> >
> > I put the code I wrote before writing in this mailing list in the folder
> > PopGen/Gio
>
> Thanks I will have a look and get acquainted with GIT.
>

It' s the first time I am using github for something serious, too.
Please tell me if you need me to add you as a 'collaborator' in the project
or something like this.
I am using eclipse with a plugin for git (http://www.jgit.org/update-site)
and it works very well.
I think there is a plugin for vim, too.
Sorry, today I couldn't do too much - I spent most of the day in seminars
and meetings :(.


>
>
> > Yes, I agree. It was just a first try. We should collect some good
> > use-cases.
>
>
> In my head I divide statistics in the following dimensions:
> 1. genetic versus genomic (e.g. Fst is single locus, LD can be seen as
> requiring more than 1 locus, therefore is "genomic")
> 2. frequency based versus marker based (some statistics require
> frequencies only - ie, you can calculate them irrespective of the type
> of marker - This is the case of Fst. Others are marker dependent, say
> Tajima D requires sequences and can only be used with sequences)
> 3. population structure versus no pop structure. Some stats require
> population structure (again, Fst), others don't (e.g., allelic
> richness)
>
> From my point of view, a long-term solution needs to take into account
> these dimensions (and others that I might be forgetting).
>
> One can think in a solution based on Populations and Individuals as
> fundamental objects (as opposed to statistics), but, from my
> experience it is very difficult to define what is an "individual"
> (i.e., what kind of information you need to store - I can expand on
> this). It is easier to think in terms of statistics.
>
> One fundamental point is that we don't have many opportunities to make
> it right: if we define an architecture which proves in the future to
> be not sufficient, then  we will have to both maintain the old legacy
> (because there will be users around whose code cannot be constantly
> broken when a new version is made available) while hack the new
> features in.
>

ok... but we can try :).
We could use the github's wiki to better organize these ideas.
I will answer to you better tomorrow (or tonight).
Now, I need a bit of fresh air! :)

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From dalloliogm at gmail.com  Wed Oct 22 13:12:41 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 22 Oct 2008 19:12:41 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810220900o620aa7x171fcd181f5248c@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<6d941f120810220900o620aa7x171fcd181f5248c@mail.gmail.com>
Message-ID: <5aa3b3570810221012w2c894977sd0f86297f42f9394@mail.gmail.com>

On Wed, Oct 22, 2008 at 6:00 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:

> Hi,
>
> On Wed, Oct 22, 2008 at 11:25 AM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
> > I wrote a prototype for a PED file parser which uses your PopGen.Record
> > object to store data.
>
> Don't feel obliged to use GenePop.Record. You can (maybe you should)
> use one that is better for your PED record. The point is: your PED
> files might have extra (or less) information than genepop files. For
> instance, they might have population names. They might store the SNP
> (A, C, T, G). With genepop you would have to convert (and thus loose)
> the extra info.


I first tried to write an AbstractPopRecord class from which to derive both
Ped.Record and your GenePop.Record classes.
Then, I realized that I wanted to use all of your methods and decided to
import your GenePop.Record instead of writing a new one.
Moreover, there are some methods (like GenePop.Record.split_in_pops) that
create Record objects, and I thought it would have been easier to always
refer to the same one.
Maybe we should write a generic PopGenRecord in which to store all general
informations about population genetics data.


>
>
> > Maybe the biggest issue is that I will have to use this library to parse
> > very big files, so there are a few things we could change in the
> > implementation of the parser.
>
> Yet another reason to develop your own record. I would not mind
> helping you with that.
>

>
> > We could also modify the parser in a way that it can accept a list of
> > populations as argument, and create a populations list with only those
> > populations from the file.
>
> We have to be careful in modifying existing code. We can add new
> functionality, add new interfaces. But changing existing interfaces or
> removing them has to be dealt with exceptional care, because that will
> break (existing) code done by users.
>
> Tiago
>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Wed Oct 22 13:26:07 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Oct 2008 18:26:07 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
Message-ID: <320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com>

On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
>
> Iterators are more difficult to implement in Ped files, because in this
> format every line of the file is an individual, so to write an iterator
> which iterates by population we will need to read at list the first row of
> every line of all the file.

It sounds like for Ped files it would make more sense to iterate over
the individuals.  The mental picture I have in mind is a big
spreadsheet, individuals as rows (lines), populations (and other
information) as columns.  By having the parser iterate over the
individuals one by one, the user could then "simplify" each individual
as they are read in, recording in memory just the interesting data.
This way the whole dataset need not be kept in memory.

> I was also thinking of starting using a database to store data, instead of
> files. This would probably solve the problem of out of memory when parsing
> those long files.
> I would probably use sqlalchemy to interface with this database: this is why
> I would like to implement a Population and Individual objects, it will fit
> better with relational mapping.

That would mean adding sqlalchemy as another (optional) dependency for
Biopython.  If you could use MySQLdb instead that would be better as
several existing modules use this.  However, I would encourage you to
avoid any database if possible because this makes the installation
much more complicated for the end user, and imposes your own arbitrary
schema as well.  It also means setting up suitable unit tests is also
a pain.

Peter

From rsclary at uncc.edu  Wed Oct 22 15:49:33 2008
From: rsclary at uncc.edu (Clary, Richard)
Date: Wed, 22 Oct 2008 15:49:33 -0400
Subject: [BioPython] Retrieving nucleotide sequence for given accession
	Entrez ID
Message-ID: <61B0EE7C247C1349881F63414448FC1F078874BC@EXEVS06.its.uncc.edu>


Can anyone provide succinct Python function to retrieve the nucleotide sequence (as a string) for a given nucleotide accession ID?  Attempting to do this through E-Utils but having a difficult time figuring out the best way to do this without having to download a FASTA file...

Thanks in advance,
R


From biopython at maubp.freeserve.co.uk  Wed Oct 22 16:15:37 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Oct 2008 21:15:37 +0100
Subject: [BioPython] Retrieving nucleotide sequence for given accession
	Entrez ID
In-Reply-To: <61B0EE7C247C1349881F63414448FC1F078874BC@EXEVS06.its.uncc.edu>
References: <61B0EE7C247C1349881F63414448FC1F078874BC@EXEVS06.its.uncc.edu>
Message-ID: <320fb6e00810221315i31358bc2n2e5c9be405a77e42@mail.gmail.com>

On Wed, Oct 22, 2008 at 8:49 PM, Clary, Richard <rsclary at uncc.edu> wrote:
>
> Can anyone provide succinct Python function to retrieve the
> nucleotide sequence (as a string) for a given nucleotide
> accession ID?  Attempting to do this through E-Utils but
> having a difficult time figuring out the best way to do this
> without having to download a FASTA file...

Hi Richard,

Are you trying this using Bipython's Bio.Entrez, or accessing E-Utils directly?

Anyway, you'll want to use efetch (e.g. via the Bio.Entrez.efetch
function in Biopython)
http://www.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html

This documentation covers the possible return formats,
http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html

I think FASTA would be simplest (I don't see a plain or raw text
option), and has only a tiny overhead in the download size over the
raw sequence.  Getting the sequence out of a FASTA file as a string is
trivial - for example, using Biopython:

from Bio import Entrez, SeqIO
Entrez.email = "Richard at example.com" #Tell the NCBI who you are
handle = Entrez.efetch(db="nucleotide", id="186972394",rettype="fasta")
seq_str = str(SeqIO.read(handle, "fasta").seq)

Peter

From dalloliogm at gmail.com  Thu Oct 23 05:41:04 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 23 Oct 2008 11:41:04 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
	<320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com>
Message-ID: <5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com>

On Wed, Oct 22, 2008 at 7:26 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
> >
> > Iterators are more difficult to implement in Ped files, because in this
> > format every line of the file is an individual, so to write an iterator
> > which iterates by population we will need to read at list the first row
> of
> > every line of all the file.
>
> It sounds like for Ped files it would make more sense to iterate over
> the individuals.  The mental picture I have in mind is a big
> spreadsheet, individuals as rows (lines), populations (and other
> information) as columns.  By having the parser iterate over the
> individuals one by one, the user could then "simplify" each individual
> as they are read in, recording in memory just the interesting data.
> This way the whole dataset need not be kept in memory.


This makes sense.
Basically, we should write a (Ped/GenePop)Iterator function, which should
read the file one line at a time, check if it a has correct syntax and is
not a comment, and then use 'yield' to create a Record object. Am I right?


>
> > I was also thinking of starting using a database to store data, instead
> of
> > files. This would probably solve the problem of out of memory when
> parsing
> > those long files.
> > I would probably use sqlalchemy to interface with this database: this is
> why
> > I would like to implement a Population and Individual objects, it will
> fit
> > better with relational mapping.
>
> That would mean adding sqlalchemy as another (optional) dependency for
> Biopython.  If you could use MySQLdb instead that would be better as
> several existing modules use this.  However, I would encourage you to
> avoid any database if possible because this makes the installation
> much more complicated for the end user, and imposes your own arbitrary
> schema as well.  It also means setting up suitable unit tests is also
> a pain.
>

Don't worry, I am not going to do that.
I will probably use sqlalchemy only in my scripts; I will use it to retrieve
data from the database, and then create Population/Marker/Individual objects
using the code I am writing now, or a adapt the objects created  by
sqlalchemy to be compatible with the functions I will have to use.


>
> Peter
>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it

From biopython at maubp.freeserve.co.uk  Thu Oct 23 05:57:38 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 23 Oct 2008 10:57:38 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
	<320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com>
	<5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com>
Message-ID: <320fb6e00810230257y4a21952o95acff34a5367cb5@mail.gmail.com>

On Thu, Oct 23, 2008, Giovanni Marco Dall'Olio wrote:
> On Wed, Oct 22, Peter wrote:
>> On Wed, Oct 22, Giovanni Marco Dall'Olio wrote:
>> >
>> > Iterators are more difficult to implement in Ped files, because in this
>> > format every line of the file is an individual, so to write an iterator
>> > which iterates by population we will need to read at list the first row
>> > of every line of all the file.
>>
>> It sounds like for Ped files it would make more sense to iterate over
>> the individuals.  The mental picture I have in mind is a big
>> spreadsheet, individuals as rows (lines), populations (and other
>> information) as columns.  By having the parser iterate over the
>> individuals one by one, the user could then "simplify" each individual
>> as they are read in, recording in memory just the interesting data.
>> This way the whole dataset need not be kept in memory.
>
> This makes sense.
> Basically, we should write a (Ped/GenePop)Iterator function, which should
> read the file one line at a time, check if it a has correct syntax and is
> not a comment, and then use 'yield' to create a Record object. Am I right?

Yes :)

Python functions written with "yield" are called  "generator functions", see:
http://www.python.org/dev/peps/pep-0255/

Peter

From m at pavis.biodec.com  Thu Oct 23 06:25:45 2008
From: m at pavis.biodec.com (m at pavis.biodec.com)
Date: Thu, 23 Oct 2008 12:25:45 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
References: <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
Message-ID: <20081023102545.GE3694@pavis.biodec.com>

* Giovanni Marco Dall'Olio (dalloliogm at gmail.com) [081022 19:12]:
> 
> I was also thinking of starting using a database to store data, instead of
> files. This would probably solve the problem of out of memory when parsing
> those long files.

If you just need to store data, i.e. you just need a thin layer above
file storage, I'd suggest evaluating ZODB

It's very simple, somehow pythonic, and you don't need to learn SQL to
manage the data (of course, SQL is just fine, and from a real DB you get
much more than just data storage, but since you are just writing about
alternatives to file storage, I assume that SQL would not be a plus)

HTH

--
 .*.                            finelli
 /V\
(/ \) --------------------------------------------------------------
(   )       Linux: Friends dont let friends use Piccolosoffice
^^-^^ --------------------------------------------------------------

It is easier to make a saint out of a libertine than out of a prig.
		-- George Santayana

From dalloliogm at gmail.com  Thu Oct 23 07:30:06 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 23 Oct 2008 13:30:06 +0200
Subject: [BioPython] [OT] Revision control and databases
Message-ID: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com>

Hi,
I have a question (well, it's not directly related to biopython or pygr, but
to scientific computing).

I always used flat files to store results and data for my bioinformatics
analys, but not (as I was saying in another thread) I would like to start
using a database to do that.

The problem is I don't know if databases do Revision Control.
When I used flat files, I was used to save all the results in a git
repository, and, everytime something was changed or calculated again, I did
commit it.
Do you know how to do this with databases? Does MySQL provide support for
revision control?
Thanks :)

(sorry for cross-posting :( )

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it

From sdavis2 at mail.nih.gov  Thu Oct 23 08:10:16 2008
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 23 Oct 2008 08:10:16 -0400
Subject: [BioPython] [OT] Revision control and databases
In-Reply-To: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com>
References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com>
Message-ID: <264855a00810230510n37d05cb1gd7b88a63988d7191@mail.gmail.com>

On Thu, Oct 23, 2008 at 7:30 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Hi,
> I have a question (well, it's not directly related to biopython or pygr, but
> to scientific computing).
>
> I always used flat files to store results and data for my bioinformatics
> analys, but not (as I was saying in another thread) I would like to start
> using a database to do that.
>
> The problem is I don't know if databases do Revision Control.
> When I used flat files, I was used to save all the results in a git
> repository, and, everytime something was changed or calculated again, I did
> commit it.
> Do you know how to do this with databases? Does MySQL provide support for
> revision control?
> Thanks :)

No.  Relational databases just store data.  You could build such a
system, but that would require a fair amount of work.  I would suggest
storing metadata about your analyses in the database and storing the
actual results on the file system.

Sean

From lpritc at scri.ac.uk  Thu Oct 23 08:44:45 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Thu, 23 Oct 2008 13:44:45 +0100
Subject: [BioPython] [bip] [OT] Revision control and databases
In-Reply-To: <C5262F1C.18234%lpritc@scri.ac.uk>
Message-ID: <C526304D.18238%lpritc@scri.ac.uk>

Hi Giovanni (and others)

Ah, reading again I see you're already using git... Without knowing exactly
what you're doing, I assume that CVS and SVN would be no improvement
<cough>, so please ignore my last paragraph below ;)

L.

On 23/10/2008 13:39, "Leighton Pritchard" <lpritc at scri.ac.uk> wrote:

> Hi Giovanni,
> 
> On 23/10/2008 12:30, "Giovanni Marco Dall'Olio" <dalloliogm at gmail.com> wrote:
> 
>> The problem is I don't know if databases do Revision Control.
>> When I used flat files, I was used to save all the results in a git
>> repository, and, everytime something was changed or calculated again, I did
>> commit it.
>> Do you know how to do this with databases? Does MySQL provide support for
>> revision control?
> 
> Databases are just collections of data.  Database Management Systems (DBMS)
> such as MySQL and PostgreSQL do not (AFAIAA) do revision control themselves,
> but they can be used for it, if you build that capability into the schema and
> also control database submissions appropriately.  There are a number of
> content management systems that implement version/revision control on common
> DBMS, like this.
> 
> Stretching a definition, you could possibly argue that CVS, SVN and the like
> are a form of DBMS... I don't know what type of data you're storing, or how
> they might scale for your purposes but, in principle, neither CVS nor SVN care
> much about whether your data represents code, legal documents, or any other
> sort of data.  For example, I've used CVS/SVN to version control manuscripts.
> You might like to try one of them.
> 
> Cheers,
> 
> L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________

From lpritc at scri.ac.uk  Thu Oct 23 08:39:40 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Thu, 23 Oct 2008 13:39:40 +0100
Subject: [BioPython] [bip] [OT] Revision control and databases
In-Reply-To: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com>
Message-ID: <C5262F1C.18234%lpritc@scri.ac.uk>

Hi Giovanni,

On 23/10/2008 12:30, "Giovanni Marco Dall'Olio" <dalloliogm at gmail.com>
wrote:

> The problem is I don't know if databases do Revision Control.
> When I used flat files, I was used to save all the results in a git
> repository, and, everytime something was changed or calculated again, I did
> commit it.
> Do you know how to do this with databases? Does MySQL provide support for
> revision control?

Databases are just collections of data.  Database Management Systems (DBMS)
such as MySQL and PostgreSQL do not (AFAIAA) do revision control themselves,
but they can be used for it, if you build that capability into the schema
and also control database submissions appropriately.  There are a number of
content management systems that implement version/revision control on common
DBMS, like this.

Stretching a definition, you could possibly argue that CVS, SVN and the like
are a form of DBMS... I don't know what type of data you're storing, or how
they might scale for your purposes but, in principle, neither CVS nor SVN
care much about whether your data represents code, legal documents, or any
other sort of data.  For example, I've used CVS/SVN to version control
manuscripts.  You might like to try one of them.

Cheers,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________

From bsouthey at gmail.com  Thu Oct 23 09:55:49 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Thu, 23 Oct 2008 08:55:49 -0500
Subject: [BioPython] [bip] [OT] Revision control and databases
In-Reply-To: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com>
References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com>
Message-ID: <49008265.3040205@gmail.com>

Giovanni Marco Dall'Olio wrote:
> Hi,
> I have a question (well, it's not directly related to biopython or 
> pygr, but to scientific computing).
>
> I always used flat files to store results and data for my 
> bioinformatics analys, but not (as I was saying in another thread) I 
> would like to start using a database to do that.
Of course Biopython's BioSQL interface may provide a starting point.
> The problem is I don't know if databases do Revision Control.
> When I used flat files, I was used to save all the results in a git 
> repository, and, everytime something was changed or calculated again, 
> I did commit it.
> Do you know how to do this with databases? Does MySQL provide support 
> for revision control?
> Thanks :)
I think you are asking the wrong questions because it depends on what 
you want to do and what you actually store. There are a number of 
questions that you need to ask yourself about what you really need to do 
(knowing you have used git helps refine these). Examples include:
How often do you use the old versions in your git repository?
How do you use the old revisions in your git repository?
Do you even use the information of an older version if a newer version 
exists?
Do you actually determine when 'something was changed or calculated 
again' or it this partly determined by an external source like a Genbank 
or UniProt update? (At least in a database approach you could automate 
this.)
How many users that can make changes?
How often do you have conflicts?
Are the conflicts hard to solve?

Revision control may be overkill for your use because this is aims to 
handle many tasks and change conflicts related to multiple users rather 
than a single user.  If you don't need all these fancy features then you 
can use a database. If you just want to store and retrieve a version 
then you can use a database but you need to at least force the inclusion 
a date and comment fields to be useful.


Regards
Bruce

From tiagoantao at gmail.com  Thu Oct 23 10:51:22 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 23 Oct 2008 15:51:22 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810221012w2c894977sd0f86297f42f9394@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<6d941f120810220900o620aa7x171fcd181f5248c@mail.gmail.com>
	<5aa3b3570810221012w2c894977sd0f86297f42f9394@mail.gmail.com>
Message-ID: <6d941f120810230751k3dee7b96y8ee13e4bf1c2a4ca@mail.gmail.com>

Hi,


> Moreover, there are some methods (like GenePop.Record.split_in_pops) that
> create Record objects, and I thought it would have been easier to always
> refer to the same one.
> Maybe we should write a generic PopGenRecord in which to store all general
> informations about population genetics data.

The problem with that is that it is
a) very difficult to come with a representation that is general enough
(and usable in the long run).
b) a general representation would be an hassle in specific cases

Let me elaborate:

Different kinds of genetic information have completely different
storage needs: If you are doing genomic studies you will probably want
to have location information (like this SNP is on chromosome X,
position Y). Others (probably the majority) only require frequency
information (or to know what the marker is, irrespective of position).
In most species you don't even know the genomic position of a certain
marker. So you would have to have an general representation capable to
handle both position information and no position information. Then, in
some cases, you need the whole marker (like if you want to do a Tajima
D) or just frequency information (for Fst). Some markers (microsats)
you can (in most, but not all) cases ignore the genetic pattern, you
just count the repeats.
  You could argue that one could try to have a most general
representation but that entails  three problems:
  1. It is very difficult to come by with a clever, correct and future
proof representation. At least I've thinking on this issue since 2005
and have found no clever answer.
  2. Performance: If you care about performance, having a most general
data representation will bring about a big performance cost
(converting from a certain general format to the format needed to do
computations).
  3. Different formats and statistics have different requirements: For
instance on GenePop you don't have population names, neither the
marker itself, but for arlequin format you have partial information on
markers and full information on population names. converting the minor
differences among formats to a "general" format would be complex.

From tiagoantao at gmail.com  Thu Oct 23 11:10:51 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 23 Oct 2008 16:10:51 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
Message-ID: <6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com>

On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Iterators are more difficult to implement in Ped files, because in this
> format every line of the file is an individual, so to write an iterator
> which iterates by population we will need to read at list the first row of
> every line of all the file.

GenePop works population by population. Where I a getting at, is that
different formats might have completely different strategies.
I've used a strategy with the FDist parser that it might be interesting to you:
1. I read the fdist file
2. Convert it to genepop
3. do all operations in the genepop format
4. convert back if necessary.

This might not work in your case because the ped format seems to be
more informative than the genepop format (and thus you loose
information in the conversion process). Feel free to copy and adapt my
code to your own (like split_in_pops and split_in_loci)


> I would probably use sqlalchemy to interface with this database: this is why
> I would like to implement a Population and Individual objects, it will fit
> better with relational mapping.

You can go ahead and suggest formats for Populations and Individuals.
But I strongly suspect that your proposal will be biased towards your
needs (I've suffered the same problem myself). I think that in
biopython the idea is to try to have a solution that is useful to
everybody.

Also, if you want to put some SQL in the code module code, you will
have to have approval from the maintainers of biopython. They will
send you to the BioSQL people, which will say that there is none of
their business. Been there, done that, no success.

Don't take me wrong, I am not trying to discourage you in any way. But
I think it is better to gain some experience before proposing changes
to core concepts.
I've been doing this work for 3 years now, and I am convinced that it
would be very hard for me to suggest a good representation for
populations and individuals. Even populations are very hard to address
(like, some data is geo-referenced -> called landspace genetics, and
the more traditional one is not).

My suggestion: solve you problem the best way you can (e.g., do an
independent PED parser - you can use any of my code if you want).
Solve small problems, one after another.
Trying to solve the general problem is very hard and requires lots of
long term experience.

From dalloliogm at gmail.com  Thu Oct 23 12:25:29 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 23 Oct 2008 18:25:29 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
	<6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com>
Message-ID: <5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com>

On Thu, Oct 23, 2008 at 5:10 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:

> On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
> > Iterators are more difficult to implement in Ped files, because in this
> > format every line of the file is an individual, so to write an iterator
> > which iterates by population we will need to read at list the first row
> of
> > every line of all the file.
>
> GenePop works population by population. Where I a getting at, is that
> different formats might have completely different strategies.
> I've used a strategy with the FDist parser that it might be interesting to
> you:
> 1. I read the fdist file
> 2. Convert it to genepop
> 3. do all operations in the genepop format
> 4. convert back if necessary.
>
> This might not work in your case because the ped format seems to be
> more informative than the genepop format (and thus you loose
> information in the conversion process). Feel free to copy and adapt my
> code to your own (like split_in_pops and split_in_loci)
>
>
> > I would probably use sqlalchemy to interface with this database: this is
> why
> > I would like to implement a Population and Individual objects, it will
> fit
> > better with relational mapping.
>
> You can go ahead and suggest formats for Populations and Individuals.
> But I strongly suspect that your proposal will be biased towards your
> needs (I've suffered the same problem myself). I think that in
> biopython the idea is to try to have a solution that is useful to
> everybody.
>
> Also, if you want to put some SQL in the code module code, you will
> have to have approval from the maintainers of biopython. They will
> send you to the BioSQL people, which will say that there is none of
> their business. Been there, done that, no success.
>
> Don't take me wrong, I am not trying to discourage you in any way. But
> I think it is better to gain some experience before proposing changes
> to core concepts.
> I've been doing this work for 3 years now, and I am convinced that it
> would be very hard for me to suggest a good representation for
> populations and individuals. Even populations are very hard to address
> (like, some data is geo-referenced -> called landspace genetics, and
> the more traditional one is not).
>
> My suggestion: solve you problem the best way you can (e.g., do an
> independent PED parser - you can use any of my code if you want).
> Solve small problems, one after another.
> Trying to solve the general problem is very hard and requires lots of
> long term experience.
>

Well, I agree with you... I don't have any idea on how this problem could be
resolved :).
However I think it would be good to add to biopython at least some
funcionality to calculate Fst statistics and parse these file formats, at
least at the level at which BioPerl does.
What if we just translate the same functionalities and copy the population
objects from bioperl into biopython?
I realize that it won't be the perfect solution: in fact, it is the same
reason why I started this discussion here, the bioperl code wasn't optimized
enought for what I want to do, but I didn't know how to modify perl modules
and preferred python.

Maybe we can just write a PED and GenePop parser and have let it work with
GenePop and your modules to calculate Fst.
We should agree with a population object that could be used as input for
GenePop.
I think it would be good anyway to release even incomplete code to the
public, because it could be useful for other people.


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From dalloliogm at gmail.com  Thu Oct 23 12:27:22 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 23 Oct 2008 18:27:22 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <320fb6e00810230257y4a21952o95acff34a5367cb5@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
	<320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com>
	<5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com>
	<320fb6e00810230257y4a21952o95acff34a5367cb5@mail.gmail.com>
Message-ID: <5aa3b3570810230927m696f9c27gdfec084647c2c509@mail.gmail.com>

On Thu, Oct 23, 2008 at 11:57 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Thu, Oct 23, 2008, Giovanni Marco Dall'Olio wrote:
> > On Wed, Oct 22, Peter wrote:
> >> On Wed, Oct 22, Giovanni Marco Dall'Olio wrote:
> >> >
> >> > Iterators are more difficult to implement in Ped files, because in
> this
> >> > format every line of the file is an individual, so to write an
> iterator
> >> > which iterates by population we will need to read at list the first
> row
> >> > of every line of all the file.
> >>
> >> It sounds like for Ped files it would make more sense to iterate over
> >> the individuals.  The mental picture I have in mind is a big
> >> spreadsheet, individuals as rows (lines), populations (and other
> >> information) as columns.  By having the parser iterate over the
> >> individuals one by one, the user could then "simplify" each individual
> >> as they are read in, recording in memory just the interesting data.
> >> This way the whole dataset need not be kept in memory.
> >
> > This makes sense.
> > Basically, we should write a (Ped/GenePop)Iterator function, which should
> > read the file one line at a time, check if it a has correct syntax and is
> > not a comment, and then use 'yield' to create a Record object. Am I
> right?
>
> Yes :)
>
> Python functions written with "yield" are called  "generator functions",
> see:
> http://www.python.org/dev/peps/pep-0255/
>

So, how should we modify the current GenePop parser to make it work as an
iterator?
Now it has a 'Scanner' and 'Consumer' methods. Should I remove them and
write a RecordIterator instead?
-
http://github.com/dalloliogm/biopython---popgen/tree/master/src/PopGen/Gio/Ped/__init__.py
Can you explain me more or less how the 'Consumer' object works? It is
mandatory to use it when creating biopython objects?

p.s. do you like the doctest to show how to use the parser?


> Peter
>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it

From biopython at maubp.freeserve.co.uk  Thu Oct 23 13:01:26 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 23 Oct 2008 18:01:26 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810230927m696f9c27gdfec084647c2c509@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
	<320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com>
	<5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com>
	<320fb6e00810230257y4a21952o95acff34a5367cb5@mail.gmail.com>
	<5aa3b3570810230927m696f9c27gdfec084647c2c509@mail.gmail.com>
Message-ID: <320fb6e00810231001w2345bbe5r8c1727ddf883553c@mail.gmail.com>

Giovanni wrote:
> So, how should we modify the current GenePop parser to make it work as an
> iterator?

I think this would mean breaking up the current Record object (which
holds everything) into sub-records which can be yielded one by one.
This would require an API change, unless you wanted to continue to
offer the two approaches in parallel (not elegant, but see
Bio/Sequencing/Ace.py for an example of where this made sense to do).

> Now it has a 'Scanner' and 'Consumer' methods. Should I remove them and
> write a RecordIterator instead?
> ...
> Can you explain me more or less how the 'Consumer' object works? It is
> mandatory to use it when creating biopython objects?

You can write an iterator with or without the Scanner/Consumer style of parser.

The Scanner/Consumer system is very flexible if you want to parse the
data into different objects (by using different consumers).  In theory
the end user could also use the provided scanner with their own
consumer.  However, in my opinion for parsing sequence file formats
this was overkill (needlessly complicated) - as only one object is
really needed to represent a sequence (we have the SeqRecord for
this), so most of the recent parsers in Bio.SeqIO and Bio.AlignIO do
not use the scanner/consumer setup.

See also the short Tutorial section "Parser Design".
http://biopython.org/DIST/docs/tutorial/Tutorial.html

For population genetics given there is no one universal record object,
perhaps the flexibility of the Scanner/Consumer system is worth while.
 On the other hand, Tiago currently has the scanner/consumer in
Bio.PopGen.GenePop as private objects so this is currently a private
implementation detail - one could replace the Scanner/Consumer details
without breaking the public API.

Peter

From biopython at maubp.freeserve.co.uk  Fri Oct 24 04:52:25 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 24 Oct 2008 09:52:25 +0100
Subject: [BioPython] Retrieving nucleotide sequence for given accession
	Entrez ID
In-Reply-To: <61B0EE7C247C1349881F63414448FC1F078874C1@EXEVS06.its.uncc.edu>
References: <61B0EE7C247C1349881F63414448FC1F078874BC@EXEVS06.its.uncc.edu>
	<320fb6e00810221315i31358bc2n2e5c9be405a77e42@mail.gmail.com>
	<61B0EE7C247C1349881F63414448FC1F078874C1@EXEVS06.its.uncc.edu>
Message-ID: <320fb6e00810240152t6e6123d3la00f1fe43121b985@mail.gmail.com>

Hi Richard,

I've taken the liberty of CC'ing this back to the mailing list,

Richard Clary wrote:
> Much appreciation Peter--it worked perfectly.

Good :)

> If you are wanting to
> retrieve multiple sequences, is a simple "+" string concatenation
> sufficient as the case when using eUtils or approach it by creating
> a tuple or dictionary and passing arguments?
>
> Richard

Moving on to your multi-sequence question, using "+" doesn't
seem to work - you should use a comma for concatenating the
IDs when calling eFetch.   What made you think of "+" here?

One other tweak is that Bio.SeqIO.read(...) is for when the handle
contains one and only one record.  In general you'll need to use
Bio.SeqIO.parse(...) instead and iterate over the records.

Depending on what you want to achieve, maybe:

from Bio import Entrez, SeqIO
id_list = ["186972394","12345678"]
Entrez.email = "Richard at example.com" #Tell the NCBI who you are
handle = Entrez.efetch(db="nucleotide", id=",".join(id_list),rettype="fasta")
for id,record in zip(id_list,SeqIO.parse(handle, "fasta")) :
    assert id in record.id, "Didn't get ID %s returned!" % id
    print "%s = %s" % (record.id, record.seq)
    #seq_str = str(record.seq)

If you still want just plain strings for the sequence, maybe:

from Bio import Entrez, SeqIO
id_list = ["186972394","12345678"]
Entrez.email = "Richard at example.com" #Tell the NCBI who you are
handle = Entrez.efetch(db="nucleotide", id=",".join(id_list),rettype="fasta")
seq_str_list = [str(record.seq) for record in SeqIO.parse(handle, "fasta")]

If you haven't already done so, please read the NCBI guidelines for
using Entrez,
http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements

Also, have a look at the Entrez chapter in the tutorial, especially
the "history" support which may be relevant.
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

Peter

From dalloliogm at gmail.com  Fri Oct 24 05:08:54 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Fri, 24 Oct 2008 11:08:54 +0200
Subject: [BioPython] [bip] [OT] Revision control and databases
In-Reply-To: <49008265.3040205@gmail.com>
References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com>
	<49008265.3040205@gmail.com>
Message-ID: <5aa3b3570810240208j76a770c5ue175089176fa050@mail.gmail.com>

On Thu, Oct 23, 2008 at 3:55 PM, Bruce Southey <bsouthey at gmail.com> wrote:

> Giovanni Marco Dall'Olio wrote:
>
>> Hi,
>> I have a question (well, it's not directly related to biopython or pygr,
>> but to scientific computing).
>>
>> I always used flat files to store results and data for my bioinformatics
>> analys, but not (as I was saying in another thread) I would like to start
>> using a database to do that.
>>
> Of course Biopython's BioSQL interface may provide a starting point.


The problem is that BioSQL doesn't support yet Population Genetics record
(see another thread in biopython mailing list), so I would have to implement
something like that in BioSQL or wait for the developers to do it.
Maybe I will do this later, but now I don't have the time.


>
>  The problem is I don't know if databases do Revision Control.
>> When I used flat files, I was used to save all the results in a git
>> repository, and, everytime something was changed or calculated again, I did
>> commit it.
>> Do you know how to do this with databases? Does MySQL provide support for
>> revision control?
>> Thanks :)
>>
> I think you are asking the wrong questions because it depends on what you
> want to do and what you actually store. There are a number of questions that
> you need to ask yourself about what you really need to do (knowing you have
> used git helps refine these). Examples include:
> How often do you use the old versions in your git repository?
> How do you use the old revisions in your git repository?
> Do you even use the information of an older version if a newer version
> exists?
> How many users that can make changes?
> How often do you have conflicts?
> Are the conflicts hard to solve?


These are all very good questions.
The problem is that I consider revision control as a 'good practice': I
remember that when I was not used to keep an history of the changes to my
data, it was a mess. I would like to have at least a 'version' field, to
know how much my data is old.

I have found this :
- http://pgfoundry.org/projects/tablelog/
which seems interesting.
I think this is a big issue for bioinformatics. How is it possible that
nobody has never tried to implement such a functionality for databases?
Version Control could be difficult to implement, but not so much. There is
must be something that I can reuse...


Do you actually determine when 'something was changed or calculated again'
> or it this partly determined by an external source like a Genbank or UniProt
> update? (At least in a database approach you could automate this.)


Well, it could be useful to


>
>
> Revision control may be overkill for your use because this is aims to
> handle many tasks and change conflicts related to multiple users rather than
> a single user.  If you don't need all these fancy features then you can use
> a database. If you just want to store and retrieve a version then you can
> use a database but you need to at least force the inclusion a date and
> comment fields to be useful.


Maybe there are other similar tools.
This is a big issue for bioinformatics. I think it is a good, when working
with

Unfortunately I think revision control would be very useful for me.
The data in the database will be used and uploaded by 4 or 5 people.
It will be used also to store the results from some script:


>
>
>
> Regards
> Bruce
>


Thank you very much for all the replies.. I didn't expect so many of them.


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it

From biopython at maubp.freeserve.co.uk  Fri Oct 24 05:25:31 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 24 Oct 2008 10:25:31 +0100
Subject: [BioPython] [bip] [OT] Revision control and databases
In-Reply-To: <5aa3b3570810240208j76a770c5ue175089176fa050@mail.gmail.com>
References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com>
	<49008265.3040205@gmail.com>
	<5aa3b3570810240208j76a770c5ue175089176fa050@mail.gmail.com>
Message-ID: <320fb6e00810240225y1c380de5y6144a80ece808b2c@mail.gmail.com>

Giovanni Marco Dall'Olio wrote:
> Bruce Southey wrote:
>> Of course Biopython's BioSQL interface may provide a starting point.
>
> The problem is that BioSQL doesn't support yet Population Genetics record
> (see another thread in biopython mailing list), so I would have to implement
> something like that in BioSQL or wait for the developers to do it.
> Maybe I will do this later, but now I don't have the time.

BioSQL currently focuses on annotated sequences, but they are working
on some phylogenetics support too.  See http://www.biosql.org/ and the
PhyloDB extension module.  If there was enough interest, perhaps a
BioSQL schema for Population Genetics could be devised too.

Giovanni Marco Dall'Olio wrote:
>>> The problem is I don't know if databases do Revision Control.
>>> When I used flat files, I was used to save all the results in a git
>>> repository, and, everytime something was changed or calculated
>>> again, I did commit it.
>>> Do you know how to do this with databases? Does MySQL
>>> provide support for revision control?

As other people have said, databases don't generally "waste" resources
on version control.  If you need this, then it is up to you to design
your schema to record this additional metadata.  For example, the
BioSQL sequences have a "version" field in the "bioentry" table
allowing multiple revisions of the same accession to be held.  When
querying the database, you could request a particular version, or
indeed the latest version.  Essentially AFAIK database version control
is a Do-It-Yourself affair when designing your database tables.

Peter

From lpritc at scri.ac.uk  Fri Oct 24 05:51:35 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Fri, 24 Oct 2008 10:51:35 +0100
Subject: [BioPython] [bip] [OT] Revision control and databases
In-Reply-To: <5aa3b3570810240208j76a770c5ue175089176fa050@mail.gmail.com>
Message-ID: <C5275937.18346%lpritc@scri.ac.uk>

On 24/10/2008 10:08, "Giovanni Marco Dall'Olio" <dalloliogm at gmail.com>
wrote:

> The problem is that BioSQL doesn't support yet Population Genetics record
> (see another thread in biopython mailing list), so I would have to implement
> something like that in BioSQL or wait for the developers to do it.
> Maybe I will do this later, but now I don't have the time.

To be fair, that's a different problem from version control...

>> How often do you use the old versions in your git repository?
>> How do you use the old revisions in your git repository?
>> Do you even use the information of an older version if a newer version
>> exists?
>> How many users that can make changes?
>> How often do you have conflicts?
>> Are the conflicts hard to solve?
> 
> These are all very good questions.
> The problem is that I consider revision control as a 'good practice'

I think that you're right - it is good practice, and Bruce raises excellent
questions here: what individuals want or need from version control depends
greatly on their own situation, and whether a particular package fits your
own needs will depend on what they are.  If you don't know what they are
before choosing a package, then there's the risk of making an unsuitable
choice.

It's worth noting that revision control can also mean slightly different
things to different people.  Some might say that a version number and an ID
for the entity (human or automated) making that change is sufficient.  Some
might say that you ought not to stop short of conflict resolution and branch
control.  It depends on the needs of your project, IMO.

> I think this is a big issue for bioinformatics. How is it possible that nobody
> has never tried to implement such a functionality for databases

Databases (DBMS, to be picky) are a general-purpose solution for many
different kinds of problem.  Revision control is an inhomogeneous problem
with no optimal solution that can be implemented in many ways and not only
using DBMS.  There are plenty of revision control examples implemented in
databases, and the examples that first come to mind in Python for me are
content management systems such as Zope and Plone.  I think that BASE
implements one, but it's a long time since I looked at it.

> Unfortunately I think revision control would be very useful for me.
> The data in the database will be used and uploaded by 4 or 5 people.

Then at a minimum you may need a solution that records version changes, and
associates versions with individuals (and perhaps individual runs of
scripts).  You may also need locking and collision detection/conflict
resolution (which DBMS like MySQL and PostgreSQL support internally via
transactions; they don't generally implement version control because it
would be wasteful), depending on whether you expect that multiple people
might modify the same file at or at about the same time.

Best,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________

From cy at cymon.org  Fri Oct 24 06:46:28 2008
From: cy at cymon.org (Cymon Cox)
Date: Fri, 24 Oct 2008 11:46:28 +0100
Subject: [BioPython] BioSQL / phylodb
Message-ID: <7265d4f0810240346m456e724ax49f7c18048a29749@mail.gmail.com>

Hi All,

Ive been looking at the phylodb extension to BioSQL. Does anyone have any
python code for uploading a tree?

Cheers, C.

-- 
____________________________________________________________________

Cymon J. Cox

From biopython at maubp.freeserve.co.uk  Fri Oct 24 06:54:28 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 24 Oct 2008 11:54:28 +0100
Subject: [BioPython] BioSQL / phylodb
In-Reply-To: <7265d4f0810240346m456e724ax49f7c18048a29749@mail.gmail.com>
References: <7265d4f0810240346m456e724ax49f7c18048a29749@mail.gmail.com>
Message-ID: <320fb6e00810240354q2b3c2a93p3c0c45b5ed48df3c@mail.gmail.com>

On Fri, Oct 24, 2008 at 11:46 AM, Cymon Cox <cy at cymon.org> wrote:
> Hi All,
>
> Ive been looking at the phylodb extension to BioSQL. Does anyone have any
> python code for uploading a tree?
>
> Cheers, C.

Not that I'm aware of, no.

Adding support to Biopython's BioSQL module to do this, and also
retrieve the data as a tree would be nice.  The Bio.Nexus.Tree class
would seem a logical representation to try and use.  As an aside,
being about to load a taxonomy from the main BioSQL taxon/taxon_name
tables as a tree might be nice too.

Peter

From kteague at bcgsc.ca  Fri Oct 24 14:32:41 2008
From: kteague at bcgsc.ca (Kevin Teague)
Date: Fri, 24 Oct 2008 11:32:41 -0700
Subject: [BioPython] [bip] [OT] Revision control and databases
In-Reply-To: <C5275937.18346%lpritc@scri.ac.uk>
References: <C5275937.18346%lpritc@scri.ac.uk>
Message-ID: <3F2B0CD4-83DF-4A88-B22D-926B97503B7C@bcgsc.ca>

>
>> I think this is a big issue for bioinformatics. How is it possible  
>> that nobody
>> has never tried to implement such a functionality for databases
>
> Databases (DBMS, to be picky) are a general-purpose solution for many
> different kinds of problem.  Revision control is an inhomogeneous  
> problem
> with no optimal solution that can be implemented in many ways and  
> not only
> using DBMS.  There are plenty of revision control examples  
> implemented in
> databases, and the examples that first come to mind in Python for me  
> are
> content management systems such as Zope and Plone.  I think that BASE
> implements one, but it's a long time since I looked at it.

The default file storage for Zope Object Database (ZODB) appends all  
new database writes, keeping older transactions on disk (similar to  
the way PostgreSQL works). Back in the day (circa 2000) Zope 2 exposed  
this database-level feature at the application level in the Zope  
Management Interface (ZMI). So you could see all past writes to the  
database, and try and revert back to an older one if desired (using  
the "undo" tab of the ZMI).

Problems with this approach included using sysadmin tools on the  
database could break application behaviour. e.g. lets say you had a  
"Document" object and a "Page Counter" object, you would wish to be  
able to view older versions of Documents, but only care about the  
current state of the Page Counters. However, if your Page Counters are  
changing like crazy and taking up tonnes of disk space and generally  
slowing down queries against the history of the database, there was no  
way to say "delete all outdated ephemeral Page Counter versions, but  
keep Document-related transactions" (especially since a Page Counter  
change and a Document change often commited in the same transaction).  
ZWiki exposed older revisions using this feature, and the accepted  
practice was to put each wiki into it's own database so that other  
forms of database maintenance didn't accidently blow away your wiki  
history ... it wasn't so pretty :P

You also had problems reverting back to just a specific revision, for  
example if you were in Revision 3 and you had changes in Revision 1  
that you wanted to go back to, but you'd made changes in Revision 2  
that referenced Revision 1, then you first had to step-back to  
Revision 2 before you could revert back to Revision 1. Even though  
Revision 2 also contained a bunch of changes that you didn't want to  
revert, that you would then manually need to later re-apply. Ug!

Zope 2 also had a Version object, you could poke a button in the UI to  
start a new "transaction" and then start making changes to code 
+content in the database. This was just implemented as a long-running  
transaction - from the point of starting to commiting a transaction  
could sometimes last for a whole month :). The problem being that when  
you finally wanted to commit the transaction to roll-out new features  
on a web site, if there were any conflicts from changes that happened  
you were hosed and would end-up copying those changes into a new  
transaction based off the latest database version and commiting that.  
It wasn't pretty :(

It has long since been acknowledged by Zope developers that exposing  
database level features at the application level is a Bad Thing(TM)!

Today there is a whole plethora of products for Zope that do some form  
of versioning, but they are all implemented at the application level.  
There is a whole plethora of products because there are many ways to  
do versioning, and the choices of how versions are managed is really  
best left up to the specific application. Some of these products  
provide reasonable APIs for implementing specific versioning within a  
specific platform - e.g Plone has a package called plone.app.iterate  
and it has APIs that use standard versioning terminology (checkin,  
checkout, working copy) for example:


class ICheckinCheckoutTool( Interface ):

     def allowCheckin( content ):
         """
         denotes whether a checkin operation can be performed on the  
content.
         """

     def allowCheckout( content ):
         """
         denotes whether a checkout operation can be performed on the  
content.
         """

     def allowCancelCheckout( content ):
         """
         denotes whether a cancel checkout operation can be performed  
on the content.
         """

     def checkin( content, checkin_messsage ):
         """
         check the working copy in, this will merge the working copy  
with the baseline
         """

     def checkout( container, content ):
         """
         """

     def cancelCheckout( content ):
         """


From sbassi at gmail.com  Fri Oct 24 21:03:43 2008
From: sbassi at gmail.com (Sebastian Bassi)
Date: Fri, 24 Oct 2008 22:03:43 -0300
Subject: [BioPython] Loading dbxrefs from a gbk file
Message-ID: <b43bf2080810241803r4f49e575p16f159fae9607eb9@mail.gmail.com>

I have a genbank file like this one: http://www.pastecode.com.ar/f231664eb
I parse it with SeqIO.parse and the SeqRecord object I get is:

SeqRecord(seq=Seq('GAGAAGGACGCGCGGCCCCCAGCGCCTCTTGGGTGGCCGCCTCGGAGCATGACC...ATA',
IUPACAmbiguousDNA()), id='NM_000208.2', name='NM_000208',
description='Homo sapiens insulin receptor (INSR), transcript variant
1, mRNA.', dbxrefs=[])

If you look at lines 130 to 133 (I highlighted in yellow) of the
genbank sequence, there is cross database information (db_xref), but
it is not associated with the SeqRecord, it is an empty list.
According to http://www.biopython.org/wiki/SeqRecord, this condition
is known, but I don't understand if this is on porpuse or is a bug.
Best,
SB.


-- 
Vendo isla: http://www.genesdigitales.com/isla/
Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
Bioinformatics news: http://www.bioinformatica.info
Tutorial libre de Python: http://tinyurl.com/2az5d5

"It is pitch black. You are likely to be eaten by a grue." -- Zork

From biopython at maubp.freeserve.co.uk  Sat Oct 25 13:22:27 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 25 Oct 2008 18:22:27 +0100
Subject: [BioPython] Loading dbxrefs from a gbk file
In-Reply-To: <b43bf2080810241803r4f49e575p16f159fae9607eb9@mail.gmail.com>
References: <b43bf2080810241803r4f49e575p16f159fae9607eb9@mail.gmail.com>
Message-ID: <320fb6e00810251022u577d4f16wa7602c7bdc664322@mail.gmail.com>

On Sat, Oct 25, 2008 at 2:03 AM, Sebastian Bassi <sbassi at gmail.com> wrote:
> I have a genbank file like this one: http://www.pastecode.com.ar/f231664eb
> ...
> If you look at lines 130 to 133 (I highlighted in yellow) of the
> genbank sequence, there is cross database information (db_xref), but
> it is not associated with the SeqRecord, it is an empty list.

What you have highlighted is part of a gene feature, and would not be
part of the SeqRecord's db_xref list.  It should however be present in
the relevant SeqRecord feature.  Try:

print my_record.features[1]

(seeing as this is the second feature in the file, i.e. feature 1
using zero-based counting).

Peter

From tiagoantao at gmail.com  Sat Oct 25 21:04:01 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Sun, 26 Oct 2008 02:04:01 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
	<6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com>
	<5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com>
Message-ID: <6d941f120810251804o31ed44cat49b407db36a6891e@mail.gmail.com>

[Sorry for the delay in answering]

On Thu, Oct 23, 2008 at 5:25 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:

> However I think it would be good to add to biopython at least some
> funcionality to calculate Fst statistics and parse these file formats, at
> least at the level at which BioPerl does.

Agree. Statistics is fundamental. I decided to postpone stats when I
started because I didn't want to to start with the core issue in
population genetics (being unexperienced at the start would probably
cause serious design errors). But I think now is the time.

> What if we just translate the same functionalities and copy the population
> objects from bioperl into biopython?

I don' t think the population objects in bioperl scale well. It is not
clear to me that their popgen module is a priority for them, and that
they carefully designed them (altough that might have changed in the
near past).
I also don' t believe that my own code (which I supplied you) is in
perfect shape to achieve this also.
I have to write down my ideas and send them here as soon as possible.
I will try to do it in the next couple of days at most.
The core idea is that there is no good abstract population and
individual objects, but they are also not needed.
What is needed, in my view, are file parsers and statistics.
Statistics should be organized in a systematic way. Example: all
frequency based, population-structure statistics should present the
same interface, something like:
   add_population(pop_name, individual_allele_list)
I will submit a small document for discussion very soon.

> I realize that it won't be the perfect solution: in fact, it is the same
> reason why I started this discussion here, the bioperl code wasn't optimized
> enought for what I want to do, but I didn't know how to modify perl modules
> and preferred python.

The important thing to notice is that biopython should not be
optimized to your needs or mine, it has to be general enough to
accomodate the vast majority of potential users. What I' ve always
tried, was to do things in a way that could be reused by others.


> Maybe we can just write a PED and GenePop parser and have let it work with
> GenePop and your modules to calculate Fst.

My suggestion would be for you to go ahead and do a Bio.PopGen.PED .
You could do it in the best way you see fit.
Converting from PED to genepop will make you loose information, if I
understand well (as you have SNP info on PED files, which you don' t
on genepop).
The other formats that I support (Fdist on released code and FStat on
the code that you have) are very similar (or less informative) than
genepop.
Again, my suggestion is for an independent parser. Of which you would
have absolute control as you would be the implementor.

I understand that this might lead to some duplicated code (like
split_in_pops), but repeated code is less of a problem than a generic
object that ends up being wrong in the long run.

> We should agree with a population object that could be used as input for
> GenePop.

For the reasons above I will fight a general Population object. At
least for now. I don't feel confident that we have the experience to
design one.
It is important to notice that we cannot break backward compatibility
without a very good reason. I think that a generic population object
will be severely resived in the future.
In your specific case I also think you would suffer with a population
object, as you need performance (parsing file, creating object,
extrating information from object, calculating statistic).
As I see it, it would be a smaller chain (parse, convert to statistic
family format, calculate statistic).

> I think it would be good anyway to release even incomplete code to the
> public, because it could be useful for other people.

Incomplete is OK. But I think we would be releasing wrong code. Code
that it would be redone in the future (and break interfaces with the
past versions). Also, a generic object would have performance problems
(it would have to be able to store all the information).

Well, I am ranting and not proposing a decent alternative. I will try
to write down something decent. I will try to write up a proposal
until Tuesday. I'm afraid the error is on my part: I have to write
down what is in my head so that people can discuss if it is a good
idea or not.

From tiagoantao at gmail.com  Sat Oct 25 21:34:55 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Sun, 26 Oct 2008 02:34:55 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810251804o31ed44cat49b407db36a6891e@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
	<6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com>
	<5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com>
	<6d941f120810251804o31ed44cat49b407db36a6891e@mail.gmail.com>
Message-ID: <6d941f120810251834q87495d5re558cf179356a8b0@mail.gmail.com>

I just want add on an extra comment explaining why I oppose doing an
individual object:

I have the following questions (and others) in my mind, which I don't
know the answer. I am not looking for answers to them, I am just
trying to illustrate the difficulty of the problem.

1. For a certain marker, do we store the genomic position of the
marker? Some (most) statistics don't use this information. For many
species this information is not even available. But for some
statistics this information is mandatory...
2. For a microsatellite do we store the motif and number of repeats or
the whole sequence? (see 4)
3. If one is interested in SNPs and one has the full sequences does
one store the full sequences or just the SNPs? If you store just the
SNPs then you cannot do sequence based analysis in the future (say
Tajima D). If you store everything then you are consuming memory and
cpu.
4. If one just wants to do frequency statistics (Fst), do you store
the marker or just the assign each one an ID and store the ID? It is
much cheaper to store an ID than a full sequence.

Populations
1. Support for landscape genetics? I mean geo-referentiation
2. Support for hierarchical population structure?
3. Do we cache statistics results on Population objects?


Let me take your class marker:
class Marker:
  total_heterozygotes_count = 0
  total_population_count = 0
  total_Purines_count = 0 # this could be renamed, of course
  total_Pyrimidines_count = 0

How would this be useful for microsatellites? Why purines, and if my
marker is a protein? If it is a SNP I want to know the nucleotide? And
if I am studying proteins and I want to have the aminoacid?

Dont take me wrong, I have done this path. To solve my particular
problems is not very hard. To have a framework that is usable by
everybody, it is a damn hard problem. And we dont really need to solve
it (ok, it would be nice to do things to populations in general, that
I agree). But the fundamental is: read file, calculate statistics.
That doesnt need population and individual objects.

If we end up having too many formats a consolidation step might be
needed in the future (to avoid having 10 split_in_pops). That I agree.

From sbassi at gmail.com  Mon Oct 27 00:13:47 2008
From: sbassi at gmail.com (Sebastian Bassi)
Date: Mon, 27 Oct 2008 01:13:47 -0300
Subject: [BioPython] Loading dbxrefs from a gbk file
In-Reply-To: <320fb6e00810251022u577d4f16wa7602c7bdc664322@mail.gmail.com>
References: <b43bf2080810241803r4f49e575p16f159fae9607eb9@mail.gmail.com>
	<320fb6e00810251022u577d4f16wa7602c7bdc664322@mail.gmail.com>
Message-ID: <b43bf2080810262113i69ffbbe7w5979655da2f11c53@mail.gmail.com>

On Sat, Oct 25, 2008 at 2:22 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> What you have highlighted is part of a gene feature, and would not be
> part of the SeqRecord's db_xref list.  It should however be present in
> the relevant SeqRecord feature.  Try:

OK, thank you.

From lueck at ipk-gatersleben.de  Mon Oct 27 09:43:49 2008
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Mon, 27 Oct 2008 14:43:49 +0100
Subject: [BioPython] ClustaW problem upwards Biopython 1.43
Message-ID: <008701c9383a$0aaa2070$1022a8c0@ipkgatersleben.de>

Hi!

I just releazed, that a ClustalW alignment gives an error message under Biopython 1.44 and 1.47 whereas under 1.43 everything works fine. 

The message is the following (example of the tutorial):

Traceback (most recent call last):
  File "I:\Final\pair_align.py", line 90, in pair_align
    alignment = Clustalw.do_alignment(cline)
  File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 79, in do_alignment
    status = run_clust.close()
IOError: [Errno 0] Error

Does someone know what's the problem?

Kind regards
Stefanie

From biopython at maubp.freeserve.co.uk  Mon Oct 27 11:12:13 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 27 Oct 2008 15:12:13 +0000
Subject: [BioPython] ClustaW problem upwards Biopython 1.43
In-Reply-To: <008701c9383a$0aaa2070$1022a8c0@ipkgatersleben.de>
References: <008701c9383a$0aaa2070$1022a8c0@ipkgatersleben.de>
Message-ID: <320fb6e00810270812le76ae75m55f53107c2572a34@mail.gmail.com>

On Mon, Oct 27, 2008 at 1:43 PM, Stefanie L?ck <lueck at ipk-gatersleben.de> wrote:
> Hi!
>
> I just releazed, that a ClustalW alignment gives an error message under Biopython 1.44 and 1.47 whereas under 1.43 everything works fine.
>
> The message is the following (example of the tutorial):
>
> Traceback (most recent call last):
>  File "I:\Final\pair_align.py", line 90, in pair_align
>    alignment = Clustalw.do_alignment(cline)
>  File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 79, in do_alignment
>    status = run_clust.close()
> IOError: [Errno 0] Error
>
> Does someone know what's the problem?

There were some changes made between Biopython 1.43 and 1.44 to try
and deal with spaces in filenames.  Could you do:

print str(cline)

That should show the exact command line python is trying to run.  What
happens if you try this command at the "DOS" prompt?

Also, what version of clustalw do you have installed?

Thanks,

Peter


From biopython at maubp.freeserve.co.uk  Mon Oct 27 13:49:59 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 27 Oct 2008 17:49:59 +0000
Subject: [BioPython] Deprecating Bio.mathfns, Bio.stringfns and Bio.listfns?
Message-ID: <320fb6e00810271049t2aa3fac4s1907027307b035f1@mail.gmail.com>

Dear Biopythoneers,

Is anyone currently using Bio.mathfns, Bio.stringfns or Bio.listfns?
These provide a selection of maths, string and list functions - some
of which are apparently irrelevant with changes or additions to python
itself (e.g. sets).

I'd like to declare these as deprecated for the next release, or at
least obsolete and likely to be deprecated in future - so if you are
using these modules or would like to defend them, please speak up
soon.

Thanks,

Peter

P.S. If you care about the details, there is a longer discussion on
the dev-mailing list:
http://lists.open-bio.org/pipermail/biopython-dev/2008-October/004472.html

From biopython at maubp.freeserve.co.uk  Mon Oct 27 13:57:20 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 27 Oct 2008 17:57:20 +0000
Subject: [BioPython] Deprecating the obsolete Bio.Ndb module?
Message-ID: <320fb6e00810271057l181cbb1fw15aa8f03e4159328@mail.gmail.com>

Dear Biopythoneers,

The Bio.Ndb module (written six years ago) provides an HTML parser for
the NDB website (nucleotide database, a repository of
three-dimensional structural information about nucleic acids).

The URL has changed, but this service is still running.  However, the
webpage layout has changed considerably - Their front page mentions a
major revision in Jan 2008.

Unless anyone would like to volunteer to look after the Bio.Ndb module
and bring it up to date, I'm suggesting we deprecate it for the next
release of Biopython.

Peter

From lueck at ipk-gatersleben.de  Tue Oct 28 04:10:25 2008
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Tue, 28 Oct 2008 09:10:25 +0100
Subject: [BioPython] ClustaW problem upwards Biopython 1.43
References: <008701c9383a$0aaa2070$1022a8c0@ipkgatersleben.de>
	<320fb6e00810270812le76ae75m55f53107c2572a34@mail.gmail.com>
Message-ID: <001a01c938d4$a19887c0$1022a8c0@ipkgatersleben.de>

Hi!

>>> print str(cline)
clustalw  pb.fasta -OUTFILE=test2.aln

I'm using CLUSTAL W 2.0.

Under DOS everything works fine.

Regards
Stefanie


----- Original Message ----- 
From: "Peter" <biopython at maubp.freeserve.co.uk>
To: "Stefanie L?ck" <lueck at ipk-gatersleben.de>
Cc: <biopython at lists.open-bio.org>
Sent: Monday, October 27, 2008 4:12 PM
Subject: Re: [BioPython] ClustaW problem upwards Biopython 1.43


On Mon, Oct 27, 2008 at 1:43 PM, Stefanie L?ck <lueck at ipk-gatersleben.de> 
wrote:
> Hi!
>
> I just releazed, that a ClustalW alignment gives an error message under 
> Biopython 1.44 and 1.47 whereas under 1.43 everything works fine.
>
> The message is the following (example of the tutorial):
>
> Traceback (most recent call last):
>  File "I:\Final\pair_align.py", line 90, in pair_align
>    alignment = Clustalw.do_alignment(cline)
>  File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 79, 
> in do_alignment
>    status = run_clust.close()
> IOError: [Errno 0] Error
>
> Does someone know what's the problem?

There were some changes made between Biopython 1.43 and 1.44 to try
and deal with spaces in filenames.  Could you do:

print str(cline)

That should show the exact command line python is trying to run.  What
happens if you try this command at the "DOS" prompt?

Also, what version of clustalw do you have installed?

Thanks,

Peter


From dalloliogm at gmail.com  Tue Oct 28 06:46:39 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Tue, 28 Oct 2008 11:46:39 +0100
Subject: [BioPython] a common repository for test datasets/use cases for all
	Bio* projects
Message-ID: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com>

Hi,
I would like to make you a proposal.
Every module/program written in bioinformatics needs to be tested
before it can be used to produce results that can be published.

For example, let's say I want to write another fasta file parser, like
SeqIO.FastaIO in biopython : I would have have to test the script
against some real fasta files, just to make sure that it doesn't parse
them in a wrong way, or that it losts data.
Or, let's say I want to write a script to calculate Fst statistics
over some population genetics data: I will have to compare the results
of my scripts against other programs, check if it gives me the right
result for a set for which I already know the Fst value, and maybe
ideate some other kind of checks to be sure my script doesn't do weird
things, like losing input data on the way.

So, the point is.. what if we create a common repository for all this
kind of testing data, to be used in common with all the other Bio*
projects?
Wouldn't it be good if all the Bio* fasta parser are able to parse the
same files and give the same results, demonstrating that all of them
work fine or are wrong at the same time?

I am doing this because me (and Tiago) would like to develop a module
to calculate Fst statistics over SNP data, and there is no point of
collecting some good test datasets and not sharing them with other
similar projects in other programming languages.

The same goes for much of the documentation, like use cases: if we
collect a good base of use cases related to bioinformatics, it would
be easier to coordinate the efforts of all the Bio* projects and
compare the different approaches used to solve the same issue by the
different comunities.

At the moment, I have created a simple git repository on github:
- http://github.com/dalloliogm/bio-test-datasets-repository
but , it is still empty and maybe github is not the ideal hosting for
such a project, since the free account has a 100MB space limit.


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it

From biopython at maubp.freeserve.co.uk  Tue Oct 28 06:55:04 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Oct 2008 10:55:04 +0000
Subject: [BioPython] a common repository for test datasets/use cases for
	all Bio* projects
In-Reply-To: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com>
References: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com>
Message-ID: <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com>

On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Hi,
> I would like to make you a proposal.
> Every module/program written in bioinformatics needs to be tested
> before it can be used to produce results that can be published.
> ...
> So, the point is.. what if we create a common repository for all this
> kind of testing data, to be used in common with all the other Bio*
> projects?

You you made some other good points, and this is a good idea.  In
practice the licences are usually OK for use to "borrow" example input
files from each other (and this does happen), but a more organised
system to encourage interchange of examples would be good.

I think this sounds like an excellent topic for the (currently very
quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev
discussion, one of the OBF mailing lists, this should cover all the
Bio* project members interested).  See
http://lists.open-bio.org/mailman/listinfo

Peter

From dalloliogm at gmail.com  Tue Oct 28 07:00:42 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Tue, 28 Oct 2008 12:00:42 +0100
Subject: [BioPython] a common repository for test datasets/use cases for
	all Bio* projects
In-Reply-To: <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com>
References: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com>
	<320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com>
Message-ID: <5aa3b3570810280400t510468d1sbce5bb0977ec772b@mail.gmail.com>

On Tue, Oct 28, 2008 at 11:55 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio
>
> I think this sounds like an excellent topic for the (currently very
> quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev
> discussion, one of the OBF mailing lists, this should cover all the
> Bio* project members interested).  See
> http://lists.open-bio.org/mailman/listinfo
>
> Peter


Thanks!!
I didn't know of this list!!

>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it

From biopython at maubp.freeserve.co.uk  Tue Oct 28 07:20:21 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Oct 2008 11:20:21 +0000
Subject: [BioPython] ClustalW problem upwards Biopython 1.43
Message-ID: <320fb6e00810280420t75f62774x55335e8a5aa11151@mail.gmail.com>

Stephanie wrote:
>
>>>> print str(cline)
>
> clustalw  pb.fasta -OUTFILE=test2.aln
>
> I'm using CLUSTAL W 2.0.

Are you sure?  The Clustal W 2.0 executable is normally called
clustalw2.exe rather than clustalw.exe - so based on the command line
above I would have expect Clustalw 1.x to be used.  Maybe you have
both versions of ClustalW installed?

Could you tell me where exactly (full paths) you have Clustalw.exe
and/or Clustalw2.exe installed?  This would be helpful for the new
unit test I'm working on.

> Under DOS everything works fine.

I've been having "fun" trying to get a new unit test for this to work
nicely on Windows - there a certainly some combinations of file name
arguments with spaces etc which won't work on Biopython 1.48.  I found
examples where the command line string ran "by hand" at the "DOS"
prompt worked fine, but would fail when invoked in python via os.popen
- on the bright side, using subprocess.Popen instead works much better
(although this isn't available for python 2.3).

If you want to try this new code, I would suggest you first install
Biopython 1.48, and then backup and update
C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py to revision
1.25 from CVS which you can download here (should be updated within
the hour):
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Clustalw/__init__.py?cvsroot=biopython

Thanks!

Peter

From peter at maubp.freeserve.co.uk  Tue Oct 28 07:36:15 2008
From: peter at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Oct 2008 11:36:15 +0000
Subject: [BioPython] Dropping Python 2.3 support?
Message-ID: <320fb6e00810280436m7cf48993v8b0562bb44919128@mail.gmail.com>

Dear all,

Those of you following the dev-mailing list will probably be aware
that we've been making excellent progress in CVS to get Biopython to
run fine on Python 2.6.  However, the downside is that continuing to
support Python 2.3 is beginning to be pain (triggered for the most
part by some older modules being deprecated in python 2.6).

Does anyone on the mailing list still use Python 2.3?  e.g. older
Linux servers, or people still using Apple Mac OS X 10.4 Tiger (or
older).

What I'd like to suggest is that the next one or two releases will
still support Python 2.3, but after that we'll drop support for Python
2.3.

Thanks,

Peter

P.S. For the record, until recently my main Windows machine ran Python
2.3 only - giving me a vested interesting in continuing Python 2.3
support ;)

From jblanca at btc.upv.es  Tue Oct 28 07:52:29 2008
From: jblanca at btc.upv.es (Jose Blanca)
Date: Tue, 28 Oct 2008 12:52:29 +0100
Subject: [BioPython] caf format support
Message-ID: <200810281252.29607.jblanca@btc.upv.es>

Hi,
I'm currently dealing with caf contig files. Has BioPython support for this 
format? Do you know of other alternatives in python or perl to deal with it?
Best regards,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)

From biopython at maubp.freeserve.co.uk  Tue Oct 28 08:16:33 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Oct 2008 12:16:33 +0000
Subject: [BioPython] caf format support
In-Reply-To: <200810281252.29607.jblanca@btc.upv.es>
References: <200810281252.29607.jblanca@btc.upv.es>
Message-ID: <320fb6e00810280516j72af2c70q46790c217585b2c5@mail.gmail.com>

On Tue, Oct 28, 2008 at 11:52 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> Hi,
> I'm currently dealing with caf contig files. Has BioPython support for this
> format? Do you know of other alternatives in python or perl to deal with it?
> Best regards,

I'm not aware of any Biopython code for CAF contig files.  However,
have a look at http://www.sanger.ac.uk/Software/formats/CAF/userguide.shtml
where some perl tools are described, including some for converting CAF
into other formats.

We do have ACE and PHRED (used by PHRAP) parsers in Bio.Sequencing, so
adding Bio.Sequencing.CAF might be logical.

Peter

From cjfields at illinois.edu  Tue Oct 28 08:26:32 2008
From: cjfields at illinois.edu (Chris Fields)
Date: Tue, 28 Oct 2008 07:26:32 -0500
Subject: [BioPython] a common repository for test datasets/use cases for
	all Bio* projects
In-Reply-To: <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com>
References: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com>
	<320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com>
Message-ID: <C58B47EC-54F4-42AB-8B93-2678D5D90B4F@illinois.edu>

All,

An open-bio repository had started up for this use at one point,  
though I don't think it made the transition to subversion yet (and it  
never really took off, not sure why).  You should try contacting open- 
bio support and maybe Jason or Chris D. can answer this in a bit more  
detail.

chris

On Oct 28, 2008, at 5:55 AM, Peter wrote:

> On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
>> Hi,
>> I would like to make you a proposal.
>> Every module/program written in bioinformatics needs to be tested
>> before it can be used to produce results that can be published.
>> ...
>> So, the point is.. what if we create a common repository for all this
>> kind of testing data, to be used in common with all the other Bio*
>> projects?
>
> You you made some other good points, and this is a good idea.  In
> practice the licences are usually OK for use to "borrow" example input
> files from each other (and this does happen), but a more organised
> system to encourage interchange of examples would be good.
>
> I think this sounds like an excellent topic for the (currently very
> quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev
> discussion, one of the OBF mailing lists, this should cover all the
> Bio* project members interested).  See
> http://lists.open-bio.org/mailman/listinfo
>
> Peter
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Marie-Claude Hofmann
College of Veterinary Medicine
University of Illinois Urbana-Champaign


From bsouthey at gmail.com  Tue Oct 28 09:56:34 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Tue, 28 Oct 2008 08:56:34 -0500
Subject: [BioPython] a common repository for test datasets/use cases for
 all Bio* projects
In-Reply-To: <C58B47EC-54F4-42AB-8B93-2678D5D90B4F@illinois.edu>
References: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com>	<320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com>
	<C58B47EC-54F4-42AB-8B93-2678D5D90B4F@illinois.edu>
Message-ID: <49071A12.8060705@gmail.com>

Chris Fields wrote:
> All,
>
> An open-bio repository had started up for this use at one point, 
> though I don't think it made the transition to subversion yet (and it 
> never really took off, not sure why).  You should try contacting 
> open-bio support and maybe Jason or Chris D. can answer this in a bit 
> more detail.
>
> chris
>
> On Oct 28, 2008, at 5:55 AM, Peter wrote:
>
>> On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio
>> <dalloliogm at gmail.com> wrote:
>>> Hi,
>>> I would like to make you a proposal.
>>> Every module/program written in bioinformatics needs to be tested
>>> before it can be used to produce results that can be published.
>>> ...
>>> So, the point is.. what if we create a common repository for all this
>>> kind of testing data, to be used in common with all the other Bio*
>>> projects?
>>
>> You you made some other good points, and this is a good idea.  In
>> practice the licences are usually OK for use to "borrow" example input
>> files from each other (and this does happen), but a more organised
>> system to encourage interchange of examples would be good.
>>
>> I think this sounds like an excellent topic for the (currently very
>> quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev
>> discussion, one of the OBF mailing lists, this should cover all the
>> Bio* project members interested).  See
>> http://lists.open-bio.org/mailman/listinfo
>>
>> Peter
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Marie-Claude Hofmann
> College of Veterinary Medicine
> University of Illinois Urbana-Champaign
>
>
>
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
There had been some discussion on scipy lists on data sets that you 
should look for.

One of the most critical questions that you must address is copyright 
and who owns the data sets (credit where credit is due). Ultimately any 
data will be distributable in some form and thus really brings in 
copyright issues and such. This is also country specific because there 
is the question of whether or not a data set can be copyrighted and the 
terms of it - not a lawyer to know this. The Science Commons has various 
other useful information especially the FAQ on databases, 
http://sciencecommons.org/resources/faq/databases/, that states "In the 
United States, data will be protected by copyright only if they express 
creativity".

I do believe you would need to be very strict on what is acceptable 
because if it is distributable you can not rely on the user being 
responsible:
1) If has been used for publication, an extremely clear statement of the 
owner (publisher) that it can be made available is required.
2) If the data is created from publicly available sources that allow it 
eg Uniprot (http://www.uniprot.org/help/license) then exact recreatable 
sets must be made available so the data can be exactly obtained from 
that source (must include the specific release as databases change).
3) If the data is from private sources then it must be released on a 
suitable license that can not be superseded by publication or change in 
ownership.

Also, the submitted data should not change even if there are errors. For 
example, Fisher's iris data at 
http://archive.ics.uci.edu/ml/datasets/Iris has  documented errors. 
Rather it would be better to use version numbers.

Regards
Bruce


From biopython at maubp.freeserve.co.uk  Tue Oct 28 11:04:21 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Oct 2008 15:04:21 +0000
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <320fb6e00810200222s641e165eqef3b209893a8d976@mail.gmail.com>
References: <320fb6e00810191152u13a2ee80pe21fe950dc3d046a@mail.gmail.com>
	<C522096F.17F00%lpritc@scri.ac.uk>
	<320fb6e00810200222s641e165eqef3b209893a8d976@mail.gmail.com>
Message-ID: <320fb6e00810280804k1ef53ec1od53c33915da61c3@mail.gmail.com>

On 20th Oct I wrote:
> Of course, someone is still bound to try calling the [Seq object's]
> translate method with a string mapping.  Maybe we should add a
> bit of defensive code to check the table argument, and print a
> helpful error message when this happens?

I've just added that in CVS, if the table argument is a 256 character
string then a ValueError is raised suggesting using
str(my_seq).translate(...) instead.

Peter

From biopython at maubp.freeserve.co.uk  Tue Oct 28 13:17:36 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Oct 2008 17:17:36 +0000
Subject: [BioPython] Should Bio.SeqIO.write(...) return the number of
	records?
Message-ID: <320fb6e00810281017m5a74a9dfh4aa18952b9a561be@mail.gmail.com>

Dear all,

I wanted to get some feedback on a possible enhancement to the
Bio.SeqIO.write(...) and Bio.AlignIO.write(...) functions to make them
return number of records/alignments written to the handle.  I've filed
enhancement Bug 2628 to track this idea.
http://bugzilla.open-bio.org/show_bug.cgi?id=2628

When creating a sequence (or alignment) file, it is sometimes useful
to know how many records (or alignments) were written out.  This is
easy if your records are in a list:

records = list(...)
SeqIO.write(records, handle, format)
print "Wrote %i records" % len(records)

If however your records are from a generator/iterator (e.g. a
generator expression, or some other iterator) you cannot use
len(records).  You could turn this into a list just to count them, but
this wastes memory.  It would therefore be useful to have the count
returned:

records = some_generator
count = SeqIO.write(records, handle, format)
print "Wrote %i records" % count

Currently Bio.SeqIO.write(...) and Bio.AlignIO.write(...) have no
return value, so adding a return value would be a backwards compatible
enhancement.  For a precedent, the BioSQL loader returns the number of
records loaded into the database.

Peter

From sbassi at gmail.com  Tue Oct 28 13:43:27 2008
From: sbassi at gmail.com (Sebastian Bassi)
Date: Tue, 28 Oct 2008 14:43:27 -0300
Subject: [BioPython] Should Bio.SeqIO.write(...) return the number of
	records?
In-Reply-To: <320fb6e00810281017m5a74a9dfh4aa18952b9a561be@mail.gmail.com>
References: <320fb6e00810281017m5a74a9dfh4aa18952b9a561be@mail.gmail.com>
Message-ID: <b43bf2080810281043r515e13aega1d609f3165e0b69@mail.gmail.com>

On Tue, Oct 28, 2008 at 2:17 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> count = SeqIO.write(records, handle, format)
> print "Wrote %i records" % count

I'm for it. It doesn't hurt adding a backward compatible feature.

From biopython at maubp.freeserve.co.uk  Tue Oct 28 14:16:58 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Oct 2008 18:16:58 +0000
Subject: [BioPython] Should Bio.SeqIO.write(...) return the number of
	records?
In-Reply-To: <b43bf2080810281043r515e13aega1d609f3165e0b69@mail.gmail.com>
References: <320fb6e00810281017m5a74a9dfh4aa18952b9a561be@mail.gmail.com>
	<b43bf2080810281043r515e13aega1d609f3165e0b69@mail.gmail.com>
Message-ID: <320fb6e00810281116u6460c62fs77ece727689fba3b@mail.gmail.com>

Sebastian Bassi wrote:
>
> I'm for it. It doesn't hurt adding a backward compatible feature.
>

Well adding an unused feature does increase the long term maintainence
load - but if we agree this does seem useful, that's fine.

Also settling on the record/alignment count as the return value
prevents any future alternative.  But right now I can't think of any
other sensible return value.

I've written a patch against CVS to implement this - see Bug 2628 for details.

Peter

From tiagoantao at gmail.com  Thu Oct 30 17:36:00 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 30 Oct 2008 21:36:00 +0000
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810251834q87495d5re558cf179356a8b0@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
	<6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com>
	<5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com>
	<6d941f120810251804o31ed44cat49b407db36a6891e@mail.gmail.com>
	<6d941f120810251834q87495d5re558cf179356a8b0@mail.gmail.com>
Message-ID: <6d941f120810301436m4bf12385s99d726bb000f7dd4@mail.gmail.com>

Hi,

FYI, I am going to continue this discussion to biopython-dev, as I
think it makes more sense there. Especially the parts about
implementation suggestions.

On Sun, Oct 26, 2008 at 1:34 AM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
> I just want add on an extra comment explaining why I oppose doing an
> individual object:
>
> I have the following questions (and others) in my mind, which I don't
> know the answer. I am not looking for answers to them, I am just
> trying to illustrate the difficulty of the problem.
>
> 1. For a certain marker, do we store the genomic position of the
> marker? Some (most) statistics don't use this information. For many
> species this information is not even available. But for some
> statistics this information is mandatory...
> 2. For a microsatellite do we store the motif and number of repeats or
> the whole sequence? (see 4)
> 3. If one is interested in SNPs and one has the full sequences does
> one store the full sequences or just the SNPs? If you store just the
> SNPs then you cannot do sequence based analysis in the future (say
> Tajima D). If you store everything then you are consuming memory and
> cpu.
> 4. If one just wants to do frequency statistics (Fst), do you store
> the marker or just the assign each one an ID and store the ID? It is
> much cheaper to store an ID than a full sequence.
>
> Populations
> 1. Support for landscape genetics? I mean geo-referentiation
> 2. Support for hierarchical population structure?
> 3. Do we cache statistics results on Population objects?
>
>
> Let me take your class marker:
> class Marker:
>  total_heterozygotes_count = 0
>  total_population_count = 0
>  total_Purines_count = 0 # this could be renamed, of course
>  total_Pyrimidines_count = 0
>
> How would this be useful for microsatellites? Why purines, and if my
> marker is a protein? If it is a SNP I want to know the nucleotide? And
> if I am studying proteins and I want to have the aminoacid?
>
> Dont take me wrong, I have done this path. To solve my particular
> problems is not very hard. To have a framework that is usable by
> everybody, it is a damn hard problem. And we dont really need to solve
> it (ok, it would be nice to do things to populations in general, that
> I agree). But the fundamental is: read file, calculate statistics.
> That doesnt need population and individual objects.
>
> If we end up having too many formats a consolidation step might be
> needed in the future (to avoid having 10 split_in_pops). That I agree.
>


-- 
"Data always beats theories. 'Look at data three times and then come
to a conclusion,' versus 'coming to a conclusion and searching for
some data.' The former will win every time."
?Matthew Simmons,
http://www.tiago.org


From pingou at pingoured.fr  Fri Oct 31 12:29:27 2008
From: pingou at pingoured.fr (Pierre-Yves)
Date: Fri, 31 Oct 2008 17:29:27 +0100
Subject: [BioPython] Sequence graph
Message-ID: <490B3267.5020501@pingoured.fr>

Dear list,

I am sorry to come here to ask this question that must have been already 
asked in the past, but my search have been rather unsuccessful...

I would like to reproduce such graph:
http://www.bioperl.org/wiki/HOWTO:Graphics#Improving_the_Image but even 
if bioperl is nice I would like to do it through BioPython.

I have thus two questions :
* Is that possible ?
* Could someone point me to an example ?

Thanks in advance for your help,

Best regards,

Pierre

From bsouthey at gmail.com  Wed Oct 22 17:02:18 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Wed, 22 Oct 2008 21:02:18 -0000
Subject: [BioPython] back-translation method for Seq object?
In-Reply-To: <320fb6e00810210859n1b922e7emd6a7456abd79cdc7@mail.gmail.com>
References: <C5221012.17F0D%lpritc@scri.ac.uk> <48FDE37B.5040301@gmail.com>	
	<320fb6e00810210745w32b37edjeec1607a3711f6ea@mail.gmail.com>
	<320fb6e00810210859n1b922e7emd6a7456abd79cdc7@mail.gmail.com>
Message-ID: <48FF951C.4030700@gmail.com>

Hi,
Some of the neat things about Python is how easy it is to modify your 
own code and adapt others code into yours.

So here is some code (under the BSD license)  that may be useful on 
this. This is a simple back or reverse translation code with many of the 
things that I have been 'talking' about. This should be self-contained 
and works on Linux system with Python2.3+. It is oriented around an 
peptide sequence 'AFLFQPQRFGR' but hopefully is more general (I have not 
tested that).

a) Convert an amino acid sequence into both a regular expression or DNA 
sequence involving ambiguous codes. There are functions to convert the 
regular expression or DNA sequence involving ambiguous codes back to a 
protein sequence since neither of these are standard.

b) Regular expression search on a list of sequences in fasta format.

c) Obtain all possible DNA sequences from an regular expression form of 
the amino acid sequence. Obviously this is very large as for the above 
sequence there are  442368 combinations (but Python is fairly quick... 
about 10 seconds on my opteron  270 system bogomips =3991.08)


Enjoy
Bruce
-------------- next part --------------
A non-text attachment was scrubbed...
Name: reverse_trans.py
Type: text/x-python
Size: 10661 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20081022/248001f3/attachment-0001.py>

From mjldehoon at yahoo.com  Wed Oct  1 12:18:24 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Wed, 1 Oct 2008 05:18:24 -0700 (PDT)
Subject: [BioPython] Bio.distance
Message-ID: <924102.72843.qm@web62403.mail.re1.yahoo.com>

Hi everybody,

Since the 1.48 release, Biopython has been making good progress in the migration from Numerical Python to NumPy. As part of this process, we are now reviewing and consolidating the code in Biopython that makes use of Numerical Python / NumPy. Specifically, we are thinking to merge the code in Bio.distance into Bio.kNN, and to deprecate Bio.distance and Bio.cdistance. Since Bio.kNN is the only Biopython module in Biopython that makes use of Bio.distance, we think that this won't affect anybody. However, if you are using Bio.distance outside of Bio.kNN, please let us know so we can find an alternative solution.

--Michiel.


From bsouthey at gmail.com  Wed Oct  1 15:49:53 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Wed, 01 Oct 2008 10:49:53 -0500
Subject: [BioPython] Bio.distance
In-Reply-To: <924102.72843.qm@web62403.mail.re1.yahoo.com>
References: <924102.72843.qm@web62403.mail.re1.yahoo.com>
Message-ID: <48E39C21.8010603@gmail.com>

Michiel de Hoon wrote:
> Hi everybody,
>
> Since the 1.48 release, Biopython has been making good progress in the migration from Numerical Python to NumPy. As part of this process, we are now reviewing and consolidating the code in Biopython that makes use of Numerical Python / NumPy. Specifically, we are thinking to merge the code in Bio.distance into Bio.kNN, and to deprecate Bio.distance and Bio.cdistance. Since Bio.kNN is the only Biopython module in Biopython that makes use of Bio.distance, we think that this won't affect anybody. However, if you are using Bio.distance outside of Bio.kNN, please let us know so we can find an alternative solution.
>
> --Michiel.
>
>
>       
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>   
Hi,
Under the 'standard' install I do not think that there is any advantage 
of using Bio.cdistance within Bio.kNN. I tested this on a bioinformatics 
data set with almost 1500 data points, 8 explanatory variables and k=9. 
I only got a one second difference between using Bio.cdistance or 
commenting it out on my system (after removing the build directory and 
reinstalling everything). Actual maximum times across three runs were 
under 16.6 seconds with it and under 17.4 seconds without it.

My system runs linux x86_64 (fedora 10) but it is not a 'clean' system 
due to other cpu intensive processes running. I used Python 2.5 and 
Numeric 2.4 as I forgot the order of imports. In my version the default 
distance without Bio.cdistance uses the Numeric dot (I did not try the 
python version) so I would expect this to be noticeably faster if lapack 
or atlas are installed than if these are not present.  (I used Fedora 
supplied Numeric so while I think this timing is without lapack and 
atlas I am not completely sure of that.)

I did not see an examples for k-nearest neighbor so below is (very bad) 
code using the logistic regression example 
(http://biopython.org/DIST/docs/cookbook/LogisticRegression.html).

Regards
Bruce


from Bio import kNN
xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11, 
-220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], 
[58, -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, 
-312.99], [154, -213.83], [147, -380.85], [93, -291.13]]
ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
model = kNN.train(xs, ys, 3)
ccr=0
tobs=0
for px, py in zip(xs, ys):
        cp=kNN.classify(model, px)
        tobs +=1
        if cp==py:
                ccr +=1
print tobs, ccr


From biopython at maubp.freeserve.co.uk  Wed Oct  1 15:52:05 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 1 Oct 2008 16:52:05 +0100
Subject: [BioPython] More string methods for the Seq object
In-Reply-To: <320fb6e00809290506p8aa2b51p4901b693ebb268bf@mail.gmail.com>
References: <320fb6e00809260859r23c7915buc114c5c0b71e195@mail.gmail.com>
	<48DD2DE6.10908@gmail.com>
	<320fb6e00809261422n6e4c4889p734508613898cc3f@mail.gmail.com>
	<48DD59DF.1000504@gmail.com>
	<320fb6e00809261457j65dc0876hd59d17aee01bc983@mail.gmail.com>
	<bbcd77d00809261855oecdcfc4w7b762b45a0aa7dc4@mail.gmail.com>
	<320fb6e00809270557n73b81b5ayb93fe85f0f466626@mail.gmail.com>
	<bbcd77d00809271806x16457c7dyc4812dbd09d5f8a9@mail.gmail.com>
	<320fb6e00809290450m6fedbaacu15a75107e5c39658@mail.gmail.com>
	<320fb6e00809290506p8aa2b51p4901b693ebb268bf@mail.gmail.com>
Message-ID: <320fb6e00810010852j5cf8e3ak7dc788372568251f@mail.gmail.com>

On Mon, Sep 29, 2008 at 1:06 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> I assume you [Bruce] are agreeing with ... follow[ing] the
>> string defaults of white space for stipping or splitting (for
>> consistency, even though this won't typically be useful for
>> sequences).  On balance this would probably be best from
>> a principle of consistency and least surprise for the user -
>> I'll update the patches.
>
> New patch for Seq object split, strip, lstrip and rstrip methods on
> Bug 2596 which follows the python string defaults (splitting on or
> stripping of white space).
> http://bugzilla.open-bio.org/show_bug.cgi?id=2596

There is now a second version of this patch on that bug, which will
also accept Seq objects as arguments to the split, strip, lstrip and
rstrip methods, plus has the start of some tests too.

We (Peter, Martin, Bruce and Leighton) seem to have reached an
agreement about adding split, strip, lstrip and rstrip methods to the
Seq object with the behaviour (arguments and defaults) to follow those
of the python string as closely as possible.

I'd like to encourage others lurking on the list to comment too, but
unless anyone objects, I intend to add these methods in CVS this week,
together with an updated unit test and updates to the tutorial.

Peter


From biopython at maubp.freeserve.co.uk  Wed Oct  1 16:03:22 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 1 Oct 2008 17:03:22 +0100
Subject: [BioPython] Bio.distance
In-Reply-To: <48E39C21.8010603@gmail.com>
References: <924102.72843.qm@web62403.mail.re1.yahoo.com>
	<48E39C21.8010603@gmail.com>
Message-ID: <320fb6e00810010903u253c6384ld401e1a771ee141e@mail.gmail.com>

On Wed, Oct 1, 2008 at 4:49 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>
> Hi,
> Under the 'standard' install I do not think that there is any advantage of
> using Bio.cdistance within Bio.kNN. I tested this on a bioinformatics data
> set with almost 1500 data points, 8 explanatory variables and k=9. ...
> Actual maximum times across three runs were under 16.6 seconds with
> it [Bio.cdistance] and under 17.4 seconds without it [Bio.distance using
> Numeric]

Its interesting that the C version is only slightly faster than
Numeric - of course as you point out there are lots of possible
complications here like lapack and atlas (plus compiler options and
CPU features).

I think your numbers are good support for Michiel's proposition that
we should deprecate Bio.cdistance and Bio.distance and just use numpy
in Bio.kNN - this will simplify our code base and make very little
difference to the speed.

Peter


From biopython at maubp.freeserve.co.uk  Wed Oct  1 16:17:10 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 1 Oct 2008 17:17:10 +0100
Subject: [BioPython]  Bio.kNN documentation
Message-ID: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com>

Bruce wrote:
> I tested this [Bio.kNN] on a bioinformatics data set with almost 1500
> data points, 8 explanatory variables and k=9. ...

Do you think this larger example could be adapted into something for
the Biopython documentation?  Otherwise the next bit of code looks
interesting.

> I did not see an examples for k-nearest neighbor so below is (very bad)
> code using the logistic regression example
> (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html).

This is a set of Bacillus subtilis gene pairs for which the operon
structure is known, with the intergene distance and gene expression
score as explanatory variables, with the class being same operon or
different operons.

> from Bio import kNN
> xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11,
> -220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], [58,
> -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, -312.99],
> [154, -213.83], [147, -380.85], [93, -291.13]]
> ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
> model = kNN.train(xs, ys, 3)
> ccr=0
> tobs=0
> for px, py in zip(xs, ys):
>       cp=kNN.classify(model, px)
>       tobs +=1
>       if cp==py:
>               ccr +=1
> print tobs, ccr

Could you expand on the cryptic variable names?  ccr = correct call
rate? tobs = total observations?

Coupled with a scatter plot (say with pylab, showing the two classes
in different colours), this could be turned into a nice little example
for the cookbook section of the tutorial.  Notice that later on in the
logistic regression example there is a second table of "test data"
which could be used to make de novo predictions.

Thanks,

Peter


From bsouthey at gmail.com  Wed Oct  1 18:40:41 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Wed, 01 Oct 2008 13:40:41 -0500
Subject: [BioPython] Bio.kNN documentation
In-Reply-To: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com>
References: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com>
Message-ID: <48E3C429.1020004@gmail.com>

Peter wrote:
> Bruce wrote:
>   
>> I tested this [Bio.kNN] on a bioinformatics data set with almost 1500
>> data points, 8 explanatory variables and k=9. ...
>>     
>
> Do you think this larger example could be adapted into something for
> the Biopython documentation?  Otherwise the next bit of code looks
> interesting.
>
>   
>> I did not see an examples for k-nearest neighbor so below is (very bad)
>> code using the logistic regression example
>> (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html).
>>     
>
> This is a set of Bacillus subtilis gene pairs for which the operon
> structure is known, with the intergene distance and gene expression
> score as explanatory variables, with the class being same operon or
> different operons.
>
>   
>> from Bio import kNN
>> xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11,
>> -220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], [58,
>> -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, -312.99],
>> [154, -213.83], [147, -380.85], [93, -291.13]]
>> ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
>> model = kNN.train(xs, ys, 3)
>> ccr=0
>> tobs=0
>> for px, py in zip(xs, ys):
>>       cp=kNN.classify(model, px)
>>       tobs +=1
>>       if cp==py:
>>               ccr +=1
>> print tobs, ccr
>>     
>
> Could you expand on the cryptic variable names?  ccr = correct call
> rate? tobs = total observations?
>
> Coupled with a scatter plot (say with pylab, showing the two classes
> in different colours), this could be turned into a nice little example
> for the cookbook section of the tutorial.  Notice that later on in the
> logistic regression example there is a second table of "test data"
> which could be used to make de novo predictions.
>
> Thanks,
>
> Peter
>
>   
I did realize that this was coming... :-)
(I guess I am volunteering myself to provide some material on machine 
learning with BioPython. So this is a start.)

I wanted something quick and dirty to output for testing, so tobs is the 
total number of observations and ccr is number of correctly classified 
points - I was to lazy to divide it by tobs to get the correct 
classification rate.

Here is an more extended sample code that also uses logistic regression. 
(Python is so great to with here!) I don't have plotting packages 
installed but someone could add the plots.

Regards
Bruce


-------------- next part --------------
A non-text attachment was scrubbed...
Name: knn_lr_example.py
Type: text/x-python
Size: 3257 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20081001/40109831/attachment-0002.py>

From biopython at maubp.freeserve.co.uk  Wed Oct  1 21:40:55 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 1 Oct 2008 22:40:55 +0100
Subject: [BioPython] problem installing mxTextTools
In-Reply-To: <48896815.10104@berkeley.edu>
References: <4889645B.9080400@berkeley.edu> <48896815.10104@berkeley.edu>
Message-ID: <320fb6e00810011440v4bd80263hf3830d8d9f548e63@mail.gmail.com>

On Fri, Jul 25, 2008 at 6:43 AM, Nick Matzke <matzke at berkeley.edu> wrote:
> Hi all,
>
> An update -- I found a solution by copying the .pck file the download
> actually gave me to the filename that the install was apparently looking
> for.  This was not exactly obvious (!!!!) but apparently it worked:
> ...
> >>> print now()
> 2008-07-24 22:39:17.66
>

Was this an old email you accidently forwarded to the list?

For the next release of Biopython the only bits of code still using
mxTextTools have been deprecated, so the Biopython setup won't even
look for mxTextTools at all.  Right now with Biopython 1.48 you can
just install without mxTextTools (as the setup.py prompt should make
clear).

Peter


From biopython at maubp.freeserve.co.uk  Wed Oct  1 21:44:34 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 1 Oct 2008 22:44:34 +0100
Subject: [BioPython] problem installing mxTextTools
In-Reply-To: <320fb6e00810011440v4bd80263hf3830d8d9f548e63@mail.gmail.com>
References: <4889645B.9080400@berkeley.edu> <48896815.10104@berkeley.edu>
	<320fb6e00810011440v4bd80263hf3830d8d9f548e63@mail.gmail.com>
Message-ID: <320fb6e00810011444u7e5bf37fh2801c1980bd38a2a@mail.gmail.com>

On Wed, Oct 1, 2008 at 10:40 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Fri, Jul 25, 2008 at 6:43 AM, Nick Matzke <matzke at berkeley.edu> wrote:
>> Hi all,
>>
>> An update -- I found a solution by copying the .pck file the download
>> actually gave me to the filename that the install was apparently looking
>> for.  This was not exactly obvious (!!!!) but apparently it worked:
>> ...
>> >>> print now()
>> 2008-07-24 22:39:17.66
>>
>
> Was this an old email you accidently forwarded to the list?

Sorry about this Nick & everyone else - it was a mistake at my end.
It looks like a glitch (perhaps in GoogleMail itself?) marked this old
thread as unread and bumped it to the top of my to read list.  Odd,
but I didn't notice until after sending my confused reply.

Peter


From kteague at bcgsc.ca  Wed Oct  1 21:53:44 2008
From: kteague at bcgsc.ca (Kevin Teague)
Date: Wed, 1 Oct 2008 14:53:44 -0700
Subject: [BioPython] development question
References: <48B5BD98.8050101@heckler-koch.cz><48B65C9B.4000407@heckler-koch.cz>
	<20080828090431.GD5801@inb.uni-luebeck.de>
Message-ID: <36BEEFA2DF192944BF71E072F7A5F4656043D6@xchange1.phage.bcgsc.ca>


On Thu, Aug 28, 2008 at 10:06:51AM +0200, Pavel SRB wrote:
> so now to biopython. On my system i have biopython from debian repository 
> via apt-get. But i would like to have second version of biopython in system 
> just to check, log and change the code to learn more. This can be done with 
> removing sys.path.remove("/var/lib/python-support/python2.5")
> and importing Bio from some other development directory. But this way i 
> loose all modules in direcotory mentioned above and i believe it can be 
> done more clearly

You might want to check out VirtualEnv:

http://pypi.python.org/pypi/virtualenv

This tool will let you "clone" your system Python, so that you have your own isolated [virtualpythonname]/bin and [virtualpythonname/lib/python/site-packages/ directories. If you create a virtualenv with the --no-site-packages, then the /var/lib/python-support/python2.5/ location will be not be in the created virtual python's sys.path. Otherwise by default this location will be included, but your own isolated [virtualpythonname/lib/python/site-packages/ location will have precendence on sys.path, so if you install a newer BioPython into there it will get imported instead of the system one.

You can of course do all of this by manually fiddling with sys.path, but VirtualEnv just wraps up a few of these common practices into one handy tool - great for experimentation or trying out different packages.


From lunt at ctbp.ucsd.edu  Sat Oct  4 21:50:33 2008
From: lunt at ctbp.ucsd.edu (Bryan Lunt)
Date: Sat, 4 Oct 2008 14:50:33 -0700
Subject: [BioPython] Copy Constructors for Bio.Seq.Seq?
In-Reply-To: <b34be8bd0810041445k382c3253ybbfec492717a85a8@mail.gmail.com>
References: <b34be8bd0810041445k382c3253ybbfec492717a85a8@mail.gmail.com>
Message-ID: <b34be8bd0810041450n56cfb161p7c281f947340a7d@mail.gmail.com>

Greetings All!
I would like to make the following humble suggestion:
A copy-constructor for Bio.Seq.Seq would be helpful, currently it
seems that calling Bio.Align.Generic.Alignment.add_sequence on a Seq
object breaks because it tries to initialize a new Seq object on
whatever data you provided, and there is no copy-constructor, nor does
Bio.Align.Generic.Alignment.add_sequence handled just adding a Seq
object directly.

Thanks for considering this, I think this addition will help make
client-code cleaner.

-Bryan Lunt


From biopython at maubp.freeserve.co.uk  Sun Oct  5 11:06:57 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 5 Oct 2008 12:06:57 +0100
Subject: [BioPython] Copy Constructors for Bio.Seq.Seq?
In-Reply-To: <b34be8bd0810041450n56cfb161p7c281f947340a7d@mail.gmail.com>
References: <b34be8bd0810041445k382c3253ybbfec492717a85a8@mail.gmail.com>
	<b34be8bd0810041450n56cfb161p7c281f947340a7d@mail.gmail.com>
Message-ID: <320fb6e00810050406t41d25043oe7011745055a1fc7@mail.gmail.com>

On Sat, Oct 4, 2008 at 10:50 PM, Bryan Lunt <lunt at ctbp.ucsd.edu> wrote:
> Greetings All!
> I would like to make the following humble suggestion:
> A copy-constructor for Bio.Seq.Seq would be helpful, ...

You can use the string idiom of my_seq[:] to make a copy of a Seq object.

> currently it
> seems that calling Bio.Align.Generic.Alignment.add_sequence on a
> Seq object breaks because it tries to initialize a new Seq object on
> whatever data you provided, and there is no copy-constructor, nor does
> Bio.Align.Generic.Alignment.add_sequence handled just adding a Seq
> object directly.

Yes, the Bio.Align.Generic.Alignment.add_sequence() method currently
expects a string (which its docstring is fairly clear about), and
giving it a Seq does fail.  I suppose allowing it to take a Seq object
would be sensible (with a check on the alphabet being compatible with
that declared for the alignment).

We have been debating making the generic Alignment a little more list
like, by allowing .append() or .extend() for use with SeqRecord
objects (Bug 2553).
http://bugzilla.open-bio.org/show_bug.cgi?id=2553

> Thanks for considering this, I think this addition will help make
> client-code cleaner.

Would the SeqRecord append/extend idea suit you just as well?

Peter


From biopython at maubp.freeserve.co.uk  Sun Oct  5 12:16:28 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 5 Oct 2008 13:16:28 +0100
Subject: [BioPython] Migrating from Numerical Python to numpy
In-Reply-To: <623262.17729.qm@web62407.mail.re1.yahoo.com>
References: <623262.17729.qm@web62407.mail.re1.yahoo.com>
Message-ID: <320fb6e00810050516i20822ebcwf15cd058af0c9759@mail.gmail.com>

On Sat, Sep 20, 2008 at 4:02 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Dear all,
>
> As you probably are well aware, Biopython releases to date have used
> the now obsolete Numeric python library.  This is no longer being
> maintained and has been superseded by the numpy library.  See
> http://www.scipy.org/History_of_SciPy for more about details on the
> history of numerical python.  Biopython 1.48 should be the last
> Numeric only release of Biopython - we have already started moving to
> numpy in CVS.
>
> Supporting both Numeric and numpy ought to be fairly straightforward
> for the pure python modules in Biopython. However, we also have C code
> which must interact with Numeric/numpy, and trying to support both
> would be harder.
>
> Would anyone be inconvenienced if the next release of Biopython
> supported numpy ONLY (dropping support for Numeric)?  If so please
> speak up now - either here or on the development mailing list.
> Otherwise, a simple switch from Numeric to numpy will probably be the
> most straightforward migration plan.

No one has objected, and a simple switch from Numeric to numpy is
underway in CVS.  The next release of Biopython will suport numpy only
(dropping support for Numeric).

As an aside, from my own testing Biopython CVS looks happy with numpy
1.0, 1.1 and the just released 1.2 (although if we have missed any
deprecation warnings please let us know).

For preparing Windows installers for Biopython, it might be helpful to
know what version of numpy most Windows users (will) have installed
(this is important due to numpy C API changes between versions).

Thanks,

Peter


From biopython at maubp.freeserve.co.uk  Mon Oct  6 10:39:15 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 6 Oct 2008 11:39:15 +0100
Subject: [BioPython] Bio.kNN documentation
In-Reply-To: <48E3C429.1020004@gmail.com>
References: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com>
	<48E3C429.1020004@gmail.com>
Message-ID: <320fb6e00810060339t427d4c5dme2690fbc36b30c81@mail.gmail.com>

Bruce wrote:
>>> I did not see an examples for k-nearest neighbor so below is
>>> (very bad) code using the logistic regression example
>>> (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html).

Peter wrote:
>> This is a set of Bacillus subtilis gene pairs for which the operon
>> structure is known, with the intergene distance and gene expression
>> score as explanatory variables, with the class being same operon or
>> different operons.
>> ...
>> Coupled with a scatter plot (say with pylab, showing the two classes
>> in different colours), this could be turned into a nice little example
>> for the cookbook section of the tutorial.  Notice that later on in the
>> logistic regression example there is a second table of "test data"
>> which could be used to make de novo predictions.

Bruce wrote:
> I did realize that this was coming... :-)
> (I guess I am volunteering myself to provide some material on
> machine learning with BioPython. So this is a start.)

Michiel has suggested adding a whole chapter to the tutorial about
supervised learning, presumably incorporating his logistic regression
example as part of this.  Have a look at thread "Bio.MarkovModel;
Bio.Popgen, Bio.PDB documentation" on the dev mailing list.  I'm sure
you can contribute (even if just by proof reading).

Peter


From fkauff at biologie.uni-kl.de  Tue Oct  7 08:02:12 2008
From: fkauff at biologie.uni-kl.de (Frank Kauff)
Date: Tue, 07 Oct 2008 10:02:12 +0200
Subject: [BioPython] Creating and traversing an ultrametric tree
In-Reply-To: <320fb6e00809241326i16a337das844f4ac74766b459@mail.gmail.com>
References: <73045cca0809231713v219c3ec3tfc24461c7af6b453@mail.gmail.com>	<320fb6e00809240200y144500cbl86f9023cb868da89@mail.gmail.com>	<73045cca0809241132x30bc4d63t7ac0b9967a20e76c@mail.gmail.com>
	<320fb6e00809241326i16a337das844f4ac74766b459@mail.gmail.com>
Message-ID: <48EB1784.50803@biologie.uni-kl.de>


Peter wrote:
> On Wed, Sep 24, 2008 at 7:32 PM, aditya shukla
> <adityashukla1983 at gmail.com> wrote:
>   
>> Hello Peter ,
>>
>> Thanks for the reply ,
>> I have attached a file with  of the kind of data that i wanna parse.
>> I tried using Thomas Mailund's Newick tree parser but this dosen't
>> seem to work , so is there any other module that can help?
>>     
>
> Your file looks like this (in case anyone on the mailing list recognises it),
>
> /T_0_size=105((-bin-ulockmgr_server:0.99[&&NHX:C=0.195.0],
> (((-bin-hostname:0.00[&&NHX:C=200.0.0],
> (-bin-dnsdomainname:0.00[&&NHX:C=200.0.0],
> ...):0.99):0.99):0.99):0.99);
>
> [with a large chunk removed, and new lines inserted]
>
> I'm guessing this is some kind of computer system profile - nothing to
> do with bioinformatics.
>
> I'm not 100% sure this is Newick format - it might be worth trying to
> parse everything after the "/T_0_size=105" text which looks out of
> place to me.
>
> If it is a valid Newick format tree file, then it is using named
> internal nodes which is something Biopython can't currently parse (see
> Bug 2543, http://bugzilla.open-bio.org/show_bug.cgi?id=2543 ).  So I
> don't think you can use the Bio.Nexus module in Biopython to read this
> tree.
>
>   
Nexus.Trees has been extended to deal with internal node names, or 
"special comments" in the format [& blablalba]. Such comments comments 
can appear directly after the taxon label, after the closing 
parentheses, or between branchlength / support values attached to a node 
or a taxon labels, such as

(a,(b,(c,d)[&hi there]))
(a,(b[&hi there],c))
(a,(b:0.123[&hi there],c[&heyho]:0.3))
(a,(b,c)0.4[&comment]:0.95)

The comments are stored without change in the corresponding node object 
and can be accessed like

 >>> t=Trees.Tree('(a,(b:0.123[&hi there],c[&heyho]:0.3))')
 >>> print t.node(3).data.comment
[&hi there]
 >>> print t.node(4).data.comment
[&heyho]
 >>>

The comments are not parsed in any way - internal labels vary greatly in 
syntax, and are used to store all kinds of information. But at least 
they are now parsed and stored, and users can deal with them in any way 
they like.

Frank

> The only other python package I can suggest you try is NetworkX,
> https://networkx.lanl.gov/wiki
>
> Good luck,
>
> Peter
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>   


From mjldehoon at yahoo.com  Tue Oct  7 23:10:12 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 7 Oct 2008 16:10:12 -0700 (PDT)
Subject: [BioPython] Bio.kNN documentation
In-Reply-To: <320fb6e00810060339t427d4c5dme2690fbc36b30c81@mail.gmail.com>
Message-ID: <381879.37032.qm@web62403.mail.re1.yahoo.com>

> Bruce wrote:
> > (I guess I am volunteering myself to provide some
> material on
> > machine learning with BioPython. So this is a start.)
> 
> Michiel has suggested adding a whole chapter to the
> tutorial about
> supervised learning, presumably incorporating his logistic
> regression
> example as part of this.  Have a look at thread
> "Bio.MarkovModel;
> Bio.Popgen, Bio.PDB documentation" on the dev mailing
> list.  I'm sure
> you can contribute (even if just by proof reading).

Some more documentation on machine learning would definitely be useful. Recently I started a chapter on supervised learning methods in the tutorial. Right now it only covers logistic regression, but it should also include Bio.MarkovModel, Bio.MaxEntropy, Bio.NaiveBayes, and Bio.kNN. If you are planning to write some documentation on any of these, please let us know so we can avoid duplicated efforts. The new tutorial is in CVS; I put a copy of the HTML output of the latest version at
http://biopython.org/DIST/docs/tutorial/Tutorial.new.html.

Thanks!

--Michiel


From bsouthey at gmail.com  Wed Oct  8 01:35:51 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Tue, 7 Oct 2008 20:35:51 -0500
Subject: [BioPython] Bio.kNN documentation
In-Reply-To: <381879.37032.qm@web62403.mail.re1.yahoo.com>
References: <320fb6e00810060339t427d4c5dme2690fbc36b30c81@mail.gmail.com>
	<381879.37032.qm@web62403.mail.re1.yahoo.com>
Message-ID: <bbcd77d00810071835r44a1cf92pba05eb436c7e1734@mail.gmail.com>

On Tue, Oct 7, 2008 at 6:10 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>> Bruce wrote:
>> > (I guess I am volunteering myself to provide some
>> material on
>> > machine learning with BioPython. So this is a start.)
>>
>> Michiel has suggested adding a whole chapter to the
>> tutorial about
>> supervised learning, presumably incorporating his logistic
>> regression
>> example as part of this.  Have a look at thread
>> "Bio.MarkovModel;
>> Bio.Popgen, Bio.PDB documentation" on the dev mailing
>> list.  I'm sure
>> you can contribute (even if just by proof reading).
>
> Some more documentation on machine learning would definitely be useful. Recently I started a chapter on supervised learning methods in the tutorial. Right now it only covers logistic regression, but it should also include Bio.MarkovModel, Bio.MaxEntropy, Bio.NaiveBayes, and Bio.kNN. If you are planning to write some documentation on any of these, please let us know so we can avoid duplicated efforts. The new tutorial is in CVS; I put a copy of the HTML output of the latest version at
> http://biopython.org/DIST/docs/tutorial/Tutorial.new.html.
>
> Thanks!
>
> --Michiel
>

Hi,
I have not given it too much thought at present but this reflects some
of the work I have been doing or involved with. I do not know enough
about Bio.MarkovModel, Bio.MaxEntropy and Bio.NaiveBayes to really
help. But I did think to start with trying to extend the supervised
learning material to be more general.  One aspect is to get provide
working code using different methodologies for different examples.

Regards
Bruce


From stephan80 at mac.com  Wed Oct  8 11:33:51 2008
From: stephan80 at mac.com (Stephan)
Date: Wed, 08 Oct 2008 13:33:51 +0200
Subject: [BioPython] Entrez.efetch
Message-ID: <75573950382669954948356356615157751492-Webmail2@me.com>

Hi,

I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now.
Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this.

Anyway, here is a weird little problem with the Bio.Entrez.efetch tool:
(I use python 2.5 and the latest Biopython 1.48)
I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file:

---------------------------CODE------------------------------------
from Bio import Entrez, SeqIO

print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"]
handle = Entrez.efetch(db="genome", id="56", rettype="genbank")
print "downloading to SeqRecord..."
record = SeqIO.read(handle, "genbank")
print "...done"

handle = Entrez.efetch(db="genome", id="56", rettype="genbank")
filehandle = open("NCBI_DroMel", "w")
print "downloading to file..."
filehandle.write(handle.read())
print "...done"

handle = open("NCBI_DroMel")
print "reading from file..."
record = SeqIO.read(handle, "genbank")
---------------------------END-CODE------------------------------------

In the last line we have a crash, see the output of the code:

---------------------------OUTPUT------------------------------------
Drosophila melanogaster chromosome 4, complete sequence
downloading to SeqRecord...
...done
downloading to file...
...done
reading chr2L from file...
Traceback (most recent call last):
  File "efetch-test.py", line 17, in <module>
    record = SeqIO.read(handle, "genbank")
  File "HOME/lib/python/Bio/SeqIO/__init__.py", line 366, in read
    first = iterator.next()
  File "HOME/lib/python/Bio/GenBank/Scanner.py", line 410, in parse_records
    record = self.parse(handle)
  File "HOME/lib/python/Bio/GenBank/Scanner.py", line 393, in parse
    if self.feed(handle, consumer) :
  File "HOME/lib/python/Bio/GenBank/Scanner.py", line 370, in feed
    misc_lines, sequence_string = self.parse_footer()
  File "HOME/lib/python/Bio/GenBank/Scanner.py", line 723, in parse_footer
    raise ValueError("Premature end of file in sequence data")
ValueError: Premature end of file in sequence data
---------------------------END-OUTPUT------------------------------------

It seems that downloading the file to disk will corrupt the genbank file, while downloading directly into biopythons SeqIO.read() function works properly. I dont get it!
When I download this chromosome manually from the NCBI-website, I indeed find a difference in one line, namely in line 3 of the genbank file. In the manually downloaded file line 3 reads: "ACCESSION   NC_004353 REGION: 1..1351857", while in the file produced from my code I have only: "ACCESSION   NC_004353". So without that region-information, the biopython parser of course runs to a premature end.

I rather use the cPickle-module now to save the whole SeqRecord-instance. Thats works fine, so I dont need an immediate solution for the above posted problem, but I thought it might be interesting maybe...

Any hints?

Regards, Stephan


From chapmanb at 50mail.com  Wed Oct  8 12:35:33 2008
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 8 Oct 2008 08:35:33 -0400
Subject: [BioPython] Entrez.efetch
In-Reply-To: <75573950382669954948356356615157751492-Webmail2@me.com>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
Message-ID: <20081008123533.GE57379@sobchak.mgh.harvard.edu>

Hi Stephan;

> It seems that downloading the file to disk will corrupt the genbank
> file, while downloading directly into biopythons SeqIO.read() function
> works properly. I dont get it! 
>
> When I download this chromosome manually from the NCBI-website,
> I indeed find a difference in one line, namely in line 3 of the
> genbank file. In the manually downloaded file line 3 reads:
> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced
> from my code I have only: "ACCESSION NC_004353". So without that
> region-information, the biopython parser of course runs to a premature
> end.

This is a tricky problem that I ran into as well and is fixed in the
latest CVS version. The issue is that the Biopython reader is using an
UndoHandle instead of a standard python handle. By default some of these
operations appear to be assuming an iterator, but UndoHandle did not
provide this.

As a result, you can lose the first couple of lines which are
previously examined to determine the filetype. The fix is to make
this a proper iterator. You can either check out current CVS, or
make the addition manually to Bio/File.py in your current version:

http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython

Hope this helps,
Brad


From biopython at maubp.freeserve.co.uk  Wed Oct  8 13:37:24 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 8 Oct 2008 14:37:24 +0100
Subject: [BioPython] Entrez.efetch
In-Reply-To: <75573950382669954948356356615157751492-Webmail2@me.com>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
Message-ID: <320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com>

On Wed, Oct 8, 2008 at 12:33 PM, Stephan <stephan80 at mac.com> wrote:
> Hi,
>
> I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now.
> Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this.
>
> Anyway, here is a weird little problem with the Bio.Entrez.efetch tool:
> (I use python 2.5 and the latest Biopython 1.48)
> I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file:
>
> ---------------------------CODE------------------------------------
> from Bio import Entrez, SeqIO
>
> print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"]
> handle = Entrez.efetch(db="genome", id="56", rettype="genbank")
> print "downloading to SeqRecord..."
> record = SeqIO.read(handle, "genbank")
> print "...done"

I assume this is just test code - as it would be silly to download the
GenBank file twice in a real script.

> handle = Entrez.efetch(db="genome", id="56", rettype="genbank")
> filehandle = open("NCBI_DroMel", "w")
> print "downloading to file..."
> filehandle.write(handle.read())

You should now close the file, which should ensure it is fully written to disk:
filehandle.close()

> print "...done"
>
> handle = open("NCBI_DroMel")
> print "reading from file..."
> record = SeqIO.read(handle, "genbank")
> ---------------------------END-CODE------------------------------------
>
> In the last line we have a crash,
>  ...
> ValueError: Premature end of file in sequence data

This is because you started reading in the file without finishing
writing to it - the parser could only read in part of the data, and is
complaining about it ending prematurely.

Peter


From p.j.a.cock at googlemail.com  Wed Oct  8 13:46:25 2008
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 8 Oct 2008 14:46:25 +0100
Subject: [BioPython] Entrez.efetch
In-Reply-To: <20081008123533.GE57379@sobchak.mgh.harvard.edu>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
	<20081008123533.GE57379@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com>

Stephan wrote:
>> When I download this chromosome manually from the NCBI-website,
>> I indeed find a difference in one line, namely in line 3 of the
>> genbank file. In the manually downloaded file line 3 reads:
>> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced
>> from my code I have only: "ACCESSION NC_004353". So without that
>> region-information, the biopython parser of course runs to a premature
>> end.

Stephan - when you say manually, do you mean via a web browser?  If so
it is likely to be using a subtly different URL, which might explain
the NCBI generating slightly different data on the fly.  Either way,
this ACCESSION line difference shouldn't trigger the "Premature end of
file in sequence data" error in the GenBank parser.

On Wed, Oct 8, 2008 at 1:35 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> This is a tricky problem that I ran into as well and is fixed in the
> latest CVS version. The issue is that the Biopython reader is using an
> UndoHandle instead of a standard python handle. By default some of these
> operations appear to be assuming an iterator, but UndoHandle did not
> provide this.

Brad, I'm pretty sure the GenBank parser is NOT using the UndoHandle.
Just adding the close made Stephan's example work for me.  What
exactly was the problem you ran into (one of the other parsers
perhaps?).

> As a result, you can lose the first couple of lines which are
> previously examined to determine the filetype. The fix is to make
> this a proper iterator. You can either check out current CVS, or
> make the addition manually to Bio/File.py in your current version:
>
> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython

Adding this to the UndoHandle seems a sensible improvement - but I
don't see how it can affect Stephan's script.

Peter


From p.j.a.cock at googlemail.com  Wed Oct  8 13:46:25 2008
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 8 Oct 2008 14:46:25 +0100
Subject: [BioPython] Entrez.efetch
In-Reply-To: <20081008123533.GE57379@sobchak.mgh.harvard.edu>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
	<20081008123533.GE57379@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com>

Stephan wrote:
>> When I download this chromosome manually from the NCBI-website,
>> I indeed find a difference in one line, namely in line 3 of the
>> genbank file. In the manually downloaded file line 3 reads:
>> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced
>> from my code I have only: "ACCESSION NC_004353". So without that
>> region-information, the biopython parser of course runs to a premature
>> end.

Stephan - when you say manually, do you mean via a web browser?  If so
it is likely to be using a subtly different URL, which might explain
the NCBI generating slightly different data on the fly.  Either way,
this ACCESSION line difference shouldn't trigger the "Premature end of
file in sequence data" error in the GenBank parser.

On Wed, Oct 8, 2008 at 1:35 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> This is a tricky problem that I ran into as well and is fixed in the
> latest CVS version. The issue is that the Biopython reader is using an
> UndoHandle instead of a standard python handle. By default some of these
> operations appear to be assuming an iterator, but UndoHandle did not
> provide this.

Brad, I'm pretty sure the GenBank parser is NOT using the UndoHandle.
Just adding the close made Stephan's example work for me.  What
exactly was the problem you ran into (one of the other parsers
perhaps?).

> As a result, you can lose the first couple of lines which are
> previously examined to determine the filetype. The fix is to make
> this a proper iterator. You can either check out current CVS, or
> make the addition manually to Bio/File.py in your current version:
>
> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython

Adding this to the UndoHandle seems a sensible improvement - but I
don't see how it can affect Stephan's script.

Peter


From stephan80 at mac.com  Wed Oct  8 13:48:25 2008
From: stephan80 at mac.com (Stephan)
Date: Wed, 08 Oct 2008 15:48:25 +0200
Subject: [BioPython] Entrez.efetch
In-Reply-To: <320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
	<320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com>
Message-ID: <128043477953580677661042463273686413408-Webmail2@me.com>

Hi guys,

OK, there is two different problems here that Brad and Peter independently pointed out to me. Peter, you are right that not closing the file actually caused the error. Your hint fixes that, thanks.
But that doesnt fix that there is a part of line 3 missing over the download, and although I actually updated to the newest cvs-version of biopython as Brad suggested (sorry for accidently putting my answer not on the mailing-list) that does not fix that line...

Best,
Stephan

 
Am Mittwoch 08 Oktober 2008 um 03:37PM schrieb "Peter" <biopython at maubp.freeserve.co.uk>:
>On Wed, Oct 8, 2008 at 12:33 PM, Stephan <stephan80 at mac.com> wrote:
>> Hi,
>>
>> I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now.
>> Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this.
>>
>> Anyway, here is a weird little problem with the Bio.Entrez.efetch tool:
>> (I use python 2.5 and the latest Biopython 1.48)
>> I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file:
>>
>> ---------------------------CODE------------------------------------
>> from Bio import Entrez, SeqIO
>>
>> print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"]
>> handle = Entrez.efetch(db="genome", id="56", rettype="genbank")
>> print "downloading to SeqRecord..."
>> record = SeqIO.read(handle, "genbank")
>> print "...done"
>
>I assume this is just test code - as it would be silly to download the
>GenBank file twice in a real script.
>
>> handle = Entrez.efetch(db="genome", id="56", rettype="genbank")
>> filehandle = open("NCBI_DroMel", "w")
>> print "downloading to file..."
>> filehandle.write(handle.read())
>
>You should now close the file, which should ensure it is fully written to disk:
>filehandle.close()
>
>> print "...done"
>>
>> handle = open("NCBI_DroMel")
>> print "reading from file..."
>> record = SeqIO.read(handle, "genbank")
>> ---------------------------END-CODE------------------------------------
>>
>> In the last line we have a crash,
>>  ...
>> ValueError: Premature end of file in sequence data
>
>This is because you started reading in the file without finishing
>writing to it - the parser could only read in part of the data, and is
>complaining about it ending prematurely.
>
>Peter
>
>


From stephan80 at mac.com  Wed Oct  8 14:00:31 2008
From: stephan80 at mac.com (Stephan)
Date: Wed, 08 Oct 2008 16:00:31 +0200
Subject: [BioPython] Entrez.efetch
In-Reply-To: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
	<20081008123533.GE57379@sobchak.mgh.harvard.edu>
	<320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com>
Message-ID: <72537648433629820630731006204512761040-Webmail2@me.com>

>Stephan - when you say manually, do you mean via a web browser?  If so 
>it is likely to be using a subtly different URL, which might explain 
>the NCBI generating slightly different data on the fly.  Either way, 
>this ACCESSION line difference shouldn't trigger the "Premature end of 
>file in sequence data" error in the GenBank parser. 

Thanks, that must be it! 
Now I guess everything is solved, closing the handle makes my code run properly and the download-from-NCBI-webpage-issue explains the difference in line 3. 

>Adding this to the UndoHandle seems a sensible improvement - but I 
>don't see how it can affect Stephan's script. 

There I agree, thanks anyway, Brad. 

Regards, 
Stephan


From biopython at maubp.freeserve.co.uk  Wed Oct  8 14:02:54 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 8 Oct 2008 15:02:54 +0100
Subject: [BioPython] Entrez.efetch
In-Reply-To: <128043477953580677661042463273686413408-Webmail2@me.com>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
	<320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com>
	<128043477953580677661042463273686413408-Webmail2@me.com>
Message-ID: <320fb6e00810080702q6774f58ap52a02073d62cb75a@mail.gmail.com>

On Wed, Oct 8, 2008 at 2:48 PM, Stephan <stephan80 at mac.com> wrote:
>
> Hi guys,
>
> OK, there is two different problems here that Brad and Peter independently
> pointed out to me. Peter, you are right that not closing the file actually
> caused the error. Your hint fixes that, thanks.

Great.

> But that doesnt fix that there is a part of line 3 missing over the download,
> and although I actually updated to the newest cvs-version of biopython as
> Brad suggested (sorry for accidently putting my answer not on the mailing-list)
> that does not fix that line...

This is the issue where you get different GenBank files using
Bio.Entrez.efetch and a "manual download"?  First of all what did you
mean by "manual download" - for example FTP (what URL), or from a
browser?  Secondly, does this difference to the ACCESSION line (line
3) actually have any ill effects?

To be clear using Bio.Entrez.efetch as in your script, I get this:

LOCUS       NC_004353            1351857 bp    DNA     linear   INV 14-MAY-2008
DEFINITION  Drosophila melanogaster chromosome 4, complete sequence.
ACCESSION   NC_004353
VERSION     NC_004353.3  GI:116010290
PROJECT     GenomeProject:164
KEYWORDS    .
SOURCE      Drosophila melanogaster (fruit fly)
  ORGANISM  Drosophila melanogaster
...

Using FTP from ftp://ftp.ncbi.nih.gov/genomes/Drosophila_melanogaster/CHR_4/NC_004353.gbk
I get something similar but different:

LOCUS       NC_004353            1351857 bp    DNA     linear   INV 14-MAY-2008
DEFINITION  Drosophila melanogaster chromosome 4, complete sequence.
ACCESSION   NC_004353
VERSION     NC_004353.3  GI:116010290
KEYWORDS    .
SOURCE      Drosophila melanogaster (fruit fly)
  ORGANISM  Drosophila melanogaster
...

Notice the FTP file lacks the PROJECT line, and also differs slightly
in its feature table.

Using the NCBI website I suspect you can get other slight variations
(like the different ACCESSION line you reported).

Peter


From stephan80 at mac.com  Wed Oct  8 13:52:07 2008
From: stephan80 at mac.com (Stephan)
Date: Wed, 08 Oct 2008 15:52:07 +0200
Subject: [BioPython] Entrez.efetch
In-Reply-To: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
	<20081008123533.GE57379@sobchak.mgh.harvard.edu>
	<320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com>
Message-ID: <56009583349175862359179071289436480391-Webmail2@me.com>


>Stephan - when you say manually, do you mean via a web browser?  If so
>it is likely to be using a subtly different URL, which might explain
>the NCBI generating slightly different data on the fly.  Either way,
>this ACCESSION line difference shouldn't trigger the "Premature end of
>file in sequence data" error in the GenBank parser.

Thanks, that must be it!
Now I guess everything is solved, closing the handle makes my code run properly and the download-from-NCBI-webpage-issue explains the difference in line 3.

>Adding this to the UndoHandle seems a sensible improvement - but I
>don't see how it can affect Stephan's script.

There I agree, thanks anyway, Brad.

Regards,
Stephan


From biopythonlist at gmail.com  Wed Oct  8 16:23:32 2008
From: biopythonlist at gmail.com (dr goettel)
Date: Wed, 8 Oct 2008 18:23:32 +0200
Subject: [BioPython] taxonomic tree
Message-ID: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com>

Hello, I'm new in this list and in BioPython.
I would like to create a NCBI-like taxonomic tree and then fill it with the
organisms that I have in a file. Is there an easy way to do this? I started
using biopython's function at 7.11.4 (finding the lineage of an organism) in
the tutorial, but I need to do this tens of thousands times so it spends too
much time querying NCBI database. Therefore I built a taxonomic database
locally and implemented something similar to 7.11.4 tutorial's function so I
get, for every sequence, the lineage in the same way:

'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina;
 Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta;
 Liliopsida; Asparagales; Orchidaceae'


Now I need to create a tree, or fill an already created one. And then search
it by some criteria.

Please could anybody help me with this? Any idea?
Thankyou very much


From biopython at maubp.freeserve.co.uk  Wed Oct  8 16:38:31 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 8 Oct 2008 17:38:31 +0100
Subject: [BioPython] taxonomic tree
In-Reply-To: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com>
References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com>
Message-ID: <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com>

On Wed, Oct 8, 2008 at 5:23 PM, dr goettel <biopythonlist at gmail.com> wrote:
> Hello, I'm new in this list and in BioPython.

Hello :)

> I would like to create a NCBI-like taxonomic tree and then fill it with the
> organisms that I have in a file. Is there an easy way to do this? I started
> using biopython's function at 7.11.4 (finding the lineage of an organism) in
> the tutorial, ...

For anyone reading this later on, note that the tutorial section
numbers tend to change with each release of Biopython.  This section
just uses Bio.Entrez to fetch taxonomy information for a particular
NCBI taxon id.

> but I need to do this tens of thousands times so it spends too
> much time querying NCBI database.

Also calling Bio.Entrez 10000 times might annoy the NCBI ;)

> Therefore I built a taxonomic database
> locally and implemented something similar to 7.11.4 tutorial's function so I
> get, for every sequence, the lineage in the same way:
>
> 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina;
>  Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta;
>  Liliopsida; Asparagales; Orchidaceae'

I assume you used the NCBI provided taxdump files to populate the
database?  See ftp://ftp.ncbi.nih.gov/pub/taxonomy/

Personally rather than designing my own database just for this (and
writing a parser for the taxonomy files), I would have suggested
installing BioSQL, and using the BioSQL script load_ncbi_taxonomy.pl
to download and import the data for you.  This is a simple perl script
- you don't need BioPerl.  See http://www.biopython.org/wiki/BioSQL
for details.

> Now I need to create a tree, or fill an already created one. And then search
> it by some criteria.

What kind of tree do you mean?  Are you talking about creating a
Newick tree, or an in memory structure?  Perhaps the Bio.Nexus
module's tree functionality would help.

If you are interested, the BioSQL tables record the taxonomy tree
using two methods, each node has a parent node allowing you to walk up
the lineage.  There are also left/right values allowing selection of
all child nodes efficiently via an SQL select statement.

Peter


From biopython at maubp.freeserve.co.uk  Wed Oct  8 16:57:37 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 8 Oct 2008 17:57:37 +0100
Subject: [BioPython] Current tutorial in CVS
Message-ID: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com>

Michiel wrote:
> ... The new tutorial is in CVS; I put a copy of the HTML output
> of the latest version at
> http://biopython.org/DIST/docs/tutorial/Tutorial.new.html.

This also gives people a chance to look at the three plotting examples
I added to the "Cookbook" section a couple of weeks back,

http://www.biopython.org/DIST/docs/tutorial/Tutorial.new.html#chapter:cookbook

Suggestions for any additional biologically motivated simple plots
would be nice - especially for different plot types.  A scatter plot
could be added, are there any suggestions for this other than melting
temperature versus length or GC%?  See also this thread on the
dev-mailing list:
http://www.biopython.org/pipermail/biopython-dev/2008-September/004277.html

Note that the file at this URL is only temporary, and will probably be
removed before the next release.  The current tutorial is at:

http://www.biopython.org/DIST/docs/tutorial/Tutorial.html
http://www.biopython.org/DIST/docs/tutorial/Tutorial.pdf

Peter


From stephan80 at mac.com  Wed Oct  8 17:11:25 2008
From: stephan80 at mac.com (Stephan)
Date: Wed, 08 Oct 2008 19:11:25 +0200
Subject: [BioPython] Entrez.efetch large files
Message-ID: <133483072970409871957631124263040035200-Webmail2@me.com>

Sorry to have an Entrez.efetch-issue again, but somehow there seems to be a problem with very large files.

So when I run the following code using the newest cvs-version of biopython:

------------------------------------CODE-----------------------------------
from Bio import Entrez, SeqIO

id = "57"
print Entrez.read(Entrez.esummary(db="genome", id=id))[0]["Title"]
handle = Entrez.efetch(db="genome", id=id, rettype="genbank")
print "downloading to SeqRecord..."
record = SeqIO.read(handle, "genbank")
print "...done"
------------------------------------END-CODE-----------------------------

it fails with the output:

------------------------------------OUTPUT-----------------------------
Drosophila melanogaster chromosome X, complete sequence
downloading to SeqRecord...
Traceback (most recent call last):
  File "efetch-test.py", line 7, in <module>
    record = SeqIO.read(handle, "genbank")
  File "/NetUsers/stschiff/lib/python/Bio/SeqIO/__init__.py", line 366, in read
    first = iterator.next()
  File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 410, in parse_records
    record = self.parse(handle)
  File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 393, in parse
    if self.feed(handle, consumer) :
  File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 370, in feed
    misc_lines, sequence_string = self.parse_footer()
  File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 723, in parse_footer
    raise ValueError("Premature end of file in sequence data")
ValueError: Premature end of file in sequence data
------------------------------------END-OUTPUT-----------------------------


If I change the id to "56" (chromosome 4, which is shorter) it works. But for all the other chromosomes (ids: 57 - 61) it fails.
If I download the genbank files manually from the ftp-server and then use SeqIO.read() it works, so the download-process corrupts the genbank files if they are very large (about 35 MB) I guess...

Any hints?

Best,
Stephan


From biopython at maubp.freeserve.co.uk  Wed Oct  8 18:57:08 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 8 Oct 2008 19:57:08 +0100
Subject: [BioPython] Entrez.efetch large files
In-Reply-To: <133483072970409871957631124263040035200-Webmail2@me.com>
References: <133483072970409871957631124263040035200-Webmail2@me.com>
Message-ID: <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com>

On Wed, Oct 8, 2008 at 6:11 PM, Stephan <stephan80 at mac.com> wrote:
> Sorry to have an Entrez.efetch-issue again, but somehow there
> seems to be a problem with very large files.
> ...
> If I change the id to "56" (chromosome 4, which is shorter) it works.
> But for all the other chromosomes (ids: 57 - 61) it fails.
> If I download the genbank files manually from the ftp-server and
> then use SeqIO.read() it works, so the download-process corrupts
> the genbank files if they are very large (about 35 MB) I guess...
>
> Any hints?

Yes - one big hint: DON'T try and parse these large files directly
from the internet.  Use efetch to download the file and save it to
disk.  Then open this local file for parsing.

There are several good reasons for this:

(1) Rerunning the script (e.g. during development) needn't re-download
the file, which wastes time and money (yours and more importantly the
NCBI's).  You may be fine, but the NCBI can and do ban people's IP
addresses if they breach the guidelines.

(2) If the parsing fails, there is something to debug easily (the
local file).  You can open the file in a text editor to check it etc.

That being said, downloading and parsing in one go should work - I
would expect an IO error if the network timed out, rather than what
appears to be the data ending prematurely.  However, I don't expect
this to be easy to resolve - quite possibly this is a network time out
somewhere, maybe at your end, maybe on one of the ISP connections in
between.

On the bright side, at least the parser isn't silently ignoring the
end of the file, which would leave you with a truncated sequence
without any warnings :)

Do you think the Biopython tutorial should be more explicit about this
topic?  e.g. In chapter 4 (on Bio.SeqIO) I wrote:

>> Note that just because you can download sequence data and
>> parse it into a SeqRecord object in one go doesn't mean this
>> is always a good idea. In general, you should probably download
>> sequences once and save them to a file for reuse.

Maybe I should have said "... doesn't mean this is a good idea..." instead?

Peter


From biopython at maubp.freeserve.co.uk  Wed Oct  8 19:32:59 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 8 Oct 2008 20:32:59 +0100
Subject: [BioPython] Entrez.efetch large files
In-Reply-To: <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com>
References: <133483072970409871957631124263040035200-Webmail2@me.com>
	<320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com>
Message-ID: <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com>

> Yes - one big hint: DON'T try and parse these large files directly
> from the internet.  Use efetch to download the file and save it to
> disk.  Then open this local file for parsing.
> ...
> Do you think the Biopython tutorial should be more explicit about this
> topic?

I've changed the tutorial (the SeqIO and Entrez chapters) in CVS to
make this advice more explicit, and included an example of doing this
too.

import os
from Bio import SeqIO
from Bio import Entrez
Entrez.email = "A.N.Other at example.com"     # Always tell NCBI who you are
filename = "gi_186972394.gbk"
if not os.path.isfile(filename) :
    print "Downloading..."
    net_handle = Entrez.efetch(db="nucleotide",id="186972394",rettype="genbank")
    out_handle = open(filename, "w")
    out_handle.write(net_handle.read())
    out_handle.close()
    net_handle.close()
    print "Saved"

print "Parsing..."
record = SeqIO.read(open(filename), "genbank")
print record


Peter


From biopython at maubp.freeserve.co.uk  Wed Oct  8 20:57:03 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 8 Oct 2008 21:57:03 +0100
Subject: [BioPython] Entrez.efetch large files
In-Reply-To: <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com>
References: <133483072970409871957631124263040035200-Webmail2@me.com>
	<320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com>
	<320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com>
	<2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com>
Message-ID: <320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com>

On Wed, Oct 8, 2008 at 9:37 PM, Stephan Schiffels <stephan80 at mac.com> wrote:
>
>  Hi Peter,
>
> OK, first of all... you were right of course, with
> out_handle.write(net_handle.read()) the download works properly and reading
> the file from disk also works.The tutorial is very clear on that point, I
> agree.

OK - hopefully I've just made it clearer still ;)

> To illustrate why I made the mistake even though I read the tutorial:
> I made some code like:
>
> try:
>        unpickling a file as SeqRecord...
> except IOError:
>        download file into SeqRecord AND pickle afterwards to disk
>
> So, as you can see, I already tried to make the download only once!

I see - interesting.

> The disk-saving step, I realized, was smarter to do via cPickle since then
> reading from it also goes faster than parsing the genbank file each time. So
> my goal was to either load a pickled SeqRecord, or download into SeqRecord
> and then pickle to disk. I hope you agree that concerning resources from
> NCBI this way is (at least in principle) already quite optimal.

You approach is clever, and I agree, it shouldn't make any difference
to the number of downloads from the NCBI (once you have the script
debugged and working).

I'm curious - do you have any numbers for the relative times to load a
SeqRecord from a pickle, or re-parse it from the GenBank file?  I'm
aware of some "hot spots" in the GenBank parser which take more time
than they really need to (feature location parsing in particular).

However, even if using pickles is much faster, I would personally
still rather use this approach:

if file not present:
   download from NCBI and save it
parse file

I think it is safer to keep the original data in the NCBI provided
format, rather than as a python pickle.  Some of my reasons include:

* you might want to parse the files with a different tool one day
(e.g. grep, or maybe BioPerl, or EMBOSS)
* different versions of Biopython will parse the file slightly
differently (e.g. once Bugs 2225 and 2578 are fixed the SeqRecord
should include slightly more information from a GenBank file) while
your pickle will be static
* if the SeqRecord or Seq objects themselves change slightly between
versions of Biopython, the pickle may not work
* more generally, is it safe to transfer the pickly files between
different computers (e.g. different versions of python or Biopython,
different OS, different line endings)?

These issues may not be a problem in your setting.

More generally, you could consider using BioSQL, but this may be
overkill for your needs.

> However, as you pointed out, parsing from the internet makes problems.

If you do work out exactly what is going wrong, I would be interested
to hear about it.

> I think the advantages of not having to download each time were clear to me
> from the tutorial. Just that downloading AND parsing at the same time makes
> problems didnt appear to me. The addings to the tutorial seem to give some
> idea.

Your approach all makes sense. Thanks for explaining your thoughts.  I
don't think I'd ever tried efetch on such a large GenBank file in the
first place - for genomes I have usually used FTP instead.

Peter


From chapmanb at 50mail.com  Wed Oct  8 21:11:25 2008
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 8 Oct 2008 17:11:25 -0400
Subject: [BioPython] Entrez.efetch
In-Reply-To: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com>
References: <75573950382669954948356356615157751492-Webmail2@me.com>
	<20081008123533.GE57379@sobchak.mgh.harvard.edu>
	<320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com>
Message-ID: <20081008211125.GB17555@sobchak.mgh.harvard.edu>

Peter and Stephan;
My fault -- sorry about the red herring on this one. I shouldn't
have tried to answer this e-mail in 5 minutes before work this
morning. Sounds like y'all have it resolved with the missing close
so I will keep my mouth shut.

Peter, I don't remember my exact problem as it was in some
throw-away script and the fix seemed non-problematic. I was thrown
off by the "line 3" information Stephan mentioned because my issue
was with the first couple of lines missing when iterating with an
UndoHandle. No matter.

Thanks for coming up with the right fix!
Brad

> Stephan wrote:
> >> When I download this chromosome manually from the NCBI-website,
> >> I indeed find a difference in one line, namely in line 3 of the
> >> genbank file. In the manually downloaded file line 3 reads:
> >> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced
> >> from my code I have only: "ACCESSION NC_004353". So without that
> >> region-information, the biopython parser of course runs to a premature
> >> end.
> 
> Stephan - when you say manually, do you mean via a web browser?  If so
> it is likely to be using a subtly different URL, which might explain
> the NCBI generating slightly different data on the fly.  Either way,
> this ACCESSION line difference shouldn't trigger the "Premature end of
> file in sequence data" error in the GenBank parser.
> 
> On Wed, Oct 8, 2008 at 1:35 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> > This is a tricky problem that I ran into as well and is fixed in the
> > latest CVS version. The issue is that the Biopython reader is using an
> > UndoHandle instead of a standard python handle. By default some of these
> > operations appear to be assuming an iterator, but UndoHandle did not
> > provide this.
> 
> Brad, I'm pretty sure the GenBank parser is NOT using the UndoHandle.
> Just adding the close made Stephan's example work for me.  What
> exactly was the problem you ran into (one of the other parsers
> perhaps?).
> 
> > As a result, you can lose the first couple of lines which are
> > previously examined to determine the filetype. The fix is to make
> > this a proper iterator. You can either check out current CVS, or
> > make the addition manually to Bio/File.py in your current version:
> >
> > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython
> 
> Adding this to the UndoHandle seems a sensible improvement - but I
> don't see how it can affect Stephan's script.
> 
> Peter


From stephan80 at mac.com  Wed Oct  8 20:37:17 2008
From: stephan80 at mac.com (Stephan Schiffels)
Date: Wed, 08 Oct 2008 22:37:17 +0200
Subject: [BioPython] Entrez.efetch large files
In-Reply-To: <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com>
References: <133483072970409871957631124263040035200-Webmail2@me.com>
	<320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com>
	<320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com>
Message-ID: <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com>

  Hi Peter,

OK, first of all... you were right of course, with out_handle.write 
(net_handle.read()) the download works properly and reading the file  
from disk also works.The tutorial is very clear on that point, I agree.

To illustrate why I made the mistake even though I read the tutorial:
I made some code like:

try:
	unpickling a file as SeqRecord...
except IOError:
	download file into SeqRecord AND pickle afterwards to disk

So, as you can see, I already tried to make the download only once!  
The disk-saving step, I realized, was smarter to do via cPickle since  
then reading from it also goes faster than parsing the genbank file  
each time. So my goal was to either load a pickled SeqRecord, or  
download into SeqRecord and then pickle to disk. I hope you agree  
that concerning resources from NCBI this way is (at least in  
principle) already quite optimal. However, as you pointed out,  
parsing from the internet makes problems.

I think the advantages of not having to download each time were clear  
to me from the tutorial. Just that downloading AND parsing at the  
same time makes problems didnt appear to me. The addings to the  
tutorial seem to give some idea.

Thanks and Regards,
Stephan

Am 08.10.2008 um 21:32 schrieb Peter:

>> Yes - one big hint: DON'T try and parse these large files directly
>> from the internet.  Use efetch to download the file and save it to
>> disk.  Then open this local file for parsing.
>> ...
>> Do you think the Biopython tutorial should be more explicit about  
>> this
>> topic?
>
> I've changed the tutorial (the SeqIO and Entrez chapters) in CVS to
> make this advice more explicit, and included an example of doing this
> too.
>
> import os
> from Bio import SeqIO
> from Bio import Entrez
> Entrez.email = "A.N.Other at example.com"     # Always tell NCBI who  
> you are
> filename = "gi_186972394.gbk"
> if not os.path.isfile(filename) :
>     print "Downloading..."
>     net_handle = Entrez.efetch 
> (db="nucleotide",id="186972394",rettype="genbank")
>     out_handle = open(filename, "w")
>     out_handle.write(net_handle.read())
>     out_handle.close()
>     net_handle.close()
>     print "Saved"
>
> print "Parsing..."
> record = SeqIO.read(open(filename), "genbank")
> print record
>
>
> Peter


From biopythonlist at gmail.com  Thu Oct  9 08:52:42 2008
From: biopythonlist at gmail.com (dr goettel)
Date: Thu, 9 Oct 2008 10:52:42 +0200
Subject: [BioPython] taxonomic tree
In-Reply-To: <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com>
References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com>
	<320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com>
Message-ID: <9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com>

On Wed, Oct 8, 2008 at 6:38 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Wed, Oct 8, 2008 at 5:23 PM, dr goettel <biopythonlist at gmail.com>
> wrote:
> > Hello, I'm new in this list and in BioPython.
>
> Hello :)
>
> > I would like to create a NCBI-like taxonomic tree and then fill it with
> the
> > organisms that I have in a file. Is there an easy way to do this? I
> started
> > using biopython's function at 7.11.4 (finding the lineage of an organism)
> in
> > the tutorial, ...
>
> For anyone reading this later on, note that the tutorial section
> numbers tend to change with each release of Biopython.  This section
> just uses Bio.Entrez to fetch taxonomy information for a particular
> NCBI taxon id.
>
> > but I need to do this tens of thousands times so it spends too
> > much time querying NCBI database.
>
> Also calling Bio.Entrez 10000 times might annoy the NCBI ;)
>
> > Therefore I built a taxonomic database
> > locally and implemented something similar to 7.11.4 tutorial's function
> so I
> > get, for every sequence, the lineage in the same way:
> >
> > 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta;
> Streptophytina;
> >  Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta;
> >  Liliopsida; Asparagales; Orchidaceae'
>
> I assume you used the NCBI provided taxdump files to populate the
> database?  See ftp://ftp.ncbi.nih.gov/pub/taxonomy/
>

Yes I did.


>
> Personally rather than designing my own database just for this (and
> writing a parser for the taxonomy files), I would have suggested
> installing BioSQL, and using the BioSQL script load_ncbi_taxonomy.pl
> to download and import the data for you.  This is a simple perl script
> - you don't need BioPerl.  See http://www.biopython.org/wiki/BioSQL
> for details.
>

I also used the load_ncbi_taxonomy.pl script. It worked great!


>
> > Now I need to create a tree, or fill an already created one. And then
> search
> > it by some criteria.
>
> What kind of tree do you mean?  Are you talking about creating a
> Newick tree, or an in memory structure?  Perhaps the Bio.Nexus
> module's tree functionality would help.
>

Thankyou very much. I still don't know if I want Newick tree or the other
one. I'll take a look on Bio.Nexus module


>
> If you are interested, the BioSQL tables record the taxonomy tree
> using two methods, each node has a parent node allowing you to walk up
> the lineage.  There are also left/right values allowing selection of
> all child nodes efficiently via an SQL select statement.
>
> Peter
>

This is what I was trying to do, from the name of the organism (the leaf of
the tree) and getting every node using the parent_node field of the taxon
table, until reaching the root node. Once I have all the steps to the root
node then I have to create/filling the tree with my data in order to
examinate the number of organisms integrating certain
class/order/family/genus... etc
Any ideas will be very apreciated.

Thankyou very much for your answer and I'll take a look on Bio.Nexus module.

drG


From biopython at maubp.freeserve.co.uk  Thu Oct  9 09:31:16 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 9 Oct 2008 10:31:16 +0100
Subject: [BioPython] taxonomic tree
In-Reply-To: <9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com>
References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com>
	<320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com>
	<9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com>
Message-ID: <320fb6e00810090231w723e3b29m5e070c55166d3bfc@mail.gmail.com>

>> Personally rather than designing my own database just for this (and
>> writing a parser for the taxonomy files), I would have suggested
>> installing BioSQL, and using the BioSQL script load_ncbi_taxonomy.pl
>> to download and import the data for you.  This is a simple perl script
>> - you don't need BioPerl.  See http://www.biopython.org/wiki/BioSQL
>> for details.
>
> I also used the load_ncbi_taxonomy.pl script. It worked great!

Good.  I would encourage you to use the version from BioSQL v1.0.1 if
you are not already, as the version with BioSQL v1.0.0 makes an
additional unnecessary assumption about the database keys matching the
NCBI taxon ID.

>> If you are interested, the BioSQL tables record the taxonomy tree
>> using two methods, each node has a parent node allowing you to walk up
>> the lineage.  There are also left/right values allowing selection of
>> all child nodes efficiently via an SQL select statement.
>
> This is what I was trying to do, from the name of the organism (the leaf of
> the tree) and getting every node using the parent_node field of the taxon
> table, until reaching the root node. Once I have all the steps to the root
> node then I have to create/filling the tree with my data in order to
> examinate the number of organisms integrating certain
> class/order/family/genus... etc
> Any ideas will be very apreciated.

To do this in Biopython you'll have to write some SQL commands - but
first you need to understand how the left/right values work if you
want to take advantage of them.  I refer you to this thread on the
BioSQL mailing list earlier in the year:
http://lists.open-bio.org/pipermail/biosql-l/2008-April/001234.html

In particular, Hilmar referred to Joe Celko's SQL for Smarties books,
and the introduction to this nested-set representation given here:
http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html

Alternatively, if you wanted to avoid the left/right values, you could
use recursion or loops on the parent ID links to build up the tree.
For a single lineage this is fine - but for a full try I would expect
the left/right values to be faster.

Note that Biopython (in CVS now) ignores the left/right values.  This
is for two reasons - for pulling out a single lineage, Eric found this
was faster.  Also, when adding new entries to the database
re-calculating the left/right values is too slow, so we leave them as
NULL (and let the user (re)run load_ncbi_taxonomy.pl later if they
care).  This means we don't want to depend on the left/right values
being present.

Peter


From stephan.schiffels at uni-koeln.de  Thu Oct  9 13:01:11 2008
From: stephan.schiffels at uni-koeln.de (Stephan Schiffels)
Date: Thu, 9 Oct 2008 15:01:11 +0200
Subject: [BioPython] Entrez.efetch large files
In-Reply-To: <320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com>
References: <133483072970409871957631124263040035200-Webmail2@me.com>
	<320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com>
	<320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com>
	<2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com>
	<320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com>
Message-ID: <171A75DA-EE34-44AB-8E16-DEC626F7164C@uni-koeln.de>

Hi Peter,

Am 08.10.2008 um 22:57 schrieb Peter:

> I'm curious - do you have any numbers for the relative times to load a
> SeqRecord from a pickle, or re-parse it from the GenBank file?  I'm
> aware of some "hot spots" in the GenBank parser which take more time
> than they really need to (feature location parsing in particular).

So, here is a little profiling of reading a large chromosome both as  
genbank and from a pickled SeqRecord (both from disk of course):
 >>> t = Timer("a = cPickle.load(open('DroMel_chr2L.pickle'))",  
"import cPickle")
 >>> t.timeit(number=1)
5.2086620330810547
 >>> t = Timer("a = SeqIO.read(open('DroMel_chr2L.gbk'), 'genbank')",  
"from Bio import SeqIO")
 >>> t.timeit(number=1)
53.902437925338745
 >>>

As you see there is an amazing 10fold speed-gain using cPickle in  
comparison to SeqIO.read() ... not bad! The pickled file is a bit  
larger than the genbank file, but not much.

> However, even if using pickles is much faster, I would personally
> still rather use this approach:
>
> if file not present:
>    download from NCBI and save it
> parse file
>
Thats precisely how I do it now. Works cool!


> I think it is safer to keep the original data in the NCBI provided
> format, rather than as a python pickle.  Some of my reasons include:
>
> * you might want to parse the files with a different tool one day
> (e.g. grep, or maybe BioPerl, or EMBOSS)
> * different versions of Biopython will parse the file slightly
> differently (e.g. once Bugs 2225 and 2578 are fixed the SeqRecord
> should include slightly more information from a GenBank file) while
> your pickle will be static
> * if the SeqRecord or Seq objects themselves change slightly between
> versions of Biopython, the pickle may not work
> * more generally, is it safe to transfer the pickly files between
> different computers (e.g. different versions of python or Biopython,
> different OS, different line endings)?
>
> These issues may not be a problem in your setting.

You are right and in fact I now safe both the genbank file and the  
pickled file to disk, so I have all the backup.

>
> More generally, you could consider using BioSQL, but this may be
> overkill for your needs.
>
BioSQL is something that I like a lot. I have not yet digged my way  
through it but hopefully there will be options for me from that side  
as well.

>> However, as you pointed out, parsing from the internet makes  
>> problems.
>
> If you do work out exactly what is going wrong, I would be interested
> to hear about it.
>
Hmm, probably I wont find it out. Parsing from the internet works for  
small files, it must be some network-issue, dont know. Since I am in  
the university-web I doubt that the error starts at my side, maybe  
NCBI clears the connection if the other side is too slow, which is  
the case for the parsing process... But I understand too little about  
networking.

>> I think the advantages of not having to download each time were  
>> clear to me
>> from the tutorial. Just that downloading AND parsing at the same  
>> time makes
>> problems didnt appear to me. The addings to the tutorial seem to  
>> give some
>> idea.
>
> Your approach all makes sense. Thanks for explaining your thoughts.  I
> don't think I'd ever tried efetch on such a large GenBank file in the
> first place - for genomes I have usually used FTP instead.
>
> Peter

Regards,
Stephan


From biopython at maubp.freeserve.co.uk  Thu Oct  9 14:18:52 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 9 Oct 2008 15:18:52 +0100
Subject: [BioPython] Entrez.efetch large files
In-Reply-To: <171A75DA-EE34-44AB-8E16-DEC626F7164C@uni-koeln.de>
References: <133483072970409871957631124263040035200-Webmail2@me.com>
	<320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com>
	<320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com>
	<2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com>
	<320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com>
	<171A75DA-EE34-44AB-8E16-DEC626F7164C@uni-koeln.de>
Message-ID: <320fb6e00810090718g3420729fh50520a4760c5d27@mail.gmail.com>

Peter wrote:
>> I'm curious - do you have any numbers for the relative times to load a
>> SeqRecord from a pickle, or re-parse it from the GenBank file?  I'm
>> aware of some "hot spots" in the GenBank parser which take more time
>> than they really need to (feature location parsing in particular).

Stephan wrote:
> So, here is a little profiling of reading a large chromosome both as genbank
> and from a pickled SeqRecord (both from disk of course):
>>>> t = Timer("a = cPickle.load(open('DroMel_chr2L.pickle'))", "import
>>>> cPickle")
>>>> t.timeit(number=1)
> 5.2086620330810547
>>>> t = Timer("a = SeqIO.read(open('DroMel_chr2L.gbk'), 'genbank')", "from
>>>> Bio import SeqIO")
>>>> t.timeit(number=1)
> 53.902437925338745
>>>>
>
> As you see there is an amazing 10fold speed-gain using cPickle in comparison
> to SeqIO.read() ... not bad! The pickled file is a bit larger than the
> genbank file, but not much.

I'm seeing more like a three fold speed-gain (using cPickle protocol
0, with Python 2.5.2 on a Mac), which is less impressive.  For a 10
fold speed up I can see why the complexity overhead of using pickle
could be worthwhile.

cPickle.load() took 8.5s
cPickle.load() took 10.0s
cPickle.load() took 9.9s
SeqIO.read() took 29.9s
SeqIO.read() took 29.8s
SeqIO.read() took 29.8s

(Script below)

I'm not very impressed with the 30 seconds needed to parse a 30MB
file.  There is certainly scope for speeding up the GenBank parsing
here.

Peter

---------------

My timing script:

import os
import cPickle
import time
from Bio import Entrez, SeqIO
#Entrez.email = "..."

id="57"
genbank_filename = "NC_004354.gbk"
pickle_filename = "NC_004354.pickle"

if not os.path.isfile(genbank_filename) :
    print "Downloading..."
    net_handle = Entrez.efetch(db="genome", id=id, rettype="genbank")
    out_handle = open(genbank_filename, "w")
    out_handle.write(net_handle.read())
    out_handle.close()
    print "Saved"

if not os.path.isfile(pickle_filename) :
    print "Parsing..."
    record = SeqIO.read(open(genbank_filename), 'genbank')
    print "Pickling..."
    out_handle = open(pickle_filename ,"w")
    cPickle.dump(record, out_handle)
    out_handle.close()
    print "Saved"

print "Profiling..."
for i in range(3) :
    start = time.time()
    record = cPickle.load(open(pickle_filename))
    print "cPickle.load() took %0.1fs" % (time.time() - start)
for i in range(3) :
    start = time.time()
    record = SeqIO.read(open(genbank_filename), 'genbank')
    print "SeqIO.read() took %0.1fs" % (time.time() - start)
print "Done"


From biopython at maubp.freeserve.co.uk  Thu Oct  9 15:48:26 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 9 Oct 2008 16:48:26 +0100
Subject: [BioPython] Deprecating Bio.PubMed and some of Bio.GenBank
Message-ID: <320fb6e00810090848g2a516877i5950f515e748b9d0@mail.gmail.com>

Dear Biopythoneers,

Those of you who looked at the release notes for Biopython 1.48 might
have read this bit:

>> Bio.PubMed and the online code in Bio.GenBank are now considered
>> obsolete, and we intend to deprecate them after the next release.
>> For accessing PubMed and GenBank, please use Bio.Entrez instead.

These bits of code are effectively simple wrappers for Bio.Entrez.
While they may be simple to use, they cannot take advantage of the
NCBI's Entrez utils history functionality.  This means they discourage
users from following the NCBI's preferred usage patterns.

We're already trying to encouraging the use of Bio.Entrez by
documenting it prominently in the tutorial (which seems to be working
given the recent questions on the mailing list), but for Biopython
1.49 I'm suggesting we go further and deprecate Bio.PubMed and the
online code in Bio.GenBank.  This would mean a warning message would
appear when this code is used, and (barring feedback) after a couple
of releases this code would be removed completely.

Any comments or objections?  In particular, is anyone using this
"obsolete" functionality now?

Peter


From biopythonlist at gmail.com  Thu Oct  9 16:32:11 2008
From: biopythonlist at gmail.com (dr goettel)
Date: Thu, 9 Oct 2008 18:32:11 +0200
Subject: [BioPython] taxonomic tree
In-Reply-To: <320fb6e00810090231w723e3b29m5e070c55166d3bfc@mail.gmail.com>
References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com>
	<320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com>
	<9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com>
	<320fb6e00810090231w723e3b29m5e070c55166d3bfc@mail.gmail.com>
Message-ID: <9b15d9f30810090932qb22ca8boc6edc871bf285154@mail.gmail.com>

> To do this in Biopython you'll have to write some SQL commands - but
> first you need to understand how the left/right values work if you
> want to take advantage of them.  I refer you to this thread on the
> BioSQL mailing list earlier in the year:
> http://lists.open-bio.org/pipermail/biosql-l/2008-April/001234.html
>
> In particular, Hilmar referred to Joe Celko's SQL for Smarties books,
> and the introduction to this nested-set representation given here:
> http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html
>

That's great!!
Taking advantage of the left/right values will help me!! They 're  great.
I started writing a lot of code to do something that in fact can be done
with some sql statements.
In fact the sql statements are quite difficult for me so I have to deep
inside "inner joins".

Thankyou very much


drG


From biopython at maubp.freeserve.co.uk  Mon Oct 13 12:38:56 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Oct 2008 13:38:56 +0100
Subject: [BioPython] Translation method for Seq object
Message-ID: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>

Dear Biopythoneers,

This is a request for feedback about proposed additions to the Seq
object for the next release of Biopython.  I'd like people to pick (a)
to (e) in the list below (with additional comments or counter
suggestions welcome).

Enhancement bug 2381 is about adding transcription and translation
methods to the Seq object, allowing an object orientated style of
programming.

e.g. Current functional programming style:

>>> from Bio.Seq import Seq, transcribe
>>> from Bio.Alphabet import generic_dna
>>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna)
>>> my_seq
Seq('CAGTGACGTTAGTCCG', DNAAlphabet())
>>> transcribe(my_seq)
Seq('CAGUGACGUUAGUCCG', RNAAlphabet())

With the latest Biopython in CVS, you can now invoke a Seq object
method instead for transcription (or back transcription):

>>> my_seq.transcribe()
Seq('CAGUGACGUUAGUCCG', RNAAlphabet())

For a comparison, compare the shift from python string functions to
string methods.  This also makes the functionality more discoverable
via dir(my_seq).

Adding Seq object methods "transcribe" and "back_transcribe" doesn't
cause any confusion with the python string methods.  However, for
translation, the python string has an existing "translate" method:

> S.translate(table [,deletechars]) -> string
>
> Return a copy of the string S, where all characters occurring
> in the optional argument deletechars are removed, and the
> remaining characters have been mapped through the given
> translation table, which must be a string of length 256.

I don't think this functionality is really of direct use for sequences, and
having a Seq object "translate" method do a biological translation into
a protein sequence is much more intuitive. However, this could cause
confusion if the Seq object is passed to non-Biopython code which
expects a string like translate method.

To avoid this naming clash, a different method name would needed.

This is where some user feedback would be very welcome - I think
the following cover all the alternatives of what to call a biological
translation function (nucleotide to protein):

(a) Just use translate (ignore the existing string method)
(b) Use translate_ (trailing underscore, see PEP8)
(c) Use translation (a noun rather than verb; different style).
(d) Use something else (e.g. bio_translate or ...)
(e) Don't add a biological translation method at all because ...

Thanks,

Peter

See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381


From ericgibert at yahoo.fr  Mon Oct 13 14:38:02 2008
From: ericgibert at yahoo.fr (Eric Gibert)
Date: Mon, 13 Oct 2008 22:38:02 +0800
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
Message-ID: <BFAD764F5E77416099835E06AA4016EC@Gecko>


(a) Seq is an object, string is another object... each of them have various
methods and coincidently two of them have the same name...

Eric


-----Original Message-----
From: biopython-bounces at lists.open-bio.org
[mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter
Sent: Monday, October 13, 2008 8:39 PM
To: BioPython Mailing List
Subject: [BioPython] Translation method for Seq object

Dear Biopythoneers,

This is a request for feedback about proposed additions to the Seq
object for the next release of Biopython.  I'd like people to pick (a)
to (e) in the list below (with additional comments or counter
suggestions welcome).

Enhancement bug 2381 is about adding transcription and translation
methods to the Seq object, allowing an object orientated style of
programming.

e.g. Current functional programming style:

>>> from Bio.Seq import Seq, transcribe
>>> from Bio.Alphabet import generic_dna
>>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna)
>>> my_seq
Seq('CAGTGACGTTAGTCCG', DNAAlphabet())
>>> transcribe(my_seq)
Seq('CAGUGACGUUAGUCCG', RNAAlphabet())

With the latest Biopython in CVS, you can now invoke a Seq object
method instead for transcription (or back transcription):

>>> my_seq.transcribe()
Seq('CAGUGACGUUAGUCCG', RNAAlphabet())

For a comparison, compare the shift from python string functions to
string methods.  This also makes the functionality more discoverable
via dir(my_seq).

Adding Seq object methods "transcribe" and "back_transcribe" doesn't
cause any confusion with the python string methods.  However, for
translation, the python string has an existing "translate" method:

> S.translate(table [,deletechars]) -> string
>
> Return a copy of the string S, where all characters occurring
> in the optional argument deletechars are removed, and the
> remaining characters have been mapped through the given
> translation table, which must be a string of length 256.

I don't think this functionality is really of direct use for sequences, and
having a Seq object "translate" method do a biological translation into
a protein sequence is much more intuitive. However, this could cause
confusion if the Seq object is passed to non-Biopython code which
expects a string like translate method.

To avoid this naming clash, a different method name would needed.

This is where some user feedback would be very welcome - I think
the following cover all the alternatives of what to call a biological
translation function (nucleotide to protein):

(a) Just use translate (ignore the existing string method)
(b) Use translate_ (trailing underscore, see PEP8)
(c) Use translation (a noun rather than verb; different style).
(d) Use something else (e.g. bio_translate or ...)
(e) Don't add a biological translation method at all because ...

Thanks,

Peter

See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381
_______________________________________________
BioPython mailing list  -  BioPython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


From bsouthey at gmail.com  Mon Oct 13 14:58:07 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Mon, 13 Oct 2008 09:58:07 -0500
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
Message-ID: <48F361FF.103@gmail.com>

Peter wrote:
> Dear Biopythoneers,
>
> This is a request for feedback about proposed additions to the Seq
> object for the next release of Biopython.  I'd like people to pick (a)
> to (e) in the list below (with additional comments or counter
> suggestions welcome).
>
> Enhancement bug 2381 is about adding transcription and translation
> methods to the Seq object, allowing an object orientated style of
> programming.
>
> e.g. Current functional programming style:
>
>   
>>>> from Bio.Seq import Seq, transcribe
>>>> from Bio.Alphabet import generic_dna
>>>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna)
>>>> my_seq
>>>>         
> Seq('CAGTGACGTTAGTCCG', DNAAlphabet())
>   
>>>> transcribe(my_seq)
>>>>         
> Seq('CAGUGACGUUAGUCCG', RNAAlphabet())
>
> With the latest Biopython in CVS, you can now invoke a Seq object
> method instead for transcription (or back transcription):
>
>   
>>>> my_seq.transcribe()
>>>>         
> Seq('CAGUGACGUUAGUCCG', RNAAlphabet())
>
> For a comparison, compare the shift from python string functions to
> string methods.  This also makes the functionality more discoverable
> via dir(my_seq).
>
> Adding Seq object methods "transcribe" and "back_transcribe" doesn't
> cause any confusion with the python string methods.  However, for
> translation, the python string has an existing "translate" method:
>
>   
>> S.translate(table [,deletechars]) -> string
>>
>> Return a copy of the string S, where all characters occurring
>> in the optional argument deletechars are removed, and the
>> remaining characters have been mapped through the given
>> translation table, which must be a string of length 256.
>>     
>
> I don't think this functionality is really of direct use for sequences, and
> having a Seq object "translate" method do a biological translation into
> a protein sequence is much more intuitive. However, this could cause
> confusion if the Seq object is passed to non-Biopython code which
> expects a string like translate method.
>
> To avoid this naming clash, a different method name would needed.
>
> This is where some user feedback would be very welcome - I think
> the following cover all the alternatives of what to call a biological
> translation function (nucleotide to protein):
>
> (a) Just use translate (ignore the existing string method)
> (b) Use translate_ (trailing underscore, see PEP8)
> (c) Use translation (a noun rather than verb; different style).
> (d) Use something else (e.g. bio_translate or ...)
> (e) Don't add a biological translation method at all because ...
>
> Thanks,
>
> Peter
>
> See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>   
Hi,
My thoughts on this is that it is generally best to avoid any confusion 
when possible. But 'translate' is not a reserved word and the Python 
documentation notes that the unicode version lacks the optional 
deletechars argument (so there is precedent for using the same word). 
Also it involves the methods versus functions argument but many of the 
string functions have been depreciated and will get removed in Python 
3.0 (so in Python 3.0 I think it will be hard to get a name clash 
without some strange inheritance going on).

Therefore, provided 'translate' is a method of Seq then I do not see any 
strong reason to avoid it except that it is long (but shorter than 
translation) :-)

Would be too cryptic to have dna(), rna() and protein() methods that 
provide the appropriate conversion based on the Seq type?
Obviously reverse translation of a protein sequence to a DNA sequence is 
complex if there are many solutions.

Regards
Bruce


From mjldehoon at yahoo.com  Mon Oct 13 14:57:28 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Mon, 13 Oct 2008 07:57:28 -0700 (PDT)
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
Message-ID: <421846.1946.qm@web62403.mail.re1.yahoo.com>

(f) Use .translate both for the Python .translate and for the Biopython .translate.

S.translate() ===> Biopython .translate

S.translate(table [,deletechars]) ===> Python .translate

We can tell from the presence or absence of arguments whether the user intends Python's translate or Biopython's translate.

--Michiel.


--- On Mon, 10/13/08, Peter <biopython at maubp.freeserve.co.uk> wrote:

> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: [BioPython] Translation method for Seq object
> To: "BioPython Mailing List" <biopython at lists.open-bio.org>
> Date: Monday, October 13, 2008, 8:38 AM
> Dear Biopythoneers,
> 
> This is a request for feedback about proposed additions to
> the Seq
> object for the next release of Biopython.  I'd like
> people to pick (a)
> to (e) in the list below (with additional comments or
> counter
> suggestions welcome).
> 
> Enhancement bug 2381 is about adding transcription and
> translation
> methods to the Seq object, allowing an object orientated
> style of
> programming.
> 
> e.g. Current functional programming style:
> 
> >>> from Bio.Seq import Seq, transcribe
> >>> from Bio.Alphabet import generic_dna
> >>> my_seq = Seq("CAGTGACGTTAGTCCG",
> generic_dna)
> >>> my_seq
> Seq('CAGTGACGTTAGTCCG', DNAAlphabet())
> >>> transcribe(my_seq)
> Seq('CAGUGACGUUAGUCCG', RNAAlphabet())
> 
> With the latest Biopython in CVS, you can now invoke a Seq
> object
> method instead for transcription (or back transcription):
> 
> >>> my_seq.transcribe()
> Seq('CAGUGACGUUAGUCCG', RNAAlphabet())
> 
> For a comparison, compare the shift from python string
> functions to
> string methods.  This also makes the functionality more
> discoverable
> via dir(my_seq).
> 
> Adding Seq object methods "transcribe" and
> "back_transcribe" doesn't
> cause any confusion with the python string methods. 
> However, for
> translation, the python string has an existing
> "translate" method:
> 
> > S.translate(table [,deletechars]) -> string
> >
> > Return a copy of the string S, where all characters
> occurring
> > in the optional argument deletechars are removed, and
> the
> > remaining characters have been mapped through the
> given
> > translation table, which must be a string of length
> 256.
> 
> I don't think this functionality is really of direct
> use for sequences, and
> having a Seq object "translate" method do a
> biological translation into
> a protein sequence is much more intuitive. However, this
> could cause
> confusion if the Seq object is passed to non-Biopython code
> which
> expects a string like translate method.
> 
> To avoid this naming clash, a different method name would
> needed.
> 
> This is where some user feedback would be very welcome - I
> think
> the following cover all the alternatives of what to call a
> biological
> translation function (nucleotide to protein):
> 
> (a) Just use translate (ignore the existing string method)
> (b) Use translate_ (trailing underscore, see PEP8)
> (c) Use translation (a noun rather than verb; different
> style).
> (d) Use something else (e.g. bio_translate or ...)
> (e) Don't add a biological translation method at all
> because ...
> 
> Thanks,
> 
> Peter
> 
> See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From biopython at maubp.freeserve.co.uk  Mon Oct 13 15:27:37 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Oct 2008 16:27:37 +0100
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <421846.1946.qm@web62403.mail.re1.yahoo.com>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
	<421846.1946.qm@web62403.mail.re1.yahoo.com>
Message-ID: <320fb6e00810130827j3ec07434s2f58e370743f9537@mail.gmail.com>

So I did manage to leave off at least one other option from my short list :)

Michiel de Hoon wrote:
>
> (f) Use .translate both for the Python .translate and for the Biopython .translate.
>
> S.translate() ===> Biopython .translate
>
> S.translate(table [,deletechars]) ===> Python .translate
>
> We can tell from the presence or absence of arguments whether the user intends Python's translate or Biopython's translate.

Sadly its not quite that simple.

For a biological translation we'd probably want to offer optional
arguments for at least the codon table and stop symbol (like the
current Bio.Seq.translate() function), with other further arguments
possible (e.g. to treat the sequence as a complete CDS where the start
codon should be validated and taken as M).

It would still be possible to automatically detect which translation
was required, but it wouldn't be very nice.  So overall I'm not keen
on this approach.

Peter


From biopython at maubp.freeserve.co.uk  Mon Oct 13 15:54:32 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Oct 2008 16:54:32 +0100
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <48F361FF.103@gmail.com>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
	<48F361FF.103@gmail.com>
Message-ID: <320fb6e00810130854m38f37075gf85b798cb4a98e21@mail.gmail.com>

Bruce wrote:
> ...
> Therefore, provided 'translate' is a method of Seq then I do not see any
> strong reason to avoid it except that it is long (but shorter than
> translation) :-)

Good - that sounds like another vote for option (a) in my original list.

> Would be too cryptic to have dna(), rna() and protein() methods that provide
> the appropriate conversion based on the Seq type?

Or in a similar vein, to_dna, to_rna, and to_protein? Or toDNA, toRNA,
toProtein?  I'd have to go and consult the current python style guide
for what is the current best practice. Something like that does sounds
reasonable (and they are short), but historically all related
Biopython functions have used the terms (back) transcription and
(back) translation so I would prefer to stick with those.

> Obviously reverse translation of a protein sequence to a DNA sequence is
> complex if there are many solutions.

Yes, back-translation is tricky because there is generally more than
one codon for any amino acid.  Ambiguous nucleotides can be used to
describe several possible codons giving that amino acid, but in
general it is not possible to do this and describe all the possible
codons which could have been used.  This topic is worth of an entire
thread... for the record, I would envisage a back_translate method for
the Seq object (assuming we settle on translate as the name for the
forward translation from nucleotide to protein).

Peter


From mjldehoon at yahoo.com  Tue Oct 14 00:50:14 2008
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Mon, 13 Oct 2008 17:50:14 -0700 (PDT)
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
Message-ID: <900752.12970.qm@web62408.mail.re1.yahoo.com>

> (a) Just use translate (ignore the existing string method)
> (b) Use translate_ (trailing underscore, see PEP8)
> (c) Use translation (a noun rather than verb; different
> style).
> (d) Use something else (e.g. bio_translate or ...)
> (e) Don't add a biological translation method at all
> because ...


(a).
Note also that once Seq objects inherit from string, the Python .translate method is still accessible as str.translate(seq).

--Michiel.


From biopython at maubp.freeserve.co.uk  Tue Oct 14 10:18:13 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 14 Oct 2008 11:18:13 +0100
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <900752.12970.qm@web62408.mail.re1.yahoo.com>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
	<900752.12970.qm@web62408.mail.re1.yahoo.com>
Message-ID: <320fb6e00810140318i14c6362eq8a51030b1da660ae@mail.gmail.com>

OK, we seem to have a consensus :)

In Biopython's CVS, the Seq object now has a translate method which
does a biological translation.  If anyone comes up with a better
proposal before the next release, we can still rename this.  Otherwise
I will update the Tutorial in CVS shortly...

Note that for now, I have followed the existing Bio.Seq.translate(...)
function and the new Seq object translate(...) method takes only two
optional parameters - the codon table and the stop symbol.  I have
noted some suggestions for possible additional arguments on Bug 2381.

The adventurous among you may want to use CVS to update your Biopython
installations to try this out.  Please note that you will now need
numpy instead of Numeric (there is nothing to stop you having both
numpy and Numeric installed at the same time).  If you do try out the
CVS code, please run the unit tests and report any issues.

Thanks,

Peter


From biopython at maubp.freeserve.co.uk  Tue Oct 14 11:11:20 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 14 Oct 2008 12:11:20 +0100
Subject: [BioPython] Deprecating Bio.PubMed and some of Bio.GenBank
In-Reply-To: <320fb6e00810090848g2a516877i5950f515e748b9d0@mail.gmail.com>
References: <320fb6e00810090848g2a516877i5950f515e748b9d0@mail.gmail.com>
Message-ID: <320fb6e00810140411o341df854x49ef3e61421193b8@mail.gmail.com>

On Thu, Oct 9, 2008 at 4:48 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Dear Biopythoneers,
>
> Those of you who looked at the release notes for Biopython 1.48 might
> have read this bit:
>
>>> Bio.PubMed and the online code in Bio.GenBank are now considered
>>> obsolete, and we intend to deprecate them after the next release.
>>> For accessing PubMed and GenBank, please use Bio.Entrez instead.
>
> These bits of code are effectively simple wrappers for Bio.Entrez.
> While they may be simple to use, they cannot take advantage of the
> NCBI's Entrez utils history functionality.  This means they discourage
> users from following the NCBI's preferred usage patterns.
>
> We're already trying to encouraging the use of Bio.Entrez by
> documenting it prominently in the tutorial (which seems to be working
> given the recent questions on the mailing list), but for Biopython
> 1.49 I'm suggesting we go further and deprecate Bio.PubMed and the
> online code in Bio.GenBank.  This would mean a warning message would
> appear when this code is used, and (barring feedback) after a couple
> of releases this code would be removed completely.
>
> Any comments or objections?  In particular, is anyone using this
> "obsolete" functionality now?

I've just deprecated Bio.PubMed in CVS - meaning for the next release
of Biopython you'll see a warning message when you import the PubMed
module.  If you are using this module please say something sooner rather
than later.  This can still be undone.

Thanks,

Peter


From dalloliogm at gmail.com  Thu Oct 16 10:02:46 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 16 Oct 2008 12:02:46 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
Message-ID: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>

Hi,
I was going to write a python program to calculate Fst statistics from a
sample of SNP data.
Is there any module already available to do that in biopython, that I am
missing?
I saw there is a 'PopGen' module, but the Cookbook says it doesn't support
sequence data.
Is someone actually writing any module in python to calculate such
statistics?

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Thu Oct 16 10:23:12 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 16 Oct 2008 11:23:12 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
Message-ID: <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>

On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Hi,
> I was going to write a python program to calculate Fst statistics from a
> sample of SNP data.  Is there any module already available to do that
> in biopython, that I am missing?  I saw there is a 'PopGen' module, but
> the Cookbook says it doesn't support sequence data.
> Is someone actually writing any module in python to calculate such
> statistics?

I think this will be a question for Tiago (the Bio.PopGen author),
although others on the list may have also tackled similar questions.

In terms of reading in the SNP data, what file format will you be
loading?  Does Bio.SeqIO currently suffice?

Have you looked into what (if any) additional python libraries you
would need?  For any Biopython addition, a dependency on just numpy
that would be preferable, but Tiago has previously suggested an
optional dependency on scipy for additional statistics needed in
population genetics.

Peter


From tiagoantao at gmail.com  Thu Oct 16 14:10:47 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 16 Oct 2008 15:10:47 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
Message-ID: <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com>

Hi,

On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Hi,
> I was going to write a python program to calculate Fst statistics from a
> sample of SNP data.
> Is there any module already available to do that in biopython, that I am
> missing?
> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support
> sequence data.
> Is someone actually writing any module in python to calculate such
> statistics?

The answer to this has to be done in parts, because it is actually a
bunch of related (but different) issues


On the data
1. Sequence support. Bio.PopGen doesn't support statistics for
sequences (like Tajima D and the like), BUT that is not relevant if
you want to do frequency based statistics (like good old Fst), you
just have to count frequencies and put into a "frequency format"
2. SNPs is actually not a sequence, but a single element, so it
becomes easier. What you need at the end is something like this:
For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0
For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0
For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10
And so on... You have to end up with frequency counts per population
So, as long as you convert data (sequence, SNP, microsatellite) to
frequency counts per population, there are no issues with the type of
data.

On calculating the statistics (Fst)
1. I am fully aware that core statistics like Fst (I work with Fst a
lot myself) are fundamental in a population genetics module, but I
sincerely don't know how to proceed because a long term solution
requires generic statistical support (e.g., chi-square tests
Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy
(and I will not maintain generic statistics code myself). I know that
Bio.PopGen is of little use without support for standard statistics.
2. A workaround (for which I have code written - but not commited to
the repository - I can give it to you) is to invoke GenePop and get
the Fst estimation. This requires the data to be in GenePop format
(again you can convert SNPs and even sequences to frequency based
format)
3. That being said, I have code to estimate Fst (Cockerham and Wier
theta and a variation from Mark Beaumont) in Python. I can give it to
you (but is not much tested).


On sequence data formats:
1. Note that sequence data files (that I know off) have no provision
for population structure (you cannot say, in a standard way, sequence
X belongs to population Y). You have to do it in adhoc way. That means
you have to invent your own convention for your private use.
2. Anyway, in your case I suppose you still have to extract the SNPs
from the sequence.
3. If you want do frequency based analysis on your SNPs, I suggest you
do a conversion to GenePop anyway (therefore you can import your data
in most population structure software as GenePop format is the defacto
standard)...
4. Because of the above there is actually no good solution for
automated conversion from sequence information to frequency based one
(in biopython or in any platform whatsoever)
I can give more suggestions if you give more details or have more
specific questions.


From tiagoantao at gmail.com  Thu Oct 16 14:14:28 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 16 Oct 2008 15:14:28 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com>
Message-ID: <6d941f120810160714s61eb6d1cx87d1943c4068d491@mail.gmail.com>

Just a minor point: I am so used to work in Fst that I mentally
converted your "F-statistics" to Fst. Most of my mail still stands.
The only point that changes a bit is that I only have code for Fst, so
I cannot help you with any other.

On Thu, Oct 16, 2008 at 3:10 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
> Hi,
>
> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
>> Hi,
>> I was going to write a python program to calculate Fst statistics from a
>> sample of SNP data.
>> Is there any module already available to do that in biopython, that I am
>> missing?
>> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support
>> sequence data.
>> Is someone actually writing any module in python to calculate such
>> statistics?
>
> The answer to this has to be done in parts, because it is actually a
> bunch of related (but different) issues
>
>
> On the data
> 1. Sequence support. Bio.PopGen doesn't support statistics for
> sequences (like Tajima D and the like), BUT that is not relevant if
> you want to do frequency based statistics (like good old Fst), you
> just have to count frequencies and put into a "frequency format"
> 2. SNPs is actually not a sequence, but a single element, so it
> becomes easier. What you need at the end is something like this:
> For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0
> For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0
> For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10
> And so on... You have to end up with frequency counts per population
> So, as long as you convert data (sequence, SNP, microsatellite) to
> frequency counts per population, there are no issues with the type of
> data.
>
> On calculating the statistics (Fst)
> 1. I am fully aware that core statistics like Fst (I work with Fst a
> lot myself) are fundamental in a population genetics module, but I
> sincerely don't know how to proceed because a long term solution
> requires generic statistical support (e.g., chi-square tests
> Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy
> (and I will not maintain generic statistics code myself). I know that
> Bio.PopGen is of little use without support for standard statistics.
> 2. A workaround (for which I have code written - but not commited to
> the repository - I can give it to you) is to invoke GenePop and get
> the Fst estimation. This requires the data to be in GenePop format
> (again you can convert SNPs and even sequences to frequency based
> format)
> 3. That being said, I have code to estimate Fst (Cockerham and Wier
> theta and a variation from Mark Beaumont) in Python. I can give it to
> you (but is not much tested).
>
>
> On sequence data formats:
> 1. Note that sequence data files (that I know off) have no provision
> for population structure (you cannot say, in a standard way, sequence
> X belongs to population Y). You have to do it in adhoc way. That means
> you have to invent your own convention for your private use.
> 2. Anyway, in your case I suppose you still have to extract the SNPs
> from the sequence.
> 3. If you want do frequency based analysis on your SNPs, I suggest you
> do a conversion to GenePop anyway (therefore you can import your data
> in most population structure software as GenePop format is the defacto
> standard)...
> 4. Because of the above there is actually no good solution for
> automated conversion from sequence information to frequency based one
> (in biopython or in any platform whatsoever)
> I can give more suggestions if you give more details or have more
> specific questions.
>


-- 
"Data always beats theories. 'Look at data three times and then come
to a conclusion,' versus 'coming to a conclusion and searching for
some data.' The former will win every time."
?Matthew Simmons,
http://www.tiago.org


From biopython at maubp.freeserve.co.uk  Thu Oct 16 15:11:27 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 16 Oct 2008 16:11:27 +0100
Subject: [BioPython]  back-translation method for Seq object?
Message-ID: <320fb6e00810160811s19e580b2ped86c43b32c401bb@mail.gmail.com>

Quoting from the recent thread about adding a translation method to
the Seq object, Bruce brought up back-translation:

Peter wrote:
> Bruce wrote:
>> Obviously reverse translation of a protein sequence to a DNA sequence is
>> complex if there are many solutions.
>
> Yes, back-translation is tricky because there is generally more than
> one codon for any amino acid.  Ambiguous nucleotides can be used to
> describe several possible codons giving that amino acid, but in
> general it is not possible to do this and describe all the possible
> codons which could have been used.  This topic is worth of an entire
> thread... for the record, I would envisage a back_translate method for
> the Seq object (assuming we settle on translate as the name for the
> forward translation from nucleotide to protein).

Do we actually need a back_translate method?  Can anyone suggest an
actual use-case for this?  It seems difficult to imagine that any
simple version would please everyone.

Bio.Translate (a semi-obsolete module whose deprecation has been
suggested) provides a back_translate method which picks an essentially
arbitrary but unambiguous codon for each amino acid.  Crude but
simple.  A more meaningful choice would require suppling codon
frequencies for the organism under consideration.

Other possibilities include using ambiguous nucleotides to try and
cover all the possibilities (e.g. "L" -> "CTN"), but even here in some
cases this is arbritary.  e.g. The standard three stop codons ['TAA',
'TAG', 'TGA'] could be represented as ['TAR', 'TGA'] or ['TRA', 'TAG']
but not by a single ambiguous codon ('TRR' also covers 'TGG' which
codes for 'W').

Potentially of use would be a generator function which returned all
possible back translations - but this would be complex and typically
overkill.

As a final point, a Seq object back-translation method could give RNA
or DNA.  From a biological point of view giving DNA by default would
make sense.  This choice is handled in Bio.Translate when creating the
translator object (part of what makes Bio.Translate relatively complex
to use).

Peter


From sdavis2 at mail.nih.gov  Thu Oct 16 15:16:51 2008
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 16 Oct 2008 11:16:51 -0400
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com>
Message-ID: <264855a00810160816g6ea765dcl65fd7e5aa38ef20c@mail.gmail.com>

On Thu, Oct 16, 2008 at 10:10 AM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
> Hi,
>
> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
>> Hi,
>> I was going to write a python program to calculate Fst statistics from a
>> sample of SNP data.
>> Is there any module already available to do that in biopython, that I am
>> missing?
>> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support
>> sequence data.
>> Is someone actually writing any module in python to calculate such
>> statistics?
>
> The answer to this has to be done in parts, because it is actually a
> bunch of related (but different) issues
>
>
> On the data
> 1. Sequence support. Bio.PopGen doesn't support statistics for
> sequences (like Tajima D and the like), BUT that is not relevant if
> you want to do frequency based statistics (like good old Fst), you
> just have to count frequencies and put into a "frequency format"
> 2. SNPs is actually not a sequence, but a single element, so it
> becomes easier. What you need at the end is something like this:
> For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0
> For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0
> For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10
> And so on... You have to end up with frequency counts per population
> So, as long as you convert data (sequence, SNP, microsatellite) to
> frequency counts per population, there are no issues with the type of
> data.
>
> On calculating the statistics (Fst)
> 1. I am fully aware that core statistics like Fst (I work with Fst a
> lot myself) are fundamental in a population genetics module, but I
> sincerely don't know how to proceed because a long term solution
> requires generic statistical support (e.g., chi-square tests
> Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy
> (and I will not maintain generic statistics code myself). I know that
> Bio.PopGen is of little use without support for standard statistics.
> 2. A workaround (for which I have code written - but not commited to
> the repository - I can give it to you) is to invoke GenePop and get
> the Fst estimation. This requires the data to be in GenePop format
> (again you can convert SNPs and even sequences to frequency based
> format)
> 3. That being said, I have code to estimate Fst (Cockerham and Wier
> theta and a variation from Mark Beaumont) in Python. I can give it to
> you (but is not much tested).
>
>
> On sequence data formats:
> 1. Note that sequence data files (that I know off) have no provision
> for population structure (you cannot say, in a standard way, sequence
> X belongs to population Y). You have to do it in adhoc way. That means
> you have to invent your own convention for your private use.
> 2. Anyway, in your case I suppose you still have to extract the SNPs
> from the sequence.
> 3. If you want do frequency based analysis on your SNPs, I suggest you
> do a conversion to GenePop anyway (therefore you can import your data
> in most population structure software as GenePop format is the defacto
> standard)...
> 4. Because of the above there is actually no good solution for
> automated conversion from sequence information to frequency based one
> (in biopython or in any platform whatsoever)
> I can give more suggestions if you give more details or have more
> specific questions.

Just a little note that the R programming language has some packages
for population genetics and, of course, has excellent statistical
tools.  One can interface with it via rpy.  I'm not advocating going
this route, but just wanted to let people know about another option.

Sean


From tiagoantao at gmail.com  Thu Oct 16 15:26:52 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 16 Oct 2008 16:26:52 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <264855a00810160816g6ea765dcl65fd7e5aa38ef20c@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com>
	<264855a00810160816g6ea765dcl65fd7e5aa38ef20c@mail.gmail.com>
Message-ID: <6d941f120810160826q2bf25382m41890fb39a4226a0@mail.gmail.com>

The task view on Genetics for R provides a good starting point to find
R packages related to the field:

http://www.freestatistics.org/cran/web/views/Genetics.html


On Thu, Oct 16, 2008 at 4:16 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
> On Thu, Oct 16, 2008 at 10:10 AM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
>> Hi,
>>
>> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
>> <dalloliogm at gmail.com> wrote:
>>> Hi,
>>> I was going to write a python program to calculate Fst statistics from a
>>> sample of SNP data.
>>> Is there any module already available to do that in biopython, that I am
>>> missing?
>>> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support
>>> sequence data.
>>> Is someone actually writing any module in python to calculate such
>>> statistics?
>>
>> The answer to this has to be done in parts, because it is actually a
>> bunch of related (but different) issues
>>
>>
>> On the data
>> 1. Sequence support. Bio.PopGen doesn't support statistics for
>> sequences (like Tajima D and the like), BUT that is not relevant if
>> you want to do frequency based statistics (like good old Fst), you
>> just have to count frequencies and put into a "frequency format"
>> 2. SNPs is actually not a sequence, but a single element, so it
>> becomes easier. What you need at the end is something like this:
>> For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0
>> For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0
>> For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10
>> And so on... You have to end up with frequency counts per population
>> So, as long as you convert data (sequence, SNP, microsatellite) to
>> frequency counts per population, there are no issues with the type of
>> data.
>>
>> On calculating the statistics (Fst)
>> 1. I am fully aware that core statistics like Fst (I work with Fst a
>> lot myself) are fundamental in a population genetics module, but I
>> sincerely don't know how to proceed because a long term solution
>> requires generic statistical support (e.g., chi-square tests
>> Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy
>> (and I will not maintain generic statistics code myself). I know that
>> Bio.PopGen is of little use without support for standard statistics.
>> 2. A workaround (for which I have code written - but not commited to
>> the repository - I can give it to you) is to invoke GenePop and get
>> the Fst estimation. This requires the data to be in GenePop format
>> (again you can convert SNPs and even sequences to frequency based
>> format)
>> 3. That being said, I have code to estimate Fst (Cockerham and Wier
>> theta and a variation from Mark Beaumont) in Python. I can give it to
>> you (but is not much tested).
>>
>>
>> On sequence data formats:
>> 1. Note that sequence data files (that I know off) have no provision
>> for population structure (you cannot say, in a standard way, sequence
>> X belongs to population Y). You have to do it in adhoc way. That means
>> you have to invent your own convention for your private use.
>> 2. Anyway, in your case I suppose you still have to extract the SNPs
>> from the sequence.
>> 3. If you want do frequency based analysis on your SNPs, I suggest you
>> do a conversion to GenePop anyway (therefore you can import your data
>> in most population structure software as GenePop format is the defacto
>> standard)...
>> 4. Because of the above there is actually no good solution for
>> automated conversion from sequence information to frequency based one
>> (in biopython or in any platform whatsoever)
>> I can give more suggestions if you give more details or have more
>> specific questions.
>
> Just a little note that the R programming language has some packages
> for population genetics and, of course, has excellent statistical
> tools.  One can interface with it via rpy.  I'm not advocating going
> this route, but just wanted to let people know about another option.
>
> Sean
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
"Data always beats theories. 'Look at data three times and then come
to a conclusion,' versus 'coming to a conclusion and searching for
some data.' The former will win every time."
?Matthew Simmons,
http://www.tiago.org


From lpritc at scri.ac.uk  Fri Oct 17 08:24:43 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Fri, 17 Oct 2008 09:24:43 +0100
Subject: [BioPython] back-translation method for Seq object?
In-Reply-To: <320fb6e00810160811s19e580b2ped86c43b32c401bb@mail.gmail.com>
Message-ID: <C51E0A5B.17DCE%lpritc@scri.ac.uk>


On 16/10/2008 16:11, "Peter" <biopython at maubp.freeserve.co.uk> wrote:

> Quoting from the recent thread about adding a translation method to
> the Seq object, Bruce brought up back-translation:
> 
> Peter wrote:
>> Bruce wrote:
>>> Obviously reverse translation of a protein sequence to a DNA sequence is
>>> complex if there are many solutions.

This is the key problem.  Forward translation is - for a given codon table -
a one-one mapping.  Reverse translation is (for many amino acids) one-many.
If the goal is to produce the coding sequence that actually encoded a
particular protein sequence, the problem is combinatorial and rapidly
becomes messy with increasing sequence length.  And that's not considering
the problem of splice variants/intron-exon boundaries if attempting to
relate the sequence back to some genome or genome fragment - more a problem
in eukaryotes.

>> Yes, back-translation is tricky because there is generally more than
>> one codon for any amino acid.  Ambiguous nucleotides can be used to
>> describe several possible codons giving that amino acid, but in
>> general it is not possible to do this and describe all the possible
>> codons which could have been used.  This topic is worth of an entire
>> thread... for the record, I would envisage a back_translate method for
>> the Seq object (assuming we settle on translate as the name for the
>> forward translation from nucleotide to protein).
> 
> Do we actually need a back_translate method?  Can anyone suggest an
> actual use-case for this?  It seems difficult to imagine that any
> simple version would please everyone.

I agree - I can't think of an occasion where I might want to back-translate
a protein in this way that wouldn't better be handled by other means.  Not
that I'm the fount of all use-cases but, given the number of ways in which
one *could* back-translate, perhaps it would be better not to pick/guess at
any single one.

Some choices to be made in deciding how to back-translate are (and I'm sure
you've already thought of them, but they're worth writing down):

I) Protein to unambiguous RNA:
  a) Codon table: arbitrary; organism-specific; user-defined?
  b) Codon choice: arbitrary and random; arbitrary and consistent; complete
set of possibilities; most common codon (if information available); other
favoured codon (if specified)?
II) Protein to ambiguous RNA:
  a) Return a Seq, string or some other representation of ambiguity?
  b) IUPAC ambiguity symbols; choice of codons; alternative representation
of ambiguity?

The most common back-translation I do is taking aligned protein sequences
back to their known coding sequences, and this is really a case of mapping
known codons onto predefined positions, rather than the interpolation of
unknown codons that is required for back-translation as implied above.
T-coffee handles this pretty well, IIRC.

To find coding sequences for a particular protein in the originating
sequence (if known), I use BLAST.  I guess there might be value in having
the ability to identify regions of the coding sequence that are least likely
to be variable (by generating them combinatorially) so that probes might be
designed if the coding sequence is not known.  But that doesn't appear to be
the way that most sequences are obtained these days: much cheaper to bung
RNA through 454 or Solexa and work through the output than to put someone on
the task of making an array of probes to find a sequence that may or may not
encode your sequenced protein...

> Bio.Translate (a semi-obsolete module whose deprecation has been
> suggested) provides a back_translate method which picks an essentially
> arbitrary but unambiguous codon for each amino acid.  Crude but
> simple.  A more meaningful choice would require suppling codon
> frequencies for the organism under consideration.

These can be found - for many organisms - in Emboss codon usage table (.cut)
files, if you have Emboss locally.  However, is requiring Emboss as a
dependency the cleanest or wisest solution for Biopython?  This approach
solves only one problem:  given a particular codon usage table, what is the
most likely sequence that would have produced this protein.  That's not a
problem I've ever come across in anger, but given a table of 'most efficient
codons' for some biological expression system, I can see this potentially
having some use.  However, given that many microbiologists can already tell
you the preferred codons for K12 without pausing for breath, I'm not sure
there's a problem looking for this solution.
 
> Other possibilities include using ambiguous nucleotides to try and
> cover all the possibilities (e.g. "L" -> "CTN"), but even here in some
> cases this is arbritary.  e.g. The standard three stop codons ['TAA',
> 'TAG', 'TGA'] could be represented as ['TAR', 'TGA'] or ['TRA', 'TAG']
> but not by a single ambiguous codon ('TRR' also covers 'TGG' which
> codes for 'W').

If Seq had an ambiguity-aware sequence representation, this could be
handled.  For example, a regular expression-based sequence representation
(which could lie alongside Seq.data, perhaps as Seq.regex) could represent
these variants as (TAA|TAG|TGA), and alternatively the usual ambiguity codes
could also be handled in a similar way (e.g. R as [AG]).  This would be of
some limited use, but would permit sequence searching within Biopython, at
least.

> Potentially of use would be a generator function which returned all
> possible back translations - but this would be complex and typically
> overkill.

I think that, for large sequences, this could quickly swamp the user.  What
do you see as the use of this output?

> As a final point, a Seq object back-translation method could give RNA
> or DNA.  From a biological point of view giving DNA by default would
> make sense.  This choice is handled in Bio.Translate when creating the
> translator object (part of what makes Bio.Translate relatively complex
> to use).

Since there is a one-one map of RNA to DNA, I'm easy about either choice on
a computational level.  Biologically-speaking, DNA -> RNA is transcription,
and RNA -> protein is translation, so I'd expect back-translation to convert
protein -> RNA, and back-transcription to convert RNA -> DNA.

Cheers,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________


From dalloliogm at gmail.com  Fri Oct 17 09:39:41 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Fri, 17 Oct 2008 11:39:41 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
Message-ID: <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>

On Thu, Oct 16, 2008 at 12:23 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
> > Hi,
> > I was going to write a python program to calculate Fst statistics from a
> > sample of SNP data.  Is there any module already available to do that
> > in biopython, that I am missing?  I saw there is a 'PopGen' module, but
> > the Cookbook says it doesn't support sequence data.
> > Is someone actually writing any module in python to calculate such
> > statistics?
>
> I think this will be a question for Tiago (the Bio.PopGen author),
> although others on the list may have also tackled similar questions.
>
> In terms of reading in the SNP data, what file format will you be
> loading?  Does Bio.SeqIO currently suffice?
>

Hi,
thank you very much all of you for the replies.
Actually I am going to use tped[1] and tfam[1] files as input, formatted
with the plink program[2].
Bio.SeqIO doesn't support these format, but this is right because they don't
cointain only sequences but rather elements like Tiago was saying.

Let's say I try to write a parser for these two file formats. In which
biopython object should I save them? Is there any kind of 'Individual' or
'Population' object in biopython?
I see from the cookbook that Bio.GenPop.Record is representanting
populations and individual as list[3], and that there is not a 'Population'
or 'Individual' object.
I think that it is a good approach, because these kind of files tend to be
very big and instantiating an Individual object instead of a tuple for every
line of the file would be take much memory.
But are you going to implement some kind of 'Individual' or 'Population'
object?
Moreover, python 2.6 will implement a new kind of data object, called 'named
tuple' [4], to implement these kind of records. It could be a good
compromise (maybe I'll better start a new thread about this and explain
better).


[1] tped, tfam: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#tr
[2] plink: http://pngu.mgh.harvard.edu/~purcell/plink/index.shtml
[3] biopython cookbook, popgen:
http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc112
[4] named tuples in python 2.6: http://code.activestate.com/recipes/500261/


>
> Have you looked into what (if any) additional python libraries you
> would need?  For any Biopython addition, a dependency on just numpy
> that would be preferable, but Tiago has previously suggested an
> optional dependency on scipy for additional statistics needed in
> population genetics.
>
> Peter
>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From dalloliogm at gmail.com  Fri Oct 17 10:03:32 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Fri, 17 Oct 2008 12:03:32 +0200
Subject: [BioPython] named tuples for biopython?
Message-ID: <5aa3b3570810170303y5660ab63n592dc99a76011ad2@mail.gmail.com>

Hi,
python 2.6 is going to implement a new kind of data (like lists, strings,
etc..) called 'named_tuple'.
It is intended to be a better data format to be used when parsing record
files and databases.

You can download the recipe from here (it should be included experimentally
in python 2.6):
- http://code.activestate.com/recipes/500261/

Basically, you instantiate a named_tuple object with this syntax:

>> Person = NamedTuple("Person name surname")

"Person" is a label for the named_tuple; the following fields, 'name' and
'surname'
Then you will have named_tuple object which is basically a mix between a
dictionary, a custom class and a tuple:

>> Person = NamedTuple("Person name surname")
>> Einstein = Person('Albert', 'Einstein')
>> Einstein.name
'Albert'

>> Einstein.surname
'Einstein'

>> people = []
>> for line in f.readlines():
>>     people.append(Person(line.split())
>>
>> for person in people:
>>     print person.name, person.surname

named_tuples are also read-only object, so they should be less
memory-expensive
It is like tuples against lists, but more customizable.

I am really not good ad explaining, and I can't find a good tutorial that
illustrate this.
I read a good article about named_tuples, but it is in italian language (
http://stacktrace.it/2008/05/gestione-dei-record-python-1/). Maybe you can
understand the code examples.

Has any of you heard about this new data type? Do you think it could be
useful for biopython? There is a lot of file parsing / database interfacing
in bioinformatics :)


p.s. since I didn't trust HTML-based mails to keep code formatting, I also
posted this same message on nodalpoint:
http://www.nodalpoint.org/2008/10/17/python_2_6_will_implement_a_new_data_format_named_tuple_can_it_be_of_use_for_biopython


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From dalloliogm at gmail.com  Fri Oct 17 10:11:23 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Fri, 17 Oct 2008 12:11:23 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810160714s61eb6d1cx87d1943c4068d491@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com>
	<6d941f120810160714s61eb6d1cx87d1943c4068d491@mail.gmail.com>
Message-ID: <5aa3b3570810170311g1d92dc52q41616cd6dc58fb03@mail.gmail.com>

On Thu, Oct 16, 2008 at 4:14 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:

> Just a minor point: I am so used to work in Fst that I mentally
> converted your "F-statistics" to Fst. Most of my mail still stands.
> The only point that changes a bit is that I only have code for Fst, so
> I cannot help you with any other.
>
> On Thu, Oct 16, 2008 at 3:10 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
>
> > 3. That being said, I have code to estimate Fst (Cockerham and Wier
> > theta and a variation from Mark Beaumont) in Python. I can give it to
> > you (but is not much tested).
> >
>


Thank you.. Can you please send me this code that you are using to calculate
Fst statistics with python?
I can't guarantee I will use it (most of the people here use perl and
bioperl, but I would prefer python), but maybe I can help you testing it.


>
>
>
>
>
> --
> "Data always beats theories. 'Look at data three times and then come
> to a conclusion,' versus 'coming to a conclusion and searching for
> some data.' The former will win every time."
> ?Matthew Simmons,
> http://www.tiago.org
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Fri Oct 17 10:17:51 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 17 Oct 2008 11:17:51 +0100
Subject: [BioPython] named tuples for biopython?
In-Reply-To: <5aa3b3570810170303y5660ab63n592dc99a76011ad2@mail.gmail.com>
References: <5aa3b3570810170303y5660ab63n592dc99a76011ad2@mail.gmail.com>
Message-ID: <320fb6e00810170317w24fe34a4p1884c4264f3e7363@mail.gmail.com>

On Fri, Oct 17, 2008 at 11:03 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Hi,
> python 2.6 is going to implement a new kind of data (like lists, strings,
> etc..) called 'named_tuple'.  It is intended to be a better data format to
> be used when parsing record files and databases.

I'd just seen this today actually via another mailing list.  Here is a
short example which actually works on python 2.6 (the details have
changed slightly from your quote),

>>> from collections import namedtuple
>>> Person = namedtuple("Person", "name surname")
>>> x = Person("Albert", "Einstein")
>>> x
Person(name='Albert', surname='Einstein')
>>> x.name
'Albert'
>>> x.surname
'Einstein'
>>> x.keys()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Person' object has no attribute 'keys'
>>> x["name"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: tuple indices must be integers, not str
>>> x[0]
'Albert'
>>> x[1]
'Einstein'

So this doesn't act much like a dictionary (in terms of the x[...]
usage), so we can't use it as a drop in enhancement for existing
dictionaries in Biopython.  I expect there are some places where a
namedtuple would make sense (although using it might break backwards
compatibility).

Also, if we did want to use NamedTuple in Biopython we'd have to
include a copy for use on older versions of python.  This is probably
possible under the python license... but would require an
implementation that still worked on pre 2.6.

Peter


From lpritc at scri.ac.uk  Fri Oct 17 10:52:33 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Fri, 17 Oct 2008 11:52:33 +0100
Subject: [BioPython] named tuples for biopython?
In-Reply-To: <5aa3b3570810170303y5660ab63n592dc99a76011ad2@mail.gmail.com>
Message-ID: <C51E2D01.17DF2%lpritc@scri.ac.uk>


On 17/10/2008 11:03, "Giovanni Marco Dall'Olio" <dalloliogm at gmail.com>
wrote:

> Hi,
> python 2.6 is going to implement a new kind of data (like lists, strings,
> etc..) called 'named_tuple'.
> It is intended to be a better data format to be used when parsing record
> files and databases.
> 
> You can download the recipe from here (it should be included experimentally
> in python 2.6):
> - http://code.activestate.com/recipes/500261/

The explanation here was pretty clear, to me:

http://docs.python.org/dev/library/collections.html#collections.namedtuple

> Has any of you heard about this new data type?

Not until you mentioned it - thanks for the heads-up.

> Do you think it could be
> useful for biopython? There is a lot of file parsing / database interfacing
> in bioinformatics :)

I can see it being a useful collection type.  It reminds me of C structs,
and looks like a near-perfect fit to many db table entries, and to
csv/ATF-format files for which the column headers can be used to define
attributes.  

I guess that one disadvantage of namedtuples, compared to, e.g. a dictionary
in which each value is itself a dictionary of attributes (with attribute
names for keys), is that there's a restricted character/word set available
for attribute names in the namedtuple, but this is not important for
dictionary keys, so some additional tally of header to attribute name may be
necessary.  This has a real use-case in, say, parsing ATF format files...

http://www.moleculardevices.com/pages/software/gn_genepix_file_formats.html

... where on-the-fly creation of attributes with the same name as in the
parsed file or table row may not be possible with a namedtuple.  If you know
of the column/field names in advance though, it shouldn't be an issue.

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________


From tiagoantao at gmail.com  Fri Oct 17 18:07:18 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Fri, 17 Oct 2008 19:07:18 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
Message-ID: <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>

Hi,

On Fri, Oct 17, 2008 at 10:39 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Let's say I try to write a parser for these two file formats. In which
> biopython object should I save them? Is there any kind of 'Individual' or
> 'Population' object in biopython?
> I see from the cookbook that Bio.GenPop.Record is representanting
> populations and individual as list[3], and that there is not a 'Population'
> or 'Individual' object.

No, there are no concepts of individuals or populations for now.
Bio.PopGen.GenePop is just a representation of a GenePop file (which
is a de facto standard in frequency based population genetics).
Currently Bio.PopGen philosophy is more of a wrapper for existing
software (e.g., I don't implement a coalescent simulator, like in
BioPerl, I wrap Simcoal2). The disadvantage is that it is not "Pure
Python" and is dependent on external applications. The advantage is
that, if the external application is good, than good functionality
becomes available inside Biopython. For example, coalescent simulation
in BioPerl is (at least last time I've checked it) orders of magnitude
less flexible than BioPython's (based on SimCoal2).
In this philosophy, I now have a (partial) wrapper for the GenePop
application to calculate statistics (voila, Fst).
That doesn't mean that core statistics functionality should not be
available in Bio.PopGen. I think it should be (that is why I have
quite done work on that - implementing from scratch Fst, allelic
richness, expected heterosigosity, ...). The same goes to the concept
of Population and Individual.
For a number of cumulative reasons, the work on that front is stalled.
But, if there is some interest, I would more than welcome reopening
that front...


> Moreover, python 2.6 will implement a new kind of data object, called 'named
> tuple' [4], to implement these kind of records. It could be a good
> compromise (maybe I'll better start a new thread about this and explain
> better).

I think the ad-hoc policy in Biopython is to support previous versions
of Python, so I don't think it will be easy to do things in a 2.6 only
way (although, for NEW functionality, from my part, I don't see a
problem with it).

Tiago


From bsouthey at gmail.com  Fri Oct 17 18:46:19 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Fri, 17 Oct 2008 13:46:19 -0500
Subject: [BioPython] back-translation method for Seq object?
In-Reply-To: <C51E0A5B.17DCE%lpritc@scri.ac.uk>
References: <C51E0A5B.17DCE%lpritc@scri.ac.uk>
Message-ID: <48F8DD7B.7010909@gmail.com>

Leighton Pritchard wrote:
> On 16/10/2008 16:11, "Peter" <biopython at maubp.freeserve.co.uk> wrote:
>
>   
>> Quoting from the recent thread about adding a translation method to
>> the Seq object, Bruce brought up back-translation:
>>
>> Peter wrote:
>>     
>>> Bruce wrote:
>>>       
>>>> Obviously reverse translation of a protein sequence to a DNA sequence is
>>>> complex if there are many solutions.
>>>>         
>
> This is the key problem.  Forward translation is - for a given codon table -
> a one-one mapping.  Reverse translation is (for many amino acids) one-many.
> If the goal is to produce the coding sequence that actually encoded a
> particular protein sequence, the problem is combinatorial and rapidly
> becomes messy with increasing sequence length.  And that's not considering
> the problem of splice variants/intron-exon boundaries if attempting to
> relate the sequence back to some genome or genome fragment - more a problem
> in eukaryotes.
>   
If you use a regular expression or a tree structure then there is a 
one-one mapping but then that would probably best as a subclass of Seq. 
Note you still would need a method to transverse it if you wanted to get 
a sequence from it as well as an reverse complement. It is fairly 
trivial to get a regular expression for it for the standard genetic code 
but I did not get my reverse complement to work satisfactory nor did I 
try to get DNA sequence from the regular expression.

I would suggest tools like Wise2 and exonerate 
(http://www.ebi.ac.uk/~guy/exonerate/) are the solution to solving gene 
structure problems than using a Seq object.

Obviously if you start with a DNA sequence, then you could create object 
that has a DNA/RNA Seq object and a protein Seq object(s) that contain 
the translation(s) like in Genbank DNA records that contain the 
translation. But that really avoids the issue here.

>>> Yes, back-translation is tricky because there is generally more than
>>> one codon for any amino acid.  Ambiguous nucleotides can be used to
>>> describe several possible codons giving that amino acid, but in
>>> general it is not possible to do this and describe all the possible
>>> codons which could have been used.  This topic is worth of an entire
>>> thread... for the record, I would envisage a back_translate method for
>>> the Seq object (assuming we settle on translate as the name for the
>>> forward translation from nucleotide to protein).
>>>       
>> Do we actually need a back_translate method?  Can anyone suggest an
>> actual use-case for this?  It seems difficult to imagine that any
>> simple version would please everyone.
>>     
>
> I agree - I can't think of an occasion where I might want to back-translate
> a protein in this way that wouldn't better be handled by other means.  Not
> that I'm the fount of all use-cases but, given the number of ways in which
> one *could* back-translate, perhaps it would be better not to pick/guess at
> any single one.
>   
Apart from the academic aspect, my main use is searching for protein 
motifs/domains, enzyme cleavage sites, finding very short combinations 
of amino acids and binding sites (I do not do this but it is the same) 
in DNA sequences especially genomic sequence. These are usually very 
small and, thus, unsuitable for most tools. One of my uses is with 
peptide identification and de novo sequencing using mass spectrometry 
when you don't know the actual protein or gene sequence. It also has the 
problem that certain amino acids have very similar mass so you would 
need to  Regardless of whether you use a regular expression query or not 
you still need a back translation of the protein query and probably the 
reverse complement.

Another case where it would be useful is that tools like TBLASTN gives 
protein alignments so you must open the DNA sequence and find the DNA 
region based on the protein alignment.

Bruce


From dalloliogm at gmail.com  Sun Oct 19 14:50:54 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Sun, 19 Oct 2008 16:50:54 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
Message-ID: <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>

On Sat, Oct 18, 2008 at 6:50 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:

>  > here have used bioperl Bio::PopGen::PopStat, but we saw that using that
> > module as it is now in bioperl is too much computationally-expensive for
> our
> > resources.
> > So, we are going to either refactor the bioperl function, or to write
> custom
> > scripts in python to calculate Fst.
> > I can program perl, but I would prefer to use python in use, since I like
> > object oriented programming.
>
> You can find my (completely unofficial, completely untested) PopGen module
> here:
> http://popgen.eu/PopGen.tar.gz
> You should take a biopython distro and replace the PopGen directory
> with the contents of this one.
>


ok, thank you very much!!
I would like to use git to keep track of the changes I will make to the
code.
What do you think if I'll upload it to http://github.com and then upload it
back on biopython when it is finished?
I am not sure, but I think it would be possible to convert the logs back to
cvs to reintegrate the changes in biopython.


> There are 2 ways to calculate Fst:
> Doing something this:
> from Bio.PopGen.Stats.Structural import Fst
>
> fst = Fst()


> fst.add_pop('Pop 1', [('a', 'a'), ('a', 'c'), ('a','c')])
> fst.add_pop('Pop 2', [('a', 'c'), ('a', 'c'), ('a','c')])
>

One of the problems we are having here, is that it takes too much RAM memory
to store all the information about characters for every population.
I was going to write a Population object, in which I'll store only the total
count of heterozygotes, individuals, and what is needed, instead of the
information about characters (('a', 'a'), ('a', 'c'), ...)

It is something like this:
class Population:
  markers = []

class Marker:
  total_heterozygotes_count = 0
  total_population_count = 0
  total_Purines_count = 0 # this could be renamed, of course
  total_Pyrimidines_count = 0


>
> Or using the new GenePop code (see GenePop/Controller.py), by using
> genepop to calculate Fsts.
>
> A few comments:
> 1. I don't trust my own Fst code (not tested at all, I am actually
> using GenePop as above). You can find it on PopGen.Stats.Structural
> (Fst, and also FstBeaumont). There is code there for Fst, Fis and Fit.
> Also Fk (I trust the Fk code, but its the only one)


I will ask my group leader to help me in writing down some good test data.
I'll let you know when I will speak with him.


>
> 2. If your problem is performance, I think you have to go to a faster
> language. Scripting languages strongly underperfom on the speed issue.
> I find this problem lots of times. C, C++ and Java (yes, java for
> performance) is what I use. Perl, Python and other scripting languages
> are quite bad performance-wise.


I know.. but I think this time, the problem is in memory usage.


> 3. You can find a Fst implementation in C++ on simuPop (see file
> stator.cpp). GenePop code must also have Fst implemented.

4. I have a Fst based application using Biopython PopGen with Fst (but
> for another application) - Fdist, you can find it at:
> http://www.biomedcentral.com/1471-2105/9/323 . Module Bio.PopGen.FDist
> (incidentally, you can also use this to calculate Fst ;) ).
> 5. My code on Bio.PopGen.Stats is surely not in its final form. I have
> a plan to change it massively. If you are interested in participating
> in the discussion, you are welcome.
>
> > This is to say that if you want, we can work on the same code, and
> > contribute it to biopython.
>
> This would be most welcome. I have almost no sense of propriety for
> the code that is on Bio.PopGen. So, if you work on this, go ahead!
>
>
> > I am writing a ped file parser (everybody here is used to this format,
> and I
> > don't know GenePop :( ), and a simple script that calculates Fst with the
> > most basic formula.
> > I am also trying to design some good tests, and I am using subversion as
> as
> > source control system.
> > Maybe I can also send this to you, so you can have a look (but it is
> still
> > very basic, I started yesterday).
>
> Again, any contribution would be most welcome. Regarding parsers I
> would suggest you to have a look at how parsers are done in Biopython.
> I am following the "standard". You can find an example on
> Bio.PopGen.GenePop.__init__.py. From my point of view I have nothing
> against a "non standard" parser as long as it is documented and
> commented.


Thank you very much.. I know more or less how parsers are written in
biopython, but I have never written one myself.


>
>
> Again, feel free to take this discussion to biopython-dev, especially
> if you are willing to contribute.
>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From mmokrejs at ribosome.natur.cuni.cz  Sun Oct 19 15:52:29 2008
From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=)
Date: Sun, 19 Oct 2008 17:52:29 +0200
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
Message-ID: <48FB57BD.7070705@ribosome.natur.cuni.cz>

Hi,
  I have been away for 2 weeks but although late, let
me oppose that string.translate() is of use. Here is my
current code:

    # make sure no unallowed chars are present in the sequence
    if type == "DNA":
        if not _sequence.translate(string.maketrans('', ''),'GgAaTtCc'):
            if not _sequence.translate(string.maketrans('', ''),'GgAaTtCcBbDdSsWw'):
                if not _sequence.translate(string.maketrans('', ''),'GgAaTtCcRrYyWwSsMmKkHhBbVvDdNn'):
                    raise ValueError, "DNA sequence contains unallowed characters: " + str(_sequence.translate(string.maketrans('', ''),'GgAaTtCcRrYyWwSsMmKkHhBbVvDdNn'))
                else:
                    _warning = "DNA sequence contains IUPACAmbiguousDNA characters, which cannot be interpreted uniquely. Please try to find sequence of higher quality."
            else:
                _warning = "DNA sequence contains ExtendedIUPACDNA characters. " + str(_sequence.translate(string.maketrans('', ''),'GATC')) + " Please try to find sequence of higher quality."
    elif type == "RNA":
        if not _sequence.translate(string.maketrans('', ''),'GgAaUuCc'):
            if not _sequence.translate(string.maketrans('', ''),'GgAaUuCcRrYyWwSsMmKkHhBbVvDdNn'):
                raise ValueError, "RNA sequence contains unallowed characters: " + str(_sequence.translate(string.maketrans('', ''),'GgAaTtCcRrYyWwSsMmKkHhBbVvDdNn'))
            else:
                _warning = "RNA sequence contains ExtendedIUPACDNA characters. " + str(_sequence.translate(string.maketrans('', ''),'GgAaUuCc')) + " Please try to find sequence of higher quality."
        _sequence = _sequence.translate(string.maketrans('Uu', 'Tt'))
    return (_warning, _type, _description, _sequence)


I would have voted for b) or c).
Martin


Peter wrote:
> Dear Biopythoneers,
> 
> This is a request for feedback about proposed additions to the Seq
> object for the next release of Biopython.  I'd like people to pick (a)
> to (e) in the list below (with additional comments or counter
> suggestions welcome).
> 
> Enhancement bug 2381 is about adding transcription and translation
> methods to the Seq object, allowing an object orientated style of
> programming.
> 
> e.g. Current functional programming style:
> 
>>>> from Bio.Seq import Seq, transcribe
>>>> from Bio.Alphabet import generic_dna
>>>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna)
>>>> my_seq
> Seq('CAGTGACGTTAGTCCG', DNAAlphabet())
>>>> transcribe(my_seq)
> Seq('CAGUGACGUUAGUCCG', RNAAlphabet())
> 
> With the latest Biopython in CVS, you can now invoke a Seq object
> method instead for transcription (or back transcription):
> 
>>>> my_seq.transcribe()
> Seq('CAGUGACGUUAGUCCG', RNAAlphabet())
> 
> For a comparison, compare the shift from python string functions to
> string methods.  This also makes the functionality more discoverable
> via dir(my_seq).
> 
> Adding Seq object methods "transcribe" and "back_transcribe" doesn't
> cause any confusion with the python string methods.  However, for
> translation, the python string has an existing "translate" method:
> 
>> S.translate(table [,deletechars]) -> string
>>
>> Return a copy of the string S, where all characters occurring
>> in the optional argument deletechars are removed, and the
>> remaining characters have been mapped through the given
>> translation table, which must be a string of length 256.
> 
> I don't think this functionality is really of direct use for sequences, and
> having a Seq object "translate" method do a biological translation into
> a protein sequence is much more intuitive. However, this could cause
> confusion if the Seq object is passed to non-Biopython code which
> expects a string like translate method.
> 
> To avoid this naming clash, a different method name would needed.
> 
> This is where some user feedback would be very welcome - I think
> the following cover all the alternatives of what to call a biological
> translation function (nucleotide to protein):
> 
> (a) Just use translate (ignore the existing string method)
> (b) Use translate_ (trailing underscore, see PEP8)
> (c) Use translation (a noun rather than verb; different style).
> (d) Use something else (e.g. bio_translate or ...)
> (e) Don't add a biological translation method at all because ...
> 
> Thanks,
> 
> Peter
> 
> See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381


From mmokrejs at ribosome.natur.cuni.cz  Sun Oct 19 16:17:50 2008
From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=)
Date: Sun, 19 Oct 2008 18:17:50 +0200
Subject: [BioPython] Current tutorial in CVS
In-Reply-To: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com>
References: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com>
Message-ID: <48FB5DAE.1050600@ribosome.natur.cuni.cz>

Hi Peter,

Peter wrote:
> Michiel wrote:
>> ... The new tutorial is in CVS; I put a copy of the HTML output
>> of the latest version at
>> http://biopython.org/DIST/docs/tutorial/Tutorial.new.html.
> 
> This also gives people a chance to look at the three plotting examples
> I added to the "Cookbook" section a couple of weeks back,
> 
> http://www.biopython.org/DIST/docs/tutorial/Tutorial.new.html#chapter:cookbook

  for those lazy would you please show how to save the generated plots into
e.g. jpg or .svg file?

Thanks, ;-)
Martin


From biopython at maubp.freeserve.co.uk  Sun Oct 19 16:34:46 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 19 Oct 2008 17:34:46 +0100
Subject: [BioPython] Current tutorial in CVS
In-Reply-To: <48FB5DAE.1050600@ribosome.natur.cuni.cz>
References: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com>
	<48FB5DAE.1050600@ribosome.natur.cuni.cz>
Message-ID: <320fb6e00810190934l432dd320ue8ff3a40fa497530@mail.gmail.com>

>
>  for those lazy would you please show how to save the generated plots into
> e.g. jpg or .svg file?

Instead or as well as pylab.show(), use pylab.savefig(...), for example:

pylab.savefig("dot_plot.png", dpi=75)
pylab.savefig("dot_plot.pdf")

On a related note - it looks like the pylab tutorial as moved, I'm
getting a 404 error on http://matplotlib.sourceforge.net/tutorial.html
now :(

It looks like http://matplotlib.sourceforge.net/api/pyplot_api.html is
the replacement.

Peter


From biopython at maubp.freeserve.co.uk  Sun Oct 19 18:15:59 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 19 Oct 2008 19:15:59 +0100
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <48FB57BD.7070705@ribosome.natur.cuni.cz>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
	<48FB57BD.7070705@ribosome.natur.cuni.cz>
Message-ID: <320fb6e00810191115k120b64c3m237d0929d33b13fb@mail.gmail.com>

On Sun, Oct 19, 2008 at 4:52 PM, Martin MOKREJ?
<mmokrejs at ribosome.natur.cuni.cz> wrote:
> Hi,
>  I have been away for 2 weeks but although late,

Untill we release a new biopython, its not too late to change the Seq
object's new methods.

> let me oppose that string.translate() is of use.
> Here is my current code:
> ...

Your code seems to be doing two things with the python string
translate() method:

(1) Using the deletechars argument (with an empty mapping) to look for
unexpected letters.  It took me a while to work out what your code was
doing - personally I would have used a python set for this, rather
than the string translate method.  Note also unicode strings don't
support the deletechars argument, and that python 3.0 removes the
deletechars argument from the string style objects.

(2) Using the translate mapping to switch "U" and "u" into "T" and "t"
to back transcribe RNA into DNA.  For this, Biopython already has a
Bio.Seq.back_transcribe function (which does work on strings), and in
CVS the Seq object gets a back_transcribe method too.  These do both
use the string translate method internally.

Neither of these operations convice me that the Seq object should
support the python string translate method.

Note that if you still need to use the python string translate method,
it is accessable by first turning the Seq object into a string (e.g.
str(my_seq).translate(mapping, delete_chars)), or as Michiel suggested
earlier, you could use the string module translate function on the Seq
object.

Also note that (as in your example using the string translate to do
back transcription) the translate method by its nature makes it
impossible to know if the original Seq object alphabet still applies
to the result.

Peter


From mmokrejs at ribosome.natur.cuni.cz  Sun Oct 19 18:28:38 2008
From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=)
Date: Sun, 19 Oct 2008 20:28:38 +0200
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <320fb6e00810191115k120b64c3m237d0929d33b13fb@mail.gmail.com>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>	
	<48FB57BD.7070705@ribosome.natur.cuni.cz>
	<320fb6e00810191115k120b64c3m237d0929d33b13fb@mail.gmail.com>
Message-ID: <48FB7C56.6010408@ribosome.natur.cuni.cz>

Peter,
  you are right in your points. I think the translate() trick
had some speed advantages over other approaches to zap unwanted
characters - I don't remember but if it is gonna break in future
python releases I will have to rewrite this anyway.
I just wanted to say I really do use the string translate
function and that it has use in bioinformatics as well. ;-)
  Still, I think the name clash is asking for disaster, but
overloading is a feature of python so it might be expected.
Do whatever you want. ;)
Cheers,
M.

Peter wrote:
> On Sun, Oct 19, 2008 at 4:52 PM, Martin MOKREJ?
> <mmokrejs at ribosome.natur.cuni.cz> wrote:
>> Hi,
>>  I have been away for 2 weeks but although late,
> 
> Untill we release a new biopython, its not too late to change the Seq
> object's new methods.
> 
>> let me oppose that string.translate() is of use.
>> Here is my current code:
>> ...
> 
> Your code seems to be doing two things with the python string
> translate() method:
> 
> (1) Using the deletechars argument (with an empty mapping) to look for
> unexpected letters.  It took me a while to work out what your code was
> doing - personally I would have used a python set for this, rather
> than the string translate method.  Note also unicode strings don't
> support the deletechars argument, and that python 3.0 removes the
> deletechars argument from the string style objects.
> 
> (2) Using the translate mapping to switch "U" and "u" into "T" and "t"
> to back transcribe RNA into DNA.  For this, Biopython already has a
> Bio.Seq.back_transcribe function (which does work on strings), and in
> CVS the Seq object gets a back_transcribe method too.  These do both
> use the string translate method internally.
> 
> Neither of these operations convice me that the Seq object should
> support the python string translate method.
> 
> Note that if you still need to use the python string translate method,
> it is accessable by first turning the Seq object into a string (e.g.
> str(my_seq).translate(mapping, delete_chars)), or as Michiel suggested
> earlier, you could use the string module translate function on the Seq
> object.
> 
> Also note that (as in your example using the string translate to do
> back transcription) the translate method by its nature makes it
> impossible to know if the original Seq object alphabet still applies
> to the result.


From biopython at maubp.freeserve.co.uk  Sun Oct 19 18:52:06 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 19 Oct 2008 19:52:06 +0100
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <48FB7C56.6010408@ribosome.natur.cuni.cz>
References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com>
	<48FB57BD.7070705@ribosome.natur.cuni.cz>
	<320fb6e00810191115k120b64c3m237d0929d33b13fb@mail.gmail.com>
	<48FB7C56.6010408@ribosome.natur.cuni.cz>
Message-ID: <320fb6e00810191152u13a2ee80pe21fe950dc3d046a@mail.gmail.com>

On Sun, Oct 19, 2008 at 7:28 PM, Martin MOKREJ?
<mmokrejs at ribosome.natur.cuni.cz> wrote:
> Peter,
>  you are right in your points. I think the translate() trick
> had some speed advantages over other approaches to
> zap unwanted characters ...

I haven't profiled this - you may be right.  On the other hand, using
the translate method in this way doesn't make the purpose of the code
obvious.

>- I don't remember but if it is gonna break in future
> python releases I will have to rewrite this anyway.

Certainly the deletechars argument seems to be gone in Python 3.0, but
you may not need to worry about that for a while.

> I just wanted to say I really do use the string translate
> function and that it has use in bioinformatics as well. ;-)

Using the string translate for (back)transcription is an obvious
example, but this is a special case that is already handled within
Biopython.

Does anyone have a non-transcription sequence example where the
mapping part of the translate method is actually used?

Using the string translate method just to remove characters is an
interesting one.  How common is this in typical python code?  I've
always used the string replace method (but usually I only want to
remove one character).  Maybe we should have a remove characters
method for the Seq object?  Here at least dealing with the alphabet is
fairly simple.  On another thread I'd suggested a "remove gaps" method
as a special case of this.

> Still, I think the name clash is asking for disaster, but
> overloading is a feature of python so it might be expected.
> Do whatever you want. ;)
> Cheers,
> M.

I'm still a tiny bit uneasy about the name clash myself... anyone else
what to join in the debate?

Peter


From biopython at maubp.freeserve.co.uk  Sun Oct 19 18:59:23 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 19 Oct 2008 19:59:23 +0100
Subject: [BioPython] Current tutorial in CVS
In-Reply-To: <320fb6e00810190934l432dd320ue8ff3a40fa497530@mail.gmail.com>
References: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com>
	<48FB5DAE.1050600@ribosome.natur.cuni.cz>
	<320fb6e00810190934l432dd320ue8ff3a40fa497530@mail.gmail.com>
Message-ID: <320fb6e00810191159j52bb78al4c38b1f7804c268f@mail.gmail.com>

Peter wrote:
> Marting wrote:
>> for those lazy would you please show how to save the generated
>> plots into e.g. jpg or .svg file?
>
> Instead or as well as pylab.show(), use pylab.savefig(...), for example:
>
> pylab.savefig("dot_plot.png", dpi=75)
> pylab.savefig("dot_plot.pdf")

I've added a note about this in the example in the CVS version of the Tutorial.

> On a related note - it looks like the pylab tutorial as moved, I'm
> getting a 404 error on http://matplotlib.sourceforge.net/tutorial.html
> now :(

I've updated this link to point at http://matplotlib.sourceforge.net/ instead
(which at the time of writing includes a quick summary of the pylab functions).

Peter


From tiagoantao at gmail.com  Mon Oct 20 05:41:56 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 20 Oct 2008 06:41:56 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
Message-ID: <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>

Hi,

On Sun, Oct 19, 2008 at 3:50 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> ok, thank you very much!!
> I would like to use git to keep track of the changes I will make to the
> code.
> What do you think if I'll upload it to http://github.com and then upload it
> back on biopython when it is finished?
> I am not sure, but I think it would be possible to convert the logs back to
> cvs to reintegrate the changes in biopython.

I think it is a good idea. When we reintegrate back I think there will
be no need to backport the commit logs anyway.


> One of the problems we are having here, is that it takes too much RAM memory
> to store all the information about characters for every population.
> I was going to write a Population object, in which I'll store only the total
> count of heterozygotes, individuals, and what is needed, instead of the
> information about characters (('a', 'a'), ('a', 'c'), ...)

I am afraid that this is not enough. Even for Fst. I suppose you are
acquainted with a formula with just heterozigosities. That is more of
just a textbook formula only.  The Fst standard estimator is really
Cockerham and Wier Theta estimator (1984 paper), and I think it needs
individual information (or at the very least allele counts). Check my
implementation of Fst, which should be it (less the bugs that are in).
Maybe my implementation of theta is wrong, which is a possiblity. But
theta is the standard.

May I do a suggestion for your problem? Split in SNP groups (like 100
at a time) and calculate 100 Fsts at time. Store the calculated Fsts
to disk and then join them at the end.

As a general rule, whatever goes into biopython has to be general
enough to accomodate all standard statistics (not just Fs). One cannot
make a solution that is taliored to solve just our personal research
issues.

I am currently traveling (which seems to be my constant state). When I
arrive back at office, on Wednsday, I will make a few suggestions on
how we can structure things. I have a few ideas that I would like to
share and discuss.

> class Marker:
>   total_heterozygotes_count = 0
>   total_population_count = 0
>   total_Purines_count = 0 # this could be renamed, of course
>   total_Pyrimidines_count = 0


Also, your representation seems to be targetted toward SNPs, people
use lots of other things (microsatellites are still used a lot). We
have to think about something that is useful to the general public.
Let me get back to you on Wednesday we ideas. If you are interested we
can work together to make a nice population genetics module that can
be used in a wide range of situations.


From lpritc at scri.ac.uk  Mon Oct 20 09:09:51 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Mon, 20 Oct 2008 10:09:51 +0100
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <320fb6e00810191152u13a2ee80pe21fe950dc3d046a@mail.gmail.com>
Message-ID: <C522096F.17F00%lpritc@scri.ac.uk>

On 19/10/2008 19:52, "Peter" <biopython at maubp.freeserve.co.uk> wrote:

> I'm still a tiny bit uneasy about the name clash myself... anyone else
> what to join in the debate?

The problem domain for biological sequences implies a natural definition for
the application of 'translate' to a DNA/RNA sequence that is the translation
into protein sequence.  The string.translate() method is not consistent with
this natural use of the language of the problem domain.

I take Martin's point that there are valid uses for the string.translate()
method in bioinformatics and elsewhere, but I think that overloading
translate() is as valid here as overloading __mul__ would be for an
implementation of matrix algebra, or complex numbers.  For biological
sequences as much as for number types, I think the problem domain and
expected behaviour of the object being represented in code should take
precedence over emulation of an object type that was never intended to
provide the functionality required for a biological sequence.

I think also that if the string.translate() method is required, an explicit
call to string.translate() implies: "translate this biological sequence as
if it were a string, and not a biological sequence".  The converse
application of a Bio.translate() method would to me imply "translate this
biological sequence as if it were a biological sequence, and not a string";
which seems to me to defeat part of the purpose of representing the
biological sequence with its own object.

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________


From biopython at maubp.freeserve.co.uk  Mon Oct 20 09:22:39 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Oct 2008 10:22:39 +0100
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <C522096F.17F00%lpritc@scri.ac.uk>
References: <320fb6e00810191152u13a2ee80pe21fe950dc3d046a@mail.gmail.com>
	<C522096F.17F00%lpritc@scri.ac.uk>
Message-ID: <320fb6e00810200222s641e165eqef3b209893a8d976@mail.gmail.com>

Leighton wrote:
> Peter wrote:
>> I'm still a tiny bit uneasy about the name clash myself... anyone else
>> what to join in the debate?
>
> The problem domain for biological sequences implies a natural definition for
> the application of 'translate' to a DNA/RNA sequence that is the translation
> into protein sequence.  The string.translate() method is not consistent with
> this natural use of the language of the problem domain.
> ...

I thought that was well argued and nicely put.  Of course, someone is
still bound to try calling the translate method with a string mapping.
 Maybe we should add a bit of defensive code to check the table
argument, and print a helpful error message when this happens?  We
currently only expect the codon table argument to be an NCBI genetic
code table name or ID (string or integer).

Earlier I wrote:
>> In Biopython's CVS, the Seq object now has a translate method
>> which does a biological translation.  If anyone comes up with a
>> better proposal before the next release, we can still rename this.
>> Otherwise I will update the Tutorial in CVS shortly...

I have since updated the Tutorial in CVS to use the new transcribe,
back_transcribe and translate methods.  Maybe we should put an updated
"preview" online for comment?

Peter


From lpritc at scri.ac.uk  Mon Oct 20 09:38:10 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Mon, 20 Oct 2008 10:38:10 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <48F8DD7B.7010909@gmail.com>
Message-ID: <C5221012.17F0D%lpritc@scri.ac.uk>


On 17/10/2008 19:46, "Bruce Southey" <bsouthey at gmail.com> wrote:

> Leighton Pritchard wrote:
>> This is the key problem.  Forward translation is - for a given codon table -
>> a one-one mapping.  Reverse translation is (for many amino acids) one-many.
>> If the goal is to produce the coding sequence that actually encoded a
>> particular protein sequence, the problem is combinatorial and rapidly
>> becomes messy with increasing sequence length.
>>   
> If you use a regular expression or a tree structure then there is a
> one-one mapping but then that would probably best as a subclass of Seq.

I don't see this, I'm afraid.

Each codon -> one amino acid : one-one mapping
Arg -> set of 6 possible codons : one-many mapping

It doesn't matter how it's represented in code, the problem of a one-many
mapping still exists for amino acid -> codon translation in most cases.

The combinatorial nature of the overall problem can be illustrated by
considering the unlikely case of a protein that comprises 100 arginines.
The number of potential coding sequences is 6**100 = 6.5e77.  That you *can*
choose any one of these to be your potential coding sequence doesn't negate
the fact that there are still (6.5e77)-1 other possibilities... It doesn't
get much better if you use the the average number of codons per amino acid:
61/20 ~= 3.  A 100aa protein would typically have 3**100 ~= 5e47 potential
coding sequences.  I wouldn't want to guess which one was correct, and I
can't see a back_translate method in this instance doing more than producing
a nucleotide sequence that is potentially capable of producing the passed
protein sequence, but for which no claims can be made about biological
plausibility.

Now, a back_translate() that takes a protein sequence alignment and, when
passed the coding sequences for each component sequence, returns the
corresponding alignment of the nucleotide sequences, makes sense to me.  But
that's a discussion for Bio.Alignment objects...

> I would suggest tools like Wise2 and exonerate
> (http://www.ebi.ac.uk/~guy/exonerate/) are the solution to solving gene
> structure problems than using a Seq object.

I wouldn't suggest using a Seq object for this purpose, either... ;)

>> I agree - I can't think of an occasion where I might want to back-translate
>> a protein in this way that wouldn't better be handled by other means.  Not
>> that I'm the fount of all use-cases but, given the number of ways in which
>> one *could* back-translate, perhaps it would be better not to pick/guess at
>> any single one.
>>   
> Apart from the academic aspect, my main use is searching for protein
> motifs/domains, enzyme cleavage sites, finding very short combinations
> of amino acids and binding sites (I do not do this but it is the same)
> in DNA sequences especially genomic sequence. These are usually very
> small and, thus, unsuitable for most tools.

I do much the same, and haven't found a pressing use for back-translation,
yet - YMMV.

> One of my uses is with
> peptide identification and de novo sequencing using mass spectrometry
> when you don't know the actual protein or gene sequence. It also has the
> problem that certain amino acids have very similar mass so you would
> need to  Regardless of whether you use a regular expression query or not
> you still need a back translation of the protein query and probably the
> reverse complement.

Perhaps I'm being dense, but I don't see why that is.  Can you give an
example?

> Another case where it would be useful is that tools like TBLASTN gives
> protein alignments so you must open the DNA sequence and find the DNA
> region based on the protein alignment.

You could use TBLASTN output - which provides start and stop coordinates for
the match on the subject sequence - to extract this directly, without the
need for backtranslation.  Example output where subject coordinates give the
match location below:

>ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete
genome
          Length = 5064019

 Score =  731 bits (1887), Expect = 0.0
 Identities = 363/376 (96%), Positives = 363/376 (96%)
 Frame = +3

Query: 1      MFHXXXXXXXXXXXXXTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
60
              MFH             TISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
Sbjct: 477432 MFHLPKLKQKPLALLLTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
477611

[...]

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________


From dalloliogm at gmail.com  Mon Oct 20 13:57:27 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Mon, 20 Oct 2008 15:57:27 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
Message-ID: <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>

On Mon, Oct 20, 2008 at 7:41 AM, Tiago Ant?o <tiagoantao at gmail.com> wrote:

> Hi,
>
> On Sun, Oct 19, 2008 at 3:50 PM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
> > ok, thank you very much!!
> > I would like to use git to keep track of the changes I will make to the
> > code.
> > What do you think if I'll upload it to http://github.com and then upload
> it
> > back on biopython when it is finished?
> > I am not sure, but I think it would be possible to convert the logs back
> to
> > cvs to reintegrate the changes in biopython.
>
> I think it is a good idea. When we reintegrate back I think there will
> be no need to backport the commit logs anyway.


Ok, I have uploaded the code to:
- http://github.com/dalloliogm/biopython---popgen

I put the code I wrote before writing in this mailing list in the folder
PopGen/Gio
-
http://github.com/dalloliogm/biopython---popgen/tree/6f6fa66cda1908dc8334ab6e9e69b7c85290a8be/src/PopGen/Gio
However, I plan to integrate these scripts with your code or re-write the
completely (well, your code is a lot better than mine :) ).

Just a curiosity: why do you use the '<>' operator instead of '!='?
Is it better supported in python 3.0?


> > One of the problems we are having here, is that it takes too much RAM
> memory
> > to store all the information about characters for every population.
> > I was going to write a Population object, in which I'll store only the
> total
> > count of heterozygotes, individuals, and what is needed, instead of the
> > information about characters (('a', 'a'), ('a', 'c'), ...)
>
> I am afraid that this is not enough. Even for Fst. I suppose you are
> acquainted with a formula with just heterozigosities.


Yes, I was trying to implement a very basic formula at first.


> That is more of
> just a textbook formula only.  The Fst standard estimator is really
> Cockerham and Wier Theta estimator (1984 paper)


Bioperl's Bio::PopGen::PopStats uses the same formula:
-
http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/PopGen/PopStats.html#POD3

"""
Bioperl's Bio:Based on diploid method in Weir BS, Genetics Data
Analysis II, 1996
           page 178.
"""

, and I think it needs
> individual information (or at the very least allele counts). Check my
> implementation of Fst, which should be it (less the bugs that are in).
> Maybe my implementation of theta is wrong, which is a possiblity. But
> theta is the standard.
>
> May I do a suggestion for your problem? Split in SNP groups (like 100
> at a time) and calculate 100 Fsts at time. Store the calculated Fsts
> to disk and then join them at the end.
>

Thanks - that's a good suggestion


>
>
> I am currently traveling (which seems to be my constant state). When I
> arrive back at office, on Wednsday, I will make a few suggestions on
> how we can structure things. I have a few ideas that I would like to
> share and discuss.
>

Have a nice trip!


>
> > class Marker:
> >   total_heterozygotes_count = 0
> >   total_population_count = 0
> >   total_Purines_count = 0 # this could be renamed, of course
> >   total_Pyrimidines_count = 0
>
>
> Also, your representation seems to be targetted toward SNPs, people
> use lots of other things (microsatellites are still used a lot). We
> have to think about something that is useful to the general public.
> Let me get back to you on Wednesday we ideas. If you are interested we
> can work together to make a nice population genetics module that can
> be used in a wide range of situations.
>

Yes, I agree. It was just a first try. We should collect some good
use-cases.


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Mon Oct 20 14:04:02 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Oct 2008 15:04:02 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
Message-ID: <320fb6e00810200704g578ec3aak6b9df1a5a90a2fc7@mail.gmail.com>

On Mon, Oct 20, 2008 at 2:57 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Just a curiosity: why do you use the '<>' operator instead of '!='?
> Is it better supported in python 3.0?

Python 2.x supports both <> and != for not equal, and people use both
depending on their personal preference (or exposure to other
languages).  Most Biopython code used to use <> which I personally do
by habit.

Python 3.x supports only != so I have recently gone through Biopython
in CVS switching all the <> to != instead.

I would recommend you use != in all new python code.

Peter


From biopython at maubp.freeserve.co.uk  Mon Oct 20 14:23:23 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Oct 2008 15:23:23 +0100
Subject: [BioPython] Bio.AlignIO feedback - seq_count
Message-ID: <320fb6e00810200723p2fcbe12ey125dd1fd67d195a7@mail.gmail.com>

Dear Biopythoneers,

I'm hoping some of you on the mailing list have actually used
Bio.AlignIO, and I'd like to ask for some feedback.

In particular, when loading in sequence files, did you ever use the
optional seq_count argument to declare how many sequences you expected
in each alignment?  The rational of this optional argument is
discussed in the Tutorial,
http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:AlignIO-count-argument

I'm curious if anyone actually found this useful in real life.

Thanks

Peter


From bsouthey at gmail.com  Tue Oct 21 14:13:15 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Tue, 21 Oct 2008 09:13:15 -0500
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <C5221012.17F0D%lpritc@scri.ac.uk>
References: <C5221012.17F0D%lpritc@scri.ac.uk>
Message-ID: <48FDE37B.5040301@gmail.com>

Leighton Pritchard wrote:
> On 17/10/2008 19:46, "Bruce Southey" <bsouthey at gmail.com> wrote:
>
>   
>> Leighton Pritchard wrote:
>>     
>>> This is the key problem.  Forward translation is - for a given codon table -
>>> a one-one mapping.  Reverse translation is (for many amino acids) one-many.
>>> If the goal is to produce the coding sequence that actually encoded a
>>> particular protein sequence, the problem is combinatorial and rapidly
>>> becomes messy with increasing sequence length.
>>>   
>>>       
>> If you use a regular expression or a tree structure then there is a
>> one-one mapping but then that would probably best as a subclass of Seq.
>>     
>
> I don't see this, I'm afraid.
>
> Each codon -> one amino acid : one-one mapping
> Arg -> set of 6 possible codons : one-many mapping
>   
If you believed this then your answer below is incorrect. The genetic 
code allow for 1 amino acid to map to a three nucleotides but not any 
three nor any more or any less than three. So to be clear there is a one 
to one mapping between a codon and amino acid as well amino acid and a 
codon. Therefore it is impossible for Arg to map to six possible codons 
as only one is correct. Under the standard genetic code, each amino acid 
can be represented in an regular expression either as the bases or 
ambiguous nucleotide codes:
Ala/A =(GCT|GCC|GCA|GCG) = GCN
Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR)
Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR)
Lys/K =(AAA|AAG) = AAR
Asn/N =(AAT|AAC) =AAY
Met/M =ATG =ATG
Asp/D =(GAT|GAC) =GAY
Phe/F =(TTT|TTC) =TTY
Cys/C =(TGT|TGC) =TGY
Pro/P =(CCT|CCC|CCA|CCG) =CCN
Gln/Q =(CAA|CAG) =CAR
Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY)
Glu/E =(GAA|GAG) = GAR
Thr/T =(ACT|ACC|ACA|ACG)  =ACN
Gly/G =(GGT|GGC|GGA|GGG) =GGN
Trp/W =TGG  =TGG
His/H =(CAT|CAC)  = CAY
Tyr/Y =(TAT|TAC) = TAY
Ile/I =(ATT|ATC|ATA) =ATH
Val/V =(GTT|GTC|GTA|GTG) =GTN

This is still a one to one mapping between an amino acid and regular 
expression relationship of the triplet that encodes it. Unfortunately 
the ambiguous nucleotide codes can not be used directly in a regular 
expression search.


> It doesn't matter how it's represented in code, the problem of a one-many
> mapping still exists for amino acid -> codon translation in most cases.
>
> The combinatorial nature of the overall problem can be illustrated by
> considering the unlikely case of a protein that comprises 100 arginines.
> The number of potential coding sequences is 6**100 = 6.5e77.  That you *can*
> choose any one of these to be your potential coding sequence doesn't negate
> the fact that there are still (6.5e77)-1 other possibilities... It doesn't
> get much better if you use the the average number of codons per amino acid:
> 61/20 ~= 3.  A 100aa protein would typically have 3**100 ~= 5e47 potential
> coding sequences.  I wouldn't want to guess which one was correct, and I
> can't see a back_translate method in this instance doing more than producing
> a nucleotide sequence that is potentially capable of producing the passed
> protein sequence, but for which no claims can be made about biological
> plausibility.
>   
You are not representing the one to six mapping you indicated above as 
sequence is composed of 300 nucleotides not 1800 as must occur with a 
one to 6 codon mapping. Rather you have provided the number of 
combinations of the six codons that can give you 100 Args based on a one 
to one mapping of one codon to one Arg.  If you use ambiguous nucleotide 
codes, you can reduce it down to 1.267651e+30 potential coding sequences 
for 100 amino acids as a worst case scenario.

It is not my position to argue what a user wants or how stupid I think 
that the request is. The user would quickly learn.
> Now, a back_translate() that takes a protein sequence alignment and, when
> passed the coding sequences for each component sequence, returns the
> corresponding alignment of the nucleotide sequences, makes sense to me.  But
> that's a discussion for Bio.Alignment objects...
>
>   
>> I would suggest tools like Wise2 and exonerate
>> (http://www.ebi.ac.uk/~guy/exonerate/) are the solution to solving gene
>> structure problems than using a Seq object.
>>     
>
> I wouldn't suggest using a Seq object for this purpose, either... ;)
>
>   
>>> I agree - I can't think of an occasion where I might want to back-translate
>>> a protein in this way that wouldn't better be handled by other means.  Not
>>> that I'm the fount of all use-cases but, given the number of ways in which
>>> one *could* back-translate, perhaps it would be better not to pick/guess at
>>> any single one.
>>>   
>>>       
>> Apart from the academic aspect, my main use is searching for protein
>> motifs/domains, enzyme cleavage sites, finding very short combinations
>> of amino acids and binding sites (I do not do this but it is the same)
>> in DNA sequences especially genomic sequence. These are usually very
>> small and, thus, unsuitable for most tools.
>>     
>
> I do much the same, and haven't found a pressing use for back-translation,
> yet - YMMV.
>
>   
>> One of my uses is with
>> peptide identification and de novo sequencing using mass spectrometry
>> when you don't know the actual protein or gene sequence. It also has the
>> problem that certain amino acids have very similar mass so you would
>> need to  Regardless of whether you use a regular expression query or not
>> you still need a back translation of the protein query and probably the
>> reverse complement.
>>     
>
> Perhaps I'm being dense, but I don't see why that is.  Can you give an
> example?
>   
Isoleucine and Leucine are the worst case (there are a couple of others 
that are close) because these have the same mass so you have to search for:
(TTA|TTG|CTT|CTC|CTA|CTG|ATT|ATC|ATA)

If you are searching say for an RFamide, you know that you need at least 
RFG, which means you need to do a query using regular expression on the 
plus strand using:
(CGT|CGC|CGA|CGG|AGA|AGG)(TTT|TTC)(GGT|GGC|GGA|GGG)

You then try to extend the match to more amino acids until you reach the 
desired mass (hopefully avoiding any introns) or sufficiently that you 
can use some other tool to help.
>   
>> Another case where it would be useful is that tools like TBLASTN gives
>> protein alignments so you must open the DNA sequence and find the DNA
>> region based on the protein alignment.
>>     
>
> You could use TBLASTN output - which provides start and stop coordinates for
> the match on the subject sequence - to extract this directly, without the
> need for backtranslation.  Example output where subject coordinates give the
> match location below:
>
>   
>> ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete
>>     
> genome
>           Length = 5064019
>
>  Score =  731 bits (1887), Expect = 0.0
>  Identities = 363/376 (96%), Positives = 363/376 (96%)
>  Frame = +3
>
> Query: 1      MFHXXXXXXXXXXXXXTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
> 60
>               MFH             TISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
> Sbjct: 477432 MFHLPKLKQKPLALLLTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
> 477611
>
> [...]
>
> L.
>
>   
Exactly my point, where is the DNA sequence? Only if you have direct 
access to the DNA sequence can you get it. Furthermore, the DNA sequence 
must be exactly the same because any change in the coordinates screws it 
up.

Bruce


From biopython at maubp.freeserve.co.uk  Tue Oct 21 14:26:49 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Oct 2008 15:26:49 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
	Seq object?
In-Reply-To: <48FDE37B.5040301@gmail.com>
References: <C5221012.17F0D%lpritc@scri.ac.uk> <48FDE37B.5040301@gmail.com>
Message-ID: <320fb6e00810210726g466292e4h3e8fe053d9107f48@mail.gmail.com>

Bruce wrote:
> Leighton wrote:
>> Each codon -> one amino acid : one-one mapping
>> Arg -> set of 6 possible codons : one-many mapping

I agree with Leighton.

> If you believed this then your answer below is incorrect.

No, I think you are just not using the terms one-to-one and
one-to-many as a mathematician would.

> The genetic code
> allow for 1 amino acid to map to a three nucleotides but not any three nor
> any more or any less than three. So to be clear there is a one to one
> mapping between a codon and amino acid as well amino acid and a codon.
> Therefore it is impossible for Arg to map to six possible codons as only one
> is correct. Under the standard genetic code, each amino acid can be
> represented in an regular expression either as the bases or ambiguous
> nucleotide codes:
> Ala/A =(GCT|GCC|GCA|GCG) = GCN

That is a one to four mapping using unambiguous nucleotides, or a one
to one mapping using ambiguous nucleotides.  This is a nice case.

> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR)

That is a one to six mapping using unambiguous nucleotides, or a one
to two mapping using ambiguous nucleotides.  This is a problem case.

> This is still a one to one mapping between an amino acid and regular
> expression relationship of the triplet that encodes it. Unfortunately the
> ambiguous nucleotide codes can not be used directly in a regular expression
> search.

The problem is that (TTN|CTR) or similar don't work in Seq objects -
would need a more advanced representation (perhaps based on regular
expressions).

Peter


From biopython at maubp.freeserve.co.uk  Tue Oct 21 14:45:57 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Oct 2008 15:45:57 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
	Seq object?
In-Reply-To: <48FDE37B.5040301@gmail.com>
References: <C5221012.17F0D%lpritc@scri.ac.uk> <48FDE37B.5040301@gmail.com>
Message-ID: <320fb6e00810210745w32b37edjeec1607a3711f6ea@mail.gmail.com>

Bruce wrote:
>>> Another case where it would be useful is that tools like TBLASTN gives
>>> protein alignments so you must open the DNA sequence and find the DNA
>>> region based on the protein alignment.

Leighton:
>> You could use TBLASTN output - which provides start and stop coordinates
>> for the match on the subject sequence - to extract this directly, without the
>> need for backtranslation.  Example output where subject coordinates give
>> the match location below:
>>
>>>
>>> ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete
>>>
>>
>> genome
>>          Length = 5064019
>>
>>  Score =  731 bits (1887), Expect = 0.0
>>  Identities = 363/376 (96%), Positives = 363/376 (96%)
>>  Frame = +3
>>
>> Query: 1      MFHXXXXXXXXXXXXXTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
>> 60
>>              MFH             TISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
>> Sbjct: 477432 MFHLPKLKQKPLALLLTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY
>> 477611
>>
>> [...]

Bruce's reply:
> Exactly my point, where is the DNA sequence? Only if you have direct access
> to the DNA sequence can you get it. Furthermore, the DNA sequence must be
> exactly the same because any change in the coordinates screws it up.

You should have the original query from when you ran the BLAST
search, so using the co-ordinates given in the BLAST hit you can
recover the original nucleotide query which gives this match.

There is no reason to do a back-translation to try and find the original
query, which would be especially difficult in this example due to the
XXXXXX region (representing a region of low complexity which was
ignored by BLAST).  Even if you tried you could find more than one
match and without checking the the coordinates BLAST gives it would
not be clear which gave this BLAST match.

Peter


From lpritc at scri.ac.uk  Tue Oct 21 15:29:35 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Tue, 21 Oct 2008 16:29:35 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <48FDE37B.5040301@gmail.com>
Message-ID: <C523B3EF.18072%lpritc@scri.ac.uk>

Hi Bruce,

On 21/10/2008 15:13, "Bruce Southey" <bsouthey at gmail.com> wrote:
> Leighton Pritchard wrote:
>> I don't see this, I'm afraid.
>> 
>> Each codon -> one amino acid : one-one mapping
>> Arg -> set of 6 possible codons : one-many mapping
>>   
> If you believed this then your answer below is incorrect. The genetic
> code allow for 1 amino acid to map to a three nucleotides but not any
> three nor any more or any less than three.

I'm fine with this bit.  Each such set of three nucleotides is called a
'codon'.  Six such codons are able to code for an arginine, as you note:

> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG)

This is a one -> six mapping.  That is, one input (arginine), is capable of
being back-translated into any of six possible outputs (CGT, CGC, CGA, CGG,
AGA, or AGG).  

but you contradict this with the comment:

> So to be clear there is a one
> to one mapping between a codon and amino acid as well amino acid and a
> codon. Therefore it is impossible for Arg to map to six possible codons

I think that you're confusing the biological fact (only one codon actually
encoded this amino acid) with the back-translation problem (in the absence
of any other information, any one of six codons is equally likely to have
encoded this amino acid).

---
 
> This is still a one to one mapping between an amino acid and regular
> expression relationship of the triplet that encodes it.

Which is not the claim that I was making.  There are any number of ways of
forcing a one-one mapping of this sort.  You could arguably represent it as
a one-to-one mapping of 'arginine -> "the backtranslation of arginine"', but
that would not be informative in reconstructing the actual coding sequence
(if that was what you wanted - which is the point of the discussion: what is
the point of a back_translate() method?).  The regular expression mapping is
not useful for this, either.

> You are not representing the one to six mapping you indicated above as
> sequence is composed of 300 nucleotides not 1800 as must occur with a
> one to 6 codon mapping [...]

I think you've misunderstood what's going on here.

Imagine a reduced system, where there is only one amino acid - let's call it
A - and there are two possible codons that can produce this amino acid - XXX
and YYY (thanks, Coldplay).

Now, if we have a 'sequence' of only one amino acid: 'A', that might have
been encoded by the sequence 'XXX', or the sequence 'YYY'.  The sequence
that coded for 'A' is one of 'XXX' or 'YYY', and we don't know which; there
are two possibilities, therefore this is a 1->2 mapping.  2=2**1.  Note that
the nucleotide sequence is 3*1=3 long.

But if our sequence has two amino acids: 'AA', this could have been the
result of 'XXXXXX', 'XXXYYY', 'YYYXXX', or 'YYYYYY'.  The coding sequence is
one of four equally likely possibilities, and this is a 1->4 mapping (one
sequence, four possible outcomes).  4=2**2, and the nucleotide sequence is
3*2 long.

If we build longer sequences, we find that the number of potential outcomes
is 2**n, where n is the number of 'A's in the input sequence, and the
mapping is 1->2**n.  The nucleotide sequence is 3*n long.

If we make this more general, where there are m codons for this amino acid,
the number of potential outcomes is m**n, and the mapping is 1->m**n.  The
nucleotide sequence is, again, 3*n long.

In my previous example for arginine, m=6, n=100, the mapping is 1->6, and
the sequence is 300nt long, *not* 1800 nt long.  There are still 6e77 ways
of encoding a sequence of 100 arginines.  A back_translate() method that
pretends to find the 'correct' coding sequence in the absence of other
information, rather than 'a' coding sequence, is not making a plausible
claim.

> It is not my position to argue what a user wants or how stupid I think
> that the request is. The user would quickly learn.

While it is entirely possible to implement a function called
back_translate() that does something a user doesn't want or need, I'm not
sure that it's the approach that should be taken, here.

It is your position to argue what you want or need out of a back_translate()
method, and why, so that other people can see your point of view, and maybe
be swayed by it.  I don't see a use for such a method, even to produce a
regular expression for searching nucleotide sequences, because TBLASTN is so
much more efficient.

> Isoleucine and Leucine are the worst case (there are a couple of others
> that are close) because these have the same mass so you have to search for:
> (TTA|TTG|CTT|CTC|CTA|CTG|ATT|ATC|ATA)
> 
> If you are searching say for an RFamide, you know that you need at least
> RFG, which means you need to do a query using regular expression on the
> plus strand using:
> (CGT|CGC|CGA|CGG|AGA|AGG)(TTT|TTC)(GGT|GGC|GGA|GGG)
> 
> You then try to extend the match to more amino acids until you reach the
> desired mass (hopefully avoiding any introns) or sufficiently that you
> can use some other tool to help.

I think that, in your position, I'd compare timings with a six-frame,
three-frame or forward translation of (depending on the nature of the
nucleotide sequence) the nucleotide sequence you're searching against, and
then use a regular expression or string search with the protein sequence as
the query.  That's likely to be significantly faster than a regex search
with that many groups, with the effects more noticeable at larger query
sequence lengths; particularly so if you cache or save the translated
sequences for future searches.
   
>>> Another case where it would be useful is that tools like TBLASTN gives
>>> protein alignments so you must open the DNA sequence and find the DNA
>>> region based on the protein alignment.
>> You could use TBLASTN output - which provides start and stop coordinates for
>> the match on the subject sequence - to extract this directly, without the
>> need for backtranslation.

> Exactly my point, where is the DNA sequence?

It's in the database against which you queried; TBLASTN queries against
nucleotide databases.  Wait, that's not quite right - TBLASTN translates
nucleotide databases into protein databases and queries against them with
the protein sequence, partly because of the one-many mapping of
back-translation.

If the database is local, you can use fastacmd (part of BLAST) to dump the
entire database, to retrieve the single matching sequence from the database,
or even to extract only the region of the sequence that is the match.  Try
fastacmd --help at the command-line.

If your database is not local, you can (probably) obtain the sequence by
querying GenBank with the accession number.  If you can't do that, or ask
the people who compiled the database you're querying against, or if they
won't let you have the sequence, then you're stuck with guessing the coding
sequence.

> Only if you have direct access to the DNA sequence can you get it.

That's not true; fastacmd can extract FASTA-formatted sequences from any
(version number compatibilities notwithstanding) correctly-formatted BLAST
database.

> Furthermore, the DNA sequence
> must be exactly the same because any change in the coordinates screws it
> up.

I don't see how that is a great concern.  The coordinates of the match would
come from the same database you were searching, so should match.  If your
database is up-to-date, and you have to go to GenBank, then you should have
the most recent revision of the sequence in there, anyway.

Even if both of the above options fail, and you can acquire the new sequence
by some accession identifier, you can build a new local database from that
sequence alone, and find where the match is.  Or translate and search
directly in Python.

If you truly have no access to the DNA sequence (e.g. if it's proprietary
information, you can't access the BLAST database, and no-one will send you
the sequence) then, and only then, are you stuck with guessing the coding
sequence in *very* large parameter space.

Best,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________


From biopython at maubp.freeserve.co.uk  Tue Oct 21 15:59:00 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Oct 2008 16:59:00 +0100
Subject: [BioPython] back-translation method for Seq object?
In-Reply-To: <320fb6e00810210745w32b37edjeec1607a3711f6ea@mail.gmail.com>
References: <C5221012.17F0D%lpritc@scri.ac.uk> <48FDE37B.5040301@gmail.com>
	<320fb6e00810210745w32b37edjeec1607a3711f6ea@mail.gmail.com>
Message-ID: <320fb6e00810210859n1b922e7emd6a7456abd79cdc7@mail.gmail.com>

On Tue, Oct 21, 2008 at 3:45 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Bruce wrote:
>>>> Another case where it would be useful is that tools like TBLASTN gives
>>>> protein alignments so you must open the DNA sequence and find the DNA
>>>> region based on the protein alignment.
>
> Leighton:
>>> You could use TBLASTN output - which provides start and stop coordinates
>>> for the match on the subject sequence - to extract this directly, without the
>>> need for backtranslation.  Example output where subject coordinates give
>>> the match location below:
>>> ...
>
> Bruce's reply:
>> Exactly my point, where is the DNA sequence? Only if you have direct access
>> to the DNA sequence can you get it. Furthermore, the DNA sequence must be
>> exactly the same because any change in the coordinates screws it up.
>
> You should have the original query from when you ran the BLAST
> search, so using the co-ordinates given in the BLAST hit you can
> recover the original nucleotide query which gives this match.

Sorry - I was thinking of the wrong variant of BLAST.  As Leighton
pointed out, you would have to use fastacmd to extract the nucleotide
sequence of the match from the blast database (assuming you were
running stand alone blastall) or fetch it via its accession (if you
were running BLAST via the NCBI).

Peter


From biopython at maubp.freeserve.co.uk  Tue Oct 21 16:07:46 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Oct 2008 17:07:46 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
	Seq object?
In-Reply-To: <C523B3EF.18072%lpritc@scri.ac.uk>
References: <48FDE37B.5040301@gmail.com> <C523B3EF.18072%lpritc@scri.ac.uk>
Message-ID: <320fb6e00810210907s328c8f23ra007e14f1f32f5c9@mail.gmail.com>

Hi everyone,

I think we all agree that if we want a back-translation
method/function to return a simple string or Seq object (given no
additional information about the codon use), this cannot fully capture
all the possible codons.

If we want to provide a simple string or Seq object, we can either
pick an arbitrary codon in each case (as in the first attachment on
Bug 2618), or perhaps represent some of the possible codons using
ambiguous nucleotides.

e.g.
back_translate("MR") = "ATGCGT" #arbitrary codon for R unambiguous nucleotides

or,
back_translate("MR") = "ATGCGN" #arbitrary codon for R using ambiguous
nucleotides

Note in either example, the following nice property holds:
translate(back_translate("MR")) == "MR"

Even if improved by typical codon usage figures to give a more
biologically likely answer, neither of these simple approaches covers
the full set of six possible codons for Arg in the standard codon
table.

It was something like this that I envisioned as a candidate for a Seq
method (based on the behaviour of the existing Bio.Translate
functionality), but only if such a simple back_translate
method/function had any real uses.  And thus far, I haven't seen any.

A back translation method/function which dealt with all the possible
codon choices would have to use a more advanced representation
(possibly as Bruce suggested using regular expressions or some sort of
tree structure - ideally as a sub-class of the Seq object).  There is
also the option of returning multiple simple strings or Seq objects
(either as a list or preferable a generator) giving all possible back
translations, but I don't think this would be useful, except perhaps
on small examples, due to the potentially vast number of return
values.

Peter


From bsouthey at gmail.com  Tue Oct 21 19:46:58 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Tue, 21 Oct 2008 14:46:58 -0500
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <C523B3EF.18072%lpritc@scri.ac.uk>
References: <C523B3EF.18072%lpritc@scri.ac.uk>
Message-ID: <48FE31B2.8030509@gmail.com>

Leighton Pritchard wrote:
> Hi Bruce,
>
> On 21/10/2008 15:13, "Bruce Southey" <bsouthey at gmail.com> wrote:
>   
>> Leighton Pritchard wrote:
>>     
>>> I don't see this, I'm afraid.
>>>
>>> Each codon -> one amino acid : one-one mapping
>>> Arg -> set of 6 possible codons : one-many mapping
>>>   
>>>       
>> If you believed this then your answer below is incorrect. The genetic
>> code allow for 1 amino acid to map to a three nucleotides but not any
>> three nor any more or any less than three.
>>     
>
> I'm fine with this bit.  Each such set of three nucleotides is called a
> 'codon'.  Six such codons are able to code for an arginine, as you note:
>
>   
>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG)
>>     
>
> This is a one -> six mapping.  That is, one input (arginine), is capable of
> being back-translated into any of six possible outputs (CGT, CGC, CGA, CGG,
> AGA, or AGG).  
>
> but you contradict this with the comment:
>
>   
>> So to be clear there is a one
>> to one mapping between a codon and amino acid as well amino acid and a
>> codon. Therefore it is impossible for Arg to map to six possible codons
>>     
>
> I think that you're confusing the biological fact (only one codon actually
> encoded this amino acid) with the back-translation problem (in the absence
> of any other information, any one of six codons is equally likely to have
> encoded this amino acid).
>
> ---
>  
>   
>> This is still a one to one mapping between an amino acid and regular
>> expression relationship of the triplet that encodes it.
>>     
>
> Which is not the claim that I was making.  There are any number of ways of
> forcing a one-one mapping of this sort.  You could arguably represent it as
> a one-to-one mapping of 'arginine -> "the backtranslation of arginine"', but
> that would not be informative in reconstructing the actual coding sequence
> (if that was what you wanted - which is the point of the discussion: what is
> the point of a back_translate() method?).  The regular expression mapping is
> not useful for this, either.
>
>   
>> You are not representing the one to six mapping you indicated above as
>> sequence is composed of 300 nucleotides not 1800 as must occur with a
>> one to 6 codon mapping [...]
>>     
>
> I think you've misunderstood what's going on here.
>
> Imagine a reduced system, where there is only one amino acid - let's call it
> A - and there are two possible codons that can produce this amino acid - XXX
> and YYY (thanks, Coldplay).
>
> Now, if we have a 'sequence' of only one amino acid: 'A', that might have
> been encoded by the sequence 'XXX', or the sequence 'YYY'.  The sequence
> that coded for 'A' is one of 'XXX' or 'YYY', and we don't know which; there
> are two possibilities, therefore this is a 1->2 mapping.  2=2**1.  Note that
> the nucleotide sequence is 3*1=3 long.
>
> But if our sequence has two amino acids: 'AA', this could have been the
> result of 'XXXXXX', 'XXXYYY', 'YYYXXX', or 'YYYYYY'.  The coding sequence is
> one of four equally likely possibilities, and this is a 1->4 mapping (one
> sequence, four possible outcomes).  4=2**2, and the nucleotide sequence is
> 3*2 long.
>
> If we build longer sequences, we find that the number of potential outcomes
> is 2**n, where n is the number of 'A's in the input sequence, and the
> mapping is 1->2**n.  The nucleotide sequence is 3*n long.
>
> If we make this more general, where there are m codons for this amino acid,
> the number of potential outcomes is m**n, and the mapping is 1->m**n.  The
> nucleotide sequence is, again, 3*n long.
>
> In my previous example for arginine, m=6, n=100, the mapping is 1->6, and
> the sequence is 300nt long, *not* 1800 nt long.  There are still 6e77 ways
> of encoding a sequence of 100 arginines.  A back_translate() method that
> pretends to find the 'correct' coding sequence in the absence of other
> information, rather than 'a' coding sequence, is not making a plausible
> claim.
>
>   
Thank you for agreeing with me! I am glad that you realized that the 
genetic code prevents a true one to many relationship. In say relational 
databases where you can have one table for the journal issue and one 
table for the papers in it, you can get multiple papers in a single 
issue. Likewise, if we ignore the genetic code, there is one amino acid 
and one or more codons. However, the genetic code means that you only 
can select one of all the codons possible resulting in multiple 
combinations of one to one relationships.
>> It is not my position to argue what a user wants or how stupid I think
>> that the request is. The user would quickly learn.
>>     
>
> While it is entirely possible to implement a function called
> back_translate() that does something a user doesn't want or need, I'm not
> sure that it's the approach that should be taken, here.
>
> It is your position to argue what you want or need out of a back_translate()
> method, and why, so that other people can see your point of view, and maybe
> be swayed by it.  I don't see a use for such a method, even to produce a
> regular expression for searching nucleotide sequences, because TBLASTN is so
> much more efficient.
>   
This very much depends on how you want to use it. TBLASTN is not very 
good for very short sequences and can not handle protein domains/motifs 
such as those in Prosite.

>   
>> Isoleucine and Leucine are the worst case (there are a couple of others
>> that are close) because these have the same mass so you have to search for:
>> (TTA|TTG|CTT|CTC|CTA|CTG|ATT|ATC|ATA)
>>
>> If you are searching say for an RFamide, you know that you need at least
>> RFG, which means you need to do a query using regular expression on the
>> plus strand using:
>> (CGT|CGC|CGA|CGG|AGA|AGG)(TTT|TTC)(GGT|GGC|GGA|GGG)
>>
>> You then try to extend the match to more amino acids until you reach the
>> desired mass (hopefully avoiding any introns) or sufficiently that you
>> can use some other tool to help.
>>     
>
> I think that, in your position, I'd compare timings with a six-frame,
> three-frame or forward translation of (depending on the nature of the
> nucleotide sequence) the nucleotide sequence you're searching against, and
> then use a regular expression or string search with the protein sequence as
> the query.  That's likely to be significantly faster than a regex search
> with that many groups, with the effects more noticeable at larger query
> sequence lengths; particularly so if you cache or save the translated
> sequences for future searches.
>   
Thanks for the comments as I did not think about reusing the translation.
>    
>   
>>>> Another case where it would be useful is that tools like TBLASTN gives
>>>> protein alignments so you must open the DNA sequence and find the DNA
>>>> region based on the protein alignment.
>>>>         
>>> You could use TBLASTN output - which provides start and stop coordinates for
>>> the match on the subject sequence - to extract this directly, without the
>>> need for backtranslation.
>>>       
>
>   
>> Exactly my point, where is the DNA sequence?
>>     
>
> It's in the database against which you queried; TBLASTN queries against
> nucleotide databases.  Wait, that's not quite right - 
No, it is not even correct! :-)

> TBLASTN translates
> nucleotide databases into protein databases and queries against them with
> the protein sequence, partly because of the one-many mapping of
> back-translation.
>   
Not exactly as stop codons are not in protein databases except where 
they code for an amino acid.

> If the database is local, you can use fastacmd (part of BLAST) to dump the
> entire database, to retrieve the single matching sequence from the database,
> or even to extract only the region of the sequence that is the match.  Try
> fastacmd --help at the command-line.
>
> If your database is not local, you can (probably) obtain the sequence by
> querying GenBank with the accession number.  If you can't do that, or ask
> the people who compiled the database you're querying against, or if they
> won't let you have the sequence, then you're stuck with guessing the coding
> sequence.
>
>   
>> Only if you have direct access to the DNA sequence can you get it.
>>     
>
> That's not true; fastacmd can extract FASTA-formatted sequences from any
> (version number compatibilities notwithstanding) correctly-formatted BLAST
> database.
>
>   
Obviously because you still have direct access to the DNA sequence.
>> Furthermore, the DNA sequence
>> must be exactly the same because any change in the coordinates screws it
>> up.
>>     
>
> I don't see how that is a great concern.  The coordinates of the match would
> come from the same database you were searching, so should match.  If your
> database is up-to-date, and you have to go to GenBank, then you should have
> the most recent revision of the sequence in there, anyway.
>
> Even if both of the above options fail, and you can acquire the new sequence
> by some accession identifier, you can build a new local database from that
> sequence alone, and find where the match is.  Or translate and search
> directly in Python.
>   

These were some of the things that one was trying to avoid, especially 
repeating it all over again and hoping like crazy that it is still 
present. (Genome assemblies are not very forgiving.)
> If you truly have no access to the DNA sequence (e.g. if it's proprietary
> information, you can't access the BLAST database, and no-one will send you
> the sequence) then, and only then, are you stuck with guessing the coding
> sequence in *very* large parameter space.
>
> Best,
>
> L.
>
>   
Bruce


From bsouthey at gmail.com  Tue Oct 21 20:36:31 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Tue, 21 Oct 2008 15:36:31 -0500
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <320fb6e00810210907s328c8f23ra007e14f1f32f5c9@mail.gmail.com>
References: <48FDE37B.5040301@gmail.com> <C523B3EF.18072%lpritc@scri.ac.uk>
	<320fb6e00810210907s328c8f23ra007e14f1f32f5c9@mail.gmail.com>
Message-ID: <48FE3D4F.6060005@gmail.com>

Peter wrote:
> Hi everyone,
>
> I think we all agree that if we want a back-translation
> method/function to return a simple string or Seq object (given no
> additional information about the codon use), this cannot fully capture
> all the possible codons.
>   


For completeness as these are not 100% correct,
Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN

Ser is really so bad that one would suggest providing a strong warning 
and just use NTN, NGN, and NNN for Leu, Arg and Ser, respectively.


> If we want to provide a simple string or Seq object, we can either
> pick an arbitrary codon in each case (as in the first attachment on
> Bug 2618), or perhaps represent some of the possible codons using
> ambiguous nucleotides.
>
> e.g.
> back_translate("MR") = "ATGCGT" #arbitrary codon for R unambiguous nucleotides
>
> or,
> back_translate("MR") = "ATGCGN" #arbitrary codon for R using ambiguous
> nucleotides
>
> Note in either example, the following nice property holds:
> translate(back_translate("MR")) == "MR"
>
> Even if improved by typical codon usage figures to give a more
> biologically likely answer, neither of these simple approaches covers
> the full set of six possible codons for Arg in the standard codon
> table.
>
> It was something like this that I envisioned as a candidate for a Seq
> method (based on the behaviour of the existing Bio.Translate
> functionality), but only if such a simple back_translate
> method/function had any real uses.  And thus far, I haven't seen any.
>   
For you perhaps but my reasons are very real to me!
> A back translation method/function which dealt with all the possible
> codon choices would have to use a more advanced representation
> (possibly as Bruce suggested using regular expressions or some sort of
> tree structure - ideally as a sub-class of the Seq object).  There is
> also the option of returning multiple simple strings or Seq objects
> (either as a list or preferable a generator) giving all possible back
> translations, but I don't think this would be useful, except perhaps
> on small examples, due to the potentially vast number of return
> values.
>
> Peter
>
>   
In any situation, we are left with a ambiguous codons, a regular 
expression or some combination of sequence type (e.g., strings or Seq 
objects). None of these options are fully compatible with the Seq 
object.  So I do agree that back-translation can not be part of the Seq 
object. Also I agree that while first two could be return types for a 
Seq object method, the usage is probably too infrequent and too 
specialized for inclusion especially to handle codon usage frequencies.

Bruce


From lpritc at scri.ac.uk  Wed Oct 22 08:31:12 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Wed, 22 Oct 2008 09:31:12 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <48FE3D4F.6060005@gmail.com>
Message-ID: <C524A360.18103%lpritc@scri.ac.uk>

On 21/10/2008 21:36, "Bruce Southey" <bsouthey at gmail.com> wrote:

> For completeness as these are not 100% correct,
> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN

There are some difficulties with this encoding (IUPAC codes are at
http://www.chick.manchester.ac.uk/SiteSeer/IUPAC_codes.html)

YTN -> [CT]T[ACGT] -> {CTA, CTC, CTG, CTT, TTA, TTC, TTG, TTT}, two of which
do not encode leucine.

MGV -> [AC]G[ACG] -> {AGA, AGC, AGG, CGA, CGC, CGG}, of which AGC does not
encode arginine, and the resulting set does not include CGT, which does
encode arginine

WSN -> [AT][CG][ACGT] -> {ACA, ACC, ACG, ACT, AGA, AGC, AGG, AGT, TCA, TCC,
TCG, TCT, TGA, TGC, TGG, TGT}, of which 10 codons do not encode serine.

This would cause problems if we wanted to translate our back-translation
back to the original protein sequence (however we might want to do this).

> Ser is really so bad that one would suggest providing a strong warning
> and just use NTN, NGN, and NNN for Leu, Arg and Ser, respectively.

We could just backtranslate all amino acids to NNN and avoid the problem
entirely ;)

>> If we want to provide a simple string or Seq object, we can either
>> pick an arbitrary codon in each case (as in the first attachment on
>> Bug 2618), or perhaps represent some of the possible codons using
>> ambiguous nucleotides.
>> 
>> e.g.
>> back_translate("MR") = "ATGCGT" #arbitrary codon for R unambiguous
>> nucleotides
>> 
>> or,
>> back_translate("MR") = "ATGCGN" #arbitrary codon for R using ambiguous
>> nucleotides
>> 
>> Note in either example, the following nice property holds:
>> translate(back_translate("MR")) == "MR"

This would be an important consideration for a back_translate() method:
should translate() and back_translate() be inverse functions of each other?

I would say that this is a desirable property, or else a nested
translate(back_translate(translate(...(seq)...))) is likely to end up as a
string or sequence of ambiguity codons, which is not very useful.  If that
can't be done, then the opportunity to do so is probably best avoided...

To ensure that translate() and back_translate() are inverse functions, the
backtranslation of a particular amino acid should either return a single
unambiguous codon, or an ambiguous codon that cannot be translated to an
alternative amino acid (assuming a consistent codon table throughout).  If
we were not to choose arbitrarily an unambiguous codon, or subset of all
possible codons, then a representation of the ambiguity is required that is
not present in the Seq object, yet (e.g. For Ser, Leu or Arg as described
above).  A modification of translate() to spot, and accept such ambiguity
would be necessary.  This looks like harder work than it's worth.

>> It was something like this that I envisioned as a candidate for a Seq
>> method (based on the behaviour of the existing Bio.Translate
>> functionality), but only if such a simple back_translate
>> method/function had any real uses.  And thus far, I haven't seen any.
>>   
> For you perhaps but my reasons are very real to me!

I agree with Peter on this.  I don't see a single compelling use case for
back_translate() in a Seq object.

I can sort of see a potential use where, if you have a protein and want to
design a primer to the coding sequence (which is not known - otherwise there
are better ways to do this), then you might want to generate a sequence of
IUPAC ambiguity codes to guide primer design.  This might involve obtaining
a sequence only of the *certain* bases, e.g. Phe -> TTN; Ser -> NNN; Gly ->
GGN; Asp -> GAN, so that FGD -> TTNNNNGGN, and there are four of nine bases
around which primers might be designed.  However, I'm *really* stretching to
come up with this example.

I've outlined my views on some of the possible ways back_translate() might
work below:


Translate protein to its original coding sequence:
===================================================
Problem: this may be just guesswork in (very) large sequence space
Potential solution: guesswork may be guided by codon usage tables or user
preference for codons, but the biological utility/significance of the
result, which is still guessed at, is highly questionable.
Alternatives: If the originating organism's sequence is known, then TBLASTN
is fast, works well, and avoids the problem.  Alternatively, forward
translation followed by a search for the protein sequence is quicker and
less messy.


Translate protein to a single possible coding sequence (not necessarily
original):
============================================================================
Problem: Same one each time, or choose randomly? What is the point, anyway?
See above for solutions/alternatives


Translate protein to ambiguous representation (inverse translate and/or
return Seq):
============================================================================
Problem: changes required to the way sequences are represented in Seq
objects; this is a significant change at the heart of Biopython with many
inevitable side-effects.  Not clear how this would work, yet.
Potential solution: major coding upheaval and rewriting of Biopython
Alternatives: ignore the requirement that backtranslation is the inverse of
translation; do not return a Seq object, but instead store the
backtranslation as an attribute, or just return a string for the user to do
what they want with


Translate protein to ambiguous representation (not inverse of translate, do
not return Seq):
============================================================================
Problem: what's the point?  agreeing which ambiguous representation to use:
regex, IUPAC, something else; IUPAC ambiguities aren't a convenient
representation for Ser, Leu, Arg;
Potential solution: just use a regex; allow a choice; make an executive
decision; ignore it and hope it goes away


I think that the last behaviour here is the only one that is feasible, but I
still don't see much point in implementing it.  At least turning a protein
sequence into a regex of possible codons would be quick to code...

>> There is
>> also the option of returning multiple simple strings or Seq objects
>> (either as a list or preferable a generator) giving all possible back
>> translations, 

Eek! (for the reasons you mention)

L.


-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________


From biopython at maubp.freeserve.co.uk  Wed Oct 22 09:17:23 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Oct 2008 10:17:23 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
	Seq object?
In-Reply-To: <C524A360.18103%lpritc@scri.ac.uk>
References: <48FE3D4F.6060005@gmail.com> <C524A360.18103%lpritc@scri.ac.uk>
Message-ID: <320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com>

On Wed, Oct 22, 2008 at 9:31 AM, Leighton Pritchard <lpritc at scri.ac.uk> wrote:
> On 21/10/2008 21:36, "Bruce Southey" <bsouthey at gmail.com> wrote:
>
>> For completeness as these are not 100% correct,
>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN

I was going to jump up and down and disagree with you here Bruce, but
Leighton has already made the same point, (CGV | AGR) != MGV etc.

It is true that the ambiguous codon MGV would cover all the possible
Arg codons, but it includes more than that.  While this could be a
useful thing for certain back-translation reasons, it does break the
expectation that translate(back_translate(sequence)) == sequence
[currently the behaviour available in Bio.Translate].

>>> If we want to provide a simple string or Seq object, we can either
>>> pick an arbitrary codon in each case (as in the first attachment on
>>> Bug 2618), or perhaps represent some of the possible codons using
>>> ambiguous nucleotides.
>>> ...
>>> It was something like this that I envisioned as a candidate for a Seq
>>> method (based on the behaviour of the existing Bio.Translate
>>> functionality), but only if such a simple back_translate
>>> method/function had any real uses.  And thus far, I haven't seen any.
>>>
>> For you perhaps but my reasons are very real to me!

I was saying I don't see the need for a *simple* back_translate
function (giving a Seq object or a string), and that such a simple
function didn't seem to help with your examples.

I'm not denying that a complex back translation operation has real
utility (although I suspect there are multiple different solutions
which won't suit every problem - and makes justifying adding this to
the core Seq object hard to justify).  Perhaps a function in
Bio.SeqUtils to create a nucleotide regex describing possible back
translations from a protein sequence would suffice?

If one of your real-world examples can be solved with a back_translate
which returns a simple string or Seq object, could you clarify this.

Peter


From lpritc at scri.ac.uk  Wed Oct 22 10:03:32 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Wed, 22 Oct 2008 11:03:32 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <48FE31B2.8030509@gmail.com>
Message-ID: <C524B904.18127%lpritc@scri.ac.uk>

On 21/10/2008 20:46, "Bruce Southey" <bsouthey at gmail.com> wrote:
> Thank you for agreeing with me! I am glad that you realized that the
> genetic code prevents a true one to many relationship.

Bruce, I am not agreeing with you.  I'll try to clarify it another way:

More than one codon can encode the amino acid arginine (this is a many-one
relationship). The amino acid arginine can be 'decoded' to more than one
codon (this is a one-many relationship).

Imagine a function that accepts an amino acid as input and returns a valid
codon that could encode for the input amino acid.  This is 'decoding' as
described above, and is the process of back-translation for a single amino
acid.

For a single (i.e. 'one') amino acid, arginine, as input, the function might
correctly provide up to six (i.e. 'many') different valid answers.  This
makes it a one-many problem.  Further external constraints (e.g. Codon
tables) may be applied to restrict the number or likelihood of each codon
being correct in specific cases, but the fundamental problem is one-many.

Providing arginine as input to a particular coded version of this function
might in all cases only return a single codon as output (one-one), but the
problem itself is still one-many.

Furthermore, even though only one codon was responsible -
biologically-speaking - for encoding the arginine you're submitting to the
function (one-one), your question is the inverse: effectively 'what codon
encoded this arginine?'.  But (and it's a big but), if you don't know
beforehand what that codon is (and why else would you bother using the
function?), the problem is one-many, as any of the six solutions might be
correct.

Analogously, there are two possible values for the square root of a positive
real number, such as 4.  It is inherently a one-many problem.  For 4, the
return value could, correctly, be +2 or -2.  Now, the math.sqrt() function
in Python follows mathematical convention for the radical, and only returns
the positive value, but that does not make the relationship between the
value and its square root one-one, it only makes that implementation of the
function one-one, even though the answer could be, correctly, either
positive or negative.

Now, if your problem is: what is the length of side of a farmer's square
field with area four square miles (big field!), only one of these answers
makes sense (one-one), as the field is constrained by our reality and cannot
have negative length (this is effectively equivalent to saying that the
organism doesn't use five of the six possible codons for arginine, so only
one answer is possible).  However, the general problem of finding a square
root is still one-many, as you can see if you rephrase the problem as 'the
vector (a 0) has length 4; what is the value of a?'.  This is directly
analogous to the problem 'the amino acid arginine was encoded by a codon;
what codon was it?'.

> This very much depends on how you want to use it. TBLASTN is not very
> good for very short sequences and can not handle protein domains/motifs
> such as those in Prosite.

That's a fair point, and I wouldn't (and didn't ;) ) recommend TBLASTN as a
solution to all such problems.  I get acceptable results for exact matches
down to about 7aa on default settings, though.  Short query sequences can be
a problem whatever method you use, though.

>> TBLASTN queries against
>> nucleotide databases.  Wait, that's not quite right -
> No, it is not even correct! :-)

Yes, it is correct.  From:
http://www.ncbi.nlm.nih.gov/blast/blast_program.shtml (and other
references...)

"""
tblastn
compares a protein query sequence against a nucleotide sequence database
dynamically translated in all reading frames
""" 

They wrote it, so they should know.  Not that I've checked the code ;)

>> TBLASTN translates
>> nucleotide databases into protein databases and queries against them with
>> the protein sequence, partly because of the one-many mapping of
>> back-translation.
> Not exactly as stop codons are not in protein databases except where
> they code for an amino acid.

Stop codons are not (usually) in protein databases, that's true.  But they
*are* in nucleotide databases, which is what TBLASTN queries.  For example,
these are TBLASTN search results, in opposite directions on the same
nucleotide sequence, that span stop codons in the subject sequence,
indicated by '*' in the BLAST output (even though there are different stop
codons; Artemis handles this more elegantly):

>ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete
genome
          Length = 5064019

 Score = 79.0 bits (193), Expect = 8e-17
 Identities = 38/40 (95%), Positives = 38/40 (95%), Gaps = 2/40 (5%)
 Frame = +2

Query: 1  YPHSTAEYLILFE-INPRS-PFFCWIFWNLMLRDVDLENF 38
          YPHSTAEYLILFE INPRS PFFCWIFWNLMLRDVDLENF
Sbjct: 2  YPHSTAEYLILFE*INPRS*PFFCWIFWNLMLRDVDLENF 121

>ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete
genome
          Length = 5064019

 Score = 56.6 bits (135), Expect = 4e-10
 Identities = 29/32 (90%), Positives = 29/32 (90%), Gaps = 3/32 (9%)
 Frame = -3

Query: 1       CNGRWRC-SPL-CYISPRISCRSW-LKPSAIV 29
               CNGRWRC SPL CYISPRISCRSW LKPSAIV
Sbjct: 2851610 CNGRWRC*SPL*CYISPRISCRSW*LKPSAIV 2851515

>> That's not true; fastacmd can extract FASTA-formatted sequences from any
>> (version number compatibilities notwithstanding) correctly-formatted BLAST
>> database.
>>   
> Obviously because you still have direct access to the DNA sequence.

I'd call it indirect access if you've, say, downloaded a precompiled nt
database from NCBI and then have to extract the FASTA sequence from that
compiled database.  Either way, if you're querying a nucleotide database,
you've got to have a representation of the nucleotide sequence *somewhere*.

>> Even if both of the above options fail, and you can acquire the new sequence
>> by some accession identifier, you can build a new local database from that
>> sequence alone, and find where the match is.  Or translate and search
>> directly in Python.
>>   
> These were some of the things that one was trying to avoid, especially
> repeating it all over again and hoping like crazy that it is still
> present. 

Some things are just harder work than others ;)

> (Genome assemblies are not very forgiving.)

The genomes I've worked on have had stable sequences at revision points for
both assembly and annotation (though the old revision points have not been
kept publicly in all cases, which can be awkward).  All should, IMO.  But
that's a different thread on a different mailing list...

Best,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________


From dalloliogm at gmail.com  Wed Oct 22 10:25:57 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 22 Oct 2008 12:25:57 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
Message-ID: <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>

On Mon, Oct 20, 2008 at 3:57 PM, Giovanni Marco Dall'Olio <
dalloliogm at gmail.com> wrote:

>
>
> On Mon, Oct 20, 2008 at 7:41 AM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
>
>> Hi,
>>
>> On Sun, Oct 19, 2008 at 3:50 PM, Giovanni Marco Dall'Olio
>> <dalloliogm at gmail.com> wrote:
>> > ok, thank you very much!!
>> > I would like to use git to keep track of the changes I will make to the
>> > code.
>> > What do you think if I'll upload it to http://github.com and then
>> upload it
>> > back on biopython when it is finished?
>> > I am not sure, but I think it would be possible to convert the logs back
>> to
>> > cvs to reintegrate the changes in biopython.
>>
>> I think it is a good idea. When we reintegrate back I think there will
>> be no need to backport the commit logs anyway.
>
>
> Ok, I have uploaded the code to:
> - http://github.com/dalloliogm/biopython---popgen
>

I wrote a prototype for a PED file parser which uses your PopGen.Record
object to store data.
It's available on github: I have still to finish the consumer object and to
test it, but I think I will be able to finish it for today.

I left you a few comments on the github wiki:
- http://github.com/dalloliogm/biopython---popgen/wikis/home

Maybe the biggest issue is that I will have to use this library to parse
very big files, so there are a few things we could change in the
implementation of the parser.
Is there any way in python to force the interpreter to store variables in
temporary files instead of RAM memory?
I was thinking about modules like shelve, cPickle, but I am not sure they
work in this way.
We could also modify the parser in a way that it can accept a list of
populations as argument, and create a populations list with only those
populations from the file.

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Wed Oct 22 10:34:21 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Oct 2008 11:34:21 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
Message-ID: <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>

On Wed, Oct 22, 2008 at 11:25 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Maybe the biggest issue is that I will have to use this library to parse
> very big files, so there are a few things we could change in the
> implementation of the parser.
> Is there any way in python to force the interpreter to store variables in
> temporary files instead of RAM memory?
> I was thinking about modules like shelve, cPickle, but I am not sure they
> work in this way.

I have not looked at the specifics here, but adopting an iterator
approach might make sense - returning the entries one by one as parsed
from the file.  This is the idea for the Bio.SeqIO and Bio.AlignIO
parsers.  The user can then turn the entries into a list (if they have
enough memory), filter them as the arrive, etc.  For example, you
could compile a list of only those desired population entries,
discarding the others on the fly.

Peter


From bsouthey at gmail.com  Wed Oct 22 15:04:29 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Wed, 22 Oct 2008 10:04:29 -0500
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com>
References: <48FE3D4F.6060005@gmail.com> <C524A360.18103%lpritc@scri.ac.uk>
	<320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com>
Message-ID: <48FF40FD.5020604@gmail.com>

Peter wrote:
> On Wed, Oct 22, 2008 at 9:31 AM, Leighton Pritchard <lpritc at scri.ac.uk> wrote:
>   
>> On 21/10/2008 21:36, "Bruce Southey" <bsouthey at gmail.com> wrote:
>>
>>     
>>> For completeness as these are not 100% correct,
>>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
>>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
>>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN
>>>       
>
> I was going to jump up and down and disagree with you here Bruce, but
> Leighton has already made the same point, (CGV | AGR) != MGV etc.
> It is true that the ambiguous codon MGV would cover all the possible
> Arg codons, but it includes more than that.  While this could be a
> useful thing for certain back-translation reasons, it does break the
> expectation that translate(back_translate(sequence)) == sequence
> [currently the behaviour available in Bio.Translate].
>   
Leighton does show these are correct:
(CGV | AGR) == MGV
and MGV ==(CGV | AGR)

BUT I fully agree that MGV does stand for other other codons that are do 
not translate for Arg as Leighton pointed out. This was why I prefixed 
this by stating "these are not 100% correct" so I am sorry that I was 
not clear enough.  Yes, I am also very aware that this creates a problem 
for doing a translate(back_translate(sequence)) without using a special 
translation table (yet another reason for not including it in Seq object 
or just return an exception).

As I pointed in your other thread that I do not believe that a 
back-translation should be part of the Seq object. If for no other 
reason than back-translation just creates too many ambiguous nucleotides 
in one DNA sequence. This will cause some of the algorithms to determine 
protein or DNA sequences to fail (back_translate('AFLFQPQRFGR') gives 
'GCNTTYYTNTTYCARCCNCARMGVTTYGGNMGV', which causes NCBI's online BLASTN 
to say it is protein). In anycase, BLAST and such are not very good at 
handling multiple ambiguous nucleotides in a sequence when probably 
one-third to one-half of the sequence would be ambiguous nucleotides.

Bruce


From biopython at maubp.freeserve.co.uk  Wed Oct 22 15:33:00 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Oct 2008 16:33:00 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
	Seq object?
In-Reply-To: <48FF40FD.5020604@gmail.com>
References: <48FE3D4F.6060005@gmail.com> <C524A360.18103%lpritc@scri.ac.uk>
	<320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com>
	<48FF40FD.5020604@gmail.com>
Message-ID: <320fb6e00810220833j4089cec1i6fb9cee563b562d3@mail.gmail.com>

Bruce wrote:
>>>> For completeness as these are not 100% correct,
>>>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
>>>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
>>>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN

Just for the record, in addition to the debate about the final equal
signs above, there is at least one error in the above - for the
leucine codons, (TTN|CTR) should read (TTR|CTN), but this doesn't
matter for the discussion in hand.

Bruce wrote:
> Leighton does show these are correct:
> (CGV | AGR) == MGV
> and MGV ==(CGV | AGR)

I don't think Leighton did mean to say that. A set of 6 codons is NOT
equal to a set of 8 codons.  However, if we say "sub set" or "super
set" here things are probably fine (I haven't double checked the
correct ambiguity codes are used here).

Similarly, Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTR|CTN) covers 6
unambiguous codons.
This is a subset of YTN = (TTC|TTA|TTG|TTT|CTC|CTA|CTG|CTT) which
covers 8 unambiguous codons.

Having back_translate("L") == "YTN" means
translate(back_translate("L")) == "X", which would surprise many.
Using "YTN" covers all the codons plus some extra ones.  This might be
useful for searching purposes, but otherwise its very misleading.

Having back_translate("L") == "CTN" means
translate(back_translate("L")) == "L", but doesn't cover the two
codons TTR (i.e. TTA or TTG).  At least this is better than
back_translate("L") == "TTR" which still has
translate(back_translate("L")) == "L", but doesn't cover the four
codons CTN.  Picking any one of the six codons also ensures
translate(back_translate("L")) == "L" but of course doesn't cover the
other five codons.  In all three cases, the utility of the back
translation is limited.

> Yes, I am also very aware that this creates a problem for doing a
> translate(back_translate(sequence)) without using a special translation
> table (yet another reason for not including it in Seq object or just return
> an exception).

Yes.

> As I pointed in your other thread that I do not believe that a
> back-translation should be part of the Seq object.

In the absence of a compelling use case, I agree.

> If for no other reason
> than back-translation just creates too many ambiguous nucleotides in one DNA
> sequence. This will cause some of the algorithms to determine protein or DNA
> sequences to fail (back_translate('AFLFQPQRFGR') gives
> 'GCNTTYYTNTTYCARCCNCARMGVTTYGGNMGV', which causes
> NCBI's online BLASTN to say it is protein).

In such cases, you can explicitly tell BLAST (or other tools) if they
are using nucleotides or proteins.  However this is a valid concern
for working with ambiguous nucleotides.

As an aside, zen of python "In the face of ambiguity, refuse the
temptation to guess." (here nucleotide versus protein)

> In anycase, BLAST and such are not very good at handling
> multiple ambiguous nucleotides in a sequence when probably
> one-third to one-half of the sequence would be ambiguous
> nucleotides.

Ambiguous searches are bound to be tricky.

Peter


From lpritc at scri.ac.uk  Wed Oct 22 15:34:47 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Wed, 22 Oct 2008 16:34:47 +0100
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <48FF40FD.5020604@gmail.com>
Message-ID: <C52506A7.18168%lpritc@scri.ac.uk>


On 22/10/2008 16:04, "Bruce Southey" <bsouthey at gmail.com> wrote:

> Peter wrote:
>> On Wed, Oct 22, 2008 at 9:31 AM, Leighton Pritchard <lpritc at scri.ac.uk>
>> wrote:
>>   
>>> On 21/10/2008 21:36, "Bruce Southey" <bsouthey at gmail.com> wrote:
   
>>>> For completeness as these are not 100% correct,
>>>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
>>>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
>>>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN
>>>>       
>> 
>> I was going to jump up and down and disagree with you here Bruce, but
>> Leighton has already made the same point, (CGV | AGR) != MGV etc.
>> It is true that the ambiguous codon MGV would cover all the possible
>> Arg codons, but it includes more than that.
>>   
> Leighton does show these are correct:
> (CGV | AGR) == MGV
> and MGV ==(CGV | AGR)

I showed (and Peter also points out) that (TTN|CTR) is a subset of YTN, and
that (TCN|AGY) is a subset of WSN, and not that they are equivalent, which
is what you have written above.  For that equivalence, we would also require
that MGV is a subset of (CGV|AGR), which is not true.

Likewise I also showed that, although (CGV|AGR) is a subset of MGV, neither
CGV nor MGV include CGT, which is a valid codon for arginine.  Whether or
not this error is corrected to CGN/MGN, the regular expression is still only
a subset of those codons implied by the IUPAC ambiguity symbols.

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________


From bsouthey at gmail.com  Wed Oct 22 15:50:19 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Wed, 22 Oct 2008 10:50:19 -0500
Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for
 Seq object?
In-Reply-To: <320fb6e00810220833j4089cec1i6fb9cee563b562d3@mail.gmail.com>
References: <48FE3D4F.6060005@gmail.com> <C524A360.18103%lpritc@scri.ac.uk>	
	<320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com>	
	<48FF40FD.5020604@gmail.com>
	<320fb6e00810220833j4089cec1i6fb9cee563b562d3@mail.gmail.com>
Message-ID: <48FF4BBB.8020007@gmail.com>

Peter wrote:
> Bruce wrote:
>   
>>>>> For completeness as these are not 100% correct,
>>>>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN
>>>>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV
>>>>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN
>>>>>           
>
> Just for the record, in addition to the debate about the final equal
> signs above, there is at least one error in the above - for the
> leucine codons, (TTN|CTR) should read (TTR|CTN), but this doesn't
> matter for the discussion in hand.
>
>   
Thanks for correctly that one.

Bruce


From tiagoantao at gmail.com  Wed Oct 22 15:52:19 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 22 Oct 2008 16:52:19 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
Message-ID: <6d941f120810220852m606b8825tafaaf871f22df48e@mail.gmail.com>

Hi,

[Back in office now]

> Ok, I have uploaded the code to:
> - http://github.com/dalloliogm/biopython---popgen
>
> I put the code I wrote before writing in this mailing list in the folder
> PopGen/Gio

Thanks I will have a look and get acquainted with GIT.


>> I am afraid that this is not enough. Even for Fst. I suppose you are
>> acquainted with a formula with just heterozigosities.
>
> Yes, I was trying to implement a very basic formula at first.

For publication and data analysis the standard is Cockerham and Wier's
theta. The Standard Ht/(Hs-Ht) (or a variation of this) might be
misleading in regards to the amount of information that is needed.


> Yes, I agree. It was just a first try. We should collect some good
> use-cases.


In my head I divide statistics in the following dimensions:
1. genetic versus genomic (e.g. Fst is single locus, LD can be seen as
requiring more than 1 locus, therefore is "genomic")
2. frequency based versus marker based (some statistics require
frequencies only - ie, you can calculate them irrespective of the type
of marker - This is the case of Fst. Others are marker dependent, say
Tajima D requires sequences and can only be used with sequences)
3. population structure versus no pop structure. Some stats require
population structure (again, Fst), others don't (e.g., allelic
richness)

>From my point of view, a long-term solution needs to take into account
these dimensions (and others that I might be forgetting).

One can think in a solution based on Populations and Individuals as
fundamental objects (as opposed to statistics), but, from my
experience it is very difficult to define what is an "individual"
(i.e., what kind of information you need to store - I can expand on
this). It is easier to think in terms of statistics.

One fundamental point is that we don't have many opportunities to make
it right: if we define an architecture which proves in the future to
be not sufficient, then  we will have to both maintain the old legacy
(because there will be users around whose code cannot be constantly
broken when a new version is made available) while hack the new
features in.


From tiagoantao at gmail.com  Wed Oct 22 16:00:39 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 22 Oct 2008 17:00:39 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
Message-ID: <6d941f120810220900o620aa7x171fcd181f5248c@mail.gmail.com>

Hi,

On Wed, Oct 22, 2008 at 11:25 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> I wrote a prototype for a PED file parser which uses your PopGen.Record
> object to store data.

Don't feel obliged to use GenePop.Record. You can (maybe you should)
use one that is better for your PED record. The point is: your PED
files might have extra (or less) information than genepop files. For
instance, they might have population names. They might store the SNP
(A, C, T, G). With genepop you would have to convert (and thus loose)
the extra info.

> Maybe the biggest issue is that I will have to use this library to parse
> very big files, so there are a few things we could change in the
> implementation of the parser.

Yet another reason to develop your own record. I would not mind
helping you with that.

> We could also modify the parser in a way that it can accept a list of
> populations as argument, and create a populations list with only those
> populations from the file.

We have to be careful in modifying existing code. We can add new
functionality, add new interfaces. But changing existing interfaces or
removing them has to be dealt with exceptional care, because that will
break (existing) code done by users.

Tiago


From tiagoantao at gmail.com  Wed Oct 22 16:03:59 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Wed, 22 Oct 2008 17:03:59 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
Message-ID: <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>

On Wed, Oct 22, 2008 at 11:34 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> I have not looked at the specifics here, but adopting an iterator
> approach might make sense - returning the entries one by one as parsed
> from the file.  This is the idea for the Bio.SeqIO and Bio.AlignIO
> parsers.  The user can then turn the entries into a list (if they have
> enough memory), filter them as the arrive, etc.  For example, you
> could compile a list of only those desired population entries,
> discarding the others on the fly.

I will have look at iterators in Python. This idea from Giovannni is
actually floating around with current users for GenePop data which
have exactly the same problem (loooong records).


From dalloliogm at gmail.com  Wed Oct 22 17:10:45 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 22 Oct 2008 19:10:45 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
Message-ID: <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>

On Wed, Oct 22, 2008 at 6:03 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:

> On Wed, Oct 22, 2008 at 11:34 AM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
> > I have not looked at the specifics here, but adopting an iterator
> > approach might make sense - returning the entries one by one as parsed
> > from the file.  This is the idea for the Bio.SeqIO and Bio.AlignIO
> > parsers.  The user can then turn the entries into a list (if they have
> > enough memory), filter them as the arrive, etc.  For example, you
> > could compile a list of only those desired population entries,
> > discarding the others on the fly.
>
> I will have look at iterators in Python. This idea from Giovannni is
> actually floating around with current users for GenePop data which
> have exactly the same problem (loooong records).
>


Iterators are more difficult to implement in Ped files, because in this
format every line of the file is an individual, so to write an iterator
which iterates by population we will need to read at list the first row of
every line of all the file.
I was also thinking of starting using a database to store data, instead of
files. This would probably solve the problem of out of memory when parsing
those long files.
I would probably use sqlalchemy to interface with this database: this is why
I would like to implement a Population and Individual objects, it will fit
better with relational mapping.

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From dalloliogm at gmail.com  Wed Oct 22 17:12:24 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 22 Oct 2008 19:12:24 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810220852m606b8825tafaaf871f22df48e@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<6d941f120810220852m606b8825tafaaf871f22df48e@mail.gmail.com>
Message-ID: <5aa3b3570810221012v3543a533u15f81196752cd52@mail.gmail.com>

On Wed, Oct 22, 2008 at 5:52 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:

> Hi,
>
> [Back in office now]
>
> > Ok, I have uploaded the code to:
> > - http://github.com/dalloliogm/biopython---popgen
> >
> > I put the code I wrote before writing in this mailing list in the folder
> > PopGen/Gio
>
> Thanks I will have a look and get acquainted with GIT.
>

It' s the first time I am using github for something serious, too.
Please tell me if you need me to add you as a 'collaborator' in the project
or something like this.
I am using eclipse with a plugin for git (http://www.jgit.org/update-site)
and it works very well.
I think there is a plugin for vim, too.
Sorry, today I couldn't do too much - I spent most of the day in seminars
and meetings :(.


>
>
> > Yes, I agree. It was just a first try. We should collect some good
> > use-cases.
>
>
> In my head I divide statistics in the following dimensions:
> 1. genetic versus genomic (e.g. Fst is single locus, LD can be seen as
> requiring more than 1 locus, therefore is "genomic")
> 2. frequency based versus marker based (some statistics require
> frequencies only - ie, you can calculate them irrespective of the type
> of marker - This is the case of Fst. Others are marker dependent, say
> Tajima D requires sequences and can only be used with sequences)
> 3. population structure versus no pop structure. Some stats require
> population structure (again, Fst), others don't (e.g., allelic
> richness)
>
> From my point of view, a long-term solution needs to take into account
> these dimensions (and others that I might be forgetting).
>
> One can think in a solution based on Populations and Individuals as
> fundamental objects (as opposed to statistics), but, from my
> experience it is very difficult to define what is an "individual"
> (i.e., what kind of information you need to store - I can expand on
> this). It is easier to think in terms of statistics.
>
> One fundamental point is that we don't have many opportunities to make
> it right: if we define an architecture which proves in the future to
> be not sufficient, then  we will have to both maintain the old legacy
> (because there will be users around whose code cannot be constantly
> broken when a new version is made available) while hack the new
> features in.
>

ok... but we can try :).
We could use the github's wiki to better organize these ideas.
I will answer to you better tomorrow (or tonight).
Now, I need a bit of fresh air! :)

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From dalloliogm at gmail.com  Wed Oct 22 17:12:41 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 22 Oct 2008 19:12:41 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810220900o620aa7x171fcd181f5248c@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<6d941f120810220900o620aa7x171fcd181f5248c@mail.gmail.com>
Message-ID: <5aa3b3570810221012w2c894977sd0f86297f42f9394@mail.gmail.com>

On Wed, Oct 22, 2008 at 6:00 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:

> Hi,
>
> On Wed, Oct 22, 2008 at 11:25 AM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
> > I wrote a prototype for a PED file parser which uses your PopGen.Record
> > object to store data.
>
> Don't feel obliged to use GenePop.Record. You can (maybe you should)
> use one that is better for your PED record. The point is: your PED
> files might have extra (or less) information than genepop files. For
> instance, they might have population names. They might store the SNP
> (A, C, T, G). With genepop you would have to convert (and thus loose)
> the extra info.


I first tried to write an AbstractPopRecord class from which to derive both
Ped.Record and your GenePop.Record classes.
Then, I realized that I wanted to use all of your methods and decided to
import your GenePop.Record instead of writing a new one.
Moreover, there are some methods (like GenePop.Record.split_in_pops) that
create Record objects, and I thought it would have been easier to always
refer to the same one.
Maybe we should write a generic PopGenRecord in which to store all general
informations about population genetics data.


>
>
> > Maybe the biggest issue is that I will have to use this library to parse
> > very big files, so there are a few things we could change in the
> > implementation of the parser.
>
> Yet another reason to develop your own record. I would not mind
> helping you with that.
>

>
> > We could also modify the parser in a way that it can accept a list of
> > populations as argument, and create a populations list with only those
> > populations from the file.
>
> We have to be careful in modifying existing code. We can add new
> functionality, add new interfaces. But changing existing interfaces or
> removing them has to be dealt with exceptional care, because that will
> break (existing) code done by users.
>
> Tiago
>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Wed Oct 22 17:26:07 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Oct 2008 18:26:07 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
Message-ID: <320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com>

On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
>
> Iterators are more difficult to implement in Ped files, because in this
> format every line of the file is an individual, so to write an iterator
> which iterates by population we will need to read at list the first row of
> every line of all the file.

It sounds like for Ped files it would make more sense to iterate over
the individuals.  The mental picture I have in mind is a big
spreadsheet, individuals as rows (lines), populations (and other
information) as columns.  By having the parser iterate over the
individuals one by one, the user could then "simplify" each individual
as they are read in, recording in memory just the interesting data.
This way the whole dataset need not be kept in memory.

> I was also thinking of starting using a database to store data, instead of
> files. This would probably solve the problem of out of memory when parsing
> those long files.
> I would probably use sqlalchemy to interface with this database: this is why
> I would like to implement a Population and Individual objects, it will fit
> better with relational mapping.

That would mean adding sqlalchemy as another (optional) dependency for
Biopython.  If you could use MySQLdb instead that would be better as
several existing modules use this.  However, I would encourage you to
avoid any database if possible because this makes the installation
much more complicated for the end user, and imposes your own arbitrary
schema as well.  It also means setting up suitable unit tests is also
a pain.

Peter


From rsclary at uncc.edu  Wed Oct 22 19:49:33 2008
From: rsclary at uncc.edu (Clary, Richard)
Date: Wed, 22 Oct 2008 15:49:33 -0400
Subject: [BioPython] Retrieving nucleotide sequence for given accession
	Entrez ID
Message-ID: <61B0EE7C247C1349881F63414448FC1F078874BC@EXEVS06.its.uncc.edu>


Can anyone provide succinct Python function to retrieve the nucleotide sequence (as a string) for a given nucleotide accession ID?  Attempting to do this through E-Utils but having a difficult time figuring out the best way to do this without having to download a FASTA file...

Thanks in advance,
R


From biopython at maubp.freeserve.co.uk  Wed Oct 22 20:15:37 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Oct 2008 21:15:37 +0100
Subject: [BioPython] Retrieving nucleotide sequence for given accession
	Entrez ID
In-Reply-To: <61B0EE7C247C1349881F63414448FC1F078874BC@EXEVS06.its.uncc.edu>
References: <61B0EE7C247C1349881F63414448FC1F078874BC@EXEVS06.its.uncc.edu>
Message-ID: <320fb6e00810221315i31358bc2n2e5c9be405a77e42@mail.gmail.com>

On Wed, Oct 22, 2008 at 8:49 PM, Clary, Richard <rsclary at uncc.edu> wrote:
>
> Can anyone provide succinct Python function to retrieve the
> nucleotide sequence (as a string) for a given nucleotide
> accession ID?  Attempting to do this through E-Utils but
> having a difficult time figuring out the best way to do this
> without having to download a FASTA file...

Hi Richard,

Are you trying this using Bipython's Bio.Entrez, or accessing E-Utils directly?

Anyway, you'll want to use efetch (e.g. via the Bio.Entrez.efetch
function in Biopython)
http://www.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html

This documentation covers the possible return formats,
http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html

I think FASTA would be simplest (I don't see a plain or raw text
option), and has only a tiny overhead in the download size over the
raw sequence.  Getting the sequence out of a FASTA file as a string is
trivial - for example, using Biopython:

from Bio import Entrez, SeqIO
Entrez.email = "Richard at example.com" #Tell the NCBI who you are
handle = Entrez.efetch(db="nucleotide", id="186972394",rettype="fasta")
seq_str = str(SeqIO.read(handle, "fasta").seq)

Peter


From dalloliogm at gmail.com  Thu Oct 23 09:41:04 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 23 Oct 2008 11:41:04 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
	<320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com>
Message-ID: <5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com>

On Wed, Oct 22, 2008 at 7:26 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
> >
> > Iterators are more difficult to implement in Ped files, because in this
> > format every line of the file is an individual, so to write an iterator
> > which iterates by population we will need to read at list the first row
> of
> > every line of all the file.
>
> It sounds like for Ped files it would make more sense to iterate over
> the individuals.  The mental picture I have in mind is a big
> spreadsheet, individuals as rows (lines), populations (and other
> information) as columns.  By having the parser iterate over the
> individuals one by one, the user could then "simplify" each individual
> as they are read in, recording in memory just the interesting data.
> This way the whole dataset need not be kept in memory.


This makes sense.
Basically, we should write a (Ped/GenePop)Iterator function, which should
read the file one line at a time, check if it a has correct syntax and is
not a comment, and then use 'yield' to create a Record object. Am I right?


>
> > I was also thinking of starting using a database to store data, instead
> of
> > files. This would probably solve the problem of out of memory when
> parsing
> > those long files.
> > I would probably use sqlalchemy to interface with this database: this is
> why
> > I would like to implement a Population and Individual objects, it will
> fit
> > better with relational mapping.
>
> That would mean adding sqlalchemy as another (optional) dependency for
> Biopython.  If you could use MySQLdb instead that would be better as
> several existing modules use this.  However, I would encourage you to
> avoid any database if possible because this makes the installation
> much more complicated for the end user, and imposes your own arbitrary
> schema as well.  It also means setting up suitable unit tests is also
> a pain.
>

Don't worry, I am not going to do that.
I will probably use sqlalchemy only in my scripts; I will use it to retrieve
data from the database, and then create Population/Marker/Individual objects
using the code I am writing now, or a adapt the objects created  by
sqlalchemy to be compatible with the functions I will have to use.


>
> Peter
>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Thu Oct 23 09:57:38 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 23 Oct 2008 10:57:38 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
	<320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com>
	<5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com>
Message-ID: <320fb6e00810230257y4a21952o95acff34a5367cb5@mail.gmail.com>

On Thu, Oct 23, 2008, Giovanni Marco Dall'Olio wrote:
> On Wed, Oct 22, Peter wrote:
>> On Wed, Oct 22, Giovanni Marco Dall'Olio wrote:
>> >
>> > Iterators are more difficult to implement in Ped files, because in this
>> > format every line of the file is an individual, so to write an iterator
>> > which iterates by population we will need to read at list the first row
>> > of every line of all the file.
>>
>> It sounds like for Ped files it would make more sense to iterate over
>> the individuals.  The mental picture I have in mind is a big
>> spreadsheet, individuals as rows (lines), populations (and other
>> information) as columns.  By having the parser iterate over the
>> individuals one by one, the user could then "simplify" each individual
>> as they are read in, recording in memory just the interesting data.
>> This way the whole dataset need not be kept in memory.
>
> This makes sense.
> Basically, we should write a (Ped/GenePop)Iterator function, which should
> read the file one line at a time, check if it a has correct syntax and is
> not a comment, and then use 'yield' to create a Record object. Am I right?

Yes :)

Python functions written with "yield" are called  "generator functions", see:
http://www.python.org/dev/peps/pep-0255/

Peter


From m at pavis.biodec.com  Thu Oct 23 10:25:45 2008
From: m at pavis.biodec.com (m at pavis.biodec.com)
Date: Thu, 23 Oct 2008 12:25:45 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
References: <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
Message-ID: <20081023102545.GE3694@pavis.biodec.com>

* Giovanni Marco Dall'Olio (dalloliogm at gmail.com) [081022 19:12]:
> 
> I was also thinking of starting using a database to store data, instead of
> files. This would probably solve the problem of out of memory when parsing
> those long files.

If you just need to store data, i.e. you just need a thin layer above
file storage, I'd suggest evaluating ZODB

It's very simple, somehow pythonic, and you don't need to learn SQL to
manage the data (of course, SQL is just fine, and from a real DB you get
much more than just data storage, but since you are just writing about
alternatives to file storage, I assume that SQL would not be a plus)

HTH

--
 .*.                            finelli
 /V\
(/ \) --------------------------------------------------------------
(   )       Linux: Friends dont let friends use Piccolosoffice
^^-^^ --------------------------------------------------------------

It is easier to make a saint out of a libertine than out of a prig.
		-- George Santayana


From dalloliogm at gmail.com  Thu Oct 23 11:30:06 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 23 Oct 2008 13:30:06 +0200
Subject: [BioPython] [OT] Revision control and databases
Message-ID: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com>

Hi,
I have a question (well, it's not directly related to biopython or pygr, but
to scientific computing).

I always used flat files to store results and data for my bioinformatics
analys, but not (as I was saying in another thread) I would like to start
using a database to do that.

The problem is I don't know if databases do Revision Control.
When I used flat files, I was used to save all the results in a git
repository, and, everytime something was changed or calculated again, I did
commit it.
Do you know how to do this with databases? Does MySQL provide support for
revision control?
Thanks :)

(sorry for cross-posting :( )

-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From sdavis2 at mail.nih.gov  Thu Oct 23 12:10:16 2008
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 23 Oct 2008 08:10:16 -0400
Subject: [BioPython] [OT] Revision control and databases
In-Reply-To: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com>
References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com>
Message-ID: <264855a00810230510n37d05cb1gd7b88a63988d7191@mail.gmail.com>

On Thu, Oct 23, 2008 at 7:30 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Hi,
> I have a question (well, it's not directly related to biopython or pygr, but
> to scientific computing).
>
> I always used flat files to store results and data for my bioinformatics
> analys, but not (as I was saying in another thread) I would like to start
> using a database to do that.
>
> The problem is I don't know if databases do Revision Control.
> When I used flat files, I was used to save all the results in a git
> repository, and, everytime something was changed or calculated again, I did
> commit it.
> Do you know how to do this with databases? Does MySQL provide support for
> revision control?
> Thanks :)

No.  Relational databases just store data.  You could build such a
system, but that would require a fair amount of work.  I would suggest
storing metadata about your analyses in the database and storing the
actual results on the file system.

Sean


From lpritc at scri.ac.uk  Thu Oct 23 12:44:45 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Thu, 23 Oct 2008 13:44:45 +0100
Subject: [BioPython] [bip] [OT] Revision control and databases
In-Reply-To: <C5262F1C.18234%lpritc@scri.ac.uk>
Message-ID: <C526304D.18238%lpritc@scri.ac.uk>

Hi Giovanni (and others)

Ah, reading again I see you're already using git... Without knowing exactly
what you're doing, I assume that CVS and SVN would be no improvement
<cough>, so please ignore my last paragraph below ;)

L.

On 23/10/2008 13:39, "Leighton Pritchard" <lpritc at scri.ac.uk> wrote:

> Hi Giovanni,
> 
> On 23/10/2008 12:30, "Giovanni Marco Dall'Olio" <dalloliogm at gmail.com> wrote:
> 
>> The problem is I don't know if databases do Revision Control.
>> When I used flat files, I was used to save all the results in a git
>> repository, and, everytime something was changed or calculated again, I did
>> commit it.
>> Do you know how to do this with databases? Does MySQL provide support for
>> revision control?
> 
> Databases are just collections of data.  Database Management Systems (DBMS)
> such as MySQL and PostgreSQL do not (AFAIAA) do revision control themselves,
> but they can be used for it, if you build that capability into the schema and
> also control database submissions appropriately.  There are a number of
> content management systems that implement version/revision control on common
> DBMS, like this.
> 
> Stretching a definition, you could possibly argue that CVS, SVN and the like
> are a form of DBMS... I don't know what type of data you're storing, or how
> they might scale for your purposes but, in principle, neither CVS nor SVN care
> much about whether your data represents code, legal documents, or any other
> sort of data.  For example, I've used CVS/SVN to version control manuscripts.
> You might like to try one of them.
> 
> Cheers,
> 
> L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________


From lpritc at scri.ac.uk  Thu Oct 23 12:39:40 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Thu, 23 Oct 2008 13:39:40 +0100
Subject: [BioPython] [bip] [OT] Revision control and databases
In-Reply-To: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com>
Message-ID: <C5262F1C.18234%lpritc@scri.ac.uk>

Hi Giovanni,

On 23/10/2008 12:30, "Giovanni Marco Dall'Olio" <dalloliogm at gmail.com>
wrote:

> The problem is I don't know if databases do Revision Control.
> When I used flat files, I was used to save all the results in a git
> repository, and, everytime something was changed or calculated again, I did
> commit it.
> Do you know how to do this with databases? Does MySQL provide support for
> revision control?

Databases are just collections of data.  Database Management Systems (DBMS)
such as MySQL and PostgreSQL do not (AFAIAA) do revision control themselves,
but they can be used for it, if you build that capability into the schema
and also control database submissions appropriately.  There are a number of
content management systems that implement version/revision control on common
DBMS, like this.

Stretching a definition, you could possibly argue that CVS, SVN and the like
are a form of DBMS... I don't know what type of data you're storing, or how
they might scale for your purposes but, in principle, neither CVS nor SVN
care much about whether your data represents code, legal documents, or any
other sort of data.  For example, I've used CVS/SVN to version control
manuscripts.  You might like to try one of them.

Cheers,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________


From bsouthey at gmail.com  Thu Oct 23 13:55:49 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Thu, 23 Oct 2008 08:55:49 -0500
Subject: [BioPython] [bip] [OT] Revision control and databases
In-Reply-To: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com>
References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com>
Message-ID: <49008265.3040205@gmail.com>

Giovanni Marco Dall'Olio wrote:
> Hi,
> I have a question (well, it's not directly related to biopython or 
> pygr, but to scientific computing).
>
> I always used flat files to store results and data for my 
> bioinformatics analys, but not (as I was saying in another thread) I 
> would like to start using a database to do that.
Of course Biopython's BioSQL interface may provide a starting point.
> The problem is I don't know if databases do Revision Control.
> When I used flat files, I was used to save all the results in a git 
> repository, and, everytime something was changed or calculated again, 
> I did commit it.
> Do you know how to do this with databases? Does MySQL provide support 
> for revision control?
> Thanks :)
I think you are asking the wrong questions because it depends on what 
you want to do and what you actually store. There are a number of 
questions that you need to ask yourself about what you really need to do 
(knowing you have used git helps refine these). Examples include:
How often do you use the old versions in your git repository?
How do you use the old revisions in your git repository?
Do you even use the information of an older version if a newer version 
exists?
Do you actually determine when 'something was changed or calculated 
again' or it this partly determined by an external source like a Genbank 
or UniProt update? (At least in a database approach you could automate 
this.)
How many users that can make changes?
How often do you have conflicts?
Are the conflicts hard to solve?

Revision control may be overkill for your use because this is aims to 
handle many tasks and change conflicts related to multiple users rather 
than a single user.  If you don't need all these fancy features then you 
can use a database. If you just want to store and retrieve a version 
then you can use a database but you need to at least force the inclusion 
a date and comment fields to be useful.


Regards
Bruce


From tiagoantao at gmail.com  Thu Oct 23 14:51:22 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 23 Oct 2008 15:51:22 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810221012w2c894977sd0f86297f42f9394@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<6d941f120810220900o620aa7x171fcd181f5248c@mail.gmail.com>
	<5aa3b3570810221012w2c894977sd0f86297f42f9394@mail.gmail.com>
Message-ID: <6d941f120810230751k3dee7b96y8ee13e4bf1c2a4ca@mail.gmail.com>

Hi,


> Moreover, there are some methods (like GenePop.Record.split_in_pops) that
> create Record objects, and I thought it would have been easier to always
> refer to the same one.
> Maybe we should write a generic PopGenRecord in which to store all general
> informations about population genetics data.

The problem with that is that it is
a) very difficult to come with a representation that is general enough
(and usable in the long run).
b) a general representation would be an hassle in specific cases

Let me elaborate:

Different kinds of genetic information have completely different
storage needs: If you are doing genomic studies you will probably want
to have location information (like this SNP is on chromosome X,
position Y). Others (probably the majority) only require frequency
information (or to know what the marker is, irrespective of position).
In most species you don't even know the genomic position of a certain
marker. So you would have to have an general representation capable to
handle both position information and no position information. Then, in
some cases, you need the whole marker (like if you want to do a Tajima
D) or just frequency information (for Fst). Some markers (microsats)
you can (in most, but not all) cases ignore the genetic pattern, you
just count the repeats.
  You could argue that one could try to have a most general
representation but that entails  three problems:
  1. It is very difficult to come by with a clever, correct and future
proof representation. At least I've thinking on this issue since 2005
and have found no clever answer.
  2. Performance: If you care about performance, having a most general
data representation will bring about a big performance cost
(converting from a certain general format to the format needed to do
computations).
  3. Different formats and statistics have different requirements: For
instance on GenePop you don't have population names, neither the
marker itself, but for arlequin format you have partial information on
markers and full information on population names. converting the minor
differences among formats to a "general" format would be complex.


From tiagoantao at gmail.com  Thu Oct 23 15:10:51 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 23 Oct 2008 16:10:51 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
Message-ID: <6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com>

On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Iterators are more difficult to implement in Ped files, because in this
> format every line of the file is an individual, so to write an iterator
> which iterates by population we will need to read at list the first row of
> every line of all the file.

GenePop works population by population. Where I a getting at, is that
different formats might have completely different strategies.
I've used a strategy with the FDist parser that it might be interesting to you:
1. I read the fdist file
2. Convert it to genepop
3. do all operations in the genepop format
4. convert back if necessary.

This might not work in your case because the ped format seems to be
more informative than the genepop format (and thus you loose
information in the conversion process). Feel free to copy and adapt my
code to your own (like split_in_pops and split_in_loci)


> I would probably use sqlalchemy to interface with this database: this is why
> I would like to implement a Population and Individual objects, it will fit
> better with relational mapping.

You can go ahead and suggest formats for Populations and Individuals.
But I strongly suspect that your proposal will be biased towards your
needs (I've suffered the same problem myself). I think that in
biopython the idea is to try to have a solution that is useful to
everybody.

Also, if you want to put some SQL in the code module code, you will
have to have approval from the maintainers of biopython. They will
send you to the BioSQL people, which will say that there is none of
their business. Been there, done that, no success.

Don't take me wrong, I am not trying to discourage you in any way. But
I think it is better to gain some experience before proposing changes
to core concepts.
I've been doing this work for 3 years now, and I am convinced that it
would be very hard for me to suggest a good representation for
populations and individuals. Even populations are very hard to address
(like, some data is geo-referenced -> called landspace genetics, and
the more traditional one is not).

My suggestion: solve you problem the best way you can (e.g., do an
independent PED parser - you can use any of my code if you want).
Solve small problems, one after another.
Trying to solve the general problem is very hard and requires lots of
long term experience.


From dalloliogm at gmail.com  Thu Oct 23 16:25:29 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 23 Oct 2008 18:25:29 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
	<6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com>
Message-ID: <5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com>

On Thu, Oct 23, 2008 at 5:10 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:

> On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
> > Iterators are more difficult to implement in Ped files, because in this
> > format every line of the file is an individual, so to write an iterator
> > which iterates by population we will need to read at list the first row
> of
> > every line of all the file.
>
> GenePop works population by population. Where I a getting at, is that
> different formats might have completely different strategies.
> I've used a strategy with the FDist parser that it might be interesting to
> you:
> 1. I read the fdist file
> 2. Convert it to genepop
> 3. do all operations in the genepop format
> 4. convert back if necessary.
>
> This might not work in your case because the ped format seems to be
> more informative than the genepop format (and thus you loose
> information in the conversion process). Feel free to copy and adapt my
> code to your own (like split_in_pops and split_in_loci)
>
>
> > I would probably use sqlalchemy to interface with this database: this is
> why
> > I would like to implement a Population and Individual objects, it will
> fit
> > better with relational mapping.
>
> You can go ahead and suggest formats for Populations and Individuals.
> But I strongly suspect that your proposal will be biased towards your
> needs (I've suffered the same problem myself). I think that in
> biopython the idea is to try to have a solution that is useful to
> everybody.
>
> Also, if you want to put some SQL in the code module code, you will
> have to have approval from the maintainers of biopython. They will
> send you to the BioSQL people, which will say that there is none of
> their business. Been there, done that, no success.
>
> Don't take me wrong, I am not trying to discourage you in any way. But
> I think it is better to gain some experience before proposing changes
> to core concepts.
> I've been doing this work for 3 years now, and I am convinced that it
> would be very hard for me to suggest a good representation for
> populations and individuals. Even populations are very hard to address
> (like, some data is geo-referenced -> called landspace genetics, and
> the more traditional one is not).
>
> My suggestion: solve you problem the best way you can (e.g., do an
> independent PED parser - you can use any of my code if you want).
> Solve small problems, one after another.
> Trying to solve the general problem is very hard and requires lots of
> long term experience.
>

Well, I agree with you... I don't have any idea on how this problem could be
resolved :).
However I think it would be good to add to biopython at least some
funcionality to calculate Fst statistics and parse these file formats, at
least at the level at which BioPerl does.
What if we just translate the same functionalities and copy the population
objects from bioperl into biopython?
I realize that it won't be the perfect solution: in fact, it is the same
reason why I started this discussion here, the bioperl code wasn't optimized
enought for what I want to do, but I didn't know how to modify perl modules
and preferred python.

Maybe we can just write a PED and GenePop parser and have let it work with
GenePop and your modules to calculate Fst.
We should agree with a population object that could be used as input for
GenePop.
I think it would be good anyway to release even incomplete code to the
public, because it could be useful for other people.


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From dalloliogm at gmail.com  Thu Oct 23 16:27:22 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 23 Oct 2008 18:27:22 +0200
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <320fb6e00810230257y4a21952o95acff34a5367cb5@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
	<320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com>
	<5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com>
	<320fb6e00810230257y4a21952o95acff34a5367cb5@mail.gmail.com>
Message-ID: <5aa3b3570810230927m696f9c27gdfec084647c2c509@mail.gmail.com>

On Thu, Oct 23, 2008 at 11:57 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Thu, Oct 23, 2008, Giovanni Marco Dall'Olio wrote:
> > On Wed, Oct 22, Peter wrote:
> >> On Wed, Oct 22, Giovanni Marco Dall'Olio wrote:
> >> >
> >> > Iterators are more difficult to implement in Ped files, because in
> this
> >> > format every line of the file is an individual, so to write an
> iterator
> >> > which iterates by population we will need to read at list the first
> row
> >> > of every line of all the file.
> >>
> >> It sounds like for Ped files it would make more sense to iterate over
> >> the individuals.  The mental picture I have in mind is a big
> >> spreadsheet, individuals as rows (lines), populations (and other
> >> information) as columns.  By having the parser iterate over the
> >> individuals one by one, the user could then "simplify" each individual
> >> as they are read in, recording in memory just the interesting data.
> >> This way the whole dataset need not be kept in memory.
> >
> > This makes sense.
> > Basically, we should write a (Ped/GenePop)Iterator function, which should
> > read the file one line at a time, check if it a has correct syntax and is
> > not a comment, and then use 'yield' to create a Record object. Am I
> right?
>
> Yes :)
>
> Python functions written with "yield" are called  "generator functions",
> see:
> http://www.python.org/dev/peps/pep-0255/
>

So, how should we modify the current GenePop parser to make it work as an
iterator?
Now it has a 'Scanner' and 'Consumer' methods. Should I remove them and
write a RecordIterator instead?
-
http://github.com/dalloliogm/biopython---popgen/tree/master/src/PopGen/Gio/Ped/__init__.py
Can you explain me more or less how the 'Consumer' object works? It is
mandatory to use it when creating biopython objects?

p.s. do you like the doctest to show how to use the parser?


> Peter
>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Thu Oct 23 17:01:26 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 23 Oct 2008 18:01:26 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810230927m696f9c27gdfec084647c2c509@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
	<320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com>
	<5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com>
	<320fb6e00810230257y4a21952o95acff34a5367cb5@mail.gmail.com>
	<5aa3b3570810230927m696f9c27gdfec084647c2c509@mail.gmail.com>
Message-ID: <320fb6e00810231001w2345bbe5r8c1727ddf883553c@mail.gmail.com>

Giovanni wrote:
> So, how should we modify the current GenePop parser to make it work as an
> iterator?

I think this would mean breaking up the current Record object (which
holds everything) into sub-records which can be yielded one by one.
This would require an API change, unless you wanted to continue to
offer the two approaches in parallel (not elegant, but see
Bio/Sequencing/Ace.py for an example of where this made sense to do).

> Now it has a 'Scanner' and 'Consumer' methods. Should I remove them and
> write a RecordIterator instead?
> ...
> Can you explain me more or less how the 'Consumer' object works? It is
> mandatory to use it when creating biopython objects?

You can write an iterator with or without the Scanner/Consumer style of parser.

The Scanner/Consumer system is very flexible if you want to parse the
data into different objects (by using different consumers).  In theory
the end user could also use the provided scanner with their own
consumer.  However, in my opinion for parsing sequence file formats
this was overkill (needlessly complicated) - as only one object is
really needed to represent a sequence (we have the SeqRecord for
this), so most of the recent parsers in Bio.SeqIO and Bio.AlignIO do
not use the scanner/consumer setup.

See also the short Tutorial section "Parser Design".
http://biopython.org/DIST/docs/tutorial/Tutorial.html

For population genetics given there is no one universal record object,
perhaps the flexibility of the Scanner/Consumer system is worth while.
 On the other hand, Tiago currently has the scanner/consumer in
Bio.PopGen.GenePop as private objects so this is currently a private
implementation detail - one could replace the Scanner/Consumer details
without breaking the public API.

Peter


From biopython at maubp.freeserve.co.uk  Fri Oct 24 08:52:25 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 24 Oct 2008 09:52:25 +0100
Subject: [BioPython] Retrieving nucleotide sequence for given accession
	Entrez ID
In-Reply-To: <61B0EE7C247C1349881F63414448FC1F078874C1@EXEVS06.its.uncc.edu>
References: <61B0EE7C247C1349881F63414448FC1F078874BC@EXEVS06.its.uncc.edu>
	<320fb6e00810221315i31358bc2n2e5c9be405a77e42@mail.gmail.com>
	<61B0EE7C247C1349881F63414448FC1F078874C1@EXEVS06.its.uncc.edu>
Message-ID: <320fb6e00810240152t6e6123d3la00f1fe43121b985@mail.gmail.com>

Hi Richard,

I've taken the liberty of CC'ing this back to the mailing list,

Richard Clary wrote:
> Much appreciation Peter--it worked perfectly.

Good :)

> If you are wanting to
> retrieve multiple sequences, is a simple "+" string concatenation
> sufficient as the case when using eUtils or approach it by creating
> a tuple or dictionary and passing arguments?
>
> Richard

Moving on to your multi-sequence question, using "+" doesn't
seem to work - you should use a comma for concatenating the
IDs when calling eFetch.   What made you think of "+" here?

One other tweak is that Bio.SeqIO.read(...) is for when the handle
contains one and only one record.  In general you'll need to use
Bio.SeqIO.parse(...) instead and iterate over the records.

Depending on what you want to achieve, maybe:

from Bio import Entrez, SeqIO
id_list = ["186972394","12345678"]
Entrez.email = "Richard at example.com" #Tell the NCBI who you are
handle = Entrez.efetch(db="nucleotide", id=",".join(id_list),rettype="fasta")
for id,record in zip(id_list,SeqIO.parse(handle, "fasta")) :
    assert id in record.id, "Didn't get ID %s returned!" % id
    print "%s = %s" % (record.id, record.seq)
    #seq_str = str(record.seq)

If you still want just plain strings for the sequence, maybe:

from Bio import Entrez, SeqIO
id_list = ["186972394","12345678"]
Entrez.email = "Richard at example.com" #Tell the NCBI who you are
handle = Entrez.efetch(db="nucleotide", id=",".join(id_list),rettype="fasta")
seq_str_list = [str(record.seq) for record in SeqIO.parse(handle, "fasta")]

If you haven't already done so, please read the NCBI guidelines for
using Entrez,
http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements

Also, have a look at the Entrez chapter in the tutorial, especially
the "history" support which may be relevant.
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

Peter


From dalloliogm at gmail.com  Fri Oct 24 09:08:54 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Fri, 24 Oct 2008 11:08:54 +0200
Subject: [BioPython] [bip] [OT] Revision control and databases
In-Reply-To: <49008265.3040205@gmail.com>
References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com>
	<49008265.3040205@gmail.com>
Message-ID: <5aa3b3570810240208j76a770c5ue175089176fa050@mail.gmail.com>

On Thu, Oct 23, 2008 at 3:55 PM, Bruce Southey <bsouthey at gmail.com> wrote:

> Giovanni Marco Dall'Olio wrote:
>
>> Hi,
>> I have a question (well, it's not directly related to biopython or pygr,
>> but to scientific computing).
>>
>> I always used flat files to store results and data for my bioinformatics
>> analys, but not (as I was saying in another thread) I would like to start
>> using a database to do that.
>>
> Of course Biopython's BioSQL interface may provide a starting point.


The problem is that BioSQL doesn't support yet Population Genetics record
(see another thread in biopython mailing list), so I would have to implement
something like that in BioSQL or wait for the developers to do it.
Maybe I will do this later, but now I don't have the time.


>
>  The problem is I don't know if databases do Revision Control.
>> When I used flat files, I was used to save all the results in a git
>> repository, and, everytime something was changed or calculated again, I did
>> commit it.
>> Do you know how to do this with databases? Does MySQL provide support for
>> revision control?
>> Thanks :)
>>
> I think you are asking the wrong questions because it depends on what you
> want to do and what you actually store. There are a number of questions that
> you need to ask yourself about what you really need to do (knowing you have
> used git helps refine these). Examples include:
> How often do you use the old versions in your git repository?
> How do you use the old revisions in your git repository?
> Do you even use the information of an older version if a newer version
> exists?
> How many users that can make changes?
> How often do you have conflicts?
> Are the conflicts hard to solve?


These are all very good questions.
The problem is that I consider revision control as a 'good practice': I
remember that when I was not used to keep an history of the changes to my
data, it was a mess. I would like to have at least a 'version' field, to
know how much my data is old.

I have found this :
- http://pgfoundry.org/projects/tablelog/
which seems interesting.
I think this is a big issue for bioinformatics. How is it possible that
nobody has never tried to implement such a functionality for databases?
Version Control could be difficult to implement, but not so much. There is
must be something that I can reuse...


Do you actually determine when 'something was changed or calculated again'
> or it this partly determined by an external source like a Genbank or UniProt
> update? (At least in a database approach you could automate this.)


Well, it could be useful to


>
>
> Revision control may be overkill for your use because this is aims to
> handle many tasks and change conflicts related to multiple users rather than
> a single user.  If you don't need all these fancy features then you can use
> a database. If you just want to store and retrieve a version then you can
> use a database but you need to at least force the inclusion a date and
> comment fields to be useful.


Maybe there are other similar tools.
This is a big issue for bioinformatics. I think it is a good, when working
with

Unfortunately I think revision control would be very useful for me.
The data in the database will be used and uploaded by 4 or 5 people.
It will be used also to store the results from some script:


>
>
>
> Regards
> Bruce
>


Thank you very much for all the replies.. I didn't expect so many of them.


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Fri Oct 24 09:25:31 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 24 Oct 2008 10:25:31 +0100
Subject: [BioPython] [bip] [OT] Revision control and databases
In-Reply-To: <5aa3b3570810240208j76a770c5ue175089176fa050@mail.gmail.com>
References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com>
	<49008265.3040205@gmail.com>
	<5aa3b3570810240208j76a770c5ue175089176fa050@mail.gmail.com>
Message-ID: <320fb6e00810240225y1c380de5y6144a80ece808b2c@mail.gmail.com>

Giovanni Marco Dall'Olio wrote:
> Bruce Southey wrote:
>> Of course Biopython's BioSQL interface may provide a starting point.
>
> The problem is that BioSQL doesn't support yet Population Genetics record
> (see another thread in biopython mailing list), so I would have to implement
> something like that in BioSQL or wait for the developers to do it.
> Maybe I will do this later, but now I don't have the time.

BioSQL currently focuses on annotated sequences, but they are working
on some phylogenetics support too.  See http://www.biosql.org/ and the
PhyloDB extension module.  If there was enough interest, perhaps a
BioSQL schema for Population Genetics could be devised too.

Giovanni Marco Dall'Olio wrote:
>>> The problem is I don't know if databases do Revision Control.
>>> When I used flat files, I was used to save all the results in a git
>>> repository, and, everytime something was changed or calculated
>>> again, I did commit it.
>>> Do you know how to do this with databases? Does MySQL
>>> provide support for revision control?

As other people have said, databases don't generally "waste" resources
on version control.  If you need this, then it is up to you to design
your schema to record this additional metadata.  For example, the
BioSQL sequences have a "version" field in the "bioentry" table
allowing multiple revisions of the same accession to be held.  When
querying the database, you could request a particular version, or
indeed the latest version.  Essentially AFAIK database version control
is a Do-It-Yourself affair when designing your database tables.

Peter


From lpritc at scri.ac.uk  Fri Oct 24 09:51:35 2008
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Fri, 24 Oct 2008 10:51:35 +0100
Subject: [BioPython] [bip] [OT] Revision control and databases
In-Reply-To: <5aa3b3570810240208j76a770c5ue175089176fa050@mail.gmail.com>
Message-ID: <C5275937.18346%lpritc@scri.ac.uk>

On 24/10/2008 10:08, "Giovanni Marco Dall'Olio" <dalloliogm at gmail.com>
wrote:

> The problem is that BioSQL doesn't support yet Population Genetics record
> (see another thread in biopython mailing list), so I would have to implement
> something like that in BioSQL or wait for the developers to do it.
> Maybe I will do this later, but now I don't have the time.

To be fair, that's a different problem from version control...

>> How often do you use the old versions in your git repository?
>> How do you use the old revisions in your git repository?
>> Do you even use the information of an older version if a newer version
>> exists?
>> How many users that can make changes?
>> How often do you have conflicts?
>> Are the conflicts hard to solve?
> 
> These are all very good questions.
> The problem is that I consider revision control as a 'good practice'

I think that you're right - it is good practice, and Bruce raises excellent
questions here: what individuals want or need from version control depends
greatly on their own situation, and whether a particular package fits your
own needs will depend on what they are.  If you don't know what they are
before choosing a package, then there's the risk of making an unsuitable
choice.

It's worth noting that revision control can also mean slightly different
things to different people.  Some might say that a version number and an ID
for the entity (human or automated) making that change is sufficient.  Some
might say that you ought not to stop short of conflict resolution and branch
control.  It depends on the needs of your project, IMO.

> I think this is a big issue for bioinformatics. How is it possible that nobody
> has never tried to implement such a functionality for databases

Databases (DBMS, to be picky) are a general-purpose solution for many
different kinds of problem.  Revision control is an inhomogeneous problem
with no optimal solution that can be implemented in many ways and not only
using DBMS.  There are plenty of revision control examples implemented in
databases, and the examples that first come to mind in Python for me are
content management systems such as Zope and Plone.  I think that BASE
implements one, but it's a long time since I looked at it.

> Unfortunately I think revision control would be very useful for me.
> The data in the database will be used and uploaded by 4 or 5 people.

Then at a minimum you may need a solution that records version changes, and
associates versions with individuals (and perhaps individual runs of
scripts).  You may also need locking and collision detection/conflict
resolution (which DBMS like MySQL and PostgreSQL support internally via
transactions; they don't generally implement version control because it
would be wasteful), depending on whether you expect that multiple people
might modify the same file at or at about the same time.

Best,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by
guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are
confidential

to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this

confidentiality and you must not use, disclose, copy, print or rely on
this 
e-mail in any way. Please notify postmaster at scri.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan
the email and the attachments (if any).
______________________________________________________________________


From cy at cymon.org  Fri Oct 24 10:46:28 2008
From: cy at cymon.org (Cymon Cox)
Date: Fri, 24 Oct 2008 11:46:28 +0100
Subject: [BioPython] BioSQL / phylodb
Message-ID: <7265d4f0810240346m456e724ax49f7c18048a29749@mail.gmail.com>

Hi All,

Ive been looking at the phylodb extension to BioSQL. Does anyone have any
python code for uploading a tree?

Cheers, C.

-- 
____________________________________________________________________

Cymon J. Cox


From biopython at maubp.freeserve.co.uk  Fri Oct 24 10:54:28 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 24 Oct 2008 11:54:28 +0100
Subject: [BioPython] BioSQL / phylodb
In-Reply-To: <7265d4f0810240346m456e724ax49f7c18048a29749@mail.gmail.com>
References: <7265d4f0810240346m456e724ax49f7c18048a29749@mail.gmail.com>
Message-ID: <320fb6e00810240354q2b3c2a93p3c0c45b5ed48df3c@mail.gmail.com>

On Fri, Oct 24, 2008 at 11:46 AM, Cymon Cox <cy at cymon.org> wrote:
> Hi All,
>
> Ive been looking at the phylodb extension to BioSQL. Does anyone have any
> python code for uploading a tree?
>
> Cheers, C.

Not that I'm aware of, no.

Adding support to Biopython's BioSQL module to do this, and also
retrieve the data as a tree would be nice.  The Bio.Nexus.Tree class
would seem a logical representation to try and use.  As an aside,
being about to load a taxonomy from the main BioSQL taxon/taxon_name
tables as a tree might be nice too.

Peter


From kteague at bcgsc.ca  Fri Oct 24 18:32:41 2008
From: kteague at bcgsc.ca (Kevin Teague)
Date: Fri, 24 Oct 2008 11:32:41 -0700
Subject: [BioPython] [bip] [OT] Revision control and databases
In-Reply-To: <C5275937.18346%lpritc@scri.ac.uk>
References: <C5275937.18346%lpritc@scri.ac.uk>
Message-ID: <3F2B0CD4-83DF-4A88-B22D-926B97503B7C@bcgsc.ca>

>
>> I think this is a big issue for bioinformatics. How is it possible  
>> that nobody
>> has never tried to implement such a functionality for databases
>
> Databases (DBMS, to be picky) are a general-purpose solution for many
> different kinds of problem.  Revision control is an inhomogeneous  
> problem
> with no optimal solution that can be implemented in many ways and  
> not only
> using DBMS.  There are plenty of revision control examples  
> implemented in
> databases, and the examples that first come to mind in Python for me  
> are
> content management systems such as Zope and Plone.  I think that BASE
> implements one, but it's a long time since I looked at it.

The default file storage for Zope Object Database (ZODB) appends all  
new database writes, keeping older transactions on disk (similar to  
the way PostgreSQL works). Back in the day (circa 2000) Zope 2 exposed  
this database-level feature at the application level in the Zope  
Management Interface (ZMI). So you could see all past writes to the  
database, and try and revert back to an older one if desired (using  
the "undo" tab of the ZMI).

Problems with this approach included using sysadmin tools on the  
database could break application behaviour. e.g. lets say you had a  
"Document" object and a "Page Counter" object, you would wish to be  
able to view older versions of Documents, but only care about the  
current state of the Page Counters. However, if your Page Counters are  
changing like crazy and taking up tonnes of disk space and generally  
slowing down queries against the history of the database, there was no  
way to say "delete all outdated ephemeral Page Counter versions, but  
keep Document-related transactions" (especially since a Page Counter  
change and a Document change often commited in the same transaction).  
ZWiki exposed older revisions using this feature, and the accepted  
practice was to put each wiki into it's own database so that other  
forms of database maintenance didn't accidently blow away your wiki  
history ... it wasn't so pretty :P

You also had problems reverting back to just a specific revision, for  
example if you were in Revision 3 and you had changes in Revision 1  
that you wanted to go back to, but you'd made changes in Revision 2  
that referenced Revision 1, then you first had to step-back to  
Revision 2 before you could revert back to Revision 1. Even though  
Revision 2 also contained a bunch of changes that you didn't want to  
revert, that you would then manually need to later re-apply. Ug!

Zope 2 also had a Version object, you could poke a button in the UI to  
start a new "transaction" and then start making changes to code 
+content in the database. This was just implemented as a long-running  
transaction - from the point of starting to commiting a transaction  
could sometimes last for a whole month :). The problem being that when  
you finally wanted to commit the transaction to roll-out new features  
on a web site, if there were any conflicts from changes that happened  
you were hosed and would end-up copying those changes into a new  
transaction based off the latest database version and commiting that.  
It wasn't pretty :(

It has long since been acknowledged by Zope developers that exposing  
database level features at the application level is a Bad Thing(TM)!

Today there is a whole plethora of products for Zope that do some form  
of versioning, but they are all implemented at the application level.  
There is a whole plethora of products because there are many ways to  
do versioning, and the choices of how versions are managed is really  
best left up to the specific application. Some of these products  
provide reasonable APIs for implementing specific versioning within a  
specific platform - e.g Plone has a package called plone.app.iterate  
and it has APIs that use standard versioning terminology (checkin,  
checkout, working copy) for example:


class ICheckinCheckoutTool( Interface ):

     def allowCheckin( content ):
         """
         denotes whether a checkin operation can be performed on the  
content.
         """

     def allowCheckout( content ):
         """
         denotes whether a checkout operation can be performed on the  
content.
         """

     def allowCancelCheckout( content ):
         """
         denotes whether a cancel checkout operation can be performed  
on the content.
         """

     def checkin( content, checkin_messsage ):
         """
         check the working copy in, this will merge the working copy  
with the baseline
         """

     def checkout( container, content ):
         """
         """

     def cancelCheckout( content ):
         """


From sbassi at gmail.com  Sat Oct 25 01:03:43 2008
From: sbassi at gmail.com (Sebastian Bassi)
Date: Fri, 24 Oct 2008 22:03:43 -0300
Subject: [BioPython] Loading dbxrefs from a gbk file
Message-ID: <b43bf2080810241803r4f49e575p16f159fae9607eb9@mail.gmail.com>

I have a genbank file like this one: http://www.pastecode.com.ar/f231664eb
I parse it with SeqIO.parse and the SeqRecord object I get is:

SeqRecord(seq=Seq('GAGAAGGACGCGCGGCCCCCAGCGCCTCTTGGGTGGCCGCCTCGGAGCATGACC...ATA',
IUPACAmbiguousDNA()), id='NM_000208.2', name='NM_000208',
description='Homo sapiens insulin receptor (INSR), transcript variant
1, mRNA.', dbxrefs=[])

If you look at lines 130 to 133 (I highlighted in yellow) of the
genbank sequence, there is cross database information (db_xref), but
it is not associated with the SeqRecord, it is an empty list.
According to http://www.biopython.org/wiki/SeqRecord, this condition
is known, but I don't understand if this is on porpuse or is a bug.
Best,
SB.


-- 
Vendo isla: http://www.genesdigitales.com/isla/
Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6
Bioinformatics news: http://www.bioinformatica.info
Tutorial libre de Python: http://tinyurl.com/2az5d5

"It is pitch black. You are likely to be eaten by a grue." -- Zork


From biopython at maubp.freeserve.co.uk  Sat Oct 25 17:22:27 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 25 Oct 2008 18:22:27 +0100
Subject: [BioPython] Loading dbxrefs from a gbk file
In-Reply-To: <b43bf2080810241803r4f49e575p16f159fae9607eb9@mail.gmail.com>
References: <b43bf2080810241803r4f49e575p16f159fae9607eb9@mail.gmail.com>
Message-ID: <320fb6e00810251022u577d4f16wa7602c7bdc664322@mail.gmail.com>

On Sat, Oct 25, 2008 at 2:03 AM, Sebastian Bassi <sbassi at gmail.com> wrote:
> I have a genbank file like this one: http://www.pastecode.com.ar/f231664eb
> ...
> If you look at lines 130 to 133 (I highlighted in yellow) of the
> genbank sequence, there is cross database information (db_xref), but
> it is not associated with the SeqRecord, it is an empty list.

What you have highlighted is part of a gene feature, and would not be
part of the SeqRecord's db_xref list.  It should however be present in
the relevant SeqRecord feature.  Try:

print my_record.features[1]

(seeing as this is the second feature in the file, i.e. feature 1
using zero-based counting).

Peter


From tiagoantao at gmail.com  Sun Oct 26 01:04:01 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Sun, 26 Oct 2008 02:04:01 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
	<6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com>
	<5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com>
Message-ID: <6d941f120810251804o31ed44cat49b407db36a6891e@mail.gmail.com>

[Sorry for the delay in answering]

On Thu, Oct 23, 2008 at 5:25 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:

> However I think it would be good to add to biopython at least some
> funcionality to calculate Fst statistics and parse these file formats, at
> least at the level at which BioPerl does.

Agree. Statistics is fundamental. I decided to postpone stats when I
started because I didn't want to to start with the core issue in
population genetics (being unexperienced at the start would probably
cause serious design errors). But I think now is the time.

> What if we just translate the same functionalities and copy the population
> objects from bioperl into biopython?

I don' t think the population objects in bioperl scale well. It is not
clear to me that their popgen module is a priority for them, and that
they carefully designed them (altough that might have changed in the
near past).
I also don' t believe that my own code (which I supplied you) is in
perfect shape to achieve this also.
I have to write down my ideas and send them here as soon as possible.
I will try to do it in the next couple of days at most.
The core idea is that there is no good abstract population and
individual objects, but they are also not needed.
What is needed, in my view, are file parsers and statistics.
Statistics should be organized in a systematic way. Example: all
frequency based, population-structure statistics should present the
same interface, something like:
   add_population(pop_name, individual_allele_list)
I will submit a small document for discussion very soon.

> I realize that it won't be the perfect solution: in fact, it is the same
> reason why I started this discussion here, the bioperl code wasn't optimized
> enought for what I want to do, but I didn't know how to modify perl modules
> and preferred python.

The important thing to notice is that biopython should not be
optimized to your needs or mine, it has to be general enough to
accomodate the vast majority of potential users. What I' ve always
tried, was to do things in a way that could be reused by others.


> Maybe we can just write a PED and GenePop parser and have let it work with
> GenePop and your modules to calculate Fst.

My suggestion would be for you to go ahead and do a Bio.PopGen.PED .
You could do it in the best way you see fit.
Converting from PED to genepop will make you loose information, if I
understand well (as you have SNP info on PED files, which you don' t
on genepop).
The other formats that I support (Fdist on released code and FStat on
the code that you have) are very similar (or less informative) than
genepop.
Again, my suggestion is for an independent parser. Of which you would
have absolute control as you would be the implementor.

I understand that this might lead to some duplicated code (like
split_in_pops), but repeated code is less of a problem than a generic
object that ends up being wrong in the long run.

> We should agree with a population object that could be used as input for
> GenePop.

For the reasons above I will fight a general Population object. At
least for now. I don't feel confident that we have the experience to
design one.
It is important to notice that we cannot break backward compatibility
without a very good reason. I think that a generic population object
will be severely resived in the future.
In your specific case I also think you would suffer with a population
object, as you need performance (parsing file, creating object,
extrating information from object, calculating statistic).
As I see it, it would be a smaller chain (parse, convert to statistic
family format, calculate statistic).

> I think it would be good anyway to release even incomplete code to the
> public, because it could be useful for other people.

Incomplete is OK. But I think we would be releasing wrong code. Code
that it would be redone in the future (and break interfaces with the
past versions). Also, a generic object would have performance problems
(it would have to be able to store all the information).

Well, I am ranting and not proposing a decent alternative. I will try
to write down something decent. I will try to write up a proposal
until Tuesday. I'm afraid the error is on my part: I have to write
down what is in my head so that people can discuss if it is a good
idea or not.


From tiagoantao at gmail.com  Sun Oct 26 01:34:55 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Sun, 26 Oct 2008 02:34:55 +0100
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810251804o31ed44cat49b407db36a6891e@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
	<6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com>
	<5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com>
	<6d941f120810251804o31ed44cat49b407db36a6891e@mail.gmail.com>
Message-ID: <6d941f120810251834q87495d5re558cf179356a8b0@mail.gmail.com>

I just want add on an extra comment explaining why I oppose doing an
individual object:

I have the following questions (and others) in my mind, which I don't
know the answer. I am not looking for answers to them, I am just
trying to illustrate the difficulty of the problem.

1. For a certain marker, do we store the genomic position of the
marker? Some (most) statistics don't use this information. For many
species this information is not even available. But for some
statistics this information is mandatory...
2. For a microsatellite do we store the motif and number of repeats or
the whole sequence? (see 4)
3. If one is interested in SNPs and one has the full sequences does
one store the full sequences or just the SNPs? If you store just the
SNPs then you cannot do sequence based analysis in the future (say
Tajima D). If you store everything then you are consuming memory and
cpu.
4. If one just wants to do frequency statistics (Fst), do you store
the marker or just the assign each one an ID and store the ID? It is
much cheaper to store an ID than a full sequence.

Populations
1. Support for landscape genetics? I mean geo-referentiation
2. Support for hierarchical population structure?
3. Do we cache statistics results on Population objects?


Let me take your class marker:
class Marker:
  total_heterozygotes_count = 0
  total_population_count = 0
  total_Purines_count = 0 # this could be renamed, of course
  total_Pyrimidines_count = 0

How would this be useful for microsatellites? Why purines, and if my
marker is a protein? If it is a SNP I want to know the nucleotide? And
if I am studying proteins and I want to have the aminoacid?

Dont take me wrong, I have done this path. To solve my particular
problems is not very hard. To have a framework that is usable by
everybody, it is a damn hard problem. And we dont really need to solve
it (ok, it would be nice to do things to populations in general, that
I agree). But the fundamental is: read file, calculate statistics.
That doesnt need population and individual objects.

If we end up having too many formats a consolidation step might be
needed in the future (to avoid having 10 split_in_pops). That I agree.


From sbassi at gmail.com  Mon Oct 27 04:13:47 2008
From: sbassi at gmail.com (Sebastian Bassi)
Date: Mon, 27 Oct 2008 01:13:47 -0300
Subject: [BioPython] Loading dbxrefs from a gbk file
In-Reply-To: <320fb6e00810251022u577d4f16wa7602c7bdc664322@mail.gmail.com>
References: <b43bf2080810241803r4f49e575p16f159fae9607eb9@mail.gmail.com>
	<320fb6e00810251022u577d4f16wa7602c7bdc664322@mail.gmail.com>
Message-ID: <b43bf2080810262113i69ffbbe7w5979655da2f11c53@mail.gmail.com>

On Sat, Oct 25, 2008 at 2:22 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> What you have highlighted is part of a gene feature, and would not be
> part of the SeqRecord's db_xref list.  It should however be present in
> the relevant SeqRecord feature.  Try:

OK, thank you.


From lueck at ipk-gatersleben.de  Mon Oct 27 13:43:49 2008
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Mon, 27 Oct 2008 14:43:49 +0100
Subject: [BioPython] ClustaW problem upwards Biopython 1.43
Message-ID: <008701c9383a$0aaa2070$1022a8c0@ipkgatersleben.de>

Hi!

I just releazed, that a ClustalW alignment gives an error message under Biopython 1.44 and 1.47 whereas under 1.43 everything works fine. 

The message is the following (example of the tutorial):

Traceback (most recent call last):
  File "I:\Final\pair_align.py", line 90, in pair_align
    alignment = Clustalw.do_alignment(cline)
  File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 79, in do_alignment
    status = run_clust.close()
IOError: [Errno 0] Error

Does someone know what's the problem?

Kind regards
Stefanie


From biopython at maubp.freeserve.co.uk  Mon Oct 27 15:12:13 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 27 Oct 2008 15:12:13 +0000
Subject: [BioPython] ClustaW problem upwards Biopython 1.43
In-Reply-To: <008701c9383a$0aaa2070$1022a8c0@ipkgatersleben.de>
References: <008701c9383a$0aaa2070$1022a8c0@ipkgatersleben.de>
Message-ID: <320fb6e00810270812le76ae75m55f53107c2572a34@mail.gmail.com>

On Mon, Oct 27, 2008 at 1:43 PM, Stefanie L?ck <lueck at ipk-gatersleben.de> wrote:
> Hi!
>
> I just releazed, that a ClustalW alignment gives an error message under Biopython 1.44 and 1.47 whereas under 1.43 everything works fine.
>
> The message is the following (example of the tutorial):
>
> Traceback (most recent call last):
>  File "I:\Final\pair_align.py", line 90, in pair_align
>    alignment = Clustalw.do_alignment(cline)
>  File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 79, in do_alignment
>    status = run_clust.close()
> IOError: [Errno 0] Error
>
> Does someone know what's the problem?

There were some changes made between Biopython 1.43 and 1.44 to try
and deal with spaces in filenames.  Could you do:

print str(cline)

That should show the exact command line python is trying to run.  What
happens if you try this command at the "DOS" prompt?

Also, what version of clustalw do you have installed?

Thanks,

Peter


From biopython at maubp.freeserve.co.uk  Mon Oct 27 17:49:59 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 27 Oct 2008 17:49:59 +0000
Subject: [BioPython] Deprecating Bio.mathfns, Bio.stringfns and Bio.listfns?
Message-ID: <320fb6e00810271049t2aa3fac4s1907027307b035f1@mail.gmail.com>

Dear Biopythoneers,

Is anyone currently using Bio.mathfns, Bio.stringfns or Bio.listfns?
These provide a selection of maths, string and list functions - some
of which are apparently irrelevant with changes or additions to python
itself (e.g. sets).

I'd like to declare these as deprecated for the next release, or at
least obsolete and likely to be deprecated in future - so if you are
using these modules or would like to defend them, please speak up
soon.

Thanks,

Peter

P.S. If you care about the details, there is a longer discussion on
the dev-mailing list:
http://lists.open-bio.org/pipermail/biopython-dev/2008-October/004472.html


From biopython at maubp.freeserve.co.uk  Mon Oct 27 17:57:20 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 27 Oct 2008 17:57:20 +0000
Subject: [BioPython] Deprecating the obsolete Bio.Ndb module?
Message-ID: <320fb6e00810271057l181cbb1fw15aa8f03e4159328@mail.gmail.com>

Dear Biopythoneers,

The Bio.Ndb module (written six years ago) provides an HTML parser for
the NDB website (nucleotide database, a repository of
three-dimensional structural information about nucleic acids).

The URL has changed, but this service is still running.  However, the
webpage layout has changed considerably - Their front page mentions a
major revision in Jan 2008.

Unless anyone would like to volunteer to look after the Bio.Ndb module
and bring it up to date, I'm suggesting we deprecate it for the next
release of Biopython.

Peter


From lueck at ipk-gatersleben.de  Tue Oct 28 08:10:25 2008
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Tue, 28 Oct 2008 09:10:25 +0100
Subject: [BioPython] ClustaW problem upwards Biopython 1.43
References: <008701c9383a$0aaa2070$1022a8c0@ipkgatersleben.de>
	<320fb6e00810270812le76ae75m55f53107c2572a34@mail.gmail.com>
Message-ID: <001a01c938d4$a19887c0$1022a8c0@ipkgatersleben.de>

Hi!

>>> print str(cline)
clustalw  pb.fasta -OUTFILE=test2.aln

I'm using CLUSTAL W 2.0.

Under DOS everything works fine.

Regards
Stefanie


----- Original Message ----- 
From: "Peter" <biopython at maubp.freeserve.co.uk>
To: "Stefanie L?ck" <lueck at ipk-gatersleben.de>
Cc: <biopython at lists.open-bio.org>
Sent: Monday, October 27, 2008 4:12 PM
Subject: Re: [BioPython] ClustaW problem upwards Biopython 1.43


On Mon, Oct 27, 2008 at 1:43 PM, Stefanie L?ck <lueck at ipk-gatersleben.de> 
wrote:
> Hi!
>
> I just releazed, that a ClustalW alignment gives an error message under 
> Biopython 1.44 and 1.47 whereas under 1.43 everything works fine.
>
> The message is the following (example of the tutorial):
>
> Traceback (most recent call last):
>  File "I:\Final\pair_align.py", line 90, in pair_align
>    alignment = Clustalw.do_alignment(cline)
>  File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 79, 
> in do_alignment
>    status = run_clust.close()
> IOError: [Errno 0] Error
>
> Does someone know what's the problem?

There were some changes made between Biopython 1.43 and 1.44 to try
and deal with spaces in filenames.  Could you do:

print str(cline)

That should show the exact command line python is trying to run.  What
happens if you try this command at the "DOS" prompt?

Also, what version of clustalw do you have installed?

Thanks,

Peter


From dalloliogm at gmail.com  Tue Oct 28 10:46:39 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Tue, 28 Oct 2008 11:46:39 +0100
Subject: [BioPython] a common repository for test datasets/use cases for all
	Bio* projects
Message-ID: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com>

Hi,
I would like to make you a proposal.
Every module/program written in bioinformatics needs to be tested
before it can be used to produce results that can be published.

For example, let's say I want to write another fasta file parser, like
SeqIO.FastaIO in biopython : I would have have to test the script
against some real fasta files, just to make sure that it doesn't parse
them in a wrong way, or that it losts data.
Or, let's say I want to write a script to calculate Fst statistics
over some population genetics data: I will have to compare the results
of my scripts against other programs, check if it gives me the right
result for a set for which I already know the Fst value, and maybe
ideate some other kind of checks to be sure my script doesn't do weird
things, like losing input data on the way.

So, the point is.. what if we create a common repository for all this
kind of testing data, to be used in common with all the other Bio*
projects?
Wouldn't it be good if all the Bio* fasta parser are able to parse the
same files and give the same results, demonstrating that all of them
work fine or are wrong at the same time?

I am doing this because me (and Tiago) would like to develop a module
to calculate Fst statistics over SNP data, and there is no point of
collecting some good test datasets and not sharing them with other
similar projects in other programming languages.

The same goes for much of the documentation, like use cases: if we
collect a good base of use cases related to bioinformatics, it would
be easier to coordinate the efforts of all the Bio* projects and
compare the different approaches used to solve the same issue by the
different comunities.

At the moment, I have created a simple git repository on github:
- http://github.com/dalloliogm/bio-test-datasets-repository
but , it is still empty and maybe github is not the ideal hosting for
such a project, since the free account has a 100MB space limit.


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Tue Oct 28 10:55:04 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Oct 2008 10:55:04 +0000
Subject: [BioPython] a common repository for test datasets/use cases for
	all Bio* projects
In-Reply-To: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com>
References: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com>
Message-ID: <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com>

On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
> Hi,
> I would like to make you a proposal.
> Every module/program written in bioinformatics needs to be tested
> before it can be used to produce results that can be published.
> ...
> So, the point is.. what if we create a common repository for all this
> kind of testing data, to be used in common with all the other Bio*
> projects?

You you made some other good points, and this is a good idea.  In
practice the licences are usually OK for use to "borrow" example input
files from each other (and this does happen), but a more organised
system to encourage interchange of examples would be good.

I think this sounds like an excellent topic for the (currently very
quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev
discussion, one of the OBF mailing lists, this should cover all the
Bio* project members interested).  See
http://lists.open-bio.org/mailman/listinfo

Peter


From dalloliogm at gmail.com  Tue Oct 28 11:00:42 2008
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Tue, 28 Oct 2008 12:00:42 +0100
Subject: [BioPython] a common repository for test datasets/use cases for
	all Bio* projects
In-Reply-To: <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com>
References: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com>
	<320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com>
Message-ID: <5aa3b3570810280400t510468d1sbce5bb0977ec772b@mail.gmail.com>

On Tue, Oct 28, 2008 at 11:55 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio
>
> I think this sounds like an excellent topic for the (currently very
> quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev
> discussion, one of the OBF mailing lists, this should cover all the
> Bio* project members interested).  See
> http://lists.open-bio.org/mailman/listinfo
>
> Peter


Thanks!!
I didn't know of this list!!

>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Tue Oct 28 11:20:21 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Oct 2008 11:20:21 +0000
Subject: [BioPython] ClustalW problem upwards Biopython 1.43
Message-ID: <320fb6e00810280420t75f62774x55335e8a5aa11151@mail.gmail.com>

Stephanie wrote:
>
>>>> print str(cline)
>
> clustalw  pb.fasta -OUTFILE=test2.aln
>
> I'm using CLUSTAL W 2.0.

Are you sure?  The Clustal W 2.0 executable is normally called
clustalw2.exe rather than clustalw.exe - so based on the command line
above I would have expect Clustalw 1.x to be used.  Maybe you have
both versions of ClustalW installed?

Could you tell me where exactly (full paths) you have Clustalw.exe
and/or Clustalw2.exe installed?  This would be helpful for the new
unit test I'm working on.

> Under DOS everything works fine.

I've been having "fun" trying to get a new unit test for this to work
nicely on Windows - there a certainly some combinations of file name
arguments with spaces etc which won't work on Biopython 1.48.  I found
examples where the command line string ran "by hand" at the "DOS"
prompt worked fine, but would fail when invoked in python via os.popen
- on the bright side, using subprocess.Popen instead works much better
(although this isn't available for python 2.3).

If you want to try this new code, I would suggest you first install
Biopython 1.48, and then backup and update
C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py to revision
1.25 from CVS which you can download here (should be updated within
the hour):
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Clustalw/__init__.py?cvsroot=biopython

Thanks!

Peter


From peter at maubp.freeserve.co.uk  Tue Oct 28 11:36:15 2008
From: peter at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Oct 2008 11:36:15 +0000
Subject: [BioPython] Dropping Python 2.3 support?
Message-ID: <320fb6e00810280436m7cf48993v8b0562bb44919128@mail.gmail.com>

Dear all,

Those of you following the dev-mailing list will probably be aware
that we've been making excellent progress in CVS to get Biopython to
run fine on Python 2.6.  However, the downside is that continuing to
support Python 2.3 is beginning to be pain (triggered for the most
part by some older modules being deprecated in python 2.6).

Does anyone on the mailing list still use Python 2.3?  e.g. older
Linux servers, or people still using Apple Mac OS X 10.4 Tiger (or
older).

What I'd like to suggest is that the next one or two releases will
still support Python 2.3, but after that we'll drop support for Python
2.3.

Thanks,

Peter

P.S. For the record, until recently my main Windows machine ran Python
2.3 only - giving me a vested interesting in continuing Python 2.3
support ;)


From jblanca at btc.upv.es  Tue Oct 28 11:52:29 2008
From: jblanca at btc.upv.es (Jose Blanca)
Date: Tue, 28 Oct 2008 12:52:29 +0100
Subject: [BioPython] caf format support
Message-ID: <200810281252.29607.jblanca@btc.upv.es>

Hi,
I'm currently dealing with caf contig files. Has BioPython support for this 
format? Do you know of other alternatives in python or perl to deal with it?
Best regards,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)


From biopython at maubp.freeserve.co.uk  Tue Oct 28 12:16:33 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Oct 2008 12:16:33 +0000
Subject: [BioPython] caf format support
In-Reply-To: <200810281252.29607.jblanca@btc.upv.es>
References: <200810281252.29607.jblanca@btc.upv.es>
Message-ID: <320fb6e00810280516j72af2c70q46790c217585b2c5@mail.gmail.com>

On Tue, Oct 28, 2008 at 11:52 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> Hi,
> I'm currently dealing with caf contig files. Has BioPython support for this
> format? Do you know of other alternatives in python or perl to deal with it?
> Best regards,

I'm not aware of any Biopython code for CAF contig files.  However,
have a look at http://www.sanger.ac.uk/Software/formats/CAF/userguide.shtml
where some perl tools are described, including some for converting CAF
into other formats.

We do have ACE and PHRED (used by PHRAP) parsers in Bio.Sequencing, so
adding Bio.Sequencing.CAF might be logical.

Peter


From cjfields at illinois.edu  Tue Oct 28 12:26:32 2008
From: cjfields at illinois.edu (Chris Fields)
Date: Tue, 28 Oct 2008 07:26:32 -0500
Subject: [BioPython] a common repository for test datasets/use cases for
	all Bio* projects
In-Reply-To: <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com>
References: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com>
	<320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com>
Message-ID: <C58B47EC-54F4-42AB-8B93-2678D5D90B4F@illinois.edu>

All,

An open-bio repository had started up for this use at one point,  
though I don't think it made the transition to subversion yet (and it  
never really took off, not sure why).  You should try contacting open- 
bio support and maybe Jason or Chris D. can answer this in a bit more  
detail.

chris

On Oct 28, 2008, at 5:55 AM, Peter wrote:

> On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio
> <dalloliogm at gmail.com> wrote:
>> Hi,
>> I would like to make you a proposal.
>> Every module/program written in bioinformatics needs to be tested
>> before it can be used to produce results that can be published.
>> ...
>> So, the point is.. what if we create a common repository for all this
>> kind of testing data, to be used in common with all the other Bio*
>> projects?
>
> You you made some other good points, and this is a good idea.  In
> practice the licences are usually OK for use to "borrow" example input
> files from each other (and this does happen), but a more organised
> system to encourage interchange of examples would be good.
>
> I think this sounds like an excellent topic for the (currently very
> quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev
> discussion, one of the OBF mailing lists, this should cover all the
> Bio* project members interested).  See
> http://lists.open-bio.org/mailman/listinfo
>
> Peter
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Marie-Claude Hofmann
College of Veterinary Medicine
University of Illinois Urbana-Champaign


From bsouthey at gmail.com  Tue Oct 28 13:56:34 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Tue, 28 Oct 2008 08:56:34 -0500
Subject: [BioPython] a common repository for test datasets/use cases for
 all Bio* projects
In-Reply-To: <C58B47EC-54F4-42AB-8B93-2678D5D90B4F@illinois.edu>
References: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com>	<320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com>
	<C58B47EC-54F4-42AB-8B93-2678D5D90B4F@illinois.edu>
Message-ID: <49071A12.8060705@gmail.com>

Chris Fields wrote:
> All,
>
> An open-bio repository had started up for this use at one point, 
> though I don't think it made the transition to subversion yet (and it 
> never really took off, not sure why).  You should try contacting 
> open-bio support and maybe Jason or Chris D. can answer this in a bit 
> more detail.
>
> chris
>
> On Oct 28, 2008, at 5:55 AM, Peter wrote:
>
>> On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio
>> <dalloliogm at gmail.com> wrote:
>>> Hi,
>>> I would like to make you a proposal.
>>> Every module/program written in bioinformatics needs to be tested
>>> before it can be used to produce results that can be published.
>>> ...
>>> So, the point is.. what if we create a common repository for all this
>>> kind of testing data, to be used in common with all the other Bio*
>>> projects?
>>
>> You you made some other good points, and this is a good idea.  In
>> practice the licences are usually OK for use to "borrow" example input
>> files from each other (and this does happen), but a more organised
>> system to encourage interchange of examples would be good.
>>
>> I think this sounds like an excellent topic for the (currently very
>> quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev
>> discussion, one of the OBF mailing lists, this should cover all the
>> Bio* project members interested).  See
>> http://lists.open-bio.org/mailman/listinfo
>>
>> Peter
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Marie-Claude Hofmann
> College of Veterinary Medicine
> University of Illinois Urbana-Champaign
>
>
>
>
> _______________________________________________
> BioPython mailing list  -  BioPython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
There had been some discussion on scipy lists on data sets that you 
should look for.

One of the most critical questions that you must address is copyright 
and who owns the data sets (credit where credit is due). Ultimately any 
data will be distributable in some form and thus really brings in 
copyright issues and such. This is also country specific because there 
is the question of whether or not a data set can be copyrighted and the 
terms of it - not a lawyer to know this. The Science Commons has various 
other useful information especially the FAQ on databases, 
http://sciencecommons.org/resources/faq/databases/, that states "In the 
United States, data will be protected by copyright only if they express 
creativity".

I do believe you would need to be very strict on what is acceptable 
because if it is distributable you can not rely on the user being 
responsible:
1) If has been used for publication, an extremely clear statement of the 
owner (publisher) that it can be made available is required.
2) If the data is created from publicly available sources that allow it 
eg Uniprot (http://www.uniprot.org/help/license) then exact recreatable 
sets must be made available so the data can be exactly obtained from 
that source (must include the specific release as databases change).
3) If the data is from private sources then it must be released on a 
suitable license that can not be superseded by publication or change in 
ownership.

Also, the submitted data should not change even if there are errors. For 
example, Fisher's iris data at 
http://archive.ics.uci.edu/ml/datasets/Iris has  documented errors. 
Rather it would be better to use version numbers.

Regards
Bruce


From biopython at maubp.freeserve.co.uk  Tue Oct 28 15:04:21 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Oct 2008 15:04:21 +0000
Subject: [BioPython] Translation method for Seq object
In-Reply-To: <320fb6e00810200222s641e165eqef3b209893a8d976@mail.gmail.com>
References: <320fb6e00810191152u13a2ee80pe21fe950dc3d046a@mail.gmail.com>
	<C522096F.17F00%lpritc@scri.ac.uk>
	<320fb6e00810200222s641e165eqef3b209893a8d976@mail.gmail.com>
Message-ID: <320fb6e00810280804k1ef53ec1od53c33915da61c3@mail.gmail.com>

On 20th Oct I wrote:
> Of course, someone is still bound to try calling the [Seq object's]
> translate method with a string mapping.  Maybe we should add a
> bit of defensive code to check the table argument, and print a
> helpful error message when this happens?

I've just added that in CVS, if the table argument is a 256 character
string then a ValueError is raised suggesting using
str(my_seq).translate(...) instead.

Peter


From biopython at maubp.freeserve.co.uk  Tue Oct 28 17:17:36 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Oct 2008 17:17:36 +0000
Subject: [BioPython] Should Bio.SeqIO.write(...) return the number of
	records?
Message-ID: <320fb6e00810281017m5a74a9dfh4aa18952b9a561be@mail.gmail.com>

Dear all,

I wanted to get some feedback on a possible enhancement to the
Bio.SeqIO.write(...) and Bio.AlignIO.write(...) functions to make them
return number of records/alignments written to the handle.  I've filed
enhancement Bug 2628 to track this idea.
http://bugzilla.open-bio.org/show_bug.cgi?id=2628

When creating a sequence (or alignment) file, it is sometimes useful
to know how many records (or alignments) were written out.  This is
easy if your records are in a list:

records = list(...)
SeqIO.write(records, handle, format)
print "Wrote %i records" % len(records)

If however your records are from a generator/iterator (e.g. a
generator expression, or some other iterator) you cannot use
len(records).  You could turn this into a list just to count them, but
this wastes memory.  It would therefore be useful to have the count
returned:

records = some_generator
count = SeqIO.write(records, handle, format)
print "Wrote %i records" % count

Currently Bio.SeqIO.write(...) and Bio.AlignIO.write(...) have no
return value, so adding a return value would be a backwards compatible
enhancement.  For a precedent, the BioSQL loader returns the number of
records loaded into the database.

Peter


From sbassi at gmail.com  Tue Oct 28 17:43:27 2008
From: sbassi at gmail.com (Sebastian Bassi)
Date: Tue, 28 Oct 2008 14:43:27 -0300
Subject: [BioPython] Should Bio.SeqIO.write(...) return the number of
	records?
In-Reply-To: <320fb6e00810281017m5a74a9dfh4aa18952b9a561be@mail.gmail.com>
References: <320fb6e00810281017m5a74a9dfh4aa18952b9a561be@mail.gmail.com>
Message-ID: <b43bf2080810281043r515e13aega1d609f3165e0b69@mail.gmail.com>

On Tue, Oct 28, 2008 at 2:17 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> count = SeqIO.write(records, handle, format)
> print "Wrote %i records" % count

I'm for it. It doesn't hurt adding a backward compatible feature.


From biopython at maubp.freeserve.co.uk  Tue Oct 28 18:16:58 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Oct 2008 18:16:58 +0000
Subject: [BioPython] Should Bio.SeqIO.write(...) return the number of
	records?
In-Reply-To: <b43bf2080810281043r515e13aega1d609f3165e0b69@mail.gmail.com>
References: <320fb6e00810281017m5a74a9dfh4aa18952b9a561be@mail.gmail.com>
	<b43bf2080810281043r515e13aega1d609f3165e0b69@mail.gmail.com>
Message-ID: <320fb6e00810281116u6460c62fs77ece727689fba3b@mail.gmail.com>

Sebastian Bassi wrote:
>
> I'm for it. It doesn't hurt adding a backward compatible feature.
>

Well adding an unused feature does increase the long term maintainence
load - but if we agree this does seem useful, that's fine.

Also settling on the record/alignment count as the return value
prevents any future alternative.  But right now I can't think of any
other sensible return value.

I've written a patch against CVS to implement this - see Bug 2628 for details.

Peter


From tiagoantao at gmail.com  Thu Oct 30 21:36:00 2008
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 30 Oct 2008 21:36:00 +0000
Subject: [BioPython] calculate F-Statistics from SNP data
In-Reply-To: <6d941f120810251834q87495d5re558cf179356a8b0@mail.gmail.com>
References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com>
	<5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com>
	<5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com>
	<320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com>
	<6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com>
	<5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com>
	<6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com>
	<5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com>
	<6d941f120810251804o31ed44cat49b407db36a6891e@mail.gmail.com>
	<6d941f120810251834q87495d5re558cf179356a8b0@mail.gmail.com>
Message-ID: <6d941f120810301436m4bf12385s99d726bb000f7dd4@mail.gmail.com>

Hi,

FYI, I am going to continue this discussion to biopython-dev, as I
think it makes more sense there. Especially the parts about
implementation suggestions.

On Sun, Oct 26, 2008 at 1:34 AM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
> I just want add on an extra comment explaining why I oppose doing an
> individual object:
>
> I have the following questions (and others) in my mind, which I don't
> know the answer. I am not looking for answers to them, I am just
> trying to illustrate the difficulty of the problem.
>
> 1. For a certain marker, do we store the genomic position of the
> marker? Some (most) statistics don't use this information. For many
> species this information is not even available. But for some
> statistics this information is mandatory...
> 2. For a microsatellite do we store the motif and number of repeats or
> the whole sequence? (see 4)
> 3. If one is interested in SNPs and one has the full sequences does
> one store the full sequences or just the SNPs? If you store just the
> SNPs then you cannot do sequence based analysis in the future (say
> Tajima D). If you store everything then you are consuming memory and
> cpu.
> 4. If one just wants to do frequency statistics (Fst), do you store
> the marker or just the assign each one an ID and store the ID? It is
> much cheaper to store an ID than a full sequence.
>
> Populations
> 1. Support for landscape genetics? I mean geo-referentiation
> 2. Support for hierarchical population structure?
> 3. Do we cache statistics results on Population objects?
>
>
> Let me take your class marker:
> class Marker:
>  total_heterozygotes_count = 0
>  total_population_count = 0
>  total_Purines_count = 0 # this could be renamed, of course
>  total_Pyrimidines_count = 0
>
> How would this be useful for microsatellites? Why purines, and if my
> marker is a protein? If it is a SNP I want to know the nucleotide? And
> if I am studying proteins and I want to have the aminoacid?
>
> Dont take me wrong, I have done this path. To solve my particular
> problems is not very hard. To have a framework that is usable by
> everybody, it is a damn hard problem. And we dont really need to solve
> it (ok, it would be nice to do things to populations in general, that
> I agree). But the fundamental is: read file, calculate statistics.
> That doesnt need population and individual objects.
>
> If we end up having too many formats a consolidation step might be
> needed in the future (to avoid having 10 split_in_pops). That I agree.
>


-- 
"Data always beats theories. 'Look at data three times and then come
to a conclusion,' versus 'coming to a conclusion and searching for
some data.' The former will win every time."
?Matthew Simmons,
http://www.tiago.org


From pingou at pingoured.fr  Fri Oct 31 16:29:27 2008
From: pingou at pingoured.fr (Pierre-Yves)
Date: Fri, 31 Oct 2008 17:29:27 +0100
Subject: [BioPython] Sequence graph
Message-ID: <490B3267.5020501@pingoured.fr>

Dear list,

I am sorry to come here to ask this question that must have been already 
asked in the past, but my search have been rather unsuccessful...

I would like to reproduce such graph:
http://www.bioperl.org/wiki/HOWTO:Graphics#Improving_the_Image but even 
if bioperl is nice I would like to do it through BioPython.

I have thus two questions :
* Is that possible ?
* Could someone point me to an example ?

Thanks in advance for your help,

Best regards,

Pierre


From bsouthey at gmail.com  Wed Oct 22 21:02:18 2008
From: bsouthey at gmail.com (Bruce Southey)
Date: Wed, 22 Oct 2008 21:02:18 -0000
Subject: [BioPython] back-translation method for Seq object?
In-Reply-To: <320fb6e00810210859n1b922e7emd6a7456abd79cdc7@mail.gmail.com>
References: <C5221012.17F0D%lpritc@scri.ac.uk> <48FDE37B.5040301@gmail.com>	
	<320fb6e00810210745w32b37edjeec1607a3711f6ea@mail.gmail.com>
	<320fb6e00810210859n1b922e7emd6a7456abd79cdc7@mail.gmail.com>
Message-ID: <48FF951C.4030700@gmail.com>

Hi,
Some of the neat things about Python is how easy it is to modify your 
own code and adapt others code into yours.

So here is some code (under the BSD license)  that may be useful on 
this. This is a simple back or reverse translation code with many of the 
things that I have been 'talking' about. This should be self-contained 
and works on Linux system with Python2.3+. It is oriented around an 
peptide sequence 'AFLFQPQRFGR' but hopefully is more general (I have not 
tested that).

a) Convert an amino acid sequence into both a regular expression or DNA 
sequence involving ambiguous codes. There are functions to convert the 
regular expression or DNA sequence involving ambiguous codes back to a 
protein sequence since neither of these are standard.

b) Regular expression search on a list of sequences in fasta format.

c) Obtain all possible DNA sequences from an regular expression form of 
the amino acid sequence. Obviously this is very large as for the above 
sequence there are  442368 combinations (but Python is fairly quick... 
about 10 seconds on my opteron  270 system bogomips =3991.08)


Enjoy
Bruce
-------------- next part --------------
A non-text attachment was scrubbed...
Name: reverse_trans.py
Type: text/x-python
Size: 10661 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20081022/248001f3/attachment-0002.py>