From mjldehoon at yahoo.com Wed Oct 1 08:18:24 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 1 Oct 2008 05:18:24 -0700 (PDT) Subject: [BioPython] Bio.distance Message-ID: <924102.72843.qm@web62403.mail.re1.yahoo.com> Hi everybody, Since the 1.48 release, Biopython has been making good progress in the migration from Numerical Python to NumPy. As part of this process, we are now reviewing and consolidating the code in Biopython that makes use of Numerical Python / NumPy. Specifically, we are thinking to merge the code in Bio.distance into Bio.kNN, and to deprecate Bio.distance and Bio.cdistance. Since Bio.kNN is the only Biopython module in Biopython that makes use of Bio.distance, we think that this won't affect anybody. However, if you are using Bio.distance outside of Bio.kNN, please let us know so we can find an alternative solution. --Michiel. From bsouthey at gmail.com Wed Oct 1 11:49:53 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 01 Oct 2008 10:49:53 -0500 Subject: [BioPython] Bio.distance In-Reply-To: <924102.72843.qm@web62403.mail.re1.yahoo.com> References: <924102.72843.qm@web62403.mail.re1.yahoo.com> Message-ID: <48E39C21.8010603@gmail.com> Michiel de Hoon wrote: > Hi everybody, > > Since the 1.48 release, Biopython has been making good progress in the migration from Numerical Python to NumPy. As part of this process, we are now reviewing and consolidating the code in Biopython that makes use of Numerical Python / NumPy. Specifically, we are thinking to merge the code in Bio.distance into Bio.kNN, and to deprecate Bio.distance and Bio.cdistance. Since Bio.kNN is the only Biopython module in Biopython that makes use of Bio.distance, we think that this won't affect anybody. However, if you are using Bio.distance outside of Bio.kNN, please let us know so we can find an alternative solution. > > --Michiel. > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > Hi, Under the 'standard' install I do not think that there is any advantage of using Bio.cdistance within Bio.kNN. I tested this on a bioinformatics data set with almost 1500 data points, 8 explanatory variables and k=9. I only got a one second difference between using Bio.cdistance or commenting it out on my system (after removing the build directory and reinstalling everything). Actual maximum times across three runs were under 16.6 seconds with it and under 17.4 seconds without it. My system runs linux x86_64 (fedora 10) but it is not a 'clean' system due to other cpu intensive processes running. I used Python 2.5 and Numeric 2.4 as I forgot the order of imports. In my version the default distance without Bio.cdistance uses the Numeric dot (I did not try the python version) so I would expect this to be noticeably faster if lapack or atlas are installed than if these are not present. (I used Fedora supplied Numeric so while I think this timing is without lapack and atlas I am not completely sure of that.) I did not see an examples for k-nearest neighbor so below is (very bad) code using the logistic regression example (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html). Regards Bruce from Bio import kNN xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11, -220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], [58, -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, -312.99], [154, -213.83], [147, -380.85], [93, -291.13]] ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0] model = kNN.train(xs, ys, 3) ccr=0 tobs=0 for px, py in zip(xs, ys): cp=kNN.classify(model, px) tobs +=1 if cp==py: ccr +=1 print tobs, ccr From biopython at maubp.freeserve.co.uk Wed Oct 1 11:52:05 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Oct 2008 16:52:05 +0100 Subject: [BioPython] More string methods for the Seq object In-Reply-To: <320fb6e00809290506p8aa2b51p4901b693ebb268bf@mail.gmail.com> References: <320fb6e00809260859r23c7915buc114c5c0b71e195@mail.gmail.com> <48DD2DE6.10908@gmail.com> <320fb6e00809261422n6e4c4889p734508613898cc3f@mail.gmail.com> <48DD59DF.1000504@gmail.com> <320fb6e00809261457j65dc0876hd59d17aee01bc983@mail.gmail.com> <320fb6e00809270557n73b81b5ayb93fe85f0f466626@mail.gmail.com> <320fb6e00809290450m6fedbaacu15a75107e5c39658@mail.gmail.com> <320fb6e00809290506p8aa2b51p4901b693ebb268bf@mail.gmail.com> Message-ID: <320fb6e00810010852j5cf8e3ak7dc788372568251f@mail.gmail.com> On Mon, Sep 29, 2008 at 1:06 PM, Peter wrote: >> I assume you [Bruce] are agreeing with ... follow[ing] the >> string defaults of white space for stipping or splitting (for >> consistency, even though this won't typically be useful for >> sequences). On balance this would probably be best from >> a principle of consistency and least surprise for the user - >> I'll update the patches. > > New patch for Seq object split, strip, lstrip and rstrip methods on > Bug 2596 which follows the python string defaults (splitting on or > stripping of white space). > http://bugzilla.open-bio.org/show_bug.cgi?id=2596 There is now a second version of this patch on that bug, which will also accept Seq objects as arguments to the split, strip, lstrip and rstrip methods, plus has the start of some tests too. We (Peter, Martin, Bruce and Leighton) seem to have reached an agreement about adding split, strip, lstrip and rstrip methods to the Seq object with the behaviour (arguments and defaults) to follow those of the python string as closely as possible. I'd like to encourage others lurking on the list to comment too, but unless anyone objects, I intend to add these methods in CVS this week, together with an updated unit test and updates to the tutorial. Peter From biopython at maubp.freeserve.co.uk Wed Oct 1 12:03:22 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Oct 2008 17:03:22 +0100 Subject: [BioPython] Bio.distance In-Reply-To: <48E39C21.8010603@gmail.com> References: <924102.72843.qm@web62403.mail.re1.yahoo.com> <48E39C21.8010603@gmail.com> Message-ID: <320fb6e00810010903u253c6384ld401e1a771ee141e@mail.gmail.com> On Wed, Oct 1, 2008 at 4:49 PM, Bruce Southey wrote: > > Hi, > Under the 'standard' install I do not think that there is any advantage of > using Bio.cdistance within Bio.kNN. I tested this on a bioinformatics data > set with almost 1500 data points, 8 explanatory variables and k=9. ... > Actual maximum times across three runs were under 16.6 seconds with > it [Bio.cdistance] and under 17.4 seconds without it [Bio.distance using > Numeric] Its interesting that the C version is only slightly faster than Numeric - of course as you point out there are lots of possible complications here like lapack and atlas (plus compiler options and CPU features). I think your numbers are good support for Michiel's proposition that we should deprecate Bio.cdistance and Bio.distance and just use numpy in Bio.kNN - this will simplify our code base and make very little difference to the speed. Peter From biopython at maubp.freeserve.co.uk Wed Oct 1 12:17:10 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Oct 2008 17:17:10 +0100 Subject: [BioPython] Bio.kNN documentation Message-ID: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com> Bruce wrote: > I tested this [Bio.kNN] on a bioinformatics data set with almost 1500 > data points, 8 explanatory variables and k=9. ... Do you think this larger example could be adapted into something for the Biopython documentation? Otherwise the next bit of code looks interesting. > I did not see an examples for k-nearest neighbor so below is (very bad) > code using the logistic regression example > (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html). This is a set of Bacillus subtilis gene pairs for which the operon structure is known, with the intergene distance and gene expression score as explanatory variables, with the class being same operon or different operons. > from Bio import kNN > xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11, > -220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], [58, > -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, -312.99], > [154, -213.83], [147, -380.85], [93, -291.13]] > ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0] > model = kNN.train(xs, ys, 3) > ccr=0 > tobs=0 > for px, py in zip(xs, ys): > cp=kNN.classify(model, px) > tobs +=1 > if cp==py: > ccr +=1 > print tobs, ccr Could you expand on the cryptic variable names? ccr = correct call rate? tobs = total observations? Coupled with a scatter plot (say with pylab, showing the two classes in different colours), this could be turned into a nice little example for the cookbook section of the tutorial. Notice that later on in the logistic regression example there is a second table of "test data" which could be used to make de novo predictions. Thanks, Peter From bsouthey at gmail.com Wed Oct 1 14:40:41 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 01 Oct 2008 13:40:41 -0500 Subject: [BioPython] Bio.kNN documentation In-Reply-To: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com> References: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com> Message-ID: <48E3C429.1020004@gmail.com> Peter wrote: > Bruce wrote: > >> I tested this [Bio.kNN] on a bioinformatics data set with almost 1500 >> data points, 8 explanatory variables and k=9. ... >> > > Do you think this larger example could be adapted into something for > the Biopython documentation? Otherwise the next bit of code looks > interesting. > > >> I did not see an examples for k-nearest neighbor so below is (very bad) >> code using the logistic regression example >> (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html). >> > > This is a set of Bacillus subtilis gene pairs for which the operon > structure is known, with the intergene distance and gene expression > score as explanatory variables, with the class being same operon or > different operons. > > >> from Bio import kNN >> xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11, >> -220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], [58, >> -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, -312.99], >> [154, -213.83], [147, -380.85], [93, -291.13]] >> ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0] >> model = kNN.train(xs, ys, 3) >> ccr=0 >> tobs=0 >> for px, py in zip(xs, ys): >> cp=kNN.classify(model, px) >> tobs +=1 >> if cp==py: >> ccr +=1 >> print tobs, ccr >> > > Could you expand on the cryptic variable names? ccr = correct call > rate? tobs = total observations? > > Coupled with a scatter plot (say with pylab, showing the two classes > in different colours), this could be turned into a nice little example > for the cookbook section of the tutorial. Notice that later on in the > logistic regression example there is a second table of "test data" > which could be used to make de novo predictions. > > Thanks, > > Peter > > I did realize that this was coming... :-) (I guess I am volunteering myself to provide some material on machine learning with BioPython. So this is a start.) I wanted something quick and dirty to output for testing, so tobs is the total number of observations and ccr is number of correctly classified points - I was to lazy to divide it by tobs to get the correct classification rate. Here is an more extended sample code that also uses logistic regression. (Python is so great to with here!) I don't have plotting packages installed but someone could add the plots. Regards Bruce -------------- next part -------------- A non-text attachment was scrubbed... Name: knn_lr_example.py Type: text/x-python Size: 3257 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Wed Oct 1 17:40:55 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Oct 2008 22:40:55 +0100 Subject: [BioPython] problem installing mxTextTools In-Reply-To: <48896815.10104@berkeley.edu> References: <4889645B.9080400@berkeley.edu> <48896815.10104@berkeley.edu> Message-ID: <320fb6e00810011440v4bd80263hf3830d8d9f548e63@mail.gmail.com> On Fri, Jul 25, 2008 at 6:43 AM, Nick Matzke wrote: > Hi all, > > An update -- I found a solution by copying the .pck file the download > actually gave me to the filename that the install was apparently looking > for. This was not exactly obvious (!!!!) but apparently it worked: > ... > >>> print now() > 2008-07-24 22:39:17.66 > Was this an old email you accidently forwarded to the list? For the next release of Biopython the only bits of code still using mxTextTools have been deprecated, so the Biopython setup won't even look for mxTextTools at all. Right now with Biopython 1.48 you can just install without mxTextTools (as the setup.py prompt should make clear). Peter From biopython at maubp.freeserve.co.uk Wed Oct 1 17:44:34 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Oct 2008 22:44:34 +0100 Subject: [BioPython] problem installing mxTextTools In-Reply-To: <320fb6e00810011440v4bd80263hf3830d8d9f548e63@mail.gmail.com> References: <4889645B.9080400@berkeley.edu> <48896815.10104@berkeley.edu> <320fb6e00810011440v4bd80263hf3830d8d9f548e63@mail.gmail.com> Message-ID: <320fb6e00810011444u7e5bf37fh2801c1980bd38a2a@mail.gmail.com> On Wed, Oct 1, 2008 at 10:40 PM, Peter wrote: > On Fri, Jul 25, 2008 at 6:43 AM, Nick Matzke wrote: >> Hi all, >> >> An update -- I found a solution by copying the .pck file the download >> actually gave me to the filename that the install was apparently looking >> for. This was not exactly obvious (!!!!) but apparently it worked: >> ... >> >>> print now() >> 2008-07-24 22:39:17.66 >> > > Was this an old email you accidently forwarded to the list? Sorry about this Nick & everyone else - it was a mistake at my end. It looks like a glitch (perhaps in GoogleMail itself?) marked this old thread as unread and bumped it to the top of my to read list. Odd, but I didn't notice until after sending my confused reply. Peter From kteague at bcgsc.ca Wed Oct 1 17:53:44 2008 From: kteague at bcgsc.ca (Kevin Teague) Date: Wed, 1 Oct 2008 14:53:44 -0700 Subject: [BioPython] development question References: <48B5BD98.8050101@heckler-koch.cz><48B65C9B.4000407@heckler-koch.cz> <20080828090431.GD5801@inb.uni-luebeck.de> Message-ID: <36BEEFA2DF192944BF71E072F7A5F4656043D6@xchange1.phage.bcgsc.ca> On Thu, Aug 28, 2008 at 10:06:51AM +0200, Pavel SRB wrote: > so now to biopython. On my system i have biopython from debian repository > via apt-get. But i would like to have second version of biopython in system > just to check, log and change the code to learn more. This can be done with > removing sys.path.remove("/var/lib/python-support/python2.5") > and importing Bio from some other development directory. But this way i > loose all modules in direcotory mentioned above and i believe it can be > done more clearly You might want to check out VirtualEnv: http://pypi.python.org/pypi/virtualenv This tool will let you "clone" your system Python, so that you have your own isolated [virtualpythonname]/bin and [virtualpythonname/lib/python/site-packages/ directories. If you create a virtualenv with the --no-site-packages, then the /var/lib/python-support/python2.5/ location will be not be in the created virtual python's sys.path. Otherwise by default this location will be included, but your own isolated [virtualpythonname/lib/python/site-packages/ location will have precendence on sys.path, so if you install a newer BioPython into there it will get imported instead of the system one. You can of course do all of this by manually fiddling with sys.path, but VirtualEnv just wraps up a few of these common practices into one handy tool - great for experimentation or trying out different packages. From lunt at ctbp.ucsd.edu Sat Oct 4 17:50:33 2008 From: lunt at ctbp.ucsd.edu (Bryan Lunt) Date: Sat, 4 Oct 2008 14:50:33 -0700 Subject: [BioPython] Copy Constructors for Bio.Seq.Seq? In-Reply-To: References: Message-ID: Greetings All! I would like to make the following humble suggestion: A copy-constructor for Bio.Seq.Seq would be helpful, currently it seems that calling Bio.Align.Generic.Alignment.add_sequence on a Seq object breaks because it tries to initialize a new Seq object on whatever data you provided, and there is no copy-constructor, nor does Bio.Align.Generic.Alignment.add_sequence handled just adding a Seq object directly. Thanks for considering this, I think this addition will help make client-code cleaner. -Bryan Lunt From biopython at maubp.freeserve.co.uk Sun Oct 5 07:06:57 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 5 Oct 2008 12:06:57 +0100 Subject: [BioPython] Copy Constructors for Bio.Seq.Seq? In-Reply-To: References: Message-ID: <320fb6e00810050406t41d25043oe7011745055a1fc7@mail.gmail.com> On Sat, Oct 4, 2008 at 10:50 PM, Bryan Lunt wrote: > Greetings All! > I would like to make the following humble suggestion: > A copy-constructor for Bio.Seq.Seq would be helpful, ... You can use the string idiom of my_seq[:] to make a copy of a Seq object. > currently it > seems that calling Bio.Align.Generic.Alignment.add_sequence on a > Seq object breaks because it tries to initialize a new Seq object on > whatever data you provided, and there is no copy-constructor, nor does > Bio.Align.Generic.Alignment.add_sequence handled just adding a Seq > object directly. Yes, the Bio.Align.Generic.Alignment.add_sequence() method currently expects a string (which its docstring is fairly clear about), and giving it a Seq does fail. I suppose allowing it to take a Seq object would be sensible (with a check on the alphabet being compatible with that declared for the alignment). We have been debating making the generic Alignment a little more list like, by allowing .append() or .extend() for use with SeqRecord objects (Bug 2553). http://bugzilla.open-bio.org/show_bug.cgi?id=2553 > Thanks for considering this, I think this addition will help make > client-code cleaner. Would the SeqRecord append/extend idea suit you just as well? Peter From biopython at maubp.freeserve.co.uk Sun Oct 5 08:16:28 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 5 Oct 2008 13:16:28 +0100 Subject: [BioPython] Migrating from Numerical Python to numpy In-Reply-To: <623262.17729.qm@web62407.mail.re1.yahoo.com> References: <623262.17729.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e00810050516i20822ebcwf15cd058af0c9759@mail.gmail.com> On Sat, Sep 20, 2008 at 4:02 AM, Michiel de Hoon wrote: > Dear all, > > As you probably are well aware, Biopython releases to date have used > the now obsolete Numeric python library. This is no longer being > maintained and has been superseded by the numpy library. See > http://www.scipy.org/History_of_SciPy for more about details on the > history of numerical python. Biopython 1.48 should be the last > Numeric only release of Biopython - we have already started moving to > numpy in CVS. > > Supporting both Numeric and numpy ought to be fairly straightforward > for the pure python modules in Biopython. However, we also have C code > which must interact with Numeric/numpy, and trying to support both > would be harder. > > Would anyone be inconvenienced if the next release of Biopython > supported numpy ONLY (dropping support for Numeric)? If so please > speak up now - either here or on the development mailing list. > Otherwise, a simple switch from Numeric to numpy will probably be the > most straightforward migration plan. No one has objected, and a simple switch from Numeric to numpy is underway in CVS. The next release of Biopython will suport numpy only (dropping support for Numeric). As an aside, from my own testing Biopython CVS looks happy with numpy 1.0, 1.1 and the just released 1.2 (although if we have missed any deprecation warnings please let us know). For preparing Windows installers for Biopython, it might be helpful to know what version of numpy most Windows users (will) have installed (this is important due to numpy C API changes between versions). Thanks, Peter From biopython at maubp.freeserve.co.uk Mon Oct 6 06:39:15 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 6 Oct 2008 11:39:15 +0100 Subject: [BioPython] Bio.kNN documentation In-Reply-To: <48E3C429.1020004@gmail.com> References: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com> <48E3C429.1020004@gmail.com> Message-ID: <320fb6e00810060339t427d4c5dme2690fbc36b30c81@mail.gmail.com> Bruce wrote: >>> I did not see an examples for k-nearest neighbor so below is >>> (very bad) code using the logistic regression example >>> (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html). Peter wrote: >> This is a set of Bacillus subtilis gene pairs for which the operon >> structure is known, with the intergene distance and gene expression >> score as explanatory variables, with the class being same operon or >> different operons. >> ... >> Coupled with a scatter plot (say with pylab, showing the two classes >> in different colours), this could be turned into a nice little example >> for the cookbook section of the tutorial. Notice that later on in the >> logistic regression example there is a second table of "test data" >> which could be used to make de novo predictions. Bruce wrote: > I did realize that this was coming... :-) > (I guess I am volunteering myself to provide some material on > machine learning with BioPython. So this is a start.) Michiel has suggested adding a whole chapter to the tutorial about supervised learning, presumably incorporating his logistic regression example as part of this. Have a look at thread "Bio.MarkovModel; Bio.Popgen, Bio.PDB documentation" on the dev mailing list. I'm sure you can contribute (even if just by proof reading). Peter From fkauff at biologie.uni-kl.de Tue Oct 7 04:02:12 2008 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Tue, 07 Oct 2008 10:02:12 +0200 Subject: [BioPython] Creating and traversing an ultrametric tree In-Reply-To: <320fb6e00809241326i16a337das844f4ac74766b459@mail.gmail.com> References: <73045cca0809231713v219c3ec3tfc24461c7af6b453@mail.gmail.com> <320fb6e00809240200y144500cbl86f9023cb868da89@mail.gmail.com> <73045cca0809241132x30bc4d63t7ac0b9967a20e76c@mail.gmail.com> <320fb6e00809241326i16a337das844f4ac74766b459@mail.gmail.com> Message-ID: <48EB1784.50803@biologie.uni-kl.de> Peter wrote: > On Wed, Sep 24, 2008 at 7:32 PM, aditya shukla > wrote: > >> Hello Peter , >> >> Thanks for the reply , >> I have attached a file with of the kind of data that i wanna parse. >> I tried using Thomas Mailund's Newick tree parser but this dosen't >> seem to work , so is there any other module that can help? >> > > Your file looks like this (in case anyone on the mailing list recognises it), > > /T_0_size=105((-bin-ulockmgr_server:0.99[&&NHX:C=0.195.0], > (((-bin-hostname:0.00[&&NHX:C=200.0.0], > (-bin-dnsdomainname:0.00[&&NHX:C=200.0.0], > ...):0.99):0.99):0.99):0.99); > > [with a large chunk removed, and new lines inserted] > > I'm guessing this is some kind of computer system profile - nothing to > do with bioinformatics. > > I'm not 100% sure this is Newick format - it might be worth trying to > parse everything after the "/T_0_size=105" text which looks out of > place to me. > > If it is a valid Newick format tree file, then it is using named > internal nodes which is something Biopython can't currently parse (see > Bug 2543, http://bugzilla.open-bio.org/show_bug.cgi?id=2543 ). So I > don't think you can use the Bio.Nexus module in Biopython to read this > tree. > > Nexus.Trees has been extended to deal with internal node names, or "special comments" in the format [& blablalba]. Such comments comments can appear directly after the taxon label, after the closing parentheses, or between branchlength / support values attached to a node or a taxon labels, such as (a,(b,(c,d)[&hi there])) (a,(b[&hi there],c)) (a,(b:0.123[&hi there],c[&heyho]:0.3)) (a,(b,c)0.4[&comment]:0.95) The comments are stored without change in the corresponding node object and can be accessed like >>> t=Trees.Tree('(a,(b:0.123[&hi there],c[&heyho]:0.3))') >>> print t.node(3).data.comment [&hi there] >>> print t.node(4).data.comment [&heyho] >>> The comments are not parsed in any way - internal labels vary greatly in syntax, and are used to store all kinds of information. But at least they are now parsed and stored, and users can deal with them in any way they like. Frank > The only other python package I can suggest you try is NetworkX, > https://networkx.lanl.gov/wiki > > Good luck, > > Peter > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From mjldehoon at yahoo.com Tue Oct 7 19:10:12 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 7 Oct 2008 16:10:12 -0700 (PDT) Subject: [BioPython] Bio.kNN documentation In-Reply-To: <320fb6e00810060339t427d4c5dme2690fbc36b30c81@mail.gmail.com> Message-ID: <381879.37032.qm@web62403.mail.re1.yahoo.com> > Bruce wrote: > > (I guess I am volunteering myself to provide some > material on > > machine learning with BioPython. So this is a start.) > > Michiel has suggested adding a whole chapter to the > tutorial about > supervised learning, presumably incorporating his logistic > regression > example as part of this. Have a look at thread > "Bio.MarkovModel; > Bio.Popgen, Bio.PDB documentation" on the dev mailing > list. I'm sure > you can contribute (even if just by proof reading). Some more documentation on machine learning would definitely be useful. Recently I started a chapter on supervised learning methods in the tutorial. Right now it only covers logistic regression, but it should also include Bio.MarkovModel, Bio.MaxEntropy, Bio.NaiveBayes, and Bio.kNN. If you are planning to write some documentation on any of these, please let us know so we can avoid duplicated efforts. The new tutorial is in CVS; I put a copy of the HTML output of the latest version at http://biopython.org/DIST/docs/tutorial/Tutorial.new.html. Thanks! --Michiel From bsouthey at gmail.com Tue Oct 7 21:35:51 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 7 Oct 2008 20:35:51 -0500 Subject: [BioPython] Bio.kNN documentation In-Reply-To: <381879.37032.qm@web62403.mail.re1.yahoo.com> References: <320fb6e00810060339t427d4c5dme2690fbc36b30c81@mail.gmail.com> <381879.37032.qm@web62403.mail.re1.yahoo.com> Message-ID: On Tue, Oct 7, 2008 at 6:10 PM, Michiel de Hoon wrote: >> Bruce wrote: >> > (I guess I am volunteering myself to provide some >> material on >> > machine learning with BioPython. So this is a start.) >> >> Michiel has suggested adding a whole chapter to the >> tutorial about >> supervised learning, presumably incorporating his logistic >> regression >> example as part of this. Have a look at thread >> "Bio.MarkovModel; >> Bio.Popgen, Bio.PDB documentation" on the dev mailing >> list. I'm sure >> you can contribute (even if just by proof reading). > > Some more documentation on machine learning would definitely be useful. Recently I started a chapter on supervised learning methods in the tutorial. Right now it only covers logistic regression, but it should also include Bio.MarkovModel, Bio.MaxEntropy, Bio.NaiveBayes, and Bio.kNN. If you are planning to write some documentation on any of these, please let us know so we can avoid duplicated efforts. The new tutorial is in CVS; I put a copy of the HTML output of the latest version at > http://biopython.org/DIST/docs/tutorial/Tutorial.new.html. > > Thanks! > > --Michiel > Hi, I have not given it too much thought at present but this reflects some of the work I have been doing or involved with. I do not know enough about Bio.MarkovModel, Bio.MaxEntropy and Bio.NaiveBayes to really help. But I did think to start with trying to extend the supervised learning material to be more general. One aspect is to get provide working code using different methodologies for different examples. Regards Bruce From stephan80 at mac.com Wed Oct 8 07:33:51 2008 From: stephan80 at mac.com (Stephan) Date: Wed, 08 Oct 2008 13:33:51 +0200 Subject: [BioPython] Entrez.efetch Message-ID: <75573950382669954948356356615157751492-Webmail2@me.com> Hi, I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now. Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this. Anyway, here is a weird little problem with the Bio.Entrez.efetch tool: (I use python 2.5 and the latest Biopython 1.48) I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file: ---------------------------CODE------------------------------------ from Bio import Entrez, SeqIO print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"] handle = Entrez.efetch(db="genome", id="56", rettype="genbank") print "downloading to SeqRecord..." record = SeqIO.read(handle, "genbank") print "...done" handle = Entrez.efetch(db="genome", id="56", rettype="genbank") filehandle = open("NCBI_DroMel", "w") print "downloading to file..." filehandle.write(handle.read()) print "...done" handle = open("NCBI_DroMel") print "reading from file..." record = SeqIO.read(handle, "genbank") ---------------------------END-CODE------------------------------------ In the last line we have a crash, see the output of the code: ---------------------------OUTPUT------------------------------------ Drosophila melanogaster chromosome 4, complete sequence downloading to SeqRecord... ...done downloading to file... ...done reading chr2L from file... Traceback (most recent call last): File "efetch-test.py", line 17, in record = SeqIO.read(handle, "genbank") File "HOME/lib/python/Bio/SeqIO/__init__.py", line 366, in read first = iterator.next() File "HOME/lib/python/Bio/GenBank/Scanner.py", line 410, in parse_records record = self.parse(handle) File "HOME/lib/python/Bio/GenBank/Scanner.py", line 393, in parse if self.feed(handle, consumer) : File "HOME/lib/python/Bio/GenBank/Scanner.py", line 370, in feed misc_lines, sequence_string = self.parse_footer() File "HOME/lib/python/Bio/GenBank/Scanner.py", line 723, in parse_footer raise ValueError("Premature end of file in sequence data") ValueError: Premature end of file in sequence data ---------------------------END-OUTPUT------------------------------------ It seems that downloading the file to disk will corrupt the genbank file, while downloading directly into biopythons SeqIO.read() function works properly. I dont get it! When I download this chromosome manually from the NCBI-website, I indeed find a difference in one line, namely in line 3 of the genbank file. In the manually downloaded file line 3 reads: "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced from my code I have only: "ACCESSION NC_004353". So without that region-information, the biopython parser of course runs to a premature end. I rather use the cPickle-module now to save the whole SeqRecord-instance. Thats works fine, so I dont need an immediate solution for the above posted problem, but I thought it might be interesting maybe... Any hints? Regards, Stephan From chapmanb at 50mail.com Wed Oct 8 08:35:33 2008 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Oct 2008 08:35:33 -0400 Subject: [BioPython] Entrez.efetch In-Reply-To: <75573950382669954948356356615157751492-Webmail2@me.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> Message-ID: <20081008123533.GE57379@sobchak.mgh.harvard.edu> Hi Stephan; > It seems that downloading the file to disk will corrupt the genbank > file, while downloading directly into biopythons SeqIO.read() function > works properly. I dont get it! > > When I download this chromosome manually from the NCBI-website, > I indeed find a difference in one line, namely in line 3 of the > genbank file. In the manually downloaded file line 3 reads: > "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced > from my code I have only: "ACCESSION NC_004353". So without that > region-information, the biopython parser of course runs to a premature > end. This is a tricky problem that I ran into as well and is fixed in the latest CVS version. The issue is that the Biopython reader is using an UndoHandle instead of a standard python handle. By default some of these operations appear to be assuming an iterator, but UndoHandle did not provide this. As a result, you can lose the first couple of lines which are previously examined to determine the filetype. The fix is to make this a proper iterator. You can either check out current CVS, or make the addition manually to Bio/File.py in your current version: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython Hope this helps, Brad From biopython at maubp.freeserve.co.uk Wed Oct 8 09:37:24 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 14:37:24 +0100 Subject: [BioPython] Entrez.efetch In-Reply-To: <75573950382669954948356356615157751492-Webmail2@me.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> Message-ID: <320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com> On Wed, Oct 8, 2008 at 12:33 PM, Stephan wrote: > Hi, > > I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now. > Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this. > > Anyway, here is a weird little problem with the Bio.Entrez.efetch tool: > (I use python 2.5 and the latest Biopython 1.48) > I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file: > > ---------------------------CODE------------------------------------ > from Bio import Entrez, SeqIO > > print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"] > handle = Entrez.efetch(db="genome", id="56", rettype="genbank") > print "downloading to SeqRecord..." > record = SeqIO.read(handle, "genbank") > print "...done" I assume this is just test code - as it would be silly to download the GenBank file twice in a real script. > handle = Entrez.efetch(db="genome", id="56", rettype="genbank") > filehandle = open("NCBI_DroMel", "w") > print "downloading to file..." > filehandle.write(handle.read()) You should now close the file, which should ensure it is fully written to disk: filehandle.close() > print "...done" > > handle = open("NCBI_DroMel") > print "reading from file..." > record = SeqIO.read(handle, "genbank") > ---------------------------END-CODE------------------------------------ > > In the last line we have a crash, > ... > ValueError: Premature end of file in sequence data This is because you started reading in the file without finishing writing to it - the parser could only read in part of the data, and is complaining about it ending prematurely. Peter From p.j.a.cock at googlemail.com Wed Oct 8 09:46:25 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 8 Oct 2008 14:46:25 +0100 Subject: [BioPython] Entrez.efetch In-Reply-To: <20081008123533.GE57379@sobchak.mgh.harvard.edu> References: <75573950382669954948356356615157751492-Webmail2@me.com> <20081008123533.GE57379@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> Stephan wrote: >> When I download this chromosome manually from the NCBI-website, >> I indeed find a difference in one line, namely in line 3 of the >> genbank file. In the manually downloaded file line 3 reads: >> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced >> from my code I have only: "ACCESSION NC_004353". So without that >> region-information, the biopython parser of course runs to a premature >> end. Stephan - when you say manually, do you mean via a web browser? If so it is likely to be using a subtly different URL, which might explain the NCBI generating slightly different data on the fly. Either way, this ACCESSION line difference shouldn't trigger the "Premature end of file in sequence data" error in the GenBank parser. On Wed, Oct 8, 2008 at 1:35 PM, Brad Chapman wrote: > This is a tricky problem that I ran into as well and is fixed in the > latest CVS version. The issue is that the Biopython reader is using an > UndoHandle instead of a standard python handle. By default some of these > operations appear to be assuming an iterator, but UndoHandle did not > provide this. Brad, I'm pretty sure the GenBank parser is NOT using the UndoHandle. Just adding the close made Stephan's example work for me. What exactly was the problem you ran into (one of the other parsers perhaps?). > As a result, you can lose the first couple of lines which are > previously examined to determine the filetype. The fix is to make > this a proper iterator. You can either check out current CVS, or > make the addition manually to Bio/File.py in your current version: > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython Adding this to the UndoHandle seems a sensible improvement - but I don't see how it can affect Stephan's script. Peter From p.j.a.cock at googlemail.com Wed Oct 8 09:46:25 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 8 Oct 2008 14:46:25 +0100 Subject: [BioPython] Entrez.efetch In-Reply-To: <20081008123533.GE57379@sobchak.mgh.harvard.edu> References: <75573950382669954948356356615157751492-Webmail2@me.com> <20081008123533.GE57379@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> Stephan wrote: >> When I download this chromosome manually from the NCBI-website, >> I indeed find a difference in one line, namely in line 3 of the >> genbank file. In the manually downloaded file line 3 reads: >> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced >> from my code I have only: "ACCESSION NC_004353". So without that >> region-information, the biopython parser of course runs to a premature >> end. Stephan - when you say manually, do you mean via a web browser? If so it is likely to be using a subtly different URL, which might explain the NCBI generating slightly different data on the fly. Either way, this ACCESSION line difference shouldn't trigger the "Premature end of file in sequence data" error in the GenBank parser. On Wed, Oct 8, 2008 at 1:35 PM, Brad Chapman wrote: > This is a tricky problem that I ran into as well and is fixed in the > latest CVS version. The issue is that the Biopython reader is using an > UndoHandle instead of a standard python handle. By default some of these > operations appear to be assuming an iterator, but UndoHandle did not > provide this. Brad, I'm pretty sure the GenBank parser is NOT using the UndoHandle. Just adding the close made Stephan's example work for me. What exactly was the problem you ran into (one of the other parsers perhaps?). > As a result, you can lose the first couple of lines which are > previously examined to determine the filetype. The fix is to make > this a proper iterator. You can either check out current CVS, or > make the addition manually to Bio/File.py in your current version: > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython Adding this to the UndoHandle seems a sensible improvement - but I don't see how it can affect Stephan's script. Peter From stephan80 at mac.com Wed Oct 8 09:48:25 2008 From: stephan80 at mac.com (Stephan) Date: Wed, 08 Oct 2008 15:48:25 +0200 Subject: [BioPython] Entrez.efetch In-Reply-To: <320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> <320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com> Message-ID: <128043477953580677661042463273686413408-Webmail2@me.com> Hi guys, OK, there is two different problems here that Brad and Peter independently pointed out to me. Peter, you are right that not closing the file actually caused the error. Your hint fixes that, thanks. But that doesnt fix that there is a part of line 3 missing over the download, and although I actually updated to the newest cvs-version of biopython as Brad suggested (sorry for accidently putting my answer not on the mailing-list) that does not fix that line... Best, Stephan Am Mittwoch 08 Oktober 2008 um 03:37PM schrieb "Peter" : >On Wed, Oct 8, 2008 at 12:33 PM, Stephan wrote: >> Hi, >> >> I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now. >> Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this. >> >> Anyway, here is a weird little problem with the Bio.Entrez.efetch tool: >> (I use python 2.5 and the latest Biopython 1.48) >> I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file: >> >> ---------------------------CODE------------------------------------ >> from Bio import Entrez, SeqIO >> >> print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"] >> handle = Entrez.efetch(db="genome", id="56", rettype="genbank") >> print "downloading to SeqRecord..." >> record = SeqIO.read(handle, "genbank") >> print "...done" > >I assume this is just test code - as it would be silly to download the >GenBank file twice in a real script. > >> handle = Entrez.efetch(db="genome", id="56", rettype="genbank") >> filehandle = open("NCBI_DroMel", "w") >> print "downloading to file..." >> filehandle.write(handle.read()) > >You should now close the file, which should ensure it is fully written to disk: >filehandle.close() > >> print "...done" >> >> handle = open("NCBI_DroMel") >> print "reading from file..." >> record = SeqIO.read(handle, "genbank") >> ---------------------------END-CODE------------------------------------ >> >> In the last line we have a crash, >> ... >> ValueError: Premature end of file in sequence data > >This is because you started reading in the file without finishing >writing to it - the parser could only read in part of the data, and is >complaining about it ending prematurely. > >Peter > > From stephan80 at mac.com Wed Oct 8 10:00:31 2008 From: stephan80 at mac.com (Stephan) Date: Wed, 08 Oct 2008 16:00:31 +0200 Subject: [BioPython] Entrez.efetch In-Reply-To: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> <20081008123533.GE57379@sobchak.mgh.harvard.edu> <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> Message-ID: <72537648433629820630731006204512761040-Webmail2@me.com> >Stephan - when you say manually, do you mean via a web browser? If so >it is likely to be using a subtly different URL, which might explain >the NCBI generating slightly different data on the fly. Either way, >this ACCESSION line difference shouldn't trigger the "Premature end of >file in sequence data" error in the GenBank parser. Thanks, that must be it! Now I guess everything is solved, closing the handle makes my code run properly and the download-from-NCBI-webpage-issue explains the difference in line 3. >Adding this to the UndoHandle seems a sensible improvement - but I >don't see how it can affect Stephan's script. There I agree, thanks anyway, Brad. Regards, Stephan From biopython at maubp.freeserve.co.uk Wed Oct 8 10:02:54 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 15:02:54 +0100 Subject: [BioPython] Entrez.efetch In-Reply-To: <128043477953580677661042463273686413408-Webmail2@me.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> <320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com> <128043477953580677661042463273686413408-Webmail2@me.com> Message-ID: <320fb6e00810080702q6774f58ap52a02073d62cb75a@mail.gmail.com> On Wed, Oct 8, 2008 at 2:48 PM, Stephan wrote: > > Hi guys, > > OK, there is two different problems here that Brad and Peter independently > pointed out to me. Peter, you are right that not closing the file actually > caused the error. Your hint fixes that, thanks. Great. > But that doesnt fix that there is a part of line 3 missing over the download, > and although I actually updated to the newest cvs-version of biopython as > Brad suggested (sorry for accidently putting my answer not on the mailing-list) > that does not fix that line... This is the issue where you get different GenBank files using Bio.Entrez.efetch and a "manual download"? First of all what did you mean by "manual download" - for example FTP (what URL), or from a browser? Secondly, does this difference to the ACCESSION line (line 3) actually have any ill effects? To be clear using Bio.Entrez.efetch as in your script, I get this: LOCUS NC_004353 1351857 bp DNA linear INV 14-MAY-2008 DEFINITION Drosophila melanogaster chromosome 4, complete sequence. ACCESSION NC_004353 VERSION NC_004353.3 GI:116010290 PROJECT GenomeProject:164 KEYWORDS . SOURCE Drosophila melanogaster (fruit fly) ORGANISM Drosophila melanogaster ... Using FTP from ftp://ftp.ncbi.nih.gov/genomes/Drosophila_melanogaster/CHR_4/NC_004353.gbk I get something similar but different: LOCUS NC_004353 1351857 bp DNA linear INV 14-MAY-2008 DEFINITION Drosophila melanogaster chromosome 4, complete sequence. ACCESSION NC_004353 VERSION NC_004353.3 GI:116010290 KEYWORDS . SOURCE Drosophila melanogaster (fruit fly) ORGANISM Drosophila melanogaster ... Notice the FTP file lacks the PROJECT line, and also differs slightly in its feature table. Using the NCBI website I suspect you can get other slight variations (like the different ACCESSION line you reported). Peter From stephan80 at mac.com Wed Oct 8 09:52:07 2008 From: stephan80 at mac.com (Stephan) Date: Wed, 08 Oct 2008 15:52:07 +0200 Subject: [BioPython] Entrez.efetch In-Reply-To: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> <20081008123533.GE57379@sobchak.mgh.harvard.edu> <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> Message-ID: <56009583349175862359179071289436480391-Webmail2@me.com> >Stephan - when you say manually, do you mean via a web browser? If so >it is likely to be using a subtly different URL, which might explain >the NCBI generating slightly different data on the fly. Either way, >this ACCESSION line difference shouldn't trigger the "Premature end of >file in sequence data" error in the GenBank parser. Thanks, that must be it! Now I guess everything is solved, closing the handle makes my code run properly and the download-from-NCBI-webpage-issue explains the difference in line 3. >Adding this to the UndoHandle seems a sensible improvement - but I >don't see how it can affect Stephan's script. There I agree, thanks anyway, Brad. Regards, Stephan From biopythonlist at gmail.com Wed Oct 8 12:23:32 2008 From: biopythonlist at gmail.com (dr goettel) Date: Wed, 8 Oct 2008 18:23:32 +0200 Subject: [BioPython] taxonomic tree Message-ID: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> Hello, I'm new in this list and in BioPython. I would like to create a NCBI-like taxonomic tree and then fill it with the organisms that I have in a file. Is there an easy way to do this? I started using biopython's function at 7.11.4 (finding the lineage of an organism) in the tutorial, but I need to do this tens of thousands times so it spends too much time querying NCBI database. Therefore I built a taxonomic database locally and implemented something similar to 7.11.4 tutorial's function so I get, for every sequence, the lineage in the same way: 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; Liliopsida; Asparagales; Orchidaceae' Now I need to create a tree, or fill an already created one. And then search it by some criteria. Please could anybody help me with this? Any idea? Thankyou very much From biopython at maubp.freeserve.co.uk Wed Oct 8 12:38:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 17:38:31 +0100 Subject: [BioPython] taxonomic tree In-Reply-To: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> Message-ID: <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com> On Wed, Oct 8, 2008 at 5:23 PM, dr goettel wrote: > Hello, I'm new in this list and in BioPython. Hello :) > I would like to create a NCBI-like taxonomic tree and then fill it with the > organisms that I have in a file. Is there an easy way to do this? I started > using biopython's function at 7.11.4 (finding the lineage of an organism) in > the tutorial, ... For anyone reading this later on, note that the tutorial section numbers tend to change with each release of Biopython. This section just uses Bio.Entrez to fetch taxonomy information for a particular NCBI taxon id. > but I need to do this tens of thousands times so it spends too > much time querying NCBI database. Also calling Bio.Entrez 10000 times might annoy the NCBI ;) > Therefore I built a taxonomic database > locally and implemented something similar to 7.11.4 tutorial's function so I > get, for every sequence, the lineage in the same way: > > 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; > Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; > Liliopsida; Asparagales; Orchidaceae' I assume you used the NCBI provided taxdump files to populate the database? See ftp://ftp.ncbi.nih.gov/pub/taxonomy/ Personally rather than designing my own database just for this (and writing a parser for the taxonomy files), I would have suggested installing BioSQL, and using the BioSQL script load_ncbi_taxonomy.pl to download and import the data for you. This is a simple perl script - you don't need BioPerl. See http://www.biopython.org/wiki/BioSQL for details. > Now I need to create a tree, or fill an already created one. And then search > it by some criteria. What kind of tree do you mean? Are you talking about creating a Newick tree, or an in memory structure? Perhaps the Bio.Nexus module's tree functionality would help. If you are interested, the BioSQL tables record the taxonomy tree using two methods, each node has a parent node allowing you to walk up the lineage. There are also left/right values allowing selection of all child nodes efficiently via an SQL select statement. Peter From biopython at maubp.freeserve.co.uk Wed Oct 8 12:57:37 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 17:57:37 +0100 Subject: [BioPython] Current tutorial in CVS Message-ID: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com> Michiel wrote: > ... The new tutorial is in CVS; I put a copy of the HTML output > of the latest version at > http://biopython.org/DIST/docs/tutorial/Tutorial.new.html. This also gives people a chance to look at the three plotting examples I added to the "Cookbook" section a couple of weeks back, http://www.biopython.org/DIST/docs/tutorial/Tutorial.new.html#chapter:cookbook Suggestions for any additional biologically motivated simple plots would be nice - especially for different plot types. A scatter plot could be added, are there any suggestions for this other than melting temperature versus length or GC%? See also this thread on the dev-mailing list: http://www.biopython.org/pipermail/biopython-dev/2008-September/004277.html Note that the file at this URL is only temporary, and will probably be removed before the next release. The current tutorial is at: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html http://www.biopython.org/DIST/docs/tutorial/Tutorial.pdf Peter From stephan80 at mac.com Wed Oct 8 13:11:25 2008 From: stephan80 at mac.com (Stephan) Date: Wed, 08 Oct 2008 19:11:25 +0200 Subject: [BioPython] Entrez.efetch large files Message-ID: <133483072970409871957631124263040035200-Webmail2@me.com> Sorry to have an Entrez.efetch-issue again, but somehow there seems to be a problem with very large files. So when I run the following code using the newest cvs-version of biopython: ------------------------------------CODE----------------------------------- from Bio import Entrez, SeqIO id = "57" print Entrez.read(Entrez.esummary(db="genome", id=id))[0]["Title"] handle = Entrez.efetch(db="genome", id=id, rettype="genbank") print "downloading to SeqRecord..." record = SeqIO.read(handle, "genbank") print "...done" ------------------------------------END-CODE----------------------------- it fails with the output: ------------------------------------OUTPUT----------------------------- Drosophila melanogaster chromosome X, complete sequence downloading to SeqRecord... Traceback (most recent call last): File "efetch-test.py", line 7, in record = SeqIO.read(handle, "genbank") File "/NetUsers/stschiff/lib/python/Bio/SeqIO/__init__.py", line 366, in read first = iterator.next() File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 410, in parse_records record = self.parse(handle) File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 393, in parse if self.feed(handle, consumer) : File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 370, in feed misc_lines, sequence_string = self.parse_footer() File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 723, in parse_footer raise ValueError("Premature end of file in sequence data") ValueError: Premature end of file in sequence data ------------------------------------END-OUTPUT----------------------------- If I change the id to "56" (chromosome 4, which is shorter) it works. But for all the other chromosomes (ids: 57 - 61) it fails. If I download the genbank files manually from the ftp-server and then use SeqIO.read() it works, so the download-process corrupts the genbank files if they are very large (about 35 MB) I guess... Any hints? Best, Stephan From biopython at maubp.freeserve.co.uk Wed Oct 8 14:57:08 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 19:57:08 +0100 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <133483072970409871957631124263040035200-Webmail2@me.com> References: <133483072970409871957631124263040035200-Webmail2@me.com> Message-ID: <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> On Wed, Oct 8, 2008 at 6:11 PM, Stephan wrote: > Sorry to have an Entrez.efetch-issue again, but somehow there > seems to be a problem with very large files. > ... > If I change the id to "56" (chromosome 4, which is shorter) it works. > But for all the other chromosomes (ids: 57 - 61) it fails. > If I download the genbank files manually from the ftp-server and > then use SeqIO.read() it works, so the download-process corrupts > the genbank files if they are very large (about 35 MB) I guess... > > Any hints? Yes - one big hint: DON'T try and parse these large files directly from the internet. Use efetch to download the file and save it to disk. Then open this local file for parsing. There are several good reasons for this: (1) Rerunning the script (e.g. during development) needn't re-download the file, which wastes time and money (yours and more importantly the NCBI's). You may be fine, but the NCBI can and do ban people's IP addresses if they breach the guidelines. (2) If the parsing fails, there is something to debug easily (the local file). You can open the file in a text editor to check it etc. That being said, downloading and parsing in one go should work - I would expect an IO error if the network timed out, rather than what appears to be the data ending prematurely. However, I don't expect this to be easy to resolve - quite possibly this is a network time out somewhere, maybe at your end, maybe on one of the ISP connections in between. On the bright side, at least the parser isn't silently ignoring the end of the file, which would leave you with a truncated sequence without any warnings :) Do you think the Biopython tutorial should be more explicit about this topic? e.g. In chapter 4 (on Bio.SeqIO) I wrote: >> Note that just because you can download sequence data and >> parse it into a SeqRecord object in one go doesn't mean this >> is always a good idea. In general, you should probably download >> sequences once and save them to a file for reuse. Maybe I should have said "... doesn't mean this is a good idea..." instead? Peter From biopython at maubp.freeserve.co.uk Wed Oct 8 15:32:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 20:32:59 +0100 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> References: <133483072970409871957631124263040035200-Webmail2@me.com> <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> Message-ID: <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> > Yes - one big hint: DON'T try and parse these large files directly > from the internet. Use efetch to download the file and save it to > disk. Then open this local file for parsing. > ... > Do you think the Biopython tutorial should be more explicit about this > topic? I've changed the tutorial (the SeqIO and Entrez chapters) in CVS to make this advice more explicit, and included an example of doing this too. import os from Bio import SeqIO from Bio import Entrez Entrez.email = "A.N.Other at example.com" # Always tell NCBI who you are filename = "gi_186972394.gbk" if not os.path.isfile(filename) : print "Downloading..." net_handle = Entrez.efetch(db="nucleotide",id="186972394",rettype="genbank") out_handle = open(filename, "w") out_handle.write(net_handle.read()) out_handle.close() net_handle.close() print "Saved" print "Parsing..." record = SeqIO.read(open(filename), "genbank") print record Peter From biopython at maubp.freeserve.co.uk Wed Oct 8 16:57:03 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 21:57:03 +0100 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com> References: <133483072970409871957631124263040035200-Webmail2@me.com> <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com> Message-ID: <320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com> On Wed, Oct 8, 2008 at 9:37 PM, Stephan Schiffels wrote: > > Hi Peter, > > OK, first of all... you were right of course, with > out_handle.write(net_handle.read()) the download works properly and reading > the file from disk also works.The tutorial is very clear on that point, I > agree. OK - hopefully I've just made it clearer still ;) > To illustrate why I made the mistake even though I read the tutorial: > I made some code like: > > try: > unpickling a file as SeqRecord... > except IOError: > download file into SeqRecord AND pickle afterwards to disk > > So, as you can see, I already tried to make the download only once! I see - interesting. > The disk-saving step, I realized, was smarter to do via cPickle since then > reading from it also goes faster than parsing the genbank file each time. So > my goal was to either load a pickled SeqRecord, or download into SeqRecord > and then pickle to disk. I hope you agree that concerning resources from > NCBI this way is (at least in principle) already quite optimal. You approach is clever, and I agree, it shouldn't make any difference to the number of downloads from the NCBI (once you have the script debugged and working). I'm curious - do you have any numbers for the relative times to load a SeqRecord from a pickle, or re-parse it from the GenBank file? I'm aware of some "hot spots" in the GenBank parser which take more time than they really need to (feature location parsing in particular). However, even if using pickles is much faster, I would personally still rather use this approach: if file not present: download from NCBI and save it parse file I think it is safer to keep the original data in the NCBI provided format, rather than as a python pickle. Some of my reasons include: * you might want to parse the files with a different tool one day (e.g. grep, or maybe BioPerl, or EMBOSS) * different versions of Biopython will parse the file slightly differently (e.g. once Bugs 2225 and 2578 are fixed the SeqRecord should include slightly more information from a GenBank file) while your pickle will be static * if the SeqRecord or Seq objects themselves change slightly between versions of Biopython, the pickle may not work * more generally, is it safe to transfer the pickly files between different computers (e.g. different versions of python or Biopython, different OS, different line endings)? These issues may not be a problem in your setting. More generally, you could consider using BioSQL, but this may be overkill for your needs. > However, as you pointed out, parsing from the internet makes problems. If you do work out exactly what is going wrong, I would be interested to hear about it. > I think the advantages of not having to download each time were clear to me > from the tutorial. Just that downloading AND parsing at the same time makes > problems didnt appear to me. The addings to the tutorial seem to give some > idea. Your approach all makes sense. Thanks for explaining your thoughts. I don't think I'd ever tried efetch on such a large GenBank file in the first place - for genomes I have usually used FTP instead. Peter From chapmanb at 50mail.com Wed Oct 8 17:11:25 2008 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Oct 2008 17:11:25 -0400 Subject: [BioPython] Entrez.efetch In-Reply-To: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> <20081008123533.GE57379@sobchak.mgh.harvard.edu> <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> Message-ID: <20081008211125.GB17555@sobchak.mgh.harvard.edu> Peter and Stephan; My fault -- sorry about the red herring on this one. I shouldn't have tried to answer this e-mail in 5 minutes before work this morning. Sounds like y'all have it resolved with the missing close so I will keep my mouth shut. Peter, I don't remember my exact problem as it was in some throw-away script and the fix seemed non-problematic. I was thrown off by the "line 3" information Stephan mentioned because my issue was with the first couple of lines missing when iterating with an UndoHandle. No matter. Thanks for coming up with the right fix! Brad > Stephan wrote: > >> When I download this chromosome manually from the NCBI-website, > >> I indeed find a difference in one line, namely in line 3 of the > >> genbank file. In the manually downloaded file line 3 reads: > >> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced > >> from my code I have only: "ACCESSION NC_004353". So without that > >> region-information, the biopython parser of course runs to a premature > >> end. > > Stephan - when you say manually, do you mean via a web browser? If so > it is likely to be using a subtly different URL, which might explain > the NCBI generating slightly different data on the fly. Either way, > this ACCESSION line difference shouldn't trigger the "Premature end of > file in sequence data" error in the GenBank parser. > > On Wed, Oct 8, 2008 at 1:35 PM, Brad Chapman wrote: > > This is a tricky problem that I ran into as well and is fixed in the > > latest CVS version. The issue is that the Biopython reader is using an > > UndoHandle instead of a standard python handle. By default some of these > > operations appear to be assuming an iterator, but UndoHandle did not > > provide this. > > Brad, I'm pretty sure the GenBank parser is NOT using the UndoHandle. > Just adding the close made Stephan's example work for me. What > exactly was the problem you ran into (one of the other parsers > perhaps?). > > > As a result, you can lose the first couple of lines which are > > previously examined to determine the filetype. The fix is to make > > this a proper iterator. You can either check out current CVS, or > > make the addition manually to Bio/File.py in your current version: > > > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython > > Adding this to the UndoHandle seems a sensible improvement - but I > don't see how it can affect Stephan's script. > > Peter From stephan80 at mac.com Wed Oct 8 16:37:17 2008 From: stephan80 at mac.com (Stephan Schiffels) Date: Wed, 08 Oct 2008 22:37:17 +0200 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> References: <133483072970409871957631124263040035200-Webmail2@me.com> <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> Message-ID: <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com> Hi Peter, OK, first of all... you were right of course, with out_handle.write (net_handle.read()) the download works properly and reading the file from disk also works.The tutorial is very clear on that point, I agree. To illustrate why I made the mistake even though I read the tutorial: I made some code like: try: unpickling a file as SeqRecord... except IOError: download file into SeqRecord AND pickle afterwards to disk So, as you can see, I already tried to make the download only once! The disk-saving step, I realized, was smarter to do via cPickle since then reading from it also goes faster than parsing the genbank file each time. So my goal was to either load a pickled SeqRecord, or download into SeqRecord and then pickle to disk. I hope you agree that concerning resources from NCBI this way is (at least in principle) already quite optimal. However, as you pointed out, parsing from the internet makes problems. I think the advantages of not having to download each time were clear to me from the tutorial. Just that downloading AND parsing at the same time makes problems didnt appear to me. The addings to the tutorial seem to give some idea. Thanks and Regards, Stephan Am 08.10.2008 um 21:32 schrieb Peter: >> Yes - one big hint: DON'T try and parse these large files directly >> from the internet. Use efetch to download the file and save it to >> disk. Then open this local file for parsing. >> ... >> Do you think the Biopython tutorial should be more explicit about >> this >> topic? > > I've changed the tutorial (the SeqIO and Entrez chapters) in CVS to > make this advice more explicit, and included an example of doing this > too. > > import os > from Bio import SeqIO > from Bio import Entrez > Entrez.email = "A.N.Other at example.com" # Always tell NCBI who > you are > filename = "gi_186972394.gbk" > if not os.path.isfile(filename) : > print "Downloading..." > net_handle = Entrez.efetch > (db="nucleotide",id="186972394",rettype="genbank") > out_handle = open(filename, "w") > out_handle.write(net_handle.read()) > out_handle.close() > net_handle.close() > print "Saved" > > print "Parsing..." > record = SeqIO.read(open(filename), "genbank") > print record > > > Peter From biopythonlist at gmail.com Thu Oct 9 04:52:42 2008 From: biopythonlist at gmail.com (dr goettel) Date: Thu, 9 Oct 2008 10:52:42 +0200 Subject: [BioPython] taxonomic tree In-Reply-To: <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com> References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com> Message-ID: <9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com> On Wed, Oct 8, 2008 at 6:38 PM, Peter wrote: > On Wed, Oct 8, 2008 at 5:23 PM, dr goettel > wrote: > > Hello, I'm new in this list and in BioPython. > > Hello :) > > > I would like to create a NCBI-like taxonomic tree and then fill it with > the > > organisms that I have in a file. Is there an easy way to do this? I > started > > using biopython's function at 7.11.4 (finding the lineage of an organism) > in > > the tutorial, ... > > For anyone reading this later on, note that the tutorial section > numbers tend to change with each release of Biopython. This section > just uses Bio.Entrez to fetch taxonomy information for a particular > NCBI taxon id. > > > but I need to do this tens of thousands times so it spends too > > much time querying NCBI database. > > Also calling Bio.Entrez 10000 times might annoy the NCBI ;) > > > Therefore I built a taxonomic database > > locally and implemented something similar to 7.11.4 tutorial's function > so I > > get, for every sequence, the lineage in the same way: > > > > 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; > Streptophytina; > > Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; > > Liliopsida; Asparagales; Orchidaceae' > > I assume you used the NCBI provided taxdump files to populate the > database? See ftp://ftp.ncbi.nih.gov/pub/taxonomy/ > Yes I did. > > Personally rather than designing my own database just for this (and > writing a parser for the taxonomy files), I would have suggested > installing BioSQL, and using the BioSQL script load_ncbi_taxonomy.pl > to download and import the data for you. This is a simple perl script > - you don't need BioPerl. See http://www.biopython.org/wiki/BioSQL > for details. > I also used the load_ncbi_taxonomy.pl script. It worked great! > > > Now I need to create a tree, or fill an already created one. And then > search > > it by some criteria. > > What kind of tree do you mean? Are you talking about creating a > Newick tree, or an in memory structure? Perhaps the Bio.Nexus > module's tree functionality would help. > Thankyou very much. I still don't know if I want Newick tree or the other one. I'll take a look on Bio.Nexus module > > If you are interested, the BioSQL tables record the taxonomy tree > using two methods, each node has a parent node allowing you to walk up > the lineage. There are also left/right values allowing selection of > all child nodes efficiently via an SQL select statement. > > Peter > This is what I was trying to do, from the name of the organism (the leaf of the tree) and getting every node using the parent_node field of the taxon table, until reaching the root node. Once I have all the steps to the root node then I have to create/filling the tree with my data in order to examinate the number of organisms integrating certain class/order/family/genus... etc Any ideas will be very apreciated. Thankyou very much for your answer and I'll take a look on Bio.Nexus module. drG From biopython at maubp.freeserve.co.uk Thu Oct 9 05:31:16 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Oct 2008 10:31:16 +0100 Subject: [BioPython] taxonomic tree In-Reply-To: <9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com> References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com> <9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com> Message-ID: <320fb6e00810090231w723e3b29m5e070c55166d3bfc@mail.gmail.com> >> Personally rather than designing my own database just for this (and >> writing a parser for the taxonomy files), I would have suggested >> installing BioSQL, and using the BioSQL script load_ncbi_taxonomy.pl >> to download and import the data for you. This is a simple perl script >> - you don't need BioPerl. See http://www.biopython.org/wiki/BioSQL >> for details. > > I also used the load_ncbi_taxonomy.pl script. It worked great! Good. I would encourage you to use the version from BioSQL v1.0.1 if you are not already, as the version with BioSQL v1.0.0 makes an additional unnecessary assumption about the database keys matching the NCBI taxon ID. >> If you are interested, the BioSQL tables record the taxonomy tree >> using two methods, each node has a parent node allowing you to walk up >> the lineage. There are also left/right values allowing selection of >> all child nodes efficiently via an SQL select statement. > > This is what I was trying to do, from the name of the organism (the leaf of > the tree) and getting every node using the parent_node field of the taxon > table, until reaching the root node. Once I have all the steps to the root > node then I have to create/filling the tree with my data in order to > examinate the number of organisms integrating certain > class/order/family/genus... etc > Any ideas will be very apreciated. To do this in Biopython you'll have to write some SQL commands - but first you need to understand how the left/right values work if you want to take advantage of them. I refer you to this thread on the BioSQL mailing list earlier in the year: http://lists.open-bio.org/pipermail/biosql-l/2008-April/001234.html In particular, Hilmar referred to Joe Celko's SQL for Smarties books, and the introduction to this nested-set representation given here: http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html Alternatively, if you wanted to avoid the left/right values, you could use recursion or loops on the parent ID links to build up the tree. For a single lineage this is fine - but for a full try I would expect the left/right values to be faster. Note that Biopython (in CVS now) ignores the left/right values. This is for two reasons - for pulling out a single lineage, Eric found this was faster. Also, when adding new entries to the database re-calculating the left/right values is too slow, so we leave them as NULL (and let the user (re)run load_ncbi_taxonomy.pl later if they care). This means we don't want to depend on the left/right values being present. Peter From stephan.schiffels at uni-koeln.de Thu Oct 9 09:01:11 2008 From: stephan.schiffels at uni-koeln.de (Stephan Schiffels) Date: Thu, 9 Oct 2008 15:01:11 +0200 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com> References: <133483072970409871957631124263040035200-Webmail2@me.com> <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com> <320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com> Message-ID: <171A75DA-EE34-44AB-8E16-DEC626F7164C@uni-koeln.de> Hi Peter, Am 08.10.2008 um 22:57 schrieb Peter: > I'm curious - do you have any numbers for the relative times to load a > SeqRecord from a pickle, or re-parse it from the GenBank file? I'm > aware of some "hot spots" in the GenBank parser which take more time > than they really need to (feature location parsing in particular). So, here is a little profiling of reading a large chromosome both as genbank and from a pickled SeqRecord (both from disk of course): >>> t = Timer("a = cPickle.load(open('DroMel_chr2L.pickle'))", "import cPickle") >>> t.timeit(number=1) 5.2086620330810547 >>> t = Timer("a = SeqIO.read(open('DroMel_chr2L.gbk'), 'genbank')", "from Bio import SeqIO") >>> t.timeit(number=1) 53.902437925338745 >>> As you see there is an amazing 10fold speed-gain using cPickle in comparison to SeqIO.read() ... not bad! The pickled file is a bit larger than the genbank file, but not much. > However, even if using pickles is much faster, I would personally > still rather use this approach: > > if file not present: > download from NCBI and save it > parse file > Thats precisely how I do it now. Works cool! > I think it is safer to keep the original data in the NCBI provided > format, rather than as a python pickle. Some of my reasons include: > > * you might want to parse the files with a different tool one day > (e.g. grep, or maybe BioPerl, or EMBOSS) > * different versions of Biopython will parse the file slightly > differently (e.g. once Bugs 2225 and 2578 are fixed the SeqRecord > should include slightly more information from a GenBank file) while > your pickle will be static > * if the SeqRecord or Seq objects themselves change slightly between > versions of Biopython, the pickle may not work > * more generally, is it safe to transfer the pickly files between > different computers (e.g. different versions of python or Biopython, > different OS, different line endings)? > > These issues may not be a problem in your setting. You are right and in fact I now safe both the genbank file and the pickled file to disk, so I have all the backup. > > More generally, you could consider using BioSQL, but this may be > overkill for your needs. > BioSQL is something that I like a lot. I have not yet digged my way through it but hopefully there will be options for me from that side as well. >> However, as you pointed out, parsing from the internet makes >> problems. > > If you do work out exactly what is going wrong, I would be interested > to hear about it. > Hmm, probably I wont find it out. Parsing from the internet works for small files, it must be some network-issue, dont know. Since I am in the university-web I doubt that the error starts at my side, maybe NCBI clears the connection if the other side is too slow, which is the case for the parsing process... But I understand too little about networking. >> I think the advantages of not having to download each time were >> clear to me >> from the tutorial. Just that downloading AND parsing at the same >> time makes >> problems didnt appear to me. The addings to the tutorial seem to >> give some >> idea. > > Your approach all makes sense. Thanks for explaining your thoughts. I > don't think I'd ever tried efetch on such a large GenBank file in the > first place - for genomes I have usually used FTP instead. > > Peter Regards, Stephan From biopython at maubp.freeserve.co.uk Thu Oct 9 10:18:52 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Oct 2008 15:18:52 +0100 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <171A75DA-EE34-44AB-8E16-DEC626F7164C@uni-koeln.de> References: <133483072970409871957631124263040035200-Webmail2@me.com> <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com> <320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com> <171A75DA-EE34-44AB-8E16-DEC626F7164C@uni-koeln.de> Message-ID: <320fb6e00810090718g3420729fh50520a4760c5d27@mail.gmail.com> Peter wrote: >> I'm curious - do you have any numbers for the relative times to load a >> SeqRecord from a pickle, or re-parse it from the GenBank file? I'm >> aware of some "hot spots" in the GenBank parser which take more time >> than they really need to (feature location parsing in particular). Stephan wrote: > So, here is a little profiling of reading a large chromosome both as genbank > and from a pickled SeqRecord (both from disk of course): >>>> t = Timer("a = cPickle.load(open('DroMel_chr2L.pickle'))", "import >>>> cPickle") >>>> t.timeit(number=1) > 5.2086620330810547 >>>> t = Timer("a = SeqIO.read(open('DroMel_chr2L.gbk'), 'genbank')", "from >>>> Bio import SeqIO") >>>> t.timeit(number=1) > 53.902437925338745 >>>> > > As you see there is an amazing 10fold speed-gain using cPickle in comparison > to SeqIO.read() ... not bad! The pickled file is a bit larger than the > genbank file, but not much. I'm seeing more like a three fold speed-gain (using cPickle protocol 0, with Python 2.5.2 on a Mac), which is less impressive. For a 10 fold speed up I can see why the complexity overhead of using pickle could be worthwhile. cPickle.load() took 8.5s cPickle.load() took 10.0s cPickle.load() took 9.9s SeqIO.read() took 29.9s SeqIO.read() took 29.8s SeqIO.read() took 29.8s (Script below) I'm not very impressed with the 30 seconds needed to parse a 30MB file. There is certainly scope for speeding up the GenBank parsing here. Peter --------------- My timing script: import os import cPickle import time from Bio import Entrez, SeqIO #Entrez.email = "..." id="57" genbank_filename = "NC_004354.gbk" pickle_filename = "NC_004354.pickle" if not os.path.isfile(genbank_filename) : print "Downloading..." net_handle = Entrez.efetch(db="genome", id=id, rettype="genbank") out_handle = open(genbank_filename, "w") out_handle.write(net_handle.read()) out_handle.close() print "Saved" if not os.path.isfile(pickle_filename) : print "Parsing..." record = SeqIO.read(open(genbank_filename), 'genbank') print "Pickling..." out_handle = open(pickle_filename ,"w") cPickle.dump(record, out_handle) out_handle.close() print "Saved" print "Profiling..." for i in range(3) : start = time.time() record = cPickle.load(open(pickle_filename)) print "cPickle.load() took %0.1fs" % (time.time() - start) for i in range(3) : start = time.time() record = SeqIO.read(open(genbank_filename), 'genbank') print "SeqIO.read() took %0.1fs" % (time.time() - start) print "Done" From biopython at maubp.freeserve.co.uk Thu Oct 9 11:48:26 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Oct 2008 16:48:26 +0100 Subject: [BioPython] Deprecating Bio.PubMed and some of Bio.GenBank Message-ID: <320fb6e00810090848g2a516877i5950f515e748b9d0@mail.gmail.com> Dear Biopythoneers, Those of you who looked at the release notes for Biopython 1.48 might have read this bit: >> Bio.PubMed and the online code in Bio.GenBank are now considered >> obsolete, and we intend to deprecate them after the next release. >> For accessing PubMed and GenBank, please use Bio.Entrez instead. These bits of code are effectively simple wrappers for Bio.Entrez. While they may be simple to use, they cannot take advantage of the NCBI's Entrez utils history functionality. This means they discourage users from following the NCBI's preferred usage patterns. We're already trying to encouraging the use of Bio.Entrez by documenting it prominently in the tutorial (which seems to be working given the recent questions on the mailing list), but for Biopython 1.49 I'm suggesting we go further and deprecate Bio.PubMed and the online code in Bio.GenBank. This would mean a warning message would appear when this code is used, and (barring feedback) after a couple of releases this code would be removed completely. Any comments or objections? In particular, is anyone using this "obsolete" functionality now? Peter From biopythonlist at gmail.com Thu Oct 9 12:32:11 2008 From: biopythonlist at gmail.com (dr goettel) Date: Thu, 9 Oct 2008 18:32:11 +0200 Subject: [BioPython] taxonomic tree In-Reply-To: <320fb6e00810090231w723e3b29m5e070c55166d3bfc@mail.gmail.com> References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com> <9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com> <320fb6e00810090231w723e3b29m5e070c55166d3bfc@mail.gmail.com> Message-ID: <9b15d9f30810090932qb22ca8boc6edc871bf285154@mail.gmail.com> > To do this in Biopython you'll have to write some SQL commands - but > first you need to understand how the left/right values work if you > want to take advantage of them. I refer you to this thread on the > BioSQL mailing list earlier in the year: > http://lists.open-bio.org/pipermail/biosql-l/2008-April/001234.html > > In particular, Hilmar referred to Joe Celko's SQL for Smarties books, > and the introduction to this nested-set representation given here: > http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html > That's great!! Taking advantage of the left/right values will help me!! They 're great. I started writing a lot of code to do something that in fact can be done with some sql statements. In fact the sql statements are quite difficult for me so I have to deep inside "inner joins". Thankyou very much drG From biopython at maubp.freeserve.co.uk Mon Oct 13 08:38:56 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Oct 2008 13:38:56 +0100 Subject: [BioPython] Translation method for Seq object Message-ID: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Dear Biopythoneers, This is a request for feedback about proposed additions to the Seq object for the next release of Biopython. I'd like people to pick (a) to (e) in the list below (with additional comments or counter suggestions welcome). Enhancement bug 2381 is about adding transcription and translation methods to the Seq object, allowing an object orientated style of programming. e.g. Current functional programming style: >>> from Bio.Seq import Seq, transcribe >>> from Bio.Alphabet import generic_dna >>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna) >>> my_seq Seq('CAGTGACGTTAGTCCG', DNAAlphabet()) >>> transcribe(my_seq) Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) With the latest Biopython in CVS, you can now invoke a Seq object method instead for transcription (or back transcription): >>> my_seq.transcribe() Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) For a comparison, compare the shift from python string functions to string methods. This also makes the functionality more discoverable via dir(my_seq). Adding Seq object methods "transcribe" and "back_transcribe" doesn't cause any confusion with the python string methods. However, for translation, the python string has an existing "translate" method: > S.translate(table [,deletechars]) -> string > > Return a copy of the string S, where all characters occurring > in the optional argument deletechars are removed, and the > remaining characters have been mapped through the given > translation table, which must be a string of length 256. I don't think this functionality is really of direct use for sequences, and having a Seq object "translate" method do a biological translation into a protein sequence is much more intuitive. However, this could cause confusion if the Seq object is passed to non-Biopython code which expects a string like translate method. To avoid this naming clash, a different method name would needed. This is where some user feedback would be very welcome - I think the following cover all the alternatives of what to call a biological translation function (nucleotide to protein): (a) Just use translate (ignore the existing string method) (b) Use translate_ (trailing underscore, see PEP8) (c) Use translation (a noun rather than verb; different style). (d) Use something else (e.g. bio_translate or ...) (e) Don't add a biological translation method at all because ... Thanks, Peter See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381 From ericgibert at yahoo.fr Mon Oct 13 10:38:02 2008 From: ericgibert at yahoo.fr (Eric Gibert) Date: Mon, 13 Oct 2008 22:38:02 +0800 Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Message-ID: (a) Seq is an object, string is another object... each of them have various methods and coincidently two of them have the same name... Eric -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter Sent: Monday, October 13, 2008 8:39 PM To: BioPython Mailing List Subject: [BioPython] Translation method for Seq object Dear Biopythoneers, This is a request for feedback about proposed additions to the Seq object for the next release of Biopython. I'd like people to pick (a) to (e) in the list below (with additional comments or counter suggestions welcome). Enhancement bug 2381 is about adding transcription and translation methods to the Seq object, allowing an object orientated style of programming. e.g. Current functional programming style: >>> from Bio.Seq import Seq, transcribe >>> from Bio.Alphabet import generic_dna >>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna) >>> my_seq Seq('CAGTGACGTTAGTCCG', DNAAlphabet()) >>> transcribe(my_seq) Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) With the latest Biopython in CVS, you can now invoke a Seq object method instead for transcription (or back transcription): >>> my_seq.transcribe() Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) For a comparison, compare the shift from python string functions to string methods. This also makes the functionality more discoverable via dir(my_seq). Adding Seq object methods "transcribe" and "back_transcribe" doesn't cause any confusion with the python string methods. However, for translation, the python string has an existing "translate" method: > S.translate(table [,deletechars]) -> string > > Return a copy of the string S, where all characters occurring > in the optional argument deletechars are removed, and the > remaining characters have been mapped through the given > translation table, which must be a string of length 256. I don't think this functionality is really of direct use for sequences, and having a Seq object "translate" method do a biological translation into a protein sequence is much more intuitive. However, this could cause confusion if the Seq object is passed to non-Biopython code which expects a string like translate method. To avoid this naming clash, a different method name would needed. This is where some user feedback would be very welcome - I think the following cover all the alternatives of what to call a biological translation function (nucleotide to protein): (a) Just use translate (ignore the existing string method) (b) Use translate_ (trailing underscore, see PEP8) (c) Use translation (a noun rather than verb; different style). (d) Use something else (e.g. bio_translate or ...) (e) Don't add a biological translation method at all because ... Thanks, Peter See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381 _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From bsouthey at gmail.com Mon Oct 13 10:58:07 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Mon, 13 Oct 2008 09:58:07 -0500 Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Message-ID: <48F361FF.103@gmail.com> Peter wrote: > Dear Biopythoneers, > > This is a request for feedback about proposed additions to the Seq > object for the next release of Biopython. I'd like people to pick (a) > to (e) in the list below (with additional comments or counter > suggestions welcome). > > Enhancement bug 2381 is about adding transcription and translation > methods to the Seq object, allowing an object orientated style of > programming. > > e.g. Current functional programming style: > > >>>> from Bio.Seq import Seq, transcribe >>>> from Bio.Alphabet import generic_dna >>>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna) >>>> my_seq >>>> > Seq('CAGTGACGTTAGTCCG', DNAAlphabet()) > >>>> transcribe(my_seq) >>>> > Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) > > With the latest Biopython in CVS, you can now invoke a Seq object > method instead for transcription (or back transcription): > > >>>> my_seq.transcribe() >>>> > Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) > > For a comparison, compare the shift from python string functions to > string methods. This also makes the functionality more discoverable > via dir(my_seq). > > Adding Seq object methods "transcribe" and "back_transcribe" doesn't > cause any confusion with the python string methods. However, for > translation, the python string has an existing "translate" method: > > >> S.translate(table [,deletechars]) -> string >> >> Return a copy of the string S, where all characters occurring >> in the optional argument deletechars are removed, and the >> remaining characters have been mapped through the given >> translation table, which must be a string of length 256. >> > > I don't think this functionality is really of direct use for sequences, and > having a Seq object "translate" method do a biological translation into > a protein sequence is much more intuitive. However, this could cause > confusion if the Seq object is passed to non-Biopython code which > expects a string like translate method. > > To avoid this naming clash, a different method name would needed. > > This is where some user feedback would be very welcome - I think > the following cover all the alternatives of what to call a biological > translation function (nucleotide to protein): > > (a) Just use translate (ignore the existing string method) > (b) Use translate_ (trailing underscore, see PEP8) > (c) Use translation (a noun rather than verb; different style). > (d) Use something else (e.g. bio_translate or ...) > (e) Don't add a biological translation method at all because ... > > Thanks, > > Peter > > See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381 > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > Hi, My thoughts on this is that it is generally best to avoid any confusion when possible. But 'translate' is not a reserved word and the Python documentation notes that the unicode version lacks the optional deletechars argument (so there is precedent for using the same word). Also it involves the methods versus functions argument but many of the string functions have been depreciated and will get removed in Python 3.0 (so in Python 3.0 I think it will be hard to get a name clash without some strange inheritance going on). Therefore, provided 'translate' is a method of Seq then I do not see any strong reason to avoid it except that it is long (but shorter than translation) :-) Would be too cryptic to have dna(), rna() and protein() methods that provide the appropriate conversion based on the Seq type? Obviously reverse translation of a protein sequence to a DNA sequence is complex if there are many solutions. Regards Bruce From mjldehoon at yahoo.com Mon Oct 13 10:57:28 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 13 Oct 2008 07:57:28 -0700 (PDT) Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Message-ID: <421846.1946.qm@web62403.mail.re1.yahoo.com> (f) Use .translate both for the Python .translate and for the Biopython .translate. S.translate() ===> Biopython .translate S.translate(table [,deletechars]) ===> Python .translate We can tell from the presence or absence of arguments whether the user intends Python's translate or Biopython's translate. --Michiel. --- On Mon, 10/13/08, Peter wrote: > From: Peter > Subject: [BioPython] Translation method for Seq object > To: "BioPython Mailing List" > Date: Monday, October 13, 2008, 8:38 AM > Dear Biopythoneers, > > This is a request for feedback about proposed additions to > the Seq > object for the next release of Biopython. I'd like > people to pick (a) > to (e) in the list below (with additional comments or > counter > suggestions welcome). > > Enhancement bug 2381 is about adding transcription and > translation > methods to the Seq object, allowing an object orientated > style of > programming. > > e.g. Current functional programming style: > > >>> from Bio.Seq import Seq, transcribe > >>> from Bio.Alphabet import generic_dna > >>> my_seq = Seq("CAGTGACGTTAGTCCG", > generic_dna) > >>> my_seq > Seq('CAGTGACGTTAGTCCG', DNAAlphabet()) > >>> transcribe(my_seq) > Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) > > With the latest Biopython in CVS, you can now invoke a Seq > object > method instead for transcription (or back transcription): > > >>> my_seq.transcribe() > Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) > > For a comparison, compare the shift from python string > functions to > string methods. This also makes the functionality more > discoverable > via dir(my_seq). > > Adding Seq object methods "transcribe" and > "back_transcribe" doesn't > cause any confusion with the python string methods. > However, for > translation, the python string has an existing > "translate" method: > > > S.translate(table [,deletechars]) -> string > > > > Return a copy of the string S, where all characters > occurring > > in the optional argument deletechars are removed, and > the > > remaining characters have been mapped through the > given > > translation table, which must be a string of length > 256. > > I don't think this functionality is really of direct > use for sequences, and > having a Seq object "translate" method do a > biological translation into > a protein sequence is much more intuitive. However, this > could cause > confusion if the Seq object is passed to non-Biopython code > which > expects a string like translate method. > > To avoid this naming clash, a different method name would > needed. > > This is where some user feedback would be very welcome - I > think > the following cover all the alternatives of what to call a > biological > translation function (nucleotide to protein): > > (a) Just use translate (ignore the existing string method) > (b) Use translate_ (trailing underscore, see PEP8) > (c) Use translation (a noun rather than verb; different > style). > (d) Use something else (e.g. bio_translate or ...) > (e) Don't add a biological translation method at all > because ... > > Thanks, > > Peter > > See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381 > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Mon Oct 13 11:27:37 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Oct 2008 16:27:37 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: <421846.1946.qm@web62403.mail.re1.yahoo.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> <421846.1946.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e00810130827j3ec07434s2f58e370743f9537@mail.gmail.com> So I did manage to leave off at least one other option from my short list :) Michiel de Hoon wrote: > > (f) Use .translate both for the Python .translate and for the Biopython .translate. > > S.translate() ===> Biopython .translate > > S.translate(table [,deletechars]) ===> Python .translate > > We can tell from the presence or absence of arguments whether the user intends Python's translate or Biopython's translate. Sadly its not quite that simple. For a biological translation we'd probably want to offer optional arguments for at least the codon table and stop symbol (like the current Bio.Seq.translate() function), with other further arguments possible (e.g. to treat the sequence as a complete CDS where the start codon should be validated and taken as M). It would still be possible to automatically detect which translation was required, but it wouldn't be very nice. So overall I'm not keen on this approach. Peter From biopython at maubp.freeserve.co.uk Mon Oct 13 11:54:32 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Oct 2008 16:54:32 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: <48F361FF.103@gmail.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> <48F361FF.103@gmail.com> Message-ID: <320fb6e00810130854m38f37075gf85b798cb4a98e21@mail.gmail.com> Bruce wrote: > ... > Therefore, provided 'translate' is a method of Seq then I do not see any > strong reason to avoid it except that it is long (but shorter than > translation) :-) Good - that sounds like another vote for option (a) in my original list. > Would be too cryptic to have dna(), rna() and protein() methods that provide > the appropriate conversion based on the Seq type? Or in a similar vein, to_dna, to_rna, and to_protein? Or toDNA, toRNA, toProtein? I'd have to go and consult the current python style guide for what is the current best practice. Something like that does sounds reasonable (and they are short), but historically all related Biopython functions have used the terms (back) transcription and (back) translation so I would prefer to stick with those. > Obviously reverse translation of a protein sequence to a DNA sequence is > complex if there are many solutions. Yes, back-translation is tricky because there is generally more than one codon for any amino acid. Ambiguous nucleotides can be used to describe several possible codons giving that amino acid, but in general it is not possible to do this and describe all the possible codons which could have been used. This topic is worth of an entire thread... for the record, I would envisage a back_translate method for the Seq object (assuming we settle on translate as the name for the forward translation from nucleotide to protein). Peter From mjldehoon at yahoo.com Mon Oct 13 20:50:14 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 13 Oct 2008 17:50:14 -0700 (PDT) Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Message-ID: <900752.12970.qm@web62408.mail.re1.yahoo.com> > (a) Just use translate (ignore the existing string method) > (b) Use translate_ (trailing underscore, see PEP8) > (c) Use translation (a noun rather than verb; different > style). > (d) Use something else (e.g. bio_translate or ...) > (e) Don't add a biological translation method at all > because ... (a). Note also that once Seq objects inherit from string, the Python .translate method is still accessible as str.translate(seq). --Michiel. From biopython at maubp.freeserve.co.uk Tue Oct 14 06:18:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Oct 2008 11:18:13 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: <900752.12970.qm@web62408.mail.re1.yahoo.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> <900752.12970.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e00810140318i14c6362eq8a51030b1da660ae@mail.gmail.com> OK, we seem to have a consensus :) In Biopython's CVS, the Seq object now has a translate method which does a biological translation. If anyone comes up with a better proposal before the next release, we can still rename this. Otherwise I will update the Tutorial in CVS shortly... Note that for now, I have followed the existing Bio.Seq.translate(...) function and the new Seq object translate(...) method takes only two optional parameters - the codon table and the stop symbol. I have noted some suggestions for possible additional arguments on Bug 2381. The adventurous among you may want to use CVS to update your Biopython installations to try this out. Please note that you will now need numpy instead of Numeric (there is nothing to stop you having both numpy and Numeric installed at the same time). If you do try out the CVS code, please run the unit tests and report any issues. Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Oct 14 07:11:20 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Oct 2008 12:11:20 +0100 Subject: [BioPython] Deprecating Bio.PubMed and some of Bio.GenBank In-Reply-To: <320fb6e00810090848g2a516877i5950f515e748b9d0@mail.gmail.com> References: <320fb6e00810090848g2a516877i5950f515e748b9d0@mail.gmail.com> Message-ID: <320fb6e00810140411o341df854x49ef3e61421193b8@mail.gmail.com> On Thu, Oct 9, 2008 at 4:48 PM, Peter wrote: > Dear Biopythoneers, > > Those of you who looked at the release notes for Biopython 1.48 might > have read this bit: > >>> Bio.PubMed and the online code in Bio.GenBank are now considered >>> obsolete, and we intend to deprecate them after the next release. >>> For accessing PubMed and GenBank, please use Bio.Entrez instead. > > These bits of code are effectively simple wrappers for Bio.Entrez. > While they may be simple to use, they cannot take advantage of the > NCBI's Entrez utils history functionality. This means they discourage > users from following the NCBI's preferred usage patterns. > > We're already trying to encouraging the use of Bio.Entrez by > documenting it prominently in the tutorial (which seems to be working > given the recent questions on the mailing list), but for Biopython > 1.49 I'm suggesting we go further and deprecate Bio.PubMed and the > online code in Bio.GenBank. This would mean a warning message would > appear when this code is used, and (barring feedback) after a couple > of releases this code would be removed completely. > > Any comments or objections? In particular, is anyone using this > "obsolete" functionality now? I've just deprecated Bio.PubMed in CVS - meaning for the next release of Biopython you'll see a warning message when you import the PubMed module. If you are using this module please say something sooner rather than later. This can still be undone. Thanks, Peter From dalloliogm at gmail.com Thu Oct 16 06:02:46 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 16 Oct 2008 12:02:46 +0200 Subject: [BioPython] calculate F-Statistics from SNP data Message-ID: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> Hi, I was going to write a python program to calculate Fst statistics from a sample of SNP data. Is there any module already available to do that in biopython, that I am missing? I saw there is a 'PopGen' module, but the Cookbook says it doesn't support sequence data. Is someone actually writing any module in python to calculate such statistics? -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Oct 16 06:23:12 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Oct 2008 11:23:12 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> Message-ID: <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio wrote: > Hi, > I was going to write a python program to calculate Fst statistics from a > sample of SNP data. Is there any module already available to do that > in biopython, that I am missing? I saw there is a 'PopGen' module, but > the Cookbook says it doesn't support sequence data. > Is someone actually writing any module in python to calculate such > statistics? I think this will be a question for Tiago (the Bio.PopGen author), although others on the list may have also tackled similar questions. In terms of reading in the SNP data, what file format will you be loading? Does Bio.SeqIO currently suffice? Have you looked into what (if any) additional python libraries you would need? For any Biopython addition, a dependency on just numpy that would be preferable, but Tiago has previously suggested an optional dependency on scipy for additional statistics needed in population genetics. Peter From tiagoantao at gmail.com Thu Oct 16 10:10:47 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 16 Oct 2008 15:10:47 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> Message-ID: <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> Hi, On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio wrote: > Hi, > I was going to write a python program to calculate Fst statistics from a > sample of SNP data. > Is there any module already available to do that in biopython, that I am > missing? > I saw there is a 'PopGen' module, but the Cookbook says it doesn't support > sequence data. > Is someone actually writing any module in python to calculate such > statistics? The answer to this has to be done in parts, because it is actually a bunch of related (but different) issues On the data 1. Sequence support. Bio.PopGen doesn't support statistics for sequences (like Tajima D and the like), BUT that is not relevant if you want to do frequency based statistics (like good old Fst), you just have to count frequencies and put into a "frequency format" 2. SNPs is actually not a sequence, but a single element, so it becomes easier. What you need at the end is something like this: For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0 For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0 For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10 And so on... You have to end up with frequency counts per population So, as long as you convert data (sequence, SNP, microsatellite) to frequency counts per population, there are no issues with the type of data. On calculating the statistics (Fst) 1. I am fully aware that core statistics like Fst (I work with Fst a lot myself) are fundamental in a population genetics module, but I sincerely don't know how to proceed because a long term solution requires generic statistical support (e.g., chi-square tests Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy (and I will not maintain generic statistics code myself). I know that Bio.PopGen is of little use without support for standard statistics. 2. A workaround (for which I have code written - but not commited to the repository - I can give it to you) is to invoke GenePop and get the Fst estimation. This requires the data to be in GenePop format (again you can convert SNPs and even sequences to frequency based format) 3. That being said, I have code to estimate Fst (Cockerham and Wier theta and a variation from Mark Beaumont) in Python. I can give it to you (but is not much tested). On sequence data formats: 1. Note that sequence data files (that I know off) have no provision for population structure (you cannot say, in a standard way, sequence X belongs to population Y). You have to do it in adhoc way. That means you have to invent your own convention for your private use. 2. Anyway, in your case I suppose you still have to extract the SNPs from the sequence. 3. If you want do frequency based analysis on your SNPs, I suggest you do a conversion to GenePop anyway (therefore you can import your data in most population structure software as GenePop format is the defacto standard)... 4. Because of the above there is actually no good solution for automated conversion from sequence information to frequency based one (in biopython or in any platform whatsoever) I can give more suggestions if you give more details or have more specific questions. From tiagoantao at gmail.com Thu Oct 16 10:14:28 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 16 Oct 2008 15:14:28 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> Message-ID: <6d941f120810160714s61eb6d1cx87d1943c4068d491@mail.gmail.com> Just a minor point: I am so used to work in Fst that I mentally converted your "F-statistics" to Fst. Most of my mail still stands. The only point that changes a bit is that I only have code for Fst, so I cannot help you with any other. On Thu, Oct 16, 2008 at 3:10 PM, Tiago Ant?o wrote: > Hi, > > On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio > wrote: >> Hi, >> I was going to write a python program to calculate Fst statistics from a >> sample of SNP data. >> Is there any module already available to do that in biopython, that I am >> missing? >> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support >> sequence data. >> Is someone actually writing any module in python to calculate such >> statistics? > > The answer to this has to be done in parts, because it is actually a > bunch of related (but different) issues > > > On the data > 1. Sequence support. Bio.PopGen doesn't support statistics for > sequences (like Tajima D and the like), BUT that is not relevant if > you want to do frequency based statistics (like good old Fst), you > just have to count frequencies and put into a "frequency format" > 2. SNPs is actually not a sequence, but a single element, so it > becomes easier. What you need at the end is something like this: > For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0 > For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0 > For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10 > And so on... You have to end up with frequency counts per population > So, as long as you convert data (sequence, SNP, microsatellite) to > frequency counts per population, there are no issues with the type of > data. > > On calculating the statistics (Fst) > 1. I am fully aware that core statistics like Fst (I work with Fst a > lot myself) are fundamental in a population genetics module, but I > sincerely don't know how to proceed because a long term solution > requires generic statistical support (e.g., chi-square tests > Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy > (and I will not maintain generic statistics code myself). I know that > Bio.PopGen is of little use without support for standard statistics. > 2. A workaround (for which I have code written - but not commited to > the repository - I can give it to you) is to invoke GenePop and get > the Fst estimation. This requires the data to be in GenePop format > (again you can convert SNPs and even sequences to frequency based > format) > 3. That being said, I have code to estimate Fst (Cockerham and Wier > theta and a variation from Mark Beaumont) in Python. I can give it to > you (but is not much tested). > > > On sequence data formats: > 1. Note that sequence data files (that I know off) have no provision > for population structure (you cannot say, in a standard way, sequence > X belongs to population Y). You have to do it in adhoc way. That means > you have to invent your own convention for your private use. > 2. Anyway, in your case I suppose you still have to extract the SNPs > from the sequence. > 3. If you want do frequency based analysis on your SNPs, I suggest you > do a conversion to GenePop anyway (therefore you can import your data > in most population structure software as GenePop format is the defacto > standard)... > 4. Because of the above there is actually no good solution for > automated conversion from sequence information to frequency based one > (in biopython or in any platform whatsoever) > I can give more suggestions if you give more details or have more > specific questions. > -- "Data always beats theories. 'Look at data three times and then come to a conclusion,' versus 'coming to a conclusion and searching for some data.' The former will win every time." ?Matthew Simmons, http://www.tiago.org From biopython at maubp.freeserve.co.uk Thu Oct 16 11:11:27 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Oct 2008 16:11:27 +0100 Subject: [BioPython] back-translation method for Seq object? Message-ID: <320fb6e00810160811s19e580b2ped86c43b32c401bb@mail.gmail.com> Quoting from the recent thread about adding a translation method to the Seq object, Bruce brought up back-translation: Peter wrote: > Bruce wrote: >> Obviously reverse translation of a protein sequence to a DNA sequence is >> complex if there are many solutions. > > Yes, back-translation is tricky because there is generally more than > one codon for any amino acid. Ambiguous nucleotides can be used to > describe several possible codons giving that amino acid, but in > general it is not possible to do this and describe all the possible > codons which could have been used. This topic is worth of an entire > thread... for the record, I would envisage a back_translate method for > the Seq object (assuming we settle on translate as the name for the > forward translation from nucleotide to protein). Do we actually need a back_translate method? Can anyone suggest an actual use-case for this? It seems difficult to imagine that any simple version would please everyone. Bio.Translate (a semi-obsolete module whose deprecation has been suggested) provides a back_translate method which picks an essentially arbitrary but unambiguous codon for each amino acid. Crude but simple. A more meaningful choice would require suppling codon frequencies for the organism under consideration. Other possibilities include using ambiguous nucleotides to try and cover all the possibilities (e.g. "L" -> "CTN"), but even here in some cases this is arbritary. e.g. The standard three stop codons ['TAA', 'TAG', 'TGA'] could be represented as ['TAR', 'TGA'] or ['TRA', 'TAG'] but not by a single ambiguous codon ('TRR' also covers 'TGG' which codes for 'W'). Potentially of use would be a generator function which returned all possible back translations - but this would be complex and typically overkill. As a final point, a Seq object back-translation method could give RNA or DNA. From a biological point of view giving DNA by default would make sense. This choice is handled in Bio.Translate when creating the translator object (part of what makes Bio.Translate relatively complex to use). Peter From sdavis2 at mail.nih.gov Thu Oct 16 11:16:51 2008 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 16 Oct 2008 11:16:51 -0400 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> Message-ID: <264855a00810160816g6ea765dcl65fd7e5aa38ef20c@mail.gmail.com> On Thu, Oct 16, 2008 at 10:10 AM, Tiago Ant?o wrote: > Hi, > > On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio > wrote: >> Hi, >> I was going to write a python program to calculate Fst statistics from a >> sample of SNP data. >> Is there any module already available to do that in biopython, that I am >> missing? >> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support >> sequence data. >> Is someone actually writing any module in python to calculate such >> statistics? > > The answer to this has to be done in parts, because it is actually a > bunch of related (but different) issues > > > On the data > 1. Sequence support. Bio.PopGen doesn't support statistics for > sequences (like Tajima D and the like), BUT that is not relevant if > you want to do frequency based statistics (like good old Fst), you > just have to count frequencies and put into a "frequency format" > 2. SNPs is actually not a sequence, but a single element, so it > becomes easier. What you need at the end is something like this: > For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0 > For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0 > For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10 > And so on... You have to end up with frequency counts per population > So, as long as you convert data (sequence, SNP, microsatellite) to > frequency counts per population, there are no issues with the type of > data. > > On calculating the statistics (Fst) > 1. I am fully aware that core statistics like Fst (I work with Fst a > lot myself) are fundamental in a population genetics module, but I > sincerely don't know how to proceed because a long term solution > requires generic statistical support (e.g., chi-square tests > Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy > (and I will not maintain generic statistics code myself). I know that > Bio.PopGen is of little use without support for standard statistics. > 2. A workaround (for which I have code written - but not commited to > the repository - I can give it to you) is to invoke GenePop and get > the Fst estimation. This requires the data to be in GenePop format > (again you can convert SNPs and even sequences to frequency based > format) > 3. That being said, I have code to estimate Fst (Cockerham and Wier > theta and a variation from Mark Beaumont) in Python. I can give it to > you (but is not much tested). > > > On sequence data formats: > 1. Note that sequence data files (that I know off) have no provision > for population structure (you cannot say, in a standard way, sequence > X belongs to population Y). You have to do it in adhoc way. That means > you have to invent your own convention for your private use. > 2. Anyway, in your case I suppose you still have to extract the SNPs > from the sequence. > 3. If you want do frequency based analysis on your SNPs, I suggest you > do a conversion to GenePop anyway (therefore you can import your data > in most population structure software as GenePop format is the defacto > standard)... > 4. Because of the above there is actually no good solution for > automated conversion from sequence information to frequency based one > (in biopython or in any platform whatsoever) > I can give more suggestions if you give more details or have more > specific questions. Just a little note that the R programming language has some packages for population genetics and, of course, has excellent statistical tools. One can interface with it via rpy. I'm not advocating going this route, but just wanted to let people know about another option. Sean From tiagoantao at gmail.com Thu Oct 16 11:26:52 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 16 Oct 2008 16:26:52 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <264855a00810160816g6ea765dcl65fd7e5aa38ef20c@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> <264855a00810160816g6ea765dcl65fd7e5aa38ef20c@mail.gmail.com> Message-ID: <6d941f120810160826q2bf25382m41890fb39a4226a0@mail.gmail.com> The task view on Genetics for R provides a good starting point to find R packages related to the field: http://www.freestatistics.org/cran/web/views/Genetics.html On Thu, Oct 16, 2008 at 4:16 PM, Sean Davis wrote: > On Thu, Oct 16, 2008 at 10:10 AM, Tiago Ant?o wrote: >> Hi, >> >> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio >> wrote: >>> Hi, >>> I was going to write a python program to calculate Fst statistics from a >>> sample of SNP data. >>> Is there any module already available to do that in biopython, that I am >>> missing? >>> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support >>> sequence data. >>> Is someone actually writing any module in python to calculate such >>> statistics? >> >> The answer to this has to be done in parts, because it is actually a >> bunch of related (but different) issues >> >> >> On the data >> 1. Sequence support. Bio.PopGen doesn't support statistics for >> sequences (like Tajima D and the like), BUT that is not relevant if >> you want to do frequency based statistics (like good old Fst), you >> just have to count frequencies and put into a "frequency format" >> 2. SNPs is actually not a sequence, but a single element, so it >> becomes easier. What you need at the end is something like this: >> For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0 >> For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0 >> For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10 >> And so on... You have to end up with frequency counts per population >> So, as long as you convert data (sequence, SNP, microsatellite) to >> frequency counts per population, there are no issues with the type of >> data. >> >> On calculating the statistics (Fst) >> 1. I am fully aware that core statistics like Fst (I work with Fst a >> lot myself) are fundamental in a population genetics module, but I >> sincerely don't know how to proceed because a long term solution >> requires generic statistical support (e.g., chi-square tests >> Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy >> (and I will not maintain generic statistics code myself). I know that >> Bio.PopGen is of little use without support for standard statistics. >> 2. A workaround (for which I have code written - but not commited to >> the repository - I can give it to you) is to invoke GenePop and get >> the Fst estimation. This requires the data to be in GenePop format >> (again you can convert SNPs and even sequences to frequency based >> format) >> 3. That being said, I have code to estimate Fst (Cockerham and Wier >> theta and a variation from Mark Beaumont) in Python. I can give it to >> you (but is not much tested). >> >> >> On sequence data formats: >> 1. Note that sequence data files (that I know off) have no provision >> for population structure (you cannot say, in a standard way, sequence >> X belongs to population Y). You have to do it in adhoc way. That means >> you have to invent your own convention for your private use. >> 2. Anyway, in your case I suppose you still have to extract the SNPs >> from the sequence. >> 3. If you want do frequency based analysis on your SNPs, I suggest you >> do a conversion to GenePop anyway (therefore you can import your data >> in most population structure software as GenePop format is the defacto >> standard)... >> 4. Because of the above there is actually no good solution for >> automated conversion from sequence information to frequency based one >> (in biopython or in any platform whatsoever) >> I can give more suggestions if you give more details or have more >> specific questions. > > Just a little note that the R programming language has some packages > for population genetics and, of course, has excellent statistical > tools. One can interface with it via rpy. I'm not advocating going > this route, but just wanted to let people know about another option. > > Sean > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- "Data always beats theories. 'Look at data three times and then come to a conclusion,' versus 'coming to a conclusion and searching for some data.' The former will win every time." ?Matthew Simmons, http://www.tiago.org From lpritc at scri.ac.uk Fri Oct 17 04:24:43 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 17 Oct 2008 09:24:43 +0100 Subject: [BioPython] back-translation method for Seq object? In-Reply-To: <320fb6e00810160811s19e580b2ped86c43b32c401bb@mail.gmail.com> Message-ID: On 16/10/2008 16:11, "Peter" wrote: > Quoting from the recent thread about adding a translation method to > the Seq object, Bruce brought up back-translation: > > Peter wrote: >> Bruce wrote: >>> Obviously reverse translation of a protein sequence to a DNA sequence is >>> complex if there are many solutions. This is the key problem. Forward translation is - for a given codon table - a one-one mapping. Reverse translation is (for many amino acids) one-many. If the goal is to produce the coding sequence that actually encoded a particular protein sequence, the problem is combinatorial and rapidly becomes messy with increasing sequence length. And that's not considering the problem of splice variants/intron-exon boundaries if attempting to relate the sequence back to some genome or genome fragment - more a problem in eukaryotes. >> Yes, back-translation is tricky because there is generally more than >> one codon for any amino acid. Ambiguous nucleotides can be used to >> describe several possible codons giving that amino acid, but in >> general it is not possible to do this and describe all the possible >> codons which could have been used. This topic is worth of an entire >> thread... for the record, I would envisage a back_translate method for >> the Seq object (assuming we settle on translate as the name for the >> forward translation from nucleotide to protein). > > Do we actually need a back_translate method? Can anyone suggest an > actual use-case for this? It seems difficult to imagine that any > simple version would please everyone. I agree - I can't think of an occasion where I might want to back-translate a protein in this way that wouldn't better be handled by other means. Not that I'm the fount of all use-cases but, given the number of ways in which one *could* back-translate, perhaps it would be better not to pick/guess at any single one. Some choices to be made in deciding how to back-translate are (and I'm sure you've already thought of them, but they're worth writing down): I) Protein to unambiguous RNA: a) Codon table: arbitrary; organism-specific; user-defined? b) Codon choice: arbitrary and random; arbitrary and consistent; complete set of possibilities; most common codon (if information available); other favoured codon (if specified)? II) Protein to ambiguous RNA: a) Return a Seq, string or some other representation of ambiguity? b) IUPAC ambiguity symbols; choice of codons; alternative representation of ambiguity? The most common back-translation I do is taking aligned protein sequences back to their known coding sequences, and this is really a case of mapping known codons onto predefined positions, rather than the interpolation of unknown codons that is required for back-translation as implied above. T-coffee handles this pretty well, IIRC. To find coding sequences for a particular protein in the originating sequence (if known), I use BLAST. I guess there might be value in having the ability to identify regions of the coding sequence that are least likely to be variable (by generating them combinatorially) so that probes might be designed if the coding sequence is not known. But that doesn't appear to be the way that most sequences are obtained these days: much cheaper to bung RNA through 454 or Solexa and work through the output than to put someone on the task of making an array of probes to find a sequence that may or may not encode your sequenced protein... > Bio.Translate (a semi-obsolete module whose deprecation has been > suggested) provides a back_translate method which picks an essentially > arbitrary but unambiguous codon for each amino acid. Crude but > simple. A more meaningful choice would require suppling codon > frequencies for the organism under consideration. These can be found - for many organisms - in Emboss codon usage table (.cut) files, if you have Emboss locally. However, is requiring Emboss as a dependency the cleanest or wisest solution for Biopython? This approach solves only one problem: given a particular codon usage table, what is the most likely sequence that would have produced this protein. That's not a problem I've ever come across in anger, but given a table of 'most efficient codons' for some biological expression system, I can see this potentially having some use. However, given that many microbiologists can already tell you the preferred codons for K12 without pausing for breath, I'm not sure there's a problem looking for this solution. > Other possibilities include using ambiguous nucleotides to try and > cover all the possibilities (e.g. "L" -> "CTN"), but even here in some > cases this is arbritary. e.g. The standard three stop codons ['TAA', > 'TAG', 'TGA'] could be represented as ['TAR', 'TGA'] or ['TRA', 'TAG'] > but not by a single ambiguous codon ('TRR' also covers 'TGG' which > codes for 'W'). If Seq had an ambiguity-aware sequence representation, this could be handled. For example, a regular expression-based sequence representation (which could lie alongside Seq.data, perhaps as Seq.regex) could represent these variants as (TAA|TAG|TGA), and alternatively the usual ambiguity codes could also be handled in a similar way (e.g. R as [AG]). This would be of some limited use, but would permit sequence searching within Biopython, at least. > Potentially of use would be a generator function which returned all > possible back translations - but this would be complex and typically > overkill. I think that, for large sequences, this could quickly swamp the user. What do you see as the use of this output? > As a final point, a Seq object back-translation method could give RNA > or DNA. From a biological point of view giving DNA by default would > make sense. This choice is handled in Bio.Translate when creating the > translator object (part of what makes Bio.Translate relatively complex > to use). Since there is a one-one map of RNA to DNA, I'm easy about either choice on a computational level. Biologically-speaking, DNA -> RNA is transcription, and RNA -> protein is translation, so I'd expect back-translation to convert protein -> RNA, and back-transcription to convert RNA -> DNA. Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From dalloliogm at gmail.com Fri Oct 17 05:39:41 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 17 Oct 2008 11:39:41 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> Message-ID: <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> On Thu, Oct 16, 2008 at 12:23 PM, Peter wrote: > On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio > wrote: > > Hi, > > I was going to write a python program to calculate Fst statistics from a > > sample of SNP data. Is there any module already available to do that > > in biopython, that I am missing? I saw there is a 'PopGen' module, but > > the Cookbook says it doesn't support sequence data. > > Is someone actually writing any module in python to calculate such > > statistics? > > I think this will be a question for Tiago (the Bio.PopGen author), > although others on the list may have also tackled similar questions. > > In terms of reading in the SNP data, what file format will you be > loading? Does Bio.SeqIO currently suffice? > Hi, thank you very much all of you for the replies. Actually I am going to use tped[1] and tfam[1] files as input, formatted with the plink program[2]. Bio.SeqIO doesn't support these format, but this is right because they don't cointain only sequences but rather elements like Tiago was saying. Let's say I try to write a parser for these two file formats. In which biopython object should I save them? Is there any kind of 'Individual' or 'Population' object in biopython? I see from the cookbook that Bio.GenPop.Record is representanting populations and individual as list[3], and that there is not a 'Population' or 'Individual' object. I think that it is a good approach, because these kind of files tend to be very big and instantiating an Individual object instead of a tuple for every line of the file would be take much memory. But are you going to implement some kind of 'Individual' or 'Population' object? Moreover, python 2.6 will implement a new kind of data object, called 'named tuple' [4], to implement these kind of records. It could be a good compromise (maybe I'll better start a new thread about this and explain better). [1] tped, tfam: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#tr [2] plink: http://pngu.mgh.harvard.edu/~purcell/plink/index.shtml [3] biopython cookbook, popgen: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc112 [4] named tuples in python 2.6: http://code.activestate.com/recipes/500261/ > > Have you looked into what (if any) additional python libraries you > would need? For any Biopython addition, a dependency on just numpy > that would be preferable, but Tiago has previously suggested an > optional dependency on scipy for additional statistics needed in > population genetics. > > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From dalloliogm at gmail.com Fri Oct 17 06:03:32 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 17 Oct 2008 12:03:32 +0200 Subject: [BioPython] named tuples for biopython? Message-ID: <5aa3b3570810170303y5660ab63n592dc99a76011ad2@mail.gmail.com> Hi, python 2.6 is going to implement a new kind of data (like lists, strings, etc..) called 'named_tuple'. It is intended to be a better data format to be used when parsing record files and databases. You can download the recipe from here (it should be included experimentally in python 2.6): - http://code.activestate.com/recipes/500261/ Basically, you instantiate a named_tuple object with this syntax: >> Person = NamedTuple("Person name surname") "Person" is a label for the named_tuple; the following fields, 'name' and 'surname' Then you will have named_tuple object which is basically a mix between a dictionary, a custom class and a tuple: >> Person = NamedTuple("Person name surname") >> Einstein = Person('Albert', 'Einstein') >> Einstein.name 'Albert' >> Einstein.surname 'Einstein' >> people = [] >> for line in f.readlines(): >> people.append(Person(line.split()) >> >> for person in people: >> print person.name, person.surname named_tuples are also read-only object, so they should be less memory-expensive It is like tuples against lists, but more customizable. I am really not good ad explaining, and I can't find a good tutorial that illustrate this. I read a good article about named_tuples, but it is in italian language ( http://stacktrace.it/2008/05/gestione-dei-record-python-1/). Maybe you can understand the code examples. Has any of you heard about this new data type? Do you think it could be useful for biopython? There is a lot of file parsing / database interfacing in bioinformatics :) p.s. since I didn't trust HTML-based mails to keep code formatting, I also posted this same message on nodalpoint: http://www.nodalpoint.org/2008/10/17/python_2_6_will_implement_a_new_data_format_named_tuple_can_it_be_of_use_for_biopython -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From dalloliogm at gmail.com Fri Oct 17 06:11:23 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 17 Oct 2008 12:11:23 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810160714s61eb6d1cx87d1943c4068d491@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> <6d941f120810160714s61eb6d1cx87d1943c4068d491@mail.gmail.com> Message-ID: <5aa3b3570810170311g1d92dc52q41616cd6dc58fb03@mail.gmail.com> On Thu, Oct 16, 2008 at 4:14 PM, Tiago Ant?o wrote: > Just a minor point: I am so used to work in Fst that I mentally > converted your "F-statistics" to Fst. Most of my mail still stands. > The only point that changes a bit is that I only have code for Fst, so > I cannot help you with any other. > > On Thu, Oct 16, 2008 at 3:10 PM, Tiago Ant?o wrote: > > > 3. That being said, I have code to estimate Fst (Cockerham and Wier > > theta and a variation from Mark Beaumont) in Python. I can give it to > > you (but is not much tested). > > > Thank you.. Can you please send me this code that you are using to calculate Fst statistics with python? I can't guarantee I will use it (most of the people here use perl and bioperl, but I would prefer python), but maybe I can help you testing it. > > > > > > -- > "Data always beats theories. 'Look at data three times and then come > to a conclusion,' versus 'coming to a conclusion and searching for > some data.' The former will win every time." > ?Matthew Simmons, > http://www.tiago.org > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Fri Oct 17 06:17:51 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Oct 2008 11:17:51 +0100 Subject: [BioPython] named tuples for biopython? In-Reply-To: <5aa3b3570810170303y5660ab63n592dc99a76011ad2@mail.gmail.com> References: <5aa3b3570810170303y5660ab63n592dc99a76011ad2@mail.gmail.com> Message-ID: <320fb6e00810170317w24fe34a4p1884c4264f3e7363@mail.gmail.com> On Fri, Oct 17, 2008 at 11:03 AM, Giovanni Marco Dall'Olio wrote: > Hi, > python 2.6 is going to implement a new kind of data (like lists, strings, > etc..) called 'named_tuple'. It is intended to be a better data format to > be used when parsing record files and databases. I'd just seen this today actually via another mailing list. Here is a short example which actually works on python 2.6 (the details have changed slightly from your quote), >>> from collections import namedtuple >>> Person = namedtuple("Person", "name surname") >>> x = Person("Albert", "Einstein") >>> x Person(name='Albert', surname='Einstein') >>> x.name 'Albert' >>> x.surname 'Einstein' >>> x.keys() Traceback (most recent call last): File "", line 1, in AttributeError: 'Person' object has no attribute 'keys' >>> x["name"] Traceback (most recent call last): File "", line 1, in TypeError: tuple indices must be integers, not str >>> x[0] 'Albert' >>> x[1] 'Einstein' So this doesn't act much like a dictionary (in terms of the x[...] usage), so we can't use it as a drop in enhancement for existing dictionaries in Biopython. I expect there are some places where a namedtuple would make sense (although using it might break backwards compatibility). Also, if we did want to use NamedTuple in Biopython we'd have to include a copy for use on older versions of python. This is probably possible under the python license... but would require an implementation that still worked on pre 2.6. Peter From lpritc at scri.ac.uk Fri Oct 17 06:52:33 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 17 Oct 2008 11:52:33 +0100 Subject: [BioPython] named tuples for biopython? In-Reply-To: <5aa3b3570810170303y5660ab63n592dc99a76011ad2@mail.gmail.com> Message-ID: On 17/10/2008 11:03, "Giovanni Marco Dall'Olio" wrote: > Hi, > python 2.6 is going to implement a new kind of data (like lists, strings, > etc..) called 'named_tuple'. > It is intended to be a better data format to be used when parsing record > files and databases. > > You can download the recipe from here (it should be included experimentally > in python 2.6): > - http://code.activestate.com/recipes/500261/ The explanation here was pretty clear, to me: http://docs.python.org/dev/library/collections.html#collections.namedtuple > Has any of you heard about this new data type? Not until you mentioned it - thanks for the heads-up. > Do you think it could be > useful for biopython? There is a lot of file parsing / database interfacing > in bioinformatics :) I can see it being a useful collection type. It reminds me of C structs, and looks like a near-perfect fit to many db table entries, and to csv/ATF-format files for which the column headers can be used to define attributes. I guess that one disadvantage of namedtuples, compared to, e.g. a dictionary in which each value is itself a dictionary of attributes (with attribute names for keys), is that there's a restricted character/word set available for attribute names in the namedtuple, but this is not important for dictionary keys, so some additional tally of header to attribute name may be necessary. This has a real use-case in, say, parsing ATF format files... http://www.moleculardevices.com/pages/software/gn_genepix_file_formats.html ... where on-the-fly creation of attributes with the same name as in the parsed file or table row may not be possible with a namedtuple. If you know of the column/field names in advance though, it shouldn't be an issue. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From tiagoantao at gmail.com Fri Oct 17 14:07:18 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 17 Oct 2008 19:07:18 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> Message-ID: <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> Hi, On Fri, Oct 17, 2008 at 10:39 AM, Giovanni Marco Dall'Olio wrote: > Let's say I try to write a parser for these two file formats. In which > biopython object should I save them? Is there any kind of 'Individual' or > 'Population' object in biopython? > I see from the cookbook that Bio.GenPop.Record is representanting > populations and individual as list[3], and that there is not a 'Population' > or 'Individual' object. No, there are no concepts of individuals or populations for now. Bio.PopGen.GenePop is just a representation of a GenePop file (which is a de facto standard in frequency based population genetics). Currently Bio.PopGen philosophy is more of a wrapper for existing software (e.g., I don't implement a coalescent simulator, like in BioPerl, I wrap Simcoal2). The disadvantage is that it is not "Pure Python" and is dependent on external applications. The advantage is that, if the external application is good, than good functionality becomes available inside Biopython. For example, coalescent simulation in BioPerl is (at least last time I've checked it) orders of magnitude less flexible than BioPython's (based on SimCoal2). In this philosophy, I now have a (partial) wrapper for the GenePop application to calculate statistics (voila, Fst). That doesn't mean that core statistics functionality should not be available in Bio.PopGen. I think it should be (that is why I have quite done work on that - implementing from scratch Fst, allelic richness, expected heterosigosity, ...). The same goes to the concept of Population and Individual. For a number of cumulative reasons, the work on that front is stalled. But, if there is some interest, I would more than welcome reopening that front... > Moreover, python 2.6 will implement a new kind of data object, called 'named > tuple' [4], to implement these kind of records. It could be a good > compromise (maybe I'll better start a new thread about this and explain > better). I think the ad-hoc policy in Biopython is to support previous versions of Python, so I don't think it will be easy to do things in a 2.6 only way (although, for NEW functionality, from my part, I don't see a problem with it). Tiago From bsouthey at gmail.com Fri Oct 17 14:46:19 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 17 Oct 2008 13:46:19 -0500 Subject: [BioPython] back-translation method for Seq object? In-Reply-To: References: Message-ID: <48F8DD7B.7010909@gmail.com> Leighton Pritchard wrote: > On 16/10/2008 16:11, "Peter" wrote: > > >> Quoting from the recent thread about adding a translation method to >> the Seq object, Bruce brought up back-translation: >> >> Peter wrote: >> >>> Bruce wrote: >>> >>>> Obviously reverse translation of a protein sequence to a DNA sequence is >>>> complex if there are many solutions. >>>> > > This is the key problem. Forward translation is - for a given codon table - > a one-one mapping. Reverse translation is (for many amino acids) one-many. > If the goal is to produce the coding sequence that actually encoded a > particular protein sequence, the problem is combinatorial and rapidly > becomes messy with increasing sequence length. And that's not considering > the problem of splice variants/intron-exon boundaries if attempting to > relate the sequence back to some genome or genome fragment - more a problem > in eukaryotes. > If you use a regular expression or a tree structure then there is a one-one mapping but then that would probably best as a subclass of Seq. Note you still would need a method to transverse it if you wanted to get a sequence from it as well as an reverse complement. It is fairly trivial to get a regular expression for it for the standard genetic code but I did not get my reverse complement to work satisfactory nor did I try to get DNA sequence from the regular expression. I would suggest tools like Wise2 and exonerate (http://www.ebi.ac.uk/~guy/exonerate/) are the solution to solving gene structure problems than using a Seq object. Obviously if you start with a DNA sequence, then you could create object that has a DNA/RNA Seq object and a protein Seq object(s) that contain the translation(s) like in Genbank DNA records that contain the translation. But that really avoids the issue here. >>> Yes, back-translation is tricky because there is generally more than >>> one codon for any amino acid. Ambiguous nucleotides can be used to >>> describe several possible codons giving that amino acid, but in >>> general it is not possible to do this and describe all the possible >>> codons which could have been used. This topic is worth of an entire >>> thread... for the record, I would envisage a back_translate method for >>> the Seq object (assuming we settle on translate as the name for the >>> forward translation from nucleotide to protein). >>> >> Do we actually need a back_translate method? Can anyone suggest an >> actual use-case for this? It seems difficult to imagine that any >> simple version would please everyone. >> > > I agree - I can't think of an occasion where I might want to back-translate > a protein in this way that wouldn't better be handled by other means. Not > that I'm the fount of all use-cases but, given the number of ways in which > one *could* back-translate, perhaps it would be better not to pick/guess at > any single one. > Apart from the academic aspect, my main use is searching for protein motifs/domains, enzyme cleavage sites, finding very short combinations of amino acids and binding sites (I do not do this but it is the same) in DNA sequences especially genomic sequence. These are usually very small and, thus, unsuitable for most tools. One of my uses is with peptide identification and de novo sequencing using mass spectrometry when you don't know the actual protein or gene sequence. It also has the problem that certain amino acids have very similar mass so you would need to Regardless of whether you use a regular expression query or not you still need a back translation of the protein query and probably the reverse complement. Another case where it would be useful is that tools like TBLASTN gives protein alignments so you must open the DNA sequence and find the DNA region based on the protein alignment. Bruce From dalloliogm at gmail.com Sun Oct 19 10:50:54 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Sun, 19 Oct 2008 16:50:54 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> Message-ID: <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> On Sat, Oct 18, 2008 at 6:50 PM, Tiago Ant?o wrote: > > here have used bioperl Bio::PopGen::PopStat, but we saw that using that > > module as it is now in bioperl is too much computationally-expensive for > our > > resources. > > So, we are going to either refactor the bioperl function, or to write > custom > > scripts in python to calculate Fst. > > I can program perl, but I would prefer to use python in use, since I like > > object oriented programming. > > You can find my (completely unofficial, completely untested) PopGen module > here: > http://popgen.eu/PopGen.tar.gz > You should take a biopython distro and replace the PopGen directory > with the contents of this one. > ok, thank you very much!! I would like to use git to keep track of the changes I will make to the code. What do you think if I'll upload it to http://github.com and then upload it back on biopython when it is finished? I am not sure, but I think it would be possible to convert the logs back to cvs to reintegrate the changes in biopython. > There are 2 ways to calculate Fst: > Doing something this: > from Bio.PopGen.Stats.Structural import Fst > > fst = Fst() > fst.add_pop('Pop 1', [('a', 'a'), ('a', 'c'), ('a','c')]) > fst.add_pop('Pop 2', [('a', 'c'), ('a', 'c'), ('a','c')]) > One of the problems we are having here, is that it takes too much RAM memory to store all the information about characters for every population. I was going to write a Population object, in which I'll store only the total count of heterozygotes, individuals, and what is needed, instead of the information about characters (('a', 'a'), ('a', 'c'), ...) It is something like this: class Population: markers = [] class Marker: total_heterozygotes_count = 0 total_population_count = 0 total_Purines_count = 0 # this could be renamed, of course total_Pyrimidines_count = 0 > > Or using the new GenePop code (see GenePop/Controller.py), by using > genepop to calculate Fsts. > > A few comments: > 1. I don't trust my own Fst code (not tested at all, I am actually > using GenePop as above). You can find it on PopGen.Stats.Structural > (Fst, and also FstBeaumont). There is code there for Fst, Fis and Fit. > Also Fk (I trust the Fk code, but its the only one) I will ask my group leader to help me in writing down some good test data. I'll let you know when I will speak with him. > > 2. If your problem is performance, I think you have to go to a faster > language. Scripting languages strongly underperfom on the speed issue. > I find this problem lots of times. C, C++ and Java (yes, java for > performance) is what I use. Perl, Python and other scripting languages > are quite bad performance-wise. I know.. but I think this time, the problem is in memory usage. > 3. You can find a Fst implementation in C++ on simuPop (see file > stator.cpp). GenePop code must also have Fst implemented. 4. I have a Fst based application using Biopython PopGen with Fst (but > for another application) - Fdist, you can find it at: > http://www.biomedcentral.com/1471-2105/9/323 . Module Bio.PopGen.FDist > (incidentally, you can also use this to calculate Fst ;) ). > 5. My code on Bio.PopGen.Stats is surely not in its final form. I have > a plan to change it massively. If you are interested in participating > in the discussion, you are welcome. > > > This is to say that if you want, we can work on the same code, and > > contribute it to biopython. > > This would be most welcome. I have almost no sense of propriety for > the code that is on Bio.PopGen. So, if you work on this, go ahead! > > > > I am writing a ped file parser (everybody here is used to this format, > and I > > don't know GenePop :( ), and a simple script that calculates Fst with the > > most basic formula. > > I am also trying to design some good tests, and I am using subversion as > as > > source control system. > > Maybe I can also send this to you, so you can have a look (but it is > still > > very basic, I started yesterday). > > Again, any contribution would be most welcome. Regarding parsers I > would suggest you to have a look at how parsers are done in Biopython. > I am following the "standard". You can find an example on > Bio.PopGen.GenePop.__init__.py. From my point of view I have nothing > against a "non standard" parser as long as it is documented and > commented. Thank you very much.. I know more or less how parsers are written in biopython, but I have never written one myself. > > > Again, feel free to take this discussion to biopython-dev, especially > if you are willing to contribute. > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From mmokrejs at ribosome.natur.cuni.cz Sun Oct 19 11:52:29 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Sun, 19 Oct 2008 17:52:29 +0200 Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Message-ID: <48FB57BD.7070705@ribosome.natur.cuni.cz> Hi, I have been away for 2 weeks but although late, let me oppose that string.translate() is of use. Here is my current code: # make sure no unallowed chars are present in the sequence if type == "DNA": if not _sequence.translate(string.maketrans('', ''),'GgAaTtCc'): if not _sequence.translate(string.maketrans('', ''),'GgAaTtCcBbDdSsWw'): if not _sequence.translate(string.maketrans('', ''),'GgAaTtCcRrYyWwSsMmKkHhBbVvDdNn'): raise ValueError, "DNA sequence contains unallowed characters: " + str(_sequence.translate(string.maketrans('', ''),'GgAaTtCcRrYyWwSsMmKkHhBbVvDdNn')) else: _warning = "DNA sequence contains IUPACAmbiguousDNA characters, which cannot be interpreted uniquely. Please try to find sequence of higher quality." else: _warning = "DNA sequence contains ExtendedIUPACDNA characters. " + str(_sequence.translate(string.maketrans('', ''),'GATC')) + " Please try to find sequence of higher quality." elif type == "RNA": if not _sequence.translate(string.maketrans('', ''),'GgAaUuCc'): if not _sequence.translate(string.maketrans('', ''),'GgAaUuCcRrYyWwSsMmKkHhBbVvDdNn'): raise ValueError, "RNA sequence contains unallowed characters: " + str(_sequence.translate(string.maketrans('', ''),'GgAaTtCcRrYyWwSsMmKkHhBbVvDdNn')) else: _warning = "RNA sequence contains ExtendedIUPACDNA characters. " + str(_sequence.translate(string.maketrans('', ''),'GgAaUuCc')) + " Please try to find sequence of higher quality." _sequence = _sequence.translate(string.maketrans('Uu', 'Tt')) return (_warning, _type, _description, _sequence) I would have voted for b) or c). Martin Peter wrote: > Dear Biopythoneers, > > This is a request for feedback about proposed additions to the Seq > object for the next release of Biopython. I'd like people to pick (a) > to (e) in the list below (with additional comments or counter > suggestions welcome). > > Enhancement bug 2381 is about adding transcription and translation > methods to the Seq object, allowing an object orientated style of > programming. > > e.g. Current functional programming style: > >>>> from Bio.Seq import Seq, transcribe >>>> from Bio.Alphabet import generic_dna >>>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna) >>>> my_seq > Seq('CAGTGACGTTAGTCCG', DNAAlphabet()) >>>> transcribe(my_seq) > Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) > > With the latest Biopython in CVS, you can now invoke a Seq object > method instead for transcription (or back transcription): > >>>> my_seq.transcribe() > Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) > > For a comparison, compare the shift from python string functions to > string methods. This also makes the functionality more discoverable > via dir(my_seq). > > Adding Seq object methods "transcribe" and "back_transcribe" doesn't > cause any confusion with the python string methods. However, for > translation, the python string has an existing "translate" method: > >> S.translate(table [,deletechars]) -> string >> >> Return a copy of the string S, where all characters occurring >> in the optional argument deletechars are removed, and the >> remaining characters have been mapped through the given >> translation table, which must be a string of length 256. > > I don't think this functionality is really of direct use for sequences, and > having a Seq object "translate" method do a biological translation into > a protein sequence is much more intuitive. However, this could cause > confusion if the Seq object is passed to non-Biopython code which > expects a string like translate method. > > To avoid this naming clash, a different method name would needed. > > This is where some user feedback would be very welcome - I think > the following cover all the alternatives of what to call a biological > translation function (nucleotide to protein): > > (a) Just use translate (ignore the existing string method) > (b) Use translate_ (trailing underscore, see PEP8) > (c) Use translation (a noun rather than verb; different style). > (d) Use something else (e.g. bio_translate or ...) > (e) Don't add a biological translation method at all because ... > > Thanks, > > Peter > > See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381 From mmokrejs at ribosome.natur.cuni.cz Sun Oct 19 12:17:50 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Sun, 19 Oct 2008 18:17:50 +0200 Subject: [BioPython] Current tutorial in CVS In-Reply-To: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com> References: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com> Message-ID: <48FB5DAE.1050600@ribosome.natur.cuni.cz> Hi Peter, Peter wrote: > Michiel wrote: >> ... The new tutorial is in CVS; I put a copy of the HTML output >> of the latest version at >> http://biopython.org/DIST/docs/tutorial/Tutorial.new.html. > > This also gives people a chance to look at the three plotting examples > I added to the "Cookbook" section a couple of weeks back, > > http://www.biopython.org/DIST/docs/tutorial/Tutorial.new.html#chapter:cookbook for those lazy would you please show how to save the generated plots into e.g. jpg or .svg file? Thanks, ;-) Martin From biopython at maubp.freeserve.co.uk Sun Oct 19 12:34:46 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 19 Oct 2008 17:34:46 +0100 Subject: [BioPython] Current tutorial in CVS In-Reply-To: <48FB5DAE.1050600@ribosome.natur.cuni.cz> References: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com> <48FB5DAE.1050600@ribosome.natur.cuni.cz> Message-ID: <320fb6e00810190934l432dd320ue8ff3a40fa497530@mail.gmail.com> > > for those lazy would you please show how to save the generated plots into > e.g. jpg or .svg file? Instead or as well as pylab.show(), use pylab.savefig(...), for example: pylab.savefig("dot_plot.png", dpi=75) pylab.savefig("dot_plot.pdf") On a related note - it looks like the pylab tutorial as moved, I'm getting a 404 error on http://matplotlib.sourceforge.net/tutorial.html now :( It looks like http://matplotlib.sourceforge.net/api/pyplot_api.html is the replacement. Peter From biopython at maubp.freeserve.co.uk Sun Oct 19 14:15:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 19 Oct 2008 19:15:59 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: <48FB57BD.7070705@ribosome.natur.cuni.cz> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> <48FB57BD.7070705@ribosome.natur.cuni.cz> Message-ID: <320fb6e00810191115k120b64c3m237d0929d33b13fb@mail.gmail.com> On Sun, Oct 19, 2008 at 4:52 PM, Martin MOKREJ? wrote: > Hi, > I have been away for 2 weeks but although late, Untill we release a new biopython, its not too late to change the Seq object's new methods. > let me oppose that string.translate() is of use. > Here is my current code: > ... Your code seems to be doing two things with the python string translate() method: (1) Using the deletechars argument (with an empty mapping) to look for unexpected letters. It took me a while to work out what your code was doing - personally I would have used a python set for this, rather than the string translate method. Note also unicode strings don't support the deletechars argument, and that python 3.0 removes the deletechars argument from the string style objects. (2) Using the translate mapping to switch "U" and "u" into "T" and "t" to back transcribe RNA into DNA. For this, Biopython already has a Bio.Seq.back_transcribe function (which does work on strings), and in CVS the Seq object gets a back_transcribe method too. These do both use the string translate method internally. Neither of these operations convice me that the Seq object should support the python string translate method. Note that if you still need to use the python string translate method, it is accessable by first turning the Seq object into a string (e.g. str(my_seq).translate(mapping, delete_chars)), or as Michiel suggested earlier, you could use the string module translate function on the Seq object. Also note that (as in your example using the string translate to do back transcription) the translate method by its nature makes it impossible to know if the original Seq object alphabet still applies to the result. Peter From mmokrejs at ribosome.natur.cuni.cz Sun Oct 19 14:28:38 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Sun, 19 Oct 2008 20:28:38 +0200 Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810191115k120b64c3m237d0929d33b13fb@mail.gmail.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> <48FB57BD.7070705@ribosome.natur.cuni.cz> <320fb6e00810191115k120b64c3m237d0929d33b13fb@mail.gmail.com> Message-ID: <48FB7C56.6010408@ribosome.natur.cuni.cz> Peter, you are right in your points. I think the translate() trick had some speed advantages over other approaches to zap unwanted characters - I don't remember but if it is gonna break in future python releases I will have to rewrite this anyway. I just wanted to say I really do use the string translate function and that it has use in bioinformatics as well. ;-) Still, I think the name clash is asking for disaster, but overloading is a feature of python so it might be expected. Do whatever you want. ;) Cheers, M. Peter wrote: > On Sun, Oct 19, 2008 at 4:52 PM, Martin MOKREJ? > wrote: >> Hi, >> I have been away for 2 weeks but although late, > > Untill we release a new biopython, its not too late to change the Seq > object's new methods. > >> let me oppose that string.translate() is of use. >> Here is my current code: >> ... > > Your code seems to be doing two things with the python string > translate() method: > > (1) Using the deletechars argument (with an empty mapping) to look for > unexpected letters. It took me a while to work out what your code was > doing - personally I would have used a python set for this, rather > than the string translate method. Note also unicode strings don't > support the deletechars argument, and that python 3.0 removes the > deletechars argument from the string style objects. > > (2) Using the translate mapping to switch "U" and "u" into "T" and "t" > to back transcribe RNA into DNA. For this, Biopython already has a > Bio.Seq.back_transcribe function (which does work on strings), and in > CVS the Seq object gets a back_transcribe method too. These do both > use the string translate method internally. > > Neither of these operations convice me that the Seq object should > support the python string translate method. > > Note that if you still need to use the python string translate method, > it is accessable by first turning the Seq object into a string (e.g. > str(my_seq).translate(mapping, delete_chars)), or as Michiel suggested > earlier, you could use the string module translate function on the Seq > object. > > Also note that (as in your example using the string translate to do > back transcription) the translate method by its nature makes it > impossible to know if the original Seq object alphabet still applies > to the result. From biopython at maubp.freeserve.co.uk Sun Oct 19 14:52:06 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 19 Oct 2008 19:52:06 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: <48FB7C56.6010408@ribosome.natur.cuni.cz> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> <48FB57BD.7070705@ribosome.natur.cuni.cz> <320fb6e00810191115k120b64c3m237d0929d33b13fb@mail.gmail.com> <48FB7C56.6010408@ribosome.natur.cuni.cz> Message-ID: <320fb6e00810191152u13a2ee80pe21fe950dc3d046a@mail.gmail.com> On Sun, Oct 19, 2008 at 7:28 PM, Martin MOKREJ? wrote: > Peter, > you are right in your points. I think the translate() trick > had some speed advantages over other approaches to > zap unwanted characters ... I haven't profiled this - you may be right. On the other hand, using the translate method in this way doesn't make the purpose of the code obvious. >- I don't remember but if it is gonna break in future > python releases I will have to rewrite this anyway. Certainly the deletechars argument seems to be gone in Python 3.0, but you may not need to worry about that for a while. > I just wanted to say I really do use the string translate > function and that it has use in bioinformatics as well. ;-) Using the string translate for (back)transcription is an obvious example, but this is a special case that is already handled within Biopython. Does anyone have a non-transcription sequence example where the mapping part of the translate method is actually used? Using the string translate method just to remove characters is an interesting one. How common is this in typical python code? I've always used the string replace method (but usually I only want to remove one character). Maybe we should have a remove characters method for the Seq object? Here at least dealing with the alphabet is fairly simple. On another thread I'd suggested a "remove gaps" method as a special case of this. > Still, I think the name clash is asking for disaster, but > overloading is a feature of python so it might be expected. > Do whatever you want. ;) > Cheers, > M. I'm still a tiny bit uneasy about the name clash myself... anyone else what to join in the debate? Peter From biopython at maubp.freeserve.co.uk Sun Oct 19 14:59:23 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 19 Oct 2008 19:59:23 +0100 Subject: [BioPython] Current tutorial in CVS In-Reply-To: <320fb6e00810190934l432dd320ue8ff3a40fa497530@mail.gmail.com> References: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com> <48FB5DAE.1050600@ribosome.natur.cuni.cz> <320fb6e00810190934l432dd320ue8ff3a40fa497530@mail.gmail.com> Message-ID: <320fb6e00810191159j52bb78al4c38b1f7804c268f@mail.gmail.com> Peter wrote: > Marting wrote: >> for those lazy would you please show how to save the generated >> plots into e.g. jpg or .svg file? > > Instead or as well as pylab.show(), use pylab.savefig(...), for example: > > pylab.savefig("dot_plot.png", dpi=75) > pylab.savefig("dot_plot.pdf") I've added a note about this in the example in the CVS version of the Tutorial. > On a related note - it looks like the pylab tutorial as moved, I'm > getting a 404 error on http://matplotlib.sourceforge.net/tutorial.html > now :( I've updated this link to point at http://matplotlib.sourceforge.net/ instead (which at the time of writing includes a quick summary of the pylab functions). Peter From tiagoantao at gmail.com Mon Oct 20 01:41:56 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 20 Oct 2008 06:41:56 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> Message-ID: <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> Hi, On Sun, Oct 19, 2008 at 3:50 PM, Giovanni Marco Dall'Olio wrote: > ok, thank you very much!! > I would like to use git to keep track of the changes I will make to the > code. > What do you think if I'll upload it to http://github.com and then upload it > back on biopython when it is finished? > I am not sure, but I think it would be possible to convert the logs back to > cvs to reintegrate the changes in biopython. I think it is a good idea. When we reintegrate back I think there will be no need to backport the commit logs anyway. > One of the problems we are having here, is that it takes too much RAM memory > to store all the information about characters for every population. > I was going to write a Population object, in which I'll store only the total > count of heterozygotes, individuals, and what is needed, instead of the > information about characters (('a', 'a'), ('a', 'c'), ...) I am afraid that this is not enough. Even for Fst. I suppose you are acquainted with a formula with just heterozigosities. That is more of just a textbook formula only. The Fst standard estimator is really Cockerham and Wier Theta estimator (1984 paper), and I think it needs individual information (or at the very least allele counts). Check my implementation of Fst, which should be it (less the bugs that are in). Maybe my implementation of theta is wrong, which is a possiblity. But theta is the standard. May I do a suggestion for your problem? Split in SNP groups (like 100 at a time) and calculate 100 Fsts at time. Store the calculated Fsts to disk and then join them at the end. As a general rule, whatever goes into biopython has to be general enough to accomodate all standard statistics (not just Fs). One cannot make a solution that is taliored to solve just our personal research issues. I am currently traveling (which seems to be my constant state). When I arrive back at office, on Wednsday, I will make a few suggestions on how we can structure things. I have a few ideas that I would like to share and discuss. > class Marker: > total_heterozygotes_count = 0 > total_population_count = 0 > total_Purines_count = 0 # this could be renamed, of course > total_Pyrimidines_count = 0 Also, your representation seems to be targetted toward SNPs, people use lots of other things (microsatellites are still used a lot). We have to think about something that is useful to the general public. Let me get back to you on Wednesday we ideas. If you are interested we can work together to make a nice population genetics module that can be used in a wide range of situations. From lpritc at scri.ac.uk Mon Oct 20 05:09:51 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Mon, 20 Oct 2008 10:09:51 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810191152u13a2ee80pe21fe950dc3d046a@mail.gmail.com> Message-ID: On 19/10/2008 19:52, "Peter" wrote: > I'm still a tiny bit uneasy about the name clash myself... anyone else > what to join in the debate? The problem domain for biological sequences implies a natural definition for the application of 'translate' to a DNA/RNA sequence that is the translation into protein sequence. The string.translate() method is not consistent with this natural use of the language of the problem domain. I take Martin's point that there are valid uses for the string.translate() method in bioinformatics and elsewhere, but I think that overloading translate() is as valid here as overloading __mul__ would be for an implementation of matrix algebra, or complex numbers. For biological sequences as much as for number types, I think the problem domain and expected behaviour of the object being represented in code should take precedence over emulation of an object type that was never intended to provide the functionality required for a biological sequence. I think also that if the string.translate() method is required, an explicit call to string.translate() implies: "translate this biological sequence as if it were a string, and not a biological sequence". The converse application of a Bio.translate() method would to me imply "translate this biological sequence as if it were a biological sequence, and not a string"; which seems to me to defeat part of the purpose of representing the biological sequence with its own object. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From biopython at maubp.freeserve.co.uk Mon Oct 20 05:22:39 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Oct 2008 10:22:39 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: References: <320fb6e00810191152u13a2ee80pe21fe950dc3d046a@mail.gmail.com> Message-ID: <320fb6e00810200222s641e165eqef3b209893a8d976@mail.gmail.com> Leighton wrote: > Peter wrote: >> I'm still a tiny bit uneasy about the name clash myself... anyone else >> what to join in the debate? > > The problem domain for biological sequences implies a natural definition for > the application of 'translate' to a DNA/RNA sequence that is the translation > into protein sequence. The string.translate() method is not consistent with > this natural use of the language of the problem domain. > ... I thought that was well argued and nicely put. Of course, someone is still bound to try calling the translate method with a string mapping. Maybe we should add a bit of defensive code to check the table argument, and print a helpful error message when this happens? We currently only expect the codon table argument to be an NCBI genetic code table name or ID (string or integer). Earlier I wrote: >> In Biopython's CVS, the Seq object now has a translate method >> which does a biological translation. If anyone comes up with a >> better proposal before the next release, we can still rename this. >> Otherwise I will update the Tutorial in CVS shortly... I have since updated the Tutorial in CVS to use the new transcribe, back_transcribe and translate methods. Maybe we should put an updated "preview" online for comment? Peter From lpritc at scri.ac.uk Mon Oct 20 05:38:10 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Mon, 20 Oct 2008 10:38:10 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <48F8DD7B.7010909@gmail.com> Message-ID: On 17/10/2008 19:46, "Bruce Southey" wrote: > Leighton Pritchard wrote: >> This is the key problem. Forward translation is - for a given codon table - >> a one-one mapping. Reverse translation is (for many amino acids) one-many. >> If the goal is to produce the coding sequence that actually encoded a >> particular protein sequence, the problem is combinatorial and rapidly >> becomes messy with increasing sequence length. >> > If you use a regular expression or a tree structure then there is a > one-one mapping but then that would probably best as a subclass of Seq. I don't see this, I'm afraid. Each codon -> one amino acid : one-one mapping Arg -> set of 6 possible codons : one-many mapping It doesn't matter how it's represented in code, the problem of a one-many mapping still exists for amino acid -> codon translation in most cases. The combinatorial nature of the overall problem can be illustrated by considering the unlikely case of a protein that comprises 100 arginines. The number of potential coding sequences is 6**100 = 6.5e77. That you *can* choose any one of these to be your potential coding sequence doesn't negate the fact that there are still (6.5e77)-1 other possibilities... It doesn't get much better if you use the the average number of codons per amino acid: 61/20 ~= 3. A 100aa protein would typically have 3**100 ~= 5e47 potential coding sequences. I wouldn't want to guess which one was correct, and I can't see a back_translate method in this instance doing more than producing a nucleotide sequence that is potentially capable of producing the passed protein sequence, but for which no claims can be made about biological plausibility. Now, a back_translate() that takes a protein sequence alignment and, when passed the coding sequences for each component sequence, returns the corresponding alignment of the nucleotide sequences, makes sense to me. But that's a discussion for Bio.Alignment objects... > I would suggest tools like Wise2 and exonerate > (http://www.ebi.ac.uk/~guy/exonerate/) are the solution to solving gene > structure problems than using a Seq object. I wouldn't suggest using a Seq object for this purpose, either... ;) >> I agree - I can't think of an occasion where I might want to back-translate >> a protein in this way that wouldn't better be handled by other means. Not >> that I'm the fount of all use-cases but, given the number of ways in which >> one *could* back-translate, perhaps it would be better not to pick/guess at >> any single one. >> > Apart from the academic aspect, my main use is searching for protein > motifs/domains, enzyme cleavage sites, finding very short combinations > of amino acids and binding sites (I do not do this but it is the same) > in DNA sequences especially genomic sequence. These are usually very > small and, thus, unsuitable for most tools. I do much the same, and haven't found a pressing use for back-translation, yet - YMMV. > One of my uses is with > peptide identification and de novo sequencing using mass spectrometry > when you don't know the actual protein or gene sequence. It also has the > problem that certain amino acids have very similar mass so you would > need to Regardless of whether you use a regular expression query or not > you still need a back translation of the protein query and probably the > reverse complement. Perhaps I'm being dense, but I don't see why that is. Can you give an example? > Another case where it would be useful is that tools like TBLASTN gives > protein alignments so you must open the DNA sequence and find the DNA > region based on the protein alignment. You could use TBLASTN output - which provides start and stop coordinates for the match on the subject sequence - to extract this directly, without the need for backtranslation. Example output where subject coordinates give the match location below: >ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete genome Length = 5064019 Score = 731 bits (1887), Expect = 0.0 Identities = 363/376 (96%), Positives = 363/376 (96%) Frame = +3 Query: 1 MFHXXXXXXXXXXXXXTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY 60 MFH TISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY Sbjct: 477432 MFHLPKLKQKPLALLLTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY 477611 [...] L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From dalloliogm at gmail.com Mon Oct 20 09:57:27 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 20 Oct 2008 15:57:27 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> Message-ID: <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> On Mon, Oct 20, 2008 at 7:41 AM, Tiago Ant?o wrote: > Hi, > > On Sun, Oct 19, 2008 at 3:50 PM, Giovanni Marco Dall'Olio > wrote: > > ok, thank you very much!! > > I would like to use git to keep track of the changes I will make to the > > code. > > What do you think if I'll upload it to http://github.com and then upload > it > > back on biopython when it is finished? > > I am not sure, but I think it would be possible to convert the logs back > to > > cvs to reintegrate the changes in biopython. > > I think it is a good idea. When we reintegrate back I think there will > be no need to backport the commit logs anyway. Ok, I have uploaded the code to: - http://github.com/dalloliogm/biopython---popgen I put the code I wrote before writing in this mailing list in the folder PopGen/Gio - http://github.com/dalloliogm/biopython---popgen/tree/6f6fa66cda1908dc8334ab6e9e69b7c85290a8be/src/PopGen/Gio However, I plan to integrate these scripts with your code or re-write the completely (well, your code is a lot better than mine :) ). Just a curiosity: why do you use the '<>' operator instead of '!='? Is it better supported in python 3.0? > > One of the problems we are having here, is that it takes too much RAM > memory > > to store all the information about characters for every population. > > I was going to write a Population object, in which I'll store only the > total > > count of heterozygotes, individuals, and what is needed, instead of the > > information about characters (('a', 'a'), ('a', 'c'), ...) > > I am afraid that this is not enough. Even for Fst. I suppose you are > acquainted with a formula with just heterozigosities. Yes, I was trying to implement a very basic formula at first. > That is more of > just a textbook formula only. The Fst standard estimator is really > Cockerham and Wier Theta estimator (1984 paper) Bioperl's Bio::PopGen::PopStats uses the same formula: - http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/PopGen/PopStats.html#POD3 """ Bioperl's Bio:Based on diploid method in Weir BS, Genetics Data Analysis II, 1996 page 178. """ , and I think it needs > individual information (or at the very least allele counts). Check my > implementation of Fst, which should be it (less the bugs that are in). > Maybe my implementation of theta is wrong, which is a possiblity. But > theta is the standard. > > May I do a suggestion for your problem? Split in SNP groups (like 100 > at a time) and calculate 100 Fsts at time. Store the calculated Fsts > to disk and then join them at the end. > Thanks - that's a good suggestion > > > I am currently traveling (which seems to be my constant state). When I > arrive back at office, on Wednsday, I will make a few suggestions on > how we can structure things. I have a few ideas that I would like to > share and discuss. > Have a nice trip! > > > class Marker: > > total_heterozygotes_count = 0 > > total_population_count = 0 > > total_Purines_count = 0 # this could be renamed, of course > > total_Pyrimidines_count = 0 > > > Also, your representation seems to be targetted toward SNPs, people > use lots of other things (microsatellites are still used a lot). We > have to think about something that is useful to the general public. > Let me get back to you on Wednesday we ideas. If you are interested we > can work together to make a nice population genetics module that can > be used in a wide range of situations. > Yes, I agree. It was just a first try. We should collect some good use-cases. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Mon Oct 20 10:04:02 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Oct 2008 15:04:02 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> Message-ID: <320fb6e00810200704g578ec3aak6b9df1a5a90a2fc7@mail.gmail.com> On Mon, Oct 20, 2008 at 2:57 PM, Giovanni Marco Dall'Olio wrote: > Just a curiosity: why do you use the '<>' operator instead of '!='? > Is it better supported in python 3.0? Python 2.x supports both <> and != for not equal, and people use both depending on their personal preference (or exposure to other languages). Most Biopython code used to use <> which I personally do by habit. Python 3.x supports only != so I have recently gone through Biopython in CVS switching all the <> to != instead. I would recommend you use != in all new python code. Peter From biopython at maubp.freeserve.co.uk Mon Oct 20 10:23:23 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Oct 2008 15:23:23 +0100 Subject: [BioPython] Bio.AlignIO feedback - seq_count Message-ID: <320fb6e00810200723p2fcbe12ey125dd1fd67d195a7@mail.gmail.com> Dear Biopythoneers, I'm hoping some of you on the mailing list have actually used Bio.AlignIO, and I'd like to ask for some feedback. In particular, when loading in sequence files, did you ever use the optional seq_count argument to declare how many sequences you expected in each alignment? The rational of this optional argument is discussed in the Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:AlignIO-count-argument I'm curious if anyone actually found this useful in real life. Thanks Peter From bsouthey at gmail.com Tue Oct 21 10:13:15 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 21 Oct 2008 09:13:15 -0500 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: References: Message-ID: <48FDE37B.5040301@gmail.com> Leighton Pritchard wrote: > On 17/10/2008 19:46, "Bruce Southey" wrote: > > >> Leighton Pritchard wrote: >> >>> This is the key problem. Forward translation is - for a given codon table - >>> a one-one mapping. Reverse translation is (for many amino acids) one-many. >>> If the goal is to produce the coding sequence that actually encoded a >>> particular protein sequence, the problem is combinatorial and rapidly >>> becomes messy with increasing sequence length. >>> >>> >> If you use a regular expression or a tree structure then there is a >> one-one mapping but then that would probably best as a subclass of Seq. >> > > I don't see this, I'm afraid. > > Each codon -> one amino acid : one-one mapping > Arg -> set of 6 possible codons : one-many mapping > If you believed this then your answer below is incorrect. The genetic code allow for 1 amino acid to map to a three nucleotides but not any three nor any more or any less than three. So to be clear there is a one to one mapping between a codon and amino acid as well amino acid and a codon. Therefore it is impossible for Arg to map to six possible codons as only one is correct. Under the standard genetic code, each amino acid can be represented in an regular expression either as the bases or ambiguous nucleotide codes: Ala/A =(GCT|GCC|GCA|GCG) = GCN Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) Lys/K =(AAA|AAG) = AAR Asn/N =(AAT|AAC) =AAY Met/M =ATG =ATG Asp/D =(GAT|GAC) =GAY Phe/F =(TTT|TTC) =TTY Cys/C =(TGT|TGC) =TGY Pro/P =(CCT|CCC|CCA|CCG) =CCN Gln/Q =(CAA|CAG) =CAR Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) Glu/E =(GAA|GAG) = GAR Thr/T =(ACT|ACC|ACA|ACG) =ACN Gly/G =(GGT|GGC|GGA|GGG) =GGN Trp/W =TGG =TGG His/H =(CAT|CAC) = CAY Tyr/Y =(TAT|TAC) = TAY Ile/I =(ATT|ATC|ATA) =ATH Val/V =(GTT|GTC|GTA|GTG) =GTN This is still a one to one mapping between an amino acid and regular expression relationship of the triplet that encodes it. Unfortunately the ambiguous nucleotide codes can not be used directly in a regular expression search. > It doesn't matter how it's represented in code, the problem of a one-many > mapping still exists for amino acid -> codon translation in most cases. > > The combinatorial nature of the overall problem can be illustrated by > considering the unlikely case of a protein that comprises 100 arginines. > The number of potential coding sequences is 6**100 = 6.5e77. That you *can* > choose any one of these to be your potential coding sequence doesn't negate > the fact that there are still (6.5e77)-1 other possibilities... It doesn't > get much better if you use the the average number of codons per amino acid: > 61/20 ~= 3. A 100aa protein would typically have 3**100 ~= 5e47 potential > coding sequences. I wouldn't want to guess which one was correct, and I > can't see a back_translate method in this instance doing more than producing > a nucleotide sequence that is potentially capable of producing the passed > protein sequence, but for which no claims can be made about biological > plausibility. > You are not representing the one to six mapping you indicated above as sequence is composed of 300 nucleotides not 1800 as must occur with a one to 6 codon mapping. Rather you have provided the number of combinations of the six codons that can give you 100 Args based on a one to one mapping of one codon to one Arg. If you use ambiguous nucleotide codes, you can reduce it down to 1.267651e+30 potential coding sequences for 100 amino acids as a worst case scenario. It is not my position to argue what a user wants or how stupid I think that the request is. The user would quickly learn. > Now, a back_translate() that takes a protein sequence alignment and, when > passed the coding sequences for each component sequence, returns the > corresponding alignment of the nucleotide sequences, makes sense to me. But > that's a discussion for Bio.Alignment objects... > > >> I would suggest tools like Wise2 and exonerate >> (http://www.ebi.ac.uk/~guy/exonerate/) are the solution to solving gene >> structure problems than using a Seq object. >> > > I wouldn't suggest using a Seq object for this purpose, either... ;) > > >>> I agree - I can't think of an occasion where I might want to back-translate >>> a protein in this way that wouldn't better be handled by other means. Not >>> that I'm the fount of all use-cases but, given the number of ways in which >>> one *could* back-translate, perhaps it would be better not to pick/guess at >>> any single one. >>> >>> >> Apart from the academic aspect, my main use is searching for protein >> motifs/domains, enzyme cleavage sites, finding very short combinations >> of amino acids and binding sites (I do not do this but it is the same) >> in DNA sequences especially genomic sequence. These are usually very >> small and, thus, unsuitable for most tools. >> > > I do much the same, and haven't found a pressing use for back-translation, > yet - YMMV. > > >> One of my uses is with >> peptide identification and de novo sequencing using mass spectrometry >> when you don't know the actual protein or gene sequence. It also has the >> problem that certain amino acids have very similar mass so you would >> need to Regardless of whether you use a regular expression query or not >> you still need a back translation of the protein query and probably the >> reverse complement. >> > > Perhaps I'm being dense, but I don't see why that is. Can you give an > example? > Isoleucine and Leucine are the worst case (there are a couple of others that are close) because these have the same mass so you have to search for: (TTA|TTG|CTT|CTC|CTA|CTG|ATT|ATC|ATA) If you are searching say for an RFamide, you know that you need at least RFG, which means you need to do a query using regular expression on the plus strand using: (CGT|CGC|CGA|CGG|AGA|AGG)(TTT|TTC)(GGT|GGC|GGA|GGG) You then try to extend the match to more amino acids until you reach the desired mass (hopefully avoiding any introns) or sufficiently that you can use some other tool to help. > >> Another case where it would be useful is that tools like TBLASTN gives >> protein alignments so you must open the DNA sequence and find the DNA >> region based on the protein alignment. >> > > You could use TBLASTN output - which provides start and stop coordinates for > the match on the subject sequence - to extract this directly, without the > need for backtranslation. Example output where subject coordinates give the > match location below: > > >> ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete >> > genome > Length = 5064019 > > Score = 731 bits (1887), Expect = 0.0 > Identities = 363/376 (96%), Positives = 363/376 (96%) > Frame = +3 > > Query: 1 MFHXXXXXXXXXXXXXTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY > 60 > MFH TISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY > Sbjct: 477432 MFHLPKLKQKPLALLLTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY > 477611 > > [...] > > L. > > Exactly my point, where is the DNA sequence? Only if you have direct access to the DNA sequence can you get it. Furthermore, the DNA sequence must be exactly the same because any change in the coordinates screws it up. Bruce From biopython at maubp.freeserve.co.uk Tue Oct 21 10:26:49 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Oct 2008 15:26:49 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <48FDE37B.5040301@gmail.com> References: <48FDE37B.5040301@gmail.com> Message-ID: <320fb6e00810210726g466292e4h3e8fe053d9107f48@mail.gmail.com> Bruce wrote: > Leighton wrote: >> Each codon -> one amino acid : one-one mapping >> Arg -> set of 6 possible codons : one-many mapping I agree with Leighton. > If you believed this then your answer below is incorrect. No, I think you are just not using the terms one-to-one and one-to-many as a mathematician would. > The genetic code > allow for 1 amino acid to map to a three nucleotides but not any three nor > any more or any less than three. So to be clear there is a one to one > mapping between a codon and amino acid as well amino acid and a codon. > Therefore it is impossible for Arg to map to six possible codons as only one > is correct. Under the standard genetic code, each amino acid can be > represented in an regular expression either as the bases or ambiguous > nucleotide codes: > Ala/A =(GCT|GCC|GCA|GCG) = GCN That is a one to four mapping using unambiguous nucleotides, or a one to one mapping using ambiguous nucleotides. This is a nice case. > Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) That is a one to six mapping using unambiguous nucleotides, or a one to two mapping using ambiguous nucleotides. This is a problem case. > This is still a one to one mapping between an amino acid and regular > expression relationship of the triplet that encodes it. Unfortunately the > ambiguous nucleotide codes can not be used directly in a regular expression > search. The problem is that (TTN|CTR) or similar don't work in Seq objects - would need a more advanced representation (perhaps based on regular expressions). Peter From biopython at maubp.freeserve.co.uk Tue Oct 21 10:45:57 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Oct 2008 15:45:57 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <48FDE37B.5040301@gmail.com> References: <48FDE37B.5040301@gmail.com> Message-ID: <320fb6e00810210745w32b37edjeec1607a3711f6ea@mail.gmail.com> Bruce wrote: >>> Another case where it would be useful is that tools like TBLASTN gives >>> protein alignments so you must open the DNA sequence and find the DNA >>> region based on the protein alignment. Leighton: >> You could use TBLASTN output - which provides start and stop coordinates >> for the match on the subject sequence - to extract this directly, without the >> need for backtranslation. Example output where subject coordinates give >> the match location below: >> >>> >>> ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete >>> >> >> genome >> Length = 5064019 >> >> Score = 731 bits (1887), Expect = 0.0 >> Identities = 363/376 (96%), Positives = 363/376 (96%) >> Frame = +3 >> >> Query: 1 MFHXXXXXXXXXXXXXTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY >> 60 >> MFH TISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY >> Sbjct: 477432 MFHLPKLKQKPLALLLTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY >> 477611 >> >> [...] Bruce's reply: > Exactly my point, where is the DNA sequence? Only if you have direct access > to the DNA sequence can you get it. Furthermore, the DNA sequence must be > exactly the same because any change in the coordinates screws it up. You should have the original query from when you ran the BLAST search, so using the co-ordinates given in the BLAST hit you can recover the original nucleotide query which gives this match. There is no reason to do a back-translation to try and find the original query, which would be especially difficult in this example due to the XXXXXX region (representing a region of low complexity which was ignored by BLAST). Even if you tried you could find more than one match and without checking the the coordinates BLAST gives it would not be clear which gave this BLAST match. Peter From lpritc at scri.ac.uk Tue Oct 21 11:29:35 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 21 Oct 2008 16:29:35 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <48FDE37B.5040301@gmail.com> Message-ID: Hi Bruce, On 21/10/2008 15:13, "Bruce Southey" wrote: > Leighton Pritchard wrote: >> I don't see this, I'm afraid. >> >> Each codon -> one amino acid : one-one mapping >> Arg -> set of 6 possible codons : one-many mapping >> > If you believed this then your answer below is incorrect. The genetic > code allow for 1 amino acid to map to a three nucleotides but not any > three nor any more or any less than three. I'm fine with this bit. Each such set of three nucleotides is called a 'codon'. Six such codons are able to code for an arginine, as you note: > Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) This is a one -> six mapping. That is, one input (arginine), is capable of being back-translated into any of six possible outputs (CGT, CGC, CGA, CGG, AGA, or AGG). but you contradict this with the comment: > So to be clear there is a one > to one mapping between a codon and amino acid as well amino acid and a > codon. Therefore it is impossible for Arg to map to six possible codons I think that you're confusing the biological fact (only one codon actually encoded this amino acid) with the back-translation problem (in the absence of any other information, any one of six codons is equally likely to have encoded this amino acid). --- > This is still a one to one mapping between an amino acid and regular > expression relationship of the triplet that encodes it. Which is not the claim that I was making. There are any number of ways of forcing a one-one mapping of this sort. You could arguably represent it as a one-to-one mapping of 'arginine -> "the backtranslation of arginine"', but that would not be informative in reconstructing the actual coding sequence (if that was what you wanted - which is the point of the discussion: what is the point of a back_translate() method?). The regular expression mapping is not useful for this, either. > You are not representing the one to six mapping you indicated above as > sequence is composed of 300 nucleotides not 1800 as must occur with a > one to 6 codon mapping [...] I think you've misunderstood what's going on here. Imagine a reduced system, where there is only one amino acid - let's call it A - and there are two possible codons that can produce this amino acid - XXX and YYY (thanks, Coldplay). Now, if we have a 'sequence' of only one amino acid: 'A', that might have been encoded by the sequence 'XXX', or the sequence 'YYY'. The sequence that coded for 'A' is one of 'XXX' or 'YYY', and we don't know which; there are two possibilities, therefore this is a 1->2 mapping. 2=2**1. Note that the nucleotide sequence is 3*1=3 long. But if our sequence has two amino acids: 'AA', this could have been the result of 'XXXXXX', 'XXXYYY', 'YYYXXX', or 'YYYYYY'. The coding sequence is one of four equally likely possibilities, and this is a 1->4 mapping (one sequence, four possible outcomes). 4=2**2, and the nucleotide sequence is 3*2 long. If we build longer sequences, we find that the number of potential outcomes is 2**n, where n is the number of 'A's in the input sequence, and the mapping is 1->2**n. The nucleotide sequence is 3*n long. If we make this more general, where there are m codons for this amino acid, the number of potential outcomes is m**n, and the mapping is 1->m**n. The nucleotide sequence is, again, 3*n long. In my previous example for arginine, m=6, n=100, the mapping is 1->6, and the sequence is 300nt long, *not* 1800 nt long. There are still 6e77 ways of encoding a sequence of 100 arginines. A back_translate() method that pretends to find the 'correct' coding sequence in the absence of other information, rather than 'a' coding sequence, is not making a plausible claim. > It is not my position to argue what a user wants or how stupid I think > that the request is. The user would quickly learn. While it is entirely possible to implement a function called back_translate() that does something a user doesn't want or need, I'm not sure that it's the approach that should be taken, here. It is your position to argue what you want or need out of a back_translate() method, and why, so that other people can see your point of view, and maybe be swayed by it. I don't see a use for such a method, even to produce a regular expression for searching nucleotide sequences, because TBLASTN is so much more efficient. > Isoleucine and Leucine are the worst case (there are a couple of others > that are close) because these have the same mass so you have to search for: > (TTA|TTG|CTT|CTC|CTA|CTG|ATT|ATC|ATA) > > If you are searching say for an RFamide, you know that you need at least > RFG, which means you need to do a query using regular expression on the > plus strand using: > (CGT|CGC|CGA|CGG|AGA|AGG)(TTT|TTC)(GGT|GGC|GGA|GGG) > > You then try to extend the match to more amino acids until you reach the > desired mass (hopefully avoiding any introns) or sufficiently that you > can use some other tool to help. I think that, in your position, I'd compare timings with a six-frame, three-frame or forward translation of (depending on the nature of the nucleotide sequence) the nucleotide sequence you're searching against, and then use a regular expression or string search with the protein sequence as the query. That's likely to be significantly faster than a regex search with that many groups, with the effects more noticeable at larger query sequence lengths; particularly so if you cache or save the translated sequences for future searches. >>> Another case where it would be useful is that tools like TBLASTN gives >>> protein alignments so you must open the DNA sequence and find the DNA >>> region based on the protein alignment. >> You could use TBLASTN output - which provides start and stop coordinates for >> the match on the subject sequence - to extract this directly, without the >> need for backtranslation. > Exactly my point, where is the DNA sequence? It's in the database against which you queried; TBLASTN queries against nucleotide databases. Wait, that's not quite right - TBLASTN translates nucleotide databases into protein databases and queries against them with the protein sequence, partly because of the one-many mapping of back-translation. If the database is local, you can use fastacmd (part of BLAST) to dump the entire database, to retrieve the single matching sequence from the database, or even to extract only the region of the sequence that is the match. Try fastacmd --help at the command-line. If your database is not local, you can (probably) obtain the sequence by querying GenBank with the accession number. If you can't do that, or ask the people who compiled the database you're querying against, or if they won't let you have the sequence, then you're stuck with guessing the coding sequence. > Only if you have direct access to the DNA sequence can you get it. That's not true; fastacmd can extract FASTA-formatted sequences from any (version number compatibilities notwithstanding) correctly-formatted BLAST database. > Furthermore, the DNA sequence > must be exactly the same because any change in the coordinates screws it > up. I don't see how that is a great concern. The coordinates of the match would come from the same database you were searching, so should match. If your database is up-to-date, and you have to go to GenBank, then you should have the most recent revision of the sequence in there, anyway. Even if both of the above options fail, and you can acquire the new sequence by some accession identifier, you can build a new local database from that sequence alone, and find where the match is. Or translate and search directly in Python. If you truly have no access to the DNA sequence (e.g. if it's proprietary information, you can't access the BLAST database, and no-one will send you the sequence) then, and only then, are you stuck with guessing the coding sequence in *very* large parameter space. Best, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From biopython at maubp.freeserve.co.uk Tue Oct 21 11:59:00 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Oct 2008 16:59:00 +0100 Subject: [BioPython] back-translation method for Seq object? In-Reply-To: <320fb6e00810210745w32b37edjeec1607a3711f6ea@mail.gmail.com> References: <48FDE37B.5040301@gmail.com> <320fb6e00810210745w32b37edjeec1607a3711f6ea@mail.gmail.com> Message-ID: <320fb6e00810210859n1b922e7emd6a7456abd79cdc7@mail.gmail.com> On Tue, Oct 21, 2008 at 3:45 PM, Peter wrote: > Bruce wrote: >>>> Another case where it would be useful is that tools like TBLASTN gives >>>> protein alignments so you must open the DNA sequence and find the DNA >>>> region based on the protein alignment. > > Leighton: >>> You could use TBLASTN output - which provides start and stop coordinates >>> for the match on the subject sequence - to extract this directly, without the >>> need for backtranslation. Example output where subject coordinates give >>> the match location below: >>> ... > > Bruce's reply: >> Exactly my point, where is the DNA sequence? Only if you have direct access >> to the DNA sequence can you get it. Furthermore, the DNA sequence must be >> exactly the same because any change in the coordinates screws it up. > > You should have the original query from when you ran the BLAST > search, so using the co-ordinates given in the BLAST hit you can > recover the original nucleotide query which gives this match. Sorry - I was thinking of the wrong variant of BLAST. As Leighton pointed out, you would have to use fastacmd to extract the nucleotide sequence of the match from the blast database (assuming you were running stand alone blastall) or fetch it via its accession (if you were running BLAST via the NCBI). Peter From biopython at maubp.freeserve.co.uk Tue Oct 21 12:07:46 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Oct 2008 17:07:46 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: References: <48FDE37B.5040301@gmail.com> Message-ID: <320fb6e00810210907s328c8f23ra007e14f1f32f5c9@mail.gmail.com> Hi everyone, I think we all agree that if we want a back-translation method/function to return a simple string or Seq object (given no additional information about the codon use), this cannot fully capture all the possible codons. If we want to provide a simple string or Seq object, we can either pick an arbitrary codon in each case (as in the first attachment on Bug 2618), or perhaps represent some of the possible codons using ambiguous nucleotides. e.g. back_translate("MR") = "ATGCGT" #arbitrary codon for R unambiguous nucleotides or, back_translate("MR") = "ATGCGN" #arbitrary codon for R using ambiguous nucleotides Note in either example, the following nice property holds: translate(back_translate("MR")) == "MR" Even if improved by typical codon usage figures to give a more biologically likely answer, neither of these simple approaches covers the full set of six possible codons for Arg in the standard codon table. It was something like this that I envisioned as a candidate for a Seq method (based on the behaviour of the existing Bio.Translate functionality), but only if such a simple back_translate method/function had any real uses. And thus far, I haven't seen any. A back translation method/function which dealt with all the possible codon choices would have to use a more advanced representation (possibly as Bruce suggested using regular expressions or some sort of tree structure - ideally as a sub-class of the Seq object). There is also the option of returning multiple simple strings or Seq objects (either as a list or preferable a generator) giving all possible back translations, but I don't think this would be useful, except perhaps on small examples, due to the potentially vast number of return values. Peter From bsouthey at gmail.com Tue Oct 21 15:46:58 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 21 Oct 2008 14:46:58 -0500 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: References: Message-ID: <48FE31B2.8030509@gmail.com> Leighton Pritchard wrote: > Hi Bruce, > > On 21/10/2008 15:13, "Bruce Southey" wrote: > >> Leighton Pritchard wrote: >> >>> I don't see this, I'm afraid. >>> >>> Each codon -> one amino acid : one-one mapping >>> Arg -> set of 6 possible codons : one-many mapping >>> >>> >> If you believed this then your answer below is incorrect. The genetic >> code allow for 1 amino acid to map to a three nucleotides but not any >> three nor any more or any less than three. >> > > I'm fine with this bit. Each such set of three nucleotides is called a > 'codon'. Six such codons are able to code for an arginine, as you note: > > >> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) >> > > This is a one -> six mapping. That is, one input (arginine), is capable of > being back-translated into any of six possible outputs (CGT, CGC, CGA, CGG, > AGA, or AGG). > > but you contradict this with the comment: > > >> So to be clear there is a one >> to one mapping between a codon and amino acid as well amino acid and a >> codon. Therefore it is impossible for Arg to map to six possible codons >> > > I think that you're confusing the biological fact (only one codon actually > encoded this amino acid) with the back-translation problem (in the absence > of any other information, any one of six codons is equally likely to have > encoded this amino acid). > > --- > > >> This is still a one to one mapping between an amino acid and regular >> expression relationship of the triplet that encodes it. >> > > Which is not the claim that I was making. There are any number of ways of > forcing a one-one mapping of this sort. You could arguably represent it as > a one-to-one mapping of 'arginine -> "the backtranslation of arginine"', but > that would not be informative in reconstructing the actual coding sequence > (if that was what you wanted - which is the point of the discussion: what is > the point of a back_translate() method?). The regular expression mapping is > not useful for this, either. > > >> You are not representing the one to six mapping you indicated above as >> sequence is composed of 300 nucleotides not 1800 as must occur with a >> one to 6 codon mapping [...] >> > > I think you've misunderstood what's going on here. > > Imagine a reduced system, where there is only one amino acid - let's call it > A - and there are two possible codons that can produce this amino acid - XXX > and YYY (thanks, Coldplay). > > Now, if we have a 'sequence' of only one amino acid: 'A', that might have > been encoded by the sequence 'XXX', or the sequence 'YYY'. The sequence > that coded for 'A' is one of 'XXX' or 'YYY', and we don't know which; there > are two possibilities, therefore this is a 1->2 mapping. 2=2**1. Note that > the nucleotide sequence is 3*1=3 long. > > But if our sequence has two amino acids: 'AA', this could have been the > result of 'XXXXXX', 'XXXYYY', 'YYYXXX', or 'YYYYYY'. The coding sequence is > one of four equally likely possibilities, and this is a 1->4 mapping (one > sequence, four possible outcomes). 4=2**2, and the nucleotide sequence is > 3*2 long. > > If we build longer sequences, we find that the number of potential outcomes > is 2**n, where n is the number of 'A's in the input sequence, and the > mapping is 1->2**n. The nucleotide sequence is 3*n long. > > If we make this more general, where there are m codons for this amino acid, > the number of potential outcomes is m**n, and the mapping is 1->m**n. The > nucleotide sequence is, again, 3*n long. > > In my previous example for arginine, m=6, n=100, the mapping is 1->6, and > the sequence is 300nt long, *not* 1800 nt long. There are still 6e77 ways > of encoding a sequence of 100 arginines. A back_translate() method that > pretends to find the 'correct' coding sequence in the absence of other > information, rather than 'a' coding sequence, is not making a plausible > claim. > > Thank you for agreeing with me! I am glad that you realized that the genetic code prevents a true one to many relationship. In say relational databases where you can have one table for the journal issue and one table for the papers in it, you can get multiple papers in a single issue. Likewise, if we ignore the genetic code, there is one amino acid and one or more codons. However, the genetic code means that you only can select one of all the codons possible resulting in multiple combinations of one to one relationships. >> It is not my position to argue what a user wants or how stupid I think >> that the request is. The user would quickly learn. >> > > While it is entirely possible to implement a function called > back_translate() that does something a user doesn't want or need, I'm not > sure that it's the approach that should be taken, here. > > It is your position to argue what you want or need out of a back_translate() > method, and why, so that other people can see your point of view, and maybe > be swayed by it. I don't see a use for such a method, even to produce a > regular expression for searching nucleotide sequences, because TBLASTN is so > much more efficient. > This very much depends on how you want to use it. TBLASTN is not very good for very short sequences and can not handle protein domains/motifs such as those in Prosite. > >> Isoleucine and Leucine are the worst case (there are a couple of others >> that are close) because these have the same mass so you have to search for: >> (TTA|TTG|CTT|CTC|CTA|CTG|ATT|ATC|ATA) >> >> If you are searching say for an RFamide, you know that you need at least >> RFG, which means you need to do a query using regular expression on the >> plus strand using: >> (CGT|CGC|CGA|CGG|AGA|AGG)(TTT|TTC)(GGT|GGC|GGA|GGG) >> >> You then try to extend the match to more amino acids until you reach the >> desired mass (hopefully avoiding any introns) or sufficiently that you >> can use some other tool to help. >> > > I think that, in your position, I'd compare timings with a six-frame, > three-frame or forward translation of (depending on the nature of the > nucleotide sequence) the nucleotide sequence you're searching against, and > then use a regular expression or string search with the protein sequence as > the query. That's likely to be significantly faster than a regex search > with that many groups, with the effects more noticeable at larger query > sequence lengths; particularly so if you cache or save the translated > sequences for future searches. > Thanks for the comments as I did not think about reusing the translation. > > >>>> Another case where it would be useful is that tools like TBLASTN gives >>>> protein alignments so you must open the DNA sequence and find the DNA >>>> region based on the protein alignment. >>>> >>> You could use TBLASTN output - which provides start and stop coordinates for >>> the match on the subject sequence - to extract this directly, without the >>> need for backtranslation. >>> > > >> Exactly my point, where is the DNA sequence? >> > > It's in the database against which you queried; TBLASTN queries against > nucleotide databases. Wait, that's not quite right - No, it is not even correct! :-) > TBLASTN translates > nucleotide databases into protein databases and queries against them with > the protein sequence, partly because of the one-many mapping of > back-translation. > Not exactly as stop codons are not in protein databases except where they code for an amino acid. > If the database is local, you can use fastacmd (part of BLAST) to dump the > entire database, to retrieve the single matching sequence from the database, > or even to extract only the region of the sequence that is the match. Try > fastacmd --help at the command-line. > > If your database is not local, you can (probably) obtain the sequence by > querying GenBank with the accession number. If you can't do that, or ask > the people who compiled the database you're querying against, or if they > won't let you have the sequence, then you're stuck with guessing the coding > sequence. > > >> Only if you have direct access to the DNA sequence can you get it. >> > > That's not true; fastacmd can extract FASTA-formatted sequences from any > (version number compatibilities notwithstanding) correctly-formatted BLAST > database. > > Obviously because you still have direct access to the DNA sequence. >> Furthermore, the DNA sequence >> must be exactly the same because any change in the coordinates screws it >> up. >> > > I don't see how that is a great concern. The coordinates of the match would > come from the same database you were searching, so should match. If your > database is up-to-date, and you have to go to GenBank, then you should have > the most recent revision of the sequence in there, anyway. > > Even if both of the above options fail, and you can acquire the new sequence > by some accession identifier, you can build a new local database from that > sequence alone, and find where the match is. Or translate and search > directly in Python. > These were some of the things that one was trying to avoid, especially repeating it all over again and hoping like crazy that it is still present. (Genome assemblies are not very forgiving.) > If you truly have no access to the DNA sequence (e.g. if it's proprietary > information, you can't access the BLAST database, and no-one will send you > the sequence) then, and only then, are you stuck with guessing the coding > sequence in *very* large parameter space. > > Best, > > L. > > Bruce From bsouthey at gmail.com Tue Oct 21 16:36:31 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 21 Oct 2008 15:36:31 -0500 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <320fb6e00810210907s328c8f23ra007e14f1f32f5c9@mail.gmail.com> References: <48FDE37B.5040301@gmail.com> <320fb6e00810210907s328c8f23ra007e14f1f32f5c9@mail.gmail.com> Message-ID: <48FE3D4F.6060005@gmail.com> Peter wrote: > Hi everyone, > > I think we all agree that if we want a back-translation > method/function to return a simple string or Seq object (given no > additional information about the codon use), this cannot fully capture > all the possible codons. > For completeness as these are not 100% correct, Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN Ser is really so bad that one would suggest providing a strong warning and just use NTN, NGN, and NNN for Leu, Arg and Ser, respectively. > If we want to provide a simple string or Seq object, we can either > pick an arbitrary codon in each case (as in the first attachment on > Bug 2618), or perhaps represent some of the possible codons using > ambiguous nucleotides. > > e.g. > back_translate("MR") = "ATGCGT" #arbitrary codon for R unambiguous nucleotides > > or, > back_translate("MR") = "ATGCGN" #arbitrary codon for R using ambiguous > nucleotides > > Note in either example, the following nice property holds: > translate(back_translate("MR")) == "MR" > > Even if improved by typical codon usage figures to give a more > biologically likely answer, neither of these simple approaches covers > the full set of six possible codons for Arg in the standard codon > table. > > It was something like this that I envisioned as a candidate for a Seq > method (based on the behaviour of the existing Bio.Translate > functionality), but only if such a simple back_translate > method/function had any real uses. And thus far, I haven't seen any. > For you perhaps but my reasons are very real to me! > A back translation method/function which dealt with all the possible > codon choices would have to use a more advanced representation > (possibly as Bruce suggested using regular expressions or some sort of > tree structure - ideally as a sub-class of the Seq object). There is > also the option of returning multiple simple strings or Seq objects > (either as a list or preferable a generator) giving all possible back > translations, but I don't think this would be useful, except perhaps > on small examples, due to the potentially vast number of return > values. > > Peter > > In any situation, we are left with a ambiguous codons, a regular expression or some combination of sequence type (e.g., strings or Seq objects). None of these options are fully compatible with the Seq object. So I do agree that back-translation can not be part of the Seq object. Also I agree that while first two could be return types for a Seq object method, the usage is probably too infrequent and too specialized for inclusion especially to handle codon usage frequencies. Bruce From lpritc at scri.ac.uk Wed Oct 22 04:31:12 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 22 Oct 2008 09:31:12 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <48FE3D4F.6060005@gmail.com> Message-ID: On 21/10/2008 21:36, "Bruce Southey" wrote: > For completeness as these are not 100% correct, > Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN > Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV > Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN There are some difficulties with this encoding (IUPAC codes are at http://www.chick.manchester.ac.uk/SiteSeer/IUPAC_codes.html) YTN -> [CT]T[ACGT] -> {CTA, CTC, CTG, CTT, TTA, TTC, TTG, TTT}, two of which do not encode leucine. MGV -> [AC]G[ACG] -> {AGA, AGC, AGG, CGA, CGC, CGG}, of which AGC does not encode arginine, and the resulting set does not include CGT, which does encode arginine WSN -> [AT][CG][ACGT] -> {ACA, ACC, ACG, ACT, AGA, AGC, AGG, AGT, TCA, TCC, TCG, TCT, TGA, TGC, TGG, TGT}, of which 10 codons do not encode serine. This would cause problems if we wanted to translate our back-translation back to the original protein sequence (however we might want to do this). > Ser is really so bad that one would suggest providing a strong warning > and just use NTN, NGN, and NNN for Leu, Arg and Ser, respectively. We could just backtranslate all amino acids to NNN and avoid the problem entirely ;) >> If we want to provide a simple string or Seq object, we can either >> pick an arbitrary codon in each case (as in the first attachment on >> Bug 2618), or perhaps represent some of the possible codons using >> ambiguous nucleotides. >> >> e.g. >> back_translate("MR") = "ATGCGT" #arbitrary codon for R unambiguous >> nucleotides >> >> or, >> back_translate("MR") = "ATGCGN" #arbitrary codon for R using ambiguous >> nucleotides >> >> Note in either example, the following nice property holds: >> translate(back_translate("MR")) == "MR" This would be an important consideration for a back_translate() method: should translate() and back_translate() be inverse functions of each other? I would say that this is a desirable property, or else a nested translate(back_translate(translate(...(seq)...))) is likely to end up as a string or sequence of ambiguity codons, which is not very useful. If that can't be done, then the opportunity to do so is probably best avoided... To ensure that translate() and back_translate() are inverse functions, the backtranslation of a particular amino acid should either return a single unambiguous codon, or an ambiguous codon that cannot be translated to an alternative amino acid (assuming a consistent codon table throughout). If we were not to choose arbitrarily an unambiguous codon, or subset of all possible codons, then a representation of the ambiguity is required that is not present in the Seq object, yet (e.g. For Ser, Leu or Arg as described above). A modification of translate() to spot, and accept such ambiguity would be necessary. This looks like harder work than it's worth. >> It was something like this that I envisioned as a candidate for a Seq >> method (based on the behaviour of the existing Bio.Translate >> functionality), but only if such a simple back_translate >> method/function had any real uses. And thus far, I haven't seen any. >> > For you perhaps but my reasons are very real to me! I agree with Peter on this. I don't see a single compelling use case for back_translate() in a Seq object. I can sort of see a potential use where, if you have a protein and want to design a primer to the coding sequence (which is not known - otherwise there are better ways to do this), then you might want to generate a sequence of IUPAC ambiguity codes to guide primer design. This might involve obtaining a sequence only of the *certain* bases, e.g. Phe -> TTN; Ser -> NNN; Gly -> GGN; Asp -> GAN, so that FGD -> TTNNNNGGN, and there are four of nine bases around which primers might be designed. However, I'm *really* stretching to come up with this example. I've outlined my views on some of the possible ways back_translate() might work below: Translate protein to its original coding sequence: =================================================== Problem: this may be just guesswork in (very) large sequence space Potential solution: guesswork may be guided by codon usage tables or user preference for codons, but the biological utility/significance of the result, which is still guessed at, is highly questionable. Alternatives: If the originating organism's sequence is known, then TBLASTN is fast, works well, and avoids the problem. Alternatively, forward translation followed by a search for the protein sequence is quicker and less messy. Translate protein to a single possible coding sequence (not necessarily original): ============================================================================ Problem: Same one each time, or choose randomly? What is the point, anyway? See above for solutions/alternatives Translate protein to ambiguous representation (inverse translate and/or return Seq): ============================================================================ Problem: changes required to the way sequences are represented in Seq objects; this is a significant change at the heart of Biopython with many inevitable side-effects. Not clear how this would work, yet. Potential solution: major coding upheaval and rewriting of Biopython Alternatives: ignore the requirement that backtranslation is the inverse of translation; do not return a Seq object, but instead store the backtranslation as an attribute, or just return a string for the user to do what they want with Translate protein to ambiguous representation (not inverse of translate, do not return Seq): ============================================================================ Problem: what's the point? agreeing which ambiguous representation to use: regex, IUPAC, something else; IUPAC ambiguities aren't a convenient representation for Ser, Leu, Arg; Potential solution: just use a regex; allow a choice; make an executive decision; ignore it and hope it goes away I think that the last behaviour here is the only one that is feasible, but I still don't see much point in implementing it. At least turning a protein sequence into a regex of possible codons would be quick to code... >> There is >> also the option of returning multiple simple strings or Seq objects >> (either as a list or preferable a generator) giving all possible back >> translations, Eek! (for the reasons you mention) L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From biopython at maubp.freeserve.co.uk Wed Oct 22 05:17:23 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Oct 2008 10:17:23 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: References: <48FE3D4F.6060005@gmail.com> Message-ID: <320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com> On Wed, Oct 22, 2008 at 9:31 AM, Leighton Pritchard wrote: > On 21/10/2008 21:36, "Bruce Southey" wrote: > >> For completeness as these are not 100% correct, >> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN >> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV >> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN I was going to jump up and down and disagree with you here Bruce, but Leighton has already made the same point, (CGV | AGR) != MGV etc. It is true that the ambiguous codon MGV would cover all the possible Arg codons, but it includes more than that. While this could be a useful thing for certain back-translation reasons, it does break the expectation that translate(back_translate(sequence)) == sequence [currently the behaviour available in Bio.Translate]. >>> If we want to provide a simple string or Seq object, we can either >>> pick an arbitrary codon in each case (as in the first attachment on >>> Bug 2618), or perhaps represent some of the possible codons using >>> ambiguous nucleotides. >>> ... >>> It was something like this that I envisioned as a candidate for a Seq >>> method (based on the behaviour of the existing Bio.Translate >>> functionality), but only if such a simple back_translate >>> method/function had any real uses. And thus far, I haven't seen any. >>> >> For you perhaps but my reasons are very real to me! I was saying I don't see the need for a *simple* back_translate function (giving a Seq object or a string), and that such a simple function didn't seem to help with your examples. I'm not denying that a complex back translation operation has real utility (although I suspect there are multiple different solutions which won't suit every problem - and makes justifying adding this to the core Seq object hard to justify). Perhaps a function in Bio.SeqUtils to create a nucleotide regex describing possible back translations from a protein sequence would suffice? If one of your real-world examples can be solved with a back_translate which returns a simple string or Seq object, could you clarify this. Peter From lpritc at scri.ac.uk Wed Oct 22 06:03:32 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 22 Oct 2008 11:03:32 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <48FE31B2.8030509@gmail.com> Message-ID: On 21/10/2008 20:46, "Bruce Southey" wrote: > Thank you for agreeing with me! I am glad that you realized that the > genetic code prevents a true one to many relationship. Bruce, I am not agreeing with you. I'll try to clarify it another way: More than one codon can encode the amino acid arginine (this is a many-one relationship). The amino acid arginine can be 'decoded' to more than one codon (this is a one-many relationship). Imagine a function that accepts an amino acid as input and returns a valid codon that could encode for the input amino acid. This is 'decoding' as described above, and is the process of back-translation for a single amino acid. For a single (i.e. 'one') amino acid, arginine, as input, the function might correctly provide up to six (i.e. 'many') different valid answers. This makes it a one-many problem. Further external constraints (e.g. Codon tables) may be applied to restrict the number or likelihood of each codon being correct in specific cases, but the fundamental problem is one-many. Providing arginine as input to a particular coded version of this function might in all cases only return a single codon as output (one-one), but the problem itself is still one-many. Furthermore, even though only one codon was responsible - biologically-speaking - for encoding the arginine you're submitting to the function (one-one), your question is the inverse: effectively 'what codon encoded this arginine?'. But (and it's a big but), if you don't know beforehand what that codon is (and why else would you bother using the function?), the problem is one-many, as any of the six solutions might be correct. Analogously, there are two possible values for the square root of a positive real number, such as 4. It is inherently a one-many problem. For 4, the return value could, correctly, be +2 or -2. Now, the math.sqrt() function in Python follows mathematical convention for the radical, and only returns the positive value, but that does not make the relationship between the value and its square root one-one, it only makes that implementation of the function one-one, even though the answer could be, correctly, either positive or negative. Now, if your problem is: what is the length of side of a farmer's square field with area four square miles (big field!), only one of these answers makes sense (one-one), as the field is constrained by our reality and cannot have negative length (this is effectively equivalent to saying that the organism doesn't use five of the six possible codons for arginine, so only one answer is possible). However, the general problem of finding a square root is still one-many, as you can see if you rephrase the problem as 'the vector (a 0) has length 4; what is the value of a?'. This is directly analogous to the problem 'the amino acid arginine was encoded by a codon; what codon was it?'. > This very much depends on how you want to use it. TBLASTN is not very > good for very short sequences and can not handle protein domains/motifs > such as those in Prosite. That's a fair point, and I wouldn't (and didn't ;) ) recommend TBLASTN as a solution to all such problems. I get acceptable results for exact matches down to about 7aa on default settings, though. Short query sequences can be a problem whatever method you use, though. >> TBLASTN queries against >> nucleotide databases. Wait, that's not quite right - > No, it is not even correct! :-) Yes, it is correct. From: http://www.ncbi.nlm.nih.gov/blast/blast_program.shtml (and other references...) """ tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames """ They wrote it, so they should know. Not that I've checked the code ;) >> TBLASTN translates >> nucleotide databases into protein databases and queries against them with >> the protein sequence, partly because of the one-many mapping of >> back-translation. > Not exactly as stop codons are not in protein databases except where > they code for an amino acid. Stop codons are not (usually) in protein databases, that's true. But they *are* in nucleotide databases, which is what TBLASTN queries. For example, these are TBLASTN search results, in opposite directions on the same nucleotide sequence, that span stop codons in the subject sequence, indicated by '*' in the BLAST output (even though there are different stop codons; Artemis handles this more elegantly): >ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete genome Length = 5064019 Score = 79.0 bits (193), Expect = 8e-17 Identities = 38/40 (95%), Positives = 38/40 (95%), Gaps = 2/40 (5%) Frame = +2 Query: 1 YPHSTAEYLILFE-INPRS-PFFCWIFWNLMLRDVDLENF 38 YPHSTAEYLILFE INPRS PFFCWIFWNLMLRDVDLENF Sbjct: 2 YPHSTAEYLILFE*INPRS*PFFCWIFWNLMLRDVDLENF 121 >ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete genome Length = 5064019 Score = 56.6 bits (135), Expect = 4e-10 Identities = 29/32 (90%), Positives = 29/32 (90%), Gaps = 3/32 (9%) Frame = -3 Query: 1 CNGRWRC-SPL-CYISPRISCRSW-LKPSAIV 29 CNGRWRC SPL CYISPRISCRSW LKPSAIV Sbjct: 2851610 CNGRWRC*SPL*CYISPRISCRSW*LKPSAIV 2851515 >> That's not true; fastacmd can extract FASTA-formatted sequences from any >> (version number compatibilities notwithstanding) correctly-formatted BLAST >> database. >> > Obviously because you still have direct access to the DNA sequence. I'd call it indirect access if you've, say, downloaded a precompiled nt database from NCBI and then have to extract the FASTA sequence from that compiled database. Either way, if you're querying a nucleotide database, you've got to have a representation of the nucleotide sequence *somewhere*. >> Even if both of the above options fail, and you can acquire the new sequence >> by some accession identifier, you can build a new local database from that >> sequence alone, and find where the match is. Or translate and search >> directly in Python. >> > These were some of the things that one was trying to avoid, especially > repeating it all over again and hoping like crazy that it is still > present. Some things are just harder work than others ;) > (Genome assemblies are not very forgiving.) The genomes I've worked on have had stable sequences at revision points for both assembly and annotation (though the old revision points have not been kept publicly in all cases, which can be awkward). All should, IMO. But that's a different thread on a different mailing list... Best, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From dalloliogm at gmail.com Wed Oct 22 06:25:57 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 22 Oct 2008 12:25:57 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> Message-ID: <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> On Mon, Oct 20, 2008 at 3:57 PM, Giovanni Marco Dall'Olio < dalloliogm at gmail.com> wrote: > > > On Mon, Oct 20, 2008 at 7:41 AM, Tiago Ant?o wrote: > >> Hi, >> >> On Sun, Oct 19, 2008 at 3:50 PM, Giovanni Marco Dall'Olio >> wrote: >> > ok, thank you very much!! >> > I would like to use git to keep track of the changes I will make to the >> > code. >> > What do you think if I'll upload it to http://github.com and then >> upload it >> > back on biopython when it is finished? >> > I am not sure, but I think it would be possible to convert the logs back >> to >> > cvs to reintegrate the changes in biopython. >> >> I think it is a good idea. When we reintegrate back I think there will >> be no need to backport the commit logs anyway. > > > Ok, I have uploaded the code to: > - http://github.com/dalloliogm/biopython---popgen > I wrote a prototype for a PED file parser which uses your PopGen.Record object to store data. It's available on github: I have still to finish the consumer object and to test it, but I think I will be able to finish it for today. I left you a few comments on the github wiki: - http://github.com/dalloliogm/biopython---popgen/wikis/home Maybe the biggest issue is that I will have to use this library to parse very big files, so there are a few things we could change in the implementation of the parser. Is there any way in python to force the interpreter to store variables in temporary files instead of RAM memory? I was thinking about modules like shelve, cPickle, but I am not sure they work in this way. We could also modify the parser in a way that it can accept a list of populations as argument, and create a populations list with only those populations from the file. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Wed Oct 22 06:34:21 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Oct 2008 11:34:21 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> Message-ID: <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> On Wed, Oct 22, 2008 at 11:25 AM, Giovanni Marco Dall'Olio wrote: > Maybe the biggest issue is that I will have to use this library to parse > very big files, so there are a few things we could change in the > implementation of the parser. > Is there any way in python to force the interpreter to store variables in > temporary files instead of RAM memory? > I was thinking about modules like shelve, cPickle, but I am not sure they > work in this way. I have not looked at the specifics here, but adopting an iterator approach might make sense - returning the entries one by one as parsed from the file. This is the idea for the Bio.SeqIO and Bio.AlignIO parsers. The user can then turn the entries into a list (if they have enough memory), filter them as the arrive, etc. For example, you could compile a list of only those desired population entries, discarding the others on the fly. Peter From bsouthey at gmail.com Wed Oct 22 11:04:29 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 22 Oct 2008 10:04:29 -0500 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com> References: <48FE3D4F.6060005@gmail.com> <320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com> Message-ID: <48FF40FD.5020604@gmail.com> Peter wrote: > On Wed, Oct 22, 2008 at 9:31 AM, Leighton Pritchard wrote: > >> On 21/10/2008 21:36, "Bruce Southey" wrote: >> >> >>> For completeness as these are not 100% correct, >>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN >>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV >>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN >>> > > I was going to jump up and down and disagree with you here Bruce, but > Leighton has already made the same point, (CGV | AGR) != MGV etc. > It is true that the ambiguous codon MGV would cover all the possible > Arg codons, but it includes more than that. While this could be a > useful thing for certain back-translation reasons, it does break the > expectation that translate(back_translate(sequence)) == sequence > [currently the behaviour available in Bio.Translate]. > Leighton does show these are correct: (CGV | AGR) == MGV and MGV ==(CGV | AGR) BUT I fully agree that MGV does stand for other other codons that are do not translate for Arg as Leighton pointed out. This was why I prefixed this by stating "these are not 100% correct" so I am sorry that I was not clear enough. Yes, I am also very aware that this creates a problem for doing a translate(back_translate(sequence)) without using a special translation table (yet another reason for not including it in Seq object or just return an exception). As I pointed in your other thread that I do not believe that a back-translation should be part of the Seq object. If for no other reason than back-translation just creates too many ambiguous nucleotides in one DNA sequence. This will cause some of the algorithms to determine protein or DNA sequences to fail (back_translate('AFLFQPQRFGR') gives 'GCNTTYYTNTTYCARCCNCARMGVTTYGGNMGV', which causes NCBI's online BLASTN to say it is protein). In anycase, BLAST and such are not very good at handling multiple ambiguous nucleotides in a sequence when probably one-third to one-half of the sequence would be ambiguous nucleotides. Bruce From biopython at maubp.freeserve.co.uk Wed Oct 22 11:33:00 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Oct 2008 16:33:00 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <48FF40FD.5020604@gmail.com> References: <48FE3D4F.6060005@gmail.com> <320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com> <48FF40FD.5020604@gmail.com> Message-ID: <320fb6e00810220833j4089cec1i6fb9cee563b562d3@mail.gmail.com> Bruce wrote: >>>> For completeness as these are not 100% correct, >>>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN >>>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV >>>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN Just for the record, in addition to the debate about the final equal signs above, there is at least one error in the above - for the leucine codons, (TTN|CTR) should read (TTR|CTN), but this doesn't matter for the discussion in hand. Bruce wrote: > Leighton does show these are correct: > (CGV | AGR) == MGV > and MGV ==(CGV | AGR) I don't think Leighton did mean to say that. A set of 6 codons is NOT equal to a set of 8 codons. However, if we say "sub set" or "super set" here things are probably fine (I haven't double checked the correct ambiguity codes are used here). Similarly, Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTR|CTN) covers 6 unambiguous codons. This is a subset of YTN = (TTC|TTA|TTG|TTT|CTC|CTA|CTG|CTT) which covers 8 unambiguous codons. Having back_translate("L") == "YTN" means translate(back_translate("L")) == "X", which would surprise many. Using "YTN" covers all the codons plus some extra ones. This might be useful for searching purposes, but otherwise its very misleading. Having back_translate("L") == "CTN" means translate(back_translate("L")) == "L", but doesn't cover the two codons TTR (i.e. TTA or TTG). At least this is better than back_translate("L") == "TTR" which still has translate(back_translate("L")) == "L", but doesn't cover the four codons CTN. Picking any one of the six codons also ensures translate(back_translate("L")) == "L" but of course doesn't cover the other five codons. In all three cases, the utility of the back translation is limited. > Yes, I am also very aware that this creates a problem for doing a > translate(back_translate(sequence)) without using a special translation > table (yet another reason for not including it in Seq object or just return > an exception). Yes. > As I pointed in your other thread that I do not believe that a > back-translation should be part of the Seq object. In the absence of a compelling use case, I agree. > If for no other reason > than back-translation just creates too many ambiguous nucleotides in one DNA > sequence. This will cause some of the algorithms to determine protein or DNA > sequences to fail (back_translate('AFLFQPQRFGR') gives > 'GCNTTYYTNTTYCARCCNCARMGVTTYGGNMGV', which causes > NCBI's online BLASTN to say it is protein). In such cases, you can explicitly tell BLAST (or other tools) if they are using nucleotides or proteins. However this is a valid concern for working with ambiguous nucleotides. As an aside, zen of python "In the face of ambiguity, refuse the temptation to guess." (here nucleotide versus protein) > In anycase, BLAST and such are not very good at handling > multiple ambiguous nucleotides in a sequence when probably > one-third to one-half of the sequence would be ambiguous > nucleotides. Ambiguous searches are bound to be tricky. Peter From lpritc at scri.ac.uk Wed Oct 22 11:34:47 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 22 Oct 2008 16:34:47 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <48FF40FD.5020604@gmail.com> Message-ID: On 22/10/2008 16:04, "Bruce Southey" wrote: > Peter wrote: >> On Wed, Oct 22, 2008 at 9:31 AM, Leighton Pritchard >> wrote: >> >>> On 21/10/2008 21:36, "Bruce Southey" wrote: >>>> For completeness as these are not 100% correct, >>>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN >>>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV >>>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN >>>> >> >> I was going to jump up and down and disagree with you here Bruce, but >> Leighton has already made the same point, (CGV | AGR) != MGV etc. >> It is true that the ambiguous codon MGV would cover all the possible >> Arg codons, but it includes more than that. >> > Leighton does show these are correct: > (CGV | AGR) == MGV > and MGV ==(CGV | AGR) I showed (and Peter also points out) that (TTN|CTR) is a subset of YTN, and that (TCN|AGY) is a subset of WSN, and not that they are equivalent, which is what you have written above. For that equivalence, we would also require that MGV is a subset of (CGV|AGR), which is not true. Likewise I also showed that, although (CGV|AGR) is a subset of MGV, neither CGV nor MGV include CGT, which is a valid codon for arginine. Whether or not this error is corrected to CGN/MGN, the regular expression is still only a subset of those codons implied by the IUPAC ambiguity symbols. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From bsouthey at gmail.com Wed Oct 22 11:50:19 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 22 Oct 2008 10:50:19 -0500 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <320fb6e00810220833j4089cec1i6fb9cee563b562d3@mail.gmail.com> References: <48FE3D4F.6060005@gmail.com> <320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com> <48FF40FD.5020604@gmail.com> <320fb6e00810220833j4089cec1i6fb9cee563b562d3@mail.gmail.com> Message-ID: <48FF4BBB.8020007@gmail.com> Peter wrote: > Bruce wrote: > >>>>> For completeness as these are not 100% correct, >>>>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN >>>>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV >>>>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN >>>>> > > Just for the record, in addition to the debate about the final equal > signs above, there is at least one error in the above - for the > leucine codons, (TTN|CTR) should read (TTR|CTN), but this doesn't > matter for the discussion in hand. > > Thanks for correctly that one. Bruce From tiagoantao at gmail.com Wed Oct 22 11:52:19 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 22 Oct 2008 16:52:19 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> Message-ID: <6d941f120810220852m606b8825tafaaf871f22df48e@mail.gmail.com> Hi, [Back in office now] > Ok, I have uploaded the code to: > - http://github.com/dalloliogm/biopython---popgen > > I put the code I wrote before writing in this mailing list in the folder > PopGen/Gio Thanks I will have a look and get acquainted with GIT. >> I am afraid that this is not enough. Even for Fst. I suppose you are >> acquainted with a formula with just heterozigosities. > > Yes, I was trying to implement a very basic formula at first. For publication and data analysis the standard is Cockerham and Wier's theta. The Standard Ht/(Hs-Ht) (or a variation of this) might be misleading in regards to the amount of information that is needed. > Yes, I agree. It was just a first try. We should collect some good > use-cases. In my head I divide statistics in the following dimensions: 1. genetic versus genomic (e.g. Fst is single locus, LD can be seen as requiring more than 1 locus, therefore is "genomic") 2. frequency based versus marker based (some statistics require frequencies only - ie, you can calculate them irrespective of the type of marker - This is the case of Fst. Others are marker dependent, say Tajima D requires sequences and can only be used with sequences) 3. population structure versus no pop structure. Some stats require population structure (again, Fst), others don't (e.g., allelic richness) >From my point of view, a long-term solution needs to take into account these dimensions (and others that I might be forgetting). One can think in a solution based on Populations and Individuals as fundamental objects (as opposed to statistics), but, from my experience it is very difficult to define what is an "individual" (i.e., what kind of information you need to store - I can expand on this). It is easier to think in terms of statistics. One fundamental point is that we don't have many opportunities to make it right: if we define an architecture which proves in the future to be not sufficient, then we will have to both maintain the old legacy (because there will be users around whose code cannot be constantly broken when a new version is made available) while hack the new features in. From tiagoantao at gmail.com Wed Oct 22 12:00:39 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 22 Oct 2008 17:00:39 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> Message-ID: <6d941f120810220900o620aa7x171fcd181f5248c@mail.gmail.com> Hi, On Wed, Oct 22, 2008 at 11:25 AM, Giovanni Marco Dall'Olio wrote: > I wrote a prototype for a PED file parser which uses your PopGen.Record > object to store data. Don't feel obliged to use GenePop.Record. You can (maybe you should) use one that is better for your PED record. The point is: your PED files might have extra (or less) information than genepop files. For instance, they might have population names. They might store the SNP (A, C, T, G). With genepop you would have to convert (and thus loose) the extra info. > Maybe the biggest issue is that I will have to use this library to parse > very big files, so there are a few things we could change in the > implementation of the parser. Yet another reason to develop your own record. I would not mind helping you with that. > We could also modify the parser in a way that it can accept a list of > populations as argument, and create a populations list with only those > populations from the file. We have to be careful in modifying existing code. We can add new functionality, add new interfaces. But changing existing interfaces or removing them has to be dealt with exceptional care, because that will break (existing) code done by users. Tiago From tiagoantao at gmail.com Wed Oct 22 12:03:59 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 22 Oct 2008 17:03:59 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> Message-ID: <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> On Wed, Oct 22, 2008 at 11:34 AM, Peter wrote: > I have not looked at the specifics here, but adopting an iterator > approach might make sense - returning the entries one by one as parsed > from the file. This is the idea for the Bio.SeqIO and Bio.AlignIO > parsers. The user can then turn the entries into a list (if they have > enough memory), filter them as the arrive, etc. For example, you > could compile a list of only those desired population entries, > discarding the others on the fly. I will have look at iterators in Python. This idea from Giovannni is actually floating around with current users for GenePop data which have exactly the same problem (loooong records). From dalloliogm at gmail.com Wed Oct 22 13:10:45 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 22 Oct 2008 19:10:45 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> Message-ID: <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> On Wed, Oct 22, 2008 at 6:03 PM, Tiago Ant?o wrote: > On Wed, Oct 22, 2008 at 11:34 AM, Peter > wrote: > > I have not looked at the specifics here, but adopting an iterator > > approach might make sense - returning the entries one by one as parsed > > from the file. This is the idea for the Bio.SeqIO and Bio.AlignIO > > parsers. The user can then turn the entries into a list (if they have > > enough memory), filter them as the arrive, etc. For example, you > > could compile a list of only those desired population entries, > > discarding the others on the fly. > > I will have look at iterators in Python. This idea from Giovannni is > actually floating around with current users for GenePop data which > have exactly the same problem (loooong records). > Iterators are more difficult to implement in Ped files, because in this format every line of the file is an individual, so to write an iterator which iterates by population we will need to read at list the first row of every line of all the file. I was also thinking of starting using a database to store data, instead of files. This would probably solve the problem of out of memory when parsing those long files. I would probably use sqlalchemy to interface with this database: this is why I would like to implement a Population and Individual objects, it will fit better with relational mapping. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From dalloliogm at gmail.com Wed Oct 22 13:12:24 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 22 Oct 2008 19:12:24 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810220852m606b8825tafaaf871f22df48e@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <6d941f120810220852m606b8825tafaaf871f22df48e@mail.gmail.com> Message-ID: <5aa3b3570810221012v3543a533u15f81196752cd52@mail.gmail.com> On Wed, Oct 22, 2008 at 5:52 PM, Tiago Ant?o wrote: > Hi, > > [Back in office now] > > > Ok, I have uploaded the code to: > > - http://github.com/dalloliogm/biopython---popgen > > > > I put the code I wrote before writing in this mailing list in the folder > > PopGen/Gio > > Thanks I will have a look and get acquainted with GIT. > It' s the first time I am using github for something serious, too. Please tell me if you need me to add you as a 'collaborator' in the project or something like this. I am using eclipse with a plugin for git (http://www.jgit.org/update-site) and it works very well. I think there is a plugin for vim, too. Sorry, today I couldn't do too much - I spent most of the day in seminars and meetings :(. > > > > Yes, I agree. It was just a first try. We should collect some good > > use-cases. > > > In my head I divide statistics in the following dimensions: > 1. genetic versus genomic (e.g. Fst is single locus, LD can be seen as > requiring more than 1 locus, therefore is "genomic") > 2. frequency based versus marker based (some statistics require > frequencies only - ie, you can calculate them irrespective of the type > of marker - This is the case of Fst. Others are marker dependent, say > Tajima D requires sequences and can only be used with sequences) > 3. population structure versus no pop structure. Some stats require > population structure (again, Fst), others don't (e.g., allelic > richness) > > From my point of view, a long-term solution needs to take into account > these dimensions (and others that I might be forgetting). > > One can think in a solution based on Populations and Individuals as > fundamental objects (as opposed to statistics), but, from my > experience it is very difficult to define what is an "individual" > (i.e., what kind of information you need to store - I can expand on > this). It is easier to think in terms of statistics. > > One fundamental point is that we don't have many opportunities to make > it right: if we define an architecture which proves in the future to > be not sufficient, then we will have to both maintain the old legacy > (because there will be users around whose code cannot be constantly > broken when a new version is made available) while hack the new > features in. > ok... but we can try :). We could use the github's wiki to better organize these ideas. I will answer to you better tomorrow (or tonight). Now, I need a bit of fresh air! :) -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From dalloliogm at gmail.com Wed Oct 22 13:12:41 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 22 Oct 2008 19:12:41 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810220900o620aa7x171fcd181f5248c@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <6d941f120810220900o620aa7x171fcd181f5248c@mail.gmail.com> Message-ID: <5aa3b3570810221012w2c894977sd0f86297f42f9394@mail.gmail.com> On Wed, Oct 22, 2008 at 6:00 PM, Tiago Ant?o wrote: > Hi, > > On Wed, Oct 22, 2008 at 11:25 AM, Giovanni Marco Dall'Olio > wrote: > > I wrote a prototype for a PED file parser which uses your PopGen.Record > > object to store data. > > Don't feel obliged to use GenePop.Record. You can (maybe you should) > use one that is better for your PED record. The point is: your PED > files might have extra (or less) information than genepop files. For > instance, they might have population names. They might store the SNP > (A, C, T, G). With genepop you would have to convert (and thus loose) > the extra info. I first tried to write an AbstractPopRecord class from which to derive both Ped.Record and your GenePop.Record classes. Then, I realized that I wanted to use all of your methods and decided to import your GenePop.Record instead of writing a new one. Moreover, there are some methods (like GenePop.Record.split_in_pops) that create Record objects, and I thought it would have been easier to always refer to the same one. Maybe we should write a generic PopGenRecord in which to store all general informations about population genetics data. > > > > Maybe the biggest issue is that I will have to use this library to parse > > very big files, so there are a few things we could change in the > > implementation of the parser. > > Yet another reason to develop your own record. I would not mind > helping you with that. > > > > We could also modify the parser in a way that it can accept a list of > > populations as argument, and create a populations list with only those > > populations from the file. > > We have to be careful in modifying existing code. We can add new > functionality, add new interfaces. But changing existing interfaces or > removing them has to be dealt with exceptional care, because that will > break (existing) code done by users. > > Tiago > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Wed Oct 22 13:26:07 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Oct 2008 18:26:07 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> Message-ID: <320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com> On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio wrote: > > Iterators are more difficult to implement in Ped files, because in this > format every line of the file is an individual, so to write an iterator > which iterates by population we will need to read at list the first row of > every line of all the file. It sounds like for Ped files it would make more sense to iterate over the individuals. The mental picture I have in mind is a big spreadsheet, individuals as rows (lines), populations (and other information) as columns. By having the parser iterate over the individuals one by one, the user could then "simplify" each individual as they are read in, recording in memory just the interesting data. This way the whole dataset need not be kept in memory. > I was also thinking of starting using a database to store data, instead of > files. This would probably solve the problem of out of memory when parsing > those long files. > I would probably use sqlalchemy to interface with this database: this is why > I would like to implement a Population and Individual objects, it will fit > better with relational mapping. That would mean adding sqlalchemy as another (optional) dependency for Biopython. If you could use MySQLdb instead that would be better as several existing modules use this. However, I would encourage you to avoid any database if possible because this makes the installation much more complicated for the end user, and imposes your own arbitrary schema as well. It also means setting up suitable unit tests is also a pain. Peter From rsclary at uncc.edu Wed Oct 22 15:49:33 2008 From: rsclary at uncc.edu (Clary, Richard) Date: Wed, 22 Oct 2008 15:49:33 -0400 Subject: [BioPython] Retrieving nucleotide sequence for given accession Entrez ID Message-ID: <61B0EE7C247C1349881F63414448FC1F078874BC@EXEVS06.its.uncc.edu> Can anyone provide succinct Python function to retrieve the nucleotide sequence (as a string) for a given nucleotide accession ID? Attempting to do this through E-Utils but having a difficult time figuring out the best way to do this without having to download a FASTA file... Thanks in advance, R From biopython at maubp.freeserve.co.uk Wed Oct 22 16:15:37 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Oct 2008 21:15:37 +0100 Subject: [BioPython] Retrieving nucleotide sequence for given accession Entrez ID In-Reply-To: <61B0EE7C247C1349881F63414448FC1F078874BC@EXEVS06.its.uncc.edu> References: <61B0EE7C247C1349881F63414448FC1F078874BC@EXEVS06.its.uncc.edu> Message-ID: <320fb6e00810221315i31358bc2n2e5c9be405a77e42@mail.gmail.com> On Wed, Oct 22, 2008 at 8:49 PM, Clary, Richard wrote: > > Can anyone provide succinct Python function to retrieve the > nucleotide sequence (as a string) for a given nucleotide > accession ID? Attempting to do this through E-Utils but > having a difficult time figuring out the best way to do this > without having to download a FASTA file... Hi Richard, Are you trying this using Bipython's Bio.Entrez, or accessing E-Utils directly? Anyway, you'll want to use efetch (e.g. via the Bio.Entrez.efetch function in Biopython) http://www.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html This documentation covers the possible return formats, http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html I think FASTA would be simplest (I don't see a plain or raw text option), and has only a tiny overhead in the download size over the raw sequence. Getting the sequence out of a FASTA file as a string is trivial - for example, using Biopython: from Bio import Entrez, SeqIO Entrez.email = "Richard at example.com" #Tell the NCBI who you are handle = Entrez.efetch(db="nucleotide", id="186972394",rettype="fasta") seq_str = str(SeqIO.read(handle, "fasta").seq) Peter From dalloliogm at gmail.com Thu Oct 23 05:41:04 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 23 Oct 2008 11:41:04 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> <320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com> Message-ID: <5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com> On Wed, Oct 22, 2008 at 7:26 PM, Peter wrote: > On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio > wrote: > > > > Iterators are more difficult to implement in Ped files, because in this > > format every line of the file is an individual, so to write an iterator > > which iterates by population we will need to read at list the first row > of > > every line of all the file. > > It sounds like for Ped files it would make more sense to iterate over > the individuals. The mental picture I have in mind is a big > spreadsheet, individuals as rows (lines), populations (and other > information) as columns. By having the parser iterate over the > individuals one by one, the user could then "simplify" each individual > as they are read in, recording in memory just the interesting data. > This way the whole dataset need not be kept in memory. This makes sense. Basically, we should write a (Ped/GenePop)Iterator function, which should read the file one line at a time, check if it a has correct syntax and is not a comment, and then use 'yield' to create a Record object. Am I right? > > > I was also thinking of starting using a database to store data, instead > of > > files. This would probably solve the problem of out of memory when > parsing > > those long files. > > I would probably use sqlalchemy to interface with this database: this is > why > > I would like to implement a Population and Individual objects, it will > fit > > better with relational mapping. > > That would mean adding sqlalchemy as another (optional) dependency for > Biopython. If you could use MySQLdb instead that would be better as > several existing modules use this. However, I would encourage you to > avoid any database if possible because this makes the installation > much more complicated for the end user, and imposes your own arbitrary > schema as well. It also means setting up suitable unit tests is also > a pain. > Don't worry, I am not going to do that. I will probably use sqlalchemy only in my scripts; I will use it to retrieve data from the database, and then create Population/Marker/Individual objects using the code I am writing now, or a adapt the objects created by sqlalchemy to be compatible with the functions I will have to use. > > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Oct 23 05:57:38 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Oct 2008 10:57:38 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> <320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com> <5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com> Message-ID: <320fb6e00810230257y4a21952o95acff34a5367cb5@mail.gmail.com> On Thu, Oct 23, 2008, Giovanni Marco Dall'Olio wrote: > On Wed, Oct 22, Peter wrote: >> On Wed, Oct 22, Giovanni Marco Dall'Olio wrote: >> > >> > Iterators are more difficult to implement in Ped files, because in this >> > format every line of the file is an individual, so to write an iterator >> > which iterates by population we will need to read at list the first row >> > of every line of all the file. >> >> It sounds like for Ped files it would make more sense to iterate over >> the individuals. The mental picture I have in mind is a big >> spreadsheet, individuals as rows (lines), populations (and other >> information) as columns. By having the parser iterate over the >> individuals one by one, the user could then "simplify" each individual >> as they are read in, recording in memory just the interesting data. >> This way the whole dataset need not be kept in memory. > > This makes sense. > Basically, we should write a (Ped/GenePop)Iterator function, which should > read the file one line at a time, check if it a has correct syntax and is > not a comment, and then use 'yield' to create a Record object. Am I right? Yes :) Python functions written with "yield" are called "generator functions", see: http://www.python.org/dev/peps/pep-0255/ Peter From m at pavis.biodec.com Thu Oct 23 06:25:45 2008 From: m at pavis.biodec.com (m at pavis.biodec.com) Date: Thu, 23 Oct 2008 12:25:45 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> References: <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> Message-ID: <20081023102545.GE3694@pavis.biodec.com> * Giovanni Marco Dall'Olio (dalloliogm at gmail.com) [081022 19:12]: > > I was also thinking of starting using a database to store data, instead of > files. This would probably solve the problem of out of memory when parsing > those long files. If you just need to store data, i.e. you just need a thin layer above file storage, I'd suggest evaluating ZODB It's very simple, somehow pythonic, and you don't need to learn SQL to manage the data (of course, SQL is just fine, and from a real DB you get much more than just data storage, but since you are just writing about alternatives to file storage, I assume that SQL would not be a plus) HTH -- .*. finelli /V\ (/ \) -------------------------------------------------------------- ( ) Linux: Friends dont let friends use Piccolosoffice ^^-^^ -------------------------------------------------------------- It is easier to make a saint out of a libertine than out of a prig. -- George Santayana From dalloliogm at gmail.com Thu Oct 23 07:30:06 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 23 Oct 2008 13:30:06 +0200 Subject: [BioPython] [OT] Revision control and databases Message-ID: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> Hi, I have a question (well, it's not directly related to biopython or pygr, but to scientific computing). I always used flat files to store results and data for my bioinformatics analys, but not (as I was saying in another thread) I would like to start using a database to do that. The problem is I don't know if databases do Revision Control. When I used flat files, I was used to save all the results in a git repository, and, everytime something was changed or calculated again, I did commit it. Do you know how to do this with databases? Does MySQL provide support for revision control? Thanks :) (sorry for cross-posting :( ) -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From sdavis2 at mail.nih.gov Thu Oct 23 08:10:16 2008 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 23 Oct 2008 08:10:16 -0400 Subject: [BioPython] [OT] Revision control and databases In-Reply-To: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> Message-ID: <264855a00810230510n37d05cb1gd7b88a63988d7191@mail.gmail.com> On Thu, Oct 23, 2008 at 7:30 AM, Giovanni Marco Dall'Olio wrote: > Hi, > I have a question (well, it's not directly related to biopython or pygr, but > to scientific computing). > > I always used flat files to store results and data for my bioinformatics > analys, but not (as I was saying in another thread) I would like to start > using a database to do that. > > The problem is I don't know if databases do Revision Control. > When I used flat files, I was used to save all the results in a git > repository, and, everytime something was changed or calculated again, I did > commit it. > Do you know how to do this with databases? Does MySQL provide support for > revision control? > Thanks :) No. Relational databases just store data. You could build such a system, but that would require a fair amount of work. I would suggest storing metadata about your analyses in the database and storing the actual results on the file system. Sean From lpritc at scri.ac.uk Thu Oct 23 08:44:45 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 23 Oct 2008 13:44:45 +0100 Subject: [BioPython] [bip] [OT] Revision control and databases In-Reply-To: Message-ID: Hi Giovanni (and others) Ah, reading again I see you're already using git... Without knowing exactly what you're doing, I assume that CVS and SVN would be no improvement , so please ignore my last paragraph below ;) L. On 23/10/2008 13:39, "Leighton Pritchard" wrote: > Hi Giovanni, > > On 23/10/2008 12:30, "Giovanni Marco Dall'Olio" wrote: > >> The problem is I don't know if databases do Revision Control. >> When I used flat files, I was used to save all the results in a git >> repository, and, everytime something was changed or calculated again, I did >> commit it. >> Do you know how to do this with databases? Does MySQL provide support for >> revision control? > > Databases are just collections of data. Database Management Systems (DBMS) > such as MySQL and PostgreSQL do not (AFAIAA) do revision control themselves, > but they can be used for it, if you build that capability into the schema and > also control database submissions appropriately. There are a number of > content management systems that implement version/revision control on common > DBMS, like this. > > Stretching a definition, you could possibly argue that CVS, SVN and the like > are a form of DBMS... I don't know what type of data you're storing, or how > they might scale for your purposes but, in principle, neither CVS nor SVN care > much about whether your data represents code, legal documents, or any other > sort of data. For example, I've used CVS/SVN to version control manuscripts. > You might like to try one of them. > > Cheers, > > L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From lpritc at scri.ac.uk Thu Oct 23 08:39:40 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 23 Oct 2008 13:39:40 +0100 Subject: [BioPython] [bip] [OT] Revision control and databases In-Reply-To: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> Message-ID: Hi Giovanni, On 23/10/2008 12:30, "Giovanni Marco Dall'Olio" wrote: > The problem is I don't know if databases do Revision Control. > When I used flat files, I was used to save all the results in a git > repository, and, everytime something was changed or calculated again, I did > commit it. > Do you know how to do this with databases? Does MySQL provide support for > revision control? Databases are just collections of data. Database Management Systems (DBMS) such as MySQL and PostgreSQL do not (AFAIAA) do revision control themselves, but they can be used for it, if you build that capability into the schema and also control database submissions appropriately. There are a number of content management systems that implement version/revision control on common DBMS, like this. Stretching a definition, you could possibly argue that CVS, SVN and the like are a form of DBMS... I don't know what type of data you're storing, or how they might scale for your purposes but, in principle, neither CVS nor SVN care much about whether your data represents code, legal documents, or any other sort of data. For example, I've used CVS/SVN to version control manuscripts. You might like to try one of them. Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From bsouthey at gmail.com Thu Oct 23 09:55:49 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 23 Oct 2008 08:55:49 -0500 Subject: [BioPython] [bip] [OT] Revision control and databases In-Reply-To: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> Message-ID: <49008265.3040205@gmail.com> Giovanni Marco Dall'Olio wrote: > Hi, > I have a question (well, it's not directly related to biopython or > pygr, but to scientific computing). > > I always used flat files to store results and data for my > bioinformatics analys, but not (as I was saying in another thread) I > would like to start using a database to do that. Of course Biopython's BioSQL interface may provide a starting point. > The problem is I don't know if databases do Revision Control. > When I used flat files, I was used to save all the results in a git > repository, and, everytime something was changed or calculated again, > I did commit it. > Do you know how to do this with databases? Does MySQL provide support > for revision control? > Thanks :) I think you are asking the wrong questions because it depends on what you want to do and what you actually store. There are a number of questions that you need to ask yourself about what you really need to do (knowing you have used git helps refine these). Examples include: How often do you use the old versions in your git repository? How do you use the old revisions in your git repository? Do you even use the information of an older version if a newer version exists? Do you actually determine when 'something was changed or calculated again' or it this partly determined by an external source like a Genbank or UniProt update? (At least in a database approach you could automate this.) How many users that can make changes? How often do you have conflicts? Are the conflicts hard to solve? Revision control may be overkill for your use because this is aims to handle many tasks and change conflicts related to multiple users rather than a single user. If you don't need all these fancy features then you can use a database. If you just want to store and retrieve a version then you can use a database but you need to at least force the inclusion a date and comment fields to be useful. Regards Bruce From tiagoantao at gmail.com Thu Oct 23 10:51:22 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 23 Oct 2008 15:51:22 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810221012w2c894977sd0f86297f42f9394@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <6d941f120810220900o620aa7x171fcd181f5248c@mail.gmail.com> <5aa3b3570810221012w2c894977sd0f86297f42f9394@mail.gmail.com> Message-ID: <6d941f120810230751k3dee7b96y8ee13e4bf1c2a4ca@mail.gmail.com> Hi, > Moreover, there are some methods (like GenePop.Record.split_in_pops) that > create Record objects, and I thought it would have been easier to always > refer to the same one. > Maybe we should write a generic PopGenRecord in which to store all general > informations about population genetics data. The problem with that is that it is a) very difficult to come with a representation that is general enough (and usable in the long run). b) a general representation would be an hassle in specific cases Let me elaborate: Different kinds of genetic information have completely different storage needs: If you are doing genomic studies you will probably want to have location information (like this SNP is on chromosome X, position Y). Others (probably the majority) only require frequency information (or to know what the marker is, irrespective of position). In most species you don't even know the genomic position of a certain marker. So you would have to have an general representation capable to handle both position information and no position information. Then, in some cases, you need the whole marker (like if you want to do a Tajima D) or just frequency information (for Fst). Some markers (microsats) you can (in most, but not all) cases ignore the genetic pattern, you just count the repeats. You could argue that one could try to have a most general representation but that entails three problems: 1. It is very difficult to come by with a clever, correct and future proof representation. At least I've thinking on this issue since 2005 and have found no clever answer. 2. Performance: If you care about performance, having a most general data representation will bring about a big performance cost (converting from a certain general format to the format needed to do computations). 3. Different formats and statistics have different requirements: For instance on GenePop you don't have population names, neither the marker itself, but for arlequin format you have partial information on markers and full information on population names. converting the minor differences among formats to a "general" format would be complex. From tiagoantao at gmail.com Thu Oct 23 11:10:51 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 23 Oct 2008 16:10:51 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> Message-ID: <6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com> On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio wrote: > Iterators are more difficult to implement in Ped files, because in this > format every line of the file is an individual, so to write an iterator > which iterates by population we will need to read at list the first row of > every line of all the file. GenePop works population by population. Where I a getting at, is that different formats might have completely different strategies. I've used a strategy with the FDist parser that it might be interesting to you: 1. I read the fdist file 2. Convert it to genepop 3. do all operations in the genepop format 4. convert back if necessary. This might not work in your case because the ped format seems to be more informative than the genepop format (and thus you loose information in the conversion process). Feel free to copy and adapt my code to your own (like split_in_pops and split_in_loci) > I would probably use sqlalchemy to interface with this database: this is why > I would like to implement a Population and Individual objects, it will fit > better with relational mapping. You can go ahead and suggest formats for Populations and Individuals. But I strongly suspect that your proposal will be biased towards your needs (I've suffered the same problem myself). I think that in biopython the idea is to try to have a solution that is useful to everybody. Also, if you want to put some SQL in the code module code, you will have to have approval from the maintainers of biopython. They will send you to the BioSQL people, which will say that there is none of their business. Been there, done that, no success. Don't take me wrong, I am not trying to discourage you in any way. But I think it is better to gain some experience before proposing changes to core concepts. I've been doing this work for 3 years now, and I am convinced that it would be very hard for me to suggest a good representation for populations and individuals. Even populations are very hard to address (like, some data is geo-referenced -> called landspace genetics, and the more traditional one is not). My suggestion: solve you problem the best way you can (e.g., do an independent PED parser - you can use any of my code if you want). Solve small problems, one after another. Trying to solve the general problem is very hard and requires lots of long term experience. From dalloliogm at gmail.com Thu Oct 23 12:25:29 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 23 Oct 2008 18:25:29 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> <6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com> Message-ID: <5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com> On Thu, Oct 23, 2008 at 5:10 PM, Tiago Ant?o wrote: > On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio > wrote: > > Iterators are more difficult to implement in Ped files, because in this > > format every line of the file is an individual, so to write an iterator > > which iterates by population we will need to read at list the first row > of > > every line of all the file. > > GenePop works population by population. Where I a getting at, is that > different formats might have completely different strategies. > I've used a strategy with the FDist parser that it might be interesting to > you: > 1. I read the fdist file > 2. Convert it to genepop > 3. do all operations in the genepop format > 4. convert back if necessary. > > This might not work in your case because the ped format seems to be > more informative than the genepop format (and thus you loose > information in the conversion process). Feel free to copy and adapt my > code to your own (like split_in_pops and split_in_loci) > > > > I would probably use sqlalchemy to interface with this database: this is > why > > I would like to implement a Population and Individual objects, it will > fit > > better with relational mapping. > > You can go ahead and suggest formats for Populations and Individuals. > But I strongly suspect that your proposal will be biased towards your > needs (I've suffered the same problem myself). I think that in > biopython the idea is to try to have a solution that is useful to > everybody. > > Also, if you want to put some SQL in the code module code, you will > have to have approval from the maintainers of biopython. They will > send you to the BioSQL people, which will say that there is none of > their business. Been there, done that, no success. > > Don't take me wrong, I am not trying to discourage you in any way. But > I think it is better to gain some experience before proposing changes > to core concepts. > I've been doing this work for 3 years now, and I am convinced that it > would be very hard for me to suggest a good representation for > populations and individuals. Even populations are very hard to address > (like, some data is geo-referenced -> called landspace genetics, and > the more traditional one is not). > > My suggestion: solve you problem the best way you can (e.g., do an > independent PED parser - you can use any of my code if you want). > Solve small problems, one after another. > Trying to solve the general problem is very hard and requires lots of > long term experience. > Well, I agree with you... I don't have any idea on how this problem could be resolved :). However I think it would be good to add to biopython at least some funcionality to calculate Fst statistics and parse these file formats, at least at the level at which BioPerl does. What if we just translate the same functionalities and copy the population objects from bioperl into biopython? I realize that it won't be the perfect solution: in fact, it is the same reason why I started this discussion here, the bioperl code wasn't optimized enought for what I want to do, but I didn't know how to modify perl modules and preferred python. Maybe we can just write a PED and GenePop parser and have let it work with GenePop and your modules to calculate Fst. We should agree with a population object that could be used as input for GenePop. I think it would be good anyway to release even incomplete code to the public, because it could be useful for other people. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From dalloliogm at gmail.com Thu Oct 23 12:27:22 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 23 Oct 2008 18:27:22 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <320fb6e00810230257y4a21952o95acff34a5367cb5@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> <320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com> <5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com> <320fb6e00810230257y4a21952o95acff34a5367cb5@mail.gmail.com> Message-ID: <5aa3b3570810230927m696f9c27gdfec084647c2c509@mail.gmail.com> On Thu, Oct 23, 2008 at 11:57 AM, Peter wrote: > On Thu, Oct 23, 2008, Giovanni Marco Dall'Olio wrote: > > On Wed, Oct 22, Peter wrote: > >> On Wed, Oct 22, Giovanni Marco Dall'Olio wrote: > >> > > >> > Iterators are more difficult to implement in Ped files, because in > this > >> > format every line of the file is an individual, so to write an > iterator > >> > which iterates by population we will need to read at list the first > row > >> > of every line of all the file. > >> > >> It sounds like for Ped files it would make more sense to iterate over > >> the individuals. The mental picture I have in mind is a big > >> spreadsheet, individuals as rows (lines), populations (and other > >> information) as columns. By having the parser iterate over the > >> individuals one by one, the user could then "simplify" each individual > >> as they are read in, recording in memory just the interesting data. > >> This way the whole dataset need not be kept in memory. > > > > This makes sense. > > Basically, we should write a (Ped/GenePop)Iterator function, which should > > read the file one line at a time, check if it a has correct syntax and is > > not a comment, and then use 'yield' to create a Record object. Am I > right? > > Yes :) > > Python functions written with "yield" are called "generator functions", > see: > http://www.python.org/dev/peps/pep-0255/ > So, how should we modify the current GenePop parser to make it work as an iterator? Now it has a 'Scanner' and 'Consumer' methods. Should I remove them and write a RecordIterator instead? - http://github.com/dalloliogm/biopython---popgen/tree/master/src/PopGen/Gio/Ped/__init__.py Can you explain me more or less how the 'Consumer' object works? It is mandatory to use it when creating biopython objects? p.s. do you like the doctest to show how to use the parser? > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Oct 23 13:01:26 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Oct 2008 18:01:26 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810230927m696f9c27gdfec084647c2c509@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> <320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com> <5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com> <320fb6e00810230257y4a21952o95acff34a5367cb5@mail.gmail.com> <5aa3b3570810230927m696f9c27gdfec084647c2c509@mail.gmail.com> Message-ID: <320fb6e00810231001w2345bbe5r8c1727ddf883553c@mail.gmail.com> Giovanni wrote: > So, how should we modify the current GenePop parser to make it work as an > iterator? I think this would mean breaking up the current Record object (which holds everything) into sub-records which can be yielded one by one. This would require an API change, unless you wanted to continue to offer the two approaches in parallel (not elegant, but see Bio/Sequencing/Ace.py for an example of where this made sense to do). > Now it has a 'Scanner' and 'Consumer' methods. Should I remove them and > write a RecordIterator instead? > ... > Can you explain me more or less how the 'Consumer' object works? It is > mandatory to use it when creating biopython objects? You can write an iterator with or without the Scanner/Consumer style of parser. The Scanner/Consumer system is very flexible if you want to parse the data into different objects (by using different consumers). In theory the end user could also use the provided scanner with their own consumer. However, in my opinion for parsing sequence file formats this was overkill (needlessly complicated) - as only one object is really needed to represent a sequence (we have the SeqRecord for this), so most of the recent parsers in Bio.SeqIO and Bio.AlignIO do not use the scanner/consumer setup. See also the short Tutorial section "Parser Design". http://biopython.org/DIST/docs/tutorial/Tutorial.html For population genetics given there is no one universal record object, perhaps the flexibility of the Scanner/Consumer system is worth while. On the other hand, Tiago currently has the scanner/consumer in Bio.PopGen.GenePop as private objects so this is currently a private implementation detail - one could replace the Scanner/Consumer details without breaking the public API. Peter From biopython at maubp.freeserve.co.uk Fri Oct 24 04:52:25 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Oct 2008 09:52:25 +0100 Subject: [BioPython] Retrieving nucleotide sequence for given accession Entrez ID In-Reply-To: <61B0EE7C247C1349881F63414448FC1F078874C1@EXEVS06.its.uncc.edu> References: <61B0EE7C247C1349881F63414448FC1F078874BC@EXEVS06.its.uncc.edu> <320fb6e00810221315i31358bc2n2e5c9be405a77e42@mail.gmail.com> <61B0EE7C247C1349881F63414448FC1F078874C1@EXEVS06.its.uncc.edu> Message-ID: <320fb6e00810240152t6e6123d3la00f1fe43121b985@mail.gmail.com> Hi Richard, I've taken the liberty of CC'ing this back to the mailing list, Richard Clary wrote: > Much appreciation Peter--it worked perfectly. Good :) > If you are wanting to > retrieve multiple sequences, is a simple "+" string concatenation > sufficient as the case when using eUtils or approach it by creating > a tuple or dictionary and passing arguments? > > Richard Moving on to your multi-sequence question, using "+" doesn't seem to work - you should use a comma for concatenating the IDs when calling eFetch. What made you think of "+" here? One other tweak is that Bio.SeqIO.read(...) is for when the handle contains one and only one record. In general you'll need to use Bio.SeqIO.parse(...) instead and iterate over the records. Depending on what you want to achieve, maybe: from Bio import Entrez, SeqIO id_list = ["186972394","12345678"] Entrez.email = "Richard at example.com" #Tell the NCBI who you are handle = Entrez.efetch(db="nucleotide", id=",".join(id_list),rettype="fasta") for id,record in zip(id_list,SeqIO.parse(handle, "fasta")) : assert id in record.id, "Didn't get ID %s returned!" % id print "%s = %s" % (record.id, record.seq) #seq_str = str(record.seq) If you still want just plain strings for the sequence, maybe: from Bio import Entrez, SeqIO id_list = ["186972394","12345678"] Entrez.email = "Richard at example.com" #Tell the NCBI who you are handle = Entrez.efetch(db="nucleotide", id=",".join(id_list),rettype="fasta") seq_str_list = [str(record.seq) for record in SeqIO.parse(handle, "fasta")] If you haven't already done so, please read the NCBI guidelines for using Entrez, http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements Also, have a look at the Entrez chapter in the tutorial, especially the "history" support which may be relevant. http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Peter From dalloliogm at gmail.com Fri Oct 24 05:08:54 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 24 Oct 2008 11:08:54 +0200 Subject: [BioPython] [bip] [OT] Revision control and databases In-Reply-To: <49008265.3040205@gmail.com> References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> <49008265.3040205@gmail.com> Message-ID: <5aa3b3570810240208j76a770c5ue175089176fa050@mail.gmail.com> On Thu, Oct 23, 2008 at 3:55 PM, Bruce Southey wrote: > Giovanni Marco Dall'Olio wrote: > >> Hi, >> I have a question (well, it's not directly related to biopython or pygr, >> but to scientific computing). >> >> I always used flat files to store results and data for my bioinformatics >> analys, but not (as I was saying in another thread) I would like to start >> using a database to do that. >> > Of course Biopython's BioSQL interface may provide a starting point. The problem is that BioSQL doesn't support yet Population Genetics record (see another thread in biopython mailing list), so I would have to implement something like that in BioSQL or wait for the developers to do it. Maybe I will do this later, but now I don't have the time. > > The problem is I don't know if databases do Revision Control. >> When I used flat files, I was used to save all the results in a git >> repository, and, everytime something was changed or calculated again, I did >> commit it. >> Do you know how to do this with databases? Does MySQL provide support for >> revision control? >> Thanks :) >> > I think you are asking the wrong questions because it depends on what you > want to do and what you actually store. There are a number of questions that > you need to ask yourself about what you really need to do (knowing you have > used git helps refine these). Examples include: > How often do you use the old versions in your git repository? > How do you use the old revisions in your git repository? > Do you even use the information of an older version if a newer version > exists? > How many users that can make changes? > How often do you have conflicts? > Are the conflicts hard to solve? These are all very good questions. The problem is that I consider revision control as a 'good practice': I remember that when I was not used to keep an history of the changes to my data, it was a mess. I would like to have at least a 'version' field, to know how much my data is old. I have found this : - http://pgfoundry.org/projects/tablelog/ which seems interesting. I think this is a big issue for bioinformatics. How is it possible that nobody has never tried to implement such a functionality for databases? Version Control could be difficult to implement, but not so much. There is must be something that I can reuse... Do you actually determine when 'something was changed or calculated again' > or it this partly determined by an external source like a Genbank or UniProt > update? (At least in a database approach you could automate this.) Well, it could be useful to > > > Revision control may be overkill for your use because this is aims to > handle many tasks and change conflicts related to multiple users rather than > a single user. If you don't need all these fancy features then you can use > a database. If you just want to store and retrieve a version then you can > use a database but you need to at least force the inclusion a date and > comment fields to be useful. Maybe there are other similar tools. This is a big issue for bioinformatics. I think it is a good, when working with Unfortunately I think revision control would be very useful for me. The data in the database will be used and uploaded by 4 or 5 people. It will be used also to store the results from some script: > > > > Regards > Bruce > Thank you very much for all the replies.. I didn't expect so many of them. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Fri Oct 24 05:25:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Oct 2008 10:25:31 +0100 Subject: [BioPython] [bip] [OT] Revision control and databases In-Reply-To: <5aa3b3570810240208j76a770c5ue175089176fa050@mail.gmail.com> References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> <49008265.3040205@gmail.com> <5aa3b3570810240208j76a770c5ue175089176fa050@mail.gmail.com> Message-ID: <320fb6e00810240225y1c380de5y6144a80ece808b2c@mail.gmail.com> Giovanni Marco Dall'Olio wrote: > Bruce Southey wrote: >> Of course Biopython's BioSQL interface may provide a starting point. > > The problem is that BioSQL doesn't support yet Population Genetics record > (see another thread in biopython mailing list), so I would have to implement > something like that in BioSQL or wait for the developers to do it. > Maybe I will do this later, but now I don't have the time. BioSQL currently focuses on annotated sequences, but they are working on some phylogenetics support too. See http://www.biosql.org/ and the PhyloDB extension module. If there was enough interest, perhaps a BioSQL schema for Population Genetics could be devised too. Giovanni Marco Dall'Olio wrote: >>> The problem is I don't know if databases do Revision Control. >>> When I used flat files, I was used to save all the results in a git >>> repository, and, everytime something was changed or calculated >>> again, I did commit it. >>> Do you know how to do this with databases? Does MySQL >>> provide support for revision control? As other people have said, databases don't generally "waste" resources on version control. If you need this, then it is up to you to design your schema to record this additional metadata. For example, the BioSQL sequences have a "version" field in the "bioentry" table allowing multiple revisions of the same accession to be held. When querying the database, you could request a particular version, or indeed the latest version. Essentially AFAIK database version control is a Do-It-Yourself affair when designing your database tables. Peter From lpritc at scri.ac.uk Fri Oct 24 05:51:35 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 24 Oct 2008 10:51:35 +0100 Subject: [BioPython] [bip] [OT] Revision control and databases In-Reply-To: <5aa3b3570810240208j76a770c5ue175089176fa050@mail.gmail.com> Message-ID: On 24/10/2008 10:08, "Giovanni Marco Dall'Olio" wrote: > The problem is that BioSQL doesn't support yet Population Genetics record > (see another thread in biopython mailing list), so I would have to implement > something like that in BioSQL or wait for the developers to do it. > Maybe I will do this later, but now I don't have the time. To be fair, that's a different problem from version control... >> How often do you use the old versions in your git repository? >> How do you use the old revisions in your git repository? >> Do you even use the information of an older version if a newer version >> exists? >> How many users that can make changes? >> How often do you have conflicts? >> Are the conflicts hard to solve? > > These are all very good questions. > The problem is that I consider revision control as a 'good practice' I think that you're right - it is good practice, and Bruce raises excellent questions here: what individuals want or need from version control depends greatly on their own situation, and whether a particular package fits your own needs will depend on what they are. If you don't know what they are before choosing a package, then there's the risk of making an unsuitable choice. It's worth noting that revision control can also mean slightly different things to different people. Some might say that a version number and an ID for the entity (human or automated) making that change is sufficient. Some might say that you ought not to stop short of conflict resolution and branch control. It depends on the needs of your project, IMO. > I think this is a big issue for bioinformatics. How is it possible that nobody > has never tried to implement such a functionality for databases Databases (DBMS, to be picky) are a general-purpose solution for many different kinds of problem. Revision control is an inhomogeneous problem with no optimal solution that can be implemented in many ways and not only using DBMS. There are plenty of revision control examples implemented in databases, and the examples that first come to mind in Python for me are content management systems such as Zope and Plone. I think that BASE implements one, but it's a long time since I looked at it. > Unfortunately I think revision control would be very useful for me. > The data in the database will be used and uploaded by 4 or 5 people. Then at a minimum you may need a solution that records version changes, and associates versions with individuals (and perhaps individual runs of scripts). You may also need locking and collision detection/conflict resolution (which DBMS like MySQL and PostgreSQL support internally via transactions; they don't generally implement version control because it would be wasteful), depending on whether you expect that multiple people might modify the same file at or at about the same time. Best, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From cy at cymon.org Fri Oct 24 06:46:28 2008 From: cy at cymon.org (Cymon Cox) Date: Fri, 24 Oct 2008 11:46:28 +0100 Subject: [BioPython] BioSQL / phylodb Message-ID: <7265d4f0810240346m456e724ax49f7c18048a29749@mail.gmail.com> Hi All, Ive been looking at the phylodb extension to BioSQL. Does anyone have any python code for uploading a tree? Cheers, C. -- ____________________________________________________________________ Cymon J. Cox From biopython at maubp.freeserve.co.uk Fri Oct 24 06:54:28 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Oct 2008 11:54:28 +0100 Subject: [BioPython] BioSQL / phylodb In-Reply-To: <7265d4f0810240346m456e724ax49f7c18048a29749@mail.gmail.com> References: <7265d4f0810240346m456e724ax49f7c18048a29749@mail.gmail.com> Message-ID: <320fb6e00810240354q2b3c2a93p3c0c45b5ed48df3c@mail.gmail.com> On Fri, Oct 24, 2008 at 11:46 AM, Cymon Cox wrote: > Hi All, > > Ive been looking at the phylodb extension to BioSQL. Does anyone have any > python code for uploading a tree? > > Cheers, C. Not that I'm aware of, no. Adding support to Biopython's BioSQL module to do this, and also retrieve the data as a tree would be nice. The Bio.Nexus.Tree class would seem a logical representation to try and use. As an aside, being about to load a taxonomy from the main BioSQL taxon/taxon_name tables as a tree might be nice too. Peter From kteague at bcgsc.ca Fri Oct 24 14:32:41 2008 From: kteague at bcgsc.ca (Kevin Teague) Date: Fri, 24 Oct 2008 11:32:41 -0700 Subject: [BioPython] [bip] [OT] Revision control and databases In-Reply-To: References: Message-ID: <3F2B0CD4-83DF-4A88-B22D-926B97503B7C@bcgsc.ca> > >> I think this is a big issue for bioinformatics. How is it possible >> that nobody >> has never tried to implement such a functionality for databases > > Databases (DBMS, to be picky) are a general-purpose solution for many > different kinds of problem. Revision control is an inhomogeneous > problem > with no optimal solution that can be implemented in many ways and > not only > using DBMS. There are plenty of revision control examples > implemented in > databases, and the examples that first come to mind in Python for me > are > content management systems such as Zope and Plone. I think that BASE > implements one, but it's a long time since I looked at it. The default file storage for Zope Object Database (ZODB) appends all new database writes, keeping older transactions on disk (similar to the way PostgreSQL works). Back in the day (circa 2000) Zope 2 exposed this database-level feature at the application level in the Zope Management Interface (ZMI). So you could see all past writes to the database, and try and revert back to an older one if desired (using the "undo" tab of the ZMI). Problems with this approach included using sysadmin tools on the database could break application behaviour. e.g. lets say you had a "Document" object and a "Page Counter" object, you would wish to be able to view older versions of Documents, but only care about the current state of the Page Counters. However, if your Page Counters are changing like crazy and taking up tonnes of disk space and generally slowing down queries against the history of the database, there was no way to say "delete all outdated ephemeral Page Counter versions, but keep Document-related transactions" (especially since a Page Counter change and a Document change often commited in the same transaction). ZWiki exposed older revisions using this feature, and the accepted practice was to put each wiki into it's own database so that other forms of database maintenance didn't accidently blow away your wiki history ... it wasn't so pretty :P You also had problems reverting back to just a specific revision, for example if you were in Revision 3 and you had changes in Revision 1 that you wanted to go back to, but you'd made changes in Revision 2 that referenced Revision 1, then you first had to step-back to Revision 2 before you could revert back to Revision 1. Even though Revision 2 also contained a bunch of changes that you didn't want to revert, that you would then manually need to later re-apply. Ug! Zope 2 also had a Version object, you could poke a button in the UI to start a new "transaction" and then start making changes to code +content in the database. This was just implemented as a long-running transaction - from the point of starting to commiting a transaction could sometimes last for a whole month :). The problem being that when you finally wanted to commit the transaction to roll-out new features on a web site, if there were any conflicts from changes that happened you were hosed and would end-up copying those changes into a new transaction based off the latest database version and commiting that. It wasn't pretty :( It has long since been acknowledged by Zope developers that exposing database level features at the application level is a Bad Thing(TM)! Today there is a whole plethora of products for Zope that do some form of versioning, but they are all implemented at the application level. There is a whole plethora of products because there are many ways to do versioning, and the choices of how versions are managed is really best left up to the specific application. Some of these products provide reasonable APIs for implementing specific versioning within a specific platform - e.g Plone has a package called plone.app.iterate and it has APIs that use standard versioning terminology (checkin, checkout, working copy) for example: class ICheckinCheckoutTool( Interface ): def allowCheckin( content ): """ denotes whether a checkin operation can be performed on the content. """ def allowCheckout( content ): """ denotes whether a checkout operation can be performed on the content. """ def allowCancelCheckout( content ): """ denotes whether a cancel checkout operation can be performed on the content. """ def checkin( content, checkin_messsage ): """ check the working copy in, this will merge the working copy with the baseline """ def checkout( container, content ): """ """ def cancelCheckout( content ): """ From sbassi at gmail.com Fri Oct 24 21:03:43 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Fri, 24 Oct 2008 22:03:43 -0300 Subject: [BioPython] Loading dbxrefs from a gbk file Message-ID: I have a genbank file like this one: http://www.pastecode.com.ar/f231664eb I parse it with SeqIO.parse and the SeqRecord object I get is: SeqRecord(seq=Seq('GAGAAGGACGCGCGGCCCCCAGCGCCTCTTGGGTGGCCGCCTCGGAGCATGACC...ATA', IUPACAmbiguousDNA()), id='NM_000208.2', name='NM_000208', description='Homo sapiens insulin receptor (INSR), transcript variant 1, mRNA.', dbxrefs=[]) If you look at lines 130 to 133 (I highlighted in yellow) of the genbank sequence, there is cross database information (db_xref), but it is not associated with the SeqRecord, it is an empty list. According to http://www.biopython.org/wiki/SeqRecord, this condition is known, but I don't understand if this is on porpuse or is a bug. Best, SB. -- Vendo isla: http://www.genesdigitales.com/isla/ Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 "It is pitch black. You are likely to be eaten by a grue." -- Zork From biopython at maubp.freeserve.co.uk Sat Oct 25 13:22:27 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 25 Oct 2008 18:22:27 +0100 Subject: [BioPython] Loading dbxrefs from a gbk file In-Reply-To: References: Message-ID: <320fb6e00810251022u577d4f16wa7602c7bdc664322@mail.gmail.com> On Sat, Oct 25, 2008 at 2:03 AM, Sebastian Bassi wrote: > I have a genbank file like this one: http://www.pastecode.com.ar/f231664eb > ... > If you look at lines 130 to 133 (I highlighted in yellow) of the > genbank sequence, there is cross database information (db_xref), but > it is not associated with the SeqRecord, it is an empty list. What you have highlighted is part of a gene feature, and would not be part of the SeqRecord's db_xref list. It should however be present in the relevant SeqRecord feature. Try: print my_record.features[1] (seeing as this is the second feature in the file, i.e. feature 1 using zero-based counting). Peter From tiagoantao at gmail.com Sat Oct 25 21:04:01 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sun, 26 Oct 2008 02:04:01 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> <6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com> <5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com> Message-ID: <6d941f120810251804o31ed44cat49b407db36a6891e@mail.gmail.com> [Sorry for the delay in answering] On Thu, Oct 23, 2008 at 5:25 PM, Giovanni Marco Dall'Olio wrote: > However I think it would be good to add to biopython at least some > funcionality to calculate Fst statistics and parse these file formats, at > least at the level at which BioPerl does. Agree. Statistics is fundamental. I decided to postpone stats when I started because I didn't want to to start with the core issue in population genetics (being unexperienced at the start would probably cause serious design errors). But I think now is the time. > What if we just translate the same functionalities and copy the population > objects from bioperl into biopython? I don' t think the population objects in bioperl scale well. It is not clear to me that their popgen module is a priority for them, and that they carefully designed them (altough that might have changed in the near past). I also don' t believe that my own code (which I supplied you) is in perfect shape to achieve this also. I have to write down my ideas and send them here as soon as possible. I will try to do it in the next couple of days at most. The core idea is that there is no good abstract population and individual objects, but they are also not needed. What is needed, in my view, are file parsers and statistics. Statistics should be organized in a systematic way. Example: all frequency based, population-structure statistics should present the same interface, something like: add_population(pop_name, individual_allele_list) I will submit a small document for discussion very soon. > I realize that it won't be the perfect solution: in fact, it is the same > reason why I started this discussion here, the bioperl code wasn't optimized > enought for what I want to do, but I didn't know how to modify perl modules > and preferred python. The important thing to notice is that biopython should not be optimized to your needs or mine, it has to be general enough to accomodate the vast majority of potential users. What I' ve always tried, was to do things in a way that could be reused by others. > Maybe we can just write a PED and GenePop parser and have let it work with > GenePop and your modules to calculate Fst. My suggestion would be for you to go ahead and do a Bio.PopGen.PED . You could do it in the best way you see fit. Converting from PED to genepop will make you loose information, if I understand well (as you have SNP info on PED files, which you don' t on genepop). The other formats that I support (Fdist on released code and FStat on the code that you have) are very similar (or less informative) than genepop. Again, my suggestion is for an independent parser. Of which you would have absolute control as you would be the implementor. I understand that this might lead to some duplicated code (like split_in_pops), but repeated code is less of a problem than a generic object that ends up being wrong in the long run. > We should agree with a population object that could be used as input for > GenePop. For the reasons above I will fight a general Population object. At least for now. I don't feel confident that we have the experience to design one. It is important to notice that we cannot break backward compatibility without a very good reason. I think that a generic population object will be severely resived in the future. In your specific case I also think you would suffer with a population object, as you need performance (parsing file, creating object, extrating information from object, calculating statistic). As I see it, it would be a smaller chain (parse, convert to statistic family format, calculate statistic). > I think it would be good anyway to release even incomplete code to the > public, because it could be useful for other people. Incomplete is OK. But I think we would be releasing wrong code. Code that it would be redone in the future (and break interfaces with the past versions). Also, a generic object would have performance problems (it would have to be able to store all the information). Well, I am ranting and not proposing a decent alternative. I will try to write down something decent. I will try to write up a proposal until Tuesday. I'm afraid the error is on my part: I have to write down what is in my head so that people can discuss if it is a good idea or not. From tiagoantao at gmail.com Sat Oct 25 21:34:55 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sun, 26 Oct 2008 02:34:55 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810251804o31ed44cat49b407db36a6891e@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> <6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com> <5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com> <6d941f120810251804o31ed44cat49b407db36a6891e@mail.gmail.com> Message-ID: <6d941f120810251834q87495d5re558cf179356a8b0@mail.gmail.com> I just want add on an extra comment explaining why I oppose doing an individual object: I have the following questions (and others) in my mind, which I don't know the answer. I am not looking for answers to them, I am just trying to illustrate the difficulty of the problem. 1. For a certain marker, do we store the genomic position of the marker? Some (most) statistics don't use this information. For many species this information is not even available. But for some statistics this information is mandatory... 2. For a microsatellite do we store the motif and number of repeats or the whole sequence? (see 4) 3. If one is interested in SNPs and one has the full sequences does one store the full sequences or just the SNPs? If you store just the SNPs then you cannot do sequence based analysis in the future (say Tajima D). If you store everything then you are consuming memory and cpu. 4. If one just wants to do frequency statistics (Fst), do you store the marker or just the assign each one an ID and store the ID? It is much cheaper to store an ID than a full sequence. Populations 1. Support for landscape genetics? I mean geo-referentiation 2. Support for hierarchical population structure? 3. Do we cache statistics results on Population objects? Let me take your class marker: class Marker: total_heterozygotes_count = 0 total_population_count = 0 total_Purines_count = 0 # this could be renamed, of course total_Pyrimidines_count = 0 How would this be useful for microsatellites? Why purines, and if my marker is a protein? If it is a SNP I want to know the nucleotide? And if I am studying proteins and I want to have the aminoacid? Dont take me wrong, I have done this path. To solve my particular problems is not very hard. To have a framework that is usable by everybody, it is a damn hard problem. And we dont really need to solve it (ok, it would be nice to do things to populations in general, that I agree). But the fundamental is: read file, calculate statistics. That doesnt need population and individual objects. If we end up having too many formats a consolidation step might be needed in the future (to avoid having 10 split_in_pops). That I agree. From sbassi at gmail.com Mon Oct 27 00:13:47 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Mon, 27 Oct 2008 01:13:47 -0300 Subject: [BioPython] Loading dbxrefs from a gbk file In-Reply-To: <320fb6e00810251022u577d4f16wa7602c7bdc664322@mail.gmail.com> References: <320fb6e00810251022u577d4f16wa7602c7bdc664322@mail.gmail.com> Message-ID: On Sat, Oct 25, 2008 at 2:22 PM, Peter wrote: > What you have highlighted is part of a gene feature, and would not be > part of the SeqRecord's db_xref list. It should however be present in > the relevant SeqRecord feature. Try: OK, thank you. From lueck at ipk-gatersleben.de Mon Oct 27 09:43:49 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 27 Oct 2008 14:43:49 +0100 Subject: [BioPython] ClustaW problem upwards Biopython 1.43 Message-ID: <008701c9383a$0aaa2070$1022a8c0@ipkgatersleben.de> Hi! I just releazed, that a ClustalW alignment gives an error message under Biopython 1.44 and 1.47 whereas under 1.43 everything works fine. The message is the following (example of the tutorial): Traceback (most recent call last): File "I:\Final\pair_align.py", line 90, in pair_align alignment = Clustalw.do_alignment(cline) File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 79, in do_alignment status = run_clust.close() IOError: [Errno 0] Error Does someone know what's the problem? Kind regards Stefanie From biopython at maubp.freeserve.co.uk Mon Oct 27 11:12:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Oct 2008 15:12:13 +0000 Subject: [BioPython] ClustaW problem upwards Biopython 1.43 In-Reply-To: <008701c9383a$0aaa2070$1022a8c0@ipkgatersleben.de> References: <008701c9383a$0aaa2070$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00810270812le76ae75m55f53107c2572a34@mail.gmail.com> On Mon, Oct 27, 2008 at 1:43 PM, Stefanie L?ck wrote: > Hi! > > I just releazed, that a ClustalW alignment gives an error message under Biopython 1.44 and 1.47 whereas under 1.43 everything works fine. > > The message is the following (example of the tutorial): > > Traceback (most recent call last): > File "I:\Final\pair_align.py", line 90, in pair_align > alignment = Clustalw.do_alignment(cline) > File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 79, in do_alignment > status = run_clust.close() > IOError: [Errno 0] Error > > Does someone know what's the problem? There were some changes made between Biopython 1.43 and 1.44 to try and deal with spaces in filenames. Could you do: print str(cline) That should show the exact command line python is trying to run. What happens if you try this command at the "DOS" prompt? Also, what version of clustalw do you have installed? Thanks, Peter From biopython at maubp.freeserve.co.uk Mon Oct 27 13:49:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Oct 2008 17:49:59 +0000 Subject: [BioPython] Deprecating Bio.mathfns, Bio.stringfns and Bio.listfns? Message-ID: <320fb6e00810271049t2aa3fac4s1907027307b035f1@mail.gmail.com> Dear Biopythoneers, Is anyone currently using Bio.mathfns, Bio.stringfns or Bio.listfns? These provide a selection of maths, string and list functions - some of which are apparently irrelevant with changes or additions to python itself (e.g. sets). I'd like to declare these as deprecated for the next release, or at least obsolete and likely to be deprecated in future - so if you are using these modules or would like to defend them, please speak up soon. Thanks, Peter P.S. If you care about the details, there is a longer discussion on the dev-mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2008-October/004472.html From biopython at maubp.freeserve.co.uk Mon Oct 27 13:57:20 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Oct 2008 17:57:20 +0000 Subject: [BioPython] Deprecating the obsolete Bio.Ndb module? Message-ID: <320fb6e00810271057l181cbb1fw15aa8f03e4159328@mail.gmail.com> Dear Biopythoneers, The Bio.Ndb module (written six years ago) provides an HTML parser for the NDB website (nucleotide database, a repository of three-dimensional structural information about nucleic acids). The URL has changed, but this service is still running. However, the webpage layout has changed considerably - Their front page mentions a major revision in Jan 2008. Unless anyone would like to volunteer to look after the Bio.Ndb module and bring it up to date, I'm suggesting we deprecate it for the next release of Biopython. Peter From lueck at ipk-gatersleben.de Tue Oct 28 04:10:25 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 28 Oct 2008 09:10:25 +0100 Subject: [BioPython] ClustaW problem upwards Biopython 1.43 References: <008701c9383a$0aaa2070$1022a8c0@ipkgatersleben.de> <320fb6e00810270812le76ae75m55f53107c2572a34@mail.gmail.com> Message-ID: <001a01c938d4$a19887c0$1022a8c0@ipkgatersleben.de> Hi! >>> print str(cline) clustalw pb.fasta -OUTFILE=test2.aln I'm using CLUSTAL W 2.0. Under DOS everything works fine. Regards Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Monday, October 27, 2008 4:12 PM Subject: Re: [BioPython] ClustaW problem upwards Biopython 1.43 On Mon, Oct 27, 2008 at 1:43 PM, Stefanie L?ck wrote: > Hi! > > I just releazed, that a ClustalW alignment gives an error message under > Biopython 1.44 and 1.47 whereas under 1.43 everything works fine. > > The message is the following (example of the tutorial): > > Traceback (most recent call last): > File "I:\Final\pair_align.py", line 90, in pair_align > alignment = Clustalw.do_alignment(cline) > File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 79, > in do_alignment > status = run_clust.close() > IOError: [Errno 0] Error > > Does someone know what's the problem? There were some changes made between Biopython 1.43 and 1.44 to try and deal with spaces in filenames. Could you do: print str(cline) That should show the exact command line python is trying to run. What happens if you try this command at the "DOS" prompt? Also, what version of clustalw do you have installed? Thanks, Peter From dalloliogm at gmail.com Tue Oct 28 06:46:39 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 28 Oct 2008 11:46:39 +0100 Subject: [BioPython] a common repository for test datasets/use cases for all Bio* projects Message-ID: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com> Hi, I would like to make you a proposal. Every module/program written in bioinformatics needs to be tested before it can be used to produce results that can be published. For example, let's say I want to write another fasta file parser, like SeqIO.FastaIO in biopython : I would have have to test the script against some real fasta files, just to make sure that it doesn't parse them in a wrong way, or that it losts data. Or, let's say I want to write a script to calculate Fst statistics over some population genetics data: I will have to compare the results of my scripts against other programs, check if it gives me the right result for a set for which I already know the Fst value, and maybe ideate some other kind of checks to be sure my script doesn't do weird things, like losing input data on the way. So, the point is.. what if we create a common repository for all this kind of testing data, to be used in common with all the other Bio* projects? Wouldn't it be good if all the Bio* fasta parser are able to parse the same files and give the same results, demonstrating that all of them work fine or are wrong at the same time? I am doing this because me (and Tiago) would like to develop a module to calculate Fst statistics over SNP data, and there is no point of collecting some good test datasets and not sharing them with other similar projects in other programming languages. The same goes for much of the documentation, like use cases: if we collect a good base of use cases related to bioinformatics, it would be easier to coordinate the efforts of all the Bio* projects and compare the different approaches used to solve the same issue by the different comunities. At the moment, I have created a simple git repository on github: - http://github.com/dalloliogm/bio-test-datasets-repository but , it is still empty and maybe github is not the ideal hosting for such a project, since the free account has a 100MB space limit. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Tue Oct 28 06:55:04 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Oct 2008 10:55:04 +0000 Subject: [BioPython] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com> References: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com> Message-ID: <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com> On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio wrote: > Hi, > I would like to make you a proposal. > Every module/program written in bioinformatics needs to be tested > before it can be used to produce results that can be published. > ... > So, the point is.. what if we create a common repository for all this > kind of testing data, to be used in common with all the other Bio* > projects? You you made some other good points, and this is a good idea. In practice the licences are usually OK for use to "borrow" example input files from each other (and this does happen), but a more organised system to encourage interchange of examples would be good. I think this sounds like an excellent topic for the (currently very quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev discussion, one of the OBF mailing lists, this should cover all the Bio* project members interested). See http://lists.open-bio.org/mailman/listinfo Peter From dalloliogm at gmail.com Tue Oct 28 07:00:42 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 28 Oct 2008 12:00:42 +0100 Subject: [BioPython] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com> References: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com> <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com> Message-ID: <5aa3b3570810280400t510468d1sbce5bb0977ec772b@mail.gmail.com> On Tue, Oct 28, 2008 at 11:55 AM, Peter wrote: > On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio > > I think this sounds like an excellent topic for the (currently very > quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev > discussion, one of the OBF mailing lists, this should cover all the > Bio* project members interested). See > http://lists.open-bio.org/mailman/listinfo > > Peter Thanks!! I didn't know of this list!! > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Tue Oct 28 07:20:21 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Oct 2008 11:20:21 +0000 Subject: [BioPython] ClustalW problem upwards Biopython 1.43 Message-ID: <320fb6e00810280420t75f62774x55335e8a5aa11151@mail.gmail.com> Stephanie wrote: > >>>> print str(cline) > > clustalw pb.fasta -OUTFILE=test2.aln > > I'm using CLUSTAL W 2.0. Are you sure? The Clustal W 2.0 executable is normally called clustalw2.exe rather than clustalw.exe - so based on the command line above I would have expect Clustalw 1.x to be used. Maybe you have both versions of ClustalW installed? Could you tell me where exactly (full paths) you have Clustalw.exe and/or Clustalw2.exe installed? This would be helpful for the new unit test I'm working on. > Under DOS everything works fine. I've been having "fun" trying to get a new unit test for this to work nicely on Windows - there a certainly some combinations of file name arguments with spaces etc which won't work on Biopython 1.48. I found examples where the command line string ran "by hand" at the "DOS" prompt worked fine, but would fail when invoked in python via os.popen - on the bright side, using subprocess.Popen instead works much better (although this isn't available for python 2.3). If you want to try this new code, I would suggest you first install Biopython 1.48, and then backup and update C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py to revision 1.25 from CVS which you can download here (should be updated within the hour): http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Clustalw/__init__.py?cvsroot=biopython Thanks! Peter From peter at maubp.freeserve.co.uk Tue Oct 28 07:36:15 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Oct 2008 11:36:15 +0000 Subject: [BioPython] Dropping Python 2.3 support? Message-ID: <320fb6e00810280436m7cf48993v8b0562bb44919128@mail.gmail.com> Dear all, Those of you following the dev-mailing list will probably be aware that we've been making excellent progress in CVS to get Biopython to run fine on Python 2.6. However, the downside is that continuing to support Python 2.3 is beginning to be pain (triggered for the most part by some older modules being deprecated in python 2.6). Does anyone on the mailing list still use Python 2.3? e.g. older Linux servers, or people still using Apple Mac OS X 10.4 Tiger (or older). What I'd like to suggest is that the next one or two releases will still support Python 2.3, but after that we'll drop support for Python 2.3. Thanks, Peter P.S. For the record, until recently my main Windows machine ran Python 2.3 only - giving me a vested interesting in continuing Python 2.3 support ;) From jblanca at btc.upv.es Tue Oct 28 07:52:29 2008 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 28 Oct 2008 12:52:29 +0100 Subject: [BioPython] caf format support Message-ID: <200810281252.29607.jblanca@btc.upv.es> Hi, I'm currently dealing with caf contig files. Has BioPython support for this format? Do you know of other alternatives in python or perl to deal with it? Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Oct 28 08:16:33 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Oct 2008 12:16:33 +0000 Subject: [BioPython] caf format support In-Reply-To: <200810281252.29607.jblanca@btc.upv.es> References: <200810281252.29607.jblanca@btc.upv.es> Message-ID: <320fb6e00810280516j72af2c70q46790c217585b2c5@mail.gmail.com> On Tue, Oct 28, 2008 at 11:52 AM, Jose Blanca wrote: > Hi, > I'm currently dealing with caf contig files. Has BioPython support for this > format? Do you know of other alternatives in python or perl to deal with it? > Best regards, I'm not aware of any Biopython code for CAF contig files. However, have a look at http://www.sanger.ac.uk/Software/formats/CAF/userguide.shtml where some perl tools are described, including some for converting CAF into other formats. We do have ACE and PHRED (used by PHRAP) parsers in Bio.Sequencing, so adding Bio.Sequencing.CAF might be logical. Peter From cjfields at illinois.edu Tue Oct 28 08:26:32 2008 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 28 Oct 2008 07:26:32 -0500 Subject: [BioPython] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com> References: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com> <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com> Message-ID: All, An open-bio repository had started up for this use at one point, though I don't think it made the transition to subversion yet (and it never really took off, not sure why). You should try contacting open- bio support and maybe Jason or Chris D. can answer this in a bit more detail. chris On Oct 28, 2008, at 5:55 AM, Peter wrote: > On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio > wrote: >> Hi, >> I would like to make you a proposal. >> Every module/program written in bioinformatics needs to be tested >> before it can be used to produce results that can be published. >> ... >> So, the point is.. what if we create a common repository for all this >> kind of testing data, to be used in common with all the other Bio* >> projects? > > You you made some other good points, and this is a good idea. In > practice the licences are usually OK for use to "borrow" example input > files from each other (and this does happen), but a more organised > system to encourage interchange of examples would be good. > > I think this sounds like an excellent topic for the (currently very > quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev > discussion, one of the OBF mailing lists, this should cover all the > Bio* project members interested). See > http://lists.open-bio.org/mailman/listinfo > > Peter > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign From bsouthey at gmail.com Tue Oct 28 09:56:34 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 28 Oct 2008 08:56:34 -0500 Subject: [BioPython] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: References: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com> <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com> Message-ID: <49071A12.8060705@gmail.com> Chris Fields wrote: > All, > > An open-bio repository had started up for this use at one point, > though I don't think it made the transition to subversion yet (and it > never really took off, not sure why). You should try contacting > open-bio support and maybe Jason or Chris D. can answer this in a bit > more detail. > > chris > > On Oct 28, 2008, at 5:55 AM, Peter wrote: > >> On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio >> wrote: >>> Hi, >>> I would like to make you a proposal. >>> Every module/program written in bioinformatics needs to be tested >>> before it can be used to produce results that can be published. >>> ... >>> So, the point is.. what if we create a common repository for all this >>> kind of testing data, to be used in common with all the other Bio* >>> projects? >> >> You you made some other good points, and this is a good idea. In >> practice the licences are usually OK for use to "borrow" example input >> files from each other (and this does happen), but a more organised >> system to encourage interchange of examples would be good. >> >> I think this sounds like an excellent topic for the (currently very >> quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev >> discussion, one of the OBF mailing lists, this should cover all the >> Bio* project members interested). See >> http://lists.open-bio.org/mailman/listinfo >> >> Peter >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Marie-Claude Hofmann > College of Veterinary Medicine > University of Illinois Urbana-Champaign > > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > There had been some discussion on scipy lists on data sets that you should look for. One of the most critical questions that you must address is copyright and who owns the data sets (credit where credit is due). Ultimately any data will be distributable in some form and thus really brings in copyright issues and such. This is also country specific because there is the question of whether or not a data set can be copyrighted and the terms of it - not a lawyer to know this. The Science Commons has various other useful information especially the FAQ on databases, http://sciencecommons.org/resources/faq/databases/, that states "In the United States, data will be protected by copyright only if they express creativity". I do believe you would need to be very strict on what is acceptable because if it is distributable you can not rely on the user being responsible: 1) If has been used for publication, an extremely clear statement of the owner (publisher) that it can be made available is required. 2) If the data is created from publicly available sources that allow it eg Uniprot (http://www.uniprot.org/help/license) then exact recreatable sets must be made available so the data can be exactly obtained from that source (must include the specific release as databases change). 3) If the data is from private sources then it must be released on a suitable license that can not be superseded by publication or change in ownership. Also, the submitted data should not change even if there are errors. For example, Fisher's iris data at http://archive.ics.uci.edu/ml/datasets/Iris has documented errors. Rather it would be better to use version numbers. Regards Bruce From biopython at maubp.freeserve.co.uk Tue Oct 28 11:04:21 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Oct 2008 15:04:21 +0000 Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810200222s641e165eqef3b209893a8d976@mail.gmail.com> References: <320fb6e00810191152u13a2ee80pe21fe950dc3d046a@mail.gmail.com> <320fb6e00810200222s641e165eqef3b209893a8d976@mail.gmail.com> Message-ID: <320fb6e00810280804k1ef53ec1od53c33915da61c3@mail.gmail.com> On 20th Oct I wrote: > Of course, someone is still bound to try calling the [Seq object's] > translate method with a string mapping. Maybe we should add a > bit of defensive code to check the table argument, and print a > helpful error message when this happens? I've just added that in CVS, if the table argument is a 256 character string then a ValueError is raised suggesting using str(my_seq).translate(...) instead. Peter From biopython at maubp.freeserve.co.uk Tue Oct 28 13:17:36 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Oct 2008 17:17:36 +0000 Subject: [BioPython] Should Bio.SeqIO.write(...) return the number of records? Message-ID: <320fb6e00810281017m5a74a9dfh4aa18952b9a561be@mail.gmail.com> Dear all, I wanted to get some feedback on a possible enhancement to the Bio.SeqIO.write(...) and Bio.AlignIO.write(...) functions to make them return number of records/alignments written to the handle. I've filed enhancement Bug 2628 to track this idea. http://bugzilla.open-bio.org/show_bug.cgi?id=2628 When creating a sequence (or alignment) file, it is sometimes useful to know how many records (or alignments) were written out. This is easy if your records are in a list: records = list(...) SeqIO.write(records, handle, format) print "Wrote %i records" % len(records) If however your records are from a generator/iterator (e.g. a generator expression, or some other iterator) you cannot use len(records). You could turn this into a list just to count them, but this wastes memory. It would therefore be useful to have the count returned: records = some_generator count = SeqIO.write(records, handle, format) print "Wrote %i records" % count Currently Bio.SeqIO.write(...) and Bio.AlignIO.write(...) have no return value, so adding a return value would be a backwards compatible enhancement. For a precedent, the BioSQL loader returns the number of records loaded into the database. Peter From sbassi at gmail.com Tue Oct 28 13:43:27 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Tue, 28 Oct 2008 14:43:27 -0300 Subject: [BioPython] Should Bio.SeqIO.write(...) return the number of records? In-Reply-To: <320fb6e00810281017m5a74a9dfh4aa18952b9a561be@mail.gmail.com> References: <320fb6e00810281017m5a74a9dfh4aa18952b9a561be@mail.gmail.com> Message-ID: On Tue, Oct 28, 2008 at 2:17 PM, Peter wrote: > count = SeqIO.write(records, handle, format) > print "Wrote %i records" % count I'm for it. It doesn't hurt adding a backward compatible feature. From biopython at maubp.freeserve.co.uk Tue Oct 28 14:16:58 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Oct 2008 18:16:58 +0000 Subject: [BioPython] Should Bio.SeqIO.write(...) return the number of records? In-Reply-To: References: <320fb6e00810281017m5a74a9dfh4aa18952b9a561be@mail.gmail.com> Message-ID: <320fb6e00810281116u6460c62fs77ece727689fba3b@mail.gmail.com> Sebastian Bassi wrote: > > I'm for it. It doesn't hurt adding a backward compatible feature. > Well adding an unused feature does increase the long term maintainence load - but if we agree this does seem useful, that's fine. Also settling on the record/alignment count as the return value prevents any future alternative. But right now I can't think of any other sensible return value. I've written a patch against CVS to implement this - see Bug 2628 for details. Peter From tiagoantao at gmail.com Thu Oct 30 17:36:00 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 30 Oct 2008 21:36:00 +0000 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810251834q87495d5re558cf179356a8b0@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> <6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com> <5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com> <6d941f120810251804o31ed44cat49b407db36a6891e@mail.gmail.com> <6d941f120810251834q87495d5re558cf179356a8b0@mail.gmail.com> Message-ID: <6d941f120810301436m4bf12385s99d726bb000f7dd4@mail.gmail.com> Hi, FYI, I am going to continue this discussion to biopython-dev, as I think it makes more sense there. Especially the parts about implementation suggestions. On Sun, Oct 26, 2008 at 1:34 AM, Tiago Ant?o wrote: > I just want add on an extra comment explaining why I oppose doing an > individual object: > > I have the following questions (and others) in my mind, which I don't > know the answer. I am not looking for answers to them, I am just > trying to illustrate the difficulty of the problem. > > 1. For a certain marker, do we store the genomic position of the > marker? Some (most) statistics don't use this information. For many > species this information is not even available. But for some > statistics this information is mandatory... > 2. For a microsatellite do we store the motif and number of repeats or > the whole sequence? (see 4) > 3. If one is interested in SNPs and one has the full sequences does > one store the full sequences or just the SNPs? If you store just the > SNPs then you cannot do sequence based analysis in the future (say > Tajima D). If you store everything then you are consuming memory and > cpu. > 4. If one just wants to do frequency statistics (Fst), do you store > the marker or just the assign each one an ID and store the ID? It is > much cheaper to store an ID than a full sequence. > > Populations > 1. Support for landscape genetics? I mean geo-referentiation > 2. Support for hierarchical population structure? > 3. Do we cache statistics results on Population objects? > > > Let me take your class marker: > class Marker: > total_heterozygotes_count = 0 > total_population_count = 0 > total_Purines_count = 0 # this could be renamed, of course > total_Pyrimidines_count = 0 > > How would this be useful for microsatellites? Why purines, and if my > marker is a protein? If it is a SNP I want to know the nucleotide? And > if I am studying proteins and I want to have the aminoacid? > > Dont take me wrong, I have done this path. To solve my particular > problems is not very hard. To have a framework that is usable by > everybody, it is a damn hard problem. And we dont really need to solve > it (ok, it would be nice to do things to populations in general, that > I agree). But the fundamental is: read file, calculate statistics. > That doesnt need population and individual objects. > > If we end up having too many formats a consolidation step might be > needed in the future (to avoid having 10 split_in_pops). That I agree. > -- "Data always beats theories. 'Look at data three times and then come to a conclusion,' versus 'coming to a conclusion and searching for some data.' The former will win every time." ?Matthew Simmons, http://www.tiago.org From pingou at pingoured.fr Fri Oct 31 12:29:27 2008 From: pingou at pingoured.fr (Pierre-Yves) Date: Fri, 31 Oct 2008 17:29:27 +0100 Subject: [BioPython] Sequence graph Message-ID: <490B3267.5020501@pingoured.fr> Dear list, I am sorry to come here to ask this question that must have been already asked in the past, but my search have been rather unsuccessful... I would like to reproduce such graph: http://www.bioperl.org/wiki/HOWTO:Graphics#Improving_the_Image but even if bioperl is nice I would like to do it through BioPython. I have thus two questions : * Is that possible ? * Could someone point me to an example ? Thanks in advance for your help, Best regards, Pierre From bsouthey at gmail.com Wed Oct 22 17:02:18 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 22 Oct 2008 21:02:18 -0000 Subject: [BioPython] back-translation method for Seq object? In-Reply-To: <320fb6e00810210859n1b922e7emd6a7456abd79cdc7@mail.gmail.com> References: <48FDE37B.5040301@gmail.com> <320fb6e00810210745w32b37edjeec1607a3711f6ea@mail.gmail.com> <320fb6e00810210859n1b922e7emd6a7456abd79cdc7@mail.gmail.com> Message-ID: <48FF951C.4030700@gmail.com> Hi, Some of the neat things about Python is how easy it is to modify your own code and adapt others code into yours. So here is some code (under the BSD license) that may be useful on this. This is a simple back or reverse translation code with many of the things that I have been 'talking' about. This should be self-contained and works on Linux system with Python2.3+. It is oriented around an peptide sequence 'AFLFQPQRFGR' but hopefully is more general (I have not tested that). a) Convert an amino acid sequence into both a regular expression or DNA sequence involving ambiguous codes. There are functions to convert the regular expression or DNA sequence involving ambiguous codes back to a protein sequence since neither of these are standard. b) Regular expression search on a list of sequences in fasta format. c) Obtain all possible DNA sequences from an regular expression form of the amino acid sequence. Obviously this is very large as for the above sequence there are 442368 combinations (but Python is fairly quick... about 10 seconds on my opteron 270 system bogomips =3991.08) Enjoy Bruce -------------- next part -------------- A non-text attachment was scrubbed... Name: reverse_trans.py Type: text/x-python Size: 10661 bytes Desc: not available URL: From mjldehoon at yahoo.com Wed Oct 1 12:18:24 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 1 Oct 2008 05:18:24 -0700 (PDT) Subject: [BioPython] Bio.distance Message-ID: <924102.72843.qm@web62403.mail.re1.yahoo.com> Hi everybody, Since the 1.48 release, Biopython has been making good progress in the migration from Numerical Python to NumPy. As part of this process, we are now reviewing and consolidating the code in Biopython that makes use of Numerical Python / NumPy. Specifically, we are thinking to merge the code in Bio.distance into Bio.kNN, and to deprecate Bio.distance and Bio.cdistance. Since Bio.kNN is the only Biopython module in Biopython that makes use of Bio.distance, we think that this won't affect anybody. However, if you are using Bio.distance outside of Bio.kNN, please let us know so we can find an alternative solution. --Michiel. From bsouthey at gmail.com Wed Oct 1 15:49:53 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 01 Oct 2008 10:49:53 -0500 Subject: [BioPython] Bio.distance In-Reply-To: <924102.72843.qm@web62403.mail.re1.yahoo.com> References: <924102.72843.qm@web62403.mail.re1.yahoo.com> Message-ID: <48E39C21.8010603@gmail.com> Michiel de Hoon wrote: > Hi everybody, > > Since the 1.48 release, Biopython has been making good progress in the migration from Numerical Python to NumPy. As part of this process, we are now reviewing and consolidating the code in Biopython that makes use of Numerical Python / NumPy. Specifically, we are thinking to merge the code in Bio.distance into Bio.kNN, and to deprecate Bio.distance and Bio.cdistance. Since Bio.kNN is the only Biopython module in Biopython that makes use of Bio.distance, we think that this won't affect anybody. However, if you are using Bio.distance outside of Bio.kNN, please let us know so we can find an alternative solution. > > --Michiel. > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > Hi, Under the 'standard' install I do not think that there is any advantage of using Bio.cdistance within Bio.kNN. I tested this on a bioinformatics data set with almost 1500 data points, 8 explanatory variables and k=9. I only got a one second difference between using Bio.cdistance or commenting it out on my system (after removing the build directory and reinstalling everything). Actual maximum times across three runs were under 16.6 seconds with it and under 17.4 seconds without it. My system runs linux x86_64 (fedora 10) but it is not a 'clean' system due to other cpu intensive processes running. I used Python 2.5 and Numeric 2.4 as I forgot the order of imports. In my version the default distance without Bio.cdistance uses the Numeric dot (I did not try the python version) so I would expect this to be noticeably faster if lapack or atlas are installed than if these are not present. (I used Fedora supplied Numeric so while I think this timing is without lapack and atlas I am not completely sure of that.) I did not see an examples for k-nearest neighbor so below is (very bad) code using the logistic regression example (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html). Regards Bruce from Bio import kNN xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11, -220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], [58, -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, -312.99], [154, -213.83], [147, -380.85], [93, -291.13]] ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0] model = kNN.train(xs, ys, 3) ccr=0 tobs=0 for px, py in zip(xs, ys): cp=kNN.classify(model, px) tobs +=1 if cp==py: ccr +=1 print tobs, ccr From biopython at maubp.freeserve.co.uk Wed Oct 1 15:52:05 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Oct 2008 16:52:05 +0100 Subject: [BioPython] More string methods for the Seq object In-Reply-To: <320fb6e00809290506p8aa2b51p4901b693ebb268bf@mail.gmail.com> References: <320fb6e00809260859r23c7915buc114c5c0b71e195@mail.gmail.com> <48DD2DE6.10908@gmail.com> <320fb6e00809261422n6e4c4889p734508613898cc3f@mail.gmail.com> <48DD59DF.1000504@gmail.com> <320fb6e00809261457j65dc0876hd59d17aee01bc983@mail.gmail.com> <320fb6e00809270557n73b81b5ayb93fe85f0f466626@mail.gmail.com> <320fb6e00809290450m6fedbaacu15a75107e5c39658@mail.gmail.com> <320fb6e00809290506p8aa2b51p4901b693ebb268bf@mail.gmail.com> Message-ID: <320fb6e00810010852j5cf8e3ak7dc788372568251f@mail.gmail.com> On Mon, Sep 29, 2008 at 1:06 PM, Peter wrote: >> I assume you [Bruce] are agreeing with ... follow[ing] the >> string defaults of white space for stipping or splitting (for >> consistency, even though this won't typically be useful for >> sequences). On balance this would probably be best from >> a principle of consistency and least surprise for the user - >> I'll update the patches. > > New patch for Seq object split, strip, lstrip and rstrip methods on > Bug 2596 which follows the python string defaults (splitting on or > stripping of white space). > http://bugzilla.open-bio.org/show_bug.cgi?id=2596 There is now a second version of this patch on that bug, which will also accept Seq objects as arguments to the split, strip, lstrip and rstrip methods, plus has the start of some tests too. We (Peter, Martin, Bruce and Leighton) seem to have reached an agreement about adding split, strip, lstrip and rstrip methods to the Seq object with the behaviour (arguments and defaults) to follow those of the python string as closely as possible. I'd like to encourage others lurking on the list to comment too, but unless anyone objects, I intend to add these methods in CVS this week, together with an updated unit test and updates to the tutorial. Peter From biopython at maubp.freeserve.co.uk Wed Oct 1 16:03:22 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Oct 2008 17:03:22 +0100 Subject: [BioPython] Bio.distance In-Reply-To: <48E39C21.8010603@gmail.com> References: <924102.72843.qm@web62403.mail.re1.yahoo.com> <48E39C21.8010603@gmail.com> Message-ID: <320fb6e00810010903u253c6384ld401e1a771ee141e@mail.gmail.com> On Wed, Oct 1, 2008 at 4:49 PM, Bruce Southey wrote: > > Hi, > Under the 'standard' install I do not think that there is any advantage of > using Bio.cdistance within Bio.kNN. I tested this on a bioinformatics data > set with almost 1500 data points, 8 explanatory variables and k=9. ... > Actual maximum times across three runs were under 16.6 seconds with > it [Bio.cdistance] and under 17.4 seconds without it [Bio.distance using > Numeric] Its interesting that the C version is only slightly faster than Numeric - of course as you point out there are lots of possible complications here like lapack and atlas (plus compiler options and CPU features). I think your numbers are good support for Michiel's proposition that we should deprecate Bio.cdistance and Bio.distance and just use numpy in Bio.kNN - this will simplify our code base and make very little difference to the speed. Peter From biopython at maubp.freeserve.co.uk Wed Oct 1 16:17:10 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Oct 2008 17:17:10 +0100 Subject: [BioPython] Bio.kNN documentation Message-ID: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com> Bruce wrote: > I tested this [Bio.kNN] on a bioinformatics data set with almost 1500 > data points, 8 explanatory variables and k=9. ... Do you think this larger example could be adapted into something for the Biopython documentation? Otherwise the next bit of code looks interesting. > I did not see an examples for k-nearest neighbor so below is (very bad) > code using the logistic regression example > (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html). This is a set of Bacillus subtilis gene pairs for which the operon structure is known, with the intergene distance and gene expression score as explanatory variables, with the class being same operon or different operons. > from Bio import kNN > xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11, > -220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], [58, > -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, -312.99], > [154, -213.83], [147, -380.85], [93, -291.13]] > ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0] > model = kNN.train(xs, ys, 3) > ccr=0 > tobs=0 > for px, py in zip(xs, ys): > cp=kNN.classify(model, px) > tobs +=1 > if cp==py: > ccr +=1 > print tobs, ccr Could you expand on the cryptic variable names? ccr = correct call rate? tobs = total observations? Coupled with a scatter plot (say with pylab, showing the two classes in different colours), this could be turned into a nice little example for the cookbook section of the tutorial. Notice that later on in the logistic regression example there is a second table of "test data" which could be used to make de novo predictions. Thanks, Peter From bsouthey at gmail.com Wed Oct 1 18:40:41 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 01 Oct 2008 13:40:41 -0500 Subject: [BioPython] Bio.kNN documentation In-Reply-To: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com> References: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com> Message-ID: <48E3C429.1020004@gmail.com> Peter wrote: > Bruce wrote: > >> I tested this [Bio.kNN] on a bioinformatics data set with almost 1500 >> data points, 8 explanatory variables and k=9. ... >> > > Do you think this larger example could be adapted into something for > the Biopython documentation? Otherwise the next bit of code looks > interesting. > > >> I did not see an examples for k-nearest neighbor so below is (very bad) >> code using the logistic regression example >> (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html). >> > > This is a set of Bacillus subtilis gene pairs for which the operon > structure is known, with the intergene distance and gene expression > score as explanatory variables, with the class being same operon or > different operons. > > >> from Bio import kNN >> xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11, >> -220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], [58, >> -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, -312.99], >> [154, -213.83], [147, -380.85], [93, -291.13]] >> ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0] >> model = kNN.train(xs, ys, 3) >> ccr=0 >> tobs=0 >> for px, py in zip(xs, ys): >> cp=kNN.classify(model, px) >> tobs +=1 >> if cp==py: >> ccr +=1 >> print tobs, ccr >> > > Could you expand on the cryptic variable names? ccr = correct call > rate? tobs = total observations? > > Coupled with a scatter plot (say with pylab, showing the two classes > in different colours), this could be turned into a nice little example > for the cookbook section of the tutorial. Notice that later on in the > logistic regression example there is a second table of "test data" > which could be used to make de novo predictions. > > Thanks, > > Peter > > I did realize that this was coming... :-) (I guess I am volunteering myself to provide some material on machine learning with BioPython. So this is a start.) I wanted something quick and dirty to output for testing, so tobs is the total number of observations and ccr is number of correctly classified points - I was to lazy to divide it by tobs to get the correct classification rate. Here is an more extended sample code that also uses logistic regression. (Python is so great to with here!) I don't have plotting packages installed but someone could add the plots. Regards Bruce -------------- next part -------------- A non-text attachment was scrubbed... Name: knn_lr_example.py Type: text/x-python Size: 3257 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Wed Oct 1 21:40:55 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Oct 2008 22:40:55 +0100 Subject: [BioPython] problem installing mxTextTools In-Reply-To: <48896815.10104@berkeley.edu> References: <4889645B.9080400@berkeley.edu> <48896815.10104@berkeley.edu> Message-ID: <320fb6e00810011440v4bd80263hf3830d8d9f548e63@mail.gmail.com> On Fri, Jul 25, 2008 at 6:43 AM, Nick Matzke wrote: > Hi all, > > An update -- I found a solution by copying the .pck file the download > actually gave me to the filename that the install was apparently looking > for. This was not exactly obvious (!!!!) but apparently it worked: > ... > >>> print now() > 2008-07-24 22:39:17.66 > Was this an old email you accidently forwarded to the list? For the next release of Biopython the only bits of code still using mxTextTools have been deprecated, so the Biopython setup won't even look for mxTextTools at all. Right now with Biopython 1.48 you can just install without mxTextTools (as the setup.py prompt should make clear). Peter From biopython at maubp.freeserve.co.uk Wed Oct 1 21:44:34 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Oct 2008 22:44:34 +0100 Subject: [BioPython] problem installing mxTextTools In-Reply-To: <320fb6e00810011440v4bd80263hf3830d8d9f548e63@mail.gmail.com> References: <4889645B.9080400@berkeley.edu> <48896815.10104@berkeley.edu> <320fb6e00810011440v4bd80263hf3830d8d9f548e63@mail.gmail.com> Message-ID: <320fb6e00810011444u7e5bf37fh2801c1980bd38a2a@mail.gmail.com> On Wed, Oct 1, 2008 at 10:40 PM, Peter wrote: > On Fri, Jul 25, 2008 at 6:43 AM, Nick Matzke wrote: >> Hi all, >> >> An update -- I found a solution by copying the .pck file the download >> actually gave me to the filename that the install was apparently looking >> for. This was not exactly obvious (!!!!) but apparently it worked: >> ... >> >>> print now() >> 2008-07-24 22:39:17.66 >> > > Was this an old email you accidently forwarded to the list? Sorry about this Nick & everyone else - it was a mistake at my end. It looks like a glitch (perhaps in GoogleMail itself?) marked this old thread as unread and bumped it to the top of my to read list. Odd, but I didn't notice until after sending my confused reply. Peter From kteague at bcgsc.ca Wed Oct 1 21:53:44 2008 From: kteague at bcgsc.ca (Kevin Teague) Date: Wed, 1 Oct 2008 14:53:44 -0700 Subject: [BioPython] development question References: <48B5BD98.8050101@heckler-koch.cz><48B65C9B.4000407@heckler-koch.cz> <20080828090431.GD5801@inb.uni-luebeck.de> Message-ID: <36BEEFA2DF192944BF71E072F7A5F4656043D6@xchange1.phage.bcgsc.ca> On Thu, Aug 28, 2008 at 10:06:51AM +0200, Pavel SRB wrote: > so now to biopython. On my system i have biopython from debian repository > via apt-get. But i would like to have second version of biopython in system > just to check, log and change the code to learn more. This can be done with > removing sys.path.remove("/var/lib/python-support/python2.5") > and importing Bio from some other development directory. But this way i > loose all modules in direcotory mentioned above and i believe it can be > done more clearly You might want to check out VirtualEnv: http://pypi.python.org/pypi/virtualenv This tool will let you "clone" your system Python, so that you have your own isolated [virtualpythonname]/bin and [virtualpythonname/lib/python/site-packages/ directories. If you create a virtualenv with the --no-site-packages, then the /var/lib/python-support/python2.5/ location will be not be in the created virtual python's sys.path. Otherwise by default this location will be included, but your own isolated [virtualpythonname/lib/python/site-packages/ location will have precendence on sys.path, so if you install a newer BioPython into there it will get imported instead of the system one. You can of course do all of this by manually fiddling with sys.path, but VirtualEnv just wraps up a few of these common practices into one handy tool - great for experimentation or trying out different packages. From lunt at ctbp.ucsd.edu Sat Oct 4 21:50:33 2008 From: lunt at ctbp.ucsd.edu (Bryan Lunt) Date: Sat, 4 Oct 2008 14:50:33 -0700 Subject: [BioPython] Copy Constructors for Bio.Seq.Seq? In-Reply-To: References: Message-ID: Greetings All! I would like to make the following humble suggestion: A copy-constructor for Bio.Seq.Seq would be helpful, currently it seems that calling Bio.Align.Generic.Alignment.add_sequence on a Seq object breaks because it tries to initialize a new Seq object on whatever data you provided, and there is no copy-constructor, nor does Bio.Align.Generic.Alignment.add_sequence handled just adding a Seq object directly. Thanks for considering this, I think this addition will help make client-code cleaner. -Bryan Lunt From biopython at maubp.freeserve.co.uk Sun Oct 5 11:06:57 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 5 Oct 2008 12:06:57 +0100 Subject: [BioPython] Copy Constructors for Bio.Seq.Seq? In-Reply-To: References: Message-ID: <320fb6e00810050406t41d25043oe7011745055a1fc7@mail.gmail.com> On Sat, Oct 4, 2008 at 10:50 PM, Bryan Lunt wrote: > Greetings All! > I would like to make the following humble suggestion: > A copy-constructor for Bio.Seq.Seq would be helpful, ... You can use the string idiom of my_seq[:] to make a copy of a Seq object. > currently it > seems that calling Bio.Align.Generic.Alignment.add_sequence on a > Seq object breaks because it tries to initialize a new Seq object on > whatever data you provided, and there is no copy-constructor, nor does > Bio.Align.Generic.Alignment.add_sequence handled just adding a Seq > object directly. Yes, the Bio.Align.Generic.Alignment.add_sequence() method currently expects a string (which its docstring is fairly clear about), and giving it a Seq does fail. I suppose allowing it to take a Seq object would be sensible (with a check on the alphabet being compatible with that declared for the alignment). We have been debating making the generic Alignment a little more list like, by allowing .append() or .extend() for use with SeqRecord objects (Bug 2553). http://bugzilla.open-bio.org/show_bug.cgi?id=2553 > Thanks for considering this, I think this addition will help make > client-code cleaner. Would the SeqRecord append/extend idea suit you just as well? Peter From biopython at maubp.freeserve.co.uk Sun Oct 5 12:16:28 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 5 Oct 2008 13:16:28 +0100 Subject: [BioPython] Migrating from Numerical Python to numpy In-Reply-To: <623262.17729.qm@web62407.mail.re1.yahoo.com> References: <623262.17729.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e00810050516i20822ebcwf15cd058af0c9759@mail.gmail.com> On Sat, Sep 20, 2008 at 4:02 AM, Michiel de Hoon wrote: > Dear all, > > As you probably are well aware, Biopython releases to date have used > the now obsolete Numeric python library. This is no longer being > maintained and has been superseded by the numpy library. See > http://www.scipy.org/History_of_SciPy for more about details on the > history of numerical python. Biopython 1.48 should be the last > Numeric only release of Biopython - we have already started moving to > numpy in CVS. > > Supporting both Numeric and numpy ought to be fairly straightforward > for the pure python modules in Biopython. However, we also have C code > which must interact with Numeric/numpy, and trying to support both > would be harder. > > Would anyone be inconvenienced if the next release of Biopython > supported numpy ONLY (dropping support for Numeric)? If so please > speak up now - either here or on the development mailing list. > Otherwise, a simple switch from Numeric to numpy will probably be the > most straightforward migration plan. No one has objected, and a simple switch from Numeric to numpy is underway in CVS. The next release of Biopython will suport numpy only (dropping support for Numeric). As an aside, from my own testing Biopython CVS looks happy with numpy 1.0, 1.1 and the just released 1.2 (although if we have missed any deprecation warnings please let us know). For preparing Windows installers for Biopython, it might be helpful to know what version of numpy most Windows users (will) have installed (this is important due to numpy C API changes between versions). Thanks, Peter From biopython at maubp.freeserve.co.uk Mon Oct 6 10:39:15 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 6 Oct 2008 11:39:15 +0100 Subject: [BioPython] Bio.kNN documentation In-Reply-To: <48E3C429.1020004@gmail.com> References: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com> <48E3C429.1020004@gmail.com> Message-ID: <320fb6e00810060339t427d4c5dme2690fbc36b30c81@mail.gmail.com> Bruce wrote: >>> I did not see an examples for k-nearest neighbor so below is >>> (very bad) code using the logistic regression example >>> (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html). Peter wrote: >> This is a set of Bacillus subtilis gene pairs for which the operon >> structure is known, with the intergene distance and gene expression >> score as explanatory variables, with the class being same operon or >> different operons. >> ... >> Coupled with a scatter plot (say with pylab, showing the two classes >> in different colours), this could be turned into a nice little example >> for the cookbook section of the tutorial. Notice that later on in the >> logistic regression example there is a second table of "test data" >> which could be used to make de novo predictions. Bruce wrote: > I did realize that this was coming... :-) > (I guess I am volunteering myself to provide some material on > machine learning with BioPython. So this is a start.) Michiel has suggested adding a whole chapter to the tutorial about supervised learning, presumably incorporating his logistic regression example as part of this. Have a look at thread "Bio.MarkovModel; Bio.Popgen, Bio.PDB documentation" on the dev mailing list. I'm sure you can contribute (even if just by proof reading). Peter From fkauff at biologie.uni-kl.de Tue Oct 7 08:02:12 2008 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Tue, 07 Oct 2008 10:02:12 +0200 Subject: [BioPython] Creating and traversing an ultrametric tree In-Reply-To: <320fb6e00809241326i16a337das844f4ac74766b459@mail.gmail.com> References: <73045cca0809231713v219c3ec3tfc24461c7af6b453@mail.gmail.com> <320fb6e00809240200y144500cbl86f9023cb868da89@mail.gmail.com> <73045cca0809241132x30bc4d63t7ac0b9967a20e76c@mail.gmail.com> <320fb6e00809241326i16a337das844f4ac74766b459@mail.gmail.com> Message-ID: <48EB1784.50803@biologie.uni-kl.de> Peter wrote: > On Wed, Sep 24, 2008 at 7:32 PM, aditya shukla > wrote: > >> Hello Peter , >> >> Thanks for the reply , >> I have attached a file with of the kind of data that i wanna parse. >> I tried using Thomas Mailund's Newick tree parser but this dosen't >> seem to work , so is there any other module that can help? >> > > Your file looks like this (in case anyone on the mailing list recognises it), > > /T_0_size=105((-bin-ulockmgr_server:0.99[&&NHX:C=0.195.0], > (((-bin-hostname:0.00[&&NHX:C=200.0.0], > (-bin-dnsdomainname:0.00[&&NHX:C=200.0.0], > ...):0.99):0.99):0.99):0.99); > > [with a large chunk removed, and new lines inserted] > > I'm guessing this is some kind of computer system profile - nothing to > do with bioinformatics. > > I'm not 100% sure this is Newick format - it might be worth trying to > parse everything after the "/T_0_size=105" text which looks out of > place to me. > > If it is a valid Newick format tree file, then it is using named > internal nodes which is something Biopython can't currently parse (see > Bug 2543, http://bugzilla.open-bio.org/show_bug.cgi?id=2543 ). So I > don't think you can use the Bio.Nexus module in Biopython to read this > tree. > > Nexus.Trees has been extended to deal with internal node names, or "special comments" in the format [& blablalba]. Such comments comments can appear directly after the taxon label, after the closing parentheses, or between branchlength / support values attached to a node or a taxon labels, such as (a,(b,(c,d)[&hi there])) (a,(b[&hi there],c)) (a,(b:0.123[&hi there],c[&heyho]:0.3)) (a,(b,c)0.4[&comment]:0.95) The comments are stored without change in the corresponding node object and can be accessed like >>> t=Trees.Tree('(a,(b:0.123[&hi there],c[&heyho]:0.3))') >>> print t.node(3).data.comment [&hi there] >>> print t.node(4).data.comment [&heyho] >>> The comments are not parsed in any way - internal labels vary greatly in syntax, and are used to store all kinds of information. But at least they are now parsed and stored, and users can deal with them in any way they like. Frank > The only other python package I can suggest you try is NetworkX, > https://networkx.lanl.gov/wiki > > Good luck, > > Peter > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From mjldehoon at yahoo.com Tue Oct 7 23:10:12 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 7 Oct 2008 16:10:12 -0700 (PDT) Subject: [BioPython] Bio.kNN documentation In-Reply-To: <320fb6e00810060339t427d4c5dme2690fbc36b30c81@mail.gmail.com> Message-ID: <381879.37032.qm@web62403.mail.re1.yahoo.com> > Bruce wrote: > > (I guess I am volunteering myself to provide some > material on > > machine learning with BioPython. So this is a start.) > > Michiel has suggested adding a whole chapter to the > tutorial about > supervised learning, presumably incorporating his logistic > regression > example as part of this. Have a look at thread > "Bio.MarkovModel; > Bio.Popgen, Bio.PDB documentation" on the dev mailing > list. I'm sure > you can contribute (even if just by proof reading). Some more documentation on machine learning would definitely be useful. Recently I started a chapter on supervised learning methods in the tutorial. Right now it only covers logistic regression, but it should also include Bio.MarkovModel, Bio.MaxEntropy, Bio.NaiveBayes, and Bio.kNN. If you are planning to write some documentation on any of these, please let us know so we can avoid duplicated efforts. The new tutorial is in CVS; I put a copy of the HTML output of the latest version at http://biopython.org/DIST/docs/tutorial/Tutorial.new.html. Thanks! --Michiel From bsouthey at gmail.com Wed Oct 8 01:35:51 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 7 Oct 2008 20:35:51 -0500 Subject: [BioPython] Bio.kNN documentation In-Reply-To: <381879.37032.qm@web62403.mail.re1.yahoo.com> References: <320fb6e00810060339t427d4c5dme2690fbc36b30c81@mail.gmail.com> <381879.37032.qm@web62403.mail.re1.yahoo.com> Message-ID: On Tue, Oct 7, 2008 at 6:10 PM, Michiel de Hoon wrote: >> Bruce wrote: >> > (I guess I am volunteering myself to provide some >> material on >> > machine learning with BioPython. So this is a start.) >> >> Michiel has suggested adding a whole chapter to the >> tutorial about >> supervised learning, presumably incorporating his logistic >> regression >> example as part of this. Have a look at thread >> "Bio.MarkovModel; >> Bio.Popgen, Bio.PDB documentation" on the dev mailing >> list. I'm sure >> you can contribute (even if just by proof reading). > > Some more documentation on machine learning would definitely be useful. Recently I started a chapter on supervised learning methods in the tutorial. Right now it only covers logistic regression, but it should also include Bio.MarkovModel, Bio.MaxEntropy, Bio.NaiveBayes, and Bio.kNN. If you are planning to write some documentation on any of these, please let us know so we can avoid duplicated efforts. The new tutorial is in CVS; I put a copy of the HTML output of the latest version at > http://biopython.org/DIST/docs/tutorial/Tutorial.new.html. > > Thanks! > > --Michiel > Hi, I have not given it too much thought at present but this reflects some of the work I have been doing or involved with. I do not know enough about Bio.MarkovModel, Bio.MaxEntropy and Bio.NaiveBayes to really help. But I did think to start with trying to extend the supervised learning material to be more general. One aspect is to get provide working code using different methodologies for different examples. Regards Bruce From stephan80 at mac.com Wed Oct 8 11:33:51 2008 From: stephan80 at mac.com (Stephan) Date: Wed, 08 Oct 2008 13:33:51 +0200 Subject: [BioPython] Entrez.efetch Message-ID: <75573950382669954948356356615157751492-Webmail2@me.com> Hi, I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now. Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this. Anyway, here is a weird little problem with the Bio.Entrez.efetch tool: (I use python 2.5 and the latest Biopython 1.48) I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file: ---------------------------CODE------------------------------------ from Bio import Entrez, SeqIO print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"] handle = Entrez.efetch(db="genome", id="56", rettype="genbank") print "downloading to SeqRecord..." record = SeqIO.read(handle, "genbank") print "...done" handle = Entrez.efetch(db="genome", id="56", rettype="genbank") filehandle = open("NCBI_DroMel", "w") print "downloading to file..." filehandle.write(handle.read()) print "...done" handle = open("NCBI_DroMel") print "reading from file..." record = SeqIO.read(handle, "genbank") ---------------------------END-CODE------------------------------------ In the last line we have a crash, see the output of the code: ---------------------------OUTPUT------------------------------------ Drosophila melanogaster chromosome 4, complete sequence downloading to SeqRecord... ...done downloading to file... ...done reading chr2L from file... Traceback (most recent call last): File "efetch-test.py", line 17, in record = SeqIO.read(handle, "genbank") File "HOME/lib/python/Bio/SeqIO/__init__.py", line 366, in read first = iterator.next() File "HOME/lib/python/Bio/GenBank/Scanner.py", line 410, in parse_records record = self.parse(handle) File "HOME/lib/python/Bio/GenBank/Scanner.py", line 393, in parse if self.feed(handle, consumer) : File "HOME/lib/python/Bio/GenBank/Scanner.py", line 370, in feed misc_lines, sequence_string = self.parse_footer() File "HOME/lib/python/Bio/GenBank/Scanner.py", line 723, in parse_footer raise ValueError("Premature end of file in sequence data") ValueError: Premature end of file in sequence data ---------------------------END-OUTPUT------------------------------------ It seems that downloading the file to disk will corrupt the genbank file, while downloading directly into biopythons SeqIO.read() function works properly. I dont get it! When I download this chromosome manually from the NCBI-website, I indeed find a difference in one line, namely in line 3 of the genbank file. In the manually downloaded file line 3 reads: "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced from my code I have only: "ACCESSION NC_004353". So without that region-information, the biopython parser of course runs to a premature end. I rather use the cPickle-module now to save the whole SeqRecord-instance. Thats works fine, so I dont need an immediate solution for the above posted problem, but I thought it might be interesting maybe... Any hints? Regards, Stephan From chapmanb at 50mail.com Wed Oct 8 12:35:33 2008 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Oct 2008 08:35:33 -0400 Subject: [BioPython] Entrez.efetch In-Reply-To: <75573950382669954948356356615157751492-Webmail2@me.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> Message-ID: <20081008123533.GE57379@sobchak.mgh.harvard.edu> Hi Stephan; > It seems that downloading the file to disk will corrupt the genbank > file, while downloading directly into biopythons SeqIO.read() function > works properly. I dont get it! > > When I download this chromosome manually from the NCBI-website, > I indeed find a difference in one line, namely in line 3 of the > genbank file. In the manually downloaded file line 3 reads: > "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced > from my code I have only: "ACCESSION NC_004353". So without that > region-information, the biopython parser of course runs to a premature > end. This is a tricky problem that I ran into as well and is fixed in the latest CVS version. The issue is that the Biopython reader is using an UndoHandle instead of a standard python handle. By default some of these operations appear to be assuming an iterator, but UndoHandle did not provide this. As a result, you can lose the first couple of lines which are previously examined to determine the filetype. The fix is to make this a proper iterator. You can either check out current CVS, or make the addition manually to Bio/File.py in your current version: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython Hope this helps, Brad From biopython at maubp.freeserve.co.uk Wed Oct 8 13:37:24 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 14:37:24 +0100 Subject: [BioPython] Entrez.efetch In-Reply-To: <75573950382669954948356356615157751492-Webmail2@me.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> Message-ID: <320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com> On Wed, Oct 8, 2008 at 12:33 PM, Stephan wrote: > Hi, > > I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now. > Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this. > > Anyway, here is a weird little problem with the Bio.Entrez.efetch tool: > (I use python 2.5 and the latest Biopython 1.48) > I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file: > > ---------------------------CODE------------------------------------ > from Bio import Entrez, SeqIO > > print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"] > handle = Entrez.efetch(db="genome", id="56", rettype="genbank") > print "downloading to SeqRecord..." > record = SeqIO.read(handle, "genbank") > print "...done" I assume this is just test code - as it would be silly to download the GenBank file twice in a real script. > handle = Entrez.efetch(db="genome", id="56", rettype="genbank") > filehandle = open("NCBI_DroMel", "w") > print "downloading to file..." > filehandle.write(handle.read()) You should now close the file, which should ensure it is fully written to disk: filehandle.close() > print "...done" > > handle = open("NCBI_DroMel") > print "reading from file..." > record = SeqIO.read(handle, "genbank") > ---------------------------END-CODE------------------------------------ > > In the last line we have a crash, > ... > ValueError: Premature end of file in sequence data This is because you started reading in the file without finishing writing to it - the parser could only read in part of the data, and is complaining about it ending prematurely. Peter From p.j.a.cock at googlemail.com Wed Oct 8 13:46:25 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 8 Oct 2008 14:46:25 +0100 Subject: [BioPython] Entrez.efetch In-Reply-To: <20081008123533.GE57379@sobchak.mgh.harvard.edu> References: <75573950382669954948356356615157751492-Webmail2@me.com> <20081008123533.GE57379@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> Stephan wrote: >> When I download this chromosome manually from the NCBI-website, >> I indeed find a difference in one line, namely in line 3 of the >> genbank file. In the manually downloaded file line 3 reads: >> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced >> from my code I have only: "ACCESSION NC_004353". So without that >> region-information, the biopython parser of course runs to a premature >> end. Stephan - when you say manually, do you mean via a web browser? If so it is likely to be using a subtly different URL, which might explain the NCBI generating slightly different data on the fly. Either way, this ACCESSION line difference shouldn't trigger the "Premature end of file in sequence data" error in the GenBank parser. On Wed, Oct 8, 2008 at 1:35 PM, Brad Chapman wrote: > This is a tricky problem that I ran into as well and is fixed in the > latest CVS version. The issue is that the Biopython reader is using an > UndoHandle instead of a standard python handle. By default some of these > operations appear to be assuming an iterator, but UndoHandle did not > provide this. Brad, I'm pretty sure the GenBank parser is NOT using the UndoHandle. Just adding the close made Stephan's example work for me. What exactly was the problem you ran into (one of the other parsers perhaps?). > As a result, you can lose the first couple of lines which are > previously examined to determine the filetype. The fix is to make > this a proper iterator. You can either check out current CVS, or > make the addition manually to Bio/File.py in your current version: > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython Adding this to the UndoHandle seems a sensible improvement - but I don't see how it can affect Stephan's script. Peter From p.j.a.cock at googlemail.com Wed Oct 8 13:46:25 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 8 Oct 2008 14:46:25 +0100 Subject: [BioPython] Entrez.efetch In-Reply-To: <20081008123533.GE57379@sobchak.mgh.harvard.edu> References: <75573950382669954948356356615157751492-Webmail2@me.com> <20081008123533.GE57379@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> Stephan wrote: >> When I download this chromosome manually from the NCBI-website, >> I indeed find a difference in one line, namely in line 3 of the >> genbank file. In the manually downloaded file line 3 reads: >> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced >> from my code I have only: "ACCESSION NC_004353". So without that >> region-information, the biopython parser of course runs to a premature >> end. Stephan - when you say manually, do you mean via a web browser? If so it is likely to be using a subtly different URL, which might explain the NCBI generating slightly different data on the fly. Either way, this ACCESSION line difference shouldn't trigger the "Premature end of file in sequence data" error in the GenBank parser. On Wed, Oct 8, 2008 at 1:35 PM, Brad Chapman wrote: > This is a tricky problem that I ran into as well and is fixed in the > latest CVS version. The issue is that the Biopython reader is using an > UndoHandle instead of a standard python handle. By default some of these > operations appear to be assuming an iterator, but UndoHandle did not > provide this. Brad, I'm pretty sure the GenBank parser is NOT using the UndoHandle. Just adding the close made Stephan's example work for me. What exactly was the problem you ran into (one of the other parsers perhaps?). > As a result, you can lose the first couple of lines which are > previously examined to determine the filetype. The fix is to make > this a proper iterator. You can either check out current CVS, or > make the addition manually to Bio/File.py in your current version: > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython Adding this to the UndoHandle seems a sensible improvement - but I don't see how it can affect Stephan's script. Peter From stephan80 at mac.com Wed Oct 8 13:48:25 2008 From: stephan80 at mac.com (Stephan) Date: Wed, 08 Oct 2008 15:48:25 +0200 Subject: [BioPython] Entrez.efetch In-Reply-To: <320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> <320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com> Message-ID: <128043477953580677661042463273686413408-Webmail2@me.com> Hi guys, OK, there is two different problems here that Brad and Peter independently pointed out to me. Peter, you are right that not closing the file actually caused the error. Your hint fixes that, thanks. But that doesnt fix that there is a part of line 3 missing over the download, and although I actually updated to the newest cvs-version of biopython as Brad suggested (sorry for accidently putting my answer not on the mailing-list) that does not fix that line... Best, Stephan Am Mittwoch 08 Oktober 2008 um 03:37PM schrieb "Peter" : >On Wed, Oct 8, 2008 at 12:33 PM, Stephan wrote: >> Hi, >> >> I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now. >> Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this. >> >> Anyway, here is a weird little problem with the Bio.Entrez.efetch tool: >> (I use python 2.5 and the latest Biopython 1.48) >> I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file: >> >> ---------------------------CODE------------------------------------ >> from Bio import Entrez, SeqIO >> >> print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"] >> handle = Entrez.efetch(db="genome", id="56", rettype="genbank") >> print "downloading to SeqRecord..." >> record = SeqIO.read(handle, "genbank") >> print "...done" > >I assume this is just test code - as it would be silly to download the >GenBank file twice in a real script. > >> handle = Entrez.efetch(db="genome", id="56", rettype="genbank") >> filehandle = open("NCBI_DroMel", "w") >> print "downloading to file..." >> filehandle.write(handle.read()) > >You should now close the file, which should ensure it is fully written to disk: >filehandle.close() > >> print "...done" >> >> handle = open("NCBI_DroMel") >> print "reading from file..." >> record = SeqIO.read(handle, "genbank") >> ---------------------------END-CODE------------------------------------ >> >> In the last line we have a crash, >> ... >> ValueError: Premature end of file in sequence data > >This is because you started reading in the file without finishing >writing to it - the parser could only read in part of the data, and is >complaining about it ending prematurely. > >Peter > > From stephan80 at mac.com Wed Oct 8 14:00:31 2008 From: stephan80 at mac.com (Stephan) Date: Wed, 08 Oct 2008 16:00:31 +0200 Subject: [BioPython] Entrez.efetch In-Reply-To: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> <20081008123533.GE57379@sobchak.mgh.harvard.edu> <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> Message-ID: <72537648433629820630731006204512761040-Webmail2@me.com> >Stephan - when you say manually, do you mean via a web browser? If so >it is likely to be using a subtly different URL, which might explain >the NCBI generating slightly different data on the fly. Either way, >this ACCESSION line difference shouldn't trigger the "Premature end of >file in sequence data" error in the GenBank parser. Thanks, that must be it! Now I guess everything is solved, closing the handle makes my code run properly and the download-from-NCBI-webpage-issue explains the difference in line 3. >Adding this to the UndoHandle seems a sensible improvement - but I >don't see how it can affect Stephan's script. There I agree, thanks anyway, Brad. Regards, Stephan From biopython at maubp.freeserve.co.uk Wed Oct 8 14:02:54 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 15:02:54 +0100 Subject: [BioPython] Entrez.efetch In-Reply-To: <128043477953580677661042463273686413408-Webmail2@me.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> <320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com> <128043477953580677661042463273686413408-Webmail2@me.com> Message-ID: <320fb6e00810080702q6774f58ap52a02073d62cb75a@mail.gmail.com> On Wed, Oct 8, 2008 at 2:48 PM, Stephan wrote: > > Hi guys, > > OK, there is two different problems here that Brad and Peter independently > pointed out to me. Peter, you are right that not closing the file actually > caused the error. Your hint fixes that, thanks. Great. > But that doesnt fix that there is a part of line 3 missing over the download, > and although I actually updated to the newest cvs-version of biopython as > Brad suggested (sorry for accidently putting my answer not on the mailing-list) > that does not fix that line... This is the issue where you get different GenBank files using Bio.Entrez.efetch and a "manual download"? First of all what did you mean by "manual download" - for example FTP (what URL), or from a browser? Secondly, does this difference to the ACCESSION line (line 3) actually have any ill effects? To be clear using Bio.Entrez.efetch as in your script, I get this: LOCUS NC_004353 1351857 bp DNA linear INV 14-MAY-2008 DEFINITION Drosophila melanogaster chromosome 4, complete sequence. ACCESSION NC_004353 VERSION NC_004353.3 GI:116010290 PROJECT GenomeProject:164 KEYWORDS . SOURCE Drosophila melanogaster (fruit fly) ORGANISM Drosophila melanogaster ... Using FTP from ftp://ftp.ncbi.nih.gov/genomes/Drosophila_melanogaster/CHR_4/NC_004353.gbk I get something similar but different: LOCUS NC_004353 1351857 bp DNA linear INV 14-MAY-2008 DEFINITION Drosophila melanogaster chromosome 4, complete sequence. ACCESSION NC_004353 VERSION NC_004353.3 GI:116010290 KEYWORDS . SOURCE Drosophila melanogaster (fruit fly) ORGANISM Drosophila melanogaster ... Notice the FTP file lacks the PROJECT line, and also differs slightly in its feature table. Using the NCBI website I suspect you can get other slight variations (like the different ACCESSION line you reported). Peter From stephan80 at mac.com Wed Oct 8 13:52:07 2008 From: stephan80 at mac.com (Stephan) Date: Wed, 08 Oct 2008 15:52:07 +0200 Subject: [BioPython] Entrez.efetch In-Reply-To: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> <20081008123533.GE57379@sobchak.mgh.harvard.edu> <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> Message-ID: <56009583349175862359179071289436480391-Webmail2@me.com> >Stephan - when you say manually, do you mean via a web browser? If so >it is likely to be using a subtly different URL, which might explain >the NCBI generating slightly different data on the fly. Either way, >this ACCESSION line difference shouldn't trigger the "Premature end of >file in sequence data" error in the GenBank parser. Thanks, that must be it! Now I guess everything is solved, closing the handle makes my code run properly and the download-from-NCBI-webpage-issue explains the difference in line 3. >Adding this to the UndoHandle seems a sensible improvement - but I >don't see how it can affect Stephan's script. There I agree, thanks anyway, Brad. Regards, Stephan From biopythonlist at gmail.com Wed Oct 8 16:23:32 2008 From: biopythonlist at gmail.com (dr goettel) Date: Wed, 8 Oct 2008 18:23:32 +0200 Subject: [BioPython] taxonomic tree Message-ID: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> Hello, I'm new in this list and in BioPython. I would like to create a NCBI-like taxonomic tree and then fill it with the organisms that I have in a file. Is there an easy way to do this? I started using biopython's function at 7.11.4 (finding the lineage of an organism) in the tutorial, but I need to do this tens of thousands times so it spends too much time querying NCBI database. Therefore I built a taxonomic database locally and implemented something similar to 7.11.4 tutorial's function so I get, for every sequence, the lineage in the same way: 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; Liliopsida; Asparagales; Orchidaceae' Now I need to create a tree, or fill an already created one. And then search it by some criteria. Please could anybody help me with this? Any idea? Thankyou very much From biopython at maubp.freeserve.co.uk Wed Oct 8 16:38:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 17:38:31 +0100 Subject: [BioPython] taxonomic tree In-Reply-To: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> Message-ID: <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com> On Wed, Oct 8, 2008 at 5:23 PM, dr goettel wrote: > Hello, I'm new in this list and in BioPython. Hello :) > I would like to create a NCBI-like taxonomic tree and then fill it with the > organisms that I have in a file. Is there an easy way to do this? I started > using biopython's function at 7.11.4 (finding the lineage of an organism) in > the tutorial, ... For anyone reading this later on, note that the tutorial section numbers tend to change with each release of Biopython. This section just uses Bio.Entrez to fetch taxonomy information for a particular NCBI taxon id. > but I need to do this tens of thousands times so it spends too > much time querying NCBI database. Also calling Bio.Entrez 10000 times might annoy the NCBI ;) > Therefore I built a taxonomic database > locally and implemented something similar to 7.11.4 tutorial's function so I > get, for every sequence, the lineage in the same way: > > 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; > Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; > Liliopsida; Asparagales; Orchidaceae' I assume you used the NCBI provided taxdump files to populate the database? See ftp://ftp.ncbi.nih.gov/pub/taxonomy/ Personally rather than designing my own database just for this (and writing a parser for the taxonomy files), I would have suggested installing BioSQL, and using the BioSQL script load_ncbi_taxonomy.pl to download and import the data for you. This is a simple perl script - you don't need BioPerl. See http://www.biopython.org/wiki/BioSQL for details. > Now I need to create a tree, or fill an already created one. And then search > it by some criteria. What kind of tree do you mean? Are you talking about creating a Newick tree, or an in memory structure? Perhaps the Bio.Nexus module's tree functionality would help. If you are interested, the BioSQL tables record the taxonomy tree using two methods, each node has a parent node allowing you to walk up the lineage. There are also left/right values allowing selection of all child nodes efficiently via an SQL select statement. Peter From biopython at maubp.freeserve.co.uk Wed Oct 8 16:57:37 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 17:57:37 +0100 Subject: [BioPython] Current tutorial in CVS Message-ID: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com> Michiel wrote: > ... The new tutorial is in CVS; I put a copy of the HTML output > of the latest version at > http://biopython.org/DIST/docs/tutorial/Tutorial.new.html. This also gives people a chance to look at the three plotting examples I added to the "Cookbook" section a couple of weeks back, http://www.biopython.org/DIST/docs/tutorial/Tutorial.new.html#chapter:cookbook Suggestions for any additional biologically motivated simple plots would be nice - especially for different plot types. A scatter plot could be added, are there any suggestions for this other than melting temperature versus length or GC%? See also this thread on the dev-mailing list: http://www.biopython.org/pipermail/biopython-dev/2008-September/004277.html Note that the file at this URL is only temporary, and will probably be removed before the next release. The current tutorial is at: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html http://www.biopython.org/DIST/docs/tutorial/Tutorial.pdf Peter From stephan80 at mac.com Wed Oct 8 17:11:25 2008 From: stephan80 at mac.com (Stephan) Date: Wed, 08 Oct 2008 19:11:25 +0200 Subject: [BioPython] Entrez.efetch large files Message-ID: <133483072970409871957631124263040035200-Webmail2@me.com> Sorry to have an Entrez.efetch-issue again, but somehow there seems to be a problem with very large files. So when I run the following code using the newest cvs-version of biopython: ------------------------------------CODE----------------------------------- from Bio import Entrez, SeqIO id = "57" print Entrez.read(Entrez.esummary(db="genome", id=id))[0]["Title"] handle = Entrez.efetch(db="genome", id=id, rettype="genbank") print "downloading to SeqRecord..." record = SeqIO.read(handle, "genbank") print "...done" ------------------------------------END-CODE----------------------------- it fails with the output: ------------------------------------OUTPUT----------------------------- Drosophila melanogaster chromosome X, complete sequence downloading to SeqRecord... Traceback (most recent call last): File "efetch-test.py", line 7, in record = SeqIO.read(handle, "genbank") File "/NetUsers/stschiff/lib/python/Bio/SeqIO/__init__.py", line 366, in read first = iterator.next() File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 410, in parse_records record = self.parse(handle) File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 393, in parse if self.feed(handle, consumer) : File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 370, in feed misc_lines, sequence_string = self.parse_footer() File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 723, in parse_footer raise ValueError("Premature end of file in sequence data") ValueError: Premature end of file in sequence data ------------------------------------END-OUTPUT----------------------------- If I change the id to "56" (chromosome 4, which is shorter) it works. But for all the other chromosomes (ids: 57 - 61) it fails. If I download the genbank files manually from the ftp-server and then use SeqIO.read() it works, so the download-process corrupts the genbank files if they are very large (about 35 MB) I guess... Any hints? Best, Stephan From biopython at maubp.freeserve.co.uk Wed Oct 8 18:57:08 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 19:57:08 +0100 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <133483072970409871957631124263040035200-Webmail2@me.com> References: <133483072970409871957631124263040035200-Webmail2@me.com> Message-ID: <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> On Wed, Oct 8, 2008 at 6:11 PM, Stephan wrote: > Sorry to have an Entrez.efetch-issue again, but somehow there > seems to be a problem with very large files. > ... > If I change the id to "56" (chromosome 4, which is shorter) it works. > But for all the other chromosomes (ids: 57 - 61) it fails. > If I download the genbank files manually from the ftp-server and > then use SeqIO.read() it works, so the download-process corrupts > the genbank files if they are very large (about 35 MB) I guess... > > Any hints? Yes - one big hint: DON'T try and parse these large files directly from the internet. Use efetch to download the file and save it to disk. Then open this local file for parsing. There are several good reasons for this: (1) Rerunning the script (e.g. during development) needn't re-download the file, which wastes time and money (yours and more importantly the NCBI's). You may be fine, but the NCBI can and do ban people's IP addresses if they breach the guidelines. (2) If the parsing fails, there is something to debug easily (the local file). You can open the file in a text editor to check it etc. That being said, downloading and parsing in one go should work - I would expect an IO error if the network timed out, rather than what appears to be the data ending prematurely. However, I don't expect this to be easy to resolve - quite possibly this is a network time out somewhere, maybe at your end, maybe on one of the ISP connections in between. On the bright side, at least the parser isn't silently ignoring the end of the file, which would leave you with a truncated sequence without any warnings :) Do you think the Biopython tutorial should be more explicit about this topic? e.g. In chapter 4 (on Bio.SeqIO) I wrote: >> Note that just because you can download sequence data and >> parse it into a SeqRecord object in one go doesn't mean this >> is always a good idea. In general, you should probably download >> sequences once and save them to a file for reuse. Maybe I should have said "... doesn't mean this is a good idea..." instead? Peter From biopython at maubp.freeserve.co.uk Wed Oct 8 19:32:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 20:32:59 +0100 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> References: <133483072970409871957631124263040035200-Webmail2@me.com> <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> Message-ID: <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> > Yes - one big hint: DON'T try and parse these large files directly > from the internet. Use efetch to download the file and save it to > disk. Then open this local file for parsing. > ... > Do you think the Biopython tutorial should be more explicit about this > topic? I've changed the tutorial (the SeqIO and Entrez chapters) in CVS to make this advice more explicit, and included an example of doing this too. import os from Bio import SeqIO from Bio import Entrez Entrez.email = "A.N.Other at example.com" # Always tell NCBI who you are filename = "gi_186972394.gbk" if not os.path.isfile(filename) : print "Downloading..." net_handle = Entrez.efetch(db="nucleotide",id="186972394",rettype="genbank") out_handle = open(filename, "w") out_handle.write(net_handle.read()) out_handle.close() net_handle.close() print "Saved" print "Parsing..." record = SeqIO.read(open(filename), "genbank") print record Peter From biopython at maubp.freeserve.co.uk Wed Oct 8 20:57:03 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 21:57:03 +0100 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com> References: <133483072970409871957631124263040035200-Webmail2@me.com> <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com> Message-ID: <320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com> On Wed, Oct 8, 2008 at 9:37 PM, Stephan Schiffels wrote: > > Hi Peter, > > OK, first of all... you were right of course, with > out_handle.write(net_handle.read()) the download works properly and reading > the file from disk also works.The tutorial is very clear on that point, I > agree. OK - hopefully I've just made it clearer still ;) > To illustrate why I made the mistake even though I read the tutorial: > I made some code like: > > try: > unpickling a file as SeqRecord... > except IOError: > download file into SeqRecord AND pickle afterwards to disk > > So, as you can see, I already tried to make the download only once! I see - interesting. > The disk-saving step, I realized, was smarter to do via cPickle since then > reading from it also goes faster than parsing the genbank file each time. So > my goal was to either load a pickled SeqRecord, or download into SeqRecord > and then pickle to disk. I hope you agree that concerning resources from > NCBI this way is (at least in principle) already quite optimal. You approach is clever, and I agree, it shouldn't make any difference to the number of downloads from the NCBI (once you have the script debugged and working). I'm curious - do you have any numbers for the relative times to load a SeqRecord from a pickle, or re-parse it from the GenBank file? I'm aware of some "hot spots" in the GenBank parser which take more time than they really need to (feature location parsing in particular). However, even if using pickles is much faster, I would personally still rather use this approach: if file not present: download from NCBI and save it parse file I think it is safer to keep the original data in the NCBI provided format, rather than as a python pickle. Some of my reasons include: * you might want to parse the files with a different tool one day (e.g. grep, or maybe BioPerl, or EMBOSS) * different versions of Biopython will parse the file slightly differently (e.g. once Bugs 2225 and 2578 are fixed the SeqRecord should include slightly more information from a GenBank file) while your pickle will be static * if the SeqRecord or Seq objects themselves change slightly between versions of Biopython, the pickle may not work * more generally, is it safe to transfer the pickly files between different computers (e.g. different versions of python or Biopython, different OS, different line endings)? These issues may not be a problem in your setting. More generally, you could consider using BioSQL, but this may be overkill for your needs. > However, as you pointed out, parsing from the internet makes problems. If you do work out exactly what is going wrong, I would be interested to hear about it. > I think the advantages of not having to download each time were clear to me > from the tutorial. Just that downloading AND parsing at the same time makes > problems didnt appear to me. The addings to the tutorial seem to give some > idea. Your approach all makes sense. Thanks for explaining your thoughts. I don't think I'd ever tried efetch on such a large GenBank file in the first place - for genomes I have usually used FTP instead. Peter From chapmanb at 50mail.com Wed Oct 8 21:11:25 2008 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Oct 2008 17:11:25 -0400 Subject: [BioPython] Entrez.efetch In-Reply-To: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> <20081008123533.GE57379@sobchak.mgh.harvard.edu> <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> Message-ID: <20081008211125.GB17555@sobchak.mgh.harvard.edu> Peter and Stephan; My fault -- sorry about the red herring on this one. I shouldn't have tried to answer this e-mail in 5 minutes before work this morning. Sounds like y'all have it resolved with the missing close so I will keep my mouth shut. Peter, I don't remember my exact problem as it was in some throw-away script and the fix seemed non-problematic. I was thrown off by the "line 3" information Stephan mentioned because my issue was with the first couple of lines missing when iterating with an UndoHandle. No matter. Thanks for coming up with the right fix! Brad > Stephan wrote: > >> When I download this chromosome manually from the NCBI-website, > >> I indeed find a difference in one line, namely in line 3 of the > >> genbank file. In the manually downloaded file line 3 reads: > >> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced > >> from my code I have only: "ACCESSION NC_004353". So without that > >> region-information, the biopython parser of course runs to a premature > >> end. > > Stephan - when you say manually, do you mean via a web browser? If so > it is likely to be using a subtly different URL, which might explain > the NCBI generating slightly different data on the fly. Either way, > this ACCESSION line difference shouldn't trigger the "Premature end of > file in sequence data" error in the GenBank parser. > > On Wed, Oct 8, 2008 at 1:35 PM, Brad Chapman wrote: > > This is a tricky problem that I ran into as well and is fixed in the > > latest CVS version. The issue is that the Biopython reader is using an > > UndoHandle instead of a standard python handle. By default some of these > > operations appear to be assuming an iterator, but UndoHandle did not > > provide this. > > Brad, I'm pretty sure the GenBank parser is NOT using the UndoHandle. > Just adding the close made Stephan's example work for me. What > exactly was the problem you ran into (one of the other parsers > perhaps?). > > > As a result, you can lose the first couple of lines which are > > previously examined to determine the filetype. The fix is to make > > this a proper iterator. You can either check out current CVS, or > > make the addition manually to Bio/File.py in your current version: > > > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython > > Adding this to the UndoHandle seems a sensible improvement - but I > don't see how it can affect Stephan's script. > > Peter From stephan80 at mac.com Wed Oct 8 20:37:17 2008 From: stephan80 at mac.com (Stephan Schiffels) Date: Wed, 08 Oct 2008 22:37:17 +0200 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> References: <133483072970409871957631124263040035200-Webmail2@me.com> <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> Message-ID: <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com> Hi Peter, OK, first of all... you were right of course, with out_handle.write (net_handle.read()) the download works properly and reading the file from disk also works.The tutorial is very clear on that point, I agree. To illustrate why I made the mistake even though I read the tutorial: I made some code like: try: unpickling a file as SeqRecord... except IOError: download file into SeqRecord AND pickle afterwards to disk So, as you can see, I already tried to make the download only once! The disk-saving step, I realized, was smarter to do via cPickle since then reading from it also goes faster than parsing the genbank file each time. So my goal was to either load a pickled SeqRecord, or download into SeqRecord and then pickle to disk. I hope you agree that concerning resources from NCBI this way is (at least in principle) already quite optimal. However, as you pointed out, parsing from the internet makes problems. I think the advantages of not having to download each time were clear to me from the tutorial. Just that downloading AND parsing at the same time makes problems didnt appear to me. The addings to the tutorial seem to give some idea. Thanks and Regards, Stephan Am 08.10.2008 um 21:32 schrieb Peter: >> Yes - one big hint: DON'T try and parse these large files directly >> from the internet. Use efetch to download the file and save it to >> disk. Then open this local file for parsing. >> ... >> Do you think the Biopython tutorial should be more explicit about >> this >> topic? > > I've changed the tutorial (the SeqIO and Entrez chapters) in CVS to > make this advice more explicit, and included an example of doing this > too. > > import os > from Bio import SeqIO > from Bio import Entrez > Entrez.email = "A.N.Other at example.com" # Always tell NCBI who > you are > filename = "gi_186972394.gbk" > if not os.path.isfile(filename) : > print "Downloading..." > net_handle = Entrez.efetch > (db="nucleotide",id="186972394",rettype="genbank") > out_handle = open(filename, "w") > out_handle.write(net_handle.read()) > out_handle.close() > net_handle.close() > print "Saved" > > print "Parsing..." > record = SeqIO.read(open(filename), "genbank") > print record > > > Peter From biopythonlist at gmail.com Thu Oct 9 08:52:42 2008 From: biopythonlist at gmail.com (dr goettel) Date: Thu, 9 Oct 2008 10:52:42 +0200 Subject: [BioPython] taxonomic tree In-Reply-To: <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com> References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com> Message-ID: <9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com> On Wed, Oct 8, 2008 at 6:38 PM, Peter wrote: > On Wed, Oct 8, 2008 at 5:23 PM, dr goettel > wrote: > > Hello, I'm new in this list and in BioPython. > > Hello :) > > > I would like to create a NCBI-like taxonomic tree and then fill it with > the > > organisms that I have in a file. Is there an easy way to do this? I > started > > using biopython's function at 7.11.4 (finding the lineage of an organism) > in > > the tutorial, ... > > For anyone reading this later on, note that the tutorial section > numbers tend to change with each release of Biopython. This section > just uses Bio.Entrez to fetch taxonomy information for a particular > NCBI taxon id. > > > but I need to do this tens of thousands times so it spends too > > much time querying NCBI database. > > Also calling Bio.Entrez 10000 times might annoy the NCBI ;) > > > Therefore I built a taxonomic database > > locally and implemented something similar to 7.11.4 tutorial's function > so I > > get, for every sequence, the lineage in the same way: > > > > 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; > Streptophytina; > > Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; > > Liliopsida; Asparagales; Orchidaceae' > > I assume you used the NCBI provided taxdump files to populate the > database? See ftp://ftp.ncbi.nih.gov/pub/taxonomy/ > Yes I did. > > Personally rather than designing my own database just for this (and > writing a parser for the taxonomy files), I would have suggested > installing BioSQL, and using the BioSQL script load_ncbi_taxonomy.pl > to download and import the data for you. This is a simple perl script > - you don't need BioPerl. See http://www.biopython.org/wiki/BioSQL > for details. > I also used the load_ncbi_taxonomy.pl script. It worked great! > > > Now I need to create a tree, or fill an already created one. And then > search > > it by some criteria. > > What kind of tree do you mean? Are you talking about creating a > Newick tree, or an in memory structure? Perhaps the Bio.Nexus > module's tree functionality would help. > Thankyou very much. I still don't know if I want Newick tree or the other one. I'll take a look on Bio.Nexus module > > If you are interested, the BioSQL tables record the taxonomy tree > using two methods, each node has a parent node allowing you to walk up > the lineage. There are also left/right values allowing selection of > all child nodes efficiently via an SQL select statement. > > Peter > This is what I was trying to do, from the name of the organism (the leaf of the tree) and getting every node using the parent_node field of the taxon table, until reaching the root node. Once I have all the steps to the root node then I have to create/filling the tree with my data in order to examinate the number of organisms integrating certain class/order/family/genus... etc Any ideas will be very apreciated. Thankyou very much for your answer and I'll take a look on Bio.Nexus module. drG From biopython at maubp.freeserve.co.uk Thu Oct 9 09:31:16 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Oct 2008 10:31:16 +0100 Subject: [BioPython] taxonomic tree In-Reply-To: <9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com> References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com> <9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com> Message-ID: <320fb6e00810090231w723e3b29m5e070c55166d3bfc@mail.gmail.com> >> Personally rather than designing my own database just for this (and >> writing a parser for the taxonomy files), I would have suggested >> installing BioSQL, and using the BioSQL script load_ncbi_taxonomy.pl >> to download and import the data for you. This is a simple perl script >> - you don't need BioPerl. See http://www.biopython.org/wiki/BioSQL >> for details. > > I also used the load_ncbi_taxonomy.pl script. It worked great! Good. I would encourage you to use the version from BioSQL v1.0.1 if you are not already, as the version with BioSQL v1.0.0 makes an additional unnecessary assumption about the database keys matching the NCBI taxon ID. >> If you are interested, the BioSQL tables record the taxonomy tree >> using two methods, each node has a parent node allowing you to walk up >> the lineage. There are also left/right values allowing selection of >> all child nodes efficiently via an SQL select statement. > > This is what I was trying to do, from the name of the organism (the leaf of > the tree) and getting every node using the parent_node field of the taxon > table, until reaching the root node. Once I have all the steps to the root > node then I have to create/filling the tree with my data in order to > examinate the number of organisms integrating certain > class/order/family/genus... etc > Any ideas will be very apreciated. To do this in Biopython you'll have to write some SQL commands - but first you need to understand how the left/right values work if you want to take advantage of them. I refer you to this thread on the BioSQL mailing list earlier in the year: http://lists.open-bio.org/pipermail/biosql-l/2008-April/001234.html In particular, Hilmar referred to Joe Celko's SQL for Smarties books, and the introduction to this nested-set representation given here: http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html Alternatively, if you wanted to avoid the left/right values, you could use recursion or loops on the parent ID links to build up the tree. For a single lineage this is fine - but for a full try I would expect the left/right values to be faster. Note that Biopython (in CVS now) ignores the left/right values. This is for two reasons - for pulling out a single lineage, Eric found this was faster. Also, when adding new entries to the database re-calculating the left/right values is too slow, so we leave them as NULL (and let the user (re)run load_ncbi_taxonomy.pl later if they care). This means we don't want to depend on the left/right values being present. Peter From stephan.schiffels at uni-koeln.de Thu Oct 9 13:01:11 2008 From: stephan.schiffels at uni-koeln.de (Stephan Schiffels) Date: Thu, 9 Oct 2008 15:01:11 +0200 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com> References: <133483072970409871957631124263040035200-Webmail2@me.com> <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com> <320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com> Message-ID: <171A75DA-EE34-44AB-8E16-DEC626F7164C@uni-koeln.de> Hi Peter, Am 08.10.2008 um 22:57 schrieb Peter: > I'm curious - do you have any numbers for the relative times to load a > SeqRecord from a pickle, or re-parse it from the GenBank file? I'm > aware of some "hot spots" in the GenBank parser which take more time > than they really need to (feature location parsing in particular). So, here is a little profiling of reading a large chromosome both as genbank and from a pickled SeqRecord (both from disk of course): >>> t = Timer("a = cPickle.load(open('DroMel_chr2L.pickle'))", "import cPickle") >>> t.timeit(number=1) 5.2086620330810547 >>> t = Timer("a = SeqIO.read(open('DroMel_chr2L.gbk'), 'genbank')", "from Bio import SeqIO") >>> t.timeit(number=1) 53.902437925338745 >>> As you see there is an amazing 10fold speed-gain using cPickle in comparison to SeqIO.read() ... not bad! The pickled file is a bit larger than the genbank file, but not much. > However, even if using pickles is much faster, I would personally > still rather use this approach: > > if file not present: > download from NCBI and save it > parse file > Thats precisely how I do it now. Works cool! > I think it is safer to keep the original data in the NCBI provided > format, rather than as a python pickle. Some of my reasons include: > > * you might want to parse the files with a different tool one day > (e.g. grep, or maybe BioPerl, or EMBOSS) > * different versions of Biopython will parse the file slightly > differently (e.g. once Bugs 2225 and 2578 are fixed the SeqRecord > should include slightly more information from a GenBank file) while > your pickle will be static > * if the SeqRecord or Seq objects themselves change slightly between > versions of Biopython, the pickle may not work > * more generally, is it safe to transfer the pickly files between > different computers (e.g. different versions of python or Biopython, > different OS, different line endings)? > > These issues may not be a problem in your setting. You are right and in fact I now safe both the genbank file and the pickled file to disk, so I have all the backup. > > More generally, you could consider using BioSQL, but this may be > overkill for your needs. > BioSQL is something that I like a lot. I have not yet digged my way through it but hopefully there will be options for me from that side as well. >> However, as you pointed out, parsing from the internet makes >> problems. > > If you do work out exactly what is going wrong, I would be interested > to hear about it. > Hmm, probably I wont find it out. Parsing from the internet works for small files, it must be some network-issue, dont know. Since I am in the university-web I doubt that the error starts at my side, maybe NCBI clears the connection if the other side is too slow, which is the case for the parsing process... But I understand too little about networking. >> I think the advantages of not having to download each time were >> clear to me >> from the tutorial. Just that downloading AND parsing at the same >> time makes >> problems didnt appear to me. The addings to the tutorial seem to >> give some >> idea. > > Your approach all makes sense. Thanks for explaining your thoughts. I > don't think I'd ever tried efetch on such a large GenBank file in the > first place - for genomes I have usually used FTP instead. > > Peter Regards, Stephan From biopython at maubp.freeserve.co.uk Thu Oct 9 14:18:52 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Oct 2008 15:18:52 +0100 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <171A75DA-EE34-44AB-8E16-DEC626F7164C@uni-koeln.de> References: <133483072970409871957631124263040035200-Webmail2@me.com> <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com> <320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com> <171A75DA-EE34-44AB-8E16-DEC626F7164C@uni-koeln.de> Message-ID: <320fb6e00810090718g3420729fh50520a4760c5d27@mail.gmail.com> Peter wrote: >> I'm curious - do you have any numbers for the relative times to load a >> SeqRecord from a pickle, or re-parse it from the GenBank file? I'm >> aware of some "hot spots" in the GenBank parser which take more time >> than they really need to (feature location parsing in particular). Stephan wrote: > So, here is a little profiling of reading a large chromosome both as genbank > and from a pickled SeqRecord (both from disk of course): >>>> t = Timer("a = cPickle.load(open('DroMel_chr2L.pickle'))", "import >>>> cPickle") >>>> t.timeit(number=1) > 5.2086620330810547 >>>> t = Timer("a = SeqIO.read(open('DroMel_chr2L.gbk'), 'genbank')", "from >>>> Bio import SeqIO") >>>> t.timeit(number=1) > 53.902437925338745 >>>> > > As you see there is an amazing 10fold speed-gain using cPickle in comparison > to SeqIO.read() ... not bad! The pickled file is a bit larger than the > genbank file, but not much. I'm seeing more like a three fold speed-gain (using cPickle protocol 0, with Python 2.5.2 on a Mac), which is less impressive. For a 10 fold speed up I can see why the complexity overhead of using pickle could be worthwhile. cPickle.load() took 8.5s cPickle.load() took 10.0s cPickle.load() took 9.9s SeqIO.read() took 29.9s SeqIO.read() took 29.8s SeqIO.read() took 29.8s (Script below) I'm not very impressed with the 30 seconds needed to parse a 30MB file. There is certainly scope for speeding up the GenBank parsing here. Peter --------------- My timing script: import os import cPickle import time from Bio import Entrez, SeqIO #Entrez.email = "..." id="57" genbank_filename = "NC_004354.gbk" pickle_filename = "NC_004354.pickle" if not os.path.isfile(genbank_filename) : print "Downloading..." net_handle = Entrez.efetch(db="genome", id=id, rettype="genbank") out_handle = open(genbank_filename, "w") out_handle.write(net_handle.read()) out_handle.close() print "Saved" if not os.path.isfile(pickle_filename) : print "Parsing..." record = SeqIO.read(open(genbank_filename), 'genbank') print "Pickling..." out_handle = open(pickle_filename ,"w") cPickle.dump(record, out_handle) out_handle.close() print "Saved" print "Profiling..." for i in range(3) : start = time.time() record = cPickle.load(open(pickle_filename)) print "cPickle.load() took %0.1fs" % (time.time() - start) for i in range(3) : start = time.time() record = SeqIO.read(open(genbank_filename), 'genbank') print "SeqIO.read() took %0.1fs" % (time.time() - start) print "Done" From biopython at maubp.freeserve.co.uk Thu Oct 9 15:48:26 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Oct 2008 16:48:26 +0100 Subject: [BioPython] Deprecating Bio.PubMed and some of Bio.GenBank Message-ID: <320fb6e00810090848g2a516877i5950f515e748b9d0@mail.gmail.com> Dear Biopythoneers, Those of you who looked at the release notes for Biopython 1.48 might have read this bit: >> Bio.PubMed and the online code in Bio.GenBank are now considered >> obsolete, and we intend to deprecate them after the next release. >> For accessing PubMed and GenBank, please use Bio.Entrez instead. These bits of code are effectively simple wrappers for Bio.Entrez. While they may be simple to use, they cannot take advantage of the NCBI's Entrez utils history functionality. This means they discourage users from following the NCBI's preferred usage patterns. We're already trying to encouraging the use of Bio.Entrez by documenting it prominently in the tutorial (which seems to be working given the recent questions on the mailing list), but for Biopython 1.49 I'm suggesting we go further and deprecate Bio.PubMed and the online code in Bio.GenBank. This would mean a warning message would appear when this code is used, and (barring feedback) after a couple of releases this code would be removed completely. Any comments or objections? In particular, is anyone using this "obsolete" functionality now? Peter From biopythonlist at gmail.com Thu Oct 9 16:32:11 2008 From: biopythonlist at gmail.com (dr goettel) Date: Thu, 9 Oct 2008 18:32:11 +0200 Subject: [BioPython] taxonomic tree In-Reply-To: <320fb6e00810090231w723e3b29m5e070c55166d3bfc@mail.gmail.com> References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com> <9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com> <320fb6e00810090231w723e3b29m5e070c55166d3bfc@mail.gmail.com> Message-ID: <9b15d9f30810090932qb22ca8boc6edc871bf285154@mail.gmail.com> > To do this in Biopython you'll have to write some SQL commands - but > first you need to understand how the left/right values work if you > want to take advantage of them. I refer you to this thread on the > BioSQL mailing list earlier in the year: > http://lists.open-bio.org/pipermail/biosql-l/2008-April/001234.html > > In particular, Hilmar referred to Joe Celko's SQL for Smarties books, > and the introduction to this nested-set representation given here: > http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html > That's great!! Taking advantage of the left/right values will help me!! They 're great. I started writing a lot of code to do something that in fact can be done with some sql statements. In fact the sql statements are quite difficult for me so I have to deep inside "inner joins". Thankyou very much drG From biopython at maubp.freeserve.co.uk Mon Oct 13 12:38:56 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Oct 2008 13:38:56 +0100 Subject: [BioPython] Translation method for Seq object Message-ID: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Dear Biopythoneers, This is a request for feedback about proposed additions to the Seq object for the next release of Biopython. I'd like people to pick (a) to (e) in the list below (with additional comments or counter suggestions welcome). Enhancement bug 2381 is about adding transcription and translation methods to the Seq object, allowing an object orientated style of programming. e.g. Current functional programming style: >>> from Bio.Seq import Seq, transcribe >>> from Bio.Alphabet import generic_dna >>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna) >>> my_seq Seq('CAGTGACGTTAGTCCG', DNAAlphabet()) >>> transcribe(my_seq) Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) With the latest Biopython in CVS, you can now invoke a Seq object method instead for transcription (or back transcription): >>> my_seq.transcribe() Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) For a comparison, compare the shift from python string functions to string methods. This also makes the functionality more discoverable via dir(my_seq). Adding Seq object methods "transcribe" and "back_transcribe" doesn't cause any confusion with the python string methods. However, for translation, the python string has an existing "translate" method: > S.translate(table [,deletechars]) -> string > > Return a copy of the string S, where all characters occurring > in the optional argument deletechars are removed, and the > remaining characters have been mapped through the given > translation table, which must be a string of length 256. I don't think this functionality is really of direct use for sequences, and having a Seq object "translate" method do a biological translation into a protein sequence is much more intuitive. However, this could cause confusion if the Seq object is passed to non-Biopython code which expects a string like translate method. To avoid this naming clash, a different method name would needed. This is where some user feedback would be very welcome - I think the following cover all the alternatives of what to call a biological translation function (nucleotide to protein): (a) Just use translate (ignore the existing string method) (b) Use translate_ (trailing underscore, see PEP8) (c) Use translation (a noun rather than verb; different style). (d) Use something else (e.g. bio_translate or ...) (e) Don't add a biological translation method at all because ... Thanks, Peter See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381 From ericgibert at yahoo.fr Mon Oct 13 14:38:02 2008 From: ericgibert at yahoo.fr (Eric Gibert) Date: Mon, 13 Oct 2008 22:38:02 +0800 Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Message-ID: (a) Seq is an object, string is another object... each of them have various methods and coincidently two of them have the same name... Eric -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter Sent: Monday, October 13, 2008 8:39 PM To: BioPython Mailing List Subject: [BioPython] Translation method for Seq object Dear Biopythoneers, This is a request for feedback about proposed additions to the Seq object for the next release of Biopython. I'd like people to pick (a) to (e) in the list below (with additional comments or counter suggestions welcome). Enhancement bug 2381 is about adding transcription and translation methods to the Seq object, allowing an object orientated style of programming. e.g. Current functional programming style: >>> from Bio.Seq import Seq, transcribe >>> from Bio.Alphabet import generic_dna >>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna) >>> my_seq Seq('CAGTGACGTTAGTCCG', DNAAlphabet()) >>> transcribe(my_seq) Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) With the latest Biopython in CVS, you can now invoke a Seq object method instead for transcription (or back transcription): >>> my_seq.transcribe() Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) For a comparison, compare the shift from python string functions to string methods. This also makes the functionality more discoverable via dir(my_seq). Adding Seq object methods "transcribe" and "back_transcribe" doesn't cause any confusion with the python string methods. However, for translation, the python string has an existing "translate" method: > S.translate(table [,deletechars]) -> string > > Return a copy of the string S, where all characters occurring > in the optional argument deletechars are removed, and the > remaining characters have been mapped through the given > translation table, which must be a string of length 256. I don't think this functionality is really of direct use for sequences, and having a Seq object "translate" method do a biological translation into a protein sequence is much more intuitive. However, this could cause confusion if the Seq object is passed to non-Biopython code which expects a string like translate method. To avoid this naming clash, a different method name would needed. This is where some user feedback would be very welcome - I think the following cover all the alternatives of what to call a biological translation function (nucleotide to protein): (a) Just use translate (ignore the existing string method) (b) Use translate_ (trailing underscore, see PEP8) (c) Use translation (a noun rather than verb; different style). (d) Use something else (e.g. bio_translate or ...) (e) Don't add a biological translation method at all because ... Thanks, Peter See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381 _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From bsouthey at gmail.com Mon Oct 13 14:58:07 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Mon, 13 Oct 2008 09:58:07 -0500 Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Message-ID: <48F361FF.103@gmail.com> Peter wrote: > Dear Biopythoneers, > > This is a request for feedback about proposed additions to the Seq > object for the next release of Biopython. I'd like people to pick (a) > to (e) in the list below (with additional comments or counter > suggestions welcome). > > Enhancement bug 2381 is about adding transcription and translation > methods to the Seq object, allowing an object orientated style of > programming. > > e.g. Current functional programming style: > > >>>> from Bio.Seq import Seq, transcribe >>>> from Bio.Alphabet import generic_dna >>>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna) >>>> my_seq >>>> > Seq('CAGTGACGTTAGTCCG', DNAAlphabet()) > >>>> transcribe(my_seq) >>>> > Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) > > With the latest Biopython in CVS, you can now invoke a Seq object > method instead for transcription (or back transcription): > > >>>> my_seq.transcribe() >>>> > Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) > > For a comparison, compare the shift from python string functions to > string methods. This also makes the functionality more discoverable > via dir(my_seq). > > Adding Seq object methods "transcribe" and "back_transcribe" doesn't > cause any confusion with the python string methods. However, for > translation, the python string has an existing "translate" method: > > >> S.translate(table [,deletechars]) -> string >> >> Return a copy of the string S, where all characters occurring >> in the optional argument deletechars are removed, and the >> remaining characters have been mapped through the given >> translation table, which must be a string of length 256. >> > > I don't think this functionality is really of direct use for sequences, and > having a Seq object "translate" method do a biological translation into > a protein sequence is much more intuitive. However, this could cause > confusion if the Seq object is passed to non-Biopython code which > expects a string like translate method. > > To avoid this naming clash, a different method name would needed. > > This is where some user feedback would be very welcome - I think > the following cover all the alternatives of what to call a biological > translation function (nucleotide to protein): > > (a) Just use translate (ignore the existing string method) > (b) Use translate_ (trailing underscore, see PEP8) > (c) Use translation (a noun rather than verb; different style). > (d) Use something else (e.g. bio_translate or ...) > (e) Don't add a biological translation method at all because ... > > Thanks, > > Peter > > See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381 > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > Hi, My thoughts on this is that it is generally best to avoid any confusion when possible. But 'translate' is not a reserved word and the Python documentation notes that the unicode version lacks the optional deletechars argument (so there is precedent for using the same word). Also it involves the methods versus functions argument but many of the string functions have been depreciated and will get removed in Python 3.0 (so in Python 3.0 I think it will be hard to get a name clash without some strange inheritance going on). Therefore, provided 'translate' is a method of Seq then I do not see any strong reason to avoid it except that it is long (but shorter than translation) :-) Would be too cryptic to have dna(), rna() and protein() methods that provide the appropriate conversion based on the Seq type? Obviously reverse translation of a protein sequence to a DNA sequence is complex if there are many solutions. Regards Bruce From mjldehoon at yahoo.com Mon Oct 13 14:57:28 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 13 Oct 2008 07:57:28 -0700 (PDT) Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Message-ID: <421846.1946.qm@web62403.mail.re1.yahoo.com> (f) Use .translate both for the Python .translate and for the Biopython .translate. S.translate() ===> Biopython .translate S.translate(table [,deletechars]) ===> Python .translate We can tell from the presence or absence of arguments whether the user intends Python's translate or Biopython's translate. --Michiel. --- On Mon, 10/13/08, Peter wrote: > From: Peter > Subject: [BioPython] Translation method for Seq object > To: "BioPython Mailing List" > Date: Monday, October 13, 2008, 8:38 AM > Dear Biopythoneers, > > This is a request for feedback about proposed additions to > the Seq > object for the next release of Biopython. I'd like > people to pick (a) > to (e) in the list below (with additional comments or > counter > suggestions welcome). > > Enhancement bug 2381 is about adding transcription and > translation > methods to the Seq object, allowing an object orientated > style of > programming. > > e.g. Current functional programming style: > > >>> from Bio.Seq import Seq, transcribe > >>> from Bio.Alphabet import generic_dna > >>> my_seq = Seq("CAGTGACGTTAGTCCG", > generic_dna) > >>> my_seq > Seq('CAGTGACGTTAGTCCG', DNAAlphabet()) > >>> transcribe(my_seq) > Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) > > With the latest Biopython in CVS, you can now invoke a Seq > object > method instead for transcription (or back transcription): > > >>> my_seq.transcribe() > Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) > > For a comparison, compare the shift from python string > functions to > string methods. This also makes the functionality more > discoverable > via dir(my_seq). > > Adding Seq object methods "transcribe" and > "back_transcribe" doesn't > cause any confusion with the python string methods. > However, for > translation, the python string has an existing > "translate" method: > > > S.translate(table [,deletechars]) -> string > > > > Return a copy of the string S, where all characters > occurring > > in the optional argument deletechars are removed, and > the > > remaining characters have been mapped through the > given > > translation table, which must be a string of length > 256. > > I don't think this functionality is really of direct > use for sequences, and > having a Seq object "translate" method do a > biological translation into > a protein sequence is much more intuitive. However, this > could cause > confusion if the Seq object is passed to non-Biopython code > which > expects a string like translate method. > > To avoid this naming clash, a different method name would > needed. > > This is where some user feedback would be very welcome - I > think > the following cover all the alternatives of what to call a > biological > translation function (nucleotide to protein): > > (a) Just use translate (ignore the existing string method) > (b) Use translate_ (trailing underscore, see PEP8) > (c) Use translation (a noun rather than verb; different > style). > (d) Use something else (e.g. bio_translate or ...) > (e) Don't add a biological translation method at all > because ... > > Thanks, > > Peter > > See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381 > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Mon Oct 13 15:27:37 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Oct 2008 16:27:37 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: <421846.1946.qm@web62403.mail.re1.yahoo.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> <421846.1946.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e00810130827j3ec07434s2f58e370743f9537@mail.gmail.com> So I did manage to leave off at least one other option from my short list :) Michiel de Hoon wrote: > > (f) Use .translate both for the Python .translate and for the Biopython .translate. > > S.translate() ===> Biopython .translate > > S.translate(table [,deletechars]) ===> Python .translate > > We can tell from the presence or absence of arguments whether the user intends Python's translate or Biopython's translate. Sadly its not quite that simple. For a biological translation we'd probably want to offer optional arguments for at least the codon table and stop symbol (like the current Bio.Seq.translate() function), with other further arguments possible (e.g. to treat the sequence as a complete CDS where the start codon should be validated and taken as M). It would still be possible to automatically detect which translation was required, but it wouldn't be very nice. So overall I'm not keen on this approach. Peter From biopython at maubp.freeserve.co.uk Mon Oct 13 15:54:32 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Oct 2008 16:54:32 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: <48F361FF.103@gmail.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> <48F361FF.103@gmail.com> Message-ID: <320fb6e00810130854m38f37075gf85b798cb4a98e21@mail.gmail.com> Bruce wrote: > ... > Therefore, provided 'translate' is a method of Seq then I do not see any > strong reason to avoid it except that it is long (but shorter than > translation) :-) Good - that sounds like another vote for option (a) in my original list. > Would be too cryptic to have dna(), rna() and protein() methods that provide > the appropriate conversion based on the Seq type? Or in a similar vein, to_dna, to_rna, and to_protein? Or toDNA, toRNA, toProtein? I'd have to go and consult the current python style guide for what is the current best practice. Something like that does sounds reasonable (and they are short), but historically all related Biopython functions have used the terms (back) transcription and (back) translation so I would prefer to stick with those. > Obviously reverse translation of a protein sequence to a DNA sequence is > complex if there are many solutions. Yes, back-translation is tricky because there is generally more than one codon for any amino acid. Ambiguous nucleotides can be used to describe several possible codons giving that amino acid, but in general it is not possible to do this and describe all the possible codons which could have been used. This topic is worth of an entire thread... for the record, I would envisage a back_translate method for the Seq object (assuming we settle on translate as the name for the forward translation from nucleotide to protein). Peter From mjldehoon at yahoo.com Tue Oct 14 00:50:14 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 13 Oct 2008 17:50:14 -0700 (PDT) Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Message-ID: <900752.12970.qm@web62408.mail.re1.yahoo.com> > (a) Just use translate (ignore the existing string method) > (b) Use translate_ (trailing underscore, see PEP8) > (c) Use translation (a noun rather than verb; different > style). > (d) Use something else (e.g. bio_translate or ...) > (e) Don't add a biological translation method at all > because ... (a). Note also that once Seq objects inherit from string, the Python .translate method is still accessible as str.translate(seq). --Michiel. From biopython at maubp.freeserve.co.uk Tue Oct 14 10:18:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Oct 2008 11:18:13 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: <900752.12970.qm@web62408.mail.re1.yahoo.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> <900752.12970.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e00810140318i14c6362eq8a51030b1da660ae@mail.gmail.com> OK, we seem to have a consensus :) In Biopython's CVS, the Seq object now has a translate method which does a biological translation. If anyone comes up with a better proposal before the next release, we can still rename this. Otherwise I will update the Tutorial in CVS shortly... Note that for now, I have followed the existing Bio.Seq.translate(...) function and the new Seq object translate(...) method takes only two optional parameters - the codon table and the stop symbol. I have noted some suggestions for possible additional arguments on Bug 2381. The adventurous among you may want to use CVS to update your Biopython installations to try this out. Please note that you will now need numpy instead of Numeric (there is nothing to stop you having both numpy and Numeric installed at the same time). If you do try out the CVS code, please run the unit tests and report any issues. Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Oct 14 11:11:20 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Oct 2008 12:11:20 +0100 Subject: [BioPython] Deprecating Bio.PubMed and some of Bio.GenBank In-Reply-To: <320fb6e00810090848g2a516877i5950f515e748b9d0@mail.gmail.com> References: <320fb6e00810090848g2a516877i5950f515e748b9d0@mail.gmail.com> Message-ID: <320fb6e00810140411o341df854x49ef3e61421193b8@mail.gmail.com> On Thu, Oct 9, 2008 at 4:48 PM, Peter wrote: > Dear Biopythoneers, > > Those of you who looked at the release notes for Biopython 1.48 might > have read this bit: > >>> Bio.PubMed and the online code in Bio.GenBank are now considered >>> obsolete, and we intend to deprecate them after the next release. >>> For accessing PubMed and GenBank, please use Bio.Entrez instead. > > These bits of code are effectively simple wrappers for Bio.Entrez. > While they may be simple to use, they cannot take advantage of the > NCBI's Entrez utils history functionality. This means they discourage > users from following the NCBI's preferred usage patterns. > > We're already trying to encouraging the use of Bio.Entrez by > documenting it prominently in the tutorial (which seems to be working > given the recent questions on the mailing list), but for Biopython > 1.49 I'm suggesting we go further and deprecate Bio.PubMed and the > online code in Bio.GenBank. This would mean a warning message would > appear when this code is used, and (barring feedback) after a couple > of releases this code would be removed completely. > > Any comments or objections? In particular, is anyone using this > "obsolete" functionality now? I've just deprecated Bio.PubMed in CVS - meaning for the next release of Biopython you'll see a warning message when you import the PubMed module. If you are using this module please say something sooner rather than later. This can still be undone. Thanks, Peter From dalloliogm at gmail.com Thu Oct 16 10:02:46 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 16 Oct 2008 12:02:46 +0200 Subject: [BioPython] calculate F-Statistics from SNP data Message-ID: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> Hi, I was going to write a python program to calculate Fst statistics from a sample of SNP data. Is there any module already available to do that in biopython, that I am missing? I saw there is a 'PopGen' module, but the Cookbook says it doesn't support sequence data. Is someone actually writing any module in python to calculate such statistics? -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Oct 16 10:23:12 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Oct 2008 11:23:12 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> Message-ID: <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio wrote: > Hi, > I was going to write a python program to calculate Fst statistics from a > sample of SNP data. Is there any module already available to do that > in biopython, that I am missing? I saw there is a 'PopGen' module, but > the Cookbook says it doesn't support sequence data. > Is someone actually writing any module in python to calculate such > statistics? I think this will be a question for Tiago (the Bio.PopGen author), although others on the list may have also tackled similar questions. In terms of reading in the SNP data, what file format will you be loading? Does Bio.SeqIO currently suffice? Have you looked into what (if any) additional python libraries you would need? For any Biopython addition, a dependency on just numpy that would be preferable, but Tiago has previously suggested an optional dependency on scipy for additional statistics needed in population genetics. Peter From tiagoantao at gmail.com Thu Oct 16 14:10:47 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 16 Oct 2008 15:10:47 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> Message-ID: <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> Hi, On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio wrote: > Hi, > I was going to write a python program to calculate Fst statistics from a > sample of SNP data. > Is there any module already available to do that in biopython, that I am > missing? > I saw there is a 'PopGen' module, but the Cookbook says it doesn't support > sequence data. > Is someone actually writing any module in python to calculate such > statistics? The answer to this has to be done in parts, because it is actually a bunch of related (but different) issues On the data 1. Sequence support. Bio.PopGen doesn't support statistics for sequences (like Tajima D and the like), BUT that is not relevant if you want to do frequency based statistics (like good old Fst), you just have to count frequencies and put into a "frequency format" 2. SNPs is actually not a sequence, but a single element, so it becomes easier. What you need at the end is something like this: For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0 For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0 For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10 And so on... You have to end up with frequency counts per population So, as long as you convert data (sequence, SNP, microsatellite) to frequency counts per population, there are no issues with the type of data. On calculating the statistics (Fst) 1. I am fully aware that core statistics like Fst (I work with Fst a lot myself) are fundamental in a population genetics module, but I sincerely don't know how to proceed because a long term solution requires generic statistical support (e.g., chi-square tests Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy (and I will not maintain generic statistics code myself). I know that Bio.PopGen is of little use without support for standard statistics. 2. A workaround (for which I have code written - but not commited to the repository - I can give it to you) is to invoke GenePop and get the Fst estimation. This requires the data to be in GenePop format (again you can convert SNPs and even sequences to frequency based format) 3. That being said, I have code to estimate Fst (Cockerham and Wier theta and a variation from Mark Beaumont) in Python. I can give it to you (but is not much tested). On sequence data formats: 1. Note that sequence data files (that I know off) have no provision for population structure (you cannot say, in a standard way, sequence X belongs to population Y). You have to do it in adhoc way. That means you have to invent your own convention for your private use. 2. Anyway, in your case I suppose you still have to extract the SNPs from the sequence. 3. If you want do frequency based analysis on your SNPs, I suggest you do a conversion to GenePop anyway (therefore you can import your data in most population structure software as GenePop format is the defacto standard)... 4. Because of the above there is actually no good solution for automated conversion from sequence information to frequency based one (in biopython or in any platform whatsoever) I can give more suggestions if you give more details or have more specific questions. From tiagoantao at gmail.com Thu Oct 16 14:14:28 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 16 Oct 2008 15:14:28 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> Message-ID: <6d941f120810160714s61eb6d1cx87d1943c4068d491@mail.gmail.com> Just a minor point: I am so used to work in Fst that I mentally converted your "F-statistics" to Fst. Most of my mail still stands. The only point that changes a bit is that I only have code for Fst, so I cannot help you with any other. On Thu, Oct 16, 2008 at 3:10 PM, Tiago Ant?o wrote: > Hi, > > On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio > wrote: >> Hi, >> I was going to write a python program to calculate Fst statistics from a >> sample of SNP data. >> Is there any module already available to do that in biopython, that I am >> missing? >> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support >> sequence data. >> Is someone actually writing any module in python to calculate such >> statistics? > > The answer to this has to be done in parts, because it is actually a > bunch of related (but different) issues > > > On the data > 1. Sequence support. Bio.PopGen doesn't support statistics for > sequences (like Tajima D and the like), BUT that is not relevant if > you want to do frequency based statistics (like good old Fst), you > just have to count frequencies and put into a "frequency format" > 2. SNPs is actually not a sequence, but a single element, so it > becomes easier. What you need at the end is something like this: > For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0 > For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0 > For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10 > And so on... You have to end up with frequency counts per population > So, as long as you convert data (sequence, SNP, microsatellite) to > frequency counts per population, there are no issues with the type of > data. > > On calculating the statistics (Fst) > 1. I am fully aware that core statistics like Fst (I work with Fst a > lot myself) are fundamental in a population genetics module, but I > sincerely don't know how to proceed because a long term solution > requires generic statistical support (e.g., chi-square tests > Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy > (and I will not maintain generic statistics code myself). I know that > Bio.PopGen is of little use without support for standard statistics. > 2. A workaround (for which I have code written - but not commited to > the repository - I can give it to you) is to invoke GenePop and get > the Fst estimation. This requires the data to be in GenePop format > (again you can convert SNPs and even sequences to frequency based > format) > 3. That being said, I have code to estimate Fst (Cockerham and Wier > theta and a variation from Mark Beaumont) in Python. I can give it to > you (but is not much tested). > > > On sequence data formats: > 1. Note that sequence data files (that I know off) have no provision > for population structure (you cannot say, in a standard way, sequence > X belongs to population Y). You have to do it in adhoc way. That means > you have to invent your own convention for your private use. > 2. Anyway, in your case I suppose you still have to extract the SNPs > from the sequence. > 3. If you want do frequency based analysis on your SNPs, I suggest you > do a conversion to GenePop anyway (therefore you can import your data > in most population structure software as GenePop format is the defacto > standard)... > 4. Because of the above there is actually no good solution for > automated conversion from sequence information to frequency based one > (in biopython or in any platform whatsoever) > I can give more suggestions if you give more details or have more > specific questions. > -- "Data always beats theories. 'Look at data three times and then come to a conclusion,' versus 'coming to a conclusion and searching for some data.' The former will win every time." ?Matthew Simmons, http://www.tiago.org From biopython at maubp.freeserve.co.uk Thu Oct 16 15:11:27 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Oct 2008 16:11:27 +0100 Subject: [BioPython] back-translation method for Seq object? Message-ID: <320fb6e00810160811s19e580b2ped86c43b32c401bb@mail.gmail.com> Quoting from the recent thread about adding a translation method to the Seq object, Bruce brought up back-translation: Peter wrote: > Bruce wrote: >> Obviously reverse translation of a protein sequence to a DNA sequence is >> complex if there are many solutions. > > Yes, back-translation is tricky because there is generally more than > one codon for any amino acid. Ambiguous nucleotides can be used to > describe several possible codons giving that amino acid, but in > general it is not possible to do this and describe all the possible > codons which could have been used. This topic is worth of an entire > thread... for the record, I would envisage a back_translate method for > the Seq object (assuming we settle on translate as the name for the > forward translation from nucleotide to protein). Do we actually need a back_translate method? Can anyone suggest an actual use-case for this? It seems difficult to imagine that any simple version would please everyone. Bio.Translate (a semi-obsolete module whose deprecation has been suggested) provides a back_translate method which picks an essentially arbitrary but unambiguous codon for each amino acid. Crude but simple. A more meaningful choice would require suppling codon frequencies for the organism under consideration. Other possibilities include using ambiguous nucleotides to try and cover all the possibilities (e.g. "L" -> "CTN"), but even here in some cases this is arbritary. e.g. The standard three stop codons ['TAA', 'TAG', 'TGA'] could be represented as ['TAR', 'TGA'] or ['TRA', 'TAG'] but not by a single ambiguous codon ('TRR' also covers 'TGG' which codes for 'W'). Potentially of use would be a generator function which returned all possible back translations - but this would be complex and typically overkill. As a final point, a Seq object back-translation method could give RNA or DNA. From a biological point of view giving DNA by default would make sense. This choice is handled in Bio.Translate when creating the translator object (part of what makes Bio.Translate relatively complex to use). Peter From sdavis2 at mail.nih.gov Thu Oct 16 15:16:51 2008 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 16 Oct 2008 11:16:51 -0400 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> Message-ID: <264855a00810160816g6ea765dcl65fd7e5aa38ef20c@mail.gmail.com> On Thu, Oct 16, 2008 at 10:10 AM, Tiago Ant?o wrote: > Hi, > > On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio > wrote: >> Hi, >> I was going to write a python program to calculate Fst statistics from a >> sample of SNP data. >> Is there any module already available to do that in biopython, that I am >> missing? >> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support >> sequence data. >> Is someone actually writing any module in python to calculate such >> statistics? > > The answer to this has to be done in parts, because it is actually a > bunch of related (but different) issues > > > On the data > 1. Sequence support. Bio.PopGen doesn't support statistics for > sequences (like Tajima D and the like), BUT that is not relevant if > you want to do frequency based statistics (like good old Fst), you > just have to count frequencies and put into a "frequency format" > 2. SNPs is actually not a sequence, but a single element, so it > becomes easier. What you need at the end is something like this: > For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0 > For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0 > For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10 > And so on... You have to end up with frequency counts per population > So, as long as you convert data (sequence, SNP, microsatellite) to > frequency counts per population, there are no issues with the type of > data. > > On calculating the statistics (Fst) > 1. I am fully aware that core statistics like Fst (I work with Fst a > lot myself) are fundamental in a population genetics module, but I > sincerely don't know how to proceed because a long term solution > requires generic statistical support (e.g., chi-square tests > Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy > (and I will not maintain generic statistics code myself). I know that > Bio.PopGen is of little use without support for standard statistics. > 2. A workaround (for which I have code written - but not commited to > the repository - I can give it to you) is to invoke GenePop and get > the Fst estimation. This requires the data to be in GenePop format > (again you can convert SNPs and even sequences to frequency based > format) > 3. That being said, I have code to estimate Fst (Cockerham and Wier > theta and a variation from Mark Beaumont) in Python. I can give it to > you (but is not much tested). > > > On sequence data formats: > 1. Note that sequence data files (that I know off) have no provision > for population structure (you cannot say, in a standard way, sequence > X belongs to population Y). You have to do it in adhoc way. That means > you have to invent your own convention for your private use. > 2. Anyway, in your case I suppose you still have to extract the SNPs > from the sequence. > 3. If you want do frequency based analysis on your SNPs, I suggest you > do a conversion to GenePop anyway (therefore you can import your data > in most population structure software as GenePop format is the defacto > standard)... > 4. Because of the above there is actually no good solution for > automated conversion from sequence information to frequency based one > (in biopython or in any platform whatsoever) > I can give more suggestions if you give more details or have more > specific questions. Just a little note that the R programming language has some packages for population genetics and, of course, has excellent statistical tools. One can interface with it via rpy. I'm not advocating going this route, but just wanted to let people know about another option. Sean From tiagoantao at gmail.com Thu Oct 16 15:26:52 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 16 Oct 2008 16:26:52 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <264855a00810160816g6ea765dcl65fd7e5aa38ef20c@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> <264855a00810160816g6ea765dcl65fd7e5aa38ef20c@mail.gmail.com> Message-ID: <6d941f120810160826q2bf25382m41890fb39a4226a0@mail.gmail.com> The task view on Genetics for R provides a good starting point to find R packages related to the field: http://www.freestatistics.org/cran/web/views/Genetics.html On Thu, Oct 16, 2008 at 4:16 PM, Sean Davis wrote: > On Thu, Oct 16, 2008 at 10:10 AM, Tiago Ant?o wrote: >> Hi, >> >> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio >> wrote: >>> Hi, >>> I was going to write a python program to calculate Fst statistics from a >>> sample of SNP data. >>> Is there any module already available to do that in biopython, that I am >>> missing? >>> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support >>> sequence data. >>> Is someone actually writing any module in python to calculate such >>> statistics? >> >> The answer to this has to be done in parts, because it is actually a >> bunch of related (but different) issues >> >> >> On the data >> 1. Sequence support. Bio.PopGen doesn't support statistics for >> sequences (like Tajima D and the like), BUT that is not relevant if >> you want to do frequency based statistics (like good old Fst), you >> just have to count frequencies and put into a "frequency format" >> 2. SNPs is actually not a sequence, but a single element, so it >> becomes easier. What you need at the end is something like this: >> For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0 >> For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0 >> For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10 >> And so on... You have to end up with frequency counts per population >> So, as long as you convert data (sequence, SNP, microsatellite) to >> frequency counts per population, there are no issues with the type of >> data. >> >> On calculating the statistics (Fst) >> 1. I am fully aware that core statistics like Fst (I work with Fst a >> lot myself) are fundamental in a population genetics module, but I >> sincerely don't know how to proceed because a long term solution >> requires generic statistical support (e.g., chi-square tests >> Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy >> (and I will not maintain generic statistics code myself). I know that >> Bio.PopGen is of little use without support for standard statistics. >> 2. A workaround (for which I have code written - but not commited to >> the repository - I can give it to you) is to invoke GenePop and get >> the Fst estimation. This requires the data to be in GenePop format >> (again you can convert SNPs and even sequences to frequency based >> format) >> 3. That being said, I have code to estimate Fst (Cockerham and Wier >> theta and a variation from Mark Beaumont) in Python. I can give it to >> you (but is not much tested). >> >> >> On sequence data formats: >> 1. Note that sequence data files (that I know off) have no provision >> for population structure (you cannot say, in a standard way, sequence >> X belongs to population Y). You have to do it in adhoc way. That means >> you have to invent your own convention for your private use. >> 2. Anyway, in your case I suppose you still have to extract the SNPs >> from the sequence. >> 3. If you want do frequency based analysis on your SNPs, I suggest you >> do a conversion to GenePop anyway (therefore you can import your data >> in most population structure software as GenePop format is the defacto >> standard)... >> 4. Because of the above there is actually no good solution for >> automated conversion from sequence information to frequency based one >> (in biopython or in any platform whatsoever) >> I can give more suggestions if you give more details or have more >> specific questions. > > Just a little note that the R programming language has some packages > for population genetics and, of course, has excellent statistical > tools. One can interface with it via rpy. I'm not advocating going > this route, but just wanted to let people know about another option. > > Sean > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- "Data always beats theories. 'Look at data three times and then come to a conclusion,' versus 'coming to a conclusion and searching for some data.' The former will win every time." ?Matthew Simmons, http://www.tiago.org From lpritc at scri.ac.uk Fri Oct 17 08:24:43 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 17 Oct 2008 09:24:43 +0100 Subject: [BioPython] back-translation method for Seq object? In-Reply-To: <320fb6e00810160811s19e580b2ped86c43b32c401bb@mail.gmail.com> Message-ID: On 16/10/2008 16:11, "Peter" wrote: > Quoting from the recent thread about adding a translation method to > the Seq object, Bruce brought up back-translation: > > Peter wrote: >> Bruce wrote: >>> Obviously reverse translation of a protein sequence to a DNA sequence is >>> complex if there are many solutions. This is the key problem. Forward translation is - for a given codon table - a one-one mapping. Reverse translation is (for many amino acids) one-many. If the goal is to produce the coding sequence that actually encoded a particular protein sequence, the problem is combinatorial and rapidly becomes messy with increasing sequence length. And that's not considering the problem of splice variants/intron-exon boundaries if attempting to relate the sequence back to some genome or genome fragment - more a problem in eukaryotes. >> Yes, back-translation is tricky because there is generally more than >> one codon for any amino acid. Ambiguous nucleotides can be used to >> describe several possible codons giving that amino acid, but in >> general it is not possible to do this and describe all the possible >> codons which could have been used. This topic is worth of an entire >> thread... for the record, I would envisage a back_translate method for >> the Seq object (assuming we settle on translate as the name for the >> forward translation from nucleotide to protein). > > Do we actually need a back_translate method? Can anyone suggest an > actual use-case for this? It seems difficult to imagine that any > simple version would please everyone. I agree - I can't think of an occasion where I might want to back-translate a protein in this way that wouldn't better be handled by other means. Not that I'm the fount of all use-cases but, given the number of ways in which one *could* back-translate, perhaps it would be better not to pick/guess at any single one. Some choices to be made in deciding how to back-translate are (and I'm sure you've already thought of them, but they're worth writing down): I) Protein to unambiguous RNA: a) Codon table: arbitrary; organism-specific; user-defined? b) Codon choice: arbitrary and random; arbitrary and consistent; complete set of possibilities; most common codon (if information available); other favoured codon (if specified)? II) Protein to ambiguous RNA: a) Return a Seq, string or some other representation of ambiguity? b) IUPAC ambiguity symbols; choice of codons; alternative representation of ambiguity? The most common back-translation I do is taking aligned protein sequences back to their known coding sequences, and this is really a case of mapping known codons onto predefined positions, rather than the interpolation of unknown codons that is required for back-translation as implied above. T-coffee handles this pretty well, IIRC. To find coding sequences for a particular protein in the originating sequence (if known), I use BLAST. I guess there might be value in having the ability to identify regions of the coding sequence that are least likely to be variable (by generating them combinatorially) so that probes might be designed if the coding sequence is not known. But that doesn't appear to be the way that most sequences are obtained these days: much cheaper to bung RNA through 454 or Solexa and work through the output than to put someone on the task of making an array of probes to find a sequence that may or may not encode your sequenced protein... > Bio.Translate (a semi-obsolete module whose deprecation has been > suggested) provides a back_translate method which picks an essentially > arbitrary but unambiguous codon for each amino acid. Crude but > simple. A more meaningful choice would require suppling codon > frequencies for the organism under consideration. These can be found - for many organisms - in Emboss codon usage table (.cut) files, if you have Emboss locally. However, is requiring Emboss as a dependency the cleanest or wisest solution for Biopython? This approach solves only one problem: given a particular codon usage table, what is the most likely sequence that would have produced this protein. That's not a problem I've ever come across in anger, but given a table of 'most efficient codons' for some biological expression system, I can see this potentially having some use. However, given that many microbiologists can already tell you the preferred codons for K12 without pausing for breath, I'm not sure there's a problem looking for this solution. > Other possibilities include using ambiguous nucleotides to try and > cover all the possibilities (e.g. "L" -> "CTN"), but even here in some > cases this is arbritary. e.g. The standard three stop codons ['TAA', > 'TAG', 'TGA'] could be represented as ['TAR', 'TGA'] or ['TRA', 'TAG'] > but not by a single ambiguous codon ('TRR' also covers 'TGG' which > codes for 'W'). If Seq had an ambiguity-aware sequence representation, this could be handled. For example, a regular expression-based sequence representation (which could lie alongside Seq.data, perhaps as Seq.regex) could represent these variants as (TAA|TAG|TGA), and alternatively the usual ambiguity codes could also be handled in a similar way (e.g. R as [AG]). This would be of some limited use, but would permit sequence searching within Biopython, at least. > Potentially of use would be a generator function which returned all > possible back translations - but this would be complex and typically > overkill. I think that, for large sequences, this could quickly swamp the user. What do you see as the use of this output? > As a final point, a Seq object back-translation method could give RNA > or DNA. From a biological point of view giving DNA by default would > make sense. This choice is handled in Bio.Translate when creating the > translator object (part of what makes Bio.Translate relatively complex > to use). Since there is a one-one map of RNA to DNA, I'm easy about either choice on a computational level. Biologically-speaking, DNA -> RNA is transcription, and RNA -> protein is translation, so I'd expect back-translation to convert protein -> RNA, and back-transcription to convert RNA -> DNA. Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From dalloliogm at gmail.com Fri Oct 17 09:39:41 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 17 Oct 2008 11:39:41 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> Message-ID: <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> On Thu, Oct 16, 2008 at 12:23 PM, Peter wrote: > On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio > wrote: > > Hi, > > I was going to write a python program to calculate Fst statistics from a > > sample of SNP data. Is there any module already available to do that > > in biopython, that I am missing? I saw there is a 'PopGen' module, but > > the Cookbook says it doesn't support sequence data. > > Is someone actually writing any module in python to calculate such > > statistics? > > I think this will be a question for Tiago (the Bio.PopGen author), > although others on the list may have also tackled similar questions. > > In terms of reading in the SNP data, what file format will you be > loading? Does Bio.SeqIO currently suffice? > Hi, thank you very much all of you for the replies. Actually I am going to use tped[1] and tfam[1] files as input, formatted with the plink program[2]. Bio.SeqIO doesn't support these format, but this is right because they don't cointain only sequences but rather elements like Tiago was saying. Let's say I try to write a parser for these two file formats. In which biopython object should I save them? Is there any kind of 'Individual' or 'Population' object in biopython? I see from the cookbook that Bio.GenPop.Record is representanting populations and individual as list[3], and that there is not a 'Population' or 'Individual' object. I think that it is a good approach, because these kind of files tend to be very big and instantiating an Individual object instead of a tuple for every line of the file would be take much memory. But are you going to implement some kind of 'Individual' or 'Population' object? Moreover, python 2.6 will implement a new kind of data object, called 'named tuple' [4], to implement these kind of records. It could be a good compromise (maybe I'll better start a new thread about this and explain better). [1] tped, tfam: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#tr [2] plink: http://pngu.mgh.harvard.edu/~purcell/plink/index.shtml [3] biopython cookbook, popgen: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc112 [4] named tuples in python 2.6: http://code.activestate.com/recipes/500261/ > > Have you looked into what (if any) additional python libraries you > would need? For any Biopython addition, a dependency on just numpy > that would be preferable, but Tiago has previously suggested an > optional dependency on scipy for additional statistics needed in > population genetics. > > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From dalloliogm at gmail.com Fri Oct 17 10:03:32 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 17 Oct 2008 12:03:32 +0200 Subject: [BioPython] named tuples for biopython? Message-ID: <5aa3b3570810170303y5660ab63n592dc99a76011ad2@mail.gmail.com> Hi, python 2.6 is going to implement a new kind of data (like lists, strings, etc..) called 'named_tuple'. It is intended to be a better data format to be used when parsing record files and databases. You can download the recipe from here (it should be included experimentally in python 2.6): - http://code.activestate.com/recipes/500261/ Basically, you instantiate a named_tuple object with this syntax: >> Person = NamedTuple("Person name surname") "Person" is a label for the named_tuple; the following fields, 'name' and 'surname' Then you will have named_tuple object which is basically a mix between a dictionary, a custom class and a tuple: >> Person = NamedTuple("Person name surname") >> Einstein = Person('Albert', 'Einstein') >> Einstein.name 'Albert' >> Einstein.surname 'Einstein' >> people = [] >> for line in f.readlines(): >> people.append(Person(line.split()) >> >> for person in people: >> print person.name, person.surname named_tuples are also read-only object, so they should be less memory-expensive It is like tuples against lists, but more customizable. I am really not good ad explaining, and I can't find a good tutorial that illustrate this. I read a good article about named_tuples, but it is in italian language ( http://stacktrace.it/2008/05/gestione-dei-record-python-1/). Maybe you can understand the code examples. Has any of you heard about this new data type? Do you think it could be useful for biopython? There is a lot of file parsing / database interfacing in bioinformatics :) p.s. since I didn't trust HTML-based mails to keep code formatting, I also posted this same message on nodalpoint: http://www.nodalpoint.org/2008/10/17/python_2_6_will_implement_a_new_data_format_named_tuple_can_it_be_of_use_for_biopython -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From dalloliogm at gmail.com Fri Oct 17 10:11:23 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 17 Oct 2008 12:11:23 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810160714s61eb6d1cx87d1943c4068d491@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> <6d941f120810160714s61eb6d1cx87d1943c4068d491@mail.gmail.com> Message-ID: <5aa3b3570810170311g1d92dc52q41616cd6dc58fb03@mail.gmail.com> On Thu, Oct 16, 2008 at 4:14 PM, Tiago Ant?o wrote: > Just a minor point: I am so used to work in Fst that I mentally > converted your "F-statistics" to Fst. Most of my mail still stands. > The only point that changes a bit is that I only have code for Fst, so > I cannot help you with any other. > > On Thu, Oct 16, 2008 at 3:10 PM, Tiago Ant?o wrote: > > > 3. That being said, I have code to estimate Fst (Cockerham and Wier > > theta and a variation from Mark Beaumont) in Python. I can give it to > > you (but is not much tested). > > > Thank you.. Can you please send me this code that you are using to calculate Fst statistics with python? I can't guarantee I will use it (most of the people here use perl and bioperl, but I would prefer python), but maybe I can help you testing it. > > > > > > -- > "Data always beats theories. 'Look at data three times and then come > to a conclusion,' versus 'coming to a conclusion and searching for > some data.' The former will win every time." > ?Matthew Simmons, > http://www.tiago.org > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Fri Oct 17 10:17:51 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 17 Oct 2008 11:17:51 +0100 Subject: [BioPython] named tuples for biopython? In-Reply-To: <5aa3b3570810170303y5660ab63n592dc99a76011ad2@mail.gmail.com> References: <5aa3b3570810170303y5660ab63n592dc99a76011ad2@mail.gmail.com> Message-ID: <320fb6e00810170317w24fe34a4p1884c4264f3e7363@mail.gmail.com> On Fri, Oct 17, 2008 at 11:03 AM, Giovanni Marco Dall'Olio wrote: > Hi, > python 2.6 is going to implement a new kind of data (like lists, strings, > etc..) called 'named_tuple'. It is intended to be a better data format to > be used when parsing record files and databases. I'd just seen this today actually via another mailing list. Here is a short example which actually works on python 2.6 (the details have changed slightly from your quote), >>> from collections import namedtuple >>> Person = namedtuple("Person", "name surname") >>> x = Person("Albert", "Einstein") >>> x Person(name='Albert', surname='Einstein') >>> x.name 'Albert' >>> x.surname 'Einstein' >>> x.keys() Traceback (most recent call last): File "", line 1, in AttributeError: 'Person' object has no attribute 'keys' >>> x["name"] Traceback (most recent call last): File "", line 1, in TypeError: tuple indices must be integers, not str >>> x[0] 'Albert' >>> x[1] 'Einstein' So this doesn't act much like a dictionary (in terms of the x[...] usage), so we can't use it as a drop in enhancement for existing dictionaries in Biopython. I expect there are some places where a namedtuple would make sense (although using it might break backwards compatibility). Also, if we did want to use NamedTuple in Biopython we'd have to include a copy for use on older versions of python. This is probably possible under the python license... but would require an implementation that still worked on pre 2.6. Peter From lpritc at scri.ac.uk Fri Oct 17 10:52:33 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 17 Oct 2008 11:52:33 +0100 Subject: [BioPython] named tuples for biopython? In-Reply-To: <5aa3b3570810170303y5660ab63n592dc99a76011ad2@mail.gmail.com> Message-ID: On 17/10/2008 11:03, "Giovanni Marco Dall'Olio" wrote: > Hi, > python 2.6 is going to implement a new kind of data (like lists, strings, > etc..) called 'named_tuple'. > It is intended to be a better data format to be used when parsing record > files and databases. > > You can download the recipe from here (it should be included experimentally > in python 2.6): > - http://code.activestate.com/recipes/500261/ The explanation here was pretty clear, to me: http://docs.python.org/dev/library/collections.html#collections.namedtuple > Has any of you heard about this new data type? Not until you mentioned it - thanks for the heads-up. > Do you think it could be > useful for biopython? There is a lot of file parsing / database interfacing > in bioinformatics :) I can see it being a useful collection type. It reminds me of C structs, and looks like a near-perfect fit to many db table entries, and to csv/ATF-format files for which the column headers can be used to define attributes. I guess that one disadvantage of namedtuples, compared to, e.g. a dictionary in which each value is itself a dictionary of attributes (with attribute names for keys), is that there's a restricted character/word set available for attribute names in the namedtuple, but this is not important for dictionary keys, so some additional tally of header to attribute name may be necessary. This has a real use-case in, say, parsing ATF format files... http://www.moleculardevices.com/pages/software/gn_genepix_file_formats.html ... where on-the-fly creation of attributes with the same name as in the parsed file or table row may not be possible with a namedtuple. If you know of the column/field names in advance though, it shouldn't be an issue. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From tiagoantao at gmail.com Fri Oct 17 18:07:18 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 17 Oct 2008 19:07:18 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> Message-ID: <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> Hi, On Fri, Oct 17, 2008 at 10:39 AM, Giovanni Marco Dall'Olio wrote: > Let's say I try to write a parser for these two file formats. In which > biopython object should I save them? Is there any kind of 'Individual' or > 'Population' object in biopython? > I see from the cookbook that Bio.GenPop.Record is representanting > populations and individual as list[3], and that there is not a 'Population' > or 'Individual' object. No, there are no concepts of individuals or populations for now. Bio.PopGen.GenePop is just a representation of a GenePop file (which is a de facto standard in frequency based population genetics). Currently Bio.PopGen philosophy is more of a wrapper for existing software (e.g., I don't implement a coalescent simulator, like in BioPerl, I wrap Simcoal2). The disadvantage is that it is not "Pure Python" and is dependent on external applications. The advantage is that, if the external application is good, than good functionality becomes available inside Biopython. For example, coalescent simulation in BioPerl is (at least last time I've checked it) orders of magnitude less flexible than BioPython's (based on SimCoal2). In this philosophy, I now have a (partial) wrapper for the GenePop application to calculate statistics (voila, Fst). That doesn't mean that core statistics functionality should not be available in Bio.PopGen. I think it should be (that is why I have quite done work on that - implementing from scratch Fst, allelic richness, expected heterosigosity, ...). The same goes to the concept of Population and Individual. For a number of cumulative reasons, the work on that front is stalled. But, if there is some interest, I would more than welcome reopening that front... > Moreover, python 2.6 will implement a new kind of data object, called 'named > tuple' [4], to implement these kind of records. It could be a good > compromise (maybe I'll better start a new thread about this and explain > better). I think the ad-hoc policy in Biopython is to support previous versions of Python, so I don't think it will be easy to do things in a 2.6 only way (although, for NEW functionality, from my part, I don't see a problem with it). Tiago From bsouthey at gmail.com Fri Oct 17 18:46:19 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 17 Oct 2008 13:46:19 -0500 Subject: [BioPython] back-translation method for Seq object? In-Reply-To: References: Message-ID: <48F8DD7B.7010909@gmail.com> Leighton Pritchard wrote: > On 16/10/2008 16:11, "Peter" wrote: > > >> Quoting from the recent thread about adding a translation method to >> the Seq object, Bruce brought up back-translation: >> >> Peter wrote: >> >>> Bruce wrote: >>> >>>> Obviously reverse translation of a protein sequence to a DNA sequence is >>>> complex if there are many solutions. >>>> > > This is the key problem. Forward translation is - for a given codon table - > a one-one mapping. Reverse translation is (for many amino acids) one-many. > If the goal is to produce the coding sequence that actually encoded a > particular protein sequence, the problem is combinatorial and rapidly > becomes messy with increasing sequence length. And that's not considering > the problem of splice variants/intron-exon boundaries if attempting to > relate the sequence back to some genome or genome fragment - more a problem > in eukaryotes. > If you use a regular expression or a tree structure then there is a one-one mapping but then that would probably best as a subclass of Seq. Note you still would need a method to transverse it if you wanted to get a sequence from it as well as an reverse complement. It is fairly trivial to get a regular expression for it for the standard genetic code but I did not get my reverse complement to work satisfactory nor did I try to get DNA sequence from the regular expression. I would suggest tools like Wise2 and exonerate (http://www.ebi.ac.uk/~guy/exonerate/) are the solution to solving gene structure problems than using a Seq object. Obviously if you start with a DNA sequence, then you could create object that has a DNA/RNA Seq object and a protein Seq object(s) that contain the translation(s) like in Genbank DNA records that contain the translation. But that really avoids the issue here. >>> Yes, back-translation is tricky because there is generally more than >>> one codon for any amino acid. Ambiguous nucleotides can be used to >>> describe several possible codons giving that amino acid, but in >>> general it is not possible to do this and describe all the possible >>> codons which could have been used. This topic is worth of an entire >>> thread... for the record, I would envisage a back_translate method for >>> the Seq object (assuming we settle on translate as the name for the >>> forward translation from nucleotide to protein). >>> >> Do we actually need a back_translate method? Can anyone suggest an >> actual use-case for this? It seems difficult to imagine that any >> simple version would please everyone. >> > > I agree - I can't think of an occasion where I might want to back-translate > a protein in this way that wouldn't better be handled by other means. Not > that I'm the fount of all use-cases but, given the number of ways in which > one *could* back-translate, perhaps it would be better not to pick/guess at > any single one. > Apart from the academic aspect, my main use is searching for protein motifs/domains, enzyme cleavage sites, finding very short combinations of amino acids and binding sites (I do not do this but it is the same) in DNA sequences especially genomic sequence. These are usually very small and, thus, unsuitable for most tools. One of my uses is with peptide identification and de novo sequencing using mass spectrometry when you don't know the actual protein or gene sequence. It also has the problem that certain amino acids have very similar mass so you would need to Regardless of whether you use a regular expression query or not you still need a back translation of the protein query and probably the reverse complement. Another case where it would be useful is that tools like TBLASTN gives protein alignments so you must open the DNA sequence and find the DNA region based on the protein alignment. Bruce From dalloliogm at gmail.com Sun Oct 19 14:50:54 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Sun, 19 Oct 2008 16:50:54 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> Message-ID: <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> On Sat, Oct 18, 2008 at 6:50 PM, Tiago Ant?o wrote: > > here have used bioperl Bio::PopGen::PopStat, but we saw that using that > > module as it is now in bioperl is too much computationally-expensive for > our > > resources. > > So, we are going to either refactor the bioperl function, or to write > custom > > scripts in python to calculate Fst. > > I can program perl, but I would prefer to use python in use, since I like > > object oriented programming. > > You can find my (completely unofficial, completely untested) PopGen module > here: > http://popgen.eu/PopGen.tar.gz > You should take a biopython distro and replace the PopGen directory > with the contents of this one. > ok, thank you very much!! I would like to use git to keep track of the changes I will make to the code. What do you think if I'll upload it to http://github.com and then upload it back on biopython when it is finished? I am not sure, but I think it would be possible to convert the logs back to cvs to reintegrate the changes in biopython. > There are 2 ways to calculate Fst: > Doing something this: > from Bio.PopGen.Stats.Structural import Fst > > fst = Fst() > fst.add_pop('Pop 1', [('a', 'a'), ('a', 'c'), ('a','c')]) > fst.add_pop('Pop 2', [('a', 'c'), ('a', 'c'), ('a','c')]) > One of the problems we are having here, is that it takes too much RAM memory to store all the information about characters for every population. I was going to write a Population object, in which I'll store only the total count of heterozygotes, individuals, and what is needed, instead of the information about characters (('a', 'a'), ('a', 'c'), ...) It is something like this: class Population: markers = [] class Marker: total_heterozygotes_count = 0 total_population_count = 0 total_Purines_count = 0 # this could be renamed, of course total_Pyrimidines_count = 0 > > Or using the new GenePop code (see GenePop/Controller.py), by using > genepop to calculate Fsts. > > A few comments: > 1. I don't trust my own Fst code (not tested at all, I am actually > using GenePop as above). You can find it on PopGen.Stats.Structural > (Fst, and also FstBeaumont). There is code there for Fst, Fis and Fit. > Also Fk (I trust the Fk code, but its the only one) I will ask my group leader to help me in writing down some good test data. I'll let you know when I will speak with him. > > 2. If your problem is performance, I think you have to go to a faster > language. Scripting languages strongly underperfom on the speed issue. > I find this problem lots of times. C, C++ and Java (yes, java for > performance) is what I use. Perl, Python and other scripting languages > are quite bad performance-wise. I know.. but I think this time, the problem is in memory usage. > 3. You can find a Fst implementation in C++ on simuPop (see file > stator.cpp). GenePop code must also have Fst implemented. 4. I have a Fst based application using Biopython PopGen with Fst (but > for another application) - Fdist, you can find it at: > http://www.biomedcentral.com/1471-2105/9/323 . Module Bio.PopGen.FDist > (incidentally, you can also use this to calculate Fst ;) ). > 5. My code on Bio.PopGen.Stats is surely not in its final form. I have > a plan to change it massively. If you are interested in participating > in the discussion, you are welcome. > > > This is to say that if you want, we can work on the same code, and > > contribute it to biopython. > > This would be most welcome. I have almost no sense of propriety for > the code that is on Bio.PopGen. So, if you work on this, go ahead! > > > > I am writing a ped file parser (everybody here is used to this format, > and I > > don't know GenePop :( ), and a simple script that calculates Fst with the > > most basic formula. > > I am also trying to design some good tests, and I am using subversion as > as > > source control system. > > Maybe I can also send this to you, so you can have a look (but it is > still > > very basic, I started yesterday). > > Again, any contribution would be most welcome. Regarding parsers I > would suggest you to have a look at how parsers are done in Biopython. > I am following the "standard". You can find an example on > Bio.PopGen.GenePop.__init__.py. From my point of view I have nothing > against a "non standard" parser as long as it is documented and > commented. Thank you very much.. I know more or less how parsers are written in biopython, but I have never written one myself. > > > Again, feel free to take this discussion to biopython-dev, especially > if you are willing to contribute. > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From mmokrejs at ribosome.natur.cuni.cz Sun Oct 19 15:52:29 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Sun, 19 Oct 2008 17:52:29 +0200 Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Message-ID: <48FB57BD.7070705@ribosome.natur.cuni.cz> Hi, I have been away for 2 weeks but although late, let me oppose that string.translate() is of use. Here is my current code: # make sure no unallowed chars are present in the sequence if type == "DNA": if not _sequence.translate(string.maketrans('', ''),'GgAaTtCc'): if not _sequence.translate(string.maketrans('', ''),'GgAaTtCcBbDdSsWw'): if not _sequence.translate(string.maketrans('', ''),'GgAaTtCcRrYyWwSsMmKkHhBbVvDdNn'): raise ValueError, "DNA sequence contains unallowed characters: " + str(_sequence.translate(string.maketrans('', ''),'GgAaTtCcRrYyWwSsMmKkHhBbVvDdNn')) else: _warning = "DNA sequence contains IUPACAmbiguousDNA characters, which cannot be interpreted uniquely. Please try to find sequence of higher quality." else: _warning = "DNA sequence contains ExtendedIUPACDNA characters. " + str(_sequence.translate(string.maketrans('', ''),'GATC')) + " Please try to find sequence of higher quality." elif type == "RNA": if not _sequence.translate(string.maketrans('', ''),'GgAaUuCc'): if not _sequence.translate(string.maketrans('', ''),'GgAaUuCcRrYyWwSsMmKkHhBbVvDdNn'): raise ValueError, "RNA sequence contains unallowed characters: " + str(_sequence.translate(string.maketrans('', ''),'GgAaTtCcRrYyWwSsMmKkHhBbVvDdNn')) else: _warning = "RNA sequence contains ExtendedIUPACDNA characters. " + str(_sequence.translate(string.maketrans('', ''),'GgAaUuCc')) + " Please try to find sequence of higher quality." _sequence = _sequence.translate(string.maketrans('Uu', 'Tt')) return (_warning, _type, _description, _sequence) I would have voted for b) or c). Martin Peter wrote: > Dear Biopythoneers, > > This is a request for feedback about proposed additions to the Seq > object for the next release of Biopython. I'd like people to pick (a) > to (e) in the list below (with additional comments or counter > suggestions welcome). > > Enhancement bug 2381 is about adding transcription and translation > methods to the Seq object, allowing an object orientated style of > programming. > > e.g. Current functional programming style: > >>>> from Bio.Seq import Seq, transcribe >>>> from Bio.Alphabet import generic_dna >>>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna) >>>> my_seq > Seq('CAGTGACGTTAGTCCG', DNAAlphabet()) >>>> transcribe(my_seq) > Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) > > With the latest Biopython in CVS, you can now invoke a Seq object > method instead for transcription (or back transcription): > >>>> my_seq.transcribe() > Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) > > For a comparison, compare the shift from python string functions to > string methods. This also makes the functionality more discoverable > via dir(my_seq). > > Adding Seq object methods "transcribe" and "back_transcribe" doesn't > cause any confusion with the python string methods. However, for > translation, the python string has an existing "translate" method: > >> S.translate(table [,deletechars]) -> string >> >> Return a copy of the string S, where all characters occurring >> in the optional argument deletechars are removed, and the >> remaining characters have been mapped through the given >> translation table, which must be a string of length 256. > > I don't think this functionality is really of direct use for sequences, and > having a Seq object "translate" method do a biological translation into > a protein sequence is much more intuitive. However, this could cause > confusion if the Seq object is passed to non-Biopython code which > expects a string like translate method. > > To avoid this naming clash, a different method name would needed. > > This is where some user feedback would be very welcome - I think > the following cover all the alternatives of what to call a biological > translation function (nucleotide to protein): > > (a) Just use translate (ignore the existing string method) > (b) Use translate_ (trailing underscore, see PEP8) > (c) Use translation (a noun rather than verb; different style). > (d) Use something else (e.g. bio_translate or ...) > (e) Don't add a biological translation method at all because ... > > Thanks, > > Peter > > See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381 From mmokrejs at ribosome.natur.cuni.cz Sun Oct 19 16:17:50 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Sun, 19 Oct 2008 18:17:50 +0200 Subject: [BioPython] Current tutorial in CVS In-Reply-To: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com> References: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com> Message-ID: <48FB5DAE.1050600@ribosome.natur.cuni.cz> Hi Peter, Peter wrote: > Michiel wrote: >> ... The new tutorial is in CVS; I put a copy of the HTML output >> of the latest version at >> http://biopython.org/DIST/docs/tutorial/Tutorial.new.html. > > This also gives people a chance to look at the three plotting examples > I added to the "Cookbook" section a couple of weeks back, > > http://www.biopython.org/DIST/docs/tutorial/Tutorial.new.html#chapter:cookbook for those lazy would you please show how to save the generated plots into e.g. jpg or .svg file? Thanks, ;-) Martin From biopython at maubp.freeserve.co.uk Sun Oct 19 16:34:46 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 19 Oct 2008 17:34:46 +0100 Subject: [BioPython] Current tutorial in CVS In-Reply-To: <48FB5DAE.1050600@ribosome.natur.cuni.cz> References: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com> <48FB5DAE.1050600@ribosome.natur.cuni.cz> Message-ID: <320fb6e00810190934l432dd320ue8ff3a40fa497530@mail.gmail.com> > > for those lazy would you please show how to save the generated plots into > e.g. jpg or .svg file? Instead or as well as pylab.show(), use pylab.savefig(...), for example: pylab.savefig("dot_plot.png", dpi=75) pylab.savefig("dot_plot.pdf") On a related note - it looks like the pylab tutorial as moved, I'm getting a 404 error on http://matplotlib.sourceforge.net/tutorial.html now :( It looks like http://matplotlib.sourceforge.net/api/pyplot_api.html is the replacement. Peter From biopython at maubp.freeserve.co.uk Sun Oct 19 18:15:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 19 Oct 2008 19:15:59 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: <48FB57BD.7070705@ribosome.natur.cuni.cz> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> <48FB57BD.7070705@ribosome.natur.cuni.cz> Message-ID: <320fb6e00810191115k120b64c3m237d0929d33b13fb@mail.gmail.com> On Sun, Oct 19, 2008 at 4:52 PM, Martin MOKREJ? wrote: > Hi, > I have been away for 2 weeks but although late, Untill we release a new biopython, its not too late to change the Seq object's new methods. > let me oppose that string.translate() is of use. > Here is my current code: > ... Your code seems to be doing two things with the python string translate() method: (1) Using the deletechars argument (with an empty mapping) to look for unexpected letters. It took me a while to work out what your code was doing - personally I would have used a python set for this, rather than the string translate method. Note also unicode strings don't support the deletechars argument, and that python 3.0 removes the deletechars argument from the string style objects. (2) Using the translate mapping to switch "U" and "u" into "T" and "t" to back transcribe RNA into DNA. For this, Biopython already has a Bio.Seq.back_transcribe function (which does work on strings), and in CVS the Seq object gets a back_transcribe method too. These do both use the string translate method internally. Neither of these operations convice me that the Seq object should support the python string translate method. Note that if you still need to use the python string translate method, it is accessable by first turning the Seq object into a string (e.g. str(my_seq).translate(mapping, delete_chars)), or as Michiel suggested earlier, you could use the string module translate function on the Seq object. Also note that (as in your example using the string translate to do back transcription) the translate method by its nature makes it impossible to know if the original Seq object alphabet still applies to the result. Peter From mmokrejs at ribosome.natur.cuni.cz Sun Oct 19 18:28:38 2008 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Sun, 19 Oct 2008 20:28:38 +0200 Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810191115k120b64c3m237d0929d33b13fb@mail.gmail.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> <48FB57BD.7070705@ribosome.natur.cuni.cz> <320fb6e00810191115k120b64c3m237d0929d33b13fb@mail.gmail.com> Message-ID: <48FB7C56.6010408@ribosome.natur.cuni.cz> Peter, you are right in your points. I think the translate() trick had some speed advantages over other approaches to zap unwanted characters - I don't remember but if it is gonna break in future python releases I will have to rewrite this anyway. I just wanted to say I really do use the string translate function and that it has use in bioinformatics as well. ;-) Still, I think the name clash is asking for disaster, but overloading is a feature of python so it might be expected. Do whatever you want. ;) Cheers, M. Peter wrote: > On Sun, Oct 19, 2008 at 4:52 PM, Martin MOKREJ? > wrote: >> Hi, >> I have been away for 2 weeks but although late, > > Untill we release a new biopython, its not too late to change the Seq > object's new methods. > >> let me oppose that string.translate() is of use. >> Here is my current code: >> ... > > Your code seems to be doing two things with the python string > translate() method: > > (1) Using the deletechars argument (with an empty mapping) to look for > unexpected letters. It took me a while to work out what your code was > doing - personally I would have used a python set for this, rather > than the string translate method. Note also unicode strings don't > support the deletechars argument, and that python 3.0 removes the > deletechars argument from the string style objects. > > (2) Using the translate mapping to switch "U" and "u" into "T" and "t" > to back transcribe RNA into DNA. For this, Biopython already has a > Bio.Seq.back_transcribe function (which does work on strings), and in > CVS the Seq object gets a back_transcribe method too. These do both > use the string translate method internally. > > Neither of these operations convice me that the Seq object should > support the python string translate method. > > Note that if you still need to use the python string translate method, > it is accessable by first turning the Seq object into a string (e.g. > str(my_seq).translate(mapping, delete_chars)), or as Michiel suggested > earlier, you could use the string module translate function on the Seq > object. > > Also note that (as in your example using the string translate to do > back transcription) the translate method by its nature makes it > impossible to know if the original Seq object alphabet still applies > to the result. From biopython at maubp.freeserve.co.uk Sun Oct 19 18:52:06 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 19 Oct 2008 19:52:06 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: <48FB7C56.6010408@ribosome.natur.cuni.cz> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> <48FB57BD.7070705@ribosome.natur.cuni.cz> <320fb6e00810191115k120b64c3m237d0929d33b13fb@mail.gmail.com> <48FB7C56.6010408@ribosome.natur.cuni.cz> Message-ID: <320fb6e00810191152u13a2ee80pe21fe950dc3d046a@mail.gmail.com> On Sun, Oct 19, 2008 at 7:28 PM, Martin MOKREJ? wrote: > Peter, > you are right in your points. I think the translate() trick > had some speed advantages over other approaches to > zap unwanted characters ... I haven't profiled this - you may be right. On the other hand, using the translate method in this way doesn't make the purpose of the code obvious. >- I don't remember but if it is gonna break in future > python releases I will have to rewrite this anyway. Certainly the deletechars argument seems to be gone in Python 3.0, but you may not need to worry about that for a while. > I just wanted to say I really do use the string translate > function and that it has use in bioinformatics as well. ;-) Using the string translate for (back)transcription is an obvious example, but this is a special case that is already handled within Biopython. Does anyone have a non-transcription sequence example where the mapping part of the translate method is actually used? Using the string translate method just to remove characters is an interesting one. How common is this in typical python code? I've always used the string replace method (but usually I only want to remove one character). Maybe we should have a remove characters method for the Seq object? Here at least dealing with the alphabet is fairly simple. On another thread I'd suggested a "remove gaps" method as a special case of this. > Still, I think the name clash is asking for disaster, but > overloading is a feature of python so it might be expected. > Do whatever you want. ;) > Cheers, > M. I'm still a tiny bit uneasy about the name clash myself... anyone else what to join in the debate? Peter From biopython at maubp.freeserve.co.uk Sun Oct 19 18:59:23 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 19 Oct 2008 19:59:23 +0100 Subject: [BioPython] Current tutorial in CVS In-Reply-To: <320fb6e00810190934l432dd320ue8ff3a40fa497530@mail.gmail.com> References: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com> <48FB5DAE.1050600@ribosome.natur.cuni.cz> <320fb6e00810190934l432dd320ue8ff3a40fa497530@mail.gmail.com> Message-ID: <320fb6e00810191159j52bb78al4c38b1f7804c268f@mail.gmail.com> Peter wrote: > Marting wrote: >> for those lazy would you please show how to save the generated >> plots into e.g. jpg or .svg file? > > Instead or as well as pylab.show(), use pylab.savefig(...), for example: > > pylab.savefig("dot_plot.png", dpi=75) > pylab.savefig("dot_plot.pdf") I've added a note about this in the example in the CVS version of the Tutorial. > On a related note - it looks like the pylab tutorial as moved, I'm > getting a 404 error on http://matplotlib.sourceforge.net/tutorial.html > now :( I've updated this link to point at http://matplotlib.sourceforge.net/ instead (which at the time of writing includes a quick summary of the pylab functions). Peter From tiagoantao at gmail.com Mon Oct 20 05:41:56 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 20 Oct 2008 06:41:56 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> Message-ID: <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> Hi, On Sun, Oct 19, 2008 at 3:50 PM, Giovanni Marco Dall'Olio wrote: > ok, thank you very much!! > I would like to use git to keep track of the changes I will make to the > code. > What do you think if I'll upload it to http://github.com and then upload it > back on biopython when it is finished? > I am not sure, but I think it would be possible to convert the logs back to > cvs to reintegrate the changes in biopython. I think it is a good idea. When we reintegrate back I think there will be no need to backport the commit logs anyway. > One of the problems we are having here, is that it takes too much RAM memory > to store all the information about characters for every population. > I was going to write a Population object, in which I'll store only the total > count of heterozygotes, individuals, and what is needed, instead of the > information about characters (('a', 'a'), ('a', 'c'), ...) I am afraid that this is not enough. Even for Fst. I suppose you are acquainted with a formula with just heterozigosities. That is more of just a textbook formula only. The Fst standard estimator is really Cockerham and Wier Theta estimator (1984 paper), and I think it needs individual information (or at the very least allele counts). Check my implementation of Fst, which should be it (less the bugs that are in). Maybe my implementation of theta is wrong, which is a possiblity. But theta is the standard. May I do a suggestion for your problem? Split in SNP groups (like 100 at a time) and calculate 100 Fsts at time. Store the calculated Fsts to disk and then join them at the end. As a general rule, whatever goes into biopython has to be general enough to accomodate all standard statistics (not just Fs). One cannot make a solution that is taliored to solve just our personal research issues. I am currently traveling (which seems to be my constant state). When I arrive back at office, on Wednsday, I will make a few suggestions on how we can structure things. I have a few ideas that I would like to share and discuss. > class Marker: > total_heterozygotes_count = 0 > total_population_count = 0 > total_Purines_count = 0 # this could be renamed, of course > total_Pyrimidines_count = 0 Also, your representation seems to be targetted toward SNPs, people use lots of other things (microsatellites are still used a lot). We have to think about something that is useful to the general public. Let me get back to you on Wednesday we ideas. If you are interested we can work together to make a nice population genetics module that can be used in a wide range of situations. From lpritc at scri.ac.uk Mon Oct 20 09:09:51 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Mon, 20 Oct 2008 10:09:51 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810191152u13a2ee80pe21fe950dc3d046a@mail.gmail.com> Message-ID: On 19/10/2008 19:52, "Peter" wrote: > I'm still a tiny bit uneasy about the name clash myself... anyone else > what to join in the debate? The problem domain for biological sequences implies a natural definition for the application of 'translate' to a DNA/RNA sequence that is the translation into protein sequence. The string.translate() method is not consistent with this natural use of the language of the problem domain. I take Martin's point that there are valid uses for the string.translate() method in bioinformatics and elsewhere, but I think that overloading translate() is as valid here as overloading __mul__ would be for an implementation of matrix algebra, or complex numbers. For biological sequences as much as for number types, I think the problem domain and expected behaviour of the object being represented in code should take precedence over emulation of an object type that was never intended to provide the functionality required for a biological sequence. I think also that if the string.translate() method is required, an explicit call to string.translate() implies: "translate this biological sequence as if it were a string, and not a biological sequence". The converse application of a Bio.translate() method would to me imply "translate this biological sequence as if it were a biological sequence, and not a string"; which seems to me to defeat part of the purpose of representing the biological sequence with its own object. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From biopython at maubp.freeserve.co.uk Mon Oct 20 09:22:39 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Oct 2008 10:22:39 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: References: <320fb6e00810191152u13a2ee80pe21fe950dc3d046a@mail.gmail.com> Message-ID: <320fb6e00810200222s641e165eqef3b209893a8d976@mail.gmail.com> Leighton wrote: > Peter wrote: >> I'm still a tiny bit uneasy about the name clash myself... anyone else >> what to join in the debate? > > The problem domain for biological sequences implies a natural definition for > the application of 'translate' to a DNA/RNA sequence that is the translation > into protein sequence. The string.translate() method is not consistent with > this natural use of the language of the problem domain. > ... I thought that was well argued and nicely put. Of course, someone is still bound to try calling the translate method with a string mapping. Maybe we should add a bit of defensive code to check the table argument, and print a helpful error message when this happens? We currently only expect the codon table argument to be an NCBI genetic code table name or ID (string or integer). Earlier I wrote: >> In Biopython's CVS, the Seq object now has a translate method >> which does a biological translation. If anyone comes up with a >> better proposal before the next release, we can still rename this. >> Otherwise I will update the Tutorial in CVS shortly... I have since updated the Tutorial in CVS to use the new transcribe, back_transcribe and translate methods. Maybe we should put an updated "preview" online for comment? Peter From lpritc at scri.ac.uk Mon Oct 20 09:38:10 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Mon, 20 Oct 2008 10:38:10 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <48F8DD7B.7010909@gmail.com> Message-ID: On 17/10/2008 19:46, "Bruce Southey" wrote: > Leighton Pritchard wrote: >> This is the key problem. Forward translation is - for a given codon table - >> a one-one mapping. Reverse translation is (for many amino acids) one-many. >> If the goal is to produce the coding sequence that actually encoded a >> particular protein sequence, the problem is combinatorial and rapidly >> becomes messy with increasing sequence length. >> > If you use a regular expression or a tree structure then there is a > one-one mapping but then that would probably best as a subclass of Seq. I don't see this, I'm afraid. Each codon -> one amino acid : one-one mapping Arg -> set of 6 possible codons : one-many mapping It doesn't matter how it's represented in code, the problem of a one-many mapping still exists for amino acid -> codon translation in most cases. The combinatorial nature of the overall problem can be illustrated by considering the unlikely case of a protein that comprises 100 arginines. The number of potential coding sequences is 6**100 = 6.5e77. That you *can* choose any one of these to be your potential coding sequence doesn't negate the fact that there are still (6.5e77)-1 other possibilities... It doesn't get much better if you use the the average number of codons per amino acid: 61/20 ~= 3. A 100aa protein would typically have 3**100 ~= 5e47 potential coding sequences. I wouldn't want to guess which one was correct, and I can't see a back_translate method in this instance doing more than producing a nucleotide sequence that is potentially capable of producing the passed protein sequence, but for which no claims can be made about biological plausibility. Now, a back_translate() that takes a protein sequence alignment and, when passed the coding sequences for each component sequence, returns the corresponding alignment of the nucleotide sequences, makes sense to me. But that's a discussion for Bio.Alignment objects... > I would suggest tools like Wise2 and exonerate > (http://www.ebi.ac.uk/~guy/exonerate/) are the solution to solving gene > structure problems than using a Seq object. I wouldn't suggest using a Seq object for this purpose, either... ;) >> I agree - I can't think of an occasion where I might want to back-translate >> a protein in this way that wouldn't better be handled by other means. Not >> that I'm the fount of all use-cases but, given the number of ways in which >> one *could* back-translate, perhaps it would be better not to pick/guess at >> any single one. >> > Apart from the academic aspect, my main use is searching for protein > motifs/domains, enzyme cleavage sites, finding very short combinations > of amino acids and binding sites (I do not do this but it is the same) > in DNA sequences especially genomic sequence. These are usually very > small and, thus, unsuitable for most tools. I do much the same, and haven't found a pressing use for back-translation, yet - YMMV. > One of my uses is with > peptide identification and de novo sequencing using mass spectrometry > when you don't know the actual protein or gene sequence. It also has the > problem that certain amino acids have very similar mass so you would > need to Regardless of whether you use a regular expression query or not > you still need a back translation of the protein query and probably the > reverse complement. Perhaps I'm being dense, but I don't see why that is. Can you give an example? > Another case where it would be useful is that tools like TBLASTN gives > protein alignments so you must open the DNA sequence and find the DNA > region based on the protein alignment. You could use TBLASTN output - which provides start and stop coordinates for the match on the subject sequence - to extract this directly, without the need for backtranslation. Example output where subject coordinates give the match location below: >ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete genome Length = 5064019 Score = 731 bits (1887), Expect = 0.0 Identities = 363/376 (96%), Positives = 363/376 (96%) Frame = +3 Query: 1 MFHXXXXXXXXXXXXXTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY 60 MFH TISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY Sbjct: 477432 MFHLPKLKQKPLALLLTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY 477611 [...] L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From dalloliogm at gmail.com Mon Oct 20 13:57:27 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 20 Oct 2008 15:57:27 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> Message-ID: <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> On Mon, Oct 20, 2008 at 7:41 AM, Tiago Ant?o wrote: > Hi, > > On Sun, Oct 19, 2008 at 3:50 PM, Giovanni Marco Dall'Olio > wrote: > > ok, thank you very much!! > > I would like to use git to keep track of the changes I will make to the > > code. > > What do you think if I'll upload it to http://github.com and then upload > it > > back on biopython when it is finished? > > I am not sure, but I think it would be possible to convert the logs back > to > > cvs to reintegrate the changes in biopython. > > I think it is a good idea. When we reintegrate back I think there will > be no need to backport the commit logs anyway. Ok, I have uploaded the code to: - http://github.com/dalloliogm/biopython---popgen I put the code I wrote before writing in this mailing list in the folder PopGen/Gio - http://github.com/dalloliogm/biopython---popgen/tree/6f6fa66cda1908dc8334ab6e9e69b7c85290a8be/src/PopGen/Gio However, I plan to integrate these scripts with your code or re-write the completely (well, your code is a lot better than mine :) ). Just a curiosity: why do you use the '<>' operator instead of '!='? Is it better supported in python 3.0? > > One of the problems we are having here, is that it takes too much RAM > memory > > to store all the information about characters for every population. > > I was going to write a Population object, in which I'll store only the > total > > count of heterozygotes, individuals, and what is needed, instead of the > > information about characters (('a', 'a'), ('a', 'c'), ...) > > I am afraid that this is not enough. Even for Fst. I suppose you are > acquainted with a formula with just heterozigosities. Yes, I was trying to implement a very basic formula at first. > That is more of > just a textbook formula only. The Fst standard estimator is really > Cockerham and Wier Theta estimator (1984 paper) Bioperl's Bio::PopGen::PopStats uses the same formula: - http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/PopGen/PopStats.html#POD3 """ Bioperl's Bio:Based on diploid method in Weir BS, Genetics Data Analysis II, 1996 page 178. """ , and I think it needs > individual information (or at the very least allele counts). Check my > implementation of Fst, which should be it (less the bugs that are in). > Maybe my implementation of theta is wrong, which is a possiblity. But > theta is the standard. > > May I do a suggestion for your problem? Split in SNP groups (like 100 > at a time) and calculate 100 Fsts at time. Store the calculated Fsts > to disk and then join them at the end. > Thanks - that's a good suggestion > > > I am currently traveling (which seems to be my constant state). When I > arrive back at office, on Wednsday, I will make a few suggestions on > how we can structure things. I have a few ideas that I would like to > share and discuss. > Have a nice trip! > > > class Marker: > > total_heterozygotes_count = 0 > > total_population_count = 0 > > total_Purines_count = 0 # this could be renamed, of course > > total_Pyrimidines_count = 0 > > > Also, your representation seems to be targetted toward SNPs, people > use lots of other things (microsatellites are still used a lot). We > have to think about something that is useful to the general public. > Let me get back to you on Wednesday we ideas. If you are interested we > can work together to make a nice population genetics module that can > be used in a wide range of situations. > Yes, I agree. It was just a first try. We should collect some good use-cases. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Mon Oct 20 14:04:02 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Oct 2008 15:04:02 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> Message-ID: <320fb6e00810200704g578ec3aak6b9df1a5a90a2fc7@mail.gmail.com> On Mon, Oct 20, 2008 at 2:57 PM, Giovanni Marco Dall'Olio wrote: > Just a curiosity: why do you use the '<>' operator instead of '!='? > Is it better supported in python 3.0? Python 2.x supports both <> and != for not equal, and people use both depending on their personal preference (or exposure to other languages). Most Biopython code used to use <> which I personally do by habit. Python 3.x supports only != so I have recently gone through Biopython in CVS switching all the <> to != instead. I would recommend you use != in all new python code. Peter From biopython at maubp.freeserve.co.uk Mon Oct 20 14:23:23 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Oct 2008 15:23:23 +0100 Subject: [BioPython] Bio.AlignIO feedback - seq_count Message-ID: <320fb6e00810200723p2fcbe12ey125dd1fd67d195a7@mail.gmail.com> Dear Biopythoneers, I'm hoping some of you on the mailing list have actually used Bio.AlignIO, and I'd like to ask for some feedback. In particular, when loading in sequence files, did you ever use the optional seq_count argument to declare how many sequences you expected in each alignment? The rational of this optional argument is discussed in the Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:AlignIO-count-argument I'm curious if anyone actually found this useful in real life. Thanks Peter From bsouthey at gmail.com Tue Oct 21 14:13:15 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 21 Oct 2008 09:13:15 -0500 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: References: Message-ID: <48FDE37B.5040301@gmail.com> Leighton Pritchard wrote: > On 17/10/2008 19:46, "Bruce Southey" wrote: > > >> Leighton Pritchard wrote: >> >>> This is the key problem. Forward translation is - for a given codon table - >>> a one-one mapping. Reverse translation is (for many amino acids) one-many. >>> If the goal is to produce the coding sequence that actually encoded a >>> particular protein sequence, the problem is combinatorial and rapidly >>> becomes messy with increasing sequence length. >>> >>> >> If you use a regular expression or a tree structure then there is a >> one-one mapping but then that would probably best as a subclass of Seq. >> > > I don't see this, I'm afraid. > > Each codon -> one amino acid : one-one mapping > Arg -> set of 6 possible codons : one-many mapping > If you believed this then your answer below is incorrect. The genetic code allow for 1 amino acid to map to a three nucleotides but not any three nor any more or any less than three. So to be clear there is a one to one mapping between a codon and amino acid as well amino acid and a codon. Therefore it is impossible for Arg to map to six possible codons as only one is correct. Under the standard genetic code, each amino acid can be represented in an regular expression either as the bases or ambiguous nucleotide codes: Ala/A =(GCT|GCC|GCA|GCG) = GCN Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) Lys/K =(AAA|AAG) = AAR Asn/N =(AAT|AAC) =AAY Met/M =ATG =ATG Asp/D =(GAT|GAC) =GAY Phe/F =(TTT|TTC) =TTY Cys/C =(TGT|TGC) =TGY Pro/P =(CCT|CCC|CCA|CCG) =CCN Gln/Q =(CAA|CAG) =CAR Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) Glu/E =(GAA|GAG) = GAR Thr/T =(ACT|ACC|ACA|ACG) =ACN Gly/G =(GGT|GGC|GGA|GGG) =GGN Trp/W =TGG =TGG His/H =(CAT|CAC) = CAY Tyr/Y =(TAT|TAC) = TAY Ile/I =(ATT|ATC|ATA) =ATH Val/V =(GTT|GTC|GTA|GTG) =GTN This is still a one to one mapping between an amino acid and regular expression relationship of the triplet that encodes it. Unfortunately the ambiguous nucleotide codes can not be used directly in a regular expression search. > It doesn't matter how it's represented in code, the problem of a one-many > mapping still exists for amino acid -> codon translation in most cases. > > The combinatorial nature of the overall problem can be illustrated by > considering the unlikely case of a protein that comprises 100 arginines. > The number of potential coding sequences is 6**100 = 6.5e77. That you *can* > choose any one of these to be your potential coding sequence doesn't negate > the fact that there are still (6.5e77)-1 other possibilities... It doesn't > get much better if you use the the average number of codons per amino acid: > 61/20 ~= 3. A 100aa protein would typically have 3**100 ~= 5e47 potential > coding sequences. I wouldn't want to guess which one was correct, and I > can't see a back_translate method in this instance doing more than producing > a nucleotide sequence that is potentially capable of producing the passed > protein sequence, but for which no claims can be made about biological > plausibility. > You are not representing the one to six mapping you indicated above as sequence is composed of 300 nucleotides not 1800 as must occur with a one to 6 codon mapping. Rather you have provided the number of combinations of the six codons that can give you 100 Args based on a one to one mapping of one codon to one Arg. If you use ambiguous nucleotide codes, you can reduce it down to 1.267651e+30 potential coding sequences for 100 amino acids as a worst case scenario. It is not my position to argue what a user wants or how stupid I think that the request is. The user would quickly learn. > Now, a back_translate() that takes a protein sequence alignment and, when > passed the coding sequences for each component sequence, returns the > corresponding alignment of the nucleotide sequences, makes sense to me. But > that's a discussion for Bio.Alignment objects... > > >> I would suggest tools like Wise2 and exonerate >> (http://www.ebi.ac.uk/~guy/exonerate/) are the solution to solving gene >> structure problems than using a Seq object. >> > > I wouldn't suggest using a Seq object for this purpose, either... ;) > > >>> I agree - I can't think of an occasion where I might want to back-translate >>> a protein in this way that wouldn't better be handled by other means. Not >>> that I'm the fount of all use-cases but, given the number of ways in which >>> one *could* back-translate, perhaps it would be better not to pick/guess at >>> any single one. >>> >>> >> Apart from the academic aspect, my main use is searching for protein >> motifs/domains, enzyme cleavage sites, finding very short combinations >> of amino acids and binding sites (I do not do this but it is the same) >> in DNA sequences especially genomic sequence. These are usually very >> small and, thus, unsuitable for most tools. >> > > I do much the same, and haven't found a pressing use for back-translation, > yet - YMMV. > > >> One of my uses is with >> peptide identification and de novo sequencing using mass spectrometry >> when you don't know the actual protein or gene sequence. It also has the >> problem that certain amino acids have very similar mass so you would >> need to Regardless of whether you use a regular expression query or not >> you still need a back translation of the protein query and probably the >> reverse complement. >> > > Perhaps I'm being dense, but I don't see why that is. Can you give an > example? > Isoleucine and Leucine are the worst case (there are a couple of others that are close) because these have the same mass so you have to search for: (TTA|TTG|CTT|CTC|CTA|CTG|ATT|ATC|ATA) If you are searching say for an RFamide, you know that you need at least RFG, which means you need to do a query using regular expression on the plus strand using: (CGT|CGC|CGA|CGG|AGA|AGG)(TTT|TTC)(GGT|GGC|GGA|GGG) You then try to extend the match to more amino acids until you reach the desired mass (hopefully avoiding any introns) or sufficiently that you can use some other tool to help. > >> Another case where it would be useful is that tools like TBLASTN gives >> protein alignments so you must open the DNA sequence and find the DNA >> region based on the protein alignment. >> > > You could use TBLASTN output - which provides start and stop coordinates for > the match on the subject sequence - to extract this directly, without the > need for backtranslation. Example output where subject coordinates give the > match location below: > > >> ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete >> > genome > Length = 5064019 > > Score = 731 bits (1887), Expect = 0.0 > Identities = 363/376 (96%), Positives = 363/376 (96%) > Frame = +3 > > Query: 1 MFHXXXXXXXXXXXXXTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY > 60 > MFH TISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY > Sbjct: 477432 MFHLPKLKQKPLALLLTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY > 477611 > > [...] > > L. > > Exactly my point, where is the DNA sequence? Only if you have direct access to the DNA sequence can you get it. Furthermore, the DNA sequence must be exactly the same because any change in the coordinates screws it up. Bruce From biopython at maubp.freeserve.co.uk Tue Oct 21 14:26:49 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Oct 2008 15:26:49 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <48FDE37B.5040301@gmail.com> References: <48FDE37B.5040301@gmail.com> Message-ID: <320fb6e00810210726g466292e4h3e8fe053d9107f48@mail.gmail.com> Bruce wrote: > Leighton wrote: >> Each codon -> one amino acid : one-one mapping >> Arg -> set of 6 possible codons : one-many mapping I agree with Leighton. > If you believed this then your answer below is incorrect. No, I think you are just not using the terms one-to-one and one-to-many as a mathematician would. > The genetic code > allow for 1 amino acid to map to a three nucleotides but not any three nor > any more or any less than three. So to be clear there is a one to one > mapping between a codon and amino acid as well amino acid and a codon. > Therefore it is impossible for Arg to map to six possible codons as only one > is correct. Under the standard genetic code, each amino acid can be > represented in an regular expression either as the bases or ambiguous > nucleotide codes: > Ala/A =(GCT|GCC|GCA|GCG) = GCN That is a one to four mapping using unambiguous nucleotides, or a one to one mapping using ambiguous nucleotides. This is a nice case. > Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) That is a one to six mapping using unambiguous nucleotides, or a one to two mapping using ambiguous nucleotides. This is a problem case. > This is still a one to one mapping between an amino acid and regular > expression relationship of the triplet that encodes it. Unfortunately the > ambiguous nucleotide codes can not be used directly in a regular expression > search. The problem is that (TTN|CTR) or similar don't work in Seq objects - would need a more advanced representation (perhaps based on regular expressions). Peter From biopython at maubp.freeserve.co.uk Tue Oct 21 14:45:57 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Oct 2008 15:45:57 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <48FDE37B.5040301@gmail.com> References: <48FDE37B.5040301@gmail.com> Message-ID: <320fb6e00810210745w32b37edjeec1607a3711f6ea@mail.gmail.com> Bruce wrote: >>> Another case where it would be useful is that tools like TBLASTN gives >>> protein alignments so you must open the DNA sequence and find the DNA >>> region based on the protein alignment. Leighton: >> You could use TBLASTN output - which provides start and stop coordinates >> for the match on the subject sequence - to extract this directly, without the >> need for backtranslation. Example output where subject coordinates give >> the match location below: >> >>> >>> ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete >>> >> >> genome >> Length = 5064019 >> >> Score = 731 bits (1887), Expect = 0.0 >> Identities = 363/376 (96%), Positives = 363/376 (96%) >> Frame = +3 >> >> Query: 1 MFHXXXXXXXXXXXXXTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY >> 60 >> MFH TISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY >> Sbjct: 477432 MFHLPKLKQKPLALLLTISVGMMAPFTFAEAKTPGTLVEKAPLDSKNGLMEAGEQYRIQY >> 477611 >> >> [...] Bruce's reply: > Exactly my point, where is the DNA sequence? Only if you have direct access > to the DNA sequence can you get it. Furthermore, the DNA sequence must be > exactly the same because any change in the coordinates screws it up. You should have the original query from when you ran the BLAST search, so using the co-ordinates given in the BLAST hit you can recover the original nucleotide query which gives this match. There is no reason to do a back-translation to try and find the original query, which would be especially difficult in this example due to the XXXXXX region (representing a region of low complexity which was ignored by BLAST). Even if you tried you could find more than one match and without checking the the coordinates BLAST gives it would not be clear which gave this BLAST match. Peter From lpritc at scri.ac.uk Tue Oct 21 15:29:35 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 21 Oct 2008 16:29:35 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <48FDE37B.5040301@gmail.com> Message-ID: Hi Bruce, On 21/10/2008 15:13, "Bruce Southey" wrote: > Leighton Pritchard wrote: >> I don't see this, I'm afraid. >> >> Each codon -> one amino acid : one-one mapping >> Arg -> set of 6 possible codons : one-many mapping >> > If you believed this then your answer below is incorrect. The genetic > code allow for 1 amino acid to map to a three nucleotides but not any > three nor any more or any less than three. I'm fine with this bit. Each such set of three nucleotides is called a 'codon'. Six such codons are able to code for an arginine, as you note: > Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) This is a one -> six mapping. That is, one input (arginine), is capable of being back-translated into any of six possible outputs (CGT, CGC, CGA, CGG, AGA, or AGG). but you contradict this with the comment: > So to be clear there is a one > to one mapping between a codon and amino acid as well amino acid and a > codon. Therefore it is impossible for Arg to map to six possible codons I think that you're confusing the biological fact (only one codon actually encoded this amino acid) with the back-translation problem (in the absence of any other information, any one of six codons is equally likely to have encoded this amino acid). --- > This is still a one to one mapping between an amino acid and regular > expression relationship of the triplet that encodes it. Which is not the claim that I was making. There are any number of ways of forcing a one-one mapping of this sort. You could arguably represent it as a one-to-one mapping of 'arginine -> "the backtranslation of arginine"', but that would not be informative in reconstructing the actual coding sequence (if that was what you wanted - which is the point of the discussion: what is the point of a back_translate() method?). The regular expression mapping is not useful for this, either. > You are not representing the one to six mapping you indicated above as > sequence is composed of 300 nucleotides not 1800 as must occur with a > one to 6 codon mapping [...] I think you've misunderstood what's going on here. Imagine a reduced system, where there is only one amino acid - let's call it A - and there are two possible codons that can produce this amino acid - XXX and YYY (thanks, Coldplay). Now, if we have a 'sequence' of only one amino acid: 'A', that might have been encoded by the sequence 'XXX', or the sequence 'YYY'. The sequence that coded for 'A' is one of 'XXX' or 'YYY', and we don't know which; there are two possibilities, therefore this is a 1->2 mapping. 2=2**1. Note that the nucleotide sequence is 3*1=3 long. But if our sequence has two amino acids: 'AA', this could have been the result of 'XXXXXX', 'XXXYYY', 'YYYXXX', or 'YYYYYY'. The coding sequence is one of four equally likely possibilities, and this is a 1->4 mapping (one sequence, four possible outcomes). 4=2**2, and the nucleotide sequence is 3*2 long. If we build longer sequences, we find that the number of potential outcomes is 2**n, where n is the number of 'A's in the input sequence, and the mapping is 1->2**n. The nucleotide sequence is 3*n long. If we make this more general, where there are m codons for this amino acid, the number of potential outcomes is m**n, and the mapping is 1->m**n. The nucleotide sequence is, again, 3*n long. In my previous example for arginine, m=6, n=100, the mapping is 1->6, and the sequence is 300nt long, *not* 1800 nt long. There are still 6e77 ways of encoding a sequence of 100 arginines. A back_translate() method that pretends to find the 'correct' coding sequence in the absence of other information, rather than 'a' coding sequence, is not making a plausible claim. > It is not my position to argue what a user wants or how stupid I think > that the request is. The user would quickly learn. While it is entirely possible to implement a function called back_translate() that does something a user doesn't want or need, I'm not sure that it's the approach that should be taken, here. It is your position to argue what you want or need out of a back_translate() method, and why, so that other people can see your point of view, and maybe be swayed by it. I don't see a use for such a method, even to produce a regular expression for searching nucleotide sequences, because TBLASTN is so much more efficient. > Isoleucine and Leucine are the worst case (there are a couple of others > that are close) because these have the same mass so you have to search for: > (TTA|TTG|CTT|CTC|CTA|CTG|ATT|ATC|ATA) > > If you are searching say for an RFamide, you know that you need at least > RFG, which means you need to do a query using regular expression on the > plus strand using: > (CGT|CGC|CGA|CGG|AGA|AGG)(TTT|TTC)(GGT|GGC|GGA|GGG) > > You then try to extend the match to more amino acids until you reach the > desired mass (hopefully avoiding any introns) or sufficiently that you > can use some other tool to help. I think that, in your position, I'd compare timings with a six-frame, three-frame or forward translation of (depending on the nature of the nucleotide sequence) the nucleotide sequence you're searching against, and then use a regular expression or string search with the protein sequence as the query. That's likely to be significantly faster than a regex search with that many groups, with the effects more noticeable at larger query sequence lengths; particularly so if you cache or save the translated sequences for future searches. >>> Another case where it would be useful is that tools like TBLASTN gives >>> protein alignments so you must open the DNA sequence and find the DNA >>> region based on the protein alignment. >> You could use TBLASTN output - which provides start and stop coordinates for >> the match on the subject sequence - to extract this directly, without the >> need for backtranslation. > Exactly my point, where is the DNA sequence? It's in the database against which you queried; TBLASTN queries against nucleotide databases. Wait, that's not quite right - TBLASTN translates nucleotide databases into protein databases and queries against them with the protein sequence, partly because of the one-many mapping of back-translation. If the database is local, you can use fastacmd (part of BLAST) to dump the entire database, to retrieve the single matching sequence from the database, or even to extract only the region of the sequence that is the match. Try fastacmd --help at the command-line. If your database is not local, you can (probably) obtain the sequence by querying GenBank with the accession number. If you can't do that, or ask the people who compiled the database you're querying against, or if they won't let you have the sequence, then you're stuck with guessing the coding sequence. > Only if you have direct access to the DNA sequence can you get it. That's not true; fastacmd can extract FASTA-formatted sequences from any (version number compatibilities notwithstanding) correctly-formatted BLAST database. > Furthermore, the DNA sequence > must be exactly the same because any change in the coordinates screws it > up. I don't see how that is a great concern. The coordinates of the match would come from the same database you were searching, so should match. If your database is up-to-date, and you have to go to GenBank, then you should have the most recent revision of the sequence in there, anyway. Even if both of the above options fail, and you can acquire the new sequence by some accession identifier, you can build a new local database from that sequence alone, and find where the match is. Or translate and search directly in Python. If you truly have no access to the DNA sequence (e.g. if it's proprietary information, you can't access the BLAST database, and no-one will send you the sequence) then, and only then, are you stuck with guessing the coding sequence in *very* large parameter space. Best, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From biopython at maubp.freeserve.co.uk Tue Oct 21 15:59:00 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Oct 2008 16:59:00 +0100 Subject: [BioPython] back-translation method for Seq object? In-Reply-To: <320fb6e00810210745w32b37edjeec1607a3711f6ea@mail.gmail.com> References: <48FDE37B.5040301@gmail.com> <320fb6e00810210745w32b37edjeec1607a3711f6ea@mail.gmail.com> Message-ID: <320fb6e00810210859n1b922e7emd6a7456abd79cdc7@mail.gmail.com> On Tue, Oct 21, 2008 at 3:45 PM, Peter wrote: > Bruce wrote: >>>> Another case where it would be useful is that tools like TBLASTN gives >>>> protein alignments so you must open the DNA sequence and find the DNA >>>> region based on the protein alignment. > > Leighton: >>> You could use TBLASTN output - which provides start and stop coordinates >>> for the match on the subject sequence - to extract this directly, without the >>> need for backtranslation. Example output where subject coordinates give >>> the match location below: >>> ... > > Bruce's reply: >> Exactly my point, where is the DNA sequence? Only if you have direct access >> to the DNA sequence can you get it. Furthermore, the DNA sequence must be >> exactly the same because any change in the coordinates screws it up. > > You should have the original query from when you ran the BLAST > search, so using the co-ordinates given in the BLAST hit you can > recover the original nucleotide query which gives this match. Sorry - I was thinking of the wrong variant of BLAST. As Leighton pointed out, you would have to use fastacmd to extract the nucleotide sequence of the match from the blast database (assuming you were running stand alone blastall) or fetch it via its accession (if you were running BLAST via the NCBI). Peter From biopython at maubp.freeserve.co.uk Tue Oct 21 16:07:46 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Oct 2008 17:07:46 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: References: <48FDE37B.5040301@gmail.com> Message-ID: <320fb6e00810210907s328c8f23ra007e14f1f32f5c9@mail.gmail.com> Hi everyone, I think we all agree that if we want a back-translation method/function to return a simple string or Seq object (given no additional information about the codon use), this cannot fully capture all the possible codons. If we want to provide a simple string or Seq object, we can either pick an arbitrary codon in each case (as in the first attachment on Bug 2618), or perhaps represent some of the possible codons using ambiguous nucleotides. e.g. back_translate("MR") = "ATGCGT" #arbitrary codon for R unambiguous nucleotides or, back_translate("MR") = "ATGCGN" #arbitrary codon for R using ambiguous nucleotides Note in either example, the following nice property holds: translate(back_translate("MR")) == "MR" Even if improved by typical codon usage figures to give a more biologically likely answer, neither of these simple approaches covers the full set of six possible codons for Arg in the standard codon table. It was something like this that I envisioned as a candidate for a Seq method (based on the behaviour of the existing Bio.Translate functionality), but only if such a simple back_translate method/function had any real uses. And thus far, I haven't seen any. A back translation method/function which dealt with all the possible codon choices would have to use a more advanced representation (possibly as Bruce suggested using regular expressions or some sort of tree structure - ideally as a sub-class of the Seq object). There is also the option of returning multiple simple strings or Seq objects (either as a list or preferable a generator) giving all possible back translations, but I don't think this would be useful, except perhaps on small examples, due to the potentially vast number of return values. Peter From bsouthey at gmail.com Tue Oct 21 19:46:58 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 21 Oct 2008 14:46:58 -0500 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: References: Message-ID: <48FE31B2.8030509@gmail.com> Leighton Pritchard wrote: > Hi Bruce, > > On 21/10/2008 15:13, "Bruce Southey" wrote: > >> Leighton Pritchard wrote: >> >>> I don't see this, I'm afraid. >>> >>> Each codon -> one amino acid : one-one mapping >>> Arg -> set of 6 possible codons : one-many mapping >>> >>> >> If you believed this then your answer below is incorrect. The genetic >> code allow for 1 amino acid to map to a three nucleotides but not any >> three nor any more or any less than three. >> > > I'm fine with this bit. Each such set of three nucleotides is called a > 'codon'. Six such codons are able to code for an arginine, as you note: > > >> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) >> > > This is a one -> six mapping. That is, one input (arginine), is capable of > being back-translated into any of six possible outputs (CGT, CGC, CGA, CGG, > AGA, or AGG). > > but you contradict this with the comment: > > >> So to be clear there is a one >> to one mapping between a codon and amino acid as well amino acid and a >> codon. Therefore it is impossible for Arg to map to six possible codons >> > > I think that you're confusing the biological fact (only one codon actually > encoded this amino acid) with the back-translation problem (in the absence > of any other information, any one of six codons is equally likely to have > encoded this amino acid). > > --- > > >> This is still a one to one mapping between an amino acid and regular >> expression relationship of the triplet that encodes it. >> > > Which is not the claim that I was making. There are any number of ways of > forcing a one-one mapping of this sort. You could arguably represent it as > a one-to-one mapping of 'arginine -> "the backtranslation of arginine"', but > that would not be informative in reconstructing the actual coding sequence > (if that was what you wanted - which is the point of the discussion: what is > the point of a back_translate() method?). The regular expression mapping is > not useful for this, either. > > >> You are not representing the one to six mapping you indicated above as >> sequence is composed of 300 nucleotides not 1800 as must occur with a >> one to 6 codon mapping [...] >> > > I think you've misunderstood what's going on here. > > Imagine a reduced system, where there is only one amino acid - let's call it > A - and there are two possible codons that can produce this amino acid - XXX > and YYY (thanks, Coldplay). > > Now, if we have a 'sequence' of only one amino acid: 'A', that might have > been encoded by the sequence 'XXX', or the sequence 'YYY'. The sequence > that coded for 'A' is one of 'XXX' or 'YYY', and we don't know which; there > are two possibilities, therefore this is a 1->2 mapping. 2=2**1. Note that > the nucleotide sequence is 3*1=3 long. > > But if our sequence has two amino acids: 'AA', this could have been the > result of 'XXXXXX', 'XXXYYY', 'YYYXXX', or 'YYYYYY'. The coding sequence is > one of four equally likely possibilities, and this is a 1->4 mapping (one > sequence, four possible outcomes). 4=2**2, and the nucleotide sequence is > 3*2 long. > > If we build longer sequences, we find that the number of potential outcomes > is 2**n, where n is the number of 'A's in the input sequence, and the > mapping is 1->2**n. The nucleotide sequence is 3*n long. > > If we make this more general, where there are m codons for this amino acid, > the number of potential outcomes is m**n, and the mapping is 1->m**n. The > nucleotide sequence is, again, 3*n long. > > In my previous example for arginine, m=6, n=100, the mapping is 1->6, and > the sequence is 300nt long, *not* 1800 nt long. There are still 6e77 ways > of encoding a sequence of 100 arginines. A back_translate() method that > pretends to find the 'correct' coding sequence in the absence of other > information, rather than 'a' coding sequence, is not making a plausible > claim. > > Thank you for agreeing with me! I am glad that you realized that the genetic code prevents a true one to many relationship. In say relational databases where you can have one table for the journal issue and one table for the papers in it, you can get multiple papers in a single issue. Likewise, if we ignore the genetic code, there is one amino acid and one or more codons. However, the genetic code means that you only can select one of all the codons possible resulting in multiple combinations of one to one relationships. >> It is not my position to argue what a user wants or how stupid I think >> that the request is. The user would quickly learn. >> > > While it is entirely possible to implement a function called > back_translate() that does something a user doesn't want or need, I'm not > sure that it's the approach that should be taken, here. > > It is your position to argue what you want or need out of a back_translate() > method, and why, so that other people can see your point of view, and maybe > be swayed by it. I don't see a use for such a method, even to produce a > regular expression for searching nucleotide sequences, because TBLASTN is so > much more efficient. > This very much depends on how you want to use it. TBLASTN is not very good for very short sequences and can not handle protein domains/motifs such as those in Prosite. > >> Isoleucine and Leucine are the worst case (there are a couple of others >> that are close) because these have the same mass so you have to search for: >> (TTA|TTG|CTT|CTC|CTA|CTG|ATT|ATC|ATA) >> >> If you are searching say for an RFamide, you know that you need at least >> RFG, which means you need to do a query using regular expression on the >> plus strand using: >> (CGT|CGC|CGA|CGG|AGA|AGG)(TTT|TTC)(GGT|GGC|GGA|GGG) >> >> You then try to extend the match to more amino acids until you reach the >> desired mass (hopefully avoiding any introns) or sufficiently that you >> can use some other tool to help. >> > > I think that, in your position, I'd compare timings with a six-frame, > three-frame or forward translation of (depending on the nature of the > nucleotide sequence) the nucleotide sequence you're searching against, and > then use a regular expression or string search with the protein sequence as > the query. That's likely to be significantly faster than a regex search > with that many groups, with the effects more noticeable at larger query > sequence lengths; particularly so if you cache or save the translated > sequences for future searches. > Thanks for the comments as I did not think about reusing the translation. > > >>>> Another case where it would be useful is that tools like TBLASTN gives >>>> protein alignments so you must open the DNA sequence and find the DNA >>>> region based on the protein alignment. >>>> >>> You could use TBLASTN output - which provides start and stop coordinates for >>> the match on the subject sequence - to extract this directly, without the >>> need for backtranslation. >>> > > >> Exactly my point, where is the DNA sequence? >> > > It's in the database against which you queried; TBLASTN queries against > nucleotide databases. Wait, that's not quite right - No, it is not even correct! :-) > TBLASTN translates > nucleotide databases into protein databases and queries against them with > the protein sequence, partly because of the one-many mapping of > back-translation. > Not exactly as stop codons are not in protein databases except where they code for an amino acid. > If the database is local, you can use fastacmd (part of BLAST) to dump the > entire database, to retrieve the single matching sequence from the database, > or even to extract only the region of the sequence that is the match. Try > fastacmd --help at the command-line. > > If your database is not local, you can (probably) obtain the sequence by > querying GenBank with the accession number. If you can't do that, or ask > the people who compiled the database you're querying against, or if they > won't let you have the sequence, then you're stuck with guessing the coding > sequence. > > >> Only if you have direct access to the DNA sequence can you get it. >> > > That's not true; fastacmd can extract FASTA-formatted sequences from any > (version number compatibilities notwithstanding) correctly-formatted BLAST > database. > > Obviously because you still have direct access to the DNA sequence. >> Furthermore, the DNA sequence >> must be exactly the same because any change in the coordinates screws it >> up. >> > > I don't see how that is a great concern. The coordinates of the match would > come from the same database you were searching, so should match. If your > database is up-to-date, and you have to go to GenBank, then you should have > the most recent revision of the sequence in there, anyway. > > Even if both of the above options fail, and you can acquire the new sequence > by some accession identifier, you can build a new local database from that > sequence alone, and find where the match is. Or translate and search > directly in Python. > These were some of the things that one was trying to avoid, especially repeating it all over again and hoping like crazy that it is still present. (Genome assemblies are not very forgiving.) > If you truly have no access to the DNA sequence (e.g. if it's proprietary > information, you can't access the BLAST database, and no-one will send you > the sequence) then, and only then, are you stuck with guessing the coding > sequence in *very* large parameter space. > > Best, > > L. > > Bruce From bsouthey at gmail.com Tue Oct 21 20:36:31 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 21 Oct 2008 15:36:31 -0500 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <320fb6e00810210907s328c8f23ra007e14f1f32f5c9@mail.gmail.com> References: <48FDE37B.5040301@gmail.com> <320fb6e00810210907s328c8f23ra007e14f1f32f5c9@mail.gmail.com> Message-ID: <48FE3D4F.6060005@gmail.com> Peter wrote: > Hi everyone, > > I think we all agree that if we want a back-translation > method/function to return a simple string or Seq object (given no > additional information about the codon use), this cannot fully capture > all the possible codons. > For completeness as these are not 100% correct, Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN Ser is really so bad that one would suggest providing a strong warning and just use NTN, NGN, and NNN for Leu, Arg and Ser, respectively. > If we want to provide a simple string or Seq object, we can either > pick an arbitrary codon in each case (as in the first attachment on > Bug 2618), or perhaps represent some of the possible codons using > ambiguous nucleotides. > > e.g. > back_translate("MR") = "ATGCGT" #arbitrary codon for R unambiguous nucleotides > > or, > back_translate("MR") = "ATGCGN" #arbitrary codon for R using ambiguous > nucleotides > > Note in either example, the following nice property holds: > translate(back_translate("MR")) == "MR" > > Even if improved by typical codon usage figures to give a more > biologically likely answer, neither of these simple approaches covers > the full set of six possible codons for Arg in the standard codon > table. > > It was something like this that I envisioned as a candidate for a Seq > method (based on the behaviour of the existing Bio.Translate > functionality), but only if such a simple back_translate > method/function had any real uses. And thus far, I haven't seen any. > For you perhaps but my reasons are very real to me! > A back translation method/function which dealt with all the possible > codon choices would have to use a more advanced representation > (possibly as Bruce suggested using regular expressions or some sort of > tree structure - ideally as a sub-class of the Seq object). There is > also the option of returning multiple simple strings or Seq objects > (either as a list or preferable a generator) giving all possible back > translations, but I don't think this would be useful, except perhaps > on small examples, due to the potentially vast number of return > values. > > Peter > > In any situation, we are left with a ambiguous codons, a regular expression or some combination of sequence type (e.g., strings or Seq objects). None of these options are fully compatible with the Seq object. So I do agree that back-translation can not be part of the Seq object. Also I agree that while first two could be return types for a Seq object method, the usage is probably too infrequent and too specialized for inclusion especially to handle codon usage frequencies. Bruce From lpritc at scri.ac.uk Wed Oct 22 08:31:12 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 22 Oct 2008 09:31:12 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <48FE3D4F.6060005@gmail.com> Message-ID: On 21/10/2008 21:36, "Bruce Southey" wrote: > For completeness as these are not 100% correct, > Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN > Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV > Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN There are some difficulties with this encoding (IUPAC codes are at http://www.chick.manchester.ac.uk/SiteSeer/IUPAC_codes.html) YTN -> [CT]T[ACGT] -> {CTA, CTC, CTG, CTT, TTA, TTC, TTG, TTT}, two of which do not encode leucine. MGV -> [AC]G[ACG] -> {AGA, AGC, AGG, CGA, CGC, CGG}, of which AGC does not encode arginine, and the resulting set does not include CGT, which does encode arginine WSN -> [AT][CG][ACGT] -> {ACA, ACC, ACG, ACT, AGA, AGC, AGG, AGT, TCA, TCC, TCG, TCT, TGA, TGC, TGG, TGT}, of which 10 codons do not encode serine. This would cause problems if we wanted to translate our back-translation back to the original protein sequence (however we might want to do this). > Ser is really so bad that one would suggest providing a strong warning > and just use NTN, NGN, and NNN for Leu, Arg and Ser, respectively. We could just backtranslate all amino acids to NNN and avoid the problem entirely ;) >> If we want to provide a simple string or Seq object, we can either >> pick an arbitrary codon in each case (as in the first attachment on >> Bug 2618), or perhaps represent some of the possible codons using >> ambiguous nucleotides. >> >> e.g. >> back_translate("MR") = "ATGCGT" #arbitrary codon for R unambiguous >> nucleotides >> >> or, >> back_translate("MR") = "ATGCGN" #arbitrary codon for R using ambiguous >> nucleotides >> >> Note in either example, the following nice property holds: >> translate(back_translate("MR")) == "MR" This would be an important consideration for a back_translate() method: should translate() and back_translate() be inverse functions of each other? I would say that this is a desirable property, or else a nested translate(back_translate(translate(...(seq)...))) is likely to end up as a string or sequence of ambiguity codons, which is not very useful. If that can't be done, then the opportunity to do so is probably best avoided... To ensure that translate() and back_translate() are inverse functions, the backtranslation of a particular amino acid should either return a single unambiguous codon, or an ambiguous codon that cannot be translated to an alternative amino acid (assuming a consistent codon table throughout). If we were not to choose arbitrarily an unambiguous codon, or subset of all possible codons, then a representation of the ambiguity is required that is not present in the Seq object, yet (e.g. For Ser, Leu or Arg as described above). A modification of translate() to spot, and accept such ambiguity would be necessary. This looks like harder work than it's worth. >> It was something like this that I envisioned as a candidate for a Seq >> method (based on the behaviour of the existing Bio.Translate >> functionality), but only if such a simple back_translate >> method/function had any real uses. And thus far, I haven't seen any. >> > For you perhaps but my reasons are very real to me! I agree with Peter on this. I don't see a single compelling use case for back_translate() in a Seq object. I can sort of see a potential use where, if you have a protein and want to design a primer to the coding sequence (which is not known - otherwise there are better ways to do this), then you might want to generate a sequence of IUPAC ambiguity codes to guide primer design. This might involve obtaining a sequence only of the *certain* bases, e.g. Phe -> TTN; Ser -> NNN; Gly -> GGN; Asp -> GAN, so that FGD -> TTNNNNGGN, and there are four of nine bases around which primers might be designed. However, I'm *really* stretching to come up with this example. I've outlined my views on some of the possible ways back_translate() might work below: Translate protein to its original coding sequence: =================================================== Problem: this may be just guesswork in (very) large sequence space Potential solution: guesswork may be guided by codon usage tables or user preference for codons, but the biological utility/significance of the result, which is still guessed at, is highly questionable. Alternatives: If the originating organism's sequence is known, then TBLASTN is fast, works well, and avoids the problem. Alternatively, forward translation followed by a search for the protein sequence is quicker and less messy. Translate protein to a single possible coding sequence (not necessarily original): ============================================================================ Problem: Same one each time, or choose randomly? What is the point, anyway? See above for solutions/alternatives Translate protein to ambiguous representation (inverse translate and/or return Seq): ============================================================================ Problem: changes required to the way sequences are represented in Seq objects; this is a significant change at the heart of Biopython with many inevitable side-effects. Not clear how this would work, yet. Potential solution: major coding upheaval and rewriting of Biopython Alternatives: ignore the requirement that backtranslation is the inverse of translation; do not return a Seq object, but instead store the backtranslation as an attribute, or just return a string for the user to do what they want with Translate protein to ambiguous representation (not inverse of translate, do not return Seq): ============================================================================ Problem: what's the point? agreeing which ambiguous representation to use: regex, IUPAC, something else; IUPAC ambiguities aren't a convenient representation for Ser, Leu, Arg; Potential solution: just use a regex; allow a choice; make an executive decision; ignore it and hope it goes away I think that the last behaviour here is the only one that is feasible, but I still don't see much point in implementing it. At least turning a protein sequence into a regex of possible codons would be quick to code... >> There is >> also the option of returning multiple simple strings or Seq objects >> (either as a list or preferable a generator) giving all possible back >> translations, Eek! (for the reasons you mention) L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From biopython at maubp.freeserve.co.uk Wed Oct 22 09:17:23 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Oct 2008 10:17:23 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: References: <48FE3D4F.6060005@gmail.com> Message-ID: <320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com> On Wed, Oct 22, 2008 at 9:31 AM, Leighton Pritchard wrote: > On 21/10/2008 21:36, "Bruce Southey" wrote: > >> For completeness as these are not 100% correct, >> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN >> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV >> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN I was going to jump up and down and disagree with you here Bruce, but Leighton has already made the same point, (CGV | AGR) != MGV etc. It is true that the ambiguous codon MGV would cover all the possible Arg codons, but it includes more than that. While this could be a useful thing for certain back-translation reasons, it does break the expectation that translate(back_translate(sequence)) == sequence [currently the behaviour available in Bio.Translate]. >>> If we want to provide a simple string or Seq object, we can either >>> pick an arbitrary codon in each case (as in the first attachment on >>> Bug 2618), or perhaps represent some of the possible codons using >>> ambiguous nucleotides. >>> ... >>> It was something like this that I envisioned as a candidate for a Seq >>> method (based on the behaviour of the existing Bio.Translate >>> functionality), but only if such a simple back_translate >>> method/function had any real uses. And thus far, I haven't seen any. >>> >> For you perhaps but my reasons are very real to me! I was saying I don't see the need for a *simple* back_translate function (giving a Seq object or a string), and that such a simple function didn't seem to help with your examples. I'm not denying that a complex back translation operation has real utility (although I suspect there are multiple different solutions which won't suit every problem - and makes justifying adding this to the core Seq object hard to justify). Perhaps a function in Bio.SeqUtils to create a nucleotide regex describing possible back translations from a protein sequence would suffice? If one of your real-world examples can be solved with a back_translate which returns a simple string or Seq object, could you clarify this. Peter From lpritc at scri.ac.uk Wed Oct 22 10:03:32 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 22 Oct 2008 11:03:32 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <48FE31B2.8030509@gmail.com> Message-ID: On 21/10/2008 20:46, "Bruce Southey" wrote: > Thank you for agreeing with me! I am glad that you realized that the > genetic code prevents a true one to many relationship. Bruce, I am not agreeing with you. I'll try to clarify it another way: More than one codon can encode the amino acid arginine (this is a many-one relationship). The amino acid arginine can be 'decoded' to more than one codon (this is a one-many relationship). Imagine a function that accepts an amino acid as input and returns a valid codon that could encode for the input amino acid. This is 'decoding' as described above, and is the process of back-translation for a single amino acid. For a single (i.e. 'one') amino acid, arginine, as input, the function might correctly provide up to six (i.e. 'many') different valid answers. This makes it a one-many problem. Further external constraints (e.g. Codon tables) may be applied to restrict the number or likelihood of each codon being correct in specific cases, but the fundamental problem is one-many. Providing arginine as input to a particular coded version of this function might in all cases only return a single codon as output (one-one), but the problem itself is still one-many. Furthermore, even though only one codon was responsible - biologically-speaking - for encoding the arginine you're submitting to the function (one-one), your question is the inverse: effectively 'what codon encoded this arginine?'. But (and it's a big but), if you don't know beforehand what that codon is (and why else would you bother using the function?), the problem is one-many, as any of the six solutions might be correct. Analogously, there are two possible values for the square root of a positive real number, such as 4. It is inherently a one-many problem. For 4, the return value could, correctly, be +2 or -2. Now, the math.sqrt() function in Python follows mathematical convention for the radical, and only returns the positive value, but that does not make the relationship between the value and its square root one-one, it only makes that implementation of the function one-one, even though the answer could be, correctly, either positive or negative. Now, if your problem is: what is the length of side of a farmer's square field with area four square miles (big field!), only one of these answers makes sense (one-one), as the field is constrained by our reality and cannot have negative length (this is effectively equivalent to saying that the organism doesn't use five of the six possible codons for arginine, so only one answer is possible). However, the general problem of finding a square root is still one-many, as you can see if you rephrase the problem as 'the vector (a 0) has length 4; what is the value of a?'. This is directly analogous to the problem 'the amino acid arginine was encoded by a codon; what codon was it?'. > This very much depends on how you want to use it. TBLASTN is not very > good for very short sequences and can not handle protein domains/motifs > such as those in Prosite. That's a fair point, and I wouldn't (and didn't ;) ) recommend TBLASTN as a solution to all such problems. I get acceptable results for exact matches down to about 7aa on default settings, though. Short query sequences can be a problem whatever method you use, though. >> TBLASTN queries against >> nucleotide databases. Wait, that's not quite right - > No, it is not even correct! :-) Yes, it is correct. From: http://www.ncbi.nlm.nih.gov/blast/blast_program.shtml (and other references...) """ tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames """ They wrote it, so they should know. Not that I've checked the code ;) >> TBLASTN translates >> nucleotide databases into protein databases and queries against them with >> the protein sequence, partly because of the one-many mapping of >> back-translation. > Not exactly as stop codons are not in protein databases except where > they code for an amino acid. Stop codons are not (usually) in protein databases, that's true. But they *are* in nucleotide databases, which is what TBLASTN queries. For example, these are TBLASTN search results, in opposite directions on the same nucleotide sequence, that span stop codons in the subject sequence, indicated by '*' in the BLAST output (even though there are different stop codons; Artemis handles this more elegantly): >ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete genome Length = 5064019 Score = 79.0 bits (193), Expect = 8e-17 Identities = 38/40 (95%), Positives = 38/40 (95%), Gaps = 2/40 (5%) Frame = +2 Query: 1 YPHSTAEYLILFE-INPRS-PFFCWIFWNLMLRDVDLENF 38 YPHSTAEYLILFE INPRS PFFCWIFWNLMLRDVDLENF Sbjct: 2 YPHSTAEYLILFE*INPRS*PFFCWIFWNLMLRDVDLENF 121 >ref|NC_004547.2| Erwinia carotovora subsp. atroseptica SCRI1043, complete genome Length = 5064019 Score = 56.6 bits (135), Expect = 4e-10 Identities = 29/32 (90%), Positives = 29/32 (90%), Gaps = 3/32 (9%) Frame = -3 Query: 1 CNGRWRC-SPL-CYISPRISCRSW-LKPSAIV 29 CNGRWRC SPL CYISPRISCRSW LKPSAIV Sbjct: 2851610 CNGRWRC*SPL*CYISPRISCRSW*LKPSAIV 2851515 >> That's not true; fastacmd can extract FASTA-formatted sequences from any >> (version number compatibilities notwithstanding) correctly-formatted BLAST >> database. >> > Obviously because you still have direct access to the DNA sequence. I'd call it indirect access if you've, say, downloaded a precompiled nt database from NCBI and then have to extract the FASTA sequence from that compiled database. Either way, if you're querying a nucleotide database, you've got to have a representation of the nucleotide sequence *somewhere*. >> Even if both of the above options fail, and you can acquire the new sequence >> by some accession identifier, you can build a new local database from that >> sequence alone, and find where the match is. Or translate and search >> directly in Python. >> > These were some of the things that one was trying to avoid, especially > repeating it all over again and hoping like crazy that it is still > present. Some things are just harder work than others ;) > (Genome assemblies are not very forgiving.) The genomes I've worked on have had stable sequences at revision points for both assembly and annotation (though the old revision points have not been kept publicly in all cases, which can be awkward). All should, IMO. But that's a different thread on a different mailing list... Best, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From dalloliogm at gmail.com Wed Oct 22 10:25:57 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 22 Oct 2008 12:25:57 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> Message-ID: <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> On Mon, Oct 20, 2008 at 3:57 PM, Giovanni Marco Dall'Olio < dalloliogm at gmail.com> wrote: > > > On Mon, Oct 20, 2008 at 7:41 AM, Tiago Ant?o wrote: > >> Hi, >> >> On Sun, Oct 19, 2008 at 3:50 PM, Giovanni Marco Dall'Olio >> wrote: >> > ok, thank you very much!! >> > I would like to use git to keep track of the changes I will make to the >> > code. >> > What do you think if I'll upload it to http://github.com and then >> upload it >> > back on biopython when it is finished? >> > I am not sure, but I think it would be possible to convert the logs back >> to >> > cvs to reintegrate the changes in biopython. >> >> I think it is a good idea. When we reintegrate back I think there will >> be no need to backport the commit logs anyway. > > > Ok, I have uploaded the code to: > - http://github.com/dalloliogm/biopython---popgen > I wrote a prototype for a PED file parser which uses your PopGen.Record object to store data. It's available on github: I have still to finish the consumer object and to test it, but I think I will be able to finish it for today. I left you a few comments on the github wiki: - http://github.com/dalloliogm/biopython---popgen/wikis/home Maybe the biggest issue is that I will have to use this library to parse very big files, so there are a few things we could change in the implementation of the parser. Is there any way in python to force the interpreter to store variables in temporary files instead of RAM memory? I was thinking about modules like shelve, cPickle, but I am not sure they work in this way. We could also modify the parser in a way that it can accept a list of populations as argument, and create a populations list with only those populations from the file. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Wed Oct 22 10:34:21 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Oct 2008 11:34:21 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> Message-ID: <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> On Wed, Oct 22, 2008 at 11:25 AM, Giovanni Marco Dall'Olio wrote: > Maybe the biggest issue is that I will have to use this library to parse > very big files, so there are a few things we could change in the > implementation of the parser. > Is there any way in python to force the interpreter to store variables in > temporary files instead of RAM memory? > I was thinking about modules like shelve, cPickle, but I am not sure they > work in this way. I have not looked at the specifics here, but adopting an iterator approach might make sense - returning the entries one by one as parsed from the file. This is the idea for the Bio.SeqIO and Bio.AlignIO parsers. The user can then turn the entries into a list (if they have enough memory), filter them as the arrive, etc. For example, you could compile a list of only those desired population entries, discarding the others on the fly. Peter From bsouthey at gmail.com Wed Oct 22 15:04:29 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 22 Oct 2008 10:04:29 -0500 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com> References: <48FE3D4F.6060005@gmail.com> <320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com> Message-ID: <48FF40FD.5020604@gmail.com> Peter wrote: > On Wed, Oct 22, 2008 at 9:31 AM, Leighton Pritchard wrote: > >> On 21/10/2008 21:36, "Bruce Southey" wrote: >> >> >>> For completeness as these are not 100% correct, >>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN >>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV >>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN >>> > > I was going to jump up and down and disagree with you here Bruce, but > Leighton has already made the same point, (CGV | AGR) != MGV etc. > It is true that the ambiguous codon MGV would cover all the possible > Arg codons, but it includes more than that. While this could be a > useful thing for certain back-translation reasons, it does break the > expectation that translate(back_translate(sequence)) == sequence > [currently the behaviour available in Bio.Translate]. > Leighton does show these are correct: (CGV | AGR) == MGV and MGV ==(CGV | AGR) BUT I fully agree that MGV does stand for other other codons that are do not translate for Arg as Leighton pointed out. This was why I prefixed this by stating "these are not 100% correct" so I am sorry that I was not clear enough. Yes, I am also very aware that this creates a problem for doing a translate(back_translate(sequence)) without using a special translation table (yet another reason for not including it in Seq object or just return an exception). As I pointed in your other thread that I do not believe that a back-translation should be part of the Seq object. If for no other reason than back-translation just creates too many ambiguous nucleotides in one DNA sequence. This will cause some of the algorithms to determine protein or DNA sequences to fail (back_translate('AFLFQPQRFGR') gives 'GCNTTYYTNTTYCARCCNCARMGVTTYGGNMGV', which causes NCBI's online BLASTN to say it is protein). In anycase, BLAST and such are not very good at handling multiple ambiguous nucleotides in a sequence when probably one-third to one-half of the sequence would be ambiguous nucleotides. Bruce From biopython at maubp.freeserve.co.uk Wed Oct 22 15:33:00 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Oct 2008 16:33:00 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <48FF40FD.5020604@gmail.com> References: <48FE3D4F.6060005@gmail.com> <320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com> <48FF40FD.5020604@gmail.com> Message-ID: <320fb6e00810220833j4089cec1i6fb9cee563b562d3@mail.gmail.com> Bruce wrote: >>>> For completeness as these are not 100% correct, >>>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN >>>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV >>>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN Just for the record, in addition to the debate about the final equal signs above, there is at least one error in the above - for the leucine codons, (TTN|CTR) should read (TTR|CTN), but this doesn't matter for the discussion in hand. Bruce wrote: > Leighton does show these are correct: > (CGV | AGR) == MGV > and MGV ==(CGV | AGR) I don't think Leighton did mean to say that. A set of 6 codons is NOT equal to a set of 8 codons. However, if we say "sub set" or "super set" here things are probably fine (I haven't double checked the correct ambiguity codes are used here). Similarly, Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTR|CTN) covers 6 unambiguous codons. This is a subset of YTN = (TTC|TTA|TTG|TTT|CTC|CTA|CTG|CTT) which covers 8 unambiguous codons. Having back_translate("L") == "YTN" means translate(back_translate("L")) == "X", which would surprise many. Using "YTN" covers all the codons plus some extra ones. This might be useful for searching purposes, but otherwise its very misleading. Having back_translate("L") == "CTN" means translate(back_translate("L")) == "L", but doesn't cover the two codons TTR (i.e. TTA or TTG). At least this is better than back_translate("L") == "TTR" which still has translate(back_translate("L")) == "L", but doesn't cover the four codons CTN. Picking any one of the six codons also ensures translate(back_translate("L")) == "L" but of course doesn't cover the other five codons. In all three cases, the utility of the back translation is limited. > Yes, I am also very aware that this creates a problem for doing a > translate(back_translate(sequence)) without using a special translation > table (yet another reason for not including it in Seq object or just return > an exception). Yes. > As I pointed in your other thread that I do not believe that a > back-translation should be part of the Seq object. In the absence of a compelling use case, I agree. > If for no other reason > than back-translation just creates too many ambiguous nucleotides in one DNA > sequence. This will cause some of the algorithms to determine protein or DNA > sequences to fail (back_translate('AFLFQPQRFGR') gives > 'GCNTTYYTNTTYCARCCNCARMGVTTYGGNMGV', which causes > NCBI's online BLASTN to say it is protein). In such cases, you can explicitly tell BLAST (or other tools) if they are using nucleotides or proteins. However this is a valid concern for working with ambiguous nucleotides. As an aside, zen of python "In the face of ambiguity, refuse the temptation to guess." (here nucleotide versus protein) > In anycase, BLAST and such are not very good at handling > multiple ambiguous nucleotides in a sequence when probably > one-third to one-half of the sequence would be ambiguous > nucleotides. Ambiguous searches are bound to be tricky. Peter From lpritc at scri.ac.uk Wed Oct 22 15:34:47 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 22 Oct 2008 16:34:47 +0100 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <48FF40FD.5020604@gmail.com> Message-ID: On 22/10/2008 16:04, "Bruce Southey" wrote: > Peter wrote: >> On Wed, Oct 22, 2008 at 9:31 AM, Leighton Pritchard >> wrote: >> >>> On 21/10/2008 21:36, "Bruce Southey" wrote: >>>> For completeness as these are not 100% correct, >>>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN >>>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV >>>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN >>>> >> >> I was going to jump up and down and disagree with you here Bruce, but >> Leighton has already made the same point, (CGV | AGR) != MGV etc. >> It is true that the ambiguous codon MGV would cover all the possible >> Arg codons, but it includes more than that. >> > Leighton does show these are correct: > (CGV | AGR) == MGV > and MGV ==(CGV | AGR) I showed (and Peter also points out) that (TTN|CTR) is a subset of YTN, and that (TCN|AGY) is a subset of WSN, and not that they are equivalent, which is what you have written above. For that equivalence, we would also require that MGV is a subset of (CGV|AGR), which is not true. Likewise I also showed that, although (CGV|AGR) is a subset of MGV, neither CGV nor MGV include CGT, which is a valid codon for arginine. Whether or not this error is corrected to CGN/MGN, the regular expression is still only a subset of those codons implied by the IUPAC ambiguity symbols. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From bsouthey at gmail.com Wed Oct 22 15:50:19 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 22 Oct 2008 10:50:19 -0500 Subject: [BioPython] [DETECTED AS SPAM] Re: back-translation method for Seq object? In-Reply-To: <320fb6e00810220833j4089cec1i6fb9cee563b562d3@mail.gmail.com> References: <48FE3D4F.6060005@gmail.com> <320fb6e00810220217l48aa0b76w5b30095706f43d26@mail.gmail.com> <48FF40FD.5020604@gmail.com> <320fb6e00810220833j4089cec1i6fb9cee563b562d3@mail.gmail.com> Message-ID: <48FF4BBB.8020007@gmail.com> Peter wrote: > Bruce wrote: > >>>>> For completeness as these are not 100% correct, >>>>> Leu/L =(TTA|TTG|CTT|CTC|CTA|CTG) = (TTN|CTR) = YTN >>>>> Arg/R =(CGT|CGC|CGA|CGG|AGA|AGG) =(CGV | AGR) = MGV >>>>> Ser/S =(TCT|TCC|TCA|TCG|AGT|AGC) =(TCN|AGY) = WSN >>>>> > > Just for the record, in addition to the debate about the final equal > signs above, there is at least one error in the above - for the > leucine codons, (TTN|CTR) should read (TTR|CTN), but this doesn't > matter for the discussion in hand. > > Thanks for correctly that one. Bruce From tiagoantao at gmail.com Wed Oct 22 15:52:19 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 22 Oct 2008 16:52:19 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> Message-ID: <6d941f120810220852m606b8825tafaaf871f22df48e@mail.gmail.com> Hi, [Back in office now] > Ok, I have uploaded the code to: > - http://github.com/dalloliogm/biopython---popgen > > I put the code I wrote before writing in this mailing list in the folder > PopGen/Gio Thanks I will have a look and get acquainted with GIT. >> I am afraid that this is not enough. Even for Fst. I suppose you are >> acquainted with a formula with just heterozigosities. > > Yes, I was trying to implement a very basic formula at first. For publication and data analysis the standard is Cockerham and Wier's theta. The Standard Ht/(Hs-Ht) (or a variation of this) might be misleading in regards to the amount of information that is needed. > Yes, I agree. It was just a first try. We should collect some good > use-cases. In my head I divide statistics in the following dimensions: 1. genetic versus genomic (e.g. Fst is single locus, LD can be seen as requiring more than 1 locus, therefore is "genomic") 2. frequency based versus marker based (some statistics require frequencies only - ie, you can calculate them irrespective of the type of marker - This is the case of Fst. Others are marker dependent, say Tajima D requires sequences and can only be used with sequences) 3. population structure versus no pop structure. Some stats require population structure (again, Fst), others don't (e.g., allelic richness) >From my point of view, a long-term solution needs to take into account these dimensions (and others that I might be forgetting). One can think in a solution based on Populations and Individuals as fundamental objects (as opposed to statistics), but, from my experience it is very difficult to define what is an "individual" (i.e., what kind of information you need to store - I can expand on this). It is easier to think in terms of statistics. One fundamental point is that we don't have many opportunities to make it right: if we define an architecture which proves in the future to be not sufficient, then we will have to both maintain the old legacy (because there will be users around whose code cannot be constantly broken when a new version is made available) while hack the new features in. From tiagoantao at gmail.com Wed Oct 22 16:00:39 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 22 Oct 2008 17:00:39 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> Message-ID: <6d941f120810220900o620aa7x171fcd181f5248c@mail.gmail.com> Hi, On Wed, Oct 22, 2008 at 11:25 AM, Giovanni Marco Dall'Olio wrote: > I wrote a prototype for a PED file parser which uses your PopGen.Record > object to store data. Don't feel obliged to use GenePop.Record. You can (maybe you should) use one that is better for your PED record. The point is: your PED files might have extra (or less) information than genepop files. For instance, they might have population names. They might store the SNP (A, C, T, G). With genepop you would have to convert (and thus loose) the extra info. > Maybe the biggest issue is that I will have to use this library to parse > very big files, so there are a few things we could change in the > implementation of the parser. Yet another reason to develop your own record. I would not mind helping you with that. > We could also modify the parser in a way that it can accept a list of > populations as argument, and create a populations list with only those > populations from the file. We have to be careful in modifying existing code. We can add new functionality, add new interfaces. But changing existing interfaces or removing them has to be dealt with exceptional care, because that will break (existing) code done by users. Tiago From tiagoantao at gmail.com Wed Oct 22 16:03:59 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 22 Oct 2008 17:03:59 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> Message-ID: <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> On Wed, Oct 22, 2008 at 11:34 AM, Peter wrote: > I have not looked at the specifics here, but adopting an iterator > approach might make sense - returning the entries one by one as parsed > from the file. This is the idea for the Bio.SeqIO and Bio.AlignIO > parsers. The user can then turn the entries into a list (if they have > enough memory), filter them as the arrive, etc. For example, you > could compile a list of only those desired population entries, > discarding the others on the fly. I will have look at iterators in Python. This idea from Giovannni is actually floating around with current users for GenePop data which have exactly the same problem (loooong records). From dalloliogm at gmail.com Wed Oct 22 17:10:45 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 22 Oct 2008 19:10:45 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> Message-ID: <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> On Wed, Oct 22, 2008 at 6:03 PM, Tiago Ant?o wrote: > On Wed, Oct 22, 2008 at 11:34 AM, Peter > wrote: > > I have not looked at the specifics here, but adopting an iterator > > approach might make sense - returning the entries one by one as parsed > > from the file. This is the idea for the Bio.SeqIO and Bio.AlignIO > > parsers. The user can then turn the entries into a list (if they have > > enough memory), filter them as the arrive, etc. For example, you > > could compile a list of only those desired population entries, > > discarding the others on the fly. > > I will have look at iterators in Python. This idea from Giovannni is > actually floating around with current users for GenePop data which > have exactly the same problem (loooong records). > Iterators are more difficult to implement in Ped files, because in this format every line of the file is an individual, so to write an iterator which iterates by population we will need to read at list the first row of every line of all the file. I was also thinking of starting using a database to store data, instead of files. This would probably solve the problem of out of memory when parsing those long files. I would probably use sqlalchemy to interface with this database: this is why I would like to implement a Population and Individual objects, it will fit better with relational mapping. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From dalloliogm at gmail.com Wed Oct 22 17:12:24 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 22 Oct 2008 19:12:24 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810220852m606b8825tafaaf871f22df48e@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <6d941f120810220852m606b8825tafaaf871f22df48e@mail.gmail.com> Message-ID: <5aa3b3570810221012v3543a533u15f81196752cd52@mail.gmail.com> On Wed, Oct 22, 2008 at 5:52 PM, Tiago Ant?o wrote: > Hi, > > [Back in office now] > > > Ok, I have uploaded the code to: > > - http://github.com/dalloliogm/biopython---popgen > > > > I put the code I wrote before writing in this mailing list in the folder > > PopGen/Gio > > Thanks I will have a look and get acquainted with GIT. > It' s the first time I am using github for something serious, too. Please tell me if you need me to add you as a 'collaborator' in the project or something like this. I am using eclipse with a plugin for git (http://www.jgit.org/update-site) and it works very well. I think there is a plugin for vim, too. Sorry, today I couldn't do too much - I spent most of the day in seminars and meetings :(. > > > > Yes, I agree. It was just a first try. We should collect some good > > use-cases. > > > In my head I divide statistics in the following dimensions: > 1. genetic versus genomic (e.g. Fst is single locus, LD can be seen as > requiring more than 1 locus, therefore is "genomic") > 2. frequency based versus marker based (some statistics require > frequencies only - ie, you can calculate them irrespective of the type > of marker - This is the case of Fst. Others are marker dependent, say > Tajima D requires sequences and can only be used with sequences) > 3. population structure versus no pop structure. Some stats require > population structure (again, Fst), others don't (e.g., allelic > richness) > > From my point of view, a long-term solution needs to take into account > these dimensions (and others that I might be forgetting). > > One can think in a solution based on Populations and Individuals as > fundamental objects (as opposed to statistics), but, from my > experience it is very difficult to define what is an "individual" > (i.e., what kind of information you need to store - I can expand on > this). It is easier to think in terms of statistics. > > One fundamental point is that we don't have many opportunities to make > it right: if we define an architecture which proves in the future to > be not sufficient, then we will have to both maintain the old legacy > (because there will be users around whose code cannot be constantly > broken when a new version is made available) while hack the new > features in. > ok... but we can try :). We could use the github's wiki to better organize these ideas. I will answer to you better tomorrow (or tonight). Now, I need a bit of fresh air! :) -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From dalloliogm at gmail.com Wed Oct 22 17:12:41 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 22 Oct 2008 19:12:41 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810220900o620aa7x171fcd181f5248c@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <5aa3b3570810170239r1f2a3ddo345b15d5a8e91a0e@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <6d941f120810220900o620aa7x171fcd181f5248c@mail.gmail.com> Message-ID: <5aa3b3570810221012w2c894977sd0f86297f42f9394@mail.gmail.com> On Wed, Oct 22, 2008 at 6:00 PM, Tiago Ant?o wrote: > Hi, > > On Wed, Oct 22, 2008 at 11:25 AM, Giovanni Marco Dall'Olio > wrote: > > I wrote a prototype for a PED file parser which uses your PopGen.Record > > object to store data. > > Don't feel obliged to use GenePop.Record. You can (maybe you should) > use one that is better for your PED record. The point is: your PED > files might have extra (or less) information than genepop files. For > instance, they might have population names. They might store the SNP > (A, C, T, G). With genepop you would have to convert (and thus loose) > the extra info. I first tried to write an AbstractPopRecord class from which to derive both Ped.Record and your GenePop.Record classes. Then, I realized that I wanted to use all of your methods and decided to import your GenePop.Record instead of writing a new one. Moreover, there are some methods (like GenePop.Record.split_in_pops) that create Record objects, and I thought it would have been easier to always refer to the same one. Maybe we should write a generic PopGenRecord in which to store all general informations about population genetics data. > > > > Maybe the biggest issue is that I will have to use this library to parse > > very big files, so there are a few things we could change in the > > implementation of the parser. > > Yet another reason to develop your own record. I would not mind > helping you with that. > > > > We could also modify the parser in a way that it can accept a list of > > populations as argument, and create a populations list with only those > > populations from the file. > > We have to be careful in modifying existing code. We can add new > functionality, add new interfaces. But changing existing interfaces or > removing them has to be dealt with exceptional care, because that will > break (existing) code done by users. > > Tiago > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Wed Oct 22 17:26:07 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Oct 2008 18:26:07 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> Message-ID: <320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com> On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio wrote: > > Iterators are more difficult to implement in Ped files, because in this > format every line of the file is an individual, so to write an iterator > which iterates by population we will need to read at list the first row of > every line of all the file. It sounds like for Ped files it would make more sense to iterate over the individuals. The mental picture I have in mind is a big spreadsheet, individuals as rows (lines), populations (and other information) as columns. By having the parser iterate over the individuals one by one, the user could then "simplify" each individual as they are read in, recording in memory just the interesting data. This way the whole dataset need not be kept in memory. > I was also thinking of starting using a database to store data, instead of > files. This would probably solve the problem of out of memory when parsing > those long files. > I would probably use sqlalchemy to interface with this database: this is why > I would like to implement a Population and Individual objects, it will fit > better with relational mapping. That would mean adding sqlalchemy as another (optional) dependency for Biopython. If you could use MySQLdb instead that would be better as several existing modules use this. However, I would encourage you to avoid any database if possible because this makes the installation much more complicated for the end user, and imposes your own arbitrary schema as well. It also means setting up suitable unit tests is also a pain. Peter From rsclary at uncc.edu Wed Oct 22 19:49:33 2008 From: rsclary at uncc.edu (Clary, Richard) Date: Wed, 22 Oct 2008 15:49:33 -0400 Subject: [BioPython] Retrieving nucleotide sequence for given accession Entrez ID Message-ID: <61B0EE7C247C1349881F63414448FC1F078874BC@EXEVS06.its.uncc.edu> Can anyone provide succinct Python function to retrieve the nucleotide sequence (as a string) for a given nucleotide accession ID? Attempting to do this through E-Utils but having a difficult time figuring out the best way to do this without having to download a FASTA file... Thanks in advance, R From biopython at maubp.freeserve.co.uk Wed Oct 22 20:15:37 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Oct 2008 21:15:37 +0100 Subject: [BioPython] Retrieving nucleotide sequence for given accession Entrez ID In-Reply-To: <61B0EE7C247C1349881F63414448FC1F078874BC@EXEVS06.its.uncc.edu> References: <61B0EE7C247C1349881F63414448FC1F078874BC@EXEVS06.its.uncc.edu> Message-ID: <320fb6e00810221315i31358bc2n2e5c9be405a77e42@mail.gmail.com> On Wed, Oct 22, 2008 at 8:49 PM, Clary, Richard wrote: > > Can anyone provide succinct Python function to retrieve the > nucleotide sequence (as a string) for a given nucleotide > accession ID? Attempting to do this through E-Utils but > having a difficult time figuring out the best way to do this > without having to download a FASTA file... Hi Richard, Are you trying this using Bipython's Bio.Entrez, or accessing E-Utils directly? Anyway, you'll want to use efetch (e.g. via the Bio.Entrez.efetch function in Biopython) http://www.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html This documentation covers the possible return formats, http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html I think FASTA would be simplest (I don't see a plain or raw text option), and has only a tiny overhead in the download size over the raw sequence. Getting the sequence out of a FASTA file as a string is trivial - for example, using Biopython: from Bio import Entrez, SeqIO Entrez.email = "Richard at example.com" #Tell the NCBI who you are handle = Entrez.efetch(db="nucleotide", id="186972394",rettype="fasta") seq_str = str(SeqIO.read(handle, "fasta").seq) Peter From dalloliogm at gmail.com Thu Oct 23 09:41:04 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 23 Oct 2008 11:41:04 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> <320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com> Message-ID: <5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com> On Wed, Oct 22, 2008 at 7:26 PM, Peter wrote: > On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio > wrote: > > > > Iterators are more difficult to implement in Ped files, because in this > > format every line of the file is an individual, so to write an iterator > > which iterates by population we will need to read at list the first row > of > > every line of all the file. > > It sounds like for Ped files it would make more sense to iterate over > the individuals. The mental picture I have in mind is a big > spreadsheet, individuals as rows (lines), populations (and other > information) as columns. By having the parser iterate over the > individuals one by one, the user could then "simplify" each individual > as they are read in, recording in memory just the interesting data. > This way the whole dataset need not be kept in memory. This makes sense. Basically, we should write a (Ped/GenePop)Iterator function, which should read the file one line at a time, check if it a has correct syntax and is not a comment, and then use 'yield' to create a Record object. Am I right? > > > I was also thinking of starting using a database to store data, instead > of > > files. This would probably solve the problem of out of memory when > parsing > > those long files. > > I would probably use sqlalchemy to interface with this database: this is > why > > I would like to implement a Population and Individual objects, it will > fit > > better with relational mapping. > > That would mean adding sqlalchemy as another (optional) dependency for > Biopython. If you could use MySQLdb instead that would be better as > several existing modules use this. However, I would encourage you to > avoid any database if possible because this makes the installation > much more complicated for the end user, and imposes your own arbitrary > schema as well. It also means setting up suitable unit tests is also > a pain. > Don't worry, I am not going to do that. I will probably use sqlalchemy only in my scripts; I will use it to retrieve data from the database, and then create Population/Marker/Individual objects using the code I am writing now, or a adapt the objects created by sqlalchemy to be compatible with the functions I will have to use. > > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Oct 23 09:57:38 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Oct 2008 10:57:38 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> <320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com> <5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com> Message-ID: <320fb6e00810230257y4a21952o95acff34a5367cb5@mail.gmail.com> On Thu, Oct 23, 2008, Giovanni Marco Dall'Olio wrote: > On Wed, Oct 22, Peter wrote: >> On Wed, Oct 22, Giovanni Marco Dall'Olio wrote: >> > >> > Iterators are more difficult to implement in Ped files, because in this >> > format every line of the file is an individual, so to write an iterator >> > which iterates by population we will need to read at list the first row >> > of every line of all the file. >> >> It sounds like for Ped files it would make more sense to iterate over >> the individuals. The mental picture I have in mind is a big >> spreadsheet, individuals as rows (lines), populations (and other >> information) as columns. By having the parser iterate over the >> individuals one by one, the user could then "simplify" each individual >> as they are read in, recording in memory just the interesting data. >> This way the whole dataset need not be kept in memory. > > This makes sense. > Basically, we should write a (Ped/GenePop)Iterator function, which should > read the file one line at a time, check if it a has correct syntax and is > not a comment, and then use 'yield' to create a Record object. Am I right? Yes :) Python functions written with "yield" are called "generator functions", see: http://www.python.org/dev/peps/pep-0255/ Peter From m at pavis.biodec.com Thu Oct 23 10:25:45 2008 From: m at pavis.biodec.com (m at pavis.biodec.com) Date: Thu, 23 Oct 2008 12:25:45 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> References: <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> Message-ID: <20081023102545.GE3694@pavis.biodec.com> * Giovanni Marco Dall'Olio (dalloliogm at gmail.com) [081022 19:12]: > > I was also thinking of starting using a database to store data, instead of > files. This would probably solve the problem of out of memory when parsing > those long files. If you just need to store data, i.e. you just need a thin layer above file storage, I'd suggest evaluating ZODB It's very simple, somehow pythonic, and you don't need to learn SQL to manage the data (of course, SQL is just fine, and from a real DB you get much more than just data storage, but since you are just writing about alternatives to file storage, I assume that SQL would not be a plus) HTH -- .*. finelli /V\ (/ \) -------------------------------------------------------------- ( ) Linux: Friends dont let friends use Piccolosoffice ^^-^^ -------------------------------------------------------------- It is easier to make a saint out of a libertine than out of a prig. -- George Santayana From dalloliogm at gmail.com Thu Oct 23 11:30:06 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 23 Oct 2008 13:30:06 +0200 Subject: [BioPython] [OT] Revision control and databases Message-ID: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> Hi, I have a question (well, it's not directly related to biopython or pygr, but to scientific computing). I always used flat files to store results and data for my bioinformatics analys, but not (as I was saying in another thread) I would like to start using a database to do that. The problem is I don't know if databases do Revision Control. When I used flat files, I was used to save all the results in a git repository, and, everytime something was changed or calculated again, I did commit it. Do you know how to do this with databases? Does MySQL provide support for revision control? Thanks :) (sorry for cross-posting :( ) -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From sdavis2 at mail.nih.gov Thu Oct 23 12:10:16 2008 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 23 Oct 2008 08:10:16 -0400 Subject: [BioPython] [OT] Revision control and databases In-Reply-To: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> Message-ID: <264855a00810230510n37d05cb1gd7b88a63988d7191@mail.gmail.com> On Thu, Oct 23, 2008 at 7:30 AM, Giovanni Marco Dall'Olio wrote: > Hi, > I have a question (well, it's not directly related to biopython or pygr, but > to scientific computing). > > I always used flat files to store results and data for my bioinformatics > analys, but not (as I was saying in another thread) I would like to start > using a database to do that. > > The problem is I don't know if databases do Revision Control. > When I used flat files, I was used to save all the results in a git > repository, and, everytime something was changed or calculated again, I did > commit it. > Do you know how to do this with databases? Does MySQL provide support for > revision control? > Thanks :) No. Relational databases just store data. You could build such a system, but that would require a fair amount of work. I would suggest storing metadata about your analyses in the database and storing the actual results on the file system. Sean From lpritc at scri.ac.uk Thu Oct 23 12:44:45 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 23 Oct 2008 13:44:45 +0100 Subject: [BioPython] [bip] [OT] Revision control and databases In-Reply-To: Message-ID: Hi Giovanni (and others) Ah, reading again I see you're already using git... Without knowing exactly what you're doing, I assume that CVS and SVN would be no improvement , so please ignore my last paragraph below ;) L. On 23/10/2008 13:39, "Leighton Pritchard" wrote: > Hi Giovanni, > > On 23/10/2008 12:30, "Giovanni Marco Dall'Olio" wrote: > >> The problem is I don't know if databases do Revision Control. >> When I used flat files, I was used to save all the results in a git >> repository, and, everytime something was changed or calculated again, I did >> commit it. >> Do you know how to do this with databases? Does MySQL provide support for >> revision control? > > Databases are just collections of data. Database Management Systems (DBMS) > such as MySQL and PostgreSQL do not (AFAIAA) do revision control themselves, > but they can be used for it, if you build that capability into the schema and > also control database submissions appropriately. There are a number of > content management systems that implement version/revision control on common > DBMS, like this. > > Stretching a definition, you could possibly argue that CVS, SVN and the like > are a form of DBMS... I don't know what type of data you're storing, or how > they might scale for your purposes but, in principle, neither CVS nor SVN care > much about whether your data represents code, legal documents, or any other > sort of data. For example, I've used CVS/SVN to version control manuscripts. > You might like to try one of them. > > Cheers, > > L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From lpritc at scri.ac.uk Thu Oct 23 12:39:40 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 23 Oct 2008 13:39:40 +0100 Subject: [BioPython] [bip] [OT] Revision control and databases In-Reply-To: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> Message-ID: Hi Giovanni, On 23/10/2008 12:30, "Giovanni Marco Dall'Olio" wrote: > The problem is I don't know if databases do Revision Control. > When I used flat files, I was used to save all the results in a git > repository, and, everytime something was changed or calculated again, I did > commit it. > Do you know how to do this with databases? Does MySQL provide support for > revision control? Databases are just collections of data. Database Management Systems (DBMS) such as MySQL and PostgreSQL do not (AFAIAA) do revision control themselves, but they can be used for it, if you build that capability into the schema and also control database submissions appropriately. There are a number of content management systems that implement version/revision control on common DBMS, like this. Stretching a definition, you could possibly argue that CVS, SVN and the like are a form of DBMS... I don't know what type of data you're storing, or how they might scale for your purposes but, in principle, neither CVS nor SVN care much about whether your data represents code, legal documents, or any other sort of data. For example, I've used CVS/SVN to version control manuscripts. You might like to try one of them. Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From bsouthey at gmail.com Thu Oct 23 13:55:49 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 23 Oct 2008 08:55:49 -0500 Subject: [BioPython] [bip] [OT] Revision control and databases In-Reply-To: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> Message-ID: <49008265.3040205@gmail.com> Giovanni Marco Dall'Olio wrote: > Hi, > I have a question (well, it's not directly related to biopython or > pygr, but to scientific computing). > > I always used flat files to store results and data for my > bioinformatics analys, but not (as I was saying in another thread) I > would like to start using a database to do that. Of course Biopython's BioSQL interface may provide a starting point. > The problem is I don't know if databases do Revision Control. > When I used flat files, I was used to save all the results in a git > repository, and, everytime something was changed or calculated again, > I did commit it. > Do you know how to do this with databases? Does MySQL provide support > for revision control? > Thanks :) I think you are asking the wrong questions because it depends on what you want to do and what you actually store. There are a number of questions that you need to ask yourself about what you really need to do (knowing you have used git helps refine these). Examples include: How often do you use the old versions in your git repository? How do you use the old revisions in your git repository? Do you even use the information of an older version if a newer version exists? Do you actually determine when 'something was changed or calculated again' or it this partly determined by an external source like a Genbank or UniProt update? (At least in a database approach you could automate this.) How many users that can make changes? How often do you have conflicts? Are the conflicts hard to solve? Revision control may be overkill for your use because this is aims to handle many tasks and change conflicts related to multiple users rather than a single user. If you don't need all these fancy features then you can use a database. If you just want to store and retrieve a version then you can use a database but you need to at least force the inclusion a date and comment fields to be useful. Regards Bruce From tiagoantao at gmail.com Thu Oct 23 14:51:22 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 23 Oct 2008 15:51:22 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810221012w2c894977sd0f86297f42f9394@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810171107s5c874bf2ia5bdc31712e4036f@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <6d941f120810220900o620aa7x171fcd181f5248c@mail.gmail.com> <5aa3b3570810221012w2c894977sd0f86297f42f9394@mail.gmail.com> Message-ID: <6d941f120810230751k3dee7b96y8ee13e4bf1c2a4ca@mail.gmail.com> Hi, > Moreover, there are some methods (like GenePop.Record.split_in_pops) that > create Record objects, and I thought it would have been easier to always > refer to the same one. > Maybe we should write a generic PopGenRecord in which to store all general > informations about population genetics data. The problem with that is that it is a) very difficult to come with a representation that is general enough (and usable in the long run). b) a general representation would be an hassle in specific cases Let me elaborate: Different kinds of genetic information have completely different storage needs: If you are doing genomic studies you will probably want to have location information (like this SNP is on chromosome X, position Y). Others (probably the majority) only require frequency information (or to know what the marker is, irrespective of position). In most species you don't even know the genomic position of a certain marker. So you would have to have an general representation capable to handle both position information and no position information. Then, in some cases, you need the whole marker (like if you want to do a Tajima D) or just frequency information (for Fst). Some markers (microsats) you can (in most, but not all) cases ignore the genetic pattern, you just count the repeats. You could argue that one could try to have a most general representation but that entails three problems: 1. It is very difficult to come by with a clever, correct and future proof representation. At least I've thinking on this issue since 2005 and have found no clever answer. 2. Performance: If you care about performance, having a most general data representation will bring about a big performance cost (converting from a certain general format to the format needed to do computations). 3. Different formats and statistics have different requirements: For instance on GenePop you don't have population names, neither the marker itself, but for arlequin format you have partial information on markers and full information on population names. converting the minor differences among formats to a "general" format would be complex. From tiagoantao at gmail.com Thu Oct 23 15:10:51 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 23 Oct 2008 16:10:51 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <5aa3b3570810180508x7e7eb1a3pd63597c622ad2dff@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> Message-ID: <6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com> On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio wrote: > Iterators are more difficult to implement in Ped files, because in this > format every line of the file is an individual, so to write an iterator > which iterates by population we will need to read at list the first row of > every line of all the file. GenePop works population by population. Where I a getting at, is that different formats might have completely different strategies. I've used a strategy with the FDist parser that it might be interesting to you: 1. I read the fdist file 2. Convert it to genepop 3. do all operations in the genepop format 4. convert back if necessary. This might not work in your case because the ped format seems to be more informative than the genepop format (and thus you loose information in the conversion process). Feel free to copy and adapt my code to your own (like split_in_pops and split_in_loci) > I would probably use sqlalchemy to interface with this database: this is why > I would like to implement a Population and Individual objects, it will fit > better with relational mapping. You can go ahead and suggest formats for Populations and Individuals. But I strongly suspect that your proposal will be biased towards your needs (I've suffered the same problem myself). I think that in biopython the idea is to try to have a solution that is useful to everybody. Also, if you want to put some SQL in the code module code, you will have to have approval from the maintainers of biopython. They will send you to the BioSQL people, which will say that there is none of their business. Been there, done that, no success. Don't take me wrong, I am not trying to discourage you in any way. But I think it is better to gain some experience before proposing changes to core concepts. I've been doing this work for 3 years now, and I am convinced that it would be very hard for me to suggest a good representation for populations and individuals. Even populations are very hard to address (like, some data is geo-referenced -> called landspace genetics, and the more traditional one is not). My suggestion: solve you problem the best way you can (e.g., do an independent PED parser - you can use any of my code if you want). Solve small problems, one after another. Trying to solve the general problem is very hard and requires lots of long term experience. From dalloliogm at gmail.com Thu Oct 23 16:25:29 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 23 Oct 2008 18:25:29 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810180950k5d62df2nf4c5d9706b3fb025@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> <6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com> Message-ID: <5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com> On Thu, Oct 23, 2008 at 5:10 PM, Tiago Ant?o wrote: > On Wed, Oct 22, 2008 at 6:10 PM, Giovanni Marco Dall'Olio > wrote: > > Iterators are more difficult to implement in Ped files, because in this > > format every line of the file is an individual, so to write an iterator > > which iterates by population we will need to read at list the first row > of > > every line of all the file. > > GenePop works population by population. Where I a getting at, is that > different formats might have completely different strategies. > I've used a strategy with the FDist parser that it might be interesting to > you: > 1. I read the fdist file > 2. Convert it to genepop > 3. do all operations in the genepop format > 4. convert back if necessary. > > This might not work in your case because the ped format seems to be > more informative than the genepop format (and thus you loose > information in the conversion process). Feel free to copy and adapt my > code to your own (like split_in_pops and split_in_loci) > > > > I would probably use sqlalchemy to interface with this database: this is > why > > I would like to implement a Population and Individual objects, it will > fit > > better with relational mapping. > > You can go ahead and suggest formats for Populations and Individuals. > But I strongly suspect that your proposal will be biased towards your > needs (I've suffered the same problem myself). I think that in > biopython the idea is to try to have a solution that is useful to > everybody. > > Also, if you want to put some SQL in the code module code, you will > have to have approval from the maintainers of biopython. They will > send you to the BioSQL people, which will say that there is none of > their business. Been there, done that, no success. > > Don't take me wrong, I am not trying to discourage you in any way. But > I think it is better to gain some experience before proposing changes > to core concepts. > I've been doing this work for 3 years now, and I am convinced that it > would be very hard for me to suggest a good representation for > populations and individuals. Even populations are very hard to address > (like, some data is geo-referenced -> called landspace genetics, and > the more traditional one is not). > > My suggestion: solve you problem the best way you can (e.g., do an > independent PED parser - you can use any of my code if you want). > Solve small problems, one after another. > Trying to solve the general problem is very hard and requires lots of > long term experience. > Well, I agree with you... I don't have any idea on how this problem could be resolved :). However I think it would be good to add to biopython at least some funcionality to calculate Fst statistics and parse these file formats, at least at the level at which BioPerl does. What if we just translate the same functionalities and copy the population objects from bioperl into biopython? I realize that it won't be the perfect solution: in fact, it is the same reason why I started this discussion here, the bioperl code wasn't optimized enought for what I want to do, but I didn't know how to modify perl modules and preferred python. Maybe we can just write a PED and GenePop parser and have let it work with GenePop and your modules to calculate Fst. We should agree with a population object that could be used as input for GenePop. I think it would be good anyway to release even incomplete code to the public, because it could be useful for other people. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From dalloliogm at gmail.com Thu Oct 23 16:27:22 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 23 Oct 2008 18:27:22 +0200 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <320fb6e00810230257y4a21952o95acff34a5367cb5@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> <320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com> <5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com> <320fb6e00810230257y4a21952o95acff34a5367cb5@mail.gmail.com> Message-ID: <5aa3b3570810230927m696f9c27gdfec084647c2c509@mail.gmail.com> On Thu, Oct 23, 2008 at 11:57 AM, Peter wrote: > On Thu, Oct 23, 2008, Giovanni Marco Dall'Olio wrote: > > On Wed, Oct 22, Peter wrote: > >> On Wed, Oct 22, Giovanni Marco Dall'Olio wrote: > >> > > >> > Iterators are more difficult to implement in Ped files, because in > this > >> > format every line of the file is an individual, so to write an > iterator > >> > which iterates by population we will need to read at list the first > row > >> > of every line of all the file. > >> > >> It sounds like for Ped files it would make more sense to iterate over > >> the individuals. The mental picture I have in mind is a big > >> spreadsheet, individuals as rows (lines), populations (and other > >> information) as columns. By having the parser iterate over the > >> individuals one by one, the user could then "simplify" each individual > >> as they are read in, recording in memory just the interesting data. > >> This way the whole dataset need not be kept in memory. > > > > This makes sense. > > Basically, we should write a (Ped/GenePop)Iterator function, which should > > read the file one line at a time, check if it a has correct syntax and is > > not a comment, and then use 'yield' to create a Record object. Am I > right? > > Yes :) > > Python functions written with "yield" are called "generator functions", > see: > http://www.python.org/dev/peps/pep-0255/ > So, how should we modify the current GenePop parser to make it work as an iterator? Now it has a 'Scanner' and 'Consumer' methods. Should I remove them and write a RecordIterator instead? - http://github.com/dalloliogm/biopython---popgen/tree/master/src/PopGen/Gio/Ped/__init__.py Can you explain me more or less how the 'Consumer' object works? It is mandatory to use it when creating biopython objects? p.s. do you like the doctest to show how to use the parser? > Peter > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Oct 23 17:01:26 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Oct 2008 18:01:26 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810230927m696f9c27gdfec084647c2c509@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> <320fb6e00810221026s842fa88jcce86dfe2a5c8ed0@mail.gmail.com> <5aa3b3570810230241w19596864qede75fdbee88c847@mail.gmail.com> <320fb6e00810230257y4a21952o95acff34a5367cb5@mail.gmail.com> <5aa3b3570810230927m696f9c27gdfec084647c2c509@mail.gmail.com> Message-ID: <320fb6e00810231001w2345bbe5r8c1727ddf883553c@mail.gmail.com> Giovanni wrote: > So, how should we modify the current GenePop parser to make it work as an > iterator? I think this would mean breaking up the current Record object (which holds everything) into sub-records which can be yielded one by one. This would require an API change, unless you wanted to continue to offer the two approaches in parallel (not elegant, but see Bio/Sequencing/Ace.py for an example of where this made sense to do). > Now it has a 'Scanner' and 'Consumer' methods. Should I remove them and > write a RecordIterator instead? > ... > Can you explain me more or less how the 'Consumer' object works? It is > mandatory to use it when creating biopython objects? You can write an iterator with or without the Scanner/Consumer style of parser. The Scanner/Consumer system is very flexible if you want to parse the data into different objects (by using different consumers). In theory the end user could also use the provided scanner with their own consumer. However, in my opinion for parsing sequence file formats this was overkill (needlessly complicated) - as only one object is really needed to represent a sequence (we have the SeqRecord for this), so most of the recent parsers in Bio.SeqIO and Bio.AlignIO do not use the scanner/consumer setup. See also the short Tutorial section "Parser Design". http://biopython.org/DIST/docs/tutorial/Tutorial.html For population genetics given there is no one universal record object, perhaps the flexibility of the Scanner/Consumer system is worth while. On the other hand, Tiago currently has the scanner/consumer in Bio.PopGen.GenePop as private objects so this is currently a private implementation detail - one could replace the Scanner/Consumer details without breaking the public API. Peter From biopython at maubp.freeserve.co.uk Fri Oct 24 08:52:25 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Oct 2008 09:52:25 +0100 Subject: [BioPython] Retrieving nucleotide sequence for given accession Entrez ID In-Reply-To: <61B0EE7C247C1349881F63414448FC1F078874C1@EXEVS06.its.uncc.edu> References: <61B0EE7C247C1349881F63414448FC1F078874BC@EXEVS06.its.uncc.edu> <320fb6e00810221315i31358bc2n2e5c9be405a77e42@mail.gmail.com> <61B0EE7C247C1349881F63414448FC1F078874C1@EXEVS06.its.uncc.edu> Message-ID: <320fb6e00810240152t6e6123d3la00f1fe43121b985@mail.gmail.com> Hi Richard, I've taken the liberty of CC'ing this back to the mailing list, Richard Clary wrote: > Much appreciation Peter--it worked perfectly. Good :) > If you are wanting to > retrieve multiple sequences, is a simple "+" string concatenation > sufficient as the case when using eUtils or approach it by creating > a tuple or dictionary and passing arguments? > > Richard Moving on to your multi-sequence question, using "+" doesn't seem to work - you should use a comma for concatenating the IDs when calling eFetch. What made you think of "+" here? One other tweak is that Bio.SeqIO.read(...) is for when the handle contains one and only one record. In general you'll need to use Bio.SeqIO.parse(...) instead and iterate over the records. Depending on what you want to achieve, maybe: from Bio import Entrez, SeqIO id_list = ["186972394","12345678"] Entrez.email = "Richard at example.com" #Tell the NCBI who you are handle = Entrez.efetch(db="nucleotide", id=",".join(id_list),rettype="fasta") for id,record in zip(id_list,SeqIO.parse(handle, "fasta")) : assert id in record.id, "Didn't get ID %s returned!" % id print "%s = %s" % (record.id, record.seq) #seq_str = str(record.seq) If you still want just plain strings for the sequence, maybe: from Bio import Entrez, SeqIO id_list = ["186972394","12345678"] Entrez.email = "Richard at example.com" #Tell the NCBI who you are handle = Entrez.efetch(db="nucleotide", id=",".join(id_list),rettype="fasta") seq_str_list = [str(record.seq) for record in SeqIO.parse(handle, "fasta")] If you haven't already done so, please read the NCBI guidelines for using Entrez, http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements Also, have a look at the Entrez chapter in the tutorial, especially the "history" support which may be relevant. http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Peter From dalloliogm at gmail.com Fri Oct 24 09:08:54 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 24 Oct 2008 11:08:54 +0200 Subject: [BioPython] [bip] [OT] Revision control and databases In-Reply-To: <49008265.3040205@gmail.com> References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> <49008265.3040205@gmail.com> Message-ID: <5aa3b3570810240208j76a770c5ue175089176fa050@mail.gmail.com> On Thu, Oct 23, 2008 at 3:55 PM, Bruce Southey wrote: > Giovanni Marco Dall'Olio wrote: > >> Hi, >> I have a question (well, it's not directly related to biopython or pygr, >> but to scientific computing). >> >> I always used flat files to store results and data for my bioinformatics >> analys, but not (as I was saying in another thread) I would like to start >> using a database to do that. >> > Of course Biopython's BioSQL interface may provide a starting point. The problem is that BioSQL doesn't support yet Population Genetics record (see another thread in biopython mailing list), so I would have to implement something like that in BioSQL or wait for the developers to do it. Maybe I will do this later, but now I don't have the time. > > The problem is I don't know if databases do Revision Control. >> When I used flat files, I was used to save all the results in a git >> repository, and, everytime something was changed or calculated again, I did >> commit it. >> Do you know how to do this with databases? Does MySQL provide support for >> revision control? >> Thanks :) >> > I think you are asking the wrong questions because it depends on what you > want to do and what you actually store. There are a number of questions that > you need to ask yourself about what you really need to do (knowing you have > used git helps refine these). Examples include: > How often do you use the old versions in your git repository? > How do you use the old revisions in your git repository? > Do you even use the information of an older version if a newer version > exists? > How many users that can make changes? > How often do you have conflicts? > Are the conflicts hard to solve? These are all very good questions. The problem is that I consider revision control as a 'good practice': I remember that when I was not used to keep an history of the changes to my data, it was a mess. I would like to have at least a 'version' field, to know how much my data is old. I have found this : - http://pgfoundry.org/projects/tablelog/ which seems interesting. I think this is a big issue for bioinformatics. How is it possible that nobody has never tried to implement such a functionality for databases? Version Control could be difficult to implement, but not so much. There is must be something that I can reuse... Do you actually determine when 'something was changed or calculated again' > or it this partly determined by an external source like a Genbank or UniProt > update? (At least in a database approach you could automate this.) Well, it could be useful to > > > Revision control may be overkill for your use because this is aims to > handle many tasks and change conflicts related to multiple users rather than > a single user. If you don't need all these fancy features then you can use > a database. If you just want to store and retrieve a version then you can > use a database but you need to at least force the inclusion a date and > comment fields to be useful. Maybe there are other similar tools. This is a big issue for bioinformatics. I think it is a good, when working with Unfortunately I think revision control would be very useful for me. The data in the database will be used and uploaded by 4 or 5 people. It will be used also to store the results from some script: > > > > Regards > Bruce > Thank you very much for all the replies.. I didn't expect so many of them. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Fri Oct 24 09:25:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Oct 2008 10:25:31 +0100 Subject: [BioPython] [bip] [OT] Revision control and databases In-Reply-To: <5aa3b3570810240208j76a770c5ue175089176fa050@mail.gmail.com> References: <5aa3b3570810230430v7968e792vda0f48fd6247b921@mail.gmail.com> <49008265.3040205@gmail.com> <5aa3b3570810240208j76a770c5ue175089176fa050@mail.gmail.com> Message-ID: <320fb6e00810240225y1c380de5y6144a80ece808b2c@mail.gmail.com> Giovanni Marco Dall'Olio wrote: > Bruce Southey wrote: >> Of course Biopython's BioSQL interface may provide a starting point. > > The problem is that BioSQL doesn't support yet Population Genetics record > (see another thread in biopython mailing list), so I would have to implement > something like that in BioSQL or wait for the developers to do it. > Maybe I will do this later, but now I don't have the time. BioSQL currently focuses on annotated sequences, but they are working on some phylogenetics support too. See http://www.biosql.org/ and the PhyloDB extension module. If there was enough interest, perhaps a BioSQL schema for Population Genetics could be devised too. Giovanni Marco Dall'Olio wrote: >>> The problem is I don't know if databases do Revision Control. >>> When I used flat files, I was used to save all the results in a git >>> repository, and, everytime something was changed or calculated >>> again, I did commit it. >>> Do you know how to do this with databases? Does MySQL >>> provide support for revision control? As other people have said, databases don't generally "waste" resources on version control. If you need this, then it is up to you to design your schema to record this additional metadata. For example, the BioSQL sequences have a "version" field in the "bioentry" table allowing multiple revisions of the same accession to be held. When querying the database, you could request a particular version, or indeed the latest version. Essentially AFAIK database version control is a Do-It-Yourself affair when designing your database tables. Peter From lpritc at scri.ac.uk Fri Oct 24 09:51:35 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 24 Oct 2008 10:51:35 +0100 Subject: [BioPython] [bip] [OT] Revision control and databases In-Reply-To: <5aa3b3570810240208j76a770c5ue175089176fa050@mail.gmail.com> Message-ID: On 24/10/2008 10:08, "Giovanni Marco Dall'Olio" wrote: > The problem is that BioSQL doesn't support yet Population Genetics record > (see another thread in biopython mailing list), so I would have to implement > something like that in BioSQL or wait for the developers to do it. > Maybe I will do this later, but now I don't have the time. To be fair, that's a different problem from version control... >> How often do you use the old versions in your git repository? >> How do you use the old revisions in your git repository? >> Do you even use the information of an older version if a newer version >> exists? >> How many users that can make changes? >> How often do you have conflicts? >> Are the conflicts hard to solve? > > These are all very good questions. > The problem is that I consider revision control as a 'good practice' I think that you're right - it is good practice, and Bruce raises excellent questions here: what individuals want or need from version control depends greatly on their own situation, and whether a particular package fits your own needs will depend on what they are. If you don't know what they are before choosing a package, then there's the risk of making an unsuitable choice. It's worth noting that revision control can also mean slightly different things to different people. Some might say that a version number and an ID for the entity (human or automated) making that change is sufficient. Some might say that you ought not to stop short of conflict resolution and branch control. It depends on the needs of your project, IMO. > I think this is a big issue for bioinformatics. How is it possible that nobody > has never tried to implement such a functionality for databases Databases (DBMS, to be picky) are a general-purpose solution for many different kinds of problem. Revision control is an inhomogeneous problem with no optimal solution that can be implemented in many ways and not only using DBMS. There are plenty of revision control examples implemented in databases, and the examples that first come to mind in Python for me are content management systems such as Zope and Plone. I think that BASE implements one, but it's a long time since I looked at it. > Unfortunately I think revision control would be very useful for me. > The data in the database will be used and uploaded by 4 or 5 people. Then at a minimum you may need a solution that records version changes, and associates versions with individuals (and perhaps individual runs of scripts). You may also need locking and collision detection/conflict resolution (which DBMS like MySQL and PostgreSQL support internally via transactions; they don't generally implement version control because it would be wasteful), depending on whether you expect that multiple people might modify the same file at or at about the same time. Best, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________________________ From cy at cymon.org Fri Oct 24 10:46:28 2008 From: cy at cymon.org (Cymon Cox) Date: Fri, 24 Oct 2008 11:46:28 +0100 Subject: [BioPython] BioSQL / phylodb Message-ID: <7265d4f0810240346m456e724ax49f7c18048a29749@mail.gmail.com> Hi All, Ive been looking at the phylodb extension to BioSQL. Does anyone have any python code for uploading a tree? Cheers, C. -- ____________________________________________________________________ Cymon J. Cox From biopython at maubp.freeserve.co.uk Fri Oct 24 10:54:28 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Oct 2008 11:54:28 +0100 Subject: [BioPython] BioSQL / phylodb In-Reply-To: <7265d4f0810240346m456e724ax49f7c18048a29749@mail.gmail.com> References: <7265d4f0810240346m456e724ax49f7c18048a29749@mail.gmail.com> Message-ID: <320fb6e00810240354q2b3c2a93p3c0c45b5ed48df3c@mail.gmail.com> On Fri, Oct 24, 2008 at 11:46 AM, Cymon Cox wrote: > Hi All, > > Ive been looking at the phylodb extension to BioSQL. Does anyone have any > python code for uploading a tree? > > Cheers, C. Not that I'm aware of, no. Adding support to Biopython's BioSQL module to do this, and also retrieve the data as a tree would be nice. The Bio.Nexus.Tree class would seem a logical representation to try and use. As an aside, being about to load a taxonomy from the main BioSQL taxon/taxon_name tables as a tree might be nice too. Peter From kteague at bcgsc.ca Fri Oct 24 18:32:41 2008 From: kteague at bcgsc.ca (Kevin Teague) Date: Fri, 24 Oct 2008 11:32:41 -0700 Subject: [BioPython] [bip] [OT] Revision control and databases In-Reply-To: References: Message-ID: <3F2B0CD4-83DF-4A88-B22D-926B97503B7C@bcgsc.ca> > >> I think this is a big issue for bioinformatics. How is it possible >> that nobody >> has never tried to implement such a functionality for databases > > Databases (DBMS, to be picky) are a general-purpose solution for many > different kinds of problem. Revision control is an inhomogeneous > problem > with no optimal solution that can be implemented in many ways and > not only > using DBMS. There are plenty of revision control examples > implemented in > databases, and the examples that first come to mind in Python for me > are > content management systems such as Zope and Plone. I think that BASE > implements one, but it's a long time since I looked at it. The default file storage for Zope Object Database (ZODB) appends all new database writes, keeping older transactions on disk (similar to the way PostgreSQL works). Back in the day (circa 2000) Zope 2 exposed this database-level feature at the application level in the Zope Management Interface (ZMI). So you could see all past writes to the database, and try and revert back to an older one if desired (using the "undo" tab of the ZMI). Problems with this approach included using sysadmin tools on the database could break application behaviour. e.g. lets say you had a "Document" object and a "Page Counter" object, you would wish to be able to view older versions of Documents, but only care about the current state of the Page Counters. However, if your Page Counters are changing like crazy and taking up tonnes of disk space and generally slowing down queries against the history of the database, there was no way to say "delete all outdated ephemeral Page Counter versions, but keep Document-related transactions" (especially since a Page Counter change and a Document change often commited in the same transaction). ZWiki exposed older revisions using this feature, and the accepted practice was to put each wiki into it's own database so that other forms of database maintenance didn't accidently blow away your wiki history ... it wasn't so pretty :P You also had problems reverting back to just a specific revision, for example if you were in Revision 3 and you had changes in Revision 1 that you wanted to go back to, but you'd made changes in Revision 2 that referenced Revision 1, then you first had to step-back to Revision 2 before you could revert back to Revision 1. Even though Revision 2 also contained a bunch of changes that you didn't want to revert, that you would then manually need to later re-apply. Ug! Zope 2 also had a Version object, you could poke a button in the UI to start a new "transaction" and then start making changes to code +content in the database. This was just implemented as a long-running transaction - from the point of starting to commiting a transaction could sometimes last for a whole month :). The problem being that when you finally wanted to commit the transaction to roll-out new features on a web site, if there were any conflicts from changes that happened you were hosed and would end-up copying those changes into a new transaction based off the latest database version and commiting that. It wasn't pretty :( It has long since been acknowledged by Zope developers that exposing database level features at the application level is a Bad Thing(TM)! Today there is a whole plethora of products for Zope that do some form of versioning, but they are all implemented at the application level. There is a whole plethora of products because there are many ways to do versioning, and the choices of how versions are managed is really best left up to the specific application. Some of these products provide reasonable APIs for implementing specific versioning within a specific platform - e.g Plone has a package called plone.app.iterate and it has APIs that use standard versioning terminology (checkin, checkout, working copy) for example: class ICheckinCheckoutTool( Interface ): def allowCheckin( content ): """ denotes whether a checkin operation can be performed on the content. """ def allowCheckout( content ): """ denotes whether a checkout operation can be performed on the content. """ def allowCancelCheckout( content ): """ denotes whether a cancel checkout operation can be performed on the content. """ def checkin( content, checkin_messsage ): """ check the working copy in, this will merge the working copy with the baseline """ def checkout( container, content ): """ """ def cancelCheckout( content ): """ From sbassi at gmail.com Sat Oct 25 01:03:43 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Fri, 24 Oct 2008 22:03:43 -0300 Subject: [BioPython] Loading dbxrefs from a gbk file Message-ID: I have a genbank file like this one: http://www.pastecode.com.ar/f231664eb I parse it with SeqIO.parse and the SeqRecord object I get is: SeqRecord(seq=Seq('GAGAAGGACGCGCGGCCCCCAGCGCCTCTTGGGTGGCCGCCTCGGAGCATGACC...ATA', IUPACAmbiguousDNA()), id='NM_000208.2', name='NM_000208', description='Homo sapiens insulin receptor (INSR), transcript variant 1, mRNA.', dbxrefs=[]) If you look at lines 130 to 133 (I highlighted in yellow) of the genbank sequence, there is cross database information (db_xref), but it is not associated with the SeqRecord, it is an empty list. According to http://www.biopython.org/wiki/SeqRecord, this condition is known, but I don't understand if this is on porpuse or is a bug. Best, SB. -- Vendo isla: http://www.genesdigitales.com/isla/ Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 "It is pitch black. You are likely to be eaten by a grue." -- Zork From biopython at maubp.freeserve.co.uk Sat Oct 25 17:22:27 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 25 Oct 2008 18:22:27 +0100 Subject: [BioPython] Loading dbxrefs from a gbk file In-Reply-To: References: Message-ID: <320fb6e00810251022u577d4f16wa7602c7bdc664322@mail.gmail.com> On Sat, Oct 25, 2008 at 2:03 AM, Sebastian Bassi wrote: > I have a genbank file like this one: http://www.pastecode.com.ar/f231664eb > ... > If you look at lines 130 to 133 (I highlighted in yellow) of the > genbank sequence, there is cross database information (db_xref), but > it is not associated with the SeqRecord, it is an empty list. What you have highlighted is part of a gene feature, and would not be part of the SeqRecord's db_xref list. It should however be present in the relevant SeqRecord feature. Try: print my_record.features[1] (seeing as this is the second feature in the file, i.e. feature 1 using zero-based counting). Peter From tiagoantao at gmail.com Sun Oct 26 01:04:01 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sun, 26 Oct 2008 02:04:01 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <5aa3b3570810190750k2faaa689hc55c59154b39479e@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> <6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com> <5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com> Message-ID: <6d941f120810251804o31ed44cat49b407db36a6891e@mail.gmail.com> [Sorry for the delay in answering] On Thu, Oct 23, 2008 at 5:25 PM, Giovanni Marco Dall'Olio wrote: > However I think it would be good to add to biopython at least some > funcionality to calculate Fst statistics and parse these file formats, at > least at the level at which BioPerl does. Agree. Statistics is fundamental. I decided to postpone stats when I started because I didn't want to to start with the core issue in population genetics (being unexperienced at the start would probably cause serious design errors). But I think now is the time. > What if we just translate the same functionalities and copy the population > objects from bioperl into biopython? I don' t think the population objects in bioperl scale well. It is not clear to me that their popgen module is a priority for them, and that they carefully designed them (altough that might have changed in the near past). I also don' t believe that my own code (which I supplied you) is in perfect shape to achieve this also. I have to write down my ideas and send them here as soon as possible. I will try to do it in the next couple of days at most. The core idea is that there is no good abstract population and individual objects, but they are also not needed. What is needed, in my view, are file parsers and statistics. Statistics should be organized in a systematic way. Example: all frequency based, population-structure statistics should present the same interface, something like: add_population(pop_name, individual_allele_list) I will submit a small document for discussion very soon. > I realize that it won't be the perfect solution: in fact, it is the same > reason why I started this discussion here, the bioperl code wasn't optimized > enought for what I want to do, but I didn't know how to modify perl modules > and preferred python. The important thing to notice is that biopython should not be optimized to your needs or mine, it has to be general enough to accomodate the vast majority of potential users. What I' ve always tried, was to do things in a way that could be reused by others. > Maybe we can just write a PED and GenePop parser and have let it work with > GenePop and your modules to calculate Fst. My suggestion would be for you to go ahead and do a Bio.PopGen.PED . You could do it in the best way you see fit. Converting from PED to genepop will make you loose information, if I understand well (as you have SNP info on PED files, which you don' t on genepop). The other formats that I support (Fdist on released code and FStat on the code that you have) are very similar (or less informative) than genepop. Again, my suggestion is for an independent parser. Of which you would have absolute control as you would be the implementor. I understand that this might lead to some duplicated code (like split_in_pops), but repeated code is less of a problem than a generic object that ends up being wrong in the long run. > We should agree with a population object that could be used as input for > GenePop. For the reasons above I will fight a general Population object. At least for now. I don't feel confident that we have the experience to design one. It is important to notice that we cannot break backward compatibility without a very good reason. I think that a generic population object will be severely resived in the future. In your specific case I also think you would suffer with a population object, as you need performance (parsing file, creating object, extrating information from object, calculating statistic). As I see it, it would be a smaller chain (parse, convert to statistic family format, calculate statistic). > I think it would be good anyway to release even incomplete code to the > public, because it could be useful for other people. Incomplete is OK. But I think we would be releasing wrong code. Code that it would be redone in the future (and break interfaces with the past versions). Also, a generic object would have performance problems (it would have to be able to store all the information). Well, I am ranting and not proposing a decent alternative. I will try to write down something decent. I will try to write up a proposal until Tuesday. I'm afraid the error is on my part: I have to write down what is in my head so that people can discuss if it is a good idea or not. From tiagoantao at gmail.com Sun Oct 26 01:34:55 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sun, 26 Oct 2008 02:34:55 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810251804o31ed44cat49b407db36a6891e@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810192241j16c587c5qbcb31e2749a2311a@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> <6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com> <5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com> <6d941f120810251804o31ed44cat49b407db36a6891e@mail.gmail.com> Message-ID: <6d941f120810251834q87495d5re558cf179356a8b0@mail.gmail.com> I just want add on an extra comment explaining why I oppose doing an individual object: I have the following questions (and others) in my mind, which I don't know the answer. I am not looking for answers to them, I am just trying to illustrate the difficulty of the problem. 1. For a certain marker, do we store the genomic position of the marker? Some (most) statistics don't use this information. For many species this information is not even available. But for some statistics this information is mandatory... 2. For a microsatellite do we store the motif and number of repeats or the whole sequence? (see 4) 3. If one is interested in SNPs and one has the full sequences does one store the full sequences or just the SNPs? If you store just the SNPs then you cannot do sequence based analysis in the future (say Tajima D). If you store everything then you are consuming memory and cpu. 4. If one just wants to do frequency statistics (Fst), do you store the marker or just the assign each one an ID and store the ID? It is much cheaper to store an ID than a full sequence. Populations 1. Support for landscape genetics? I mean geo-referentiation 2. Support for hierarchical population structure? 3. Do we cache statistics results on Population objects? Let me take your class marker: class Marker: total_heterozygotes_count = 0 total_population_count = 0 total_Purines_count = 0 # this could be renamed, of course total_Pyrimidines_count = 0 How would this be useful for microsatellites? Why purines, and if my marker is a protein? If it is a SNP I want to know the nucleotide? And if I am studying proteins and I want to have the aminoacid? Dont take me wrong, I have done this path. To solve my particular problems is not very hard. To have a framework that is usable by everybody, it is a damn hard problem. And we dont really need to solve it (ok, it would be nice to do things to populations in general, that I agree). But the fundamental is: read file, calculate statistics. That doesnt need population and individual objects. If we end up having too many formats a consolidation step might be needed in the future (to avoid having 10 split_in_pops). That I agree. From sbassi at gmail.com Mon Oct 27 04:13:47 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Mon, 27 Oct 2008 01:13:47 -0300 Subject: [BioPython] Loading dbxrefs from a gbk file In-Reply-To: <320fb6e00810251022u577d4f16wa7602c7bdc664322@mail.gmail.com> References: <320fb6e00810251022u577d4f16wa7602c7bdc664322@mail.gmail.com> Message-ID: On Sat, Oct 25, 2008 at 2:22 PM, Peter wrote: > What you have highlighted is part of a gene feature, and would not be > part of the SeqRecord's db_xref list. It should however be present in > the relevant SeqRecord feature. Try: OK, thank you. From lueck at ipk-gatersleben.de Mon Oct 27 13:43:49 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Mon, 27 Oct 2008 14:43:49 +0100 Subject: [BioPython] ClustaW problem upwards Biopython 1.43 Message-ID: <008701c9383a$0aaa2070$1022a8c0@ipkgatersleben.de> Hi! I just releazed, that a ClustalW alignment gives an error message under Biopython 1.44 and 1.47 whereas under 1.43 everything works fine. The message is the following (example of the tutorial): Traceback (most recent call last): File "I:\Final\pair_align.py", line 90, in pair_align alignment = Clustalw.do_alignment(cline) File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 79, in do_alignment status = run_clust.close() IOError: [Errno 0] Error Does someone know what's the problem? Kind regards Stefanie From biopython at maubp.freeserve.co.uk Mon Oct 27 15:12:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Oct 2008 15:12:13 +0000 Subject: [BioPython] ClustaW problem upwards Biopython 1.43 In-Reply-To: <008701c9383a$0aaa2070$1022a8c0@ipkgatersleben.de> References: <008701c9383a$0aaa2070$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00810270812le76ae75m55f53107c2572a34@mail.gmail.com> On Mon, Oct 27, 2008 at 1:43 PM, Stefanie L?ck wrote: > Hi! > > I just releazed, that a ClustalW alignment gives an error message under Biopython 1.44 and 1.47 whereas under 1.43 everything works fine. > > The message is the following (example of the tutorial): > > Traceback (most recent call last): > File "I:\Final\pair_align.py", line 90, in pair_align > alignment = Clustalw.do_alignment(cline) > File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 79, in do_alignment > status = run_clust.close() > IOError: [Errno 0] Error > > Does someone know what's the problem? There were some changes made between Biopython 1.43 and 1.44 to try and deal with spaces in filenames. Could you do: print str(cline) That should show the exact command line python is trying to run. What happens if you try this command at the "DOS" prompt? Also, what version of clustalw do you have installed? Thanks, Peter From biopython at maubp.freeserve.co.uk Mon Oct 27 17:49:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Oct 2008 17:49:59 +0000 Subject: [BioPython] Deprecating Bio.mathfns, Bio.stringfns and Bio.listfns? Message-ID: <320fb6e00810271049t2aa3fac4s1907027307b035f1@mail.gmail.com> Dear Biopythoneers, Is anyone currently using Bio.mathfns, Bio.stringfns or Bio.listfns? These provide a selection of maths, string and list functions - some of which are apparently irrelevant with changes or additions to python itself (e.g. sets). I'd like to declare these as deprecated for the next release, or at least obsolete and likely to be deprecated in future - so if you are using these modules or would like to defend them, please speak up soon. Thanks, Peter P.S. If you care about the details, there is a longer discussion on the dev-mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2008-October/004472.html From biopython at maubp.freeserve.co.uk Mon Oct 27 17:57:20 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Oct 2008 17:57:20 +0000 Subject: [BioPython] Deprecating the obsolete Bio.Ndb module? Message-ID: <320fb6e00810271057l181cbb1fw15aa8f03e4159328@mail.gmail.com> Dear Biopythoneers, The Bio.Ndb module (written six years ago) provides an HTML parser for the NDB website (nucleotide database, a repository of three-dimensional structural information about nucleic acids). The URL has changed, but this service is still running. However, the webpage layout has changed considerably - Their front page mentions a major revision in Jan 2008. Unless anyone would like to volunteer to look after the Bio.Ndb module and bring it up to date, I'm suggesting we deprecate it for the next release of Biopython. Peter From lueck at ipk-gatersleben.de Tue Oct 28 08:10:25 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 28 Oct 2008 09:10:25 +0100 Subject: [BioPython] ClustaW problem upwards Biopython 1.43 References: <008701c9383a$0aaa2070$1022a8c0@ipkgatersleben.de> <320fb6e00810270812le76ae75m55f53107c2572a34@mail.gmail.com> Message-ID: <001a01c938d4$a19887c0$1022a8c0@ipkgatersleben.de> Hi! >>> print str(cline) clustalw pb.fasta -OUTFILE=test2.aln I'm using CLUSTAL W 2.0. Under DOS everything works fine. Regards Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Monday, October 27, 2008 4:12 PM Subject: Re: [BioPython] ClustaW problem upwards Biopython 1.43 On Mon, Oct 27, 2008 at 1:43 PM, Stefanie L?ck wrote: > Hi! > > I just releazed, that a ClustalW alignment gives an error message under > Biopython 1.44 and 1.47 whereas under 1.43 everything works fine. > > The message is the following (example of the tutorial): > > Traceback (most recent call last): > File "I:\Final\pair_align.py", line 90, in pair_align > alignment = Clustalw.do_alignment(cline) > File "C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py", line 79, > in do_alignment > status = run_clust.close() > IOError: [Errno 0] Error > > Does someone know what's the problem? There were some changes made between Biopython 1.43 and 1.44 to try and deal with spaces in filenames. Could you do: print str(cline) That should show the exact command line python is trying to run. What happens if you try this command at the "DOS" prompt? Also, what version of clustalw do you have installed? Thanks, Peter From dalloliogm at gmail.com Tue Oct 28 10:46:39 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 28 Oct 2008 11:46:39 +0100 Subject: [BioPython] a common repository for test datasets/use cases for all Bio* projects Message-ID: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com> Hi, I would like to make you a proposal. Every module/program written in bioinformatics needs to be tested before it can be used to produce results that can be published. For example, let's say I want to write another fasta file parser, like SeqIO.FastaIO in biopython : I would have have to test the script against some real fasta files, just to make sure that it doesn't parse them in a wrong way, or that it losts data. Or, let's say I want to write a script to calculate Fst statistics over some population genetics data: I will have to compare the results of my scripts against other programs, check if it gives me the right result for a set for which I already know the Fst value, and maybe ideate some other kind of checks to be sure my script doesn't do weird things, like losing input data on the way. So, the point is.. what if we create a common repository for all this kind of testing data, to be used in common with all the other Bio* projects? Wouldn't it be good if all the Bio* fasta parser are able to parse the same files and give the same results, demonstrating that all of them work fine or are wrong at the same time? I am doing this because me (and Tiago) would like to develop a module to calculate Fst statistics over SNP data, and there is no point of collecting some good test datasets and not sharing them with other similar projects in other programming languages. The same goes for much of the documentation, like use cases: if we collect a good base of use cases related to bioinformatics, it would be easier to coordinate the efforts of all the Bio* projects and compare the different approaches used to solve the same issue by the different comunities. At the moment, I have created a simple git repository on github: - http://github.com/dalloliogm/bio-test-datasets-repository but , it is still empty and maybe github is not the ideal hosting for such a project, since the free account has a 100MB space limit. -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Tue Oct 28 10:55:04 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Oct 2008 10:55:04 +0000 Subject: [BioPython] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com> References: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com> Message-ID: <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com> On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio wrote: > Hi, > I would like to make you a proposal. > Every module/program written in bioinformatics needs to be tested > before it can be used to produce results that can be published. > ... > So, the point is.. what if we create a common repository for all this > kind of testing data, to be used in common with all the other Bio* > projects? You you made some other good points, and this is a good idea. In practice the licences are usually OK for use to "borrow" example input files from each other (and this does happen), but a more organised system to encourage interchange of examples would be good. I think this sounds like an excellent topic for the (currently very quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev discussion, one of the OBF mailing lists, this should cover all the Bio* project members interested). See http://lists.open-bio.org/mailman/listinfo Peter From dalloliogm at gmail.com Tue Oct 28 11:00:42 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 28 Oct 2008 12:00:42 +0100 Subject: [BioPython] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com> References: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com> <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com> Message-ID: <5aa3b3570810280400t510468d1sbce5bb0977ec772b@mail.gmail.com> On Tue, Oct 28, 2008 at 11:55 AM, Peter wrote: > On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio > > I think this sounds like an excellent topic for the (currently very > quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev > discussion, one of the OBF mailing lists, this should cover all the > Bio* project members interested). See > http://lists.open-bio.org/mailman/listinfo > > Peter Thanks!! I didn't know of this list!! > -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Tue Oct 28 11:20:21 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Oct 2008 11:20:21 +0000 Subject: [BioPython] ClustalW problem upwards Biopython 1.43 Message-ID: <320fb6e00810280420t75f62774x55335e8a5aa11151@mail.gmail.com> Stephanie wrote: > >>>> print str(cline) > > clustalw pb.fasta -OUTFILE=test2.aln > > I'm using CLUSTAL W 2.0. Are you sure? The Clustal W 2.0 executable is normally called clustalw2.exe rather than clustalw.exe - so based on the command line above I would have expect Clustalw 1.x to be used. Maybe you have both versions of ClustalW installed? Could you tell me where exactly (full paths) you have Clustalw.exe and/or Clustalw2.exe installed? This would be helpful for the new unit test I'm working on. > Under DOS everything works fine. I've been having "fun" trying to get a new unit test for this to work nicely on Windows - there a certainly some combinations of file name arguments with spaces etc which won't work on Biopython 1.48. I found examples where the command line string ran "by hand" at the "DOS" prompt worked fine, but would fail when invoked in python via os.popen - on the bright side, using subprocess.Popen instead works much better (although this isn't available for python 2.3). If you want to try this new code, I would suggest you first install Biopython 1.48, and then backup and update C:\Python25\Lib\site-packages\Bio\Clustalw\__init__.py to revision 1.25 from CVS which you can download here (should be updated within the hour): http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Clustalw/__init__.py?cvsroot=biopython Thanks! Peter From peter at maubp.freeserve.co.uk Tue Oct 28 11:36:15 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Oct 2008 11:36:15 +0000 Subject: [BioPython] Dropping Python 2.3 support? Message-ID: <320fb6e00810280436m7cf48993v8b0562bb44919128@mail.gmail.com> Dear all, Those of you following the dev-mailing list will probably be aware that we've been making excellent progress in CVS to get Biopython to run fine on Python 2.6. However, the downside is that continuing to support Python 2.3 is beginning to be pain (triggered for the most part by some older modules being deprecated in python 2.6). Does anyone on the mailing list still use Python 2.3? e.g. older Linux servers, or people still using Apple Mac OS X 10.4 Tiger (or older). What I'd like to suggest is that the next one or two releases will still support Python 2.3, but after that we'll drop support for Python 2.3. Thanks, Peter P.S. For the record, until recently my main Windows machine ran Python 2.3 only - giving me a vested interesting in continuing Python 2.3 support ;) From jblanca at btc.upv.es Tue Oct 28 11:52:29 2008 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 28 Oct 2008 12:52:29 +0100 Subject: [BioPython] caf format support Message-ID: <200810281252.29607.jblanca@btc.upv.es> Hi, I'm currently dealing with caf contig files. Has BioPython support for this format? Do you know of other alternatives in python or perl to deal with it? Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue Oct 28 12:16:33 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Oct 2008 12:16:33 +0000 Subject: [BioPython] caf format support In-Reply-To: <200810281252.29607.jblanca@btc.upv.es> References: <200810281252.29607.jblanca@btc.upv.es> Message-ID: <320fb6e00810280516j72af2c70q46790c217585b2c5@mail.gmail.com> On Tue, Oct 28, 2008 at 11:52 AM, Jose Blanca wrote: > Hi, > I'm currently dealing with caf contig files. Has BioPython support for this > format? Do you know of other alternatives in python or perl to deal with it? > Best regards, I'm not aware of any Biopython code for CAF contig files. However, have a look at http://www.sanger.ac.uk/Software/formats/CAF/userguide.shtml where some perl tools are described, including some for converting CAF into other formats. We do have ACE and PHRED (used by PHRAP) parsers in Bio.Sequencing, so adding Bio.Sequencing.CAF might be logical. Peter From cjfields at illinois.edu Tue Oct 28 12:26:32 2008 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 28 Oct 2008 07:26:32 -0500 Subject: [BioPython] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com> References: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com> <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com> Message-ID: All, An open-bio repository had started up for this use at one point, though I don't think it made the transition to subversion yet (and it never really took off, not sure why). You should try contacting open- bio support and maybe Jason or Chris D. can answer this in a bit more detail. chris On Oct 28, 2008, at 5:55 AM, Peter wrote: > On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio > wrote: >> Hi, >> I would like to make you a proposal. >> Every module/program written in bioinformatics needs to be tested >> before it can be used to produce results that can be published. >> ... >> So, the point is.. what if we create a common repository for all this >> kind of testing data, to be used in common with all the other Bio* >> projects? > > You you made some other good points, and this is a good idea. In > practice the licences are usually OK for use to "borrow" example input > files from each other (and this does happen), but a more organised > system to encourage interchange of examples would be good. > > I think this sounds like an excellent topic for the (currently very > quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev > discussion, one of the OBF mailing lists, this should cover all the > Bio* project members interested). See > http://lists.open-bio.org/mailman/listinfo > > Peter > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign From bsouthey at gmail.com Tue Oct 28 13:56:34 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 28 Oct 2008 08:56:34 -0500 Subject: [BioPython] a common repository for test datasets/use cases for all Bio* projects In-Reply-To: References: <5aa3b3570810280346o18eb0054q58748bc1637243ed@mail.gmail.com> <320fb6e00810280355y3830857bp13f04a861b39fd97@mail.gmail.com> Message-ID: <49071A12.8060705@gmail.com> Chris Fields wrote: > All, > > An open-bio repository had started up for this use at one point, > though I don't think it made the transition to subversion yet (and it > never really took off, not sure why). You should try contacting > open-bio support and maybe Jason or Chris D. can answer this in a bit > more detail. > > chris > > On Oct 28, 2008, at 5:55 AM, Peter wrote: > >> On Tue, Oct 28, 2008 at 10:46 AM, Giovanni Marco Dall'Olio >> wrote: >>> Hi, >>> I would like to make you a proposal. >>> Every module/program written in bioinformatics needs to be tested >>> before it can be used to produce results that can be published. >>> ... >>> So, the point is.. what if we create a common repository for all this >>> kind of testing data, to be used in common with all the other Bio* >>> projects? >> >> You you made some other good points, and this is a good idea. In >> practice the licences are usually OK for use to "borrow" example input >> files from each other (and this does happen), but a more organised >> system to encourage interchange of examples would be good. >> >> I think this sounds like an excellent topic for the (currently very >> quiet) Open-Bio-l mailing list (Open Bioinformatics Cross Project dev >> discussion, one of the OBF mailing lists, this should cover all the >> Bio* project members interested). See >> http://lists.open-bio.org/mailman/listinfo >> >> Peter >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Marie-Claude Hofmann > College of Veterinary Medicine > University of Illinois Urbana-Champaign > > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > There had been some discussion on scipy lists on data sets that you should look for. One of the most critical questions that you must address is copyright and who owns the data sets (credit where credit is due). Ultimately any data will be distributable in some form and thus really brings in copyright issues and such. This is also country specific because there is the question of whether or not a data set can be copyrighted and the terms of it - not a lawyer to know this. The Science Commons has various other useful information especially the FAQ on databases, http://sciencecommons.org/resources/faq/databases/, that states "In the United States, data will be protected by copyright only if they express creativity". I do believe you would need to be very strict on what is acceptable because if it is distributable you can not rely on the user being responsible: 1) If has been used for publication, an extremely clear statement of the owner (publisher) that it can be made available is required. 2) If the data is created from publicly available sources that allow it eg Uniprot (http://www.uniprot.org/help/license) then exact recreatable sets must be made available so the data can be exactly obtained from that source (must include the specific release as databases change). 3) If the data is from private sources then it must be released on a suitable license that can not be superseded by publication or change in ownership. Also, the submitted data should not change even if there are errors. For example, Fisher's iris data at http://archive.ics.uci.edu/ml/datasets/Iris has documented errors. Rather it would be better to use version numbers. Regards Bruce From biopython at maubp.freeserve.co.uk Tue Oct 28 15:04:21 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Oct 2008 15:04:21 +0000 Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810200222s641e165eqef3b209893a8d976@mail.gmail.com> References: <320fb6e00810191152u13a2ee80pe21fe950dc3d046a@mail.gmail.com> <320fb6e00810200222s641e165eqef3b209893a8d976@mail.gmail.com> Message-ID: <320fb6e00810280804k1ef53ec1od53c33915da61c3@mail.gmail.com> On 20th Oct I wrote: > Of course, someone is still bound to try calling the [Seq object's] > translate method with a string mapping. Maybe we should add a > bit of defensive code to check the table argument, and print a > helpful error message when this happens? I've just added that in CVS, if the table argument is a 256 character string then a ValueError is raised suggesting using str(my_seq).translate(...) instead. Peter From biopython at maubp.freeserve.co.uk Tue Oct 28 17:17:36 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Oct 2008 17:17:36 +0000 Subject: [BioPython] Should Bio.SeqIO.write(...) return the number of records? Message-ID: <320fb6e00810281017m5a74a9dfh4aa18952b9a561be@mail.gmail.com> Dear all, I wanted to get some feedback on a possible enhancement to the Bio.SeqIO.write(...) and Bio.AlignIO.write(...) functions to make them return number of records/alignments written to the handle. I've filed enhancement Bug 2628 to track this idea. http://bugzilla.open-bio.org/show_bug.cgi?id=2628 When creating a sequence (or alignment) file, it is sometimes useful to know how many records (or alignments) were written out. This is easy if your records are in a list: records = list(...) SeqIO.write(records, handle, format) print "Wrote %i records" % len(records) If however your records are from a generator/iterator (e.g. a generator expression, or some other iterator) you cannot use len(records). You could turn this into a list just to count them, but this wastes memory. It would therefore be useful to have the count returned: records = some_generator count = SeqIO.write(records, handle, format) print "Wrote %i records" % count Currently Bio.SeqIO.write(...) and Bio.AlignIO.write(...) have no return value, so adding a return value would be a backwards compatible enhancement. For a precedent, the BioSQL loader returns the number of records loaded into the database. Peter From sbassi at gmail.com Tue Oct 28 17:43:27 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Tue, 28 Oct 2008 14:43:27 -0300 Subject: [BioPython] Should Bio.SeqIO.write(...) return the number of records? In-Reply-To: <320fb6e00810281017m5a74a9dfh4aa18952b9a561be@mail.gmail.com> References: <320fb6e00810281017m5a74a9dfh4aa18952b9a561be@mail.gmail.com> Message-ID: On Tue, Oct 28, 2008 at 2:17 PM, Peter wrote: > count = SeqIO.write(records, handle, format) > print "Wrote %i records" % count I'm for it. It doesn't hurt adding a backward compatible feature. From biopython at maubp.freeserve.co.uk Tue Oct 28 18:16:58 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Oct 2008 18:16:58 +0000 Subject: [BioPython] Should Bio.SeqIO.write(...) return the number of records? In-Reply-To: References: <320fb6e00810281017m5a74a9dfh4aa18952b9a561be@mail.gmail.com> Message-ID: <320fb6e00810281116u6460c62fs77ece727689fba3b@mail.gmail.com> Sebastian Bassi wrote: > > I'm for it. It doesn't hurt adding a backward compatible feature. > Well adding an unused feature does increase the long term maintainence load - but if we agree this does seem useful, that's fine. Also settling on the record/alignment count as the return value prevents any future alternative. But right now I can't think of any other sensible return value. I've written a patch against CVS to implement this - see Bug 2628 for details. Peter From tiagoantao at gmail.com Thu Oct 30 21:36:00 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 30 Oct 2008 21:36:00 +0000 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810251834q87495d5re558cf179356a8b0@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <5aa3b3570810200657i4ff7ded1p5198a801ff9eccd7@mail.gmail.com> <5aa3b3570810220325g563f6a22x3f30185ae3a01b4e@mail.gmail.com> <320fb6e00810220334n6aedc5a2m7a560c25ff703917@mail.gmail.com> <6d941f120810220903s6cdc034fhec369677ac5896c9@mail.gmail.com> <5aa3b3570810221010h787c74c7h65084e05964de71d@mail.gmail.com> <6d941f120810230810k4e48c48cp5c55722a851005cf@mail.gmail.com> <5aa3b3570810230925k1eccff39kd47f022842576a46@mail.gmail.com> <6d941f120810251804o31ed44cat49b407db36a6891e@mail.gmail.com> <6d941f120810251834q87495d5re558cf179356a8b0@mail.gmail.com> Message-ID: <6d941f120810301436m4bf12385s99d726bb000f7dd4@mail.gmail.com> Hi, FYI, I am going to continue this discussion to biopython-dev, as I think it makes more sense there. Especially the parts about implementation suggestions. On Sun, Oct 26, 2008 at 1:34 AM, Tiago Ant?o wrote: > I just want add on an extra comment explaining why I oppose doing an > individual object: > > I have the following questions (and others) in my mind, which I don't > know the answer. I am not looking for answers to them, I am just > trying to illustrate the difficulty of the problem. > > 1. For a certain marker, do we store the genomic position of the > marker? Some (most) statistics don't use this information. For many > species this information is not even available. But for some > statistics this information is mandatory... > 2. For a microsatellite do we store the motif and number of repeats or > the whole sequence? (see 4) > 3. If one is interested in SNPs and one has the full sequences does > one store the full sequences or just the SNPs? If you store just the > SNPs then you cannot do sequence based analysis in the future (say > Tajima D). If you store everything then you are consuming memory and > cpu. > 4. If one just wants to do frequency statistics (Fst), do you store > the marker or just the assign each one an ID and store the ID? It is > much cheaper to store an ID than a full sequence. > > Populations > 1. Support for landscape genetics? I mean geo-referentiation > 2. Support for hierarchical population structure? > 3. Do we cache statistics results on Population objects? > > > Let me take your class marker: > class Marker: > total_heterozygotes_count = 0 > total_population_count = 0 > total_Purines_count = 0 # this could be renamed, of course > total_Pyrimidines_count = 0 > > How would this be useful for microsatellites? Why purines, and if my > marker is a protein? If it is a SNP I want to know the nucleotide? And > if I am studying proteins and I want to have the aminoacid? > > Dont take me wrong, I have done this path. To solve my particular > problems is not very hard. To have a framework that is usable by > everybody, it is a damn hard problem. And we dont really need to solve > it (ok, it would be nice to do things to populations in general, that > I agree). But the fundamental is: read file, calculate statistics. > That doesnt need population and individual objects. > > If we end up having too many formats a consolidation step might be > needed in the future (to avoid having 10 split_in_pops). That I agree. > -- "Data always beats theories. 'Look at data three times and then come to a conclusion,' versus 'coming to a conclusion and searching for some data.' The former will win every time." ?Matthew Simmons, http://www.tiago.org From pingou at pingoured.fr Fri Oct 31 16:29:27 2008 From: pingou at pingoured.fr (Pierre-Yves) Date: Fri, 31 Oct 2008 17:29:27 +0100 Subject: [BioPython] Sequence graph Message-ID: <490B3267.5020501@pingoured.fr> Dear list, I am sorry to come here to ask this question that must have been already asked in the past, but my search have been rather unsuccessful... I would like to reproduce such graph: http://www.bioperl.org/wiki/HOWTO:Graphics#Improving_the_Image but even if bioperl is nice I would like to do it through BioPython. I have thus two questions : * Is that possible ? * Could someone point me to an example ? Thanks in advance for your help, Best regards, Pierre From bsouthey at gmail.com Wed Oct 22 21:02:18 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 22 Oct 2008 21:02:18 -0000 Subject: [BioPython] back-translation method for Seq object? In-Reply-To: <320fb6e00810210859n1b922e7emd6a7456abd79cdc7@mail.gmail.com> References: <48FDE37B.5040301@gmail.com> <320fb6e00810210745w32b37edjeec1607a3711f6ea@mail.gmail.com> <320fb6e00810210859n1b922e7emd6a7456abd79cdc7@mail.gmail.com> Message-ID: <48FF951C.4030700@gmail.com> Hi, Some of the neat things about Python is how easy it is to modify your own code and adapt others code into yours. So here is some code (under the BSD license) that may be useful on this. This is a simple back or reverse translation code with many of the things that I have been 'talking' about. This should be self-contained and works on Linux system with Python2.3+. It is oriented around an peptide sequence 'AFLFQPQRFGR' but hopefully is more general (I have not tested that). a) Convert an amino acid sequence into both a regular expression or DNA sequence involving ambiguous codes. There are functions to convert the regular expression or DNA sequence involving ambiguous codes back to a protein sequence since neither of these are standard. b) Regular expression search on a list of sequences in fasta format. c) Obtain all possible DNA sequences from an regular expression form of the amino acid sequence. Obviously this is very large as for the above sequence there are 442368 combinations (but Python is fairly quick... about 10 seconds on my opteron 270 system bogomips =3991.08) Enjoy Bruce -------------- next part -------------- A non-text attachment was scrubbed... Name: reverse_trans.py Type: text/x-python Size: 10661 bytes Desc: not available URL: