From mjldehoon at yahoo.com Wed Oct 1 08:18:24 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 1 Oct 2008 05:18:24 -0700 (PDT) Subject: [BioPython] Bio.distance Message-ID: <924102.72843.qm@web62403.mail.re1.yahoo.com> Hi everybody, Since the 1.48 release, Biopython has been making good progress in the migration from Numerical Python to NumPy. As part of this process, we are now reviewing and consolidating the code in Biopython that makes use of Numerical Python / NumPy. Specifically, we are thinking to merge the code in Bio.distance into Bio.kNN, and to deprecate Bio.distance and Bio.cdistance. Since Bio.kNN is the only Biopython module in Biopython that makes use of Bio.distance, we think that this won't affect anybody. However, if you are using Bio.distance outside of Bio.kNN, please let us know so we can find an alternative solution. --Michiel. From bsouthey at gmail.com Wed Oct 1 11:49:53 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 01 Oct 2008 10:49:53 -0500 Subject: [BioPython] Bio.distance In-Reply-To: <924102.72843.qm@web62403.mail.re1.yahoo.com> References: <924102.72843.qm@web62403.mail.re1.yahoo.com> Message-ID: <48E39C21.8010603@gmail.com> Michiel de Hoon wrote: > Hi everybody, > > Since the 1.48 release, Biopython has been making good progress in the migration from Numerical Python to NumPy. As part of this process, we are now reviewing and consolidating the code in Biopython that makes use of Numerical Python / NumPy. Specifically, we are thinking to merge the code in Bio.distance into Bio.kNN, and to deprecate Bio.distance and Bio.cdistance. Since Bio.kNN is the only Biopython module in Biopython that makes use of Bio.distance, we think that this won't affect anybody. However, if you are using Bio.distance outside of Bio.kNN, please let us know so we can find an alternative solution. > > --Michiel. > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > Hi, Under the 'standard' install I do not think that there is any advantage of using Bio.cdistance within Bio.kNN. I tested this on a bioinformatics data set with almost 1500 data points, 8 explanatory variables and k=9. I only got a one second difference between using Bio.cdistance or commenting it out on my system (after removing the build directory and reinstalling everything). Actual maximum times across three runs were under 16.6 seconds with it and under 17.4 seconds without it. My system runs linux x86_64 (fedora 10) but it is not a 'clean' system due to other cpu intensive processes running. I used Python 2.5 and Numeric 2.4 as I forgot the order of imports. In my version the default distance without Bio.cdistance uses the Numeric dot (I did not try the python version) so I would expect this to be noticeably faster if lapack or atlas are installed than if these are not present. (I used Fedora supplied Numeric so while I think this timing is without lapack and atlas I am not completely sure of that.) I did not see an examples for k-nearest neighbor so below is (very bad) code using the logistic regression example (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html). Regards Bruce from Bio import kNN xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11, -220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], [58, -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, -312.99], [154, -213.83], [147, -380.85], [93, -291.13]] ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0] model = kNN.train(xs, ys, 3) ccr=0 tobs=0 for px, py in zip(xs, ys): cp=kNN.classify(model, px) tobs +=1 if cp==py: ccr +=1 print tobs, ccr From biopython at maubp.freeserve.co.uk Wed Oct 1 11:52:05 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Oct 2008 16:52:05 +0100 Subject: [BioPython] More string methods for the Seq object In-Reply-To: <320fb6e00809290506p8aa2b51p4901b693ebb268bf@mail.gmail.com> References: <320fb6e00809260859r23c7915buc114c5c0b71e195@mail.gmail.com> <48DD2DE6.10908@gmail.com> <320fb6e00809261422n6e4c4889p734508613898cc3f@mail.gmail.com> <48DD59DF.1000504@gmail.com> <320fb6e00809261457j65dc0876hd59d17aee01bc983@mail.gmail.com> <320fb6e00809270557n73b81b5ayb93fe85f0f466626@mail.gmail.com> <320fb6e00809290450m6fedbaacu15a75107e5c39658@mail.gmail.com> <320fb6e00809290506p8aa2b51p4901b693ebb268bf@mail.gmail.com> Message-ID: <320fb6e00810010852j5cf8e3ak7dc788372568251f@mail.gmail.com> On Mon, Sep 29, 2008 at 1:06 PM, Peter wrote: >> I assume you [Bruce] are agreeing with ... follow[ing] the >> string defaults of white space for stipping or splitting (for >> consistency, even though this won't typically be useful for >> sequences). On balance this would probably be best from >> a principle of consistency and least surprise for the user - >> I'll update the patches. > > New patch for Seq object split, strip, lstrip and rstrip methods on > Bug 2596 which follows the python string defaults (splitting on or > stripping of white space). > http://bugzilla.open-bio.org/show_bug.cgi?id=2596 There is now a second version of this patch on that bug, which will also accept Seq objects as arguments to the split, strip, lstrip and rstrip methods, plus has the start of some tests too. We (Peter, Martin, Bruce and Leighton) seem to have reached an agreement about adding split, strip, lstrip and rstrip methods to the Seq object with the behaviour (arguments and defaults) to follow those of the python string as closely as possible. I'd like to encourage others lurking on the list to comment too, but unless anyone objects, I intend to add these methods in CVS this week, together with an updated unit test and updates to the tutorial. Peter From biopython at maubp.freeserve.co.uk Wed Oct 1 12:03:22 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Oct 2008 17:03:22 +0100 Subject: [BioPython] Bio.distance In-Reply-To: <48E39C21.8010603@gmail.com> References: <924102.72843.qm@web62403.mail.re1.yahoo.com> <48E39C21.8010603@gmail.com> Message-ID: <320fb6e00810010903u253c6384ld401e1a771ee141e@mail.gmail.com> On Wed, Oct 1, 2008 at 4:49 PM, Bruce Southey wrote: > > Hi, > Under the 'standard' install I do not think that there is any advantage of > using Bio.cdistance within Bio.kNN. I tested this on a bioinformatics data > set with almost 1500 data points, 8 explanatory variables and k=9. ... > Actual maximum times across three runs were under 16.6 seconds with > it [Bio.cdistance] and under 17.4 seconds without it [Bio.distance using > Numeric] Its interesting that the C version is only slightly faster than Numeric - of course as you point out there are lots of possible complications here like lapack and atlas (plus compiler options and CPU features). I think your numbers are good support for Michiel's proposition that we should deprecate Bio.cdistance and Bio.distance and just use numpy in Bio.kNN - this will simplify our code base and make very little difference to the speed. Peter From biopython at maubp.freeserve.co.uk Wed Oct 1 12:17:10 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Oct 2008 17:17:10 +0100 Subject: [BioPython] Bio.kNN documentation Message-ID: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com> Bruce wrote: > I tested this [Bio.kNN] on a bioinformatics data set with almost 1500 > data points, 8 explanatory variables and k=9. ... Do you think this larger example could be adapted into something for the Biopython documentation? Otherwise the next bit of code looks interesting. > I did not see an examples for k-nearest neighbor so below is (very bad) > code using the logistic regression example > (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html). This is a set of Bacillus subtilis gene pairs for which the operon structure is known, with the intergene distance and gene expression score as explanatory variables, with the class being same operon or different operons. > from Bio import kNN > xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11, > -220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], [58, > -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, -312.99], > [154, -213.83], [147, -380.85], [93, -291.13]] > ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0] > model = kNN.train(xs, ys, 3) > ccr=0 > tobs=0 > for px, py in zip(xs, ys): > cp=kNN.classify(model, px) > tobs +=1 > if cp==py: > ccr +=1 > print tobs, ccr Could you expand on the cryptic variable names? ccr = correct call rate? tobs = total observations? Coupled with a scatter plot (say with pylab, showing the two classes in different colours), this could be turned into a nice little example for the cookbook section of the tutorial. Notice that later on in the logistic regression example there is a second table of "test data" which could be used to make de novo predictions. Thanks, Peter From bsouthey at gmail.com Wed Oct 1 14:40:41 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Wed, 01 Oct 2008 13:40:41 -0500 Subject: [BioPython] Bio.kNN documentation In-Reply-To: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com> References: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com> Message-ID: <48E3C429.1020004@gmail.com> Peter wrote: > Bruce wrote: > >> I tested this [Bio.kNN] on a bioinformatics data set with almost 1500 >> data points, 8 explanatory variables and k=9. ... >> > > Do you think this larger example could be adapted into something for > the Biopython documentation? Otherwise the next bit of code looks > interesting. > > >> I did not see an examples for k-nearest neighbor so below is (very bad) >> code using the logistic regression example >> (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html). >> > > This is a set of Bacillus subtilis gene pairs for which the operon > structure is known, with the intergene distance and gene expression > score as explanatory variables, with the class being same operon or > different operons. > > >> from Bio import kNN >> xs = [[-53, -200.78], [117, -267.14], [57, -163.47], [16, -190.30], [11, >> -220.94], [85, -193.94], [16, -182.71], [15, -180.41], [-26, -181.73], [58, >> -259.87], [126, -414.53], [191, -249.57], [113, -265.28], [145, -312.99], >> [154, -213.83], [147, -380.85], [93, -291.13]] >> ys = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0] >> model = kNN.train(xs, ys, 3) >> ccr=0 >> tobs=0 >> for px, py in zip(xs, ys): >> cp=kNN.classify(model, px) >> tobs +=1 >> if cp==py: >> ccr +=1 >> print tobs, ccr >> > > Could you expand on the cryptic variable names? ccr = correct call > rate? tobs = total observations? > > Coupled with a scatter plot (say with pylab, showing the two classes > in different colours), this could be turned into a nice little example > for the cookbook section of the tutorial. Notice that later on in the > logistic regression example there is a second table of "test data" > which could be used to make de novo predictions. > > Thanks, > > Peter > > I did realize that this was coming... :-) (I guess I am volunteering myself to provide some material on machine learning with BioPython. So this is a start.) I wanted something quick and dirty to output for testing, so tobs is the total number of observations and ccr is number of correctly classified points - I was to lazy to divide it by tobs to get the correct classification rate. Here is an more extended sample code that also uses logistic regression. (Python is so great to with here!) I don't have plotting packages installed but someone could add the plots. Regards Bruce -------------- next part -------------- A non-text attachment was scrubbed... Name: knn_lr_example.py Type: text/x-python Size: 3257 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Wed Oct 1 17:40:55 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Oct 2008 22:40:55 +0100 Subject: [BioPython] problem installing mxTextTools In-Reply-To: <48896815.10104@berkeley.edu> References: <4889645B.9080400@berkeley.edu> <48896815.10104@berkeley.edu> Message-ID: <320fb6e00810011440v4bd80263hf3830d8d9f548e63@mail.gmail.com> On Fri, Jul 25, 2008 at 6:43 AM, Nick Matzke wrote: > Hi all, > > An update -- I found a solution by copying the .pck file the download > actually gave me to the filename that the install was apparently looking > for. This was not exactly obvious (!!!!) but apparently it worked: > ... > >>> print now() > 2008-07-24 22:39:17.66 > Was this an old email you accidently forwarded to the list? For the next release of Biopython the only bits of code still using mxTextTools have been deprecated, so the Biopython setup won't even look for mxTextTools at all. Right now with Biopython 1.48 you can just install without mxTextTools (as the setup.py prompt should make clear). Peter From biopython at maubp.freeserve.co.uk Wed Oct 1 17:44:34 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 1 Oct 2008 22:44:34 +0100 Subject: [BioPython] problem installing mxTextTools In-Reply-To: <320fb6e00810011440v4bd80263hf3830d8d9f548e63@mail.gmail.com> References: <4889645B.9080400@berkeley.edu> <48896815.10104@berkeley.edu> <320fb6e00810011440v4bd80263hf3830d8d9f548e63@mail.gmail.com> Message-ID: <320fb6e00810011444u7e5bf37fh2801c1980bd38a2a@mail.gmail.com> On Wed, Oct 1, 2008 at 10:40 PM, Peter wrote: > On Fri, Jul 25, 2008 at 6:43 AM, Nick Matzke wrote: >> Hi all, >> >> An update -- I found a solution by copying the .pck file the download >> actually gave me to the filename that the install was apparently looking >> for. This was not exactly obvious (!!!!) but apparently it worked: >> ... >> >>> print now() >> 2008-07-24 22:39:17.66 >> > > Was this an old email you accidently forwarded to the list? Sorry about this Nick & everyone else - it was a mistake at my end. It looks like a glitch (perhaps in GoogleMail itself?) marked this old thread as unread and bumped it to the top of my to read list. Odd, but I didn't notice until after sending my confused reply. Peter From kteague at bcgsc.ca Wed Oct 1 17:53:44 2008 From: kteague at bcgsc.ca (Kevin Teague) Date: Wed, 1 Oct 2008 14:53:44 -0700 Subject: [BioPython] development question References: <48B5BD98.8050101@heckler-koch.cz><48B65C9B.4000407@heckler-koch.cz> <20080828090431.GD5801@inb.uni-luebeck.de> Message-ID: <36BEEFA2DF192944BF71E072F7A5F4656043D6@xchange1.phage.bcgsc.ca> On Thu, Aug 28, 2008 at 10:06:51AM +0200, Pavel SRB wrote: > so now to biopython. On my system i have biopython from debian repository > via apt-get. But i would like to have second version of biopython in system > just to check, log and change the code to learn more. This can be done with > removing sys.path.remove("/var/lib/python-support/python2.5") > and importing Bio from some other development directory. But this way i > loose all modules in direcotory mentioned above and i believe it can be > done more clearly You might want to check out VirtualEnv: http://pypi.python.org/pypi/virtualenv This tool will let you "clone" your system Python, so that you have your own isolated [virtualpythonname]/bin and [virtualpythonname/lib/python/site-packages/ directories. If you create a virtualenv with the --no-site-packages, then the /var/lib/python-support/python2.5/ location will be not be in the created virtual python's sys.path. Otherwise by default this location will be included, but your own isolated [virtualpythonname/lib/python/site-packages/ location will have precendence on sys.path, so if you install a newer BioPython into there it will get imported instead of the system one. You can of course do all of this by manually fiddling with sys.path, but VirtualEnv just wraps up a few of these common practices into one handy tool - great for experimentation or trying out different packages. From lunt at ctbp.ucsd.edu Sat Oct 4 17:50:33 2008 From: lunt at ctbp.ucsd.edu (Bryan Lunt) Date: Sat, 4 Oct 2008 14:50:33 -0700 Subject: [BioPython] Copy Constructors for Bio.Seq.Seq? In-Reply-To: References: Message-ID: Greetings All! I would like to make the following humble suggestion: A copy-constructor for Bio.Seq.Seq would be helpful, currently it seems that calling Bio.Align.Generic.Alignment.add_sequence on a Seq object breaks because it tries to initialize a new Seq object on whatever data you provided, and there is no copy-constructor, nor does Bio.Align.Generic.Alignment.add_sequence handled just adding a Seq object directly. Thanks for considering this, I think this addition will help make client-code cleaner. -Bryan Lunt From biopython at maubp.freeserve.co.uk Sun Oct 5 07:06:57 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 5 Oct 2008 12:06:57 +0100 Subject: [BioPython] Copy Constructors for Bio.Seq.Seq? In-Reply-To: References: Message-ID: <320fb6e00810050406t41d25043oe7011745055a1fc7@mail.gmail.com> On Sat, Oct 4, 2008 at 10:50 PM, Bryan Lunt wrote: > Greetings All! > I would like to make the following humble suggestion: > A copy-constructor for Bio.Seq.Seq would be helpful, ... You can use the string idiom of my_seq[:] to make a copy of a Seq object. > currently it > seems that calling Bio.Align.Generic.Alignment.add_sequence on a > Seq object breaks because it tries to initialize a new Seq object on > whatever data you provided, and there is no copy-constructor, nor does > Bio.Align.Generic.Alignment.add_sequence handled just adding a Seq > object directly. Yes, the Bio.Align.Generic.Alignment.add_sequence() method currently expects a string (which its docstring is fairly clear about), and giving it a Seq does fail. I suppose allowing it to take a Seq object would be sensible (with a check on the alphabet being compatible with that declared for the alignment). We have been debating making the generic Alignment a little more list like, by allowing .append() or .extend() for use with SeqRecord objects (Bug 2553). http://bugzilla.open-bio.org/show_bug.cgi?id=2553 > Thanks for considering this, I think this addition will help make > client-code cleaner. Would the SeqRecord append/extend idea suit you just as well? Peter From biopython at maubp.freeserve.co.uk Sun Oct 5 08:16:28 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 5 Oct 2008 13:16:28 +0100 Subject: [BioPython] Migrating from Numerical Python to numpy In-Reply-To: <623262.17729.qm@web62407.mail.re1.yahoo.com> References: <623262.17729.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e00810050516i20822ebcwf15cd058af0c9759@mail.gmail.com> On Sat, Sep 20, 2008 at 4:02 AM, Michiel de Hoon wrote: > Dear all, > > As you probably are well aware, Biopython releases to date have used > the now obsolete Numeric python library. This is no longer being > maintained and has been superseded by the numpy library. See > http://www.scipy.org/History_of_SciPy for more about details on the > history of numerical python. Biopython 1.48 should be the last > Numeric only release of Biopython - we have already started moving to > numpy in CVS. > > Supporting both Numeric and numpy ought to be fairly straightforward > for the pure python modules in Biopython. However, we also have C code > which must interact with Numeric/numpy, and trying to support both > would be harder. > > Would anyone be inconvenienced if the next release of Biopython > supported numpy ONLY (dropping support for Numeric)? If so please > speak up now - either here or on the development mailing list. > Otherwise, a simple switch from Numeric to numpy will probably be the > most straightforward migration plan. No one has objected, and a simple switch from Numeric to numpy is underway in CVS. The next release of Biopython will suport numpy only (dropping support for Numeric). As an aside, from my own testing Biopython CVS looks happy with numpy 1.0, 1.1 and the just released 1.2 (although if we have missed any deprecation warnings please let us know). For preparing Windows installers for Biopython, it might be helpful to know what version of numpy most Windows users (will) have installed (this is important due to numpy C API changes between versions). Thanks, Peter From biopython at maubp.freeserve.co.uk Mon Oct 6 06:39:15 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 6 Oct 2008 11:39:15 +0100 Subject: [BioPython] Bio.kNN documentation In-Reply-To: <48E3C429.1020004@gmail.com> References: <320fb6e00810010917x2373a122t6604c2444ffeef65@mail.gmail.com> <48E3C429.1020004@gmail.com> Message-ID: <320fb6e00810060339t427d4c5dme2690fbc36b30c81@mail.gmail.com> Bruce wrote: >>> I did not see an examples for k-nearest neighbor so below is >>> (very bad) code using the logistic regression example >>> (http://biopython.org/DIST/docs/cookbook/LogisticRegression.html). Peter wrote: >> This is a set of Bacillus subtilis gene pairs for which the operon >> structure is known, with the intergene distance and gene expression >> score as explanatory variables, with the class being same operon or >> different operons. >> ... >> Coupled with a scatter plot (say with pylab, showing the two classes >> in different colours), this could be turned into a nice little example >> for the cookbook section of the tutorial. Notice that later on in the >> logistic regression example there is a second table of "test data" >> which could be used to make de novo predictions. Bruce wrote: > I did realize that this was coming... :-) > (I guess I am volunteering myself to provide some material on > machine learning with BioPython. So this is a start.) Michiel has suggested adding a whole chapter to the tutorial about supervised learning, presumably incorporating his logistic regression example as part of this. Have a look at thread "Bio.MarkovModel; Bio.Popgen, Bio.PDB documentation" on the dev mailing list. I'm sure you can contribute (even if just by proof reading). Peter From fkauff at biologie.uni-kl.de Tue Oct 7 04:02:12 2008 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Tue, 07 Oct 2008 10:02:12 +0200 Subject: [BioPython] Creating and traversing an ultrametric tree In-Reply-To: <320fb6e00809241326i16a337das844f4ac74766b459@mail.gmail.com> References: <73045cca0809231713v219c3ec3tfc24461c7af6b453@mail.gmail.com> <320fb6e00809240200y144500cbl86f9023cb868da89@mail.gmail.com> <73045cca0809241132x30bc4d63t7ac0b9967a20e76c@mail.gmail.com> <320fb6e00809241326i16a337das844f4ac74766b459@mail.gmail.com> Message-ID: <48EB1784.50803@biologie.uni-kl.de> Peter wrote: > On Wed, Sep 24, 2008 at 7:32 PM, aditya shukla > wrote: > >> Hello Peter , >> >> Thanks for the reply , >> I have attached a file with of the kind of data that i wanna parse. >> I tried using Thomas Mailund's Newick tree parser but this dosen't >> seem to work , so is there any other module that can help? >> > > Your file looks like this (in case anyone on the mailing list recognises it), > > /T_0_size=105((-bin-ulockmgr_server:0.99[&&NHX:C=0.195.0], > (((-bin-hostname:0.00[&&NHX:C=200.0.0], > (-bin-dnsdomainname:0.00[&&NHX:C=200.0.0], > ...):0.99):0.99):0.99):0.99); > > [with a large chunk removed, and new lines inserted] > > I'm guessing this is some kind of computer system profile - nothing to > do with bioinformatics. > > I'm not 100% sure this is Newick format - it might be worth trying to > parse everything after the "/T_0_size=105" text which looks out of > place to me. > > If it is a valid Newick format tree file, then it is using named > internal nodes which is something Biopython can't currently parse (see > Bug 2543, http://bugzilla.open-bio.org/show_bug.cgi?id=2543 ). So I > don't think you can use the Bio.Nexus module in Biopython to read this > tree. > > Nexus.Trees has been extended to deal with internal node names, or "special comments" in the format [& blablalba]. Such comments comments can appear directly after the taxon label, after the closing parentheses, or between branchlength / support values attached to a node or a taxon labels, such as (a,(b,(c,d)[&hi there])) (a,(b[&hi there],c)) (a,(b:0.123[&hi there],c[&heyho]:0.3)) (a,(b,c)0.4[&comment]:0.95) The comments are stored without change in the corresponding node object and can be accessed like >>> t=Trees.Tree('(a,(b:0.123[&hi there],c[&heyho]:0.3))') >>> print t.node(3).data.comment [&hi there] >>> print t.node(4).data.comment [&heyho] >>> The comments are not parsed in any way - internal labels vary greatly in syntax, and are used to store all kinds of information. But at least they are now parsed and stored, and users can deal with them in any way they like. Frank > The only other python package I can suggest you try is NetworkX, > https://networkx.lanl.gov/wiki > > Good luck, > > Peter > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From mjldehoon at yahoo.com Tue Oct 7 19:10:12 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 7 Oct 2008 16:10:12 -0700 (PDT) Subject: [BioPython] Bio.kNN documentation In-Reply-To: <320fb6e00810060339t427d4c5dme2690fbc36b30c81@mail.gmail.com> Message-ID: <381879.37032.qm@web62403.mail.re1.yahoo.com> > Bruce wrote: > > (I guess I am volunteering myself to provide some > material on > > machine learning with BioPython. So this is a start.) > > Michiel has suggested adding a whole chapter to the > tutorial about > supervised learning, presumably incorporating his logistic > regression > example as part of this. Have a look at thread > "Bio.MarkovModel; > Bio.Popgen, Bio.PDB documentation" on the dev mailing > list. I'm sure > you can contribute (even if just by proof reading). Some more documentation on machine learning would definitely be useful. Recently I started a chapter on supervised learning methods in the tutorial. Right now it only covers logistic regression, but it should also include Bio.MarkovModel, Bio.MaxEntropy, Bio.NaiveBayes, and Bio.kNN. If you are planning to write some documentation on any of these, please let us know so we can avoid duplicated efforts. The new tutorial is in CVS; I put a copy of the HTML output of the latest version at http://biopython.org/DIST/docs/tutorial/Tutorial.new.html. Thanks! --Michiel From bsouthey at gmail.com Tue Oct 7 21:35:51 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 7 Oct 2008 20:35:51 -0500 Subject: [BioPython] Bio.kNN documentation In-Reply-To: <381879.37032.qm@web62403.mail.re1.yahoo.com> References: <320fb6e00810060339t427d4c5dme2690fbc36b30c81@mail.gmail.com> <381879.37032.qm@web62403.mail.re1.yahoo.com> Message-ID: On Tue, Oct 7, 2008 at 6:10 PM, Michiel de Hoon wrote: >> Bruce wrote: >> > (I guess I am volunteering myself to provide some >> material on >> > machine learning with BioPython. So this is a start.) >> >> Michiel has suggested adding a whole chapter to the >> tutorial about >> supervised learning, presumably incorporating his logistic >> regression >> example as part of this. Have a look at thread >> "Bio.MarkovModel; >> Bio.Popgen, Bio.PDB documentation" on the dev mailing >> list. I'm sure >> you can contribute (even if just by proof reading). > > Some more documentation on machine learning would definitely be useful. Recently I started a chapter on supervised learning methods in the tutorial. Right now it only covers logistic regression, but it should also include Bio.MarkovModel, Bio.MaxEntropy, Bio.NaiveBayes, and Bio.kNN. If you are planning to write some documentation on any of these, please let us know so we can avoid duplicated efforts. The new tutorial is in CVS; I put a copy of the HTML output of the latest version at > http://biopython.org/DIST/docs/tutorial/Tutorial.new.html. > > Thanks! > > --Michiel > Hi, I have not given it too much thought at present but this reflects some of the work I have been doing or involved with. I do not know enough about Bio.MarkovModel, Bio.MaxEntropy and Bio.NaiveBayes to really help. But I did think to start with trying to extend the supervised learning material to be more general. One aspect is to get provide working code using different methodologies for different examples. Regards Bruce From stephan80 at mac.com Wed Oct 8 07:33:51 2008 From: stephan80 at mac.com (Stephan) Date: Wed, 08 Oct 2008 13:33:51 +0200 Subject: [BioPython] Entrez.efetch Message-ID: <75573950382669954948356356615157751492-Webmail2@me.com> Hi, I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now. Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this. Anyway, here is a weird little problem with the Bio.Entrez.efetch tool: (I use python 2.5 and the latest Biopython 1.48) I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file: ---------------------------CODE------------------------------------ from Bio import Entrez, SeqIO print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"] handle = Entrez.efetch(db="genome", id="56", rettype="genbank") print "downloading to SeqRecord..." record = SeqIO.read(handle, "genbank") print "...done" handle = Entrez.efetch(db="genome", id="56", rettype="genbank") filehandle = open("NCBI_DroMel", "w") print "downloading to file..." filehandle.write(handle.read()) print "...done" handle = open("NCBI_DroMel") print "reading from file..." record = SeqIO.read(handle, "genbank") ---------------------------END-CODE------------------------------------ In the last line we have a crash, see the output of the code: ---------------------------OUTPUT------------------------------------ Drosophila melanogaster chromosome 4, complete sequence downloading to SeqRecord... ...done downloading to file... ...done reading chr2L from file... Traceback (most recent call last): File "efetch-test.py", line 17, in record = SeqIO.read(handle, "genbank") File "HOME/lib/python/Bio/SeqIO/__init__.py", line 366, in read first = iterator.next() File "HOME/lib/python/Bio/GenBank/Scanner.py", line 410, in parse_records record = self.parse(handle) File "HOME/lib/python/Bio/GenBank/Scanner.py", line 393, in parse if self.feed(handle, consumer) : File "HOME/lib/python/Bio/GenBank/Scanner.py", line 370, in feed misc_lines, sequence_string = self.parse_footer() File "HOME/lib/python/Bio/GenBank/Scanner.py", line 723, in parse_footer raise ValueError("Premature end of file in sequence data") ValueError: Premature end of file in sequence data ---------------------------END-OUTPUT------------------------------------ It seems that downloading the file to disk will corrupt the genbank file, while downloading directly into biopythons SeqIO.read() function works properly. I dont get it! When I download this chromosome manually from the NCBI-website, I indeed find a difference in one line, namely in line 3 of the genbank file. In the manually downloaded file line 3 reads: "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced from my code I have only: "ACCESSION NC_004353". So without that region-information, the biopython parser of course runs to a premature end. I rather use the cPickle-module now to save the whole SeqRecord-instance. Thats works fine, so I dont need an immediate solution for the above posted problem, but I thought it might be interesting maybe... Any hints? Regards, Stephan From chapmanb at 50mail.com Wed Oct 8 08:35:33 2008 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Oct 2008 08:35:33 -0400 Subject: [BioPython] Entrez.efetch In-Reply-To: <75573950382669954948356356615157751492-Webmail2@me.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> Message-ID: <20081008123533.GE57379@sobchak.mgh.harvard.edu> Hi Stephan; > It seems that downloading the file to disk will corrupt the genbank > file, while downloading directly into biopythons SeqIO.read() function > works properly. I dont get it! > > When I download this chromosome manually from the NCBI-website, > I indeed find a difference in one line, namely in line 3 of the > genbank file. In the manually downloaded file line 3 reads: > "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced > from my code I have only: "ACCESSION NC_004353". So without that > region-information, the biopython parser of course runs to a premature > end. This is a tricky problem that I ran into as well and is fixed in the latest CVS version. The issue is that the Biopython reader is using an UndoHandle instead of a standard python handle. By default some of these operations appear to be assuming an iterator, but UndoHandle did not provide this. As a result, you can lose the first couple of lines which are previously examined to determine the filetype. The fix is to make this a proper iterator. You can either check out current CVS, or make the addition manually to Bio/File.py in your current version: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython Hope this helps, Brad From biopython at maubp.freeserve.co.uk Wed Oct 8 09:37:24 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 14:37:24 +0100 Subject: [BioPython] Entrez.efetch In-Reply-To: <75573950382669954948356356615157751492-Webmail2@me.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> Message-ID: <320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com> On Wed, Oct 8, 2008 at 12:33 PM, Stephan wrote: > Hi, > > I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now. > Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this. > > Anyway, here is a weird little problem with the Bio.Entrez.efetch tool: > (I use python 2.5 and the latest Biopython 1.48) > I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file: > > ---------------------------CODE------------------------------------ > from Bio import Entrez, SeqIO > > print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"] > handle = Entrez.efetch(db="genome", id="56", rettype="genbank") > print "downloading to SeqRecord..." > record = SeqIO.read(handle, "genbank") > print "...done" I assume this is just test code - as it would be silly to download the GenBank file twice in a real script. > handle = Entrez.efetch(db="genome", id="56", rettype="genbank") > filehandle = open("NCBI_DroMel", "w") > print "downloading to file..." > filehandle.write(handle.read()) You should now close the file, which should ensure it is fully written to disk: filehandle.close() > print "...done" > > handle = open("NCBI_DroMel") > print "reading from file..." > record = SeqIO.read(handle, "genbank") > ---------------------------END-CODE------------------------------------ > > In the last line we have a crash, > ... > ValueError: Premature end of file in sequence data This is because you started reading in the file without finishing writing to it - the parser could only read in part of the data, and is complaining about it ending prematurely. Peter From p.j.a.cock at googlemail.com Wed Oct 8 09:46:25 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 8 Oct 2008 14:46:25 +0100 Subject: [BioPython] Entrez.efetch In-Reply-To: <20081008123533.GE57379@sobchak.mgh.harvard.edu> References: <75573950382669954948356356615157751492-Webmail2@me.com> <20081008123533.GE57379@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> Stephan wrote: >> When I download this chromosome manually from the NCBI-website, >> I indeed find a difference in one line, namely in line 3 of the >> genbank file. In the manually downloaded file line 3 reads: >> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced >> from my code I have only: "ACCESSION NC_004353". So without that >> region-information, the biopython parser of course runs to a premature >> end. Stephan - when you say manually, do you mean via a web browser? If so it is likely to be using a subtly different URL, which might explain the NCBI generating slightly different data on the fly. Either way, this ACCESSION line difference shouldn't trigger the "Premature end of file in sequence data" error in the GenBank parser. On Wed, Oct 8, 2008 at 1:35 PM, Brad Chapman wrote: > This is a tricky problem that I ran into as well and is fixed in the > latest CVS version. The issue is that the Biopython reader is using an > UndoHandle instead of a standard python handle. By default some of these > operations appear to be assuming an iterator, but UndoHandle did not > provide this. Brad, I'm pretty sure the GenBank parser is NOT using the UndoHandle. Just adding the close made Stephan's example work for me. What exactly was the problem you ran into (one of the other parsers perhaps?). > As a result, you can lose the first couple of lines which are > previously examined to determine the filetype. The fix is to make > this a proper iterator. You can either check out current CVS, or > make the addition manually to Bio/File.py in your current version: > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython Adding this to the UndoHandle seems a sensible improvement - but I don't see how it can affect Stephan's script. Peter From p.j.a.cock at googlemail.com Wed Oct 8 09:46:25 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 8 Oct 2008 14:46:25 +0100 Subject: [BioPython] Entrez.efetch In-Reply-To: <20081008123533.GE57379@sobchak.mgh.harvard.edu> References: <75573950382669954948356356615157751492-Webmail2@me.com> <20081008123533.GE57379@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> Stephan wrote: >> When I download this chromosome manually from the NCBI-website, >> I indeed find a difference in one line, namely in line 3 of the >> genbank file. In the manually downloaded file line 3 reads: >> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced >> from my code I have only: "ACCESSION NC_004353". So without that >> region-information, the biopython parser of course runs to a premature >> end. Stephan - when you say manually, do you mean via a web browser? If so it is likely to be using a subtly different URL, which might explain the NCBI generating slightly different data on the fly. Either way, this ACCESSION line difference shouldn't trigger the "Premature end of file in sequence data" error in the GenBank parser. On Wed, Oct 8, 2008 at 1:35 PM, Brad Chapman wrote: > This is a tricky problem that I ran into as well and is fixed in the > latest CVS version. The issue is that the Biopython reader is using an > UndoHandle instead of a standard python handle. By default some of these > operations appear to be assuming an iterator, but UndoHandle did not > provide this. Brad, I'm pretty sure the GenBank parser is NOT using the UndoHandle. Just adding the close made Stephan's example work for me. What exactly was the problem you ran into (one of the other parsers perhaps?). > As a result, you can lose the first couple of lines which are > previously examined to determine the filetype. The fix is to make > this a proper iterator. You can either check out current CVS, or > make the addition manually to Bio/File.py in your current version: > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython Adding this to the UndoHandle seems a sensible improvement - but I don't see how it can affect Stephan's script. Peter From stephan80 at mac.com Wed Oct 8 09:48:25 2008 From: stephan80 at mac.com (Stephan) Date: Wed, 08 Oct 2008 15:48:25 +0200 Subject: [BioPython] Entrez.efetch In-Reply-To: <320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> <320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com> Message-ID: <128043477953580677661042463273686413408-Webmail2@me.com> Hi guys, OK, there is two different problems here that Brad and Peter independently pointed out to me. Peter, you are right that not closing the file actually caused the error. Your hint fixes that, thanks. But that doesnt fix that there is a part of line 3 missing over the download, and although I actually updated to the newest cvs-version of biopython as Brad suggested (sorry for accidently putting my answer not on the mailing-list) that does not fix that line... Best, Stephan Am Mittwoch 08 Oktober 2008 um 03:37PM schrieb "Peter" : >On Wed, Oct 8, 2008 at 12:33 PM, Stephan wrote: >> Hi, >> >> I am using biopython for a week or so. The package is amazing, I wonder how I possibly ignored this for so long now. >> Since I am not only new to biopython I am also new in this mailing list, so forgive me if this is not the right forum for a question like this. >> >> Anyway, here is a weird little problem with the Bio.Entrez.efetch tool: >> (I use python 2.5 and the latest Biopython 1.48) >> I want to run the following little test-code, using etetch to get chromosome 4 of Drosophila melanogaster as a genbank-file: >> >> ---------------------------CODE------------------------------------ >> from Bio import Entrez, SeqIO >> >> print Entrez.read(Entrez.esummary(db="genome", id="56"))[0]["Title"] >> handle = Entrez.efetch(db="genome", id="56", rettype="genbank") >> print "downloading to SeqRecord..." >> record = SeqIO.read(handle, "genbank") >> print "...done" > >I assume this is just test code - as it would be silly to download the >GenBank file twice in a real script. > >> handle = Entrez.efetch(db="genome", id="56", rettype="genbank") >> filehandle = open("NCBI_DroMel", "w") >> print "downloading to file..." >> filehandle.write(handle.read()) > >You should now close the file, which should ensure it is fully written to disk: >filehandle.close() > >> print "...done" >> >> handle = open("NCBI_DroMel") >> print "reading from file..." >> record = SeqIO.read(handle, "genbank") >> ---------------------------END-CODE------------------------------------ >> >> In the last line we have a crash, >> ... >> ValueError: Premature end of file in sequence data > >This is because you started reading in the file without finishing >writing to it - the parser could only read in part of the data, and is >complaining about it ending prematurely. > >Peter > > From stephan80 at mac.com Wed Oct 8 10:00:31 2008 From: stephan80 at mac.com (Stephan) Date: Wed, 08 Oct 2008 16:00:31 +0200 Subject: [BioPython] Entrez.efetch In-Reply-To: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> <20081008123533.GE57379@sobchak.mgh.harvard.edu> <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> Message-ID: <72537648433629820630731006204512761040-Webmail2@me.com> >Stephan - when you say manually, do you mean via a web browser? If so >it is likely to be using a subtly different URL, which might explain >the NCBI generating slightly different data on the fly. Either way, >this ACCESSION line difference shouldn't trigger the "Premature end of >file in sequence data" error in the GenBank parser. Thanks, that must be it! Now I guess everything is solved, closing the handle makes my code run properly and the download-from-NCBI-webpage-issue explains the difference in line 3. >Adding this to the UndoHandle seems a sensible improvement - but I >don't see how it can affect Stephan's script. There I agree, thanks anyway, Brad. Regards, Stephan From biopython at maubp.freeserve.co.uk Wed Oct 8 10:02:54 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 15:02:54 +0100 Subject: [BioPython] Entrez.efetch In-Reply-To: <128043477953580677661042463273686413408-Webmail2@me.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> <320fb6e00810080637p2d62083ej8dda855d0c10e92@mail.gmail.com> <128043477953580677661042463273686413408-Webmail2@me.com> Message-ID: <320fb6e00810080702q6774f58ap52a02073d62cb75a@mail.gmail.com> On Wed, Oct 8, 2008 at 2:48 PM, Stephan wrote: > > Hi guys, > > OK, there is two different problems here that Brad and Peter independently > pointed out to me. Peter, you are right that not closing the file actually > caused the error. Your hint fixes that, thanks. Great. > But that doesnt fix that there is a part of line 3 missing over the download, > and although I actually updated to the newest cvs-version of biopython as > Brad suggested (sorry for accidently putting my answer not on the mailing-list) > that does not fix that line... This is the issue where you get different GenBank files using Bio.Entrez.efetch and a "manual download"? First of all what did you mean by "manual download" - for example FTP (what URL), or from a browser? Secondly, does this difference to the ACCESSION line (line 3) actually have any ill effects? To be clear using Bio.Entrez.efetch as in your script, I get this: LOCUS NC_004353 1351857 bp DNA linear INV 14-MAY-2008 DEFINITION Drosophila melanogaster chromosome 4, complete sequence. ACCESSION NC_004353 VERSION NC_004353.3 GI:116010290 PROJECT GenomeProject:164 KEYWORDS . SOURCE Drosophila melanogaster (fruit fly) ORGANISM Drosophila melanogaster ... Using FTP from ftp://ftp.ncbi.nih.gov/genomes/Drosophila_melanogaster/CHR_4/NC_004353.gbk I get something similar but different: LOCUS NC_004353 1351857 bp DNA linear INV 14-MAY-2008 DEFINITION Drosophila melanogaster chromosome 4, complete sequence. ACCESSION NC_004353 VERSION NC_004353.3 GI:116010290 KEYWORDS . SOURCE Drosophila melanogaster (fruit fly) ORGANISM Drosophila melanogaster ... Notice the FTP file lacks the PROJECT line, and also differs slightly in its feature table. Using the NCBI website I suspect you can get other slight variations (like the different ACCESSION line you reported). Peter From stephan80 at mac.com Wed Oct 8 09:52:07 2008 From: stephan80 at mac.com (Stephan) Date: Wed, 08 Oct 2008 15:52:07 +0200 Subject: [BioPython] Entrez.efetch In-Reply-To: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> <20081008123533.GE57379@sobchak.mgh.harvard.edu> <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> Message-ID: <56009583349175862359179071289436480391-Webmail2@me.com> >Stephan - when you say manually, do you mean via a web browser? If so >it is likely to be using a subtly different URL, which might explain >the NCBI generating slightly different data on the fly. Either way, >this ACCESSION line difference shouldn't trigger the "Premature end of >file in sequence data" error in the GenBank parser. Thanks, that must be it! Now I guess everything is solved, closing the handle makes my code run properly and the download-from-NCBI-webpage-issue explains the difference in line 3. >Adding this to the UndoHandle seems a sensible improvement - but I >don't see how it can affect Stephan's script. There I agree, thanks anyway, Brad. Regards, Stephan From biopythonlist at gmail.com Wed Oct 8 12:23:32 2008 From: biopythonlist at gmail.com (dr goettel) Date: Wed, 8 Oct 2008 18:23:32 +0200 Subject: [BioPython] taxonomic tree Message-ID: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> Hello, I'm new in this list and in BioPython. I would like to create a NCBI-like taxonomic tree and then fill it with the organisms that I have in a file. Is there an easy way to do this? I started using biopython's function at 7.11.4 (finding the lineage of an organism) in the tutorial, but I need to do this tens of thousands times so it spends too much time querying NCBI database. Therefore I built a taxonomic database locally and implemented something similar to 7.11.4 tutorial's function so I get, for every sequence, the lineage in the same way: 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; Liliopsida; Asparagales; Orchidaceae' Now I need to create a tree, or fill an already created one. And then search it by some criteria. Please could anybody help me with this? Any idea? Thankyou very much From biopython at maubp.freeserve.co.uk Wed Oct 8 12:38:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 17:38:31 +0100 Subject: [BioPython] taxonomic tree In-Reply-To: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> Message-ID: <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com> On Wed, Oct 8, 2008 at 5:23 PM, dr goettel wrote: > Hello, I'm new in this list and in BioPython. Hello :) > I would like to create a NCBI-like taxonomic tree and then fill it with the > organisms that I have in a file. Is there an easy way to do this? I started > using biopython's function at 7.11.4 (finding the lineage of an organism) in > the tutorial, ... For anyone reading this later on, note that the tutorial section numbers tend to change with each release of Biopython. This section just uses Bio.Entrez to fetch taxonomy information for a particular NCBI taxon id. > but I need to do this tens of thousands times so it spends too > much time querying NCBI database. Also calling Bio.Entrez 10000 times might annoy the NCBI ;) > Therefore I built a taxonomic database > locally and implemented something similar to 7.11.4 tutorial's function so I > get, for every sequence, the lineage in the same way: > > 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; > Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; > Liliopsida; Asparagales; Orchidaceae' I assume you used the NCBI provided taxdump files to populate the database? See ftp://ftp.ncbi.nih.gov/pub/taxonomy/ Personally rather than designing my own database just for this (and writing a parser for the taxonomy files), I would have suggested installing BioSQL, and using the BioSQL script load_ncbi_taxonomy.pl to download and import the data for you. This is a simple perl script - you don't need BioPerl. See http://www.biopython.org/wiki/BioSQL for details. > Now I need to create a tree, or fill an already created one. And then search > it by some criteria. What kind of tree do you mean? Are you talking about creating a Newick tree, or an in memory structure? Perhaps the Bio.Nexus module's tree functionality would help. If you are interested, the BioSQL tables record the taxonomy tree using two methods, each node has a parent node allowing you to walk up the lineage. There are also left/right values allowing selection of all child nodes efficiently via an SQL select statement. Peter From biopython at maubp.freeserve.co.uk Wed Oct 8 12:57:37 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 17:57:37 +0100 Subject: [BioPython] Current tutorial in CVS Message-ID: <320fb6e00810080957m3c7fbd40g82f7c0fd794aa334@mail.gmail.com> Michiel wrote: > ... The new tutorial is in CVS; I put a copy of the HTML output > of the latest version at > http://biopython.org/DIST/docs/tutorial/Tutorial.new.html. This also gives people a chance to look at the three plotting examples I added to the "Cookbook" section a couple of weeks back, http://www.biopython.org/DIST/docs/tutorial/Tutorial.new.html#chapter:cookbook Suggestions for any additional biologically motivated simple plots would be nice - especially for different plot types. A scatter plot could be added, are there any suggestions for this other than melting temperature versus length or GC%? See also this thread on the dev-mailing list: http://www.biopython.org/pipermail/biopython-dev/2008-September/004277.html Note that the file at this URL is only temporary, and will probably be removed before the next release. The current tutorial is at: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html http://www.biopython.org/DIST/docs/tutorial/Tutorial.pdf Peter From stephan80 at mac.com Wed Oct 8 13:11:25 2008 From: stephan80 at mac.com (Stephan) Date: Wed, 08 Oct 2008 19:11:25 +0200 Subject: [BioPython] Entrez.efetch large files Message-ID: <133483072970409871957631124263040035200-Webmail2@me.com> Sorry to have an Entrez.efetch-issue again, but somehow there seems to be a problem with very large files. So when I run the following code using the newest cvs-version of biopython: ------------------------------------CODE----------------------------------- from Bio import Entrez, SeqIO id = "57" print Entrez.read(Entrez.esummary(db="genome", id=id))[0]["Title"] handle = Entrez.efetch(db="genome", id=id, rettype="genbank") print "downloading to SeqRecord..." record = SeqIO.read(handle, "genbank") print "...done" ------------------------------------END-CODE----------------------------- it fails with the output: ------------------------------------OUTPUT----------------------------- Drosophila melanogaster chromosome X, complete sequence downloading to SeqRecord... Traceback (most recent call last): File "efetch-test.py", line 7, in record = SeqIO.read(handle, "genbank") File "/NetUsers/stschiff/lib/python/Bio/SeqIO/__init__.py", line 366, in read first = iterator.next() File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 410, in parse_records record = self.parse(handle) File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 393, in parse if self.feed(handle, consumer) : File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 370, in feed misc_lines, sequence_string = self.parse_footer() File "/NetUsers/stschiff/lib/python/Bio/GenBank/Scanner.py", line 723, in parse_footer raise ValueError("Premature end of file in sequence data") ValueError: Premature end of file in sequence data ------------------------------------END-OUTPUT----------------------------- If I change the id to "56" (chromosome 4, which is shorter) it works. But for all the other chromosomes (ids: 57 - 61) it fails. If I download the genbank files manually from the ftp-server and then use SeqIO.read() it works, so the download-process corrupts the genbank files if they are very large (about 35 MB) I guess... Any hints? Best, Stephan From biopython at maubp.freeserve.co.uk Wed Oct 8 14:57:08 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 19:57:08 +0100 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <133483072970409871957631124263040035200-Webmail2@me.com> References: <133483072970409871957631124263040035200-Webmail2@me.com> Message-ID: <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> On Wed, Oct 8, 2008 at 6:11 PM, Stephan wrote: > Sorry to have an Entrez.efetch-issue again, but somehow there > seems to be a problem with very large files. > ... > If I change the id to "56" (chromosome 4, which is shorter) it works. > But for all the other chromosomes (ids: 57 - 61) it fails. > If I download the genbank files manually from the ftp-server and > then use SeqIO.read() it works, so the download-process corrupts > the genbank files if they are very large (about 35 MB) I guess... > > Any hints? Yes - one big hint: DON'T try and parse these large files directly from the internet. Use efetch to download the file and save it to disk. Then open this local file for parsing. There are several good reasons for this: (1) Rerunning the script (e.g. during development) needn't re-download the file, which wastes time and money (yours and more importantly the NCBI's). You may be fine, but the NCBI can and do ban people's IP addresses if they breach the guidelines. (2) If the parsing fails, there is something to debug easily (the local file). You can open the file in a text editor to check it etc. That being said, downloading and parsing in one go should work - I would expect an IO error if the network timed out, rather than what appears to be the data ending prematurely. However, I don't expect this to be easy to resolve - quite possibly this is a network time out somewhere, maybe at your end, maybe on one of the ISP connections in between. On the bright side, at least the parser isn't silently ignoring the end of the file, which would leave you with a truncated sequence without any warnings :) Do you think the Biopython tutorial should be more explicit about this topic? e.g. In chapter 4 (on Bio.SeqIO) I wrote: >> Note that just because you can download sequence data and >> parse it into a SeqRecord object in one go doesn't mean this >> is always a good idea. In general, you should probably download >> sequences once and save them to a file for reuse. Maybe I should have said "... doesn't mean this is a good idea..." instead? Peter From biopython at maubp.freeserve.co.uk Wed Oct 8 15:32:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 20:32:59 +0100 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> References: <133483072970409871957631124263040035200-Webmail2@me.com> <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> Message-ID: <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> > Yes - one big hint: DON'T try and parse these large files directly > from the internet. Use efetch to download the file and save it to > disk. Then open this local file for parsing. > ... > Do you think the Biopython tutorial should be more explicit about this > topic? I've changed the tutorial (the SeqIO and Entrez chapters) in CVS to make this advice more explicit, and included an example of doing this too. import os from Bio import SeqIO from Bio import Entrez Entrez.email = "A.N.Other at example.com" # Always tell NCBI who you are filename = "gi_186972394.gbk" if not os.path.isfile(filename) : print "Downloading..." net_handle = Entrez.efetch(db="nucleotide",id="186972394",rettype="genbank") out_handle = open(filename, "w") out_handle.write(net_handle.read()) out_handle.close() net_handle.close() print "Saved" print "Parsing..." record = SeqIO.read(open(filename), "genbank") print record Peter From biopython at maubp.freeserve.co.uk Wed Oct 8 16:57:03 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 8 Oct 2008 21:57:03 +0100 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com> References: <133483072970409871957631124263040035200-Webmail2@me.com> <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com> Message-ID: <320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com> On Wed, Oct 8, 2008 at 9:37 PM, Stephan Schiffels wrote: > > Hi Peter, > > OK, first of all... you were right of course, with > out_handle.write(net_handle.read()) the download works properly and reading > the file from disk also works.The tutorial is very clear on that point, I > agree. OK - hopefully I've just made it clearer still ;) > To illustrate why I made the mistake even though I read the tutorial: > I made some code like: > > try: > unpickling a file as SeqRecord... > except IOError: > download file into SeqRecord AND pickle afterwards to disk > > So, as you can see, I already tried to make the download only once! I see - interesting. > The disk-saving step, I realized, was smarter to do via cPickle since then > reading from it also goes faster than parsing the genbank file each time. So > my goal was to either load a pickled SeqRecord, or download into SeqRecord > and then pickle to disk. I hope you agree that concerning resources from > NCBI this way is (at least in principle) already quite optimal. You approach is clever, and I agree, it shouldn't make any difference to the number of downloads from the NCBI (once you have the script debugged and working). I'm curious - do you have any numbers for the relative times to load a SeqRecord from a pickle, or re-parse it from the GenBank file? I'm aware of some "hot spots" in the GenBank parser which take more time than they really need to (feature location parsing in particular). However, even if using pickles is much faster, I would personally still rather use this approach: if file not present: download from NCBI and save it parse file I think it is safer to keep the original data in the NCBI provided format, rather than as a python pickle. Some of my reasons include: * you might want to parse the files with a different tool one day (e.g. grep, or maybe BioPerl, or EMBOSS) * different versions of Biopython will parse the file slightly differently (e.g. once Bugs 2225 and 2578 are fixed the SeqRecord should include slightly more information from a GenBank file) while your pickle will be static * if the SeqRecord or Seq objects themselves change slightly between versions of Biopython, the pickle may not work * more generally, is it safe to transfer the pickly files between different computers (e.g. different versions of python or Biopython, different OS, different line endings)? These issues may not be a problem in your setting. More generally, you could consider using BioSQL, but this may be overkill for your needs. > However, as you pointed out, parsing from the internet makes problems. If you do work out exactly what is going wrong, I would be interested to hear about it. > I think the advantages of not having to download each time were clear to me > from the tutorial. Just that downloading AND parsing at the same time makes > problems didnt appear to me. The addings to the tutorial seem to give some > idea. Your approach all makes sense. Thanks for explaining your thoughts. I don't think I'd ever tried efetch on such a large GenBank file in the first place - for genomes I have usually used FTP instead. Peter From chapmanb at 50mail.com Wed Oct 8 17:11:25 2008 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 8 Oct 2008 17:11:25 -0400 Subject: [BioPython] Entrez.efetch In-Reply-To: <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> References: <75573950382669954948356356615157751492-Webmail2@me.com> <20081008123533.GE57379@sobchak.mgh.harvard.edu> <320fb6e00810080646o3f2f9887m3d7f3927009fc53c@mail.gmail.com> Message-ID: <20081008211125.GB17555@sobchak.mgh.harvard.edu> Peter and Stephan; My fault -- sorry about the red herring on this one. I shouldn't have tried to answer this e-mail in 5 minutes before work this morning. Sounds like y'all have it resolved with the missing close so I will keep my mouth shut. Peter, I don't remember my exact problem as it was in some throw-away script and the fix seemed non-problematic. I was thrown off by the "line 3" information Stephan mentioned because my issue was with the first couple of lines missing when iterating with an UndoHandle. No matter. Thanks for coming up with the right fix! Brad > Stephan wrote: > >> When I download this chromosome manually from the NCBI-website, > >> I indeed find a difference in one line, namely in line 3 of the > >> genbank file. In the manually downloaded file line 3 reads: > >> "ACCESSION NC_004353 REGION: 1..1351857", while in the file produced > >> from my code I have only: "ACCESSION NC_004353". So without that > >> region-information, the biopython parser of course runs to a premature > >> end. > > Stephan - when you say manually, do you mean via a web browser? If so > it is likely to be using a subtly different URL, which might explain > the NCBI generating slightly different data on the fly. Either way, > this ACCESSION line difference shouldn't trigger the "Premature end of > file in sequence data" error in the GenBank parser. > > On Wed, Oct 8, 2008 at 1:35 PM, Brad Chapman wrote: > > This is a tricky problem that I ran into as well and is fixed in the > > latest CVS version. The issue is that the Biopython reader is using an > > UndoHandle instead of a standard python handle. By default some of these > > operations appear to be assuming an iterator, but UndoHandle did not > > provide this. > > Brad, I'm pretty sure the GenBank parser is NOT using the UndoHandle. > Just adding the close made Stephan's example work for me. What > exactly was the problem you ran into (one of the other parsers > perhaps?). > > > As a result, you can lose the first couple of lines which are > > previously examined to determine the filetype. The fix is to make > > this a proper iterator. You can either check out current CVS, or > > make the addition manually to Bio/File.py in your current version: > > > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/File.py.diff?r1=1.17&r2=1.18&cvsroot=biopython > > Adding this to the UndoHandle seems a sensible improvement - but I > don't see how it can affect Stephan's script. > > Peter From stephan80 at mac.com Wed Oct 8 16:37:17 2008 From: stephan80 at mac.com (Stephan Schiffels) Date: Wed, 08 Oct 2008 22:37:17 +0200 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> References: <133483072970409871957631124263040035200-Webmail2@me.com> <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> Message-ID: <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com> Hi Peter, OK, first of all... you were right of course, with out_handle.write (net_handle.read()) the download works properly and reading the file from disk also works.The tutorial is very clear on that point, I agree. To illustrate why I made the mistake even though I read the tutorial: I made some code like: try: unpickling a file as SeqRecord... except IOError: download file into SeqRecord AND pickle afterwards to disk So, as you can see, I already tried to make the download only once! The disk-saving step, I realized, was smarter to do via cPickle since then reading from it also goes faster than parsing the genbank file each time. So my goal was to either load a pickled SeqRecord, or download into SeqRecord and then pickle to disk. I hope you agree that concerning resources from NCBI this way is (at least in principle) already quite optimal. However, as you pointed out, parsing from the internet makes problems. I think the advantages of not having to download each time were clear to me from the tutorial. Just that downloading AND parsing at the same time makes problems didnt appear to me. The addings to the tutorial seem to give some idea. Thanks and Regards, Stephan Am 08.10.2008 um 21:32 schrieb Peter: >> Yes - one big hint: DON'T try and parse these large files directly >> from the internet. Use efetch to download the file and save it to >> disk. Then open this local file for parsing. >> ... >> Do you think the Biopython tutorial should be more explicit about >> this >> topic? > > I've changed the tutorial (the SeqIO and Entrez chapters) in CVS to > make this advice more explicit, and included an example of doing this > too. > > import os > from Bio import SeqIO > from Bio import Entrez > Entrez.email = "A.N.Other at example.com" # Always tell NCBI who > you are > filename = "gi_186972394.gbk" > if not os.path.isfile(filename) : > print "Downloading..." > net_handle = Entrez.efetch > (db="nucleotide",id="186972394",rettype="genbank") > out_handle = open(filename, "w") > out_handle.write(net_handle.read()) > out_handle.close() > net_handle.close() > print "Saved" > > print "Parsing..." > record = SeqIO.read(open(filename), "genbank") > print record > > > Peter From biopythonlist at gmail.com Thu Oct 9 04:52:42 2008 From: biopythonlist at gmail.com (dr goettel) Date: Thu, 9 Oct 2008 10:52:42 +0200 Subject: [BioPython] taxonomic tree In-Reply-To: <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com> References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com> Message-ID: <9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com> On Wed, Oct 8, 2008 at 6:38 PM, Peter wrote: > On Wed, Oct 8, 2008 at 5:23 PM, dr goettel > wrote: > > Hello, I'm new in this list and in BioPython. > > Hello :) > > > I would like to create a NCBI-like taxonomic tree and then fill it with > the > > organisms that I have in a file. Is there an easy way to do this? I > started > > using biopython's function at 7.11.4 (finding the lineage of an organism) > in > > the tutorial, ... > > For anyone reading this later on, note that the tutorial section > numbers tend to change with each release of Biopython. This section > just uses Bio.Entrez to fetch taxonomy information for a particular > NCBI taxon id. > > > but I need to do this tens of thousands times so it spends too > > much time querying NCBI database. > > Also calling Bio.Entrez 10000 times might annoy the NCBI ;) > > > Therefore I built a taxonomic database > > locally and implemented something similar to 7.11.4 tutorial's function > so I > > get, for every sequence, the lineage in the same way: > > > > 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; > Streptophytina; > > Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; > > Liliopsida; Asparagales; Orchidaceae' > > I assume you used the NCBI provided taxdump files to populate the > database? See ftp://ftp.ncbi.nih.gov/pub/taxonomy/ > Yes I did. > > Personally rather than designing my own database just for this (and > writing a parser for the taxonomy files), I would have suggested > installing BioSQL, and using the BioSQL script load_ncbi_taxonomy.pl > to download and import the data for you. This is a simple perl script > - you don't need BioPerl. See http://www.biopython.org/wiki/BioSQL > for details. > I also used the load_ncbi_taxonomy.pl script. It worked great! > > > Now I need to create a tree, or fill an already created one. And then > search > > it by some criteria. > > What kind of tree do you mean? Are you talking about creating a > Newick tree, or an in memory structure? Perhaps the Bio.Nexus > module's tree functionality would help. > Thankyou very much. I still don't know if I want Newick tree or the other one. I'll take a look on Bio.Nexus module > > If you are interested, the BioSQL tables record the taxonomy tree > using two methods, each node has a parent node allowing you to walk up > the lineage. There are also left/right values allowing selection of > all child nodes efficiently via an SQL select statement. > > Peter > This is what I was trying to do, from the name of the organism (the leaf of the tree) and getting every node using the parent_node field of the taxon table, until reaching the root node. Once I have all the steps to the root node then I have to create/filling the tree with my data in order to examinate the number of organisms integrating certain class/order/family/genus... etc Any ideas will be very apreciated. Thankyou very much for your answer and I'll take a look on Bio.Nexus module. drG From biopython at maubp.freeserve.co.uk Thu Oct 9 05:31:16 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Oct 2008 10:31:16 +0100 Subject: [BioPython] taxonomic tree In-Reply-To: <9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com> References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com> <9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com> Message-ID: <320fb6e00810090231w723e3b29m5e070c55166d3bfc@mail.gmail.com> >> Personally rather than designing my own database just for this (and >> writing a parser for the taxonomy files), I would have suggested >> installing BioSQL, and using the BioSQL script load_ncbi_taxonomy.pl >> to download and import the data for you. This is a simple perl script >> - you don't need BioPerl. See http://www.biopython.org/wiki/BioSQL >> for details. > > I also used the load_ncbi_taxonomy.pl script. It worked great! Good. I would encourage you to use the version from BioSQL v1.0.1 if you are not already, as the version with BioSQL v1.0.0 makes an additional unnecessary assumption about the database keys matching the NCBI taxon ID. >> If you are interested, the BioSQL tables record the taxonomy tree >> using two methods, each node has a parent node allowing you to walk up >> the lineage. There are also left/right values allowing selection of >> all child nodes efficiently via an SQL select statement. > > This is what I was trying to do, from the name of the organism (the leaf of > the tree) and getting every node using the parent_node field of the taxon > table, until reaching the root node. Once I have all the steps to the root > node then I have to create/filling the tree with my data in order to > examinate the number of organisms integrating certain > class/order/family/genus... etc > Any ideas will be very apreciated. To do this in Biopython you'll have to write some SQL commands - but first you need to understand how the left/right values work if you want to take advantage of them. I refer you to this thread on the BioSQL mailing list earlier in the year: http://lists.open-bio.org/pipermail/biosql-l/2008-April/001234.html In particular, Hilmar referred to Joe Celko's SQL for Smarties books, and the introduction to this nested-set representation given here: http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html Alternatively, if you wanted to avoid the left/right values, you could use recursion or loops on the parent ID links to build up the tree. For a single lineage this is fine - but for a full try I would expect the left/right values to be faster. Note that Biopython (in CVS now) ignores the left/right values. This is for two reasons - for pulling out a single lineage, Eric found this was faster. Also, when adding new entries to the database re-calculating the left/right values is too slow, so we leave them as NULL (and let the user (re)run load_ncbi_taxonomy.pl later if they care). This means we don't want to depend on the left/right values being present. Peter From stephan.schiffels at uni-koeln.de Thu Oct 9 09:01:11 2008 From: stephan.schiffels at uni-koeln.de (Stephan Schiffels) Date: Thu, 9 Oct 2008 15:01:11 +0200 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com> References: <133483072970409871957631124263040035200-Webmail2@me.com> <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com> <320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com> Message-ID: <171A75DA-EE34-44AB-8E16-DEC626F7164C@uni-koeln.de> Hi Peter, Am 08.10.2008 um 22:57 schrieb Peter: > I'm curious - do you have any numbers for the relative times to load a > SeqRecord from a pickle, or re-parse it from the GenBank file? I'm > aware of some "hot spots" in the GenBank parser which take more time > than they really need to (feature location parsing in particular). So, here is a little profiling of reading a large chromosome both as genbank and from a pickled SeqRecord (both from disk of course): >>> t = Timer("a = cPickle.load(open('DroMel_chr2L.pickle'))", "import cPickle") >>> t.timeit(number=1) 5.2086620330810547 >>> t = Timer("a = SeqIO.read(open('DroMel_chr2L.gbk'), 'genbank')", "from Bio import SeqIO") >>> t.timeit(number=1) 53.902437925338745 >>> As you see there is an amazing 10fold speed-gain using cPickle in comparison to SeqIO.read() ... not bad! The pickled file is a bit larger than the genbank file, but not much. > However, even if using pickles is much faster, I would personally > still rather use this approach: > > if file not present: > download from NCBI and save it > parse file > Thats precisely how I do it now. Works cool! > I think it is safer to keep the original data in the NCBI provided > format, rather than as a python pickle. Some of my reasons include: > > * you might want to parse the files with a different tool one day > (e.g. grep, or maybe BioPerl, or EMBOSS) > * different versions of Biopython will parse the file slightly > differently (e.g. once Bugs 2225 and 2578 are fixed the SeqRecord > should include slightly more information from a GenBank file) while > your pickle will be static > * if the SeqRecord or Seq objects themselves change slightly between > versions of Biopython, the pickle may not work > * more generally, is it safe to transfer the pickly files between > different computers (e.g. different versions of python or Biopython, > different OS, different line endings)? > > These issues may not be a problem in your setting. You are right and in fact I now safe both the genbank file and the pickled file to disk, so I have all the backup. > > More generally, you could consider using BioSQL, but this may be > overkill for your needs. > BioSQL is something that I like a lot. I have not yet digged my way through it but hopefully there will be options for me from that side as well. >> However, as you pointed out, parsing from the internet makes >> problems. > > If you do work out exactly what is going wrong, I would be interested > to hear about it. > Hmm, probably I wont find it out. Parsing from the internet works for small files, it must be some network-issue, dont know. Since I am in the university-web I doubt that the error starts at my side, maybe NCBI clears the connection if the other side is too slow, which is the case for the parsing process... But I understand too little about networking. >> I think the advantages of not having to download each time were >> clear to me >> from the tutorial. Just that downloading AND parsing at the same >> time makes >> problems didnt appear to me. The addings to the tutorial seem to >> give some >> idea. > > Your approach all makes sense. Thanks for explaining your thoughts. I > don't think I'd ever tried efetch on such a large GenBank file in the > first place - for genomes I have usually used FTP instead. > > Peter Regards, Stephan From biopython at maubp.freeserve.co.uk Thu Oct 9 10:18:52 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Oct 2008 15:18:52 +0100 Subject: [BioPython] Entrez.efetch large files In-Reply-To: <171A75DA-EE34-44AB-8E16-DEC626F7164C@uni-koeln.de> References: <133483072970409871957631124263040035200-Webmail2@me.com> <320fb6e00810081157o486b7dd7pbc032788555ce73d@mail.gmail.com> <320fb6e00810081232u1521190fm5468a836816cdaeb@mail.gmail.com> <2C675D8E-9724-4D01-88BF-D61AB5B999A4@mac.com> <320fb6e00810081357s65d1fa8h80e64546dd0929a2@mail.gmail.com> <171A75DA-EE34-44AB-8E16-DEC626F7164C@uni-koeln.de> Message-ID: <320fb6e00810090718g3420729fh50520a4760c5d27@mail.gmail.com> Peter wrote: >> I'm curious - do you have any numbers for the relative times to load a >> SeqRecord from a pickle, or re-parse it from the GenBank file? I'm >> aware of some "hot spots" in the GenBank parser which take more time >> than they really need to (feature location parsing in particular). Stephan wrote: > So, here is a little profiling of reading a large chromosome both as genbank > and from a pickled SeqRecord (both from disk of course): >>>> t = Timer("a = cPickle.load(open('DroMel_chr2L.pickle'))", "import >>>> cPickle") >>>> t.timeit(number=1) > 5.2086620330810547 >>>> t = Timer("a = SeqIO.read(open('DroMel_chr2L.gbk'), 'genbank')", "from >>>> Bio import SeqIO") >>>> t.timeit(number=1) > 53.902437925338745 >>>> > > As you see there is an amazing 10fold speed-gain using cPickle in comparison > to SeqIO.read() ... not bad! The pickled file is a bit larger than the > genbank file, but not much. I'm seeing more like a three fold speed-gain (using cPickle protocol 0, with Python 2.5.2 on a Mac), which is less impressive. For a 10 fold speed up I can see why the complexity overhead of using pickle could be worthwhile. cPickle.load() took 8.5s cPickle.load() took 10.0s cPickle.load() took 9.9s SeqIO.read() took 29.9s SeqIO.read() took 29.8s SeqIO.read() took 29.8s (Script below) I'm not very impressed with the 30 seconds needed to parse a 30MB file. There is certainly scope for speeding up the GenBank parsing here. Peter --------------- My timing script: import os import cPickle import time from Bio import Entrez, SeqIO #Entrez.email = "..." id="57" genbank_filename = "NC_004354.gbk" pickle_filename = "NC_004354.pickle" if not os.path.isfile(genbank_filename) : print "Downloading..." net_handle = Entrez.efetch(db="genome", id=id, rettype="genbank") out_handle = open(genbank_filename, "w") out_handle.write(net_handle.read()) out_handle.close() print "Saved" if not os.path.isfile(pickle_filename) : print "Parsing..." record = SeqIO.read(open(genbank_filename), 'genbank') print "Pickling..." out_handle = open(pickle_filename ,"w") cPickle.dump(record, out_handle) out_handle.close() print "Saved" print "Profiling..." for i in range(3) : start = time.time() record = cPickle.load(open(pickle_filename)) print "cPickle.load() took %0.1fs" % (time.time() - start) for i in range(3) : start = time.time() record = SeqIO.read(open(genbank_filename), 'genbank') print "SeqIO.read() took %0.1fs" % (time.time() - start) print "Done" From biopython at maubp.freeserve.co.uk Thu Oct 9 11:48:26 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 9 Oct 2008 16:48:26 +0100 Subject: [BioPython] Deprecating Bio.PubMed and some of Bio.GenBank Message-ID: <320fb6e00810090848g2a516877i5950f515e748b9d0@mail.gmail.com> Dear Biopythoneers, Those of you who looked at the release notes for Biopython 1.48 might have read this bit: >> Bio.PubMed and the online code in Bio.GenBank are now considered >> obsolete, and we intend to deprecate them after the next release. >> For accessing PubMed and GenBank, please use Bio.Entrez instead. These bits of code are effectively simple wrappers for Bio.Entrez. While they may be simple to use, they cannot take advantage of the NCBI's Entrez utils history functionality. This means they discourage users from following the NCBI's preferred usage patterns. We're already trying to encouraging the use of Bio.Entrez by documenting it prominently in the tutorial (which seems to be working given the recent questions on the mailing list), but for Biopython 1.49 I'm suggesting we go further and deprecate Bio.PubMed and the online code in Bio.GenBank. This would mean a warning message would appear when this code is used, and (barring feedback) after a couple of releases this code would be removed completely. Any comments or objections? In particular, is anyone using this "obsolete" functionality now? Peter From biopythonlist at gmail.com Thu Oct 9 12:32:11 2008 From: biopythonlist at gmail.com (dr goettel) Date: Thu, 9 Oct 2008 18:32:11 +0200 Subject: [BioPython] taxonomic tree In-Reply-To: <320fb6e00810090231w723e3b29m5e070c55166d3bfc@mail.gmail.com> References: <9b15d9f30810080923i26e6323do2ca243362ce239dc@mail.gmail.com> <320fb6e00810080938v51dd369dg8d419ac43dd8b6a1@mail.gmail.com> <9b15d9f30810090152lc508bbcy70da92f6a8304a99@mail.gmail.com> <320fb6e00810090231w723e3b29m5e070c55166d3bfc@mail.gmail.com> Message-ID: <9b15d9f30810090932qb22ca8boc6edc871bf285154@mail.gmail.com> > To do this in Biopython you'll have to write some SQL commands - but > first you need to understand how the left/right values work if you > want to take advantage of them. I refer you to this thread on the > BioSQL mailing list earlier in the year: > http://lists.open-bio.org/pipermail/biosql-l/2008-April/001234.html > > In particular, Hilmar referred to Joe Celko's SQL for Smarties books, > and the introduction to this nested-set representation given here: > http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html > That's great!! Taking advantage of the left/right values will help me!! They 're great. I started writing a lot of code to do something that in fact can be done with some sql statements. In fact the sql statements are quite difficult for me so I have to deep inside "inner joins". Thankyou very much drG From biopython at maubp.freeserve.co.uk Mon Oct 13 08:38:56 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Oct 2008 13:38:56 +0100 Subject: [BioPython] Translation method for Seq object Message-ID: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Dear Biopythoneers, This is a request for feedback about proposed additions to the Seq object for the next release of Biopython. I'd like people to pick (a) to (e) in the list below (with additional comments or counter suggestions welcome). Enhancement bug 2381 is about adding transcription and translation methods to the Seq object, allowing an object orientated style of programming. e.g. Current functional programming style: >>> from Bio.Seq import Seq, transcribe >>> from Bio.Alphabet import generic_dna >>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna) >>> my_seq Seq('CAGTGACGTTAGTCCG', DNAAlphabet()) >>> transcribe(my_seq) Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) With the latest Biopython in CVS, you can now invoke a Seq object method instead for transcription (or back transcription): >>> my_seq.transcribe() Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) For a comparison, compare the shift from python string functions to string methods. This also makes the functionality more discoverable via dir(my_seq). Adding Seq object methods "transcribe" and "back_transcribe" doesn't cause any confusion with the python string methods. However, for translation, the python string has an existing "translate" method: > S.translate(table [,deletechars]) -> string > > Return a copy of the string S, where all characters occurring > in the optional argument deletechars are removed, and the > remaining characters have been mapped through the given > translation table, which must be a string of length 256. I don't think this functionality is really of direct use for sequences, and having a Seq object "translate" method do a biological translation into a protein sequence is much more intuitive. However, this could cause confusion if the Seq object is passed to non-Biopython code which expects a string like translate method. To avoid this naming clash, a different method name would needed. This is where some user feedback would be very welcome - I think the following cover all the alternatives of what to call a biological translation function (nucleotide to protein): (a) Just use translate (ignore the existing string method) (b) Use translate_ (trailing underscore, see PEP8) (c) Use translation (a noun rather than verb; different style). (d) Use something else (e.g. bio_translate or ...) (e) Don't add a biological translation method at all because ... Thanks, Peter See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381 From ericgibert at yahoo.fr Mon Oct 13 10:38:02 2008 From: ericgibert at yahoo.fr (Eric Gibert) Date: Mon, 13 Oct 2008 22:38:02 +0800 Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Message-ID: (a) Seq is an object, string is another object... each of them have various methods and coincidently two of them have the same name... Eric -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter Sent: Monday, October 13, 2008 8:39 PM To: BioPython Mailing List Subject: [BioPython] Translation method for Seq object Dear Biopythoneers, This is a request for feedback about proposed additions to the Seq object for the next release of Biopython. I'd like people to pick (a) to (e) in the list below (with additional comments or counter suggestions welcome). Enhancement bug 2381 is about adding transcription and translation methods to the Seq object, allowing an object orientated style of programming. e.g. Current functional programming style: >>> from Bio.Seq import Seq, transcribe >>> from Bio.Alphabet import generic_dna >>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna) >>> my_seq Seq('CAGTGACGTTAGTCCG', DNAAlphabet()) >>> transcribe(my_seq) Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) With the latest Biopython in CVS, you can now invoke a Seq object method instead for transcription (or back transcription): >>> my_seq.transcribe() Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) For a comparison, compare the shift from python string functions to string methods. This also makes the functionality more discoverable via dir(my_seq). Adding Seq object methods "transcribe" and "back_transcribe" doesn't cause any confusion with the python string methods. However, for translation, the python string has an existing "translate" method: > S.translate(table [,deletechars]) -> string > > Return a copy of the string S, where all characters occurring > in the optional argument deletechars are removed, and the > remaining characters have been mapped through the given > translation table, which must be a string of length 256. I don't think this functionality is really of direct use for sequences, and having a Seq object "translate" method do a biological translation into a protein sequence is much more intuitive. However, this could cause confusion if the Seq object is passed to non-Biopython code which expects a string like translate method. To avoid this naming clash, a different method name would needed. This is where some user feedback would be very welcome - I think the following cover all the alternatives of what to call a biological translation function (nucleotide to protein): (a) Just use translate (ignore the existing string method) (b) Use translate_ (trailing underscore, see PEP8) (c) Use translation (a noun rather than verb; different style). (d) Use something else (e.g. bio_translate or ...) (e) Don't add a biological translation method at all because ... Thanks, Peter See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381 _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From bsouthey at gmail.com Mon Oct 13 10:58:07 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Mon, 13 Oct 2008 09:58:07 -0500 Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Message-ID: <48F361FF.103@gmail.com> Peter wrote: > Dear Biopythoneers, > > This is a request for feedback about proposed additions to the Seq > object for the next release of Biopython. I'd like people to pick (a) > to (e) in the list below (with additional comments or counter > suggestions welcome). > > Enhancement bug 2381 is about adding transcription and translation > methods to the Seq object, allowing an object orientated style of > programming. > > e.g. Current functional programming style: > > >>>> from Bio.Seq import Seq, transcribe >>>> from Bio.Alphabet import generic_dna >>>> my_seq = Seq("CAGTGACGTTAGTCCG", generic_dna) >>>> my_seq >>>> > Seq('CAGTGACGTTAGTCCG', DNAAlphabet()) > >>>> transcribe(my_seq) >>>> > Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) > > With the latest Biopython in CVS, you can now invoke a Seq object > method instead for transcription (or back transcription): > > >>>> my_seq.transcribe() >>>> > Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) > > For a comparison, compare the shift from python string functions to > string methods. This also makes the functionality more discoverable > via dir(my_seq). > > Adding Seq object methods "transcribe" and "back_transcribe" doesn't > cause any confusion with the python string methods. However, for > translation, the python string has an existing "translate" method: > > >> S.translate(table [,deletechars]) -> string >> >> Return a copy of the string S, where all characters occurring >> in the optional argument deletechars are removed, and the >> remaining characters have been mapped through the given >> translation table, which must be a string of length 256. >> > > I don't think this functionality is really of direct use for sequences, and > having a Seq object "translate" method do a biological translation into > a protein sequence is much more intuitive. However, this could cause > confusion if the Seq object is passed to non-Biopython code which > expects a string like translate method. > > To avoid this naming clash, a different method name would needed. > > This is where some user feedback would be very welcome - I think > the following cover all the alternatives of what to call a biological > translation function (nucleotide to protein): > > (a) Just use translate (ignore the existing string method) > (b) Use translate_ (trailing underscore, see PEP8) > (c) Use translation (a noun rather than verb; different style). > (d) Use something else (e.g. bio_translate or ...) > (e) Don't add a biological translation method at all because ... > > Thanks, > > Peter > > See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381 > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > Hi, My thoughts on this is that it is generally best to avoid any confusion when possible. But 'translate' is not a reserved word and the Python documentation notes that the unicode version lacks the optional deletechars argument (so there is precedent for using the same word). Also it involves the methods versus functions argument but many of the string functions have been depreciated and will get removed in Python 3.0 (so in Python 3.0 I think it will be hard to get a name clash without some strange inheritance going on). Therefore, provided 'translate' is a method of Seq then I do not see any strong reason to avoid it except that it is long (but shorter than translation) :-) Would be too cryptic to have dna(), rna() and protein() methods that provide the appropriate conversion based on the Seq type? Obviously reverse translation of a protein sequence to a DNA sequence is complex if there are many solutions. Regards Bruce From mjldehoon at yahoo.com Mon Oct 13 10:57:28 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 13 Oct 2008 07:57:28 -0700 (PDT) Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Message-ID: <421846.1946.qm@web62403.mail.re1.yahoo.com> (f) Use .translate both for the Python .translate and for the Biopython .translate. S.translate() ===> Biopython .translate S.translate(table [,deletechars]) ===> Python .translate We can tell from the presence or absence of arguments whether the user intends Python's translate or Biopython's translate. --Michiel. --- On Mon, 10/13/08, Peter wrote: > From: Peter > Subject: [BioPython] Translation method for Seq object > To: "BioPython Mailing List" > Date: Monday, October 13, 2008, 8:38 AM > Dear Biopythoneers, > > This is a request for feedback about proposed additions to > the Seq > object for the next release of Biopython. I'd like > people to pick (a) > to (e) in the list below (with additional comments or > counter > suggestions welcome). > > Enhancement bug 2381 is about adding transcription and > translation > methods to the Seq object, allowing an object orientated > style of > programming. > > e.g. Current functional programming style: > > >>> from Bio.Seq import Seq, transcribe > >>> from Bio.Alphabet import generic_dna > >>> my_seq = Seq("CAGTGACGTTAGTCCG", > generic_dna) > >>> my_seq > Seq('CAGTGACGTTAGTCCG', DNAAlphabet()) > >>> transcribe(my_seq) > Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) > > With the latest Biopython in CVS, you can now invoke a Seq > object > method instead for transcription (or back transcription): > > >>> my_seq.transcribe() > Seq('CAGUGACGUUAGUCCG', RNAAlphabet()) > > For a comparison, compare the shift from python string > functions to > string methods. This also makes the functionality more > discoverable > via dir(my_seq). > > Adding Seq object methods "transcribe" and > "back_transcribe" doesn't > cause any confusion with the python string methods. > However, for > translation, the python string has an existing > "translate" method: > > > S.translate(table [,deletechars]) -> string > > > > Return a copy of the string S, where all characters > occurring > > in the optional argument deletechars are removed, and > the > > remaining characters have been mapped through the > given > > translation table, which must be a string of length > 256. > > I don't think this functionality is really of direct > use for sequences, and > having a Seq object "translate" method do a > biological translation into > a protein sequence is much more intuitive. However, this > could cause > confusion if the Seq object is passed to non-Biopython code > which > expects a string like translate method. > > To avoid this naming clash, a different method name would > needed. > > This is where some user feedback would be very welcome - I > think > the following cover all the alternatives of what to call a > biological > translation function (nucleotide to protein): > > (a) Just use translate (ignore the existing string method) > (b) Use translate_ (trailing underscore, see PEP8) > (c) Use translation (a noun rather than verb; different > style). > (d) Use something else (e.g. bio_translate or ...) > (e) Don't add a biological translation method at all > because ... > > Thanks, > > Peter > > See also http://bugzilla.open-bio.org/show_bug.cgi?id=2381 > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Mon Oct 13 11:27:37 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Oct 2008 16:27:37 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: <421846.1946.qm@web62403.mail.re1.yahoo.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> <421846.1946.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e00810130827j3ec07434s2f58e370743f9537@mail.gmail.com> So I did manage to leave off at least one other option from my short list :) Michiel de Hoon wrote: > > (f) Use .translate both for the Python .translate and for the Biopython .translate. > > S.translate() ===> Biopython .translate > > S.translate(table [,deletechars]) ===> Python .translate > > We can tell from the presence or absence of arguments whether the user intends Python's translate or Biopython's translate. Sadly its not quite that simple. For a biological translation we'd probably want to offer optional arguments for at least the codon table and stop symbol (like the current Bio.Seq.translate() function), with other further arguments possible (e.g. to treat the sequence as a complete CDS where the start codon should be validated and taken as M). It would still be possible to automatically detect which translation was required, but it wouldn't be very nice. So overall I'm not keen on this approach. Peter From biopython at maubp.freeserve.co.uk Mon Oct 13 11:54:32 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 13 Oct 2008 16:54:32 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: <48F361FF.103@gmail.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> <48F361FF.103@gmail.com> Message-ID: <320fb6e00810130854m38f37075gf85b798cb4a98e21@mail.gmail.com> Bruce wrote: > ... > Therefore, provided 'translate' is a method of Seq then I do not see any > strong reason to avoid it except that it is long (but shorter than > translation) :-) Good - that sounds like another vote for option (a) in my original list. > Would be too cryptic to have dna(), rna() and protein() methods that provide > the appropriate conversion based on the Seq type? Or in a similar vein, to_dna, to_rna, and to_protein? Or toDNA, toRNA, toProtein? I'd have to go and consult the current python style guide for what is the current best practice. Something like that does sounds reasonable (and they are short), but historically all related Biopython functions have used the terms (back) transcription and (back) translation so I would prefer to stick with those. > Obviously reverse translation of a protein sequence to a DNA sequence is > complex if there are many solutions. Yes, back-translation is tricky because there is generally more than one codon for any amino acid. Ambiguous nucleotides can be used to describe several possible codons giving that amino acid, but in general it is not possible to do this and describe all the possible codons which could have been used. This topic is worth of an entire thread... for the record, I would envisage a back_translate method for the Seq object (assuming we settle on translate as the name for the forward translation from nucleotide to protein). Peter From mjldehoon at yahoo.com Mon Oct 13 20:50:14 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 13 Oct 2008 17:50:14 -0700 (PDT) Subject: [BioPython] Translation method for Seq object In-Reply-To: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> Message-ID: <900752.12970.qm@web62408.mail.re1.yahoo.com> > (a) Just use translate (ignore the existing string method) > (b) Use translate_ (trailing underscore, see PEP8) > (c) Use translation (a noun rather than verb; different > style). > (d) Use something else (e.g. bio_translate or ...) > (e) Don't add a biological translation method at all > because ... (a). Note also that once Seq objects inherit from string, the Python .translate method is still accessible as str.translate(seq). --Michiel. From biopython at maubp.freeserve.co.uk Tue Oct 14 06:18:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Oct 2008 11:18:13 +0100 Subject: [BioPython] Translation method for Seq object In-Reply-To: <900752.12970.qm@web62408.mail.re1.yahoo.com> References: <320fb6e00810130538h3776350dndbb9e82eb47f9c33@mail.gmail.com> <900752.12970.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e00810140318i14c6362eq8a51030b1da660ae@mail.gmail.com> OK, we seem to have a consensus :) In Biopython's CVS, the Seq object now has a translate method which does a biological translation. If anyone comes up with a better proposal before the next release, we can still rename this. Otherwise I will update the Tutorial in CVS shortly... Note that for now, I have followed the existing Bio.Seq.translate(...) function and the new Seq object translate(...) method takes only two optional parameters - the codon table and the stop symbol. I have noted some suggestions for possible additional arguments on Bug 2381. The adventurous among you may want to use CVS to update your Biopython installations to try this out. Please note that you will now need numpy instead of Numeric (there is nothing to stop you having both numpy and Numeric installed at the same time). If you do try out the CVS code, please run the unit tests and report any issues. Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Oct 14 07:11:20 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Oct 2008 12:11:20 +0100 Subject: [BioPython] Deprecating Bio.PubMed and some of Bio.GenBank In-Reply-To: <320fb6e00810090848g2a516877i5950f515e748b9d0@mail.gmail.com> References: <320fb6e00810090848g2a516877i5950f515e748b9d0@mail.gmail.com> Message-ID: <320fb6e00810140411o341df854x49ef3e61421193b8@mail.gmail.com> On Thu, Oct 9, 2008 at 4:48 PM, Peter wrote: > Dear Biopythoneers, > > Those of you who looked at the release notes for Biopython 1.48 might > have read this bit: > >>> Bio.PubMed and the online code in Bio.GenBank are now considered >>> obsolete, and we intend to deprecate them after the next release. >>> For accessing PubMed and GenBank, please use Bio.Entrez instead. > > These bits of code are effectively simple wrappers for Bio.Entrez. > While they may be simple to use, they cannot take advantage of the > NCBI's Entrez utils history functionality. This means they discourage > users from following the NCBI's preferred usage patterns. > > We're already trying to encouraging the use of Bio.Entrez by > documenting it prominently in the tutorial (which seems to be working > given the recent questions on the mailing list), but for Biopython > 1.49 I'm suggesting we go further and deprecate Bio.PubMed and the > online code in Bio.GenBank. This would mean a warning message would > appear when this code is used, and (barring feedback) after a couple > of releases this code would be removed completely. > > Any comments or objections? In particular, is anyone using this > "obsolete" functionality now? I've just deprecated Bio.PubMed in CVS - meaning for the next release of Biopython you'll see a warning message when you import the PubMed module. If you are using this module please say something sooner rather than later. This can still be undone. Thanks, Peter From dalloliogm at gmail.com Thu Oct 16 06:02:46 2008 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 16 Oct 2008 12:02:46 +0200 Subject: [BioPython] calculate F-Statistics from SNP data Message-ID: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> Hi, I was going to write a python program to calculate Fst statistics from a sample of SNP data. Is there any module already available to do that in biopython, that I am missing? I saw there is a 'PopGen' module, but the Cookbook says it doesn't support sequence data. Is someone actually writing any module in python to calculate such statistics? -- ----------------------------------------------------------- My Blog on Bioinformatics (italian): http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Thu Oct 16 06:23:12 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Oct 2008 11:23:12 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> Message-ID: <320fb6e00810160323g552d0503nb3a8a6809b3464de@mail.gmail.com> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio wrote: > Hi, > I was going to write a python program to calculate Fst statistics from a > sample of SNP data. Is there any module already available to do that > in biopython, that I am missing? I saw there is a 'PopGen' module, but > the Cookbook says it doesn't support sequence data. > Is someone actually writing any module in python to calculate such > statistics? I think this will be a question for Tiago (the Bio.PopGen author), although others on the list may have also tackled similar questions. In terms of reading in the SNP data, what file format will you be loading? Does Bio.SeqIO currently suffice? Have you looked into what (if any) additional python libraries you would need? For any Biopython addition, a dependency on just numpy that would be preferable, but Tiago has previously suggested an optional dependency on scipy for additional statistics needed in population genetics. Peter From tiagoantao at gmail.com Thu Oct 16 10:10:47 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 16 Oct 2008 15:10:47 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> Message-ID: <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> Hi, On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio wrote: > Hi, > I was going to write a python program to calculate Fst statistics from a > sample of SNP data. > Is there any module already available to do that in biopython, that I am > missing? > I saw there is a 'PopGen' module, but the Cookbook says it doesn't support > sequence data. > Is someone actually writing any module in python to calculate such > statistics? The answer to this has to be done in parts, because it is actually a bunch of related (but different) issues On the data 1. Sequence support. Bio.PopGen doesn't support statistics for sequences (like Tajima D and the like), BUT that is not relevant if you want to do frequency based statistics (like good old Fst), you just have to count frequencies and put into a "frequency format" 2. SNPs is actually not a sequence, but a single element, so it becomes easier. What you need at the end is something like this: For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0 For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0 For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10 And so on... You have to end up with frequency counts per population So, as long as you convert data (sequence, SNP, microsatellite) to frequency counts per population, there are no issues with the type of data. On calculating the statistics (Fst) 1. I am fully aware that core statistics like Fst (I work with Fst a lot myself) are fundamental in a population genetics module, but I sincerely don't know how to proceed because a long term solution requires generic statistical support (e.g., chi-square tests Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy (and I will not maintain generic statistics code myself). I know that Bio.PopGen is of little use without support for standard statistics. 2. A workaround (for which I have code written - but not commited to the repository - I can give it to you) is to invoke GenePop and get the Fst estimation. This requires the data to be in GenePop format (again you can convert SNPs and even sequences to frequency based format) 3. That being said, I have code to estimate Fst (Cockerham and Wier theta and a variation from Mark Beaumont) in Python. I can give it to you (but is not much tested). On sequence data formats: 1. Note that sequence data files (that I know off) have no provision for population structure (you cannot say, in a standard way, sequence X belongs to population Y). You have to do it in adhoc way. That means you have to invent your own convention for your private use. 2. Anyway, in your case I suppose you still have to extract the SNPs from the sequence. 3. If you want do frequency based analysis on your SNPs, I suggest you do a conversion to GenePop anyway (therefore you can import your data in most population structure software as GenePop format is the defacto standard)... 4. Because of the above there is actually no good solution for automated conversion from sequence information to frequency based one (in biopython or in any platform whatsoever) I can give more suggestions if you give more details or have more specific questions. From tiagoantao at gmail.com Thu Oct 16 10:14:28 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 16 Oct 2008 15:14:28 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> Message-ID: <6d941f120810160714s61eb6d1cx87d1943c4068d491@mail.gmail.com> Just a minor point: I am so used to work in Fst that I mentally converted your "F-statistics" to Fst. Most of my mail still stands. The only point that changes a bit is that I only have code for Fst, so I cannot help you with any other. On Thu, Oct 16, 2008 at 3:10 PM, Tiago Ant?o wrote: > Hi, > > On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio > wrote: >> Hi, >> I was going to write a python program to calculate Fst statistics from a >> sample of SNP data. >> Is there any module already available to do that in biopython, that I am >> missing? >> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support >> sequence data. >> Is someone actually writing any module in python to calculate such >> statistics? > > The answer to this has to be done in parts, because it is actually a > bunch of related (but different) issues > > > On the data > 1. Sequence support. Bio.PopGen doesn't support statistics for > sequences (like Tajima D and the like), BUT that is not relevant if > you want to do frequency based statistics (like good old Fst), you > just have to count frequencies and put into a "frequency format" > 2. SNPs is actually not a sequence, but a single element, so it > becomes easier. What you need at the end is something like this: > For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0 > For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0 > For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10 > And so on... You have to end up with frequency counts per population > So, as long as you convert data (sequence, SNP, microsatellite) to > frequency counts per population, there are no issues with the type of > data. > > On calculating the statistics (Fst) > 1. I am fully aware that core statistics like Fst (I work with Fst a > lot myself) are fundamental in a population genetics module, but I > sincerely don't know how to proceed because a long term solution > requires generic statistical support (e.g., chi-square tests > Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy > (and I will not maintain generic statistics code myself). I know that > Bio.PopGen is of little use without support for standard statistics. > 2. A workaround (for which I have code written - but not commited to > the repository - I can give it to you) is to invoke GenePop and get > the Fst estimation. This requires the data to be in GenePop format > (again you can convert SNPs and even sequences to frequency based > format) > 3. That being said, I have code to estimate Fst (Cockerham and Wier > theta and a variation from Mark Beaumont) in Python. I can give it to > you (but is not much tested). > > > On sequence data formats: > 1. Note that sequence data files (that I know off) have no provision > for population structure (you cannot say, in a standard way, sequence > X belongs to population Y). You have to do it in adhoc way. That means > you have to invent your own convention for your private use. > 2. Anyway, in your case I suppose you still have to extract the SNPs > from the sequence. > 3. If you want do frequency based analysis on your SNPs, I suggest you > do a conversion to GenePop anyway (therefore you can import your data > in most population structure software as GenePop format is the defacto > standard)... > 4. Because of the above there is actually no good solution for > automated conversion from sequence information to frequency based one > (in biopython or in any platform whatsoever) > I can give more suggestions if you give more details or have more > specific questions. > -- "Data always beats theories. 'Look at data three times and then come to a conclusion,' versus 'coming to a conclusion and searching for some data.' The former will win every time." ?Matthew Simmons, http://www.tiago.org From biopython at maubp.freeserve.co.uk Thu Oct 16 11:11:27 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 16 Oct 2008 16:11:27 +0100 Subject: [BioPython] back-translation method for Seq object? Message-ID: <320fb6e00810160811s19e580b2ped86c43b32c401bb@mail.gmail.com> Quoting from the recent thread about adding a translation method to the Seq object, Bruce brought up back-translation: Peter wrote: > Bruce wrote: >> Obviously reverse translation of a protein sequence to a DNA sequence is >> complex if there are many solutions. > > Yes, back-translation is tricky because there is generally more than > one codon for any amino acid. Ambiguous nucleotides can be used to > describe several possible codons giving that amino acid, but in > general it is not possible to do this and describe all the possible > codons which could have been used. This topic is worth of an entire > thread... for the record, I would envisage a back_translate method for > the Seq object (assuming we settle on translate as the name for the > forward translation from nucleotide to protein). Do we actually need a back_translate method? Can anyone suggest an actual use-case for this? It seems difficult to imagine that any simple version would please everyone. Bio.Translate (a semi-obsolete module whose deprecation has been suggested) provides a back_translate method which picks an essentially arbitrary but unambiguous codon for each amino acid. Crude but simple. A more meaningful choice would require suppling codon frequencies for the organism under consideration. Other possibilities include using ambiguous nucleotides to try and cover all the possibilities (e.g. "L" -> "CTN"), but even here in some cases this is arbritary. e.g. The standard three stop codons ['TAA', 'TAG', 'TGA'] could be represented as ['TAR', 'TGA'] or ['TRA', 'TAG'] but not by a single ambiguous codon ('TRR' also covers 'TGG' which codes for 'W'). Potentially of use would be a generator function which returned all possible back translations - but this would be complex and typically overkill. As a final point, a Seq object back-translation method could give RNA or DNA. From a biological point of view giving DNA by default would make sense. This choice is handled in Bio.Translate when creating the translator object (part of what makes Bio.Translate relatively complex to use). Peter From sdavis2 at mail.nih.gov Thu Oct 16 11:16:51 2008 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 16 Oct 2008 11:16:51 -0400 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> Message-ID: <264855a00810160816g6ea765dcl65fd7e5aa38ef20c@mail.gmail.com> On Thu, Oct 16, 2008 at 10:10 AM, Tiago Ant?o wrote: > Hi, > > On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio > wrote: >> Hi, >> I was going to write a python program to calculate Fst statistics from a >> sample of SNP data. >> Is there any module already available to do that in biopython, that I am >> missing? >> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support >> sequence data. >> Is someone actually writing any module in python to calculate such >> statistics? > > The answer to this has to be done in parts, because it is actually a > bunch of related (but different) issues > > > On the data > 1. Sequence support. Bio.PopGen doesn't support statistics for > sequences (like Tajima D and the like), BUT that is not relevant if > you want to do frequency based statistics (like good old Fst), you > just have to count frequencies and put into a "frequency format" > 2. SNPs is actually not a sequence, but a single element, so it > becomes easier. What you need at the end is something like this: > For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0 > For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0 > For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10 > And so on... You have to end up with frequency counts per population > So, as long as you convert data (sequence, SNP, microsatellite) to > frequency counts per population, there are no issues with the type of > data. > > On calculating the statistics (Fst) > 1. I am fully aware that core statistics like Fst (I work with Fst a > lot myself) are fundamental in a population genetics module, but I > sincerely don't know how to proceed because a long term solution > requires generic statistical support (e.g., chi-square tests > Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy > (and I will not maintain generic statistics code myself). I know that > Bio.PopGen is of little use without support for standard statistics. > 2. A workaround (for which I have code written - but not commited to > the repository - I can give it to you) is to invoke GenePop and get > the Fst estimation. This requires the data to be in GenePop format > (again you can convert SNPs and even sequences to frequency based > format) > 3. That being said, I have code to estimate Fst (Cockerham and Wier > theta and a variation from Mark Beaumont) in Python. I can give it to > you (but is not much tested). > > > On sequence data formats: > 1. Note that sequence data files (that I know off) have no provision > for population structure (you cannot say, in a standard way, sequence > X belongs to population Y). You have to do it in adhoc way. That means > you have to invent your own convention for your private use. > 2. Anyway, in your case I suppose you still have to extract the SNPs > from the sequence. > 3. If you want do frequency based analysis on your SNPs, I suggest you > do a conversion to GenePop anyway (therefore you can import your data > in most population structure software as GenePop format is the defacto > standard)... > 4. Because of the above there is actually no good solution for > automated conversion from sequence information to frequency based one > (in biopython or in any platform whatsoever) > I can give more suggestions if you give more details or have more > specific questions. Just a little note that the R programming language has some packages for population genetics and, of course, has excellent statistical tools. One can interface with it via rpy. I'm not advocating going this route, but just wanted to let people know about another option. Sean From tiagoantao at gmail.com Thu Oct 16 11:26:52 2008 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 16 Oct 2008 16:26:52 +0100 Subject: [BioPython] calculate F-Statistics from SNP data In-Reply-To: <264855a00810160816g6ea765dcl65fd7e5aa38ef20c@mail.gmail.com> References: <5aa3b3570810160302q48df31d8h777cb760b763b77d@mail.gmail.com> <6d941f120810160710t25c4e0c8n406093de7e397f6@mail.gmail.com> <264855a00810160816g6ea765dcl65fd7e5aa38ef20c@mail.gmail.com> Message-ID: <6d941f120810160826q2bf25382m41890fb39a4226a0@mail.gmail.com> The task view on Genetics for R provides a good starting point to find R packages related to the field: http://www.freestatistics.org/cran/web/views/Genetics.html On Thu, Oct 16, 2008 at 4:16 PM, Sean Davis wrote: > On Thu, Oct 16, 2008 at 10:10 AM, Tiago Ant?o wrote: >> Hi, >> >> On Thu, Oct 16, 2008 at 11:02 AM, Giovanni Marco Dall'Olio >> wrote: >>> Hi, >>> I was going to write a python program to calculate Fst statistics from a >>> sample of SNP data. >>> Is there any module already available to do that in biopython, that I am >>> missing? >>> I saw there is a 'PopGen' module, but the Cookbook says it doesn't support >>> sequence data. >>> Is someone actually writing any module in python to calculate such >>> statistics? >> >> The answer to this has to be done in parts, because it is actually a >> bunch of related (but different) issues >> >> >> On the data >> 1. Sequence support. Bio.PopGen doesn't support statistics for >> sequences (like Tajima D and the like), BUT that is not relevant if >> you want to do frequency based statistics (like good old Fst), you >> just have to count frequencies and put into a "frequency format" >> 2. SNPs is actually not a sequence, but a single element, so it >> becomes easier. What you need at the end is something like this: >> For population 1 and SNP 1 the number of As is 10, Cs is 20, Ts is 0, G is 0 >> For population 2 and SNP 1 the number of As is 20, Cs is 0, Ts is 0, G is 0 >> For population 2 and SNP 2 the number of As is 0, Cs is 20, Ts is 0, G is 10 >> And so on... You have to end up with frequency counts per population >> So, as long as you convert data (sequence, SNP, microsatellite) to >> frequency counts per population, there are no issues with the type of >> data. >> >> On calculating the statistics (Fst) >> 1. I am fully aware that core statistics like Fst (I work with Fst a >> lot myself) are fundamental in a population genetics module, but I >> sincerely don't know how to proceed because a long term solution >> requires generic statistical support (e.g., chi-square tests >> Hardy-Weinberg equilibrium...) and I am not allowed to depend on scipy >> (and I will not maintain generic statistics code myself). I know that >> Bio.PopGen is of little use without support for standard statistics. >> 2. A workaround (for which I have code written - but not commited to >> the repository - I can give it to you) is to invoke GenePop and get >> the Fst estimation. This requires the data to be in GenePop format >> (again you can convert SNPs and even sequences to frequency based >> format) >> 3. That being said, I have code to estimate Fst (Cockerham and Wier >> theta and a variation from Mark Beaumont) in Python. I can give it to >> you (but is not much tested). >> >> >> On sequence data formats: >> 1. Note that sequence data files (that I know off) have no provision >> for population structure (you cannot say, in a standard way, sequence >> X belongs to population Y). You have to do it in adhoc way. That means >> you have to invent your own convention for your private use. >> 2. Anyway, in your case I suppose you still have to extract the SNPs >> from the sequence. >> 3. If you want do frequency based analysis on your SNPs, I suggest you >> do a conversion to GenePop anyway (therefore you can import your data >> in most population structure software as GenePop format is the defacto >> standard)... >> 4. Because of the above there is actually no good solution for >> automated conversion from sequence information to frequency based one >> (in biopython or in any platform whatsoever) >> I can give more suggestions if you give more details or have more >> specific questions. > > Just a little note that the R programming language has some packages > for population genetics and, of course, has excellent statistical > tools. One can interface with it via rpy. I'm not advocating going > this route, but just wanted to let people know about another option. > > Sean > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- "Data always beats theories. 'Look at data three times and then come to a conclusion,' versus 'coming to a conclusion and searching for some data.' The former will win every time." ?Matthew Simmons, http://www.tiago.org From lpritc at scri.ac.uk Fri Oct 17 04:24:43 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 17 Oct 2008 09:24:43 +0100 Subject: [BioPython] back-translation method for Seq object? In-Reply-To: <320fb6e00810160811s19e580b2ped86c43b32c401bb@mail.gmail.com> Message-ID: On 16/10/2008 16:11, "Peter" wrote: > Quoting from the recent thread about adding a translation method to > the Seq object, Bruce brought up back-translation: > > Peter wrote: >> Bruce wrote: >>> Obviously reverse translation of a protein sequence to a DNA sequence is >>> complex if there are many solutions. This is the key problem. Forward translation is - for a given codon table - a one-one mapping. Reverse tra