We're sorry, but something went wrong

From mjldehoon at yahoo.com Sun Sep 4 02:09:13 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 3 Sep 2011 23:09:13 -0700 (PDT) Subject: [Biopython-dev] Bio.GenBank Message-ID: <1315116553.25037.YahooMailClassic@web161210.mail.bf1.yahoo.com> Dear all, Currently, Bio/GenBank/__init__.py imports Bio.ParserSupport but uses very little of it. Therefore I would like to suggest to remove this dependency on ParserSupport from Bio/GenBank/__init__.py. I copied the corresponding patch below. Any objections, anybody? Best, --Michiel diff --git a/Bio/GenBank/__init__.py b/Bio/GenBank/__init__.py index 43c10d4..df38abe 100644 --- a/Bio/GenBank/__init__.py +++ b/Bio/GenBank/__init__.py @@ -47,7 +47,6 @@ import re # other Biopython stuff from Bio import SeqFeature -from Bio.ParserSupport import AbstractConsumer from Bio import Entrez # other Bio.GenBank stuff @@ -389,7 +388,7 @@ class RecordParser(object): self._scanner.feed(handle, self._consumer) return self._consumer.data -class _BaseGenBankConsumer(AbstractConsumer): +class _BaseGenBankConsumer(object): """Abstract GenBank consumer providing useful general functions. This just helps to eliminate some duplication in things that most @@ -404,6 +403,12 @@ class _BaseGenBankConsumer(AbstractConsumer): def __init__(self): pass + def _unhandled(self, data): + pass + + def __getattr__(self, attr): + return self._unhandled + def _split_keywords(self, keyword_string): """Split a string of keywords into a nice clean list. """ From p.j.a.cock at googlemail.com Mon Sep 5 06:04:27 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Sep 2011 11:04:27 +0100 Subject: [Biopython-dev] Bio.GenBank In-Reply-To: <1315116553.25037.YahooMailClassic@web161210.mail.bf1.yahoo.com> References: <1315116553.25037.YahooMailClassic@web161210.mail.bf1.yahoo.com> Message-ID: On Sun, Sep 4, 2011 at 7:09 AM, Michiel de Hoon wrote: > Dear all, > > Currently, Bio/GenBank/__init__.py imports Bio.ParserSupport > but uses very little of it. Therefore I would like to suggest to > remove this dependency on ParserSupport from > Bio/GenBank/__init__.py. I copied the corresponding patch below. > Any objections, anybody? Hi Michiel, I'd have to dig into the code to understand the patch, but I presume there is a follow up question coming - can we then deprecate Bio.ParserSupport since right now only the GenBank and "pending deprecation" plain text BLAST parsers use it (plus Compass which you recently fixed)? Peter From mjldehoon at yahoo.com Mon Sep 5 07:08:43 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 5 Sep 2011 04:08:43 -0700 (PDT) Subject: [Biopython-dev] Bio.GenBank In-Reply-To: Message-ID: <1315220923.37754.YahooMailClassic@web161211.mail.bf1.yahoo.com> Hi Peter, > I'd have to dig into the code to understand the patch, but > I presume there is a follow up question coming - can we > then deprecate Bio.ParserSupport since right now only the > GenBank and "pending deprecation" plain text BLAST > parsers use it (plus Compass which you recently fixed)? Yes. With this patch, the plain text BLAST parser is the last piece of code that uses Bio.ParserSupport. Best, --Michiel. --- On Mon, 9/5/11, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] Bio.GenBank > To: "Michiel de Hoon" > Cc: biopython-dev at biopython.org > Date: Monday, September 5, 2011, 6:04 AM > On Sun, Sep 4, 2011 at 7:09 AM, > Michiel de Hoon > wrote: > > Dear all, > > > > Currently, Bio/GenBank/__init__.py imports > Bio.ParserSupport > > but uses very little of it. Therefore I would like to > suggest to > > remove this dependency on ParserSupport from > > Bio/GenBank/__init__.py. I copied the corresponding > patch below. > > Any objections, anybody? > > Hi Michiel, > > I'd have to dig into the code to understand the patch, but > I presume there is a follow up question coming - can we > then deprecate Bio.ParserSupport since right now only the > GenBank and "pending deprecation" plain text BLAST > parsers use it (plus Compass which you recently fixed)? > > Peter > From p.j.a.cock at googlemail.com Wed Sep 7 08:58:51 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 7 Sep 2011 13:58:51 +0100 Subject: [Biopython-dev] Bio.GenBank In-Reply-To: <1315220923.37754.YahooMailClassic@web161211.mail.bf1.yahoo.com> References: <1315220923.37754.YahooMailClassic@web161211.mail.bf1.yahoo.com> Message-ID: On Mon, Sep 5, 2011 at 12:08 PM, Michiel de Hoon wrote: > Hi Peter, > >> I'd have to dig into the code to understand the patch, but >> I presume there is a follow up question coming - can we >> then deprecate Bio.ParserSupport since right now only the >> GenBank and "pending deprecation" plain text BLAST >> parsers use it (plus Compass which you recently fixed)? > > Yes. With this patch, the plain text BLAST parser is the last > piece of code that uses Bio.ParserSupport. I'm OK with modifying Bio.GenBank not to depend on Bio.ParserSupport, and if you want to adding an "obsolete" comment or more explicitly a PendingDeprecationWarning to Bio.ParserSupport seems sensible too. Peter From mjldehoon at yahoo.com Wed Sep 7 09:53:22 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 7 Sep 2011 06:53:22 -0700 (PDT) Subject: [Biopython-dev] Bio.File Message-ID: <1315403602.80271.YahooMailClassic@web161203.mail.bf1.yahoo.com> Hi all, Bio.File makes three classes available: Bio.File.UndoHandle Bio.File.StringHandle (which simply points to StringIO.StringIO) Bio.File.SGMLStripper (which has a pending deprecation warning) Bio.File.StringHandle is currently used only in Bio.Blast.NCBIStandalone and Bio.ParserSupport, both of which now have a pending deprecation warning. Bio.File.UndoHandle is used in three modules that now have a pending deprecation warning (Bio.Blast.NCBIStandalone, Bio.ParserSupport, Bio.UniGene.UniGene), as well as in Bio.SCOP.__init__. I don't know why the UndoHandle is used in that module; the relevant code looks like this: def _open(cgi, params={}, get=1): ... handle = urllib.urlopen(cgi, options) uhandle = File.UndoHandle(handle) return uhandle If there is no pressing reason for using File.UndoHandle here and we can remove it, then we could add a PendingDeprecationWarning to Bio.File. Best, --Michiel. From p.j.a.cock at googlemail.com Wed Sep 7 10:36:43 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 7 Sep 2011 15:36:43 +0100 Subject: [Biopython-dev] Bio.File In-Reply-To: <1315403602.80271.YahooMailClassic@web161203.mail.bf1.yahoo.com> References: <1315403602.80271.YahooMailClassic@web161203.mail.bf1.yahoo.com> Message-ID: On Wed, Sep 7, 2011 at 2:53 PM, Michiel de Hoon wrote: > Hi all, > > Bio.File makes three classes available: > Bio.File.UndoHandle > Bio.File.StringHandle (which simply points to StringIO.StringIO) > Bio.File.SGMLStripper (which has a pending deprecation warning) > > Bio.File.StringHandle is currently used only in > Bio.Blast.NCBIStandalone and Bio.ParserSupport, > both of which now have a pending deprecation warning. We can just switch them to use StringIO directly, and immediately deprecate Bio.File.StringHandle. We can probably deprecate SGMLStripper now as well (which means indirectly deprecating the bit of Bio.ParserSupport which uses it). > Bio.File.UndoHandle is used in three modules that now have a > pending deprecation warning (Bio.Blast.NCBIStandalone, > Bio.ParserSupport, Bio.UniGene.UniGene), as well as in > Bio.SCOP.__init__. I don't know why the UndoHandle is > used in that module; the relevant code looks like this: > > def _open(cgi, params={}, get=1): > ? ?... > ? ?handle = urllib.urlopen(cgi, options) > ? ?uhandle = File.UndoHandle(handle) > ? ?return uhandle > > If there is no pressing reason for using File.UndoHandle here > and we can remove it, then we could add a > PendingDeprecationWarning to Bio.File. Unless there is something similar in the standard library, I think the UndoHandle is still useful. UndoHandle used to be used in Bio.Entrez for spotting error conditions, but now we trust the NCBI to set an HTTP return code: https://github.com/biopython/biopython/commit/2c4d8b99fc1b2dffa726e7d9956d766f7013164d I'm using the same trick in my TogoWS wrapper (something I'm hoping will be ready to include in the next Biopython, once the TogoWS team have fixed a couple of server side issues). If the server could be relied on to always give an HTTP error code this wouldn't be needed: https://github.com/peterjc/biopython/blob/togows/Bio/TogoWS/__init__.py I imagine the use of an UndoHandle in SCOP search was to allow the user to make similar sanity checks. Peter From mjldehoon at yahoo.com Thu Sep 8 10:35:38 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 8 Sep 2011 07:35:38 -0700 (PDT) Subject: [Biopython-dev] Bio.File In-Reply-To: Message-ID: <1315492538.36803.YahooMailClassic@web161206.mail.bf1.yahoo.com> --- On Wed, 9/7/11, Peter Cock wrote: > > Bio.File.StringHandle is currently used only in > > Bio.Blast.NCBIStandalone and Bio.ParserSupport, > > both of which now have a pending deprecation warning. > > We can just switch them to use StringIO directly, and > immediately > deprecate Bio.File.StringHandle. > > We can probably deprecate SGMLStripper now as well (which > means indirectly deprecating the bit of Bio.ParserSupport > which uses it). > OK, done. --Michiel. From mjldehoon at yahoo.com Thu Sep 8 10:49:09 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 8 Sep 2011 07:49:09 -0700 (PDT) Subject: [Biopython-dev] Bio.File In-Reply-To: Message-ID: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> --- On Wed, 9/7/11, Peter Cock wrote: > UndoHandle used to be used in Bio.Entrez for spotting > error conditions, but now we trust the NCBI to set an > HTTP return code: > > https://github.com/biopython/biopython/commit/2c4d8b99fc1b2dffa726e7d9956d766f7013164d No we shouldn't rely an HTTP return code. The idea is that only the parser can know if the output returned by NCBI is valid, as in: handle = Entrez.efetch(...something...) try: record = Entrez.read(handle) raise Exception: # NCBI returned something invalid, or at least # something that we don't know how to parse > If the server could be relied on to always give an > HTTP error code this wouldn't be needed: > > https://github.com/peterjc/biopython/blob/togows/Bio/TogoWS/__init__.py > I don't like this approach much, as it depends on exactly what the error message looks like, and misses any other problems, such as incomplete output. There will be a certain false positive rate, with return values that pass the checking of the first 10 lines but are still unusable. Even worse, the false positive rate can suddenly go up if the server maintainers decide to change anything in their error messages. This kind of checking should be done by the parser, which can tell you exactly if the data are valid, or if not, what is wrong with it. Best, --Michiel. [copied from Bio/TogoWS/__init__.py]: # Wrap the handle inside an UndoHandle. uhandle = File.UndoHandle(handle) # Check for errors in the first 10 lines. # This is kind of ugly. lines = [] for i in range(10): lines.append(uhandle.readline()) for i in range(9, -1, -1): uhandle.saveline(lines[i]) data = ''.join(lines) if data == '': #ValueError? This can occur with an invalid formats or fields #e.g. http://togows.dbcls.jp/entry/pubmed/16381885.au #which is an invalid file format, I meant to try this #instead http://togows.dbcls.jp/entry/pubmed/16381885/au raise IOError("TogoWS replied with no data:\n%s % url") if data == ' ': #I've seen this on things which should work, e.g. #e.g. http://togows.dbcls.jp/entry/genome/X52960.fasta raise IOError("TogoWS replied with just a single space:\n%s" % url) if data.startswith("Error: "): #TODO - Should this be a value error (in some cases?) raise IOError("TogoWS replied with an error message:\n\n%s\n\n%s" \ % (data, url)) if "We're sorry, but something went wrong" in data: #ValueError? This can occur with an invalid formats or fields raise IOError("TogoWS replied: We're sorry, but something went wrong:\n%s" \ % url) From andrea at biocomp.unibo.it Thu Sep 8 10:47:15 2011 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 8 Sep 2011 16:47:15 +0200 (CEST) Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: References: Message-ID: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> Hi, one year ago we were talking about a library I was developing basically to draw seqrecord in a similar way to the BioPerl Bio::Graphics module. Today, I'm releasing the public beta version of that software that is much more mature than one year ago. The library is called BioGraPy and is based on matplotlib for drawings and on biopython objects for input. Basically you can give to biography a SeqRecord and it will draw it and save it in any of the matplotlib supported formats (including png, SVG and PDF). But you can use it also at a lower level deciding exactly how and were to plot every feature also building very complex drawings. It comes with integrated help for web usage, such as clickable SVG and html maps. BioGraPy also supports continuous feature such as an hydrophobicity plot and seqrecord per-letter annotations (if numerical). All the code is documented with sphinx, and I'm also completing a comprensive tutorial. The source code and the documentation are available at: http://apierleoni.github.com/BioGraPy/ BioGraPy is released under the LGPL license. This is an open project, so anyone willing to contribute, test or simply suggest improvements is welcome. You cannot plot circular drawings from Biograpy, but you have GenomeDiagram for that. I hope (and think) this will be useful, significantly extending the biopython plotting capabilities. Andrea From p.j.a.cock at googlemail.com Thu Sep 8 11:25:17 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 8 Sep 2011 16:25:17 +0100 Subject: [Biopython-dev] Bio.File In-Reply-To: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> References: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> Message-ID: On Thu, Sep 8, 2011 at 3:49 PM, Michiel de Hoon wrote: > > No we shouldn't rely an HTTP return code. The idea is that only > the parser can know if the output returned by NCBI is valid, as in: > > handle = Entrez.efetch(...something...) > try: > ? ?record = Entrez.read(handle) > raise Exception: > ? ?# NCBI returned something invalid, or at least > ? ?# something that we don't know how to parse In theory, yes, but quite often parsers look for certain patterns and if you feed them something else they may just say "no data". For example, the GenBank parser ignores anything before the LOCUS line (in order to cope with the free text header in the large multi-record files on the NCBI FTP site). As a side effect, you can give it almost any plain text file and the parser won't raise an error - it will just say no GenBank records found. >> If the server could be relied on to always give an >> HTTP error code this wouldn't be needed: >> >> https://github.com/peterjc/biopython/blob/togows/Bio/TogoWS/__init__.py >> > > I don't like this approach much, as it depends on exactly > what the error message looks like, and misses any other > problems, such as incomplete output. There will be a > certain false positive rate, with return values that pass > the checking of the first 10 lines but are still unusable. Yes, in theory the server should detect and handle errors nicely - but there are sometimes bugs in web- services. Certainly from memory I have had HTTP return code 200 (OK) with invalid data from both the NCBI and TogoWS. > Even worse, the false positive rate can suddenly go up > if the server maintainers decide to change anything in > their error messages. The checks are deliberately designed to avoid false positives - at the cost of missing some errors early. > This kind of checking should be > done by the parser, which can tell you exactly if the > data are valid, or if not, what is wrong with it. That isn't always possible, since so many bioinformatics file formats are so vague that validation is hard. I accept checking the first 10 lines for common errors specific to that webservice is inelegant, but it is practical. [Some of those TogoWS checks are probably superfluous right now, I'm still polishing the error handling - some of which will rely on TogoWS itself catching more conditions] Regards, Peter From p.j.a.cock at googlemail.com Thu Sep 8 11:44:53 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 8 Sep 2011 16:44:53 +0100 Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> References: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> Message-ID: On Thu, Sep 8, 2011 at 3:47 PM, Andrea Pierleoni wrote: > Hi, > one year ago we were talking about a library I was developing basically to > draw seqrecord in a similar way to the BioPerl Bio::Graphics module. > Today, I'm releasing the public beta version of that software ... > http://apierleoni.github.com/BioGraPy/ Are you doing anything with "join" features from GenBank files (or similar compound features)? This is something I'm thinking about changing in the Biopython SeqFeature objects - having a single SeqFeature with a compound location, rather than as now having a parent SeqFeature with child SeqFeatures for the sub parts (which does not make sense with things like GFF3 where there are real parent/child relationships between features). > > BioGraPy is released under the LGPL license. > I'm curious about the license choice - LGPL prevents Biopython adopting it for example. Peter From andrea at biocomp.unibo.it Thu Sep 8 12:11:15 2011 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 8 Sep 2011 18:11:15 +0200 (CEST) Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: References: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> Message-ID: <60d254c9e1990dc9f7946f24d66a7aea.squirrel@lipid.biocomp.unibo.it> > On Thu, Sep 8, 2011 at 3:47 PM, Andrea Pierleoni > wrote: >> Hi, >> one year ago we were talking about a library I was developing basically >> to >> draw seqrecord in a similar way to the BioPerl Bio::Graphics module. >> Today, I'm releasing the public beta version of that software ... >> http://apierleoni.github.com/BioGraPy/ > > Are you doing anything with "join" features from GenBank files (or > similar compound features)? This is something I'm thinking about > changing in the Biopython SeqFeature objects - having a single > SeqFeature with a compound location, rather than as now having > a parent SeqFeature with child SeqFeatures for the sub parts > (which does not make sense with things like GFF3 where there > are real parent/child relationships between features). > Yes, I'm using 'join' features, there is a specific "graphic feature" for features with 'join'. I think it can be easily changed accordingly. Actually I'm also guessing a hierarchy when plotting directly a gene seqrecord/seqfeature with attached joined subfeatures. Being able to trace parent/child relationships would be a big improvement, and not just for this library of course. >> >> BioGraPy is released under the LGPL license. >> > > I'm curious about the license choice - LGPL prevents Biopython > adopting it for example. > Then I think it's time to change the license :) Why is it preventing biopython to adopt it? Which one do you suggest? I could also use the biopython license, I don't need a strict control on the code, I just want the library to be used by everybody willing to, even closed source programs. Andrea From p.j.a.cock at googlemail.com Thu Sep 8 13:08:50 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 8 Sep 2011 18:08:50 +0100 Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: <60d254c9e1990dc9f7946f24d66a7aea.squirrel@lipid.biocomp.unibo.it> References: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> <60d254c9e1990dc9f7946f24d66a7aea.squirrel@lipid.biocomp.unibo.it> Message-ID: On Thursday, September 8, 2011, Andrea Pierleoni wrote: >> On Thu, Sep 8, 2011 at 3:47 PM, Andrea Pierleoni >> wrote: >>> Hi, >>> one year ago we were talking about a library I was developing basically >>> to draw seqrecord in a similar way to the BioPerl Bio::Graphics module. >>> Today, I'm releasing the public beta version of that software ... >>> http://apierleoni.github.com/BioGraPy/ >> >> Are you doing anything with "join" features from GenBank files (or >> similar compound features)? This is something I'm thinking about >> changing in the Biopython SeqFeature objects - having a single >> SeqFeature with a compound location, rather than as now having >> a parent SeqFeature with child SeqFeatures for the sub parts >> (which does not make sense with things like GFF3 where there >> are real parent/child relationships between features). >> > > Yes, I'm using 'join' features, there is a specific "graphic feature" > for features with 'join'. I think it can be easily changed accordingly. > Actually I'm also guessing a hierarchy when plotting directly a gene > seqrecord/seqfeature with attached joined subfeatures. > Being able to trace parent/child relationships would be a big > improvement, and not just for this library of course. I'll write more about this later, once my code gets a bit closer to being ready. >>> >>> BioGraPy is released under the LGPL license. >>> >> >> I'm curious about the license choice - LGPL prevents Biopython >> adopting it for example. >> > > Then I think it's time to change the license :) > Why is it preventing biopython to adopt it? Adopt in the sense of include into Biopython. > Which one do you suggest? > I could also use the biopython license, I don't need a strict control > on the code, I just want the library to be used by everybody willing to, > even closed source programs. > As I recall, Biopythin, NumPy, SciPy etc all use a very Liberal MIT/BSD type licence, while LGPL tends to scare commercial users ;) Peter From andrea at biocomp.unibo.it Thu Sep 8 15:02:50 2011 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 8 Sep 2011 21:02:50 +0200 (CEST) Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: References: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> <60d254c9e1990dc9f7946f24d66a7aea.squirrel@lipid.biocomp.unibo.it> Message-ID: <302333674995860f369795196ea1df10.squirrel@lipid.biocomp.unibo.it> >> Yes, I'm using 'join' features, there is a specific "graphic feature" >> for features with 'join'. I think it can be easily changed accordingly. >> Actually I'm also guessing a hierarchy when plotting directly a gene >> seqrecord/seqfeature with attached joined subfeatures. >> Being able to trace parent/child relationships would be a big >> improvement, and not just for this library of course. > > I'll write more about this later, once my code gets a bit > closer to being ready. > ok, let me know. >>>> >>>> BioGraPy is released under the LGPL license. >>>> >>> >>> I'm curious about the license choice - LGPL prevents Biopython >>> adopting it for example. >>> >> >> Then I think it's time to change the license :) >> Why is it preventing biopython to adopt it? > > Adopt in the sense of include into Biopython. > well if you think it is worth it, biograpy can of course be included in biopython. the good thing is that it is all sphinx documented, so if biopython is moving to sphinx too, this part is ready. Biograpy requires matplotlib (and thus of course numpy), but could be just an optional installation for those who want to use this graphic package, as it is reportlab for genomediagram. Also, now that there is a drawing library it should be easy to complete the DAS client, and have something very similar to DASTY that given a protein id is able to fetch all the das annotation and even draw them with an html4 (image maps) or html5 (svg) friendly result. >> Which one do you suggest? >> I could also use the biopython license, I don't need a strict control >> on the code, I just want the library to be used by everybody willing to, >> even closed source programs. >> > > As I recall, Biopythin, NumPy, SciPy etc all use a very > Liberal MIT/BSD type licence, while LGPL tends to > scare commercial users ;) > > it's funny, since I choose the LGPL license not to scare commercial users :) can you send me a link to the license so that I can include it in biograpy? thanks Andrea From mjldehoon at yahoo.com Sat Sep 10 23:22:15 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 10 Sep 2011 20:22:15 -0700 (PDT) Subject: [Biopython-dev] Bio.GenBank (was: Bio.File) In-Reply-To: Message-ID: <1315711335.33798.YahooMailClassic@web161215.mail.bf1.yahoo.com> Hi all, There are several issues here. Let's talk about Bio.GenBank first. I think it's OK to have a module Bio.GenBank in addition to Bio.SeqIO, but it's a bit unclear to me which code in Bio.GenBank is still relevant and which (if any) can potentially be deprecated. Also we'd need some documentation for Bio.GenBank. In particular it's not clear to me which classes in Bio.GenBank are intended to be used by users. The description at the top of Bio.GenBank says that only Bio.GenBank.RecordParser should be used directly. However, in the test code in Bio.Graphics.GenomeDiagram (after "if name=='__main__':") Bio.GenBank.FeatureParser is used. Should that be replaced by Bio.SeqIO then? Also I think that the RecordParser should raise an Exception if it cannot find a record when parsing. Compare the following: >>> from Bio import SeqIO >>> from StringIO import StringIO >>> handle = StringIO("no record here") >>> SeqIO.read(handle, 'fasta') Traceback (most recent call last): File "", line 1, in File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 617, in read raise ValueError("No records found in handle") ValueError: No records found in handle >>> from Bio import GenBank >>> parser = GenBank.RecordParser() >>> handle = StringIO("no record here") >>> parser.parse(handle) >>> # no error raised This still lets us ignore header text before the actual start of a GenBank record; the error should only be raised if no GenBank record can be found anywhere. Best, --Michiel. --- On Thu, 9/8/11, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] Bio.File > To: "Michiel de Hoon" > Cc: biopython-dev at biopython.org > Date: Thursday, September 8, 2011, 11:25 AM > On Thu, Sep 8, 2011 at 3:49 PM, > Michiel de Hoon > wrote: > > > > No we shouldn't rely an HTTP return code. The idea is > that only > > the parser can know if the output returned by NCBI is > valid, as in: > > > > handle = Entrez.efetch(...something...) > > try: > > ? ?record = Entrez.read(handle) > > raise Exception: > > ? ?# NCBI returned something invalid, or at least > > ? ?# something that we don't know how to parse > > In theory, yes, but quite often parsers look for certain > patterns and if you feed them something else they may > just say "no data". For example, the GenBank parser > ignores anything before the LOCUS line (in order to > cope with the free text header in the large multi-record > files on the NCBI FTP site). As a side effect, you can > give it almost any plain text file and the parser won't > raise an error - it will just say no GenBank records > found. > > >> If the server could be relied on to always give > an > >> HTTP error code this wouldn't be needed: > >> > >> https://github.com/peterjc/biopython/blob/togows/Bio/TogoWS/__init__.py > >> > > > > I don't like this approach much, as it depends on > exactly > > what the error message looks like, and misses any > other > > problems, such as incomplete output. There will be a > > certain false positive rate, with return values that > pass > > the checking of the first 10 lines but are still > unusable. > > Yes, in theory the server should detect and handle > errors nicely - but there are sometimes bugs in web- > services. Certainly from memory I have had HTTP > return code 200 (OK) with invalid data from both the > NCBI and TogoWS. > > > Even worse, the false positive rate can suddenly go > up > > if the server maintainers decide to change anything > in > > their error messages. > > The checks are deliberately designed to avoid false > positives - at the cost of missing some errors early. > > > This kind of checking should be > > done by the parser, which can tell you exactly if the > > data are valid, or if not, what is wrong with it. > > That isn't always possible, since so many bioinformatics > file formats are so vague that validation is hard. > > I accept checking the first 10 lines for common errors > specific to that webservice is inelegant, but it is > practical. > > [Some of those TogoWS checks are probably superfluous > right now, I'm still polishing the error handling - some > of > which will rely on TogoWS itself catching more conditions] > > Regards, > > Peter > From p.j.a.cock at googlemail.com Sun Sep 11 10:06:13 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 11 Sep 2011 15:06:13 +0100 Subject: [Biopython-dev] Bio.GenBank (was: Bio.File) In-Reply-To: <1315711335.33798.YahooMailClassic@web161215.mail.bf1.yahoo.com> References: <1315711335.33798.YahooMailClassic@web161215.mail.bf1.yahoo.com> Message-ID: On Sun, Sep 11, 2011 at 4:22 AM, Michiel de Hoon wrote: > Hi all, > > There are several issues here. > Let's talk about Bio.GenBank first. > > I think it's OK to have a module Bio.GenBank in addition > to Bio.SeqIO, but it's a bit unclear to me which code in > Bio.GenBank is still relevant and which (if any) can > potentially be deprecated. Bio.GenBank uses a scanner/consumer to offer two object models for GenBank/EMBL files. First, SeqRecord objects which is wrapped by Bio.SeqIO. Second, a more faithful GenBank record object which also supports non-sequence based GenBank whole genome shotgun master records. These are GenBank files that summarize the content of a project, and provide lists of scaffold and contig files in the project. I have never used this - Iddo has though. So currently none of Bio.GenBank can really be deprecated. If we don't care about WGS records, then perhaps the RecordParser could be deprecated and later with some refactoring Bio.SeqIO could parse things directly. That would be my long term ideal. Maybe we can represent the WGS records as SeqRecord objects without a sequence, but I don't like that idea really. Such files are NOT sequence files at all. > > Also we'd need some documentation for Bio.GenBank. > In general it would be a good idea to have a worked example parsing a (small) GenBank file and showing where in the SeqRecord each bit of annotation goes. Doing this as a doctest (embedded in the Tutorial perhaps) would keep the documentation up to date (any changes should show up as a unit test failure). > In particular it's not clear to me which classes in > Bio.GenBank are intended to be used by users. > The description at the top of Bio.GenBank says > that only Bio.GenBank.RecordParser should be > used directly. What is says is "Currently the ONLY reason to use Bio.GenBank directly is for the RecordParser which turns a GenBank file into GenBank-specific Record objects.", by which I mean if you want SeqRecord objects, use Bio.SeqIO instead (which will call Bio.GenBank.FeatureParser internally), since that is our standard API for parsing as SeqRecords. > However, in the test code in > Bio.Graphics.GenomeDiagram (after > "if name=='__main__':") Bio.GenBank.FeatureParser > is used. Should that be replaced by Bio.SeqIO then? Yes. If the code is needed at all... > Also I think that the RecordParser should > raise an Exception if it cannot find a record > when parsing. I disagree (or at least, when exposed via Bio.SeqIO I disagree). > Compare the following: > >>>> from Bio import SeqIO >>>> from StringIO import StringIO >>>> handle = StringIO("no record here") >>>> SeqIO.read(handle, 'fasta') > Traceback (most recent call last): > ?File "", line 1, in > ?File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 617, in read > ? ?raise ValueError("No records found in handle") > ValueError: No records found in handle That's fine - the read function says it will raise an exception if there is not exactly one record. Perhaps you meant to use parse here as in the following example? If you do, you get no records and no exception. >>>> from Bio import GenBank >>>> parser = GenBank.RecordParser() >>>> handle = StringIO("no record here") >>>> parser.parse(handle) >>>> # no error raised > > This still lets us ignore header text before > the actual start of a GenBank record; the > error should only be raised if no GenBank > record can be found anywhere. > If you used Bio.SeqIO.read(...) with GenBank format on an empty file you'd also get an exception. I explicitly test the SeqIO parsers to check they handle an empty file gracefully - and for simple sequential formats like FASTA and GenBank that means returns no records. This is an important special case, and it should be handled this way for generic pipelines. I often have empty FASTA files. Peter From p.j.a.cock at googlemail.com Sun Sep 11 10:12:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 11 Sep 2011 15:12:19 +0100 Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: <302333674995860f369795196ea1df10.squirrel@lipid.biocomp.unibo.it> References: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> <60d254c9e1990dc9f7946f24d66a7aea.squirrel@lipid.biocomp.unibo.it> <302333674995860f369795196ea1df10.squirrel@lipid.biocomp.unibo.it> Message-ID: On Thu, Sep 8, 2011 at 8:02 PM, Andrea Pierleoni wrote: > >> >> Adopt in the sense of include into Biopython. >> > > well if you think it is worth it, biograpy can of course > be included in biopython. > > ... > >>> Which one do you suggest? >>> I could also use the biopython license, I don't need a strict control >>> on the code, I just want the library to be used by everybody willing to, >>> even closed source programs. >>> >> >> As I recall, Biopythin, NumPy, SciPy etc all use a very >> Liberal MIT/BSD type licence, while LGPL tends to >> scare commercial users ;) > > it's funny, since I choose the LGPL license not to scare > commercial users :) Well, its better than the GPL from that point of view ;) > can you send me a link to the license so that I can include it in biograpy? > thanks The Biopython licence is just: http://www.biopython.org/DIST/LICENSE If in the medium/long term you'd like to consider incorporating this into Biopython, then my recommendation is either use a compatible licence now, or ensure you get copyright assignment for all code contributions so that you can change the license later. My worry is if you use LGPL and take third party author contributions, then later wanted to change the license you'd need to contact all those 3rd party authors to get their permission. Regards, Peter From p.j.a.cock at googlemail.com Sun Sep 11 18:01:36 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 11 Sep 2011 23:01:36 +0100 Subject: [Biopython-dev] Bio.GenBank (was: Bio.File) In-Reply-To: References: <1315711335.33798.YahooMailClassic@web161215.mail.bf1.yahoo.com> Message-ID: On Sun, Sep 11, 2011 at 3:06 PM, Peter Cock wrote: > On Sun, Sep 11, 2011 at 4:22 AM, Michiel de Hoon wrote: >> However, in the test code in >> Bio.Graphics.GenomeDiagram (after >> "if name=='__main__':") Bio.GenBank.FeatureParser >> is used. Should that be replaced by Bio.SeqIO then? > > Yes. If the code is needed at all... > Updated, but those two mini-tests are probably superflous and if not should be merged into the unit tests. Peter From p.j.a.cock at googlemail.com Mon Sep 12 05:07:49 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 12 Sep 2011 10:07:49 +0100 Subject: [Biopython-dev] Calculating bootstrap values from a set of trees Message-ID: Hi Eric, I'm wondering if there is any code in Bio.Phylo for calculating bootstrap values from a set of trees? e.g. I have a master tree created from an alignment, and 1000 bootstrap trees (created from 1000 re-sampled alignments). I want to annotate each branch with the number/percentage of times is it found in the 1000 bootsrap trees. I once implemented this in python using binary strings to represent each branch as a split or partition of the nodes into two groups. I'm not sure where I put this script... but it pre-dated Bio.Phylo anyway. Alternatively, which standalone tool would you recommend for this? Thanks, Peter From mjldehoon at yahoo.com Mon Sep 12 08:49:35 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 12 Sep 2011 05:49:35 -0700 (PDT) Subject: [Biopython-dev] Bio.GenBank (was: Bio.File) In-Reply-To: Message-ID: <1315831775.55207.YahooMailClassic@web161206.mail.bf1.yahoo.com> --- On Sun, 9/11/11, Peter Cock wrote: > So currently none of Bio.GenBank can really be > deprecated. OK. > Maybe we can represent the WGS records as > SeqRecord objects without a sequence, but I > don't like that idea really. Such files are NOT > sequence files at all. I agree. > > > > > Also we'd need some documentation for Bio.GenBank. > > > > In general it would be a good idea to have a > worked example parsing a (small) GenBank > file and showing where in the SeqRecord > each bit of annotation goes. That would be good, but we also need some documentation for Bio.GenBank itself, to clarify how Bio.GenBank is meant to be used by users (and also to clarify that Bio.SeqIO produces SeqRecords, and Bio.GenBank its own GenBank-specific records). > > Also I think that the RecordParser should > > raise an Exception if it cannot find a record > > when parsing. > > I disagree (or at least, when exposed via > Bio.SeqIO I disagree). After reading your comments, I realized that my mail was confusing. I think we actually agree. This is what I meant to say: Compare the following: >>> from Bio import SeqIO >>> from StringIO import StringIO >>> handle = StringIO("no record here") >>> SeqIO.read(handle, 'genbank') Traceback (most recent call last): ?File "", line 1, in ?File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 617, in read ?raise ValueError("No records found in handle") ValueError: No records found in handle That's fine - the read function says it will raise an exception if there is not exactly one record. With SeqIO.parse, we don't get an Exception: >>> handle = StringIO("no record here") >>> records = SeqIO.parse(handle, 'genbank') >>> for record in records: print record.id ... >>> This is also OK. SeqIO.parse expects zero, one, or multiple records. Now for Bio.GenBank: >>> from Bio import GenBank >>> parser = GenBank.RecordParser() >>> handle = StringIO("no record here") >>> parser.parse(handle) >>> # no error raised This I think is not OK. GenBank.RecordParser().parse expects one record; it should raise an Exception if it does not one. Likewise, the parser does not raise an Exception if there are multiple records in the handle. and for Bio.GenBank.Iterator: >>> from Bio.GenBank import Iterator >>> from Bio.GenBank import RecordParser >>> from StringIO import StringIO >>> handle = StringIO("no record here") >>> parser = RecordParser() >>> records = Iterator(handle, parser) >>> for record in records: print record.locus ... >>> which is the same behavior as for Bio.SeqIO.parse, which I think is OK. Assuming that the RecordParser and the Iterator are the only two classes that are intended for the end-user, it's probably better to add a Bio.GenBank.read and a Bio.GenBank.parse function to be consistent with the other Biopython modules. Sorry for the confusion! --Michiel. From eric.talevich at gmail.com Mon Sep 12 09:14:15 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 12 Sep 2011 09:14:15 -0400 Subject: [Biopython-dev] Calculating bootstrap values from a set of trees In-Reply-To: References: Message-ID: Hi Peter, On Mon, Sep 12, 2011 at 5:07 AM, Peter Cock wrote: > Hi Eric, > > I'm wondering if there is any code in Bio.Phylo for calculating > bootstrap values from a set of trees? > > e.g. I have a master tree created from an alignment, and 1000 > bootstrap trees (created from 1000 re-sampled alignments). I want to > annotate each branch with the number/percentage of times is it found > in the 1000 bootsrap trees. > I haven't implemented this in Bio.Phylo yet, unfortunately. The closest thing is Bio.Nexus.Trees.consensus. It would be worthwhile to port this to Bio.Phylo eventually. I once implemented this in python using binary strings to represent > each branch as a split or partition of the nodes into two groups. I'm > not sure where I put this script... but it pre-dated Bio.Phylo anyway. > > Alternatively, which standalone tool would you recommend for this? > > I think Phylip's seqboot and consense will do the trick. Normally I let RAxML do this sort of thing for me. Cheers, Eric From p.j.a.cock at googlemail.com Mon Sep 12 09:29:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 12 Sep 2011 14:29:19 +0100 Subject: [Biopython-dev] Bio.GenBank (was: Bio.File) In-Reply-To: <1315831775.55207.YahooMailClassic@web161206.mail.bf1.yahoo.com> References: <1315831775.55207.YahooMailClassic@web161206.mail.bf1.yahoo.com> Message-ID: On Mon, Sep 12, 2011 at 1:49 PM, Michiel de Hoon wrote: > >>>> from Bio import GenBank >>>> parser = GenBank.RecordParser() >>>> handle = StringIO("no record here") >>>> parser.parse(handle) >>>> # no error raised > > This I think is not OK. GenBank.RecordParser().parse expects one > record; it should raise an Exception if it does not one. Likewise, the > parser does not raise an Exception if there are multiple records in > the handle. > > and for Bio.GenBank.Iterator: > >>>> from Bio.GenBank import Iterator >>>> from Bio.GenBank import RecordParser >>>> from StringIO import StringIO >>>> handle = StringIO("no record here") >>>> parser = RecordParser() >>>> records = Iterator(handle, parser) >>>> for record in records: print record.locus > ... >>>> > > which is the same behavior as for Bio.SeqIO.parse, which I think is OK. OK, yes - I see what you mean now. > Assuming that the RecordParser and the Iterator are the only > two classes that are intended for the end-user, it's probably > better to add a Bio.GenBank.read and a Bio.GenBank.parse > function to be consistent with the other Biopython modules. Good plan - and then we can discourage direct use of the rest of Bio.GenBank (i.e. RecordParser, Iterator etc). How's this? https://github.com/biopython/biopython/commit/ff72037efbae2da6eb6db550aa6d02b883ec1345 > Sorry for the confusion! > No problem. Peter From p.j.a.cock at googlemail.com Mon Sep 12 09:40:46 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 12 Sep 2011 14:40:46 +0100 Subject: [Biopython-dev] Calculating bootstrap values from a set of trees In-Reply-To: References:

Message-ID: On Mon, Sep 12, 2011 at 2:14 PM, Eric Talevich wrote: > Hi Peter, > > On Mon, Sep 12, 2011 at 5:07 AM, Peter Cock > wrote: >> >> Hi Eric, >> >> I'm wondering if there is any code in Bio.Phylo for calculating >> bootstrap values from a set of trees? >> >> e.g. I have a master tree created from an alignment, and 1000 >> bootstrap trees (created from 1000 re-sampled alignments). I want to >> annotate each branch with the number/percentage of times is it found >> in the 1000 bootsrap trees. > > I haven't implemented this in Bio.Phylo yet, unfortunately. The closest > thing is Bio.Nexus.Trees.consensus. It would be worthwhile to port this to > Bio.Phylo eventually. > > >> I once implemented this in python using binary strings to represent >> each branch as a split or partition of the nodes into two groups. I'm >> not sure where I put this script... but it pre-dated Bio.Phylo anyway. >> >> Alternatively, which standalone tool would you recommend for this? >> > > I think Phylip's seqboot and consense will do the trick. http://evolution.genetics.washington.edu/phylip/doc/consense.html My understanding was Phylip's consense takes a set of trees and finds a consensus - there is no obvious way to tell it you want to use a particular pre-determined tree. > > Normally I let RAxML do this sort of thing for me. > I'm unclear if RAxML will accept some 3rd party master tree (via -t) and a set of bootstrapped trees (via -z) without also wanting the original alignment and a choice of model... My reason for wanting to decouple bootstrapping the trees and applying the bootstraps to the master tree is for splitting large jobs across a cluster. Each cluster node can generate bootstrap trees independently of the other cluster nodes (no network IO or synchronisation needed). These trees are then collated (concatenated into a big multiple entry tree file), with the final step combining the bootstrapped trees onto the master tree to assess support being comparatively quick. Peter From cy at cymon.org Mon Sep 12 11:13:42 2011 From: cy at cymon.org (Cymon Cox) Date: Mon, 12 Sep 2011 16:13:42 +0100 Subject: [Biopython-dev] Calculating bootstrap values from a set of trees In-Reply-To: References:

Message-ID: Peter, I don't know of any stand alone software to automate the annotate of nodes of a target tree with labels - I'm assuming you want to add labels (in this case ML bootstrap support values) to a Newick tree description (eg an ML optimal tree). Most wouldn't do this, but manually label the tree in a graphics software when preparing the figure for publication. If you want support values for all nodes in your master/target tree, you could loop over all the clades in your tree and use dendropy to help calculate the bootstrap values for you bootstrap trees. Cheers, Cymon On 12 September 2011 14:40, Peter Cock wrote: > On Mon, Sep 12, 2011 at 2:14 PM, Eric Talevich > wrote: > > Hi Peter, > > > > On Mon, Sep 12, 2011 at 5:07 AM, Peter Cock > > wrote: > >> > >> Hi Eric, > >> > >> I'm wondering if there is any code in Bio.Phylo for calculating > >> bootstrap values from a set of trees? > >> > >> e.g. I have a master tree created from an alignment, and 1000 > >> bootstrap trees (created from 1000 re-sampled alignments). I want to > >> annotate each branch with the number/percentage of times is it found > >> in the 1000 bootsrap trees. > > > > I haven't implemented this in Bio.Phylo yet, unfortunately. The closest > > thing is Bio.Nexus.Trees.consensus. It would be worthwhile to port this > to > > Bio.Phylo eventually. > > > > > >> I once implemented this in python using binary strings to represent > >> each branch as a split or partition of the nodes into two groups. I'm > >> not sure where I put this script... but it pre-dated Bio.Phylo anyway. > >> > >> Alternatively, which standalone tool would you recommend for this? > >> > > > > I think Phylip's seqboot and consense will do the trick. > > http://evolution.genetics.washington.edu/phylip/doc/consense.html > > My understanding was Phylip's consense takes a set of trees > and finds a consensus - there is no obvious way to tell it you > want to use a particular pre-determined tree. > > > > > Normally I let RAxML do this sort of thing for me. > > > > I'm unclear if RAxML will accept some 3rd party master tree > (via -t) and a set of bootstrapped trees (via -z) without also > wanting the original alignment and a choice of model... > > My reason for wanting to decouple bootstrapping the trees > and applying the bootstraps to the master tree is for splitting > large jobs across a cluster. Each cluster node can generate > bootstrap trees independently of the other cluster nodes > (no network IO or synchronisation needed). These trees are > then collated (concatenated into a big multiple entry tree > file), with the final step combining the bootstrapped trees > onto the master tree to assess support being comparatively > quick. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- ____________________________________________________________________ Cymon J. Cox From p.j.a.cock at googlemail.com Mon Sep 12 11:18:32 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 12 Sep 2011 16:18:32 +0100 Subject: [Biopython-dev] Calculating bootstrap values from a set of trees In-Reply-To: References:

Message-ID: On Mon, Sep 12, 2011 at 4:13 PM, Cymon Cox wrote: > Peter, > > I don't know of any stand alone software to automate the annotate of nodes > of a target tree with labels - I'm assuming you want to add labels (in this > case ML bootstrap support values) to a Newick tree description (eg an ML > optimal tree). Yes, or NJ bootstraps, or whatever. > Most wouldn't do this, but manually label the tree in a > graphics software when preparing the figure for publication. Huh. I guess it depends on the size of tree ;) > If you want support values for all nodes in your master/target tree, you > could loop over all the clades in your tree and use dendropy to help > calculate the bootstrap values for you bootstrap trees. > > Cheers, Cymon Thanks - looks like I'm not overlooking some really obvious tool to do this then. Peter From cy at cymon.org Mon Sep 12 11:25:44 2011 From: cy at cymon.org (Cymon Cox) Date: Mon, 12 Sep 2011 16:25:44 +0100 Subject: [Biopython-dev] Calculating bootstrap values from a set of trees In-Reply-To: References:

Message-ID: Peter, On 12 September 2011 16:18, Peter Cock wrote: > On Mon, Sep 12, 2011 at 4:13 PM, Cymon Cox wrote: > > Peter, > > > > I don't know of any stand alone software to automate the annotate of > nodes > > of a target tree with labels - I'm assuming you want to add labels (in > this > > case ML bootstrap support values) to a Newick tree description (eg an ML > > optimal tree). > > Yes, or NJ bootstraps, or whatever. > > > Most wouldn't do this, but manually label the tree in a > > graphics software when preparing the figure for publication. > > Huh. I guess it depends on the size of tree ;) > Well, yes. One of mine had >600 taxa - I didnt do it manually ;) > > If you want support values for all nodes in your master/target tree, you > > could loop over all the clades in your tree and use dendropy to help > > calculate the bootstrap values for you bootstrap trees. > > > > Cheers, Cymon > > Thanks - looks like I'm not overlooking some really obvious tool > to do this then. > Nothing obvious - but I have a vague recollection that Ive seen this as an option in a tree graphics programme before - for the life of me I cant remember which though! If I comes to me I'll let you know ;) C. -- ____________________________________________________________________ Cymon J. Cox From andrea at biocomp.unibo.it Mon Sep 12 11:34:55 2011 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Mon, 12 Sep 2011 17:34:55 +0200 (CEST) Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: References: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> <60d254c9e1990dc9f7946f24d66a7aea.squirrel@lipid.biocomp.unibo.it> <302333674995860f369795196ea1df10.squirrel@lipid.biocomp.unibo.it> Message-ID: <7fcb72ae2b632b6f9923b889513ab9ea.squirrel@lipid.biocomp.unibo.it> > >> can you send me a link to the license so that I can include it in >> biograpy? >> thanks > > The Biopython licence is just: > http://www.biopython.org/DIST/LICENSE Yes, I saw that license, but I didn't find any reference to MIT or anything else, so I was not sure this was the right one... > > If in the medium/long term you'd like to consider incorporating > this into Biopython, then my recommendation is either use a > compatible licence now, or ensure you get copyright assignment > for all code contributions so that you can change the license later. > > My worry is if you use LGPL and take third party author > contributions, then later wanted to change the license you'd > need to contact all those 3rd party authors to get their > permission. > Well we can easily change the license to the BioPython one. This is intended to be a free library. the more people can use it, the better, even for commercial purposes. BioGraPy can of course be incorporated in BioPython for commodity, and/or be shipped as a separate package. Personally I'd prefer to ship also with BioPython so we can be sure that the right versions are always packed together. Eg. If you are going to change subfeatures, than a compatible version of BioGraPy must be used. From p.j.a.cock at googlemail.com Mon Sep 12 11:50:39 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 12 Sep 2011 16:50:39 +0100 Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: <7fcb72ae2b632b6f9923b889513ab9ea.squirrel@lipid.biocomp.unibo.it> References: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> <60d254c9e1990dc9f7946f24d66a7aea.squirrel@lipid.biocomp.unibo.it> <302333674995860f369795196ea1df10.squirrel@lipid.biocomp.unibo.it> <7fcb72ae2b632b6f9923b889513ab9ea.squirrel@lipid.biocomp.unibo.it> Message-ID: On Mon, Sep 12, 2011 at 4:34 PM, Andrea Pierleoni wrote: > >> >>> can you send me a link to the license so that I can include it in >>> biograpy? >>> thanks >> >> The Biopython licence is just: >> http://www.biopython.org/DIST/LICENSE > > Yes, I saw that license, but I didn't find any reference to MIT or > anything else, so I was not sure this was the right one... > >> >> If in the medium/long term you'd like to consider incorporating >> this into Biopython, then my recommendation is either use a >> compatible licence now, or ensure you get copyright assignment >> for all code contributions so that you can change the license later. >> >> My worry is if you use LGPL and take third party author >> contributions, then later wanted to change the license you'd >> need to contact all those 3rd party authors to get their >> permission. >> > > Well we can easily change the license to the BioPython one. This is > intended to be a free ?library. the more people can use it, the better, > even for commercial purposes. > BioGraPy can of course be incorporated in BioPython for commodity, > and/or be shipped as a separate package. That how GenomeDiagram started. > Personally I'd prefer to ship also with BioPython so we can be sure > that the right versions are always packed together. > Eg. If you are going to change subfeatures, than a compatible version > of BioGraPy must be used. Yeah - changing SeqFeature locations is a potential minefield, so I will want to try and make any transition as smooth as possible with a backwards compatibility hack. Peter From andrea at biocomp.unibo.it Mon Sep 12 12:04:59 2011 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Mon, 12 Sep 2011 18:04:59 +0200 (CEST) Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: References: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> <60d254c9e1990dc9f7946f24d66a7aea.squirrel@lipid.biocomp.unibo.it> <302333674995860f369795196ea1df10.squirrel@lipid.biocomp.unibo.it> <7fcb72ae2b632b6f9923b889513ab9ea.squirrel@lipid.biocomp.unibo.it> Message-ID: <0b787347db4866e30d0768fe306e7ca4.squirrel@lipid.biocomp.unibo.it> > On Mon, Sep 12, 2011 at 4:34 PM, Andrea Pierleoni > wrote: >> >>> >>>> can you send me a link to the license so that I can include it in >>>> biograpy? >>>> thanks >>> >>> The Biopython licence is just: >>> http://www.biopython.org/DIST/LICENSE >> >> Yes, I saw that license, but I didn't find any reference to MIT or >> anything else, so I was not sure this was the right one... >> >>> >>> If in the medium/long term you'd like to consider incorporating >>> this into Biopython, then my recommendation is either use a >>> compatible licence now, or ensure you get copyright assignment >>> for all code contributions so that you can change the license later. >>> >>> My worry is if you use LGPL and take third party author >>> contributions, then later wanted to change the license you'd >>> need to contact all those 3rd party authors to get their >>> permission. >>> >> >> Well we can easily change the license to the BioPython one. This is >> intended to be a free ?library. the more people can use it, the better, >> even for commercial purposes. >> BioGraPy can of course be incorporated in BioPython for commodity, >> and/or be shipped as a separate package. > > That how GenomeDiagram started. > >> Personally I'd prefer to ship also with BioPython so we can be sure >> that the right versions are always packed together. >> Eg. If you are going to change subfeatures, than a compatible version >> of BioGraPy must be used. > > Yeah - changing SeqFeature locations is a potential minefield, > so I will want to try and make any transition as smooth as > possible with a backwards compatibility hack. > > Peter > Backwards compatibility is always needed when feasible... :) Andrea From nicolas.rochette at univ-lyon1.fr Mon Sep 12 15:49:06 2011 From: nicolas.rochette at univ-lyon1.fr (Nicolas Rochette) Date: Mon, 12 Sep 2011 21:49:06 +0200 Subject: [Biopython-dev] Calculating bootstrap values from a set of trees In-Reply-To: References:

Message-ID: <4E6E6232.50805@univ-lyon1.fr> Hi Peter, What you are looking for exists in the bppconsense program from the "Bio++ Suite" http://home.gna.org/bppsuite/ With something like : bppconsense input.tree.file=NEWICK_FILE method=Input input.trees.file=BOOTSTRAPS_FILE output.tree.file=OUTPUT_FILE Regards, Nicolas Rochette From mjldehoon at yahoo.com Wed Sep 14 11:34:01 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 14 Sep 2011 08:34:01 -0700 (PDT) Subject: [Biopython-dev] Bio.GenBank (was: Bio.File) In-Reply-To: Message-ID: <1316014441.82644.YahooMailClassic@web161216.mail.bf1.yahoo.com> Hi Peter, --- On Mon, 9/12/11, Peter Cock wrote: > How's this? > https://github.com/biopython/biopython/commit/ff72037efbae2da6eb6db550aa6d02b883ec1345 The code looks good. About the documentation, at the top of the module you say that using Bio.GenBank can be useful for WGS master records. That is true, but people with particular interests may have other reasons to use Bio.GenBank, and maybe WGS master records will not be stored as GenBank files in the future. So it may be good to keep the documentation a bit more generic, so it's still valid in a few years. But I agree that in most cases and for most users, Bio.SeqIO is the appropriate module rather than Bio.GenBank. Does Bio.SeqIO still need to use Bio.GenBank's FeatureParser? Or can it also use Bio.GenBank.read() or Bio.GenBank.parse()? --Michiel. From p.j.a.cock at googlemail.com Wed Sep 14 16:48:48 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 14 Sep 2011 21:48:48 +0100 Subject: [Biopython-dev] Bio.GenBank (was: Bio.File) In-Reply-To: <1316014441.82644.YahooMailClassic@web161216.mail.bf1.yahoo.com> References: <1316014441.82644.YahooMailClassic@web161216.mail.bf1.yahoo.com> Message-ID: On Wed, Sep 14, 2011 at 4:34 PM, Michiel de Hoon wrote: > Hi Peter, > > --- On Mon, 9/12/11, Peter Cock wrote: >> How's this? >> https://github.com/biopython/biopython/commit/ff72037efbae2da6eb6db550aa6d02b883ec1345 > > The code looks good. OK. > About the documentation, at the top of the module you say > that using Bio.GenBank can be useful for WGS master > records. That is true, but people with particular interests > may have other reasons to use Bio.GenBank, and maybe > WGS master records will not be stored as GenBank files > in the future. So it may be good to keep the documentation > a bit more generic, so it's still valid in a few years. But I > agree that in most cases and for most users, Bio.SeqIO > is the appropriate module rather than Bio.GenBank. Please go ahead and try to make it clearer. > Does Bio.SeqIO still need to use Bio.GenBank's > FeatureParser? Or can it also use Bio.GenBank.read() > or Bio.GenBank.parse()? Yes, and no, respectively. At least as written - I guess the new read/parse functions could take an optional argument to control this but I fear that would just be confusing. Essentially both are both using the scanner/consumer model, but one uses the Record producing consumer and the other the SeqRecord producing consumer. Peter From p.j.a.cock at googlemail.com Fri Sep 16 12:31:13 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 16 Sep 2011 17:31:13 +0100 Subject: [Biopython-dev] SeqFeature start/end and making positions act like ints Message-ID: Hi all, We've previously discussed adding start/end properties to the SeqFeature returning integers - which would be useful but inconsistent with the FeatureLocation which returns Position objects: https://redmine.open-bio.org/issues/2818 After an interesting discussion with Leighton, I spent the afternoon making (most of the) Position objects subclass int - so that they can be used like integers (with the fuzzy information retained but generally ignored except for writing the features out again). This means we can have SeqFeature start/end properties which like those of the FeatureLocation return position objects - and they are actually easy to use (except for some very extreme cases). e.g. You can use them to slice a sequence. The code is on a branch here: https://github.com/peterjc/biopython/tree/int_pos It is almost 100% backwards compatible. Some of the arguments for creating a fuzzy position (and their __repr__) have changed, and some of their attributes, but we feel this is unlikely to actually affect anyone. We rather suspect only the SeqIO parsers actually create or use the fuzzy objects in the first place! In terms of usability I think this is a worthwhile improvement. The new class heirachy is a bit more complex though - and I have not looked at the performance implications at all. Would anyone like to review this please? Peter From redmine at redmine.open-bio.org Fri Sep 16 12:45:50 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 16 Sep 2011 16:45:50 +0000 Subject: [Biopython-dev] [Biopython - Bug #2818] Add start and end properties to SeqFeature object References: Message-ID: Issue #2818 has been updated by Peter Cock. See also this proposal: http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009172.html ---------------------------------------- Bug #2818: Add start and end properties to SeqFeature object https://redmine.open-bio.org/issues/2818 Author: Peter Cock Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: Not Applicable URL: An enhancment proposed on the mailing list would add start and end properties to the SeqFeature returning plain integers (non-fuzzy approximations to the start and end locations) suitable for slicing most parent sequences. Dealing with a join location would still be tricky. Example usage: >>> from Bio import SeqIO >>> record = SeqIO.read(open("NC_005816.gb"),"gb") >>> feature = record.features[2] >>> print feature type: gene location: [86:1109] ref: None:None strand: 1 qualifiers: Key: db_xref, Value: ['GeneID:2767718'] Key: locus_tag, Value: ['YP_pPCP01'] >>> record[feature.start:feature.end] SeqRecord(seq=Seq('ATGGTCACTTTTGAGACAGTTATGGAAATTAAAATCCTGCACAAGCAGGGAATG...TGA', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=[]) >>> record.seq[feature.start:feature.end] Seq('ATGGTCACTTTTGAGACAGTTATGGAAATTAAAATCCTGCACAAGCAGGGAATG...TGA', IUPACAmbiguousDNA()) Patch to follow. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Fri Sep 16 13:07:29 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 16 Sep 2011 18:07:29 +0100 Subject: [Biopython-dev] Biopython under PyPy Message-ID: Hi all, I've been trying Biopython under PyPy 1.6, and the unit tests for a lot of things work fine. In the short term I'm skipping all the C extensions (not clear how easy they will be under PyPy): https://github.com/biopython/biopython/commit/2a26ceebed01508a69aefd6a3a6437245347a5a2 PyPy ships with a minimal numpy implementation, but it seems to be very minimal - e.g. there is no dot function. This is actually a bit annoying as "import numpy" works but you don't get everything! Anyway, there are some easy checks we can add to individual unit tests to skip them under pypy. What is interesting is running the full test suite reports some false positives (tests which when run on their own, or as part of a smaller group pass), and the test suite itself never finishes: error: Too many open files I'm not sure what this is from... I fixed an obvious handle leak: https://github.com/biopython/biopython/commit/f7ce81b3751745970c32cc813836507e93da3c30 I suspect the problem is some of the individual tests are leaking handles - which we know already from warnings under Python 3 etc. Peter From eric.talevich at gmail.com Fri Sep 16 16:14:31 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 16 Sep 2011 16:14:31 -0400 Subject: [Biopython-dev] Biopython under PyPy In-Reply-To: References: Message-ID: On Fri, Sep 16, 2011 at 1:07 PM, Peter Cock wrote: > Hi all, > > I've been trying Biopython under PyPy 1.6, and the unit tests for > a lot of things work fine. In the short term I'm skipping all the C > extensions (not clear how easy they will be under PyPy): > > https://github.com/biopython/biopython/commit/2a26ceebed01508a69aefd6a3a6437245347a5a2 > > Neato! Here's the relevant bug in Redmine: https://redmine.open-bio.org/issues/3236 > PyPy ships with a minimal numpy implementation, but it seems > to be very minimal - e.g. there is no dot function. This is actually > a bit annoying as "import numpy" works but you don't get everything! > Anyway, there are some easy checks we can add to individual > unit tests to skip them under pypy. > Presumably this will get better in future releases of numpy, but yeah, it will be awkward to have to check that the numpy module not only exists, but is in fact the 'real' numpy. > > What is interesting is running the full test suite reports some > false positives (tests which when run on their own, or as part > of a smaller group pass), and the test suite itself never finishes: > error: Too many open files > > I'm not sure what this is from... I fixed an obvious handle leak: > > https://github.com/biopython/biopython/commit/f7ce81b3751745970c32cc813836507e93da3c30 > > I suspect the problem is some of the individual tests are > leaking handles - which we know already from warnings > under Python 3 etc. > Now that we've ditched Py2.4, we can start using context managers ('with') instead of explicit open/close. This should help ensure handles are closed when exceptions are raised. The other noteworthy bug the unit tests uncovered, for me, was in test_Restriction. It wasn't clear at all to me why this error is raised -- some subtle difference in magic-method access between implementations, maybe? -Eric From eric.talevich at gmail.com Fri Sep 16 16:33:19 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 16 Sep 2011 16:33:19 -0400 Subject: [Biopython-dev] SeqFeature start/end and making positions act like ints In-Reply-To: References: Message-ID: On Fri, Sep 16, 2011 at 12:31 PM, Peter Cock wrote: > Hi all, > > We've previously discussed adding start/end properties > to the SeqFeature returning integers - which would be > useful but inconsistent with the FeatureLocation which > returns Position objects: > > https://redmine.open-bio.org/issues/2818 > > After an interesting discussion with Leighton, I spent > the afternoon making (most of the) Position objects > subclass int - so that they can be used like integers > (with the fuzzy information retained but generally > ignored except for writing the features out again). > > This means we can have SeqFeature start/end > properties which like those of the FeatureLocation > return position objects - and they are actually easy > to use (except for some very extreme cases). > e.g. You can use them to slice a sequence. > > The code is on a branch here: > https://github.com/peterjc/biopython/tree/int_pos > > It is almost 100% backwards compatible. Some > of the arguments for creating a fuzzy position > (and their __repr__) have changed, and some > of their attributes, but we feel this is unlikely to > actually affect anyone. We rather suspect only > the SeqIO parsers actually create or use the > fuzzy objects in the first place! > > In terms of usability I think this is a worthwhile > improvement. The new class heirachy is a bit > more complex though - and I have not looked > at the performance implications at all. > > Would anyone like to review this please? > > Here's another way to do it, maybe -- modify Seq.Seq.__getitem__ to also check if it's been given a SeqFeature, and if so, handle the joins there. The handling of fuzziness could happen in here or use the new .start and .end properties. Outline: def __getitem__(self, index): """Returns a subsequence of single letter, use my_seq[index].""" if isinstance(index, int): #Return a single letter as a string return self._data[index] elif isinstance(index, SeqFeature): # NEW -- handle start/end/join voodoo safely # if there's a join, extract the subsequences and then concatenate them return the_result else: #Return the (sub)sequence as another Seq object return Seq(self._data[index], self.alphabet) Think that would work? -Eric From p.j.a.cock at googlemail.com Fri Sep 16 18:56:25 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 16 Sep 2011 23:56:25 +0100 Subject: [Biopython-dev] Biopython under PyPy In-Reply-To: References:

Message-ID: On Fri, Sep 16, 2011 at 9:14 PM, Eric Talevich wrote: > On Fri, Sep 16, 2011 at 1:07 PM, Peter Cock > wrote: >> >> Hi all, >> >> I've been trying Biopython under PyPy 1.6, and the unit tests for >> a lot of things work fine. In the short term I'm skipping all the C >> extensions (not clear how easy they will be under PyPy): >> >> https://github.com/biopython/biopython/commit/2a26ceebed01508a69aefd6a3a6437245347a5a2 >> > > Neato! Here's the relevant bug in Redmine: > https://redmine.open-bio.org/issues/3236 Oh yeah - what I did to setup.py is almost the same. >> >> PyPy ships with a minimal numpy implementation, but it seems >> to be very minimal - e.g. there is no dot function. This is actually >> a bit annoying as "import numpy" works but you don't get everything! >> Anyway, there are some easy checks we can add to individual >> unit tests to skip them under pypy. > > Presumably this will get better in future releases of numpy, > but yeah, it will be awkward to have to check that the numpy > module not only exists, but is in fact the 'real' numpy. I'm hoping we just need to check if it is good enough, i.e. has the bits of numpy required for that module. That's my aim with the test_*.py changes. >> What is interesting is running the full test suite reports some >> false positives (tests which when run on their own, or as part >> of a smaller group pass), and the test suite itself never finishes: >> error: Too many open files >> >> I'm not sure what this is from... I fixed an obvious handle leak: >> >> https://github.com/biopython/biopython/commit/f7ce81b3751745970c32cc813836507e93da3c30 >> >> I suspect the problem is some of the individual tests are >> leaking handles - which we know already from warnings >> under Python 3 etc. > > Now that we've ditched Py2.4, we can start using context managers ('with') > instead of explicit open/close. This should help ensure handles are closed > when exceptions are raised. Yeah - in the example above can you put the with statement inside an if? > The other noteworthy bug the unit tests uncovered, for me, was in > test_Restriction. It wasn't clear at all to me why this error is raised -- > some subtle difference in magic-method access between implementations, > maybe? I'd noticed that too, and agree it probably falls into the "too much magic" category :( Peter From p.j.a.cock at googlemail.com Fri Sep 16 19:01:18 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 17 Sep 2011 00:01:18 +0100 Subject: [Biopython-dev] SeqFeature start/end and making positions act like ints In-Reply-To: References: Message-ID: On Fri, Sep 16, 2011 at 9:33 PM, Eric Talevich wrote: > On Fri, Sep 16, 2011 at 12:31 PM, Peter Cock > wrote: >> >> Hi all, >> >> We've previously discussed adding start/end properties >> to the SeqFeature returning integers - which would be >> useful but inconsistent with the FeatureLocation which >> returns Position objects: >> >> https://redmine.open-bio.org/issues/2818 >> >> After an interesting discussion with Leighton, I spent >> the afternoon making (most of the) Position objects >> subclass int - so that they can be used like integers >> (with the fuzzy information retained but generally >> ignored except for writing the features out again). >> >> This means we can have SeqFeature start/end >> properties which like those of the FeatureLocation >> return position objects - and they are actually easy >> to use (except for some very extreme cases). >> e.g. You can use them to slice a sequence. >> >> The code is on a branch here: >> https://github.com/peterjc/biopython/tree/int_pos >> >> It is almost 100% backwards compatible. Some >> of the arguments for creating a fuzzy position >> (and their __repr__) have changed, and some >> of their attributes, but we feel this is unlikely to >> actually affect anyone. We rather suspect only >> the SeqIO parsers actually create or use the >> fuzzy objects in the first place! >> >> In terms of usability I think this is a worthwhile >> improvement. The new class heirachy is a bit >> more complex though - and I have not looked >> at the performance implications at all. >> >> Would anyone like to review this please? >> > > Here's another way to do it, maybe -- modify Seq.Seq.__getitem__ to also > check if it's been given a SeqFeature, and if so, handle the joins there. > The handling of fuzziness could happen in here or use the new .start and > .end properties. > > Outline: > > ??? def __getitem__(self, index): > ??????? """Returns a subsequence of single letter, use my_seq[index].""" > ??????? if isinstance(index, int): > ??????????? #Return a single letter as a string > ??????????? return self._data[index] > ??????? elif isinstance(index, SeqFeature): > ??????????? # NEW -- handle start/end/join voodoo safely > ??????????? # if there's a join, extract the subsequences and then > concatenate them > ??????????? return the_result > ??????? else: > ??????????? #Return the (sub)sequence as another Seq object > ??????????? return Seq(self._data[index], self.alphabet) > > > Think that would work? Yes - in fact I've done that on another branch but with to avoid circular imports used hasattr(index, "extract") instead. It solves a different problem to making start/end easier to use. Peter From eric.talevich at gmail.com Fri Sep 16 22:37:20 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 16 Sep 2011 22:37:20 -0400 Subject: [Biopython-dev] Biopython under PyPy In-Reply-To: References:

Message-ID: On Fri, Sep 16, 2011 at 6:56 PM, Peter Cock wrote: > On Fri, Sep 16, 2011 at 9:14 PM, Eric Talevich > wrote: > > On Fri, Sep 16, 2011 at 1:07 PM, Peter Cock > > wrote: > >> > >> Hi all, > >> > >> I've been trying Biopython under PyPy 1.6, and the unit tests for > >> a lot of things work fine. In the short term I'm skipping all the C > >> extensions (not clear how easy they will be under PyPy): > >> > >> > https://github.com/biopython/biopython/commit/2a26ceebed01508a69aefd6a3a6437245347a5a2 > >> > > > > Neato! Here's the relevant bug in Redmine: > > https://redmine.open-bio.org/issues/3236 > > Oh yeah - what I did to setup.py is almost the same. > > >> > >> PyPy ships with a minimal numpy implementation, but it seems > >> to be very minimal - e.g. there is no dot function. This is actually > >> a bit annoying as "import numpy" works but you don't get everything! > >> Anyway, there are some easy checks we can add to individual > >> unit tests to skip them under pypy. > > > > Presumably this will get better in future releases of numpy, > > but yeah, it will be awkward to have to check that the numpy > > module not only exists, but is in fact the 'real' numpy. > > I'm hoping we just need to check if it is good enough, > i.e. has the bits of numpy required for that module. > That's my aim with the test_*.py changes. > > >> What is interesting is running the full test suite reports some > >> false positives (tests which when run on their own, or as part > >> of a smaller group pass), and the test suite itself never finishes: > >> error: Too many open files > >> > >> I'm not sure what this is from... I fixed an obvious handle leak: > >> > >> > https://github.com/biopython/biopython/commit/f7ce81b3751745970c32cc813836507e93da3c30 > >> > >> I suspect the problem is some of the individual tests are > >> leaking handles - which we know already from warnings > >> under Python 3 etc. > > > > Now that we've ditched Py2.4, we can start using context managers > ('with') > > instead of explicit open/close. This should help ensure handles are > closed > > when exceptions are raised. > > Yeah - in the example above can you put the with statement > inside an if? > In runTest it's a little strange because the open() call depends on the Python version. So we could do: if sys.version_info[0] >= 3: #Python 3 problem: Can't use utf8 on output/test_geo #due to micro (\xb5) and degrees (\xb0) symbols open_kwargs = {'encoding': 'latin'} else: open_kwargs = {'mode': 'rU'} with open(outputfile, **kwargs) as expected: # Everything else... But then we lose the try/except block that catches the missing-file error. The cleanest solution would be a separate context handler: @contextlib.contextmanager def open_outputfile(fname): try: if sys.version_info[0] >= 3: #Python 3 problem: Can't use utf8 on output/test_geo #due to micro (\xb5) and degrees (\xb0) symbols expected = open(outputfile, encoding="latin") else: expected = open(outputfile, 'rU') yield expected except IOError: self.fail("Warning: Can't open %s for test %s" % (outputfile, self.name)) finally: expected.close() I think that would do everything we want. -Eric From redmine at redmine.open-bio.org Sat Sep 17 01:25:18 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 17 Sep 2011 05:25:18 +0000 Subject: [Biopython-dev] [Biopython - Bug #3291] (New) Bio.PDB.PDBIO preserve atoms' serial number Message-ID: Issue #3291 has been reported by Carlos Rios. ---------------------------------------- Bug #3291: Bio.PDB.PDBIO preserve atoms' serial number https://redmine.open-bio.org/issues/3291 Author: Carlos Rios Status: New Priority: Normal Assignee: Category: Target version: URL: http://lists.open-bio.org/pipermail/biopython/2009-May/005163.html As we know, Bio.PDB.PDBIO.save() renumbers the atom serial number starting with 1 in the first atom (per model), but sometimes people needs conserve the original serial number. A request of this can be found in http://lists.open-bio.org/pipermail/biopython/2009-May/005163.html I made a little patch, where I add a new parameter in Bio.PDB.PDBIO.save(), `conserve_atoms_number`, with default value conserve_atoms_number=False for backward compatibility. I hope it helps. Regards ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From eric.talevich at gmail.com Sat Sep 17 09:44:21 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 17 Sep 2011 09:44:21 -0400 Subject: [Biopython-dev] SeqFeature start/end and making positions act like ints In-Reply-To: References:

Message-ID: On Fri, Sep 16, 2011 at 7:01 PM, Peter Cock wrote: > On Fri, Sep 16, 2011 at 9:33 PM, Eric Talevich > wrote: > > On Fri, Sep 16, 2011 at 12:31 PM, Peter Cock > > wrote: > >> > >> Hi all, > >> > >> We've previously discussed adding start/end properties > >> to the SeqFeature returning integers - which would be > >> useful but inconsistent with the FeatureLocation which > >> returns Position objects: > >> > >> https://redmine.open-bio.org/issues/2818 > >> > >> After an interesting discussion with Leighton, I spent > >> the afternoon making (most of the) Position objects > >> subclass int - so that they can be used like integers > >> (with the fuzzy information retained but generally > >> ignored except for writing the features out again). > >> > >> This means we can have SeqFeature start/end > >> properties which like those of the FeatureLocation > >> return position objects - and they are actually easy > >> to use (except for some very extreme cases). > >> e.g. You can use them to slice a sequence. > >> > >> The code is on a branch here: > >> https://github.com/peterjc/biopython/tree/int_pos > >> > >> It is almost 100% backwards compatible. Some > >> of the arguments for creating a fuzzy position > >> (and their __repr__) have changed, and some > >> of their attributes, but we feel this is unlikely to > >> actually affect anyone. We rather suspect only > >> the SeqIO parsers actually create or use the > >> fuzzy objects in the first place! > >> > >> In terms of usability I think this is a worthwhile > >> improvement. The new class heirachy is a bit > >> more complex though - and I have not looked > >> at the performance implications at all. > >> > >> Would anyone like to review this please? > >> > > > > Here's another way to do it, maybe -- modify Seq.Seq.__getitem__ to also > > check if it's been given a SeqFeature, and if so, handle the joins there. > > The handling of fuzziness could happen in here or use the new .start and > > .end properties. > > > [...] > > > > > Think that would work? > > Yes - in fact I've done that on another branch but with to avoid > circular imports used hasattr(index, "extract") instead. It solves > a different problem to making start/end easier to use. > > OK, you're way ahead of me. The new start/end properties you implemented look good to me, and I doubt there would be a serious hit to performance -- plus, code that didn't need these shortcuts don't have to use them. These will be handy for writing code that visualizes SeqFeatures, too. From p.j.a.cock at googlemail.com Sat Sep 17 15:38:53 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 17 Sep 2011 20:38:53 +0100 Subject: [Biopython-dev] SeqFeature start/end and making positions act like ints In-Reply-To: References:

Message-ID: On Sat, Sep 17, 2011 at 2:44 PM, Eric Talevich wrote: > On Fri, Sep 16, 2011 at 7:01 PM, Peter Cock wrote: >> >> On Fri, Sep 16, 2011 at 9:33 PM, Eric Talevich wrote: >> > >> > Think that would work? >> >> Yes - in fact I've done that on another branch but with to avoid >> circular imports used hasattr(index, "extract") instead. It solves >> a different problem to making start/end easier to use. > > OK, you're way ahead of me. Well, I've been thinking about this on and off for a while now. One issue with the __getitem__ trick is what would we do for the SeqRecord when sliced with a SeqFeature? Should it use the id and annotation from the SeqFeature or the SeqRecord? > The new start/end properties you implemented > look good to me, and I doubt there would be a serious hit > to performance -- plus, code that didn't need these shortcuts > don't have to use them. Good. I've realised I need to double check the integer methods (equals, sorting, hashes etc), but they should be fine. > These will be handy for writing code that visualizes > SeqFeatures, too. Well, slightly easier - I have some more dramatic changes to the SeqFeature and FeatureLocation objects planned, but I'm still playing with this. Peter From p.j.a.cock at googlemail.com Mon Sep 19 05:03:59 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Sep 2011 10:03:59 +0100 Subject: [Biopython-dev] SeqFeature start/end and making positions act like ints In-Reply-To: References:

Message-ID: On Sat, Sep 17, 2011 at 8:38 PM, Peter Cock wrote: > On Sat, Sep 17, 2011 at 2:44 PM, Eric Talevich wrote: >> On Fri, Sep 16, 2011 at 7:01 PM, Peter Cock wrote: >>> >>> On Fri, Sep 16, 2011 at 9:33 PM, Eric Talevich wrote: >>> > >>> > Think that would work? >>> >>> Yes - in fact I've done that on another branch but with to avoid >>> circular imports used hasattr(index, "extract") instead. It solves >>> a different problem to making start/end easier to use. >> >> OK, you're way ahead of me. The actual commit wasn't that far ahead of you: https://github.com/peterjc/biopython/commit/db4553c7e0bcb8a7eca137aeb24d713d9bf9dd93 > Well, I've been thinking about this on and off for a while now. > One issue with the __getitem__ trick is what would we do for > the SeqRecord when sliced with a SeqFeature? Should it use > the id and annotation from the SeqFeature or the SeqRecord? This needs some thought. >> The new start/end properties you implemented >> look good to me, and I doubt there would be a serious hit >> to performance -- plus, code that didn't need these shortcuts >> don't have to use them. > > Good. I've realised I need to double check the integer > methods (equals, sorting, hashes etc), but they should > be fine. Thinking about this more, the current _shift method of the position objects (used in SeqRecord slicing) would make sense as the __add__ method, thus: BeforePosition(5) + 10 --> BeforePosition(15) rather than currently: BeforePosition(5)._shift(10) --> BeforePosition(15) However, perhaps that is just making work for ourselves, we'd have to implement code for all the mixture cases, e.g. BeforePosition(5) + AfterPosition(10) --> UncertainPosition(15) >> These will be handy for writing code that visualizes >> SeqFeatures, too. > > Well, slightly easier - I have some more dramatic changes to > the SeqFeature and FeatureLocation objects planned, but I'm > still playing with this. One of the key changes (which can be done without really changing the API) is to move the database & accession and the strand from the SeqFeature to the FeatureLocation. These are intimately connected with the location, as much as the start/end. This is one of the things I've been working on here: https://github.com/peterjc/biopython/commits/f_loc The other key change on that experimental branch is moving away from sub_features for join locations (etc). Here I was trying a new CoupoundLocation object, but am still wondering if this should be done in the SeqFeature or FeatureLocation object instead (or if SeqFeature should subclass FeatureLocation). Peter From p.j.a.cock at googlemail.com Mon Sep 19 12:33:47 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 19 Sep 2011 17:33:47 +0100 Subject: [Biopython-dev] Biopython under PyPy In-Reply-To: References:

Message-ID: On Fri, Sep 16, 2011 at 1:07 PM, Peter Cock wrote: >>>> What is interesting is running the full test suite reports some >>>> false positives (tests which when run on their own, or as part >>>> of a smaller group pass), and the test suite itself never finishes: >>>> error: Too many open files >>>> >>>> I'm not sure what this is from... I fixed an obvious handle leak: >>>> >>>> >>>> https://github.com/biopython/biopython/commit/f7ce81b3751745970c32cc813836507e93da3c30 >>>> >>>> I suspect the problem is some of the individual tests are >>>> leaking handles - which we know already from warnings >>>> under Python 3 etc. Adding a gc.collect() call after each test_XXX.py is run seems to solve that in as much as run_tests.py finishes. We're down to these failures, test_CAPS.py test_Pathway.py test_Restriction.py test_SeqIO_index.py - leaking handles I'll look at test_SeqIO_index.py a bit more, but the others are more curious. They could indicate some fragile code in Biopython which is implementation specific, or perhaps they hit a bug in PyPy. Is anyone interested in finding out? Otherwise we can skip them for now, so that the whole test suite passes, and get PyPy added to the BuildBot for nightly regression testing. On Sat, Sep 17, 2011 at 3:37 AM, Eric Talevich wrote: > On Fri, Sep 16, 2011 at 6:56 PM, Peter Cock wrote: >> Yeah - in the example above can you put the with statement >> inside an if? > > In runTest it's a little strange because the open() call depends on the > Python version. So we could do: > > if sys.version_info[0] >= 3: > ??? #Python 3 problem: Can't use utf8 on output/test_geo > ??? #due to micro (\xb5) and degrees (\xb0) symbols > ??? open_kwargs = {'encoding': 'latin'} > else: > ??? open_kwargs = {'mode': 'rU'} > with open(outputfile, **kwargs) as expected: > ??? # Everything else... > > > But then we lose the try/except block that catches the missing-file error. > The cleanest solution would be a separate context handler: > > @contextlib.contextmanager > def open_outputfile(fname): > ??? try: > ??????? if sys.version_info[0] >= 3: > ??????????? #Python 3 problem: Can't use utf8 on output/test_geo > ??????????? #due to micro (\xb5) and degrees (\xb0) symbols > ??????????? expected = open(outputfile, encoding="latin") > ??????? else: > ??????????? expected = open(outputfile, 'rU') > ??????? yield expected > ??? except IOError: > ??????? self.fail("Warning: Can't open %s for test %s" % (outputfile, > self.name)) > ??? finally: > ??????? expected.close() > > > I think that would do everything we want. Yeah... frankly I find the explicit open/close easier to read here. We can at least get rid of one level of try/except nesting now we've dropped Python 2.4 support... Peter From eric.talevich at gmail.com Mon Sep 19 18:13:32 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 19 Sep 2011 18:13:32 -0400 Subject: [Biopython-dev] Biopython under PyPy In-Reply-To: References:

Message-ID: On Mon, Sep 19, 2011 at 12:33 PM, Peter Cock wrote: > On Fri, Sep 16, 2011 at 1:07 PM, Peter Cock wrote: > >>>> What is interesting is running the full test suite reports some > >>>> false positives (tests which when run on their own, or as part > >>>> of a smaller group pass), and the test suite itself never finishes: > >>>> error: Too many open files > >>>> > >>>> I'm not sure what this is from... I fixed an obvious handle leak: > >>>> > >>>> > >>>> > https://github.com/biopython/biopython/commit/f7ce81b3751745970c32cc813836507e93da3c30 > >>>> > >>>> I suspect the problem is some of the individual tests are > >>>> leaking handles - which we know already from warnings > >>>> under Python 3 etc. > > Adding a gc.collect() call after each test_XXX.py is run > seems to solve that in as much as run_tests.py finishes. > > We're down to these failures, > > test_CAPS.py > test_Pathway.py > test_Restriction.py > test_SeqIO_index.py - leaking handles > > I'll look at test_SeqIO_index.py a bit more, but the others > are more curious. They could indicate some fragile code > in Biopython which is implementation specific, or perhaps > they hit a bug in PyPy. Is anyone interested in finding out? > > Otherwise we can skip them for now, so that the whole > test suite passes, and get PyPy added to the BuildBot > for nightly regression testing. > I could take a look at test_Restriction.py this weekend, I think. -Eric From p.j.a.cock at googlemail.com Mon Sep 19 19:05:21 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 20 Sep 2011 00:05:21 +0100 Subject: [Biopython-dev] Biopython under PyPy In-Reply-To: References:

Message-ID: >> We're down to these failures, >> >> test_CAPS.py >> test_Pathway.py >> test_Restriction.py >> test_SeqIO_index.py - leaking handles >> >> I'll look at test_SeqIO_index.py a bit more, but the others >> are more curious. They could indicate some fragile code >> in Biopython which is implementation specific, or perhaps >> they hit a bug in PyPy. Is anyone interested in finding out? >> >> Otherwise we can skip them for now, so that the whole >> test suite passes, and get PyPy added to the BuildBot >> for nightly regression testing. > > I could take a look at test_Restriction.py this weekend, I think. > Great. I think test_SeqIO_index.py is working now. Peter From redmine at redmine.open-bio.org Fri Sep 23 15:38:08 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 23 Sep 2011 19:38:08 +0000 Subject: [Biopython-dev] [Biopython - Feature #3295] (New) SeqIO read support for the PDB format Message-ID: Issue #3295 has been reported by Eric Talevich. ---------------------------------------- Feature #3295: SeqIO read support for the PDB format https://redmine.open-bio.org/issues/3295 Author: Eric Talevich Status: New Priority: Normal Assignee: Category: Target version: URL: It's useful to be able to retrieve primary sequences from a PDB file. I propose an implementation in SeqIO that uses PDBParser and PPBuilder to extract the peptide sequences from a given structure and yield each one as a SeqRecord. Basically:

>>> struct = PDB.PDBParser().get_structure(pdb_id, fname)
>>> for peptide in PDB.PPBuilder().build_peptides(struct):
...     yield SeqRecord(Seq(peptide.get_sequence(), generic_protein),
...                     id=pdb_id+chain_id)

but with proper metadata. This will be read-only support, of course. Sound good? ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From eric.talevich at gmail.com Sat Sep 24 21:56:07 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 24 Sep 2011 21:56:07 -0400 Subject: [Biopython-dev] Biopython under PyPy In-Reply-To: References:

Message-ID: On Mon, Sep 19, 2011 at 6:13 PM, Eric Talevich wrote: > On Mon, Sep 19, 2011 at 12:33 PM, Peter Cock wrote: > >> >> We're down to these failures, >> >> test_CAPS.py >> test_Pathway.py >> test_Restriction.py >> test_SeqIO_index.py - leaking handles >> >> I'll look at test_SeqIO_index.py a bit more, but the others >> are more curious. They could indicate some fragile code >> in Biopython which is implementation specific, or perhaps >> they hit a bug in PyPy. Is anyone interested in finding out? >> >> Otherwise we can skip them for now, so that the whole >> test suite passes, and get PyPy added to the BuildBot >> for nightly regression testing. >> > > I could take a look at test_Restriction.py this weekend, I think. > > I think it could be a bug in PyPy, but I'm not sure. The underlying error in test_Restriction and test_CAPS is triggered by "import Bio.Restriction.Restriction": Traceback: [...] super(RestrictionType, cls).__init__(cls, name, bases, dct) TypeError: unbound method __init__() must be called with BssMI instance as first argument (got RestrictionType instance instead) Adding a print statement before the failing line: print cls, name, isinstance(cls, RestrictionType) super(RestrictionType, cls).__init__(cls, name, bases, dct) Gives: BssMI BssMI True [the same error...] I'm not even sure how to turn this into a simple test case for the PyPy folks to look at. Anyone else want to take a crack at it? The fun starts near the bottom of Bio/Restriction/Restriction.py -- search for the word "magic". "We live on a placid island of ignorance in the midst of black seas of infinity, and it was not meant that we should voyage far. The sciences, each straining in its own direction, have hitherto harmed us little; but some day the piecing together of dissociated knowledge will open up such terrifying vistas of reality, and of our frightful position therein, that we shall either go mad from the revelation or flee from the light into the peace and safety of a new dark age." -- H.P. Lovecraft, The Call of Cthulhu All the best, Eric From redmine at redmine.open-bio.org Sun Sep 25 00:14:59 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 25 Sep 2011 04:14:59 +0000 Subject: [Biopython-dev] [Biopython - Bug #3263] (Closed) Phylo: Move clade 'color' and 'width' attributes to BaseTree References: Message-ID: Issue #3263 has been updated by Eric Talevich. Status changed from New to Closed % Done changed from 0 to 100 Did it: https://github.com/biopython/biopython/commit/5608398c558801db2505dffaa6fb85b47435a3fc Nice to be able to add colors for Phylo.draw() without remembering the incantation of as_phyloxml(). ---------------------------------------- Bug #3263: Phylo: Move clade 'color' and 'width' attributes to BaseTree https://redmine.open-bio.org/issues/3263 Author: Eric Talevich Status: Closed Priority: Normal Assignee: Biopython Dev Mailing List Category: Target version: URL: The 'color' and 'width' attributes are associated with PhyloXML trees right now, but are useful enough to be associated with the base Tree object (which you'd get from parsing a Newick or Nexus file), even though Newick and Nexus can't serialize this info. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Sun Sep 25 08:33:49 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 25 Sep 2011 13:33:49 +0100 Subject: [Biopython-dev] Biopython under PyPy In-Reply-To: References:

Message-ID: On Sun, Sep 25, 2011 at 2:56 AM, Eric Talevich wrote: >> >> I could take a look at test_Restriction.py this weekend, I think. >> > > I think it could be a bug in PyPy, but I'm not sure. > > The underlying error in test_Restriction and test_CAPS is triggered by > "import Bio.Restriction.Restriction": > > Traceback: > [...] > ??? super(RestrictionType, cls).__init__(cls, name, bases, dct) > TypeError: unbound method __init__() must be called with BssMI instance as > first argument (got RestrictionType instance instead) > > > Adding a print statement before the failing line: > > ??????? print cls, name, isinstance(cls, RestrictionType) > ??????? super(RestrictionType, cls).__init__(cls, name, bases, dct) > > Gives: > > BssMI BssMI True > [the same error...] > > I'm not even sure how to turn this into a simple test case for the PyPy > folks to look at. Anyone else want to take a crack at it? The fun starts > near the bottom of Bio/Restriction/Restriction.py -- search for the word > "magic". > That doesn't surprise me, we had trouble with that bit before going to Python 2.6: http://bugzilla.open-bio.org/show_bug.cgi?id=2604 Peter From p.j.a.cock at googlemail.com Wed Sep 28 06:59:28 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 28 Sep 2011 11:59:28 +0100 Subject: [Biopython-dev] Fwd: [biopython] trie - reducing memory use (#19) In-Reply-To: References: Message-ID: Who's the best person to review this Bio.Trie change? Peter ---------- Forwarded message ---------- From: timwintle Date: Wed, Sep 28, 2011 at 11:34 AM Subject: [biopython] trie - reducing memory use (#19) To: Peter Cock This adds space for static 1-charater strings which are used for transitions in the trie implementation - which avoids multiple allocations for the same string. Over a large (dense) set of 8-character strings, I observed memory use reduced from 388M to 283M after this change. You can merge this Pull Request by running: ?git pull https://github.com/timwintle/biopython trie-memory-use Or you can view, comment on it, or merge it online at: ?https://github.com/biopython/biopython/pull/19 -- Commit Summary -- * Pre-allocation of single char strings * Removing old functions -- File Changes -- M Bio/trie.c (113) M Bio/triemodule.c (1) -- Patch Links -- ?https://github.com/biopython/biopython/pull/19.patch ?https://github.com/biopython/biopython/pull/19.diff -- Reply to this email directly or view it on GitHub: https://github.com/biopython/biopython/pull/19 From p.j.a.cock at googlemail.com Wed Sep 28 10:29:06 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 28 Sep 2011 15:29:06 +0100 Subject: [Biopython-dev] SeqFeature start/end and making positions act like ints In-Reply-To: References:

Message-ID: On Mon, Sep 19, 2011 at 10:03 AM, Peter Cock wrote: > On Sat, Sep 17, 2011 at 8:38 PM, Peter Cock wrote: >> On Sat, Sep 17, 2011 at 2:44 PM, Eric Talevich wrote: >>> The new start/end properties you implemented >>> look good to me, and I doubt there would be a serious hit >>> to performance -- plus, code that didn't need these shortcuts >>> don't have to use them. >> >> Good. I've realised I need to double check the integer >> methods (equals, sorting, hashes etc), but they should >> be fine. > > Thinking about this more, the current _shift method of > the position objects (used in SeqRecord slicing) would > make sense as the __add__ method, thus: > > BeforePosition(5) + 10 --> BeforePosition(15) > > rather than currently: > > BeforePosition(5)._shift(10) --> BeforePosition(15) > > However, perhaps that is just making work for ourselves, > we'd have to implement code for all the mixture cases, e.g. > > BeforePosition(5) + AfterPosition(10) --> UncertainPosition(15) I went with the practical option - for all the maths operations etc you just get the basic int behaviour. Much simpler! Having done a bit of testing to reassure myself there was no unexpected performance regression, I have committed this to the trunk (as a single commit - it seemed cleaner to me): https://github.com/biopython/biopython/commit/c52e986a3da571a5793b00958c5bbcde1d581526 Note I have not included the SeqFeature start/end proxy methods. There is a reason for this related to the other location changes I've been playing with. I've been thinking it makes more sense for the start/end of a join etc to give the lowest start and the highest end of the sub-locations. In general that means no change to the current situation, but it does matter for origin spanning & out-of-order splicing. The min/max like behaviour seems more useful (for both visualisation, but also bounds checking). This branch is now defunct, and I may delete it at some point: https://github.com/peterjc/biopython/tree/int_pos >>> These will be handy for writing code that visualizes >>> SeqFeatures, too. >> >> Well, slightly easier - I have some more dramatic changes to >> the SeqFeature and FeatureLocation objects planned, but I'm >> still playing with this. > > One of the key changes (which can be done without > really changing the API) is to move the database & > accession and the strand from the SeqFeature to the > FeatureLocation. These are intimately connected with > the location, as much as the start/end. I think these changes can be applied to the trunk for the next release. > This is one of the things I've been working on here: > https://github.com/peterjc/biopython/commits/f_loc > > The other key change on that experimental branch > is moving away from sub_features for join locations > (etc). Here I was trying a new CoupoundLocation > object, but am still wondering if this should be done > in the SeqFeature or FeatureLocation object instead > (or if SeqFeature should subclass FeatureLocation). I'm still thinking about this - but haven't done any more code on it just recently. I'll return to this issue later. Peter From mjldehoon at yahoo.com Sun Sep 4 06:09:13 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 3 Sep 2011 23:09:13 -0700 (PDT) Subject: [Biopython-dev] Bio.GenBank Message-ID: <1315116553.25037.YahooMailClassic@web161210.mail.bf1.yahoo.com> Dear all, Currently, Bio/GenBank/__init__.py imports Bio.ParserSupport but uses very little of it. Therefore I would like to suggest to remove this dependency on ParserSupport from Bio/GenBank/__init__.py. I copied the corresponding patch below. Any objections, anybody? Best, --Michiel diff --git a/Bio/GenBank/__init__.py b/Bio/GenBank/__init__.py index 43c10d4..df38abe 100644 --- a/Bio/GenBank/__init__.py +++ b/Bio/GenBank/__init__.py @@ -47,7 +47,6 @@ import re # other Biopython stuff from Bio import SeqFeature -from Bio.ParserSupport import AbstractConsumer from Bio import Entrez # other Bio.GenBank stuff @@ -389,7 +388,7 @@ class RecordParser(object): self._scanner.feed(handle, self._consumer) return self._consumer.data -class _BaseGenBankConsumer(AbstractConsumer): +class _BaseGenBankConsumer(object): """Abstract GenBank consumer providing useful general functions. This just helps to eliminate some duplication in things that most @@ -404,6 +403,12 @@ class _BaseGenBankConsumer(AbstractConsumer): def __init__(self): pass + def _unhandled(self, data): + pass + + def __getattr__(self, attr): + return self._unhandled + def _split_keywords(self, keyword_string): """Split a string of keywords into a nice clean list. """ From p.j.a.cock at googlemail.com Mon Sep 5 10:04:27 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Sep 2011 11:04:27 +0100 Subject: [Biopython-dev] Bio.GenBank In-Reply-To: <1315116553.25037.YahooMailClassic@web161210.mail.bf1.yahoo.com> References: <1315116553.25037.YahooMailClassic@web161210.mail.bf1.yahoo.com> Message-ID: On Sun, Sep 4, 2011 at 7:09 AM, Michiel de Hoon wrote: > Dear all, > > Currently, Bio/GenBank/__init__.py imports Bio.ParserSupport > but uses very little of it. Therefore I would like to suggest to > remove this dependency on ParserSupport from > Bio/GenBank/__init__.py. I copied the corresponding patch below. > Any objections, anybody? Hi Michiel, I'd have to dig into the code to understand the patch, but I presume there is a follow up question coming - can we then deprecate Bio.ParserSupport since right now only the GenBank and "pending deprecation" plain text BLAST parsers use it (plus Compass which you recently fixed)? Peter From mjldehoon at yahoo.com Mon Sep 5 11:08:43 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 5 Sep 2011 04:08:43 -0700 (PDT) Subject: [Biopython-dev] Bio.GenBank In-Reply-To: Message-ID: <1315220923.37754.YahooMailClassic@web161211.mail.bf1.yahoo.com> Hi Peter, > I'd have to dig into the code to understand the patch, but > I presume there is a follow up question coming - can we > then deprecate Bio.ParserSupport since right now only the > GenBank and "pending deprecation" plain text BLAST > parsers use it (plus Compass which you recently fixed)? Yes. With this patch, the plain text BLAST parser is the last piece of code that uses Bio.ParserSupport. Best, --Michiel. --- On Mon, 9/5/11, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] Bio.GenBank > To: "Michiel de Hoon" > Cc: biopython-dev at biopython.org > Date: Monday, September 5, 2011, 6:04 AM > On Sun, Sep 4, 2011 at 7:09 AM, > Michiel de Hoon > wrote: > > Dear all, > > > > Currently, Bio/GenBank/__init__.py imports > Bio.ParserSupport > > but uses very little of it. Therefore I would like to > suggest to > > remove this dependency on ParserSupport from > > Bio/GenBank/__init__.py. I copied the corresponding > patch below. > > Any objections, anybody? > > Hi Michiel, > > I'd have to dig into the code to understand the patch, but > I presume there is a follow up question coming - can we > then deprecate Bio.ParserSupport since right now only the > GenBank and "pending deprecation" plain text BLAST > parsers use it (plus Compass which you recently fixed)? > > Peter > From p.j.a.cock at googlemail.com Wed Sep 7 12:58:51 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 7 Sep 2011 13:58:51 +0100 Subject: [Biopython-dev] Bio.GenBank In-Reply-To: <1315220923.37754.YahooMailClassic@web161211.mail.bf1.yahoo.com> References: <1315220923.37754.YahooMailClassic@web161211.mail.bf1.yahoo.com> Message-ID: On Mon, Sep 5, 2011 at 12:08 PM, Michiel de Hoon wrote: > Hi Peter, > >> I'd have to dig into the code to understand the patch, but >> I presume there is a follow up question coming - can we >> then deprecate Bio.ParserSupport since right now only the >> GenBank and "pending deprecation" plain text BLAST >> parsers use it (plus Compass which you recently fixed)? > > Yes. With this patch, the plain text BLAST parser is the last > piece of code that uses Bio.ParserSupport. I'm OK with modifying Bio.GenBank not to depend on Bio.ParserSupport, and if you want to adding an "obsolete" comment or more explicitly a PendingDeprecationWarning to Bio.ParserSupport seems sensible too. Peter From mjldehoon at yahoo.com Wed Sep 7 13:53:22 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 7 Sep 2011 06:53:22 -0700 (PDT) Subject: [Biopython-dev] Bio.File Message-ID: <1315403602.80271.YahooMailClassic@web161203.mail.bf1.yahoo.com> Hi all, Bio.File makes three classes available: Bio.File.UndoHandle Bio.File.StringHandle (which simply points to StringIO.StringIO) Bio.File.SGMLStripper (which has a pending deprecation warning) Bio.File.StringHandle is currently used only in Bio.Blast.NCBIStandalone and Bio.ParserSupport, both of which now have a pending deprecation warning. Bio.File.UndoHandle is used in three modules that now have a pending deprecation warning (Bio.Blast.NCBIStandalone, Bio.ParserSupport, Bio.UniGene.UniGene), as well as in Bio.SCOP.__init__. I don't know why the UndoHandle is used in that module; the relevant code looks like this: def _open(cgi, params={}, get=1): ... handle = urllib.urlopen(cgi, options) uhandle = File.UndoHandle(handle) return uhandle If there is no pressing reason for using File.UndoHandle here and we can remove it, then we could add a PendingDeprecationWarning to Bio.File. Best, --Michiel. From p.j.a.cock at googlemail.com Wed Sep 7 14:36:43 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 7 Sep 2011 15:36:43 +0100 Subject: [Biopython-dev] Bio.File In-Reply-To: <1315403602.80271.YahooMailClassic@web161203.mail.bf1.yahoo.com> References: <1315403602.80271.YahooMailClassic@web161203.mail.bf1.yahoo.com> Message-ID: On Wed, Sep 7, 2011 at 2:53 PM, Michiel de Hoon wrote: > Hi all, > > Bio.File makes three classes available: > Bio.File.UndoHandle > Bio.File.StringHandle (which simply points to StringIO.StringIO) > Bio.File.SGMLStripper (which has a pending deprecation warning) > > Bio.File.StringHandle is currently used only in > Bio.Blast.NCBIStandalone and Bio.ParserSupport, > both of which now have a pending deprecation warning. We can just switch them to use StringIO directly, and immediately deprecate Bio.File.StringHandle. We can probably deprecate SGMLStripper now as well (which means indirectly deprecating the bit of Bio.ParserSupport which uses it). > Bio.File.UndoHandle is used in three modules that now have a > pending deprecation warning (Bio.Blast.NCBIStandalone, > Bio.ParserSupport, Bio.UniGene.UniGene), as well as in > Bio.SCOP.__init__. I don't know why the UndoHandle is > used in that module; the relevant code looks like this: > > def _open(cgi, params={}, get=1): > ? ?... > ? ?handle = urllib.urlopen(cgi, options) > ? ?uhandle = File.UndoHandle(handle) > ? ?return uhandle > > If there is no pressing reason for using File.UndoHandle here > and we can remove it, then we could add a > PendingDeprecationWarning to Bio.File. Unless there is something similar in the standard library, I think the UndoHandle is still useful. UndoHandle used to be used in Bio.Entrez for spotting error conditions, but now we trust the NCBI to set an HTTP return code: https://github.com/biopython/biopython/commit/2c4d8b99fc1b2dffa726e7d9956d766f7013164d I'm using the same trick in my TogoWS wrapper (something I'm hoping will be ready to include in the next Biopython, once the TogoWS team have fixed a couple of server side issues). If the server could be relied on to always give an HTTP error code this wouldn't be needed: https://github.com/peterjc/biopython/blob/togows/Bio/TogoWS/__init__.py I imagine the use of an UndoHandle in SCOP search was to allow the user to make similar sanity checks. Peter From mjldehoon at yahoo.com Thu Sep 8 14:35:38 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 8 Sep 2011 07:35:38 -0700 (PDT) Subject: [Biopython-dev] Bio.File In-Reply-To: Message-ID: <1315492538.36803.YahooMailClassic@web161206.mail.bf1.yahoo.com> --- On Wed, 9/7/11, Peter Cock wrote: > > Bio.File.StringHandle is currently used only in > > Bio.Blast.NCBIStandalone and Bio.ParserSupport, > > both of which now have a pending deprecation warning. > > We can just switch them to use StringIO directly, and > immediately > deprecate Bio.File.StringHandle. > > We can probably deprecate SGMLStripper now as well (which > means indirectly deprecating the bit of Bio.ParserSupport > which uses it). > OK, done. --Michiel. From mjldehoon at yahoo.com Thu Sep 8 14:49:09 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 8 Sep 2011 07:49:09 -0700 (PDT) Subject: [Biopython-dev] Bio.File In-Reply-To: Message-ID: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> --- On Wed, 9/7/11, Peter Cock wrote: > UndoHandle used to be used in Bio.Entrez for spotting > error conditions, but now we trust the NCBI to set an > HTTP return code: > > https://github.com/biopython/biopython/commit/2c4d8b99fc1b2dffa726e7d9956d766f7013164d No we shouldn't rely an HTTP return code. The idea is that only the parser can know if the output returned by NCBI is valid, as in: handle = Entrez.efetch(...something...) try: record = Entrez.read(handle) raise Exception: # NCBI returned something invalid, or at least # something that we don't know how to parse > If the server could be relied on to always give an > HTTP error code this wouldn't be needed: > > https://github.com/peterjc/biopython/blob/togows/Bio/TogoWS/__init__.py > I don't like this approach much, as it depends on exactly what the error message looks like, and misses any other problems, such as incomplete output. There will be a certain false positive rate, with return values that pass the checking of the first 10 lines but are still unusable. Even worse, the false positive rate can suddenly go up if the server maintainers decide to change anything in their error messages. This kind of checking should be done by the parser, which can tell you exactly if the data are valid, or if not, what is wrong with it. Best, --Michiel. [copied from Bio/TogoWS/__init__.py]: # Wrap the handle inside an UndoHandle. uhandle = File.UndoHandle(handle) # Check for errors in the first 10 lines. # This is kind of ugly. lines = [] for i in range(10): lines.append(uhandle.readline()) for i in range(9, -1, -1): uhandle.saveline(lines[i]) data = ''.join(lines) if data == '': #ValueError? This can occur with an invalid formats or fields #e.g. http://togows.dbcls.jp/entry/pubmed/16381885.au #which is an invalid file format, I meant to try this #instead http://togows.dbcls.jp/entry/pubmed/16381885/au raise IOError("TogoWS replied with no data:\n%s % url") if data == ' ': #I've seen this on things which should work, e.g. #e.g. http://togows.dbcls.jp/entry/genome/X52960.fasta raise IOError("TogoWS replied with just a single space:\n%s" % url) if data.startswith("Error: "): #TODO - Should this be a value error (in some cases?) raise IOError("TogoWS replied with an error message:\n\n%s\n\n%s" \ % (data, url)) if "We're sorry, but something went wrong" in data: #ValueError? This can occur with an invalid formats or fields raise IOError("TogoWS replied: We're sorry, but something went wrong:\n%s" \ % url) From andrea at biocomp.unibo.it Thu Sep 8 14:47:15 2011 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 8 Sep 2011 16:47:15 +0200 (CEST) Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: References: Message-ID: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> Hi, one year ago we were talking about a library I was developing basically to draw seqrecord in a similar way to the BioPerl Bio::Graphics module. Today, I'm releasing the public beta version of that software that is much more mature than one year ago. The library is called BioGraPy and is based on matplotlib for drawings and on biopython objects for input. Basically you can give to biography a SeqRecord and it will draw it and save it in any of the matplotlib supported formats (including png, SVG and PDF). But you can use it also at a lower level deciding exactly how and were to plot every feature also building very complex drawings. It comes with integrated help for web usage, such as clickable SVG and html maps. BioGraPy also supports continuous feature such as an hydrophobicity plot and seqrecord per-letter annotations (if numerical). All the code is documented with sphinx, and I'm also completing a comprensive tutorial. The source code and the documentation are available at: http://apierleoni.github.com/BioGraPy/ BioGraPy is released under the LGPL license. This is an open project, so anyone willing to contribute, test or simply suggest improvements is welcome. You cannot plot circular drawings from Biograpy, but you have GenomeDiagram for that. I hope (and think) this will be useful, significantly extending the biopython plotting capabilities. Andrea From p.j.a.cock at googlemail.com Thu Sep 8 15:25:17 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 8 Sep 2011 16:25:17 +0100 Subject: [Biopython-dev] Bio.File In-Reply-To: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> References: <1315493349.29125.YahooMailClassic@web161216.mail.bf1.yahoo.com> Message-ID: On Thu, Sep 8, 2011 at 3:49 PM, Michiel de Hoon wrote: > > No we shouldn't rely an HTTP return code. The idea is that only > the parser can know if the output returned by NCBI is valid, as in: > > handle = Entrez.efetch(...something...) > try: > ? ?record = Entrez.read(handle) > raise Exception: > ? ?# NCBI returned something invalid, or at least > ? ?# something that we don't know how to parse In theory, yes, but quite often parsers look for certain patterns and if you feed them something else they may just say "no data". For example, the GenBank parser ignores anything before the LOCUS line (in order to cope with the free text header in the large multi-record files on the NCBI FTP site). As a side effect, you can give it almost any plain text file and the parser won't raise an error - it will just say no GenBank records found. >> If the server could be relied on to always give an >> HTTP error code this wouldn't be needed: >> >> https://github.com/peterjc/biopython/blob/togows/Bio/TogoWS/__init__.py >> > > I don't like this approach much, as it depends on exactly > what the error message looks like, and misses any other > problems, such as incomplete output. There will be a > certain false positive rate, with return values that pass > the checking of the first 10 lines but are still unusable. Yes, in theory the server should detect and handle errors nicely - but there are sometimes bugs in web- services. Certainly from memory I have had HTTP return code 200 (OK) with invalid data from both the NCBI and TogoWS. > Even worse, the false positive rate can suddenly go up > if the server maintainers decide to change anything in > their error messages. The checks are deliberately designed to avoid false positives - at the cost of missing some errors early. > This kind of checking should be > done by the parser, which can tell you exactly if the > data are valid, or if not, what is wrong with it. That isn't always possible, since so many bioinformatics file formats are so vague that validation is hard. I accept checking the first 10 lines for common errors specific to that webservice is inelegant, but it is practical. [Some of those TogoWS checks are probably superfluous right now, I'm still polishing the error handling - some of which will rely on TogoWS itself catching more conditions] Regards, Peter From p.j.a.cock at googlemail.com Thu Sep 8 15:44:53 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 8 Sep 2011 16:44:53 +0100 Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> References: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> Message-ID: On Thu, Sep 8, 2011 at 3:47 PM, Andrea Pierleoni wrote: > Hi, > one year ago we were talking about a library I was developing basically to > draw seqrecord in a similar way to the BioPerl Bio::Graphics module. > Today, I'm releasing the public beta version of that software ... > http://apierleoni.github.com/BioGraPy/ Are you doing anything with "join" features from GenBank files (or similar compound features)? This is something I'm thinking about changing in the Biopython SeqFeature objects - having a single SeqFeature with a compound location, rather than as now having a parent SeqFeature with child SeqFeatures for the sub parts (which does not make sense with things like GFF3 where there are real parent/child relationships between features). > > BioGraPy is released under the LGPL license. > I'm curious about the license choice - LGPL prevents Biopython adopting it for example. Peter From andrea at biocomp.unibo.it Thu Sep 8 16:11:15 2011 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 8 Sep 2011 18:11:15 +0200 (CEST) Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: References: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> Message-ID: <60d254c9e1990dc9f7946f24d66a7aea.squirrel@lipid.biocomp.unibo.it> > On Thu, Sep 8, 2011 at 3:47 PM, Andrea Pierleoni > wrote: >> Hi, >> one year ago we were talking about a library I was developing basically >> to >> draw seqrecord in a similar way to the BioPerl Bio::Graphics module. >> Today, I'm releasing the public beta version of that software ... >> http://apierleoni.github.com/BioGraPy/ > > Are you doing anything with "join" features from GenBank files (or > similar compound features)? This is something I'm thinking about > changing in the Biopython SeqFeature objects - having a single > SeqFeature with a compound location, rather than as now having > a parent SeqFeature with child SeqFeatures for the sub parts > (which does not make sense with things like GFF3 where there > are real parent/child relationships between features). > Yes, I'm using 'join' features, there is a specific "graphic feature" for features with 'join'. I think it can be easily changed accordingly. Actually I'm also guessing a hierarchy when plotting directly a gene seqrecord/seqfeature with attached joined subfeatures. Being able to trace parent/child relationships would be a big improvement, and not just for this library of course. >> >> BioGraPy is released under the LGPL license. >> > > I'm curious about the license choice - LGPL prevents Biopython > adopting it for example. > Then I think it's time to change the license :) Why is it preventing biopython to adopt it? Which one do you suggest? I could also use the biopython license, I don't need a strict control on the code, I just want the library to be used by everybody willing to, even closed source programs. Andrea From p.j.a.cock at googlemail.com Thu Sep 8 17:08:50 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 8 Sep 2011 18:08:50 +0100 Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: <60d254c9e1990dc9f7946f24d66a7aea.squirrel@lipid.biocomp.unibo.it> References: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> <60d254c9e1990dc9f7946f24d66a7aea.squirrel@lipid.biocomp.unibo.it> Message-ID: On Thursday, September 8, 2011, Andrea Pierleoni wrote: >> On Thu, Sep 8, 2011 at 3:47 PM, Andrea Pierleoni >> wrote: >>> Hi, >>> one year ago we were talking about a library I was developing basically >>> to draw seqrecord in a similar way to the BioPerl Bio::Graphics module. >>> Today, I'm releasing the public beta version of that software ... >>> http://apierleoni.github.com/BioGraPy/ >> >> Are you doing anything with "join" features from GenBank files (or >> similar compound features)? This is something I'm thinking about >> changing in the Biopython SeqFeature objects - having a single >> SeqFeature with a compound location, rather than as now having >> a parent SeqFeature with child SeqFeatures for the sub parts >> (which does not make sense with things like GFF3 where there >> are real parent/child relationships between features). >> > > Yes, I'm using 'join' features, there is a specific "graphic feature" > for features with 'join'. I think it can be easily changed accordingly. > Actually I'm also guessing a hierarchy when plotting directly a gene > seqrecord/seqfeature with attached joined subfeatures. > Being able to trace parent/child relationships would be a big > improvement, and not just for this library of course. I'll write more about this later, once my code gets a bit closer to being ready. >>> >>> BioGraPy is released under the LGPL license. >>> >> >> I'm curious about the license choice - LGPL prevents Biopython >> adopting it for example. >> > > Then I think it's time to change the license :) > Why is it preventing biopython to adopt it? Adopt in the sense of include into Biopython. > Which one do you suggest? > I could also use the biopython license, I don't need a strict control > on the code, I just want the library to be used by everybody willing to, > even closed source programs. > As I recall, Biopythin, NumPy, SciPy etc all use a very Liberal MIT/BSD type licence, while LGPL tends to scare commercial users ;) Peter From andrea at biocomp.unibo.it Thu Sep 8 19:02:50 2011 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 8 Sep 2011 21:02:50 +0200 (CEST) Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: References: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> <60d254c9e1990dc9f7946f24d66a7aea.squirrel@lipid.biocomp.unibo.it> Message-ID: <302333674995860f369795196ea1df10.squirrel@lipid.biocomp.unibo.it> >> Yes, I'm using 'join' features, there is a specific "graphic feature" >> for features with 'join'. I think it can be easily changed accordingly. >> Actually I'm also guessing a hierarchy when plotting directly a gene >> seqrecord/seqfeature with attached joined subfeatures. >> Being able to trace parent/child relationships would be a big >> improvement, and not just for this library of course. > > I'll write more about this later, once my code gets a bit > closer to being ready. > ok, let me know. >>>> >>>> BioGraPy is released under the LGPL license. >>>> >>> >>> I'm curious about the license choice - LGPL prevents Biopython >>> adopting it for example. >>> >> >> Then I think it's time to change the license :) >> Why is it preventing biopython to adopt it? > > Adopt in the sense of include into Biopython. > well if you think it is worth it, biograpy can of course be included in biopython. the good thing is that it is all sphinx documented, so if biopython is moving to sphinx too, this part is ready. Biograpy requires matplotlib (and thus of course numpy), but could be just an optional installation for those who want to use this graphic package, as it is reportlab for genomediagram. Also, now that there is a drawing library it should be easy to complete the DAS client, and have something very similar to DASTY that given a protein id is able to fetch all the das annotation and even draw them with an html4 (image maps) or html5 (svg) friendly result. >> Which one do you suggest? >> I could also use the biopython license, I don't need a strict control >> on the code, I just want the library to be used by everybody willing to, >> even closed source programs. >> > > As I recall, Biopythin, NumPy, SciPy etc all use a very > Liberal MIT/BSD type licence, while LGPL tends to > scare commercial users ;) > > it's funny, since I choose the LGPL license not to scare commercial users :) can you send me a link to the license so that I can include it in biograpy? thanks Andrea From mjldehoon at yahoo.com Sun Sep 11 03:22:15 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 10 Sep 2011 20:22:15 -0700 (PDT) Subject: [Biopython-dev] Bio.GenBank (was: Bio.File) In-Reply-To: Message-ID: <1315711335.33798.YahooMailClassic@web161215.mail.bf1.yahoo.com> Hi all, There are several issues here. Let's talk about Bio.GenBank first. I think it's OK to have a module Bio.GenBank in addition to Bio.SeqIO, but it's a bit unclear to me which code in Bio.GenBank is still relevant and which (if any) can potentially be deprecated. Also we'd need some documentation for Bio.GenBank. In particular it's not clear to me which classes in Bio.GenBank are intended to be used by users. The description at the top of Bio.GenBank says that only Bio.GenBank.RecordParser should be used directly. However, in the test code in Bio.Graphics.GenomeDiagram (after "if name=='__main__':") Bio.GenBank.FeatureParser is used. Should that be replaced by Bio.SeqIO then? Also I think that the RecordParser should raise an Exception if it cannot find a record when parsing. Compare the following: >>> from Bio import SeqIO >>> from StringIO import StringIO >>> handle = StringIO("no record here") >>> SeqIO.read(handle, 'fasta') Traceback (most recent call last): File "", line 1, in File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 617, in read raise ValueError("No records found in handle") ValueError: No records found in handle >>> from Bio import GenBank >>> parser = GenBank.RecordParser() >>> handle = StringIO("no record here") >>> parser.parse(handle) >>> # no error raised This still lets us ignore header text before the actual start of a GenBank record; the error should only be raised if no GenBank record can be found anywhere. Best, --Michiel. --- On Thu, 9/8/11, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] Bio.File > To: "Michiel de Hoon" > Cc: biopython-dev at biopython.org > Date: Thursday, September 8, 2011, 11:25 AM > On Thu, Sep 8, 2011 at 3:49 PM, > Michiel de Hoon > wrote: > > > > No we shouldn't rely an HTTP return code. The idea is > that only > > the parser can know if the output returned by NCBI is > valid, as in: > > > > handle = Entrez.efetch(...something...) > > try: > > ? ?record = Entrez.read(handle) > > raise Exception: > > ? ?# NCBI returned something invalid, or at least > > ? ?# something that we don't know how to parse > > In theory, yes, but quite often parsers look for certain > patterns and if you feed them something else they may > just say "no data". For example, the GenBank parser > ignores anything before the LOCUS line (in order to > cope with the free text header in the large multi-record > files on the NCBI FTP site). As a side effect, you can > give it almost any plain text file and the parser won't > raise an error - it will just say no GenBank records > found. > > >> If the server could be relied on to always give > an > >> HTTP error code this wouldn't be needed: > >> > >> https://github.com/peterjc/biopython/blob/togows/Bio/TogoWS/__init__.py > >> > > > > I don't like this approach much, as it depends on > exactly > > what the error message looks like, and misses any > other > > problems, such as incomplete output. There will be a > > certain false positive rate, with return values that > pass > > the checking of the first 10 lines but are still > unusable. > > Yes, in theory the server should detect and handle > errors nicely - but there are sometimes bugs in web- > services. Certainly from memory I have had HTTP > return code 200 (OK) with invalid data from both the > NCBI and TogoWS. > > > Even worse, the false positive rate can suddenly go > up > > if the server maintainers decide to change anything > in > > their error messages. > > The checks are deliberately designed to avoid false > positives - at the cost of missing some errors early. > > > This kind of checking should be > > done by the parser, which can tell you exactly if the > > data are valid, or if not, what is wrong with it. > > That isn't always possible, since so many bioinformatics > file formats are so vague that validation is hard. > > I accept checking the first 10 lines for common errors > specific to that webservice is inelegant, but it is > practical. > > [Some of those TogoWS checks are probably superfluous > right now, I'm still polishing the error handling - some > of > which will rely on TogoWS itself catching more conditions] > > Regards, > > Peter > From p.j.a.cock at googlemail.com Sun Sep 11 14:06:13 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 11 Sep 2011 15:06:13 +0100 Subject: [Biopython-dev] Bio.GenBank (was: Bio.File) In-Reply-To: <1315711335.33798.YahooMailClassic@web161215.mail.bf1.yahoo.com> References: <1315711335.33798.YahooMailClassic@web161215.mail.bf1.yahoo.com> Message-ID: On Sun, Sep 11, 2011 at 4:22 AM, Michiel de Hoon wrote: > Hi all, > > There are several issues here. > Let's talk about Bio.GenBank first. > > I think it's OK to have a module Bio.GenBank in addition > to Bio.SeqIO, but it's a bit unclear to me which code in > Bio.GenBank is still relevant and which (if any) can > potentially be deprecated. Bio.GenBank uses a scanner/consumer to offer two object models for GenBank/EMBL files. First, SeqRecord objects which is wrapped by Bio.SeqIO. Second, a more faithful GenBank record object which also supports non-sequence based GenBank whole genome shotgun master records. These are GenBank files that summarize the content of a project, and provide lists of scaffold and contig files in the project. I have never used this - Iddo has though. So currently none of Bio.GenBank can really be deprecated. If we don't care about WGS records, then perhaps the RecordParser could be deprecated and later with some refactoring Bio.SeqIO could parse things directly. That would be my long term ideal. Maybe we can represent the WGS records as SeqRecord objects without a sequence, but I don't like that idea really. Such files are NOT sequence files at all. > > Also we'd need some documentation for Bio.GenBank. > In general it would be a good idea to have a worked example parsing a (small) GenBank file and showing where in the SeqRecord each bit of annotation goes. Doing this as a doctest (embedded in the Tutorial perhaps) would keep the documentation up to date (any changes should show up as a unit test failure). > In particular it's not clear to me which classes in > Bio.GenBank are intended to be used by users. > The description at the top of Bio.GenBank says > that only Bio.GenBank.RecordParser should be > used directly. What is says is "Currently the ONLY reason to use Bio.GenBank directly is for the RecordParser which turns a GenBank file into GenBank-specific Record objects.", by which I mean if you want SeqRecord objects, use Bio.SeqIO instead (which will call Bio.GenBank.FeatureParser internally), since that is our standard API for parsing as SeqRecords. > However, in the test code in > Bio.Graphics.GenomeDiagram (after > "if name=='__main__':") Bio.GenBank.FeatureParser > is used. Should that be replaced by Bio.SeqIO then? Yes. If the code is needed at all... > Also I think that the RecordParser should > raise an Exception if it cannot find a record > when parsing. I disagree (or at least, when exposed via Bio.SeqIO I disagree). > Compare the following: > >>>> from Bio import SeqIO >>>> from StringIO import StringIO >>>> handle = StringIO("no record here") >>>> SeqIO.read(handle, 'fasta') > Traceback (most recent call last): > ?File "", line 1, in > ?File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 617, in read > ? ?raise ValueError("No records found in handle") > ValueError: No records found in handle That's fine - the read function says it will raise an exception if there is not exactly one record. Perhaps you meant to use parse here as in the following example? If you do, you get no records and no exception. >>>> from Bio import GenBank >>>> parser = GenBank.RecordParser() >>>> handle = StringIO("no record here") >>>> parser.parse(handle) >>>> # no error raised > > This still lets us ignore header text before > the actual start of a GenBank record; the > error should only be raised if no GenBank > record can be found anywhere. > If you used Bio.SeqIO.read(...) with GenBank format on an empty file you'd also get an exception. I explicitly test the SeqIO parsers to check they handle an empty file gracefully - and for simple sequential formats like FASTA and GenBank that means returns no records. This is an important special case, and it should be handled this way for generic pipelines. I often have empty FASTA files. Peter From p.j.a.cock at googlemail.com Sun Sep 11 14:12:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 11 Sep 2011 15:12:19 +0100 Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: <302333674995860f369795196ea1df10.squirrel@lipid.biocomp.unibo.it> References: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> <60d254c9e1990dc9f7946f24d66a7aea.squirrel@lipid.biocomp.unibo.it> <302333674995860f369795196ea1df10.squirrel@lipid.biocomp.unibo.it> Message-ID: On Thu, Sep 8, 2011 at 8:02 PM, Andrea Pierleoni wrote: > >> >> Adopt in the sense of include into Biopython. >> > > well if you think it is worth it, biograpy can of course > be included in biopython. > > ... > >>> Which one do you suggest? >>> I could also use the biopython license, I don't need a strict control >>> on the code, I just want the library to be used by everybody willing to, >>> even closed source programs. >>> >> >> As I recall, Biopythin, NumPy, SciPy etc all use a very >> Liberal MIT/BSD type licence, while LGPL tends to >> scare commercial users ;) > > it's funny, since I choose the LGPL license not to scare > commercial users :) Well, its better than the GPL from that point of view ;) > can you send me a link to the license so that I can include it in biograpy? > thanks The Biopython licence is just: http://www.biopython.org/DIST/LICENSE If in the medium/long term you'd like to consider incorporating this into Biopython, then my recommendation is either use a compatible licence now, or ensure you get copyright assignment for all code contributions so that you can change the license later. My worry is if you use LGPL and take third party author contributions, then later wanted to change the license you'd need to contact all those 3rd party authors to get their permission. Regards, Peter From p.j.a.cock at googlemail.com Sun Sep 11 22:01:36 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 11 Sep 2011 23:01:36 +0100 Subject: [Biopython-dev] Bio.GenBank (was: Bio.File) In-Reply-To: References: <1315711335.33798.YahooMailClassic@web161215.mail.bf1.yahoo.com> Message-ID: On Sun, Sep 11, 2011 at 3:06 PM, Peter Cock wrote: > On Sun, Sep 11, 2011 at 4:22 AM, Michiel de Hoon wrote: >> However, in the test code in >> Bio.Graphics.GenomeDiagram (after >> "if name=='__main__':") Bio.GenBank.FeatureParser >> is used. Should that be replaced by Bio.SeqIO then? > > Yes. If the code is needed at all... > Updated, but those two mini-tests are probably superflous and if not should be merged into the unit tests. Peter From p.j.a.cock at googlemail.com Mon Sep 12 09:07:49 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 12 Sep 2011 10:07:49 +0100 Subject: [Biopython-dev] Calculating bootstrap values from a set of trees Message-ID: Hi Eric, I'm wondering if there is any code in Bio.Phylo for calculating bootstrap values from a set of trees? e.g. I have a master tree created from an alignment, and 1000 bootstrap trees (created from 1000 re-sampled alignments). I want to annotate each branch with the number/percentage of times is it found in the 1000 bootsrap trees. I once implemented this in python using binary strings to represent each branch as a split or partition of the nodes into two groups. I'm not sure where I put this script... but it pre-dated Bio.Phylo anyway. Alternatively, which standalone tool would you recommend for this? Thanks, Peter From mjldehoon at yahoo.com Mon Sep 12 12:49:35 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 12 Sep 2011 05:49:35 -0700 (PDT) Subject: [Biopython-dev] Bio.GenBank (was: Bio.File) In-Reply-To: Message-ID: <1315831775.55207.YahooMailClassic@web161206.mail.bf1.yahoo.com> --- On Sun, 9/11/11, Peter Cock wrote: > So currently none of Bio.GenBank can really be > deprecated. OK. > Maybe we can represent the WGS records as > SeqRecord objects without a sequence, but I > don't like that idea really. Such files are NOT > sequence files at all. I agree. > > > > > Also we'd need some documentation for Bio.GenBank. > > > > In general it would be a good idea to have a > worked example parsing a (small) GenBank > file and showing where in the SeqRecord > each bit of annotation goes. That would be good, but we also need some documentation for Bio.GenBank itself, to clarify how Bio.GenBank is meant to be used by users (and also to clarify that Bio.SeqIO produces SeqRecords, and Bio.GenBank its own GenBank-specific records). > > Also I think that the RecordParser should > > raise an Exception if it cannot find a record > > when parsing. > > I disagree (or at least, when exposed via > Bio.SeqIO I disagree). After reading your comments, I realized that my mail was confusing. I think we actually agree. This is what I meant to say: Compare the following: >>> from Bio import SeqIO >>> from StringIO import StringIO >>> handle = StringIO("no record here") >>> SeqIO.read(handle, 'genbank') Traceback (most recent call last): ?File "", line 1, in ?File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 617, in read ?raise ValueError("No records found in handle") ValueError: No records found in handle That's fine - the read function says it will raise an exception if there is not exactly one record. With SeqIO.parse, we don't get an Exception: >>> handle = StringIO("no record here") >>> records = SeqIO.parse(handle, 'genbank') >>> for record in records: print record.id ... >>> This is also OK. SeqIO.parse expects zero, one, or multiple records. Now for Bio.GenBank: >>> from Bio import GenBank >>> parser = GenBank.RecordParser() >>> handle = StringIO("no record here") >>> parser.parse(handle) >>> # no error raised This I think is not OK. GenBank.RecordParser().parse expects one record; it should raise an Exception if it does not one. Likewise, the parser does not raise an Exception if there are multiple records in the handle. and for Bio.GenBank.Iterator: >>> from Bio.GenBank import Iterator >>> from Bio.GenBank import RecordParser >>> from StringIO import StringIO >>> handle = StringIO("no record here") >>> parser = RecordParser() >>> records = Iterator(handle, parser) >>> for record in records: print record.locus ... >>> which is the same behavior as for Bio.SeqIO.parse, which I think is OK. Assuming that the RecordParser and the Iterator are the only two classes that are intended for the end-user, it's probably better to add a Bio.GenBank.read and a Bio.GenBank.parse function to be consistent with the other Biopython modules. Sorry for the confusion! --Michiel. From eric.talevich at gmail.com Mon Sep 12 13:14:15 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 12 Sep 2011 09:14:15 -0400 Subject: [Biopython-dev] Calculating bootstrap values from a set of trees In-Reply-To: References: Message-ID: Hi Peter, On Mon, Sep 12, 2011 at 5:07 AM, Peter Cock wrote: > Hi Eric, > > I'm wondering if there is any code in Bio.Phylo for calculating > bootstrap values from a set of trees? > > e.g. I have a master tree created from an alignment, and 1000 > bootstrap trees (created from 1000 re-sampled alignments). I want to > annotate each branch with the number/percentage of times is it found > in the 1000 bootsrap trees. > I haven't implemented this in Bio.Phylo yet, unfortunately. The closest thing is Bio.Nexus.Trees.consensus. It would be worthwhile to port this to Bio.Phylo eventually. I once implemented this in python using binary strings to represent > each branch as a split or partition of the nodes into two groups. I'm > not sure where I put this script... but it pre-dated Bio.Phylo anyway. > > Alternatively, which standalone tool would you recommend for this? > > I think Phylip's seqboot and consense will do the trick. Normally I let RAxML do this sort of thing for me. Cheers, Eric From p.j.a.cock at googlemail.com Mon Sep 12 13:29:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 12 Sep 2011 14:29:19 +0100 Subject: [Biopython-dev] Bio.GenBank (was: Bio.File) In-Reply-To: <1315831775.55207.YahooMailClassic@web161206.mail.bf1.yahoo.com> References: <1315831775.55207.YahooMailClassic@web161206.mail.bf1.yahoo.com> Message-ID: On Mon, Sep 12, 2011 at 1:49 PM, Michiel de Hoon wrote: > >>>> from Bio import GenBank >>>> parser = GenBank.RecordParser() >>>> handle = StringIO("no record here") >>>> parser.parse(handle) >>>> # no error raised > > This I think is not OK. GenBank.RecordParser().parse expects one > record; it should raise an Exception if it does not one. Likewise, the > parser does not raise an Exception if there are multiple records in > the handle. > > and for Bio.GenBank.Iterator: > >>>> from Bio.GenBank import Iterator >>>> from Bio.GenBank import RecordParser >>>> from StringIO import StringIO >>>> handle = StringIO("no record here") >>>> parser = RecordParser() >>>> records = Iterator(handle, parser) >>>> for record in records: print record.locus > ... >>>> > > which is the same behavior as for Bio.SeqIO.parse, which I think is OK. OK, yes - I see what you mean now. > Assuming that the RecordParser and the Iterator are the only > two classes that are intended for the end-user, it's probably > better to add a Bio.GenBank.read and a Bio.GenBank.parse > function to be consistent with the other Biopython modules. Good plan - and then we can discourage direct use of the rest of Bio.GenBank (i.e. RecordParser, Iterator etc). How's this? https://github.com/biopython/biopython/commit/ff72037efbae2da6eb6db550aa6d02b883ec1345 > Sorry for the confusion! > No problem. Peter From p.j.a.cock at googlemail.com Mon Sep 12 13:40:46 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 12 Sep 2011 14:40:46 +0100 Subject: [Biopython-dev] Calculating bootstrap values from a set of trees In-Reply-To: References:

Message-ID: On Mon, Sep 12, 2011 at 2:14 PM, Eric Talevich wrote: > Hi Peter, > > On Mon, Sep 12, 2011 at 5:07 AM, Peter Cock > wrote: >> >> Hi Eric, >> >> I'm wondering if there is any code in Bio.Phylo for calculating >> bootstrap values from a set of trees? >> >> e.g. I have a master tree created from an alignment, and 1000 >> bootstrap trees (created from 1000 re-sampled alignments). I want to >> annotate each branch with the number/percentage of times is it found >> in the 1000 bootsrap trees. > > I haven't implemented this in Bio.Phylo yet, unfortunately. The closest > thing is Bio.Nexus.Trees.consensus. It would be worthwhile to port this to > Bio.Phylo eventually. > > >> I once implemented this in python using binary strings to represent >> each branch as a split or partition of the nodes into two groups. I'm >> not sure where I put this script... but it pre-dated Bio.Phylo anyway. >> >> Alternatively, which standalone tool would you recommend for this? >> > > I think Phylip's seqboot and consense will do the trick. http://evolution.genetics.washington.edu/phylip/doc/consense.html My understanding was Phylip's consense takes a set of trees and finds a consensus - there is no obvious way to tell it you want to use a particular pre-determined tree. > > Normally I let RAxML do this sort of thing for me. > I'm unclear if RAxML will accept some 3rd party master tree (via -t) and a set of bootstrapped trees (via -z) without also wanting the original alignment and a choice of model... My reason for wanting to decouple bootstrapping the trees and applying the bootstraps to the master tree is for splitting large jobs across a cluster. Each cluster node can generate bootstrap trees independently of the other cluster nodes (no network IO or synchronisation needed). These trees are then collated (concatenated into a big multiple entry tree file), with the final step combining the bootstrapped trees onto the master tree to assess support being comparatively quick. Peter From cy at cymon.org Mon Sep 12 15:13:42 2011 From: cy at cymon.org (Cymon Cox) Date: Mon, 12 Sep 2011 16:13:42 +0100 Subject: [Biopython-dev] Calculating bootstrap values from a set of trees In-Reply-To: References:

Message-ID: Peter, I don't know of any stand alone software to automate the annotate of nodes of a target tree with labels - I'm assuming you want to add labels (in this case ML bootstrap support values) to a Newick tree description (eg an ML optimal tree). Most wouldn't do this, but manually label the tree in a graphics software when preparing the figure for publication. If you want support values for all nodes in your master/target tree, you could loop over all the clades in your tree and use dendropy to help calculate the bootstrap values for you bootstrap trees. Cheers, Cymon On 12 September 2011 14:40, Peter Cock wrote: > On Mon, Sep 12, 2011 at 2:14 PM, Eric Talevich > wrote: > > Hi Peter, > > > > On Mon, Sep 12, 2011 at 5:07 AM, Peter Cock > > wrote: > >> > >> Hi Eric, > >> > >> I'm wondering if there is any code in Bio.Phylo for calculating > >> bootstrap values from a set of trees? > >> > >> e.g. I have a master tree created from an alignment, and 1000 > >> bootstrap trees (created from 1000 re-sampled alignments). I want to > >> annotate each branch with the number/percentage of times is it found > >> in the 1000 bootsrap trees. > > > > I haven't implemented this in Bio.Phylo yet, unfortunately. The closest > > thing is Bio.Nexus.Trees.consensus. It would be worthwhile to port this > to > > Bio.Phylo eventually. > > > > > >> I once implemented this in python using binary strings to represent > >> each branch as a split or partition of the nodes into two groups. I'm > >> not sure where I put this script... but it pre-dated Bio.Phylo anyway. > >> > >> Alternatively, which standalone tool would you recommend for this? > >> > > > > I think Phylip's seqboot and consense will do the trick. > > http://evolution.genetics.washington.edu/phylip/doc/consense.html > > My understanding was Phylip's consense takes a set of trees > and finds a consensus - there is no obvious way to tell it you > want to use a particular pre-determined tree. > > > > > Normally I let RAxML do this sort of thing for me. > > > > I'm unclear if RAxML will accept some 3rd party master tree > (via -t) and a set of bootstrapped trees (via -z) without also > wanting the original alignment and a choice of model... > > My reason for wanting to decouple bootstrapping the trees > and applying the bootstraps to the master tree is for splitting > large jobs across a cluster. Each cluster node can generate > bootstrap trees independently of the other cluster nodes > (no network IO or synchronisation needed). These trees are > then collated (concatenated into a big multiple entry tree > file), with the final step combining the bootstrapped trees > onto the master tree to assess support being comparatively > quick. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- ____________________________________________________________________ Cymon J. Cox From p.j.a.cock at googlemail.com Mon Sep 12 15:18:32 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 12 Sep 2011 16:18:32 +0100 Subject: [Biopython-dev] Calculating bootstrap values from a set of trees In-Reply-To: References:

Message-ID: On Mon, Sep 12, 2011 at 4:13 PM, Cymon Cox wrote: > Peter, > > I don't know of any stand alone software to automate the annotate of nodes > of a target tree with labels - I'm assuming you want to add labels (in this > case ML bootstrap support values) to a Newick tree description (eg an ML > optimal tree). Yes, or NJ bootstraps, or whatever. > Most wouldn't do this, but manually label the tree in a > graphics software when preparing the figure for publication. Huh. I guess it depends on the size of tree ;) > If you want support values for all nodes in your master/target tree, you > could loop over all the clades in your tree and use dendropy to help > calculate the bootstrap values for you bootstrap trees. > > Cheers, Cymon Thanks - looks like I'm not overlooking some really obvious tool to do this then. Peter From cy at cymon.org Mon Sep 12 15:25:44 2011 From: cy at cymon.org (Cymon Cox) Date: Mon, 12 Sep 2011 16:25:44 +0100 Subject: [Biopython-dev] Calculating bootstrap values from a set of trees In-Reply-To: References:

Message-ID: Peter, On 12 September 2011 16:18, Peter Cock wrote: > On Mon, Sep 12, 2011 at 4:13 PM, Cymon Cox wrote: > > Peter, > > > > I don't know of any stand alone software to automate the annotate of > nodes > > of a target tree with labels - I'm assuming you want to add labels (in > this > > case ML bootstrap support values) to a Newick tree description (eg an ML > > optimal tree). > > Yes, or NJ bootstraps, or whatever. > > > Most wouldn't do this, but manually label the tree in a > > graphics software when preparing the figure for publication. > > Huh. I guess it depends on the size of tree ;) > Well, yes. One of mine had >600 taxa - I didnt do it manually ;) > > If you want support values for all nodes in your master/target tree, you > > could loop over all the clades in your tree and use dendropy to help > > calculate the bootstrap values for you bootstrap trees. > > > > Cheers, Cymon > > Thanks - looks like I'm not overlooking some really obvious tool > to do this then. > Nothing obvious - but I have a vague recollection that Ive seen this as an option in a tree graphics programme before - for the life of me I cant remember which though! If I comes to me I'll let you know ;) C. -- ____________________________________________________________________ Cymon J. Cox From andrea at biocomp.unibo.it Mon Sep 12 15:34:55 2011 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Mon, 12 Sep 2011 17:34:55 +0200 (CEST) Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: References: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> <60d254c9e1990dc9f7946f24d66a7aea.squirrel@lipid.biocomp.unibo.it> <302333674995860f369795196ea1df10.squirrel@lipid.biocomp.unibo.it> Message-ID: <7fcb72ae2b632b6f9923b889513ab9ea.squirrel@lipid.biocomp.unibo.it> > >> can you send me a link to the license so that I can include it in >> biograpy? >> thanks > > The Biopython licence is just: > http://www.biopython.org/DIST/LICENSE Yes, I saw that license, but I didn't find any reference to MIT or anything else, so I was not sure this was the right one... > > If in the medium/long term you'd like to consider incorporating > this into Biopython, then my recommendation is either use a > compatible licence now, or ensure you get copyright assignment > for all code contributions so that you can change the license later. > > My worry is if you use LGPL and take third party author > contributions, then later wanted to change the license you'd > need to contact all those 3rd party authors to get their > permission. > Well we can easily change the license to the BioPython one. This is intended to be a free library. the more people can use it, the better, even for commercial purposes. BioGraPy can of course be incorporated in BioPython for commodity, and/or be shipped as a separate package. Personally I'd prefer to ship also with BioPython so we can be sure that the right versions are always packed together. Eg. If you are going to change subfeatures, than a compatible version of BioGraPy must be used. From p.j.a.cock at googlemail.com Mon Sep 12 15:50:39 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 12 Sep 2011 16:50:39 +0100 Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: <7fcb72ae2b632b6f9923b889513ab9ea.squirrel@lipid.biocomp.unibo.it> References: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> <60d254c9e1990dc9f7946f24d66a7aea.squirrel@lipid.biocomp.unibo.it> <302333674995860f369795196ea1df10.squirrel@lipid.biocomp.unibo.it> <7fcb72ae2b632b6f9923b889513ab9ea.squirrel@lipid.biocomp.unibo.it> Message-ID: On Mon, Sep 12, 2011 at 4:34 PM, Andrea Pierleoni wrote: > >> >>> can you send me a link to the license so that I can include it in >>> biograpy? >>> thanks >> >> The Biopython licence is just: >> http://www.biopython.org/DIST/LICENSE > > Yes, I saw that license, but I didn't find any reference to MIT or > anything else, so I was not sure this was the right one... > >> >> If in the medium/long term you'd like to consider incorporating >> this into Biopython, then my recommendation is either use a >> compatible licence now, or ensure you get copyright assignment >> for all code contributions so that you can change the license later. >> >> My worry is if you use LGPL and take third party author >> contributions, then later wanted to change the license you'd >> need to contact all those 3rd party authors to get their >> permission. >> > > Well we can easily change the license to the BioPython one. This is > intended to be a free ?library. the more people can use it, the better, > even for commercial purposes. > BioGraPy can of course be incorporated in BioPython for commodity, > and/or be shipped as a separate package. That how GenomeDiagram started. > Personally I'd prefer to ship also with BioPython so we can be sure > that the right versions are always packed together. > Eg. If you are going to change subfeatures, than a compatible version > of BioGraPy must be used. Yeah - changing SeqFeature locations is a potential minefield, so I will want to try and make any transition as smooth as possible with a backwards compatibility hack. Peter From andrea at biocomp.unibo.it Mon Sep 12 16:04:59 2011 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Mon, 12 Sep 2011 18:04:59 +0200 (CEST) Subject: [Biopython-dev] Biograpy 1.0 beta released In-Reply-To: References: <374c3b3c0359b131ccfbd354e0c2cca6.squirrel@lipid.biocomp.unibo.it> <60d254c9e1990dc9f7946f24d66a7aea.squirrel@lipid.biocomp.unibo.it> <302333674995860f369795196ea1df10.squirrel@lipid.biocomp.unibo.it> <7fcb72ae2b632b6f9923b889513ab9ea.squirrel@lipid.biocomp.unibo.it> Message-ID: <0b787347db4866e30d0768fe306e7ca4.squirrel@lipid.biocomp.unibo.it> > On Mon, Sep 12, 2011 at 4:34 PM, Andrea Pierleoni > wrote: >> >>> >>>> can you send me a link to the license so that I can include it in >>>> biograpy? >>>> thanks >>> >>> The Biopython licence is just: >>> http://www.biopython.org/DIST/LICENSE >> >> Yes, I saw that license, but I didn't find any reference to MIT or >> anything else, so I was not sure this was the right one... >> >>> >>> If in the medium/long term you'd like to consider incorporating >>> this into Biopython, then my recommendation is either use a >>> compatible licence now, or ensure you get copyright assignment >>> for all code contributions so that you can change the license later. >>> >>> My worry is if you use LGPL and take third party author >>> contributions, then later wanted to change the license you'd >>> need to contact all those 3rd party authors to get their >>> permission. >>> >> >> Well we can easily change the license to the BioPython one. This is >> intended to be a free ?library. the more people can use it, the better, >> even for commercial purposes. >> BioGraPy can of course be incorporated in BioPython for commodity, >> and/or be shipped as a separate package. > > That how GenomeDiagram started. > >> Personally I'd prefer to ship also with BioPython so we can be sure >> that the right versions are always packed together. >> Eg. If you are going to change subfeatures, than a compatible version >> of BioGraPy must be used. > > Yeah - changing SeqFeature locations is a potential minefield, > so I will want to try and make any transition as smooth as > possible with a backwards compatibility hack. > > Peter > Backwards compatibility is always needed when feasible... :) Andrea From nicolas.rochette at univ-lyon1.fr Mon Sep 12 19:49:06 2011 From: nicolas.rochette at univ-lyon1.fr (Nicolas Rochette) Date: Mon, 12 Sep 2011 21:49:06 +0200 Subject: [Biopython-dev] Calculating bootstrap values from a set of trees In-Reply-To: References:

Message-ID: <4E6E6232.50805@univ-lyon1.fr> Hi Peter, What you are looking for exists in the bppconsense program from the "Bio++ Suite" http://home.gna.org/bppsuite/ With something like : bppconsense input.tree.file=NEWICK_FILE method=Input input.trees.file=BOOTSTRAPS_FILE output.tree.file=OUTPUT_FILE Regards, Nicolas Rochette From mjldehoon at yahoo.com Wed Sep 14 15:34:01 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 14 Sep 2011 08:34:01 -0700 (PDT) Subject: [Biopython-dev] Bio.GenBank (was: Bio.File) In-Reply-To: Message-ID: <1316014441.82644.YahooMailClassic@web161216.mail.bf1.yahoo.com> Hi Peter, --- On Mon, 9/12/11, Peter Cock wrote: > How's this? > https://github.com/biopython/biopython/commit/ff72037efbae2da6eb6db550aa6d02b883ec1345 The code looks good. About the documentation, at the top of the module you say that using Bio.GenBank can be useful for WGS master records. That is true, but people with particular interests may have other reasons to use Bio.GenBank, and maybe WGS master records will not be stored as GenBank files in the future. So it may be good to keep the documentation a bit more generic, so it's still valid in a few years. But I agree that in most cases and for most users, Bio.SeqIO is the appropriate module rather than Bio.GenBank. Does Bio.SeqIO still need to use Bio.GenBank's FeatureParser? Or can it also use Bio.GenBank.read() or Bio.GenBank.parse()? --Michiel. From p.j.a.cock at googlemail.com Wed Sep 14 20:48:48 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 14 Sep 2011 21:48:48 +0100 Subject: [Biopython-dev] Bio.GenBank (was: Bio.File) In-Reply-To: <1316014441.82644.YahooMailClassic@web161216.mail.bf1.yahoo.com> References: <1316014441.82644.YahooMailClassic@web161216.mail.bf1.yahoo.com> Message-ID: On Wed, Sep 14, 2011 at 4:34 PM, Michiel de Hoon wrote: > Hi Peter, > > --- On Mon, 9/12/11, Peter Cock wrote: >> How's this? >> https://github.com/biopython/biopython/commit/ff72037efbae2da6eb6db550aa6d02b883ec1345 > > The code looks good. OK. > About the documentation, at the top of the module you say > that using Bio.GenBank can be useful for WGS master > records. That is true, but people with particular interests > may have other reasons to use Bio.GenBank, and maybe > WGS master records will not be stored as GenBank files > in the future. So it may be good to keep the documentation > a bit more generic, so it's still valid in a few years. But I > agree that in most cases and for most users, Bio.SeqIO > is the appropriate module rather than Bio.GenBank. Please go ahead and try to make it clearer. > Does Bio.SeqIO still need to use Bio.GenBank's > FeatureParser? Or can it also use Bio.GenBank.read() > or Bio.GenBank.parse()? Yes, and no, respectively. At least as written - I guess the new read/parse functions could take an optional argument to control this but I fear that would just be confusing. Essentially both are both using the scanner/consumer model, but one uses the Record producing consumer and the other the SeqRecord producing consumer. Peter From p.j.a.cock at googlemail.com Fri Sep 16 16:31:13 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 16 Sep 2011 17:31:13 +0100 Subject: [Biopython-dev] SeqFeature start/end and making positions act like ints Message-ID: Hi all, We've previously discussed adding start/end properties to the SeqFeature returning integers - which would be useful but inconsistent with the FeatureLocation which returns Position objects: https://redmine.open-bio.org/issues/2818 After an interesting discussion with Leighton, I spent the afternoon making (most of the) Position objects subclass int - so that they can be used like integers (with the fuzzy information retained but generally ignored except for writing the features out again). This means we can have SeqFeature start/end properties which like those of the FeatureLocation return position objects - and they are actually easy to use (except for some very extreme cases). e.g. You can use them to slice a sequence. The code is on a branch here: https://github.com/peterjc/biopython/tree/int_pos It is almost 100% backwards compatible. Some of the arguments for creating a fuzzy position (and their __repr__) have changed, and some of their attributes, but we feel this is unlikely to actually affect anyone. We rather suspect only the SeqIO parsers actually create or use the fuzzy objects in the first place! In terms of usability I think this is a worthwhile improvement. The new class heirachy is a bit more complex though - and I have not looked at the performance implications at all. Would anyone like to review this please? Peter From redmine at redmine.open-bio.org Fri Sep 16 16:45:50 2011 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 16 Sep 2011 16:45:50 +0000 Subject: [Biopython-dev] [Biopython - Bug #2818] Add start and end properties to SeqFeature object References: Message-ID: Issue #2818 has been updated by Peter Cock. See also this proposal: http://lists.open-bio.org/pipermail/biopython-dev/2011-September/009172.html ---------------------------------------- Bug #2818: Add start and end properties to SeqFeature object https://redmine.open-bio.org/issues/2818 Author: Peter Cock Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: Not Applicable URL: An enhancment proposed on the mailing list would add start and end properties to the SeqFeature returning plain integers (non-fuzzy approximations to the start and end locations) suitable for slicing most parent sequences. Dealing with a join location would still be tricky. Example usage: >>> from Bio import SeqIO >>> record = SeqIO.read(open("NC_005816.gb"),"gb") >>> feature = record.features[2] >>> print feature type: gene location: [86:1109] ref: None:None strand: 1 qualifiers: Key: db_xref, Value: ['GeneID:2767718'] Key: locus_tag, Value: ['YP_pPCP01'] >>> record[feature.start:feature.end] SeqRecord(seq=Seq('ATGGTCACTTTTGAGACAGTTATGGAAATTAAAATCCTGCACAAGCAGGGAATG...TGA', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=[]) >>> record.seq[feature.start:feature.end] Seq('ATGGTCACTTTTGAGACAGTTATGGAAATTAAAATCCTGCACAAGCAGGGAATG...TGA', IUPACAmbiguousDNA()) Patch to follow. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Fri Sep 16 17:07:29 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 16 Sep 2011 18:07:29 +0100 Subject: [Biopython-dev] Biopython under PyPy Message-ID: Hi all, I've been trying Biopython under PyPy 1.6, and the unit tests for a lot of things work fine. In the short term I'm skipping all the C extensions (not clear how easy they will be under PyPy): https://github.com/biopython/biopython/commit/2a26ceebed01508a69aefd6a3a6437245347a5a2 PyPy ships with a minimal numpy implementation, but it seems to be very minimal - e.g. there is no dot function. This is actually a bit annoying as "import numpy" works but you don't get everything! Anyway, there are some easy checks we can add to individual unit tests to skip them under pypy. What is interesting is running the full test suite reports some false positives (tests which when run on their own, or as part of a smaller group pass), and the test suite itself never finishes: error: Too many open files I'm not sure what this is from... I fixed an obvious handle leak: https://github.com/biopython/biopython/commit/f7ce81b3751745970c32cc813836507e93da3c30 I suspect the problem is some of the individual tests are leaking handles - which we know already from warnings under Python 3 etc. Peter From eric.talevich at gmail.com Fri Sep 16 20:14:31 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 16 Sep 2011 16:14:31 -0400 Subject: [Biopython-dev] Biopython under PyPy In-Reply-To: References: Message-ID: On Fri, Sep 16, 2011 at 1:07 PM, Peter Cock wrote: > Hi all, > > I've been trying Biopython under PyPy 1.6, and the unit tests for > a lot of things work fine. In the short term I'm skipping all the C > extensions (not clear how easy they will be under PyPy): > > https://github.com/biopython/biopython/commit/2a26ceebed01508a69aefd6a3a6437245347a5a2 > > Neato! Here's the relevant bug in Redmine: https://redmine.open-bio.org/issues/3236 > PyPy ships with a minimal numpy implementation, but it seems > to be very minimal - e.g. there is no dot function. This is actually > a bit annoying as "import numpy" works but you don't get everything! > Anyway, there are some easy checks we can add to individual > unit tests to skip them under pypy. > Presumably this will get better in future releases of numpy, but yeah, it will be awkward to have to check that the numpy module not only exists, but is in fact the 'real' numpy. > > What is interesting is running the full test suite reports some > false positives (tests which when run on their own, or as part > of a smaller group pass), and the test suite itself never finishes: > error: Too many open files > > I'm not sure what this is from... I fixed an obvious handle leak: > > https://github.com/biopython/biopython/commit/f7ce81b3751745970c32cc813836507e93da3c30 > > I suspect the problem is some of the individual tests are > leaking handles - which we know already from warnings > under Python 3 etc. > Now that we've ditched Py2.4, we can start using context managers ('with') instead of explicit open/close. This should help ensure handles are closed when exceptions are raised. The other noteworthy bug the unit tests uncovered, for me, was in test_Restriction. It wasn't clear at all to me why this error is raised -- some subtle difference in magic-method access between implementations, maybe? -Eric From eric.talevich at gmail.com Fri Sep 16 20:33:19 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 16 Sep 2011 16:33:19 -0400 Subject: [Biopython-dev] SeqFeature start/end and making positions act like ints In-Reply-To: References: Message-ID: On Fri, Sep 16, 2011 at 12:31 PM, Peter Cock wrote: > Hi all, > > We've previously discussed adding start/end properties > to the SeqFeature returning integers - which would be > useful but inconsistent with the FeatureLocation which > returns Position objects: > > https://redmine.open-bio.org/issues/2818 > > After an interesting discussion with Leighton, I spent > the afternoon making (most of the) Position objects > subclass int - so that they can be used like integers > (with the fuzzy information retained but generally > ignored except for writing the features out again). > > This means we can have SeqFeature start/end > properties which like those of the FeatureLocation > return position objects - and they are actually easy > to use (except for some very extreme cases). > e.g. You can use them to slice a sequence. > > The code is on a branch here: > https://github.com/peterjc/biopython/tree/int_pos > > It is almost 100% backwards compatible. Some > of the arguments for creating a fuzzy position > (and their __repr__) have changed, and some > of their attributes, but we feel this is unlikely to > actually affect anyone. We rather suspect only > the SeqIO parsers actually create or use the > fuzzy objects in the first place! > > In terms of usability I think this is a worthwhile > improvement. The new class heirachy is a bit > more complex though - and I have not looked > at the performance implications at all. > > Would anyone like to review this please? > > Here's another way to do it, maybe -- modify Seq.Seq.__getitem__ to also check if it's been given a SeqFeature, and if so, handle the joins there. The handling of fuzziness could happen in here or use the new .start and .end properties. Outline: def __getitem__(self, index): """Returns a subsequence of single letter, use my_seq[index].""" if isinstance(index, int): #Return a single letter as a string return self._data[index] elif isinstance(index, SeqFeature): # NEW -- handle start/end/join voodoo safely # if there's a join, extract the subsequences and then concatenate them return the_result else: #Return the (sub)sequence as another Seq object return Seq(self._data[index], self.alphabet) Think that would work? -Eric From p.j.a.cock at googlemail.com Fri Sep 16 22:56:25 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 16 Sep 2011 23:56:25 +0100 Subject: [Biopython-dev] Biopython under PyPy In-Reply-To: References: