From krother at rubor.de Thu Jul 1 09:01:41 2010 From: krother at rubor.de (Kristian Rother) Date: Thu, 1 Jul 2010 15:01:41 +0200 Subject: [Biopython-dev] RNA Alphabet with modified nucleotides Message-ID: Hi, I've commited code + tests for representing RNA sequences with modified nucleotides to a branch on Github. See: http://github.com/krother/biopython/commits/rna_alphabet I'm done with my list of 'most wanted' features for this class, including suggestions from Peter. What could I do next to help integrating the new code with the rest of Biopython? Best Regards, Kristian From biopython at maubp.freeserve.co.uk Thu Jul 1 09:26:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 1 Jul 2010 14:26:57 +0100 Subject: [Biopython-dev] RNA Alphabet with modified nucleotides In-Reply-To: References: Message-ID: On Thu, Jul 1, 2010 at 2:01 PM, Kristian Rother wrote: > > Hi, > > I've commited code + tests for representing RNA sequences with modified > nucleotides to a branch on Github. See: > > http://github.com/krother/biopython/commits/rna_alphabet > > I'm done with my list of 'most wanted' features for this class, including > suggestions from Peter. > What could I do next to help integrating the new code with the rest of > Biopython? Hi Kristian, I haven't had a play with the code, just a very brief look at it. You'll need to add licence and copyright statements. A few embedded doctests in the docstrings would be very nice to help explain how the new classes are to be used. What happens if you add some of the new DNA seq objects to test_Seq_objs.py? Is it all fine? Are you planning to add a reverse complement method etc? Or does the current fall back on the Seq implementation work OK? Peter From biopython at maubp.freeserve.co.uk Fri Jul 2 09:42:13 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 2 Jul 2010 14:42:13 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) Message-ID: Hi all, BOSC is rapidly approaching, so I have been working on slides for the Biopython Project Update. One thing I would really like help with is listing current active projects, as I think the wiki is out of date here: http://biopython.org/wiki/Active_projects In addition to the GSoC work, my list currently has the following (in some cases just from looking at github - for example I don't recall Tamas posting on the mailing lists): Brad Chapman ? GFF parsing Andrea Pierleoni - UniProt XML parsing Kristian Rother ? Modified RNA sequences Chris Lasher, Kyle Ellrott, Tam?s Nepusz ? Gene Ontology Kyle Ellrott - HMMER parser Uri Laserson, Peter Cock - IMGT files (EMBL like) I know Michiel has mentioned some ideas for updating our BLAST parsers, and I have several smaller things on the side (e.g. an on disk index for Bio.SeqIO.index, enhancements to SeqFeature and FeatureLocation). What are we missing that should be there? Thanks, Peter From eric.talevich at gmail.com Fri Jul 2 10:47:49 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 2 Jul 2010 10:47:49 -0400 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: On Fri, Jul 2, 2010 at 9:42 AM, Peter wrote: > Hi all, > > BOSC is rapidly approaching, so I have been working on slides for the > Biopython Project Update. One thing I would really like help with is > listing > current active projects, as I think the wiki is out of date here: > http://biopython.org/wiki/Active_projects > [...] > What are we missing that should be there? > Biopython's network on GitHub is a good resource for tracking active projects: http://github.com/biopython/biopython/network Should we add a link to that in the preamble? Not every project has its own public branch, but for those that do, GitHub will always be up to date. -Eric From biopython at maubp.freeserve.co.uk Fri Jul 2 10:52:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 2 Jul 2010 15:52:31 +0100 Subject: [Biopython-dev] Switching to GitHub Organization Message-ID: Hi all, Following Chris' lead with BioPerl (see below), I've also switched Biopython's github account to an organization. There should be no differences for fetching code or committing for those of you with access. Peter ---------- Forwarded message ---------- From: Chris Fields Date: Fri, Jul 2, 2010 at 2:48 PM Subject: [Bioperl-l] BioPerl Switching to GitHub Organization To: BioPerl List GitHub (as expected) just released their setup for organizations, including open-source projects. ?The announcement is here: http://github.com/blog/674-introducing-organizations I have already moved bioperl over to an organization account and have added a few co-owners of the github repository. ?The move is transparent, no one should notice any difference in checking out code. ?I'm working on reassigning teams to projects at this time, so please post here if there are any problems. chris _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From kellrott at gmail.com Fri Jul 2 10:53:09 2010 From: kellrott at gmail.com (Kyle) Date: Fri, 2 Jul 2010 07:53:09 -0700 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: I also have a fork for adding zxjdbc (the Jython's java database system) support to BioSQL. And one for parsing MetaGeneAnnotator files ( http://metagene.cb.k.u-tokyo.ac.jp/ ) Kye On Fri, Jul 2, 2010 at 7:47 AM, Eric Talevich wrote: > On Fri, Jul 2, 2010 at 9:42 AM, Peter wrote: > >> Hi all, >> >> BOSC is rapidly approaching, so I have been working on slides for the >> Biopython Project Update. One thing I would really like help with is >> listing >> current active projects, as I think the wiki is out of date here: >> http://biopython.org/wiki/Active_projects >> [...] >> What are we missing that should be there? >> > > Biopython's network on GitHub is a good resource for tracking active > projects: > http://github.com/biopython/biopython/network > > Should we add a link to that in the preamble? Not every project has its own > public branch, but for those that do, GitHub will always be up to date. > > -Eric > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Fri Jul 2 11:09:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 2 Jul 2010 16:09:05 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: On Fri, Jul 2, 2010 at 3:53 PM, Kyle wrote: > I also have a fork for adding zxjdbc (the Jython's java database > system) support to BioSQL. And one for parsing MetaGeneAnnotator files > ( http://metagene.cb.k.u-tokyo.ac.jp/ ) Maybe this will be two slides then - or small font ;) > On Fri, Jul 2, 2010 at 7:47 AM, Eric Talevich wrote: >> >> Biopython's network on GitHub is a good resource for tracking active >> projects: >> http://github.com/biopython/biopython/network >> >> Should we add a link to that in the preamble? Not every project has its own >> public branch, but for those that do, GitHub will always be up to date. >> >> -Eric Good idea (I'd been trawling it to make the original list). Peter From andrea at biocomp.unibo.it Sat Jul 3 02:52:26 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Sat, 3 Jul 2010 08:52:26 +0200 (CEST) Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: > Hi all, > > BOSC is rapidly approaching, so I have been working on slides for the > Biopython Project Update. One thing I would really like help with is > listing > current active projects, as I think the wiki is out of date here: > http://biopython.org/wiki/Active_projects > > In addition to the GSoC work, my list currently has the following (in some > cases just from looking at github - for example I don't recall Tamas > posting > on the mailing lists): > > Brad Chapman ? GFF parsing > Andrea Pierleoni - UniProt XML parsing > Kristian Rother ? Modified RNA sequences > Chris Lasher, Kyle Ellrott, Tam?s Nepusz ? Gene Ontology > Kyle Ellrott - HMMER parser > Uri Laserson, Peter Cock - IMGT files (EMBL like) > > I know Michiel has mentioned some ideas for updating our BLAST parsers, > and I have several smaller things on the side (e.g. an on disk index for > Bio.SeqIO.index, enhancements to SeqFeature and FeatureLocation). > > What are we missing that should be there? > > Thanks, > > Peter > Dear Peter, I'm actually working on two more projects than the XML parsing, that could be useful in biopython. 1) together with Mauro Amico, we hare developing a graphical library very similar to the Bio::Graphics module pf BioPerl. The project is at good point, and will come with documentation and tutorial as a standalone package we call BioGraPy. I know that in biopython one can already use GenomeDiagram to draw, for example, seqrecord features, but this could extend biopython plotting capability significantly. You can use BioGraPy to plot a blast output (with its HTML map), to plot hydrophobicity plot along the sequence (read as per letter annotations), mRNA and CDS with their splicing sites, and so on... BioGrapy relies on matplotlib, so this will be an additional external dependence, but worthwhile in my opinion. 2) Since I'm working with the web2py web framework, and I work with biosql databases, I spent some time adapting the current BioSQL code to be used with the web2py DAL (Database Abstraction Layer). DAL is much more simpler (and sometimes faster) than SQLAlchemy, and its syntax and use are very similar to SQL queries, so it was very easy to adapt the current code to use the DAL. Main advantages of using the web2py DAL are that it can be used on almost any DB engine. listing from the web2py site: SQLite, PostgreSQL, MySQL, MSSQL, FireBird, Oracle, IBM DB2, Informix, Ingres, and Google App Engine. I've succesfully tested with both Postgres and SQLite, but should be tested for the other. Since the Web2py code is GPL2, I can incorporate the modules needed for DAL directly into Biopython, so there will be no external dependences. I know that Brad Chapman and some others were working on implementing BioSQL with SQLAlchemy, so let me know if this could be an addition to Biopython. Cheers, Andrea From tiagoantao at gmail.com Sat Jul 3 06:01:34 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sat, 3 Jul 2010 11:01:34 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: On Fri, Jul 2, 2010 at 2:42 PM, Peter wrote: > What are we missing that should be there? The Population genetics code is still alive, though I have to update the documentation a bit. I want to support the dfdist application soon. Most unexpectedly the fdist code is being used quite a bit (via an application), currently 33 citations on scholar. And people constantly ask me for dfdist support. A close second is support for large genepop files supporting thousands of markers. By the way, I suppose python 2 to 3 is the elephant in the room? I bet all of us have run 2to3 on biopython ;) ... The results are not that bad... From biopython at maubp.freeserve.co.uk Sat Jul 3 06:12:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 3 Jul 2010 11:12:54 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: On Sat, Jul 3, 2010 at 7:52 AM, Andrea Pierleoni wrote: > Dear Peter, > I'm actually working on two more projects than the XML parsing, that could > be useful in biopython. > > 1) together with Mauro Amico, we hare developing a graphical library very > similar to the Bio::Graphics module pf BioPerl. The project is at good > point, and will come with documentation and tutorial as a standalone > package we call BioGraPy. I know that in biopython one can already use > GenomeDiagram to draw, for example, seqrecord features, but this could > extend biopython plotting capability significantly. You can use BioGraPy > to plot a blast output (with its HTML map), to plot hydrophobicity plot > along the sequence (read as per letter annotations), mRNA and CDS with > their splicing sites, and so on... BioGrapy relies on matplotlib, so this > will be an additional external dependence, but worthwhile in my opinion. That does sound interesting. I'm not saying it couldn't be rolled into Biopython, but perhaps shipping it a separate package building on Biopython and matplotlib is a good plan. There are advantages either way. > 2) Since I'm working with the web2py web framework, and I work with biosql > databases, I spent some time adapting the current BioSQL code to be used > with the web2py DAL (Database Abstraction Layer). DAL is much more simpler > (and sometimes faster) than SQLAlchemy, and its syntax and use are very > similar to SQL queries, so it was very easy to adapt the current code to > use the DAL. Main advantages of using the web2py DAL are that it can be > used on almost any DB engine. listing from the web2py site: SQLite, > PostgreSQL, MySQL, MSSQL, FireBird, Oracle, IBM DB2, Informix, Ingres, and > Google App Engine. I've succesfully tested with both Postgres and SQLite, > but should be tested for the other. Since the Web2py code is GPL2, I can > incorporate the modules needed for DAL directly into Biopython, so there > will be no external dependences. I know that Brad Chapman and some others > were working on implementing BioSQL with SQLAlchemy, so let me know if > this could be an addition to Biopython. I'm not sure we can easily include GPL code in Biopython... it would complicate things. Kyle has also been working on using the JVM DB API for BioSQL under Jython - I'd rather we ended up with a runtime choice of drivers (database specific like mysqldb, and others like the abstractions SQLAlchemy or the web2py DAL) which would all be external to Biopython. Peter From biopython at maubp.freeserve.co.uk Sat Jul 3 06:14:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 3 Jul 2010 11:14:57 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: 2010/7/3 Tiago Ant?o : > > By the way, I suppose python 2 to 3 is the elephant in the room? I bet > all of us have run 2to3 on biopython ;) ... The results are not that > bad... Could you start a new thread with a summary of what 2to3 reports? I believe the latest NumPy in their repository builds fine on Python 3.2, so we can't use waiting for them as an excuse much longer ;) Peter From tiagoantao at gmail.com Sat Jul 3 06:25:45 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sat, 3 Jul 2010 11:25:45 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: On Sat, Jul 3, 2010 at 7:52 AM, Andrea Pierleoni wrote: > 1) together with Mauro Amico, we hare developing a graphical library very > similar to the Bio::Graphics module pf BioPerl. The project is at good > point, and will come with documentation and tutorial as a standalone > package we call BioGraPy. I know that in biopython one can already use > GenomeDiagram to draw, for example, seqrecord features, but this could > extend biopython plotting capability significantly. You can use BioGraPy > to plot a blast output (with its HTML map), to plot hydrophobicity plot > along the sequence (read as per letter annotations), mRNA and CDS with > their splicing sites, and so on... BioGrapy relies on matplotlib, so this > will be an additional external dependence, but worthwhile in my opinion. 2 comments: 1. Strong support for matplotlib dependence. As usual it is very easy to shield the code against forcing people to install matplotlib (this is not a C library type of dependency where things would be more serious). The dependency is only needed for people who want to use your code. So this is not a big problem. matplotlib is also very standard in scientific python, not a marginal application. Thumbs up, IMHO. matplotlib, numpy and scipy are no brainers in my opinion. 2. The bioperl name bio::graphics strikes me as not completely perfect. I say this because there is more to bioinformatics than sequence analysis. Whatever naming convention is assumed in biopython for any kind of graphics, there should be some care with this. ;) . That being said, I think it is great that charting support exists. My .02 ? (loosing value by the day), Tiago From tiagoantao at gmail.com Sat Jul 3 06:44:01 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sat, 3 Jul 2010 11:44:01 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: 2010/7/3 Peter : > Could you start a new thread with a summary of what 2to3 reports? > I believe the latest NumPy in their repository builds fine on Python > 3.2, so we can't use waiting for them as an excuse much longer ;) I will, let me just tidy up the output and put some stats to help people out. I will put this up on Monday, ahead of BOSC. Maybe it will end up being an interesting discussion topic in Boston ;) Tiago From bugzilla-daemon at portal.open-bio.org Sun Jul 4 13:41:05 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 4 Jul 2010 13:41:05 -0400 Subject: [Biopython-dev] [Bug 3105] New: Bio.Nexus useless line Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3105 Summary: Bio.Nexus useless line Product: Biopython Version: 1.54 Platform: All OS/Version: All Status: NEW Severity: minor Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: tiagoantao at gmail.com There is a line on Bio.Nexus that is wrong/useless: elif hasattr(file, "write"): This is checking if the built-in file class has an attribute called write (which it also has). This is the same as elif True: This is either useless or wrong. This becomes a hurdle for automated conversion to python 3 as there is no file class on python 3. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jul 4 14:09:38 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 4 Jul 2010 14:09:38 -0400 Subject: [Biopython-dev] [Bug 3105] Bio.Nexus useless line In-Reply-To: Message-ID: <201007041809.o64I9cNM016424@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3105 ------- Comment #1 from eric.talevich at gmail.com 2010-07-04 14:09 EST ------- I think it's a typo. The function write_nexus_data takes an argument "filename", and this code block is supposed to figure out whether that's an open file handle or a file name. So it should be: if hasattr(filename, 'write'): ... But we actually do it a different way now, checking for strings: if isinstance(filename, basestr): # open it else: # it's a handle -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jul 4 15:34:44 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 4 Jul 2010 15:34:44 -0400 Subject: [Biopython-dev] [Bug 3105] Bio.Nexus useless line In-Reply-To: Message-ID: <201007041934.o64JYiIv019836@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3105 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-04 15:34 EST ------- I think Eric is right, it could just be a typo. The Nexus API accepts either filenames or handles and so needs to check which it has. Given other bits of Biopython now do the same, we could perhaps have a single bit of shared code for this - or at least consistent coding style. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jul 4 16:22:28 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 4 Jul 2010 16:22:28 -0400 Subject: [Biopython-dev] [Bug 3105] Bio.Nexus useless line In-Reply-To: Message-ID: <201007042022.o64KMS5W021453@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3105 ------- Comment #3 from eric.talevich at gmail.com 2010-07-04 16:22 EST ------- (In reply to comment #2) > Given other > bits of Biopython now do the same, we could perhaps have a single bit > of shared code for this - or at least consistent coding style. > Here's a snippet I use for myself: import contextlib @contextlib.contextmanager def maybe_open(infile, mode='r'): """Take a file name or a handle, and return a handle. Simplifies creating functions that automagically accept either a file name or an already opened file handle. """ do_close = False if isinstance(infile, basestring): do_close = True handle = open(infile, mode) else: handle = infile yield handle if do_close: handle.close() Use like: >>> with maybe_open(filename_or_handle) as handle: ... For Py2.4 compliance, you can just drop the @contextlib.contextmanager decorator and leave the function as it is. Then this works: >>> for handle in maybe_open(fname): ... It's an iterator of one item, taking care of loose ends when it terminates. Neat, huh? I suspect that yielding from a try/finally block, which is forbidden in Py2.4, is related to the with statement under the hood in Py2.5+. Since maybe_open kind of needs that protection to work safely, I think the copy/paste approach is fine until we officially drop Py2.4 support. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Sun Jul 4 16:24:42 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sun, 4 Jul 2010 21:24:42 +0100 Subject: [Biopython-dev] 2to3 ramblings Message-ID: Hi, Here are my findings on the attempt of converting biopython to python 3. What I did: 1. Tried to convert Bio (not BioSQL) 2. No C code 3. No external apps No external apps just because I don't most of the around here. Things are going much faster than expected 52 out of 144 tests are failing. Less than 6 hour work tothis. With the exception of sff processing I chosed the most complicated that I've found (many of the existing failing tests are of the easy kind) Some general issues that I am finding that impact us: 1. import exception is no more 2. Many lists are now iterators (e.g. map results) 3. 2to3 of course is not complete. Also sometimes there are some small mistakes (things one would expect to convert that are not) 4. sgmlib is no more. 2 options: include it (from python 2.6, which I am doing) OR use htmllib. 5. slices [:], have to be ints (which is mildly problematic with the fact that division is now float). Thus myPos = x/2 x[myPos:] has to become myLen = int(x/2) 6. Doctests have to be converted (2to3 does it) 7. Default open is now non-binary, so open sometimes requires rb. file is no more 8. Many order functions do not accept None e.g max([None,1,2]) will fail 9. StringType, *Type are no more 10. sort has no cmp function anymore 11. urllib namespace refactored 12. unit tests really help! 13!!!: The biggest problem has been bytes versus strings and encodings. Most existing complex problems are about this Biggest issues have been with Nexus and, above all, Sff (mostly 13 above - encoding formats). With the exception of Sff, I think I could easily sort out everything myself. The big incognito seems to be the C code. But I will assume that conversion is easy for the rest of the discussion. I have also to test process code that executes external apps. >From my point of view the conversion is not the big issue. The big issue is the maintenance of a version that works on both 2 and 3 at the same time (we dont want to maintain 2 codebases, correct?). Somethings are easy, but some are unknowns. It is possible to make _some_ code (that currently works only on 2) work on both pythons with little effort. Other code (e.g. prints) can be automatically converted on build. But some issues are still unknown to me. What numpy does (at least partially) is, on build: if python 3 is detected then call 2to3 to convert a python2 codebase to python3. Seems to work quite well. My gut feeling is that code of the form if python.version==2: a_version else: b_version can be almost non-existent. But it is just a gut feeling. So I think the python codebase can be easily shared between python 2 and 3 with little ugliness. About the C codebase? I don' t have any idea for now. This is not as much work as it seems. I think it is possible to have almost everything working on python3 for BOSC (assuming the current pace). But again, the main issue is not the conversion but maintaining a single code base. In practice, I think the first step is to have a build system like numpy: which detects the python version and calls 2to3. A single code base that can be built and tested on both 2 and 3. Suggested readings http://coderazzi.net/tnotes/python/migrating2to3.html http://diveintopython3.org/porting-code-to-python-3-with-2to3.html http://dbaktiar.wordpress.com/2009/08/20/python-3-1-file-open-is-no-longer-binary-by-default/ Well, these are my 0.02?. I can work on putting a github version of this if you are interested... -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From bioinformed at gmail.com Sun Jul 4 16:50:53 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Sun, 4 Jul 2010 16:50:53 -0400 Subject: [Biopython-dev] 2to3 ramblings In-Reply-To: References: Message-ID: 2010/7/4 Tiago Ant?o > myPos = x/2 I strongly recommend: myPos = x//2 versus anything that ventures into float territory and then retreats back into integer-land. -Kevin From bugzilla-daemon at portal.open-bio.org Mon Jul 5 03:40:20 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 5 Jul 2010 03:40:20 -0400 Subject: [Biopython-dev] [Bug 3105] Bio.Nexus useless line In-Reply-To: Message-ID: <201007050740.o657eKfs025732@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3105 fkauff at biologie.uni-kl.de changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from fkauff at biologie.uni-kl.de 2010-07-05 03:40 EST ------- It's a typo (a rather old one). Type checking has been changed to isinstance. Frank (In reply to comment #1) > I think it's a typo. The function write_nexus_data takes an argument > "filename", and this code block is supposed to figure out whether that's an > open file handle or a file name. > > So it should be: > > if hasattr(filename, 'write'): ... > > But we actually do it a different way now, checking for strings: > > if isinstance(filename, basestr): # open it > else: # it's a handle > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From fkauff at biologie.uni-kl.de Mon Jul 5 03:46:59 2010 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Mon, 05 Jul 2010 09:46:59 +0200 Subject: [Biopython-dev] 2to3 ramblings In-Reply-To: References: Message-ID: <4C318DF3.1080706@biologie.uni-kl.de> Hi Tiago, On 07/04/2010 10:24 PM, Tiago Ant?o wrote: > Hi, > > Here are my findings on the attempt of converting biopython to python 3. > ... > > Biggest issues have been with Nexus and, above all, Sff (mostly 13 > above - encoding formats). > > I'd be happy to help with Nexus.py. You have some sort of list with the lines that failed? Frank > With the exception of Sff, I think I could easily sort out everything myself. > > The big incognito seems to be the C code. But I will assume that > conversion is easy for the rest of the discussion. I have also to test > process code that executes external apps. > > > > From my point of view the conversion is not the big issue. The big > issue is the maintenance of a version that works on both 2 and 3 at > the same time (we dont want to maintain 2 codebases, correct?). > Somethings are easy, but some are unknowns. It is possible to make > _some_ code (that currently works only on 2) work on both pythons with > little effort. Other code (e.g. prints) can be automatically converted > on build. But some issues are still unknown to me. > > What numpy does (at least partially) is, on build: if python 3 is > detected then call 2to3 to convert a python2 codebase to python3. > Seems to work quite well. My gut feeling is that code of the form > if python.version==2: > a_version > else: > b_version > can be almost non-existent. > But it is just a gut feeling. > > So I think the python codebase can be easily shared between python 2 > and 3 with little ugliness. About the C codebase? I don' t have any > idea for now. > > This is not as much work as it seems. I think it is possible to have > almost everything working on python3 for BOSC (assuming the current > pace). But again, the main issue is not the conversion but maintaining > a single code base. In practice, I think the first step is to have a > build system like numpy: which detects the python version and calls > 2to3. A single code base that can be built and tested on both 2 and 3. > > > Suggested readings > http://coderazzi.net/tnotes/python/migrating2to3.html > http://diveintopython3.org/porting-code-to-python-3-with-2to3.html > http://dbaktiar.wordpress.com/2009/08/20/python-3-1-file-open-is-no-longer-binary-by-default/ > > > Well, these are my 0.02?. I can work on putting a github version of > this if you are interested... > > From tiagoantao at gmail.com Mon Jul 5 05:11:22 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 10:11:22 +0100 Subject: [Biopython-dev] 2to3 ramblings In-Reply-To: <4C318DF3.1080706@biologie.uni-kl.de> References: <4C318DF3.1080706@biologie.uni-kl.de> Message-ID: On Mon, Jul 5, 2010 at 8:46 AM, Frank Kauff wrote: > I'd be happy to help with Nexus.py. You have some sort of list with the > lines that failed? Thanks for the help. I have restarted the whole process in order make things easier for everybody else. As soon as I get there (again) I will send you existing problems. I will start opening tickets with patches for several components and putting there solutions. I hope it will be clear to everybody the level of triviality of patches required. Nexus (and mainly SFF) were the issues I stumbled in. Lets see on the second run. Tiago From bugzilla-daemon at portal.open-bio.org Mon Jul 5 05:16:01 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 5 Jul 2010 05:16:01 -0400 Subject: [Biopython-dev] [Bug 3105] Bio.Nexus useless line In-Reply-To: Message-ID: <201007050916.o659G10F030478@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3105 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-05 05:16 EST ------- (In reply to comment #4) > It's a typo (a rather old one). > > Type checking has been changed to isinstance. > > Frank Thanks Frank - it turns out to be my old typo from 4 July 2008. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 5 05:44:17 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 5 Jul 2010 05:44:17 -0400 Subject: [Biopython-dev] [Bug 3106] New: Making Bio.Sequencing.Ace Python 3 compliant Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3106 Summary: Making Bio.Sequencing.Ace Python 3 compliant Product: Biopython Version: 1.54 Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: tiagoantao at gmail.com The patch attached serves to make Bio.Sequencing.Ace Python 3 compliant A few notes: 1. It is a patch to the test (replacing / with // as per Kevin suggestion). The core code needs no patch 2. It still requires running 2to3, but that is normal 3. Was tested on both 3.1.2 and 2.6.5 (ie, not on 2.5 and 2.4) This is a typical pattern: the change is trivial and has no impact on the 2 codebase -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 5 05:45:33 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 5 Jul 2010 05:45:33 -0400 Subject: [Biopython-dev] [Bug 3106] Making Bio.Sequencing.Ace Python 3 compliant In-Reply-To: Message-ID: <201007050945.o659jX00031812@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3106 ------- Comment #1 from tiagoantao at gmail.com 2010-07-05 05:45 EST ------- Created an attachment (id=1519) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1519&action=view) Patch to make test_Ace py3k compliant -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 5 05:48:48 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 5 Jul 2010 05:48:48 -0400 Subject: [Biopython-dev] [Bug 3106] Making Bio.Sequencing.Ace Python 3 compliant In-Reply-To: Message-ID: <201007050948.o659mm42031951@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3106 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-05 05:48 EST ------- Tiago - please go ahead and apply these and any further / to // changes to use explicit integer division required to help 2to3 (without bothering with more bug reports - a summary email to the dev list would be more than enough). Thanks! Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 5 05:54:21 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 5 Jul 2010 05:54:21 -0400 Subject: [Biopython-dev] [Bug 3106] Making Bio.Sequencing.Ace Python 3 compliant In-Reply-To: Message-ID: <201007050954.o659sLWt032216@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3106 tiagoantao at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from tiagoantao at gmail.com 2010-07-05 05:54 EST ------- (In reply to comment #2) > Tiago - please go ahead and apply these and any further / to // changes to use > explicit integer division required to help 2to3 (without bothering with more > bug > reports - a summary email to the dev list would be more than enough). Thanks! OK, I will just two notes: 1. Apologies in advance if I make a blunder with git, I am a bzr person and my git skills are limited 2. I will go to biopython-dev whenever something conceptually new arises that I think requires discussion before commit. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 5 06:04:18 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 5 Jul 2010 06:04:18 -0400 Subject: [Biopython-dev] [Bug 3106] Making Bio.Sequencing.Ace Python 3 compliant In-Reply-To: Message-ID: <201007051004.o65A4I7g032704@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3106 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-05 06:04 EST ------- (In reply to comment #3) > (In reply to comment #2) > > Tiago - please go ahead and apply these and any further / to // changes to > > use explicit integer division required to help 2to3 (without bothering with > > more bug reports - a summary email to the dev list would be more than > > enough). Thanks! > > OK, I will just two notes: > > 1. Apologies in advance if I make a blunder with git, I am a bzr person and > my git skills are limited Looks fine so far :) > 2. I will go to biopython-dev whenever something conceptually new arises > that I think requires discussion before commit. Great. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 5 06:31:39 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 5 Jul 2010 06:31:39 -0400 Subject: [Biopython-dev] [Bug 3105] Bio.Nexus useless line In-Reply-To: Message-ID: <201007051031.o65AVdUv001589@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3105 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-05 06:31 EST ------- (In reply to comment #4) > It's a typo (a rather old one). > > Type checking has been changed to isinstance. > > Frank I've just changed it back to method checking, i.e. as Eric suggested with my typo fixed: if hasattr(filename, 'write'): The trouble with isinstance(filename, file) is that it doesn't allow for file like objects - specifically a StringIO handle as used in the unit tests, meaning test_AlignIO.py was failing. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Mon Jul 5 06:34:17 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 11:34:17 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 Message-ID: Hi, test_AlignIO provides a far more interesting case (but not complicated, not at all). The issues are as follows: 1. list sorting Bio.Data.CodonTable has a: possible.sort(_sort) Py3 has no compare function (that _sort is a 5 line function defined just above). That can be "forced in", but there is normally a simpler dialect, with keywords. The line above becomes: if sys.version_info[0] == 3: possible.sort(key=lambda x:self.ambiguous_protein[x]) else: possible.sort(_sort) 2. Strings and bytes Bio.Seq requires if sys.version_info[0] == 3 : return str.maketrans(before, after) else: return string.maketrans(before, after) The way p3 handles strings and bytes are the biggest issue that I think we will face from a technical perspective. 3. The big one: No sgmllib in p3. The obvious solution is to include it (I suppose the licenses are compatible?). The alternative (using htmllib) might be more long-term, in my opinion This is all that is needed (plus 1 import sys line). I was inclined to commit 1 and 2. But 3 needs to be discussed... Tiago From p.j.a.cock at googlemail.com Mon Jul 5 07:01:42 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Jul 2010 12:01:42 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: 2010/7/5 Tiago Ant?o : > Hi, > > test_AlignIO provides a far more interesting case (but not > complicated, not at all). Or just test_seq.py or test_Seq_objs.py which are more low level ;) > The issues are as follows: > > 1. list sorting > Bio.Data.CodonTable has a: > possible.sort(_sort) > Py3 has no compare function (that _sort is a 5 line function defined > just above). That can be "forced in", but there is normally a simpler > dialect, with keywords. The line above becomes: > if sys.version_info[0] == 3: > ? ? ? ? ? ?possible.sort(key=lambda x:self.ambiguous_protein[x]) > else: > ? ? ? ? ? ?possible.sort(_sort) I think Python 2.4 added support for the key argument, so can we just unconditionally change it to: possible.sort(key=lambda x:self.ambiguous_protein[x]) However, that isn't doing quite the same thing. The old sort was by table length first to try and get the least ambiguous mapping or something like that... we probably need some more unit tests first. > 2. Strings and bytes > Bio.Seq requires > ? ?if sys.version_info[0] == 3 : > ? ? ? ?return str.maketrans(before, after) > ? ?else: > ? ? ? ?return string.maketrans(before, after) This is within our private _maketrans function only? That looks sensible but I wonder why 2to3 doesn't handle this on its own. Would moving the "import string" into the function help for clarity? def _maketrans(complement_mapping): """Makes a python string translation table (PRIVATE).""" before = ''.join(complement_mapping.keys()) after = ''.join(complement_mapping.values()) before = before + before.lower() after = after + after.lower() if sys.version_info[0] == 3 : return str.maketrans(before, after) else: import string return string.maketrans(before, after) > The way p3 handles strings and bytes are the biggest issue that I > think we will face from a technical perspective. I agree that strings vs bytes will be an issue for us (potentially from a memory point of view for Seq objects). > 3. The big one: No sgmllib in p3. > ? The obvious solution is to include it (I suppose the licenses are > compatible?). The alternative (using htmllib) might be more long-term, > in my opinion A lot of the things using sgmllib are already deprecated (e.g. Bio.NetCatch and Bio.Prosite). I think that leaves just Bio.UniGene and Bio.InterPro - which isn't such a big issue. Peter From fkauff at biologie.uni-kl.de Mon Jul 5 07:07:22 2010 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Mon, 05 Jul 2010 13:07:22 +0200 Subject: [Biopython-dev] [Bug 3105] Bio.Nexus useless line In-Reply-To: <201007051031.o65AVdUv001589@portal.open-bio.org> References: <201007051031.o65AVdUv001589@portal.open-bio.org> Message-ID: <4C31BCEA.1000007@biologie.uni-kl.de> On 07/05/2010 12:31 PM, bugzilla-daemon at portal.open-bio.org wrote: > http://bugzilla.open-bio.org/show_bug.cgi?id=3105 > > > > > > ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-05 06:31 EST ------- > (In reply to comment #4) > >> It's a typo (a rather old one). >> >> Type checking has been changed to isinstance. >> >> Frank >> > I've just changed it back to method checking, i.e. as Eric suggested with > my typo fixed: > > if hasattr(filename, 'write'): > > The trouble with isinstance(filename, file) is that it doesn't allow for file > like objects - specifically a StringIO handle as used in the unit tests, > meaning test_AlignIO.py was failing. > > Peter > > > Goot catch. I didn't remember that. Frank From tiagoantao at gmail.com Mon Jul 5 07:13:50 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 12:13:50 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: Hi, 2010/7/5 Peter Cock : > I think Python 2.4 added support for the key argument, so can we > just unconditionally change it to: > > possible.sort(key=lambda x:self.ambiguous_protein[x]) > > However, that isn't doing quite the same thing. The old sort was by > table length first to try and get the least ambiguous mapping or > something like that... we probably need some more unit tests first. erm, my mistake possible.sort(key=lambda x:len(self.ambiguous_protein[x])) I think this sorts this out? > This is within our private _maketrans function only? That looks sensible > but I wonder why 2to3 doesn't handle this on its own. Because (I think), there are now 2 possible alternatives (one byte-wise and one string-wise), so 2to3 does not know which to choose. > Would moving the "import string" into the function help for clarity? It it is only used there, maybe it makes sense. >> 3. The big one: No sgmllib in p3. >> ? The obvious solution is to include it (I suppose the licenses are >> compatible?). The alternative (using htmllib) might be more long-term, >> in my opinion > > A lot of the things using sgmllib are already deprecated (e.g. > Bio.NetCatch and Bio.Prosite). I think that leaves just Bio.UniGene > and Bio.InterPro - which isn't such a big issue. I know very little about those parts of the code, but there was an import required for sgmllib in test_AlignIO. Tiago -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From p.j.a.cock at googlemail.com Mon Jul 5 07:26:36 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Jul 2010 12:26:36 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: 2010/7/5 Tiago Ant?o : > Hi, > > 2010/7/5 Peter Cock : >> I think Python 2.4 added support for the key argument, so can we >> just unconditionally change it to: >> >> possible.sort(key=lambda x:self.ambiguous_protein[x]) >> >> However, that isn't doing quite the same thing. The old sort was by >> table length first to try and get the least ambiguous mapping or >> something like that... we probably need some more unit tests first. > > erm, my mistake > possible.sort(key=lambda x:len(self.ambiguous_protein[x])) > > I think this sorts this out? Probably. >> This is within our private _maketrans function only? That looks sensible >> but I wonder why 2to3 doesn't handle this on its own. > > Because (I think), there are now 2 possible alternatives (one > byte-wise and one string-wise), so 2to3 does not know which to choose. True. >> Would moving the "import string" into the function help for clarity? > > It it is only used there, maybe it makes sense. OK. >>> 3. The big one: No sgmllib in p3. >>> ? The obvious solution is to include it (I suppose the licenses are >>> compatible?). The alternative (using htmllib) might be more long-term, >>> in my opinion >> >> A lot of the things using sgmllib are already deprecated (e.g. >> Bio.NetCatch and Bio.Prosite). I think that leaves just Bio.UniGene >> and Bio.InterPro - which isn't such a big issue. > > I know very little about those parts of the code, but there was an > import required for sgmllib in test_AlignIO. This is due to Bio/File.py trying to import sgmllib, and Bio.File is used by several of the SeqIO/AlignIO parsers (e.g. Bio.GenBank). That code needing sgmllib was deprecated in Biopython 1.52 (Sept 2009), and so we should be keeping it until Sept 2010... I think making it a lazy import will do the trick. Peter P.S. I've just committed this, so do a pull before more changes: http://github.com/biopython/biopython/commit/4f2650c309224e74bd18758b4ee2be24879c15dd From p.j.a.cock at googlemail.com Mon Jul 5 07:36:25 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Jul 2010 12:36:25 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: >>>> 3. The big one: No sgmllib in p3. >>>> ? The obvious solution is to include it (I suppose the licenses are >>>> compatible?). The alternative (using htmllib) might be more long-term, >>>> in my opinion >>> >>> A lot of the things using sgmllib are already deprecated (e.g. >>> Bio.NetCatch and Bio.Prosite). I think that leaves just Bio.UniGene >>> and Bio.InterPro - which isn't such a big issue. >> >> I know very little about those parts of the code, but there was an >> import required for sgmllib in test_AlignIO. > > This is due to Bio/File.py trying to import sgmllib, and Bio.File is used > by several of the SeqIO/AlignIO parsers (e.g. Bio.GenBank). That > code needing sgmllib was deprecated in Biopython 1.52 (Sept 2009), > and so we should be keeping it until Sept 2010... I think making it a > lazy import will do the trick. How's this? http://github.com/biopython/biopython/commit/e9ab0b353ae4a914db20a53f2377a34bc56c30a6 Peter From tiagoantao at gmail.com Mon Jul 5 07:38:20 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 12:38:20 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: 2010/7/5 Peter Cock : > Or just test_seq.py or test_Seq_objs.py which are more low level ;) Glad you raise these 2, as I want to discuss them. 2 changes: 1. to test_Seq_objs.py add import sys if sys.version_info[0] == 3: maketrans = str.maketrans else: from string import maketrans 2. (more serious) array.array("c", ...) is no more (the c). Maybe self.data = array.array("u", data) ? With ifs per version. This affects test_seq.py and Seq.py Regarding commits (e.g. the sort case). I can commit general corrections, e.g. a single sort with a key for all versions or put some ifs (use the old code for 2.x and new code for 3). The first option is cleaner, the second safer. I warm up to the cleaner version: the changes are trivial (and trivial to roll back, should the need arise). From tiagoantao at gmail.com Mon Jul 5 07:38:50 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 12:38:50 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: > How's this? > > http://github.com/biopython/biopython/commit/e9ab0b353ae4a914db20a53f2377a34bc56c30a6 Makes things much cleaner and easier... -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From p.j.a.cock at googlemail.com Mon Jul 5 07:44:54 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Jul 2010 12:44:54 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: 2010/7/5 Tiago Ant?o : > 2010/7/5 Peter Cock : >> Or just test_seq.py or test_Seq_objs.py which are more low level ;) > > Glad you raise these 2, as I want to discuss them. > 2 changes: > 1. to test_Seq_objs.py add > ?import sys > if sys.version_info[0] == 3: > ? ?maketrans = str.maketrans > else: > ? ?from string import maketrans OK, do a "git pull origin master" and then make that change (and move the import string into the function too). It seems to be a fairly simple way to cope with Python 2.x and Python 3.x. > 2. (more serious) array.array("c", ...) is no more (the c). > Maybe self.data = array.array("u", data) ? With ifs per version. This > affects test_seq.py and Seq.py This is the MutableSeq object, right? Try some local changes and see, but I fear we may have to redo the internals of that more substantially. > Regarding commits (e.g. the sort case). > I can commit general corrections, e.g. a single sort with a key for > all versions or put some ifs (use the old code for 2.x and new code > for 3). The first option is cleaner, the second safer. I warm up to > the cleaner version: the changes are trivial (and trivial to roll > back, should the need arise). For the codon sort case, the old code effectively did two sorts (one by length with a tie breaker). If you can write some unit tests to check we don't alter the behaviour, the clean fix is nicer. Also 2to3 is suggesting we use for loops in Bio/Sequencing/Ace.py in place of side effects with a map call (lines 474, 480, 484). That does seem like good advice. Peter From mjldehoon at yahoo.com Mon Jul 5 07:47:21 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 5 Jul 2010 04:47:21 -0700 (PDT) Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: Message-ID: <298020.45436.qm@web62407.mail.re1.yahoo.com> --- On Mon, 7/5/10, Tiago Ant?o wrote: > >> 3. The big one: No sgmllib in p3. > > A lot of the things using sgmllib are already > deprecated (e.g. > > Bio.NetCatch and Bio.Prosite). I think that leaves > > just Bio.UniGene and Bio.InterPro - which isn't such > a big issue. > I know very little about those parts of the code, but there > was an import required for sgmllib in test_AlignIO. In Bio.UniGene and Bio.InterPro, sgmllib is used for parsing HTML pages, which tends to break easily anyway because the HTML format keeps changing. As a case in point, the parser in Bio.InterPro doesn't seem to work with current HTML pages from InterPro. I haven't tried Bio.UniGene, but Bio.UniGene can also parse UniGene flat files so I doubt that there is a real need to parse UniGene html files. In test_AlignIO, the import for sgmllib is coming from the SGMLStripper class in Bio.File, imported from Bio.ParserSupport, imported from Bio.GenBank, imported from Bio.SeqIO. But Bio.SeqIO doesn't actually use SGMLStripper, which has been deprecated. So I suggest that instead of fixing the modules that depend on sgmllib, we replace the relevant pieces of code by a NotImplementedError, and see if anybody complains. For the longer term, it would be nice if the code in Bio.GenBank could be moved to Bio.SeqIO, and made independent of Bio.ParserSupport. --Michiel. From tiagoantao at gmail.com Mon Jul 5 08:05:26 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 13:05:26 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: 2010/7/5 Peter Cock : > For the codon sort case, the old code effectively did two sorts > (one by length with a tie breaker). If you can write some unit tests > to check we don't alter the behaviour, the clean fix is nicer. >>> a=[(1,2),(1,1),(2,1),(1,0)] >>> a.sort(key=lambda x:(x[0],x[1])) >>> a [(1, 0), (1, 1), (1, 2), (2, 1)] Multi-level sorting is possible ;) thus possible.sort(key=lambda x:(len(self.ambiguous_protein[x]), x)) From p.j.a.cock at googlemail.com Mon Jul 5 09:18:12 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Jul 2010 14:18:12 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: <298020.45436.qm@web62407.mail.re1.yahoo.com> References: <298020.45436.qm@web62407.mail.re1.yahoo.com> Message-ID: Tiago wrote: >>> 3. The big one: No sgmllib in p3. Peter wrote: >> A lot of the things using sgmllib are already deprecated (e.g. >> Bio.NetCatch and Bio.Prosite). I think that leaves just Bio.UniGene >> and Bio.InterPro - which isn't such a big issue. Michiel wrote: > In Bio.UniGene and Bio.InterPro, sgmllib is used for parsing HTML pages, > which tends to break easily anyway because the HTML format keeps > changing. As a case in point, the parser in Bio.InterPro doesn't seem to > work with current HTML pages from InterPro. So that one is ready for deprecation (assuming no one steps forward to update it). > I haven't tried Bio.UniGene, but Bio.UniGene can also parse UniGene > flat files so I doubt that there is a real need to parse UniGene html files. Again, perhaps this HTML parser can be deprecated. > In test_AlignIO, the import for sgmllib is coming from the SGMLStripper > class in Bio.File, imported from Bio.ParserSupport, imported from > Bio.GenBank, imported from Bio.SeqIO. But Bio.SeqIO doesn't > actually use SGMLStripper, which has been deprecated. That's been fixed by making Bio.File ignore the deprecated SGML stuff if sgmllib isn't available. > So I suggest that instead of fixing the modules that depend on sgmllib, > we replace the relevant pieces of code by a NotImplementedError, and > see if anybody complains. How about just deprecation instead? > For the longer term, it would be nice if the code in Bio.GenBank > could be moved to Bio.SeqIO, and made independent of > Bio.ParserSupport. That makes sense except for the fact that Bio.GenBank is still useful for "low level" work (not using a SeqRecord), for example WGS files. Certainly long term I think we could drop Bio.GenBank and have a simplified SeqRecord only parser in Bio.SeqIO. My recent location parsing work is a step in that direction. Peter From p.j.a.cock at googlemail.com Mon Jul 5 09:18:53 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Jul 2010 14:18:53 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: 2010/7/5 Tiago Ant?o : > 2010/7/5 Peter Cock : >> For the codon sort case, the old code effectively did two sorts >> (one by length with a tie breaker). If you can write some unit tests >> to check we don't alter the behaviour, the clean fix is nicer. > > >>>> a=[(1,2),(1,1),(2,1),(1,0)] >>>> a.sort(key=lambda x:(x[0],x[1])) >>>> a > [(1, 0), (1, 1), (1, 2), (2, 1)] > > > Multi-level sorting is possible ;) > thus > possible.sort(key=lambda x:(len(self.ambiguous_protein[x]), x)) > Neat - with a sensible comment to explain why, that looks good. Peter From tiagoantao at gmail.com Mon Jul 5 10:28:33 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 15:28:33 +0100 Subject: [Biopython-dev] test_Entrez 3.x Message-ID: Hi, A pre-read, http://dbaktiar.wordpress.com/2009/08/20/python-3-1-file-open-is-no-longer-binary-by-default/ I am not completely sure that the text above is totally correct, but it does introduce the problem quite well. expat seems to want a byte stream. In the core code this is minor, Expat.Parser gets one open(,"rb") on externalEntityRefHandler and it is ready to roll (at least passes the test_Entrez test). But test_Entrez does need quite a few files open as rb. I do not know if I like this idea of opening a text file as binary. But at least the core code is barely touched. It is more an issue with the test. -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From biopython at maubp.freeserve.co.uk Mon Jul 5 10:38:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Jul 2010 15:38:05 +0100 Subject: [Biopython-dev] test_Entrez 3.x In-Reply-To: References: Message-ID: 2010/7/5 Tiago Ant?o : > Hi, > > A pre-read, > http://dbaktiar.wordpress.com/2009/08/20/python-3-1-file-open-is-no-longer-binary-by-default/ > I am not completely sure that the text above is totally correct, but > it does introduce the problem quite well. > > expat seems to want a byte stream. > In the core code this is minor, Expat.Parser gets one open(,"rb") on > externalEntityRefHandler and it is ready to roll (at least passes the > test_Entrez test). > But test_Entrez does need quite a few files ?open as rb. > > I do not know if I like this idea of opening a text file as binary. > But at least the core code is barely touched. It is more an issue with > the test. If Expat wants bytes, then on Python 3 we need to open the file in binary mode. This should be harmless on Python 2, although we should confirm this by running the unit tests on Windows - the only difference I would expect this will disable the magic new line conversion. Peter From tiagoantao at gmail.com Mon Jul 5 11:52:40 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 16:52:40 +0100 Subject: [Biopython-dev] NCBIXML Message-ID: Hi, Finally something less trivial. NCBIXML test has different results when running (with py3) using python3 run_tests.py test_NCBIXML.py or just python3 test_NCBIXML.py First fails. Second works. I' ve discovered that expat parsing is assuming that the encoding is ascii and sends an error (no encoding is specified in the file), whereas with utf-8 all is fine. Passing an encoding to ParserCreate gives no joy. Maybe somebody has had experiences with test having different outcomes depending on how they are invoked? Regards, Tiago -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From biopython at maubp.freeserve.co.uk Mon Jul 5 12:04:50 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Jul 2010 17:04:50 +0100 Subject: [Biopython-dev] NCBIXML In-Reply-To: References: Message-ID: 2010/7/5 Tiago Ant?o : > Hi, > > Finally something less trivial. > NCBIXML test has different results when running (with py3) using > python3 run_tests.py test_NCBIXML.py > or just > python3 test_NCBIXML.py > > First fails. Second works. > I' ve discovered that expat parsing is assuming that the encoding is > ascii and sends an error (no encoding is specified in the file), > whereas with utf-8 all is fine. > Passing an encoding to ParserCreate gives no joy. > > Maybe somebody has had experiences with test having different outcomes > depending on how they are invoked? I suspect this will be down to the run_test.py magic which attempts to run the test using the compiled files in the build directory. Have you run "python3 setup.py install" or not? If the build directory and the installed Biopython are the same this problem may go away... Peter From eric.talevich at gmail.com Mon Jul 5 12:07:57 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 5 Jul 2010 12:07:57 -0400 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: 2010/7/5 Tiago Ant?o > > 1. to test_Seq_objs.py add > import sys > if sys.version_info[0] == 3: > maketrans = str.maketrans > else: > from string import maketrans > You could skip importing sys by checking if the attribute is there on str: if hasattr(str, 'maketrans'): maketrans = str.maketrans else: from string import maketrans -E From biopython at maubp.freeserve.co.uk Mon Jul 5 14:16:18 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Jul 2010 19:16:18 +0100 Subject: [Biopython-dev] Python 3 porting Message-ID: Hi all, While Tiago and I have sorted out some of the easy stuff, there is still plenty to do make Biopython via 2to3 work nicely on Python 3 (and that's just looking at the Python code, ignoring our C code, and also ignoring things using NumPy). We still have quite a few warnings using the -3 switch on Python 2.6 or 2.7 which we should probably concentrate on first. Note that deprecation warnings in Python 2.7 are silent by default (so as not to bother end users, which makes sense as this is the last Python 2.x series). Peter From tiagoantao at gmail.com Mon Jul 5 16:13:00 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 21:13:00 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: Hi, On Mon, Jul 5, 2010 at 7:16 PM, Peter wrote: > While Tiago and I have sorted out some of the easy stuff, there is still plenty > to do make Biopython via 2to3 work nicely on Python 3 (and that's just looking > at the Python code, ignoring our C code, and also ignoring things using NumPy). PopGen is now in. The interesting thing is that it has Bio.Application examples and they presented no problem at all. Nexus is also in. I also converted test_lowess (a VERY SIMPLE numpy example). Something seems to have broken one of the seqio tests as it blocks the test system (on py3) PhyloXML I am really stuck and NCBXML seems to have a problem only inside run_tests. Tomorrow I will have a look at PDB, KEGG, Emboss and clustalw. From chapmanb at 50mail.com Mon Jul 5 21:30:50 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 5 Jul 2010 21:30:50 -0400 Subject: [Biopython-dev] Slides for Biopython talk at BOSC 2010 Message-ID: <20100706013050.GE1664@kunkel> Hey all; I've got the honor of presenting Biopython at BOSC 2010, and have put the slides up here: http://www.slideshare.net/chapmanb/biopython-at-bosc-2010 The talk tries to place the lessons I've learned from the Biopython community this year within the broader framework of open source work. It's been great to see the community grow so much, and so please pay special attention to slide 6; did I miss your name? I suck: e-mail me so I can correct that. Happy to get any other thoughts or comments and looking forward to seeing folks in person who are coming to BOSC. If you will be in Boston on Thursday evening, think about stopping by my place for BBQ and beers: http://www.open-bio.org/wiki/Codefest_2010#BBQ Drop me an e-mail for my number and better directions. Thanks, Brad From biopython at maubp.freeserve.co.uk Tue Jul 6 06:03:29 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 11:03:29 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: On Mon, Jul 5, 2010 at 7:16 PM, Peter wrote: > Hi all, > > While Tiago and I have sorted out some of the easy stuff, there is still plenty > to do make Biopython via 2to3 work nicely on Python 3 (and that's just looking > at the Python code, ignoring our C code, and also ignoring things using NumPy). I can get SFF output working - first by using the new io.BytesIO module (in Python 2.6+ as well) in place of StringIO for testing writing binary files (i.e. SFF output). This also requires a sprinkling of decode/encode calls in SffIO.py when reading/ writing the binary file - I'm not sure how to get that automatically with 2to3. Note that the SeqRecord's format method can currently return a single read in SFF file as a binary string. This won't be so sensible on Python 3 where a byte string makes more sense than unicode, so I think we should deprecate supporting binary files (i.e. SFF) in the SeqRecord's format method. > We still have quite a few warnings using the -3 switch on Python 2.6 or 2.7 > which we should probably concentrate on first. A lot of these are with changes to object comparison (the __cmp__ method is no more), which will need a little extra care, and the related issue of using cmp in list sorting (again, not supported anymore). Peter From biopython at maubp.freeserve.co.uk Tue Jul 6 06:36:39 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 11:36:39 +0100 Subject: [Biopython-dev] Deprecating Bio.Crystal in next release? Message-ID: Hi all, Given recent discussion (and the lack of interest on the dev list on previous occasions), is there any objection to deprecating Bio.Crystal in the next release of Biopython? http://lists.open-bio.org/pipermail/biopython/2010-July/006633.html http://lists.open-bio.org/pipermail/biopython-dev/2008-October/004405.html http://lists.open-bio.org/pipermail/biopython-dev/2007-July/002901.html Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Jul 6 09:40:51 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 14:40:51 +0100 Subject: [Biopython-dev] Deprecating Bio.InterPro Message-ID: Hi all, Another old module which hasn't been updated for some time is Bio.InterPro, a parser for the HTML (webpages) at the EBI, e.g. http://www.ebi.ac.uk/interpro/IEntry?ac=IPR001064 The parser doesn't work with the current website, and also uses a Python library called sgmllib which was deprecated as of Python 2.6. Website parsers are in general a bad idea because the tend to need a lot of work to keep up to date. Perhaps in this case there are suitable plain text files on the FTP site which might be used? Unless anyone has a good reason not to, we are going to deprecate the Bio.IntrerPro module in the next release of Biopython. Peter From biopython at maubp.freeserve.co.uk Tue Jul 6 10:03:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 15:03:05 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/6 Tiago Ant?o : > > On Tue, Jul 6, 2010 at 11:03 AM, Peter wrote: >> This also requires a sprinkling of decode/encode calls in SffIO.py when reading/ >> writing the binary file - I'm not sure how to get that automatically with 2to3. > > Would it be problematic if the 2.x code had that? 2.6 at least > supports decode/encode. Of course I do not know the implications in > code that is highly string intensive like SeqIO stuff... but it other > places (test cases, very simple) it seems to work OK. Python 2.4+ strings and unicode objects do support encode and decode, but we don't want to be converting from strings to unicode on Python 2.x - I want everything to stay as plain strings. Adding explicit decode calls would have side effects on Python 2.x (things becoming unicode), but would be needed for SFF parsing on Python 3. I could add explicit encode calls which would help SFF output under Python 3.x. This shouldn't change the functionality on Python 2.x, but I am a little concerned about it having a negative impact on the speed, but I have not measured this. We may need some big if statements... Peter From tiagoantao at gmail.com Tue Jul 6 09:43:29 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 6 Jul 2010 14:43:29 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: On Tue, Jul 6, 2010 at 11:03 AM, Peter wrote: > This also requires a sprinkling of decode/encode calls in SffIO.py when reading/ > writing the binary file - I'm not sure how to get that automatically with 2to3. Would it be problematic if the 2.x code had that? 2.6 at least supports decode/encode. Of course I do not know the implications in code that is highly string intensive like SeqIO stuff... but it other places (test cases, very simple) it seems to work OK. From tiagoantao at gmail.com Tue Jul 6 10:05:33 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 6 Jul 2010 15:05:33 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/6 Peter : > Python 2.4+ strings and unicode objects do support encode and decode, > but we don't want to be converting from strings to unicode on Python 2.x - > I want everything to stay as plain strings. Adding explicit decode calls > would have side effects on Python 2.x (things becoming unicode), but > would be needed for SFF parsing on Python 3. > Argh... I will have to correct some code I submitted (with decode). I am testing on 2.6.5. I will start testing on 2.4, it is safer From biopython at maubp.freeserve.co.uk Tue Jul 6 11:03:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 16:03:32 +0100 Subject: [Biopython-dev] Extending test_PDB.py coverage? Message-ID: Hi all, I've been running the unit tests with the Python 3 warnings enabled (this needs either Python 2.6 or Python 2.7), e.g. python2.6 -3 run_tests.py python2.6 -3 run_tests.py test_PDB.py python2.6 -3 test_PDB.py There is a harmless glitch with test_1_warning in this mode (because it isn't expecting all the extra warnings). I was getting some DeprecationWarning messages about using "k in d" rather than d.has_key(k), which I fixed: http://github.com/biopython/biopython/commit/9b508b6a6391ac9d379a74cbb3cca1127e3c7aba Looking at the Bio/PDB/*.py files there are still quite a few more examples of has_key being used - but these are not being picked up by the unit tests: AbstractPropertyMap.py: def has_key(self, id): AbstractPropertyMap.py: >>> if map.has_key((chain_id, res_id)): AbstractPropertyMap.py: return self.property_dict.has_key(translated_id) DSSP.py: print d.has_key(('A', 1)) Entity.py: return self.child_dict.has_key(id) Entity.py: return self.child_dict.has_key(id) FragmentMapper.py: def has_key(self, res): FragmentMapper.py: return self.fd.has_key(res) FragmentMapper.py: if fm.has_key(r): MMCIFParser.py: if mmcif_dict.has_key("_atom_site.auth_seq_id"): NACCESS.py: if naccess_dict.has_key((chain_id, res_id)): NACCESS.py: if self.naccess_atom_dict.has_key(full_id): Residue.py: if _atom_name_dict.has_key(name1): Residue.py: if _atom_name_dict.has_key(name2): Selection.py: if not d.has_key(i): While we could just fix the has_key usage, this would be a good point to first extend the unit coverage - just in case we break something. Some of these like DSSP and NACCESS are wrappers for command line tools, so new files test_PDB_DSSP.py and test_PDB_NACCESS.py would be sensible which can check for and run the tool if installed. Others like the Residue, Entity and Selection modules should be more straight forward to add directly to test_PDB.py itself. Are there any volunteers? Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Jul 6 11:20:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 16:20:38 +0100 Subject: [Biopython-dev] Deprecating Bio.Index? Message-ID: Hello all, Is anyone using the Bio.Index module in Biopython in their own code? This supported file indexing and was used in other parts of Biopython which have all now been deprecated (e.g. Bio.SwissProt.SProt and Bio.Prosite) or removed. The more recent Bio.SeqIO module provides a general approach to indexing sequence files. Would it inconvenience anyone if Bio.Index was deprecated in the next release (triggering warnings when imported, but still functional), and then removed later on? Thanks, Peter From eric.talevich at gmail.com Wed Jul 7 13:47:47 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 7 Jul 2010 13:47:47 -0400 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/5 Tiago Ant?o > Hi, > > On Mon, Jul 5, 2010 at 7:16 PM, Peter > wrote: > > While Tiago and I have sorted out some of the easy stuff, there is still > plenty > > to do make Biopython via 2to3 work nicely on Python 3 (and that's just > looking > > at the Python code, ignoring our C code, and also ignoring things using > NumPy). > > PhyloXML I am really stuck and NCBXML seems to have a problem only > inside run_tests. > > Hello, I ran "python -3 test_PhyloXML.py" and found one warning specific to PhyloXML, about comparing unequal types in BaseTree.py. I have a fix for this, shall I push it to GitHub? Was there anything else in Bio/Phylo/ that was causing problems? I'm just running the unit tests with the -3 flag, and didn't find any other issues that way. Thanks, Eric From bugzilla-daemon at portal.open-bio.org Wed Jul 7 21:45:33 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 7 Jul 2010 21:45:33 -0400 Subject: [Biopython-dev] [Bug 3109] New: Record class in Bio.SCOP.Cla has hierarchy member as list instead of dictionary Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3109 Summary: Record class in Bio.SCOP.Cla has hierarchy member as list instead of dictionary Product: Biopython Version: 1.54b Platform: PC URL: http://github.com/jfinkels/biopython/commit/6d2257dd0c46 abdf1ecd14b8bc660e32a205630a OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: jeffrey.finkelstein at gmail.com The Record class in the Bio.SCOP.Cla module has the hierarchy member as a list of key-value 2-tuples, but it should be a dictionary of key-value pairs. The SCOP Classification file format, http://scop.mrc-lmb.cam.ac.uk/scop/release-notes.html#scop-parseable-files , states that the order of the hierarchy key-value pairs in each record is unordered. This also allows easier access to the key-value pairs in a way that corresponds with the semantics of the file format specification. I have provided a fix at my own GitHub fork of Biopython. http://github.com/jfinkels/biopython/commit/6d2257dd0c46abdf1ecd14b8bc660e32a205630a In fixing this bug and the associated unit tests, I also changed the Record.__str__ method to output a string WITHOUT a trailing newline (which matches Python convention anyway). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 7 21:47:53 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 7 Jul 2010 21:47:53 -0400 Subject: [Biopython-dev] [Bug 3109] Record class in Bio.SCOP.Cla has hierarchy member as list instead of dictionary In-Reply-To: Message-ID: <201007080147.o681lrsM008729@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3109 ------- Comment #1 from jeffrey.finkelstein at gmail.com 2010-07-07 21:47 EST ------- Created an attachment (id=1522) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1522&action=view) Patch for bug 3109. This can also be found on my fork of Biopython at GitHub: http://github.com/jfinkels/biopython/commit/6d2257dd0c46abdf1ecd14b8bc660e32a205630a -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jul 8 03:34:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 Jul 2010 08:34:10 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/7 Eric Talevich : > 2010/7/5 Tiago Ant?o > >> ?PhyloXML I am really stuck and NCBXML seems to have a problem only >> inside run_tests. > > Hello, > > I ran "python -3 test_PhyloXML.py" and found one warning specific to > PhyloXML, about comparing unequal types in BaseTree.py. I have a fix for > this, shall I push it to GitHub? > > Was there anything else in Bio/Phylo/ that was causing problems? I'm just > running the unit tests with the -3 flag, and didn't find any other issues > that way. Running the test in Python 2.6 or 2,7 with -3 will spot a number of issues, and if we can fix them we should. Assuming your comparison fix is simple please go ahead and commit it. This will not spot everything (e.g. unicode and string problems). Actually running 2to3 and then trying the tests on Python 3 will spot more or different problems (such as unicode/bytes problems). I think this is where Tiago was having trouble with phyloXML. Note that the 2to3 script will be slightly different depending which copy you are using (i.e. which version of Python it came with). Peter From biopython at maubp.freeserve.co.uk Thu Jul 8 08:24:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 Jul 2010 13:24:28 +0100 Subject: [Biopython-dev] Equality in Bio.Restriction.RestrictionType Message-ID: Hi Fr?d?ric et al, One of the things in Python 3 is that overriding equality (done with __eq__ only since __cmp__ has gone) requires you also override __hash__. One remaining example of this which triggers a deprecation warning within our test suite when running with the -3 switch in in Bio.Restriction. I therefore had a look at how __eq__ and __ne__ are defined in the RestrictionType class - and strangely they do NOT seem to be inverses. def __eq__(cls, other): """RE == other -> bool True if RE and other are the same enzyme.""" return other is cls def __ne__(cls, other): """RE != other -> bool. isoschizomer strict, same recognition site, same restriction -> False all the other-> True""" if not isinstance(other, RestrictionType): return True elif cls.charac == other.charac: return False else: return True Fr?d?ric - could you clarify the intent here? Thanks, Peter From tiagoantao at gmail.com Thu Jul 8 16:55:05 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 8 Jul 2010 21:55:05 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: > Actually running 2to3 and then trying the tests on Python 3 will spot more > or different problems (such as unicode/bytes problems). I think this is where > Tiago was having trouble with phyloXML. I suppose (correct me if I am wrong), that the main objective of the exercise is to make all the tests pass with Python 3 (while maintaining Python 2 compatibility). The second objective would be to find potential points of error that can be introduced by the changes and create even more tests on those points. The third would be to not let performance (speed/memory) degrade (String processing being the big issue here). -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From eric.talevich at gmail.com Thu Jul 8 18:19:07 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 8 Jul 2010 18:19:07 -0400 Subject: [Biopython-dev] Documentation Message-ID: Hello everyone, I recently read this interesting article by one of the Django developers: http://jacobian.org/writing/great-documentation/what-to-write/ The post describes three kinds of documentation a software project should have: 1. A tutorial giving an overview of the project's major areas -- not covering every feature, but giving the user a good enough understanding of the whole project. The Biopython Tutorial and Cookbook already covers this very well. If anything, we may have put more detailed information than necessary into the Tutorial. The length may also be a bit overwhelming for newcomers. 2. Topic guides for each of the project's components. As I understand it, the wiki should fill this role. We could manage this (and #1, simultaneously) by converting some less-essential portions of the Tutorial to wiki pages. 3. A detailed reference for the complete API. The article specifically states that docstring converters like epydoc are insufficient, and may give developers a false sense of having taken care of this part of the documentation. The Python project uses Sphinx now, as do quite a few other projects. It uses ReStructuredText as the markup syntax, and can (1) pull in docstrings automatically, and (2) run doctest on code samples. I think this would work nicely for Biopython. http://sphinx.pocoo.org/ http://docutils.sourceforge.net/ This would add a developer dependency on Docutils, a very healthy project, and of course Sphinx. Epydoc can also accept ReStructuredText as the markup syntax in docstrings, in place of epytext, if docutils is available. So, if we were to go that route, the upgrade path would look like: 1. Add docutils as a dependency for building the API docs, in addition to epydoc. 2. Convert the docstrings that use epytext to use ReStructuredText instead. (grep will help, and the changes are pretty robotic.) 3. When all docstrings are rst-compatible (plain text is OK), try running Sphinx with a stub page that just pulls in all the docstrings under Bio. (Or something like that.) Does it work? 4. If it works, figure out how to put the Sphinx-generated docs on biopython.org so people can use them. 5. Now that we have a bunch of stub pages that pull in each module's docstrings, start adding value to those stubs by moving API-reference-style parts of the wiki and Tutorial into the sphinx stubs. 6. Semi-independently of this, try trimming the Tutorial a bit to make some nice wiki pages. Does this sound worthwhile? All the best, Eric From tiagoantao at gmail.com Fri Jul 9 08:19:41 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 9 Jul 2010 13:19:41 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/9 Peter : > The primary aim is to get the main Biopython functionality working on > Python 3 (with an eye on performance), while maintaining Python 2 > support. Getting the unit tests working is just a step towards this - and > the more test coverage we have the more useful this will be for us. > But that is probably what you meant? Actually, a bit more: I don't know how to deal with cases for which there are no unit tests (and 2to3 warnings). -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From biopython at maubp.freeserve.co.uk Fri Jul 9 09:25:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Jul 2010 14:25:40 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/9 Tiago Ant?o : > > 2010/7/9 Peter : >> The primary aim is to get the main Biopython functionality working on >> Python 3 (with an eye on performance), while maintaining Python 2 >> support. Getting the unit tests working is just a step towards this - and >> the more test coverage we have the more useful this will be for us. >> But that is probably what you meant? > > Actually, a bit more: I don't know how to deal with cases for which > there are no unit tests (and 2to3 warnings). The simple answer is we really need to write more unit tests ;) This will be tedious, but useful for improving the robustness of Biopython on Python 2,x as well as helping with porting to Python 3.x For example, I recent asked if anyone would like to write some more tests for Bio.PDB (lots of things using has_key have no test coverage). Peter From tiagoantao at gmail.com Fri Jul 9 08:30:36 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 9 Jul 2010 13:30:36 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Thu, Jul 8, 2010 at 11:19 PM, Eric Talevich wrote: > The Python project uses Sphinx now, as do quite a few other projects. It > uses ReStructuredText as the markup syntax, and can (1) pull in docstrings > automatically, and (2) run doctest on code samples. I think this would work > nicely for Biopython. Just to show another example along these lines (a computational biology one), from the forward-time population genetics simulator, simuPOP. http://simupop.sourceforge.net/Main/Documentation Tiago From biopython at maubp.freeserve.co.uk Fri Jul 9 05:40:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Jul 2010 10:40:10 +0100 Subject: [Biopython-dev] Python 3 subprocess bytes vs unicode Message-ID: Hi all, Many of the unit tests failing on Python 3.1 after using 2to3 are when calling external command line tools. Interestingly in Py3k the sys,stdin, sys,stdout and sys,stderr are in text mode by default - they automatically give you unicode strings instead of the raw bytes. This makes sense to me (and you can get at the bytes if you want them): http://docs.python.org/py3k/library/sys.html However, the stdin, stdout and strerr of any child process created with subprocess default to binary mode, and so return or expect bytes - not unicode strings: http://docs.python.org/py3k/library/subprocess.html It looks like we'll want to use universal_newlines=True when calling subprocess to that we can treat subprocess handles as text mode (i.e. unicode strings not bytes). This option is also present on Python 2, where is just controls the automatic handling of new line characters - so should be harmless (or even a good idea). This seems like a more elegant option than adding lots of encode/decode calls when doing IO with child processes (which I think Tiago has tried). Peter P.S. if we make our command line wrappers callable (or add some kind of run method) as previously discussed, it can set this option when calling subprocess. From biopython at maubp.freeserve.co.uk Fri Jul 9 04:01:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Jul 2010 09:01:43 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/8 Tiago Ant?o : >> Actually running 2to3 and then trying the tests on Python 3 will spot more >> or different problems (such as unicode/bytes problems). I think this is where >> Tiago was having trouble with phyloXML. > > I suppose (correct me if I am wrong), that the main objective of the > exercise is to make all the tests pass with Python 3 (while > maintaining Python 2 compatibility). The second objective would be to > find potential points of error that can be introduced by the changes > and create even more tests on those points. The third would be to not > let performance (speed/memory) degrade (String processing being the > big issue here). The primary aim is to get the main Biopython functionality working on Python 3 (with an eye on performance), while maintaining Python 2 support. Getting the unit tests working is just a step towards this - and the more test coverage we have the more useful this will be for us. But that is probably what you meant? Peter From biopython at maubp.freeserve.co.uk Fri Jul 9 04:15:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Jul 2010 09:15:31 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Thu, Jul 8, 2010 at 11:19 PM, Eric Talevich wrote: > Hello everyone, > > I recently read this interesting article by one of the Django developers: > http://jacobian.org/writing/great-documentation/what-to-write/ I don't agree with everything he said, but interesting. > The post describes three kinds of documentation a software project should > have: > > 1. A tutorial giving an overview of the project's major areas -- not > covering every feature, but giving the user a good enough understanding of > the whole project. > > The Biopython Tutorial and Cookbook already covers this very well. If > anything, we may have put more detailed information than necessary into the > Tutorial. The length may also be a bit overwhelming for newcomers. > > 2. Topic guides for each of the project's components. > > As I understand it, the wiki should fill this role. We could manage this > (and #1, simultaneously) by converting some less-essential portions of the > Tutorial to wiki pages. > > 3. A detailed reference for the complete API. > > The article specifically states that docstring converters like epydoc are > insufficient, and may give developers a false sense of having taken care of > this part of the documentation. His idea of an introductory tutorial is more a walk though example. > The Python project uses Sphinx now, as do quite a few other projects. It > uses ReStructuredText as the markup syntax, and can (1) pull in docstrings > automatically, and (2) run doctest on code samples. I think this would work > nicely for Biopython. > > http://sphinx.pocoo.org/ > http://docutils.sourceforge.net/ > > This would add a developer dependency on Docutils, a very healthy project, > and of course Sphinx. Epydoc can also accept ReStructuredText as the markup > syntax in docstrings, in place of epytext, if docutils is available. > > So, if we were to go that route, the upgrade path would look like: > > 1. Add docutils as a dependency for building the API docs, in addition to > epydoc. > 2. Convert the docstrings that use epytext to use ReStructuredText instead. > (grep will help, and the changes are pretty robotic.) > 3. When all docstrings are rst-compatible (plain text is OK), try running > Sphinx with a stub page that just pulls in all the docstrings under Bio. (Or > something like that.) Does it work? > 4. If it works, figure out how to put the Sphinx-generated docs on > biopython.org so people can use them. > 5. Now that we have a bunch of stub pages that pull in each module's > docstrings, start adding value to those stubs by moving API-reference-style > parts of the wiki and Tutorial into the sphinx stubs. i.e. Move from epydoc to sphinx? That would probably make things much prettier - and could make the docstrings more accessible. We could even move the main tutorial from LaTeX to sphinx as well - it can make nice HTML and PDF files. > 6. Semi-independently of this, try trimming the Tutorial a bit to make some > nice wiki pages. Wiki pages have some major drawbacks for primary documentation - they are not in git for a start which means version tracking is separate from the code version tracking. They also would be hard to bundle into the offline documentation. I'm not keen on this, beyond moving some "cookbook" examples to the wiki. Peter From tiagoantao at gmail.com Sat Jul 10 16:42:00 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sat, 10 Jul 2010 21:42:00 +0100 Subject: [Biopython-dev] 2to3 and doctests Message-ID: Hi, There are a couple of issues with 2to3 and biopython doctests. 1. There is a bug in 2to3 which crashes the tool with some doctests. This bug was recognized by the python team and corrected (but only on svn). It is very easy solve, in file refactor.py (lib2to2 python library) replace if self.log.isEnabledFor(logging.DEBUG): with if self.logger.isEnabledFor(logging.DEBUG): See http://svn.python.org/view/sandbox/trunk/2to3/lib2to3/refactor.py?r1=81478&r2=82779 And http://bugs.python.org/issue9217 This affects probably all versions of 2to3 (2.6.5 to 3.1.2) 2. Some of our doctests are incorrectly specified, one example from Phylo/BaseTree.py >>> for clade in tree.find_clades(branch_length=True, order='level'): >>> if (clade.branch_length < .5 and >>> not clade.is_terminal() and >>> clade is not self.root): >>> tree.collapse(clade) According to documentation we are supposed to use ?...? on continuation lines, not ?>>>?. See http://bugs.python.org/issue9221 2to3 seems to be more sensitive to this than python when running the tests. If nobody opposes, I will convert all doctests to correct variations -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From eric.talevich at gmail.com Sat Jul 10 23:34:18 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 10 Jul 2010 23:34:18 -0400 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: Hi guys, NumPy is keeping notes on what they did to make their code work on Python 3. Have you seen this? http://projects.scipy.org/numpy/browser/trunk/doc/Py3K.txt They use 2to3 in setup.py, too. (Sorry for the lag, my internet access here is shaky.) Cheers, Eric From tiagoantao at gmail.com Sun Jul 11 05:30:04 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sun, 11 Jul 2010 10:30:04 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/11 Eric Talevich : > NumPy is keeping notes on what they did to make their code work on Python 3. > Have you seen this? > http://projects.scipy.org/numpy/browser/trunk/doc/Py3K.txt > > They use 2to3 in setup.py, too. I did not know about that link, many thanks. But their use of 2to3 on setup.py seems very good (BTW, the setup that I've sent you in a previous message does that and is inspired in numpy). Inspired on numpy, here is a suggestion on how things might work in a biopython version that is both 2 and 3 compatible: 1. There is a single code base written in Python 2. This code base is "3-aware" (just check Peter's commits in the last few days for lots of examples of this) in the sense that some constructs are not possible. A few (very rare?) if sys.version_info[0]==3 do exist. 2. On setup.py, if python3 is detected 2to3 is called and the code is converted. As the code base was sensibly prepared, the code will compile on 3 with just 2to3 (no need for manual intervention at all). This means a single code base (no branching). Let me repeat this, as I think it is important from a maintenance perspective: no need for different branches! Also note that my prototype setup.py (anyone interested please send me an email and I will send a copy out of list - just to avoid attachments to the list) is both 2 and 3 compatible (runs on both versions unchanged) but it still has some flaws: no doctest conversion and no test conversion. But it illustrates the point that a setup.py (2to3 based) like numpy works for biopython. This means development proceeds in 2.x (code is converted from 2 to 3, not the opposite). I was thinking in doing a small script that every night does a git pull, runs the tests in python3 and, if something that was py3k compatible in the past does break, then it sends an email to biopython-dev. The point of this would be to make development the least cumbersome possible: people do not want to have to test everything in BOTH 2 and 3 (just 2). They only have to intervene (and are only informed) if there is a new problem. Best, Tiago From biopython at maubp.freeserve.co.uk Sun Jul 11 05:42:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 11 Jul 2010 10:42:24 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/10 Tiago Ant?o : > Hi, > > There are a couple of issues with 2to3 and biopython doctests. > > 1. There is a bug in 2to3 which crashes the tool with some doctests. > This bug was recognized by the python team and corrected (but only on > svn). > It is very easy solve, in file refactor.py (lib2to2 python library) replace > if self.log.isEnabledFor(logging.DEBUG): > with > if self.logger.isEnabledFor(logging.DEBUG): > See > http://svn.python.org/view/sandbox/trunk/2to3/lib2to3/refactor.py?r1=81478&r2=82779 > And > http://bugs.python.org/issue9217 > This affects probably all versions of 2to3 (2.6.5 to 3.1.2) Thanks for the alert & links > 2. Some of our doctests are incorrectly specified, one example from > Phylo/BaseTree.py > ? ? ? ?>>> for clade in tree.find_clades(branch_length=True, order='level'): > ? ? ? ?>>> ? ? if (clade.branch_length < .5 and > ? ? ? ?>>> ? ? ? ? not clade.is_terminal() and > ? ? ? ?>>> ? ? ? ? clade is not self.root): > ? ? ? ?>>> ? ? ? ? tree.collapse(clade) > According to documentation we are supposed to use ?...? on > continuation lines, not ?>>>?. > See http://bugs.python.org/issue9221 > 2to3 seems to be more sensitive to this than python when running the tests. > > If nobody opposes, I will convert all doctests to correct variations Yes, those >>> should be ... so please go ahead and fix them. Thanks, Peter From biopython at maubp.freeserve.co.uk Sun Jul 11 05:47:09 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 11 Jul 2010 10:47:09 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/11 Tiago Ant?o : > 2010/7/11 Eric Talevich : >> NumPy is keeping notes on what they did to make their code work on Python 3. >> Have you seen this? >> http://projects.scipy.org/numpy/browser/trunk/doc/Py3K.txt >> >> They use 2to3 in setup.py, too. > > I did not know about that link, many thanks. > > But their use of 2to3 on setup.py seems very good (BTW, the setup that > I've sent you in a previous message does that and is inspired in > numpy). > > Inspired on numpy, here is a suggestion on how things might work in a > biopython version that is both 2 and 3 compatible: Hi all, While at EuroSciPy 2010 I've been chatting to Pauli Virtanen and David Cournapeau about how NumPy etc are doing things - they have got a working single code base written in Python 2.x which supports Python 3 via the 2to3 script, and plan to continue like this for the medium term. For their C code, then the usual #ifdef tricks are used. See also: http://mail.scipy.org/pipermail/numpy-discussion/2010-July/051436.html Peter From biopython at maubp.freeserve.co.uk Sun Jul 11 05:52:26 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 11 Jul 2010 10:52:26 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/11 Tiago Ant?o : > > I was thinking in doing a small script that every night does a git > pull, runs the tests in python3 and, if something that was py3k > compatible in the past does break, then it sends an email to > biopython-dev. The point of this would be to make development the > least cumbersome possible: people do not want to have to test > everything in BOTH 2 and 3 (just 2). They only have to intervene > (and are only informed) if there is a new problem. > That is worth doing, but beyond that I've been thinking about some kind of buildbot doing nightly builds and tests on assorted machines, pushing the reports to the webserver. Doing this on Python 3.1 as well as Python 2.4 to 2.7 would be great. We could probably have a simple HTML upload to the server using an SSH key (no new services or software required on the server), ideally via a new restricted user account with access only to one folder on the website. Peter From tiagoantao at gmail.com Sun Jul 11 12:44:04 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sun, 11 Jul 2010 17:44:04 +0100 Subject: [Biopython-dev] Extending test_PDB.py coverage? In-Reply-To: References: Message-ID: On Tue, Jul 6, 2010 at 4:03 PM, Peter wrote: > Looking at the Bio/PDB/*.py files there are still quite a few more examples > of has_key being used - but these are not being picked up by the unit tests: Just a side note: There are doctests on Bio.PDB, but these are not activated on run_tests.py. Is this correct? From eric.talevich at gmail.com Mon Jul 12 11:47:55 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 12 Jul 2010 11:47:55 -0400 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Fri, Jul 9, 2010 at 4:15 AM, Peter wrote: > On Thu, Jul 8, 2010 at 11:19 PM, Eric Talevich > wrote: > > So, if we were to go that route, the upgrade path would look like: > > > > 1. Add docutils as a dependency for building the API docs, in addition to > > epydoc. > > 2. Convert the docstrings that use epytext to use ReStructuredText > instead. > > (grep will help, and the changes are pretty robotic.) > > 3. When all docstrings are rst-compatible (plain text is OK), try running > > Sphinx with a stub page that just pulls in all the docstrings under Bio. > (Or > > something like that.) Does it work? > > 4. If it works, figure out how to put the Sphinx-generated docs on > > biopython.org so people can use them. > > 5. Now that we have a bunch of stub pages that pull in each module's > > docstrings, start adding value to those stubs by moving > API-reference-style > > parts of the wiki and Tutorial into the sphinx stubs. > > i.e. Move from epydoc to sphinx? That would probably make things much > prettier - and could make the docstrings more accessible. We could even > move the main tutorial from LaTeX to sphinx as well - it can make nice > HTML and PDF files. > OK, I'll start a branch for this on GitHub. Do you have a preference for how I handle the new docutils dependency? I thought I'd just document it somewhere, similar to how the Tutorial's current hevea dependency is mentioned. I'll work on getting epydoc to work with docutils/ReStructuredText first, then start a reference manual under Doc/reference/ after that. > > 6. Semi-independently of this, try trimming the Tutorial a bit to make > some > > nice wiki pages. > > Wiki pages have some major drawbacks for primary documentation - they > are not in git for a start which means version tracking is separate from > the > code version tracking. They also would be hard to bundle into the offline > documentation. I'm not keen on this, beyond moving some "cookbook" > examples to the wiki. > > OK. I'll leave this part for the end, then, and just make note of which parts of the Tutorial seem tangential enough to be moved to cookbook pages on the wiki. I expect a bigger portion of the Tutorial could be moved to the new reference manual instead -- for example, most of the API explanations in the Phylo chapter. I like the way simuPOP separates the user guide and reference, although the Table of Contents pages are a little unwieldy... Best, Eric From biopython at maubp.freeserve.co.uk Tue Jul 13 05:59:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Jul 2010 10:59:38 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: 2010/7/12 Eric Talevich : > Peter wrote: >> i.e. Move from epydoc to sphinx? That would probably make things much >> prettier - and could make the docstrings more accessible. We could even >> move the main tutorial from LaTeX to sphinx as well - it can make nice >> HTML and PDF files. > > OK, I'll start a branch for this on GitHub. Do you have a preference for how > I handle the new docutils dependency? I thought I'd just document it > somewhere, similar to how the Tutorial's current hevea dependency is > mentioned. > > I'll work on getting epydoc to work with docutils/ReStructuredText first, > ... So in order to move the API docs to Sphinx, they have to be formatted as reStructuredText (rather than plain text or epytext as we use now)? The good news is epydoc can also support reStructuredText (important during transition). That will be a big bit of work, but can be done on a module by module basis. See: http://epydoc.sourceforge.net/othermarkup.html#restructuredtext > ... > then start a reference manual under Doc/reference/ after that. Can you clarify what you idea is here? Split the current Tutorial.tex LaTeX file into a more introductory walk through, and a more technical reference manual? I'd rather more technical material into the API docs (i,e. the docstrings) and keep Tutorial.tex more introductory. Regards, Peter From eric.talevich at gmail.com Tue Jul 13 11:17:50 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 13 Jul 2010 11:17:50 -0400 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: 2010/7/13 Peter > 2010/7/12 Eric Talevich : > > Peter wrote: > >> i.e. Move from epydoc to sphinx? That would probably make things much > >> prettier - and could make the docstrings more accessible. We could even > >> move the main tutorial from LaTeX to sphinx as well - it can make nice > >> HTML and PDF files. > > > > OK, I'll start a branch for this on GitHub. Do you have a preference for > how > > I handle the new docutils dependency? I thought I'd just document it > > somewhere, similar to how the Tutorial's current hevea dependency is > > mentioned. > > > > I'll work on getting epydoc to work with docutils/ReStructuredText first, > > ... > > So in order to move the API docs to Sphinx, they have to be formatted > as reStructuredText (rather than plain text or epytext as we use now)? > The good news is epydoc can also support reStructuredText (important > during transition). That will be a big bit of work, but can be done on a > module by module basis. > > See: > http://epydoc.sourceforge.net/othermarkup.html#restructuredtext > Yeah, I think that's the best way to go. I once considered using reStructuredText for Bio.Phylo instead of epytext, but was deterred by the extra dependency. So, my branch for this (not on github yet) will first just convert all the docstrings to at least work with reStructuredText, and hopefully the plain-text docstrings will generally Just Work. Once that's done, and Epydoc will handle all the docstrings as reStructuredText without any problems, I think it would be a good time to merge that work into the trunk so we can all start/continue writing rst-compatible docstrings. > > ... > > then start a reference manual under Doc/reference/ after that. > > Can you clarify what you idea is here? Split the current Tutorial.tex > LaTeX file into a more introductory walk through, and a more technical > reference manual? I'd rather more technical material into the API > docs (i,e. the docstrings) and keep Tutorial.tex more introductory. > > As I understand it, using Sphinx for API docs requires creating a .rst document for each sub-package. The document can be a stub, containing just a command to pull in the module docstrings: http://sphinx.pocoo.org/ext/autosummary.html Incidentally, we could set it up to run doctest from here, too: http://sphinx.pocoo.org/ext/doctest.html In any case, I won't touch Tutorial.tex at first. I'll just set up the stubs for pulling in docstrings, and call that a minimal Sphinx reference manual, separate from the Tutorial. Then we should figure out how to make the reference manual easy to view (at least for anyone with a Git branch), and at least think about how it should be published on biopython.org -- I think it's just static .html files, so this shouldn't be too hard. Once we're happy with Sphinx as a replacement for Epydoc, and are able to make the reference manual available through the same sources as the Tutorial, then we'd be free to move pieces of the Tutorial to the reference manual, as appropriate -- adding longer descriptions and examples to the .rst documents that were previously just stubs. For example, my Bio.Phylo chapter in the Tutorial has detailed API descriptions that should be moved to the reference. The BLAST chapter has a complete class diagram which also seems like reference material to me. There's also some BioSQL material scattered around the internet that would be more helpful if aggregated into a complete, up-to-date reference. Sound like a plan? -Eric From biopython at maubp.freeserve.co.uk Tue Jul 13 11:56:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Jul 2010 16:56:47 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Tue, Jul 13, 2010 at 4:17 PM, Eric Talevich wrote: >Peter wrote: >> So in order to move the API docs to Sphinx, they have to be formatted >> as reStructuredText (rather than plain text or epytext as we use now)? >> The good news is epydoc can also support reStructuredText (important >> during transition). That will be a big bit of work, but can be done on a >> module by module basis. >> >> See: >> http://epydoc.sourceforge.net/othermarkup.html#restructuredtext >> > > Yeah, I think that's the best way to go. I once considered using > reStructuredText for Bio.Phylo instead of epytext, but was deterred by the > extra dependency. So, my branch for this (not on github yet) will first just > convert all the docstrings to at least work with reStructuredText, and > hopefully the plain-text docstrings will generally Just Work. > > Once that's done, and Epydoc will handle all the docstrings as > reStructuredText without any problems, I think it would be a good time to > merge that work into the trunk so we can all start/continue writing > rst-compatible docstrings. Sounds good. I would like to keep the reStructuredText as simple as possible for human readers (i.e. when looking at the doctext within Python). I think the NumPy project had similar aims, and have documented this - e.g. here: http://projects.scipy.org/numpy/wiki/CodingStyleGuidelines >> > ... >> > then start a reference manual under Doc/reference/ after that. >> >> Can you clarify what you idea is here? Split the current Tutorial.tex >> LaTeX file into a more introductory walk through, and a more technical >> reference manual? I'd rather more technical material into the API >> docs (i,e. the docstrings) and keep Tutorial.tex more introductory. >> >> > As I understand it, using Sphinx for API docs requires creating a .rst > document for each sub-package. The document can be a stub, containing > just a command to pull in the module docstrings: > http://sphinx.pocoo.org/ext/autosummary.html That fits with my impression from chatting to NumPy folk at EuroSciPy 2010. Loads of stub RST files sounds like a bit of a pain, but I can live with it. > Incidentally, we could set it up to run doctest from here, too: > http://sphinx.pocoo.org/ext/doctest.html > > In any case, I won't touch Tutorial.tex at first. I'll just set up the stubs > for pulling in docstrings, and call that a minimal Sphinx reference manual, > separate from the Tutorial. Then we should figure out how to make the > reference manual easy to view (at least for anyone with a Git branch), and > at least think about how it should be published on biopython.org -- I think > it's just static .html files, so this shouldn't be too hard. Sounds OK... > Once we're happy with Sphinx as a replacement for Epydoc, and are able to > make the reference manual available through the same sources as the > Tutorial, then we'd be free to move pieces of the Tutorial to the reference > manual, as appropriate -- adding longer descriptions and examples to the > .rst documents that were previously just stubs. Or move them into the module docstrings instead? > For example, my Bio.Phylo chapter in the Tutorial has detailed API > descriptions that should be moved to the reference. The BLAST chapter has a > complete class diagram which also seems like reference material to me. > There's also some BioSQL material scattered around the internet that would > be more helpful if aggregated into a complete, up-to-date reference. Regarding BioSQL, what online bits are you referring to beyond this: http://www.biopython.org/wiki/BioSQL (and the LaTeX file referenced)? Peter From vsbuffalo at gmail.com Tue Jul 13 18:24:50 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Tue, 13 Jul 2010 15:24:50 -0700 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: Hi All, I'd like to become more active in the Biopython project, and porting the documentation to Sphinx seems like an excellent way to begin. Is there a wiki or other website for allocating docstrings/other documentation to be rewritten in reStructuredText? Vince -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From biopython at maubp.freeserve.co.uk Tue Jul 13 19:04:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 00:04:28 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Tue, Jul 13, 2010 at 11:24 PM, Vince S. Buffalo wrote: > Hi All, > > I'd like to become more active in the Biopython project, and porting the > documentation to Sphinx seems like an excellent way to begin. Is there a > wiki or other website for allocating docstrings/other documentation to be > rewritten in reStructuredText? > > Vince Hi Vince, Volunteers to help would be great. In terms of a wiki or website system, I guess you are aware of or have used the NumPy system. They put a lot of effort into setting up a workflow to edit docstrings via a wiki, before manual merging into the code base. We don't have anything like that. For now, it would be a case of making a fork on github, and editing Python source code files one by one to convert their docstrings into reStructuredText (plus checking the output works in epydoc, and making sure this doesn't break any doctests). We'd then be able to pull your changes into the trunk (manually). Are you familiar with git, github, epydoc and/or Sphinx? Regards, Peter From vsbuffalo at gmail.com Tue Jul 13 19:26:51 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Tue, 13 Jul 2010 16:26:51 -0700 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: I am familiar with git, github and Sphinx, but not epydoc. Would initial draft version of the tutorial to Sphinx be a good first move? best, Vince On Tue, Jul 13, 2010 at 4:04 PM, Peter wrote: > On Tue, Jul 13, 2010 at 11:24 PM, Vince S. Buffalo > wrote: > > Hi All, > > > > I'd like to become more active in the Biopython project, and porting the > > documentation to Sphinx seems like an excellent way to begin. Is there a > > wiki or other website for allocating docstrings/other documentation to be > > rewritten in reStructuredText? > > > > Vince > > Hi Vince, > > Volunteers to help would be great. In terms of a wiki or website system, > I guess you are aware of or have used the NumPy system. They put a > lot of effort into setting up a workflow to edit docstrings via a wiki, > before > manual merging into the code base. We don't have anything like that. > > For now, it would be a case of making a fork on github, and editing > Python source code files one by one to convert their docstrings into > reStructuredText (plus checking the output works in epydoc, and > making sure this doesn't break any doctests). We'd then be able > to pull your changes into the trunk (manually). > > Are you familiar with git, github, epydoc and/or Sphinx? > > Regards, > > Peter > -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From eric.talevich at gmail.com Tue Jul 13 19:42:47 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 13 Jul 2010 19:42:47 -0400 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Tue, Jul 13, 2010 at 7:26 PM, Vince S. Buffalo wrote: > I am familiar with git, github and Sphinx, but not epydoc. Would initial > draft version of the tutorial to Sphinx be a good first move? > > best, > Vince > > Hi Vince, Converting both to Sphinx would be awesome, but if you're looking to learn about Biopython in depth, I'd recommend starting by converting the docstrings to reStructuredText. In the current Biopython source tree, you can grep for "__docformat__" to identify modules that are already using Epytext markup; those should be converted first. See: http://epydoc.sourceforge.net/manual-othermarkup.html Then, you can try running Epydoc with the option to interpret all docstrings as reStructuredText, rather than plain text. Make sure you're in a new, empty directory outside the Biopython source tree, and use the command: epydoc --html --verbose --docformat restructuredtext Bio BioSQL This should identify any remaining issues, including any dependencies you're missing. Best, Eric From biopython at maubp.freeserve.co.uk Wed Jul 14 06:12:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 11:12:03 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Wed, Jul 14, 2010 at 12:42 AM, Eric Talevich wrote: > On Tue, Jul 13, 2010 at 7:26 PM, Vince S. Buffalo wrote: > >> I am familiar with git, github and Sphinx, but not epydoc. You'll be fine then - once epydoc is installed (easy on Linux as it should be in your distribution's packages) its just one line to execute - see also: http://biopython.org/wiki/Building_a_release >> Would initial draft version of the tutorial to Sphinx be a good first move? >> >> best, >> Vince >> >> > Hi Vince, > > Converting both to Sphinx would be awesome, but if you're looking to learn > about Biopython in depth, I'd recommend starting by converting the > docstrings to reStructuredText. As Eric says, we would suggest starting with docstrings. > In the current Biopython source tree, you can grep for "__docformat__" to > identify modules that are already using Epytext markup; those should be > converted first. See: > http://epydoc.sourceforge.net/manual-othermarkup.html Note that with epydoc you can have different python files using different mark up (this is the __docformat__ thing Eric mentioned). Most of ours are plain text, some use epytext, soon some will use reStructuredText. The advantage of this is we can translate things gradually (file by file). Anything already using epytext should be quite clear and easy to convert to reStructuredText. Anything using plain text may need a little more work. Personally I'd suggest you pick modules you are familiar with to update first. Peter From biopython at maubp.freeserve.co.uk Wed Jul 14 06:49:16 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 11:49:16 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/11 Peter : > 2010/7/10 Tiago Ant?o : >> Hi, >> >> There are a couple of issues with 2to3 and biopython doctests. >> >> 1. There is a bug in 2to3 which crashes the tool with some doctests. >> This bug was recognized by the python team and corrected ... >> >> 2. Some of our doctests are incorrectly specified, one example from >> Phylo/BaseTree.py ... We're not the only people to run into problems with 2to3 and doctest not working properly: http://lucumr.pocoo.org/2010/2/11/porting-to-python-3-a-guide Most of our doctests still work after 2to3 and I guess we can look at the failures on a case by case basis (and for expediency move them into proper unit tests or remove them if they can't be tweaked to work on both Python 2.x and 3.x). Peter From tiagoantao at gmail.com Wed Jul 14 07:13:48 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 14 Jul 2010 12:13:48 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/14 Peter : > We're not the only people to run into problems with 2to3 and doctest > not working properly: > > http://lucumr.pocoo.org/2010/2/11/porting-to-python-3-a-guide > > Most of our doctests still work after 2to3 and I guess we can look at > the failures on a case by case basis (and for expediency move them > into proper unit tests or remove them if they can't be tweaked to > work on both Python 2.x and 3.x). I don't have the cases here, but they are only a handful. I suppose they can either be converted to (ugly) single-liners or to unit-tests. Anyway, it is a pity that 2to3 doctests are in such a state. Because all the rest seems to work quite well. Tiago PS - I can do this today, just tell me if you prefer single-liners or unit-tests From biopython at maubp.freeserve.co.uk Wed Jul 14 07:52:48 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 12:52:48 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/14 Tiago Ant?o : > > 2010/7/14 Peter : >> We're not the only people to run into problems with 2to3 and doctest >> not working properly: >> >> http://lucumr.pocoo.org/2010/2/11/porting-to-python-3-a-guide >> >> Most of our doctests still work after 2to3 and I guess we can look at >> the failures on a case by case basis (and for expediency move them >> into proper unit tests or remove them if they can't be tweaked to >> work on both Python 2.x and 3.x). > > I don't have the cases here, but they are only a handful. I suppose > they can either be converted to (ugly) single-liners or to unit-tests. > > Anyway, it is a pity that 2to3 doctests are in such a state. Because > all the rest seems to work quite well. > > Tiago > PS - I can do this today, just tell me if you prefer single-liners or unit-tests So is it basically just multi-line doctests with slash continuation chars we have a problem with? Those were always a bit ugly - but seemed the best solution given the desire to limit ourselves to 80 character lines. Could you make them single liners for now (least work to get the tests to pass), and I'll take a look at the commit later. If these are from my examples I'll then have a think about how better to handle them. Peter From peter at maubp.freeserve.co.uk Wed Jul 14 12:27:55 2010 From: peter at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 17:27:55 +0100 Subject: [Biopython-dev] Fwd: [Gmod-schema] GFF3 Is_circular In-Reply-To: <5B028E4D-30B2-4DCA-B41A-FF59ABDC4898@mac.com> References: <5B028E4D-30B2-4DCA-B41A-FF59ABDC4898@mac.com> Message-ID: Hi Brad, Something to be aware of for GFF work - the spec finally has explicit support for circular genomes :) Peter ---------- Forwarded message ---------- From: Andrew McArthur Date: Wed, Jul 14, 2010 at 5:17 PM Subject: [Gmod-schema] GFF3 Is_circular To: gmod-schema at lists.sourceforge.net Hello all, The definition of GFF3 at the Sequence Ontology site (http://www.sequenceontology.org/gff3.shtml) now has format definitions for supporting circular molecules such as plasmids or bacterial genomes. ?This is done using a new Is_circular flag in the GFF3 attributes field. ?Notably,?"For features that cross the origin of a circular feature (e.g. most bacterial genomes, plasmids, and some viral genomes), the requirement for start to be less than or equal to end is satisfied by making end = the position of the end + the length of the landmark feature." Are Chado 1.1 and?gmod_bulk_load_gff3.pl supporting this change to GFF3 or should I wait before changing my GFF3 files? Thanks, Andrew McArthur ------ Andrew G. McArthur, Ph.D. Bioinformatics Consulting Services Email: amcarthur at mac.com, Web: http://mcarthurlab.blogspot.com Phone: 905.296.3252, Mobile: 905.745.2794, Fax: 647.439.0829 AIM: amcarthur at mac.com, Skype: agmcarthur ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ Gmod-schema mailing list Gmod-schema at lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/gmod-schema From biopython at maubp.freeserve.co.uk Wed Jul 14 13:47:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 18:47:45 +0100 Subject: [Biopython-dev] Python 3 and Bio.SeqIO.index() Message-ID: Hi all, >From background reading I knew that text IO speed was very slow in Python 3.0, but this had been improved in Python 3.1 - however there was still an overhead for the unicode conversion. e.g. http://dabeaz.blogspot.com/2010/01/reexamining-python-3-text-io.html First some good news - using Bio.SeqIO.convert() for FASTQ to FASTA seems to be faster under Python 3.1 than Python 2.7 (on a Windows XP 32bit machine). Now for the bad news - using Bio.SeqIO.index() is much slower. I decided to simplify this down to a minimal test case, and confirmed my hunch: indexing files in the new default unicode text mode comes with a major time penalty (a factor of about one hundred). I've attached four versions of the same script which scans a FASTA file building a dictionary of record offsets. * fast in Python 2 using the default non-unicode strings * slower in Python 3 using the default unicode strings * slower in Python 3 using Latin encoded unicode strings * faster in Python 3 using binary mode and bytes The basic Python 3 script was created using 2to3 from the Python 2 version. I manually changed this to make the latin variant, and the binary bytes version. Sample output on an example file with just 94 entries: c:\python27\python index2.py ls_orchid.fasta - Indexed in 0.02s c:\python31\python index3.py ls_orchid.fasta - Indexed in 12.20s c:\python31\python index3latin.py ls_orchid.fasta - Indexed in 11.78s c:\python31\python index3b.py ls_orchid.fasta - Indexed in 0.02s Here the Python 2 version and the Python 3 binary examples are both extremely fast, while Python 3 unicode is very slow. There may be a tiny benefit to using the Latin encoding as suggested on the blog post I linked to above. Using a FASTA file with 7 million entries (converted from SRA entry SRR001666_1.fastq), we have: c:\python27\python index2.py SRR001666_1.fasta - Indexed in 79.45s c:\python31\python index3latin.py SRR001666_1.fasta - Over an hour (I killed it) c:\python31\python index3b.py SRR001666_1.fasta - Indexed in 34.31s I think the reason that Python 3 binary is faster than Python 2 is we are using universal read lines mode in Python 2, which will add an overhead (both for reading, and in calculating the offset). Given the way the Bio.SeqIO.index() API works, we have control over the file mode. I think we are going to have to open the file in binary mode for indexing efficiently. This may mean an extra wrapper for handling cross platform new line characters (something that Python 2.x does for us). I'd also be interested to try making the optimized functions in Bio.SeqIO.convert() use binary mode too and see if that makes them any faster (even on Python 2). In general, perhaps it would be useful if on Python 3 Bio.SeqIO could cope with opening text files in either unicode text mode or in binary mode? These issues may also influence what we decide to use for Seq objects by default (bytes versus unicode). Of course, the more special cases like this we have to worry about, the more complex a single codebase becomes... Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: index3.py Type: application/octet-stream Size: 625 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: index3b.py Type: application/octet-stream Size: 637 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: index3latin.py Type: application/octet-stream Size: 645 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: index2.py Type: application/octet-stream Size: 607 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Wed Jul 14 14:09:15 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 19:09:15 +0100 Subject: [Biopython-dev] Python 3 and Bio.SeqIO.index() In-Reply-To: References: Message-ID: On Wed, Jul 14, 2010 at 6:47 PM, Peter wrote: > > Using a FASTA file with 7 million entries (converted from SRA > entry SRR001666_1.fastq), we have: > > c:\python27\python index2.py SRR001666_1.fasta - Indexed in 79.45s > c:\python31\python index3latin.py SRR001666_1.fasta - Over an hour (I killed it) > c:\python31\python index3b.py SRR001666_1.fasta - Indexed in 34.31s > > I think the reason that Python 3 binary is faster than Python 2 > is we are using universal read lines mode in Python 2, which will > add an overhead (both for reading, and in calculating the offset). Confirmed - switching the mode from "rU" to "rb" to give index2.py, c:\python27\python index2.py SRR001666_1.fasta - Indexed in 76.96s c:\python27\python index2b.py SRR001666_1.fasta - Indexed in 36.62s I've had a quick go at doing this for Bio.SeqIO.index(), and with the catch that the get_raw() functionality then returns the underlying newlines (which we can fix if need be) it seems to work (unit tests pass). This may be worth following up on regardless of the Python 3 work, since the speed up is pretty good (from 97s to 52s on this example on Windows). We'd need more testing for the cross platform issues of course. I wonder if the same speed up happens on Linux / Mac OS X? Something to try tomorrow I guess. Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: seqio_index_b.patch Type: application/octet-stream Size: 1288 bytes Desc: not available URL: From vsbuffalo at gmail.com Wed Jul 14 14:24:40 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Wed, 14 Jul 2010 11:24:40 -0700 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: Thanks Eric and Peter, I'll get started on this! best, Vince On Wed, Jul 14, 2010 at 3:12 AM, Peter wrote: > On Wed, Jul 14, 2010 at 12:42 AM, Eric Talevich > wrote: > > On Tue, Jul 13, 2010 at 7:26 PM, Vince S. Buffalo >wrote: > > > >> I am familiar with git, github and Sphinx, but not epydoc. > > You'll be fine then - once epydoc is installed (easy on Linux as it should > be in your distribution's packages) its just one line to execute - see > also: > http://biopython.org/wiki/Building_a_release > > >> Would initial draft version of the tutorial to Sphinx be a good first > move? > >> > >> best, > >> Vince > >> > >> > > Hi Vince, > > > > Converting both to Sphinx would be awesome, but if you're looking to > learn > > about Biopython in depth, I'd recommend starting by converting the > > docstrings to reStructuredText. > > As Eric says, we would suggest starting with docstrings. > > > In the current Biopython source tree, you can grep for "__docformat__" to > > identify modules that are already using Epytext markup; those should be > > converted first. See: > > http://epydoc.sourceforge.net/manual-othermarkup.html > > Note that with epydoc you can have different python files using different > mark up (this is the __docformat__ thing Eric mentioned). Most of ours are > plain text, some use epytext, soon some will use reStructuredText. The > advantage of this is we can translate things gradually (file by file). > > Anything already using epytext should be quite clear and easy to convert > to reStructuredText. Anything using plain text may need a little more work. > Personally I'd suggest you pick modules you are familiar with to update > first. > > Peter > -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From vsbuffalo at gmail.com Wed Jul 14 14:38:51 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Wed, 14 Jul 2010 11:38:51 -0700 Subject: [Biopython-dev] Python 3 and Bio.SeqIO.index() In-Reply-To: References: Message-ID: There's a slight (perhaps non-significant) speed up on OS X. In Python 2.7 on OS X 10.5.8: vinceb$ python index2.py s_7_1_sequence.fasta s_7_1_sequence.fasta Indexed in 32.35s vinceb$ python index2b.py s_7_1_sequence.fasta s_7_1_sequence.fasta Indexed in 26.01s best, Vince On Wed, Jul 14, 2010 at 11:09 AM, Peter wrote: > SRR001666_1.fasta -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From vsbuffalo at gmail.com Thu Jul 15 00:00:42 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Wed, 14 Jul 2010 21:00:42 -0700 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: Hi all, I've started the conversion process with the file Bio/SeqIO/__init__.py. A few questions came up. First, are docstrings (with the autodoc extension) going to be the primary form of documentation, or are we going to copy/paste them into a separate documentation tree? I believe the latter is what Python and Django do, and may give us freedom to target different audiences with docstrings and the separate documentation. Also, after some Googling about autodoc, it seems complex. Also, has a branch been created on github? At this point, I'll continuing going through the "robotic" steps of converting epydoc formatting to ReST. Given my youthfulness working on this project, I'll try to keep you guys well updated. Preemptive apologies for future questions :-) Also, to test ReST + Sphinx on one section, I ran Sphinx on a copy/pasted docstring. I have to say, Sphinx is beautiful: http://imgur.com/4gNok best, Vince On Wed, Jul 14, 2010 at 11:24 AM, Vince S. Buffalo wrote: > Thanks Eric and Peter, I'll get started on this! > > best, > Vince > > > On Wed, Jul 14, 2010 at 3:12 AM, Peter wrote: > >> On Wed, Jul 14, 2010 at 12:42 AM, Eric Talevich >> wrote: >> > On Tue, Jul 13, 2010 at 7:26 PM, Vince S. Buffalo > >wrote: >> > >> >> I am familiar with git, github and Sphinx, but not epydoc. >> >> You'll be fine then - once epydoc is installed (easy on Linux as it should >> be in your distribution's packages) its just one line to execute - see >> also: >> http://biopython.org/wiki/Building_a_release >> >> >> Would initial draft version of the tutorial to Sphinx be a good first >> move? >> >> >> >> best, >> >> Vince >> >> >> >> >> > Hi Vince, >> > >> > Converting both to Sphinx would be awesome, but if you're looking to >> learn >> > about Biopython in depth, I'd recommend starting by converting the >> > docstrings to reStructuredText. >> >> As Eric says, we would suggest starting with docstrings. >> >> > In the current Biopython source tree, you can grep for "__docformat__" >> to >> > identify modules that are already using Epytext markup; those should be >> > converted first. See: >> > http://epydoc.sourceforge.net/manual-othermarkup.html >> >> Note that with epydoc you can have different python files using different >> mark up (this is the __docformat__ thing Eric mentioned). Most of ours are >> plain text, some use epytext, soon some will use reStructuredText. The >> advantage of this is we can translate things gradually (file by file). >> >> Anything already using epytext should be quite clear and easy to convert >> to reStructuredText. Anything using plain text may need a little more >> work. >> Personally I'd suggest you pick modules you are familiar with to update >> first. >> >> Peter >> > > > > -- > Vince Buffalo > Programmer > Bioinformatics Core > UC Davis Genome Center > University of California, Davis > > "There's real poetry in the real world. Science is the poetry of reality." > -Richard Dawkins > > -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From biopython at maubp.freeserve.co.uk Thu Jul 15 05:18:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 10:18:28 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Thu, Jul 15, 2010 at 5:00 AM, Vince S. Buffalo wrote: > Hi all, > > I've started the conversion process with the file Bio/SeqIO/__init__.py. A > few questions came up. > > First, are docstrings (with the autodoc extension) going to be the primary > form of documentation, or are we going to copy/paste them into a separate > documentation tree? I believe the latter is what Python and Django do, and > may give us freedom to target different audiences with docstrings and the > separate documentation. Also, after some Googling about autodoc, it seems > complex. I think the docstrings should be the primmary API documentation, and the Tutorial the primary introductory text. We currently have three forms, * Biopython Tutorial (PDF & HTML, written in LateX) which is the main documentation and should be introductory. * Module docstrings for the API, more technical (shown online with epydoc which is functional but ugly, later this will use SPhinx) * Some wiki pages, more for recent things still in flux, and some user contributed "Cookbook" entries. The wiki is nice to edit for user contributions, but not under source code control. > > Also, has a branch been created on github? > No - I was suggesting you make a fork and your own branch, and we will periodically review your changes and apply them to the trunk. Is that OK? > At this point, I'll continuing going through the "robotic" steps of > converting epydoc formatting to ReST. Given my youthfulness working on this > project, I'll try to keep you guys well updated. Preemptive apologies for > future questions :-) If in doubt, its better to ask first - so not a problem at all. > Also, to test ReST + Sphinx on one section, I ran Sphinx on a copy/pasted > docstring. I have to say, Sphinx is beautiful: http://imgur.com/4gNok The epydoc version is here (deep linking to avoid the frames): http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html The core text isn't so different, I'm more excited about the section names and navigation side of things with SPhinx. Peter From biopython at maubp.freeserve.co.uk Thu Jul 15 06:23:30 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 11:23:30 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Thu, Jul 15, 2010 at 10:18 AM, Peter wrote: >> >> Also, has a branch been created on github? >> > > No - I was suggesting you make a fork and your own branch, and we will > periodically review your changes and apply them to the trunk. Is that OK? It looks like you are already doing that - great. A few things from looking over your first two commits, for SeqIO and AlignIO, http://github.com/vsbuffalo/biopython/commit/a77d168cdf3f4a2c36708b5553531eef216f8aec http://github.com/vsbuffalo/biopython/commit/76ba2d5e9c5d915230bbdee73fa3a3a962f814df (1) Until most of the docstrings are using reStructuredText, we need to keep using epydoc (before switching to SPhinx). During this transition we will have a mix of mark up in different modules. The __docformat__ setting is important to tell epydoc this. So rather than deleting any existing value like: __docformat__ = "epytext en" it should probably be replaced with: __docformat__ = "reStructuredText en" See http://epydoc.sourceforge.net/othermarkup.html (2) I'm not keen on things like :mod:`Bio.AlignIO` or :func:`write` in the markup. They look ugly and confusing to me (for looking at the raw text at the Python terminal). Have you looked at NumPy's guidelines http://projects.scipy.org/numpy/wiki/CodingStyleGuidelines and whatever pre-processors they use to assist Sphinx? (3) Do you think we should we also be standardising how we describe parameters in docstrings? e.g. Follow what NumPy is doing? Peter From biopython at maubp.freeserve.co.uk Thu Jul 15 09:31:29 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 14:31:29 +0100 Subject: [Biopython-dev] Python 3 and Bio.SeqIO.index() In-Reply-To: References: Message-ID: On Wed, Jul 14, 2010 at 7:38 PM, Vince S. Buffalo wrote: > There's a slight (perhaps non-significant) speed up on OS X.?In Python 2.7 > on OS X 10.5.8: > vinceb$ python index2.py s_7_1_sequence.fasta > s_7_1_sequence.fasta > Indexed in 32.35s > vinceb$ python index2b.py s_7_1_sequence.fasta > s_7_1_sequence.fasta > Indexed in 26.01s > best, > Vince I don't have Python 3 on my Mac yet, so I've tried things out under Linux. 7 million entry FASTA file with Unix line endings (LF), on Linux: python2.7 index2.py SRR001666_1.lf.fasta - 19s python2.7 index2b.py SRR001666_1.lf.fasta - 19s python3.1 index3.py SRR001666_1.lf.fasta - Over an hour (I killed it) python3.1 index3b.py SRR001666_1.lf.fasta - 29s Again, I gave up on the Python 3 plain text unicode string version. 7 million entry FASTA file with DOS line endings (CR LF), on Linux: python2.7 index2.py SRR001666_1.crlf.fasta - 19 or 20s python2.7 index2b.py SRR001666_1.crlf.fasta - 19 or 20s python3.1 index3.py SRR001666_1.crlf.fasta - not tested python3.1 index3b.py SRR001666_1.crlf.fasta - 29s Interestingly the line endings make almost no difference to the timings. On this machine the python3.1 bytes version is slower than either of the Python 2.7 versions. This may be down to compiler options or something (I compiled the Python 3.1 myself with the defaults). Recall on the Windows machine Python 3.1 (binary mode) was faster than Python 2.7 (binary mode or universal new lines mode). Regarding possible speed ups under Python 2 by avoiding universal new lines mode, as you can see above on this Linux Python 2.7 setup timing on index2.py and index2b.py are practically equal (~19s), unlike on the Windows machine where this did seem to help. I think the clear message (from both Windows and Linux) is that for Bio.SeqIO.index() to perform at a tolerable speed on Python 3 we can't use the default text mode with unicode strings, we are going to have to use binary mode with bytes. Peter From vsbuffalo at gmail.com Thu Jul 15 11:22:55 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Thu, 15 Jul 2010 08:22:55 -0700 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: > > I think the docstrings should be the primmary API documentation, and the Tutorial the primary introductory text. I like the idea of code and documentation living together, but one thing that concerns me is that as the documentation grows larger and filled with more examples, it may begin to clutter the code quite a bit. Separate documentation and code allow greedy search and replace in documentation with the guarantee it won't damage code. And in Emacs (and other editors I presume) there are ReST editing modes that highlight syntax that do not work in docstrings. The benefits of documentation and code living together are that developers can more easily find and update documentation on their functions, which is not to be underestimated. It is interesting that numpy seems entirely documented in docstrings, but django and other projects less so. > (2) I'm not keen on things like :mod:`Bio.AlignIO` or :func:`write` in the > markup. They look ugly and confusing to me (for looking at the raw text > at the Python terminal). Have you looked at NumPy's guidelines > http://projects.scipy.org/numpy/wiki/CodingStyleGuidelines > and whatever pre-processors they use to assist Sphinx? > Ah, in looking at numpy's ReST source (i.e. http://docs.scipy.org/numpy/source/numpy/dist/lib64/python2.4/site-packages/numpy/lib/function_base.py#347) it is much more terse. I can switch to this approach and skim their their source to find their preprocessor. > (3) Do you think we should we also be standardising how we describe > parameters in docstrings? e.g. Follow what NumPy is doing? I was getting this same feeling as I was working. It might not be a bad idea to create a stub-type docstring for every non-internal function so at the very least something ends up on the documentation. This would also provide a template for standardizing parameters (e.g. indicating return value types, etc). This would likely increase the length of all code files quite a bit through, but the documentation coverage would be higher. -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From biopython at maubp.freeserve.co.uk Thu Jul 15 11:38:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 16:38:24 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Thu, Jul 15, 2010 at 4:22 PM, Vince S. Buffalo wrote: >> >> ?I think the docstrings should be the primary API documentation, >> and the Tutorial the primary introductory text. > > I like the idea of code and documentation living together, but one thing > that concerns me is that as the documentation grows larger and filled with > more examples, it may begin to clutter the code quite a bit. We can cross that bridge if we come to it - right now I would say most modules really need more docstrings. If you think that any of the docstrings you've looked at are too long, we can discuss shortening them (ideally by relocating good content or tests). Peter From biopython at maubp.freeserve.co.uk Thu Jul 15 12:32:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 17:32:01 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/6 Peter : > > I could add explicit encode calls which would help SFF output under > Python 3.x. This shouldn't change the functionality on Python 2.x, but > I am a little concerned about it having a negative impact on the speed, > but I have not measured this. > With my recent commits, SFF support now seems to work on Python 3. This includes test_SeqIO_index.py although there are other issues here: http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008004.html I had to add some conditional code to handle bytes <-> unicode, which may have a measurable slow down on Python 2. Peter From tiagoantao at gmail.com Thu Jul 15 12:41:37 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 15 Jul 2010 17:41:37 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/14 Peter : > Could you make them single liners for now (least work to get the tests > to pass), and I'll take a look at the commit later. If these are from my > examples I'll then have a think about how better to handle them. Actually no need for single liners. In some cases it was this >>> xxx \ yyy To this >>> xxx \ ... yyy Also """ >>> \""" ... \""" """ to """ >>> ''' ... ''' """ I will commit this changes in a few minutes. -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From bioinformed at gmail.com Thu Jul 15 12:58:52 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Thu, 15 Jul 2010 12:58:52 -0400 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/15 Peter > With my recent commits, SFF support now seems to work on Python 3. > This includes test_SeqIO_index.py although there are other issues here: > http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008004.html > > I had to add some conditional code to handle bytes <-> unicode, which > may have a measurable slow down on Python 2. > > I'm in the midst of processing a great deal of SFF data, so I'll try to give the new SFF code a try under Python 2.7. -Kevin From biopython at maubp.freeserve.co.uk Thu Jul 15 13:13:22 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 18:13:22 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: On Thu, Jul 15, 2010 at 5:58 PM, Kevin Jacobs wrote: > 2010/7/15 Peter > >> With my recent commits, SFF support now seems to work on Python 3. >> This includes test_SeqIO_index.py although there are other issues here: >> http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008004.html >> >> I had to add some conditional code to handle bytes <-> unicode, which >> may have a measurable slow down on Python 2. > > I'm in the midst of processing a great deal of SFF data, so I'll try to give > the new SFF code a try under Python 2.7. Excellent - I take it you were already using the SFF support in Biopython? Peter From vsbuffalo at gmail.com Thu Jul 15 13:20:26 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Thu, 15 Jul 2010 10:20:26 -0700 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: <201005141705.o4EH56ok028481@portal.open-bio.org> References: <201005141705.o4EH56ok028481@portal.open-bio.org> Message-ID: Sorry to bump this old topic, but are there plans to merge this into the main project? I do a lot of processing with the SAM format and it would be great to use Biopython for this. Does the pure Python implementation run as quickly as the pysam version? Is anyone still considering forking pysam and rewriting the C wrappers? Vince On Fri, May 14, 2010 at 10:05 AM, wrote: > http://bugzilla.open-bio.org/show_bug.cgi?id=2905 > > > > > > ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-14 13:05 EST ------- > The code on my branch has been updated, and now supports SAM and BAM > parsing > (currently it only extracts the read name, sequence and quality scores), > indexing by name with Bio.SeqIO.index(), and fast conversion to FASTA or > Sanger FASTQ with Bio.SeqIO.convert() which is handy for redoing a mapping: > > http://github.com/peterjc/biopython/tree/seqio-sam-bam > > Note that suffixes of "/1" or "/2" are added to forward or reverse read > names to make them unique. This matches the Illumina pipeline convention > and is handled by most tools which take paired end data. > > I'm actually using this code at the moment: I've started with BAM files of > paired end Illumina transcriptome reads mapped onto a draft assembly. I > then > used the convert code to convert these to FASTQ files, then split them into > a pair of FASTQ files (forward and reverse) and used BWA to remap them to a > different reference (giving new SAM files). > > > -- > Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are the assignee for the bug, or are watching the assignee. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From biopython at maubp.freeserve.co.uk Thu Jul 15 14:35:59 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 19:35:59 +0100 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: References: <201005141705.o4EH56ok028481@portal.open-bio.org> Message-ID: On Thu, Jul 15, 2010 at 6:20 PM, Vince S. Buffalo wrote: > Sorry to bump this old topic, but are there plans to merge this into the > main project? I do a lot of processing with the SAM format and it would be > great to use Biopython for this. > > Does the pure Python implementation run as quickly as the pysam > version? Is anyone still considering forking pysam and rewriting the > C wrappers? > > Vince EMBOSS now has limited SAM/BAM support, http://lists.open-bio.org/pipermail/emboss-dev/2010-July/000656.html BioLib is also now taking an interest in SAM/BAM support, I'd expect to see something on their mailing list soon: http://biolib.open-bio.org/wiki/Main_Page Can I ask what you want to do with SAM/BAM files? I did quite a bit of exploratory work for SAM/BAM in SeqIO, focussing on the raw reads (not the alignment side). This is very different from what you can do with PySam. It has allowed me to do SAM/BAM back to FASTQ which has been helpful in real work. There are branches on github, but still quite experimental and not necessarily going to be committed: http://github.com/peterjc/biopython/tree/seqio-sam-bam http://github.com/peterjc/biopython/tree/seqio-sam-bam-index Peter From bioinformed at gmail.com Thu Jul 15 14:43:29 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Thu, 15 Jul 2010 14:43:29 -0400 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: References: <201005141705.o4EH56ok028481@portal.open-bio.org> Message-ID: On Thu, Jul 15, 2010 at 1:20 PM, Vince S. Buffalo wrote: > Sorry to bump this old topic, but are there plans to merge this into the > main project? I do a lot of processing with the SAM format and it would be > great to use Biopython for this. > > Does the pure Python implementation run as quickly as the pysam version? Is > anyone still considering forking pysam and rewriting the C wrappers? > > I also started writing a pure Python SAM/BAM reader/writer with Cython accelerators, but quickly got distracted by the gaps in the "standard" and quirks in the various implementations. Instead, I've improved the base pysam implementation, fixed the parts that weren't working for me, and have posted a clone on the Google code site: http://code.google.com/r/bioinformed-pysam/ Of course, this doesn't help with how best to add functionality to BioPython... -Kevin From vsbuffalo at gmail.com Thu Jul 15 16:05:12 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Thu, 15 Jul 2010 13:05:12 -0700 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: References: <201005141705.o4EH56ok028481@portal.open-bio.org> Message-ID: Our group has used the SAM format in parsing CIGAR strings to find hybrid mapped reads for various projects. We primarily use the pileup format in looking for SNP candidates and in differential expression analysis with RNA-seq. cDNA reads are mapped back to a reference transcriptome, and then we parse the pileup format to form counts for transcripts, which then go to R for differential expression analysis. As we look towards pipelining some common tasks, it would be nice if pysam's functionality were in Biopython. Also, I wonder if other folks work with the pileup format as frequently as we do - if so, this may be a worthy candidate for a parser. I'll look into BioLib and EMBOSS, thanks Peter. Vince On Thu, Jul 15, 2010 at 11:35 AM, Peter wrote: > On Thu, Jul 15, 2010 at 6:20 PM, Vince S. Buffalo wrote: > > Sorry to bump this old topic, but are there plans to merge this into the > > main project? I do a lot of processing with the SAM format and it would > be > > great to use Biopython for this. > > > > Does the pure Python implementation run as quickly as the pysam > > version? Is anyone still considering forking pysam and rewriting the > > C wrappers? > > > > Vince > > EMBOSS now has limited SAM/BAM support, > http://lists.open-bio.org/pipermail/emboss-dev/2010-July/000656.html > > BioLib is also now taking an interest in SAM/BAM support, > I'd expect to see something on their mailing list soon: > http://biolib.open-bio.org/wiki/Main_Page > > Can I ask what you want to do with SAM/BAM files? > > I did quite a bit of exploratory work for SAM/BAM in SeqIO, > focussing on the raw reads (not the alignment side). This > is very different from what you can do with PySam. It has > allowed me to do SAM/BAM back to FASTQ which has been > helpful in real work. There are branches on github, but still > quite experimental and not necessarily going to be committed: > http://github.com/peterjc/biopython/tree/seqio-sam-bam > http://github.com/peterjc/biopython/tree/seqio-sam-bam-index > > Peter > -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From biopython at maubp.freeserve.co.uk Fri Jul 16 09:50:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jul 2010 14:50:57 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: Hi Tiago, You've been looking more carefully at 2to3 and doctests than I have, perhaps you can answer this query for me: It seems to me it does not automatically fix doctests. I'm aware of the -d or --doctests_only option, but that means we have to run 2to3 twice I think (once for the code, once for the doctests). Is there an extra flag or something obvious I am missing here? I want to call 2to3 once and have it fix the code including the doctests. Peter From tiagoantao at gmail.com Fri Jul 16 11:24:20 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 16 Jul 2010 16:24:20 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/16 Peter : > You've been looking more carefully at 2to3 and doctests than I have, > perhaps you can answer this query for me: It seems to me it does > not automatically fix doctests. > > I'm aware of the -d or --doctests_only option, but that means we have > to run 2to3 twice I think (once for the code, once for the doctests). > > Is there an extra flag or something obvious I am missing here? I want > to call 2to3 once and have it fix the code including the doctests. My assessment is exactly the same as yours. I call the app 2 times: one for code, another for doctests. The setup.py that I provided only does code precisely because of this. I still did not have time to, programatically, call both transformations. So: yes it sucks. Talking about setup.py, its current incarnation is broken on Python 3. Even if the objective is for it to print some information on calling 2to3 it will not work. Just putting the prints with () should sort it (and work everywhere) Anyway, I think we can make setup.py much more helpful in the p3 case by calling 2to3 (like numpy). The tests would also need to be transformed, I think. Regards, Tiago From biopython at maubp.freeserve.co.uk Fri Jul 16 11:31:51 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jul 2010 16:31:51 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/16 Tiago Ant?o : > 2010/7/16 Peter : >> You've been looking more carefully at 2to3 and doctests than I have, >> perhaps you can answer this query for me: It seems to me it does >> not automatically fix doctests. >> >> I'm aware of the -d or --doctests_only option, but that means we have >> to run 2to3 twice I think (once for the code, once for the doctests). >> >> Is there an extra flag or something obvious I am missing here? I want >> to call 2to3 once and have it fix the code including the doctests. > > My assessment is exactly the same as yours. > I call the app 2 times: one for code, another for doctests. > The setup.py that I provided only does code precisely because of this. > I still did not have time to, programatically, call both > transformations. > So: yes it sucks. Maybe we should file an enhancement bug report? > Talking about setup.py, its current incarnation is broken on Python 3. > Even if the objective is for it to print some information on calling > 2to3 it will not work. Just putting the prints with () should sort it > (and work everywhere) I like that plan except for a "bug" in 2to3, it will turn this example which works for BOTH python 2 and python 3: print("Hello world") into this: print(("Hello world")) Using this syntax for simple prints is actually a tip here: http://wiki.python.org/moin/PortingPythonToPy3k > Anyway, I think we can make setup.py much more helpful in the p3 case > by calling 2to3 (like numpy). The tests would also need to be > transformed, I think. I think its a little premature for that - but once we have a full the conversion running smoothly it makes sense. For now transforming the code in situ makes working in Python 3 to debug something much easier I think. Peter From tiagoantao at gmail.com Fri Jul 16 11:44:35 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 16 Jul 2010 16:44:35 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/16 Peter : > Maybe we should file an enhancement bug report? Good idea, I will do that. > I like that plan except for a "bug" in 2to3, it will turn this example which > works for BOTH python 2 and python 3: > > print("Hello world") > > into this: > > print(("Hello world")) Well, as I see it, setup.py will never need to be converted by 2to3. Its is possible to do a single file that works in all versions, therefore that problem does not apply (unless people try to convert it explicitly - I think we need to recommend against that). This seems to be the case with numpy setup.py. My view is this: Current case: person calls setup.py, always works. In the p3 case just prints the warning and 2to3 recommendation. Future (stable): person calls setup.py , and it does everything necessary (calling 2to3 if needed) Never: Person calls 2to3 person calls setup.py In fact it does not make much sense as it is now: the person has to call 2to3 against setup.py in order to be informed to... call 2to3 ;) See my point? Tiago From biopython at maubp.freeserve.co.uk Fri Jul 16 11:58:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jul 2010 16:58:03 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/16 Tiago Ant?o : > > 2010/7/16 Peter : > >> Maybe we should file an enhancement bug report? > > Good idea, I will do that. > >> I like that plan except for a "bug" in 2to3, it will turn this example which >> works for BOTH python 2 and python 3: >> >> print("Hello world") >> >> into this: >> >> print(("Hello world")) > > Well, as I see it, setup.py will never need to be converted by 2to3. > Its is possible to do a single file that works in all versions, > therefore that problem does not apply (unless people try to convert it > explicitly - I think we need to recommend against that). This seems to > be the case with numpy setup.py. > > My view is this: > Current case: > person calls setup.py, always works. In the p3 case just prints the > warning and 2to3 recommendation. > Future (stable): > person calls setup.py , and it does everything necessary (calling 2to3 > if needed) > Never: > Person calls 2to3 > person calls setup.py > > In fact it does not make much sense as it is now: the person has to > call 2to3 against setup.py in order to be informed to... call 2to3 ;) > See my point? I agree that we should tweak setup.py to run under both Python 2 (life as normal) and Python 3 (tells you to manually run 2to3 on the source code etc, but then continues as normal). We'll need to tweak input vs raw_input (Python 3 vs Python 2). Peter From tiagoantao at gmail.com Fri Jul 16 12:06:50 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 16 Jul 2010 17:06:50 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/16 Peter : > We'll need to tweak input vs raw_input (Python 3 vs Python 2). Me thinks this is probably enough? if sys.version_info[0] == 3: def raw_input(): return input() From biopython at maubp.freeserve.co.uk Fri Jul 16 12:12:04 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jul 2010 17:12:04 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/16 Tiago Ant?o : > 2010/7/16 Peter : >> We'll need to tweak input vs raw_input (Python 3 vs Python 2). > > Me thinks this is probably enough? > if sys.version_info[0] == 3: > ? ?def raw_input(): > ? ? ? ?return input() Unless 2to3 does something horrible to that, yes. Do you want to test this and check it in now? I've got some other things to be getting on with so I'll take a break from updating the trunk with small Python 3 changes ;) Peter From tiagoantao at gmail.com Fri Jul 16 12:16:50 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 16 Jul 2010 17:16:50 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: Ok, I will check this in, but I think a small note should be added somewhere on how to use 2to3. I can do that if you tell me your preferred place (README?). 2010/7/16 Peter : > 2010/7/16 Tiago Ant?o : >> 2010/7/16 Peter : >>> We'll need to tweak input vs raw_input (Python 3 vs Python 2). >> >> Me thinks this is probably enough? >> if sys.version_info[0] == 3: >> ? ?def raw_input(): >> ? ? ? ?return input() > > Unless 2to3 does something horrible to that, yes. Do you want > to test this and check it in now? I've got some other things to > be getting on with so I'll take a break from updating the trunk > with small Python 3 changes ;) > > Peter > -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From biopython at maubp.freeserve.co.uk Fri Jul 16 12:24:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jul 2010 17:24:03 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/16 Tiago Ant?o : > Ok, I will check this in, but I think a small note should be added > somewhere on how to use 2to3. I can do that if you tell me your > preferred place (README?). Good point - yes, add something to the README file and in the message from setup.py tell them to read that. Of course, this is just an interim measure while we are still working on Python 3 porting. Peter From kellrott at gmail.com Fri Jul 16 15:25:44 2010 From: kellrott at gmail.com (Kyle) Date: Fri, 16 Jul 2010 12:25:44 -0700 Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54 In-Reply-To: <320fb6e01003181234m71cc777bxaf5f29f2fbe1f21f@mail.gmail.com> References: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com> <320fb6e01003181234m71cc777bxaf5f29f2fbe1f21f@mail.gmail.com> Message-ID: After much delay, I've made the change and posted it to the zxjdbc branch on Github. Now users can call BioSeqDatabase.open_database(backend = 'MySQL' ) and it will work the same on Python and Jython. Kyle On Thu, Mar 18, 2010 at 12:34 PM, Peter wrote: > On Thu, Mar 18, 2010 at 7:28 PM, Kyle wrote: >> What should the parameter be called? Possibilities: >> 'backend', 'dbtype', ... ?ideas anyone? > > Just database would be too vague. I quite like backend. > > Peter > From anaryin at gmail.com Fri Jul 16 17:20:54 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 16 Jul 2010 14:20:54 -0700 Subject: [Biopython-dev] GSOC Mid-Term Evaluation Message-ID: Hello all, I've been quite silent lately and I feel I should apologize :) I'm leaving the U.S. back to Europe and it's been quite hectic with packing and finishing some last minute stuff - namely my Thesis - so GSOC has been put a bit aside for the past week.. Still, I'll be working on unit tests and documentation for what I've done so far. It's not a big list of things but they do require a bit of effort to be well documented and most of all, assure they are working properly. Hope to be back to work fully on Monday! Best to all of you and thanks for the evaluation :) I know only the mentors' word was taken into account for them but if anyone has suggestions, criticism, feel free to do so. Again, the code is hosted here: http://github.com/JoaoRodrigues/biopython/tree/GSOC2010 I added some examples and a short list of what I've done so far here: http://www.biopython.org/wiki/GSOC2010_Joao#Project_Progress Jo?o [...] Rodrigues @ http://doeidoei.wordpress.org From bugzilla-daemon at portal.open-bio.org Sat Jul 17 10:47:13 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 17 Jul 2010 10:47:13 -0400 Subject: [Biopython-dev] [Bug 2578] The GenBank SeqRecord parser does not record molecule type or if circular In-Reply-To: Message-ID: <201007171447.o6HElDUD030395@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2578 claude at 2xlibre.net changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |claude at 2xlibre.net -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jul 17 17:04:24 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 17 Jul 2010 17:04:24 -0400 Subject: [Biopython-dev] [Bug 3118] New: isinstance should use basestring for detecting string type Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3118 Summary: isinstance should use basestring for detecting string type Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P5 Component: Other AssignedTo: biopython-dev at biopython.org ReportedBy: claude at 2xlibre.net I've been bitten by this issue today, when I gave a Unicode string to annotation["date"] and the SeqIO writer for GenBank format tested it as isinstance(..., str) which returned False (Bio/SeqIO/InsdcIO.py). I saw that the code had a mix of isinstance( , str) and isinstance( , basestring). I chased all remaining str type comparisons for cooking the following patch. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jul 17 17:05:19 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 17 Jul 2010 17:05:19 -0400 Subject: [Biopython-dev] [Bug 3118] isinstance should use basestring for detecting string type In-Reply-To: Message-ID: <201007172105.o6HL5Jm3010666@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3118 ------- Comment #1 from claude at 2xlibre.net 2010-07-17 17:05 EST ------- Created an attachment (id=1523) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1523&action=view) Replace all remaining isinstance(, str) by isinstance(, basestring) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jul 17 18:30:20 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 17 Jul 2010 18:30:20 -0400 Subject: [Biopython-dev] [Bug 3118] isinstance should use basestring for detecting string type In-Reply-To: Message-ID: <201007172230.o6HMUKvV012752@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3118 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-17 18:30 EST ------- Hi Claude, Some of those should really be str (e.g. for the SeqFeature extract test in SeqFeatureExtractionWritingReading if the input is a str then the output should be too; also for BioSQL some of the adaptors do care about string vs unicode so that needs more checking), but in general you have a good point. In this particular case, yes - thank you: http://github.com/biopython/biopython/commit/450b1a9024490feb2cdbbbc30f1dc429620d8c41 I think we need some more unit tests here (especially for BioSQL), which will help with the current Python 3 testing via 2to3, where string vs unicode is a big issue. Leaving bug open... Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From vsbuffalo at gmail.com Sun Jul 18 02:50:59 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Sat, 17 Jul 2010 23:50:59 -0700 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: I dug into how Numpy is processing their own ReST dialect, and the answer lies in doc/HOWTO_BUILD_DOCS.txt. There is an extension that can be obtained from PyPi, or included manually in a doc/sphinxext directory. Are these extension requirements alright (before I continue changing the format)? Some more information is below. *Numpy's documentation uses several custom extensions to Sphinx. These* *are shipped in the ``sphinxext/`` directory, and are automatically* *enabled when building Numpy's documentation.* * * *However, if you want to make use of these extensions in third-party* *projects, they are available on PyPi_ as the numpydoc_ package, and* *can be installed with::* * * * easy_install numpydoc* * * *In addition, you will need to add::* * * * extensions = ['numpydoc']* On Thu, Jul 15, 2010 at 8:38 AM, Peter wrote: > On Thu, Jul 15, 2010 at 4:22 PM, Vince S. Buffalo > wrote: > >> > >> I think the docstrings should be the primary API documentation, > >> and the Tutorial the primary introductory text. > > > > I like the idea of code and documentation living together, but one thing > > that concerns me is that as the documentation grows larger and filled > with > > more examples, it may begin to clutter the code quite a bit. > > We can cross that bridge if we come to it - right now I would say most > modules really need more docstrings. If you think that any of the > docstrings > you've looked at are too long, we can discuss shortening them (ideally by > relocating good content or tests). > > Peter > -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From biopython at maubp.freeserve.co.uk Sun Jul 18 06:52:58 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 18 Jul 2010 11:52:58 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Sun, Jul 18, 2010 at 7:50 AM, Vince S. Buffalo wrote: > I dug into how Numpy is processing their own ReST dialect, and the > answer lies in doc/HOWTO_BUILD_DOCS.txt. There is an extension > that can be obtained from PyPi, or included manually in a doc/sphinxext > directory. > > Are these extension requirements alright (before I continue changing the > format)? Some more information is below. > If they are useful, then I'm OK with that. We can probably even take a copy and add it to the Biopython source code since it is under the BSD licence: http://pypi.python.org/pypi/numpydoc/ We may be fine with just restricted reStructuredText - see how you get on with that first? Peter From biopython at maubp.freeserve.co.uk Sun Jul 18 08:28:14 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 18 Jul 2010 13:28:14 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Sun, Jul 18, 2010 at 11:52 AM, Peter wrote: > On Sun, Jul 18, 2010 at 7:50 AM, Vince S. Buffalo wrote: >> I dug into how Numpy is processing their own ReST dialect, and the >> answer lies in doc/HOWTO_BUILD_DOCS.txt. There is an extension >> that can be obtained from PyPi, or included manually in a doc/sphinxext >> directory. >> >> Are these extension requirements alright (before I continue changing the >> format)? Some more information is below. > > If they are useful, then I'm OK with that. We can probably even take > a copy and add it to the Biopython source code since it is under the > BSD licence: http://pypi.python.org/pypi/numpydoc/ > > We may be fine with just restricted reStructuredText - see how you > get on with that first? Plus of course in the short term we'll still be using epydoc anyway. Peter From vsbuffalo at gmail.com Sun Jul 18 15:10:40 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Sun, 18 Jul 2010 12:10:40 -0700 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: Sounds good. I'll add a copy to doc/sphinxext as in Numpy. By restricted, you mean without the :class:`ClassName` type annotation? Vince On Sun, Jul 18, 2010 at 5:28 AM, Peter wrote: > On Sun, Jul 18, 2010 at 11:52 AM, Peter wrote: > > On Sun, Jul 18, 2010 at 7:50 AM, Vince S. Buffalo wrote: > >> I dug into how Numpy is processing their own ReST dialect, and the > >> answer lies in doc/HOWTO_BUILD_DOCS.txt. There is an extension > >> that can be obtained from PyPi, or included manually in a doc/sphinxext > >> directory. > >> > >> Are these extension requirements alright (before I continue changing the > >> format)? Some more information is below. > > > > If they are useful, then I'm OK with that. We can probably even take > > a copy and add it to the Biopython source code since it is under the > > BSD licence: http://pypi.python.org/pypi/numpydoc/ > > > > We may be fine with just restricted reStructuredText - see how you > > get on with that first? > > Plus of course in the short term we'll still be using epydoc anyway. > > Peter > -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From bugzilla-daemon at portal.open-bio.org Sun Jul 18 15:23:48 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 18 Jul 2010 15:23:48 -0400 Subject: [Biopython-dev] [Bug 3118] isinstance should use basestring for detecting string type In-Reply-To: Message-ID: <201007181923.o6IJNmbf007400@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3118 ------- Comment #3 from claude at 2xlibre.net 2010-07-18 15:23 EST ------- Thanks Peter for the fix you committed. It resolves my issue. I understand my search/replace strategy was a bit rude :-) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Jul 19 04:34:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Jul 2010 09:34:43 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Sun, Jul 18, 2010 at 8:10 PM, Vince S. Buffalo wrote: > Sounds good. I'll add a copy to doc/sphinxext as in Numpy. > > By restricted, you mean without the :class:`ClassName` type annotation? > > Vince Yes - to me that looks horrible as plain text. I was hoping NumPy had a clear definition of their restricted subset of reStructuredText we could follow... maybe I haven't looked hard enough. Have you been able to run epydoc with reStructuredText yet? Peter From bugzilla-daemon at portal.open-bio.org Mon Jul 19 10:40:58 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 19 Jul 2010 10:40:58 -0400 Subject: [Biopython-dev] [Bug 3119] New: Bio.Nexus can't parse file from Prank 100701 (1st July 2010) Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3119 Summary: Bio.Nexus can't parse file from Prank 100701 (1st July 2010) Product: Biopython Version: 1.54 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk I've been updating test_Prank_tool.py to cope with the latest version of Prank, 1 July 2010 from http://www.ebi.ac.uk/goldman-srv/prank/src/prank/ Some changes are simple, such as removing tests using feature of Prank which have been removed. One test is failing due to some big changes in the NEXUS output from Prank, and this may be due to a problem with our parser: >>> from Bio.Nexus import Nexus >>> n = Nexus.Nexus("output_prank_v100701.nex") Traceback (most recent call last): ... Bio.Nexus.Nexus.NexusError: Taxon AK1H_ECOLI: Illegal character / in line: (check dimensions / interleaving) I will attach the file, it is created by the unit test as output.2.nex but is usually deleted. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 19 10:42:31 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 19 Jul 2010 10:42:31 -0400 Subject: [Biopython-dev] [Bug 3119] Bio.Nexus can't parse file from Prank 100701 (1st July 2010) In-Reply-To: Message-ID: <201007191442.o6JEgVgj020619@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3119 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-19 10:42 EST ------- Created an attachment (id=1524) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1524&action=view) Sample NEXUS output from prank v100701 This file is from Prank v100701 (1 July 2010), compiled and run on Linux from: http://www.ebi.ac.uk/goldman-srv/prank/src/prank/prank.src.100701.tgz -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 19 10:45:22 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 19 Jul 2010 10:45:22 -0400 Subject: [Biopython-dev] [Bug 3119] Bio.Nexus can't parse file from Prank 100701 (1st July 2010) In-Reply-To: Message-ID: <201007191445.o6JEjMRN020749@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3119 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-19 10:45 EST ------- Created an attachment (id=1525) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1525&action=view) Sample NEXUS output from prank v081202 Equivalent output from Prank v.081202 (2 Dec 2008), compiled and run on Mac OS X. Bio.Nexus can parse this. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 19 10:49:36 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 19 Jul 2010 10:49:36 -0400 Subject: [Biopython-dev] [Bug 3119] Bio.Nexus can't parse file from Prank 100701 (1st July 2010) In-Reply-To: Message-ID: <201007191449.o6JEna91020945@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3119 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-19 10:49 EST ------- I have for the moment added a hack to avoid the test failure, http://github.com/biopython/biopython/commit/ca6a5958415d4d026b2b799a35fd3a6371491024 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From andrea at biocomp.unibo.it Tue Jul 20 10:51:50 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Tue, 20 Jul 2010 16:51:50 +0200 (CEST) Subject: [Biopython-dev] DAS client in biopython Message-ID: Hi all, I've been working a little do develop a DAS client in python, and I thought it could be a nice addition to biopython. So I build up a branch on github that can be found here: http://github.com/apierleoni/biopython/tree/das-client The DAS module is under Bio and can be imported using >>> from Bio.DAS.DASpy import DASpy some code examples are included in the DASpy.py file. cool things you can do with DASpy: - fetch all the available DAS servers listed at dasregistry - connect to each of them and use 'das1:sequence' and 'das1:feature' methods to retrieve sequences, features and annotations from DAS servers. - build a SeqRecord starting from multiple DAS servers (one for the sequence and the others for features and annotations) Eg. you can build a SeqRecord object that will list all abailable DAS annotations given a uniprot ID. I'm actually the only user of the code, so I'll appreciate any comment about it. Hope this turns useful to someone else. Andrea From andrea at biocomp.unibo.it Tue Jul 20 11:06:22 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Tue, 20 Jul 2010 17:06:22 +0200 (CEST) Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> > I'm not sure we can easily include GPL code in Biopython... it would > complicate things. Kyle has also been working on using the JVM DB > API for BioSQL under Jython - I'd rather we ended up with a runtime > choice of drivers (database specific like mysqldb, and others like the > abstractions SQLAlchemy or the web2py DAL) which would all be > external to Biopython. > > Peter > I've checked online and, actually, web2py code comes under: "GPL2 License with an exception for easier commercialization of applications." and they states: "Applications built with web2py can be released under any license the author wishes as long they do not contain web2py code. In particular they can be bytecode compiled and distributed in closed source. The admin interface provides a button to byte-code compile. It is fine to distribute web2py (source or compiled) with your applications as long as you make it clear in the license where your application ends and web2py starts." I don't think this will cause any problem given that the web2py code is acknowledged. Anyhow, are there any plan in extending the BioSQL interface? We could make some methods useful to people not skilled with SQL, that can boost their experience with BioSQL. something like selecting all the bioentries carrying a given feature type or a qualifier value or even a dbxref. Allowing people to use the BioSQL schema without exactly knowing the schema and have to write complex queries could be a big addition. Andrea From biopython at maubp.freeserve.co.uk Tue Jul 20 11:19:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Jul 2010 16:19:56 +0100 Subject: [Biopython-dev] DAS client in biopython In-Reply-To: References: Message-ID: On Tue, Jul 20, 2010 at 3:51 PM, Andrea Pierleoni wrote: > Hi all, > I've been working a little do develop a DAS client in python, and I > thought it could be a nice addition to biopython. Hi Andrea, This does look interesting - I've never needed to work with DAS but maybe one day... > So I build up a branch on github that can be found here: > > http://github.com/apierleoni/biopython/tree/das-client It looks like you have lots of other code on that branch too, like BioSQL2py (your BioSQL via web2py DAL) - this isn't a problem for now but would complicate merging later. > The DAS module is under Bio and can be imported using > >>>> from Bio.DAS.DASpy import DASpy The heirachy seems unnecessarily nested, why not move the code in Bio/DAS/DASpy.py into Bio/DAS/__init__.py? Or even into Bio/DAS.py instead? Then that import becomes: from Bio.DAS import DASpy, which also avoids the ambiguity of DASpy for a module and a class. Are you expecting to have other files under Bio/DAS? Also the name DASpy confuses me, maybe the class should be something about DAS Servers? Would it be right to regard the class DASSeq as a subclass of SeqRecord? It looks like a minimally annotated sequence. See also the DBSeqRecord in BioSQL. Regards, Peter From biopython at maubp.freeserve.co.uk Tue Jul 20 11:23:16 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Jul 2010 16:23:16 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> References: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> Message-ID: On Tue, Jul 20, 2010 at 4:06 PM, Andrea Pierleoni wrote: > > >> I'm not sure we can easily include GPL code in Biopython... it would >> complicate things. Kyle has also been working on using the JVM DB >> API for BioSQL under Jython - I'd rather we ended up with a runtime >> choice of drivers (database specific like mysqldb, and others like the >> abstractions SQLAlchemy or the web2py DAL) which would all be >> external to Biopython. >> >> Peter >> > > I've checked online and, actually, web2py code comes under: > > "GPL2 License with an exception for easier commercialization of > applications." > > and they states: > > "Applications built with web2py can be released under any license the > author wishes as long they do not contain web2py code. In particular they > can be bytecode compiled and distributed in closed source. The admin > interface provides a button to byte-code compile. > It is fine to distribute web2py (source or compiled) with your > applications as long as you make it clear in the license where your > application ends and web2py starts." > > I don't think this will cause any problem given that the web2py code is > acknowledged. I wouldn't want to ship web2py with Biopython - we'd just list it as another optional package you might want to install for use with BioSQL (as we do with MySQLdb etc). > Anyhow, are there any plan in extending the BioSQL interface? > We could make some methods useful to people not skilled with SQL, that can > boost their experience with BioSQL. something like selecting all the bioentries > carrying a given feature type or a qualifier value ?or even a dbxref. > Allowing people to use the BioSQL schema without exactly knowing the > schema and have to write complex queries could be a big addition. There are already several query methods, but more wouldn't be a bad idea. I was thinking we could implement dictionary like access, and support for iterator over all the records. Peter From andrea at biocomp.unibo.it Tue Jul 20 12:00:34 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Tue, 20 Jul 2010 18:00:34 +0200 (CEST) Subject: [Biopython-dev] DAS client in biopython In-Reply-To: References: Message-ID: > On Tue, Jul 20, 2010 at 3:51 PM, Andrea Pierleoni wrote: >> Hi all, >> I've been working a little do develop a DAS client in python, and I >> thought it could be a nice addition to biopython. > > Hi Andrea, > > This does look interesting - I've never needed to work with > DAS but maybe one day... > >> So I build up a branch on github that can be found here: >> >> http://github.com/apierleoni/biopython/tree/das-client > > It looks like you have lots of other code on that branch too, > like BioSQL2py (your BioSQL via web2py DAL) - this isn't > a problem for now but would complicate merging later. > BioSQL2py is just an empty directory on that branch, It will be filled in an other specific branch (actually it shouldn't be there :) ) >> The DAS module is under Bio and can be imported using >> >>>>> from Bio.DAS.DASpy import DASpy > > The heirachy seems unnecessarily nested, why not move the > code in Bio/DAS/DASpy.py into Bio/DAS/__init__.py? Or > even into Bio/DAS.py instead? Then that import becomes: > from Bio.DAS import DASpy, which also avoids the ambiguity > of DASpy for a module and a class. Are you expecting to have > other files under Bio/DAS? > I'm not planning on having other file. but since this was a proposal, I build the Bio/DAS structure to host any additional client available, if there are any. howver if it will be the only way to parse DAS file we can simplify to a Bio/DAS.py file. much better to me. > Also the name DASpy confuses me, maybe the class > should be something about DAS Servers? > DASpy is the way I'm used to call this client, and that is the main class but can be renamed to something more meaningful > Would it be right to regard the class DASSeq as a subclass > of SeqRecord? It looks like a minimally annotated sequence. > See also the DBSeqRecord in BioSQL. > well, I think a DASSeq can fit comfortably in a SeqRecord. this would also simplify the build of a SeqRecord object in DASpy.fetch_to_seqrec. Thanks for the advices Peter Andrea From biopython at maubp.freeserve.co.uk Tue Jul 20 12:03:44 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Jul 2010 17:03:44 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Message-ID: On Wed, Jun 2, 2010 at 12:36 PM, Peter wrote: > On Tue, Jun 1, 2010 at 4:15 PM, Peter wrote: >> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman wrote: >>> I'd suggest having an option to not capture stdout and stderr, which >>> would help users avoid those cases where a program spews a lot to >>> stdout and it's unwieldy to capture and stick it into a string. >> >> We need to avoid any risk of deadlocks, so I guess the safe >> implementation here would be call subprocess with stdout and >> stderr sent to dev null. > > How does this look? Tested on Mac and Windows: > http://github.com/peterjc/biopython/tree/app-exec2 > > Example usage without capturing the output: > > ? ?from Bio.Emboss.Applications import WaterCommandline > ? ?water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? asequence="a.fasta", bsequence="b.fasta") > ? ?print "About to run:\n%s" % water_cmd > ? ?return_code = water_cmd() > ? ?print "Return code: %i" % return_code > > Example usage with stdout and stderr capture: > > ? ?from Bio.Emboss.Applications import WaterCommandline > ? ?water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? asequence="a.fasta", bsequence="b.fasta") > ? ?print "About to run:\n%s" % water_cmd > ? ?stdout, stderr, return_code = water_cmd(capture=True) > ? ?print "Return code: %i" % return_code > ? ?print "Tool output:\n%s" % stdout > > Note in this implementation it either returns an integer error level > (the default) or a tuple of stdout, stderr and the error level return > code. If we opt for adding methods rather than using __call__ > these could be different methods instead. > > Another potentially useful option would be to copy the > subprocess.check_call() function in Python 2.5+ which verifies > the return code (error level) is zero and raises an exception if not > (probably only sensible if not capturing the output?). Maybe this > could even be the default behaviour? > > [I would prefer to keep the interface as simple as possible though, > less options is better! KISS principle.] > > Peter Interestingly in Python 2.7 subprocess gained a new function called check_output which returns a string (stdout, optionally combined with stderr as a single string). If there is a non-zero return code you get a CalledProcessError exception (with return code and output): http://docs.python.org/library/subprocess.html In some ways there are too many choices - how unpythonic ;) Having thought about this for a while, I realised that in almost every case I have never cared about the exact return code, just if it is zero (success) or not (failure). Therefore the behaviour of the subprocess functions check_call (Python 2.5+) and check_output (Python 2.7+) seems desirable (you get an exception if the return code is non zero). That just leaves what to return: stdout and/or stderr. I personally have never needed to merge stderr and stdout into a single pipe or string - the only use case for this I can think of is to capture the output into a file for logging purposes. Generally it makes more sense to keep them separate. This leaves the question should we return just stdout, or both? Sometimes stderr is useful, so I think both. So, in yet-another-branch, I wrote a __call__ implementation which raises an exception on non-zero return codes, but otherwise returns stdout and stderr as a tuple of two strings: http://github.com/peterjc/biopython/commits/app-exec3 I'm pretty confident this will suffice for most use cases, and propose we implement this in Biopython 1.55. Thoughts? Peter From andrea at biocomp.unibo.it Tue Jul 20 12:07:29 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Tue, 20 Jul 2010 18:07:29 +0200 (CEST) Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> Message-ID: > > I wouldn't want to ship web2py with Biopython - we'd just list it as > another optional package you might want to install for use with BioSQL > (as we do with MySQLdb etc). > that sounds reasonable. > > There are already several query methods, but more wouldn't be a bad > idea. I was thinking we could implement dictionary like access, and > support for iterator over all the records. > dictionary and iterators would be very pythonic, and useful. are you working on it? correct me if I'm wrong, but the standard policy in Biopython BioSQL to update a bioentry record is to delete the old one and create a new one (ore make a new version). wouldn't be useful to enable in biopython some minor modifications to a bioentry like adding/removing features and qualifiers? maybe I can help with this. Andrea From biopython at maubp.freeserve.co.uk Tue Jul 20 12:18:16 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Jul 2010 17:18:16 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> Message-ID: On Tue, Jul 20, 2010 at 5:07 PM, Andrea Pierleoni wrote: > >> >> I wouldn't want to ship web2py with Biopython - we'd just list it as >> another optional package you might want to install for use with BioSQL >> (as we do with MySQLdb etc). >> > > that sounds reasonable. > >> >> There are already several query methods, but more wouldn't be a bad >> idea. I was thinking we could implement dictionary like access, and >> support for iterator over all the records. >> > > dictionary and iterators would be very pythonic, and useful. are you > working on it? Not right now, no - if you want to try soon please go ahead. > correct me if I'm wrong, but the standard policy in Biopython BioSQL to > update a bioentry record is to delete the old one and create a new one > (or make a new version). wouldn't be useful to enable in biopython some > minor modifications to a bioentry like adding/removing features and > qualifiers? maybe I can help with this. The current functionality is limited to loading and retrieving records (and retreiving is done in a lazy or on demand way which saves memory and DB access). As a consequence, if you want to edit a record in the database you have to either do it directly (bypass our BioSQL code) or load a new record. The BioSQL schema doesn't have any sort of audit trail (unlike CHADO if I remember correctly), so for many uses this almost read only setup is actually a plus point. Here we use BioSQL essentially as a container for NCBI GenBank / RefSeq dumps - although we do add additional annotations on top. I can see advantages in allowing the DBSeqRecord to write back to the database - it would need a lots of refactoring through (e.g. most of the loader code would get moved). I would start by creating a read only proxy for the DBSeqFeature (something I think Leighton Pritchard did in some of his code) because editing feature annotations would be an important part of this. Peter From biopython at maubp.freeserve.co.uk Wed Jul 21 06:47:14 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Jul 2010 11:47:14 +0100 Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54 In-Reply-To: References: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com> <320fb6e01003181234m71cc777bxaf5f29f2fbe1f21f@mail.gmail.com> Message-ID: On Fri, Jul 16, 2010 at 8:25 PM, Kyle wrote: > After much delay, I've made the change and posted it to the zxjdbc > branch on Github. Now users can call > BioSeqDatabase.open_database(backend = 'MySQL' ) and it will work the > same on Python and Jython. Nice. I'll have to look at your code, but we can have it try a series of supported adaptors (e.g. there are several for PostgreSQL), which will make things a little easier on the user even on C Python. Peter From bioinformed at gmail.com Wed Jul 21 07:31:39 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Wed, 21 Jul 2010 07:31:39 -0400 Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54 In-Reply-To: References: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com> Message-ID: On Thu, Mar 18, 2010 at 3:28 PM, Kyle wrote: > What should the parameter by called? Possibilities: 'backend', 'dbtype', > ... > ideas anyone? > > I suggest 'driver', since it is explicit and precise about what is being chosen. This allows users to select among several drivers, even alternatives for the same database backend. It also allows the creation of default aliases for meta-drivers like 'mysql' or 'postgresql', which could search among a list of compatible drivers and the most suitable one that is found to be installed. -Kevin From biopython at maubp.freeserve.co.uk Wed Jul 21 07:55:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Jul 2010 12:55:10 +0100 Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54 In-Reply-To: References: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com> Message-ID: On Wed, Jul 21, 2010 at 12:31 PM, Kevin Jacobs wrote: > On Thu, Mar 18, 2010 at 3:28 PM, Kyle wrote: > >> What should the parameter by called? Possibilities: 'backend', 'dbtype', >> ... >> ideas anyone? >> >> > I suggest 'driver', since it is explicit and precise about what is being > chosen. ?This allows users to select among several drivers, even > alternatives for the same database backend. ?It also allows the creation of > default aliases for meta-drivers like 'mysql' or 'postgresql', which could > search among a list of compatible drivers and the most suitable one that is > found to be installed. We already have a parameter called driver (e.g. set to MySQLdb, psycopg2, psycopg, pgdb, sqlite3) which then have to take on a double meaning (the python driver versus the underlying back end database, MySQL, PostreSQL, SQLite). Peter From biopython at maubp.freeserve.co.uk Wed Jul 21 07:58:21 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Jul 2010 12:58:21 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> Message-ID: On Tue, Jul 20, 2010 at 5:18 PM, Peter wrote: > On Tue, Jul 20, 2010 at 5:07 PM, Andrea Pierleoni wrote: >>> >>> There are already several query methods, but more wouldn't be a bad >>> idea. I was thinking we could implement dictionary like access, and >>> support for iterator over all the records. >>> >> >> dictionary and iterators would be very pythonic, and useful. are you >> working on it? > > Not right now, no - if you want to try soon please go ahead. > Well, I went and did the basics to be consistent with the existing limited dict like support in BioSeqDatabase. Would you mind testing it? This can be improved by iterating over the cursor rather than building a list of identifiers in memory. Likewise __len__ and __contains__ can be turned into SQL statements to be more efficient. Do you fancy trying that? Peter From andrea at biocomp.unibo.it Wed Jul 21 11:43:30 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Wed, 21 Jul 2010 17:43:30 +0200 (CEST) Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> Message-ID: > > Well, I went and did the basics to be consistent with the existing limited > dict like support in BioSeqDatabase. Would you mind testing it? > > This can be improved by iterating over the cursor rather than building a > list of identifiers in memory. Likewise __len__ and __contains__ can be > turned into SQL statements to be more efficient. Do you fancy trying that? > > Peter > I've tested the new BioSeqDatabase in postgres BioSQL db containing 50000 bioentry, and it works very fast (I'm using python 2.6) even in this way. howver using SQL will be much better of course. I will take a try, as soon as I fix the DAS client and UniprotIO. Andrea From andrea at biocomp.unibo.it Wed Jul 21 11:48:40 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Wed, 21 Jul 2010 17:48:40 +0200 (CEST) Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> Message-ID: <806685d7b693bb7c85a03846c9160b56.squirrel@lipid.biocomp.unibo.it> > The current functionality is limited to loading and retrieving records > (and retreiving is done in a lazy or on demand way which saves > memory and DB access). As a consequence, if you want to edit a > record in the database you have to either do it directly (bypass our > BioSQL code) or load a new record. > > The BioSQL schema doesn't have any sort of audit trail (unlike CHADO > if I remember correctly), so for many uses this almost read only setup > is actually a plus point. Here we use BioSQL essentially as a container > for NCBI GenBank / RefSeq dumps - although we do add additional > annotations on top. > > I can see advantages in allowing the DBSeqRecord to write back > to the database - it would need a lots of refactoring through (e.g. most > of the loader code would get moved). I would start by creating a read > only proxy for the DBSeqFeature (something I think Leighton Pritchard > did in some of his code) because editing feature annotations would > be an important part of this. > maybe I'll succeed in the next month in writing some methods to modify bioentry directly in the SQL db. we can talk about this later, as soon as we have some code to work on. however audit trail will not be possible in the current BioSQL schema, unless using separate tables (as I'm actually doing). but I don't think this will be easily integrable in biopython. Is there anyone needing user logs? From biopython at maubp.freeserve.co.uk Wed Jul 21 11:59:15 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Jul 2010 16:59:15 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: <806685d7b693bb7c85a03846c9160b56.squirrel@lipid.biocomp.unibo.it> References: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> <806685d7b693bb7c85a03846c9160b56.squirrel@lipid.biocomp.unibo.it> Message-ID: On Wed, Jul 21, 2010 at 4:48 PM, Andrea Pierleoni wrote: >> The current functionality is limited to loading and retrieving records >> (and retreiving is done in a lazy or on demand way which saves >> memory and DB access). As a consequence, if you want to edit a >> record in the database you have to either do it directly (bypass our >> BioSQL code) or load a new record. >> >> The BioSQL schema doesn't have any sort of audit trail (unlike CHADO >> if I remember correctly), so for many uses this almost read only setup >> is actually a plus point. Here we use BioSQL essentially as a container >> for NCBI GenBank / RefSeq dumps - although we do add additional >> annotations on top. >> >> I can see advantages in allowing the DBSeqRecord to write back >> to the database - it would need a lots of refactoring through (e.g. most >> of the loader code would get moved). I would start by creating a read >> only proxy for the DBSeqFeature (something I think Leighton Pritchard >> did in some of his code) because editing feature annotations would >> be an important part of this. >> > > maybe I'll succeed in the next month in writing some methods to modify > bioentry directly in the SQL db. we can talk about this later, as soon as > we have some code to work on. Sure - there is no hurry. > however audit trail will not be possible in the current BioSQL schema, > unless using separate tables (as I'm actually doing). but I don't think > this will be easily integrable in biopython. Is there anyone needing > user logs? I agree, and didn't mean to suggest adding audit tables to the BioSQL schema. I was just pointing out this issue (depending on the intended usage, this may or may not be a problem). Peter From biopython at maubp.freeserve.co.uk Wed Jul 21 12:40:49 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Jul 2010 17:40:49 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> Message-ID: On Wed, Jul 21, 2010 at 4:43 PM, Andrea Pierleoni wrote: >> >> Well, I went and did the basics to be consistent with the existing limited >> dict like support in BioSeqDatabase. Would you mind testing it? >> >> This can be improved by iterating over the cursor rather than building a >> list of identifiers in memory. Likewise __len__ and __contains__ can be >> turned into SQL statements to be more efficient. Do you fancy trying that? >> >> Peter >> > > I've tested the new BioSeqDatabase in postgres BioSQL db containing 50000 > bioentry, and it works very fast (I'm using python 2.6) even in this way. Good :) > howver using SQL will be much better of course. I will take a try, as soon > as I fix the DAS client and UniprotIO. I had time this afternoon to do __len__ and __contains__ with SQL, and add a couple of tests here too. Memory efficient Iteration can wait for another day - I'm going home now. We should probably have started a new thread for this BioSQL discussion. Peter From andrea at biocomp.unibo.it Thu Jul 22 10:13:21 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 22 Jul 2010 16:13:21 +0200 (CEST) Subject: [Biopython-dev] DAS client in biopython In-Reply-To: References: Message-ID: > The heirachy seems unnecessarily nested, why not move the > code in Bio/DAS/DASpy.py into Bio/DAS/__init__.py? Or > even into Bio/DAS.py instead? Then that import becomes: > from Bio.DAS import DASpy, which also avoids the ambiguity > of DASpy for a module and a class. Are you expecting to have > other files under Bio/DAS? > hierarchy is now simplified to a single file DAS.py under Bio. > Also the name DASpy confuses me, maybe the class > should be something about DAS Servers? > I renamed the DASpy class to DASregistry so the main call now is: from Bio.DAS import DASregistry das = DASregistry() simplier... > Would it be right to regard the class DASSeq as a subclass > of SeqRecord? It looks like a minimally annotated sequence. > See also the DBSeqRecord in BioSQL. > I've been thinking about it and, actually, the DASSeq class corresponds exactly to information and methods available in the DAS sequence method, so I'd leave it this way. Most of the time this class shouden0t be accessed. and a clean SeqRecord object can be obtained using the "fetch_to_seqrec" method in DASregistry. Andrea From biopython at maubp.freeserve.co.uk Thu Jul 22 10:33:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 Jul 2010 15:33:57 +0100 Subject: [Biopython-dev] DAS client in biopython In-Reply-To: References: Message-ID: On Thu, Jul 22, 2010 at 3:13 PM, Andrea Pierleoni wrote: > >> The heirachy seems unnecessarily nested, why not move the >> code in Bio/DAS/DASpy.py into Bio/DAS/__init__.py? Or >> even into Bio/DAS.py instead? Then that import becomes: >> from Bio.DAS import DASpy, which also avoids the ambiguity >> of DASpy for a module and a class. Are you expecting to have >> other files under Bio/DAS? >> > > hierarchy is now simplified to a single file DAS.py under Bio. > >> Also the name DASpy confuses me, maybe the class >> should be something about DAS Servers? >> > > I renamed the DASpy class to DASregistry so the main call now > is: > > from Bio.DAS import DASregistry > > das = DASregistry() > > simplier... > That appears to make sense :) >> Would it be right to regard the class DASSeq as a subclass >> of SeqRecord? It looks like a minimally annotated sequence. >> See also the DBSeqRecord in BioSQL. >> > > I've been thinking about it and, actually, the DASSeq class > corresponds exactly to information and methods available in > the DAS sequence method, so I'd leave it this way. So what DAS calls a sequence is closer to Biopython's SeqRecord than Biopython Seq object? Hmm - that could cause confusion, whatever you call your class. > Most of the time this class shouden0t be accessed. and a clean > SeqRecord object can be obtained using the "fetch_to_seqrec" > method in DASregistry. I'll take another look at your code later. Peter From andrea at biocomp.unibo.it Thu Jul 22 10:51:45 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 22 Jul 2010 16:51:45 +0200 (CEST) Subject: [Biopython-dev] DAS client in biopython In-Reply-To: References: Message-ID: > > So what DAS calls a sequence is closer to Biopython's SeqRecord > than Biopython Seq object? Hmm - that could cause confusion, > whatever you call your class. > here there is an example of a DASSEQUENCE MLAKATLAIVLSAASLPVLAAQCEATIESNDAMQYNLKEMVVDKSCKQFTVHLKHVGKMAKVAMGHNWVLTKEADKQGVATDGMNAGLAQDYVKAGDTRVIAHTKVIGGGESDSVTFDVSKLTPGEAYAYFCSFPGHWAMMKGTLKLSN it is basically a Seq object with some metadata associated that I'm keeping. the moltype is used to set the Alphabet. It has an ID so it could also fit a seqrecord, but the DASseq class should not be used outside of DAS.py. Than you can link to this sequence, feature and annotations that are parsed from DASGFF XML response. the big confusion here is that both SeqRecord anntotations and features comes with DASGFF. annotations has start and end position equal to 0. >> Most of the time this class shouden0t be accessed. and a clean >> SeqRecord object can be obtained using the "fetch_to_seqrec" >> method in DASregistry. > > I'll take another look at your code later. > any comment is welcome, thanks Andrea From chapmanb at 50mail.com Fri Jul 23 07:48:06 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 23 Jul 2010 07:48:06 -0400 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Message-ID: <20100723114806.GA1868@sobchak.mgh.harvard.edu> Peter; [Simplified interface for calling commandline programs] > Having thought about this for a while, I realised that in almost every > case I have never cared about the exact return code, just if it is zero > (success) or not (failure). Therefore the behaviour of the subprocess > functions check_call (Python 2.5+) and check_output (Python 2.7+) > seems desirable (you get an exception if the return code is non zero). This makes good sense. > That just leaves what to return: stdout and/or stderr. I personally > have never needed to merge stderr and stdout into a single pipe > or string - the only use case for this I can think of is to capture the > output into a file for logging purposes. Generally it makes more sense > to keep them separate. This leaves the question should we return > just stdout, or both? Sometimes stderr is useful, so I think both. Both is also my preference. > So, in yet-another-branch, I wrote a __call__ implementation which > raises an exception on non-zero return codes, but otherwise returns > stdout and stderr as a tuple of two strings: > > http://github.com/peterjc/biopython/commits/app-exec3 Generally the idea and implementation are great. My only specific suggestion is regarding the default handling of stdout and stderr when you don't want to capture them. Currently you are eating those by writing to /dev/null. Would it be clearer to just use the default, which is to continue to route the programs stdout and stderr through the main instance? This gives friendly feedback that the program is running and makes debugging errors easier, especially if an external program doesn't use error codes correctly. Awesome to see this going in, Brad From biopython at maubp.freeserve.co.uk Fri Jul 23 09:19:37 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 23 Jul 2010 14:19:37 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: <20100723114806.GA1868@sobchak.mgh.harvard.edu> References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> <20100723114806.GA1868@sobchak.mgh.harvard.edu> Message-ID: On Fri, Jul 23, 2010 at 12:48 PM, Brad Chapman wrote: > Peter; > > [Simplified interface for calling commandline programs] >> Having thought about this for a while, I realised that in almost every >> case I have never cared about the exact return code, just if it is zero >> (success) or not (failure). Therefore the behaviour of the subprocess >> functions check_call (Python 2.5+) and check_output (Python 2.7+) >> seems desirable (you get an exception if the return code is non zero). > > This makes good sense. > Good. >> That just leaves what to return: stdout and/or stderr. I personally >> have never needed to merge stderr and stdout into a single pipe >> or string - the only use case for this I can think of is to capture the >> output into a file for logging purposes. Generally it makes more sense >> to keep them separate. This leaves the question should we return >> just stdout, or both? Sometimes stderr is useful, so I think both. > > Both is also my preference. > Good. >> So, in yet-another-branch, I wrote a __call__ implementation which >> raises an exception on non-zero return codes, but otherwise returns >> stdout and stderr as a tuple of two strings: >> >> http://github.com/peterjc/biopython/commits/app-exec3 > > Generally the idea and implementation are great. My only specific > suggestion is regarding the default handling of stdout and stderr > when you don't want to capture them. Currently you are eating those > by writing to /dev/null. Would it be clearer to just use the > default, which is to continue to route the programs stdout and > stderr through the main instance? This gives friendly > feedback that the program is running and makes debugging errors > easier, especially if an external program doesn't use error codes > correctly. Fair point. Personally I'd either want to capture the output (default) or completely ignore it (hence the implementation in this branch). Anyone else want to comment on this aspect? > Awesome to see this going in, > Brad Peter From biopython at maubp.freeserve.co.uk Mon Jul 26 11:04:41 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Jul 2010 16:04:41 +0100 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: Message-ID: Andrea Pierleoni wrote: > > Hi Everyone, > I've been using a lot biopython in the last couple of years, it is very > useful to me. So now it's my turn to contribute and be helpful to someone > else. > I wrote a parser for the Uniprot XML format, that is reasonably fast (8000 > entries/min on a core2duo mainstream PC). The main improvements with the > actual SwissProt flat file parser are a deeper parsing of comment fields, > and a Seqrecord containing features. > > The parser is based on the ElementTree library and was successfully tested > on the complete SwissProt database (v57.12). Thus I think it is ready to > be released. > > I followed the rules to develop a new parser for SeqIO, filed an > enhancement bug to bugzilla (bug 2992), and included the parser in a > public biopython fork on github available at: > > http://github.com/apierleoni/biopython/tree/uniprotxml-branch > > the new parser is in the "uniprotxml-branch" branch, and the parser code > is in Bio/SeqIO/UniprotIO.py > > The parser can be used from SeqIO using: > > iterator=SeqIO.parse(handle,'uniprot') > > I think this could be easily integrated in Biopython, ?unit test is still > missing, but should be very easy to do. > Anyhow any code review or suggestions are welcome. > > Andrea Hi Andrea, As you have probably noticed via github, I have been trying out your code. I noticed you hadn't implemented indexing support so I have done this on my branch as a quick hack: http://github.com/peterjc/biopython/commits/uniprot What I want to be able to do is seek to the start of an in the XML handle, and have the parser continue from that point. I've done this by the nasty trick of extracting the record from the XML file as a string (using the get_raw method of the index class), then adding the XML header and footer to it, and then invoking your parser. There should be a better way to do this, but I am not familiar enough with ElementTree to see it right away. Can you improve on this? I'd also like to have SeqFeature parsing done for the plain text "swiss" parser as well, which can double as a cross check for your parser. Did you look at my old patch? http://bugzilla.open-bio.org/show_bug.cgi?id=2235 We should also run a comparison test of the "swiss" plain text and "uniprot" XML parsers on the full downloads of UniProtKB/Swiss-Prot and/or UniProtKB/TrEMBL, see http://www.uniprot.org/downloads Peter From biopython at maubp.freeserve.co.uk Mon Jul 26 11:12:36 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Jul 2010 16:12:36 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: <20100723114806.GA1868@sobchak.mgh.harvard.edu> References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> <20100723114806.GA1868@sobchak.mgh.harvard.edu> Message-ID: On Fri, Jul 23, 2010 at 12:48 PM, Brad Chapman wrote: > Peter; > > [Simplified interface for calling commandline programs] > ... > > Awesome to see this going in, > Brad It is in now, I cherry-picked the changes I'd made on the app-exec3 branch (seemed a bit silly to do a merge for a little thing like this and make the history even more confusing). I haven't update the tutorial yet... Peter From biopython at maubp.freeserve.co.uk Mon Jul 26 12:08:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Jul 2010 17:08:10 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> <20100723114806.GA1868@sobchak.mgh.harvard.edu> Message-ID: On Mon, Jul 26, 2010 at 4:12 PM, Peter wrote: > On Fri, Jul 23, 2010 at 12:48 PM, Brad Chapman wrote: >> Peter; >> >> [Simplified interface for calling commandline programs] >> ... >> >> Awesome to see this going in, >> Brad > > It is in now, I cherry-picked the changes I'd made on the app-exec3 branch > (seemed a bit silly to do a merge for a little thing like this and make the > history even more confusing). > > I haven't updated the tutorial yet... I have updated the tutorial now - note that this just uses the default __call__ functionality, for simplicity I am avoiding mentioning the optional arguments (they are covered in the docstring of course). Peter From biopython at maubp.freeserve.co.uk Mon Jul 26 12:47:51 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Jul 2010 17:47:51 +0100 Subject: [Biopython-dev] long for longitude in Bio.Phylo and Python 3 Message-ID: Hi Eric et all, Background: Eric has found a problem in Bio.Phylo with variables, arguments and properties called "long" for longitude which the 2to3 script is wrongly converting into "int", see: http://bugs.python.org/issue2734 If the remaining issue with Bug 2734 is fixed, we would still have a problem running the conversion with 2to3 as included with all releases of Python to date (i.e. 2.6, 2.7, 3.1), which would complicate deployment. Eric: It could break backwards compatibility, but would a switch from lat & long to latitude and longitude be the least painful solution? Do you think we could support both names as part of a deprecation cycle? Peter From eric.talevich at gmail.com Mon Jul 26 13:04:24 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 26 Jul 2010 13:04:24 -0400 Subject: [Biopython-dev] long for longitude in Bio.Phylo and Python 3 In-Reply-To: References: Message-ID: On Mon, Jul 26, 2010 at 12:47 PM, Peter wrote: > Hi Eric et all, > > Background: Eric has found a problem in Bio.Phylo with variables, arguments > and properties called "long" for longitude which the 2to3 script is wrongly > converting into "int", see: http://bugs.python.org/issue2734 > > If the remaining issue with Bug 2734 is fixed, we would still have a > problem > running the conversion with 2to3 as included with all releases of Python to > date (i.e. 2.6, 2.7, 3.1), which would complicate deployment. > > Eric: It could break backwards compatibility, but would a switch from lat & > long to latitude and longitude be the least painful solution? Do you think > we could support both names as part of a deprecation cycle? > > Peter > The names "lat", "long" and "alt" are from the phyloXML spec, so it's convenient to keep them the same in Biopython. But I could change them to the longer form if that's needed. The parser and serializer assume the attribute names match the XML spec in general, and special-case names that won't work in Python (like "from"). Deprecation: Since we note in the Tutorial that Bio.Phylo is semi-beta, I'd like to use an accelerated deprecation cycle for name changes like this: 1 transitional release with shims that trigger a warning, then remove the shims in the release after that. Is that OK? I haven't had a chance to try "2to3 --nofix=long" on the entire codebase yet. Best, Eric From biopython at maubp.freeserve.co.uk Mon Jul 26 13:19:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Jul 2010 18:19:24 +0100 Subject: [Biopython-dev] long for longitude in Bio.Phylo and Python 3 In-Reply-To: References: Message-ID: On Mon, Jul 26, 2010 at 6:04 PM, Eric Talevich wrote: > On Mon, Jul 26, 2010 at 12:47 PM, Peter wrote: > >> Hi Eric et all, >> >> Background: Eric has found a problem in Bio.Phylo with variables, arguments >> and properties called "long" for longitude which the 2to3 script is wrongly >> converting into "int", see: http://bugs.python.org/issue2734 >> >> If the remaining issue with Bug 2734 is fixed, we would still have a >> problem >> running the conversion with 2to3 as included with all releases of Python to >> date (i.e. 2.6, 2.7, 3.1), which would complicate deployment. >> >> Eric: It could break backwards compatibility, but would a switch from lat & >> long to latitude and longitude be the least painful solution? Do you think >> we could support both names as part of a deprecation cycle? >> >> Peter >> > > The names "lat", "long" and "alt" are from the phyloXML spec, so it's > convenient to keep them the same in Biopython. But I could change them to > the longer form if that's needed. The parser and serializer assume the > attribute names match the XML spec in general, and special-case names that > won't work in Python (like "from"). > > Deprecation: Since we note in the Tutorial that Bio.Phylo is semi-beta, I'd > like to use an accelerated deprecation cycle for name changes like this: 1 > transitional release with shims that trigger a warning, then remove the > shims in the release after that. Is that OK? > > I haven't had a chance to try "2to3 --nofix=long" on the entire codebase > yet. Assuming that using "2to3 --nofix=long" on the entire codebase isn't going to work, then I'm OK with an accelerated deprecation for switching lat/long in Bio.Phylo. If "2to3 --nofix=long" doesn't cause us problems elsewhere, that will be a neater solution. Peter From biopython at maubp.freeserve.co.uk Tue Jul 27 06:28:29 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Jul 2010 11:28:29 +0100 Subject: [Biopython-dev] WARNING - I've just been rewriting history Message-ID: Hi all, If anyone has been trying to use the git repository in the last 12 hours or so, please note I have just re-written recent history. If in any doubt, do a fresh clone. According the github network no one else has committed anything recently, which is good. Re-writing history in git is possible but is generally considered a "bad thing" because someone might have already taken and worked from the "erased" changes. Hopefully I got away with it without messing anyone up... What I did and why: One of our team made a bad merge, and pushed it to the master. If this had been spotted BEFORE being made public a local revert could have been done. The standard procedure here is to do a merge revert, but unfortunately it seems they reverted to the wrong branch (merge reverts can be done back to either of the two parents). At this point we had two unwanted commits, and the best way to fix this wasn't clear [at least not to us - has anyone got advice here for future reference?]. I took the (rash?) choice first thing this morning to take a new branch from just before the bad merge, and then via a few renames made that the new master branch, and deleted the problematic branch. The git history is now "clean", but has been changed. *** To repeat - if anyone did a git pull in the last 12 hours or so, please discard those changes and take a fresh clone. *** As a general warning, please think twice before any merge. Then check twice before pushing to github. I don't want to point fingers or spread blame around - we're all still learning git. I'm guilty of unnecessary merges this too - most recently 17 July, a brief fork and merge of two versions of the master branch, where with hindsight a "git rebase origin master" would have been wise before that commit. If you are not confident about merging branches, perhaps sending a merge pull request might be safer - get someone else to go it ;) Would anyone other than me feel happy handling merge requests? Regards, Peter From tiagoantao at gmail.com Tue Jul 27 08:41:25 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 27 Jul 2010 13:41:25 +0100 Subject: [Biopython-dev] WARNING - I've just been rewriting history In-Reply-To: References: Message-ID: On Tue, Jul 27, 2010 at 11:28 AM, Peter wrote: > What I did and why: One of our team made a bad merge, and pushed it to "One of our team", erhm... that would be me. Worse, it is the second time that I make the exact same mistake. My sincere apologies. Will not happen again, I will never do a merge again, in any case. One was fool, two was freakish. Three won't happen. Tiago From biopython at maubp.freeserve.co.uk Tue Jul 27 09:23:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Jul 2010 14:23:27 +0100 Subject: [Biopython-dev] Python 3 and encoding for online resources Message-ID: Hi all, One of the remaining (pure python) problems with Biopython under Python 3 relates to parsing online resources like the NCBI Entrez API or even Bio.ExPASy.get_sprot_raw(). See for example test_SeqIO_online.py for a failure. In Python 2, urlopen from urlib or urllib2 would give a string handle. In python 3, you get a bytes handle (not a unicode handle and choosing the encoding is tricky): http://docs.python.org/py3k/library/urllib.request.html In the case of resources like the NCBI and ExPASy we should be able to assume an encoding (maybe UTF-8 or Latin) for all the plain text output, while from XML/HTML there are ways for the data to specify this itself. I think we may need to transform the urllib bytes handle into a unicode string handle for parsing. One option would be to extend the Bio.File.UndoHandle class (or invent a subclass) which applies the decoding. This seems simple since Bio.Entrez and Bio.ExPASy already use this class. Another option (which I suggested on the Bio.SeqIO.index() thread [1]) would be to extend our parsers to cope with both byte and unicode handles. That could be a lot of work though... Thoughts? Peter [1] http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008004.html From andrea at biocomp.unibo.it Tue Jul 27 09:50:53 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Tue, 27 Jul 2010 15:50:53 +0200 (CEST) Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: Message-ID: <3b674cf220d52226cf9b2e189598fe61.squirrel@lipid.biocomp.unibo.it> > > Hi Andrea, > > As you have probably noticed via github, I have been trying out your code. > > I noticed you hadn't implemented indexing support so I have done this on > my branch as a quick hack: > > http://github.com/peterjc/biopython/commits/uniprot good, are we going to continue developing on two separate branches/repos? if you want I can grant you acces to my repo, no problem, just to make things simpler... > > What I want to be able to do is seek to the start of an in the > XML handle, and have the parser continue from that point. I've done this > by the nasty trick of extracting the record from the XML file as a string > (using the get_raw method of the index class), then adding the XML > header and footer to it, and then invoking your parser. There should > be a better way to do this, but I am not familiar enough with > ElementTree to see it right away. Can you improve on this? > well it can be done using ElementTree, maybe it will also be faster than using the re module (actually I don't know if the re module is used by etree). however using cElementTree, when possible, will improve performance. by using ElementTree we can also handle namespace, rteurning a valid uniprot XML file/string. > I'd also like to have SeqFeature parsing done for the plain text "swiss" > parser as well, which can double as a cross check for your parser. Did you > look at my old patch? http://bugzilla.open-bio.org/show_bug.cgi?id=2235 > yes I looked at it, and Mauro build some unit testing to compare the results between the two parsers, take a look at Tests / test_Uniprot.py in my repo: http://github.com/apierleoni/biopython/blob/uniprotxml-branch/Tests/test_Uniprot.py > We should also run a comparison test of the "swiss" plain text and > "uniprot" XML parsers on the full downloads of UniProtKB/Swiss-Prot > and/or UniProtKB/TrEMBL, see http://www.uniprot.org/downloads > I've succesfully tested the last version in my ranch on the current version of UniprotKB/Swiss-Prot. the main differences between the two formats will be the comment field, and I don't see how they can match, sincce they are very different from the two original uniprot files. any idea? just to be clear, are we going to call this parser format just "uniprot" or "uniprot-xml"? Andrea From biopython at maubp.freeserve.co.uk Tue Jul 27 10:04:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Jul 2010 15:04:01 +0100 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: <3b674cf220d52226cf9b2e189598fe61.squirrel@lipid.biocomp.unibo.it> References: <3b674cf220d52226cf9b2e189598fe61.squirrel@lipid.biocomp.unibo.it> Message-ID: On Tue, Jul 27, 2010 at 2:50 PM, Andrea Pierleoni wrote: >> >> Hi Andrea, >> >> As you have probably noticed via github, I have been trying out your code. >> >> I noticed you hadn't implemented indexing support so I have done this on >> my branch as a quick hack: >> >> http://github.com/peterjc/biopython/commits/uniprot > > good, are we going to continue developing on two separate branches/repos? > if you want I can grant you acces to my repo, no problem, just to make > things simpler... Partly it was because you had some unrelated stuff on your uniprot branch (something in the FASTA m10 parser - I'd be interested to see an example file which triggered your change). >> >> What I want to be able to do is seek to the start of an in the >> XML handle, and have the parser continue from that point. I've done this >> by the nasty trick of extracting the record from the XML file as a string >> (using the get_raw method of the index class), then adding the XML >> header and footer to it, and then invoking your parser. There should >> be a better way to do this, but I am not familiar enough with >> ElementTree to see it right away. Can you improve on this? >> > > well it can be done using ElementTree, maybe it will also be faster than > using > the re module (actually I don't know if the re module is used by etree). > however using cElementTree, when possible, will improve performance. > by using ElementTree we can also handle namespace, > rteurning a valid uniprot XML file/string. If you can do this via (c)ElementTree, without building a dummy XML single record as a string in memory first, that would be worth trying. >> I'd also like to have SeqFeature parsing done for the plain text "swiss" >> parser as well, which can double as a cross check for your parser. Did you >> look at my old patch? http://bugzilla.open-bio.org/show_bug.cgi?id=2235 >> > > yes I looked at it, At some point I'll try the patch and test it against your UniProt XML feature generation. If I recall correctly there were some special cases with features at the very start of the protein which puzzled me. Hopefully the XML descriptions are clearer. > ... and Mauro build some unit testing to compare the results > between the two parsers, take a look at Tests / test_Uniprot.py in my repo: > > http://github.com/apierleoni/biopython/blob/uniprotxml-branch/Tests/test_Uniprot.py I thought I tried your version of the test but the seq_tests_common function compare_records seemed to strict... >> We should also run a comparison test of the "swiss" plain text and >> "uniprot" XML parsers on the full downloads of UniProtKB/Swiss-Prot >> and/or UniProtKB/TrEMBL, see http://www.uniprot.org/downloads >> > > I've succesfully tested the last version in my ranch on the current > version of UniprotKB/Swiss-Prot. Good. > the main differences between the two formats will be the comment field, > and I don't see how they can match, sincce they are very different from > the two original uniprot files. > > any idea? I avoided this issue in the test on my branch ;) I think we should update the plain text parser and BioSQL wrapper to support use the same nesting as BioPerl is using. i.e. Start by running BioPerl to import a record into BioSQL, and see how the comment ended up. > just to be clear, are we going to call this parser format just ?"uniprot" or > "uniprot-xml"? Another open question, I recall asking this on the open-bio cross project mailing list, but can't find it in the archive. Maybe I just meant to write an email and forgot? Do you remember this - I would have CC'd you. Basically I don't have a strong view on "uniprot" versus "uniprot-xml" but would like to agree this with BioPerl and EMBOSS. Peter From biopython at maubp.freeserve.co.uk Tue Jul 27 10:44:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Jul 2010 15:44:53 +0100 Subject: [Biopython-dev] Python 3 status (ignoring numpy and our C code) Message-ID: Hi all, I haven't gotten round to installing NumPy under Python 3 on this machine. Summary of test output (ignoring all the passes and skipped tests) using 2to3 with default settings. ------------------------------------------------------------------------ test_CAPS ... ERROR test_Restriction ... ERROR TypeError: unhashable type: 'RestrictionType' This is a tricky issue, see: http://lists.open-bio.org/pipermail/biopython-dev/2010-July/007975.html ------------------------------------------------------------------------ test_Crystal ... FAIL Slicing issues, we could fix them or just deprecate Bio.Crystal http://lists.open-bio.org/pipermail/biopython-dev/2010-July/thread.html ------------------------------------------------------------------------ test_LocationParser ... Syntax error at or near `467' token Something in the spark parser isn't handled by 2to3, not urgent as I want to deprecate Bio.GenBank.LocationParser which is the only thing using spark. ------------------------------------------------------------------------ test_NCBI_BLAST_tools ... FAIL Not Python 3 specific, the latest BLAST+ has changed some switches. ------------------------------------------------------------------------ test_PhyloXML ... FAIL Longitude versus long problem with 2to3: http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008071.html ------------------------------------------------------------------------ test_SeqIO_index ... ok Test passes but is very very slow, see: http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008004.html ------------------------------------------------------------------------ test_SeqIO_online ... FAIL May need to turn all online byte handles into unicode handles, http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008076.html ------------------------------------------------------------------------ test_property_manager ... FAIL I think this is a change to the default object's __repr__ method, and/or module name vs __main__ but in any case I'm tempted to deprecate Bio.PropertyManager because we don't really use it and I don't understand it ("Here be dragons!") ------------------------------------------------------------------------ Not looking too bad. Now I really should install NumPy on this machine... Peter From andrea at biocomp.unibo.it Tue Jul 27 10:55:20 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Tue, 27 Jul 2010 16:55:20 +0200 (CEST) Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: <3b674cf220d52226cf9b2e189598fe61.squirrel@lipid.biocomp.unibo.it> Message-ID: > Partly it was because you had some unrelated stuff on your uniprot branch > (something in the FASTA m10 parser - I'd be interested to see an example > file which triggered your change). > yes, I know, about the FASTA parser, but actually that change did not fix the problem, just get better. the m10 parser has problems when parsing from glsearch output, but we could discuss that in a separe thread. > If you can do this via (c)ElementTree, without building a dummy XML > single record as a string in memory first, that would be worth trying. > yes it can be done, I'll put this in my work list. > > At some point I'll try the patch and test it against your UniProt XML > feature generation. If I recall correctly there were some special cases > with features at the very start of the protein which puzzled me. Hopefully > the XML descriptions are clearer. > XML descriptions are clearer, but have some probvlem as well. some features do not have a stat and end point. in this case I skipped them. >> ... and Mauro build some unit testing to compare the results >> between the two parsers, take a look at Tests / test_Uniprot.py in my >> repo: >> >> http://github.com/apierleoni/biopython/blob/uniprotxml-branch/Tests/test_Uniprot.py > > I thought I tried your version of the test but the seq_tests_common > function > compare_records seemed to strict... > I depends how how well we want to fit the plain-text vs xml parser. I don't think we could end up in 100% identical seqrecords, and some flexibility should be used. > > I avoided this issue in the test on my branch ;) > > I think we should update the plain text parser and BioSQL wrapper to > support > use the same nesting as BioPerl is using. i.e. Start by running > BioPerl to import > a record into BioSQL, and see how the comment ended up. > well, BioPerl guys weren't very collaborative on the BioSQL mailing list. however I just read a couple of messages at that time. they are using their schema and BioJava is not using the same schema. I don't know about other projects. I think we have 3 choiches: 1) follow BioPerl whatever they does (could be good) 2) try to define our rules (bad) 3) set a defined open schema and propose it to BioSQL (good) In my parser I'm storing information from the comment as annotations in the seqrecords, buinding annotation key on the basis of the XML tree. this is a quick and dirty hack, but can be done much better. we could store complex comment field with XML, but I'm not incline in using just a big XML string in the comment field. Also keep in mind that the "comment" field is no longer called comments in the uniprot web-site but "general annotations", so maybe it makes sense to store this data as annotation in some other place. >> just to be clear, are we going to call this parser format just >> ?"uniprot" or >> "uniprot-xml"? > > Another open question, I recall asking this on the open-bio cross project > mailing list, but can't find it in the archive. Maybe I just meant to > write an > email and forgot? Do you remember this - I would have CC'd you. > Basically I don't have a strong view on "uniprot" versus "uniprot-xml" but > would like to agree this with BioPerl and EMBOSS. The issue here was that I started calling this format "uniprot" then I realize in the EBI REST services the file format is referred as "uniprot-xml". currently in my branch it is called uniprot-xml Andrea From biopython at maubp.freeserve.co.uk Tue Jul 27 11:16:00 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Jul 2010 16:16:00 +0100 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: <3b674cf220d52226cf9b2e189598fe61.squirrel@lipid.biocomp.unibo.it> Message-ID: On Tue, Jul 27, 2010 at 3:55 PM, Andrea Pierleoni wrote: > >> At some point I'll try the patch and test it against your UniProt XML >> feature generation. If I recall correctly there were some special cases >> with features at the very start of the protein which puzzled me. Hopefully >> the XML descriptions are clearer. >> > > XML descriptions are clearer, but have some probvlem as well. > some features do not have a stat and end point. in this case I skipped them. If you have some specific examples (IDs) to hand that would be useful. >>> ... and Mauro build some unit testing to compare the results >>> between the two parsers, take a look at Tests / test_Uniprot.py in my >>> repo: >>> >>> http://github.com/apierleoni/biopython/blob/uniprotxml-branch/Tests/test_Uniprot.py >> >> I thought I tried your version of the test but the seq_tests_common >> function compare_records seemed to strict... >> > > I depends how how well we want to fit the plain-text vs xml parser. > I don't think we could end up in 100% identical seqrecords, and some > flexibility should be used. I agree we're not going to get 100% identical records. >> I think we should update the plain text parser and BioSQL wrapper to >> support use the same nesting as BioPerl is using. i.e. Start by running >> BioPerl to import a record into BioSQL, and see how the comment >> ended up. >> > > well, BioPerl guys weren't very collaborative on the BioSQL mailing list. > however I just read a couple of messages at that time. > > they are using their schema and BioJava is not using the same schema. > I don't know about other projects. Perhaps you are using "schema" in a different way that I would. All the projects use the same schema (where I mean database tables), but there are differences in the details of how each file format gets parsed and ends up stored in those tables. > I think we have 3 choiches: > > 1) follow BioPerl whatever they does (could be good) > 2) try to define our rules (bad) > 3) set a defined open schema and propose it to BioSQL (good) If in (3) you mean we should have some clear examples of major file formats and how each field should end up in BioSQL, I agree. In the short to medium term I regard the bioperl-db mapping as the reference implementation (although their code does continue to change), i.e. (1). I found one of the threads I was thinking about in the archive, http://bioperl.org/pipermail/biosql-l/2010-January/001672.html http://bioperl.org/pipermail/bioperl-l/2010-January/031993.html http://bioperl.org/pipermail/open-bio-l/2010-January/000609.html > In my parser I'm storing information from the comment as annotations > in the seqrecords, buinding annotation key on the basis of the XML > tree. this is a quick and dirty hack, but can be done much better. > > we could store complex comment field with XML, but I'm not incline > in using just a big XML string in the comment field. Some sorted of nested structure like a dictionary? Are you familiar with the Perl TagTree which is what BioPerl are using here. I think Richard Holland said (in the above linked thread) that BioJava just sticks the DE section as an XML string into their record object (and thus puts XML in the BioSQL database?). > Also keep in mind that the "comment" field is no longer called comments > in the uniprot web-site but "general annotations", so maybe it makes sense >?to store this data as annotation in some other place. Sounds sensible. >>> just to be clear, are we going to call this parser format just >>> ?"uniprot" or >>> "uniprot-xml"? >> >> Another open question, I recall asking this on the open-bio cross project >> mailing list, but can't find it in the archive. Maybe I just meant to write >> an email and forgot? Do you remember this - I would have CC'd you. >> Basically I don't have a strong view on "uniprot" versus "uniprot-xml" but >> would like to agree this with BioPerl and EMBOSS. > > > The issue here was that I started calling this format "uniprot" then I > realize in the EBI REST services the file format is referred as > "uniprot-xml". currently in my branch it is called uniprot-xml > I'll (re-)post that as a specific query on the open-bio-l mailing list... Peter From andrea at biocomp.unibo.it Tue Jul 27 12:37:59 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Tue, 27 Jul 2010 18:37:59 +0200 (CEST) Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: <3b674cf220d52226cf9b2e189598fe61.squirrel@lipid.biocomp.unibo.it> Message-ID: <8ec7479153894f66ea029abd059e06c5.squirrel@lipid.biocomp.unibo.it> >> XML descriptions are clearer, but have some probvlem as well. >> some features do not have a stat and end point. in this case I skipped >> them. > > If you have some specific examples (IDs) to hand that would be useful. > try this: http://www.uniprot.org/uniprot/Q8NE62.xml the "error" refers to old '?' symbol in feature positions it carries this feature: I'm actually skipping al the features/comments carrying a status="unknown" attrib in start or end positions, or both. other examples: 3HIDH_DICDI ADAM1_RAT ADAM1_RAT ADM1B_MOUSE ADM1B_MOUSE CARDH_CYNCA CARDH_CYNCA CHDH_HUMAN COQ41_PARTE COQ4_CHAGB COQ4_LEIMA COX11_DICDI COX11_DICDI COX16_NEUCR ... I'm actually skipping all the features having a > > I agree we're not going to get 100% identical records. good > > Perhaps you are using "schema" in a different way that I would. All the > projects use the same schema (where I mean database tables), but > there are differences in the details of how each file format gets parsed > and ends up stored in those tables. Yes I'm referring to data schema in general, not strictly the BioSQL schema. I don't mean to change the BioSQL schema. > >> I think we have 3 choiches: >> >> 1) follow BioPerl whatever they does (could be good) >> 2) try to define our rules (bad) >> 3) set a defined open schema and propose it to BioSQL (good) > > If in (3) you mean we should have some clear examples of major file > formats and how each field should end up in BioSQL, I agree. In the > short to medium term I regard the bioperl-db mapping as the reference > implementation (although their code does continue to change), i.e. (1). > > I found one of the threads I was thinking about in the archive, > http://bioperl.org/pipermail/biosql-l/2010-January/001672.html > http://bioperl.org/pipermail/bioperl-l/2010-January/031993.html > http://bioperl.org/pipermail/open-bio-l/2010-January/000609.html so does it make sens to follow their code and their change? this would be valid just for BioPerl and BioPython. > >> In my parser I'm storing information from the comment as annotations >> in the seqrecords, buinding annotation key on the basis of the XML >> tree. this is a quick and dirty hack, but can be done much better. >> >> we could store complex comment field with XML, but I'm not incline >> in using just a big XML string in the comment field. > > Some sorted of nested structure like a dictionary? Are you familiar > with the Perl TagTree which is what BioPerl are using here. I think > Richard Holland said (in the above linked thread) that BioJava just > sticks the DE section as an XML string into their record object > (and thus puts XML in the BioSQL database?). > I'm not familiar with the TagTree but I've looked at it when there was the discussion, and I do not see any advantage on using this explicitly on the db fields instead of an XML. I would save an XML text on the DB easily readable by every language and even humans. XML text can be also queried easily. Then I'd represent this XML in a nested dictionary structure similar to the perl TagTree. I don't know if there is any implementation in python about this... >> Also keep in mind that the "comment" field is no longer called comments >> in the uniprot web-site but "general annotations", so maybe it makes >> sense >>?to store this data as annotation in some other place. > > Sounds sensible. you can use XML here too, if needed. Also by using XML, we could be able to store dictionary-containing seqrecords in a BioSQL db. A big plus to me. > > I'll (re-)post that as a specific query on the open-bio-l mailing list... > it looks like anybody is agreeing with "uniprot-xml" Andrea From biopython at maubp.freeserve.co.uk Tue Jul 27 12:40:23 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Jul 2010 17:40:23 +0100 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: <8ec7479153894f66ea029abd059e06c5.squirrel@lipid.biocomp.unibo.it> References: <3b674cf220d52226cf9b2e189598fe61.squirrel@lipid.biocomp.unibo.it> <8ec7479153894f66ea029abd059e06c5.squirrel@lipid.biocomp.unibo.it> Message-ID: On Tue, Jul 27, 2010 at 5:37 PM, Andrea Pierleoni wrote: > >> >> I'll (re-)post that as a specific query on the open-bio-l mailing list... >> > > it looks like anybody is agreeing with "uniprot-xml" > Yes - so far at least :) http://bioperl.org/pipermail/open-bio-l/2010-July/000701.html ... http://open-bio.org/pipermail/open-bio-l/2010-July/000704.html Peter From bugzilla-daemon at portal.open-bio.org Wed Jul 28 04:20:41 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 28 Jul 2010 04:20:41 -0400 Subject: [Biopython-dev] [Bug 3119] Bio.Nexus can't parse file from Prank 100701 (1st July 2010) In-Reply-To: Message-ID: <201007280820.o6S8Kfj3001278@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3119 ------- Comment #4 from fkauff at biologie.uni-kl.de 2010-07-28 04:20 EST ------- Slashes in Taxon names may cause troubles (even when properly quoted), not only for Bio.Nexus, but also for many other programs. If you want to use / or other special characters in taxon names, better use a " or ' around them. It might be best to avoid them entirely, my experience is that at one point during file processing there will be a software that complains. The translate statement in the nexus file ends both with a , AND a ; after the second taxon, which is also not nexus compliant. Frank (In reply to comment #0) > I've been updating test_Prank_tool.py to cope with the latest version of Prank, > 1 July 2010 from http://www.ebi.ac.uk/goldman-srv/prank/src/prank/ > > Some changes are simple, such as removing tests using feature of Prank which > have been removed. One test is failing due to some big changes in the NEXUS > output from Prank, and this may be due to a problem with our parser: > > >>> from Bio.Nexus import Nexus > >>> n = Nexus.Nexus("output_prank_v100701.nex") > Traceback (most recent call last): > ... > Bio.Nexus.Nexus.NexusError: Taxon AK1H_ECOLI: Illegal character / in line: > (check dimensions / interleaving) > > I will attach the file, it is created by the unit test as output.2.nex but > is usually deleted. > > Peter > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 28 04:26:38 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 28 Jul 2010 04:26:38 -0400 Subject: [Biopython-dev] [Bug 3119] Bio.Nexus can't parse file from Prank 100701 (1st July 2010) In-Reply-To: Message-ID: <201007280826.o6S8QcMJ001475@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3119 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-28 04:26 EST ------- (In reply to comment #4) > Slashes in Taxon names may cause troubles (even when properly quoted), not > only for Bio.Nexus, but also for many other programs. If you want to use / or > other special characters in taxon names, better use a " or ' around them. It > might be best to avoid them entirely, my experience is that at one point > during file processing there will be a software that complains. Sure - but on the other hand, this why we test things too ;) > The translate statement in the nexus file ends both with a , AND a ; after the > second taxon, which is also not nexus compliant. > > Frank So you think there is a problem with PRANK's output here? Would you like to report this or should I? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 28 04:49:17 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 28 Jul 2010 04:49:17 -0400 Subject: [Biopython-dev] [Bug 3119] Bio.Nexus can't parse file from Prank 100701 (1st July 2010) In-Reply-To: Message-ID: <201007280849.o6S8nHbQ002167@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3119 ------- Comment #6 from fkauff at biologie.uni-kl.de 2010-07-28 04:49 EST ------- I think this is a bug - taxa in a translate statement are separated by commas, and after the last one, there is a semicolon, not both. Which makes sense. You're welcome to report it - probably you have more info at hand how the file was generated... Frank PS. I Updated tree parsing in Nexus to handle the tree * PRANK = ... statement. (In reply to comment #5) > (In reply to comment #4) > > Slashes in Taxon names may cause troubles (even when properly quoted), not > > only for Bio.Nexus, but also for many other programs. If you want to use / or > > other special characters in taxon names, better use a " or ' around them. It > > might be best to avoid them entirely, my experience is that at one point > > during file processing there will be a software that complains. > > Sure - but on the other hand, this why we test things too ;) > > > The translate statement in the nexus file ends both with a , AND a ; after the > > second taxon, which is also not nexus compliant. > > > > Frank > > So you think there is a problem with PRANK's output here? Would you like to > report this or should I? > > Peter > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 28 06:46:19 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 28 Jul 2010 06:46:19 -0400 Subject: [Biopython-dev] [Bug 3119] Bio.Nexus can't parse file from Prank 100701 (1st July 2010) In-Reply-To: Message-ID: <201007281046.o6SAkJt3006529@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3119 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-28 06:46 EST ------- Created an attachment (id=1530) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1530&action=view) Hand corrected NEXUS output from prank v100701 I am attaching a hand edited version of the PRANK v100701 NEXUS output where I have wrapped the names with single quotes, and removed the stray comma in the translate statement. See below for details. Bio.Nexus is happy with this file. (In reply to comment #4) > Slashes in Taxon names may cause troubles (even when properly quoted), not > only for Bio.Nexus, but also for many other programs. If you want to use / > or other special characters in taxon names, better use a " or ' around them. > It might be best to avoid them entirely, my experience is that at one point > during file processing there will be a software that complains. I should have been clearer earlier: Yes, I understand that special characters like slash will cause some tools problems, but they are nevertheless common. In particular, PFAM alignments take the form name/start-end to encode which subregion of a protein is being shown - like the example here which uses AK1H_ECOLI/1-378 and AKH_HAEIN/1-382 as the taxa names. I have just checked in a change to the error message, which I think throws more light on the issue: http://github.com/biopython/biopython/commit/d8a4a6edc98fa69885b6865336020db02035ff0b Now I get: >>> from Bio.Nexus import Nexus >>> n = Nexus.Nexus("output_prank_v100701.nex") Traceback (most recent call last): ... Bio.Nexus.Nexus.NexusError: Taxon AK1H_ECOLI: Illegal character / in sequence /1-378CPDSINAALICRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRR (check dimensions/interleaving) Notice that the tail of the taxon name ('/1-378') is being treated as part of the sequence. Having looked at the code and read the relevant bits of the NEXUS specification (Maddison et al), I think that PRANK is producing invalid taxa labels. In order to include characters like slashes and dashes (minus signs) that are considered punctation (and thus indicate the end of the taxa label) the labels should have been wrapped in single quotes. See the attachment. > The translate statement in the nexus file ends both with a , AND a ; after the > second taxon, which is also not nexus compliant. (In reply to comment #6) > I think this is a bug - taxa in a translate statement are separated by commas, > and after the last one, there is a semicolon, not both. Which makes sense. I have not looked at this aspect in detail, but will take you word for it. See the attachment. (In reply to comment #6) > > You're welcome to report it - probably you have more info at hand how the file > was generated... > For the record, the file was generated with the following, input file in FASTA format has two sequences which already have gaps in them: http://biopython.open-bio.org/SRC/biopython/Tests/Fasta/fa01 http://github.com/biopython/biopython/raw/master/Tests/Fasta/fa01 Then run prank (here using v081202), from the same directory: $ prank -d=fa01 -f=17 -noxml -notree Warning: option '+F' is not selected. You can select it by adding flag "+F". PRANK: aligning sequences in 'fa01', writing results to 'output.?.nex' [plain alignment]. Generating approximate guidetree. Generating multiple alignment. #1#(1/1): 95% aligned Generating improved guidetree. Generating improved multiple alignment. #1#(1/1): computing full probability Alignment done. Total time 1s $ diff output.1.nex output.2.nex $ more output.2.nex #NEXUS ... See previously attachment 1524 for the output. (In reply to comment #6) > > Frank > > PS. I Updated tree parsing in Nexus to handle the > > tree * PRANK = ... > > statement. > Great. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Jul 28 10:38:41 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Jul 2010 15:38:41 +0100 Subject: [Biopython-dev] Python 3 status (ignoring numpy and our C code) In-Reply-To: References: Message-ID: One other outstanding test failure (which I thought I'd fixed, but in doing so broke Python 2) is the Bio.Seq doctests which include exceptions. This is a known issue due to a change in the traceback for Python 2.7 and Python 3 to include the exception module name, making it difficult to write doctests with exceptions which also pass on both Python 2 and 3. This seems to have been fixed in Python 2.7, while there will be a work around available in Python 3.2 (but apparently not in Python 3.0 or 3.1) via doctest.IGNORE_EXCEPTION_DETAIL, see: http://bugs.python.org/issue7490 For now I have taken the pragmatic choice of skipping the Bio.Seq doctest under Python 3.1 Peter From biopython at maubp.freeserve.co.uk Wed Jul 28 12:08:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Jul 2010 17:08:46 +0100 Subject: [Biopython-dev] Equality in Bio.Restriction.RestrictionType In-Reply-To: References: Message-ID: 2010/7/8 Peter : > Hi Fr?d?ric et al, > > One of the things in Python 3 is that overriding equality (done with __eq__ > only since __cmp__ has gone) requires you also override __hash__. One > remaining example of this which triggers a deprecation warning within our > test suite when running with the -3 switch in in Bio.Restriction. > > I therefore had a look at how __eq__ and __ne__ are defined in the > RestrictionType class - and strangely they do NOT seem to be inverses. > > ? ?def __eq__(cls, other): > ? ? ? ?"""RE == other -> bool > > ? ? ? ?True if RE and other are the same enzyme.""" > ? ? ? ?return other is cls > > ? ?def __ne__(cls, other): > ? ? ? ?"""RE != other -> bool. > ? ? ? ?isoschizomer strict, same recognition site, same restriction -> False > ? ? ? ?all the other-> True""" > ? ? ? ?if not isinstance(other, RestrictionType): > ? ? ? ? ? ?return True > ? ? ? ?elif cls.charac == other.charac: > ? ? ? ? ? ?return False > ? ? ? ?else: > ? ? ? ? ? ?return True > > Fr?d?ric - could you clarify the intent here? Hi Fr?d?ric, As implemented, __eq__ just seems to check for object identity, effectively id(a) == id(b), so to make the unit test pass on Python 3 all I needed to do here was define __hash__ explicitly to return id(self), which is the default behaviour under Python 2. I'm still puzzled about the reasoning behind the comparisons. Clearly you had something special in mind with these definitions as shown by the test_Restriction.py unit tests under test_comparisons, assert Acc65I == Acc65I assert not(Acc65I == Asp718I) assert not(Acc65I != Asp718I) Note that Acc65I.site == Asp718I.site == 'GGTACC', and also Acc65I.charac == Asp718I.charac == (1, -1, None, None, 'GGTACC') It looks to me like Acc65I and Asp718I differ only in name, and you wanted both Acc65I == Asp718 and Acc65I != Asp718I to return False. i.e. They are neither equal nor non-equal, but somewhere in between? Regards, Peter From biopython at maubp.freeserve.co.uk Wed Jul 28 12:17:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Jul 2010 17:17:07 +0100 Subject: [Biopython-dev] Python 3 status (ignoring numpy and our C code) In-Reply-To: References: Message-ID: Regarding the remaining Python 3 unit test failures, On Tue, Jul 27, 2010 at 3:44 PM, Peter wrote: > > test_CAPS ... ERROR > test_Restriction ... ERROR > > TypeError: unhashable type: 'RestrictionType' > > This is a tricky issue, see: > http://lists.open-bio.org/pipermail/biopython-dev/2010-July/007975.html This is now fixed - it turns out that I didn't need to understand the full complexities of the restriction object comparisons, just what __eq__ was doing: http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008089.html Peter From biopython at maubp.freeserve.co.uk Wed Jul 28 12:38:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Jul 2010 17:38:01 +0100 Subject: [Biopython-dev] Python 3 status (ignoring numpy and our C code) In-Reply-To: References: Message-ID: More progress on the Python 3 front, On Tue, Jul 27, 2010 at 3:44 PM, Peter wrote: > > test_Crystal ... FAIL > > Slicing issues, we could fix them or just deprecate Bio.Crystal > http://lists.open-bio.org/pipermail/biopython-dev/2010-July/thread.html > I decided to just fix it - test_Crystal.py seems to cover all the basic cases for slicing. http://github.com/biopython/biopython/commit/faefe401af626656c3f8b457c066627c0ab5ef79 Peter From eric.talevich at gmail.com Wed Jul 28 22:22:12 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 28 Jul 2010 22:22:12 -0400 Subject: [Biopython-dev] WARNING - I've just been rewriting history In-Reply-To: References: Message-ID: On Tue, Jul 27, 2010 at 6:28 AM, Peter wrote: > Hi all, > > If anyone has been trying to use the git repository in the last 12 hours or > so, > please note I have just re-written recent history. If in any doubt, do a > fresh > clone. According the github network no one else has committed anything > recently, which is good. > > Re-writing history in git is possible but is generally considered a "bad > thing" > because someone might have already taken and worked from the "erased" > changes. Hopefully I got away with it without messing anyone up... > (emerges battered and bruised from the wreckage) If anyone else besides be got hit by this, here's a summary of how to fix your local repository without nuking all your local branches: # We're on the "master" branch, a clone of "upstream/master" # This has an alternate history of biopython/biopython/master # so "git pull upstream master" doesn't work anymore git branch -m master borked git checkout -b master upstream/master git pull upstream master # If everything looks OK... git branch -d borked Note that this only recreates a fresh copy of Biopython's official master branch; if you've made commits on top of the borked history, or merged it into other branches, you should probably just make a fresh clone and export your local branches as patch sets. What I did and why: One of our team made a bad merge, and pushed it to > the master. If this had been spotted BEFORE being made public a local > revert could have been done. The standard procedure here is to do a > merge revert, but unfortunately it seems they reverted to the wrong branch > (merge reverts can be done back to either of the two parents). At this > point > we had two unwanted commits, and the best way to fix this wasn't clear > [at least not to us - has anyone got advice here for future reference?]. > As usual in git, there's probably a way to do this, but I sure don't know what it is. > If you are not confident about merging branches, perhaps sending a > merge pull request might be safer - get someone else to go it ;) > Would anyone other than me feel happy handling merge requests? > Starting a month or so from now, I'd be willing to take a crack at it. Another suggestion for avoiding accidentally pushing weird changes to the main repo: point your "master" branch at your personal fork on github (normally called "origin"), rather than upstream. Then "git push" will do the safe thing by default. Regards, Eric From eric.talevich at gmail.com Thu Jul 29 00:08:37 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 29 Jul 2010 00:08:37 -0400 Subject: [Biopython-dev] long for longitude in Bio.Phylo and Python 3 In-Reply-To: References: Message-ID: On Mon, Jul 26, 2010 at 1:19 PM, Peter wrote: > On Mon, Jul 26, 2010 at 6:04 PM, Eric Talevich > wrote: > > On Mon, Jul 26, 2010 at 12:47 PM, Peter >wrote: > > > >> Hi Eric et all, > >> > >> Background: Eric has found a problem in Bio.Phylo with variables, > arguments > >> and properties called "long" for longitude which the 2to3 script is > wrongly > >> converting into "int", see: http://bugs.python.org/issue2734 > >> > >> If the remaining issue with Bug 2734 is fixed, we would still have a > >> problem > >> running the conversion with 2to3 as included with all releases of Python > to > >> date (i.e. 2.6, 2.7, 3.1), which would complicate deployment. > >> > >> Eric: It could break backwards compatibility, but would a switch from > lat & > >> long to latitude and longitude be the least painful solution? Do you > think > >> we could support both names as part of a deprecation cycle? > >> > >> Peter > >> > > > > The names "lat", "long" and "alt" are from the phyloXML spec, so it's > > convenient to keep them the same in Biopython. But I could change them to > > the longer form if that's needed. The parser and serializer assume the > > attribute names match the XML spec in general, and special-case names > that > > won't work in Python (like "from"). > > > [...] > > If "2to3 --nofix=long" doesn't cause us problems elsewhere, that will > be a neater solution. > >From my testing just now, "2to3 --nofix=long" seems to be fine. I don't see any new errors introduced by it. -Eric From biopython at maubp.freeserve.co.uk Thu Jul 29 04:30:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 29 Jul 2010 09:30:17 +0100 Subject: [Biopython-dev] WARNING - I've just been rewriting history In-Reply-To: References: Message-ID: Eric Talevich wrote: > Peter wrote: > >> Hi all, >> >> If anyone has been trying to use the git repository in the last 12 hours or >> so, please note I have just re-written recent history. If in any doubt, do a >> fresh clone. According the github network no one else has committed >> anything recently, which is good. >> >> Re-writing history in git is possible but is generally considered a "bad >> thing" because someone might have already taken and worked from >> the "erased" changes. Hopefully I got away with it without messing >> anyone up... >> > > (emerges battered and bruised from the wreckage) > Ah - sorry Eric, but at least you sorted it out. Did you see the email first or discover something wrong the hard way? > If anyone else besides be got hit by this, here's a summary of how to fix > your local repository without nuking all your local branches: > > # We're on the "master" branch, a clone of "upstream/master" > # This has an alternate history of biopython/biopython/master > # so "git pull upstream master" doesn't work anymore > git branch -m master borked > git checkout -b master upstream/master > git pull upstream master > # If everything looks OK... > git branch -d borked > i.e. Rename your local copy of the borked master, get a clean copy of the rewritten master, delete renamed borked master. Looks very sensible. > > Note that this only recreates a fresh copy of Biopython's official master > branch; if you've made commits on top of the borked history, or merged it > into other branches, you should probably just make a fresh clone and > export your local branches as patch sets. > >> What I did and why: One of our team made a bad merge, and pushed it to >> the master. If this had been spotted BEFORE being made public a local >> revert could have been done. The standard procedure here is to do a >> merge revert, but unfortunately it seems they reverted to the wrong >> branch (merge reverts can be done back to either of the two parents). >> At this point we had two unwanted commits, and the best way to fix this >> wasn't clear [at least not to us - has anyone got advice here for future >> reference?]. >> > > As usual in git, there's probably a way to do this, but I sure don't know > what it is. > Laurent sent me this link off-list, it sounds very complicated: http://www.kernel.org/pub/software/scm/git/docs/howto/revert-a-faulty-merge.txt >> If you are not confident about merging branches, perhaps sending a >> merge pull request might be safer - get someone else to go it ;) >> Would anyone other than me feel happy handling merge requests? >> > > Starting a month or so from now, I'd be willing to take a crack at it. > > Another suggestion for avoiding accidentally pushing weird changes to the > main repo: point your "master" branch at your personal fork on github > (normally called "origin"), rather than upstream. Then "git push" will do > the safe thing by default. i.e. Push to your personal github repository's master first? That way it won't harm the official repository? Peter From biopython at maubp.freeserve.co.uk Thu Jul 29 06:29:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 29 Jul 2010 11:29:28 +0100 Subject: [Biopython-dev] Bytes, Strings and Unicode (Python 2 vs 3) Message-ID: Hi all, I'm forwarding something from the NumPy mailing list regarding strings and unicode: On Thu, Jul 29, 2010 at 4:40 AM, Fernando Perez wrote: > > On Wed, Jul 28, 2010 at 12:36 PM, Fernando Perez wrote: >> The official Python 2.x unicode story is well explained here: >> http://docs.python.org/howto/unicode.html >> >> and here is the corresponding document for 3.x: >> http://docs.python.org/release/3.1.2/howto/unicode.html > > Just in case you're still thirsty for more info on Unicode... :) > > Min Ragan-Kelley just did a great summary writeup of these questions > from a low-level perspective: for pyzmq we need to handle strings > (i.e. unicode) at the python level, but efficiently and unambiguously > communicate with a networking layer written in C. ?We spent a lot of > time thinking about this, and his writeup is a great resource for > anyone who needs to look at this from a C/low-level angle: > > http://ptsg.berkeley.edu/~minrk/zmq/unicode.html > > This adds a view that isn't made very explicit in any of the docs I'd > previously sent. > > Cheers, > > f The fact that on most Linux distributions Python 3's unicode strings will take 4x the memory of plain byte strings, and even Windows and Mac will take 2x the memory is concerning for me (since I've been using Biopython for some next gen sequencing stuff where memory is already sometimes the main bottleneck). I think we will want to make the Seq object use bytes internally, rather than unicode strings. We'll also want to make sure the Seq module functions will cope with bytes, unicode or Seq type objects. For most annotation (e.g. in SeqRecord and SeqFeature objects), I guess the default of unicode strings will be OK. Perhaps the SeqRecord's id/name/description might be border line cases... Peter From eric.talevich at gmail.com Thu Jul 29 12:00:22 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 29 Jul 2010 12:00:22 -0400 Subject: [Biopython-dev] WARNING - I've just been rewriting history In-Reply-To: References: Message-ID: On Thu, Jul 29, 2010 at 4:30 AM, Peter wrote: > Eric Talevich wrote: > > Peter wrote: > > > >> Hi all, > >> > >> If anyone has been trying to use the git repository in the last 12 hours > or > >> so, please note I have just re-written recent history. If in any doubt, > do a > >> fresh clone. According the github network no one else has committed > >> anything recently, which is good. > >> > >> Re-writing history in git is possible but is generally considered a "bad > >> thing" because someone might have already taken and worked from > >> the "erased" changes. Hopefully I got away with it without messing > >> anyone up... > >> > > > > (emerges battered and bruised from the wreckage) > > > > Ah - sorry Eric, but at least you sorted it out. Did you see the email > first or discover something wrong the hard way? > The hard way. I had a small uncommitted change to PhyloXMLIO.py, and wanted to apply it to the tip of the master branch. So I stashed my change, pulled from upstream (just after the bad merge reversion), and popped the stash. My change no longer applied cleanly, even though the history showed no new commits affecting PhyloXMLIO.py. Suck. Having burned myself with "git rebase -i" on my own github fork last summer, I recognized the problem after you rewrote the upstream history: After pulling from upstream (or origin), the local copy claims be several commits ahead of the public branch it's supposed to mirror. > If anyone else besides be got hit by this, here's a summary of how to fix > > your local repository without nuking all your local branches: > > > > # We're on the "master" branch, a clone of "upstream/master" > > # This has an alternate history of biopython/biopython/master > > # so "git pull upstream master" doesn't work anymore > > git branch -m master borked > > git checkout -b master upstream/master > > git pull upstream master > > # If everything looks OK... > > git branch -d borked > > > > i.e. Rename your local copy of the borked master, get a clean > copy of the rewritten master, delete renamed borked master. > Looks very sensible. > Yes. Plus: after pulling a clean copy of upstream/master, "git fetch upstream" helps set things right again. Laurent sent me this link off-list, it sounds very complicated: > > http://www.kernel.org/pub/software/scm/git/docs/howto/revert-a-faulty-merge.txt > This part looks key: If at all possible, for example, if you find a problem that got merged into the main tree, rather than revert the merge, try _really_ hard to bisect the problem down into the branch you merged, and just fix it, or try to revert the individual commit that caused it. > Another suggestion for avoiding accidentally pushing weird changes to the > > main repo: point your "master" branch at your personal fork on github > > (normally called "origin"), rather than upstream. Then "git push" will do > > the safe thing by default. > > i.e. Push to your personal github repository's master first? That way > it won't harm the official repository? > Yeah, mainly for psychological reasons -- pushing to origin satisfies a certain urge to publish new work, but typing "git push upstream master" makes me think more carefully about whether a change set is ready for the official repository. -Eric From krother at rubor.de Thu Jul 1 13:01:41 2010 From: krother at rubor.de (Kristian Rother) Date: Thu, 1 Jul 2010 15:01:41 +0200 Subject: [Biopython-dev] RNA Alphabet with modified nucleotides Message-ID: Hi, I've commited code + tests for representing RNA sequences with modified nucleotides to a branch on Github. See: http://github.com/krother/biopython/commits/rna_alphabet I'm done with my list of 'most wanted' features for this class, including suggestions from Peter. What could I do next to help integrating the new code with the rest of Biopython? Best Regards, Kristian From biopython at maubp.freeserve.co.uk Thu Jul 1 13:26:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 1 Jul 2010 14:26:57 +0100 Subject: [Biopython-dev] RNA Alphabet with modified nucleotides In-Reply-To: References: Message-ID: On Thu, Jul 1, 2010 at 2:01 PM, Kristian Rother wrote: > > Hi, > > I've commited code + tests for representing RNA sequences with modified > nucleotides to a branch on Github. See: > > http://github.com/krother/biopython/commits/rna_alphabet > > I'm done with my list of 'most wanted' features for this class, including > suggestions from Peter. > What could I do next to help integrating the new code with the rest of > Biopython? Hi Kristian, I haven't had a play with the code, just a very brief look at it. You'll need to add licence and copyright statements. A few embedded doctests in the docstrings would be very nice to help explain how the new classes are to be used. What happens if you add some of the new DNA seq objects to test_Seq_objs.py? Is it all fine? Are you planning to add a reverse complement method etc? Or does the current fall back on the Seq implementation work OK? Peter From biopython at maubp.freeserve.co.uk Fri Jul 2 13:42:13 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 2 Jul 2010 14:42:13 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) Message-ID: Hi all, BOSC is rapidly approaching, so I have been working on slides for the Biopython Project Update. One thing I would really like help with is listing current active projects, as I think the wiki is out of date here: http://biopython.org/wiki/Active_projects In addition to the GSoC work, my list currently has the following (in some cases just from looking at github - for example I don't recall Tamas posting on the mailing lists): Brad Chapman ? GFF parsing Andrea Pierleoni - UniProt XML parsing Kristian Rother ? Modified RNA sequences Chris Lasher, Kyle Ellrott, Tam?s Nepusz ? Gene Ontology Kyle Ellrott - HMMER parser Uri Laserson, Peter Cock - IMGT files (EMBL like) I know Michiel has mentioned some ideas for updating our BLAST parsers, and I have several smaller things on the side (e.g. an on disk index for Bio.SeqIO.index, enhancements to SeqFeature and FeatureLocation). What are we missing that should be there? Thanks, Peter From eric.talevich at gmail.com Fri Jul 2 14:47:49 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 2 Jul 2010 10:47:49 -0400 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: On Fri, Jul 2, 2010 at 9:42 AM, Peter wrote: > Hi all, > > BOSC is rapidly approaching, so I have been working on slides for the > Biopython Project Update. One thing I would really like help with is > listing > current active projects, as I think the wiki is out of date here: > http://biopython.org/wiki/Active_projects > [...] > What are we missing that should be there? > Biopython's network on GitHub is a good resource for tracking active projects: http://github.com/biopython/biopython/network Should we add a link to that in the preamble? Not every project has its own public branch, but for those that do, GitHub will always be up to date. -Eric From biopython at maubp.freeserve.co.uk Fri Jul 2 14:52:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 2 Jul 2010 15:52:31 +0100 Subject: [Biopython-dev] Switching to GitHub Organization Message-ID: Hi all, Following Chris' lead with BioPerl (see below), I've also switched Biopython's github account to an organization. There should be no differences for fetching code or committing for those of you with access. Peter ---------- Forwarded message ---------- From: Chris Fields Date: Fri, Jul 2, 2010 at 2:48 PM Subject: [Bioperl-l] BioPerl Switching to GitHub Organization To: BioPerl List GitHub (as expected) just released their setup for organizations, including open-source projects. ?The announcement is here: http://github.com/blog/674-introducing-organizations I have already moved bioperl over to an organization account and have added a few co-owners of the github repository. ?The move is transparent, no one should notice any difference in checking out code. ?I'm working on reassigning teams to projects at this time, so please post here if there are any problems. chris _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From kellrott at gmail.com Fri Jul 2 14:53:09 2010 From: kellrott at gmail.com (Kyle) Date: Fri, 2 Jul 2010 07:53:09 -0700 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: I also have a fork for adding zxjdbc (the Jython's java database system) support to BioSQL. And one for parsing MetaGeneAnnotator files ( http://metagene.cb.k.u-tokyo.ac.jp/ ) Kye On Fri, Jul 2, 2010 at 7:47 AM, Eric Talevich wrote: > On Fri, Jul 2, 2010 at 9:42 AM, Peter wrote: > >> Hi all, >> >> BOSC is rapidly approaching, so I have been working on slides for the >> Biopython Project Update. One thing I would really like help with is >> listing >> current active projects, as I think the wiki is out of date here: >> http://biopython.org/wiki/Active_projects >> [...] >> What are we missing that should be there? >> > > Biopython's network on GitHub is a good resource for tracking active > projects: > http://github.com/biopython/biopython/network > > Should we add a link to that in the preamble? Not every project has its own > public branch, but for those that do, GitHub will always be up to date. > > -Eric > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython at maubp.freeserve.co.uk Fri Jul 2 15:09:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 2 Jul 2010 16:09:05 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: On Fri, Jul 2, 2010 at 3:53 PM, Kyle wrote: > I also have a fork for adding zxjdbc (the Jython's java database > system) support to BioSQL. And one for parsing MetaGeneAnnotator files > ( http://metagene.cb.k.u-tokyo.ac.jp/ ) Maybe this will be two slides then - or small font ;) > On Fri, Jul 2, 2010 at 7:47 AM, Eric Talevich wrote: >> >> Biopython's network on GitHub is a good resource for tracking active >> projects: >> http://github.com/biopython/biopython/network >> >> Should we add a link to that in the preamble? Not every project has its own >> public branch, but for those that do, GitHub will always be up to date. >> >> -Eric Good idea (I'd been trawling it to make the original list). Peter From andrea at biocomp.unibo.it Sat Jul 3 06:52:26 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Sat, 3 Jul 2010 08:52:26 +0200 (CEST) Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: > Hi all, > > BOSC is rapidly approaching, so I have been working on slides for the > Biopython Project Update. One thing I would really like help with is > listing > current active projects, as I think the wiki is out of date here: > http://biopython.org/wiki/Active_projects > > In addition to the GSoC work, my list currently has the following (in some > cases just from looking at github - for example I don't recall Tamas > posting > on the mailing lists): > > Brad Chapman ? GFF parsing > Andrea Pierleoni - UniProt XML parsing > Kristian Rother ? Modified RNA sequences > Chris Lasher, Kyle Ellrott, Tam?s Nepusz ? Gene Ontology > Kyle Ellrott - HMMER parser > Uri Laserson, Peter Cock - IMGT files (EMBL like) > > I know Michiel has mentioned some ideas for updating our BLAST parsers, > and I have several smaller things on the side (e.g. an on disk index for > Bio.SeqIO.index, enhancements to SeqFeature and FeatureLocation). > > What are we missing that should be there? > > Thanks, > > Peter > Dear Peter, I'm actually working on two more projects than the XML parsing, that could be useful in biopython. 1) together with Mauro Amico, we hare developing a graphical library very similar to the Bio::Graphics module pf BioPerl. The project is at good point, and will come with documentation and tutorial as a standalone package we call BioGraPy. I know that in biopython one can already use GenomeDiagram to draw, for example, seqrecord features, but this could extend biopython plotting capability significantly. You can use BioGraPy to plot a blast output (with its HTML map), to plot hydrophobicity plot along the sequence (read as per letter annotations), mRNA and CDS with their splicing sites, and so on... BioGrapy relies on matplotlib, so this will be an additional external dependence, but worthwhile in my opinion. 2) Since I'm working with the web2py web framework, and I work with biosql databases, I spent some time adapting the current BioSQL code to be used with the web2py DAL (Database Abstraction Layer). DAL is much more simpler (and sometimes faster) than SQLAlchemy, and its syntax and use are very similar to SQL queries, so it was very easy to adapt the current code to use the DAL. Main advantages of using the web2py DAL are that it can be used on almost any DB engine. listing from the web2py site: SQLite, PostgreSQL, MySQL, MSSQL, FireBird, Oracle, IBM DB2, Informix, Ingres, and Google App Engine. I've succesfully tested with both Postgres and SQLite, but should be tested for the other. Since the Web2py code is GPL2, I can incorporate the modules needed for DAL directly into Biopython, so there will be no external dependences. I know that Brad Chapman and some others were working on implementing BioSQL with SQLAlchemy, so let me know if this could be an addition to Biopython. Cheers, Andrea From tiagoantao at gmail.com Sat Jul 3 10:01:34 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sat, 3 Jul 2010 11:01:34 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: On Fri, Jul 2, 2010 at 2:42 PM, Peter wrote: > What are we missing that should be there? The Population genetics code is still alive, though I have to update the documentation a bit. I want to support the dfdist application soon. Most unexpectedly the fdist code is being used quite a bit (via an application), currently 33 citations on scholar. And people constantly ask me for dfdist support. A close second is support for large genepop files supporting thousands of markers. By the way, I suppose python 2 to 3 is the elephant in the room? I bet all of us have run 2to3 on biopython ;) ... The results are not that bad... From biopython at maubp.freeserve.co.uk Sat Jul 3 10:12:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 3 Jul 2010 11:12:54 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: On Sat, Jul 3, 2010 at 7:52 AM, Andrea Pierleoni wrote: > Dear Peter, > I'm actually working on two more projects than the XML parsing, that could > be useful in biopython. > > 1) together with Mauro Amico, we hare developing a graphical library very > similar to the Bio::Graphics module pf BioPerl. The project is at good > point, and will come with documentation and tutorial as a standalone > package we call BioGraPy. I know that in biopython one can already use > GenomeDiagram to draw, for example, seqrecord features, but this could > extend biopython plotting capability significantly. You can use BioGraPy > to plot a blast output (with its HTML map), to plot hydrophobicity plot > along the sequence (read as per letter annotations), mRNA and CDS with > their splicing sites, and so on... BioGrapy relies on matplotlib, so this > will be an additional external dependence, but worthwhile in my opinion. That does sound interesting. I'm not saying it couldn't be rolled into Biopython, but perhaps shipping it a separate package building on Biopython and matplotlib is a good plan. There are advantages either way. > 2) Since I'm working with the web2py web framework, and I work with biosql > databases, I spent some time adapting the current BioSQL code to be used > with the web2py DAL (Database Abstraction Layer). DAL is much more simpler > (and sometimes faster) than SQLAlchemy, and its syntax and use are very > similar to SQL queries, so it was very easy to adapt the current code to > use the DAL. Main advantages of using the web2py DAL are that it can be > used on almost any DB engine. listing from the web2py site: SQLite, > PostgreSQL, MySQL, MSSQL, FireBird, Oracle, IBM DB2, Informix, Ingres, and > Google App Engine. I've succesfully tested with both Postgres and SQLite, > but should be tested for the other. Since the Web2py code is GPL2, I can > incorporate the modules needed for DAL directly into Biopython, so there > will be no external dependences. I know that Brad Chapman and some others > were working on implementing BioSQL with SQLAlchemy, so let me know if > this could be an addition to Biopython. I'm not sure we can easily include GPL code in Biopython... it would complicate things. Kyle has also been working on using the JVM DB API for BioSQL under Jython - I'd rather we ended up with a runtime choice of drivers (database specific like mysqldb, and others like the abstractions SQLAlchemy or the web2py DAL) which would all be external to Biopython. Peter From biopython at maubp.freeserve.co.uk Sat Jul 3 10:14:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 3 Jul 2010 11:14:57 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: 2010/7/3 Tiago Ant?o : > > By the way, I suppose python 2 to 3 is the elephant in the room? I bet > all of us have run 2to3 on biopython ;) ... The results are not that > bad... Could you start a new thread with a summary of what 2to3 reports? I believe the latest NumPy in their repository builds fine on Python 3.2, so we can't use waiting for them as an excuse much longer ;) Peter From tiagoantao at gmail.com Sat Jul 3 10:25:45 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sat, 3 Jul 2010 11:25:45 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: On Sat, Jul 3, 2010 at 7:52 AM, Andrea Pierleoni wrote: > 1) together with Mauro Amico, we hare developing a graphical library very > similar to the Bio::Graphics module pf BioPerl. The project is at good > point, and will come with documentation and tutorial as a standalone > package we call BioGraPy. I know that in biopython one can already use > GenomeDiagram to draw, for example, seqrecord features, but this could > extend biopython plotting capability significantly. You can use BioGraPy > to plot a blast output (with its HTML map), to plot hydrophobicity plot > along the sequence (read as per letter annotations), mRNA and CDS with > their splicing sites, and so on... BioGrapy relies on matplotlib, so this > will be an additional external dependence, but worthwhile in my opinion. 2 comments: 1. Strong support for matplotlib dependence. As usual it is very easy to shield the code against forcing people to install matplotlib (this is not a C library type of dependency where things would be more serious). The dependency is only needed for people who want to use your code. So this is not a big problem. matplotlib is also very standard in scientific python, not a marginal application. Thumbs up, IMHO. matplotlib, numpy and scipy are no brainers in my opinion. 2. The bioperl name bio::graphics strikes me as not completely perfect. I say this because there is more to bioinformatics than sequence analysis. Whatever naming convention is assumed in biopython for any kind of graphics, there should be some care with this. ;) . That being said, I think it is great that charting support exists. My .02 ? (loosing value by the day), Tiago From tiagoantao at gmail.com Sat Jul 3 10:44:01 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sat, 3 Jul 2010 11:44:01 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: 2010/7/3 Peter : > Could you start a new thread with a summary of what 2to3 reports? > I believe the latest NumPy in their repository builds fine on Python > 3.2, so we can't use waiting for them as an excuse much longer ;) I will, let me just tidy up the output and put some stats to help people out. I will put this up on Monday, ahead of BOSC. Maybe it will end up being an interesting discussion topic in Boston ;) Tiago From bugzilla-daemon at portal.open-bio.org Sun Jul 4 17:41:05 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 4 Jul 2010 13:41:05 -0400 Subject: [Biopython-dev] [Bug 3105] New: Bio.Nexus useless line Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3105 Summary: Bio.Nexus useless line Product: Biopython Version: 1.54 Platform: All OS/Version: All Status: NEW Severity: minor Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: tiagoantao at gmail.com There is a line on Bio.Nexus that is wrong/useless: elif hasattr(file, "write"): This is checking if the built-in file class has an attribute called write (which it also has). This is the same as elif True: This is either useless or wrong. This becomes a hurdle for automated conversion to python 3 as there is no file class on python 3. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jul 4 18:09:38 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 4 Jul 2010 14:09:38 -0400 Subject: [Biopython-dev] [Bug 3105] Bio.Nexus useless line In-Reply-To: Message-ID: <201007041809.o64I9cNM016424@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3105 ------- Comment #1 from eric.talevich at gmail.com 2010-07-04 14:09 EST ------- I think it's a typo. The function write_nexus_data takes an argument "filename", and this code block is supposed to figure out whether that's an open file handle or a file name. So it should be: if hasattr(filename, 'write'): ... But we actually do it a different way now, checking for strings: if isinstance(filename, basestr): # open it else: # it's a handle -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jul 4 19:34:44 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 4 Jul 2010 15:34:44 -0400 Subject: [Biopython-dev] [Bug 3105] Bio.Nexus useless line In-Reply-To: Message-ID: <201007041934.o64JYiIv019836@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3105 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-04 15:34 EST ------- I think Eric is right, it could just be a typo. The Nexus API accepts either filenames or handles and so needs to check which it has. Given other bits of Biopython now do the same, we could perhaps have a single bit of shared code for this - or at least consistent coding style. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sun Jul 4 20:22:28 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 4 Jul 2010 16:22:28 -0400 Subject: [Biopython-dev] [Bug 3105] Bio.Nexus useless line In-Reply-To: Message-ID: <201007042022.o64KMS5W021453@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3105 ------- Comment #3 from eric.talevich at gmail.com 2010-07-04 16:22 EST ------- (In reply to comment #2) > Given other > bits of Biopython now do the same, we could perhaps have a single bit > of shared code for this - or at least consistent coding style. > Here's a snippet I use for myself: import contextlib @contextlib.contextmanager def maybe_open(infile, mode='r'): """Take a file name or a handle, and return a handle. Simplifies creating functions that automagically accept either a file name or an already opened file handle. """ do_close = False if isinstance(infile, basestring): do_close = True handle = open(infile, mode) else: handle = infile yield handle if do_close: handle.close() Use like: >>> with maybe_open(filename_or_handle) as handle: ... For Py2.4 compliance, you can just drop the @contextlib.contextmanager decorator and leave the function as it is. Then this works: >>> for handle in maybe_open(fname): ... It's an iterator of one item, taking care of loose ends when it terminates. Neat, huh? I suspect that yielding from a try/finally block, which is forbidden in Py2.4, is related to the with statement under the hood in Py2.5+. Since maybe_open kind of needs that protection to work safely, I think the copy/paste approach is fine until we officially drop Py2.4 support. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Sun Jul 4 20:24:42 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sun, 4 Jul 2010 21:24:42 +0100 Subject: [Biopython-dev] 2to3 ramblings Message-ID: Hi, Here are my findings on the attempt of converting biopython to python 3. What I did: 1. Tried to convert Bio (not BioSQL) 2. No C code 3. No external apps No external apps just because I don't most of the around here. Things are going much faster than expected 52 out of 144 tests are failing. Less than 6 hour work tothis. With the exception of sff processing I chosed the most complicated that I've found (many of the existing failing tests are of the easy kind) Some general issues that I am finding that impact us: 1. import exception is no more 2. Many lists are now iterators (e.g. map results) 3. 2to3 of course is not complete. Also sometimes there are some small mistakes (things one would expect to convert that are not) 4. sgmlib is no more. 2 options: include it (from python 2.6, which I am doing) OR use htmllib. 5. slices [:], have to be ints (which is mildly problematic with the fact that division is now float). Thus myPos = x/2 x[myPos:] has to become myLen = int(x/2) 6. Doctests have to be converted (2to3 does it) 7. Default open is now non-binary, so open sometimes requires rb. file is no more 8. Many order functions do not accept None e.g max([None,1,2]) will fail 9. StringType, *Type are no more 10. sort has no cmp function anymore 11. urllib namespace refactored 12. unit tests really help! 13!!!: The biggest problem has been bytes versus strings and encodings. Most existing complex problems are about this Biggest issues have been with Nexus and, above all, Sff (mostly 13 above - encoding formats). With the exception of Sff, I think I could easily sort out everything myself. The big incognito seems to be the C code. But I will assume that conversion is easy for the rest of the discussion. I have also to test process code that executes external apps. >From my point of view the conversion is not the big issue. The big issue is the maintenance of a version that works on both 2 and 3 at the same time (we dont want to maintain 2 codebases, correct?). Somethings are easy, but some are unknowns. It is possible to make _some_ code (that currently works only on 2) work on both pythons with little effort. Other code (e.g. prints) can be automatically converted on build. But some issues are still unknown to me. What numpy does (at least partially) is, on build: if python 3 is detected then call 2to3 to convert a python2 codebase to python3. Seems to work quite well. My gut feeling is that code of the form if python.version==2: a_version else: b_version can be almost non-existent. But it is just a gut feeling. So I think the python codebase can be easily shared between python 2 and 3 with little ugliness. About the C codebase? I don' t have any idea for now. This is not as much work as it seems. I think it is possible to have almost everything working on python3 for BOSC (assuming the current pace). But again, the main issue is not the conversion but maintaining a single code base. In practice, I think the first step is to have a build system like numpy: which detects the python version and calls 2to3. A single code base that can be built and tested on both 2 and 3. Suggested readings http://coderazzi.net/tnotes/python/migrating2to3.html http://diveintopython3.org/porting-code-to-python-3-with-2to3.html http://dbaktiar.wordpress.com/2009/08/20/python-3-1-file-open-is-no-longer-binary-by-default/ Well, these are my 0.02?. I can work on putting a github version of this if you are interested... -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From bioinformed at gmail.com Sun Jul 4 20:50:53 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Sun, 4 Jul 2010 16:50:53 -0400 Subject: [Biopython-dev] 2to3 ramblings In-Reply-To: References: Message-ID: 2010/7/4 Tiago Ant?o > myPos = x/2 I strongly recommend: myPos = x//2 versus anything that ventures into float territory and then retreats back into integer-land. -Kevin From bugzilla-daemon at portal.open-bio.org Mon Jul 5 07:40:20 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 5 Jul 2010 03:40:20 -0400 Subject: [Biopython-dev] [Bug 3105] Bio.Nexus useless line In-Reply-To: Message-ID: <201007050740.o657eKfs025732@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3105 fkauff at biologie.uni-kl.de changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #4 from fkauff at biologie.uni-kl.de 2010-07-05 03:40 EST ------- It's a typo (a rather old one). Type checking has been changed to isinstance. Frank (In reply to comment #1) > I think it's a typo. The function write_nexus_data takes an argument > "filename", and this code block is supposed to figure out whether that's an > open file handle or a file name. > > So it should be: > > if hasattr(filename, 'write'): ... > > But we actually do it a different way now, checking for strings: > > if isinstance(filename, basestr): # open it > else: # it's a handle > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From fkauff at biologie.uni-kl.de Mon Jul 5 07:46:59 2010 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Mon, 05 Jul 2010 09:46:59 +0200 Subject: [Biopython-dev] 2to3 ramblings In-Reply-To: References: Message-ID: <4C318DF3.1080706@biologie.uni-kl.de> Hi Tiago, On 07/04/2010 10:24 PM, Tiago Ant?o wrote: > Hi, > > Here are my findings on the attempt of converting biopython to python 3. > ... > > Biggest issues have been with Nexus and, above all, Sff (mostly 13 > above - encoding formats). > > I'd be happy to help with Nexus.py. You have some sort of list with the lines that failed? Frank > With the exception of Sff, I think I could easily sort out everything myself. > > The big incognito seems to be the C code. But I will assume that > conversion is easy for the rest of the discussion. I have also to test > process code that executes external apps. > > > > From my point of view the conversion is not the big issue. The big > issue is the maintenance of a version that works on both 2 and 3 at > the same time (we dont want to maintain 2 codebases, correct?). > Somethings are easy, but some are unknowns. It is possible to make > _some_ code (that currently works only on 2) work on both pythons with > little effort. Other code (e.g. prints) can be automatically converted > on build. But some issues are still unknown to me. > > What numpy does (at least partially) is, on build: if python 3 is > detected then call 2to3 to convert a python2 codebase to python3. > Seems to work quite well. My gut feeling is that code of the form > if python.version==2: > a_version > else: > b_version > can be almost non-existent. > But it is just a gut feeling. > > So I think the python codebase can be easily shared between python 2 > and 3 with little ugliness. About the C codebase? I don' t have any > idea for now. > > This is not as much work as it seems. I think it is possible to have > almost everything working on python3 for BOSC (assuming the current > pace). But again, the main issue is not the conversion but maintaining > a single code base. In practice, I think the first step is to have a > build system like numpy: which detects the python version and calls > 2to3. A single code base that can be built and tested on both 2 and 3. > > > Suggested readings > http://coderazzi.net/tnotes/python/migrating2to3.html > http://diveintopython3.org/porting-code-to-python-3-with-2to3.html > http://dbaktiar.wordpress.com/2009/08/20/python-3-1-file-open-is-no-longer-binary-by-default/ > > > Well, these are my 0.02?. I can work on putting a github version of > this if you are interested... > > From tiagoantao at gmail.com Mon Jul 5 09:11:22 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 10:11:22 +0100 Subject: [Biopython-dev] 2to3 ramblings In-Reply-To: <4C318DF3.1080706@biologie.uni-kl.de> References: <4C318DF3.1080706@biologie.uni-kl.de> Message-ID: On Mon, Jul 5, 2010 at 8:46 AM, Frank Kauff wrote: > I'd be happy to help with Nexus.py. You have some sort of list with the > lines that failed? Thanks for the help. I have restarted the whole process in order make things easier for everybody else. As soon as I get there (again) I will send you existing problems. I will start opening tickets with patches for several components and putting there solutions. I hope it will be clear to everybody the level of triviality of patches required. Nexus (and mainly SFF) were the issues I stumbled in. Lets see on the second run. Tiago From bugzilla-daemon at portal.open-bio.org Mon Jul 5 09:16:01 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 5 Jul 2010 05:16:01 -0400 Subject: [Biopython-dev] [Bug 3105] Bio.Nexus useless line In-Reply-To: Message-ID: <201007050916.o659G10F030478@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3105 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-05 05:16 EST ------- (In reply to comment #4) > It's a typo (a rather old one). > > Type checking has been changed to isinstance. > > Frank Thanks Frank - it turns out to be my old typo from 4 July 2008. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 5 09:44:17 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 5 Jul 2010 05:44:17 -0400 Subject: [Biopython-dev] [Bug 3106] New: Making Bio.Sequencing.Ace Python 3 compliant Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3106 Summary: Making Bio.Sequencing.Ace Python 3 compliant Product: Biopython Version: 1.54 Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: tiagoantao at gmail.com The patch attached serves to make Bio.Sequencing.Ace Python 3 compliant A few notes: 1. It is a patch to the test (replacing / with // as per Kevin suggestion). The core code needs no patch 2. It still requires running 2to3, but that is normal 3. Was tested on both 3.1.2 and 2.6.5 (ie, not on 2.5 and 2.4) This is a typical pattern: the change is trivial and has no impact on the 2 codebase -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 5 09:45:33 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 5 Jul 2010 05:45:33 -0400 Subject: [Biopython-dev] [Bug 3106] Making Bio.Sequencing.Ace Python 3 compliant In-Reply-To: Message-ID: <201007050945.o659jX00031812@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3106 ------- Comment #1 from tiagoantao at gmail.com 2010-07-05 05:45 EST ------- Created an attachment (id=1519) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1519&action=view) Patch to make test_Ace py3k compliant -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 5 09:48:48 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 5 Jul 2010 05:48:48 -0400 Subject: [Biopython-dev] [Bug 3106] Making Bio.Sequencing.Ace Python 3 compliant In-Reply-To: Message-ID: <201007050948.o659mm42031951@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3106 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-05 05:48 EST ------- Tiago - please go ahead and apply these and any further / to // changes to use explicit integer division required to help 2to3 (without bothering with more bug reports - a summary email to the dev list would be more than enough). Thanks! Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 5 09:54:21 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 5 Jul 2010 05:54:21 -0400 Subject: [Biopython-dev] [Bug 3106] Making Bio.Sequencing.Ace Python 3 compliant In-Reply-To: Message-ID: <201007050954.o659sLWt032216@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3106 tiagoantao at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #3 from tiagoantao at gmail.com 2010-07-05 05:54 EST ------- (In reply to comment #2) > Tiago - please go ahead and apply these and any further / to // changes to use > explicit integer division required to help 2to3 (without bothering with more > bug > reports - a summary email to the dev list would be more than enough). Thanks! OK, I will just two notes: 1. Apologies in advance if I make a blunder with git, I am a bzr person and my git skills are limited 2. I will go to biopython-dev whenever something conceptually new arises that I think requires discussion before commit. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 5 10:04:18 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 5 Jul 2010 06:04:18 -0400 Subject: [Biopython-dev] [Bug 3106] Making Bio.Sequencing.Ace Python 3 compliant In-Reply-To: Message-ID: <201007051004.o65A4I7g032704@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3106 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-05 06:04 EST ------- (In reply to comment #3) > (In reply to comment #2) > > Tiago - please go ahead and apply these and any further / to // changes to > > use explicit integer division required to help 2to3 (without bothering with > > more bug reports - a summary email to the dev list would be more than > > enough). Thanks! > > OK, I will just two notes: > > 1. Apologies in advance if I make a blunder with git, I am a bzr person and > my git skills are limited Looks fine so far :) > 2. I will go to biopython-dev whenever something conceptually new arises > that I think requires discussion before commit. Great. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 5 10:31:39 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 5 Jul 2010 06:31:39 -0400 Subject: [Biopython-dev] [Bug 3105] Bio.Nexus useless line In-Reply-To: Message-ID: <201007051031.o65AVdUv001589@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3105 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-05 06:31 EST ------- (In reply to comment #4) > It's a typo (a rather old one). > > Type checking has been changed to isinstance. > > Frank I've just changed it back to method checking, i.e. as Eric suggested with my typo fixed: if hasattr(filename, 'write'): The trouble with isinstance(filename, file) is that it doesn't allow for file like objects - specifically a StringIO handle as used in the unit tests, meaning test_AlignIO.py was failing. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Mon Jul 5 10:34:17 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 11:34:17 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 Message-ID: Hi, test_AlignIO provides a far more interesting case (but not complicated, not at all). The issues are as follows: 1. list sorting Bio.Data.CodonTable has a: possible.sort(_sort) Py3 has no compare function (that _sort is a 5 line function defined just above). That can be "forced in", but there is normally a simpler dialect, with keywords. The line above becomes: if sys.version_info[0] == 3: possible.sort(key=lambda x:self.ambiguous_protein[x]) else: possible.sort(_sort) 2. Strings and bytes Bio.Seq requires if sys.version_info[0] == 3 : return str.maketrans(before, after) else: return string.maketrans(before, after) The way p3 handles strings and bytes are the biggest issue that I think we will face from a technical perspective. 3. The big one: No sgmllib in p3. The obvious solution is to include it (I suppose the licenses are compatible?). The alternative (using htmllib) might be more long-term, in my opinion This is all that is needed (plus 1 import sys line). I was inclined to commit 1 and 2. But 3 needs to be discussed... Tiago From p.j.a.cock at googlemail.com Mon Jul 5 11:01:42 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Jul 2010 12:01:42 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: 2010/7/5 Tiago Ant?o : > Hi, > > test_AlignIO provides a far more interesting case (but not > complicated, not at all). Or just test_seq.py or test_Seq_objs.py which are more low level ;) > The issues are as follows: > > 1. list sorting > Bio.Data.CodonTable has a: > possible.sort(_sort) > Py3 has no compare function (that _sort is a 5 line function defined > just above). That can be "forced in", but there is normally a simpler > dialect, with keywords. The line above becomes: > if sys.version_info[0] == 3: > ? ? ? ? ? ?possible.sort(key=lambda x:self.ambiguous_protein[x]) > else: > ? ? ? ? ? ?possible.sort(_sort) I think Python 2.4 added support for the key argument, so can we just unconditionally change it to: possible.sort(key=lambda x:self.ambiguous_protein[x]) However, that isn't doing quite the same thing. The old sort was by table length first to try and get the least ambiguous mapping or something like that... we probably need some more unit tests first. > 2. Strings and bytes > Bio.Seq requires > ? ?if sys.version_info[0] == 3 : > ? ? ? ?return str.maketrans(before, after) > ? ?else: > ? ? ? ?return string.maketrans(before, after) This is within our private _maketrans function only? That looks sensible but I wonder why 2to3 doesn't handle this on its own. Would moving the "import string" into the function help for clarity? def _maketrans(complement_mapping): """Makes a python string translation table (PRIVATE).""" before = ''.join(complement_mapping.keys()) after = ''.join(complement_mapping.values()) before = before + before.lower() after = after + after.lower() if sys.version_info[0] == 3 : return str.maketrans(before, after) else: import string return string.maketrans(before, after) > The way p3 handles strings and bytes are the biggest issue that I > think we will face from a technical perspective. I agree that strings vs bytes will be an issue for us (potentially from a memory point of view for Seq objects). > 3. The big one: No sgmllib in p3. > ? The obvious solution is to include it (I suppose the licenses are > compatible?). The alternative (using htmllib) might be more long-term, > in my opinion A lot of the things using sgmllib are already deprecated (e.g. Bio.NetCatch and Bio.Prosite). I think that leaves just Bio.UniGene and Bio.InterPro - which isn't such a big issue. Peter From fkauff at biologie.uni-kl.de Mon Jul 5 11:07:22 2010 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Mon, 05 Jul 2010 13:07:22 +0200 Subject: [Biopython-dev] [Bug 3105] Bio.Nexus useless line In-Reply-To: <201007051031.o65AVdUv001589@portal.open-bio.org> References: <201007051031.o65AVdUv001589@portal.open-bio.org> Message-ID: <4C31BCEA.1000007@biologie.uni-kl.de> On 07/05/2010 12:31 PM, bugzilla-daemon at portal.open-bio.org wrote: > http://bugzilla.open-bio.org/show_bug.cgi?id=3105 > > > > > > ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-05 06:31 EST ------- > (In reply to comment #4) > >> It's a typo (a rather old one). >> >> Type checking has been changed to isinstance. >> >> Frank >> > I've just changed it back to method checking, i.e. as Eric suggested with > my typo fixed: > > if hasattr(filename, 'write'): > > The trouble with isinstance(filename, file) is that it doesn't allow for file > like objects - specifically a StringIO handle as used in the unit tests, > meaning test_AlignIO.py was failing. > > Peter > > > Goot catch. I didn't remember that. Frank From tiagoantao at gmail.com Mon Jul 5 11:13:50 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 12:13:50 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: Hi, 2010/7/5 Peter Cock : > I think Python 2.4 added support for the key argument, so can we > just unconditionally change it to: > > possible.sort(key=lambda x:self.ambiguous_protein[x]) > > However, that isn't doing quite the same thing. The old sort was by > table length first to try and get the least ambiguous mapping or > something like that... we probably need some more unit tests first. erm, my mistake possible.sort(key=lambda x:len(self.ambiguous_protein[x])) I think this sorts this out? > This is within our private _maketrans function only? That looks sensible > but I wonder why 2to3 doesn't handle this on its own. Because (I think), there are now 2 possible alternatives (one byte-wise and one string-wise), so 2to3 does not know which to choose. > Would moving the "import string" into the function help for clarity? It it is only used there, maybe it makes sense. >> 3. The big one: No sgmllib in p3. >> ? The obvious solution is to include it (I suppose the licenses are >> compatible?). The alternative (using htmllib) might be more long-term, >> in my opinion > > A lot of the things using sgmllib are already deprecated (e.g. > Bio.NetCatch and Bio.Prosite). I think that leaves just Bio.UniGene > and Bio.InterPro - which isn't such a big issue. I know very little about those parts of the code, but there was an import required for sgmllib in test_AlignIO. Tiago -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From p.j.a.cock at googlemail.com Mon Jul 5 11:26:36 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Jul 2010 12:26:36 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: 2010/7/5 Tiago Ant?o : > Hi, > > 2010/7/5 Peter Cock : >> I think Python 2.4 added support for the key argument, so can we >> just unconditionally change it to: >> >> possible.sort(key=lambda x:self.ambiguous_protein[x]) >> >> However, that isn't doing quite the same thing. The old sort was by >> table length first to try and get the least ambiguous mapping or >> something like that... we probably need some more unit tests first. > > erm, my mistake > possible.sort(key=lambda x:len(self.ambiguous_protein[x])) > > I think this sorts this out? Probably. >> This is within our private _maketrans function only? That looks sensible >> but I wonder why 2to3 doesn't handle this on its own. > > Because (I think), there are now 2 possible alternatives (one > byte-wise and one string-wise), so 2to3 does not know which to choose. True. >> Would moving the "import string" into the function help for clarity? > > It it is only used there, maybe it makes sense. OK. >>> 3. The big one: No sgmllib in p3. >>> ? The obvious solution is to include it (I suppose the licenses are >>> compatible?). The alternative (using htmllib) might be more long-term, >>> in my opinion >> >> A lot of the things using sgmllib are already deprecated (e.g. >> Bio.NetCatch and Bio.Prosite). I think that leaves just Bio.UniGene >> and Bio.InterPro - which isn't such a big issue. > > I know very little about those parts of the code, but there was an > import required for sgmllib in test_AlignIO. This is due to Bio/File.py trying to import sgmllib, and Bio.File is used by several of the SeqIO/AlignIO parsers (e.g. Bio.GenBank). That code needing sgmllib was deprecated in Biopython 1.52 (Sept 2009), and so we should be keeping it until Sept 2010... I think making it a lazy import will do the trick. Peter P.S. I've just committed this, so do a pull before more changes: http://github.com/biopython/biopython/commit/4f2650c309224e74bd18758b4ee2be24879c15dd From p.j.a.cock at googlemail.com Mon Jul 5 11:36:25 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Jul 2010 12:36:25 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: >>>> 3. The big one: No sgmllib in p3. >>>> ? The obvious solution is to include it (I suppose the licenses are >>>> compatible?). The alternative (using htmllib) might be more long-term, >>>> in my opinion >>> >>> A lot of the things using sgmllib are already deprecated (e.g. >>> Bio.NetCatch and Bio.Prosite). I think that leaves just Bio.UniGene >>> and Bio.InterPro - which isn't such a big issue. >> >> I know very little about those parts of the code, but there was an >> import required for sgmllib in test_AlignIO. > > This is due to Bio/File.py trying to import sgmllib, and Bio.File is used > by several of the SeqIO/AlignIO parsers (e.g. Bio.GenBank). That > code needing sgmllib was deprecated in Biopython 1.52 (Sept 2009), > and so we should be keeping it until Sept 2010... I think making it a > lazy import will do the trick. How's this? http://github.com/biopython/biopython/commit/e9ab0b353ae4a914db20a53f2377a34bc56c30a6 Peter From tiagoantao at gmail.com Mon Jul 5 11:38:20 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 12:38:20 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: 2010/7/5 Peter Cock : > Or just test_seq.py or test_Seq_objs.py which are more low level ;) Glad you raise these 2, as I want to discuss them. 2 changes: 1. to test_Seq_objs.py add import sys if sys.version_info[0] == 3: maketrans = str.maketrans else: from string import maketrans 2. (more serious) array.array("c", ...) is no more (the c). Maybe self.data = array.array("u", data) ? With ifs per version. This affects test_seq.py and Seq.py Regarding commits (e.g. the sort case). I can commit general corrections, e.g. a single sort with a key for all versions or put some ifs (use the old code for 2.x and new code for 3). The first option is cleaner, the second safer. I warm up to the cleaner version: the changes are trivial (and trivial to roll back, should the need arise). From tiagoantao at gmail.com Mon Jul 5 11:38:50 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 12:38:50 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: > How's this? > > http://github.com/biopython/biopython/commit/e9ab0b353ae4a914db20a53f2377a34bc56c30a6 Makes things much cleaner and easier... -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From p.j.a.cock at googlemail.com Mon Jul 5 11:44:54 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Jul 2010 12:44:54 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: 2010/7/5 Tiago Ant?o : > 2010/7/5 Peter Cock : >> Or just test_seq.py or test_Seq_objs.py which are more low level ;) > > Glad you raise these 2, as I want to discuss them. > 2 changes: > 1. to test_Seq_objs.py add > ?import sys > if sys.version_info[0] == 3: > ? ?maketrans = str.maketrans > else: > ? ?from string import maketrans OK, do a "git pull origin master" and then make that change (and move the import string into the function too). It seems to be a fairly simple way to cope with Python 2.x and Python 3.x. > 2. (more serious) array.array("c", ...) is no more (the c). > Maybe self.data = array.array("u", data) ? With ifs per version. This > affects test_seq.py and Seq.py This is the MutableSeq object, right? Try some local changes and see, but I fear we may have to redo the internals of that more substantially. > Regarding commits (e.g. the sort case). > I can commit general corrections, e.g. a single sort with a key for > all versions or put some ifs (use the old code for 2.x and new code > for 3). The first option is cleaner, the second safer. I warm up to > the cleaner version: the changes are trivial (and trivial to roll > back, should the need arise). For the codon sort case, the old code effectively did two sorts (one by length with a tie breaker). If you can write some unit tests to check we don't alter the behaviour, the clean fix is nicer. Also 2to3 is suggesting we use for loops in Bio/Sequencing/Ace.py in place of side effects with a map call (lines 474, 480, 484). That does seem like good advice. Peter From mjldehoon at yahoo.com Mon Jul 5 11:47:21 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 5 Jul 2010 04:47:21 -0700 (PDT) Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: Message-ID: <298020.45436.qm@web62407.mail.re1.yahoo.com> --- On Mon, 7/5/10, Tiago Ant?o wrote: > >> 3. The big one: No sgmllib in p3. > > A lot of the things using sgmllib are already > deprecated (e.g. > > Bio.NetCatch and Bio.Prosite). I think that leaves > > just Bio.UniGene and Bio.InterPro - which isn't such > a big issue. > I know very little about those parts of the code, but there > was an import required for sgmllib in test_AlignIO. In Bio.UniGene and Bio.InterPro, sgmllib is used for parsing HTML pages, which tends to break easily anyway because the HTML format keeps changing. As a case in point, the parser in Bio.InterPro doesn't seem to work with current HTML pages from InterPro. I haven't tried Bio.UniGene, but Bio.UniGene can also parse UniGene flat files so I doubt that there is a real need to parse UniGene html files. In test_AlignIO, the import for sgmllib is coming from the SGMLStripper class in Bio.File, imported from Bio.ParserSupport, imported from Bio.GenBank, imported from Bio.SeqIO. But Bio.SeqIO doesn't actually use SGMLStripper, which has been deprecated. So I suggest that instead of fixing the modules that depend on sgmllib, we replace the relevant pieces of code by a NotImplementedError, and see if anybody complains. For the longer term, it would be nice if the code in Bio.GenBank could be moved to Bio.SeqIO, and made independent of Bio.ParserSupport. --Michiel. From tiagoantao at gmail.com Mon Jul 5 12:05:26 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 13:05:26 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: 2010/7/5 Peter Cock : > For the codon sort case, the old code effectively did two sorts > (one by length with a tie breaker). If you can write some unit tests > to check we don't alter the behaviour, the clean fix is nicer. >>> a=[(1,2),(1,1),(2,1),(1,0)] >>> a.sort(key=lambda x:(x[0],x[1])) >>> a [(1, 0), (1, 1), (1, 2), (2, 1)] Multi-level sorting is possible ;) thus possible.sort(key=lambda x:(len(self.ambiguous_protein[x]), x)) From p.j.a.cock at googlemail.com Mon Jul 5 13:18:12 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Jul 2010 14:18:12 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: <298020.45436.qm@web62407.mail.re1.yahoo.com> References: <298020.45436.qm@web62407.mail.re1.yahoo.com> Message-ID: Tiago wrote: >>> 3. The big one: No sgmllib in p3. Peter wrote: >> A lot of the things using sgmllib are already deprecated (e.g. >> Bio.NetCatch and Bio.Prosite). I think that leaves just Bio.UniGene >> and Bio.InterPro - which isn't such a big issue. Michiel wrote: > In Bio.UniGene and Bio.InterPro, sgmllib is used for parsing HTML pages, > which tends to break easily anyway because the HTML format keeps > changing. As a case in point, the parser in Bio.InterPro doesn't seem to > work with current HTML pages from InterPro. So that one is ready for deprecation (assuming no one steps forward to update it). > I haven't tried Bio.UniGene, but Bio.UniGene can also parse UniGene > flat files so I doubt that there is a real need to parse UniGene html files. Again, perhaps this HTML parser can be deprecated. > In test_AlignIO, the import for sgmllib is coming from the SGMLStripper > class in Bio.File, imported from Bio.ParserSupport, imported from > Bio.GenBank, imported from Bio.SeqIO. But Bio.SeqIO doesn't > actually use SGMLStripper, which has been deprecated. That's been fixed by making Bio.File ignore the deprecated SGML stuff if sgmllib isn't available. > So I suggest that instead of fixing the modules that depend on sgmllib, > we replace the relevant pieces of code by a NotImplementedError, and > see if anybody complains. How about just deprecation instead? > For the longer term, it would be nice if the code in Bio.GenBank > could be moved to Bio.SeqIO, and made independent of > Bio.ParserSupport. That makes sense except for the fact that Bio.GenBank is still useful for "low level" work (not using a SeqRecord), for example WGS files. Certainly long term I think we could drop Bio.GenBank and have a simplified SeqRecord only parser in Bio.SeqIO. My recent location parsing work is a step in that direction. Peter From p.j.a.cock at googlemail.com Mon Jul 5 13:18:53 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 5 Jul 2010 14:18:53 +0100 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: 2010/7/5 Tiago Ant?o : > 2010/7/5 Peter Cock : >> For the codon sort case, the old code effectively did two sorts >> (one by length with a tie breaker). If you can write some unit tests >> to check we don't alter the behaviour, the clean fix is nicer. > > >>>> a=[(1,2),(1,1),(2,1),(1,0)] >>>> a.sort(key=lambda x:(x[0],x[1])) >>>> a > [(1, 0), (1, 1), (1, 2), (2, 1)] > > > Multi-level sorting is possible ;) > thus > possible.sort(key=lambda x:(len(self.ambiguous_protein[x]), x)) > Neat - with a sensible comment to explain why, that looks good. Peter From tiagoantao at gmail.com Mon Jul 5 14:28:33 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 15:28:33 +0100 Subject: [Biopython-dev] test_Entrez 3.x Message-ID: Hi, A pre-read, http://dbaktiar.wordpress.com/2009/08/20/python-3-1-file-open-is-no-longer-binary-by-default/ I am not completely sure that the text above is totally correct, but it does introduce the problem quite well. expat seems to want a byte stream. In the core code this is minor, Expat.Parser gets one open(,"rb") on externalEntityRefHandler and it is ready to roll (at least passes the test_Entrez test). But test_Entrez does need quite a few files open as rb. I do not know if I like this idea of opening a text file as binary. But at least the core code is barely touched. It is more an issue with the test. -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From biopython at maubp.freeserve.co.uk Mon Jul 5 14:38:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Jul 2010 15:38:05 +0100 Subject: [Biopython-dev] test_Entrez 3.x In-Reply-To: References: Message-ID: 2010/7/5 Tiago Ant?o : > Hi, > > A pre-read, > http://dbaktiar.wordpress.com/2009/08/20/python-3-1-file-open-is-no-longer-binary-by-default/ > I am not completely sure that the text above is totally correct, but > it does introduce the problem quite well. > > expat seems to want a byte stream. > In the core code this is minor, Expat.Parser gets one open(,"rb") on > externalEntityRefHandler and it is ready to roll (at least passes the > test_Entrez test). > But test_Entrez does need quite a few files ?open as rb. > > I do not know if I like this idea of opening a text file as binary. > But at least the core code is barely touched. It is more an issue with > the test. If Expat wants bytes, then on Python 3 we need to open the file in binary mode. This should be harmless on Python 2, although we should confirm this by running the unit tests on Windows - the only difference I would expect this will disable the magic new line conversion. Peter From tiagoantao at gmail.com Mon Jul 5 15:52:40 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 16:52:40 +0100 Subject: [Biopython-dev] NCBIXML Message-ID: Hi, Finally something less trivial. NCBIXML test has different results when running (with py3) using python3 run_tests.py test_NCBIXML.py or just python3 test_NCBIXML.py First fails. Second works. I' ve discovered that expat parsing is assuming that the encoding is ascii and sends an error (no encoding is specified in the file), whereas with utf-8 all is fine. Passing an encoding to ParserCreate gives no joy. Maybe somebody has had experiences with test having different outcomes depending on how they are invoked? Regards, Tiago -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From biopython at maubp.freeserve.co.uk Mon Jul 5 16:04:50 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Jul 2010 17:04:50 +0100 Subject: [Biopython-dev] NCBIXML In-Reply-To: References: Message-ID: 2010/7/5 Tiago Ant?o : > Hi, > > Finally something less trivial. > NCBIXML test has different results when running (with py3) using > python3 run_tests.py test_NCBIXML.py > or just > python3 test_NCBIXML.py > > First fails. Second works. > I' ve discovered that expat parsing is assuming that the encoding is > ascii and sends an error (no encoding is specified in the file), > whereas with utf-8 all is fine. > Passing an encoding to ParserCreate gives no joy. > > Maybe somebody has had experiences with test having different outcomes > depending on how they are invoked? I suspect this will be down to the run_test.py magic which attempts to run the test using the compiled files in the build directory. Have you run "python3 setup.py install" or not? If the build directory and the installed Biopython are the same this problem may go away... Peter From eric.talevich at gmail.com Mon Jul 5 16:07:57 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 5 Jul 2010 12:07:57 -0400 Subject: [Biopython-dev] test_AlignIO to python 3 In-Reply-To: References: Message-ID: 2010/7/5 Tiago Ant?o > > 1. to test_Seq_objs.py add > import sys > if sys.version_info[0] == 3: > maketrans = str.maketrans > else: > from string import maketrans > You could skip importing sys by checking if the attribute is there on str: if hasattr(str, 'maketrans'): maketrans = str.maketrans else: from string import maketrans -E From biopython at maubp.freeserve.co.uk Mon Jul 5 18:16:18 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Jul 2010 19:16:18 +0100 Subject: [Biopython-dev] Python 3 porting Message-ID: Hi all, While Tiago and I have sorted out some of the easy stuff, there is still plenty to do make Biopython via 2to3 work nicely on Python 3 (and that's just looking at the Python code, ignoring our C code, and also ignoring things using NumPy). We still have quite a few warnings using the -3 switch on Python 2.6 or 2.7 which we should probably concentrate on first. Note that deprecation warnings in Python 2.7 are silent by default (so as not to bother end users, which makes sense as this is the last Python 2.x series). Peter From tiagoantao at gmail.com Mon Jul 5 20:13:00 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 5 Jul 2010 21:13:00 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: Hi, On Mon, Jul 5, 2010 at 7:16 PM, Peter wrote: > While Tiago and I have sorted out some of the easy stuff, there is still plenty > to do make Biopython via 2to3 work nicely on Python 3 (and that's just looking > at the Python code, ignoring our C code, and also ignoring things using NumPy). PopGen is now in. The interesting thing is that it has Bio.Application examples and they presented no problem at all. Nexus is also in. I also converted test_lowess (a VERY SIMPLE numpy example). Something seems to have broken one of the seqio tests as it blocks the test system (on py3) PhyloXML I am really stuck and NCBXML seems to have a problem only inside run_tests. Tomorrow I will have a look at PDB, KEGG, Emboss and clustalw. From chapmanb at 50mail.com Tue Jul 6 01:30:50 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 5 Jul 2010 21:30:50 -0400 Subject: [Biopython-dev] Slides for Biopython talk at BOSC 2010 Message-ID: <20100706013050.GE1664@kunkel> Hey all; I've got the honor of presenting Biopython at BOSC 2010, and have put the slides up here: http://www.slideshare.net/chapmanb/biopython-at-bosc-2010 The talk tries to place the lessons I've learned from the Biopython community this year within the broader framework of open source work. It's been great to see the community grow so much, and so please pay special attention to slide 6; did I miss your name? I suck: e-mail me so I can correct that. Happy to get any other thoughts or comments and looking forward to seeing folks in person who are coming to BOSC. If you will be in Boston on Thursday evening, think about stopping by my place for BBQ and beers: http://www.open-bio.org/wiki/Codefest_2010#BBQ Drop me an e-mail for my number and better directions. Thanks, Brad From biopython at maubp.freeserve.co.uk Tue Jul 6 10:03:29 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 11:03:29 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: On Mon, Jul 5, 2010 at 7:16 PM, Peter wrote: > Hi all, > > While Tiago and I have sorted out some of the easy stuff, there is still plenty > to do make Biopython via 2to3 work nicely on Python 3 (and that's just looking > at the Python code, ignoring our C code, and also ignoring things using NumPy). I can get SFF output working - first by using the new io.BytesIO module (in Python 2.6+ as well) in place of StringIO for testing writing binary files (i.e. SFF output). This also requires a sprinkling of decode/encode calls in SffIO.py when reading/ writing the binary file - I'm not sure how to get that automatically with 2to3. Note that the SeqRecord's format method can currently return a single read in SFF file as a binary string. This won't be so sensible on Python 3 where a byte string makes more sense than unicode, so I think we should deprecate supporting binary files (i.e. SFF) in the SeqRecord's format method. > We still have quite a few warnings using the -3 switch on Python 2.6 or 2.7 > which we should probably concentrate on first. A lot of these are with changes to object comparison (the __cmp__ method is no more), which will need a little extra care, and the related issue of using cmp in list sorting (again, not supported anymore). Peter From biopython at maubp.freeserve.co.uk Tue Jul 6 10:36:39 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 11:36:39 +0100 Subject: [Biopython-dev] Deprecating Bio.Crystal in next release? Message-ID: Hi all, Given recent discussion (and the lack of interest on the dev list on previous occasions), is there any objection to deprecating Bio.Crystal in the next release of Biopython? http://lists.open-bio.org/pipermail/biopython/2010-July/006633.html http://lists.open-bio.org/pipermail/biopython-dev/2008-October/004405.html http://lists.open-bio.org/pipermail/biopython-dev/2007-July/002901.html Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Jul 6 13:40:51 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 14:40:51 +0100 Subject: [Biopython-dev] Deprecating Bio.InterPro Message-ID: Hi all, Another old module which hasn't been updated for some time is Bio.InterPro, a parser for the HTML (webpages) at the EBI, e.g. http://www.ebi.ac.uk/interpro/IEntry?ac=IPR001064 The parser doesn't work with the current website, and also uses a Python library called sgmllib which was deprecated as of Python 2.6. Website parsers are in general a bad idea because the tend to need a lot of work to keep up to date. Perhaps in this case there are suitable plain text files on the FTP site which might be used? Unless anyone has a good reason not to, we are going to deprecate the Bio.IntrerPro module in the next release of Biopython. Peter From biopython at maubp.freeserve.co.uk Tue Jul 6 14:03:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 15:03:05 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/6 Tiago Ant?o : > > On Tue, Jul 6, 2010 at 11:03 AM, Peter wrote: >> This also requires a sprinkling of decode/encode calls in SffIO.py when reading/ >> writing the binary file - I'm not sure how to get that automatically with 2to3. > > Would it be problematic if the 2.x code had that? 2.6 at least > supports decode/encode. Of course I do not know the implications in > code that is highly string intensive like SeqIO stuff... but it other > places (test cases, very simple) it seems to work OK. Python 2.4+ strings and unicode objects do support encode and decode, but we don't want to be converting from strings to unicode on Python 2.x - I want everything to stay as plain strings. Adding explicit decode calls would have side effects on Python 2.x (things becoming unicode), but would be needed for SFF parsing on Python 3. I could add explicit encode calls which would help SFF output under Python 3.x. This shouldn't change the functionality on Python 2.x, but I am a little concerned about it having a negative impact on the speed, but I have not measured this. We may need some big if statements... Peter From tiagoantao at gmail.com Tue Jul 6 13:43:29 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 6 Jul 2010 14:43:29 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: On Tue, Jul 6, 2010 at 11:03 AM, Peter wrote: > This also requires a sprinkling of decode/encode calls in SffIO.py when reading/ > writing the binary file - I'm not sure how to get that automatically with 2to3. Would it be problematic if the 2.x code had that? 2.6 at least supports decode/encode. Of course I do not know the implications in code that is highly string intensive like SeqIO stuff... but it other places (test cases, very simple) it seems to work OK. From tiagoantao at gmail.com Tue Jul 6 14:05:33 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 6 Jul 2010 15:05:33 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/6 Peter : > Python 2.4+ strings and unicode objects do support encode and decode, > but we don't want to be converting from strings to unicode on Python 2.x - > I want everything to stay as plain strings. Adding explicit decode calls > would have side effects on Python 2.x (things becoming unicode), but > would be needed for SFF parsing on Python 3. > Argh... I will have to correct some code I submitted (with decode). I am testing on 2.6.5. I will start testing on 2.4, it is safer From biopython at maubp.freeserve.co.uk Tue Jul 6 15:03:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 16:03:32 +0100 Subject: [Biopython-dev] Extending test_PDB.py coverage? Message-ID: Hi all, I've been running the unit tests with the Python 3 warnings enabled (this needs either Python 2.6 or Python 2.7), e.g. python2.6 -3 run_tests.py python2.6 -3 run_tests.py test_PDB.py python2.6 -3 test_PDB.py There is a harmless glitch with test_1_warning in this mode (because it isn't expecting all the extra warnings). I was getting some DeprecationWarning messages about using "k in d" rather than d.has_key(k), which I fixed: http://github.com/biopython/biopython/commit/9b508b6a6391ac9d379a74cbb3cca1127e3c7aba Looking at the Bio/PDB/*.py files there are still quite a few more examples of has_key being used - but these are not being picked up by the unit tests: AbstractPropertyMap.py: def has_key(self, id): AbstractPropertyMap.py: >>> if map.has_key((chain_id, res_id)): AbstractPropertyMap.py: return self.property_dict.has_key(translated_id) DSSP.py: print d.has_key(('A', 1)) Entity.py: return self.child_dict.has_key(id) Entity.py: return self.child_dict.has_key(id) FragmentMapper.py: def has_key(self, res): FragmentMapper.py: return self.fd.has_key(res) FragmentMapper.py: if fm.has_key(r): MMCIFParser.py: if mmcif_dict.has_key("_atom_site.auth_seq_id"): NACCESS.py: if naccess_dict.has_key((chain_id, res_id)): NACCESS.py: if self.naccess_atom_dict.has_key(full_id): Residue.py: if _atom_name_dict.has_key(name1): Residue.py: if _atom_name_dict.has_key(name2): Selection.py: if not d.has_key(i): While we could just fix the has_key usage, this would be a good point to first extend the unit coverage - just in case we break something. Some of these like DSSP and NACCESS are wrappers for command line tools, so new files test_PDB_DSSP.py and test_PDB_NACCESS.py would be sensible which can check for and run the tool if installed. Others like the Residue, Entity and Selection modules should be more straight forward to add directly to test_PDB.py itself. Are there any volunteers? Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Jul 6 15:20:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 16:20:38 +0100 Subject: [Biopython-dev] Deprecating Bio.Index? Message-ID: Hello all, Is anyone using the Bio.Index module in Biopython in their own code? This supported file indexing and was used in other parts of Biopython which have all now been deprecated (e.g. Bio.SwissProt.SProt and Bio.Prosite) or removed. The more recent Bio.SeqIO module provides a general approach to indexing sequence files. Would it inconvenience anyone if Bio.Index was deprecated in the next release (triggering warnings when imported, but still functional), and then removed later on? Thanks, Peter From eric.talevich at gmail.com Wed Jul 7 17:47:47 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 7 Jul 2010 13:47:47 -0400 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/5 Tiago Ant?o > Hi, > > On Mon, Jul 5, 2010 at 7:16 PM, Peter > wrote: > > While Tiago and I have sorted out some of the easy stuff, there is still > plenty > > to do make Biopython via 2to3 work nicely on Python 3 (and that's just > looking > > at the Python code, ignoring our C code, and also ignoring things using > NumPy). > > PhyloXML I am really stuck and NCBXML seems to have a problem only > inside run_tests. > > Hello, I ran "python -3 test_PhyloXML.py" and found one warning specific to PhyloXML, about comparing unequal types in BaseTree.py. I have a fix for this, shall I push it to GitHub? Was there anything else in Bio/Phylo/ that was causing problems? I'm just running the unit tests with the -3 flag, and didn't find any other issues that way. Thanks, Eric From bugzilla-daemon at portal.open-bio.org Thu Jul 8 01:45:33 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 7 Jul 2010 21:45:33 -0400 Subject: [Biopython-dev] [Bug 3109] New: Record class in Bio.SCOP.Cla has hierarchy member as list instead of dictionary Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3109 Summary: Record class in Bio.SCOP.Cla has hierarchy member as list instead of dictionary Product: Biopython Version: 1.54b Platform: PC URL: http://github.com/jfinkels/biopython/commit/6d2257dd0c46 abdf1ecd14b8bc660e32a205630a OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: jeffrey.finkelstein at gmail.com The Record class in the Bio.SCOP.Cla module has the hierarchy member as a list of key-value 2-tuples, but it should be a dictionary of key-value pairs. The SCOP Classification file format, http://scop.mrc-lmb.cam.ac.uk/scop/release-notes.html#scop-parseable-files , states that the order of the hierarchy key-value pairs in each record is unordered. This also allows easier access to the key-value pairs in a way that corresponds with the semantics of the file format specification. I have provided a fix at my own GitHub fork of Biopython. http://github.com/jfinkels/biopython/commit/6d2257dd0c46abdf1ecd14b8bc660e32a205630a In fixing this bug and the associated unit tests, I also changed the Record.__str__ method to output a string WITHOUT a trailing newline (which matches Python convention anyway). -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jul 8 01:47:53 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 7 Jul 2010 21:47:53 -0400 Subject: [Biopython-dev] [Bug 3109] Record class in Bio.SCOP.Cla has hierarchy member as list instead of dictionary In-Reply-To: Message-ID: <201007080147.o681lrsM008729@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3109 ------- Comment #1 from jeffrey.finkelstein at gmail.com 2010-07-07 21:47 EST ------- Created an attachment (id=1522) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1522&action=view) Patch for bug 3109. This can also be found on my fork of Biopython at GitHub: http://github.com/jfinkels/biopython/commit/6d2257dd0c46abdf1ecd14b8bc660e32a205630a -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Thu Jul 8 07:34:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 Jul 2010 08:34:10 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/7 Eric Talevich : > 2010/7/5 Tiago Ant?o > >> ?PhyloXML I am really stuck and NCBXML seems to have a problem only >> inside run_tests. > > Hello, > > I ran "python -3 test_PhyloXML.py" and found one warning specific to > PhyloXML, about comparing unequal types in BaseTree.py. I have a fix for > this, shall I push it to GitHub? > > Was there anything else in Bio/Phylo/ that was causing problems? I'm just > running the unit tests with the -3 flag, and didn't find any other issues > that way. Running the test in Python 2.6 or 2,7 with -3 will spot a number of issues, and if we can fix them we should. Assuming your comparison fix is simple please go ahead and commit it. This will not spot everything (e.g. unicode and string problems). Actually running 2to3 and then trying the tests on Python 3 will spot more or different problems (such as unicode/bytes problems). I think this is where Tiago was having trouble with phyloXML. Note that the 2to3 script will be slightly different depending which copy you are using (i.e. which version of Python it came with). Peter From biopython at maubp.freeserve.co.uk Thu Jul 8 12:24:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 Jul 2010 13:24:28 +0100 Subject: [Biopython-dev] Equality in Bio.Restriction.RestrictionType Message-ID: Hi Fr?d?ric et al, One of the things in Python 3 is that overriding equality (done with __eq__ only since __cmp__ has gone) requires you also override __hash__. One remaining example of this which triggers a deprecation warning within our test suite when running with the -3 switch in in Bio.Restriction. I therefore had a look at how __eq__ and __ne__ are defined in the RestrictionType class - and strangely they do NOT seem to be inverses. def __eq__(cls, other): """RE == other -> bool True if RE and other are the same enzyme.""" return other is cls def __ne__(cls, other): """RE != other -> bool. isoschizomer strict, same recognition site, same restriction -> False all the other-> True""" if not isinstance(other, RestrictionType): return True elif cls.charac == other.charac: return False else: return True Fr?d?ric - could you clarify the intent here? Thanks, Peter From tiagoantao at gmail.com Thu Jul 8 20:55:05 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 8 Jul 2010 21:55:05 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: > Actually running 2to3 and then trying the tests on Python 3 will spot more > or different problems (such as unicode/bytes problems). I think this is where > Tiago was having trouble with phyloXML. I suppose (correct me if I am wrong), that the main objective of the exercise is to make all the tests pass with Python 3 (while maintaining Python 2 compatibility). The second objective would be to find potential points of error that can be introduced by the changes and create even more tests on those points. The third would be to not let performance (speed/memory) degrade (String processing being the big issue here). -- "If you want to get laid, go to college. If you want an education, go to the library." - Frank Zappa From eric.talevich at gmail.com Thu Jul 8 22:19:07 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 8 Jul 2010 18:19:07 -0400 Subject: [Biopython-dev] Documentation Message-ID: Hello everyone, I recently read this interesting article by one of the Django developers: http://jacobian.org/writing/great-documentation/what-to-write/ The post describes three kinds of documentation a software project should have: 1. A tutorial giving an overview of the project's major areas -- not covering every feature, but giving the user a good enough understanding of the whole project. The Biopython Tutorial and Cookbook already covers this very well. If anything, we may have put more detailed information than necessary into the Tutorial. The length may also be a bit overwhelming for newcomers. 2. Topic guides for each of the project's components. As I understand it, the wiki should fill this role. We could manage this (and #1, simultaneously) by converting some less-essential portions of the Tutorial to wiki pages. 3. A detailed reference for the complete API. The article specifically states that docstring converters like epydoc are insufficient, and may give developers a false sense of having taken care of this part of the documentation. The Python project uses Sphinx now, as do quite a few other projects. It uses ReStructuredText as the markup syntax, and can (1) pull in docstrings automatically, and (2) run doctest on code samples. I think this would work nicely for Biopython. http://sphinx.pocoo.org/ http://docutils.sourceforge.net/ This would add a developer dependency on Docutils, a very healthy project, and of course Sphinx. Epydoc can also accept ReStructuredText as the markup syntax in docstrings, in place of epytext, if docutils is available. So, if we were to go that route, the upgrade path would look like: 1. Add docutils as a dependency for building the API docs, in addition to epydoc. 2. Convert the docstrings that use epytext to use ReStructuredText instead. (grep will help, and the changes are pretty robotic.) 3. When all docstrings are rst-compatible (plain text is OK), try running Sphinx with a stub page that just pulls in all the docstrings under Bio. (Or something like that.) Does it work? 4. If it works, figure out how to put the Sphinx-generated docs on biopython.org so people can use them. 5. Now that we have a bunch of stub pages that pull in each module's docstrings, start adding value to those stubs by moving API-reference-style parts of the wiki and Tutorial into the sphinx stubs. 6. Semi-independently of this, try trimming the Tutorial a bit to make some nice wiki pages. Does this sound worthwhile? All the best, Eric From tiagoantao at gmail.com Fri Jul 9 12:19:41 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 9 Jul 2010 13:19:41 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/9 Peter : > The primary aim is to get the main Biopython functionality working on > Python 3 (with an eye on performance), while maintaining Python 2 > support. Getting the unit tests working is just a step towards this - and > the more test coverage we have the more useful this will be for us. > But that is probably what you meant? Actually, a bit more: I don't know how to deal with cases for which there are no unit tests (and 2to3 warnings). -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From biopython at maubp.freeserve.co.uk Fri Jul 9 13:25:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Jul 2010 14:25:40 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/9 Tiago Ant?o : > > 2010/7/9 Peter : >> The primary aim is to get the main Biopython functionality working on >> Python 3 (with an eye on performance), while maintaining Python 2 >> support. Getting the unit tests working is just a step towards this - and >> the more test coverage we have the more useful this will be for us. >> But that is probably what you meant? > > Actually, a bit more: I don't know how to deal with cases for which > there are no unit tests (and 2to3 warnings). The simple answer is we really need to write more unit tests ;) This will be tedious, but useful for improving the robustness of Biopython on Python 2,x as well as helping with porting to Python 3.x For example, I recent asked if anyone would like to write some more tests for Bio.PDB (lots of things using has_key have no test coverage). Peter From tiagoantao at gmail.com Fri Jul 9 12:30:36 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 9 Jul 2010 13:30:36 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Thu, Jul 8, 2010 at 11:19 PM, Eric Talevich wrote: > The Python project uses Sphinx now, as do quite a few other projects. It > uses ReStructuredText as the markup syntax, and can (1) pull in docstrings > automatically, and (2) run doctest on code samples. I think this would work > nicely for Biopython. Just to show another example along these lines (a computational biology one), from the forward-time population genetics simulator, simuPOP. http://simupop.sourceforge.net/Main/Documentation Tiago From biopython at maubp.freeserve.co.uk Fri Jul 9 09:40:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Jul 2010 10:40:10 +0100 Subject: [Biopython-dev] Python 3 subprocess bytes vs unicode Message-ID: Hi all, Many of the unit tests failing on Python 3.1 after using 2to3 are when calling external command line tools. Interestingly in Py3k the sys,stdin, sys,stdout and sys,stderr are in text mode by default - they automatically give you unicode strings instead of the raw bytes. This makes sense to me (and you can get at the bytes if you want them): http://docs.python.org/py3k/library/sys.html However, the stdin, stdout and strerr of any child process created with subprocess default to binary mode, and so return or expect bytes - not unicode strings: http://docs.python.org/py3k/library/subprocess.html It looks like we'll want to use universal_newlines=True when calling subprocess to that we can treat subprocess handles as text mode (i.e. unicode strings not bytes). This option is also present on Python 2, where is just controls the automatic handling of new line characters - so should be harmless (or even a good idea). This seems like a more elegant option than adding lots of encode/decode calls when doing IO with child processes (which I think Tiago has tried). Peter P.S. if we make our command line wrappers callable (or add some kind of run method) as previously discussed, it can set this option when calling subprocess. From biopython at maubp.freeserve.co.uk Fri Jul 9 08:01:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Jul 2010 09:01:43 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/8 Tiago Ant?o : >> Actually running 2to3 and then trying the tests on Python 3 will spot more >> or different problems (such as unicode/bytes problems). I think this is where >> Tiago was having trouble with phyloXML. > > I suppose (correct me if I am wrong), that the main objective of the > exercise is to make all the tests pass with Python 3 (while > maintaining Python 2 compatibility). The second objective would be to > find potential points of error that can be introduced by the changes > and create even more tests on those points. The third would be to not > let performance (speed/memory) degrade (String processing being the > big issue here). The primary aim is to get the main Biopython functionality working on Python 3 (with an eye on performance), while maintaining Python 2 support. Getting the unit tests working is just a step towards this - and the more test coverage we have the more useful this will be for us. But that is probably what you meant? Peter From biopython at maubp.freeserve.co.uk Fri Jul 9 08:15:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 9 Jul 2010 09:15:31 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Thu, Jul 8, 2010 at 11:19 PM, Eric Talevich wrote: > Hello everyone, > > I recently read this interesting article by one of the Django developers: > http://jacobian.org/writing/great-documentation/what-to-write/ I don't agree with everything he said, but interesting. > The post describes three kinds of documentation a software project should > have: > > 1. A tutorial giving an overview of the project's major areas -- not > covering every feature, but giving the user a good enough understanding of > the whole project. > > The Biopython Tutorial and Cookbook already covers this very well. If > anything, we may have put more detailed information than necessary into the > Tutorial. The length may also be a bit overwhelming for newcomers. > > 2. Topic guides for each of the project's components. > > As I understand it, the wiki should fill this role. We could manage this > (and #1, simultaneously) by converting some less-essential portions of the > Tutorial to wiki pages. > > 3. A detailed reference for the complete API. > > The article specifically states that docstring converters like epydoc are > insufficient, and may give developers a false sense of having taken care of > this part of the documentation. His idea of an introductory tutorial is more a walk though example. > The Python project uses Sphinx now, as do quite a few other projects. It > uses ReStructuredText as the markup syntax, and can (1) pull in docstrings > automatically, and (2) run doctest on code samples. I think this would work > nicely for Biopython. > > http://sphinx.pocoo.org/ > http://docutils.sourceforge.net/ > > This would add a developer dependency on Docutils, a very healthy project, > and of course Sphinx. Epydoc can also accept ReStructuredText as the markup > syntax in docstrings, in place of epytext, if docutils is available. > > So, if we were to go that route, the upgrade path would look like: > > 1. Add docutils as a dependency for building the API docs, in addition to > epydoc. > 2. Convert the docstrings that use epytext to use ReStructuredText instead. > (grep will help, and the changes are pretty robotic.) > 3. When all docstrings are rst-compatible (plain text is OK), try running > Sphinx with a stub page that just pulls in all the docstrings under Bio. (Or > something like that.) Does it work? > 4. If it works, figure out how to put the Sphinx-generated docs on > biopython.org so people can use them. > 5. Now that we have a bunch of stub pages that pull in each module's > docstrings, start adding value to those stubs by moving API-reference-style > parts of the wiki and Tutorial into the sphinx stubs. i.e. Move from epydoc to sphinx? That would probably make things much prettier - and could make the docstrings more accessible. We could even move the main tutorial from LaTeX to sphinx as well - it can make nice HTML and PDF files. > 6. Semi-independently of this, try trimming the Tutorial a bit to make some > nice wiki pages. Wiki pages have some major drawbacks for primary documentation - they are not in git for a start which means version tracking is separate from the code version tracking. They also would be hard to bundle into the offline documentation. I'm not keen on this, beyond moving some "cookbook" examples to the wiki. Peter From tiagoantao at gmail.com Sat Jul 10 20:42:00 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sat, 10 Jul 2010 21:42:00 +0100 Subject: [Biopython-dev] 2to3 and doctests Message-ID: Hi, There are a couple of issues with 2to3 and biopython doctests. 1. There is a bug in 2to3 which crashes the tool with some doctests. This bug was recognized by the python team and corrected (but only on svn). It is very easy solve, in file refactor.py (lib2to2 python library) replace if self.log.isEnabledFor(logging.DEBUG): with if self.logger.isEnabledFor(logging.DEBUG): See http://svn.python.org/view/sandbox/trunk/2to3/lib2to3/refactor.py?r1=81478&r2=82779 And http://bugs.python.org/issue9217 This affects probably all versions of 2to3 (2.6.5 to 3.1.2) 2. Some of our doctests are incorrectly specified, one example from Phylo/BaseTree.py >>> for clade in tree.find_clades(branch_length=True, order='level'): >>> if (clade.branch_length < .5 and >>> not clade.is_terminal() and >>> clade is not self.root): >>> tree.collapse(clade) According to documentation we are supposed to use ?...? on continuation lines, not ?>>>?. See http://bugs.python.org/issue9221 2to3 seems to be more sensitive to this than python when running the tests. If nobody opposes, I will convert all doctests to correct variations -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From eric.talevich at gmail.com Sun Jul 11 03:34:18 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 10 Jul 2010 23:34:18 -0400 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: Hi guys, NumPy is keeping notes on what they did to make their code work on Python 3. Have you seen this? http://projects.scipy.org/numpy/browser/trunk/doc/Py3K.txt They use 2to3 in setup.py, too. (Sorry for the lag, my internet access here is shaky.) Cheers, Eric From tiagoantao at gmail.com Sun Jul 11 09:30:04 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sun, 11 Jul 2010 10:30:04 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/11 Eric Talevich : > NumPy is keeping notes on what they did to make their code work on Python 3. > Have you seen this? > http://projects.scipy.org/numpy/browser/trunk/doc/Py3K.txt > > They use 2to3 in setup.py, too. I did not know about that link, many thanks. But their use of 2to3 on setup.py seems very good (BTW, the setup that I've sent you in a previous message does that and is inspired in numpy). Inspired on numpy, here is a suggestion on how things might work in a biopython version that is both 2 and 3 compatible: 1. There is a single code base written in Python 2. This code base is "3-aware" (just check Peter's commits in the last few days for lots of examples of this) in the sense that some constructs are not possible. A few (very rare?) if sys.version_info[0]==3 do exist. 2. On setup.py, if python3 is detected 2to3 is called and the code is converted. As the code base was sensibly prepared, the code will compile on 3 with just 2to3 (no need for manual intervention at all). This means a single code base (no branching). Let me repeat this, as I think it is important from a maintenance perspective: no need for different branches! Also note that my prototype setup.py (anyone interested please send me an email and I will send a copy out of list - just to avoid attachments to the list) is both 2 and 3 compatible (runs on both versions unchanged) but it still has some flaws: no doctest conversion and no test conversion. But it illustrates the point that a setup.py (2to3 based) like numpy works for biopython. This means development proceeds in 2.x (code is converted from 2 to 3, not the opposite). I was thinking in doing a small script that every night does a git pull, runs the tests in python3 and, if something that was py3k compatible in the past does break, then it sends an email to biopython-dev. The point of this would be to make development the least cumbersome possible: people do not want to have to test everything in BOTH 2 and 3 (just 2). They only have to intervene (and are only informed) if there is a new problem. Best, Tiago From biopython at maubp.freeserve.co.uk Sun Jul 11 09:42:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 11 Jul 2010 10:42:24 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/10 Tiago Ant?o : > Hi, > > There are a couple of issues with 2to3 and biopython doctests. > > 1. There is a bug in 2to3 which crashes the tool with some doctests. > This bug was recognized by the python team and corrected (but only on > svn). > It is very easy solve, in file refactor.py (lib2to2 python library) replace > if self.log.isEnabledFor(logging.DEBUG): > with > if self.logger.isEnabledFor(logging.DEBUG): > See > http://svn.python.org/view/sandbox/trunk/2to3/lib2to3/refactor.py?r1=81478&r2=82779 > And > http://bugs.python.org/issue9217 > This affects probably all versions of 2to3 (2.6.5 to 3.1.2) Thanks for the alert & links > 2. Some of our doctests are incorrectly specified, one example from > Phylo/BaseTree.py > ? ? ? ?>>> for clade in tree.find_clades(branch_length=True, order='level'): > ? ? ? ?>>> ? ? if (clade.branch_length < .5 and > ? ? ? ?>>> ? ? ? ? not clade.is_terminal() and > ? ? ? ?>>> ? ? ? ? clade is not self.root): > ? ? ? ?>>> ? ? ? ? tree.collapse(clade) > According to documentation we are supposed to use ?...? on > continuation lines, not ?>>>?. > See http://bugs.python.org/issue9221 > 2to3 seems to be more sensitive to this than python when running the tests. > > If nobody opposes, I will convert all doctests to correct variations Yes, those >>> should be ... so please go ahead and fix them. Thanks, Peter From biopython at maubp.freeserve.co.uk Sun Jul 11 09:47:09 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 11 Jul 2010 10:47:09 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/11 Tiago Ant?o : > 2010/7/11 Eric Talevich : >> NumPy is keeping notes on what they did to make their code work on Python 3. >> Have you seen this? >> http://projects.scipy.org/numpy/browser/trunk/doc/Py3K.txt >> >> They use 2to3 in setup.py, too. > > I did not know about that link, many thanks. > > But their use of 2to3 on setup.py seems very good (BTW, the setup that > I've sent you in a previous message does that and is inspired in > numpy). > > Inspired on numpy, here is a suggestion on how things might work in a > biopython version that is both 2 and 3 compatible: Hi all, While at EuroSciPy 2010 I've been chatting to Pauli Virtanen and David Cournapeau about how NumPy etc are doing things - they have got a working single code base written in Python 2.x which supports Python 3 via the 2to3 script, and plan to continue like this for the medium term. For their C code, then the usual #ifdef tricks are used. See also: http://mail.scipy.org/pipermail/numpy-discussion/2010-July/051436.html Peter From biopython at maubp.freeserve.co.uk Sun Jul 11 09:52:26 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 11 Jul 2010 10:52:26 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/11 Tiago Ant?o : > > I was thinking in doing a small script that every night does a git > pull, runs the tests in python3 and, if something that was py3k > compatible in the past does break, then it sends an email to > biopython-dev. The point of this would be to make development the > least cumbersome possible: people do not want to have to test > everything in BOTH 2 and 3 (just 2). They only have to intervene > (and are only informed) if there is a new problem. > That is worth doing, but beyond that I've been thinking about some kind of buildbot doing nightly builds and tests on assorted machines, pushing the reports to the webserver. Doing this on Python 3.1 as well as Python 2.4 to 2.7 would be great. We could probably have a simple HTML upload to the server using an SSH key (no new services or software required on the server), ideally via a new restricted user account with access only to one folder on the website. Peter From tiagoantao at gmail.com Sun Jul 11 16:44:04 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Sun, 11 Jul 2010 17:44:04 +0100 Subject: [Biopython-dev] Extending test_PDB.py coverage? In-Reply-To: References: Message-ID: On Tue, Jul 6, 2010 at 4:03 PM, Peter wrote: > Looking at the Bio/PDB/*.py files there are still quite a few more examples > of has_key being used - but these are not being picked up by the unit tests: Just a side note: There are doctests on Bio.PDB, but these are not activated on run_tests.py. Is this correct? From eric.talevich at gmail.com Mon Jul 12 15:47:55 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 12 Jul 2010 11:47:55 -0400 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Fri, Jul 9, 2010 at 4:15 AM, Peter wrote: > On Thu, Jul 8, 2010 at 11:19 PM, Eric Talevich > wrote: > > So, if we were to go that route, the upgrade path would look like: > > > > 1. Add docutils as a dependency for building the API docs, in addition to > > epydoc. > > 2. Convert the docstrings that use epytext to use ReStructuredText > instead. > > (grep will help, and the changes are pretty robotic.) > > 3. When all docstrings are rst-compatible (plain text is OK), try running > > Sphinx with a stub page that just pulls in all the docstrings under Bio. > (Or > > something like that.) Does it work? > > 4. If it works, figure out how to put the Sphinx-generated docs on > > biopython.org so people can use them. > > 5. Now that we have a bunch of stub pages that pull in each module's > > docstrings, start adding value to those stubs by moving > API-reference-style > > parts of the wiki and Tutorial into the sphinx stubs. > > i.e. Move from epydoc to sphinx? That would probably make things much > prettier - and could make the docstrings more accessible. We could even > move the main tutorial from LaTeX to sphinx as well - it can make nice > HTML and PDF files. > OK, I'll start a branch for this on GitHub. Do you have a preference for how I handle the new docutils dependency? I thought I'd just document it somewhere, similar to how the Tutorial's current hevea dependency is mentioned. I'll work on getting epydoc to work with docutils/ReStructuredText first, then start a reference manual under Doc/reference/ after that. > > 6. Semi-independently of this, try trimming the Tutorial a bit to make > some > > nice wiki pages. > > Wiki pages have some major drawbacks for primary documentation - they > are not in git for a start which means version tracking is separate from > the > code version tracking. They also would be hard to bundle into the offline > documentation. I'm not keen on this, beyond moving some "cookbook" > examples to the wiki. > > OK. I'll leave this part for the end, then, and just make note of which parts of the Tutorial seem tangential enough to be moved to cookbook pages on the wiki. I expect a bigger portion of the Tutorial could be moved to the new reference manual instead -- for example, most of the API explanations in the Phylo chapter. I like the way simuPOP separates the user guide and reference, although the Table of Contents pages are a little unwieldy... Best, Eric From biopython at maubp.freeserve.co.uk Tue Jul 13 09:59:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Jul 2010 10:59:38 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: 2010/7/12 Eric Talevich : > Peter wrote: >> i.e. Move from epydoc to sphinx? That would probably make things much >> prettier - and could make the docstrings more accessible. We could even >> move the main tutorial from LaTeX to sphinx as well - it can make nice >> HTML and PDF files. > > OK, I'll start a branch for this on GitHub. Do you have a preference for how > I handle the new docutils dependency? I thought I'd just document it > somewhere, similar to how the Tutorial's current hevea dependency is > mentioned. > > I'll work on getting epydoc to work with docutils/ReStructuredText first, > ... So in order to move the API docs to Sphinx, they have to be formatted as reStructuredText (rather than plain text or epytext as we use now)? The good news is epydoc can also support reStructuredText (important during transition). That will be a big bit of work, but can be done on a module by module basis. See: http://epydoc.sourceforge.net/othermarkup.html#restructuredtext > ... > then start a reference manual under Doc/reference/ after that. Can you clarify what you idea is here? Split the current Tutorial.tex LaTeX file into a more introductory walk through, and a more technical reference manual? I'd rather more technical material into the API docs (i,e. the docstrings) and keep Tutorial.tex more introductory. Regards, Peter From eric.talevich at gmail.com Tue Jul 13 15:17:50 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 13 Jul 2010 11:17:50 -0400 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: 2010/7/13 Peter > 2010/7/12 Eric Talevich : > > Peter wrote: > >> i.e. Move from epydoc to sphinx? That would probably make things much > >> prettier - and could make the docstrings more accessible. We could even > >> move the main tutorial from LaTeX to sphinx as well - it can make nice > >> HTML and PDF files. > > > > OK, I'll start a branch for this on GitHub. Do you have a preference for > how > > I handle the new docutils dependency? I thought I'd just document it > > somewhere, similar to how the Tutorial's current hevea dependency is > > mentioned. > > > > I'll work on getting epydoc to work with docutils/ReStructuredText first, > > ... > > So in order to move the API docs to Sphinx, they have to be formatted > as reStructuredText (rather than plain text or epytext as we use now)? > The good news is epydoc can also support reStructuredText (important > during transition). That will be a big bit of work, but can be done on a > module by module basis. > > See: > http://epydoc.sourceforge.net/othermarkup.html#restructuredtext > Yeah, I think that's the best way to go. I once considered using reStructuredText for Bio.Phylo instead of epytext, but was deterred by the extra dependency. So, my branch for this (not on github yet) will first just convert all the docstrings to at least work with reStructuredText, and hopefully the plain-text docstrings will generally Just Work. Once that's done, and Epydoc will handle all the docstrings as reStructuredText without any problems, I think it would be a good time to merge that work into the trunk so we can all start/continue writing rst-compatible docstrings. > > ... > > then start a reference manual under Doc/reference/ after that. > > Can you clarify what you idea is here? Split the current Tutorial.tex > LaTeX file into a more introductory walk through, and a more technical > reference manual? I'd rather more technical material into the API > docs (i,e. the docstrings) and keep Tutorial.tex more introductory. > > As I understand it, using Sphinx for API docs requires creating a .rst document for each sub-package. The document can be a stub, containing just a command to pull in the module docstrings: http://sphinx.pocoo.org/ext/autosummary.html Incidentally, we could set it up to run doctest from here, too: http://sphinx.pocoo.org/ext/doctest.html In any case, I won't touch Tutorial.tex at first. I'll just set up the stubs for pulling in docstrings, and call that a minimal Sphinx reference manual, separate from the Tutorial. Then we should figure out how to make the reference manual easy to view (at least for anyone with a Git branch), and at least think about how it should be published on biopython.org -- I think it's just static .html files, so this shouldn't be too hard. Once we're happy with Sphinx as a replacement for Epydoc, and are able to make the reference manual available through the same sources as the Tutorial, then we'd be free to move pieces of the Tutorial to the reference manual, as appropriate -- adding longer descriptions and examples to the .rst documents that were previously just stubs. For example, my Bio.Phylo chapter in the Tutorial has detailed API descriptions that should be moved to the reference. The BLAST chapter has a complete class diagram which also seems like reference material to me. There's also some BioSQL material scattered around the internet that would be more helpful if aggregated into a complete, up-to-date reference. Sound like a plan? -Eric From biopython at maubp.freeserve.co.uk Tue Jul 13 15:56:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Jul 2010 16:56:47 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Tue, Jul 13, 2010 at 4:17 PM, Eric Talevich wrote: >Peter wrote: >> So in order to move the API docs to Sphinx, they have to be formatted >> as reStructuredText (rather than plain text or epytext as we use now)? >> The good news is epydoc can also support reStructuredText (important >> during transition). That will be a big bit of work, but can be done on a >> module by module basis. >> >> See: >> http://epydoc.sourceforge.net/othermarkup.html#restructuredtext >> > > Yeah, I think that's the best way to go. I once considered using > reStructuredText for Bio.Phylo instead of epytext, but was deterred by the > extra dependency. So, my branch for this (not on github yet) will first just > convert all the docstrings to at least work with reStructuredText, and > hopefully the plain-text docstrings will generally Just Work. > > Once that's done, and Epydoc will handle all the docstrings as > reStructuredText without any problems, I think it would be a good time to > merge that work into the trunk so we can all start/continue writing > rst-compatible docstrings. Sounds good. I would like to keep the reStructuredText as simple as possible for human readers (i.e. when looking at the doctext within Python). I think the NumPy project had similar aims, and have documented this - e.g. here: http://projects.scipy.org/numpy/wiki/CodingStyleGuidelines >> > ... >> > then start a reference manual under Doc/reference/ after that. >> >> Can you clarify what you idea is here? Split the current Tutorial.tex >> LaTeX file into a more introductory walk through, and a more technical >> reference manual? I'd rather more technical material into the API >> docs (i,e. the docstrings) and keep Tutorial.tex more introductory. >> >> > As I understand it, using Sphinx for API docs requires creating a .rst > document for each sub-package. The document can be a stub, containing > just a command to pull in the module docstrings: > http://sphinx.pocoo.org/ext/autosummary.html That fits with my impression from chatting to NumPy folk at EuroSciPy 2010. Loads of stub RST files sounds like a bit of a pain, but I can live with it. > Incidentally, we could set it up to run doctest from here, too: > http://sphinx.pocoo.org/ext/doctest.html > > In any case, I won't touch Tutorial.tex at first. I'll just set up the stubs > for pulling in docstrings, and call that a minimal Sphinx reference manual, > separate from the Tutorial. Then we should figure out how to make the > reference manual easy to view (at least for anyone with a Git branch), and > at least think about how it should be published on biopython.org -- I think > it's just static .html files, so this shouldn't be too hard. Sounds OK... > Once we're happy with Sphinx as a replacement for Epydoc, and are able to > make the reference manual available through the same sources as the > Tutorial, then we'd be free to move pieces of the Tutorial to the reference > manual, as appropriate -- adding longer descriptions and examples to the > .rst documents that were previously just stubs. Or move them into the module docstrings instead? > For example, my Bio.Phylo chapter in the Tutorial has detailed API > descriptions that should be moved to the reference. The BLAST chapter has a > complete class diagram which also seems like reference material to me. > There's also some BioSQL material scattered around the internet that would > be more helpful if aggregated into a complete, up-to-date reference. Regarding BioSQL, what online bits are you referring to beyond this: http://www.biopython.org/wiki/BioSQL (and the LaTeX file referenced)? Peter From vsbuffalo at gmail.com Tue Jul 13 22:24:50 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Tue, 13 Jul 2010 15:24:50 -0700 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: Hi All, I'd like to become more active in the Biopython project, and porting the documentation to Sphinx seems like an excellent way to begin. Is there a wiki or other website for allocating docstrings/other documentation to be rewritten in reStructuredText? Vince -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From biopython at maubp.freeserve.co.uk Tue Jul 13 23:04:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 00:04:28 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Tue, Jul 13, 2010 at 11:24 PM, Vince S. Buffalo wrote: > Hi All, > > I'd like to become more active in the Biopython project, and porting the > documentation to Sphinx seems like an excellent way to begin. Is there a > wiki or other website for allocating docstrings/other documentation to be > rewritten in reStructuredText? > > Vince Hi Vince, Volunteers to help would be great. In terms of a wiki or website system, I guess you are aware of or have used the NumPy system. They put a lot of effort into setting up a workflow to edit docstrings via a wiki, before manual merging into the code base. We don't have anything like that. For now, it would be a case of making a fork on github, and editing Python source code files one by one to convert their docstrings into reStructuredText (plus checking the output works in epydoc, and making sure this doesn't break any doctests). We'd then be able to pull your changes into the trunk (manually). Are you familiar with git, github, epydoc and/or Sphinx? Regards, Peter From vsbuffalo at gmail.com Tue Jul 13 23:26:51 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Tue, 13 Jul 2010 16:26:51 -0700 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: I am familiar with git, github and Sphinx, but not epydoc. Would initial draft version of the tutorial to Sphinx be a good first move? best, Vince On Tue, Jul 13, 2010 at 4:04 PM, Peter wrote: > On Tue, Jul 13, 2010 at 11:24 PM, Vince S. Buffalo > wrote: > > Hi All, > > > > I'd like to become more active in the Biopython project, and porting the > > documentation to Sphinx seems like an excellent way to begin. Is there a > > wiki or other website for allocating docstrings/other documentation to be > > rewritten in reStructuredText? > > > > Vince > > Hi Vince, > > Volunteers to help would be great. In terms of a wiki or website system, > I guess you are aware of or have used the NumPy system. They put a > lot of effort into setting up a workflow to edit docstrings via a wiki, > before > manual merging into the code base. We don't have anything like that. > > For now, it would be a case of making a fork on github, and editing > Python source code files one by one to convert their docstrings into > reStructuredText (plus checking the output works in epydoc, and > making sure this doesn't break any doctests). We'd then be able > to pull your changes into the trunk (manually). > > Are you familiar with git, github, epydoc and/or Sphinx? > > Regards, > > Peter > -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From eric.talevich at gmail.com Tue Jul 13 23:42:47 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 13 Jul 2010 19:42:47 -0400 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Tue, Jul 13, 2010 at 7:26 PM, Vince S. Buffalo wrote: > I am familiar with git, github and Sphinx, but not epydoc. Would initial > draft version of the tutorial to Sphinx be a good first move? > > best, > Vince > > Hi Vince, Converting both to Sphinx would be awesome, but if you're looking to learn about Biopython in depth, I'd recommend starting by converting the docstrings to reStructuredText. In the current Biopython source tree, you can grep for "__docformat__" to identify modules that are already using Epytext markup; those should be converted first. See: http://epydoc.sourceforge.net/manual-othermarkup.html Then, you can try running Epydoc with the option to interpret all docstrings as reStructuredText, rather than plain text. Make sure you're in a new, empty directory outside the Biopython source tree, and use the command: epydoc --html --verbose --docformat restructuredtext Bio BioSQL This should identify any remaining issues, including any dependencies you're missing. Best, Eric From biopython at maubp.freeserve.co.uk Wed Jul 14 10:12:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 11:12:03 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Wed, Jul 14, 2010 at 12:42 AM, Eric Talevich wrote: > On Tue, Jul 13, 2010 at 7:26 PM, Vince S. Buffalo wrote: > >> I am familiar with git, github and Sphinx, but not epydoc. You'll be fine then - once epydoc is installed (easy on Linux as it should be in your distribution's packages) its just one line to execute - see also: http://biopython.org/wiki/Building_a_release >> Would initial draft version of the tutorial to Sphinx be a good first move? >> >> best, >> Vince >> >> > Hi Vince, > > Converting both to Sphinx would be awesome, but if you're looking to learn > about Biopython in depth, I'd recommend starting by converting the > docstrings to reStructuredText. As Eric says, we would suggest starting with docstrings. > In the current Biopython source tree, you can grep for "__docformat__" to > identify modules that are already using Epytext markup; those should be > converted first. See: > http://epydoc.sourceforge.net/manual-othermarkup.html Note that with epydoc you can have different python files using different mark up (this is the __docformat__ thing Eric mentioned). Most of ours are plain text, some use epytext, soon some will use reStructuredText. The advantage of this is we can translate things gradually (file by file). Anything already using epytext should be quite clear and easy to convert to reStructuredText. Anything using plain text may need a little more work. Personally I'd suggest you pick modules you are familiar with to update first. Peter From biopython at maubp.freeserve.co.uk Wed Jul 14 10:49:16 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 11:49:16 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/11 Peter : > 2010/7/10 Tiago Ant?o : >> Hi, >> >> There are a couple of issues with 2to3 and biopython doctests. >> >> 1. There is a bug in 2to3 which crashes the tool with some doctests. >> This bug was recognized by the python team and corrected ... >> >> 2. Some of our doctests are incorrectly specified, one example from >> Phylo/BaseTree.py ... We're not the only people to run into problems with 2to3 and doctest not working properly: http://lucumr.pocoo.org/2010/2/11/porting-to-python-3-a-guide Most of our doctests still work after 2to3 and I guess we can look at the failures on a case by case basis (and for expediency move them into proper unit tests or remove them if they can't be tweaked to work on both Python 2.x and 3.x). Peter From tiagoantao at gmail.com Wed Jul 14 11:13:48 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 14 Jul 2010 12:13:48 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/14 Peter : > We're not the only people to run into problems with 2to3 and doctest > not working properly: > > http://lucumr.pocoo.org/2010/2/11/porting-to-python-3-a-guide > > Most of our doctests still work after 2to3 and I guess we can look at > the failures on a case by case basis (and for expediency move them > into proper unit tests or remove them if they can't be tweaked to > work on both Python 2.x and 3.x). I don't have the cases here, but they are only a handful. I suppose they can either be converted to (ugly) single-liners or to unit-tests. Anyway, it is a pity that 2to3 doctests are in such a state. Because all the rest seems to work quite well. Tiago PS - I can do this today, just tell me if you prefer single-liners or unit-tests From biopython at maubp.freeserve.co.uk Wed Jul 14 11:52:48 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 12:52:48 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/14 Tiago Ant?o : > > 2010/7/14 Peter : >> We're not the only people to run into problems with 2to3 and doctest >> not working properly: >> >> http://lucumr.pocoo.org/2010/2/11/porting-to-python-3-a-guide >> >> Most of our doctests still work after 2to3 and I guess we can look at >> the failures on a case by case basis (and for expediency move them >> into proper unit tests or remove them if they can't be tweaked to >> work on both Python 2.x and 3.x). > > I don't have the cases here, but they are only a handful. I suppose > they can either be converted to (ugly) single-liners or to unit-tests. > > Anyway, it is a pity that 2to3 doctests are in such a state. Because > all the rest seems to work quite well. > > Tiago > PS - I can do this today, just tell me if you prefer single-liners or unit-tests So is it basically just multi-line doctests with slash continuation chars we have a problem with? Those were always a bit ugly - but seemed the best solution given the desire to limit ourselves to 80 character lines. Could you make them single liners for now (least work to get the tests to pass), and I'll take a look at the commit later. If these are from my examples I'll then have a think about how better to handle them. Peter From peter at maubp.freeserve.co.uk Wed Jul 14 16:27:55 2010 From: peter at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 17:27:55 +0100 Subject: [Biopython-dev] Fwd: [Gmod-schema] GFF3 Is_circular In-Reply-To: <5B028E4D-30B2-4DCA-B41A-FF59ABDC4898@mac.com> References: <5B028E4D-30B2-4DCA-B41A-FF59ABDC4898@mac.com> Message-ID: Hi Brad, Something to be aware of for GFF work - the spec finally has explicit support for circular genomes :) Peter ---------- Forwarded message ---------- From: Andrew McArthur Date: Wed, Jul 14, 2010 at 5:17 PM Subject: [Gmod-schema] GFF3 Is_circular To: gmod-schema at lists.sourceforge.net Hello all, The definition of GFF3 at the Sequence Ontology site (http://www.sequenceontology.org/gff3.shtml) now has format definitions for supporting circular molecules such as plasmids or bacterial genomes. ?This is done using a new Is_circular flag in the GFF3 attributes field. ?Notably,?"For features that cross the origin of a circular feature (e.g. most bacterial genomes, plasmids, and some viral genomes), the requirement for start to be less than or equal to end is satisfied by making end = the position of the end + the length of the landmark feature." Are Chado 1.1 and?gmod_bulk_load_gff3.pl supporting this change to GFF3 or should I wait before changing my GFF3 files? Thanks, Andrew McArthur ------ Andrew G. McArthur, Ph.D. Bioinformatics Consulting Services Email: amcarthur at mac.com, Web: http://mcarthurlab.blogspot.com Phone: 905.296.3252, Mobile: 905.745.2794, Fax: 647.439.0829 AIM: amcarthur at mac.com, Skype: agmcarthur ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ Gmod-schema mailing list Gmod-schema at lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/gmod-schema From biopython at maubp.freeserve.co.uk Wed Jul 14 17:47:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 18:47:45 +0100 Subject: [Biopython-dev] Python 3 and Bio.SeqIO.index() Message-ID: Hi all, >From background reading I knew that text IO speed was very slow in Python 3.0, but this had been improved in Python 3.1 - however there was still an overhead for the unicode conversion. e.g. http://dabeaz.blogspot.com/2010/01/reexamining-python-3-text-io.html First some good news - using Bio.SeqIO.convert() for FASTQ to FASTA seems to be faster under Python 3.1 than Python 2.7 (on a Windows XP 32bit machine). Now for the bad news - using Bio.SeqIO.index() is much slower. I decided to simplify this down to a minimal test case, and confirmed my hunch: indexing files in the new default unicode text mode comes with a major time penalty (a factor of about one hundred). I've attached four versions of the same script which scans a FASTA file building a dictionary of record offsets. * fast in Python 2 using the default non-unicode strings * slower in Python 3 using the default unicode strings * slower in Python 3 using Latin encoded unicode strings * faster in Python 3 using binary mode and bytes The basic Python 3 script was created using 2to3 from the Python 2 version. I manually changed this to make the latin variant, and the binary bytes version. Sample output on an example file with just 94 entries: c:\python27\python index2.py ls_orchid.fasta - Indexed in 0.02s c:\python31\python index3.py ls_orchid.fasta - Indexed in 12.20s c:\python31\python index3latin.py ls_orchid.fasta - Indexed in 11.78s c:\python31\python index3b.py ls_orchid.fasta - Indexed in 0.02s Here the Python 2 version and the Python 3 binary examples are both extremely fast, while Python 3 unicode is very slow. There may be a tiny benefit to using the Latin encoding as suggested on the blog post I linked to above. Using a FASTA file with 7 million entries (converted from SRA entry SRR001666_1.fastq), we have: c:\python27\python index2.py SRR001666_1.fasta - Indexed in 79.45s c:\python31\python index3latin.py SRR001666_1.fasta - Over an hour (I killed it) c:\python31\python index3b.py SRR001666_1.fasta - Indexed in 34.31s I think the reason that Python 3 binary is faster than Python 2 is we are using universal read lines mode in Python 2, which will add an overhead (both for reading, and in calculating the offset). Given the way the Bio.SeqIO.index() API works, we have control over the file mode. I think we are going to have to open the file in binary mode for indexing efficiently. This may mean an extra wrapper for handling cross platform new line characters (something that Python 2.x does for us). I'd also be interested to try making the optimized functions in Bio.SeqIO.convert() use binary mode too and see if that makes them any faster (even on Python 2). In general, perhaps it would be useful if on Python 3 Bio.SeqIO could cope with opening text files in either unicode text mode or in binary mode? These issues may also influence what we decide to use for Seq objects by default (bytes versus unicode). Of course, the more special cases like this we have to worry about, the more complex a single codebase becomes... Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: index3.py Type: application/octet-stream Size: 625 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: index3b.py Type: application/octet-stream Size: 637 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: index3latin.py Type: application/octet-stream Size: 645 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: index2.py Type: application/octet-stream Size: 607 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Wed Jul 14 18:09:15 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 19:09:15 +0100 Subject: [Biopython-dev] Python 3 and Bio.SeqIO.index() In-Reply-To: References: Message-ID: On Wed, Jul 14, 2010 at 6:47 PM, Peter wrote: > > Using a FASTA file with 7 million entries (converted from SRA > entry SRR001666_1.fastq), we have: > > c:\python27\python index2.py SRR001666_1.fasta - Indexed in 79.45s > c:\python31\python index3latin.py SRR001666_1.fasta - Over an hour (I killed it) > c:\python31\python index3b.py SRR001666_1.fasta - Indexed in 34.31s > > I think the reason that Python 3 binary is faster than Python 2 > is we are using universal read lines mode in Python 2, which will > add an overhead (both for reading, and in calculating the offset). Confirmed - switching the mode from "rU" to "rb" to give index2.py, c:\python27\python index2.py SRR001666_1.fasta - Indexed in 76.96s c:\python27\python index2b.py SRR001666_1.fasta - Indexed in 36.62s I've had a quick go at doing this for Bio.SeqIO.index(), and with the catch that the get_raw() functionality then returns the underlying newlines (which we can fix if need be) it seems to work (unit tests pass). This may be worth following up on regardless of the Python 3 work, since the speed up is pretty good (from 97s to 52s on this example on Windows). We'd need more testing for the cross platform issues of course. I wonder if the same speed up happens on Linux / Mac OS X? Something to try tomorrow I guess. Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: seqio_index_b.patch Type: application/octet-stream Size: 1288 bytes Desc: not available URL: From vsbuffalo at gmail.com Wed Jul 14 18:24:40 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Wed, 14 Jul 2010 11:24:40 -0700 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: Thanks Eric and Peter, I'll get started on this! best, Vince On Wed, Jul 14, 2010 at 3:12 AM, Peter wrote: > On Wed, Jul 14, 2010 at 12:42 AM, Eric Talevich > wrote: > > On Tue, Jul 13, 2010 at 7:26 PM, Vince S. Buffalo >wrote: > > > >> I am familiar with git, github and Sphinx, but not epydoc. > > You'll be fine then - once epydoc is installed (easy on Linux as it should > be in your distribution's packages) its just one line to execute - see > also: > http://biopython.org/wiki/Building_a_release > > >> Would initial draft version of the tutorial to Sphinx be a good first > move? > >> > >> best, > >> Vince > >> > >> > > Hi Vince, > > > > Converting both to Sphinx would be awesome, but if you're looking to > learn > > about Biopython in depth, I'd recommend starting by converting the > > docstrings to reStructuredText. > > As Eric says, we would suggest starting with docstrings. > > > In the current Biopython source tree, you can grep for "__docformat__" to > > identify modules that are already using Epytext markup; those should be > > converted first. See: > > http://epydoc.sourceforge.net/manual-othermarkup.html > > Note that with epydoc you can have different python files using different > mark up (this is the __docformat__ thing Eric mentioned). Most of ours are > plain text, some use epytext, soon some will use reStructuredText. The > advantage of this is we can translate things gradually (file by file). > > Anything already using epytext should be quite clear and easy to convert > to reStructuredText. Anything using plain text may need a little more work. > Personally I'd suggest you pick modules you are familiar with to update > first. > > Peter > -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From vsbuffalo at gmail.com Wed Jul 14 18:38:51 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Wed, 14 Jul 2010 11:38:51 -0700 Subject: [Biopython-dev] Python 3 and Bio.SeqIO.index() In-Reply-To: References: Message-ID: There's a slight (perhaps non-significant) speed up on OS X. In Python 2.7 on OS X 10.5.8: vinceb$ python index2.py s_7_1_sequence.fasta s_7_1_sequence.fasta Indexed in 32.35s vinceb$ python index2b.py s_7_1_sequence.fasta s_7_1_sequence.fasta Indexed in 26.01s best, Vince On Wed, Jul 14, 2010 at 11:09 AM, Peter wrote: > SRR001666_1.fasta -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From vsbuffalo at gmail.com Thu Jul 15 04:00:42 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Wed, 14 Jul 2010 21:00:42 -0700 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: Hi all, I've started the conversion process with the file Bio/SeqIO/__init__.py. A few questions came up. First, are docstrings (with the autodoc extension) going to be the primary form of documentation, or are we going to copy/paste them into a separate documentation tree? I believe the latter is what Python and Django do, and may give us freedom to target different audiences with docstrings and the separate documentation. Also, after some Googling about autodoc, it seems complex. Also, has a branch been created on github? At this point, I'll continuing going through the "robotic" steps of converting epydoc formatting to ReST. Given my youthfulness working on this project, I'll try to keep you guys well updated. Preemptive apologies for future questions :-) Also, to test ReST + Sphinx on one section, I ran Sphinx on a copy/pasted docstring. I have to say, Sphinx is beautiful: http://imgur.com/4gNok best, Vince On Wed, Jul 14, 2010 at 11:24 AM, Vince S. Buffalo wrote: > Thanks Eric and Peter, I'll get started on this! > > best, > Vince > > > On Wed, Jul 14, 2010 at 3:12 AM, Peter wrote: > >> On Wed, Jul 14, 2010 at 12:42 AM, Eric Talevich >> wrote: >> > On Tue, Jul 13, 2010 at 7:26 PM, Vince S. Buffalo > >wrote: >> > >> >> I am familiar with git, github and Sphinx, but not epydoc. >> >> You'll be fine then - once epydoc is installed (easy on Linux as it should >> be in your distribution's packages) its just one line to execute - see >> also: >> http://biopython.org/wiki/Building_a_release >> >> >> Would initial draft version of the tutorial to Sphinx be a good first >> move? >> >> >> >> best, >> >> Vince >> >> >> >> >> > Hi Vince, >> > >> > Converting both to Sphinx would be awesome, but if you're looking to >> learn >> > about Biopython in depth, I'd recommend starting by converting the >> > docstrings to reStructuredText. >> >> As Eric says, we would suggest starting with docstrings. >> >> > In the current Biopython source tree, you can grep for "__docformat__" >> to >> > identify modules that are already using Epytext markup; those should be >> > converted first. See: >> > http://epydoc.sourceforge.net/manual-othermarkup.html >> >> Note that with epydoc you can have different python files using different >> mark up (this is the __docformat__ thing Eric mentioned). Most of ours are >> plain text, some use epytext, soon some will use reStructuredText. The >> advantage of this is we can translate things gradually (file by file). >> >> Anything already using epytext should be quite clear and easy to convert >> to reStructuredText. Anything using plain text may need a little more >> work. >> Personally I'd suggest you pick modules you are familiar with to update >> first. >> >> Peter >> > > > > -- > Vince Buffalo > Programmer > Bioinformatics Core > UC Davis Genome Center > University of California, Davis > > "There's real poetry in the real world. Science is the poetry of reality." > -Richard Dawkins > > -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From biopython at maubp.freeserve.co.uk Thu Jul 15 09:18:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 10:18:28 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Thu, Jul 15, 2010 at 5:00 AM, Vince S. Buffalo wrote: > Hi all, > > I've started the conversion process with the file Bio/SeqIO/__init__.py. A > few questions came up. > > First, are docstrings (with the autodoc extension) going to be the primary > form of documentation, or are we going to copy/paste them into a separate > documentation tree? I believe the latter is what Python and Django do, and > may give us freedom to target different audiences with docstrings and the > separate documentation. Also, after some Googling about autodoc, it seems > complex. I think the docstrings should be the primmary API documentation, and the Tutorial the primary introductory text. We currently have three forms, * Biopython Tutorial (PDF & HTML, written in LateX) which is the main documentation and should be introductory. * Module docstrings for the API, more technical (shown online with epydoc which is functional but ugly, later this will use SPhinx) * Some wiki pages, more for recent things still in flux, and some user contributed "Cookbook" entries. The wiki is nice to edit for user contributions, but not under source code control. > > Also, has a branch been created on github? > No - I was suggesting you make a fork and your own branch, and we will periodically review your changes and apply them to the trunk. Is that OK? > At this point, I'll continuing going through the "robotic" steps of > converting epydoc formatting to ReST. Given my youthfulness working on this > project, I'll try to keep you guys well updated. Preemptive apologies for > future questions :-) If in doubt, its better to ask first - so not a problem at all. > Also, to test ReST + Sphinx on one section, I ran Sphinx on a copy/pasted > docstring. I have to say, Sphinx is beautiful: http://imgur.com/4gNok The epydoc version is here (deep linking to avoid the frames): http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html The core text isn't so different, I'm more excited about the section names and navigation side of things with SPhinx. Peter From biopython at maubp.freeserve.co.uk Thu Jul 15 10:23:30 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 11:23:30 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Thu, Jul 15, 2010 at 10:18 AM, Peter wrote: >> >> Also, has a branch been created on github? >> > > No - I was suggesting you make a fork and your own branch, and we will > periodically review your changes and apply them to the trunk. Is that OK? It looks like you are already doing that - great. A few things from looking over your first two commits, for SeqIO and AlignIO, http://github.com/vsbuffalo/biopython/commit/a77d168cdf3f4a2c36708b5553531eef216f8aec http://github.com/vsbuffalo/biopython/commit/76ba2d5e9c5d915230bbdee73fa3a3a962f814df (1) Until most of the docstrings are using reStructuredText, we need to keep using epydoc (before switching to SPhinx). During this transition we will have a mix of mark up in different modules. The __docformat__ setting is important to tell epydoc this. So rather than deleting any existing value like: __docformat__ = "epytext en" it should probably be replaced with: __docformat__ = "reStructuredText en" See http://epydoc.sourceforge.net/othermarkup.html (2) I'm not keen on things like :mod:`Bio.AlignIO` or :func:`write` in the markup. They look ugly and confusing to me (for looking at the raw text at the Python terminal). Have you looked at NumPy's guidelines http://projects.scipy.org/numpy/wiki/CodingStyleGuidelines and whatever pre-processors they use to assist Sphinx? (3) Do you think we should we also be standardising how we describe parameters in docstrings? e.g. Follow what NumPy is doing? Peter From biopython at maubp.freeserve.co.uk Thu Jul 15 13:31:29 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 14:31:29 +0100 Subject: [Biopython-dev] Python 3 and Bio.SeqIO.index() In-Reply-To: References: Message-ID: On Wed, Jul 14, 2010 at 7:38 PM, Vince S. Buffalo wrote: > There's a slight (perhaps non-significant) speed up on OS X.?In Python 2.7 > on OS X 10.5.8: > vinceb$ python index2.py s_7_1_sequence.fasta > s_7_1_sequence.fasta > Indexed in 32.35s > vinceb$ python index2b.py s_7_1_sequence.fasta > s_7_1_sequence.fasta > Indexed in 26.01s > best, > Vince I don't have Python 3 on my Mac yet, so I've tried things out under Linux. 7 million entry FASTA file with Unix line endings (LF), on Linux: python2.7 index2.py SRR001666_1.lf.fasta - 19s python2.7 index2b.py SRR001666_1.lf.fasta - 19s python3.1 index3.py SRR001666_1.lf.fasta - Over an hour (I killed it) python3.1 index3b.py SRR001666_1.lf.fasta - 29s Again, I gave up on the Python 3 plain text unicode string version. 7 million entry FASTA file with DOS line endings (CR LF), on Linux: python2.7 index2.py SRR001666_1.crlf.fasta - 19 or 20s python2.7 index2b.py SRR001666_1.crlf.fasta - 19 or 20s python3.1 index3.py SRR001666_1.crlf.fasta - not tested python3.1 index3b.py SRR001666_1.crlf.fasta - 29s Interestingly the line endings make almost no difference to the timings. On this machine the python3.1 bytes version is slower than either of the Python 2.7 versions. This may be down to compiler options or something (I compiled the Python 3.1 myself with the defaults). Recall on the Windows machine Python 3.1 (binary mode) was faster than Python 2.7 (binary mode or universal new lines mode). Regarding possible speed ups under Python 2 by avoiding universal new lines mode, as you can see above on this Linux Python 2.7 setup timing on index2.py and index2b.py are practically equal (~19s), unlike on the Windows machine where this did seem to help. I think the clear message (from both Windows and Linux) is that for Bio.SeqIO.index() to perform at a tolerable speed on Python 3 we can't use the default text mode with unicode strings, we are going to have to use binary mode with bytes. Peter From vsbuffalo at gmail.com Thu Jul 15 15:22:55 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Thu, 15 Jul 2010 08:22:55 -0700 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: > > I think the docstrings should be the primmary API documentation, and the Tutorial the primary introductory text. I like the idea of code and documentation living together, but one thing that concerns me is that as the documentation grows larger and filled with more examples, it may begin to clutter the code quite a bit. Separate documentation and code allow greedy search and replace in documentation with the guarantee it won't damage code. And in Emacs (and other editors I presume) there are ReST editing modes that highlight syntax that do not work in docstrings. The benefits of documentation and code living together are that developers can more easily find and update documentation on their functions, which is not to be underestimated. It is interesting that numpy seems entirely documented in docstrings, but django and other projects less so. > (2) I'm not keen on things like :mod:`Bio.AlignIO` or :func:`write` in the > markup. They look ugly and confusing to me (for looking at the raw text > at the Python terminal). Have you looked at NumPy's guidelines > http://projects.scipy.org/numpy/wiki/CodingStyleGuidelines > and whatever pre-processors they use to assist Sphinx? > Ah, in looking at numpy's ReST source (i.e. http://docs.scipy.org/numpy/source/numpy/dist/lib64/python2.4/site-packages/numpy/lib/function_base.py#347) it is much more terse. I can switch to this approach and skim their their source to find their preprocessor. > (3) Do you think we should we also be standardising how we describe > parameters in docstrings? e.g. Follow what NumPy is doing? I was getting this same feeling as I was working. It might not be a bad idea to create a stub-type docstring for every non-internal function so at the very least something ends up on the documentation. This would also provide a template for standardizing parameters (e.g. indicating return value types, etc). This would likely increase the length of all code files quite a bit through, but the documentation coverage would be higher. -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From biopython at maubp.freeserve.co.uk Thu Jul 15 15:38:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 16:38:24 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Thu, Jul 15, 2010 at 4:22 PM, Vince S. Buffalo wrote: >> >> ?I think the docstrings should be the primary API documentation, >> and the Tutorial the primary introductory text. > > I like the idea of code and documentation living together, but one thing > that concerns me is that as the documentation grows larger and filled with > more examples, it may begin to clutter the code quite a bit. We can cross that bridge if we come to it - right now I would say most modules really need more docstrings. If you think that any of the docstrings you've looked at are too long, we can discuss shortening them (ideally by relocating good content or tests). Peter From biopython at maubp.freeserve.co.uk Thu Jul 15 16:32:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 17:32:01 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/6 Peter : > > I could add explicit encode calls which would help SFF output under > Python 3.x. This shouldn't change the functionality on Python 2.x, but > I am a little concerned about it having a negative impact on the speed, > but I have not measured this. > With my recent commits, SFF support now seems to work on Python 3. This includes test_SeqIO_index.py although there are other issues here: http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008004.html I had to add some conditional code to handle bytes <-> unicode, which may have a measurable slow down on Python 2. Peter From tiagoantao at gmail.com Thu Jul 15 16:41:37 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 15 Jul 2010 17:41:37 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/14 Peter : > Could you make them single liners for now (least work to get the tests > to pass), and I'll take a look at the commit later. If these are from my > examples I'll then have a think about how better to handle them. Actually no need for single liners. In some cases it was this >>> xxx \ yyy To this >>> xxx \ ... yyy Also """ >>> \""" ... \""" """ to """ >>> ''' ... ''' """ I will commit this changes in a few minutes. -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From bioinformed at gmail.com Thu Jul 15 16:58:52 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Thu, 15 Jul 2010 12:58:52 -0400 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: 2010/7/15 Peter > With my recent commits, SFF support now seems to work on Python 3. > This includes test_SeqIO_index.py although there are other issues here: > http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008004.html > > I had to add some conditional code to handle bytes <-> unicode, which > may have a measurable slow down on Python 2. > > I'm in the midst of processing a great deal of SFF data, so I'll try to give the new SFF code a try under Python 2.7. -Kevin From biopython at maubp.freeserve.co.uk Thu Jul 15 17:13:22 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 18:13:22 +0100 Subject: [Biopython-dev] Python 3 porting In-Reply-To: References: Message-ID: On Thu, Jul 15, 2010 at 5:58 PM, Kevin Jacobs wrote: > 2010/7/15 Peter > >> With my recent commits, SFF support now seems to work on Python 3. >> This includes test_SeqIO_index.py although there are other issues here: >> http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008004.html >> >> I had to add some conditional code to handle bytes <-> unicode, which >> may have a measurable slow down on Python 2. > > I'm in the midst of processing a great deal of SFF data, so I'll try to give > the new SFF code a try under Python 2.7. Excellent - I take it you were already using the SFF support in Biopython? Peter From vsbuffalo at gmail.com Thu Jul 15 17:20:26 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Thu, 15 Jul 2010 10:20:26 -0700 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: <201005141705.o4EH56ok028481@portal.open-bio.org> References: <201005141705.o4EH56ok028481@portal.open-bio.org> Message-ID: Sorry to bump this old topic, but are there plans to merge this into the main project? I do a lot of processing with the SAM format and it would be great to use Biopython for this. Does the pure Python implementation run as quickly as the pysam version? Is anyone still considering forking pysam and rewriting the C wrappers? Vince On Fri, May 14, 2010 at 10:05 AM, wrote: > http://bugzilla.open-bio.org/show_bug.cgi?id=2905 > > > > > > ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2010-05-14 13:05 EST ------- > The code on my branch has been updated, and now supports SAM and BAM > parsing > (currently it only extracts the read name, sequence and quality scores), > indexing by name with Bio.SeqIO.index(), and fast conversion to FASTA or > Sanger FASTQ with Bio.SeqIO.convert() which is handy for redoing a mapping: > > http://github.com/peterjc/biopython/tree/seqio-sam-bam > > Note that suffixes of "/1" or "/2" are added to forward or reverse read > names to make them unique. This matches the Illumina pipeline convention > and is handled by most tools which take paired end data. > > I'm actually using this code at the moment: I've started with BAM files of > paired end Illumina transcriptome reads mapped onto a draft assembly. I > then > used the convert code to convert these to FASTQ files, then split them into > a pair of FASTQ files (forward and reverse) and used BWA to remap them to a > different reference (giving new SAM files). > > > -- > Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are the assignee for the bug, or are watching the assignee. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From biopython at maubp.freeserve.co.uk Thu Jul 15 18:35:59 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 19:35:59 +0100 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: References: <201005141705.o4EH56ok028481@portal.open-bio.org> Message-ID: On Thu, Jul 15, 2010 at 6:20 PM, Vince S. Buffalo wrote: > Sorry to bump this old topic, but are there plans to merge this into the > main project? I do a lot of processing with the SAM format and it would be > great to use Biopython for this. > > Does the pure Python implementation run as quickly as the pysam > version? Is anyone still considering forking pysam and rewriting the > C wrappers? > > Vince EMBOSS now has limited SAM/BAM support, http://lists.open-bio.org/pipermail/emboss-dev/2010-July/000656.html BioLib is also now taking an interest in SAM/BAM support, I'd expect to see something on their mailing list soon: http://biolib.open-bio.org/wiki/Main_Page Can I ask what you want to do with SAM/BAM files? I did quite a bit of exploratory work for SAM/BAM in SeqIO, focussing on the raw reads (not the alignment side). This is very different from what you can do with PySam. It has allowed me to do SAM/BAM back to FASTQ which has been helpful in real work. There are branches on github, but still quite experimental and not necessarily going to be committed: http://github.com/peterjc/biopython/tree/seqio-sam-bam http://github.com/peterjc/biopython/tree/seqio-sam-bam-index Peter From bioinformed at gmail.com Thu Jul 15 18:43:29 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Thu, 15 Jul 2010 14:43:29 -0400 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: References: <201005141705.o4EH56ok028481@portal.open-bio.org> Message-ID: On Thu, Jul 15, 2010 at 1:20 PM, Vince S. Buffalo wrote: > Sorry to bump this old topic, but are there plans to merge this into the > main project? I do a lot of processing with the SAM format and it would be > great to use Biopython for this. > > Does the pure Python implementation run as quickly as the pysam version? Is > anyone still considering forking pysam and rewriting the C wrappers? > > I also started writing a pure Python SAM/BAM reader/writer with Cython accelerators, but quickly got distracted by the gaps in the "standard" and quirks in the various implementations. Instead, I've improved the base pysam implementation, fixed the parts that weren't working for me, and have posted a clone on the Google code site: http://code.google.com/r/bioinformed-pysam/ Of course, this doesn't help with how best to add functionality to BioPython... -Kevin From vsbuffalo at gmail.com Thu Jul 15 20:05:12 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Thu, 15 Jul 2010 13:05:12 -0700 Subject: [Biopython-dev] [Bug 2905] Short read alignment format SAM / BAM In-Reply-To: References: <201005141705.o4EH56ok028481@portal.open-bio.org> Message-ID: Our group has used the SAM format in parsing CIGAR strings to find hybrid mapped reads for various projects. We primarily use the pileup format in looking for SNP candidates and in differential expression analysis with RNA-seq. cDNA reads are mapped back to a reference transcriptome, and then we parse the pileup format to form counts for transcripts, which then go to R for differential expression analysis. As we look towards pipelining some common tasks, it would be nice if pysam's functionality were in Biopython. Also, I wonder if other folks work with the pileup format as frequently as we do - if so, this may be a worthy candidate for a parser. I'll look into BioLib and EMBOSS, thanks Peter. Vince On Thu, Jul 15, 2010 at 11:35 AM, Peter wrote: > On Thu, Jul 15, 2010 at 6:20 PM, Vince S. Buffalo wrote: > > Sorry to bump this old topic, but are there plans to merge this into the > > main project? I do a lot of processing with the SAM format and it would > be > > great to use Biopython for this. > > > > Does the pure Python implementation run as quickly as the pysam > > version? Is anyone still considering forking pysam and rewriting the > > C wrappers? > > > > Vince > > EMBOSS now has limited SAM/BAM support, > http://lists.open-bio.org/pipermail/emboss-dev/2010-July/000656.html > > BioLib is also now taking an interest in SAM/BAM support, > I'd expect to see something on their mailing list soon: > http://biolib.open-bio.org/wiki/Main_Page > > Can I ask what you want to do with SAM/BAM files? > > I did quite a bit of exploratory work for SAM/BAM in SeqIO, > focussing on the raw reads (not the alignment side). This > is very different from what you can do with PySam. It has > allowed me to do SAM/BAM back to FASTQ which has been > helpful in real work. There are branches on github, but still > quite experimental and not necessarily going to be committed: > http://github.com/peterjc/biopython/tree/seqio-sam-bam > http://github.com/peterjc/biopython/tree/seqio-sam-bam-index > > Peter > -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From biopython at maubp.freeserve.co.uk Fri Jul 16 13:50:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jul 2010 14:50:57 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: Hi Tiago, You've been looking more carefully at 2to3 and doctests than I have, perhaps you can answer this query for me: It seems to me it does not automatically fix doctests. I'm aware of the -d or --doctests_only option, but that means we have to run 2to3 twice I think (once for the code, once for the doctests). Is there an extra flag or something obvious I am missing here? I want to call 2to3 once and have it fix the code including the doctests. Peter From tiagoantao at gmail.com Fri Jul 16 15:24:20 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 16 Jul 2010 16:24:20 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/16 Peter : > You've been looking more carefully at 2to3 and doctests than I have, > perhaps you can answer this query for me: It seems to me it does > not automatically fix doctests. > > I'm aware of the -d or --doctests_only option, but that means we have > to run 2to3 twice I think (once for the code, once for the doctests). > > Is there an extra flag or something obvious I am missing here? I want > to call 2to3 once and have it fix the code including the doctests. My assessment is exactly the same as yours. I call the app 2 times: one for code, another for doctests. The setup.py that I provided only does code precisely because of this. I still did not have time to, programatically, call both transformations. So: yes it sucks. Talking about setup.py, its current incarnation is broken on Python 3. Even if the objective is for it to print some information on calling 2to3 it will not work. Just putting the prints with () should sort it (and work everywhere) Anyway, I think we can make setup.py much more helpful in the p3 case by calling 2to3 (like numpy). The tests would also need to be transformed, I think. Regards, Tiago From biopython at maubp.freeserve.co.uk Fri Jul 16 15:31:51 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jul 2010 16:31:51 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/16 Tiago Ant?o : > 2010/7/16 Peter : >> You've been looking more carefully at 2to3 and doctests than I have, >> perhaps you can answer this query for me: It seems to me it does >> not automatically fix doctests. >> >> I'm aware of the -d or --doctests_only option, but that means we have >> to run 2to3 twice I think (once for the code, once for the doctests). >> >> Is there an extra flag or something obvious I am missing here? I want >> to call 2to3 once and have it fix the code including the doctests. > > My assessment is exactly the same as yours. > I call the app 2 times: one for code, another for doctests. > The setup.py that I provided only does code precisely because of this. > I still did not have time to, programatically, call both > transformations. > So: yes it sucks. Maybe we should file an enhancement bug report? > Talking about setup.py, its current incarnation is broken on Python 3. > Even if the objective is for it to print some information on calling > 2to3 it will not work. Just putting the prints with () should sort it > (and work everywhere) I like that plan except for a "bug" in 2to3, it will turn this example which works for BOTH python 2 and python 3: print("Hello world") into this: print(("Hello world")) Using this syntax for simple prints is actually a tip here: http://wiki.python.org/moin/PortingPythonToPy3k > Anyway, I think we can make setup.py much more helpful in the p3 case > by calling 2to3 (like numpy). The tests would also need to be > transformed, I think. I think its a little premature for that - but once we have a full the conversion running smoothly it makes sense. For now transforming the code in situ makes working in Python 3 to debug something much easier I think. Peter From tiagoantao at gmail.com Fri Jul 16 15:44:35 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 16 Jul 2010 16:44:35 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/16 Peter : > Maybe we should file an enhancement bug report? Good idea, I will do that. > I like that plan except for a "bug" in 2to3, it will turn this example which > works for BOTH python 2 and python 3: > > print("Hello world") > > into this: > > print(("Hello world")) Well, as I see it, setup.py will never need to be converted by 2to3. Its is possible to do a single file that works in all versions, therefore that problem does not apply (unless people try to convert it explicitly - I think we need to recommend against that). This seems to be the case with numpy setup.py. My view is this: Current case: person calls setup.py, always works. In the p3 case just prints the warning and 2to3 recommendation. Future (stable): person calls setup.py , and it does everything necessary (calling 2to3 if needed) Never: Person calls 2to3 person calls setup.py In fact it does not make much sense as it is now: the person has to call 2to3 against setup.py in order to be informed to... call 2to3 ;) See my point? Tiago From biopython at maubp.freeserve.co.uk Fri Jul 16 15:58:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jul 2010 16:58:03 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/16 Tiago Ant?o : > > 2010/7/16 Peter : > >> Maybe we should file an enhancement bug report? > > Good idea, I will do that. > >> I like that plan except for a "bug" in 2to3, it will turn this example which >> works for BOTH python 2 and python 3: >> >> print("Hello world") >> >> into this: >> >> print(("Hello world")) > > Well, as I see it, setup.py will never need to be converted by 2to3. > Its is possible to do a single file that works in all versions, > therefore that problem does not apply (unless people try to convert it > explicitly - I think we need to recommend against that). This seems to > be the case with numpy setup.py. > > My view is this: > Current case: > person calls setup.py, always works. In the p3 case just prints the > warning and 2to3 recommendation. > Future (stable): > person calls setup.py , and it does everything necessary (calling 2to3 > if needed) > Never: > Person calls 2to3 > person calls setup.py > > In fact it does not make much sense as it is now: the person has to > call 2to3 against setup.py in order to be informed to... call 2to3 ;) > See my point? I agree that we should tweak setup.py to run under both Python 2 (life as normal) and Python 3 (tells you to manually run 2to3 on the source code etc, but then continues as normal). We'll need to tweak input vs raw_input (Python 3 vs Python 2). Peter From tiagoantao at gmail.com Fri Jul 16 16:06:50 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 16 Jul 2010 17:06:50 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/16 Peter : > We'll need to tweak input vs raw_input (Python 3 vs Python 2). Me thinks this is probably enough? if sys.version_info[0] == 3: def raw_input(): return input() From biopython at maubp.freeserve.co.uk Fri Jul 16 16:12:04 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jul 2010 17:12:04 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/16 Tiago Ant?o : > 2010/7/16 Peter : >> We'll need to tweak input vs raw_input (Python 3 vs Python 2). > > Me thinks this is probably enough? > if sys.version_info[0] == 3: > ? ?def raw_input(): > ? ? ? ?return input() Unless 2to3 does something horrible to that, yes. Do you want to test this and check it in now? I've got some other things to be getting on with so I'll take a break from updating the trunk with small Python 3 changes ;) Peter From tiagoantao at gmail.com Fri Jul 16 16:16:50 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 16 Jul 2010 17:16:50 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: Ok, I will check this in, but I think a small note should be added somewhere on how to use 2to3. I can do that if you tell me your preferred place (README?). 2010/7/16 Peter : > 2010/7/16 Tiago Ant?o : >> 2010/7/16 Peter : >>> We'll need to tweak input vs raw_input (Python 3 vs Python 2). >> >> Me thinks this is probably enough? >> if sys.version_info[0] == 3: >> ? ?def raw_input(): >> ? ? ? ?return input() > > Unless 2to3 does something horrible to that, yes. Do you want > to test this and check it in now? I've got some other things to > be getting on with so I'll take a break from updating the trunk > with small Python 3 changes ;) > > Peter > -- "If you want to get laid, go to college.? If you want an education, go to the library." - Frank Zappa From biopython at maubp.freeserve.co.uk Fri Jul 16 16:24:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jul 2010 17:24:03 +0100 Subject: [Biopython-dev] 2to3 and doctests In-Reply-To: References: Message-ID: 2010/7/16 Tiago Ant?o : > Ok, I will check this in, but I think a small note should be added > somewhere on how to use 2to3. I can do that if you tell me your > preferred place (README?). Good point - yes, add something to the README file and in the message from setup.py tell them to read that. Of course, this is just an interim measure while we are still working on Python 3 porting. Peter From kellrott at gmail.com Fri Jul 16 19:25:44 2010 From: kellrott at gmail.com (Kyle) Date: Fri, 16 Jul 2010 12:25:44 -0700 Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54 In-Reply-To: <320fb6e01003181234m71cc777bxaf5f29f2fbe1f21f@mail.gmail.com> References: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com> <320fb6e01003181234m71cc777bxaf5f29f2fbe1f21f@mail.gmail.com> Message-ID: After much delay, I've made the change and posted it to the zxjdbc branch on Github. Now users can call BioSeqDatabase.open_database(backend = 'MySQL' ) and it will work the same on Python and Jython. Kyle On Thu, Mar 18, 2010 at 12:34 PM, Peter wrote: > On Thu, Mar 18, 2010 at 7:28 PM, Kyle wrote: >> What should the parameter be called? Possibilities: >> 'backend', 'dbtype', ... ?ideas anyone? > > Just database would be too vague. I quite like backend. > > Peter > From anaryin at gmail.com Fri Jul 16 21:20:54 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 16 Jul 2010 14:20:54 -0700 Subject: [Biopython-dev] GSOC Mid-Term Evaluation Message-ID: Hello all, I've been quite silent lately and I feel I should apologize :) I'm leaving the U.S. back to Europe and it's been quite hectic with packing and finishing some last minute stuff - namely my Thesis - so GSOC has been put a bit aside for the past week.. Still, I'll be working on unit tests and documentation for what I've done so far. It's not a big list of things but they do require a bit of effort to be well documented and most of all, assure they are working properly. Hope to be back to work fully on Monday! Best to all of you and thanks for the evaluation :) I know only the mentors' word was taken into account for them but if anyone has suggestions, criticism, feel free to do so. Again, the code is hosted here: http://github.com/JoaoRodrigues/biopython/tree/GSOC2010 I added some examples and a short list of what I've done so far here: http://www.biopython.org/wiki/GSOC2010_Joao#Project_Progress Jo?o [...] Rodrigues @ http://doeidoei.wordpress.org From bugzilla-daemon at portal.open-bio.org Sat Jul 17 14:47:13 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 17 Jul 2010 10:47:13 -0400 Subject: [Biopython-dev] [Bug 2578] The GenBank SeqRecord parser does not record molecule type or if circular In-Reply-To: Message-ID: <201007171447.o6HElDUD030395@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2578 claude at 2xlibre.net changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |claude at 2xlibre.net -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jul 17 21:04:24 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 17 Jul 2010 17:04:24 -0400 Subject: [Biopython-dev] [Bug 3118] New: isinstance should use basestring for detecting string type Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3118 Summary: isinstance should use basestring for detecting string type Product: Biopython Version: Not Applicable Platform: All OS/Version: All Status: NEW Severity: normal Priority: P5 Component: Other AssignedTo: biopython-dev at biopython.org ReportedBy: claude at 2xlibre.net I've been bitten by this issue today, when I gave a Unicode string to annotation["date"] and the SeqIO writer for GenBank format tested it as isinstance(..., str) which returned False (Bio/SeqIO/InsdcIO.py). I saw that the code had a mix of isinstance( , str) and isinstance( , basestring). I chased all remaining str type comparisons for cooking the following patch. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jul 17 21:05:19 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 17 Jul 2010 17:05:19 -0400 Subject: [Biopython-dev] [Bug 3118] isinstance should use basestring for detecting string type In-Reply-To: Message-ID: <201007172105.o6HL5Jm3010666@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3118 ------- Comment #1 from claude at 2xlibre.net 2010-07-17 17:05 EST ------- Created an attachment (id=1523) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1523&action=view) Replace all remaining isinstance(, str) by isinstance(, basestring) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Sat Jul 17 22:30:20 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 17 Jul 2010 18:30:20 -0400 Subject: [Biopython-dev] [Bug 3118] isinstance should use basestring for detecting string type In-Reply-To: Message-ID: <201007172230.o6HMUKvV012752@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3118 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-17 18:30 EST ------- Hi Claude, Some of those should really be str (e.g. for the SeqFeature extract test in SeqFeatureExtractionWritingReading if the input is a str then the output should be too; also for BioSQL some of the adaptors do care about string vs unicode so that needs more checking), but in general you have a good point. In this particular case, yes - thank you: http://github.com/biopython/biopython/commit/450b1a9024490feb2cdbbbc30f1dc429620d8c41 I think we need some more unit tests here (especially for BioSQL), which will help with the current Python 3 testing via 2to3, where string vs unicode is a big issue. Leaving bug open... Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From vsbuffalo at gmail.com Sun Jul 18 06:50:59 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Sat, 17 Jul 2010 23:50:59 -0700 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: I dug into how Numpy is processing their own ReST dialect, and the answer lies in doc/HOWTO_BUILD_DOCS.txt. There is an extension that can be obtained from PyPi, or included manually in a doc/sphinxext directory. Are these extension requirements alright (before I continue changing the format)? Some more information is below. *Numpy's documentation uses several custom extensions to Sphinx. These* *are shipped in the ``sphinxext/`` directory, and are automatically* *enabled when building Numpy's documentation.* * * *However, if you want to make use of these extensions in third-party* *projects, they are available on PyPi_ as the numpydoc_ package, and* *can be installed with::* * * * easy_install numpydoc* * * *In addition, you will need to add::* * * * extensions = ['numpydoc']* On Thu, Jul 15, 2010 at 8:38 AM, Peter wrote: > On Thu, Jul 15, 2010 at 4:22 PM, Vince S. Buffalo > wrote: > >> > >> I think the docstrings should be the primary API documentation, > >> and the Tutorial the primary introductory text. > > > > I like the idea of code and documentation living together, but one thing > > that concerns me is that as the documentation grows larger and filled > with > > more examples, it may begin to clutter the code quite a bit. > > We can cross that bridge if we come to it - right now I would say most > modules really need more docstrings. If you think that any of the > docstrings > you've looked at are too long, we can discuss shortening them (ideally by > relocating good content or tests). > > Peter > -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From biopython at maubp.freeserve.co.uk Sun Jul 18 10:52:58 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 18 Jul 2010 11:52:58 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Sun, Jul 18, 2010 at 7:50 AM, Vince S. Buffalo wrote: > I dug into how Numpy is processing their own ReST dialect, and the > answer lies in doc/HOWTO_BUILD_DOCS.txt. There is an extension > that can be obtained from PyPi, or included manually in a doc/sphinxext > directory. > > Are these extension requirements alright (before I continue changing the > format)? Some more information is below. > If they are useful, then I'm OK with that. We can probably even take a copy and add it to the Biopython source code since it is under the BSD licence: http://pypi.python.org/pypi/numpydoc/ We may be fine with just restricted reStructuredText - see how you get on with that first? Peter From biopython at maubp.freeserve.co.uk Sun Jul 18 12:28:14 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 18 Jul 2010 13:28:14 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Sun, Jul 18, 2010 at 11:52 AM, Peter wrote: > On Sun, Jul 18, 2010 at 7:50 AM, Vince S. Buffalo wrote: >> I dug into how Numpy is processing their own ReST dialect, and the >> answer lies in doc/HOWTO_BUILD_DOCS.txt. There is an extension >> that can be obtained from PyPi, or included manually in a doc/sphinxext >> directory. >> >> Are these extension requirements alright (before I continue changing the >> format)? Some more information is below. > > If they are useful, then I'm OK with that. We can probably even take > a copy and add it to the Biopython source code since it is under the > BSD licence: http://pypi.python.org/pypi/numpydoc/ > > We may be fine with just restricted reStructuredText - see how you > get on with that first? Plus of course in the short term we'll still be using epydoc anyway. Peter From vsbuffalo at gmail.com Sun Jul 18 19:10:40 2010 From: vsbuffalo at gmail.com (Vince S. Buffalo) Date: Sun, 18 Jul 2010 12:10:40 -0700 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: Sounds good. I'll add a copy to doc/sphinxext as in Numpy. By restricted, you mean without the :class:`ClassName` type annotation? Vince On Sun, Jul 18, 2010 at 5:28 AM, Peter wrote: > On Sun, Jul 18, 2010 at 11:52 AM, Peter wrote: > > On Sun, Jul 18, 2010 at 7:50 AM, Vince S. Buffalo wrote: > >> I dug into how Numpy is processing their own ReST dialect, and the > >> answer lies in doc/HOWTO_BUILD_DOCS.txt. There is an extension > >> that can be obtained from PyPi, or included manually in a doc/sphinxext > >> directory. > >> > >> Are these extension requirements alright (before I continue changing the > >> format)? Some more information is below. > > > > If they are useful, then I'm OK with that. We can probably even take > > a copy and add it to the Biopython source code since it is under the > > BSD licence: http://pypi.python.org/pypi/numpydoc/ > > > > We may be fine with just restricted reStructuredText - see how you > > get on with that first? > > Plus of course in the short term we'll still be using epydoc anyway. > > Peter > -- Vince Buffalo Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins From bugzilla-daemon at portal.open-bio.org Sun Jul 18 19:23:48 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 18 Jul 2010 15:23:48 -0400 Subject: [Biopython-dev] [Bug 3118] isinstance should use basestring for detecting string type In-Reply-To: Message-ID: <201007181923.o6IJNmbf007400@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3118 ------- Comment #3 from claude at 2xlibre.net 2010-07-18 15:23 EST ------- Thanks Peter for the fix you committed. It resolves my issue. I understand my search/replace strategy was a bit rude :-) -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Mon Jul 19 08:34:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Jul 2010 09:34:43 +0100 Subject: [Biopython-dev] Documentation In-Reply-To: References: Message-ID: On Sun, Jul 18, 2010 at 8:10 PM, Vince S. Buffalo wrote: > Sounds good. I'll add a copy to doc/sphinxext as in Numpy. > > By restricted, you mean without the :class:`ClassName` type annotation? > > Vince Yes - to me that looks horrible as plain text. I was hoping NumPy had a clear definition of their restricted subset of reStructuredText we could follow... maybe I haven't looked hard enough. Have you been able to run epydoc with reStructuredText yet? Peter From bugzilla-daemon at portal.open-bio.org Mon Jul 19 14:40:58 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 19 Jul 2010 10:40:58 -0400 Subject: [Biopython-dev] [Bug 3119] New: Bio.Nexus can't parse file from Prank 100701 (1st July 2010) Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=3119 Summary: Bio.Nexus can't parse file from Prank 100701 (1st July 2010) Product: Biopython Version: 1.54 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk I've been updating test_Prank_tool.py to cope with the latest version of Prank, 1 July 2010 from http://www.ebi.ac.uk/goldman-srv/prank/src/prank/ Some changes are simple, such as removing tests using feature of Prank which have been removed. One test is failing due to some big changes in the NEXUS output from Prank, and this may be due to a problem with our parser: >>> from Bio.Nexus import Nexus >>> n = Nexus.Nexus("output_prank_v100701.nex") Traceback (most recent call last): ... Bio.Nexus.Nexus.NexusError: Taxon AK1H_ECOLI: Illegal character / in line: (check dimensions / interleaving) I will attach the file, it is created by the unit test as output.2.nex but is usually deleted. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 19 14:42:31 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 19 Jul 2010 10:42:31 -0400 Subject: [Biopython-dev] [Bug 3119] Bio.Nexus can't parse file from Prank 100701 (1st July 2010) In-Reply-To: Message-ID: <201007191442.o6JEgVgj020619@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3119 ------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-19 10:42 EST ------- Created an attachment (id=1524) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1524&action=view) Sample NEXUS output from prank v100701 This file is from Prank v100701 (1 July 2010), compiled and run on Linux from: http://www.ebi.ac.uk/goldman-srv/prank/src/prank/prank.src.100701.tgz -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 19 14:45:22 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 19 Jul 2010 10:45:22 -0400 Subject: [Biopython-dev] [Bug 3119] Bio.Nexus can't parse file from Prank 100701 (1st July 2010) In-Reply-To: Message-ID: <201007191445.o6JEjMRN020749@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3119 ------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-19 10:45 EST ------- Created an attachment (id=1525) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1525&action=view) Sample NEXUS output from prank v081202 Equivalent output from Prank v.081202 (2 Dec 2008), compiled and run on Mac OS X. Bio.Nexus can parse this. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jul 19 14:49:36 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 19 Jul 2010 10:49:36 -0400 Subject: [Biopython-dev] [Bug 3119] Bio.Nexus can't parse file from Prank 100701 (1st July 2010) In-Reply-To: Message-ID: <201007191449.o6JEna91020945@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3119 ------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-19 10:49 EST ------- I have for the moment added a hack to avoid the test failure, http://github.com/biopython/biopython/commit/ca6a5958415d4d026b2b799a35fd3a6371491024 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From andrea at biocomp.unibo.it Tue Jul 20 14:51:50 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Tue, 20 Jul 2010 16:51:50 +0200 (CEST) Subject: [Biopython-dev] DAS client in biopython Message-ID: Hi all, I've been working a little do develop a DAS client in python, and I thought it could be a nice addition to biopython. So I build up a branch on github that can be found here: http://github.com/apierleoni/biopython/tree/das-client The DAS module is under Bio and can be imported using >>> from Bio.DAS.DASpy import DASpy some code examples are included in the DASpy.py file. cool things you can do with DASpy: - fetch all the available DAS servers listed at dasregistry - connect to each of them and use 'das1:sequence' and 'das1:feature' methods to retrieve sequences, features and annotations from DAS servers. - build a SeqRecord starting from multiple DAS servers (one for the sequence and the others for features and annotations) Eg. you can build a SeqRecord object that will list all abailable DAS annotations given a uniprot ID. I'm actually the only user of the code, so I'll appreciate any comment about it. Hope this turns useful to someone else. Andrea From andrea at biocomp.unibo.it Tue Jul 20 15:06:22 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Tue, 20 Jul 2010 17:06:22 +0200 (CEST) Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: Message-ID: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> > I'm not sure we can easily include GPL code in Biopython... it would > complicate things. Kyle has also been working on using the JVM DB > API for BioSQL under Jython - I'd rather we ended up with a runtime > choice of drivers (database specific like mysqldb, and others like the > abstractions SQLAlchemy or the web2py DAL) which would all be > external to Biopython. > > Peter > I've checked online and, actually, web2py code comes under: "GPL2 License with an exception for easier commercialization of applications." and they states: "Applications built with web2py can be released under any license the author wishes as long they do not contain web2py code. In particular they can be bytecode compiled and distributed in closed source. The admin interface provides a button to byte-code compile. It is fine to distribute web2py (source or compiled) with your applications as long as you make it clear in the license where your application ends and web2py starts." I don't think this will cause any problem given that the web2py code is acknowledged. Anyhow, are there any plan in extending the BioSQL interface? We could make some methods useful to people not skilled with SQL, that can boost their experience with BioSQL. something like selecting all the bioentries carrying a given feature type or a qualifier value or even a dbxref. Allowing people to use the BioSQL schema without exactly knowing the schema and have to write complex queries could be a big addition. Andrea From biopython at maubp.freeserve.co.uk Tue Jul 20 15:19:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Jul 2010 16:19:56 +0100 Subject: [Biopython-dev] DAS client in biopython In-Reply-To: References: Message-ID: On Tue, Jul 20, 2010 at 3:51 PM, Andrea Pierleoni wrote: > Hi all, > I've been working a little do develop a DAS client in python, and I > thought it could be a nice addition to biopython. Hi Andrea, This does look interesting - I've never needed to work with DAS but maybe one day... > So I build up a branch on github that can be found here: > > http://github.com/apierleoni/biopython/tree/das-client It looks like you have lots of other code on that branch too, like BioSQL2py (your BioSQL via web2py DAL) - this isn't a problem for now but would complicate merging later. > The DAS module is under Bio and can be imported using > >>>> from Bio.DAS.DASpy import DASpy The heirachy seems unnecessarily nested, why not move the code in Bio/DAS/DASpy.py into Bio/DAS/__init__.py? Or even into Bio/DAS.py instead? Then that import becomes: from Bio.DAS import DASpy, which also avoids the ambiguity of DASpy for a module and a class. Are you expecting to have other files under Bio/DAS? Also the name DASpy confuses me, maybe the class should be something about DAS Servers? Would it be right to regard the class DASSeq as a subclass of SeqRecord? It looks like a minimally annotated sequence. See also the DBSeqRecord in BioSQL. Regards, Peter From biopython at maubp.freeserve.co.uk Tue Jul 20 15:23:16 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Jul 2010 16:23:16 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> References: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> Message-ID: On Tue, Jul 20, 2010 at 4:06 PM, Andrea Pierleoni wrote: > > >> I'm not sure we can easily include GPL code in Biopython... it would >> complicate things. Kyle has also been working on using the JVM DB >> API for BioSQL under Jython - I'd rather we ended up with a runtime >> choice of drivers (database specific like mysqldb, and others like the >> abstractions SQLAlchemy or the web2py DAL) which would all be >> external to Biopython. >> >> Peter >> > > I've checked online and, actually, web2py code comes under: > > "GPL2 License with an exception for easier commercialization of > applications." > > and they states: > > "Applications built with web2py can be released under any license the > author wishes as long they do not contain web2py code. In particular they > can be bytecode compiled and distributed in closed source. The admin > interface provides a button to byte-code compile. > It is fine to distribute web2py (source or compiled) with your > applications as long as you make it clear in the license where your > application ends and web2py starts." > > I don't think this will cause any problem given that the web2py code is > acknowledged. I wouldn't want to ship web2py with Biopython - we'd just list it as another optional package you might want to install for use with BioSQL (as we do with MySQLdb etc). > Anyhow, are there any plan in extending the BioSQL interface? > We could make some methods useful to people not skilled with SQL, that can > boost their experience with BioSQL. something like selecting all the bioentries > carrying a given feature type or a qualifier value ?or even a dbxref. > Allowing people to use the BioSQL schema without exactly knowing the > schema and have to write complex queries could be a big addition. There are already several query methods, but more wouldn't be a bad idea. I was thinking we could implement dictionary like access, and support for iterator over all the records. Peter From andrea at biocomp.unibo.it Tue Jul 20 16:00:34 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Tue, 20 Jul 2010 18:00:34 +0200 (CEST) Subject: [Biopython-dev] DAS client in biopython In-Reply-To: References: Message-ID: > On Tue, Jul 20, 2010 at 3:51 PM, Andrea Pierleoni wrote: >> Hi all, >> I've been working a little do develop a DAS client in python, and I >> thought it could be a nice addition to biopython. > > Hi Andrea, > > This does look interesting - I've never needed to work with > DAS but maybe one day... > >> So I build up a branch on github that can be found here: >> >> http://github.com/apierleoni/biopython/tree/das-client > > It looks like you have lots of other code on that branch too, > like BioSQL2py (your BioSQL via web2py DAL) - this isn't > a problem for now but would complicate merging later. > BioSQL2py is just an empty directory on that branch, It will be filled in an other specific branch (actually it shouldn't be there :) ) >> The DAS module is under Bio and can be imported using >> >>>>> from Bio.DAS.DASpy import DASpy > > The heirachy seems unnecessarily nested, why not move the > code in Bio/DAS/DASpy.py into Bio/DAS/__init__.py? Or > even into Bio/DAS.py instead? Then that import becomes: > from Bio.DAS import DASpy, which also avoids the ambiguity > of DASpy for a module and a class. Are you expecting to have > other files under Bio/DAS? > I'm not planning on having other file. but since this was a proposal, I build the Bio/DAS structure to host any additional client available, if there are any. howver if it will be the only way to parse DAS file we can simplify to a Bio/DAS.py file. much better to me. > Also the name DASpy confuses me, maybe the class > should be something about DAS Servers? > DASpy is the way I'm used to call this client, and that is the main class but can be renamed to something more meaningful > Would it be right to regard the class DASSeq as a subclass > of SeqRecord? It looks like a minimally annotated sequence. > See also the DBSeqRecord in BioSQL. > well, I think a DASSeq can fit comfortably in a SeqRecord. this would also simplify the build of a SeqRecord object in DASpy.fetch_to_seqrec. Thanks for the advices Peter Andrea From biopython at maubp.freeserve.co.uk Tue Jul 20 16:03:44 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Jul 2010 17:03:44 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Message-ID: On Wed, Jun 2, 2010 at 12:36 PM, Peter wrote: > On Tue, Jun 1, 2010 at 4:15 PM, Peter wrote: >> On Tue, Jun 1, 2010 at 2:23 PM, Brad Chapman wrote: >>> I'd suggest having an option to not capture stdout and stderr, which >>> would help users avoid those cases where a program spews a lot to >>> stdout and it's unwieldy to capture and stick it into a string. >> >> We need to avoid any risk of deadlocks, so I guess the safe >> implementation here would be call subprocess with stdout and >> stderr sent to dev null. > > How does this look? Tested on Mac and Windows: > http://github.com/peterjc/biopython/tree/app-exec2 > > Example usage without capturing the output: > > ? ?from Bio.Emboss.Applications import WaterCommandline > ? ?water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? asequence="a.fasta", bsequence="b.fasta") > ? ?print "About to run:\n%s" % water_cmd > ? ?return_code = water_cmd() > ? ?print "Return code: %i" % return_code > > Example usage with stdout and stderr capture: > > ? ?from Bio.Emboss.Applications import WaterCommandline > ? ?water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? asequence="a.fasta", bsequence="b.fasta") > ? ?print "About to run:\n%s" % water_cmd > ? ?stdout, stderr, return_code = water_cmd(capture=True) > ? ?print "Return code: %i" % return_code > ? ?print "Tool output:\n%s" % stdout > > Note in this implementation it either returns an integer error level > (the default) or a tuple of stdout, stderr and the error level return > code. If we opt for adding methods rather than using __call__ > these could be different methods instead. > > Another potentially useful option would be to copy the > subprocess.check_call() function in Python 2.5+ which verifies > the return code (error level) is zero and raises an exception if not > (probably only sensible if not capturing the output?). Maybe this > could even be the default behaviour? > > [I would prefer to keep the interface as simple as possible though, > less options is better! KISS principle.] > > Peter Interestingly in Python 2.7 subprocess gained a new function called check_output which returns a string (stdout, optionally combined with stderr as a single string). If there is a non-zero return code you get a CalledProcessError exception (with return code and output): http://docs.python.org/library/subprocess.html In some ways there are too many choices - how unpythonic ;) Having thought about this for a while, I realised that in almost every case I have never cared about the exact return code, just if it is zero (success) or not (failure). Therefore the behaviour of the subprocess functions check_call (Python 2.5+) and check_output (Python 2.7+) seems desirable (you get an exception if the return code is non zero). That just leaves what to return: stdout and/or stderr. I personally have never needed to merge stderr and stdout into a single pipe or string - the only use case for this I can think of is to capture the output into a file for logging purposes. Generally it makes more sense to keep them separate. This leaves the question should we return just stdout, or both? Sometimes stderr is useful, so I think both. So, in yet-another-branch, I wrote a __call__ implementation which raises an exception on non-zero return codes, but otherwise returns stdout and stderr as a tuple of two strings: http://github.com/peterjc/biopython/commits/app-exec3 I'm pretty confident this will suffice for most use cases, and propose we implement this in Biopython 1.55. Thoughts? Peter From andrea at biocomp.unibo.it Tue Jul 20 16:07:29 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Tue, 20 Jul 2010 18:07:29 +0200 (CEST) Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> Message-ID: > > I wouldn't want to ship web2py with Biopython - we'd just list it as > another optional package you might want to install for use with BioSQL > (as we do with MySQLdb etc). > that sounds reasonable. > > There are already several query methods, but more wouldn't be a bad > idea. I was thinking we could implement dictionary like access, and > support for iterator over all the records. > dictionary and iterators would be very pythonic, and useful. are you working on it? correct me if I'm wrong, but the standard policy in Biopython BioSQL to update a bioentry record is to delete the old one and create a new one (ore make a new version). wouldn't be useful to enable in biopython some minor modifications to a bioentry like adding/removing features and qualifiers? maybe I can help with this. Andrea From biopython at maubp.freeserve.co.uk Tue Jul 20 16:18:16 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Jul 2010 17:18:16 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> Message-ID: On Tue, Jul 20, 2010 at 5:07 PM, Andrea Pierleoni wrote: > >> >> I wouldn't want to ship web2py with Biopython - we'd just list it as >> another optional package you might want to install for use with BioSQL >> (as we do with MySQLdb etc). >> > > that sounds reasonable. > >> >> There are already several query methods, but more wouldn't be a bad >> idea. I was thinking we could implement dictionary like access, and >> support for iterator over all the records. >> > > dictionary and iterators would be very pythonic, and useful. are you > working on it? Not right now, no - if you want to try soon please go ahead. > correct me if I'm wrong, but the standard policy in Biopython BioSQL to > update a bioentry record is to delete the old one and create a new one > (or make a new version). wouldn't be useful to enable in biopython some > minor modifications to a bioentry like adding/removing features and > qualifiers? maybe I can help with this. The current functionality is limited to loading and retrieving records (and retreiving is done in a lazy or on demand way which saves memory and DB access). As a consequence, if you want to edit a record in the database you have to either do it directly (bypass our BioSQL code) or load a new record. The BioSQL schema doesn't have any sort of audit trail (unlike CHADO if I remember correctly), so for many uses this almost read only setup is actually a plus point. Here we use BioSQL essentially as a container for NCBI GenBank / RefSeq dumps - although we do add additional annotations on top. I can see advantages in allowing the DBSeqRecord to write back to the database - it would need a lots of refactoring through (e.g. most of the loader code would get moved). I would start by creating a read only proxy for the DBSeqFeature (something I think Leighton Pritchard did in some of his code) because editing feature annotations would be an important part of this. Peter From biopython at maubp.freeserve.co.uk Wed Jul 21 10:47:14 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Jul 2010 11:47:14 +0100 Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54 In-Reply-To: References: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com> <320fb6e01003181234m71cc777bxaf5f29f2fbe1f21f@mail.gmail.com> Message-ID: On Fri, Jul 16, 2010 at 8:25 PM, Kyle wrote: > After much delay, I've made the change and posted it to the zxjdbc > branch on Github. Now users can call > BioSeqDatabase.open_database(backend = 'MySQL' ) and it will work the > same on Python and Jython. Nice. I'll have to look at your code, but we can have it try a series of supported adaptors (e.g. there are several for PostgreSQL), which will make things a little easier on the user even on C Python. Peter From bioinformed at gmail.com Wed Jul 21 11:31:39 2010 From: bioinformed at gmail.com (Kevin Jacobs ) Date: Wed, 21 Jul 2010 07:31:39 -0400 Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54 In-Reply-To: References: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com> Message-ID: On Thu, Mar 18, 2010 at 3:28 PM, Kyle wrote: > What should the parameter by called? Possibilities: 'backend', 'dbtype', > ... > ideas anyone? > > I suggest 'driver', since it is explicit and precise about what is being chosen. This allows users to select among several drivers, even alternatives for the same database backend. It also allows the creation of default aliases for meta-drivers like 'mysql' or 'postgresql', which could search among a list of compatible drivers and the most suitable one that is found to be installed. -Kevin From biopython at maubp.freeserve.co.uk Wed Jul 21 11:55:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Jul 2010 12:55:10 +0100 Subject: [Biopython-dev] BioSQL drivers, was: Planning for Biopython 1.54 In-Reply-To: References: <320fb6e01003180419x7e376966o2ad655b639438503@mail.gmail.com> Message-ID: On Wed, Jul 21, 2010 at 12:31 PM, Kevin Jacobs wrote: > On Thu, Mar 18, 2010 at 3:28 PM, Kyle wrote: > >> What should the parameter by called? Possibilities: 'backend', 'dbtype', >> ... >> ideas anyone? >> >> > I suggest 'driver', since it is explicit and precise about what is being > chosen. ?This allows users to select among several drivers, even > alternatives for the same database backend. ?It also allows the creation of > default aliases for meta-drivers like 'mysql' or 'postgresql', which could > search among a list of compatible drivers and the most suitable one that is > found to be installed. We already have a parameter called driver (e.g. set to MySQLdb, psycopg2, psycopg, pgdb, sqlite3) which then have to take on a double meaning (the python driver versus the underlying back end database, MySQL, PostreSQL, SQLite). Peter From biopython at maubp.freeserve.co.uk Wed Jul 21 11:58:21 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Jul 2010 12:58:21 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> Message-ID: On Tue, Jul 20, 2010 at 5:18 PM, Peter wrote: > On Tue, Jul 20, 2010 at 5:07 PM, Andrea Pierleoni wrote: >>> >>> There are already several query methods, but more wouldn't be a bad >>> idea. I was thinking we could implement dictionary like access, and >>> support for iterator over all the records. >>> >> >> dictionary and iterators would be very pythonic, and useful. are you >> working on it? > > Not right now, no - if you want to try soon please go ahead. > Well, I went and did the basics to be consistent with the existing limited dict like support in BioSeqDatabase. Would you mind testing it? This can be improved by iterating over the cursor rather than building a list of identifiers in memory. Likewise __len__ and __contains__ can be turned into SQL statements to be more efficient. Do you fancy trying that? Peter From andrea at biocomp.unibo.it Wed Jul 21 15:43:30 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Wed, 21 Jul 2010 17:43:30 +0200 (CEST) Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> Message-ID: > > Well, I went and did the basics to be consistent with the existing limited > dict like support in BioSeqDatabase. Would you mind testing it? > > This can be improved by iterating over the cursor rather than building a > list of identifiers in memory. Likewise __len__ and __contains__ can be > turned into SQL statements to be more efficient. Do you fancy trying that? > > Peter > I've tested the new BioSeqDatabase in postgres BioSQL db containing 50000 bioentry, and it works very fast (I'm using python 2.6) even in this way. howver using SQL will be much better of course. I will take a try, as soon as I fix the DAS client and UniprotIO. Andrea From andrea at biocomp.unibo.it Wed Jul 21 15:48:40 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Wed, 21 Jul 2010 17:48:40 +0200 (CEST) Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> Message-ID: <806685d7b693bb7c85a03846c9160b56.squirrel@lipid.biocomp.unibo.it> > The current functionality is limited to loading and retrieving records > (and retreiving is done in a lazy or on demand way which saves > memory and DB access). As a consequence, if you want to edit a > record in the database you have to either do it directly (bypass our > BioSQL code) or load a new record. > > The BioSQL schema doesn't have any sort of audit trail (unlike CHADO > if I remember correctly), so for many uses this almost read only setup > is actually a plus point. Here we use BioSQL essentially as a container > for NCBI GenBank / RefSeq dumps - although we do add additional > annotations on top. > > I can see advantages in allowing the DBSeqRecord to write back > to the database - it would need a lots of refactoring through (e.g. most > of the loader code would get moved). I would start by creating a read > only proxy for the DBSeqFeature (something I think Leighton Pritchard > did in some of his code) because editing feature annotations would > be an important part of this. > maybe I'll succeed in the next month in writing some methods to modify bioentry directly in the SQL db. we can talk about this later, as soon as we have some code to work on. however audit trail will not be possible in the current BioSQL schema, unless using separate tables (as I'm actually doing). but I don't think this will be easily integrable in biopython. Is there anyone needing user logs? From biopython at maubp.freeserve.co.uk Wed Jul 21 15:59:15 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Jul 2010 16:59:15 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: <806685d7b693bb7c85a03846c9160b56.squirrel@lipid.biocomp.unibo.it> References: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> <806685d7b693bb7c85a03846c9160b56.squirrel@lipid.biocomp.unibo.it> Message-ID: On Wed, Jul 21, 2010 at 4:48 PM, Andrea Pierleoni wrote: >> The current functionality is limited to loading and retrieving records >> (and retreiving is done in a lazy or on demand way which saves >> memory and DB access). As a consequence, if you want to edit a >> record in the database you have to either do it directly (bypass our >> BioSQL code) or load a new record. >> >> The BioSQL schema doesn't have any sort of audit trail (unlike CHADO >> if I remember correctly), so for many uses this almost read only setup >> is actually a plus point. Here we use BioSQL essentially as a container >> for NCBI GenBank / RefSeq dumps - although we do add additional >> annotations on top. >> >> I can see advantages in allowing the DBSeqRecord to write back >> to the database - it would need a lots of refactoring through (e.g. most >> of the loader code would get moved). I would start by creating a read >> only proxy for the DBSeqFeature (something I think Leighton Pritchard >> did in some of his code) because editing feature annotations would >> be an important part of this. >> > > maybe I'll succeed in the next month in writing some methods to modify > bioentry directly in the SQL db. we can talk about this later, as soon as > we have some code to work on. Sure - there is no hurry. > however audit trail will not be possible in the current BioSQL schema, > unless using separate tables (as I'm actually doing). but I don't think > this will be easily integrable in biopython. Is there anyone needing > user logs? I agree, and didn't mean to suggest adding audit tables to the BioSQL schema. I was just pointing out this issue (depending on the intended usage, this may or may not be a problem). Peter From biopython at maubp.freeserve.co.uk Wed Jul 21 16:40:49 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Jul 2010 17:40:49 +0100 Subject: [Biopython-dev] Active projects list? (BOSC Biopython Project Update) In-Reply-To: References: <40eb2c096de34b449b67e05a062dbd06.squirrel@lipid.biocomp.unibo.it> Message-ID: On Wed, Jul 21, 2010 at 4:43 PM, Andrea Pierleoni wrote: >> >> Well, I went and did the basics to be consistent with the existing limited >> dict like support in BioSeqDatabase. Would you mind testing it? >> >> This can be improved by iterating over the cursor rather than building a >> list of identifiers in memory. Likewise __len__ and __contains__ can be >> turned into SQL statements to be more efficient. Do you fancy trying that? >> >> Peter >> > > I've tested the new BioSeqDatabase in postgres BioSQL db containing 50000 > bioentry, and it works very fast (I'm using python 2.6) even in this way. Good :) > howver using SQL will be much better of course. I will take a try, as soon > as I fix the DAS client and UniprotIO. I had time this afternoon to do __len__ and __contains__ with SQL, and add a couple of tests here too. Memory efficient Iteration can wait for another day - I'm going home now. We should probably have started a new thread for this BioSQL discussion. Peter From andrea at biocomp.unibo.it Thu Jul 22 14:13:21 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 22 Jul 2010 16:13:21 +0200 (CEST) Subject: [Biopython-dev] DAS client in biopython In-Reply-To: References: Message-ID: > The heirachy seems unnecessarily nested, why not move the > code in Bio/DAS/DASpy.py into Bio/DAS/__init__.py? Or > even into Bio/DAS.py instead? Then that import becomes: > from Bio.DAS import DASpy, which also avoids the ambiguity > of DASpy for a module and a class. Are you expecting to have > other files under Bio/DAS? > hierarchy is now simplified to a single file DAS.py under Bio. > Also the name DASpy confuses me, maybe the class > should be something about DAS Servers? > I renamed the DASpy class to DASregistry so the main call now is: from Bio.DAS import DASregistry das = DASregistry() simplier... > Would it be right to regard the class DASSeq as a subclass > of SeqRecord? It looks like a minimally annotated sequence. > See also the DBSeqRecord in BioSQL. > I've been thinking about it and, actually, the DASSeq class corresponds exactly to information and methods available in the DAS sequence method, so I'd leave it this way. Most of the time this class shouden0t be accessed. and a clean SeqRecord object can be obtained using the "fetch_to_seqrec" method in DASregistry. Andrea From biopython at maubp.freeserve.co.uk Thu Jul 22 14:33:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 Jul 2010 15:33:57 +0100 Subject: [Biopython-dev] DAS client in biopython In-Reply-To: References: Message-ID: On Thu, Jul 22, 2010 at 3:13 PM, Andrea Pierleoni wrote: > >> The heirachy seems unnecessarily nested, why not move the >> code in Bio/DAS/DASpy.py into Bio/DAS/__init__.py? Or >> even into Bio/DAS.py instead? Then that import becomes: >> from Bio.DAS import DASpy, which also avoids the ambiguity >> of DASpy for a module and a class. Are you expecting to have >> other files under Bio/DAS? >> > > hierarchy is now simplified to a single file DAS.py under Bio. > >> Also the name DASpy confuses me, maybe the class >> should be something about DAS Servers? >> > > I renamed the DASpy class to DASregistry so the main call now > is: > > from Bio.DAS import DASregistry > > das = DASregistry() > > simplier... > That appears to make sense :) >> Would it be right to regard the class DASSeq as a subclass >> of SeqRecord? It looks like a minimally annotated sequence. >> See also the DBSeqRecord in BioSQL. >> > > I've been thinking about it and, actually, the DASSeq class > corresponds exactly to information and methods available in > the DAS sequence method, so I'd leave it this way. So what DAS calls a sequence is closer to Biopython's SeqRecord than Biopython Seq object? Hmm - that could cause confusion, whatever you call your class. > Most of the time this class shouden0t be accessed. and a clean > SeqRecord object can be obtained using the "fetch_to_seqrec" > method in DASregistry. I'll take another look at your code later. Peter From andrea at biocomp.unibo.it Thu Jul 22 14:51:45 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Thu, 22 Jul 2010 16:51:45 +0200 (CEST) Subject: [Biopython-dev] DAS client in biopython In-Reply-To: References: Message-ID: > > So what DAS calls a sequence is closer to Biopython's SeqRecord > than Biopython Seq object? Hmm - that could cause confusion, > whatever you call your class. > here there is an example of a DASSEQUENCE MLAKATLAIVLSAASLPVLAAQCEATIESNDAMQYNLKEMVVDKSCKQFTVHLKHVGKMAKVAMGHNWVLTKEADKQGVATDGMNAGLAQDYVKAGDTRVIAHTKVIGGGESDSVTFDVSKLTPGEAYAYFCSFPGHWAMMKGTLKLSN it is basically a Seq object with some metadata associated that I'm keeping. the moltype is used to set the Alphabet. It has an ID so it could also fit a seqrecord, but the DASseq class should not be used outside of DAS.py. Than you can link to this sequence, feature and annotations that are parsed from DASGFF XML response. the big confusion here is that both SeqRecord anntotations and features comes with DASGFF. annotations has start and end position equal to 0. >> Most of the time this class shouden0t be accessed. and a clean >> SeqRecord object can be obtained using the "fetch_to_seqrec" >> method in DASregistry. > > I'll take another look at your code later. > any comment is welcome, thanks Andrea From chapmanb at 50mail.com Fri Jul 23 11:48:06 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 23 Jul 2010 07:48:06 -0400 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> Message-ID: <20100723114806.GA1868@sobchak.mgh.harvard.edu> Peter; [Simplified interface for calling commandline programs] > Having thought about this for a while, I realised that in almost every > case I have never cared about the exact return code, just if it is zero > (success) or not (failure). Therefore the behaviour of the subprocess > functions check_call (Python 2.5+) and check_output (Python 2.7+) > seems desirable (you get an exception if the return code is non zero). This makes good sense. > That just leaves what to return: stdout and/or stderr. I personally > have never needed to merge stderr and stdout into a single pipe > or string - the only use case for this I can think of is to capture the > output into a file for logging purposes. Generally it makes more sense > to keep them separate. This leaves the question should we return > just stdout, or both? Sometimes stderr is useful, so I think both. Both is also my preference. > So, in yet-another-branch, I wrote a __call__ implementation which > raises an exception on non-zero return codes, but otherwise returns > stdout and stderr as a tuple of two strings: > > http://github.com/peterjc/biopython/commits/app-exec3 Generally the idea and implementation are great. My only specific suggestion is regarding the default handling of stdout and stderr when you don't want to capture them. Currently you are eating those by writing to /dev/null. Would it be clearer to just use the default, which is to continue to route the programs stdout and stderr through the main instance? This gives friendly feedback that the program is running and makes debugging errors easier, especially if an external program doesn't use error codes correctly. Awesome to see this going in, Brad From biopython at maubp.freeserve.co.uk Fri Jul 23 13:19:37 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 23 Jul 2010 14:19:37 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: <20100723114806.GA1868@sobchak.mgh.harvard.edu> References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> <20100723114806.GA1868@sobchak.mgh.harvard.edu> Message-ID: On Fri, Jul 23, 2010 at 12:48 PM, Brad Chapman wrote: > Peter; > > [Simplified interface for calling commandline programs] >> Having thought about this for a while, I realised that in almost every >> case I have never cared about the exact return code, just if it is zero >> (success) or not (failure). Therefore the behaviour of the subprocess >> functions check_call (Python 2.5+) and check_output (Python 2.7+) >> seems desirable (you get an exception if the return code is non zero). > > This makes good sense. > Good. >> That just leaves what to return: stdout and/or stderr. I personally >> have never needed to merge stderr and stdout into a single pipe >> or string - the only use case for this I can think of is to capture the >> output into a file for logging purposes. Generally it makes more sense >> to keep them separate. This leaves the question should we return >> just stdout, or both? Sometimes stderr is useful, so I think both. > > Both is also my preference. > Good. >> So, in yet-another-branch, I wrote a __call__ implementation which >> raises an exception on non-zero return codes, but otherwise returns >> stdout and stderr as a tuple of two strings: >> >> http://github.com/peterjc/biopython/commits/app-exec3 > > Generally the idea and implementation are great. My only specific > suggestion is regarding the default handling of stdout and stderr > when you don't want to capture them. Currently you are eating those > by writing to /dev/null. Would it be clearer to just use the > default, which is to continue to route the programs stdout and > stderr through the main instance? This gives friendly > feedback that the program is running and makes debugging errors > easier, especially if an external program doesn't use error codes > correctly. Fair point. Personally I'd either want to capture the output (default) or completely ignore it (hence the implementation in this branch). Anyone else want to comment on this aspect? > Awesome to see this going in, > Brad Peter From biopython at maubp.freeserve.co.uk Mon Jul 26 15:04:41 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Jul 2010 16:04:41 +0100 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: Message-ID: Andrea Pierleoni wrote: > > Hi Everyone, > I've been using a lot biopython in the last couple of years, it is very > useful to me. So now it's my turn to contribute and be helpful to someone > else. > I wrote a parser for the Uniprot XML format, that is reasonably fast (8000 > entries/min on a core2duo mainstream PC). The main improvements with the > actual SwissProt flat file parser are a deeper parsing of comment fields, > and a Seqrecord containing features. > > The parser is based on the ElementTree library and was successfully tested > on the complete SwissProt database (v57.12). Thus I think it is ready to > be released. > > I followed the rules to develop a new parser for SeqIO, filed an > enhancement bug to bugzilla (bug 2992), and included the parser in a > public biopython fork on github available at: > > http://github.com/apierleoni/biopython/tree/uniprotxml-branch > > the new parser is in the "uniprotxml-branch" branch, and the parser code > is in Bio/SeqIO/UniprotIO.py > > The parser can be used from SeqIO using: > > iterator=SeqIO.parse(handle,'uniprot') > > I think this could be easily integrated in Biopython, ?unit test is still > missing, but should be very easy to do. > Anyhow any code review or suggestions are welcome. > > Andrea Hi Andrea, As you have probably noticed via github, I have been trying out your code. I noticed you hadn't implemented indexing support so I have done this on my branch as a quick hack: http://github.com/peterjc/biopython/commits/uniprot What I want to be able to do is seek to the start of an in the XML handle, and have the parser continue from that point. I've done this by the nasty trick of extracting the record from the XML file as a string (using the get_raw method of the index class), then adding the XML header and footer to it, and then invoking your parser. There should be a better way to do this, but I am not familiar enough with ElementTree to see it right away. Can you improve on this? I'd also like to have SeqFeature parsing done for the plain text "swiss" parser as well, which can double as a cross check for your parser. Did you look at my old patch? http://bugzilla.open-bio.org/show_bug.cgi?id=2235 We should also run a comparison test of the "swiss" plain text and "uniprot" XML parsers on the full downloads of UniProtKB/Swiss-Prot and/or UniProtKB/TrEMBL, see http://www.uniprot.org/downloads Peter From biopython at maubp.freeserve.co.uk Mon Jul 26 15:12:36 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Jul 2010 16:12:36 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: <20100723114806.GA1868@sobchak.mgh.harvard.edu> References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> <20100723114806.GA1868@sobchak.mgh.harvard.edu> Message-ID: On Fri, Jul 23, 2010 at 12:48 PM, Brad Chapman wrote: > Peter; > > [Simplified interface for calling commandline programs] > ... > > Awesome to see this going in, > Brad It is in now, I cherry-picked the changes I'd made on the app-exec3 branch (seemed a bit silly to do a merge for a little thing like this and make the history even more confusing). I haven't update the tutorial yet... Peter From biopython at maubp.freeserve.co.uk Mon Jul 26 16:08:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Jul 2010 17:08:10 +0100 Subject: [Biopython-dev] subprocess and calling application wrappers In-Reply-To: References: <20100601132355.GU1054@sobchak.mgh.harvard.edu> <20100723114806.GA1868@sobchak.mgh.harvard.edu> Message-ID: On Mon, Jul 26, 2010 at 4:12 PM, Peter wrote: > On Fri, Jul 23, 2010 at 12:48 PM, Brad Chapman wrote: >> Peter; >> >> [Simplified interface for calling commandline programs] >> ... >> >> Awesome to see this going in, >> Brad > > It is in now, I cherry-picked the changes I'd made on the app-exec3 branch > (seemed a bit silly to do a merge for a little thing like this and make the > history even more confusing). > > I haven't updated the tutorial yet... I have updated the tutorial now - note that this just uses the default __call__ functionality, for simplicity I am avoiding mentioning the optional arguments (they are covered in the docstring of course). Peter From biopython at maubp.freeserve.co.uk Mon Jul 26 16:47:51 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Jul 2010 17:47:51 +0100 Subject: [Biopython-dev] long for longitude in Bio.Phylo and Python 3 Message-ID: Hi Eric et all, Background: Eric has found a problem in Bio.Phylo with variables, arguments and properties called "long" for longitude which the 2to3 script is wrongly converting into "int", see: http://bugs.python.org/issue2734 If the remaining issue with Bug 2734 is fixed, we would still have a problem running the conversion with 2to3 as included with all releases of Python to date (i.e. 2.6, 2.7, 3.1), which would complicate deployment. Eric: It could break backwards compatibility, but would a switch from lat & long to latitude and longitude be the least painful solution? Do you think we could support both names as part of a deprecation cycle? Peter From eric.talevich at gmail.com Mon Jul 26 17:04:24 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 26 Jul 2010 13:04:24 -0400 Subject: [Biopython-dev] long for longitude in Bio.Phylo and Python 3 In-Reply-To: References: Message-ID: On Mon, Jul 26, 2010 at 12:47 PM, Peter wrote: > Hi Eric et all, > > Background: Eric has found a problem in Bio.Phylo with variables, arguments > and properties called "long" for longitude which the 2to3 script is wrongly > converting into "int", see: http://bugs.python.org/issue2734 > > If the remaining issue with Bug 2734 is fixed, we would still have a > problem > running the conversion with 2to3 as included with all releases of Python to > date (i.e. 2.6, 2.7, 3.1), which would complicate deployment. > > Eric: It could break backwards compatibility, but would a switch from lat & > long to latitude and longitude be the least painful solution? Do you think > we could support both names as part of a deprecation cycle? > > Peter > The names "lat", "long" and "alt" are from the phyloXML spec, so it's convenient to keep them the same in Biopython. But I could change them to the longer form if that's needed. The parser and serializer assume the attribute names match the XML spec in general, and special-case names that won't work in Python (like "from"). Deprecation: Since we note in the Tutorial that Bio.Phylo is semi-beta, I'd like to use an accelerated deprecation cycle for name changes like this: 1 transitional release with shims that trigger a warning, then remove the shims in the release after that. Is that OK? I haven't had a chance to try "2to3 --nofix=long" on the entire codebase yet. Best, Eric From biopython at maubp.freeserve.co.uk Mon Jul 26 17:19:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Jul 2010 18:19:24 +0100 Subject: [Biopython-dev] long for longitude in Bio.Phylo and Python 3 In-Reply-To: References: Message-ID: On Mon, Jul 26, 2010 at 6:04 PM, Eric Talevich wrote: > On Mon, Jul 26, 2010 at 12:47 PM, Peter wrote: > >> Hi Eric et all, >> >> Background: Eric has found a problem in Bio.Phylo with variables, arguments >> and properties called "long" for longitude which the 2to3 script is wrongly >> converting into "int", see: http://bugs.python.org/issue2734 >> >> If the remaining issue with Bug 2734 is fixed, we would still have a >> problem >> running the conversion with 2to3 as included with all releases of Python to >> date (i.e. 2.6, 2.7, 3.1), which would complicate deployment. >> >> Eric: It could break backwards compatibility, but would a switch from lat & >> long to latitude and longitude be the least painful solution? Do you think >> we could support both names as part of a deprecation cycle? >> >> Peter >> > > The names "lat", "long" and "alt" are from the phyloXML spec, so it's > convenient to keep them the same in Biopython. But I could change them to > the longer form if that's needed. The parser and serializer assume the > attribute names match the XML spec in general, and special-case names that > won't work in Python (like "from"). > > Deprecation: Since we note in the Tutorial that Bio.Phylo is semi-beta, I'd > like to use an accelerated deprecation cycle for name changes like this: 1 > transitional release with shims that trigger a warning, then remove the > shims in the release after that. Is that OK? > > I haven't had a chance to try "2to3 --nofix=long" on the entire codebase > yet. Assuming that using "2to3 --nofix=long" on the entire codebase isn't going to work, then I'm OK with an accelerated deprecation for switching lat/long in Bio.Phylo. If "2to3 --nofix=long" doesn't cause us problems elsewhere, that will be a neater solution. Peter From biopython at maubp.freeserve.co.uk Tue Jul 27 10:28:29 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Jul 2010 11:28:29 +0100 Subject: [Biopython-dev] WARNING - I've just been rewriting history Message-ID: Hi all, If anyone has been trying to use the git repository in the last 12 hours or so, please note I have just re-written recent history. If in any doubt, do a fresh clone. According the github network no one else has committed anything recently, which is good. Re-writing history in git is possible but is generally considered a "bad thing" because someone might have already taken and worked from the "erased" changes. Hopefully I got away with it without messing anyone up... What I did and why: One of our team made a bad merge, and pushed it to the master. If this had been spotted BEFORE being made public a local revert could have been done. The standard procedure here is to do a merge revert, but unfortunately it seems they reverted to the wrong branch (merge reverts can be done back to either of the two parents). At this point we had two unwanted commits, and the best way to fix this wasn't clear [at least not to us - has anyone got advice here for future reference?]. I took the (rash?) choice first thing this morning to take a new branch from just before the bad merge, and then via a few renames made that the new master branch, and deleted the problematic branch. The git history is now "clean", but has been changed. *** To repeat - if anyone did a git pull in the last 12 hours or so, please discard those changes and take a fresh clone. *** As a general warning, please think twice before any merge. Then check twice before pushing to github. I don't want to point fingers or spread blame around - we're all still learning git. I'm guilty of unnecessary merges this too - most recently 17 July, a brief fork and merge of two versions of the master branch, where with hindsight a "git rebase origin master" would have been wise before that commit. If you are not confident about merging branches, perhaps sending a merge pull request might be safer - get someone else to go it ;) Would anyone other than me feel happy handling merge requests? Regards, Peter From tiagoantao at gmail.com Tue Jul 27 12:41:25 2010 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 27 Jul 2010 13:41:25 +0100 Subject: [Biopython-dev] WARNING - I've just been rewriting history In-Reply-To: References: Message-ID: On Tue, Jul 27, 2010 at 11:28 AM, Peter wrote: > What I did and why: One of our team made a bad merge, and pushed it to "One of our team", erhm... that would be me. Worse, it is the second time that I make the exact same mistake. My sincere apologies. Will not happen again, I will never do a merge again, in any case. One was fool, two was freakish. Three won't happen. Tiago From biopython at maubp.freeserve.co.uk Tue Jul 27 13:23:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Jul 2010 14:23:27 +0100 Subject: [Biopython-dev] Python 3 and encoding for online resources Message-ID: Hi all, One of the remaining (pure python) problems with Biopython under Python 3 relates to parsing online resources like the NCBI Entrez API or even Bio.ExPASy.get_sprot_raw(). See for example test_SeqIO_online.py for a failure. In Python 2, urlopen from urlib or urllib2 would give a string handle. In python 3, you get a bytes handle (not a unicode handle and choosing the encoding is tricky): http://docs.python.org/py3k/library/urllib.request.html In the case of resources like the NCBI and ExPASy we should be able to assume an encoding (maybe UTF-8 or Latin) for all the plain text output, while from XML/HTML there are ways for the data to specify this itself. I think we may need to transform the urllib bytes handle into a unicode string handle for parsing. One option would be to extend the Bio.File.UndoHandle class (or invent a subclass) which applies the decoding. This seems simple since Bio.Entrez and Bio.ExPASy already use this class. Another option (which I suggested on the Bio.SeqIO.index() thread [1]) would be to extend our parsers to cope with both byte and unicode handles. That could be a lot of work though... Thoughts? Peter [1] http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008004.html From andrea at biocomp.unibo.it Tue Jul 27 13:50:53 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Tue, 27 Jul 2010 15:50:53 +0200 (CEST) Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: Message-ID: <3b674cf220d52226cf9b2e189598fe61.squirrel@lipid.biocomp.unibo.it> > > Hi Andrea, > > As you have probably noticed via github, I have been trying out your code. > > I noticed you hadn't implemented indexing support so I have done this on > my branch as a quick hack: > > http://github.com/peterjc/biopython/commits/uniprot good, are we going to continue developing on two separate branches/repos? if you want I can grant you acces to my repo, no problem, just to make things simpler... > > What I want to be able to do is seek to the start of an in the > XML handle, and have the parser continue from that point. I've done this > by the nasty trick of extracting the record from the XML file as a string > (using the get_raw method of the index class), then adding the XML > header and footer to it, and then invoking your parser. There should > be a better way to do this, but I am not familiar enough with > ElementTree to see it right away. Can you improve on this? > well it can be done using ElementTree, maybe it will also be faster than using the re module (actually I don't know if the re module is used by etree). however using cElementTree, when possible, will improve performance. by using ElementTree we can also handle namespace, rteurning a valid uniprot XML file/string. > I'd also like to have SeqFeature parsing done for the plain text "swiss" > parser as well, which can double as a cross check for your parser. Did you > look at my old patch? http://bugzilla.open-bio.org/show_bug.cgi?id=2235 > yes I looked at it, and Mauro build some unit testing to compare the results between the two parsers, take a look at Tests / test_Uniprot.py in my repo: http://github.com/apierleoni/biopython/blob/uniprotxml-branch/Tests/test_Uniprot.py > We should also run a comparison test of the "swiss" plain text and > "uniprot" XML parsers on the full downloads of UniProtKB/Swiss-Prot > and/or UniProtKB/TrEMBL, see http://www.uniprot.org/downloads > I've succesfully tested the last version in my ranch on the current version of UniprotKB/Swiss-Prot. the main differences between the two formats will be the comment field, and I don't see how they can match, sincce they are very different from the two original uniprot files. any idea? just to be clear, are we going to call this parser format just "uniprot" or "uniprot-xml"? Andrea From biopython at maubp.freeserve.co.uk Tue Jul 27 14:04:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Jul 2010 15:04:01 +0100 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: <3b674cf220d52226cf9b2e189598fe61.squirrel@lipid.biocomp.unibo.it> References: <3b674cf220d52226cf9b2e189598fe61.squirrel@lipid.biocomp.unibo.it> Message-ID: On Tue, Jul 27, 2010 at 2:50 PM, Andrea Pierleoni wrote: >> >> Hi Andrea, >> >> As you have probably noticed via github, I have been trying out your code. >> >> I noticed you hadn't implemented indexing support so I have done this on >> my branch as a quick hack: >> >> http://github.com/peterjc/biopython/commits/uniprot > > good, are we going to continue developing on two separate branches/repos? > if you want I can grant you acces to my repo, no problem, just to make > things simpler... Partly it was because you had some unrelated stuff on your uniprot branch (something in the FASTA m10 parser - I'd be interested to see an example file which triggered your change). >> >> What I want to be able to do is seek to the start of an in the >> XML handle, and have the parser continue from that point. I've done this >> by the nasty trick of extracting the record from the XML file as a string >> (using the get_raw method of the index class), then adding the XML >> header and footer to it, and then invoking your parser. There should >> be a better way to do this, but I am not familiar enough with >> ElementTree to see it right away. Can you improve on this? >> > > well it can be done using ElementTree, maybe it will also be faster than > using > the re module (actually I don't know if the re module is used by etree). > however using cElementTree, when possible, will improve performance. > by using ElementTree we can also handle namespace, > rteurning a valid uniprot XML file/string. If you can do this via (c)ElementTree, without building a dummy XML single record as a string in memory first, that would be worth trying. >> I'd also like to have SeqFeature parsing done for the plain text "swiss" >> parser as well, which can double as a cross check for your parser. Did you >> look at my old patch? http://bugzilla.open-bio.org/show_bug.cgi?id=2235 >> > > yes I looked at it, At some point I'll try the patch and test it against your UniProt XML feature generation. If I recall correctly there were some special cases with features at the very start of the protein which puzzled me. Hopefully the XML descriptions are clearer. > ... and Mauro build some unit testing to compare the results > between the two parsers, take a look at Tests / test_Uniprot.py in my repo: > > http://github.com/apierleoni/biopython/blob/uniprotxml-branch/Tests/test_Uniprot.py I thought I tried your version of the test but the seq_tests_common function compare_records seemed to strict... >> We should also run a comparison test of the "swiss" plain text and >> "uniprot" XML parsers on the full downloads of UniProtKB/Swiss-Prot >> and/or UniProtKB/TrEMBL, see http://www.uniprot.org/downloads >> > > I've succesfully tested the last version in my ranch on the current > version of UniprotKB/Swiss-Prot. Good. > the main differences between the two formats will be the comment field, > and I don't see how they can match, sincce they are very different from > the two original uniprot files. > > any idea? I avoided this issue in the test on my branch ;) I think we should update the plain text parser and BioSQL wrapper to support use the same nesting as BioPerl is using. i.e. Start by running BioPerl to import a record into BioSQL, and see how the comment ended up. > just to be clear, are we going to call this parser format just ?"uniprot" or > "uniprot-xml"? Another open question, I recall asking this on the open-bio cross project mailing list, but can't find it in the archive. Maybe I just meant to write an email and forgot? Do you remember this - I would have CC'd you. Basically I don't have a strong view on "uniprot" versus "uniprot-xml" but would like to agree this with BioPerl and EMBOSS. Peter From biopython at maubp.freeserve.co.uk Tue Jul 27 14:44:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Jul 2010 15:44:53 +0100 Subject: [Biopython-dev] Python 3 status (ignoring numpy and our C code) Message-ID: Hi all, I haven't gotten round to installing NumPy under Python 3 on this machine. Summary of test output (ignoring all the passes and skipped tests) using 2to3 with default settings. ------------------------------------------------------------------------ test_CAPS ... ERROR test_Restriction ... ERROR TypeError: unhashable type: 'RestrictionType' This is a tricky issue, see: http://lists.open-bio.org/pipermail/biopython-dev/2010-July/007975.html ------------------------------------------------------------------------ test_Crystal ... FAIL Slicing issues, we could fix them or just deprecate Bio.Crystal http://lists.open-bio.org/pipermail/biopython-dev/2010-July/thread.html ------------------------------------------------------------------------ test_LocationParser ... Syntax error at or near `467' token Something in the spark parser isn't handled by 2to3, not urgent as I want to deprecate Bio.GenBank.LocationParser which is the only thing using spark. ------------------------------------------------------------------------ test_NCBI_BLAST_tools ... FAIL Not Python 3 specific, the latest BLAST+ has changed some switches. ------------------------------------------------------------------------ test_PhyloXML ... FAIL Longitude versus long problem with 2to3: http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008071.html ------------------------------------------------------------------------ test_SeqIO_index ... ok Test passes but is very very slow, see: http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008004.html ------------------------------------------------------------------------ test_SeqIO_online ... FAIL May need to turn all online byte handles into unicode handles, http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008076.html ------------------------------------------------------------------------ test_property_manager ... FAIL I think this is a change to the default object's __repr__ method, and/or module name vs __main__ but in any case I'm tempted to deprecate Bio.PropertyManager because we don't really use it and I don't understand it ("Here be dragons!") ------------------------------------------------------------------------ Not looking too bad. Now I really should install NumPy on this machine... Peter From andrea at biocomp.unibo.it Tue Jul 27 14:55:20 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Tue, 27 Jul 2010 16:55:20 +0200 (CEST) Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: <3b674cf220d52226cf9b2e189598fe61.squirrel@lipid.biocomp.unibo.it> Message-ID: > Partly it was because you had some unrelated stuff on your uniprot branch > (something in the FASTA m10 parser - I'd be interested to see an example > file which triggered your change). > yes, I know, about the FASTA parser, but actually that change did not fix the problem, just get better. the m10 parser has problems when parsing from glsearch output, but we could discuss that in a separe thread. > If you can do this via (c)ElementTree, without building a dummy XML > single record as a string in memory first, that would be worth trying. > yes it can be done, I'll put this in my work list. > > At some point I'll try the patch and test it against your UniProt XML > feature generation. If I recall correctly there were some special cases > with features at the very start of the protein which puzzled me. Hopefully > the XML descriptions are clearer. > XML descriptions are clearer, but have some probvlem as well. some features do not have a stat and end point. in this case I skipped them. >> ... and Mauro build some unit testing to compare the results >> between the two parsers, take a look at Tests / test_Uniprot.py in my >> repo: >> >> http://github.com/apierleoni/biopython/blob/uniprotxml-branch/Tests/test_Uniprot.py > > I thought I tried your version of the test but the seq_tests_common > function > compare_records seemed to strict... > I depends how how well we want to fit the plain-text vs xml parser. I don't think we could end up in 100% identical seqrecords, and some flexibility should be used. > > I avoided this issue in the test on my branch ;) > > I think we should update the plain text parser and BioSQL wrapper to > support > use the same nesting as BioPerl is using. i.e. Start by running > BioPerl to import > a record into BioSQL, and see how the comment ended up. > well, BioPerl guys weren't very collaborative on the BioSQL mailing list. however I just read a couple of messages at that time. they are using their schema and BioJava is not using the same schema. I don't know about other projects. I think we have 3 choiches: 1) follow BioPerl whatever they does (could be good) 2) try to define our rules (bad) 3) set a defined open schema and propose it to BioSQL (good) In my parser I'm storing information from the comment as annotations in the seqrecords, buinding annotation key on the basis of the XML tree. this is a quick and dirty hack, but can be done much better. we could store complex comment field with XML, but I'm not incline in using just a big XML string in the comment field. Also keep in mind that the "comment" field is no longer called comments in the uniprot web-site but "general annotations", so maybe it makes sense to store this data as annotation in some other place. >> just to be clear, are we going to call this parser format just >> ?"uniprot" or >> "uniprot-xml"? > > Another open question, I recall asking this on the open-bio cross project > mailing list, but can't find it in the archive. Maybe I just meant to > write an > email and forgot? Do you remember this - I would have CC'd you. > Basically I don't have a strong view on "uniprot" versus "uniprot-xml" but > would like to agree this with BioPerl and EMBOSS. The issue here was that I started calling this format "uniprot" then I realize in the EBI REST services the file format is referred as "uniprot-xml". currently in my branch it is called uniprot-xml Andrea From biopython at maubp.freeserve.co.uk Tue Jul 27 15:16:00 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Jul 2010 16:16:00 +0100 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: <3b674cf220d52226cf9b2e189598fe61.squirrel@lipid.biocomp.unibo.it> Message-ID: On Tue, Jul 27, 2010 at 3:55 PM, Andrea Pierleoni wrote: > >> At some point I'll try the patch and test it against your UniProt XML >> feature generation. If I recall correctly there were some special cases >> with features at the very start of the protein which puzzled me. Hopefully >> the XML descriptions are clearer. >> > > XML descriptions are clearer, but have some probvlem as well. > some features do not have a stat and end point. in this case I skipped them. If you have some specific examples (IDs) to hand that would be useful. >>> ... and Mauro build some unit testing to compare the results >>> between the two parsers, take a look at Tests / test_Uniprot.py in my >>> repo: >>> >>> http://github.com/apierleoni/biopython/blob/uniprotxml-branch/Tests/test_Uniprot.py >> >> I thought I tried your version of the test but the seq_tests_common >> function compare_records seemed to strict... >> > > I depends how how well we want to fit the plain-text vs xml parser. > I don't think we could end up in 100% identical seqrecords, and some > flexibility should be used. I agree we're not going to get 100% identical records. >> I think we should update the plain text parser and BioSQL wrapper to >> support use the same nesting as BioPerl is using. i.e. Start by running >> BioPerl to import a record into BioSQL, and see how the comment >> ended up. >> > > well, BioPerl guys weren't very collaborative on the BioSQL mailing list. > however I just read a couple of messages at that time. > > they are using their schema and BioJava is not using the same schema. > I don't know about other projects. Perhaps you are using "schema" in a different way that I would. All the projects use the same schema (where I mean database tables), but there are differences in the details of how each file format gets parsed and ends up stored in those tables. > I think we have 3 choiches: > > 1) follow BioPerl whatever they does (could be good) > 2) try to define our rules (bad) > 3) set a defined open schema and propose it to BioSQL (good) If in (3) you mean we should have some clear examples of major file formats and how each field should end up in BioSQL, I agree. In the short to medium term I regard the bioperl-db mapping as the reference implementation (although their code does continue to change), i.e. (1). I found one of the threads I was thinking about in the archive, http://bioperl.org/pipermail/biosql-l/2010-January/001672.html http://bioperl.org/pipermail/bioperl-l/2010-January/031993.html http://bioperl.org/pipermail/open-bio-l/2010-January/000609.html > In my parser I'm storing information from the comment as annotations > in the seqrecords, buinding annotation key on the basis of the XML > tree. this is a quick and dirty hack, but can be done much better. > > we could store complex comment field with XML, but I'm not incline > in using just a big XML string in the comment field. Some sorted of nested structure like a dictionary? Are you familiar with the Perl TagTree which is what BioPerl are using here. I think Richard Holland said (in the above linked thread) that BioJava just sticks the DE section as an XML string into their record object (and thus puts XML in the BioSQL database?). > Also keep in mind that the "comment" field is no longer called comments > in the uniprot web-site but "general annotations", so maybe it makes sense >?to store this data as annotation in some other place. Sounds sensible. >>> just to be clear, are we going to call this parser format just >>> ?"uniprot" or >>> "uniprot-xml"? >> >> Another open question, I recall asking this on the open-bio cross project >> mailing list, but can't find it in the archive. Maybe I just meant to write >> an email and forgot? Do you remember this - I would have CC'd you. >> Basically I don't have a strong view on "uniprot" versus "uniprot-xml" but >> would like to agree this with BioPerl and EMBOSS. > > > The issue here was that I started calling this format "uniprot" then I > realize in the EBI REST services the file format is referred as > "uniprot-xml". currently in my branch it is called uniprot-xml > I'll (re-)post that as a specific query on the open-bio-l mailing list... Peter From andrea at biocomp.unibo.it Tue Jul 27 16:37:59 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Tue, 27 Jul 2010 18:37:59 +0200 (CEST) Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: References: <3b674cf220d52226cf9b2e189598fe61.squirrel@lipid.biocomp.unibo.it> Message-ID: <8ec7479153894f66ea029abd059e06c5.squirrel@lipid.biocomp.unibo.it> >> XML descriptions are clearer, but have some probvlem as well. >> some features do not have a stat and end point. in this case I skipped >> them. > > If you have some specific examples (IDs) to hand that would be useful. > try this: http://www.uniprot.org/uniprot/Q8NE62.xml the "error" refers to old '?' symbol in feature positions it carries this feature: I'm actually skipping al the features/comments carrying a status="unknown" attrib in start or end positions, or both. other examples: 3HIDH_DICDI ADAM1_RAT ADAM1_RAT ADM1B_MOUSE ADM1B_MOUSE CARDH_CYNCA CARDH_CYNCA CHDH_HUMAN COQ41_PARTE COQ4_CHAGB COQ4_LEIMA COX11_DICDI COX11_DICDI COX16_NEUCR ... I'm actually skipping all the features having a > > I agree we're not going to get 100% identical records. good > > Perhaps you are using "schema" in a different way that I would. All the > projects use the same schema (where I mean database tables), but > there are differences in the details of how each file format gets parsed > and ends up stored in those tables. Yes I'm referring to data schema in general, not strictly the BioSQL schema. I don't mean to change the BioSQL schema. > >> I think we have 3 choiches: >> >> 1) follow BioPerl whatever they does (could be good) >> 2) try to define our rules (bad) >> 3) set a defined open schema and propose it to BioSQL (good) > > If in (3) you mean we should have some clear examples of major file > formats and how each field should end up in BioSQL, I agree. In the > short to medium term I regard the bioperl-db mapping as the reference > implementation (although their code does continue to change), i.e. (1). > > I found one of the threads I was thinking about in the archive, > http://bioperl.org/pipermail/biosql-l/2010-January/001672.html > http://bioperl.org/pipermail/bioperl-l/2010-January/031993.html > http://bioperl.org/pipermail/open-bio-l/2010-January/000609.html so does it make sens to follow their code and their change? this would be valid just for BioPerl and BioPython. > >> In my parser I'm storing information from the comment as annotations >> in the seqrecords, buinding annotation key on the basis of the XML >> tree. this is a quick and dirty hack, but can be done much better. >> >> we could store complex comment field with XML, but I'm not incline >> in using just a big XML string in the comment field. > > Some sorted of nested structure like a dictionary? Are you familiar > with the Perl TagTree which is what BioPerl are using here. I think > Richard Holland said (in the above linked thread) that BioJava just > sticks the DE section as an XML string into their record object > (and thus puts XML in the BioSQL database?). > I'm not familiar with the TagTree but I've looked at it when there was the discussion, and I do not see any advantage on using this explicitly on the db fields instead of an XML. I would save an XML text on the DB easily readable by every language and even humans. XML text can be also queried easily. Then I'd represent this XML in a nested dictionary structure similar to the perl TagTree. I don't know if there is any implementation in python about this... >> Also keep in mind that the "comment" field is no longer called comments >> in the uniprot web-site but "general annotations", so maybe it makes >> sense >>?to store this data as annotation in some other place. > > Sounds sensible. you can use XML here too, if needed. Also by using XML, we could be able to store dictionary-containing seqrecords in a BioSQL db. A big plus to me. > > I'll (re-)post that as a specific query on the open-bio-l mailing list... > it looks like anybody is agreeing with "uniprot-xml" Andrea From biopython at maubp.freeserve.co.uk Tue Jul 27 16:40:23 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Jul 2010 17:40:23 +0100 Subject: [Biopython-dev] New: Uniprot XML parser In-Reply-To: <8ec7479153894f66ea029abd059e06c5.squirrel@lipid.biocomp.unibo.it> References: <3b674cf220d52226cf9b2e189598fe61.squirrel@lipid.biocomp.unibo.it> <8ec7479153894f66ea029abd059e06c5.squirrel@lipid.biocomp.unibo.it> Message-ID: On Tue, Jul 27, 2010 at 5:37 PM, Andrea Pierleoni wrote: > >> >> I'll (re-)post that as a specific query on the open-bio-l mailing list... >> > > it looks like anybody is agreeing with "uniprot-xml" > Yes - so far at least :) http://bioperl.org/pipermail/open-bio-l/2010-July/000701.html ... http://open-bio.org/pipermail/open-bio-l/2010-July/000704.html Peter From bugzilla-daemon at portal.open-bio.org Wed Jul 28 08:20:41 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 28 Jul 2010 04:20:41 -0400 Subject: [Biopython-dev] [Bug 3119] Bio.Nexus can't parse file from Prank 100701 (1st July 2010) In-Reply-To: Message-ID: <201007280820.o6S8Kfj3001278@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3119 ------- Comment #4 from fkauff at biologie.uni-kl.de 2010-07-28 04:20 EST ------- Slashes in Taxon names may cause troubles (even when properly quoted), not only for Bio.Nexus, but also for many other programs. If you want to use / or other special characters in taxon names, better use a " or ' around them. It might be best to avoid them entirely, my experience is that at one point during file processing there will be a software that complains. The translate statement in the nexus file ends both with a , AND a ; after the second taxon, which is also not nexus compliant. Frank (In reply to comment #0) > I've been updating test_Prank_tool.py to cope with the latest version of Prank, > 1 July 2010 from http://www.ebi.ac.uk/goldman-srv/prank/src/prank/ > > Some changes are simple, such as removing tests using feature of Prank which > have been removed. One test is failing due to some big changes in the NEXUS > output from Prank, and this may be due to a problem with our parser: > > >>> from Bio.Nexus import Nexus > >>> n = Nexus.Nexus("output_prank_v100701.nex") > Traceback (most recent call last): > ... > Bio.Nexus.Nexus.NexusError: Taxon AK1H_ECOLI: Illegal character / in line: > (check dimensions / interleaving) > > I will attach the file, it is created by the unit test as output.2.nex but > is usually deleted. > > Peter > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 28 08:26:38 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 28 Jul 2010 04:26:38 -0400 Subject: [Biopython-dev] [Bug 3119] Bio.Nexus can't parse file from Prank 100701 (1st July 2010) In-Reply-To: Message-ID: <201007280826.o6S8QcMJ001475@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3119 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-28 04:26 EST ------- (In reply to comment #4) > Slashes in Taxon names may cause troubles (even when properly quoted), not > only for Bio.Nexus, but also for many other programs. If you want to use / or > other special characters in taxon names, better use a " or ' around them. It > might be best to avoid them entirely, my experience is that at one point > during file processing there will be a software that complains. Sure - but on the other hand, this why we test things too ;) > The translate statement in the nexus file ends both with a , AND a ; after the > second taxon, which is also not nexus compliant. > > Frank So you think there is a problem with PRANK's output here? Would you like to report this or should I? Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 28 08:49:17 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 28 Jul 2010 04:49:17 -0400 Subject: [Biopython-dev] [Bug 3119] Bio.Nexus can't parse file from Prank 100701 (1st July 2010) In-Reply-To: Message-ID: <201007280849.o6S8nHbQ002167@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3119 ------- Comment #6 from fkauff at biologie.uni-kl.de 2010-07-28 04:49 EST ------- I think this is a bug - taxa in a translate statement are separated by commas, and after the last one, there is a semicolon, not both. Which makes sense. You're welcome to report it - probably you have more info at hand how the file was generated... Frank PS. I Updated tree parsing in Nexus to handle the tree * PRANK = ... statement. (In reply to comment #5) > (In reply to comment #4) > > Slashes in Taxon names may cause troubles (even when properly quoted), not > > only for Bio.Nexus, but also for many other programs. If you want to use / or > > other special characters in taxon names, better use a " or ' around them. It > > might be best to avoid them entirely, my experience is that at one point > > during file processing there will be a software that complains. > > Sure - but on the other hand, this why we test things too ;) > > > The translate statement in the nexus file ends both with a , AND a ; after the > > second taxon, which is also not nexus compliant. > > > > Frank > > So you think there is a problem with PRANK's output here? Would you like to > report this or should I? > > Peter > -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jul 28 10:46:19 2010 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 28 Jul 2010 06:46:19 -0400 Subject: [Biopython-dev] [Bug 3119] Bio.Nexus can't parse file from Prank 100701 (1st July 2010) In-Reply-To: Message-ID: <201007281046.o6SAkJt3006529@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=3119 ------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2010-07-28 06:46 EST ------- Created an attachment (id=1530) --> (http://bugzilla.open-bio.org/attachment.cgi?id=1530&action=view) Hand corrected NEXUS output from prank v100701 I am attaching a hand edited version of the PRANK v100701 NEXUS output where I have wrapped the names with single quotes, and removed the stray comma in the translate statement. See below for details. Bio.Nexus is happy with this file. (In reply to comment #4) > Slashes in Taxon names may cause troubles (even when properly quoted), not > only for Bio.Nexus, but also for many other programs. If you want to use / > or other special characters in taxon names, better use a " or ' around them. > It might be best to avoid them entirely, my experience is that at one point > during file processing there will be a software that complains. I should have been clearer earlier: Yes, I understand that special characters like slash will cause some tools problems, but they are nevertheless common. In particular, PFAM alignments take the form name/start-end to encode which subregion of a protein is being shown - like the example here which uses AK1H_ECOLI/1-378 and AKH_HAEIN/1-382 as the taxa names. I have just checked in a change to the error message, which I think throws more light on the issue: http://github.com/biopython/biopython/commit/d8a4a6edc98fa69885b6865336020db02035ff0b Now I get: >>> from Bio.Nexus import Nexus >>> n = Nexus.Nexus("output_prank_v100701.nex") Traceback (most recent call last): ... Bio.Nexus.Nexus.NexusError: Taxon AK1H_ECOLI: Illegal character / in sequence /1-378CPDSINAALICRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRR (check dimensions/interleaving) Notice that the tail of the taxon name ('/1-378') is being treated as part of the sequence. Having looked at the code and read the relevant bits of the NEXUS specification (Maddison et al), I think that PRANK is producing invalid taxa labels. In order to include characters like slashes and dashes (minus signs) that are considered punctation (and thus indicate the end of the taxa label) the labels should have been wrapped in single quotes. See the attachment. > The translate statement in the nexus file ends both with a , AND a ; after the > second taxon, which is also not nexus compliant. (In reply to comment #6) > I think this is a bug - taxa in a translate statement are separated by commas, > and after the last one, there is a semicolon, not both. Which makes sense. I have not looked at this aspect in detail, but will take you word for it. See the attachment. (In reply to comment #6) > > You're welcome to report it - probably you have more info at hand how the file > was generated... > For the record, the file was generated with the following, input file in FASTA format has two sequences which already have gaps in them: http://biopython.open-bio.org/SRC/biopython/Tests/Fasta/fa01 http://github.com/biopython/biopython/raw/master/Tests/Fasta/fa01 Then run prank (here using v081202), from the same directory: $ prank -d=fa01 -f=17 -noxml -notree Warning: option '+F' is not selected. You can select it by adding flag "+F". PRANK: aligning sequences in 'fa01', writing results to 'output.?.nex' [plain alignment]. Generating approximate guidetree. Generating multiple alignment. #1#(1/1): 95% aligned Generating improved guidetree. Generating improved multiple alignment. #1#(1/1): computing full probability Alignment done. Total time 1s $ diff output.1.nex output.2.nex $ more output.2.nex #NEXUS ... See previously attachment 1524 for the output. (In reply to comment #6) > > Frank > > PS. I Updated tree parsing in Nexus to handle the > > tree * PRANK = ... > > statement. > Great. Peter -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython at maubp.freeserve.co.uk Wed Jul 28 14:38:41 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Jul 2010 15:38:41 +0100 Subject: [Biopython-dev] Python 3 status (ignoring numpy and our C code) In-Reply-To: References: Message-ID: One other outstanding test failure (which I thought I'd fixed, but in doing so broke Python 2) is the Bio.Seq doctests which include exceptions. This is a known issue due to a change in the traceback for Python 2.7 and Python 3 to include the exception module name, making it difficult to write doctests with exceptions which also pass on both Python 2 and 3. This seems to have been fixed in Python 2.7, while there will be a work around available in Python 3.2 (but apparently not in Python 3.0 or 3.1) via doctest.IGNORE_EXCEPTION_DETAIL, see: http://bugs.python.org/issue7490 For now I have taken the pragmatic choice of skipping the Bio.Seq doctest under Python 3.1 Peter From biopython at maubp.freeserve.co.uk Wed Jul 28 16:08:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Jul 2010 17:08:46 +0100 Subject: [Biopython-dev] Equality in Bio.Restriction.RestrictionType In-Reply-To: References: Message-ID: 2010/7/8 Peter : > Hi Fr?d?ric et al, > > One of the things in Python 3 is that overriding equality (done with __eq__ > only since __cmp__ has gone) requires you also override __hash__. One > remaining example of this which triggers a deprecation warning within our > test suite when running with the -3 switch in in Bio.Restriction. > > I therefore had a look at how __eq__ and __ne__ are defined in the > RestrictionType class - and strangely they do NOT seem to be inverses. > > ? ?def __eq__(cls, other): > ? ? ? ?"""RE == other -> bool > > ? ? ? ?True if RE and other are the same enzyme.""" > ? ? ? ?return other is cls > > ? ?def __ne__(cls, other): > ? ? ? ?"""RE != other -> bool. > ? ? ? ?isoschizomer strict, same recognition site, same restriction -> False > ? ? ? ?all the other-> True""" > ? ? ? ?if not isinstance(other, RestrictionType): > ? ? ? ? ? ?return True > ? ? ? ?elif cls.charac == other.charac: > ? ? ? ? ? ?return False > ? ? ? ?else: > ? ? ? ? ? ?return True > > Fr?d?ric - could you clarify the intent here? Hi Fr?d?ric, As implemented, __eq__ just seems to check for object identity, effectively id(a) == id(b), so to make the unit test pass on Python 3 all I needed to do here was define __hash__ explicitly to return id(self), which is the default behaviour under Python 2. I'm still puzzled about the reasoning behind the comparisons. Clearly you had something special in mind with these definitions as shown by the test_Restriction.py unit tests under test_comparisons, assert Acc65I == Acc65I assert not(Acc65I == Asp718I) assert not(Acc65I != Asp718I) Note that Acc65I.site == Asp718I.site == 'GGTACC', and also Acc65I.charac == Asp718I.charac == (1, -1, None, None, 'GGTACC') It looks to me like Acc65I and Asp718I differ only in name, and you wanted both Acc65I == Asp718 and Acc65I != Asp718I to return False. i.e. They are neither equal nor non-equal, but somewhere in between? Regards, Peter From biopython at maubp.freeserve.co.uk Wed Jul 28 16:17:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Jul 2010 17:17:07 +0100 Subject: [Biopython-dev] Python 3 status (ignoring numpy and our C code) In-Reply-To: References: Message-ID: Regarding the remaining Python 3 unit test failures, On Tue, Jul 27, 2010 at 3:44 PM, Peter wrote: > > test_CAPS ... ERROR > test_Restriction ... ERROR > > TypeError: unhashable type: 'RestrictionType' > > This is a tricky issue, see: > http://lists.open-bio.org/pipermail/biopython-dev/2010-July/007975.html This is now fixed - it turns out that I didn't need to understand the full complexities of the restriction object comparisons, just what __eq__ was doing: http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008089.html Peter From biopython at maubp.freeserve.co.uk Wed Jul 28 16:38:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Jul 2010 17:38:01 +0100 Subject: [Biopython-dev] Python 3 status (ignoring numpy and our C code) In-Reply-To: References: Message-ID: More progress on the Python 3 front, On Tue, Jul 27, 2010 at 3:44 PM, Peter wrote: > > test_Crystal ... FAIL > > Slicing issues, we could fix them or just deprecate Bio.Crystal > http://lists.open-bio.org/pipermail/biopython-dev/2010-July/thread.html > I decided to just fix it - test_Crystal.py seems to cover all the basic cases for slicing. http://github.com/biopython/biopython/commit/faefe401af626656c3f8b457c066627c0ab5ef79 Peter From eric.talevich at gmail.com Thu Jul 29 02:22:12 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 28 Jul 2010 22:22:12 -0400 Subject: [Biopython-dev] WARNING - I've just been rewriting history In-Reply-To: References: Message-ID: On Tue, Jul 27, 2010 at 6:28 AM, Peter wrote: > Hi all, > > If anyone has been trying to use the git repository in the last 12 hours or > so, > please note I have just re-written recent history. If in any doubt, do a > fresh > clone. According the github network no one else has committed anything > recently, which is good. > > Re-writing history in git is possible but is generally considered a "bad > thing" > because someone might have already taken and worked from the "erased" > changes. Hopefully I got away with it without messing anyone up... > (emerges battered and bruised from the wreckage) If anyone else besides be got hit by this, here's a summary of how to fix your local repository without nuking all your local branches: # We're on the "master" branch, a clone of "upstream/master" # This has an alternate history of biopython/biopython/master # so "git pull upstream master" doesn't work anymore git branch -m master borked git checkout -b master upstream/master git pull upstream master # If everything looks OK... git branch -d borked Note that this only recreates a fresh copy of Biopython's official master branch; if you've made commits on top of the borked history, or merged it into other branches, you should probably just make a fresh clone and export your local branches as patch sets. What I did and why: One of our team made a bad merge, and pushed it to > the master. If this had been spotted BEFORE being made public a local > revert could have been done. The standard procedure here is to do a > merge revert, but unfortunately it seems they reverted to the wrong branch > (merge reverts can be done back to either of the two parents). At this > point > we had two unwanted commits, and the best way to fix this wasn't clear > [at least not to us - has anyone got advice here for future reference?]. > As usual in git, there's probably a way to do this, but I sure don't know what it is. > If you are not confident about merging branches, perhaps sending a > merge pull request might be safer - get someone else to go it ;) > Would anyone other than me feel happy handling merge requests? > Starting a month or so from now, I'd be willing to take a crack at it. Another suggestion for avoiding accidentally pushing weird changes to the main repo: point your "master" branch at your personal fork on github (normally called "origin"), rather than upstream. Then "git push" will do the safe thing by default. Regards, Eric From eric.talevich at gmail.com Thu Jul 29 04:08:37 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 29 Jul 2010 00:08:37 -0400 Subject: [Biopython-dev] long for longitude in Bio.Phylo and Python 3 In-Reply-To: References: Message-ID: On Mon, Jul 26, 2010 at 1:19 PM, Peter wrote: > On Mon, Jul 26, 2010 at 6:04 PM, Eric Talevich > wrote: > > On Mon, Jul 26, 2010 at 12:47 PM, Peter >wrote: > > > >> Hi Eric et all, > >> > >> Background: Eric has found a problem in Bio.Phylo with variables, > arguments > >> and properties called "long" for longitude which the 2to3 script is > wrongly > >> converting into "int", see: http://bugs.python.org/issue2734 > >> > >> If the remaining issue with Bug 2734 is fixed, we would still have a > >> problem > >> running the conversion with 2to3 as included with all releases of Python > to > >> date (i.e. 2.6, 2.7, 3.1), which would complicate deployment. > >> > >> Eric: It could break backwards compatibility, but would a switch from > lat & > >> long to latitude and longitude be the least painful solution? Do you > think > >> we could support both names as part of a deprecation cycle? > >> > >> Peter > >> > > > > The names "lat", "long" and "alt" are from the phyloXML spec, so it's > > convenient to keep them the same in Biopython. But I could change them to > > the longer form if that's needed. The parser and serializer assume the > > attribute names match the XML spec in general, and special-case names > that > > won't work in Python (like "from"). > > > [...] > > If "2to3 --nofix=long" doesn't cause us problems elsewhere, that will > be a neater solution. > >From my testing just now, "2to3 --nofix=long" seems to be fine. I don't see any new errors introduced by it. -Eric From biopython at maubp.freeserve.co.uk Thu Jul 29 08:30:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 29 Jul 2010 09:30:17 +0100 Subject: [Biopython-dev] WARNING - I've just been rewriting history In-Reply-To: References: Message-ID: Eric Talevich wrote: > Peter wrote: > >> Hi all, >> >> If anyone has been trying to use the git repository in the last 12 hours or >> so, please note I have just re-written recent history. If in any doubt, do a >> fresh clone. According the github network no one else has committed >> anything recently, which is good. >> >> Re-writing history in git is possible but is generally considered a "bad >> thing" because someone might have already taken and worked from >> the "erased" changes. Hopefully I got away with it without messing >> anyone up... >> > > (emerges battered and bruised from the wreckage) > Ah - sorry Eric, but at least you sorted it out. Did you see the email first or discover something wrong the hard way? > If anyone else besides be got hit by this, here's a summary of how to fix > your local repository without nuking all your local branches: > > # We're on the "master" branch, a clone of "upstream/master" > # This has an alternate history of biopython/biopython/master > # so "git pull upstream master" doesn't work anymore > git branch -m master borked > git checkout -b master upstream/master > git pull upstream master > # If everything looks OK... > git branch -d borked > i.e. Rename your local copy of the borked master, get a clean copy of the rewritten master, delete renamed borked master. Looks very sensible. > > Note that this only recreates a fresh copy of Biopython's official master > branch; if you've made commits on top of the borked history, or merged it > into other branches, you should probably just make a fresh clone and > export your local branches as patch sets. > >> What I did and why: One of our team made a bad merge, and pushed it to >> the master. If this had been spotted BEFORE being made public a local >> revert could have been done. The standard procedure here is to do a >> merge revert, but unfortunately it seems they reverted to the wrong >> branch (merge reverts can be done back to either of the two parents). >> At this point we had two unwanted commits, and the best way to fix this >> wasn't clear [at least not to us - has anyone got advice here for future >> reference?]. >> > > As usual in git, there's probably a way to do this, but I sure don't know > what it is. > Laurent sent me this link off-list, it sounds very complicated: http://www.kernel.org/pub/software/scm/git/docs/howto/revert-a-faulty-merge.txt >> If you are not confident about merging branches, perhaps sending a >> merge pull request might be safer - get someone else to go it ;) >> Would anyone other than me feel happy handling merge requests? >> > > Starting a month or so from now, I'd be willing to take a crack at it. > > Another suggestion for avoiding accidentally pushing weird changes to the > main repo: point your "master" branch at your personal fork on github > (normally called "origin"), rather than upstream. Then "git push" will do > the safe thing by default. i.e. Push to your personal github repository's master first? That way it won't harm the official repository? Peter From biopython at maubp.freeserve.co.uk Thu Jul 29 10:29:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 29 Jul 2010 11:29:28 +0100 Subject: [Biopython-dev] Bytes, Strings and Unicode (Python 2 vs 3) Message-ID: Hi all, I'm forwarding something from the NumPy mailing list regarding strings and unicode: On Thu, Jul 29, 2010 at 4:40 AM, Fernando Perez wrote: > > On Wed, Jul 28, 2010 at 12:36 PM, Fernando Perez wrote: >> The official Python 2.x unicode story is well explained here: >> http://docs.python.org/howto/unicode.html >> >> and here is the corresponding document for 3.x: >> http://docs.python.org/release/3.1.2/howto/unicode.html > > Just in case you're still thirsty for more info on Unicode... :) > > Min Ragan-Kelley just did a great summary writeup of these questions > from a low-level perspective: for pyzmq we need to handle strings > (i.e. unicode) at the python level, but efficiently and unambiguously > communicate with a networking layer written in C. ?We spent a lot of > time thinking about this, and his writeup is a great resource for > anyone who needs to look at this from a C/low-level angle: > > http://ptsg.berkeley.edu/~minrk/zmq/unicode.html > > This adds a view that isn't made very explicit in any of the docs I'd > previously sent. > > Cheers, > > f The fact that on most Linux distributions Python 3's unicode strings will take 4x the memory of plain byte strings, and even Windows and Mac will take 2x the memory is concerning for me (since I've been using Biopython for some next gen sequencing stuff where memory is already sometimes the main bottleneck). I think we will want to make the Seq object use bytes internally, rather than unicode strings. We'll also want to make sure the Seq module functions will cope with bytes, unicode or Seq type objects. For most annotation (e.g. in SeqRecord and SeqFeature objects), I guess the default of unicode strings will be OK. Perhaps the SeqRecord's id/name/description might be border line cases... Peter From eric.talevich at gmail.com Thu Jul 29 16:00:22 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 29 Jul 2010 12:00:22 -0400 Subject: [Biopython-dev] WARNING - I've just been rewriting history In-Reply-To: References: Message-ID: On Thu, Jul 29, 2010 at 4:30 AM, Peter wrote: > Eric Talevich wrote: > > Peter wrote: > > > >> Hi all, > >> > >> If anyone has been trying to use the git repository in the last 12 hours > or > >> so, please note I have just re-written recent history. If in any doubt, > do a > >> fresh clone. According the github network no one else has committed > >> anything recently, which is good. > >> > >> Re-writing history in git is possible but is generally considered a "bad > >> thing" because someone might have already taken and worked from > >> the "erased" changes. Hopefully I got away with it without messing > >> anyone up... > >> > > > > (emerges battered and bruised from the wreckage) > > > > Ah - sorry Eric, but at least you sorted it out. Did you see the email > first or discover something wrong the hard way? > The hard way. I had a small uncommitted change to PhyloXMLIO.py, and wanted to apply it to the tip of the master branch. So I stashed my change, pulled from upstream (just after the bad merge reversion), and popped the stash. My change no longer applied cleanly, even though the history showed no new commits affecting PhyloXMLIO.py. Suck. Having burned myself with "git rebase -i" on my own github fork last summer, I recognized the problem after you rewrote the upstream history: After pulling from upstream (or origin), the local copy claims be several commits ahead of the public branch it's supposed to mirror. > If anyone else besides be got hit by this, here's a summary of how to fix > > your local repository without nuking all your local branches: > > > > # We're on the "master" branch, a clone of "upstream/master" > > # This has an alternate history of biopython/biopython/master > > # so "git pull upstream master" doesn't work anymore > > git branch -m master borked > > git checkout -b master upstream/master > > git pull upstream master > > # If everything looks OK... > > git branch -d borked > > > > i.e. Rename your local copy of the borked master, get a clean > copy of the rewritten master, delete renamed borked master. > Looks very sensible. > Yes. Plus: after pulling a clean copy of upstream/master, "git fetch upstream" helps set things right again. Laurent sent me this link off-list, it sounds very complicated: > > http://www.kernel.org/pub/software/scm/git/docs/howto/revert-a-faulty-merge.txt > This part looks key: If at all possible, for example, if you find a problem that got merged into the main tree, rather than revert the merge, try _really_ hard to bisect the problem down into the branch you merged, and just fix it, or try to revert the individual commit that caused it. > Another suggestion for avoiding accidentally pushing weird changes to the > > main repo: point your "master" branch at your personal fork on github > > (normally called "origin"), rather than upstream. Then "git push" will do > > the safe thing by default. > > i.e. Push to your personal github repository's master first? That way > it won't harm the official repository? > Yeah, mainly for psychological reasons -- pushing to origin satisfies a certain urge to publish new work, but typing "git push upstream master" makes me think more carefully about whether a change set is ready for the official repository. -Eric