From p.j.a.cock at googlemail.com Mon Jan 7 13:55:25 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 7 Jan 2013 18:55:25 +0000 Subject: [Biopython-dev] Dropping Python 2.5 and Jython 2.5 support? In-Reply-To: References: Message-ID: On Mon, Oct 22, 2012 at 6:17 PM, Peter Cock wrote: > Dear Biopythoneers, > > Would anyone object to us preparing to drop support for Python 2.5 and > Jython 2.5, perhaps after the next Biopython release? > > To reassure those of you using Jython, we'd wait until Jython 2.7 is out > first. Jython 2.7 is already in alpha, and brings support for C Python 2.7 > language features. > > Thanks, > > Peter Hello all, Having recently back-ported some Python 3 code with a C extension to Python 2.6 and 2.7, I can now more clearly appreciate the benefits dropping Python 2.5 support has for writing code for both Python 2 and 3 - and am keen to be able to exploit this for Biopython. Given no major objections to the email I sent round in October last year (thank you for your input Nathan), we will press ahead with phasing out support for Python 2.5, provisionally supporting it in the forthcoming Biopython 1.61 and at least one more release (which would mean Biopython 1.62 due Summer 2013). https://github.com/biopython/biopython/commit/3f17f75b320fb6624d332809ef07314bab97477c My only significant concern is for Jython users, since this will also mean dropping support for Jython 2.5 (which implements the Python 2.5 language). The replacement Jython 2.7 is still only at the alpha release stage. Regards, Peter From kai.blin at biotech.uni-tuebingen.de Tue Jan 8 05:28:31 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Tue, 08 Jan 2013 11:28:31 +0100 Subject: [Biopython-dev] More relaxed parsing of wonky GenBank files Message-ID: <50EBF4CF.9080901@biotech.uni-tuebingen.de> Hi folks, I've recently pushed into production use a new version of my software that uses BioPython parsers instead of our own hand-written parsers. One big thing we noticed is that BioPython is waaay more picky as to what a proper GenBank file is supposed to look like. Sadly, many of our users seem to be creating their GenBank files with programs that only have a rough understanding what the file format is supposed to look like. Most of the invalid input can safely be ignored, and I would propose to extend the GenBank parser to cope with the most common errors I'm seeing in day to day use. I'm happy to provide the patches, but before starting this work I'd like to make sure that they would be acceptable in principle. So, any reason to rather blow up in our user's face than to try and cope with invalid input? Cheers, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben From mjldehoon at yahoo.com Tue Jan 8 06:11:46 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 8 Jan 2013 03:11:46 -0800 (PST) Subject: [Biopython-dev] More relaxed parsing of wonky GenBank files In-Reply-To: <50EBF4CF.9080901@biotech.uni-tuebingen.de> Message-ID: <1357643506.32308.YahooMailClassic@web164005.mail.gq1.yahoo.com> Entrez.parse has a "validate" argument to allow parsing of XML files that contain tags that are not represented in the corresponding DTD. If validate==True, the parser raises an Exception if any tags are missing. If False, then the parser will ignore missing tags. Maybe SeqIO.parse could have a similar "validate" argument? Best, -Michiel. --- On Tue, 1/8/13, Kai Blin wrote: > From: Kai Blin > Subject: [Biopython-dev] More relaxed parsing of wonky GenBank files > To: "Biopython-Dev Mailing List" > Date: Tuesday, January 8, 2013, 5:28 AM > Hi folks, > > I've recently pushed into production use a new version of my > software > that uses BioPython parsers instead of our own hand-written > parsers. > > One big thing we noticed is that BioPython is waaay more > picky as to > what a proper GenBank file is supposed to look like. Sadly, > many of > our users seem to be creating their GenBank files with > programs that > only have a rough understanding what the file format is > supposed to > look like. Most of the invalid input can safely be ignored, > and I > would propose to extend the GenBank parser to cope with the > most > common errors I'm seeing in day to day use. > > I'm happy to provide the patches, but before starting this > work I'd > like to make sure that they would be acceptable in > principle. So, any > reason to rather blow up in our user's face than to try and > cope with > invalid input? > > Cheers, > Kai > > -- > Dipl.-Inform. Kai Blin? ? ? > ???kai.blin at biotech.uni-tuebingen.de > Institute for Microbiology and Infection Medicine > Division of Microbiology/Biotechnology > Eberhard-Karls-Universit?t T?bingen > Auf der Morgenstelle 28? ? ? ? ? > ? ? ???Phone : ++49 7071 29-78841 > D-72076 T?bingen? ? ? ? ? ? > ? ? ? ? ? ? Fax > :???++49 7071 29-5979 > Germany > Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Tue Jan 8 08:27:20 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 8 Jan 2013 13:27:20 +0000 Subject: [Biopython-dev] More relaxed parsing of wonky GenBank files In-Reply-To: <50EBF4CF.9080901@biotech.uni-tuebingen.de> References: <50EBF4CF.9080901@biotech.uni-tuebingen.de> Message-ID: On Tuesday, January 8, 2013, Kai Blin wrote: > Hi folks, > > I've recently pushed into production use a new version of my software > that uses BioPython parsers instead of our own hand-written parsers. > > One big thing we noticed is that BioPython is waaay more picky as to > what a proper GenBank file is supposed to look like. Sadly, many of > our users seem to be creating their GenBank files with programs that > only have a rough understanding what the file format is supposed to > look like. Most of the invalid input can safely be ignored, and I > would propose to extend the GenBank parser to cope with the most > common errors I'm seeing in day to day use. > > I'm happy to provide the patches, but before starting this work I'd > like to make sure that they would be acceptable in principle. So, any > reason to rather blow up in our user's face than to try and cope with > invalid input? > > Cheers, > Kai > We already try to be tolerant, and issue warnings where it seems safe to take a broken file (e.g. Unrecognised first line, mismatch between length given in first line and actual sequence), but in these cases not all the mis-formed data will or can be parsed. Sometimes a file is broken to the point it is unwise to attempt to parse it any further and an exception is the best course of action. Clearly you're found a whole load more dodgy files. If you can work out which buggy tools are producing them, please do try and report the issues to the tool authors. I know that BioEdit is one source, but maintainence of that popular free Windows tool stopped many years ago. If you can prepare some (small) example files illustrating the rule-breaking files (for testing), and with patches too if you like, I will certainly review them for inclusion. Note if the user wants an exception, they can use the warnings module to catch and upgrade our parser warnings. As Michael pointed out, other bits of Biopython have an explicit validation or strict mode like the Entrez and PDB parsers. In the case of the PDB parser this just toggles between issuing warnings and raising exceptions. I'm not sure if the GenBank (and any other SeqIO parsers) need a validate/permissive option given this can already be achieved with the warnings module. After all, broken GenBank files should be in the minority. (My understanding of the Entrez setting is also about dealing with missing DTD files and cases where the NCBI has a bug and their XML and DTD disagree.) Peter From kai.blin at biotech.uni-tuebingen.de Tue Jan 8 08:55:42 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Tue, 08 Jan 2013 14:55:42 +0100 Subject: [Biopython-dev] More relaxed parsing of wonky GenBank files In-Reply-To: References: <50EBF4CF.9080901@biotech.uni-tuebingen.de> Message-ID: <50EC255E.5040904@biotech.uni-tuebingen.de> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-01-08 14:27, Peter Cock wrote: > We already try to be tolerant, and issue warnings where it seems > safe to take a broken file (e.g. Unrecognised first line, mismatch > between length given in first line and actual sequence), but in > these cases not all the mis-formed data will or can be parsed. > Sometimes a file is broken to the point it is unwise to attempt to > parse it any further and an exception is the best course of > action. Yeah, I started looking into the code and realized that it already tries to handle a lot of special cases. > Clearly you're found a whole load more dodgy files. If you can work > out which buggy tools are producing them, please do try and report > the issues to the tool authors. I know that BioEdit is one source, > but maintainence of that popular free Windows tool stopped many > years ago. Unfortunately I often have no way to contact the uploaders of the broken sequence files, unless they chose to provide an email address. > If you can prepare some (small) example files illustrating the > rule-breaking files (for testing), and with patches too if you > like, I will certainly review them for inclusion. The two most common things I saw in the last week are single record files without the '//' end-of-record marker, and files where the sequence lines are indented by one space more than expected (my favourite). I've added two sample files for these issues, I'm currently working on patches that make them pass the tests. Thanks for the comments. I'll push to my github fork once I've got something. Cheers, Kai - -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iQEcBAEBAgAGBQJQ7CVeAAoJEKM5lwBiwTTPGCYIANAkOxKtNPkclw66aCBWCaAH Uz6zyCk8DTomGOy1fnBoPKI3R+tn73+8XNe6RknFDb6NL/uMD1bR4mTHi1yuHT24 7XSJp+j1JeIamMSs6hLAf4s/HIE2YoEriOe8I6lUAa2I//rxsKf2PcS7y/4Ax6XP K/PUPODVanTCKFrpOIh2DS92lXvMJqI+cpZQ7k1ioaL+6iM9uqi9iRiV9H69Dci5 9bubA98+XvG1cnBISoQTHXpU1p1uiKU1CLxyWdl+9GTq4dCxTkeKDQvxoOd8JH/P ksJPXyYY5u41KrDFpIMNJZpvr0PawLHcUGePKXDEvAt7wvmfDxN92xcVYsUP9w4= =9u/w -----END PGP SIGNATURE----- From kai.blin at biotech.uni-tuebingen.de Tue Jan 8 09:36:03 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Tue, 08 Jan 2013 15:36:03 +0100 Subject: [Biopython-dev] More relaxed parsing of wonky GenBank files In-Reply-To: <50EC255E.5040904@biotech.uni-tuebingen.de> References: <50EBF4CF.9080901@biotech.uni-tuebingen.de> <50EC255E.5040904@biotech.uni-tuebingen.de> Message-ID: <50EC2ED3.8000401@biotech.uni-tuebingen.de> On 2013-01-08 14:55, Kai Blin wrote: > Thanks for the comments. I'll push to my github fork once I've got > something. Pull request is at https://github.com/biopython/biopython/pull/145 Cheers, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben From redmine at redmine.open-bio.org Wed Jan 9 17:58:25 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 9 Jan 2013 22:58:25 +0000 Subject: [Biopython-dev] [Biopython - Bug #3403] (New) PDBList fails to download large PDB structures Message-ID: Issue #3403 has been reported by David Cain. ---------------------------------------- Bug #3403: PDBList fails to download large PDB structures https://redmine.open-bio.org/issues/3403 Author: David Cain Status: New Priority: High Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: https://github.com/DavidCain/biopython/tree/fix_pdb_dl The current @PDBList@ module will often fail to download large PDB files.
>>> from Bio.PDB import PDBList
>>> pdbl = PDBList()
>>> pdbl.retrieve_pdb_file("1hgg")
Downloading PDB structure '1hgg'...
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/pymodules/python2.7/Bio/PDB/PDBList.py", line 247, in retrieve_pdb_file
    out.writelines(gz.read())
  File "/usr/lib/python2.7/gzip.py", line 249, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 303, in _read
    self._read_eof()
  File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0x21d7a5f7 != 0x4b5eabb6L
>>>
The source of this problem is that the entire gzipped file must be read into memory before it's written to disk locally. With large archives, the local file can be truncated prematurely, which causes gzip to crash on extraction. I fixed this issue on my "GitHub branch":https://github.com/DavidCain/biopython/tree/fix_pdb_dl, which I've made a pull request for. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Jan 9 17:58:25 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 9 Jan 2013 22:58:25 +0000 Subject: [Biopython-dev] [Biopython - Bug #3403] (New) PDBList fails to download large PDB structures Message-ID: Issue #3403 has been reported by David Cain. ---------------------------------------- Bug #3403: PDBList fails to download large PDB structures https://redmine.open-bio.org/issues/3403 Author: David Cain Status: New Priority: High Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: https://github.com/DavidCain/biopython/tree/fix_pdb_dl The current @PDBList@ module will often fail to download large PDB files.
>>> from Bio.PDB import PDBList
>>> pdbl = PDBList()
>>> pdbl.retrieve_pdb_file("1hgg")
Downloading PDB structure '1hgg'...
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/pymodules/python2.7/Bio/PDB/PDBList.py", line 247, in retrieve_pdb_file
    out.writelines(gz.read())
  File "/usr/lib/python2.7/gzip.py", line 249, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 303, in _read
    self._read_eof()
  File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0x21d7a5f7 != 0x4b5eabb6L
>>>
The source of this problem is that the entire gzipped file must be read into memory before it's written to disk locally. With large archives, the local file can be truncated prematurely, which causes gzip to crash on extraction. I fixed this issue on my "GitHub branch":https://github.com/DavidCain/biopython/tree/fix_pdb_dl, which I've made a pull request for. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Jan 9 18:08:28 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 9 Jan 2013 23:08:28 +0000 Subject: [Biopython-dev] [Biopython - Bug #3403] PDBList fails to download large PDB structures References: Message-ID: Issue #3403 has been updated by David Cain. (Pull request "here":https://github.com/biopython/biopython/pull/146) ---------------------------------------- Bug #3403: PDBList fails to download large PDB structures https://redmine.open-bio.org/issues/3403 Author: David Cain Status: New Priority: High Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: https://github.com/DavidCain/biopython/tree/fix_pdb_dl The current @PDBList@ module will often fail to download large PDB files.
>>> from Bio.PDB import PDBList
>>> pdbl = PDBList()
>>> pdbl.retrieve_pdb_file("1hgg")
Downloading PDB structure '1hgg'...
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/pymodules/python2.7/Bio/PDB/PDBList.py", line 247, in retrieve_pdb_file
    out.writelines(gz.read())
  File "/usr/lib/python2.7/gzip.py", line 249, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 303, in _read
    self._read_eof()
  File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0x21d7a5f7 != 0x4b5eabb6L
>>>
The source of this problem is that the entire gzipped file must be read into memory before it's written to disk locally. With large archives, the local file can be truncated prematurely, which causes gzip to crash on extraction. I fixed this issue on my "GitHub branch":https://github.com/DavidCain/biopython/tree/fix_pdb_dl, which I've made a pull request for. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Wed Jan 9 18:55:13 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 9 Jan 2013 23:55:13 +0000 Subject: [Biopython-dev] Fwd: [biopython] Fix broken downloading of large PDB structures (#146) In-Reply-To: References: Message-ID: FYI ---------- Forwarded message ---------- From: David Cain Date: Wed, Jan 9, 2013 at 10:59 PM Subject: [biopython] Fix broken downloading of large PDB structures (#146) To: biopython/biopython Summary of changes - Fix failure to download large PDB files - Use with statements for safer file I/O - Remove obsolete parameters - PEP 8 changes, update documentation Failure to download large PDB files (See: Redmine bug #3403 ) The current PDBList module will often fail to download large PDB files. >>> from Bio.PDB import PDBList >>> pdbl = PDBList() >>> pdbl.retrieve_pdb_file("1hgg") ... IOError: CRC check failed 0x21d7a5f7 != 0x4b5eabb6L >>> The source of this problem is that the entire gzipped file must be read into memory before it's written to disk locally. Instead of this memory-intensive approach, I changed the downloading to use urllib.urlretrieve, which is more readable and far more efficient. Obsolete parameters The long-obsolete parameters to retrieve_pdb_file(() have been removed. Formerly, the function allowed the user to specify compression and/or a system utility to perform decompression. But all archives are now gzipped, and PDBList uses Python's gzip module to decompress archives. These parameters have been obsolete for over a year (they were marked deprecated with commit 7ebf6e9 ). ------------------------------ You can merge this Pull Request by running git pull https://github.com/DavidCain/biopython fix_pdb_dl Or view, comment on, or merge it at: https://github.com/biopython/biopython/pull/146 Commit Summary - Use urlretrieve to smartly download PDB archives - Use 'with' statement for safer file I/O - Collapse unwieldy if-else structure - PEP8 fixes within retrieve_pdb_file - Remove deprecated parameters - Update with clarifying comments - PEP8 fixes, updated comments for file - Use urlretrieve in other instance of save to disk File Changes - *M* Bio/PDB/PDBList.py (217) Patch Links: - https://github.com/biopython/biopython/pull/146.patch - https://github.com/biopython/biopython/pull/146.diff From mjldehoon at yahoo.com Thu Jan 10 04:21:34 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 10 Jan 2013 01:21:34 -0800 (PST) Subject: [Biopython-dev] Bio._utils iterlen not needed Message-ID: <1357809694.20781.YahooMailClassic@web164003.mail.gq1.yahoo.com> Dear all, As far as I can tell the iterlen function in Bio._utils is not needed. Simply calling len(items) does exactly what iterlen does, and is much faster too. For the other functions, are they important enough to warrant a separate module? From our previous experience in Biopython, these kinds of utility modules tend to be underused. This is because the functions are simple and therefore easy to replicate, and often they do not do exactly what is needed in a particular module. Similar utility modules in Biopython in the past were forgotten after a while, and then deprecated and removed. Best, -Michiel. From p.j.a.cock at googlemail.com Thu Jan 10 08:03:50 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Jan 2013 13:03:50 +0000 Subject: [Biopython-dev] Bio._utils iterlen not needed In-Reply-To: <1357809694.20781.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1357809694.20781.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: On Thu, Jan 10, 2013 at 9:21 AM, Michiel de Hoon wrote: > > Dear all, > > As far as I can tell the iterlen function in Bio._utils is not needed. > Simply calling len(items) does exactly what iterlen does, and is much faster too. No, the reason d'?tre for iterlen is that you can't use len on an iterator, e.g. >>> len(iter("abcde")) Traceback (most recent call last): File "", line 1, in TypeError: object of type 'iterator' has no len() >>> from Bio._utils import iterlen >>> iterlen(iter("abcde")) 5 Perhaps the function needs a little more documentation... > For the other functions, are they important enough to warrant > a separate module? From our previous experience in Biopython, > these kinds of utility modules tend to be underused. This is > because the functions are simple and therefore easy to > replicate, and often they do not do exactly what is needed > in a particular module. Similar utility modules in Biopython > in the past were forgotten after a while, and then deprecated > and removed. Note that Bio._utils has a leading underscore - these are therefore a 'private' API which we don't have to worry about maintaining and deprecated etc in the same way as a public API. We're not expect end users to use this module ;) The functions here were originally helper functions used in Bio.Phylo which are now also used in Bio.SearchIO - I think a shared private module like this is a good compromise between code duplication and top level modules. Peter From mjldehoon at yahoo.com Thu Jan 10 12:24:14 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 10 Jan 2013 09:24:14 -0800 (PST) Subject: [Biopython-dev] Bio._utils iterlen not needed In-Reply-To: Message-ID: <1357838654.1021.YahooMailClassic@web164001.mail.gq1.yahoo.com> --- On Thu, 1/10/13, Peter Cock wrote: > > Simply calling len(items) does exactly what iterlen > does, and is much faster too. > > No, the reason d'?tre for iterlen is that you can't use len > on an iterator, e.g. > > >>> len(iter("abcde")) > Traceback (most recent call last): > ? File "", line 1, in > TypeError: object of type 'iterator' has no len() > You're right. Actually it depends on the iterator. For example, len(xrange(100)) works (xrange also returns an iterator). I guess in general an iterator can't have a len() function because it's not clear that the iterator will ever end. That said, currently the iterlen function is used in only one place, in Bio/Phylo/BaseTree.py as follows: def count_terminals(self): return _utils.iterlen(self.find_clades(terminal=True)) But here you could simply have def count_terminals(self): clades = self.find_clades(terminal=True) count = 0 for clade in clades: count+=1 return count I don't see why we need a function iterlen for this, and if we do have such a function, why it should be in Bio._utils. Best, -Michiel. From p.j.a.cock at googlemail.com Thu Jan 10 16:16:12 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Jan 2013 21:16:12 +0000 Subject: [Biopython-dev] Bio._utils iterlen not needed In-Reply-To: <1357838654.1021.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1357838654.1021.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Thu, Jan 10, 2013 at 5:24 PM, Michiel de Hoon wrote: > --- On Thu, 1/10/13, Peter Cock wrote: >> > Simply calling len(items) does exactly what iterlen >> > does, and is much faster too. >> >> No, the reason d'?tre for iterlen is that you can't use len >> on an iterator, e.g. >> >> >>> len(iter("abcde")) >> Traceback (most recent call last): >> File "", line 1, in >> TypeError: object of type 'iterator' has no len() > > You're right. Actually it depends on the iterator. For example, > len(xrange(100)) works (xrange also returns an iterator). I guess > in general an iterator can't have a len() function because it's not > clear that the iterator will ever end. Good point - I didn't know xrange defined __len__, and you are right in general - other iterator object could also do that: https://github.com/biopython/biopython/commit/57ae89cdedbc1e18495ffb615a3a1d2c9feb0296 > That said, currently the iterlen function is used in only one place, > in Bio/Phylo/BaseTree.py as follows: True. I hadn't checked that - I assumed it was used more than once. If there are no other natural placed where it would make sense then yes, it might as well be done in line once, and Bio._utils.iterlen could be removed. When written, iterlen was in private module Bio.Phylo._sugar (CC'ing Eric) which Bow moved to Bio._utils as he wanted to use some of it in SearchIO. Peter From eric.talevich at gmail.com Thu Jan 10 16:50:45 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 10 Jan 2013 16:50:45 -0500 Subject: [Biopython-dev] Bio._utils iterlen not needed In-Reply-To: References: <1357838654.1021.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Thu, Jan 10, 2013 at 4:16 PM, Peter Cock wrote: > On Thu, Jan 10, 2013 at 5:24 PM, Michiel de Hoon > wrote: > > That said, currently the iterlen function is used in only one place, > > in Bio/Phylo/BaseTree.py as follows: > > True. I hadn't checked that - I assumed it was used more > than once. If there are no other natural placed where it would > make sense then yes, it might as well be done in line once, > and Bio._utils.iterlen could be removed. > > When written, iterlen was in private module Bio.Phylo._sugar > (CC'ing Eric) which Bow moved to Bio._utils as he wanted to > use some of it in SearchIO. > That's all true. I created _sugar.py during GSoC 2009 for utility code that Bio.Phylo needed, but wasn't related to trees in any way -- similar to Bow's thinking. I probably meant to get rid of the module entirely after the grand merge (hence the note at the top of _sugar.py to keep the file as small as possible). IIRC, I made it a separate function while testing whether "enumerate" or "cnt += 1" would be faster. I have no objections to getting rid of the function now. -E From mjldehoon at yahoo.com Fri Jan 11 07:36:15 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 11 Jan 2013 04:36:15 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone Message-ID: <1357907775.13851.YahooMailClassic@web164001.mail.gq1.yahoo.com> Hi everybody, Bio.ParserSupport has had a PendingDeprecationWarning since Biopython 1.59, so we may consider upgrading this to a BiopythonDeprecationWarning in Biopython 1.61 before removing Bio.ParserSupport. The only tricky point is that then we would also have to upgrade the PendingDeprecationWarning in Bio/Blast/NCBIStandalone.py to a BiopythonDeprecationWarning, as that code relies on Bio.ParserSupport. Bio.Blast.NCBIStandalone has had this PendingDeprecationWarning since Biopython release 1.56. Any objections? This may help giving Bow's Bio.SearchIO module some more prominence. On a related point, the fact that we are deprecating Bio.ParserSupport (which was a painful process) suggests that having a new module Bio._utils with a set of generic utility functions is not a good idea. Best, -Michiel. From p.j.a.cock at googlemail.com Fri Jan 11 10:33:05 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 11 Jan 2013 15:33:05 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1357907775.13851.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1357907775.13851.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Fri, Jan 11, 2013 at 12:36 PM, Michiel de Hoon wrote: > Hi everybody, > > Bio.ParserSupport has had a PendingDeprecationWarning since Biopython > 1.59, so we may consider upgrading this to a BiopythonDeprecationWarning in > Biopython 1.61 before removing Bio.ParserSupport. The only tricky point is > that then we would also have to upgrade the PendingDeprecationWarning in > Bio/Blast/NCBIStandalone.py to a BiopythonDeprecationWarning, as that code > relies on Bio.ParserSupport. Bio.Blast.NCBIStandalone has had this > PendingDeprecationWarning since Biopython release 1.56. > > Any objections? This may help giving Bow's Bio.SearchIO module some more > prominence. Bow's SearchIO is using Bio.Blast.NCBIStandalone to handle plain text, https://github.com/biopython/biopython/blob/master/Bio/SearchIO/BlastIO/blast_text.py We'd discussed a new parser targeting just the plain text from BLAST+ (and if not too different maybe the final legacy BLAST release), which should be less diverse that the current range of BLAST quirks built up over the years. > On a related point, the fact that we are deprecating Bio.ParserSupport > (which was a painful process) suggests that having a new module Bio._utils > with a set of generic utility functions is not a good idea. That's why Bio._utils is a private module - we can drop/change/etc this without worrying about breaking other people's code. The issue with Bio.ParserSupport is it was a public API. Regards, Peter From w.arindrarto at gmail.com Sun Jan 13 10:22:13 2013 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sun, 13 Jan 2013 16:22:13 +0100 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: References: <1357907775.13851.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: Hi everyone, >> Bio.ParserSupport has had a PendingDeprecationWarning since Biopython >> 1.59, so we may consider upgrading this to a BiopythonDeprecationWarning in >> Biopython 1.61 before removing Bio.ParserSupport. The only tricky point is >> that then we would also have to upgrade the PendingDeprecationWarning in >> Bio/Blast/NCBIStandalone.py to a BiopythonDeprecationWarning, as that code >> relies on Bio.ParserSupport. Bio.Blast.NCBIStandalone has had this >> PendingDeprecationWarning since Biopython release 1.56. >> >> Any objections? This may help giving Bow's Bio.SearchIO module some more >> prominence. > > Bow's SearchIO is using Bio.Blast.NCBIStandalone to handle plain text, > https://github.com/biopython/biopython/blob/master/Bio/SearchIO/BlastIO/blast_text.py > > We'd discussed a new parser targeting just the plain text from BLAST+ > (and if not too different maybe the final legacy BLAST release), which > should be less diverse that the current range of BLAST quirks built up > over the years. Yes. Until such a parser is ready, Bio.ParserSupport is still needed. We may still deprecate it from the visible / public namespace and move it into a private module, though. If we are also deprecating Bio.BLAST, then moving Bio.BLAST.NCBIStandalone into a private module as well seems like an ok fix for the time being. regards, Bow From p.j.a.cock at googlemail.com Tue Jan 15 10:28:07 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 15 Jan 2013 15:28:07 +0000 Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout? In-Reply-To: References: Message-ID: On Fri, Dec 14, 2012 at 12:48 PM, Wibowo Arindrarto wrote: > Hi everyone, > >>> It's reproducible in my machine: Arch Linux 64 bit running >>> Python3.1.5. Haven't figured out a fix yet, but trying to see if I >>> can. >> >> Great. We haven't really proved this is down to a change in >> either Python 3.1.4 or 3.1.5 but it does look likely. > > It's reproduced in my local 3.1.4 installation. Seems like an unfixed > bug that went through to 3.1.5. Regarding this issue with test_Emboss.py, AttributeError: '_io.FileIO' object has no attribute 'read1' http://lists.open-bio.org/pipermail/biopython-dev/2012-December/010156.html I've now tried downgrading Python 3.1 on this machine, and it does seem to be a problem under Python 3.1.4 and 3.1.5 but not 3.1.3. For now I have simply left this buildslave running 3.1.3 instead. I will also downgrade Python 3.1 on the second 64 bit Linux server. That should take care of the annoying buildbot failures (and the daily email I've been getting). This thread may help someone else with a similar issue, but I don't feel inclined to try and explore in any more depth what exactly is going wrong under Python 3.1.4 and 3.1.5, and if there is a Python bug we should report. Regards, Peter From kai.blin at biotech.uni-tuebingen.de Tue Jan 15 10:54:45 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Tue, 15 Jan 2013 16:54:45 +0100 Subject: [Biopython-dev] More 'fun' with GenBank Message-ID: <50F57BC5.7020607@biotech.uni-tuebingen.de> Hi folks, as people are hitting my web service with all sorts of wonky GenBank files, I've stumbled over another one that throws the GenBank parser off track. The culprit is a SeqFeature with a location line like: CDS join(complement(4093..4338),complement(3876..4011), complement(3655..3809),complement(3284..3585), complement(2421..2813),complement(2057..2303)) Now, the way I read the GenBank spec, this is not a valid location line, but should instead be a complement() of joins(). Unfortunately, the NCBI seems to disagree with its own specs, and put the record into their Nucleotide database as CABT02000004, which means that by all practical purposes, it _is_ a valid GenBank file and the parser should cope. The parser looks at this location and creates a feature on the -1 strand, from 4092:2303. This is caused by by the feature location calculation on https://github.com/biopython/biopython/blob/master/Bio/GenBank/__init__.py#L1049 and the lines after. In short, we do s = cur_feature.sub_features[0].location.start e = cur_feature.sub_features[-1].location.end cur_feature.location = SeqFeature.FeatureLocation(s, e, strand) And when the join() looks like the record I'm dealing with, this is clearly the wrong way around. I decided to fix this by sorting the subfeatures by start,end coordinates, and that fixes this issue for me. Unfortunately, this also breaks an existing test, the extra_keywords.gb test. https://github.com/biopython/biopython/blob/master/Tests/GenBank/extra_keywords.gb#L647 has a feature that has a location of CDS join(153490..154269,AL121804.2:41..610, AL121804.2:672..1487) Here, we probably do want the feature from 153489:1487, even though I'm not sure how useful such a location really is. So I decided to fix this by sorting the subfeatures first on their ref, and then on start, end. This again breaks a test, this time in one_of.gb https://github.com/biopython/biopython/blob/master/Tests/GenBank/one_of.gb#L39 where the location line is CDS join(2201..2479,U18267.1:120..246,U18268.1:130..288, U18270.1:4691..4788,U18269.1:82..>128) Here, the U18270.1 record seems to come befire the U18269.1 record. Now, we're again spanning a feature into multiple contigs, none of which are accessible to the extract() function as far as I'm aware. Sorting the locations by start, end (and maybe ref first) at least fixes the case CABT02000004 is broken on where we have the chance of getting extract() to work. The attached patch is my proposed change, but I wanted to get some feedback first before opening a bug and/or submitting a pull request. Cheers, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-GenBank-Sort-subfeatures-by-ref-and-start-end-positi.patch Type: text/x-patch Size: 9059 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Tue Jan 15 11:41:32 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 15 Jan 2013 16:41:32 +0000 Subject: [Biopython-dev] More 'fun' with GenBank In-Reply-To: <50F57BC5.7020607@biotech.uni-tuebingen.de> References: <50F57BC5.7020607@biotech.uni-tuebingen.de> Message-ID: On Tue, Jan 15, 2013 at 3:54 PM, Kai Blin wrote: > Hi folks, > > as people are hitting my web service with all sorts of wonky GenBank > files, I've stumbled over another one that throws the GenBank parser off > track. > > The culprit is a SeqFeature with a location line like: > > CDS join(complement(4093..4338),complement(3876..4011), > complement(3655..3809),complement(3284..3585), > complement(2421..2813),complement(2057..2303)) > > Now, the way I read the GenBank spec, this is not a valid location line, > but should instead be a complement() of joins(). Unfortunately, the NCBI > seems to disagree with its own specs, and put the record into their > Nucleotide database as CABT02000004, which means that by all practical > purposes, it _is_ a valid GenBank file and the parser should cope. That should work - for a while GenBank and EMBL didn't agree about joins on the complement strand, one did complement(join(a..b,c..d)) and the other join(complement(c..d),complement(a..b)), notice the order of the sub-regions flips. > The parser looks at this location and creates a feature on the -1 > strand, from 4092:2303. This is caused by by the feature location > calculation on > https://github.com/biopython/biopython/blob/master/Bio/GenBank/__init__.py#L1049 > and the lines after. > > In short, we do > s = cur_feature.sub_features[0].location.start > e = cur_feature.sub_features[-1].location.end > cur_feature.location = SeqFeature.FeatureLocation(s, e, strand) For join feature locations, the sub-feature locations should be fine but the overall feature location is a bit weird/broken for negative and mixed strands. This was one of the things the re-factoring on this branch aimed to fix, https://github.com/peterjc/biopython/tree/f_loc4/ http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html I was intending to bring this up again after the next release (which could be later this month or February 2012), but perhaps it would be worth doing now? Peter From arklenna at gmail.com Tue Jan 15 12:19:48 2013 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 15 Jan 2013 12:19:48 -0500 Subject: [Biopython-dev] More 'fun' with GenBank In-Reply-To: References: <50F57BC5.7020607@biotech.uni-tuebingen.de> Message-ID: +1 for f_loc4. The FeatureLocation/CompoundLocation classes will hopefully make handling joins and other GenBank operators a little more logical. Not to mention my CoordinateMapper is based on this branch! Lenna On Tue, Jan 15, 2013 at 11:41 AM, Peter Cock wrote: > On Tue, Jan 15, 2013 at 3:54 PM, Kai Blin > wrote: > > Hi folks, > > > > as people are hitting my web service with all sorts of wonky GenBank > > files, I've stumbled over another one that throws the GenBank parser off > > track. > > > > The culprit is a SeqFeature with a location line like: > > > > CDS join(complement(4093..4338),complement(3876..4011), > > complement(3655..3809),complement(3284..3585), > > complement(2421..2813),complement(2057..2303)) > > > > Now, the way I read the GenBank spec, this is not a valid location line, > > but should instead be a complement() of joins(). Unfortunately, the NCBI > > seems to disagree with its own specs, and put the record into their > > Nucleotide database as CABT02000004, which means that by all practical > > purposes, it _is_ a valid GenBank file and the parser should cope. > > That should work - for a while GenBank and EMBL didn't agree about > joins on the complement strand, one did complement(join(a..b,c..d)) > and the other join(complement(c..d),complement(a..b)), notice the > order of the sub-regions flips. > > > The parser looks at this location and creates a feature on the -1 > > strand, from 4092:2303. This is caused by by the feature location > > calculation on > > > https://github.com/biopython/biopython/blob/master/Bio/GenBank/__init__.py#L1049 > > and the lines after. > > > > In short, we do > > s = cur_feature.sub_features[0].location.start > > e = cur_feature.sub_features[-1].location.end > > cur_feature.location = SeqFeature.FeatureLocation(s, e, > strand) > > For join feature locations, the sub-feature locations should be fine > but the overall feature location is a bit weird/broken for negative > and mixed strands. > > This was one of the things the re-factoring on this branch aimed to > fix, https://github.com/peterjc/biopython/tree/f_loc4/ > http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html > > I was intending to bring this up again after the next release (which > could be later this month or February 2012), but perhaps it would > be worth doing now? > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Tue Jan 15 14:03:51 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 15 Jan 2013 19:03:51 +0000 Subject: [Biopython-dev] More 'fun' with GenBank In-Reply-To: References: <50F57BC5.7020607@biotech.uni-tuebingen.de> Message-ID: On Tue, Jan 15, 2013 at 5:19 PM, Lenna Peterson wrote: > +1 for f_loc4. The FeatureLocation/CompoundLocation classes will hopefully > make handling joins and other GenBank operators a little more logical. Not > to mention my CoordinateMapper is based on this branch! > > Lenna It will need a bit of work to rebase (some of the PEP8 changes have touched the same lines of code), but I will try and do that this week. Peter From antony.lee at berkeley.edu Tue Jan 15 16:45:19 2013 From: antony.lee at berkeley.edu (Antony Lee) Date: Tue, 15 Jan 2013 13:45:19 -0800 Subject: [Biopython-dev] Circular sequences Message-ID: <20130115214519.GC8511@gmail.com> Hi all, While working on a (more sane?) rewrite of the Restriction library (https://github.com/biopython/biopython/pull/148), I found the need to add a circular/linear attribute to sequence objects (just as the currently existing Restriction library does). So I quickly added such a class, independently of whatever Biopython currently provides. But it seems like the module would be better integrated in the rest of Biopython if it used Bio.Seq.Seq instead. I saw that CircularSeqs have already been discussed on the mailing list, and the main issue was with indexing and slicing. So here are my thoughts about how such an object should behave. Assume a circular seq s of length 10. Simple indexing works modulo 10 (and negative indices work identically). Methods that return one or more indices return the indices modulo 10. Slicing with both ends defined (i.e. s[x:y(:z)]) wrap as many times as needed around the sequence if y >= x, and make at most one complete cycle if y < x (i.e. add len(s) as many times as needed to y to make it bigger than x, and stop there). Slicing with one or both ends undefined (ie. s[:], s[x:], s[:y]) raises an IndexError (because, well, I read s[x:] as "return the elements of s starting from the x'th until the end"... but there is no such end.). (A second option would be to return an infinite iterable for s[x:], but that doesn't take care of s[:y] anyways, not to mention the bugs that may appear from that.) A few other issues were addressed in the previous thread. I think that adding CircularSeqs does not make sense at all (so __add__ raises a ValueError), and translation can either check for the presence of a stop codon and raise ValueError otherwise, or return an infinite iterator. Another thing that may be useful for a restriction analysis library is a good way to represent a dsDNA sequence with some overhangs. Any thoughts? Antony From kai.blin at biotech.uni-tuebingen.de Wed Jan 16 03:28:06 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Wed, 16 Jan 2013 09:28:06 +0100 Subject: [Biopython-dev] More 'fun' with GenBank In-Reply-To: References: <50F57BC5.7020607@biotech.uni-tuebingen.de> Message-ID: <50F66496.8000109@biotech.uni-tuebingen.de> On 2013-01-15 20:03, Peter Cock wrote: Hi Peter, > It will need a bit of work to rebase (some of the PEP8 changes have > touched the same lines of code), but I will try and do that this week. Your f_loc4 branch certainly fixes the problem I'm seeing. Is there anything I can do to help with getting it merged? I'm happy to give a closer look at the rebase conflicts coming up during the merge if you don't mind me asking the occasional question if I can't work out reasons for a code change from the commit messages. Cheers, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-University of T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Deutschland Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben From Markus.Piotrowski at ruhr-uni-bochum.de Wed Jan 16 04:42:54 2013 From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus Piotrowski) Date: 16 Jan 2013 10:42:54 +0100 Subject: [Biopython-dev] Circular sequences In-Reply-To: <20130115214519.GC8511@gmail.com> References: <20130115214519.GC8511@gmail.com> Message-ID: <50F6761E.9000606@ruhr-uni-bochum.de> Am 15.01.2013 22:45, schrieb Antony Lee: > needed to y to make it bigger than x, and stop there). Slicing with one > or both ends undefined (ie. s[:], s[x:], s[:y]) raises an IndexError > (because, well, I read s[x:] as "return the elements of s starting from > the x'th until the end"... but there is no such end.). (A second option > would be to return an infinite iterable for s[x:], but that doesn't take > care of s[:y] anyways, not to mention the bugs that may appear from > that.) Another possibility, which makes some biological sense (thinking on restriction), would be that s[x:] (or s[:y]) returns a linear sequence starting at x and ending with x-1 (or ending with y and starting at y+1). Thus, s[x:] would mean 'cut my circle at x and return the linear sequence starting at x'. Markus From p.j.a.cock at googlemail.com Wed Jan 16 05:24:13 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 16 Jan 2013 10:24:13 +0000 Subject: [Biopython-dev] Circular sequences In-Reply-To: <50F6761E.9000606@ruhr-uni-bochum.de> References: <20130115214519.GC8511@gmail.com> <50F6761E.9000606@ruhr-uni-bochum.de> Message-ID: For those that missed it last time, I think the most recent in depth discussion about circular sequences and slicing was here: http://lists.open-bio.org/pipermail/biopython/2011-March/007075.html ... http://lists.open-bio.org/pipermail/biopython/2011-March/007085.html On Wed, Jan 16, 2013 at 9:42 AM, Markus Piotrowski wrote: > Am 15.01.2013 22:45, schrieb Antony Lee: > >> needed to y to make it bigger than x, and stop there). Slicing with one >> or both ends undefined (ie. s[:], s[x:], s[:y]) raises an IndexError >> (because, well, I read s[x:] as "return the elements of s starting from >> the x'th until the end"... but there is no such end.). (A second option >> would be to return an infinite iterable for s[x:], but that doesn't take >> care of s[:y] anyways, not to mention the bugs that may appear from >> that.) > > > Another possibility, which makes some biological sense (thinking on > restriction), would be that > s[x:] (or s[:y]) returns a linear sequence starting at x and ending with x-1 > (or ending with y and starting at y+1). Thus, s[x:] would mean 'cut my > circle at x and return the linear sequence starting at x'. That's exactly the kind of behaviour which would make me nervous given in general the Biopython sequence objects mimic Python strings. There are many examples where that 'extra' sequence would be unexpected. For instance, writing out line wrapped sequence data. I would prefer an explicit method like 'cut' on a circular sequence object returning a full length linear sequence. Similarly a 'roll' or 'rotate' method could shift the origin to a new coordinate. One simple solution to the complexities of the slice behaviour is the practical one: They act like Python strings, basically all we would be adding would an 'is circular' flag and some logic about how to propagate that flag in operations like addition and slicing. If we went that route it might still be possible to make the find and 'in' functionality origin aware... but that may just cause trouble. This would solve where to store if a sequence is circular (e.g. when reading GenBank and EMBL files - or for handling restriction enzyme digests), but other than that not add much utility. Thoughts? Peter From antony.lee at berkeley.edu Wed Jan 16 14:09:32 2013 From: antony.lee at berkeley.edu (Antony Lee) Date: Wed, 16 Jan 2013 11:09:32 -0800 Subject: [Biopython-dev] Circular sequences In-Reply-To: References: <20130115214519.GC8511@gmail.com> <50F6761E.9000606@ruhr-uni-bochum.de> Message-ID: <20130116190932.GA1962@gmail.com> I think the proposed behaviour makes biological sense (now s[x:] and s[:y] mean "cut the sequence before x (or before y) and keep the downstream (or upstream) sequence, whatever it is"). But I understand Peter's concerns as well. A quick grep showed me around 400 instances of "[:" showing up in the current code base, and as many ":]", and most of them seem to be related to string (as opposed to sequence) processing so checking these may not be impossible (though not very fun of course), but this won't protect against future mis-uses of sequence indexing. So I think methods such as cut and roll are fine too (and go back to raising ValueError when either or both ends of the slice are None). Now it would be the responsibility of sequence-consuming functions to start by .cut()ting the sequence before slicing it. find and __contains__ can be implemented easily (though perhaps inelegantly) by changing "foo in circular(bar)" into "foo in linear(bar) + linear(bar)[:len(foo)-1]" (which is essentially what is done in both Restriction libraries, the old and the new one). Finally let me say that right now I don't use the most of the rest of Biopython (and don't really think I'll use most of it in the near future) so I care little about whether this specific feature gets integrated or not; however I do think it is needed in a proper restriction analysis library. Indeed, one could say that we just have to add a "circular=True|False" keyword argument to methods such as search and catalyze, but that is not enough to distinguish e.g. if a circular plasmid is digested once or not at all (of course, one can check separately but what I mean there is that circularity is a natural "output" of the functions, not just input). Antony On Wed, Jan 16, 2013 at 10:24:13AM +0000, Peter Cock wrote: > For those that missed it last time, I think the most recent in depth > discussion about circular sequences and slicing was here: > > http://lists.open-bio.org/pipermail/biopython/2011-March/007075.html > ... > http://lists.open-bio.org/pipermail/biopython/2011-March/007085.html > > On Wed, Jan 16, 2013 at 9:42 AM, Markus Piotrowski > wrote: > > Am 15.01.2013 22:45, schrieb Antony Lee: > > > >> needed to y to make it bigger than x, and stop there). Slicing with one > >> or both ends undefined (ie. s[:], s[x:], s[:y]) raises an IndexError > >> (because, well, I read s[x:] as "return the elements of s starting from > >> the x'th until the end"... but there is no such end.). (A second option > >> would be to return an infinite iterable for s[x:], but that doesn't take > >> care of s[:y] anyways, not to mention the bugs that may appear from > >> that.) > > > > > > Another possibility, which makes some biological sense (thinking on > > restriction), would be that > > s[x:] (or s[:y]) returns a linear sequence starting at x and ending with x-1 > > (or ending with y and starting at y+1). Thus, s[x:] would mean 'cut my > > circle at x and return the linear sequence starting at x'. > > That's exactly the kind of behaviour which would make me nervous > given in general the Biopython sequence objects mimic Python strings. > There are many examples where that 'extra' sequence would be > unexpected. For instance, writing out line wrapped sequence data. > > I would prefer an explicit method like 'cut' on a circular sequence > object returning a full length linear sequence. Similarly a 'roll' or > 'rotate' method could shift the origin to a new coordinate. > > One simple solution to the complexities of the slice behaviour is > the practical one: They act like Python strings, basically all we > would be adding would an 'is circular' flag and some logic about > how to propagate that flag in operations like addition and slicing. > If we went that route it might still be possible to make the find and > 'in' functionality origin aware... but that may just cause trouble. > > This would solve where to store if a sequence is circular (e.g. when > reading GenBank and EMBL files - or for handling restriction > enzyme digests), but other than that not add much utility. > > Thoughts? > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From redmine at redmine.open-bio.org Fri Jan 18 04:43:26 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 18 Jan 2013 09:43:26 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Michiel de Hoon. Micha?, can you confirm that the fixed Bio.trie works for you? Then we can close this bug report. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Fri Jan 18 10:17:43 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 18 Jan 2013 15:17:43 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Micha? Nowotka. Can you just give me two more weeks? I need some time to evaluate it. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From eric.talevich at gmail.com Fri Jan 18 20:20:11 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 18 Jan 2013 20:20:11 -0500 Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo In-Reply-To: References: Message-ID: On Fri, Dec 28, 2012 at 10:50 AM, Ben Morris wrote: > On Tue, Dec 25, 2012 at 2:18 AM, Eric Talevich > wrote: > > > > On Mon, Dec 24, 2012 at 8:58 AM, Ben Morris wrote: > >> > >> Hi all, > >> > >> I've implemented support for two new phylogenetic tree formats: NeXML > and > >> RDF (conforming to the Comparative Data Analysis Ontology). > >> > >> I noticed that NeXML support was planned, but I didn't see anyone > working > >> on it on GitHub and the feature request hadn't been updated in about a > >> year, so I went ahead and implemented a simple version. At first I tried > >> the generateDS.py approach, but the generated writer doesn't give very > much > >> control over the output, so I ended up writing my own parser/writer > using > >> ElementTree. > >> > >> As for the RDF/CDAO format, AFAIK this is not a format that's supported > by > >> any other phylogenetic libraries, so I'm not sure how useful this is to > >> everyone else. It provides a simple, standards-compliant format that > can be > >> imported to a triple store and supports annotation. We'll be using it at > >> NESCent so I wanted to make it available to everyone else as well. The > >> parser and writer require the Redlands Python bindings. > >> > >> The code is available in my fork of Biopython, > >> > >> https://github.com/bendmorris/biopython > >> > >> under branches "cdao" and "nexml." I'd love to get everyone's thoughts > and > >> see if these contributions would be a good fit for the Biopython > project. > > > > > > > > Thanks for letting us know! I'll try it out soonish. Looking at the code > on your nexml branch, I have a few comments: > > > > - The parser uses ElementTree.parse rather than iterparse, so in its > current state it would not be able to parse massive files (those larger > than available RAM). Worth fixing eventually? > > Great point. I rewrote it to use iterparse instead. > > > - The parser creates Newick.Tree and Newick.Clade objects, which is > nearly correct in my opinion. I would suggest subclassing BaseTree.Tree and > BaseTree.Clade to create NeXML-specific Tree and Clade classes, even if you > don't have any additional attributes to attach to those classes at the > moment. (These would go in a new file NeXML.py, similar to PhyloXML.py and > PhyloXMLIO.py.) > > Went ahead and did this as well. > Thanks! Sorry for the pace of this, I'm in the midst of a dissertation. > - The 'confidence' or 'confidences' attribute isn't used (for e.g. > bootstrap support values). Does NeXML define it? > > Not that I'm aware of, but I'm not sure. I searched > http://nexml.org/nexml/html/doc/schema-1/ and didn't find anything. > I'm going to ask some people who know more about this than I do. > I would like for Bio.Phylo's I/O modules to be able to successfully round-trip a file from Newick to phyloXML to NeXML and back to Newick without losing support values. I found these two examples of how to add this data to a NeXML document by referencing CDAO: https://www.nescent.org/wg_evoinfo/NeXML_Test_Files#Bootstraps_represented_using_the_.22meta.22_tag https://www.nescent.org/wg_evoinfo/NeXML_Test_Files#Bootstraps_represented_without_new_tags_or_elements That's the standard way to store bootstrap supports in NeXML (Hilmar confirms). How do your NeXML and CDAO modules interact, if at all? Would the CDAO modules be useful to properly support NeXML metadata like support/confidence values, or would it be simpler to just hard-code the few tags we're specifically interested in? Relatedly, those look like good test files. I see you've started writing NeXML unit tests already; if you would like help with any of this, just let me know. -Eric From mjldehoon at yahoo.com Sun Jan 20 02:30:24 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 19 Jan 2013 23:30:24 -0800 (PST) Subject: [Biopython-dev] Bio.Motif update Message-ID: <1358667024.24762.YahooMailClassic@web164004.mail.gq1.yahoo.com> Dear all, As we discussed previously, I've been going over Bio.Motif to update it and make its usage more explicit. I'm pretty much done. While I have been uploading my changes to the main biopython github repository, this does not mean that these changes are final; comments and suggestions for changes are welcome. In many cases, there is a difference in the syntax between the old Bio.Motif and the new Bio.Motif. For example, motif.consensus is a method in the old Bio.Motif, but a property in the new Bio.Motif. While I tried to put PendingDeprecationWarnings on all changes consistently, there may be some corner cases that I missed. For this reason, and also to make the documentation more understandable, it may be better to put the new Bio.Motif code in a module Bio.motifs, to put the old Bio.Motif code back into Bio.Motif (so that Bio.Motif in release 1.61 will be identical to the Bio.Motif in release 1.60), and (assuming that we are happy with the new Bio.motifs modules) put a PendingDeprecationWarning on Bio.Motif as a whole. Then in the documentation we'll have one chapter on Bio.Motif and one chapter on Bio.motifs. Also we'll have one set of tests for Bio.Motif, and one set of tests for Bio.motifs. Any objections to creating a separate Bio.motifs module? Here you can find the relevant chapter in the current documentation on the new Bio.Motif: http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html#htoc190 Best, -Michiel From p.j.a.cock at googlemail.com Sun Jan 20 14:03:45 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 20 Jan 2013 19:03:45 +0000 Subject: [Biopython-dev] More 'fun' with GenBank In-Reply-To: <50F66496.8000109@biotech.uni-tuebingen.de> References: <50F57BC5.7020607@biotech.uni-tuebingen.de> <50F66496.8000109@biotech.uni-tuebingen.de> Message-ID: On Wed, Jan 16, 2013 at 8:28 AM, Kai Blin wrote: > On 2013-01-15 20:03, Peter Cock wrote: > > Hi Peter, > >> It will need a bit of work to rebase (some of the PEP8 changes have >> touched the same lines of code), but I will try and do that this week. > > Your f_loc4 branch certainly fixes the problem I'm seeing. Is there > anything I can do to help with getting it merged? I'm happy to give a > closer look at the rebase conflicts coming up during the merge if you > don't mind me asking the occasional question if I can't work out reasons > for a code change from the commit messages. > > Cheers, > Kai I've done the rebase - all the tests still pass so if I missed anything it should just be minor: https://github.com/peterjc/biopython/commits/f_loc4 (old) https://github.com/peterjc/biopython/commits/f_loc5 (rebased) Kai - would you mind retesting with f_loc5 (the rebased branch)? Everyone - does it seem sensible to include this now, ready for the upcoming release (*)? Or perhaps just after the release? Peter (*) See other thread about Bio.Motif, which I think is all we need to address before doing the release: http://lists.open-bio.org/pipermail/biopython-dev/2013-January/010235.html From bartek at rezolwenta.eu.org Sun Jan 20 17:34:42 2013 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Sun, 20 Jan 2013 23:34:42 +0100 Subject: [Biopython-dev] Bio.Motif update In-Reply-To: <1358667024.24762.YahooMailClassic@web164004.mail.gq1.yahoo.com> References: <1358667024.24762.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: Hi, great job Michiel! It looks very nice overall. As the code that will be using the new library needs to be changed, I would vote for the change in the namespace, but given that the userbase of the Bio.Motif was quite limited, I think it wouldn't cause major problems to keep the name as is. best Bartek On Sun, Jan 20, 2013 at 8:30 AM, Michiel de Hoon wrote: > Dear all, > > As we discussed previously, I've been going over Bio.Motif to update it and make its usage more explicit. I'm pretty much done. While I have been uploading my changes to the main biopython github repository, this does not mean that these changes are final; comments and suggestions for changes are welcome. > > In many cases, there is a difference in the syntax between the old Bio.Motif and the new Bio.Motif. For example, motif.consensus is a method in the old Bio.Motif, but a property in the new Bio.Motif. > While I tried to put PendingDeprecationWarnings on all changes consistently, there may be some corner cases that I missed. > > For this reason, and also to make the documentation more understandable, it may be better to put the new Bio.Motif code in a module Bio.motifs, to put the old Bio.Motif code back into Bio.Motif (so that Bio.Motif in release 1.61 will be identical to the Bio.Motif in release 1.60), and (assuming that we are happy with the new Bio.motifs modules) put a PendingDeprecationWarning on Bio.Motif as a whole. Then in the documentation we'll have one chapter on Bio.Motif and one chapter on Bio.motifs. Also we'll have one set of tests for Bio.Motif, and one set of tests for Bio.motifs. > > Any objections to creating a separate Bio.motifs module? > > Here you can find the relevant chapter in the current documentation on the new Bio.Motif: > > http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html#htoc190 > > Best, > -Michiel > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Bartek Wilczynski From kai.blin at biotech.uni-tuebingen.de Mon Jan 21 04:49:31 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Mon, 21 Jan 2013 10:49:31 +0100 Subject: [Biopython-dev] More 'fun' with GenBank In-Reply-To: References: <50F57BC5.7020607@biotech.uni-tuebingen.de> <50F66496.8000109@biotech.uni-tuebingen.de> Message-ID: <50FD0F2B.1080606@biotech.uni-tuebingen.de> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-01-20 20:03, Peter Cock wrote: > Kai - would you mind retesting with f_loc5 (the rebased branch)? The location of the feature that caused trouble for me still looks correct. I'm currently running some more sequences, but I'm pretty confident that the code will work just fine. The tests I added to the genbank parser code for all the problem cases I had pass, after all. :) > Everyone - does it seem sensible to include this now, ready for the > upcoming release (*)? Or perhaps just after the release? I'd perfer having this in the next release if possible, but of course if the release after that is coming up within a reasonable time frame, that would work as well. Cheers, Kai - -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iQEcBAEBAgAGBQJQ/Q8rAAoJEKM5lwBiwTTP9oEIAIoa543zGerNtxNg67ybV4uE jzOkyBzJIxkGAjIxcuNnYTo+OgYHkMQekeo7wkGgPKN558+LE8zKza3JdWbVqV/M bEd6mYo5LsfveK3Vn397GJcPCOaQtb5MvNUOPJWstzReRVIM6lN3WXm3HxicuTji 2aFZG5dtaMXjZhxxMo4IRz2Jtrr01nZu1OVP02mco4LDoEkRInunDcWJcz/DOsJd h4vJzVa4veMKFfJV4U9PGZnuatcwKgMLVQ1heKh4/efEOQ4dIjdlYG29FjHsZvy6 RjwL4ZZpGZfZwgBJPGiYqn5ZsgzVqgS5aWdw8/9jN5dpETP24DnzVi6vlIRTWqg= =uUeG -----END PGP SIGNATURE----- From redmine at redmine.open-bio.org Tue Jan 22 21:30:31 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 23 Jan 2013 02:30:31 +0000 Subject: [Biopython-dev] [Biopython - Bug #3403] (Closed) PDBList fails to download large PDB structures References: Message-ID: Issue #3403 has been updated by Eric Talevich. Status changed from New to Closed % Done changed from 0 to 100 Fixed by David Cain. Thanks! https://github.com/biopython/biopython/pull/146 First commit in the series here: https://github.com/biopython/biopython/commit/7282e80ed6a65a10c5c624b2a7ec787656437a15 ---------------------------------------- Bug #3403: PDBList fails to download large PDB structures https://redmine.open-bio.org/issues/3403 Author: David Cain Status: Closed Priority: High Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: https://github.com/DavidCain/biopython/tree/fix_pdb_dl The current @PDBList@ module will often fail to download large PDB files.
>>> from Bio.PDB import PDBList
>>> pdbl = PDBList()
>>> pdbl.retrieve_pdb_file("1hgg")
Downloading PDB structure '1hgg'...
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/pymodules/python2.7/Bio/PDB/PDBList.py", line 247, in retrieve_pdb_file
    out.writelines(gz.read())
  File "/usr/lib/python2.7/gzip.py", line 249, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 303, in _read
    self._read_eof()
  File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0x21d7a5f7 != 0x4b5eabb6L
>>>
The source of this problem is that the entire gzipped file must be read into memory before it's written to disk locally. With large archives, the local file can be truncated prematurely, which causes gzip to crash on extraction. I fixed this issue on my "GitHub branch":https://github.com/DavidCain/biopython/tree/fix_pdb_dl, which I've made a pull request for. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From mjldehoon at yahoo.com Sat Jan 26 23:45:46 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 26 Jan 2013 20:45:46 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone Message-ID: <1359261946.16561.YahooMailClassic@web164001.mail.gq1.yahoo.com> [This message previously got lost in cyberspace. Sending it again.] --- On Fri, 1/11/13, Peter Cock wrote: > Bow's SearchIO is using Bio.Blast.NCBIStandalone to handle > plain text, > https://github.com/biopython/biopython/blob/master/Bio/SearchIO/BlastIO/blast_text.py OK then let's keep Bio.ParserSupport as is for now. > That's why Bio._utils is a private module - we can > drop/change/etc this without worrying about breaking > other people's code. The issue with Bio.ParserSupport > is it was a public API. Its API being public was not the problem -- we have deprecated and removed lots of public modules over the years. The problem with Bio.ParserSupport was twofold. First, it ended up making parsers more complex and difficult to understand for people not familiar with Bio.ParserSupport, in particular for newcomers and users trying to fix a bug. So Bio.ParserSupport never made us really happy. As a case in point, Bio._utils was created rather than reusing the code in Bio.ParserSupport. The second problem was that many modules were using bits and pieces of Bio.ParserSupport, so we could not drop or change Bio.ParserSupport easily. Bio.ParserSupport has been officially obsolete but not deprecated for years. > That's why Bio._utils is a private module - we can > drop/change/etc this without worrying about breaking > other people's code. Let's drop it. Just it being a private module doesn't make it "free". It clutters up the code base. This is particularly true for top-level modules. Best, -Michiel. From mjldehoon at yahoo.com Sat Jan 26 23:46:47 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 26 Jan 2013 20:46:47 -0800 (PST) Subject: [Biopython-dev] Bio.Motif update In-Reply-To: Message-ID: <1359262007.25151.YahooMailClassic@web164002.mail.gq1.yahoo.com> OK, thanks! I separated Bio.Motif into Bio.Motif (essentially the same as in Biopython release 1.60) and Bio.motifs (the new code). Best, -Michiel. --- On Sun, 1/20/13, Bartek Wilczynski wrote: > From: Bartek Wilczynski > Subject: Re: [Biopython-dev] Bio.Motif update > To: "Michiel de Hoon" > Cc: "BioPython-Dev" > Date: Sunday, January 20, 2013, 5:34 PM > Hi, > > great job Michiel! It looks very nice overall. As the code > that will > be using the new library needs to be changed, I would vote > for the > change in the namespace, but given that the userbase of the > Bio.Motif > was quite limited, I think it wouldn't cause major problems > to keep > the name as is. > > best > Bartek > > On Sun, Jan 20, 2013 at 8:30 AM, Michiel de Hoon > wrote: > > Dear all, > > > > As we discussed previously, I've been going over > Bio.Motif to update it and make its usage more explicit. I'm > pretty much done. While I have been uploading my changes to > the main biopython github repository, this does not mean > that these changes are final; comments and suggestions for > changes are welcome. > > > > In many cases, there is a difference in the syntax > between the old Bio.Motif and the new Bio.Motif. For > example, motif.consensus is a method in the old Bio.Motif, > but a property in the new Bio.Motif. > > While I tried to put PendingDeprecationWarnings on all > changes consistently, there may be some corner cases that I > missed. > > > > For this reason, and also to make the documentation > more understandable, it may be better to put the new > Bio.Motif code in a module Bio.motifs, to put the old > Bio.Motif code back into Bio.Motif (so that Bio.Motif in > release 1.61 will be identical to the Bio.Motif in release > 1.60), and (assuming that we are happy with the new > Bio.motifs modules) put a PendingDeprecationWarning on > Bio.Motif as a whole. Then in the documentation we'll have > one chapter on Bio.Motif and one chapter on Bio.motifs. Also > we'll have one set of tests for Bio.Motif, and one set of > tests for Bio.motifs. > > > > Any objections to creating a separate Bio.motifs > module? > > > > Here you can find the relevant chapter in the current > documentation on the new Bio.Motif: > > > > http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html#htoc190 > > > > Best, > > -Michiel > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > > > > -- > Bartek Wilczynski > From w.arindrarto at gmail.com Sun Jan 27 05:52:15 2013 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sun, 27 Jan 2013 11:52:15 +0100 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359261946.16561.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1359261946.16561.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: Hi Michiel, everyone, >> That's why Bio._utils is a private module - we can >> drop/change/etc this without worrying about breaking >> other people's code. The issue with Bio.ParserSupport >> is it was a public API. > > Its API being public was not the problem -- we have deprecated and removed lots of public modules over the years. > > The problem with Bio.ParserSupport was twofold. First, it ended up making parsers more complex and difficult to understand for people not familiar with Bio.ParserSupport, in particular for newcomers and users trying to fix a bug. So Bio.ParserSupport never made us really happy. As a case in point, Bio._utils was created rather than reusing the code in Bio.ParserSupport. > > The second problem was that many modules were using bits and pieces of Bio.ParserSupport, so we could not drop or change Bio.ParserSupport easily. Bio.ParserSupport has been officially obsolete but not deprecated for years. > >> That's why Bio._utils is a private module - we can >> drop/change/etc this without worrying about breaking >> other people's code. > > Let's drop it. My initial intention of refactoring and adding some new code to Bio._utils was to reduce code repetition. I intended it (and perhaps we should make it explicit in its docstrings) to be a collection of small, useful functions that may be used in various cases. Some examples inside include several string-formatting functions, each of them independent of the other. There's also a general function for running doctests (https://github.com/biopython/biopython/blob/master/Bio/_utils.py#L100), which was written because there was a lot of repetitive code in different submodules basically doing the same thing (looking up the test directory, running the test). I feel quite strongly that this doctest function is required by many current (and future modules) across Biopython, so it makes sense to refactor them out into a root namespace. All of this seems different from Bio.ParserSupport, which attempts to be a one-single solution for writing new parsers (only parsers). Given the wildly incoherent nature of different file output formats, it's not surprising that Bio.ParserSupport's code base has to be quite complicated to accomodate all of them. Naturally it has many related parts and functions, and understanding them all is much harder than to understand the small functions in Bio._utils (in my experience). So for now, I think it is still ok if we use Bio._utils. Perhaps, in light of this discussion, we should make it explicitly clear that it's only for containing general, small, utility functions instead of containing one 'support framework' (e.g. ParserSupport) to avoid future unhappiness. Cheers, Bow From eric.talevich at gmail.com Mon Jan 28 00:59:14 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 28 Jan 2013 00:59:14 -0500 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: References: <1359261946.16561.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Sun, Jan 27, 2013 at 5:52 AM, Wibowo Arindrarto wrote: > Hi Michiel, everyone, > > >> That's why Bio._utils is a private module - we can > >> drop/change/etc this without worrying about breaking > >> other people's code. The issue with Bio.ParserSupport > >> is it was a public API. > > > > Its API being public was not the problem -- we have deprecated and > removed lots of public modules over the years. > > > > The problem with Bio.ParserSupport was twofold. First, it ended up > making parsers more complex and difficult to understand for people not > familiar with Bio.ParserSupport, in particular for newcomers and users > trying to fix a bug. So Bio.ParserSupport never made us really happy. As a > case in point, Bio._utils was created rather than reusing the code in > Bio.ParserSupport. > > > > The second problem was that many modules were using bits and pieces of > Bio.ParserSupport, so we could not drop or change Bio.ParserSupport easily. > Bio.ParserSupport has been officially obsolete but not deprecated for years. > > > >> That's why Bio._utils is a private module - we can > >> drop/change/etc this without worrying about breaking > >> other people's code. > > > > Let's drop it. > > My initial intention of refactoring and adding some new code to > Bio._utils was to reduce code repetition. I intended it (and perhaps > we should make it explicit in its docstrings) to be a collection of > small, useful functions that may be used in various cases. > > Some examples inside include several string-formatting functions, each > of them independent of the other. There's also a general function for > running doctests > (https://github.com/biopython/biopython/blob/master/Bio/_utils.py#L100), > which was written because there was a lot of repetitive code in > different submodules basically doing the same thing (looking up the > test directory, running the test). I feel quite strongly that this > doctest function is required by many current (and future modules) > across Biopython, so it makes sense to refactor them out into a root > namespace. > Interesting discussion. It's worth considering why some functions are being used in multiple parts of the code base. In some cases there are essentially shortcomings in the Python standard library or issues with cross-platform/cross-implementation/backward compatibility that would require us to use *exactly* the same code each time a certain recurring problem is encountered. The Bio._py3k and Bio.File modules makes sense for this reason, I think, and before we deprecated Py2.4 it would have been helpful to have shared code for importing ElementTree (both the uniprot-xml and phyloXML parsers used the same half-page tangle of attempted imports). So, maybe the doctest helpers should go in a new module specific to that topic. In other cases there's a recurring need in separate modules, but (a) it's short and simple enough to write the solution from scratch each time where it's needed, and so isn't enough of a maintenance concern to offset the convenience of having all the relevant code in one place; and/or (b) the needs of different modules aren't exactly the same, merely similar, leading to a proliferation of options in the shared function and the situation that a simpler implementation would have worked for any given module. The point is that just as there's a maintenance cost to having duplicated code in multiple places, there's a maintenance cost to having dependencies between multiple modules even within the same project, and the value of a new module ought to be greater than the cost it imposes. Best, Eric From mjldehoon at yahoo.com Mon Jan 28 09:58:58 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 28 Jan 2013 06:58:58 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: Message-ID: <1359385138.84799.YahooMailClassic@web164002.mail.gq1.yahoo.com> Hi Bow, --- On Sun, 1/27/13, Wibowo Arindrarto wrote: > All of this seems different from Bio.ParserSupport, which > attempts to be a one-single solution for writing new parsers > (only parsers). Given the wildly incoherent nature of different > file output formats, it's not surprising that Bio.ParserSupport's > code base has to be quite complicated to accommodate all of them. > Naturally it has many related parts and functions, and understanding > them all is much harder than to understand the small functions in > Bio._utils (in my experience). It's not just Bio.ParserSupport; previously we also had Bio/listfns.py; Bio/mathfns.py; Bio/stringfns.py; their C versions; and Bio/csupport.c. These all contained small utility functions. But in the end we dropped them. Btw, was Bio._utils ever discussed on the mailing list? If yes, I apologize for missing this discussion and raising these issues now. Best, -Michiel. From p.j.a.cock at googlemail.com Mon Jan 28 10:10:29 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 28 Jan 2013 15:10:29 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359385138.84799.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1359385138.84799.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: On Mon, Jan 28, 2013 at 2:58 PM, Michiel de Hoon wrote: > > Btw, was Bio._utils ever discussed on the mailing list? If yes, I > apologize for missing this discussion and raising these issues now. I think only on the pull request - I'll have a look at the GitHub settings as ideally at the minimum new pull requests should perhaps be CC'd to the dev list? Peter From p.j.a.cock at googlemail.com Mon Jan 28 10:17:19 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 28 Jan 2013 15:17:19 +0000 Subject: [Biopython-dev] Sending pull requests to the mailing list Message-ID: Retitling thread, On Mon, Jan 28, 2013 at 3:10 PM, Peter Cock wrote: > On Mon, Jan 28, 2013 at 2:58 PM, Michiel de Hoon wrote: >> >> Btw, was Bio._utils ever discussed on the mailing list? If yes, I >> apologize for missing this discussion and raising these issues now. > > I think only on the pull request - I'll have a look at the GitHub > settings as ideally at the minimum new pull requests should > perhaps be CC'd to the dev list? According to https://help.github.com/articles/using-pull-requests "Everyone that can push to the base repository will receive an email notification and see the new pull request in their dashboard the next time they log in." I think you can also choose to get emails under your own profile settings. There doesn't seem to be any email notification settings under the Biopython organisation account on GitHub. If there is an easy way to have GitHub email new pull requests to the biopython-dev mailing I've overlooked it. There might be an API based solution... or a simple email client forwarding rule? Peter From w.arindrarto at gmail.com Mon Jan 28 12:19:51 2013 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Mon, 28 Jan 2013 18:19:51 +0100 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359385138.84799.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1359385138.84799.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: Hi everyone, > --- On Sun, 1/27/13, Wibowo Arindrarto wrote: >> All of this seems different from Bio.ParserSupport, which >> attempts to be a one-single solution for writing new parsers >> (only parsers). Given the wildly incoherent nature of different >> file output formats, it's not surprising that Bio.ParserSupport's >> code base has to be quite complicated to accommodate all of them. >> Naturally it has many related parts and functions, and understanding >> them all is much harder than to understand the small functions in >> Bio._utils (in my experience). > > It's not just Bio.ParserSupport; previously we also had Bio/listfns.py; Bio/mathfns.py; Bio/stringfns.py; their C versions; and Bio/csupport.c. These all contained small utility functions. But in the end we dropped them. Hm..in this case (and in light of Eric's points as well), it may be ok to drop the string formatting functions in Bio._utils. They are used in Bio.Phylo and Bio.SearchIO for now. In Bio.SearchIO they are used in multiple submodules, however, so I am still leaning on putting them at least on Bio.SearchIO's main directory. They were originally in Bio.SearchIO._utils, after all. As for the doctest-related functions, do you propose to move them to a specific doctest-related module as well? >> Btw, was Bio._utils ever discussed on the mailing list? If yes, I >> apologize for missing this discussion and raising these issues now. > > I think only on the pull request - I'll have a look at the GitHub > settings as ideally at the minimum new pull requests should > perhaps be CC'd to the dev list? Indeed, I did submit a pull request but was not forwarded / discussed in the mailing list. This is the pull request, for reference: https://github.com/biopython/biopython/pull/140. For the dev-mailing list notification, I personally agree, given that the amount of pull requests received still seems manageable. Is it possible to just receive the initial email notifying the pull, though? So far, I've been 'watching' the repository and getting emails from there ~ perhaps the organization needs to 'watch' the repo to get notifications as well? Best, Bow From redmine at redmine.open-bio.org Mon Jan 28 17:20:54 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 28 Jan 2013 22:20:54 +0000 Subject: [Biopython-dev] [Biopython - Bug #2776] Bio.pairwise2 returns non-optimal alignment in at least some cases References: Message-ID: Issue #2776 has been updated by Peter Cock. In the opinion of Bryan Lunt, comment on another issue on Github: https://github.com/biopython/biopython/pull/149 "Bug" 2776 is not a bug, it is a feature. I hand-edited a datafile for EMBOSS programs and tried the EMBOSS "needle" program with (a homomorphism of) the same sequences. It behaves the same as pairwise2. The point is that for there to be gaps they have to be flanked by matches, except on the ends, so what the original bug report asks for is not something these algorithms will ever produce anyway. ---------------------------------------- Bug #2776: Bio.pairwise2 returns non-optimal alignment in at least some cases https://redmine.open-bio.org/issues/2776 Author: Klaus Kopec Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.49 URL: At least in some cases, Bio.pairwise2 returns an alignment that is not the one with the highest score for the input parameters. This occurs in localXX and globalXX. Yet, I only encountered the problem with large mismatch values (which I use as I need mismatch free alignments). simple example (the bug also occured for longer sequences): >>> sequence1 = 'GKG' >>> sequence2 = 'GWG' >>> A = pairwise2.align.globalms(sequence1, sequence2, 5, -100, -5, -5)[0] >>> A[0] 'GKG--' >>> A[1] '--GWG' >>> A[2] -15.0 whereas 'GK-G' 'G-WG' would get a score of 0 System: Kubuntu 8.10 64Bit, Python 2.6.1, Biopython 1.49 (my pairwise2.py is identical to the current CVS version of it) -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From mjldehoon at yahoo.com Tue Jan 29 04:43:59 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 29 Jan 2013 01:43:59 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: Message-ID: <1359452639.95165.YahooMailClassic@web164005.mail.gq1.yahoo.com> I'd prefer if developers first write to the dev mailing list if they want to make any major changes, or changes that affect Biopython overall. It can be hard to understand the implications just from looking at a pull request, and there may be so many pull requests that the important ones may be missed anyway. Best, -Michiel. --- On Mon, 1/28/13, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone > To: "Michiel de Hoon" > Cc: "Wibowo Arindrarto" , "BioPython-Dev Mailing List" > Date: Monday, January 28, 2013, 10:10 AM > On Mon, Jan 28, 2013 at 2:58 PM, > Michiel de Hoon > wrote: > > > > Btw, was Bio._utils ever discussed on the mailing list? > If yes, I > > apologize for missing this discussion and raising these > issues now. > > I think only on the pull request - I'll have a look at the > GitHub > settings as ideally at the minimum new pull requests should > perhaps be CC'd to the dev list? > > Peter > From mjldehoon at yahoo.com Tue Jan 29 04:54:01 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 29 Jan 2013 01:54:01 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: Message-ID: <1359453241.43038.YahooMailClassic@web164004.mail.gq1.yahoo.com> --- On Mon, 1/28/13, Wibowo Arindrarto wrote: > Hm..in this case (and in light of Eric's points as well), it > may be ok to drop the string formatting functions in Bio._utils. > They are used in Bio.Phylo and Bio.SearchIO for now. In Bio.SearchIO > they are used in multiple submodules, however, so I am still leaning > on putting them at least on Bio.SearchIO's main directory. They were > originally in Bio.SearchIO._utils, after all. I think it's OK to have a _utils submodule inside Bio.SearchIO. Since you are developing and maintaining that module, to a large degree it's up to you how you want to organize your code. For the same reason, for Bio.Phylo it's better to discuss with Eric Talevich first to see what he thinks. > As for the doctest-related functions, do you propose to move > them to a specific doctest-related module as well? For the doctest-related functions, we first need to understand what the purpose is, before deciding how to implement it (and in what module the code should be). Best, -Michiel. From p.j.a.cock at googlemail.com Tue Jan 29 05:23:43 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Jan 2013 10:23:43 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359452639.95165.YahooMailClassic@web164005.mail.gq1.yahoo.com> References: <1359452639.95165.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: On Tue, Jan 29, 2013 at 9:43 AM, Michiel de Hoon wrote: > I'd prefer if developers first write to the dev mailing list if they want to make > any major changes, or changes that affect Biopython overall. It can be hard > to understand the implications just from looking at a pull request, and there > may be so many pull requests that the important ones may be missed anyway. Certainly a good policy, which I have tried to follow. In this case since it was just moving a small private API code, I didn't consider it major. Peter From p.j.a.cock at googlemail.com Tue Jan 29 05:29:30 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Jan 2013 10:29:30 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359453241.43038.YahooMailClassic@web164004.mail.gq1.yahoo.com> References: <1359453241.43038.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: On Tue, Jan 29, 2013 at 9:54 AM, Michiel de Hoon wrote: > >> As for the doctest-related functions, do you propose to move >> them to a specific doctest-related module as well? > > For the doctest-related functions, we first need to understand > what the purpose is, before deciding how to implement it (and > in what module the code should be). When editing doctests, it is convenient to be able to run them on the current file, e.g. ~/biopython $ emacs Bio/SeqRecord.py ~/biopython $ python Bio/SeqRecord.py Or, ~/biopython/Bio $ emacs SeqRecord.py ~/biopython/Bio $ python SeqRecord.py To do that, many of our modules had a repeated bit of code at the bottom, now moved to a shared function in Bio/_utils.py resulting in a lot less boiler plate code, e.g. https://github.com/biopython/biopython/commit/8b59d89bb4e282192ddee751e24ceef4afa63528 Bow had initially done this for the doctests in Bio.SearchIO, but I agreed it make sense to do this elsewhere. Peter From w.arindrarto at gmail.com Tue Jan 29 06:05:19 2013 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 29 Jan 2013 12:05:19 +0100 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359453241.43038.YahooMailClassic@web164004.mail.gq1.yahoo.com> References: <1359453241.43038.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: Hi Michiel, everyone, >>> I'd prefer if developers first write to the dev mailing list if they want to make any major changes, or changes that affect Biopython overall. It can be hard to understand the implications just from looking at a pull request, and there may be so many pull requests that the important ones may be missed anyway. >>> I think it's OK to have a _utils submodule inside Bio.SearchIO. Since you are developing and maintaining that module, to a large degree it's up to you how you want to organize your code. For the same reason, for Bio.Phylo it's better to discuss with Eric Talevich first to see what he thinks. Noted. I'm sorry that this is causing more headaches than it solves. I'll be sure to notify the dev-mailing list for other similar changes. >>> As for the doctest-related functions, do you propose to move >>> them to a specific doctest-related module as well? >> >> For the doctest-related functions, we first need to understand what the purpose is, before deciding how to implement it (and in what module the code should be). > > When editing doctests, it is convenient to be able to run them on > the current file, e.g. > > ~/biopython $ emacs Bio/SeqRecord.py > ~/biopython $ python Bio/SeqRecord.py > > Or, > > ~/biopython/Bio $ emacs SeqRecord.py > ~/biopython/Bio $ python SeqRecord.py > > To do that, many of our modules had a repeated bit of code at > the bottom, now moved to a shared function in Bio/_utils.py > resulting in a lot less boiler plate code, e.g. > > https://github.com/biopython/biopython/commit/8b59d89bb4e282192ddee751e24ceef4afa63528 > > Bow had initially done this for the doctests in Bio.SearchIO, > but I agreed it make sense to do this elsewhere. Indeed, the doctests functions are two simple small functions to make it easier to run doctests. The first one looks up the test directory (our Tests directory) and the second one simply executes the doctest. Best, Bow From p.j.a.cock at googlemail.com Tue Jan 29 10:46:25 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Jan 2013 15:46:25 +0000 Subject: [Biopython-dev] Bio.Motif update In-Reply-To: <1359262007.25151.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1359262007.25151.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: On Sun, Jan 27, 2013 at 4:46 AM, Michiel de Hoon wrote: > OK, thanks! I separated Bio.Motif into Bio.Motif (essentially the same > as in Biopython release 1.60) and Bio.motifs (the new code). We need to say something about this in the NEWS file too. I think it would make sense to add a PendingDeprecationWarning to Bio.Motif now. Also, if you feel the new Bio.motifs API isn't quite settled yet, adding the new BiopythonExperimentalWarning to that makes sense. What do you think? (And once this is settled, I think we can schedule the release) Regards, Peter From p.j.a.cock at googlemail.com Tue Jan 29 12:10:50 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Jan 2013 17:10:50 +0000 Subject: [Biopython-dev] Namespace for online resources? Message-ID: Hello all, We used to have Bio.WWW for assorted online tools, but that was deprecated some time back. Is there a case for bringing it back, or something similar like Bio.WebTools as suggested by Kevin Murray on this pull request?: https://github.com/biopython/biopython/pull/132 In this case, since this is to fetch Arabidopsis sequence via an accession number, perhaps Bio.SeqUtils might be better? (As an aside, recall we've talked about merging Bio.Seq* at some point). Thoughts? Peter From w.arindrarto at gmail.com Tue Jan 29 14:52:42 2013 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 29 Jan 2013 20:52:42 +0100 Subject: [Biopython-dev] Namespace for online resources? In-Reply-To: References: Message-ID: Hi everyone, > We used to have Bio.WWW for assorted online tools, but that > was deprecated some time back. Is there a case for bringing it > back, or something similar like Bio.WebTools as suggested by > Kevin Murray on this pull request?: > > https://github.com/biopython/biopython/pull/132 > > In this case, since this is to fetch Arabidopsis sequence via > an accession number, perhaps Bio.SeqUtils might be better? > (As an aside, recall we've talked about merging Bio.Seq* at > some point). Why was Bio.WWW deprecated in the first place? Personally, I would prefer to have all online database access centralized in one place, if possible. It makes for a less-cluttered root namespace and may be more intuitive in most cases. I do notice that for cases like Bio.Entrez, sometimes we need to only parse the data locally since it has been downloaded previously (hence no online access). To do this task, Bio.www (basically the centralized online module) may not be the most intuitive place to look in, for most people, although an argument can be made that we are still parsing data whose format is specific for an online resource. However, looking at the way we are doing this now (with the current codebase placing Entrez access and parsing in Bio.Entrez; similarly for Bio.ExPASy) locating the module in Bio.TAIR (or Bio.tair? PEP-8 compliance?) looks more consistent. If we are to create a new module for online access (e.g. Bio.webtools. Bio.www) for Bio.TAIR, for consistency we may have to juggle Entrez and ExPASy around as well, right? Putting Bio.TAIR in Bio.SeqUtils doesn't seem..right to me. My impression is that SeqUtils is supposed to be for functions acting on sequence strings (or Seq objects) and nothing else. After all, we can also retrieve GenBank sequences from Biopython but that functionality is separated on its own Bio.Entrez not Bio.SeqUtils. . Just my two cents :), Bow From arklenna at gmail.com Tue Jan 29 15:05:15 2013 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 29 Jan 2013 15:05:15 -0500 Subject: [Biopython-dev] Namespace for online resources? In-Reply-To: References: Message-ID: I agree with Bow that centralizing all online database access makes sense. It would also simplify the testing process (i.e. anything that requires a network connection goes into the web namespace and can be skipped when testing offline). In situations like Entrez, the network access portion could be separated out and put into the web namespace under the same name: import Bio.www.Entrez # for downloading the data import Bio.Entrez # for parsing/using the downloaded data Cheers, Lenna On Tue, Jan 29, 2013 at 2:52 PM, Wibowo Arindrarto wrote: > Hi everyone, > > > We used to have Bio.WWW for assorted online tools, but that > > was deprecated some time back. Is there a case for bringing it > > back, or something similar like Bio.WebTools as suggested by > > Kevin Murray on this pull request?: > > > > https://github.com/biopython/biopython/pull/132 > > > > In this case, since this is to fetch Arabidopsis sequence via > > an accession number, perhaps Bio.SeqUtils might be better? > > (As an aside, recall we've talked about merging Bio.Seq* at > > some point). > > Why was Bio.WWW deprecated in the first place? > > Personally, I would prefer to have all online database access > centralized in one place, if possible. It makes for a less-cluttered > root namespace and may be more intuitive in most cases. I do notice > that for cases like Bio.Entrez, sometimes we need to only parse the > data locally since it has been downloaded previously (hence no online > access). To do this task, Bio.www (basically the centralized online > module) may not be the most intuitive place to look in, for most > people, although an argument can be made that we are still parsing > data whose format is specific for an online resource. > > However, looking at the way we are doing this now (with the current > codebase placing Entrez access and parsing in Bio.Entrez; similarly > for Bio.ExPASy) locating the module in Bio.TAIR (or Bio.tair? PEP-8 > compliance?) looks more consistent. If we are to create a new module > for online access (e.g. Bio.webtools. Bio.www) for Bio.TAIR, for > consistency we may have to juggle Entrez and ExPASy around as well, > right? > > Putting Bio.TAIR in Bio.SeqUtils doesn't seem..right to me. My > impression is that SeqUtils is supposed to be for functions acting on > sequence strings (or Seq objects) and nothing else. After all, we can > also retrieve GenBank sequences from Biopython but that functionality > is separated on its own Bio.Entrez not Bio.SeqUtils. > . > Just my two cents :), > Bow > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Tue Jan 29 16:03:59 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Jan 2013 21:03:59 +0000 Subject: [Biopython-dev] Namespace for online resources? In-Reply-To: References: Message-ID: On Tue, Jan 29, 2013 at 7:52 PM, Wibowo Arindrarto wrote: > Hi everyone, > > Why was Bio.WWW deprecated in the first place? > The flippant answer is everything under Bio.WWW was moved or deprecated: http://lists.open-bio.org/pipermail/biopython-dev/2008-July/004059.html I'm trying to identify the discussions prior to that covering the moves: Bio.WWW.ExPASy -> Bio.ExPASy Bio.WWW.InterPro -> Bio.InterPro Bio.WWW.NCBI -> Bio.Entrez Bio.WWW.SCOP -> Bio.SCOP Peter From p.j.a.cock at googlemail.com Tue Jan 29 16:11:29 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Jan 2013 21:11:29 +0000 Subject: [Biopython-dev] Namespace for online resources? In-Reply-To: References: Message-ID: On Tue, Jan 29, 2013 at 9:03 PM, Peter Cock wrote: > On Tue, Jan 29, 2013 at 7:52 PM, Wibowo Arindrarto > wrote: >> Hi everyone, >> >> Why was Bio.WWW deprecated in the first place? >> > > The flippant answer is everything under Bio.WWW was moved > or deprecated: > http://lists.open-bio.org/pipermail/biopython-dev/2008-July/004059.html > > I'm trying to identify the discussions prior to that covering the moves: > > Bio.WWW.ExPASy -> Bio.ExPASy > Bio.WWW.InterPro -> Bio.InterPro > Bio.WWW.NCBI -> Bio.Entrez > Bio.WWW.SCOP -> Bio.SCOP Probably this thread, http://lists.open-bio.org/pipermail/biopython-dev/2007-November/003241.html Also a bit more background on the NCBI Entrez side: http://lists.open-bio.org/pipermail/biopython-dev/2008-February/003423.html Peter From natemsutton at yahoo.com Tue Jan 29 16:22:57 2013 From: natemsutton at yahoo.com (Nate Sutton) Date: Tue, 29 Jan 2013 13:22:57 -0800 (PST) Subject: [Biopython-dev] New BioPython member Message-ID: <1359494577.29159.YahooMailNeo@web122606.mail.ne1.yahoo.com> Dear all, I just recently joined the BioPython developers group and am looking forward to contributing to BioPython!? I have worked for a while in programming, genetics, and biology and have a m.s. in Biomedical Informatics.? After talking with some fellow contributors I have decided to try working on https://redmine.open-bio.org/issues/3360 but I will also work on writing some documentation on examples from the cookbook, especially if I am stuck on the bug.? If anyone wants to work on the same things, I?d be glad to hear that, I may be slow on the work because I am still learning Python after coming from other languages. -Nate From mjldehoon at yahoo.com Tue Jan 29 21:00:32 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 29 Jan 2013 18:00:32 -0800 (PST) Subject: [Biopython-dev] Namespace for online resources? In-Reply-To: Message-ID: <1359511232.14591.YahooMailClassic@web164002.mail.gq1.yahoo.com> Bio.WWW was one of those modules that seem a good idea at first, but then failed to gain general acceptance. There are three problems with Bio.WWW: 1) From the module name, it's not clear what you would find in it. For example, if you want to access the Entrez database, would you first look in Bio.Entrez or in Bio.WWW? Similarly for TAIR: Would you look for it in Bio.TAIR, or in Bio.WWW? 2) The modules in Bio.WWW don't have much to do with each other, except that they access the internet. But any given user probably is mainly interested in Entrez, or ExPASy, or some other database, not in all of them at the same time. 3) The flip side of this is that a user accessing e.g. ExPASy would have to import both Bio.WWW and Bio.ExPASy to be able to use ExPASy. Doctests get more complicated also, as they would span more than one module. Here is an example from Bio.Entrez that accesses the database, and then parses the results: >>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here at example.org" >>> handle = Entrez.einfo() # or esearch, efetch, ... >>> record = Entrez.read(handle) >>> handle.close() The ultimate question is whether we organize the code in Biopython by their functionality from a user perspective, or by the kind of things they do? Almost all of Biopython is organized according to the former. For example, we don't have a Bio.Parsers module for all the parsers; similarly, we don't have Bio.WWW for internet access. Best, -Michiel. --- On Tue, 1/29/13, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] Namespace for online resources? > To: "Wibowo Arindrarto" > Cc: "Biopython-Dev Mailing List" > Date: Tuesday, January 29, 2013, 4:11 PM > On Tue, Jan 29, 2013 at 9:03 PM, > Peter Cock > wrote: > > On Tue, Jan 29, 2013 at 7:52 PM, Wibowo Arindrarto > > > wrote: > >> Hi everyone, > >> > >> Why was Bio.WWW deprecated in the first place? > >> > > > > The flippant answer is everything under Bio.WWW was > moved > > or deprecated: > > http://lists.open-bio.org/pipermail/biopython-dev/2008-July/004059.html > > > > I'm trying to identify the discussions prior to that > covering the moves: > > > > Bio.WWW.ExPASy -> Bio.ExPASy > > Bio.WWW.InterPro -> Bio.InterPro > > Bio.WWW.NCBI -> Bio.Entrez > > Bio.WWW.SCOP -> Bio.SCOP > > Probably this thread, > http://lists.open-bio.org/pipermail/biopython-dev/2007-November/003241.html > > Also a bit more background on the NCBI Entrez side: > http://lists.open-bio.org/pipermail/biopython-dev/2008-February/003423.html > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From kjwu at ucsd.edu Tue Jan 29 21:09:42 2013 From: kjwu at ucsd.edu (Kevin Wu) Date: Tue, 29 Jan 2013 18:09:42 -0800 Subject: [Biopython-dev] Trie with_prefix doesn't work as expected Message-ID: Hi All, I'm attempting to use the trie implementation in biopython to develop a suffix trie. I'm using the with_prefix function to find all keys which start with a sequence, however, the function doesn't return values that I expect. I tested it with the canonical example "banana" and am a bit confused. from Bio.trie import trie t = trie() s = "BANANA" for i in range(len(s)): # insert all suffixes into trie t[s[i:]] = i t.with_prefix("NA") # this works as expected >> ['NA', 'NANA'] t.with_prefix("AN") >> ['AN', 'ANNA'] # this doesn't work as expected # expected output: ["ANANA", "ANA"] Can anyone clarify my confusion or confirm this bug? I'm on Biopython 1.60, Linux Mint 64-bit. Thanks! Kevin From mjldehoon at yahoo.com Tue Jan 29 21:29:09 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 29 Jan 2013 18:29:09 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: Message-ID: <1359512949.16659.YahooMailClassic@web164002.mail.gq1.yahoo.com> Hi Bow, Thanks for the explanation. > Indeed, the doctests functions are two simple small > functions to make it easier to run doctests. The first > one looks up the test directory (our Tests directory) and > the second one simply executes the doctest. The point of looking up the test directory is to find the example input files, right? Have a look at Bio/Align/Applications/_Mafft.py. Its doctest uses the complete path to the example input file: https://github.com/biopython/biopython/commit/32a6beb1e039fa614398a7dee1c031466e8e42ed#Bio/Align/Applications/_Mafft.py I like this solution better, since it's more straightforward, it doesn't need a new module, and also allows the user to run the example without having to figure out where the input file is located. Best, -Michiel. From k.d.murray.91 at gmail.com Tue Jan 29 22:37:46 2013 From: k.d.murray.91 at gmail.com (Kevin Murray) Date: Wed, 30 Jan 2013 14:37:46 +1100 Subject: [Biopython-dev] Namespace for online resources? In-Reply-To: <1359511232.14591.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1359511232.14591.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: Hi all, Essentially, I agree with everything Bow and Lenna have said. If all web-based tools are in a single root-level package, then with appropriate documentation I think users should know where to find any function. People are at least going to know if their required module interfaces with some website. I guess the problem is that moving all the web stuff into one package will break alot of code, which leads me back to my original idea of just copying where stuff like TOGOws and ExPASy is located, i.e. sticking TAIR in the root level directory. Peter and Michiel, do you think that Lenna's suggestion is workable? Would it make sense to go all in and simultaneously refactor parsers into Bio.parse, Bio.*IO into Bio.io.*, etc etc. Perhaps this could be delayed until the next major release (or form the beginings of a biopython2 branch?). Cheers, Kevin Murray On 30 January 2013 13:00, Michiel de Hoon wrote: > Bio.WWW was one of those modules that seem a good idea at first, but then > failed to gain general acceptance. There are three problems with Bio.WWW: > > 1) From the module name, it's not clear what you would find in it. For > example, if you want to access the Entrez database, would you first look in > Bio.Entrez or in Bio.WWW? Similarly for TAIR: Would you look for it in > Bio.TAIR, or in Bio.WWW? > > 2) The modules in Bio.WWW don't have much to do with each other, except > that they access the internet. But any given user probably is mainly > interested in Entrez, or ExPASy, or some other database, not in all of them > at the same time. > > 3) The flip side of this is that a user accessing e.g. ExPASy would have > to import both Bio.WWW and Bio.ExPASy to be able to use ExPASy. Doctests > get more complicated also, as they would span more than one module. Here is > an example from Bio.Entrez that accesses the database, and then parses the > results: > >>> from Bio import Entrez > >>> Entrez.email = "Your.Name.Here at example.org" > >>> handle = Entrez.einfo() # or esearch, efetch, ... > >>> record = Entrez.read(handle) > >>> handle.close() > > The ultimate question is whether we organize the code in Biopython by > their functionality from a user perspective, or by the kind of things they > do? Almost all of Biopython is organized according to the former. For > example, we don't have a Bio.Parsers module for all the parsers; similarly, > we don't have Bio.WWW for internet access. > > Best, > -Michiel. > > > --- On Tue, 1/29/13, Peter Cock wrote: > > > From: Peter Cock > > Subject: Re: [Biopython-dev] Namespace for online resources? > > To: "Wibowo Arindrarto" > > Cc: "Biopython-Dev Mailing List" > > Date: Tuesday, January 29, 2013, 4:11 PM > > On Tue, Jan 29, 2013 at 9:03 PM, > > Peter Cock > > wrote: > > > On Tue, Jan 29, 2013 at 7:52 PM, Wibowo Arindrarto > > > > > wrote: > > >> Hi everyone, > > >> > > >> Why was Bio.WWW deprecated in the first place? > > >> > > > > > > The flippant answer is everything under Bio.WWW was > > moved > > > or deprecated: > > > > http://lists.open-bio.org/pipermail/biopython-dev/2008-July/004059.html > > > > > > I'm trying to identify the discussions prior to that > > covering the moves: > > > > > > Bio.WWW.ExPASy -> Bio.ExPASy > > > Bio.WWW.InterPro -> Bio.InterPro > > > Bio.WWW.NCBI -> Bio.Entrez > > > Bio.WWW.SCOP -> Bio.SCOP > > > > Probably this thread, > > > http://lists.open-bio.org/pipermail/biopython-dev/2007-November/003241.html > > > > Also a bit more background on the NCBI Entrez side: > > > http://lists.open-bio.org/pipermail/biopython-dev/2008-February/003423.html > > > > Peter > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Wed Jan 30 03:52:24 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Jan 2013 08:52:24 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359512949.16659.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1359512949.16659.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: On Wed, Jan 30, 2013 at 2:29 AM, Michiel de Hoon wrote: > Hi Bow, > > Thanks for the explanation. > >> Indeed, the doctests functions are two simple small >> functions to make it easier to run doctests. The first >> one looks up the test directory (our Tests directory) and >> the second one simply executes the doctest. > > The point of looking up the test directory is to find the > example input files, right? Yes. Most of the code is working out where our Test directory is, without that it is just two lines: import doctest doctest.testmod() > Have a look at Bio/Align/Applications/_Mafft.py. > Its doctest uses the complete path to the example input file: > > https://github.com/biopython/biopython/commit/32a6beb1e039fa614398a7dee1c031466e8e42ed#Bio/Align/Applications/_Mafft.py > > I like this solution better, since it's more straightforward, it doesn't > need a new module, and also allows the user to run the example > without having to figure out where the input file is located. That's a special case - the file being referred to isn't used other than to print out a command line string. So it is fine. The doctests we're talking about typically are for parsing, and they need to find the file. In order to run via the main test suite (run_tests.py) we can assume we are in the Biopython Tests folder and therefore use relative paths. Those relative paths won't work if trying to run the doctests via the __name__ trick, thus the path magic which seemed sensible to put in one place only. We can of course remove these __name__ trick conveniences, they are only intended to make life easier for us developers when editing the doctests of a module. But I think it is worth having as a private function somewhere in the code base. Regards, Peter From p.j.a.cock at googlemail.com Wed Jan 30 04:31:31 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Jan 2013 09:31:31 +0000 Subject: [Biopython-dev] New BioPython member In-Reply-To: <1359494577.29159.YahooMailNeo@web122606.mail.ne1.yahoo.com> References: <1359494577.29159.YahooMailNeo@web122606.mail.ne1.yahoo.com> Message-ID: On Tue, Jan 29, 2013 at 9:22 PM, Nate Sutton wrote: > Dear all, > > I just recently joined the BioPython developers group and am > looking forward to contributing to BioPython! I have worked for a while > in programming, genetics, and biology and have > a m.s. in Biomedical Informatics. After > talking with some fellow contributors I have decided to try working on > https://redmine.open-bio.org/issues/3360 but I will also work on writing > some documentation on examples from the > cookbook, especially if I am stuck on the bug. If anyone wants to work on > the same things, I?d be glad to hear that, I > may be slow on the work because I am still learning Python after coming > from > other languages. > > -Nate Hi Nate, and welcome. Eric is in charge of the Bio.Phylo module, but within that the command line application wrappers under Bio.Phylo.Applications follow a pattern used elsewhere in Biopython. To add a wrapper for fasttree http://www.microbesonline.org/fasttree/ have a look at the existing wrappers for PHYML and RAXML, defined in Bio/Phylo/Applications/_Phyml.py and Bio/Phylo/Applications/_Raxml.py (leading underscores mean private modules in Python), which are exposed to the user via Bio/Phylo/Applications/__init__.py In this case, I'd suggest putting the new wrapper in a new file, Bio/Phylo/Applications/_fastree.py Other similar wrappers existing under Bio.Emboss, Bio.Align, etc. Don't be shy about asking for guidance on this, or git and github. Ultimately I'm hoping you'll be able to do is take a fork (personally copy of the repository) on GitHub, create a new fasttree branch, commit your enhancements, and make a pull request. If that's all too much for now, simply writing the new file and letting us do the git side would be fine. Regards, Peter From p.j.a.cock at googlemail.com Wed Jan 30 04:42:23 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Jan 2013 09:42:23 +0000 Subject: [Biopython-dev] Trie with_prefix doesn't work as expected In-Reply-To: References: Message-ID: On Wed, Jan 30, 2013 at 2:09 AM, Kevin Wu wrote: > Hi All, > > I'm attempting to use the trie implementation in biopython to develop a > suffix trie. I'm using the with_prefix function to find all keys which > start with a sequence, however, the function doesn't return values that I > expect. I tested it with the canonical example "banana" and am a bit > confused. > > from Bio.trie import trie > t = trie() > s = "BANANA" > for i in range(len(s)): # insert all suffixes into trie > t[s[i:]] = i > > t.with_prefix("NA") # this works as expected >>> ['NA', 'NANA'] > > t.with_prefix("AN") >>> ['AN', 'ANNA'] # this doesn't work as expected > # expected output: ["ANANA", "ANA"] > > Can anyone clarify my confusion or confirm this bug? I'm on Biopython 1.60, > Linux Mint 64-bit. There is certainly something odd happening. I'm testing with the current code in git (pre-Biopython 1.61) under Mac OS X. >>> from Bio.trie import trie >>> t = trie() >>> s = "BANANA" >>> for i in range(len(s)): # insert all suffixes into trie ... t[s[i:]] = i ... print "%s -> %i" % (s[i:], i) ... assert t[s[i:]] == i ... BANANA -> 0 ANANA -> 1 NANA -> 2 ANA -> 3 NA -> 4 A -> 5 >>> t.values() [5, 3, 1, 0, 4, 2] >>> t.keys() ['A', 'ANA', 'ANANA', 'BANANA', 'NA', 'NANA'] These look fine: >>> t.with_prefix("NA") ['NA', 'NANA'] >>> t.with_prefix("A") ['A', 'ANA', 'ANANA'] >>> t.with_prefix("ANA") ['ANA', 'ANANA'] As you point out, this example seems wrong: >>> t.with_prefix("AN") ['AN', 'ANNA'] The value 'ANNA' shouldn't be in the trie. Peter From mjldehoon at yahoo.com Wed Jan 30 05:20:53 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 30 Jan 2013 02:20:53 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: Message-ID: <1359541253.85968.YahooMailClassic@web164003.mail.gq1.yahoo.com> Hi Peter, --- On Wed, 1/30/13, Peter Cock wrote: > Those relative paths won't work if trying to run the > doctests via the __name__ trick, thus the path magic which > seemed sensible to put in one place only. In which case won't they work? I tried this on SeqRecord.py, and as far as I can tell, the relative paths work fine also when running the doctests from the __name__=="__main__" block, both on Unix and Windows. Best, -Michiel From p.j.a.cock at googlemail.com Wed Jan 30 06:42:21 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Jan 2013 11:42:21 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359541253.85968.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1359541253.85968.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: On Wed, Jan 30, 2013 at 10:20 AM, Michiel de Hoon wrote: > Hi Peter, > > --- On Wed, 1/30/13, Peter Cock wrote: >> Those relative paths won't work if trying to run the >> doctests via the __name__ trick, thus the path magic which >> seemed sensible to put in one place only. > > In which case won't they work? I tried this on SeqRecord.py, > and as far as I can tell, the relative paths work fine also when > running the doctests from the __name__=="__main__" block, > both on Unix and Windows. Yes, no path magic works IF you are in the Tests folder, e.g. ~/biopython/Tests $ emacs ../Bio/SeqRecord.py ~/biopython/Tests $ python ../Bio/SeqRecord.py However for anything like the following convenient alternatives to work and run the doctests, you need some path magic: ~/biopython $ emacs Bio/SeqRecord.py ~/biopython $ python Bio/SeqRecord.py Or, ~/biopython/Bio $ emacs SeqRecord.py ~/biopython/Bio $ python SeqRecord.py I felt having a central convenience function to make that work was worthwhile in order to make working on doctests easier without code duplication. I would accept that this alone does not justify a whole module or file like Bio/_utils.py If you feel strongly about this, we can remove the function run_doctest from Bio/_utils.py (it does after all serve no real purpose in the installed library code), and just require the current directory be the test folder. Would you like me to make that change? Regards, Peter From mjldehoon at yahoo.com Wed Jan 30 07:10:17 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 30 Jan 2013 04:10:17 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: Message-ID: <1359547817.36972.YahooMailClassic@web164001.mail.gq1.yahoo.com> Hi Peter, --- On Wed, 1/30/13, Peter Cock wrote: > However for anything like the following convenient > alternatives to work and run the doctests, you need > some path magic: > ~/biopython $ emacs Bio/SeqRecord.py > ~/biopython $ python Bio/SeqRecord.py Here I agree. > Or, > > ~/biopython/Bio $ emacs SeqRecord.py > ~/biopython/Bio $ python SeqRecord.py > Well I was thinking that the doctests in SeqRecord.py could use a relative path to the Tests directory, e.g. ../Tests/Quality/solexa_faked.fastq. But I agree that this will fail again for any script in submodules. Still I would think that there is a better way to do this, and I doubt that we are the first ones who want to access test files with doctests. I can write a short message to comp.lang.python to see have anybody has any suggestions. Best, -Michiel. From arklenna at gmail.com Wed Jan 30 12:10:40 2013 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 30 Jan 2013 12:10:40 -0500 Subject: [Biopython-dev] Namespace for online resources? In-Reply-To: <1359511232.14591.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1359511232.14591.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: Michiel, You raise an excellent point that separating the modules in this way will complicate doctests. Regarding point (2), is your primary concern namespace clutter or importing efficiency? I still maintain that the category of internet access is more fundamental than the category of parsers. For point (1), if every database is accessed using a WWW submodule, a user will know to look there. Obviously moving everything would be a lot of work... Cheers, Lenna On Tue, Jan 29, 2013 at 9:00 PM, Michiel de Hoon wrote: > Bio.WWW was one of those modules that seem a good idea at first, but then > failed to gain general acceptance. There are three problems with Bio.WWW: > > 1) From the module name, it's not clear what you would find in it. For > example, if you want to access the Entrez database, would you first look in > Bio.Entrez or in Bio.WWW? Similarly for TAIR: Would you look for it in > Bio.TAIR, or in Bio.WWW? > > 2) The modules in Bio.WWW don't have much to do with each other, except > that they access the internet. But any given user probably is mainly > interested in Entrez, or ExPASy, or some other database, not in all of them > at the same time. > > 3) The flip side of this is that a user accessing e.g. ExPASy would have > to import both Bio.WWW and Bio.ExPASy to be able to use ExPASy. Doctests > get more complicated also, as they would span more than one module. Here is > an example from Bio.Entrez that accesses the database, and then parses the > results: > >>> from Bio import Entrez > >>> Entrez.email = "Your.Name.Here at example.org" > >>> handle = Entrez.einfo() # or esearch, efetch, ... > >>> record = Entrez.read(handle) > >>> handle.close() > > The ultimate question is whether we organize the code in Biopython by > their functionality from a user perspective, or by the kind of things they > do? Almost all of Biopython is organized according to the former. For > example, we don't have a Bio.Parsers module for all the parsers; similarly, > we don't have Bio.WWW for internet access. > > Best, > -Michiel. > > > --- On Tue, 1/29/13, Peter Cock wrote: > > > From: Peter Cock > > Subject: Re: [Biopython-dev] Namespace for online resources? > > To: "Wibowo Arindrarto" > > Cc: "Biopython-Dev Mailing List" > > Date: Tuesday, January 29, 2013, 4:11 PM > > On Tue, Jan 29, 2013 at 9:03 PM, > > Peter Cock > > wrote: > > > On Tue, Jan 29, 2013 at 7:52 PM, Wibowo Arindrarto > > > > > wrote: > > >> Hi everyone, > > >> > > >> Why was Bio.WWW deprecated in the first place? > > >> > > > > > > The flippant answer is everything under Bio.WWW was > > moved > > > or deprecated: > > > > http://lists.open-bio.org/pipermail/biopython-dev/2008-July/004059.html > > > > > > I'm trying to identify the discussions prior to that > > covering the moves: > > > > > > Bio.WWW.ExPASy -> Bio.ExPASy > > > Bio.WWW.InterPro -> Bio.InterPro > > > Bio.WWW.NCBI -> Bio.Entrez > > > Bio.WWW.SCOP -> Bio.SCOP > > > > Probably this thread, > > > http://lists.open-bio.org/pipermail/biopython-dev/2007-November/003241.html > > > > Also a bit more background on the NCBI Entrez side: > > > http://lists.open-bio.org/pipermail/biopython-dev/2008-February/003423.html > > > > Peter > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From w.arindrarto at gmail.com Wed Jan 30 12:20:39 2013 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 30 Jan 2013 18:20:39 +0100 Subject: [Biopython-dev] Namespace for online resources? In-Reply-To: <1359511232.14591.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1359511232.14591.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: Hi everyone, Peter, thanks for the links to the archives, I'm starting to get a grip on why Bio.WWW was deprecated in the first place. Michiel, thanks for the explanation. My responses are below. My reply is a bit long, so in the interest of brevity, I'll say first that I'm in favor of putting TAIR in Bio.TAIR now, for practical reasons and consistency with similar modules. But I do still have some slight objections to this approach. > Bio.WWW was one of those modules that seem a good idea at first, but then failed to gain general acceptance. There are three problems with Bio.WWW: > > 1) From the module name, it's not clear what you would find in it. For example, if you want to access the Entrez database, would you first look in Bio.Entrez or in Bio.WWW? Similarly for TAIR: Would you look for it in Bio.TAIR, or in Bio.WWW? This seems to be a naming issue, but it does not invalidate the idea of having one central place for online access. I'll continue to refer to this module as Bio.WW here, but there may be other more suitable names, such as Bio.remotedb, Bio.remote.db, Bio.www.db (or something else) which makes the module a more intuitive place to look in, right?. > 2) The modules in Bio.WWW don't have much to do with each other, except that they access the internet. But any given user probably is mainly interested in Entrez, or ExPASy, or some other database, not in all of them at the same time. We may put a note in the documentation to note this, right? If we are worried about loading unecessary modules, we can keep the __init__.py in Bio.WWW empty, and have Entrez, ExPASy, and the others inside Bio.WWW. > 3) The flip side of this is that a user accessing e.g. ExPASy would have to import both Bio.WWW and Bio.ExPASy to be able to use ExPASy. Doctests get more complicated also, as they would span more than one module. Here is an example from Bio.Entrez that accesses the database, and then parses the results: >>>> from Bio import Entrez >>>> Entrez.email = "Your.Name.Here at example.org" >>>> handle = Entrez.einfo() # or esearch, efetch, ... >>>> record = Entrez.read(handle) >>>> handle.close() Since ExPASy's formats may be specific to them, I was thinking their parsers should also go in Bio.WWW (in this case, Bio.WWW.ExPASy). Note that at the moment we also have cases where the database entry retriever and parser lies in different submodules of the code (e.g. importing Fasta from Bio.Entrez and parsing it with Bio.SeqIO). This is OK in my opinion, however, as Fasta is a widely used format not exclusive to Entrez. But for exclusive format like ExPASy's or Entrez's, it makes sense to keep them in the same module as their database entry retriever. > The ultimate question is whether we organize the code in Biopython by their functionality from a user perspective, or by the kind of things they do? Almost all of Biopython is organized according to the former. For example, we don't have a Bio.Parsers module for all the parsers; similarly, we don't have Bio.WWW for internet access. Hmm..those two points are not necessarily mutually exclusive, right? I think having a centralized module for online access still makes for a functional grouping based on a user's perspective. In the parser's case, it makes sense to organize it the way we do now as there are so many parsers. But for online access, I think it's still manageable to put them in one directory. Just to throw the idea around, we may also have subdirectories for different kinds of online access (e.g. Bio.www.db for online database access, Bio.www.app for online tools access like NCBI BLAST or HMMER). This is not something urgent, but maybe worth thinking / discussing about :). Cheers, Bow From mjldehoon at yahoo.com Thu Jan 31 06:03:12 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 31 Jan 2013 03:03:12 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: Message-ID: <1359630192.62870.YahooMailClassic@web164001.mail.gq1.yahoo.com> Dear all, [Michiel wrote:] > Still I would think that there is a better way to do this, > and I doubt that we are the first ones who want to access > test files with doctests. I can write a short message to > comp.lang.python to see have anybody has any suggestions. So I started writing a message to comp.lang.python, and while reading the doctest documentation to make my message understandable I realized that we can solve our problem by using the setUp and tearDown arguments to doctest.DocTestSuite. Then we put the test files in the same directory as the module we want to test, and use setUp/tearDown to let the unittest switch to this directory when needed. This has the added benefit that the example files are easier to find for users who want to try out a doctest example. Perhaps we'll still run into some issues if we try to implement this, but it seems a step in the right direction. Best, -Michiel. From p.j.a.cock at googlemail.com Thu Jan 31 06:38:43 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 31 Jan 2013 11:38:43 +0000 Subject: [Biopython-dev] Trie with_prefix doesn't work as expected In-Reply-To: References: Message-ID: On Wed, Jan 30, 2013 at 9:42 AM, Peter Cock wrote: > On Wed, Jan 30, 2013 at 2:09 AM, Kevin Wu wrote: >> Hi All, >> >> I'm attempting to use the trie implementation in biopython to develop a >> suffix trie. I'm using the with_prefix function to find all keys which >> start with a sequence, however, the function doesn't return values that I >> expect. I tested it with the canonical example "banana" and am a bit >> confused. >> >> from Bio.trie import trie >> t = trie() >> s = "BANANA" >> for i in range(len(s)): # insert all suffixes into trie >> t[s[i:]] = i >> >> t.with_prefix("NA") # this works as expected >>>> ['NA', 'NANA'] >> >> t.with_prefix("AN") >>>> ['AN', 'ANNA'] # this doesn't work as expected >> # expected output: ["ANANA", "ANA"] >> >> Can anyone clarify my confusion or confirm this bug? I'm on Biopython 1.60, >> Linux Mint 64-bit. > > There is certainly something odd happening. I'm testing with the > current code in git (pre-Biopython 1.61) under Mac OS X. > >>>> from Bio.trie import trie >>>> t = trie() >>>> s = "BANANA" >>>> for i in range(len(s)): # insert all suffixes into trie > ... t[s[i:]] = i > ... print "%s -> %i" % (s[i:], i) > ... assert t[s[i:]] == i > ... > BANANA -> 0 > ANANA -> 1 > NANA -> 2 > ANA -> 3 > NA -> 4 > A -> 5 >>>> t.values() > [5, 3, 1, 0, 4, 2] >>>> t.keys() > ['A', 'ANA', 'ANANA', 'BANANA', 'NA', 'NANA'] > > These look fine: > >>>> t.with_prefix("NA") > ['NA', 'NANA'] >>>> t.with_prefix("A") > ['A', 'ANA', 'ANANA'] >>>> t.with_prefix("ANA") > ['ANA', 'ANANA'] > > As you point out, this example seems wrong: > >>>> t.with_prefix("AN") > ['AN', 'ANNA'] > > The value 'ANNA' shouldn't be in the trie. > > Peter Thanks to Jeff Chang for a very speedy fix (sent as an attachment off list), which I have applied to the repository: https://github.com/biopython/biopython/commit/cd7cc7174fd4b0607381e9c58f6ae0d17cca8f74 I've also added a unit test based on Kevin's example: https://github.com/biopython/biopython/commit/efc289c8fe2e78ad12481973e42554fa40f2ea0a Thank you for reporting this Kevin. Peter P.S. Nice to hear from you again Jeff :) I think your last commit was before we moved from CVS to git, please let us know if you want commit access on github. From p.j.a.cock at googlemail.com Thu Jan 31 06:43:44 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 31 Jan 2013 11:43:44 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359630192.62870.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1359630192.62870.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Thu, Jan 31, 2013 at 11:03 AM, Michiel de Hoon wrote: > Dear all, > > [Michiel wrote:] >> Still I would think that there is a better way to do this, >> and I doubt that we are the first ones who want to access >> test files with doctests. I can write a short message to >> comp.lang.python to see have anybody has any suggestions. > > So I started writing a message to comp.lang.python, and while reading > the doctest documentation to make my message understandable I > realized that we can solve our problem by using the setUp and tearDown > arguments to doctest.DocTestSuite. Then we put the test files in the same > directory as the module we want to test, and use setUp/tearDown to let > the unittest switch to this directory when needed. > > This has the added benefit that the example files are easier to find > for users who want to try out a doctest example. > > Perhaps we'll still run into some issues if we try to implement this, but > it seems a step in the right direction. I don't follow what you are suggesting here. Are you suggesting putting test files under Bio/* as well/instead or under Tests/* ? Peter From mjldehoon at yahoo.com Thu Jan 31 08:46:47 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 31 Jan 2013 05:46:47 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: Message-ID: <1359640007.58576.YahooMailClassic@web164005.mail.gq1.yahoo.com> > I don't follow what you are suggesting here. Are you > suggesting putting test files under Bio/* as well/instead > or under Tests/* ? Well the key point is that if we run the doctests from the Tests directory (with run_tests.py), we can change directory to the directory containing the module whose doctests we want to test. Then, if "python somemodule.py" can find the test files, then so can run_tests.py. We'd just need to make sure that the relative paths in somemodule.py are correct with respect to the directory in which somemodule.py resides. But keep in mind that the unit tests in Tests and the doctests in the modules have different functions. The purpose of the unit tests is to test the Biopython code; the purpose of the doctests is to make sure the docstring examples work. So one could argue that the heavy test files should go under Tests, while simple test files just for the docstring examples should go under Bio/SomeModule. Best, -Michiel. --- On Thu, 1/31/13, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone > To: "Michiel de Hoon" > Cc: "Wibowo Arindrarto" , "BioPython-Dev Mailing List" > Date: Thursday, January 31, 2013, 6:43 AM > On Thu, Jan 31, 2013 at 11:03 AM, > Michiel de Hoon > wrote: > > Dear all, > > > > [Michiel wrote:] > >> Still I would think that there is a better way to > do this, > >> and I doubt that we are the first ones who want to > access > >> test files with doctests. I can write a short > message to > >> comp.lang.python to see have anybody has any > suggestions. > > > > So I started writing a message to comp.lang.python, and > while reading > > the doctest documentation to make my message > understandable I > > realized that we can solve our problem by using the > setUp and tearDown > > arguments to doctest.DocTestSuite. Then we put the test > files in the same > > directory as the module we want to test, and use > setUp/tearDown to let > > the unittest switch to this directory when needed. > > > > This has the added benefit that the example files are > easier to find > > for users who want to try out a doctest example. > > > > Perhaps we'll still run into some issues if we try to > implement this, but > > it seems a step in the right direction. > > I don't follow what you are suggesting here. Are you > suggesting putting > test files under Bio/* as well/instead or under Tests/* ? > > Peter > From p.j.a.cock at googlemail.com Thu Jan 31 09:26:50 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 31 Jan 2013 14:26:50 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359640007.58576.YahooMailClassic@web164005.mail.gq1.yahoo.com> References: <1359640007.58576.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: On Thu, Jan 31, 2013 at 1:46 PM, Michiel de Hoon wrote: >> I don't follow what you are suggesting here. Are you >> suggesting putting test files under Bio/* as well/instead >> or under Tests/* ? > > Well the key point is that if we run the doctests from the Tests directory > (with run_tests.py), we can change directory to the directory containing > the module whose doctests we want to test. Then, if "python somemodule.py" > can find the test files, then so can run_tests.py. We'd just need to make > sure that the relative paths in somemodule.py are correct with respect to > the directory in which somemodule.py resides. I can see how that would work - put all the path changing magic into run_tests.py (before running the doctest for Bio/x/y/z.py change to the directory Bio/x/y and so on), and have the Bio/x/y/z.py doctests assume they will be run from Bio/x/y only. > But keep in mind that the unit tests in Tests and the doctests in the modules > have different functions. The purpose of the unit tests is to test the Biopython > code; the purpose of the doctests is to make sure the docstring examples work. Of course. > So one could argue that the heavy test files should go under Tests, while > simple test files just for the docstring examples should go under Bio/SomeModule. Many of the unittests and doctests currently use the same example files. However, my main objection is that I don't like the idea of putting test files under Bio/* - I feel it should be the source code only (bar some special cases like data files). There are probably packaging guidelines about this somewhere... but I can't find anything immediately. Regards, Peter From mjldehoon at yahoo.com Thu Jan 31 10:33:35 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 31 Jan 2013 07:33:35 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone Message-ID: <1359646415.80564.YahooMailClassic@web164006.mail.gq1.yahoo.com> --- On Thu, 1/31/13, Peter Cock wrote: > However, my main objection is that I don't like the idea of > putting test files under Bio/* I'm OK with using the setUp and tearDown arguments to doctest.DocTestSuite to do the directory magic, but keeping the test files under Tests/. Best, -Michiel. From p.j.a.cock at googlemail.com Thu Jan 31 10:47:18 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 31 Jan 2013 15:47:18 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359646415.80564.YahooMailClassic@web164006.mail.gq1.yahoo.com> References: <1359646415.80564.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: On Thu, Jan 31, 2013 at 3:33 PM, Michiel de Hoon wrote: > --- On Thu, 1/31/13, Peter Cock wrote: >> However, my main objection is that I don't like the idea of >> putting test files under Bio/* > > I'm OK with using the setUp and tearDown arguments to > doctest.DocTestSuite to do the directory magic, but keeping the test files > under Tests/. As a more elegant version of the Bio._utils.run_doctest() function? Peter From p.j.a.cock at googlemail.com Mon Jan 7 18:55:25 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 7 Jan 2013 18:55:25 +0000 Subject: [Biopython-dev] Dropping Python 2.5 and Jython 2.5 support? In-Reply-To: References: Message-ID: On Mon, Oct 22, 2012 at 6:17 PM, Peter Cock wrote: > Dear Biopythoneers, > > Would anyone object to us preparing to drop support for Python 2.5 and > Jython 2.5, perhaps after the next Biopython release? > > To reassure those of you using Jython, we'd wait until Jython 2.7 is out > first. Jython 2.7 is already in alpha, and brings support for C Python 2.7 > language features. > > Thanks, > > Peter Hello all, Having recently back-ported some Python 3 code with a C extension to Python 2.6 and 2.7, I can now more clearly appreciate the benefits dropping Python 2.5 support has for writing code for both Python 2 and 3 - and am keen to be able to exploit this for Biopython. Given no major objections to the email I sent round in October last year (thank you for your input Nathan), we will press ahead with phasing out support for Python 2.5, provisionally supporting it in the forthcoming Biopython 1.61 and at least one more release (which would mean Biopython 1.62 due Summer 2013). https://github.com/biopython/biopython/commit/3f17f75b320fb6624d332809ef07314bab97477c My only significant concern is for Jython users, since this will also mean dropping support for Jython 2.5 (which implements the Python 2.5 language). The replacement Jython 2.7 is still only at the alpha release stage. Regards, Peter From kai.blin at biotech.uni-tuebingen.de Tue Jan 8 10:28:31 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Tue, 08 Jan 2013 11:28:31 +0100 Subject: [Biopython-dev] More relaxed parsing of wonky GenBank files Message-ID: <50EBF4CF.9080901@biotech.uni-tuebingen.de> Hi folks, I've recently pushed into production use a new version of my software that uses BioPython parsers instead of our own hand-written parsers. One big thing we noticed is that BioPython is waaay more picky as to what a proper GenBank file is supposed to look like. Sadly, many of our users seem to be creating their GenBank files with programs that only have a rough understanding what the file format is supposed to look like. Most of the invalid input can safely be ignored, and I would propose to extend the GenBank parser to cope with the most common errors I'm seeing in day to day use. I'm happy to provide the patches, but before starting this work I'd like to make sure that they would be acceptable in principle. So, any reason to rather blow up in our user's face than to try and cope with invalid input? Cheers, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben From mjldehoon at yahoo.com Tue Jan 8 11:11:46 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 8 Jan 2013 03:11:46 -0800 (PST) Subject: [Biopython-dev] More relaxed parsing of wonky GenBank files In-Reply-To: <50EBF4CF.9080901@biotech.uni-tuebingen.de> Message-ID: <1357643506.32308.YahooMailClassic@web164005.mail.gq1.yahoo.com> Entrez.parse has a "validate" argument to allow parsing of XML files that contain tags that are not represented in the corresponding DTD. If validate==True, the parser raises an Exception if any tags are missing. If False, then the parser will ignore missing tags. Maybe SeqIO.parse could have a similar "validate" argument? Best, -Michiel. --- On Tue, 1/8/13, Kai Blin wrote: > From: Kai Blin > Subject: [Biopython-dev] More relaxed parsing of wonky GenBank files > To: "Biopython-Dev Mailing List" > Date: Tuesday, January 8, 2013, 5:28 AM > Hi folks, > > I've recently pushed into production use a new version of my > software > that uses BioPython parsers instead of our own hand-written > parsers. > > One big thing we noticed is that BioPython is waaay more > picky as to > what a proper GenBank file is supposed to look like. Sadly, > many of > our users seem to be creating their GenBank files with > programs that > only have a rough understanding what the file format is > supposed to > look like. Most of the invalid input can safely be ignored, > and I > would propose to extend the GenBank parser to cope with the > most > common errors I'm seeing in day to day use. > > I'm happy to provide the patches, but before starting this > work I'd > like to make sure that they would be acceptable in > principle. So, any > reason to rather blow up in our user's face than to try and > cope with > invalid input? > > Cheers, > Kai > > -- > Dipl.-Inform. Kai Blin? ? ? > ???kai.blin at biotech.uni-tuebingen.de > Institute for Microbiology and Infection Medicine > Division of Microbiology/Biotechnology > Eberhard-Karls-Universit?t T?bingen > Auf der Morgenstelle 28? ? ? ? ? > ? ? ???Phone : ++49 7071 29-78841 > D-72076 T?bingen? ? ? ? ? ? > ? ? ? ? ? ? Fax > :???++49 7071 29-5979 > Germany > Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Tue Jan 8 13:27:20 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 8 Jan 2013 13:27:20 +0000 Subject: [Biopython-dev] More relaxed parsing of wonky GenBank files In-Reply-To: <50EBF4CF.9080901@biotech.uni-tuebingen.de> References: <50EBF4CF.9080901@biotech.uni-tuebingen.de> Message-ID: On Tuesday, January 8, 2013, Kai Blin wrote: > Hi folks, > > I've recently pushed into production use a new version of my software > that uses BioPython parsers instead of our own hand-written parsers. > > One big thing we noticed is that BioPython is waaay more picky as to > what a proper GenBank file is supposed to look like. Sadly, many of > our users seem to be creating their GenBank files with programs that > only have a rough understanding what the file format is supposed to > look like. Most of the invalid input can safely be ignored, and I > would propose to extend the GenBank parser to cope with the most > common errors I'm seeing in day to day use. > > I'm happy to provide the patches, but before starting this work I'd > like to make sure that they would be acceptable in principle. So, any > reason to rather blow up in our user's face than to try and cope with > invalid input? > > Cheers, > Kai > We already try to be tolerant, and issue warnings where it seems safe to take a broken file (e.g. Unrecognised first line, mismatch between length given in first line and actual sequence), but in these cases not all the mis-formed data will or can be parsed. Sometimes a file is broken to the point it is unwise to attempt to parse it any further and an exception is the best course of action. Clearly you're found a whole load more dodgy files. If you can work out which buggy tools are producing them, please do try and report the issues to the tool authors. I know that BioEdit is one source, but maintainence of that popular free Windows tool stopped many years ago. If you can prepare some (small) example files illustrating the rule-breaking files (for testing), and with patches too if you like, I will certainly review them for inclusion. Note if the user wants an exception, they can use the warnings module to catch and upgrade our parser warnings. As Michael pointed out, other bits of Biopython have an explicit validation or strict mode like the Entrez and PDB parsers. In the case of the PDB parser this just toggles between issuing warnings and raising exceptions. I'm not sure if the GenBank (and any other SeqIO parsers) need a validate/permissive option given this can already be achieved with the warnings module. After all, broken GenBank files should be in the minority. (My understanding of the Entrez setting is also about dealing with missing DTD files and cases where the NCBI has a bug and their XML and DTD disagree.) Peter From kai.blin at biotech.uni-tuebingen.de Tue Jan 8 13:55:42 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Tue, 08 Jan 2013 14:55:42 +0100 Subject: [Biopython-dev] More relaxed parsing of wonky GenBank files In-Reply-To: References: <50EBF4CF.9080901@biotech.uni-tuebingen.de> Message-ID: <50EC255E.5040904@biotech.uni-tuebingen.de> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-01-08 14:27, Peter Cock wrote: > We already try to be tolerant, and issue warnings where it seems > safe to take a broken file (e.g. Unrecognised first line, mismatch > between length given in first line and actual sequence), but in > these cases not all the mis-formed data will or can be parsed. > Sometimes a file is broken to the point it is unwise to attempt to > parse it any further and an exception is the best course of > action. Yeah, I started looking into the code and realized that it already tries to handle a lot of special cases. > Clearly you're found a whole load more dodgy files. If you can work > out which buggy tools are producing them, please do try and report > the issues to the tool authors. I know that BioEdit is one source, > but maintainence of that popular free Windows tool stopped many > years ago. Unfortunately I often have no way to contact the uploaders of the broken sequence files, unless they chose to provide an email address. > If you can prepare some (small) example files illustrating the > rule-breaking files (for testing), and with patches too if you > like, I will certainly review them for inclusion. The two most common things I saw in the last week are single record files without the '//' end-of-record marker, and files where the sequence lines are indented by one space more than expected (my favourite). I've added two sample files for these issues, I'm currently working on patches that make them pass the tests. Thanks for the comments. I'll push to my github fork once I've got something. Cheers, Kai - -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iQEcBAEBAgAGBQJQ7CVeAAoJEKM5lwBiwTTPGCYIANAkOxKtNPkclw66aCBWCaAH Uz6zyCk8DTomGOy1fnBoPKI3R+tn73+8XNe6RknFDb6NL/uMD1bR4mTHi1yuHT24 7XSJp+j1JeIamMSs6hLAf4s/HIE2YoEriOe8I6lUAa2I//rxsKf2PcS7y/4Ax6XP K/PUPODVanTCKFrpOIh2DS92lXvMJqI+cpZQ7k1ioaL+6iM9uqi9iRiV9H69Dci5 9bubA98+XvG1cnBISoQTHXpU1p1uiKU1CLxyWdl+9GTq4dCxTkeKDQvxoOd8JH/P ksJPXyYY5u41KrDFpIMNJZpvr0PawLHcUGePKXDEvAt7wvmfDxN92xcVYsUP9w4= =9u/w -----END PGP SIGNATURE----- From kai.blin at biotech.uni-tuebingen.de Tue Jan 8 14:36:03 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Tue, 08 Jan 2013 15:36:03 +0100 Subject: [Biopython-dev] More relaxed parsing of wonky GenBank files In-Reply-To: <50EC255E.5040904@biotech.uni-tuebingen.de> References: <50EBF4CF.9080901@biotech.uni-tuebingen.de> <50EC255E.5040904@biotech.uni-tuebingen.de> Message-ID: <50EC2ED3.8000401@biotech.uni-tuebingen.de> On 2013-01-08 14:55, Kai Blin wrote: > Thanks for the comments. I'll push to my github fork once I've got > something. Pull request is at https://github.com/biopython/biopython/pull/145 Cheers, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben From redmine at redmine.open-bio.org Wed Jan 9 22:58:25 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 9 Jan 2013 22:58:25 +0000 Subject: [Biopython-dev] [Biopython - Bug #3403] (New) PDBList fails to download large PDB structures Message-ID: Issue #3403 has been reported by David Cain. ---------------------------------------- Bug #3403: PDBList fails to download large PDB structures https://redmine.open-bio.org/issues/3403 Author: David Cain Status: New Priority: High Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: https://github.com/DavidCain/biopython/tree/fix_pdb_dl The current @PDBList@ module will often fail to download large PDB files.
>>> from Bio.PDB import PDBList
>>> pdbl = PDBList()
>>> pdbl.retrieve_pdb_file("1hgg")
Downloading PDB structure '1hgg'...
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/pymodules/python2.7/Bio/PDB/PDBList.py", line 247, in retrieve_pdb_file
    out.writelines(gz.read())
  File "/usr/lib/python2.7/gzip.py", line 249, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 303, in _read
    self._read_eof()
  File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0x21d7a5f7 != 0x4b5eabb6L
>>>
The source of this problem is that the entire gzipped file must be read into memory before it's written to disk locally. With large archives, the local file can be truncated prematurely, which causes gzip to crash on extraction. I fixed this issue on my "GitHub branch":https://github.com/DavidCain/biopython/tree/fix_pdb_dl, which I've made a pull request for. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Jan 9 22:58:25 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 9 Jan 2013 22:58:25 +0000 Subject: [Biopython-dev] [Biopython - Bug #3403] (New) PDBList fails to download large PDB structures Message-ID: Issue #3403 has been reported by David Cain. ---------------------------------------- Bug #3403: PDBList fails to download large PDB structures https://redmine.open-bio.org/issues/3403 Author: David Cain Status: New Priority: High Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: https://github.com/DavidCain/biopython/tree/fix_pdb_dl The current @PDBList@ module will often fail to download large PDB files.
>>> from Bio.PDB import PDBList
>>> pdbl = PDBList()
>>> pdbl.retrieve_pdb_file("1hgg")
Downloading PDB structure '1hgg'...
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/pymodules/python2.7/Bio/PDB/PDBList.py", line 247, in retrieve_pdb_file
    out.writelines(gz.read())
  File "/usr/lib/python2.7/gzip.py", line 249, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 303, in _read
    self._read_eof()
  File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0x21d7a5f7 != 0x4b5eabb6L
>>>
The source of this problem is that the entire gzipped file must be read into memory before it's written to disk locally. With large archives, the local file can be truncated prematurely, which causes gzip to crash on extraction. I fixed this issue on my "GitHub branch":https://github.com/DavidCain/biopython/tree/fix_pdb_dl, which I've made a pull request for. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Wed Jan 9 23:08:28 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 9 Jan 2013 23:08:28 +0000 Subject: [Biopython-dev] [Biopython - Bug #3403] PDBList fails to download large PDB structures References: Message-ID: Issue #3403 has been updated by David Cain. (Pull request "here":https://github.com/biopython/biopython/pull/146) ---------------------------------------- Bug #3403: PDBList fails to download large PDB structures https://redmine.open-bio.org/issues/3403 Author: David Cain Status: New Priority: High Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: https://github.com/DavidCain/biopython/tree/fix_pdb_dl The current @PDBList@ module will often fail to download large PDB files.
>>> from Bio.PDB import PDBList
>>> pdbl = PDBList()
>>> pdbl.retrieve_pdb_file("1hgg")
Downloading PDB structure '1hgg'...
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/pymodules/python2.7/Bio/PDB/PDBList.py", line 247, in retrieve_pdb_file
    out.writelines(gz.read())
  File "/usr/lib/python2.7/gzip.py", line 249, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 303, in _read
    self._read_eof()
  File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0x21d7a5f7 != 0x4b5eabb6L
>>>
The source of this problem is that the entire gzipped file must be read into memory before it's written to disk locally. With large archives, the local file can be truncated prematurely, which causes gzip to crash on extraction. I fixed this issue on my "GitHub branch":https://github.com/DavidCain/biopython/tree/fix_pdb_dl, which I've made a pull request for. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Wed Jan 9 23:55:13 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 9 Jan 2013 23:55:13 +0000 Subject: [Biopython-dev] Fwd: [biopython] Fix broken downloading of large PDB structures (#146) In-Reply-To: References: Message-ID: FYI ---------- Forwarded message ---------- From: David Cain Date: Wed, Jan 9, 2013 at 10:59 PM Subject: [biopython] Fix broken downloading of large PDB structures (#146) To: biopython/biopython Summary of changes - Fix failure to download large PDB files - Use with statements for safer file I/O - Remove obsolete parameters - PEP 8 changes, update documentation Failure to download large PDB files (See: Redmine bug #3403 ) The current PDBList module will often fail to download large PDB files. >>> from Bio.PDB import PDBList >>> pdbl = PDBList() >>> pdbl.retrieve_pdb_file("1hgg") ... IOError: CRC check failed 0x21d7a5f7 != 0x4b5eabb6L >>> The source of this problem is that the entire gzipped file must be read into memory before it's written to disk locally. Instead of this memory-intensive approach, I changed the downloading to use urllib.urlretrieve, which is more readable and far more efficient. Obsolete parameters The long-obsolete parameters to retrieve_pdb_file(() have been removed. Formerly, the function allowed the user to specify compression and/or a system utility to perform decompression. But all archives are now gzipped, and PDBList uses Python's gzip module to decompress archives. These parameters have been obsolete for over a year (they were marked deprecated with commit 7ebf6e9 ). ------------------------------ You can merge this Pull Request by running git pull https://github.com/DavidCain/biopython fix_pdb_dl Or view, comment on, or merge it at: https://github.com/biopython/biopython/pull/146 Commit Summary - Use urlretrieve to smartly download PDB archives - Use 'with' statement for safer file I/O - Collapse unwieldy if-else structure - PEP8 fixes within retrieve_pdb_file - Remove deprecated parameters - Update with clarifying comments - PEP8 fixes, updated comments for file - Use urlretrieve in other instance of save to disk File Changes - *M* Bio/PDB/PDBList.py (217) Patch Links: - https://github.com/biopython/biopython/pull/146.patch - https://github.com/biopython/biopython/pull/146.diff From mjldehoon at yahoo.com Thu Jan 10 09:21:34 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 10 Jan 2013 01:21:34 -0800 (PST) Subject: [Biopython-dev] Bio._utils iterlen not needed Message-ID: <1357809694.20781.YahooMailClassic@web164003.mail.gq1.yahoo.com> Dear all, As far as I can tell the iterlen function in Bio._utils is not needed. Simply calling len(items) does exactly what iterlen does, and is much faster too. For the other functions, are they important enough to warrant a separate module? From our previous experience in Biopython, these kinds of utility modules tend to be underused. This is because the functions are simple and therefore easy to replicate, and often they do not do exactly what is needed in a particular module. Similar utility modules in Biopython in the past were forgotten after a while, and then deprecated and removed. Best, -Michiel. From p.j.a.cock at googlemail.com Thu Jan 10 13:03:50 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Jan 2013 13:03:50 +0000 Subject: [Biopython-dev] Bio._utils iterlen not needed In-Reply-To: <1357809694.20781.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1357809694.20781.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: On Thu, Jan 10, 2013 at 9:21 AM, Michiel de Hoon wrote: > > Dear all, > > As far as I can tell the iterlen function in Bio._utils is not needed. > Simply calling len(items) does exactly what iterlen does, and is much faster too. No, the reason d'?tre for iterlen is that you can't use len on an iterator, e.g. >>> len(iter("abcde")) Traceback (most recent call last): File "", line 1, in TypeError: object of type 'iterator' has no len() >>> from Bio._utils import iterlen >>> iterlen(iter("abcde")) 5 Perhaps the function needs a little more documentation... > For the other functions, are they important enough to warrant > a separate module? From our previous experience in Biopython, > these kinds of utility modules tend to be underused. This is > because the functions are simple and therefore easy to > replicate, and often they do not do exactly what is needed > in a particular module. Similar utility modules in Biopython > in the past were forgotten after a while, and then deprecated > and removed. Note that Bio._utils has a leading underscore - these are therefore a 'private' API which we don't have to worry about maintaining and deprecated etc in the same way as a public API. We're not expect end users to use this module ;) The functions here were originally helper functions used in Bio.Phylo which are now also used in Bio.SearchIO - I think a shared private module like this is a good compromise between code duplication and top level modules. Peter From mjldehoon at yahoo.com Thu Jan 10 17:24:14 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 10 Jan 2013 09:24:14 -0800 (PST) Subject: [Biopython-dev] Bio._utils iterlen not needed In-Reply-To: Message-ID: <1357838654.1021.YahooMailClassic@web164001.mail.gq1.yahoo.com> --- On Thu, 1/10/13, Peter Cock wrote: > > Simply calling len(items) does exactly what iterlen > does, and is much faster too. > > No, the reason d'?tre for iterlen is that you can't use len > on an iterator, e.g. > > >>> len(iter("abcde")) > Traceback (most recent call last): > ? File "", line 1, in > TypeError: object of type 'iterator' has no len() > You're right. Actually it depends on the iterator. For example, len(xrange(100)) works (xrange also returns an iterator). I guess in general an iterator can't have a len() function because it's not clear that the iterator will ever end. That said, currently the iterlen function is used in only one place, in Bio/Phylo/BaseTree.py as follows: def count_terminals(self): return _utils.iterlen(self.find_clades(terminal=True)) But here you could simply have def count_terminals(self): clades = self.find_clades(terminal=True) count = 0 for clade in clades: count+=1 return count I don't see why we need a function iterlen for this, and if we do have such a function, why it should be in Bio._utils. Best, -Michiel. From p.j.a.cock at googlemail.com Thu Jan 10 21:16:12 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Jan 2013 21:16:12 +0000 Subject: [Biopython-dev] Bio._utils iterlen not needed In-Reply-To: <1357838654.1021.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1357838654.1021.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Thu, Jan 10, 2013 at 5:24 PM, Michiel de Hoon wrote: > --- On Thu, 1/10/13, Peter Cock wrote: >> > Simply calling len(items) does exactly what iterlen >> > does, and is much faster too. >> >> No, the reason d'?tre for iterlen is that you can't use len >> on an iterator, e.g. >> >> >>> len(iter("abcde")) >> Traceback (most recent call last): >> File "", line 1, in >> TypeError: object of type 'iterator' has no len() > > You're right. Actually it depends on the iterator. For example, > len(xrange(100)) works (xrange also returns an iterator). I guess > in general an iterator can't have a len() function because it's not > clear that the iterator will ever end. Good point - I didn't know xrange defined __len__, and you are right in general - other iterator object could also do that: https://github.com/biopython/biopython/commit/57ae89cdedbc1e18495ffb615a3a1d2c9feb0296 > That said, currently the iterlen function is used in only one place, > in Bio/Phylo/BaseTree.py as follows: True. I hadn't checked that - I assumed it was used more than once. If there are no other natural placed where it would make sense then yes, it might as well be done in line once, and Bio._utils.iterlen could be removed. When written, iterlen was in private module Bio.Phylo._sugar (CC'ing Eric) which Bow moved to Bio._utils as he wanted to use some of it in SearchIO. Peter From eric.talevich at gmail.com Thu Jan 10 21:50:45 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 10 Jan 2013 16:50:45 -0500 Subject: [Biopython-dev] Bio._utils iterlen not needed In-Reply-To: References: <1357838654.1021.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Thu, Jan 10, 2013 at 4:16 PM, Peter Cock wrote: > On Thu, Jan 10, 2013 at 5:24 PM, Michiel de Hoon > wrote: > > That said, currently the iterlen function is used in only one place, > > in Bio/Phylo/BaseTree.py as follows: > > True. I hadn't checked that - I assumed it was used more > than once. If there are no other natural placed where it would > make sense then yes, it might as well be done in line once, > and Bio._utils.iterlen could be removed. > > When written, iterlen was in private module Bio.Phylo._sugar > (CC'ing Eric) which Bow moved to Bio._utils as he wanted to > use some of it in SearchIO. > That's all true. I created _sugar.py during GSoC 2009 for utility code that Bio.Phylo needed, but wasn't related to trees in any way -- similar to Bow's thinking. I probably meant to get rid of the module entirely after the grand merge (hence the note at the top of _sugar.py to keep the file as small as possible). IIRC, I made it a separate function while testing whether "enumerate" or "cnt += 1" would be faster. I have no objections to getting rid of the function now. -E From mjldehoon at yahoo.com Fri Jan 11 12:36:15 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 11 Jan 2013 04:36:15 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone Message-ID: <1357907775.13851.YahooMailClassic@web164001.mail.gq1.yahoo.com> Hi everybody, Bio.ParserSupport has had a PendingDeprecationWarning since Biopython 1.59, so we may consider upgrading this to a BiopythonDeprecationWarning in Biopython 1.61 before removing Bio.ParserSupport. The only tricky point is that then we would also have to upgrade the PendingDeprecationWarning in Bio/Blast/NCBIStandalone.py to a BiopythonDeprecationWarning, as that code relies on Bio.ParserSupport. Bio.Blast.NCBIStandalone has had this PendingDeprecationWarning since Biopython release 1.56. Any objections? This may help giving Bow's Bio.SearchIO module some more prominence. On a related point, the fact that we are deprecating Bio.ParserSupport (which was a painful process) suggests that having a new module Bio._utils with a set of generic utility functions is not a good idea. Best, -Michiel. From p.j.a.cock at googlemail.com Fri Jan 11 15:33:05 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 11 Jan 2013 15:33:05 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1357907775.13851.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1357907775.13851.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Fri, Jan 11, 2013 at 12:36 PM, Michiel de Hoon wrote: > Hi everybody, > > Bio.ParserSupport has had a PendingDeprecationWarning since Biopython > 1.59, so we may consider upgrading this to a BiopythonDeprecationWarning in > Biopython 1.61 before removing Bio.ParserSupport. The only tricky point is > that then we would also have to upgrade the PendingDeprecationWarning in > Bio/Blast/NCBIStandalone.py to a BiopythonDeprecationWarning, as that code > relies on Bio.ParserSupport. Bio.Blast.NCBIStandalone has had this > PendingDeprecationWarning since Biopython release 1.56. > > Any objections? This may help giving Bow's Bio.SearchIO module some more > prominence. Bow's SearchIO is using Bio.Blast.NCBIStandalone to handle plain text, https://github.com/biopython/biopython/blob/master/Bio/SearchIO/BlastIO/blast_text.py We'd discussed a new parser targeting just the plain text from BLAST+ (and if not too different maybe the final legacy BLAST release), which should be less diverse that the current range of BLAST quirks built up over the years. > On a related point, the fact that we are deprecating Bio.ParserSupport > (which was a painful process) suggests that having a new module Bio._utils > with a set of generic utility functions is not a good idea. That's why Bio._utils is a private module - we can drop/change/etc this without worrying about breaking other people's code. The issue with Bio.ParserSupport is it was a public API. Regards, Peter From w.arindrarto at gmail.com Sun Jan 13 15:22:13 2013 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sun, 13 Jan 2013 16:22:13 +0100 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: References: <1357907775.13851.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: Hi everyone, >> Bio.ParserSupport has had a PendingDeprecationWarning since Biopython >> 1.59, so we may consider upgrading this to a BiopythonDeprecationWarning in >> Biopython 1.61 before removing Bio.ParserSupport. The only tricky point is >> that then we would also have to upgrade the PendingDeprecationWarning in >> Bio/Blast/NCBIStandalone.py to a BiopythonDeprecationWarning, as that code >> relies on Bio.ParserSupport. Bio.Blast.NCBIStandalone has had this >> PendingDeprecationWarning since Biopython release 1.56. >> >> Any objections? This may help giving Bow's Bio.SearchIO module some more >> prominence. > > Bow's SearchIO is using Bio.Blast.NCBIStandalone to handle plain text, > https://github.com/biopython/biopython/blob/master/Bio/SearchIO/BlastIO/blast_text.py > > We'd discussed a new parser targeting just the plain text from BLAST+ > (and if not too different maybe the final legacy BLAST release), which > should be less diverse that the current range of BLAST quirks built up > over the years. Yes. Until such a parser is ready, Bio.ParserSupport is still needed. We may still deprecate it from the visible / public namespace and move it into a private module, though. If we are also deprecating Bio.BLAST, then moving Bio.BLAST.NCBIStandalone into a private module as well seems like an ok fix for the time being. regards, Bow From p.j.a.cock at googlemail.com Tue Jan 15 15:28:07 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 15 Jan 2013 15:28:07 +0000 Subject: [Biopython-dev] buildbot issue on Python 3.1 - stdout? In-Reply-To: References: Message-ID: On Fri, Dec 14, 2012 at 12:48 PM, Wibowo Arindrarto wrote: > Hi everyone, > >>> It's reproducible in my machine: Arch Linux 64 bit running >>> Python3.1.5. Haven't figured out a fix yet, but trying to see if I >>> can. >> >> Great. We haven't really proved this is down to a change in >> either Python 3.1.4 or 3.1.5 but it does look likely. > > It's reproduced in my local 3.1.4 installation. Seems like an unfixed > bug that went through to 3.1.5. Regarding this issue with test_Emboss.py, AttributeError: '_io.FileIO' object has no attribute 'read1' http://lists.open-bio.org/pipermail/biopython-dev/2012-December/010156.html I've now tried downgrading Python 3.1 on this machine, and it does seem to be a problem under Python 3.1.4 and 3.1.5 but not 3.1.3. For now I have simply left this buildslave running 3.1.3 instead. I will also downgrade Python 3.1 on the second 64 bit Linux server. That should take care of the annoying buildbot failures (and the daily email I've been getting). This thread may help someone else with a similar issue, but I don't feel inclined to try and explore in any more depth what exactly is going wrong under Python 3.1.4 and 3.1.5, and if there is a Python bug we should report. Regards, Peter From kai.blin at biotech.uni-tuebingen.de Tue Jan 15 15:54:45 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Tue, 15 Jan 2013 16:54:45 +0100 Subject: [Biopython-dev] More 'fun' with GenBank Message-ID: <50F57BC5.7020607@biotech.uni-tuebingen.de> Hi folks, as people are hitting my web service with all sorts of wonky GenBank files, I've stumbled over another one that throws the GenBank parser off track. The culprit is a SeqFeature with a location line like: CDS join(complement(4093..4338),complement(3876..4011), complement(3655..3809),complement(3284..3585), complement(2421..2813),complement(2057..2303)) Now, the way I read the GenBank spec, this is not a valid location line, but should instead be a complement() of joins(). Unfortunately, the NCBI seems to disagree with its own specs, and put the record into their Nucleotide database as CABT02000004, which means that by all practical purposes, it _is_ a valid GenBank file and the parser should cope. The parser looks at this location and creates a feature on the -1 strand, from 4092:2303. This is caused by by the feature location calculation on https://github.com/biopython/biopython/blob/master/Bio/GenBank/__init__.py#L1049 and the lines after. In short, we do s = cur_feature.sub_features[0].location.start e = cur_feature.sub_features[-1].location.end cur_feature.location = SeqFeature.FeatureLocation(s, e, strand) And when the join() looks like the record I'm dealing with, this is clearly the wrong way around. I decided to fix this by sorting the subfeatures by start,end coordinates, and that fixes this issue for me. Unfortunately, this also breaks an existing test, the extra_keywords.gb test. https://github.com/biopython/biopython/blob/master/Tests/GenBank/extra_keywords.gb#L647 has a feature that has a location of CDS join(153490..154269,AL121804.2:41..610, AL121804.2:672..1487) Here, we probably do want the feature from 153489:1487, even though I'm not sure how useful such a location really is. So I decided to fix this by sorting the subfeatures first on their ref, and then on start, end. This again breaks a test, this time in one_of.gb https://github.com/biopython/biopython/blob/master/Tests/GenBank/one_of.gb#L39 where the location line is CDS join(2201..2479,U18267.1:120..246,U18268.1:130..288, U18270.1:4691..4788,U18269.1:82..>128) Here, the U18270.1 record seems to come befire the U18269.1 record. Now, we're again spanning a feature into multiple contigs, none of which are accessible to the extract() function as far as I'm aware. Sorting the locations by start, end (and maybe ref first) at least fixes the case CABT02000004 is broken on where we have the chance of getting extract() to work. The attached patch is my proposed change, but I wanted to get some feedback first before opening a bug and/or submitting a pull request. Cheers, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-GenBank-Sort-subfeatures-by-ref-and-start-end-positi.patch Type: text/x-patch Size: 9059 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Tue Jan 15 16:41:32 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 15 Jan 2013 16:41:32 +0000 Subject: [Biopython-dev] More 'fun' with GenBank In-Reply-To: <50F57BC5.7020607@biotech.uni-tuebingen.de> References: <50F57BC5.7020607@biotech.uni-tuebingen.de> Message-ID: On Tue, Jan 15, 2013 at 3:54 PM, Kai Blin wrote: > Hi folks, > > as people are hitting my web service with all sorts of wonky GenBank > files, I've stumbled over another one that throws the GenBank parser off > track. > > The culprit is a SeqFeature with a location line like: > > CDS join(complement(4093..4338),complement(3876..4011), > complement(3655..3809),complement(3284..3585), > complement(2421..2813),complement(2057..2303)) > > Now, the way I read the GenBank spec, this is not a valid location line, > but should instead be a complement() of joins(). Unfortunately, the NCBI > seems to disagree with its own specs, and put the record into their > Nucleotide database as CABT02000004, which means that by all practical > purposes, it _is_ a valid GenBank file and the parser should cope. That should work - for a while GenBank and EMBL didn't agree about joins on the complement strand, one did complement(join(a..b,c..d)) and the other join(complement(c..d),complement(a..b)), notice the order of the sub-regions flips. > The parser looks at this location and creates a feature on the -1 > strand, from 4092:2303. This is caused by by the feature location > calculation on > https://github.com/biopython/biopython/blob/master/Bio/GenBank/__init__.py#L1049 > and the lines after. > > In short, we do > s = cur_feature.sub_features[0].location.start > e = cur_feature.sub_features[-1].location.end > cur_feature.location = SeqFeature.FeatureLocation(s, e, strand) For join feature locations, the sub-feature locations should be fine but the overall feature location is a bit weird/broken for negative and mixed strands. This was one of the things the re-factoring on this branch aimed to fix, https://github.com/peterjc/biopython/tree/f_loc4/ http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html I was intending to bring this up again after the next release (which could be later this month or February 2012), but perhaps it would be worth doing now? Peter From arklenna at gmail.com Tue Jan 15 17:19:48 2013 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 15 Jan 2013 12:19:48 -0500 Subject: [Biopython-dev] More 'fun' with GenBank In-Reply-To: References: <50F57BC5.7020607@biotech.uni-tuebingen.de> Message-ID: +1 for f_loc4. The FeatureLocation/CompoundLocation classes will hopefully make handling joins and other GenBank operators a little more logical. Not to mention my CoordinateMapper is based on this branch! Lenna On Tue, Jan 15, 2013 at 11:41 AM, Peter Cock wrote: > On Tue, Jan 15, 2013 at 3:54 PM, Kai Blin > wrote: > > Hi folks, > > > > as people are hitting my web service with all sorts of wonky GenBank > > files, I've stumbled over another one that throws the GenBank parser off > > track. > > > > The culprit is a SeqFeature with a location line like: > > > > CDS join(complement(4093..4338),complement(3876..4011), > > complement(3655..3809),complement(3284..3585), > > complement(2421..2813),complement(2057..2303)) > > > > Now, the way I read the GenBank spec, this is not a valid location line, > > but should instead be a complement() of joins(). Unfortunately, the NCBI > > seems to disagree with its own specs, and put the record into their > > Nucleotide database as CABT02000004, which means that by all practical > > purposes, it _is_ a valid GenBank file and the parser should cope. > > That should work - for a while GenBank and EMBL didn't agree about > joins on the complement strand, one did complement(join(a..b,c..d)) > and the other join(complement(c..d),complement(a..b)), notice the > order of the sub-regions flips. > > > The parser looks at this location and creates a feature on the -1 > > strand, from 4092:2303. This is caused by by the feature location > > calculation on > > > https://github.com/biopython/biopython/blob/master/Bio/GenBank/__init__.py#L1049 > > and the lines after. > > > > In short, we do > > s = cur_feature.sub_features[0].location.start > > e = cur_feature.sub_features[-1].location.end > > cur_feature.location = SeqFeature.FeatureLocation(s, e, > strand) > > For join feature locations, the sub-feature locations should be fine > but the overall feature location is a bit weird/broken for negative > and mixed strands. > > This was one of the things the re-factoring on this branch aimed to > fix, https://github.com/peterjc/biopython/tree/f_loc4/ > http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html > > I was intending to bring this up again after the next release (which > could be later this month or February 2012), but perhaps it would > be worth doing now? > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Tue Jan 15 19:03:51 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 15 Jan 2013 19:03:51 +0000 Subject: [Biopython-dev] More 'fun' with GenBank In-Reply-To: References: <50F57BC5.7020607@biotech.uni-tuebingen.de> Message-ID: On Tue, Jan 15, 2013 at 5:19 PM, Lenna Peterson wrote: > +1 for f_loc4. The FeatureLocation/CompoundLocation classes will hopefully > make handling joins and other GenBank operators a little more logical. Not > to mention my CoordinateMapper is based on this branch! > > Lenna It will need a bit of work to rebase (some of the PEP8 changes have touched the same lines of code), but I will try and do that this week. Peter From antony.lee at berkeley.edu Tue Jan 15 21:45:19 2013 From: antony.lee at berkeley.edu (Antony Lee) Date: Tue, 15 Jan 2013 13:45:19 -0800 Subject: [Biopython-dev] Circular sequences Message-ID: <20130115214519.GC8511@gmail.com> Hi all, While working on a (more sane?) rewrite of the Restriction library (https://github.com/biopython/biopython/pull/148), I found the need to add a circular/linear attribute to sequence objects (just as the currently existing Restriction library does). So I quickly added such a class, independently of whatever Biopython currently provides. But it seems like the module would be better integrated in the rest of Biopython if it used Bio.Seq.Seq instead. I saw that CircularSeqs have already been discussed on the mailing list, and the main issue was with indexing and slicing. So here are my thoughts about how such an object should behave. Assume a circular seq s of length 10. Simple indexing works modulo 10 (and negative indices work identically). Methods that return one or more indices return the indices modulo 10. Slicing with both ends defined (i.e. s[x:y(:z)]) wrap as many times as needed around the sequence if y >= x, and make at most one complete cycle if y < x (i.e. add len(s) as many times as needed to y to make it bigger than x, and stop there). Slicing with one or both ends undefined (ie. s[:], s[x:], s[:y]) raises an IndexError (because, well, I read s[x:] as "return the elements of s starting from the x'th until the end"... but there is no such end.). (A second option would be to return an infinite iterable for s[x:], but that doesn't take care of s[:y] anyways, not to mention the bugs that may appear from that.) A few other issues were addressed in the previous thread. I think that adding CircularSeqs does not make sense at all (so __add__ raises a ValueError), and translation can either check for the presence of a stop codon and raise ValueError otherwise, or return an infinite iterator. Another thing that may be useful for a restriction analysis library is a good way to represent a dsDNA sequence with some overhangs. Any thoughts? Antony From kai.blin at biotech.uni-tuebingen.de Wed Jan 16 08:28:06 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Wed, 16 Jan 2013 09:28:06 +0100 Subject: [Biopython-dev] More 'fun' with GenBank In-Reply-To: References: <50F57BC5.7020607@biotech.uni-tuebingen.de> Message-ID: <50F66496.8000109@biotech.uni-tuebingen.de> On 2013-01-15 20:03, Peter Cock wrote: Hi Peter, > It will need a bit of work to rebase (some of the PEP8 changes have > touched the same lines of code), but I will try and do that this week. Your f_loc4 branch certainly fixes the problem I'm seeing. Is there anything I can do to help with getting it merged? I'm happy to give a closer look at the rebase conflicts coming up during the merge if you don't mind me asking the occasional question if I can't work out reasons for a code change from the commit messages. Cheers, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-University of T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Deutschland Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben From Markus.Piotrowski at ruhr-uni-bochum.de Wed Jan 16 09:42:54 2013 From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus Piotrowski) Date: 16 Jan 2013 10:42:54 +0100 Subject: [Biopython-dev] Circular sequences In-Reply-To: <20130115214519.GC8511@gmail.com> References: <20130115214519.GC8511@gmail.com> Message-ID: <50F6761E.9000606@ruhr-uni-bochum.de> Am 15.01.2013 22:45, schrieb Antony Lee: > needed to y to make it bigger than x, and stop there). Slicing with one > or both ends undefined (ie. s[:], s[x:], s[:y]) raises an IndexError > (because, well, I read s[x:] as "return the elements of s starting from > the x'th until the end"... but there is no such end.). (A second option > would be to return an infinite iterable for s[x:], but that doesn't take > care of s[:y] anyways, not to mention the bugs that may appear from > that.) Another possibility, which makes some biological sense (thinking on restriction), would be that s[x:] (or s[:y]) returns a linear sequence starting at x and ending with x-1 (or ending with y and starting at y+1). Thus, s[x:] would mean 'cut my circle at x and return the linear sequence starting at x'. Markus From p.j.a.cock at googlemail.com Wed Jan 16 10:24:13 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 16 Jan 2013 10:24:13 +0000 Subject: [Biopython-dev] Circular sequences In-Reply-To: <50F6761E.9000606@ruhr-uni-bochum.de> References: <20130115214519.GC8511@gmail.com> <50F6761E.9000606@ruhr-uni-bochum.de> Message-ID: For those that missed it last time, I think the most recent in depth discussion about circular sequences and slicing was here: http://lists.open-bio.org/pipermail/biopython/2011-March/007075.html ... http://lists.open-bio.org/pipermail/biopython/2011-March/007085.html On Wed, Jan 16, 2013 at 9:42 AM, Markus Piotrowski wrote: > Am 15.01.2013 22:45, schrieb Antony Lee: > >> needed to y to make it bigger than x, and stop there). Slicing with one >> or both ends undefined (ie. s[:], s[x:], s[:y]) raises an IndexError >> (because, well, I read s[x:] as "return the elements of s starting from >> the x'th until the end"... but there is no such end.). (A second option >> would be to return an infinite iterable for s[x:], but that doesn't take >> care of s[:y] anyways, not to mention the bugs that may appear from >> that.) > > > Another possibility, which makes some biological sense (thinking on > restriction), would be that > s[x:] (or s[:y]) returns a linear sequence starting at x and ending with x-1 > (or ending with y and starting at y+1). Thus, s[x:] would mean 'cut my > circle at x and return the linear sequence starting at x'. That's exactly the kind of behaviour which would make me nervous given in general the Biopython sequence objects mimic Python strings. There are many examples where that 'extra' sequence would be unexpected. For instance, writing out line wrapped sequence data. I would prefer an explicit method like 'cut' on a circular sequence object returning a full length linear sequence. Similarly a 'roll' or 'rotate' method could shift the origin to a new coordinate. One simple solution to the complexities of the slice behaviour is the practical one: They act like Python strings, basically all we would be adding would an 'is circular' flag and some logic about how to propagate that flag in operations like addition and slicing. If we went that route it might still be possible to make the find and 'in' functionality origin aware... but that may just cause trouble. This would solve where to store if a sequence is circular (e.g. when reading GenBank and EMBL files - or for handling restriction enzyme digests), but other than that not add much utility. Thoughts? Peter From antony.lee at berkeley.edu Wed Jan 16 19:09:32 2013 From: antony.lee at berkeley.edu (Antony Lee) Date: Wed, 16 Jan 2013 11:09:32 -0800 Subject: [Biopython-dev] Circular sequences In-Reply-To: References: <20130115214519.GC8511@gmail.com> <50F6761E.9000606@ruhr-uni-bochum.de> Message-ID: <20130116190932.GA1962@gmail.com> I think the proposed behaviour makes biological sense (now s[x:] and s[:y] mean "cut the sequence before x (or before y) and keep the downstream (or upstream) sequence, whatever it is"). But I understand Peter's concerns as well. A quick grep showed me around 400 instances of "[:" showing up in the current code base, and as many ":]", and most of them seem to be related to string (as opposed to sequence) processing so checking these may not be impossible (though not very fun of course), but this won't protect against future mis-uses of sequence indexing. So I think methods such as cut and roll are fine too (and go back to raising ValueError when either or both ends of the slice are None). Now it would be the responsibility of sequence-consuming functions to start by .cut()ting the sequence before slicing it. find and __contains__ can be implemented easily (though perhaps inelegantly) by changing "foo in circular(bar)" into "foo in linear(bar) + linear(bar)[:len(foo)-1]" (which is essentially what is done in both Restriction libraries, the old and the new one). Finally let me say that right now I don't use the most of the rest of Biopython (and don't really think I'll use most of it in the near future) so I care little about whether this specific feature gets integrated or not; however I do think it is needed in a proper restriction analysis library. Indeed, one could say that we just have to add a "circular=True|False" keyword argument to methods such as search and catalyze, but that is not enough to distinguish e.g. if a circular plasmid is digested once or not at all (of course, one can check separately but what I mean there is that circularity is a natural "output" of the functions, not just input). Antony On Wed, Jan 16, 2013 at 10:24:13AM +0000, Peter Cock wrote: > For those that missed it last time, I think the most recent in depth > discussion about circular sequences and slicing was here: > > http://lists.open-bio.org/pipermail/biopython/2011-March/007075.html > ... > http://lists.open-bio.org/pipermail/biopython/2011-March/007085.html > > On Wed, Jan 16, 2013 at 9:42 AM, Markus Piotrowski > wrote: > > Am 15.01.2013 22:45, schrieb Antony Lee: > > > >> needed to y to make it bigger than x, and stop there). Slicing with one > >> or both ends undefined (ie. s[:], s[x:], s[:y]) raises an IndexError > >> (because, well, I read s[x:] as "return the elements of s starting from > >> the x'th until the end"... but there is no such end.). (A second option > >> would be to return an infinite iterable for s[x:], but that doesn't take > >> care of s[:y] anyways, not to mention the bugs that may appear from > >> that.) > > > > > > Another possibility, which makes some biological sense (thinking on > > restriction), would be that > > s[x:] (or s[:y]) returns a linear sequence starting at x and ending with x-1 > > (or ending with y and starting at y+1). Thus, s[x:] would mean 'cut my > > circle at x and return the linear sequence starting at x'. > > That's exactly the kind of behaviour which would make me nervous > given in general the Biopython sequence objects mimic Python strings. > There are many examples where that 'extra' sequence would be > unexpected. For instance, writing out line wrapped sequence data. > > I would prefer an explicit method like 'cut' on a circular sequence > object returning a full length linear sequence. Similarly a 'roll' or > 'rotate' method could shift the origin to a new coordinate. > > One simple solution to the complexities of the slice behaviour is > the practical one: They act like Python strings, basically all we > would be adding would an 'is circular' flag and some logic about > how to propagate that flag in operations like addition and slicing. > If we went that route it might still be possible to make the find and > 'in' functionality origin aware... but that may just cause trouble. > > This would solve where to store if a sequence is circular (e.g. when > reading GenBank and EMBL files - or for handling restriction > enzyme digests), but other than that not add much utility. > > Thoughts? > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From redmine at redmine.open-bio.org Fri Jan 18 09:43:26 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 18 Jan 2013 09:43:26 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Michiel de Hoon. Micha?, can you confirm that the fixed Bio.trie works for you? Then we can close this bug report. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Fri Jan 18 15:17:43 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 18 Jan 2013 15:17:43 +0000 Subject: [Biopython-dev] [Biopython - Bug #3395] Biopython trie implementation can't load large data sets References: Message-ID: Issue #3395 has been updated by Micha? Nowotka. Can you just give me two more weeks? I need some time to evaluate it. ---------------------------------------- Bug #3395: Biopython trie implementation can't load large data sets https://redmine.open-bio.org/issues/3395 Author: Micha? Nowotka Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Imagine I have Biopython trie: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'w') tr = trie.trie() #fill in the trie trie.save(f, trie) Now /tmp/trie.dat.gz is about 50MB. Let's try to read it: from Bio import trie import gzip f = gzip.open('/tmp/trie.dat.gz', 'r') tr = trie.load(f) Unfortunately I'm getting meaningless error saying: "loading failed for some reason" Any hints? -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From eric.talevich at gmail.com Sat Jan 19 01:20:11 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 18 Jan 2013 20:20:11 -0500 Subject: [Biopython-dev] Support for NeXML and RDF trees in Bio.Phylo In-Reply-To: References: Message-ID: On Fri, Dec 28, 2012 at 10:50 AM, Ben Morris wrote: > On Tue, Dec 25, 2012 at 2:18 AM, Eric Talevich > wrote: > > > > On Mon, Dec 24, 2012 at 8:58 AM, Ben Morris wrote: > >> > >> Hi all, > >> > >> I've implemented support for two new phylogenetic tree formats: NeXML > and > >> RDF (conforming to the Comparative Data Analysis Ontology). > >> > >> I noticed that NeXML support was planned, but I didn't see anyone > working > >> on it on GitHub and the feature request hadn't been updated in about a > >> year, so I went ahead and implemented a simple version. At first I tried > >> the generateDS.py approach, but the generated writer doesn't give very > much > >> control over the output, so I ended up writing my own parser/writer > using > >> ElementTree. > >> > >> As for the RDF/CDAO format, AFAIK this is not a format that's supported > by > >> any other phylogenetic libraries, so I'm not sure how useful this is to > >> everyone else. It provides a simple, standards-compliant format that > can be > >> imported to a triple store and supports annotation. We'll be using it at > >> NESCent so I wanted to make it available to everyone else as well. The > >> parser and writer require the Redlands Python bindings. > >> > >> The code is available in my fork of Biopython, > >> > >> https://github.com/bendmorris/biopython > >> > >> under branches "cdao" and "nexml." I'd love to get everyone's thoughts > and > >> see if these contributions would be a good fit for the Biopython > project. > > > > > > > > Thanks for letting us know! I'll try it out soonish. Looking at the code > on your nexml branch, I have a few comments: > > > > - The parser uses ElementTree.parse rather than iterparse, so in its > current state it would not be able to parse massive files (those larger > than available RAM). Worth fixing eventually? > > Great point. I rewrote it to use iterparse instead. > > > - The parser creates Newick.Tree and Newick.Clade objects, which is > nearly correct in my opinion. I would suggest subclassing BaseTree.Tree and > BaseTree.Clade to create NeXML-specific Tree and Clade classes, even if you > don't have any additional attributes to attach to those classes at the > moment. (These would go in a new file NeXML.py, similar to PhyloXML.py and > PhyloXMLIO.py.) > > Went ahead and did this as well. > Thanks! Sorry for the pace of this, I'm in the midst of a dissertation. > - The 'confidence' or 'confidences' attribute isn't used (for e.g. > bootstrap support values). Does NeXML define it? > > Not that I'm aware of, but I'm not sure. I searched > http://nexml.org/nexml/html/doc/schema-1/ and didn't find anything. > I'm going to ask some people who know more about this than I do. > I would like for Bio.Phylo's I/O modules to be able to successfully round-trip a file from Newick to phyloXML to NeXML and back to Newick without losing support values. I found these two examples of how to add this data to a NeXML document by referencing CDAO: https://www.nescent.org/wg_evoinfo/NeXML_Test_Files#Bootstraps_represented_using_the_.22meta.22_tag https://www.nescent.org/wg_evoinfo/NeXML_Test_Files#Bootstraps_represented_without_new_tags_or_elements That's the standard way to store bootstrap supports in NeXML (Hilmar confirms). How do your NeXML and CDAO modules interact, if at all? Would the CDAO modules be useful to properly support NeXML metadata like support/confidence values, or would it be simpler to just hard-code the few tags we're specifically interested in? Relatedly, those look like good test files. I see you've started writing NeXML unit tests already; if you would like help with any of this, just let me know. -Eric From mjldehoon at yahoo.com Sun Jan 20 07:30:24 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 19 Jan 2013 23:30:24 -0800 (PST) Subject: [Biopython-dev] Bio.Motif update Message-ID: <1358667024.24762.YahooMailClassic@web164004.mail.gq1.yahoo.com> Dear all, As we discussed previously, I've been going over Bio.Motif to update it and make its usage more explicit. I'm pretty much done. While I have been uploading my changes to the main biopython github repository, this does not mean that these changes are final; comments and suggestions for changes are welcome. In many cases, there is a difference in the syntax between the old Bio.Motif and the new Bio.Motif. For example, motif.consensus is a method in the old Bio.Motif, but a property in the new Bio.Motif. While I tried to put PendingDeprecationWarnings on all changes consistently, there may be some corner cases that I missed. For this reason, and also to make the documentation more understandable, it may be better to put the new Bio.Motif code in a module Bio.motifs, to put the old Bio.Motif code back into Bio.Motif (so that Bio.Motif in release 1.61 will be identical to the Bio.Motif in release 1.60), and (assuming that we are happy with the new Bio.motifs modules) put a PendingDeprecationWarning on Bio.Motif as a whole. Then in the documentation we'll have one chapter on Bio.Motif and one chapter on Bio.motifs. Also we'll have one set of tests for Bio.Motif, and one set of tests for Bio.motifs. Any objections to creating a separate Bio.motifs module? Here you can find the relevant chapter in the current documentation on the new Bio.Motif: http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html#htoc190 Best, -Michiel From p.j.a.cock at googlemail.com Sun Jan 20 19:03:45 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 20 Jan 2013 19:03:45 +0000 Subject: [Biopython-dev] More 'fun' with GenBank In-Reply-To: <50F66496.8000109@biotech.uni-tuebingen.de> References: <50F57BC5.7020607@biotech.uni-tuebingen.de> <50F66496.8000109@biotech.uni-tuebingen.de> Message-ID: On Wed, Jan 16, 2013 at 8:28 AM, Kai Blin wrote: > On 2013-01-15 20:03, Peter Cock wrote: > > Hi Peter, > >> It will need a bit of work to rebase (some of the PEP8 changes have >> touched the same lines of code), but I will try and do that this week. > > Your f_loc4 branch certainly fixes the problem I'm seeing. Is there > anything I can do to help with getting it merged? I'm happy to give a > closer look at the rebase conflicts coming up during the merge if you > don't mind me asking the occasional question if I can't work out reasons > for a code change from the commit messages. > > Cheers, > Kai I've done the rebase - all the tests still pass so if I missed anything it should just be minor: https://github.com/peterjc/biopython/commits/f_loc4 (old) https://github.com/peterjc/biopython/commits/f_loc5 (rebased) Kai - would you mind retesting with f_loc5 (the rebased branch)? Everyone - does it seem sensible to include this now, ready for the upcoming release (*)? Or perhaps just after the release? Peter (*) See other thread about Bio.Motif, which I think is all we need to address before doing the release: http://lists.open-bio.org/pipermail/biopython-dev/2013-January/010235.html From bartek at rezolwenta.eu.org Sun Jan 20 22:34:42 2013 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Sun, 20 Jan 2013 23:34:42 +0100 Subject: [Biopython-dev] Bio.Motif update In-Reply-To: <1358667024.24762.YahooMailClassic@web164004.mail.gq1.yahoo.com> References: <1358667024.24762.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: Hi, great job Michiel! It looks very nice overall. As the code that will be using the new library needs to be changed, I would vote for the change in the namespace, but given that the userbase of the Bio.Motif was quite limited, I think it wouldn't cause major problems to keep the name as is. best Bartek On Sun, Jan 20, 2013 at 8:30 AM, Michiel de Hoon wrote: > Dear all, > > As we discussed previously, I've been going over Bio.Motif to update it and make its usage more explicit. I'm pretty much done. While I have been uploading my changes to the main biopython github repository, this does not mean that these changes are final; comments and suggestions for changes are welcome. > > In many cases, there is a difference in the syntax between the old Bio.Motif and the new Bio.Motif. For example, motif.consensus is a method in the old Bio.Motif, but a property in the new Bio.Motif. > While I tried to put PendingDeprecationWarnings on all changes consistently, there may be some corner cases that I missed. > > For this reason, and also to make the documentation more understandable, it may be better to put the new Bio.Motif code in a module Bio.motifs, to put the old Bio.Motif code back into Bio.Motif (so that Bio.Motif in release 1.61 will be identical to the Bio.Motif in release 1.60), and (assuming that we are happy with the new Bio.motifs modules) put a PendingDeprecationWarning on Bio.Motif as a whole. Then in the documentation we'll have one chapter on Bio.Motif and one chapter on Bio.motifs. Also we'll have one set of tests for Bio.Motif, and one set of tests for Bio.motifs. > > Any objections to creating a separate Bio.motifs module? > > Here you can find the relevant chapter in the current documentation on the new Bio.Motif: > > http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html#htoc190 > > Best, > -Michiel > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Bartek Wilczynski From kai.blin at biotech.uni-tuebingen.de Mon Jan 21 09:49:31 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Mon, 21 Jan 2013 10:49:31 +0100 Subject: [Biopython-dev] More 'fun' with GenBank In-Reply-To: References: <50F57BC5.7020607@biotech.uni-tuebingen.de> <50F66496.8000109@biotech.uni-tuebingen.de> Message-ID: <50FD0F2B.1080606@biotech.uni-tuebingen.de> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-01-20 20:03, Peter Cock wrote: > Kai - would you mind retesting with f_loc5 (the rebased branch)? The location of the feature that caused trouble for me still looks correct. I'm currently running some more sequences, but I'm pretty confident that the code will work just fine. The tests I added to the genbank parser code for all the problem cases I had pass, after all. :) > Everyone - does it seem sensible to include this now, ready for the > upcoming release (*)? Or perhaps just after the release? I'd perfer having this in the next release if possible, but of course if the release after that is coming up within a reasonable time frame, that would work as well. Cheers, Kai - -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iQEcBAEBAgAGBQJQ/Q8rAAoJEKM5lwBiwTTP9oEIAIoa543zGerNtxNg67ybV4uE jzOkyBzJIxkGAjIxcuNnYTo+OgYHkMQekeo7wkGgPKN558+LE8zKza3JdWbVqV/M bEd6mYo5LsfveK3Vn397GJcPCOaQtb5MvNUOPJWstzReRVIM6lN3WXm3HxicuTji 2aFZG5dtaMXjZhxxMo4IRz2Jtrr01nZu1OVP02mco4LDoEkRInunDcWJcz/DOsJd h4vJzVa4veMKFfJV4U9PGZnuatcwKgMLVQ1heKh4/efEOQ4dIjdlYG29FjHsZvy6 RjwL4ZZpGZfZwgBJPGiYqn5ZsgzVqgS5aWdw8/9jN5dpETP24DnzVi6vlIRTWqg= =uUeG -----END PGP SIGNATURE----- From redmine at redmine.open-bio.org Wed Jan 23 02:30:31 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 23 Jan 2013 02:30:31 +0000 Subject: [Biopython-dev] [Biopython - Bug #3403] (Closed) PDBList fails to download large PDB structures References: Message-ID: Issue #3403 has been updated by Eric Talevich. Status changed from New to Closed % Done changed from 0 to 100 Fixed by David Cain. Thanks! https://github.com/biopython/biopython/pull/146 First commit in the series here: https://github.com/biopython/biopython/commit/7282e80ed6a65a10c5c624b2a7ec787656437a15 ---------------------------------------- Bug #3403: PDBList fails to download large PDB structures https://redmine.open-bio.org/issues/3403 Author: David Cain Status: Closed Priority: High Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: https://github.com/DavidCain/biopython/tree/fix_pdb_dl The current @PDBList@ module will often fail to download large PDB files.
>>> from Bio.PDB import PDBList
>>> pdbl = PDBList()
>>> pdbl.retrieve_pdb_file("1hgg")
Downloading PDB structure '1hgg'...
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/pymodules/python2.7/Bio/PDB/PDBList.py", line 247, in retrieve_pdb_file
    out.writelines(gz.read())
  File "/usr/lib/python2.7/gzip.py", line 249, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 303, in _read
    self._read_eof()
  File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0x21d7a5f7 != 0x4b5eabb6L
>>>
The source of this problem is that the entire gzipped file must be read into memory before it's written to disk locally. With large archives, the local file can be truncated prematurely, which causes gzip to crash on extraction. I fixed this issue on my "GitHub branch":https://github.com/DavidCain/biopython/tree/fix_pdb_dl, which I've made a pull request for. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From mjldehoon at yahoo.com Sun Jan 27 04:45:46 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 26 Jan 2013 20:45:46 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone Message-ID: <1359261946.16561.YahooMailClassic@web164001.mail.gq1.yahoo.com> [This message previously got lost in cyberspace. Sending it again.] --- On Fri, 1/11/13, Peter Cock wrote: > Bow's SearchIO is using Bio.Blast.NCBIStandalone to handle > plain text, > https://github.com/biopython/biopython/blob/master/Bio/SearchIO/BlastIO/blast_text.py OK then let's keep Bio.ParserSupport as is for now. > That's why Bio._utils is a private module - we can > drop/change/etc this without worrying about breaking > other people's code. The issue with Bio.ParserSupport > is it was a public API. Its API being public was not the problem -- we have deprecated and removed lots of public modules over the years. The problem with Bio.ParserSupport was twofold. First, it ended up making parsers more complex and difficult to understand for people not familiar with Bio.ParserSupport, in particular for newcomers and users trying to fix a bug. So Bio.ParserSupport never made us really happy. As a case in point, Bio._utils was created rather than reusing the code in Bio.ParserSupport. The second problem was that many modules were using bits and pieces of Bio.ParserSupport, so we could not drop or change Bio.ParserSupport easily. Bio.ParserSupport has been officially obsolete but not deprecated for years. > That's why Bio._utils is a private module - we can > drop/change/etc this without worrying about breaking > other people's code. Let's drop it. Just it being a private module doesn't make it "free". It clutters up the code base. This is particularly true for top-level modules. Best, -Michiel. From mjldehoon at yahoo.com Sun Jan 27 04:46:47 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 26 Jan 2013 20:46:47 -0800 (PST) Subject: [Biopython-dev] Bio.Motif update In-Reply-To: Message-ID: <1359262007.25151.YahooMailClassic@web164002.mail.gq1.yahoo.com> OK, thanks! I separated Bio.Motif into Bio.Motif (essentially the same as in Biopython release 1.60) and Bio.motifs (the new code). Best, -Michiel. --- On Sun, 1/20/13, Bartek Wilczynski wrote: > From: Bartek Wilczynski > Subject: Re: [Biopython-dev] Bio.Motif update > To: "Michiel de Hoon" > Cc: "BioPython-Dev" > Date: Sunday, January 20, 2013, 5:34 PM > Hi, > > great job Michiel! It looks very nice overall. As the code > that will > be using the new library needs to be changed, I would vote > for the > change in the namespace, but given that the userbase of the > Bio.Motif > was quite limited, I think it wouldn't cause major problems > to keep > the name as is. > > best > Bartek > > On Sun, Jan 20, 2013 at 8:30 AM, Michiel de Hoon > wrote: > > Dear all, > > > > As we discussed previously, I've been going over > Bio.Motif to update it and make its usage more explicit. I'm > pretty much done. While I have been uploading my changes to > the main biopython github repository, this does not mean > that these changes are final; comments and suggestions for > changes are welcome. > > > > In many cases, there is a difference in the syntax > between the old Bio.Motif and the new Bio.Motif. For > example, motif.consensus is a method in the old Bio.Motif, > but a property in the new Bio.Motif. > > While I tried to put PendingDeprecationWarnings on all > changes consistently, there may be some corner cases that I > missed. > > > > For this reason, and also to make the documentation > more understandable, it may be better to put the new > Bio.Motif code in a module Bio.motifs, to put the old > Bio.Motif code back into Bio.Motif (so that Bio.Motif in > release 1.61 will be identical to the Bio.Motif in release > 1.60), and (assuming that we are happy with the new > Bio.motifs modules) put a PendingDeprecationWarning on > Bio.Motif as a whole. Then in the documentation we'll have > one chapter on Bio.Motif and one chapter on Bio.motifs. Also > we'll have one set of tests for Bio.Motif, and one set of > tests for Bio.motifs. > > > > Any objections to creating a separate Bio.motifs > module? > > > > Here you can find the relevant chapter in the current > documentation on the new Bio.Motif: > > > > http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html#htoc190 > > > > Best, > > -Michiel > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > > > > -- > Bartek Wilczynski > From w.arindrarto at gmail.com Sun Jan 27 10:52:15 2013 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Sun, 27 Jan 2013 11:52:15 +0100 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359261946.16561.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1359261946.16561.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: Hi Michiel, everyone, >> That's why Bio._utils is a private module - we can >> drop/change/etc this without worrying about breaking >> other people's code. The issue with Bio.ParserSupport >> is it was a public API. > > Its API being public was not the problem -- we have deprecated and removed lots of public modules over the years. > > The problem with Bio.ParserSupport was twofold. First, it ended up making parsers more complex and difficult to understand for people not familiar with Bio.ParserSupport, in particular for newcomers and users trying to fix a bug. So Bio.ParserSupport never made us really happy. As a case in point, Bio._utils was created rather than reusing the code in Bio.ParserSupport. > > The second problem was that many modules were using bits and pieces of Bio.ParserSupport, so we could not drop or change Bio.ParserSupport easily. Bio.ParserSupport has been officially obsolete but not deprecated for years. > >> That's why Bio._utils is a private module - we can >> drop/change/etc this without worrying about breaking >> other people's code. > > Let's drop it. My initial intention of refactoring and adding some new code to Bio._utils was to reduce code repetition. I intended it (and perhaps we should make it explicit in its docstrings) to be a collection of small, useful functions that may be used in various cases. Some examples inside include several string-formatting functions, each of them independent of the other. There's also a general function for running doctests (https://github.com/biopython/biopython/blob/master/Bio/_utils.py#L100), which was written because there was a lot of repetitive code in different submodules basically doing the same thing (looking up the test directory, running the test). I feel quite strongly that this doctest function is required by many current (and future modules) across Biopython, so it makes sense to refactor them out into a root namespace. All of this seems different from Bio.ParserSupport, which attempts to be a one-single solution for writing new parsers (only parsers). Given the wildly incoherent nature of different file output formats, it's not surprising that Bio.ParserSupport's code base has to be quite complicated to accomodate all of them. Naturally it has many related parts and functions, and understanding them all is much harder than to understand the small functions in Bio._utils (in my experience). So for now, I think it is still ok if we use Bio._utils. Perhaps, in light of this discussion, we should make it explicitly clear that it's only for containing general, small, utility functions instead of containing one 'support framework' (e.g. ParserSupport) to avoid future unhappiness. Cheers, Bow From eric.talevich at gmail.com Mon Jan 28 05:59:14 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 28 Jan 2013 00:59:14 -0500 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: References: <1359261946.16561.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Sun, Jan 27, 2013 at 5:52 AM, Wibowo Arindrarto wrote: > Hi Michiel, everyone, > > >> That's why Bio._utils is a private module - we can > >> drop/change/etc this without worrying about breaking > >> other people's code. The issue with Bio.ParserSupport > >> is it was a public API. > > > > Its API being public was not the problem -- we have deprecated and > removed lots of public modules over the years. > > > > The problem with Bio.ParserSupport was twofold. First, it ended up > making parsers more complex and difficult to understand for people not > familiar with Bio.ParserSupport, in particular for newcomers and users > trying to fix a bug. So Bio.ParserSupport never made us really happy. As a > case in point, Bio._utils was created rather than reusing the code in > Bio.ParserSupport. > > > > The second problem was that many modules were using bits and pieces of > Bio.ParserSupport, so we could not drop or change Bio.ParserSupport easily. > Bio.ParserSupport has been officially obsolete but not deprecated for years. > > > >> That's why Bio._utils is a private module - we can > >> drop/change/etc this without worrying about breaking > >> other people's code. > > > > Let's drop it. > > My initial intention of refactoring and adding some new code to > Bio._utils was to reduce code repetition. I intended it (and perhaps > we should make it explicit in its docstrings) to be a collection of > small, useful functions that may be used in various cases. > > Some examples inside include several string-formatting functions, each > of them independent of the other. There's also a general function for > running doctests > (https://github.com/biopython/biopython/blob/master/Bio/_utils.py#L100), > which was written because there was a lot of repetitive code in > different submodules basically doing the same thing (looking up the > test directory, running the test). I feel quite strongly that this > doctest function is required by many current (and future modules) > across Biopython, so it makes sense to refactor them out into a root > namespace. > Interesting discussion. It's worth considering why some functions are being used in multiple parts of the code base. In some cases there are essentially shortcomings in the Python standard library or issues with cross-platform/cross-implementation/backward compatibility that would require us to use *exactly* the same code each time a certain recurring problem is encountered. The Bio._py3k and Bio.File modules makes sense for this reason, I think, and before we deprecated Py2.4 it would have been helpful to have shared code for importing ElementTree (both the uniprot-xml and phyloXML parsers used the same half-page tangle of attempted imports). So, maybe the doctest helpers should go in a new module specific to that topic. In other cases there's a recurring need in separate modules, but (a) it's short and simple enough to write the solution from scratch each time where it's needed, and so isn't enough of a maintenance concern to offset the convenience of having all the relevant code in one place; and/or (b) the needs of different modules aren't exactly the same, merely similar, leading to a proliferation of options in the shared function and the situation that a simpler implementation would have worked for any given module. The point is that just as there's a maintenance cost to having duplicated code in multiple places, there's a maintenance cost to having dependencies between multiple modules even within the same project, and the value of a new module ought to be greater than the cost it imposes. Best, Eric From mjldehoon at yahoo.com Mon Jan 28 14:58:58 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 28 Jan 2013 06:58:58 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: Message-ID: <1359385138.84799.YahooMailClassic@web164002.mail.gq1.yahoo.com> Hi Bow, --- On Sun, 1/27/13, Wibowo Arindrarto wrote: > All of this seems different from Bio.ParserSupport, which > attempts to be a one-single solution for writing new parsers > (only parsers). Given the wildly incoherent nature of different > file output formats, it's not surprising that Bio.ParserSupport's > code base has to be quite complicated to accommodate all of them. > Naturally it has many related parts and functions, and understanding > them all is much harder than to understand the small functions in > Bio._utils (in my experience). It's not just Bio.ParserSupport; previously we also had Bio/listfns.py; Bio/mathfns.py; Bio/stringfns.py; their C versions; and Bio/csupport.c. These all contained small utility functions. But in the end we dropped them. Btw, was Bio._utils ever discussed on the mailing list? If yes, I apologize for missing this discussion and raising these issues now. Best, -Michiel. From p.j.a.cock at googlemail.com Mon Jan 28 15:10:29 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 28 Jan 2013 15:10:29 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359385138.84799.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1359385138.84799.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: On Mon, Jan 28, 2013 at 2:58 PM, Michiel de Hoon wrote: > > Btw, was Bio._utils ever discussed on the mailing list? If yes, I > apologize for missing this discussion and raising these issues now. I think only on the pull request - I'll have a look at the GitHub settings as ideally at the minimum new pull requests should perhaps be CC'd to the dev list? Peter From p.j.a.cock at googlemail.com Mon Jan 28 15:17:19 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 28 Jan 2013 15:17:19 +0000 Subject: [Biopython-dev] Sending pull requests to the mailing list Message-ID: Retitling thread, On Mon, Jan 28, 2013 at 3:10 PM, Peter Cock wrote: > On Mon, Jan 28, 2013 at 2:58 PM, Michiel de Hoon wrote: >> >> Btw, was Bio._utils ever discussed on the mailing list? If yes, I >> apologize for missing this discussion and raising these issues now. > > I think only on the pull request - I'll have a look at the GitHub > settings as ideally at the minimum new pull requests should > perhaps be CC'd to the dev list? According to https://help.github.com/articles/using-pull-requests "Everyone that can push to the base repository will receive an email notification and see the new pull request in their dashboard the next time they log in." I think you can also choose to get emails under your own profile settings. There doesn't seem to be any email notification settings under the Biopython organisation account on GitHub. If there is an easy way to have GitHub email new pull requests to the biopython-dev mailing I've overlooked it. There might be an API based solution... or a simple email client forwarding rule? Peter From w.arindrarto at gmail.com Mon Jan 28 17:19:51 2013 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Mon, 28 Jan 2013 18:19:51 +0100 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359385138.84799.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1359385138.84799.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: Hi everyone, > --- On Sun, 1/27/13, Wibowo Arindrarto wrote: >> All of this seems different from Bio.ParserSupport, which >> attempts to be a one-single solution for writing new parsers >> (only parsers). Given the wildly incoherent nature of different >> file output formats, it's not surprising that Bio.ParserSupport's >> code base has to be quite complicated to accommodate all of them. >> Naturally it has many related parts and functions, and understanding >> them all is much harder than to understand the small functions in >> Bio._utils (in my experience). > > It's not just Bio.ParserSupport; previously we also had Bio/listfns.py; Bio/mathfns.py; Bio/stringfns.py; their C versions; and Bio/csupport.c. These all contained small utility functions. But in the end we dropped them. Hm..in this case (and in light of Eric's points as well), it may be ok to drop the string formatting functions in Bio._utils. They are used in Bio.Phylo and Bio.SearchIO for now. In Bio.SearchIO they are used in multiple submodules, however, so I am still leaning on putting them at least on Bio.SearchIO's main directory. They were originally in Bio.SearchIO._utils, after all. As for the doctest-related functions, do you propose to move them to a specific doctest-related module as well? >> Btw, was Bio._utils ever discussed on the mailing list? If yes, I >> apologize for missing this discussion and raising these issues now. > > I think only on the pull request - I'll have a look at the GitHub > settings as ideally at the minimum new pull requests should > perhaps be CC'd to the dev list? Indeed, I did submit a pull request but was not forwarded / discussed in the mailing list. This is the pull request, for reference: https://github.com/biopython/biopython/pull/140. For the dev-mailing list notification, I personally agree, given that the amount of pull requests received still seems manageable. Is it possible to just receive the initial email notifying the pull, though? So far, I've been 'watching' the repository and getting emails from there ~ perhaps the organization needs to 'watch' the repo to get notifications as well? Best, Bow From redmine at redmine.open-bio.org Mon Jan 28 22:20:54 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 28 Jan 2013 22:20:54 +0000 Subject: [Biopython-dev] [Biopython - Bug #2776] Bio.pairwise2 returns non-optimal alignment in at least some cases References: Message-ID: Issue #2776 has been updated by Peter Cock. In the opinion of Bryan Lunt, comment on another issue on Github: https://github.com/biopython/biopython/pull/149 "Bug" 2776 is not a bug, it is a feature. I hand-edited a datafile for EMBOSS programs and tried the EMBOSS "needle" program with (a homomorphism of) the same sequences. It behaves the same as pairwise2. The point is that for there to be gaps they have to be flanked by matches, except on the ends, so what the original bug report asks for is not something these algorithms will ever produce anyway. ---------------------------------------- Bug #2776: Bio.pairwise2 returns non-optimal alignment in at least some cases https://redmine.open-bio.org/issues/2776 Author: Klaus Kopec Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.49 URL: At least in some cases, Bio.pairwise2 returns an alignment that is not the one with the highest score for the input parameters. This occurs in localXX and globalXX. Yet, I only encountered the problem with large mismatch values (which I use as I need mismatch free alignments). simple example (the bug also occured for longer sequences): >>> sequence1 = 'GKG' >>> sequence2 = 'GWG' >>> A = pairwise2.align.globalms(sequence1, sequence2, 5, -100, -5, -5)[0] >>> A[0] 'GKG--' >>> A[1] '--GWG' >>> A[2] -15.0 whereas 'GK-G' 'G-WG' would get a score of 0 System: Kubuntu 8.10 64Bit, Python 2.6.1, Biopython 1.49 (my pairwise2.py is identical to the current CVS version of it) -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From mjldehoon at yahoo.com Tue Jan 29 09:43:59 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 29 Jan 2013 01:43:59 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: Message-ID: <1359452639.95165.YahooMailClassic@web164005.mail.gq1.yahoo.com> I'd prefer if developers first write to the dev mailing list if they want to make any major changes, or changes that affect Biopython overall. It can be hard to understand the implications just from looking at a pull request, and there may be so many pull requests that the important ones may be missed anyway. Best, -Michiel. --- On Mon, 1/28/13, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone > To: "Michiel de Hoon" > Cc: "Wibowo Arindrarto" , "BioPython-Dev Mailing List" > Date: Monday, January 28, 2013, 10:10 AM > On Mon, Jan 28, 2013 at 2:58 PM, > Michiel de Hoon > wrote: > > > > Btw, was Bio._utils ever discussed on the mailing list? > If yes, I > > apologize for missing this discussion and raising these > issues now. > > I think only on the pull request - I'll have a look at the > GitHub > settings as ideally at the minimum new pull requests should > perhaps be CC'd to the dev list? > > Peter > From mjldehoon at yahoo.com Tue Jan 29 09:54:01 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 29 Jan 2013 01:54:01 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: Message-ID: <1359453241.43038.YahooMailClassic@web164004.mail.gq1.yahoo.com> --- On Mon, 1/28/13, Wibowo Arindrarto wrote: > Hm..in this case (and in light of Eric's points as well), it > may be ok to drop the string formatting functions in Bio._utils. > They are used in Bio.Phylo and Bio.SearchIO for now. In Bio.SearchIO > they are used in multiple submodules, however, so I am still leaning > on putting them at least on Bio.SearchIO's main directory. They were > originally in Bio.SearchIO._utils, after all. I think it's OK to have a _utils submodule inside Bio.SearchIO. Since you are developing and maintaining that module, to a large degree it's up to you how you want to organize your code. For the same reason, for Bio.Phylo it's better to discuss with Eric Talevich first to see what he thinks. > As for the doctest-related functions, do you propose to move > them to a specific doctest-related module as well? For the doctest-related functions, we first need to understand what the purpose is, before deciding how to implement it (and in what module the code should be). Best, -Michiel. From p.j.a.cock at googlemail.com Tue Jan 29 10:23:43 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Jan 2013 10:23:43 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359452639.95165.YahooMailClassic@web164005.mail.gq1.yahoo.com> References: <1359452639.95165.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: On Tue, Jan 29, 2013 at 9:43 AM, Michiel de Hoon wrote: > I'd prefer if developers first write to the dev mailing list if they want to make > any major changes, or changes that affect Biopython overall. It can be hard > to understand the implications just from looking at a pull request, and there > may be so many pull requests that the important ones may be missed anyway. Certainly a good policy, which I have tried to follow. In this case since it was just moving a small private API code, I didn't consider it major. Peter From p.j.a.cock at googlemail.com Tue Jan 29 10:29:30 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Jan 2013 10:29:30 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359453241.43038.YahooMailClassic@web164004.mail.gq1.yahoo.com> References: <1359453241.43038.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: On Tue, Jan 29, 2013 at 9:54 AM, Michiel de Hoon wrote: > >> As for the doctest-related functions, do you propose to move >> them to a specific doctest-related module as well? > > For the doctest-related functions, we first need to understand > what the purpose is, before deciding how to implement it (and > in what module the code should be). When editing doctests, it is convenient to be able to run them on the current file, e.g. ~/biopython $ emacs Bio/SeqRecord.py ~/biopython $ python Bio/SeqRecord.py Or, ~/biopython/Bio $ emacs SeqRecord.py ~/biopython/Bio $ python SeqRecord.py To do that, many of our modules had a repeated bit of code at the bottom, now moved to a shared function in Bio/_utils.py resulting in a lot less boiler plate code, e.g. https://github.com/biopython/biopython/commit/8b59d89bb4e282192ddee751e24ceef4afa63528 Bow had initially done this for the doctests in Bio.SearchIO, but I agreed it make sense to do this elsewhere. Peter From w.arindrarto at gmail.com Tue Jan 29 11:05:19 2013 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 29 Jan 2013 12:05:19 +0100 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359453241.43038.YahooMailClassic@web164004.mail.gq1.yahoo.com> References: <1359453241.43038.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: Hi Michiel, everyone, >>> I'd prefer if developers first write to the dev mailing list if they want to make any major changes, or changes that affect Biopython overall. It can be hard to understand the implications just from looking at a pull request, and there may be so many pull requests that the important ones may be missed anyway. >>> I think it's OK to have a _utils submodule inside Bio.SearchIO. Since you are developing and maintaining that module, to a large degree it's up to you how you want to organize your code. For the same reason, for Bio.Phylo it's better to discuss with Eric Talevich first to see what he thinks. Noted. I'm sorry that this is causing more headaches than it solves. I'll be sure to notify the dev-mailing list for other similar changes. >>> As for the doctest-related functions, do you propose to move >>> them to a specific doctest-related module as well? >> >> For the doctest-related functions, we first need to understand what the purpose is, before deciding how to implement it (and in what module the code should be). > > When editing doctests, it is convenient to be able to run them on > the current file, e.g. > > ~/biopython $ emacs Bio/SeqRecord.py > ~/biopython $ python Bio/SeqRecord.py > > Or, > > ~/biopython/Bio $ emacs SeqRecord.py > ~/biopython/Bio $ python SeqRecord.py > > To do that, many of our modules had a repeated bit of code at > the bottom, now moved to a shared function in Bio/_utils.py > resulting in a lot less boiler plate code, e.g. > > https://github.com/biopython/biopython/commit/8b59d89bb4e282192ddee751e24ceef4afa63528 > > Bow had initially done this for the doctests in Bio.SearchIO, > but I agreed it make sense to do this elsewhere. Indeed, the doctests functions are two simple small functions to make it easier to run doctests. The first one looks up the test directory (our Tests directory) and the second one simply executes the doctest. Best, Bow From p.j.a.cock at googlemail.com Tue Jan 29 15:46:25 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Jan 2013 15:46:25 +0000 Subject: [Biopython-dev] Bio.Motif update In-Reply-To: <1359262007.25151.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1359262007.25151.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: On Sun, Jan 27, 2013 at 4:46 AM, Michiel de Hoon wrote: > OK, thanks! I separated Bio.Motif into Bio.Motif (essentially the same > as in Biopython release 1.60) and Bio.motifs (the new code). We need to say something about this in the NEWS file too. I think it would make sense to add a PendingDeprecationWarning to Bio.Motif now. Also, if you feel the new Bio.motifs API isn't quite settled yet, adding the new BiopythonExperimentalWarning to that makes sense. What do you think? (And once this is settled, I think we can schedule the release) Regards, Peter From p.j.a.cock at googlemail.com Tue Jan 29 17:10:50 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Jan 2013 17:10:50 +0000 Subject: [Biopython-dev] Namespace for online resources? Message-ID: Hello all, We used to have Bio.WWW for assorted online tools, but that was deprecated some time back. Is there a case for bringing it back, or something similar like Bio.WebTools as suggested by Kevin Murray on this pull request?: https://github.com/biopython/biopython/pull/132 In this case, since this is to fetch Arabidopsis sequence via an accession number, perhaps Bio.SeqUtils might be better? (As an aside, recall we've talked about merging Bio.Seq* at some point). Thoughts? Peter From w.arindrarto at gmail.com Tue Jan 29 19:52:42 2013 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 29 Jan 2013 20:52:42 +0100 Subject: [Biopython-dev] Namespace for online resources? In-Reply-To: References: Message-ID: Hi everyone, > We used to have Bio.WWW for assorted online tools, but that > was deprecated some time back. Is there a case for bringing it > back, or something similar like Bio.WebTools as suggested by > Kevin Murray on this pull request?: > > https://github.com/biopython/biopython/pull/132 > > In this case, since this is to fetch Arabidopsis sequence via > an accession number, perhaps Bio.SeqUtils might be better? > (As an aside, recall we've talked about merging Bio.Seq* at > some point). Why was Bio.WWW deprecated in the first place? Personally, I would prefer to have all online database access centralized in one place, if possible. It makes for a less-cluttered root namespace and may be more intuitive in most cases. I do notice that for cases like Bio.Entrez, sometimes we need to only parse the data locally since it has been downloaded previously (hence no online access). To do this task, Bio.www (basically the centralized online module) may not be the most intuitive place to look in, for most people, although an argument can be made that we are still parsing data whose format is specific for an online resource. However, looking at the way we are doing this now (with the current codebase placing Entrez access and parsing in Bio.Entrez; similarly for Bio.ExPASy) locating the module in Bio.TAIR (or Bio.tair? PEP-8 compliance?) looks more consistent. If we are to create a new module for online access (e.g. Bio.webtools. Bio.www) for Bio.TAIR, for consistency we may have to juggle Entrez and ExPASy around as well, right? Putting Bio.TAIR in Bio.SeqUtils doesn't seem..right to me. My impression is that SeqUtils is supposed to be for functions acting on sequence strings (or Seq objects) and nothing else. After all, we can also retrieve GenBank sequences from Biopython but that functionality is separated on its own Bio.Entrez not Bio.SeqUtils. . Just my two cents :), Bow From arklenna at gmail.com Tue Jan 29 20:05:15 2013 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 29 Jan 2013 15:05:15 -0500 Subject: [Biopython-dev] Namespace for online resources? In-Reply-To: References: Message-ID: I agree with Bow that centralizing all online database access makes sense. It would also simplify the testing process (i.e. anything that requires a network connection goes into the web namespace and can be skipped when testing offline). In situations like Entrez, the network access portion could be separated out and put into the web namespace under the same name: import Bio.www.Entrez # for downloading the data import Bio.Entrez # for parsing/using the downloaded data Cheers, Lenna On Tue, Jan 29, 2013 at 2:52 PM, Wibowo Arindrarto wrote: > Hi everyone, > > > We used to have Bio.WWW for assorted online tools, but that > > was deprecated some time back. Is there a case for bringing it > > back, or something similar like Bio.WebTools as suggested by > > Kevin Murray on this pull request?: > > > > https://github.com/biopython/biopython/pull/132 > > > > In this case, since this is to fetch Arabidopsis sequence via > > an accession number, perhaps Bio.SeqUtils might be better? > > (As an aside, recall we've talked about merging Bio.Seq* at > > some point). > > Why was Bio.WWW deprecated in the first place? > > Personally, I would prefer to have all online database access > centralized in one place, if possible. It makes for a less-cluttered > root namespace and may be more intuitive in most cases. I do notice > that for cases like Bio.Entrez, sometimes we need to only parse the > data locally since it has been downloaded previously (hence no online > access). To do this task, Bio.www (basically the centralized online > module) may not be the most intuitive place to look in, for most > people, although an argument can be made that we are still parsing > data whose format is specific for an online resource. > > However, looking at the way we are doing this now (with the current > codebase placing Entrez access and parsing in Bio.Entrez; similarly > for Bio.ExPASy) locating the module in Bio.TAIR (or Bio.tair? PEP-8 > compliance?) looks more consistent. If we are to create a new module > for online access (e.g. Bio.webtools. Bio.www) for Bio.TAIR, for > consistency we may have to juggle Entrez and ExPASy around as well, > right? > > Putting Bio.TAIR in Bio.SeqUtils doesn't seem..right to me. My > impression is that SeqUtils is supposed to be for functions acting on > sequence strings (or Seq objects) and nothing else. After all, we can > also retrieve GenBank sequences from Biopython but that functionality > is separated on its own Bio.Entrez not Bio.SeqUtils. > . > Just my two cents :), > Bow > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Tue Jan 29 21:03:59 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Jan 2013 21:03:59 +0000 Subject: [Biopython-dev] Namespace for online resources? In-Reply-To: References: Message-ID: On Tue, Jan 29, 2013 at 7:52 PM, Wibowo Arindrarto wrote: > Hi everyone, > > Why was Bio.WWW deprecated in the first place? > The flippant answer is everything under Bio.WWW was moved or deprecated: http://lists.open-bio.org/pipermail/biopython-dev/2008-July/004059.html I'm trying to identify the discussions prior to that covering the moves: Bio.WWW.ExPASy -> Bio.ExPASy Bio.WWW.InterPro -> Bio.InterPro Bio.WWW.NCBI -> Bio.Entrez Bio.WWW.SCOP -> Bio.SCOP Peter From p.j.a.cock at googlemail.com Tue Jan 29 21:11:29 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Jan 2013 21:11:29 +0000 Subject: [Biopython-dev] Namespace for online resources? In-Reply-To: References: Message-ID: On Tue, Jan 29, 2013 at 9:03 PM, Peter Cock wrote: > On Tue, Jan 29, 2013 at 7:52 PM, Wibowo Arindrarto > wrote: >> Hi everyone, >> >> Why was Bio.WWW deprecated in the first place? >> > > The flippant answer is everything under Bio.WWW was moved > or deprecated: > http://lists.open-bio.org/pipermail/biopython-dev/2008-July/004059.html > > I'm trying to identify the discussions prior to that covering the moves: > > Bio.WWW.ExPASy -> Bio.ExPASy > Bio.WWW.InterPro -> Bio.InterPro > Bio.WWW.NCBI -> Bio.Entrez > Bio.WWW.SCOP -> Bio.SCOP Probably this thread, http://lists.open-bio.org/pipermail/biopython-dev/2007-November/003241.html Also a bit more background on the NCBI Entrez side: http://lists.open-bio.org/pipermail/biopython-dev/2008-February/003423.html Peter From natemsutton at yahoo.com Tue Jan 29 21:22:57 2013 From: natemsutton at yahoo.com (Nate Sutton) Date: Tue, 29 Jan 2013 13:22:57 -0800 (PST) Subject: [Biopython-dev] New BioPython member Message-ID: <1359494577.29159.YahooMailNeo@web122606.mail.ne1.yahoo.com> Dear all, I just recently joined the BioPython developers group and am looking forward to contributing to BioPython!? I have worked for a while in programming, genetics, and biology and have a m.s. in Biomedical Informatics.? After talking with some fellow contributors I have decided to try working on https://redmine.open-bio.org/issues/3360 but I will also work on writing some documentation on examples from the cookbook, especially if I am stuck on the bug.? If anyone wants to work on the same things, I?d be glad to hear that, I may be slow on the work because I am still learning Python after coming from other languages. -Nate From mjldehoon at yahoo.com Wed Jan 30 02:00:32 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 29 Jan 2013 18:00:32 -0800 (PST) Subject: [Biopython-dev] Namespace for online resources? In-Reply-To: Message-ID: <1359511232.14591.YahooMailClassic@web164002.mail.gq1.yahoo.com> Bio.WWW was one of those modules that seem a good idea at first, but then failed to gain general acceptance. There are three problems with Bio.WWW: 1) From the module name, it's not clear what you would find in it. For example, if you want to access the Entrez database, would you first look in Bio.Entrez or in Bio.WWW? Similarly for TAIR: Would you look for it in Bio.TAIR, or in Bio.WWW? 2) The modules in Bio.WWW don't have much to do with each other, except that they access the internet. But any given user probably is mainly interested in Entrez, or ExPASy, or some other database, not in all of them at the same time. 3) The flip side of this is that a user accessing e.g. ExPASy would have to import both Bio.WWW and Bio.ExPASy to be able to use ExPASy. Doctests get more complicated also, as they would span more than one module. Here is an example from Bio.Entrez that accesses the database, and then parses the results: >>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here at example.org" >>> handle = Entrez.einfo() # or esearch, efetch, ... >>> record = Entrez.read(handle) >>> handle.close() The ultimate question is whether we organize the code in Biopython by their functionality from a user perspective, or by the kind of things they do? Almost all of Biopython is organized according to the former. For example, we don't have a Bio.Parsers module for all the parsers; similarly, we don't have Bio.WWW for internet access. Best, -Michiel. --- On Tue, 1/29/13, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] Namespace for online resources? > To: "Wibowo Arindrarto" > Cc: "Biopython-Dev Mailing List" > Date: Tuesday, January 29, 2013, 4:11 PM > On Tue, Jan 29, 2013 at 9:03 PM, > Peter Cock > wrote: > > On Tue, Jan 29, 2013 at 7:52 PM, Wibowo Arindrarto > > > wrote: > >> Hi everyone, > >> > >> Why was Bio.WWW deprecated in the first place? > >> > > > > The flippant answer is everything under Bio.WWW was > moved > > or deprecated: > > http://lists.open-bio.org/pipermail/biopython-dev/2008-July/004059.html > > > > I'm trying to identify the discussions prior to that > covering the moves: > > > > Bio.WWW.ExPASy -> Bio.ExPASy > > Bio.WWW.InterPro -> Bio.InterPro > > Bio.WWW.NCBI -> Bio.Entrez > > Bio.WWW.SCOP -> Bio.SCOP > > Probably this thread, > http://lists.open-bio.org/pipermail/biopython-dev/2007-November/003241.html > > Also a bit more background on the NCBI Entrez side: > http://lists.open-bio.org/pipermail/biopython-dev/2008-February/003423.html > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From kjwu at ucsd.edu Wed Jan 30 02:09:42 2013 From: kjwu at ucsd.edu (Kevin Wu) Date: Tue, 29 Jan 2013 18:09:42 -0800 Subject: [Biopython-dev] Trie with_prefix doesn't work as expected Message-ID: Hi All, I'm attempting to use the trie implementation in biopython to develop a suffix trie. I'm using the with_prefix function to find all keys which start with a sequence, however, the function doesn't return values that I expect. I tested it with the canonical example "banana" and am a bit confused. from Bio.trie import trie t = trie() s = "BANANA" for i in range(len(s)): # insert all suffixes into trie t[s[i:]] = i t.with_prefix("NA") # this works as expected >> ['NA', 'NANA'] t.with_prefix("AN") >> ['AN', 'ANNA'] # this doesn't work as expected # expected output: ["ANANA", "ANA"] Can anyone clarify my confusion or confirm this bug? I'm on Biopython 1.60, Linux Mint 64-bit. Thanks! Kevin From mjldehoon at yahoo.com Wed Jan 30 02:29:09 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 29 Jan 2013 18:29:09 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: Message-ID: <1359512949.16659.YahooMailClassic@web164002.mail.gq1.yahoo.com> Hi Bow, Thanks for the explanation. > Indeed, the doctests functions are two simple small > functions to make it easier to run doctests. The first > one looks up the test directory (our Tests directory) and > the second one simply executes the doctest. The point of looking up the test directory is to find the example input files, right? Have a look at Bio/Align/Applications/_Mafft.py. Its doctest uses the complete path to the example input file: https://github.com/biopython/biopython/commit/32a6beb1e039fa614398a7dee1c031466e8e42ed#Bio/Align/Applications/_Mafft.py I like this solution better, since it's more straightforward, it doesn't need a new module, and also allows the user to run the example without having to figure out where the input file is located. Best, -Michiel. From k.d.murray.91 at gmail.com Wed Jan 30 03:37:46 2013 From: k.d.murray.91 at gmail.com (Kevin Murray) Date: Wed, 30 Jan 2013 14:37:46 +1100 Subject: [Biopython-dev] Namespace for online resources? In-Reply-To: <1359511232.14591.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1359511232.14591.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: Hi all, Essentially, I agree with everything Bow and Lenna have said. If all web-based tools are in a single root-level package, then with appropriate documentation I think users should know where to find any function. People are at least going to know if their required module interfaces with some website. I guess the problem is that moving all the web stuff into one package will break alot of code, which leads me back to my original idea of just copying where stuff like TOGOws and ExPASy is located, i.e. sticking TAIR in the root level directory. Peter and Michiel, do you think that Lenna's suggestion is workable? Would it make sense to go all in and simultaneously refactor parsers into Bio.parse, Bio.*IO into Bio.io.*, etc etc. Perhaps this could be delayed until the next major release (or form the beginings of a biopython2 branch?). Cheers, Kevin Murray On 30 January 2013 13:00, Michiel de Hoon wrote: > Bio.WWW was one of those modules that seem a good idea at first, but then > failed to gain general acceptance. There are three problems with Bio.WWW: > > 1) From the module name, it's not clear what you would find in it. For > example, if you want to access the Entrez database, would you first look in > Bio.Entrez or in Bio.WWW? Similarly for TAIR: Would you look for it in > Bio.TAIR, or in Bio.WWW? > > 2) The modules in Bio.WWW don't have much to do with each other, except > that they access the internet. But any given user probably is mainly > interested in Entrez, or ExPASy, or some other database, not in all of them > at the same time. > > 3) The flip side of this is that a user accessing e.g. ExPASy would have > to import both Bio.WWW and Bio.ExPASy to be able to use ExPASy. Doctests > get more complicated also, as they would span more than one module. Here is > an example from Bio.Entrez that accesses the database, and then parses the > results: > >>> from Bio import Entrez > >>> Entrez.email = "Your.Name.Here at example.org" > >>> handle = Entrez.einfo() # or esearch, efetch, ... > >>> record = Entrez.read(handle) > >>> handle.close() > > The ultimate question is whether we organize the code in Biopython by > their functionality from a user perspective, or by the kind of things they > do? Almost all of Biopython is organized according to the former. For > example, we don't have a Bio.Parsers module for all the parsers; similarly, > we don't have Bio.WWW for internet access. > > Best, > -Michiel. > > > --- On Tue, 1/29/13, Peter Cock wrote: > > > From: Peter Cock > > Subject: Re: [Biopython-dev] Namespace for online resources? > > To: "Wibowo Arindrarto" > > Cc: "Biopython-Dev Mailing List" > > Date: Tuesday, January 29, 2013, 4:11 PM > > On Tue, Jan 29, 2013 at 9:03 PM, > > Peter Cock > > wrote: > > > On Tue, Jan 29, 2013 at 7:52 PM, Wibowo Arindrarto > > > > > wrote: > > >> Hi everyone, > > >> > > >> Why was Bio.WWW deprecated in the first place? > > >> > > > > > > The flippant answer is everything under Bio.WWW was > > moved > > > or deprecated: > > > > http://lists.open-bio.org/pipermail/biopython-dev/2008-July/004059.html > > > > > > I'm trying to identify the discussions prior to that > > covering the moves: > > > > > > Bio.WWW.ExPASy -> Bio.ExPASy > > > Bio.WWW.InterPro -> Bio.InterPro > > > Bio.WWW.NCBI -> Bio.Entrez > > > Bio.WWW.SCOP -> Bio.SCOP > > > > Probably this thread, > > > http://lists.open-bio.org/pipermail/biopython-dev/2007-November/003241.html > > > > Also a bit more background on the NCBI Entrez side: > > > http://lists.open-bio.org/pipermail/biopython-dev/2008-February/003423.html > > > > Peter > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Wed Jan 30 08:52:24 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Jan 2013 08:52:24 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359512949.16659.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1359512949.16659.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: On Wed, Jan 30, 2013 at 2:29 AM, Michiel de Hoon wrote: > Hi Bow, > > Thanks for the explanation. > >> Indeed, the doctests functions are two simple small >> functions to make it easier to run doctests. The first >> one looks up the test directory (our Tests directory) and >> the second one simply executes the doctest. > > The point of looking up the test directory is to find the > example input files, right? Yes. Most of the code is working out where our Test directory is, without that it is just two lines: import doctest doctest.testmod() > Have a look at Bio/Align/Applications/_Mafft.py. > Its doctest uses the complete path to the example input file: > > https://github.com/biopython/biopython/commit/32a6beb1e039fa614398a7dee1c031466e8e42ed#Bio/Align/Applications/_Mafft.py > > I like this solution better, since it's more straightforward, it doesn't > need a new module, and also allows the user to run the example > without having to figure out where the input file is located. That's a special case - the file being referred to isn't used other than to print out a command line string. So it is fine. The doctests we're talking about typically are for parsing, and they need to find the file. In order to run via the main test suite (run_tests.py) we can assume we are in the Biopython Tests folder and therefore use relative paths. Those relative paths won't work if trying to run the doctests via the __name__ trick, thus the path magic which seemed sensible to put in one place only. We can of course remove these __name__ trick conveniences, they are only intended to make life easier for us developers when editing the doctests of a module. But I think it is worth having as a private function somewhere in the code base. Regards, Peter From p.j.a.cock at googlemail.com Wed Jan 30 09:31:31 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Jan 2013 09:31:31 +0000 Subject: [Biopython-dev] New BioPython member In-Reply-To: <1359494577.29159.YahooMailNeo@web122606.mail.ne1.yahoo.com> References: <1359494577.29159.YahooMailNeo@web122606.mail.ne1.yahoo.com> Message-ID: On Tue, Jan 29, 2013 at 9:22 PM, Nate Sutton wrote: > Dear all, > > I just recently joined the BioPython developers group and am > looking forward to contributing to BioPython! I have worked for a while > in programming, genetics, and biology and have > a m.s. in Biomedical Informatics. After > talking with some fellow contributors I have decided to try working on > https://redmine.open-bio.org/issues/3360 but I will also work on writing > some documentation on examples from the > cookbook, especially if I am stuck on the bug. If anyone wants to work on > the same things, I?d be glad to hear that, I > may be slow on the work because I am still learning Python after coming > from > other languages. > > -Nate Hi Nate, and welcome. Eric is in charge of the Bio.Phylo module, but within that the command line application wrappers under Bio.Phylo.Applications follow a pattern used elsewhere in Biopython. To add a wrapper for fasttree http://www.microbesonline.org/fasttree/ have a look at the existing wrappers for PHYML and RAXML, defined in Bio/Phylo/Applications/_Phyml.py and Bio/Phylo/Applications/_Raxml.py (leading underscores mean private modules in Python), which are exposed to the user via Bio/Phylo/Applications/__init__.py In this case, I'd suggest putting the new wrapper in a new file, Bio/Phylo/Applications/_fastree.py Other similar wrappers existing under Bio.Emboss, Bio.Align, etc. Don't be shy about asking for guidance on this, or git and github. Ultimately I'm hoping you'll be able to do is take a fork (personally copy of the repository) on GitHub, create a new fasttree branch, commit your enhancements, and make a pull request. If that's all too much for now, simply writing the new file and letting us do the git side would be fine. Regards, Peter From p.j.a.cock at googlemail.com Wed Jan 30 09:42:23 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Jan 2013 09:42:23 +0000 Subject: [Biopython-dev] Trie with_prefix doesn't work as expected In-Reply-To: References: Message-ID: On Wed, Jan 30, 2013 at 2:09 AM, Kevin Wu wrote: > Hi All, > > I'm attempting to use the trie implementation in biopython to develop a > suffix trie. I'm using the with_prefix function to find all keys which > start with a sequence, however, the function doesn't return values that I > expect. I tested it with the canonical example "banana" and am a bit > confused. > > from Bio.trie import trie > t = trie() > s = "BANANA" > for i in range(len(s)): # insert all suffixes into trie > t[s[i:]] = i > > t.with_prefix("NA") # this works as expected >>> ['NA', 'NANA'] > > t.with_prefix("AN") >>> ['AN', 'ANNA'] # this doesn't work as expected > # expected output: ["ANANA", "ANA"] > > Can anyone clarify my confusion or confirm this bug? I'm on Biopython 1.60, > Linux Mint 64-bit. There is certainly something odd happening. I'm testing with the current code in git (pre-Biopython 1.61) under Mac OS X. >>> from Bio.trie import trie >>> t = trie() >>> s = "BANANA" >>> for i in range(len(s)): # insert all suffixes into trie ... t[s[i:]] = i ... print "%s -> %i" % (s[i:], i) ... assert t[s[i:]] == i ... BANANA -> 0 ANANA -> 1 NANA -> 2 ANA -> 3 NA -> 4 A -> 5 >>> t.values() [5, 3, 1, 0, 4, 2] >>> t.keys() ['A', 'ANA', 'ANANA', 'BANANA', 'NA', 'NANA'] These look fine: >>> t.with_prefix("NA") ['NA', 'NANA'] >>> t.with_prefix("A") ['A', 'ANA', 'ANANA'] >>> t.with_prefix("ANA") ['ANA', 'ANANA'] As you point out, this example seems wrong: >>> t.with_prefix("AN") ['AN', 'ANNA'] The value 'ANNA' shouldn't be in the trie. Peter From mjldehoon at yahoo.com Wed Jan 30 10:20:53 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 30 Jan 2013 02:20:53 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: Message-ID: <1359541253.85968.YahooMailClassic@web164003.mail.gq1.yahoo.com> Hi Peter, --- On Wed, 1/30/13, Peter Cock wrote: > Those relative paths won't work if trying to run the > doctests via the __name__ trick, thus the path magic which > seemed sensible to put in one place only. In which case won't they work? I tried this on SeqRecord.py, and as far as I can tell, the relative paths work fine also when running the doctests from the __name__=="__main__" block, both on Unix and Windows. Best, -Michiel From p.j.a.cock at googlemail.com Wed Jan 30 11:42:21 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Jan 2013 11:42:21 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359541253.85968.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1359541253.85968.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: On Wed, Jan 30, 2013 at 10:20 AM, Michiel de Hoon wrote: > Hi Peter, > > --- On Wed, 1/30/13, Peter Cock wrote: >> Those relative paths won't work if trying to run the >> doctests via the __name__ trick, thus the path magic which >> seemed sensible to put in one place only. > > In which case won't they work? I tried this on SeqRecord.py, > and as far as I can tell, the relative paths work fine also when > running the doctests from the __name__=="__main__" block, > both on Unix and Windows. Yes, no path magic works IF you are in the Tests folder, e.g. ~/biopython/Tests $ emacs ../Bio/SeqRecord.py ~/biopython/Tests $ python ../Bio/SeqRecord.py However for anything like the following convenient alternatives to work and run the doctests, you need some path magic: ~/biopython $ emacs Bio/SeqRecord.py ~/biopython $ python Bio/SeqRecord.py Or, ~/biopython/Bio $ emacs SeqRecord.py ~/biopython/Bio $ python SeqRecord.py I felt having a central convenience function to make that work was worthwhile in order to make working on doctests easier without code duplication. I would accept that this alone does not justify a whole module or file like Bio/_utils.py If you feel strongly about this, we can remove the function run_doctest from Bio/_utils.py (it does after all serve no real purpose in the installed library code), and just require the current directory be the test folder. Would you like me to make that change? Regards, Peter From mjldehoon at yahoo.com Wed Jan 30 12:10:17 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 30 Jan 2013 04:10:17 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: Message-ID: <1359547817.36972.YahooMailClassic@web164001.mail.gq1.yahoo.com> Hi Peter, --- On Wed, 1/30/13, Peter Cock wrote: > However for anything like the following convenient > alternatives to work and run the doctests, you need > some path magic: > ~/biopython $ emacs Bio/SeqRecord.py > ~/biopython $ python Bio/SeqRecord.py Here I agree. > Or, > > ~/biopython/Bio $ emacs SeqRecord.py > ~/biopython/Bio $ python SeqRecord.py > Well I was thinking that the doctests in SeqRecord.py could use a relative path to the Tests directory, e.g. ../Tests/Quality/solexa_faked.fastq. But I agree that this will fail again for any script in submodules. Still I would think that there is a better way to do this, and I doubt that we are the first ones who want to access test files with doctests. I can write a short message to comp.lang.python to see have anybody has any suggestions. Best, -Michiel. From arklenna at gmail.com Wed Jan 30 17:10:40 2013 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 30 Jan 2013 12:10:40 -0500 Subject: [Biopython-dev] Namespace for online resources? In-Reply-To: <1359511232.14591.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1359511232.14591.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: Michiel, You raise an excellent point that separating the modules in this way will complicate doctests. Regarding point (2), is your primary concern namespace clutter or importing efficiency? I still maintain that the category of internet access is more fundamental than the category of parsers. For point (1), if every database is accessed using a WWW submodule, a user will know to look there. Obviously moving everything would be a lot of work... Cheers, Lenna On Tue, Jan 29, 2013 at 9:00 PM, Michiel de Hoon wrote: > Bio.WWW was one of those modules that seem a good idea at first, but then > failed to gain general acceptance. There are three problems with Bio.WWW: > > 1) From the module name, it's not clear what you would find in it. For > example, if you want to access the Entrez database, would you first look in > Bio.Entrez or in Bio.WWW? Similarly for TAIR: Would you look for it in > Bio.TAIR, or in Bio.WWW? > > 2) The modules in Bio.WWW don't have much to do with each other, except > that they access the internet. But any given user probably is mainly > interested in Entrez, or ExPASy, or some other database, not in all of them > at the same time. > > 3) The flip side of this is that a user accessing e.g. ExPASy would have > to import both Bio.WWW and Bio.ExPASy to be able to use ExPASy. Doctests > get more complicated also, as they would span more than one module. Here is > an example from Bio.Entrez that accesses the database, and then parses the > results: > >>> from Bio import Entrez > >>> Entrez.email = "Your.Name.Here at example.org" > >>> handle = Entrez.einfo() # or esearch, efetch, ... > >>> record = Entrez.read(handle) > >>> handle.close() > > The ultimate question is whether we organize the code in Biopython by > their functionality from a user perspective, or by the kind of things they > do? Almost all of Biopython is organized according to the former. For > example, we don't have a Bio.Parsers module for all the parsers; similarly, > we don't have Bio.WWW for internet access. > > Best, > -Michiel. > > > --- On Tue, 1/29/13, Peter Cock wrote: > > > From: Peter Cock > > Subject: Re: [Biopython-dev] Namespace for online resources? > > To: "Wibowo Arindrarto" > > Cc: "Biopython-Dev Mailing List" > > Date: Tuesday, January 29, 2013, 4:11 PM > > On Tue, Jan 29, 2013 at 9:03 PM, > > Peter Cock > > wrote: > > > On Tue, Jan 29, 2013 at 7:52 PM, Wibowo Arindrarto > > > > > wrote: > > >> Hi everyone, > > >> > > >> Why was Bio.WWW deprecated in the first place? > > >> > > > > > > The flippant answer is everything under Bio.WWW was > > moved > > > or deprecated: > > > > http://lists.open-bio.org/pipermail/biopython-dev/2008-July/004059.html > > > > > > I'm trying to identify the discussions prior to that > > covering the moves: > > > > > > Bio.WWW.ExPASy -> Bio.ExPASy > > > Bio.WWW.InterPro -> Bio.InterPro > > > Bio.WWW.NCBI -> Bio.Entrez > > > Bio.WWW.SCOP -> Bio.SCOP > > > > Probably this thread, > > > http://lists.open-bio.org/pipermail/biopython-dev/2007-November/003241.html > > > > Also a bit more background on the NCBI Entrez side: > > > http://lists.open-bio.org/pipermail/biopython-dev/2008-February/003423.html > > > > Peter > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From w.arindrarto at gmail.com Wed Jan 30 17:20:39 2013 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 30 Jan 2013 18:20:39 +0100 Subject: [Biopython-dev] Namespace for online resources? In-Reply-To: <1359511232.14591.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1359511232.14591.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: Hi everyone, Peter, thanks for the links to the archives, I'm starting to get a grip on why Bio.WWW was deprecated in the first place. Michiel, thanks for the explanation. My responses are below. My reply is a bit long, so in the interest of brevity, I'll say first that I'm in favor of putting TAIR in Bio.TAIR now, for practical reasons and consistency with similar modules. But I do still have some slight objections to this approach. > Bio.WWW was one of those modules that seem a good idea at first, but then failed to gain general acceptance. There are three problems with Bio.WWW: > > 1) From the module name, it's not clear what you would find in it. For example, if you want to access the Entrez database, would you first look in Bio.Entrez or in Bio.WWW? Similarly for TAIR: Would you look for it in Bio.TAIR, or in Bio.WWW? This seems to be a naming issue, but it does not invalidate the idea of having one central place for online access. I'll continue to refer to this module as Bio.WW here, but there may be other more suitable names, such as Bio.remotedb, Bio.remote.db, Bio.www.db (or something else) which makes the module a more intuitive place to look in, right?. > 2) The modules in Bio.WWW don't have much to do with each other, except that they access the internet. But any given user probably is mainly interested in Entrez, or ExPASy, or some other database, not in all of them at the same time. We may put a note in the documentation to note this, right? If we are worried about loading unecessary modules, we can keep the __init__.py in Bio.WWW empty, and have Entrez, ExPASy, and the others inside Bio.WWW. > 3) The flip side of this is that a user accessing e.g. ExPASy would have to import both Bio.WWW and Bio.ExPASy to be able to use ExPASy. Doctests get more complicated also, as they would span more than one module. Here is an example from Bio.Entrez that accesses the database, and then parses the results: >>>> from Bio import Entrez >>>> Entrez.email = "Your.Name.Here at example.org" >>>> handle = Entrez.einfo() # or esearch, efetch, ... >>>> record = Entrez.read(handle) >>>> handle.close() Since ExPASy's formats may be specific to them, I was thinking their parsers should also go in Bio.WWW (in this case, Bio.WWW.ExPASy). Note that at the moment we also have cases where the database entry retriever and parser lies in different submodules of the code (e.g. importing Fasta from Bio.Entrez and parsing it with Bio.SeqIO). This is OK in my opinion, however, as Fasta is a widely used format not exclusive to Entrez. But for exclusive format like ExPASy's or Entrez's, it makes sense to keep them in the same module as their database entry retriever. > The ultimate question is whether we organize the code in Biopython by their functionality from a user perspective, or by the kind of things they do? Almost all of Biopython is organized according to the former. For example, we don't have a Bio.Parsers module for all the parsers; similarly, we don't have Bio.WWW for internet access. Hmm..those two points are not necessarily mutually exclusive, right? I think having a centralized module for online access still makes for a functional grouping based on a user's perspective. In the parser's case, it makes sense to organize it the way we do now as there are so many parsers. But for online access, I think it's still manageable to put them in one directory. Just to throw the idea around, we may also have subdirectories for different kinds of online access (e.g. Bio.www.db for online database access, Bio.www.app for online tools access like NCBI BLAST or HMMER). This is not something urgent, but maybe worth thinking / discussing about :). Cheers, Bow From mjldehoon at yahoo.com Thu Jan 31 11:03:12 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 31 Jan 2013 03:03:12 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: Message-ID: <1359630192.62870.YahooMailClassic@web164001.mail.gq1.yahoo.com> Dear all, [Michiel wrote:] > Still I would think that there is a better way to do this, > and I doubt that we are the first ones who want to access > test files with doctests. I can write a short message to > comp.lang.python to see have anybody has any suggestions. So I started writing a message to comp.lang.python, and while reading the doctest documentation to make my message understandable I realized that we can solve our problem by using the setUp and tearDown arguments to doctest.DocTestSuite. Then we put the test files in the same directory as the module we want to test, and use setUp/tearDown to let the unittest switch to this directory when needed. This has the added benefit that the example files are easier to find for users who want to try out a doctest example. Perhaps we'll still run into some issues if we try to implement this, but it seems a step in the right direction. Best, -Michiel. From p.j.a.cock at googlemail.com Thu Jan 31 11:38:43 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 31 Jan 2013 11:38:43 +0000 Subject: [Biopython-dev] Trie with_prefix doesn't work as expected In-Reply-To: References: Message-ID: On Wed, Jan 30, 2013 at 9:42 AM, Peter Cock wrote: > On Wed, Jan 30, 2013 at 2:09 AM, Kevin Wu wrote: >> Hi All, >> >> I'm attempting to use the trie implementation in biopython to develop a >> suffix trie. I'm using the with_prefix function to find all keys which >> start with a sequence, however, the function doesn't return values that I >> expect. I tested it with the canonical example "banana" and am a bit >> confused. >> >> from Bio.trie import trie >> t = trie() >> s = "BANANA" >> for i in range(len(s)): # insert all suffixes into trie >> t[s[i:]] = i >> >> t.with_prefix("NA") # this works as expected >>>> ['NA', 'NANA'] >> >> t.with_prefix("AN") >>>> ['AN', 'ANNA'] # this doesn't work as expected >> # expected output: ["ANANA", "ANA"] >> >> Can anyone clarify my confusion or confirm this bug? I'm on Biopython 1.60, >> Linux Mint 64-bit. > > There is certainly something odd happening. I'm testing with the > current code in git (pre-Biopython 1.61) under Mac OS X. > >>>> from Bio.trie import trie >>>> t = trie() >>>> s = "BANANA" >>>> for i in range(len(s)): # insert all suffixes into trie > ... t[s[i:]] = i > ... print "%s -> %i" % (s[i:], i) > ... assert t[s[i:]] == i > ... > BANANA -> 0 > ANANA -> 1 > NANA -> 2 > ANA -> 3 > NA -> 4 > A -> 5 >>>> t.values() > [5, 3, 1, 0, 4, 2] >>>> t.keys() > ['A', 'ANA', 'ANANA', 'BANANA', 'NA', 'NANA'] > > These look fine: > >>>> t.with_prefix("NA") > ['NA', 'NANA'] >>>> t.with_prefix("A") > ['A', 'ANA', 'ANANA'] >>>> t.with_prefix("ANA") > ['ANA', 'ANANA'] > > As you point out, this example seems wrong: > >>>> t.with_prefix("AN") > ['AN', 'ANNA'] > > The value 'ANNA' shouldn't be in the trie. > > Peter Thanks to Jeff Chang for a very speedy fix (sent as an attachment off list), which I have applied to the repository: https://github.com/biopython/biopython/commit/cd7cc7174fd4b0607381e9c58f6ae0d17cca8f74 I've also added a unit test based on Kevin's example: https://github.com/biopython/biopython/commit/efc289c8fe2e78ad12481973e42554fa40f2ea0a Thank you for reporting this Kevin. Peter P.S. Nice to hear from you again Jeff :) I think your last commit was before we moved from CVS to git, please let us know if you want commit access on github. From p.j.a.cock at googlemail.com Thu Jan 31 11:43:44 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 31 Jan 2013 11:43:44 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359630192.62870.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1359630192.62870.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Thu, Jan 31, 2013 at 11:03 AM, Michiel de Hoon wrote: > Dear all, > > [Michiel wrote:] >> Still I would think that there is a better way to do this, >> and I doubt that we are the first ones who want to access >> test files with doctests. I can write a short message to >> comp.lang.python to see have anybody has any suggestions. > > So I started writing a message to comp.lang.python, and while reading > the doctest documentation to make my message understandable I > realized that we can solve our problem by using the setUp and tearDown > arguments to doctest.DocTestSuite. Then we put the test files in the same > directory as the module we want to test, and use setUp/tearDown to let > the unittest switch to this directory when needed. > > This has the added benefit that the example files are easier to find > for users who want to try out a doctest example. > > Perhaps we'll still run into some issues if we try to implement this, but > it seems a step in the right direction. I don't follow what you are suggesting here. Are you suggesting putting test files under Bio/* as well/instead or under Tests/* ? Peter From mjldehoon at yahoo.com Thu Jan 31 13:46:47 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 31 Jan 2013 05:46:47 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: Message-ID: <1359640007.58576.YahooMailClassic@web164005.mail.gq1.yahoo.com> > I don't follow what you are suggesting here. Are you > suggesting putting test files under Bio/* as well/instead > or under Tests/* ? Well the key point is that if we run the doctests from the Tests directory (with run_tests.py), we can change directory to the directory containing the module whose doctests we want to test. Then, if "python somemodule.py" can find the test files, then so can run_tests.py. We'd just need to make sure that the relative paths in somemodule.py are correct with respect to the directory in which somemodule.py resides. But keep in mind that the unit tests in Tests and the doctests in the modules have different functions. The purpose of the unit tests is to test the Biopython code; the purpose of the doctests is to make sure the docstring examples work. So one could argue that the heavy test files should go under Tests, while simple test files just for the docstring examples should go under Bio/SomeModule. Best, -Michiel. --- On Thu, 1/31/13, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone > To: "Michiel de Hoon" > Cc: "Wibowo Arindrarto" , "BioPython-Dev Mailing List" > Date: Thursday, January 31, 2013, 6:43 AM > On Thu, Jan 31, 2013 at 11:03 AM, > Michiel de Hoon > wrote: > > Dear all, > > > > [Michiel wrote:] > >> Still I would think that there is a better way to > do this, > >> and I doubt that we are the first ones who want to > access > >> test files with doctests. I can write a short > message to > >> comp.lang.python to see have anybody has any > suggestions. > > > > So I started writing a message to comp.lang.python, and > while reading > > the doctest documentation to make my message > understandable I > > realized that we can solve our problem by using the > setUp and tearDown > > arguments to doctest.DocTestSuite. Then we put the test > files in the same > > directory as the module we want to test, and use > setUp/tearDown to let > > the unittest switch to this directory when needed. > > > > This has the added benefit that the example files are > easier to find > > for users who want to try out a doctest example. > > > > Perhaps we'll still run into some issues if we try to > implement this, but > > it seems a step in the right direction. > > I don't follow what you are suggesting here. Are you > suggesting putting > test files under Bio/* as well/instead or under Tests/* ? > > Peter > From p.j.a.cock at googlemail.com Thu Jan 31 14:26:50 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 31 Jan 2013 14:26:50 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359640007.58576.YahooMailClassic@web164005.mail.gq1.yahoo.com> References: <1359640007.58576.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: On Thu, Jan 31, 2013 at 1:46 PM, Michiel de Hoon wrote: >> I don't follow what you are suggesting here. Are you >> suggesting putting test files under Bio/* as well/instead >> or under Tests/* ? > > Well the key point is that if we run the doctests from the Tests directory > (with run_tests.py), we can change directory to the directory containing > the module whose doctests we want to test. Then, if "python somemodule.py" > can find the test files, then so can run_tests.py. We'd just need to make > sure that the relative paths in somemodule.py are correct with respect to > the directory in which somemodule.py resides. I can see how that would work - put all the path changing magic into run_tests.py (before running the doctest for Bio/x/y/z.py change to the directory Bio/x/y and so on), and have the Bio/x/y/z.py doctests assume they will be run from Bio/x/y only. > But keep in mind that the unit tests in Tests and the doctests in the modules > have different functions. The purpose of the unit tests is to test the Biopython > code; the purpose of the doctests is to make sure the docstring examples work. Of course. > So one could argue that the heavy test files should go under Tests, while > simple test files just for the docstring examples should go under Bio/SomeModule. Many of the unittests and doctests currently use the same example files. However, my main objection is that I don't like the idea of putting test files under Bio/* - I feel it should be the source code only (bar some special cases like data files). There are probably packaging guidelines about this somewhere... but I can't find anything immediately. Regards, Peter From mjldehoon at yahoo.com Thu Jan 31 15:33:35 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 31 Jan 2013 07:33:35 -0800 (PST) Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone Message-ID: <1359646415.80564.YahooMailClassic@web164006.mail.gq1.yahoo.com> --- On Thu, 1/31/13, Peter Cock wrote: > However, my main objection is that I don't like the idea of > putting test files under Bio/* I'm OK with using the setUp and tearDown arguments to doctest.DocTestSuite to do the directory magic, but keeping the test files under Tests/. Best, -Michiel. From p.j.a.cock at googlemail.com Thu Jan 31 15:47:18 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 31 Jan 2013 15:47:18 +0000 Subject: [Biopython-dev] Deprecating Bio.ParserSupport, Bio.Blast.NCBIStandalone In-Reply-To: <1359646415.80564.YahooMailClassic@web164006.mail.gq1.yahoo.com> References: <1359646415.80564.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: On Thu, Jan 31, 2013 at 3:33 PM, Michiel de Hoon wrote: > --- On Thu, 1/31/13, Peter Cock wrote: >> However, my main objection is that I don't like the idea of >> putting test files under Bio/* > > I'm OK with using the setUp and tearDown arguments to > doctest.DocTestSuite to do the directory magic, but keeping the test files > under Tests/. As a more elegant version of the Bio._utils.run_doctest() function? Peter