From redmine at redmine.open-bio.org Sun Sep 2 15:20:01 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 2 Sep 2012 19:20:01 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by Jo?o Rodrigues. I contacted the developers of PatchDock and they updated their code. Their PDBs no longer have the double END statement, but they might have conflicting chains though: the parser will likely break if by chance both chains have id A and overlapping residue numbers. Still, a slight improvement. ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sun Sep 2 21:05:19 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 3 Sep 2012 01:05:19 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by David Cain. That's awesome! Thanks for doing that. Well, chain renumbering is definitely a problem, but I don't see any easy fix for that. I still think the "pull request":https://github.com/biopython/biopython/pull/60 is relevant for detecting otherwise malformed PDB files (additionally, parsing will still stop after the first file if @CONECT@ files are relevant). ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From w.arindrarto at gmail.com Mon Sep 3 06:14:59 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Mon, 3 Sep 2012 12:14:59 +0200 Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: References: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com> <87lim4h07o.fsf@fastmail.fm> Message-ID: Hello everyone, I'd like to update everyone on my latest SearchIO(?) developments. There has been some progress and bug fixes since GSoC officially ended two weeks ago. Some of them I'd like to share here: 1. I've written a draft tutorial chapter for the submodule. It' been pushed to my development repo (https://github.com/bow/biopython/tree/searchio) and I'm hosting the HTML temporarily on my site ( http://bow.web.id/biopython/Tutorial.html). Comments and critiques are welcomed :). 2. Back on the naming issue, I'm still using SearchIO for now. I've experimented with other names (Bio.Search and Bio.SeqSearch), and my impression is I like Bio.SeqSearch the most, followed by Bio.Search, and Bio.SearchIO. It does feel confusing initially (we have SeqUtils, SeqFeature, etc.), but after a while it's the one that feels most natural. 3. And finally, Peter and I discussed this briefly previously: what about if we merge the existing BLAST wrappers and NCBI qblast into Bio.(SeqSearch / Search / SearchIO)? I felt there were a lot of overlap between this submodule and Bio.BLAST when writing the tutorial, so merging surfaced in my thoughts again. We could put the BLAST wrappers under Bio.SeqSearch.Applications (for example), along with other wrappers (I have a yet-untested Bio.HMMER3 wrapper and possibly Bio.BLAT wrapper that put here as well). As for qblast (and other remote searches, like the one provided by HMMER at the moment), we could put them in Bio.SeqSearch.Remote, perhaps. I think this would make it easier for anyone who works with BLAST / other sequence search tools as all Biopython-related functionalities are grouped in one place. This is just a thought for now, but I'd love to hear your thoughts on the merge (and the naming ;) ). cheers, Bow On Tue, Aug 21, 2012 at 6:01 PM, Wibowo Arindrarto wrote: > On Tue, Aug 14, 2012 at 9:49 PM, Peter Cock > wrote: > > On Tue, Apr 10, 2012 at 1:58 AM, Brad Chapman wrote: > >> Michiel; > >>> Hi Eric, Peter, > >>> > >>> > How about Bio.Search, for now? > >>> > >>> I would prefer Bio.Pairwise or Bio.Align.Pairwise, since that tells > >>> users something about what the module is for. Bio.Search could be > >>> anything (search PubMed? search the Entrez databases? search Google? > >>> anyway Bio.Search does not suggest that this module is about pairwise > >>> alignments). But Peter previously mentioned that he doesn't like > >>> Bio.Pairwise; can we convince you? > >> > >> I agree with Peter on this one. The module is primarily about searching > >> a sequence database with an input via multiple methods, not about > >> pairwise alignment of two sequences with is what Bio.Align.Pairwise > >> suggests to me. > >> > >> Brad > > > > On potential problem with Bio.Search (on top of concerns raised > > here about vagueness) Bow and I were just talking about during > > our weekly GSoC video call was the existence of Bio/Search.py > > which is obsolete and long overdue for removal. I have just > > deprecated it (something I forgot to do before the last release): > > > https://github.com/biopython/biopython/commit/5a275ccd1df3def40df1eef517af755d373dadd8 > > > > We'd earlier talked about using Bio.Search as the namespace. I was > > worried about the potential existence on a user's machine of both > > Bio/Search.py (the old obsolete code) and Bio/Search/__init__.py > > (aka SearchIO, the new module) and which would take precedence > > when doing: from Bio import Search > > > > Given how Python module installations work, that seems highly > > likely to occur. The good news is that the package would take > > priority - see http://www.python.org/doc/essays/packages.html > > > >>>>> What If I Have a Module and a Package With The Same Name? > >>>>> > >>>>> You may have a directory (on sys.path) which has both a module > >>>>> spam.py and a subdirectory spam that contains an __init__.py > >>>>> (without the __init__.py, a directory is not recognized as a > package). > >>>>> In this case, the subdirectory has precedence, and importing spam > >>>>> will ignore the spam.py file, loading the package spam instead. If > >>>>> you want the module spam.py to have precedence, it must be > >>>>> placed in a directory that comes earlier in sys.path. > > > > So there is no technical reason to avoid Bio.Search as an > > option for the Bio.SearchIO namespace. We could then > > have Bio.Search.Applications for command line wrappers, > > consistent with Bio.Phylo.Applications, Bio.Motif.Applications > > and Bio.Align.Applications. > > > > Of course, Bio.Search is still perhaps too broad a name... but > > on balance perhaps it is still better than Bio.SearchIO? > > > > Regards, > > > > Peter > > Hi everyone, > > If I may add my two cents, for now I am in favor of putting the module > under Bio.Search. It is not the best name out there (it does sound a > bit vague), but it's the one that seem to be the most intuitive (until > a better alternative comes out). There were some other alternatives > that I and Peter have discussed, but they seem less appealing for us. > You're free to add your thoughts on these of course :) : > > - Bio.SeqSearch. This sounds ok, but when you consider we have > Bio.Seq, Bio.SeqRecord, Bio.SeqFeature, and Bio.SeqUtils, it becomes > quite confusing quickly. > > - Bio.PSearch ('p' for pairwise). This one seemed the less intuitive > among the three options, so I'm not so big on this. > > For now, I'm still writing everything (code, docstrings, tutorial) > using SearchIO. I suppose it's better if we could agree on a more > suitable name, though. > > On another note, I'm also in favor of using the Bio.Phylo module > skeleton for Bio.SearchIO / Bio.Search. We may then group all sequence > search-related application wrappers under Applications (I actually > prefers 'app' for better PEP8 compliance, but that's another > discussion) and perhaps even refactor our remote search calls (e.g. > the 'qblast' module) under Bio.Search as well. > > cheers, > Bow > From p.j.a.cock at googlemail.com Mon Sep 3 08:28:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Sep 2012 13:28:30 +0100 Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: References: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com> <87lim4h07o.fsf@fastmail.fm> Message-ID: On Mon, Sep 3, 2012 at 11:14 AM, Wibowo Arindrarto wrote: > Hello everyone, > > I'd like to update everyone on my latest SearchIO(?) developments. There > has been some progress and bug fixes since GSoC officially ended two weeks > ago. Some of them I'd like to share here: > > 1. I've written a draft tutorial chapter for the submodule. It' been pushed > to my development repo (https://github.com/bow/biopython/tree/searchio) and > I'm hosting the HTML temporarily on my site ( > http://bow.web.id/biopython/Tutorial.html). Comments and critiques are > welcomed :). Oh - excellent - I'll read that in the next few days :) > 2. Back on the naming issue, I'm still using SearchIO for now. I've > experimented with other names (Bio.Search and Bio.SeqSearch), and my > impression is I like Bio.SeqSearch the most, followed by Bio.Search, and > Bio.SearchIO. It does feel confusing initially (we have SeqUtils, > SeqFeature, etc.), but after a while it's the one that feels most natural. Initially Bio.SeqSearch sounds a bit long... but maybe it will grow on me... > 3. And finally, Peter and I discussed this briefly previously: what about > if we merge the existing BLAST wrappers and NCBI qblast into Bio.(SeqSearch > / Search / SearchIO)? I felt there were a lot of overlap between this > submodule and Bio.BLAST when writing the tutorial, so merging surfaced in > my thoughts again. We could put the BLAST wrappers under > Bio.SeqSearch.Applications (for example), along with other wrappers (I have > a yet-untested Bio.HMMER3 wrapper and possibly Bio.BLAT wrapper that put > here as well). As for qblast (and other remote searches, like the one > provided by HMMER at the moment), we could put them in > Bio.SeqSearch.Remote, perhaps. I think this would make it easier for anyone > who works with BLAST / other sequence search tools as all Biopython-related > functionalities are grouped in one place. As per my discussion with Bow, I'm OK with aiming to deprecate the Bio.BLAST namespace as part of introducing Bio.SeqSearch/Search/.., although I hadn't a strong preference on a naming convention for any online functionality. Possibly www is shorter than remote and also clear? > This is just a thought for now, but I'd love to hear your thoughts on the > merge (and the naming ;) ). > > cheers, > Bow Thanks Bow :) Peter From p.j.a.cock at googlemail.com Mon Sep 3 08:55:07 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Sep 2012 13:55:07 +0100 Subject: [Biopython-dev] Beta code in the official releases? In-Reply-To: <1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org> References: <877gsq8mn2.fsf@fastmail.fm> <1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org> Message-ID: On Wed, Aug 29, 2012 at 6:54 PM, Sczesnak, Andrew wrote: > +1 > > It's been over a year since I first submit my MAF code! Already? Ouch, my apologies. I'm at a hackathon this week with the OBF GSoC mentors who looked at MAF for BioRuby - looking at this for inclusion in the next Biopython release (perhaps with a beta tag) is on my agenda. Peter From anaryin at gmail.com Mon Sep 3 18:07:39 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 4 Sep 2012 01:07:39 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends Message-ID: Hi all, A quick update on some latest work. I found some time to finally work a bit on the PDB parser and Bio.PDB in general. I started by optimizing the current code. I ran cProfile on script that parsed a set of structures without header and without element columns. I did this because one of the optimizations rendered the current header parser useless.. (replaced the PDB file handle by an iterator instead of using the readlines method). I still need to work a bit on the memory leak, but for now it seems pretty ok (parsed 400-ish large structures without a glitch). I am attaching two pictures of cProfile and the two output files. There is a nice improvement of about 25%, but this can still be improved for sure. I just replaced some methods here and there, pre-initialized the numpy arrays, etc.. I pushed this version to my github pdb_enhancements branch . One big change I would propose is to eliminate the duality child_list/child_dict. I think that keeping child_dict and generating child_list from sorted dict keys would be good enough. OrderedDict also looks appropriate, but it's Py2.7+.. Still need to look into this, but by looking at all those "append" methods in the profiling it hints at a nice speed up, and also at much cleaner code. Let me know of your opinion if you have some time, Cheers, Jo?o PS. Attached complex_1.pdb as an example of the structures in the dataset used for this particular test. -------------- next part -------------- A non-text attachment was scrubbed... Name: BioPDB-master-TBEV.png Type: image/png Size: 166144 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: BioPDB-master-TBEV.profile Type: application/octet-stream Size: 252112 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: BioPDB-optimized-TBEV.png Type: image/png Size: 148137 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: BioPDB-optimized-TBEV.profile Type: application/octet-stream Size: 273487 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: complex_1w.pdb Type: chemical/x-pdb Size: 649559 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Tue Sep 4 01:56:55 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Sep 2012 06:56:55 +0100 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Mon, Sep 3, 2012 at 11:07 PM, Jo?o Rodrigues wrote: > One big change I would propose is to eliminate the duality > child_list/child_dict. I think that keeping child_dict and generating > child_list from sorted dict keys would be good enough. OrderedDict also > looks appropriate, but it's Py2.7+.. Still need to look into this, but by > looking at all those "append" methods in the profiling it hints at a nice > speed up, and also at much cleaner code. > Where there are back-ports of the OrderedDict and other useful classes like NamedTuple, we could probably include these as part of our Python 2/3 compatibility code. i.e. In Bio.PDB use: from Bio._py3k import OrderedDict (Until we drop older versions of Python which don't come with this). In Bio._py3k we would have something like this: #Use in preference system OrderedDict (Python 2.7 and 3.x), #the backport from PyPI, or our own bundled implementation try: from collections import OrderedDict except ImportError: try: #Whatever http://pypi.python.org/pypi/ordereddict uses: from xxx import OrderedDict except ImportError: #Import local bundled implementation, e.g. from _ordereddict import OrderedDict See http://code.activestate.com/recipes/576693-ordered-dictionary-for-py24/ Are there any objections to this plan? Regards, Peter From anaryin at gmail.com Tue Sep 4 01:59:36 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 4 Sep 2012 08:59:36 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: Sounds great, I saw the active state link before but I never thought of including it. Thanks! From w.arindrarto at gmail.com Tue Sep 4 02:11:05 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 4 Sep 2012 08:11:05 +0200 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: Hi Peter, Jo?o, Just a little FYI. I ran into the OrderedDict issue when I started writing SearchIO a few months ago as well, so I added an OrderedDict implementation in Bio._py3k ( https://github.com/bow/biopython/commit/34f873b2d21136487a0b925be898c52daa0cc61c ). The code is from the ordereddict module from PyPI at that time. I haven't checked if it's the same as the one shown in the link (there may have been some updates), but it seems to work fine up to now. Hope this is useful :), Bow On Tue, Sep 4, 2012 at 7:59 AM, Jo?o Rodrigues wrote: > Sounds great, I saw the active state link before but I never thought of > including it. Thanks! > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Tue Sep 4 02:30:51 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Sep 2012 07:30:51 +0100 Subject: [Biopython-dev] PEP8 lower case module names? Message-ID: Hello all, Over on one of Bow's pull requests Michiel made a suggestion about consolidating the Bio.Seq* namespace under Bio.Seq.* which we can do by replacing Bio/Seq.py with Bio/Seq/__init__.py See: https://github.com/biopython/biopython/pull/63#issuecomment-8252340 I agree that Bio.Seq, Bio.SeqUtils, Bio.SeqIO, Bio.SeqRecord, and Bio.SeqFeature isn't ideal. However, changing this would be a big disruption - so perhaps any large change like this should also address the mixed case module names which are not PEP8 conformant (Modules should have short, all-lowercase names). http://www.python.org/dev/peps/pep-0008/#package-and-module-names One idea I was pondering is a new parallel namespace, ideally bio.* but we can't use that due to case insensitive file systems like Windows and (by default) Mac OS X. So perhaps biopy, or bp? [I've not checked for clashes with other libraries yet.] We could gradually move code over to the new namespace, using imports to preserve back compatibility - but support both namespaces during a (long) transition period. What I like about this is it allows people to make a gradual conversion - and we don't have to burden of two main branches if we attempted a single jump to a Biopython v2. Does this seem worth considering? Regards, Peter From mjldehoon at yahoo.com Tue Sep 4 06:27:57 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 4 Sep 2012 03:27:57 -0700 (PDT) Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: Message-ID: <1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com> Hi Peter, --- On Tue, 9/4/12, Peter Cock wrote: > One idea I was pondering is a new parallel namespace, > ideally bio.* but we can't use that due to case > insensitive file systems like Windows and (by default) > Mac OS X. So perhaps biopy, or bp? As you say, the ideal namespace is bio.*, so let's use that. We have been using Bio.* for more than 10 years. We should not get stuck with a non-ideal namespace for the next 10+ years because there may be some glitches switching from Bio.* to bio.*. Frankly I doubt that this will cause huge problems in practice. > We could gradually move code over to the new namespace, > using imports to preserve back compatibility - but support > both namespaces during a (long) transition period. Why do we need a transition period? It's just a matter of replacing upper case with lower case in the imports. > What I like about this is it allows people to make a > gradual > conversion - and we don't have to burden of two main > branches if we attempted a single jump to a Biopython v2. > > Does this seem worth considering? Yes but by all means, let's keep this simple. In the past, changes to Biopython have very rarely caused any serious problems for users. Best, -Michiel. From p.j.a.cock at googlemail.com Tue Sep 4 06:59:00 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Sep 2012 11:59:00 +0100 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: <1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Tue, Sep 4, 2012 at 11:27 AM, Michiel de Hoon wrote: > Hi Peter, > > --- On Tue, 9/4/12, Peter Cock wrote: >> One idea I was pondering is a new parallel namespace, >> ideally bio.* but we can't use that due to case >> insensitive file systems like Windows and (by default) >> Mac OS X. So perhaps biopy, or bp? > > As you say, the ideal namespace is bio.*, so let's use > that. We have been using Bio.* for more than 10 years. > We should not get stuck with a non-ideal namespace for > the next 10+ years because there may be some glitches > switching from Bio.* to bio.*. Frankly I doubt that this > will cause huge problems in practice. So you'd advocate a simple switch where from one release to the next we change all the module names (making them lower case, perhaps from consolidation under bio.seq too)? This may cause some difficulties for upgrades - it may require manual intervention to remove the old Bio folder in order to allow creation of the new bio folder. >> We could gradually move code over to the new namespace, >> using imports to preserve back compatibility - but support >> both namespaces during a (long) transition period. > > Why do we need a transition period? It's just a matter > of replacing upper case with lower case in the imports. That forces people to update all their scripts at once. Of course, we can document how to do this so a script would work before and after the case change, e.g. try: from bio.seq import Seq except ImportError: from Bio.Seq import Seq >> What I like about this is it allows people to make a >> gradual >> conversion - and we don't have to burden of two main >> branches if we attempted a single jump to a Biopython v2. >> >> Does this seem worth considering? > > Yes but by all means, let's keep this simple. In the past, changes to Biopython have very rarely caused any serious problems for users. > > Best, > -Michiel. > From p.j.a.cock at googlemail.com Tue Sep 4 08:16:26 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Sep 2012 13:16:26 +0100 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Tue, Sep 4, 2012 at 7:11 AM, Wibowo Arindrarto wrote: > Hi Peter, Jo?o, > > Just a little FYI. I ran into the OrderedDict issue when I started writing > SearchIO a few months ago as well, so I added an OrderedDict implementation > in Bio._py3k > (https://github.com/bow/biopython/commit/34f873b2d21136487a0b925be898c52daa0cc61c). > > The code is from the ordereddict module from PyPI at that time. I haven't > checked if it's the same as the one shown in the link (there may have been > some updates), but it seems to work fine up to now. > > Hope this is useful :), > Bow Given the OrderedDict will be useful for Bio.PDB and Bow's SearchIO, that seems quite a good case for including it. How does this look (on the 'od' branch in my repository)? https://github.com/peterjc/biopython/commit/52b011aa8ddce06e636de776d8cea8e62845853f This differs from Bow's version in that I put the module in as a separate file (Bio/_ordereddict.py), and that it will prefer the ordereddict package if already installed (e.g. from PyPI). Peter From w.arindrarto at gmail.com Tue Sep 4 08:36:55 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 4 Sep 2012 14:36:55 +0200 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Tue, Sep 4, 2012 at 2:16 PM, Peter Cock wrote: > > On Tue, Sep 4, 2012 at 7:11 AM, Wibowo Arindrarto > wrote: > > Hi Peter, Jo?o, > > > > Just a little FYI. I ran into the OrderedDict issue when I started > > writing > > SearchIO a few months ago as well, so I added an OrderedDict > > implementation > > in Bio._py3k > > > > (https://github.com/bow/biopython/commit/34f873b2d21136487a0b925be898c52daa0cc61c). > > > > The code is from the ordereddict module from PyPI at that time. I > > haven't > > checked if it's the same as the one shown in the link (there may have > > been > > some updates), but it seems to work fine up to now. > > > > Hope this is useful :), > > Bow > > Given the OrderedDict will be useful for Bio.PDB and Bow's SearchIO, > that seems quite a good case for including it. How does this look > (on the 'od' branch in my repository)? > > > https://github.com/peterjc/biopython/commit/52b011aa8ddce06e636de776d8cea8e62845853f > > This differs from Bow's version in that I put the module in as a separate > file (Bio/_ordereddict.py), and that it will prefer the ordereddict > package > if already installed (e.g. from PyPI). > > Peter Hi Peter, This looks good. I like the 'ordereddict' module import check prior to using our bundled version. One more thing I would suggest is about the namespace. I feel that in the future, we may run into similar issues (non-Python3 compatibility issues) since Python2.7 deprecation is still a long way. Perhaps create a new subpackage in the root folder (maybe Bio._compat, but I don't have a strong preference), to keep code like this in one place? Or we could even put Bio._py3k under this subpackage and have one central place for compatibility-related code? This would prevent further root namespace clutter. regards, Bow From k.d.murray.91 at gmail.com Tue Sep 4 08:57:22 2012 From: k.d.murray.91 at gmail.com (Kevin Murray) Date: Tue, 4 Sep 2012 22:57:22 +1000 Subject: [Biopython-dev] TAIR/AGI support Message-ID: Hi All, What's the status of TAIR AGIs in BioPython (I can see no mention of them, or support for them)? I've written a brief module which allows a user to query NCBI with a TAIR AGI, returning a Seq object (via Efetch). Is there any interest in including such functionality in BioPython? More generally, are there any particular areas of BioPython development which could use an extra pair of hands? Regards Kevin Murray From anaryin at gmail.com Tue Sep 4 10:19:11 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 4 Sep 2012 17:19:11 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: Guys, Looks great, I will try to 'cherry pick' that branch and merge it with mine. I have to solve some issues with the tests, but it seems to be a straightforward change. Cheers, Jo?o No dia 4 de Set de 2012 15:37, "Wibowo Arindrarto" escreveu: > On Tue, Sep 4, 2012 at 2:16 PM, Peter Cock > wrote: > > > > On Tue, Sep 4, 2012 at 7:11 AM, Wibowo Arindrarto > > wrote: > > > Hi Peter, Jo?o, > > > > > > Just a little FYI. I ran into the OrderedDict issue when I started > > > writing > > > SearchIO a few months ago as well, so I added an OrderedDict > > > implementation > > > in Bio._py3k > > > > > > ( > https://github.com/bow/biopython/commit/34f873b2d21136487a0b925be898c52daa0cc61c > ). > > > > > > The code is from the ordereddict module from PyPI at that time. I > > > haven't > > > checked if it's the same as the one shown in the link (there may have > > > been > > > some updates), but it seems to work fine up to now. > > > > > > Hope this is useful :), > > > Bow > > > > Given the OrderedDict will be useful for Bio.PDB and Bow's SearchIO, > > that seems quite a good case for including it. How does this look > > (on the 'od' branch in my repository)? > > > > > > > https://github.com/peterjc/biopython/commit/52b011aa8ddce06e636de776d8cea8e62845853f > > > > This differs from Bow's version in that I put the module in as a separate > > file (Bio/_ordereddict.py), and that it will prefer the ordereddict > > package > > if already installed (e.g. from PyPI). > > > > Peter > > Hi Peter, > > This looks good. I like the 'ordereddict' module import check prior to > using our bundled version. > > One more thing I would suggest is about the namespace. I feel that in > the future, we may run into similar issues (non-Python3 compatibility > issues) since Python2.7 deprecation is still a long way. Perhaps > create a new subpackage in the root folder (maybe Bio._compat, but I > don't have a strong preference), to keep code like this in one place? > Or we could even put Bio._py3k under this subpackage and have one > central place for compatibility-related code? This would prevent > further root namespace clutter. > > regards, > Bow > From p.j.a.cock at googlemail.com Tue Sep 4 10:42:35 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Sep 2012 15:42:35 +0100 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Tue, Sep 4, 2012 at 3:19 PM, Jo?o Rodrigues wrote: > Guys, > > Looks great, I will try to 'cherry pick' that branch and merge it with mine. I've applied it to the master now, which might make it easier. I think Bow might have a point about namespaces - although the underscore modules are 'private', they still show up in dir(Bio) so having a single folder for our inter-Python version compatibility code seems sensible if we add any more (e.g. NamedTuples). > I have to solve some issues with the tests, but it seems to be a > straightforward change. Great. Peter From anaryin at gmail.com Tue Sep 4 12:02:42 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 4 Sep 2012 19:02:42 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: I agree, we could move them to a folder then? No dia 4 de Set de 2012 17:42, "Peter Cock" escreveu: > On Tue, Sep 4, 2012 at 3:19 PM, Jo?o Rodrigues wrote: > > Guys, > > > > Looks great, I will try to 'cherry pick' that branch and merge it with > mine. > > I've applied it to the master now, which might make it easier. > I think Bow might have a point about namespaces - although the > underscore modules are 'private', they still show up in dir(Bio) > so having a single folder for our inter-Python version compatibility > code seems sensible if we add any more (e.g. NamedTuples). > > > I have to solve some issues with the tests, but it seems to be a > > straightforward change. > > Great. > > Peter > From p.j.a.cock at googlemail.com Tue Sep 4 19:54:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Sep 2012 00:54:56 +0100 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Tue, Sep 4, 2012 at 5:02 PM, Jo?o Rodrigues wrote: > I agree, we could move them to a folder then? > OK - I moved Bio/_py3k.py to Bio/_py3k/__init__.py and also the new file Bio/_ordereddict.py to Bio/_py3k/ordereddict.py - this avoids having to change any of our import statements: https://github.com/biopython/biopython/commit/1a9bd6eeab0de3283bd1e6cc28c7754fbffefe2d Peter From redmine at redmine.open-bio.org Tue Sep 4 23:19:53 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 5 Sep 2012 03:19:53 +0000 Subject: [Biopython-dev] [Biopython - Bug #3382] (New) Bio.PDB.PDBList.retrieve_pdb_file() fails for Python3 Message-ID: Issue #3382 has been reported by Alexander Campbell. ---------------------------------------- Bug #3382: Bio.PDB.PDBList.retrieve_pdb_file() fails for Python3 https://redmine.open-bio.org/issues/3382 Author: Alexander Campbell Status: New Priority: Normal Assignee: Category: Target version: URL: At present, calling @Bio.PDB.PDBList.retrieve_pdb_file()@ on any PDB ID will fail, giving the following traceback:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in ()
----> 1 pdbl.retrieve_pdb_file('1FAT')

/usr/lib64/python3.2/site-packages/Bio/PDB/PDBList.py in retrieve_pdb_file(self, pdb_code, obsolete, compression, uncompress, pdir)
    245         gz = gzip.open(filename, 'rb')
    246         out = open(final_file, 'wb')
--> 247         out.writelines(gz.read())
    248         gz.close()
    249         out.close()

TypeError: 'int' does not support the buffer interface
This occurs because in Python3 a file opened in binary mode will return type @bytes@ for @read()@, or a list of type @bytes@ objects for @readlines()@. The @writelines()@ method expects an iterable where each element is of type @str at . This worked in Python2 as a @str@ can be viewed as a sequence of @str@ objects, and so line 247 effectively wrote one character at a time for the single @str@ yielded by @read()@. In Python3 iterating over a @bytes@ yields @int@ objects, leading to the TypeError. This issue can be fixed by changing line 247's call to @writelines()@ to just @write()@. This does not break functionality in Python2, according to my testing with Python 3.2.3 and 2.7.3 on Fedora 17. There are 4 more instances of @writelines()@ calls in the codebase, but in each of those cases the argument is a list or generator of @str@ or @bytes@ objects, as I don't think they will raise an error. I haven't tested them though. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From w.arindrarto at gmail.com Wed Sep 5 05:53:36 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 5 Sep 2012 11:53:36 +0200 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: References: <1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: Hi guys, If I may add my two cents on this issue, I think it's also a chance to rectify all other namespace issues that we may have (not just PEP8-related). For instance: * In the root namespace, we now have Bio.Align and Bio.AlignIO. Since we might be merging Bio.Seq and Bio.SeqIO into bio[py].seq (per the Github discussion[1]), I suppose we should do the same with Bio.Align as well (perhaps into bio[py].seq.align or bio[py].align). * With the change above, we might also want to change some of the submodule names completely. For example, if we merge Bio.Align into bio[py].align we'll have bio[py].align.applications, which I personally think could be shortened into bio[py].align.app. * As per the Github disscussion[1] as well, perhaps Bio.SeqUtils should also be merged as Seq object methods. There may be other changes as well, but the bottom line is all these changes will be quite considerable. As such, I think we could go all the way and be explicit in stating that the changes will be incompatible with previous Biopython versions (i.e. old scripts will break). As for bio.* and biopy.*, if we do decide to go all the way, bio.* seems like a better choice since there will be other incompatible changes anyway. But if we eventually decide to only fix PEP8-related issues while keeping compatibility with older versions, I'm leaning more towards biopy.*. regards, Bow [1] https://github.com/biopython/biopython/pull/63#issuecomment-8252340 On Tue, Sep 4, 2012 at 12:59 PM, Peter Cock wrote: > On Tue, Sep 4, 2012 at 11:27 AM, Michiel de Hoon wrote: >> Hi Peter, >> >> --- On Tue, 9/4/12, Peter Cock wrote: >>> One idea I was pondering is a new parallel namespace, >>> ideally bio.* but we can't use that due to case >>> insensitive file systems like Windows and (by default) >>> Mac OS X. So perhaps biopy, or bp? >> >> As you say, the ideal namespace is bio.*, so let's use >> that. We have been using Bio.* for more than 10 years. >> We should not get stuck with a non-ideal namespace for >> the next 10+ years because there may be some glitches >> switching from Bio.* to bio.*. Frankly I doubt that this >> will cause huge problems in practice. > > So you'd advocate a simple switch where from one > release to the next we change all the module names > (making them lower case, perhaps from consolidation > under bio.seq too)? > > This may cause some difficulties for upgrades - it may > require manual intervention to remove the old Bio folder > in order to allow creation of the new bio folder. > >>> We could gradually move code over to the new namespace, >>> using imports to preserve back compatibility - but support >>> both namespaces during a (long) transition period. >> >> Why do we need a transition period? It's just a matter >> of replacing upper case with lower case in the imports. > > That forces people to update all their scripts at once. > Of course, we can document how to do this so a script > would work before and after the case change, e.g. > > try: > from bio.seq import Seq > except ImportError: > > from Bio.Seq import Seq > >>> What I like about this is it allows people to make a >>> gradual >>> conversion - and we don't have to burden of two main >>> branches if we attempted a single jump to a Biopython v2. >>> >>> Does this seem worth considering? >> >> Yes but by all means, let's keep this simple. In the past, changes to Biopython have very rarely caused any serious problems for users. >> >> Best, >> -Michiel. >> > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From anaryin at gmail.com Wed Sep 5 16:24:23 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 5 Sep 2012 23:24:23 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: Hello all, Some news. A. The OrderedDict implementation is quite slow. It essentially slows down the parser by 30%, rendering all the improvements I had done moot. Therefore, although it's a great idea, a major reason for these updates is speed so I think it might not be worth it. B. As an alternative to this, I implemented the following. Entity has now only child_dict, and is a general dictionary. However, each Object (Model, Chain, Residue, Atom) gets their own __cmp__ method overloaded with the information in the "_sort" methods that already existed. In this way, a simple sorting of the values of the dictionary returns an ordered list. I tweaked the Atom.__cmp__ to first sort N CA C O atoms and then alphabetically. I also added that inorganic atoms such as Calcium come at the end. This will make things a bit nicer when Calcium is involved for example. Finally, the only downside to this seems to be that we lose the order in which residues are inserted. Ie. if residue 151 is the first of the PDB file and all others range from 1-150, then this first 151 is going to be placed at the end when you iterate. However, from my experience and in my opinion, not only this is logical, but it also rarely happens in real PDB files. C. I am strongly in favour of removing most (if not all) set/get methods and replace them by direct attribute access. For instance, "atom.get_parent() --> atom.parent". Saves some space in the code and makes things more transparent. D. I edited the PDBParser to tweaks a few things, nothing major. The file handle is now treated as an iterator throughout the parsing and it should be more memory-friendly. The line counter is still preserved. I also added a test to make the get_header argument actually work. E. General things here and there that I can't just remember.. F. Unittests are breaking everywhere. Checking why, but it all seems related to this sorting issue. Cheers, Jo?o From p.j.a.cock at googlemail.com Wed Sep 5 19:31:42 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 00:31:42 +0100 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Wed, Sep 5, 2012 at 9:24 PM, Jo?o Rodrigues wrote: > Hello all, > > Some news. > > A. The OrderedDict implementation is quite slow. It essentially slows down > the parser by 30%, rendering all the improvements I had done moot. > Therefore, although it's a great idea, a major reason for these updates is > speed so I think it might not be worth it. Which Python was that? i.e. The OrderedDict from the standard lib (which I hope is optimised), or the back port (which might be slower). > B. As an alternative to this, I implemented the following. Entity has now > only child_dict, and is a general dictionary. However, each Object (Model, > Chain, Residue, Atom) gets their own __cmp__ method overloaded with the > information in the "_sort" methods that already existed. In this way, a > simple sorting of the values of the dictionary returns an ordered list. I > tweaked the Atom.__cmp__ to first sort N CA C O atoms and then > alphabetically. I also added that inorganic atoms such as Calcium come at > the end. This will make things a bit nicer when Calcium is involved for > example. Finally, the only downside to this seems to be that we lose the > order in which residues are inserted. Ie. if residue 151 is the first of the > PDB file and all others range from 1-150, then this first 151 is going to be > placed at the end when you iterate. However, from my experience and in my > opinion, not only this is logical, but it also rarely happens in real PDB > files. That seems risky - but see if you can sort out what is happening with the unit tests (below). I'm not sure about your atomic sorting... it seems a bit magic. Would sorting on atomic number be nicer (and simple)? > C. I am strongly in favour of removing most (if not all) set/get methods and > replace them by direct attribute access. For instance, "atom.get_parent() > --> atom.parent". Saves some space in the code and makes things more > transparent. It would also look less like Java code ;) I like this plan - but initially define and document the new properties, and deprecate the old get/set properties. Without that you'll break almost every PDB using script out there. > D. I edited the PDBParser to tweaks a few things, nothing major. The file > handle is now treated as an iterator throughout the parsing and it should be > more memory-friendly. The line counter is still preserved. I also added a > test to make the get_header argument actually work. > > E. General things here and there that I can't just remember.. > > F. Unittests are breaking everywhere. Checking why, but it all seems related > to this sorting issue. > > Cheers, > > Jo?o Regards, Peter From p.j.a.cock at googlemail.com Wed Sep 5 20:10:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 01:10:57 +0100 Subject: [Biopython-dev] Beta code in the official releases? In-Reply-To: <1F36894B170C114F9C902C20BC5129981AD239DA@MSGWCDCPMB25.nyumc.org> References: <877gsq8mn2.fsf@fastmail.fm> <1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org> <1F36894B170C114F9C902C20BC5129981AD239DA@MSGWCDCPMB25.nyumc.org> Message-ID: On Wed, Sep 5, 2012 at 8:19 PM, Sczesnak, Andrew wrote: > Yeah, it would be great if this module could finally be included. > I've e-mailed the list numerous times asking what would be > necessary to include it and have done all you and Brad have > asked. I've watched you include bits and pieces of code from > other contributors quickly and without much scrutiny, so I > can't help but feel singled out. What is the logic in delaying > this? We've heard from people who are already using the > code and have asked when it will be pulled. Is it serving the > community to not even include the basic reader/writer? Am > I wasting my time? Is it your goal to actively discourage > contributions? In my mind, the main technical issue regarding MAF and AlignIO and the common alignment object is the lack of a common way of handling the idea of start/end (and sometimes strand) for each sequence (in a consistent co-ordinate system using Python counting). Evidently I haven't manage to adequately convey my interpretation/concern. Some file formats like EMBOSS' have these number explicitly but we're not parsing them: http://lists.open-bio.org/pipermail/biopython/2012-September/008142.html In the case of "fasta-m10" the numbers are stored in private properties as a 'short term' hack: http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009744.html Others like Stockholm have identifier/start-end as a combined names (but this is not mandatory). Here the start and end are being stored in the annotations dictionary (as unparsed strings, still using 1-based co-ordinates). In MAF the start/end are explicit and much more important. It would be near pointless to parse the the file ignoring these. Maybe your approach is good enough for MAF, and we should have adopted it as is, and delayed better integration with the other AlignIO formats? i.e. This is a general limitation in AlignIO and the object model, somewhat annoying in the formats already supported, but information critical to the MAF format. I was expecting a convention for this to fall out of Bow's GSoC work for 'pairwise alignments' in SearchIO - but the object model he came up with was not SeqRecord based (many of the file formats he was using didn't include sequences). Right now my inclination is still to add a location property to the SeqRecord, usually a FeatureLocation, but it could also be the proposed CompoundLocation for more complex cases. The question then is if/when this would be propagated, e.g. SeqRecord slicing/addition. http://lists.open-bio.org/pipermail/biopython-dev/2012-May/009646.html http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html So the wheels are turning, but slowly. I have not had as much time to dedicate to this as I would like - but other smaller or less inter-connected things are much easer to review and merge. Peter From p.j.a.cock at googlemail.com Wed Sep 5 20:34:19 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 01:34:19 +0100 Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound feature locations In-Reply-To: References: Message-ID: On Tue, Jul 24, 2012 at 10:38 PM, Peter Cock wrote: > On Tue, Jul 24, 2012 at 10:08 PM, Lenna Peterson wrote: >> I agree that an "upgraded" FeatureLocation could be more >> elegant. > > It could turn out to be simpler having just one location object... > certainly worth trying out before committing this branch as is. Such a new "upgraded" FeatureLocation would need to hold a list/tuple of its parts (rather like the proposed CompoundLocation), and those could be simply as tuples of start, end, strand, db_ref etc (essentially everything currently held in a FeatureLocation). I'm not sure that that is any better than the new class CompoundLocation holding a list of existing FeatureLocation objects. On the bright side, the branch still works nicely with the extra BioSQL tests I added. One of the issues worth a bit more discussion is the start and end values of the CompoundLocation - which I am considering making act as the left/minimum and right/ maximum boundary of the region spanned by the parts. For normal forward strand features this does give the biological start and end, likewise for reverse strand features but inverted (location's start gives the biological end). i.e. for *most* features this means no change to the current behaviour. My proposal would mean that for a feature spanning the origin on a circular genome of length N, the start would be 0 and the end N. Similarly for weird cases from trans-splicing, the start/end coordinates would give the total region spanned. As shown below, sometimes that happens to match the current behaviour, but in other cases the current behaviour isn't useful anyway. Adopting start/end as the spanned region makes a lot of sense for things like drawing features in a region of interest, or other more abstract tasks doing feature/region intersection. Here knowing the min/max boundaries of the region spanned is more useful than any attempt to capture the biological start/end of the feature. Note that already for the simple FeatureLocation for reverse strand features we have start < end, i.e. the start coordinate property does NOT represent the biological starting point. Under the proposed CompoundLocation behaviour, the desirable property of the FeatureLocation that start < end would also hold for compound locations. Pathological examples at the end, Regards, Peter P.S. One of the advantages of the CompoundLocation is when constructing the location you don't give the overall start/end - there are inferred from the list of parts automatically. Currently the GenBank/EMBL parser is having to do this. P.P.S. I've also confirmed Lenna's testing that sum of feature locations works if we define integer addition with locations (so that sum can include zero and several locations), see: https://github.com/peterjc/biopython/commit/dc6bc658141cc42e7e6802bbe8baf6c87a6874c0 ----------------------------------------------------------------- Trans-splicing: Mixed Strands An example where the range/span idea is simpler is mixed strand features like this trans-spliced example from NC_000932 (in our unit tests), join(complement(69611..69724),139856..140650) What would you expect as the start/end here? The biological start is base 69724 (one based) and the last base is 140650. Currently: >>> from Bio import SeqIO >>> f = SeqIO.read("NC_000932.gb", "gb").features[135] >>> print f.location [69610:140650] >>> f.location.start ExactPosition(69610) >>> f.location.end ExactPosition(140650) >>> for sub in f.sub_features: print sub.location ... [69610:69724](-) [139855:140650](+) Here the end value does match the last base in the feature following the biological order - the start value is actually a base in the middle of the combined sequence. In fact, for this example the start/end are already acting like the range/span idea. ----------------------------------------------------------------- Trans-splicing: Reverse strand The example above is a real corner case, and so is this single strand trans-splcing example, also in NC_000932, which is a bit like an circular genome origin spanning annotation: complement(join(97999..98793,69611..69724)) With the current master branch: >>> from Bio import SeqIO >>> f = SeqIO.read("NC_000932.gb", "genbank").features[1] >>> print f.location [97998:69724](-) >>> f.location.start ExactPosition(97998) >>> f.location.end ExactPosition(69724) >>> for sub in f.sub_features: print sub.location ... [97998:98793](-) [69610:69724](-) Notice that we do not have start < end as you might expect. However the start and end DO capture the biological end and start (order inverted - this is on the reverse strand). To verify this I find it helps to transform the GenBank style location: complement(join(97999..98793,69611..69724)) into the old EMBL equivalent: join(complement(69611..69724),complement(97999..98793)) i.e. The first base is 69724 (one based counting), and the last base is 97999 (one based counting). So if you wanted to look at the upstream or downstream (assuming that makes sense for a trans-spliced gene), the current start/end values are useful (but you have to choose start vs end dependent on the strand). On the other hand, the range of co-ordindate values is 69611 to 98793 (one based, inclusive). Therefore one might expect start 69610 and end 98793 (Python counting), giving the spanned region. From chapmanb at 50mail.com Wed Sep 5 20:37:57 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 05 Sep 2012 20:37:57 -0400 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: References: <1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: <87wr08x9y2.fsf@fastmail.fm> Hi all; I don't know if there's going to be a clean way around mucking up the API for older scripts if we make this change. If we want to do this my thoughts would be: - Use the 'bio' module since that's the cleanest. - Hack together something that will remove old 'Bio' modules on install of the new version. - Write a Biopython1to2 script that will fix the imports on older scripts to the new module structure. However, my vote would be to stick with everything as is. I know we aren't PEP8 compliant but things aren't that awful that we need an upheaval. I wish Python library installs weren't so messy that we could do this more cleanly, Brad > Hi guys, > > If I may add my two cents on this issue, I think it's also a chance > to rectify all other namespace issues that we may have (not just > PEP8-related). > > For instance: > > * In the root namespace, we now have Bio.Align and Bio.AlignIO. Since > we might be merging Bio.Seq and Bio.SeqIO into bio[py].seq (per the > Github discussion[1]), I suppose we should do the same with Bio.Align > as well (perhaps into bio[py].seq.align or bio[py].align). > > * With the change above, we might also want to change some of the > submodule names completely. For example, if we merge Bio.Align into > bio[py].align we'll have bio[py].align.applications, which I > personally think could be shortened into bio[py].align.app. > > * As per the Github disscussion[1] as well, perhaps Bio.SeqUtils > should also be merged as Seq object methods. > > There may be other changes as well, but the bottom line is all these > changes will be quite considerable. As such, I think we could go all > the way and be explicit in stating that the changes will be > incompatible with previous Biopython versions (i.e. old scripts will > break). > > As for bio.* and biopy.*, if we do decide to go all the way, bio.* > seems like a better choice since there will be other incompatible > changes anyway. But if we eventually decide to only fix PEP8-related > issues while keeping compatibility with older versions, I'm leaning > more towards biopy.*. > > regards, > Bow > > [1] https://github.com/biopython/biopython/pull/63#issuecomment-8252340 > > On Tue, Sep 4, 2012 at 12:59 PM, Peter Cock wrote: >> On Tue, Sep 4, 2012 at 11:27 AM, Michiel de Hoon wrote: >>> Hi Peter, >>> >>> --- On Tue, 9/4/12, Peter Cock wrote: >>>> One idea I was pondering is a new parallel namespace, >>>> ideally bio.* but we can't use that due to case >>>> insensitive file systems like Windows and (by default) >>>> Mac OS X. So perhaps biopy, or bp? >>> >>> As you say, the ideal namespace is bio.*, so let's use >>> that. We have been using Bio.* for more than 10 years. >>> We should not get stuck with a non-ideal namespace for >>> the next 10+ years because there may be some glitches >>> switching from Bio.* to bio.*. Frankly I doubt that this >>> will cause huge problems in practice. >> >> So you'd advocate a simple switch where from one >> release to the next we change all the module names >> (making them lower case, perhaps from consolidation >> under bio.seq too)? >> >> This may cause some difficulties for upgrades - it may >> require manual intervention to remove the old Bio folder >> in order to allow creation of the new bio folder. >> >>>> We could gradually move code over to the new namespace, >>>> using imports to preserve back compatibility - but support >>>> both namespaces during a (long) transition period. >>> >>> Why do we need a transition period? It's just a matter >>> of replacing upper case with lower case in the imports. >> >> That forces people to update all their scripts at once. >> Of course, we can document how to do this so a script >> would work before and after the case change, e.g. >> >> try: >> from bio.seq import Seq >> except ImportError: >> >> from Bio.Seq import Seq >> >>>> What I like about this is it allows people to make a >>>> gradual >>>> conversion - and we don't have to burden of two main >>>> branches if we attempted a single jump to a Biopython v2. >>>> >>>> Does this seem worth considering? >>> >>> Yes but by all means, let's keep this simple. In the past, changes to Biopython have very rarely caused any serious problems for users. >>> >>> Best, >>> -Michiel. >>> >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From chapmanb at 50mail.com Wed Sep 5 20:31:58 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 05 Sep 2012 20:31:58 -0400 Subject: [Biopython-dev] Beta code in the official releases? In-Reply-To: <1F36894B170C114F9C902C20BC5129981AD239DA@MSGWCDCPMB25.nyumc.org> References: <877gsq8mn2.fsf@fastmail.fm> <1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org> <1F36894B170C114F9C902C20BC5129981AD239DA@MSGWCDCPMB25.nyumc.org> Message-ID: <87zk54xa81.fsf@fastmail.fm> Andrew; > Yeah, it would be great if this module could finally be included. I've > e-mailed the list numerous times asking what would be necessary to > include it and have done all you and Brad have asked. I've watched you > include bits and pieces of code from other contributors quickly and > without much scrutiny, so I can't help but feel singled out. What is > the logic in delaying this? We've heard from people who are already > using the code and have asked when it will be pulled. Is it serving > the community to not even include the basic reader/writer? Am I > wasting my time? Is it your goal to actively discourage contributions? In addition to Peter's technical comments, from a personal side I hope you don't take offense. We definitely value contributions and your work. Some changes can end up being tricky because of the need to work with or fix previous non-optimal design decisions. When they require extra attention and decisions this can make it hard to allocate time for folks that volunteer on the project. This is definitely nothing personal and I hope you don't feel that way. My GFF parser has languished for even longer for similar reasons. I think the long term solution for this is incorporating beta code so we can get these in, recognize the contributions, make them available, and still giving wiggle room to improve the design before locking into an API that we need to support long term. Thanks again for all the work. We do appreciate it, Brad From chapmanb at 50mail.com Wed Sep 5 20:45:19 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 05 Sep 2012 20:45:19 -0400 Subject: [Biopython-dev] TAIR/AGI support In-Reply-To: References: Message-ID: <87txvcx9ls.fsf@fastmail.fm> Kevin; Thanks for the e-mail and offers of code. Always happy to have other folks involved with the project. > What's the status of TAIR AGIs in BioPython (I can see no mention of them, > or support for them)? I've written a brief module which allows a user to > query NCBI with a TAIR AGI, returning a Seq object (via Efetch). Is there > any interest in including such functionality in BioPython? Is the code available on GitHub to get a better sense of all the functionality it supports? Do you have an idea where it would fit best? As a tair submodule inside of Bio.Entrez, or somewhere else? > More generally, are there any particular areas of BioPython development > which could use an extra pair of hands? Following the mailing list for discussions on current projects is the best way to get a sense of what different folks are working on. The issue tracker also has open issues and features that could use attention if anything there strikes your fancy: https://redmine.open-bio.org/projects/biopython Hope this helps, Brad From p.j.a.cock at googlemail.com Wed Sep 5 20:57:19 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 01:57:19 +0100 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: <87wr08x9y2.fsf@fastmail.fm> References: <1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com> <87wr08x9y2.fsf@fastmail.fm> Message-ID: On Thu, Sep 6, 2012 at 1:37 AM, Brad Chapman wrote: > > Hi all; > I don't know if there's going to be a clean way around mucking up the > API for older scripts if we make this change. > > If we want to do this my thoughts would be: > > - Use the 'bio' module since that's the cleanest. > - Hack together something that will remove old 'Bio' modules on install > of the new version. > - Write a Biopython1to2 script that will fix the imports on older > scripts to the new module structure. I really don't like using "bio" since (due to Python's use of folders for package names) you couldn't in general also have the old code available under "Bio". i.e. This forces a hard switch on our users which is a very bad idea I think. Thus my suggestion of something else like "biopy" (although the Mac's autocorrection keeps turning it into biopsy which would be annoying - grin), or if not already taken "bp". To expand on my earlier email, the transition structure I had in mind was that we'd have something like this: biopy/seq/__init__.py - real code for Seq object etc Bio/Seq/__init__.py - just "from biopy.seq import Seq" and a deprecation warning. > However, my vote would be to stick with everything as is. I know we > aren't PEP8 compliant but things aren't that awful that we need an > upheaval. I wish Python library installs weren't so messy that we could > do this more cleanly, > Brad That does seem safer, and we can still do the less invasive restructuring discussed, e.g. Bio/Seq.py -> Bio/Seq/__init__.py allowing us to (gradually) move Bio.Seq* things under Bio.Seq, while preserving the legacy imports under a deprecation warning. Also if we're considering moving Bio.SeqIO to Bio.Seq, as Bow points out, we'd want to do Bio/AlignIO.py -> Bio.Align (perhaps pushing the core objects into Bio/Align/_objects.py or similar but exposing them in the current namespace location). Regards, Peter From p.j.a.cock at googlemail.com Wed Sep 5 21:34:50 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 02:34:50 +0100 Subject: [Biopython-dev] SeqRecord locations; was: Beta code in the official releases? Message-ID: On Thu, Sep 6, 2012 at 1:10 AM, Peter Cock wrote: > > In my mind, the main technical issue regarding MAF and AlignIO > and the common alignment object is the lack of a common way > of handling the idea of start/end (and sometimes strand) for > each sequence (in a consistent co-ordinate system using Python > counting). Evidently I haven't manage to adequately convey my > interpretation/concern. > > Some file formats like EMBOSS' have these number explicitly > but we're not parsing them: > http://lists.open-bio.org/pipermail/biopython/2012-September/008142.html > > In the case of "fasta-m10" the numbers are stored in private > properties as a 'short term' hack: > http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009744.html > > Others like Stockholm have identifier/start-end as a combined > names (but this is not mandatory). Here the start and end are > being stored in the annotations dictionary (as unparsed strings, > still using 1-based co-ordinates). > > In MAF the start/end are explicit and much more important. > It would be near pointless to parse the the file ignoring these. > Maybe your approach is good enough for MAF, and we > should have adopted it as is, and delayed better integration > with the other AlignIO formats? > > i.e. This is a general limitation in AlignIO and the object > model, somewhat annoying in the formats already supported, > but information critical to the MAF format. > > I was expecting a convention for this to fall out of Bow's GSoC > work for 'pairwise alignments' in SearchIO - but the object > model he came up with was not SeqRecord based (many > of the file formats he was using didn't include sequences). > > Right now my inclination is still to add a location property to > the SeqRecord, usually a FeatureLocation, but it could also > be the proposed CompoundLocation for more complex cases. > The question then is if/when this would be propagated, e.g. > SeqRecord slicing/addition. > http://lists.open-bio.org/pipermail/biopython-dev/2012-May/009646.html > http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html > > So the wheels are turning, but slowly. I have not had as > much time to dedicate to this as I would like - but other > smaller or less inter-connected things are much easer to > review and merge. To expand on the SeqRecord.location property idea, I am thinking about (in the typical use cases) using a normal FeatureLocation object (from Bio.SeqFeature) where the start, end or strand are in the same co-ordinate system as the sequence of the SeqRecord. i.e. For a protein fragment, they would be in amino acids. For a nucleotide fragment, they would be in base pairs. Note that you might want to describe the CDS region for a protein sequence (which would be possible even for a join using the proposed CompoundLocation), so maybe 'location' is the wrong name here, perhaps 'fragment' or 'subregion', or something is clearer? When I talked about adding SeqRecords, and what would the combined SeqRecord's location be, we could use FeatureLocation addition (as defined on the branch for CompoundLocation objects). For slicing a SeqRecord, provided len(record.location) == len(record), this is well defined. However, I expect that quite often if used for alignments, what we will have instead is len(record.location) = len(record.seq.ungapped()) so we might be able to update the sub-record's location if we count the gap characters and factor them in. This equality could be verified in the SeqRecord __init__ (which would require the gap character, but the AlignIO parsers should all set that). I would like slicing to update the start/end because slicing alignment objects seems to be a quite common operation - so if you started from an alignment file using start/end (like Stockholm or MAF) it would be good to update these fields for the sub-alignment. This feels like it would work, but would it be useful or just over engineering? Would a simple static location property which is not automatically propagated in SeqRecord manipulations be enough (at least initially)? If so, is Brad's suggestion to just use special values in the annotations dictionary a simpler way forward (where we already have policies in place for handling generic annotation during SeqRecord annotation - in general dropping it)? If so, would this be keys 'start', 'end', 'strand' for integer start and end using Python counting, and a strand value of +1 or -1 for forward and reverse? [We could use strand None for unavailable as in the SeqFeature location object, but I think no entry in the dictionary is nicer here]. Peter From anaryin at gmail.com Thu Sep 6 01:52:34 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 6 Sep 2012 08:52:34 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: Hey, Which Python was that? i.e. The OrderedDict from the standard lib > (which I hope is optimised), or the back port (which might be slower). > Both. I also found it strange and googledit. Apparently OrderedDict is pure python, not C like dict, thus the difference. That seems risky - but see if you can sort out what is happening > with the unit tests (below). > What Bio.PDB does right now is rely on the list to iterate over things. Thus, you get the order in which you read the PDB file. However, if you sort it using the several Objects sort method you will get the following rules: Atom.py - N CA C O first, then alphabetically Residue.py - First aminoacids and nucleic acids, then heteroatoms. Chain.py - Empty chains last. These are already in place somewhere in the code. I just used them to overload the __cmp__ method, with a couple of additions because I personally disagree with the following: Atom.py - Inorganic atoms should come out last. For simplicity. Residue.py - If the PDB order is 151 MSE, 152 VAL, 153 CYS, you should get in return when you iterate: 151, 152, 153. Right now you get 152, 153, 151. PDB files already have weird large numbers for water and ions for example, so these come out last anyway. Pushing all HETATMs to the end will sometimes disrupt the "natural" order of things, for instance modified residues. Magic perhaps :) I sorted out all relevant issues with the unittests. I had a small problem with build_peptides because of this HETATM last rule, so I took it away and now it works. All tests pass except 4: 2 because of the header, which is not read decently right now, and 2 because of the ordering which is explicit in the assert statement of the test. So it's a matter of changing these assertions and they will work. It would also look less like Java code ;) > > I like this plan - but initially define and document the new properties, > and deprecate the old get/set properties. Without that you'll break > almost every PDB using script out there. > How do I deprecate the old ones? Is there a DeprecationWarning or so? Just a reminder, if you want to test/check the code, it's on my github . Cheers, Jo?o From w.arindrarto at gmail.com Thu Sep 6 01:57:04 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 6 Sep 2012 07:57:04 +0200 Subject: [Biopython-dev] SeqRecord locations; was: Beta code in the official releases? In-Reply-To: References: Message-ID: Hi guys, To add my two cents, I am in favor of creating a dynamic SeqRecord coordinate system using SeqFeature. However, I think it would also be good if we set some limitations as there are so many ways that slicing and addition could be used to create new SeqRecords, and anticipating all these scenarios may create an over-engineered (and probably slower) SeqRecord. Some scenarios that I can think now: 1. Slicing SeqRecord objects using step values > 1 (e.g. new_seq = seq[1:120:3]) 2. Adding two or more SeqRecord objects with noncontiguous coordinate (i.e. end coordinate of the first sequence is not directly followed by the second sequence's start coordinate), and then slice the resulting object So maybe some limitations that we could set are: 1. Only update the coordinates if slicing step is 1 (or -1), otherwise discard it. 2. Only update the coordinates if addition is between contiguous coordinates, otherwise discard it. Personally, I think this would cover most use cases for slicing while allowing us to keep it simple. As for the name, 'region' sounds better than 'location'. Maybe 'coverage'? I don't have any strong preference between these, but 'subregion' doesn't feel that nice. Finally, for the coordinate system, I imagine it will use Python's coordinate system, too? (zero-based, half-open, and the parsers / writers should do the conversion). Should we also reverse the coordinates if the objects are sliced in reverse (e.g. seqrecord[::-1]) or simply inverse the strand value but keep the coordinates unchanged? regards, Bow On Thu, Sep 6, 2012 at 3:34 AM, Peter Cock wrote: > On Thu, Sep 6, 2012 at 1:10 AM, Peter Cock wrote: >> >> In my mind, the main technical issue regarding MAF and AlignIO >> and the common alignment object is the lack of a common way >> of handling the idea of start/end (and sometimes strand) for >> each sequence (in a consistent co-ordinate system using Python >> counting). Evidently I haven't manage to adequately convey my >> interpretation/concern. >> >> Some file formats like EMBOSS' have these number explicitly >> but we're not parsing them: >> http://lists.open-bio.org/pipermail/biopython/2012-September/008142.html >> >> In the case of "fasta-m10" the numbers are stored in private >> properties as a 'short term' hack: >> http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009744.html >> >> Others like Stockholm have identifier/start-end as a combined >> names (but this is not mandatory). Here the start and end are >> being stored in the annotations dictionary (as unparsed strings, >> still using 1-based co-ordinates). >> >> In MAF the start/end are explicit and much more important. >> It would be near pointless to parse the the file ignoring these. >> Maybe your approach is good enough for MAF, and we >> should have adopted it as is, and delayed better integration >> with the other AlignIO formats? >> >> i.e. This is a general limitation in AlignIO and the object >> model, somewhat annoying in the formats already supported, >> but information critical to the MAF format. >> >> I was expecting a convention for this to fall out of Bow's GSoC >> work for 'pairwise alignments' in SearchIO - but the object >> model he came up with was not SeqRecord based (many >> of the file formats he was using didn't include sequences). >> >> Right now my inclination is still to add a location property to >> the SeqRecord, usually a FeatureLocation, but it could also >> be the proposed CompoundLocation for more complex cases. >> The question then is if/when this would be propagated, e.g. >> SeqRecord slicing/addition. >> http://lists.open-bio.org/pipermail/biopython-dev/2012-May/009646.html >> http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html >> >> So the wheels are turning, but slowly. I have not had as >> much time to dedicate to this as I would like - but other >> smaller or less inter-connected things are much easer to >> review and merge. > > To expand on the SeqRecord.location property idea, I am > thinking about (in the typical use cases) using a normal > FeatureLocation object (from Bio.SeqFeature) where the > start, end or strand are in the same co-ordinate system > as the sequence of the SeqRecord. > > i.e. For a protein fragment, they would be in amino acids. > For a nucleotide fragment, they would be in base pairs. > > Note that you might want to describe the CDS region > for a protein sequence (which would be possible even > for a join using the proposed CompoundLocation), so > maybe 'location' is the wrong name here, perhaps > 'fragment' or 'subregion', or something is clearer? > > When I talked about adding SeqRecords, and what would > the combined SeqRecord's location be, we could use > FeatureLocation addition (as defined on the branch for > CompoundLocation objects). > > For slicing a SeqRecord, provided len(record.location) > == len(record), this is well defined. However, I expect > that quite often if used for alignments, what we will have > instead is len(record.location) = len(record.seq.ungapped()) > so we might be able to update the sub-record's location > if we count the gap characters and factor them in. This > equality could be verified in the SeqRecord __init__ > (which would require the gap character, but the AlignIO > parsers should all set that). > > I would like slicing to update the start/end because > slicing alignment objects seems to be a quite common > operation - so if you started from an alignment file > using start/end (like Stockholm or MAF) it would be > good to update these fields for the sub-alignment. > > This feels like it would work, but would it be useful or > just over engineering? Would a simple static location > property which is not automatically propagated in > SeqRecord manipulations be enough (at least initially)? > > If so, is Brad's suggestion to just use special values in > the annotations dictionary a simpler way forward (where > we already have policies in place for handling generic > annotation during SeqRecord annotation - in general > dropping it)? > > If so, would this be keys 'start', 'end', 'strand' for > integer start and end using Python counting, and > a strand value of +1 or -1 for forward and reverse? > [We could use strand None for unavailable as in > the SeqFeature location object, but I think no entry > in the dictionary is nicer here]. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From mjldehoon at yahoo.com Thu Sep 6 02:31:57 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 5 Sep 2012 23:31:57 -0700 (PDT) Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: Message-ID: <1346913117.35905.YahooMailClassic@web164006.mail.gq1.yahoo.com> [Brad] > Hack together something that will remove old 'Bio' modules > on install of the new version. We could check in setup.py if we can import Bio, and ask the user to remove the old Biopython installation before proceeding. Since we can tell the user exactly which directory to remove, this would be straightforward. I would prefer this to removing the directory automatically. [Peter] > I really don't like using "bio" since (due to Python's use > of folders for package names) you couldn't in general also > have the old code available under "Bio". i.e. This forces > a hard switch on our users which is a very bad idea I think. I don't see why a user would like to have both an old Biopython under Bio and a new Biopython under bio. Unless he wants to run some scripts with the old Biopython and other scripts with the new Biopython, but I don't see the point of that. [Peter] > Thus my suggestion of something else like "biopy" [...] > , or if not already taken "bp". [Brad] > However, my vote would be to stick with everything as is. If the choice is between "bp", "biopy", or "Bio", then I agree with Brad; I prefer keeping a nice but PEP8-noncompliant module name "Bio" rather than switching to a PEP8-compliant but less attractive name like "biopy" or "bp". Best, -Michiel. From p.j.a.cock at googlemail.com Thu Sep 6 03:06:07 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 08:06:07 +0100 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: <1346913117.35905.YahooMailClassic@web164006.mail.gq1.yahoo.com> References: <1346913117.35905.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: On Thu, Sep 6, 2012 at 7:31 AM, Michiel de Hoon wrote: > [Brad] >> Hack together something that will remove old 'Bio' modules >> on install of the new version. > > We could check in setup.py if we can import Bio, and ask > the user to remove the old Biopython installation before > proceeding. Since we can tell the user exactly which directory > to remove, this would be straightforward. I would prefer this > to removing the directory automatically. I agree automatically removing the old install is risky. For single user machines, where the single user has only a small collection of scripts this isn't such an issue. For any shared server, or user with lots of Biopython scripts (some of which may have been written by different people), you would be forced into a mass change at one go. You would also have considerable hassle later on with any attempt to re-run old scripts. > [Peter] >> I really don't like using "bio" since (due to Python's use >> of folders for package names) you couldn't in general also >> have the old code available under "Bio". i.e. This forces >> a hard switch on our users which is a very bad idea I think. > > I don't see why a user would like to have both an old > Biopython under Bio and a new Biopython under bio. > Unless he wants to run some scripts with the old Biopython > and other scripts with the new Biopython, but I don't see > the point of that. Really? That is exactly what I am concerned about (both for single user machines like my desktop, and shared machines like our servers). How about the common situation of wanting to re-run old scripts from old projects on new data? If we were just changing the case, this might not be too complex (it would still be a frustrating transition period), but if we're also moving things around at the same time it is too much I feel. > [Peter] >> Thus my suggestion of something else like "biopy" [...] >> , or if not already taken "bp". > > [Brad] >> However, my vote would be to stick with everything as is. > > If the choice is between "bp", "biopy", or "Bio", then > I agree with Brad; I prefer keeping a nice but > PEP8-noncompliant module name "Bio" rather than > switching to a PEP8-compliant but less attractive > name like "biopy" or "bp". There is 'biopython' but it is rather long? No other ideas from anyone else? How about over the next year we gradually consolidate modules under the existing mixed case names? e.g. move Bio.AlignIO functionality and Bio.Align, and Bio.Seq* under Bio.Seq (leaving backwards compatible imports supported but deprecated). Here's a further (and slightly more radical) idea: We stick with using 'Bio' and the current mixed case names on Python 2, but adopt 'bio' and other PEP8 compatible names for Python 3 (as a uniform strict automatic rule: mixed case -> lower case)? i.e. Do this as part of our 2to3 process. Some nasty downside might occur to me later but right now it seems like a neat idea... other that not being quite in line with the expectation that Python 3 should not be used as an excuse to make API changes. Too radical? Regards, Peter From p.j.a.cock at googlemail.com Thu Sep 6 03:16:41 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 08:16:41 +0100 Subject: [Biopython-dev] SeqRecord locations; was: Beta code in the official releases? In-Reply-To: References: Message-ID: On Thu, Sep 6, 2012 at 6:57 AM, Wibowo Arindrarto wrote: > Hi guys, > > To add my two cents, I am in favor of creating a dynamic SeqRecord > coordinate system using SeqFeature. However, I think it would also be > good if we set some limitations as there are so many ways that slicing > and addition could be used to create new SeqRecords, and anticipating > all these scenarios may create an over-engineered (and probably > slower) SeqRecord. > > Some scenarios that I can think now: > > 1. Slicing SeqRecord objects using step values > 1 > (e.g. new_seq = seq[1:120:3]) Absolutely - here I would expect to lose the location information. We already have similar restrictions in the SeqRecord slicing for how SeqFeatures are handled. > 2. Adding two or more SeqRecord objects with noncontiguous coordinate > (i.e. end coordinate of the first sequence is not directly followed by > the second sequence's start coordinate), and then slice the resulting > object Adding *could* be done via the CompoundLocation, although that in itself might want to consider if nicely-abutting locations should be merged, e.g. in GenBank notation 100..201 and 202..300 could be 100.300 rather than join(100..201,202..300) which is what my CompoundLocation code currently does. > So maybe some limitations that we could set are: > > 1. Only update the coordinates if slicing step is 1 (or -1), otherwise > discard it. Yep. > 2. Only update the coordinates if addition is between contiguous > coordinates, otherwise discard it. That does seem simple - especially as the primary driver for this is multiple sequence alignments and those only support simple continuous locations with a start and end. > Personally, I think this would cover most use cases for slicing while > allowing us to keep it simple. That is perhaps a good balance (and as a bonus means we don't have to link this to the CompoundLocation unless we want to). > As for the name, 'region' sounds better than 'location'. Maybe > 'coverage'? I don't have any strong preference between these, but > 'subregion' doesn't feel that nice. Region seems fine. > Finally, for the coordinate system, I imagine it will use Python's > coordinate system, too? (zero-based, half-open, and the parsers / > writers should do the conversion). Yes. I'm suggesting using the FeatureLocation object (from Bio.SeqFeatures), which does this. > Should we also reverse the > coordinates if the objects are sliced in reverse (e.g. > seqrecord[::-1]) or simply inverse the strand value but keep the > coordinates unchanged? The strand changes, and the start/end must also be recalculated from the length of the parent sequence. The FeatureLocation has a (private) _flip method to do this. In some cases we won't have the parent sequence length, so would have to drop the location. I'll have a go at implementing this on a branch in the next few hours (unless something more pressing comes up at the BioHackathon). As it happens this overlaps nicely with some of the group discussion about how to represent feature locations in RDF. Regards, Peter From p.j.a.cock at googlemail.com Thu Sep 6 03:21:16 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 08:21:16 +0100 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Thu, Sep 6, 2012 at 6:52 AM, Jo?o Rodrigues wrote: > >> It would also look less like Java code ;) >> >> I like this plan - but initially define and document the new properties, >> and deprecate the old get/set properties. Without that you'll break >> almost every PDB using script out there. > > How do I deprecate the old ones? Is there a DeprecationWarning or so? > Yes, we use Bio.BiopythonDeprecationWarning rather than the default DeprecationWarning because the later is now silent by default. Grep the code for example usage, see also: http://biopython.org/wiki/Deprecation_policy Regards, Peter From p.j.a.cock at googlemail.com Thu Sep 6 05:36:41 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 10:36:41 +0100 Subject: [Biopython-dev] SeqRecord locations; was: Beta code in the official releases? In-Reply-To: References: Message-ID: On Thu, Sep 6, 2012 at 8:16 AM, Peter Cock wrote: > > I'll have a go at implementing this on a branch in the next > few hours (unless something more pressing comes up at > the BioHackathon). As it happens this overlaps nicely with > some of the group discussion about how to represent feature > locations in RDF. > I've made a start, will do more later: https://github.com/peterjc/biopython/tree/sr_loc Peter From mjldehoon at yahoo.com Thu Sep 6 06:13:38 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 6 Sep 2012 03:13:38 -0700 (PDT) Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: Message-ID: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com> --- On Thu, 9/6/12, Peter Cock wrote: > For any shared server, [...] you > would be forced into a mass change at one go. OK, for multiple users on a shared server I see your point. > Here's a further (and slightly more radical) idea: We > stick with using 'Bio' and the current mixed case > names on Python 2, but adopt 'bio' and other PEP8 > compatible names for Python 3 (as a uniform > strict automatic rule: mixed case -> lower case)? > i.e. Do this as part of our 2to3 process. The Python developers argue against combining a switch to Python 3 with other major changes, since then if bugs arise it is unclear if it is due to the switch to Python 3 or due to the other changes. But perhaps it's OK if we have one Bio.* version for Python 2 and one bio.* version for Python 3 that are otherwise completely identical to each other. > How about over the next year we gradually consolidate > modules under the existing mixed case names? e.g. > move Bio.AlignIO functionality and Bio.Align, I guess you meant "merge Bio.AlignIO functionality into Bio.Align". > and Bio.Seq* under Bio.Seq (leaving backwards compatible > imports supported but deprecated). Sounds good to me. AFAIAC, we don't need to do this gradually over the next year. May as well do it for the next release. -Michiel. From anaryin at gmail.com Thu Sep 6 09:48:51 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 6 Sep 2012 16:48:51 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: Ok, thanks. The modules are littered with set/get methods and adding DeprecationWarning to all of them might be a bit too much.. Instead, should we add one single warning at the top of the PDBParser, since this is the only obligatory module for Bio.PDB so that everyone gets the warning message once and once only? Otherwise I can imagine several warnings popping up everywhere.. Cheers, Jo?o From eric.talevich at gmail.com Thu Sep 6 10:17:03 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 6 Sep 2012 10:17:03 -0400 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Thu, Sep 6, 2012 at 1:52 AM, Jo?o Rodrigues wrote: > > What Bio.PDB does right now is rely on the list to iterate over things. > Thus, you get the order in which you read the PDB file. However, if you > sort it using the several Objects sort method you will get the following > rules: > > Atom.py - N CA C O first, then alphabetically > Residue.py - First aminoacids and nucleic acids, then heteroatoms. > Chain.py - Empty chains last. > > These are already in place somewhere in the code. I just used them to > overload the __cmp__ method, with a couple of additions because I > personally disagree with the following: > > Atom.py - Inorganic atoms should come out last. For simplicity. > Residue.py - If the PDB order is 151 MSE, 152 VAL, 153 CYS, you should get > in return when you iterate: 151, 152, 153. Right now you get 152, 153, 151. > PDB files already have weird large numbers for water and ions for example, > so these come out last anyway. Pushing all HETATMs to the end will > sometimes disrupt the "natural" order of things, for instance modified > residues. Magic perhaps :) > > Here's another edge case to think about: 3BEG. The enzyme is chain A, starting from residue number 69; the substrate peptide is chain B; and then after listing the atoms for chain B they jump back to chain A and add the three ligands as individual residues, with residue numbers 1, 2 and 3, on HETATM lines. The current PDBParser complains about this structure but parses it so that the extra HETATM residues are at the end of chain A's child_list. If I were to try to generate a polypeptide sequence from each of the chains in this structure, I think I'd want to just ignore the three extra residues, rather than list them as the first three residues of the peptide as "SAX". How do you think this should be handled? Maybe treat in-sequence modified residues differently from out-of-sequence HETATMs? -E From eric.talevich at gmail.com Thu Sep 6 10:40:13 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 6 Sep 2012 10:40:13 -0400 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com> References: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: On Thu, Sep 6, 2012 at 6:13 AM, Michiel de Hoon wrote: > --- On Thu, 9/6/12, Peter Cock wrote: > > For any shared server, [...] you > > would be forced into a mass change at one go. > > OK, for multiple users on a shared server I see your point. True, and old scripts/pipelines have a way of sticking around, especially once they've been shared with others in the lab. > Here's a further (and slightly more radical) idea: We > > stick with using 'Bio' and the current mixed case > > names on Python 2, but adopt 'bio' and other PEP8 > > compatible names for Python 3 (as a uniform > > strict automatic rule: mixed case -> lower case)? > > i.e. Do this as part of our 2to3 process. > > The Python developers argue against combining a switch to Python 3 with > other major changes, since then if bugs arise it is unclear if it is due to > the switch to Python 3 or due to the other changes. But perhaps it's OK if > we have one Bio.* version for Python 2 and one bio.* version for Python 3 > that are otherwise completely identical to each other. > Agreed, since the bio.* version is generated by the 2to3 script it should still be easy enough to distinguish "this is a bug in the library" from "this is a problem with Py3, 2to3 or your environment". The extra separation on the filesystem provided by Py2/Py3 should also prevent some problems with case-insensitivity and the environment. > > How about over the next year we gradually consolidate > > modules under the existing mixed case names? e.g. > > move Bio.AlignIO functionality and Bio.Align, > > I guess you meant "merge Bio.AlignIO functionality into Bio.Align". > > > and Bio.Seq* under Bio.Seq (leaving backwards compatible > > imports supported but deprecated). > > Sounds good to me. AFAIAC, we don't need to do this gradually over the > next year. May as well do it for the next release. > > Doing this in a single release might be better, so we can document/remember the release number when the Grand Reshuffling took place and troubleshoot users' resulting problems more easily. Should we call that Biopython 2.0.0 and switch to semantic version numbers? From anaryin at gmail.com Thu Sep 6 10:51:11 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 6 Sep 2012 17:51:11 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: Well... :) If this is what the authors put in.. well, that's just it. The parser should not be an interpreter. However, when building peptides, you should get two peptides: the ALA-SEP, and the protein chain A. And I think this is what you will get. Also, the fact that they are heteroatoms is already a good filter if you want them out of the equation. From p.j.a.cock at googlemail.com Thu Sep 6 21:01:04 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Sep 2012 02:01:04 +0100 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: References: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: On Thu, Sep 6, 2012 at 3:40 PM, Eric Talevich wrote: > On Thu, Sep 6, 2012 at 6:13 AM, Michiel de Hoon wrote: >> --- On Thu, 9/6/12, Peter Cock wrote: >> > Here's a further (and slightly more radical) idea: We >> > stick with using 'Bio' and the current mixed case >> > names on Python 2, but adopt 'bio' and other PEP8 >> > compatible names for Python 3 (as a uniform >> > strict automatic rule: mixed case -> lower case)? >> > i.e. Do this as part of our 2to3 process. >> >> The Python developers argue against combining a switch to Python 3 with >> other major changes, since then if bugs arise it is unclear if it is due to >> the switch to Python 3 or due to the other changes. But perhaps it's OK if >> we have one Bio.* version for Python 2 and one bio.* version for Python 3 >> that are otherwise completely identical to each other. > > > Agreed, since the bio.* version is generated by the 2to3 script it should > still be easy enough to distinguish "this is a bug in the library" from > "this is a problem with Py3, 2to3 or your environment". The extra separation > on the filesystem provided by Py2/Py3 should also prevent some problems with > case-insensitivity and the environment. Yes - they would be in different site-packages folders, and since we have a tiny Python 3 install base, moving them from Bio to bio seems low impact. I guess we need to have a little hack with the 2to3 library and try defining our own custom fixer for the imports... Note this case difference will slightly complicate our documentation - but that is always going to be an issue for the Python 2 to 3 move. >> >> > How about over the next year we gradually consolidate >> > modules under the existing mixed case names? e.g. >> > move Bio.AlignIO functionality and Bio.Align, >> >> I guess you meant "merge Bio.AlignIO functionality into Bio.Align". Yes, sorry. >> > and Bio.Seq* under Bio.Seq (leaving backwards compatible >> > imports supported but deprecated). >> >> Sounds good to me. AFAIAC, we don't need to do this gradually >> over the next year. May as well do it for the next release. > > Doing this in a single release might be better, so we can document/remember > the release number when the Grand Reshuffling took place and troubleshoot > users' resulting problems more easily. Doing it one release makes sense - but we can do it gradually in a series of self contained commits - and feel our way. Michiel - do you want to start with the Bio/Seq.py to Bio/Seq/__init__.py change? We'll need to do that before any consolidation steps. > Should we call that Biopython 2.0.0 and switch to semantic version numbers? > Maybe... at some point a Biopython 2 would be a good excuse for some publicity and another application note. The eventual move from developing under Python 2 (and using 2to3 for Python 3) to natively developing under Python 3 would be an excuse for a major version bump. Peter From p.j.a.cock at googlemail.com Thu Sep 6 21:03:22 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Sep 2012 02:03:22 +0100 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Thu, Sep 6, 2012 at 2:48 PM, Jo?o Rodrigues wrote: > Ok, thanks. > > The modules are littered with set/get methods and adding DeprecationWarning > to all of them might be a bit too much.. Instead, should we add one single > warning at the top of the PDBParser, since this is the only obligatory > module for Bio.PDB so that everyone gets the warning message once and once > only? Otherwise I can imagine several warnings popping up everywhere.. If you use the exact same message, then I think you'll only see the warning once. Try it with a couple of the get/set methods to confirm. Having the warning happen even if you don't use the get/set seems wrong. Peter From anaryin at gmail.com Fri Sep 7 03:21:56 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 7 Sep 2012 10:21:56 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: Likely true. I'm writing a txt file with the changes. I don't think they can be merged easily without breaking a lot of stuff, in particular the removal of child_list. Therefore, I suggest we write a few deprecation warnings here and there where affected by the consensual changes we agree on and give a few releases before we actually merge them. Also, once I'm happy with the changes, I'll make a new branch to allow 'beta testing' by anyone who wants and write a wiki page on it. Cheers, Jo?o No dia 7 de Set de 2012 04:03, "Peter Cock" escreveu: > On Thu, Sep 6, 2012 at 2:48 PM, Jo?o Rodrigues wrote: > > Ok, thanks. > > > > The modules are littered with set/get methods and adding > DeprecationWarning > > to all of them might be a bit too much.. Instead, should we add one > single > > warning at the top of the PDBParser, since this is the only obligatory > > module for Bio.PDB so that everyone gets the warning message once and > once > > only? Otherwise I can imagine several warnings popping up everywhere.. > > If you use the exact same message, then I think you'll only see the > warning once. Try it with a couple of the get/set methods to confirm. > > Having the warning happen even if you don't use the get/set seems > wrong. > > Peter > From mjldehoon at yahoo.com Sun Sep 9 03:31:05 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 9 Sep 2012 00:31:05 -0700 (PDT) Subject: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif Message-ID: <1347175865.35152.YahooMailClassic@web164003.mail.gq1.yahoo.com> Returning to a previous discussion... [Michiel:] > ..., currently Bio.Motif._Motif.Motif objects also perform > functions that are more appropriate for a separate PWM > (position-weight matrix) class within Bio.Motif. It may be > a good idea to have a separate PWM class for this functionality. [Bartek:] > I'm not sure. I think it is valuable to be able to load > instances from a file and then convert them to a PWM. > It could be done with separate classes, > but I'm not sure it would be easier then... I think there is one confusing issue here. The current .pwm() method of a Motif object doesn't calculate a position-weight matrix but only normalizes the counts matrix to create a probability matrix. To calculate a PWM, we would have to calculate the logarithm of these probabilities divided by the corresponding background probabilities (for which in Bio.Motif we are currently using the log_odds method). So I was mainly thinking of a PWM class to represent what is currently being returned by the log_odds method. This allows users to create a PWM from the log-odds scores directly instead of from an alignment (for example, if the PWM is available from some publication but not the actual alignments). Also this avoids some confusion with regard to which methods operate on which object. For example, currently we have motif.scanPWM and motif.score_hit that actually operate on the log-odds matrix, motif.anticonsensus, motif.consensus, motif[:] uses the probability matrix, and motif.max_score and motif.min_score use the log-odds matrix to evaluate the score of motif.consensus, motif.anticonsensus which were calculated using the probablity matrix (and therefore don't necessarily return the maximum and minimum score). So I would suggest to keep the various types of matrices explicit; something along these lines: >>> motif = Motif.read(...) >>> counts = motif.counts # .counts is a property of motif # counts is an instance of the Motif.FrequencyMatrix class # you can also make a FrequencyMatrix object directly from # the frequencies, as in >>> counts = Motif.FrequencyMatrix(my_frequency_matrix) >>> counts[2,:] array([1.0, 4.0, 3.0, 2.0]) # indices refer explicitly to the counts matrix >>> counts[2,'G'] 3.0 >>> my_consensus_sequence = counts.consensus # .consensus is a property of counts >>> my_anticonsensus_sequence = counts.anticonsensus # .anticonsensus is a property of counts >>> my_probability_matrix = counts.normalize() # this can be a numpy array, or a Motif.ProbabilityMatrix # class that inherits from a numpy array >>> my_probability_matrix[2,:] array([0.1, 0.4, 0.3, 0.2]) # indices refer explicitly to the probability matrix >>> pwm = counts.make_pwm(...) # or pwm = motif.PositionWeightMatrix(my_matrix) >>> pwm[0,:] array([ -2.3, 0.1, 1.2, 1.8]) >>> pwm[0,2] 1.2 >>> pwm[0,'C'] 0.1 # indices explicitly refer to the pwm >>> scores = pwm.scan(sequence) >>> score = pwm.score(sequence) Does that sound reasonable? Any comments, suggestions? Best, -Michiel. From bartek at rezolwenta.eu.org Mon Sep 10 03:12:59 2012 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 10 Sep 2012 09:12:59 +0200 Subject: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif In-Reply-To: <1347175865.35152.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1347175865.35152.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: Hi, I think it is an idea worth discussing a little bit more. Thanks for bringing it up Michiel. It captures at least some of the issues caused by the fact that different motifs might be internally represented differently. I'm not sure I'm all excited about having to deal with explicit extra classes for PWMs and aligned instances, but maybe this is the price for having a clear separation of where certain things are calculated. The issue I think still needs discussion is where is the searching done? If I want to search for instances, do I do it from the PWM object?, This seems to be the natural idea, but then can we find a nice interface for people who don't want to be bothered with too complicated interfaces? I'll try to come up with a more thought through and longer response later in the week... best Bartek On Sun, Sep 9, 2012 at 9:31 AM, Michiel de Hoon wrote: > Returning to a previous discussion... > > [Michiel:] >> ..., currently Bio.Motif._Motif.Motif objects also perform >> functions that are more appropriate for a separate PWM >> (position-weight matrix) class within Bio.Motif. It may be >> a good idea to have a separate PWM class for this functionality. > > [Bartek:] >> I'm not sure. I think it is valuable to be able to load >> instances from a file and then convert them to a PWM. >> It could be done with separate classes, >> but I'm not sure it would be easier then... > > I think there is one confusing issue here. > The current .pwm() method of a Motif object doesn't calculate a position-weight matrix but only normalizes the counts matrix to create a probability matrix. To calculate a PWM, we would have to calculate the logarithm of these probabilities divided by the corresponding background probabilities (for which in Bio.Motif we are currently using the log_odds method). > > So I was mainly thinking of a PWM class to represent what is currently being returned by the log_odds method. This allows users to create a PWM from the log-odds scores directly instead of from an alignment (for example, if the PWM is available from some publication but not the actual alignments). > Also this avoids some confusion with regard to which methods operate on which object. For example, currently we have motif.scanPWM and motif.score_hit that actually operate on the log-odds matrix, > motif.anticonsensus, motif.consensus, motif[:] uses the probability matrix, and motif.max_score and motif.min_score use the log-odds matrix to evaluate the score of motif.consensus, motif.anticonsensus which were calculated using the probablity matrix (and therefore don't necessarily return the maximum and minimum score). > > So I would suggest to keep the various types of matrices explicit; something along these lines: > >>>> motif = Motif.read(...) >>>> counts = motif.counts > # .counts is a property of motif > # counts is an instance of the Motif.FrequencyMatrix class > # you can also make a FrequencyMatrix object directly from > # the frequencies, as in >>>> counts = Motif.FrequencyMatrix(my_frequency_matrix) >>>> counts[2,:] > array([1.0, 4.0, 3.0, 2.0]) > # indices refer explicitly to the counts matrix >>>> counts[2,'G'] > 3.0 > >>>> my_consensus_sequence = counts.consensus > # .consensus is a property of counts >>>> my_anticonsensus_sequence = counts.anticonsensus > # .anticonsensus is a property of counts > >>>> my_probability_matrix = counts.normalize() > # this can be a numpy array, or a Motif.ProbabilityMatrix > # class that inherits from a numpy array >>>> my_probability_matrix[2,:] > array([0.1, 0.4, 0.3, 0.2]) > # indices refer explicitly to the probability matrix > >>>> pwm = counts.make_pwm(...) > # or pwm = motif.PositionWeightMatrix(my_matrix) >>>> pwm[0,:] > array([ -2.3, 0.1, 1.2, 1.8]) >>>> pwm[0,2] > 1.2 >>>> pwm[0,'C'] > 0.1 > # indices explicitly refer to the pwm > >>>> scores = pwm.scan(sequence) >>>> score = pwm.score(sequence) > > > Does that sound reasonable? Any comments, suggestions? > > Best, > -Michiel. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Bartek Wilczynski From p.j.a.cock at googlemail.com Mon Sep 10 04:39:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 10 Sep 2012 09:39:30 +0100 Subject: [Biopython-dev] Most buildbot slaves down Message-ID: Hi all, For those of you actively monitoring the nightly BuildBot for Biopython and/or BioRuby, all the buildslaves at my institute are currently effectively offline. A new stricter firewall policy was introduced last week while I was away. I hope we'll have the necessary outgoing ports opened again soon. In the meantime, additional buildslaves hosted elsewhere would be very useful. The machines need to be online and are typically only used once every 24 hours for the scheduled builds. Non-Linux machines are particularly important for cross-platform testing (while for Linux the TravisCI testing seems to be working nicely overall). Any volunteers? Thanks, Peter From tiagoantao at gmail.com Mon Sep 10 04:50:41 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 10 Sep 2012 09:50:41 +0100 Subject: [Biopython-dev] [BioRuby] Most buildbot slaves down In-Reply-To: References: Message-ID: Hi, Not much helpful in the non-linux front, but I noticed that my machine was down for some reason, restarted it and it is doing at least a few of the builds. Tiago On Mon, Sep 10, 2012 at 9:39 AM, Peter Cock wrote: > Hi all, > > For those of you actively monitoring the nightly BuildBot > for Biopython and/or BioRuby, all the buildslaves at my > institute are currently effectively offline. A new stricter > firewall policy was introduced last week while I was away. > I hope we'll have the necessary outgoing ports opened > again soon. > > In the meantime, additional buildslaves hosted elsewhere > would be very useful. The machines need to be online > and are typically only used once every 24 hours for the > scheduled builds. Non-Linux machines are particularly > important for cross-platform testing (while for Linux > the TravisCI testing seems to be working nicely overall). > > Any volunteers? > > Thanks, > > Peter > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From redmine at redmine.open-bio.org Thu Sep 13 22:23:53 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 14 Sep 2012 02:23:53 +0000 Subject: [Biopython-dev] [Biopython - Bug #3384] (New) Installation fails with pip-3.2 Message-ID: Issue #3384 has been reported by Roy Crihfield. ---------------------------------------- Bug #3384: Installation fails with pip-3.2 https://redmine.open-bio.org/issues/3384 Author: Roy Crihfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Linux 3.5.3-1-ARCH x86_64 GNU/Linux Python 3.2.3 Bio.__version__ == '1.60' Installation fails with with pip 1.2: $ sudo pip-3.2 install biopython : : Converting build/py3.2/Doc/examples/fasta_dictionary.py Converting build/py3.2/Doc/examples/nmr/simplepredict.py Python 2to3 processing done. running egg_info error: error in 'egg_base' option: 'pip-egg-info' does not exist or is not a directory ---------------------------------------- Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython Exception information: Traceback (most recent call last): File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/basecommand.py", line 106, in main status = self.run(options, args) File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/commands/install.py", line 256, in run requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle) File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 1042, in prepare_files req_to_install.run_egg_info() File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 236, in run_egg_info command_desc='python setup.py egg_info') File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/util.py", line 612, in call_subprocess % (command_desc, proc.returncode, cwd)) pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Thu Sep 13 22:23:54 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 14 Sep 2012 02:23:54 +0000 Subject: [Biopython-dev] [Biopython - Bug #3384] (New) Installation fails with pip-3.2 Message-ID: Issue #3384 has been reported by Roy Crihfield. ---------------------------------------- Bug #3384: Installation fails with pip-3.2 https://redmine.open-bio.org/issues/3384 Author: Roy Crihfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Linux 3.5.3-1-ARCH x86_64 GNU/Linux Python 3.2.3 Bio.__version__ == '1.60' Installation fails with with pip 1.2: $ sudo pip-3.2 install biopython : : Converting build/py3.2/Doc/examples/fasta_dictionary.py Converting build/py3.2/Doc/examples/nmr/simplepredict.py Python 2to3 processing done. running egg_info error: error in 'egg_base' option: 'pip-egg-info' does not exist or is not a directory ---------------------------------------- Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython Exception information: Traceback (most recent call last): File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/basecommand.py", line 106, in main status = self.run(options, args) File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/commands/install.py", line 256, in run requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle) File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 1042, in prepare_files req_to_install.run_egg_info() File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 236, in run_egg_info command_desc='python setup.py egg_info') File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/util.py", line 612, in call_subprocess % (command_desc, proc.returncode, cwd)) pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Fri Sep 14 04:46:08 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 14 Sep 2012 08:46:08 +0000 Subject: [Biopython-dev] [Biopython - Bug #3384] Installation fails with pip-3.2 References: Message-ID: Issue #3384 has been updated by Peter Cock. Does the standard install mechanism work on your machine? i.e. python3.2 setup.py build python3.2 setup.py test sudo python3.2 setup.py install If you want to investigate the pip error, there is a possible workaround developed by NumPy (who also use 2to3 in a similar way to us), see http://projects.scipy.org/numpy/ticket/1857 Thanks ---------------------------------------- Bug #3384: Installation fails with pip-3.2 https://redmine.open-bio.org/issues/3384 Author: Roy Crihfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Linux 3.5.3-1-ARCH x86_64 GNU/Linux Python 3.2.3 Bio.__version__ == '1.60' Installation fails with with pip 1.2: $ sudo pip-3.2 install biopython : : Converting build/py3.2/Doc/examples/fasta_dictionary.py Converting build/py3.2/Doc/examples/nmr/simplepredict.py Python 2to3 processing done. running egg_info error: error in 'egg_base' option: 'pip-egg-info' does not exist or is not a directory ---------------------------------------- Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython Exception information: Traceback (most recent call last): File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/basecommand.py", line 106, in main status = self.run(options, args) File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/commands/install.py", line 256, in run requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle) File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 1042, in prepare_files req_to_install.run_egg_info() File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 236, in run_egg_info command_desc='python setup.py egg_info') File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/util.py", line 612, in call_subprocess % (command_desc, proc.returncode, cwd)) pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Fri Sep 14 21:57:53 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 15 Sep 2012 01:57:53 +0000 Subject: [Biopython-dev] [Biopython - Bug #3384] Installation fails with pip-3.2 References: Message-ID: Issue #3384 has been updated by Roy Crihfield. Yes, installing manually works. I found that hack but was hoping there would be a better solution, or support for pip planned for the future. ---------------------------------------- Bug #3384: Installation fails with pip-3.2 https://redmine.open-bio.org/issues/3384 Author: Roy Crihfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Linux 3.5.3-1-ARCH x86_64 GNU/Linux Python 3.2.3 Bio.__version__ == '1.60' Installation fails with with pip 1.2: $ sudo pip-3.2 install biopython : : Converting build/py3.2/Doc/examples/fasta_dictionary.py Converting build/py3.2/Doc/examples/nmr/simplepredict.py Python 2to3 processing done. running egg_info error: error in 'egg_base' option: 'pip-egg-info' does not exist or is not a directory ---------------------------------------- Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython Exception information: Traceback (most recent call last): File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/basecommand.py", line 106, in main status = self.run(options, args) File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/commands/install.py", line 256, in run requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle) File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 1042, in prepare_files req_to_install.run_egg_info() File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 236, in run_egg_info command_desc='python setup.py egg_info') File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/util.py", line 612, in call_subprocess % (command_desc, proc.returncode, cwd)) pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sat Sep 15 17:29:29 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 15 Sep 2012 21:29:29 +0000 Subject: [Biopython-dev] [Biopython - Bug #3340] Example using Bio.Clustalw in Tutorial References: Message-ID: Issue #3340 has been updated by Grace Yeo. I've submitted a pull request for this here: https://github.com/biopython/biopython/pull/71 ---------------------------------------- Bug #3340: Example using Bio.Clustalw in Tutorial https://redmine.open-bio.org/issues/3340 Author: Peter Cock Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The module Bio.Clustalw was deprecated and removed, yet is still used in the Tutorial's 'Creating your own substitution matrix from an alignment' example. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Sun Sep 16 08:34:31 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 16 Sep 2012 13:34:31 +0100 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: References: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: On Fri, Sep 7, 2012 at 2:01 AM, Peter Cock wrote: > On Thu, Sep 6, 2012 at 3:40 PM, Eric Talevich wrote: >> On Thu, Sep 6, 2012 at 6:13 AM, Michiel de Hoon wrote: >>> --- On Thu, 9/6/12, Peter Cock wrote: >>> > Here's a further (and slightly more radical) idea: We >>> > stick with using 'Bio' and the current mixed case >>> > names on Python 2, but adopt 'bio' and other PEP8 >>> > compatible names for Python 3 (as a uniform >>> > strict automatic rule: mixed case -> lower case)? >>> > i.e. Do this as part of our 2to3 process. >>> >>> The Python developers argue against combining a switch to Python 3 with >>> other major changes, since then if bugs arise it is unclear if it is due to >>> the switch to Python 3 or due to the other changes. But perhaps it's OK if >>> we have one Bio.* version for Python 2 and one bio.* version for Python 3 >>> that are otherwise completely identical to each other. >> >> >> Agreed, since the bio.* version is generated by the 2to3 script it should >> still be easy enough to distinguish "this is a bug in the library" from >> "this is a problem with Py3, 2to3 or your environment". The extra separation >> on the filesystem provided by Py2/Py3 should also prevent some problems with >> case-insensitivity and the environment. > > Yes - they would be in different site-packages folders, and since > we have a tiny Python 3 install base, moving them from Bio to > bio seems low impact. > > I guess we need to have a little hack with the 2to3 library and > try defining our own custom fixer for the imports... > > Note this case difference will slightly complicate our documentation - > but that is always going to be an issue for the Python 2 to 3 move. > I've made a start at this - the easy part seems to work :) https://github.com/peterjc/biopython/commits/py3lower The hard bit will be fixing all the import lines... ;) Peter From k.d.murray.91 at gmail.com Thu Sep 20 00:28:08 2012 From: k.d.murray.91 at gmail.com (Kevin Murray) Date: Thu, 20 Sep 2012 14:28:08 +1000 Subject: [Biopython-dev] TAIR/AGI support In-Reply-To: <87txvcx9ls.fsf@fastmail.fm> References: <87txvcx9ls.fsf@fastmail.fm> Message-ID: Hi Brad, My TAIR/AGI script is on github here: https://github.com/kdmurray91/biopython/blob/master/Bio/TAIR/__init__.py I got it to work directly from TAIR's website, however it has not been rigorously tested. I plan on implementing the process as i described in my previous email, whereby it fetches the Genbank record from TOGOws or via NCBI's Efetch (using biopython's interfaces of course). I will keep you all posted. To the list in general, I'm open to suggestions on what to work on next? Regards Kevin Murray On 6 September 2012 10:45, Brad Chapman wrote: > > Kevin; > Thanks for the e-mail and offers of code. Always happy to have other > folks involved with the project. > > > What's the status of TAIR AGIs in BioPython (I can see no mention of > them, > > or support for them)? I've written a brief module which allows a user to > > query NCBI with a TAIR AGI, returning a Seq object (via Efetch). Is there > > any interest in including such functionality in BioPython? > > Is the code available on GitHub to get a better sense of all the > functionality it supports? Do you have an idea where it would fit best? > As a tair submodule inside of Bio.Entrez, or somewhere else? > > > More generally, are there any particular areas of BioPython development > > which could use an extra pair of hands? > > Following the mailing list for discussions on current projects is the > best way to get a sense of what different folks are working on. The > issue tracker also has open issues and features that could use attention > if anything there strikes your fancy: > > https://redmine.open-bio.org/projects/biopython > > Hope this helps, > Brad > > From p.j.a.cock at googlemail.com Thu Sep 20 05:08:58 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 20 Sep 2012 10:08:58 +0100 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: References: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: On Sun, Sep 16, 2012 at 1:34 PM, Peter Cock wrote: >> >> I guess we need to have a little hack with the 2to3 library and >> try defining our own custom fixer for the imports... >> >> Note this case difference will slightly complicate our documentation - >> but that is always going to be an issue for the Python 2 to 3 move. >> > > I've made a start at this - the easy part seems to work :) > > https://github.com/peterjc/biopython/commits/py3lower > > The hard bit will be fixing all the import lines... ;) > > Peter Progress - but slow. I think this will work with a bit more time spent on it. With hindsight I'd have made more effort to try and reuse lib2to3, but the documentation is sketchy and they do warn it is liable to change between releases. What I've got instead is a pattern matching script which line-by-line spots imports & updates them, and also notes what knock on changes must be made later in the file. It is also aware of and updates doctest examples. e.g. from Bio import SeqIO record = SeqIO.read("my_chr.gbk", "genbank") becomes: from bio import seqIO record = seqIO.read("my_chr.gbk", "genbank") In the process I've spotted some minor style issues and some quote mistakes in the code base which I have fixed on the main branch as well, e.g. https://github.com/biopython/biopython/commit/b396844401da8b5c5ed1f7f13d69622a6ad0c0cd https://github.com/biopython/biopython/commit/165e2b8da445250f070c3860c9082ff6a0c919e0 I also reformatted a few import lines to make processing them easier - and arguably easier to read too: https://github.com/biopython/biopython/commit/f6940e8a4fcf056fa725225ede5e848c5d6f4fd6 One slightly more complicated issue with lower case module names is we get clashes in some code with existing variable or argument names. This seems particularly common with seq, alphabet and motif. Most of these fixes for this are on the experimental branch. In some cases I've opted to change the import, e.g. from Bio import Alphabet to: from Bio import Alphabet as _alphabet This seemed simplest to avoid changing argument names in functions/methods. I'll continue to work on this as time allows - right now the code is due for a refactoring (e.g. avoid code duplication where I handle doctests), and would benefit from some self-tests. But the message remains: This should work :) Peter From yhtgrace at gmail.com Fri Sep 21 12:57:19 2012 From: yhtgrace at gmail.com (Hui Ting Grace Yeo) Date: Fri, 21 Sep 2012 12:57:19 -0400 Subject: [Biopython-dev] Biopython tutorial: Substitution Matrices Message-ID: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com> Hey everyone, I'm working on this bug here https://redmine.open-bio.org/issues/3340 and I've updated the example in the tutorial (on substitution matrices, 17.4.2) using Bio.AlignIO on github here https://github.com/yhtgrace/biopython/tree/clustalw-alignIO-replace. I'm able to reproduce the dictionary replace_info, but when I go on to finish the example, I get the following log odds matrix: D 2 E -1 1 H -5 -4 3 K -10 -5 -4 1 R -4 -8 -4 -2 2 D E H K R which is different from the one given in the tutorial. I'm wondering if I've missed something. Thanks! Grace Yeo From p.j.a.cock at googlemail.com Mon Sep 24 04:53:07 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 24 Sep 2012 09:53:07 +0100 Subject: [Biopython-dev] ColorSpiral for Bio.Graphics Message-ID: Hello all, Last week Leighton was doing some work with Biopython and GenomeDiagram using the cross-links functionality we worked on for Biopython 1.59, which I described here: http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/ As you may have noticed via Twitter or his blog, Leighton has generated an enormous (5m by 1m) PDF poster printout comparing 29 bacterial genomes: http://armchairbiology.blogspot.co.uk/2012/09/the-colours-man-colours.html As he describes on his blog post, this required generating arbitrary color sets, with the option of adding some noise (or jitter as he called it) to make neighbouring colours visually distinct (rather than the more typical requirement of a smooth value to color mapping). His code to do that is now on this branch (with a minor bug fix and a few more docstrings added), ready for possible merging into Biopython: https://github.com/peterjc/biopython/tree/colorspiral Does this seem like a sensible addition to Bio.Graphics? Does anyone have any thoughts on the namespace Bio.Graphics.ColorSpiral given it defines an object ColorSpiral? Might a Bio.Graphics.Colors be useful? (If as discussed on the other thread we move to lower case module names for Python 3, this namespace clash also present in many other Biopython modules goes away): http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009934.html Regards, Peter From p.j.a.cock at googlemail.com Tue Sep 25 12:00:45 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Sep 2012 17:00:45 +0100 Subject: [Biopython-dev] [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <5061C20F.7040209@stats.ox.ac.uk> References: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> <506194F6.9000103@fold.natur.cuni.cz> <5061C20F.7040209@stats.ox.ac.uk> Message-ID: On Tue, Sep 25, 2012 at 3:39 PM, Tanya Golubchik wrote: > Hello, > > Apologies for not having followed the entire discussion, but just wanted > to say that we're also using NCBIXML here and are likely to be > incorporating it in a new piece of software soon, so it would be really > unfortunate if some tags disappeared, were renamed or (even worse) > changed meaning in future releases. > > I'm a bit late coming in here so maybe this has been answered, but is > there a better parser that should be used at the moment? I was under the > impression that NCBIXML is the only one. > > Thanks, > Tanya Hi Tanya, I hope I can reassure you there is nothing to worry about :) Right now there is only the NCBIXML parser, and we're not going to change it (except possibly to make it a little faster if people want to work on that). We're planning to a add new module based on Bow's GSoC code, under the working name SearchIO, which would cover BLAST, BLAT, HMMER, etc. This would have a different API and in the long term would probably replace all of Bio.Blast. http://biopython.org/wiki/SearchIO The discussion about possible changes has been (I think) only about this new code (and would have been better off on the development mailing list but this thread went off on a slight tangent). Once 'SearchIO' is released, we'd want to encourage people to use that instead of NCBIXML, with a view to deprecating and eventually removing NCBIXML. See: http://biopython.org/wiki/Deprecation_policy Regards, Peter From p.j.a.cock at googlemail.com Thu Sep 27 09:01:44 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Sep 2012 14:01:44 +0100 Subject: [Biopython-dev] ColorSpiral for Bio.Graphics In-Reply-To: References: Message-ID: On Mon, Sep 24, 2012 at 9:53 AM, Peter Cock wrote: > Hello all, > > Last week Leighton was doing some work with Biopython > and GenomeDiagram using the cross-links functionality > we worked on for Biopython 1.59, which I described here: > http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/ > > As you may have noticed via Twitter or his blog, Leighton has > generated an enormous (5m by 1m) PDF poster printout > comparing 29 bacterial genomes: > http://armchairbiology.blogspot.co.uk/2012/09/the-colours-man-colours.html > > As he describes on his blog post, this required generating > arbitrary color sets, with the option of adding some noise > (or jitter as he called it) to make neighbouring colours > visually distinct (rather than the more typical requirement > of a smooth value to color mapping). > > His code to do that is now on this branch (with a minor > bug fix and a few more docstrings added), ready for > possible merging into Biopython: > https://github.com/peterjc/biopython/tree/colorspiral > > Does this seem like a sensible addition to Bio.Graphics? > > Does anyone have any thoughts on the namespace > Bio.Graphics.ColorSpiral given it defines an object > ColorSpiral? Might a Bio.Graphics.Colors be useful? > > (If as discussed on the other thread we move to lower > case module names for Python 3, this namespace > clash also present in many other Biopython modules > goes away): > http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009934.html > > Regards, > > Peter I've committed it - we can still move/rename/etc until the next release if anyone has suggestions for improvement. https://github.com/biopython/biopython/commit/35a484026b68dd1b530d3446640b2f4d4b73eda7 Peter From p.j.a.cock at googlemail.com Thu Sep 27 09:55:21 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Sep 2012 14:55:21 +0100 Subject: [Biopython-dev] Biopython tutorial: Substitution Matrices In-Reply-To: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com> References: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com> Message-ID: On Fri, Sep 21, 2012 at 5:57 PM, Hui Ting Grace Yeo wrote: > Hey everyone, > > I'm working on this bug here https://redmine.open-bio.org/issues/3340 > and I've updated the example in the tutorial (on substitution matrices, > 17.4.2) using Bio.AlignIO on github here > https://github.com/yhtgrace/biopython/tree/clustalw-alignIO-replace. > I'm able to reproduce the dictionary replace_info, but when I go on to > finish the example, I get the following log odds matrix: > > D 2 > E -1 1 > H -5 -4 3 > K -10 -5 -4 1 > R -4 -8 -4 -2 2 > D E H K R > > which is different from the one given in the tutorial. I'm wondering if I've > missed something. Hi Grace, Using the current code and the example as it is, I also observe the same result as you. According to github's "blame" feature the current text dates back 4 years, https://github.com/biopython/biopython/commit/bed3ab39d8a635f1e74be99e6730a48d2460f8b7 However, that was just a reformatting of an older example which Brad wrote 11 years ago while converting the example from DNA to protein: https://github.com/biopython/biopython/commit/21df476c66b279824c51e6abd3f4ae549d003813 The example file itself protein.aln has not changed, committed: https://github.com/biopython/biopython/commit/ccbe2d72014eafb064994bc3782ca5529d0b0448 See also Doc/examples/make_subsmat.py So, since the example hasn't been changed in 11 years, this suggests either Brad committed the wrong output (and no-one noticed), or something changed in the calculation during that time. (Nowadays we try to use doctests for the examples in the API and in the Tutorial where possible, so that code changes which affect our examples are detected automatically.) The most likely candidates would be something in the file Bio/SubsMat/__init__.py https://github.com/biopython/biopython/commits/master/Bio/SubsMat/__init__.py A little detective work might be needed to explain this... sadly trying to use Biopython from back then is complicated by the reliance on the Martel/mxTextTools dependency. Maybe Brad or Michiel has some insight? -- In the meantime, I have applied your changes to the example to use AlignIO, https://github.com/biopython/biopython/commit/19f9317fe0e346f6c3f197d027076d9a1265def7 https://github.com/biopython/biopython/commit/5949f54dadb6d4ac8400e11d2afa33db549afba5 This will now get tested via test_Tutorial.py automatically (except for the final line about printing the odds matrix): https://github.com/biopython/biopython/commit/15dd6ba17eb092d0d7df674ac45617d99256d098 Thank you, Peter From redmine at redmine.open-bio.org Thu Sep 27 09:57:38 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 27 Sep 2012 13:57:38 +0000 Subject: [Biopython-dev] [Biopython - Bug #3340] (Resolved) Example using Bio.Clustalw in Tutorial References: Message-ID: Issue #3340 has been updated by Peter Cock. Status changed from New to Resolved % Done changed from 0 to 100 Fixed with Grace's commits, although she has also spotted a separate issue with the log odds matrix output later in the example: http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009958.html http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009962.html ---------------------------------------- Bug #3340: Example using Bio.Clustalw in Tutorial https://redmine.open-bio.org/issues/3340 Author: Peter Cock Status: Resolved Priority: Normal Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The module Bio.Clustalw was deprecated and removed, yet is still used in the Tutorial's 'Creating your own substitution matrix from an alignment' example. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Fri Sep 28 06:50:52 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Sep 2012 11:50:52 +0100 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: References: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: On Thu, Sep 20, 2012 at 10:08 AM, Peter Cock wrote: > On Sun, Sep 16, 2012 at 1:34 PM, Peter Cock wrote: >>> >>> I guess we need to have a little hack with the 2to3 library and >>> try defining our own custom fixer for the imports... >> >> I've made a start at this - the easy part seems to work :) >> >> https://github.com/peterjc/biopython/commits/py3lower >> >> ... The code to do this lower case name mangling remains a quite spaghetti like mess in do2to3.py but it now works enough to pass the test suite (with some but not all 3rd party dependencies installed) under Linux and my Mac OS X machine (where like Windows I have a case insensitive file system). Here's a clean run on TravisCI (Linux with a case sensitive file system): https://travis-ci.org/#!/peterjc/biopython/jobs/2584146 I've not tried Windows itself yet. Also only Python 3.2 Note if you want to try this, after switching to (and after switching from) the py3lower branch you should delete the build/py3.* folder where the 2to3 converted code is cached. The good news is that only a handful of bits of code needed special case code (e.g. finding the Entrez DTD files), with most tweaks just to import lines (as mentioned earlier) or renaming of internal variables. So this idea to adopt PEP8 lower case module names as part of supporting Python 3 appears to be technically viable. Peter From p.j.a.cock at googlemail.com Fri Sep 28 05:35:42 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Sep 2012 10:35:42 +0100 Subject: [Biopython-dev] ColorSpiral for Bio.Graphics In-Reply-To: References: Message-ID: On Thu, Sep 27, 2012 at 2:01 PM, Peter Cock wrote: > On Mon, Sep 24, 2012 at 9:53 AM, Peter Cock wrote: >> As he describes on his blog post, this required generating >> arbitrary color sets, with the option of adding some noise >> (or jitter as he called it) to make neighbouring colours >> visually distinct (rather than the more typical requirement >> of a smooth value to color mapping). >> >> ... > > I've committed it - we can still move/rename/etc until the > next release if anyone has suggestions for improvement. > https://github.com/biopython/biopython/commit/35a484026b68dd1b530d3446640b2f4d4b73eda7 The buildbot run last night spotted a problem under Python 2.5 (no cmath.rect function) which I've now fixed. https://github.com/biopython/biopython/commit/ee933c3f5c4b98ab232c5180492dc11a46b89f0d We do test under Python 2.5 with TravisCI as well, but at the moment we don't install the ReportLab dependency. There is a balance between installing more dependencies (to get more of our code tested) and the extra runtime required (meaning the job is more likely to be killed, or fail due to a network issue) giving false test failures. Peter From p.j.a.cock at googlemail.com Fri Sep 28 06:06:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Sep 2012 11:06:10 +0100 Subject: [Biopython-dev] Biopython tutorial: Substitution Matrices In-Reply-To: <87ipaywk47.fsf@fastmail.fm> References: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com> <87ipaywk47.fsf@fastmail.fm> Message-ID: On Fri, Sep 28, 2012 at 10:51 AM, Brad Chapman wrote: >> So, since the example hasn't been changed in 11 years, this >> suggests either Brad committed the wrong output (and no-one >> noticed), or something changed in the calculation during that >> time. > > Seriously, I could have easily copy/pasted something wrong when writing > this, so if there is no obvious code change I'd go with that assumption > and fix the docs to be correct. OK - I've done that: https://github.com/biopython/biopython/commit/b57707f9f3afc0980a3dbf936f6642a4d9cc8a69 Thanks Brad & Grace, Peter P.S. I've included Grace as a contributor in the upcoming release notes (please let me know if you'd prefer this as Hui Ting Grace Yeo instead): https://github.com/biopython/biopython/commit/5af03e78f37cbce82ce167c762d892cce9cb062e From bjoern at gruenings.eu Fri Sep 28 09:03:22 2012 From: bjoern at gruenings.eu (=?ISO-8859-1?Q?Bj=F6rn_Gr=FCning?=) Date: Fri, 28 Sep 2012 15:03:22 +0200 Subject: [Biopython-dev] [Patch] Genbank Parser Message-ID: <1348837402.21455.1.camel@threonin> Hi, the tbl2asn tool from the ncbi creates genbank files that did not have a version number. Unfortunately that version number is used to fill consumer.data.id. I implemented the following fall-back: If there is no version information available than it takes the consumer.data.name for the consumer.data.id. Does that makes sense? Thanks! Bjoern -------------- next part -------------- A non-text attachment was scrubbed... Name: biopython_genbank_id-fallback.diff Type: text/x-patch Size: 1016 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Fri Sep 28 09:38:11 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Sep 2012 14:38:11 +0100 Subject: [Biopython-dev] [Patch] Genbank Parser In-Reply-To: <1348837402.21455.1.camel@threonin> References: <1348837402.21455.1.camel@threonin> Message-ID: On Fri, Sep 28, 2012 at 2:03 PM, Bj?rn Gr?ning wrote: > Hi, > > the tbl2asn tool from the ncbi creates genbank files that did not have a > version number. Unfortunately that version number is used to fill > consumer.data.id. > I implemented the following fall-back: > If there is no version information available than it takes the > consumer.data.name for the consumer.data.id. Does that makes sense? > > Thanks! > Bjoern Can you share some example output from tbl2asn that shows this problem? Ideally something small we could include as a unit test. Thanks, Peter From chapmanb at 50mail.com Fri Sep 28 05:51:36 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 28 Sep 2012 05:51:36 -0400 Subject: [Biopython-dev] Biopython tutorial: Substitution Matrices In-Reply-To: References: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com> Message-ID: <87ipaywk47.fsf@fastmail.fm> Grace and Peter; [Different log odds matrix in documentation] > However, that was just a reformatting of an older example which > Brad wrote 11 years ago while converting the example from DNA > to protein: Gee, thanks for making me feel old. > So, since the example hasn't been changed in 11 years, this > suggests either Brad committed the wrong output (and no-one > noticed), or something changed in the calculation during that > time. Seriously, I could have easily copy/pasted something wrong when writing this, so if there is no obvious code change I'd go with that assumption and fix the docs to be correct. Thanks for spotting this, Brad From bjoern at gruenings.eu Thu Sep 27 18:11:05 2012 From: bjoern at gruenings.eu (bjoern at gruenings.eu) Date: Fri, 28 Sep 2012 00:11:05 +0200 (CEST) Subject: [Biopython-dev] [Patch] Genbank Parser fall-back data.id Message-ID: <59367.132.230.56.143.1348783865.squirrel@mail.gruenings.eu> Hi, the tbl2asn tool from the ncbi creates genbank files that did not have a version number. Unfortunately that version number is used to fill consumer.data.id. I implemented the following fall-back: If there is no version information available than it takes the consumer.data.name for the consumer.data.id. Does that makes sense? Thanks! Bjoern -------------- next part -------------- A non-text attachment was scrubbed... Name: biopython_genbank.diff Type: text/x-patch Size: 1015 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Sat Sep 29 08:10:24 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 29 Sep 2012 13:10:24 +0100 Subject: [Biopython-dev] Fwd: [Utilities-announce] PubMed E-Utility 2013 DTD updates In-Reply-To: References: Message-ID: I've added the two new DTD files mentioned below: https://github.com/biopython/biopython/commit/2a09b03ab4d861e91eb543bd6df717ecb4fdf097 Peter ---------- Forwarded message ---------- From: ** Date: Friday, September 28, 2012 Subject: [Utilities-announce] PubMed E-Utility 2013 DTD updates To: NLM/NCBI List utilities-announce NCBI PubMed E-Utility Users,**** ** ** We anticipate updating the PubMed E-Utility DTDs for 2012 in mid-December, approximately on December 10 or 11, 2012.**** ** ** The forthcoming DTDs are available from:**** ** ** http://www.ncbi.nlm.nih.gov/entrez/query/DTD/nlmmedlinecitationset_130101.dtd **** http://www.ncbi.nlm.nih.gov/entrez/query/DTD/pubmed_130101.dtd**** ** ** Changes to NLMMedlineCitationSet DTD AND MEDLINE/PubMed XML:**** ** ** **- **Indicating abstracts not in MEDLINE/PubMed but available from publishers**** English-language abstracts are taken directly from the published article and included in the and elements. If the article does not have a published abstract, the record lacks the and elements. However, publishers may create English-language abstracts that are not published with the article, as well as, non-English- language abstracts that may or may not be published with the article.**** ** ** These other abstracts will be indicated in the element. A new "Language" attribute is added to the element. The element will carry the standard phrase: "Abstract available from the publisher."**** ** ** DTD:**** **** **** ** ** Sample XML:**** Abstract available from the publisher.**** **** ** ** **- **Rename NameID to Identifier**** The NameID element was created in 2010 and modified in 2011 but has not yet been used. NameID is renamed to Identifier. Identifier is an optional, possibly multiply-occurring element permissible within the Author (personal and collective) and Investigator elements. The value in the Identifier attribute Source designates the organizational authority that established the unique identifier. **** ** ** DTD:**** **** **** ** ** **** **** ** ** **** **** ** ** Sample XML:**** **** Smith**** John**** A**** 55555555555555**** **** ** ** Thank you.**** From p.j.a.cock at googlemail.com Sat Sep 29 16:25:14 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 29 Sep 2012 21:25:14 +0100 Subject: [Biopython-dev] Nexus __slots__ and Python 3.3 Message-ID: Hello all, I've started testing under the newly released Python 3.3, and there is a new problem which I don't recall running into when I tried one of the Python 3.3 alpha releases: $ python3 test_Nexus.py Traceback (most recent call last): File "test_Nexus.py", line 7, in from Bio.Nexus import Nexus, Trees File "/Users/peterjc/lib/python3.3/site-packages/Bio/Nexus/Nexus.py", line 513, in class Nexus(object): ValueError: 'original_taxon_order' in __slots__ conflicts with class variable I can fix this with the following change, which appears to have no side effects under Python 2 (the unit tests still all pass): $ git diff diff --git a/Bio/Nexus/Nexus.py b/Bio/Nexus/Nexus.py index 1d6abd2..8c7fbcc 100644 --- a/Bio/Nexus/Nexus.py +++ b/Bio/Nexus/Nexus.py @@ -511,8 +511,6 @@ class Block(object): class Nexus(object): - __slots__=['original_taxon_order','__dict__'] - def __init__(self, input=None): self.ntax=0 # number of taxa self.nchar=0 # number of characters I have committed this: https://github.com/biopython/biopython/commit/e90db11f4a1d983bc2bfe12bec30edbdbb200634 However, I'm not really sure what the intention of this line was in the first place. It is (assuming I didn't miss anything with grep), or now was, the only use of __slots__ in the whole of Biopython. Regards, Peter From p.j.a.cock at googlemail.com Sat Sep 29 16:34:27 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 29 Sep 2012 21:34:27 +0100 Subject: [Biopython-dev] PAML test problems under Python 3.3.0 Message-ID: Hi Brandon (et al), Could you have a look at the PAML unit tests under Python 3.3 please? I see a mix of failures and 'blocking' under a self-compiled Python 3.3.0 on Mac OS X 10.8 (Mountain Lion): $ python3 test_PAML_yn00.py testAlignmentExists (__main__.ModTest) ... ok testAlignmentFileIsValid (__main__.ModTest) ... FAIL testAlignmentSpecified (__main__.ModTest) ... ok testCtlFileExistsOnRead (__main__.ModTest) ... ok testCtlFileExistsOnRun (__main__.ModTest) ... ok testCtlFileValidOnRead (__main__.ModTest) ... ERROR testCtlFileValidOnRun (__main__.ModTest) ... ok testOptionExists (__main__.ModTest) ... ok testOutputFileSpecified (__main__.ModTest) ... ok testOutputFileValid (__main__.ModTest) ... ok testParseAllVersions (__main__.ModTest) ... ok testResultsExist (__main__.ModTest) ... ok testResultsParsable (__main__.ModTest) ... ok testResultsValid (__main__.ModTest) ... ^C $ python3 test_PAML_codeml.py testAlignmentExists (__main__.ModTest) ... ok testAlignmentFileIsValid (__main__.ModTest) ... FAIL testAlignmentSpecified (__main__.ModTest) ... ok testCtlFileExistsOnRead (__main__.ModTest) ... ok testCtlFileExistsOnRun (__main__.ModTest) ... ok testCtlFileValidOnRead (__main__.ModTest) ... ERROR testCtlFileValidOnRun (__main__.ModTest) ... ok testOptionExists (__main__.ModTest) ... ok testOutputFileSpecified (__main__.ModTest) ... ok testOutputFileValid (__main__.ModTest) ... ok testPamlErrorsCaught (__main__.ModTest) ... ok testParseAA (__main__.ModTest) ... ok testParseAAPairwise (__main__.ModTest) ... ok testParseAllNSsites (__main__.ModTest) ... ok testParseBranchSiteA (__main__.ModTest) ... ok testParseCladeModelC (__main__.ModTest) ... ok testParseFreeRatio (__main__.ModTest) ... ok testParseNSsite3 (__main__.ModTest) ... ok testParseNgene2Mgene02 (__main__.ModTest) ... ok testParseNgene2Mgene1 (__main__.ModTest) ... ok testParseNgene2Mgene34 (__main__.ModTest) ... ok testParsePairwise (__main__.ModTest) ... ok testParseSEs (__main__.ModTest) ... ok testResultsExist (__main__.ModTest) ... ok testResultsParsable (__main__.ModTest) ... ok testResultsValid (__main__.ModTest) ... ^C $ python3 test_PAML_baseml.py testAlignmentExists (__main__.ModTest) ... ok testAlignmentFileIsValid (__main__.ModTest) ... FAIL testAlignmentSpecified (__main__.ModTest) ... ok testCtlFileExistsOnRead (__main__.ModTest) ... ok testCtlFileExistsOnRun (__main__.ModTest) ... ok testCtlFileValidOnRead (__main__.ModTest) ... ERROR testCtlFileValidOnRun (__main__.ModTest) ... ok testOptionExists (__main__.ModTest) ... ok testOutputFileSpecified (__main__.ModTest) ... ok testOutputFileValid (__main__.ModTest) ... ok testPamlErrorsCaught (__main__.ModTest) ... ok testParseAllVersions (__main__.ModTest) ... ok testParseAlpha1Rho1 (__main__.ModTest) ... ok testParseModel (__main__.ModTest) ... ok testParseNhomo (__main__.ModTest) ... ok testParseSEs (__main__.ModTest) ... ok testResultsExist (__main__.ModTest) ... ok testResultsParsable (__main__.ModTest) ... ok testResultsValid (__main__.ModTest) ... ^C If you've not tried this before, the procedure I'm using is: $ python3 setup.py build $ cd build/py3.3/Tests $ python3 test_PAML_baseml.py etc The key point is to run the tests directly (rather than just via 'python3 setup.py test') you must change director to the 2to3 converted folder under the build folder. By commenting out the test methods which seem to blocking, it seems some of the failures are to do with exception handling. I've not dug any further into this. Thanks, Peter From redmine at redmine.open-bio.org Sun Sep 2 19:20:01 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sun, 2 Sep 2012 19:20:01 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by Jo?o Rodrigues. I contacted the developers of PatchDock and they updated their code. Their PDBs no longer have the double END statement, but they might have conflicting chains though: the parser will likely break if by chance both chains have id A and overlapping residue numbers. Still, a slight improvement. ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Mon Sep 3 01:05:19 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Mon, 3 Sep 2012 01:05:19 +0000 Subject: [Biopython-dev] [Biopython - Bug #3379] PDBParser fails to parse PDBs produced by PatchDock References: Message-ID: Issue #3379 has been updated by David Cain. That's awesome! Thanks for doing that. Well, chain renumbering is definitely a problem, but I don't see any easy fix for that. I still think the "pull request":https://github.com/biopython/biopython/pull/60 is relevant for detecting otherwise malformed PDB files (additionally, parsing will still stop after the first file if @CONECT@ files are relevant). ---------------------------------------- Bug #3379: PDBParser fails to parse PDBs produced by PatchDock https://redmine.open-bio.org/issues/3379 Author: David Cain Status: New Priority: Low Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: 1.57 URL: I apoligize in advance if this technically doesn't count as a bug, as the problem is arising out of improperly formatted PDBs. h3. Background Protein docking utilities can generally create a complex PDB from two input files. Depending on the rotation algorithm, at least one of the PDB files is rotated (its ATOM coordinates modified in-place), then the two files are concatenated to create a protein complex file. h3. Why PDBParser fails Utilities like ZDOCK strip a lot of data from the input files, creating a poorly-formed PDB file that raises PDBConstructionWarnings, but PDBParser can ultimately parse. PatchDock, however, preserves the input PDB files as they were- the only thing that changes is ATOM coordinates. This is problematic when the receptor PDB has an @END@ record or @CONECT@ records: PDBParser's current behavior is to consider anything after an @END@ or @CONECT@ to be trailer data, and cease parsing when they're encountered. This means that many complexes parse cleanly, but completely exclude the ligand. h3. How to fix the problem Now, in an ideal world- the responsibility would be on the creators of the docking utilities to create well-formed complex PDB files. However, this quick concatenation seems to be pretty common (complexes are often created by very short, hackish Perl scripts). Should PDBParser be able to parse these badly formed PDB files? h3. Potential change to @PDBParser._parse_coordinates@? If a modification to PDBParser is on the table, my thought would be to still consider anything after @END@ or @CONECT@ to be part of the trailer, but make an attempt to parse extra coordinate data from this trailer before returning (probably through a recursive call). If records are found in the trailer, a PDBConstructionWarning is raised, but they're added to the structure. If this approach is reasonable, let me know and I'd be happy to mock something up and push it to my branch on GitHub. Otherwise, I'll just write scripts to clean ugly complexes for parsing. My only thought is that most users of docking software are probably not able or willing to write such a script, and thus can't use BioPython to parse the PDB output. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From w.arindrarto at gmail.com Mon Sep 3 10:14:59 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Mon, 3 Sep 2012 12:14:59 +0200 Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: References: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com> <87lim4h07o.fsf@fastmail.fm> Message-ID: Hello everyone, I'd like to update everyone on my latest SearchIO(?) developments. There has been some progress and bug fixes since GSoC officially ended two weeks ago. Some of them I'd like to share here: 1. I've written a draft tutorial chapter for the submodule. It' been pushed to my development repo (https://github.com/bow/biopython/tree/searchio) and I'm hosting the HTML temporarily on my site ( http://bow.web.id/biopython/Tutorial.html). Comments and critiques are welcomed :). 2. Back on the naming issue, I'm still using SearchIO for now. I've experimented with other names (Bio.Search and Bio.SeqSearch), and my impression is I like Bio.SeqSearch the most, followed by Bio.Search, and Bio.SearchIO. It does feel confusing initially (we have SeqUtils, SeqFeature, etc.), but after a while it's the one that feels most natural. 3. And finally, Peter and I discussed this briefly previously: what about if we merge the existing BLAST wrappers and NCBI qblast into Bio.(SeqSearch / Search / SearchIO)? I felt there were a lot of overlap between this submodule and Bio.BLAST when writing the tutorial, so merging surfaced in my thoughts again. We could put the BLAST wrappers under Bio.SeqSearch.Applications (for example), along with other wrappers (I have a yet-untested Bio.HMMER3 wrapper and possibly Bio.BLAT wrapper that put here as well). As for qblast (and other remote searches, like the one provided by HMMER at the moment), we could put them in Bio.SeqSearch.Remote, perhaps. I think this would make it easier for anyone who works with BLAST / other sequence search tools as all Biopython-related functionalities are grouped in one place. This is just a thought for now, but I'd love to hear your thoughts on the merge (and the naming ;) ). cheers, Bow On Tue, Aug 21, 2012 at 6:01 PM, Wibowo Arindrarto wrote: > On Tue, Aug 14, 2012 at 9:49 PM, Peter Cock > wrote: > > On Tue, Apr 10, 2012 at 1:58 AM, Brad Chapman wrote: > >> Michiel; > >>> Hi Eric, Peter, > >>> > >>> > How about Bio.Search, for now? > >>> > >>> I would prefer Bio.Pairwise or Bio.Align.Pairwise, since that tells > >>> users something about what the module is for. Bio.Search could be > >>> anything (search PubMed? search the Entrez databases? search Google? > >>> anyway Bio.Search does not suggest that this module is about pairwise > >>> alignments). But Peter previously mentioned that he doesn't like > >>> Bio.Pairwise; can we convince you? > >> > >> I agree with Peter on this one. The module is primarily about searching > >> a sequence database with an input via multiple methods, not about > >> pairwise alignment of two sequences with is what Bio.Align.Pairwise > >> suggests to me. > >> > >> Brad > > > > On potential problem with Bio.Search (on top of concerns raised > > here about vagueness) Bow and I were just talking about during > > our weekly GSoC video call was the existence of Bio/Search.py > > which is obsolete and long overdue for removal. I have just > > deprecated it (something I forgot to do before the last release): > > > https://github.com/biopython/biopython/commit/5a275ccd1df3def40df1eef517af755d373dadd8 > > > > We'd earlier talked about using Bio.Search as the namespace. I was > > worried about the potential existence on a user's machine of both > > Bio/Search.py (the old obsolete code) and Bio/Search/__init__.py > > (aka SearchIO, the new module) and which would take precedence > > when doing: from Bio import Search > > > > Given how Python module installations work, that seems highly > > likely to occur. The good news is that the package would take > > priority - see http://www.python.org/doc/essays/packages.html > > > >>>>> What If I Have a Module and a Package With The Same Name? > >>>>> > >>>>> You may have a directory (on sys.path) which has both a module > >>>>> spam.py and a subdirectory spam that contains an __init__.py > >>>>> (without the __init__.py, a directory is not recognized as a > package). > >>>>> In this case, the subdirectory has precedence, and importing spam > >>>>> will ignore the spam.py file, loading the package spam instead. If > >>>>> you want the module spam.py to have precedence, it must be > >>>>> placed in a directory that comes earlier in sys.path. > > > > So there is no technical reason to avoid Bio.Search as an > > option for the Bio.SearchIO namespace. We could then > > have Bio.Search.Applications for command line wrappers, > > consistent with Bio.Phylo.Applications, Bio.Motif.Applications > > and Bio.Align.Applications. > > > > Of course, Bio.Search is still perhaps too broad a name... but > > on balance perhaps it is still better than Bio.SearchIO? > > > > Regards, > > > > Peter > > Hi everyone, > > If I may add my two cents, for now I am in favor of putting the module > under Bio.Search. It is not the best name out there (it does sound a > bit vague), but it's the one that seem to be the most intuitive (until > a better alternative comes out). There were some other alternatives > that I and Peter have discussed, but they seem less appealing for us. > You're free to add your thoughts on these of course :) : > > - Bio.SeqSearch. This sounds ok, but when you consider we have > Bio.Seq, Bio.SeqRecord, Bio.SeqFeature, and Bio.SeqUtils, it becomes > quite confusing quickly. > > - Bio.PSearch ('p' for pairwise). This one seemed the less intuitive > among the three options, so I'm not so big on this. > > For now, I'm still writing everything (code, docstrings, tutorial) > using SearchIO. I suppose it's better if we could agree on a more > suitable name, though. > > On another note, I'm also in favor of using the Bio.Phylo module > skeleton for Bio.SearchIO / Bio.Search. We may then group all sequence > search-related application wrappers under Applications (I actually > prefers 'app' for better PEP8 compliance, but that's another > discussion) and perhaps even refactor our remote search calls (e.g. > the 'qblast' module) under Bio.Search as well. > > cheers, > Bow > From p.j.a.cock at googlemail.com Mon Sep 3 12:28:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Sep 2012 13:28:30 +0100 Subject: [Biopython-dev] GSoC SearchIO project In-Reply-To: References: <1334014051.14489.YahooMailClassic@web161204.mail.bf1.yahoo.com> <87lim4h07o.fsf@fastmail.fm> Message-ID: On Mon, Sep 3, 2012 at 11:14 AM, Wibowo Arindrarto wrote: > Hello everyone, > > I'd like to update everyone on my latest SearchIO(?) developments. There > has been some progress and bug fixes since GSoC officially ended two weeks > ago. Some of them I'd like to share here: > > 1. I've written a draft tutorial chapter for the submodule. It' been pushed > to my development repo (https://github.com/bow/biopython/tree/searchio) and > I'm hosting the HTML temporarily on my site ( > http://bow.web.id/biopython/Tutorial.html). Comments and critiques are > welcomed :). Oh - excellent - I'll read that in the next few days :) > 2. Back on the naming issue, I'm still using SearchIO for now. I've > experimented with other names (Bio.Search and Bio.SeqSearch), and my > impression is I like Bio.SeqSearch the most, followed by Bio.Search, and > Bio.SearchIO. It does feel confusing initially (we have SeqUtils, > SeqFeature, etc.), but after a while it's the one that feels most natural. Initially Bio.SeqSearch sounds a bit long... but maybe it will grow on me... > 3. And finally, Peter and I discussed this briefly previously: what about > if we merge the existing BLAST wrappers and NCBI qblast into Bio.(SeqSearch > / Search / SearchIO)? I felt there were a lot of overlap between this > submodule and Bio.BLAST when writing the tutorial, so merging surfaced in > my thoughts again. We could put the BLAST wrappers under > Bio.SeqSearch.Applications (for example), along with other wrappers (I have > a yet-untested Bio.HMMER3 wrapper and possibly Bio.BLAT wrapper that put > here as well). As for qblast (and other remote searches, like the one > provided by HMMER at the moment), we could put them in > Bio.SeqSearch.Remote, perhaps. I think this would make it easier for anyone > who works with BLAST / other sequence search tools as all Biopython-related > functionalities are grouped in one place. As per my discussion with Bow, I'm OK with aiming to deprecate the Bio.BLAST namespace as part of introducing Bio.SeqSearch/Search/.., although I hadn't a strong preference on a naming convention for any online functionality. Possibly www is shorter than remote and also clear? > This is just a thought for now, but I'd love to hear your thoughts on the > merge (and the naming ;) ). > > cheers, > Bow Thanks Bow :) Peter From p.j.a.cock at googlemail.com Mon Sep 3 12:55:07 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Sep 2012 13:55:07 +0100 Subject: [Biopython-dev] Beta code in the official releases? In-Reply-To: <1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org> References: <877gsq8mn2.fsf@fastmail.fm> <1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org> Message-ID: On Wed, Aug 29, 2012 at 6:54 PM, Sczesnak, Andrew wrote: > +1 > > It's been over a year since I first submit my MAF code! Already? Ouch, my apologies. I'm at a hackathon this week with the OBF GSoC mentors who looked at MAF for BioRuby - looking at this for inclusion in the next Biopython release (perhaps with a beta tag) is on my agenda. Peter From anaryin at gmail.com Mon Sep 3 22:07:39 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 4 Sep 2012 01:07:39 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends Message-ID: Hi all, A quick update on some latest work. I found some time to finally work a bit on the PDB parser and Bio.PDB in general. I started by optimizing the current code. I ran cProfile on script that parsed a set of structures without header and without element columns. I did this because one of the optimizations rendered the current header parser useless.. (replaced the PDB file handle by an iterator instead of using the readlines method). I still need to work a bit on the memory leak, but for now it seems pretty ok (parsed 400-ish large structures without a glitch). I am attaching two pictures of cProfile and the two output files. There is a nice improvement of about 25%, but this can still be improved for sure. I just replaced some methods here and there, pre-initialized the numpy arrays, etc.. I pushed this version to my github pdb_enhancements branch . One big change I would propose is to eliminate the duality child_list/child_dict. I think that keeping child_dict and generating child_list from sorted dict keys would be good enough. OrderedDict also looks appropriate, but it's Py2.7+.. Still need to look into this, but by looking at all those "append" methods in the profiling it hints at a nice speed up, and also at much cleaner code. Let me know of your opinion if you have some time, Cheers, Jo?o PS. Attached complex_1.pdb as an example of the structures in the dataset used for this particular test. -------------- next part -------------- A non-text attachment was scrubbed... Name: BioPDB-master-TBEV.png Type: image/png Size: 166144 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: BioPDB-master-TBEV.profile Type: application/octet-stream Size: 252112 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: BioPDB-optimized-TBEV.png Type: image/png Size: 148137 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: BioPDB-optimized-TBEV.profile Type: application/octet-stream Size: 273487 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: complex_1w.pdb Type: chemical/x-pdb Size: 649559 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Tue Sep 4 05:56:55 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Sep 2012 06:56:55 +0100 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Mon, Sep 3, 2012 at 11:07 PM, Jo?o Rodrigues wrote: > One big change I would propose is to eliminate the duality > child_list/child_dict. I think that keeping child_dict and generating > child_list from sorted dict keys would be good enough. OrderedDict also > looks appropriate, but it's Py2.7+.. Still need to look into this, but by > looking at all those "append" methods in the profiling it hints at a nice > speed up, and also at much cleaner code. > Where there are back-ports of the OrderedDict and other useful classes like NamedTuple, we could probably include these as part of our Python 2/3 compatibility code. i.e. In Bio.PDB use: from Bio._py3k import OrderedDict (Until we drop older versions of Python which don't come with this). In Bio._py3k we would have something like this: #Use in preference system OrderedDict (Python 2.7 and 3.x), #the backport from PyPI, or our own bundled implementation try: from collections import OrderedDict except ImportError: try: #Whatever http://pypi.python.org/pypi/ordereddict uses: from xxx import OrderedDict except ImportError: #Import local bundled implementation, e.g. from _ordereddict import OrderedDict See http://code.activestate.com/recipes/576693-ordered-dictionary-for-py24/ Are there any objections to this plan? Regards, Peter From anaryin at gmail.com Tue Sep 4 05:59:36 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 4 Sep 2012 08:59:36 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: Sounds great, I saw the active state link before but I never thought of including it. Thanks! From w.arindrarto at gmail.com Tue Sep 4 06:11:05 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 4 Sep 2012 08:11:05 +0200 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: Hi Peter, Jo?o, Just a little FYI. I ran into the OrderedDict issue when I started writing SearchIO a few months ago as well, so I added an OrderedDict implementation in Bio._py3k ( https://github.com/bow/biopython/commit/34f873b2d21136487a0b925be898c52daa0cc61c ). The code is from the ordereddict module from PyPI at that time. I haven't checked if it's the same as the one shown in the link (there may have been some updates), but it seems to work fine up to now. Hope this is useful :), Bow On Tue, Sep 4, 2012 at 7:59 AM, Jo?o Rodrigues wrote: > Sounds great, I saw the active state link before but I never thought of > including it. Thanks! > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Tue Sep 4 06:30:51 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Sep 2012 07:30:51 +0100 Subject: [Biopython-dev] PEP8 lower case module names? Message-ID: Hello all, Over on one of Bow's pull requests Michiel made a suggestion about consolidating the Bio.Seq* namespace under Bio.Seq.* which we can do by replacing Bio/Seq.py with Bio/Seq/__init__.py See: https://github.com/biopython/biopython/pull/63#issuecomment-8252340 I agree that Bio.Seq, Bio.SeqUtils, Bio.SeqIO, Bio.SeqRecord, and Bio.SeqFeature isn't ideal. However, changing this would be a big disruption - so perhaps any large change like this should also address the mixed case module names which are not PEP8 conformant (Modules should have short, all-lowercase names). http://www.python.org/dev/peps/pep-0008/#package-and-module-names One idea I was pondering is a new parallel namespace, ideally bio.* but we can't use that due to case insensitive file systems like Windows and (by default) Mac OS X. So perhaps biopy, or bp? [I've not checked for clashes with other libraries yet.] We could gradually move code over to the new namespace, using imports to preserve back compatibility - but support both namespaces during a (long) transition period. What I like about this is it allows people to make a gradual conversion - and we don't have to burden of two main branches if we attempted a single jump to a Biopython v2. Does this seem worth considering? Regards, Peter From mjldehoon at yahoo.com Tue Sep 4 10:27:57 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 4 Sep 2012 03:27:57 -0700 (PDT) Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: Message-ID: <1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com> Hi Peter, --- On Tue, 9/4/12, Peter Cock wrote: > One idea I was pondering is a new parallel namespace, > ideally bio.* but we can't use that due to case > insensitive file systems like Windows and (by default) > Mac OS X. So perhaps biopy, or bp? As you say, the ideal namespace is bio.*, so let's use that. We have been using Bio.* for more than 10 years. We should not get stuck with a non-ideal namespace for the next 10+ years because there may be some glitches switching from Bio.* to bio.*. Frankly I doubt that this will cause huge problems in practice. > We could gradually move code over to the new namespace, > using imports to preserve back compatibility - but support > both namespaces during a (long) transition period. Why do we need a transition period? It's just a matter of replacing upper case with lower case in the imports. > What I like about this is it allows people to make a > gradual > conversion - and we don't have to burden of two main > branches if we attempted a single jump to a Biopython v2. > > Does this seem worth considering? Yes but by all means, let's keep this simple. In the past, changes to Biopython have very rarely caused any serious problems for users. Best, -Michiel. From p.j.a.cock at googlemail.com Tue Sep 4 10:59:00 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Sep 2012 11:59:00 +0100 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: <1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Tue, Sep 4, 2012 at 11:27 AM, Michiel de Hoon wrote: > Hi Peter, > > --- On Tue, 9/4/12, Peter Cock wrote: >> One idea I was pondering is a new parallel namespace, >> ideally bio.* but we can't use that due to case >> insensitive file systems like Windows and (by default) >> Mac OS X. So perhaps biopy, or bp? > > As you say, the ideal namespace is bio.*, so let's use > that. We have been using Bio.* for more than 10 years. > We should not get stuck with a non-ideal namespace for > the next 10+ years because there may be some glitches > switching from Bio.* to bio.*. Frankly I doubt that this > will cause huge problems in practice. So you'd advocate a simple switch where from one release to the next we change all the module names (making them lower case, perhaps from consolidation under bio.seq too)? This may cause some difficulties for upgrades - it may require manual intervention to remove the old Bio folder in order to allow creation of the new bio folder. >> We could gradually move code over to the new namespace, >> using imports to preserve back compatibility - but support >> both namespaces during a (long) transition period. > > Why do we need a transition period? It's just a matter > of replacing upper case with lower case in the imports. That forces people to update all their scripts at once. Of course, we can document how to do this so a script would work before and after the case change, e.g. try: from bio.seq import Seq except ImportError: from Bio.Seq import Seq >> What I like about this is it allows people to make a >> gradual >> conversion - and we don't have to burden of two main >> branches if we attempted a single jump to a Biopython v2. >> >> Does this seem worth considering? > > Yes but by all means, let's keep this simple. In the past, changes to Biopython have very rarely caused any serious problems for users. > > Best, > -Michiel. > From p.j.a.cock at googlemail.com Tue Sep 4 12:16:26 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Sep 2012 13:16:26 +0100 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Tue, Sep 4, 2012 at 7:11 AM, Wibowo Arindrarto wrote: > Hi Peter, Jo?o, > > Just a little FYI. I ran into the OrderedDict issue when I started writing > SearchIO a few months ago as well, so I added an OrderedDict implementation > in Bio._py3k > (https://github.com/bow/biopython/commit/34f873b2d21136487a0b925be898c52daa0cc61c). > > The code is from the ordereddict module from PyPI at that time. I haven't > checked if it's the same as the one shown in the link (there may have been > some updates), but it seems to work fine up to now. > > Hope this is useful :), > Bow Given the OrderedDict will be useful for Bio.PDB and Bow's SearchIO, that seems quite a good case for including it. How does this look (on the 'od' branch in my repository)? https://github.com/peterjc/biopython/commit/52b011aa8ddce06e636de776d8cea8e62845853f This differs from Bow's version in that I put the module in as a separate file (Bio/_ordereddict.py), and that it will prefer the ordereddict package if already installed (e.g. from PyPI). Peter From w.arindrarto at gmail.com Tue Sep 4 12:36:55 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 4 Sep 2012 14:36:55 +0200 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Tue, Sep 4, 2012 at 2:16 PM, Peter Cock wrote: > > On Tue, Sep 4, 2012 at 7:11 AM, Wibowo Arindrarto > wrote: > > Hi Peter, Jo?o, > > > > Just a little FYI. I ran into the OrderedDict issue when I started > > writing > > SearchIO a few months ago as well, so I added an OrderedDict > > implementation > > in Bio._py3k > > > > (https://github.com/bow/biopython/commit/34f873b2d21136487a0b925be898c52daa0cc61c). > > > > The code is from the ordereddict module from PyPI at that time. I > > haven't > > checked if it's the same as the one shown in the link (there may have > > been > > some updates), but it seems to work fine up to now. > > > > Hope this is useful :), > > Bow > > Given the OrderedDict will be useful for Bio.PDB and Bow's SearchIO, > that seems quite a good case for including it. How does this look > (on the 'od' branch in my repository)? > > > https://github.com/peterjc/biopython/commit/52b011aa8ddce06e636de776d8cea8e62845853f > > This differs from Bow's version in that I put the module in as a separate > file (Bio/_ordereddict.py), and that it will prefer the ordereddict > package > if already installed (e.g. from PyPI). > > Peter Hi Peter, This looks good. I like the 'ordereddict' module import check prior to using our bundled version. One more thing I would suggest is about the namespace. I feel that in the future, we may run into similar issues (non-Python3 compatibility issues) since Python2.7 deprecation is still a long way. Perhaps create a new subpackage in the root folder (maybe Bio._compat, but I don't have a strong preference), to keep code like this in one place? Or we could even put Bio._py3k under this subpackage and have one central place for compatibility-related code? This would prevent further root namespace clutter. regards, Bow From k.d.murray.91 at gmail.com Tue Sep 4 12:57:22 2012 From: k.d.murray.91 at gmail.com (Kevin Murray) Date: Tue, 4 Sep 2012 22:57:22 +1000 Subject: [Biopython-dev] TAIR/AGI support Message-ID: Hi All, What's the status of TAIR AGIs in BioPython (I can see no mention of them, or support for them)? I've written a brief module which allows a user to query NCBI with a TAIR AGI, returning a Seq object (via Efetch). Is there any interest in including such functionality in BioPython? More generally, are there any particular areas of BioPython development which could use an extra pair of hands? Regards Kevin Murray From anaryin at gmail.com Tue Sep 4 14:19:11 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 4 Sep 2012 17:19:11 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: Guys, Looks great, I will try to 'cherry pick' that branch and merge it with mine. I have to solve some issues with the tests, but it seems to be a straightforward change. Cheers, Jo?o No dia 4 de Set de 2012 15:37, "Wibowo Arindrarto" escreveu: > On Tue, Sep 4, 2012 at 2:16 PM, Peter Cock > wrote: > > > > On Tue, Sep 4, 2012 at 7:11 AM, Wibowo Arindrarto > > wrote: > > > Hi Peter, Jo?o, > > > > > > Just a little FYI. I ran into the OrderedDict issue when I started > > > writing > > > SearchIO a few months ago as well, so I added an OrderedDict > > > implementation > > > in Bio._py3k > > > > > > ( > https://github.com/bow/biopython/commit/34f873b2d21136487a0b925be898c52daa0cc61c > ). > > > > > > The code is from the ordereddict module from PyPI at that time. I > > > haven't > > > checked if it's the same as the one shown in the link (there may have > > > been > > > some updates), but it seems to work fine up to now. > > > > > > Hope this is useful :), > > > Bow > > > > Given the OrderedDict will be useful for Bio.PDB and Bow's SearchIO, > > that seems quite a good case for including it. How does this look > > (on the 'od' branch in my repository)? > > > > > > > https://github.com/peterjc/biopython/commit/52b011aa8ddce06e636de776d8cea8e62845853f > > > > This differs from Bow's version in that I put the module in as a separate > > file (Bio/_ordereddict.py), and that it will prefer the ordereddict > > package > > if already installed (e.g. from PyPI). > > > > Peter > > Hi Peter, > > This looks good. I like the 'ordereddict' module import check prior to > using our bundled version. > > One more thing I would suggest is about the namespace. I feel that in > the future, we may run into similar issues (non-Python3 compatibility > issues) since Python2.7 deprecation is still a long way. Perhaps > create a new subpackage in the root folder (maybe Bio._compat, but I > don't have a strong preference), to keep code like this in one place? > Or we could even put Bio._py3k under this subpackage and have one > central place for compatibility-related code? This would prevent > further root namespace clutter. > > regards, > Bow > From p.j.a.cock at googlemail.com Tue Sep 4 14:42:35 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Sep 2012 15:42:35 +0100 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Tue, Sep 4, 2012 at 3:19 PM, Jo?o Rodrigues wrote: > Guys, > > Looks great, I will try to 'cherry pick' that branch and merge it with mine. I've applied it to the master now, which might make it easier. I think Bow might have a point about namespaces - although the underscore modules are 'private', they still show up in dir(Bio) so having a single folder for our inter-Python version compatibility code seems sensible if we add any more (e.g. NamedTuples). > I have to solve some issues with the tests, but it seems to be a > straightforward change. Great. Peter From anaryin at gmail.com Tue Sep 4 16:02:42 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 4 Sep 2012 19:02:42 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: I agree, we could move them to a folder then? No dia 4 de Set de 2012 17:42, "Peter Cock" escreveu: > On Tue, Sep 4, 2012 at 3:19 PM, Jo?o Rodrigues wrote: > > Guys, > > > > Looks great, I will try to 'cherry pick' that branch and merge it with > mine. > > I've applied it to the master now, which might make it easier. > I think Bow might have a point about namespaces - although the > underscore modules are 'private', they still show up in dir(Bio) > so having a single folder for our inter-Python version compatibility > code seems sensible if we add any more (e.g. NamedTuples). > > > I have to solve some issues with the tests, but it seems to be a > > straightforward change. > > Great. > > Peter > From p.j.a.cock at googlemail.com Tue Sep 4 23:54:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Sep 2012 00:54:56 +0100 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Tue, Sep 4, 2012 at 5:02 PM, Jo?o Rodrigues wrote: > I agree, we could move them to a folder then? > OK - I moved Bio/_py3k.py to Bio/_py3k/__init__.py and also the new file Bio/_ordereddict.py to Bio/_py3k/ordereddict.py - this avoids having to change any of our import statements: https://github.com/biopython/biopython/commit/1a9bd6eeab0de3283bd1e6cc28c7754fbffefe2d Peter From redmine at redmine.open-bio.org Wed Sep 5 03:19:53 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Wed, 5 Sep 2012 03:19:53 +0000 Subject: [Biopython-dev] [Biopython - Bug #3382] (New) Bio.PDB.PDBList.retrieve_pdb_file() fails for Python3 Message-ID: Issue #3382 has been reported by Alexander Campbell. ---------------------------------------- Bug #3382: Bio.PDB.PDBList.retrieve_pdb_file() fails for Python3 https://redmine.open-bio.org/issues/3382 Author: Alexander Campbell Status: New Priority: Normal Assignee: Category: Target version: URL: At present, calling @Bio.PDB.PDBList.retrieve_pdb_file()@ on any PDB ID will fail, giving the following traceback:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in ()
----> 1 pdbl.retrieve_pdb_file('1FAT')

/usr/lib64/python3.2/site-packages/Bio/PDB/PDBList.py in retrieve_pdb_file(self, pdb_code, obsolete, compression, uncompress, pdir)
    245         gz = gzip.open(filename, 'rb')
    246         out = open(final_file, 'wb')
--> 247         out.writelines(gz.read())
    248         gz.close()
    249         out.close()

TypeError: 'int' does not support the buffer interface
This occurs because in Python3 a file opened in binary mode will return type @bytes@ for @read()@, or a list of type @bytes@ objects for @readlines()@. The @writelines()@ method expects an iterable where each element is of type @str at . This worked in Python2 as a @str@ can be viewed as a sequence of @str@ objects, and so line 247 effectively wrote one character at a time for the single @str@ yielded by @read()@. In Python3 iterating over a @bytes@ yields @int@ objects, leading to the TypeError. This issue can be fixed by changing line 247's call to @writelines()@ to just @write()@. This does not break functionality in Python2, according to my testing with Python 3.2.3 and 2.7.3 on Fedora 17. There are 4 more instances of @writelines()@ calls in the codebase, but in each of those cases the argument is a list or generator of @str@ or @bytes@ objects, as I don't think they will raise an error. I haven't tested them though. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From w.arindrarto at gmail.com Wed Sep 5 09:53:36 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 5 Sep 2012 11:53:36 +0200 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: References: <1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: Hi guys, If I may add my two cents on this issue, I think it's also a chance to rectify all other namespace issues that we may have (not just PEP8-related). For instance: * In the root namespace, we now have Bio.Align and Bio.AlignIO. Since we might be merging Bio.Seq and Bio.SeqIO into bio[py].seq (per the Github discussion[1]), I suppose we should do the same with Bio.Align as well (perhaps into bio[py].seq.align or bio[py].align). * With the change above, we might also want to change some of the submodule names completely. For example, if we merge Bio.Align into bio[py].align we'll have bio[py].align.applications, which I personally think could be shortened into bio[py].align.app. * As per the Github disscussion[1] as well, perhaps Bio.SeqUtils should also be merged as Seq object methods. There may be other changes as well, but the bottom line is all these changes will be quite considerable. As such, I think we could go all the way and be explicit in stating that the changes will be incompatible with previous Biopython versions (i.e. old scripts will break). As for bio.* and biopy.*, if we do decide to go all the way, bio.* seems like a better choice since there will be other incompatible changes anyway. But if we eventually decide to only fix PEP8-related issues while keeping compatibility with older versions, I'm leaning more towards biopy.*. regards, Bow [1] https://github.com/biopython/biopython/pull/63#issuecomment-8252340 On Tue, Sep 4, 2012 at 12:59 PM, Peter Cock wrote: > On Tue, Sep 4, 2012 at 11:27 AM, Michiel de Hoon wrote: >> Hi Peter, >> >> --- On Tue, 9/4/12, Peter Cock wrote: >>> One idea I was pondering is a new parallel namespace, >>> ideally bio.* but we can't use that due to case >>> insensitive file systems like Windows and (by default) >>> Mac OS X. So perhaps biopy, or bp? >> >> As you say, the ideal namespace is bio.*, so let's use >> that. We have been using Bio.* for more than 10 years. >> We should not get stuck with a non-ideal namespace for >> the next 10+ years because there may be some glitches >> switching from Bio.* to bio.*. Frankly I doubt that this >> will cause huge problems in practice. > > So you'd advocate a simple switch where from one > release to the next we change all the module names > (making them lower case, perhaps from consolidation > under bio.seq too)? > > This may cause some difficulties for upgrades - it may > require manual intervention to remove the old Bio folder > in order to allow creation of the new bio folder. > >>> We could gradually move code over to the new namespace, >>> using imports to preserve back compatibility - but support >>> both namespaces during a (long) transition period. >> >> Why do we need a transition period? It's just a matter >> of replacing upper case with lower case in the imports. > > That forces people to update all their scripts at once. > Of course, we can document how to do this so a script > would work before and after the case change, e.g. > > try: > from bio.seq import Seq > except ImportError: > > from Bio.Seq import Seq > >>> What I like about this is it allows people to make a >>> gradual >>> conversion - and we don't have to burden of two main >>> branches if we attempted a single jump to a Biopython v2. >>> >>> Does this seem worth considering? >> >> Yes but by all means, let's keep this simple. In the past, changes to Biopython have very rarely caused any serious problems for users. >> >> Best, >> -Michiel. >> > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From anaryin at gmail.com Wed Sep 5 20:24:23 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 5 Sep 2012 23:24:23 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: Hello all, Some news. A. The OrderedDict implementation is quite slow. It essentially slows down the parser by 30%, rendering all the improvements I had done moot. Therefore, although it's a great idea, a major reason for these updates is speed so I think it might not be worth it. B. As an alternative to this, I implemented the following. Entity has now only child_dict, and is a general dictionary. However, each Object (Model, Chain, Residue, Atom) gets their own __cmp__ method overloaded with the information in the "_sort" methods that already existed. In this way, a simple sorting of the values of the dictionary returns an ordered list. I tweaked the Atom.__cmp__ to first sort N CA C O atoms and then alphabetically. I also added that inorganic atoms such as Calcium come at the end. This will make things a bit nicer when Calcium is involved for example. Finally, the only downside to this seems to be that we lose the order in which residues are inserted. Ie. if residue 151 is the first of the PDB file and all others range from 1-150, then this first 151 is going to be placed at the end when you iterate. However, from my experience and in my opinion, not only this is logical, but it also rarely happens in real PDB files. C. I am strongly in favour of removing most (if not all) set/get methods and replace them by direct attribute access. For instance, "atom.get_parent() --> atom.parent". Saves some space in the code and makes things more transparent. D. I edited the PDBParser to tweaks a few things, nothing major. The file handle is now treated as an iterator throughout the parsing and it should be more memory-friendly. The line counter is still preserved. I also added a test to make the get_header argument actually work. E. General things here and there that I can't just remember.. F. Unittests are breaking everywhere. Checking why, but it all seems related to this sorting issue. Cheers, Jo?o From p.j.a.cock at googlemail.com Wed Sep 5 23:31:42 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 00:31:42 +0100 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Wed, Sep 5, 2012 at 9:24 PM, Jo?o Rodrigues wrote: > Hello all, > > Some news. > > A. The OrderedDict implementation is quite slow. It essentially slows down > the parser by 30%, rendering all the improvements I had done moot. > Therefore, although it's a great idea, a major reason for these updates is > speed so I think it might not be worth it. Which Python was that? i.e. The OrderedDict from the standard lib (which I hope is optimised), or the back port (which might be slower). > B. As an alternative to this, I implemented the following. Entity has now > only child_dict, and is a general dictionary. However, each Object (Model, > Chain, Residue, Atom) gets their own __cmp__ method overloaded with the > information in the "_sort" methods that already existed. In this way, a > simple sorting of the values of the dictionary returns an ordered list. I > tweaked the Atom.__cmp__ to first sort N CA C O atoms and then > alphabetically. I also added that inorganic atoms such as Calcium come at > the end. This will make things a bit nicer when Calcium is involved for > example. Finally, the only downside to this seems to be that we lose the > order in which residues are inserted. Ie. if residue 151 is the first of the > PDB file and all others range from 1-150, then this first 151 is going to be > placed at the end when you iterate. However, from my experience and in my > opinion, not only this is logical, but it also rarely happens in real PDB > files. That seems risky - but see if you can sort out what is happening with the unit tests (below). I'm not sure about your atomic sorting... it seems a bit magic. Would sorting on atomic number be nicer (and simple)? > C. I am strongly in favour of removing most (if not all) set/get methods and > replace them by direct attribute access. For instance, "atom.get_parent() > --> atom.parent". Saves some space in the code and makes things more > transparent. It would also look less like Java code ;) I like this plan - but initially define and document the new properties, and deprecate the old get/set properties. Without that you'll break almost every PDB using script out there. > D. I edited the PDBParser to tweaks a few things, nothing major. The file > handle is now treated as an iterator throughout the parsing and it should be > more memory-friendly. The line counter is still preserved. I also added a > test to make the get_header argument actually work. > > E. General things here and there that I can't just remember.. > > F. Unittests are breaking everywhere. Checking why, but it all seems related > to this sorting issue. > > Cheers, > > Jo?o Regards, Peter From p.j.a.cock at googlemail.com Thu Sep 6 00:10:57 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 01:10:57 +0100 Subject: [Biopython-dev] Beta code in the official releases? In-Reply-To: <1F36894B170C114F9C902C20BC5129981AD239DA@MSGWCDCPMB25.nyumc.org> References: <877gsq8mn2.fsf@fastmail.fm> <1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org> <1F36894B170C114F9C902C20BC5129981AD239DA@MSGWCDCPMB25.nyumc.org> Message-ID: On Wed, Sep 5, 2012 at 8:19 PM, Sczesnak, Andrew wrote: > Yeah, it would be great if this module could finally be included. > I've e-mailed the list numerous times asking what would be > necessary to include it and have done all you and Brad have > asked. I've watched you include bits and pieces of code from > other contributors quickly and without much scrutiny, so I > can't help but feel singled out. What is the logic in delaying > this? We've heard from people who are already using the > code and have asked when it will be pulled. Is it serving the > community to not even include the basic reader/writer? Am > I wasting my time? Is it your goal to actively discourage > contributions? In my mind, the main technical issue regarding MAF and AlignIO and the common alignment object is the lack of a common way of handling the idea of start/end (and sometimes strand) for each sequence (in a consistent co-ordinate system using Python counting). Evidently I haven't manage to adequately convey my interpretation/concern. Some file formats like EMBOSS' have these number explicitly but we're not parsing them: http://lists.open-bio.org/pipermail/biopython/2012-September/008142.html In the case of "fasta-m10" the numbers are stored in private properties as a 'short term' hack: http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009744.html Others like Stockholm have identifier/start-end as a combined names (but this is not mandatory). Here the start and end are being stored in the annotations dictionary (as unparsed strings, still using 1-based co-ordinates). In MAF the start/end are explicit and much more important. It would be near pointless to parse the the file ignoring these. Maybe your approach is good enough for MAF, and we should have adopted it as is, and delayed better integration with the other AlignIO formats? i.e. This is a general limitation in AlignIO and the object model, somewhat annoying in the formats already supported, but information critical to the MAF format. I was expecting a convention for this to fall out of Bow's GSoC work for 'pairwise alignments' in SearchIO - but the object model he came up with was not SeqRecord based (many of the file formats he was using didn't include sequences). Right now my inclination is still to add a location property to the SeqRecord, usually a FeatureLocation, but it could also be the proposed CompoundLocation for more complex cases. The question then is if/when this would be propagated, e.g. SeqRecord slicing/addition. http://lists.open-bio.org/pipermail/biopython-dev/2012-May/009646.html http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html So the wheels are turning, but slowly. I have not had as much time to dedicate to this as I would like - but other smaller or less inter-connected things are much easer to review and merge. Peter From p.j.a.cock at googlemail.com Thu Sep 6 00:34:19 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 01:34:19 +0100 Subject: [Biopython-dev] Replacing SeqFeature sub_features with compound feature locations In-Reply-To: References: Message-ID: On Tue, Jul 24, 2012 at 10:38 PM, Peter Cock wrote: > On Tue, Jul 24, 2012 at 10:08 PM, Lenna Peterson wrote: >> I agree that an "upgraded" FeatureLocation could be more >> elegant. > > It could turn out to be simpler having just one location object... > certainly worth trying out before committing this branch as is. Such a new "upgraded" FeatureLocation would need to hold a list/tuple of its parts (rather like the proposed CompoundLocation), and those could be simply as tuples of start, end, strand, db_ref etc (essentially everything currently held in a FeatureLocation). I'm not sure that that is any better than the new class CompoundLocation holding a list of existing FeatureLocation objects. On the bright side, the branch still works nicely with the extra BioSQL tests I added. One of the issues worth a bit more discussion is the start and end values of the CompoundLocation - which I am considering making act as the left/minimum and right/ maximum boundary of the region spanned by the parts. For normal forward strand features this does give the biological start and end, likewise for reverse strand features but inverted (location's start gives the biological end). i.e. for *most* features this means no change to the current behaviour. My proposal would mean that for a feature spanning the origin on a circular genome of length N, the start would be 0 and the end N. Similarly for weird cases from trans-splicing, the start/end coordinates would give the total region spanned. As shown below, sometimes that happens to match the current behaviour, but in other cases the current behaviour isn't useful anyway. Adopting start/end as the spanned region makes a lot of sense for things like drawing features in a region of interest, or other more abstract tasks doing feature/region intersection. Here knowing the min/max boundaries of the region spanned is more useful than any attempt to capture the biological start/end of the feature. Note that already for the simple FeatureLocation for reverse strand features we have start < end, i.e. the start coordinate property does NOT represent the biological starting point. Under the proposed CompoundLocation behaviour, the desirable property of the FeatureLocation that start < end would also hold for compound locations. Pathological examples at the end, Regards, Peter P.S. One of the advantages of the CompoundLocation is when constructing the location you don't give the overall start/end - there are inferred from the list of parts automatically. Currently the GenBank/EMBL parser is having to do this. P.P.S. I've also confirmed Lenna's testing that sum of feature locations works if we define integer addition with locations (so that sum can include zero and several locations), see: https://github.com/peterjc/biopython/commit/dc6bc658141cc42e7e6802bbe8baf6c87a6874c0 ----------------------------------------------------------------- Trans-splicing: Mixed Strands An example where the range/span idea is simpler is mixed strand features like this trans-spliced example from NC_000932 (in our unit tests), join(complement(69611..69724),139856..140650) What would you expect as the start/end here? The biological start is base 69724 (one based) and the last base is 140650. Currently: >>> from Bio import SeqIO >>> f = SeqIO.read("NC_000932.gb", "gb").features[135] >>> print f.location [69610:140650] >>> f.location.start ExactPosition(69610) >>> f.location.end ExactPosition(140650) >>> for sub in f.sub_features: print sub.location ... [69610:69724](-) [139855:140650](+) Here the end value does match the last base in the feature following the biological order - the start value is actually a base in the middle of the combined sequence. In fact, for this example the start/end are already acting like the range/span idea. ----------------------------------------------------------------- Trans-splicing: Reverse strand The example above is a real corner case, and so is this single strand trans-splcing example, also in NC_000932, which is a bit like an circular genome origin spanning annotation: complement(join(97999..98793,69611..69724)) With the current master branch: >>> from Bio import SeqIO >>> f = SeqIO.read("NC_000932.gb", "genbank").features[1] >>> print f.location [97998:69724](-) >>> f.location.start ExactPosition(97998) >>> f.location.end ExactPosition(69724) >>> for sub in f.sub_features: print sub.location ... [97998:98793](-) [69610:69724](-) Notice that we do not have start < end as you might expect. However the start and end DO capture the biological end and start (order inverted - this is on the reverse strand). To verify this I find it helps to transform the GenBank style location: complement(join(97999..98793,69611..69724)) into the old EMBL equivalent: join(complement(69611..69724),complement(97999..98793)) i.e. The first base is 69724 (one based counting), and the last base is 97999 (one based counting). So if you wanted to look at the upstream or downstream (assuming that makes sense for a trans-spliced gene), the current start/end values are useful (but you have to choose start vs end dependent on the strand). On the other hand, the range of co-ordindate values is 69611 to 98793 (one based, inclusive). Therefore one might expect start 69610 and end 98793 (Python counting), giving the spanned region. From chapmanb at 50mail.com Thu Sep 6 00:37:57 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 05 Sep 2012 20:37:57 -0400 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: References: <1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: <87wr08x9y2.fsf@fastmail.fm> Hi all; I don't know if there's going to be a clean way around mucking up the API for older scripts if we make this change. If we want to do this my thoughts would be: - Use the 'bio' module since that's the cleanest. - Hack together something that will remove old 'Bio' modules on install of the new version. - Write a Biopython1to2 script that will fix the imports on older scripts to the new module structure. However, my vote would be to stick with everything as is. I know we aren't PEP8 compliant but things aren't that awful that we need an upheaval. I wish Python library installs weren't so messy that we could do this more cleanly, Brad > Hi guys, > > If I may add my two cents on this issue, I think it's also a chance > to rectify all other namespace issues that we may have (not just > PEP8-related). > > For instance: > > * In the root namespace, we now have Bio.Align and Bio.AlignIO. Since > we might be merging Bio.Seq and Bio.SeqIO into bio[py].seq (per the > Github discussion[1]), I suppose we should do the same with Bio.Align > as well (perhaps into bio[py].seq.align or bio[py].align). > > * With the change above, we might also want to change some of the > submodule names completely. For example, if we merge Bio.Align into > bio[py].align we'll have bio[py].align.applications, which I > personally think could be shortened into bio[py].align.app. > > * As per the Github disscussion[1] as well, perhaps Bio.SeqUtils > should also be merged as Seq object methods. > > There may be other changes as well, but the bottom line is all these > changes will be quite considerable. As such, I think we could go all > the way and be explicit in stating that the changes will be > incompatible with previous Biopython versions (i.e. old scripts will > break). > > As for bio.* and biopy.*, if we do decide to go all the way, bio.* > seems like a better choice since there will be other incompatible > changes anyway. But if we eventually decide to only fix PEP8-related > issues while keeping compatibility with older versions, I'm leaning > more towards biopy.*. > > regards, > Bow > > [1] https://github.com/biopython/biopython/pull/63#issuecomment-8252340 > > On Tue, Sep 4, 2012 at 12:59 PM, Peter Cock wrote: >> On Tue, Sep 4, 2012 at 11:27 AM, Michiel de Hoon wrote: >>> Hi Peter, >>> >>> --- On Tue, 9/4/12, Peter Cock wrote: >>>> One idea I was pondering is a new parallel namespace, >>>> ideally bio.* but we can't use that due to case >>>> insensitive file systems like Windows and (by default) >>>> Mac OS X. So perhaps biopy, or bp? >>> >>> As you say, the ideal namespace is bio.*, so let's use >>> that. We have been using Bio.* for more than 10 years. >>> We should not get stuck with a non-ideal namespace for >>> the next 10+ years because there may be some glitches >>> switching from Bio.* to bio.*. Frankly I doubt that this >>> will cause huge problems in practice. >> >> So you'd advocate a simple switch where from one >> release to the next we change all the module names >> (making them lower case, perhaps from consolidation >> under bio.seq too)? >> >> This may cause some difficulties for upgrades - it may >> require manual intervention to remove the old Bio folder >> in order to allow creation of the new bio folder. >> >>>> We could gradually move code over to the new namespace, >>>> using imports to preserve back compatibility - but support >>>> both namespaces during a (long) transition period. >>> >>> Why do we need a transition period? It's just a matter >>> of replacing upper case with lower case in the imports. >> >> That forces people to update all their scripts at once. >> Of course, we can document how to do this so a script >> would work before and after the case change, e.g. >> >> try: >> from bio.seq import Seq >> except ImportError: >> >> from Bio.Seq import Seq >> >>>> What I like about this is it allows people to make a >>>> gradual >>>> conversion - and we don't have to burden of two main >>>> branches if we attempted a single jump to a Biopython v2. >>>> >>>> Does this seem worth considering? >>> >>> Yes but by all means, let's keep this simple. In the past, changes to Biopython have very rarely caused any serious problems for users. >>> >>> Best, >>> -Michiel. >>> >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From chapmanb at 50mail.com Thu Sep 6 00:31:58 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 05 Sep 2012 20:31:58 -0400 Subject: [Biopython-dev] Beta code in the official releases? In-Reply-To: <1F36894B170C114F9C902C20BC5129981AD239DA@MSGWCDCPMB25.nyumc.org> References: <877gsq8mn2.fsf@fastmail.fm> <1F36894B170C114F9C902C20BC5129981AD23835@MSGWCDCPMB25.nyumc.org> <1F36894B170C114F9C902C20BC5129981AD239DA@MSGWCDCPMB25.nyumc.org> Message-ID: <87zk54xa81.fsf@fastmail.fm> Andrew; > Yeah, it would be great if this module could finally be included. I've > e-mailed the list numerous times asking what would be necessary to > include it and have done all you and Brad have asked. I've watched you > include bits and pieces of code from other contributors quickly and > without much scrutiny, so I can't help but feel singled out. What is > the logic in delaying this? We've heard from people who are already > using the code and have asked when it will be pulled. Is it serving > the community to not even include the basic reader/writer? Am I > wasting my time? Is it your goal to actively discourage contributions? In addition to Peter's technical comments, from a personal side I hope you don't take offense. We definitely value contributions and your work. Some changes can end up being tricky because of the need to work with or fix previous non-optimal design decisions. When they require extra attention and decisions this can make it hard to allocate time for folks that volunteer on the project. This is definitely nothing personal and I hope you don't feel that way. My GFF parser has languished for even longer for similar reasons. I think the long term solution for this is incorporating beta code so we can get these in, recognize the contributions, make them available, and still giving wiggle room to improve the design before locking into an API that we need to support long term. Thanks again for all the work. We do appreciate it, Brad From chapmanb at 50mail.com Thu Sep 6 00:45:19 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 05 Sep 2012 20:45:19 -0400 Subject: [Biopython-dev] TAIR/AGI support In-Reply-To: References: Message-ID: <87txvcx9ls.fsf@fastmail.fm> Kevin; Thanks for the e-mail and offers of code. Always happy to have other folks involved with the project. > What's the status of TAIR AGIs in BioPython (I can see no mention of them, > or support for them)? I've written a brief module which allows a user to > query NCBI with a TAIR AGI, returning a Seq object (via Efetch). Is there > any interest in including such functionality in BioPython? Is the code available on GitHub to get a better sense of all the functionality it supports? Do you have an idea where it would fit best? As a tair submodule inside of Bio.Entrez, or somewhere else? > More generally, are there any particular areas of BioPython development > which could use an extra pair of hands? Following the mailing list for discussions on current projects is the best way to get a sense of what different folks are working on. The issue tracker also has open issues and features that could use attention if anything there strikes your fancy: https://redmine.open-bio.org/projects/biopython Hope this helps, Brad From p.j.a.cock at googlemail.com Thu Sep 6 00:57:19 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 01:57:19 +0100 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: <87wr08x9y2.fsf@fastmail.fm> References: <1346754477.76801.YahooMailClassic@web164001.mail.gq1.yahoo.com> <87wr08x9y2.fsf@fastmail.fm> Message-ID: On Thu, Sep 6, 2012 at 1:37 AM, Brad Chapman wrote: > > Hi all; > I don't know if there's going to be a clean way around mucking up the > API for older scripts if we make this change. > > If we want to do this my thoughts would be: > > - Use the 'bio' module since that's the cleanest. > - Hack together something that will remove old 'Bio' modules on install > of the new version. > - Write a Biopython1to2 script that will fix the imports on older > scripts to the new module structure. I really don't like using "bio" since (due to Python's use of folders for package names) you couldn't in general also have the old code available under "Bio". i.e. This forces a hard switch on our users which is a very bad idea I think. Thus my suggestion of something else like "biopy" (although the Mac's autocorrection keeps turning it into biopsy which would be annoying - grin), or if not already taken "bp". To expand on my earlier email, the transition structure I had in mind was that we'd have something like this: biopy/seq/__init__.py - real code for Seq object etc Bio/Seq/__init__.py - just "from biopy.seq import Seq" and a deprecation warning. > However, my vote would be to stick with everything as is. I know we > aren't PEP8 compliant but things aren't that awful that we need an > upheaval. I wish Python library installs weren't so messy that we could > do this more cleanly, > Brad That does seem safer, and we can still do the less invasive restructuring discussed, e.g. Bio/Seq.py -> Bio/Seq/__init__.py allowing us to (gradually) move Bio.Seq* things under Bio.Seq, while preserving the legacy imports under a deprecation warning. Also if we're considering moving Bio.SeqIO to Bio.Seq, as Bow points out, we'd want to do Bio/AlignIO.py -> Bio.Align (perhaps pushing the core objects into Bio/Align/_objects.py or similar but exposing them in the current namespace location). Regards, Peter From p.j.a.cock at googlemail.com Thu Sep 6 01:34:50 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 02:34:50 +0100 Subject: [Biopython-dev] SeqRecord locations; was: Beta code in the official releases? Message-ID: On Thu, Sep 6, 2012 at 1:10 AM, Peter Cock wrote: > > In my mind, the main technical issue regarding MAF and AlignIO > and the common alignment object is the lack of a common way > of handling the idea of start/end (and sometimes strand) for > each sequence (in a consistent co-ordinate system using Python > counting). Evidently I haven't manage to adequately convey my > interpretation/concern. > > Some file formats like EMBOSS' have these number explicitly > but we're not parsing them: > http://lists.open-bio.org/pipermail/biopython/2012-September/008142.html > > In the case of "fasta-m10" the numbers are stored in private > properties as a 'short term' hack: > http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009744.html > > Others like Stockholm have identifier/start-end as a combined > names (but this is not mandatory). Here the start and end are > being stored in the annotations dictionary (as unparsed strings, > still using 1-based co-ordinates). > > In MAF the start/end are explicit and much more important. > It would be near pointless to parse the the file ignoring these. > Maybe your approach is good enough for MAF, and we > should have adopted it as is, and delayed better integration > with the other AlignIO formats? > > i.e. This is a general limitation in AlignIO and the object > model, somewhat annoying in the formats already supported, > but information critical to the MAF format. > > I was expecting a convention for this to fall out of Bow's GSoC > work for 'pairwise alignments' in SearchIO - but the object > model he came up with was not SeqRecord based (many > of the file formats he was using didn't include sequences). > > Right now my inclination is still to add a location property to > the SeqRecord, usually a FeatureLocation, but it could also > be the proposed CompoundLocation for more complex cases. > The question then is if/when this would be propagated, e.g. > SeqRecord slicing/addition. > http://lists.open-bio.org/pipermail/biopython-dev/2012-May/009646.html > http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html > > So the wheels are turning, but slowly. I have not had as > much time to dedicate to this as I would like - but other > smaller or less inter-connected things are much easer to > review and merge. To expand on the SeqRecord.location property idea, I am thinking about (in the typical use cases) using a normal FeatureLocation object (from Bio.SeqFeature) where the start, end or strand are in the same co-ordinate system as the sequence of the SeqRecord. i.e. For a protein fragment, they would be in amino acids. For a nucleotide fragment, they would be in base pairs. Note that you might want to describe the CDS region for a protein sequence (which would be possible even for a join using the proposed CompoundLocation), so maybe 'location' is the wrong name here, perhaps 'fragment' or 'subregion', or something is clearer? When I talked about adding SeqRecords, and what would the combined SeqRecord's location be, we could use FeatureLocation addition (as defined on the branch for CompoundLocation objects). For slicing a SeqRecord, provided len(record.location) == len(record), this is well defined. However, I expect that quite often if used for alignments, what we will have instead is len(record.location) = len(record.seq.ungapped()) so we might be able to update the sub-record's location if we count the gap characters and factor them in. This equality could be verified in the SeqRecord __init__ (which would require the gap character, but the AlignIO parsers should all set that). I would like slicing to update the start/end because slicing alignment objects seems to be a quite common operation - so if you started from an alignment file using start/end (like Stockholm or MAF) it would be good to update these fields for the sub-alignment. This feels like it would work, but would it be useful or just over engineering? Would a simple static location property which is not automatically propagated in SeqRecord manipulations be enough (at least initially)? If so, is Brad's suggestion to just use special values in the annotations dictionary a simpler way forward (where we already have policies in place for handling generic annotation during SeqRecord annotation - in general dropping it)? If so, would this be keys 'start', 'end', 'strand' for integer start and end using Python counting, and a strand value of +1 or -1 for forward and reverse? [We could use strand None for unavailable as in the SeqFeature location object, but I think no entry in the dictionary is nicer here]. Peter From anaryin at gmail.com Thu Sep 6 05:52:34 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 6 Sep 2012 08:52:34 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: Hey, Which Python was that? i.e. The OrderedDict from the standard lib > (which I hope is optimised), or the back port (which might be slower). > Both. I also found it strange and googledit. Apparently OrderedDict is pure python, not C like dict, thus the difference. That seems risky - but see if you can sort out what is happening > with the unit tests (below). > What Bio.PDB does right now is rely on the list to iterate over things. Thus, you get the order in which you read the PDB file. However, if you sort it using the several Objects sort method you will get the following rules: Atom.py - N CA C O first, then alphabetically Residue.py - First aminoacids and nucleic acids, then heteroatoms. Chain.py - Empty chains last. These are already in place somewhere in the code. I just used them to overload the __cmp__ method, with a couple of additions because I personally disagree with the following: Atom.py - Inorganic atoms should come out last. For simplicity. Residue.py - If the PDB order is 151 MSE, 152 VAL, 153 CYS, you should get in return when you iterate: 151, 152, 153. Right now you get 152, 153, 151. PDB files already have weird large numbers for water and ions for example, so these come out last anyway. Pushing all HETATMs to the end will sometimes disrupt the "natural" order of things, for instance modified residues. Magic perhaps :) I sorted out all relevant issues with the unittests. I had a small problem with build_peptides because of this HETATM last rule, so I took it away and now it works. All tests pass except 4: 2 because of the header, which is not read decently right now, and 2 because of the ordering which is explicit in the assert statement of the test. So it's a matter of changing these assertions and they will work. It would also look less like Java code ;) > > I like this plan - but initially define and document the new properties, > and deprecate the old get/set properties. Without that you'll break > almost every PDB using script out there. > How do I deprecate the old ones? Is there a DeprecationWarning or so? Just a reminder, if you want to test/check the code, it's on my github . Cheers, Jo?o From w.arindrarto at gmail.com Thu Sep 6 05:57:04 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 6 Sep 2012 07:57:04 +0200 Subject: [Biopython-dev] SeqRecord locations; was: Beta code in the official releases? In-Reply-To: References: Message-ID: Hi guys, To add my two cents, I am in favor of creating a dynamic SeqRecord coordinate system using SeqFeature. However, I think it would also be good if we set some limitations as there are so many ways that slicing and addition could be used to create new SeqRecords, and anticipating all these scenarios may create an over-engineered (and probably slower) SeqRecord. Some scenarios that I can think now: 1. Slicing SeqRecord objects using step values > 1 (e.g. new_seq = seq[1:120:3]) 2. Adding two or more SeqRecord objects with noncontiguous coordinate (i.e. end coordinate of the first sequence is not directly followed by the second sequence's start coordinate), and then slice the resulting object So maybe some limitations that we could set are: 1. Only update the coordinates if slicing step is 1 (or -1), otherwise discard it. 2. Only update the coordinates if addition is between contiguous coordinates, otherwise discard it. Personally, I think this would cover most use cases for slicing while allowing us to keep it simple. As for the name, 'region' sounds better than 'location'. Maybe 'coverage'? I don't have any strong preference between these, but 'subregion' doesn't feel that nice. Finally, for the coordinate system, I imagine it will use Python's coordinate system, too? (zero-based, half-open, and the parsers / writers should do the conversion). Should we also reverse the coordinates if the objects are sliced in reverse (e.g. seqrecord[::-1]) or simply inverse the strand value but keep the coordinates unchanged? regards, Bow On Thu, Sep 6, 2012 at 3:34 AM, Peter Cock wrote: > On Thu, Sep 6, 2012 at 1:10 AM, Peter Cock wrote: >> >> In my mind, the main technical issue regarding MAF and AlignIO >> and the common alignment object is the lack of a common way >> of handling the idea of start/end (and sometimes strand) for >> each sequence (in a consistent co-ordinate system using Python >> counting). Evidently I haven't manage to adequately convey my >> interpretation/concern. >> >> Some file formats like EMBOSS' have these number explicitly >> but we're not parsing them: >> http://lists.open-bio.org/pipermail/biopython/2012-September/008142.html >> >> In the case of "fasta-m10" the numbers are stored in private >> properties as a 'short term' hack: >> http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009744.html >> >> Others like Stockholm have identifier/start-end as a combined >> names (but this is not mandatory). Here the start and end are >> being stored in the annotations dictionary (as unparsed strings, >> still using 1-based co-ordinates). >> >> In MAF the start/end are explicit and much more important. >> It would be near pointless to parse the the file ignoring these. >> Maybe your approach is good enough for MAF, and we >> should have adopted it as is, and delayed better integration >> with the other AlignIO formats? >> >> i.e. This is a general limitation in AlignIO and the object >> model, somewhat annoying in the formats already supported, >> but information critical to the MAF format. >> >> I was expecting a convention for this to fall out of Bow's GSoC >> work for 'pairwise alignments' in SearchIO - but the object >> model he came up with was not SeqRecord based (many >> of the file formats he was using didn't include sequences). >> >> Right now my inclination is still to add a location property to >> the SeqRecord, usually a FeatureLocation, but it could also >> be the proposed CompoundLocation for more complex cases. >> The question then is if/when this would be propagated, e.g. >> SeqRecord slicing/addition. >> http://lists.open-bio.org/pipermail/biopython-dev/2012-May/009646.html >> http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html >> >> So the wheels are turning, but slowly. I have not had as >> much time to dedicate to this as I would like - but other >> smaller or less inter-connected things are much easer to >> review and merge. > > To expand on the SeqRecord.location property idea, I am > thinking about (in the typical use cases) using a normal > FeatureLocation object (from Bio.SeqFeature) where the > start, end or strand are in the same co-ordinate system > as the sequence of the SeqRecord. > > i.e. For a protein fragment, they would be in amino acids. > For a nucleotide fragment, they would be in base pairs. > > Note that you might want to describe the CDS region > for a protein sequence (which would be possible even > for a join using the proposed CompoundLocation), so > maybe 'location' is the wrong name here, perhaps > 'fragment' or 'subregion', or something is clearer? > > When I talked about adding SeqRecords, and what would > the combined SeqRecord's location be, we could use > FeatureLocation addition (as defined on the branch for > CompoundLocation objects). > > For slicing a SeqRecord, provided len(record.location) > == len(record), this is well defined. However, I expect > that quite often if used for alignments, what we will have > instead is len(record.location) = len(record.seq.ungapped()) > so we might be able to update the sub-record's location > if we count the gap characters and factor them in. This > equality could be verified in the SeqRecord __init__ > (which would require the gap character, but the AlignIO > parsers should all set that). > > I would like slicing to update the start/end because > slicing alignment objects seems to be a quite common > operation - so if you started from an alignment file > using start/end (like Stockholm or MAF) it would be > good to update these fields for the sub-alignment. > > This feels like it would work, but would it be useful or > just over engineering? Would a simple static location > property which is not automatically propagated in > SeqRecord manipulations be enough (at least initially)? > > If so, is Brad's suggestion to just use special values in > the annotations dictionary a simpler way forward (where > we already have policies in place for handling generic > annotation during SeqRecord annotation - in general > dropping it)? > > If so, would this be keys 'start', 'end', 'strand' for > integer start and end using Python counting, and > a strand value of +1 or -1 for forward and reverse? > [We could use strand None for unavailable as in > the SeqFeature location object, but I think no entry > in the dictionary is nicer here]. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From mjldehoon at yahoo.com Thu Sep 6 06:31:57 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 5 Sep 2012 23:31:57 -0700 (PDT) Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: Message-ID: <1346913117.35905.YahooMailClassic@web164006.mail.gq1.yahoo.com> [Brad] > Hack together something that will remove old 'Bio' modules > on install of the new version. We could check in setup.py if we can import Bio, and ask the user to remove the old Biopython installation before proceeding. Since we can tell the user exactly which directory to remove, this would be straightforward. I would prefer this to removing the directory automatically. [Peter] > I really don't like using "bio" since (due to Python's use > of folders for package names) you couldn't in general also > have the old code available under "Bio". i.e. This forces > a hard switch on our users which is a very bad idea I think. I don't see why a user would like to have both an old Biopython under Bio and a new Biopython under bio. Unless he wants to run some scripts with the old Biopython and other scripts with the new Biopython, but I don't see the point of that. [Peter] > Thus my suggestion of something else like "biopy" [...] > , or if not already taken "bp". [Brad] > However, my vote would be to stick with everything as is. If the choice is between "bp", "biopy", or "Bio", then I agree with Brad; I prefer keeping a nice but PEP8-noncompliant module name "Bio" rather than switching to a PEP8-compliant but less attractive name like "biopy" or "bp". Best, -Michiel. From p.j.a.cock at googlemail.com Thu Sep 6 07:06:07 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 08:06:07 +0100 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: <1346913117.35905.YahooMailClassic@web164006.mail.gq1.yahoo.com> References: <1346913117.35905.YahooMailClassic@web164006.mail.gq1.yahoo.com> Message-ID: On Thu, Sep 6, 2012 at 7:31 AM, Michiel de Hoon wrote: > [Brad] >> Hack together something that will remove old 'Bio' modules >> on install of the new version. > > We could check in setup.py if we can import Bio, and ask > the user to remove the old Biopython installation before > proceeding. Since we can tell the user exactly which directory > to remove, this would be straightforward. I would prefer this > to removing the directory automatically. I agree automatically removing the old install is risky. For single user machines, where the single user has only a small collection of scripts this isn't such an issue. For any shared server, or user with lots of Biopython scripts (some of which may have been written by different people), you would be forced into a mass change at one go. You would also have considerable hassle later on with any attempt to re-run old scripts. > [Peter] >> I really don't like using "bio" since (due to Python's use >> of folders for package names) you couldn't in general also >> have the old code available under "Bio". i.e. This forces >> a hard switch on our users which is a very bad idea I think. > > I don't see why a user would like to have both an old > Biopython under Bio and a new Biopython under bio. > Unless he wants to run some scripts with the old Biopython > and other scripts with the new Biopython, but I don't see > the point of that. Really? That is exactly what I am concerned about (both for single user machines like my desktop, and shared machines like our servers). How about the common situation of wanting to re-run old scripts from old projects on new data? If we were just changing the case, this might not be too complex (it would still be a frustrating transition period), but if we're also moving things around at the same time it is too much I feel. > [Peter] >> Thus my suggestion of something else like "biopy" [...] >> , or if not already taken "bp". > > [Brad] >> However, my vote would be to stick with everything as is. > > If the choice is between "bp", "biopy", or "Bio", then > I agree with Brad; I prefer keeping a nice but > PEP8-noncompliant module name "Bio" rather than > switching to a PEP8-compliant but less attractive > name like "biopy" or "bp". There is 'biopython' but it is rather long? No other ideas from anyone else? How about over the next year we gradually consolidate modules under the existing mixed case names? e.g. move Bio.AlignIO functionality and Bio.Align, and Bio.Seq* under Bio.Seq (leaving backwards compatible imports supported but deprecated). Here's a further (and slightly more radical) idea: We stick with using 'Bio' and the current mixed case names on Python 2, but adopt 'bio' and other PEP8 compatible names for Python 3 (as a uniform strict automatic rule: mixed case -> lower case)? i.e. Do this as part of our 2to3 process. Some nasty downside might occur to me later but right now it seems like a neat idea... other that not being quite in line with the expectation that Python 3 should not be used as an excuse to make API changes. Too radical? Regards, Peter From p.j.a.cock at googlemail.com Thu Sep 6 07:16:41 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 08:16:41 +0100 Subject: [Biopython-dev] SeqRecord locations; was: Beta code in the official releases? In-Reply-To: References: Message-ID: On Thu, Sep 6, 2012 at 6:57 AM, Wibowo Arindrarto wrote: > Hi guys, > > To add my two cents, I am in favor of creating a dynamic SeqRecord > coordinate system using SeqFeature. However, I think it would also be > good if we set some limitations as there are so many ways that slicing > and addition could be used to create new SeqRecords, and anticipating > all these scenarios may create an over-engineered (and probably > slower) SeqRecord. > > Some scenarios that I can think now: > > 1. Slicing SeqRecord objects using step values > 1 > (e.g. new_seq = seq[1:120:3]) Absolutely - here I would expect to lose the location information. We already have similar restrictions in the SeqRecord slicing for how SeqFeatures are handled. > 2. Adding two or more SeqRecord objects with noncontiguous coordinate > (i.e. end coordinate of the first sequence is not directly followed by > the second sequence's start coordinate), and then slice the resulting > object Adding *could* be done via the CompoundLocation, although that in itself might want to consider if nicely-abutting locations should be merged, e.g. in GenBank notation 100..201 and 202..300 could be 100.300 rather than join(100..201,202..300) which is what my CompoundLocation code currently does. > So maybe some limitations that we could set are: > > 1. Only update the coordinates if slicing step is 1 (or -1), otherwise > discard it. Yep. > 2. Only update the coordinates if addition is between contiguous > coordinates, otherwise discard it. That does seem simple - especially as the primary driver for this is multiple sequence alignments and those only support simple continuous locations with a start and end. > Personally, I think this would cover most use cases for slicing while > allowing us to keep it simple. That is perhaps a good balance (and as a bonus means we don't have to link this to the CompoundLocation unless we want to). > As for the name, 'region' sounds better than 'location'. Maybe > 'coverage'? I don't have any strong preference between these, but > 'subregion' doesn't feel that nice. Region seems fine. > Finally, for the coordinate system, I imagine it will use Python's > coordinate system, too? (zero-based, half-open, and the parsers / > writers should do the conversion). Yes. I'm suggesting using the FeatureLocation object (from Bio.SeqFeatures), which does this. > Should we also reverse the > coordinates if the objects are sliced in reverse (e.g. > seqrecord[::-1]) or simply inverse the strand value but keep the > coordinates unchanged? The strand changes, and the start/end must also be recalculated from the length of the parent sequence. The FeatureLocation has a (private) _flip method to do this. In some cases we won't have the parent sequence length, so would have to drop the location. I'll have a go at implementing this on a branch in the next few hours (unless something more pressing comes up at the BioHackathon). As it happens this overlaps nicely with some of the group discussion about how to represent feature locations in RDF. Regards, Peter From p.j.a.cock at googlemail.com Thu Sep 6 07:21:16 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 08:21:16 +0100 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Thu, Sep 6, 2012 at 6:52 AM, Jo?o Rodrigues wrote: > >> It would also look less like Java code ;) >> >> I like this plan - but initially define and document the new properties, >> and deprecate the old get/set properties. Without that you'll break >> almost every PDB using script out there. > > How do I deprecate the old ones? Is there a DeprecationWarning or so? > Yes, we use Bio.BiopythonDeprecationWarning rather than the default DeprecationWarning because the later is now silent by default. Grep the code for example usage, see also: http://biopython.org/wiki/Deprecation_policy Regards, Peter From p.j.a.cock at googlemail.com Thu Sep 6 09:36:41 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 6 Sep 2012 10:36:41 +0100 Subject: [Biopython-dev] SeqRecord locations; was: Beta code in the official releases? In-Reply-To: References: Message-ID: On Thu, Sep 6, 2012 at 8:16 AM, Peter Cock wrote: > > I'll have a go at implementing this on a branch in the next > few hours (unless something more pressing comes up at > the BioHackathon). As it happens this overlaps nicely with > some of the group discussion about how to represent feature > locations in RDF. > I've made a start, will do more later: https://github.com/peterjc/biopython/tree/sr_loc Peter From mjldehoon at yahoo.com Thu Sep 6 10:13:38 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 6 Sep 2012 03:13:38 -0700 (PDT) Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: Message-ID: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com> --- On Thu, 9/6/12, Peter Cock wrote: > For any shared server, [...] you > would be forced into a mass change at one go. OK, for multiple users on a shared server I see your point. > Here's a further (and slightly more radical) idea: We > stick with using 'Bio' and the current mixed case > names on Python 2, but adopt 'bio' and other PEP8 > compatible names for Python 3 (as a uniform > strict automatic rule: mixed case -> lower case)? > i.e. Do this as part of our 2to3 process. The Python developers argue against combining a switch to Python 3 with other major changes, since then if bugs arise it is unclear if it is due to the switch to Python 3 or due to the other changes. But perhaps it's OK if we have one Bio.* version for Python 2 and one bio.* version for Python 3 that are otherwise completely identical to each other. > How about over the next year we gradually consolidate > modules under the existing mixed case names? e.g. > move Bio.AlignIO functionality and Bio.Align, I guess you meant "merge Bio.AlignIO functionality into Bio.Align". > and Bio.Seq* under Bio.Seq (leaving backwards compatible > imports supported but deprecated). Sounds good to me. AFAIAC, we don't need to do this gradually over the next year. May as well do it for the next release. -Michiel. From anaryin at gmail.com Thu Sep 6 13:48:51 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 6 Sep 2012 16:48:51 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: Ok, thanks. The modules are littered with set/get methods and adding DeprecationWarning to all of them might be a bit too much.. Instead, should we add one single warning at the top of the PDBParser, since this is the only obligatory module for Bio.PDB so that everyone gets the warning message once and once only? Otherwise I can imagine several warnings popping up everywhere.. Cheers, Jo?o From eric.talevich at gmail.com Thu Sep 6 14:17:03 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 6 Sep 2012 10:17:03 -0400 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Thu, Sep 6, 2012 at 1:52 AM, Jo?o Rodrigues wrote: > > What Bio.PDB does right now is rely on the list to iterate over things. > Thus, you get the order in which you read the PDB file. However, if you > sort it using the several Objects sort method you will get the following > rules: > > Atom.py - N CA C O first, then alphabetically > Residue.py - First aminoacids and nucleic acids, then heteroatoms. > Chain.py - Empty chains last. > > These are already in place somewhere in the code. I just used them to > overload the __cmp__ method, with a couple of additions because I > personally disagree with the following: > > Atom.py - Inorganic atoms should come out last. For simplicity. > Residue.py - If the PDB order is 151 MSE, 152 VAL, 153 CYS, you should get > in return when you iterate: 151, 152, 153. Right now you get 152, 153, 151. > PDB files already have weird large numbers for water and ions for example, > so these come out last anyway. Pushing all HETATMs to the end will > sometimes disrupt the "natural" order of things, for instance modified > residues. Magic perhaps :) > > Here's another edge case to think about: 3BEG. The enzyme is chain A, starting from residue number 69; the substrate peptide is chain B; and then after listing the atoms for chain B they jump back to chain A and add the three ligands as individual residues, with residue numbers 1, 2 and 3, on HETATM lines. The current PDBParser complains about this structure but parses it so that the extra HETATM residues are at the end of chain A's child_list. If I were to try to generate a polypeptide sequence from each of the chains in this structure, I think I'd want to just ignore the three extra residues, rather than list them as the first three residues of the peptide as "SAX". How do you think this should be handled? Maybe treat in-sequence modified residues differently from out-of-sequence HETATMs? -E From eric.talevich at gmail.com Thu Sep 6 14:40:13 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 6 Sep 2012 10:40:13 -0400 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com> References: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: On Thu, Sep 6, 2012 at 6:13 AM, Michiel de Hoon wrote: > --- On Thu, 9/6/12, Peter Cock wrote: > > For any shared server, [...] you > > would be forced into a mass change at one go. > > OK, for multiple users on a shared server I see your point. True, and old scripts/pipelines have a way of sticking around, especially once they've been shared with others in the lab. > Here's a further (and slightly more radical) idea: We > > stick with using 'Bio' and the current mixed case > > names on Python 2, but adopt 'bio' and other PEP8 > > compatible names for Python 3 (as a uniform > > strict automatic rule: mixed case -> lower case)? > > i.e. Do this as part of our 2to3 process. > > The Python developers argue against combining a switch to Python 3 with > other major changes, since then if bugs arise it is unclear if it is due to > the switch to Python 3 or due to the other changes. But perhaps it's OK if > we have one Bio.* version for Python 2 and one bio.* version for Python 3 > that are otherwise completely identical to each other. > Agreed, since the bio.* version is generated by the 2to3 script it should still be easy enough to distinguish "this is a bug in the library" from "this is a problem with Py3, 2to3 or your environment". The extra separation on the filesystem provided by Py2/Py3 should also prevent some problems with case-insensitivity and the environment. > > How about over the next year we gradually consolidate > > modules under the existing mixed case names? e.g. > > move Bio.AlignIO functionality and Bio.Align, > > I guess you meant "merge Bio.AlignIO functionality into Bio.Align". > > > and Bio.Seq* under Bio.Seq (leaving backwards compatible > > imports supported but deprecated). > > Sounds good to me. AFAIAC, we don't need to do this gradually over the > next year. May as well do it for the next release. > > Doing this in a single release might be better, so we can document/remember the release number when the Grand Reshuffling took place and troubleshoot users' resulting problems more easily. Should we call that Biopython 2.0.0 and switch to semantic version numbers? From anaryin at gmail.com Thu Sep 6 14:51:11 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 6 Sep 2012 17:51:11 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: Well... :) If this is what the authors put in.. well, that's just it. The parser should not be an interpreter. However, when building peptides, you should get two peptides: the ALA-SEP, and the protein chain A. And I think this is what you will get. Also, the fact that they are heteroatoms is already a good filter if you want them out of the equation. From p.j.a.cock at googlemail.com Fri Sep 7 01:01:04 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Sep 2012 02:01:04 +0100 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: References: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: On Thu, Sep 6, 2012 at 3:40 PM, Eric Talevich wrote: > On Thu, Sep 6, 2012 at 6:13 AM, Michiel de Hoon wrote: >> --- On Thu, 9/6/12, Peter Cock wrote: >> > Here's a further (and slightly more radical) idea: We >> > stick with using 'Bio' and the current mixed case >> > names on Python 2, but adopt 'bio' and other PEP8 >> > compatible names for Python 3 (as a uniform >> > strict automatic rule: mixed case -> lower case)? >> > i.e. Do this as part of our 2to3 process. >> >> The Python developers argue against combining a switch to Python 3 with >> other major changes, since then if bugs arise it is unclear if it is due to >> the switch to Python 3 or due to the other changes. But perhaps it's OK if >> we have one Bio.* version for Python 2 and one bio.* version for Python 3 >> that are otherwise completely identical to each other. > > > Agreed, since the bio.* version is generated by the 2to3 script it should > still be easy enough to distinguish "this is a bug in the library" from > "this is a problem with Py3, 2to3 or your environment". The extra separation > on the filesystem provided by Py2/Py3 should also prevent some problems with > case-insensitivity and the environment. Yes - they would be in different site-packages folders, and since we have a tiny Python 3 install base, moving them from Bio to bio seems low impact. I guess we need to have a little hack with the 2to3 library and try defining our own custom fixer for the imports... Note this case difference will slightly complicate our documentation - but that is always going to be an issue for the Python 2 to 3 move. >> >> > How about over the next year we gradually consolidate >> > modules under the existing mixed case names? e.g. >> > move Bio.AlignIO functionality and Bio.Align, >> >> I guess you meant "merge Bio.AlignIO functionality into Bio.Align". Yes, sorry. >> > and Bio.Seq* under Bio.Seq (leaving backwards compatible >> > imports supported but deprecated). >> >> Sounds good to me. AFAIAC, we don't need to do this gradually >> over the next year. May as well do it for the next release. > > Doing this in a single release might be better, so we can document/remember > the release number when the Grand Reshuffling took place and troubleshoot > users' resulting problems more easily. Doing it one release makes sense - but we can do it gradually in a series of self contained commits - and feel our way. Michiel - do you want to start with the Bio/Seq.py to Bio/Seq/__init__.py change? We'll need to do that before any consolidation steps. > Should we call that Biopython 2.0.0 and switch to semantic version numbers? > Maybe... at some point a Biopython 2 would be a good excuse for some publicity and another application note. The eventual move from developing under Python 2 (and using 2to3 for Python 3) to natively developing under Python 3 would be an excuse for a major version bump. Peter From p.j.a.cock at googlemail.com Fri Sep 7 01:03:22 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 7 Sep 2012 02:03:22 +0100 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: On Thu, Sep 6, 2012 at 2:48 PM, Jo?o Rodrigues wrote: > Ok, thanks. > > The modules are littered with set/get methods and adding DeprecationWarning > to all of them might be a bit too much.. Instead, should we add one single > warning at the top of the PDBParser, since this is the only obligatory > module for Bio.PDB so that everyone gets the warning message once and once > only? Otherwise I can imagine several warnings popping up everywhere.. If you use the exact same message, then I think you'll only see the warning once. Try it with a couple of the get/set methods to confirm. Having the warning happen even if you don't use the get/set seems wrong. Peter From anaryin at gmail.com Fri Sep 7 07:21:56 2012 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 7 Sep 2012 10:21:56 +0300 Subject: [Biopython-dev] Optimization of PDBParser and friends In-Reply-To: References: Message-ID: Likely true. I'm writing a txt file with the changes. I don't think they can be merged easily without breaking a lot of stuff, in particular the removal of child_list. Therefore, I suggest we write a few deprecation warnings here and there where affected by the consensual changes we agree on and give a few releases before we actually merge them. Also, once I'm happy with the changes, I'll make a new branch to allow 'beta testing' by anyone who wants and write a wiki page on it. Cheers, Jo?o No dia 7 de Set de 2012 04:03, "Peter Cock" escreveu: > On Thu, Sep 6, 2012 at 2:48 PM, Jo?o Rodrigues wrote: > > Ok, thanks. > > > > The modules are littered with set/get methods and adding > DeprecationWarning > > to all of them might be a bit too much.. Instead, should we add one > single > > warning at the top of the PDBParser, since this is the only obligatory > > module for Bio.PDB so that everyone gets the warning message once and > once > > only? Otherwise I can imagine several warnings popping up everywhere.. > > If you use the exact same message, then I think you'll only see the > warning once. Try it with a couple of the get/set methods to confirm. > > Having the warning happen even if you don't use the get/set seems > wrong. > > Peter > From mjldehoon at yahoo.com Sun Sep 9 07:31:05 2012 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 9 Sep 2012 00:31:05 -0700 (PDT) Subject: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif Message-ID: <1347175865.35152.YahooMailClassic@web164003.mail.gq1.yahoo.com> Returning to a previous discussion... [Michiel:] > ..., currently Bio.Motif._Motif.Motif objects also perform > functions that are more appropriate for a separate PWM > (position-weight matrix) class within Bio.Motif. It may be > a good idea to have a separate PWM class for this functionality. [Bartek:] > I'm not sure. I think it is valuable to be able to load > instances from a file and then convert them to a PWM. > It could be done with separate classes, > but I'm not sure it would be easier then... I think there is one confusing issue here. The current .pwm() method of a Motif object doesn't calculate a position-weight matrix but only normalizes the counts matrix to create a probability matrix. To calculate a PWM, we would have to calculate the logarithm of these probabilities divided by the corresponding background probabilities (for which in Bio.Motif we are currently using the log_odds method). So I was mainly thinking of a PWM class to represent what is currently being returned by the log_odds method. This allows users to create a PWM from the log-odds scores directly instead of from an alignment (for example, if the PWM is available from some publication but not the actual alignments). Also this avoids some confusion with regard to which methods operate on which object. For example, currently we have motif.scanPWM and motif.score_hit that actually operate on the log-odds matrix, motif.anticonsensus, motif.consensus, motif[:] uses the probability matrix, and motif.max_score and motif.min_score use the log-odds matrix to evaluate the score of motif.consensus, motif.anticonsensus which were calculated using the probablity matrix (and therefore don't necessarily return the maximum and minimum score). So I would suggest to keep the various types of matrices explicit; something along these lines: >>> motif = Motif.read(...) >>> counts = motif.counts # .counts is a property of motif # counts is an instance of the Motif.FrequencyMatrix class # you can also make a FrequencyMatrix object directly from # the frequencies, as in >>> counts = Motif.FrequencyMatrix(my_frequency_matrix) >>> counts[2,:] array([1.0, 4.0, 3.0, 2.0]) # indices refer explicitly to the counts matrix >>> counts[2,'G'] 3.0 >>> my_consensus_sequence = counts.consensus # .consensus is a property of counts >>> my_anticonsensus_sequence = counts.anticonsensus # .anticonsensus is a property of counts >>> my_probability_matrix = counts.normalize() # this can be a numpy array, or a Motif.ProbabilityMatrix # class that inherits from a numpy array >>> my_probability_matrix[2,:] array([0.1, 0.4, 0.3, 0.2]) # indices refer explicitly to the probability matrix >>> pwm = counts.make_pwm(...) # or pwm = motif.PositionWeightMatrix(my_matrix) >>> pwm[0,:] array([ -2.3, 0.1, 1.2, 1.8]) >>> pwm[0,2] 1.2 >>> pwm[0,'C'] 0.1 # indices explicitly refer to the pwm >>> scores = pwm.scan(sequence) >>> score = pwm.score(sequence) Does that sound reasonable? Any comments, suggestions? Best, -Michiel. From bartek at rezolwenta.eu.org Mon Sep 10 07:12:59 2012 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Mon, 10 Sep 2012 09:12:59 +0200 Subject: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif In-Reply-To: <1347175865.35152.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1347175865.35152.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: Hi, I think it is an idea worth discussing a little bit more. Thanks for bringing it up Michiel. It captures at least some of the issues caused by the fact that different motifs might be internally represented differently. I'm not sure I'm all excited about having to deal with explicit extra classes for PWMs and aligned instances, but maybe this is the price for having a clear separation of where certain things are calculated. The issue I think still needs discussion is where is the searching done? If I want to search for instances, do I do it from the PWM object?, This seems to be the natural idea, but then can we find a nice interface for people who don't want to be bothered with too complicated interfaces? I'll try to come up with a more thought through and longer response later in the week... best Bartek On Sun, Sep 9, 2012 at 9:31 AM, Michiel de Hoon wrote: > Returning to a previous discussion... > > [Michiel:] >> ..., currently Bio.Motif._Motif.Motif objects also perform >> functions that are more appropriate for a separate PWM >> (position-weight matrix) class within Bio.Motif. It may be >> a good idea to have a separate PWM class for this functionality. > > [Bartek:] >> I'm not sure. I think it is valuable to be able to load >> instances from a file and then convert them to a PWM. >> It could be done with separate classes, >> but I'm not sure it would be easier then... > > I think there is one confusing issue here. > The current .pwm() method of a Motif object doesn't calculate a position-weight matrix but only normalizes the counts matrix to create a probability matrix. To calculate a PWM, we would have to calculate the logarithm of these probabilities divided by the corresponding background probabilities (for which in Bio.Motif we are currently using the log_odds method). > > So I was mainly thinking of a PWM class to represent what is currently being returned by the log_odds method. This allows users to create a PWM from the log-odds scores directly instead of from an alignment (for example, if the PWM is available from some publication but not the actual alignments). > Also this avoids some confusion with regard to which methods operate on which object. For example, currently we have motif.scanPWM and motif.score_hit that actually operate on the log-odds matrix, > motif.anticonsensus, motif.consensus, motif[:] uses the probability matrix, and motif.max_score and motif.min_score use the log-odds matrix to evaluate the score of motif.consensus, motif.anticonsensus which were calculated using the probablity matrix (and therefore don't necessarily return the maximum and minimum score). > > So I would suggest to keep the various types of matrices explicit; something along these lines: > >>>> motif = Motif.read(...) >>>> counts = motif.counts > # .counts is a property of motif > # counts is an instance of the Motif.FrequencyMatrix class > # you can also make a FrequencyMatrix object directly from > # the frequencies, as in >>>> counts = Motif.FrequencyMatrix(my_frequency_matrix) >>>> counts[2,:] > array([1.0, 4.0, 3.0, 2.0]) > # indices refer explicitly to the counts matrix >>>> counts[2,'G'] > 3.0 > >>>> my_consensus_sequence = counts.consensus > # .consensus is a property of counts >>>> my_anticonsensus_sequence = counts.anticonsensus > # .anticonsensus is a property of counts > >>>> my_probability_matrix = counts.normalize() > # this can be a numpy array, or a Motif.ProbabilityMatrix > # class that inherits from a numpy array >>>> my_probability_matrix[2,:] > array([0.1, 0.4, 0.3, 0.2]) > # indices refer explicitly to the probability matrix > >>>> pwm = counts.make_pwm(...) > # or pwm = motif.PositionWeightMatrix(my_matrix) >>>> pwm[0,:] > array([ -2.3, 0.1, 1.2, 1.8]) >>>> pwm[0,2] > 1.2 >>>> pwm[0,'C'] > 0.1 > # indices explicitly refer to the pwm > >>>> scores = pwm.scan(sequence) >>>> score = pwm.score(sequence) > > > Does that sound reasonable? Any comments, suggestions? > > Best, > -Michiel. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Bartek Wilczynski From p.j.a.cock at googlemail.com Mon Sep 10 08:39:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 10 Sep 2012 09:39:30 +0100 Subject: [Biopython-dev] Most buildbot slaves down Message-ID: Hi all, For those of you actively monitoring the nightly BuildBot for Biopython and/or BioRuby, all the buildslaves at my institute are currently effectively offline. A new stricter firewall policy was introduced last week while I was away. I hope we'll have the necessary outgoing ports opened again soon. In the meantime, additional buildslaves hosted elsewhere would be very useful. The machines need to be online and are typically only used once every 24 hours for the scheduled builds. Non-Linux machines are particularly important for cross-platform testing (while for Linux the TravisCI testing seems to be working nicely overall). Any volunteers? Thanks, Peter From tiagoantao at gmail.com Mon Sep 10 08:50:41 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 10 Sep 2012 09:50:41 +0100 Subject: [Biopython-dev] [BioRuby] Most buildbot slaves down In-Reply-To: References: Message-ID: Hi, Not much helpful in the non-linux front, but I noticed that my machine was down for some reason, restarted it and it is doing at least a few of the builds. Tiago On Mon, Sep 10, 2012 at 9:39 AM, Peter Cock wrote: > Hi all, > > For those of you actively monitoring the nightly BuildBot > for Biopython and/or BioRuby, all the buildslaves at my > institute are currently effectively offline. A new stricter > firewall policy was introduced last week while I was away. > I hope we'll have the necessary outgoing ports opened > again soon. > > In the meantime, additional buildslaves hosted elsewhere > would be very useful. The machines need to be online > and are typically only used once every 24 hours for the > scheduled builds. Non-Linux machines are particularly > important for cross-platform testing (while for Linux > the TravisCI testing seems to be working nicely overall). > > Any volunteers? > > Thanks, > > Peter > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From redmine at redmine.open-bio.org Fri Sep 14 02:23:53 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 14 Sep 2012 02:23:53 +0000 Subject: [Biopython-dev] [Biopython - Bug #3384] (New) Installation fails with pip-3.2 Message-ID: Issue #3384 has been reported by Roy Crihfield. ---------------------------------------- Bug #3384: Installation fails with pip-3.2 https://redmine.open-bio.org/issues/3384 Author: Roy Crihfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Linux 3.5.3-1-ARCH x86_64 GNU/Linux Python 3.2.3 Bio.__version__ == '1.60' Installation fails with with pip 1.2: $ sudo pip-3.2 install biopython : : Converting build/py3.2/Doc/examples/fasta_dictionary.py Converting build/py3.2/Doc/examples/nmr/simplepredict.py Python 2to3 processing done. running egg_info error: error in 'egg_base' option: 'pip-egg-info' does not exist or is not a directory ---------------------------------------- Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython Exception information: Traceback (most recent call last): File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/basecommand.py", line 106, in main status = self.run(options, args) File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/commands/install.py", line 256, in run requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle) File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 1042, in prepare_files req_to_install.run_egg_info() File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 236, in run_egg_info command_desc='python setup.py egg_info') File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/util.py", line 612, in call_subprocess % (command_desc, proc.returncode, cwd)) pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Fri Sep 14 02:23:54 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 14 Sep 2012 02:23:54 +0000 Subject: [Biopython-dev] [Biopython - Bug #3384] (New) Installation fails with pip-3.2 Message-ID: Issue #3384 has been reported by Roy Crihfield. ---------------------------------------- Bug #3384: Installation fails with pip-3.2 https://redmine.open-bio.org/issues/3384 Author: Roy Crihfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Linux 3.5.3-1-ARCH x86_64 GNU/Linux Python 3.2.3 Bio.__version__ == '1.60' Installation fails with with pip 1.2: $ sudo pip-3.2 install biopython : : Converting build/py3.2/Doc/examples/fasta_dictionary.py Converting build/py3.2/Doc/examples/nmr/simplepredict.py Python 2to3 processing done. running egg_info error: error in 'egg_base' option: 'pip-egg-info' does not exist or is not a directory ---------------------------------------- Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython Exception information: Traceback (most recent call last): File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/basecommand.py", line 106, in main status = self.run(options, args) File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/commands/install.py", line 256, in run requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle) File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 1042, in prepare_files req_to_install.run_egg_info() File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 236, in run_egg_info command_desc='python setup.py egg_info') File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/util.py", line 612, in call_subprocess % (command_desc, proc.returncode, cwd)) pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Fri Sep 14 08:46:08 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 14 Sep 2012 08:46:08 +0000 Subject: [Biopython-dev] [Biopython - Bug #3384] Installation fails with pip-3.2 References: Message-ID: Issue #3384 has been updated by Peter Cock. Does the standard install mechanism work on your machine? i.e. python3.2 setup.py build python3.2 setup.py test sudo python3.2 setup.py install If you want to investigate the pip error, there is a possible workaround developed by NumPy (who also use 2to3 in a similar way to us), see http://projects.scipy.org/numpy/ticket/1857 Thanks ---------------------------------------- Bug #3384: Installation fails with pip-3.2 https://redmine.open-bio.org/issues/3384 Author: Roy Crihfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Linux 3.5.3-1-ARCH x86_64 GNU/Linux Python 3.2.3 Bio.__version__ == '1.60' Installation fails with with pip 1.2: $ sudo pip-3.2 install biopython : : Converting build/py3.2/Doc/examples/fasta_dictionary.py Converting build/py3.2/Doc/examples/nmr/simplepredict.py Python 2to3 processing done. running egg_info error: error in 'egg_base' option: 'pip-egg-info' does not exist or is not a directory ---------------------------------------- Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython Exception information: Traceback (most recent call last): File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/basecommand.py", line 106, in main status = self.run(options, args) File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/commands/install.py", line 256, in run requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle) File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 1042, in prepare_files req_to_install.run_egg_info() File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 236, in run_egg_info command_desc='python setup.py egg_info') File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/util.py", line 612, in call_subprocess % (command_desc, proc.returncode, cwd)) pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sat Sep 15 01:57:53 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 15 Sep 2012 01:57:53 +0000 Subject: [Biopython-dev] [Biopython - Bug #3384] Installation fails with pip-3.2 References: Message-ID: Issue #3384 has been updated by Roy Crihfield. Yes, installing manually works. I found that hack but was hoping there would be a better solution, or support for pip planned for the future. ---------------------------------------- Bug #3384: Installation fails with pip-3.2 https://redmine.open-bio.org/issues/3384 Author: Roy Crihfield Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: Linux 3.5.3-1-ARCH x86_64 GNU/Linux Python 3.2.3 Bio.__version__ == '1.60' Installation fails with with pip 1.2: $ sudo pip-3.2 install biopython : : Converting build/py3.2/Doc/examples/fasta_dictionary.py Converting build/py3.2/Doc/examples/nmr/simplepredict.py Python 2to3 processing done. running egg_info error: error in 'egg_base' option: 'pip-egg-info' does not exist or is not a directory ---------------------------------------- Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython Exception information: Traceback (most recent call last): File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/basecommand.py", line 106, in main status = self.run(options, args) File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/commands/install.py", line 256, in run requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle) File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 1042, in prepare_files req_to_install.run_egg_info() File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/req.py", line 236, in run_egg_info command_desc='python setup.py egg_info') File "/usr/lib/python3.2/site-packages/pip-1.2-py3.2.egg/pip/util.py", line 612, in call_subprocess % (command_desc, proc.returncode, cwd)) pip.exceptions.InstallationError: Command python setup.py egg_info failed with error code 1 in /tmp/pip-build/biopython -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Sat Sep 15 21:29:29 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Sat, 15 Sep 2012 21:29:29 +0000 Subject: [Biopython-dev] [Biopython - Bug #3340] Example using Bio.Clustalw in Tutorial References: Message-ID: Issue #3340 has been updated by Grace Yeo. I've submitted a pull request for this here: https://github.com/biopython/biopython/pull/71 ---------------------------------------- Bug #3340: Example using Bio.Clustalw in Tutorial https://redmine.open-bio.org/issues/3340 Author: Peter Cock Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The module Bio.Clustalw was deprecated and removed, yet is still used in the Tutorial's 'Creating your own substitution matrix from an alignment' example. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Sun Sep 16 12:34:31 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 16 Sep 2012 13:34:31 +0100 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: References: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: On Fri, Sep 7, 2012 at 2:01 AM, Peter Cock wrote: > On Thu, Sep 6, 2012 at 3:40 PM, Eric Talevich wrote: >> On Thu, Sep 6, 2012 at 6:13 AM, Michiel de Hoon wrote: >>> --- On Thu, 9/6/12, Peter Cock wrote: >>> > Here's a further (and slightly more radical) idea: We >>> > stick with using 'Bio' and the current mixed case >>> > names on Python 2, but adopt 'bio' and other PEP8 >>> > compatible names for Python 3 (as a uniform >>> > strict automatic rule: mixed case -> lower case)? >>> > i.e. Do this as part of our 2to3 process. >>> >>> The Python developers argue against combining a switch to Python 3 with >>> other major changes, since then if bugs arise it is unclear if it is due to >>> the switch to Python 3 or due to the other changes. But perhaps it's OK if >>> we have one Bio.* version for Python 2 and one bio.* version for Python 3 >>> that are otherwise completely identical to each other. >> >> >> Agreed, since the bio.* version is generated by the 2to3 script it should >> still be easy enough to distinguish "this is a bug in the library" from >> "this is a problem with Py3, 2to3 or your environment". The extra separation >> on the filesystem provided by Py2/Py3 should also prevent some problems with >> case-insensitivity and the environment. > > Yes - they would be in different site-packages folders, and since > we have a tiny Python 3 install base, moving them from Bio to > bio seems low impact. > > I guess we need to have a little hack with the 2to3 library and > try defining our own custom fixer for the imports... > > Note this case difference will slightly complicate our documentation - > but that is always going to be an issue for the Python 2 to 3 move. > I've made a start at this - the easy part seems to work :) https://github.com/peterjc/biopython/commits/py3lower The hard bit will be fixing all the import lines... ;) Peter From k.d.murray.91 at gmail.com Thu Sep 20 04:28:08 2012 From: k.d.murray.91 at gmail.com (Kevin Murray) Date: Thu, 20 Sep 2012 14:28:08 +1000 Subject: [Biopython-dev] TAIR/AGI support In-Reply-To: <87txvcx9ls.fsf@fastmail.fm> References: <87txvcx9ls.fsf@fastmail.fm> Message-ID: Hi Brad, My TAIR/AGI script is on github here: https://github.com/kdmurray91/biopython/blob/master/Bio/TAIR/__init__.py I got it to work directly from TAIR's website, however it has not been rigorously tested. I plan on implementing the process as i described in my previous email, whereby it fetches the Genbank record from TOGOws or via NCBI's Efetch (using biopython's interfaces of course). I will keep you all posted. To the list in general, I'm open to suggestions on what to work on next? Regards Kevin Murray On 6 September 2012 10:45, Brad Chapman wrote: > > Kevin; > Thanks for the e-mail and offers of code. Always happy to have other > folks involved with the project. > > > What's the status of TAIR AGIs in BioPython (I can see no mention of > them, > > or support for them)? I've written a brief module which allows a user to > > query NCBI with a TAIR AGI, returning a Seq object (via Efetch). Is there > > any interest in including such functionality in BioPython? > > Is the code available on GitHub to get a better sense of all the > functionality it supports? Do you have an idea where it would fit best? > As a tair submodule inside of Bio.Entrez, or somewhere else? > > > More generally, are there any particular areas of BioPython development > > which could use an extra pair of hands? > > Following the mailing list for discussions on current projects is the > best way to get a sense of what different folks are working on. The > issue tracker also has open issues and features that could use attention > if anything there strikes your fancy: > > https://redmine.open-bio.org/projects/biopython > > Hope this helps, > Brad > > From p.j.a.cock at googlemail.com Thu Sep 20 09:08:58 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 20 Sep 2012 10:08:58 +0100 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: References: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: On Sun, Sep 16, 2012 at 1:34 PM, Peter Cock wrote: >> >> I guess we need to have a little hack with the 2to3 library and >> try defining our own custom fixer for the imports... >> >> Note this case difference will slightly complicate our documentation - >> but that is always going to be an issue for the Python 2 to 3 move. >> > > I've made a start at this - the easy part seems to work :) > > https://github.com/peterjc/biopython/commits/py3lower > > The hard bit will be fixing all the import lines... ;) > > Peter Progress - but slow. I think this will work with a bit more time spent on it. With hindsight I'd have made more effort to try and reuse lib2to3, but the documentation is sketchy and they do warn it is liable to change between releases. What I've got instead is a pattern matching script which line-by-line spots imports & updates them, and also notes what knock on changes must be made later in the file. It is also aware of and updates doctest examples. e.g. from Bio import SeqIO record = SeqIO.read("my_chr.gbk", "genbank") becomes: from bio import seqIO record = seqIO.read("my_chr.gbk", "genbank") In the process I've spotted some minor style issues and some quote mistakes in the code base which I have fixed on the main branch as well, e.g. https://github.com/biopython/biopython/commit/b396844401da8b5c5ed1f7f13d69622a6ad0c0cd https://github.com/biopython/biopython/commit/165e2b8da445250f070c3860c9082ff6a0c919e0 I also reformatted a few import lines to make processing them easier - and arguably easier to read too: https://github.com/biopython/biopython/commit/f6940e8a4fcf056fa725225ede5e848c5d6f4fd6 One slightly more complicated issue with lower case module names is we get clashes in some code with existing variable or argument names. This seems particularly common with seq, alphabet and motif. Most of these fixes for this are on the experimental branch. In some cases I've opted to change the import, e.g. from Bio import Alphabet to: from Bio import Alphabet as _alphabet This seemed simplest to avoid changing argument names in functions/methods. I'll continue to work on this as time allows - right now the code is due for a refactoring (e.g. avoid code duplication where I handle doctests), and would benefit from some self-tests. But the message remains: This should work :) Peter From yhtgrace at gmail.com Fri Sep 21 16:57:19 2012 From: yhtgrace at gmail.com (Hui Ting Grace Yeo) Date: Fri, 21 Sep 2012 12:57:19 -0400 Subject: [Biopython-dev] Biopython tutorial: Substitution Matrices Message-ID: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com> Hey everyone, I'm working on this bug here https://redmine.open-bio.org/issues/3340 and I've updated the example in the tutorial (on substitution matrices, 17.4.2) using Bio.AlignIO on github here https://github.com/yhtgrace/biopython/tree/clustalw-alignIO-replace. I'm able to reproduce the dictionary replace_info, but when I go on to finish the example, I get the following log odds matrix: D 2 E -1 1 H -5 -4 3 K -10 -5 -4 1 R -4 -8 -4 -2 2 D E H K R which is different from the one given in the tutorial. I'm wondering if I've missed something. Thanks! Grace Yeo From p.j.a.cock at googlemail.com Mon Sep 24 08:53:07 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 24 Sep 2012 09:53:07 +0100 Subject: [Biopython-dev] ColorSpiral for Bio.Graphics Message-ID: Hello all, Last week Leighton was doing some work with Biopython and GenomeDiagram using the cross-links functionality we worked on for Biopython 1.59, which I described here: http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/ As you may have noticed via Twitter or his blog, Leighton has generated an enormous (5m by 1m) PDF poster printout comparing 29 bacterial genomes: http://armchairbiology.blogspot.co.uk/2012/09/the-colours-man-colours.html As he describes on his blog post, this required generating arbitrary color sets, with the option of adding some noise (or jitter as he called it) to make neighbouring colours visually distinct (rather than the more typical requirement of a smooth value to color mapping). His code to do that is now on this branch (with a minor bug fix and a few more docstrings added), ready for possible merging into Biopython: https://github.com/peterjc/biopython/tree/colorspiral Does this seem like a sensible addition to Bio.Graphics? Does anyone have any thoughts on the namespace Bio.Graphics.ColorSpiral given it defines an object ColorSpiral? Might a Bio.Graphics.Colors be useful? (If as discussed on the other thread we move to lower case module names for Python 3, this namespace clash also present in many other Biopython modules goes away): http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009934.html Regards, Peter From p.j.a.cock at googlemail.com Tue Sep 25 16:00:45 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Sep 2012 17:00:45 +0100 Subject: [Biopython-dev] [Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used? In-Reply-To: <5061C20F.7040209@stats.ox.ac.uk> References: <1348569292.2538.YahooMailClassic@web164006.mail.gq1.yahoo.com> <506194F6.9000103@fold.natur.cuni.cz> <5061C20F.7040209@stats.ox.ac.uk> Message-ID: On Tue, Sep 25, 2012 at 3:39 PM, Tanya Golubchik wrote: > Hello, > > Apologies for not having followed the entire discussion, but just wanted > to say that we're also using NCBIXML here and are likely to be > incorporating it in a new piece of software soon, so it would be really > unfortunate if some tags disappeared, were renamed or (even worse) > changed meaning in future releases. > > I'm a bit late coming in here so maybe this has been answered, but is > there a better parser that should be used at the moment? I was under the > impression that NCBIXML is the only one. > > Thanks, > Tanya Hi Tanya, I hope I can reassure you there is nothing to worry about :) Right now there is only the NCBIXML parser, and we're not going to change it (except possibly to make it a little faster if people want to work on that). We're planning to a add new module based on Bow's GSoC code, under the working name SearchIO, which would cover BLAST, BLAT, HMMER, etc. This would have a different API and in the long term would probably replace all of Bio.Blast. http://biopython.org/wiki/SearchIO The discussion about possible changes has been (I think) only about this new code (and would have been better off on the development mailing list but this thread went off on a slight tangent). Once 'SearchIO' is released, we'd want to encourage people to use that instead of NCBIXML, with a view to deprecating and eventually removing NCBIXML. See: http://biopython.org/wiki/Deprecation_policy Regards, Peter From p.j.a.cock at googlemail.com Thu Sep 27 13:01:44 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Sep 2012 14:01:44 +0100 Subject: [Biopython-dev] ColorSpiral for Bio.Graphics In-Reply-To: References: Message-ID: On Mon, Sep 24, 2012 at 9:53 AM, Peter Cock wrote: > Hello all, > > Last week Leighton was doing some work with Biopython > and GenomeDiagram using the cross-links functionality > we worked on for Biopython 1.59, which I described here: > http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/ > > As you may have noticed via Twitter or his blog, Leighton has > generated an enormous (5m by 1m) PDF poster printout > comparing 29 bacterial genomes: > http://armchairbiology.blogspot.co.uk/2012/09/the-colours-man-colours.html > > As he describes on his blog post, this required generating > arbitrary color sets, with the option of adding some noise > (or jitter as he called it) to make neighbouring colours > visually distinct (rather than the more typical requirement > of a smooth value to color mapping). > > His code to do that is now on this branch (with a minor > bug fix and a few more docstrings added), ready for > possible merging into Biopython: > https://github.com/peterjc/biopython/tree/colorspiral > > Does this seem like a sensible addition to Bio.Graphics? > > Does anyone have any thoughts on the namespace > Bio.Graphics.ColorSpiral given it defines an object > ColorSpiral? Might a Bio.Graphics.Colors be useful? > > (If as discussed on the other thread we move to lower > case module names for Python 3, this namespace > clash also present in many other Biopython modules > goes away): > http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009934.html > > Regards, > > Peter I've committed it - we can still move/rename/etc until the next release if anyone has suggestions for improvement. https://github.com/biopython/biopython/commit/35a484026b68dd1b530d3446640b2f4d4b73eda7 Peter From p.j.a.cock at googlemail.com Thu Sep 27 13:55:21 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Sep 2012 14:55:21 +0100 Subject: [Biopython-dev] Biopython tutorial: Substitution Matrices In-Reply-To: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com> References: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com> Message-ID: On Fri, Sep 21, 2012 at 5:57 PM, Hui Ting Grace Yeo wrote: > Hey everyone, > > I'm working on this bug here https://redmine.open-bio.org/issues/3340 > and I've updated the example in the tutorial (on substitution matrices, > 17.4.2) using Bio.AlignIO on github here > https://github.com/yhtgrace/biopython/tree/clustalw-alignIO-replace. > I'm able to reproduce the dictionary replace_info, but when I go on to > finish the example, I get the following log odds matrix: > > D 2 > E -1 1 > H -5 -4 3 > K -10 -5 -4 1 > R -4 -8 -4 -2 2 > D E H K R > > which is different from the one given in the tutorial. I'm wondering if I've > missed something. Hi Grace, Using the current code and the example as it is, I also observe the same result as you. According to github's "blame" feature the current text dates back 4 years, https://github.com/biopython/biopython/commit/bed3ab39d8a635f1e74be99e6730a48d2460f8b7 However, that was just a reformatting of an older example which Brad wrote 11 years ago while converting the example from DNA to protein: https://github.com/biopython/biopython/commit/21df476c66b279824c51e6abd3f4ae549d003813 The example file itself protein.aln has not changed, committed: https://github.com/biopython/biopython/commit/ccbe2d72014eafb064994bc3782ca5529d0b0448 See also Doc/examples/make_subsmat.py So, since the example hasn't been changed in 11 years, this suggests either Brad committed the wrong output (and no-one noticed), or something changed in the calculation during that time. (Nowadays we try to use doctests for the examples in the API and in the Tutorial where possible, so that code changes which affect our examples are detected automatically.) The most likely candidates would be something in the file Bio/SubsMat/__init__.py https://github.com/biopython/biopython/commits/master/Bio/SubsMat/__init__.py A little detective work might be needed to explain this... sadly trying to use Biopython from back then is complicated by the reliance on the Martel/mxTextTools dependency. Maybe Brad or Michiel has some insight? -- In the meantime, I have applied your changes to the example to use AlignIO, https://github.com/biopython/biopython/commit/19f9317fe0e346f6c3f197d027076d9a1265def7 https://github.com/biopython/biopython/commit/5949f54dadb6d4ac8400e11d2afa33db549afba5 This will now get tested via test_Tutorial.py automatically (except for the final line about printing the odds matrix): https://github.com/biopython/biopython/commit/15dd6ba17eb092d0d7df674ac45617d99256d098 Thank you, Peter From redmine at redmine.open-bio.org Thu Sep 27 13:57:38 2012 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 27 Sep 2012 13:57:38 +0000 Subject: [Biopython-dev] [Biopython - Bug #3340] (Resolved) Example using Bio.Clustalw in Tutorial References: Message-ID: Issue #3340 has been updated by Peter Cock. Status changed from New to Resolved % Done changed from 0 to 100 Fixed with Grace's commits, although she has also spotted a separate issue with the log odds matrix output later in the example: http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009958.html http://lists.open-bio.org/pipermail/biopython-dev/2012-September/009962.html ---------------------------------------- Bug #3340: Example using Bio.Clustalw in Tutorial https://redmine.open-bio.org/issues/3340 Author: Peter Cock Status: Resolved Priority: Normal Assignee: Biopython Dev Mailing List Category: Documentation Target version: URL: The module Bio.Clustalw was deprecated and removed, yet is still used in the Tutorial's 'Creating your own substitution matrix from an alignment' example. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Fri Sep 28 10:50:52 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Sep 2012 11:50:52 +0100 Subject: [Biopython-dev] PEP8 lower case module names? In-Reply-To: References: <1346926418.97489.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: On Thu, Sep 20, 2012 at 10:08 AM, Peter Cock wrote: > On Sun, Sep 16, 2012 at 1:34 PM, Peter Cock wrote: >>> >>> I guess we need to have a little hack with the 2to3 library and >>> try defining our own custom fixer for the imports... >> >> I've made a start at this - the easy part seems to work :) >> >> https://github.com/peterjc/biopython/commits/py3lower >> >> ... The code to do this lower case name mangling remains a quite spaghetti like mess in do2to3.py but it now works enough to pass the test suite (with some but not all 3rd party dependencies installed) under Linux and my Mac OS X machine (where like Windows I have a case insensitive file system). Here's a clean run on TravisCI (Linux with a case sensitive file system): https://travis-ci.org/#!/peterjc/biopython/jobs/2584146 I've not tried Windows itself yet. Also only Python 3.2 Note if you want to try this, after switching to (and after switching from) the py3lower branch you should delete the build/py3.* folder where the 2to3 converted code is cached. The good news is that only a handful of bits of code needed special case code (e.g. finding the Entrez DTD files), with most tweaks just to import lines (as mentioned earlier) or renaming of internal variables. So this idea to adopt PEP8 lower case module names as part of supporting Python 3 appears to be technically viable. Peter From p.j.a.cock at googlemail.com Fri Sep 28 09:35:42 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Sep 2012 10:35:42 +0100 Subject: [Biopython-dev] ColorSpiral for Bio.Graphics In-Reply-To: References: Message-ID: On Thu, Sep 27, 2012 at 2:01 PM, Peter Cock wrote: > On Mon, Sep 24, 2012 at 9:53 AM, Peter Cock wrote: >> As he describes on his blog post, this required generating >> arbitrary color sets, with the option of adding some noise >> (or jitter as he called it) to make neighbouring colours >> visually distinct (rather than the more typical requirement >> of a smooth value to color mapping). >> >> ... > > I've committed it - we can still move/rename/etc until the > next release if anyone has suggestions for improvement. > https://github.com/biopython/biopython/commit/35a484026b68dd1b530d3446640b2f4d4b73eda7 The buildbot run last night spotted a problem under Python 2.5 (no cmath.rect function) which I've now fixed. https://github.com/biopython/biopython/commit/ee933c3f5c4b98ab232c5180492dc11a46b89f0d We do test under Python 2.5 with TravisCI as well, but at the moment we don't install the ReportLab dependency. There is a balance between installing more dependencies (to get more of our code tested) and the extra runtime required (meaning the job is more likely to be killed, or fail due to a network issue) giving false test failures. Peter From p.j.a.cock at googlemail.com Fri Sep 28 10:06:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Sep 2012 11:06:10 +0100 Subject: [Biopython-dev] Biopython tutorial: Substitution Matrices In-Reply-To: <87ipaywk47.fsf@fastmail.fm> References: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com> <87ipaywk47.fsf@fastmail.fm> Message-ID: On Fri, Sep 28, 2012 at 10:51 AM, Brad Chapman wrote: >> So, since the example hasn't been changed in 11 years, this >> suggests either Brad committed the wrong output (and no-one >> noticed), or something changed in the calculation during that >> time. > > Seriously, I could have easily copy/pasted something wrong when writing > this, so if there is no obvious code change I'd go with that assumption > and fix the docs to be correct. OK - I've done that: https://github.com/biopython/biopython/commit/b57707f9f3afc0980a3dbf936f6642a4d9cc8a69 Thanks Brad & Grace, Peter P.S. I've included Grace as a contributor in the upcoming release notes (please let me know if you'd prefer this as Hui Ting Grace Yeo instead): https://github.com/biopython/biopython/commit/5af03e78f37cbce82ce167c762d892cce9cb062e From bjoern at gruenings.eu Fri Sep 28 13:03:22 2012 From: bjoern at gruenings.eu (=?ISO-8859-1?Q?Bj=F6rn_Gr=FCning?=) Date: Fri, 28 Sep 2012 15:03:22 +0200 Subject: [Biopython-dev] [Patch] Genbank Parser Message-ID: <1348837402.21455.1.camel@threonin> Hi, the tbl2asn tool from the ncbi creates genbank files that did not have a version number. Unfortunately that version number is used to fill consumer.data.id. I implemented the following fall-back: If there is no version information available than it takes the consumer.data.name for the consumer.data.id. Does that makes sense? Thanks! Bjoern -------------- next part -------------- A non-text attachment was scrubbed... Name: biopython_genbank_id-fallback.diff Type: text/x-patch Size: 1016 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Fri Sep 28 13:38:11 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 28 Sep 2012 14:38:11 +0100 Subject: [Biopython-dev] [Patch] Genbank Parser In-Reply-To: <1348837402.21455.1.camel@threonin> References: <1348837402.21455.1.camel@threonin> Message-ID: On Fri, Sep 28, 2012 at 2:03 PM, Bj?rn Gr?ning wrote: > Hi, > > the tbl2asn tool from the ncbi creates genbank files that did not have a > version number. Unfortunately that version number is used to fill > consumer.data.id. > I implemented the following fall-back: > If there is no version information available than it takes the > consumer.data.name for the consumer.data.id. Does that makes sense? > > Thanks! > Bjoern Can you share some example output from tbl2asn that shows this problem? Ideally something small we could include as a unit test. Thanks, Peter From chapmanb at 50mail.com Fri Sep 28 09:51:36 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 28 Sep 2012 05:51:36 -0400 Subject: [Biopython-dev] Biopython tutorial: Substitution Matrices In-Reply-To: References: <9EF9244A-87B8-4874-A46A-AE2C393237C9@gmail.com> Message-ID: <87ipaywk47.fsf@fastmail.fm> Grace and Peter; [Different log odds matrix in documentation] > However, that was just a reformatting of an older example which > Brad wrote 11 years ago while converting the example from DNA > to protein: Gee, thanks for making me feel old. > So, since the example hasn't been changed in 11 years, this > suggests either Brad committed the wrong output (and no-one > noticed), or something changed in the calculation during that > time. Seriously, I could have easily copy/pasted something wrong when writing this, so if there is no obvious code change I'd go with that assumption and fix the docs to be correct. Thanks for spotting this, Brad From bjoern at gruenings.eu Thu Sep 27 22:11:05 2012 From: bjoern at gruenings.eu (bjoern at gruenings.eu) Date: Fri, 28 Sep 2012 00:11:05 +0200 (CEST) Subject: [Biopython-dev] [Patch] Genbank Parser fall-back data.id Message-ID: <59367.132.230.56.143.1348783865.squirrel@mail.gruenings.eu> Hi, the tbl2asn tool from the ncbi creates genbank files that did not have a version number. Unfortunately that version number is used to fill consumer.data.id. I implemented the following fall-back: If there is no version information available than it takes the consumer.data.name for the consumer.data.id. Does that makes sense? Thanks! Bjoern -------------- next part -------------- A non-text attachment was scrubbed... Name: biopython_genbank.diff Type: text/x-patch Size: 1015 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Sat Sep 29 12:10:24 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 29 Sep 2012 13:10:24 +0100 Subject: [Biopython-dev] Fwd: [Utilities-announce] PubMed E-Utility 2013 DTD updates In-Reply-To: References: Message-ID: I've added the two new DTD files mentioned below: https://github.com/biopython/biopython/commit/2a09b03ab4d861e91eb543bd6df717ecb4fdf097 Peter ---------- Forwarded message ---------- From: ** Date: Friday, September 28, 2012 Subject: [Utilities-announce] PubMed E-Utility 2013 DTD updates To: NLM/NCBI List utilities-announce NCBI PubMed E-Utility Users,**** ** ** We anticipate updating the PubMed E-Utility DTDs for 2012 in mid-December, approximately on December 10 or 11, 2012.**** ** ** The forthcoming DTDs are available from:**** ** ** http://www.ncbi.nlm.nih.gov/entrez/query/DTD/nlmmedlinecitationset_130101.dtd **** http://www.ncbi.nlm.nih.gov/entrez/query/DTD/pubmed_130101.dtd**** ** ** Changes to NLMMedlineCitationSet DTD AND MEDLINE/PubMed XML:**** ** ** **- **Indicating abstracts not in MEDLINE/PubMed but available from publishers**** English-language abstracts are taken directly from the published article and included in the and elements. If the article does not have a published abstract, the record lacks the and elements. However, publishers may create English-language abstracts that are not published with the article, as well as, non-English- language abstracts that may or may not be published with the article.**** ** ** These other abstracts will be indicated in the element. A new "Language" attribute is added to the element. The element will carry the standard phrase: "Abstract available from the publisher."**** ** ** DTD:**** **** **** ** ** Sample XML:**** Abstract available from the publisher.**** **** ** ** **- **Rename NameID to Identifier**** The NameID element was created in 2010 and modified in 2011 but has not yet been used. NameID is renamed to Identifier. Identifier is an optional, possibly multiply-occurring element permissible within the Author (personal and collective) and Investigator elements. The value in the Identifier attribute Source designates the organizational authority that established the unique identifier. **** ** ** DTD:**** **** **** ** ** **** **** ** ** **** **** ** ** Sample XML:**** **** Smith**** John**** A**** 55555555555555**** **** ** ** Thank you.**** From p.j.a.cock at googlemail.com Sat Sep 29 20:25:14 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 29 Sep 2012 21:25:14 +0100 Subject: [Biopython-dev] Nexus __slots__ and Python 3.3 Message-ID: Hello all, I've started testing under the newly released Python 3.3, and there is a new problem which I don't recall running into when I tried one of the Python 3.3 alpha releases: $ python3 test_Nexus.py Traceback (most recent call last): File "test_Nexus.py", line 7, in from Bio.Nexus import Nexus, Trees File "/Users/peterjc/lib/python3.3/site-packages/Bio/Nexus/Nexus.py", line 513, in class Nexus(object): ValueError: 'original_taxon_order' in __slots__ conflicts with class variable I can fix this with the following change, which appears to have no side effects under Python 2 (the unit tests still all pass): $ git diff diff --git a/Bio/Nexus/Nexus.py b/Bio/Nexus/Nexus.py index 1d6abd2..8c7fbcc 100644 --- a/Bio/Nexus/Nexus.py +++ b/Bio/Nexus/Nexus.py @@ -511,8 +511,6 @@ class Block(object): class Nexus(object): - __slots__=['original_taxon_order','__dict__'] - def __init__(self, input=None): self.ntax=0 # number of taxa self.nchar=0 # number of characters I have committed this: https://github.com/biopython/biopython/commit/e90db11f4a1d983bc2bfe12bec30edbdbb200634 However, I'm not really sure what the intention of this line was in the first place. It is (assuming I didn't miss anything with grep), or now was, the only use of __slots__ in the whole of Biopython. Regards, Peter From p.j.a.cock at googlemail.com Sat Sep 29 20:34:27 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 29 Sep 2012 21:34:27 +0100 Subject: [Biopython-dev] PAML test problems under Python 3.3.0 Message-ID: Hi Brandon (et al), Could you have a look at the PAML unit tests under Python 3.3 please? I see a mix of failures and 'blocking' under a self-compiled Python 3.3.0 on Mac OS X 10.8 (Mountain Lion): $ python3 test_PAML_yn00.py testAlignmentExists (__main__.ModTest) ... ok testAlignmentFileIsValid (__main__.ModTest) ... FAIL testAlignmentSpecified (__main__.ModTest) ... ok testCtlFileExistsOnRead (__main__.ModTest) ... ok testCtlFileExistsOnRun (__main__.ModTest) ... ok testCtlFileValidOnRead (__main__.ModTest) ... ERROR testCtlFileValidOnRun (__main__.ModTest) ... ok testOptionExists (__main__.ModTest) ... ok testOutputFileSpecified (__main__.ModTest) ... ok testOutputFileValid (__main__.ModTest) ... ok testParseAllVersions (__main__.ModTest) ... ok testResultsExist (__main__.ModTest) ... ok testResultsParsable (__main__.ModTest) ... ok testResultsValid (__main__.ModTest) ... ^C $ python3 test_PAML_codeml.py testAlignmentExists (__main__.ModTest) ... ok testAlignmentFileIsValid (__main__.ModTest) ... FAIL testAlignmentSpecified (__main__.ModTest) ... ok testCtlFileExistsOnRead (__main__.ModTest) ... ok testCtlFileExistsOnRun (__main__.ModTest) ... ok testCtlFileValidOnRead (__main__.ModTest) ... ERROR testCtlFileValidOnRun (__main__.ModTest) ... ok testOptionExists (__main__.ModTest) ... ok testOutputFileSpecified (__main__.ModTest) ... ok testOutputFileValid (__main__.ModTest) ... ok testPamlErrorsCaught (__main__.ModTest) ... ok testParseAA (__main__.ModTest) ... ok testParseAAPairwise (__main__.ModTest) ... ok testParseAllNSsites (__main__.ModTest) ... ok testParseBranchSiteA (__main__.ModTest) ... ok testParseCladeModelC (__main__.ModTest) ... ok testParseFreeRatio (__main__.ModTest) ... ok testParseNSsite3 (__main__.ModTest) ... ok testParseNgene2Mgene02 (__main__.ModTest) ... ok testParseNgene2Mgene1 (__main__.ModTest) ... ok testParseNgene2Mgene34 (__main__.ModTest) ... ok testParsePairwise (__main__.ModTest) ... ok testParseSEs (__main__.ModTest) ... ok testResultsExist (__main__.ModTest) ... ok testResultsParsable (__main__.ModTest) ... ok testResultsValid (__main__.ModTest) ... ^C $ python3 test_PAML_baseml.py testAlignmentExists (__main__.ModTest) ... ok testAlignmentFileIsValid (__main__.ModTest) ... FAIL testAlignmentSpecified (__main__.ModTest) ... ok testCtlFileExistsOnRead (__main__.ModTest) ... ok testCtlFileExistsOnRun (__main__.ModTest) ... ok testCtlFileValidOnRead (__main__.ModTest) ... ERROR testCtlFileValidOnRun (__main__.ModTest) ... ok testOptionExists (__main__.ModTest) ... ok testOutputFileSpecified (__main__.ModTest) ... ok testOutputFileValid (__main__.ModTest) ... ok testPamlErrorsCaught (__main__.ModTest) ... ok testParseAllVersions (__main__.ModTest) ... ok testParseAlpha1Rho1 (__main__.ModTest) ... ok testParseModel (__main__.ModTest) ... ok testParseNhomo (__main__.ModTest) ... ok testParseSEs (__main__.ModTest) ... ok testResultsExist (__main__.ModTest) ... ok testResultsParsable (__main__.ModTest) ... ok testResultsValid (__main__.ModTest) ... ^C If you've not tried this before, the procedure I'm using is: $ python3 setup.py build $ cd build/py3.3/Tests $ python3 test_PAML_baseml.py etc The key point is to run the tests directly (rather than just via 'python3 setup.py test') you must change director to the 2to3 converted folder under the build folder. By commenting out the test methods which seem to blocking, it seems some of the failures are to do with exception handling. I've not dug any further into this. Thanks, Peter