From Markus.Piotrowski at ruhr-uni-bochum.de Mon Sep 2 16:49:20 2013 From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus Piotrowski) Date: 2 Sep 2013 22:49:20 +0200 Subject: [Biopython-dev] =?utf-8?q?Fwd=3A_=5Bbiopython=5D_Potential_error_?= =?utf-8?q?in_mass_calculations_for_RNA/DNA=3F_=28=23229=29?= In-Reply-To: References: Message-ID: Hi, I prepared a bugfix for that: https://github.com/MarkusPiotrowski/biopython/commit/fd8914f14d48c984a69b6e8227c679e3c67bd1eb Summary: Bugfix for DNA/RNA masses In Bio.Data.IUPACData: - corrected masses for monophosphate nucleotides in unambiguous_dna_weights and unambiguous_rna_weights (most values where too high by a mass of 16 Da) - added two dictionaries with monoisotopic masses for monophosphate nucleotides (monoisotopic_unambiguous_dna_weights and monoisotopic_unambiguous_rna_weights) - added average and monisotopic masses for selenocysteine and pyrrolysine in protein_weights and monoisotopic_protein_weights In Bio.SeqUtils.__init__: Rewrote method molecular_weight to - correct the calculation (sum masses of sequence elements and substract 18 Da for each formed bond) - allow mass calculation for RNA and protein sequences - allow mass calculation for double stranded nucleic acids Am 2013-08-30 17:46, schrieb Peter Cock: > Who are our sequence mass experts? > https://github.com/biopython/biopython/issues/229 > > ---------- Forwarded message ---------- > From: nruggero > Date: Thu, Aug 29, 2013 at 11:03 PM > Subject: [biopython] Potential error in mass calculations for > RNA/DNA? > (#229) > To: biopython/biopython > > > In Bio/Data/IUPACData.py the molecular weights of unambiguous DNA are > listed as: > > unambiguous_dna_weights = { > "A": 347., > "C": 323., > "G": 363., > "T": 322., > } > > As far as I can tell these are the molecular weights for the > non-deoxy > bases instead of the deoxy bases. For example, AMP (347.22) instead > of dAMP > (331.22) is listed. > > I've looked at the original BioPearl code that these numbers were > taken > from and I think they were just copied incorrectly. I have also > looked at > the code which uses this dict in Bio/SeqUtils/__init__.py called > molecular_weight() and it just takes the sum of these values over the > sequence (no correction made). > > So, is this an error or am I missing something basic? > Thanks > > ? > Reply to this email directly or view it on > GitHub > . > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From zruan1991 at gmail.com Mon Sep 2 18:20:17 2013 From: zruan1991 at gmail.com (Zheng Ruan) Date: Mon, 2 Sep 2013 18:20:17 -0400 Subject: [Biopython-dev] Codon Alignment GSoC Weekly Update Message-ID: Hi all, An update of Codon Alignment GSoC project can be found at http://zruanweb.com/. Thanks for your comments and suggestions. Best, Zheng Ruan From yeyanbo289 at gmail.com Mon Sep 2 21:32:16 2013 From: yeyanbo289 at gmail.com (Yanbo Ye) Date: Tue, 3 Sep 2013 09:32:16 +0800 Subject: [Biopython-dev] GSOC weekly update 12 Message-ID: Hi all, The last week update for Biopython.Phylo project can be found here: http://blog.yeyanbo.com/posts/google-summer-of-code-12.html Thanks, Yanbo -- *Yanbo Ye* *Guangzhou Institutes of Biomedicine and Health, * *Chinese Academy of Sciences* *190 Kaiyuan Avenue, Science Park, Guangzhou, China** * * * *Email: ye_yanbo at gibh.ac.cn* *Web: http://www.yeyanbo.com* *Phone: (86)-020-32093810* From jurajbergman at hotmail.com Thu Sep 5 10:33:55 2013 From: jurajbergman at hotmail.com (Juraj Bergman) Date: Thu, 5 Sep 2013 16:33:55 +0200 Subject: [Biopython-dev] Python_MKT Message-ID: Dear all, I'm resending my implementation of the McDonald-Kreitman test. Link to the description of the module:https://www.dropbox.com/s/zgnz8xwlcsispzf/Python_MKT.pdf Link to the code:https://www.dropbox.com/s/1z3opj4rbb0ms14/Python_MKT.py I apologise for the initial mistake of sending attachments instead of links. Kind regards, Juraj Bergman P.S. Regarding the multi_short_path() function - I realize that it is very, very repetitive butI have not (yet) managed to find a suitable loop construction that would replace the current code. The multi_short_path() function is by far the most complex function of the modulebecause its purpose is to find the codon network with the least amount of overall nucleotide substitutions and the least amount of non-synonymous nucleotide substitutions (given any combination of codons). Each codon is being represented as multiple lists of two integers (depending on the overall amount of codons being processed). The first integer specifies the amount of synonymous and the second specifies the amount of non-synonymous substitutions.For example, if 10 codons are being fitted in a network, then there are 10x10 = 100 combinations of codon-codon pathways, each represented with a two-integer list, and out of these 100 lists, the 'best' 10 have to be chosen to get the most optimal codon network (and the repetitiveness of thefunction mainly arises because of this process). This is, in short, a description of the function and I would appreciate any pointers that would help to make the code more succinct :) From zruan1991 at gmail.com Fri Sep 6 00:00:06 2013 From: zruan1991 at gmail.com (Zheng Ruan) Date: Fri, 6 Sep 2013 00:00:06 -0400 Subject: [Biopython-dev] Fwd: Python_MKT In-Reply-To: References: Message-ID: Hi Juraj, I am also planing to implement MK test into my GSoC framework. I just went through you code and it is really independent. Will you be also to modify it to utilize the MultipleSeqAlignment, Alphabet and CodonTable module of Biopython so that it is more extendable? As to the multi_short_path() function, you really confused me. Is your implementation guaranteed to find the shortest path? This problem can be abstracted as finding the minimum spanning tree in graph theory and a good algorithm is known (Prim algorithm or Kruskal algorithm). My idea of linking multiple codons is first generate a codon by codon matrix representing the synonymous and nonsynonymous substitutions each codon needs to change to the other in advance. Then finding the minimum spanning tree that connect all the node in the matrix with minimum length (least synonymous substitutions). I plan to implement this and you may have more insight about my suggestions. Thanks! Best, Zheng Ruan On Thu, Sep 5, 2013 at 10:33 AM, Juraj Bergman wrote: > > > > Dear all, > I'm resending my implementation of the McDonald-Kreitman test. > Link to the description of the module: > https://www.dropbox.com/s/zgnz8xwlcsispzf/Python_MKT.pdf > Link to the code:https://www.dropbox.com/s/1z3opj4rbb0ms14/Python_MKT.py > I apologise for the initial mistake of sending attachments instead of > links. > Kind regards, > Juraj Bergman > P.S. Regarding the multi_short_path() function - I realize that it is > very, very repetitive butI have not (yet) managed to find a suitable loop > construction that would replace the current code. The multi_short_path() > function is by far the most complex function of the modulebecause its > purpose is to find the codon network with the least amount of overall > nucleotide substitutions and the least amount of non-synonymous nucleotide > substitutions (given any combination of codons). Each codon is being > represented as multiple lists of two integers (depending on the overall > amount of codons being processed). The first integer specifies the amount > of synonymous and the second specifies the amount of non-synonymous > substitutions.For example, if 10 codons are being fitted in a network, then > there are 10x10 = 100 combinations of codon-codon pathways, each > represented with a two-integer list, and out of these 100 lists, the 'best' > 10 have to be chosen to get the most optimal codon networ! > k (and the repetitiveness of thefunction mainly arises because of this > process). This is, in short, a description of the function and I would > appreciate any pointers that would help to make the code more succinct :) > > > > > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From jurajbergman at hotmail.com Fri Sep 6 02:38:34 2013 From: jurajbergman at hotmail.com (Juraj Bergman) Date: Fri, 6 Sep 2013 08:38:34 +0200 Subject: [Biopython-dev] Python_MKT In-Reply-To: References: , , Message-ID: Hi Zheng, I think that the utilization of MultipleSeqAlignment and other modules already implemented in the Biopython framework is the next step in developing my module. The code was made independent because it says on the Biopython wiki that, whensubmitting code, it should be generalized so I didn't use any existing Biopython modules... As for the multi_short_path() function - it is guaranteed to find the shortest path (as far as I've tested it and I've tested it quite a bit) but I agree that it is very confusing (even for me), but it works... But still, my next goal is to try and rewrite it (so thank you for the suggestions :). The codon-codon matrix principle you described is also the principle behind the multi_short_path() function and, I think, it is a good way of tackling the problem... But in the end the result of the multi _short_path() is to find a tree with the least amount of overall substitutions (synonymous + non-synonymous) and with the number of non-synonymous substitutions being minimized. If you try to connect the nodes based solely on the minimum amount of synonymous substitutions you may not always get a minimum length tree (for example: if considering only the synonymous substitutions, then, theoretically, a codon_a -> codon_b exchange which requires two synonymous changes has priority over a codon_a -> codon_c which requires only one non-synonymous change, and that in turn can affect the length of the whole tree) - I hope this makes some sense to you... Also, when connecting nodes, I took the approach of first making a root of the tree and then building the tree from that root, otherwise you could end up with multiple unconnected branches... I hope this helps with your implementation... If I come up with a better alternative to the multi_short_path() I'll be sure to post a link! Again, thanks for taking the time to going through my code, all the best, Juraj Date: Fri, 6 Sep 2013 00:00:06 -0400 Subject: Fwd: [Biopython-dev] Python_MKT From: zruan1991 at gmail.com To: biopython-dev at biopython.org; jurajbergman at hotmail.com Hi Juraj, I am also planing to implement MK test into my GSoC framework. I just went through you code and it is really independent. Will you be also to modify it to utilize the MultipleSeqAlignment, Alphabet and CodonTable module of Biopython so that it is more extendable? As to the multi_short_path() function, you really confused me. Is your implementation guaranteed to find the shortest path? This problem can be abstracted as finding the minimum spanning tree in graph theory and a good algorithm is known (Prim algorithm or Kruskal algorithm). My idea of linking multiple codons is first generate a codon by codon matrix representing the synonymous and nonsynonymous substitutions each codon needs to change to the other in advance. Then finding the minimum spanning tree that connect all the node in the matrix with minimum length (least synonymous substitutions). I plan to implement this and you may have more insight about my suggestions. Thanks! Best,Zheng Ruan On Thu, Sep 5, 2013 at 10:33 AM, Juraj Bergman wrote: Dear all, I'm resending my implementation of the McDonald-Kreitman test. Link to the description of the module:https://www.dropbox.com/s/zgnz8xwlcsispzf/Python_MKT.pdf Link to the code:https://www.dropbox.com/s/1z3opj4rbb0ms14/Python_MKT.py I apologise for the initial mistake of sending attachments instead of links. Kind regards, Juraj Bergman P.S. Regarding the multi_short_path() function - I realize that it is very, very repetitive butI have not (yet) managed to find a suitable loop construction that would replace the current code. The multi_short_path() function is by far the most complex function of the modulebecause its purpose is to find the codon network with the least amount of overall nucleotide substitutions and the least amount of non-synonymous nucleotide substitutions (given any combination of codons). Each codon is being represented as multiple lists of two integers (depending on the overall amount of codons being processed). The first integer specifies the amount of synonymous and the second specifies the amount of non-synonymous substitutions.For example, if 10 codons are being fitted in a network, then there are 10x10 = 100 combinations of codon-codon pathways, each represented with a two-integer list, and out of these 100 lists, the 'best' 10 have to be chosen to get the most optimal codon networ! k (and the repetitiveness of thefunction mainly arises because of this process). This is, in short, a description of the function and I would appreciate any pointers that would help to make the code more succinct :) _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From zruan1991 at gmail.com Fri Sep 6 11:07:01 2013 From: zruan1991 at gmail.com (Zheng Ruan) Date: Fri, 6 Sep 2013 11:07:01 -0400 Subject: [Biopython-dev] Python_MKT In-Reply-To: References: Message-ID: Hi Juraj, It's good to hear that you plan to do that. A big advantage of using Biopython module is to make your MKT test more integrated with existing functions. This can be helpful to design pipeline within Biopython. What I would also try is to use the Bio.Data.CodonTable so that user can specify genetic code of their gene of interest. I think there are situations where you are not able to minimize synonymous and non-synonymous, and non-synonymous substitutions at the same time. If I understand your point correctly, multi_short_path() function tries to find the least synonymous and non-synonymous substitutions from a set of paths that all holds minimum non-synonymous substitutions, right? In this case, for example when you have 10 different codons at hand, you can first start from each codon and build a minimum spanning tree. And then you expect at most 10 minimum spanning trees, all with equal number of minimum non-synonymous substitutions. Finally, you can pick the tree with least overall substitutions (non-synonymous and synonymous) from the set of trees. I don't expect the algorithm to cost more than 2000 lines. Maybe we can discuss this more after I finish coding this weekend. Thanks! Best, Zheng Ruan On Fri, Sep 6, 2013 at 2:38 AM, Juraj Bergman wrote: > Hi Zheng, > > I think that the utilization of MultipleSeqAlignment and other modules > already implemented in the Biopython framework is the next step in > developing my module. The code was made independent because it says on the > Biopython wiki that, when > submitting code, it should be generalized so I didn't use any existing > Biopython modules... > > As for the multi_short_path() function - it is guaranteed to find the > shortest path (as far as I've tested it and I've tested it quite a bit) but > I agree that it is very confusing (even for me), but it works... But > still, my next goal is to try and rewrite it (so thank you for the > suggestions :). The codon-codon matrix principle you described is also the > principle behind the multi_short_path() function and, I think, it is a good > way of tackling the problem... But in the end the result of the multi > _short_path() is to find a tree with the least amount of overall > substitutions (synonymous + non-synonymous) and with the number of > non-synonymous substitutions being minimized. If you try to connect the > nodes based solely on the minimum amount of synonymous substitutions you > may not always get a minimum length tree (for example: if considering only > the synonymous substitutions, then, theoretically, a codon_a -> codon_b > exchange which requires two synonymous changes has priority over a codon_a > -> codon_c which requires only one non-synonymous change, and that in turn > can affect the length of the whole tree) - I hope this makes some sense > to you... Also, when connecting nodes, I took the approach of first making > a root of the tree and then building the tree from that root, otherwise you > could end up with multiple unconnected branches... I hope this helps with > your implementation... If I come up with a better alternative to the > multi_short_path() I'll be sure to post a link! > > Again, thanks for taking the time to going through my code, all the best, > > Juraj > > ------------------------------ > Date: Fri, 6 Sep 2013 00:00:06 -0400 > Subject: Fwd: [Biopython-dev] Python_MKT > From: zruan1991 at gmail.com > To: biopython-dev at biopython.org; jurajbergman at hotmail.com > > > Hi Juraj, > > I am also planing to implement MK test into my GSoC framework. I just went > through you code and it is really independent. Will you be also to modify > it to utilize the MultipleSeqAlignment, Alphabet and CodonTable module of > Biopython so that it is more extendable? > > As to the multi_short_path() function, you really confused me. Is your > implementation guaranteed to find the shortest path? This problem can be > abstracted as finding the minimum spanning tree in graph theory and a good > algorithm is known (Prim algorithm or Kruskal algorithm). My idea of > linking multiple codons is first generate a codon by codon matrix > representing the synonymous and nonsynonymous substitutions each codon > needs to change to the other in advance. Then finding the minimum spanning > tree that connect all the node in the matrix with minimum length (least > synonymous substitutions). I plan to implement this and you may have more > insight about my suggestions. Thanks! > > Best, > Zheng Ruan > > > On Thu, Sep 5, 2013 at 10:33 AM, Juraj Bergman wrote: > > > > > Dear all, > I'm resending my implementation of the McDonald-Kreitman test. > Link to the description of the module: > https://www.dropbox.com/s/zgnz8xwlcsispzf/Python_MKT.pdf > Link to the code:https://www.dropbox.com/s/1z3opj4rbb0ms14/Python_MKT.py > I apologise for the initial mistake of sending attachments instead of > links. > Kind regards, > Juraj Bergman > P.S. Regarding the multi_short_path() function - I realize that it is > very, very repetitive butI have not (yet) managed to find a suitable loop > construction that would replace the current code. The multi_short_path() > function is by far the most complex function of the modulebecause its > purpose is to find the codon network with the least amount of overall > nucleotide substitutions and the least amount of non-synonymous nucleotide > substitutions (given any combination of codons). Each codon is being > represented as multiple lists of two integers (depending on the overall > amount of codons being processed). The first integer specifies the amount > of synonymous and the second specifies the amount of non-synonymous > substitutions.For example, if 10 codons are being fitted in a network, then > there are 10x10 = 100 combinations of codon-codon pathways, each > represented with a two-integer list, and out of these 100 lists, the 'best' > 10 have to be chosen to get the most optimal codon networ! > k (and the repetitiveness of thefunction mainly arises because of this > process). This is, in short, a description of the function and I would > appreciate any pointers that would help to make the code more succinct :) > > > > > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > > From p.j.a.cock at googlemail.com Fri Sep 6 11:44:44 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 6 Sep 2013 16:44:44 +0100 Subject: [Biopython-dev] Python 2 and 3 migration thoughts In-Reply-To: References: Message-ID: On Thu, May 30, 2013 at 2:33 PM, Peter Cock wrote: > Splitting off from this thread: > http://lists.open-bio.org/pipermail/biopython/2013-May/008601.html > > On Thu, May 30, 2013 at 2:13 PM, Peter Cock wrote: >> Thank you for all the comments so far, don't stop yet :) >> >> On Thu, May 30, 2013 at 1:51 PM, Wibowo Arindrarto >> wrote: >>> Hi everyone, >>> >>> I'm leaning towards insisting on Python >=3.3 support (I'm running >>> 3.3.2). I suppose that even if Python3.3 is not available on a machine >>> or through the default package manager, it's always installable on its >>> own. If that's not the case, I imagine Python2.x is most likely >>> present in these machines (so Biopython can still be used). >> >> True. >> >> So far everyone who has replied (including some off list) have said >> they are using Python 3.3 which is encouraging. Thank you for >> the comments so far. >> >> It looks like we can forget about Python 3.1, and just need to >> decide if it is worth including Python 3.2.5 in the short term. >> >>> On a related note, do we have a defined timeline on when we >>> would drop support for Python2.x? Are there any plans to have >>> our codebase written in Python3.x instead of Python2.x? >> >> Nothing concrete planned, no. I'll reply in more detail on the >> biopython-dev list as I do have some thoughts about this. > > Good question Bow, > > I think people will still be using Python 2 a year or two from > now, so we must support both for some time. > > Biopython 1.62 (next week perhaps?) > - Final release with Python 2.5 support > - Official support for Python 2.5, 2.6, 2.7 and 3.3 > - Possibly official support for Python 3.2.5+ as well? > > (Exactly which versions of Python 3 we'll include to be > decided, see the other thread for that discussion.) > > Short term we will continue with developing using Python 2 > syntax and running 2to3 for Python 3. As far as I know, > the reverse process with 3to2 is not well established. If > anyone wants to investigate that would be useful as > another option. However, dropping Python 2.5 support > makes things more flexible... > > Medium term I believe it would be possible to have a single > code base which is both valid Python 2 and 3 at the same > time. This may require us to target 2.7 and 3.3+ only - we'll > have to try it and see if Python 2.6 will hold us back. > > I've actually done this with lzma.backports, a small but > non-trivial module with Python and C code: > > https://pypi.python.org/pypi/backports.lzma/ > https://github.com/peterjc/backports.lzma > > Python 3.3 reintroduces some features designed to make > this more straightforward, like unicode literals (missing in > the early versions of Python 3). This is why I'd like to drop > Python 3.2 as soon as possible. > > What I was thinking is we can start migrating modules on a > case by case basis from "Python 2 syntax" to "Dual syntax" > one by one, with a white-list in the do2to3.py script. That > way over time less and less modules need to be converted > via 2to3, and "python3 setup.py install" will get faster, > until eventually we can stop using 2to3 at all. > > This conversion could consider the code and doctests > separately. However, using using print(example) we can > hopefully get most of the doctests and Tutorial examples > to work under both Python 2 and 3 at the same time. > > That's my current thinking anyway - and I think the fact > that it would be a gradual migration from writing Python 2 > specific code to writing dual 2/3 code makes it low risk > (as long as we're continuing to run regular testing). > > Regards, > > Peter This branch is trying out marking individual Python files as dual coding (Python 2 and Python 3) or as Python 2 only requiring conversion via 2to3 for use on Python 3: https://github.com/peterjc/biopython/tree/tag2to3 Currently the tags are two special hash comment lines expected near the start of the file itself (rather than a list within the do2to3.py script). The actual text of the marker isn't critical - perhaps these need full stops? # This file targets both Python 2 and Python 3 at the same time # TODO - Targets Python 2 only (use 2to3 to run under Python 3) The first main issues thus far have been print statements, where we will either need to use the __future__ import or restrict ourselves to simple single argument calls - I have been using the later. This should not be a big problem on the main code, and we ought to update the print-and-compare unit tests anyway, The next common issue is import statements, for example StringIO (another bytes versus unicode issue). That can be handled via Bio._py3k in some cases. A third major class of issues in the unit tests so far is iterators versus lists, for example dictionary methods and the map function's return value. These can be tackled on a case by case basis I think - often by adding the occasional list(...) or sorted(x) instead of trying x.sorted() is enough. There are also quite a few instances of 'basestring' which might be handled via _py3k? As of right now, on this branch there are only 8 files under Tests which require conversion via 2to3 : Tests/common_BioSQL.py Tests/seq_tests_common.py Tests/test_NCBI_qblast.py Tests/test_SCOP_Cla.py Tests/test_seq.py Tests/test_SeqIO.py Tests/test_SeqIO_index.py Tests/test_Uniprot.py Having I hope demonstrated this will work, I'd like some feedback before applying this (or a modified version of it) to the master branch. Any thoughts? Thanks, Peter From p.j.a.cock at googlemail.com Sat Sep 7 07:30:50 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 7 Sep 2013 12:30:50 +0100 Subject: [Biopython-dev] Python 2 and 3 migration thoughts In-Reply-To: References: Message-ID: On Fri, Sep 6, 2013 at 4:44 PM, Peter Cock wrote: > On Thu, May 30, 2013 at 2:33 PM, Peter Cock wrote: >> >> Short term we will continue with developing using Python 2 >> syntax and running 2to3 for Python 3. As far as I know, >> the reverse process with 3to2 is not well established. If >> anyone wants to investigate that would be useful as >> another option. However, dropping Python 2.5 support >> makes things more flexible... >> >> Medium term I believe it would be possible to have a single >> code base which is both valid Python 2 and 3 at the same >> time. This may require us to target 2.7 and 3.3+ only - we'll >> have to try it and see if Python 2.6 will hold us back. >> >> I've actually done this with lzma.backports, a small but >> non-trivial module with Python and C code: >> >> https://pypi.python.org/pypi/backports.lzma/ >> https://github.com/peterjc/backports.lzma >> >> Python 3.3 reintroduces some features designed to make >> this more straightforward, like unicode literals (missing in >> the early versions of Python 3). This is why I'd like to drop >> Python 3.2 as soon as possible. >> >> What I was thinking is we can start migrating modules on a >> case by case basis from "Python 2 syntax" to "Dual syntax" >> one by one, with a white-list in the do2to3.py script. That >> way over time less and less modules need to be converted >> via 2to3, and "python3 setup.py install" will get faster, >> until eventually we can stop using 2to3 at all. >> >> This conversion could consider the code and doctests >> separately. However, using using print(example) we can >> hopefully get most of the doctests and Tutorial examples >> to work under both Python 2 and 3 at the same time. >> >> That's my current thinking anyway - and I think the fact >> that it would be a gradual migration from writing Python 2 >> specific code to writing dual 2/3 code makes it low risk >> (as long as we're continuing to run regular testing). >> >> Regards, >> >> Peter > > This branch is trying out marking individual Python files > as dual coding (Python 2 and Python 3) or as Python 2 > only requiring conversion via 2to3 for use on Python 3: > > https://github.com/peterjc/biopython/tree/tag2to3 > > Currently the tags are two special hash comment lines > expected near the start of the file itself (rather than a > list within the do2to3.py script). The actual text of the > marker isn't critical - perhaps these need full stops? > > # This file targets both Python 2 and Python 3 at the same time > # TODO - Targets Python 2 only (use 2to3 to run under Python 3) > > The first main issues thus far have been print statements, > where we will either need to use the __future__ import or > restrict ourselves to simple single argument calls - I have > been using the later. This should not be a big problem on the > main code, and we ought to update the print-and-compare > unit tests anyway, e.g. https://github.com/biopython/biopython/commit/6fa766e2348eae4e083503885f4ea5b66f531d7a > The next common issue is import statements, for > example StringIO (another bytes versus unicode issue). > That can be handled via Bio._py3k in some cases. For StringIO, https://github.com/biopython/biopython/commit/b09ebbf6f8c4032f874d89a91d199d8697c2d381 For commands.getoutput used in many tests, https://github.com/biopython/biopython/commit/11a1eca60e7a1491dbe54204ad3103e013bfebc5 > A third major class of issues in the unit tests so > far is iterators versus lists, for example dictionary > methods and the map function's return value. These > can be tackled on a case by case basis I think - often > by adding the occasional list(...) or sorted(x) instead > of trying x.sorted() is enough. e.g. for sorting dictionary keys, https://github.com/biopython/biopython/commit/b27f30012af6e66f6f143ecde719bf72609af8f2 e.g. for avoiding iterators from map function, https://github.com/biopython/biopython/commit/730850e3f4e88a70860e56abafbb579b25414f06 > There are also quite a few instances of 'basestring' > which might be handled via _py3k? > > As of right now, on this branch there are only 8 files under > Tests which require conversion via 2to3 : Down to six files under Tests now if I rebase the branch to include the recent fixes on the master. > Having I hope demonstrated this will work, I'd like some > feedback before applying this (or a modified version of > it) to the master branch. I've started applying individual code fixes to the master to improve Python 2 and 3 compatibility already. I'm specifically looking for thoughts on how to handle the transition period when some of our code will still need 2to3, while other code will not. Does the special comment line seem like a good solution? On the plus side, it tracks any changes with the file being updated (which wouldn't happen with a list in the do2to3.py file). Peter From p.j.a.cock at googlemail.com Sat Sep 7 09:44:55 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 7 Sep 2013 14:44:55 +0100 Subject: [Biopython-dev] SearchIO wiki page & documentation Message-ID: Hi Bow, You've done a great job with the wiki page for SearchIO, http://biopython.org/wiki/SearchIO - thank you! One thing I wondered about on reading this is if for the BLAST XML output the optional indent and increment arguments could be combined into one - an indent string defaulting to two spaces? Also for frames, is there an existing Biopython precedent for this (-3 to 3)? Regards, Peter From bow at bow.web.id Sat Sep 7 10:04:12 2013 From: bow at bow.web.id (Wibowo Arindrarto) Date: Sat, 7 Sep 2013 16:04:12 +0200 Subject: [Biopython-dev] SearchIO wiki page & documentation In-Reply-To: References: Message-ID: Hi Peter, Thanks. Comments are always welcomed :). For the indent and increment argument, I actually prefer to keep them separate. The reason is that having them in separate variables makes it easier for the writer to navigate into or out of the levels. The writer keeps track of which XML child element it is writing; and it either increases or decreases the level (so it can print the proper indentation). This is required since BLAST's XML tree does not really map with the object model we are using. It is similar, but not the same (e.g. the statistics tags are all children of a single element that is not the query element, while in the object they are all flat attributes of the query object). When it increases the element level, I can understand that having indent and increment as one argument makes it simpler. However, when the writer wants to go up a level (go back to the parent level), it gets difficult with a combined indent & increment variable, since Python strings do not work with the minus operator (though it does work with the plus operator). As for the frames, I tried to make it consistent with the way SeqFeature stores it strands (-3 to 3, and None). Best, Bow On Sat, Sep 7, 2013 at 3:44 PM, Peter Cock wrote: > Hi Bow, > > You've done a great job with the wiki page for SearchIO, > http://biopython.org/wiki/SearchIO - thank you! > > One thing I wondered about on reading this is if for the > BLAST XML output the optional indent and increment > arguments could be combined into one - an indent > string defaulting to two spaces? > > Also for frames, is there an existing Biopython precedent > for this (-3 to 3)? > > Regards, > > Peter From jurajbergman at hotmail.com Sat Sep 7 10:14:59 2013 From: jurajbergman at hotmail.com (Juraj Bergman) Date: Sat, 7 Sep 2013 16:14:59 +0200 Subject: [Biopython-dev] Python_MKT In-Reply-To: References: , , Message-ID: Hi, I've made some improvements in my MKT module - mainly using Kruskal's algorithm to rewrite the multi_short_path() function (thanks for the suggestion Zheng!) and I added some new functions as well (pathway_a(), pathways_n()). links:https://www.dropbox.com/s/zgnz8xwlcsispzf/Python_MKT.pdfhttps://www.dropbox.com/s/1z3opj4rbb0ms14/Python_MKT.py Regards, Juraj Date: Fri, 6 Sep 2013 00:00:06 -0400 Subject: Fwd: [Biopython-dev] Python_MKT From: zruan1991 at gmail.com To: biopython-dev at biopython.org; jurajbergman at hotmail.com Hi Juraj, I am also planing to implement MK test into my GSoC framework. I just went through you code and it is really independent. Will you be also to modify it to utilize the MultipleSeqAlignment, Alphabet and CodonTable module of Biopython so that it is more extendable? As to the multi_short_path() function, you really confused me. Is your implementation guaranteed to find the shortest path? This problem can be abstracted as finding the minimum spanning tree in graph theory and a good algorithm is known (Prim algorithm or Kruskal algorithm). My idea of linking multiple codons is first generate a codon by codon matrix representing the synonymous and nonsynonymous substitutions each codon needs to change to the other in advance. Then finding the minimum spanning tree that connect all the node in the matrix with minimum length (least synonymous substitutions). I plan to implement this and you may have more insight about my suggestions. Thanks! Best,Zheng Ruan On Thu, Sep 5, 2013 at 10:33 AM, Juraj Bergman wrote: Dear all, I'm resending my implementation of the McDonald-Kreitman test. Link to the description of the module:https://www.dropbox.com/s/zgnz8xwlcsispzf/Python_MKT.pdf Link to the code:https://www.dropbox.com/s/1z3opj4rbb0ms14/Python_MKT.py I apologise for the initial mistake of sending attachments instead of links. Kind regards, Juraj Bergman P.S. Regarding the multi_short_path() function - I realize that it is very, very repetitive butI have not (yet) managed to find a suitable loop construction that would replace the current code. The multi_short_path() function is by far the most complex function of the modulebecause its purpose is to find the codon network with the least amount of overall nucleotide substitutions and the least amount of non-synonymous nucleotide substitutions (given any combination of codons). Each codon is being represented as multiple lists of two integers (depending on the overall amount of codons being processed). The first integer specifies the amount of synonymous and the second specifies the amount of non-synonymous substitutions.For example, if 10 codons are being fitted in a network, then there are 10x10 = 100 combinations of codon-codon pathways, each represented with a two-integer list, and out of these 100 lists, the 'best' 10 have to be chosen to get the most optimal codon networ! k (and the repetitiveness of thefunction mainly arises because of this process). This is, in short, a description of the function and I would appreciate any pointers that would help to make the code more succinct :) _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Sat Sep 7 14:12:37 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 7 Sep 2013 19:12:37 +0100 Subject: [Biopython-dev] Print statements vs functions (Python 2 vs 3) In-Reply-To: References: Message-ID: On Sat, Sep 7, 2013 at 2:52 PM, Peter Cock wrote: > On Sat, Sep 7, 2013 at 12:41 PM, Peter Cock wrote: >> Dear Biopythoneers, >> >> As you will be aware, with our recent release of Biopython 1.62 >> we now officially support Python 3 for the first time (specifically >> Python 3.3 - we don't recommend 3.0, 3.1 or 3.2 here), while >> continuing to support Python 2 as well. >> >> Currently all our documentation is written assuming Python 2, >> but with some small changes most things can be written to >> work under both variants. The most visible change is how to >> print things, and that happens a lot in our examples. >> >> I would like us to switch to using the Python 3 style print >> function in our documentation (including the Tutorial and >> the docstrings embedded in the code as help text). >> >> ... >> >> Would anyone object to us using the print function style >> in the Biopython documentation? >> >> I'm particularly keen to hear from beginners - as this >> is potentially confusing. >> >> Thanks, >> >> Peter. > > I tweeted this email, > > Biopython Project (@Biopython): Would anyone object to us using > #Python3 print function style in the #Biopython documentation? > http://lists.open-bio.org/pipermail/biopython/2013-September/008751.html > https://twitter.com/Biopython/status/376309705972654080 > > Two replies already: > > Raphael Mattos (@rsmattos): @Biopython I think it's time.... > https://twitter.com/rsmattos/status/376321218456338432 > > Alec Munro (@alecmunro): @Biopython do it! > https://twitter.com/alecmunro/status/376341224544038912 > > Peter On Sat, Sep 7, 2013 at 5:25 PM, Peter Cock wrote: > On Sat, Sep 7, 2013 at 2:50 PM, Dan Tomso wrote: >> Hi, Peter. >> >> This sounds OK to me. > > Thanks Dan. And another voice of approval on Twitter: > > Karin Lagesen (@karinlag): @Biopython @pjacock Go for it! > https://twitter.com/karinlag/status/376356704080105472 And another positive voice: Dave Lunt (@davelunt): @Biopython the docs change sounds good, that very clear explanation you link to should also be somewhere obvious https://twitter.com/davelunt/status/376405338511384576 Since there has only been positive reaction, I've made a start at converting the examples in the Tutorial to use the Python 3 style print function (maintaining full Python 2 compatibility under Python 2.6 and 2.7 via the future import): https://github.com/biopython/biopython/commit/34d155a02cbcf7c953fb8238a5412f8c7c0e1cc5 https://github.com/biopython/biopython/commit/74a8b8349b58ae9aa7a727d6e1ab774a4c9008a3 For those curious to see how it looks (but not already familiar with LaTeX, pdflatex and hevea), you can see a sneak preview here: http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf (Hopefully those links will once again auto-update every night, something that was working nicely prior to the server move) If you spot any typos, please let us know. Thanks! Peter From eric.talevich at gmail.com Sat Sep 7 15:17:08 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 7 Sep 2013 12:17:08 -0700 Subject: [Biopython-dev] Python 2 and 3 migration thoughts In-Reply-To: References: Message-ID: On Sat, Sep 7, 2013 at 4:30 AM, Peter Cock wrote: > On Fri, Sep 6, 2013 at 4:44 PM, Peter Cock > wrote: > > > > This branch is trying out marking individual Python files > > as dual coding (Python 2 and Python 3) or as Python 2 > > only requiring conversion via 2to3 for use on Python 3: > > > > https://github.com/peterjc/biopython/tree/tag2to3 > > > > Currently the tags are two special hash comment lines > > expected near the start of the file itself (rather than a > > list within the do2to3.py script). The actual text of the > > marker isn't critical - perhaps these need full stops? > > > > # This file targets both Python 2 and Python 3 at the same time > > # TODO - Targets Python 2 only (use 2to3 to run under Python 3) > > > [...] > > As of right now, on this branch there are only 8 files under > > Tests which require conversion via 2to3 : > > Down to six files under Tests now if I rebase the branch > to include the recent fixes on the master. > > > Having I hope demonstrated this will work, I'd like some > > feedback before applying this (or a modified version of > > it) to the master branch. > > I've started applying individual code fixes to the master > to improve Python 2 and 3 compatibility already. > > I'm specifically looking for thoughts on how to handle > the transition period when some of our code will still > need 2to3, while other code will not. > > Does the special comment line seem like a good solution? > On the plus side, it tracks any changes with the file being > updated (which wouldn't happen with a list in the do2to3.py > file). > > Peter > > Hi Peter, This looks like a good way to move forward overall. Regarding the special comment lines -- since these are only used in do2to3.py, would it be cleaner to keep a hard-coded list of filenames in do2to3.py and leave the modules and scripts alone? Are there any characteristics that would make it difficult to determine whether a given module or script is Py3-compliant? -Eric From p.j.a.cock at googlemail.com Sun Sep 8 16:52:40 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 8 Sep 2013 21:52:40 +0100 Subject: [Biopython-dev] Python 2 and 3 migration thoughts In-Reply-To: References: Message-ID: On Sat, Sep 7, 2013 at 8:17 PM, Eric Talevich wrote: >> > >> > # This file targets both Python 2 and Python 3 at the same time >> > # TODO - Targets Python 2 only (use 2to3 to run under Python 3) >> > >> >> Does the special comment line seem like a good solution? >> On the plus side, it tracks any changes with the file being >> updated (which wouldn't happen with a list in the do2to3.py >> file). > > Hi Peter, > > This looks like a good way to move forward overall. Regarding the special > comment lines -- since these are only used in do2to3.py, would it be > cleaner to keep a hard-coded list of filenames in do2to3.py and leave the > modules and scripts alone? Are there any characteristics that would make it > difficult to determine whether a given module or script is Py3-compliant? Hi Eric, There are import time problems which are easy to spot - in particular SyntaxError is a good clue. However, many of the issues are only really found at run time (e.g. different method names). This means that the tests (which I started with) are actually the easiest to check. Right now I don't have a feel for what fraction of the main Bio/* and BioSQL/* files can be made dual-coding, and that would have an influence on how best to tag things needing 2to3 or not. I'm happy to continue this on branches for a while longer and find out. I do like the idea of a special #TODO comment line where 2to3 is still needed - it is symbolic of where I want the code base to go ;) Regards, Peter From nigel.delaney at outlook.com Mon Sep 9 12:03:49 2013 From: nigel.delaney at outlook.com (Nigel Delaney) Date: Mon, 9 Sep 2013 12:03:49 -0400 Subject: [Biopython-dev] VCF Parsers Message-ID: Hi Biopython, Just wanted to ask quickly if anyone on the biopython team has implemented or is implementing vcf parsers. I have seen a few python written ones but they seem to be quite slow, and so am curious if anyone has wrapped a C library of some sort. Thanks for any help, Nigel From arklenna at gmail.com Mon Sep 9 12:18:17 2013 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 9 Sep 2013 17:18:17 +0100 Subject: [Biopython-dev] VCF Parsers In-Reply-To: References: Message-ID: I believe PyVCF [1] has a Cython implementation. Cheers, Lenna 1: https://github.com/jamescasbon/PyVCF On Mon, Sep 9, 2013 at 5:03 PM, Nigel Delaney wrote: > Hi Biopython, > > > > Just wanted to ask quickly if anyone on the biopython team has implemented > or is implementing vcf parsers. I have seen a few python written ones but > they seem to be quite slow, and so am curious if anyone has wrapped a C > library of some sort. > > > > Thanks for any help, > > Nigel > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From chapmanb at 50mail.com Mon Sep 9 12:56:08 2013 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 09 Sep 2013 12:56:08 -0400 Subject: [Biopython-dev] VCF Parsers In-Reply-To: References: Message-ID: <86sixe7x3r.fsf@fastmail.fm> Nigel and Lenna; There is also a fully Cython implementation called cyvcf and has the same interface as PyVCF with some additional speed improvements: https://github.com/arq5x/cyvcf https://pypi.python.org/pypi/cyvcf Brad > I believe PyVCF [1] has a Cython implementation. > > Cheers, > > Lenna > > 1: https://github.com/jamescasbon/PyVCF > > > On Mon, Sep 9, 2013 at 5:03 PM, Nigel Delaney wrote: > >> Hi Biopython, >> >> >> >> Just wanted to ask quickly if anyone on the biopython team has implemented >> or is implementing vcf parsers. I have seen a few python written ones but >> they seem to be quite slow, and so am curious if anyone has wrapped a C >> library of some sort. >> >> >> >> Thanks for any help, >> >> Nigel >> >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Mon Sep 9 16:29:35 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 9 Sep 2013 21:29:35 +0100 Subject: [Biopython-dev] Print statements vs functions (Python 2 vs 3) In-Reply-To: References: Message-ID: On Sat, Sep 7, 2013 at 7:12 PM, Peter Cock wrote: > On Sat, Sep 7, 2013 at 2:52 PM, Peter Cock wrote: >> On Sat, Sep 7, 2013 at 12:41 PM, Peter Cock wrote: >>> ... >>> >>> Would anyone object to us using the print function style >>> in the Biopython documentation? > > ... > > Since there has only been positive reaction, I've made a > start at converting the examples in the Tutorial to use the > Python 3 style print function (maintaining full Python 2 > compatibility under Python 2.6 and 2.7 via the future > import): > > https://github.com/biopython/biopython/commit/34d155a02cbcf7c953fb8238a5412f8c7c0e1cc5 > https://github.com/biopython/biopython/commit/74a8b8349b58ae9aa7a727d6e1ab774a4c9008a3 > It turned out to be slightly more than a weekend project, but I've now done this for the main code including the doctests :) All new code changes should be written using the print function style and will then work on both Python 2 and 3 without change, e.g. print(variable) Any accidental usage of an old-style print statement will be caught in two ways, under Python 2 via the future import (if it is in the file you are editing): https://github.com/biopython/biopython/commit/de12c5e08fc44d9c158954bb4b1d5f98cfb84c69 And I have also disabling the print fixer during 2to3 which would result in old-style print statements causing an error when testing under Python 3: https://github.com/biopython/biopython/commit/00ab061dba42082ff0e20383847ebffaf6dd8eef If you are using a print function, and the file doesn't have it already, please add the future import: from __future__ import print_function If there any any stray print statements still there (e.g. hiding in examples scripts I missed), please fix them or report them. Regards, Peter From zruan1991 at gmail.com Mon Sep 9 22:36:21 2013 From: zruan1991 at gmail.com (Zheng Ruan) Date: Mon, 9 Sep 2013 22:36:21 -0400 Subject: [Biopython-dev] Codon Alignment GSoC Weekly Update Message-ID: Hi all, The update for Codon Alignment GSoC project can be found at http://zruanweb.com/. Thanks for your comments and suggestions. Best, Zheng Ruan From yeyanbo289 at gmail.com Tue Sep 10 03:29:06 2013 From: yeyanbo289 at gmail.com (Yanbo Ye) Date: Tue, 10 Sep 2013 15:29:06 +0800 Subject: [Biopython-dev] GSOC weekly update 13 Message-ID: Hi all, I posted the update of Biopython.Phylo project here: http://blog.yeyanbo.com/posts/google-summer-of-code-13.html Thanks, Yanbo -- *Yanbo Ye* *Guangzhou Institutes of Biomedicine and Health, * *Chinese Academy of Sciences* *190 Kaiyuan Avenue, Science Park, Guangzhou, China** * * * *Email: ye_yanbo at gibh.ac.cn* *Web: http://www.yeyanbo.com* *Phone: (86)-020-32093810* From p.j.a.cock at googlemail.com Fri Sep 13 04:54:14 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 13 Sep 2013 09:54:14 +0100 Subject: [Biopython-dev] Galaxy Tool Shed packages for Biopython Message-ID: Hi all, I've send this to both the Galaxy and Biopython developers lists, and hope this will make sense to both groups. If you've not heard of Galaxy, start here: http://galaxyproject.org - while the easy to guess Biopython website is at http://biopython.org Brad Chapman and I are both Biopython core developers, and are also both on the "IUC" Galaxy Tool Shed committee because we've been quite involved in wrapping and writing tools for use on Galaxy. Fellow committee member Bj?rn Gr?ning has done a lot of the hands on work defining package definitions for dependencies within the Galaxy Tool Shed ecosystem - including defining them for Biopython, NumPy, SciPy, MatPlotLib, etc. We're very grateful for his hard work - most of which is now available under the IUC group account: http://toolshed.g2.bx.psu.edu/view/iuc/ http://testtoolshed.g2.bx.psu.edu/view/iuc/ The Biopython packages, however, are under a dedicated "biopython" account on the Galaxy Tool Shed to which currently Bjoern, Brad and I have access to: http://toolshed.g2.bx.psu.edu/view/biopython/ http://testtoolshed.g2.bx.psu.edu/view/biopython/ This packaging work was initially tracked in Bjoern's own GitHub repository, https://github.com/bgruening/galaxytools/ We (me, Brad and Bjoern) agreed that a Biopython owned repository would be more sensible in the long term, so I have created this and ported Bjoern's commits to it: https://github.com/biopython/galaxy_packages Currently the "Galaxy packagers" team on GitHub which has read and write access to this new repository is just me, Brad and Bjoern. Regards, Peter From eric.talevich at gmail.com Fri Sep 13 16:08:12 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 13 Sep 2013 13:08:12 -0700 Subject: [Biopython-dev] Fwd: New Biopython (sub)module? In-Reply-To: References: <521354A9.6020701@brueffer.de> Message-ID: On Thu, Aug 22, 2013 at 6:01 AM, Peter Cock wrote: > On Wed, Aug 21, 2013 at 11:00 PM, Cyrus Maher > wrote: > > > > That said, I was also hoping to get your thoughts on whether this seemed > > like the type of project that would fit in with Biopython. Peter said > that > > Eric might have some good comments on this matter? > > Right - I was thinking Eric and this year's phylogenetic focused GSoC > students should have some good comments, e.g. about adding > something like pal2nal into Biopython. > > Peter > Hi Cyrus, MOSAIC looks cool, it's always good to see progress in ortholog detection. Since the core of the program is a single Python module, it shouldn't be too hard to plug this into Biopython. Keep in mind, though, that once MOSAIC is in the Biopython source tree it could become less convenient for you to make major updates and changes to the program, whereas if you control the packaging yourself you're free to change the API, add dependencies, etc. however you like. So, for the manuscript/publication at least, you might find it safer to only state that distributing MOSAIC with Biopython is planned, rather than committing to a release version number. Thoughts on the code: - Zheng Ruan has written a nice codon alignment module as part of his GSoC project. Once that's merged, you'll be able to drop the pal2nal dependency. - We haven't merged Chris's MSAprobs wrapper yet (to my knowledge), though at a glance it looks like it should be straightforward. For Bio.mosaic (I guess?), we would probably wait until the wrapper is merged and then remove the conditional in mosaic. - Does EMBOSS stretcher do anything that couldn't be done with Bio.pairwise2? If not, you could use pairwise2 instead and avoid another dependency. - The use of pandas looks fairly basic and therefore also avoidable. It looks like with a few more lines of code you could use Python's built-in csv module to parse a table and store it in a numpy matrix instead. - MOSAIC does some logging to the console, which is sensible for the program but isn't done as much in Biopython. Some of these print statements could be changed to warnings (see the warnings module). The progress indicators could maybe be toggled at the function level with a keyword argument, e.g. verbose=True/False. Cheers, Eric From eric.talevich at gmail.com Fri Sep 13 17:05:16 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 13 Sep 2013 14:05:16 -0700 Subject: [Biopython-dev] GSOC weekly update 13 In-Reply-To: References: Message-ID: On Tue, Sep 10, 2013 at 12:29 AM, Yanbo Ye wrote: > > Hi all, > > I posted the update of Biopython.Phylo project here: > http://blog.yeyanbo.com/posts/google-summer-of-code-13.html > > Thanks, > Yanbo > Hi Yanbo, Looks like you finished your project right on schedule. :) For the next week, how are you planning to document your new modules? It looks like you've put the essential information in the docstrings, which is good to see. If you write more detailed explanations or examples of how to use the new features on the Biopython wiki next week, I can help roll them into the main tutorial. Or you could make a patch to Tutorial.tex directly, if you'd like. The unit tests look pretty good already. Thanks for all your hard work! Cheers, Eric From eric.talevich at gmail.com Fri Sep 13 17:56:52 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 13 Sep 2013 14:56:52 -0700 Subject: [Biopython-dev] Codon Alignment GSoC Weekly Update In-Reply-To: References: Message-ID: Hi Zheng, I just went through your code and left some comments. Impressive work! So, next week is the "soft pencils-down" deadline, and this would be a good time to put together the canonical documentation for your project. One way to go about this would be to copy the relevant text, code examples and figures from your blog and either put them on a CodonAlignment page on the Biopython wiki, or consolidate them into a new chapter in Tutorial.tex. Or did you have something else in mind? Cheers, Eric On Mon, Sep 9, 2013 at 7:36 PM, Zheng Ruan wrote: > Hi all, > > The update for Codon Alignment GSoC project can be found at > http://zruanweb.com/. Thanks for your comments and suggestions. > > Best, > Zheng Ruan > From zruan1991 at gmail.com Sat Sep 14 12:49:33 2013 From: zruan1991 at gmail.com (Zheng Ruan) Date: Sat, 14 Sep 2013 12:49:33 -0400 Subject: [Biopython-dev] Chi2 test in Bio.Phylo.PAML.chi2 Message-ID: Hi all, I am trying to use chi2 test within Biopython to reduce my dependency of scipy. However, the chi2 test is very slow in some case of stat value when degree of freedom is 1 (MK test has a df of 1). Here is a small example: >>> from Bio.Phylo.PAML import chi2 >>> chi2.cdf_chi2(1, 1) 0.3173105078923443 >>> chi2.cdf_chi2(1, 2) 0.1572992072733692 >>> chi2.cdf_chi2(1, 3) 0.08326451704454607 >>> chi2.cdf_chi2(1, 4) 0.04550026405390195 >>> chi2.cdf_chi2(1, 5) ^CTraceback (most recent call last): File "", line 1, in File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line 20, in cdf_chi2 prob = 1 - _incomplete_gamma(x, alpha) File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line 116, in _incomplete_gamma pn[i] /= overflow KeyboardInterrupt >>> chi2.cdf_chi2(1, 6) 0.014305878510978087 >>> chi2.cdf_chi2(1, 7) 0.00815097160412992 >>> chi2.cdf_chi2(1, 8) 0.004677734999637195 >>> chi2.cdf_chi2(1, 9) ^CTraceback (most recent call last): File "", line 1, in File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line 20, in cdf_chi2 prob = 1 - _incomplete_gamma(x, alpha) File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line 112, in _incomplete_gamma pn[i] = pn[i+2] KeyboardInterrupt The behavior of chi2.cdf_chi2 is quite wiered. Could someone clarify this? Thanks! Best, Zheng Ruan From eric.talevich at gmail.com Sat Sep 14 13:24:17 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 14 Sep 2013 10:24:17 -0700 Subject: [Biopython-dev] Chi2 test in Bio.Phylo.PAML.chi2 In-Reply-To: References: Message-ID: On Sat, Sep 14, 2013 at 9:49 AM, Zheng Ruan wrote: > Hi all, > > I am trying to use chi2 test within Biopython to reduce my dependency of > scipy. However, the chi2 test is very slow in some case of stat value when > degree of freedom is 1 (MK test has a df of 1). Here is a small example: > > >>> from Bio.Phylo.PAML import chi2 > >>> chi2.cdf_chi2(1, 1) > 0.3173105078923443 > >>> chi2.cdf_chi2(1, 2) > 0.1572992072733692 > >>> chi2.cdf_chi2(1, 3) > 0.08326451704454607 > >>> chi2.cdf_chi2(1, 4) > 0.04550026405390195 > >>> chi2.cdf_chi2(1, 5) > ^CTraceback (most recent call last): > File "", line 1, in > File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line > 20, in cdf_chi2 > prob = 1 - _incomplete_gamma(x, alpha) > File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line > 116, in _incomplete_gamma > pn[i] /= overflow > KeyboardInterrupt > >>> chi2.cdf_chi2(1, 6) > 0.014305878510978087 > >>> chi2.cdf_chi2(1, 7) > 0.00815097160412992 > >>> chi2.cdf_chi2(1, 8) > 0.004677734999637195 > >>> chi2.cdf_chi2(1, 9) > ^CTraceback (most recent call last): > File "", line 1, in > File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line > 20, in cdf_chi2 > prob = 1 - _incomplete_gamma(x, alpha) > File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line > 112, in _incomplete_gamma > pn[i] = pn[i+2] > KeyboardInterrupt > > > The behavior of chi2.cdf_chi2 is quite wiered. Could someone clarify this? > Thanks! > > Best, > Zheng Ruan > It looks like that implementation of chi2 (based on PAML's C implementation) has trouble with convergence at df=1. I wrote another Python implementation of chi2 based on the SciPy source code (to avoid a hard SciPy dependency in CladeCompare, which also uses a G-test), which you can use if you find it works better: https://github.com/etal/biofrills/blob/master/biofrills/stats/chisq.py It imports the original scipy version at the end in case the user does have scipy installed, since that compiled version will be much faster. This hasn't been tested as much as Bio.Phylo.PAML.chi2, though, and I haven't benchmarked the two Python implementations against each other. Also note that it uses math.lgamma, which was only added in Python 2.7, so for 2.6 compatibility you'll need to copy in the pure-Python log-gamma implementation from Bio.Phylo.PAML.chi2. (We could add this conditional import of math.lgamma to Bio.Phylo.PAML.chi2, too.) Or, you could try increasing the tolerance used for testing convergence in Bio.Phylo.PAML.chi2. Best, Eric From yeyanbo289 at gmail.com Sun Sep 15 23:42:12 2013 From: yeyanbo289 at gmail.com (Yanbo Ye) Date: Mon, 16 Sep 2013 11:42:12 +0800 Subject: [Biopython-dev] GSOC weekly update 13 In-Reply-To: References: Message-ID: Hi Eric, I noticed there are some relevant TODOs on the phylo cookbook page, so I'd like to edit them add some examples onto the Biopython wiki this week. Cheers, Yanbo On Sat, Sep 14, 2013 at 5:05 AM, Eric Talevich wrote: > On Tue, Sep 10, 2013 at 12:29 AM, Yanbo Ye wrote: > >> >> Hi all, >> >> I posted the update of Biopython.Phylo project here: >> http://blog.yeyanbo.com/posts/google-summer-of-code-13.html >> >> Thanks, >> Yanbo >> > > Hi Yanbo, > > Looks like you finished your project right on schedule. :) > > For the next week, how are you planning to document your new modules? It > looks like you've put the essential information in the docstrings, which is > good to see. If you write more detailed explanations or examples of how to > use the new features on the Biopython wiki next week, I can help roll them > into the main tutorial. Or you could make a patch to Tutorial.tex directly, > if you'd like. > > The unit tests look pretty good already. Thanks for all your hard work! > > Cheers, > Eric > -- *Yanbo Ye* *Guangzhou Institutes of Biomedicine and Health, * *Chinese Academy of Sciences* *190 Kaiyuan Avenue, Science Park, Guangzhou, China** * * * *Email: ye_yanbo at gibh.ac.cn* *Web: http://www.yeyanbo.com* *Phone: (86)-020-32093810* From yeyanbo289 at gmail.com Mon Sep 16 00:33:06 2013 From: yeyanbo289 at gmail.com (Yanbo Ye) Date: Mon, 16 Sep 2013 12:33:06 +0800 Subject: [Biopython-dev] GSOC weekly update 14 Message-ID: Hi all, My update of Biopython.Phylo project is here. http://blog.yeyanbo.com/posts/google-summer-of-code-14.html This week I will add document and examples of new features to the cookbook on biopython wiki. Thanks, Yanbo -- *Yanbo Ye* *Guangzhou Institutes of Biomedicine and Health, * *Chinese Academy of Sciences* *190 Kaiyuan Avenue, Science Park, Guangzhou, China** * * * *Email: ye_yanbo at gibh.ac.cn* *Web: http://www.yeyanbo.com* *Phone: (86)-020-32093810* From zruan1991 at gmail.com Mon Sep 16 23:35:10 2013 From: zruan1991 at gmail.com (Zheng Ruan) Date: Mon, 16 Sep 2013 23:35:10 -0400 Subject: [Biopython-dev] Codon Alignment GSoC Project Last Update Message-ID: Hi all, The last code update for Codon Alignment GSoC project can be found at http://zruanweb.com/. This week I'll be converting my blog examples into an independent chapter of biopython tutorial. Some tests for the CodonAlign module will also be added shortly. Thanks! Best, Ruan From michael.maher at ucsf.edu Tue Sep 17 15:20:46 2013 From: michael.maher at ucsf.edu (Cyrus Maher) Date: Tue, 17 Sep 2013 12:20:46 -0700 Subject: [Biopython-dev] Fwd: New Biopython (sub)module? In-Reply-To: References: <521354A9.6020701@brueffer.de> Message-ID: Hi Eric, We're glad you like MOSAIC! It's exciting to start getting it out there. Just as a quick update, the latest version of the paper is available on arxiv . In addition, updated documentation, relevant files, etc. can be found here . The module has also been uploaded to PyPI, so it can now be installed with easy_install bio-mosaic. Given the importance of ortholog detection to a broad range of computational biology tasks, we definitely think it's worth putting in a little extra work and making a few sacrifices to make this tool more broadly and conveniently available to the community. So if you're game, we would love to start thinking about timelines for making any necessary changes. We really appreciate your comments so far. Below are some initial thoughts/replies: ============ *- Zheng Ruan has written a nice codon alignment module as part of his GSoC project. Once that's merged, you'll be able to drop the pal2nal dependency. * * * This is a great idea and we'd be happy to incorporate it. *- We haven't merged Chris's MSAprobs wrapper yet (to my knowledge), though at a glance it looks like it should be straightforward. For Bio.mosaic (I guess?), we would probably wait until the wrapper is merged and then remove the conditional in mosaic. * * * Sounds good! * * *- Does EMBOSS stretcher do anything that couldn't be done with Bio.pairwise2? If not, you could use pairwise2 instead and avoid another dependency. * Pairwise alignment constitutes a significant portion of MOSAIC's run time. stretcher was chosen because of its speed. How about this: we could test if stretcher is installed, and if it's not, we can 1.) fall back to Bio.pairwise2 and 2.) provide a helpful warning about slowdown with a direct link to the latest EMBOSS toolkit. What do you think? * * * - The use of pandas looks fairly basic and therefore also avoidable. It looks like with a few more lines of code you could use Python's built-in csv module to parse a table and store it in a numpy matrix instead. * You're totally right. We can do that. * * *- MOSAIC does some logging to the console, which is sensible for the program but isn't done as much in Biopython. Some of these print statements could be changed to warnings (see the warnings module). The progress indicators could maybe be toggled at the function level with a keyword argument, e.g. verbose=True/False.* Consider it done! ============ Thanks again for your feedback! Looking forward to hearing further comments/next steps, etc... Cheers, -Cyrus On Fri, Sep 13, 2013 at 1:08 PM, Eric Talevich wrote: > On Thu, Aug 22, 2013 at 6:01 AM, Peter Cock wrote: > >> On Wed, Aug 21, 2013 at 11:00 PM, Cyrus Maher >> wrote: >> > >> > That said, I was also hoping to get your thoughts on whether this seemed >> > like the type of project that would fit in with Biopython. Peter said >> that >> > Eric might have some good comments on this matter? >> >> Right - I was thinking Eric and this year's phylogenetic focused GSoC >> students should have some good comments, e.g. about adding >> something like pal2nal into Biopython. >> >> Peter >> > > Hi Cyrus, > > MOSAIC looks cool, it's always good to see progress in ortholog detection. > Since the core of the program is a single Python module, it shouldn't be > too hard to plug this into Biopython. Keep in mind, though, that once > MOSAIC is in the Biopython source tree it could become less convenient for > you to make major updates and changes to the program, whereas if you > control the packaging yourself you're free to change the API, add > dependencies, etc. however you like. So, for the manuscript/publication at > least, you might find it safer to only state that distributing MOSAIC with > Biopython is planned, rather than committing to a release version number. > > Thoughts on the code: > > - Zheng Ruan has written a nice codon alignment module as part of his GSoC > project. Once that's merged, you'll be able to drop the pal2nal dependency. > > - We haven't merged Chris's MSAprobs wrapper yet (to my knowledge), though > at a glance it looks like it should be straightforward. For Bio.mosaic (I > guess?), we would probably wait until the wrapper is merged and then remove > the conditional in mosaic. > > - Does EMBOSS stretcher do anything that couldn't be done with > Bio.pairwise2? If not, you could use pairwise2 instead and avoid another > dependency. > > - The use of pandas looks fairly basic and therefore also avoidable. It > looks like with a few more lines of code you could use Python's built-in > csv module to parse a table and store it in a numpy matrix instead. > > - MOSAIC does some logging to the console, which is sensible for the > program but isn't done as much in Biopython. Some of these print statements > could be changed to warnings (see the warnings module). The progress > indicators could maybe be toggled at the function level with a keyword > argument, e.g. verbose=True/False. > > > Cheers, > Eric > From zruan1991 at gmail.com Sat Sep 21 19:29:17 2013 From: zruan1991 at gmail.com (Zheng Ruan) Date: Sat, 21 Sep 2013 19:29:17 -0400 Subject: [Biopython-dev] Codon Alignment GSoC Documentation Update Message-ID: Hi all, The documentation for Codon Alignment GSoC project is now available in reStructuredText, LaTeX (pdf) and HTML format at http://zruanweb.com/. To this point, I finished all the tasks of my project. I really enjoy the coding experience in the past two months. Thanks for all your help and valuable feedback! Best, Zheng Ruan From yeyanbo289 at gmail.com Sun Sep 22 01:22:46 2013 From: yeyanbo289 at gmail.com (Yanbo Ye) Date: Sun, 22 Sep 2013 13:22:46 +0800 Subject: [Biopython-dev] GSOC weekly update 14 Message-ID: Hi all, My tutorial about the new features of Bio.Phylo that completed during this GSoC is in this file: https://github.com/lijax/gsoc/blob/master/phylo_wiki.md . I also updated the phylo page on the biopython wiki. http://biopython.org/wiki/Phylo Thanks for all your help and suggestions during last three months. I' really appreciate this coding experience and would like to continue contributing to the Biopython community. Any comments and suggests about the code or documentation would be welcome. cheers, Yanbo -- *Yanbo Ye* *Guangzhou Institutes of Biomedicine and Health, * *Chinese Academy of Sciences* *190 Kaiyuan Avenue, Science Park, Guangzhou, China** * * * *Email: ye_yanbo at gibh.ac.cn* *Web: http://www.yeyanbo.com* *Phone: (86)-020-32093810* From p.j.a.cock at googlemail.com Mon Sep 23 16:58:25 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 23 Sep 2013 21:58:25 +0100 Subject: [Biopython-dev] NumPy 1.7 and NPY_NO_DEPRECATED_API warnings Message-ID: Hi all, I'm seeing the following warning from NumPy 1.7 with Python 3.3 on Mac OS X, and on Linux too. I believe the NumPy version is the critical factor: building 'Bio.Cluster.cluster' extension building 'Bio.KDTree._CKDTree' extension building 'Bio.Motif._pwm' extension building 'Bio.motifs._pwm' extension all give: /Users/peterjc/lib/python3.3/site-packages/numpy/core/include/numpy/npy_deprecated_api.h:11:2: warning: "Using deprecated NumPy API, disable it by #defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings] According to this page, http://docs.scipy.org/doc/numpy-dev/reference/c-api.deprecations.html If we add this line it should confirm our code is clean for NumPy 1.7 (and implies to side effects on older NumPy): #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION Unfortunately that seems all four modules have problems doing that, presumably planned NumPy C API changes we need to handle via a version conditional #ifdef? Peter From anaryin at gmail.com Tue Sep 24 02:50:28 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 24 Sep 2013 08:50:28 +0200 Subject: [Biopython-dev] Building Biopython with ICC instead of GCC (intel compiler) Message-ID: Hi all, This is more of a curiosity rather than a necessity. I'm setting up a new cluster and we are preferring ICC (intel compiler) over the usual GCC. When I run "python setup.py build" the output shows ICC being used quite a lot but some lines still use GCC. Example: *gcc -pthread -shared build/temp.linux-x86_64-2.6/Bio/KDTree/KDTree.o build/temp.linux-x86_64-2.6/Bio/KDTree/KDTreemodule.o -L/usr/lib64 -lpython2.6 -o build/lib.linux-x86_64-2.6/Bio/KDTree/_CKDTree.so* building 'Bio.Motif._pwm' extension creating build/temp.linux-x86_64-2.6/Bio/Motif icc -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/home/software/python-libs/lib64/python2.6/site-packages/numpy/core/include -I/usr/include/python2.6 -c Bio/Motif/_pwm.c -o build/temp.linux-x86_64-2.6/Bio/Motif/_pwm.o icc: command line warning #10006: ignoring unknown option '-fwrapv' icc: command line warning #10006: ignoring unknown option '-fwrapv' /home/software/python-libs/lib64/python2.6/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h(15): warning #1224: #warning directive: "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" #warning "Using deprecated NumPy API, disable it by " \ ^ I guess this has to do with distutils? Any idea on how to force it to use only the intel compilers? Cheers, Jo?o From mjldehoon at yahoo.com Tue Sep 24 20:55:25 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 24 Sep 2013 17:55:25 -0700 (PDT) Subject: [Biopython-dev] Building Biopython with ICC instead of GCC (intel compiler) In-Reply-To: References: Message-ID: <1380070525.80725.YahooMailNeo@web164006.mail.gq1.yahoo.com> How was Python itself compiled? I believe distutils is supposed to select the same compiler as was used for Python itself. Best, -Michiel. ________________________________ I guess this has to do with distutils? Any idea on how to force it to useonly the intel compilers? _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Fri Sep 27 11:47:11 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 27 Sep 2013 16:47:11 +0100 Subject: [Biopython-dev] Problem with SeqIO uniprot-xml on older XML files? Message-ID: Hi all, There seems to be a problem parsing older UniProt XML files, see http://seqanswers.com/forums/showthread.php?t=33921 Could anyone have a look at this? Somehow the start/end of each record does not seem to be recognised here, >>> from Bio import SeqIO >>> r = next(SeqIO.parse("uniref90.xml", "uniprot-xml")) (takes ages, presumably scanning whole file) Note the indexing code also breaks: >>> from Bio import SeqIO >>> d = SeqIO.index("uniref90.xml", "uniprot-xml") Traceback (most recent call last): File "", line 1, in File "/home/pc40583/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 808, in index key_function, repr, "SeqRecord") File "/home/pc40583/lib/python2.7/site-packages/Bio/File.py", line 250, in __init__ for key, offset, length in offset_iter: File "/home/pc40583/lib/python2.7/site-packages/Bio/SeqIO/_index.py", line 401, in __iter__ % (start_offset, end_offset)) ValueError: Did not find line in bytes 283 to 38649 Thanks, Peter From p.j.a.cock at googlemail.com Sat Sep 28 06:14:30 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 28 Sep 2013 11:14:30 +0100 Subject: [Biopython-dev] Adjusting the xxMotif wrapper / Bio.Application plans In-Reply-To: References: <520374DF.9070301@brueffer.de> Message-ID: On Thu, Aug 8, 2013 at 12:00 PM, Peter Cock wrote: > On Thu, Aug 8, 2013 at 11:37 AM, Christian Brueffer > wrote: >>> >>> Was there a special reason for all these case variants >>> in the XXmotif options?? >> >> I basically followed the example set by >> Bio/Align/Applications/_Clustalw.py. > > Ah. Without checking I think maybe the ClustalW documentation > used both cases - but the order was deliberately with the lower > case one last as that was used in the Python object as the > property name and keyword. > >> The "rationale" was to allow for people to use their favourite >> spelling variety. >> >> I guess it was bad luck this happened to serve as an example, as it >> was the first piece of code I ever touched in BioPython. >> >> It would be nice to streamline all application wrappers in this regard >> sometime... > > Yeah, perhaps we can formally deprecate set_parameter in > the next release which means all the aliases 'go away' and > that leaves us with just the final entry exposed as the usable > property name and keyword. > > Peter I have updated the application wrapper code to spot hyphens in what should be property names/arguments: https://github.com/biopython/biopython/commit/ba1a43475a3d4450b3ac8409adaf0e59a25b0e47 This forced me to update the XXmotif wrapper and I opted to switch it to using lower case property names: https://github.com/biopython/biopython/commit/f4b4006a64d5166b5c0934d2ad1f8dc3bab30067 I was looking at this as part of applying Christian's MSAProb wrapper: https://github.com/biopython/biopython/pull/225 Peter From saketkc at gmail.com Sat Sep 28 06:22:52 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Sat, 28 Sep 2013 15:52:52 +0530 Subject: [Biopython-dev] Adjusting the xxMotif wrapper / Bio.Application plans In-Reply-To: References: <520374DF.9070301@brueffer.de> Message-ID: On 28 September 2013 15:44, Peter Cock wrote: > On Thu, Aug 8, 2013 at 12:00 PM, Peter Cock wrote: >> On Thu, Aug 8, 2013 at 11:37 AM, Christian Brueffer >> wrote: >>>> >>>> Was there a special reason for all these case variants >>>> in the XXmotif options?? >>> >>> I basically followed the example set by >>> Bio/Align/Applications/_Clustalw.py. >> >> Ah. Without checking I think maybe the ClustalW documentation >> used both cases - but the order was deliberately with the lower >> case one last as that was used in the Python object as the >> property name and keyword. >> >>> The "rationale" was to allow for people to use their favourite >>> spelling variety. >>> >>> I guess it was bad luck this happened to serve as an example, as it >>> was the first piece of code I ever touched in BioPython. >>> >>> It would be nice to streamline all application wrappers in this regard >>> sometime... >> >> Yeah, perhaps we can formally deprecate set_parameter in >> the next release which means all the aliases 'go away' and >> that leaves us with just the final entry exposed as the usable >> property name and keyword. >> >> Peter > > I have updated the application wrapper code to spot hyphens > in what should be property names/arguments: > https://github.com/biopython/biopython/commit/ba1a43475a3d4450b3ac8409adaf0e59a25b0e47 > > This forced me to update the XXmotif wrapper and I opted > to switch it to using lower case property names: > https://github.com/biopython/biopython/commit/f4b4006a64d5166b5c0934d2ad1f8dc3bab30067 > > I was looking at this as part of applying Christian's MSAProb wrapper: > https://github.com/biopython/biopython/pull/225 > Great! I had done a similar mistake while writing the samtools wrapper(which I am yet to wrap up) https://github.com/saketkc/biopython/commit/30b3d9878281e00afed9e7b6d0bbfb2bdacbce91 > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From anaryin at gmail.com Sat Sep 28 06:57:32 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Sat, 28 Sep 2013 12:57:32 +0200 Subject: [Biopython-dev] Building Biopython with ICC instead of GCC (intel compiler) In-Reply-To: <1380070525.80725.YahooMailNeo@web164006.mail.gq1.yahoo.com> References: <1380070525.80725.YahooMailNeo@web164006.mail.gq1.yahoo.com> Message-ID: Hi Michiel, It was compiled with GCC. That might be the issue indeed but I assumed there were some variables you could set to have the compilers changed (setting CC, CXX, F77 for example). As I said, there is no problem, GCC is perfectly good enough. It was just a curiosity. Cheers, Jo?o 2013/9/25 Michiel de Hoon > How was Python itself compiled? I believe distutils is supposed to select > the same compiler as was used for Python itself. > > Best, > -Michiel. > > ------------------------------ > **I guess this has to do with distutils? Any idea on how to force it to > use only the intel compilers? > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > From nigel.delaney at outlook.com Sat Sep 28 11:10:25 2013 From: nigel.delaney at outlook.com (Nigel Delaney) Date: Sat, 28 Sep 2013 11:10:25 -0400 Subject: [Biopython-dev] Newick Parser Message-ID: I had a couple questions on the newick parser I was hoping someone might know the answer to. First, it fails when there are BOMs in the file, though in general it seems that UTF encoding with BOMs should be allowed. Is there a standard way that BOM in files are handled in biopython? Second, does anyone know what the consensus is on newick files that have placements for data but no data. For example: ((A,B):Name:.0235)C) Defines a name and length for the A,B node. However, ((A,B)::)C) Has positions for name and length but no length or name data, which seems like it should be an error, though currently is just skipped. From p.j.a.cock at googlemail.com Sat Sep 28 11:23:43 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 28 Sep 2013 16:23:43 +0100 Subject: [Biopython-dev] Newick Parser In-Reply-To: References: Message-ID: Hi Nigel, On Saturday, September 28, 2013, Nigel Delaney wrote: > I had a couple questions on the newick parser I was hoping someone might > know the answer to. > > First, it fails when there are BOMs in the file, though in general it seems > that UTF encoding with BOMs should be allowed. Is there a standard way > that > BOM in files are handled in biopython? > You mean a Unicode byte order mark (BOM)? Does it even make sense to allow non-ASCII in Newick format? Peter From nigel.delaney at outlook.com Sat Sep 28 11:55:14 2013 From: nigel.delaney at outlook.com (Nigel Delaney) Date: Sat, 28 Sep 2013 11:55:14 -0400 Subject: [Biopython-dev] Newick Parser In-Reply-To: References: Message-ID: You mean a Unicode byte order mark (BOM)? Yep. Does it even make sense to allow non-ASCII in Newick format? I think that's a matter of opinion. The specs I found discussed how to parse the string, but not how to encode the string. The advantages I can see are allowing people to use the extended characters for node/tip label names, and being robust if different text-editors/programs muck with the files (which I would suspect are usually ASCII). The disadvantage is that it's another case to handle in code, so could just be ignored or throw an exception. Not sure what the preferred choice for biopython would be. From p.j.a.cock at googlemail.com Sat Sep 28 13:28:24 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 28 Sep 2013 18:28:24 +0100 Subject: [Biopython-dev] Newick Parser In-Reply-To: References: Message-ID: On Sat, Sep 28, 2013 at 4:55 PM, Nigel Delaney wrote: > >> >> Does it even make sense to allow non-ASCII in Newick format? >> > > I think that?s a matter of opinion. The specs I found > discussed how to parse the string, but not how to > encode the string. Right, and they probably all pre-date unicode and are implicitly ASCII only. > The advantages I can see are allowing people to use the > extended characters for node/tip label names, and being > robust if different text-editors/programs muck with the files > (which I would suspect are usually ASCII). Yep. > The disadvantage is that it?s another case to handle in code, so could just > be ignored or throw an exception. > > Not sure what the preferred choice for biopython would be. If you'd like to work on this it sounds useful - but you'll have to be extra careful about testing under both Python 2 and Python 3 due to the joys of unicode. Peter From nigel.delaney at outlook.com Sat Sep 28 13:52:03 2013 From: nigel.delaney at outlook.com (Nigel Delaney) Date: Sat, 28 Sep 2013 13:52:03 -0400 Subject: [Biopython-dev] Newick Parser In-Reply-To: References: Message-ID: Hi Peter, I think handling the joys of Unicode might be a bit more trouble than it's worth given how few of the files are probably Unicode, and I think most bioinformatics is still done in standard ACSCII English anyway. I just submitted pull request 241. It throws an error when BOMs are detected (right now it says the number of "(" does not equal the number of ")" which is super confusing). This way the user can just convert the file on their end. All the best, N -----Original Message----- From: Peter Cock [mailto:p.j.a.cock at googlemail.com] Sent: Saturday, September 28, 2013 1:28 PM To: Nigel Delaney Cc: Biopython-Dev Mailing List Subject: Re: [Biopython-dev] Newick Parser On Sat, Sep 28, 2013 at 4:55 PM, Nigel Delaney wrote: > >> >> Does it even make sense to allow non-ASCII in Newick format? >> > > I think that's a matter of opinion. The specs I found discussed how > to parse the string, but not how to encode the string. Right, and they probably all pre-date unicode and are implicitly ASCII only. > The advantages I can see are allowing people to use the extended > characters for node/tip label names, and being robust if different > text-editors/programs muck with the files (which I would suspect are > usually ASCII). Yep. > The disadvantage is that it's another case to handle in code, so could > just be ignored or throw an exception. > > Not sure what the preferred choice for biopython would be. If you'd like to work on this it sounds useful - but you'll have to be extra careful about testing under both Python 2 and Python 3 due to the joys of unicode. Peter From p.j.a.cock at googlemail.com Sat Sep 28 14:18:57 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 28 Sep 2013 19:18:57 +0100 Subject: [Biopython-dev] Newick Parser In-Reply-To: References: Message-ID: On Sat, Sep 28, 2013 at 6:52 PM, Nigel Delaney wrote: > Hi Peter, > > I think handling the joys of Unicode might be a bit more trouble than it's > worth given how few of the files are probably Unicode, and I think most > bioinformatics is still done in standard ACSCII English anyway. > > I just submitted pull request 241. It throws an error when BOMs are > detected (right now it says the number of "(" does not equal the number of > ")" which is super confusing). This way the user can just convert the file > on their end. Thanks - I've replied with what is intended as constructive feedback: https://github.com/biopython/biopython/pull/241 The startswith method is more powerful that many people realise ;) Peter From p.j.a.cock at googlemail.com Sun Sep 29 08:30:10 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 29 Sep 2013 13:30:10 +0100 Subject: [Biopython-dev] Bio.SVDSuperimposer cleanup to remove nested module name Message-ID: Hi all, Could someone have a look at this proposed change to the Bio.SVDSuperimposer module (used in Bio.PDB) please: https://github.com/biopython/biopython/pull/242 Thanks, Peter From p.j.a.cock at googlemail.com Sun Sep 29 08:39:24 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 29 Sep 2013 13:39:24 +0100 Subject: [Biopython-dev] No longer testing under Python 3.1 Message-ID: Hi all, In line with past discussion, we're not officially supporting Python 3.0, 3.1 or 3.2 - just 3.3 onwards. Until recently the Buildbot has been covering Python 3.1 and 3.2, but as of this commit I have dropped Python 3.1 from the test matrix: https://github.com/biopython/biopython/commit/de71aadb8c603a6cd30b563fe7bc44d56b98d506 http://testing.open-bio.org/biopython/tgrid For now everything seems to work under Python 3.2 (testing under TravisCI and the buildbot) which may be useful as PyPy3 currently targets Python 3.2 rather than 3.3. Peter From p.j.a.cock at googlemail.com Sun Sep 29 19:22:52 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 30 Sep 2013 00:22:52 +0100 Subject: [Biopython-dev] Python 2 and 3 migration thoughts In-Reply-To: References: Message-ID: On Sun, Sep 8, 2013 at 9:52 PM, Peter Cock wrote: > On Sat, Sep 7, 2013 at 8:17 PM, Eric Talevich wrote: >>> > >>> > # This file targets both Python 2 and Python 3 at the same time >>> > # TODO - Targets Python 2 only (use 2to3 to run under Python 3) >>> > >>> >>> Does the special comment line seem like a good solution? >>> On the plus side, it tracks any changes with the file being >>> updated (which wouldn't happen with a list in the do2to3.py >>> file). >> >> Hi Peter, >> >> This looks like a good way to move forward overall. Regarding the special >> comment lines -- since these are only used in do2to3.py, would it be >> cleaner to keep a hard-coded list of filenames in do2to3.py and leave the >> modules and scripts alone? Are there any characteristics that would make it >> difficult to determine whether a given module or script is Py3-compliant? > > Hi Eric, > > There are import time problems which are easy to spot - in particular > SyntaxError is a good clue. However, many of the issues are only > really found at run time (e.g. different method names). This means > that the tests (which I started with) are actually the easiest to check. > > Right now I don't have a feel for what fraction of the main Bio/* and > BioSQL/* files can be made dual-coding, and that would have an > influence on how best to tag things needing 2to3 or not. I'm happy > to continue this on branches for a while longer and find out. Assuming my methodology isn't flawed, we're about half way in terms of getting every file in Biopython do be dual Python 2 and Python 3 code: 262 no change, 290 need fixers Troublesome ones at 52.5% This is based on there being a difference between the pre- and post-2to3 conversion (discounting removing future imports) This is an over estimate as often the 2to3 script makes unnecessary changes. This is after applying a *lot* of little changes to our codebase, things like removing unneeded use of my_dict.keys() which the 2to3 fixers are over cautious in wrapping as list(my_dict.keys()) - I would like to do a beta before the next release. > I do like the idea of a special #TODO comment line where 2to3 > is still needed - it is symbolic of where I want the code base to go ;) That's what is going on in this revised branch - if the special #TODO comment is there, then 2to3 is used, otherwise we assume the file is already OK to use under Python 3: https://github.com/peterjc/biopython/tree/mark2to3 This is now quicker to install under Python 3, but there is plenty of scope for speed optimisation (e.g. requiring the magic marker is in the first (say) 20 lines of the file, and expanding the magic marker to list the specific 2to3 fixers required and running just those. Regards, Peter From p.j.a.cock at googlemail.com Mon Sep 30 12:18:21 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 30 Sep 2013 17:18:21 +0100 Subject: [Biopython-dev] Python 2 and 3 migration thoughts In-Reply-To: References: Message-ID: On Mon, Sep 30, 2013 at 12:22 AM, Peter Cock wrote: > Assuming my methodology isn't flawed, we're about half way > in terms of getting every file in Biopython do be dual Python 2 > and Python 3 code: > > 262 no change, 290 need fixers > Troublesome ones at 52.5% New numbers with Bio._py3k.urllib changes which should have dropped the number of troublesome files by at most 13 files: 374 no change, 177 need fixers Troublesome ones 32.1% I think my markup script is a bit fragile in terms of the exact sequence of steps with do2to3.py etc. But much better numbers than Sunday night :) Revised branch here: https://github.com/peterjc/biopython/tree/mark2to3a https://github.com/peterjc/biopython/commit/14f9ff121532ff92ec7bacc1867bdd058a6e8f74 Build and test times on the master vs this branch are looking a lot better for Python 3 (although the numbers for different TravisCI runs are not directly comparable), and there is still a lot of room for improvement: master: https://travis-ci.org/biopython/biopython/builds/11965000 branch: https://travis-ci.org/peterjc/biopython/builds/11968132 So that's good. However, are these urllib import fixes an acceptable way forwards? Included in the above branch and here: https://github.com/peterjc/biopython/tree/urllib https://github.com/peterjc/biopython/commit/1305387a5d98a5f3c7b83ca3db580b9e63dba851 Thanks, Peter From Markus.Piotrowski at ruhr-uni-bochum.de Mon Sep 2 20:49:20 2013 From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus Piotrowski) Date: 2 Sep 2013 22:49:20 +0200 Subject: [Biopython-dev] =?utf-8?q?Fwd=3A_=5Bbiopython=5D_Potential_error_?= =?utf-8?q?in_mass_calculations_for_RNA/DNA=3F_=28=23229=29?= In-Reply-To: References: Message-ID: Hi, I prepared a bugfix for that: https://github.com/MarkusPiotrowski/biopython/commit/fd8914f14d48c984a69b6e8227c679e3c67bd1eb Summary: Bugfix for DNA/RNA masses In Bio.Data.IUPACData: - corrected masses for monophosphate nucleotides in unambiguous_dna_weights and unambiguous_rna_weights (most values where too high by a mass of 16 Da) - added two dictionaries with monoisotopic masses for monophosphate nucleotides (monoisotopic_unambiguous_dna_weights and monoisotopic_unambiguous_rna_weights) - added average and monisotopic masses for selenocysteine and pyrrolysine in protein_weights and monoisotopic_protein_weights In Bio.SeqUtils.__init__: Rewrote method molecular_weight to - correct the calculation (sum masses of sequence elements and substract 18 Da for each formed bond) - allow mass calculation for RNA and protein sequences - allow mass calculation for double stranded nucleic acids Am 2013-08-30 17:46, schrieb Peter Cock: > Who are our sequence mass experts? > https://github.com/biopython/biopython/issues/229 > > ---------- Forwarded message ---------- > From: nruggero > Date: Thu, Aug 29, 2013 at 11:03 PM > Subject: [biopython] Potential error in mass calculations for > RNA/DNA? > (#229) > To: biopython/biopython > > > In Bio/Data/IUPACData.py the molecular weights of unambiguous DNA are > listed as: > > unambiguous_dna_weights = { > "A": 347., > "C": 323., > "G": 363., > "T": 322., > } > > As far as I can tell these are the molecular weights for the > non-deoxy > bases instead of the deoxy bases. For example, AMP (347.22) instead > of dAMP > (331.22) is listed. > > I've looked at the original BioPearl code that these numbers were > taken > from and I think they were just copied incorrectly. I have also > looked at > the code which uses this dict in Bio/SeqUtils/__init__.py called > molecular_weight() and it just takes the sum of these values over the > sequence (no correction made). > > So, is this an error or am I missing something basic? > Thanks > > ? > Reply to this email directly or view it on > GitHub > . > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From zruan1991 at gmail.com Mon Sep 2 22:20:17 2013 From: zruan1991 at gmail.com (Zheng Ruan) Date: Mon, 2 Sep 2013 18:20:17 -0400 Subject: [Biopython-dev] Codon Alignment GSoC Weekly Update Message-ID: Hi all, An update of Codon Alignment GSoC project can be found at http://zruanweb.com/. Thanks for your comments and suggestions. Best, Zheng Ruan From yeyanbo289 at gmail.com Tue Sep 3 01:32:16 2013 From: yeyanbo289 at gmail.com (Yanbo Ye) Date: Tue, 3 Sep 2013 09:32:16 +0800 Subject: [Biopython-dev] GSOC weekly update 12 Message-ID: Hi all, The last week update for Biopython.Phylo project can be found here: http://blog.yeyanbo.com/posts/google-summer-of-code-12.html Thanks, Yanbo -- *Yanbo Ye* *Guangzhou Institutes of Biomedicine and Health, * *Chinese Academy of Sciences* *190 Kaiyuan Avenue, Science Park, Guangzhou, China** * * * *Email: ye_yanbo at gibh.ac.cn* *Web: http://www.yeyanbo.com* *Phone: (86)-020-32093810* From jurajbergman at hotmail.com Thu Sep 5 14:33:55 2013 From: jurajbergman at hotmail.com (Juraj Bergman) Date: Thu, 5 Sep 2013 16:33:55 +0200 Subject: [Biopython-dev] Python_MKT Message-ID: Dear all, I'm resending my implementation of the McDonald-Kreitman test. Link to the description of the module:https://www.dropbox.com/s/zgnz8xwlcsispzf/Python_MKT.pdf Link to the code:https://www.dropbox.com/s/1z3opj4rbb0ms14/Python_MKT.py I apologise for the initial mistake of sending attachments instead of links. Kind regards, Juraj Bergman P.S. Regarding the multi_short_path() function - I realize that it is very, very repetitive butI have not (yet) managed to find a suitable loop construction that would replace the current code. The multi_short_path() function is by far the most complex function of the modulebecause its purpose is to find the codon network with the least amount of overall nucleotide substitutions and the least amount of non-synonymous nucleotide substitutions (given any combination of codons). Each codon is being represented as multiple lists of two integers (depending on the overall amount of codons being processed). The first integer specifies the amount of synonymous and the second specifies the amount of non-synonymous substitutions.For example, if 10 codons are being fitted in a network, then there are 10x10 = 100 combinations of codon-codon pathways, each represented with a two-integer list, and out of these 100 lists, the 'best' 10 have to be chosen to get the most optimal codon network (and the repetitiveness of thefunction mainly arises because of this process). This is, in short, a description of the function and I would appreciate any pointers that would help to make the code more succinct :) From zruan1991 at gmail.com Fri Sep 6 04:00:06 2013 From: zruan1991 at gmail.com (Zheng Ruan) Date: Fri, 6 Sep 2013 00:00:06 -0400 Subject: [Biopython-dev] Fwd: Python_MKT In-Reply-To: References: Message-ID: Hi Juraj, I am also planing to implement MK test into my GSoC framework. I just went through you code and it is really independent. Will you be also to modify it to utilize the MultipleSeqAlignment, Alphabet and CodonTable module of Biopython so that it is more extendable? As to the multi_short_path() function, you really confused me. Is your implementation guaranteed to find the shortest path? This problem can be abstracted as finding the minimum spanning tree in graph theory and a good algorithm is known (Prim algorithm or Kruskal algorithm). My idea of linking multiple codons is first generate a codon by codon matrix representing the synonymous and nonsynonymous substitutions each codon needs to change to the other in advance. Then finding the minimum spanning tree that connect all the node in the matrix with minimum length (least synonymous substitutions). I plan to implement this and you may have more insight about my suggestions. Thanks! Best, Zheng Ruan On Thu, Sep 5, 2013 at 10:33 AM, Juraj Bergman wrote: > > > > Dear all, > I'm resending my implementation of the McDonald-Kreitman test. > Link to the description of the module: > https://www.dropbox.com/s/zgnz8xwlcsispzf/Python_MKT.pdf > Link to the code:https://www.dropbox.com/s/1z3opj4rbb0ms14/Python_MKT.py > I apologise for the initial mistake of sending attachments instead of > links. > Kind regards, > Juraj Bergman > P.S. Regarding the multi_short_path() function - I realize that it is > very, very repetitive butI have not (yet) managed to find a suitable loop > construction that would replace the current code. The multi_short_path() > function is by far the most complex function of the modulebecause its > purpose is to find the codon network with the least amount of overall > nucleotide substitutions and the least amount of non-synonymous nucleotide > substitutions (given any combination of codons). Each codon is being > represented as multiple lists of two integers (depending on the overall > amount of codons being processed). The first integer specifies the amount > of synonymous and the second specifies the amount of non-synonymous > substitutions.For example, if 10 codons are being fitted in a network, then > there are 10x10 = 100 combinations of codon-codon pathways, each > represented with a two-integer list, and out of these 100 lists, the 'best' > 10 have to be chosen to get the most optimal codon networ! > k (and the repetitiveness of thefunction mainly arises because of this > process). This is, in short, a description of the function and I would > appreciate any pointers that would help to make the code more succinct :) > > > > > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From jurajbergman at hotmail.com Fri Sep 6 06:38:34 2013 From: jurajbergman at hotmail.com (Juraj Bergman) Date: Fri, 6 Sep 2013 08:38:34 +0200 Subject: [Biopython-dev] Python_MKT In-Reply-To: References: , , Message-ID: Hi Zheng, I think that the utilization of MultipleSeqAlignment and other modules already implemented in the Biopython framework is the next step in developing my module. The code was made independent because it says on the Biopython wiki that, whensubmitting code, it should be generalized so I didn't use any existing Biopython modules... As for the multi_short_path() function - it is guaranteed to find the shortest path (as far as I've tested it and I've tested it quite a bit) but I agree that it is very confusing (even for me), but it works... But still, my next goal is to try and rewrite it (so thank you for the suggestions :). The codon-codon matrix principle you described is also the principle behind the multi_short_path() function and, I think, it is a good way of tackling the problem... But in the end the result of the multi _short_path() is to find a tree with the least amount of overall substitutions (synonymous + non-synonymous) and with the number of non-synonymous substitutions being minimized. If you try to connect the nodes based solely on the minimum amount of synonymous substitutions you may not always get a minimum length tree (for example: if considering only the synonymous substitutions, then, theoretically, a codon_a -> codon_b exchange which requires two synonymous changes has priority over a codon_a -> codon_c which requires only one non-synonymous change, and that in turn can affect the length of the whole tree) - I hope this makes some sense to you... Also, when connecting nodes, I took the approach of first making a root of the tree and then building the tree from that root, otherwise you could end up with multiple unconnected branches... I hope this helps with your implementation... If I come up with a better alternative to the multi_short_path() I'll be sure to post a link! Again, thanks for taking the time to going through my code, all the best, Juraj Date: Fri, 6 Sep 2013 00:00:06 -0400 Subject: Fwd: [Biopython-dev] Python_MKT From: zruan1991 at gmail.com To: biopython-dev at biopython.org; jurajbergman at hotmail.com Hi Juraj, I am also planing to implement MK test into my GSoC framework. I just went through you code and it is really independent. Will you be also to modify it to utilize the MultipleSeqAlignment, Alphabet and CodonTable module of Biopython so that it is more extendable? As to the multi_short_path() function, you really confused me. Is your implementation guaranteed to find the shortest path? This problem can be abstracted as finding the minimum spanning tree in graph theory and a good algorithm is known (Prim algorithm or Kruskal algorithm). My idea of linking multiple codons is first generate a codon by codon matrix representing the synonymous and nonsynonymous substitutions each codon needs to change to the other in advance. Then finding the minimum spanning tree that connect all the node in the matrix with minimum length (least synonymous substitutions). I plan to implement this and you may have more insight about my suggestions. Thanks! Best,Zheng Ruan On Thu, Sep 5, 2013 at 10:33 AM, Juraj Bergman wrote: Dear all, I'm resending my implementation of the McDonald-Kreitman test. Link to the description of the module:https://www.dropbox.com/s/zgnz8xwlcsispzf/Python_MKT.pdf Link to the code:https://www.dropbox.com/s/1z3opj4rbb0ms14/Python_MKT.py I apologise for the initial mistake of sending attachments instead of links. Kind regards, Juraj Bergman P.S. Regarding the multi_short_path() function - I realize that it is very, very repetitive butI have not (yet) managed to find a suitable loop construction that would replace the current code. The multi_short_path() function is by far the most complex function of the modulebecause its purpose is to find the codon network with the least amount of overall nucleotide substitutions and the least amount of non-synonymous nucleotide substitutions (given any combination of codons). Each codon is being represented as multiple lists of two integers (depending on the overall amount of codons being processed). The first integer specifies the amount of synonymous and the second specifies the amount of non-synonymous substitutions.For example, if 10 codons are being fitted in a network, then there are 10x10 = 100 combinations of codon-codon pathways, each represented with a two-integer list, and out of these 100 lists, the 'best' 10 have to be chosen to get the most optimal codon networ! k (and the repetitiveness of thefunction mainly arises because of this process). This is, in short, a description of the function and I would appreciate any pointers that would help to make the code more succinct :) _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From zruan1991 at gmail.com Fri Sep 6 15:07:01 2013 From: zruan1991 at gmail.com (Zheng Ruan) Date: Fri, 6 Sep 2013 11:07:01 -0400 Subject: [Biopython-dev] Python_MKT In-Reply-To: References: Message-ID: Hi Juraj, It's good to hear that you plan to do that. A big advantage of using Biopython module is to make your MKT test more integrated with existing functions. This can be helpful to design pipeline within Biopython. What I would also try is to use the Bio.Data.CodonTable so that user can specify genetic code of their gene of interest. I think there are situations where you are not able to minimize synonymous and non-synonymous, and non-synonymous substitutions at the same time. If I understand your point correctly, multi_short_path() function tries to find the least synonymous and non-synonymous substitutions from a set of paths that all holds minimum non-synonymous substitutions, right? In this case, for example when you have 10 different codons at hand, you can first start from each codon and build a minimum spanning tree. And then you expect at most 10 minimum spanning trees, all with equal number of minimum non-synonymous substitutions. Finally, you can pick the tree with least overall substitutions (non-synonymous and synonymous) from the set of trees. I don't expect the algorithm to cost more than 2000 lines. Maybe we can discuss this more after I finish coding this weekend. Thanks! Best, Zheng Ruan On Fri, Sep 6, 2013 at 2:38 AM, Juraj Bergman wrote: > Hi Zheng, > > I think that the utilization of MultipleSeqAlignment and other modules > already implemented in the Biopython framework is the next step in > developing my module. The code was made independent because it says on the > Biopython wiki that, when > submitting code, it should be generalized so I didn't use any existing > Biopython modules... > > As for the multi_short_path() function - it is guaranteed to find the > shortest path (as far as I've tested it and I've tested it quite a bit) but > I agree that it is very confusing (even for me), but it works... But > still, my next goal is to try and rewrite it (so thank you for the > suggestions :). The codon-codon matrix principle you described is also the > principle behind the multi_short_path() function and, I think, it is a good > way of tackling the problem... But in the end the result of the multi > _short_path() is to find a tree with the least amount of overall > substitutions (synonymous + non-synonymous) and with the number of > non-synonymous substitutions being minimized. If you try to connect the > nodes based solely on the minimum amount of synonymous substitutions you > may not always get a minimum length tree (for example: if considering only > the synonymous substitutions, then, theoretically, a codon_a -> codon_b > exchange which requires two synonymous changes has priority over a codon_a > -> codon_c which requires only one non-synonymous change, and that in turn > can affect the length of the whole tree) - I hope this makes some sense > to you... Also, when connecting nodes, I took the approach of first making > a root of the tree and then building the tree from that root, otherwise you > could end up with multiple unconnected branches... I hope this helps with > your implementation... If I come up with a better alternative to the > multi_short_path() I'll be sure to post a link! > > Again, thanks for taking the time to going through my code, all the best, > > Juraj > > ------------------------------ > Date: Fri, 6 Sep 2013 00:00:06 -0400 > Subject: Fwd: [Biopython-dev] Python_MKT > From: zruan1991 at gmail.com > To: biopython-dev at biopython.org; jurajbergman at hotmail.com > > > Hi Juraj, > > I am also planing to implement MK test into my GSoC framework. I just went > through you code and it is really independent. Will you be also to modify > it to utilize the MultipleSeqAlignment, Alphabet and CodonTable module of > Biopython so that it is more extendable? > > As to the multi_short_path() function, you really confused me. Is your > implementation guaranteed to find the shortest path? This problem can be > abstracted as finding the minimum spanning tree in graph theory and a good > algorithm is known (Prim algorithm or Kruskal algorithm). My idea of > linking multiple codons is first generate a codon by codon matrix > representing the synonymous and nonsynonymous substitutions each codon > needs to change to the other in advance. Then finding the minimum spanning > tree that connect all the node in the matrix with minimum length (least > synonymous substitutions). I plan to implement this and you may have more > insight about my suggestions. Thanks! > > Best, > Zheng Ruan > > > On Thu, Sep 5, 2013 at 10:33 AM, Juraj Bergman wrote: > > > > > Dear all, > I'm resending my implementation of the McDonald-Kreitman test. > Link to the description of the module: > https://www.dropbox.com/s/zgnz8xwlcsispzf/Python_MKT.pdf > Link to the code:https://www.dropbox.com/s/1z3opj4rbb0ms14/Python_MKT.py > I apologise for the initial mistake of sending attachments instead of > links. > Kind regards, > Juraj Bergman > P.S. Regarding the multi_short_path() function - I realize that it is > very, very repetitive butI have not (yet) managed to find a suitable loop > construction that would replace the current code. The multi_short_path() > function is by far the most complex function of the modulebecause its > purpose is to find the codon network with the least amount of overall > nucleotide substitutions and the least amount of non-synonymous nucleotide > substitutions (given any combination of codons). Each codon is being > represented as multiple lists of two integers (depending on the overall > amount of codons being processed). The first integer specifies the amount > of synonymous and the second specifies the amount of non-synonymous > substitutions.For example, if 10 codons are being fitted in a network, then > there are 10x10 = 100 combinations of codon-codon pathways, each > represented with a two-integer list, and out of these 100 lists, the 'best' > 10 have to be chosen to get the most optimal codon networ! > k (and the repetitiveness of thefunction mainly arises because of this > process). This is, in short, a description of the function and I would > appreciate any pointers that would help to make the code more succinct :) > > > > > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > > From p.j.a.cock at googlemail.com Fri Sep 6 15:44:44 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 6 Sep 2013 16:44:44 +0100 Subject: [Biopython-dev] Python 2 and 3 migration thoughts In-Reply-To: References: Message-ID: On Thu, May 30, 2013 at 2:33 PM, Peter Cock wrote: > Splitting off from this thread: > http://lists.open-bio.org/pipermail/biopython/2013-May/008601.html > > On Thu, May 30, 2013 at 2:13 PM, Peter Cock wrote: >> Thank you for all the comments so far, don't stop yet :) >> >> On Thu, May 30, 2013 at 1:51 PM, Wibowo Arindrarto >> wrote: >>> Hi everyone, >>> >>> I'm leaning towards insisting on Python >=3.3 support (I'm running >>> 3.3.2). I suppose that even if Python3.3 is not available on a machine >>> or through the default package manager, it's always installable on its >>> own. If that's not the case, I imagine Python2.x is most likely >>> present in these machines (so Biopython can still be used). >> >> True. >> >> So far everyone who has replied (including some off list) have said >> they are using Python 3.3 which is encouraging. Thank you for >> the comments so far. >> >> It looks like we can forget about Python 3.1, and just need to >> decide if it is worth including Python 3.2.5 in the short term. >> >>> On a related note, do we have a defined timeline on when we >>> would drop support for Python2.x? Are there any plans to have >>> our codebase written in Python3.x instead of Python2.x? >> >> Nothing concrete planned, no. I'll reply in more detail on the >> biopython-dev list as I do have some thoughts about this. > > Good question Bow, > > I think people will still be using Python 2 a year or two from > now, so we must support both for some time. > > Biopython 1.62 (next week perhaps?) > - Final release with Python 2.5 support > - Official support for Python 2.5, 2.6, 2.7 and 3.3 > - Possibly official support for Python 3.2.5+ as well? > > (Exactly which versions of Python 3 we'll include to be > decided, see the other thread for that discussion.) > > Short term we will continue with developing using Python 2 > syntax and running 2to3 for Python 3. As far as I know, > the reverse process with 3to2 is not well established. If > anyone wants to investigate that would be useful as > another option. However, dropping Python 2.5 support > makes things more flexible... > > Medium term I believe it would be possible to have a single > code base which is both valid Python 2 and 3 at the same > time. This may require us to target 2.7 and 3.3+ only - we'll > have to try it and see if Python 2.6 will hold us back. > > I've actually done this with lzma.backports, a small but > non-trivial module with Python and C code: > > https://pypi.python.org/pypi/backports.lzma/ > https://github.com/peterjc/backports.lzma > > Python 3.3 reintroduces some features designed to make > this more straightforward, like unicode literals (missing in > the early versions of Python 3). This is why I'd like to drop > Python 3.2 as soon as possible. > > What I was thinking is we can start migrating modules on a > case by case basis from "Python 2 syntax" to "Dual syntax" > one by one, with a white-list in the do2to3.py script. That > way over time less and less modules need to be converted > via 2to3, and "python3 setup.py install" will get faster, > until eventually we can stop using 2to3 at all. > > This conversion could consider the code and doctests > separately. However, using using print(example) we can > hopefully get most of the doctests and Tutorial examples > to work under both Python 2 and 3 at the same time. > > That's my current thinking anyway - and I think the fact > that it would be a gradual migration from writing Python 2 > specific code to writing dual 2/3 code makes it low risk > (as long as we're continuing to run regular testing). > > Regards, > > Peter This branch is trying out marking individual Python files as dual coding (Python 2 and Python 3) or as Python 2 only requiring conversion via 2to3 for use on Python 3: https://github.com/peterjc/biopython/tree/tag2to3 Currently the tags are two special hash comment lines expected near the start of the file itself (rather than a list within the do2to3.py script). The actual text of the marker isn't critical - perhaps these need full stops? # This file targets both Python 2 and Python 3 at the same time # TODO - Targets Python 2 only (use 2to3 to run under Python 3) The first main issues thus far have been print statements, where we will either need to use the __future__ import or restrict ourselves to simple single argument calls - I have been using the later. This should not be a big problem on the main code, and we ought to update the print-and-compare unit tests anyway, The next common issue is import statements, for example StringIO (another bytes versus unicode issue). That can be handled via Bio._py3k in some cases. A third major class of issues in the unit tests so far is iterators versus lists, for example dictionary methods and the map function's return value. These can be tackled on a case by case basis I think - often by adding the occasional list(...) or sorted(x) instead of trying x.sorted() is enough. There are also quite a few instances of 'basestring' which might be handled via _py3k? As of right now, on this branch there are only 8 files under Tests which require conversion via 2to3 : Tests/common_BioSQL.py Tests/seq_tests_common.py Tests/test_NCBI_qblast.py Tests/test_SCOP_Cla.py Tests/test_seq.py Tests/test_SeqIO.py Tests/test_SeqIO_index.py Tests/test_Uniprot.py Having I hope demonstrated this will work, I'd like some feedback before applying this (or a modified version of it) to the master branch. Any thoughts? Thanks, Peter From p.j.a.cock at googlemail.com Sat Sep 7 11:30:50 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 7 Sep 2013 12:30:50 +0100 Subject: [Biopython-dev] Python 2 and 3 migration thoughts In-Reply-To: References: Message-ID: On Fri, Sep 6, 2013 at 4:44 PM, Peter Cock wrote: > On Thu, May 30, 2013 at 2:33 PM, Peter Cock wrote: >> >> Short term we will continue with developing using Python 2 >> syntax and running 2to3 for Python 3. As far as I know, >> the reverse process with 3to2 is not well established. If >> anyone wants to investigate that would be useful as >> another option. However, dropping Python 2.5 support >> makes things more flexible... >> >> Medium term I believe it would be possible to have a single >> code base which is both valid Python 2 and 3 at the same >> time. This may require us to target 2.7 and 3.3+ only - we'll >> have to try it and see if Python 2.6 will hold us back. >> >> I've actually done this with lzma.backports, a small but >> non-trivial module with Python and C code: >> >> https://pypi.python.org/pypi/backports.lzma/ >> https://github.com/peterjc/backports.lzma >> >> Python 3.3 reintroduces some features designed to make >> this more straightforward, like unicode literals (missing in >> the early versions of Python 3). This is why I'd like to drop >> Python 3.2 as soon as possible. >> >> What I was thinking is we can start migrating modules on a >> case by case basis from "Python 2 syntax" to "Dual syntax" >> one by one, with a white-list in the do2to3.py script. That >> way over time less and less modules need to be converted >> via 2to3, and "python3 setup.py install" will get faster, >> until eventually we can stop using 2to3 at all. >> >> This conversion could consider the code and doctests >> separately. However, using using print(example) we can >> hopefully get most of the doctests and Tutorial examples >> to work under both Python 2 and 3 at the same time. >> >> That's my current thinking anyway - and I think the fact >> that it would be a gradual migration from writing Python 2 >> specific code to writing dual 2/3 code makes it low risk >> (as long as we're continuing to run regular testing). >> >> Regards, >> >> Peter > > This branch is trying out marking individual Python files > as dual coding (Python 2 and Python 3) or as Python 2 > only requiring conversion via 2to3 for use on Python 3: > > https://github.com/peterjc/biopython/tree/tag2to3 > > Currently the tags are two special hash comment lines > expected near the start of the file itself (rather than a > list within the do2to3.py script). The actual text of the > marker isn't critical - perhaps these need full stops? > > # This file targets both Python 2 and Python 3 at the same time > # TODO - Targets Python 2 only (use 2to3 to run under Python 3) > > The first main issues thus far have been print statements, > where we will either need to use the __future__ import or > restrict ourselves to simple single argument calls - I have > been using the later. This should not be a big problem on the > main code, and we ought to update the print-and-compare > unit tests anyway, e.g. https://github.com/biopython/biopython/commit/6fa766e2348eae4e083503885f4ea5b66f531d7a > The next common issue is import statements, for > example StringIO (another bytes versus unicode issue). > That can be handled via Bio._py3k in some cases. For StringIO, https://github.com/biopython/biopython/commit/b09ebbf6f8c4032f874d89a91d199d8697c2d381 For commands.getoutput used in many tests, https://github.com/biopython/biopython/commit/11a1eca60e7a1491dbe54204ad3103e013bfebc5 > A third major class of issues in the unit tests so > far is iterators versus lists, for example dictionary > methods and the map function's return value. These > can be tackled on a case by case basis I think - often > by adding the occasional list(...) or sorted(x) instead > of trying x.sorted() is enough. e.g. for sorting dictionary keys, https://github.com/biopython/biopython/commit/b27f30012af6e66f6f143ecde719bf72609af8f2 e.g. for avoiding iterators from map function, https://github.com/biopython/biopython/commit/730850e3f4e88a70860e56abafbb579b25414f06 > There are also quite a few instances of 'basestring' > which might be handled via _py3k? > > As of right now, on this branch there are only 8 files under > Tests which require conversion via 2to3 : Down to six files under Tests now if I rebase the branch to include the recent fixes on the master. > Having I hope demonstrated this will work, I'd like some > feedback before applying this (or a modified version of > it) to the master branch. I've started applying individual code fixes to the master to improve Python 2 and 3 compatibility already. I'm specifically looking for thoughts on how to handle the transition period when some of our code will still need 2to3, while other code will not. Does the special comment line seem like a good solution? On the plus side, it tracks any changes with the file being updated (which wouldn't happen with a list in the do2to3.py file). Peter From p.j.a.cock at googlemail.com Sat Sep 7 13:44:55 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 7 Sep 2013 14:44:55 +0100 Subject: [Biopython-dev] SearchIO wiki page & documentation Message-ID: Hi Bow, You've done a great job with the wiki page for SearchIO, http://biopython.org/wiki/SearchIO - thank you! One thing I wondered about on reading this is if for the BLAST XML output the optional indent and increment arguments could be combined into one - an indent string defaulting to two spaces? Also for frames, is there an existing Biopython precedent for this (-3 to 3)? Regards, Peter From bow at bow.web.id Sat Sep 7 14:04:12 2013 From: bow at bow.web.id (Wibowo Arindrarto) Date: Sat, 7 Sep 2013 16:04:12 +0200 Subject: [Biopython-dev] SearchIO wiki page & documentation In-Reply-To: References: Message-ID: Hi Peter, Thanks. Comments are always welcomed :). For the indent and increment argument, I actually prefer to keep them separate. The reason is that having them in separate variables makes it easier for the writer to navigate into or out of the levels. The writer keeps track of which XML child element it is writing; and it either increases or decreases the level (so it can print the proper indentation). This is required since BLAST's XML tree does not really map with the object model we are using. It is similar, but not the same (e.g. the statistics tags are all children of a single element that is not the query element, while in the object they are all flat attributes of the query object). When it increases the element level, I can understand that having indent and increment as one argument makes it simpler. However, when the writer wants to go up a level (go back to the parent level), it gets difficult with a combined indent & increment variable, since Python strings do not work with the minus operator (though it does work with the plus operator). As for the frames, I tried to make it consistent with the way SeqFeature stores it strands (-3 to 3, and None). Best, Bow On Sat, Sep 7, 2013 at 3:44 PM, Peter Cock wrote: > Hi Bow, > > You've done a great job with the wiki page for SearchIO, > http://biopython.org/wiki/SearchIO - thank you! > > One thing I wondered about on reading this is if for the > BLAST XML output the optional indent and increment > arguments could be combined into one - an indent > string defaulting to two spaces? > > Also for frames, is there an existing Biopython precedent > for this (-3 to 3)? > > Regards, > > Peter From jurajbergman at hotmail.com Sat Sep 7 14:14:59 2013 From: jurajbergman at hotmail.com (Juraj Bergman) Date: Sat, 7 Sep 2013 16:14:59 +0200 Subject: [Biopython-dev] Python_MKT In-Reply-To: References: , , Message-ID: Hi, I've made some improvements in my MKT module - mainly using Kruskal's algorithm to rewrite the multi_short_path() function (thanks for the suggestion Zheng!) and I added some new functions as well (pathway_a(), pathways_n()). links:https://www.dropbox.com/s/zgnz8xwlcsispzf/Python_MKT.pdfhttps://www.dropbox.com/s/1z3opj4rbb0ms14/Python_MKT.py Regards, Juraj Date: Fri, 6 Sep 2013 00:00:06 -0400 Subject: Fwd: [Biopython-dev] Python_MKT From: zruan1991 at gmail.com To: biopython-dev at biopython.org; jurajbergman at hotmail.com Hi Juraj, I am also planing to implement MK test into my GSoC framework. I just went through you code and it is really independent. Will you be also to modify it to utilize the MultipleSeqAlignment, Alphabet and CodonTable module of Biopython so that it is more extendable? As to the multi_short_path() function, you really confused me. Is your implementation guaranteed to find the shortest path? This problem can be abstracted as finding the minimum spanning tree in graph theory and a good algorithm is known (Prim algorithm or Kruskal algorithm). My idea of linking multiple codons is first generate a codon by codon matrix representing the synonymous and nonsynonymous substitutions each codon needs to change to the other in advance. Then finding the minimum spanning tree that connect all the node in the matrix with minimum length (least synonymous substitutions). I plan to implement this and you may have more insight about my suggestions. Thanks! Best,Zheng Ruan On Thu, Sep 5, 2013 at 10:33 AM, Juraj Bergman wrote: Dear all, I'm resending my implementation of the McDonald-Kreitman test. Link to the description of the module:https://www.dropbox.com/s/zgnz8xwlcsispzf/Python_MKT.pdf Link to the code:https://www.dropbox.com/s/1z3opj4rbb0ms14/Python_MKT.py I apologise for the initial mistake of sending attachments instead of links. Kind regards, Juraj Bergman P.S. Regarding the multi_short_path() function - I realize that it is very, very repetitive butI have not (yet) managed to find a suitable loop construction that would replace the current code. The multi_short_path() function is by far the most complex function of the modulebecause its purpose is to find the codon network with the least amount of overall nucleotide substitutions and the least amount of non-synonymous nucleotide substitutions (given any combination of codons). Each codon is being represented as multiple lists of two integers (depending on the overall amount of codons being processed). The first integer specifies the amount of synonymous and the second specifies the amount of non-synonymous substitutions.For example, if 10 codons are being fitted in a network, then there are 10x10 = 100 combinations of codon-codon pathways, each represented with a two-integer list, and out of these 100 lists, the 'best' 10 have to be chosen to get the most optimal codon networ! k (and the repetitiveness of thefunction mainly arises because of this process). This is, in short, a description of the function and I would appreciate any pointers that would help to make the code more succinct :) _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Sat Sep 7 18:12:37 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 7 Sep 2013 19:12:37 +0100 Subject: [Biopython-dev] Print statements vs functions (Python 2 vs 3) In-Reply-To: References: Message-ID: On Sat, Sep 7, 2013 at 2:52 PM, Peter Cock wrote: > On Sat, Sep 7, 2013 at 12:41 PM, Peter Cock wrote: >> Dear Biopythoneers, >> >> As you will be aware, with our recent release of Biopython 1.62 >> we now officially support Python 3 for the first time (specifically >> Python 3.3 - we don't recommend 3.0, 3.1 or 3.2 here), while >> continuing to support Python 2 as well. >> >> Currently all our documentation is written assuming Python 2, >> but with some small changes most things can be written to >> work under both variants. The most visible change is how to >> print things, and that happens a lot in our examples. >> >> I would like us to switch to using the Python 3 style print >> function in our documentation (including the Tutorial and >> the docstrings embedded in the code as help text). >> >> ... >> >> Would anyone object to us using the print function style >> in the Biopython documentation? >> >> I'm particularly keen to hear from beginners - as this >> is potentially confusing. >> >> Thanks, >> >> Peter. > > I tweeted this email, > > Biopython Project (@Biopython): Would anyone object to us using > #Python3 print function style in the #Biopython documentation? > http://lists.open-bio.org/pipermail/biopython/2013-September/008751.html > https://twitter.com/Biopython/status/376309705972654080 > > Two replies already: > > Raphael Mattos (@rsmattos): @Biopython I think it's time.... > https://twitter.com/rsmattos/status/376321218456338432 > > Alec Munro (@alecmunro): @Biopython do it! > https://twitter.com/alecmunro/status/376341224544038912 > > Peter On Sat, Sep 7, 2013 at 5:25 PM, Peter Cock wrote: > On Sat, Sep 7, 2013 at 2:50 PM, Dan Tomso wrote: >> Hi, Peter. >> >> This sounds OK to me. > > Thanks Dan. And another voice of approval on Twitter: > > Karin Lagesen (@karinlag): @Biopython @pjacock Go for it! > https://twitter.com/karinlag/status/376356704080105472 And another positive voice: Dave Lunt (@davelunt): @Biopython the docs change sounds good, that very clear explanation you link to should also be somewhere obvious https://twitter.com/davelunt/status/376405338511384576 Since there has only been positive reaction, I've made a start at converting the examples in the Tutorial to use the Python 3 style print function (maintaining full Python 2 compatibility under Python 2.6 and 2.7 via the future import): https://github.com/biopython/biopython/commit/34d155a02cbcf7c953fb8238a5412f8c7c0e1cc5 https://github.com/biopython/biopython/commit/74a8b8349b58ae9aa7a727d6e1ab774a4c9008a3 For those curious to see how it looks (but not already familiar with LaTeX, pdflatex and hevea), you can see a sneak preview here: http://biopython.org/DIST/docs/tutorial/Tutorial-dev.html http://biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf (Hopefully those links will once again auto-update every night, something that was working nicely prior to the server move) If you spot any typos, please let us know. Thanks! Peter From eric.talevich at gmail.com Sat Sep 7 19:17:08 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 7 Sep 2013 12:17:08 -0700 Subject: [Biopython-dev] Python 2 and 3 migration thoughts In-Reply-To: References: Message-ID: On Sat, Sep 7, 2013 at 4:30 AM, Peter Cock wrote: > On Fri, Sep 6, 2013 at 4:44 PM, Peter Cock > wrote: > > > > This branch is trying out marking individual Python files > > as dual coding (Python 2 and Python 3) or as Python 2 > > only requiring conversion via 2to3 for use on Python 3: > > > > https://github.com/peterjc/biopython/tree/tag2to3 > > > > Currently the tags are two special hash comment lines > > expected near the start of the file itself (rather than a > > list within the do2to3.py script). The actual text of the > > marker isn't critical - perhaps these need full stops? > > > > # This file targets both Python 2 and Python 3 at the same time > > # TODO - Targets Python 2 only (use 2to3 to run under Python 3) > > > [...] > > As of right now, on this branch there are only 8 files under > > Tests which require conversion via 2to3 : > > Down to six files under Tests now if I rebase the branch > to include the recent fixes on the master. > > > Having I hope demonstrated this will work, I'd like some > > feedback before applying this (or a modified version of > > it) to the master branch. > > I've started applying individual code fixes to the master > to improve Python 2 and 3 compatibility already. > > I'm specifically looking for thoughts on how to handle > the transition period when some of our code will still > need 2to3, while other code will not. > > Does the special comment line seem like a good solution? > On the plus side, it tracks any changes with the file being > updated (which wouldn't happen with a list in the do2to3.py > file). > > Peter > > Hi Peter, This looks like a good way to move forward overall. Regarding the special comment lines -- since these are only used in do2to3.py, would it be cleaner to keep a hard-coded list of filenames in do2to3.py and leave the modules and scripts alone? Are there any characteristics that would make it difficult to determine whether a given module or script is Py3-compliant? -Eric From p.j.a.cock at googlemail.com Sun Sep 8 20:52:40 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 8 Sep 2013 21:52:40 +0100 Subject: [Biopython-dev] Python 2 and 3 migration thoughts In-Reply-To: References: Message-ID: On Sat, Sep 7, 2013 at 8:17 PM, Eric Talevich wrote: >> > >> > # This file targets both Python 2 and Python 3 at the same time >> > # TODO - Targets Python 2 only (use 2to3 to run under Python 3) >> > >> >> Does the special comment line seem like a good solution? >> On the plus side, it tracks any changes with the file being >> updated (which wouldn't happen with a list in the do2to3.py >> file). > > Hi Peter, > > This looks like a good way to move forward overall. Regarding the special > comment lines -- since these are only used in do2to3.py, would it be > cleaner to keep a hard-coded list of filenames in do2to3.py and leave the > modules and scripts alone? Are there any characteristics that would make it > difficult to determine whether a given module or script is Py3-compliant? Hi Eric, There are import time problems which are easy to spot - in particular SyntaxError is a good clue. However, many of the issues are only really found at run time (e.g. different method names). This means that the tests (which I started with) are actually the easiest to check. Right now I don't have a feel for what fraction of the main Bio/* and BioSQL/* files can be made dual-coding, and that would have an influence on how best to tag things needing 2to3 or not. I'm happy to continue this on branches for a while longer and find out. I do like the idea of a special #TODO comment line where 2to3 is still needed - it is symbolic of where I want the code base to go ;) Regards, Peter From nigel.delaney at outlook.com Mon Sep 9 16:03:49 2013 From: nigel.delaney at outlook.com (Nigel Delaney) Date: Mon, 9 Sep 2013 12:03:49 -0400 Subject: [Biopython-dev] VCF Parsers Message-ID: Hi Biopython, Just wanted to ask quickly if anyone on the biopython team has implemented or is implementing vcf parsers. I have seen a few python written ones but they seem to be quite slow, and so am curious if anyone has wrapped a C library of some sort. Thanks for any help, Nigel From arklenna at gmail.com Mon Sep 9 16:18:17 2013 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 9 Sep 2013 17:18:17 +0100 Subject: [Biopython-dev] VCF Parsers In-Reply-To: References: Message-ID: I believe PyVCF [1] has a Cython implementation. Cheers, Lenna 1: https://github.com/jamescasbon/PyVCF On Mon, Sep 9, 2013 at 5:03 PM, Nigel Delaney wrote: > Hi Biopython, > > > > Just wanted to ask quickly if anyone on the biopython team has implemented > or is implementing vcf parsers. I have seen a few python written ones but > they seem to be quite slow, and so am curious if anyone has wrapped a C > library of some sort. > > > > Thanks for any help, > > Nigel > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From chapmanb at 50mail.com Mon Sep 9 16:56:08 2013 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 09 Sep 2013 12:56:08 -0400 Subject: [Biopython-dev] VCF Parsers In-Reply-To: References: Message-ID: <86sixe7x3r.fsf@fastmail.fm> Nigel and Lenna; There is also a fully Cython implementation called cyvcf and has the same interface as PyVCF with some additional speed improvements: https://github.com/arq5x/cyvcf https://pypi.python.org/pypi/cyvcf Brad > I believe PyVCF [1] has a Cython implementation. > > Cheers, > > Lenna > > 1: https://github.com/jamescasbon/PyVCF > > > On Mon, Sep 9, 2013 at 5:03 PM, Nigel Delaney wrote: > >> Hi Biopython, >> >> >> >> Just wanted to ask quickly if anyone on the biopython team has implemented >> or is implementing vcf parsers. I have seen a few python written ones but >> they seem to be quite slow, and so am curious if anyone has wrapped a C >> library of some sort. >> >> >> >> Thanks for any help, >> >> Nigel >> >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Mon Sep 9 20:29:35 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 9 Sep 2013 21:29:35 +0100 Subject: [Biopython-dev] Print statements vs functions (Python 2 vs 3) In-Reply-To: References: Message-ID: On Sat, Sep 7, 2013 at 7:12 PM, Peter Cock wrote: > On Sat, Sep 7, 2013 at 2:52 PM, Peter Cock wrote: >> On Sat, Sep 7, 2013 at 12:41 PM, Peter Cock wrote: >>> ... >>> >>> Would anyone object to us using the print function style >>> in the Biopython documentation? > > ... > > Since there has only been positive reaction, I've made a > start at converting the examples in the Tutorial to use the > Python 3 style print function (maintaining full Python 2 > compatibility under Python 2.6 and 2.7 via the future > import): > > https://github.com/biopython/biopython/commit/34d155a02cbcf7c953fb8238a5412f8c7c0e1cc5 > https://github.com/biopython/biopython/commit/74a8b8349b58ae9aa7a727d6e1ab774a4c9008a3 > It turned out to be slightly more than a weekend project, but I've now done this for the main code including the doctests :) All new code changes should be written using the print function style and will then work on both Python 2 and 3 without change, e.g. print(variable) Any accidental usage of an old-style print statement will be caught in two ways, under Python 2 via the future import (if it is in the file you are editing): https://github.com/biopython/biopython/commit/de12c5e08fc44d9c158954bb4b1d5f98cfb84c69 And I have also disabling the print fixer during 2to3 which would result in old-style print statements causing an error when testing under Python 3: https://github.com/biopython/biopython/commit/00ab061dba42082ff0e20383847ebffaf6dd8eef If you are using a print function, and the file doesn't have it already, please add the future import: from __future__ import print_function If there any any stray print statements still there (e.g. hiding in examples scripts I missed), please fix them or report them. Regards, Peter From zruan1991 at gmail.com Tue Sep 10 02:36:21 2013 From: zruan1991 at gmail.com (Zheng Ruan) Date: Mon, 9 Sep 2013 22:36:21 -0400 Subject: [Biopython-dev] Codon Alignment GSoC Weekly Update Message-ID: Hi all, The update for Codon Alignment GSoC project can be found at http://zruanweb.com/. Thanks for your comments and suggestions. Best, Zheng Ruan From yeyanbo289 at gmail.com Tue Sep 10 07:29:06 2013 From: yeyanbo289 at gmail.com (Yanbo Ye) Date: Tue, 10 Sep 2013 15:29:06 +0800 Subject: [Biopython-dev] GSOC weekly update 13 Message-ID: Hi all, I posted the update of Biopython.Phylo project here: http://blog.yeyanbo.com/posts/google-summer-of-code-13.html Thanks, Yanbo -- *Yanbo Ye* *Guangzhou Institutes of Biomedicine and Health, * *Chinese Academy of Sciences* *190 Kaiyuan Avenue, Science Park, Guangzhou, China** * * * *Email: ye_yanbo at gibh.ac.cn* *Web: http://www.yeyanbo.com* *Phone: (86)-020-32093810* From p.j.a.cock at googlemail.com Fri Sep 13 08:54:14 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 13 Sep 2013 09:54:14 +0100 Subject: [Biopython-dev] Galaxy Tool Shed packages for Biopython Message-ID: Hi all, I've send this to both the Galaxy and Biopython developers lists, and hope this will make sense to both groups. If you've not heard of Galaxy, start here: http://galaxyproject.org - while the easy to guess Biopython website is at http://biopython.org Brad Chapman and I are both Biopython core developers, and are also both on the "IUC" Galaxy Tool Shed committee because we've been quite involved in wrapping and writing tools for use on Galaxy. Fellow committee member Bj?rn Gr?ning has done a lot of the hands on work defining package definitions for dependencies within the Galaxy Tool Shed ecosystem - including defining them for Biopython, NumPy, SciPy, MatPlotLib, etc. We're very grateful for his hard work - most of which is now available under the IUC group account: http://toolshed.g2.bx.psu.edu/view/iuc/ http://testtoolshed.g2.bx.psu.edu/view/iuc/ The Biopython packages, however, are under a dedicated "biopython" account on the Galaxy Tool Shed to which currently Bjoern, Brad and I have access to: http://toolshed.g2.bx.psu.edu/view/biopython/ http://testtoolshed.g2.bx.psu.edu/view/biopython/ This packaging work was initially tracked in Bjoern's own GitHub repository, https://github.com/bgruening/galaxytools/ We (me, Brad and Bjoern) agreed that a Biopython owned repository would be more sensible in the long term, so I have created this and ported Bjoern's commits to it: https://github.com/biopython/galaxy_packages Currently the "Galaxy packagers" team on GitHub which has read and write access to this new repository is just me, Brad and Bjoern. Regards, Peter From eric.talevich at gmail.com Fri Sep 13 20:08:12 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 13 Sep 2013 13:08:12 -0700 Subject: [Biopython-dev] Fwd: New Biopython (sub)module? In-Reply-To: References: <521354A9.6020701@brueffer.de> Message-ID: On Thu, Aug 22, 2013 at 6:01 AM, Peter Cock wrote: > On Wed, Aug 21, 2013 at 11:00 PM, Cyrus Maher > wrote: > > > > That said, I was also hoping to get your thoughts on whether this seemed > > like the type of project that would fit in with Biopython. Peter said > that > > Eric might have some good comments on this matter? > > Right - I was thinking Eric and this year's phylogenetic focused GSoC > students should have some good comments, e.g. about adding > something like pal2nal into Biopython. > > Peter > Hi Cyrus, MOSAIC looks cool, it's always good to see progress in ortholog detection. Since the core of the program is a single Python module, it shouldn't be too hard to plug this into Biopython. Keep in mind, though, that once MOSAIC is in the Biopython source tree it could become less convenient for you to make major updates and changes to the program, whereas if you control the packaging yourself you're free to change the API, add dependencies, etc. however you like. So, for the manuscript/publication at least, you might find it safer to only state that distributing MOSAIC with Biopython is planned, rather than committing to a release version number. Thoughts on the code: - Zheng Ruan has written a nice codon alignment module as part of his GSoC project. Once that's merged, you'll be able to drop the pal2nal dependency. - We haven't merged Chris's MSAprobs wrapper yet (to my knowledge), though at a glance it looks like it should be straightforward. For Bio.mosaic (I guess?), we would probably wait until the wrapper is merged and then remove the conditional in mosaic. - Does EMBOSS stretcher do anything that couldn't be done with Bio.pairwise2? If not, you could use pairwise2 instead and avoid another dependency. - The use of pandas looks fairly basic and therefore also avoidable. It looks like with a few more lines of code you could use Python's built-in csv module to parse a table and store it in a numpy matrix instead. - MOSAIC does some logging to the console, which is sensible for the program but isn't done as much in Biopython. Some of these print statements could be changed to warnings (see the warnings module). The progress indicators could maybe be toggled at the function level with a keyword argument, e.g. verbose=True/False. Cheers, Eric From eric.talevich at gmail.com Fri Sep 13 21:05:16 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 13 Sep 2013 14:05:16 -0700 Subject: [Biopython-dev] GSOC weekly update 13 In-Reply-To: References: Message-ID: On Tue, Sep 10, 2013 at 12:29 AM, Yanbo Ye wrote: > > Hi all, > > I posted the update of Biopython.Phylo project here: > http://blog.yeyanbo.com/posts/google-summer-of-code-13.html > > Thanks, > Yanbo > Hi Yanbo, Looks like you finished your project right on schedule. :) For the next week, how are you planning to document your new modules? It looks like you've put the essential information in the docstrings, which is good to see. If you write more detailed explanations or examples of how to use the new features on the Biopython wiki next week, I can help roll them into the main tutorial. Or you could make a patch to Tutorial.tex directly, if you'd like. The unit tests look pretty good already. Thanks for all your hard work! Cheers, Eric From eric.talevich at gmail.com Fri Sep 13 21:56:52 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 13 Sep 2013 14:56:52 -0700 Subject: [Biopython-dev] Codon Alignment GSoC Weekly Update In-Reply-To: References: Message-ID: Hi Zheng, I just went through your code and left some comments. Impressive work! So, next week is the "soft pencils-down" deadline, and this would be a good time to put together the canonical documentation for your project. One way to go about this would be to copy the relevant text, code examples and figures from your blog and either put them on a CodonAlignment page on the Biopython wiki, or consolidate them into a new chapter in Tutorial.tex. Or did you have something else in mind? Cheers, Eric On Mon, Sep 9, 2013 at 7:36 PM, Zheng Ruan wrote: > Hi all, > > The update for Codon Alignment GSoC project can be found at > http://zruanweb.com/. Thanks for your comments and suggestions. > > Best, > Zheng Ruan > From zruan1991 at gmail.com Sat Sep 14 16:49:33 2013 From: zruan1991 at gmail.com (Zheng Ruan) Date: Sat, 14 Sep 2013 12:49:33 -0400 Subject: [Biopython-dev] Chi2 test in Bio.Phylo.PAML.chi2 Message-ID: Hi all, I am trying to use chi2 test within Biopython to reduce my dependency of scipy. However, the chi2 test is very slow in some case of stat value when degree of freedom is 1 (MK test has a df of 1). Here is a small example: >>> from Bio.Phylo.PAML import chi2 >>> chi2.cdf_chi2(1, 1) 0.3173105078923443 >>> chi2.cdf_chi2(1, 2) 0.1572992072733692 >>> chi2.cdf_chi2(1, 3) 0.08326451704454607 >>> chi2.cdf_chi2(1, 4) 0.04550026405390195 >>> chi2.cdf_chi2(1, 5) ^CTraceback (most recent call last): File "", line 1, in File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line 20, in cdf_chi2 prob = 1 - _incomplete_gamma(x, alpha) File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line 116, in _incomplete_gamma pn[i] /= overflow KeyboardInterrupt >>> chi2.cdf_chi2(1, 6) 0.014305878510978087 >>> chi2.cdf_chi2(1, 7) 0.00815097160412992 >>> chi2.cdf_chi2(1, 8) 0.004677734999637195 >>> chi2.cdf_chi2(1, 9) ^CTraceback (most recent call last): File "", line 1, in File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line 20, in cdf_chi2 prob = 1 - _incomplete_gamma(x, alpha) File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line 112, in _incomplete_gamma pn[i] = pn[i+2] KeyboardInterrupt The behavior of chi2.cdf_chi2 is quite wiered. Could someone clarify this? Thanks! Best, Zheng Ruan From eric.talevich at gmail.com Sat Sep 14 17:24:17 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 14 Sep 2013 10:24:17 -0700 Subject: [Biopython-dev] Chi2 test in Bio.Phylo.PAML.chi2 In-Reply-To: References: Message-ID: On Sat, Sep 14, 2013 at 9:49 AM, Zheng Ruan wrote: > Hi all, > > I am trying to use chi2 test within Biopython to reduce my dependency of > scipy. However, the chi2 test is very slow in some case of stat value when > degree of freedom is 1 (MK test has a df of 1). Here is a small example: > > >>> from Bio.Phylo.PAML import chi2 > >>> chi2.cdf_chi2(1, 1) > 0.3173105078923443 > >>> chi2.cdf_chi2(1, 2) > 0.1572992072733692 > >>> chi2.cdf_chi2(1, 3) > 0.08326451704454607 > >>> chi2.cdf_chi2(1, 4) > 0.04550026405390195 > >>> chi2.cdf_chi2(1, 5) > ^CTraceback (most recent call last): > File "", line 1, in > File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line > 20, in cdf_chi2 > prob = 1 - _incomplete_gamma(x, alpha) > File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line > 116, in _incomplete_gamma > pn[i] /= overflow > KeyboardInterrupt > >>> chi2.cdf_chi2(1, 6) > 0.014305878510978087 > >>> chi2.cdf_chi2(1, 7) > 0.00815097160412992 > >>> chi2.cdf_chi2(1, 8) > 0.004677734999637195 > >>> chi2.cdf_chi2(1, 9) > ^CTraceback (most recent call last): > File "", line 1, in > File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line > 20, in cdf_chi2 > prob = 1 - _incomplete_gamma(x, alpha) > File "/home/rz/Working.Dir/GSoC/biopython/Bio/Phylo/PAML/chi2.py", line > 112, in _incomplete_gamma > pn[i] = pn[i+2] > KeyboardInterrupt > > > The behavior of chi2.cdf_chi2 is quite wiered. Could someone clarify this? > Thanks! > > Best, > Zheng Ruan > It looks like that implementation of chi2 (based on PAML's C implementation) has trouble with convergence at df=1. I wrote another Python implementation of chi2 based on the SciPy source code (to avoid a hard SciPy dependency in CladeCompare, which also uses a G-test), which you can use if you find it works better: https://github.com/etal/biofrills/blob/master/biofrills/stats/chisq.py It imports the original scipy version at the end in case the user does have scipy installed, since that compiled version will be much faster. This hasn't been tested as much as Bio.Phylo.PAML.chi2, though, and I haven't benchmarked the two Python implementations against each other. Also note that it uses math.lgamma, which was only added in Python 2.7, so for 2.6 compatibility you'll need to copy in the pure-Python log-gamma implementation from Bio.Phylo.PAML.chi2. (We could add this conditional import of math.lgamma to Bio.Phylo.PAML.chi2, too.) Or, you could try increasing the tolerance used for testing convergence in Bio.Phylo.PAML.chi2. Best, Eric From yeyanbo289 at gmail.com Mon Sep 16 03:42:12 2013 From: yeyanbo289 at gmail.com (Yanbo Ye) Date: Mon, 16 Sep 2013 11:42:12 +0800 Subject: [Biopython-dev] GSOC weekly update 13 In-Reply-To: References: Message-ID: Hi Eric, I noticed there are some relevant TODOs on the phylo cookbook page, so I'd like to edit them add some examples onto the Biopython wiki this week. Cheers, Yanbo On Sat, Sep 14, 2013 at 5:05 AM, Eric Talevich wrote: > On Tue, Sep 10, 2013 at 12:29 AM, Yanbo Ye wrote: > >> >> Hi all, >> >> I posted the update of Biopython.Phylo project here: >> http://blog.yeyanbo.com/posts/google-summer-of-code-13.html >> >> Thanks, >> Yanbo >> > > Hi Yanbo, > > Looks like you finished your project right on schedule. :) > > For the next week, how are you planning to document your new modules? It > looks like you've put the essential information in the docstrings, which is > good to see. If you write more detailed explanations or examples of how to > use the new features on the Biopython wiki next week, I can help roll them > into the main tutorial. Or you could make a patch to Tutorial.tex directly, > if you'd like. > > The unit tests look pretty good already. Thanks for all your hard work! > > Cheers, > Eric > -- *Yanbo Ye* *Guangzhou Institutes of Biomedicine and Health, * *Chinese Academy of Sciences* *190 Kaiyuan Avenue, Science Park, Guangzhou, China** * * * *Email: ye_yanbo at gibh.ac.cn* *Web: http://www.yeyanbo.com* *Phone: (86)-020-32093810* From yeyanbo289 at gmail.com Mon Sep 16 04:33:06 2013 From: yeyanbo289 at gmail.com (Yanbo Ye) Date: Mon, 16 Sep 2013 12:33:06 +0800 Subject: [Biopython-dev] GSOC weekly update 14 Message-ID: Hi all, My update of Biopython.Phylo project is here. http://blog.yeyanbo.com/posts/google-summer-of-code-14.html This week I will add document and examples of new features to the cookbook on biopython wiki. Thanks, Yanbo -- *Yanbo Ye* *Guangzhou Institutes of Biomedicine and Health, * *Chinese Academy of Sciences* *190 Kaiyuan Avenue, Science Park, Guangzhou, China** * * * *Email: ye_yanbo at gibh.ac.cn* *Web: http://www.yeyanbo.com* *Phone: (86)-020-32093810* From zruan1991 at gmail.com Tue Sep 17 03:35:10 2013 From: zruan1991 at gmail.com (Zheng Ruan) Date: Mon, 16 Sep 2013 23:35:10 -0400 Subject: [Biopython-dev] Codon Alignment GSoC Project Last Update Message-ID: Hi all, The last code update for Codon Alignment GSoC project can be found at http://zruanweb.com/. This week I'll be converting my blog examples into an independent chapter of biopython tutorial. Some tests for the CodonAlign module will also be added shortly. Thanks! Best, Ruan From michael.maher at ucsf.edu Tue Sep 17 19:20:46 2013 From: michael.maher at ucsf.edu (Cyrus Maher) Date: Tue, 17 Sep 2013 12:20:46 -0700 Subject: [Biopython-dev] Fwd: New Biopython (sub)module? In-Reply-To: References: <521354A9.6020701@brueffer.de> Message-ID: Hi Eric, We're glad you like MOSAIC! It's exciting to start getting it out there. Just as a quick update, the latest version of the paper is available on arxiv . In addition, updated documentation, relevant files, etc. can be found here . The module has also been uploaded to PyPI, so it can now be installed with easy_install bio-mosaic. Given the importance of ortholog detection to a broad range of computational biology tasks, we definitely think it's worth putting in a little extra work and making a few sacrifices to make this tool more broadly and conveniently available to the community. So if you're game, we would love to start thinking about timelines for making any necessary changes. We really appreciate your comments so far. Below are some initial thoughts/replies: ============ *- Zheng Ruan has written a nice codon alignment module as part of his GSoC project. Once that's merged, you'll be able to drop the pal2nal dependency. * * * This is a great idea and we'd be happy to incorporate it. *- We haven't merged Chris's MSAprobs wrapper yet (to my knowledge), though at a glance it looks like it should be straightforward. For Bio.mosaic (I guess?), we would probably wait until the wrapper is merged and then remove the conditional in mosaic. * * * Sounds good! * * *- Does EMBOSS stretcher do anything that couldn't be done with Bio.pairwise2? If not, you could use pairwise2 instead and avoid another dependency. * Pairwise alignment constitutes a significant portion of MOSAIC's run time. stretcher was chosen because of its speed. How about this: we could test if stretcher is installed, and if it's not, we can 1.) fall back to Bio.pairwise2 and 2.) provide a helpful warning about slowdown with a direct link to the latest EMBOSS toolkit. What do you think? * * * - The use of pandas looks fairly basic and therefore also avoidable. It looks like with a few more lines of code you could use Python's built-in csv module to parse a table and store it in a numpy matrix instead. * You're totally right. We can do that. * * *- MOSAIC does some logging to the console, which is sensible for the program but isn't done as much in Biopython. Some of these print statements could be changed to warnings (see the warnings module). The progress indicators could maybe be toggled at the function level with a keyword argument, e.g. verbose=True/False.* Consider it done! ============ Thanks again for your feedback! Looking forward to hearing further comments/next steps, etc... Cheers, -Cyrus On Fri, Sep 13, 2013 at 1:08 PM, Eric Talevich wrote: > On Thu, Aug 22, 2013 at 6:01 AM, Peter Cock wrote: > >> On Wed, Aug 21, 2013 at 11:00 PM, Cyrus Maher >> wrote: >> > >> > That said, I was also hoping to get your thoughts on whether this seemed >> > like the type of project that would fit in with Biopython. Peter said >> that >> > Eric might have some good comments on this matter? >> >> Right - I was thinking Eric and this year's phylogenetic focused GSoC >> students should have some good comments, e.g. about adding >> something like pal2nal into Biopython. >> >> Peter >> > > Hi Cyrus, > > MOSAIC looks cool, it's always good to see progress in ortholog detection. > Since the core of the program is a single Python module, it shouldn't be > too hard to plug this into Biopython. Keep in mind, though, that once > MOSAIC is in the Biopython source tree it could become less convenient for > you to make major updates and changes to the program, whereas if you > control the packaging yourself you're free to change the API, add > dependencies, etc. however you like. So, for the manuscript/publication at > least, you might find it safer to only state that distributing MOSAIC with > Biopython is planned, rather than committing to a release version number. > > Thoughts on the code: > > - Zheng Ruan has written a nice codon alignment module as part of his GSoC > project. Once that's merged, you'll be able to drop the pal2nal dependency. > > - We haven't merged Chris's MSAprobs wrapper yet (to my knowledge), though > at a glance it looks like it should be straightforward. For Bio.mosaic (I > guess?), we would probably wait until the wrapper is merged and then remove > the conditional in mosaic. > > - Does EMBOSS stretcher do anything that couldn't be done with > Bio.pairwise2? If not, you could use pairwise2 instead and avoid another > dependency. > > - The use of pandas looks fairly basic and therefore also avoidable. It > looks like with a few more lines of code you could use Python's built-in > csv module to parse a table and store it in a numpy matrix instead. > > - MOSAIC does some logging to the console, which is sensible for the > program but isn't done as much in Biopython. Some of these print statements > could be changed to warnings (see the warnings module). The progress > indicators could maybe be toggled at the function level with a keyword > argument, e.g. verbose=True/False. > > > Cheers, > Eric > From zruan1991 at gmail.com Sat Sep 21 23:29:17 2013 From: zruan1991 at gmail.com (Zheng Ruan) Date: Sat, 21 Sep 2013 19:29:17 -0400 Subject: [Biopython-dev] Codon Alignment GSoC Documentation Update Message-ID: Hi all, The documentation for Codon Alignment GSoC project is now available in reStructuredText, LaTeX (pdf) and HTML format at http://zruanweb.com/. To this point, I finished all the tasks of my project. I really enjoy the coding experience in the past two months. Thanks for all your help and valuable feedback! Best, Zheng Ruan From yeyanbo289 at gmail.com Sun Sep 22 05:22:46 2013 From: yeyanbo289 at gmail.com (Yanbo Ye) Date: Sun, 22 Sep 2013 13:22:46 +0800 Subject: [Biopython-dev] GSOC weekly update 14 Message-ID: Hi all, My tutorial about the new features of Bio.Phylo that completed during this GSoC is in this file: https://github.com/lijax/gsoc/blob/master/phylo_wiki.md . I also updated the phylo page on the biopython wiki. http://biopython.org/wiki/Phylo Thanks for all your help and suggestions during last three months. I' really appreciate this coding experience and would like to continue contributing to the Biopython community. Any comments and suggests about the code or documentation would be welcome. cheers, Yanbo -- *Yanbo Ye* *Guangzhou Institutes of Biomedicine and Health, * *Chinese Academy of Sciences* *190 Kaiyuan Avenue, Science Park, Guangzhou, China** * * * *Email: ye_yanbo at gibh.ac.cn* *Web: http://www.yeyanbo.com* *Phone: (86)-020-32093810* From p.j.a.cock at googlemail.com Mon Sep 23 20:58:25 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 23 Sep 2013 21:58:25 +0100 Subject: [Biopython-dev] NumPy 1.7 and NPY_NO_DEPRECATED_API warnings Message-ID: Hi all, I'm seeing the following warning from NumPy 1.7 with Python 3.3 on Mac OS X, and on Linux too. I believe the NumPy version is the critical factor: building 'Bio.Cluster.cluster' extension building 'Bio.KDTree._CKDTree' extension building 'Bio.Motif._pwm' extension building 'Bio.motifs._pwm' extension all give: /Users/peterjc/lib/python3.3/site-packages/numpy/core/include/numpy/npy_deprecated_api.h:11:2: warning: "Using deprecated NumPy API, disable it by #defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings] According to this page, http://docs.scipy.org/doc/numpy-dev/reference/c-api.deprecations.html If we add this line it should confirm our code is clean for NumPy 1.7 (and implies to side effects on older NumPy): #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION Unfortunately that seems all four modules have problems doing that, presumably planned NumPy C API changes we need to handle via a version conditional #ifdef? Peter From anaryin at gmail.com Tue Sep 24 06:50:28 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 24 Sep 2013 08:50:28 +0200 Subject: [Biopython-dev] Building Biopython with ICC instead of GCC (intel compiler) Message-ID: Hi all, This is more of a curiosity rather than a necessity. I'm setting up a new cluster and we are preferring ICC (intel compiler) over the usual GCC. When I run "python setup.py build" the output shows ICC being used quite a lot but some lines still use GCC. Example: *gcc -pthread -shared build/temp.linux-x86_64-2.6/Bio/KDTree/KDTree.o build/temp.linux-x86_64-2.6/Bio/KDTree/KDTreemodule.o -L/usr/lib64 -lpython2.6 -o build/lib.linux-x86_64-2.6/Bio/KDTree/_CKDTree.so* building 'Bio.Motif._pwm' extension creating build/temp.linux-x86_64-2.6/Bio/Motif icc -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/home/software/python-libs/lib64/python2.6/site-packages/numpy/core/include -I/usr/include/python2.6 -c Bio/Motif/_pwm.c -o build/temp.linux-x86_64-2.6/Bio/Motif/_pwm.o icc: command line warning #10006: ignoring unknown option '-fwrapv' icc: command line warning #10006: ignoring unknown option '-fwrapv' /home/software/python-libs/lib64/python2.6/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h(15): warning #1224: #warning directive: "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" #warning "Using deprecated NumPy API, disable it by " \ ^ I guess this has to do with distutils? Any idea on how to force it to use only the intel compilers? Cheers, Jo?o From mjldehoon at yahoo.com Wed Sep 25 00:55:25 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 24 Sep 2013 17:55:25 -0700 (PDT) Subject: [Biopython-dev] Building Biopython with ICC instead of GCC (intel compiler) In-Reply-To: References: Message-ID: <1380070525.80725.YahooMailNeo@web164006.mail.gq1.yahoo.com> How was Python itself compiled? I believe distutils is supposed to select the same compiler as was used for Python itself. Best, -Michiel. ________________________________ I guess this has to do with distutils? Any idea on how to force it to useonly the intel compilers? _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Fri Sep 27 15:47:11 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 27 Sep 2013 16:47:11 +0100 Subject: [Biopython-dev] Problem with SeqIO uniprot-xml on older XML files? Message-ID: Hi all, There seems to be a problem parsing older UniProt XML files, see http://seqanswers.com/forums/showthread.php?t=33921 Could anyone have a look at this? Somehow the start/end of each record does not seem to be recognised here, >>> from Bio import SeqIO >>> r = next(SeqIO.parse("uniref90.xml", "uniprot-xml")) (takes ages, presumably scanning whole file) Note the indexing code also breaks: >>> from Bio import SeqIO >>> d = SeqIO.index("uniref90.xml", "uniprot-xml") Traceback (most recent call last): File "", line 1, in File "/home/pc40583/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 808, in index key_function, repr, "SeqRecord") File "/home/pc40583/lib/python2.7/site-packages/Bio/File.py", line 250, in __init__ for key, offset, length in offset_iter: File "/home/pc40583/lib/python2.7/site-packages/Bio/SeqIO/_index.py", line 401, in __iter__ % (start_offset, end_offset)) ValueError: Did not find line in bytes 283 to 38649 Thanks, Peter From p.j.a.cock at googlemail.com Sat Sep 28 10:14:30 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 28 Sep 2013 11:14:30 +0100 Subject: [Biopython-dev] Adjusting the xxMotif wrapper / Bio.Application plans In-Reply-To: References: <520374DF.9070301@brueffer.de> Message-ID: On Thu, Aug 8, 2013 at 12:00 PM, Peter Cock wrote: > On Thu, Aug 8, 2013 at 11:37 AM, Christian Brueffer > wrote: >>> >>> Was there a special reason for all these case variants >>> in the XXmotif options?? >> >> I basically followed the example set by >> Bio/Align/Applications/_Clustalw.py. > > Ah. Without checking I think maybe the ClustalW documentation > used both cases - but the order was deliberately with the lower > case one last as that was used in the Python object as the > property name and keyword. > >> The "rationale" was to allow for people to use their favourite >> spelling variety. >> >> I guess it was bad luck this happened to serve as an example, as it >> was the first piece of code I ever touched in BioPython. >> >> It would be nice to streamline all application wrappers in this regard >> sometime... > > Yeah, perhaps we can formally deprecate set_parameter in > the next release which means all the aliases 'go away' and > that leaves us with just the final entry exposed as the usable > property name and keyword. > > Peter I have updated the application wrapper code to spot hyphens in what should be property names/arguments: https://github.com/biopython/biopython/commit/ba1a43475a3d4450b3ac8409adaf0e59a25b0e47 This forced me to update the XXmotif wrapper and I opted to switch it to using lower case property names: https://github.com/biopython/biopython/commit/f4b4006a64d5166b5c0934d2ad1f8dc3bab30067 I was looking at this as part of applying Christian's MSAProb wrapper: https://github.com/biopython/biopython/pull/225 Peter From saketkc at gmail.com Sat Sep 28 10:22:52 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Sat, 28 Sep 2013 15:52:52 +0530 Subject: [Biopython-dev] Adjusting the xxMotif wrapper / Bio.Application plans In-Reply-To: References: <520374DF.9070301@brueffer.de> Message-ID: On 28 September 2013 15:44, Peter Cock wrote: > On Thu, Aug 8, 2013 at 12:00 PM, Peter Cock wrote: >> On Thu, Aug 8, 2013 at 11:37 AM, Christian Brueffer >> wrote: >>>> >>>> Was there a special reason for all these case variants >>>> in the XXmotif options?? >>> >>> I basically followed the example set by >>> Bio/Align/Applications/_Clustalw.py. >> >> Ah. Without checking I think maybe the ClustalW documentation >> used both cases - but the order was deliberately with the lower >> case one last as that was used in the Python object as the >> property name and keyword. >> >>> The "rationale" was to allow for people to use their favourite >>> spelling variety. >>> >>> I guess it was bad luck this happened to serve as an example, as it >>> was the first piece of code I ever touched in BioPython. >>> >>> It would be nice to streamline all application wrappers in this regard >>> sometime... >> >> Yeah, perhaps we can formally deprecate set_parameter in >> the next release which means all the aliases 'go away' and >> that leaves us with just the final entry exposed as the usable >> property name and keyword. >> >> Peter > > I have updated the application wrapper code to spot hyphens > in what should be property names/arguments: > https://github.com/biopython/biopython/commit/ba1a43475a3d4450b3ac8409adaf0e59a25b0e47 > > This forced me to update the XXmotif wrapper and I opted > to switch it to using lower case property names: > https://github.com/biopython/biopython/commit/f4b4006a64d5166b5c0934d2ad1f8dc3bab30067 > > I was looking at this as part of applying Christian's MSAProb wrapper: > https://github.com/biopython/biopython/pull/225 > Great! I had done a similar mistake while writing the samtools wrapper(which I am yet to wrap up) https://github.com/saketkc/biopython/commit/30b3d9878281e00afed9e7b6d0bbfb2bdacbce91 > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From anaryin at gmail.com Sat Sep 28 10:57:32 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Sat, 28 Sep 2013 12:57:32 +0200 Subject: [Biopython-dev] Building Biopython with ICC instead of GCC (intel compiler) In-Reply-To: <1380070525.80725.YahooMailNeo@web164006.mail.gq1.yahoo.com> References: <1380070525.80725.YahooMailNeo@web164006.mail.gq1.yahoo.com> Message-ID: Hi Michiel, It was compiled with GCC. That might be the issue indeed but I assumed there were some variables you could set to have the compilers changed (setting CC, CXX, F77 for example). As I said, there is no problem, GCC is perfectly good enough. It was just a curiosity. Cheers, Jo?o 2013/9/25 Michiel de Hoon > How was Python itself compiled? I believe distutils is supposed to select > the same compiler as was used for Python itself. > > Best, > -Michiel. > > ------------------------------ > **I guess this has to do with distutils? Any idea on how to force it to > use only the intel compilers? > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > From nigel.delaney at outlook.com Sat Sep 28 15:10:25 2013 From: nigel.delaney at outlook.com (Nigel Delaney) Date: Sat, 28 Sep 2013 11:10:25 -0400 Subject: [Biopython-dev] Newick Parser Message-ID: I had a couple questions on the newick parser I was hoping someone might know the answer to. First, it fails when there are BOMs in the file, though in general it seems that UTF encoding with BOMs should be allowed. Is there a standard way that BOM in files are handled in biopython? Second, does anyone know what the consensus is on newick files that have placements for data but no data. For example: ((A,B):Name:.0235)C) Defines a name and length for the A,B node. However, ((A,B)::)C) Has positions for name and length but no length or name data, which seems like it should be an error, though currently is just skipped. From p.j.a.cock at googlemail.com Sat Sep 28 15:23:43 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 28 Sep 2013 16:23:43 +0100 Subject: [Biopython-dev] Newick Parser In-Reply-To: References: Message-ID: Hi Nigel, On Saturday, September 28, 2013, Nigel Delaney wrote: > I had a couple questions on the newick parser I was hoping someone might > know the answer to. > > First, it fails when there are BOMs in the file, though in general it seems > that UTF encoding with BOMs should be allowed. Is there a standard way > that > BOM in files are handled in biopython? > You mean a Unicode byte order mark (BOM)? Does it even make sense to allow non-ASCII in Newick format? Peter From nigel.delaney at outlook.com Sat Sep 28 15:55:14 2013 From: nigel.delaney at outlook.com (Nigel Delaney) Date: Sat, 28 Sep 2013 11:55:14 -0400 Subject: [Biopython-dev] Newick Parser In-Reply-To: References: Message-ID: You mean a Unicode byte order mark (BOM)? Yep. Does it even make sense to allow non-ASCII in Newick format? I think that's a matter of opinion. The specs I found discussed how to parse the string, but not how to encode the string. The advantages I can see are allowing people to use the extended characters for node/tip label names, and being robust if different text-editors/programs muck with the files (which I would suspect are usually ASCII). The disadvantage is that it's another case to handle in code, so could just be ignored or throw an exception. Not sure what the preferred choice for biopython would be. From p.j.a.cock at googlemail.com Sat Sep 28 17:28:24 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 28 Sep 2013 18:28:24 +0100 Subject: [Biopython-dev] Newick Parser In-Reply-To: References: Message-ID: On Sat, Sep 28, 2013 at 4:55 PM, Nigel Delaney wrote: > >> >> Does it even make sense to allow non-ASCII in Newick format? >> > > I think that?s a matter of opinion. The specs I found > discussed how to parse the string, but not how to > encode the string. Right, and they probably all pre-date unicode and are implicitly ASCII only. > The advantages I can see are allowing people to use the > extended characters for node/tip label names, and being > robust if different text-editors/programs muck with the files > (which I would suspect are usually ASCII). Yep. > The disadvantage is that it?s another case to handle in code, so could just > be ignored or throw an exception. > > Not sure what the preferred choice for biopython would be. If you'd like to work on this it sounds useful - but you'll have to be extra careful about testing under both Python 2 and Python 3 due to the joys of unicode. Peter From nigel.delaney at outlook.com Sat Sep 28 17:52:03 2013 From: nigel.delaney at outlook.com (Nigel Delaney) Date: Sat, 28 Sep 2013 13:52:03 -0400 Subject: [Biopython-dev] Newick Parser In-Reply-To: References: Message-ID: Hi Peter, I think handling the joys of Unicode might be a bit more trouble than it's worth given how few of the files are probably Unicode, and I think most bioinformatics is still done in standard ACSCII English anyway. I just submitted pull request 241. It throws an error when BOMs are detected (right now it says the number of "(" does not equal the number of ")" which is super confusing). This way the user can just convert the file on their end. All the best, N -----Original Message----- From: Peter Cock [mailto:p.j.a.cock at googlemail.com] Sent: Saturday, September 28, 2013 1:28 PM To: Nigel Delaney Cc: Biopython-Dev Mailing List Subject: Re: [Biopython-dev] Newick Parser On Sat, Sep 28, 2013 at 4:55 PM, Nigel Delaney wrote: > >> >> Does it even make sense to allow non-ASCII in Newick format? >> > > I think that's a matter of opinion. The specs I found discussed how > to parse the string, but not how to encode the string. Right, and they probably all pre-date unicode and are implicitly ASCII only. > The advantages I can see are allowing people to use the extended > characters for node/tip label names, and being robust if different > text-editors/programs muck with the files (which I would suspect are > usually ASCII). Yep. > The disadvantage is that it's another case to handle in code, so could > just be ignored or throw an exception. > > Not sure what the preferred choice for biopython would be. If you'd like to work on this it sounds useful - but you'll have to be extra careful about testing under both Python 2 and Python 3 due to the joys of unicode. Peter From p.j.a.cock at googlemail.com Sat Sep 28 18:18:57 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 28 Sep 2013 19:18:57 +0100 Subject: [Biopython-dev] Newick Parser In-Reply-To: References: Message-ID: On Sat, Sep 28, 2013 at 6:52 PM, Nigel Delaney wrote: > Hi Peter, > > I think handling the joys of Unicode might be a bit more trouble than it's > worth given how few of the files are probably Unicode, and I think most > bioinformatics is still done in standard ACSCII English anyway. > > I just submitted pull request 241. It throws an error when BOMs are > detected (right now it says the number of "(" does not equal the number of > ")" which is super confusing). This way the user can just convert the file > on their end. Thanks - I've replied with what is intended as constructive feedback: https://github.com/biopython/biopython/pull/241 The startswith method is more powerful that many people realise ;) Peter From p.j.a.cock at googlemail.com Sun Sep 29 12:30:10 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 29 Sep 2013 13:30:10 +0100 Subject: [Biopython-dev] Bio.SVDSuperimposer cleanup to remove nested module name Message-ID: Hi all, Could someone have a look at this proposed change to the Bio.SVDSuperimposer module (used in Bio.PDB) please: https://github.com/biopython/biopython/pull/242 Thanks, Peter From p.j.a.cock at googlemail.com Sun Sep 29 12:39:24 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 29 Sep 2013 13:39:24 +0100 Subject: [Biopython-dev] No longer testing under Python 3.1 Message-ID: Hi all, In line with past discussion, we're not officially supporting Python 3.0, 3.1 or 3.2 - just 3.3 onwards. Until recently the Buildbot has been covering Python 3.1 and 3.2, but as of this commit I have dropped Python 3.1 from the test matrix: https://github.com/biopython/biopython/commit/de71aadb8c603a6cd30b563fe7bc44d56b98d506 http://testing.open-bio.org/biopython/tgrid For now everything seems to work under Python 3.2 (testing under TravisCI and the buildbot) which may be useful as PyPy3 currently targets Python 3.2 rather than 3.3. Peter From p.j.a.cock at googlemail.com Sun Sep 29 23:22:52 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 30 Sep 2013 00:22:52 +0100 Subject: [Biopython-dev] Python 2 and 3 migration thoughts In-Reply-To: References: Message-ID: On Sun, Sep 8, 2013 at 9:52 PM, Peter Cock wrote: > On Sat, Sep 7, 2013 at 8:17 PM, Eric Talevich wrote: >>> > >>> > # This file targets both Python 2 and Python 3 at the same time >>> > # TODO - Targets Python 2 only (use 2to3 to run under Python 3) >>> > >>> >>> Does the special comment line seem like a good solution? >>> On the plus side, it tracks any changes with the file being >>> updated (which wouldn't happen with a list in the do2to3.py >>> file). >> >> Hi Peter, >> >> This looks like a good way to move forward overall. Regarding the special >> comment lines -- since these are only used in do2to3.py, would it be >> cleaner to keep a hard-coded list of filenames in do2to3.py and leave the >> modules and scripts alone? Are there any characteristics that would make it >> difficult to determine whether a given module or script is Py3-compliant? > > Hi Eric, > > There are import time problems which are easy to spot - in particular > SyntaxError is a good clue. However, many of the issues are only > really found at run time (e.g. different method names). This means > that the tests (which I started with) are actually the easiest to check. > > Right now I don't have a feel for what fraction of the main Bio/* and > BioSQL/* files can be made dual-coding, and that would have an > influence on how best to tag things needing 2to3 or not. I'm happy > to continue this on branches for a while longer and find out. Assuming my methodology isn't flawed, we're about half way in terms of getting every file in Biopython do be dual Python 2 and Python 3 code: 262 no change, 290 need fixers Troublesome ones at 52.5% This is based on there being a difference between the pre- and post-2to3 conversion (discounting removing future imports) This is an over estimate as often the 2to3 script makes unnecessary changes. This is after applying a *lot* of little changes to our codebase, things like removing unneeded use of my_dict.keys() which the 2to3 fixers are over cautious in wrapping as list(my_dict.keys()) - I would like to do a beta before the next release. > I do like the idea of a special #TODO comment line where 2to3 > is still needed - it is symbolic of where I want the code base to go ;) That's what is going on in this revised branch - if the special #TODO comment is there, then 2to3 is used, otherwise we assume the file is already OK to use under Python 3: https://github.com/peterjc/biopython/tree/mark2to3 This is now quicker to install under Python 3, but there is plenty of scope for speed optimisation (e.g. requiring the magic marker is in the first (say) 20 lines of the file, and expanding the magic marker to list the specific 2to3 fixers required and running just those. Regards, Peter From p.j.a.cock at googlemail.com Mon Sep 30 16:18:21 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 30 Sep 2013 17:18:21 +0100 Subject: [Biopython-dev] Python 2 and 3 migration thoughts In-Reply-To: References: Message-ID: On Mon, Sep 30, 2013 at 12:22 AM, Peter Cock wrote: > Assuming my methodology isn't flawed, we're about half way > in terms of getting every file in Biopython do be dual Python 2 > and Python 3 code: > > 262 no change, 290 need fixers > Troublesome ones at 52.5% New numbers with Bio._py3k.urllib changes which should have dropped the number of troublesome files by at most 13 files: 374 no change, 177 need fixers Troublesome ones 32.1% I think my markup script is a bit fragile in terms of the exact sequence of steps with do2to3.py etc. But much better numbers than Sunday night :) Revised branch here: https://github.com/peterjc/biopython/tree/mark2to3a https://github.com/peterjc/biopython/commit/14f9ff121532ff92ec7bacc1867bdd058a6e8f74 Build and test times on the master vs this branch are looking a lot better for Python 3 (although the numbers for different TravisCI runs are not directly comparable), and there is still a lot of room for improvement: master: https://travis-ci.org/biopython/biopython/builds/11965000 branch: https://travis-ci.org/peterjc/biopython/builds/11968132 So that's good. However, are these urllib import fixes an acceptable way forwards? Included in the above branch and here: https://github.com/peterjc/biopython/tree/urllib https://github.com/peterjc/biopython/commit/1305387a5d98a5f3c7b83ca3db580b9e63dba851 Thanks, Peter