From saketkc at gmail.com Sun Feb 2 23:22:42 2014 From: saketkc at gmail.com (Saket Choudhary) Date: Mon, 3 Feb 2014 09:52:42 +0530 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References:

Message-ID: On 31 January 2014 16:25, Peter Cock wrote: > On Wed, Jan 29, 2014 at 9:29 PM, Eric Talevich wrote: >> Hi folks, >> >> Google Summer of Code is on again for 2014, and the Open Bioinformatics >> Foundation (OBF) is once again applying as a mentoring organization. >> Participating in GSoC as an organization is very competitive, and we will >> need your help in gathering a good set of ideas and potential mentors for >> Biopython's role in GSoC this year. >> >> If you have an idea for a Summer of Code project, please post your idea >> here on the Biopython mailing list for discussion and start an outline on >> this wiki page: >> http://biopython.org/wiki/Google_Summer_of_Code >> >> We also welcome ideas that fit with OBF's mission but are not part of a >> single Bio* project, or span multiple projects -- these ideas can be posted >> on the OBF wiki and discussed on the OBF mailing list: >> http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas >> http://lists.open-bio.org/mailman/listinfo/open-bio-l >> >> Here's to another fun and productive Summer of Code! >> >> Cheers, >> Eric & Raoul > > Thanks Eric & Raoul, > > Remember that the ideas don't have to come from potential mentors - > if as a student there is something you'd particularly like to work on > please ask, and perhaps we can find a suitable (Biopython) mentor. > > Regards, > > Peter I would like to propose a QC module for NGS & Microarray data. Essentially a fastQC[1] and limma[2], respectively ported to Biopython. [1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [2] http://bioconductor.org/packages/devel/bioc/html/limma.html Saket > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Mon Feb 3 07:31:37 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Feb 2014 12:31:37 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References:

Message-ID: On Mon, Feb 3, 2014 at 4:22 AM, Saket Choudhary wrote: > > I would like to propose a QC module for NGS & Microarray data. > Essentially a fastQC[1] and limma[2], respectively ported to > Biopython. > > [1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ > [2] http://bioconductor.org/packages/devel/bioc/html/limma.html Hi Saket, What did you have in mind for 'porting' fastQC? Recreating it in Python alone doesn't seem like a sensible use of time & effort. Are there particular functions etc you think make sense to have available as a library of code? For limma, the linear model side would fall nicely under SciPy, eg http://scikit-learn.org/stable/modules/linear_model.html However, Biopython's existing microarray support could do with some love. Peter From saketkc at gmail.com Mon Feb 3 12:37:41 2014 From: saketkc at gmail.com (Saket Choudhary) Date: Mon, 3 Feb 2014 17:37:41 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References:

Message-ID: Hi Peter, My idea was to have a QC/preprocessing module inside Biopython, which could then be integrated with the rest of the NGS tools wrappers. Though you are right, these functionalities as such are already part of fastQC and replicating might not be a good idea. As for limma, I had these things in mind: 1. Correct me if I am wrong, but Biopython only supports Affymetrix data, right? My idea was to build parsers for Genepix, Agilent etc 2. Add other methods for in/between array normalisation, MA, volcano plots Yes, it is like reinventing the wheel, but I have been thinking of porting this to python myself, this might not be good from the point of view of a GSoC project however. Saket On 3 February 2014 12:31, Peter Cock wrote: > On Mon, Feb 3, 2014 at 4:22 AM, Saket Choudhary wrote: >> >> I would like to propose a QC module for NGS & Microarray data. >> Essentially a fastQC[1] and limma[2], respectively ported to >> Biopython. >> >> [1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ >> [2] http://bioconductor.org/packages/devel/bioc/html/limma.html > > Hi Saket, > > What did you have in mind for 'porting' fastQC? Recreating it in > Python alone doesn't seem like a sensible use of time & effort. > Are there particular functions etc you think make sense to have > available as a library of code? > > For limma, the linear model side would fall nicely under SciPy, > eg http://scikit-learn.org/stable/modules/linear_model.html > However, Biopython's existing microarray support could do > with some love. > > Peter From p.j.a.cock at googlemail.com Mon Feb 3 12:49:25 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Feb 2014 17:49:25 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References:

Message-ID: On Mon, Feb 3, 2014 at 5:37 PM, Saket Choudhary wrote: > > As for limma, I had these things in mind: > 1. Correct me if I am wrong, but Biopython only supports Affymetrix > data, right? My idea was to build parsers for Genepix, Agilent etc And GEO data somewhat (which has been processed by the NCBI into yet another format). > 2. Add other methods for in/between array normalisation, MA, > volcano plots Much of the core statistics and plotting can probably build on scipy and something like matplotlib? > Yes, it is like reinventing the wheel, but I have been thinking of > porting this to python myself, this might not be good from the point > of view of a GSoC project however. This sounds like it *could* be the basis of a possible GSoC (provided suitable mentor(s) are available). Are you still eligible as a student? Regards, Peter From saketkc at gmail.com Mon Feb 3 12:52:43 2014 From: saketkc at gmail.com (Saket Choudhary) Date: Mon, 3 Feb 2014 17:52:43 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References:

Message-ID: On 3 February 2014 17:49, Peter Cock wrote: > On Mon, Feb 3, 2014 at 5:37 PM, Saket Choudhary wrote: >> >> As for limma, I had these things in mind: >> 1. Correct me if I am wrong, but Biopython only supports Affymetrix >> data, right? My idea was to build parsers for Genepix, Agilent etc > > And GEO data somewhat (which has been processed by the NCBI > into yet another format). > >> 2. Add other methods for in/between array normalisation, MA, >> volcano plots > > Much of the core statistics and plotting can probably build on > scipy and something like matplotlib? > Yes, that's what I was heading to. >> Yes, it is like reinventing the wheel, but I have been thinking of >> porting this to python myself, this might not be good from the point >> of view of a GSoC project however. > > This sounds like it *could* be the basis of a possible GSoC (provided > suitable mentor(s) are available). Are you still eligible as a student? > Yes, this is my final year of integrated bachelors+masters program :( || :) Saket > Regards, > > Peter From bylin at ucsc.edu Tue Feb 4 12:44:47 2014 From: bylin at ucsc.edu (Brian Lin) Date: Tue, 4 Feb 2014 09:44:47 -0800 Subject: [Biopython-dev] SeqRecord comparison suggestion Message-ID: Hi everyone, In the past I have spent hours debugging my code because I expected different SeqRecord objects to evaluate as equal if all their attributes (id, seq, etc) were the same. Unfortunately the "==" operator only compares the address in memory. Is there a reason (e.g. some tradeoff, design elements, etc) that this behavior is tolerated? If not, I'd like to submit a pull request - I overloaded the __eq__ and __ne__ operators and wrote a quick test for it, then pushed it to: https://github.com/bylin/biopython Best, Brian Brian Lin | bylin at ucsc.edu | Brian's LinkedIn B.S., Genetics and Computer Science, University of California at Davis Ph.D Program in Bioinformatics, University of California at Santa Cruz From idoerg at gmail.com Tue Feb 4 12:57:49 2014 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 4 Feb 2014 12:57:49 -0500 Subject: [Biopython-dev] SeqRecord comparison suggestion In-Reply-To: References: Message-ID: Thanks! My initial thoughts are that seqrecord instances should not have an __eq__ operator. The equality operation here is somewhat meaningless when you consider the number of parameters that can constitute a seqrecord, especially when dealing with a genomic record or a contig. This can lead to unexpected behavior. That being said, it may be a good idea to allow for a function that performs a comprehensive comparison using all attributes. Specifically to answer your question: I don't think the address comparison is by design. It's a Python feature. My $0.02 Iddo Friedberg http://iddo-friedberg.net/contact.html Sent from a device that promotes typos On Feb 4, 2014 12:46 PM, "Brian Lin" wrote: > Hi everyone, > > In the past I have spent hours debugging my code because I expected > different SeqRecord objects to evaluate as equal if all their attributes > (id, seq, etc) were the same. Unfortunately the "==" operator only compares > the address in memory. > > Is there a reason (e.g. some tradeoff, design elements, etc) that this > behavior is tolerated? If not, I'd like to submit a pull request - I > overloaded the __eq__ and __ne__ operators and wrote a quick test for it, > then pushed it to: > > https://github.com/bylin/biopython > > Best, > Brian > > Brian Lin | bylin at ucsc.edu | Brian's > LinkedIn > B.S., Genetics and Computer Science, University of California at Davis > Ph.D Program in Bioinformatics, University of California at Santa Cruz > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From arklenna at gmail.com Tue Feb 4 13:54:34 2014 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 4 Feb 2014 13:54:34 -0500 Subject: [Biopython-dev] SeqRecord comparison suggestion In-Reply-To: References:

Message-ID: On Tue, Feb 4, 2014 at 12:57 PM, Iddo Friedberg wrote: > Thanks! > > My initial thoughts are that seqrecord instances should not have an __eq__ > operator. The equality operation here is somewhat meaningless when you > consider the number of parameters that can constitute a seqrecord, > especially when dealing with a genomic record or a contig. This can lead > to unexpected behavior. > That being said, it may be a good idea to allow for a function that > performs a comprehensive comparison using all attributes. > I agree that an explicit comparison method would be less error-prone than ==. This method could even allow the user to specify which attributes must be identical for equality. > Specifically to answer your question: I don't think the address comparison > is by design. It's a Python feature. > __eq__ and __ne__ could instead be defined to raise NotImplementedError to prevent future users from experiencing the same problems and direct them to the explicit comparison method. Cheers, Lenna From p.j.a.cock at googlemail.com Tue Feb 4 15:56:12 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Feb 2014 20:56:12 +0000 Subject: [Biopython-dev] SeqRecord comparison suggestion In-Reply-To: References:

Message-ID: On Tuesday, February 4, 2014, Lenna Peterson wrote: > On Tue, Feb 4, 2014 at 12:57 PM, Iddo Friedberg > > wrote: > > > Thanks! > > > > My initial thoughts are that seqrecord instances should not have an > __eq__ > > operator. The equality operation here is somewhat meaningless when you > > consider the number of parameters that can constitute a seqrecord, > > especially when dealing with a genomic record or a contig. This can lead > > to unexpected behavior. > > > > That being said, it may be a good idea to allow for a function that > > performs a comprehensive comparison using all attributes. > > > > I agree that an explicit comparison method would be less error-prone than > ==. This method could even allow the user to specify which attributes must > be identical for equality. > > > > Specifically to answer your question: I don't think the address > comparison > > is by design. It's a Python feature. > > > > __eq__ and __ne__ could instead be defined to raise NotImplementedError to > prevent future users from experiencing the same problems and direct them to > the explicit comparison method. > > Cheers, > > Lenna > We should probably switch the current FutureWarning to a noisy BiopythonWarning ... because the current warning is (almost) silent. I think this was silenced in a recentish Python release, from memory it used to give a clear warning to the user :( Peter From p.j.a.cock at googlemail.com Wed Feb 5 13:15:08 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Feb 2014 18:15:08 +0000 Subject: [Biopython-dev] SeqRecord comparison suggestion In-Reply-To: References:

Message-ID: On Tue, Feb 4, 2014 at 8:56 PM, Peter Cock wrote: > > We should probably switch the current FutureWarning to a noisy > BiopythonWarning ... because the current warning is (almost) silent. I think > this was silenced in a recentish Python release, from memory it used to give > a clear warning to the user :( Correction - I should have double checked this before writing the email, the FutureWarning from Seq comparison works fine: >>> from Bio.Seq import Seq >>> Seq("ACGT") == "ACGT" False >>> Seq("ACGT") == Seq("ACGT") /Library/Python/2.7/site-packages/Bio/Seq.py:179: FutureWarning: In future comparing Seq objects will use string comparison (not object comparison). Incompatible alphabets will trigger a warning (not an exception). In the interim please use id(seq1)==id(seq2) or str(seq1)==str(seq2) to make your code explicit and to avoid this warning. (It may be time to actually try switching Seq equality to be string like...) On Tue, Feb 4, 2014 at 12:57 PM, Iddo Friedberg wrote: > Thanks! > > My initial thoughts are that seqrecord instances should not have an __eq__ > operator. The equality operation here is somewhat meaningless when you > consider the number of parameters that can constitute a seqrecord, > especially when dealing with a genomic record or a contig. This can lead > to unexpected behaviour. Indeed, which is one reason why we never defined __eq__ etc for the SeqRecord (how equal is equal? Same ID? Same sequence? Same annotions?). Therefore the SeqRecord gets the default Python object equality, which is are they the same object in memory? Peter From p.j.a.cock at googlemail.com Sat Feb 8 08:01:52 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 8 Feb 2014 13:01:52 +0000 Subject: [Biopython-dev] Fwd: [biopython] SubsMat.MatrixInfo update (#282) In-Reply-To: References: Message-ID: Who'd like to take a look at this offer of code for substitution matrices? ---------- Forwarded message ---------- From: biologyguy Date: Fri, Feb 7, 2014 at 10:00 PM Subject: [biopython] SubsMat.MatrixInfo update (#282) To: biopython/biopython I've updated Bio.SubsMat.MatrixInfo.py with a new substitution matrix (PHAT) and written a little function to output the matrices in a more useable format. I've been using the new version for months, and figured I should polish it up and submit it to be included in the official package. Anyone interested in taking a look? -Steve -- Reply to this email directly or view it on GitHub . From anubhavmaity7 at gmail.com Sun Feb 9 10:05:23 2014 From: anubhavmaity7 at gmail.com (Anubhav Maity) Date: Sun, 9 Feb 2014 20:35:23 +0530 Subject: [Biopython-dev] Fwd: [GSoC] Want to contribute to open-bio for GSOC 2014 In-Reply-To: References:

Message-ID: Hi, Thanks You, Peter, for your reply. I have setup my github account and have forked the source code. I have build and install biopython after reading the README file in the github repository. I want to contribute code to bioython. I want some suggestions from where to start? Waiting for your reply. Thanks and Regards, Anubhav ---------- Forwarded message ---------- From: Peter Cock Date: Sat, Feb 8, 2014 at 6:28 PM Subject: Re: [GSoC] Want to contribute to open-bio for GSOC 2014 To: Anubhav Maity Cc: OBF GSoC On Fri, Feb 7, 2014 at 10:33 PM, Anubhav Maity wrote: > Hi, > > I am a BTech student from an Indian university and want to contribute code > for open-bio for GSOC 2014. > I love to code and can code in python. I have studied biology in high > school and have taken biotechnology during my college study. > I have looked on the projects of biopython i.e Codon alignment and > analysis, Bio.Phylo: filling in the gaps and Indexing & Lazy-loading > Sequence Parsers. All the projects are very interesting. I want to > contribute in one of these projects, please help me in getting started. > Waiting for your positive reply. > > Thanks and Regards, > Anubhav Hi Anubhav, Please sign up to the biopython and biopython-dev mailing lists and introduce yourself there too. You will also need a GitHub account to contribute to Biopython development - so you might want to set that up now as well: http://lists.open-bio.org/mailman/listinfo/biopython http://lists.open-bio.org/mailman/listinfo/biopython-dev https://github.com/biopython/biopython Regards, Peter From saketkc at gmail.com Sun Feb 9 18:51:25 2014 From: saketkc at gmail.com (Saket Choudhary) Date: Sun, 9 Feb 2014 23:51:25 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References:

Message-ID: On 3 February 2014 17:52, Saket Choudhary wrote: > On 3 February 2014 17:49, Peter Cock wrote: >> On Mon, Feb 3, 2014 at 5:37 PM, Saket Choudhary wrote: >>> >>> As for limma, I had these things in mind: >>> 1. Correct me if I am wrong, but Biopython only supports Affymetrix >>> data, right? My idea was to build parsers for Genepix, Agilent etc >> >> And GEO data somewhat (which has been processed by the NCBI >> into yet another format). >> >>> 2. Add other methods for in/between array normalisation, MA, >>> volcano plots >> >> Much of the core statistics and plotting can probably build on >> scipy and something like matplotlib? >> > Yes, that's what I was heading to. > >>> Yes, it is like reinventing the wheel, but I have been thinking of >>> porting this to python myself, this might not be good from the point >>> of view of a GSoC project however. >> >> This sounds like it *could* be the basis of a possible GSoC (provided >> suitable mentor(s) are available). Are you still eligible as a student? >> > > > Yes, this is my final year of integrated bachelors+masters program :( || :) > > Saket >> Regards, >> >> Peter Would anyone be interested in mentoring this? Saket From p.j.a.cock at googlemail.com Fri Feb 14 04:43:42 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Feb 2014 09:43:42 +0000 Subject: [Biopython-dev] Fwd: [Open-bio-l] OBF GSoC 2014: Last call for project ideas and mentors In-Reply-To: References: Message-ID: Potential GSoC mentors, please think about this urgently! (This isn't the deadline for proposing project ideas, if we do get to take part in GSoC - but having more solid ideas on the webpage now will help with getting accepted as a GSoC organisation). Thanks, Peter ---------- Forwarded message ---------- From: Eric Talevich Date: Thu, Feb 13, 2014 at 9:21 PM Subject: [Open-bio-l] OBF GSoC 2014: Last call for project ideas and mentors To: open-bio-l at lists.open-bio.org Folks, The Google Summer of Code organization applications are due tomorrow. The core of our application to Google is our list of project ideas, listed here: http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas The more ideas and designated mentors we have, the better our chance to get funded students to work on these projects. So: If you have an idea, please do post it to the project wiki today. Or, if you are willing to serve as a mentor but do not have a specific project idea in mind, let us know. Thanks! Eric & Raoul OBF GSoC 2014 admins _______________________________________________ Open-Bio-l mailing list Open-Bio-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/open-bio-l From mok at bioxray.dk Mon Feb 17 20:02:44 2014 From: mok at bioxray.dk (Morten Kjeldgaard) Date: Tue, 18 Feb 2014 02:02:44 +0100 Subject: [Biopython-dev] Bug in Bio.PDB? Message-ID: <96CEC8D5-7D42-42CC-A9B8-32699C386AB0@bioxray.dk> Hi, I need to remove HETATMS and water molecules from a PDB file, so I was playing around with the following code snippet (from [1]): for model in structure: for chain in model: for residue in chain: id = residue.id if id[0] != ' ': chain.detach_child(id) Unfortunately, it does not work correctly. If 2 HETATM residues are found after each other, the second one is skipped. I assume the reason is that model, chain, etc. really are generators, and they get screwed up if you mess around with the datastructure while the loop is running. Here is a bit of debugging output from a run with print statements scattered appropriately in the above code: *** D *** (' ', 13, ' ') *** D *** (' ', 14, ' ') *** D *** (' ', 15, ' ') *** D *** ('H_H2U', 16, ' ') removing ('H_H2U', 16, ' ') from chain D *** D *** (' ', 18, ' ') *** D *** (' ', 19, ' ') *** D *** (' ', 20, ' ?) As you see, residue 16 is correctly identified as a HETATM residue, however, the following residue 17 is skipped (it is also a H2U residue) and so it is NOT removed from the structure. The way to make the loop work is to squirrel away a list of HETATM residues and detach them from the chain when the loop is finished. (Another way is to keep running the snippet until no HETATMs are left.) I am not sure whether to characterize this as a bug or a ?feature?, but it is confusing and defeats the intuitive understanding of how the SMCRA hierarchy objects ought to work. (I am using Biopython version 1.59) Cheers, Morten [1] http://pelican.rsvs.ulaval.ca/mediawiki/index.php/Manipulating_PDB_files_using_BioPython -- Morten Kjeldgaard, asc. professor, MSc, PhD Dept. of Molecular Biology and Genetics, Aarhus University Gustav Wieds Vej 10C, Building 3135, DK-8000 Aarhus C, Denmark. From anaryin at gmail.com Mon Feb 17 21:02:04 2014 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 18 Feb 2014 03:02:04 +0100 Subject: [Biopython-dev] Bug in Bio.PDB? In-Reply-To: <96CEC8D5-7D42-42CC-A9B8-32699C386AB0@bioxray.dk> References: <96CEC8D5-7D42-42CC-A9B8-32699C386AB0@bioxray.dk> Message-ID: Hi Morten, This is not a "bug" per se, but a matter of changing the chain object while you are iterating over it. You get the same effect with this loop: >>> nums = [1,2,3,4,5,6,7,8,9,11,13,15] >>> for i in nums: ... if i % 2 == 0: ... nums.remove(i) ... print i, ... 1 2 4 6 8 11 13 15 ? The only options you have is to use chain.child_list (which does create a copy of the list and it's safe to iterate on) or just save the ids to remove and remove them a posteriori. Cheers, Jo?o From anaryin at gmail.com Mon Feb 17 21:02:57 2014 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 18 Feb 2014 03:02:57 +0100 Subject: [Biopython-dev] Bug in Bio.PDB? In-Reply-To: References: <96CEC8D5-7D42-42CC-A9B8-32699C386AB0@bioxray.dk> Message-ID: Also, in addition, if you just want to save the PDB without these HETATMs, use the Select class when saving with PDBIO.save. 2014-02-18 3:02 GMT+01:00 Jo?o Rodrigues : > Hi Morten, > > This is not a "bug" per se, but a matter of changing the chain object > while you are iterating over it. You get the same effect with this loop: > > >>> nums = [1,2,3,4,5,6,7,8,9,11,13,15] > >>> for i in nums: > ... if i % 2 == 0: > ... nums.remove(i) > ... print i, > ... > 1 2 4 6 8 11 13 15 > ? > The only options you have is to use chain.child_list (which does create a > copy of the list and it's safe to iterate on) or just save the ids to > remove and remove them a posteriori. > > Cheers, > > Jo?o > From p.j.a.cock at googlemail.com Tue Feb 18 09:17:48 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Feb 2014 14:17:48 +0000 Subject: [Biopython-dev] Galaxy Tool Shed packages for Biopython In-Reply-To: References: Message-ID: On Fri, Sep 13, 2013 at 9:54 AM, Peter Cock wrote: > Hi all, > > I've sent this to both the Galaxy and Biopython developers lists, > and hope this will make sense to both groups. If you've not heard > of Galaxy, start here: http://galaxyproject.org - while the easy to > guess Biopython website is at http://biopython.org > > Brad Chapman and I are both Biopython core developers, and > are also both on the "IUC" Galaxy Tool Shed committee because > we've been quite involved in wrapping and writing tools for use > on Galaxy. > > Fellow committee member Bj?rn Gr?ning has done a > lot of the hands on work defining package definitions for > dependencies within the Galaxy Tool Shed ecosystem - > including defining them for Biopython, NumPy, SciPy, > MatPlotLib, etc. We're very grateful for his hard work - > most of which is now available under the IUC group > account: > > http://toolshed.g2.bx.psu.edu/view/iuc/ > http://testtoolshed.g2.bx.psu.edu/view/iuc/ > > The Biopython packages, however, are under a dedicated > "biopython" account on the Galaxy Tool Shed to which > currently Bjoern, Brad and I have access to: > > http://toolshed.g2.bx.psu.edu/view/biopython/ > http://testtoolshed.g2.bx.psu.edu/view/biopython/ > > This packaging work was initially tracked in Bjoern's own GitHub > repository, https://github.com/bgruening/galaxytools/ > > We (me, Brad and Bjoern) agreed that a Biopython owned > repository would be more sensible in the long term, so I have > created this and ported Bjoern's commits to it: > https://github.com/biopython/galaxy_packages > > Currently the "Galaxy packagers" team on GitHub which > has read and write access to this new repository is just > me, Brad and Bjoern. > > Regards, > > Peter Earlier today I belatedly setup Galaxy packages for Biopython 1.63, for both the main and test Galaxy ToolSheds: http://toolshed.g2.bx.psu.edu/view/biopython/ http://testtoolshed.g2.bx.psu.edu/view/biopython/ This could perhaps become part of the Biopython release process in future? Peter From p.j.a.cock at googlemail.com Tue Feb 18 10:01:54 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Feb 2014 15:01:54 +0000 Subject: [Biopython-dev] [galaxy-dev] Galaxy Tool Shed packages for Biopython In-Reply-To: <53036D4E.5000208@gmail.com> References:

<53036D4E.5000208@gmail.com> Message-ID: On Tue, Feb 18, 2014 at 2:25 PM, Bj?rn Gr?ning wrote: > Thanks Peter! > > +1 for including that into the Biopython release process > > Cheers, > Bjoern How about we add step N+1 to the instructions, http://biopython.org/wiki/Building_a_release "Ask Peter, Brad, or Bjoern to prepare a new Galaxy package on https://github.com/biopython/galaxy_packages and upload it to the main and test Galaxy ToolShed." Regards, Peter From anaryin at gmail.com Wed Feb 19 09:54:13 2014 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 19 Feb 2014 15:54:13 +0100 Subject: [Biopython-dev] Future of Bio.PDB Message-ID: >From another thread: As for what I suggested. Since my GSOC period, already 4 years ago.., I > noticed that the PDB module is a bit messy in terms of organization. The > module itself if named after the databank, which can be confused with the > format name, the mmcif parser is defined inside in a subfolder and there > are application wrappers there too (DSSP, NACCESS). Besides this issue, > which is not an issue at all and just my own pet peeve, there is a lot that > the entire module could gain from a thorough revision. I've been using it > very often and some normal manipulations of structures are not > straightforward to carry out (calculating a center of mass for example, > removing double occupancies) due to the parser being slow and quite memory > hungry. In fact, trying to run the parser on a very large collection of > structures often results in a random crash due to memory issues. > I've been toying with a lot of changes, performance improvements, etc, but > I'm not satisfied at all with them.. somethings that i've been trying is to > have the structure coordinates defined as a full numpy array instead of N > arrays per structure (one per atom) or the usage of __slots__ to mitigate > memory usage (managed to get it down 33% this way). This would also go in > line with a suggestion from Eric a long time ago to make a Bio.Struct > module which would be the perfect "playground" to implement and test these > changes. Other developments that I think are worth looking into are for > example making a nice library to link a parsed structure to the PDB > database and fetch information on it using the REST services they provide. > I'd like to hear your opinion (as in, everybody, developers and users) on > this and if it makes sense to indeed give a bit of TLC to the Bio.PDB > module. Also, on what changes you think should be carried out to improve > the module, like which features are missing, which applications are worth > wrapping. > Just to kick off some discussion. Maybe a new thread should be opened for > this later on. > Cheers, > Jo?o As for the name of the module, yes, Bio.Struct is just the "legacy" name I remember.. Bio.structure would probably be better and more clear. From eric.talevich at gmail.com Wed Feb 19 11:22:54 2014 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 19 Feb 2014 08:22:54 -0800 Subject: [Biopython-dev] Future of Bio.PDB In-Reply-To: References: Message-ID: On Wed, Feb 19, 2014 at 6:54 AM, Jo?o Rodrigues wrote: > From another thread: > > As for what I suggested. Since my GSOC period, already 4 years ago.., I > > noticed that the PDB module is a bit messy in terms of organization. The > > module itself if named after the databank, which can be confused with the > > format name, the mmcif parser is defined inside in a subfolder and there > > are application wrappers there too (DSSP, NACCESS). Besides this issue, > > which is not an issue at all and just my own pet peeve, there is a lot > that > > the entire module could gain from a thorough revision. I've been using it > > very often and some normal manipulations of structures are not > > straightforward to carry out (calculating a center of mass for example, > > removing double occupancies) due to the parser being slow and quite > memory > > hungry. In fact, trying to run the parser on a very large collection of > > structures often results in a random crash due to memory issues. > > I've been toying with a lot of changes, performance improvements, etc, > but > > I'm not satisfied at all with them.. somethings that i've been trying is > to > > have the structure coordinates defined as a full numpy array instead of N > > arrays per structure (one per atom) or the usage of __slots__ to mitigate > > memory usage (managed to get it down 33% this way). This would also go in > > line with a suggestion from Eric a long time ago to make a Bio.Struct > > module which would be the perfect "playground" to implement and test > these > > changes. Other developments that I think are worth looking into are for > > example making a nice library to link a parsed structure to the PDB > > database and fetch information on it using the REST services they > provide. > > I'd like to hear your opinion (as in, everybody, developers and users) on > > this and if it makes sense to indeed give a bit of TLC to the Bio.PDB > > module. Also, on what changes you think should be carried out to improve > > the module, like which features are missing, which applications are worth > > wrapping. > > Just to kick off some discussion. Maybe a new thread should be opened for > > this later on. > > Cheers, > > Jo?o > > > As for the name of the module, yes, Bio.Struct is just the "legacy" name I > remember.. Bio.structure would probably be better and more clear. > The p3d folks once offered to incorporate their work into Biopython: http://www.biomedcentral.com/1471-2105/10/258 We had concerns about having p3d and Bio.PDB coexisting within Biopython, but if someone wanted to emulate the Bio.PDB API on top of p3d, or otherwise slip p3d's secret sauce into the Bio.PDB internals, that would do the trick. (I have not thought about the details of how this would work at all.) I think it should also be possible to replace p3d's custom query language with the sort of tricks Bio.Phylo, pandas and SqlAlchemy do with keyword arguments and generators to get the same results with Python syntax. Alternatively, there is the option of sticking with the Bio.PDB namespace and adding only "read", "write" and "convert" functions to Bio/PDB/__init__.py to make the basic usage of the module more similar to the other Biopython sub-packages. The Model class could store one or several NumPy arrays that cover all atom coordinates, and the Chain, Residue, Atom and Interface classes would probably just store references to that array, e.g. a shorter 1D array of integer row indexes. Would either of these internal changes make it easier to apply the GSoC work that's been done on Bio.PDB? -Eric From davidjosephcain at gmail.com Wed Feb 19 11:35:56 2014 From: davidjosephcain at gmail.com (David Cain) Date: Wed, 19 Feb 2014 11:35:56 -0500 Subject: [Biopython-dev] Future of Bio.PDB In-Reply-To: References: Message-ID: I frequently make use of Bio.PDB, and agree wholeheartedly that certain aspects of it are very dated, or haphazardly organized. The module as a whole would benefit greatly from some extra attention. I'm happy to lend a hand in whatever revamp takes place. David Cain +1 (339) 222 4452 On Wed, Feb 19, 2014 at 11:22 AM, Eric Talevich wrote: > On Wed, Feb 19, 2014 at 6:54 AM, Jo?o Rodrigues wrote: > > > From another thread: > > > > As for what I suggested. Since my GSOC period, already 4 years ago.., I > > > noticed that the PDB module is a bit messy in terms of organization. > The > > > module itself if named after the databank, which can be confused with > the > > > format name, the mmcif parser is defined inside in a subfolder and > there > > > are application wrappers there too (DSSP, NACCESS). Besides this issue, > > > which is not an issue at all and just my own pet peeve, there is a lot > > that > > > the entire module could gain from a thorough revision. I've been using > it > > > very often and some normal manipulations of structures are not > > > straightforward to carry out (calculating a center of mass for example, > > > removing double occupancies) due to the parser being slow and quite > > memory > > > hungry. In fact, trying to run the parser on a very large collection of > > > structures often results in a random crash due to memory issues. > > > I've been toying with a lot of changes, performance improvements, etc, > > but > > > I'm not satisfied at all with them.. somethings that i've been trying > is > > to > > > have the structure coordinates defined as a full numpy array instead > of N > > > arrays per structure (one per atom) or the usage of __slots__ to > mitigate > > > memory usage (managed to get it down 33% this way). This would also go > in > > > line with a suggestion from Eric a long time ago to make a Bio.Struct > > > module which would be the perfect "playground" to implement and test > > these > > > changes. Other developments that I think are worth looking into are for > > > example making a nice library to link a parsed structure to the PDB > > > database and fetch information on it using the REST services they > > provide. > > > I'd like to hear your opinion (as in, everybody, developers and users) > on > > > this and if it makes sense to indeed give a bit of TLC to the Bio.PDB > > > module. Also, on what changes you think should be carried out to > improve > > > the module, like which features are missing, which applications are > worth > > > wrapping. > > > Just to kick off some discussion. Maybe a new thread should be opened > for > > > this later on. > > > Cheers, > > > Jo?o > > > > > > As for the name of the module, yes, Bio.Struct is just the "legacy" name > I > > remember.. Bio.structure would probably be better and more clear. > > > > The p3d folks once offered to incorporate their work into Biopython: > http://www.biomedcentral.com/1471-2105/10/258 > > We had concerns about having p3d and Bio.PDB coexisting within Biopython, > but if someone wanted to emulate the Bio.PDB API on top of p3d, or > otherwise slip p3d's secret sauce into the Bio.PDB internals, that would do > the trick. (I have not thought about the details of how this would work at > all.) I think it should also be possible to replace p3d's custom query > language with the sort of tricks Bio.Phylo, pandas and SqlAlchemy do with > keyword arguments and generators to get the same results with Python > syntax. > > Alternatively, there is the option of sticking with the Bio.PDB namespace > and adding only "read", "write" and "convert" functions to > Bio/PDB/__init__.py to make the basic usage of the module more similar to > the other Biopython sub-packages. The Model class could store one or > several NumPy arrays that cover all atom coordinates, and the Chain, > Residue, Atom and Interface classes would probably just store references to > that array, e.g. a shorter 1D array of integer row indexes. > > Would either of these internal changes make it easier to apply the GSoC > work that's been done on Bio.PDB? > > -Eric > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Wed Feb 19 11:41:28 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 19 Feb 2014 16:41:28 +0000 Subject: [Biopython-dev] Future of Bio.PDB In-Reply-To: References: Message-ID: On Wed, Feb 19, 2014 at 4:35 PM, David Cain wrote: > I frequently make use of Bio.PDB, and agree wholeheartedly that certain > aspects of it are very dated, or haphazardly organized. > > The module as a whole would benefit greatly from some extra attention. I'm > happy to lend a hand in whatever revamp takes place. > > David Cain > +1 (339) 222 4452 Very true, and thanks for the offer. If we go with the parallel namespace option (Bio.Struct, Bio.structure or similar) then we can stick an experimental warning on it and include it as a 'beta' module within the next release (while continuing to keep Bio.PDB fully backward compatible for now, with the likely goal of a formal deprecation in future). Peter From anaryin at gmail.com Wed Feb 19 11:50:56 2014 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 19 Feb 2014 17:50:56 +0100 Subject: [Biopython-dev] Future of Bio.PDB In-Reply-To: References:

Message-ID: @Eric: p3d, csb, biskit, prody, there are plenty of libraries that "compete" with Bio.PDB, each focused on a different area or offering specific features. We should go over them and see what we could implement, how to implement the parsing, etc. p3d offers a really nice selection API indeed, I guess more geared towards interactive usage, but maybe a bit cumbersome for high-throughput applications? (disclaimer, never used it.. ). We also have the "support" of EBI, their parsers are probably the most robust so worth trying to ask how their implementation differs from ours (I know they have one of their own). @David: Thanks a lot for the offer, could you make a list of the things you see outdated or otherwise worth working on? (Maybe we could open a wiki page / gist for a list of possible areas of improvement?) @Peter: I thought of that too. Keeping Bio.PDB would be crucial to maintain backwards compatibility while we could work and offer users the possibility to test/work with the new module under an "experimental" tag. 2014-02-19 17:41 GMT+01:00 Peter Cock : > On Wed, Feb 19, 2014 at 4:35 PM, David Cain > wrote: > > I frequently make use of Bio.PDB, and agree wholeheartedly that certain > > aspects of it are very dated, or haphazardly organized. > > > > The module as a whole would benefit greatly from some extra attention. > I'm > > happy to lend a hand in whatever revamp takes place. > > > > David Cain > > +1 (339) 222 4452 > > Very true, and thanks for the offer. > > If we go with the parallel namespace option (Bio.Struct, Bio.structure or > similar) then we can stick an experimental warning on it and include it > as a 'beta' module within the next release (while continuing to keep > Bio.PDB fully backward compatible for now, with the likely goal of > a formal deprecation in future). > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From Tom.Brown at enmu.edu Wed Feb 19 15:42:19 2014 From: Tom.Brown at enmu.edu (Brown, Tom) Date: Wed, 19 Feb 2014 20:42:19 +0000 Subject: [Biopython-dev] GI number for NcbiblastpCommandline Message-ID: <7D3EF56670A2CC448808CB89D32F7372A94DE7AA@ITSNV498.ad.enet.enmu.edu> Currently using result_handle = NCBIWWW.qblast("blastp", "nr", blastGI) where blastGI = 113000 in the Biopython program and would like to convert it to a local blastp. Is there a way to specify the blastGi within NcbiblastpCommandline instead of having to provide a fasta file for blast? What are my options. from Bio.Blast.Applications import NcbiblastpCommandline blastp_cline = NcbiblastpCommandline(query="sh3.fasta", db="nr", evalue=0.001, outfmt=5, out="sh3.xml") stdout, stderr = blastp_cline() Thanks Tom ________________________________ Confidentiality Notice: This e-mail, including all attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information as defined under FERPA. Any unauthorized review, use, disclosure or distribution is prohibited unless specifically provided under the New Mexico Inspection of Public Records Act. If you are not the intended recipient, please contact the sender and destroy all copies of this message From mgymrek at mit.edu Wed Feb 19 18:05:07 2014 From: mgymrek at mit.edu (Melissa Gymrek) Date: Wed, 19 Feb 2014 18:05:07 -0500 Subject: [Biopython-dev] fastsimcoal Message-ID: Hello, I am new to the list and don't know the status of the PopGen tools, so forgive me if this has already been discussed. I added a controller for fastsimcoal to the SimCoal module (the commit from my forked repository, with the new controller plus a test case is here). Fastsimcoal is basically a faster and more flexible version of simcoal2 and shares the same input format generated by Bio.PopGen.SimCoal.Template, so it made sense to add it there. Would there be any interest in adding this? Best, ~Meliss From mok at bioxray.dk Wed Feb 19 18:05:31 2014 From: mok at bioxray.dk (Morten Kjeldgaard) Date: Thu, 20 Feb 2014 00:05:31 +0100 Subject: [Biopython-dev] Future of Bio.PDB In-Reply-To: References: Message-ID: On 19/02/2014, at 17:35, David Cain wrote: > I frequently make use of Bio.PDB, and agree wholeheartedly that certain > aspects of it are very dated, or haphazardly organized. > > The module as a whole would benefit greatly from some extra attention. I'm > happy to lend a hand in whatever revamp takes place. I second that. I am also willing to participate in this project! Cheers, Morten From p.j.a.cock at googlemail.com Thu Feb 20 03:40:39 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 20 Feb 2014 08:40:39 +0000 Subject: [Biopython-dev] fastsimcoal In-Reply-To: References: Message-ID: Hi Meliss, That looks very useful - hopefully Tiago (the module owner) will be able to take a look soon. My only suggestion right now is to add some links to the docstrings for where to get the programs, and what version the wrapper was written and tested for (which can be very helpful if the tool changes in future). Thanks, Peter On Wed, Feb 19, 2014 at 11:05 PM, Melissa Gymrek wrote: > Hello, > > I am new to the list and don't know the status of the PopGen tools, so > forgive me if this has already been discussed. I added a controller for > fastsimcoal to the SimCoal module (the commit from my forked repository, > with the new controller plus a test case is > here). > Fastsimcoal is basically a faster and more flexible version of simcoal2 and > shares the same input format generated by Bio.PopGen.SimCoal.Template, so > it made sense to add it there. Would there be any interest in adding this? > > Best, > ~Meliss > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From tiagoantao at gmail.com Thu Feb 20 04:40:55 2014 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 20 Feb 2014 09:40:55 +0000 Subject: [Biopython-dev] fastsimcoal In-Reply-To: References: Message-ID: Hi, On 19 February 2014 23:05, Melissa Gymrek wrote: > it made sense to add it there. Would there be any interest in adding this? > > I think this is great. I was actually thinking in deprecating simcoal support soon. So, my suggestion is: 1. Add this 2. Deprecate simcoal (I would take care of that). I think we need to go to the main biopython list and ask if someone is using this... But irrespective of the original code, I think this should be added. Tiago PS - You forgot to add your name to the copyright message at the top of Bio/PopGen/SimCoal/Controller.py From p.j.a.cock at googlemail.com Thu Feb 20 04:46:45 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 20 Feb 2014 09:46:45 +0000 Subject: [Biopython-dev] fastsimcoal In-Reply-To: References:

Message-ID: On Thu, Feb 20, 2014 at 9:40 AM, Tiago Ant?o wrote: > Hi, > > On 19 February 2014 23:05, Melissa Gymrek wrote: >> >> it made sense to add it there. Would there be any interest in adding this? >> > > I think this is great. I was actually thinking in deprecating simcoal > support soon. So, my suggestion is: > > 1. Add this > 2. Deprecate simcoal (I would take care of that). I think we need to go to > the main biopython list and ask if someone is using this... > > But irrespective of the original code, I think this should be added. > > Tiago > PS - You forgot to add your name to the copyright message at the top > of Bio/PopGen/SimCoal/Controller.py Sorry - one more thing, tests ;) I would suggest making a copy of Tests/test_PopGen_SimCoal.py for live testing with the new binary, and adding/copying static tests in Tests/test_PopGen_SimCoal_nodepend.py if relevant. Peter From tiagoantao at gmail.com Thu Feb 20 05:05:15 2014 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 20 Feb 2014 10:05:15 +0000 Subject: [Biopython-dev] fastsimcoal In-Reply-To: References:

Message-ID: Hi, On 20 February 2014 09:46, Peter Cock wrote: > I would suggest making a copy of Tests/test_PopGen_SimCoal.py > +1 There should be a test_PopGen_Fastsimcoal.py, I agree. Especially because I can envision the simcoal case disappearing medium-term. > for live testing with the new binary, and adding/copying static tests > in Tests/test_PopGen_SimCoal_nodepend.py if relevant. > > The input format should be the same. The binary non-dependent test cases are probably equal, me thinks. Tiago -- "The truth may be out there, but the lies are already in your head" - Terry Pratchett From p.j.a.cock at googlemail.com Thu Feb 20 06:22:26 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 20 Feb 2014 11:22:26 +0000 Subject: [Biopython-dev] GI number for NcbiblastpCommandline In-Reply-To: <7D3EF56670A2CC448808CB89D32F7372A94DE7AA@ITSNV498.ad.enet.enmu.edu> References: <7D3EF56670A2CC448808CB89D32F7372A94DE7AA@ITSNV498.ad.enet.enmu.edu> Message-ID: On Wed, Feb 19, 2014 at 8:42 PM, Brown, Tom wrote: > Currently using result_handle = NCBIWWW.qblast("blastp", "nr", blastGI) > where blastGI = 113000 in the Biopython program and would like to convert > it to a local blastp. Is there a way to specify the blastGi within > NcbiblastpCommandline instead of having to provide a fasta file for blast? > What are my options. > > from Bio.Blast.Applications import NcbiblastpCommandline > blastp_cline = NcbiblastpCommandline(query="sh3.fasta", db="nr", evalue=0.001, outfmt=5, out="sh3.xml") > stdout, stderr = blastp_cline() > > Thanks > > Tom Hi Tom, Sorry but no: somehow you will need to download/fetch the actual protein sequence for GI:113000 in order to give it to the standalone blastp tool. e.g. http://www.ncbi.nlm.nih.gov/protein/113000 One way would be using Entrez Fetch (efetch) via Bio.Entrez. Depending on your protein set, it might be simpler to download via FTP - your example is a yeast protein also in UniProtKB, it is also possible to fetch sequences via their UniProt ID. Peter From mgymrek at mit.edu Thu Feb 20 06:26:17 2014 From: mgymrek at mit.edu (Melissa Gymrek) Date: Thu, 20 Feb 2014 06:26:17 -0500 Subject: [Biopython-dev] fastsimcoal In-Reply-To: References:

Message-ID: Hi, Great I will implement what was discussed above. Regarding test cases, I second what Tiago said about test_PopGen_SimCoal_nodepend.py, those cases should stay the same. If/when simcoal2 is deprecated should this probably be renamed to test_PopGen_Fastsimcoal_nodepend.py? ~M On Thu, Feb 20, 2014 at 5:05 AM, Tiago Ant?o wrote: > Hi, > > > On 20 February 2014 09:46, Peter Cock wrote: > > >> I would suggest making a copy of Tests/test_PopGen_SimCoal.py >> > > +1 > There should be a test_PopGen_Fastsimcoal.py, I agree. Especially because > I can envision the simcoal case disappearing medium-term. > > > >> for live testing with the new binary, and adding/copying static tests >> in Tests/test_PopGen_SimCoal_nodepend.py if relevant. >> >> > The input format should be the same. The binary non-dependent test cases > are probably equal, me thinks. > > Tiago > > > -- > "The truth may be out there, but the lies are already in your head" - > Terry Pratchett > From tra at popgen.net Thu Feb 20 11:36:36 2014 From: tra at popgen.net (Tiago Antao) Date: Thu, 20 Feb 2014 16:36:36 +0000 Subject: [Biopython-dev] [galaxy-dev] Galaxy Tool Shed packages for Biopython In-Reply-To: References:

<53036D4E.5000208@gmail.com> Message-ID: <6538199077f92c4c0953477d0c4feca9@webmail.webfaction.com> On 2014-02-18 15:01, Peter Cock wrote: > How about we add step N+1 to the instructions, > http://biopython.org/wiki/Building_a_release > > "Ask Peter, Brad, or Bjoern to prepare a new Galaxy package > on https://github.com/biopython/galaxy_packages and upload > it to the main and test Galaxy ToolShed." Makes sense, so that is not forgotten by any release manager that is less conversant with Galaxy. Tiago From clements at galaxyproject.org Sat Feb 22 18:21:33 2014 From: clements at galaxyproject.org (Dave Clements) Date: Sat, 22 Feb 2014 15:21:33 -0800 Subject: [Biopython-dev] 2014 Galaxy Community Conference (GCC2014), Baltimore, June 30-July 2 Message-ID: *2014 Galaxy Community Conference (GCC2014) * http://galaxyproject.org/GCC2014 June 30 - July 2, 2014 Homewood Campus Johns Hopkins University Baltimore, Maryland , United States ------ The *2014 Galaxy Community Conference *(*GCC2014*, http://galaxyproject.org/GCC2014) features two full days of presentations, discussions, poster sessions, lightning talks and birds-of-a-feather, *all about data-intensive biology and the tools that support it*. GCC2014 also includes a Training Day with five concurrent tracks and in-depth coverage of thirteen different topics. GCC2014 will be held at the Homewood Campus of Johns Hopkins University , in Baltimore, Maryland, United States, from June 30 through July 2, 2014. Galaxy is an easily extensible data integration and analysis platform for life sciences research that supports hundreds of bioinformatics analysis tools. Galaxy is open-source and can be locally installed or run on the cloud. There are hundreds of local installs, and over 50 publicly accessible serversaround the world. *Early registration * is now open. Early combined registration (Training Day + main meeting) starts at $140 for post-docs and students. Registration is capped this year at 250 participants, *and we expect to hit that limit*. Registering early assures you a place at the conference and also a spot in the Training Day workshops you want to attend. You can also book affordable conference housing at the same time you register. See the conference Logistics page for details on this and other housing options. *Abstract submission * for both oral presentations and posters is also open. Abstract submission for oral presentations closes April 4, and poster submission closes April 25. The *GigaScience * "Galaxy: Data Intensive and Reproducible Research " series (announced for GCC2013) *is continuing to take submissions for this year's meeting and beyond. * BGI is also continuing to cover the article processing charges until the end of the year, and for more information see their latest update . Thanks, and hope to see you in Baltimore! The GCC2014 Organizing Committee -- http://galaxyproject.org/ http://getgalaxy.org/ http://usegalaxy.org/ http://wiki.galaxyproject.org/ From harsh.beria93 at gmail.com Wed Feb 26 11:14:24 2014 From: harsh.beria93 at gmail.com (Harsh Beria) Date: Wed, 26 Feb 2014 21:44:24 +0530 Subject: [Biopython-dev] Gsoc 2014 aspirant Message-ID: Hi, I am a Harsh Beria, third year UG student at Indian Institute of Technology, Kharagpur. I have started working in Computational Biophysics recently, having written code for pdb to fasta parser, sequence alignment using Needleman Wunch and Smith Waterman, Secondary Structure prediction, Henikoff's weight and am currently working on Monte Carlo simulation. Overall, I have started to like this field and want to carry my interest forward by pursuing a relevant project for GSOC 2014. I mainly code in C and python and would like to start contributing to the Biopython library. I started going through the official contribution wiki page ( http://biopython.org/wiki/Contributing) I also went through the wiki page of Bio.SeqlO's. I seriously want to contribute to the Biopython library through GSOC. What do I do next ? Thanks -- Harsh Beria, Indian Institute of Technology,Kharagpur E-mail: harsh.beria93 at gmail.com From p.j.a.cock at googlemail.com Thu Feb 27 08:49:22 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Feb 2014 13:49:22 +0000 Subject: [Biopython-dev] Introductory Biopython material Message-ID: Hello all, This is just to let you know that I've written some introductory Biopython material targeting Python Novices, focused on some practical sequence manipulation examples, freely available under the CC-BY licence here: https://github.com/peterjc/biopython_workshop I've run this as a workshop twice, but it should be fine for self study as well. I'm open to moving this under the Biopython project's GitHub account, if people think that would be better? I've added a few links to this from the website - these can be moved/edited/removed if people think there's a better place to put them: http://biopython.org/wiki/SeqIO and http://biopython.org/wiki/Category:Wiki_Documentation Regards, Peter From tra at popgen.net Thu Feb 27 09:53:48 2014 From: tra at popgen.net (Tiago Antao) Date: Thu, 27 Feb 2014 14:53:48 +0000 Subject: [Biopython-dev] Bio.PopGen.SimCoal partial deprecation Message-ID: <20140227145348.44cbe923@lnx> Dear all, With the availability of the new fastsimcoal interface by Melissa Gymrek, I was planning on deprecating the code to deal with old version (SimCoal 2.0). This would mean deprecating class SimCoalController (Bio.PopGen.SimCoal.Controller.py), along with the relevant test code (and SimCoal2 dependency). All the other code would be maintained (e.g. templating). And Melissa's new fastsimcoal class (FastSimCoalController) would of course be added. If somebody has strong feelings against this deprecation, please do voice your concerns. Best, Tiago From p.j.a.cock at googlemail.com Thu Feb 27 11:12:31 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Feb 2014 16:12:31 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: On Thu, Feb 27, 2014 at 3:50 PM, Leighton Pritchard wrote: > I would like to propose further development of the GenomeDiagram > module (and maybe the KGML module, if it's incorporated into Biopython) > to enable browser-based interactive visualisation, along the lines of Bokeh[1] > > [1] http://bokeh.pydata.org/ I presume you're offering to mentor this - which would be great :) Peter P.S. The KGML module Leighton's talking about is here: https://github.com/biopython/biopython/pull/173 Leighton's blog posts about this work: http://armchairbiology.blogspot.co.uk/2013/01/keggwatch-part-i.html http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-ii.html http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-iii.html From tra at popgen.net Thu Feb 27 11:19:44 2014 From: tra at popgen.net (Tiago Antao) Date: Thu, 27 Feb 2014 16:19:44 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References:

Message-ID: <20140227161944.05640d0d@lnx> Hi, On Thu, 27 Feb 2014 16:12:31 +0000 Peter Cock wrote: > P.S. The KGML module Leighton's talking about is here: > https://github.com/biopython/biopython/pull/173 Would this add a new library dependency to Biopython (PIL)? I am all in favour of that (as independent modules could have their dependencies without causing problems - as you only need the dependency if you actually use the module). But that would require the revision of the module dependency policy, right? Which until now has been a bit on the conservative side... I am thinking here matplotlib and scipy, for instance... Tiago From p.j.a.cock at googlemail.com Thu Feb 27 11:28:33 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Feb 2014 16:28:33 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: <20140227161944.05640d0d@lnx> References:

<20140227161944.05640d0d@lnx> Message-ID: On Thu, Feb 27, 2014 at 4:19 PM, Tiago Antao wrote: > Hi, > > On Thu, 27 Feb 2014 16:12:31 +0000 > Peter Cock wrote: > >> P.S. The KGML module Leighton's talking about is here: >> https://github.com/biopython/biopython/pull/173 > > > Would this add a new library dependency to Biopython (PIL)? I am all in > favour of that (as independent modules could have their dependencies > without causing problems - as you only need the dependency if you > actually use the module). > > But that would require the revision of the module dependency policy, > right? Which until now has been a bit on the conservative side... We've been conservative/hard-line on build time dependencies (i.e. only NumPy), but there are now quite a few 'soft' run time Python dependencies (e.g. NetworkX, MySQLdb, ...). > I am thinking here matplotlib and scipy, for instance... > > Tiago IIRC, ReportLab already requires PIL for any bitmap output, so this isn't a new 'soft dependency'. Peter From p.j.a.cock at googlemail.com Thu Feb 27 11:31:11 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Feb 2014 16:31:11 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References:

Message-ID: On Thu, Feb 27, 2014 at 4:25 PM, Fields, Christopher J wrote: > On Feb 27, 2014, at 10:12 AM, Peter Cock wrote: > >> On Thu, Feb 27, 2014 at 3:50 PM, Leighton Pritchard >> wrote: >>> I would like to propose further development of the GenomeDiagram >>> module (and maybe the KGML module, if it's incorporated into Biopython) >>> to enable browser-based interactive visualisation, along the lines of Bokeh[1] >>> >>> [1] http://bokeh.pydata.org/ >> >> I presume you're offering to mentor this - which would be great :) >> >> Peter > > I would add that to the wiki, and indicate whether you can mentor it. > Seems like a cool idea! > > chris Leighton left out the link, but had added this to the Biopython wiki: http://biopython.org/wiki/GSOC#Interactive_GenomeDiagram_Module Peter From tra at popgen.net Thu Feb 27 11:32:03 2014 From: tra at popgen.net (Tiago Antao) Date: Thu, 27 Feb 2014 16:32:03 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References:

<20140227161944.05640d0d@lnx> Message-ID: <20140227163203.191432da@lnx> On Thu, 27 Feb 2014 16:28:33 +0000 Peter Cock wrote: > We've been conservative/hard-line on build time dependencies > (i.e. only NumPy), but there are now quite a few 'soft' run time > Python dependencies (e.g. NetworkX, MySQLdb, ...). That makes sense. So, any soft dependencies (non-build time) of widely use Python libraries would be OK? Tiago From p.j.a.cock at googlemail.com Thu Feb 27 11:38:28 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Feb 2014 16:38:28 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: <20140227163203.191432da@lnx> References:

<20140227161944.05640d0d@lnx> <20140227163203.191432da@lnx> Message-ID: On Thu, Feb 27, 2014 at 4:32 PM, Tiago Antao wrote: > On Thu, 27 Feb 2014 16:28:33 +0000 > Peter Cock wrote: >> We've been conservative/hard-line on build time dependencies >> (i.e. only NumPy), but there are now quite a few 'soft' run time >> Python dependencies (e.g. NetworkX, MySQLdb, ...). > > That makes sense. So, any soft dependencies (non-build time) of > widely use Python libraries would be OK? That seems to be the way things have gone - the key points being non-build time, and widely used (which I would hope includes readily available cross platform). But we digress - any more potential GSoC mentors? Project ideas? Peter From cjfields at illinois.edu Thu Feb 27 11:25:18 2014 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 27 Feb 2014 16:25:18 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References:

Message-ID: On Feb 27, 2014, at 10:12 AM, Peter Cock wrote: > On Thu, Feb 27, 2014 at 3:50 PM, Leighton Pritchard > wrote: >> I would like to propose further development of the GenomeDiagram >> module (and maybe the KGML module, if it's incorporated into Biopython) >> to enable browser-based interactive visualisation, along the lines of Bokeh[1] >> >> [1] http://bokeh.pydata.org/ > > I presume you're offering to mentor this - which would be great :) > > Peter > > P.S. The KGML module Leighton's talking about is here: > https://github.com/biopython/biopython/pull/173 > > Leighton's blog posts about this work: > http://armchairbiology.blogspot.co.uk/2013/01/keggwatch-part-i.html > http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-ii.html > http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-iii.html I would add that to the wiki, and indicate whether you can mentor it. Seems like a cool idea! chris From mjldehoon at yahoo.com Fri Feb 28 05:07:49 2014 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 28 Feb 2014 02:07:49 -0800 (PST) Subject: [Biopython-dev] Gsoc 2014 aspirant In-Reply-To: Message-ID: <1393582069.6863.YahooMailBasic@web164006.mail.gq1.yahoo.com> Hi Harsh Beria, One option is to work on pairwise sequence alignments. Currently there is some code for that in Biopython (in Bio/pairwise2.py), but it is not general and is not being maintained. This may need to be rebuilt from the ground up. Best, -Michiel. -------------------------------------------- On Wed, 2/26/14, Harsh Beria wrote: Subject: [Biopython-dev] Gsoc 2014 aspirant To: biopython at lists.open-bio.org, biopython-dev at lists.open-bio.org, gsoc at lists.open-bio.org Date: Wednesday, February 26, 2014, 11:14 AM Hi, I am a Harsh Beria, third year UG student at Indian Institute of Technology, Kharagpur. I have started working in Computational Biophysics recently, having written code for pdb to fasta parser, sequence alignment using Needleman Wunch and Smith Waterman, Secondary Structure prediction, Henikoff's weight and am currently working on Monte Carlo simulation. Overall, I have started to like this field and want to carry my interest forward by pursuing a relevant project for GSOC 2014. I mainly code in C and python and would like to start contributing to the Biopython library. I started going through the official contribution wiki page ( http://biopython.org/wiki/Contributing) I also went through the wiki page of Bio.SeqlO's. I seriously want to contribute to the Biopython library through GSOC. What do I do next ? Thanks -- Harsh Beria, Indian Institute of Technology,Kharagpur E-mail: harsh.beria93 at gmail.com _______________________________________________ Biopython-dev mailing list Biopython-dev at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev From cjfields at illinois.edu Fri Feb 28 12:45:32 2014 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 28 Feb 2014 17:45:32 +0000 Subject: [Biopython-dev] Gsoc 2014 aspirant In-Reply-To: <1393582069.6863.YahooMailBasic@web164006.mail.gq1.yahoo.com> References: <1393582069.6863.YahooMailBasic@web164006.mail.gq1.yahoo.com> Message-ID: I?m wondering, with something that is as broadly applicable as pairwise alignment, would it be better to implement only in Python (or implement in Python wedded to a C backend)? Or maybe set up something in python that taps into an already well-defined C/C++ library that does this? The reason I mention this: with bioperl we went down this route with bioperl-ext a long time ago (these are generally C-based backend tools with a perl front-end), that bit-rotted simply b/c there were other more maintainable options. IIUC from this post, similar issues re: maintainability held for Bio/pairwise2.py (unless I?m mistaken, which is entirely possible). However, tools like pysam and Bio::DB::Samtools (on the perl end) seem to have been maintained much more readily since they tap into a common library. For instance, my suggestion would be to implement a Biopython tool that does pairwise alignment using library X (SeqAn, EMBOSS, etc). Or maybe a generic python front-end that allows users to pick the tool/method for the alignment, with maybe a library binding as an initial implementation. chris On Feb 28, 2014, at 4:07 AM, Michiel de Hoon wrote: > Hi Harsh Beria, > > One option is to work on pairwise sequence alignments. Currently there is some code for that in Biopython (in Bio/pairwise2.py), but it is not general and is not being maintained. This may need to be rebuilt from the ground up. > > Best, > -Michiel. > > -------------------------------------------- > On Wed, 2/26/14, Harsh Beria wrote: > > Subject: [Biopython-dev] Gsoc 2014 aspirant > To: biopython at lists.open-bio.org, biopython-dev at lists.open-bio.org, gsoc at lists.open-bio.org > Date: Wednesday, February 26, 2014, 11:14 AM > > Hi, > > I am a Harsh Beria, third year UG student at Indian > Institute of > Technology, Kharagpur. I have started working in > Computational Biophysics > recently, having written code for pdb to fasta parser, > sequence alignment > using Needleman Wunch and Smith Waterman, Secondary > Structure prediction, > Henikoff's weight and am currently working on Monte Carlo > simulation. > Overall, I have started to like this field and want to carry > my interest > forward by pursuing a relevant project for GSOC 2014. I > mainly code in C > and python and would like to start contributing to the > Biopython library. I > started going through the official contribution wiki page ( > http://biopython.org/wiki/Contributing) > > I also went through the wiki page of Bio.SeqlO's. I > seriously want to > contribute to the Biopython library through GSOC. What do I > do next ? > > Thanks > -- > > Harsh Beria, > Indian Institute of Technology,Kharagpur > E-mail: harsh.beria93 at gmail.com > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From harsh.beria93 at gmail.com Fri Feb 28 14:10:41 2014 From: harsh.beria93 at gmail.com (Harsh Beria) Date: Sat, 1 Mar 2014 00:40:41 +0530 Subject: [Biopython-dev] Gsoc 2014 aspirant In-Reply-To: References: <1393582069.6863.YahooMailBasic@web164006.mail.gq1.yahoo.com> Message-ID: I can work on pairwise sequence alignment. Actually, I have previously worked on this using Dynamic programming. But I doubt whether this can be a GSOC project because the work load will not be too much. If we use different methods to predict sequence alignment and make a front-end which allows the user to input the sequence or even a pdb file and method of alignment and predict the alignment, the work can be substantial enough. Also, as suggested by Christopher, sequence alignment is pretty basic and we can use C backend, which can significantly improve the runtime. So, we can discuss it and I can start working on it. On Fri, Feb 28, 2014 at 11:15 PM, Fields, Christopher J < cjfields at illinois.edu> wrote: > I'm wondering, with something that is as broadly applicable as pairwise > alignment, would it be better to implement only in Python (or implement in > Python wedded to a C backend)? Or maybe set up something in python that > taps into an already well-defined C/C++ library that does this? > > The reason I mention this: with bioperl we went down this route with > bioperl-ext a long time ago (these are generally C-based backend tools with > a perl front-end), that bit-rotted simply b/c there were other more > maintainable options. IIUC from this post, similar issues re: > maintainability held for Bio/pairwise2.py (unless I'm mistaken, which is > entirely possible). However, tools like pysam and Bio::DB::Samtools (on > the perl end) seem to have been maintained much more readily since they tap > into a common library. > > For instance, my suggestion would be to implement a Biopython tool that > does pairwise alignment using library X (SeqAn, EMBOSS, etc). Or maybe a > generic python front-end that allows users to pick the tool/method for the > alignment, with maybe a library binding as an initial implementation. > > chris > > On Feb 28, 2014, at 4:07 AM, Michiel de Hoon wrote: > > > Hi Harsh Beria, > > > > One option is to work on pairwise sequence alignments. Currently there > is some code for that in Biopython (in Bio/pairwise2.py), but it is not > general and is not being maintained. This may need to be rebuilt from the > ground up. > > > > Best, > > -Michiel. > > > > -------------------------------------------- > > On Wed, 2/26/14, Harsh Beria wrote: > > > > Subject: [Biopython-dev] Gsoc 2014 aspirant > > To: biopython at lists.open-bio.org, biopython-dev at lists.open-bio.org, > gsoc at lists.open-bio.org > > Date: Wednesday, February 26, 2014, 11:14 AM > > > > Hi, > > > > I am a Harsh Beria, third year UG student at Indian > > Institute of > > Technology, Kharagpur. I have started working in > > Computational Biophysics > > recently, having written code for pdb to fasta parser, > > sequence alignment > > using Needleman Wunch and Smith Waterman, Secondary > > Structure prediction, > > Henikoff's weight and am currently working on Monte Carlo > > simulation. > > Overall, I have started to like this field and want to carry > > my interest > > forward by pursuing a relevant project for GSOC 2014. I > > mainly code in C > > and python and would like to start contributing to the > > Biopython library. I > > started going through the official contribution wiki page ( > > http://biopython.org/wiki/Contributing) > > > > I also went through the wiki page of Bio.SeqlO's. I > > seriously want to > > contribute to the Biopython library through GSOC. What do I > > do next ? > > > > Thanks > > -- > > > > Harsh Beria, > > Indian Institute of Technology,Kharagpur > > E-mail: harsh.beria93 at gmail.com > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > -- Harsh Beria, Indian Institute of Technology,Kharagpur E-mail: harsh.beria93 at gmail.com Ph: +919332157616 From saketkc at gmail.com Mon Feb 3 04:22:42 2014 From: saketkc at gmail.com (Saket Choudhary) Date: Mon, 3 Feb 2014 09:52:42 +0530 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References:

Message-ID: On 31 January 2014 16:25, Peter Cock wrote: > On Wed, Jan 29, 2014 at 9:29 PM, Eric Talevich wrote: >> Hi folks, >> >> Google Summer of Code is on again for 2014, and the Open Bioinformatics >> Foundation (OBF) is once again applying as a mentoring organization. >> Participating in GSoC as an organization is very competitive, and we will >> need your help in gathering a good set of ideas and potential mentors for >> Biopython's role in GSoC this year. >> >> If you have an idea for a Summer of Code project, please post your idea >> here on the Biopython mailing list for discussion and start an outline on >> this wiki page: >> http://biopython.org/wiki/Google_Summer_of_Code >> >> We also welcome ideas that fit with OBF's mission but are not part of a >> single Bio* project, or span multiple projects -- these ideas can be posted >> on the OBF wiki and discussed on the OBF mailing list: >> http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas >> http://lists.open-bio.org/mailman/listinfo/open-bio-l >> >> Here's to another fun and productive Summer of Code! >> >> Cheers, >> Eric & Raoul > > Thanks Eric & Raoul, > > Remember that the ideas don't have to come from potential mentors - > if as a student there is something you'd particularly like to work on > please ask, and perhaps we can find a suitable (Biopython) mentor. > > Regards, > > Peter I would like to propose a QC module for NGS & Microarray data. Essentially a fastQC[1] and limma[2], respectively ported to Biopython. [1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [2] http://bioconductor.org/packages/devel/bioc/html/limma.html Saket > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Mon Feb 3 12:31:37 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Feb 2014 12:31:37 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References:

Message-ID: On Mon, Feb 3, 2014 at 4:22 AM, Saket Choudhary wrote: > > I would like to propose a QC module for NGS & Microarray data. > Essentially a fastQC[1] and limma[2], respectively ported to > Biopython. > > [1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ > [2] http://bioconductor.org/packages/devel/bioc/html/limma.html Hi Saket, What did you have in mind for 'porting' fastQC? Recreating it in Python alone doesn't seem like a sensible use of time & effort. Are there particular functions etc you think make sense to have available as a library of code? For limma, the linear model side would fall nicely under SciPy, eg http://scikit-learn.org/stable/modules/linear_model.html However, Biopython's existing microarray support could do with some love. Peter From saketkc at gmail.com Mon Feb 3 17:37:41 2014 From: saketkc at gmail.com (Saket Choudhary) Date: Mon, 3 Feb 2014 17:37:41 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References:

Message-ID: Hi Peter, My idea was to have a QC/preprocessing module inside Biopython, which could then be integrated with the rest of the NGS tools wrappers. Though you are right, these functionalities as such are already part of fastQC and replicating might not be a good idea. As for limma, I had these things in mind: 1. Correct me if I am wrong, but Biopython only supports Affymetrix data, right? My idea was to build parsers for Genepix, Agilent etc 2. Add other methods for in/between array normalisation, MA, volcano plots Yes, it is like reinventing the wheel, but I have been thinking of porting this to python myself, this might not be good from the point of view of a GSoC project however. Saket On 3 February 2014 12:31, Peter Cock wrote: > On Mon, Feb 3, 2014 at 4:22 AM, Saket Choudhary wrote: >> >> I would like to propose a QC module for NGS & Microarray data. >> Essentially a fastQC[1] and limma[2], respectively ported to >> Biopython. >> >> [1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ >> [2] http://bioconductor.org/packages/devel/bioc/html/limma.html > > Hi Saket, > > What did you have in mind for 'porting' fastQC? Recreating it in > Python alone doesn't seem like a sensible use of time & effort. > Are there particular functions etc you think make sense to have > available as a library of code? > > For limma, the linear model side would fall nicely under SciPy, > eg http://scikit-learn.org/stable/modules/linear_model.html > However, Biopython's existing microarray support could do > with some love. > > Peter From p.j.a.cock at googlemail.com Mon Feb 3 17:49:25 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Feb 2014 17:49:25 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References:

Message-ID: On Mon, Feb 3, 2014 at 5:37 PM, Saket Choudhary wrote: > > As for limma, I had these things in mind: > 1. Correct me if I am wrong, but Biopython only supports Affymetrix > data, right? My idea was to build parsers for Genepix, Agilent etc And GEO data somewhat (which has been processed by the NCBI into yet another format). > 2. Add other methods for in/between array normalisation, MA, > volcano plots Much of the core statistics and plotting can probably build on scipy and something like matplotlib? > Yes, it is like reinventing the wheel, but I have been thinking of > porting this to python myself, this might not be good from the point > of view of a GSoC project however. This sounds like it *could* be the basis of a possible GSoC (provided suitable mentor(s) are available). Are you still eligible as a student? Regards, Peter From saketkc at gmail.com Mon Feb 3 17:52:43 2014 From: saketkc at gmail.com (Saket Choudhary) Date: Mon, 3 Feb 2014 17:52:43 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References:

Message-ID: On 3 February 2014 17:49, Peter Cock wrote: > On Mon, Feb 3, 2014 at 5:37 PM, Saket Choudhary wrote: >> >> As for limma, I had these things in mind: >> 1. Correct me if I am wrong, but Biopython only supports Affymetrix >> data, right? My idea was to build parsers for Genepix, Agilent etc > > And GEO data somewhat (which has been processed by the NCBI > into yet another format). > >> 2. Add other methods for in/between array normalisation, MA, >> volcano plots > > Much of the core statistics and plotting can probably build on > scipy and something like matplotlib? > Yes, that's what I was heading to. >> Yes, it is like reinventing the wheel, but I have been thinking of >> porting this to python myself, this might not be good from the point >> of view of a GSoC project however. > > This sounds like it *could* be the basis of a possible GSoC (provided > suitable mentor(s) are available). Are you still eligible as a student? > Yes, this is my final year of integrated bachelors+masters program :( || :) Saket > Regards, > > Peter From bylin at ucsc.edu Tue Feb 4 17:44:47 2014 From: bylin at ucsc.edu (Brian Lin) Date: Tue, 4 Feb 2014 09:44:47 -0800 Subject: [Biopython-dev] SeqRecord comparison suggestion Message-ID: Hi everyone, In the past I have spent hours debugging my code because I expected different SeqRecord objects to evaluate as equal if all their attributes (id, seq, etc) were the same. Unfortunately the "==" operator only compares the address in memory. Is there a reason (e.g. some tradeoff, design elements, etc) that this behavior is tolerated? If not, I'd like to submit a pull request - I overloaded the __eq__ and __ne__ operators and wrote a quick test for it, then pushed it to: https://github.com/bylin/biopython Best, Brian Brian Lin | bylin at ucsc.edu | Brian's LinkedIn B.S., Genetics and Computer Science, University of California at Davis Ph.D Program in Bioinformatics, University of California at Santa Cruz From idoerg at gmail.com Tue Feb 4 17:57:49 2014 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 4 Feb 2014 12:57:49 -0500 Subject: [Biopython-dev] SeqRecord comparison suggestion In-Reply-To: References: Message-ID: Thanks! My initial thoughts are that seqrecord instances should not have an __eq__ operator. The equality operation here is somewhat meaningless when you consider the number of parameters that can constitute a seqrecord, especially when dealing with a genomic record or a contig. This can lead to unexpected behavior. That being said, it may be a good idea to allow for a function that performs a comprehensive comparison using all attributes. Specifically to answer your question: I don't think the address comparison is by design. It's a Python feature. My $0.02 Iddo Friedberg http://iddo-friedberg.net/contact.html Sent from a device that promotes typos On Feb 4, 2014 12:46 PM, "Brian Lin" wrote: > Hi everyone, > > In the past I have spent hours debugging my code because I expected > different SeqRecord objects to evaluate as equal if all their attributes > (id, seq, etc) were the same. Unfortunately the "==" operator only compares > the address in memory. > > Is there a reason (e.g. some tradeoff, design elements, etc) that this > behavior is tolerated? If not, I'd like to submit a pull request - I > overloaded the __eq__ and __ne__ operators and wrote a quick test for it, > then pushed it to: > > https://github.com/bylin/biopython > > Best, > Brian > > Brian Lin | bylin at ucsc.edu | Brian's > LinkedIn > B.S., Genetics and Computer Science, University of California at Davis > Ph.D Program in Bioinformatics, University of California at Santa Cruz > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From arklenna at gmail.com Tue Feb 4 18:54:34 2014 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 4 Feb 2014 13:54:34 -0500 Subject: [Biopython-dev] SeqRecord comparison suggestion In-Reply-To: References:

Message-ID: On Tue, Feb 4, 2014 at 12:57 PM, Iddo Friedberg wrote: > Thanks! > > My initial thoughts are that seqrecord instances should not have an __eq__ > operator. The equality operation here is somewhat meaningless when you > consider the number of parameters that can constitute a seqrecord, > especially when dealing with a genomic record or a contig. This can lead > to unexpected behavior. > That being said, it may be a good idea to allow for a function that > performs a comprehensive comparison using all attributes. > I agree that an explicit comparison method would be less error-prone than ==. This method could even allow the user to specify which attributes must be identical for equality. > Specifically to answer your question: I don't think the address comparison > is by design. It's a Python feature. > __eq__ and __ne__ could instead be defined to raise NotImplementedError to prevent future users from experiencing the same problems and direct them to the explicit comparison method. Cheers, Lenna From p.j.a.cock at googlemail.com Tue Feb 4 20:56:12 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Feb 2014 20:56:12 +0000 Subject: [Biopython-dev] SeqRecord comparison suggestion In-Reply-To: References:

Message-ID: On Tuesday, February 4, 2014, Lenna Peterson wrote: > On Tue, Feb 4, 2014 at 12:57 PM, Iddo Friedberg > > wrote: > > > Thanks! > > > > My initial thoughts are that seqrecord instances should not have an > __eq__ > > operator. The equality operation here is somewhat meaningless when you > > consider the number of parameters that can constitute a seqrecord, > > especially when dealing with a genomic record or a contig. This can lead > > to unexpected behavior. > > > > That being said, it may be a good idea to allow for a function that > > performs a comprehensive comparison using all attributes. > > > > I agree that an explicit comparison method would be less error-prone than > ==. This method could even allow the user to specify which attributes must > be identical for equality. > > > > Specifically to answer your question: I don't think the address > comparison > > is by design. It's a Python feature. > > > > __eq__ and __ne__ could instead be defined to raise NotImplementedError to > prevent future users from experiencing the same problems and direct them to > the explicit comparison method. > > Cheers, > > Lenna > We should probably switch the current FutureWarning to a noisy BiopythonWarning ... because the current warning is (almost) silent. I think this was silenced in a recentish Python release, from memory it used to give a clear warning to the user :( Peter From p.j.a.cock at googlemail.com Wed Feb 5 18:15:08 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Feb 2014 18:15:08 +0000 Subject: [Biopython-dev] SeqRecord comparison suggestion In-Reply-To: References:

Message-ID: On Tue, Feb 4, 2014 at 8:56 PM, Peter Cock wrote: > > We should probably switch the current FutureWarning to a noisy > BiopythonWarning ... because the current warning is (almost) silent. I think > this was silenced in a recentish Python release, from memory it used to give > a clear warning to the user :( Correction - I should have double checked this before writing the email, the FutureWarning from Seq comparison works fine: >>> from Bio.Seq import Seq >>> Seq("ACGT") == "ACGT" False >>> Seq("ACGT") == Seq("ACGT") /Library/Python/2.7/site-packages/Bio/Seq.py:179: FutureWarning: In future comparing Seq objects will use string comparison (not object comparison). Incompatible alphabets will trigger a warning (not an exception). In the interim please use id(seq1)==id(seq2) or str(seq1)==str(seq2) to make your code explicit and to avoid this warning. (It may be time to actually try switching Seq equality to be string like...) On Tue, Feb 4, 2014 at 12:57 PM, Iddo Friedberg wrote: > Thanks! > > My initial thoughts are that seqrecord instances should not have an __eq__ > operator. The equality operation here is somewhat meaningless when you > consider the number of parameters that can constitute a seqrecord, > especially when dealing with a genomic record or a contig. This can lead > to unexpected behaviour. Indeed, which is one reason why we never defined __eq__ etc for the SeqRecord (how equal is equal? Same ID? Same sequence? Same annotions?). Therefore the SeqRecord gets the default Python object equality, which is are they the same object in memory? Peter From p.j.a.cock at googlemail.com Sat Feb 8 13:01:52 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 8 Feb 2014 13:01:52 +0000 Subject: [Biopython-dev] Fwd: [biopython] SubsMat.MatrixInfo update (#282) In-Reply-To: References: Message-ID: Who'd like to take a look at this offer of code for substitution matrices? ---------- Forwarded message ---------- From: biologyguy Date: Fri, Feb 7, 2014 at 10:00 PM Subject: [biopython] SubsMat.MatrixInfo update (#282) To: biopython/biopython I've updated Bio.SubsMat.MatrixInfo.py with a new substitution matrix (PHAT) and written a little function to output the matrices in a more useable format. I've been using the new version for months, and figured I should polish it up and submit it to be included in the official package. Anyone interested in taking a look? -Steve -- Reply to this email directly or view it on GitHub . From anubhavmaity7 at gmail.com Sun Feb 9 15:05:23 2014 From: anubhavmaity7 at gmail.com (Anubhav Maity) Date: Sun, 9 Feb 2014 20:35:23 +0530 Subject: [Biopython-dev] Fwd: [GSoC] Want to contribute to open-bio for GSOC 2014 In-Reply-To: References:

Message-ID: Hi, Thanks You, Peter, for your reply. I have setup my github account and have forked the source code. I have build and install biopython after reading the README file in the github repository. I want to contribute code to bioython. I want some suggestions from where to start? Waiting for your reply. Thanks and Regards, Anubhav ---------- Forwarded message ---------- From: Peter Cock Date: Sat, Feb 8, 2014 at 6:28 PM Subject: Re: [GSoC] Want to contribute to open-bio for GSOC 2014 To: Anubhav Maity Cc: OBF GSoC On Fri, Feb 7, 2014 at 10:33 PM, Anubhav Maity wrote: > Hi, > > I am a BTech student from an Indian university and want to contribute code > for open-bio for GSOC 2014. > I love to code and can code in python. I have studied biology in high > school and have taken biotechnology during my college study. > I have looked on the projects of biopython i.e Codon alignment and > analysis, Bio.Phylo: filling in the gaps and Indexing & Lazy-loading > Sequence Parsers. All the projects are very interesting. I want to > contribute in one of these projects, please help me in getting started. > Waiting for your positive reply. > > Thanks and Regards, > Anubhav Hi Anubhav, Please sign up to the biopython and biopython-dev mailing lists and introduce yourself there too. You will also need a GitHub account to contribute to Biopython development - so you might want to set that up now as well: http://lists.open-bio.org/mailman/listinfo/biopython http://lists.open-bio.org/mailman/listinfo/biopython-dev https://github.com/biopython/biopython Regards, Peter From saketkc at gmail.com Sun Feb 9 23:51:25 2014 From: saketkc at gmail.com (Saket Choudhary) Date: Sun, 9 Feb 2014 23:51:25 +0000 Subject: [Biopython-dev] [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References:

Message-ID: On 3 February 2014 17:52, Saket Choudhary wrote: > On 3 February 2014 17:49, Peter Cock wrote: >> On Mon, Feb 3, 2014 at 5:37 PM, Saket Choudhary wrote: >>> >>> As for limma, I had these things in mind: >>> 1. Correct me if I am wrong, but Biopython only supports Affymetrix >>> data, right? My idea was to build parsers for Genepix, Agilent etc >> >> And GEO data somewhat (which has been processed by the NCBI >> into yet another format). >> >>> 2. Add other methods for in/between array normalisation, MA, >>> volcano plots >> >> Much of the core statistics and plotting can probably build on >> scipy and something like matplotlib? >> > Yes, that's what I was heading to. > >>> Yes, it is like reinventing the wheel, but I have been thinking of >>> porting this to python myself, this might not be good from the point >>> of view of a GSoC project however. >> >> This sounds like it *could* be the basis of a possible GSoC (provided >> suitable mentor(s) are available). Are you still eligible as a student? >> > > > Yes, this is my final year of integrated bachelors+masters program :( || :) > > Saket >> Regards, >> >> Peter Would anyone be interested in mentoring this? Saket From p.j.a.cock at googlemail.com Fri Feb 14 09:43:42 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 14 Feb 2014 09:43:42 +0000 Subject: [Biopython-dev] Fwd: [Open-bio-l] OBF GSoC 2014: Last call for project ideas and mentors In-Reply-To: References: Message-ID: Potential GSoC mentors, please think about this urgently! (This isn't the deadline for proposing project ideas, if we do get to take part in GSoC - but having more solid ideas on the webpage now will help with getting accepted as a GSoC organisation). Thanks, Peter ---------- Forwarded message ---------- From: Eric Talevich Date: Thu, Feb 13, 2014 at 9:21 PM Subject: [Open-bio-l] OBF GSoC 2014: Last call for project ideas and mentors To: open-bio-l at lists.open-bio.org Folks, The Google Summer of Code organization applications are due tomorrow. The core of our application to Google is our list of project ideas, listed here: http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas The more ideas and designated mentors we have, the better our chance to get funded students to work on these projects. So: If you have an idea, please do post it to the project wiki today. Or, if you are willing to serve as a mentor but do not have a specific project idea in mind, let us know. Thanks! Eric & Raoul OBF GSoC 2014 admins _______________________________________________ Open-Bio-l mailing list Open-Bio-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/open-bio-l From mok at bioxray.dk Tue Feb 18 01:02:44 2014 From: mok at bioxray.dk (Morten Kjeldgaard) Date: Tue, 18 Feb 2014 02:02:44 +0100 Subject: [Biopython-dev] Bug in Bio.PDB? Message-ID: <96CEC8D5-7D42-42CC-A9B8-32699C386AB0@bioxray.dk> Hi, I need to remove HETATMS and water molecules from a PDB file, so I was playing around with the following code snippet (from [1]): for model in structure: for chain in model: for residue in chain: id = residue.id if id[0] != ' ': chain.detach_child(id) Unfortunately, it does not work correctly. If 2 HETATM residues are found after each other, the second one is skipped. I assume the reason is that model, chain, etc. really are generators, and they get screwed up if you mess around with the datastructure while the loop is running. Here is a bit of debugging output from a run with print statements scattered appropriately in the above code: *** D *** (' ', 13, ' ') *** D *** (' ', 14, ' ') *** D *** (' ', 15, ' ') *** D *** ('H_H2U', 16, ' ') removing ('H_H2U', 16, ' ') from chain D *** D *** (' ', 18, ' ') *** D *** (' ', 19, ' ') *** D *** (' ', 20, ' ?) As you see, residue 16 is correctly identified as a HETATM residue, however, the following residue 17 is skipped (it is also a H2U residue) and so it is NOT removed from the structure. The way to make the loop work is to squirrel away a list of HETATM residues and detach them from the chain when the loop is finished. (Another way is to keep running the snippet until no HETATMs are left.) I am not sure whether to characterize this as a bug or a ?feature?, but it is confusing and defeats the intuitive understanding of how the SMCRA hierarchy objects ought to work. (I am using Biopython version 1.59) Cheers, Morten [1] http://pelican.rsvs.ulaval.ca/mediawiki/index.php/Manipulating_PDB_files_using_BioPython -- Morten Kjeldgaard, asc. professor, MSc, PhD Dept. of Molecular Biology and Genetics, Aarhus University Gustav Wieds Vej 10C, Building 3135, DK-8000 Aarhus C, Denmark. From anaryin at gmail.com Tue Feb 18 02:02:04 2014 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 18 Feb 2014 03:02:04 +0100 Subject: [Biopython-dev] Bug in Bio.PDB? In-Reply-To: <96CEC8D5-7D42-42CC-A9B8-32699C386AB0@bioxray.dk> References: <96CEC8D5-7D42-42CC-A9B8-32699C386AB0@bioxray.dk> Message-ID: Hi Morten, This is not a "bug" per se, but a matter of changing the chain object while you are iterating over it. You get the same effect with this loop: >>> nums = [1,2,3,4,5,6,7,8,9,11,13,15] >>> for i in nums: ... if i % 2 == 0: ... nums.remove(i) ... print i, ... 1 2 4 6 8 11 13 15 ? The only options you have is to use chain.child_list (which does create a copy of the list and it's safe to iterate on) or just save the ids to remove and remove them a posteriori. Cheers, Jo?o From anaryin at gmail.com Tue Feb 18 02:02:57 2014 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 18 Feb 2014 03:02:57 +0100 Subject: [Biopython-dev] Bug in Bio.PDB? In-Reply-To: References: <96CEC8D5-7D42-42CC-A9B8-32699C386AB0@bioxray.dk> Message-ID: Also, in addition, if you just want to save the PDB without these HETATMs, use the Select class when saving with PDBIO.save. 2014-02-18 3:02 GMT+01:00 Jo?o Rodrigues : > Hi Morten, > > This is not a "bug" per se, but a matter of changing the chain object > while you are iterating over it. You get the same effect with this loop: > > >>> nums = [1,2,3,4,5,6,7,8,9,11,13,15] > >>> for i in nums: > ... if i % 2 == 0: > ... nums.remove(i) > ... print i, > ... > 1 2 4 6 8 11 13 15 > ? > The only options you have is to use chain.child_list (which does create a > copy of the list and it's safe to iterate on) or just save the ids to > remove and remove them a posteriori. > > Cheers, > > Jo?o > From p.j.a.cock at googlemail.com Tue Feb 18 14:17:48 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Feb 2014 14:17:48 +0000 Subject: [Biopython-dev] Galaxy Tool Shed packages for Biopython In-Reply-To: References: Message-ID: On Fri, Sep 13, 2013 at 9:54 AM, Peter Cock wrote: > Hi all, > > I've sent this to both the Galaxy and Biopython developers lists, > and hope this will make sense to both groups. If you've not heard > of Galaxy, start here: http://galaxyproject.org - while the easy to > guess Biopython website is at http://biopython.org > > Brad Chapman and I are both Biopython core developers, and > are also both on the "IUC" Galaxy Tool Shed committee because > we've been quite involved in wrapping and writing tools for use > on Galaxy. > > Fellow committee member Bj?rn Gr?ning has done a > lot of the hands on work defining package definitions for > dependencies within the Galaxy Tool Shed ecosystem - > including defining them for Biopython, NumPy, SciPy, > MatPlotLib, etc. We're very grateful for his hard work - > most of which is now available under the IUC group > account: > > http://toolshed.g2.bx.psu.edu/view/iuc/ > http://testtoolshed.g2.bx.psu.edu/view/iuc/ > > The Biopython packages, however, are under a dedicated > "biopython" account on the Galaxy Tool Shed to which > currently Bjoern, Brad and I have access to: > > http://toolshed.g2.bx.psu.edu/view/biopython/ > http://testtoolshed.g2.bx.psu.edu/view/biopython/ > > This packaging work was initially tracked in Bjoern's own GitHub > repository, https://github.com/bgruening/galaxytools/ > > We (me, Brad and Bjoern) agreed that a Biopython owned > repository would be more sensible in the long term, so I have > created this and ported Bjoern's commits to it: > https://github.com/biopython/galaxy_packages > > Currently the "Galaxy packagers" team on GitHub which > has read and write access to this new repository is just > me, Brad and Bjoern. > > Regards, > > Peter Earlier today I belatedly setup Galaxy packages for Biopython 1.63, for both the main and test Galaxy ToolSheds: http://toolshed.g2.bx.psu.edu/view/biopython/ http://testtoolshed.g2.bx.psu.edu/view/biopython/ This could perhaps become part of the Biopython release process in future? Peter From p.j.a.cock at googlemail.com Tue Feb 18 15:01:54 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 18 Feb 2014 15:01:54 +0000 Subject: [Biopython-dev] [galaxy-dev] Galaxy Tool Shed packages for Biopython In-Reply-To: <53036D4E.5000208@gmail.com> References:

<53036D4E.5000208@gmail.com> Message-ID: On Tue, Feb 18, 2014 at 2:25 PM, Bj?rn Gr?ning wrote: > Thanks Peter! > > +1 for including that into the Biopython release process > > Cheers, > Bjoern How about we add step N+1 to the instructions, http://biopython.org/wiki/Building_a_release "Ask Peter, Brad, or Bjoern to prepare a new Galaxy package on https://github.com/biopython/galaxy_packages and upload it to the main and test Galaxy ToolShed." Regards, Peter From anaryin at gmail.com Wed Feb 19 14:54:13 2014 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 19 Feb 2014 15:54:13 +0100 Subject: [Biopython-dev] Future of Bio.PDB Message-ID: >From another thread: As for what I suggested. Since my GSOC period, already 4 years ago.., I > noticed that the PDB module is a bit messy in terms of organization. The > module itself if named after the databank, which can be confused with the > format name, the mmcif parser is defined inside in a subfolder and there > are application wrappers there too (DSSP, NACCESS). Besides this issue, > which is not an issue at all and just my own pet peeve, there is a lot that > the entire module could gain from a thorough revision. I've been using it > very often and some normal manipulations of structures are not > straightforward to carry out (calculating a center of mass for example, > removing double occupancies) due to the parser being slow and quite memory > hungry. In fact, trying to run the parser on a very large collection of > structures often results in a random crash due to memory issues. > I've been toying with a lot of changes, performance improvements, etc, but > I'm not satisfied at all with them.. somethings that i've been trying is to > have the structure coordinates defined as a full numpy array instead of N > arrays per structure (one per atom) or the usage of __slots__ to mitigate > memory usage (managed to get it down 33% this way). This would also go in > line with a suggestion from Eric a long time ago to make a Bio.Struct > module which would be the perfect "playground" to implement and test these > changes. Other developments that I think are worth looking into are for > example making a nice library to link a parsed structure to the PDB > database and fetch information on it using the REST services they provide. > I'd like to hear your opinion (as in, everybody, developers and users) on > this and if it makes sense to indeed give a bit of TLC to the Bio.PDB > module. Also, on what changes you think should be carried out to improve > the module, like which features are missing, which applications are worth > wrapping. > Just to kick off some discussion. Maybe a new thread should be opened for > this later on. > Cheers, > Jo?o As for the name of the module, yes, Bio.Struct is just the "legacy" name I remember.. Bio.structure would probably be better and more clear. From eric.talevich at gmail.com Wed Feb 19 16:22:54 2014 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 19 Feb 2014 08:22:54 -0800 Subject: [Biopython-dev] Future of Bio.PDB In-Reply-To: References: Message-ID: On Wed, Feb 19, 2014 at 6:54 AM, Jo?o Rodrigues wrote: > From another thread: > > As for what I suggested. Since my GSOC period, already 4 years ago.., I > > noticed that the PDB module is a bit messy in terms of organization. The > > module itself if named after the databank, which can be confused with the > > format name, the mmcif parser is defined inside in a subfolder and there > > are application wrappers there too (DSSP, NACCESS). Besides this issue, > > which is not an issue at all and just my own pet peeve, there is a lot > that > > the entire module could gain from a thorough revision. I've been using it > > very often and some normal manipulations of structures are not > > straightforward to carry out (calculating a center of mass for example, > > removing double occupancies) due to the parser being slow and quite > memory > > hungry. In fact, trying to run the parser on a very large collection of > > structures often results in a random crash due to memory issues. > > I've been toying with a lot of changes, performance improvements, etc, > but > > I'm not satisfied at all with them.. somethings that i've been trying is > to > > have the structure coordinates defined as a full numpy array instead of N > > arrays per structure (one per atom) or the usage of __slots__ to mitigate > > memory usage (managed to get it down 33% this way). This would also go in > > line with a suggestion from Eric a long time ago to make a Bio.Struct > > module which would be the perfect "playground" to implement and test > these > > changes. Other developments that I think are worth looking into are for > > example making a nice library to link a parsed structure to the PDB > > database and fetch information on it using the REST services they > provide. > > I'd like to hear your opinion (as in, everybody, developers and users) on > > this and if it makes sense to indeed give a bit of TLC to the Bio.PDB > > module. Also, on what changes you think should be carried out to improve > > the module, like which features are missing, which applications are worth > > wrapping. > > Just to kick off some discussion. Maybe a new thread should be opened for > > this later on. > > Cheers, > > Jo?o > > > As for the name of the module, yes, Bio.Struct is just the "legacy" name I > remember.. Bio.structure would probably be better and more clear. > The p3d folks once offered to incorporate their work into Biopython: http://www.biomedcentral.com/1471-2105/10/258 We had concerns about having p3d and Bio.PDB coexisting within Biopython, but if someone wanted to emulate the Bio.PDB API on top of p3d, or otherwise slip p3d's secret sauce into the Bio.PDB internals, that would do the trick. (I have not thought about the details of how this would work at all.) I think it should also be possible to replace p3d's custom query language with the sort of tricks Bio.Phylo, pandas and SqlAlchemy do with keyword arguments and generators to get the same results with Python syntax. Alternatively, there is the option of sticking with the Bio.PDB namespace and adding only "read", "write" and "convert" functions to Bio/PDB/__init__.py to make the basic usage of the module more similar to the other Biopython sub-packages. The Model class could store one or several NumPy arrays that cover all atom coordinates, and the Chain, Residue, Atom and Interface classes would probably just store references to that array, e.g. a shorter 1D array of integer row indexes. Would either of these internal changes make it easier to apply the GSoC work that's been done on Bio.PDB? -Eric From davidjosephcain at gmail.com Wed Feb 19 16:35:56 2014 From: davidjosephcain at gmail.com (David Cain) Date: Wed, 19 Feb 2014 11:35:56 -0500 Subject: [Biopython-dev] Future of Bio.PDB In-Reply-To: References: Message-ID: I frequently make use of Bio.PDB, and agree wholeheartedly that certain aspects of it are very dated, or haphazardly organized. The module as a whole would benefit greatly from some extra attention. I'm happy to lend a hand in whatever revamp takes place. David Cain +1 (339) 222 4452 On Wed, Feb 19, 2014 at 11:22 AM, Eric Talevich wrote: > On Wed, Feb 19, 2014 at 6:54 AM, Jo?o Rodrigues wrote: > > > From another thread: > > > > As for what I suggested. Since my GSOC period, already 4 years ago.., I > > > noticed that the PDB module is a bit messy in terms of organization. > The > > > module itself if named after the databank, which can be confused with > the > > > format name, the mmcif parser is defined inside in a subfolder and > there > > > are application wrappers there too (DSSP, NACCESS). Besides this issue, > > > which is not an issue at all and just my own pet peeve, there is a lot > > that > > > the entire module could gain from a thorough revision. I've been using > it > > > very often and some normal manipulations of structures are not > > > straightforward to carry out (calculating a center of mass for example, > > > removing double occupancies) due to the parser being slow and quite > > memory > > > hungry. In fact, trying to run the parser on a very large collection of > > > structures often results in a random crash due to memory issues. > > > I've been toying with a lot of changes, performance improvements, etc, > > but > > > I'm not satisfied at all with them.. somethings that i've been trying > is > > to > > > have the structure coordinates defined as a full numpy array instead > of N > > > arrays per structure (one per atom) or the usage of __slots__ to > mitigate > > > memory usage (managed to get it down 33% this way). This would also go > in > > > line with a suggestion from Eric a long time ago to make a Bio.Struct > > > module which would be the perfect "playground" to implement and test > > these > > > changes. Other developments that I think are worth looking into are for > > > example making a nice library to link a parsed structure to the PDB > > > database and fetch information on it using the REST services they > > provide. > > > I'd like to hear your opinion (as in, everybody, developers and users) > on > > > this and if it makes sense to indeed give a bit of TLC to the Bio.PDB > > > module. Also, on what changes you think should be carried out to > improve > > > the module, like which features are missing, which applications are > worth > > > wrapping. > > > Just to kick off some discussion. Maybe a new thread should be opened > for > > > this later on. > > > Cheers, > > > Jo?o > > > > > > As for the name of the module, yes, Bio.Struct is just the "legacy" name > I > > remember.. Bio.structure would probably be better and more clear. > > > > The p3d folks once offered to incorporate their work into Biopython: > http://www.biomedcentral.com/1471-2105/10/258 > > We had concerns about having p3d and Bio.PDB coexisting within Biopython, > but if someone wanted to emulate the Bio.PDB API on top of p3d, or > otherwise slip p3d's secret sauce into the Bio.PDB internals, that would do > the trick. (I have not thought about the details of how this would work at > all.) I think it should also be possible to replace p3d's custom query > language with the sort of tricks Bio.Phylo, pandas and SqlAlchemy do with > keyword arguments and generators to get the same results with Python > syntax. > > Alternatively, there is the option of sticking with the Bio.PDB namespace > and adding only "read", "write" and "convert" functions to > Bio/PDB/__init__.py to make the basic usage of the module more similar to > the other Biopython sub-packages. The Model class could store one or > several NumPy arrays that cover all atom coordinates, and the Chain, > Residue, Atom and Interface classes would probably just store references to > that array, e.g. a shorter 1D array of integer row indexes. > > Would either of these internal changes make it easier to apply the GSoC > work that's been done on Bio.PDB? > > -Eric > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Wed Feb 19 16:41:28 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 19 Feb 2014 16:41:28 +0000 Subject: [Biopython-dev] Future of Bio.PDB In-Reply-To: References: Message-ID: On Wed, Feb 19, 2014 at 4:35 PM, David Cain wrote: > I frequently make use of Bio.PDB, and agree wholeheartedly that certain > aspects of it are very dated, or haphazardly organized. > > The module as a whole would benefit greatly from some extra attention. I'm > happy to lend a hand in whatever revamp takes place. > > David Cain > +1 (339) 222 4452 Very true, and thanks for the offer. If we go with the parallel namespace option (Bio.Struct, Bio.structure or similar) then we can stick an experimental warning on it and include it as a 'beta' module within the next release (while continuing to keep Bio.PDB fully backward compatible for now, with the likely goal of a formal deprecation in future). Peter From anaryin at gmail.com Wed Feb 19 16:50:56 2014 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 19 Feb 2014 17:50:56 +0100 Subject: [Biopython-dev] Future of Bio.PDB In-Reply-To: References: