From 88whacko at gmail.com Sat Mar 2 11:49:48 2013 From: 88whacko at gmail.com (Andrea Rizzi) Date: Sat, 2 Mar 2013 17:49:48 +0100 Subject: [Biopython-dev] New contributor Message-ID: Hello! My name is Andrea Rizzi and I'm a master's student in computer science and computational biology. I would be glad to help you developing biopython. I've used the library quite extensively but I'm mostly familiar with handling sequences, MSAs and PDB files. I've read through the small contributing guide on the wiki and on the tutorial and I thought I could start with something relatively straightforward like writing/completing some unit tests (if I understood correctly there's a fairly strong need for them). I've good knowledge of both git and unittest. Anyway any task is actually fine to me :) . If you agree I'll try to look for a module that needs some more testing (or maybe you have one to suggest me), otherwise I could just go to the bug tracker and try to help out fixing some bugs. -- -- Andrea From p.j.a.cock at googlemail.com Sun Mar 3 07:00:25 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 3 Mar 2013 12:00:25 +0000 Subject: [Biopython-dev] Fwd: GSoC 2013 is ON In-Reply-To: <20130303112326.GA5638@thebird.nl> References: <20130303112326.GA5638@thebird.nl> Message-ID: Time to start preparations for Google Summer of Code 2013 :) ---------- Forwarded message ---------- From: *Pjotr Prins* Date: Sunday, March 3, 2013 Subject: GSoC 2013 is ON Game on! GSoC 2013 is ON. I am running with the OBF project administration this year for the Google Summer of code (GSoC). First and foremost I want to thank Robert Buels and others for making OBF/GSoC a success in the previous three years! This year, Robert, Chris Fields and Hilmar Lapp will act as backup administrators. The deadline for the OBF application for GSoC2013 as a mentoring organisation is Friday March 29! See http://www.google-melange.com/gsoc/events/google/gsoc2013 Similar to previous years, each Bio* project needs to update and add project ideas on the project's individual OBF wiki page and create links from the main OBF page at http://www.open-bio.org/wiki/Google_Summer_of_Code (we will update the main information on that page soon). So, for each of the OBF projects that wants to do GSoC again this year: 1. Update the list of project ideas on your project's GSoC page (BioPython, BioPerl, BioRuby, etc). Add new ones, remove ones that have already been done or no longer relevant, etc. For an example see http://bioruby.open-bio.org/wiki/Google_Summer_of_Code 2. Update the final list of project ideas on the main OBF GSoC page to match. http://www.open-bio.org/wiki/Google_Summer_of_Code 3. Register with gsoc at lists.open-bio.org 4. Announce it on that list when you are ready :) Anyone can submit a project idea! Former GSoC students are especially encouraged to contribute ideas to the mailing lists. Please have the updates done by Friday March 22nd. The number and quality of the project ideas are part of the evaluation process for whether OBF is accepted as a Summer of Code organisation again this year, so let's come up with some good ones! Pj. (Pjotr Prins) Important dates: * March 22nd: Finalise project ideas * March 29th: Deadline OBF mentoring organisation submission to Google http://www.open-bio.org/wiki/Google_Summer_of_Code From saketkc at gmail.com Mon Mar 4 05:59:26 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Mon, 4 Mar 2013 16:29:26 +0530 Subject: [Biopython-dev] BWA Wrapper In-Reply-To: References: Message-ID: Hi, I have updated the code here : https://github.com/saketkc/biopython/tree/bwa_wrapper I have added unittests for the wrapper. And yes, this did help me in fixing a lot of minor bugs in my original wrapper. @Peter : Is this 'pull request' ready ? Thanks Saket On 19 February 2013 19:55, Peter Cock wrote: > On Tue, Feb 19, 2013 at 1:15 PM, Saket Choudhary wrote: >> >> Thanks Peter. >> >> I will add that. Any pointers to what would be a good reference test_aba.py >> file in Tests/ directory for writing unit tests for this ? >> >> I have worked on BDD before but Unit Tests are new for me, so it may take >> some time.I plan to finish it the coming week once my university >> examinations are done >> >> Thanks >> >> Saket > > There's a chapter in the Tutorial about our test framework. In this > case existing command line tool wrappers are the best reference, > e.g. test_Emboss.py or test_Muscle.py > > Also if you want to use doctests and have them included in the > test suite, add the module to the list in Tests/run_tests.py - however > this does not handle optional dependencies (other than NumPy). > Therefore all the application wrapper doctests to date have carefully > avoided actually invoking the command line - and instead most > print the string representation instead. This allows us to check > the example use cases should run (and catches silly errors in > the examples like a typo in an argument name). > > Thanks, > > Peter From saketkc at gmail.com Tue Mar 5 12:26:57 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Tue, 5 Mar 2013 22:56:57 +0530 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: I had this idea of an online biopython shell on the lines of bioruby shell : http://bioruby.open-bio.org/wiki/BioRubyOnRails On 13 February 2013 07:38, Michiel de Hoon wrote: > It would be great to have better support for microarray analysis in Biopython. Something like lumi/limma in R. Perhaps this is an option for the GSoC? > > Best, > -Michiel. > > --- On Tue, 2/12/13, Peter Cock wrote: > >> From: Peter Cock >> Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) >> To: "Biopython-Dev Mailing List" >> Date: Tuesday, February 12, 2013, 12:51 PM >> Hello all, >> >> Google recently confirmed they will be running Google Summer >> of Code 2013, >> and we (Biopython and the other Bio* projects) would hope to >> be accepted again >> under the Open Bioinformatics Foundation as in previous >> years: >> http://lists.open-bio.org/pipermail/gsoc/2013/000196.html >> >> It would be great to start coming up with potential project >> ideas, both larger >> pieces of work suitable for GSoC but also smaller tasks for >> other project >> students, or 'low hanging fruit' for potential contributors >> to cut >> their teeth on. >> >> See also http://biopython.org/wiki/Active_projects >> and the ideas list there. >> >> Regards, >> >> Peter >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Fri Mar 8 11:08:46 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 8 Mar 2013 16:08:46 +0000 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Tue, Mar 5, 2013 at 5:26 PM, Saket Choudhary wrote: > I had this idea of an online biopython shell on the lines of bioruby shell : > http://bioruby.open-bio.org/wiki/BioRubyOnRails > That screenshot makes me think of http://ipython.org/ - is that similar? Peter From redmine at redmine.open-bio.org Fri Mar 8 11:49:48 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 8 Mar 2013 16:49:48 +0000 Subject: [Biopython-dev] [Biopython - Bug #3422] (New) Missing Message-ID: Issue #3422 has been reported by Jared Sampson. ---------------------------------------- Bug #3422: Missing https://redmine.open-bio.org/issues/3422 Author: Jared Sampson Status: New Priority: Normal Assignee: Category: Target version: URL: http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/bookdoc_130101.dtd When using Entrez.efetch to retrieve an XML file, I get the following warning about a missing DTD: bookdoc_130101.dtd === /path/to/my/virtualenv/lib/python2.7/site-packages/Bio/Entrez/Parser.py:522: UserWarning: Unable to load DTD file bookdoc_130101.dtd. Bio.Entrez uses NCBI's DTD files to parse XML files returned by NCBI Entrez. Though most of NCBI's DTD files are included in the Biopython distribution, sometimes you may find that a particular DTD file is missing. While we can access the DTD file through the internet, the parser is much faster if the required DTD files are available locally. For this purpose, please download bookdoc_130101.dtd from http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/bookdoc_130101.dtd and save it either in directory /path/to/my/virtualenv/lib/python2.7/site-packages/Bio/Entrez/DTDs or in directory /Users/me/.biopython/Bio/Entrez/DTDs in order for Bio.Entrez to find it. Alternatively, you can save bookdoc_130101.dtd in the directory Bio/Entrez/DTDs in the Biopython distribution, and reinstall Biopython. Please also inform the Biopython developers about this missing DTD, by reporting a bug on http://bugzilla.open-bio.org/ or sign up to our mailing list and emailing us, so that we can include it with the next release of Biopython. Proceeding to access the DTD file through the internet... warnings.warn(message) === Also, the bugzilla.open-bio.org URL mentioned comes up empty. Thanks, Jared Sampson ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From saketkc at gmail.com Fri Mar 8 13:30:03 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Sat, 9 Mar 2013 00:00:03 +0530 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: It is essentially an online RoR based application that allows you to try bioruby through your browser without the need of a bioruby native install . I was thinking of a django/flask application that would essentially be a playground for trying out biopython Saket On 08/03/2013, Peter Cock wrote: > On Tue, Mar 5, 2013 at 5:26 PM, Saket Choudhary wrote: >> I had this idea of an online biopython shell on the lines of bioruby >> shell : >> http://bioruby.open-bio.org/wiki/BioRubyOnRails >> > > That screenshot makes me think of http://ipython.org/ - is that similar? > > Peter > From chapmanb at 50mail.com Sat Mar 9 11:06:34 2013 From: chapmanb at 50mail.com (Brad Chapman) Date: Sat, 09 Mar 2013 11:06:34 -0500 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: <87wqtgeewl.fsf@fastmail.fm> Saket and Peter; What you're describing is what Ipython provides, a web-based way to edit and interact with Python code. There are some projects that build on top of it to provide more of a playground environment like you're describing: http://continuum.io/wakari.html https://github.com/Exhibitionist/Exhibitionist Hope this helps, Brad > It is essentially an online RoR based application that allows you to > try bioruby through your browser without the need of a bioruby native > install . I was thinking of a django/flask application that would > essentially be a playground for trying out biopython > > > Saket > > On 08/03/2013, Peter Cock wrote: >> On Tue, Mar 5, 2013 at 5:26 PM, Saket Choudhary wrote: >>> I had this idea of an online biopython shell on the lines of bioruby >>> shell : >>> http://bioruby.open-bio.org/wiki/BioRubyOnRails >>> >> >> That screenshot makes me think of http://ipython.org/ - is that similar? >> >> Peter >> > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From xbello at gmail.com Tue Mar 12 05:36:35 2013 From: xbello at gmail.com (Xabier Bello) Date: Tue, 12 Mar 2013 10:36:35 +0100 Subject: [Biopython-dev] Consumer of "KW" in embl format Message-ID: Hi: I don't know if this is the right way to do this. The code: records = SeqIO.parse(open("MyFile.embl", "r"), "embl") for record in records: print record.annotations["keywords"] Doesn't work I've added to Bio/GenBank/Scanner.py, in _feed_header_lines(): elif line_type == 'KW': consumer.keywords(data.rstrip(";")) And now it seems to parse the keyword lines. Regards. From p.j.a.cock at googlemail.com Tue Mar 12 05:54:51 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Mar 2013 09:54:51 +0000 Subject: [Biopython-dev] Consumer of "KW" in embl format In-Reply-To: References: Message-ID: On Tue, Mar 12, 2013 at 9:36 AM, Xabier Bello wrote: > Hi: > > I don't know if this is the right way to do this. The code: > > records = SeqIO.parse(open("MyFile.embl", "r"), "embl") > for record in records: > print record.annotations["keywords"] > > Doesn't work > > I've added to Bio/GenBank/Scanner.py, in _feed_header_lines(): > > elif line_type == 'KW': > consumer.keywords(data.rstrip(";")) > > And now it seems to parse the keyword lines. > > Regards. Good idea, although it needs a little more generalisation for handling multiple keywords - a list of strings seems sensible here. Quoting ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt 3.4.6 The KW Line The KW (KeyWord) lines provide information which can be used to generate cross-reference indexes of the sequence entries based on functional, structural, or other categories deemed important. The format for a KW line is: KW keyword[; keyword ...]. More than one keyword may be listed on each KW line; the keywords are separated by semicolons, and the last keyword is followed by a full stop. Keywords may consist of more than one word, and they may contain embedded blanks and stops. A keyword is never split between lines. An example of a keyword line is: KW beta-glucosidase. The keywords are ordered alphabetically; the ordering implies no hierarchy of importance or function. If an entry has no keywords assigned to it, it will contain a single KW line like this: KW . Likewise the GenBank parser should support the KEYWORDS line too - and then writing the keywords out again too. Is this something you'd like to work on, or should I do it? (If you are interested in getting involved in Biopython development this seems like a nice project to start with - not too complicated, but large enough to make creating a fork on GitHub and your own enhancement branch a good idea.) Thanks, Peter From p.j.a.cock at googlemail.com Tue Mar 12 06:02:15 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Mar 2013 10:02:15 +0000 Subject: [Biopython-dev] Consumer of "KW" in embl format In-Reply-To: References: Message-ID: On Tue, Mar 12, 2013 at 9:54 AM, Peter Cock wrote: > On Tue, Mar 12, 2013 at 9:36 AM, Xabier Bello wrote: >> Hi: >> >> I don't know if this is the right way to do this. The code: >> >> records = SeqIO.parse(open("MyFile.embl", "r"), "embl") >> for record in records: >> print record.annotations["keywords"] >> >> Doesn't work >> >> I've added to Bio/GenBank/Scanner.py, in _feed_header_lines(): >> >> elif line_type == 'KW': >> consumer.keywords(data.rstrip(";")) >> >> And now it seems to parse the keyword lines. >> >> Regards. > > Good idea, although it needs a little more generalisation for handling > multiple keywords - a list of strings seems sensible here. Quoting > ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt > > > 3.4.6 The KW Line > The KW (KeyWord) lines provide information which can be used to generate > cross-reference indexes of the sequence entries based on functional, > structural, or other categories deemed important. > The format for a KW line is: > KW keyword[; keyword ...]. > More than one keyword may be listed on each KW line; the keywords are > separated by semicolons, and the last keyword is followed by a full > stop. Keywords may consist of more than one word, and they may contain > embedded blanks and stops. A keyword is never split between lines. > An example of a keyword line is: > KW beta-glucosidase. > The keywords are ordered alphabetically; the ordering implies no hierarchy > of importance or function. If an entry has no keywords assigned to it, > it will contain a single KW line like this: > KW . > > > Likewise the GenBank parser should support the KEYWORDS line > too - and then writing the keywords out again too. > > Is this something you'd like to work on, or should I do it? To clarify - Biopython should already be reading and writing any KEYWORDS line in GenBank files - the same data structure should be used for EMBL files (your suggestion looks good, but an explicit unit test covering single and multiple keywords would be ideal), and then the EMBL writer updated to write this. i.e. code added in Bio/SeqIO/InsdcIO.py Peter From p.j.a.cock at googlemail.com Tue Mar 12 06:58:39 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Mar 2013 10:58:39 +0000 Subject: [Biopython-dev] Consumer of "KW" in embl format In-Reply-To: References: Message-ID: On Tue, Mar 12, 2013 at 10:12 AM, Xabier Bello wrote: > I think I'm not that close to the Biopython code. > > I found a problem (I needed to read the Keywords), and solved it quick and > dirty. In fact, it doesn't read multiline KW. I'm not sure I could implement > that in a fair amount of time. > > Regards. No problem - I've committed your fix, a basic test, and extended this for multiple KW lines. As discussed I've thanked you in the NEWS file too. https://github.com/biopython/biopython/commit/fc036dcdac22252a366647823a0c7c317c303313 https://github.com/biopython/biopython/commit/606ea9360d262d21c3e01eda66c4cf9118880d46 Updating the EMBL writer in Bio/SeqIO/InsdcIO.py should be a nice small task for any volunteer wanting to make a first contribution... (Potential Google Summer of Code students - Hint hint ;) ) Thank you Xabier, Peter From p.j.a.cock at googlemail.com Tue Mar 12 10:40:16 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Mar 2013 14:40:16 +0000 Subject: [Biopython-dev] Consumer of "KW" in embl format In-Reply-To: References: Message-ID: On Tue, Mar 12, 2013 at 11:35 AM, Xabier Bello wrote: > Lets try it: > > Line 997: > def _write_keywords(self, record): > #Put the keywords right after DE line. > self._write_multi_line("KW", "%s." % "; ".join( > record.annotations["keywords"])) > self.handle.write("XX\n") Looks good - although there is a potential problem here with long keywords where this does not avoid splitting a single keyword over multiple KW lines (as specified in the EMBL specification). This is a corner case though... > Line 1070: > if "keywords" in record.annotations: > self._write_keywords(record) > > Note to self: learn to make diff patches and forks on github. Good plan :) Meanwhile, I committed that change: https://github.com/biopython/biopython/commit/41470eac55a665d1cb1c7e73ebfd3c1df98af5ad I added a little more testing, from which I think we may need to do some work with some of the other EMBL fields like dbxrefs: https://github.com/biopython/biopython/commit/07639dde32083f4f024616292a5c736e85770a4e Thanks, Peter From p.j.a.cock at googlemail.com Tue Mar 12 11:13:23 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Mar 2013 15:13:23 +0000 Subject: [Biopython-dev] Consumer of "KW" in embl format In-Reply-To: References: Message-ID: On Tue, Mar 12, 2013 at 2:40 PM, Peter Cock wrote: > On Tue, Mar 12, 2013 at 11:35 AM, Xabier Bello wrote: >> Lets try it: >> >> Line 997: >> def _write_keywords(self, record): >> #Put the keywords right after DE line. >> self._write_multi_line("KW", "%s." % "; ".join( >> record.annotations["keywords"])) >> self.handle.write("XX\n") > > Looks good - although there is a potential problem here with long keywords > where this does not avoid splitting a single keyword over multiple KW lines > (as specified in the EMBL specification). This is a corner case though... OK, not such a rare case: $ python test_SeqIO_features.py ... ====================================================================== ERROR: test_cor6 (__main__.TestWriteRead) Write and read back cor6_6.gb ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SeqIO_features.py", line 1105, in test_cor6 write_read(os.path.join("GenBank", "cor6_6.gb"), "gb") File "test_SeqIO_features.py", line 35, in write_read compare_records(gb_records, gb_records2) File "test_SeqIO_features.py", line 110, in compare_records if not compare_record(old,new,expect_minor_diffs): File "test_SeqIO_features.py", line 101, in compare_record % (key, old.annotations[key], new.annotations[key])) ValueError: Annotation mis-match for keywords: ['antifreeze protein homology', 'cold-regulated gene', 'cor6.6 gene', 'KIN1 homology'] ['antifreeze protein homology', 'cold-regulated gene', 'cor6.6 gene', 'KIN1', 'homology'] ---------------------------------------------------------------------- I'll fix this later today... Peter From clements at galaxyproject.org Tue Mar 12 18:01:49 2013 From: clements at galaxyproject.org (Dave Clements) Date: Tue, 12 Mar 2013 15:01:49 -0700 Subject: [Biopython-dev] 2013 Galaxy Community Conference (GCC2013), 30 June - 2 July, Oslo Message-ID: Hello all, We are pleased to announce that early registration and paper and poster abstract submission are now open for the 2013 Galaxy Community Conference (GCC2013) . GCC2013 will be held 30 June through July 2 in Oslo Norway, at the University of Oslo . GCC2013 is an opportunity to participate in two full days of presentations, discussions, poster sessions, keynotes, lightning talks and breakouts, *all about high-throughput biology and the tools that support it*. The conference also includes a Training Day for the second year in a row, this year with more in-depth topic coverage, more concurrent sessions, and more topics. If you are a biologist or bioinformatician performing or enabling high-throughput biological research, then please consider attending. GCC2013 is aimed at: - Bioinformatics tool developers and data providers - Workflow developers and power bioinformatics users - Sequencing and Bioinformatics core staff - Data archival and analysis reproducibility specialists *Early registration * *saves up to 75% off regular registration costs,* and is very affordable, with combined registration (Training Day + main meeting) starting at ~ ?95 for post-docs and students. Registering early also assures you a spot in the Training Day workshops you want to attend. Once a Training Day session becomes full, it will be closed to new registrations. Early registration closes 24 May. *Abstract submission * for oral presentations closes 12 April, and for posters on 3 May. Please consider presenting your work. If you are working with big biological data, then the people at this meeting want to hear about your work. Thanks, and hope to see you in Oslo! The GCC2013 Organizing Committee PS: And please help get the word out ! -- http://galaxyproject.org/GCC2013 http://galaxyproject.org/ http://getgalaxy.org/ http://usegalaxy.org/ http://wiki.galaxyproject.org/ From anaryin at gmail.com Wed Mar 13 07:09:29 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 13 Mar 2013 12:09:29 +0100 Subject: [Biopython-dev] [Biopython] Updating GSOC page? In-Reply-To: References: Message-ID: Hello all, I updated the GSOC page on the wiki to be more organized: http://biopython.org/wiki/GSOC If no one opposes, I'll replace the current page (here) with it, just in time for GSOC 2013. Best, Jo?o PS. sorry for the spamming but I posted this 5 days ago in the non dev list and got no answers so.. 2013/3/8 Jo?o Rodrigues > Small update: http://biopython.org/wiki/GSOC > > If ok, We can just link the normal one for this one. I kept it separate > just in case. > > > 2013/3/4 Peter Cock > >> On Sun, Mar 3, 2013 at 11:07 PM, Jo?o Rodrigues >> wrote: >> > Hello all, >> > >> > Does any oppose to a refreshment of our GSOC >> > pagebased on the >> > BioRuby >> > page ? It >> could use >> > a facelift before the new round of projects/students come in. >> > >> > Best, >> > >> > Jo?o >> >> A good idea - see also the GSoC discussions on the biopython-dev >> list about potential project ideas. >> >> Thanks, >> >> Peter >> > > From mikael.trellet at gmail.com Wed Mar 13 07:17:17 2013 From: mikael.trellet at gmail.com (Mikael Trellet) Date: Wed, 13 Mar 2013 12:17:17 +0100 Subject: [Biopython-dev] [Biopython] Updating GSOC page? In-Reply-To: References: Message-ID: It's well-formated and looks nice for me, the improvement from the former one is signifcant so I would agree to update the page. Good work ;) Mikael On Wed, Mar 13, 2013 at 12:09 PM, Jo?o Rodrigues wrote: > Hello all, > > I updated the GSOC page on the wiki to be more organized: > http://biopython.org/wiki/GSOC > > If no one opposes, I'll replace the current page > (here) > with it, just in time for GSOC 2013. > > Best, > > Jo?o > > PS. sorry for the spamming but I posted this 5 days ago in the non dev list > and got no answers so.. > > > 2013/3/8 Jo?o Rodrigues > > > Small update: http://biopython.org/wiki/GSOC > > > > If ok, We can just link the normal one for this one. I kept it separate > > just in case. > > > > > > 2013/3/4 Peter Cock > > > >> On Sun, Mar 3, 2013 at 11:07 PM, Jo?o Rodrigues > >> wrote: > >> > Hello all, > >> > > >> > Does any oppose to a refreshment of our GSOC > >> > pagebased on the > >> > BioRuby > >> > page ? It > >> could use > >> > a facelift before the new round of projects/students come in. > >> > > >> > Best, > >> > > >> > Jo?o > >> > >> A good idea - see also the GSoC discussions on the biopython-dev > >> list about potential project ideas. > >> > >> Thanks, > >> > >> Peter > >> > > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- -------------------------------------------- Mikael TRELLET, - Groupe VENISE, CNRS LIMSI 91403 Orsay CEDEX - LBT/IBPC, 75005 Paris France +33650607172 From p.j.a.cock at googlemail.com Wed Mar 13 08:04:28 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 13 Mar 2013 12:04:28 +0000 Subject: [Biopython-dev] [Biopython] Updating GSOC page? In-Reply-To: References: Message-ID: On Wed, Mar 13, 2013 at 11:09 AM, Jo?o Rodrigues wrote: > Hello all, > > I updated the GSOC page on the wiki to be more organized: > http://biopython.org/wiki/GSOC > > If no one opposes, I'll replace the current page > (here) > with it, just in time for GSOC 2013. > > Best, > > Jo?o Sounds sensible, and you can set a direct on GSOC to Google_Summer_of_Code by replacing the content with: #REDIRECT [[link]] Peter From anaryin at gmail.com Wed Mar 13 09:22:23 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 13 Mar 2013 14:22:23 +0100 Subject: [Biopython-dev] [Biopython] Updating GSOC page? In-Reply-To: References: Message-ID: Done, thanks. http://biopython.org/wiki/Google_Summer_of_Code http://biopython.org/wiki/GSOC 2013/3/13 Peter Cock > On Wed, Mar 13, 2013 at 11:09 AM, Jo?o Rodrigues > wrote: > > Hello all, > > > > I updated the GSOC page on the wiki to be more organized: > > http://biopython.org/wiki/GSOC > > > > If no one opposes, I'll replace the current page > > (here) > > with it, just in time for GSOC 2013. > > > > Best, > > > > Jo?o > > Sounds sensible, and you can set a direct on GSOC to > Google_Summer_of_Code by replacing the content with: > > #REDIRECT [[link]] > > Peter > From mjldehoon at yahoo.com Wed Mar 13 10:44:55 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 13 Mar 2013 07:44:55 -0700 (PDT) Subject: [Biopython-dev] New contributor In-Reply-To: Message-ID: <1363185895.3324.YahooMailClassic@web164005.mail.gq1.yahoo.com> Hi Andrea, Welcome to Biopython! It's great that you want to contribute. Writing & finishing some unit tests sounds like a good idea, and of course bug fixing is always welcome. Other options are to look at orphan modules in Biopython (modules without active maintainers, without documentation, or without unit tests). Once you decide what specifically you want to work on, it's good to let us know on the mailing list, to see if anybody else is working on the same. Good luck! Best, -Michiel. --- On Sat, 3/2/13, Andrea Rizzi <88whacko at gmail.com> wrote: > From: Andrea Rizzi <88whacko at gmail.com> > Subject: [Biopython-dev] New contributor > To: biopython-dev at biopython.org > Date: Saturday, March 2, 2013, 11:49 AM > Hello! > My name is Andrea Rizzi and I'm a master's student in > computer science and > computational biology. I would be glad to help you > developing biopython. > I've used the library quite extensively but I'm mostly > familiar with > handling sequences, MSAs and PDB files. > > I've read through the small contributing guide on the wiki > and on the > tutorial and I thought I could start with something > relatively > straightforward like writing/completing some unit tests (if > I understood > correctly there's a fairly strong need for them). I've good > knowledge of > both git and unittest. Anyway any task is actually fine to > me :) . > > If you agree I'll try to look for a module that needs some > more testing (or > maybe you have one to suggest me), otherwise I could just go > to the bug > tracker and try to help out fixing some bugs. > > -- > -- Andrea > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From eric.talevich at gmail.com Wed Mar 13 14:32:25 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 13 Mar 2013 14:32:25 -0400 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Tue, Feb 12, 2013 at 9:08 PM, Michiel de Hoon wrote: > It would be great to have better support for microarray analysis in > Biopython. Something like lumi/limma in R. Perhaps this is an option for > the GSoC? > > Best, > -Michiel. > I like Michiel's idea, and I'll suggest two more: 1. Codon alignment & analysis: - PAL2NAL-style conversion of unaligned nucleic acid sequences and a protein sequence alignment to a codon alignment. (Previously discussed) - dN/dS and the related functions needed to calculate it. - Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage of codon alignments, including validation (testing for frame shifts etc.) 2. Phylo enhancements: 2a. Tree drawing: - A proper draw_unrooted function to perform radial layout, with an optional "iterations" argument to use Felsenstein's Equal Daylight algorithm -- I feel this layout approach is neglected in most libraries. - Better matplotlib/pylab integration, so the plot components can be tweaked using matplotlib functions. - Other common layout approaches, e.g. circular. 2b. A "Phylo.consensus" module: - strict consensus, like Bio.Nexus already implements. - other consensus methods, time permitting. 2c. A "Phylo.distance" module: - Robinson-Foulds distance -- though others might be working on this already. 2d. Simple tree inference: - Straightforward algorithms exist for neighbor-joining and parsimony tree estimation. For small alignments (and perhaps medium-sized ones with PyPy), it would be nice to run these without an external program, e.g. to construct a guide tree for another algorithm or quickly view a phylogenetic clustering of sequences. Any interest in either of these? Shall I add them to the wiki? -Eric --- On Tue, 2/12/13, Peter Cock wrote: > > > From: Peter Cock > > Subject: [Biopython-dev] Project ideas for GSoC (or other student > projects) > > To: "Biopython-Dev Mailing List" > > Date: Tuesday, February 12, 2013, 12:51 PM > > Hello all, > > > > Google recently confirmed they will be running Google Summer > > of Code 2013, > > and we (Biopython and the other Bio* projects) would hope to > > be accepted again > > under the Open Bioinformatics Foundation as in previous > > years: > > http://lists.open-bio.org/pipermail/gsoc/2013/000196.html > > > > It would be great to start coming up with potential project > > ideas, both larger > > pieces of work suitable for GSoC but also smaller tasks for > > other project > > students, or 'low hanging fruit' for potential contributors > > to cut > > their teeth on. > > > > See also http://biopython.org/wiki/Active_projects > > and the ideas list there. > > > > Regards, > > > > Peter > From p.j.a.cock at googlemail.com Wed Mar 13 17:16:27 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 13 Mar 2013 21:16:27 +0000 Subject: [Biopython-dev] BWA Wrapper In-Reply-To: References: Message-ID: On Monday, March 4, 2013, Saket Choudhary wrote: > Hi, > > I have updated the code here : > https://github.com/saketkc/biopython/tree/bwa_wrapper > > I have added unittests for the wrapper. And yes, this did help me in > fixing a lot of minor bugs in my original wrapper. > > @Peter : Is this 'pull request' ready ? > > Thanks > > Saket > > Sorry I've not had time to test this yet - and have been off ill today as well. The basic approach you've taken seems sound, and a good basis for other samtools style tools. Peter From p.j.a.cock at googlemail.com Thu Mar 14 07:25:41 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 14 Mar 2013 11:25:41 +0000 Subject: [Biopython-dev] Fwd: [biopython] Add the ability to parse CEL version 4 files from Affy (#168) In-Reply-To: References: Message-ID: Who would be the best person to review this? Michael? Peter ---------- Forwarded message ---------- From: *Jeff Hammerbacher* Date: Thursday, March 14, 2013 Subject: [biopython] Add the ability to parse CEL version 4 files from Affy (#168) To: biopython/biopython Hey, I noticed that Biopython was missing the ability to parse binary CEL files (version 4), so I've added a rough implementation. I've kept TODOs in the code and a main method to demonstrate example use. I realize these are not best practices for a mature library, but this corner of Biopython (the Affy module) seems quite immature, so I figured I'd leave the code in this state to indicate to others that there is much room for improvement. I have not contributed to this project before, so please let me know how to get this pull request in shape for a commit. Thanks, Jeff ------------------------------ You can merge this Pull Request by running git pull https://github.com/hammer/biopython master Or view, comment on, or merge it at: https://github.com/biopython/biopython/pull/168 Commit Summary - Add the ability to parse CEL version 4 files from Affy. File Changes - *A* Bio/Affy/CelFileV4.py(186) Patch Links: - https://github.com/biopython/biopython/pull/168.patch - https://github.com/biopython/biopython/pull/168.diff From mjldehoon at yahoo.com Fri Mar 15 09:09:18 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 15 Mar 2013 06:09:18 -0700 (PDT) Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: Message-ID: <1363352958.71694.YahooMailClassic@web164004.mail.gq1.yahoo.com> Hi everybody, I looked at the mmCIF parser again, and it turned out that the Python standard library contains a shlex lexical analyzer module that makes mmCIF parsing straightforward without relying on flex or PLY. I uploaded a modified version of MMCIF2Dict.py to the git repository. This parser does the exact same thing as the flex-based parser, but is in pure Python. If you're interested, have a look at MMCIF2Dict.py in the git repository; comments and suggestions are welcome. If there are no objections, I think we can remove everything in Bio.PDB.mmCIF. Also I'm a bit unhappy with how the information in an mmCIF file is represented in Biopython. I think there are more Pythonic ways to store the contents of an mmCIF file in a Python object. Best, -Michiel. --- On Sat, 2/16/13, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) > To: "Michiel de Hoon" > Cc: "BioPython-Dev Mailing List" , "Lenna Peterson" > Date: Saturday, February 16, 2013, 5:42 AM > On Sat, Feb 16, 2013 at 2:46 AM, > Michiel de Hoon > wrote: > > Hi Lenna, > > > > Maybe we are confusing each other.. > > I am looking for a solution that (a) doesn't introduce > new dependencies, > > +1 > > > (b) is pure-Python so it can run on Jython, > > +1 And on PyPy (which to me is more interesting that Jython) > etc. > > > and (c) if that is not possible and we do need to use > C, then that C code > > should be understandable so that it can be debugged if > necessary. > > > > I was suggesting to clean up lex.yy.c so that we can at > least achieve (c). > > This does mean we essentially give up on ever regenerating > the lex.yy.c > file every again - could that be a problem if Flex itself > changes much? > > > The alternative is to start from the PLY-based parser > and remove the > > dependency on PLY. > > > > Best, > > -Michiel. > > Peter > From anaryin at gmail.com Fri Mar 15 09:20:16 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 15 Mar 2013 14:20:16 +0100 Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: <1363352958.71694.YahooMailClassic@web164004.mail.gq1.yahoo.com> References: <1363352958.71694.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: Hi Michiel, Speaking without really checking the code.. What we perhaps should have it the parsers, whatever they are, populating the same type of object in the end (PDBParser and mmCIFParser). Is this the current status of the mmCIF? Best, Jo?o 2013/3/15 Michiel de Hoon > Hi everybody, > > I looked at the mmCIF parser again, and it turned out that the Python > standard library contains a shlex lexical analyzer module that makes mmCIF > parsing straightforward without relying on flex or PLY. I uploaded a > modified version of MMCIF2Dict.py to the git repository. This parser does > the exact same thing as the flex-based parser, but is in pure Python. If > you're interested, have a look at MMCIF2Dict.py in the git repository; > comments and suggestions are welcome. > > If there are no objections, I think we can remove everything in > Bio.PDB.mmCIF. Also I'm a bit unhappy with how the information in an mmCIF > file is represented in Biopython. I think there are more Pythonic ways to > store the contents of an mmCIF file in a Python object. > > Best, > -Michiel. > > --- On Sat, 2/16/13, Peter Cock wrote: > > > From: Peter Cock > > Subject: Re: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) > > To: "Michiel de Hoon" > > Cc: "BioPython-Dev Mailing List" , "Lenna > Peterson" > > Date: Saturday, February 16, 2013, 5:42 AM > > On Sat, Feb 16, 2013 at 2:46 AM, > > Michiel de Hoon > > wrote: > > > Hi Lenna, > > > > > > Maybe we are confusing each other.. > > > I am looking for a solution that (a) doesn't introduce > > new dependencies, > > > > +1 > > > > > (b) is pure-Python so it can run on Jython, > > > > +1 And on PyPy (which to me is more interesting that Jython) > > etc. > > > > > and (c) if that is not possible and we do need to use > > C, then that C code > > > should be understandable so that it can be debugged if > > necessary. > > > > > > I was suggesting to clean up lex.yy.c so that we can at > > least achieve (c). > > > > This does mean we essentially give up on ever regenerating > > the lex.yy.c > > file every again - could that be a problem if Flex itself > > changes much? > > > > > The alternative is to start from the PLY-based parser > > and remove the > > > dependency on PLY. > > > > > > Best, > > > -Michiel. > > > > Peter > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Fri Mar 15 09:21:50 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 15 Mar 2013 13:21:50 +0000 Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: <1363352958.71694.YahooMailClassic@web164004.mail.gq1.yahoo.com> References: <1363352958.71694.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: On Fri, Mar 15, 2013 at 1:09 PM, Michiel de Hoon wrote: > Hi everybody, > > I looked at the mmCIF parser again, and it turned out that the Python > standard library contains a shlex lexical analyzer module that makes mmCIF > parsing straightforward without relying on flex or PLY. I uploaded a > modified version of MMCIF2Dict.py to the git repository. This parser does > the exact same thing as the flex-based parser, but is in pure Python. If > you're interested, have a look at MMCIF2Dict.py in the git repository; > comments and suggestions are welcome. That makes MMCIF2Dict look a lot shorter :) https://github.com/biopython/biopython/commit/b2bafdfcd67c738f91722495bb732297b7936828 > If there are no objections, I think we can remove everything in > Bio.PDB.mmCIF. Also I'm a bit unhappy with how the information in an mmCIF > file is represented in Biopython. I think there are more Pythonic ways to > store the contents of an mmCIF file in a Python object. > > Best, > -Michiel. Do you think we need a deprecation cycle for Bio.PDB.mmCIF? It has been available by default on Debian etc where the dependency was taken care of by the packagers. I've never used this code so perhaps Eric or Jo?o's perspective would be more helpful than mine. Peter From mjldehoon at yahoo.com Fri Mar 15 11:08:43 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 15 Mar 2013 08:08:43 -0700 (PDT) Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: Message-ID: <1363360123.26690.YahooMailClassic@web164004.mail.gq1.yahoo.com> Hi all, --- On Fri, 3/15/13, Peter Cock wrote: > Do you think we need a deprecation cycle for Bio.PDB.mmCIF? > It has been available by default on Debian etc where the > dependency was taken care of by the packagers. Probably not. The Bio.PDB.mmCIF module was essentially a private module used by Bio.PDB.MMCIF2Dict, whose usage is unchanged. Also, AFAICT the Bio.PDB.mmCIF module is not documented anywhere. And finally, all this module does is tokenize the mmCIF file, so probably not something an end user would be interested in. I am not a heavy user of Bio.PDB myself, so feel free to correct me if I am wrong. Best, -Michiel. From p.j.a.cock at googlemail.com Fri Mar 15 11:28:48 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 15 Mar 2013 15:28:48 +0000 Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: <1363360950.87852.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1363360950.87852.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: On Fri, Mar 15, 2013 at 3:22 PM, Michiel de Hoon wrote: > > Speaking of which, we have a Biopython Structural Bioinformatics FAQ (i.e. > how to use the Bio.PDB module) on the Biopython website with additional > information on Bio.PDB, including some information on things that are not in > the main Biopython Tutorial. Perhaps this is a good time to integrate this > FAQ into the main documentation? > Both are LaTeX documents so this shouldn't be too hard to do. Peter From mjldehoon at yahoo.com Fri Mar 15 11:22:30 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 15 Mar 2013 08:22:30 -0700 (PDT) Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: Message-ID: <1363360950.87852.YahooMailClassic@web164003.mail.gq1.yahoo.com> Hi Jo?o, --- On Fri, 3/15/13, Jo?o Rodrigues wrote: What we perhaps should have it the parsers, whatever they are, populating the same type of object in the end (PDBParser and mmCIFParser). I think that there are two options: 1) PDBParser and mmCIFParser both produce Structure objects, with any additional information found in mmCIF files stored as additional attributes of Structure objects (and the same thing for PDB files); 2) We make a module mmCIF with a function mmCIF.read that reads an mmCIF file and stores the information in a mmCIF.Record object that is optimized for storing mmCIF information. The mmCIFParser uses mmCIF.read, and pulls out the necessary information from the mmCIF.Record object to create a Structure object (which is free of mmCIF-specific stuff). Users can make Structure objects if that is all they need, or use mmCIF.read if they want to have all information in an mmCIF file. Currently the situation is closer to (2), with MMCIF2Dict playing the role of mmCIF.read, but I don't like much the way MMCIF2Dict stores information. Since I am not a power user of Bio.PDB, other people may have more insight in whether (1) or (2) (or something completely different) is best. Is this the current status of the mmCIF? I just replaced the flex-dependent part of mmCIF by pure Python code, but I didn't change the functionality or usage of the mmCIF code. So the current status is still the same as described in the documentation. Speaking of which, we have a Biopython Structural Bioinformatics FAQ (i.e. how to use the Bio.PDB module) on the Biopython website with additional information on Bio.PDB, including some information on things that are not in the main Biopython Tutorial. Perhaps this is a good time to integrate this FAQ into the main documentation? Best, -Michiel From jacobs at bioinformed.com Fri Mar 15 11:40:38 2013 From: jacobs at bioinformed.com (Kevin Jacobs) Date: Fri, 15 Mar 2013 08:40:38 -0700 Subject: [Biopython-dev] BWA Wrapper In-Reply-To: References: Message-ID: FYI, I am working on a direct Cython wrapper around the new BWA-MEM aligner, which will allow API-level access to Heng Li's extremely impressive new algorithm. It is still in early development and is missing many bells and whistles, but will be shaping up in the next few weeks. Test program: import bwamem mem = bwamem.MEMAligner('ref/human_g1k_v37.fasta') a = mem.align('TCACGACGCTCTTCCGATCTGTT...GTGCATTCTCTGGTCAGACAGCCAAGG') a = a[0] print 'ref id =',a.rid print 'pos =',a.pos print 'CIGAR =',a.cigar.to_string() Output (correct): ref id = 0 pos = 115250385 CIGAR = 17N134M On Wed, Mar 13, 2013 at 2:16 PM, Peter Cock wrote: > On Monday, March 4, 2013, Saket Choudhary wrote: > > > Hi, > > > > I have updated the code here : > > https://github.com/saketkc/biopython/tree/bwa_wrapper > > > > I have added unittests for the wrapper. And yes, this did help me in > > fixing a lot of minor bugs in my original wrapper. > > > > @Peter : Is this 'pull request' ready ? > > > > Thanks > > > > Saket > > > > > Sorry I've not had time to test this yet - and have > been off ill today as well. The basic approach you've > taken seems sound, and a good basis for other > samtools style tools. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From anaryin at gmail.com Fri Mar 15 11:53:41 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 15 Mar 2013 16:53:41 +0100 Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: <1363360950.87852.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1363360950.87852.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: Hi Michiel, > 1) PDBParser and mmCIFParser both produce Structure objects, with any > additional information found in mmCIF files stored as additional attributes > of Structure objects (and the same thing for PDB files); > This approach has a few advantages. First and most obvious, converting one file format to another seamlessly. Second, reducing the code to something easier to maintain and to extend too. The disadvantage is that the Structure objects might become a bit too bloated. On the other hand, we can make them lighter and take advantage of Python's dynamic attributes (if I need a b-factor, I just add atom.bfactor). This would also help a lot with the current parser which is quite "sluggish" for some purposes and bring a lot more flexibility (parsing pqr files, mol2 files, etc). All we'd need would be a parser for each file format and a generic container to have the backbone of the structure and extend is as we need. A simple flag for the parser type would make checking if function X can be used on this particular structure easier too. > > 2) We make a module mmCIF with a function mmCIF.read that reads an mmCIF > file and stores the information in a mmCIF.Record object that is optimized > for storing mmCIF information. The mmCIFParser uses mmCIF.read, and pulls > out the necessary information from the mmCIF.Record object to create a > Structure object (which is free of mmCIF-specific stuff). Users can make > Structure objects if that is all they need, or use mmCIF.read if they want > to have all information in an mmCIF file. > I'm completely unfamiliar with mmCIF files.. how much more information do they have than a PDB file? And what kind of information is useful to extract from them? Speaking of which, we have a Biopython Structural Bioinformatics FAQ (i.e. > how to use the Bio.PDB module) on the Biopython website with additional > information on Bio.PDB, including some information on things that are not > in the main Biopython Tutorial. Perhaps this is a good time to integrate > this FAQ into the main documentation? We could also update it a bit because it's been a while and there are some different things here and there. And additions too. Best, Jo?o From bartek at rezolwenta.eu.org Fri Mar 15 19:06:57 2013 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Sat, 16 Mar 2013 00:06:57 +0100 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: Hi All, I would add one more (old) idea for a GSoC pool, i.e. adding support for different biological ontologies to biopython. This was already discussed some time ago (http://www.biopython.org/w/index.php?title=Gene_Ontology&redirect=no) mostly in the context of gene ontology, and to some extent this is addressed by the development of GOAtools (https://github.com/tanghaibao/goatools), but I think it would be worth to have a decent support for OBO-file-based ontologies (not only gene ontology, I'm also interested myself in anatomical ontologies, there are also other available at obofoundry.org) in biopython. I think it would need to include support for IO operations on both OBO and annotation files, as well as statistical enrichment measures and potentially some visualisation. Would anyone be interested in co-mentoring this project? There is one student in my department who would be interested in applying to GSoC for this project, but I think it would be great if other people joined the discussion on the functionality and having more people involved is always better... best Bartek Wilczynski On Wed, Mar 13, 2013 at 7:32 PM, Eric Talevich wrote: > On Tue, Feb 12, 2013 at 9:08 PM, Michiel de Hoon wrote: > >> It would be great to have better support for microarray analysis in >> Biopython. Something like lumi/limma in R. Perhaps this is an option for >> the GSoC? >> >> Best, >> -Michiel. >> > > I like Michiel's idea, and I'll suggest two more: > > 1. Codon alignment & analysis: > - PAL2NAL-style conversion of unaligned nucleic acid sequences and a > protein sequence alignment to a codon alignment. (Previously discussed) > - dN/dS and the related functions needed to calculate it. > - Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage of > codon alignments, including validation (testing for frame shifts etc.) > > 2. Phylo enhancements: > 2a. Tree drawing: > - A proper draw_unrooted function to perform radial layout, with an > optional "iterations" argument to use Felsenstein's Equal Daylight > algorithm -- I feel this layout approach is neglected in most libraries. > - Better matplotlib/pylab integration, so the plot components can be > tweaked using matplotlib functions. > - Other common layout approaches, e.g. circular. > 2b. A "Phylo.consensus" module: > - strict consensus, like Bio.Nexus already implements. > - other consensus methods, time permitting. > 2c. A "Phylo.distance" module: > - Robinson-Foulds distance -- though others might be working on this > already. > 2d. Simple tree inference: > - Straightforward algorithms exist for neighbor-joining and parsimony tree > estimation. For small alignments (and perhaps medium-sized ones with PyPy), > it would be nice to run these without an external program, e.g. to > construct a guide tree for another algorithm or quickly view a phylogenetic > clustering of sequences. > > Any interest in either of these? Shall I add them to the wiki? > > -Eric > > > --- On Tue, 2/12/13, Peter Cock wrote: >> >> > From: Peter Cock >> > Subject: [Biopython-dev] Project ideas for GSoC (or other student >> projects) >> > To: "Biopython-Dev Mailing List" >> > Date: Tuesday, February 12, 2013, 12:51 PM >> > Hello all, >> > >> > Google recently confirmed they will be running Google Summer >> > of Code 2013, >> > and we (Biopython and the other Bio* projects) would hope to >> > be accepted again >> > under the Open Bioinformatics Foundation as in previous >> > years: >> > http://lists.open-bio.org/pipermail/gsoc/2013/000196.html >> > >> > It would be great to start coming up with potential project >> > ideas, both larger >> > pieces of work suitable for GSoC but also smaller tasks for >> > other project >> > students, or 'low hanging fruit' for potential contributors >> > to cut >> > their teeth on. >> > >> > See also http://biopython.org/wiki/Active_projects >> > and the ideas list there. >> > >> > Regards, >> > >> > Peter >> > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Bartek Wilczynski From mjldehoon at yahoo.com Fri Mar 15 22:38:48 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 15 Mar 2013 19:38:48 -0700 (PDT) Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: Message-ID: <1363401528.82829.YahooMailClassic@web164001.mail.gq1.yahoo.com> --- On Fri, 3/15/13, Jo?o Rodrigues wrote: I'm completely unfamiliar with mmCIF files.. how much more information do they have than a PDB file? These are two examples from the Biopython tests: https://github.com/biopython/biopython/blob/master/Tests/PDB/1A8O.cif https://github.com/biopython/biopython/blob/master/Tests/PDB/1LCD.cif And what kind of information is useful to extract from them? I think we should extract all information from these files, and let the user decide which parts are useful. Best, -Michiel. From p.j.a.cock at googlemail.com Sat Mar 16 10:38:22 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 16 Mar 2013 14:38:22 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5E48@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: On Wed, Dec 5, 2012 at 6:41 PM, Peter Cock wrote: > Hi David, > > I've been experimenting with your pull request, thank you: > https://github.com/biopython/biopython/pull/116 > Hi again David, I've not used your code as is, but have started by pulling out and generalising what I felt was the least contentious part: https://github.com/biopython/biopython/commit/087712510421ec7f655a7981926a757aa93e9177 This means that label_position = start, middle, end (and some historic aliases defined in the linear drawer code) now work on circular GenomeDiagrams. I have made the default None, which gives the current behaviour (as 'start' on linear, the more complicated to explain vertical bottom on circular). Support for allowing the default label orientation to be radially consistent all round the circle (rather than the current flipping for the left/right halves which assumes the output is kept vertical) would be nice, but the thing I am most keen on is the inside/outside of the track label placement. Hopefully I'll have time to finish that this weekend... Peter From p.j.a.cock at googlemail.com Sat Mar 16 16:37:12 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 16 Mar 2013 20:37:12 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5E48@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: On Sat, Mar 16, 2013 at 2:38 PM, Peter Cock wrote: > On Wed, Dec 5, 2012 at 6:41 PM, Peter Cock wrote: >> Hi David, >> >> I've been experimenting with your pull request, thank you: >> https://github.com/biopython/biopython/pull/116 >> > > Hi again David, > > I've not used your code as is, but have started by pulling out > and generalising what I felt was the least contentious part: > > https://github.com/biopython/biopython/commit/087712510421ec7f655a7981926a757aa93e9177 > > This means that label_position = start, middle, end (and some > historic aliases defined in the linear drawer code) now work > on circular GenomeDiagrams. I have made the default None, > which gives the current behaviour (as 'start' on linear, the > more complicated to explain vertical bottom on circular). > > Support for allowing the default label orientation to be radially > consistent all round the circle (rather than the current flipping > for the left/right halves which assumes the output is kept > vertical) would be nice, but the thing I am most keen on is the > inside/outside of the track label placement. Hopefully I'll have > time to finish that this weekend... Here's a version on a branch which addresses the label placement by adding a label_strand argument, where +1 means the label is on the forward strand side of the track (above or outside), while -1 means the reverse strand side of the track (below or inside), and the default is to follow the strand of the feature being draw. This seemed to me quite an intuitive arrangement: https://github.com/peterjc/biopython/tree/label_strand This branch also (without making it optional) switches circular diagram feature labels to be "outside" the sigil like the linear diagram, rather than "insider" the sigil. This does tend to take up more space (which would explain the original motivation), but rarely gives a very legible result except with a box sigil and a very small/short label which falls completely within the sigil. This could be made a user option if there is demand... my inclination is not to (the API is already quite complex). David, I will email you an updated version of your example script using this branch for you to look at. It allows me to recreate the same effect as your code (bar the orientation changes which I have not at this point incorporated). David & Leighton, what do you think of this label idea? Peter From p.j.a.cock at googlemail.com Mon Mar 18 07:58:49 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 18 Mar 2013 11:58:49 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5E48@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: On Sat, Mar 16, 2013 at 8:37 PM, Peter Cock wrote: > > David & Leighton, what do you think of this label idea? > > Peter >From discussion off list, my branch seems positively accepted by both, and so I've applied that to the master. I probably will need to update some images in the Tutorial... We appear to agree that label orientation is an aesthetic judgement, and therefore a user option to control this on circular diagrams would be nice - but I've not done this (yet) and remain cautious about further complicating this bit of the code & while trying to have a consistent API between the linear and circular drawers. See also: https://github.com/biopython/biopython/pull/116 Peter From chapmanb at 50mail.com Mon Mar 18 12:49:33 2013 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 18 Mar 2013 12:49:33 -0400 Subject: [Biopython-dev] SciPy Bioinformatics symposium: abstracts due Wednesday Mar 20th Message-ID: <87y5dkvejm.fsf@fastmail.fm> Hi all; I'm helping organize a bioinformatics mini-symposium as part of SciPy 2013: Bioinformatics mini-symposia: http://j.mp/Z4xxXB SciPy info: http://conference.scipy.org/scipy2013/about.php This is a great chance for the Python bioinformatics community to connect with the wider Python scientific computing world. SciPy will feature programmers working on IPython reproducible research, scikit-learn machine learning approaches, large scale computing problems with NumPy and lots more relevant to bioinformatics work. This year there will a special symposium track dedicated to bioinformatics and I'd like to encourage everyone to submit abstracts. The deadline is this Wednesday, March 20th: http://conference.scipy.org/scipy2013/speaking_overview.php http://conference.scipy.org/scipy2013/speaking_submission.php SciPy takes place June 24-29th in Austin, TX. I'm looking forward to seeing lots of bioinformatics people there. Please feel free to write me if you have any questions, Brad From 88whacko at gmail.com Wed Mar 20 14:10:14 2013 From: 88whacko at gmail.com (Andrea Rizzi) Date: Wed, 20 Mar 2013 19:10:14 +0100 Subject: [Biopython-dev] New contributor In-Reply-To: <1363185895.3324.YahooMailClassic@web164005.mail.gq1.yahoo.com> References: <1363185895.3324.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: Thank you for your welcome Michiel! I will looking for a good project to work on in the next few days and I will let you know soon. Meanwhile I've started to read some code to become familiar with the modules and I bumped into few small bugs concerning the Seq objects, in particular I found: 1) a duplicated test method name (one test in test_Seq_objs.py wasn't performed); 2) an error in Alphabet._case_less(). I've also expanded a little bit the documentation and I've substituted tostring() method with the suggested str() method in a function of MutableSeq. The branch is located here https://github.com/andrrizzi/biopython/tree/seq-branch I'm not sure if it is more comfortable for you to merge this kind of commits from a git branch or it is more advisable to open a ticket and create a patch. Anyway if you think this small commits may be useful, feel free to use them. Best, Andrea 2013/3/13 Michiel de Hoon > Hi Andrea, > > Welcome to Biopython! > It's great that you want to contribute. > Writing & finishing some unit tests sounds like a good idea, and of course > bug fixing is always welcome. > Other options are to look at orphan modules in Biopython (modules without > active maintainers, without documentation, or without unit tests). > Once you decide what specifically you want to work on, it's good to let us > know on the mailing list, to see if anybody else is working on the same. > > Good luck! > > Best, > -Michiel. > > > > --- On Sat, 3/2/13, Andrea Rizzi <88whacko at gmail.com> wrote: > > > From: Andrea Rizzi <88whacko at gmail.com> > > Subject: [Biopython-dev] New contributor > > To: biopython-dev at biopython.org > > Date: Saturday, March 2, 2013, 11:49 AM > > Hello! > > My name is Andrea Rizzi and I'm a master's student in > > computer science and > > computational biology. I would be glad to help you > > developing biopython. > > I've used the library quite extensively but I'm mostly > > familiar with > > handling sequences, MSAs and PDB files. > > > > I've read through the small contributing guide on the wiki > > and on the > > tutorial and I thought I could start with something > > relatively > > straightforward like writing/completing some unit tests (if > > I understood > > correctly there's a fairly strong need for them). I've good > > knowledge of > > both git and unittest. Anyway any task is actually fine to > > me :) . > > > > If you agree I'll try to look for a module that needs some > > more testing (or > > maybe you have one to suggest me), otherwise I could just go > > to the bug > > tracker and try to help out fixing some bugs. > > > > -- > > -- Andrea > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > -- -- Andrea From p.j.a.cock at googlemail.com Thu Mar 21 08:17:51 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 12:17:51 +0000 Subject: [Biopython-dev] New contributor In-Reply-To: References: <1363185895.3324.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: On Wed, Mar 20, 2013 at 6:10 PM, Andrea Rizzi <88whacko at gmail.com> wrote: > Thank you for your welcome Michiel! > > I will looking for a good project to work on in the next few days and I > will let you know soon. Meanwhile I've started to read some code to become > familiar with the modules and I bumped into few small bugs concerning the > Seq objects, in particular I found: > > 1) a duplicated test method name (one test in test_Seq_objs.py wasn't > performed); > 2) an error in Alphabet._case_less(). Well spotted - changes applied to the master, thanks. > I've also expanded a little bit the documentation and I've substituted > tostring() method with the suggested str() method in a function of > MutableSeq. The branch is located here > > https://github.com/andrrizzi/biopython/tree/seq-branch > > I'm not sure if it is more comfortable for you to merge this kind of > commits from a git branch or it is more advisable to open a ticket and > create a patch. Anyway if you think this small commits may be useful, feel > free to use them. If you're happy on GitHub, a pull request is simplest. I've looked at these changes one by one and applied and/or commented on them. (We're debating moving our issue tracker from RedMine to GitHub, which would make things a little easier in future). Thank you! Peter From p.j.a.cock at googlemail.com Thu Mar 21 12:11:44 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 16:11:44 +0000 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Fri, Mar 15, 2013 at 11:06 PM, Bartek Wilczynski wrote: > Hi All, > I would add one more (old) idea for a GSoC pool, i.e. adding support > for different biological ontologies to biopython. > > This was already discussed some time ago > (http://www.biopython.org/w/index.php?title=Gene_Ontology&redirect=no) > mostly in the context of gene ontology, and to some extent this is > addressed by the development of GOAtools > (https://github.com/tanghaibao/goatools), but I think it would be > worth to have a decent support for OBO-file-based ontologies (not only > gene ontology, I'm also interested myself in anatomical ontologies, > there are also other available at obofoundry.org) in biopython. > > I think it would need to include support for IO operations on both OBO > and annotation files, as well as statistical enrichment measures and > potentially some visualisation. > > Would anyone be interested in co-mentoring this project? There is one > student in my department who would be interested in applying to GSoC > for this project, but I think it would be great if other people joined > the discussion on the functionality and having more people involved is > always better... > > best > Bartek Wilczynski That's a good idea - I would have used this recently with some GO stuff (e.g. given a GO term, is it a molecular function, biological process, or cellular compartment - can solve this easily by traversing up any branch of the DAG). Right now we need to put this list of ideas on the wiki page (ready for combining into the OBF page which will be shown to Google to make our case for taking part in the GSoC 2013 program). http://biopython.org/wiki/Google_Summer_of_Code If any of you as a potential mentor want to put up an outline proposal, even better. Peter From p.j.a.cock at googlemail.com Thu Mar 21 12:29:29 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 16:29:29 +0000 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Thu, Mar 21, 2013 at 4:11 PM, Peter Cock wrote: > > Right now we need to put this list of ideas on the wiki page (ready > for combining into the OBF page which will be shown to Google > to make our case for taking part in the GSoC 2013 program). > http://biopython.org/wiki/Google_Summer_of_Code > > If any of you as a potential mentor want to put up an outline > proposal, even better. > I've been wondering about potential GSoC projects which I'd be interested in mentoring (or co-mentoring), and thus far I've only got one outline idea. I'm interested in taking the Bio.SeqIO.index(...) / index_db(...) functionality (which does whole record parsing on demand) and extending this with lazy-loading or lazy-parsing (which has precedent in our BioSQL wrappers). For example, with whole genome FASTA files you may never need to load the entire sequence, but using an index system like tabix (or even actually using a tabix index) Biopython could provide a lazy-loading Seq object which extracts only the sequence region of interest on demand. The same idea applies to richer file formats too, like EMBL and GenBank. Here lazy loading the sequence is actually easier (the number of bases per line is strictly defined), but you can apply the same ideas to lazy loading features too. This means indexing both the sequence and the feature table. Likewise, this makes sense for GTF/GFF/GFF3 where you would index the features, and also if present index the embedded FASTA sequence at the end of the file. Clearly handling this would ideally build on Lenna and Brad's work with the underlying parser. With what I have in mind, there are two technical sides to this. First, the index format (binning strategies etc) for which we should review tabix and BAM's indexing and its planned replacement CSI (able to handle longer references). Second, to avoid code duplication, this would mean some re-factoring of the existing parser code to ensure that if a record is loaded in full via the traditional API, it would go though the same code as if it were loaded via the new lazy loading approach. Potentially the existing parsers could optionally also become lazy loaders (contingent on this requiring ownership of the file handle as it will use seek and tell to move the file pointer). That in theory could make our parsers much faster (depending on the overheads) for tasks where only a minority of the data is ever used. I've had some fun chats with Pjotr Prins from BioRuby about this at a CodeFest/BOSC meeting. Brad and Lenna, I've CC'd you explicitly as I'm guessing from the GFF work you are most likely to have considered some of these issues. Does this sound like something worth exploring further, and worth proposing as an outline GSoC project? I think it would be quite a challenging project - but like last year, it is something I would like to try myself if I had the time. Regards, Peter From p.j.a.cock at googlemail.com Thu Mar 21 13:01:51 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 17:01:51 +0000 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Wed, Mar 13, 2013 at 6:32 PM, Eric Talevich wrote: > > I like Michiel's idea, and I'll suggest two more: > > 1. Codon alignment & analysis: Already up on the wiki :) > > 2. Phylo enhancements: > 2a. Tree drawing: > - A proper draw_unrooted function to perform radial layout, with an optional > "iterations" argument to use Felsenstein's Equal Daylight algorithm -- I > feel this layout approach is neglected in most libraries. > - Better matplotlib/pylab integration, so the plot components can be tweaked > using matplotlib functions. > - Other common layout approaches, e.g. circular. > 2b. A "Phylo.consensus" module: > - strict consensus, like Bio.Nexus already implements. > - other consensus methods, time permitting. > 2c. A "Phylo.distance" module: > - Robinson-Foulds distance -- though others might be working on this > already. > 2d. Simple tree inference: > - Straightforward algorithms exist for neighbor-joining and parsimony tree > estimation. For small alignments (and perhaps medium-sized ones with PyPy), > it would be nice to run these without an external program, e.g. to construct > a guide tree for another algorithm or quickly view a phylogenetic clustering > of sequences. One more idea for a sub-task? 2e. Using multiple trees for bootstrapping a master tree. Take the master tree and for each edge you have a partition of the leaves, which can be used as a dictionary hash (e.g. as a binary representation). Then for each of the bootstrap runs, look at each edge, compute the hash for that split of the leaves, and increment the count. Then at the end, you have a dictionary of counts which are the branch bootstrap supports. I wrote that once in Python some time back, and used it to take a set of boot strap trees generated on a cluster and give the support values to the master tree. > > Any interest in either of these? Shall I add them to the wiki? > They both seem worth posting on the wiki, although we may not have enough mentors for both to go ahead :( Peter From p.j.a.cock at googlemail.com Thu Mar 21 12:55:30 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 16:55:30 +0000 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Wed, Mar 13, 2013 at 6:32 PM, Eric Talevich wrote: > I like Michiel's idea, and I'll suggest two more: > > 1. Codon alignment & analysis: > - PAL2NAL-style conversion of unaligned nucleic acid sequences and a protein > sequence alignment to a codon alignment. (Previously discussed) e.g. https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py > - dN/dS and the related functions needed to calculate it. > - Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage of > codon alignments, including validation (testing for frame shifts etc.) http://biopython.org/wiki/Google_Summer_of_Code#Codon_alignment_and_analysis I see you've started fleshing this idea out on the wiki, which is great. Right now it seems a little on the light weight side - or is that deliberate (to see if a student can take this idea and come up with a solid project proposal in this area)? Things like model selection might be a fun extension - I can think of a local expert who would be great to get involved on the science side if he's interested. Alternatively this could include doing some more general work on the alignment object - for instance per-column-annotation for things like a consensus sequence - or an array-of-char implementation as an alternative to the list-of-SeqRecords we have now (with its poor column access speed). Peter From p.j.a.cock at googlemail.com Thu Mar 21 13:29:44 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 17:29:44 +0000 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: Message-ID: On Tue, Feb 12, 2013 at 6:29 PM, Wibowo Arindrarto wrote: > Hi everyone, > > It's more or less a 'low hanging fruit', but I've been thinking > perhaps it may be useful if we have our own interface to the HMMER3 > online service? The corresponding SearchIO parsers may be written for > this as well (they return different formats for which we haven't any > parsers currently). Worth adding to the projects list here (or filing an enhancement bug) http://biopython.org/wiki/Active_projects#Project_ideas - but not enough to base a whole GSoC project around. > And I think there are more things being worked on, not yet mentioned > in the wiki: > > 1. Porting our docs to Sphinx[1] > 2. Converting some/all of the print and compare tests to unit tests. > For example, our Bio.Seq's tests are still print and compare tests. > > regards, > Bow > > [1] See the original feature request here: > https://redmine.open-bio.org/issues/3221 > https://redmine.open-bio.org/issues/3220 > https://redmine.open-bio.org/issues/3219 I don't think a purely documentation focused project is eligible for GSoC. But both ideas make sense separately from GSoC. Regards, Peter From p.j.a.cock at googlemail.com Thu Mar 21 13:36:24 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 17:36:24 +0000 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Thu, Mar 21, 2013 at 4:29 PM, Peter Cock wrote: > On Thu, Mar 21, 2013 at 4:11 PM, Peter Cock wrote: >> >> Right now we need to put this list of ideas on the wiki page (ready >> for combining into the OBF page which will be shown to Google >> to make our case for taking part in the GSoC 2013 program). >> http://biopython.org/wiki/Google_Summer_of_Code >> >> If any of you as a potential mentor want to put up an outline >> proposal, even better. >> > > I've been wondering about potential GSoC projects which I'd > be interested in mentoring (or co-mentoring), and thus far I've > only got one outline idea. > > I'm interested in taking the Bio.SeqIO.index(...) / index_db(...) > functionality (which does whole record parsing on demand) > and extending this with lazy-loading or lazy-parsing (which > has precedent in our BioSQL wrappers). For example, with > whole genome FASTA files you may never need to load the > entire sequence, but using an index system like tabix (or > even actually using a tabix index) Biopython could provide > a lazy-loading Seq object which extracts only the sequence > region of interest on demand. > > The same idea applies to richer file formats too, like EMBL > and GenBank. ... > > Likewise, this makes sense for GTF/GFF/GFF3 ... P.S. An example use case, http://www.biostars.org/p/64363/ Part of this work could include enhancements to the SeqRecord handling of SeqFeatures - offering more than just the current simple list - for example lookup by ID, dbxref, or position. That would be nice to have now with the current in-memory parsers. An old but still relevant example usecase: http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features Regards, Peter From eric.talevich at gmail.com Thu Mar 21 13:42:19 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 21 Mar 2013 13:42:19 -0400 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Thu, Mar 21, 2013 at 12:55 PM, Peter Cock wrote: > On Wed, Mar 13, 2013 at 6:32 PM, Eric Talevich > wrote: > > I like Michiel's idea, and I'll suggest two more: > > > > 1. Codon alignment & analysis: > > - PAL2NAL-style conversion of unaligned nucleic acid sequences and a > protein > > sequence alignment to a codon alignment. (Previously discussed) > > e.g. > https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py Well, check you out. Would you be interested in mentoring this project? > > - dN/dS and the related functions needed to calculate it. > > - Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage > of > > codon alignments, including validation (testing for frame shifts etc.) > > > http://biopython.org/wiki/Google_Summer_of_Code#Codon_alignment_and_analysis > > I see you've started fleshing this idea out on the wiki, which is great. > Right now it seems a little on the light weight side - or is that > deliberate > (to see if a student can take this idea and come up with a solid > project proposal in this area)? Things like model selection might > be a fun extension - I can think of a local expert who would be > great to get involved on the science side if he's interested. > I put up a quick sketch to avoid locking the wiki page for too long, but also deliberately left it vague to see where the applicants take it. Model selection would be cool, I added it. Local expert, also great. > Alternatively this could include doing some more general work > on the alignment object - for instance per-column-annotation > for things like a consensus sequence - or an array-of-char > implementation as an alternative to the list-of-SeqRecords > we have now (with its poor column access speed). > > Peter > I wonder if that's something we could just do incrementally -- change the MultipleSeqAlignment class to store a list-of-lists-of chars (or list-of-strings), a list of SeqRecord-like husks (all the annotations, but without the Seq itself) for each row, a list of column annotations, and a single alphabet for the whole alignment. How do you suppose the speed of that would compare to the current list-of-SeqRecords, and also to that of a wrapped NumPy matrix? Would it be a significant enough speed improvement to justify both replacing the current implementation, and to make the NumPy approach less tempting (given PyPy's progress toward including a compliant implementation)? Alternatively, we could post a GSoC project for creating a separate TurboAlignment class/module based on NumPy which would be mostly interchangeable and interconvertible with the pure-Python version in the Biopython core. Speaking of which, should we also post the idea of storing sequences as an efficient byte array, BioJava-style? -Eric From p.j.a.cock at googlemail.com Thu Mar 21 13:59:10 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 17:59:10 +0000 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Thu, Mar 21, 2013 at 5:42 PM, Eric Talevich wrote: > On Thu, Mar 21, 2013 at 12:55 PM, Peter Cock > wrote: >> >> On Wed, Mar 13, 2013 at 6:32 PM, Eric Talevich >> wrote: >> > I like Michiel's idea, and I'll suggest two more: >> > >> > 1. Codon alignment & analysis: >> > - PAL2NAL-style conversion of unaligned nucleic acid sequences and a >> > protein >> > sequence alignment to a codon alignment. (Previously discussed) >> >> e.g. >> https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py > > Well, check you out. Would you be interested in mentoring this project? > If I'm not primary mentor on another project, I'd be open to co-mentoring something on the alignment side. >> > - dN/dS and the related functions needed to calculate it. >> > - Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage >> > of >> > codon alignments, including validation (testing for frame shifts etc.) >> >> >> http://biopython.org/wiki/Google_Summer_of_Code#Codon_alignment_and_analysis >> >> I see you've started fleshing this idea out on the wiki, which is great. >> Right now it seems a little on the light weight side - or is that >> deliberate >> (to see if a student can take this idea and come up with a solid >> project proposal in this area)? Things like model selection might >> be a fun extension - I can think of a local expert who would be >> great to get involved on the science side if he's interested. > > > I put up a quick sketch to avoid locking the wiki page for too long, but > also deliberately left it vague to see where the applicants take it. Model > selection would be cool, I added it. Local expert, also great. If he's available and willing, yes. I've not mentioned this to him yet so no promises - the idea only occurred to me while writing that email ;) >> >> Alternatively this could include doing some more general work >> on the alignment object - for instance per-column-annotation >> for things like a consensus sequence - or an array-of-char >> implementation as an alternative to the list-of-SeqRecords >> we have now (with its poor column access speed). >> >> Peter > > > I wonder if that's something we could just do incrementally -- change the > MultipleSeqAlignment class to store a list-of-lists-of chars (or > list-of-strings), a list of SeqRecord-like husks (all the annotations, but > without the Seq itself) for each row, a list of column annotations, and a > single alphabet for the whole alignment. > > How do you suppose the speed of that would compare to the current > list-of-SeqRecords, and also to that of a wrapped NumPy matrix? Would it be > a significant enough speed improvement to justify both replacing the current > implementation, and to make the NumPy approach less tempting (given PyPy's > progress toward including a compliant implementation)? Alternatively, we > could post a GSoC project for creating a separate TurboAlignment > class/module based on NumPy which would be mostly interchangeable and > interconvertible with the pure-Python version in the Biopython core. When I said array-of-char I did have NumPy in mind, and PyPy does now cope with two or more dimensional arrays in NumPyPy. Note that NumPy handles both row and column orientated arrays with a simple class init option, so this can easily be setup to favour row or column access. Last time I did anything with the alignment object where column access was a bottleneck (calculating mutual information between columns), I just loaded all the columns into memory as a list of strings, and computed on that. It worked very nicely. > Speaking of which, should we also post the idea of storing sequences as an > efficient byte array, BioJava-style? I'd wondered about that (in combination with the discussion about strict alphabet checking), but is there enough for a whole GSoC project? Related to this one could look at something with k-mer hashes... (Its good to see lots of possible project ideas bouncing around) Peter From chapmanb at 50mail.com Fri Mar 22 08:48:34 2013 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 22 Mar 2013 08:48:34 -0400 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: <87zjxvsiql.fsf@fastmail.fm> Peter; > I've been wondering about potential GSoC projects which I'd > be interested in mentoring (or co-mentoring), and thus far I've > only got one outline idea. > > I'm interested in taking the Bio.SeqIO.index(...) / index_db(...) > functionality (which does whole record parsing on demand) > and extending this with lazy-loading or lazy-parsing (which > has precedent in our BioSQL wrappers). For example, with > whole genome FASTA files you may never need to load the > entire sequence, but using an index system like tabix (or > even actually using a tabix index) Biopython could provide > a lazy-loading Seq object which extracts only the sequence > region of interest on demand. This sounds incredibly useful. It's definitely worthwhile writing up if you'll have time this summer to mentor it. > Likewise, this makes sense for GTF/GFF/GFF3 where you > would index the features, and also if present index the > embedded FASTA sequence at the end of the file. I'm cc'ing Ryan, who has been thinking about similar work as part of gffutils. We're planning now on an approach that takes the BCBio.GFF parsing and rolls it into gffutils so we can parse, index in a SQLite database and expose as Biopython objects. Here is some initial discussion and planning: https://github.com/daler/gffutils/issues/2 https://docs.google.com/document/d/15l_yZ_pge22ETw-pz2g4NWRAUAccmr1MYPmqXbj1Jl8/edit?usp=sharing Brad From dalerr at niddk.nih.gov Fri Mar 22 12:20:45 2013 From: dalerr at niddk.nih.gov (Ryan Dale) Date: Fri, 22 Mar 2013 12:20:45 -0400 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: <87zjxvsiql.fsf@fastmail.fm> References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> <87zjxvsiql.fsf@fastmail.fm> Message-ID: <514C84DD.9070306@niddk.nih.gov> Hi Brad & Peter - On 03/22/2013 08:48 AM, Brad Chapman wrote: > Peter; > >> I've been wondering about potential GSoC projects which I'd >> be interested in mentoring (or co-mentoring), and thus far I've >> only got one outline idea. >> >> I'm interested in taking the Bio.SeqIO.index(...) / index_db(...) >> functionality (which does whole record parsing on demand) >> and extending this with lazy-loading or lazy-parsing (which >> has precedent in our BioSQL wrappers). For example, with >> whole genome FASTA files you may never need to load the >> entire sequence, but using an index system like tabix (or >> even actually using a tabix index) Biopython could provide >> a lazy-loading Seq object which extracts only the sequence >> region of interest on demand. > This sounds incredibly useful. It's definitely worthwhile writing up if > you'll have time this summer to mentor it. Agreed - a general, lazy-loading/lazy-parsing, indexed mechanism for accessing data annotation-like file formats would be fantastic. >> Likewise, this makes sense for GTF/GFF/GFF3 where you >> would index the features, and also if present index the >> embedded FASTA sequence at the end of the file. > I'm cc'ing Ryan, who has been thinking about similar work as part of > gffutils. We're planning now on an approach that takes the BCBio.GFF > parsing and rolls it into gffutils so we can parse, index in a SQLite > database and expose as Biopython objects. Here is some initial > discussion and planning: > > https://github.com/daler/gffutils/issues/2 > https://docs.google.com/document/d/15l_yZ_pge22ETw-pz2g4NWRAUAccmr1MYPmqXbj1Jl8/edit?usp=sharing As Peter pointed out on the GitHub issues page, what he has in mind is more general than just GFF/GTF, and I see gffutils as extending upon a specific subset of the functionality he proposes. For example, there are common use-cases that I think make sense for a GFF/GTF-only library (say, adding new annotations for introns, as inferred from the isoform + exon annotations) that might not be readily generalizable to all annotation-like file formats. But if this general indexing approach were already available, then gffutils could just be a wrapper around that, adding the specific GFF/GTF functionality as another layer. Then again . . . currently gffutils imports GFF data into a sqlite3 database, so data are persistent and both read/write. For the intron-inferring example, we simply add new records to the db, but with an indexing approach, the file would presumably have to be re-indexed before reading again. So how you'd like to use your GFF files (read-only vs read/write) would influence which strategy you'd chooses. So I think there's actually smaller-than-expected overlap between gffutils and Peter's general indexing idea, and in the context of GSoC, I'm not sure you'd have to take gffutils into account. But gffutils would certainly benefit from general indexing, especially when retrieving sequences for features! -ryan From mjldehoon at yahoo.com Tue Mar 26 09:21:35 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 26 Mar 2013 06:21:35 -0700 (PDT) Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: Message-ID: <1364304095.69042.YahooMailClassic@web164002.mail.gq1.yahoo.com> Hi all, Speaking of which, we have a Biopython Structural Bioinformatics FAQ (i.e. how to use the Bio.PDB module) on the Biopython website with additional information on Bio.PDB, including some information on things that are not in the main Biopython Tutorial. Perhaps this is a good time to integrate this FAQ into the main documentation? We could also update it a bit because it's been a while and there are some different things here and there. And additions too. I went over the Biopython Structural Bioinformatics FAQ and integrated it into the main Biopython tutorial; see biopython.org/DIST/docs/tutorial/Tutorial-dev.html or biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf Though I think everything is there, it may be good if somebody more experienced with Bio.PDB were to look it over to see if it still makes sense. In addition, I converted the Biopython Structural Bioinformatics FAQ to our wiki format and added it to our wiki documentation; see http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ This wiki now contains the exact same information (except for some minor updates/fixes) as the PDF with the Biopython Structural Bioinformatics FAQ that we have on the Biopython website. I guess with this we can remove the lyx/tex source code of the Biopython Structural Bioinformatics FAQ from the git repository, as well as the PDF from the Biopython website. Any objections? Best, -Michiel. From p.j.a.cock at googlemail.com Tue Mar 26 09:53:52 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 26 Mar 2013 13:53:52 +0000 Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: <1364304095.69042.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1364304095.69042.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: On Tue, Mar 26, 2013 at 1:21 PM, Michiel de Hoon wrote: > > I guess with this we can remove the lyx/tex source code of the Biopython Structural Bioinformatics FAQ from the git repository, as well as the PDF from the Biopython website. Any objections? > Good work Michiel :) I would suggest making a final revision to the Biopython Structural Bioinformatics FAQ to explain this document is now obsolete, and where the information has moved to. Commit that to git, and put the final PDF online replacing the current version. That way anyone looking at the PDF online (or the git history) will have a clear route to finding the current information. Thanks, Peter From anaryin at gmail.com Tue Mar 26 09:54:55 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 26 Mar 2013 14:54:55 +0100 Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: References: <1364304095.69042.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: Great work! I'll go over it in the next few days. 2013/3/26 Peter Cock > On Tue, Mar 26, 2013 at 1:21 PM, Michiel de Hoon > wrote: > > > > I guess with this we can remove the lyx/tex source code of the Biopython > Structural Bioinformatics FAQ from the git repository, as well as the PDF > from the Biopython website. Any objections? > > > > Good work Michiel :) > > I would suggest making a final revision to the Biopython Structural > Bioinformatics > FAQ to explain this document is now obsolete, and where the information has > moved to. Commit that to git, and put the final PDF online replacing the > current > version. That way anyone looking at the PDF online (or the git > history) will have > a clear route to finding the current information. > > Thanks, > > Peter > From lara.vignotto at gmail.com Wed Mar 27 10:09:50 2013 From: lara.vignotto at gmail.com (Lara Vignotto) Date: Wed, 27 Mar 2013 15:09:50 +0100 Subject: [Biopython-dev] [GSoC] Further info about Codon alignment idea Message-ID: Hello, I'm a student from Italy. I'm attending the first year of Biotechnology at the University of Udine, and I'm interested about the Codon alignment and analysis project proposed fot the Google Summer of Code 2013. Since I would like to know if I have got the skills required to contribute, can you tell me more about the project? Regards, Lara Vignotto From 88whacko at gmail.com Thu Mar 28 06:39:07 2013 From: 88whacko at gmail.com (Andrea Rizzi) Date: Thu, 28 Mar 2013 11:39:07 +0100 Subject: [Biopython-dev] New contributor In-Reply-To: References: <1363185895.3324.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: Thank you for the great feedback Peter. I'll write a test case for Bio.Alphabet then since I couldn't find any. When it's ready I'll request a pull. Thank you again! Andrea 2013/3/21 Peter Cock > On Wed, Mar 20, 2013 at 6:10 PM, Andrea Rizzi <88whacko at gmail.com> wrote: > > Thank you for your welcome Michiel! > > > > I will looking for a good project to work on in the next few days and I > > will let you know soon. Meanwhile I've started to read some code to > become > > familiar with the modules and I bumped into few small bugs concerning the > > Seq objects, in particular I found: > > > > 1) a duplicated test method name (one test in test_Seq_objs.py wasn't > > performed); > > 2) an error in Alphabet._case_less(). > > Well spotted - changes applied to the master, thanks. > > > I've also expanded a little bit the documentation and I've substituted > > tostring() method with the suggested str() method in a function of > > MutableSeq. The branch is located here > > > > https://github.com/andrrizzi/biopython/tree/seq-branch > > > > I'm not sure if it is more comfortable for you to merge this kind of > > commits from a git branch or it is more advisable to open a ticket and > > create a patch. Anyway if you think this small commits may be useful, > feel > > free to use them. > > If you're happy on GitHub, a pull request is simplest. I've looked > at these changes one by one and applied and/or commented > on them. > > (We're debating moving our issue tracker from RedMine to > GitHub, which would make things a little easier in future). > > Thank you! > > Peter > From p.j.a.cock at googlemail.com Thu Mar 28 09:39:57 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 28 Mar 2013 13:39:57 +0000 Subject: [Biopython-dev] [GSoC] Further info about Codon alignment idea In-Reply-To: References: Message-ID: On Wed, Mar 27, 2013 at 2:09 PM, Lara Vignotto wrote: > Hello, > I'm a student from Italy. I'm attending the first year of Biotechnology at > the University of Udine, and I'm interested about the Codon alignment and > analysis project proposed fot the Google Summer of Code 2013. > Since I would like to know if I have got the skills required to contribute, > can you tell me more about the project? > > Regards, > Lara Vignotto Hi Lara, Welcome and thank you for your interest in taking part in GSoC 2013. The background discussion to the outline idea on the wiki was here: http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010449.html http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010471.html http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010474.html http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010475.html (I think that was all the posts - check the archive to be sure). The text of the wiki is hopefully enough to spark your interest - what we're really like to see is a student intrigued by the idea and driven to expand the topic into a full project proposal. If for example your current course work included some phylogenetics that might help give you perspective about what is useful and worth adding to Biopython. You should probably also have a look at the NESCent GSoC project ideas if it is the phylogenetic side that really interest you - in previous years Biopython has mentored GSoC students with NESCent: http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2013 You would also need to be competent with Python - although if you also know and love Perl or Ruby (etc) there might be a mentor willing to supervise a related project with BioPerl or BioRuby - that's good too from the wider OBF and Bio* perspective. For tree traversal some back ground reading on things like breadth first search and other algorithms for 'walking' the tree would be a good idea (see also the Python os.path module for 'walking' a file system tree). I'm sure there will be other technical things to learn about and use, depending on where a GSoC project based on this idea went. Did that help? Is there something more specific I can try to answer? Regards, Peter From p.j.a.cock at googlemail.com Thu Mar 28 11:44:11 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 28 Mar 2013 15:44:11 +0000 Subject: [Biopython-dev] Fwd: [biopython] Custom GenBank locus length (#171) In-Reply-To: References: Message-ID: For those not getting the pull request emails from GitHub, ---------- Forwarded message ---------- From: Marco Galardini Date: Thu, Mar 28, 2013 at 3:19 PM Subject: [biopython] Custom GenBank locus length (#171) To: biopython/biopython Instead of an exception, raise a warning, so the file is saved and the user can decide to correct the error. I don't know if this is a good pratice, but I have some GenBank files provided by the JGI/DOE with locus names longer than 16 chars, so I guess that providing a warning to the user instead of a complete failure could be better. ________________________________ You can merge this Pull Request by running git pull https://github.com/mgalardini/biopython patch-1 Or view, comment on, or merge it at: https://github.com/biopython/biopython/pull/171 Commit Summary Custom GenBank locus length File Changes M Bio/SeqIO/InsdcIO.py (4) Patch Links: https://github.com/biopython/biopython/pull/171.patch https://github.com/biopython/biopython/pull/171.diff From marco.galardini at unifi.it Thu Mar 28 11:54:38 2013 From: marco.galardini at unifi.it (Marco Galardini) Date: Thu, 28 Mar 2013 16:54:38 +0100 Subject: [Biopython-dev] Fwd: [biopython] Custom GenBank locus length (#171) In-Reply-To: References: Message-ID: <515467BE.7090105@unifi.it> Good afternoon everyone, Actually, i have been testing a bit more and some other changes may be needed (sorry about that, this is my first change to the biopython code). The assertions on the lines length still fail, so my guess is that probably it's not a good idea to try to write down a genbank with unusual identifiers (even if they are from JGI!). Marco On 03/28/2013 04:44 PM, Peter Cock wrote: > For those not getting the pull request emails from GitHub, > > ---------- Forwarded message ---------- > From: Marco Galardini > Date: Thu, Mar 28, 2013 at 3:19 PM > Subject: [biopython] Custom GenBank locus length (#171) > To: biopython/biopython > > > Instead of an exception, raise a warning, so the file is saved and the > user can decide to correct the error. > > I don't know if this is a good pratice, but I have some GenBank files > provided by the JGI/DOE with locus names longer than 16 chars, so I > guess that providing a warning to the user instead of a complete > failure could be better. > > ________________________________ > > You can merge this Pull Request by running > > git pull https://github.com/mgalardini/biopython patch-1 > > Or view, comment on, or merge it at: > > https://github.com/biopython/biopython/pull/171 > > Commit Summary > > Custom GenBank locus length > > File Changes > > M Bio/SeqIO/InsdcIO.py (4) > > Patch Links: > > https://github.com/biopython/biopython/pull/171.patch > https://github.com/biopython/biopython/pull/171.diff > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev -- ------------------------------------------------- Marco Galardini, PhD Dipartimento di Biologia Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI) e-mail: marco.galardini at unifi.it www: http://www.unifi.it/dblage/CMpro-v-p-51.html phone: +39 055 4574737 mobile: +39 340 2808041 ------------------------------------------------- From p.j.a.cock at googlemail.com Thu Mar 28 14:00:38 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 28 Mar 2013 18:00:38 +0000 Subject: [Biopython-dev] stdout/stderr handling oddity Message-ID: Hi all, While looking at the BWA wrapper from Saket Choudhary https://github.com/biopython/biopython/pull/167 and the associated enhancement to the __call__ functionality of the command line wrapper base class, I wrote a couple of unit tests - which have left me a little puzzled: https://github.com/biopython/biopython/commit/3f5d4c442424a7ca33ae0bafa60c840e80ae2fda Could a few of you try running this test_Application.py file and confirm it works as is, and try uncommenting the two problem test cases? (I'm curious if the echo test works as intended on a plain Windows machine without cygwin installed - I hope so). Unless anyone else can explain this, I think the next step is a simple test program which produces predictable output to both stdout and stderr, just in case this is due to there being no stderr output in these tests. e.g. Print integers 1, 2, 3, 4, ..., to some sensible limit, like 20, where non-primes are on stdout while primes on stderr. Peter From arklenna at gmail.com Thu Mar 28 16:54:11 2013 From: arklenna at gmail.com (Lenna Peterson) Date: Thu, 28 Mar 2013 16:54:11 -0400 Subject: [Biopython-dev] stdout/stderr handling oddity In-Reply-To: References: Message-ID: Hi Peter, On Mac OS X, opening os.devnull with mode 'w' on lines 418 and 422 of the Application __init__.py causes the tests to pass for me. Lenna From saketkc at gmail.com Thu Mar 28 16:57:54 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Fri, 29 Mar 2013 02:27:54 +0530 Subject: [Biopython-dev] stdout/stderr handling oddity In-Reply-To: References: Message-ID: Yes. And the reason is this :http://stackoverflow.com/questions/2368967/bad-file-descriptor-error On 29 March 2013 02:24, Lenna Peterson wrote: > Hi Peter, > > On Mac OS X, opening os.devnull with mode 'w' on lines 418 and 422 of the > Application __init__.py causes the tests to pass for me. > > Lenna > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From saketkc at gmail.com Thu Mar 28 17:00:00 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Fri, 29 Mar 2013 02:30:00 +0530 Subject: [Biopython-dev] stdout/stderr handling oddity In-Reply-To: References: Message-ID: Forgot to add : Tested on Ubuntu 12.04 On 29 March 2013 02:27, Saket Choudhary wrote: > Yes. > And the reason is this > :http://stackoverflow.com/questions/2368967/bad-file-descriptor-error > > On 29 March 2013 02:24, Lenna Peterson wrote: >> Hi Peter, >> >> On Mac OS X, opening os.devnull with mode 'w' on lines 418 and 422 of the >> Application __init__.py causes the tests to pass for me. >> >> Lenna >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Thu Mar 28 18:11:11 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 28 Mar 2013 22:11:11 +0000 Subject: [Biopython-dev] stdout/stderr handling oddity In-Reply-To: References: Message-ID: > On 29 March 2013 02:24, Lenna Peterson wrote: >> Hi Peter, >> >> On Mac OS X, opening os.devnull with mode 'w' on lines 418 and 422 of the >> Application __init__.py causes the tests to pass for me. >> >> Lenna On Thu, Mar 28, 2013 at 8:57 PM, Saket Choudhary wrote: > Yes. > And the reason is this > :http://stackoverflow.com/questions/2368967/bad-file-descriptor-error > Thank you both - I am kicking myself now - maybe I should have taken another sick day this week instead of returning to work? ;) Fixed: https://github.com/biopython/biopython/commit/bba2acbf3d690ad7b99e94ac8ead6763b1d05ab8 I guess no one had bothered to using this option to send stderr to /dev/null - or if they had never reported this error. The only thing which puzzles me is why this worked for stdout. Odd. Cheers, Peter From p.j.a.cock at googlemail.com Fri Mar 29 07:54:33 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 29 Mar 2013 11:54:33 +0000 Subject: [Biopython-dev] Fwd: [biopython] Fix Biopython installation with pip (#172) In-Reply-To: References: Message-ID: Hi Brad, This sounds sensible in principle - it just needs some hands on testing on various systems - any volunteers who use PIP and virtual envs? Thanks, Peter ---------- Forwarded message ---------- From: Brad Chapman Date: Fri, Mar 29, 2013 at 11:47 AM Subject: [biopython] Fix Biopython installation with pip (#172) To: biopython/biopython Hi all; This is yet another take on making Biopython install nicely with pip in virtual environments. This avoids adding numpy as an explicit dependency and instead uses it if present or skips it if not. The problem with the previous install_requires approach is that pip doesn't build and install all requirements before setting up Biopython, so Biopython will fail with a numpy missing error. Additionally, our old approach drags in numpy so creates a heavyweight dependency for isolated environments. The new approach requires users to explicitly install numpy if needed but doesn't penalize them if it's not present. I submitted as a pull request for documentation and feedback from anyone. If y'all agree, merge away. Thanks, Brad ________________________________ You can merge this Pull Request by running git pull https://github.com/chapmanb/biopython master Or view, comment on, or merge it at: https://github.com/biopython/biopython/pull/172 Commit Summary Improve Biopython installation with pip: avoid including numpy as dependency when automated. Instead explicitly avoid needing numpy installed to continue Add helpful comment on pip dependency management File Changes M setup.py (38) Patch Links: https://github.com/biopython/biopython/pull/172.patch https://github.com/biopython/biopython/pull/172.diff From chapmanb at 50mail.com Fri Mar 1 02:25:42 2013 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 28 Feb 2013 21:25:42 -0500 Subject: [Biopython-dev] Coming soon: BOSC/Broad Hackathon, SciPy Bioinformatics, BOSC Codefest Message-ID: <87lia7ua8p.fsf@fastmail.fm> Hi all; There are some upcoming coding events and conferences of interest to open source biology programmers: - BOSC/Broad Interoperability Hackathon -- This is a two day coding session at the Broad Institute in Cambridge, MA on April 7-8 focused on improving tool interoperability. Sign up and details: http://j.mp/XJT6ew - SciPy 2013 -- The Scientific Python conference is June 26-27 in Austin and has a Bioinformatics mini-symposia this year. They're doing some great work like IPython, NumPy, SciPy and scikit-learn; and this is a nice opportunity to reach a new set of like minded programmers and expand the open source bioinformatics community. Bioinformatics mini-symposia: http://j.mp/Z4xxXB Abstract details: http://conference.scipy.org/scipy2013/about.php - Codefest at the Bioinformatics Open Source Conference -- This year BOSC is taking place in Berlin from July 19-20 and we'll have a two day coding session before the conference. This is the 4th year of Codefests and they've proven to be a productive and fun time to work collectively on open source projects. Sign up and details: http://www.open-bio.org/wiki/Codefest_2013 BOSC conference: http://www.open-bio.org/wiki/BOSC_2013 Here are the key dates for the events and abstracts: March 20, 2013: SciPy abstracts due April 7-8, 2013: BOSC/Broad Interoperability Hackathon, Cambridge, MA April 12, 2013: BOSC abstracts due June 24-29, 2013: SciPy in Austin, TX July 17-18, 2013: Codefest 2013, Berlin July 19-20, 2013: BOSC 2013, Berlin Looking forward to seeing everyone this spring and summer for plenty of fun science and code, Brad From chapmanb at 50mail.com Fri Mar 1 02:36:34 2013 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 28 Feb 2013 21:36:34 -0500 Subject: [Biopython-dev] [ANN] SciPy2013: Call for abstracts In-Reply-To: References: Message-ID: <87ppzjsv65.fsf@fastmail.fm> Peter; Thanks for sending this out. I'm helping with the organization of the SciPy bioinformatics session thanks to Peter's recommendation and wrote up a little bit about the types of abstracts that would fit will with the overall theme of SciPy: http://j.mp/Z4xxXB This is a great chance to connect with another open source scientific community so definitely send in an abstract if this is of interest; the deadline is coming up next month: March 20th. Austin also has awesome music and barbecue in addition to science and hacking so lots of reasons to attend, Brad > The new bioinformatics mini-symposium this year makes SciPy 2013 > especially interesting. > > Peter > > ---------- Forwarded message ---------- > From: *Jonathan Rocher* > Date: Wednesday, February 27, 2013 > Subject: [Numpy-discussion] [ANN] SciPy2013: Call for abstracts > To: SciPy Users List , numfocus at googlegroups.com, > Discussion of Numerical Python > > > [Apologies for cross-posts] > > Dear all, > > The annual SciPy Conference (Scientific Computing with > Python) allows > participants from academic, commercial, and governmental organizations to > showcase their latest projects, learn from skilled users and developers, > and collaborate on code development. *The deadline for abstract submissions > is March 20th, 2013. * > > Submissions are welcome that address general Scientific Computing with > Python, one of the two special themes for this years conference (machine > learning & reproducible science), or the domain-specific > mini-symposiaheld > during the conference (Meteorology, climatology, and atmospheric and > oceanic science, Astronomy and astrophysics, Medical imaging, > Bio-informatics). > > Please submit your abstract at the SciPy 2013 website abstract submission > form . > Abstracts will be accepted for posters or presentations. Optional papers to > be published in the conference proceedings will be requested following > abstract submission. This year the proceedings will be made available prior > to the conference to help attendees navigate the conference. > > We look forward to an exciting and interesting set of talks, posters, and > discussions and hope to see you at the conference. > The SciPy 2013 Program Committee Chairs > > Matt McCormick, Kitware, Inc. > Katy Huff, University of Wisconsin-Madison and Argonne National Laboratory > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From 88whacko at gmail.com Sat Mar 2 16:49:48 2013 From: 88whacko at gmail.com (Andrea Rizzi) Date: Sat, 2 Mar 2013 17:49:48 +0100 Subject: [Biopython-dev] New contributor Message-ID: Hello! My name is Andrea Rizzi and I'm a master's student in computer science and computational biology. I would be glad to help you developing biopython. I've used the library quite extensively but I'm mostly familiar with handling sequences, MSAs and PDB files. I've read through the small contributing guide on the wiki and on the tutorial and I thought I could start with something relatively straightforward like writing/completing some unit tests (if I understood correctly there's a fairly strong need for them). I've good knowledge of both git and unittest. Anyway any task is actually fine to me :) . If you agree I'll try to look for a module that needs some more testing (or maybe you have one to suggest me), otherwise I could just go to the bug tracker and try to help out fixing some bugs. -- -- Andrea From p.j.a.cock at googlemail.com Sun Mar 3 12:00:25 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 3 Mar 2013 12:00:25 +0000 Subject: [Biopython-dev] Fwd: GSoC 2013 is ON In-Reply-To: <20130303112326.GA5638@thebird.nl> References: <20130303112326.GA5638@thebird.nl> Message-ID: Time to start preparations for Google Summer of Code 2013 :) ---------- Forwarded message ---------- From: *Pjotr Prins* Date: Sunday, March 3, 2013 Subject: GSoC 2013 is ON Game on! GSoC 2013 is ON. I am running with the OBF project administration this year for the Google Summer of code (GSoC). First and foremost I want to thank Robert Buels and others for making OBF/GSoC a success in the previous three years! This year, Robert, Chris Fields and Hilmar Lapp will act as backup administrators. The deadline for the OBF application for GSoC2013 as a mentoring organisation is Friday March 29! See http://www.google-melange.com/gsoc/events/google/gsoc2013 Similar to previous years, each Bio* project needs to update and add project ideas on the project's individual OBF wiki page and create links from the main OBF page at http://www.open-bio.org/wiki/Google_Summer_of_Code (we will update the main information on that page soon). So, for each of the OBF projects that wants to do GSoC again this year: 1. Update the list of project ideas on your project's GSoC page (BioPython, BioPerl, BioRuby, etc). Add new ones, remove ones that have already been done or no longer relevant, etc. For an example see http://bioruby.open-bio.org/wiki/Google_Summer_of_Code 2. Update the final list of project ideas on the main OBF GSoC page to match. http://www.open-bio.org/wiki/Google_Summer_of_Code 3. Register with gsoc at lists.open-bio.org 4. Announce it on that list when you are ready :) Anyone can submit a project idea! Former GSoC students are especially encouraged to contribute ideas to the mailing lists. Please have the updates done by Friday March 22nd. The number and quality of the project ideas are part of the evaluation process for whether OBF is accepted as a Summer of Code organisation again this year, so let's come up with some good ones! Pj. (Pjotr Prins) Important dates: * March 22nd: Finalise project ideas * March 29th: Deadline OBF mentoring organisation submission to Google http://www.open-bio.org/wiki/Google_Summer_of_Code From saketkc at gmail.com Mon Mar 4 10:59:26 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Mon, 4 Mar 2013 16:29:26 +0530 Subject: [Biopython-dev] BWA Wrapper In-Reply-To: References: Message-ID: Hi, I have updated the code here : https://github.com/saketkc/biopython/tree/bwa_wrapper I have added unittests for the wrapper. And yes, this did help me in fixing a lot of minor bugs in my original wrapper. @Peter : Is this 'pull request' ready ? Thanks Saket On 19 February 2013 19:55, Peter Cock wrote: > On Tue, Feb 19, 2013 at 1:15 PM, Saket Choudhary wrote: >> >> Thanks Peter. >> >> I will add that. Any pointers to what would be a good reference test_aba.py >> file in Tests/ directory for writing unit tests for this ? >> >> I have worked on BDD before but Unit Tests are new for me, so it may take >> some time.I plan to finish it the coming week once my university >> examinations are done >> >> Thanks >> >> Saket > > There's a chapter in the Tutorial about our test framework. In this > case existing command line tool wrappers are the best reference, > e.g. test_Emboss.py or test_Muscle.py > > Also if you want to use doctests and have them included in the > test suite, add the module to the list in Tests/run_tests.py - however > this does not handle optional dependencies (other than NumPy). > Therefore all the application wrapper doctests to date have carefully > avoided actually invoking the command line - and instead most > print the string representation instead. This allows us to check > the example use cases should run (and catches silly errors in > the examples like a typo in an argument name). > > Thanks, > > Peter From saketkc at gmail.com Tue Mar 5 17:26:57 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Tue, 5 Mar 2013 22:56:57 +0530 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: I had this idea of an online biopython shell on the lines of bioruby shell : http://bioruby.open-bio.org/wiki/BioRubyOnRails On 13 February 2013 07:38, Michiel de Hoon wrote: > It would be great to have better support for microarray analysis in Biopython. Something like lumi/limma in R. Perhaps this is an option for the GSoC? > > Best, > -Michiel. > > --- On Tue, 2/12/13, Peter Cock wrote: > >> From: Peter Cock >> Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) >> To: "Biopython-Dev Mailing List" >> Date: Tuesday, February 12, 2013, 12:51 PM >> Hello all, >> >> Google recently confirmed they will be running Google Summer >> of Code 2013, >> and we (Biopython and the other Bio* projects) would hope to >> be accepted again >> under the Open Bioinformatics Foundation as in previous >> years: >> http://lists.open-bio.org/pipermail/gsoc/2013/000196.html >> >> It would be great to start coming up with potential project >> ideas, both larger >> pieces of work suitable for GSoC but also smaller tasks for >> other project >> students, or 'low hanging fruit' for potential contributors >> to cut >> their teeth on. >> >> See also http://biopython.org/wiki/Active_projects >> and the ideas list there. >> >> Regards, >> >> Peter >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Fri Mar 8 16:08:46 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 8 Mar 2013 16:08:46 +0000 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Tue, Mar 5, 2013 at 5:26 PM, Saket Choudhary wrote: > I had this idea of an online biopython shell on the lines of bioruby shell : > http://bioruby.open-bio.org/wiki/BioRubyOnRails > That screenshot makes me think of http://ipython.org/ - is that similar? Peter From redmine at redmine.open-bio.org Fri Mar 8 16:49:48 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Fri, 8 Mar 2013 16:49:48 +0000 Subject: [Biopython-dev] [Biopython - Bug #3422] (New) Missing Message-ID: Issue #3422 has been reported by Jared Sampson. ---------------------------------------- Bug #3422: Missing https://redmine.open-bio.org/issues/3422 Author: Jared Sampson Status: New Priority: Normal Assignee: Category: Target version: URL: http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/bookdoc_130101.dtd When using Entrez.efetch to retrieve an XML file, I get the following warning about a missing DTD: bookdoc_130101.dtd === /path/to/my/virtualenv/lib/python2.7/site-packages/Bio/Entrez/Parser.py:522: UserWarning: Unable to load DTD file bookdoc_130101.dtd. Bio.Entrez uses NCBI's DTD files to parse XML files returned by NCBI Entrez. Though most of NCBI's DTD files are included in the Biopython distribution, sometimes you may find that a particular DTD file is missing. While we can access the DTD file through the internet, the parser is much faster if the required DTD files are available locally. For this purpose, please download bookdoc_130101.dtd from http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/bookdoc_130101.dtd and save it either in directory /path/to/my/virtualenv/lib/python2.7/site-packages/Bio/Entrez/DTDs or in directory /Users/me/.biopython/Bio/Entrez/DTDs in order for Bio.Entrez to find it. Alternatively, you can save bookdoc_130101.dtd in the directory Bio/Entrez/DTDs in the Biopython distribution, and reinstall Biopython. Please also inform the Biopython developers about this missing DTD, by reporting a bug on http://bugzilla.open-bio.org/ or sign up to our mailing list and emailing us, so that we can include it with the next release of Biopython. Proceeding to access the DTD file through the internet... warnings.warn(message) === Also, the bugzilla.open-bio.org URL mentioned comes up empty. Thanks, Jared Sampson ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From saketkc at gmail.com Fri Mar 8 18:30:03 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Sat, 9 Mar 2013 00:00:03 +0530 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: It is essentially an online RoR based application that allows you to try bioruby through your browser without the need of a bioruby native install . I was thinking of a django/flask application that would essentially be a playground for trying out biopython Saket On 08/03/2013, Peter Cock wrote: > On Tue, Mar 5, 2013 at 5:26 PM, Saket Choudhary wrote: >> I had this idea of an online biopython shell on the lines of bioruby >> shell : >> http://bioruby.open-bio.org/wiki/BioRubyOnRails >> > > That screenshot makes me think of http://ipython.org/ - is that similar? > > Peter > From chapmanb at 50mail.com Sat Mar 9 16:06:34 2013 From: chapmanb at 50mail.com (Brad Chapman) Date: Sat, 09 Mar 2013 11:06:34 -0500 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: <87wqtgeewl.fsf@fastmail.fm> Saket and Peter; What you're describing is what Ipython provides, a web-based way to edit and interact with Python code. There are some projects that build on top of it to provide more of a playground environment like you're describing: http://continuum.io/wakari.html https://github.com/Exhibitionist/Exhibitionist Hope this helps, Brad > It is essentially an online RoR based application that allows you to > try bioruby through your browser without the need of a bioruby native > install . I was thinking of a django/flask application that would > essentially be a playground for trying out biopython > > > Saket > > On 08/03/2013, Peter Cock wrote: >> On Tue, Mar 5, 2013 at 5:26 PM, Saket Choudhary wrote: >>> I had this idea of an online biopython shell on the lines of bioruby >>> shell : >>> http://bioruby.open-bio.org/wiki/BioRubyOnRails >>> >> >> That screenshot makes me think of http://ipython.org/ - is that similar? >> >> Peter >> > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From xbello at gmail.com Tue Mar 12 09:36:35 2013 From: xbello at gmail.com (Xabier Bello) Date: Tue, 12 Mar 2013 10:36:35 +0100 Subject: [Biopython-dev] Consumer of "KW" in embl format Message-ID: Hi: I don't know if this is the right way to do this. The code: records = SeqIO.parse(open("MyFile.embl", "r"), "embl") for record in records: print record.annotations["keywords"] Doesn't work I've added to Bio/GenBank/Scanner.py, in _feed_header_lines(): elif line_type == 'KW': consumer.keywords(data.rstrip(";")) And now it seems to parse the keyword lines. Regards. From p.j.a.cock at googlemail.com Tue Mar 12 09:54:51 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Mar 2013 09:54:51 +0000 Subject: [Biopython-dev] Consumer of "KW" in embl format In-Reply-To: References: Message-ID: On Tue, Mar 12, 2013 at 9:36 AM, Xabier Bello wrote: > Hi: > > I don't know if this is the right way to do this. The code: > > records = SeqIO.parse(open("MyFile.embl", "r"), "embl") > for record in records: > print record.annotations["keywords"] > > Doesn't work > > I've added to Bio/GenBank/Scanner.py, in _feed_header_lines(): > > elif line_type == 'KW': > consumer.keywords(data.rstrip(";")) > > And now it seems to parse the keyword lines. > > Regards. Good idea, although it needs a little more generalisation for handling multiple keywords - a list of strings seems sensible here. Quoting ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt 3.4.6 The KW Line The KW (KeyWord) lines provide information which can be used to generate cross-reference indexes of the sequence entries based on functional, structural, or other categories deemed important. The format for a KW line is: KW keyword[; keyword ...]. More than one keyword may be listed on each KW line; the keywords are separated by semicolons, and the last keyword is followed by a full stop. Keywords may consist of more than one word, and they may contain embedded blanks and stops. A keyword is never split between lines. An example of a keyword line is: KW beta-glucosidase. The keywords are ordered alphabetically; the ordering implies no hierarchy of importance or function. If an entry has no keywords assigned to it, it will contain a single KW line like this: KW . Likewise the GenBank parser should support the KEYWORDS line too - and then writing the keywords out again too. Is this something you'd like to work on, or should I do it? (If you are interested in getting involved in Biopython development this seems like a nice project to start with - not too complicated, but large enough to make creating a fork on GitHub and your own enhancement branch a good idea.) Thanks, Peter From p.j.a.cock at googlemail.com Tue Mar 12 10:02:15 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Mar 2013 10:02:15 +0000 Subject: [Biopython-dev] Consumer of "KW" in embl format In-Reply-To: References: Message-ID: On Tue, Mar 12, 2013 at 9:54 AM, Peter Cock wrote: > On Tue, Mar 12, 2013 at 9:36 AM, Xabier Bello wrote: >> Hi: >> >> I don't know if this is the right way to do this. The code: >> >> records = SeqIO.parse(open("MyFile.embl", "r"), "embl") >> for record in records: >> print record.annotations["keywords"] >> >> Doesn't work >> >> I've added to Bio/GenBank/Scanner.py, in _feed_header_lines(): >> >> elif line_type == 'KW': >> consumer.keywords(data.rstrip(";")) >> >> And now it seems to parse the keyword lines. >> >> Regards. > > Good idea, although it needs a little more generalisation for handling > multiple keywords - a list of strings seems sensible here. Quoting > ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt > > > 3.4.6 The KW Line > The KW (KeyWord) lines provide information which can be used to generate > cross-reference indexes of the sequence entries based on functional, > structural, or other categories deemed important. > The format for a KW line is: > KW keyword[; keyword ...]. > More than one keyword may be listed on each KW line; the keywords are > separated by semicolons, and the last keyword is followed by a full > stop. Keywords may consist of more than one word, and they may contain > embedded blanks and stops. A keyword is never split between lines. > An example of a keyword line is: > KW beta-glucosidase. > The keywords are ordered alphabetically; the ordering implies no hierarchy > of importance or function. If an entry has no keywords assigned to it, > it will contain a single KW line like this: > KW . > > > Likewise the GenBank parser should support the KEYWORDS line > too - and then writing the keywords out again too. > > Is this something you'd like to work on, or should I do it? To clarify - Biopython should already be reading and writing any KEYWORDS line in GenBank files - the same data structure should be used for EMBL files (your suggestion looks good, but an explicit unit test covering single and multiple keywords would be ideal), and then the EMBL writer updated to write this. i.e. code added in Bio/SeqIO/InsdcIO.py Peter From p.j.a.cock at googlemail.com Tue Mar 12 10:58:39 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Mar 2013 10:58:39 +0000 Subject: [Biopython-dev] Consumer of "KW" in embl format In-Reply-To: References: Message-ID: On Tue, Mar 12, 2013 at 10:12 AM, Xabier Bello wrote: > I think I'm not that close to the Biopython code. > > I found a problem (I needed to read the Keywords), and solved it quick and > dirty. In fact, it doesn't read multiline KW. I'm not sure I could implement > that in a fair amount of time. > > Regards. No problem - I've committed your fix, a basic test, and extended this for multiple KW lines. As discussed I've thanked you in the NEWS file too. https://github.com/biopython/biopython/commit/fc036dcdac22252a366647823a0c7c317c303313 https://github.com/biopython/biopython/commit/606ea9360d262d21c3e01eda66c4cf9118880d46 Updating the EMBL writer in Bio/SeqIO/InsdcIO.py should be a nice small task for any volunteer wanting to make a first contribution... (Potential Google Summer of Code students - Hint hint ;) ) Thank you Xabier, Peter From p.j.a.cock at googlemail.com Tue Mar 12 14:40:16 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Mar 2013 14:40:16 +0000 Subject: [Biopython-dev] Consumer of "KW" in embl format In-Reply-To: References: Message-ID: On Tue, Mar 12, 2013 at 11:35 AM, Xabier Bello wrote: > Lets try it: > > Line 997: > def _write_keywords(self, record): > #Put the keywords right after DE line. > self._write_multi_line("KW", "%s." % "; ".join( > record.annotations["keywords"])) > self.handle.write("XX\n") Looks good - although there is a potential problem here with long keywords where this does not avoid splitting a single keyword over multiple KW lines (as specified in the EMBL specification). This is a corner case though... > Line 1070: > if "keywords" in record.annotations: > self._write_keywords(record) > > Note to self: learn to make diff patches and forks on github. Good plan :) Meanwhile, I committed that change: https://github.com/biopython/biopython/commit/41470eac55a665d1cb1c7e73ebfd3c1df98af5ad I added a little more testing, from which I think we may need to do some work with some of the other EMBL fields like dbxrefs: https://github.com/biopython/biopython/commit/07639dde32083f4f024616292a5c736e85770a4e Thanks, Peter From p.j.a.cock at googlemail.com Tue Mar 12 15:13:23 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Mar 2013 15:13:23 +0000 Subject: [Biopython-dev] Consumer of "KW" in embl format In-Reply-To: References: Message-ID: On Tue, Mar 12, 2013 at 2:40 PM, Peter Cock wrote: > On Tue, Mar 12, 2013 at 11:35 AM, Xabier Bello wrote: >> Lets try it: >> >> Line 997: >> def _write_keywords(self, record): >> #Put the keywords right after DE line. >> self._write_multi_line("KW", "%s." % "; ".join( >> record.annotations["keywords"])) >> self.handle.write("XX\n") > > Looks good - although there is a potential problem here with long keywords > where this does not avoid splitting a single keyword over multiple KW lines > (as specified in the EMBL specification). This is a corner case though... OK, not such a rare case: $ python test_SeqIO_features.py ... ====================================================================== ERROR: test_cor6 (__main__.TestWriteRead) Write and read back cor6_6.gb ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SeqIO_features.py", line 1105, in test_cor6 write_read(os.path.join("GenBank", "cor6_6.gb"), "gb") File "test_SeqIO_features.py", line 35, in write_read compare_records(gb_records, gb_records2) File "test_SeqIO_features.py", line 110, in compare_records if not compare_record(old,new,expect_minor_diffs): File "test_SeqIO_features.py", line 101, in compare_record % (key, old.annotations[key], new.annotations[key])) ValueError: Annotation mis-match for keywords: ['antifreeze protein homology', 'cold-regulated gene', 'cor6.6 gene', 'KIN1 homology'] ['antifreeze protein homology', 'cold-regulated gene', 'cor6.6 gene', 'KIN1', 'homology'] ---------------------------------------------------------------------- I'll fix this later today... Peter From clements at galaxyproject.org Tue Mar 12 22:01:49 2013 From: clements at galaxyproject.org (Dave Clements) Date: Tue, 12 Mar 2013 15:01:49 -0700 Subject: [Biopython-dev] 2013 Galaxy Community Conference (GCC2013), 30 June - 2 July, Oslo Message-ID: Hello all, We are pleased to announce that early registration and paper and poster abstract submission are now open for the 2013 Galaxy Community Conference (GCC2013) . GCC2013 will be held 30 June through July 2 in Oslo Norway, at the University of Oslo . GCC2013 is an opportunity to participate in two full days of presentations, discussions, poster sessions, keynotes, lightning talks and breakouts, *all about high-throughput biology and the tools that support it*. The conference also includes a Training Day for the second year in a row, this year with more in-depth topic coverage, more concurrent sessions, and more topics. If you are a biologist or bioinformatician performing or enabling high-throughput biological research, then please consider attending. GCC2013 is aimed at: - Bioinformatics tool developers and data providers - Workflow developers and power bioinformatics users - Sequencing and Bioinformatics core staff - Data archival and analysis reproducibility specialists *Early registration * *saves up to 75% off regular registration costs,* and is very affordable, with combined registration (Training Day + main meeting) starting at ~ ?95 for post-docs and students. Registering early also assures you a spot in the Training Day workshops you want to attend. Once a Training Day session becomes full, it will be closed to new registrations. Early registration closes 24 May. *Abstract submission * for oral presentations closes 12 April, and for posters on 3 May. Please consider presenting your work. If you are working with big biological data, then the people at this meeting want to hear about your work. Thanks, and hope to see you in Oslo! The GCC2013 Organizing Committee PS: And please help get the word out ! -- http://galaxyproject.org/GCC2013 http://galaxyproject.org/ http://getgalaxy.org/ http://usegalaxy.org/ http://wiki.galaxyproject.org/ From anaryin at gmail.com Wed Mar 13 11:09:29 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 13 Mar 2013 12:09:29 +0100 Subject: [Biopython-dev] [Biopython] Updating GSOC page? In-Reply-To: References: Message-ID: Hello all, I updated the GSOC page on the wiki to be more organized: http://biopython.org/wiki/GSOC If no one opposes, I'll replace the current page (here) with it, just in time for GSOC 2013. Best, Jo?o PS. sorry for the spamming but I posted this 5 days ago in the non dev list and got no answers so.. 2013/3/8 Jo?o Rodrigues > Small update: http://biopython.org/wiki/GSOC > > If ok, We can just link the normal one for this one. I kept it separate > just in case. > > > 2013/3/4 Peter Cock > >> On Sun, Mar 3, 2013 at 11:07 PM, Jo?o Rodrigues >> wrote: >> > Hello all, >> > >> > Does any oppose to a refreshment of our GSOC >> > pagebased on the >> > BioRuby >> > page ? It >> could use >> > a facelift before the new round of projects/students come in. >> > >> > Best, >> > >> > Jo?o >> >> A good idea - see also the GSoC discussions on the biopython-dev >> list about potential project ideas. >> >> Thanks, >> >> Peter >> > > From mikael.trellet at gmail.com Wed Mar 13 11:17:17 2013 From: mikael.trellet at gmail.com (Mikael Trellet) Date: Wed, 13 Mar 2013 12:17:17 +0100 Subject: [Biopython-dev] [Biopython] Updating GSOC page? In-Reply-To: References: Message-ID: It's well-formated and looks nice for me, the improvement from the former one is signifcant so I would agree to update the page. Good work ;) Mikael On Wed, Mar 13, 2013 at 12:09 PM, Jo?o Rodrigues wrote: > Hello all, > > I updated the GSOC page on the wiki to be more organized: > http://biopython.org/wiki/GSOC > > If no one opposes, I'll replace the current page > (here) > with it, just in time for GSOC 2013. > > Best, > > Jo?o > > PS. sorry for the spamming but I posted this 5 days ago in the non dev list > and got no answers so.. > > > 2013/3/8 Jo?o Rodrigues > > > Small update: http://biopython.org/wiki/GSOC > > > > If ok, We can just link the normal one for this one. I kept it separate > > just in case. > > > > > > 2013/3/4 Peter Cock > > > >> On Sun, Mar 3, 2013 at 11:07 PM, Jo?o Rodrigues > >> wrote: > >> > Hello all, > >> > > >> > Does any oppose to a refreshment of our GSOC > >> > pagebased on the > >> > BioRuby > >> > page ? It > >> could use > >> > a facelift before the new round of projects/students come in. > >> > > >> > Best, > >> > > >> > Jo?o > >> > >> A good idea - see also the GSoC discussions on the biopython-dev > >> list about potential project ideas. > >> > >> Thanks, > >> > >> Peter > >> > > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- -------------------------------------------- Mikael TRELLET, - Groupe VENISE, CNRS LIMSI 91403 Orsay CEDEX - LBT/IBPC, 75005 Paris France +33650607172 From p.j.a.cock at googlemail.com Wed Mar 13 12:04:28 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 13 Mar 2013 12:04:28 +0000 Subject: [Biopython-dev] [Biopython] Updating GSOC page? In-Reply-To: References: Message-ID: On Wed, Mar 13, 2013 at 11:09 AM, Jo?o Rodrigues wrote: > Hello all, > > I updated the GSOC page on the wiki to be more organized: > http://biopython.org/wiki/GSOC > > If no one opposes, I'll replace the current page > (here) > with it, just in time for GSOC 2013. > > Best, > > Jo?o Sounds sensible, and you can set a direct on GSOC to Google_Summer_of_Code by replacing the content with: #REDIRECT [[link]] Peter From anaryin at gmail.com Wed Mar 13 13:22:23 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 13 Mar 2013 14:22:23 +0100 Subject: [Biopython-dev] [Biopython] Updating GSOC page? In-Reply-To: References: Message-ID: Done, thanks. http://biopython.org/wiki/Google_Summer_of_Code http://biopython.org/wiki/GSOC 2013/3/13 Peter Cock > On Wed, Mar 13, 2013 at 11:09 AM, Jo?o Rodrigues > wrote: > > Hello all, > > > > I updated the GSOC page on the wiki to be more organized: > > http://biopython.org/wiki/GSOC > > > > If no one opposes, I'll replace the current page > > (here) > > with it, just in time for GSOC 2013. > > > > Best, > > > > Jo?o > > Sounds sensible, and you can set a direct on GSOC to > Google_Summer_of_Code by replacing the content with: > > #REDIRECT [[link]] > > Peter > From mjldehoon at yahoo.com Wed Mar 13 14:44:55 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 13 Mar 2013 07:44:55 -0700 (PDT) Subject: [Biopython-dev] New contributor In-Reply-To: Message-ID: <1363185895.3324.YahooMailClassic@web164005.mail.gq1.yahoo.com> Hi Andrea, Welcome to Biopython! It's great that you want to contribute. Writing & finishing some unit tests sounds like a good idea, and of course bug fixing is always welcome. Other options are to look at orphan modules in Biopython (modules without active maintainers, without documentation, or without unit tests). Once you decide what specifically you want to work on, it's good to let us know on the mailing list, to see if anybody else is working on the same. Good luck! Best, -Michiel. --- On Sat, 3/2/13, Andrea Rizzi <88whacko at gmail.com> wrote: > From: Andrea Rizzi <88whacko at gmail.com> > Subject: [Biopython-dev] New contributor > To: biopython-dev at biopython.org > Date: Saturday, March 2, 2013, 11:49 AM > Hello! > My name is Andrea Rizzi and I'm a master's student in > computer science and > computational biology. I would be glad to help you > developing biopython. > I've used the library quite extensively but I'm mostly > familiar with > handling sequences, MSAs and PDB files. > > I've read through the small contributing guide on the wiki > and on the > tutorial and I thought I could start with something > relatively > straightforward like writing/completing some unit tests (if > I understood > correctly there's a fairly strong need for them). I've good > knowledge of > both git and unittest. Anyway any task is actually fine to > me :) . > > If you agree I'll try to look for a module that needs some > more testing (or > maybe you have one to suggest me), otherwise I could just go > to the bug > tracker and try to help out fixing some bugs. > > -- > -- Andrea > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From eric.talevich at gmail.com Wed Mar 13 18:32:25 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 13 Mar 2013 14:32:25 -0400 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Tue, Feb 12, 2013 at 9:08 PM, Michiel de Hoon wrote: > It would be great to have better support for microarray analysis in > Biopython. Something like lumi/limma in R. Perhaps this is an option for > the GSoC? > > Best, > -Michiel. > I like Michiel's idea, and I'll suggest two more: 1. Codon alignment & analysis: - PAL2NAL-style conversion of unaligned nucleic acid sequences and a protein sequence alignment to a codon alignment. (Previously discussed) - dN/dS and the related functions needed to calculate it. - Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage of codon alignments, including validation (testing for frame shifts etc.) 2. Phylo enhancements: 2a. Tree drawing: - A proper draw_unrooted function to perform radial layout, with an optional "iterations" argument to use Felsenstein's Equal Daylight algorithm -- I feel this layout approach is neglected in most libraries. - Better matplotlib/pylab integration, so the plot components can be tweaked using matplotlib functions. - Other common layout approaches, e.g. circular. 2b. A "Phylo.consensus" module: - strict consensus, like Bio.Nexus already implements. - other consensus methods, time permitting. 2c. A "Phylo.distance" module: - Robinson-Foulds distance -- though others might be working on this already. 2d. Simple tree inference: - Straightforward algorithms exist for neighbor-joining and parsimony tree estimation. For small alignments (and perhaps medium-sized ones with PyPy), it would be nice to run these without an external program, e.g. to construct a guide tree for another algorithm or quickly view a phylogenetic clustering of sequences. Any interest in either of these? Shall I add them to the wiki? -Eric --- On Tue, 2/12/13, Peter Cock wrote: > > > From: Peter Cock > > Subject: [Biopython-dev] Project ideas for GSoC (or other student > projects) > > To: "Biopython-Dev Mailing List" > > Date: Tuesday, February 12, 2013, 12:51 PM > > Hello all, > > > > Google recently confirmed they will be running Google Summer > > of Code 2013, > > and we (Biopython and the other Bio* projects) would hope to > > be accepted again > > under the Open Bioinformatics Foundation as in previous > > years: > > http://lists.open-bio.org/pipermail/gsoc/2013/000196.html > > > > It would be great to start coming up with potential project > > ideas, both larger > > pieces of work suitable for GSoC but also smaller tasks for > > other project > > students, or 'low hanging fruit' for potential contributors > > to cut > > their teeth on. > > > > See also http://biopython.org/wiki/Active_projects > > and the ideas list there. > > > > Regards, > > > > Peter > From p.j.a.cock at googlemail.com Wed Mar 13 21:16:27 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 13 Mar 2013 21:16:27 +0000 Subject: [Biopython-dev] BWA Wrapper In-Reply-To: References: Message-ID: On Monday, March 4, 2013, Saket Choudhary wrote: > Hi, > > I have updated the code here : > https://github.com/saketkc/biopython/tree/bwa_wrapper > > I have added unittests for the wrapper. And yes, this did help me in > fixing a lot of minor bugs in my original wrapper. > > @Peter : Is this 'pull request' ready ? > > Thanks > > Saket > > Sorry I've not had time to test this yet - and have been off ill today as well. The basic approach you've taken seems sound, and a good basis for other samtools style tools. Peter From p.j.a.cock at googlemail.com Thu Mar 14 11:25:41 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 14 Mar 2013 11:25:41 +0000 Subject: [Biopython-dev] Fwd: [biopython] Add the ability to parse CEL version 4 files from Affy (#168) In-Reply-To: References: Message-ID: Who would be the best person to review this? Michael? Peter ---------- Forwarded message ---------- From: *Jeff Hammerbacher* Date: Thursday, March 14, 2013 Subject: [biopython] Add the ability to parse CEL version 4 files from Affy (#168) To: biopython/biopython Hey, I noticed that Biopython was missing the ability to parse binary CEL files (version 4), so I've added a rough implementation. I've kept TODOs in the code and a main method to demonstrate example use. I realize these are not best practices for a mature library, but this corner of Biopython (the Affy module) seems quite immature, so I figured I'd leave the code in this state to indicate to others that there is much room for improvement. I have not contributed to this project before, so please let me know how to get this pull request in shape for a commit. Thanks, Jeff ------------------------------ You can merge this Pull Request by running git pull https://github.com/hammer/biopython master Or view, comment on, or merge it at: https://github.com/biopython/biopython/pull/168 Commit Summary - Add the ability to parse CEL version 4 files from Affy. File Changes - *A* Bio/Affy/CelFileV4.py(186) Patch Links: - https://github.com/biopython/biopython/pull/168.patch - https://github.com/biopython/biopython/pull/168.diff From mjldehoon at yahoo.com Fri Mar 15 13:09:18 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 15 Mar 2013 06:09:18 -0700 (PDT) Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: Message-ID: <1363352958.71694.YahooMailClassic@web164004.mail.gq1.yahoo.com> Hi everybody, I looked at the mmCIF parser again, and it turned out that the Python standard library contains a shlex lexical analyzer module that makes mmCIF parsing straightforward without relying on flex or PLY. I uploaded a modified version of MMCIF2Dict.py to the git repository. This parser does the exact same thing as the flex-based parser, but is in pure Python. If you're interested, have a look at MMCIF2Dict.py in the git repository; comments and suggestions are welcome. If there are no objections, I think we can remove everything in Bio.PDB.mmCIF. Also I'm a bit unhappy with how the information in an mmCIF file is represented in Biopython. I think there are more Pythonic ways to store the contents of an mmCIF file in a Python object. Best, -Michiel. --- On Sat, 2/16/13, Peter Cock wrote: > From: Peter Cock > Subject: Re: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) > To: "Michiel de Hoon" > Cc: "BioPython-Dev Mailing List" , "Lenna Peterson" > Date: Saturday, February 16, 2013, 5:42 AM > On Sat, Feb 16, 2013 at 2:46 AM, > Michiel de Hoon > wrote: > > Hi Lenna, > > > > Maybe we are confusing each other.. > > I am looking for a solution that (a) doesn't introduce > new dependencies, > > +1 > > > (b) is pure-Python so it can run on Jython, > > +1 And on PyPy (which to me is more interesting that Jython) > etc. > > > and (c) if that is not possible and we do need to use > C, then that C code > > should be understandable so that it can be debugged if > necessary. > > > > I was suggesting to clean up lex.yy.c so that we can at > least achieve (c). > > This does mean we essentially give up on ever regenerating > the lex.yy.c > file every again - could that be a problem if Flex itself > changes much? > > > The alternative is to start from the PLY-based parser > and remove the > > dependency on PLY. > > > > Best, > > -Michiel. > > Peter > From anaryin at gmail.com Fri Mar 15 13:20:16 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 15 Mar 2013 14:20:16 +0100 Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: <1363352958.71694.YahooMailClassic@web164004.mail.gq1.yahoo.com> References: <1363352958.71694.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: Hi Michiel, Speaking without really checking the code.. What we perhaps should have it the parsers, whatever they are, populating the same type of object in the end (PDBParser and mmCIFParser). Is this the current status of the mmCIF? Best, Jo?o 2013/3/15 Michiel de Hoon > Hi everybody, > > I looked at the mmCIF parser again, and it turned out that the Python > standard library contains a shlex lexical analyzer module that makes mmCIF > parsing straightforward without relying on flex or PLY. I uploaded a > modified version of MMCIF2Dict.py to the git repository. This parser does > the exact same thing as the flex-based parser, but is in pure Python. If > you're interested, have a look at MMCIF2Dict.py in the git repository; > comments and suggestions are welcome. > > If there are no objections, I think we can remove everything in > Bio.PDB.mmCIF. Also I'm a bit unhappy with how the information in an mmCIF > file is represented in Biopython. I think there are more Pythonic ways to > store the contents of an mmCIF file in a Python object. > > Best, > -Michiel. > > --- On Sat, 2/16/13, Peter Cock wrote: > > > From: Peter Cock > > Subject: Re: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) > > To: "Michiel de Hoon" > > Cc: "BioPython-Dev Mailing List" , "Lenna > Peterson" > > Date: Saturday, February 16, 2013, 5:42 AM > > On Sat, Feb 16, 2013 at 2:46 AM, > > Michiel de Hoon > > wrote: > > > Hi Lenna, > > > > > > Maybe we are confusing each other.. > > > I am looking for a solution that (a) doesn't introduce > > new dependencies, > > > > +1 > > > > > (b) is pure-Python so it can run on Jython, > > > > +1 And on PyPy (which to me is more interesting that Jython) > > etc. > > > > > and (c) if that is not possible and we do need to use > > C, then that C code > > > should be understandable so that it can be debugged if > > necessary. > > > > > > I was suggesting to clean up lex.yy.c so that we can at > > least achieve (c). > > > > This does mean we essentially give up on ever regenerating > > the lex.yy.c > > file every again - could that be a problem if Flex itself > > changes much? > > > > > The alternative is to start from the PLY-based parser > > and remove the > > > dependency on PLY. > > > > > > Best, > > > -Michiel. > > > > Peter > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From p.j.a.cock at googlemail.com Fri Mar 15 13:21:50 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 15 Mar 2013 13:21:50 +0000 Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: <1363352958.71694.YahooMailClassic@web164004.mail.gq1.yahoo.com> References: <1363352958.71694.YahooMailClassic@web164004.mail.gq1.yahoo.com> Message-ID: On Fri, Mar 15, 2013 at 1:09 PM, Michiel de Hoon wrote: > Hi everybody, > > I looked at the mmCIF parser again, and it turned out that the Python > standard library contains a shlex lexical analyzer module that makes mmCIF > parsing straightforward without relying on flex or PLY. I uploaded a > modified version of MMCIF2Dict.py to the git repository. This parser does > the exact same thing as the flex-based parser, but is in pure Python. If > you're interested, have a look at MMCIF2Dict.py in the git repository; > comments and suggestions are welcome. That makes MMCIF2Dict look a lot shorter :) https://github.com/biopython/biopython/commit/b2bafdfcd67c738f91722495bb732297b7936828 > If there are no objections, I think we can remove everything in > Bio.PDB.mmCIF. Also I'm a bit unhappy with how the information in an mmCIF > file is represented in Biopython. I think there are more Pythonic ways to > store the contents of an mmCIF file in a Python object. > > Best, > -Michiel. Do you think we need a deprecation cycle for Bio.PDB.mmCIF? It has been available by default on Debian etc where the dependency was taken care of by the packagers. I've never used this code so perhaps Eric or Jo?o's perspective would be more helpful than mine. Peter From mjldehoon at yahoo.com Fri Mar 15 15:08:43 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 15 Mar 2013 08:08:43 -0700 (PDT) Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: Message-ID: <1363360123.26690.YahooMailClassic@web164004.mail.gq1.yahoo.com> Hi all, --- On Fri, 3/15/13, Peter Cock wrote: > Do you think we need a deprecation cycle for Bio.PDB.mmCIF? > It has been available by default on Debian etc where the > dependency was taken care of by the packagers. Probably not. The Bio.PDB.mmCIF module was essentially a private module used by Bio.PDB.MMCIF2Dict, whose usage is unchanged. Also, AFAICT the Bio.PDB.mmCIF module is not documented anywhere. And finally, all this module does is tokenize the mmCIF file, so probably not something an end user would be interested in. I am not a heavy user of Bio.PDB myself, so feel free to correct me if I am wrong. Best, -Michiel. From p.j.a.cock at googlemail.com Fri Mar 15 15:28:48 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 15 Mar 2013 15:28:48 +0000 Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: <1363360950.87852.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1363360950.87852.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: On Fri, Mar 15, 2013 at 3:22 PM, Michiel de Hoon wrote: > > Speaking of which, we have a Biopython Structural Bioinformatics FAQ (i.e. > how to use the Bio.PDB module) on the Biopython website with additional > information on Bio.PDB, including some information on things that are not in > the main Biopython Tutorial. Perhaps this is a good time to integrate this > FAQ into the main documentation? > Both are LaTeX documents so this shouldn't be too hard to do. Peter From mjldehoon at yahoo.com Fri Mar 15 15:22:30 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 15 Mar 2013 08:22:30 -0700 (PDT) Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: Message-ID: <1363360950.87852.YahooMailClassic@web164003.mail.gq1.yahoo.com> Hi Jo?o, --- On Fri, 3/15/13, Jo?o Rodrigues wrote: What we perhaps should have it the parsers, whatever they are, populating the same type of object in the end (PDBParser and mmCIFParser). I think that there are two options: 1) PDBParser and mmCIFParser both produce Structure objects, with any additional information found in mmCIF files stored as additional attributes of Structure objects (and the same thing for PDB files); 2) We make a module mmCIF with a function mmCIF.read that reads an mmCIF file and stores the information in a mmCIF.Record object that is optimized for storing mmCIF information. The mmCIFParser uses mmCIF.read, and pulls out the necessary information from the mmCIF.Record object to create a Structure object (which is free of mmCIF-specific stuff). Users can make Structure objects if that is all they need, or use mmCIF.read if they want to have all information in an mmCIF file. Currently the situation is closer to (2), with MMCIF2Dict playing the role of mmCIF.read, but I don't like much the way MMCIF2Dict stores information. Since I am not a power user of Bio.PDB, other people may have more insight in whether (1) or (2) (or something completely different) is best. Is this the current status of the mmCIF? I just replaced the flex-dependent part of mmCIF by pure Python code, but I didn't change the functionality or usage of the mmCIF code. So the current status is still the same as described in the documentation. Speaking of which, we have a Biopython Structural Bioinformatics FAQ (i.e. how to use the Bio.PDB module) on the Biopython website with additional information on Bio.PDB, including some information on things that are not in the main Biopython Tutorial. Perhaps this is a good time to integrate this FAQ into the main documentation? Best, -Michiel From jacobs at bioinformed.com Fri Mar 15 15:40:38 2013 From: jacobs at bioinformed.com (Kevin Jacobs) Date: Fri, 15 Mar 2013 08:40:38 -0700 Subject: [Biopython-dev] BWA Wrapper In-Reply-To: References: Message-ID: FYI, I am working on a direct Cython wrapper around the new BWA-MEM aligner, which will allow API-level access to Heng Li's extremely impressive new algorithm. It is still in early development and is missing many bells and whistles, but will be shaping up in the next few weeks. Test program: import bwamem mem = bwamem.MEMAligner('ref/human_g1k_v37.fasta') a = mem.align('TCACGACGCTCTTCCGATCTGTT...GTGCATTCTCTGGTCAGACAGCCAAGG') a = a[0] print 'ref id =',a.rid print 'pos =',a.pos print 'CIGAR =',a.cigar.to_string() Output (correct): ref id = 0 pos = 115250385 CIGAR = 17N134M On Wed, Mar 13, 2013 at 2:16 PM, Peter Cock wrote: > On Monday, March 4, 2013, Saket Choudhary wrote: > > > Hi, > > > > I have updated the code here : > > https://github.com/saketkc/biopython/tree/bwa_wrapper > > > > I have added unittests for the wrapper. And yes, this did help me in > > fixing a lot of minor bugs in my original wrapper. > > > > @Peter : Is this 'pull request' ready ? > > > > Thanks > > > > Saket > > > > > Sorry I've not had time to test this yet - and have > been off ill today as well. The basic approach you've > taken seems sound, and a good basis for other > samtools style tools. > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From anaryin at gmail.com Fri Mar 15 15:53:41 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 15 Mar 2013 16:53:41 +0100 Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: <1363360950.87852.YahooMailClassic@web164003.mail.gq1.yahoo.com> References: <1363360950.87852.YahooMailClassic@web164003.mail.gq1.yahoo.com> Message-ID: Hi Michiel, > 1) PDBParser and mmCIFParser both produce Structure objects, with any > additional information found in mmCIF files stored as additional attributes > of Structure objects (and the same thing for PDB files); > This approach has a few advantages. First and most obvious, converting one file format to another seamlessly. Second, reducing the code to something easier to maintain and to extend too. The disadvantage is that the Structure objects might become a bit too bloated. On the other hand, we can make them lighter and take advantage of Python's dynamic attributes (if I need a b-factor, I just add atom.bfactor). This would also help a lot with the current parser which is quite "sluggish" for some purposes and bring a lot more flexibility (parsing pqr files, mol2 files, etc). All we'd need would be a parser for each file format and a generic container to have the backbone of the structure and extend is as we need. A simple flag for the parser type would make checking if function X can be used on this particular structure easier too. > > 2) We make a module mmCIF with a function mmCIF.read that reads an mmCIF > file and stores the information in a mmCIF.Record object that is optimized > for storing mmCIF information. The mmCIFParser uses mmCIF.read, and pulls > out the necessary information from the mmCIF.Record object to create a > Structure object (which is free of mmCIF-specific stuff). Users can make > Structure objects if that is all they need, or use mmCIF.read if they want > to have all information in an mmCIF file. > I'm completely unfamiliar with mmCIF files.. how much more information do they have than a PDB file? And what kind of information is useful to extract from them? Speaking of which, we have a Biopython Structural Bioinformatics FAQ (i.e. > how to use the Bio.PDB module) on the Biopython website with additional > information on Bio.PDB, including some information on things that are not > in the main Biopython Tutorial. Perhaps this is a good time to integrate > this FAQ into the main documentation? We could also update it a bit because it's been a while and there are some different things here and there. And additions too. Best, Jo?o From bartek at rezolwenta.eu.org Fri Mar 15 23:06:57 2013 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Sat, 16 Mar 2013 00:06:57 +0100 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: Hi All, I would add one more (old) idea for a GSoC pool, i.e. adding support for different biological ontologies to biopython. This was already discussed some time ago (http://www.biopython.org/w/index.php?title=Gene_Ontology&redirect=no) mostly in the context of gene ontology, and to some extent this is addressed by the development of GOAtools (https://github.com/tanghaibao/goatools), but I think it would be worth to have a decent support for OBO-file-based ontologies (not only gene ontology, I'm also interested myself in anatomical ontologies, there are also other available at obofoundry.org) in biopython. I think it would need to include support for IO operations on both OBO and annotation files, as well as statistical enrichment measures and potentially some visualisation. Would anyone be interested in co-mentoring this project? There is one student in my department who would be interested in applying to GSoC for this project, but I think it would be great if other people joined the discussion on the functionality and having more people involved is always better... best Bartek Wilczynski On Wed, Mar 13, 2013 at 7:32 PM, Eric Talevich wrote: > On Tue, Feb 12, 2013 at 9:08 PM, Michiel de Hoon wrote: > >> It would be great to have better support for microarray analysis in >> Biopython. Something like lumi/limma in R. Perhaps this is an option for >> the GSoC? >> >> Best, >> -Michiel. >> > > I like Michiel's idea, and I'll suggest two more: > > 1. Codon alignment & analysis: > - PAL2NAL-style conversion of unaligned nucleic acid sequences and a > protein sequence alignment to a codon alignment. (Previously discussed) > - dN/dS and the related functions needed to calculate it. > - Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage of > codon alignments, including validation (testing for frame shifts etc.) > > 2. Phylo enhancements: > 2a. Tree drawing: > - A proper draw_unrooted function to perform radial layout, with an > optional "iterations" argument to use Felsenstein's Equal Daylight > algorithm -- I feel this layout approach is neglected in most libraries. > - Better matplotlib/pylab integration, so the plot components can be > tweaked using matplotlib functions. > - Other common layout approaches, e.g. circular. > 2b. A "Phylo.consensus" module: > - strict consensus, like Bio.Nexus already implements. > - other consensus methods, time permitting. > 2c. A "Phylo.distance" module: > - Robinson-Foulds distance -- though others might be working on this > already. > 2d. Simple tree inference: > - Straightforward algorithms exist for neighbor-joining and parsimony tree > estimation. For small alignments (and perhaps medium-sized ones with PyPy), > it would be nice to run these without an external program, e.g. to > construct a guide tree for another algorithm or quickly view a phylogenetic > clustering of sequences. > > Any interest in either of these? Shall I add them to the wiki? > > -Eric > > > --- On Tue, 2/12/13, Peter Cock wrote: >> >> > From: Peter Cock >> > Subject: [Biopython-dev] Project ideas for GSoC (or other student >> projects) >> > To: "Biopython-Dev Mailing List" >> > Date: Tuesday, February 12, 2013, 12:51 PM >> > Hello all, >> > >> > Google recently confirmed they will be running Google Summer >> > of Code 2013, >> > and we (Biopython and the other Bio* projects) would hope to >> > be accepted again >> > under the Open Bioinformatics Foundation as in previous >> > years: >> > http://lists.open-bio.org/pipermail/gsoc/2013/000196.html >> > >> > It would be great to start coming up with potential project >> > ideas, both larger >> > pieces of work suitable for GSoC but also smaller tasks for >> > other project >> > students, or 'low hanging fruit' for potential contributors >> > to cut >> > their teeth on. >> > >> > See also http://biopython.org/wiki/Active_projects >> > and the ideas list there. >> > >> > Regards, >> > >> > Peter >> > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > -- Bartek Wilczynski From mjldehoon at yahoo.com Sat Mar 16 02:38:48 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 15 Mar 2013 19:38:48 -0700 (PDT) Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: Message-ID: <1363401528.82829.YahooMailClassic@web164001.mail.gq1.yahoo.com> --- On Fri, 3/15/13, Jo?o Rodrigues wrote: I'm completely unfamiliar with mmCIF files.. how much more information do they have than a PDB file? These are two examples from the Biopython tests: https://github.com/biopython/biopython/blob/master/Tests/PDB/1A8O.cif https://github.com/biopython/biopython/blob/master/Tests/PDB/1LCD.cif And what kind of information is useful to extract from them? I think we should extract all information from these files, and let the user decide which parts are useful. Best, -Michiel. From p.j.a.cock at googlemail.com Sat Mar 16 14:38:22 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 16 Mar 2013 14:38:22 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5E48@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: On Wed, Dec 5, 2012 at 6:41 PM, Peter Cock wrote: > Hi David, > > I've been experimenting with your pull request, thank you: > https://github.com/biopython/biopython/pull/116 > Hi again David, I've not used your code as is, but have started by pulling out and generalising what I felt was the least contentious part: https://github.com/biopython/biopython/commit/087712510421ec7f655a7981926a757aa93e9177 This means that label_position = start, middle, end (and some historic aliases defined in the linear drawer code) now work on circular GenomeDiagrams. I have made the default None, which gives the current behaviour (as 'start' on linear, the more complicated to explain vertical bottom on circular). Support for allowing the default label orientation to be radially consistent all round the circle (rather than the current flipping for the left/right halves which assumes the output is kept vertical) would be nice, but the thing I am most keen on is the inside/outside of the track label placement. Hopefully I'll have time to finish that this weekend... Peter From p.j.a.cock at googlemail.com Sat Mar 16 20:37:12 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 16 Mar 2013 20:37:12 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5E48@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: On Sat, Mar 16, 2013 at 2:38 PM, Peter Cock wrote: > On Wed, Dec 5, 2012 at 6:41 PM, Peter Cock wrote: >> Hi David, >> >> I've been experimenting with your pull request, thank you: >> https://github.com/biopython/biopython/pull/116 >> > > Hi again David, > > I've not used your code as is, but have started by pulling out > and generalising what I felt was the least contentious part: > > https://github.com/biopython/biopython/commit/087712510421ec7f655a7981926a757aa93e9177 > > This means that label_position = start, middle, end (and some > historic aliases defined in the linear drawer code) now work > on circular GenomeDiagrams. I have made the default None, > which gives the current behaviour (as 'start' on linear, the > more complicated to explain vertical bottom on circular). > > Support for allowing the default label orientation to be radially > consistent all round the circle (rather than the current flipping > for the left/right halves which assumes the output is kept > vertical) would be nice, but the thing I am most keen on is the > inside/outside of the track label placement. Hopefully I'll have > time to finish that this weekend... Here's a version on a branch which addresses the label placement by adding a label_strand argument, where +1 means the label is on the forward strand side of the track (above or outside), while -1 means the reverse strand side of the track (below or inside), and the default is to follow the strand of the feature being draw. This seemed to me quite an intuitive arrangement: https://github.com/peterjc/biopython/tree/label_strand This branch also (without making it optional) switches circular diagram feature labels to be "outside" the sigil like the linear diagram, rather than "insider" the sigil. This does tend to take up more space (which would explain the original motivation), but rarely gives a very legible result except with a box sigil and a very small/short label which falls completely within the sigil. This could be made a user option if there is demand... my inclination is not to (the API is already quite complex). David, I will email you an updated version of your example script using this branch for you to look at. It allows me to recreate the same effect as your code (bar the orientation changes which I have not at this point incorporated). David & Leighton, what do you think of this label idea? Peter From p.j.a.cock at googlemail.com Mon Mar 18 11:58:49 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 18 Mar 2013 11:58:49 +0000 Subject: [Biopython-dev] Modifications to CircularDrawer In-Reply-To: References: <959CFF5060375249824CC633DDDF896F1C0C58B1@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5C9B@AMSPRD0410MB351.eurprd04.prod.outlook.com> <959CFF5060375249824CC633DDDF896F1C0C5E48@AMSPRD0410MB351.eurprd04.prod.outlook.com> Message-ID: On Sat, Mar 16, 2013 at 8:37 PM, Peter Cock wrote: > > David & Leighton, what do you think of this label idea? > > Peter >From discussion off list, my branch seems positively accepted by both, and so I've applied that to the master. I probably will need to update some images in the Tutorial... We appear to agree that label orientation is an aesthetic judgement, and therefore a user option to control this on circular diagrams would be nice - but I've not done this (yet) and remain cautious about further complicating this bit of the code & while trying to have a consistent API between the linear and circular drawers. See also: https://github.com/biopython/biopython/pull/116 Peter From chapmanb at 50mail.com Mon Mar 18 16:49:33 2013 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 18 Mar 2013 12:49:33 -0400 Subject: [Biopython-dev] SciPy Bioinformatics symposium: abstracts due Wednesday Mar 20th Message-ID: <87y5dkvejm.fsf@fastmail.fm> Hi all; I'm helping organize a bioinformatics mini-symposium as part of SciPy 2013: Bioinformatics mini-symposia: http://j.mp/Z4xxXB SciPy info: http://conference.scipy.org/scipy2013/about.php This is a great chance for the Python bioinformatics community to connect with the wider Python scientific computing world. SciPy will feature programmers working on IPython reproducible research, scikit-learn machine learning approaches, large scale computing problems with NumPy and lots more relevant to bioinformatics work. This year there will a special symposium track dedicated to bioinformatics and I'd like to encourage everyone to submit abstracts. The deadline is this Wednesday, March 20th: http://conference.scipy.org/scipy2013/speaking_overview.php http://conference.scipy.org/scipy2013/speaking_submission.php SciPy takes place June 24-29th in Austin, TX. I'm looking forward to seeing lots of bioinformatics people there. Please feel free to write me if you have any questions, Brad From 88whacko at gmail.com Wed Mar 20 18:10:14 2013 From: 88whacko at gmail.com (Andrea Rizzi) Date: Wed, 20 Mar 2013 19:10:14 +0100 Subject: [Biopython-dev] New contributor In-Reply-To: <1363185895.3324.YahooMailClassic@web164005.mail.gq1.yahoo.com> References: <1363185895.3324.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: Thank you for your welcome Michiel! I will looking for a good project to work on in the next few days and I will let you know soon. Meanwhile I've started to read some code to become familiar with the modules and I bumped into few small bugs concerning the Seq objects, in particular I found: 1) a duplicated test method name (one test in test_Seq_objs.py wasn't performed); 2) an error in Alphabet._case_less(). I've also expanded a little bit the documentation and I've substituted tostring() method with the suggested str() method in a function of MutableSeq. The branch is located here https://github.com/andrrizzi/biopython/tree/seq-branch I'm not sure if it is more comfortable for you to merge this kind of commits from a git branch or it is more advisable to open a ticket and create a patch. Anyway if you think this small commits may be useful, feel free to use them. Best, Andrea 2013/3/13 Michiel de Hoon > Hi Andrea, > > Welcome to Biopython! > It's great that you want to contribute. > Writing & finishing some unit tests sounds like a good idea, and of course > bug fixing is always welcome. > Other options are to look at orphan modules in Biopython (modules without > active maintainers, without documentation, or without unit tests). > Once you decide what specifically you want to work on, it's good to let us > know on the mailing list, to see if anybody else is working on the same. > > Good luck! > > Best, > -Michiel. > > > > --- On Sat, 3/2/13, Andrea Rizzi <88whacko at gmail.com> wrote: > > > From: Andrea Rizzi <88whacko at gmail.com> > > Subject: [Biopython-dev] New contributor > > To: biopython-dev at biopython.org > > Date: Saturday, March 2, 2013, 11:49 AM > > Hello! > > My name is Andrea Rizzi and I'm a master's student in > > computer science and > > computational biology. I would be glad to help you > > developing biopython. > > I've used the library quite extensively but I'm mostly > > familiar with > > handling sequences, MSAs and PDB files. > > > > I've read through the small contributing guide on the wiki > > and on the > > tutorial and I thought I could start with something > > relatively > > straightforward like writing/completing some unit tests (if > > I understood > > correctly there's a fairly strong need for them). I've good > > knowledge of > > both git and unittest. Anyway any task is actually fine to > > me :) . > > > > If you agree I'll try to look for a module that needs some > > more testing (or > > maybe you have one to suggest me), otherwise I could just go > > to the bug > > tracker and try to help out fixing some bugs. > > > > -- > > -- Andrea > > _______________________________________________ > > Biopython-dev mailing list > > Biopython-dev at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython-dev > > > -- -- Andrea From p.j.a.cock at googlemail.com Thu Mar 21 12:17:51 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 12:17:51 +0000 Subject: [Biopython-dev] New contributor In-Reply-To: References: <1363185895.3324.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: On Wed, Mar 20, 2013 at 6:10 PM, Andrea Rizzi <88whacko at gmail.com> wrote: > Thank you for your welcome Michiel! > > I will looking for a good project to work on in the next few days and I > will let you know soon. Meanwhile I've started to read some code to become > familiar with the modules and I bumped into few small bugs concerning the > Seq objects, in particular I found: > > 1) a duplicated test method name (one test in test_Seq_objs.py wasn't > performed); > 2) an error in Alphabet._case_less(). Well spotted - changes applied to the master, thanks. > I've also expanded a little bit the documentation and I've substituted > tostring() method with the suggested str() method in a function of > MutableSeq. The branch is located here > > https://github.com/andrrizzi/biopython/tree/seq-branch > > I'm not sure if it is more comfortable for you to merge this kind of > commits from a git branch or it is more advisable to open a ticket and > create a patch. Anyway if you think this small commits may be useful, feel > free to use them. If you're happy on GitHub, a pull request is simplest. I've looked at these changes one by one and applied and/or commented on them. (We're debating moving our issue tracker from RedMine to GitHub, which would make things a little easier in future). Thank you! Peter From p.j.a.cock at googlemail.com Thu Mar 21 16:11:44 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 16:11:44 +0000 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Fri, Mar 15, 2013 at 11:06 PM, Bartek Wilczynski wrote: > Hi All, > I would add one more (old) idea for a GSoC pool, i.e. adding support > for different biological ontologies to biopython. > > This was already discussed some time ago > (http://www.biopython.org/w/index.php?title=Gene_Ontology&redirect=no) > mostly in the context of gene ontology, and to some extent this is > addressed by the development of GOAtools > (https://github.com/tanghaibao/goatools), but I think it would be > worth to have a decent support for OBO-file-based ontologies (not only > gene ontology, I'm also interested myself in anatomical ontologies, > there are also other available at obofoundry.org) in biopython. > > I think it would need to include support for IO operations on both OBO > and annotation files, as well as statistical enrichment measures and > potentially some visualisation. > > Would anyone be interested in co-mentoring this project? There is one > student in my department who would be interested in applying to GSoC > for this project, but I think it would be great if other people joined > the discussion on the functionality and having more people involved is > always better... > > best > Bartek Wilczynski That's a good idea - I would have used this recently with some GO stuff (e.g. given a GO term, is it a molecular function, biological process, or cellular compartment - can solve this easily by traversing up any branch of the DAG). Right now we need to put this list of ideas on the wiki page (ready for combining into the OBF page which will be shown to Google to make our case for taking part in the GSoC 2013 program). http://biopython.org/wiki/Google_Summer_of_Code If any of you as a potential mentor want to put up an outline proposal, even better. Peter From p.j.a.cock at googlemail.com Thu Mar 21 16:29:29 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 16:29:29 +0000 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Thu, Mar 21, 2013 at 4:11 PM, Peter Cock wrote: > > Right now we need to put this list of ideas on the wiki page (ready > for combining into the OBF page which will be shown to Google > to make our case for taking part in the GSoC 2013 program). > http://biopython.org/wiki/Google_Summer_of_Code > > If any of you as a potential mentor want to put up an outline > proposal, even better. > I've been wondering about potential GSoC projects which I'd be interested in mentoring (or co-mentoring), and thus far I've only got one outline idea. I'm interested in taking the Bio.SeqIO.index(...) / index_db(...) functionality (which does whole record parsing on demand) and extending this with lazy-loading or lazy-parsing (which has precedent in our BioSQL wrappers). For example, with whole genome FASTA files you may never need to load the entire sequence, but using an index system like tabix (or even actually using a tabix index) Biopython could provide a lazy-loading Seq object which extracts only the sequence region of interest on demand. The same idea applies to richer file formats too, like EMBL and GenBank. Here lazy loading the sequence is actually easier (the number of bases per line is strictly defined), but you can apply the same ideas to lazy loading features too. This means indexing both the sequence and the feature table. Likewise, this makes sense for GTF/GFF/GFF3 where you would index the features, and also if present index the embedded FASTA sequence at the end of the file. Clearly handling this would ideally build on Lenna and Brad's work with the underlying parser. With what I have in mind, there are two technical sides to this. First, the index format (binning strategies etc) for which we should review tabix and BAM's indexing and its planned replacement CSI (able to handle longer references). Second, to avoid code duplication, this would mean some re-factoring of the existing parser code to ensure that if a record is loaded in full via the traditional API, it would go though the same code as if it were loaded via the new lazy loading approach. Potentially the existing parsers could optionally also become lazy loaders (contingent on this requiring ownership of the file handle as it will use seek and tell to move the file pointer). That in theory could make our parsers much faster (depending on the overheads) for tasks where only a minority of the data is ever used. I've had some fun chats with Pjotr Prins from BioRuby about this at a CodeFest/BOSC meeting. Brad and Lenna, I've CC'd you explicitly as I'm guessing from the GFF work you are most likely to have considered some of these issues. Does this sound like something worth exploring further, and worth proposing as an outline GSoC project? I think it would be quite a challenging project - but like last year, it is something I would like to try myself if I had the time. Regards, Peter From p.j.a.cock at googlemail.com Thu Mar 21 17:01:51 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 17:01:51 +0000 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Wed, Mar 13, 2013 at 6:32 PM, Eric Talevich wrote: > > I like Michiel's idea, and I'll suggest two more: > > 1. Codon alignment & analysis: Already up on the wiki :) > > 2. Phylo enhancements: > 2a. Tree drawing: > - A proper draw_unrooted function to perform radial layout, with an optional > "iterations" argument to use Felsenstein's Equal Daylight algorithm -- I > feel this layout approach is neglected in most libraries. > - Better matplotlib/pylab integration, so the plot components can be tweaked > using matplotlib functions. > - Other common layout approaches, e.g. circular. > 2b. A "Phylo.consensus" module: > - strict consensus, like Bio.Nexus already implements. > - other consensus methods, time permitting. > 2c. A "Phylo.distance" module: > - Robinson-Foulds distance -- though others might be working on this > already. > 2d. Simple tree inference: > - Straightforward algorithms exist for neighbor-joining and parsimony tree > estimation. For small alignments (and perhaps medium-sized ones with PyPy), > it would be nice to run these without an external program, e.g. to construct > a guide tree for another algorithm or quickly view a phylogenetic clustering > of sequences. One more idea for a sub-task? 2e. Using multiple trees for bootstrapping a master tree. Take the master tree and for each edge you have a partition of the leaves, which can be used as a dictionary hash (e.g. as a binary representation). Then for each of the bootstrap runs, look at each edge, compute the hash for that split of the leaves, and increment the count. Then at the end, you have a dictionary of counts which are the branch bootstrap supports. I wrote that once in Python some time back, and used it to take a set of boot strap trees generated on a cluster and give the support values to the master tree. > > Any interest in either of these? Shall I add them to the wiki? > They both seem worth posting on the wiki, although we may not have enough mentors for both to go ahead :( Peter From p.j.a.cock at googlemail.com Thu Mar 21 16:55:30 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 16:55:30 +0000 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Wed, Mar 13, 2013 at 6:32 PM, Eric Talevich wrote: > I like Michiel's idea, and I'll suggest two more: > > 1. Codon alignment & analysis: > - PAL2NAL-style conversion of unaligned nucleic acid sequences and a protein > sequence alignment to a codon alignment. (Previously discussed) e.g. https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py > - dN/dS and the related functions needed to calculate it. > - Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage of > codon alignments, including validation (testing for frame shifts etc.) http://biopython.org/wiki/Google_Summer_of_Code#Codon_alignment_and_analysis I see you've started fleshing this idea out on the wiki, which is great. Right now it seems a little on the light weight side - or is that deliberate (to see if a student can take this idea and come up with a solid project proposal in this area)? Things like model selection might be a fun extension - I can think of a local expert who would be great to get involved on the science side if he's interested. Alternatively this could include doing some more general work on the alignment object - for instance per-column-annotation for things like a consensus sequence - or an array-of-char implementation as an alternative to the list-of-SeqRecords we have now (with its poor column access speed). Peter From p.j.a.cock at googlemail.com Thu Mar 21 17:29:44 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 17:29:44 +0000 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: Message-ID: On Tue, Feb 12, 2013 at 6:29 PM, Wibowo Arindrarto wrote: > Hi everyone, > > It's more or less a 'low hanging fruit', but I've been thinking > perhaps it may be useful if we have our own interface to the HMMER3 > online service? The corresponding SearchIO parsers may be written for > this as well (they return different formats for which we haven't any > parsers currently). Worth adding to the projects list here (or filing an enhancement bug) http://biopython.org/wiki/Active_projects#Project_ideas - but not enough to base a whole GSoC project around. > And I think there are more things being worked on, not yet mentioned > in the wiki: > > 1. Porting our docs to Sphinx[1] > 2. Converting some/all of the print and compare tests to unit tests. > For example, our Bio.Seq's tests are still print and compare tests. > > regards, > Bow > > [1] See the original feature request here: > https://redmine.open-bio.org/issues/3221 > https://redmine.open-bio.org/issues/3220 > https://redmine.open-bio.org/issues/3219 I don't think a purely documentation focused project is eligible for GSoC. But both ideas make sense separately from GSoC. Regards, Peter From p.j.a.cock at googlemail.com Thu Mar 21 17:36:24 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 17:36:24 +0000 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Thu, Mar 21, 2013 at 4:29 PM, Peter Cock wrote: > On Thu, Mar 21, 2013 at 4:11 PM, Peter Cock wrote: >> >> Right now we need to put this list of ideas on the wiki page (ready >> for combining into the OBF page which will be shown to Google >> to make our case for taking part in the GSoC 2013 program). >> http://biopython.org/wiki/Google_Summer_of_Code >> >> If any of you as a potential mentor want to put up an outline >> proposal, even better. >> > > I've been wondering about potential GSoC projects which I'd > be interested in mentoring (or co-mentoring), and thus far I've > only got one outline idea. > > I'm interested in taking the Bio.SeqIO.index(...) / index_db(...) > functionality (which does whole record parsing on demand) > and extending this with lazy-loading or lazy-parsing (which > has precedent in our BioSQL wrappers). For example, with > whole genome FASTA files you may never need to load the > entire sequence, but using an index system like tabix (or > even actually using a tabix index) Biopython could provide > a lazy-loading Seq object which extracts only the sequence > region of interest on demand. > > The same idea applies to richer file formats too, like EMBL > and GenBank. ... > > Likewise, this makes sense for GTF/GFF/GFF3 ... P.S. An example use case, http://www.biostars.org/p/64363/ Part of this work could include enhancements to the SeqRecord handling of SeqFeatures - offering more than just the current simple list - for example lookup by ID, dbxref, or position. That would be nice to have now with the current in-memory parsers. An old but still relevant example usecase: http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features Regards, Peter From eric.talevich at gmail.com Thu Mar 21 17:42:19 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 21 Mar 2013 13:42:19 -0400 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Thu, Mar 21, 2013 at 12:55 PM, Peter Cock wrote: > On Wed, Mar 13, 2013 at 6:32 PM, Eric Talevich > wrote: > > I like Michiel's idea, and I'll suggest two more: > > > > 1. Codon alignment & analysis: > > - PAL2NAL-style conversion of unaligned nucleic acid sequences and a > protein > > sequence alignment to a codon alignment. (Previously discussed) > > e.g. > https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py Well, check you out. Would you be interested in mentoring this project? > > - dN/dS and the related functions needed to calculate it. > > - Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage > of > > codon alignments, including validation (testing for frame shifts etc.) > > > http://biopython.org/wiki/Google_Summer_of_Code#Codon_alignment_and_analysis > > I see you've started fleshing this idea out on the wiki, which is great. > Right now it seems a little on the light weight side - or is that > deliberate > (to see if a student can take this idea and come up with a solid > project proposal in this area)? Things like model selection might > be a fun extension - I can think of a local expert who would be > great to get involved on the science side if he's interested. > I put up a quick sketch to avoid locking the wiki page for too long, but also deliberately left it vague to see where the applicants take it. Model selection would be cool, I added it. Local expert, also great. > Alternatively this could include doing some more general work > on the alignment object - for instance per-column-annotation > for things like a consensus sequence - or an array-of-char > implementation as an alternative to the list-of-SeqRecords > we have now (with its poor column access speed). > > Peter > I wonder if that's something we could just do incrementally -- change the MultipleSeqAlignment class to store a list-of-lists-of chars (or list-of-strings), a list of SeqRecord-like husks (all the annotations, but without the Seq itself) for each row, a list of column annotations, and a single alphabet for the whole alignment. How do you suppose the speed of that would compare to the current list-of-SeqRecords, and also to that of a wrapped NumPy matrix? Would it be a significant enough speed improvement to justify both replacing the current implementation, and to make the NumPy approach less tempting (given PyPy's progress toward including a compliant implementation)? Alternatively, we could post a GSoC project for creating a separate TurboAlignment class/module based on NumPy which would be mostly interchangeable and interconvertible with the pure-Python version in the Biopython core. Speaking of which, should we also post the idea of storing sequences as an efficient byte array, BioJava-style? -Eric From p.j.a.cock at googlemail.com Thu Mar 21 17:59:10 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Mar 2013 17:59:10 +0000 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: On Thu, Mar 21, 2013 at 5:42 PM, Eric Talevich wrote: > On Thu, Mar 21, 2013 at 12:55 PM, Peter Cock > wrote: >> >> On Wed, Mar 13, 2013 at 6:32 PM, Eric Talevich >> wrote: >> > I like Michiel's idea, and I'll suggest two more: >> > >> > 1. Codon alignment & analysis: >> > - PAL2NAL-style conversion of unaligned nucleic acid sequences and a >> > protein >> > sequence alignment to a codon alignment. (Previously discussed) >> >> e.g. >> https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py > > Well, check you out. Would you be interested in mentoring this project? > If I'm not primary mentor on another project, I'd be open to co-mentoring something on the alignment side. >> > - dN/dS and the related functions needed to calculate it. >> > - Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage >> > of >> > codon alignments, including validation (testing for frame shifts etc.) >> >> >> http://biopython.org/wiki/Google_Summer_of_Code#Codon_alignment_and_analysis >> >> I see you've started fleshing this idea out on the wiki, which is great. >> Right now it seems a little on the light weight side - or is that >> deliberate >> (to see if a student can take this idea and come up with a solid >> project proposal in this area)? Things like model selection might >> be a fun extension - I can think of a local expert who would be >> great to get involved on the science side if he's interested. > > > I put up a quick sketch to avoid locking the wiki page for too long, but > also deliberately left it vague to see where the applicants take it. Model > selection would be cool, I added it. Local expert, also great. If he's available and willing, yes. I've not mentioned this to him yet so no promises - the idea only occurred to me while writing that email ;) >> >> Alternatively this could include doing some more general work >> on the alignment object - for instance per-column-annotation >> for things like a consensus sequence - or an array-of-char >> implementation as an alternative to the list-of-SeqRecords >> we have now (with its poor column access speed). >> >> Peter > > > I wonder if that's something we could just do incrementally -- change the > MultipleSeqAlignment class to store a list-of-lists-of chars (or > list-of-strings), a list of SeqRecord-like husks (all the annotations, but > without the Seq itself) for each row, a list of column annotations, and a > single alphabet for the whole alignment. > > How do you suppose the speed of that would compare to the current > list-of-SeqRecords, and also to that of a wrapped NumPy matrix? Would it be > a significant enough speed improvement to justify both replacing the current > implementation, and to make the NumPy approach less tempting (given PyPy's > progress toward including a compliant implementation)? Alternatively, we > could post a GSoC project for creating a separate TurboAlignment > class/module based on NumPy which would be mostly interchangeable and > interconvertible with the pure-Python version in the Biopython core. When I said array-of-char I did have NumPy in mind, and PyPy does now cope with two or more dimensional arrays in NumPyPy. Note that NumPy handles both row and column orientated arrays with a simple class init option, so this can easily be setup to favour row or column access. Last time I did anything with the alignment object where column access was a bottleneck (calculating mutual information between columns), I just loaded all the columns into memory as a list of strings, and computed on that. It worked very nicely. > Speaking of which, should we also post the idea of storing sequences as an > efficient byte array, BioJava-style? I'd wondered about that (in combination with the discussion about strict alphabet checking), but is there enough for a whole GSoC project? Related to this one could look at something with k-mer hashes... (Its good to see lots of possible project ideas bouncing around) Peter From chapmanb at 50mail.com Fri Mar 22 12:48:34 2013 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 22 Mar 2013 08:48:34 -0400 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> Message-ID: <87zjxvsiql.fsf@fastmail.fm> Peter; > I've been wondering about potential GSoC projects which I'd > be interested in mentoring (or co-mentoring), and thus far I've > only got one outline idea. > > I'm interested in taking the Bio.SeqIO.index(...) / index_db(...) > functionality (which does whole record parsing on demand) > and extending this with lazy-loading or lazy-parsing (which > has precedent in our BioSQL wrappers). For example, with > whole genome FASTA files you may never need to load the > entire sequence, but using an index system like tabix (or > even actually using a tabix index) Biopython could provide > a lazy-loading Seq object which extracts only the sequence > region of interest on demand. This sounds incredibly useful. It's definitely worthwhile writing up if you'll have time this summer to mentor it. > Likewise, this makes sense for GTF/GFF/GFF3 where you > would index the features, and also if present index the > embedded FASTA sequence at the end of the file. I'm cc'ing Ryan, who has been thinking about similar work as part of gffutils. We're planning now on an approach that takes the BCBio.GFF parsing and rolls it into gffutils so we can parse, index in a SQLite database and expose as Biopython objects. Here is some initial discussion and planning: https://github.com/daler/gffutils/issues/2 https://docs.google.com/document/d/15l_yZ_pge22ETw-pz2g4NWRAUAccmr1MYPmqXbj1Jl8/edit?usp=sharing Brad From dalerr at niddk.nih.gov Fri Mar 22 16:20:45 2013 From: dalerr at niddk.nih.gov (Ryan Dale) Date: Fri, 22 Mar 2013 12:20:45 -0400 Subject: [Biopython-dev] Project ideas for GSoC (or other student projects) In-Reply-To: <87zjxvsiql.fsf@fastmail.fm> References: <1360721306.47860.YahooMailClassic@web164001.mail.gq1.yahoo.com> <87zjxvsiql.fsf@fastmail.fm> Message-ID: <514C84DD.9070306@niddk.nih.gov> Hi Brad & Peter - On 03/22/2013 08:48 AM, Brad Chapman wrote: > Peter; > >> I've been wondering about potential GSoC projects which I'd >> be interested in mentoring (or co-mentoring), and thus far I've >> only got one outline idea. >> >> I'm interested in taking the Bio.SeqIO.index(...) / index_db(...) >> functionality (which does whole record parsing on demand) >> and extending this with lazy-loading or lazy-parsing (which >> has precedent in our BioSQL wrappers). For example, with >> whole genome FASTA files you may never need to load the >> entire sequence, but using an index system like tabix (or >> even actually using a tabix index) Biopython could provide >> a lazy-loading Seq object which extracts only the sequence >> region of interest on demand. > This sounds incredibly useful. It's definitely worthwhile writing up if > you'll have time this summer to mentor it. Agreed - a general, lazy-loading/lazy-parsing, indexed mechanism for accessing data annotation-like file formats would be fantastic. >> Likewise, this makes sense for GTF/GFF/GFF3 where you >> would index the features, and also if present index the >> embedded FASTA sequence at the end of the file. > I'm cc'ing Ryan, who has been thinking about similar work as part of > gffutils. We're planning now on an approach that takes the BCBio.GFF > parsing and rolls it into gffutils so we can parse, index in a SQLite > database and expose as Biopython objects. Here is some initial > discussion and planning: > > https://github.com/daler/gffutils/issues/2 > https://docs.google.com/document/d/15l_yZ_pge22ETw-pz2g4NWRAUAccmr1MYPmqXbj1Jl8/edit?usp=sharing As Peter pointed out on the GitHub issues page, what he has in mind is more general than just GFF/GTF, and I see gffutils as extending upon a specific subset of the functionality he proposes. For example, there are common use-cases that I think make sense for a GFF/GTF-only library (say, adding new annotations for introns, as inferred from the isoform + exon annotations) that might not be readily generalizable to all annotation-like file formats. But if this general indexing approach were already available, then gffutils could just be a wrapper around that, adding the specific GFF/GTF functionality as another layer. Then again . . . currently gffutils imports GFF data into a sqlite3 database, so data are persistent and both read/write. For the intron-inferring example, we simply add new records to the db, but with an indexing approach, the file would presumably have to be re-indexed before reading again. So how you'd like to use your GFF files (read-only vs read/write) would influence which strategy you'd chooses. So I think there's actually smaller-than-expected overlap between gffutils and Peter's general indexing idea, and in the context of GSoC, I'm not sure you'd have to take gffutils into account. But gffutils would certainly benefit from general indexing, especially when retrieving sequences for features! -ryan From mjldehoon at yahoo.com Tue Mar 26 13:21:35 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 26 Mar 2013 06:21:35 -0700 (PDT) Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: Message-ID: <1364304095.69042.YahooMailClassic@web164002.mail.gq1.yahoo.com> Hi all, Speaking of which, we have a Biopython Structural Bioinformatics FAQ (i.e. how to use the Bio.PDB module) on the Biopython website with additional information on Bio.PDB, including some information on things that are not in the main Biopython Tutorial. Perhaps this is a good time to integrate this FAQ into the main documentation? We could also update it a bit because it's been a while and there are some different things here and there. And additions too. I went over the Biopython Structural Bioinformatics FAQ and integrated it into the main Biopython tutorial; see biopython.org/DIST/docs/tutorial/Tutorial-dev.html or biopython.org/DIST/docs/tutorial/Tutorial-dev.pdf Though I think everything is there, it may be good if somebody more experienced with Bio.PDB were to look it over to see if it still makes sense. In addition, I converted the Biopython Structural Bioinformatics FAQ to our wiki format and added it to our wiki documentation; see http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ This wiki now contains the exact same information (except for some minor updates/fixes) as the PDF with the Biopython Structural Bioinformatics FAQ that we have on the Biopython website. I guess with this we can remove the lyx/tex source code of the Biopython Structural Bioinformatics FAQ from the git repository, as well as the PDF from the Biopython website. Any objections? Best, -Michiel. From p.j.a.cock at googlemail.com Tue Mar 26 13:53:52 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 26 Mar 2013 13:53:52 +0000 Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: <1364304095.69042.YahooMailClassic@web164002.mail.gq1.yahoo.com> References: <1364304095.69042.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: On Tue, Mar 26, 2013 at 1:21 PM, Michiel de Hoon wrote: > > I guess with this we can remove the lyx/tex source code of the Biopython Structural Bioinformatics FAQ from the git repository, as well as the PDF from the Biopython website. Any objections? > Good work Michiel :) I would suggest making a final revision to the Biopython Structural Bioinformatics FAQ to explain this document is now obsolete, and where the information has moved to. Commit that to git, and put the final PDF online replacing the current version. That way anyone looking at the PDF online (or the git history) will have a clear route to finding the current information. Thanks, Peter From anaryin at gmail.com Tue Mar 26 13:54:55 2013 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 26 Mar 2013 14:54:55 +0100 Subject: [Biopython-dev] flex, setup.py and Bio.PDB.mmCIF (Bug 2619) In-Reply-To: References: <1364304095.69042.YahooMailClassic@web164002.mail.gq1.yahoo.com> Message-ID: Great work! I'll go over it in the next few days. 2013/3/26 Peter Cock > On Tue, Mar 26, 2013 at 1:21 PM, Michiel de Hoon > wrote: > > > > I guess with this we can remove the lyx/tex source code of the Biopython > Structural Bioinformatics FAQ from the git repository, as well as the PDF > from the Biopython website. Any objections? > > > > Good work Michiel :) > > I would suggest making a final revision to the Biopython Structural > Bioinformatics > FAQ to explain this document is now obsolete, and where the information has > moved to. Commit that to git, and put the final PDF online replacing the > current > version. That way anyone looking at the PDF online (or the git > history) will have > a clear route to finding the current information. > > Thanks, > > Peter > From lara.vignotto at gmail.com Wed Mar 27 14:09:50 2013 From: lara.vignotto at gmail.com (Lara Vignotto) Date: Wed, 27 Mar 2013 15:09:50 +0100 Subject: [Biopython-dev] [GSoC] Further info about Codon alignment idea Message-ID: Hello, I'm a student from Italy. I'm attending the first year of Biotechnology at the University of Udine, and I'm interested about the Codon alignment and analysis project proposed fot the Google Summer of Code 2013. Since I would like to know if I have got the skills required to contribute, can you tell me more about the project? Regards, Lara Vignotto From 88whacko at gmail.com Thu Mar 28 10:39:07 2013 From: 88whacko at gmail.com (Andrea Rizzi) Date: Thu, 28 Mar 2013 11:39:07 +0100 Subject: [Biopython-dev] New contributor In-Reply-To: References: <1363185895.3324.YahooMailClassic@web164005.mail.gq1.yahoo.com> Message-ID: Thank you for the great feedback Peter. I'll write a test case for Bio.Alphabet then since I couldn't find any. When it's ready I'll request a pull. Thank you again! Andrea 2013/3/21 Peter Cock > On Wed, Mar 20, 2013 at 6:10 PM, Andrea Rizzi <88whacko at gmail.com> wrote: > > Thank you for your welcome Michiel! > > > > I will looking for a good project to work on in the next few days and I > > will let you know soon. Meanwhile I've started to read some code to > become > > familiar with the modules and I bumped into few small bugs concerning the > > Seq objects, in particular I found: > > > > 1) a duplicated test method name (one test in test_Seq_objs.py wasn't > > performed); > > 2) an error in Alphabet._case_less(). > > Well spotted - changes applied to the master, thanks. > > > I've also expanded a little bit the documentation and I've substituted > > tostring() method with the suggested str() method in a function of > > MutableSeq. The branch is located here > > > > https://github.com/andrrizzi/biopython/tree/seq-branch > > > > I'm not sure if it is more comfortable for you to merge this kind of > > commits from a git branch or it is more advisable to open a ticket and > > create a patch. Anyway if you think this small commits may be useful, > feel > > free to use them. > > If you're happy on GitHub, a pull request is simplest. I've looked > at these changes one by one and applied and/or commented > on them. > > (We're debating moving our issue tracker from RedMine to > GitHub, which would make things a little easier in future). > > Thank you! > > Peter > From p.j.a.cock at googlemail.com Thu Mar 28 13:39:57 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 28 Mar 2013 13:39:57 +0000 Subject: [Biopython-dev] [GSoC] Further info about Codon alignment idea In-Reply-To: References: Message-ID: On Wed, Mar 27, 2013 at 2:09 PM, Lara Vignotto wrote: > Hello, > I'm a student from Italy. I'm attending the first year of Biotechnology at > the University of Udine, and I'm interested about the Codon alignment and > analysis project proposed fot the Google Summer of Code 2013. > Since I would like to know if I have got the skills required to contribute, > can you tell me more about the project? > > Regards, > Lara Vignotto Hi Lara, Welcome and thank you for your interest in taking part in GSoC 2013. The background discussion to the outline idea on the wiki was here: http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010449.html http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010471.html http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010474.html http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010475.html (I think that was all the posts - check the archive to be sure). The text of the wiki is hopefully enough to spark your interest - what we're really like to see is a student intrigued by the idea and driven to expand the topic into a full project proposal. If for example your current course work included some phylogenetics that might help give you perspective about what is useful and worth adding to Biopython. You should probably also have a look at the NESCent GSoC project ideas if it is the phylogenetic side that really interest you - in previous years Biopython has mentored GSoC students with NESCent: http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2013 You would also need to be competent with Python - although if you also know and love Perl or Ruby (etc) there might be a mentor willing to supervise a related project with BioPerl or BioRuby - that's good too from the wider OBF and Bio* perspective. For tree traversal some back ground reading on things like breadth first search and other algorithms for 'walking' the tree would be a good idea (see also the Python os.path module for 'walking' a file system tree). I'm sure there will be other technical things to learn about and use, depending on where a GSoC project based on this idea went. Did that help? Is there something more specific I can try to answer? Regards, Peter From p.j.a.cock at googlemail.com Thu Mar 28 15:44:11 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 28 Mar 2013 15:44:11 +0000 Subject: [Biopython-dev] Fwd: [biopython] Custom GenBank locus length (#171) In-Reply-To: References: Message-ID: For those not getting the pull request emails from GitHub, ---------- Forwarded message ---------- From: Marco Galardini Date: Thu, Mar 28, 2013 at 3:19 PM Subject: [biopython] Custom GenBank locus length (#171) To: biopython/biopython Instead of an exception, raise a warning, so the file is saved and the user can decide to correct the error. I don't know if this is a good pratice, but I have some GenBank files provided by the JGI/DOE with locus names longer than 16 chars, so I guess that providing a warning to the user instead of a complete failure could be better. ________________________________ You can merge this Pull Request by running git pull https://github.com/mgalardini/biopython patch-1 Or view, comment on, or merge it at: https://github.com/biopython/biopython/pull/171 Commit Summary Custom GenBank locus length File Changes M Bio/SeqIO/InsdcIO.py (4) Patch Links: https://github.com/biopython/biopython/pull/171.patch https://github.com/biopython/biopython/pull/171.diff From marco.galardini at unifi.it Thu Mar 28 15:54:38 2013 From: marco.galardini at unifi.it (Marco Galardini) Date: Thu, 28 Mar 2013 16:54:38 +0100 Subject: [Biopython-dev] Fwd: [biopython] Custom GenBank locus length (#171) In-Reply-To: References: Message-ID: <515467BE.7090105@unifi.it> Good afternoon everyone, Actually, i have been testing a bit more and some other changes may be needed (sorry about that, this is my first change to the biopython code). The assertions on the lines length still fail, so my guess is that probably it's not a good idea to try to write down a genbank with unusual identifiers (even if they are from JGI!). Marco On 03/28/2013 04:44 PM, Peter Cock wrote: > For those not getting the pull request emails from GitHub, > > ---------- Forwarded message ---------- > From: Marco Galardini > Date: Thu, Mar 28, 2013 at 3:19 PM > Subject: [biopython] Custom GenBank locus length (#171) > To: biopython/biopython > > > Instead of an exception, raise a warning, so the file is saved and the > user can decide to correct the error. > > I don't know if this is a good pratice, but I have some GenBank files > provided by the JGI/DOE with locus names longer than 16 chars, so I > guess that providing a warning to the user instead of a complete > failure could be better. > > ________________________________ > > You can merge this Pull Request by running > > git pull https://github.com/mgalardini/biopython patch-1 > > Or view, comment on, or merge it at: > > https://github.com/biopython/biopython/pull/171 > > Commit Summary > > Custom GenBank locus length > > File Changes > > M Bio/SeqIO/InsdcIO.py (4) > > Patch Links: > > https://github.com/biopython/biopython/pull/171.patch > https://github.com/biopython/biopython/pull/171.diff > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev -- ------------------------------------------------- Marco Galardini, PhD Dipartimento di Biologia Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI) e-mail: marco.galardini at unifi.it www: http://www.unifi.it/dblage/CMpro-v-p-51.html phone: +39 055 4574737 mobile: +39 340 2808041 ------------------------------------------------- From p.j.a.cock at googlemail.com Thu Mar 28 18:00:38 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 28 Mar 2013 18:00:38 +0000 Subject: [Biopython-dev] stdout/stderr handling oddity Message-ID: Hi all, While looking at the BWA wrapper from Saket Choudhary https://github.com/biopython/biopython/pull/167 and the associated enhancement to the __call__ functionality of the command line wrapper base class, I wrote a couple of unit tests - which have left me a little puzzled: https://github.com/biopython/biopython/commit/3f5d4c442424a7ca33ae0bafa60c840e80ae2fda Could a few of you try running this test_Application.py file and confirm it works as is, and try uncommenting the two problem test cases? (I'm curious if the echo test works as intended on a plain Windows machine without cygwin installed - I hope so). Unless anyone else can explain this, I think the next step is a simple test program which produces predictable output to both stdout and stderr, just in case this is due to there being no stderr output in these tests. e.g. Print integers 1, 2, 3, 4, ..., to some sensible limit, like 20, where non-primes are on stdout while primes on stderr. Peter From arklenna at gmail.com Thu Mar 28 20:54:11 2013 From: arklenna at gmail.com (Lenna Peterson) Date: Thu, 28 Mar 2013 16:54:11 -0400 Subject: [Biopython-dev] stdout/stderr handling oddity In-Reply-To: References: Message-ID: Hi Peter, On Mac OS X, opening os.devnull with mode 'w' on lines 418 and 422 of the Application __init__.py causes the tests to pass for me. Lenna From saketkc at gmail.com Thu Mar 28 20:57:54 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Fri, 29 Mar 2013 02:27:54 +0530 Subject: [Biopython-dev] stdout/stderr handling oddity In-Reply-To: References: Message-ID: Yes. And the reason is this :http://stackoverflow.com/questions/2368967/bad-file-descriptor-error On 29 March 2013 02:24, Lenna Peterson wrote: > Hi Peter, > > On Mac OS X, opening os.devnull with mode 'w' on lines 418 and 422 of the > Application __init__.py causes the tests to pass for me. > > Lenna > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From saketkc at gmail.com Thu Mar 28 21:00:00 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Fri, 29 Mar 2013 02:30:00 +0530 Subject: [Biopython-dev] stdout/stderr handling oddity In-Reply-To: References: Message-ID: Forgot to add : Tested on Ubuntu 12.04 On 29 March 2013 02:27, Saket Choudhary wrote: > Yes. > And the reason is this > :http://stackoverflow.com/questions/2368967/bad-file-descriptor-error > > On 29 March 2013 02:24, Lenna Peterson wrote: >> Hi Peter, >> >> On Mac OS X, opening os.devnull with mode 'w' on lines 418 and 422 of the >> Application __init__.py causes the tests to pass for me. >> >> Lenna >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Thu Mar 28 22:11:11 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 28 Mar 2013 22:11:11 +0000 Subject: [Biopython-dev] stdout/stderr handling oddity In-Reply-To: References: Message-ID: > On 29 March 2013 02:24, Lenna Peterson wrote: >> Hi Peter, >> >> On Mac OS X, opening os.devnull with mode 'w' on lines 418 and 422 of the >> Application __init__.py causes the tests to pass for me. >> >> Lenna On Thu, Mar 28, 2013 at 8:57 PM, Saket Choudhary wrote: > Yes. > And the reason is this > :http://stackoverflow.com/questions/2368967/bad-file-descriptor-error > Thank you both - I am kicking myself now - maybe I should have taken another sick day this week instead of returning to work? ;) Fixed: https://github.com/biopython/biopython/commit/bba2acbf3d690ad7b99e94ac8ead6763b1d05ab8 I guess no one had bothered to using this option to send stderr to /dev/null - or if they had never reported this error. The only thing which puzzles me is why this worked for stdout. Odd. Cheers, Peter From p.j.a.cock at googlemail.com Fri Mar 29 11:54:33 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 29 Mar 2013 11:54:33 +0000 Subject: [Biopython-dev] Fwd: [biopython] Fix Biopython installation with pip (#172) In-Reply-To: References: Message-ID: Hi Brad, This sounds sensible in principle - it just needs some hands on testing on various systems - any volunteers who use PIP and virtual envs? Thanks, Peter ---------- Forwarded message ---------- From: Brad Chapman Date: Fri, Mar 29, 2013 at 11:47 AM Subject: [biopython] Fix Biopython installation with pip (#172) To: biopython/biopython Hi all; This is yet another take on making Biopython install nicely with pip in virtual environments. This avoids adding numpy as an explicit dependency and instead uses it if present or skips it if not. The problem with the previous install_requires approach is that pip doesn't build and install all requirements before setting up Biopython, so Biopython will fail with a numpy missing error. Additionally, our old approach drags in numpy so creates a heavyweight dependency for isolated environments. The new approach requires users to explicitly install numpy if needed but doesn't penalize them if it's not present. I submitted as a pull request for documentation and feedback from anyone. If y'all agree, merge away. Thanks, Brad ________________________________ You can merge this Pull Request by running git pull https://github.com/chapmanb/biopython master Or view, comment on, or merge it at: https://github.com/biopython/biopython/pull/172 Commit Summary Improve Biopython installation with pip: avoid including numpy as dependency when automated. Instead explicitly avoid needing numpy installed to continue Add helpful comment on pip dependency management File Changes M setup.py (38) Patch Links: https://github.com/biopython/biopython/pull/172.patch https://github.com/biopython/biopython/pull/172.diff