From p.j.a.cock at googlemail.com Fri Mar 16 17:40:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 16 Mar 2012 21:40:56 +0000 Subject: [GSoC] [Open-bio-l] Google Summer of Code is *ON* for OBF projects! In-Reply-To: <4F6398E8.4010806@gmail.com> References: <4F6398E8.4010806@gmail.com> Message-ID: On Fri, Mar 16, 2012 at 7:47 PM, Robert Buels wrote: > Hi all, > > Great news: Google announced today that the Open Bioinformatics > Foundation has been accepted as a mentoring organization for this > summer's Google Summer of Code! > > GSoC is a Google-sponsored student internship program for open-source > projects, open to students from around the world (not just US > residents). ? Students are paid a $5000 USD stipend to work as a > developer on an open-source project for the summer. For more on GSoC, > see GSoC 2012 FAQ at http://goo.gl/kNv48 > > Student applications are due April 6, 2012 at 19:00 UTC. ?Students who > are interested in participating should look at the OBF's GSoC page at > http://open-bio.org/wiki/Google_Summer_of_Code, which lists project > ideas, and whom to contact about applying. > > For current developers on OBF projects, please consider volunteering to > be a mentor if you have not already, and contribute project ideas. ?Just > list your name and project ideas on OBF wiki and on the relevant > project's GSoC wiki page. > > Thanks to all who helped make OBF's application to GSoC a success, and > let's have a great, productive summer of code! > > Rob Buels > OBF GSoC 2012 Administrator Excellent news - well done Rob et al. Would you like me to post this to the news blog, or can you? http://news.open-bio.org/news/ Thanks, Peter From cjfields at illinois.edu Fri Mar 16 17:49:32 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 16 Mar 2012 21:49:32 +0000 Subject: [GSoC] [Open-bio-l] Google Summer of Code is *ON* for OBF projects! In-Reply-To: References: <4F6398E8.4010806@gmail.com> Message-ID: On Mar 16, 2012, at 4:40 PM, Peter Cock wrote: > On Fri, Mar 16, 2012 at 7:47 PM, Robert Buels wrote: >> Hi all, >> >> Great news: Google announced today that the Open Bioinformatics >> Foundation has been accepted as a mentoring organization for this >> summer's Google Summer of Code! >> >> GSoC is a Google-sponsored student internship program for open-source >> projects, open to students from around the world (not just US >> residents). Students are paid a $5000 USD stipend to work as a >> developer on an open-source project for the summer. For more on GSoC, >> see GSoC 2012 FAQ at http://goo.gl/kNv48 >> >> Student applications are due April 6, 2012 at 19:00 UTC. Students who >> are interested in participating should look at the OBF's GSoC page at >> http://open-bio.org/wiki/Google_Summer_of_Code, which lists project >> ideas, and whom to contact about applying. >> >> For current developers on OBF projects, please consider volunteering to >> be a mentor if you have not already, and contribute project ideas. Just >> list your name and project ideas on OBF wiki and on the relevant >> project's GSoC wiki page. >> >> Thanks to all who helped make OBF's application to GSoC a success, and >> let's have a great, productive summer of code! >> >> Rob Buels >> OBF GSoC 2012 Administrator > > Excellent news - well done Rob et al. > > Would you like me to post this to the news blog, or can you? > http://news.open-bio.org/news/ > > Thanks, > > Peter I think post away. I've already tweated this. chris From p.j.a.cock at googlemail.com Fri Mar 16 17:54:46 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 16 Mar 2012 21:54:46 +0000 Subject: [GSoC] [Open-bio-l] Google Summer of Code is *ON* for OBF projects! In-Reply-To: References: <4F6398E8.4010806@gmail.com> Message-ID: On Fri, Mar 16, 2012 at 9:49 PM, Fields, Christopher J wrote: > On Mar 16, 2012, at 4:40 PM, Peter Cock wrote: >> >> Excellent news - well done Rob et al. >> >> Would you like me to post this to the news blog, or can you? >> http://news.open-bio.org/news/ >> >> Thanks, >> >> Peter > > I think post away. ?I've already tweated this. > > chris > Done, http://news.open-bio.org/news/2012/03/obf-accepted-for-gsoc-2012/ This was posted to the @obf_news twitter account here https://twitter.com/obf_news/status/180773706715504640 Peter From ayushgoel111 at gmail.com Thu Mar 22 14:03:49 2012 From: ayushgoel111 at gmail.com (Ayush Goel) Date: Thu, 22 Mar 2012 23:33:49 +0530 Subject: [GSoC] Interested in working on SearchIO Message-ID: Hello, I am a student at Delhi College of Engineering. I have a prior experience in python at two other interns. I was hoping to find myself a more challenging project this time with python as the default language. The description of the SearchIO project seems to be a very good one. Still I am pretty new to the biopython's code. If possible, I would like to have some more information regarding what is expected from the deliverable. Also if some reference material on the background of the data formats required (BLAST etc) could be provided, then it would be very helpful. -- Regards, Ayush Goel From p.j.a.cock at googlemail.com Fri Mar 23 05:30:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 23 Mar 2012 09:30:10 +0000 Subject: [GSoC] Interested in working on SearchIO In-Reply-To: References: Message-ID: On Thu, Mar 22, 2012 at 6:03 PM, Ayush Goel wrote: > Hello, > > ?I am a student at Delhi College of Engineering. I have a prior > experience in python at two other interns. I was hoping to find myself > a more challenging project this time with python as the default > language. The description of the SearchIO project seems to be a very > good one. > > ?Still I am pretty new to the biopython's code. If possible, I would > like to have some more information regarding what is expected from the > deliverable. Also if some reference material on the background of the > data formats required (BLAST etc) could be provided, then it would be > very helpful. Hello Ayush, Are you doing any biology or bioinformatics courses? That would help with background knowledge. The SearchIO project does require a reasonably broad knowledge of important tools and concepts in pairwise sequence alignment - if you not familiar with BLAST etc that will be a big handicap. You don't need to know the algorithm details - just the overall idea, and how to run the tools and what kind of analysis people might want to do with it. Some possible background reading (an introductory Bioinformatics course or book might be good too): http://www.ncbi.nlm.nih.gov/BLAST/ http://en.wikipedia.org/wiki/BLAST http://emboss.open-bio.org/wiki/Appdoc:Needle http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm http://emboss.open-bio.org/wiki/Appdoc:Water http://en.wikipedia.org/wiki/Smith-Waterman_algorithm In terms of possible deliverables, I went into more detail here: http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html However, if you have a lot of experience with Python and parsing text and XML files, that would be a big plus. Perhaps there is another topic that might suit you better. Is there a particular reason why you are interested in Biopython? Regards, Peter From saketkc at gmail.com Mon Mar 26 04:40:51 2012 From: saketkc at gmail.com (Saket Choudhary) Date: Mon, 26 Mar 2012 14:10:51 +0530 Subject: [GSoC] [GSoC2012] BioRuby Message-ID: Hi ! I am Saket Choudhary, a third year undergraduate student at IIT Bombay, India. Ruby has been my first love for programming. My last project[internship] was at SlideShare , one of the biggest Ruby on Rails website in the world. I had developed the Admin interface for SlideShare which could enable suspending users, reconversion/deletion of SlideShows , and user deletion/suspension. My hack at Yahoo! Open Hack India -2011 qualified among the Top 50 hacks , I built a Sinatra app for fetching a defined file from a Dropbox account and sending it to a specified email address , just on a SMS.[ https://github.com/saketkc/dropbox_on_sms] I went through the GSoC idea page and "Adding social networking functionality to BioRuby.org", is of my special interest. I have an experience working on the Rails platform. My plans for making the website more "Social" is as follows: 1. Provide an online 'Scratchpad' for Ruby/BioRuby enthusiasts that not only allows them to run their codes online but also provides them a facility to store it online in the form of an archive so that they an acess it later. 2. Include sharing facility on "Scratchpad" so that one user can share his code online with other users/community and get feedback/comments. On the lines of "Sage " notebooks. 3. Develop an Online Board on the lines of "Quora " Boards so that users can pin certain codes/algorithms on to their own boards for their reference this would reduce the overhead of searching for a particular algorithm again and again . Let me know you views on these ideas. I would send a mockup of the same incase these ideas seem feasible to you. Thanks Saket Choudhary IIT Bombay From saketkc at gmail.com Mon Mar 26 11:32:40 2012 From: saketkc at gmail.com (Saket Choudhary) Date: Mon, 26 Mar 2012 21:02:40 +0530 Subject: [GSoC] [GSoC2012] BioRuby In-Reply-To: <20120326133700.GA22488@thebird.nl> References: <20120326133700.GA22488@thebird.nl> Message-ID: Hi Pjotr, Thanks ! I have been using BioRuby and BioPython for 6 months now to solve the "Protein Loop Closure" problem. I use it mainly to manipulate the atom positions in a given PDB file, thus perturbing their positions. I went through the discussion that happened on the mailing list last month: Here are my notes about the same 1.http://lists.open-bio.org/pipermail/bioruby/2012-February/002087.html Even I am a big fan of Jruby homepage. Here are my suggestiosn : 1. Ruby is simple, clear , intutitve and this makes Biouby intutive to everyone. This needs to be emphasised the very first time a user comes to the webpage : Instead of giving them examples on Wiki[ http://bioruby.open-bio.org/wiki/SampleCodes] to "read" through a tutorial in the form of a short writeup/description about Ruby/BioRuby foolwed by a challenge would be more appealing and intuitive to the user even though he is being exposed to Ruby/BioRuby for the first time. Say some tutorial on the lines of http://www.codecademy.com/ , a short tutorial followed by your Scratchpad. ! 2. Calendar/Tweet/Conference widget: Something again on the lines of Jruby website. 3. Favicon missing ? Though a very trivial issue , but just wanted to know why isn't there a favicon for bioruby.org ? These are the stuff I gathered , I am still digging the old threads, will post here if something relevant comes up ! Saket Choudhary IIT Bombay github.com/saketkc On 26 March 2012 19:07, Pjotr Prins wrote: > Hi Saket, > > Welcome! > > It would be good if you also introduce yourself to the BioRuby ML, and > post your ideas. We are working on the website (should I say > web 'experience'), and I like what you propose. Also check out the ML > archive of the last months, you'll find a lot of information. > > Pj. > > On Mon, Mar 26, 2012 at 02:10:51PM +0530, Saket Choudhary wrote: > > Hi ! > > > > I am Saket Choudhary, a third year undergraduate student at IIT Bombay, > > India. > > > > Ruby has been my first love for programming. My last project[internship] > > was at SlideShare , one of the biggest Ruby > on > > Rails website in the world. I had developed the Admin interface for > > SlideShare which could enable suspending users, reconversion/deletion of > > SlideShows , and user deletion/suspension. > > > > My hack at Yahoo! Open Hack India -2011 qualified among the Top 50 hacks > , > > I built a Sinatra app for fetching a defined file from a Dropbox account > > and sending it to a specified email address , just on a SMS.[ > > https://github.com/saketkc/dropbox_on_sms] > > > > I went through the GSoC idea page and "Adding social networking > > functionality to BioRuby.org", is of my special interest. I have an > > experience working on the Rails platform. My plans for making the website > > more "Social" is as follows: > > > > 1. Provide an online 'Scratchpad' for Ruby/BioRuby enthusiasts that not > > only allows them to run their codes online but also provides them a > > facility to store it online in the form of an archive so that they an > acess > > it later. > > > > 2. Include sharing facility on "Scratchpad" so that one user can share > his > > code online with other users/community and get feedback/comments. On the > > lines of "Sage " notebooks. > > > > 3. Develop an Online Board on the lines of "Quora >" > > Boards so that users can pin certain codes/algorithms on to their own > > boards for their reference this would reduce the overhead of searching > for > > a particular algorithm again and again . > > > > Let me know you views on these ideas. I would send a mockup of the same > > incase these ideas seem feasible to you. > > > > Thanks > > > > Saket Choudhary > > IIT Bombay > > _______________________________________________ > > GSoC mailing list > > GSoC at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/gsoc > From pjotr2012 at thebird.nl Mon Mar 26 09:37:00 2012 From: pjotr2012 at thebird.nl (Pjotr Prins) Date: Mon, 26 Mar 2012 15:37:00 +0200 Subject: [GSoC] [GSoC2012] BioRuby In-Reply-To: References: Message-ID: <20120326133700.GA22488@thebird.nl> Hi Saket, Welcome! It would be good if you also introduce yourself to the BioRuby ML, and post your ideas. We are working on the website (should I say web 'experience'), and I like what you propose. Also check out the ML archive of the last months, you'll find a lot of information. Pj. On Mon, Mar 26, 2012 at 02:10:51PM +0530, Saket Choudhary wrote: > Hi ! > > I am Saket Choudhary, a third year undergraduate student at IIT Bombay, > India. > > Ruby has been my first love for programming. My last project[internship] > was at SlideShare , one of the biggest Ruby on > Rails website in the world. I had developed the Admin interface for > SlideShare which could enable suspending users, reconversion/deletion of > SlideShows , and user deletion/suspension. > > My hack at Yahoo! Open Hack India -2011 qualified among the Top 50 hacks , > I built a Sinatra app for fetching a defined file from a Dropbox account > and sending it to a specified email address , just on a SMS.[ > https://github.com/saketkc/dropbox_on_sms] > > I went through the GSoC idea page and "Adding social networking > functionality to BioRuby.org", is of my special interest. I have an > experience working on the Rails platform. My plans for making the website > more "Social" is as follows: > > 1. Provide an online 'Scratchpad' for Ruby/BioRuby enthusiasts that not > only allows them to run their codes online but also provides them a > facility to store it online in the form of an archive so that they an acess > it later. > > 2. Include sharing facility on "Scratchpad" so that one user can share his > code online with other users/community and get feedback/comments. On the > lines of "Sage " notebooks. > > 3. Develop an Online Board on the lines of "Quora " > Boards so that users can pin certain codes/algorithms on to their own > boards for their reference this would reduce the overhead of searching for > a particular algorithm again and again . > > Let me know you views on these ideas. I would send a mockup of the same > incase these ideas seem feasible to you. > > Thanks > > Saket Choudhary > IIT Bombay > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From shahruchin711 at gmail.com Sun Apr 1 07:23:00 2012 From: shahruchin711 at gmail.com (Ruchin Shah) Date: Sun, 1 Apr 2012 16:53:00 +0530 Subject: [GSoC] BioJava- Porting the BLAST,HMMER algorithms Message-ID: Hi, I am Ruchin Shah, 3rd year undergraduate student from DA-IICT,India. I would like to work on some challenging projects in the field of bioinformatics. I have already worked on a project called BioSpectroGram(written in Java)under the mentorship of Prof. Manish K. Gupta(http://www.guptalab.org/mankg/public_html/) which aims at analyzing DNA and protein sequences using various kinds of transfromations(FFT,DCT,etc.). I came to know about the idea of implementing the two algorithms-BLAST and HMMER, and i find it very fascinating. I have a good coding experience(http://www.spoj.pl/users/ruchinshah/). I am also familiar with the FASTA and GenBank formats. I read about the BLASTA algorithm at http://en.wikipedia.org/wiki/BLAST#BLAST but if possible I would like to know more about these two algorithms and exactly what is expected from the project and also some more references.If I am not wrong then you are expecting to use some C-to-Java conversion tool or JNI to exploit the already available BLAST+ tool and not implement the algorithms from scratch . From andreas at sdsc.edu Sun Apr 1 14:02:16 2012 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 1 Apr 2012 11:02:16 -0700 Subject: [GSoC] BioJava- Porting the BLAST,HMMER algorithms In-Reply-To: References: Message-ID: Hi Ruchin, Are you also on the biojava-l mailing list? We had quite a number of discussions about this project already there and if you are not on the list it might be a good start to catch up with what was already discussed there. http://lists.open-bio.org/pipermail/biojava-l/ The idea in short is to come up with an all-Java version of some of the frequently used algorithms. We are quite flexible regarding the projects and what we are really looking for are sound projects and motivated students. What is expected is a realistic project proposal, which in turn depends on your background and how you propose to conduct the project. Andreas On Sun, Apr 1, 2012 at 4:23 AM, Ruchin Shah wrote: > Hi, > > I am Ruchin Shah, 3rd year undergraduate student from DA-IICT,India. > > ? ? ? ? I would like to work on some challenging projects in the field of > bioinformatics. I have already worked on a project called > BioSpectroGram(written in Java)under the mentorship of Prof. Manish K. > Gupta(http://www.guptalab.org/mankg/public_html/) which aims at analyzing > DNA and protein sequences using various kinds of > transfromations(FFT,DCT,etc.). I came to know about the idea of implementing > the two algorithms-BLAST and HMMER, and i find it very fascinating. I have > a good coding experience(http://www.spoj.pl/users/ruchinshah/). I am also > familiar with the FASTA and GenBank formats. > > ? ? ? ? I read about the BLASTA algorithm at > http://en.wikipedia.org/wiki/BLAST#BLAST but if possible I would like to > know more about these two algorithms and exactly what is expected from the > project > and also some more references.If I am not wrong then you are expecting to > use some C-to-Java conversion tool or JNI to exploit the already available > BLAST+ tool and not implement the algorithms from scratch . > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From p.j.a.cock at googlemail.com Tue Apr 24 07:21:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 12:21:56 +0100 Subject: [GSoC] Fwd: Announcing OBF Google Summer of Code Accepted Students In-Reply-To: <4F95EA76.4030004@gmail.com> References: <4F95EA76.4030004@gmail.com> Message-ID: The announcement is also on the OBF news blog now: http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ ---------- Forwarded message ---------- From: Robert Buels Date: Tue, Apr 24, 2012 at 12:49 AM Subject: [Bioperl-l] Announcing OBF Google Summer of Code Accepted Students To: BioPerl List , BioJava List , BioRuby List , BioPython List , BioDAS List , BioLib List , BioSQL List Hello all, I'm very pleased and excited to announce that the Open Bioinformatics Foundation has selected 5 very capable students to work on OBF projects this summer as part of the Google Summer of Code program. The accepted students, their projects, and their mentors (in alphabetical order): Wibowo Arindrarto ? ?SearchIO Implementation in Biopython ? ?mentored by Peter Cock Lenna Peterson ? ?Diff My DNA: Development of a Genomic Variant Toolkit for Biopython ? ?mentored by Brad Chapman Marjan Povolni ? ?The worlds fastest parallelized GFF3/GTF parser in D, and an ? ?interfacing biogem plugin for Ruby ? ?mentored by Pjotr Prins, Francesco Strozzi, Raoul Bonnal Artem Tarasov ? ?Fast parallelized GFF3/GTF parser in C++, with Ruby FFI bindings ? ?mentored by Pjotr Prins, Francesco Strozzi, Raoul Bonnal Clayton Wheeler ? ?Multiple Alignment Format parser for BioRuby ? ?mentored by Francesco Strozzi and Raoul Bonnal As in every year, we received many great applications and ideas. However, funding and mentor resources are limited, and we were not able to accept as many as we would have liked. ?Our deepest thanks to all the students who applied: we sincerely appreciate the time and effort you put into your applications, and hope you will still consider being a part of the OBF's open source projects, even without Google funding. ?I speak for myself and all of the mentors who read and scored applications when I say that we were truly honored by the number and quality of the applications we received. For the accepted students: congratulations! ?You have risen to the top of a very competitive application process. ?Now it's time to "put your money where your mouth is", as the saying goes. ?Let's get out there and write some great code this summer! Best regards, Rob ---- Robert Buels OBF GSoC 2012 Administrator _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From p.j.a.cock at googlemail.com Tue Apr 24 07:24:20 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 12:24:20 +0100 Subject: [GSoC] OBF GSoC students weekly progress reports Message-ID: Hello all, First, to echo Rob, congratulations to our selected students: http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ http://lists.open-bio.org/pipermail/gsoc/2012/000049.html Weekly Progress Reports: To encourage community bonding and awareness of what the GSoC 2012 students are doing, this year the OBF is being much clearer about our progress report expectations. We would like every student to setup a blog for the GSoC project (or a category/tag on your existing blog) which you will use to summarize your progress every week, as well as longer posts at the half way evaluation, and at the end of the summer. In addition, after publishing each blog post, we expect you to email the URL and the text of the blog (or if important images or formatting would be lost, at least a short summary) to the host project's mailing list(s) (check with your mentors if the project has more than one) AND the gsoc at open-bio.org mailing list. You will be writing under your own name, but with a clear association with your mentors, the OBF and its projects, so please take this seriously and be professional. Remember this will become part of your online presence, and potentially looked at by future employers and colleagues. Please talk to your mentors about this during the "community bonding" stage of the GSoC code (i.e. the next few weeks before you actually start). Thank you, Peter (On behalf of the OBF GSoC mentors and projects) Note: As per Rob's earlier email, could both students and mentors please ensure you have subscribed to the public OBF GSoC email list at http://lists.open-bio.org/mailman/listinfo/gsoc (I have BCC'd you on this email just in case you haven't done this yet). Thanks! From arklenna at gmail.com Tue Apr 24 13:21:46 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 24 Apr 2012 13:21:46 -0400 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: Hi all, I'm very excited to be participating in GSoC '12 with Biopython! My development blog is on tumblr, which I chose primarily because it supports markdown syntax, which I'm used to from GitHub. Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 However, Tumblr doesn't allow post comments. Will I need to switch to a blog platform that allows comments? Cheers, Lenna From p.j.a.cock at googlemail.com Tue Apr 24 13:55:51 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 18:55:51 +0100 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 6:21 PM, Lenna Peterson wrote: > Hi all, > > > I'm very excited to be participating in GSoC '12 with Biopython! > > My development blog is on tumblr, which I chose primarily because it > supports markdown syntax, which I'm used to from GitHub. > > Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 > > However, Tumblr doesn't allow post comments. Will I need to switch to a > blog platform that allows comments? > > Cheers, > > Lenna Hi Lenna, Great - you've got a blog already you're also the first student to reply :) Blog comments could be nice, but personally in your shoes I'd direct any discussion to the biopython(-dev) mailing list. e.g. 1. Post weekly update blog, get blog post URL 2. Send email with summary, including blog post URL 3. Goto mailing list archive, get archived email URL 4. Update blog post to link to email (and thus any thread from it, at least for that month). A little cumbersome, but it would save you moving your blog? I'd actually be happier with most discussion on the biopython-dev list rather than blog comments, or even github (which will still be useful for things like code reviews). This may be different for the other projects - I know BioRuby uses IRC much more for example, but even there they've tried to post archives of important IRC discussions to their mailing list too. Thank you! Peter From arklenna at gmail.com Tue Apr 24 14:41:25 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 24 Apr 2012 14:41:25 -0400 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 1:55 PM, Peter Cock wrote: > > Hi Lenna, > > Great - you've got a blog already you're also the first student to reply :) > > Blog comments could be nice, but personally in your shoes I'd > direct any discussion to the biopython(-dev) mailing list. e.g. > > 1. Post weekly update blog, get blog post URL > 2. Send email with summary, including blog post URL > 3. Goto mailing list archive, get archived email URL > 4. Update blog post to link to email (and thus any thread from it, > at least for that month). > > A little cumbersome, but it would save you moving your blog? > > I'd actually be happier with most discussion on the biopython-dev > list rather than blog comments, or even github (which will still be > useful for things like code reviews). > > This may be different for the other projects - I know BioRuby > uses IRC much more for example, but even there they've tried > to post archives of important IRC discussions to their mailing > list too. > > Thank you! > > Peter Peter, If I get ambitious, I could write a Python script to retrieve the mailing list url and put it into my blog post! To clarify - for biopython, should the update emails go out to both the biopython and biopython-dev mailing lists, or just the latter? Lenna From w.arindrarto at gmail.com Tue Apr 24 15:01:23 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 24 Apr 2012 21:01:23 +0200 Subject: [GSoC] [Biopython] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 19:55, Peter Cock wrote: > On Tue, Apr 24, 2012 at 6:21 PM, Lenna Peterson wrote: >> Hi all, >> >> >> I'm very excited to be participating in GSoC '12 with Biopython! >> >> My development blog is on tumblr, which I chose primarily because it >> supports markdown syntax, which I'm used to from GitHub. >> >> Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 >> >> However, Tumblr doesn't allow post comments. Will I need to switch to a >> blog platform that allows comments? >> >> Cheers, >> >> Lenna > > Hi Lenna, > > Great - you've got a blog already you're also the first student to reply :) > > Blog comments could be nice, but personally in your shoes I'd > direct any discussion to the biopython(-dev) mailing list. e.g. > > 1. Post weekly update blog, get blog post URL > 2. Send email with summary, including blog post URL > 3. Goto mailing list archive, get archived email URL > 4. Update blog post to link to email (and thus any thread from it, > at least for that month). > > A little cumbersome, but it would save you moving your blog? > > I'd actually be happier with most discussion on the biopython-dev > list rather than blog comments, or even github (which will still be > useful for things like code reviews). > > This may be different for the other projects - I know BioRuby > uses IRC much more for example, but even there they've tried > to post archives of important IRC discussions to their mailing > list too. > > Thank you! > > Peter > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Hi everyone, Wibowo Arindrarto here, but you can just call me Bow for short :). I'm very excited to be accepted into GSoC with OBF as well! I will be blogging on my site: http://bow.web.id/blog, and I've actually made my inaugural GSoC post just a few hours after I heard the news, here: http://bow.web.id/blog/2012/04/google-summer-of-code-is-on/. I'll be posting all GSoC related post under the `gsoc` tag, accessible through this URL: http://bow.web.id/blog/tag/gsoc/. To follow Peter's suggestion, I'll post my weekly progress in this mailing list for everyone to see, too. cheers, Bow From rbuels at gmail.com Tue Apr 24 15:13:48 2012 From: rbuels at gmail.com (Robert Buels) Date: Tue, 24 Apr 2012 15:13:48 -0400 Subject: [GSoC] [Biopython] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: <4F96FB6C.3010805@gmail.com> Bow, make sure you subscribe to the OBF GSoC mailing list. http://lists.open-bio.org/mailman/listinfo/gsoc Rob On 04/24/2012 03:01 PM, Wibowo Arindrarto wrote: > On Tue, Apr 24, 2012 at 19:55, Peter Cock wrote: >> On Tue, Apr 24, 2012 at 6:21 PM, Lenna Peterson wrote: >>> Hi all, >>> >>> >>> I'm very excited to be participating in GSoC '12 with Biopython! >>> >>> My development blog is on tumblr, which I chose primarily because it >>> supports markdown syntax, which I'm used to from GitHub. >>> >>> Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 >>> >>> However, Tumblr doesn't allow post comments. Will I need to switch to a >>> blog platform that allows comments? >>> >>> Cheers, >>> >>> Lenna >> >> Hi Lenna, >> >> Great - you've got a blog already you're also the first student to reply :) >> >> Blog comments could be nice, but personally in your shoes I'd >> direct any discussion to the biopython(-dev) mailing list. e.g. >> >> 1. Post weekly update blog, get blog post URL >> 2. Send email with summary, including blog post URL >> 3. Goto mailing list archive, get archived email URL >> 4. Update blog post to link to email (and thus any thread from it, >> at least for that month). >> >> A little cumbersome, but it would save you moving your blog? >> >> I'd actually be happier with most discussion on the biopython-dev >> list rather than blog comments, or even github (which will still be >> useful for things like code reviews). >> >> This may be different for the other projects - I know BioRuby >> uses IRC much more for example, but even there they've tried >> to post archives of important IRC discussions to their mailing >> list too. >> >> Thank you! >> >> Peter >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > Hi everyone, > > Wibowo Arindrarto here, but you can just call me Bow for short :). I'm > very excited to be accepted into GSoC with OBF as well! > > I will be blogging on my site: http://bow.web.id/blog, and I've > actually made my inaugural GSoC post just a few hours after I heard > the news, here: > http://bow.web.id/blog/2012/04/google-summer-of-code-is-on/. I'll be > posting all GSoC related post under the `gsoc` tag, accessible through > this URL: http://bow.web.id/blog/tag/gsoc/. To follow Peter's > suggestion, I'll post my weekly progress in this mailing list for > everyone to see, too. > > cheers, > Bow > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From marian.povolny at gmail.com Wed Apr 25 13:17:01 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Wed, 25 Apr 2012 19:17:01 +0200 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: Hi Peter, Another excited GSoC student here :) I think the idea with a blog for status updates a great idea, I would have done it probably even if it wasn't a requirement. I didn't have a blog before, so I created one at tumblr, and it should be possible for the visitors to leave comments too. But I do agree with you that the ML is a better place for discussions about our GSoC projects. Here is a link to my new blog: http://blog.mpthecoder.com/ GSoC related posts will be tagged with #gsoc ( http://blog.mpthecoder.com/tagged/gsoc). @Lenna Tumblr lets you use your Disqus account if you want to enable comments on your tumblr blog. However, not all themes support it. See the first q&a here for more info: http://www.tumblr.com/help It took me about 2 minutes to create an account on Disqus and link it to my blog. -- Marjan On Tue, Apr 24, 2012 at 1:24 PM, Peter Cock wrote: > Hello all, > > First, to echo Rob, congratulations to our selected students: > http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ > http://lists.open-bio.org/pipermail/gsoc/2012/000049.html > > Weekly Progress Reports: > > To encourage community bonding and awareness of what the > GSoC 2012 students are doing, this year the OBF is being much > clearer about our progress report expectations. > > We would like every student to setup a blog for the GSoC project > (or a category/tag on your existing blog) which you will use to > summarize your progress every week, as well as longer posts > at the half way evaluation, and at the end of the summer. > > In addition, after publishing each blog post, we expect you to > email the URL and the text of the blog (or if important images > or formatting would be lost, at least a short summary) to the > host project's mailing list(s) (check with your mentors if the > project has more than one) AND the gsoc at open-bio.org > mailing list. > > You will be writing under your own name, but with a clear > association with your mentors, the OBF and its projects, so > please take this seriously and be professional. Remember > this will become part of your online presence, and potentially > looked at by future employers and colleagues. > > Please talk to your mentors about this during the "community > bonding" stage of the GSoC code (i.e. the next few weeks > before you actually start). > > Thank you, > > Peter > > (On behalf of the OBF GSoC mentors and projects) > > Note: As per Rob's earlier email, could both students and mentors > please ensure you have subscribed to the public OBF GSoC email > list at http://lists.open-bio.org/mailman/listinfo/gsoc (I have BCC'd > you on this email just in case you haven't done this yet). Thanks! > From arklenna at gmail.com Wed Apr 25 20:16:11 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 25 Apr 2012 20:16:11 -0400 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: On Wed, Apr 25, 2012 at 1:17 PM, Marjan Povolni wrote: > > @Lenna > Tumblr lets you use your Disqus account if you want to enable comments on > your tumblr blog. However, not all themes support it. See the first q&a > here for more info: > > http://www.tumblr.com/help > > It took me about 2 minutes to create an account on Disqus and link it to my > blog. > > -- > Marjan > > Marjan - Thanks for the tip! I have disqus set up on my tumblr now. I also filed my enrollment and tax forms with Google. Now I'm busy in the thinking phase ;) Lenna From p.j.a.cock at googlemail.com Thu Apr 26 05:49:26 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 26 Apr 2012 10:49:26 +0100 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: On Wed, Apr 25, 2012 at 6:17 PM, Marjan Povolni wrote: > Hi Peter, > > Another excited GSoC student here :) > > I think the idea with a blog for status updates a great idea, I would have > done it probably even if it wasn't a requirement. I didn't have a blog > before, so I created one at tumblr, and it should be possible for the > visitors to leave comments too. But I do agree with you that the ML is a > better place for discussions about our GSoC projects. Here is a link to my > new blog: > > http://blog.mpthecoder.com/ > > GSoC related posts will be tagged with #gsoc > (http://blog.mpthecoder.com/tagged/gsoc). Excellent, I've added the three Blog links so far to this post: http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ I'll do another full post highlighting your blogs once all five are ready. Thanks, Peter From lomereiter at googlemail.com Thu Apr 26 07:43:07 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Thu, 26 Apr 2012 15:43:07 +0400 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: Hi all, I'm also very excited about being accepted :) > Excellent, I've added the three Blog links so far to this post: > http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ > > I'll do another full post highlighting your blogs once all five > are ready. > My blog posts will be at http://lomereiter.wordpress.com/tag/gsoc, I'll update it at least every week during the coding period. From rbuels at gmail.com Thu Apr 26 11:05:01 2012 From: rbuels at gmail.com (Robert Buels) Date: Thu, 26 Apr 2012 11:05:01 -0400 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: <4F99641D.3070908@gmail.com> Thanks for handling the blog links Peter! The wiki page has them now too. http://www.open-bio.org/wiki/Google_Summer_of_Code#About_Google_Summer_of_Code Artem and Clayton: please update that wiki page to link to your progress blogs and notify Peter so he can put the link on the OBF blog. Rob On 04/26/2012 05:49 AM, Peter Cock wrote: > On Wed, Apr 25, 2012 at 6:17 PM, Marjan Povolni > wrote: >> Hi Peter, >> >> Another excited GSoC student here :) >> >> I think the idea with a blog for status updates a great idea, I would have >> done it probably even if it wasn't a requirement. I didn't have a blog >> before, so I created one at tumblr, and it should be possible for the >> visitors to leave comments too. But I do agree with you that the ML is a >> better place for discussions about our GSoC projects. Here is a link to my >> new blog: >> >> http://blog.mpthecoder.com/ >> >> GSoC related posts will be tagged with #gsoc >> (http://blog.mpthecoder.com/tagged/gsoc). > > Excellent, I've added the three Blog links so far to this post: > http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ > > I'll do another full post highlighting your blogs once all five > are ready. > > Thanks, > > Peter > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc > From marian.povolny at gmail.com Sat May 5 09:07:30 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sat, 5 May 2012 15:07:30 +0200 Subject: [GSoC] GSoC weekly status report No.1 Message-ID: Hello all, It might be a little early, but there has been so much going on in the last 10 days since the results of GSoC were published... http://blog.mpthecoder.com/post/22380853664/gsoc-weekly-status-report-no-1 A short summary: It has been 10 days since the GSoC results were published, and a lot has happened since then. I got to know the other students and mentors in a longish meeting on Google hangout, I got into a discussion with my mentor on IRC in which we didn?t agree about the parallelization strategy for the parser (experiments will show who?s right) and my inbox is full with mails from my mentor and other students, in which we exchanged loads of interesting ideas. Also, I solved a bug in biogems.info website, which was stopping Pjotr from updating the website with new information about biogems. There is now a GitHub repository for my project: https://github.com/mamarjan/bioruby-hpc-gff3 The work for the first week of coding is halfway done too. There seems to be huge interest for a GFF3 parser with more features, like indexing, random access and writing output, and also support for linking into trees of features that are not located close to each other in the file. A fast sequential parser could be used to generate indexes, and the lower-level parts can be used to reorder the file for faster future usage. Based on that, I think this project is a good start. *I would like to ask you if you?re using the GFF3/GTF file formats in your research, to send me example files and descriptions of how are your applications using the data. This way I?ll be able to test the parser against your files and optimize it for your applications. Currently I have GFF files from Ensembl and Wormbase, and Pjotr pointed me to the genome browser web application at wormbase.org.* -- Marjan From rbuels at gmail.com Sun May 6 11:00:07 2012 From: rbuels at gmail.com (Robert Buels) Date: Sun, 06 May 2012 11:00:07 -0400 Subject: [GSoC] GSoC weekly status report No.1 In-Reply-To: References: Message-ID: <4FA691F7.9030905@gmail.com> Hi Marjan, You should probably incorporate into your test suite all of the test gff3 files in the test data directory of the Perl Bio::GFF3::LowLevel::Parser. It has coverage for some corner cases that are a little bit tricky. https://github.com/solgenomics/bio-gff3/tree/master/t/data Rob On 05/05/2012 09:07 AM, Marjan Povolni wrote: > Hello all, > > It might be a little early, but there has been so much going on in the last > 10 days since the results of GSoC were published... > > http://blog.mpthecoder.com/post/22380853664/gsoc-weekly-status-report-no-1 > > A short summary: > > It has been 10 days since the GSoC results were published, and a lot has > happened since then. I got to know the other students and mentors in a > longish meeting on Google hangout, I got into a discussion with my mentor > on IRC in which we didn?t agree about the parallelization strategy for the > parser (experiments will show who?s right) and my inbox is full with mails > from my mentor and other students, in which we exchanged loads of > interesting ideas. Also, I solved a bug in biogems.info website, which was > stopping Pjotr from updating the website with new information about biogems. > > There is now a GitHub repository for my project: > > https://github.com/mamarjan/bioruby-hpc-gff3 > > The work for the first week of coding is halfway done too. > > There seems to be huge interest for a GFF3 parser with more features, like > indexing, random access and writing output, and also support for linking > into trees of features that are not located close to each other in the > file. A fast sequential parser could be used to generate indexes, and the > lower-level parts can be used to reorder the file for faster future usage. > Based on that, I think this project is a good start. > > *I would like to ask you if you?re using the GFF3/GTF file formats in your > research, to send me example files and descriptions of how are your > applications using the data. This way I?ll be able to test the parser > against your files and optimize it for your applications. Currently I have > GFF files from Ensembl and Wormbase, and Pjotr pointed me to the genome > browser web application at wormbase.org.* > > -- > Marjan > > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc > From lomereiter at googlemail.com Sun May 6 15:56:50 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Sun, 6 May 2012 23:56:50 +0400 Subject: [GSoC] [BAM] Weekly report No. 0 Message-ID: Hi all, I wrote a few words about what I've done last week: http://lomereiter.wordpress.com/2012/05/06/gsoc-weekly-report-0/ Summary: The code is available at github: https://github.com/lomereiter/BAMread/ I already started to write code planned for the first week so as to have more time in June for exam preparation. Opening BAM and parsing SAM header works, and is available from Ruby, and now I need to write some tests and documentation. Also, I described some compile-time metaprogramming tricks in D which I use to reduce duplication in the code. I'd be grateful for some small BAM files, 1-50 kilobytes in size, with non-empty headers, for testing purposes. -- Artem From marian.povolny at gmail.com Sun May 6 16:22:01 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sun, 6 May 2012 22:22:01 +0200 Subject: [GSoC] GSoC weekly status report No.1 In-Reply-To: <4FA691F7.9030905@gmail.com> References: <4FA691F7.9030905@gmail.com> Message-ID: Thanks for the tip, that's a great idea! -- Marjan On Sun, May 6, 2012 at 5:00 PM, Robert Buels wrote: > Hi Marjan, > > You should probably incorporate into your test suite all of the test gff3 > files in the test data directory of the Perl Bio::GFF3::LowLevel::Parser. > It has coverage for some corner cases that are a little bit tricky. > > https://github.com/**solgenomics/bio-gff3/tree/**master/t/data > > Rob > > > On 05/05/2012 09:07 AM, Marjan Povolni wrote: > >> Hello all, >> >> It might be a little early, but there has been so much going on in the >> last >> 10 days since the results of GSoC were published... >> >> http://blog.mpthecoder.com/**post/22380853664/gsoc-weekly-** >> status-report-no-1 >> >> A short summary: >> >> It has been 10 days since the GSoC results were published, and a lot has >> happened since then. I got to know the other students and mentors in a >> longish meeting on Google hangout, I got into a discussion with my mentor >> on IRC in which we didn?t agree about the parallelization strategy for the >> parser (experiments will show who?s right) and my inbox is full with mails >> from my mentor and other students, in which we exchanged loads of >> interesting ideas. Also, I solved a bug in biogems.info website, which >> was >> stopping Pjotr from updating the website with new information about >> biogems. >> >> There is now a GitHub repository for my project: >> >> https://github.com/mamarjan/**bioruby-hpc-gff3 >> >> The work for the first week of coding is halfway done too. >> >> There seems to be huge interest for a GFF3 parser with more features, like >> indexing, random access and writing output, and also support for linking >> into trees of features that are not located close to each other in the >> file. A fast sequential parser could be used to generate indexes, and the >> lower-level parts can be used to reorder the file for faster future usage. >> Based on that, I think this project is a good start. >> >> *I would like to ask you if you?re using the GFF3/GTF file formats in your >> >> research, to send me example files and descriptions of how are your >> applications using the data. This way I?ll be able to test the parser >> against your files and optimize it for your applications. Currently I have >> GFF files from Ensembl and Wormbase, and Pjotr pointed me to the genome >> browser web application at wormbase.org.* >> >> -- >> Marjan >> >> ______________________________**_________________ >> GSoC mailing list >> GSoC at lists.open-bio.org >> http://lists.open-bio.org/**mailman/listinfo/gsoc >> >> From arklenna at gmail.com Sun May 6 17:26:30 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sun, 6 May 2012 17:26:30 -0400 Subject: [GSoC] GSoC python variant update Message-ID: Hi all, I've written a few new posts on my blog; here's the latest: http://arklenna.tumblr.com/post/22542372076/spot-isa-dog I will attach a UML diagram and include the part of the post addressing the diagram. Click through to the full post for a bonus Einstein quote! ------- My main goals are not limited to: * Make the structure parser and file-format agnostic: an abstracted OO design should allow anything to be slotted in (for example, Marjan's C GFF parser?) * Maintain encapsulation: limit how much each object can see of objects above and below it * Allow extension at multiple levels: some existing parsers may process data in different ways; this structure should allow handling both raw data and data in various formats. The `Variant` object's constructor allows an end user to change the default parsers. Practical implementation details of `parse()` and `write()` will need to be finessed - for example, ways to help the user sift through immense quantities of data. I'm still in the process of comparing the data contained in VCF/GVF files as well as the APIs of PyVCF and BCBio.GFF. `Parser` and `Writer` are both abstract classes that will define all methods found in known parsers/writers with `NotImplementedError`s. I'm speculating on whether a Variant-specific exception would be useful, but a custom message should suffice. Continuing down the diagram, `PyVCFWrapper` and `BCBioGFFWrapper` would each inherit from both `Parser` and `Writer`. As the name implies, they would serve as the adapter between the generic `Variant` and the specific parser. I anticipate that this structure could easily be extended to allow intermediate storage in DBs as well as innumerable sorting/comparing/filtering methods inside `Variant`. ------- I would appreciate any and all feedback about the overall structure. Namespace is definitely flexible. I'd also appreciate any specific genomic variant workflows, and if somebody can point me to smallish sample files of the same data in both VCF and GVF, I'd be eternally grateful. Regards, Lenna -------------- next part -------------- A non-text attachment was scrubbed... Name: Variant_UML.png Type: image/png Size: 23313 bytes Desc: not available URL: From chapmanb at 50mail.com Mon May 7 20:24:39 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 07 May 2012 20:24:39 -0400 Subject: [GSoC] GSoC python variant update In-Reply-To: References: Message-ID: <87mx5jfrjs.fsf@fastmail.fm> Lenna; This all looks great for a top level overview of the classes. This should give you sufficient flexibility to work on the different file types. Another approach is to avoid some of the inheritence and have parse/write dispatch to VCF or GFF specific classes based on the filetype: if filetype == "vcf": variant_handler = PyVCFVariants() elif filetype == "gvf": variant_handler = GVFVariants() variant_handler.parse(*args) Avoiding layers can be nice to simplify the architecture, as long as it gives you the flexibility you need. My suggestion for digging more in the API design would be to start playing with some VCF files and getting comfortable with the data they have and where it would go in Biopython objects. VCF is much more widely used than GVF so it's a good practical place to start. Thanks for all this work and best of luck on finals, Brad > Hi all, > > I've written a few new posts on my blog; here's the latest: > > http://arklenna.tumblr.com/post/22542372076/spot-isa-dog > > I will attach a UML diagram and include the part of the post > addressing the diagram. Click through to the full post for a bonus > Einstein quote! > > ------- > > My main goals are not limited to: > > * Make the structure parser and file-format agnostic: an abstracted > OO design should allow anything to be slotted in (for example, > Marjan's C GFF parser?) > * Maintain encapsulation: limit how much each object can see of > objects above and below it > * Allow extension at multiple levels: some existing parsers may > process data in different ways; this structure should allow handling > both raw data and data in various formats. > > The `Variant` object's constructor allows an end user to change the > default parsers. Practical implementation details of `parse()` and > `write()` will need to be finessed - for example, ways to help the > user sift through immense quantities of data. I'm still in the process > of comparing the data contained in VCF/GVF files as well as the APIs > of PyVCF and BCBio.GFF. > > `Parser` and `Writer` are both abstract classes that will define all > methods found in known parsers/writers with `NotImplementedError`s. > I'm speculating on whether a Variant-specific exception would be > useful, but a custom message should suffice. > > Continuing down the diagram, `PyVCFWrapper` and `BCBioGFFWrapper` > would each inherit from both `Parser` and `Writer`. As the name > implies, they would serve as the adapter between the generic `Variant` > and the specific parser. > > I anticipate that this structure could easily be extended to allow > intermediate storage in DBs as well as innumerable > sorting/comparing/filtering methods inside `Variant`. > > ------- > > I would appreciate any and all feedback about the overall structure. > Namespace is definitely flexible. I'd also appreciate any specific > genomic variant workflows, and if somebody can point me to smallish > sample files of the same data in both VCF and GVF, I'd be eternally > grateful. > > Regards, > > Lenna Attachment: Variant_UML.png (image/png) > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From casbon at gmail.com Tue May 8 04:57:57 2012 From: casbon at gmail.com (James Casbon) Date: Tue, 8 May 2012 09:57:57 +0100 Subject: [GSoC] GSoC python variant update In-Reply-To: <87mx5jfrjs.fsf@fastmail.fm> References: <87mx5jfrjs.fsf@fastmail.fm> Message-ID: On 8 May 2012 01:24, Brad Chapman wrote: > > Lenna; > This all looks great for a top level overview of the classes. This > should give you sufficient flexibility to work on the different file > types. Another approach is to avoid some of the inheritence and have > parse/write dispatch to VCF or GFF specific classes based on the > filetype: > > if filetype == "vcf": > ? ?variant_handler = PyVCFVariants() > elif filetype == "gvf": > ? ?variant_handler = GVFVariants() > variant_handler.parse(*args) > > Avoiding layers can be nice to simplify the architecture, as long as it > gives you the flexibility you need. Hi Lenna, This looks a good start, but I would agree with Brad that layers of inheritance aren't always the best way to proceed with python. Specific feedback: why does the Variant have parse/write methods when you state that you will use adaptation from the general variation class to the actual parser? I'm also slightly worried this could be pretty slow when dealing with the volume of data you get from a VCF file. As for the points in your blog post... I have plenty of data, do we know any SNP callers capable of creating GVF files? If so, I can give you both formats. The simplest variant workflows would be to filter and then score on some metric. Filter would be to remove noise, so quality threshold is the simplest one. The metric used depends on the experimental setup. For case/control, a fishers test is quite easy, or for a single population an HWE test is fairly simple. Hope this helps, -- James http://casbon.me/ From pjotr2010 at thebird.nl Tue May 8 07:40:43 2012 From: pjotr2010 at thebird.nl (Pjotr Prins) Date: Tue, 8 May 2012 13:40:43 +0200 Subject: [GSoC] GSoC python variant update In-Reply-To: References: <87mx5jfrjs.fsf@fastmail.fm> Message-ID: <20120508114043.GC14359@thebird.nl> On Tue, May 08, 2012 at 09:57:57AM +0100, James Casbon wrote: > > Avoiding layers can be nice to simplify the architecture, as long as it > > gives you the flexibility you need. This is actually a pattern. See 'Using Mixin Technology to Improve Modularity', for example. http://www.cs.utexas.edu/~lin/papers/aop03.pdf Pj. From w.arindrarto at gmail.com Wed May 9 12:24:43 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 9 May 2012 18:24:43 +0200 Subject: [GSoC] GSoC Project Update -- 1 Message-ID: Hi everyone, I just posted my latest blog updated here: http://bow.web.id/blog/2012/05/warming-up-for-the-coding-period/ To summarize, I've spent most of my time getting to know the programs I will support better. This has been done by: 1. Playing around with the programs to see how many different outputs I can generate. 2. Writing scripts to automate test case generation for each of the programs. 3. Writing wrappers (for programs not yet wrapped by Biopython: FASTA, HMMER, and BLAT) to ease writing the test case generators. 4. Continuing to complete my proposed SearchIO object naming scheme (http://bit.ly/searchio-terms) The test cases, their generators, and the wrappers I've written are available in my non-Biopython gsoc repo here: http://github.com/bow/gsoc/. Additionally, I've used the generated test case to improve a recent bug report and submitted a fix for the next release. For the coming weeks prior to coding start, I'm planning to play around more with XML and SQLite as I will use them in the code. I might start to add more skeleton code to my current development branch as well (https://github.com/bow/biopython). cheers, Bow From arklenna at gmail.com Wed May 9 20:16:18 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 9 May 2012 20:16:18 -0400 Subject: [GSoC] GSoC python variant update In-Reply-To: <20120508114043.GC14359@thebird.nl> References: <87mx5jfrjs.fsf@fastmail.fm> <20120508114043.GC14359@thebird.nl> Message-ID: I think my UML diagram may need a legend, or perhaps it should just be abandoned. I've written some skeleton code to try to avoid confusion about the pesky OO terms that have slightly different meanings for every language. https://gist.github.com/2649676 Regarding concerns about inheritance: I think the UML diagram implies 3 levels of inheritance. The only inheritance I intended was from abstract interfaces like Parser or Writer, that only contain non-implemented methods. Because I can't guarantee that all future parsers will have common attribute and method names, the only solution I can see is to write an interface and inherit from that to make wrappers for each parser. Thank you to Eric for this link: (https://en.wikipedia.org/wiki/Fragile_base_class). The page states that the best way to avoid problems is to use an interface. Also thank you to Pjotr for the article about mixins (http://www.cs.utexas.edu/~lin/papers/aop03.pdf). I believe I'm using inheritance in a safe and helpful manner. James, I hope my clarification and skeleton code answer any questions you have about the implementation. Brad, I am using if statements to determine which parser to use, but I am still calling wrappers that inherit from an interface. Eric, I looked at the structure of PDBParser. Is the idea that a user might pass in an instance of StructureBuilder that already contained some structure and add to it? Or is there another purpose that isn't jumping out at me? In my skeleton code, I used the example of StructureBuilder, but I'm not sure if there's an advantage to passing the object rather than the object's name. And finally, Brad and James, I will do my best to get more conversant with VCF etc. If I'm not a user, I can't be a capable developer. Looking forward to any more structural feedback! Cheers, Lenna From eric.talevich at gmail.com Thu May 10 09:36:49 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 10 May 2012 09:36:49 -0400 Subject: [GSoC] GSoC python variant update In-Reply-To: References: <87mx5jfrjs.fsf@fastmail.fm> <20120508114043.GC14359@thebird.nl> Message-ID: On Wed, May 9, 2012 at 8:16 PM, Lenna Peterson wrote: > I looked at the structure of PDBParser. Is the idea that a user might pass > in an instance of StructureBuilder that already contained some structure > and add to it? Or is there another purpose that isn't jumping out at me? In > my skeleton code, I used the example of StructureBuilder, but I'm not sure > if there's an advantage to passing the object rather than the object's name. > > My understanding of the producer/consumer design in Bio.PDB (I didn't write it) is that the logic for parsing the given file format is contained in the *Parser class, and the logic for building the target object is in the *Builder class. This is useful if the target object is somewhat complex to build, as is the case with PDB's Structure/Model/Chain/Residue/Atom hierarchy -- the parser just passes raw values along to the appropriate method on the StructureBuilder class. (The Internet also points out that this design is super useful if "producing" and "consuming" are asynchronous, which is not the case here... yet?) Regarding the shared interface, I think we've generally achieved this throughout most of Biopython by just remembering to implement the required methods on each parser and writer class -- just "parse" and "write", usually. Essentially, it's your design minus the common base class that enforces the interface; an error in the implementation would result in an AttributeError rather than a NotImplementedError. This works because (1) Python uses duck typing, unlike C++ and Java; (2) in Biopython, each file format is usually implemented by one dedicated person who can keep it all in their head, and we don't add new file formats very rapidly; (3) we maintain pretty good coverage with our unit tests, and certainly add unit tests for new parsers. Given all that, I think your design is superior, and it's quite clear how it all works from the way you've written it. As for the difference between passing an instance of the *Builder object versus a reference to the *Builder class (did I get that right?), it requires slightly less code from the user to pass a reference to the class. Also, if you set the object-or-class as a default argument, remember that objects are mutable, so you risk hitting one of Python's most infamous gotchas (default arguments are only evaluated once, so the second time you use the parser, you'll be adding to the original object instead of starting with a fresh copy). Cheers, Eric From marian.povolny at gmail.com Sat May 12 15:46:46 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sat, 12 May 2012 21:46:46 +0200 Subject: [GSoC] GSoC weekly status report No.1.1 Message-ID: Hi all, Here is my status report for this week: This year we the GSoC students sure are a very creative group, just look at our numbering schemes for our status reports for the pre-coding period - everyone has his own thing going :) And now back to the GFF3 project. I found a few more sites with big GFF3 files, those will be great for performance testing. And Robert Buels suggested that I should reuse the test suite from the Perl?s Bio::GFF3::LowLevel::Parser, and I think that?s a great idea. I should definitely use that for completeness testing and I will check the test suites of other GFF3 parsers. I have also finished the work for the first week. That means basically I?m already more then two weeks ahead of schedule. The parser is now reading data on the D side and forwarding that to Ruby line by line. That won?t be faster then reading the file from Ruby, but that?s a nice basic case to get data flowing from D to Ruby. The rake tasks have been improved too. There are now two tasks for building the D library, ?compile? and ?compiledebug?, and there is the ?spec? task for running rspec tests and ?features? task for running cucumber tests. The ?clean? task now deletes object and library files. There is also a problem with the D library and garbage collector. It seems this is the problem Iain Buclaw (one of the GDC developers) has warned us about. When using a D shared library, when the GC kicks in for the first time, it looks like if it collects all the static data, for example the per-module variables. And pretty much everything, even when we register with GC a chuck of memory allocated with malloc, it still gets collected. Or at least that?s what it looks like. However, Iain also assured us that this will be solved by the end of this month/beginning of the next. My cucumber and rspec tests still work because they don?t require enough memory for the GC to run, but to be sure that this issue doesn?t interfere with development at this point, I manually disabled the GC on library initialization. I didn?t try yet, but from what has been discussed in the forums, both 32 and 64-bit DLLs on windows built using DMD work fine. I also helped Pjotr with getting our blog posts included in the RSS feed on biogems.info. That's all for now, you can find this report on my blog too: http://blog.mpthecoder.com/post/22919943701/gsoc-weekly-status-report-no-1-1 -- Best regards, Marjan From lomereiter at googlemail.com Sun May 13 16:10:45 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 14 May 2012 00:10:45 +0400 Subject: [GSoC] Weekly report No 0.5 Message-ID: Hi all, this is yet another GSoC report. During last week, I was mainly concentrated on D part of the project, adding functionality to it. I implemented parsing of the whole BAM file :) Today I wrote a simple utility in D, which uses my library to convert BAM to SAM. It doesn?t work with array tags yet, and not as fast as samtools, but nevertheless? On a couple of BAM files from test/data directory (namely, bins.bam and ex1_header.bam) the output is identical to that of samtools view ? I checked with diff ? and that kinda proves that everything works fine. Speed issues are mainly due to using std.variant module for storing tags. It uses runtime reflection which is quite slow. Maybe, there?re some other reasons. Anyway, I?m going to write my own tagged union type next week, it should improve the performance quite a bit, and also fix design flaws. For testing tag parsing, I used file tags.bam provided to me by Peter Cock. It contains tests for all types of tags, and my library successfully passes them. Later I?ll experiment with possible speed improvements, and having unit tests covering full range of possible tag types is a must. Also, I downloaded and compiled gdc from trunk. It provides decent performance, not worse than dmd, at least. We expect gdc to gain shared library support in the next two months. Before that happens, we have to use dmd, although there?re some issues with its garbage collector, causing segfaults. We discussed that with Marjan and Pjotr and decided that the best option in such circumstances would be to disable GC during development ? testing library on small files won?t consume much memory anyway. Another thing I downloaded and compiled, is Rubinius. I?m going to investigate why it hangs on BioRuby unittests in 1.9 mode. Another mode, 1.8, seems to work fine except maybe some very minor bugs. During next week, I?m going to learn how to use Cucumber and Rspec, improve D library performance a little, and start to write Ruby bindings. So it will be mostly ?Ruby week? ;) -- Artem From cswh at umich.edu Mon May 14 23:36:17 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 14 May 2012 23:36:17 -0400 Subject: [GSoC] GSoC week 1 status report Message-ID: <2D9F6030-8A11-4443-B610-58464F506EE5@umich.edu> Hi all, I've put my first GSoC status report on my project blog: http://csw.github.com/bioruby-maf/blog/2012/05/13/progress/ (The web version of this has 100% more hyperlinks, but here's a plain text version, too.) This has been my first half-week of work on my Google Summer of Code project, and it?s off to an exciting start. The first order of business has been to get my development environment together; since I?ve been a microbiology student instead of a programmer for the last year, it?s taken some work. In that process, I?ve ended up making a few open source contributions just to get my tools working the way I want. I?m running GNU Emacs 24 and trying to take more advantage of it than I have in the past. I?ll have much more to say about this in a future post. I?ve also started working on the BioRuby unit test failures under JRuby, as a way of familiarizing myself with the BioRuby code base as well as the community and its development processes. Right now, JRuby in 1.8 mode is showing 6 failures and 126 errors, which is hardly confidence-inspiring for people considering using JRuby with BioRuby. This is too bad, since JRuby has some definite advantages as a Ruby implementation. After looking into these failures, I?ve broken them down into a few categories: ? temporary file permissions problems, likely due to some sort of Travis-CI environment issue ? a bug in JRuby?s implementation of Open3.popen3 which I?m working up a bug report for ? an odd autoload problem I?ve filed JRUBY-6658 for and sent an accompanying RubySpec patch for ? a problem with libxml-jruby, which appears unmaintained, for which I?ve submitted a BioRuby patch plus JRUBY-6662 ? and a small test case bug relating to floating point handling, which I?ve submitted apatch for. Once these are resolved, JRuby should be passing the BioRuby unit tests in 1.8 mode, and closer to passing in 1.9 mode. (There are a few extra failures under 1.9 that I haven?t sorted through yet.) I?ve also gotten a start on my project itself, creating the bioruby-maf Github repository with a project skeleton and writing my first Cucumber feature for it. This is, in fact, my first Cucumber feature ever. However, I did spend a few cross-country flights reading the RSpec and Cucumber books last week; between that and cribbing from Pjotr?s code I feel like I have some idea what I?m doing. Just assembling that feature has been useful, too, since I?ve had to get several of the existing MAF tools running on my machine. In fact, my test MAF data and the FASTA version of it are courtesy of bx-python, which will be my reference implementation in many respects. Clayton Wheeler cswh at umich.edu From w.arindrarto at gmail.com Wed May 16 15:36:28 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 16 May 2012 21:36:28 +0200 Subject: [GSoC] GSoC Project Update -- 2 Message-ID: Hi everyone, I just posted my latest GSoC blog update here: http://bow.web.id/blog/2012/05/the-final-preparations/ To summarize, I spent the last week playing with XML and SQLite, and in extension SeqIO's index and index_db. I didn't write as much as real code the week before (mostly on online tutorials). Additionally, I started writing some of the SearchIO main methods, improved the test case generation time, and added more entries to the SearchIO terms table (http://bit.ly/searchio-terms). Finally, from this day onwards, I'm starting coding for the actual SearchIO implementation. The weekly plan will follow my proposed timeline (http://bit.ly/searchio-proposal) and I'll be writing mostly on my main SearchIO branch (https://github.com/bow/biopython/tree/searchio/Bio/SearchIO). cheers, Bow P.S. I also updated my blog last week so that the GSoC entries can be tracked through its own feed. The feed is available here: http://bow.web.id/feed/atom-gsoc.xml From arklenna at gmail.com Wed May 16 16:01:30 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 16 May 2012 16:01:30 -0400 Subject: [GSoC] GSoC python variant update 2 Message-ID: Hi all, Latest blog post here: http://arklenna.tumblr.com/post/23178684555/week-2 Brief summary of this post: I don't think `SeqFeature` or an extension thereof would be appropriate for storing Variant data; therefore, I intend to make a new structure based on `_Record` and `_Call` in PyVCF. I'm not sure if this structure should be associated with `Seq`, i.e. by naming it `SeqVariant`, and would like feedback on this question. It could be very difficult to make PyVCF compatible with Python 2.5. Therefore, I am planning to write my project to be compatible with Python 2.6 and delaying its inclusion in the main Biopython branch until a future 2.6+ Biopython release. Alternate suggestions are welcome. This week I will solidify the structure so I am ready for the end of the community bonding period and the start of coding on May 21. Regards, Lenna From chapmanb at 50mail.com Wed May 16 20:19:01 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 16 May 2012 20:19:01 -0400 Subject: [GSoC] [Biopython-dev] GSoC python variant update 2 In-Reply-To: References: Message-ID: <871umju0ay.fsf@fastmail.fm> Lenna; Thanks for the update on your thinking. Sounds like you are right on track. > I don't think `SeqFeature` or an extension thereof would be > appropriate for storing Variant data; therefore, I intend to make a > new structure based on `_Record` and `_Call` in PyVCF. I'm not sure if > this structure should be associated with `Seq`, i.e. by naming it > `SeqVariant`, and would like feedback on this question. I'm agreed about SeqFeature. Would you consider using _Record/_Call directly? Then you could provide functionality to convert this to/from basic SeqFeatures if needed. An advantage of using these structures explicitly is that you could plug in compatible APIs, like Aaron Quinlan's CyVCF: https://github.com/arq5x/cyvcf I don't think we should add a new representation class unless we explicitly need to store additional information. > It could be very difficult to make PyVCF compatible with Python > 2.5. Therefore, I am planning to write my project to be compatible > with Python 2.6 and delaying its inclusion in the main Biopython > branch until a future 2.6+ Biopython release. Alternate suggestions > are welcome. I'm agreed with this. I don't think 2.5 is an entrenched as 2.4 was so think we could move on a deprecation path for it. It's more important to be forward compatible with 3.x and 2.6+ should make that easier. Thanks again for sharing all your thoughts and digging into this, Brad From rbuels at gmail.com Thu May 17 08:52:54 2012 From: rbuels at gmail.com (Robert Buels) Date: Thu, 17 May 2012 08:52:54 -0400 Subject: [GSoC] students: upcoming dates Message-ID: <4FB4F4A6.8050802@gmail.com> Hi students, There are a couple of important dates coming up soon, don't forget! May 18 (TOMORROW!): deadline to submit tax forms and proof of enrollment. Do you want to get paid? May 21: start of the formal coding period Keep up the good work, I'm very happy to have you working with us. :-) Rob -- Robert Buels OBF GSoC 2012 Org. Administrator From marian.povolny at gmail.com Mon May 21 05:36:01 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 21 May 2012 11:36:01 +0200 Subject: [GSoC] GSoC weekly status report No.1.2 Message-ID: http://blog.mpthecoder.com/post/23473020471/gsoc-weekly-status-report-no-1-2 It?s been three months since my first introduction on the BioRuby ML and it?s been great. As it is the end of the GSoC community bonding period, I would like to thank Pjotr most and then all the other community members for their help and support. It?s a great feeling to become a member of a small but growing community of enthusiasts that work together for the better of all of us and for fun. As Pjotr already did, I would like to encourage you to write blog posts about using Ruby in Bioinformatics and let us include them in our RSS and news feeds on the biogems.info website. The site supports both RSS and Atom feeds now, and a similar functionality will be part of the new website for BioRuby once it?s finished. The code also supports adding only posts for one category/tag, so you can tag your posts with BioRuby or similar, and only those posts will be included in the RSS feed on biogems.info. The GSoC coding period starts today, It?s time for me to roll my sleeves up, and start working on the GFF3 parser full-time. -- Marjan From lomereiter at googlemail.com Mon May 21 07:58:46 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 21 May 2012 15:58:46 +0400 Subject: [GSoC] Weekly report #1 Message-ID: Hi all, here's my report about the past week: http://lomereiter.wordpress.com/2012/05/21/gsoc-weekly-report-1/ Brief summary: 1) BioRuby unit tests and Rubinius bugs ? I posted 2 issues in Rubinius bugtracker, and one of them is already solved. Rubinius in 1.8 mode should now pass all tests. The situation with 1.9 mode is not that great, but I'm working on it. 2) I started to collect D optimization tricks on github wiki page. Currently, it contains just 6 tips, but this number is going to grow. Probably, another page will be created soon to keep best practices of connecting Ruby and D. Since my project and Marjan's one have a lot in common, I think it's important for us to not waste time on something that already have been investigated. 3) During the week, I learned a bit about BDD and Cucumber, enjoyed it, and wrote my first two features. 4) Measurements of object instantiation time in Ruby suggest that exposing low-level D functions via FFI makes little sense. I'm going to discuss with mentors which high-level functions should be available, and make that into Cucumber features. -- Artem From cswh at umich.edu Mon May 21 11:50:18 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 21 May 2012 11:50:18 -0400 Subject: [GSoC] GSoC week 2 status report Message-ID: <0D2AC678-1DD1-40B9-B100-EDA3429B3D87@umich.edu> Hi all, Here's my report on last week's work: http://csw.github.com/bioruby-maf/blog/2012/05/21/week_2_progress/ This was my second week of work on my GSoC project, and the last week of the ?community bonding? period before the official start of coding. A major focus of mine was BioRuby?s phyloXML support; it uses libxml, which has been causing unit test failures under JRuby. In the end, the best course of action seemed to separate the phyloXML support as a separate plugin, which I have done as the bio-phyloxml gem. This will remove BioRuby?s dependency on XML libraries entirely and that JRuby issue along with it. At the same time, users of the phyloXML code should be able to continue using it with no substantive changes. Separately, I began porting this phyloXML code to use Nokogiri instead of libxml-ruby, but ran into difficulties with this effort. While it is possible, and the library APIs are very similar, the code uses relatively low-level XML processing APIs in ways that seem to be sensitive to subtle differences in text node and namespace semantics between the two libraries. Substantial restructuring of the code and the addition of quite a few unit tests might be necessary to carry out such a port with confidence that the resulting code would work well. Also, someone else submitted a JRuby patch for JRUBY-6658, one of the major causes of BioRuby?s unit test failures with JRuby; once a fix is integrated, we?ll be close to having all the tests passing under JRuby. I identified another JRuby bug, JRUBY-6666, causing several unit test failures. This one affects BioRuby?s code for running external commands, so it would be likely to be encountered in production use. For this one, I also worked up a patch. I also spent some time preparing a performance testing environment, for evaluating existing MAF implementations as well as my own. This will be important, since I will be considering the use of an existing C parser. I will also want to ensure that the performance of my code is competitive with the alternatives. Lacking any hardware more powerful than a MacBook Air, I am setting this up with Amazon EC2. To simplify environment setup, I?ll be using Chef. I?ve already set up a Chef repository with configuration logic, and some rudimentary code to streamline launching Ubuntu machines on EC2 and bootstrapping a Chef environment. To save money, I plan to make use of EC2 Spot Instances, which are perfect for instances that only need to run for a few hours for batch tasks. Clayton Wheeler cswh at umich.edu From bonnal at ingm.org Tue May 22 05:21:42 2012 From: bonnal at ingm.org (Raoul Bonnal) Date: Tue, 22 May 2012 11:21:42 +0200 Subject: [GSoC] [BioRuby] GSoC week 2 status report In-Reply-To: <0D2AC678-1DD1-40B9-B100-EDA3429B3D87@umich.edu> Message-ID: Hi Clayton, Well done and thanks for your contributes to bioruby and jruby community. For you computing issue I have two solutions: 1) I can create a VM and give you the access, I need to contact my IT dep. 2) Could Amazon provide some VM for our students? On 21/05/12 17.50, "Clayton Wheeler" wrote: > Hi all, > > Here's my report on last week's work: > > http://csw.github.com/bioruby-maf/blog/2012/05/21/week_2_progress/ > > This was my second week of work on my GSoC project, and the last week of the > ?community bonding? period before the official start of coding. A major focus > of mine was BioRuby?s phyloXML support; it uses libxml, which has been causing > unit test failures under JRuby. In the end, the best course of action seemed > to separate the phyloXML support as a separate plugin, which I have done as > the bio-phyloxml gem. This will remove BioRuby?s dependency on XML libraries > entirely and that JRuby issue along with it. At the same time, users of the > phyloXML code should be able to continue using it with no substantive changes. > > Separately, I began porting this phyloXML code to use Nokogiri instead of > libxml-ruby, but ran into difficulties with this effort. While it is possible, > and the library APIs are very similar, the code uses relatively low-level XML > processing APIs in ways that seem to be sensitive to subtle differences in > text node and namespace semantics between the two libraries. Substantial > restructuring of the code and the addition of quite a few unit tests might be > necessary to carry out such a port with confidence that the resulting code > would work well. > > Also, someone else submitted a JRuby patch for JRUBY-6658, one of the major > causes of BioRuby?s unit test failures with JRuby; once a fix is integrated, > we?ll be close to having all the tests passing under JRuby. > > I identified another JRuby bug, JRUBY-6666, causing several unit test > failures. This one affects BioRuby?s code for running external commands, so it > would be likely to be encountered in production use. For this one, I also > worked up a patch. > > I also spent some time preparing a performance testing environment, for > evaluating existing MAF implementations as well as my own. This will be > important, since I will be considering the use of an existing C parser. I will > also want to ensure that the performance of my code is competitive with the > alternatives. Lacking any hardware more powerful than a MacBook Air, I am > setting this up with Amazon EC2. To simplify environment setup, I?ll be using > Chef. I?ve already set up a Chef repository with configuration logic, and some > rudimentary code to streamline launching Ubuntu machines on EC2 and > bootstrapping a Chef environment. To save money, I plan to make use of EC2 > Spot Instances, which are perfect for instances that only need to run for a > few hours for batch tasks. > > Clayton Wheeler > cswh at umich.edu > > > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From w.arindrarto at gmail.com Tue May 22 06:21:25 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 22 May 2012 12:21:25 +0200 Subject: [GSoC] GSoC Project Update -- 3 Message-ID: Hi everyone, I just posted my latest GSoC update here: http://bow.web.id/blog/2012/05/from-bio-import-searchio/ To summarize the post and what I've done the last week: * I finished writing all base SearchIO objects and tested them as well. These objects are the QueryResult object (previously called Result), representing search results from a single query; the Hit object, representing pairwise alignments from a single database hit; and the HSP object, representing a single alignment. I've also written the docstrings for these objects, so you can run help() on them in an interpreter session. The post also includes a very brief outline of the base objects' features, if you are curious. * Using this, I was able to write a working prototype for SearchIO BLAST XML parsing. This prototype has also been tested, using the test cases I've generated previously. For now, it's implemented using our NCBIXML parser, just so that people can have a taste of what SearchIO will feel like. If you want to play around with the prototype, it's available here: https://github.com/bow/biopython/tree/searchio-blastxml. As always, feel free to notify me of suggestions, critiques, and/or feature requests :). regards, Bow From rbuels at gmail.com Tue May 22 16:15:15 2012 From: rbuels at gmail.com (Robert Buels) Date: Tue, 22 May 2012 16:15:15 -0400 Subject: [GSoC] [BioRuby] GSoC week 2 status report In-Reply-To: References: Message-ID: <4FBBF3D3.4040003@gmail.com> On 05/22/2012 05:21 AM, Raoul Bonnal wrote: > 2) Could Amazon provide some VM for our students? AWS allows quite a bit of free usage at no charge: http://aws.amazon.com/free/ If you need more, you could apply for a grant from them. http://aws.amazon.com/education/ Rob From saketkc at gmail.com Tue May 22 16:17:01 2012 From: saketkc at gmail.com (Saket Choudhary) Date: Tue, 22 May 2012 21:17:01 +0100 Subject: [GSoC] [BioRuby] GSoC week 2 status report In-Reply-To: <4FBBF3D3.4040003@gmail.com> References: <4FBBF3D3.4040003@gmail.com> Message-ID: I have a free 50$ credit on AWS. I would want to give ti to BioRuby , if possible. On 22 May 2012 21:15, Robert Buels wrote: > On 05/22/2012 05:21 AM, Raoul Bonnal wrote: > >> 2) Could Amazon provide some VM for our students? >> > > AWS allows quite a bit of free usage at no charge: > http://aws.amazon.com/free/ > If you need more, you could apply for a grant from them. > http://aws.amazon.com/**education/ > > Rob > ______________________________**_________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/gsoc > From arklenna at gmail.com Wed May 23 17:56:03 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 23 May 2012 17:56:03 -0400 Subject: [GSoC] GSoC python variant update 3 Message-ID: Hi all, Latest blog post here: http://arklenna.tumblr.com/post/23630012065/week-1 Brief summary: I have reversed my prior conclusion that `SeqRecord` is inadequate for holding variant data. It is still not ideal, but the advantages of using an existing native object are substantial, and the disadvantages can be reduced by creating an accessor for the variant-specific data within a `SeqRecord`. I've made an outline of how I would store the information returned by PyVCF within `SeqRecord` and `SeqFeature` objects. It includes a few questions about the most logical way to store certain variant information. As the coding period has now started, I'll be pushing some prototypes to GitHub in the near future. Lenna From cjfields at illinois.edu Thu May 24 01:14:20 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 24 May 2012 05:14:20 +0000 Subject: [GSoC] [BioRuby] Weekly report #1 In-Reply-To: References: Message-ID: I think the mentioned D wrappers on the SWIG page are ANSI C/C++ libraries wrapped for D, not D code/libs/etc wrapped for Ruby, unless I'm mistaken... chris On May 23, 2012, at 11:30 PM, Mic wrote: > D to Ruby: http://www.swig.org/compare.html > > On Mon, May 21, 2012 at 9:58 PM, Artem Tarasov wrote: > >> Hi all, >> >> here's my report about the past week: >> http://lomereiter.wordpress.com/2012/05/21/gsoc-weekly-report-1/ >> >> Brief summary: >> >> 1) BioRuby unit tests and Rubinius bugs ? I posted 2 issues in Rubinius >> bugtracker, and one of them is already solved. Rubinius in 1.8 mode should >> now pass all tests. The situation with 1.9 mode is not that great, but I'm >> working on it. >> >> 2) I started to collect D optimization tricks on github wiki page. >> Currently, it contains just 6 tips, but this number is going to grow. >> Probably, another page will be created soon to keep best practices of >> connecting Ruby and D. Since my project and Marjan's one have a lot in >> common, I think it's important for us to not waste time on something that >> already have been investigated. >> >> 3) During the week, I learned a bit about BDD and Cucumber, enjoyed it, and >> wrote my first two features. >> >> 4) Measurements of object instantiation time in Ruby suggest that exposing >> low-level D functions via FFI makes little sense. I'm going to discuss with >> mentors which high-level functions should be available, and make that into >> Cucumber features. >> >> >> >> >> -- >> Artem >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From cswh at umich.edu Thu May 24 01:33:40 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Thu, 24 May 2012 01:33:40 -0400 Subject: [GSoC] [BioRuby] GSoC week 2 status report In-Reply-To: References: Message-ID: <9DBCD042-7086-4F4B-ABB9-1A7F63C089B8@umich.edu> Thanks for the offers of help, everybody. Raoul, if it's convenient for you to set up a test VM in house, that would probably make the most sense. I don't think it's a pressing need at this point, but let's look into that. If we run into issues, we can revisit the EC2 options. (I've had an AWS account too long to qualify for the free usage tier, unfortunately.) An Amazon grant might be worth looking at, especially if we can use it to publicly host, say, BGZF-compressed pre-indexed MAF data sets also. On the other hand, that might be overkill just for my needs; using spot-priced instances, I expect I could do all the testing I need for under $50. Clayton Wheeler cswh at umich.edu From lomereiter at googlemail.com Thu May 24 01:40:54 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Thu, 24 May 2012 09:40:54 +0400 Subject: [GSoC] [BioRuby] Weekly report #1 In-Reply-To: References: Message-ID: Chris is right. Currently, it's easier to write everything manually. When I'll develop some 'best practices' I may put then into compile-time algorithms and generate bindings from D. (The language has compile-time introspection but doesn't have run-time one, probably because that would hurt the performance.) On Thu, May 24, 2012 at 9:14 AM, Fields, Christopher J < cjfields at illinois.edu> wrote: > I think the mentioned D wrappers on the SWIG page are ANSI C/C++ libraries > wrapped for D, not D code/libs/etc wrapped for Ruby, unless I'm mistaken... > > chris > > On May 23, 2012, at 11:30 PM, Mic wrote: > > > D to Ruby: http://www.swig.org/compare.html > > > > On Mon, May 21, 2012 at 9:58 PM, Artem Tarasov < > lomereiter at googlemail.com>wrote: > > > >> Hi all, > >> > >> here's my report about the past week: > >> http://lomereiter.wordpress.com/2012/05/21/gsoc-weekly-report-1/ > >> > >> Brief summary: > >> > >> 1) BioRuby unit tests and Rubinius bugs ? I posted 2 issues in Rubinius > >> bugtracker, and one of them is already solved. Rubinius in 1.8 mode > should > >> now pass all tests. The situation with 1.9 mode is not that great, but > I'm > >> working on it. > >> > >> 2) I started to collect D optimization tricks on github wiki page. > >> Currently, it contains just 6 tips, but this number is going to grow. > >> Probably, another page will be created soon to keep best practices of > >> connecting Ruby and D. Since my project and Marjan's one have a lot in > >> common, I think it's important for us to not waste time on something > that > >> already have been investigated. > >> > >> 3) During the week, I learned a bit about BDD and Cucumber, enjoyed it, > and > >> wrote my first two features. > >> > >> 4) Measurements of object instantiation time in Ruby suggest that > exposing > >> low-level D functions via FFI makes little sense. I'm going to discuss > with > >> mentors which high-level functions should be available, and make that > into > >> Cucumber features. > >> > >> > >> > >> > >> -- > >> Artem > >> > >> _______________________________________________ > >> BioRuby Project - http://www.bioruby.org/ > >> BioRuby mailing list > >> BioRuby at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioruby > >> > > > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > From mictadlo at gmail.com Thu May 24 00:30:22 2012 From: mictadlo at gmail.com (Mic) Date: Thu, 24 May 2012 14:30:22 +1000 Subject: [GSoC] [BioRuby] Weekly report #1 In-Reply-To: References: Message-ID: D to Ruby: http://www.swig.org/compare.html On Mon, May 21, 2012 at 9:58 PM, Artem Tarasov wrote: > Hi all, > > here's my report about the past week: > http://lomereiter.wordpress.com/2012/05/21/gsoc-weekly-report-1/ > > Brief summary: > > 1) BioRuby unit tests and Rubinius bugs ? I posted 2 issues in Rubinius > bugtracker, and one of them is already solved. Rubinius in 1.8 mode should > now pass all tests. The situation with 1.9 mode is not that great, but I'm > working on it. > > 2) I started to collect D optimization tricks on github wiki page. > Currently, it contains just 6 tips, but this number is going to grow. > Probably, another page will be created soon to keep best practices of > connecting Ruby and D. Since my project and Marjan's one have a lot in > common, I think it's important for us to not waste time on something that > already have been investigated. > > 3) During the week, I learned a bit about BDD and Cucumber, enjoyed it, and > wrote my first two features. > > 4) Measurements of object instantiation time in Ruby suggest that exposing > low-level D functions via FFI makes little sense. I'm going to discuss with > mentors which high-level functions should be available, and make that into > Cucumber features. > > > > > -- > Artem > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From cswh at umich.edu Fri May 25 16:42:13 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Fri, 25 May 2012 16:42:13 -0400 Subject: [GSoC] New blog post on this week's work Message-ID: <329E20F7-BF3F-4201-ADD0-ABCDFC5ECDE4@umich.edu> Hi all, I've written a new blog post on the work I did on my MAF parser this week: http://csw.github.com/bioruby-maf/blog/2012/05/25/first_milestone/ It covers parser implementation and performance issues, BDD, and tools. Clayton Wheeler cswh at umich.edu From john.woods at marcottelab.org Thu May 24 10:01:08 2012 From: john.woods at marcottelab.org (John Woods) Date: Thu, 24 May 2012 09:01:08 -0500 Subject: [GSoC] [BioRuby] GSoC week 2 status report In-Reply-To: References: <0D2AC678-1DD1-40B9-B100-EDA3429B3D87@umich.edu> Message-ID: If I can just suggest, there's a startup pitch out there which was formerly known as Happy Science Coding, now Appsoma, which lets you run Ruby code on Rackspace instances. It may or may not be appropriate for what you want to do. It's not EC2, but it is a VM (right?). http://appsoma.com/ It's still a bit buggy with Ruby. If you have trouble, email Zack (see the "About us" page). He's fairly responsive. John SciRuby On Tue, May 22, 2012 at 4:21 AM, Raoul Bonnal wrote: > Hi Clayton, > Well done and thanks for your contributes to bioruby and jruby community. > > For you computing issue I have two solutions: > 1) I can create a VM and give you the access, I need to contact my IT dep. > 2) Could Amazon provide some VM for our students? > > > > On 21/05/12 17.50, "Clayton Wheeler" wrote: > > > Hi all, > > > > Here's my report on last week's work: > > > > http://csw.github.com/bioruby-maf/blog/2012/05/21/week_2_progress/ > > > > This was my second week of work on my GSoC project, and the last week of > the > > ?community bonding? period before the official start of coding. A major > focus > > of mine was BioRuby?s phyloXML support; it uses libxml, which has been > causing > > unit test failures under JRuby. In the end, the best course of action > seemed > > to separate the phyloXML support as a separate plugin, which I have done > as > > the bio-phyloxml gem. This will remove BioRuby?s dependency on XML > libraries > > entirely and that JRuby issue along with it. At the same time, users of > the > > phyloXML code should be able to continue using it with no substantive > changes. > > > > Separately, I began porting this phyloXML code to use Nokogiri instead of > > libxml-ruby, but ran into difficulties with this effort. While it is > possible, > > and the library APIs are very similar, the code uses relatively > low-level XML > > processing APIs in ways that seem to be sensitive to subtle differences > in > > text node and namespace semantics between the two libraries. Substantial > > restructuring of the code and the addition of quite a few unit tests > might be > > necessary to carry out such a port with confidence that the resulting > code > > would work well. > > > > Also, someone else submitted a JRuby patch for JRUBY-6658, one of the > major > > causes of BioRuby?s unit test failures with JRuby; once a fix is > integrated, > > we?ll be close to having all the tests passing under JRuby. > > > > I identified another JRuby bug, JRUBY-6666, causing several unit test > > failures. This one affects BioRuby?s code for running external commands, > so it > > would be likely to be encountered in production use. For this one, I also > > worked up a patch. > > > > I also spent some time preparing a performance testing environment, for > > evaluating existing MAF implementations as well as my own. This will be > > important, since I will be considering the use of an existing C parser. > I will > > also want to ensure that the performance of my code is competitive with > the > > alternatives. Lacking any hardware more powerful than a MacBook Air, I am > > setting this up with Amazon EC2. To simplify environment setup, I?ll be > using > > Chef. I?ve already set up a Chef repository with configuration logic, > and some > > rudimentary code to streamline launching Ubuntu machines on EC2 and > > bootstrapping a Chef environment. To save money, I plan to make use of > EC2 > > Spot Instances, which are perfect for instances that only need to run > for a > > few hours for batch tasks. > > > > Clayton Wheeler > > cswh at umich.edu > > > > > > > > > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From lomereiter at googlemail.com Sun May 27 14:27:43 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Sun, 27 May 2012 22:27:43 +0400 Subject: [GSoC] weekly report #2 Message-ID: Hi all, I wrote a blog post about the past week: http://lomereiter.wordpress.com/2012/05/27/gsoc-weekly-report-2/ Topics are: 1) I have quite good validation module for BAM now. More kinds of checks can be added, just request them :) 2) Also I started to implement random access via BAI file, just because I mostly finished what I planned for the first two weeks, and random access seems to be one of the most important things. Also it's not mentioned in the blog, but I started to work on BGZF gem, as Pjotr suggested to me. I'll try to document it and publish the first version next week. Currently I write it in pure Ruby. From marian.povolny at gmail.com Sun May 27 15:21:48 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sun, 27 May 2012 21:21:48 +0200 Subject: [GSoC] GSoC weekly status report No.1.9 Message-ID: http://blog.mpthecoder.com/post/23877896288/gsoc-weekly-status-report-no-1-9 This is the final post in 1.x series, I promise. The last week was spent adding support of parsing lines into records. It was a lot of work, and when I read the comments from my mentor, I wasn?t happy. But I agree with him, I did make it more complicated then it had to be (the C API, for example), I should spend some time polishing and refactoring the D side, and my cucumber features should be split into more features. So that?s the rough plan for the next week. -- Marjan From bonnal at ingm.org Mon May 28 04:50:19 2012 From: bonnal at ingm.org (Raoul Bonnal) Date: Mon, 28 May 2012 10:50:19 +0200 Subject: [GSoC] DevTools In-Reply-To: <329E20F7-BF3F-4201-ADD0-ABCDFC5ECDE4@umich.edu> Message-ID: In case you want to use RedMine I can give you the license for free, any bioruby developer can request it. From p.j.a.cock at googlemail.com Mon May 28 05:00:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 28 May 2012 10:00:30 +0100 Subject: [GSoC] [BioRuby] DevTools In-Reply-To: References: <329E20F7-BF3F-4201-ADD0-ABCDFC5ECDE4@umich.edu> Message-ID: On Mon, May 28, 2012 at 9:50 AM, Raoul Bonnal wrote: > In case you want to use RedMine I can give you the license for free, any > bioruby developer can request it. > ??? Redmine is licensed under the GPL. Did you mean admin rights on the OBF RedMine instance, for example to close bug reports? https://redmine.open-bio.org/projects/bioruby Peter From p.j.a.cock at googlemail.com Mon May 28 05:07:39 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 28 May 2012 10:07:39 +0100 Subject: [GSoC] weekly report #2 In-Reply-To: References: Message-ID: On Sun, May 27, 2012 at 7:27 PM, Artem Tarasov wrote: > Hi all, > > I wrote a blog post about the past week: > http://lomereiter.wordpress.com/2012/05/27/gsoc-weekly-report-2/ > > Topics are: > 1) I have quite good validation module for BAM now. More kinds of checks > can be added, just request them :) > The blog mentions you think you found some issues with tags.bam file - could you elaborate (directl email is fine), and tell me about any future issues please? > 2) Also I started to implement random access via BAI file, just because I > mostly finished what I planned for the first two weeks, and random access > seems to be one of the most important things. > > Also it's not mentioned in the blog, but I started to work on BGZF gem, as > Pjotr suggested to me. I'll try to document it and publish the first > version next week. Currently I write it in pure Ruby. > I guess my suggestion that Clayton might be able to use your BGZF support code for compressed MAF files does make sense to package the BGZF support as a Bio Gem. Good point Pjotr. http://lists.open-bio.org/pipermail/bioruby/2012-May/002301.html Peter From bonnal at ingm.org Mon May 28 05:03:01 2012 From: bonnal at ingm.org (Raoul Bonnal) Date: Mon, 28 May 2012 11:03:01 +0200 Subject: [GSoC] [BioRuby] DevTools In-Reply-To: Message-ID: Ahhhhhhhhhhh I mean RubyMine http://www.jetbrains.com/ruby/ sorry On 28/05/12 11.00, "Peter Cock" wrote: > > > On Mon, May 28, 2012 at 9:50 AM, Raoul Bonnal wrote: >> In case you want to use RedMine I can give you the license for free, any >> bioruby developer can request it. > > ??? Redmine is licensed under the GPL. > > Did you mean admin rights on the OBF RedMine instance, for > example to close bug reports? > https://redmine.open-bio.org/projects/bioruby > > Peter > > From lomereiter at googlemail.com Mon May 28 05:29:24 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 28 May 2012 13:29:24 +0400 Subject: [GSoC] weekly report #2 In-Reply-To: References: Message-ID: > > The blog mentions you think you found some issues with tags.bam > file - could you elaborate (directl email is fine), and tell me about any > future issues please? > They are very minor. Specification says (1.4) that 'QNAME' should be [!-?A-~], that doesn't include space and '@' sign, and that (1.5) printable characters in tags with 'A' type are [!-~], i.e. only space is not allowed. BTW, I looked at your code which generated the file, it uses range(32, 127) both for 'Z' and 'A' types of tags, even though it's explicitly written in comments right above these lines where space should be included, and where it shouldn't :) From p.j.a.cock at googlemail.com Mon May 28 05:48:21 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 28 May 2012 10:48:21 +0100 Subject: [GSoC] weekly report #2 In-Reply-To: References: Message-ID: On Mon, May 28, 2012 at 10:29 AM, Artem Tarasov wrote: > The blog mentions you think you found some issues with tags.bam >> file - could you elaborate (directl email is fine), and tell me about any >> future issues please? >> > > They are very minor. Specification says (1.4) that 'QNAME' should be > [!-?A-~], that doesn't include space and '@' sign, > Fair point. I should fix that. The '@" was presumably excluded in the v1.3 spec to avoid confusion with FASTQ files. > and that (1.5) > printable characters in tags with 'A' type are [!-~], i.e. only space > is not allowed. > > BTW, I looked at your code which generated the file, it uses > range(32, 127) both for 'Z' and 'A' types of tags, even though > it's explicitly written in comments right above these lines where > space should be included, and where it shouldn't :) > Good point, that is a change in the specification I hadn't noticed. Back in v1.2, both A and Z were just "printable character" and "printable string", which to me includes the space. It was only in v1.3 that this was made explicit with a regex, and space ceased to be allowed in the A tag. I wonder if that was an accident or deliberate? You'll notice that samtools doesn't complain about these deviations from the specification but it doesn't attempt any validation. I'm not sure if Picard checks this. Thanks, Peter From w.arindrarto at gmail.com Wed May 30 17:44:04 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 30 May 2012 23:44:04 +0200 Subject: [GSoC] GSoC Project Update -- 4 Message-ID: Hi everyone, I just posted my latest GSoC update here: http://bow.web.id/blog/2012/05/assembling-the-parsers/ To summarize: I've been working on more SearchIO parsers last week, adding more formats to support. We know have SearchIO-specific BLAST+ XML parser (it was first implemented on top of NCBIXML). It uses ElementTree as the base XML parser, with promising performance gains. I've also completed SearchIO's blast tabular parser, which takes in the BLAST+ tabular output files with or without headers. If the tabular file has headers, it can parse any number of columns in any order as long the columns with hit and query IDs are present. Finally, I've finished writing the HMMER plain text parser. For now, the parser can handle outputs from hmmscan and hmmsearch, single and multiple queries. All these parsers have been tested using the test cases I've generated previously. Additionally, I also had a public discussion with Peter on Github regarding SearchIO objects here: https://github.com/bow/biopython/commit/69a0ab64dfa7718f7455ca4c3961e95277fb4dbc#-P0, if anyone is interested. It started as a discussion on some behaviors of the HSP object, but also relates to other issues raised earlier (the dynamic SeqRecord coordinates Peter brought up earlier and Biopython's platform support). That's it for this week :). cheers, Bow From marian.povolny at gmail.com Sun Jun 3 17:07:18 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sun, 3 Jun 2012 23:07:18 +0200 Subject: [GSoC] GSoC weekly status report No.2 Message-ID: http://blog.mpthecoder.com/post/24355573626/gsoc-weekly-status-report-no-2 It?s the end of the second week of GSoC and time for a new report. I spent the last week mostly doing work based on criticism from my mentor. The D parser which parses lines into records is now in a pretty good shape, and tested. Today I received a list of new issues that need to be resolved before going further, but they?re not that much work and I can plan some new developments. A utility for validation is in planning for next week, which could be also used for performance measurement. And after that I will turn to making the current parser parallel. Also, tomorrow I?ll be defending my Masters Thesis, after which I should be able to concentrate more on the GFF3 parser. From arklenna at gmail.com Sun Jun 3 22:39:47 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sun, 3 Jun 2012 22:39:47 -0400 Subject: [GSoC] GSoC python variant update 4 Message-ID: Blog post (entirely reproduced in this email): http://arklenna.tumblr.com/post/24378549953/ I started implementing storage of VCF data in `SeqRecord` and `SeqFeature`. I digressed, spending a few days experimenting with overloading `__getattr__()` in lieu of manually writing properties. Then it occurred to me that if, as Reece pointed out, a variant doesn't contain the actual sequence but a reference to the sequence, the advantages to using `SeqRecord` are minimal or possibly negative. In my experience, the highest performance for filtering large amounts of data is SQL. SQL has the advantage of scalability: SQLite now ships with Python, users can choose to run their own MySQL/PGSQL server, and I've read about a few approaches to GPU accelerated SQL. My initial glances at BioSQL, GMOD, etc. didn't show anything specifically designed for variants (again, a focus on storage of the sequence itself) so I implemented my own interface. Currently, the `parse_all()` method is very slow (approximately 260 seconds for a file with 240,000 variants when the parsing takes 5-10 seconds) and I am investigating why. My first step will be to reduce commit frequency. With a SQL backend, it seems superfluous to have a dedicated variant representation within Python. The SQL result object should allow for straightforward retrieval of data by name. I'm storing "misc" data in a SQL text field using JSON, which is also easy to access. Next: * Looking at BioSQL/GMOD etc to see if there is an existing standard I should be using/following * Deciding the extent of the convenience functions I wish to implement * Thinking about the most efficient way to filter records on the way into the SQL database From arklenna at gmail.com Mon Jun 4 09:30:15 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 4 Jun 2012 09:30:15 -0400 Subject: [GSoC] [Biopython-dev] GSoC python variant update 4 In-Reply-To: References: Message-ID: <2D58B8E1-5056-445F-B623-56B7136048BC@gmail.com> On Jun 4, 2012, at 1:11 AM, Mic wrote: > Hi Lenna, > Big companies are using http://en.wikipedia.org/wiki/NoSQL > > What kind of ORM do you want use ( http://en.wikipedia.org/wiki/SQLAlchemy or http://en.wikipedia.org/wiki/Storm_%28software%29 ) > > Cheers, > Mic > > Hey Mic, Looks like there has been some talk about SQLAlchemy in Biopython: http://biopython.org/pipermail/biopython/2009-August/005455.html Lenna From mictadlo at gmail.com Mon Jun 4 01:11:56 2012 From: mictadlo at gmail.com (Mic) Date: Mon, 4 Jun 2012 15:11:56 +1000 Subject: [GSoC] [Biopython-dev] GSoC python variant update 4 In-Reply-To: References: Message-ID: Hi Lenna, Big companies are using http://en.wikipedia.org/wiki/NoSQL What kind of ORM do you want use ( http://en.wikipedia.org/wiki/SQLAlchemyor http://en.wikipedia.org/wiki/Storm_%28software%29 ) Cheers, Mic On Mon, Jun 4, 2012 at 12:39 PM, Lenna Peterson wrote: > Blog post (entirely reproduced in this email): > http://arklenna.tumblr.com/post/24378549953/ > > I started implementing storage of VCF data in `SeqRecord` and > `SeqFeature`. I digressed, spending a few days experimenting with > overloading `__getattr__()` in lieu of manually writing properties. > Then it occurred to me that if, as Reece pointed out, a variant > doesn't contain the actual sequence but a reference to the sequence, > the advantages to using `SeqRecord` are minimal or possibly negative. > > In my experience, the highest performance for filtering large amounts > of data is SQL. SQL has the advantage of scalability: SQLite now ships > with Python, users can choose to run their own MySQL/PGSQL server, and > I've read about a few approaches to GPU accelerated SQL. > > My initial glances at BioSQL, GMOD, etc. didn't show anything > specifically designed for variants (again, a focus on storage of the > sequence itself) so I implemented my own interface. Currently, the > `parse_all()` method is very slow (approximately 260 seconds for a > file with 240,000 variants when the parsing takes 5-10 seconds) and I > am investigating why. My first step will be to reduce commit > frequency. > > With a SQL backend, it seems superfluous to have a dedicated variant > representation within Python. The SQL result object should allow for > straightforward retrieval of data by name. I'm storing "misc" data in > a SQL text field using JSON, which is also easy to access. > > Next: > > * Looking at BioSQL/GMOD etc to see if there is an existing standard I > should be using/following > * Deciding the extent of the convenience functions I wish to implement > * Thinking about the most efficient way to filter records on the way > into the SQL database > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From chapmanb at 50mail.com Mon Jun 4 12:04:15 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 04 Jun 2012 12:04:15 -0400 Subject: [GSoC] GSoC python variant update 4 In-Reply-To: References: Message-ID: <87haurjc74.fsf@fastmail.fm> Lenna; Thanks for the summary. A couple of thoughts on the directions: - For property access, I think the best approach would be to store all of the arbitrary key/value pairs from INFO in SeqRecord annotations, then only use hand coded @properties to expose the most useful. That's gives people access to the most useful ones (as determined by you) with attributes but lets anyone dig in and get custom ones. - If you'd like to explore an SQL backend, you should have a look at Gemini: https://github.com/arq5x/gemini which stores variants in a SQLite database along with associated annotations. It's a flat structure based on adding and exposing useful annotations on variants: https://github.com/arq5x/gemini/blob/master/gemini/database.py Reinventing a new SQL store representation is a lot of work so it might be good to work off what others folks are currently doing and try to provide a Biopython friendly front end, much as you're exploring with PyVCF. Hope these are useful. Let me know if you have any questions at all, Brad > Blog post (entirely reproduced in this email): > http://arklenna.tumblr.com/post/24378549953/ > > I started implementing storage of VCF data in `SeqRecord` and > `SeqFeature`. I digressed, spending a few days experimenting with > overloading `__getattr__()` in lieu of manually writing properties. > Then it occurred to me that if, as Reece pointed out, a variant > doesn't contain the actual sequence but a reference to the sequence, > the advantages to using `SeqRecord` are minimal or possibly negative. > > In my experience, the highest performance for filtering large amounts > of data is SQL. SQL has the advantage of scalability: SQLite now ships > with Python, users can choose to run their own MySQL/PGSQL server, and > I've read about a few approaches to GPU accelerated SQL. > > My initial glances at BioSQL, GMOD, etc. didn't show anything > specifically designed for variants (again, a focus on storage of the > sequence itself) so I implemented my own interface. Currently, the > `parse_all()` method is very slow (approximately 260 seconds for a > file with 240,000 variants when the parsing takes 5-10 seconds) and I > am investigating why. My first step will be to reduce commit > frequency. > > With a SQL backend, it seems superfluous to have a dedicated variant > representation within Python. The SQL result object should allow for > straightforward retrieval of data by name. I'm storing "misc" data in > a SQL text field using JSON, which is also easy to access. > > Next: > > * Looking at BioSQL/GMOD etc to see if there is an existing standard I > should be using/following > * Deciding the extent of the convenience functions I wish to implement > * Thinking about the most efficient way to filter records on the way > into the SQL database > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From lomereiter at googlemail.com Mon Jun 4 14:02:58 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 4 Jun 2012 22:02:58 +0400 Subject: [GSoC] Weekly report #3 Message-ID: Hello all, the post is here: http://lomereiter.wordpress.com/2012/06/04/gsoc-weekly-report-3/ I've implemented random access to BAM file, using index file. Also I created a generic function for memoization which stores decompressed blocks in cache, following some desired cache strategy. Currently, I use simple FIFO cache. Also I studied how to make SAM output faster. I came to the conclusion that not only D standard library functions, but even ones of *printf family are too slow for this purpose, because they have to parse format string. Instead, I need to use specialized functions for printing integers and floats. Currently, output is about 4x slower than in samtools. So I have to take back some of my harsh words about its code and say that there is something to learn from there. It indeed uses its own functions for integer output, and also uses string buffer to do less calls (system functions can't be inlined). I'll use this approach, too, so very soon my library will be usable in pipelines, but only for output. Then I'm going to move on to allow alignments to be modified and outputted to BAM. After that, SAM parser needs to be implemented, and I'm going to use Ragel (finite-state machine compiler) for that purpose. So by the beginning of July I want to have SAM<->BAM conversion working, with a good speed. Add to that first release of biogem, and those are my plans for this month. From p.j.a.cock at googlemail.com Mon Jun 4 15:36:25 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 Jun 2012 20:36:25 +0100 Subject: [GSoC] Weekly report #3 In-Reply-To: References: Message-ID: On Mon, Jun 4, 2012 at 7:02 PM, Artem Tarasov wrote: > Hello all, > > the post is here: > http://lomereiter.wordpress.com/2012/06/04/gsoc-weekly-report-3/ > > I've implemented random access to BAM file, using index file. Also I > created a generic function for memoization which stores decompressed > blocks in cache, following some desired cache strategy. Currently, I > use simple FIFO cache. That sounds good. We've talked a little bit about the block caching strategy for Biopython's BGZF support - dropping the least recently used block would be good (LRU) but requires the overhead of storing and recording timestamps on each access. Currently my Biopython BGZF code just drops a cached block 'at random' (actually based on the dictionary hashing algorithm), and switching to FIFO was something I planned to try next (easily done with Python's OrderedDict class). FIFO seems like a good solution as the overheads are much lower than LRU. Have you got any good random access benchmarks to try this out with? i.e. something non-random, such as pulling mates of paired end reads. How many BGZF blocks are you keeping in the cache, and why? Are you thinking about BGZF output yet (which will be required in order to write BAM files)? Regards, Peter From lomereiter at googlemail.com Mon Jun 4 16:07:03 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Tue, 5 Jun 2012 00:07:03 +0400 Subject: [GSoC] Weekly report #3 In-Reply-To: References: Message-ID: > Have you got any good random access benchmarks to try this out > with? i.e. something non-random, such as pulling mates of paired > end reads. > Currently, no. Please suggest your ideas about benchmarks because I suspect that you have much more experience with BAM files and better knowledge of use patterns. How many BGZF blocks are you keeping in the cache, and why? > Currently, 512. I don't know why, seems like a reasonable number (about 30MB of RAM). Maybe it should be a runtime parameter but I doubt that end users will bother with tweaking cache size. > Are you thinking about BGZF output yet (which will be required > in order to write BAM files)? > It's not hard at all. I already wrote packing string to BGZF in Ruby: https://github.com/lomereiter/bioruby-bgzf/blob/master/lib/bio-bgzf/pack.rb Parallelizing should also be easy, it's very similar to reading blocks from file. Determine how many alignments to pack in one block (it's 65Kb max), send compression task to taskpool, then go create next chunk of alignments, and so on. From cswh at umich.edu Mon Jun 4 23:04:06 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 4 Jun 2012 23:04:06 -0400 Subject: [GSoC] Weekly report: Indexed MAF access, Kyoto Cabinet, SQLite, and more Message-ID: <2B6E16E9-3DBC-4F54-88F8-C42E03124A1E@umich.edu> Hi all, My latest blog post on (mostly) last week's work is here: http://csw.github.com/bioruby-maf/blog/2012/06/04/indexed_maf_access/ Highlights include SQLite vs. Kyoto Cabinet, the path to BGZF support, and the challenges of supporting multiple Ruby implementations. Clayton Wheeler cswh at umich.edu From casbon at gmail.com Wed Jun 6 05:39:12 2012 From: casbon at gmail.com (James Casbon) Date: Wed, 6 Jun 2012 10:39:12 +0100 Subject: [GSoC] [Biopython-dev] GSoC python variant update 4 In-Reply-To: <87haurjc74.fsf@fastmail.fm> References: <87haurjc74.fsf@fastmail.fm> Message-ID: I'd be cautious about going for SQL for VCF backends. At least the following two problems arise: 1. VCF isn't a format, it's a meta-format so there isn't really a single data representation, but many. You are going to need a very flexible schema to allow variable records with complex entries like lists. (An entry is dynamically defined by the FORMAT field in each row, right?). Having a JSON misc entry means you lose all query abilities on these data anyway. 2. If you move your data away from VCF, you cannot use tools from outside your universe. i.e. lets say you want to use a GATK variant annotator, you need to do the roundtrip from SQL->VCF->SQL. I speak having developed this approach already and largely abandoned it due to the problems above. You are right that SQL would be a better solution for data index and access (no serialization issues, multiple tuned indexes), but be careful that you may spend a lot of time and not have a lot to show. I would really like it if biology used existing binary formats (HDF5 anyone?), but we don't. More practical use right now would be bcf support. From w.arindrarto at gmail.com Wed Jun 6 14:22:26 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 6 Jun 2012 20:22:26 +0200 Subject: [GSoC] GSoC Project Update -- 5 Message-ID: Hi everyone, I just posted another update on my GSoC project here: http://bow.web.id/blog/2012/06/hello-indexers/ A brief summary: * I added the SearchIO indexing functions, with the same interface as SeqIO's indexing functions. It currently supports all the available SearchIO parsers (blast-tab, blast-xml, and hmmer-text). * (not mentioned in the post) I did some refactoring to the SearchIO code base. It was starting to get a bit messy, but now it's cleaner. All the parsers are now implemented as classes. For some of them, users can use it directly to tweak its behavior (e.g. the blast-tab parser can be used to parse plain blast-tab files with custom column ordering. This is not possible if users use SearchIO.parse or SearchIO.read instead). Additionally, I should also mention that my schedule has been changed slightly. The original plan for next week was to focus on hmmer-text indexing. However, since it has been done (except for the testing, which should not take a week), I will be focusing on writing the SearchIO converters. So expect to see that instead. That's all for now :). regards, Bow From lomereiter at googlemail.com Mon Jun 11 13:25:48 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 11 Jun 2012 21:25:48 +0400 Subject: [GSoC] weekly report #4 Message-ID: Hello everybody, here's my weekly report: http://lomereiter.wordpress.com/2012/06/11/gsoc-weekly-report-4/ I've added BAM output support (not parallelized yet) and alignment creation/modification - changing fields, adding tags, and replacing existing ones. Thus, the library has a lot of features at the moment, and I started documenting them on github wiki. Also I found out that there's a great tool in DMD distribution, called rdmd, which allows to execute D files as scripts, by just adding "#!/usr/bin/rdmd" at the top. It will automatically compile all needed files and run executable. That dramatically simplifies library usage, no need to write cumbersome makefiles. The examples are at https://github.com/lomereiter/BAMread/wiki/Getting-started You can try to write your own script if you wish, follow the instructions in the wiki. Also, as my library now is able to write BAM, the current project title is quite misleading. So I'd like to hear suggestions on renaming :) -- Artem From p.j.a.cock at googlemail.com Mon Jun 11 13:41:39 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 11 Jun 2012 18:41:39 +0100 Subject: [GSoC] weekly report #4 In-Reply-To: References: Message-ID: On Mon, Jun 11, 2012 at 6:25 PM, Artem Tarasov wrote: > Hello everybody, > > here's my weekly report: > http://lomereiter.wordpress.com/2012/06/11/gsoc-weekly-report-4/ > > ... > > Also, as my library now is able to write BAM, the current project title is > quite misleading. > So I'd like to hear suggestions on renaming :) As to the name, how about damtools (D alignment/map tools), "for dealing with the flood of sequence data" (dam as in reservoir). Peter From cjfields at illinois.edu Mon Jun 11 13:46:43 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 11 Jun 2012 17:46:43 +0000 Subject: [GSoC] weekly report #4 In-Reply-To: References: Message-ID: <67FF495D-E8AD-4920-9EA8-6464E1310FBB@illinois.edu> On Jun 11, 2012, at 12:41 PM, Peter Cock wrote: > On Mon, Jun 11, 2012 at 6:25 PM, Artem Tarasov > wrote: >> Hello everybody, >> >> here's my weekly report: >> http://lomereiter.wordpress.com/2012/06/11/gsoc-weekly-report-4/ >> >> ... >> >> Also, as my library now is able to write BAM, the current project title is >> quite misleading. >> So I'd like to hear suggestions on renaming :) > > As to the name, how about damtools (D alignment/map tools), > "for dealing with the flood of sequence data" (dam as in reservoir). > > Peter Or 'damn, look how much work we have to do' chris From lomereiter at googlemail.com Mon Jun 11 14:47:48 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 11 Jun 2012 22:47:48 +0400 Subject: [GSoC] weekly report #4 In-Reply-To: References: Message-ID: No, thanks... I'll call it libsambamba. In suahili, sambamba means 'parallel' ( http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) On Mon, Jun 11, 2012 at 9:41 PM, Peter Cock wrote: > > As to the name, how about damtools (D alignment/map tools), > "for dealing with the flood of sequence data" (dam as in reservoir). > > Peter > From p.j.a.cock at googlemail.com Mon Jun 11 14:59:38 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 11 Jun 2012 19:59:38 +0100 Subject: [GSoC] weekly report #4 In-Reply-To: <20120611185718.GA12417@thebird.nl> References: <20120611185718.GA12417@thebird.nl> Message-ID: On Monday, June 11, 2012, Pjotr Prins wrote: > On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: > > No, thanks... > > > > I'll call it libsambamba. In suahili, sambamba means 'parallel' ( > > http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) > > I like it mbwana. > > Pj. > As the mentor, you'd be the mbwana or the bwana (boss), not Artem. But I do like lib-sambamba as a name - very clever. Peter From cjfields at illinois.edu Mon Jun 11 15:19:18 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 11 Jun 2012 19:19:18 +0000 Subject: [GSoC] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: > On Monday, June 11, 2012, Pjotr Prins wrote: > >> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>> No, thanks... >>> >>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >> >> I like it mbwana. >> >> Pj. >> > > As the mentor, you'd be the mbwana or the bwana (boss), not Artem. > > But I do like lib-sambamba as a name - very clever. > > Peter Agreed, fits very well. chris From pjotr.public14 at thebird.nl Mon Jun 11 14:57:18 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 11 Jun 2012 20:57:18 +0200 Subject: [GSoC] [BioRuby] weekly report #4 In-Reply-To: References: Message-ID: <20120611185718.GA12417@thebird.nl> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: > No, thanks... > > I'll call it libsambamba. In suahili, sambamba means 'parallel' ( > http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) I like it mbwana. Pj. From georgkam at gmail.com Mon Jun 11 15:28:42 2012 From: georgkam at gmail.com (George Githinji) Date: Mon, 11 Jun 2012 22:28:42 +0300 Subject: [GSoC] [BioRuby] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: Good tribute to swahili! ahsante sana bwana Artem! (Thank you very much for the suggestion) Sambamba could also mean correct way or the right thing in everyday speak.. (bwana is a term of respect or honour, though it also refers to a boss .. mostly we use 'mkubwa' to mean boss) George On Mon, Jun 11, 2012 at 10:19 PM, Fields, Christopher J wrote: > On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: > >> On Monday, June 11, 2012, Pjotr Prins wrote: >> >>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>>> No, thanks... >>>> >>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >>> >>> I like it mbwana. >>> >>> Pj. >>> >> >> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. >> >> But I do like lib-sambamba as a name - very clever. >> >> Peter > > Agreed, fits very well. > > chris > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- --------------- Sincerely George Skype: george_g2 Blog: http://biorelated.wordpress.com/ Twitter: http://twitter.com/#!/george_l From cjfields at illinois.edu Mon Jun 11 15:36:44 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 11 Jun 2012 19:36:44 +0000 Subject: [GSoC] [BioRuby] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: <114DEA27-A766-4F0F-8144-098FF0905E1D@illinois.edu> heh, which makes me think you don't respect your bosses :) chris On Jun 11, 2012, at 2:28 PM, George Githinji wrote: > ...(bwana is a term of respect or honour, though it also refers to a boss > .. mostly we use 'mkubwa' to mean boss) > > George > > > On Mon, Jun 11, 2012 at 10:19 PM, Fields, Christopher J > wrote: >> On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: >> >>> On Monday, June 11, 2012, Pjotr Prins wrote: >>> >>>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>>>> No, thanks... >>>>> >>>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >>>> >>>> I like it mbwana. >>>> >>>> Pj. >>>> >>> >>> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. >>> >>> But I do like lib-sambamba as a name - very clever. >>> >>> Peter >> >> Agreed, fits very well. >> >> chris >> >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > > -- > --------------- > Sincerely > George > Skype: george_g2 > Blog: http://biorelated.wordpress.com/ > Twitter: http://twitter.com/#!/george_l From to.petr at gmail.com Mon Jun 11 16:35:59 2012 From: to.petr at gmail.com (P. Troshin) Date: Mon, 11 Jun 2012 21:35:59 +0100 Subject: [GSoC] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: None of my business but it's a bit unwieldy. It may be clever, but 99% people who come across it would not know. mbwana is simpler in that respect. Sorry for spoiling the consensus :-( P. On 11 June 2012 20:19, Fields, Christopher J wrote: > On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: > >> On Monday, June 11, 2012, Pjotr Prins wrote: >> >>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>>> No, thanks... >>>> >>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >>> >>> I like it mbwana. >>> >>> Pj. >>> >> >> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. >> >> But I do like lib-sambamba as a name - very clever. >> >> Peter > > Agreed, fits very well. > > chris > > > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From marian.povolny at gmail.com Mon Jun 11 16:52:05 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 11 Jun 2012 22:52:05 +0200 Subject: [GSoC] GSoC weekly status report No.3 Message-ID: http://blog.mpthecoder.com/post/24904798973/gsoc-weekly-status-report-no-3 My first report as a Master of Computer Engineering and Communications :) Here is a list with what I?ve been working on the last week: more cleanup and refactoring validation code, README etc, made a validation utility in D, which simply reports problems found to stderr, made a benchmark tool with -v option for measuring parser speed with and without validation, after having a basic benchmark tool, found a few places which were very bad for performance. After fixing that code, parsing a 233MB GFF3 file on a five year old PC took 6 seconds, but without validation, and with only a single thread, and replacing escaped characters turned off, made replacing escaped characters optional, because the current implementation requires creation of additional string objects to do that, which has a big impact on performance. There is a plan for making it faster, but is scheduled for later, added minimal parallelisation, by reading the file in a separate thread. Two additional days were spent on a segmentation fault in the D garbage collector which occured when parsing a big file with a lot of errors. That should never happen, as I?m using the safe part of the D language, that is no pointers or anything similar. The worst that should happen is an exception. But a segmentation fault points to an error in either the compiler, the runtime or support library. The minimum reproducible example is still 42 lines long: https://gist.github.com/2911818 but changing anything in it makes the segmentation fault go away. More info on this topic can be found in the discussion here: https://github.com/mamarjan/bioruby-hpc-gff3/issues/31 I?ll be probably posting a bug report on the Dlang webpage tomorrow. For the coming week I would like to add more parallelisation, change the validation code so that exceptions almost never happen (and the seg fault also) and add support for merging records into features. -- Marjan From cswh at umich.edu Mon Jun 11 16:56:02 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 11 Jun 2012 16:56:02 -0400 Subject: [GSoC] GSoC weekly status report: MAF filtering Message-ID: Hi all, Here's my status report on last week's work: http://csw.github.com/bioruby-maf/blog/2012/06/09/filtering-work/ Highlights: mainly MAF alignment block filtering and performance challenges with binary data in Ruby. Clayton Wheeler cswh at umich.edu From pjotr.public14 at thebird.nl Tue Jun 12 03:05:18 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 12 Jun 2012 09:05:18 +0200 Subject: [GSoC] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: <20120612070518.GA14848@thebird.nl> sam-bam-baah has the comment of sheep in it. May explain consensus :) How about sambamba tools. Less unwieldy. On Mon, Jun 11, 2012 at 09:35:59PM +0100, P. Troshin wrote: > None of my business but it's a bit unwieldy. It may be clever, but 99% > people who come across it would not know. mbwana is simpler in that > respect. Sorry for spoiling the consensus :-( > > P. > > > > On 11 June 2012 20:19, Fields, Christopher J wrote: > > On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: > > > >> On Monday, June 11, 2012, Pjotr Prins wrote: > >> > >>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: > >>>> No, thanks... > >>>> > >>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( > >>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) > >>> > >>> I like it mbwana. > >>> > >>> Pj. > >>> > >> > >> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. > >> > >> But I do like lib-sambamba as a name - very clever. > >> > >> Peter > > > > Agreed, fits very well. > > > > chris > > > > > > _______________________________________________ > > GSoC mailing list > > GSoC at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/gsoc > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From to.petr at gmail.com Tue Jun 12 12:48:40 2012 From: to.petr at gmail.com (P. Troshin) Date: Tue, 12 Jun 2012 17:48:40 +0100 Subject: [GSoC] weekly report #4 In-Reply-To: <20120612070518.GA14848@thebird.nl> References: <20120611185718.GA12417@thebird.nl> <20120612070518.GA14848@thebird.nl> Message-ID: > How about sambamba tools. Less unwieldy. I think its better, but it's not particularly catchy but please ignore me, it is so much easier to critique, when to come up with a really good name. P. On 12 June 2012 08:05, Pjotr Prins wrote: > sam-bam-baah has the comment of sheep in it. May explain consensus :) > > How about sambamba tools. Less unwieldy. > > On Mon, Jun 11, 2012 at 09:35:59PM +0100, P. Troshin wrote: >> None of my business but it's a bit unwieldy. It may be clever, but 99% >> people who come across it would not know. mbwana is simpler in that >> respect. Sorry for spoiling the consensus :-( >> >> P. >> >> >> >> On 11 June 2012 20:19, Fields, Christopher J wrote: >> > On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: >> > >> >> On Monday, June 11, 2012, Pjotr Prins wrote: >> >> >> >>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >> >>>> No, thanks... >> >>>> >> >>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >> >>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >> >>> >> >>> I like it mbwana. >> >>> >> >>> Pj. >> >>> >> >> >> >> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. >> >> >> >> But I do like lib-sambamba as a name - very clever. >> >> >> >> Peter >> > >> > Agreed, fits very well. >> > >> > chris >> > >> > >> > _______________________________________________ >> > GSoC mailing list >> > GSoC at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/gsoc >> _______________________________________________ >> GSoC mailing list >> GSoC at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/gsoc From to.petr at gmail.com Tue Jun 12 13:13:11 2012 From: to.petr at gmail.com (P. Troshin) Date: Tue, 12 Jun 2012 18:13:11 +0100 Subject: [GSoC] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> <20120612070518.GA14848@thebird.nl> Message-ID: Also I cannot help it but the Samba file server is the first thing that comes to my mind when I hear sambamba. I suspect it may be just my techy background... I think that my favorite so far is damtools, because its terribly close to samtools, which is where you want to be I guess. However, I can see why its not everybody's favorite and unfortunately there are already at least a few other damtools around http://code.google.com/p/dam-tools/, http://home.earthlink.net/~matthewjheaney/damtools/index.html P. On 12 June 2012 17:48, P. Troshin wrote: >> How about sambamba tools. Less unwieldy. > > I think its better, but it's not particularly catchy but please ignore > me, it is so much easier to critique, when to come up with a really > good name. > > P. > > > On 12 June 2012 08:05, Pjotr Prins wrote: >> sam-bam-baah has the comment of sheep in it. May explain consensus :) >> >> How about sambamba tools. Less unwieldy. >> >> On Mon, Jun 11, 2012 at 09:35:59PM +0100, P. Troshin wrote: >>> None of my business but it's a bit unwieldy. It may be clever, but 99% >>> people who come across it would not know. mbwana is simpler in that >>> respect. Sorry for spoiling the consensus :-( >>> >>> P. >>> >>> >>> >>> On 11 June 2012 20:19, Fields, Christopher J wrote: >>> > On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: >>> > >>> >> On Monday, June 11, 2012, Pjotr Prins wrote: >>> >> >>> >>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>> >>>> No, thanks... >>> >>>> >>> >>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>> >>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >>> >>> >>> >>> I like it mbwana. >>> >>> >>> >>> Pj. >>> >>> >>> >> >>> >> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. >>> >> >>> >> But I do like lib-sambamba as a name - very clever. >>> >> >>> >> Peter >>> > >>> > Agreed, fits very well. >>> > >>> > chris >>> > >>> > >>> > _______________________________________________ >>> > GSoC mailing list >>> > GSoC at lists.open-bio.org >>> > http://lists.open-bio.org/mailman/listinfo/gsoc >>> _______________________________________________ >>> GSoC mailing list >>> GSoC at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/gsoc From w.arindrarto at gmail.com Wed Jun 13 19:33:52 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 14 Jun 2012 01:33:52 +0200 Subject: [GSoC] GSoC Project Update -- 6 Message-ID: Hi everyone, It's a bit late than usual, but I've finally finished my update for the past week: http://bow.web.id/blog/2012/06/round-trip-with-searchio/ As a summary: 1. SearchIO now has a write and convert function that outputs to BLAST XML and tabular files. 2. The two main container objects QueryResult and Hit now has their own filter() and map() functions similar to Python's built-in filter and map. For QueryResult objects, there are hit_filter, hsp_filter, hit_map, and hsp_map functions and for Hit objects we have filter and map. Filter functions accept a boolean function with either Hit or HSP as its argument, while map accepts a function that must return either Hit or HSP objects. I wrote a short demo on my post to make this a bit clearer and show what it can help users do. 3. (not mentioned in the post) is more tweaks and tests to the existing functionalities, especially indexing. That's all for now :). regards, Bow From arklenna at gmail.com Mon Jun 18 00:21:42 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 18 Jun 2012 00:21:42 -0400 Subject: [GSoC] GSoC python variant update 5 Message-ID: Latest post: http://arklenna.tumblr.com/post/25343434817/ James raised some [concerns](http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009688.html) about the difficulty of representing the VCF "metaformat" in SQL. I've taken these into consideration and am forging ahead. So far, some of the types of data fit more neatly into SQL than into a VCF row. I have redesigned my SQL schema with a two-pronged approach to tackle the flexibility of VCF: 1. For the site, alt, and genotype tables, there are columns for the reserved info/format keywords in the VCF spec (so far only for non-SV). 2. For new info and format keywords (both in the header and in the body), I am storing the values in a "narrow table." This table stores a foreign key to the key's row and the key-value pair. The narrow table is also good for storing reserved keys that are lists (but not per-allele or per-genotype). Note: this diagram only has the FKs listed for simplicity. (SQL diagram) Interestingly, despite the increase in the number of tables and thus insert statements, the current script is considerably faster than the previous version. Evidently JSON serialization is slow. There are a few things I haven't figured out: 1. Can an info field be per-genotype? The spec implies that wouldn't make sense, but doesn't forbid it. 2. Is there a safe way to find out if a VCF 4.0 field is per-allele or per-genotype? 3. Will my SQL representation be able to handle SV? ======= I'll be out of town for the next week but I will have plenty of time for Python. From marian.povolny at gmail.com Mon Jun 18 14:28:12 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 18 Jun 2012 20:28:12 +0200 Subject: [GSoC] GSoC weekly status report No.4 Message-ID: http://blog.mpthecoder.com/post/25375170121/gsoc-weekly-status-report-no-4 During the last week combining records into features has been added, and also connecting the features into parent-child relationships. Validation messages have been enhanced with file names and line numbers, and now look like errors reported by a compiler. Feels most natural to me. Combining the features into records works by keeping a forward cache of a number of features (1000 by default, configurable). That means that the parsing results will be correct only if records which are part of the same feature are at most 1000 features from each other, or the amount of features set. The first implementation which was comparing the IDs of records required 10min for a 233MB file. After switching to first comparing hash values of IDs instead, and only if they match comparing the IDs, the parsing time was down to 45s. After fixing a bug, the time is now 10 seconds for the 233MB m_hapla file :) Linking the features into parent-child relationships works similarly, by using 32-bit hashes most of the time instead of comparing strings. With this functionality turned on, the same file is parsed in 13 seconds. All the measurements have been done using the benchmark utility, which has a few more options for setting what should be run. Otherwise I did more refactoring, moved all the gff3_* files into a gff3 directory, so the D modules are now bio.gff3.*, parsing functions are now static methods of GFF3File and GFF3Data classes, etc. For the new week, I would like to add filtering to the D library, which I can then use to implement iteration over genes, mRNAs, CDS features, etc. After that the library should be pretty much complete feature-wise, at least per what was promised in the project proposal, so I?ll continue by defining the C API and developing the Ruby gem. -- Marjan From chapmanb at 50mail.com Mon Jun 18 20:28:11 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 18 Jun 2012 20:28:11 -0400 Subject: [GSoC] [Biopython-dev] GSoC python variant update 5 In-Reply-To: References: Message-ID: <87k3z4xi04.fsf@fastmail.fm> Lenna; Thanks for the update. I've been following the commits on GitHub and looks like you're getting some traction with the SQL representation. I do worry about it for some of the same reasons as James but happy to have you take a look if it helps with your understanding of VCF. I think it might also be worth thinking of some use cases that are not well covered with the current PyVCF parser and seeing if your representation tackles them better. One current one that is tough is slicing a VCF file by sample. Row based slicing is well supported but column based is not as easy. If I had a, say, 50 sample file: how well does it allow pulling out the genotypes and records from a single sample and re-writing as VCF. Can you code up this type of workflow with your current representation? For your specific questions: > 1. Can an info field be per-genotype? The spec implies that wouldn't > make sense, but doesn't forbid it. The INFO key/values are per-variant. There are also arbitrary per-genotype key/values allowed, specified in the FORMAT file. > 2. Is there a safe way to find out if a VCF 4.0 field is per-allele or > per-genotype? This should be the INFO/FORMAT distinction I described above. > 3. Will my SQL representation be able to handle SV? VCF encodes structural variation information into the INFO metadata, so as long as you support the structural variant specified ALT fields it should fit. The longer term question is if you want to support more explicit linking between distant breakends, which would require special support. I think that's probably more of an end-of-the-summer goal, however, since most people aren't yet doing tons of VCF structural variation work. Brad From lomereiter at googlemail.com Tue Jun 19 04:25:07 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Tue, 19 Jun 2012 12:25:07 +0400 Subject: [GSoC] weekly report #5 Message-ID: Hi all, I wrote a few words about improvements in my project during the past week: http://lomereiter.wordpress.com/2012/06/19/gsoc-weekly-report-5/ - More wiki content on Github, with examples of how to use the library for common cases. - Faster conversion to SAM, now it's not worse than samtools in this respect - Parallelized BGZF compression, though it was relatively easy to add - Reconsidering interaction with dynamic languages due to shared library issues in D. Now I'm thinking of an approach of making command-line tools outputting JSON and wrapping them. At least in BioRuby we have Bio::Command to make this process easy. - Progress in SAM parsing - valid records are now fully parsed, and it takes just 300 lines of D/Ragel mix, together with some unittests. Also, Ragel provides some convenient methods to handle errors, but I haven't investigated them yet. Once error handling is added, the branch will be ready to be merged, and then I'll add SAM reading. -- Artem From reece at harts.net Tue Jun 19 08:51:26 2012 From: reece at harts.net (Reece Hart) Date: Tue, 19 Jun 2012 05:51:26 -0700 Subject: [GSoC] [Biopython-dev] GSoC python variant update 5 In-Reply-To: References: Message-ID: On Sun, Jun 17, 2012 at 9:21 PM, Lenna Peterson wrote: > Latest post: http://arklenna.tumblr.com/post/25343434817/ > Hi Lenna- Thanks for making the time to update your blog. As with James and Brad, I doubt the suitability of SQL for this project. However, I learn things when I'm wrong, so this should work out either way! I don't understand your "SQL diagram" (more properly, an "entity-relationship diagram"). It would help me -- and perhaps you too -- to provide more detail in the ERD and then to parse a few lines from a VCF file into your schema by hand (e.g., as a set of tsv files or Google doc spreadsheets). It's also worthwhile to look at other people's schemas for similar data. http://www.ensembl.org/info/docs/variation/variation-database-schema.pdf is a good place to start. In any case, VCF parsing is merely a specialized embodiment of general variant representation, which is the primary goal for this project. Therefore, it would be worthwhile now to test whatever scheme you propose against other formats (GFF and HGVS have been discussed). I don't mean that you should implement now, but rather just make sure that you're heading in a direction that's compatible with other planned uses. -Reece From reece at harts.net Tue Jun 19 08:52:33 2012 From: reece at harts.net (Reece Hart) Date: Tue, 19 Jun 2012 05:52:33 -0700 Subject: [GSoC] [Biopython-dev] GSoC python variant update 5 In-Reply-To: References: Message-ID: On Tue, Jun 19, 2012 at 5:51 AM, Reece Hart wrote: > (GFF and HGVS have been discussed Ooops. I meant GVF, but the point is the same. From chris.mit7 at gmail.com Tue Jun 19 12:23:01 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Tue, 19 Jun 2012 12:23:01 -0400 Subject: [GSoC] [Biopython-dev] GSoC python variant update 5 In-Reply-To: References: Message-ID: Lenna, One concern I had, which may be avoided by your schema of using narrow-tables, is how well your current structure can support the inevitable updates to the VCF format. It may show my inexperience with SQL, but is a SQL backend flexible enough to adopt new conventions while also maintaining backwards compatibility? Also, from a usage standpoint -- I wouldn't want to have a vcf file and a database file on my drive. It would be redundant for me. It may just be my style, but I usually sieve out the useful information out of a vcf file into several smaller specific vcf files. Really what a vcf parser does is make your output more concise. I wouldn't want then another .db file for each time I wanted to parse my vcf file into a smaller chunk. Additionally, any time you gained in filtering by using a SQL backend may be negligible when the user gets to this stage. The file sizes will be substantially smaller. In short, I think you might be over-engineering this. Keeping a SQL backend is going to require indexing after updates (how long will this take, and is the time comprable to using pure python?, you also have the issue where SQL decides to ignore your index...), and writing queries that may be optimal for some usage cases and poor in others. You may have thought about these concerns, and I don't mean to deter your efforts, you may be a SQL guru for all I know (I also just may be biased from how I operate). Chris On Tue, Jun 19, 2012 at 8:52 AM, Reece Hart wrote: > On Tue, Jun 19, 2012 at 5:51 AM, Reece Hart wrote: > > > (GFF and HGVS have been discussed > > > Ooops. I meant GVF, but the point is the same. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From w.arindrarto at gmail.com Thu Jun 21 03:09:40 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 21 Jun 2012 09:09:40 +0200 Subject: [GSoC] GSoC Project Update -- 7 Message-ID: Hi everyone, I've just posted another update for the past week here: http://bow.web.id/blog/2012/06/new-parsers-for-new-week/ In short, here is what I did: * Added parsing, indexing, and writing support for two new formats (along with their tests): the HMMER table output (hmmer-tab) and the HMMER domain table output (hmmscan-domtab, hmmsearch-domtab, or phmmer-domtab). There is a small issue which prevents the HMMER domain table format to be simply named hmmer-domtab, which I discuss in the post. * (not mentioned in the post) Added more tests for writing and indexing, and refactored some of the existing code. As mentioned in the post, since the core SearchIO API functions are now complete, for the coming weeks I will focus on adding more formats to support, improving the code, and of course add more tests and documentation. regards, Bow From cswh at umich.edu Fri Jun 22 00:25:40 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Fri, 22 Jun 2012 00:25:40 -0400 Subject: [GSoC] GSoC weekly status report: parallel I/O and JRuby Message-ID: <656B7BDD-DD0E-40FE-91AF-DC23113427D5@umich.edu> Hi all, This week's status report is a double feature: http://csw.github.com/bioruby-maf/blog/2012/06/13/jruby_support_and_performance_work/ http://csw.github.com/bioruby-maf/blog/2012/06/21/parallel_io/ In short, I now have JRuby fully supported by my MAF code, including the Kyoto Cabinet components. Using JRuby, I've been able to deliver very solid performance for index-driven random access parsing as well as for sequential whole-file parsing. Clayton Wheeler cswh at umich.edu From marian.povolny at gmail.com Mon Jun 25 16:38:10 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 25 Jun 2012 22:38:10 +0200 Subject: [GSoC] GSoC weekly status report No.5 Message-ID: http://blog.mpthecoder.com/post/25870737554/gsoc-weekly-status-report-no-5 *Summary of the last week* During the last week a few improvements have been made: - the validation messages have been improved with file names and line number, in the compiler error style, - filtering has been added, - replacing escaped characters has been re-implemented to get a huge performance improvement. The 1GB file that required 10min for parsing because of 6.5 milion escaped characters, is now parsed in 22.5 seconds, only 0.5 more compared with when replacing them is turned off, - added a tool for correctly counting features in a GFF3 file. This will be useful because the user can then find a good value for the feature cache size by using this tool to get the correct count and the benchmark tool to get the count for a particular cache size. The tool is still slow for some files, so I?m thinking about how to improve that, - other small fixes, comments and similar? *More on filtering* The filtering was first implemented using classes, but later refactored using delegates instead. The result was 50 lines less code. The user can now specify a filter before parsing a file like this: GFF3File.parse_by_records("file.gff3", NO_VALIDATION, false, NO_BEFORE_FILTER, OR(ATTRIBUTE("ID", EQUALS("1")), ATTRIBUTE("ID", CONTAINS("2")))); The first filter which is set to none in this example is the filter before the line is parsed, that means that the filter doesn?t support ATTRIBUTE and FIELD predicates. The following predicates are implemented: FIELD, ATTRIBUTE, EQUALS, CONTAINS, STARTS_WITH, AND, OR, NOT. In case they?re used in a way which is not allowed, there will be a compiler error. Otherwise the allowed combinations should be logical enough to guess (but I?ll document them too). I altered the benchmark tool a few times to test the performance, and what I found was very positive, the performance impact in the few tests I did was very small. I?ll have more data once the next tool is finished. *New week* Release early and often - it?s a mantra a heard quite a few times before. So as the group of mentors and students has agreed, every student will be releasing a gem at the end of this week. I?m still not sure what will be in it, because the support for shared libraries in D compilers for Linux has not been implemented yet. So it will probably be a combination of a command-line utility and a Ruby module which uses that utility. What I have currently in mind is re-implementing the gff3-fetch utility developed by Pjotr in Ruby, to make it faster using D. But first I?ll implement filtering functionality for it, so the users can reduce a file to records which are interesting to them and then parse that using a parser in Ruby, for example. A Ruby module that would make using this utility easier for Ruby developers seems like a good idea for the first release. Part of this utility will be to support GFF3 output, so that will be implemented too (and has already been done today to some extend). From lomereiter at googlemail.com Tue Jun 26 11:45:21 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Tue, 26 Jun 2012 19:45:21 +0400 Subject: [GSoC] weekly report #6 Message-ID: Hello all, here's my weekly report: http://lomereiter.wordpress.com/2012/06/26/gsoc-weekly-report-6/ Summary: Ruby bindings moved to parsing JSON from command-line tool output, everything works fine. That also means you can use JSON output from other languages. SAM input was added. Not optimized at all, parser currently does a lot of unnecessary memory allocations. Now it's about 3x as slow as samtools one, but it should be easy to improve the speed (at least doubling is possible according to profiling results). Also there's now a command line tool called Sambamba, which is used for creating JSON output. But it also outputs SAM and accepts both SAM and BAM formats as an input. Options are mostly the same as for the samtools view command, including fetching regions with the same syntax, and some filtering (e.g. on quality). From w.arindrarto at gmail.com Thu Jun 28 09:26:40 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 28 Jun 2012 15:26:40 +0200 Subject: [GSoC] GSoC Project Update -- 8 Message-ID: Hello everyone, Here is my update for the last week: http://bow.web.id/blog/2012/06/fasta-comes-to-searchio/ To summarize it here: * SearchIO now supports Fasta indexing and parsing. The code integrates some part of the FastaIO module in AlignIO, but with more new addition to enable parsing into SearchIO objects hierarchy. * Improved the text output of common SearchIO objects. The text outputs (using str() or print) are now easier to interpret. * (not mentioned in the blog post) Tests for Fasta parsing and indexing, along with more tests for the common objects. That's all I have for this week :). Next week, I will be adding more formats to support into the submodule. regards, Bow From arklenna at gmail.com Sat Jun 30 02:15:05 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sat, 30 Jun 2012 02:15:05 -0400 Subject: [GSoC] GSoC python variant update 6 Message-ID: Post: http://arklenna.tumblr.com/post/26196310200/ I've made a new branch (variant2: https://github.com/lennax/biopython/tree/variant2 ) which has a very skeletal outline of a set of Python objects designed to store variants. One might note many similarities to the organization of PyVCF. One thing SQL did neatly was store per-allele data with the allele, rather than with the site, and I'm envisioning doing this in Python, as well. For a Python variant object, are there any organizational choices that would make it easier for future conversion of a variant to HGVS syntax? (this is primarily directed at Reece but I'm open to all suggestions) Another question that may reveal my complete ignorance of haplotypes and such: could a polyploid site ever be partially phased? e.g. a triploid genotype of 0/1|0? Looking forward to any and all questions, comments, concerns, etc. Lenna From chapmanb at 50mail.com Mon Jul 2 06:36:39 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 02 Jul 2012 06:36:39 -0400 Subject: [GSoC] GSoC python variant update 6 In-Reply-To: References: Message-ID: <874npqo3ew.fsf@fastmail.fm> Lenna; Thanks for the updates and thoughts. I like the direction you're moving after taking everything you've learned from the SQL experiments. My general suggestions would be: - Leverage PyVCF for all of the backend parsing. We want to remain compatible with this since merging/interfacing with the work James and everyone is doing is a primary goal. Keeping a similar code structure is a great way to facilitate this. - For HGVS the general idea is to not be too tied to the VCF format, so I wouldn't worry about strict compatibility but rather use it to inform choices where you feel that things are mirroring VCF structure rather than more general variant representation. > Another question that may reveal my complete ignorance of haplotypes > and such: could a polyploid site ever be partially phased? e.g. a > triploid genotype of 0/1|0? It's possible but this is kind of a fringe case right now so I wouldn't especially worry about it. Thanks again, Brad From lomereiter at gmail.com Tue Jul 3 12:40:40 2012 From: lomereiter at gmail.com (Artem Tarasov) Date: Tue, 3 Jul 2012 20:40:40 +0400 Subject: [GSoC] weekly report #7 Message-ID: Hi all, I wrote a blog post about the previous week: http://lomereiter.wordpress.com/2012/07/03/gsoc-weekly-report-7/ Highlights: First version of bioruby-sambamba gem is released on rubygems.org, but the installation process can be made much more convenient. Producing binaries for all common platforms and distributing them with platform-specific gems seems to be the best way to go. Also, I've done a lot of refactoring (however, a bit more is needed), and significantly improved speed of validation and SAM parsing. In July, I'm planning to implement indexing, sorting and merging BAM files, and also add filtering functionality to Ruby bindings. For the latter, I'm going to introduce a tiny query language so that command-line tools will be able to parse it, and bindings will have some filter classes with a method to generate a query string like. From w.arindrarto at gmail.com Wed Jul 4 09:03:01 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 4 Jul 2012 15:03:01 +0200 Subject: [GSoC] GSoC Project Update -- 9 Message-ID: Hello everyone, The past week I have been working to add PSL parsing support and I've just posted my update here: http://bow.web.id/blog/2012/07/initial-blat-support/ Currently, we have parsing, indexing, and writing support. But this could change (writing might not be supported) due to a possible change in the current object model. I've explained a bit on why this is the case in the post, but to summarize it here, it's because we haven't got a way to properly model segmented HSP sequences. Peter and I have discussed this a bit, but we haven't figured out an elegant way to solve it for now. Aside from working on PSL, I also added more tests and started refactoring the code as it's starting to get messy. That's all my update for the past week. For this week, I'll try to look into other formats and try to come up with possible solutions to the segmented HSP problem. regards, Bow From marian.povolny at gmail.com Wed Jul 4 14:56:54 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Wed, 4 Jul 2012 20:56:54 +0200 Subject: [GSoC] GSoC weekly status report No.6 and v0.1.0 Message-ID: http://blog.mpthecoder.com/post/26505431193/gsoc-weekly-status-report-no-6-and-v0-1-0 This post is a little bit late, but I wanted it to be the announcement of the first release, the v0.1.0? of gff3-pltools! I've created a minimal website for this project, which can be found here: http://mamarjan.github.com/gff3-pltools/ There are links to binary gems for 32 and 64-bit Linux, a source package for other platforms, binary packages with the D tools only, and a link to the API docs for the Ruby library. Please read the blog post for more information, and the README for even more information. Best regards, Marjan From reece at harts.net Thu Jul 5 15:40:02 2012 From: reece at harts.net (Reece Hart) Date: Thu, 5 Jul 2012 12:40:02 -0700 Subject: [GSoC] GSoC python variant update 6 In-Reply-To: References: Message-ID: On Fri, Jun 29, 2012 at 11:15 PM, Lenna Peterson wrote: > For a Python variant object, are there any organizational choices that > would make it easier for future conversion of a variant to HGVS > syntax? (this is primarily directed at Reece but I'm open to all > suggestions) > Oh, no, things directed at me! That's a broad question. I'll try to answer without being long winded. The essential elements of a sequence variant are a reference to a sequence, the location, and specifics about the operation. The name, allelic depth, etc are all distinct from these elements and I would store them separately in a format-specific record or as a subclass. I don't have much experience with FeatureLocations, but that might be appropriate. Depending on how far you plan to go with VCF, you'll have to deal with Locations for breakpoints. For the Occam's Razor version a model for variation, I'd float this in the community: variation := And I'd test this against representing: - a single SNP in VCF - a compound het from VCF - a variant in RNA - a variant in CDS coords - a variant in a protein sequence - a trinuclotide repeat (Which the simple model above fails, BTW.) What makes the uber variant problem hard, I think, is several competing design axes: 1) sequence type (DNA, RNA, protein), 2) coordinate systems (really, CDS in a transcript record), 3) diversity of variant types (SNV, indel, repeat, etc), 4) diversity of auxiliary data (e.g., genotype info from VCF). HGVS makes us think outside merely VCF data: in particular, it adds the nuance of coordinate systems and multiple sequence types. I suspect you should be considering mixins and/or subclassing for some of these needs. I don't know how to solve any of this complexity. What I do know is that 1) it's too much just for your project, 2) it would be nice to have a design that can be easily extended beyond your project, and 3) therefore, part of your project should be to pave the way for extensions without tackling them. It's also a good time to put stakes in the ground around internal conventions, such as variants are always represented using interbase coordinates (= 0-based, right-open). And, if you end up handling just VCF variants, that's cool too. -Reece From arklenna at gmail.com Mon Jul 9 00:33:57 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 9 Jul 2012 00:33:57 -0400 Subject: [GSoC] GSoC python variant update 7 Message-ID: Post: http://arklenna.tumblr.com/post/26812132902/ Synopsis: This week, I wrote a script for PyVCF that can filter a file by sample as it's being parsed. It's currently named `vcf_sample_filter.py`. It's designed to be functional from the command line, the Python interpreter, or as a module. Next up: come up with a generic-via-extensibility representation of a variant. I'm working through some examples and should have a basic outline soon. Lenna From cswh at umich.edu Mon Jul 9 23:21:40 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 9 Jul 2012 23:21:40 -0400 Subject: [GSoC] bio-maf 0.2.0 (and Kyoto Cabinet gem for JRuby) Message-ID: Hi all, I've released version 0.2.0 of bio-maf for BioRuby: http://csw.github.com/bioruby-maf/blog/2012/07/09/bio-maf_0.2.0/ Notably, this release includes removal of gaps remaining after filtering out sequences, and 'tiling' multiple alignment blocks together along with reference sequence data. Also, last week I released my Kyoto Cabinet support for JRuby as a separate gem. It's now approaching parity with the standard Ruby library for Kyoto Cabinet. http://csw.github.com/bioruby-maf/blog/2012/07/02/kyoto_cabinet_support_for_jruby/ Clayton Wheeler cswh at umich.edu From lomereiter at gmail.com Tue Jul 10 06:26:35 2012 From: lomereiter at gmail.com (Artem Tarasov) Date: Tue, 10 Jul 2012 14:26:35 +0400 Subject: [GSoC] weekly report #8 Message-ID: Hello all, here's the link to the report: http://lomereiter.wordpress.com/2012/07/10/gsoc-weekly-report-8/ last week I implemented producing BAI files, and my tool sambamba-index exploits parallelism and thus is faster than samtools on multicore. Now I'm working on sorting, basic version already works but memory consumption should be improved. In fact, at least for HDDs, time of indexing and sorting is bounded by I/O speed, not the number of CPUs. So for sorting I need to tweak sizes of read/write buffers in order to get maximum performance. By the end of this week, I'm also going to make an utility for merging several sorted BAM files into one. From marian.povolny at gmail.com Tue Jul 10 18:01:07 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Wed, 11 Jul 2012 00:01:07 +0200 Subject: [GSoC] GSoC weekly status report No.7 Message-ID: http://blog.mpthecoder.com/post/26930939671/gsoc-weekly-status-report-no-7 I was hoping to get more done over the weekend, but the internet connection was down, so I had to take the weekend off :) Otherwise I?m working toward the 0.2 version. The deadline is set for Saturday evening. What will be in it keeps changing, but for now there are new toString() and recursiveToString() methods in Feature class, and append_to(?) methods which accept an Appender object, for more efficient output. The utility for correctly counting features is now notably faster, and gff3-ffetch has a new option for passing FASTA data to output. Currently in planning are: support for new types of records (pragmas and comments), GDC support and Ruby interface for the validation utility. More could be added to this list, but I also have to make a plan for the second half of the summer, and that will take some time too. I was hoping to use the GDC which comes with Ubuntu 12.04, but I gave up on that because of some confusing errors I was receiving in the D stdlib. I will try to build the GDC directly from its GitHub repository and get my library to compile with it. Making man pages for binaries in gems is also a problem which currently has no elegant solution. I don?t want to force my users to type ?gem man command?, so I?m planning to split the current repository into two: gff3-pltools in D and then the second repository for the Ruby library. The gff3-pltools would then receive a more traditional installation procedure and receive proper man pages. -- Marjan From cswh at umich.edu Tue Jul 10 19:45:33 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Tue, 10 Jul 2012 19:45:33 -0400 Subject: [GSoC] Questions on next steps for MAF parsing for bio-maf Message-ID: <9343DE6C-EC59-480E-B746-396F08F36395@umich.edu> Hi all, In the course of working out my plan for the rest of my bio-maf project, I have come up with a few questions I'm not able to answer: https://github.com/csw/bioruby-maf/wiki/Questions * Is it useful to build indexes on other sequences besides the reference sequence? * Should the score field of an alignment block be zeroed or removed whenever the block is modified? * How, precisely, should selection based on features in GTF/GFF3 files work? * When converting a MAF Block/Sequence to bio-alignment representation, how should we handle quality metadata (from 'q' lines), which is tied to the actual sequence data and would need to be maintained in parallel if a column were deleted? * Is supporting the bx-python index format still desirable? Performance with Kyoto Cabinet indexes seems competitive, and the indexes are neither very large nor very expensive to build. * Blankenberg et al. mention this filtering mode: "removing blocks which have aligned species occurring between non-syntenic chromosomes or strands" which is unfortunately a bit cryptic. * Are coverage statistics useful or appropriate to provide? Any insight that you might be able to offer would be helpful. Thanks, Clayton Wheeler cswh at umich.edu From pjotr2012 at thebird.nl Wed Jul 11 05:25:17 2012 From: pjotr2012 at thebird.nl (Pjotr Prins) Date: Wed, 11 Jul 2012 11:25:17 +0200 Subject: [GSoC] Questions on next steps for MAF parsing for bio-maf In-Reply-To: <9343DE6C-EC59-480E-B746-396F08F36395@umich.edu> References: <9343DE6C-EC59-480E-B746-396F08F36395@umich.edu> Message-ID: <20120711092517.GA2827@thebird.nl> Hi Clayton and mentors, I think it would be extremely useful to get someone in who uses MAF in a pipeline. I know Raoul does, but we need more users. Anyone you know using MAF daily? Otherwise we should post on the Bio* lists. Same for GFF3 and Marjan. Anyone you know out there? Pj. On Tue, Jul 10, 2012 at 07:45:33PM -0400, Clayton Wheeler wrote: > Hi all, > > In the course of working out my plan for the rest of my bio-maf project, I have come up with a few questions I'm not able to answer: > > https://github.com/csw/bioruby-maf/wiki/Questions > > * Is it useful to build indexes on other sequences besides the reference sequence? > > * Should the score field of an alignment block be zeroed or removed whenever the block is modified? > > * How, precisely, should selection based on features in GTF/GFF3 files work? > > * When converting a MAF Block/Sequence to bio-alignment representation, how should we handle quality metadata (from 'q' lines), which is tied to the actual sequence data and would need to be maintained in parallel if a column were deleted? > > * Is supporting the bx-python index format still desirable? Performance with Kyoto Cabinet indexes seems competitive, and the indexes are neither very large nor very expensive to build. > > * Blankenberg et al. mention this filtering mode: "removing blocks which have aligned species occurring between non-syntenic chromosomes or strands" which is unfortunately a bit cryptic. > > * Are coverage statistics useful or appropriate to provide? > > Any insight that you might be able to offer would be helpful. > > Thanks, > > Clayton Wheeler > cswh at umich.edu > > > > > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From marian.povolny at gmail.com Mon Jul 16 13:16:12 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 16 Jul 2012 19:16:12 +0200 Subject: [GSoC] GSoC weekly status report No.8 Message-ID: http://blog.mpthecoder.com/post/27339349340/gsoc-weekly-status-report-no-8 Summary: The 0.2 version of gff3-pltools has been released, together with a Ruby gem bio-gff3-pltools. Binary and source packages can be downloaded from the following location: http://mamarjan.github.com/gff3-pltools/ On Wednesday I?ll be traveling to Lodi for the EU-codefest, there I?ll be presenting about the project and current GFF3 parser and tools performance. For the next release I would like to add parallelism to the parser. I?m also thinking about adding a new option to gff3-ffetch, which would let the user specify which fields and attributes to output in tab-separated columns. Best regards, Marjan From cjfields at illinois.edu Mon Jul 16 13:20:06 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 16 Jul 2012 17:20:06 +0000 Subject: [GSoC] GSoC weekly status report No.8 In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF2B63D4B5@CHIMBX5.ad.uillinois.edu> I'll try to be on IRC (#bioruby and #obf-soc) those days, I may have a few questions. chris On Jul 16, 2012, at 12:16 PM, Marjan Povolni wrote: > http://blog.mpthecoder.com/post/27339349340/gsoc-weekly-status-report-no-8 > > Summary: > > The 0.2 version of gff3-pltools has been released, together with a Ruby gem > bio-gff3-pltools. Binary and source packages can be downloaded from the > following location: > > http://mamarjan.github.com/gff3-pltools/ > > On Wednesday I?ll be traveling to Lodi for the EU-codefest, there I?ll be > presenting about the project and current GFF3 parser and tools performance. > > For the next release I would like to add parallelism to the parser. I?m > also thinking about adding a new option to gff3-ffetch, which would let the > user specify which fields and attributes to output in tab-separated columns. > > Best regards, > Marjan > > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From pjotr.public14 at thebird.nl Mon Jul 16 13:29:06 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 16 Jul 2012 19:29:06 +0200 Subject: [GSoC] GSoC weekly status report No.8 In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF2B63D4B5@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF2B63D4B5@CHIMBX5.ad.uillinois.edu> Message-ID: <20120716172906.GA20140@thebird.nl> On Mon, Jul 16, 2012 at 05:20:06PM +0000, Fields, Christopher J wrote: > I'll try to be on IRC (#bioruby and #obf-soc) those days, I may have a few questions. Cool :) We will also join gbrowse IRC. From lomereiter at gmail.com Tue Jul 17 02:47:49 2012 From: lomereiter at gmail.com (Artem Tarasov) Date: Tue, 17 Jul 2012 10:47:49 +0400 Subject: [GSoC] weekly report #9 Message-ID: Hello everybody, My progress report for the past week is available at http://lomereiter.wordpress.com/2012/07/17/gsoc-weekly-report-9/ I've implemented sorting and merging, both parallelized and quite fast. Also my merging tool improves on ideas taken from Picard source code and merges SAM headers as well as sorted alignment records. For those who use Debian, packages for amd64 and i386 are now available: https://github.com/lomereiter/sambamba/downloads At the moment, alternatives to the following samtools commands are developed: view, index, sort, merge, flagstat. The current limitation is that most tools don't work with stdin/stdout and work with BAM files only (does anybody still use SAM?). Nevertheless, they wisely use multi-core processors and usually give a better speed. From pjotr.public14 at thebird.nl Tue Jul 17 03:59:38 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 17 Jul 2012 09:59:38 +0200 Subject: [GSoC] [BioRuby] weekly report #9 In-Reply-To: References: Message-ID: <20120717075938.GA30198@thebird.nl> Are you going to support STDIN/STDOUT? Another killer feature! On Tue, Jul 17, 2012 at 10:47:49AM +0400, Artem Tarasov wrote: > Hello everybody, > > My progress report for the past week is available at > http://lomereiter.wordpress.com/2012/07/17/gsoc-weekly-report-9/ > > I've implemented sorting and merging, both parallelized and quite fast. > Also my merging tool improves on ideas taken from Picard source code and > merges SAM headers as well as sorted alignment records. > > For those who use Debian, packages for amd64 and i386 are now available: > > https://github.com/lomereiter/sambamba/downloads > > At the moment, alternatives to the following samtools commands are > developed: view, index, sort, merge, flagstat. The current limitation is > that most tools don't work with stdin/stdout and work with BAM files only > (does anybody still use SAM?). Nevertheless, they wisely use multi-core > processors and usually give a better speed. > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From lomereiter at gmail.com Tue Jul 17 04:38:22 2012 From: lomereiter at gmail.com (Artem Tarasov) Date: Tue, 17 Jul 2012 12:38:22 +0400 Subject: [GSoC] [BioRuby] weekly report #9 In-Reply-To: <20120717075938.GA30198@thebird.nl> References: <20120717075938.GA30198@thebird.nl> Message-ID: Firstly, I wouldn't call that a killer feature. On Un*x you should be able to use /dev/stdin and /dev/stdout (or a named pipe) as input/output filenames, that's the way people pipe Picard tools. Many Un*x tools (including samtools) facilitate that by using dash as a shortcut for stdin/stdout, but this is not a requirement. Clearly, STDIN can't be used for random access, and some parts of my code currently rely on assumption that input stream is seekable. I should make that optional, and then named pipes can be used as input. On Tue, Jul 17, 2012 at 11:59 AM, Pjotr Prins wrote: > Are you going to support STDIN/STDOUT? Another killer feature! > > On Tue, Jul 17, 2012 at 10:47:49AM +0400, Artem Tarasov wrote: > > Hello everybody, > > > > My progress report for the past week is available at > > http://lomereiter.wordpress.com/2012/07/17/gsoc-weekly-report-9/ > > > > I've implemented sorting and merging, both parallelized and quite fast. > > Also my merging tool improves on ideas taken from Picard source code and > > merges SAM headers as well as sorted alignment records. > > > > For those who use Debian, packages for amd64 and i386 are now available: > > > > https://github.com/lomereiter/sambamba/downloads > > > > At the moment, alternatives to the following samtools commands are > > developed: view, index, sort, merge, flagstat. The current limitation is > > that most tools don't work with stdin/stdout and work with BAM files only > > (does anybody still use SAM?). Nevertheless, they wisely use multi-core > > processors and usually give a better speed. > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > From arklenna at gmail.com Tue Jul 17 13:48:33 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 17 Jul 2012 13:48:33 -0400 Subject: [GSoC] GSoC python variant update 7 Message-ID: Hi all, New blog post: http://arklenna.tumblr.com/post/27418058203/ Last week, Reece suggested trying to represent a variety of variants with just five identifiers: accession, start, stop, pre_seq, and post_seq. I've started a very minimal Variant object (in https://github.com/lennax/biopython/blob/variant2/Bio/Variant/variant.py), using `FeatureLocation` for its location. This uses zero-based, right-open coordinates, similar to array counting in Python. In contrast, HGVS and VCF both count from 1. I've created a list of variant types each represented in HGVS, VCF (if possible), and my new Python representation. It can be found on the blog post. Please let me know if there are any errors in my interpretation of these variant types. Thanks, Lenna From cswh at umich.edu Wed Jul 18 15:44:58 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Wed, 18 Jul 2012 15:44:58 -0400 Subject: [GSoC] bio-maf release 0.3.0 Message-ID: Hi all, I've released bio-maf version 0.3.0: http://csw.github.com/bioruby-maf/blog/2012/07/18/bio-maf_0.3.0/ This version adds features including joining adjacent MAF blocks when sequences that caused them to be split have been filtered out; returning bio-alignment objects; and truncating (or ?slicing?) alignment blocks to only cover a given genomic interval. For developers, this also adds a higher-level Bio::MAF::Access API for working with directories containing indexed MAF files (or, alternatively, single files), providing all relevant functionality for indexed access in a simpler way than using the KyotoIndex and Parser classes directly. The maf_tile(1) utility has been updated to use this functionality; a directory of indexed MAF files can now be specified, and the correct file will now be parsed as appropriate. Usage of Enumerators and blocks has also been substantially improved; all access methods for multiple blocks such as Access#find, Access#slice, Parser#each_block now accept a block parameter, which will be called for each block in turn. If no block parameter is given, they will all return an Enumeratorfor the resulting blocks. This is how most of the Ruby standard library, e.g. Array#each, works. -- Clayton Wheeler cswh at umich.edu From w.arindrarto at gmail.com Wed Jul 18 15:49:37 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 18 Jul 2012 21:49:37 +0200 Subject: [GSoC] GSoC Project Update -- 10 Message-ID: Hi everyone, I've just posted two new updates for my GSoC project, here: http://bow.web.id/blog/2012/07/parsing-blast-plain-text-files-in-searchio/ and here: http://bow.web.id/blog/2012/07/exonerate-in-searchio/ The first one is about a somewhat unofficial new format to be supported by SearchIO: the BLAST plain text output. I know that current Biopython text parser is obsoleted, but I figure it still could be useful for some to have a similar model in SearchIO. It is unofficial since it's basically a wrapper around the current parser, and after discussing things with Peter, it doesn't seem wise to say that we officially support parsing the format. Especially when NCBI itself does not guarantee a stable style between each BLAST release. I should note that I've also made a small change to the current NCBIStandalone code as there were some problems when I try to parse BLAST 2.2.26+ text output with multiple queries. The second one, is about the program I've been spending most of my time on: Exonerate. We now have three Exonerate formats that SearchIO can parse and index: `exonerate-text`, for human-readable aligments, `exonerate-vulgar`, for vulgar lines, and `exonerate-cigar`, for vulgar lines. It's one of the more interesting formats I've been working on so far :), since it has so much information in it. I've tried to capture them as sensible as possible, and I made a small demonstration using it in my post. In addition to writing these two formats, I've also written their tests. Now, having finished almost all of the parsers, I'm planning to devote more time to start writing the documentation during the coming weeks. regards, Bow From lomereiter at gmail.com Tue Jul 24 10:46:09 2012 From: lomereiter at gmail.com (Artem Tarasov) Date: Tue, 24 Jul 2012 18:46:09 +0400 Subject: [GSoC] weekly report #10 Message-ID: Hi all, During the past week I've added filtering functionality to sambamba-view utility. Now the tool parses expressions like "mapping_quality >= 50 and [MQ] >=50 and not ([RG] =~ /abcd/i or [RG] == null)", superseding the functionality given by samtools flags -f, -F, -q, -l, -r. Also I'm now introducing wget-like text progressbars to my tools, as of now this is presented in sambamba-index only. More on that is at http://lomereiter.wordpress.com/2012/07/24/gsoc-weekly-report-10/ From arklenna at gmail.com Fri Jul 27 17:23:50 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Fri, 27 Jul 2012 17:23:50 -0400 Subject: [GSoC] GSoC python variant update 8 Message-ID: It appears that this email didn't make it to the list due to the catastrophe yesterday. I apologize if anyone receives two copies! Link: http://arklenna.tumblr.com/post/28082157403/ Post: I previously proposed the implementation of a method for PyVCF that would quickly scan the entire file and provide useful summary statistics. The idea is shamelessly copied from Brad's GFF parser (see https://github.com/chapmanb/bcbb/tree/master/gff ); for GFF, this method is helpful because the annotations on a sequence can vary widely. However, I no longer think this would be useful for VCF: 1. Most importantly, the VCF headers generally contain a complete listing of all of the types of information contained in the file. It's technically optional, but I hope that the most commonly used variant callers produce accurate headers. However, if there is a prevalence of files with a mismatch between headers and actual INFO/FORMAT fields, please let me know. 2. Next, any listing of ranges of data such as POS or QUAL might as well be coupled with actual filtering. This would be different if a presentation of the distribution of quality scores would be necessary to set an appropriate threshold. It would also depend on the ratio of speed between the range scan and the filtering (i.e. whether a possible second filter would be unacceptably time consuming). 3. Finally, and perhaps most importantly, many files are so large that scanning an entire file would take too long. Setting a limit and displaying updated information in real time (i.e. writing to `sys.stdout` with '\r', https://gist.github.com/3161269 ) could overcome this issue. If any VCF users can think of a great reason to scan a VCF file before filtering it, please get in touch. ------- I added the method `as_SeqFeature()` to my basic variant class, but it's still incomplete. Some of this is in flux due to forthcoming changes to FeatureLocation. I'm currently working on expanding the coordinate mapper Reece posted to the dev list a couple years ago (see http://biopython.org/pipermail/biopython/2010-June/006598.html ). Expect an update on that very soon. Best, Lenna From chris.mit7 at gmail.com Fri Jul 27 19:17:13 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Fri, 27 Jul 2012 19:17:13 -0400 Subject: [GSoC] [Biopython-dev] GSoC python variant update 8 In-Reply-To: References: Message-ID: Sorry for my brevity, but one great reason to scan a VCF file is to know where your variants are for downstream analysis. For instance, when analyzing RNA-Seq data for features such as Allele Specific Expression, having quick access to where variants are located is essential. On Thu, Jul 26, 2012 at 6:30 PM, Lenna Peterson wrote: > Link: http://arklenna.tumblr.com/post/28082157403/ > > Post: > > I previously proposed the implementation of a method for PyVCF that > would quickly scan the entire file and provide useful summary > statistics. The idea is shamelessly copied from Brad's GFF parser (see > https://github.com/chapmanb/bcbb/tree/master/gff ); for GFF, this > method is helpful because the annotations on a sequence can vary > widely. However, I no longer think this would be useful for VCF: > > 1. Most importantly, the VCF headers generally contain a complete > listing of all of the types of information contained in the file. It's > technically optional, but I hope that the most commonly used variant > callers produce accurate headers. However, if there is a prevalence of > files with a mismatch between headers and actual INFO/FORMAT fields, > please let me know. > > 2. Next, any listing of ranges of data such as POS or QUAL might as > well be coupled with actual filtering. This would be different if a > presentation of the distribution of quality scores would be necessary > to set an appropriate threshold. It would also depend on the ratio of > speed between the range scan and the filtering (i.e. whether a > possible second filter would be unacceptably time consuming). > > 3. Finally, and perhaps most importantly, many files are so large that > scanning an entire file would take too long. Setting a limit and > displaying updated information in real time (i.e. writing to > `sys.stdout` with '\r', https://gist.github.com/3161269 ) could > overcome this issue. > > If any VCF users can think of a great reason to scan a VCF file before > filtering it, please get in touch. > > ------- > > I added the method `as_SeqFeature()` to my basic variant class, but > it's still incomplete. Some of this is in flux due to forthcoming > changes to FeatureLocation. > > I'm currently working on expanding the coordinate mapper Reece posted > to the dev list a couple years ago (see > http://biopython.org/pipermail/biopython/2010-June/006598.html ). > Expect an update on that very soon. > > Best, > > Lenna > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From arklenna at gmail.com Thu Jul 26 18:30:35 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Thu, 26 Jul 2012 18:30:35 -0400 Subject: [GSoC] GSoC python variant update 8 Message-ID: Link: http://arklenna.tumblr.com/post/28082157403/ Post: I previously proposed the implementation of a method for PyVCF that would quickly scan the entire file and provide useful summary statistics. The idea is shamelessly copied from Brad's GFF parser (see https://github.com/chapmanb/bcbb/tree/master/gff ); for GFF, this method is helpful because the annotations on a sequence can vary widely. However, I no longer think this would be useful for VCF: 1. Most importantly, the VCF headers generally contain a complete listing of all of the types of information contained in the file. It's technically optional, but I hope that the most commonly used variant callers produce accurate headers. However, if there is a prevalence of files with a mismatch between headers and actual INFO/FORMAT fields, please let me know. 2. Next, any listing of ranges of data such as POS or QUAL might as well be coupled with actual filtering. This would be different if a presentation of the distribution of quality scores would be necessary to set an appropriate threshold. It would also depend on the ratio of speed between the range scan and the filtering (i.e. whether a possible second filter would be unacceptably time consuming). 3. Finally, and perhaps most importantly, many files are so large that scanning an entire file would take too long. Setting a limit and displaying updated information in real time (i.e. writing to `sys.stdout` with '\r', https://gist.github.com/3161269 ) could overcome this issue. If any VCF users can think of a great reason to scan a VCF file before filtering it, please get in touch. ------- I added the method `as_SeqFeature()` to my basic variant class, but it's still incomplete. Some of this is in flux due to forthcoming changes to FeatureLocation. I'm currently working on expanding the coordinate mapper Reece posted to the dev list a couple years ago (see http://biopython.org/pipermail/biopython/2010-June/006598.html ). Expect an update on that very soon. Best, Lenna From marian.povolny at gmail.com Wed Aug 1 08:46:44 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Wed, 1 Aug 2012 14:46:44 +0200 Subject: [GSoC] GSoC weekly status report No.9 Message-ID: http://blog.mpthecoder.com/post/28481171942/gsoc-weekly-status-report-no-9 The trip to Lodi was very fruitful. It was great to meet both my mentor and other community members. Based on the input received at the codefest, I created a new plan for the second part of the summer: https://github.com/mamarjan/gff3-pltools/wiki/Part2 Since then I have done the following: - improved validation speed, - added GTF support for input and output, - table output with an option to select which fields and attributes should be in the table, - tools for conversion to GTF and JSON, - JSON output support, which needs some more polish. -- Marjan From cswh at umich.edu Sun Aug 5 19:24:58 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Sun, 5 Aug 2012 19:24:58 -0400 Subject: [GSoC] bio-maf update: BGZF and testing Message-ID: Hi all, I've posted an update on my most recent work on bio-maf. Highlights include BGZF compression support, a new maf_extract command-line tool for random access, and my discoveries from testing on the full UCSC multiz46way dataset. http://csw.github.com/bioruby-maf/blog/2012/08/05/bgzf_and_testing/ -- Clayton Wheeler cswh at umich.edu From arklenna at gmail.com Tue Aug 7 01:11:04 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 7 Aug 2012 01:11:04 -0400 Subject: [GSoC] GSoC python variant update Message-ID: Full post: http://arklenna.tumblr.com/post/28890255191/ Summary: * I'm working on the coordinate mapper Reece contributed: http://biopython.org/pipermail/biopython/2010-June/006598.html * I'm representing intron locations relative to CDS coords using the HGVS standards: http://www.hgvs.org/mutnomen/refseq_figure.html I'd like to know if there are other common ways of representing such positions. * In order to customize the display of positions (e.g. 0-based or 1-based), I'm using a class as a configuration container. I've read on StackOverflow that attempts to use globals or a singleton class are discouraged in Python, but I have not found practical suggestions for how to implement module-wide configurations. Suggestions are welcome. * Any advice about circular genomes or strandedness is also welcome. * This mapper will work for SeqRecords, SeqFeatures, FeatureLocations, etc. Are there other Biopython objects that store sequence coordinates and thus should be mappable? Regards, Lenna From lomereiter at gmail.com Tue Aug 7 01:14:02 2012 From: lomereiter at gmail.com (Artem Tarasov) Date: Tue, 7 Aug 2012 05:14:02 +0000 Subject: [GSoC] weekly report #11 Message-ID: Hello all, here's my latest report about work on sambamba: http://lomereiter.wordpress.com/2012/08/06/gsoc-weekly-report-11/ * simple internal DSL for filtering was introduced in bindings (description: https://github.com/lomereiter/bioruby-sambamba/blob/master/features/filtering.feature ) * BAM output support was added to filtering tool (sambamba view) * man pages for all tools were created * a script was written for building Debian packages and uploading them to Github * Debian packages for v0.2.1 are added to Github downloads (for both i386 and amd64) Now the goal is to make those tools accessible at Galaxy Tool Shed, and also this week I plan to optimize memory usage -- buffer sizes are quite large now, and the task is to find out how much they can be reduced without significant impact on performance. From marian.povolny at gmail.com Tue Aug 7 09:36:04 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Tue, 7 Aug 2012 15:36:04 +0200 Subject: [GSoC] GSoC weekly status report No.10 Message-ID: http://blog.mpthecoder.com/post/28907136767/gsoc-weekly-status-report-no-10 The 0.3 release is available on the website: http://mamarjan.github.com/gff3-pltools/ In addition to what was described in the last weekly report, a GFF3 sorting tool has been added, grouping records which belong to the same feature, and Ruby bindings have been updated to support GTF and new options in gff3-ffetch. -- Marjan From w.arindrarto at gmail.com Tue Aug 7 13:56:26 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 7 Aug 2012 19:56:26 +0200 Subject: [GSoC] GSoC Project Update -- 11 Message-ID: Hello everyone, I have just posted my latest update on my project here: http://bow.web.id/blog/2012/08/back-on-the-main-branch/ It's been taking quite a while since I posted my last update since there has been a considerable change to the SearchIO object model I'm using. The details are in my blog post, but to keep it short, it was because the previous model (QueryResult, Hit, and HSP) was inadequate in handling files that have multiple sequences in their HSP (so far seen in files output by BLAT and Exonerate). In my previous updates, I've been using simple Python lists to store attributes related to these multiple sequences, but that turned out to be problematic as it may make the object have inconsistent attributes. After trying out several different implementations and discussing them with Peter, we've finally settled on a new model. The new model changes the HSP object into a container that stores a new object: HSPFragment. HSPFragment represents a single, contiguous alignment of the hit and query sequence. It only stores the sequence, coordinates, frames, and strands. Other attributes made by the search program (such as evalues or scores) are stored in the HSP object. This change required some modifications on all of the current parsers, but from a user's perspective working with file formats other than BLAT or Exonerate, the changes should be minimum. Aside from this, there's also a small update on the main API which lets it accept keyword arguments. The arguments modify behaviors of the parser, and they are different for each parser. Currently, this is only used by the BLAST tabular parser, but I imagine more parsers will use this in the future. Finally, having settled on a firmer object model, I'll be spending the rest of my time to focus on the documentation. There may still be small fixes to the code, but I expect nothing as major as this one. regards, Bow From chapmanb at 50mail.com Wed Aug 8 09:55:36 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 08 Aug 2012 09:55:36 -0400 Subject: [GSoC] [Biopython-dev] GSoC python variant update In-Reply-To: References: Message-ID: <874nodh4iv.fsf@fastmail.fm> Lenna; This all sounds great and will be a nice practical addition to Biopython. Thanks for taking it on. Some specific thoughts on your questions: > * I'm representing intron locations relative to CDS coords using the > HGVS standards: http://www.hgvs.org/mutnomen/refseq_figure.html > I'd like to know if there are other common ways of representing such > positions. I don't know of one myself, so it's great to be following a standard rather than reinventing something. Nice work. > * In order to customize the display of positions (e.g. 0-based or > 1-based), I'm using a class as a configuration container. I've read on > StackOverflow that attempts to use globals or a singleton class are > discouraged in Python, but I have not found practical suggestions for > how to implement module-wide configurations. Suggestions are welcome. With configuration items like this, you have two choices: - A global variable. - Pass the configuration to every function that needs it. There are tradeoffs with both approaches, but for this case I agree with your decision to use globals. Most people will want 0-based/Biopython style but it gives those who don't a knob to switch over. > * Any advice about circular genomes or strandedness is also welcome. Circular handling is an unresolved issue in Biopython: https://redmine.open-bio.org/issues/2578 It's a bit tricky, especially with features that span the origin. I'd prioritize handling strandedness since you're going to have plenty of reverse strand coding sequences. You're mapping not only within the coding region but also back to the original sequence on the reverse strand. So in your g2c mapping, the original gene goes from e1 -> s1 -> e0 -> s0 as you read 5' to 3' across the sequence. The best place to get started is to pick a reverse strand gene and then work through the mappings, thinking through the orientations. I find drawing it out to be the easiest way. > * This mapper will work for SeqRecords, SeqFeatures, FeatureLocations, > etc. Are there other Biopython objects that store sequence coordinates > and thus should be mappable? That sounds like a great start. Thanks again for this, Brad From p.j.a.cock at googlemail.com Wed Aug 8 10:33:05 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 8 Aug 2012 15:33:05 +0100 Subject: [GSoC] [Biopython-dev] GSoC python variant update In-Reply-To: <874nodh4iv.fsf@fastmail.fm> References: <874nodh4iv.fsf@fastmail.fm> Message-ID: On Wed, Aug 8, 2012 at 2:55 PM, Brad Chapman wrote: >Lenna wrote: >> * Any advice about circular genomes or strandedness is also welcome. > > Circular handling is an unresolved issue in Biopython: > > https://redmine.open-bio.org/issues/2578 > > It's a bit tricky, especially with features that span the origin. > > I'd prioritize handling strandedness since you're going to have plenty > of reverse strand coding sequences. You're mapping not only within the > coding region but also back to the original sequence on the reverse > strand. So in your g2c mapping, the original gene goes from > e1 -> s1 -> e0 -> s0 as you read 5' to 3' across the sequence. The best > place to get started is to pick a reverse strand gene and then work > through the mappings, thinking through the orientations. I find drawing > it out to be the easiest way. And then think about mixed strand genes, e.g. transpliced tRNA is a good example - there is a GenBank example in our unit tests. Peter From arklenna at gmail.com Mon Aug 13 01:00:41 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 13 Aug 2012 01:00:41 -0400 Subject: [GSoC] GSoC python variant update 10 Message-ID: Link: http://arklenna.tumblr.com/post/29317968106/ Post: Following extensive [discussion](http://biopython.org/pipermail/biopython-dev/2012-August/009849.html) on the dev list of the pros and cons of configuration classes/modules, I have refactored my [coordinate mapper](https://gist.github.com/3172753) to keep configuration as isolated as possible. All mapping functions use base 0 internally. Transformation to and from 1-based coords is allowed by custom MapPosition objects. (they are currently separate from the Seq* positions but could probably subclass ExactPosition). The MapPosition objects have to_dialect and from_dialect methods that automatically handle conversion between bases and other formatting details. There are two different ways a user can convert a coordinate from HGVS: # ... assuming cm is an instance of CoordinateMapper # Manually construct position from HGVS CDS_coord = CDSPosition.from_hgvs("6+1") genomic_coord = cm.c2g(CDS_coord) print genomic_coord.to_hgvs() # Pass dialect argument to mapping function genomic_coord = cm.c2g("6+1", dialect="HGVS") print genomic_coord.to_hgvs() Furthermore, the inheritance hierarchy is designed to allow a user to set a default string representation: # Set MapPositions to print as HGVS by default def use_hgvs(self): return str(self.to_hgvs()) MapPosition.__str__ = use_hgvs The [version](https://gist.github.com/3172753/577b7c383e057b78cdcee64be33f18117a46faaf) as of this writing is passing tests using base 0. I have not yet implemented tests for `from_hgvs` or `to_hgvs`, but that's next on my list. I'm hoping to have time for strand and mixed strand, too. Cheers, Lenna From arklenna at gmail.com Thu Aug 16 21:58:46 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Thu, 16 Aug 2012 21:58:46 -0400 Subject: [GSoC] GSoC Python variant (penultimate) update Message-ID: Post: http://arklenna.tumblr.com/post/29592108099/ I have been considering how to handle gene strandedness. As long as I'm correctly interpreting the following position, my coordinate mapper should produce the correct coordinates with negative strand or mixed strand features. GenBank: join(complement(25..30), 36..40) Biopython: FeatureLocation(24, 30, -1) + FeatureLocation(35, 40) (please click through to post for monospaced font) 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 <---------------- -------------> 5 4 3 2 1 0 6 7 8 9 10 I have to admit that it wasn't until I read a BioStar [post](http://biostars.org/post/show/3423/forward-and-reverse-strand-conventions/) earlier this week that I fully understood the relationship between plus/minus forward/reverse sense/antisense coding/template strands. So please let me know as soon as possible if I've made a mistake in the above code. `c2g` yields the correct genome position, but not the strand. I still need to integrate strand information into my `GenomePosition` object and/or partially merge it with `ExactLocation`. This weekend I intend to expand documentation and write a brief cookbook entry. Cheers, Lenna From p.j.a.cock at googlemail.com Fri Aug 17 04:21:01 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 17 Aug 2012 09:21:01 +0100 Subject: [GSoC] GSoC Python variant (penultimate) update In-Reply-To: References: Message-ID: On Fri, Aug 17, 2012 at 2:58 AM, Lenna Peterson wrote: > > I have to admit that it wasn't until I read a BioStar > [post](http://biostars.org/post/show/3423/forward-and-reverse-strand-conventions/) > earlier this week that I fully understood the relationship between > plus/minus forward/reverse sense/antisense coding/template strands. So > please let me know as soon as possible if I've made a mistake in the > above code. Given this is nice and fresh in your mind, can you suggest any clarifications to the Biopython Tutorial section talking about this issue? The section on transcription & translation starting: "Before talking about transcription, I want to try and clarify the strand issue. Consider the following (made up) stretch of double stranded DNA which encodes a short peptide: ..." Hmm. That should probably say "I want to try to clarify...". Peter From arklenna at gmail.com Mon Aug 20 00:22:36 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 20 Aug 2012 00:22:36 -0400 Subject: [GSoC] GSoC python variant final update Message-ID: Post: http://arklenna.tumblr.com/post/29808300789/ The coordinate mapper, with updated documentation, is now located on this branch: https://github.com/lennax/biopython/tree/f_loc4 It awaits the merging of Peter's f_loc4 branch. I've written an entry on coordinate mapping for the Cookbook: http://biopython.org/wiki/Coordinate_mapping Additionally, at Peter's suggestion, I've written a clarification of strand as it relates to transcription and translation. It's available here: https://docs.google.com/document/d/11R7EOJXn90lN5_SmaPOyN5rFfPQybbCbUBo6EY0R0pA/edit It's been a great experience working with this project this summer. Thank you to everyone involved. Cheers, Lenna From lomereiter at gmail.com Mon Aug 20 05:47:18 2012 From: lomereiter at gmail.com (Artem Tarasov) Date: Mon, 20 Aug 2012 13:47:18 +0400 Subject: [GSoC] GSoC final report Message-ID: Hi all, here's a wrap-up of what was added/fixed in sambamba and Ruby bindings during the last couple of weeks: http://lomereiter.wordpress.com/2012/08/20/gsoc-weekly-report-12/ * Tools don't require input files to be seekable anymore, allowing to work with e.g. /dev/stdin and /dev/stdout * I've added an option of MessagePack output, that drastically improved speed of bindings (2-4x speedup depending on configuration) * The gem is on Travis CI now, passing tests on MRI 1.9.2/1.9.3. (JRuby also works, but on Travis there're problems with popen, same as with BioRuby) * The tool 'sambamba_filter' is now available on Galaxy Tool Shed. It's been a great summer. Thank you OBF/BioRuby folks :) -- Artem From pjotr.public14 at thebird.nl Mon Aug 20 05:55:59 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 20 Aug 2012 11:55:59 +0200 Subject: [GSoC] [BioRuby] GSoC final report In-Reply-To: References: Message-ID: <20120820095559.GB2453@thebird.nl> Thank you Artem, Great job. Pj. On Mon, Aug 20, 2012 at 01:47:18PM +0400, Artem Tarasov wrote: > Hi all, > > here's a wrap-up of what was added/fixed in sambamba and Ruby bindings > during the last couple of weeks: > http://lomereiter.wordpress.com/2012/08/20/gsoc-weekly-report-12/ > > * Tools don't require input files to be seekable anymore, allowing to work > with e.g. /dev/stdin and /dev/stdout > * I've added an option of MessagePack output, that drastically improved > speed of bindings (2-4x speedup depending on configuration) > * The gem is on Travis CI now, passing tests on MRI 1.9.2/1.9.3. (JRuby > also works, but on Travis there're problems with popen, same as with > BioRuby) > * The tool 'sambamba_filter' is now available on Galaxy Tool Shed. > > > It's been a great summer. Thank you OBF/BioRuby folks :) > > > -- > Artem > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From chapmanb at 50mail.com Mon Aug 20 08:45:49 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 20 Aug 2012 08:45:49 -0400 Subject: [GSoC] GSoC python variant final update In-Reply-To: References: Message-ID: <87harxzq82.fsf@fastmail.fm> Lenna; Thanks for the documentation and getting that all code moved into a branch. This looks great and looking forward to having it merged when Peter's work goes in. Thanks also for all the great work this summer and good luck on the first day of PhD school, Brad > Post: http://arklenna.tumblr.com/post/29808300789/ > > The coordinate mapper, with updated documentation, is now located on > this branch: https://github.com/lennax/biopython/tree/f_loc4 > It awaits the merging of Peter's f_loc4 branch. > > I've written an entry on coordinate mapping for the Cookbook: > http://biopython.org/wiki/Coordinate_mapping > > Additionally, at Peter's suggestion, I've written a clarification of > strand as it relates to transcription and translation. It's available > here: https://docs.google.com/document/d/11R7EOJXn90lN5_SmaPOyN5rFfPQybbCbUBo6EY0R0pA/edit > > It's been a great experience working with this project this summer. > Thank you to everyone involved. > > Cheers, > > Lenna > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From w.arindrarto at gmail.com Tue Aug 21 12:09:07 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 21 Aug 2012 18:09:07 +0200 Subject: [GSoC] GSoC Project Update -- 10 In-Reply-To: References: Message-ID: Hi everyone, I've just posted my last entry for my Google Summer of Code project this year: http://bow.web.id/blog/2012/08/summers-over/ I want to say thank you to the Biopython community, especially Peter for mentoring me this summer :), to OBF for accepting my proposal, and to anyone who has helped and given me valuable inputs for me throughout the project :). It's been a priceless learning experience, and I only hope that my code will be useful in return. There are still some things to do before the code is merge-ready and even more when the code is included in an official release, so I'll still be around. cheers, Bow From marian.povolny at gmail.com Tue Aug 21 15:11:01 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Tue, 21 Aug 2012 21:11:01 +0200 Subject: [GSoC] Final GSoC report Message-ID: http://blog.mpthecoder.com/post/29910330225/final-gsoc-report *Summary* Yesterday I tagged the 0.4 release of gff3-pltools, and that marks the end of the summer. At least in GSoC terms. Should I say end of the project? I don?t think so. The tools can still be improved, and the Ruby bindings should follow. The major changes since the last release include the following: - filtering functionality has been moved to a separate utility: gff3-filter, along with a new language for specifying filtering expressions, - conversion to table format of selected fields has been moved to a separate utility: gff3-select. However, the ?select option is still part of gff3-filter, - gff3-ffetch is now fetching FASTA sequences from GFF3 and FASTA files for CDS and mRNA records and features, - man pages for utilities. ** The original idea was to create a GFF3/GTF parser in D and Ruby bindings. The Ruby bindings part didn?t work out because there is still no support for D shared libraries in Linux, but instead there are now a few useful command-line tools for processing GFF3 which can be used without programming knowledge. To me, the summer was fun, challenging, and a great experience. I even got to meet my mentor in person, and other community members too, and to make my first steps in bioinformatics. I even gave a small presentation at the EU-codefest. What a summer it was! Thanks to everybody who made it possible: Google, Open Bioinformatics Foundation and my mentor Pjotr Prins. -- Marjan From cswh at umich.edu Tue Aug 21 15:35:21 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Tue, 21 Aug 2012 12:35:21 -0700 Subject: [GSoC] Bio-MAF 1.0.1 Message-ID: Hi all, I've released bio-maf 1.0.1 and written a final GSoC blog post about it: http://csw.github.com/bioruby-maf/blog/2012/08/21/bio-maf_1.0.1/ This release should be substantially more robust, with solid and reasonably-performing BGZF support, better CLI tools, and various robustness, compatibility, and memory-footprint improvements. (I've also developed a Galaxy integration for the maf_tile tool; unlike the existing Galaxy MAF tools, this is capable of filling in gaps with a FASTA reference sequence, and concatenating the alignment output from several exons specified in a BED file. It's not quite all packaged up with the toolshed facility yet, but I should be able to wrap that up shortly. Sneak preview: https://gist.github.com/3418576) It's been a pleasure working with all of you, and I'm glad I've been able to deliver something useful. Pjotr, Raoul, Francesco, thanks for your help and advice this summer! Marjan, Artem, you guys did excellent work and gave me some great suggestions in the code reviews. And, of course, thanks to Google for organizing and funding this! -- Clayton Wheeler cswh at umich.edu From pjotr.public14 at thebird.nl Tue Aug 21 17:47:53 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 21 Aug 2012 23:47:53 +0200 Subject: [GSoC] [BioRuby] Bio-MAF 1.0.1 In-Reply-To: References: Message-ID: <20120821214753.GB10348@thebird.nl> Thank you Marjan and Clayton. It was our pleasure. Pj. On Tue, Aug 21, 2012 at 12:35:21PM -0700, Clayton Wheeler wrote: > Hi all, > > I've released bio-maf 1.0.1 and written a final GSoC blog post about it: > > http://csw.github.com/bioruby-maf/blog/2012/08/21/bio-maf_1.0.1/ > > This release should be substantially more robust, with solid and > reasonably-performing BGZF support, better CLI tools, and various > robustness, compatibility, and memory-footprint improvements. > > (I've also developed a Galaxy integration for the maf_tile tool; > unlike the existing Galaxy MAF tools, this is capable of filling in > gaps with a FASTA reference sequence, and concatenating the alignment > output from several exons specified in a BED file. It's not quite all > packaged up with the toolshed facility yet, but I should be able to > wrap that up shortly. Sneak preview: https://gist.github.com/3418576) > > It's been a pleasure working with all of you, and I'm glad I've been > able to deliver something useful. Pjotr, Raoul, Francesco, thanks for > your help and advice this summer! Marjan, Artem, you guys did > excellent work and gave me some great suggestions in the code reviews. > And, of course, thanks to Google for organizing and funding this! > > -- > Clayton Wheeler > cswh at umich.edu > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From mictadlo at gmail.com Tue Aug 21 20:55:30 2012 From: mictadlo at gmail.com (Mic) Date: Wed, 22 Aug 2012 10:55:30 +1000 Subject: [GSoC] [BioRuby] Final GSoC report In-Reply-To: References: Message-ID: Hi, Python is able to connect to D with help of http://pyd.dsource.org/ . Maybe it would be something for Biopython Cheers, Mic On Wed, Aug 22, 2012 at 5:11 AM, Marjan Povolni wrote: > http://blog.mpthecoder.com/post/29910330225/final-gsoc-report > > *Summary* > > Yesterday I tagged the 0.4 release of gff3-pltools, and that marks the end > of the summer. At least in GSoC terms. Should I say end of the project? I > don?t think so. The tools can still be improved, and the Ruby bindings > should follow. > > The major changes since the last release include the following: > > - filtering functionality has been moved to a separate utility: > gff3-filter, along with a new language for specifying filtering > expressions, > - conversion to table format of selected fields has been moved to a > separate utility: gff3-select. However, the ?select option is still > part of > gff3-filter, > - gff3-ffetch is now fetching FASTA sequences from GFF3 and FASTA files > for CDS and mRNA records and features, > - man pages for utilities. > > ** > The original idea was to create a GFF3/GTF parser in D and Ruby bindings. > The Ruby bindings part didn?t work out because there is still no support > for D shared libraries in Linux, but instead there are now a few useful > command-line tools for processing GFF3 which can be used without > programming knowledge. > > To me, the summer was fun, challenging, and a great experience. I even got > to meet my mentor in person, and other community members too, and to make > my first steps in bioinformatics. I even gave a small presentation at the > EU-codefest. What a summer it was! > > Thanks to everybody who made it possible: Google, Open Bioinformatics > Foundation and my mentor Pjotr Prins. > > -- > Marjan > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From cjfields at illinois.edu Wed Aug 22 00:16:13 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 22 Aug 2012 04:16:13 +0000 Subject: [GSoC] [BioRuby] GSoC final report In-Reply-To: <20120820095559.GB2453@thebird.nl> References: <20120820095559.GB2453@thebird.nl> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF33B73325@CHIMBX5.ad.uillinois.edu> Wholeheartedly agree. Congrats Artem on a job well done! chris On Aug 20, 2012, at 4:55 AM, Pjotr Prins wrote: > Thank you Artem, > > Great job. > > Pj. > > On Mon, Aug 20, 2012 at 01:47:18PM +0400, Artem Tarasov wrote: >> Hi all, >> >> here's a wrap-up of what was added/fixed in sambamba and Ruby bindings >> during the last couple of weeks: >> http://lomereiter.wordpress.com/2012/08/20/gsoc-weekly-report-12/ >> >> * Tools don't require input files to be seekable anymore, allowing to work >> with e.g. /dev/stdin and /dev/stdout >> * I've added an option of MessagePack output, that drastically improved >> speed of bindings (2-4x speedup depending on configuration) >> * The gem is on Travis CI now, passing tests on MRI 1.9.2/1.9.3. (JRuby >> also works, but on Travis there're problems with popen, same as with >> BioRuby) >> * The tool 'sambamba_filter' is now available on Galaxy Tool Shed. >> >> >> It's been a great summer. Thank you OBF/BioRuby folks :) >> >> >> -- >> Artem >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From lomereiter at gmail.com Wed Aug 22 02:42:12 2012 From: lomereiter at gmail.com (Artem Tarasov) Date: Wed, 22 Aug 2012 10:42:12 +0400 Subject: [GSoC] [BioRuby] Final GSoC report In-Reply-To: References: Message-ID: Hi, Unfortunately, the problem is on the side of D. PyD wiki ( https://bitbucket.org/ariovistus/pyd/wiki/Home) says that "extension libraries are nominally working with LDC (FE 2.060 or later); however, druntime currently limits what can be done here". However, this issue has become quite popular in last months, see e.g. this thread: http://forum.dlang.org/thread/mailman.1330.1345434177.31962.digitalmars-d at puremagic.com ? so maybe this'll get fixed soon. -- Artem On Wed, Aug 22, 2012 at 4:55 AM, Mic wrote: > Hi, > Python is able to connect to D with help of http://pyd.dsource.org/ . > > Maybe it would be something for Biopython > > Cheers, > Mic > > On Wed, Aug 22, 2012 at 5:11 AM, Marjan Povolni >wrote: > > > http://blog.mpthecoder.com/post/29910330225/final-gsoc-report > > > > *Summary* > > > > Yesterday I tagged the 0.4 release of gff3-pltools, and that marks the > end > > of the summer. At least in GSoC terms. Should I say end of the project? I > > don?t think so. The tools can still be improved, and the Ruby bindings > > should follow. > > > > The major changes since the last release include the following: > > > > - filtering functionality has been moved to a separate utility: > > gff3-filter, along with a new language for specifying filtering > > expressions, > > - conversion to table format of selected fields has been moved to a > > separate utility: gff3-select. However, the ?select option is still > > part of > > gff3-filter, > > - gff3-ffetch is now fetching FASTA sequences from GFF3 and FASTA > files > > for CDS and mRNA records and features, > > - man pages for utilities. > > > > ** > > The original idea was to create a GFF3/GTF parser in D and Ruby bindings. > > The Ruby bindings part didn?t work out because there is still no support > > for D shared libraries in Linux, but instead there are now a few useful > > command-line tools for processing GFF3 which can be used without > > programming knowledge. > > > > To me, the summer was fun, challenging, and a great experience. I even > got > > to meet my mentor in person, and other community members too, and to make > > my first steps in bioinformatics. I even gave a small presentation at the > > EU-codefest. What a summer it was! > > > > Thanks to everybody who made it possible: Google, Open Bioinformatics > > Foundation and my mentor Pjotr Prins. > > > > -- > > Marjan > > > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc > From p.j.a.cock at googlemail.com Wed Aug 22 04:42:03 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Aug 2012 09:42:03 +0100 Subject: [GSoC] GSoC Project Update -- 10 In-Reply-To: References: Message-ID: On Tue, Aug 21, 2012 at 5:09 PM, Wibowo Arindrarto wrote: > Hi everyone, > > I've just posted my last entry for my Google Summer of Code project > this year: http://bow.web.id/blog/2012/08/summers-over/ > > I want to say thank you to the Biopython community, especially Peter > for mentoring me this summer :), to OBF for accepting my proposal, and > to anyone who has helped and given me valuable inputs for me > throughout the project :). > > It's been a priceless learning experience, and I only hope that my > code will be useful in return. > > There are still some things to do before the code is merge-ready and > even more when the code is included in an official release, so I'll > still be around. > > cheers, > Bow Thank you Bow, It has been a pleasure to mentor you, and I'm excited about getting this (and Lenna's and other branches) into Biopython. Now, back to the module naming discussion... ;) http://lists.open-bio.org/pipermail/biopython-dev/2012-August/009868.html http://lists.open-bio.org/pipermail/biopython-dev/2012-August/009888.html Peter From pjotr.public14 at thebird.nl Wed Aug 22 06:43:52 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Wed, 22 Aug 2012 12:43:52 +0200 Subject: [GSoC] [BioRuby] Final GSoC report In-Reply-To: References: Message-ID: <20120822104352.GA11847@thebird.nl> Yes, linking to D from an interpreted language is not hard, basically it is the same calling convention as that of C. So a D shared library looks the same as a C shared library to the calling code - all existing foreign function interfaces (FFI) work. That is the good news. The bad news, as Artem points out, is that there is a problem in the D garbage collector. Items get collected, which should not. This will be fixed sooner or later. The commitment is there, and it is moving up the priority list. For us it did not matter, as the parsers and tools happily run on their own using a command line interface, without much overhead. One advantage, from my perspective, is that we are not tied to Ruby, at this point, and the tools can be hosted in Galaxy. Another advantage, perhaps, is that we have not been side-tracked in providing rich library interfaces. That appeals to my purist side. Writing FFI bindings later is not a problem. Pj. On Wed, Aug 22, 2012 at 10:42:12AM +0400, Artem Tarasov wrote: > Hi, > > Unfortunately, the problem is on the side of D. PyD wiki ( > https://bitbucket.org/ariovistus/pyd/wiki/Home) says that "extension > libraries are nominally working with LDC (FE 2.060 or later); however, > druntime currently limits what can be done here". > > However, this issue has become quite popular in last months, see e.g. this > thread: > http://forum.dlang.org/thread/mailman.1330.1345434177.31962.digitalmars-d at puremagic.com > ? > so maybe this'll get fixed soon. > > -- > Artem > > On Wed, Aug 22, 2012 at 4:55 AM, Mic wrote: > > > Hi, > > Python is able to connect to D with help of http://pyd.dsource.org/ . > > > > Maybe it would be something for Biopython > > > > Cheers, > > Mic > > > > On Wed, Aug 22, 2012 at 5:11 AM, Marjan Povolni > >wrote: > > > > > http://blog.mpthecoder.com/post/29910330225/final-gsoc-report > > > > > > *Summary* > > > > > > Yesterday I tagged the 0.4 release of gff3-pltools, and that marks the > > end > > > of the summer. At least in GSoC terms. Should I say end of the project? I > > > don?t think so. The tools can still be improved, and the Ruby bindings > > > should follow. > > > > > > The major changes since the last release include the following: > > > > > > - filtering functionality has been moved to a separate utility: > > > gff3-filter, along with a new language for specifying filtering > > > expressions, > > > - conversion to table format of selected fields has been moved to a > > > separate utility: gff3-select. However, the ?select option is still > > > part of > > > gff3-filter, > > > - gff3-ffetch is now fetching FASTA sequences from GFF3 and FASTA > > files > > > for CDS and mRNA records and features, > > > - man pages for utilities. > > > > > > ** > > > The original idea was to create a GFF3/GTF parser in D and Ruby bindings. > > > The Ruby bindings part didn?t work out because there is still no support > > > for D shared libraries in Linux, but instead there are now a few useful > > > command-line tools for processing GFF3 which can be used without > > > programming knowledge. > > > > > > To me, the summer was fun, challenging, and a great experience. I even > > got > > > to meet my mentor in person, and other community members too, and to make > > > my first steps in bioinformatics. I even gave a small presentation at the > > > EU-codefest. What a summer it was! > > > > > > Thanks to everybody who made it possible: Google, Open Bioinformatics > > > Foundation and my mentor Pjotr Prins. > > > > > > -- > > > Marjan > > > > > > _______________________________________________ > > > BioRuby Project - http://www.bioruby.org/ > > > BioRuby mailing list > > > BioRuby at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > > > _______________________________________________ > > GSoC mailing list > > GSoC at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/gsoc > > > > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From p.j.a.cock at googlemail.com Wed Aug 22 07:10:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Aug 2012 12:10:56 +0100 Subject: [GSoC] [BioRuby] Final GSoC report In-Reply-To: <20120822104352.GA11847@thebird.nl> References: <20120822104352.GA11847@thebird.nl> Message-ID: On Wed, Aug 22, 2012 at 11:43 AM, Pjotr Prins wrote: > Yes, linking to D from an interpreted language is not hard, basically > it is the same calling convention as that of C. So a D shared library > looks the same as a C shared library to the calling code - all > existing foreign function interfaces (FFI) work. That is the good > news. How do things stand from a cross-platform perspective? i.e. When might this be doable on Linux, Mac OS X, and Windows? (and other Unix like platforms of potential interest) > The bad news, as Artem points out, is that there is a problem in the > D garbage collector. Items get collected, which should not. This will > be fixed sooner or later. The commitment is there, and it is moving > up the priority list. Is there a D issue/bug tracker for this? Thanks, Peter From hlapp at drycafe.net Sun Aug 26 22:14:35 2012 From: hlapp at drycafe.net (Hilmar Lapp) Date: Sun, 26 Aug 2012 22:14:35 -0400 Subject: [GSoC] [BioRuby] GSoC final report In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF33B73325@CHIMBX5.ad.uillinois.edu> References: <20120820095559.GB2453@thebird.nl> <118F034CF4C3EF48A96F86CE585B94BF33B73325@CHIMBX5.ad.uillinois.edu> Message-ID: <56EAF480-2694-4825-999E-14763A6A3ACB@drycafe.net> Indeed, congratulations to all of OBF's 2012 GSoC students and mentors - great job! It'd be great to have a summary blog post on the OBF news blog - anyone up for composing that? -hilmar On Aug 22, 2012, at 12:16 AM, Fields, Christopher J wrote: > Wholeheartedly agree. Congrats Artem on a job well done! > > chris > > On Aug 20, 2012, at 4:55 AM, Pjotr Prins wrote: > >> Thank you Artem, >> >> Great job. >> >> Pj. >> >> On Mon, Aug 20, 2012 at 01:47:18PM +0400, Artem Tarasov wrote: >>> Hi all, >>> >>> here's a wrap-up of what was added/fixed in sambamba and Ruby bindings >>> during the last couple of weeks: >>> http://lomereiter.wordpress.com/2012/08/20/gsoc-weekly-report-12/ >>> >>> * Tools don't require input files to be seekable anymore, allowing to work >>> with e.g. /dev/stdin and /dev/stdout >>> * I've added an option of MessagePack output, that drastically improved >>> speed of bindings (2-4x speedup depending on configuration) >>> * The gem is on Travis CI now, passing tests on MRI 1.9.2/1.9.3. (JRuby >>> also works, but on Travis there're problems with popen, same as with >>> BioRuby) >>> * The tool 'sambamba_filter' is now available on Galaxy Tool Shed. >>> >>> >>> It's been a great summer. Thank you OBF/BioRuby folks :) >>> >>> >>> -- >>> Artem >>> _______________________________________________ >>> BioRuby Project - http://www.bioruby.org/ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >>> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From p.j.a.cock at googlemail.com Mon Sep 3 09:08:51 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Sep 2012 14:08:51 +0100 Subject: [GSoC] [BioRuby] GSoC final report In-Reply-To: <56EAF480-2694-4825-999E-14763A6A3ACB@drycafe.net> References: <20120820095559.GB2453@thebird.nl> <118F034CF4C3EF48A96F86CE585B94BF33B73325@CHIMBX5.ad.uillinois.edu> <56EAF480-2694-4825-999E-14763A6A3ACB@drycafe.net> Message-ID: On Mon, Aug 27, 2012 at 3:14 AM, Hilmar Lapp wrote: > Indeed, congratulations to all of OBF's 2012 GSoC students > and mentors - great job! > > It'd be great to have a summary blog post on the OBF news > blog - anyone up for composing that? > > -hilmar I agree it is a good idea. I'm in Japan for the 2012 BioHackathon, and have spoken with Pjotr, Raul and Francesco - I think we can work on a blog post together this week (I have editing rights). Brad - would you like to contribute/preview the text? Shall we ask your co-mentors too? Regards, Peter From chapmanb at 50mail.com Tue Sep 4 05:30:23 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 04 Sep 2012 05:30:23 -0400 Subject: [GSoC] [BioRuby] GSoC final report In-Reply-To: References: <20120820095559.GB2453@thebird.nl> <118F034CF4C3EF48A96F86CE585B94BF33B73325@CHIMBX5.ad.uillinois.edu> <56EAF480-2694-4825-999E-14763A6A3ACB@drycafe.net> Message-ID: <87txvei18w.fsf@fastmail.fm> Peter; Thanks for doing this. Happy to help from my end. For Lenna's project here is some text to use: Lenna Peterson worked on improving support for variation analysis in Biopython. Her summer work produced tools for manipulating [VCF (variant call format)][0] in Python, a thorough investigation of incorporating [PyVCF][1] into Biopython, and tools to handle clean coordinate mapping between transcripts and genomic coordinates. [Her blog][2] has detailed discussions of progress during the summer and the code is in a [branch on GitHub][3] awaiting inclusion into Biopython. [0]: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 [1]: https://github.com/jamescasbon/PyVCF [2]: http://arklenna.tumblr.com/ [3]: https://github.com/lennax Let me know if anything else would be useful, Brad > On Mon, Aug 27, 2012 at 3:14 AM, Hilmar Lapp wrote: >> Indeed, congratulations to all of OBF's 2012 GSoC students >> and mentors - great job! >> >> It'd be great to have a summary blog post on the OBF news >> blog - anyone up for composing that? >> >> -hilmar > > I agree it is a good idea. > > I'm in Japan for the 2012 BioHackathon, and have spoken with > Pjotr, Raul and Francesco - I think we can work on a blog post > together this week (I have editing rights). > > Brad - would you like to contribute/preview the text? Shall we > ask your co-mentors too? > > Regards, > > Peter From pjotr.public14 at thebird.nl Tue Sep 25 02:08:50 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 25 Sep 2012 08:08:50 +0200 Subject: [GSoC] [BioRuby] GSoC week 2 status report Message-ID: <20120925060850.GA1143@thebird.nl> Hi John, Congrats from the BioRuby panel and community winning Ruby Association Grant! http://sciruby.com/blog/2012/09/24/sciruby-receives-ruby-association-grant--fellowships-available/ Pj. From p.j.a.cock at googlemail.com Mon Nov 26 12:22:00 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 26 Nov 2012 17:22:00 +0000 Subject: [GSoC] GSoC python variant final update In-Reply-To: References: Message-ID: On Mon, Aug 20, 2012 at 5:22 AM, Lenna Peterson wrote: > Post: http://arklenna.tumblr.com/post/29808300789/ > > The coordinate mapper, with updated documentation, is now located on > this branch: https://github.com/lennax/biopython/tree/f_loc4 > It awaits the merging of Peter's f_loc4 branch. > > I've written an entry on coordinate mapping for the Cookbook: > http://biopython.org/wiki/Coordinate_mapping Hi Lenna, Do you need my f_loc4 branch for the main GSoC variants work, or just the coordinate mapper? Thanks, Peter From p.j.a.cock at googlemail.com Fri Mar 16 21:40:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 16 Mar 2012 21:40:56 +0000 Subject: [GSoC] [Open-bio-l] Google Summer of Code is *ON* for OBF projects! In-Reply-To: <4F6398E8.4010806@gmail.com> References: <4F6398E8.4010806@gmail.com> Message-ID: On Fri, Mar 16, 2012 at 7:47 PM, Robert Buels wrote: > Hi all, > > Great news: Google announced today that the Open Bioinformatics > Foundation has been accepted as a mentoring organization for this > summer's Google Summer of Code! > > GSoC is a Google-sponsored student internship program for open-source > projects, open to students from around the world (not just US > residents). ? Students are paid a $5000 USD stipend to work as a > developer on an open-source project for the summer. For more on GSoC, > see GSoC 2012 FAQ at http://goo.gl/kNv48 > > Student applications are due April 6, 2012 at 19:00 UTC. ?Students who > are interested in participating should look at the OBF's GSoC page at > http://open-bio.org/wiki/Google_Summer_of_Code, which lists project > ideas, and whom to contact about applying. > > For current developers on OBF projects, please consider volunteering to > be a mentor if you have not already, and contribute project ideas. ?Just > list your name and project ideas on OBF wiki and on the relevant > project's GSoC wiki page. > > Thanks to all who helped make OBF's application to GSoC a success, and > let's have a great, productive summer of code! > > Rob Buels > OBF GSoC 2012 Administrator Excellent news - well done Rob et al. Would you like me to post this to the news blog, or can you? http://news.open-bio.org/news/ Thanks, Peter From cjfields at illinois.edu Fri Mar 16 21:49:32 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 16 Mar 2012 21:49:32 +0000 Subject: [GSoC] [Open-bio-l] Google Summer of Code is *ON* for OBF projects! In-Reply-To: References: <4F6398E8.4010806@gmail.com> Message-ID: On Mar 16, 2012, at 4:40 PM, Peter Cock wrote: > On Fri, Mar 16, 2012 at 7:47 PM, Robert Buels wrote: >> Hi all, >> >> Great news: Google announced today that the Open Bioinformatics >> Foundation has been accepted as a mentoring organization for this >> summer's Google Summer of Code! >> >> GSoC is a Google-sponsored student internship program for open-source >> projects, open to students from around the world (not just US >> residents). Students are paid a $5000 USD stipend to work as a >> developer on an open-source project for the summer. For more on GSoC, >> see GSoC 2012 FAQ at http://goo.gl/kNv48 >> >> Student applications are due April 6, 2012 at 19:00 UTC. Students who >> are interested in participating should look at the OBF's GSoC page at >> http://open-bio.org/wiki/Google_Summer_of_Code, which lists project >> ideas, and whom to contact about applying. >> >> For current developers on OBF projects, please consider volunteering to >> be a mentor if you have not already, and contribute project ideas. Just >> list your name and project ideas on OBF wiki and on the relevant >> project's GSoC wiki page. >> >> Thanks to all who helped make OBF's application to GSoC a success, and >> let's have a great, productive summer of code! >> >> Rob Buels >> OBF GSoC 2012 Administrator > > Excellent news - well done Rob et al. > > Would you like me to post this to the news blog, or can you? > http://news.open-bio.org/news/ > > Thanks, > > Peter I think post away. I've already tweated this. chris From p.j.a.cock at googlemail.com Fri Mar 16 21:54:46 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 16 Mar 2012 21:54:46 +0000 Subject: [GSoC] [Open-bio-l] Google Summer of Code is *ON* for OBF projects! In-Reply-To: References: <4F6398E8.4010806@gmail.com> Message-ID: On Fri, Mar 16, 2012 at 9:49 PM, Fields, Christopher J wrote: > On Mar 16, 2012, at 4:40 PM, Peter Cock wrote: >> >> Excellent news - well done Rob et al. >> >> Would you like me to post this to the news blog, or can you? >> http://news.open-bio.org/news/ >> >> Thanks, >> >> Peter > > I think post away. ?I've already tweated this. > > chris > Done, http://news.open-bio.org/news/2012/03/obf-accepted-for-gsoc-2012/ This was posted to the @obf_news twitter account here https://twitter.com/obf_news/status/180773706715504640 Peter From ayushgoel111 at gmail.com Thu Mar 22 18:03:49 2012 From: ayushgoel111 at gmail.com (Ayush Goel) Date: Thu, 22 Mar 2012 23:33:49 +0530 Subject: [GSoC] Interested in working on SearchIO Message-ID: Hello, I am a student at Delhi College of Engineering. I have a prior experience in python at two other interns. I was hoping to find myself a more challenging project this time with python as the default language. The description of the SearchIO project seems to be a very good one. Still I am pretty new to the biopython's code. If possible, I would like to have some more information regarding what is expected from the deliverable. Also if some reference material on the background of the data formats required (BLAST etc) could be provided, then it would be very helpful. -- Regards, Ayush Goel From p.j.a.cock at googlemail.com Fri Mar 23 09:30:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 23 Mar 2012 09:30:10 +0000 Subject: [GSoC] Interested in working on SearchIO In-Reply-To: References: Message-ID: On Thu, Mar 22, 2012 at 6:03 PM, Ayush Goel wrote: > Hello, > > ?I am a student at Delhi College of Engineering. I have a prior > experience in python at two other interns. I was hoping to find myself > a more challenging project this time with python as the default > language. The description of the SearchIO project seems to be a very > good one. > > ?Still I am pretty new to the biopython's code. If possible, I would > like to have some more information regarding what is expected from the > deliverable. Also if some reference material on the background of the > data formats required (BLAST etc) could be provided, then it would be > very helpful. Hello Ayush, Are you doing any biology or bioinformatics courses? That would help with background knowledge. The SearchIO project does require a reasonably broad knowledge of important tools and concepts in pairwise sequence alignment - if you not familiar with BLAST etc that will be a big handicap. You don't need to know the algorithm details - just the overall idea, and how to run the tools and what kind of analysis people might want to do with it. Some possible background reading (an introductory Bioinformatics course or book might be good too): http://www.ncbi.nlm.nih.gov/BLAST/ http://en.wikipedia.org/wiki/BLAST http://emboss.open-bio.org/wiki/Appdoc:Needle http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm http://emboss.open-bio.org/wiki/Appdoc:Water http://en.wikipedia.org/wiki/Smith-Waterman_algorithm In terms of possible deliverables, I went into more detail here: http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html However, if you have a lot of experience with Python and parsing text and XML files, that would be a big plus. Perhaps there is another topic that might suit you better. Is there a particular reason why you are interested in Biopython? Regards, Peter From saketkc at gmail.com Mon Mar 26 08:40:51 2012 From: saketkc at gmail.com (Saket Choudhary) Date: Mon, 26 Mar 2012 14:10:51 +0530 Subject: [GSoC] [GSoC2012] BioRuby Message-ID: Hi ! I am Saket Choudhary, a third year undergraduate student at IIT Bombay, India. Ruby has been my first love for programming. My last project[internship] was at SlideShare , one of the biggest Ruby on Rails website in the world. I had developed the Admin interface for SlideShare which could enable suspending users, reconversion/deletion of SlideShows , and user deletion/suspension. My hack at Yahoo! Open Hack India -2011 qualified among the Top 50 hacks , I built a Sinatra app for fetching a defined file from a Dropbox account and sending it to a specified email address , just on a SMS.[ https://github.com/saketkc/dropbox_on_sms] I went through the GSoC idea page and "Adding social networking functionality to BioRuby.org", is of my special interest. I have an experience working on the Rails platform. My plans for making the website more "Social" is as follows: 1. Provide an online 'Scratchpad' for Ruby/BioRuby enthusiasts that not only allows them to run their codes online but also provides them a facility to store it online in the form of an archive so that they an acess it later. 2. Include sharing facility on "Scratchpad" so that one user can share his code online with other users/community and get feedback/comments. On the lines of "Sage " notebooks. 3. Develop an Online Board on the lines of "Quora " Boards so that users can pin certain codes/algorithms on to their own boards for their reference this would reduce the overhead of searching for a particular algorithm again and again . Let me know you views on these ideas. I would send a mockup of the same incase these ideas seem feasible to you. Thanks Saket Choudhary IIT Bombay From saketkc at gmail.com Mon Mar 26 15:32:40 2012 From: saketkc at gmail.com (Saket Choudhary) Date: Mon, 26 Mar 2012 21:02:40 +0530 Subject: [GSoC] [GSoC2012] BioRuby In-Reply-To: <20120326133700.GA22488@thebird.nl> References: <20120326133700.GA22488@thebird.nl> Message-ID: Hi Pjotr, Thanks ! I have been using BioRuby and BioPython for 6 months now to solve the "Protein Loop Closure" problem. I use it mainly to manipulate the atom positions in a given PDB file, thus perturbing their positions. I went through the discussion that happened on the mailing list last month: Here are my notes about the same 1.http://lists.open-bio.org/pipermail/bioruby/2012-February/002087.html Even I am a big fan of Jruby homepage. Here are my suggestiosn : 1. Ruby is simple, clear , intutitve and this makes Biouby intutive to everyone. This needs to be emphasised the very first time a user comes to the webpage : Instead of giving them examples on Wiki[ http://bioruby.open-bio.org/wiki/SampleCodes] to "read" through a tutorial in the form of a short writeup/description about Ruby/BioRuby foolwed by a challenge would be more appealing and intuitive to the user even though he is being exposed to Ruby/BioRuby for the first time. Say some tutorial on the lines of http://www.codecademy.com/ , a short tutorial followed by your Scratchpad. ! 2. Calendar/Tweet/Conference widget: Something again on the lines of Jruby website. 3. Favicon missing ? Though a very trivial issue , but just wanted to know why isn't there a favicon for bioruby.org ? These are the stuff I gathered , I am still digging the old threads, will post here if something relevant comes up ! Saket Choudhary IIT Bombay github.com/saketkc On 26 March 2012 19:07, Pjotr Prins wrote: > Hi Saket, > > Welcome! > > It would be good if you also introduce yourself to the BioRuby ML, and > post your ideas. We are working on the website (should I say > web 'experience'), and I like what you propose. Also check out the ML > archive of the last months, you'll find a lot of information. > > Pj. > > On Mon, Mar 26, 2012 at 02:10:51PM +0530, Saket Choudhary wrote: > > Hi ! > > > > I am Saket Choudhary, a third year undergraduate student at IIT Bombay, > > India. > > > > Ruby has been my first love for programming. My last project[internship] > > was at SlideShare , one of the biggest Ruby > on > > Rails website in the world. I had developed the Admin interface for > > SlideShare which could enable suspending users, reconversion/deletion of > > SlideShows , and user deletion/suspension. > > > > My hack at Yahoo! Open Hack India -2011 qualified among the Top 50 hacks > , > > I built a Sinatra app for fetching a defined file from a Dropbox account > > and sending it to a specified email address , just on a SMS.[ > > https://github.com/saketkc/dropbox_on_sms] > > > > I went through the GSoC idea page and "Adding social networking > > functionality to BioRuby.org", is of my special interest. I have an > > experience working on the Rails platform. My plans for making the website > > more "Social" is as follows: > > > > 1. Provide an online 'Scratchpad' for Ruby/BioRuby enthusiasts that not > > only allows them to run their codes online but also provides them a > > facility to store it online in the form of an archive so that they an > acess > > it later. > > > > 2. Include sharing facility on "Scratchpad" so that one user can share > his > > code online with other users/community and get feedback/comments. On the > > lines of "Sage " notebooks. > > > > 3. Develop an Online Board on the lines of "Quora >" > > Boards so that users can pin certain codes/algorithms on to their own > > boards for their reference this would reduce the overhead of searching > for > > a particular algorithm again and again . > > > > Let me know you views on these ideas. I would send a mockup of the same > > incase these ideas seem feasible to you. > > > > Thanks > > > > Saket Choudhary > > IIT Bombay > > _______________________________________________ > > GSoC mailing list > > GSoC at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/gsoc > From pjotr2012 at thebird.nl Mon Mar 26 13:37:00 2012 From: pjotr2012 at thebird.nl (Pjotr Prins) Date: Mon, 26 Mar 2012 15:37:00 +0200 Subject: [GSoC] [GSoC2012] BioRuby In-Reply-To: References: Message-ID: <20120326133700.GA22488@thebird.nl> Hi Saket, Welcome! It would be good if you also introduce yourself to the BioRuby ML, and post your ideas. We are working on the website (should I say web 'experience'), and I like what you propose. Also check out the ML archive of the last months, you'll find a lot of information. Pj. On Mon, Mar 26, 2012 at 02:10:51PM +0530, Saket Choudhary wrote: > Hi ! > > I am Saket Choudhary, a third year undergraduate student at IIT Bombay, > India. > > Ruby has been my first love for programming. My last project[internship] > was at SlideShare , one of the biggest Ruby on > Rails website in the world. I had developed the Admin interface for > SlideShare which could enable suspending users, reconversion/deletion of > SlideShows , and user deletion/suspension. > > My hack at Yahoo! Open Hack India -2011 qualified among the Top 50 hacks , > I built a Sinatra app for fetching a defined file from a Dropbox account > and sending it to a specified email address , just on a SMS.[ > https://github.com/saketkc/dropbox_on_sms] > > I went through the GSoC idea page and "Adding social networking > functionality to BioRuby.org", is of my special interest. I have an > experience working on the Rails platform. My plans for making the website > more "Social" is as follows: > > 1. Provide an online 'Scratchpad' for Ruby/BioRuby enthusiasts that not > only allows them to run their codes online but also provides them a > facility to store it online in the form of an archive so that they an acess > it later. > > 2. Include sharing facility on "Scratchpad" so that one user can share his > code online with other users/community and get feedback/comments. On the > lines of "Sage " notebooks. > > 3. Develop an Online Board on the lines of "Quora " > Boards so that users can pin certain codes/algorithms on to their own > boards for their reference this would reduce the overhead of searching for > a particular algorithm again and again . > > Let me know you views on these ideas. I would send a mockup of the same > incase these ideas seem feasible to you. > > Thanks > > Saket Choudhary > IIT Bombay > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From shahruchin711 at gmail.com Sun Apr 1 11:23:00 2012 From: shahruchin711 at gmail.com (Ruchin Shah) Date: Sun, 1 Apr 2012 16:53:00 +0530 Subject: [GSoC] BioJava- Porting the BLAST,HMMER algorithms Message-ID: Hi, I am Ruchin Shah, 3rd year undergraduate student from DA-IICT,India. I would like to work on some challenging projects in the field of bioinformatics. I have already worked on a project called BioSpectroGram(written in Java)under the mentorship of Prof. Manish K. Gupta(http://www.guptalab.org/mankg/public_html/) which aims at analyzing DNA and protein sequences using various kinds of transfromations(FFT,DCT,etc.). I came to know about the idea of implementing the two algorithms-BLAST and HMMER, and i find it very fascinating. I have a good coding experience(http://www.spoj.pl/users/ruchinshah/). I am also familiar with the FASTA and GenBank formats. I read about the BLASTA algorithm at http://en.wikipedia.org/wiki/BLAST#BLAST but if possible I would like to know more about these two algorithms and exactly what is expected from the project and also some more references.If I am not wrong then you are expecting to use some C-to-Java conversion tool or JNI to exploit the already available BLAST+ tool and not implement the algorithms from scratch . From andreas at sdsc.edu Sun Apr 1 18:02:16 2012 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 1 Apr 2012 11:02:16 -0700 Subject: [GSoC] BioJava- Porting the BLAST,HMMER algorithms In-Reply-To: References: Message-ID: Hi Ruchin, Are you also on the biojava-l mailing list? We had quite a number of discussions about this project already there and if you are not on the list it might be a good start to catch up with what was already discussed there. http://lists.open-bio.org/pipermail/biojava-l/ The idea in short is to come up with an all-Java version of some of the frequently used algorithms. We are quite flexible regarding the projects and what we are really looking for are sound projects and motivated students. What is expected is a realistic project proposal, which in turn depends on your background and how you propose to conduct the project. Andreas On Sun, Apr 1, 2012 at 4:23 AM, Ruchin Shah wrote: > Hi, > > I am Ruchin Shah, 3rd year undergraduate student from DA-IICT,India. > > ? ? ? ? I would like to work on some challenging projects in the field of > bioinformatics. I have already worked on a project called > BioSpectroGram(written in Java)under the mentorship of Prof. Manish K. > Gupta(http://www.guptalab.org/mankg/public_html/) which aims at analyzing > DNA and protein sequences using various kinds of > transfromations(FFT,DCT,etc.). I came to know about the idea of implementing > the two algorithms-BLAST and HMMER, and i find it very fascinating. I have > a good coding experience(http://www.spoj.pl/users/ruchinshah/). I am also > familiar with the FASTA and GenBank formats. > > ? ? ? ? I read about the BLASTA algorithm at > http://en.wikipedia.org/wiki/BLAST#BLAST but if possible I would like to > know more about these two algorithms and exactly what is expected from the > project > and also some more references.If I am not wrong then you are expecting to > use some C-to-Java conversion tool or JNI to exploit the already available > BLAST+ tool and not implement the algorithms from scratch . > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From p.j.a.cock at googlemail.com Tue Apr 24 11:21:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 12:21:56 +0100 Subject: [GSoC] Fwd: Announcing OBF Google Summer of Code Accepted Students In-Reply-To: <4F95EA76.4030004@gmail.com> References: <4F95EA76.4030004@gmail.com> Message-ID: The announcement is also on the OBF news blog now: http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ ---------- Forwarded message ---------- From: Robert Buels Date: Tue, Apr 24, 2012 at 12:49 AM Subject: [Bioperl-l] Announcing OBF Google Summer of Code Accepted Students To: BioPerl List , BioJava List , BioRuby List , BioPython List , BioDAS List , BioLib List , BioSQL List Hello all, I'm very pleased and excited to announce that the Open Bioinformatics Foundation has selected 5 very capable students to work on OBF projects this summer as part of the Google Summer of Code program. The accepted students, their projects, and their mentors (in alphabetical order): Wibowo Arindrarto ? ?SearchIO Implementation in Biopython ? ?mentored by Peter Cock Lenna Peterson ? ?Diff My DNA: Development of a Genomic Variant Toolkit for Biopython ? ?mentored by Brad Chapman Marjan Povolni ? ?The worlds fastest parallelized GFF3/GTF parser in D, and an ? ?interfacing biogem plugin for Ruby ? ?mentored by Pjotr Prins, Francesco Strozzi, Raoul Bonnal Artem Tarasov ? ?Fast parallelized GFF3/GTF parser in C++, with Ruby FFI bindings ? ?mentored by Pjotr Prins, Francesco Strozzi, Raoul Bonnal Clayton Wheeler ? ?Multiple Alignment Format parser for BioRuby ? ?mentored by Francesco Strozzi and Raoul Bonnal As in every year, we received many great applications and ideas. However, funding and mentor resources are limited, and we were not able to accept as many as we would have liked. ?Our deepest thanks to all the students who applied: we sincerely appreciate the time and effort you put into your applications, and hope you will still consider being a part of the OBF's open source projects, even without Google funding. ?I speak for myself and all of the mentors who read and scored applications when I say that we were truly honored by the number and quality of the applications we received. For the accepted students: congratulations! ?You have risen to the top of a very competitive application process. ?Now it's time to "put your money where your mouth is", as the saying goes. ?Let's get out there and write some great code this summer! Best regards, Rob ---- Robert Buels OBF GSoC 2012 Administrator _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From p.j.a.cock at googlemail.com Tue Apr 24 11:24:20 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 12:24:20 +0100 Subject: [GSoC] OBF GSoC students weekly progress reports Message-ID: Hello all, First, to echo Rob, congratulations to our selected students: http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ http://lists.open-bio.org/pipermail/gsoc/2012/000049.html Weekly Progress Reports: To encourage community bonding and awareness of what the GSoC 2012 students are doing, this year the OBF is being much clearer about our progress report expectations. We would like every student to setup a blog for the GSoC project (or a category/tag on your existing blog) which you will use to summarize your progress every week, as well as longer posts at the half way evaluation, and at the end of the summer. In addition, after publishing each blog post, we expect you to email the URL and the text of the blog (or if important images or formatting would be lost, at least a short summary) to the host project's mailing list(s) (check with your mentors if the project has more than one) AND the gsoc at open-bio.org mailing list. You will be writing under your own name, but with a clear association with your mentors, the OBF and its projects, so please take this seriously and be professional. Remember this will become part of your online presence, and potentially looked at by future employers and colleagues. Please talk to your mentors about this during the "community bonding" stage of the GSoC code (i.e. the next few weeks before you actually start). Thank you, Peter (On behalf of the OBF GSoC mentors and projects) Note: As per Rob's earlier email, could both students and mentors please ensure you have subscribed to the public OBF GSoC email list at http://lists.open-bio.org/mailman/listinfo/gsoc (I have BCC'd you on this email just in case you haven't done this yet). Thanks! From arklenna at gmail.com Tue Apr 24 17:21:46 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 24 Apr 2012 13:21:46 -0400 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: Hi all, I'm very excited to be participating in GSoC '12 with Biopython! My development blog is on tumblr, which I chose primarily because it supports markdown syntax, which I'm used to from GitHub. Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 However, Tumblr doesn't allow post comments. Will I need to switch to a blog platform that allows comments? Cheers, Lenna From p.j.a.cock at googlemail.com Tue Apr 24 17:55:51 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 18:55:51 +0100 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 6:21 PM, Lenna Peterson wrote: > Hi all, > > > I'm very excited to be participating in GSoC '12 with Biopython! > > My development blog is on tumblr, which I chose primarily because it > supports markdown syntax, which I'm used to from GitHub. > > Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 > > However, Tumblr doesn't allow post comments. Will I need to switch to a > blog platform that allows comments? > > Cheers, > > Lenna Hi Lenna, Great - you've got a blog already you're also the first student to reply :) Blog comments could be nice, but personally in your shoes I'd direct any discussion to the biopython(-dev) mailing list. e.g. 1. Post weekly update blog, get blog post URL 2. Send email with summary, including blog post URL 3. Goto mailing list archive, get archived email URL 4. Update blog post to link to email (and thus any thread from it, at least for that month). A little cumbersome, but it would save you moving your blog? I'd actually be happier with most discussion on the biopython-dev list rather than blog comments, or even github (which will still be useful for things like code reviews). This may be different for the other projects - I know BioRuby uses IRC much more for example, but even there they've tried to post archives of important IRC discussions to their mailing list too. Thank you! Peter From arklenna at gmail.com Tue Apr 24 18:41:25 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 24 Apr 2012 14:41:25 -0400 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 1:55 PM, Peter Cock wrote: > > Hi Lenna, > > Great - you've got a blog already you're also the first student to reply :) > > Blog comments could be nice, but personally in your shoes I'd > direct any discussion to the biopython(-dev) mailing list. e.g. > > 1. Post weekly update blog, get blog post URL > 2. Send email with summary, including blog post URL > 3. Goto mailing list archive, get archived email URL > 4. Update blog post to link to email (and thus any thread from it, > at least for that month). > > A little cumbersome, but it would save you moving your blog? > > I'd actually be happier with most discussion on the biopython-dev > list rather than blog comments, or even github (which will still be > useful for things like code reviews). > > This may be different for the other projects - I know BioRuby > uses IRC much more for example, but even there they've tried > to post archives of important IRC discussions to their mailing > list too. > > Thank you! > > Peter Peter, If I get ambitious, I could write a Python script to retrieve the mailing list url and put it into my blog post! To clarify - for biopython, should the update emails go out to both the biopython and biopython-dev mailing lists, or just the latter? Lenna From w.arindrarto at gmail.com Tue Apr 24 19:01:23 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 24 Apr 2012 21:01:23 +0200 Subject: [GSoC] [Biopython] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 19:55, Peter Cock wrote: > On Tue, Apr 24, 2012 at 6:21 PM, Lenna Peterson wrote: >> Hi all, >> >> >> I'm very excited to be participating in GSoC '12 with Biopython! >> >> My development blog is on tumblr, which I chose primarily because it >> supports markdown syntax, which I'm used to from GitHub. >> >> Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 >> >> However, Tumblr doesn't allow post comments. Will I need to switch to a >> blog platform that allows comments? >> >> Cheers, >> >> Lenna > > Hi Lenna, > > Great - you've got a blog already you're also the first student to reply :) > > Blog comments could be nice, but personally in your shoes I'd > direct any discussion to the biopython(-dev) mailing list. e.g. > > 1. Post weekly update blog, get blog post URL > 2. Send email with summary, including blog post URL > 3. Goto mailing list archive, get archived email URL > 4. Update blog post to link to email (and thus any thread from it, > at least for that month). > > A little cumbersome, but it would save you moving your blog? > > I'd actually be happier with most discussion on the biopython-dev > list rather than blog comments, or even github (which will still be > useful for things like code reviews). > > This may be different for the other projects - I know BioRuby > uses IRC much more for example, but even there they've tried > to post archives of important IRC discussions to their mailing > list too. > > Thank you! > > Peter > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Hi everyone, Wibowo Arindrarto here, but you can just call me Bow for short :). I'm very excited to be accepted into GSoC with OBF as well! I will be blogging on my site: http://bow.web.id/blog, and I've actually made my inaugural GSoC post just a few hours after I heard the news, here: http://bow.web.id/blog/2012/04/google-summer-of-code-is-on/. I'll be posting all GSoC related post under the `gsoc` tag, accessible through this URL: http://bow.web.id/blog/tag/gsoc/. To follow Peter's suggestion, I'll post my weekly progress in this mailing list for everyone to see, too. cheers, Bow From rbuels at gmail.com Tue Apr 24 19:13:48 2012 From: rbuels at gmail.com (Robert Buels) Date: Tue, 24 Apr 2012 15:13:48 -0400 Subject: [GSoC] [Biopython] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: <4F96FB6C.3010805@gmail.com> Bow, make sure you subscribe to the OBF GSoC mailing list. http://lists.open-bio.org/mailman/listinfo/gsoc Rob On 04/24/2012 03:01 PM, Wibowo Arindrarto wrote: > On Tue, Apr 24, 2012 at 19:55, Peter Cock wrote: >> On Tue, Apr 24, 2012 at 6:21 PM, Lenna Peterson wrote: >>> Hi all, >>> >>> >>> I'm very excited to be participating in GSoC '12 with Biopython! >>> >>> My development blog is on tumblr, which I chose primarily because it >>> supports markdown syntax, which I'm used to from GitHub. >>> >>> Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 >>> >>> However, Tumblr doesn't allow post comments. Will I need to switch to a >>> blog platform that allows comments? >>> >>> Cheers, >>> >>> Lenna >> >> Hi Lenna, >> >> Great - you've got a blog already you're also the first student to reply :) >> >> Blog comments could be nice, but personally in your shoes I'd >> direct any discussion to the biopython(-dev) mailing list. e.g. >> >> 1. Post weekly update blog, get blog post URL >> 2. Send email with summary, including blog post URL >> 3. Goto mailing list archive, get archived email URL >> 4. Update blog post to link to email (and thus any thread from it, >> at least for that month). >> >> A little cumbersome, but it would save you moving your blog? >> >> I'd actually be happier with most discussion on the biopython-dev >> list rather than blog comments, or even github (which will still be >> useful for things like code reviews). >> >> This may be different for the other projects - I know BioRuby >> uses IRC much more for example, but even there they've tried >> to post archives of important IRC discussions to their mailing >> list too. >> >> Thank you! >> >> Peter >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > Hi everyone, > > Wibowo Arindrarto here, but you can just call me Bow for short :). I'm > very excited to be accepted into GSoC with OBF as well! > > I will be blogging on my site: http://bow.web.id/blog, and I've > actually made my inaugural GSoC post just a few hours after I heard > the news, here: > http://bow.web.id/blog/2012/04/google-summer-of-code-is-on/. I'll be > posting all GSoC related post under the `gsoc` tag, accessible through > this URL: http://bow.web.id/blog/tag/gsoc/. To follow Peter's > suggestion, I'll post my weekly progress in this mailing list for > everyone to see, too. > > cheers, > Bow > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From marian.povolny at gmail.com Wed Apr 25 17:17:01 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Wed, 25 Apr 2012 19:17:01 +0200 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: Hi Peter, Another excited GSoC student here :) I think the idea with a blog for status updates a great idea, I would have done it probably even if it wasn't a requirement. I didn't have a blog before, so I created one at tumblr, and it should be possible for the visitors to leave comments too. But I do agree with you that the ML is a better place for discussions about our GSoC projects. Here is a link to my new blog: http://blog.mpthecoder.com/ GSoC related posts will be tagged with #gsoc ( http://blog.mpthecoder.com/tagged/gsoc). @Lenna Tumblr lets you use your Disqus account if you want to enable comments on your tumblr blog. However, not all themes support it. See the first q&a here for more info: http://www.tumblr.com/help It took me about 2 minutes to create an account on Disqus and link it to my blog. -- Marjan On Tue, Apr 24, 2012 at 1:24 PM, Peter Cock wrote: > Hello all, > > First, to echo Rob, congratulations to our selected students: > http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ > http://lists.open-bio.org/pipermail/gsoc/2012/000049.html > > Weekly Progress Reports: > > To encourage community bonding and awareness of what the > GSoC 2012 students are doing, this year the OBF is being much > clearer about our progress report expectations. > > We would like every student to setup a blog for the GSoC project > (or a category/tag on your existing blog) which you will use to > summarize your progress every week, as well as longer posts > at the half way evaluation, and at the end of the summer. > > In addition, after publishing each blog post, we expect you to > email the URL and the text of the blog (or if important images > or formatting would be lost, at least a short summary) to the > host project's mailing list(s) (check with your mentors if the > project has more than one) AND the gsoc at open-bio.org > mailing list. > > You will be writing under your own name, but with a clear > association with your mentors, the OBF and its projects, so > please take this seriously and be professional. Remember > this will become part of your online presence, and potentially > looked at by future employers and colleagues. > > Please talk to your mentors about this during the "community > bonding" stage of the GSoC code (i.e. the next few weeks > before you actually start). > > Thank you, > > Peter > > (On behalf of the OBF GSoC mentors and projects) > > Note: As per Rob's earlier email, could both students and mentors > please ensure you have subscribed to the public OBF GSoC email > list at http://lists.open-bio.org/mailman/listinfo/gsoc (I have BCC'd > you on this email just in case you haven't done this yet). Thanks! > From arklenna at gmail.com Thu Apr 26 00:16:11 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 25 Apr 2012 20:16:11 -0400 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: On Wed, Apr 25, 2012 at 1:17 PM, Marjan Povolni wrote: > > @Lenna > Tumblr lets you use your Disqus account if you want to enable comments on > your tumblr blog. However, not all themes support it. See the first q&a > here for more info: > > http://www.tumblr.com/help > > It took me about 2 minutes to create an account on Disqus and link it to my > blog. > > -- > Marjan > > Marjan - Thanks for the tip! I have disqus set up on my tumblr now. I also filed my enrollment and tax forms with Google. Now I'm busy in the thinking phase ;) Lenna From p.j.a.cock at googlemail.com Thu Apr 26 09:49:26 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 26 Apr 2012 10:49:26 +0100 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: On Wed, Apr 25, 2012 at 6:17 PM, Marjan Povolni wrote: > Hi Peter, > > Another excited GSoC student here :) > > I think the idea with a blog for status updates a great idea, I would have > done it probably even if it wasn't a requirement. I didn't have a blog > before, so I created one at tumblr, and it should be possible for the > visitors to leave comments too. But I do agree with you that the ML is a > better place for discussions about our GSoC projects. Here is a link to my > new blog: > > http://blog.mpthecoder.com/ > > GSoC related posts will be tagged with #gsoc > (http://blog.mpthecoder.com/tagged/gsoc). Excellent, I've added the three Blog links so far to this post: http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ I'll do another full post highlighting your blogs once all five are ready. Thanks, Peter From lomereiter at googlemail.com Thu Apr 26 11:43:07 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Thu, 26 Apr 2012 15:43:07 +0400 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: Hi all, I'm also very excited about being accepted :) > Excellent, I've added the three Blog links so far to this post: > http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ > > I'll do another full post highlighting your blogs once all five > are ready. > My blog posts will be at http://lomereiter.wordpress.com/tag/gsoc, I'll update it at least every week during the coding period. From rbuels at gmail.com Thu Apr 26 15:05:01 2012 From: rbuels at gmail.com (Robert Buels) Date: Thu, 26 Apr 2012 11:05:01 -0400 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: <4F99641D.3070908@gmail.com> Thanks for handling the blog links Peter! The wiki page has them now too. http://www.open-bio.org/wiki/Google_Summer_of_Code#About_Google_Summer_of_Code Artem and Clayton: please update that wiki page to link to your progress blogs and notify Peter so he can put the link on the OBF blog. Rob On 04/26/2012 05:49 AM, Peter Cock wrote: > On Wed, Apr 25, 2012 at 6:17 PM, Marjan Povolni > wrote: >> Hi Peter, >> >> Another excited GSoC student here :) >> >> I think the idea with a blog for status updates a great idea, I would have >> done it probably even if it wasn't a requirement. I didn't have a blog >> before, so I created one at tumblr, and it should be possible for the >> visitors to leave comments too. But I do agree with you that the ML is a >> better place for discussions about our GSoC projects. Here is a link to my >> new blog: >> >> http://blog.mpthecoder.com/ >> >> GSoC related posts will be tagged with #gsoc >> (http://blog.mpthecoder.com/tagged/gsoc). > > Excellent, I've added the three Blog links so far to this post: > http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ > > I'll do another full post highlighting your blogs once all five > are ready. > > Thanks, > > Peter > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc > From marian.povolny at gmail.com Sat May 5 13:07:30 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sat, 5 May 2012 15:07:30 +0200 Subject: [GSoC] GSoC weekly status report No.1 Message-ID: Hello all, It might be a little early, but there has been so much going on in the last 10 days since the results of GSoC were published... http://blog.mpthecoder.com/post/22380853664/gsoc-weekly-status-report-no-1 A short summary: It has been 10 days since the GSoC results were published, and a lot has happened since then. I got to know the other students and mentors in a longish meeting on Google hangout, I got into a discussion with my mentor on IRC in which we didn?t agree about the parallelization strategy for the parser (experiments will show who?s right) and my inbox is full with mails from my mentor and other students, in which we exchanged loads of interesting ideas. Also, I solved a bug in biogems.info website, which was stopping Pjotr from updating the website with new information about biogems. There is now a GitHub repository for my project: https://github.com/mamarjan/bioruby-hpc-gff3 The work for the first week of coding is halfway done too. There seems to be huge interest for a GFF3 parser with more features, like indexing, random access and writing output, and also support for linking into trees of features that are not located close to each other in the file. A fast sequential parser could be used to generate indexes, and the lower-level parts can be used to reorder the file for faster future usage. Based on that, I think this project is a good start. *I would like to ask you if you?re using the GFF3/GTF file formats in your research, to send me example files and descriptions of how are your applications using the data. This way I?ll be able to test the parser against your files and optimize it for your applications. Currently I have GFF files from Ensembl and Wormbase, and Pjotr pointed me to the genome browser web application at wormbase.org.* -- Marjan From rbuels at gmail.com Sun May 6 15:00:07 2012 From: rbuels at gmail.com (Robert Buels) Date: Sun, 06 May 2012 11:00:07 -0400 Subject: [GSoC] GSoC weekly status report No.1 In-Reply-To: References: Message-ID: <4FA691F7.9030905@gmail.com> Hi Marjan, You should probably incorporate into your test suite all of the test gff3 files in the test data directory of the Perl Bio::GFF3::LowLevel::Parser. It has coverage for some corner cases that are a little bit tricky. https://github.com/solgenomics/bio-gff3/tree/master/t/data Rob On 05/05/2012 09:07 AM, Marjan Povolni wrote: > Hello all, > > It might be a little early, but there has been so much going on in the last > 10 days since the results of GSoC were published... > > http://blog.mpthecoder.com/post/22380853664/gsoc-weekly-status-report-no-1 > > A short summary: > > It has been 10 days since the GSoC results were published, and a lot has > happened since then. I got to know the other students and mentors in a > longish meeting on Google hangout, I got into a discussion with my mentor > on IRC in which we didn?t agree about the parallelization strategy for the > parser (experiments will show who?s right) and my inbox is full with mails > from my mentor and other students, in which we exchanged loads of > interesting ideas. Also, I solved a bug in biogems.info website, which was > stopping Pjotr from updating the website with new information about biogems. > > There is now a GitHub repository for my project: > > https://github.com/mamarjan/bioruby-hpc-gff3 > > The work for the first week of coding is halfway done too. > > There seems to be huge interest for a GFF3 parser with more features, like > indexing, random access and writing output, and also support for linking > into trees of features that are not located close to each other in the > file. A fast sequential parser could be used to generate indexes, and the > lower-level parts can be used to reorder the file for faster future usage. > Based on that, I think this project is a good start. > > *I would like to ask you if you?re using the GFF3/GTF file formats in your > research, to send me example files and descriptions of how are your > applications using the data. This way I?ll be able to test the parser > against your files and optimize it for your applications. Currently I have > GFF files from Ensembl and Wormbase, and Pjotr pointed me to the genome > browser web application at wormbase.org.* > > -- > Marjan > > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc > From lomereiter at googlemail.com Sun May 6 19:56:50 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Sun, 6 May 2012 23:56:50 +0400 Subject: [GSoC] [BAM] Weekly report No. 0 Message-ID: Hi all, I wrote a few words about what I've done last week: http://lomereiter.wordpress.com/2012/05/06/gsoc-weekly-report-0/ Summary: The code is available at github: https://github.com/lomereiter/BAMread/ I already started to write code planned for the first week so as to have more time in June for exam preparation. Opening BAM and parsing SAM header works, and is available from Ruby, and now I need to write some tests and documentation. Also, I described some compile-time metaprogramming tricks in D which I use to reduce duplication in the code. I'd be grateful for some small BAM files, 1-50 kilobytes in size, with non-empty headers, for testing purposes. -- Artem From marian.povolny at gmail.com Sun May 6 20:22:01 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sun, 6 May 2012 22:22:01 +0200 Subject: [GSoC] GSoC weekly status report No.1 In-Reply-To: <4FA691F7.9030905@gmail.com> References: <4FA691F7.9030905@gmail.com> Message-ID: Thanks for the tip, that's a great idea! -- Marjan On Sun, May 6, 2012 at 5:00 PM, Robert Buels wrote: > Hi Marjan, > > You should probably incorporate into your test suite all of the test gff3 > files in the test data directory of the Perl Bio::GFF3::LowLevel::Parser. > It has coverage for some corner cases that are a little bit tricky. > > https://github.com/**solgenomics/bio-gff3/tree/**master/t/data > > Rob > > > On 05/05/2012 09:07 AM, Marjan Povolni wrote: > >> Hello all, >> >> It might be a little early, but there has been so much going on in the >> last >> 10 days since the results of GSoC were published... >> >> http://blog.mpthecoder.com/**post/22380853664/gsoc-weekly-** >> status-report-no-1 >> >> A short summary: >> >> It has been 10 days since the GSoC results were published, and a lot has >> happened since then. I got to know the other students and mentors in a >> longish meeting on Google hangout, I got into a discussion with my mentor >> on IRC in which we didn?t agree about the parallelization strategy for the >> parser (experiments will show who?s right) and my inbox is full with mails >> from my mentor and other students, in which we exchanged loads of >> interesting ideas. Also, I solved a bug in biogems.info website, which >> was >> stopping Pjotr from updating the website with new information about >> biogems. >> >> There is now a GitHub repository for my project: >> >> https://github.com/mamarjan/**bioruby-hpc-gff3 >> >> The work for the first week of coding is halfway done too. >> >> There seems to be huge interest for a GFF3 parser with more features, like >> indexing, random access and writing output, and also support for linking >> into trees of features that are not located close to each other in the >> file. A fast sequential parser could be used to generate indexes, and the >> lower-level parts can be used to reorder the file for faster future usage. >> Based on that, I think this project is a good start. >> >> *I would like to ask you if you?re using the GFF3/GTF file formats in your >> >> research, to send me example files and descriptions of how are your >> applications using the data. This way I?ll be able to test the parser >> against your files and optimize it for your applications. Currently I have >> GFF files from Ensembl and Wormbase, and Pjotr pointed me to the genome >> browser web application at wormbase.org.* >> >> -- >> Marjan >> >> ______________________________**_________________ >> GSoC mailing list >> GSoC at lists.open-bio.org >> http://lists.open-bio.org/**mailman/listinfo/gsoc >> >> From arklenna at gmail.com Sun May 6 21:26:30 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sun, 6 May 2012 17:26:30 -0400 Subject: [GSoC] GSoC python variant update Message-ID: Hi all, I've written a few new posts on my blog; here's the latest: http://arklenna.tumblr.com/post/22542372076/spot-isa-dog I will attach a UML diagram and include the part of the post addressing the diagram. Click through to the full post for a bonus Einstein quote! ------- My main goals are not limited to: * Make the structure parser and file-format agnostic: an abstracted OO design should allow anything to be slotted in (for example, Marjan's C GFF parser?) * Maintain encapsulation: limit how much each object can see of objects above and below it * Allow extension at multiple levels: some existing parsers may process data in different ways; this structure should allow handling both raw data and data in various formats. The `Variant` object's constructor allows an end user to change the default parsers. Practical implementation details of `parse()` and `write()` will need to be finessed - for example, ways to help the user sift through immense quantities of data. I'm still in the process of comparing the data contained in VCF/GVF files as well as the APIs of PyVCF and BCBio.GFF. `Parser` and `Writer` are both abstract classes that will define all methods found in known parsers/writers with `NotImplementedError`s. I'm speculating on whether a Variant-specific exception would be useful, but a custom message should suffice. Continuing down the diagram, `PyVCFWrapper` and `BCBioGFFWrapper` would each inherit from both `Parser` and `Writer`. As the name implies, they would serve as the adapter between the generic `Variant` and the specific parser. I anticipate that this structure could easily be extended to allow intermediate storage in DBs as well as innumerable sorting/comparing/filtering methods inside `Variant`. ------- I would appreciate any and all feedback about the overall structure. Namespace is definitely flexible. I'd also appreciate any specific genomic variant workflows, and if somebody can point me to smallish sample files of the same data in both VCF and GVF, I'd be eternally grateful. Regards, Lenna -------------- next part -------------- A non-text attachment was scrubbed... Name: Variant_UML.png Type: image/png Size: 23313 bytes Desc: not available URL: From chapmanb at 50mail.com Tue May 8 00:24:39 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 07 May 2012 20:24:39 -0400 Subject: [GSoC] GSoC python variant update In-Reply-To: References: Message-ID: <87mx5jfrjs.fsf@fastmail.fm> Lenna; This all looks great for a top level overview of the classes. This should give you sufficient flexibility to work on the different file types. Another approach is to avoid some of the inheritence and have parse/write dispatch to VCF or GFF specific classes based on the filetype: if filetype == "vcf": variant_handler = PyVCFVariants() elif filetype == "gvf": variant_handler = GVFVariants() variant_handler.parse(*args) Avoiding layers can be nice to simplify the architecture, as long as it gives you the flexibility you need. My suggestion for digging more in the API design would be to start playing with some VCF files and getting comfortable with the data they have and where it would go in Biopython objects. VCF is much more widely used than GVF so it's a good practical place to start. Thanks for all this work and best of luck on finals, Brad > Hi all, > > I've written a few new posts on my blog; here's the latest: > > http://arklenna.tumblr.com/post/22542372076/spot-isa-dog > > I will attach a UML diagram and include the part of the post > addressing the diagram. Click through to the full post for a bonus > Einstein quote! > > ------- > > My main goals are not limited to: > > * Make the structure parser and file-format agnostic: an abstracted > OO design should allow anything to be slotted in (for example, > Marjan's C GFF parser?) > * Maintain encapsulation: limit how much each object can see of > objects above and below it > * Allow extension at multiple levels: some existing parsers may > process data in different ways; this structure should allow handling > both raw data and data in various formats. > > The `Variant` object's constructor allows an end user to change the > default parsers. Practical implementation details of `parse()` and > `write()` will need to be finessed - for example, ways to help the > user sift through immense quantities of data. I'm still in the process > of comparing the data contained in VCF/GVF files as well as the APIs > of PyVCF and BCBio.GFF. > > `Parser` and `Writer` are both abstract classes that will define all > methods found in known parsers/writers with `NotImplementedError`s. > I'm speculating on whether a Variant-specific exception would be > useful, but a custom message should suffice. > > Continuing down the diagram, `PyVCFWrapper` and `BCBioGFFWrapper` > would each inherit from both `Parser` and `Writer`. As the name > implies, they would serve as the adapter between the generic `Variant` > and the specific parser. > > I anticipate that this structure could easily be extended to allow > intermediate storage in DBs as well as innumerable > sorting/comparing/filtering methods inside `Variant`. > > ------- > > I would appreciate any and all feedback about the overall structure. > Namespace is definitely flexible. I'd also appreciate any specific > genomic variant workflows, and if somebody can point me to smallish > sample files of the same data in both VCF and GVF, I'd be eternally > grateful. > > Regards, > > Lenna Attachment: Variant_UML.png (image/png) > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From casbon at gmail.com Tue May 8 08:57:57 2012 From: casbon at gmail.com (James Casbon) Date: Tue, 8 May 2012 09:57:57 +0100 Subject: [GSoC] GSoC python variant update In-Reply-To: <87mx5jfrjs.fsf@fastmail.fm> References: <87mx5jfrjs.fsf@fastmail.fm> Message-ID: On 8 May 2012 01:24, Brad Chapman wrote: > > Lenna; > This all looks great for a top level overview of the classes. This > should give you sufficient flexibility to work on the different file > types. Another approach is to avoid some of the inheritence and have > parse/write dispatch to VCF or GFF specific classes based on the > filetype: > > if filetype == "vcf": > ? ?variant_handler = PyVCFVariants() > elif filetype == "gvf": > ? ?variant_handler = GVFVariants() > variant_handler.parse(*args) > > Avoiding layers can be nice to simplify the architecture, as long as it > gives you the flexibility you need. Hi Lenna, This looks a good start, but I would agree with Brad that layers of inheritance aren't always the best way to proceed with python. Specific feedback: why does the Variant have parse/write methods when you state that you will use adaptation from the general variation class to the actual parser? I'm also slightly worried this could be pretty slow when dealing with the volume of data you get from a VCF file. As for the points in your blog post... I have plenty of data, do we know any SNP callers capable of creating GVF files? If so, I can give you both formats. The simplest variant workflows would be to filter and then score on some metric. Filter would be to remove noise, so quality threshold is the simplest one. The metric used depends on the experimental setup. For case/control, a fishers test is quite easy, or for a single population an HWE test is fairly simple. Hope this helps, -- James http://casbon.me/ From pjotr2010 at thebird.nl Tue May 8 11:40:43 2012 From: pjotr2010 at thebird.nl (Pjotr Prins) Date: Tue, 8 May 2012 13:40:43 +0200 Subject: [GSoC] GSoC python variant update In-Reply-To: References: <87mx5jfrjs.fsf@fastmail.fm> Message-ID: <20120508114043.GC14359@thebird.nl> On Tue, May 08, 2012 at 09:57:57AM +0100, James Casbon wrote: > > Avoiding layers can be nice to simplify the architecture, as long as it > > gives you the flexibility you need. This is actually a pattern. See 'Using Mixin Technology to Improve Modularity', for example. http://www.cs.utexas.edu/~lin/papers/aop03.pdf Pj. From w.arindrarto at gmail.com Wed May 9 16:24:43 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 9 May 2012 18:24:43 +0200 Subject: [GSoC] GSoC Project Update -- 1 Message-ID: Hi everyone, I just posted my latest blog updated here: http://bow.web.id/blog/2012/05/warming-up-for-the-coding-period/ To summarize, I've spent most of my time getting to know the programs I will support better. This has been done by: 1. Playing around with the programs to see how many different outputs I can generate. 2. Writing scripts to automate test case generation for each of the programs. 3. Writing wrappers (for programs not yet wrapped by Biopython: FASTA, HMMER, and BLAT) to ease writing the test case generators. 4. Continuing to complete my proposed SearchIO object naming scheme (http://bit.ly/searchio-terms) The test cases, their generators, and the wrappers I've written are available in my non-Biopython gsoc repo here: http://github.com/bow/gsoc/. Additionally, I've used the generated test case to improve a recent bug report and submitted a fix for the next release. For the coming weeks prior to coding start, I'm planning to play around more with XML and SQLite as I will use them in the code. I might start to add more skeleton code to my current development branch as well (https://github.com/bow/biopython). cheers, Bow From arklenna at gmail.com Thu May 10 00:16:18 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 9 May 2012 20:16:18 -0400 Subject: [GSoC] GSoC python variant update In-Reply-To: <20120508114043.GC14359@thebird.nl> References: <87mx5jfrjs.fsf@fastmail.fm> <20120508114043.GC14359@thebird.nl> Message-ID: I think my UML diagram may need a legend, or perhaps it should just be abandoned. I've written some skeleton code to try to avoid confusion about the pesky OO terms that have slightly different meanings for every language. https://gist.github.com/2649676 Regarding concerns about inheritance: I think the UML diagram implies 3 levels of inheritance. The only inheritance I intended was from abstract interfaces like Parser or Writer, that only contain non-implemented methods. Because I can't guarantee that all future parsers will have common attribute and method names, the only solution I can see is to write an interface and inherit from that to make wrappers for each parser. Thank you to Eric for this link: (https://en.wikipedia.org/wiki/Fragile_base_class). The page states that the best way to avoid problems is to use an interface. Also thank you to Pjotr for the article about mixins (http://www.cs.utexas.edu/~lin/papers/aop03.pdf). I believe I'm using inheritance in a safe and helpful manner. James, I hope my clarification and skeleton code answer any questions you have about the implementation. Brad, I am using if statements to determine which parser to use, but I am still calling wrappers that inherit from an interface. Eric, I looked at the structure of PDBParser. Is the idea that a user might pass in an instance of StructureBuilder that already contained some structure and add to it? Or is there another purpose that isn't jumping out at me? In my skeleton code, I used the example of StructureBuilder, but I'm not sure if there's an advantage to passing the object rather than the object's name. And finally, Brad and James, I will do my best to get more conversant with VCF etc. If I'm not a user, I can't be a capable developer. Looking forward to any more structural feedback! Cheers, Lenna From eric.talevich at gmail.com Thu May 10 13:36:49 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 10 May 2012 09:36:49 -0400 Subject: [GSoC] GSoC python variant update In-Reply-To: References: <87mx5jfrjs.fsf@fastmail.fm> <20120508114043.GC14359@thebird.nl> Message-ID: On Wed, May 9, 2012 at 8:16 PM, Lenna Peterson wrote: > I looked at the structure of PDBParser. Is the idea that a user might pass > in an instance of StructureBuilder that already contained some structure > and add to it? Or is there another purpose that isn't jumping out at me? In > my skeleton code, I used the example of StructureBuilder, but I'm not sure > if there's an advantage to passing the object rather than the object's name. > > My understanding of the producer/consumer design in Bio.PDB (I didn't write it) is that the logic for parsing the given file format is contained in the *Parser class, and the logic for building the target object is in the *Builder class. This is useful if the target object is somewhat complex to build, as is the case with PDB's Structure/Model/Chain/Residue/Atom hierarchy -- the parser just passes raw values along to the appropriate method on the StructureBuilder class. (The Internet also points out that this design is super useful if "producing" and "consuming" are asynchronous, which is not the case here... yet?) Regarding the shared interface, I think we've generally achieved this throughout most of Biopython by just remembering to implement the required methods on each parser and writer class -- just "parse" and "write", usually. Essentially, it's your design minus the common base class that enforces the interface; an error in the implementation would result in an AttributeError rather than a NotImplementedError. This works because (1) Python uses duck typing, unlike C++ and Java; (2) in Biopython, each file format is usually implemented by one dedicated person who can keep it all in their head, and we don't add new file formats very rapidly; (3) we maintain pretty good coverage with our unit tests, and certainly add unit tests for new parsers. Given all that, I think your design is superior, and it's quite clear how it all works from the way you've written it. As for the difference between passing an instance of the *Builder object versus a reference to the *Builder class (did I get that right?), it requires slightly less code from the user to pass a reference to the class. Also, if you set the object-or-class as a default argument, remember that objects are mutable, so you risk hitting one of Python's most infamous gotchas (default arguments are only evaluated once, so the second time you use the parser, you'll be adding to the original object instead of starting with a fresh copy). Cheers, Eric From marian.povolny at gmail.com Sat May 12 19:46:46 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sat, 12 May 2012 21:46:46 +0200 Subject: [GSoC] GSoC weekly status report No.1.1 Message-ID: Hi all, Here is my status report for this week: This year we the GSoC students sure are a very creative group, just look at our numbering schemes for our status reports for the pre-coding period - everyone has his own thing going :) And now back to the GFF3 project. I found a few more sites with big GFF3 files, those will be great for performance testing. And Robert Buels suggested that I should reuse the test suite from the Perl?s Bio::GFF3::LowLevel::Parser, and I think that?s a great idea. I should definitely use that for completeness testing and I will check the test suites of other GFF3 parsers. I have also finished the work for the first week. That means basically I?m already more then two weeks ahead of schedule. The parser is now reading data on the D side and forwarding that to Ruby line by line. That won?t be faster then reading the file from Ruby, but that?s a nice basic case to get data flowing from D to Ruby. The rake tasks have been improved too. There are now two tasks for building the D library, ?compile? and ?compiledebug?, and there is the ?spec? task for running rspec tests and ?features? task for running cucumber tests. The ?clean? task now deletes object and library files. There is also a problem with the D library and garbage collector. It seems this is the problem Iain Buclaw (one of the GDC developers) has warned us about. When using a D shared library, when the GC kicks in for the first time, it looks like if it collects all the static data, for example the per-module variables. And pretty much everything, even when we register with GC a chuck of memory allocated with malloc, it still gets collected. Or at least that?s what it looks like. However, Iain also assured us that this will be solved by the end of this month/beginning of the next. My cucumber and rspec tests still work because they don?t require enough memory for the GC to run, but to be sure that this issue doesn?t interfere with development at this point, I manually disabled the GC on library initialization. I didn?t try yet, but from what has been discussed in the forums, both 32 and 64-bit DLLs on windows built using DMD work fine. I also helped Pjotr with getting our blog posts included in the RSS feed on biogems.info. That's all for now, you can find this report on my blog too: http://blog.mpthecoder.com/post/22919943701/gsoc-weekly-status-report-no-1-1 -- Best regards, Marjan From lomereiter at googlemail.com Sun May 13 20:10:45 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 14 May 2012 00:10:45 +0400 Subject: [GSoC] Weekly report No 0.5 Message-ID: Hi all, this is yet another GSoC report. During last week, I was mainly concentrated on D part of the project, adding functionality to it. I implemented parsing of the whole BAM file :) Today I wrote a simple utility in D, which uses my library to convert BAM to SAM. It doesn?t work with array tags yet, and not as fast as samtools, but nevertheless? On a couple of BAM files from test/data directory (namely, bins.bam and ex1_header.bam) the output is identical to that of samtools view ? I checked with diff ? and that kinda proves that everything works fine. Speed issues are mainly due to using std.variant module for storing tags. It uses runtime reflection which is quite slow. Maybe, there?re some other reasons. Anyway, I?m going to write my own tagged union type next week, it should improve the performance quite a bit, and also fix design flaws. For testing tag parsing, I used file tags.bam provided to me by Peter Cock. It contains tests for all types of tags, and my library successfully passes them. Later I?ll experiment with possible speed improvements, and having unit tests covering full range of possible tag types is a must. Also, I downloaded and compiled gdc from trunk. It provides decent performance, not worse than dmd, at least. We expect gdc to gain shared library support in the next two months. Before that happens, we have to use dmd, although there?re some issues with its garbage collector, causing segfaults. We discussed that with Marjan and Pjotr and decided that the best option in such circumstances would be to disable GC during development ? testing library on small files won?t consume much memory anyway. Another thing I downloaded and compiled, is Rubinius. I?m going to investigate why it hangs on BioRuby unittests in 1.9 mode. Another mode, 1.8, seems to work fine except maybe some very minor bugs. During next week, I?m going to learn how to use Cucumber and Rspec, improve D library performance a little, and start to write Ruby bindings. So it will be mostly ?Ruby week? ;) -- Artem From cswh at umich.edu Tue May 15 03:36:17 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 14 May 2012 23:36:17 -0400 Subject: [GSoC] GSoC week 1 status report Message-ID: <2D9F6030-8A11-4443-B610-58464F506EE5@umich.edu> Hi all, I've put my first GSoC status report on my project blog: http://csw.github.com/bioruby-maf/blog/2012/05/13/progress/ (The web version of this has 100% more hyperlinks, but here's a plain text version, too.) This has been my first half-week of work on my Google Summer of Code project, and it?s off to an exciting start. The first order of business has been to get my development environment together; since I?ve been a microbiology student instead of a programmer for the last year, it?s taken some work. In that process, I?ve ended up making a few open source contributions just to get my tools working the way I want. I?m running GNU Emacs 24 and trying to take more advantage of it than I have in the past. I?ll have much more to say about this in a future post. I?ve also started working on the BioRuby unit test failures under JRuby, as a way of familiarizing myself with the BioRuby code base as well as the community and its development processes. Right now, JRuby in 1.8 mode is showing 6 failures and 126 errors, which is hardly confidence-inspiring for people considering using JRuby with BioRuby. This is too bad, since JRuby has some definite advantages as a Ruby implementation. After looking into these failures, I?ve broken them down into a few categories: ? temporary file permissions problems, likely due to some sort of Travis-CI environment issue ? a bug in JRuby?s implementation of Open3.popen3 which I?m working up a bug report for ? an odd autoload problem I?ve filed JRUBY-6658 for and sent an accompanying RubySpec patch for ? a problem with libxml-jruby, which appears unmaintained, for which I?ve submitted a BioRuby patch plus JRUBY-6662 ? and a small test case bug relating to floating point handling, which I?ve submitted apatch for. Once these are resolved, JRuby should be passing the BioRuby unit tests in 1.8 mode, and closer to passing in 1.9 mode. (There are a few extra failures under 1.9 that I haven?t sorted through yet.) I?ve also gotten a start on my project itself, creating the bioruby-maf Github repository with a project skeleton and writing my first Cucumber feature for it. This is, in fact, my first Cucumber feature ever. However, I did spend a few cross-country flights reading the RSpec and Cucumber books last week; between that and cribbing from Pjotr?s code I feel like I have some idea what I?m doing. Just assembling that feature has been useful, too, since I?ve had to get several of the existing MAF tools running on my machine. In fact, my test MAF data and the FASTA version of it are courtesy of bx-python, which will be my reference implementation in many respects. Clayton Wheeler cswh at umich.edu From w.arindrarto at gmail.com Wed May 16 19:36:28 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 16 May 2012 21:36:28 +0200 Subject: [GSoC] GSoC Project Update -- 2 Message-ID: Hi everyone, I just posted my latest GSoC blog update here: http://bow.web.id/blog/2012/05/the-final-preparations/ To summarize, I spent the last week playing with XML and SQLite, and in extension SeqIO's index and index_db. I didn't write as much as real code the week before (mostly on online tutorials). Additionally, I started writing some of the SearchIO main methods, improved the test case generation time, and added more entries to the SearchIO terms table (http://bit.ly/searchio-terms). Finally, from this day onwards, I'm starting coding for the actual SearchIO implementation. The weekly plan will follow my proposed timeline (http://bit.ly/searchio-proposal) and I'll be writing mostly on my main SearchIO branch (https://github.com/bow/biopython/tree/searchio/Bio/SearchIO). cheers, Bow P.S. I also updated my blog last week so that the GSoC entries can be tracked through its own feed. The feed is available here: http://bow.web.id/feed/atom-gsoc.xml From arklenna at gmail.com Wed May 16 20:01:30 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 16 May 2012 16:01:30 -0400 Subject: [GSoC] GSoC python variant update 2 Message-ID: Hi all, Latest blog post here: http://arklenna.tumblr.com/post/23178684555/week-2 Brief summary of this post: I don't think `SeqFeature` or an extension thereof would be appropriate for storing Variant data; therefore, I intend to make a new structure based on `_Record` and `_Call` in PyVCF. I'm not sure if this structure should be associated with `Seq`, i.e. by naming it `SeqVariant`, and would like feedback on this question. It could be very difficult to make PyVCF compatible with Python 2.5. Therefore, I am planning to write my project to be compatible with Python 2.6 and delaying its inclusion in the main Biopython branch until a future 2.6+ Biopython release. Alternate suggestions are welcome. This week I will solidify the structure so I am ready for the end of the community bonding period and the start of coding on May 21. Regards, Lenna From chapmanb at 50mail.com Thu May 17 00:19:01 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 16 May 2012 20:19:01 -0400 Subject: [GSoC] [Biopython-dev] GSoC python variant update 2 In-Reply-To: References: Message-ID: <871umju0ay.fsf@fastmail.fm> Lenna; Thanks for the update on your thinking. Sounds like you are right on track. > I don't think `SeqFeature` or an extension thereof would be > appropriate for storing Variant data; therefore, I intend to make a > new structure based on `_Record` and `_Call` in PyVCF. I'm not sure if > this structure should be associated with `Seq`, i.e. by naming it > `SeqVariant`, and would like feedback on this question. I'm agreed about SeqFeature. Would you consider using _Record/_Call directly? Then you could provide functionality to convert this to/from basic SeqFeatures if needed. An advantage of using these structures explicitly is that you could plug in compatible APIs, like Aaron Quinlan's CyVCF: https://github.com/arq5x/cyvcf I don't think we should add a new representation class unless we explicitly need to store additional information. > It could be very difficult to make PyVCF compatible with Python > 2.5. Therefore, I am planning to write my project to be compatible > with Python 2.6 and delaying its inclusion in the main Biopython > branch until a future 2.6+ Biopython release. Alternate suggestions > are welcome. I'm agreed with this. I don't think 2.5 is an entrenched as 2.4 was so think we could move on a deprecation path for it. It's more important to be forward compatible with 3.x and 2.6+ should make that easier. Thanks again for sharing all your thoughts and digging into this, Brad From rbuels at gmail.com Thu May 17 12:52:54 2012 From: rbuels at gmail.com (Robert Buels) Date: Thu, 17 May 2012 08:52:54 -0400 Subject: [GSoC] students: upcoming dates Message-ID: <4FB4F4A6.8050802@gmail.com> Hi students, There are a couple of important dates coming up soon, don't forget! May 18 (TOMORROW!): deadline to submit tax forms and proof of enrollment. Do you want to get paid? May 21: start of the formal coding period Keep up the good work, I'm very happy to have you working with us. :-) Rob -- Robert Buels OBF GSoC 2012 Org. Administrator From marian.povolny at gmail.com Mon May 21 09:36:01 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 21 May 2012 11:36:01 +0200 Subject: [GSoC] GSoC weekly status report No.1.2 Message-ID: http://blog.mpthecoder.com/post/23473020471/gsoc-weekly-status-report-no-1-2 It?s been three months since my first introduction on the BioRuby ML and it?s been great. As it is the end of the GSoC community bonding period, I would like to thank Pjotr most and then all the other community members for their help and support. It?s a great feeling to become a member of a small but growing community of enthusiasts that work together for the better of all of us and for fun. As Pjotr already did, I would like to encourage you to write blog posts about using Ruby in Bioinformatics and let us include them in our RSS and news feeds on the biogems.info website. The site supports both RSS and Atom feeds now, and a similar functionality will be part of the new website for BioRuby once it?s finished. The code also supports adding only posts for one category/tag, so you can tag your posts with BioRuby or similar, and only those posts will be included in the RSS feed on biogems.info. The GSoC coding period starts today, It?s time for me to roll my sleeves up, and start working on the GFF3 parser full-time. -- Marjan From lomereiter at googlemail.com Mon May 21 11:58:46 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 21 May 2012 15:58:46 +0400 Subject: [GSoC] Weekly report #1 Message-ID: Hi all, here's my report about the past week: http://lomereiter.wordpress.com/2012/05/21/gsoc-weekly-report-1/ Brief summary: 1) BioRuby unit tests and Rubinius bugs ? I posted 2 issues in Rubinius bugtracker, and one of them is already solved. Rubinius in 1.8 mode should now pass all tests. The situation with 1.9 mode is not that great, but I'm working on it. 2) I started to collect D optimization tricks on github wiki page. Currently, it contains just 6 tips, but this number is going to grow. Probably, another page will be created soon to keep best practices of connecting Ruby and D. Since my project and Marjan's one have a lot in common, I think it's important for us to not waste time on something that already have been investigated. 3) During the week, I learned a bit about BDD and Cucumber, enjoyed it, and wrote my first two features. 4) Measurements of object instantiation time in Ruby suggest that exposing low-level D functions via FFI makes little sense. I'm going to discuss with mentors which high-level functions should be available, and make that into Cucumber features. -- Artem From cswh at umich.edu Mon May 21 15:50:18 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 21 May 2012 11:50:18 -0400 Subject: [GSoC] GSoC week 2 status report Message-ID: <0D2AC678-1DD1-40B9-B100-EDA3429B3D87@umich.edu> Hi all, Here's my report on last week's work: http://csw.github.com/bioruby-maf/blog/2012/05/21/week_2_progress/ This was my second week of work on my GSoC project, and the last week of the ?community bonding? period before the official start of coding. A major focus of mine was BioRuby?s phyloXML support; it uses libxml, which has been causing unit test failures under JRuby. In the end, the best course of action seemed to separate the phyloXML support as a separate plugin, which I have done as the bio-phyloxml gem. This will remove BioRuby?s dependency on XML libraries entirely and that JRuby issue along with it. At the same time, users of the phyloXML code should be able to continue using it with no substantive changes. Separately, I began porting this phyloXML code to use Nokogiri instead of libxml-ruby, but ran into difficulties with this effort. While it is possible, and the library APIs are very similar, the code uses relatively low-level XML processing APIs in ways that seem to be sensitive to subtle differences in text node and namespace semantics between the two libraries. Substantial restructuring of the code and the addition of quite a few unit tests might be necessary to carry out such a port with confidence that the resulting code would work well. Also, someone else submitted a JRuby patch for JRUBY-6658, one of the major causes of BioRuby?s unit test failures with JRuby; once a fix is integrated, we?ll be close to having all the tests passing under JRuby. I identified another JRuby bug, JRUBY-6666, causing several unit test failures. This one affects BioRuby?s code for running external commands, so it would be likely to be encountered in production use. For this one, I also worked up a patch. I also spent some time preparing a performance testing environment, for evaluating existing MAF implementations as well as my own. This will be important, since I will be considering the use of an existing C parser. I will also want to ensure that the performance of my code is competitive with the alternatives. Lacking any hardware more powerful than a MacBook Air, I am setting this up with Amazon EC2. To simplify environment setup, I?ll be using Chef. I?ve already set up a Chef repository with configuration logic, and some rudimentary code to streamline launching Ubuntu machines on EC2 and bootstrapping a Chef environment. To save money, I plan to make use of EC2 Spot Instances, which are perfect for instances that only need to run for a few hours for batch tasks. Clayton Wheeler cswh at umich.edu From bonnal at ingm.org Tue May 22 09:21:42 2012 From: bonnal at ingm.org (Raoul Bonnal) Date: Tue, 22 May 2012 11:21:42 +0200 Subject: [GSoC] [BioRuby] GSoC week 2 status report In-Reply-To: <0D2AC678-1DD1-40B9-B100-EDA3429B3D87@umich.edu> Message-ID: Hi Clayton, Well done and thanks for your contributes to bioruby and jruby community. For you computing issue I have two solutions: 1) I can create a VM and give you the access, I need to contact my IT dep. 2) Could Amazon provide some VM for our students? On 21/05/12 17.50, "Clayton Wheeler" wrote: > Hi all, > > Here's my report on last week's work: > > http://csw.github.com/bioruby-maf/blog/2012/05/21/week_2_progress/ > > This was my second week of work on my GSoC project, and the last week of the > ?community bonding? period before the official start of coding. A major focus > of mine was BioRuby?s phyloXML support; it uses libxml, which has been causing > unit test failures under JRuby. In the end, the best course of action seemed > to separate the phyloXML support as a separate plugin, which I have done as > the bio-phyloxml gem. This will remove BioRuby?s dependency on XML libraries > entirely and that JRuby issue along with it. At the same time, users of the > phyloXML code should be able to continue using it with no substantive changes. > > Separately, I began porting this phyloXML code to use Nokogiri instead of > libxml-ruby, but ran into difficulties with this effort. While it is possible, > and the library APIs are very similar, the code uses relatively low-level XML > processing APIs in ways that seem to be sensitive to subtle differences in > text node and namespace semantics between the two libraries. Substantial > restructuring of the code and the addition of quite a few unit tests might be > necessary to carry out such a port with confidence that the resulting code > would work well. > > Also, someone else submitted a JRuby patch for JRUBY-6658, one of the major > causes of BioRuby?s unit test failures with JRuby; once a fix is integrated, > we?ll be close to having all the tests passing under JRuby. > > I identified another JRuby bug, JRUBY-6666, causing several unit test > failures. This one affects BioRuby?s code for running external commands, so it > would be likely to be encountered in production use. For this one, I also > worked up a patch. > > I also spent some time preparing a performance testing environment, for > evaluating existing MAF implementations as well as my own. This will be > important, since I will be considering the use of an existing C parser. I will > also want to ensure that the performance of my code is competitive with the > alternatives. Lacking any hardware more powerful than a MacBook Air, I am > setting this up with Amazon EC2. To simplify environment setup, I?ll be using > Chef. I?ve already set up a Chef repository with configuration logic, and some > rudimentary code to streamline launching Ubuntu machines on EC2 and > bootstrapping a Chef environment. To save money, I plan to make use of EC2 > Spot Instances, which are perfect for instances that only need to run for a > few hours for batch tasks. > > Clayton Wheeler > cswh at umich.edu > > > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From w.arindrarto at gmail.com Tue May 22 10:21:25 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 22 May 2012 12:21:25 +0200 Subject: [GSoC] GSoC Project Update -- 3 Message-ID: Hi everyone, I just posted my latest GSoC update here: http://bow.web.id/blog/2012/05/from-bio-import-searchio/ To summarize the post and what I've done the last week: * I finished writing all base SearchIO objects and tested them as well. These objects are the QueryResult object (previously called Result), representing search results from a single query; the Hit object, representing pairwise alignments from a single database hit; and the HSP object, representing a single alignment. I've also written the docstrings for these objects, so you can run help() on them in an interpreter session. The post also includes a very brief outline of the base objects' features, if you are curious. * Using this, I was able to write a working prototype for SearchIO BLAST XML parsing. This prototype has also been tested, using the test cases I've generated previously. For now, it's implemented using our NCBIXML parser, just so that people can have a taste of what SearchIO will feel like. If you want to play around with the prototype, it's available here: https://github.com/bow/biopython/tree/searchio-blastxml. As always, feel free to notify me of suggestions, critiques, and/or feature requests :). regards, Bow From rbuels at gmail.com Tue May 22 20:15:15 2012 From: rbuels at gmail.com (Robert Buels) Date: Tue, 22 May 2012 16:15:15 -0400 Subject: [GSoC] [BioRuby] GSoC week 2 status report In-Reply-To: References: Message-ID: <4FBBF3D3.4040003@gmail.com> On 05/22/2012 05:21 AM, Raoul Bonnal wrote: > 2) Could Amazon provide some VM for our students? AWS allows quite a bit of free usage at no charge: http://aws.amazon.com/free/ If you need more, you could apply for a grant from them. http://aws.amazon.com/education/ Rob From saketkc at gmail.com Tue May 22 20:17:01 2012 From: saketkc at gmail.com (Saket Choudhary) Date: Tue, 22 May 2012 21:17:01 +0100 Subject: [GSoC] [BioRuby] GSoC week 2 status report In-Reply-To: <4FBBF3D3.4040003@gmail.com> References: <4FBBF3D3.4040003@gmail.com> Message-ID: I have a free 50$ credit on AWS. I would want to give ti to BioRuby , if possible. On 22 May 2012 21:15, Robert Buels wrote: > On 05/22/2012 05:21 AM, Raoul Bonnal wrote: > >> 2) Could Amazon provide some VM for our students? >> > > AWS allows quite a bit of free usage at no charge: > http://aws.amazon.com/free/ > If you need more, you could apply for a grant from them. > http://aws.amazon.com/**education/ > > Rob > ______________________________**_________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/gsoc > From arklenna at gmail.com Wed May 23 21:56:03 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 23 May 2012 17:56:03 -0400 Subject: [GSoC] GSoC python variant update 3 Message-ID: Hi all, Latest blog post here: http://arklenna.tumblr.com/post/23630012065/week-1 Brief summary: I have reversed my prior conclusion that `SeqRecord` is inadequate for holding variant data. It is still not ideal, but the advantages of using an existing native object are substantial, and the disadvantages can be reduced by creating an accessor for the variant-specific data within a `SeqRecord`. I've made an outline of how I would store the information returned by PyVCF within `SeqRecord` and `SeqFeature` objects. It includes a few questions about the most logical way to store certain variant information. As the coding period has now started, I'll be pushing some prototypes to GitHub in the near future. Lenna From cjfields at illinois.edu Thu May 24 05:14:20 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 24 May 2012 05:14:20 +0000 Subject: [GSoC] [BioRuby] Weekly report #1 In-Reply-To: References: Message-ID: I think the mentioned D wrappers on the SWIG page are ANSI C/C++ libraries wrapped for D, not D code/libs/etc wrapped for Ruby, unless I'm mistaken... chris On May 23, 2012, at 11:30 PM, Mic wrote: > D to Ruby: http://www.swig.org/compare.html > > On Mon, May 21, 2012 at 9:58 PM, Artem Tarasov wrote: > >> Hi all, >> >> here's my report about the past week: >> http://lomereiter.wordpress.com/2012/05/21/gsoc-weekly-report-1/ >> >> Brief summary: >> >> 1) BioRuby unit tests and Rubinius bugs ? I posted 2 issues in Rubinius >> bugtracker, and one of them is already solved. Rubinius in 1.8 mode should >> now pass all tests. The situation with 1.9 mode is not that great, but I'm >> working on it. >> >> 2) I started to collect D optimization tricks on github wiki page. >> Currently, it contains just 6 tips, but this number is going to grow. >> Probably, another page will be created soon to keep best practices of >> connecting Ruby and D. Since my project and Marjan's one have a lot in >> common, I think it's important for us to not waste time on something that >> already have been investigated. >> >> 3) During the week, I learned a bit about BDD and Cucumber, enjoyed it, and >> wrote my first two features. >> >> 4) Measurements of object instantiation time in Ruby suggest that exposing >> low-level D functions via FFI makes little sense. I'm going to discuss with >> mentors which high-level functions should be available, and make that into >> Cucumber features. >> >> >> >> >> -- >> Artem >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From cswh at umich.edu Thu May 24 05:33:40 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Thu, 24 May 2012 01:33:40 -0400 Subject: [GSoC] [BioRuby] GSoC week 2 status report In-Reply-To: References: Message-ID: <9DBCD042-7086-4F4B-ABB9-1A7F63C089B8@umich.edu> Thanks for the offers of help, everybody. Raoul, if it's convenient for you to set up a test VM in house, that would probably make the most sense. I don't think it's a pressing need at this point, but let's look into that. If we run into issues, we can revisit the EC2 options. (I've had an AWS account too long to qualify for the free usage tier, unfortunately.) An Amazon grant might be worth looking at, especially if we can use it to publicly host, say, BGZF-compressed pre-indexed MAF data sets also. On the other hand, that might be overkill just for my needs; using spot-priced instances, I expect I could do all the testing I need for under $50. Clayton Wheeler cswh at umich.edu From lomereiter at googlemail.com Thu May 24 05:40:54 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Thu, 24 May 2012 09:40:54 +0400 Subject: [GSoC] [BioRuby] Weekly report #1 In-Reply-To: References: Message-ID: Chris is right. Currently, it's easier to write everything manually. When I'll develop some 'best practices' I may put then into compile-time algorithms and generate bindings from D. (The language has compile-time introspection but doesn't have run-time one, probably because that would hurt the performance.) On Thu, May 24, 2012 at 9:14 AM, Fields, Christopher J < cjfields at illinois.edu> wrote: > I think the mentioned D wrappers on the SWIG page are ANSI C/C++ libraries > wrapped for D, not D code/libs/etc wrapped for Ruby, unless I'm mistaken... > > chris > > On May 23, 2012, at 11:30 PM, Mic wrote: > > > D to Ruby: http://www.swig.org/compare.html > > > > On Mon, May 21, 2012 at 9:58 PM, Artem Tarasov < > lomereiter at googlemail.com>wrote: > > > >> Hi all, > >> > >> here's my report about the past week: > >> http://lomereiter.wordpress.com/2012/05/21/gsoc-weekly-report-1/ > >> > >> Brief summary: > >> > >> 1) BioRuby unit tests and Rubinius bugs ? I posted 2 issues in Rubinius > >> bugtracker, and one of them is already solved. Rubinius in 1.8 mode > should > >> now pass all tests. The situation with 1.9 mode is not that great, but > I'm > >> working on it. > >> > >> 2) I started to collect D optimization tricks on github wiki page. > >> Currently, it contains just 6 tips, but this number is going to grow. > >> Probably, another page will be created soon to keep best practices of > >> connecting Ruby and D. Since my project and Marjan's one have a lot in > >> common, I think it's important for us to not waste time on something > that > >> already have been investigated. > >> > >> 3) During the week, I learned a bit about BDD and Cucumber, enjoyed it, > and > >> wrote my first two features. > >> > >> 4) Measurements of object instantiation time in Ruby suggest that > exposing > >> low-level D functions via FFI makes little sense. I'm going to discuss > with > >> mentors which high-level functions should be available, and make that > into > >> Cucumber features. > >> > >> > >> > >> > >> -- > >> Artem > >> > >> _______________________________________________ > >> BioRuby Project - http://www.bioruby.org/ > >> BioRuby mailing list > >> BioRuby at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioruby > >> > > > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > From mictadlo at gmail.com Thu May 24 04:30:22 2012 From: mictadlo at gmail.com (Mic) Date: Thu, 24 May 2012 14:30:22 +1000 Subject: [GSoC] [BioRuby] Weekly report #1 In-Reply-To: References: Message-ID: D to Ruby: http://www.swig.org/compare.html On Mon, May 21, 2012 at 9:58 PM, Artem Tarasov wrote: > Hi all, > > here's my report about the past week: > http://lomereiter.wordpress.com/2012/05/21/gsoc-weekly-report-1/ > > Brief summary: > > 1) BioRuby unit tests and Rubinius bugs ? I posted 2 issues in Rubinius > bugtracker, and one of them is already solved. Rubinius in 1.8 mode should > now pass all tests. The situation with 1.9 mode is not that great, but I'm > working on it. > > 2) I started to collect D optimization tricks on github wiki page. > Currently, it contains just 6 tips, but this number is going to grow. > Probably, another page will be created soon to keep best practices of > connecting Ruby and D. Since my project and Marjan's one have a lot in > common, I think it's important for us to not waste time on something that > already have been investigated. > > 3) During the week, I learned a bit about BDD and Cucumber, enjoyed it, and > wrote my first two features. > > 4) Measurements of object instantiation time in Ruby suggest that exposing > low-level D functions via FFI makes little sense. I'm going to discuss with > mentors which high-level functions should be available, and make that into > Cucumber features. > > > > > -- > Artem > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From cswh at umich.edu Fri May 25 20:42:13 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Fri, 25 May 2012 16:42:13 -0400 Subject: [GSoC] New blog post on this week's work Message-ID: <329E20F7-BF3F-4201-ADD0-ABCDFC5ECDE4@umich.edu> Hi all, I've written a new blog post on the work I did on my MAF parser this week: http://csw.github.com/bioruby-maf/blog/2012/05/25/first_milestone/ It covers parser implementation and performance issues, BDD, and tools. Clayton Wheeler cswh at umich.edu From john.woods at marcottelab.org Thu May 24 14:01:08 2012 From: john.woods at marcottelab.org (John Woods) Date: Thu, 24 May 2012 09:01:08 -0500 Subject: [GSoC] [BioRuby] GSoC week 2 status report In-Reply-To: References: <0D2AC678-1DD1-40B9-B100-EDA3429B3D87@umich.edu> Message-ID: If I can just suggest, there's a startup pitch out there which was formerly known as Happy Science Coding, now Appsoma, which lets you run Ruby code on Rackspace instances. It may or may not be appropriate for what you want to do. It's not EC2, but it is a VM (right?). http://appsoma.com/ It's still a bit buggy with Ruby. If you have trouble, email Zack (see the "About us" page). He's fairly responsive. John SciRuby On Tue, May 22, 2012 at 4:21 AM, Raoul Bonnal wrote: > Hi Clayton, > Well done and thanks for your contributes to bioruby and jruby community. > > For you computing issue I have two solutions: > 1) I can create a VM and give you the access, I need to contact my IT dep. > 2) Could Amazon provide some VM for our students? > > > > On 21/05/12 17.50, "Clayton Wheeler" wrote: > > > Hi all, > > > > Here's my report on last week's work: > > > > http://csw.github.com/bioruby-maf/blog/2012/05/21/week_2_progress/ > > > > This was my second week of work on my GSoC project, and the last week of > the > > ?community bonding? period before the official start of coding. A major > focus > > of mine was BioRuby?s phyloXML support; it uses libxml, which has been > causing > > unit test failures under JRuby. In the end, the best course of action > seemed > > to separate the phyloXML support as a separate plugin, which I have done > as > > the bio-phyloxml gem. This will remove BioRuby?s dependency on XML > libraries > > entirely and that JRuby issue along with it. At the same time, users of > the > > phyloXML code should be able to continue using it with no substantive > changes. > > > > Separately, I began porting this phyloXML code to use Nokogiri instead of > > libxml-ruby, but ran into difficulties with this effort. While it is > possible, > > and the library APIs are very similar, the code uses relatively > low-level XML > > processing APIs in ways that seem to be sensitive to subtle differences > in > > text node and namespace semantics between the two libraries. Substantial > > restructuring of the code and the addition of quite a few unit tests > might be > > necessary to carry out such a port with confidence that the resulting > code > > would work well. > > > > Also, someone else submitted a JRuby patch for JRUBY-6658, one of the > major > > causes of BioRuby?s unit test failures with JRuby; once a fix is > integrated, > > we?ll be close to having all the tests passing under JRuby. > > > > I identified another JRuby bug, JRUBY-6666, causing several unit test > > failures. This one affects BioRuby?s code for running external commands, > so it > > would be likely to be encountered in production use. For this one, I also > > worked up a patch. > > > > I also spent some time preparing a performance testing environment, for > > evaluating existing MAF implementations as well as my own. This will be > > important, since I will be considering the use of an existing C parser. > I will > > also want to ensure that the performance of my code is competitive with > the > > alternatives. Lacking any hardware more powerful than a MacBook Air, I am > > setting this up with Amazon EC2. To simplify environment setup, I?ll be > using > > Chef. I?ve already set up a Chef repository with configuration logic, > and some > > rudimentary code to streamline launching Ubuntu machines on EC2 and > > bootstrapping a Chef environment. To save money, I plan to make use of > EC2 > > Spot Instances, which are perfect for instances that only need to run > for a > > few hours for batch tasks. > > > > Clayton Wheeler > > cswh at umich.edu > > > > > > > > > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From lomereiter at googlemail.com Sun May 27 18:27:43 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Sun, 27 May 2012 22:27:43 +0400 Subject: [GSoC] weekly report #2 Message-ID: Hi all, I wrote a blog post about the past week: http://lomereiter.wordpress.com/2012/05/27/gsoc-weekly-report-2/ Topics are: 1) I have quite good validation module for BAM now. More kinds of checks can be added, just request them :) 2) Also I started to implement random access via BAI file, just because I mostly finished what I planned for the first two weeks, and random access seems to be one of the most important things. Also it's not mentioned in the blog, but I started to work on BGZF gem, as Pjotr suggested to me. I'll try to document it and publish the first version next week. Currently I write it in pure Ruby. From marian.povolny at gmail.com Sun May 27 19:21:48 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sun, 27 May 2012 21:21:48 +0200 Subject: [GSoC] GSoC weekly status report No.1.9 Message-ID: http://blog.mpthecoder.com/post/23877896288/gsoc-weekly-status-report-no-1-9 This is the final post in 1.x series, I promise. The last week was spent adding support of parsing lines into records. It was a lot of work, and when I read the comments from my mentor, I wasn?t happy. But I agree with him, I did make it more complicated then it had to be (the C API, for example), I should spend some time polishing and refactoring the D side, and my cucumber features should be split into more features. So that?s the rough plan for the next week. -- Marjan From bonnal at ingm.org Mon May 28 08:50:19 2012 From: bonnal at ingm.org (Raoul Bonnal) Date: Mon, 28 May 2012 10:50:19 +0200 Subject: [GSoC] DevTools In-Reply-To: <329E20F7-BF3F-4201-ADD0-ABCDFC5ECDE4@umich.edu> Message-ID: In case you want to use RedMine I can give you the license for free, any bioruby developer can request it. From p.j.a.cock at googlemail.com Mon May 28 09:00:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 28 May 2012 10:00:30 +0100 Subject: [GSoC] [BioRuby] DevTools In-Reply-To: References: <329E20F7-BF3F-4201-ADD0-ABCDFC5ECDE4@umich.edu> Message-ID: On Mon, May 28, 2012 at 9:50 AM, Raoul Bonnal wrote: > In case you want to use RedMine I can give you the license for free, any > bioruby developer can request it. > ??? Redmine is licensed under the GPL. Did you mean admin rights on the OBF RedMine instance, for example to close bug reports? https://redmine.open-bio.org/projects/bioruby Peter From p.j.a.cock at googlemail.com Mon May 28 09:07:39 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 28 May 2012 10:07:39 +0100 Subject: [GSoC] weekly report #2 In-Reply-To: References: Message-ID: On Sun, May 27, 2012 at 7:27 PM, Artem Tarasov wrote: > Hi all, > > I wrote a blog post about the past week: > http://lomereiter.wordpress.com/2012/05/27/gsoc-weekly-report-2/ > > Topics are: > 1) I have quite good validation module for BAM now. More kinds of checks > can be added, just request them :) > The blog mentions you think you found some issues with tags.bam file - could you elaborate (directl email is fine), and tell me about any future issues please? > 2) Also I started to implement random access via BAI file, just because I > mostly finished what I planned for the first two weeks, and random access > seems to be one of the most important things. > > Also it's not mentioned in the blog, but I started to work on BGZF gem, as > Pjotr suggested to me. I'll try to document it and publish the first > version next week. Currently I write it in pure Ruby. > I guess my suggestion that Clayton might be able to use your BGZF support code for compressed MAF files does make sense to package the BGZF support as a Bio Gem. Good point Pjotr. http://lists.open-bio.org/pipermail/bioruby/2012-May/002301.html Peter From bonnal at ingm.org Mon May 28 09:03:01 2012 From: bonnal at ingm.org (Raoul Bonnal) Date: Mon, 28 May 2012 11:03:01 +0200 Subject: [GSoC] [BioRuby] DevTools In-Reply-To: Message-ID: Ahhhhhhhhhhh I mean RubyMine http://www.jetbrains.com/ruby/ sorry On 28/05/12 11.00, "Peter Cock" wrote: > > > On Mon, May 28, 2012 at 9:50 AM, Raoul Bonnal wrote: >> In case you want to use RedMine I can give you the license for free, any >> bioruby developer can request it. > > ??? Redmine is licensed under the GPL. > > Did you mean admin rights on the OBF RedMine instance, for > example to close bug reports? > https://redmine.open-bio.org/projects/bioruby > > Peter > > From lomereiter at googlemail.com Mon May 28 09:29:24 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 28 May 2012 13:29:24 +0400 Subject: [GSoC] weekly report #2 In-Reply-To: References: Message-ID: > > The blog mentions you think you found some issues with tags.bam > file - could you elaborate (directl email is fine), and tell me about any > future issues please? > They are very minor. Specification says (1.4) that 'QNAME' should be [!-?A-~], that doesn't include space and '@' sign, and that (1.5) printable characters in tags with 'A' type are [!-~], i.e. only space is not allowed. BTW, I looked at your code which generated the file, it uses range(32, 127) both for 'Z' and 'A' types of tags, even though it's explicitly written in comments right above these lines where space should be included, and where it shouldn't :) From p.j.a.cock at googlemail.com Mon May 28 09:48:21 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 28 May 2012 10:48:21 +0100 Subject: [GSoC] weekly report #2 In-Reply-To: References: Message-ID: On Mon, May 28, 2012 at 10:29 AM, Artem Tarasov wrote: > The blog mentions you think you found some issues with tags.bam >> file - could you elaborate (directl email is fine), and tell me about any >> future issues please? >> > > They are very minor. Specification says (1.4) that 'QNAME' should be > [!-?A-~], that doesn't include space and '@' sign, > Fair point. I should fix that. The '@" was presumably excluded in the v1.3 spec to avoid confusion with FASTQ files. > and that (1.5) > printable characters in tags with 'A' type are [!-~], i.e. only space > is not allowed. > > BTW, I looked at your code which generated the file, it uses > range(32, 127) both for 'Z' and 'A' types of tags, even though > it's explicitly written in comments right above these lines where > space should be included, and where it shouldn't :) > Good point, that is a change in the specification I hadn't noticed. Back in v1.2, both A and Z were just "printable character" and "printable string", which to me includes the space. It was only in v1.3 that this was made explicit with a regex, and space ceased to be allowed in the A tag. I wonder if that was an accident or deliberate? You'll notice that samtools doesn't complain about these deviations from the specification but it doesn't attempt any validation. I'm not sure if Picard checks this. Thanks, Peter From w.arindrarto at gmail.com Wed May 30 21:44:04 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 30 May 2012 23:44:04 +0200 Subject: [GSoC] GSoC Project Update -- 4 Message-ID: Hi everyone, I just posted my latest GSoC update here: http://bow.web.id/blog/2012/05/assembling-the-parsers/ To summarize: I've been working on more SearchIO parsers last week, adding more formats to support. We know have SearchIO-specific BLAST+ XML parser (it was first implemented on top of NCBIXML). It uses ElementTree as the base XML parser, with promising performance gains. I've also completed SearchIO's blast tabular parser, which takes in the BLAST+ tabular output files with or without headers. If the tabular file has headers, it can parse any number of columns in any order as long the columns with hit and query IDs are present. Finally, I've finished writing the HMMER plain text parser. For now, the parser can handle outputs from hmmscan and hmmsearch, single and multiple queries. All these parsers have been tested using the test cases I've generated previously. Additionally, I also had a public discussion with Peter on Github regarding SearchIO objects here: https://github.com/bow/biopython/commit/69a0ab64dfa7718f7455ca4c3961e95277fb4dbc#-P0, if anyone is interested. It started as a discussion on some behaviors of the HSP object, but also relates to other issues raised earlier (the dynamic SeqRecord coordinates Peter brought up earlier and Biopython's platform support). That's it for this week :). cheers, Bow From marian.povolny at gmail.com Sun Jun 3 21:07:18 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sun, 3 Jun 2012 23:07:18 +0200 Subject: [GSoC] GSoC weekly status report No.2 Message-ID: http://blog.mpthecoder.com/post/24355573626/gsoc-weekly-status-report-no-2 It?s the end of the second week of GSoC and time for a new report. I spent the last week mostly doing work based on criticism from my mentor. The D parser which parses lines into records is now in a pretty good shape, and tested. Today I received a list of new issues that need to be resolved before going further, but they?re not that much work and I can plan some new developments. A utility for validation is in planning for next week, which could be also used for performance measurement. And after that I will turn to making the current parser parallel. Also, tomorrow I?ll be defending my Masters Thesis, after which I should be able to concentrate more on the GFF3 parser. From arklenna at gmail.com Mon Jun 4 02:39:47 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sun, 3 Jun 2012 22:39:47 -0400 Subject: [GSoC] GSoC python variant update 4 Message-ID: Blog post (entirely reproduced in this email): http://arklenna.tumblr.com/post/24378549953/ I started implementing storage of VCF data in `SeqRecord` and `SeqFeature`. I digressed, spending a few days experimenting with overloading `__getattr__()` in lieu of manually writing properties. Then it occurred to me that if, as Reece pointed out, a variant doesn't contain the actual sequence but a reference to the sequence, the advantages to using `SeqRecord` are minimal or possibly negative. In my experience, the highest performance for filtering large amounts of data is SQL. SQL has the advantage of scalability: SQLite now ships with Python, users can choose to run their own MySQL/PGSQL server, and I've read about a few approaches to GPU accelerated SQL. My initial glances at BioSQL, GMOD, etc. didn't show anything specifically designed for variants (again, a focus on storage of the sequence itself) so I implemented my own interface. Currently, the `parse_all()` method is very slow (approximately 260 seconds for a file with 240,000 variants when the parsing takes 5-10 seconds) and I am investigating why. My first step will be to reduce commit frequency. With a SQL backend, it seems superfluous to have a dedicated variant representation within Python. The SQL result object should allow for straightforward retrieval of data by name. I'm storing "misc" data in a SQL text field using JSON, which is also easy to access. Next: * Looking at BioSQL/GMOD etc to see if there is an existing standard I should be using/following * Deciding the extent of the convenience functions I wish to implement * Thinking about the most efficient way to filter records on the way into the SQL database From arklenna at gmail.com Mon Jun 4 13:30:15 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 4 Jun 2012 09:30:15 -0400 Subject: [GSoC] [Biopython-dev] GSoC python variant update 4 In-Reply-To: References: Message-ID: <2D58B8E1-5056-445F-B623-56B7136048BC@gmail.com> On Jun 4, 2012, at 1:11 AM, Mic wrote: > Hi Lenna, > Big companies are using http://en.wikipedia.org/wiki/NoSQL > > What kind of ORM do you want use ( http://en.wikipedia.org/wiki/SQLAlchemy or http://en.wikipedia.org/wiki/Storm_%28software%29 ) > > Cheers, > Mic > > Hey Mic, Looks like there has been some talk about SQLAlchemy in Biopython: http://biopython.org/pipermail/biopython/2009-August/005455.html Lenna From mictadlo at gmail.com Mon Jun 4 05:11:56 2012 From: mictadlo at gmail.com (Mic) Date: Mon, 4 Jun 2012 15:11:56 +1000 Subject: [GSoC] [Biopython-dev] GSoC python variant update 4 In-Reply-To: References: Message-ID: Hi Lenna, Big companies are using http://en.wikipedia.org/wiki/NoSQL What kind of ORM do you want use ( http://en.wikipedia.org/wiki/SQLAlchemyor http://en.wikipedia.org/wiki/Storm_%28software%29 ) Cheers, Mic On Mon, Jun 4, 2012 at 12:39 PM, Lenna Peterson wrote: > Blog post (entirely reproduced in this email): > http://arklenna.tumblr.com/post/24378549953/ > > I started implementing storage of VCF data in `SeqRecord` and > `SeqFeature`. I digressed, spending a few days experimenting with > overloading `__getattr__()` in lieu of manually writing properties. > Then it occurred to me that if, as Reece pointed out, a variant > doesn't contain the actual sequence but a reference to the sequence, > the advantages to using `SeqRecord` are minimal or possibly negative. > > In my experience, the highest performance for filtering large amounts > of data is SQL. SQL has the advantage of scalability: SQLite now ships > with Python, users can choose to run their own MySQL/PGSQL server, and > I've read about a few approaches to GPU accelerated SQL. > > My initial glances at BioSQL, GMOD, etc. didn't show anything > specifically designed for variants (again, a focus on storage of the > sequence itself) so I implemented my own interface. Currently, the > `parse_all()` method is very slow (approximately 260 seconds for a > file with 240,000 variants when the parsing takes 5-10 seconds) and I > am investigating why. My first step will be to reduce commit > frequency. > > With a SQL backend, it seems superfluous to have a dedicated variant > representation within Python. The SQL result object should allow for > straightforward retrieval of data by name. I'm storing "misc" data in > a SQL text field using JSON, which is also easy to access. > > Next: > > * Looking at BioSQL/GMOD etc to see if there is an existing standard I > should be using/following > * Deciding the extent of the convenience functions I wish to implement > * Thinking about the most efficient way to filter records on the way > into the SQL database > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From chapmanb at 50mail.com Mon Jun 4 16:04:15 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 04 Jun 2012 12:04:15 -0400 Subject: [GSoC] GSoC python variant update 4 In-Reply-To: References: Message-ID: <87haurjc74.fsf@fastmail.fm> Lenna; Thanks for the summary. A couple of thoughts on the directions: - For property access, I think the best approach would be to store all of the arbitrary key/value pairs from INFO in SeqRecord annotations, then only use hand coded @properties to expose the most useful. That's gives people access to the most useful ones (as determined by you) with attributes but lets anyone dig in and get custom ones. - If you'd like to explore an SQL backend, you should have a look at Gemini: https://github.com/arq5x/gemini which stores variants in a SQLite database along with associated annotations. It's a flat structure based on adding and exposing useful annotations on variants: https://github.com/arq5x/gemini/blob/master/gemini/database.py Reinventing a new SQL store representation is a lot of work so it might be good to work off what others folks are currently doing and try to provide a Biopython friendly front end, much as you're exploring with PyVCF. Hope these are useful. Let me know if you have any questions at all, Brad > Blog post (entirely reproduced in this email): > http://arklenna.tumblr.com/post/24378549953/ > > I started implementing storage of VCF data in `SeqRecord` and > `SeqFeature`. I digressed, spending a few days experimenting with > overloading `__getattr__()` in lieu of manually writing properties. > Then it occurred to me that if, as Reece pointed out, a variant > doesn't contain the actual sequence but a reference to the sequence, > the advantages to using `SeqRecord` are minimal or possibly negative. > > In my experience, the highest performance for filtering large amounts > of data is SQL. SQL has the advantage of scalability: SQLite now ships > with Python, users can choose to run their own MySQL/PGSQL server, and > I've read about a few approaches to GPU accelerated SQL. > > My initial glances at BioSQL, GMOD, etc. didn't show anything > specifically designed for variants (again, a focus on storage of the > sequence itself) so I implemented my own interface. Currently, the > `parse_all()` method is very slow (approximately 260 seconds for a > file with 240,000 variants when the parsing takes 5-10 seconds) and I > am investigating why. My first step will be to reduce commit > frequency. > > With a SQL backend, it seems superfluous to have a dedicated variant > representation within Python. The SQL result object should allow for > straightforward retrieval of data by name. I'm storing "misc" data in > a SQL text field using JSON, which is also easy to access. > > Next: > > * Looking at BioSQL/GMOD etc to see if there is an existing standard I > should be using/following > * Deciding the extent of the convenience functions I wish to implement > * Thinking about the most efficient way to filter records on the way > into the SQL database > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From lomereiter at googlemail.com Mon Jun 4 18:02:58 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 4 Jun 2012 22:02:58 +0400 Subject: [GSoC] Weekly report #3 Message-ID: Hello all, the post is here: http://lomereiter.wordpress.com/2012/06/04/gsoc-weekly-report-3/ I've implemented random access to BAM file, using index file. Also I created a generic function for memoization which stores decompressed blocks in cache, following some desired cache strategy. Currently, I use simple FIFO cache. Also I studied how to make SAM output faster. I came to the conclusion that not only D standard library functions, but even ones of *printf family are too slow for this purpose, because they have to parse format string. Instead, I need to use specialized functions for printing integers and floats. Currently, output is about 4x slower than in samtools. So I have to take back some of my harsh words about its code and say that there is something to learn from there. It indeed uses its own functions for integer output, and also uses string buffer to do less calls (system functions can't be inlined). I'll use this approach, too, so very soon my library will be usable in pipelines, but only for output. Then I'm going to move on to allow alignments to be modified and outputted to BAM. After that, SAM parser needs to be implemented, and I'm going to use Ragel (finite-state machine compiler) for that purpose. So by the beginning of July I want to have SAM<->BAM conversion working, with a good speed. Add to that first release of biogem, and those are my plans for this month. From p.j.a.cock at googlemail.com Mon Jun 4 19:36:25 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 Jun 2012 20:36:25 +0100 Subject: [GSoC] Weekly report #3 In-Reply-To: References: Message-ID: On Mon, Jun 4, 2012 at 7:02 PM, Artem Tarasov wrote: > Hello all, > > the post is here: > http://lomereiter.wordpress.com/2012/06/04/gsoc-weekly-report-3/ > > I've implemented random access to BAM file, using index file. Also I > created a generic function for memoization which stores decompressed > blocks in cache, following some desired cache strategy. Currently, I > use simple FIFO cache. That sounds good. We've talked a little bit about the block caching strategy for Biopython's BGZF support - dropping the least recently used block would be good (LRU) but requires the overhead of storing and recording timestamps on each access. Currently my Biopython BGZF code just drops a cached block 'at random' (actually based on the dictionary hashing algorithm), and switching to FIFO was something I planned to try next (easily done with Python's OrderedDict class). FIFO seems like a good solution as the overheads are much lower than LRU. Have you got any good random access benchmarks to try this out with? i.e. something non-random, such as pulling mates of paired end reads. How many BGZF blocks are you keeping in the cache, and why? Are you thinking about BGZF output yet (which will be required in order to write BAM files)? Regards, Peter From lomereiter at googlemail.com Mon Jun 4 20:07:03 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Tue, 5 Jun 2012 00:07:03 +0400 Subject: [GSoC] Weekly report #3 In-Reply-To: References: Message-ID: > Have you got any good random access benchmarks to try this out > with? i.e. something non-random, such as pulling mates of paired > end reads. > Currently, no. Please suggest your ideas about benchmarks because I suspect that you have much more experience with BAM files and better knowledge of use patterns. How many BGZF blocks are you keeping in the cache, and why? > Currently, 512. I don't know why, seems like a reasonable number (about 30MB of RAM). Maybe it should be a runtime parameter but I doubt that end users will bother with tweaking cache size. > Are you thinking about BGZF output yet (which will be required > in order to write BAM files)? > It's not hard at all. I already wrote packing string to BGZF in Ruby: https://github.com/lomereiter/bioruby-bgzf/blob/master/lib/bio-bgzf/pack.rb Parallelizing should also be easy, it's very similar to reading blocks from file. Determine how many alignments to pack in one block (it's 65Kb max), send compression task to taskpool, then go create next chunk of alignments, and so on. From cswh at umich.edu Tue Jun 5 03:04:06 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 4 Jun 2012 23:04:06 -0400 Subject: [GSoC] Weekly report: Indexed MAF access, Kyoto Cabinet, SQLite, and more Message-ID: <2B6E16E9-3DBC-4F54-88F8-C42E03124A1E@umich.edu> Hi all, My latest blog post on (mostly) last week's work is here: http://csw.github.com/bioruby-maf/blog/2012/06/04/indexed_maf_access/ Highlights include SQLite vs. Kyoto Cabinet, the path to BGZF support, and the challenges of supporting multiple Ruby implementations. Clayton Wheeler cswh at umich.edu From casbon at gmail.com Wed Jun 6 09:39:12 2012 From: casbon at gmail.com (James Casbon) Date: Wed, 6 Jun 2012 10:39:12 +0100 Subject: [GSoC] [Biopython-dev] GSoC python variant update 4 In-Reply-To: <87haurjc74.fsf@fastmail.fm> References: <87haurjc74.fsf@fastmail.fm> Message-ID: I'd be cautious about going for SQL for VCF backends. At least the following two problems arise: 1. VCF isn't a format, it's a meta-format so there isn't really a single data representation, but many. You are going to need a very flexible schema to allow variable records with complex entries like lists. (An entry is dynamically defined by the FORMAT field in each row, right?). Having a JSON misc entry means you lose all query abilities on these data anyway. 2. If you move your data away from VCF, you cannot use tools from outside your universe. i.e. lets say you want to use a GATK variant annotator, you need to do the roundtrip from SQL->VCF->SQL. I speak having developed this approach already and largely abandoned it due to the problems above. You are right that SQL would be a better solution for data index and access (no serialization issues, multiple tuned indexes), but be careful that you may spend a lot of time and not have a lot to show. I would really like it if biology used existing binary formats (HDF5 anyone?), but we don't. More practical use right now would be bcf support. From w.arindrarto at gmail.com Wed Jun 6 18:22:26 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 6 Jun 2012 20:22:26 +0200 Subject: [GSoC] GSoC Project Update -- 5 Message-ID: Hi everyone, I just posted another update on my GSoC project here: http://bow.web.id/blog/2012/06/hello-indexers/ A brief summary: * I added the SearchIO indexing functions, with the same interface as SeqIO's indexing functions. It currently supports all the available SearchIO parsers (blast-tab, blast-xml, and hmmer-text). * (not mentioned in the post) I did some refactoring to the SearchIO code base. It was starting to get a bit messy, but now it's cleaner. All the parsers are now implemented as classes. For some of them, users can use it directly to tweak its behavior (e.g. the blast-tab parser can be used to parse plain blast-tab files with custom column ordering. This is not possible if users use SearchIO.parse or SearchIO.read instead). Additionally, I should also mention that my schedule has been changed slightly. The original plan for next week was to focus on hmmer-text indexing. However, since it has been done (except for the testing, which should not take a week), I will be focusing on writing the SearchIO converters. So expect to see that instead. That's all for now :). regards, Bow From lomereiter at googlemail.com Mon Jun 11 17:25:48 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 11 Jun 2012 21:25:48 +0400 Subject: [GSoC] weekly report #4 Message-ID: Hello everybody, here's my weekly report: http://lomereiter.wordpress.com/2012/06/11/gsoc-weekly-report-4/ I've added BAM output support (not parallelized yet) and alignment creation/modification - changing fields, adding tags, and replacing existing ones. Thus, the library has a lot of features at the moment, and I started documenting them on github wiki. Also I found out that there's a great tool in DMD distribution, called rdmd, which allows to execute D files as scripts, by just adding "#!/usr/bin/rdmd" at the top. It will automatically compile all needed files and run executable. That dramatically simplifies library usage, no need to write cumbersome makefiles. The examples are at https://github.com/lomereiter/BAMread/wiki/Getting-started You can try to write your own script if you wish, follow the instructions in the wiki. Also, as my library now is able to write BAM, the current project title is quite misleading. So I'd like to hear suggestions on renaming :) -- Artem From p.j.a.cock at googlemail.com Mon Jun 11 17:41:39 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 11 Jun 2012 18:41:39 +0100 Subject: [GSoC] weekly report #4 In-Reply-To: References: Message-ID: On Mon, Jun 11, 2012 at 6:25 PM, Artem Tarasov wrote: > Hello everybody, > > here's my weekly report: > http://lomereiter.wordpress.com/2012/06/11/gsoc-weekly-report-4/ > > ... > > Also, as my library now is able to write BAM, the current project title is > quite misleading. > So I'd like to hear suggestions on renaming :) As to the name, how about damtools (D alignment/map tools), "for dealing with the flood of sequence data" (dam as in reservoir). Peter From cjfields at illinois.edu Mon Jun 11 17:46:43 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 11 Jun 2012 17:46:43 +0000 Subject: [GSoC] weekly report #4 In-Reply-To: References: Message-ID: <67FF495D-E8AD-4920-9EA8-6464E1310FBB@illinois.edu> On Jun 11, 2012, at 12:41 PM, Peter Cock wrote: > On Mon, Jun 11, 2012 at 6:25 PM, Artem Tarasov > wrote: >> Hello everybody, >> >> here's my weekly report: >> http://lomereiter.wordpress.com/2012/06/11/gsoc-weekly-report-4/ >> >> ... >> >> Also, as my library now is able to write BAM, the current project title is >> quite misleading. >> So I'd like to hear suggestions on renaming :) > > As to the name, how about damtools (D alignment/map tools), > "for dealing with the flood of sequence data" (dam as in reservoir). > > Peter Or 'damn, look how much work we have to do' chris From lomereiter at googlemail.com Mon Jun 11 18:47:48 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 11 Jun 2012 22:47:48 +0400 Subject: [GSoC] weekly report #4 In-Reply-To: References: Message-ID: No, thanks... I'll call it libsambamba. In suahili, sambamba means 'parallel' ( http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) On Mon, Jun 11, 2012 at 9:41 PM, Peter Cock wrote: > > As to the name, how about damtools (D alignment/map tools), > "for dealing with the flood of sequence data" (dam as in reservoir). > > Peter > From p.j.a.cock at googlemail.com Mon Jun 11 18:59:38 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 11 Jun 2012 19:59:38 +0100 Subject: [GSoC] weekly report #4 In-Reply-To: <20120611185718.GA12417@thebird.nl> References: <20120611185718.GA12417@thebird.nl> Message-ID: On Monday, June 11, 2012, Pjotr Prins wrote: > On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: > > No, thanks... > > > > I'll call it libsambamba. In suahili, sambamba means 'parallel' ( > > http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) > > I like it mbwana. > > Pj. > As the mentor, you'd be the mbwana or the bwana (boss), not Artem. But I do like lib-sambamba as a name - very clever. Peter From cjfields at illinois.edu Mon Jun 11 19:19:18 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 11 Jun 2012 19:19:18 +0000 Subject: [GSoC] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: > On Monday, June 11, 2012, Pjotr Prins wrote: > >> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>> No, thanks... >>> >>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >> >> I like it mbwana. >> >> Pj. >> > > As the mentor, you'd be the mbwana or the bwana (boss), not Artem. > > But I do like lib-sambamba as a name - very clever. > > Peter Agreed, fits very well. chris From pjotr.public14 at thebird.nl Mon Jun 11 18:57:18 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 11 Jun 2012 20:57:18 +0200 Subject: [GSoC] [BioRuby] weekly report #4 In-Reply-To: References: Message-ID: <20120611185718.GA12417@thebird.nl> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: > No, thanks... > > I'll call it libsambamba. In suahili, sambamba means 'parallel' ( > http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) I like it mbwana. Pj. From georgkam at gmail.com Mon Jun 11 19:28:42 2012 From: georgkam at gmail.com (George Githinji) Date: Mon, 11 Jun 2012 22:28:42 +0300 Subject: [GSoC] [BioRuby] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: Good tribute to swahili! ahsante sana bwana Artem! (Thank you very much for the suggestion) Sambamba could also mean correct way or the right thing in everyday speak.. (bwana is a term of respect or honour, though it also refers to a boss .. mostly we use 'mkubwa' to mean boss) George On Mon, Jun 11, 2012 at 10:19 PM, Fields, Christopher J wrote: > On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: > >> On Monday, June 11, 2012, Pjotr Prins wrote: >> >>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>>> No, thanks... >>>> >>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >>> >>> I like it mbwana. >>> >>> Pj. >>> >> >> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. >> >> But I do like lib-sambamba as a name - very clever. >> >> Peter > > Agreed, fits very well. > > chris > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- --------------- Sincerely George Skype: george_g2 Blog: http://biorelated.wordpress.com/ Twitter: http://twitter.com/#!/george_l From cjfields at illinois.edu Mon Jun 11 19:36:44 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 11 Jun 2012 19:36:44 +0000 Subject: [GSoC] [BioRuby] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: <114DEA27-A766-4F0F-8144-098FF0905E1D@illinois.edu> heh, which makes me think you don't respect your bosses :) chris On Jun 11, 2012, at 2:28 PM, George Githinji wrote: > ...(bwana is a term of respect or honour, though it also refers to a boss > .. mostly we use 'mkubwa' to mean boss) > > George > > > On Mon, Jun 11, 2012 at 10:19 PM, Fields, Christopher J > wrote: >> On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: >> >>> On Monday, June 11, 2012, Pjotr Prins wrote: >>> >>>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>>>> No, thanks... >>>>> >>>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >>>> >>>> I like it mbwana. >>>> >>>> Pj. >>>> >>> >>> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. >>> >>> But I do like lib-sambamba as a name - very clever. >>> >>> Peter >> >> Agreed, fits very well. >> >> chris >> >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > > -- > --------------- > Sincerely > George > Skype: george_g2 > Blog: http://biorelated.wordpress.com/ > Twitter: http://twitter.com/#!/george_l From to.petr at gmail.com Mon Jun 11 20:35:59 2012 From: to.petr at gmail.com (P. Troshin) Date: Mon, 11 Jun 2012 21:35:59 +0100 Subject: [GSoC] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: None of my business but it's a bit unwieldy. It may be clever, but 99% people who come across it would not know. mbwana is simpler in that respect. Sorry for spoiling the consensus :-( P. On 11 June 2012 20:19, Fields, Christopher J wrote: > On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: > >> On Monday, June 11, 2012, Pjotr Prins wrote: >> >>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>>> No, thanks... >>>> >>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >>> >>> I like it mbwana. >>> >>> Pj. >>> >> >> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. >> >> But I do like lib-sambamba as a name - very clever. >> >> Peter > > Agreed, fits very well. > > chris > > > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From marian.povolny at gmail.com Mon Jun 11 20:52:05 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 11 Jun 2012 22:52:05 +0200 Subject: [GSoC] GSoC weekly status report No.3 Message-ID: http://blog.mpthecoder.com/post/24904798973/gsoc-weekly-status-report-no-3 My first report as a Master of Computer Engineering and Communications :) Here is a list with what I?ve been working on the last week: more cleanup and refactoring validation code, README etc, made a validation utility in D, which simply reports problems found to stderr, made a benchmark tool with -v option for measuring parser speed with and without validation, after having a basic benchmark tool, found a few places which were very bad for performance. After fixing that code, parsing a 233MB GFF3 file on a five year old PC took 6 seconds, but without validation, and with only a single thread, and replacing escaped characters turned off, made replacing escaped characters optional, because the current implementation requires creation of additional string objects to do that, which has a big impact on performance. There is a plan for making it faster, but is scheduled for later, added minimal parallelisation, by reading the file in a separate thread. Two additional days were spent on a segmentation fault in the D garbage collector which occured when parsing a big file with a lot of errors. That should never happen, as I?m using the safe part of the D language, that is no pointers or anything similar. The worst that should happen is an exception. But a segmentation fault points to an error in either the compiler, the runtime or support library. The minimum reproducible example is still 42 lines long: https://gist.github.com/2911818 but changing anything in it makes the segmentation fault go away. More info on this topic can be found in the discussion here: https://github.com/mamarjan/bioruby-hpc-gff3/issues/31 I?ll be probably posting a bug report on the Dlang webpage tomorrow. For the coming week I would like to add more parallelisation, change the validation code so that exceptions almost never happen (and the seg fault also) and add support for merging records into features. -- Marjan From cswh at umich.edu Mon Jun 11 20:56:02 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 11 Jun 2012 16:56:02 -0400 Subject: [GSoC] GSoC weekly status report: MAF filtering Message-ID: Hi all, Here's my status report on last week's work: http://csw.github.com/bioruby-maf/blog/2012/06/09/filtering-work/ Highlights: mainly MAF alignment block filtering and performance challenges with binary data in Ruby. Clayton Wheeler cswh at umich.edu From pjotr.public14 at thebird.nl Tue Jun 12 07:05:18 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 12 Jun 2012 09:05:18 +0200 Subject: [GSoC] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: <20120612070518.GA14848@thebird.nl> sam-bam-baah has the comment of sheep in it. May explain consensus :) How about sambamba tools. Less unwieldy. On Mon, Jun 11, 2012 at 09:35:59PM +0100, P. Troshin wrote: > None of my business but it's a bit unwieldy. It may be clever, but 99% > people who come across it would not know. mbwana is simpler in that > respect. Sorry for spoiling the consensus :-( > > P. > > > > On 11 June 2012 20:19, Fields, Christopher J wrote: > > On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: > > > >> On Monday, June 11, 2012, Pjotr Prins wrote: > >> > >>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: > >>>> No, thanks... > >>>> > >>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( > >>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) > >>> > >>> I like it mbwana. > >>> > >>> Pj. > >>> > >> > >> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. > >> > >> But I do like lib-sambamba as a name - very clever. > >> > >> Peter > > > > Agreed, fits very well. > > > > chris > > > > > > _______________________________________________ > > GSoC mailing list > > GSoC at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/gsoc > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From to.petr at gmail.com Tue Jun 12 16:48:40 2012 From: to.petr at gmail.com (P. Troshin) Date: Tue, 12 Jun 2012 17:48:40 +0100 Subject: [GSoC] weekly report #4 In-Reply-To: <20120612070518.GA14848@thebird.nl> References: <20120611185718.GA12417@thebird.nl> <20120612070518.GA14848@thebird.nl> Message-ID: > How about sambamba tools. Less unwieldy. I think its better, but it's not particularly catchy but please ignore me, it is so much easier to critique, when to come up with a really good name. P. On 12 June 2012 08:05, Pjotr Prins wrote: > sam-bam-baah has the comment of sheep in it. May explain consensus :) > > How about sambamba tools. Less unwieldy. > > On Mon, Jun 11, 2012 at 09:35:59PM +0100, P. Troshin wrote: >> None of my business but it's a bit unwieldy. It may be clever, but 99% >> people who come across it would not know. mbwana is simpler in that >> respect. Sorry for spoiling the consensus :-( >> >> P. >> >> >> >> On 11 June 2012 20:19, Fields, Christopher J wrote: >> > On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: >> > >> >> On Monday, June 11, 2012, Pjotr Prins wrote: >> >> >> >>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >> >>>> No, thanks... >> >>>> >> >>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >> >>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >> >>> >> >>> I like it mbwana. >> >>> >> >>> Pj. >> >>> >> >> >> >> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. >> >> >> >> But I do like lib-sambamba as a name - very clever. >> >> >> >> Peter >> > >> > Agreed, fits very well. >> > >> > chris >> > >> > >> > _______________________________________________ >> > GSoC mailing list >> > GSoC at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/gsoc >> _______________________________________________ >> GSoC mailing list >> GSoC at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/gsoc From to.petr at gmail.com Tue Jun 12 17:13:11 2012 From: to.petr at gmail.com (P. Troshin) Date: Tue, 12 Jun 2012 18:13:11 +0100 Subject: [GSoC] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> <20120612070518.GA14848@thebird.nl> Message-ID: Also I cannot help it but the Samba file server is the first thing that comes to my mind when I hear sambamba. I suspect it may be just my techy background... I think that my favorite so far is damtools, because its terribly close to samtools, which is where you want to be I guess. However, I can see why its not everybody's favorite and unfortunately there are already at least a few other damtools around http://code.google.com/p/dam-tools/, http://home.earthlink.net/~matthewjheaney/damtools/index.html P. On 12 June 2012 17:48, P. Troshin wrote: >> How about sambamba tools. Less unwieldy. > > I think its better, but it's not particularly catchy but please ignore > me, it is so much easier to critique, when to come up with a really > good name. > > P. > > > On 12 June 2012 08:05, Pjotr Prins wrote: >> sam-bam-baah has the comment of sheep in it. May explain consensus :) >> >> How about sambamba tools. Less unwieldy. >> >> On Mon, Jun 11, 2012 at 09:35:59PM +0100, P. Troshin wrote: >>> None of my business but it's a bit unwieldy. It may be clever, but 99% >>> people who come across it would not know. mbwana is simpler in that >>> respect. Sorry for spoiling the consensus :-( >>> >>> P. >>> >>> >>> >>> On 11 June 2012 20:19, Fields, Christopher J wrote: >>> > On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: >>> > >>> >> On Monday, June 11, 2012, Pjotr Prins wrote: >>> >> >>> >>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>> >>>> No, thanks... >>> >>>> >>> >>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>> >>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >>> >>> >>> >>> I like it mbwana. >>> >>> >>> >>> Pj. >>> >>> >>> >> >>> >> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. >>> >> >>> >> But I do like lib-sambamba as a name - very clever. >>> >> >>> >> Peter >>> > >>> > Agreed, fits very well. >>> > >>> > chris >>> > >>> > >>> > _______________________________________________ >>> > GSoC mailing list >>> > GSoC at lists.open-bio.org >>> > http://lists.open-bio.org/mailman/listinfo/gsoc >>> _______________________________________________ >>> GSoC mailing list >>> GSoC at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/gsoc From w.arindrarto at gmail.com Wed Jun 13 23:33:52 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 14 Jun 2012 01:33:52 +0200 Subject: [GSoC] GSoC Project Update -- 6 Message-ID: Hi everyone, It's a bit late than usual, but I've finally finished my update for the past week: http://bow.web.id/blog/2012/06/round-trip-with-searchio/ As a summary: 1. SearchIO now has a write and convert function that outputs to BLAST XML and tabular files. 2. The two main container objects QueryResult and Hit now has their own filter() and map() functions similar to Python's built-in filter and map. For QueryResult objects, there are hit_filter, hsp_filter, hit_map, and hsp_map functions and for Hit objects we have filter and map. Filter functions accept a boolean function with either Hit or HSP as its argument, while map accepts a function that must return either Hit or HSP objects. I wrote a short demo on my post to make this a bit clearer and show what it can help users do. 3. (not mentioned in the post) is more tweaks and tests to the existing functionalities, especially indexing. That's all for now :). regards, Bow From arklenna at gmail.com Mon Jun 18 04:21:42 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 18 Jun 2012 00:21:42 -0400 Subject: [GSoC] GSoC python variant update 5 Message-ID: Latest post: http://arklenna.tumblr.com/post/25343434817/ James raised some [concerns](http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009688.html) about the difficulty of representing the VCF "metaformat" in SQL. I've taken these into consideration and am forging ahead. So far, some of the types of data fit more neatly into SQL than into a VCF row. I have redesigned my SQL schema with a two-pronged approach to tackle the flexibility of VCF: 1. For the site, alt, and genotype tables, there are columns for the reserved info/format keywords in the VCF spec (so far only for non-SV). 2. For new info and format keywords (both in the header and in the body), I am storing the values in a "narrow table." This table stores a foreign key to the key's row and the key-value pair. The narrow table is also good for storing reserved keys that are lists (but not per-allele or per-genotype). Note: this diagram only has the FKs listed for simplicity. (SQL diagram) Interestingly, despite the increase in the number of tables and thus insert statements, the current script is considerably faster than the previous version. Evidently JSON serialization is slow. There are a few things I haven't figured out: 1. Can an info field be per-genotype? The spec implies that wouldn't make sense, but doesn't forbid it. 2. Is there a safe way to find out if a VCF 4.0 field is per-allele or per-genotype? 3. Will my SQL representation be able to handle SV? ======= I'll be out of town for the next week but I will have plenty of time for Python. From marian.povolny at gmail.com Mon Jun 18 18:28:12 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 18 Jun 2012 20:28:12 +0200 Subject: [GSoC] GSoC weekly status report No.4 Message-ID: http://blog.mpthecoder.com/post/25375170121/gsoc-weekly-status-report-no-4 During the last week combining records into features has been added, and also connecting the features into parent-child relationships. Validation messages have been enhanced with file names and line numbers, and now look like errors reported by a compiler. Feels most natural to me. Combining the features into records works by keeping a forward cache of a number of features (1000 by default, configurable). That means that the parsing results will be correct only if records which are part of the same feature are at most 1000 features from each other, or the amount of features set. The first implementation which was comparing the IDs of records required 10min for a 233MB file. After switching to first comparing hash values of IDs instead, and only if they match comparing the IDs, the parsing time was down to 45s. After fixing a bug, the time is now 10 seconds for the 233MB m_hapla file :) Linking the features into parent-child relationships works similarly, by using 32-bit hashes most of the time instead of comparing strings. With this functionality turned on, the same file is parsed in 13 seconds. All the measurements have been done using the benchmark utility, which has a few more options for setting what should be run. Otherwise I did more refactoring, moved all the gff3_* files into a gff3 directory, so the D modules are now bio.gff3.*, parsing functions are now static methods of GFF3File and GFF3Data classes, etc. For the new week, I would like to add filtering to the D library, which I can then use to implement iteration over genes, mRNAs, CDS features, etc. After that the library should be pretty much complete feature-wise, at least per what was promised in the project proposal, so I?ll continue by defining the C API and developing the Ruby gem. -- Marjan From chapmanb at 50mail.com Tue Jun 19 00:28:11 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 18 Jun 2012 20:28:11 -0400 Subject: [GSoC] [Biopython-dev] GSoC python variant update 5 In-Reply-To: References: Message-ID: <87k3z4xi04.fsf@fastmail.fm> Lenna; Thanks for the update. I've been following the commits on GitHub and looks like you're getting some traction with the SQL representation. I do worry about it for some of the same reasons as James but happy to have you take a look if it helps with your understanding of VCF. I think it might also be worth thinking of some use cases that are not well covered with the current PyVCF parser and seeing if your representation tackles them better. One current one that is tough is slicing a VCF file by sample. Row based slicing is well supported but column based is not as easy. If I had a, say, 50 sample file: how well does it allow pulling out the genotypes and records from a single sample and re-writing as VCF. Can you code up this type of workflow with your current representation? For your specific questions: > 1. Can an info field be per-genotype? The spec implies that wouldn't > make sense, but doesn't forbid it. The INFO key/values are per-variant. There are also arbitrary per-genotype key/values allowed, specified in the FORMAT file. > 2. Is there a safe way to find out if a VCF 4.0 field is per-allele or > per-genotype? This should be the INFO/FORMAT distinction I described above. > 3. Will my SQL representation be able to handle SV? VCF encodes structural variation information into the INFO metadata, so as long as you support the structural variant specified ALT fields it should fit. The longer term question is if you want to support more explicit linking between distant breakends, which would require special support. I think that's probably more of an end-of-the-summer goal, however, since most people aren't yet doing tons of VCF structural variation work. Brad From lomereiter at googlemail.com Tue Jun 19 08:25:07 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Tue, 19 Jun 2012 12:25:07 +0400 Subject: [GSoC] weekly report #5 Message-ID: Hi all, I wrote a few words about improvements in my project during the past week: http://lomereiter.wordpress.com/2012/06/19/gsoc-weekly-report-5/ - More wiki content on Github, with examples of how to use the library for common cases. - Faster conversion to SAM, now it's not worse than samtools in this respect - Parallelized BGZF compression, though it was relatively easy to add - Reconsidering interaction with dynamic languages due to shared library issues in D. Now I'm thinking of an approach of making command-line tools outputting JSON and wrapping them. At least in BioRuby we have Bio::Command to make this process easy. - Progress in SAM parsing - valid records are now fully parsed, and it takes just 300 lines of D/Ragel mix, together with some unittests. Also, Ragel provides some convenient methods to handle errors, but I haven't investigated them yet. Once error handling is added, the branch will be ready to be merged, and then I'll add SAM reading. -- Artem From reece at harts.net Tue Jun 19 12:51:26 2012 From: reece at harts.net (Reece Hart) Date: Tue, 19 Jun 2012 05:51:26 -0700 Subject: [GSoC] [Biopython-dev] GSoC python variant update 5 In-Reply-To: References: Message-ID: On Sun, Jun 17, 2012 at 9:21 PM, Lenna Peterson wrote: > Latest post: http://arklenna.tumblr.com/post/25343434817/ > Hi Lenna- Thanks for making the time to update your blog. As with James and Brad, I doubt the suitability of SQL for this project. However, I learn things when I'm wrong, so this should work out either way! I don't understand your "SQL diagram" (more properly, an "entity-relationship diagram"). It would help me -- and perhaps you too -- to provide more detail in the ERD and then to parse a few lines from a VCF file into your schema by hand (e.g., as a set of tsv files or Google doc spreadsheets). It's also worthwhile to look at other people's schemas for similar data. http://www.ensembl.org/info/docs/variation/variation-database-schema.pdf is a good place to start. In any case, VCF parsing is merely a specialized embodiment of general variant representation, which is the primary goal for this project. Therefore, it would be worthwhile now to test whatever scheme you propose against other formats (GFF and HGVS have been discussed). I don't mean that you should implement now, but rather just make sure that you're heading in a direction that's compatible with other planned uses. -Reece From reece at harts.net Tue Jun 19 12:52:33 2012 From: reece at harts.net (Reece Hart) Date: Tue, 19 Jun 2012 05:52:33 -0700 Subject: [GSoC] [Biopython-dev] GSoC python variant update 5 In-Reply-To: References: Message-ID: On Tue, Jun 19, 2012 at 5:51 AM, Reece Hart wrote: > (GFF and HGVS have been discussed Ooops. I meant GVF, but the point is the same. From chris.mit7 at gmail.com Tue Jun 19 16:23:01 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Tue, 19 Jun 2012 12:23:01 -0400 Subject: [GSoC] [Biopython-dev] GSoC python variant update 5 In-Reply-To: References: Message-ID: Lenna, One concern I had, which may be avoided by your schema of using narrow-tables, is how well your current structure can support the inevitable updates to the VCF format. It may show my inexperience with SQL, but is a SQL backend flexible enough to adopt new conventions while also maintaining backwards compatibility? Also, from a usage standpoint -- I wouldn't want to have a vcf file and a database file on my drive. It would be redundant for me. It may just be my style, but I usually sieve out the useful information out of a vcf file into several smaller specific vcf files. Really what a vcf parser does is make your output more concise. I wouldn't want then another .db file for each time I wanted to parse my vcf file into a smaller chunk. Additionally, any time you gained in filtering by using a SQL backend may be negligible when the user gets to this stage. The file sizes will be substantially smaller. In short, I think you might be over-engineering this. Keeping a SQL backend is going to require indexing after updates (how long will this take, and is the time comprable to using pure python?, you also have the issue where SQL decides to ignore your index...), and writing queries that may be optimal for some usage cases and poor in others. You may have thought about these concerns, and I don't mean to deter your efforts, you may be a SQL guru for all I know (I also just may be biased from how I operate). Chris On Tue, Jun 19, 2012 at 8:52 AM, Reece Hart wrote: > On Tue, Jun 19, 2012 at 5:51 AM, Reece Hart wrote: > > > (GFF and HGVS have been discussed > > > Ooops. I meant GVF, but the point is the same. > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From w.arindrarto at gmail.com Thu Jun 21 07:09:40 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 21 Jun 2012 09:09:40 +0200 Subject: [GSoC] GSoC Project Update -- 7 Message-ID: Hi everyone, I've just posted another update for the past week here: http://bow.web.id/blog/2012/06/new-parsers-for-new-week/ In short, here is what I did: * Added parsing, indexing, and writing support for two new formats (along with their tests): the HMMER table output (hmmer-tab) and the HMMER domain table output (hmmscan-domtab, hmmsearch-domtab, or phmmer-domtab). There is a small issue which prevents the HMMER domain table format to be simply named hmmer-domtab, which I discuss in the post. * (not mentioned in the post) Added more tests for writing and indexing, and refactored some of the existing code. As mentioned in the post, since the core SearchIO API functions are now complete, for the coming weeks I will focus on adding more formats to support, improving the code, and of course add more tests and documentation. regards, Bow From cswh at umich.edu Fri Jun 22 04:25:40 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Fri, 22 Jun 2012 00:25:40 -0400 Subject: [GSoC] GSoC weekly status report: parallel I/O and JRuby Message-ID: <656B7BDD-DD0E-40FE-91AF-DC23113427D5@umich.edu> Hi all, This week's status report is a double feature: http://csw.github.com/bioruby-maf/blog/2012/06/13/jruby_support_and_performance_work/ http://csw.github.com/bioruby-maf/blog/2012/06/21/parallel_io/ In short, I now have JRuby fully supported by my MAF code, including the Kyoto Cabinet components. Using JRuby, I've been able to deliver very solid performance for index-driven random access parsing as well as for sequential whole-file parsing. Clayton Wheeler cswh at umich.edu From marian.povolny at gmail.com Mon Jun 25 20:38:10 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 25 Jun 2012 22:38:10 +0200 Subject: [GSoC] GSoC weekly status report No.5 Message-ID: http://blog.mpthecoder.com/post/25870737554/gsoc-weekly-status-report-no-5 *Summary of the last week* During the last week a few improvements have been made: - the validation messages have been improved with file names and line number, in the compiler error style, - filtering has been added, - replacing escaped characters has been re-implemented to get a huge performance improvement. The 1GB file that required 10min for parsing because of 6.5 milion escaped characters, is now parsed in 22.5 seconds, only 0.5 more compared with when replacing them is turned off, - added a tool for correctly counting features in a GFF3 file. This will be useful because the user can then find a good value for the feature cache size by using this tool to get the correct count and the benchmark tool to get the count for a particular cache size. The tool is still slow for some files, so I?m thinking about how to improve that, - other small fixes, comments and similar? *More on filtering* The filtering was first implemented using classes, but later refactored using delegates instead. The result was 50 lines less code. The user can now specify a filter before parsing a file like this: GFF3File.parse_by_records("file.gff3", NO_VALIDATION, false, NO_BEFORE_FILTER, OR(ATTRIBUTE("ID", EQUALS("1")), ATTRIBUTE("ID", CONTAINS("2")))); The first filter which is set to none in this example is the filter before the line is parsed, that means that the filter doesn?t support ATTRIBUTE and FIELD predicates. The following predicates are implemented: FIELD, ATTRIBUTE, EQUALS, CONTAINS, STARTS_WITH, AND, OR, NOT. In case they?re used in a way which is not allowed, there will be a compiler error. Otherwise the allowed combinations should be logical enough to guess (but I?ll document them too). I altered the benchmark tool a few times to test the performance, and what I found was very positive, the performance impact in the few tests I did was very small. I?ll have more data once the next tool is finished. *New week* Release early and often - it?s a mantra a heard quite a few times before. So as the group of mentors and students has agreed, every student will be releasing a gem at the end of this week. I?m still not sure what will be in it, because the support for shared libraries in D compilers for Linux has not been implemented yet. So it will probably be a combination of a command-line utility and a Ruby module which uses that utility. What I have currently in mind is re-implementing the gff3-fetch utility developed by Pjotr in Ruby, to make it faster using D. But first I?ll implement filtering functionality for it, so the users can reduce a file to records which are interesting to them and then parse that using a parser in Ruby, for example. A Ruby module that would make using this utility easier for Ruby developers seems like a good idea for the first release. Part of this utility will be to support GFF3 output, so that will be implemented too (and has already been done today to some extend). From lomereiter at googlemail.com Tue Jun 26 15:45:21 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Tue, 26 Jun 2012 19:45:21 +0400 Subject: [GSoC] weekly report #6 Message-ID: Hello all, here's my weekly report: http://lomereiter.wordpress.com/2012/06/26/gsoc-weekly-report-6/ Summary: Ruby bindings moved to parsing JSON from command-line tool output, everything works fine. That also means you can use JSON output from other languages. SAM input was added. Not optimized at all, parser currently does a lot of unnecessary memory allocations. Now it's about 3x as slow as samtools one, but it should be easy to improve the speed (at least doubling is possible according to profiling results). Also there's now a command line tool called Sambamba, which is used for creating JSON output. But it also outputs SAM and accepts both SAM and BAM formats as an input. Options are mostly the same as for the samtools view command, including fetching regions with the same syntax, and some filtering (e.g. on quality). From w.arindrarto at gmail.com Thu Jun 28 13:26:40 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 28 Jun 2012 15:26:40 +0200 Subject: [GSoC] GSoC Project Update -- 8 Message-ID: Hello everyone, Here is my update for the last week: http://bow.web.id/blog/2012/06/fasta-comes-to-searchio/ To summarize it here: * SearchIO now supports Fasta indexing and parsing. The code integrates some part of the FastaIO module in AlignIO, but with more new addition to enable parsing into SearchIO objects hierarchy. * Improved the text output of common SearchIO objects. The text outputs (using str() or print) are now easier to interpret. * (not mentioned in the blog post) Tests for Fasta parsing and indexing, along with more tests for the common objects. That's all I have for this week :). Next week, I will be adding more formats to support into the submodule. regards, Bow From arklenna at gmail.com Sat Jun 30 06:15:05 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sat, 30 Jun 2012 02:15:05 -0400 Subject: [GSoC] GSoC python variant update 6 Message-ID: Post: http://arklenna.tumblr.com/post/26196310200/ I've made a new branch (variant2: https://github.com/lennax/biopython/tree/variant2 ) which has a very skeletal outline of a set of Python objects designed to store variants. One might note many similarities to the organization of PyVCF. One thing SQL did neatly was store per-allele data with the allele, rather than with the site, and I'm envisioning doing this in Python, as well. For a Python variant object, are there any organizational choices that would make it easier for future conversion of a variant to HGVS syntax? (this is primarily directed at Reece but I'm open to all suggestions) Another question that may reveal my complete ignorance of haplotypes and such: could a polyploid site ever be partially phased? e.g. a triploid genotype of 0/1|0? Looking forward to any and all questions, comments, concerns, etc. Lenna From chapmanb at 50mail.com Mon Jul 2 10:36:39 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 02 Jul 2012 06:36:39 -0400 Subject: [GSoC] GSoC python variant update 6 In-Reply-To: References: Message-ID: <874npqo3ew.fsf@fastmail.fm> Lenna; Thanks for the updates and thoughts. I like the direction you're moving after taking everything you've learned from the SQL experiments. My general suggestions would be: - Leverage PyVCF for all of the backend parsing. We want to remain compatible with this since merging/interfacing with the work James and everyone is doing is a primary goal. Keeping a similar code structure is a great way to facilitate this. - For HGVS the general idea is to not be too tied to the VCF format, so I wouldn't worry about strict compatibility but rather use it to inform choices where you feel that things are mirroring VCF structure rather than more general variant representation. > Another question that may reveal my complete ignorance of haplotypes > and such: could a polyploid site ever be partially phased? e.g. a > triploid genotype of 0/1|0? It's possible but this is kind of a fringe case right now so I wouldn't especially worry about it. Thanks again, Brad From lomereiter at gmail.com Tue Jul 3 16:40:40 2012 From: lomereiter at gmail.com (Artem Tarasov) Date: Tue, 3 Jul 2012 20:40:40 +0400 Subject: [GSoC] weekly report #7 Message-ID: Hi all, I wrote a blog post about the previous week: http://lomereiter.wordpress.com/2012/07/03/gsoc-weekly-report-7/ Highlights: First version of bioruby-sambamba gem is released on rubygems.org, but the installation process can be made much more convenient. Producing binaries for all common platforms and distributing them with platform-specific gems seems to be the best way to go. Also, I've done a lot of refactoring (however, a bit more is needed), and significantly improved speed of validation and SAM parsing. In July, I'm planning to implement indexing, sorting and merging BAM files, and also add filtering functionality to Ruby bindings. For the latter, I'm going to introduce a tiny query language so that command-line tools will be able to parse it, and bindings will have some filter classes with a method to generate a query string like. From w.arindrarto at gmail.com Wed Jul 4 13:03:01 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 4 Jul 2012 15:03:01 +0200 Subject: [GSoC] GSoC Project Update -- 9 Message-ID: Hello everyone, The past week I have been working to add PSL parsing support and I've just posted my update here: http://bow.web.id/blog/2012/07/initial-blat-support/ Currently, we have parsing, indexing, and writing support. But this could change (writing might not be supported) due to a possible change in the current object model. I've explained a bit on why this is the case in the post, but to summarize it here, it's because we haven't got a way to properly model segmented HSP sequences. Peter and I have discussed this a bit, but we haven't figured out an elegant way to solve it for now. Aside from working on PSL, I also added more tests and started refactoring the code as it's starting to get messy. That's all my update for the past week. For this week, I'll try to look into other formats and try to come up with possible solutions to the segmented HSP problem. regards, Bow From marian.povolny at gmail.com Wed Jul 4 18:56:54 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Wed, 4 Jul 2012 20:56:54 +0200 Subject: [GSoC] GSoC weekly status report No.6 and v0.1.0 Message-ID: http://blog.mpthecoder.com/post/26505431193/gsoc-weekly-status-report-no-6-and-v0-1-0 This post is a little bit late, but I wanted it to be the announcement of the first release, the v0.1.0? of gff3-pltools! I've created a minimal website for this project, which can be found here: http://mamarjan.github.com/gff3-pltools/ There are links to binary gems for 32 and 64-bit Linux, a source package for other platforms, binary packages with the D tools only, and a link to the API docs for the Ruby library. Please read the blog post for more information, and the README for even more information. Best regards, Marjan From reece at harts.net Thu Jul 5 19:40:02 2012 From: reece at harts.net (Reece Hart) Date: Thu, 5 Jul 2012 12:40:02 -0700 Subject: [GSoC] GSoC python variant update 6 In-Reply-To: References: Message-ID: On Fri, Jun 29, 2012 at 11:15 PM, Lenna Peterson wrote: > For a Python variant object, are there any organizational choices that > would make it easier for future conversion of a variant to HGVS > syntax? (this is primarily directed at Reece but I'm open to all > suggestions) > Oh, no, things directed at me! That's a broad question. I'll try to answer without being long winded. The essential elements of a sequence variant are a reference to a sequence, the location, and specifics about the operation. The name, allelic depth, etc are all distinct from these elements and I would store them separately in a format-specific record or as a subclass. I don't have much experience with FeatureLocations, but that might be appropriate. Depending on how far you plan to go with VCF, you'll have to deal with Locations for breakpoints. For the Occam's Razor version a model for variation, I'd float this in the community: variation := And I'd test this against representing: - a single SNP in VCF - a compound het from VCF - a variant in RNA - a variant in CDS coords - a variant in a protein sequence - a trinuclotide repeat (Which the simple model above fails, BTW.) What makes the uber variant problem hard, I think, is several competing design axes: 1) sequence type (DNA, RNA, protein), 2) coordinate systems (really, CDS in a transcript record), 3) diversity of variant types (SNV, indel, repeat, etc), 4) diversity of auxiliary data (e.g., genotype info from VCF). HGVS makes us think outside merely VCF data: in particular, it adds the nuance of coordinate systems and multiple sequence types. I suspect you should be considering mixins and/or subclassing for some of these needs. I don't know how to solve any of this complexity. What I do know is that 1) it's too much just for your project, 2) it would be nice to have a design that can be easily extended beyond your project, and 3) therefore, part of your project should be to pave the way for extensions without tackling them. It's also a good time to put stakes in the ground around internal conventions, such as variants are always represented using interbase coordinates (= 0-based, right-open). And, if you end up handling just VCF variants, that's cool too. -Reece From arklenna at gmail.com Mon Jul 9 04:33:57 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 9 Jul 2012 00:33:57 -0400 Subject: [GSoC] GSoC python variant update 7 Message-ID: Post: http://arklenna.tumblr.com/post/26812132902/ Synopsis: This week, I wrote a script for PyVCF that can filter a file by sample as it's being parsed. It's currently named `vcf_sample_filter.py`. It's designed to be functional from the command line, the Python interpreter, or as a module. Next up: come up with a generic-via-extensibility representation of a variant. I'm working through some examples and should have a basic outline soon. Lenna From cswh at umich.edu Tue Jul 10 03:21:40 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 9 Jul 2012 23:21:40 -0400 Subject: [GSoC] bio-maf 0.2.0 (and Kyoto Cabinet gem for JRuby) Message-ID: Hi all, I've released version 0.2.0 of bio-maf for BioRuby: http://csw.github.com/bioruby-maf/blog/2012/07/09/bio-maf_0.2.0/ Notably, this release includes removal of gaps remaining after filtering out sequences, and 'tiling' multiple alignment blocks together along with reference sequence data. Also, last week I released my Kyoto Cabinet support for JRuby as a separate gem. It's now approaching parity with the standard Ruby library for Kyoto Cabinet. http://csw.github.com/bioruby-maf/blog/2012/07/02/kyoto_cabinet_support_for_jruby/ Clayton Wheeler cswh at umich.edu From lomereiter at gmail.com Tue Jul 10 10:26:35 2012 From: lomereiter at gmail.com (Artem Tarasov) Date: Tue, 10 Jul 2012 14:26:35 +0400 Subject: [GSoC] weekly report #8 Message-ID: Hello all, here's the link to the report: http://lomereiter.wordpress.com/2012/07/10/gsoc-weekly-report-8/ last week I implemented producing BAI files, and my tool sambamba-index exploits parallelism and thus is faster than samtools on multicore. Now I'm working on sorting, basic version already works but memory consumption should be improved. In fact, at least for HDDs, time of indexing and sorting is bounded by I/O speed, not the number of CPUs. So for sorting I need to tweak sizes of read/write buffers in order to get maximum performance. By the end of this week, I'm also going to make an utility for merging several sorted BAM files into one. From marian.povolny at gmail.com Tue Jul 10 22:01:07 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Wed, 11 Jul 2012 00:01:07 +0200 Subject: [GSoC] GSoC weekly status report No.7 Message-ID: http://blog.mpthecoder.com/post/26930939671/gsoc-weekly-status-report-no-7 I was hoping to get more done over the weekend, but the internet connection was down, so I had to take the weekend off :) Otherwise I?m working toward the 0.2 version. The deadline is set for Saturday evening. What will be in it keeps changing, but for now there are new toString() and recursiveToString() methods in Feature class, and append_to(?) methods which accept an Appender object, for more efficient output. The utility for correctly counting features is now notably faster, and gff3-ffetch has a new option for passing FASTA data to output. Currently in planning are: support for new types of records (pragmas and comments), GDC support and Ruby interface for the validation utility. More could be added to this list, but I also have to make a plan for the second half of the summer, and that will take some time too. I was hoping to use the GDC which comes with Ubuntu 12.04, but I gave up on that because of some confusing errors I was receiving in the D stdlib. I will try to build the GDC directly from its GitHub repository and get my library to compile with it. Making man pages for binaries in gems is also a problem which currently has no elegant solution. I don?t want to force my users to type ?gem man command?, so I?m planning to split the current repository into two: gff3-pltools in D and then the second repository for the Ruby library. The gff3-pltools would then receive a more traditional installation procedure and receive proper man pages. -- Marjan From cswh at umich.edu Tue Jul 10 23:45:33 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Tue, 10 Jul 2012 19:45:33 -0400 Subject: [GSoC] Questions on next steps for MAF parsing for bio-maf Message-ID: <9343DE6C-EC59-480E-B746-396F08F36395@umich.edu> Hi all, In the course of working out my plan for the rest of my bio-maf project, I have come up with a few questions I'm not able to answer: https://github.com/csw/bioruby-maf/wiki/Questions * Is it useful to build indexes on other sequences besides the reference sequence? * Should the score field of an alignment block be zeroed or removed whenever the block is modified? * How, precisely, should selection based on features in GTF/GFF3 files work? * When converting a MAF Block/Sequence to bio-alignment representation, how should we handle quality metadata (from 'q' lines), which is tied to the actual sequence data and would need to be maintained in parallel if a column were deleted? * Is supporting the bx-python index format still desirable? Performance with Kyoto Cabinet indexes seems competitive, and the indexes are neither very large nor very expensive to build. * Blankenberg et al. mention this filtering mode: "removing blocks which have aligned species occurring between non-syntenic chromosomes or strands" which is unfortunately a bit cryptic. * Are coverage statistics useful or appropriate to provide? Any insight that you might be able to offer would be helpful. Thanks, Clayton Wheeler cswh at umich.edu From pjotr2012 at thebird.nl Wed Jul 11 09:25:17 2012 From: pjotr2012 at thebird.nl (Pjotr Prins) Date: Wed, 11 Jul 2012 11:25:17 +0200 Subject: [GSoC] Questions on next steps for MAF parsing for bio-maf In-Reply-To: <9343DE6C-EC59-480E-B746-396F08F36395@umich.edu> References: <9343DE6C-EC59-480E-B746-396F08F36395@umich.edu> Message-ID: <20120711092517.GA2827@thebird.nl> Hi Clayton and mentors, I think it would be extremely useful to get someone in who uses MAF in a pipeline. I know Raoul does, but we need more users. Anyone you know using MAF daily? Otherwise we should post on the Bio* lists. Same for GFF3 and Marjan. Anyone you know out there? Pj. On Tue, Jul 10, 2012 at 07:45:33PM -0400, Clayton Wheeler wrote: > Hi all, > > In the course of working out my plan for the rest of my bio-maf project, I have come up with a few questions I'm not able to answer: > > https://github.com/csw/bioruby-maf/wiki/Questions > > * Is it useful to build indexes on other sequences besides the reference sequence? > > * Should the score field of an alignment block be zeroed or removed whenever the block is modified? > > * How, precisely, should selection based on features in GTF/GFF3 files work? > > * When converting a MAF Block/Sequence to bio-alignment representation, how should we handle quality metadata (from 'q' lines), which is tied to the actual sequence data and would need to be maintained in parallel if a column were deleted? > > * Is supporting the bx-python index format still desirable? Performance with Kyoto Cabinet indexes seems competitive, and the indexes are neither very large nor very expensive to build. > > * Blankenberg et al. mention this filtering mode: "removing blocks which have aligned species occurring between non-syntenic chromosomes or strands" which is unfortunately a bit cryptic. > > * Are coverage statistics useful or appropriate to provide? > > Any insight that you might be able to offer would be helpful. > > Thanks, > > Clayton Wheeler > cswh at umich.edu > > > > > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From marian.povolny at gmail.com Mon Jul 16 17:16:12 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 16 Jul 2012 19:16:12 +0200 Subject: [GSoC] GSoC weekly status report No.8 Message-ID: http://blog.mpthecoder.com/post/27339349340/gsoc-weekly-status-report-no-8 Summary: The 0.2 version of gff3-pltools has been released, together with a Ruby gem bio-gff3-pltools. Binary and source packages can be downloaded from the following location: http://mamarjan.github.com/gff3-pltools/ On Wednesday I?ll be traveling to Lodi for the EU-codefest, there I?ll be presenting about the project and current GFF3 parser and tools performance. For the next release I would like to add parallelism to the parser. I?m also thinking about adding a new option to gff3-ffetch, which would let the user specify which fields and attributes to output in tab-separated columns. Best regards, Marjan From cjfields at illinois.edu Mon Jul 16 17:20:06 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 16 Jul 2012 17:20:06 +0000 Subject: [GSoC] GSoC weekly status report No.8 In-Reply-To: References: Message-ID: <118F034CF4C3EF48A96F86CE585B94BF2B63D4B5@CHIMBX5.ad.uillinois.edu> I'll try to be on IRC (#bioruby and #obf-soc) those days, I may have a few questions. chris On Jul 16, 2012, at 12:16 PM, Marjan Povolni wrote: > http://blog.mpthecoder.com/post/27339349340/gsoc-weekly-status-report-no-8 > > Summary: > > The 0.2 version of gff3-pltools has been released, together with a Ruby gem > bio-gff3-pltools. Binary and source packages can be downloaded from the > following location: > > http://mamarjan.github.com/gff3-pltools/ > > On Wednesday I?ll be traveling to Lodi for the EU-codefest, there I?ll be > presenting about the project and current GFF3 parser and tools performance. > > For the next release I would like to add parallelism to the parser. I?m > also thinking about adding a new option to gff3-ffetch, which would let the > user specify which fields and attributes to output in tab-separated columns. > > Best regards, > Marjan > > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From pjotr.public14 at thebird.nl Mon Jul 16 17:29:06 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 16 Jul 2012 19:29:06 +0200 Subject: [GSoC] GSoC weekly status report No.8 In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF2B63D4B5@CHIMBX5.ad.uillinois.edu> References: <118F034CF4C3EF48A96F86CE585B94BF2B63D4B5@CHIMBX5.ad.uillinois.edu> Message-ID: <20120716172906.GA20140@thebird.nl> On Mon, Jul 16, 2012 at 05:20:06PM +0000, Fields, Christopher J wrote: > I'll try to be on IRC (#bioruby and #obf-soc) those days, I may have a few questions. Cool :) We will also join gbrowse IRC. From lomereiter at gmail.com Tue Jul 17 06:47:49 2012 From: lomereiter at gmail.com (Artem Tarasov) Date: Tue, 17 Jul 2012 10:47:49 +0400 Subject: [GSoC] weekly report #9 Message-ID: Hello everybody, My progress report for the past week is available at http://lomereiter.wordpress.com/2012/07/17/gsoc-weekly-report-9/ I've implemented sorting and merging, both parallelized and quite fast. Also my merging tool improves on ideas taken from Picard source code and merges SAM headers as well as sorted alignment records. For those who use Debian, packages for amd64 and i386 are now available: https://github.com/lomereiter/sambamba/downloads At the moment, alternatives to the following samtools commands are developed: view, index, sort, merge, flagstat. The current limitation is that most tools don't work with stdin/stdout and work with BAM files only (does anybody still use SAM?). Nevertheless, they wisely use multi-core processors and usually give a better speed. From pjotr.public14 at thebird.nl Tue Jul 17 07:59:38 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 17 Jul 2012 09:59:38 +0200 Subject: [GSoC] [BioRuby] weekly report #9 In-Reply-To: References: Message-ID: <20120717075938.GA30198@thebird.nl> Are you going to support STDIN/STDOUT? Another killer feature! On Tue, Jul 17, 2012 at 10:47:49AM +0400, Artem Tarasov wrote: > Hello everybody, > > My progress report for the past week is available at > http://lomereiter.wordpress.com/2012/07/17/gsoc-weekly-report-9/ > > I've implemented sorting and merging, both parallelized and quite fast. > Also my merging tool improves on ideas taken from Picard source code and > merges SAM headers as well as sorted alignment records. > > For those who use Debian, packages for amd64 and i386 are now available: > > https://github.com/lomereiter/sambamba/downloads > > At the moment, alternatives to the following samtools commands are > developed: view, index, sort, merge, flagstat. The current limitation is > that most tools don't work with stdin/stdout and work with BAM files only > (does anybody still use SAM?). Nevertheless, they wisely use multi-core > processors and usually give a better speed. > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From lomereiter at gmail.com Tue Jul 17 08:38:22 2012 From: lomereiter at gmail.com (Artem Tarasov) Date: Tue, 17 Jul 2012 12:38:22 +0400 Subject: [GSoC] [BioRuby] weekly report #9 In-Reply-To: <20120717075938.GA30198@thebird.nl> References: <20120717075938.GA30198@thebird.nl> Message-ID: Firstly, I wouldn't call that a killer feature. On Un*x you should be able to use /dev/stdin and /dev/stdout (or a named pipe) as input/output filenames, that's the way people pipe Picard tools. Many Un*x tools (including samtools) facilitate that by using dash as a shortcut for stdin/stdout, but this is not a requirement. Clearly, STDIN can't be used for random access, and some parts of my code currently rely on assumption that input stream is seekable. I should make that optional, and then named pipes can be used as input. On Tue, Jul 17, 2012 at 11:59 AM, Pjotr Prins wrote: > Are you going to support STDIN/STDOUT? Another killer feature! > > On Tue, Jul 17, 2012 at 10:47:49AM +0400, Artem Tarasov wrote: > > Hello everybody, > > > > My progress report for the past week is available at > > http://lomereiter.wordpress.com/2012/07/17/gsoc-weekly-report-9/ > > > > I've implemented sorting and merging, both parallelized and quite fast. > > Also my merging tool improves on ideas taken from Picard source code and > > merges SAM headers as well as sorted alignment records. > > > > For those who use Debian, packages for amd64 and i386 are now available: > > > > https://github.com/lomereiter/sambamba/downloads > > > > At the moment, alternatives to the following samtools commands are > > developed: view, index, sort, merge, flagstat. The current limitation is > > that most tools don't work with stdin/stdout and work with BAM files only > > (does anybody still use SAM?). Nevertheless, they wisely use multi-core > > processors and usually give a better speed. > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > From arklenna at gmail.com Tue Jul 17 17:48:33 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 17 Jul 2012 13:48:33 -0400 Subject: [GSoC] GSoC python variant update 7 Message-ID: Hi all, New blog post: http://arklenna.tumblr.com/post/27418058203/ Last week, Reece suggested trying to represent a variety of variants with just five identifiers: accession, start, stop, pre_seq, and post_seq. I've started a very minimal Variant object (in https://github.com/lennax/biopython/blob/variant2/Bio/Variant/variant.py), using `FeatureLocation` for its location. This uses zero-based, right-open coordinates, similar to array counting in Python. In contrast, HGVS and VCF both count from 1. I've created a list of variant types each represented in HGVS, VCF (if possible), and my new Python representation. It can be found on the blog post. Please let me know if there are any errors in my interpretation of these variant types. Thanks, Lenna From cswh at umich.edu Wed Jul 18 19:44:58 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Wed, 18 Jul 2012 15:44:58 -0400 Subject: [GSoC] bio-maf release 0.3.0 Message-ID: Hi all, I've released bio-maf version 0.3.0: http://csw.github.com/bioruby-maf/blog/2012/07/18/bio-maf_0.3.0/ This version adds features including joining adjacent MAF blocks when sequences that caused them to be split have been filtered out; returning bio-alignment objects; and truncating (or ?slicing?) alignment blocks to only cover a given genomic interval. For developers, this also adds a higher-level Bio::MAF::Access API for working with directories containing indexed MAF files (or, alternatively, single files), providing all relevant functionality for indexed access in a simpler way than using the KyotoIndex and Parser classes directly. The maf_tile(1) utility has been updated to use this functionality; a directory of indexed MAF files can now be specified, and the correct file will now be parsed as appropriate. Usage of Enumerators and blocks has also been substantially improved; all access methods for multiple blocks such as Access#find, Access#slice, Parser#each_block now accept a block parameter, which will be called for each block in turn. If no block parameter is given, they will all return an Enumeratorfor the resulting blocks. This is how most of the Ruby standard library, e.g. Array#each, works. -- Clayton Wheeler cswh at umich.edu From w.arindrarto at gmail.com Wed Jul 18 19:49:37 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 18 Jul 2012 21:49:37 +0200 Subject: [GSoC] GSoC Project Update -- 10 Message-ID: Hi everyone, I've just posted two new updates for my GSoC project, here: http://bow.web.id/blog/2012/07/parsing-blast-plain-text-files-in-searchio/ and here: http://bow.web.id/blog/2012/07/exonerate-in-searchio/ The first one is about a somewhat unofficial new format to be supported by SearchIO: the BLAST plain text output. I know that current Biopython text parser is obsoleted, but I figure it still could be useful for some to have a similar model in SearchIO. It is unofficial since it's basically a wrapper around the current parser, and after discussing things with Peter, it doesn't seem wise to say that we officially support parsing the format. Especially when NCBI itself does not guarantee a stable style between each BLAST release. I should note that I've also made a small change to the current NCBIStandalone code as there were some problems when I try to parse BLAST 2.2.26+ text output with multiple queries. The second one, is about the program I've been spending most of my time on: Exonerate. We now have three Exonerate formats that SearchIO can parse and index: `exonerate-text`, for human-readable aligments, `exonerate-vulgar`, for vulgar lines, and `exonerate-cigar`, for vulgar lines. It's one of the more interesting formats I've been working on so far :), since it has so much information in it. I've tried to capture them as sensible as possible, and I made a small demonstration using it in my post. In addition to writing these two formats, I've also written their tests. Now, having finished almost all of the parsers, I'm planning to devote more time to start writing the documentation during the coming weeks. regards, Bow From lomereiter at gmail.com Tue Jul 24 14:46:09 2012 From: lomereiter at gmail.com (Artem Tarasov) Date: Tue, 24 Jul 2012 18:46:09 +0400 Subject: [GSoC] weekly report #10 Message-ID: Hi all, During the past week I've added filtering functionality to sambamba-view utility. Now the tool parses expressions like "mapping_quality >= 50 and [MQ] >=50 and not ([RG] =~ /abcd/i or [RG] == null)", superseding the functionality given by samtools flags -f, -F, -q, -l, -r. Also I'm now introducing wget-like text progressbars to my tools, as of now this is presented in sambamba-index only. More on that is at http://lomereiter.wordpress.com/2012/07/24/gsoc-weekly-report-10/ From arklenna at gmail.com Fri Jul 27 21:23:50 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Fri, 27 Jul 2012 17:23:50 -0400 Subject: [GSoC] GSoC python variant update 8 Message-ID: It appears that this email didn't make it to the list due to the catastrophe yesterday. I apologize if anyone receives two copies! Link: http://arklenna.tumblr.com/post/28082157403/ Post: I previously proposed the implementation of a method for PyVCF that would quickly scan the entire file and provide useful summary statistics. The idea is shamelessly copied from Brad's GFF parser (see https://github.com/chapmanb/bcbb/tree/master/gff ); for GFF, this method is helpful because the annotations on a sequence can vary widely. However, I no longer think this would be useful for VCF: 1. Most importantly, the VCF headers generally contain a complete listing of all of the types of information contained in the file. It's technically optional, but I hope that the most commonly used variant callers produce accurate headers. However, if there is a prevalence of files with a mismatch between headers and actual INFO/FORMAT fields, please let me know. 2. Next, any listing of ranges of data such as POS or QUAL might as well be coupled with actual filtering. This would be different if a presentation of the distribution of quality scores would be necessary to set an appropriate threshold. It would also depend on the ratio of speed between the range scan and the filtering (i.e. whether a possible second filter would be unacceptably time consuming). 3. Finally, and perhaps most importantly, many files are so large that scanning an entire file would take too long. Setting a limit and displaying updated information in real time (i.e. writing to `sys.stdout` with '\r', https://gist.github.com/3161269 ) could overcome this issue. If any VCF users can think of a great reason to scan a VCF file before filtering it, please get in touch. ------- I added the method `as_SeqFeature()` to my basic variant class, but it's still incomplete. Some of this is in flux due to forthcoming changes to FeatureLocation. I'm currently working on expanding the coordinate mapper Reece posted to the dev list a couple years ago (see http://biopython.org/pipermail/biopython/2010-June/006598.html ). Expect an update on that very soon. Best, Lenna From chris.mit7 at gmail.com Fri Jul 27 23:17:13 2012 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Fri, 27 Jul 2012 19:17:13 -0400 Subject: [GSoC] [Biopython-dev] GSoC python variant update 8 In-Reply-To: References: Message-ID: Sorry for my brevity, but one great reason to scan a VCF file is to know where your variants are for downstream analysis. For instance, when analyzing RNA-Seq data for features such as Allele Specific Expression, having quick access to where variants are located is essential. On Thu, Jul 26, 2012 at 6:30 PM, Lenna Peterson wrote: > Link: http://arklenna.tumblr.com/post/28082157403/ > > Post: > > I previously proposed the implementation of a method for PyVCF that > would quickly scan the entire file and provide useful summary > statistics. The idea is shamelessly copied from Brad's GFF parser (see > https://github.com/chapmanb/bcbb/tree/master/gff ); for GFF, this > method is helpful because the annotations on a sequence can vary > widely. However, I no longer think this would be useful for VCF: > > 1. Most importantly, the VCF headers generally contain a complete > listing of all of the types of information contained in the file. It's > technically optional, but I hope that the most commonly used variant > callers produce accurate headers. However, if there is a prevalence of > files with a mismatch between headers and actual INFO/FORMAT fields, > please let me know. > > 2. Next, any listing of ranges of data such as POS or QUAL might as > well be coupled with actual filtering. This would be different if a > presentation of the distribution of quality scores would be necessary > to set an appropriate threshold. It would also depend on the ratio of > speed between the range scan and the filtering (i.e. whether a > possible second filter would be unacceptably time consuming). > > 3. Finally, and perhaps most importantly, many files are so large that > scanning an entire file would take too long. Setting a limit and > displaying updated information in real time (i.e. writing to > `sys.stdout` with '\r', https://gist.github.com/3161269 ) could > overcome this issue. > > If any VCF users can think of a great reason to scan a VCF file before > filtering it, please get in touch. > > ------- > > I added the method `as_SeqFeature()` to my basic variant class, but > it's still incomplete. Some of this is in flux due to forthcoming > changes to FeatureLocation. > > I'm currently working on expanding the coordinate mapper Reece posted > to the dev list a couple years ago (see > http://biopython.org/pipermail/biopython/2010-June/006598.html ). > Expect an update on that very soon. > > Best, > > Lenna > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From arklenna at gmail.com Thu Jul 26 22:30:35 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Thu, 26 Jul 2012 18:30:35 -0400 Subject: [GSoC] GSoC python variant update 8 Message-ID: Link: http://arklenna.tumblr.com/post/28082157403/ Post: I previously proposed the implementation of a method for PyVCF that would quickly scan the entire file and provide useful summary statistics. The idea is shamelessly copied from Brad's GFF parser (see https://github.com/chapmanb/bcbb/tree/master/gff ); for GFF, this method is helpful because the annotations on a sequence can vary widely. However, I no longer think this would be useful for VCF: 1. Most importantly, the VCF headers generally contain a complete listing of all of the types of information contained in the file. It's technically optional, but I hope that the most commonly used variant callers produce accurate headers. However, if there is a prevalence of files with a mismatch between headers and actual INFO/FORMAT fields, please let me know. 2. Next, any listing of ranges of data such as POS or QUAL might as well be coupled with actual filtering. This would be different if a presentation of the distribution of quality scores would be necessary to set an appropriate threshold. It would also depend on the ratio of speed between the range scan and the filtering (i.e. whether a possible second filter would be unacceptably time consuming). 3. Finally, and perhaps most importantly, many files are so large that scanning an entire file would take too long. Setting a limit and displaying updated information in real time (i.e. writing to `sys.stdout` with '\r', https://gist.github.com/3161269 ) could overcome this issue. If any VCF users can think of a great reason to scan a VCF file before filtering it, please get in touch. ------- I added the method `as_SeqFeature()` to my basic variant class, but it's still incomplete. Some of this is in flux due to forthcoming changes to FeatureLocation. I'm currently working on expanding the coordinate mapper Reece posted to the dev list a couple years ago (see http://biopython.org/pipermail/biopython/2010-June/006598.html ). Expect an update on that very soon. Best, Lenna From marian.povolny at gmail.com Wed Aug 1 12:46:44 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Wed, 1 Aug 2012 14:46:44 +0200 Subject: [GSoC] GSoC weekly status report No.9 Message-ID: http://blog.mpthecoder.com/post/28481171942/gsoc-weekly-status-report-no-9 The trip to Lodi was very fruitful. It was great to meet both my mentor and other community members. Based on the input received at the codefest, I created a new plan for the second part of the summer: https://github.com/mamarjan/gff3-pltools/wiki/Part2 Since then I have done the following: - improved validation speed, - added GTF support for input and output, - table output with an option to select which fields and attributes should be in the table, - tools for conversion to GTF and JSON, - JSON output support, which needs some more polish. -- Marjan From cswh at umich.edu Sun Aug 5 23:24:58 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Sun, 5 Aug 2012 19:24:58 -0400 Subject: [GSoC] bio-maf update: BGZF and testing Message-ID: Hi all, I've posted an update on my most recent work on bio-maf. Highlights include BGZF compression support, a new maf_extract command-line tool for random access, and my discoveries from testing on the full UCSC multiz46way dataset. http://csw.github.com/bioruby-maf/blog/2012/08/05/bgzf_and_testing/ -- Clayton Wheeler cswh at umich.edu From arklenna at gmail.com Tue Aug 7 05:11:04 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 7 Aug 2012 01:11:04 -0400 Subject: [GSoC] GSoC python variant update Message-ID: Full post: http://arklenna.tumblr.com/post/28890255191/ Summary: * I'm working on the coordinate mapper Reece contributed: http://biopython.org/pipermail/biopython/2010-June/006598.html * I'm representing intron locations relative to CDS coords using the HGVS standards: http://www.hgvs.org/mutnomen/refseq_figure.html I'd like to know if there are other common ways of representing such positions. * In order to customize the display of positions (e.g. 0-based or 1-based), I'm using a class as a configuration container. I've read on StackOverflow that attempts to use globals or a singleton class are discouraged in Python, but I have not found practical suggestions for how to implement module-wide configurations. Suggestions are welcome. * Any advice about circular genomes or strandedness is also welcome. * This mapper will work for SeqRecords, SeqFeatures, FeatureLocations, etc. Are there other Biopython objects that store sequence coordinates and thus should be mappable? Regards, Lenna From lomereiter at gmail.com Tue Aug 7 05:14:02 2012 From: lomereiter at gmail.com (Artem Tarasov) Date: Tue, 7 Aug 2012 05:14:02 +0000 Subject: [GSoC] weekly report #11 Message-ID: Hello all, here's my latest report about work on sambamba: http://lomereiter.wordpress.com/2012/08/06/gsoc-weekly-report-11/ * simple internal DSL for filtering was introduced in bindings (description: https://github.com/lomereiter/bioruby-sambamba/blob/master/features/filtering.feature ) * BAM output support was added to filtering tool (sambamba view) * man pages for all tools were created * a script was written for building Debian packages and uploading them to Github * Debian packages for v0.2.1 are added to Github downloads (for both i386 and amd64) Now the goal is to make those tools accessible at Galaxy Tool Shed, and also this week I plan to optimize memory usage -- buffer sizes are quite large now, and the task is to find out how much they can be reduced without significant impact on performance. From marian.povolny at gmail.com Tue Aug 7 13:36:04 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Tue, 7 Aug 2012 15:36:04 +0200 Subject: [GSoC] GSoC weekly status report No.10 Message-ID: http://blog.mpthecoder.com/post/28907136767/gsoc-weekly-status-report-no-10 The 0.3 release is available on the website: http://mamarjan.github.com/gff3-pltools/ In addition to what was described in the last weekly report, a GFF3 sorting tool has been added, grouping records which belong to the same feature, and Ruby bindings have been updated to support GTF and new options in gff3-ffetch. -- Marjan From w.arindrarto at gmail.com Tue Aug 7 17:56:26 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 7 Aug 2012 19:56:26 +0200 Subject: [GSoC] GSoC Project Update -- 11 Message-ID: Hello everyone, I have just posted my latest update on my project here: http://bow.web.id/blog/2012/08/back-on-the-main-branch/ It's been taking quite a while since I posted my last update since there has been a considerable change to the SearchIO object model I'm using. The details are in my blog post, but to keep it short, it was because the previous model (QueryResult, Hit, and HSP) was inadequate in handling files that have multiple sequences in their HSP (so far seen in files output by BLAT and Exonerate). In my previous updates, I've been using simple Python lists to store attributes related to these multiple sequences, but that turned out to be problematic as it may make the object have inconsistent attributes. After trying out several different implementations and discussing them with Peter, we've finally settled on a new model. The new model changes the HSP object into a container that stores a new object: HSPFragment. HSPFragment represents a single, contiguous alignment of the hit and query sequence. It only stores the sequence, coordinates, frames, and strands. Other attributes made by the search program (such as evalues or scores) are stored in the HSP object. This change required some modifications on all of the current parsers, but from a user's perspective working with file formats other than BLAT or Exonerate, the changes should be minimum. Aside from this, there's also a small update on the main API which lets it accept keyword arguments. The arguments modify behaviors of the parser, and they are different for each parser. Currently, this is only used by the BLAST tabular parser, but I imagine more parsers will use this in the future. Finally, having settled on a firmer object model, I'll be spending the rest of my time to focus on the documentation. There may still be small fixes to the code, but I expect nothing as major as this one. regards, Bow From chapmanb at 50mail.com Wed Aug 8 13:55:36 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 08 Aug 2012 09:55:36 -0400 Subject: [GSoC] [Biopython-dev] GSoC python variant update In-Reply-To: References: Message-ID: <874nodh4iv.fsf@fastmail.fm> Lenna; This all sounds great and will be a nice practical addition to Biopython. Thanks for taking it on. Some specific thoughts on your questions: > * I'm representing intron locations relative to CDS coords using the > HGVS standards: http://www.hgvs.org/mutnomen/refseq_figure.html > I'd like to know if there are other common ways of representing such > positions. I don't know of one myself, so it's great to be following a standard rather than reinventing something. Nice work. > * In order to customize the display of positions (e.g. 0-based or > 1-based), I'm using a class as a configuration container. I've read on > StackOverflow that attempts to use globals or a singleton class are > discouraged in Python, but I have not found practical suggestions for > how to implement module-wide configurations. Suggestions are welcome. With configuration items like this, you have two choices: - A global variable. - Pass the configuration to every function that needs it. There are tradeoffs with both approaches, but for this case I agree with your decision to use globals. Most people will want 0-based/Biopython style but it gives those who don't a knob to switch over. > * Any advice about circular genomes or strandedness is also welcome. Circular handling is an unresolved issue in Biopython: https://redmine.open-bio.org/issues/2578 It's a bit tricky, especially with features that span the origin. I'd prioritize handling strandedness since you're going to have plenty of reverse strand coding sequences. You're mapping not only within the coding region but also back to the original sequence on the reverse strand. So in your g2c mapping, the original gene goes from e1 -> s1 -> e0 -> s0 as you read 5' to 3' across the sequence. The best place to get started is to pick a reverse strand gene and then work through the mappings, thinking through the orientations. I find drawing it out to be the easiest way. > * This mapper will work for SeqRecords, SeqFeatures, FeatureLocations, > etc. Are there other Biopython objects that store sequence coordinates > and thus should be mappable? That sounds like a great start. Thanks again for this, Brad From p.j.a.cock at googlemail.com Wed Aug 8 14:33:05 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 8 Aug 2012 15:33:05 +0100 Subject: [GSoC] [Biopython-dev] GSoC python variant update In-Reply-To: <874nodh4iv.fsf@fastmail.fm> References: <874nodh4iv.fsf@fastmail.fm> Message-ID: On Wed, Aug 8, 2012 at 2:55 PM, Brad Chapman wrote: >Lenna wrote: >> * Any advice about circular genomes or strandedness is also welcome. > > Circular handling is an unresolved issue in Biopython: > > https://redmine.open-bio.org/issues/2578 > > It's a bit tricky, especially with features that span the origin. > > I'd prioritize handling strandedness since you're going to have plenty > of reverse strand coding sequences. You're mapping not only within the > coding region but also back to the original sequence on the reverse > strand. So in your g2c mapping, the original gene goes from > e1 -> s1 -> e0 -> s0 as you read 5' to 3' across the sequence. The best > place to get started is to pick a reverse strand gene and then work > through the mappings, thinking through the orientations. I find drawing > it out to be the easiest way. And then think about mixed strand genes, e.g. transpliced tRNA is a good example - there is a GenBank example in our unit tests. Peter From arklenna at gmail.com Mon Aug 13 05:00:41 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 13 Aug 2012 01:00:41 -0400 Subject: [GSoC] GSoC python variant update 10 Message-ID: Link: http://arklenna.tumblr.com/post/29317968106/ Post: Following extensive [discussion](http://biopython.org/pipermail/biopython-dev/2012-August/009849.html) on the dev list of the pros and cons of configuration classes/modules, I have refactored my [coordinate mapper](https://gist.github.com/3172753) to keep configuration as isolated as possible. All mapping functions use base 0 internally. Transformation to and from 1-based coords is allowed by custom MapPosition objects. (they are currently separate from the Seq* positions but could probably subclass ExactPosition). The MapPosition objects have to_dialect and from_dialect methods that automatically handle conversion between bases and other formatting details. There are two different ways a user can convert a coordinate from HGVS: # ... assuming cm is an instance of CoordinateMapper # Manually construct position from HGVS CDS_coord = CDSPosition.from_hgvs("6+1") genomic_coord = cm.c2g(CDS_coord) print genomic_coord.to_hgvs() # Pass dialect argument to mapping function genomic_coord = cm.c2g("6+1", dialect="HGVS") print genomic_coord.to_hgvs() Furthermore, the inheritance hierarchy is designed to allow a user to set a default string representation: # Set MapPositions to print as HGVS by default def use_hgvs(self): return str(self.to_hgvs()) MapPosition.__str__ = use_hgvs The [version](https://gist.github.com/3172753/577b7c383e057b78cdcee64be33f18117a46faaf) as of this writing is passing tests using base 0. I have not yet implemented tests for `from_hgvs` or `to_hgvs`, but that's next on my list. I'm hoping to have time for strand and mixed strand, too. Cheers, Lenna From arklenna at gmail.com Fri Aug 17 01:58:46 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Thu, 16 Aug 2012 21:58:46 -0400 Subject: [GSoC] GSoC Python variant (penultimate) update Message-ID: Post: http://arklenna.tumblr.com/post/29592108099/ I have been considering how to handle gene strandedness. As long as I'm correctly interpreting the following position, my coordinate mapper should produce the correct coordinates with negative strand or mixed strand features. GenBank: join(complement(25..30), 36..40) Biopython: FeatureLocation(24, 30, -1) + FeatureLocation(35, 40) (please click through to post for monospaced font) 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 <---------------- -------------> 5 4 3 2 1 0 6 7 8 9 10 I have to admit that it wasn't until I read a BioStar [post](http://biostars.org/post/show/3423/forward-and-reverse-strand-conventions/) earlier this week that I fully understood the relationship between plus/minus forward/reverse sense/antisense coding/template strands. So please let me know as soon as possible if I've made a mistake in the above code. `c2g` yields the correct genome position, but not the strand. I still need to integrate strand information into my `GenomePosition` object and/or partially merge it with `ExactLocation`. This weekend I intend to expand documentation and write a brief cookbook entry. Cheers, Lenna From p.j.a.cock at googlemail.com Fri Aug 17 08:21:01 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 17 Aug 2012 09:21:01 +0100 Subject: [GSoC] GSoC Python variant (penultimate) update In-Reply-To: References: Message-ID: On Fri, Aug 17, 2012 at 2:58 AM, Lenna Peterson wrote: > > I have to admit that it wasn't until I read a BioStar > [post](http://biostars.org/post/show/3423/forward-and-reverse-strand-conventions/) > earlier this week that I fully understood the relationship between > plus/minus forward/reverse sense/antisense coding/template strands. So > please let me know as soon as possible if I've made a mistake in the > above code. Given this is nice and fresh in your mind, can you suggest any clarifications to the Biopython Tutorial section talking about this issue? The section on transcription & translation starting: "Before talking about transcription, I want to try and clarify the strand issue. Consider the following (made up) stretch of double stranded DNA which encodes a short peptide: ..." Hmm. That should probably say "I want to try to clarify...". Peter From arklenna at gmail.com Mon Aug 20 04:22:36 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 20 Aug 2012 00:22:36 -0400 Subject: [GSoC] GSoC python variant final update Message-ID: Post: http://arklenna.tumblr.com/post/29808300789/ The coordinate mapper, with updated documentation, is now located on this branch: https://github.com/lennax/biopython/tree/f_loc4 It awaits the merging of Peter's f_loc4 branch. I've written an entry on coordinate mapping for the Cookbook: http://biopython.org/wiki/Coordinate_mapping Additionally, at Peter's suggestion, I've written a clarification of strand as it relates to transcription and translation. It's available here: https://docs.google.com/document/d/11R7EOJXn90lN5_SmaPOyN5rFfPQybbCbUBo6EY0R0pA/edit It's been a great experience working with this project this summer. Thank you to everyone involved. Cheers, Lenna From lomereiter at gmail.com Mon Aug 20 09:47:18 2012 From: lomereiter at gmail.com (Artem Tarasov) Date: Mon, 20 Aug 2012 13:47:18 +0400 Subject: [GSoC] GSoC final report Message-ID: Hi all, here's a wrap-up of what was added/fixed in sambamba and Ruby bindings during the last couple of weeks: http://lomereiter.wordpress.com/2012/08/20/gsoc-weekly-report-12/ * Tools don't require input files to be seekable anymore, allowing to work with e.g. /dev/stdin and /dev/stdout * I've added an option of MessagePack output, that drastically improved speed of bindings (2-4x speedup depending on configuration) * The gem is on Travis CI now, passing tests on MRI 1.9.2/1.9.3. (JRuby also works, but on Travis there're problems with popen, same as with BioRuby) * The tool 'sambamba_filter' is now available on Galaxy Tool Shed. It's been a great summer. Thank you OBF/BioRuby folks :) -- Artem From pjotr.public14 at thebird.nl Mon Aug 20 09:55:59 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 20 Aug 2012 11:55:59 +0200 Subject: [GSoC] [BioRuby] GSoC final report In-Reply-To: References: Message-ID: <20120820095559.GB2453@thebird.nl> Thank you Artem, Great job. Pj. On Mon, Aug 20, 2012 at 01:47:18PM +0400, Artem Tarasov wrote: > Hi all, > > here's a wrap-up of what was added/fixed in sambamba and Ruby bindings > during the last couple of weeks: > http://lomereiter.wordpress.com/2012/08/20/gsoc-weekly-report-12/ > > * Tools don't require input files to be seekable anymore, allowing to work > with e.g. /dev/stdin and /dev/stdout > * I've added an option of MessagePack output, that drastically improved > speed of bindings (2-4x speedup depending on configuration) > * The gem is on Travis CI now, passing tests on MRI 1.9.2/1.9.3. (JRuby > also works, but on Travis there're problems with popen, same as with > BioRuby) > * The tool 'sambamba_filter' is now available on Galaxy Tool Shed. > > > It's been a great summer. Thank you OBF/BioRuby folks :) > > > -- > Artem > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From chapmanb at 50mail.com Mon Aug 20 12:45:49 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 20 Aug 2012 08:45:49 -0400 Subject: [GSoC] GSoC python variant final update In-Reply-To: References: Message-ID: <87harxzq82.fsf@fastmail.fm> Lenna; Thanks for the documentation and getting that all code moved into a branch. This looks great and looking forward to having it merged when Peter's work goes in. Thanks also for all the great work this summer and good luck on the first day of PhD school, Brad > Post: http://arklenna.tumblr.com/post/29808300789/ > > The coordinate mapper, with updated documentation, is now located on > this branch: https://github.com/lennax/biopython/tree/f_loc4 > It awaits the merging of Peter's f_loc4 branch. > > I've written an entry on coordinate mapping for the Cookbook: > http://biopython.org/wiki/Coordinate_mapping > > Additionally, at Peter's suggestion, I've written a clarification of > strand as it relates to transcription and translation. It's available > here: https://docs.google.com/document/d/11R7EOJXn90lN5_SmaPOyN5rFfPQybbCbUBo6EY0R0pA/edit > > It's been a great experience working with this project this summer. > Thank you to everyone involved. > > Cheers, > > Lenna > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From w.arindrarto at gmail.com Tue Aug 21 16:09:07 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 21 Aug 2012 18:09:07 +0200 Subject: [GSoC] GSoC Project Update -- 10 In-Reply-To: References: Message-ID: Hi everyone, I've just posted my last entry for my Google Summer of Code project this year: http://bow.web.id/blog/2012/08/summers-over/ I want to say thank you to the Biopython community, especially Peter for mentoring me this summer :), to OBF for accepting my proposal, and to anyone who has helped and given me valuable inputs for me throughout the project :). It's been a priceless learning experience, and I only hope that my code will be useful in return. There are still some things to do before the code is merge-ready and even more when the code is included in an official release, so I'll still be around. cheers, Bow From marian.povolny at gmail.com Tue Aug 21 19:11:01 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Tue, 21 Aug 2012 21:11:01 +0200 Subject: [GSoC] Final GSoC report Message-ID: http://blog.mpthecoder.com/post/29910330225/final-gsoc-report *Summary* Yesterday I tagged the 0.4 release of gff3-pltools, and that marks the end of the summer. At least in GSoC terms. Should I say end of the project? I don?t think so. The tools can still be improved, and the Ruby bindings should follow. The major changes since the last release include the following: - filtering functionality has been moved to a separate utility: gff3-filter, along with a new language for specifying filtering expressions, - conversion to table format of selected fields has been moved to a separate utility: gff3-select. However, the ?select option is still part of gff3-filter, - gff3-ffetch is now fetching FASTA sequences from GFF3 and FASTA files for CDS and mRNA records and features, - man pages for utilities. ** The original idea was to create a GFF3/GTF parser in D and Ruby bindings. The Ruby bindings part didn?t work out because there is still no support for D shared libraries in Linux, but instead there are now a few useful command-line tools for processing GFF3 which can be used without programming knowledge. To me, the summer was fun, challenging, and a great experience. I even got to meet my mentor in person, and other community members too, and to make my first steps in bioinformatics. I even gave a small presentation at the EU-codefest. What a summer it was! Thanks to everybody who made it possible: Google, Open Bioinformatics Foundation and my mentor Pjotr Prins. -- Marjan From cswh at umich.edu Tue Aug 21 19:35:21 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Tue, 21 Aug 2012 12:35:21 -0700 Subject: [GSoC] Bio-MAF 1.0.1 Message-ID: Hi all, I've released bio-maf 1.0.1 and written a final GSoC blog post about it: http://csw.github.com/bioruby-maf/blog/2012/08/21/bio-maf_1.0.1/ This release should be substantially more robust, with solid and reasonably-performing BGZF support, better CLI tools, and various robustness, compatibility, and memory-footprint improvements. (I've also developed a Galaxy integration for the maf_tile tool; unlike the existing Galaxy MAF tools, this is capable of filling in gaps with a FASTA reference sequence, and concatenating the alignment output from several exons specified in a BED file. It's not quite all packaged up with the toolshed facility yet, but I should be able to wrap that up shortly. Sneak preview: https://gist.github.com/3418576) It's been a pleasure working with all of you, and I'm glad I've been able to deliver something useful. Pjotr, Raoul, Francesco, thanks for your help and advice this summer! Marjan, Artem, you guys did excellent work and gave me some great suggestions in the code reviews. And, of course, thanks to Google for organizing and funding this! -- Clayton Wheeler cswh at umich.edu From pjotr.public14 at thebird.nl Tue Aug 21 21:47:53 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 21 Aug 2012 23:47:53 +0200 Subject: [GSoC] [BioRuby] Bio-MAF 1.0.1 In-Reply-To: References: Message-ID: <20120821214753.GB10348@thebird.nl> Thank you Marjan and Clayton. It was our pleasure. Pj. On Tue, Aug 21, 2012 at 12:35:21PM -0700, Clayton Wheeler wrote: > Hi all, > > I've released bio-maf 1.0.1 and written a final GSoC blog post about it: > > http://csw.github.com/bioruby-maf/blog/2012/08/21/bio-maf_1.0.1/ > > This release should be substantially more robust, with solid and > reasonably-performing BGZF support, better CLI tools, and various > robustness, compatibility, and memory-footprint improvements. > > (I've also developed a Galaxy integration for the maf_tile tool; > unlike the existing Galaxy MAF tools, this is capable of filling in > gaps with a FASTA reference sequence, and concatenating the alignment > output from several exons specified in a BED file. It's not quite all > packaged up with the toolshed facility yet, but I should be able to > wrap that up shortly. Sneak preview: https://gist.github.com/3418576) > > It's been a pleasure working with all of you, and I'm glad I've been > able to deliver something useful. Pjotr, Raoul, Francesco, thanks for > your help and advice this summer! Marjan, Artem, you guys did > excellent work and gave me some great suggestions in the code reviews. > And, of course, thanks to Google for organizing and funding this! > > -- > Clayton Wheeler > cswh at umich.edu > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From mictadlo at gmail.com Wed Aug 22 00:55:30 2012 From: mictadlo at gmail.com (Mic) Date: Wed, 22 Aug 2012 10:55:30 +1000 Subject: [GSoC] [BioRuby] Final GSoC report In-Reply-To: References: Message-ID: Hi, Python is able to connect to D with help of http://pyd.dsource.org/ . Maybe it would be something for Biopython Cheers, Mic On Wed, Aug 22, 2012 at 5:11 AM, Marjan Povolni wrote: > http://blog.mpthecoder.com/post/29910330225/final-gsoc-report > > *Summary* > > Yesterday I tagged the 0.4 release of gff3-pltools, and that marks the end > of the summer. At least in GSoC terms. Should I say end of the project? I > don?t think so. The tools can still be improved, and the Ruby bindings > should follow. > > The major changes since the last release include the following: > > - filtering functionality has been moved to a separate utility: > gff3-filter, along with a new language for specifying filtering > expressions, > - conversion to table format of selected fields has been moved to a > separate utility: gff3-select. However, the ?select option is still > part of > gff3-filter, > - gff3-ffetch is now fetching FASTA sequences from GFF3 and FASTA files > for CDS and mRNA records and features, > - man pages for utilities. > > ** > The original idea was to create a GFF3/GTF parser in D and Ruby bindings. > The Ruby bindings part didn?t work out because there is still no support > for D shared libraries in Linux, but instead there are now a few useful > command-line tools for processing GFF3 which can be used without > programming knowledge. > > To me, the summer was fun, challenging, and a great experience. I even got > to meet my mentor in person, and other community members too, and to make > my first steps in bioinformatics. I even gave a small presentation at the > EU-codefest. What a summer it was! > > Thanks to everybody who made it possible: Google, Open Bioinformatics > Foundation and my mentor Pjotr Prins. > > -- > Marjan > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From cjfields at illinois.edu Wed Aug 22 04:16:13 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 22 Aug 2012 04:16:13 +0000 Subject: [GSoC] [BioRuby] GSoC final report In-Reply-To: <20120820095559.GB2453@thebird.nl> References: <20120820095559.GB2453@thebird.nl> Message-ID: <118F034CF4C3EF48A96F86CE585B94BF33B73325@CHIMBX5.ad.uillinois.edu> Wholeheartedly agree. Congrats Artem on a job well done! chris On Aug 20, 2012, at 4:55 AM, Pjotr Prins wrote: > Thank you Artem, > > Great job. > > Pj. > > On Mon, Aug 20, 2012 at 01:47:18PM +0400, Artem Tarasov wrote: >> Hi all, >> >> here's a wrap-up of what was added/fixed in sambamba and Ruby bindings >> during the last couple of weeks: >> http://lomereiter.wordpress.com/2012/08/20/gsoc-weekly-report-12/ >> >> * Tools don't require input files to be seekable anymore, allowing to work >> with e.g. /dev/stdin and /dev/stdout >> * I've added an option of MessagePack output, that drastically improved >> speed of bindings (2-4x speedup depending on configuration) >> * The gem is on Travis CI now, passing tests on MRI 1.9.2/1.9.3. (JRuby >> also works, but on Travis there're problems with popen, same as with >> BioRuby) >> * The tool 'sambamba_filter' is now available on Galaxy Tool Shed. >> >> >> It's been a great summer. Thank you OBF/BioRuby folks :) >> >> >> -- >> Artem >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From lomereiter at gmail.com Wed Aug 22 06:42:12 2012 From: lomereiter at gmail.com (Artem Tarasov) Date: Wed, 22 Aug 2012 10:42:12 +0400 Subject: [GSoC] [BioRuby] Final GSoC report In-Reply-To: References: Message-ID: Hi, Unfortunately, the problem is on the side of D. PyD wiki ( https://bitbucket.org/ariovistus/pyd/wiki/Home) says that "extension libraries are nominally working with LDC (FE 2.060 or later); however, druntime currently limits what can be done here". However, this issue has become quite popular in last months, see e.g. this thread: http://forum.dlang.org/thread/mailman.1330.1345434177.31962.digitalmars-d at puremagic.com ? so maybe this'll get fixed soon. -- Artem On Wed, Aug 22, 2012 at 4:55 AM, Mic wrote: > Hi, > Python is able to connect to D with help of http://pyd.dsource.org/ . > > Maybe it would be something for Biopython > > Cheers, > Mic > > On Wed, Aug 22, 2012 at 5:11 AM, Marjan Povolni >wrote: > > > http://blog.mpthecoder.com/post/29910330225/final-gsoc-report > > > > *Summary* > > > > Yesterday I tagged the 0.4 release of gff3-pltools, and that marks the > end > > of the summer. At least in GSoC terms. Should I say end of the project? I > > don?t think so. The tools can still be improved, and the Ruby bindings > > should follow. > > > > The major changes since the last release include the following: > > > > - filtering functionality has been moved to a separate utility: > > gff3-filter, along with a new language for specifying filtering > > expressions, > > - conversion to table format of selected fields has been moved to a > > separate utility: gff3-select. However, the ?select option is still > > part of > > gff3-filter, > > - gff3-ffetch is now fetching FASTA sequences from GFF3 and FASTA > files > > for CDS and mRNA records and features, > > - man pages for utilities. > > > > ** > > The original idea was to create a GFF3/GTF parser in D and Ruby bindings. > > The Ruby bindings part didn?t work out because there is still no support > > for D shared libraries in Linux, but instead there are now a few useful > > command-line tools for processing GFF3 which can be used without > > programming knowledge. > > > > To me, the summer was fun, challenging, and a great experience. I even > got > > to meet my mentor in person, and other community members too, and to make > > my first steps in bioinformatics. I even gave a small presentation at the > > EU-codefest. What a summer it was! > > > > Thanks to everybody who made it possible: Google, Open Bioinformatics > > Foundation and my mentor Pjotr Prins. > > > > -- > > Marjan > > > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc > From p.j.a.cock at googlemail.com Wed Aug 22 08:42:03 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Aug 2012 09:42:03 +0100 Subject: [GSoC] GSoC Project Update -- 10 In-Reply-To: References: Message-ID: On Tue, Aug 21, 2012 at 5:09 PM, Wibowo Arindrarto wrote: > Hi everyone, > > I've just posted my last entry for my Google Summer of Code project > this year: http://bow.web.id/blog/2012/08/summers-over/ > > I want to say thank you to the Biopython community, especially Peter > for mentoring me this summer :), to OBF for accepting my proposal, and > to anyone who has helped and given me valuable inputs for me > throughout the project :). > > It's been a priceless learning experience, and I only hope that my > code will be useful in return. > > There are still some things to do before the code is merge-ready and > even more when the code is included in an official release, so I'll > still be around. > > cheers, > Bow Thank you Bow, It has been a pleasure to mentor you, and I'm excited about getting this (and Lenna's and other branches) into Biopython. Now, back to the module naming discussion... ;) http://lists.open-bio.org/pipermail/biopython-dev/2012-August/009868.html http://lists.open-bio.org/pipermail/biopython-dev/2012-August/009888.html Peter From pjotr.public14 at thebird.nl Wed Aug 22 10:43:52 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Wed, 22 Aug 2012 12:43:52 +0200 Subject: [GSoC] [BioRuby] Final GSoC report In-Reply-To: References: Message-ID: <20120822104352.GA11847@thebird.nl> Yes, linking to D from an interpreted language is not hard, basically it is the same calling convention as that of C. So a D shared library looks the same as a C shared library to the calling code - all existing foreign function interfaces (FFI) work. That is the good news. The bad news, as Artem points out, is that there is a problem in the D garbage collector. Items get collected, which should not. This will be fixed sooner or later. The commitment is there, and it is moving up the priority list. For us it did not matter, as the parsers and tools happily run on their own using a command line interface, without much overhead. One advantage, from my perspective, is that we are not tied to Ruby, at this point, and the tools can be hosted in Galaxy. Another advantage, perhaps, is that we have not been side-tracked in providing rich library interfaces. That appeals to my purist side. Writing FFI bindings later is not a problem. Pj. On Wed, Aug 22, 2012 at 10:42:12AM +0400, Artem Tarasov wrote: > Hi, > > Unfortunately, the problem is on the side of D. PyD wiki ( > https://bitbucket.org/ariovistus/pyd/wiki/Home) says that "extension > libraries are nominally working with LDC (FE 2.060 or later); however, > druntime currently limits what can be done here". > > However, this issue has become quite popular in last months, see e.g. this > thread: > http://forum.dlang.org/thread/mailman.1330.1345434177.31962.digitalmars-d at puremagic.com > ? > so maybe this'll get fixed soon. > > -- > Artem > > On Wed, Aug 22, 2012 at 4:55 AM, Mic wrote: > > > Hi, > > Python is able to connect to D with help of http://pyd.dsource.org/ . > > > > Maybe it would be something for Biopython > > > > Cheers, > > Mic > > > > On Wed, Aug 22, 2012 at 5:11 AM, Marjan Povolni > >wrote: > > > > > http://blog.mpthecoder.com/post/29910330225/final-gsoc-report > > > > > > *Summary* > > > > > > Yesterday I tagged the 0.4 release of gff3-pltools, and that marks the > > end > > > of the summer. At least in GSoC terms. Should I say end of the project? I > > > don?t think so. The tools can still be improved, and the Ruby bindings > > > should follow. > > > > > > The major changes since the last release include the following: > > > > > > - filtering functionality has been moved to a separate utility: > > > gff3-filter, along with a new language for specifying filtering > > > expressions, > > > - conversion to table format of selected fields has been moved to a > > > separate utility: gff3-select. However, the ?select option is still > > > part of > > > gff3-filter, > > > - gff3-ffetch is now fetching FASTA sequences from GFF3 and FASTA > > files > > > for CDS and mRNA records and features, > > > - man pages for utilities. > > > > > > ** > > > The original idea was to create a GFF3/GTF parser in D and Ruby bindings. > > > The Ruby bindings part didn?t work out because there is still no support > > > for D shared libraries in Linux, but instead there are now a few useful > > > command-line tools for processing GFF3 which can be used without > > > programming knowledge. > > > > > > To me, the summer was fun, challenging, and a great experience. I even > > got > > > to meet my mentor in person, and other community members too, and to make > > > my first steps in bioinformatics. I even gave a small presentation at the > > > EU-codefest. What a summer it was! > > > > > > Thanks to everybody who made it possible: Google, Open Bioinformatics > > > Foundation and my mentor Pjotr Prins. > > > > > > -- > > > Marjan > > > > > > _______________________________________________ > > > BioRuby Project - http://www.bioruby.org/ > > > BioRuby mailing list > > > BioRuby at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > > > _______________________________________________ > > GSoC mailing list > > GSoC at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/gsoc > > > > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From p.j.a.cock at googlemail.com Wed Aug 22 11:10:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Aug 2012 12:10:56 +0100 Subject: [GSoC] [BioRuby] Final GSoC report In-Reply-To: <20120822104352.GA11847@thebird.nl> References: <20120822104352.GA11847@thebird.nl> Message-ID: On Wed, Aug 22, 2012 at 11:43 AM, Pjotr Prins wrote: > Yes, linking to D from an interpreted language is not hard, basically > it is the same calling convention as that of C. So a D shared library > looks the same as a C shared library to the calling code - all > existing foreign function interfaces (FFI) work. That is the good > news. How do things stand from a cross-platform perspective? i.e. When might this be doable on Linux, Mac OS X, and Windows? (and other Unix like platforms of potential interest) > The bad news, as Artem points out, is that there is a problem in the > D garbage collector. Items get collected, which should not. This will > be fixed sooner or later. The commitment is there, and it is moving > up the priority list. Is there a D issue/bug tracker for this? Thanks, Peter From hlapp at drycafe.net Mon Aug 27 02:14:35 2012 From: hlapp at drycafe.net (Hilmar Lapp) Date: Sun, 26 Aug 2012 22:14:35 -0400 Subject: [GSoC] [BioRuby] GSoC final report In-Reply-To: <118F034CF4C3EF48A96F86CE585B94BF33B73325@CHIMBX5.ad.uillinois.edu> References: <20120820095559.GB2453@thebird.nl> <118F034CF4C3EF48A96F86CE585B94BF33B73325@CHIMBX5.ad.uillinois.edu> Message-ID: <56EAF480-2694-4825-999E-14763A6A3ACB@drycafe.net> Indeed, congratulations to all of OBF's 2012 GSoC students and mentors - great job! It'd be great to have a summary blog post on the OBF news blog - anyone up for composing that? -hilmar On Aug 22, 2012, at 12:16 AM, Fields, Christopher J wrote: > Wholeheartedly agree. Congrats Artem on a job well done! > > chris > > On Aug 20, 2012, at 4:55 AM, Pjotr Prins wrote: > >> Thank you Artem, >> >> Great job. >> >> Pj. >> >> On Mon, Aug 20, 2012 at 01:47:18PM +0400, Artem Tarasov wrote: >>> Hi all, >>> >>> here's a wrap-up of what was added/fixed in sambamba and Ruby bindings >>> during the last couple of weeks: >>> http://lomereiter.wordpress.com/2012/08/20/gsoc-weekly-report-12/ >>> >>> * Tools don't require input files to be seekable anymore, allowing to work >>> with e.g. /dev/stdin and /dev/stdout >>> * I've added an option of MessagePack output, that drastically improved >>> speed of bindings (2-4x speedup depending on configuration) >>> * The gem is on Travis CI now, passing tests on MRI 1.9.2/1.9.3. (JRuby >>> also works, but on Travis there're problems with popen, same as with >>> BioRuby) >>> * The tool 'sambamba_filter' is now available on Galaxy Tool Shed. >>> >>> >>> It's been a great summer. Thank you OBF/BioRuby folks :) >>> >>> >>> -- >>> Artem >>> _______________________________________________ >>> BioRuby Project - http://www.bioruby.org/ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >>> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From p.j.a.cock at googlemail.com Mon Sep 3 13:08:51 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Sep 2012 14:08:51 +0100 Subject: [GSoC] [BioRuby] GSoC final report In-Reply-To: <56EAF480-2694-4825-999E-14763A6A3ACB@drycafe.net> References: <20120820095559.GB2453@thebird.nl> <118F034CF4C3EF48A96F86CE585B94BF33B73325@CHIMBX5.ad.uillinois.edu> <56EAF480-2694-4825-999E-14763A6A3ACB@drycafe.net> Message-ID: On Mon, Aug 27, 2012 at 3:14 AM, Hilmar Lapp wrote: > Indeed, congratulations to all of OBF's 2012 GSoC students > and mentors - great job! > > It'd be great to have a summary blog post on the OBF news > blog - anyone up for composing that? > > -hilmar I agree it is a good idea. I'm in Japan for the 2012 BioHackathon, and have spoken with Pjotr, Raul and Francesco - I think we can work on a blog post together this week (I have editing rights). Brad - would you like to contribute/preview the text? Shall we ask your co-mentors too? Regards, Peter From chapmanb at 50mail.com Tue Sep 4 09:30:23 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 04 Sep 2012 05:30:23 -0400 Subject: [GSoC] [BioRuby] GSoC final report In-Reply-To: References: <20120820095559.GB2453@thebird.nl> <118F034CF4C3EF48A96F86CE585B94BF33B73325@CHIMBX5.ad.uillinois.edu> <56EAF480-2694-4825-999E-14763A6A3ACB@drycafe.net> Message-ID: <87txvei18w.fsf@fastmail.fm> Peter; Thanks for doing this. Happy to help from my end. For Lenna's project here is some text to use: Lenna Peterson worked on improving support for variation analysis in Biopython. Her summer work produced tools for manipulating [VCF (variant call format)][0] in Python, a thorough investigation of incorporating [PyVCF][1] into Biopython, and tools to handle clean coordinate mapping between transcripts and genomic coordinates. [Her blog][2] has detailed discussions of progress during the summer and the code is in a [branch on GitHub][3] awaiting inclusion into Biopython. [0]: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 [1]: https://github.com/jamescasbon/PyVCF [2]: http://arklenna.tumblr.com/ [3]: https://github.com/lennax Let me know if anything else would be useful, Brad > On Mon, Aug 27, 2012 at 3:14 AM, Hilmar Lapp wrote: >> Indeed, congratulations to all of OBF's 2012 GSoC students >> and mentors - great job! >> >> It'd be great to have a summary blog post on the OBF news >> blog - anyone up for composing that? >> >> -hilmar > > I agree it is a good idea. > > I'm in Japan for the 2012 BioHackathon, and have spoken with > Pjotr, Raul and Francesco - I think we can work on a blog post > together this week (I have editing rights). > > Brad - would you like to contribute/preview the text? Shall we > ask your co-mentors too? > > Regards, > > Peter From pjotr.public14 at thebird.nl Tue Sep 25 06:08:50 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 25 Sep 2012 08:08:50 +0200 Subject: [GSoC] [BioRuby] GSoC week 2 status report Message-ID: <20120925060850.GA1143@thebird.nl> Hi John, Congrats from the BioRuby panel and community winning Ruby Association Grant! http://sciruby.com/blog/2012/09/24/sciruby-receives-ruby-association-grant--fellowships-available/ Pj. From p.j.a.cock at googlemail.com Mon Nov 26 17:22:00 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 26 Nov 2012 17:22:00 +0000 Subject: [GSoC] GSoC python variant final update In-Reply-To: References: Message-ID: On Mon, Aug 20, 2012 at 5:22 AM, Lenna Peterson wrote: > Post: http://arklenna.tumblr.com/post/29808300789/ > > The coordinate mapper, with updated documentation, is now located on > this branch: https://github.com/lennax/biopython/tree/f_loc4 > It awaits the merging of Peter's f_loc4 branch. > > I've written an entry on coordinate mapping for the Cookbook: > http://biopython.org/wiki/Coordinate_mapping Hi Lenna, Do you need my f_loc4 branch for the main GSoC variants work, or just the coordinate mapper? Thanks, Peter