From p.j.a.cock at googlemail.com Fri Mar 16 17:40:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 16 Mar 2012 21:40:56 +0000 Subject: [GSoC] [Open-bio-l] Google Summer of Code is *ON* for OBF projects! In-Reply-To: <4F6398E8.4010806@gmail.com> References: <4F6398E8.4010806@gmail.com> Message-ID: On Fri, Mar 16, 2012 at 7:47 PM, Robert Buels wrote: > Hi all, > > Great news: Google announced today that the Open Bioinformatics > Foundation has been accepted as a mentoring organization for this > summer's Google Summer of Code! > > GSoC is a Google-sponsored student internship program for open-source > projects, open to students from around the world (not just US > residents). ? Students are paid a $5000 USD stipend to work as a > developer on an open-source project for the summer. For more on GSoC, > see GSoC 2012 FAQ at http://goo.gl/kNv48 > > Student applications are due April 6, 2012 at 19:00 UTC. ?Students who > are interested in participating should look at the OBF's GSoC page at > http://open-bio.org/wiki/Google_Summer_of_Code, which lists project > ideas, and whom to contact about applying. > > For current developers on OBF projects, please consider volunteering to > be a mentor if you have not already, and contribute project ideas. ?Just > list your name and project ideas on OBF wiki and on the relevant > project's GSoC wiki page. > > Thanks to all who helped make OBF's application to GSoC a success, and > let's have a great, productive summer of code! > > Rob Buels > OBF GSoC 2012 Administrator Excellent news - well done Rob et al. Would you like me to post this to the news blog, or can you? http://news.open-bio.org/news/ Thanks, Peter From cjfields at illinois.edu Fri Mar 16 17:49:32 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 16 Mar 2012 21:49:32 +0000 Subject: [GSoC] [Open-bio-l] Google Summer of Code is *ON* for OBF projects! In-Reply-To: References: <4F6398E8.4010806@gmail.com> Message-ID: On Mar 16, 2012, at 4:40 PM, Peter Cock wrote: > On Fri, Mar 16, 2012 at 7:47 PM, Robert Buels wrote: >> Hi all, >> >> Great news: Google announced today that the Open Bioinformatics >> Foundation has been accepted as a mentoring organization for this >> summer's Google Summer of Code! >> >> GSoC is a Google-sponsored student internship program for open-source >> projects, open to students from around the world (not just US >> residents). Students are paid a $5000 USD stipend to work as a >> developer on an open-source project for the summer. For more on GSoC, >> see GSoC 2012 FAQ at http://goo.gl/kNv48 >> >> Student applications are due April 6, 2012 at 19:00 UTC. Students who >> are interested in participating should look at the OBF's GSoC page at >> http://open-bio.org/wiki/Google_Summer_of_Code, which lists project >> ideas, and whom to contact about applying. >> >> For current developers on OBF projects, please consider volunteering to >> be a mentor if you have not already, and contribute project ideas. Just >> list your name and project ideas on OBF wiki and on the relevant >> project's GSoC wiki page. >> >> Thanks to all who helped make OBF's application to GSoC a success, and >> let's have a great, productive summer of code! >> >> Rob Buels >> OBF GSoC 2012 Administrator > > Excellent news - well done Rob et al. > > Would you like me to post this to the news blog, or can you? > http://news.open-bio.org/news/ > > Thanks, > > Peter I think post away. I've already tweated this. chris From p.j.a.cock at googlemail.com Fri Mar 16 17:54:46 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 16 Mar 2012 21:54:46 +0000 Subject: [GSoC] [Open-bio-l] Google Summer of Code is *ON* for OBF projects! In-Reply-To: References: <4F6398E8.4010806@gmail.com>

Message-ID: On Fri, Mar 16, 2012 at 9:49 PM, Fields, Christopher J wrote: > On Mar 16, 2012, at 4:40 PM, Peter Cock wrote: >> >> Excellent news - well done Rob et al. >> >> Would you like me to post this to the news blog, or can you? >> http://news.open-bio.org/news/ >> >> Thanks, >> >> Peter > > I think post away. ?I've already tweated this. > > chris > Done, http://news.open-bio.org/news/2012/03/obf-accepted-for-gsoc-2012/ This was posted to the @obf_news twitter account here https://twitter.com/obf_news/status/180773706715504640 Peter From ayushgoel111 at gmail.com Thu Mar 22 14:03:49 2012 From: ayushgoel111 at gmail.com (Ayush Goel) Date: Thu, 22 Mar 2012 23:33:49 +0530 Subject: [GSoC] Interested in working on SearchIO Message-ID: Hello, I am a student at Delhi College of Engineering. I have a prior experience in python at two other interns. I was hoping to find myself a more challenging project this time with python as the default language. The description of the SearchIO project seems to be a very good one. Still I am pretty new to the biopython's code. If possible, I would like to have some more information regarding what is expected from the deliverable. Also if some reference material on the background of the data formats required (BLAST etc) could be provided, then it would be very helpful. -- Regards, Ayush Goel From p.j.a.cock at googlemail.com Fri Mar 23 05:30:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 23 Mar 2012 09:30:10 +0000 Subject: [GSoC] Interested in working on SearchIO In-Reply-To: References: Message-ID: On Thu, Mar 22, 2012 at 6:03 PM, Ayush Goel wrote: > Hello, > > ?I am a student at Delhi College of Engineering. I have a prior > experience in python at two other interns. I was hoping to find myself > a more challenging project this time with python as the default > language. The description of the SearchIO project seems to be a very > good one. > > ?Still I am pretty new to the biopython's code. If possible, I would > like to have some more information regarding what is expected from the > deliverable. Also if some reference material on the background of the > data formats required (BLAST etc) could be provided, then it would be > very helpful. Hello Ayush, Are you doing any biology or bioinformatics courses? That would help with background knowledge. The SearchIO project does require a reasonably broad knowledge of important tools and concepts in pairwise sequence alignment - if you not familiar with BLAST etc that will be a big handicap. You don't need to know the algorithm details - just the overall idea, and how to run the tools and what kind of analysis people might want to do with it. Some possible background reading (an introductory Bioinformatics course or book might be good too): http://www.ncbi.nlm.nih.gov/BLAST/ http://en.wikipedia.org/wiki/BLAST http://emboss.open-bio.org/wiki/Appdoc:Needle http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm http://emboss.open-bio.org/wiki/Appdoc:Water http://en.wikipedia.org/wiki/Smith-Waterman_algorithm In terms of possible deliverables, I went into more detail here: http://lists.open-bio.org/pipermail/biopython-dev/2012-March/009468.html However, if you have a lot of experience with Python and parsing text and XML files, that would be a big plus. Perhaps there is another topic that might suit you better. Is there a particular reason why you are interested in Biopython? Regards, Peter From saketkc at gmail.com Mon Mar 26 04:40:51 2012 From: saketkc at gmail.com (Saket Choudhary) Date: Mon, 26 Mar 2012 14:10:51 +0530 Subject: [GSoC] [GSoC2012] BioRuby Message-ID: Hi ! I am Saket Choudhary, a third year undergraduate student at IIT Bombay, India. Ruby has been my first love for programming. My last project[internship] was at SlideShare , one of the biggest Ruby on Rails website in the world. I had developed the Admin interface for SlideShare which could enable suspending users, reconversion/deletion of SlideShows , and user deletion/suspension. My hack at Yahoo! Open Hack India -2011 qualified among the Top 50 hacks , I built a Sinatra app for fetching a defined file from a Dropbox account and sending it to a specified email address , just on a SMS.[ https://github.com/saketkc/dropbox_on_sms] I went through the GSoC idea page and "Adding social networking functionality to BioRuby.org", is of my special interest. I have an experience working on the Rails platform. My plans for making the website more "Social" is as follows: 1. Provide an online 'Scratchpad' for Ruby/BioRuby enthusiasts that not only allows them to run their codes online but also provides them a facility to store it online in the form of an archive so that they an acess it later. 2. Include sharing facility on "Scratchpad" so that one user can share his code online with other users/community and get feedback/comments. On the lines of "Sage " notebooks. 3. Develop an Online Board on the lines of "Quora " Boards so that users can pin certain codes/algorithms on to their own boards for their reference this would reduce the overhead of searching for a particular algorithm again and again . Let me know you views on these ideas. I would send a mockup of the same incase these ideas seem feasible to you. Thanks Saket Choudhary IIT Bombay From saketkc at gmail.com Mon Mar 26 11:32:40 2012 From: saketkc at gmail.com (Saket Choudhary) Date: Mon, 26 Mar 2012 21:02:40 +0530 Subject: [GSoC] [GSoC2012] BioRuby In-Reply-To: <20120326133700.GA22488@thebird.nl> References: <20120326133700.GA22488@thebird.nl> Message-ID: Hi Pjotr, Thanks ! I have been using BioRuby and BioPython for 6 months now to solve the "Protein Loop Closure" problem. I use it mainly to manipulate the atom positions in a given PDB file, thus perturbing their positions. I went through the discussion that happened on the mailing list last month: Here are my notes about the same 1.http://lists.open-bio.org/pipermail/bioruby/2012-February/002087.html Even I am a big fan of Jruby homepage. Here are my suggestiosn : 1. Ruby is simple, clear , intutitve and this makes Biouby intutive to everyone. This needs to be emphasised the very first time a user comes to the webpage : Instead of giving them examples on Wiki[ http://bioruby.open-bio.org/wiki/SampleCodes] to "read" through a tutorial in the form of a short writeup/description about Ruby/BioRuby foolwed by a challenge would be more appealing and intuitive to the user even though he is being exposed to Ruby/BioRuby for the first time. Say some tutorial on the lines of http://www.codecademy.com/ , a short tutorial followed by your Scratchpad. ! 2. Calendar/Tweet/Conference widget: Something again on the lines of Jruby website. 3. Favicon missing ? Though a very trivial issue , but just wanted to know why isn't there a favicon for bioruby.org ? These are the stuff I gathered , I am still digging the old threads, will post here if something relevant comes up ! Saket Choudhary IIT Bombay github.com/saketkc On 26 March 2012 19:07, Pjotr Prins wrote: > Hi Saket, > > Welcome! > > It would be good if you also introduce yourself to the BioRuby ML, and > post your ideas. We are working on the website (should I say > web 'experience'), and I like what you propose. Also check out the ML > archive of the last months, you'll find a lot of information. > > Pj. > > On Mon, Mar 26, 2012 at 02:10:51PM +0530, Saket Choudhary wrote: > > Hi ! > > > > I am Saket Choudhary, a third year undergraduate student at IIT Bombay, > > India. > > > > Ruby has been my first love for programming. My last project[internship] > > was at SlideShare , one of the biggest Ruby > on > > Rails website in the world. I had developed the Admin interface for > > SlideShare which could enable suspending users, reconversion/deletion of > > SlideShows , and user deletion/suspension. > > > > My hack at Yahoo! Open Hack India -2011 qualified among the Top 50 hacks > , > > I built a Sinatra app for fetching a defined file from a Dropbox account > > and sending it to a specified email address , just on a SMS.[ > > https://github.com/saketkc/dropbox_on_sms] > > > > I went through the GSoC idea page and "Adding social networking > > functionality to BioRuby.org", is of my special interest. I have an > > experience working on the Rails platform. My plans for making the website > > more "Social" is as follows: > > > > 1. Provide an online 'Scratchpad' for Ruby/BioRuby enthusiasts that not > > only allows them to run their codes online but also provides them a > > facility to store it online in the form of an archive so that they an > acess > > it later. > > > > 2. Include sharing facility on "Scratchpad" so that one user can share > his > > code online with other users/community and get feedback/comments. On the > > lines of "Sage " notebooks. > > > > 3. Develop an Online Board on the lines of "Quora >" > > Boards so that users can pin certain codes/algorithms on to their own > > boards for their reference this would reduce the overhead of searching > for > > a particular algorithm again and again . > > > > Let me know you views on these ideas. I would send a mockup of the same > > incase these ideas seem feasible to you. > > > > Thanks > > > > Saket Choudhary > > IIT Bombay > > _______________________________________________ > > GSoC mailing list > > GSoC at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/gsoc > From pjotr2012 at thebird.nl Mon Mar 26 09:37:00 2012 From: pjotr2012 at thebird.nl (Pjotr Prins) Date: Mon, 26 Mar 2012 15:37:00 +0200 Subject: [GSoC] [GSoC2012] BioRuby In-Reply-To: References: Message-ID: <20120326133700.GA22488@thebird.nl> Hi Saket, Welcome! It would be good if you also introduce yourself to the BioRuby ML, and post your ideas. We are working on the website (should I say web 'experience'), and I like what you propose. Also check out the ML archive of the last months, you'll find a lot of information. Pj. On Mon, Mar 26, 2012 at 02:10:51PM +0530, Saket Choudhary wrote: > Hi ! > > I am Saket Choudhary, a third year undergraduate student at IIT Bombay, > India. > > Ruby has been my first love for programming. My last project[internship] > was at SlideShare , one of the biggest Ruby on > Rails website in the world. I had developed the Admin interface for > SlideShare which could enable suspending users, reconversion/deletion of > SlideShows , and user deletion/suspension. > > My hack at Yahoo! Open Hack India -2011 qualified among the Top 50 hacks , > I built a Sinatra app for fetching a defined file from a Dropbox account > and sending it to a specified email address , just on a SMS.[ > https://github.com/saketkc/dropbox_on_sms] > > I went through the GSoC idea page and "Adding social networking > functionality to BioRuby.org", is of my special interest. I have an > experience working on the Rails platform. My plans for making the website > more "Social" is as follows: > > 1. Provide an online 'Scratchpad' for Ruby/BioRuby enthusiasts that not > only allows them to run their codes online but also provides them a > facility to store it online in the form of an archive so that they an acess > it later. > > 2. Include sharing facility on "Scratchpad" so that one user can share his > code online with other users/community and get feedback/comments. On the > lines of "Sage " notebooks. > > 3. Develop an Online Board on the lines of "Quora " > Boards so that users can pin certain codes/algorithms on to their own > boards for their reference this would reduce the overhead of searching for > a particular algorithm again and again . > > Let me know you views on these ideas. I would send a mockup of the same > incase these ideas seem feasible to you. > > Thanks > > Saket Choudhary > IIT Bombay > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From shahruchin711 at gmail.com Sun Apr 1 07:23:00 2012 From: shahruchin711 at gmail.com (Ruchin Shah) Date: Sun, 1 Apr 2012 16:53:00 +0530 Subject: [GSoC] BioJava- Porting the BLAST,HMMER algorithms Message-ID: Hi, I am Ruchin Shah, 3rd year undergraduate student from DA-IICT,India. I would like to work on some challenging projects in the field of bioinformatics. I have already worked on a project called BioSpectroGram(written in Java)under the mentorship of Prof. Manish K. Gupta(http://www.guptalab.org/mankg/public_html/) which aims at analyzing DNA and protein sequences using various kinds of transfromations(FFT,DCT,etc.). I came to know about the idea of implementing the two algorithms-BLAST and HMMER, and i find it very fascinating. I have a good coding experience(http://www.spoj.pl/users/ruchinshah/). I am also familiar with the FASTA and GenBank formats. I read about the BLASTA algorithm at http://en.wikipedia.org/wiki/BLAST#BLAST but if possible I would like to know more about these two algorithms and exactly what is expected from the project and also some more references.If I am not wrong then you are expecting to use some C-to-Java conversion tool or JNI to exploit the already available BLAST+ tool and not implement the algorithms from scratch . From andreas at sdsc.edu Sun Apr 1 14:02:16 2012 From: andreas at sdsc.edu (Andreas Prlic) Date: Sun, 1 Apr 2012 11:02:16 -0700 Subject: [GSoC] BioJava- Porting the BLAST,HMMER algorithms In-Reply-To: References: Message-ID: Hi Ruchin, Are you also on the biojava-l mailing list? We had quite a number of discussions about this project already there and if you are not on the list it might be a good start to catch up with what was already discussed there. http://lists.open-bio.org/pipermail/biojava-l/ The idea in short is to come up with an all-Java version of some of the frequently used algorithms. We are quite flexible regarding the projects and what we are really looking for are sound projects and motivated students. What is expected is a realistic project proposal, which in turn depends on your background and how you propose to conduct the project. Andreas On Sun, Apr 1, 2012 at 4:23 AM, Ruchin Shah wrote: > Hi, > > I am Ruchin Shah, 3rd year undergraduate student from DA-IICT,India. > > ? ? ? ? I would like to work on some challenging projects in the field of > bioinformatics. I have already worked on a project called > BioSpectroGram(written in Java)under the mentorship of Prof. Manish K. > Gupta(http://www.guptalab.org/mankg/public_html/) which aims at analyzing > DNA and protein sequences using various kinds of > transfromations(FFT,DCT,etc.). I came to know about the idea of implementing > the two algorithms-BLAST and HMMER, and i find it very fascinating. I have > a good coding experience(http://www.spoj.pl/users/ruchinshah/). I am also > familiar with the FASTA and GenBank formats. > > ? ? ? ? I read about the BLASTA algorithm at > http://en.wikipedia.org/wiki/BLAST#BLAST but if possible I would like to > know more about these two algorithms and exactly what is expected from the > project > and also some more references.If I am not wrong then you are expecting to > use some C-to-Java conversion tool or JNI to exploit the already available > BLAST+ tool and not implement the algorithms from scratch . > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From p.j.a.cock at googlemail.com Tue Apr 24 07:21:56 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 12:21:56 +0100 Subject: [GSoC] Fwd: Announcing OBF Google Summer of Code Accepted Students In-Reply-To: <4F95EA76.4030004@gmail.com> References: <4F95EA76.4030004@gmail.com> Message-ID: The announcement is also on the OBF news blog now: http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ ---------- Forwarded message ---------- From: Robert Buels Date: Tue, Apr 24, 2012 at 12:49 AM Subject: [Bioperl-l] Announcing OBF Google Summer of Code Accepted Students To: BioPerl List , BioJava List , BioRuby List , BioPython List , BioDAS List , BioLib List , BioSQL List Hello all, I'm very pleased and excited to announce that the Open Bioinformatics Foundation has selected 5 very capable students to work on OBF projects this summer as part of the Google Summer of Code program. The accepted students, their projects, and their mentors (in alphabetical order): Wibowo Arindrarto ? ?SearchIO Implementation in Biopython ? ?mentored by Peter Cock Lenna Peterson ? ?Diff My DNA: Development of a Genomic Variant Toolkit for Biopython ? ?mentored by Brad Chapman Marjan Povolni ? ?The worlds fastest parallelized GFF3/GTF parser in D, and an ? ?interfacing biogem plugin for Ruby ? ?mentored by Pjotr Prins, Francesco Strozzi, Raoul Bonnal Artem Tarasov ? ?Fast parallelized GFF3/GTF parser in C++, with Ruby FFI bindings ? ?mentored by Pjotr Prins, Francesco Strozzi, Raoul Bonnal Clayton Wheeler ? ?Multiple Alignment Format parser for BioRuby ? ?mentored by Francesco Strozzi and Raoul Bonnal As in every year, we received many great applications and ideas. However, funding and mentor resources are limited, and we were not able to accept as many as we would have liked. ?Our deepest thanks to all the students who applied: we sincerely appreciate the time and effort you put into your applications, and hope you will still consider being a part of the OBF's open source projects, even without Google funding. ?I speak for myself and all of the mentors who read and scored applications when I say that we were truly honored by the number and quality of the applications we received. For the accepted students: congratulations! ?You have risen to the top of a very competitive application process. ?Now it's time to "put your money where your mouth is", as the saying goes. ?Let's get out there and write some great code this summer! Best regards, Rob ---- Robert Buels OBF GSoC 2012 Administrator _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From p.j.a.cock at googlemail.com Tue Apr 24 07:24:20 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 12:24:20 +0100 Subject: [GSoC] OBF GSoC students weekly progress reports Message-ID: Hello all, First, to echo Rob, congratulations to our selected students: http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ http://lists.open-bio.org/pipermail/gsoc/2012/000049.html Weekly Progress Reports: To encourage community bonding and awareness of what the GSoC 2012 students are doing, this year the OBF is being much clearer about our progress report expectations. We would like every student to setup a blog for the GSoC project (or a category/tag on your existing blog) which you will use to summarize your progress every week, as well as longer posts at the half way evaluation, and at the end of the summer. In addition, after publishing each blog post, we expect you to email the URL and the text of the blog (or if important images or formatting would be lost, at least a short summary) to the host project's mailing list(s) (check with your mentors if the project has more than one) AND the gsoc at open-bio.org mailing list. You will be writing under your own name, but with a clear association with your mentors, the OBF and its projects, so please take this seriously and be professional. Remember this will become part of your online presence, and potentially looked at by future employers and colleagues. Please talk to your mentors about this during the "community bonding" stage of the GSoC code (i.e. the next few weeks before you actually start). Thank you, Peter (On behalf of the OBF GSoC mentors and projects) Note: As per Rob's earlier email, could both students and mentors please ensure you have subscribed to the public OBF GSoC email list at http://lists.open-bio.org/mailman/listinfo/gsoc (I have BCC'd you on this email just in case you haven't done this yet). Thanks! From arklenna at gmail.com Tue Apr 24 13:21:46 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 24 Apr 2012 13:21:46 -0400 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: Hi all, I'm very excited to be participating in GSoC '12 with Biopython! My development blog is on tumblr, which I chose primarily because it supports markdown syntax, which I'm used to from GitHub. Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 However, Tumblr doesn't allow post comments. Will I need to switch to a blog platform that allows comments? Cheers, Lenna From p.j.a.cock at googlemail.com Tue Apr 24 13:55:51 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Apr 2012 18:55:51 +0100 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: On Tue, Apr 24, 2012 at 6:21 PM, Lenna Peterson wrote: > Hi all, > > > I'm very excited to be participating in GSoC '12 with Biopython! > > My development blog is on tumblr, which I chose primarily because it > supports markdown syntax, which I'm used to from GitHub. > > Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 > > However, Tumblr doesn't allow post comments. Will I need to switch to a > blog platform that allows comments? > > Cheers, > > Lenna Hi Lenna, Great - you've got a blog already you're also the first student to reply :) Blog comments could be nice, but personally in your shoes I'd direct any discussion to the biopython(-dev) mailing list. e.g. 1. Post weekly update blog, get blog post URL 2. Send email with summary, including blog post URL 3. Goto mailing list archive, get archived email URL 4. Update blog post to link to email (and thus any thread from it, at least for that month). A little cumbersome, but it would save you moving your blog? I'd actually be happier with most discussion on the biopython-dev list rather than blog comments, or even github (which will still be useful for things like code reviews). This may be different for the other projects - I know BioRuby uses IRC much more for example, but even there they've tried to post archives of important IRC discussions to their mailing list too. Thank you! Peter From arklenna at gmail.com Tue Apr 24 14:41:25 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Tue, 24 Apr 2012 14:41:25 -0400 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References:

Message-ID: On Tue, Apr 24, 2012 at 1:55 PM, Peter Cock wrote: > > Hi Lenna, > > Great - you've got a blog already you're also the first student to reply :) > > Blog comments could be nice, but personally in your shoes I'd > direct any discussion to the biopython(-dev) mailing list. e.g. > > 1. Post weekly update blog, get blog post URL > 2. Send email with summary, including blog post URL > 3. Goto mailing list archive, get archived email URL > 4. Update blog post to link to email (and thus any thread from it, > at least for that month). > > A little cumbersome, but it would save you moving your blog? > > I'd actually be happier with most discussion on the biopython-dev > list rather than blog comments, or even github (which will still be > useful for things like code reviews). > > This may be different for the other projects - I know BioRuby > uses IRC much more for example, but even there they've tried > to post archives of important IRC discussions to their mailing > list too. > > Thank you! > > Peter Peter, If I get ambitious, I could write a Python script to retrieve the mailing list url and put it into my blog post! To clarify - for biopython, should the update emails go out to both the biopython and biopython-dev mailing lists, or just the latter? Lenna From w.arindrarto at gmail.com Tue Apr 24 15:01:23 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 24 Apr 2012 21:01:23 +0200 Subject: [GSoC] [Biopython] OBF GSoC students weekly progress reports In-Reply-To: References:

Message-ID: On Tue, Apr 24, 2012 at 19:55, Peter Cock wrote: > On Tue, Apr 24, 2012 at 6:21 PM, Lenna Peterson wrote: >> Hi all, >> >> >> I'm very excited to be participating in GSoC '12 with Biopython! >> >> My development blog is on tumblr, which I chose primarily because it >> supports markdown syntax, which I'm used to from GitHub. >> >> Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 >> >> However, Tumblr doesn't allow post comments. Will I need to switch to a >> blog platform that allows comments? >> >> Cheers, >> >> Lenna > > Hi Lenna, > > Great - you've got a blog already you're also the first student to reply :) > > Blog comments could be nice, but personally in your shoes I'd > direct any discussion to the biopython(-dev) mailing list. e.g. > > 1. Post weekly update blog, get blog post URL > 2. Send email with summary, including blog post URL > 3. Goto mailing list archive, get archived email URL > 4. Update blog post to link to email (and thus any thread from it, > at least for that month). > > A little cumbersome, but it would save you moving your blog? > > I'd actually be happier with most discussion on the biopython-dev > list rather than blog comments, or even github (which will still be > useful for things like code reviews). > > This may be different for the other projects - I know BioRuby > uses IRC much more for example, but even there they've tried > to post archives of important IRC discussions to their mailing > list too. > > Thank you! > > Peter > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Hi everyone, Wibowo Arindrarto here, but you can just call me Bow for short :). I'm very excited to be accepted into GSoC with OBF as well! I will be blogging on my site: http://bow.web.id/blog, and I've actually made my inaugural GSoC post just a few hours after I heard the news, here: http://bow.web.id/blog/2012/04/google-summer-of-code-is-on/. I'll be posting all GSoC related post under the `gsoc` tag, accessible through this URL: http://bow.web.id/blog/tag/gsoc/. To follow Peter's suggestion, I'll post my weekly progress in this mailing list for everyone to see, too. cheers, Bow From rbuels at gmail.com Tue Apr 24 15:13:48 2012 From: rbuels at gmail.com (Robert Buels) Date: Tue, 24 Apr 2012 15:13:48 -0400 Subject: [GSoC] [Biopython] OBF GSoC students weekly progress reports In-Reply-To: References:

Message-ID: <4F96FB6C.3010805@gmail.com> Bow, make sure you subscribe to the OBF GSoC mailing list. http://lists.open-bio.org/mailman/listinfo/gsoc Rob On 04/24/2012 03:01 PM, Wibowo Arindrarto wrote: > On Tue, Apr 24, 2012 at 19:55, Peter Cock wrote: >> On Tue, Apr 24, 2012 at 6:21 PM, Lenna Peterson wrote: >>> Hi all, >>> >>> >>> I'm very excited to be participating in GSoC '12 with Biopython! >>> >>> My development blog is on tumblr, which I chose primarily because it >>> supports markdown syntax, which I'm used to from GitHub. >>> >>> Here's my gsoc12 tag: http://arklenna.tumblr.com/tagged/gsoc2012 >>> >>> However, Tumblr doesn't allow post comments. Will I need to switch to a >>> blog platform that allows comments? >>> >>> Cheers, >>> >>> Lenna >> >> Hi Lenna, >> >> Great - you've got a blog already you're also the first student to reply :) >> >> Blog comments could be nice, but personally in your shoes I'd >> direct any discussion to the biopython(-dev) mailing list. e.g. >> >> 1. Post weekly update blog, get blog post URL >> 2. Send email with summary, including blog post URL >> 3. Goto mailing list archive, get archived email URL >> 4. Update blog post to link to email (and thus any thread from it, >> at least for that month). >> >> A little cumbersome, but it would save you moving your blog? >> >> I'd actually be happier with most discussion on the biopython-dev >> list rather than blog comments, or even github (which will still be >> useful for things like code reviews). >> >> This may be different for the other projects - I know BioRuby >> uses IRC much more for example, but even there they've tried >> to post archives of important IRC discussions to their mailing >> list too. >> >> Thank you! >> >> Peter >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > Hi everyone, > > Wibowo Arindrarto here, but you can just call me Bow for short :). I'm > very excited to be accepted into GSoC with OBF as well! > > I will be blogging on my site: http://bow.web.id/blog, and I've > actually made my inaugural GSoC post just a few hours after I heard > the news, here: > http://bow.web.id/blog/2012/04/google-summer-of-code-is-on/. I'll be > posting all GSoC related post under the `gsoc` tag, accessible through > this URL: http://bow.web.id/blog/tag/gsoc/. To follow Peter's > suggestion, I'll post my weekly progress in this mailing list for > everyone to see, too. > > cheers, > Bow > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From marian.povolny at gmail.com Wed Apr 25 13:17:01 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Wed, 25 Apr 2012 19:17:01 +0200 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References: Message-ID: Hi Peter, Another excited GSoC student here :) I think the idea with a blog for status updates a great idea, I would have done it probably even if it wasn't a requirement. I didn't have a blog before, so I created one at tumblr, and it should be possible for the visitors to leave comments too. But I do agree with you that the ML is a better place for discussions about our GSoC projects. Here is a link to my new blog: http://blog.mpthecoder.com/ GSoC related posts will be tagged with #gsoc ( http://blog.mpthecoder.com/tagged/gsoc). @Lenna Tumblr lets you use your Disqus account if you want to enable comments on your tumblr blog. However, not all themes support it. See the first q&a here for more info: http://www.tumblr.com/help It took me about 2 minutes to create an account on Disqus and link it to my blog. -- Marjan On Tue, Apr 24, 2012 at 1:24 PM, Peter Cock wrote: > Hello all, > > First, to echo Rob, congratulations to our selected students: > http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ > http://lists.open-bio.org/pipermail/gsoc/2012/000049.html > > Weekly Progress Reports: > > To encourage community bonding and awareness of what the > GSoC 2012 students are doing, this year the OBF is being much > clearer about our progress report expectations. > > We would like every student to setup a blog for the GSoC project > (or a category/tag on your existing blog) which you will use to > summarize your progress every week, as well as longer posts > at the half way evaluation, and at the end of the summer. > > In addition, after publishing each blog post, we expect you to > email the URL and the text of the blog (or if important images > or formatting would be lost, at least a short summary) to the > host project's mailing list(s) (check with your mentors if the > project has more than one) AND the gsoc at open-bio.org > mailing list. > > You will be writing under your own name, but with a clear > association with your mentors, the OBF and its projects, so > please take this seriously and be professional. Remember > this will become part of your online presence, and potentially > looked at by future employers and colleagues. > > Please talk to your mentors about this during the "community > bonding" stage of the GSoC code (i.e. the next few weeks > before you actually start). > > Thank you, > > Peter > > (On behalf of the OBF GSoC mentors and projects) > > Note: As per Rob's earlier email, could both students and mentors > please ensure you have subscribed to the public OBF GSoC email > list at http://lists.open-bio.org/mailman/listinfo/gsoc (I have BCC'd > you on this email just in case you haven't done this yet). Thanks! > From arklenna at gmail.com Wed Apr 25 20:16:11 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 25 Apr 2012 20:16:11 -0400 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References:

Message-ID: On Wed, Apr 25, 2012 at 1:17 PM, Marjan Povolni wrote: > > @Lenna > Tumblr lets you use your Disqus account if you want to enable comments on > your tumblr blog. However, not all themes support it. See the first q&a > here for more info: > > http://www.tumblr.com/help > > It took me about 2 minutes to create an account on Disqus and link it to my > blog. > > -- > Marjan > > Marjan - Thanks for the tip! I have disqus set up on my tumblr now. I also filed my enrollment and tax forms with Google. Now I'm busy in the thinking phase ;) Lenna From p.j.a.cock at googlemail.com Thu Apr 26 05:49:26 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 26 Apr 2012 10:49:26 +0100 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References:

Message-ID: On Wed, Apr 25, 2012 at 6:17 PM, Marjan Povolni wrote: > Hi Peter, > > Another excited GSoC student here :) > > I think the idea with a blog for status updates a great idea, I would have > done it probably even if it wasn't a requirement. I didn't have a blog > before, so I created one at tumblr, and it should be possible for the > visitors to leave comments too. But I do agree with you that the ML is a > better place for discussions about our GSoC projects. Here is a link to my > new blog: > > http://blog.mpthecoder.com/ > > GSoC related posts will be tagged with #gsoc > (http://blog.mpthecoder.com/tagged/gsoc). Excellent, I've added the three Blog links so far to this post: http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ I'll do another full post highlighting your blogs once all five are ready. Thanks, Peter From lomereiter at googlemail.com Thu Apr 26 07:43:07 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Thu, 26 Apr 2012 15:43:07 +0400 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References:

Message-ID: Hi all, I'm also very excited about being accepted :) > Excellent, I've added the three Blog links so far to this post: > http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ > > I'll do another full post highlighting your blogs once all five > are ready. > My blog posts will be at http://lomereiter.wordpress.com/tag/gsoc, I'll update it at least every week during the coding period. From rbuels at gmail.com Thu Apr 26 11:05:01 2012 From: rbuels at gmail.com (Robert Buels) Date: Thu, 26 Apr 2012 11:05:01 -0400 Subject: [GSoC] OBF GSoC students weekly progress reports In-Reply-To: References:

Message-ID: <4F99641D.3070908@gmail.com> Thanks for handling the blog links Peter! The wiki page has them now too. http://www.open-bio.org/wiki/Google_Summer_of_Code#About_Google_Summer_of_Code Artem and Clayton: please update that wiki page to link to your progress blogs and notify Peter so he can put the link on the OBF blog. Rob On 04/26/2012 05:49 AM, Peter Cock wrote: > On Wed, Apr 25, 2012 at 6:17 PM, Marjan Povolni > wrote: >> Hi Peter, >> >> Another excited GSoC student here :) >> >> I think the idea with a blog for status updates a great idea, I would have >> done it probably even if it wasn't a requirement. I didn't have a blog >> before, so I created one at tumblr, and it should be possible for the >> visitors to leave comments too. But I do agree with you that the ML is a >> better place for discussions about our GSoC projects. Here is a link to my >> new blog: >> >> http://blog.mpthecoder.com/ >> >> GSoC related posts will be tagged with #gsoc >> (http://blog.mpthecoder.com/tagged/gsoc). > > Excellent, I've added the three Blog links so far to this post: > http://news.open-bio.org/news/2012/04/students-selected-for-gsoc/ > > I'll do another full post highlighting your blogs once all five > are ready. > > Thanks, > > Peter > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc > From marian.povolny at gmail.com Sat May 5 09:07:30 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sat, 5 May 2012 15:07:30 +0200 Subject: [GSoC] GSoC weekly status report No.1 Message-ID: Hello all, It might be a little early, but there has been so much going on in the last 10 days since the results of GSoC were published... http://blog.mpthecoder.com/post/22380853664/gsoc-weekly-status-report-no-1 A short summary: It has been 10 days since the GSoC results were published, and a lot has happened since then. I got to know the other students and mentors in a longish meeting on Google hangout, I got into a discussion with my mentor on IRC in which we didn?t agree about the parallelization strategy for the parser (experiments will show who?s right) and my inbox is full with mails from my mentor and other students, in which we exchanged loads of interesting ideas. Also, I solved a bug in biogems.info website, which was stopping Pjotr from updating the website with new information about biogems. There is now a GitHub repository for my project: https://github.com/mamarjan/bioruby-hpc-gff3 The work for the first week of coding is halfway done too. There seems to be huge interest for a GFF3 parser with more features, like indexing, random access and writing output, and also support for linking into trees of features that are not located close to each other in the file. A fast sequential parser could be used to generate indexes, and the lower-level parts can be used to reorder the file for faster future usage. Based on that, I think this project is a good start. *I would like to ask you if you?re using the GFF3/GTF file formats in your research, to send me example files and descriptions of how are your applications using the data. This way I?ll be able to test the parser against your files and optimize it for your applications. Currently I have GFF files from Ensembl and Wormbase, and Pjotr pointed me to the genome browser web application at wormbase.org.* -- Marjan From rbuels at gmail.com Sun May 6 11:00:07 2012 From: rbuels at gmail.com (Robert Buels) Date: Sun, 06 May 2012 11:00:07 -0400 Subject: [GSoC] GSoC weekly status report No.1 In-Reply-To: References: Message-ID: <4FA691F7.9030905@gmail.com> Hi Marjan, You should probably incorporate into your test suite all of the test gff3 files in the test data directory of the Perl Bio::GFF3::LowLevel::Parser. It has coverage for some corner cases that are a little bit tricky. https://github.com/solgenomics/bio-gff3/tree/master/t/data Rob On 05/05/2012 09:07 AM, Marjan Povolni wrote: > Hello all, > > It might be a little early, but there has been so much going on in the last > 10 days since the results of GSoC were published... > > http://blog.mpthecoder.com/post/22380853664/gsoc-weekly-status-report-no-1 > > A short summary: > > It has been 10 days since the GSoC results were published, and a lot has > happened since then. I got to know the other students and mentors in a > longish meeting on Google hangout, I got into a discussion with my mentor > on IRC in which we didn?t agree about the parallelization strategy for the > parser (experiments will show who?s right) and my inbox is full with mails > from my mentor and other students, in which we exchanged loads of > interesting ideas. Also, I solved a bug in biogems.info website, which was > stopping Pjotr from updating the website with new information about biogems. > > There is now a GitHub repository for my project: > > https://github.com/mamarjan/bioruby-hpc-gff3 > > The work for the first week of coding is halfway done too. > > There seems to be huge interest for a GFF3 parser with more features, like > indexing, random access and writing output, and also support for linking > into trees of features that are not located close to each other in the > file. A fast sequential parser could be used to generate indexes, and the > lower-level parts can be used to reorder the file for faster future usage. > Based on that, I think this project is a good start. > > *I would like to ask you if you?re using the GFF3/GTF file formats in your > research, to send me example files and descriptions of how are your > applications using the data. This way I?ll be able to test the parser > against your files and optimize it for your applications. Currently I have > GFF files from Ensembl and Wormbase, and Pjotr pointed me to the genome > browser web application at wormbase.org.* > > -- > Marjan > > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc > From lomereiter at googlemail.com Sun May 6 15:56:50 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Sun, 6 May 2012 23:56:50 +0400 Subject: [GSoC] [BAM] Weekly report No. 0 Message-ID: Hi all, I wrote a few words about what I've done last week: http://lomereiter.wordpress.com/2012/05/06/gsoc-weekly-report-0/ Summary: The code is available at github: https://github.com/lomereiter/BAMread/ I already started to write code planned for the first week so as to have more time in June for exam preparation. Opening BAM and parsing SAM header works, and is available from Ruby, and now I need to write some tests and documentation. Also, I described some compile-time metaprogramming tricks in D which I use to reduce duplication in the code. I'd be grateful for some small BAM files, 1-50 kilobytes in size, with non-empty headers, for testing purposes. -- Artem From marian.povolny at gmail.com Sun May 6 16:22:01 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sun, 6 May 2012 22:22:01 +0200 Subject: [GSoC] GSoC weekly status report No.1 In-Reply-To: <4FA691F7.9030905@gmail.com> References: <4FA691F7.9030905@gmail.com> Message-ID: Thanks for the tip, that's a great idea! -- Marjan On Sun, May 6, 2012 at 5:00 PM, Robert Buels wrote: > Hi Marjan, > > You should probably incorporate into your test suite all of the test gff3 > files in the test data directory of the Perl Bio::GFF3::LowLevel::Parser. > It has coverage for some corner cases that are a little bit tricky. > > https://github.com/**solgenomics/bio-gff3/tree/**master/t/data > > Rob > > > On 05/05/2012 09:07 AM, Marjan Povolni wrote: > >> Hello all, >> >> It might be a little early, but there has been so much going on in the >> last >> 10 days since the results of GSoC were published... >> >> http://blog.mpthecoder.com/**post/22380853664/gsoc-weekly-** >> status-report-no-1 >> >> A short summary: >> >> It has been 10 days since the GSoC results were published, and a lot has >> happened since then. I got to know the other students and mentors in a >> longish meeting on Google hangout, I got into a discussion with my mentor >> on IRC in which we didn?t agree about the parallelization strategy for the >> parser (experiments will show who?s right) and my inbox is full with mails >> from my mentor and other students, in which we exchanged loads of >> interesting ideas. Also, I solved a bug in biogems.info website, which >> was >> stopping Pjotr from updating the website with new information about >> biogems. >> >> There is now a GitHub repository for my project: >> >> https://github.com/mamarjan/**bioruby-hpc-gff3 >> >> The work for the first week of coding is halfway done too. >> >> There seems to be huge interest for a GFF3 parser with more features, like >> indexing, random access and writing output, and also support for linking >> into trees of features that are not located close to each other in the >> file. A fast sequential parser could be used to generate indexes, and the >> lower-level parts can be used to reorder the file for faster future usage. >> Based on that, I think this project is a good start. >> >> *I would like to ask you if you?re using the GFF3/GTF file formats in your >> >> research, to send me example files and descriptions of how are your >> applications using the data. This way I?ll be able to test the parser >> against your files and optimize it for your applications. Currently I have >> GFF files from Ensembl and Wormbase, and Pjotr pointed me to the genome >> browser web application at wormbase.org.* >> >> -- >> Marjan >> >> ______________________________**_________________ >> GSoC mailing list >> GSoC at lists.open-bio.org >> http://lists.open-bio.org/**mailman/listinfo/gsoc >> >> From arklenna at gmail.com Sun May 6 17:26:30 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sun, 6 May 2012 17:26:30 -0400 Subject: [GSoC] GSoC python variant update Message-ID: Hi all, I've written a few new posts on my blog; here's the latest: http://arklenna.tumblr.com/post/22542372076/spot-isa-dog I will attach a UML diagram and include the part of the post addressing the diagram. Click through to the full post for a bonus Einstein quote! ------- My main goals are not limited to: * Make the structure parser and file-format agnostic: an abstracted OO design should allow anything to be slotted in (for example, Marjan's C GFF parser?) * Maintain encapsulation: limit how much each object can see of objects above and below it * Allow extension at multiple levels: some existing parsers may process data in different ways; this structure should allow handling both raw data and data in various formats. The `Variant` object's constructor allows an end user to change the default parsers. Practical implementation details of `parse()` and `write()` will need to be finessed - for example, ways to help the user sift through immense quantities of data. I'm still in the process of comparing the data contained in VCF/GVF files as well as the APIs of PyVCF and BCBio.GFF. `Parser` and `Writer` are both abstract classes that will define all methods found in known parsers/writers with `NotImplementedError`s. I'm speculating on whether a Variant-specific exception would be useful, but a custom message should suffice. Continuing down the diagram, `PyVCFWrapper` and `BCBioGFFWrapper` would each inherit from both `Parser` and `Writer`. As the name implies, they would serve as the adapter between the generic `Variant` and the specific parser. I anticipate that this structure could easily be extended to allow intermediate storage in DBs as well as innumerable sorting/comparing/filtering methods inside `Variant`. ------- I would appreciate any and all feedback about the overall structure. Namespace is definitely flexible. I'd also appreciate any specific genomic variant workflows, and if somebody can point me to smallish sample files of the same data in both VCF and GVF, I'd be eternally grateful. Regards, Lenna -------------- next part -------------- A non-text attachment was scrubbed... Name: Variant_UML.png Type: image/png Size: 23313 bytes Desc: not available URL: From chapmanb at 50mail.com Mon May 7 20:24:39 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 07 May 2012 20:24:39 -0400 Subject: [GSoC] GSoC python variant update In-Reply-To: References: Message-ID: <87mx5jfrjs.fsf@fastmail.fm> Lenna; This all looks great for a top level overview of the classes. This should give you sufficient flexibility to work on the different file types. Another approach is to avoid some of the inheritence and have parse/write dispatch to VCF or GFF specific classes based on the filetype: if filetype == "vcf": variant_handler = PyVCFVariants() elif filetype == "gvf": variant_handler = GVFVariants() variant_handler.parse(*args) Avoiding layers can be nice to simplify the architecture, as long as it gives you the flexibility you need. My suggestion for digging more in the API design would be to start playing with some VCF files and getting comfortable with the data they have and where it would go in Biopython objects. VCF is much more widely used than GVF so it's a good practical place to start. Thanks for all this work and best of luck on finals, Brad > Hi all, > > I've written a few new posts on my blog; here's the latest: > > http://arklenna.tumblr.com/post/22542372076/spot-isa-dog > > I will attach a UML diagram and include the part of the post > addressing the diagram. Click through to the full post for a bonus > Einstein quote! > > ------- > > My main goals are not limited to: > > * Make the structure parser and file-format agnostic: an abstracted > OO design should allow anything to be slotted in (for example, > Marjan's C GFF parser?) > * Maintain encapsulation: limit how much each object can see of > objects above and below it > * Allow extension at multiple levels: some existing parsers may > process data in different ways; this structure should allow handling > both raw data and data in various formats. > > The `Variant` object's constructor allows an end user to change the > default parsers. Practical implementation details of `parse()` and > `write()` will need to be finessed - for example, ways to help the > user sift through immense quantities of data. I'm still in the process > of comparing the data contained in VCF/GVF files as well as the APIs > of PyVCF and BCBio.GFF. > > `Parser` and `Writer` are both abstract classes that will define all > methods found in known parsers/writers with `NotImplementedError`s. > I'm speculating on whether a Variant-specific exception would be > useful, but a custom message should suffice. > > Continuing down the diagram, `PyVCFWrapper` and `BCBioGFFWrapper` > would each inherit from both `Parser` and `Writer`. As the name > implies, they would serve as the adapter between the generic `Variant` > and the specific parser. > > I anticipate that this structure could easily be extended to allow > intermediate storage in DBs as well as innumerable > sorting/comparing/filtering methods inside `Variant`. > > ------- > > I would appreciate any and all feedback about the overall structure. > Namespace is definitely flexible. I'd also appreciate any specific > genomic variant workflows, and if somebody can point me to smallish > sample files of the same data in both VCF and GVF, I'd be eternally > grateful. > > Regards, > > Lenna Attachment: Variant_UML.png (image/png) > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From casbon at gmail.com Tue May 8 04:57:57 2012 From: casbon at gmail.com (James Casbon) Date: Tue, 8 May 2012 09:57:57 +0100 Subject: [GSoC] GSoC python variant update In-Reply-To: <87mx5jfrjs.fsf@fastmail.fm> References: <87mx5jfrjs.fsf@fastmail.fm> Message-ID: On 8 May 2012 01:24, Brad Chapman wrote: > > Lenna; > This all looks great for a top level overview of the classes. This > should give you sufficient flexibility to work on the different file > types. Another approach is to avoid some of the inheritence and have > parse/write dispatch to VCF or GFF specific classes based on the > filetype: > > if filetype == "vcf": > ? ?variant_handler = PyVCFVariants() > elif filetype == "gvf": > ? ?variant_handler = GVFVariants() > variant_handler.parse(*args) > > Avoiding layers can be nice to simplify the architecture, as long as it > gives you the flexibility you need. Hi Lenna, This looks a good start, but I would agree with Brad that layers of inheritance aren't always the best way to proceed with python. Specific feedback: why does the Variant have parse/write methods when you state that you will use adaptation from the general variation class to the actual parser? I'm also slightly worried this could be pretty slow when dealing with the volume of data you get from a VCF file. As for the points in your blog post... I have plenty of data, do we know any SNP callers capable of creating GVF files? If so, I can give you both formats. The simplest variant workflows would be to filter and then score on some metric. Filter would be to remove noise, so quality threshold is the simplest one. The metric used depends on the experimental setup. For case/control, a fishers test is quite easy, or for a single population an HWE test is fairly simple. Hope this helps, -- James http://casbon.me/ From pjotr2010 at thebird.nl Tue May 8 07:40:43 2012 From: pjotr2010 at thebird.nl (Pjotr Prins) Date: Tue, 8 May 2012 13:40:43 +0200 Subject: [GSoC] GSoC python variant update In-Reply-To: References: <87mx5jfrjs.fsf@fastmail.fm> Message-ID: <20120508114043.GC14359@thebird.nl> On Tue, May 08, 2012 at 09:57:57AM +0100, James Casbon wrote: > > Avoiding layers can be nice to simplify the architecture, as long as it > > gives you the flexibility you need. This is actually a pattern. See 'Using Mixin Technology to Improve Modularity', for example. http://www.cs.utexas.edu/~lin/papers/aop03.pdf Pj. From w.arindrarto at gmail.com Wed May 9 12:24:43 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 9 May 2012 18:24:43 +0200 Subject: [GSoC] GSoC Project Update -- 1 Message-ID: Hi everyone, I just posted my latest blog updated here: http://bow.web.id/blog/2012/05/warming-up-for-the-coding-period/ To summarize, I've spent most of my time getting to know the programs I will support better. This has been done by: 1. Playing around with the programs to see how many different outputs I can generate. 2. Writing scripts to automate test case generation for each of the programs. 3. Writing wrappers (for programs not yet wrapped by Biopython: FASTA, HMMER, and BLAT) to ease writing the test case generators. 4. Continuing to complete my proposed SearchIO object naming scheme (http://bit.ly/searchio-terms) The test cases, their generators, and the wrappers I've written are available in my non-Biopython gsoc repo here: http://github.com/bow/gsoc/. Additionally, I've used the generated test case to improve a recent bug report and submitted a fix for the next release. For the coming weeks prior to coding start, I'm planning to play around more with XML and SQLite as I will use them in the code. I might start to add more skeleton code to my current development branch as well (https://github.com/bow/biopython). cheers, Bow From arklenna at gmail.com Wed May 9 20:16:18 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 9 May 2012 20:16:18 -0400 Subject: [GSoC] GSoC python variant update In-Reply-To: <20120508114043.GC14359@thebird.nl> References: <87mx5jfrjs.fsf@fastmail.fm> <20120508114043.GC14359@thebird.nl> Message-ID: I think my UML diagram may need a legend, or perhaps it should just be abandoned. I've written some skeleton code to try to avoid confusion about the pesky OO terms that have slightly different meanings for every language. https://gist.github.com/2649676 Regarding concerns about inheritance: I think the UML diagram implies 3 levels of inheritance. The only inheritance I intended was from abstract interfaces like Parser or Writer, that only contain non-implemented methods. Because I can't guarantee that all future parsers will have common attribute and method names, the only solution I can see is to write an interface and inherit from that to make wrappers for each parser. Thank you to Eric for this link: (https://en.wikipedia.org/wiki/Fragile_base_class). The page states that the best way to avoid problems is to use an interface. Also thank you to Pjotr for the article about mixins (http://www.cs.utexas.edu/~lin/papers/aop03.pdf). I believe I'm using inheritance in a safe and helpful manner. James, I hope my clarification and skeleton code answer any questions you have about the implementation. Brad, I am using if statements to determine which parser to use, but I am still calling wrappers that inherit from an interface. Eric, I looked at the structure of PDBParser. Is the idea that a user might pass in an instance of StructureBuilder that already contained some structure and add to it? Or is there another purpose that isn't jumping out at me? In my skeleton code, I used the example of StructureBuilder, but I'm not sure if there's an advantage to passing the object rather than the object's name. And finally, Brad and James, I will do my best to get more conversant with VCF etc. If I'm not a user, I can't be a capable developer. Looking forward to any more structural feedback! Cheers, Lenna From eric.talevich at gmail.com Thu May 10 09:36:49 2012 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 10 May 2012 09:36:49 -0400 Subject: [GSoC] GSoC python variant update In-Reply-To: References: <87mx5jfrjs.fsf@fastmail.fm> <20120508114043.GC14359@thebird.nl> Message-ID: On Wed, May 9, 2012 at 8:16 PM, Lenna Peterson wrote: > I looked at the structure of PDBParser. Is the idea that a user might pass > in an instance of StructureBuilder that already contained some structure > and add to it? Or is there another purpose that isn't jumping out at me? In > my skeleton code, I used the example of StructureBuilder, but I'm not sure > if there's an advantage to passing the object rather than the object's name. > > My understanding of the producer/consumer design in Bio.PDB (I didn't write it) is that the logic for parsing the given file format is contained in the *Parser class, and the logic for building the target object is in the *Builder class. This is useful if the target object is somewhat complex to build, as is the case with PDB's Structure/Model/Chain/Residue/Atom hierarchy -- the parser just passes raw values along to the appropriate method on the StructureBuilder class. (The Internet also points out that this design is super useful if "producing" and "consuming" are asynchronous, which is not the case here... yet?) Regarding the shared interface, I think we've generally achieved this throughout most of Biopython by just remembering to implement the required methods on each parser and writer class -- just "parse" and "write", usually. Essentially, it's your design minus the common base class that enforces the interface; an error in the implementation would result in an AttributeError rather than a NotImplementedError. This works because (1) Python uses duck typing, unlike C++ and Java; (2) in Biopython, each file format is usually implemented by one dedicated person who can keep it all in their head, and we don't add new file formats very rapidly; (3) we maintain pretty good coverage with our unit tests, and certainly add unit tests for new parsers. Given all that, I think your design is superior, and it's quite clear how it all works from the way you've written it. As for the difference between passing an instance of the *Builder object versus a reference to the *Builder class (did I get that right?), it requires slightly less code from the user to pass a reference to the class. Also, if you set the object-or-class as a default argument, remember that objects are mutable, so you risk hitting one of Python's most infamous gotchas (default arguments are only evaluated once, so the second time you use the parser, you'll be adding to the original object instead of starting with a fresh copy). Cheers, Eric From marian.povolny at gmail.com Sat May 12 15:46:46 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sat, 12 May 2012 21:46:46 +0200 Subject: [GSoC] GSoC weekly status report No.1.1 Message-ID: Hi all, Here is my status report for this week: This year we the GSoC students sure are a very creative group, just look at our numbering schemes for our status reports for the pre-coding period - everyone has his own thing going :) And now back to the GFF3 project. I found a few more sites with big GFF3 files, those will be great for performance testing. And Robert Buels suggested that I should reuse the test suite from the Perl?s Bio::GFF3::LowLevel::Parser, and I think that?s a great idea. I should definitely use that for completeness testing and I will check the test suites of other GFF3 parsers. I have also finished the work for the first week. That means basically I?m already more then two weeks ahead of schedule. The parser is now reading data on the D side and forwarding that to Ruby line by line. That won?t be faster then reading the file from Ruby, but that?s a nice basic case to get data flowing from D to Ruby. The rake tasks have been improved too. There are now two tasks for building the D library, ?compile? and ?compiledebug?, and there is the ?spec? task for running rspec tests and ?features? task for running cucumber tests. The ?clean? task now deletes object and library files. There is also a problem with the D library and garbage collector. It seems this is the problem Iain Buclaw (one of the GDC developers) has warned us about. When using a D shared library, when the GC kicks in for the first time, it looks like if it collects all the static data, for example the per-module variables. And pretty much everything, even when we register with GC a chuck of memory allocated with malloc, it still gets collected. Or at least that?s what it looks like. However, Iain also assured us that this will be solved by the end of this month/beginning of the next. My cucumber and rspec tests still work because they don?t require enough memory for the GC to run, but to be sure that this issue doesn?t interfere with development at this point, I manually disabled the GC on library initialization. I didn?t try yet, but from what has been discussed in the forums, both 32 and 64-bit DLLs on windows built using DMD work fine. I also helped Pjotr with getting our blog posts included in the RSS feed on biogems.info. That's all for now, you can find this report on my blog too: http://blog.mpthecoder.com/post/22919943701/gsoc-weekly-status-report-no-1-1 -- Best regards, Marjan From lomereiter at googlemail.com Sun May 13 16:10:45 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 14 May 2012 00:10:45 +0400 Subject: [GSoC] Weekly report No 0.5 Message-ID: Hi all, this is yet another GSoC report. During last week, I was mainly concentrated on D part of the project, adding functionality to it. I implemented parsing of the whole BAM file :) Today I wrote a simple utility in D, which uses my library to convert BAM to SAM. It doesn?t work with array tags yet, and not as fast as samtools, but nevertheless? On a couple of BAM files from test/data directory (namely, bins.bam and ex1_header.bam) the output is identical to that of samtools view ? I checked with diff ? and that kinda proves that everything works fine. Speed issues are mainly due to using std.variant module for storing tags. It uses runtime reflection which is quite slow. Maybe, there?re some other reasons. Anyway, I?m going to write my own tagged union type next week, it should improve the performance quite a bit, and also fix design flaws. For testing tag parsing, I used file tags.bam provided to me by Peter Cock. It contains tests for all types of tags, and my library successfully passes them. Later I?ll experiment with possible speed improvements, and having unit tests covering full range of possible tag types is a must. Also, I downloaded and compiled gdc from trunk. It provides decent performance, not worse than dmd, at least. We expect gdc to gain shared library support in the next two months. Before that happens, we have to use dmd, although there?re some issues with its garbage collector, causing segfaults. We discussed that with Marjan and Pjotr and decided that the best option in such circumstances would be to disable GC during development ? testing library on small files won?t consume much memory anyway. Another thing I downloaded and compiled, is Rubinius. I?m going to investigate why it hangs on BioRuby unittests in 1.9 mode. Another mode, 1.8, seems to work fine except maybe some very minor bugs. During next week, I?m going to learn how to use Cucumber and Rspec, improve D library performance a little, and start to write Ruby bindings. So it will be mostly ?Ruby week? ;) -- Artem From cswh at umich.edu Mon May 14 23:36:17 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 14 May 2012 23:36:17 -0400 Subject: [GSoC] GSoC week 1 status report Message-ID: <2D9F6030-8A11-4443-B610-58464F506EE5@umich.edu> Hi all, I've put my first GSoC status report on my project blog: http://csw.github.com/bioruby-maf/blog/2012/05/13/progress/ (The web version of this has 100% more hyperlinks, but here's a plain text version, too.) This has been my first half-week of work on my Google Summer of Code project, and it?s off to an exciting start. The first order of business has been to get my development environment together; since I?ve been a microbiology student instead of a programmer for the last year, it?s taken some work. In that process, I?ve ended up making a few open source contributions just to get my tools working the way I want. I?m running GNU Emacs 24 and trying to take more advantage of it than I have in the past. I?ll have much more to say about this in a future post. I?ve also started working on the BioRuby unit test failures under JRuby, as a way of familiarizing myself with the BioRuby code base as well as the community and its development processes. Right now, JRuby in 1.8 mode is showing 6 failures and 126 errors, which is hardly confidence-inspiring for people considering using JRuby with BioRuby. This is too bad, since JRuby has some definite advantages as a Ruby implementation. After looking into these failures, I?ve broken them down into a few categories: ? temporary file permissions problems, likely due to some sort of Travis-CI environment issue ? a bug in JRuby?s implementation of Open3.popen3 which I?m working up a bug report for ? an odd autoload problem I?ve filed JRUBY-6658 for and sent an accompanying RubySpec patch for ? a problem with libxml-jruby, which appears unmaintained, for which I?ve submitted a BioRuby patch plus JRUBY-6662 ? and a small test case bug relating to floating point handling, which I?ve submitted apatch for. Once these are resolved, JRuby should be passing the BioRuby unit tests in 1.8 mode, and closer to passing in 1.9 mode. (There are a few extra failures under 1.9 that I haven?t sorted through yet.) I?ve also gotten a start on my project itself, creating the bioruby-maf Github repository with a project skeleton and writing my first Cucumber feature for it. This is, in fact, my first Cucumber feature ever. However, I did spend a few cross-country flights reading the RSpec and Cucumber books last week; between that and cribbing from Pjotr?s code I feel like I have some idea what I?m doing. Just assembling that feature has been useful, too, since I?ve had to get several of the existing MAF tools running on my machine. In fact, my test MAF data and the FASTA version of it are courtesy of bx-python, which will be my reference implementation in many respects. Clayton Wheeler cswh at umich.edu From w.arindrarto at gmail.com Wed May 16 15:36:28 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 16 May 2012 21:36:28 +0200 Subject: [GSoC] GSoC Project Update -- 2 Message-ID: Hi everyone, I just posted my latest GSoC blog update here: http://bow.web.id/blog/2012/05/the-final-preparations/ To summarize, I spent the last week playing with XML and SQLite, and in extension SeqIO's index and index_db. I didn't write as much as real code the week before (mostly on online tutorials). Additionally, I started writing some of the SearchIO main methods, improved the test case generation time, and added more entries to the SearchIO terms table (http://bit.ly/searchio-terms). Finally, from this day onwards, I'm starting coding for the actual SearchIO implementation. The weekly plan will follow my proposed timeline (http://bit.ly/searchio-proposal) and I'll be writing mostly on my main SearchIO branch (https://github.com/bow/biopython/tree/searchio/Bio/SearchIO). cheers, Bow P.S. I also updated my blog last week so that the GSoC entries can be tracked through its own feed. The feed is available here: http://bow.web.id/feed/atom-gsoc.xml From arklenna at gmail.com Wed May 16 16:01:30 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 16 May 2012 16:01:30 -0400 Subject: [GSoC] GSoC python variant update 2 Message-ID: Hi all, Latest blog post here: http://arklenna.tumblr.com/post/23178684555/week-2 Brief summary of this post: I don't think `SeqFeature` or an extension thereof would be appropriate for storing Variant data; therefore, I intend to make a new structure based on `_Record` and `_Call` in PyVCF. I'm not sure if this structure should be associated with `Seq`, i.e. by naming it `SeqVariant`, and would like feedback on this question. It could be very difficult to make PyVCF compatible with Python 2.5. Therefore, I am planning to write my project to be compatible with Python 2.6 and delaying its inclusion in the main Biopython branch until a future 2.6+ Biopython release. Alternate suggestions are welcome. This week I will solidify the structure so I am ready for the end of the community bonding period and the start of coding on May 21. Regards, Lenna From chapmanb at 50mail.com Wed May 16 20:19:01 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 16 May 2012 20:19:01 -0400 Subject: [GSoC] [Biopython-dev] GSoC python variant update 2 In-Reply-To: References: Message-ID: <871umju0ay.fsf@fastmail.fm> Lenna; Thanks for the update on your thinking. Sounds like you are right on track. > I don't think `SeqFeature` or an extension thereof would be > appropriate for storing Variant data; therefore, I intend to make a > new structure based on `_Record` and `_Call` in PyVCF. I'm not sure if > this structure should be associated with `Seq`, i.e. by naming it > `SeqVariant`, and would like feedback on this question. I'm agreed about SeqFeature. Would you consider using _Record/_Call directly? Then you could provide functionality to convert this to/from basic SeqFeatures if needed. An advantage of using these structures explicitly is that you could plug in compatible APIs, like Aaron Quinlan's CyVCF: https://github.com/arq5x/cyvcf I don't think we should add a new representation class unless we explicitly need to store additional information. > It could be very difficult to make PyVCF compatible with Python > 2.5. Therefore, I am planning to write my project to be compatible > with Python 2.6 and delaying its inclusion in the main Biopython > branch until a future 2.6+ Biopython release. Alternate suggestions > are welcome. I'm agreed with this. I don't think 2.5 is an entrenched as 2.4 was so think we could move on a deprecation path for it. It's more important to be forward compatible with 3.x and 2.6+ should make that easier. Thanks again for sharing all your thoughts and digging into this, Brad From rbuels at gmail.com Thu May 17 08:52:54 2012 From: rbuels at gmail.com (Robert Buels) Date: Thu, 17 May 2012 08:52:54 -0400 Subject: [GSoC] students: upcoming dates Message-ID: <4FB4F4A6.8050802@gmail.com> Hi students, There are a couple of important dates coming up soon, don't forget! May 18 (TOMORROW!): deadline to submit tax forms and proof of enrollment. Do you want to get paid? May 21: start of the formal coding period Keep up the good work, I'm very happy to have you working with us. :-) Rob -- Robert Buels OBF GSoC 2012 Org. Administrator From marian.povolny at gmail.com Mon May 21 05:36:01 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 21 May 2012 11:36:01 +0200 Subject: [GSoC] GSoC weekly status report No.1.2 Message-ID: http://blog.mpthecoder.com/post/23473020471/gsoc-weekly-status-report-no-1-2 It?s been three months since my first introduction on the BioRuby ML and it?s been great. As it is the end of the GSoC community bonding period, I would like to thank Pjotr most and then all the other community members for their help and support. It?s a great feeling to become a member of a small but growing community of enthusiasts that work together for the better of all of us and for fun. As Pjotr already did, I would like to encourage you to write blog posts about using Ruby in Bioinformatics and let us include them in our RSS and news feeds on the biogems.info website. The site supports both RSS and Atom feeds now, and a similar functionality will be part of the new website for BioRuby once it?s finished. The code also supports adding only posts for one category/tag, so you can tag your posts with BioRuby or similar, and only those posts will be included in the RSS feed on biogems.info. The GSoC coding period starts today, It?s time for me to roll my sleeves up, and start working on the GFF3 parser full-time. -- Marjan From lomereiter at googlemail.com Mon May 21 07:58:46 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 21 May 2012 15:58:46 +0400 Subject: [GSoC] Weekly report #1 Message-ID: Hi all, here's my report about the past week: http://lomereiter.wordpress.com/2012/05/21/gsoc-weekly-report-1/ Brief summary: 1) BioRuby unit tests and Rubinius bugs ? I posted 2 issues in Rubinius bugtracker, and one of them is already solved. Rubinius in 1.8 mode should now pass all tests. The situation with 1.9 mode is not that great, but I'm working on it. 2) I started to collect D optimization tricks on github wiki page. Currently, it contains just 6 tips, but this number is going to grow. Probably, another page will be created soon to keep best practices of connecting Ruby and D. Since my project and Marjan's one have a lot in common, I think it's important for us to not waste time on something that already have been investigated. 3) During the week, I learned a bit about BDD and Cucumber, enjoyed it, and wrote my first two features. 4) Measurements of object instantiation time in Ruby suggest that exposing low-level D functions via FFI makes little sense. I'm going to discuss with mentors which high-level functions should be available, and make that into Cucumber features. -- Artem From cswh at umich.edu Mon May 21 11:50:18 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 21 May 2012 11:50:18 -0400 Subject: [GSoC] GSoC week 2 status report Message-ID: <0D2AC678-1DD1-40B9-B100-EDA3429B3D87@umich.edu> Hi all, Here's my report on last week's work: http://csw.github.com/bioruby-maf/blog/2012/05/21/week_2_progress/ This was my second week of work on my GSoC project, and the last week of the ?community bonding? period before the official start of coding. A major focus of mine was BioRuby?s phyloXML support; it uses libxml, which has been causing unit test failures under JRuby. In the end, the best course of action seemed to separate the phyloXML support as a separate plugin, which I have done as the bio-phyloxml gem. This will remove BioRuby?s dependency on XML libraries entirely and that JRuby issue along with it. At the same time, users of the phyloXML code should be able to continue using it with no substantive changes. Separately, I began porting this phyloXML code to use Nokogiri instead of libxml-ruby, but ran into difficulties with this effort. While it is possible, and the library APIs are very similar, the code uses relatively low-level XML processing APIs in ways that seem to be sensitive to subtle differences in text node and namespace semantics between the two libraries. Substantial restructuring of the code and the addition of quite a few unit tests might be necessary to carry out such a port with confidence that the resulting code would work well. Also, someone else submitted a JRuby patch for JRUBY-6658, one of the major causes of BioRuby?s unit test failures with JRuby; once a fix is integrated, we?ll be close to having all the tests passing under JRuby. I identified another JRuby bug, JRUBY-6666, causing several unit test failures. This one affects BioRuby?s code for running external commands, so it would be likely to be encountered in production use. For this one, I also worked up a patch. I also spent some time preparing a performance testing environment, for evaluating existing MAF implementations as well as my own. This will be important, since I will be considering the use of an existing C parser. I will also want to ensure that the performance of my code is competitive with the alternatives. Lacking any hardware more powerful than a MacBook Air, I am setting this up with Amazon EC2. To simplify environment setup, I?ll be using Chef. I?ve already set up a Chef repository with configuration logic, and some rudimentary code to streamline launching Ubuntu machines on EC2 and bootstrapping a Chef environment. To save money, I plan to make use of EC2 Spot Instances, which are perfect for instances that only need to run for a few hours for batch tasks. Clayton Wheeler cswh at umich.edu From bonnal at ingm.org Tue May 22 05:21:42 2012 From: bonnal at ingm.org (Raoul Bonnal) Date: Tue, 22 May 2012 11:21:42 +0200 Subject: [GSoC] [BioRuby] GSoC week 2 status report In-Reply-To: <0D2AC678-1DD1-40B9-B100-EDA3429B3D87@umich.edu> Message-ID: Hi Clayton, Well done and thanks for your contributes to bioruby and jruby community. For you computing issue I have two solutions: 1) I can create a VM and give you the access, I need to contact my IT dep. 2) Could Amazon provide some VM for our students? On 21/05/12 17.50, "Clayton Wheeler" wrote: > Hi all, > > Here's my report on last week's work: > > http://csw.github.com/bioruby-maf/blog/2012/05/21/week_2_progress/ > > This was my second week of work on my GSoC project, and the last week of the > ?community bonding? period before the official start of coding. A major focus > of mine was BioRuby?s phyloXML support; it uses libxml, which has been causing > unit test failures under JRuby. In the end, the best course of action seemed > to separate the phyloXML support as a separate plugin, which I have done as > the bio-phyloxml gem. This will remove BioRuby?s dependency on XML libraries > entirely and that JRuby issue along with it. At the same time, users of the > phyloXML code should be able to continue using it with no substantive changes. > > Separately, I began porting this phyloXML code to use Nokogiri instead of > libxml-ruby, but ran into difficulties with this effort. While it is possible, > and the library APIs are very similar, the code uses relatively low-level XML > processing APIs in ways that seem to be sensitive to subtle differences in > text node and namespace semantics between the two libraries. Substantial > restructuring of the code and the addition of quite a few unit tests might be > necessary to carry out such a port with confidence that the resulting code > would work well. > > Also, someone else submitted a JRuby patch for JRUBY-6658, one of the major > causes of BioRuby?s unit test failures with JRuby; once a fix is integrated, > we?ll be close to having all the tests passing under JRuby. > > I identified another JRuby bug, JRUBY-6666, causing several unit test > failures. This one affects BioRuby?s code for running external commands, so it > would be likely to be encountered in production use. For this one, I also > worked up a patch. > > I also spent some time preparing a performance testing environment, for > evaluating existing MAF implementations as well as my own. This will be > important, since I will be considering the use of an existing C parser. I will > also want to ensure that the performance of my code is competitive with the > alternatives. Lacking any hardware more powerful than a MacBook Air, I am > setting this up with Amazon EC2. To simplify environment setup, I?ll be using > Chef. I?ve already set up a Chef repository with configuration logic, and some > rudimentary code to streamline launching Ubuntu machines on EC2 and > bootstrapping a Chef environment. To save money, I plan to make use of EC2 > Spot Instances, which are perfect for instances that only need to run for a > few hours for batch tasks. > > Clayton Wheeler > cswh at umich.edu > > > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From w.arindrarto at gmail.com Tue May 22 06:21:25 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Tue, 22 May 2012 12:21:25 +0200 Subject: [GSoC] GSoC Project Update -- 3 Message-ID: Hi everyone, I just posted my latest GSoC update here: http://bow.web.id/blog/2012/05/from-bio-import-searchio/ To summarize the post and what I've done the last week: * I finished writing all base SearchIO objects and tested them as well. These objects are the QueryResult object (previously called Result), representing search results from a single query; the Hit object, representing pairwise alignments from a single database hit; and the HSP object, representing a single alignment. I've also written the docstrings for these objects, so you can run help() on them in an interpreter session. The post also includes a very brief outline of the base objects' features, if you are curious. * Using this, I was able to write a working prototype for SearchIO BLAST XML parsing. This prototype has also been tested, using the test cases I've generated previously. For now, it's implemented using our NCBIXML parser, just so that people can have a taste of what SearchIO will feel like. If you want to play around with the prototype, it's available here: https://github.com/bow/biopython/tree/searchio-blastxml. As always, feel free to notify me of suggestions, critiques, and/or feature requests :). regards, Bow From rbuels at gmail.com Tue May 22 16:15:15 2012 From: rbuels at gmail.com (Robert Buels) Date: Tue, 22 May 2012 16:15:15 -0400 Subject: [GSoC] [BioRuby] GSoC week 2 status report In-Reply-To: References: Message-ID: <4FBBF3D3.4040003@gmail.com> On 05/22/2012 05:21 AM, Raoul Bonnal wrote: > 2) Could Amazon provide some VM for our students? AWS allows quite a bit of free usage at no charge: http://aws.amazon.com/free/ If you need more, you could apply for a grant from them. http://aws.amazon.com/education/ Rob From saketkc at gmail.com Tue May 22 16:17:01 2012 From: saketkc at gmail.com (Saket Choudhary) Date: Tue, 22 May 2012 21:17:01 +0100 Subject: [GSoC] [BioRuby] GSoC week 2 status report In-Reply-To: <4FBBF3D3.4040003@gmail.com> References: <4FBBF3D3.4040003@gmail.com> Message-ID: I have a free 50$ credit on AWS. I would want to give ti to BioRuby , if possible. On 22 May 2012 21:15, Robert Buels wrote: > On 05/22/2012 05:21 AM, Raoul Bonnal wrote: > >> 2) Could Amazon provide some VM for our students? >> > > AWS allows quite a bit of free usage at no charge: > http://aws.amazon.com/free/ > If you need more, you could apply for a grant from them. > http://aws.amazon.com/**education/ > > Rob > ______________________________**_________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/gsoc > From arklenna at gmail.com Wed May 23 17:56:03 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Wed, 23 May 2012 17:56:03 -0400 Subject: [GSoC] GSoC python variant update 3 Message-ID: Hi all, Latest blog post here: http://arklenna.tumblr.com/post/23630012065/week-1 Brief summary: I have reversed my prior conclusion that `SeqRecord` is inadequate for holding variant data. It is still not ideal, but the advantages of using an existing native object are substantial, and the disadvantages can be reduced by creating an accessor for the variant-specific data within a `SeqRecord`. I've made an outline of how I would store the information returned by PyVCF within `SeqRecord` and `SeqFeature` objects. It includes a few questions about the most logical way to store certain variant information. As the coding period has now started, I'll be pushing some prototypes to GitHub in the near future. Lenna From cjfields at illinois.edu Thu May 24 01:14:20 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 24 May 2012 05:14:20 +0000 Subject: [GSoC] [BioRuby] Weekly report #1 In-Reply-To: References: Message-ID: I think the mentioned D wrappers on the SWIG page are ANSI C/C++ libraries wrapped for D, not D code/libs/etc wrapped for Ruby, unless I'm mistaken... chris On May 23, 2012, at 11:30 PM, Mic wrote: > D to Ruby: http://www.swig.org/compare.html > > On Mon, May 21, 2012 at 9:58 PM, Artem Tarasov wrote: > >> Hi all, >> >> here's my report about the past week: >> http://lomereiter.wordpress.com/2012/05/21/gsoc-weekly-report-1/ >> >> Brief summary: >> >> 1) BioRuby unit tests and Rubinius bugs ? I posted 2 issues in Rubinius >> bugtracker, and one of them is already solved. Rubinius in 1.8 mode should >> now pass all tests. The situation with 1.9 mode is not that great, but I'm >> working on it. >> >> 2) I started to collect D optimization tricks on github wiki page. >> Currently, it contains just 6 tips, but this number is going to grow. >> Probably, another page will be created soon to keep best practices of >> connecting Ruby and D. Since my project and Marjan's one have a lot in >> common, I think it's important for us to not waste time on something that >> already have been investigated. >> >> 3) During the week, I learned a bit about BDD and Cucumber, enjoyed it, and >> wrote my first two features. >> >> 4) Measurements of object instantiation time in Ruby suggest that exposing >> low-level D functions via FFI makes little sense. I'm going to discuss with >> mentors which high-level functions should be available, and make that into >> Cucumber features. >> >> >> >> >> -- >> Artem >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From cswh at umich.edu Thu May 24 01:33:40 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Thu, 24 May 2012 01:33:40 -0400 Subject: [GSoC] [BioRuby] GSoC week 2 status report In-Reply-To: References: Message-ID: <9DBCD042-7086-4F4B-ABB9-1A7F63C089B8@umich.edu> Thanks for the offers of help, everybody. Raoul, if it's convenient for you to set up a test VM in house, that would probably make the most sense. I don't think it's a pressing need at this point, but let's look into that. If we run into issues, we can revisit the EC2 options. (I've had an AWS account too long to qualify for the free usage tier, unfortunately.) An Amazon grant might be worth looking at, especially if we can use it to publicly host, say, BGZF-compressed pre-indexed MAF data sets also. On the other hand, that might be overkill just for my needs; using spot-priced instances, I expect I could do all the testing I need for under $50. Clayton Wheeler cswh at umich.edu From lomereiter at googlemail.com Thu May 24 01:40:54 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Thu, 24 May 2012 09:40:54 +0400 Subject: [GSoC] [BioRuby] Weekly report #1 In-Reply-To: References:

Message-ID: Chris is right. Currently, it's easier to write everything manually. When I'll develop some 'best practices' I may put then into compile-time algorithms and generate bindings from D. (The language has compile-time introspection but doesn't have run-time one, probably because that would hurt the performance.) On Thu, May 24, 2012 at 9:14 AM, Fields, Christopher J < cjfields at illinois.edu> wrote: > I think the mentioned D wrappers on the SWIG page are ANSI C/C++ libraries > wrapped for D, not D code/libs/etc wrapped for Ruby, unless I'm mistaken... > > chris > > On May 23, 2012, at 11:30 PM, Mic wrote: > > > D to Ruby: http://www.swig.org/compare.html > > > > On Mon, May 21, 2012 at 9:58 PM, Artem Tarasov < > lomereiter at googlemail.com>wrote: > > > >> Hi all, > >> > >> here's my report about the past week: > >> http://lomereiter.wordpress.com/2012/05/21/gsoc-weekly-report-1/ > >> > >> Brief summary: > >> > >> 1) BioRuby unit tests and Rubinius bugs ? I posted 2 issues in Rubinius > >> bugtracker, and one of them is already solved. Rubinius in 1.8 mode > should > >> now pass all tests. The situation with 1.9 mode is not that great, but > I'm > >> working on it. > >> > >> 2) I started to collect D optimization tricks on github wiki page. > >> Currently, it contains just 6 tips, but this number is going to grow. > >> Probably, another page will be created soon to keep best practices of > >> connecting Ruby and D. Since my project and Marjan's one have a lot in > >> common, I think it's important for us to not waste time on something > that > >> already have been investigated. > >> > >> 3) During the week, I learned a bit about BDD and Cucumber, enjoyed it, > and > >> wrote my first two features. > >> > >> 4) Measurements of object instantiation time in Ruby suggest that > exposing > >> low-level D functions via FFI makes little sense. I'm going to discuss > with > >> mentors which high-level functions should be available, and make that > into > >> Cucumber features. > >> > >> > >> > >> > >> -- > >> Artem > >> > >> _______________________________________________ > >> BioRuby Project - http://www.bioruby.org/ > >> BioRuby mailing list > >> BioRuby at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioruby > >> > > > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > From mictadlo at gmail.com Thu May 24 00:30:22 2012 From: mictadlo at gmail.com (Mic) Date: Thu, 24 May 2012 14:30:22 +1000 Subject: [GSoC] [BioRuby] Weekly report #1 In-Reply-To: References: Message-ID: D to Ruby: http://www.swig.org/compare.html On Mon, May 21, 2012 at 9:58 PM, Artem Tarasov wrote: > Hi all, > > here's my report about the past week: > http://lomereiter.wordpress.com/2012/05/21/gsoc-weekly-report-1/ > > Brief summary: > > 1) BioRuby unit tests and Rubinius bugs ? I posted 2 issues in Rubinius > bugtracker, and one of them is already solved. Rubinius in 1.8 mode should > now pass all tests. The situation with 1.9 mode is not that great, but I'm > working on it. > > 2) I started to collect D optimization tricks on github wiki page. > Currently, it contains just 6 tips, but this number is going to grow. > Probably, another page will be created soon to keep best practices of > connecting Ruby and D. Since my project and Marjan's one have a lot in > common, I think it's important for us to not waste time on something that > already have been investigated. > > 3) During the week, I learned a bit about BDD and Cucumber, enjoyed it, and > wrote my first two features. > > 4) Measurements of object instantiation time in Ruby suggest that exposing > low-level D functions via FFI makes little sense. I'm going to discuss with > mentors which high-level functions should be available, and make that into > Cucumber features. > > > > > -- > Artem > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From cswh at umich.edu Fri May 25 16:42:13 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Fri, 25 May 2012 16:42:13 -0400 Subject: [GSoC] New blog post on this week's work Message-ID: <329E20F7-BF3F-4201-ADD0-ABCDFC5ECDE4@umich.edu> Hi all, I've written a new blog post on the work I did on my MAF parser this week: http://csw.github.com/bioruby-maf/blog/2012/05/25/first_milestone/ It covers parser implementation and performance issues, BDD, and tools. Clayton Wheeler cswh at umich.edu From john.woods at marcottelab.org Thu May 24 10:01:08 2012 From: john.woods at marcottelab.org (John Woods) Date: Thu, 24 May 2012 09:01:08 -0500 Subject: [GSoC] [BioRuby] GSoC week 2 status report In-Reply-To: References: <0D2AC678-1DD1-40B9-B100-EDA3429B3D87@umich.edu> Message-ID: If I can just suggest, there's a startup pitch out there which was formerly known as Happy Science Coding, now Appsoma, which lets you run Ruby code on Rackspace instances. It may or may not be appropriate for what you want to do. It's not EC2, but it is a VM (right?). http://appsoma.com/ It's still a bit buggy with Ruby. If you have trouble, email Zack (see the "About us" page). He's fairly responsive. John SciRuby On Tue, May 22, 2012 at 4:21 AM, Raoul Bonnal wrote: > Hi Clayton, > Well done and thanks for your contributes to bioruby and jruby community. > > For you computing issue I have two solutions: > 1) I can create a VM and give you the access, I need to contact my IT dep. > 2) Could Amazon provide some VM for our students? > > > > On 21/05/12 17.50, "Clayton Wheeler" wrote: > > > Hi all, > > > > Here's my report on last week's work: > > > > http://csw.github.com/bioruby-maf/blog/2012/05/21/week_2_progress/ > > > > This was my second week of work on my GSoC project, and the last week of > the > > ?community bonding? period before the official start of coding. A major > focus > > of mine was BioRuby?s phyloXML support; it uses libxml, which has been > causing > > unit test failures under JRuby. In the end, the best course of action > seemed > > to separate the phyloXML support as a separate plugin, which I have done > as > > the bio-phyloxml gem. This will remove BioRuby?s dependency on XML > libraries > > entirely and that JRuby issue along with it. At the same time, users of > the > > phyloXML code should be able to continue using it with no substantive > changes. > > > > Separately, I began porting this phyloXML code to use Nokogiri instead of > > libxml-ruby, but ran into difficulties with this effort. While it is > possible, > > and the library APIs are very similar, the code uses relatively > low-level XML > > processing APIs in ways that seem to be sensitive to subtle differences > in > > text node and namespace semantics between the two libraries. Substantial > > restructuring of the code and the addition of quite a few unit tests > might be > > necessary to carry out such a port with confidence that the resulting > code > > would work well. > > > > Also, someone else submitted a JRuby patch for JRUBY-6658, one of the > major > > causes of BioRuby?s unit test failures with JRuby; once a fix is > integrated, > > we?ll be close to having all the tests passing under JRuby. > > > > I identified another JRuby bug, JRUBY-6666, causing several unit test > > failures. This one affects BioRuby?s code for running external commands, > so it > > would be likely to be encountered in production use. For this one, I also > > worked up a patch. > > > > I also spent some time preparing a performance testing environment, for > > evaluating existing MAF implementations as well as my own. This will be > > important, since I will be considering the use of an existing C parser. > I will > > also want to ensure that the performance of my code is competitive with > the > > alternatives. Lacking any hardware more powerful than a MacBook Air, I am > > setting this up with Amazon EC2. To simplify environment setup, I?ll be > using > > Chef. I?ve already set up a Chef repository with configuration logic, > and some > > rudimentary code to streamline launching Ubuntu machines on EC2 and > > bootstrapping a Chef environment. To save money, I plan to make use of > EC2 > > Spot Instances, which are perfect for instances that only need to run > for a > > few hours for batch tasks. > > > > Clayton Wheeler > > cswh at umich.edu > > > > > > > > > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From lomereiter at googlemail.com Sun May 27 14:27:43 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Sun, 27 May 2012 22:27:43 +0400 Subject: [GSoC] weekly report #2 Message-ID: Hi all, I wrote a blog post about the past week: http://lomereiter.wordpress.com/2012/05/27/gsoc-weekly-report-2/ Topics are: 1) I have quite good validation module for BAM now. More kinds of checks can be added, just request them :) 2) Also I started to implement random access via BAI file, just because I mostly finished what I planned for the first two weeks, and random access seems to be one of the most important things. Also it's not mentioned in the blog, but I started to work on BGZF gem, as Pjotr suggested to me. I'll try to document it and publish the first version next week. Currently I write it in pure Ruby. From marian.povolny at gmail.com Sun May 27 15:21:48 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sun, 27 May 2012 21:21:48 +0200 Subject: [GSoC] GSoC weekly status report No.1.9 Message-ID: http://blog.mpthecoder.com/post/23877896288/gsoc-weekly-status-report-no-1-9 This is the final post in 1.x series, I promise. The last week was spent adding support of parsing lines into records. It was a lot of work, and when I read the comments from my mentor, I wasn?t happy. But I agree with him, I did make it more complicated then it had to be (the C API, for example), I should spend some time polishing and refactoring the D side, and my cucumber features should be split into more features. So that?s the rough plan for the next week. -- Marjan From bonnal at ingm.org Mon May 28 04:50:19 2012 From: bonnal at ingm.org (Raoul Bonnal) Date: Mon, 28 May 2012 10:50:19 +0200 Subject: [GSoC] DevTools In-Reply-To: <329E20F7-BF3F-4201-ADD0-ABCDFC5ECDE4@umich.edu> Message-ID: In case you want to use RedMine I can give you the license for free, any bioruby developer can request it. From p.j.a.cock at googlemail.com Mon May 28 05:00:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 28 May 2012 10:00:30 +0100 Subject: [GSoC] [BioRuby] DevTools In-Reply-To: References: <329E20F7-BF3F-4201-ADD0-ABCDFC5ECDE4@umich.edu> Message-ID: On Mon, May 28, 2012 at 9:50 AM, Raoul Bonnal wrote: > In case you want to use RedMine I can give you the license for free, any > bioruby developer can request it. > ??? Redmine is licensed under the GPL. Did you mean admin rights on the OBF RedMine instance, for example to close bug reports? https://redmine.open-bio.org/projects/bioruby Peter From p.j.a.cock at googlemail.com Mon May 28 05:07:39 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 28 May 2012 10:07:39 +0100 Subject: [GSoC] weekly report #2 In-Reply-To: References: Message-ID: On Sun, May 27, 2012 at 7:27 PM, Artem Tarasov wrote: > Hi all, > > I wrote a blog post about the past week: > http://lomereiter.wordpress.com/2012/05/27/gsoc-weekly-report-2/ > > Topics are: > 1) I have quite good validation module for BAM now. More kinds of checks > can be added, just request them :) > The blog mentions you think you found some issues with tags.bam file - could you elaborate (directl email is fine), and tell me about any future issues please? > 2) Also I started to implement random access via BAI file, just because I > mostly finished what I planned for the first two weeks, and random access > seems to be one of the most important things. > > Also it's not mentioned in the blog, but I started to work on BGZF gem, as > Pjotr suggested to me. I'll try to document it and publish the first > version next week. Currently I write it in pure Ruby. > I guess my suggestion that Clayton might be able to use your BGZF support code for compressed MAF files does make sense to package the BGZF support as a Bio Gem. Good point Pjotr. http://lists.open-bio.org/pipermail/bioruby/2012-May/002301.html Peter From bonnal at ingm.org Mon May 28 05:03:01 2012 From: bonnal at ingm.org (Raoul Bonnal) Date: Mon, 28 May 2012 11:03:01 +0200 Subject: [GSoC] [BioRuby] DevTools In-Reply-To: Message-ID: Ahhhhhhhhhhh I mean RubyMine http://www.jetbrains.com/ruby/ sorry On 28/05/12 11.00, "Peter Cock" wrote: > > > On Mon, May 28, 2012 at 9:50 AM, Raoul Bonnal wrote: >> In case you want to use RedMine I can give you the license for free, any >> bioruby developer can request it. > > ??? Redmine is licensed under the GPL. > > Did you mean admin rights on the OBF RedMine instance, for > example to close bug reports? > https://redmine.open-bio.org/projects/bioruby > > Peter > > From lomereiter at googlemail.com Mon May 28 05:29:24 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 28 May 2012 13:29:24 +0400 Subject: [GSoC] weekly report #2 In-Reply-To: References:

Message-ID: > > The blog mentions you think you found some issues with tags.bam > file - could you elaborate (directl email is fine), and tell me about any > future issues please? > They are very minor. Specification says (1.4) that 'QNAME' should be [!-?A-~], that doesn't include space and '@' sign, and that (1.5) printable characters in tags with 'A' type are [!-~], i.e. only space is not allowed. BTW, I looked at your code which generated the file, it uses range(32, 127) both for 'Z' and 'A' types of tags, even though it's explicitly written in comments right above these lines where space should be included, and where it shouldn't :) From p.j.a.cock at googlemail.com Mon May 28 05:48:21 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 28 May 2012 10:48:21 +0100 Subject: [GSoC] weekly report #2 In-Reply-To: References:

Message-ID: On Mon, May 28, 2012 at 10:29 AM, Artem Tarasov wrote: > The blog mentions you think you found some issues with tags.bam >> file - could you elaborate (directl email is fine), and tell me about any >> future issues please? >> > > They are very minor. Specification says (1.4) that 'QNAME' should be > [!-?A-~], that doesn't include space and '@' sign, > Fair point. I should fix that. The '@" was presumably excluded in the v1.3 spec to avoid confusion with FASTQ files. > and that (1.5) > printable characters in tags with 'A' type are [!-~], i.e. only space > is not allowed. > > BTW, I looked at your code which generated the file, it uses > range(32, 127) both for 'Z' and 'A' types of tags, even though > it's explicitly written in comments right above these lines where > space should be included, and where it shouldn't :) > Good point, that is a change in the specification I hadn't noticed. Back in v1.2, both A and Z were just "printable character" and "printable string", which to me includes the space. It was only in v1.3 that this was made explicit with a regex, and space ceased to be allowed in the A tag. I wonder if that was an accident or deliberate? You'll notice that samtools doesn't complain about these deviations from the specification but it doesn't attempt any validation. I'm not sure if Picard checks this. Thanks, Peter From w.arindrarto at gmail.com Wed May 30 17:44:04 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 30 May 2012 23:44:04 +0200 Subject: [GSoC] GSoC Project Update -- 4 Message-ID: Hi everyone, I just posted my latest GSoC update here: http://bow.web.id/blog/2012/05/assembling-the-parsers/ To summarize: I've been working on more SearchIO parsers last week, adding more formats to support. We know have SearchIO-specific BLAST+ XML parser (it was first implemented on top of NCBIXML). It uses ElementTree as the base XML parser, with promising performance gains. I've also completed SearchIO's blast tabular parser, which takes in the BLAST+ tabular output files with or without headers. If the tabular file has headers, it can parse any number of columns in any order as long the columns with hit and query IDs are present. Finally, I've finished writing the HMMER plain text parser. For now, the parser can handle outputs from hmmscan and hmmsearch, single and multiple queries. All these parsers have been tested using the test cases I've generated previously. Additionally, I also had a public discussion with Peter on Github regarding SearchIO objects here: https://github.com/bow/biopython/commit/69a0ab64dfa7718f7455ca4c3961e95277fb4dbc#-P0, if anyone is interested. It started as a discussion on some behaviors of the HSP object, but also relates to other issues raised earlier (the dynamic SeqRecord coordinates Peter brought up earlier and Biopython's platform support). That's it for this week :). cheers, Bow From marian.povolny at gmail.com Sun Jun 3 17:07:18 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sun, 3 Jun 2012 23:07:18 +0200 Subject: [GSoC] GSoC weekly status report No.2 Message-ID: http://blog.mpthecoder.com/post/24355573626/gsoc-weekly-status-report-no-2 It?s the end of the second week of GSoC and time for a new report. I spent the last week mostly doing work based on criticism from my mentor. The D parser which parses lines into records is now in a pretty good shape, and tested. Today I received a list of new issues that need to be resolved before going further, but they?re not that much work and I can plan some new developments. A utility for validation is in planning for next week, which could be also used for performance measurement. And after that I will turn to making the current parser parallel. Also, tomorrow I?ll be defending my Masters Thesis, after which I should be able to concentrate more on the GFF3 parser. From arklenna at gmail.com Sun Jun 3 22:39:47 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Sun, 3 Jun 2012 22:39:47 -0400 Subject: [GSoC] GSoC python variant update 4 Message-ID: Blog post (entirely reproduced in this email): http://arklenna.tumblr.com/post/24378549953/ I started implementing storage of VCF data in `SeqRecord` and `SeqFeature`. I digressed, spending a few days experimenting with overloading `__getattr__()` in lieu of manually writing properties. Then it occurred to me that if, as Reece pointed out, a variant doesn't contain the actual sequence but a reference to the sequence, the advantages to using `SeqRecord` are minimal or possibly negative. In my experience, the highest performance for filtering large amounts of data is SQL. SQL has the advantage of scalability: SQLite now ships with Python, users can choose to run their own MySQL/PGSQL server, and I've read about a few approaches to GPU accelerated SQL. My initial glances at BioSQL, GMOD, etc. didn't show anything specifically designed for variants (again, a focus on storage of the sequence itself) so I implemented my own interface. Currently, the `parse_all()` method is very slow (approximately 260 seconds for a file with 240,000 variants when the parsing takes 5-10 seconds) and I am investigating why. My first step will be to reduce commit frequency. With a SQL backend, it seems superfluous to have a dedicated variant representation within Python. The SQL result object should allow for straightforward retrieval of data by name. I'm storing "misc" data in a SQL text field using JSON, which is also easy to access. Next: * Looking at BioSQL/GMOD etc to see if there is an existing standard I should be using/following * Deciding the extent of the convenience functions I wish to implement * Thinking about the most efficient way to filter records on the way into the SQL database From arklenna at gmail.com Mon Jun 4 09:30:15 2012 From: arklenna at gmail.com (Lenna Peterson) Date: Mon, 4 Jun 2012 09:30:15 -0400 Subject: [GSoC] [Biopython-dev] GSoC python variant update 4 In-Reply-To: References:

Message-ID: <2D58B8E1-5056-445F-B623-56B7136048BC@gmail.com> On Jun 4, 2012, at 1:11 AM, Mic wrote: > Hi Lenna, > Big companies are using http://en.wikipedia.org/wiki/NoSQL > > What kind of ORM do you want use ( http://en.wikipedia.org/wiki/SQLAlchemy or http://en.wikipedia.org/wiki/Storm_%28software%29 ) > > Cheers, > Mic > > Hey Mic, Looks like there has been some talk about SQLAlchemy in Biopython: http://biopython.org/pipermail/biopython/2009-August/005455.html Lenna From mictadlo at gmail.com Mon Jun 4 01:11:56 2012 From: mictadlo at gmail.com (Mic) Date: Mon, 4 Jun 2012 15:11:56 +1000 Subject: [GSoC] [Biopython-dev] GSoC python variant update 4 In-Reply-To: References: Message-ID: Hi Lenna, Big companies are using http://en.wikipedia.org/wiki/NoSQL What kind of ORM do you want use ( http://en.wikipedia.org/wiki/SQLAlchemyor http://en.wikipedia.org/wiki/Storm_%28software%29 ) Cheers, Mic On Mon, Jun 4, 2012 at 12:39 PM, Lenna Peterson wrote: > Blog post (entirely reproduced in this email): > http://arklenna.tumblr.com/post/24378549953/ > > I started implementing storage of VCF data in `SeqRecord` and > `SeqFeature`. I digressed, spending a few days experimenting with > overloading `__getattr__()` in lieu of manually writing properties. > Then it occurred to me that if, as Reece pointed out, a variant > doesn't contain the actual sequence but a reference to the sequence, > the advantages to using `SeqRecord` are minimal or possibly negative. > > In my experience, the highest performance for filtering large amounts > of data is SQL. SQL has the advantage of scalability: SQLite now ships > with Python, users can choose to run their own MySQL/PGSQL server, and > I've read about a few approaches to GPU accelerated SQL. > > My initial glances at BioSQL, GMOD, etc. didn't show anything > specifically designed for variants (again, a focus on storage of the > sequence itself) so I implemented my own interface. Currently, the > `parse_all()` method is very slow (approximately 260 seconds for a > file with 240,000 variants when the parsing takes 5-10 seconds) and I > am investigating why. My first step will be to reduce commit > frequency. > > With a SQL backend, it seems superfluous to have a dedicated variant > representation within Python. The SQL result object should allow for > straightforward retrieval of data by name. I'm storing "misc" data in > a SQL text field using JSON, which is also easy to access. > > Next: > > * Looking at BioSQL/GMOD etc to see if there is an existing standard I > should be using/following > * Deciding the extent of the convenience functions I wish to implement > * Thinking about the most efficient way to filter records on the way > into the SQL database > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From chapmanb at 50mail.com Mon Jun 4 12:04:15 2012 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 04 Jun 2012 12:04:15 -0400 Subject: [GSoC] GSoC python variant update 4 In-Reply-To: References: Message-ID: <87haurjc74.fsf@fastmail.fm> Lenna; Thanks for the summary. A couple of thoughts on the directions: - For property access, I think the best approach would be to store all of the arbitrary key/value pairs from INFO in SeqRecord annotations, then only use hand coded @properties to expose the most useful. That's gives people access to the most useful ones (as determined by you) with attributes but lets anyone dig in and get custom ones. - If you'd like to explore an SQL backend, you should have a look at Gemini: https://github.com/arq5x/gemini which stores variants in a SQLite database along with associated annotations. It's a flat structure based on adding and exposing useful annotations on variants: https://github.com/arq5x/gemini/blob/master/gemini/database.py Reinventing a new SQL store representation is a lot of work so it might be good to work off what others folks are currently doing and try to provide a Biopython friendly front end, much as you're exploring with PyVCF. Hope these are useful. Let me know if you have any questions at all, Brad > Blog post (entirely reproduced in this email): > http://arklenna.tumblr.com/post/24378549953/ > > I started implementing storage of VCF data in `SeqRecord` and > `SeqFeature`. I digressed, spending a few days experimenting with > overloading `__getattr__()` in lieu of manually writing properties. > Then it occurred to me that if, as Reece pointed out, a variant > doesn't contain the actual sequence but a reference to the sequence, > the advantages to using `SeqRecord` are minimal or possibly negative. > > In my experience, the highest performance for filtering large amounts > of data is SQL. SQL has the advantage of scalability: SQLite now ships > with Python, users can choose to run their own MySQL/PGSQL server, and > I've read about a few approaches to GPU accelerated SQL. > > My initial glances at BioSQL, GMOD, etc. didn't show anything > specifically designed for variants (again, a focus on storage of the > sequence itself) so I implemented my own interface. Currently, the > `parse_all()` method is very slow (approximately 260 seconds for a > file with 240,000 variants when the parsing takes 5-10 seconds) and I > am investigating why. My first step will be to reduce commit > frequency. > > With a SQL backend, it seems superfluous to have a dedicated variant > representation within Python. The SQL result object should allow for > straightforward retrieval of data by name. I'm storing "misc" data in > a SQL text field using JSON, which is also easy to access. > > Next: > > * Looking at BioSQL/GMOD etc to see if there is an existing standard I > should be using/following > * Deciding the extent of the convenience functions I wish to implement > * Thinking about the most efficient way to filter records on the way > into the SQL database > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From lomereiter at googlemail.com Mon Jun 4 14:02:58 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 4 Jun 2012 22:02:58 +0400 Subject: [GSoC] Weekly report #3 Message-ID: Hello all, the post is here: http://lomereiter.wordpress.com/2012/06/04/gsoc-weekly-report-3/ I've implemented random access to BAM file, using index file. Also I created a generic function for memoization which stores decompressed blocks in cache, following some desired cache strategy. Currently, I use simple FIFO cache. Also I studied how to make SAM output faster. I came to the conclusion that not only D standard library functions, but even ones of *printf family are too slow for this purpose, because they have to parse format string. Instead, I need to use specialized functions for printing integers and floats. Currently, output is about 4x slower than in samtools. So I have to take back some of my harsh words about its code and say that there is something to learn from there. It indeed uses its own functions for integer output, and also uses string buffer to do less calls (system functions can't be inlined). I'll use this approach, too, so very soon my library will be usable in pipelines, but only for output. Then I'm going to move on to allow alignments to be modified and outputted to BAM. After that, SAM parser needs to be implemented, and I'm going to use Ragel (finite-state machine compiler) for that purpose. So by the beginning of July I want to have SAM<->BAM conversion working, with a good speed. Add to that first release of biogem, and those are my plans for this month. From p.j.a.cock at googlemail.com Mon Jun 4 15:36:25 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 Jun 2012 20:36:25 +0100 Subject: [GSoC] Weekly report #3 In-Reply-To: References: Message-ID: On Mon, Jun 4, 2012 at 7:02 PM, Artem Tarasov wrote: > Hello all, > > the post is here: > http://lomereiter.wordpress.com/2012/06/04/gsoc-weekly-report-3/ > > I've implemented random access to BAM file, using index file. Also I > created a generic function for memoization which stores decompressed > blocks in cache, following some desired cache strategy. Currently, I > use simple FIFO cache. That sounds good. We've talked a little bit about the block caching strategy for Biopython's BGZF support - dropping the least recently used block would be good (LRU) but requires the overhead of storing and recording timestamps on each access. Currently my Biopython BGZF code just drops a cached block 'at random' (actually based on the dictionary hashing algorithm), and switching to FIFO was something I planned to try next (easily done with Python's OrderedDict class). FIFO seems like a good solution as the overheads are much lower than LRU. Have you got any good random access benchmarks to try this out with? i.e. something non-random, such as pulling mates of paired end reads. How many BGZF blocks are you keeping in the cache, and why? Are you thinking about BGZF output yet (which will be required in order to write BAM files)? Regards, Peter From lomereiter at googlemail.com Mon Jun 4 16:07:03 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Tue, 5 Jun 2012 00:07:03 +0400 Subject: [GSoC] Weekly report #3 In-Reply-To: References:

Message-ID: > Have you got any good random access benchmarks to try this out > with? i.e. something non-random, such as pulling mates of paired > end reads. > Currently, no. Please suggest your ideas about benchmarks because I suspect that you have much more experience with BAM files and better knowledge of use patterns. How many BGZF blocks are you keeping in the cache, and why? > Currently, 512. I don't know why, seems like a reasonable number (about 30MB of RAM). Maybe it should be a runtime parameter but I doubt that end users will bother with tweaking cache size. > Are you thinking about BGZF output yet (which will be required > in order to write BAM files)? > It's not hard at all. I already wrote packing string to BGZF in Ruby: https://github.com/lomereiter/bioruby-bgzf/blob/master/lib/bio-bgzf/pack.rb Parallelizing should also be easy, it's very similar to reading blocks from file. Determine how many alignments to pack in one block (it's 65Kb max), send compression task to taskpool, then go create next chunk of alignments, and so on. From cswh at umich.edu Mon Jun 4 23:04:06 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 4 Jun 2012 23:04:06 -0400 Subject: [GSoC] Weekly report: Indexed MAF access, Kyoto Cabinet, SQLite, and more Message-ID: <2B6E16E9-3DBC-4F54-88F8-C42E03124A1E@umich.edu> Hi all, My latest blog post on (mostly) last week's work is here: http://csw.github.com/bioruby-maf/blog/2012/06/04/indexed_maf_access/ Highlights include SQLite vs. Kyoto Cabinet, the path to BGZF support, and the challenges of supporting multiple Ruby implementations. Clayton Wheeler cswh at umich.edu From casbon at gmail.com Wed Jun 6 05:39:12 2012 From: casbon at gmail.com (James Casbon) Date: Wed, 6 Jun 2012 10:39:12 +0100 Subject: [GSoC] [Biopython-dev] GSoC python variant update 4 In-Reply-To: <87haurjc74.fsf@fastmail.fm> References: <87haurjc74.fsf@fastmail.fm> Message-ID: I'd be cautious about going for SQL for VCF backends. At least the following two problems arise: 1. VCF isn't a format, it's a meta-format so there isn't really a single data representation, but many. You are going to need a very flexible schema to allow variable records with complex entries like lists. (An entry is dynamically defined by the FORMAT field in each row, right?). Having a JSON misc entry means you lose all query abilities on these data anyway. 2. If you move your data away from VCF, you cannot use tools from outside your universe. i.e. lets say you want to use a GATK variant annotator, you need to do the roundtrip from SQL->VCF->SQL. I speak having developed this approach already and largely abandoned it due to the problems above. You are right that SQL would be a better solution for data index and access (no serialization issues, multiple tuned indexes), but be careful that you may spend a lot of time and not have a lot to show. I would really like it if biology used existing binary formats (HDF5 anyone?), but we don't. More practical use right now would be bcf support. From w.arindrarto at gmail.com Wed Jun 6 14:22:26 2012 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 6 Jun 2012 20:22:26 +0200 Subject: [GSoC] GSoC Project Update -- 5 Message-ID: Hi everyone, I just posted another update on my GSoC project here: http://bow.web.id/blog/2012/06/hello-indexers/ A brief summary: * I added the SearchIO indexing functions, with the same interface as SeqIO's indexing functions. It currently supports all the available SearchIO parsers (blast-tab, blast-xml, and hmmer-text). * (not mentioned in the post) I did some refactoring to the SearchIO code base. It was starting to get a bit messy, but now it's cleaner. All the parsers are now implemented as classes. For some of them, users can use it directly to tweak its behavior (e.g. the blast-tab parser can be used to parse plain blast-tab files with custom column ordering. This is not possible if users use SearchIO.parse or SearchIO.read instead). Additionally, I should also mention that my schedule has been changed slightly. The original plan for next week was to focus on hmmer-text indexing. However, since it has been done (except for the testing, which should not take a week), I will be focusing on writing the SearchIO converters. So expect to see that instead. That's all for now :). regards, Bow From lomereiter at googlemail.com Mon Jun 11 13:25:48 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 11 Jun 2012 21:25:48 +0400 Subject: [GSoC] weekly report #4 Message-ID: Hello everybody, here's my weekly report: http://lomereiter.wordpress.com/2012/06/11/gsoc-weekly-report-4/ I've added BAM output support (not parallelized yet) and alignment creation/modification - changing fields, adding tags, and replacing existing ones. Thus, the library has a lot of features at the moment, and I started documenting them on github wiki. Also I found out that there's a great tool in DMD distribution, called rdmd, which allows to execute D files as scripts, by just adding "#!/usr/bin/rdmd" at the top. It will automatically compile all needed files and run executable. That dramatically simplifies library usage, no need to write cumbersome makefiles. The examples are at https://github.com/lomereiter/BAMread/wiki/Getting-started You can try to write your own script if you wish, follow the instructions in the wiki. Also, as my library now is able to write BAM, the current project title is quite misleading. So I'd like to hear suggestions on renaming :) -- Artem From p.j.a.cock at googlemail.com Mon Jun 11 13:41:39 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 11 Jun 2012 18:41:39 +0100 Subject: [GSoC] weekly report #4 In-Reply-To: References: Message-ID: On Mon, Jun 11, 2012 at 6:25 PM, Artem Tarasov wrote: > Hello everybody, > > here's my weekly report: > http://lomereiter.wordpress.com/2012/06/11/gsoc-weekly-report-4/ > > ... > > Also, as my library now is able to write BAM, the current project title is > quite misleading. > So I'd like to hear suggestions on renaming :) As to the name, how about damtools (D alignment/map tools), "for dealing with the flood of sequence data" (dam as in reservoir). Peter From cjfields at illinois.edu Mon Jun 11 13:46:43 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 11 Jun 2012 17:46:43 +0000 Subject: [GSoC] weekly report #4 In-Reply-To: References:

Message-ID: <67FF495D-E8AD-4920-9EA8-6464E1310FBB@illinois.edu> On Jun 11, 2012, at 12:41 PM, Peter Cock wrote: > On Mon, Jun 11, 2012 at 6:25 PM, Artem Tarasov > wrote: >> Hello everybody, >> >> here's my weekly report: >> http://lomereiter.wordpress.com/2012/06/11/gsoc-weekly-report-4/ >> >> ... >> >> Also, as my library now is able to write BAM, the current project title is >> quite misleading. >> So I'd like to hear suggestions on renaming :) > > As to the name, how about damtools (D alignment/map tools), > "for dealing with the flood of sequence data" (dam as in reservoir). > > Peter Or 'damn, look how much work we have to do' chris From lomereiter at googlemail.com Mon Jun 11 14:47:48 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 11 Jun 2012 22:47:48 +0400 Subject: [GSoC] weekly report #4 In-Reply-To: References:

Message-ID: No, thanks... I'll call it libsambamba. In suahili, sambamba means 'parallel' ( http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) On Mon, Jun 11, 2012 at 9:41 PM, Peter Cock wrote: > > As to the name, how about damtools (D alignment/map tools), > "for dealing with the flood of sequence data" (dam as in reservoir). > > Peter > From p.j.a.cock at googlemail.com Mon Jun 11 14:59:38 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 11 Jun 2012 19:59:38 +0100 Subject: [GSoC] weekly report #4 In-Reply-To: <20120611185718.GA12417@thebird.nl> References:

<20120611185718.GA12417@thebird.nl> Message-ID: On Monday, June 11, 2012, Pjotr Prins wrote: > On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: > > No, thanks... > > > > I'll call it libsambamba. In suahili, sambamba means 'parallel' ( > > http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) > > I like it mbwana. > > Pj. > As the mentor, you'd be the mbwana or the bwana (boss), not Artem. But I do like lib-sambamba as a name - very clever. Peter From cjfields at illinois.edu Mon Jun 11 15:19:18 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 11 Jun 2012 19:19:18 +0000 Subject: [GSoC] weekly report #4 In-Reply-To: References:

<20120611185718.GA12417@thebird.nl> Message-ID: On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: > On Monday, June 11, 2012, Pjotr Prins wrote: > >> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>> No, thanks... >>> >>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >> >> I like it mbwana. >> >> Pj. >> > > As the mentor, you'd be the mbwana or the bwana (boss), not Artem. > > But I do like lib-sambamba as a name - very clever. > > Peter Agreed, fits very well. chris From pjotr.public14 at thebird.nl Mon Jun 11 14:57:18 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 11 Jun 2012 20:57:18 +0200 Subject: [GSoC] [BioRuby] weekly report #4 In-Reply-To: References:

Message-ID: <20120611185718.GA12417@thebird.nl> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: > No, thanks... > > I'll call it libsambamba. In suahili, sambamba means 'parallel' ( > http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) I like it mbwana. Pj. From georgkam at gmail.com Mon Jun 11 15:28:42 2012 From: georgkam at gmail.com (George Githinji) Date: Mon, 11 Jun 2012 22:28:42 +0300 Subject: [GSoC] [BioRuby] weekly report #4 In-Reply-To: References: