From lomereiter at googlemail.com Sat Jun 2 09:06:12 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Sat, 2 Jun 2012 17:06:12 +0400 Subject: [BioRuby] Parsing line-based formats with Ragel Message-ID: Hi guys, I've recently discovered absolutely cool thing called Ragel ( http://www.complang.org/ragel/). It is a finite state machine compiler, its applications include parsing Cucumber features in Gherkin, parsing HTTP requests in Mongrel, and implementing pack/unpack functions in Rubinius. It can be used for creating parser for any regular language, that includes nearly every line-based format. It generates code for C, C++, Objective C, D(!), Java, and Go. The speed of generated code is incredible. I wrote a few words more about it in my blog: http://lomereiter.wordpress.com/2012/06/02/ragel-and-bioinformatics/ Basically, you write a formal grammar, define which snippets of code to execute on state transitions, and everything just works. As for me, I'm going to implement SAM parser with this tool. It can also be useful for Marjan. I wrote a GFF3 grammar, but it might be incorrect in some places. Here's a basic example of usage: https://github.com/lomereiter/bioragel/blob/master/examples/d/gff3.rl -- Artem From pjotr.public14 at thebird.nl Sat Jun 2 09:15:22 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sat, 2 Jun 2012 15:15:22 +0200 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: References: Message-ID: <20120602131522.GA3670@thebird.nl> Hmm. Maybe we could use this for bio-ngs recipes too. Pj. On Sat, Jun 02, 2012 at 05:06:12PM +0400, Artem Tarasov wrote: > Hi guys, > > I've recently discovered absolutely cool thing called Ragel ( > http://www.complang.org/ragel/). It is a finite state machine compiler, its > applications include parsing Cucumber features in Gherkin, parsing HTTP > requests in Mongrel, and implementing pack/unpack functions in Rubinius. > > It can be used for creating parser for any regular language, that includes > nearly every line-based format. It generates code for C, C++, Objective C, > D(!), Java, and Go. The speed of generated code is incredible. > > I wrote a few words more about it in my blog: > http://lomereiter.wordpress.com/2012/06/02/ragel-and-bioinformatics/ > > Basically, you write a formal grammar, define which snippets of code to > execute on state transitions, and everything just works. As for me, I'm > going to implement SAM parser with this tool. > > It can also be useful for Marjan. I wrote a GFF3 grammar, but it might be > incorrect in some places. Here's a basic example of usage: > https://github.com/lomereiter/bioragel/blob/master/examples/d/gff3.rl > > > > -- > Artem > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From marian.povolny at gmail.com Sat Jun 2 10:18:40 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sat, 2 Jun 2012 16:18:40 +0200 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: References: Message-ID: Cool, definitely something worth checking out for GFF3. -- Marjan On Sat, Jun 2, 2012 at 3:06 PM, Artem Tarasov wrote: > Hi guys, > > I've recently discovered absolutely cool thing called Ragel ( > http://www.complang.org/ragel/). It is a finite state machine compiler, > its > applications include parsing Cucumber features in Gherkin, parsing HTTP > requests in Mongrel, and implementing pack/unpack functions in Rubinius. > > It can be used for creating parser for any regular language, that includes > nearly every line-based format. It generates code for C, C++, Objective C, > D(!), Java, and Go. The speed of generated code is incredible. > > I wrote a few words more about it in my blog: > http://lomereiter.wordpress.com/2012/06/02/ragel-and-bioinformatics/ > > Basically, you write a formal grammar, define which snippets of code to > execute on state transitions, and everything just works. As for me, I'm > going to implement SAM parser with this tool. > > It can also be useful for Marjan. I wrote a GFF3 grammar, but it might be > incorrect in some places. Here's a basic example of usage: > https://github.com/lomereiter/bioragel/blob/master/examples/d/gff3.rl > > > > -- > Artem > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From pjotr.public14 at thebird.nl Sat Jun 2 12:24:55 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sat, 2 Jun 2012 18:24:55 +0200 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: References: Message-ID: <20120602162455.GA6483@thebird.nl> On Sat, Jun 02, 2012 at 04:18:40PM +0200, Marjan Povolni wrote: > Cool, definitely something worth checking out for GFF3. One reason the state-machine is fast is because it does not create objects in memory (avoiding so called death by object creation ;). Data will be in the CPU cache, rather than main memory. Be interesting to see if Artem can run parsers on multi-core. With GFF3 line parsing, a really simple format, we immediately create a range of objects. Of course, this can happen on the stack too, so the speed advantage may not be that important. Still, I think especially for escape characters and character encodings this could be interesting for GFF3. Because that is the most complicated to get right. For now, we choose to assume GFF3 is plain ASCII. So, I guess we file this under 'enhancements'. Right? Pj. From marian.povolny at gmail.com Sat Jun 2 12:58:39 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sat, 2 Jun 2012 18:58:39 +0200 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: <20120602162455.GA6483@thebird.nl> References: <20120602162455.GA6483@thebird.nl> Message-ID: Well, not really. Currently my parser only creates a D struct named Record, which contains only slices of the original string. So in basic conditions there is only a string for the whole line (which can be a slice of a bigger string), a Record which can also be located on the stack, and a dynamic associative array for the attributes, which again maps string slices to slices. Additional string objects are only created when an escaped character has to be replaced with the original. Otherwise all operations are simply handling slices (which are start address+length, and when copied behave and cost more like an int then an object), and most if not all operations are using the stack. And this can be improved upon with an array which is not immutable, which would make it possible to replace the escaped characters in place. Given lines are not that long, everything is still in CPU cache. Linking into trees could have problems with cache, but it depends on how many lines the parser will keep in memory at every moment. 3MB is a lot of lines. And the current parser should be handling escaped characters ok now, except for the ones with values over 0x1F. I would certainly like to make a version of the Parser with Ragel. There are a range of checks which I do manually with multiple functions, and Ragel might be more optimized for those checks. And the result could be much cleaner. On Sat, Jun 2, 2012 at 6:24 PM, Pjotr Prins wrote: > On Sat, Jun 02, 2012 at 04:18:40PM +0200, Marjan Povolni wrote: > > Cool, definitely something worth checking out for GFF3. > > One reason the state-machine is fast is because it does not create > objects in memory (avoiding so called death by object creation ;). > Data will be in the CPU cache, rather than main memory. Be interesting > to see if Artem can run parsers on multi-core. > > With GFF3 line parsing, a really simple format, we immediately create > a range of objects. Of course, this can happen on the stack too, so > the speed advantage may not be that important. > > Still, I think especially for escape characters and character encodings > this could be interesting for GFF3. Because that is the most > complicated to get right. > > For now, we choose to assume GFF3 is plain ASCII. So, I guess we file > this under 'enhancements'. Right? > > Pj. > From pjotr.public14 at thebird.nl Sun Jun 3 10:20:48 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sun, 3 Jun 2012 16:20:48 +0200 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: References: Message-ID: <20120603142048.GA21415@thebird.nl> Trust a CS student to start on finite state machines. For us mere mortals, here is a good write-up on Ragel principles for Rubyists http://zedshaw.com/essays/ragel_state_charts.html by the much loved Zed :) Pj. On Sat, Jun 02, 2012 at 05:06:12PM +0400, Artem Tarasov wrote: > Hi guys, > > I've recently discovered absolutely cool thing called Ragel ( > http://www.complang.org/ragel/). It is a finite state machine compiler, its > applications include parsing Cucumber features in Gherkin, parsing HTTP > requests in Mongrel, and implementing pack/unpack functions in Rubinius. > > It can be used for creating parser for any regular language, that includes > nearly every line-based format. It generates code for C, C++, Objective C, > D(!), Java, and Go. The speed of generated code is incredible. > > I wrote a few words more about it in my blog: > http://lomereiter.wordpress.com/2012/06/02/ragel-and-bioinformatics/ > > Basically, you write a formal grammar, define which snippets of code to > execute on state transitions, and everything just works. As for me, I'm > going to implement SAM parser with this tool. > > It can also be useful for Marjan. I wrote a GFF3 grammar, but it might be > incorrect in some places. Here's a basic example of usage: > https://github.com/lomereiter/bioragel/blob/master/examples/d/gff3.rl > > > > -- > Artem > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From marian.povolny at gmail.com Sun Jun 3 17:07:18 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sun, 3 Jun 2012 23:07:18 +0200 Subject: [BioRuby] GSoC weekly status report No.2 Message-ID: http://blog.mpthecoder.com/post/24355573626/gsoc-weekly-status-report-no-2 It?s the end of the second week of GSoC and time for a new report. I spent the last week mostly doing work based on criticism from my mentor. The D parser which parses lines into records is now in a pretty good shape, and tested. Today I received a list of new issues that need to be resolved before going further, but they?re not that much work and I can plan some new developments. A utility for validation is in planning for next week, which could be also used for performance measurement. And after that I will turn to making the current parser parallel. Also, tomorrow I?ll be defending my Masters Thesis, after which I should be able to concentrate more on the GFF3 parser. From donttrustben at gmail.com Sun Jun 3 18:10:22 2012 From: donttrustben at gmail.com (Ben Woodcroft) Date: Mon, 4 Jun 2012 08:10:22 +1000 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: <20120603142048.GA21415@thebird.nl> References: <20120603142048.GA21415@thebird.nl> Message-ID: Just wanted to say thanks for pointing this out Artem - can definitely see myself using it in the future. If only you'd been a few days earlier! Perhaps idealistically, the state machine might be written once, and then the last mile be implemented in multiple different Bio* projects. On 4 June 2012 00:20, Pjotr Prins wrote: > Trust a CS student to start on finite state machines. For us mere > mortals, here is a good write-up on Ragel principles for Rubyists > > http://zedshaw.com/essays/ragel_state_charts.html > > by the much loved Zed :) > > Pj. > > On Sat, Jun 02, 2012 at 05:06:12PM +0400, Artem Tarasov wrote: > > Hi guys, > > > > I've recently discovered absolutely cool thing called Ragel ( > > http://www.complang.org/ragel/). It is a finite state machine compiler, > its > > applications include parsing Cucumber features in Gherkin, parsing HTTP > > requests in Mongrel, and implementing pack/unpack functions in Rubinius. > > > > It can be used for creating parser for any regular language, that > includes > > nearly every line-based format. It generates code for C, C++, Objective > C, > > D(!), Java, and Go. The speed of generated code is incredible. > > > > I wrote a few words more about it in my blog: > > http://lomereiter.wordpress.com/2012/06/02/ragel-and-bioinformatics/ > > > > Basically, you write a formal grammar, define which snippets of code to > > execute on state transitions, and everything just works. As for me, I'm > > going to implement SAM parser with this tool. > > > > It can also be useful for Marjan. I wrote a GFF3 grammar, but it might be > > incorrect in some places. Here's a basic example of usage: > > https://github.com/lomereiter/bioragel/blob/master/examples/d/gff3.rl > > > > > > > > -- > > Artem > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > -- -- Ben Woodcroft http://ecogenomic.org/users/ben-woodcroft From cjfields at illinois.edu Sun Jun 3 20:56:18 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 4 Jun 2012 00:56:18 +0000 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: References: <20120603142048.GA21415@thebird.nl> Message-ID: Have to agree, and in cases where a Bio* might run into problems with Ragel (Perl or Python) we can at least look at the grammar and use something for those languages that is similar in concept (e.g. Marpa for Perl), or go a little more roundabout and bind to C-generated ones from Ragel. chris On Jun 3, 2012, at 5:10 PM, Ben Woodcroft wrote: > Just wanted to say thanks for pointing this out Artem - can definitely see > myself using it in the future. If only you'd been a few days earlier! > > Perhaps idealistically, the state machine might be written once, and then > the last mile be implemented in multiple different Bio* projects. > > On 4 June 2012 00:20, Pjotr Prins wrote: > >> Trust a CS student to start on finite state machines. For us mere >> mortals, here is a good write-up on Ragel principles for Rubyists >> >> http://zedshaw.com/essays/ragel_state_charts.html >> >> by the much loved Zed :) >> >> Pj. >> >> On Sat, Jun 02, 2012 at 05:06:12PM +0400, Artem Tarasov wrote: >>> Hi guys, >>> >>> I've recently discovered absolutely cool thing called Ragel ( >>> http://www.complang.org/ragel/). It is a finite state machine compiler, >> its >>> applications include parsing Cucumber features in Gherkin, parsing HTTP >>> requests in Mongrel, and implementing pack/unpack functions in Rubinius. >>> >>> It can be used for creating parser for any regular language, that >> includes >>> nearly every line-based format. It generates code for C, C++, Objective >> C, >>> D(!), Java, and Go. The speed of generated code is incredible. >>> >>> I wrote a few words more about it in my blog: >>> http://lomereiter.wordpress.com/2012/06/02/ragel-and-bioinformatics/ >>> >>> Basically, you write a formal grammar, define which snippets of code to >>> execute on state transitions, and everything just works. As for me, I'm >>> going to implement SAM parser with this tool. >>> >>> It can also be useful for Marjan. I wrote a GFF3 grammar, but it might be >>> incorrect in some places. Here's a basic example of usage: >>> https://github.com/lomereiter/bioragel/blob/master/examples/d/gff3.rl >>> >>> >>> >>> -- >>> Artem >>> _______________________________________________ >>> BioRuby Project - http://www.bioruby.org/ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >>> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > > > -- > -- > Ben Woodcroft > http://ecogenomic.org/users/ben-woodcroft > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From pjotr.public14 at thebird.nl Mon Jun 4 01:17:45 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 4 Jun 2012 07:17:45 +0200 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: References: <20120603142048.GA21415@thebird.nl> Message-ID: <20120604051745.GB24131@thebird.nl> On Mon, Jun 04, 2012 at 12:56:18AM +0000, Fields, Christopher J wrote: > Have to agree, and in cases where a Bio* might run into problems > with Ragel (Perl or Python) we can at least look at the grammar and > use something for those languages that is similar in concept (e.g. > Marpa for Perl), or go a little more roundabout and bind to > C-generated ones from Ragel. Also agree. Parsing is a common theme in Bio*. A state engine would be a great abstraction, targetting C or D, and even the interpreted languages. The SAM parser would be a great proof-of-concept. I am also very interested to see how it will perform against samtools. The spanner in the works may be that we tend to be very sloppy about standards. So relaxed parsers may also be needed. Pj. From p.j.a.cock at googlemail.com Mon Jun 4 05:27:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 Jun 2012 10:27:10 +0100 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: <20120604051745.GB24131@thebird.nl> References: <20120603142048.GA21415@thebird.nl> <20120604051745.GB24131@thebird.nl> Message-ID: On Mon, Jun 4, 2012 at 6:17 AM, Pjotr Prins wrote: > On Mon, Jun 04, 2012 at 12:56:18AM +0000, Fields, Christopher J wrote: >> Have to agree, and in cases where a Bio* might run into problems >> with Ragel (Perl or Python) we can at least look at the grammar and >> use something for those languages that is similar in concept (e.g. >> Marpa for Perl), or go a little more roundabout and bind to >> C-generated ones from Ragel. > > Also agree. Parsing is a common theme in Bio*. A state engine would > be a great abstraction, targetting C or D, and even the interpreted > languages. The SAM parser would be a great proof-of-concept. I am > also very interested to see how it will perform against samtools. > > The spanner in the works may be that we tend to be very sloppy > about standards. So relaxed parsers may also be needed. When I read Artem's post about Ragel and formal grammars for parsing bioinformatics file formats I was intrigued, but cautious. Biopython used to have a lot of its parsers written in Martel, a home grown regular expression on steroids parsing framework. On significant downside was even minor tweaks to the format description required a good knowledge of regular expressions and how the Martel grammar worked. This created a significant barrier to entry, e.g. inserting a new optional line type at a particular point in a file format was initially quite daunting, leaving parser maintenance in the hands of a few people. (The reasons we ended up dropping Martel was a combination of poor scaling with large datasets, problems with a third party library API change, and lack of time from the original author to work on it. Most of our parsers are now 'pure Python'). It would not surprise me that over half the time spent on writing a parser goes on dealing with corner cases/border line invalid inputs, and that a formal grammar may not be the best way to deal with 'messy' data. But I would hope SAM/BAM files would be well enough behaved to make this worth trying. Regards, Peter From cjfields at illinois.edu Mon Jun 4 08:56:39 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 4 Jun 2012 12:56:39 +0000 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: <20120604051745.GB24131@thebird.nl> References: <20120603142048.GA21415@thebird.nl> <20120604051745.GB24131@thebird.nl> Message-ID: On Jun 4, 2012, at 12:17 AM, Pjotr Prins wrote: > On Mon, Jun 04, 2012 at 12:56:18AM +0000, Fields, Christopher J wrote: >> Have to agree, and in cases where a Bio* might run into problems >> with Ragel (Perl or Python) we can at least look at the grammar and >> use something for those languages that is similar in concept (e.g. >> Marpa for Perl), or go a little more roundabout and bind to >> C-generated ones from Ragel. > > Also agree. Parsing is a common theme in Bio*. A state engine would > be a great abstraction, targetting C or D, and even the interpreted > languages. The SAM parser would be a great proof-of-concept. I am > also very interested to see how it will perform against samtools. > > The spanner in the works may be that we tend to be very sloppy about > standards. So relaxed parsers may also be needed. Either that, or use the grammar as a source of validation (e.g. if the parse fails, the data is not formatted correctly). That's basicallt the tact I plan with perl 6 grammars. chris > Pj. From lomereiter at googlemail.com Mon Jun 4 10:31:14 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 4 Jun 2012 18:31:14 +0400 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: References: <20120603142048.GA21415@thebird.nl> <20120604051745.GB24131@thebird.nl> Message-ID: On Mon, Jun 4, 2012 at 4:56 PM, Fields, Christopher J wrote: > > Also agree. Parsing is a common theme in Bio*. A state engine would > > be a great abstraction, targetting C or D, and even the interpreted > > languages. The SAM parser would be a great proof-of-concept. I am > > also very interested to see how it will perform against samtools. > > > > The spanner in the works may be that we tend to be very sloppy about > > standards. So relaxed parsers may also be needed. > > Either that, or use the grammar as a source of validation (e.g. if the > parse fails, the data is not formatted correctly). That's basicallt the > tact I plan with perl 6 grammars. > > chris > > Yes, I think that the problem of invalid data can be addressed by having additional rules with less strict grammar. For instance, if the format uses tab delimiting, we can track the problem down to a particular field, using less restrictions on character set, like invalidsomefield = [^\t]+ %some_error_action; somefield = (bunch of rules conformant to spec) | invalidsomefield; If we want more comprehendable error messages, instead of [^\t]+ another set of rules for different kinds of invalid input can be used. The big plus of state machines is that they don't scan string multiple times, as it usually happens with hand-written parser when you usually do several checks in turn. From lomereiter at googlemail.com Mon Jun 4 14:02:58 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 4 Jun 2012 22:02:58 +0400 Subject: [BioRuby] [GSoC] Weekly report #3 Message-ID: Hello all, the post is here: http://lomereiter.wordpress.com/2012/06/04/gsoc-weekly-report-3/ I've implemented random access to BAM file, using index file. Also I created a generic function for memoization which stores decompressed blocks in cache, following some desired cache strategy. Currently, I use simple FIFO cache. Also I studied how to make SAM output faster. I came to the conclusion that not only D standard library functions, but even ones of *printf family are too slow for this purpose, because they have to parse format string. Instead, I need to use specialized functions for printing integers and floats. Currently, output is about 4x slower than in samtools. So I have to take back some of my harsh words about its code and say that there is something to learn from there. It indeed uses its own functions for integer output, and also uses string buffer to do less calls (system functions can't be inlined). I'll use this approach, too, so very soon my library will be usable in pipelines, but only for output. Then I'm going to move on to allow alignments to be modified and outputted to BAM. After that, SAM parser needs to be implemented, and I'm going to use Ragel (finite-state machine compiler) for that purpose. So by the beginning of July I want to have SAM<->BAM conversion working, with a good speed. Add to that first release of biogem, and those are my plans for this month. From p.j.a.cock at googlemail.com Mon Jun 4 15:36:25 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 Jun 2012 20:36:25 +0100 Subject: [BioRuby] [GSoC] Weekly report #3 In-Reply-To: References: Message-ID: On Mon, Jun 4, 2012 at 7:02 PM, Artem Tarasov wrote: > Hello all, > > the post is here: > http://lomereiter.wordpress.com/2012/06/04/gsoc-weekly-report-3/ > > I've implemented random access to BAM file, using index file. Also I > created a generic function for memoization which stores decompressed > blocks in cache, following some desired cache strategy. Currently, I > use simple FIFO cache. That sounds good. We've talked a little bit about the block caching strategy for Biopython's BGZF support - dropping the least recently used block would be good (LRU) but requires the overhead of storing and recording timestamps on each access. Currently my Biopython BGZF code just drops a cached block 'at random' (actually based on the dictionary hashing algorithm), and switching to FIFO was something I planned to try next (easily done with Python's OrderedDict class). FIFO seems like a good solution as the overheads are much lower than LRU. Have you got any good random access benchmarks to try this out with? i.e. something non-random, such as pulling mates of paired end reads. How many BGZF blocks are you keeping in the cache, and why? Are you thinking about BGZF output yet (which will be required in order to write BAM files)? Regards, Peter From lomereiter at googlemail.com Mon Jun 4 16:07:03 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Tue, 5 Jun 2012 00:07:03 +0400 Subject: [BioRuby] [GSoC] Weekly report #3 In-Reply-To: References: Message-ID: > Have you got any good random access benchmarks to try this out > with? i.e. something non-random, such as pulling mates of paired > end reads. > Currently, no. Please suggest your ideas about benchmarks because I suspect that you have much more experience with BAM files and better knowledge of use patterns. How many BGZF blocks are you keeping in the cache, and why? > Currently, 512. I don't know why, seems like a reasonable number (about 30MB of RAM). Maybe it should be a runtime parameter but I doubt that end users will bother with tweaking cache size. > Are you thinking about BGZF output yet (which will be required > in order to write BAM files)? > It's not hard at all. I already wrote packing string to BGZF in Ruby: https://github.com/lomereiter/bioruby-bgzf/blob/master/lib/bio-bgzf/pack.rb Parallelizing should also be easy, it's very similar to reading blocks from file. Determine how many alignments to pack in one block (it's 65Kb max), send compression task to taskpool, then go create next chunk of alignments, and so on. From cswh at umich.edu Mon Jun 4 23:04:06 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 4 Jun 2012 23:04:06 -0400 Subject: [BioRuby] Weekly report: Indexed MAF access, Kyoto Cabinet, SQLite, and more Message-ID: <2B6E16E9-3DBC-4F54-88F8-C42E03124A1E@umich.edu> Hi all, My latest blog post on (mostly) last week's work is here: http://csw.github.com/bioruby-maf/blog/2012/06/04/indexed_maf_access/ Highlights include SQLite vs. Kyoto Cabinet, the path to BGZF support, and the challenges of supporting multiple Ruby implementations. Clayton Wheeler cswh at umich.edu From francesco.strozzi at gmail.com Tue Jun 5 09:49:13 2012 From: francesco.strozzi at gmail.com (Francesco Strozzi) Date: Tue, 5 Jun 2012 15:49:13 +0200 Subject: [BioRuby] OpenShift Message-ID: Hi all, does anyone has tried RedHat OpenShift? It's the new PaaS from RH, they can host mainly web applications and the basic account it's completely free. They give you 3 workers + 1.5Gb of total RAM and around 3 Gb of space. By default they can host Perl, PHP, Java, Python and Ruby (1.8.7, epic fail here), but they also provide blank containers where you could setup the environment that you like with the language that you want. It seems to have a nice CLI with Git integration....I think this could be something useful for test environments or for free hosting of small apps. -- Francesco From begumisa at gmail.com Wed Jun 6 09:41:55 2012 From: begumisa at gmail.com (Godfrey M Begumisa) Date: Wed, 6 Jun 2012 16:41:55 +0300 Subject: [BioRuby] (no subject) Message-ID: I request my email to be removed from this mailing list -- Begumisa From p.j.a.cock at googlemail.com Thu Jun 7 07:32:34 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 7 Jun 2012 12:32:34 +0100 Subject: [BioRuby] Spam on wiki Message-ID: Hi all, I follow the changes to the Biopython wiki via RSS, and we (and the BioPerl wiki) had some minor vandalism recently which I have fixed. The BioRuby wiki changes RSS feed is: http://bioruby.org/w/index.php?title=Special:RecentChanges&feed=rss There is a similar new spam page on the BioRuby wiki from new user "Helium mint", but I don't think I have admin rights to do anything about it (block the user and delete the page): http://bioruby.org/w/index.php?title=Special:RecentChanges&days=30 Regards, Peter From sibert at wisc.edu Thu Jun 7 15:24:36 2012 From: sibert at wisc.edu (Bryan Sibert) Date: Thu, 7 Jun 2012 14:24:36 -0500 Subject: [BioRuby] Error reading ABIF SangerChromatogram data Message-ID: Hello, I am using BioRuby 1.4.2 on Ruby 1.9.3 installed on a 64-bit Windows Vista machine. I am trying to read some .ab1 files using the BioRuby ABIF class which inherits SangerChromatogram. Ruby raises a NoMethodError when I attempt this. Below is a simple program based up on the example usage provided in the documentation: require 'bio' chromatogram_ff = Bio::Abif.open("120605M4_05H_9.ab1") chromatogram = chromatogram_ff.next_entry This raises the following error: C:\...\Desktop>test C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:107:in `get_entry_data': undefined method `match' for nil:NilClass (NoMethodError) from C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:85:in `block in get_directory_entries' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:77:in `times' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:77:in `get_directory_entries' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:42:in `initialize' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/io/flatfile/splitter.rb:55:in `new' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/io/flatfile/splitter.rb:55:in `get_parsed_entry' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/io/flatfile.rb:288:in `next_entry' from C:/.../Desktop/test.rb:3:in `
' I am at a loss as to how to fix this. If you have any ideas, please let me know. Thanks, Bryan From ngoto at gen-info.osaka-u.ac.jp Fri Jun 8 09:28:51 2012 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Fri, 8 Jun 2012 22:28:51 +0900 Subject: [BioRuby] Error reading ABIF SangerChromatogram data In-Reply-To: References: Message-ID: <201206081334.q58DYSH4022666@portal.open-bio.org> Hi Bryan, It may be possible that default file open mode on Windows is ASCII and the line feed code conversion from CR+LF to LF brakes read data. Please try the following workaround code: require 'bio' f = File.open("120605M4_05H_9.ab1", "rb") chromatogram_ff = Bio::Abif.open(f) chromatogram = chromatogram_ff.next_entry This will be fixed in the next version: default file open mode of Bio::FlatFile is changed to binary. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Thu, 7 Jun 2012 14:24:36 -0500 Bryan Sibert wrote: > Hello, > > I am using BioRuby 1.4.2 on Ruby 1.9.3 installed on a 64-bit Windows Vista > machine. I am trying to read some .ab1 files using the BioRuby ABIF class > which inherits SangerChromatogram. Ruby raises a NoMethodError when I > attempt this. Below is a simple program based up on the example usage > provided in the documentation: > > require 'bio' > chromatogram_ff = Bio::Abif.open("120605M4_05H_9.ab1") > chromatogram = chromatogram_ff.next_entry > > This raises the following error: > > C:\...\Desktop>test > > C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:107:in > `get_entry_data': undefined method `match' for nil:NilClass (NoMethodError) > from > C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:85:in > `block in get_directory_entries' > from > C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:77:in > `times' > from > C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:77:in > `get_directory_entries' > from > C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:42:in > `initialize' > from > C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/io/flatfile/splitter.rb:55:in > `new' > from > C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/io/flatfile/splitter.rb:55:in > `get_parsed_entry' > from > C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/io/flatfile.rb:288:in > `next_entry' > from C:/.../Desktop/test.rb:3:in `
' > > > I am at a loss as to how to fix this. If you have any ideas, please let me > know. > > Thanks, > > Bryan > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From lomereiter at googlemail.com Mon Jun 11 13:25:48 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 11 Jun 2012 21:25:48 +0400 Subject: [BioRuby] [GSoC] weekly report #4 Message-ID: Hello everybody, here's my weekly report: http://lomereiter.wordpress.com/2012/06/11/gsoc-weekly-report-4/ I've added BAM output support (not parallelized yet) and alignment creation/modification - changing fields, adding tags, and replacing existing ones. Thus, the library has a lot of features at the moment, and I started documenting them on github wiki. Also I found out that there's a great tool in DMD distribution, called rdmd, which allows to execute D files as scripts, by just adding "#!/usr/bin/rdmd" at the top. It will automatically compile all needed files and run executable. That dramatically simplifies library usage, no need to write cumbersome makefiles. The examples are at https://github.com/lomereiter/BAMread/wiki/Getting-started You can try to write your own script if you wish, follow the instructions in the wiki. Also, as my library now is able to write BAM, the current project title is quite misleading. So I'd like to hear suggestions on renaming :) -- Artem From p.j.a.cock at googlemail.com Mon Jun 11 13:41:39 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 11 Jun 2012 18:41:39 +0100 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: References: Message-ID: On Mon, Jun 11, 2012 at 6:25 PM, Artem Tarasov wrote: > Hello everybody, > > here's my weekly report: > http://lomereiter.wordpress.com/2012/06/11/gsoc-weekly-report-4/ > > ... > > Also, as my library now is able to write BAM, the current project title is > quite misleading. > So I'd like to hear suggestions on renaming :) As to the name, how about damtools (D alignment/map tools), "for dealing with the flood of sequence data" (dam as in reservoir). Peter From cjfields at illinois.edu Mon Jun 11 13:46:43 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 11 Jun 2012 17:46:43 +0000 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: References: Message-ID: <67FF495D-E8AD-4920-9EA8-6464E1310FBB@illinois.edu> On Jun 11, 2012, at 12:41 PM, Peter Cock wrote: > On Mon, Jun 11, 2012 at 6:25 PM, Artem Tarasov > wrote: >> Hello everybody, >> >> here's my weekly report: >> http://lomereiter.wordpress.com/2012/06/11/gsoc-weekly-report-4/ >> >> ... >> >> Also, as my library now is able to write BAM, the current project title is >> quite misleading. >> So I'd like to hear suggestions on renaming :) > > As to the name, how about damtools (D alignment/map tools), > "for dealing with the flood of sequence data" (dam as in reservoir). > > Peter Or 'damn, look how much work we have to do' chris From lomereiter at googlemail.com Mon Jun 11 14:47:48 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 11 Jun 2012 22:47:48 +0400 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: References: Message-ID: No, thanks... I'll call it libsambamba. In suahili, sambamba means 'parallel' ( http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) On Mon, Jun 11, 2012 at 9:41 PM, Peter Cock wrote: > > As to the name, how about damtools (D alignment/map tools), > "for dealing with the flood of sequence data" (dam as in reservoir). > > Peter > From pjotr.public14 at thebird.nl Mon Jun 11 14:57:18 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 11 Jun 2012 20:57:18 +0200 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: References: Message-ID: <20120611185718.GA12417@thebird.nl> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: > No, thanks... > > I'll call it libsambamba. In suahili, sambamba means 'parallel' ( > http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) I like it mbwana. Pj. From p.j.a.cock at googlemail.com Mon Jun 11 14:59:38 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 11 Jun 2012 19:59:38 +0100 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: <20120611185718.GA12417@thebird.nl> References: <20120611185718.GA12417@thebird.nl> Message-ID: On Monday, June 11, 2012, Pjotr Prins wrote: > On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: > > No, thanks... > > > > I'll call it libsambamba. In suahili, sambamba means 'parallel' ( > > http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) > > I like it mbwana. > > Pj. > As the mentor, you'd be the mbwana or the bwana (boss), not Artem. But I do like lib-sambamba as a name - very clever. Peter From cjfields at illinois.edu Mon Jun 11 15:19:18 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 11 Jun 2012 19:19:18 +0000 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: > On Monday, June 11, 2012, Pjotr Prins wrote: > >> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>> No, thanks... >>> >>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >> >> I like it mbwana. >> >> Pj. >> > > As the mentor, you'd be the mbwana or the bwana (boss), not Artem. > > But I do like lib-sambamba as a name - very clever. > > Peter Agreed, fits very well. chris From georgkam at gmail.com Mon Jun 11 15:28:42 2012 From: georgkam at gmail.com (George Githinji) Date: Mon, 11 Jun 2012 22:28:42 +0300 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: Good tribute to swahili! ahsante sana bwana Artem! (Thank you very much for the suggestion) Sambamba could also mean correct way or the right thing in everyday speak.. (bwana is a term of respect or honour, though it also refers to a boss .. mostly we use 'mkubwa' to mean boss) George On Mon, Jun 11, 2012 at 10:19 PM, Fields, Christopher J wrote: > On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: > >> On Monday, June 11, 2012, Pjotr Prins wrote: >> >>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>>> No, thanks... >>>> >>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >>> >>> I like it mbwana. >>> >>> Pj. >>> >> >> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. >> >> But I do like lib-sambamba as a name - very clever. >> >> Peter > > Agreed, fits very well. > > chris > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- --------------- Sincerely George Skype: george_g2 Blog: http://biorelated.wordpress.com/ Twitter: http://twitter.com/#!/george_l From cjfields at illinois.edu Mon Jun 11 15:36:44 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 11 Jun 2012 19:36:44 +0000 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: <114DEA27-A766-4F0F-8144-098FF0905E1D@illinois.edu> heh, which makes me think you don't respect your bosses :) chris On Jun 11, 2012, at 2:28 PM, George Githinji wrote: > ...(bwana is a term of respect or honour, though it also refers to a boss > .. mostly we use 'mkubwa' to mean boss) > > George > > > On Mon, Jun 11, 2012 at 10:19 PM, Fields, Christopher J > wrote: >> On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: >> >>> On Monday, June 11, 2012, Pjotr Prins wrote: >>> >>>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>>>> No, thanks... >>>>> >>>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >>>> >>>> I like it mbwana. >>>> >>>> Pj. >>>> >>> >>> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. >>> >>> But I do like lib-sambamba as a name - very clever. >>> >>> Peter >> >> Agreed, fits very well. >> >> chris >> >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > > -- > --------------- > Sincerely > George > Skype: george_g2 > Blog: http://biorelated.wordpress.com/ > Twitter: http://twitter.com/#!/george_l From marian.povolny at gmail.com Mon Jun 11 16:52:05 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 11 Jun 2012 22:52:05 +0200 Subject: [BioRuby] GSoC weekly status report No.3 Message-ID: http://blog.mpthecoder.com/post/24904798973/gsoc-weekly-status-report-no-3 My first report as a Master of Computer Engineering and Communications :) Here is a list with what I?ve been working on the last week: more cleanup and refactoring validation code, README etc, made a validation utility in D, which simply reports problems found to stderr, made a benchmark tool with -v option for measuring parser speed with and without validation, after having a basic benchmark tool, found a few places which were very bad for performance. After fixing that code, parsing a 233MB GFF3 file on a five year old PC took 6 seconds, but without validation, and with only a single thread, and replacing escaped characters turned off, made replacing escaped characters optional, because the current implementation requires creation of additional string objects to do that, which has a big impact on performance. There is a plan for making it faster, but is scheduled for later, added minimal parallelisation, by reading the file in a separate thread. Two additional days were spent on a segmentation fault in the D garbage collector which occured when parsing a big file with a lot of errors. That should never happen, as I?m using the safe part of the D language, that is no pointers or anything similar. The worst that should happen is an exception. But a segmentation fault points to an error in either the compiler, the runtime or support library. The minimum reproducible example is still 42 lines long: https://gist.github.com/2911818 but changing anything in it makes the segmentation fault go away. More info on this topic can be found in the discussion here: https://github.com/mamarjan/bioruby-hpc-gff3/issues/31 I?ll be probably posting a bug report on the Dlang webpage tomorrow. For the coming week I would like to add more parallelisation, change the validation code so that exceptions almost never happen (and the seg fault also) and add support for merging records into features. -- Marjan From cswh at umich.edu Mon Jun 11 16:56:02 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 11 Jun 2012 16:56:02 -0400 Subject: [BioRuby] GSoC weekly status report: MAF filtering Message-ID: Hi all, Here's my status report on last week's work: http://csw.github.com/bioruby-maf/blog/2012/06/09/filtering-work/ Highlights: mainly MAF alignment block filtering and performance challenges with binary data in Ruby. Clayton Wheeler cswh at umich.edu From pjotr.public14 at thebird.nl Tue Jun 12 03:05:18 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 12 Jun 2012 09:05:18 +0200 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: <20120612070518.GA14848@thebird.nl> sam-bam-baah has the comment of sheep in it. May explain consensus :) How about sambamba tools. Less unwieldy. On Mon, Jun 11, 2012 at 09:35:59PM +0100, P. Troshin wrote: > None of my business but it's a bit unwieldy. It may be clever, but 99% > people who come across it would not know. mbwana is simpler in that > respect. Sorry for spoiling the consensus :-( > > P. > > > > On 11 June 2012 20:19, Fields, Christopher J wrote: > > On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: > > > >> On Monday, June 11, 2012, Pjotr Prins wrote: > >> > >>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: > >>>> No, thanks... > >>>> > >>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( > >>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) > >>> > >>> I like it mbwana. > >>> > >>> Pj. > >>> > >> > >> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. > >> > >> But I do like lib-sambamba as a name - very clever. > >> > >> Peter > > > > Agreed, fits very well. > > > > chris > > > > > > _______________________________________________ > > GSoC mailing list > > GSoC at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/gsoc > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From marian.povolny at gmail.com Mon Jun 18 14:28:12 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 18 Jun 2012 20:28:12 +0200 Subject: [BioRuby] GSoC weekly status report No.4 Message-ID: http://blog.mpthecoder.com/post/25375170121/gsoc-weekly-status-report-no-4 During the last week combining records into features has been added, and also connecting the features into parent-child relationships. Validation messages have been enhanced with file names and line numbers, and now look like errors reported by a compiler. Feels most natural to me. Combining the features into records works by keeping a forward cache of a number of features (1000 by default, configurable). That means that the parsing results will be correct only if records which are part of the same feature are at most 1000 features from each other, or the amount of features set. The first implementation which was comparing the IDs of records required 10min for a 233MB file. After switching to first comparing hash values of IDs instead, and only if they match comparing the IDs, the parsing time was down to 45s. After fixing a bug, the time is now 10 seconds for the 233MB m_hapla file :) Linking the features into parent-child relationships works similarly, by using 32-bit hashes most of the time instead of comparing strings. With this functionality turned on, the same file is parsed in 13 seconds. All the measurements have been done using the benchmark utility, which has a few more options for setting what should be run. Otherwise I did more refactoring, moved all the gff3_* files into a gff3 directory, so the D modules are now bio.gff3.*, parsing functions are now static methods of GFF3File and GFF3Data classes, etc. For the new week, I would like to add filtering to the D library, which I can then use to implement iteration over genes, mRNAs, CDS features, etc. After that the library should be pretty much complete feature-wise, at least per what was promised in the project proposal, so I?ll continue by defining the C API and developing the Ruby gem. -- Marjan From lomereiter at googlemail.com Tue Jun 19 04:25:07 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Tue, 19 Jun 2012 12:25:07 +0400 Subject: [BioRuby] [GSoC] weekly report #5 Message-ID: Hi all, I wrote a few words about improvements in my project during the past week: http://lomereiter.wordpress.com/2012/06/19/gsoc-weekly-report-5/ - More wiki content on Github, with examples of how to use the library for common cases. - Faster conversion to SAM, now it's not worse than samtools in this respect - Parallelized BGZF compression, though it was relatively easy to add - Reconsidering interaction with dynamic languages due to shared library issues in D. Now I'm thinking of an approach of making command-line tools outputting JSON and wrapping them. At least in BioRuby we have Bio::Command to make this process easy. - Progress in SAM parsing - valid records are now fully parsed, and it takes just 300 lines of D/Ragel mix, together with some unittests. Also, Ragel provides some convenient methods to handle errors, but I haven't investigated them yet. Once error handling is added, the branch will be ready to be merged, and then I'll add SAM reading. -- Artem From p.j.a.cock at googlemail.com Thu Jun 21 05:23:13 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Jun 2012 10:23:13 +0100 Subject: [BioRuby] BioRuby on Travis-ci! In-Reply-To: References: <20120509171449.GA29529@thebird.nl> <20120509213158.GB31329@thebird.nl> <201205160739.q4G7dS4G004980@portal.open-bio.org> Message-ID: On Wed, May 16, 2012 at 9:15 AM, Anurag Priyam wrote: > On Wed, May 16, 2012 at 1:00 PM, Naohisa GOTO > wrote: >> For Bioruby, I manually set the hook with my (ngoto's) personal >> Travis account. As far as I can see, organization accout in Travis >> is currently not available. > > You are talking about the toggle button on your Travis profile page, > right? ?For repos that belong to an organization, you need to enable > Travis hook from Github (admin/service-hooks), iirc, using the token > on your Travis profile page. Thank you to all the BioRuby team that helped out with this. I guess Travis have made some improvements to handling organization accounts, but this time I was able to get it to work: http://travis-ci.org/#!/biopython/biopython http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009742.html Thanks again, Peter From cswh at umich.edu Fri Jun 22 00:25:40 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Fri, 22 Jun 2012 00:25:40 -0400 Subject: [BioRuby] GSoC weekly status report: parallel I/O and JRuby Message-ID: <656B7BDD-DD0E-40FE-91AF-DC23113427D5@umich.edu> Hi all, This week's status report is a double feature: http://csw.github.com/bioruby-maf/blog/2012/06/13/jruby_support_and_performance_work/ http://csw.github.com/bioruby-maf/blog/2012/06/21/parallel_io/ In short, I now have JRuby fully supported by my MAF code, including the Kyoto Cabinet components. Using JRuby, I've been able to deliver very solid performance for index-driven random access parsing as well as for sequential whole-file parsing. Clayton Wheeler cswh at umich.edu From mail at michaelbarton.me.uk Fri Jun 22 12:27:12 2012 From: mail at michaelbarton.me.uk (Michael Barton) Date: Fri, 22 Jun 2012 12:27:12 -0400 Subject: [BioRuby] Remote blast not working? In-Reply-To: <20110320205651.GE3954@Michael-Bartons-MacBook.local> References: <20110316211201.GA2640@nku069218.hh.nku.edu> <540CCF1E-2F3B-441A-9C44-2F36BB7D2035@hgc.jp> <20110320205651.GE3954@Michael-Bartons-MacBook.local> Message-ID: Hi, I appear to be having a similar problem again using ruby 1.9.3 and bio 1.4.2. blast = Bio::Blast.remote('blastp', 'nr-aa', '-e 0.001', 'genomenet') /Users/mike/.gem/gems/bio-1.4.2/lib/bio/appl/blast/genomenet.rb:251:in `exec_genomenet': cannot understand response (RuntimeError) from /Users/mike/.gem/gems/bio-1.4.2/lib/bio/appl/blast.rb:368:in `query' from bin/blast:17:in `
' I've tried downloading the latest version of genomenet.rb but this did not appear to solve the problem this time. Thank you Michael Barton On 20 March 2011 16:56, Michael Barton wrote: > Hi Toshiaki, > > Thank you for the patch. That worked perfectly. I sometimes get unusual parsing > errors which I think are originating from raxml. Has anyone else experienced > something similar? > > Cheers > > Mike > > On Thu, Mar 17, 2011 at 10:28:21AM +0900, Toshiaki Katayama wrote: >> As for the GenomeNet part, it is already fixed but not yet released. >> >> https://github.com/bioruby/bioruby/blob/master/lib/bio/appl/blast/genomenet.rb#L241 >> >> You can replace your installation with the above file for quick fix. >> >> Thanks, >> Toshiaki >> >> On 2011/03/17, at 6:12, Michael Barton wrote: >> >> > Hi, >> > >> > I'm struggling to use the remote blast part of bioruby. I get errors using >> > either the ddbj or genomenet versions. With genomenet I get an error about >> > a nil response and with genomenet I get an error for 'cannot understand >> > response'. I've tried different combinations of databases and options but keep >> > getting the same problems. >> > >> > Bio::Blast::Remote.ddbj('blastp', 'nr-aa', '-e 0.001') >> > >> > Can anyone offer any suggestions? >> > >> > Cheers >> > >> > Mike >> > _______________________________________________ >> > BioRuby Project - http://www.bioruby.org/ >> > BioRuby mailing list >> > BioRuby at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/bioruby >> From ngoto at gen-info.osaka-u.ac.jp Fri Jun 22 16:09:43 2012 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Sat, 23 Jun 2012 05:09:43 +0900 Subject: [BioRuby] Remote blast not working? In-Reply-To: References: <20110316211201.GA2640@nku069218.hh.nku.edu> <540CCF1E-2F3B-441A-9C44-2F36BB7D2035@hgc.jp> <20110320205651.GE3954@Michael-Bartons-MacBook.local> Message-ID: <201206222019.q5MKJTRA029396@portal.open-bio.org> Hi, It seems that the Genomenet BLAST site is modified again by the Genomenet site administrators. Currently, it is not easy to catch up their changes. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp On Fri, 22 Jun 2012 12:27:12 -0400 Michael Barton wrote: > Hi, > > I appear to be having a similar problem again using ruby 1.9.3 and bio 1.4.2. > > blast = Bio::Blast.remote('blastp', 'nr-aa', '-e 0.001', 'genomenet') > > /Users/mike/.gem/gems/bio-1.4.2/lib/bio/appl/blast/genomenet.rb:251:in > `exec_genomenet': cannot understand response (RuntimeError) > from /Users/mike/.gem/gems/bio-1.4.2/lib/bio/appl/blast.rb:368:in `query' > from bin/blast:17:in `
' > > I've tried downloading the latest version of genomenet.rb but this did > not appear to solve the problem this time. > > Thank you > > Michael Barton > > > > On 20 March 2011 16:56, Michael Barton wrote: > > Hi Toshiaki, > > > > Thank you for the patch. That worked perfectly. I sometimes get unusual parsing > > errors which I think are originating from raxml. Has anyone else experienced > > something similar? > > > > Cheers > > > > Mike > > > > On Thu, Mar 17, 2011 at 10:28:21AM +0900, Toshiaki Katayama wrote: > >> As for the GenomeNet part, it is already fixed but not yet released. > >> > >> https://github.com/bioruby/bioruby/blob/master/lib/bio/appl/blast/genomenet.rb#L241 > >> > >> You can replace your installation with the above file for quick fix. > >> > >> Thanks, > >> Toshiaki > >> > >> On 2011/03/17, at 6:12, Michael Barton wrote: > >> > >> > Hi, > >> > > >> > I'm struggling to use the remote blast part of bioruby. I get errors using > >> > either the ddbj or genomenet versions. With genomenet I get an error about > >> > a nil response and with genomenet I get an error for 'cannot understand > >> > response'. I've tried different combinations of databases and options but keep > >> > getting the same problems. > >> > > >> > Bio::Blast::Remote.ddbj('blastp', 'nr-aa', '-e 0.001') > >> > > >> > Can anyone offer any suggestions? > >> > > >> > Cheers > >> > > >> > Mike > >> > _______________________________________________ > >> > BioRuby Project - http://www.bioruby.org/ > >> > BioRuby mailing list > >> > BioRuby at lists.open-bio.org > >> > http://lists.open-bio.org/mailman/listinfo/bioruby > >> > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From mail at michaelbarton.me.uk Mon Jun 25 16:32:35 2012 From: mail at michaelbarton.me.uk (Michael Barton) Date: Mon, 25 Jun 2012 16:32:35 -0400 Subject: [BioRuby] Remote blast not working? In-Reply-To: <201206222019.q5MKJTRA029396@portal.open-bio.org> References: <20110316211201.GA2640@nku069218.hh.nku.edu> <540CCF1E-2F3B-441A-9C44-2F36BB7D2035@hgc.jp> <20110320205651.GE3954@Michael-Bartons-MacBook.local> <201206222019.q5MKJTRA029396@portal.open-bio.org> Message-ID: <20120625203235.GA15382@bartonh-mbp-01.ConnectAkron> I know the EBI maintains a REST API for blast. They have an example ruby script for accessing it as well. Building a bio-gem around this could solve the problem of keeping up with Genomenet in bio core. A bio-gem could also use Nokogiri for faster XML parsing too. I think this could be a good student project perhaps. On Sat, Jun 23, 2012 at 05:09:43AM +0900, Naohisa GOTO wrote: > Hi, > > It seems that the Genomenet BLAST site is modified again by the > Genomenet site administrators. Currently, it is not easy to catch up > their changes. > > Naohisa Goto ngoto at gen-info.osaka-u.ac.jp > > On Fri, 22 Jun 2012 12:27:12 -0400 Michael Barton > wrote: > > > Hi, > > > > I appear to be having a similar problem again using ruby 1.9.3 and > > bio 1.4.2. > > > > blast = Bio::Blast.remote('blastp', 'nr-aa', '-e 0.001', > > 'genomenet') > > par error: Word too long: /Users/mike/.gem/gems/bio-1.4.2/lib/bio/appl/blast/genomenet.rb:251:in From marian.povolny at gmail.com Mon Jun 25 16:38:10 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 25 Jun 2012 22:38:10 +0200 Subject: [BioRuby] GSoC weekly status report No.5 Message-ID: http://blog.mpthecoder.com/post/25870737554/gsoc-weekly-status-report-no-5 *Summary of the last week* During the last week a few improvements have been made: - the validation messages have been improved with file names and line number, in the compiler error style, - filtering has been added, - replacing escaped characters has been re-implemented to get a huge performance improvement. The 1GB file that required 10min for parsing because of 6.5 milion escaped characters, is now parsed in 22.5 seconds, only 0.5 more compared with when replacing them is turned off, - added a tool for correctly counting features in a GFF3 file. This will be useful because the user can then find a good value for the feature cache size by using this tool to get the correct count and the benchmark tool to get the count for a particular cache size. The tool is still slow for some files, so I?m thinking about how to improve that, - other small fixes, comments and similar? *More on filtering* The filtering was first implemented using classes, but later refactored using delegates instead. The result was 50 lines less code. The user can now specify a filter before parsing a file like this: GFF3File.parse_by_records("file.gff3", NO_VALIDATION, false, NO_BEFORE_FILTER, OR(ATTRIBUTE("ID", EQUALS("1")), ATTRIBUTE("ID", CONTAINS("2")))); The first filter which is set to none in this example is the filter before the line is parsed, that means that the filter doesn?t support ATTRIBUTE and FIELD predicates. The following predicates are implemented: FIELD, ATTRIBUTE, EQUALS, CONTAINS, STARTS_WITH, AND, OR, NOT. In case they?re used in a way which is not allowed, there will be a compiler error. Otherwise the allowed combinations should be logical enough to guess (but I?ll document them too). I altered the benchmark tool a few times to test the performance, and what I found was very positive, the performance impact in the few tests I did was very small. I?ll have more data once the next tool is finished. *New week* Release early and often - it?s a mantra a heard quite a few times before. So as the group of mentors and students has agreed, every student will be releasing a gem at the end of this week. I?m still not sure what will be in it, because the support for shared libraries in D compilers for Linux has not been implemented yet. So it will probably be a combination of a command-line utility and a Ruby module which uses that utility. What I have currently in mind is re-implementing the gff3-fetch utility developed by Pjotr in Ruby, to make it faster using D. But first I?ll implement filtering functionality for it, so the users can reduce a file to records which are interesting to them and then parse that using a parser in Ruby, for example. A Ruby module that would make using this utility easier for Ruby developers seems like a good idea for the first release. Part of this utility will be to support GFF3 output, so that will be implemented too (and has already been done today to some extend). From lomereiter at googlemail.com Tue Jun 26 11:45:21 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Tue, 26 Jun 2012 19:45:21 +0400 Subject: [BioRuby] [GSoC] weekly report #6 Message-ID: Hello all, here's my weekly report: http://lomereiter.wordpress.com/2012/06/26/gsoc-weekly-report-6/ Summary: Ruby bindings moved to parsing JSON from command-line tool output, everything works fine. That also means you can use JSON output from other languages. SAM input was added. Not optimized at all, parser currently does a lot of unnecessary memory allocations. Now it's about 3x as slow as samtools one, but it should be easy to improve the speed (at least doubling is possible according to profiling results). Also there's now a command line tool called Sambamba, which is used for creating JSON output. But it also outputs SAM and accepts both SAM and BAM formats as an input. Options are mostly the same as for the samtools view command, including fetching regions with the same syntax, and some filtering (e.g. on quality). From lomereiter at googlemail.com Sat Jun 2 13:06:12 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Sat, 2 Jun 2012 17:06:12 +0400 Subject: [BioRuby] Parsing line-based formats with Ragel Message-ID: Hi guys, I've recently discovered absolutely cool thing called Ragel ( http://www.complang.org/ragel/). It is a finite state machine compiler, its applications include parsing Cucumber features in Gherkin, parsing HTTP requests in Mongrel, and implementing pack/unpack functions in Rubinius. It can be used for creating parser for any regular language, that includes nearly every line-based format. It generates code for C, C++, Objective C, D(!), Java, and Go. The speed of generated code is incredible. I wrote a few words more about it in my blog: http://lomereiter.wordpress.com/2012/06/02/ragel-and-bioinformatics/ Basically, you write a formal grammar, define which snippets of code to execute on state transitions, and everything just works. As for me, I'm going to implement SAM parser with this tool. It can also be useful for Marjan. I wrote a GFF3 grammar, but it might be incorrect in some places. Here's a basic example of usage: https://github.com/lomereiter/bioragel/blob/master/examples/d/gff3.rl -- Artem From pjotr.public14 at thebird.nl Sat Jun 2 13:15:22 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sat, 2 Jun 2012 15:15:22 +0200 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: References: Message-ID: <20120602131522.GA3670@thebird.nl> Hmm. Maybe we could use this for bio-ngs recipes too. Pj. On Sat, Jun 02, 2012 at 05:06:12PM +0400, Artem Tarasov wrote: > Hi guys, > > I've recently discovered absolutely cool thing called Ragel ( > http://www.complang.org/ragel/). It is a finite state machine compiler, its > applications include parsing Cucumber features in Gherkin, parsing HTTP > requests in Mongrel, and implementing pack/unpack functions in Rubinius. > > It can be used for creating parser for any regular language, that includes > nearly every line-based format. It generates code for C, C++, Objective C, > D(!), Java, and Go. The speed of generated code is incredible. > > I wrote a few words more about it in my blog: > http://lomereiter.wordpress.com/2012/06/02/ragel-and-bioinformatics/ > > Basically, you write a formal grammar, define which snippets of code to > execute on state transitions, and everything just works. As for me, I'm > going to implement SAM parser with this tool. > > It can also be useful for Marjan. I wrote a GFF3 grammar, but it might be > incorrect in some places. Here's a basic example of usage: > https://github.com/lomereiter/bioragel/blob/master/examples/d/gff3.rl > > > > -- > Artem > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From marian.povolny at gmail.com Sat Jun 2 14:18:40 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sat, 2 Jun 2012 16:18:40 +0200 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: References: Message-ID: Cool, definitely something worth checking out for GFF3. -- Marjan On Sat, Jun 2, 2012 at 3:06 PM, Artem Tarasov wrote: > Hi guys, > > I've recently discovered absolutely cool thing called Ragel ( > http://www.complang.org/ragel/). It is a finite state machine compiler, > its > applications include parsing Cucumber features in Gherkin, parsing HTTP > requests in Mongrel, and implementing pack/unpack functions in Rubinius. > > It can be used for creating parser for any regular language, that includes > nearly every line-based format. It generates code for C, C++, Objective C, > D(!), Java, and Go. The speed of generated code is incredible. > > I wrote a few words more about it in my blog: > http://lomereiter.wordpress.com/2012/06/02/ragel-and-bioinformatics/ > > Basically, you write a formal grammar, define which snippets of code to > execute on state transitions, and everything just works. As for me, I'm > going to implement SAM parser with this tool. > > It can also be useful for Marjan. I wrote a GFF3 grammar, but it might be > incorrect in some places. Here's a basic example of usage: > https://github.com/lomereiter/bioragel/blob/master/examples/d/gff3.rl > > > > -- > Artem > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From pjotr.public14 at thebird.nl Sat Jun 2 16:24:55 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sat, 2 Jun 2012 18:24:55 +0200 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: References: Message-ID: <20120602162455.GA6483@thebird.nl> On Sat, Jun 02, 2012 at 04:18:40PM +0200, Marjan Povolni wrote: > Cool, definitely something worth checking out for GFF3. One reason the state-machine is fast is because it does not create objects in memory (avoiding so called death by object creation ;). Data will be in the CPU cache, rather than main memory. Be interesting to see if Artem can run parsers on multi-core. With GFF3 line parsing, a really simple format, we immediately create a range of objects. Of course, this can happen on the stack too, so the speed advantage may not be that important. Still, I think especially for escape characters and character encodings this could be interesting for GFF3. Because that is the most complicated to get right. For now, we choose to assume GFF3 is plain ASCII. So, I guess we file this under 'enhancements'. Right? Pj. From marian.povolny at gmail.com Sat Jun 2 16:58:39 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sat, 2 Jun 2012 18:58:39 +0200 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: <20120602162455.GA6483@thebird.nl> References: <20120602162455.GA6483@thebird.nl> Message-ID: Well, not really. Currently my parser only creates a D struct named Record, which contains only slices of the original string. So in basic conditions there is only a string for the whole line (which can be a slice of a bigger string), a Record which can also be located on the stack, and a dynamic associative array for the attributes, which again maps string slices to slices. Additional string objects are only created when an escaped character has to be replaced with the original. Otherwise all operations are simply handling slices (which are start address+length, and when copied behave and cost more like an int then an object), and most if not all operations are using the stack. And this can be improved upon with an array which is not immutable, which would make it possible to replace the escaped characters in place. Given lines are not that long, everything is still in CPU cache. Linking into trees could have problems with cache, but it depends on how many lines the parser will keep in memory at every moment. 3MB is a lot of lines. And the current parser should be handling escaped characters ok now, except for the ones with values over 0x1F. I would certainly like to make a version of the Parser with Ragel. There are a range of checks which I do manually with multiple functions, and Ragel might be more optimized for those checks. And the result could be much cleaner. On Sat, Jun 2, 2012 at 6:24 PM, Pjotr Prins wrote: > On Sat, Jun 02, 2012 at 04:18:40PM +0200, Marjan Povolni wrote: > > Cool, definitely something worth checking out for GFF3. > > One reason the state-machine is fast is because it does not create > objects in memory (avoiding so called death by object creation ;). > Data will be in the CPU cache, rather than main memory. Be interesting > to see if Artem can run parsers on multi-core. > > With GFF3 line parsing, a really simple format, we immediately create > a range of objects. Of course, this can happen on the stack too, so > the speed advantage may not be that important. > > Still, I think especially for escape characters and character encodings > this could be interesting for GFF3. Because that is the most > complicated to get right. > > For now, we choose to assume GFF3 is plain ASCII. So, I guess we file > this under 'enhancements'. Right? > > Pj. > From pjotr.public14 at thebird.nl Sun Jun 3 14:20:48 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sun, 3 Jun 2012 16:20:48 +0200 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: References: Message-ID: <20120603142048.GA21415@thebird.nl> Trust a CS student to start on finite state machines. For us mere mortals, here is a good write-up on Ragel principles for Rubyists http://zedshaw.com/essays/ragel_state_charts.html by the much loved Zed :) Pj. On Sat, Jun 02, 2012 at 05:06:12PM +0400, Artem Tarasov wrote: > Hi guys, > > I've recently discovered absolutely cool thing called Ragel ( > http://www.complang.org/ragel/). It is a finite state machine compiler, its > applications include parsing Cucumber features in Gherkin, parsing HTTP > requests in Mongrel, and implementing pack/unpack functions in Rubinius. > > It can be used for creating parser for any regular language, that includes > nearly every line-based format. It generates code for C, C++, Objective C, > D(!), Java, and Go. The speed of generated code is incredible. > > I wrote a few words more about it in my blog: > http://lomereiter.wordpress.com/2012/06/02/ragel-and-bioinformatics/ > > Basically, you write a formal grammar, define which snippets of code to > execute on state transitions, and everything just works. As for me, I'm > going to implement SAM parser with this tool. > > It can also be useful for Marjan. I wrote a GFF3 grammar, but it might be > incorrect in some places. Here's a basic example of usage: > https://github.com/lomereiter/bioragel/blob/master/examples/d/gff3.rl > > > > -- > Artem > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From marian.povolny at gmail.com Sun Jun 3 21:07:18 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Sun, 3 Jun 2012 23:07:18 +0200 Subject: [BioRuby] GSoC weekly status report No.2 Message-ID: http://blog.mpthecoder.com/post/24355573626/gsoc-weekly-status-report-no-2 It?s the end of the second week of GSoC and time for a new report. I spent the last week mostly doing work based on criticism from my mentor. The D parser which parses lines into records is now in a pretty good shape, and tested. Today I received a list of new issues that need to be resolved before going further, but they?re not that much work and I can plan some new developments. A utility for validation is in planning for next week, which could be also used for performance measurement. And after that I will turn to making the current parser parallel. Also, tomorrow I?ll be defending my Masters Thesis, after which I should be able to concentrate more on the GFF3 parser. From donttrustben at gmail.com Sun Jun 3 22:10:22 2012 From: donttrustben at gmail.com (Ben Woodcroft) Date: Mon, 4 Jun 2012 08:10:22 +1000 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: <20120603142048.GA21415@thebird.nl> References: <20120603142048.GA21415@thebird.nl> Message-ID: Just wanted to say thanks for pointing this out Artem - can definitely see myself using it in the future. If only you'd been a few days earlier! Perhaps idealistically, the state machine might be written once, and then the last mile be implemented in multiple different Bio* projects. On 4 June 2012 00:20, Pjotr Prins wrote: > Trust a CS student to start on finite state machines. For us mere > mortals, here is a good write-up on Ragel principles for Rubyists > > http://zedshaw.com/essays/ragel_state_charts.html > > by the much loved Zed :) > > Pj. > > On Sat, Jun 02, 2012 at 05:06:12PM +0400, Artem Tarasov wrote: > > Hi guys, > > > > I've recently discovered absolutely cool thing called Ragel ( > > http://www.complang.org/ragel/). It is a finite state machine compiler, > its > > applications include parsing Cucumber features in Gherkin, parsing HTTP > > requests in Mongrel, and implementing pack/unpack functions in Rubinius. > > > > It can be used for creating parser for any regular language, that > includes > > nearly every line-based format. It generates code for C, C++, Objective > C, > > D(!), Java, and Go. The speed of generated code is incredible. > > > > I wrote a few words more about it in my blog: > > http://lomereiter.wordpress.com/2012/06/02/ragel-and-bioinformatics/ > > > > Basically, you write a formal grammar, define which snippets of code to > > execute on state transitions, and everything just works. As for me, I'm > > going to implement SAM parser with this tool. > > > > It can also be useful for Marjan. I wrote a GFF3 grammar, but it might be > > incorrect in some places. Here's a basic example of usage: > > https://github.com/lomereiter/bioragel/blob/master/examples/d/gff3.rl > > > > > > > > -- > > Artem > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > -- -- Ben Woodcroft http://ecogenomic.org/users/ben-woodcroft From cjfields at illinois.edu Mon Jun 4 00:56:18 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 4 Jun 2012 00:56:18 +0000 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: References: <20120603142048.GA21415@thebird.nl> Message-ID: Have to agree, and in cases where a Bio* might run into problems with Ragel (Perl or Python) we can at least look at the grammar and use something for those languages that is similar in concept (e.g. Marpa for Perl), or go a little more roundabout and bind to C-generated ones from Ragel. chris On Jun 3, 2012, at 5:10 PM, Ben Woodcroft wrote: > Just wanted to say thanks for pointing this out Artem - can definitely see > myself using it in the future. If only you'd been a few days earlier! > > Perhaps idealistically, the state machine might be written once, and then > the last mile be implemented in multiple different Bio* projects. > > On 4 June 2012 00:20, Pjotr Prins wrote: > >> Trust a CS student to start on finite state machines. For us mere >> mortals, here is a good write-up on Ragel principles for Rubyists >> >> http://zedshaw.com/essays/ragel_state_charts.html >> >> by the much loved Zed :) >> >> Pj. >> >> On Sat, Jun 02, 2012 at 05:06:12PM +0400, Artem Tarasov wrote: >>> Hi guys, >>> >>> I've recently discovered absolutely cool thing called Ragel ( >>> http://www.complang.org/ragel/). It is a finite state machine compiler, >> its >>> applications include parsing Cucumber features in Gherkin, parsing HTTP >>> requests in Mongrel, and implementing pack/unpack functions in Rubinius. >>> >>> It can be used for creating parser for any regular language, that >> includes >>> nearly every line-based format. It generates code for C, C++, Objective >> C, >>> D(!), Java, and Go. The speed of generated code is incredible. >>> >>> I wrote a few words more about it in my blog: >>> http://lomereiter.wordpress.com/2012/06/02/ragel-and-bioinformatics/ >>> >>> Basically, you write a formal grammar, define which snippets of code to >>> execute on state transitions, and everything just works. As for me, I'm >>> going to implement SAM parser with this tool. >>> >>> It can also be useful for Marjan. I wrote a GFF3 grammar, but it might be >>> incorrect in some places. Here's a basic example of usage: >>> https://github.com/lomereiter/bioragel/blob/master/examples/d/gff3.rl >>> >>> >>> >>> -- >>> Artem >>> _______________________________________________ >>> BioRuby Project - http://www.bioruby.org/ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >>> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > > > -- > -- > Ben Woodcroft > http://ecogenomic.org/users/ben-woodcroft > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From pjotr.public14 at thebird.nl Mon Jun 4 05:17:45 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 4 Jun 2012 07:17:45 +0200 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: References: <20120603142048.GA21415@thebird.nl> Message-ID: <20120604051745.GB24131@thebird.nl> On Mon, Jun 04, 2012 at 12:56:18AM +0000, Fields, Christopher J wrote: > Have to agree, and in cases where a Bio* might run into problems > with Ragel (Perl or Python) we can at least look at the grammar and > use something for those languages that is similar in concept (e.g. > Marpa for Perl), or go a little more roundabout and bind to > C-generated ones from Ragel. Also agree. Parsing is a common theme in Bio*. A state engine would be a great abstraction, targetting C or D, and even the interpreted languages. The SAM parser would be a great proof-of-concept. I am also very interested to see how it will perform against samtools. The spanner in the works may be that we tend to be very sloppy about standards. So relaxed parsers may also be needed. Pj. From p.j.a.cock at googlemail.com Mon Jun 4 09:27:10 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 Jun 2012 10:27:10 +0100 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: <20120604051745.GB24131@thebird.nl> References: <20120603142048.GA21415@thebird.nl> <20120604051745.GB24131@thebird.nl> Message-ID: On Mon, Jun 4, 2012 at 6:17 AM, Pjotr Prins wrote: > On Mon, Jun 04, 2012 at 12:56:18AM +0000, Fields, Christopher J wrote: >> Have to agree, and in cases where a Bio* might run into problems >> with Ragel (Perl or Python) we can at least look at the grammar and >> use something for those languages that is similar in concept (e.g. >> Marpa for Perl), or go a little more roundabout and bind to >> C-generated ones from Ragel. > > Also agree. Parsing is a common theme in Bio*. A state engine would > be a great abstraction, targetting C or D, and even the interpreted > languages. The SAM parser would be a great proof-of-concept. I am > also very interested to see how it will perform against samtools. > > The spanner in the works may be that we tend to be very sloppy > about standards. So relaxed parsers may also be needed. When I read Artem's post about Ragel and formal grammars for parsing bioinformatics file formats I was intrigued, but cautious. Biopython used to have a lot of its parsers written in Martel, a home grown regular expression on steroids parsing framework. On significant downside was even minor tweaks to the format description required a good knowledge of regular expressions and how the Martel grammar worked. This created a significant barrier to entry, e.g. inserting a new optional line type at a particular point in a file format was initially quite daunting, leaving parser maintenance in the hands of a few people. (The reasons we ended up dropping Martel was a combination of poor scaling with large datasets, problems with a third party library API change, and lack of time from the original author to work on it. Most of our parsers are now 'pure Python'). It would not surprise me that over half the time spent on writing a parser goes on dealing with corner cases/border line invalid inputs, and that a formal grammar may not be the best way to deal with 'messy' data. But I would hope SAM/BAM files would be well enough behaved to make this worth trying. Regards, Peter From cjfields at illinois.edu Mon Jun 4 12:56:39 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 4 Jun 2012 12:56:39 +0000 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: <20120604051745.GB24131@thebird.nl> References: <20120603142048.GA21415@thebird.nl> <20120604051745.GB24131@thebird.nl> Message-ID: On Jun 4, 2012, at 12:17 AM, Pjotr Prins wrote: > On Mon, Jun 04, 2012 at 12:56:18AM +0000, Fields, Christopher J wrote: >> Have to agree, and in cases where a Bio* might run into problems >> with Ragel (Perl or Python) we can at least look at the grammar and >> use something for those languages that is similar in concept (e.g. >> Marpa for Perl), or go a little more roundabout and bind to >> C-generated ones from Ragel. > > Also agree. Parsing is a common theme in Bio*. A state engine would > be a great abstraction, targetting C or D, and even the interpreted > languages. The SAM parser would be a great proof-of-concept. I am > also very interested to see how it will perform against samtools. > > The spanner in the works may be that we tend to be very sloppy about > standards. So relaxed parsers may also be needed. Either that, or use the grammar as a source of validation (e.g. if the parse fails, the data is not formatted correctly). That's basicallt the tact I plan with perl 6 grammars. chris > Pj. From lomereiter at googlemail.com Mon Jun 4 14:31:14 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 4 Jun 2012 18:31:14 +0400 Subject: [BioRuby] Parsing line-based formats with Ragel In-Reply-To: References: <20120603142048.GA21415@thebird.nl> <20120604051745.GB24131@thebird.nl> Message-ID: On Mon, Jun 4, 2012 at 4:56 PM, Fields, Christopher J wrote: > > Also agree. Parsing is a common theme in Bio*. A state engine would > > be a great abstraction, targetting C or D, and even the interpreted > > languages. The SAM parser would be a great proof-of-concept. I am > > also very interested to see how it will perform against samtools. > > > > The spanner in the works may be that we tend to be very sloppy about > > standards. So relaxed parsers may also be needed. > > Either that, or use the grammar as a source of validation (e.g. if the > parse fails, the data is not formatted correctly). That's basicallt the > tact I plan with perl 6 grammars. > > chris > > Yes, I think that the problem of invalid data can be addressed by having additional rules with less strict grammar. For instance, if the format uses tab delimiting, we can track the problem down to a particular field, using less restrictions on character set, like invalidsomefield = [^\t]+ %some_error_action; somefield = (bunch of rules conformant to spec) | invalidsomefield; If we want more comprehendable error messages, instead of [^\t]+ another set of rules for different kinds of invalid input can be used. The big plus of state machines is that they don't scan string multiple times, as it usually happens with hand-written parser when you usually do several checks in turn. From lomereiter at googlemail.com Mon Jun 4 18:02:58 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 4 Jun 2012 22:02:58 +0400 Subject: [BioRuby] [GSoC] Weekly report #3 Message-ID: Hello all, the post is here: http://lomereiter.wordpress.com/2012/06/04/gsoc-weekly-report-3/ I've implemented random access to BAM file, using index file. Also I created a generic function for memoization which stores decompressed blocks in cache, following some desired cache strategy. Currently, I use simple FIFO cache. Also I studied how to make SAM output faster. I came to the conclusion that not only D standard library functions, but even ones of *printf family are too slow for this purpose, because they have to parse format string. Instead, I need to use specialized functions for printing integers and floats. Currently, output is about 4x slower than in samtools. So I have to take back some of my harsh words about its code and say that there is something to learn from there. It indeed uses its own functions for integer output, and also uses string buffer to do less calls (system functions can't be inlined). I'll use this approach, too, so very soon my library will be usable in pipelines, but only for output. Then I'm going to move on to allow alignments to be modified and outputted to BAM. After that, SAM parser needs to be implemented, and I'm going to use Ragel (finite-state machine compiler) for that purpose. So by the beginning of July I want to have SAM<->BAM conversion working, with a good speed. Add to that first release of biogem, and those are my plans for this month. From p.j.a.cock at googlemail.com Mon Jun 4 19:36:25 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 4 Jun 2012 20:36:25 +0100 Subject: [BioRuby] [GSoC] Weekly report #3 In-Reply-To: References: Message-ID: On Mon, Jun 4, 2012 at 7:02 PM, Artem Tarasov wrote: > Hello all, > > the post is here: > http://lomereiter.wordpress.com/2012/06/04/gsoc-weekly-report-3/ > > I've implemented random access to BAM file, using index file. Also I > created a generic function for memoization which stores decompressed > blocks in cache, following some desired cache strategy. Currently, I > use simple FIFO cache. That sounds good. We've talked a little bit about the block caching strategy for Biopython's BGZF support - dropping the least recently used block would be good (LRU) but requires the overhead of storing and recording timestamps on each access. Currently my Biopython BGZF code just drops a cached block 'at random' (actually based on the dictionary hashing algorithm), and switching to FIFO was something I planned to try next (easily done with Python's OrderedDict class). FIFO seems like a good solution as the overheads are much lower than LRU. Have you got any good random access benchmarks to try this out with? i.e. something non-random, such as pulling mates of paired end reads. How many BGZF blocks are you keeping in the cache, and why? Are you thinking about BGZF output yet (which will be required in order to write BAM files)? Regards, Peter From lomereiter at googlemail.com Mon Jun 4 20:07:03 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Tue, 5 Jun 2012 00:07:03 +0400 Subject: [BioRuby] [GSoC] Weekly report #3 In-Reply-To: References: Message-ID: > Have you got any good random access benchmarks to try this out > with? i.e. something non-random, such as pulling mates of paired > end reads. > Currently, no. Please suggest your ideas about benchmarks because I suspect that you have much more experience with BAM files and better knowledge of use patterns. How many BGZF blocks are you keeping in the cache, and why? > Currently, 512. I don't know why, seems like a reasonable number (about 30MB of RAM). Maybe it should be a runtime parameter but I doubt that end users will bother with tweaking cache size. > Are you thinking about BGZF output yet (which will be required > in order to write BAM files)? > It's not hard at all. I already wrote packing string to BGZF in Ruby: https://github.com/lomereiter/bioruby-bgzf/blob/master/lib/bio-bgzf/pack.rb Parallelizing should also be easy, it's very similar to reading blocks from file. Determine how many alignments to pack in one block (it's 65Kb max), send compression task to taskpool, then go create next chunk of alignments, and so on. From cswh at umich.edu Tue Jun 5 03:04:06 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 4 Jun 2012 23:04:06 -0400 Subject: [BioRuby] Weekly report: Indexed MAF access, Kyoto Cabinet, SQLite, and more Message-ID: <2B6E16E9-3DBC-4F54-88F8-C42E03124A1E@umich.edu> Hi all, My latest blog post on (mostly) last week's work is here: http://csw.github.com/bioruby-maf/blog/2012/06/04/indexed_maf_access/ Highlights include SQLite vs. Kyoto Cabinet, the path to BGZF support, and the challenges of supporting multiple Ruby implementations. Clayton Wheeler cswh at umich.edu From francesco.strozzi at gmail.com Tue Jun 5 13:49:13 2012 From: francesco.strozzi at gmail.com (Francesco Strozzi) Date: Tue, 5 Jun 2012 15:49:13 +0200 Subject: [BioRuby] OpenShift Message-ID: Hi all, does anyone has tried RedHat OpenShift? It's the new PaaS from RH, they can host mainly web applications and the basic account it's completely free. They give you 3 workers + 1.5Gb of total RAM and around 3 Gb of space. By default they can host Perl, PHP, Java, Python and Ruby (1.8.7, epic fail here), but they also provide blank containers where you could setup the environment that you like with the language that you want. It seems to have a nice CLI with Git integration....I think this could be something useful for test environments or for free hosting of small apps. -- Francesco From begumisa at gmail.com Wed Jun 6 13:41:55 2012 From: begumisa at gmail.com (Godfrey M Begumisa) Date: Wed, 6 Jun 2012 16:41:55 +0300 Subject: [BioRuby] (no subject) Message-ID: I request my email to be removed from this mailing list -- Begumisa From p.j.a.cock at googlemail.com Thu Jun 7 11:32:34 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 7 Jun 2012 12:32:34 +0100 Subject: [BioRuby] Spam on wiki Message-ID: Hi all, I follow the changes to the Biopython wiki via RSS, and we (and the BioPerl wiki) had some minor vandalism recently which I have fixed. The BioRuby wiki changes RSS feed is: http://bioruby.org/w/index.php?title=Special:RecentChanges&feed=rss There is a similar new spam page on the BioRuby wiki from new user "Helium mint", but I don't think I have admin rights to do anything about it (block the user and delete the page): http://bioruby.org/w/index.php?title=Special:RecentChanges&days=30 Regards, Peter From sibert at wisc.edu Thu Jun 7 19:24:36 2012 From: sibert at wisc.edu (Bryan Sibert) Date: Thu, 7 Jun 2012 14:24:36 -0500 Subject: [BioRuby] Error reading ABIF SangerChromatogram data Message-ID: Hello, I am using BioRuby 1.4.2 on Ruby 1.9.3 installed on a 64-bit Windows Vista machine. I am trying to read some .ab1 files using the BioRuby ABIF class which inherits SangerChromatogram. Ruby raises a NoMethodError when I attempt this. Below is a simple program based up on the example usage provided in the documentation: require 'bio' chromatogram_ff = Bio::Abif.open("120605M4_05H_9.ab1") chromatogram = chromatogram_ff.next_entry This raises the following error: C:\...\Desktop>test C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:107:in `get_entry_data': undefined method `match' for nil:NilClass (NoMethodError) from C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:85:in `block in get_directory_entries' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:77:in `times' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:77:in `get_directory_entries' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:42:in `initialize' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/io/flatfile/splitter.rb:55:in `new' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/io/flatfile/splitter.rb:55:in `get_parsed_entry' from C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/io/flatfile.rb:288:in `next_entry' from C:/.../Desktop/test.rb:3:in `
' I am at a loss as to how to fix this. If you have any ideas, please let me know. Thanks, Bryan From ngoto at gen-info.osaka-u.ac.jp Fri Jun 8 13:28:51 2012 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Fri, 8 Jun 2012 22:28:51 +0900 Subject: [BioRuby] Error reading ABIF SangerChromatogram data In-Reply-To: References: Message-ID: <201206081334.q58DYSH4022666@portal.open-bio.org> Hi Bryan, It may be possible that default file open mode on Windows is ASCII and the line feed code conversion from CR+LF to LF brakes read data. Please try the following workaround code: require 'bio' f = File.open("120605M4_05H_9.ab1", "rb") chromatogram_ff = Bio::Abif.open(f) chromatogram = chromatogram_ff.next_entry This will be fixed in the next version: default file open mode of Bio::FlatFile is changed to binary. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Thu, 7 Jun 2012 14:24:36 -0500 Bryan Sibert wrote: > Hello, > > I am using BioRuby 1.4.2 on Ruby 1.9.3 installed on a 64-bit Windows Vista > machine. I am trying to read some .ab1 files using the BioRuby ABIF class > which inherits SangerChromatogram. Ruby raises a NoMethodError when I > attempt this. Below is a simple program based up on the example usage > provided in the documentation: > > require 'bio' > chromatogram_ff = Bio::Abif.open("120605M4_05H_9.ab1") > chromatogram = chromatogram_ff.next_entry > > This raises the following error: > > C:\...\Desktop>test > > C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:107:in > `get_entry_data': undefined method `match' for nil:NilClass (NoMethodError) > from > C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:85:in > `block in get_directory_entries' > from > C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:77:in > `times' > from > C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:77:in > `get_directory_entries' > from > C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/db/sanger_chromatogram/abif.rb:42:in > `initialize' > from > C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/io/flatfile/splitter.rb:55:in > `new' > from > C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/io/flatfile/splitter.rb:55:in > `get_parsed_entry' > from > C:/Ruby193/lib/ruby/gems/1.9.1/gems/bio-1.4.2/lib/bio/io/flatfile.rb:288:in > `next_entry' > from C:/.../Desktop/test.rb:3:in `
' > > > I am at a loss as to how to fix this. If you have any ideas, please let me > know. > > Thanks, > > Bryan > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From lomereiter at googlemail.com Mon Jun 11 17:25:48 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 11 Jun 2012 21:25:48 +0400 Subject: [BioRuby] [GSoC] weekly report #4 Message-ID: Hello everybody, here's my weekly report: http://lomereiter.wordpress.com/2012/06/11/gsoc-weekly-report-4/ I've added BAM output support (not parallelized yet) and alignment creation/modification - changing fields, adding tags, and replacing existing ones. Thus, the library has a lot of features at the moment, and I started documenting them on github wiki. Also I found out that there's a great tool in DMD distribution, called rdmd, which allows to execute D files as scripts, by just adding "#!/usr/bin/rdmd" at the top. It will automatically compile all needed files and run executable. That dramatically simplifies library usage, no need to write cumbersome makefiles. The examples are at https://github.com/lomereiter/BAMread/wiki/Getting-started You can try to write your own script if you wish, follow the instructions in the wiki. Also, as my library now is able to write BAM, the current project title is quite misleading. So I'd like to hear suggestions on renaming :) -- Artem From p.j.a.cock at googlemail.com Mon Jun 11 17:41:39 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 11 Jun 2012 18:41:39 +0100 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: References: Message-ID: On Mon, Jun 11, 2012 at 6:25 PM, Artem Tarasov wrote: > Hello everybody, > > here's my weekly report: > http://lomereiter.wordpress.com/2012/06/11/gsoc-weekly-report-4/ > > ... > > Also, as my library now is able to write BAM, the current project title is > quite misleading. > So I'd like to hear suggestions on renaming :) As to the name, how about damtools (D alignment/map tools), "for dealing with the flood of sequence data" (dam as in reservoir). Peter From cjfields at illinois.edu Mon Jun 11 17:46:43 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 11 Jun 2012 17:46:43 +0000 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: References: Message-ID: <67FF495D-E8AD-4920-9EA8-6464E1310FBB@illinois.edu> On Jun 11, 2012, at 12:41 PM, Peter Cock wrote: > On Mon, Jun 11, 2012 at 6:25 PM, Artem Tarasov > wrote: >> Hello everybody, >> >> here's my weekly report: >> http://lomereiter.wordpress.com/2012/06/11/gsoc-weekly-report-4/ >> >> ... >> >> Also, as my library now is able to write BAM, the current project title is >> quite misleading. >> So I'd like to hear suggestions on renaming :) > > As to the name, how about damtools (D alignment/map tools), > "for dealing with the flood of sequence data" (dam as in reservoir). > > Peter Or 'damn, look how much work we have to do' chris From lomereiter at googlemail.com Mon Jun 11 18:47:48 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Mon, 11 Jun 2012 22:47:48 +0400 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: References: Message-ID: No, thanks... I'll call it libsambamba. In suahili, sambamba means 'parallel' ( http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) On Mon, Jun 11, 2012 at 9:41 PM, Peter Cock wrote: > > As to the name, how about damtools (D alignment/map tools), > "for dealing with the flood of sequence data" (dam as in reservoir). > > Peter > From pjotr.public14 at thebird.nl Mon Jun 11 18:57:18 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Mon, 11 Jun 2012 20:57:18 +0200 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: References: Message-ID: <20120611185718.GA12417@thebird.nl> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: > No, thanks... > > I'll call it libsambamba. In suahili, sambamba means 'parallel' ( > http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) I like it mbwana. Pj. From p.j.a.cock at googlemail.com Mon Jun 11 18:59:38 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 11 Jun 2012 19:59:38 +0100 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: <20120611185718.GA12417@thebird.nl> References: <20120611185718.GA12417@thebird.nl> Message-ID: On Monday, June 11, 2012, Pjotr Prins wrote: > On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: > > No, thanks... > > > > I'll call it libsambamba. In suahili, sambamba means 'parallel' ( > > http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) > > I like it mbwana. > > Pj. > As the mentor, you'd be the mbwana or the bwana (boss), not Artem. But I do like lib-sambamba as a name - very clever. Peter From cjfields at illinois.edu Mon Jun 11 19:19:18 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 11 Jun 2012 19:19:18 +0000 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: > On Monday, June 11, 2012, Pjotr Prins wrote: > >> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>> No, thanks... >>> >>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >> >> I like it mbwana. >> >> Pj. >> > > As the mentor, you'd be the mbwana or the bwana (boss), not Artem. > > But I do like lib-sambamba as a name - very clever. > > Peter Agreed, fits very well. chris From georgkam at gmail.com Mon Jun 11 19:28:42 2012 From: georgkam at gmail.com (George Githinji) Date: Mon, 11 Jun 2012 22:28:42 +0300 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: Good tribute to swahili! ahsante sana bwana Artem! (Thank you very much for the suggestion) Sambamba could also mean correct way or the right thing in everyday speak.. (bwana is a term of respect or honour, though it also refers to a boss .. mostly we use 'mkubwa' to mean boss) George On Mon, Jun 11, 2012 at 10:19 PM, Fields, Christopher J wrote: > On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: > >> On Monday, June 11, 2012, Pjotr Prins wrote: >> >>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>>> No, thanks... >>>> >>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >>> >>> I like it mbwana. >>> >>> Pj. >>> >> >> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. >> >> But I do like lib-sambamba as a name - very clever. >> >> Peter > > Agreed, fits very well. > > chris > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- --------------- Sincerely George Skype: george_g2 Blog: http://biorelated.wordpress.com/ Twitter: http://twitter.com/#!/george_l From cjfields at illinois.edu Mon Jun 11 19:36:44 2012 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 11 Jun 2012 19:36:44 +0000 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: <114DEA27-A766-4F0F-8144-098FF0905E1D@illinois.edu> heh, which makes me think you don't respect your bosses :) chris On Jun 11, 2012, at 2:28 PM, George Githinji wrote: > ...(bwana is a term of respect or honour, though it also refers to a boss > .. mostly we use 'mkubwa' to mean boss) > > George > > > On Mon, Jun 11, 2012 at 10:19 PM, Fields, Christopher J > wrote: >> On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: >> >>> On Monday, June 11, 2012, Pjotr Prins wrote: >>> >>>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: >>>>> No, thanks... >>>>> >>>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( >>>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) >>>> >>>> I like it mbwana. >>>> >>>> Pj. >>>> >>> >>> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. >>> >>> But I do like lib-sambamba as a name - very clever. >>> >>> Peter >> >> Agreed, fits very well. >> >> chris >> >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > > -- > --------------- > Sincerely > George > Skype: george_g2 > Blog: http://biorelated.wordpress.com/ > Twitter: http://twitter.com/#!/george_l From marian.povolny at gmail.com Mon Jun 11 20:52:05 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 11 Jun 2012 22:52:05 +0200 Subject: [BioRuby] GSoC weekly status report No.3 Message-ID: http://blog.mpthecoder.com/post/24904798973/gsoc-weekly-status-report-no-3 My first report as a Master of Computer Engineering and Communications :) Here is a list with what I?ve been working on the last week: more cleanup and refactoring validation code, README etc, made a validation utility in D, which simply reports problems found to stderr, made a benchmark tool with -v option for measuring parser speed with and without validation, after having a basic benchmark tool, found a few places which were very bad for performance. After fixing that code, parsing a 233MB GFF3 file on a five year old PC took 6 seconds, but without validation, and with only a single thread, and replacing escaped characters turned off, made replacing escaped characters optional, because the current implementation requires creation of additional string objects to do that, which has a big impact on performance. There is a plan for making it faster, but is scheduled for later, added minimal parallelisation, by reading the file in a separate thread. Two additional days were spent on a segmentation fault in the D garbage collector which occured when parsing a big file with a lot of errors. That should never happen, as I?m using the safe part of the D language, that is no pointers or anything similar. The worst that should happen is an exception. But a segmentation fault points to an error in either the compiler, the runtime or support library. The minimum reproducible example is still 42 lines long: https://gist.github.com/2911818 but changing anything in it makes the segmentation fault go away. More info on this topic can be found in the discussion here: https://github.com/mamarjan/bioruby-hpc-gff3/issues/31 I?ll be probably posting a bug report on the Dlang webpage tomorrow. For the coming week I would like to add more parallelisation, change the validation code so that exceptions almost never happen (and the seg fault also) and add support for merging records into features. -- Marjan From cswh at umich.edu Mon Jun 11 20:56:02 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Mon, 11 Jun 2012 16:56:02 -0400 Subject: [BioRuby] GSoC weekly status report: MAF filtering Message-ID: Hi all, Here's my status report on last week's work: http://csw.github.com/bioruby-maf/blog/2012/06/09/filtering-work/ Highlights: mainly MAF alignment block filtering and performance challenges with binary data in Ruby. Clayton Wheeler cswh at umich.edu From pjotr.public14 at thebird.nl Tue Jun 12 07:05:18 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 12 Jun 2012 09:05:18 +0200 Subject: [BioRuby] [GSoC] weekly report #4 In-Reply-To: References: <20120611185718.GA12417@thebird.nl> Message-ID: <20120612070518.GA14848@thebird.nl> sam-bam-baah has the comment of sheep in it. May explain consensus :) How about sambamba tools. Less unwieldy. On Mon, Jun 11, 2012 at 09:35:59PM +0100, P. Troshin wrote: > None of my business but it's a bit unwieldy. It may be clever, but 99% > people who come across it would not know. mbwana is simpler in that > respect. Sorry for spoiling the consensus :-( > > P. > > > > On 11 June 2012 20:19, Fields, Christopher J wrote: > > On Jun 11, 2012, at 1:59 PM, Peter Cock wrote: > > > >> On Monday, June 11, 2012, Pjotr Prins wrote: > >> > >>> On Mon, Jun 11, 2012 at 10:47:48PM +0400, Artem Tarasov wrote: > >>>> No, thanks... > >>>> > >>>> I'll call it libsambamba. In suahili, sambamba means 'parallel' ( > >>>> http://sn.wikipedia.org/wiki/Sambamba) Nice coincidence, huh? :) > >>> > >>> I like it mbwana. > >>> > >>> Pj. > >>> > >> > >> As the mentor, you'd be the mbwana or the bwana (boss), not Artem. > >> > >> But I do like lib-sambamba as a name - very clever. > >> > >> Peter > > > > Agreed, fits very well. > > > > chris > > > > > > _______________________________________________ > > GSoC mailing list > > GSoC at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/gsoc > _______________________________________________ > GSoC mailing list > GSoC at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/gsoc From marian.povolny at gmail.com Mon Jun 18 18:28:12 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 18 Jun 2012 20:28:12 +0200 Subject: [BioRuby] GSoC weekly status report No.4 Message-ID: http://blog.mpthecoder.com/post/25375170121/gsoc-weekly-status-report-no-4 During the last week combining records into features has been added, and also connecting the features into parent-child relationships. Validation messages have been enhanced with file names and line numbers, and now look like errors reported by a compiler. Feels most natural to me. Combining the features into records works by keeping a forward cache of a number of features (1000 by default, configurable). That means that the parsing results will be correct only if records which are part of the same feature are at most 1000 features from each other, or the amount of features set. The first implementation which was comparing the IDs of records required 10min for a 233MB file. After switching to first comparing hash values of IDs instead, and only if they match comparing the IDs, the parsing time was down to 45s. After fixing a bug, the time is now 10 seconds for the 233MB m_hapla file :) Linking the features into parent-child relationships works similarly, by using 32-bit hashes most of the time instead of comparing strings. With this functionality turned on, the same file is parsed in 13 seconds. All the measurements have been done using the benchmark utility, which has a few more options for setting what should be run. Otherwise I did more refactoring, moved all the gff3_* files into a gff3 directory, so the D modules are now bio.gff3.*, parsing functions are now static methods of GFF3File and GFF3Data classes, etc. For the new week, I would like to add filtering to the D library, which I can then use to implement iteration over genes, mRNAs, CDS features, etc. After that the library should be pretty much complete feature-wise, at least per what was promised in the project proposal, so I?ll continue by defining the C API and developing the Ruby gem. -- Marjan From lomereiter at googlemail.com Tue Jun 19 08:25:07 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Tue, 19 Jun 2012 12:25:07 +0400 Subject: [BioRuby] [GSoC] weekly report #5 Message-ID: Hi all, I wrote a few words about improvements in my project during the past week: http://lomereiter.wordpress.com/2012/06/19/gsoc-weekly-report-5/ - More wiki content on Github, with examples of how to use the library for common cases. - Faster conversion to SAM, now it's not worse than samtools in this respect - Parallelized BGZF compression, though it was relatively easy to add - Reconsidering interaction with dynamic languages due to shared library issues in D. Now I'm thinking of an approach of making command-line tools outputting JSON and wrapping them. At least in BioRuby we have Bio::Command to make this process easy. - Progress in SAM parsing - valid records are now fully parsed, and it takes just 300 lines of D/Ragel mix, together with some unittests. Also, Ragel provides some convenient methods to handle errors, but I haven't investigated them yet. Once error handling is added, the branch will be ready to be merged, and then I'll add SAM reading. -- Artem From p.j.a.cock at googlemail.com Thu Jun 21 09:23:13 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Jun 2012 10:23:13 +0100 Subject: [BioRuby] BioRuby on Travis-ci! In-Reply-To: References: <20120509171449.GA29529@thebird.nl> <20120509213158.GB31329@thebird.nl> <201205160739.q4G7dS4G004980@portal.open-bio.org> Message-ID: On Wed, May 16, 2012 at 9:15 AM, Anurag Priyam wrote: > On Wed, May 16, 2012 at 1:00 PM, Naohisa GOTO > wrote: >> For Bioruby, I manually set the hook with my (ngoto's) personal >> Travis account. As far as I can see, organization accout in Travis >> is currently not available. > > You are talking about the toggle button on your Travis profile page, > right? ?For repos that belong to an organization, you need to enable > Travis hook from Github (admin/service-hooks), iirc, using the token > on your Travis profile page. Thank you to all the BioRuby team that helped out with this. I guess Travis have made some improvements to handling organization accounts, but this time I was able to get it to work: http://travis-ci.org/#!/biopython/biopython http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009742.html Thanks again, Peter From cswh at umich.edu Fri Jun 22 04:25:40 2012 From: cswh at umich.edu (Clayton Wheeler) Date: Fri, 22 Jun 2012 00:25:40 -0400 Subject: [BioRuby] GSoC weekly status report: parallel I/O and JRuby Message-ID: <656B7BDD-DD0E-40FE-91AF-DC23113427D5@umich.edu> Hi all, This week's status report is a double feature: http://csw.github.com/bioruby-maf/blog/2012/06/13/jruby_support_and_performance_work/ http://csw.github.com/bioruby-maf/blog/2012/06/21/parallel_io/ In short, I now have JRuby fully supported by my MAF code, including the Kyoto Cabinet components. Using JRuby, I've been able to deliver very solid performance for index-driven random access parsing as well as for sequential whole-file parsing. Clayton Wheeler cswh at umich.edu From mail at michaelbarton.me.uk Fri Jun 22 16:27:12 2012 From: mail at michaelbarton.me.uk (Michael Barton) Date: Fri, 22 Jun 2012 12:27:12 -0400 Subject: [BioRuby] Remote blast not working? In-Reply-To: <20110320205651.GE3954@Michael-Bartons-MacBook.local> References: <20110316211201.GA2640@nku069218.hh.nku.edu> <540CCF1E-2F3B-441A-9C44-2F36BB7D2035@hgc.jp> <20110320205651.GE3954@Michael-Bartons-MacBook.local> Message-ID: Hi, I appear to be having a similar problem again using ruby 1.9.3 and bio 1.4.2. blast = Bio::Blast.remote('blastp', 'nr-aa', '-e 0.001', 'genomenet') /Users/mike/.gem/gems/bio-1.4.2/lib/bio/appl/blast/genomenet.rb:251:in `exec_genomenet': cannot understand response (RuntimeError) from /Users/mike/.gem/gems/bio-1.4.2/lib/bio/appl/blast.rb:368:in `query' from bin/blast:17:in `
' I've tried downloading the latest version of genomenet.rb but this did not appear to solve the problem this time. Thank you Michael Barton On 20 March 2011 16:56, Michael Barton wrote: > Hi Toshiaki, > > Thank you for the patch. That worked perfectly. I sometimes get unusual parsing > errors which I think are originating from raxml. Has anyone else experienced > something similar? > > Cheers > > Mike > > On Thu, Mar 17, 2011 at 10:28:21AM +0900, Toshiaki Katayama wrote: >> As for the GenomeNet part, it is already fixed but not yet released. >> >> https://github.com/bioruby/bioruby/blob/master/lib/bio/appl/blast/genomenet.rb#L241 >> >> You can replace your installation with the above file for quick fix. >> >> Thanks, >> Toshiaki >> >> On 2011/03/17, at 6:12, Michael Barton wrote: >> >> > Hi, >> > >> > I'm struggling to use the remote blast part of bioruby. I get errors using >> > either the ddbj or genomenet versions. With genomenet I get an error about >> > a nil response and with genomenet I get an error for 'cannot understand >> > response'. I've tried different combinations of databases and options but keep >> > getting the same problems. >> > >> > Bio::Blast::Remote.ddbj('blastp', 'nr-aa', '-e 0.001') >> > >> > Can anyone offer any suggestions? >> > >> > Cheers >> > >> > Mike >> > _______________________________________________ >> > BioRuby Project - http://www.bioruby.org/ >> > BioRuby mailing list >> > BioRuby at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/bioruby >> From ngoto at gen-info.osaka-u.ac.jp Fri Jun 22 20:09:43 2012 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Sat, 23 Jun 2012 05:09:43 +0900 Subject: [BioRuby] Remote blast not working? In-Reply-To: References: <20110316211201.GA2640@nku069218.hh.nku.edu> <540CCF1E-2F3B-441A-9C44-2F36BB7D2035@hgc.jp> <20110320205651.GE3954@Michael-Bartons-MacBook.local> Message-ID: <201206222019.q5MKJTRA029396@portal.open-bio.org> Hi, It seems that the Genomenet BLAST site is modified again by the Genomenet site administrators. Currently, it is not easy to catch up their changes. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp On Fri, 22 Jun 2012 12:27:12 -0400 Michael Barton wrote: > Hi, > > I appear to be having a similar problem again using ruby 1.9.3 and bio 1.4.2. > > blast = Bio::Blast.remote('blastp', 'nr-aa', '-e 0.001', 'genomenet') > > /Users/mike/.gem/gems/bio-1.4.2/lib/bio/appl/blast/genomenet.rb:251:in > `exec_genomenet': cannot understand response (RuntimeError) > from /Users/mike/.gem/gems/bio-1.4.2/lib/bio/appl/blast.rb:368:in `query' > from bin/blast:17:in `
' > > I've tried downloading the latest version of genomenet.rb but this did > not appear to solve the problem this time. > > Thank you > > Michael Barton > > > > On 20 March 2011 16:56, Michael Barton wrote: > > Hi Toshiaki, > > > > Thank you for the patch. That worked perfectly. I sometimes get unusual parsing > > errors which I think are originating from raxml. Has anyone else experienced > > something similar? > > > > Cheers > > > > Mike > > > > On Thu, Mar 17, 2011 at 10:28:21AM +0900, Toshiaki Katayama wrote: > >> As for the GenomeNet part, it is already fixed but not yet released. > >> > >> https://github.com/bioruby/bioruby/blob/master/lib/bio/appl/blast/genomenet.rb#L241 > >> > >> You can replace your installation with the above file for quick fix. > >> > >> Thanks, > >> Toshiaki > >> > >> On 2011/03/17, at 6:12, Michael Barton wrote: > >> > >> > Hi, > >> > > >> > I'm struggling to use the remote blast part of bioruby. I get errors using > >> > either the ddbj or genomenet versions. With genomenet I get an error about > >> > a nil response and with genomenet I get an error for 'cannot understand > >> > response'. I've tried different combinations of databases and options but keep > >> > getting the same problems. > >> > > >> > Bio::Blast::Remote.ddbj('blastp', 'nr-aa', '-e 0.001') > >> > > >> > Can anyone offer any suggestions? > >> > > >> > Cheers > >> > > >> > Mike > >> > _______________________________________________ > >> > BioRuby Project - http://www.bioruby.org/ > >> > BioRuby mailing list > >> > BioRuby at lists.open-bio.org > >> > http://lists.open-bio.org/mailman/listinfo/bioruby > >> > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From mail at michaelbarton.me.uk Mon Jun 25 20:32:35 2012 From: mail at michaelbarton.me.uk (Michael Barton) Date: Mon, 25 Jun 2012 16:32:35 -0400 Subject: [BioRuby] Remote blast not working? In-Reply-To: <201206222019.q5MKJTRA029396@portal.open-bio.org> References: <20110316211201.GA2640@nku069218.hh.nku.edu> <540CCF1E-2F3B-441A-9C44-2F36BB7D2035@hgc.jp> <20110320205651.GE3954@Michael-Bartons-MacBook.local> <201206222019.q5MKJTRA029396@portal.open-bio.org> Message-ID: <20120625203235.GA15382@bartonh-mbp-01.ConnectAkron> I know the EBI maintains a REST API for blast. They have an example ruby script for accessing it as well. Building a bio-gem around this could solve the problem of keeping up with Genomenet in bio core. A bio-gem could also use Nokogiri for faster XML parsing too. I think this could be a good student project perhaps. On Sat, Jun 23, 2012 at 05:09:43AM +0900, Naohisa GOTO wrote: > Hi, > > It seems that the Genomenet BLAST site is modified again by the > Genomenet site administrators. Currently, it is not easy to catch up > their changes. > > Naohisa Goto ngoto at gen-info.osaka-u.ac.jp > > On Fri, 22 Jun 2012 12:27:12 -0400 Michael Barton > wrote: > > > Hi, > > > > I appear to be having a similar problem again using ruby 1.9.3 and > > bio 1.4.2. > > > > blast = Bio::Blast.remote('blastp', 'nr-aa', '-e 0.001', > > 'genomenet') > > par error: Word too long: /Users/mike/.gem/gems/bio-1.4.2/lib/bio/appl/blast/genomenet.rb:251:in From marian.povolny at gmail.com Mon Jun 25 20:38:10 2012 From: marian.povolny at gmail.com (Marjan Povolni) Date: Mon, 25 Jun 2012 22:38:10 +0200 Subject: [BioRuby] GSoC weekly status report No.5 Message-ID: http://blog.mpthecoder.com/post/25870737554/gsoc-weekly-status-report-no-5 *Summary of the last week* During the last week a few improvements have been made: - the validation messages have been improved with file names and line number, in the compiler error style, - filtering has been added, - replacing escaped characters has been re-implemented to get a huge performance improvement. The 1GB file that required 10min for parsing because of 6.5 milion escaped characters, is now parsed in 22.5 seconds, only 0.5 more compared with when replacing them is turned off, - added a tool for correctly counting features in a GFF3 file. This will be useful because the user can then find a good value for the feature cache size by using this tool to get the correct count and the benchmark tool to get the count for a particular cache size. The tool is still slow for some files, so I?m thinking about how to improve that, - other small fixes, comments and similar? *More on filtering* The filtering was first implemented using classes, but later refactored using delegates instead. The result was 50 lines less code. The user can now specify a filter before parsing a file like this: GFF3File.parse_by_records("file.gff3", NO_VALIDATION, false, NO_BEFORE_FILTER, OR(ATTRIBUTE("ID", EQUALS("1")), ATTRIBUTE("ID", CONTAINS("2")))); The first filter which is set to none in this example is the filter before the line is parsed, that means that the filter doesn?t support ATTRIBUTE and FIELD predicates. The following predicates are implemented: FIELD, ATTRIBUTE, EQUALS, CONTAINS, STARTS_WITH, AND, OR, NOT. In case they?re used in a way which is not allowed, there will be a compiler error. Otherwise the allowed combinations should be logical enough to guess (but I?ll document them too). I altered the benchmark tool a few times to test the performance, and what I found was very positive, the performance impact in the few tests I did was very small. I?ll have more data once the next tool is finished. *New week* Release early and often - it?s a mantra a heard quite a few times before. So as the group of mentors and students has agreed, every student will be releasing a gem at the end of this week. I?m still not sure what will be in it, because the support for shared libraries in D compilers for Linux has not been implemented yet. So it will probably be a combination of a command-line utility and a Ruby module which uses that utility. What I have currently in mind is re-implementing the gff3-fetch utility developed by Pjotr in Ruby, to make it faster using D. But first I?ll implement filtering functionality for it, so the users can reduce a file to records which are interesting to them and then parse that using a parser in Ruby, for example. A Ruby module that would make using this utility easier for Ruby developers seems like a good idea for the first release. Part of this utility will be to support GFF3 output, so that will be implemented too (and has already been done today to some extend). From lomereiter at googlemail.com Tue Jun 26 15:45:21 2012 From: lomereiter at googlemail.com (Artem Tarasov) Date: Tue, 26 Jun 2012 19:45:21 +0400 Subject: [BioRuby] [GSoC] weekly report #6 Message-ID: Hello all, here's my weekly report: http://lomereiter.wordpress.com/2012/06/26/gsoc-weekly-report-6/ Summary: Ruby bindings moved to parsing JSON from command-line tool output, everything works fine. That also means you can use JSON output from other languages. SAM input was added. Not optimized at all, parser currently does a lot of unnecessary memory allocations. Now it's about 3x as slow as samtools one, but it should be easy to improve the speed (at least doubling is possible according to profiling results). Also there's now a command line tool called Sambamba, which is used for creating JSON output. But it also outputs SAM and accepts both SAM and BAM formats as an input. Options are mostly the same as for the samtools view command, including fetching regions with the same syntax, and some filtering (e.g. on quality).