[BioRuby] SIM4 parser

Sat Jul 4 13:59:09 UTC 2009

Hi,

I am now trying to parse a lot of SIM4 outputs.
First, as I did not like to create a file for each output,
I inserted "SIM4\n" as a separator like BLAST, and modified
the parser to use DELIMITER.
Since the delimiter SIM4 was arbitrarily selected by
myself and is not standard the above modification perhaps
will not go to the formal bioruby distribution.
This change worked fine, but yet I found
the parsing of alignment fails often.
The problem seems to sit in the individual parser.

One of the reason was related to the alignment like:

     450     .    :    .    :    .    :    .    :    .    :
     447 CTCCCTCAGCGGCCTCTATTTTCAAGGGCTTCCGCATTACAG
         ||||||||||||||||||||||||||||||||||||||||||<<<...<<
    2846 CTCCCTCAGCGGCCTCTATTTTCAAGGGCTTCCGCATTACAGCTG...TA

     500     .    :    .    :    .    :    .    :    .    :
     489  TCTGGGCAGGAGACGGCATGGAAGGGCGAGCTGGGGATGAAGCAACCAA
         <|||||||||||||||||||||||||||||||||||||||||||||||||
    3081 CTCTGGGCAGGAGACGGCATGGAAGGGCGAGCTGGGGATGAAGCAACCAA

This can be corrected with the following modifications:
fix the space after the number to one space (\d+\s* -> \d+\s)
and remove only the newline character at the end of line
(strip -> chomp)

@@ -343,8 +343,8 @@
            dat.each do |str|
              a = str.split(/\r?\n/)
              a.shift
-            if /^(\s*\d+\s*)(.+)$/ =~ a[0] then
-              range = ($1.length)..($1.length + $2.strip.length - 1)
+            if /^(\s*\d+\s)(.+)$/ =~ a[0] then
+              range = ($1.length)..($1.length + $2.chomp.length - 1)
                a.collect! { |x| x[range] }
                s1 << a.shift
                ml << a.shift

so that the space represented at the end and beginning of the line  
will not be lost.

The other one yet to be resolved is related to discontiguous matches  
that is not
considered a proper intron as the following example:

180-534  (6091-6445)   99% ==
551-580  (7776-7804)   96%

...

     550
     533 GA
         ||
    6444 GA

       0     .    :    .    :    .    :
     551 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
         |||||||||||||||||||||||||||||-
    7776 AAAAAAAAAAAAAAAAAAAAAAAAAAAAA

I don't find a simple way to modify current code to handle this  
situation.

A way to resolve may to check if the start address match the address  
that
was specified in the previous section stating the ranges of the matches.
I'm considering implementing this way.

What do you think?
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan