[BioRuby] SIM4 parser
Tomoaki NISHIYAMA
tomoakin at kenroku.kanazawa-u.ac.jp
Sat Jul 4 13:59:09 UTC 2009
Hi,
I am now trying to parse a lot of SIM4 outputs.
First, as I did not like to create a file for each output,
I inserted "SIM4\n" as a separator like BLAST, and modified
the parser to use DELIMITER.
Since the delimiter SIM4 was arbitrarily selected by
myself and is not standard the above modification perhaps
will not go to the formal bioruby distribution.
This change worked fine, but yet I found
the parsing of alignment fails often.
The problem seems to sit in the individual parser.
One of the reason was related to the alignment like:
450 . : . : . : . : . :
447 CTCCCTCAGCGGCCTCTATTTTCAAGGGCTTCCGCATTACAG
||||||||||||||||||||||||||||||||||||||||||<<<...<<
2846 CTCCCTCAGCGGCCTCTATTTTCAAGGGCTTCCGCATTACAGCTG...TA
500 . : . : . : . : . :
489 TCTGGGCAGGAGACGGCATGGAAGGGCGAGCTGGGGATGAAGCAACCAA
<|||||||||||||||||||||||||||||||||||||||||||||||||
3081 CTCTGGGCAGGAGACGGCATGGAAGGGCGAGCTGGGGATGAAGCAACCAA
This can be corrected with the following modifications:
fix the space after the number to one space (\d+\s* -> \d+\s)
and remove only the newline character at the end of line
(strip -> chomp)
@@ -343,8 +343,8 @@
dat.each do |str|
a = str.split(/\r?\n/)
a.shift
- if /^(\s*\d+\s*)(.+)$/ =~ a[0] then
- range = ($1.length)..($1.length + $2.strip.length - 1)
+ if /^(\s*\d+\s)(.+)$/ =~ a[0] then
+ range = ($1.length)..($1.length + $2.chomp.length - 1)
a.collect! { |x| x[range] }
s1 << a.shift
ml << a.shift
so that the space represented at the end and beginning of the line
will not be lost.
The other one yet to be resolved is related to discontiguous matches
that is not
considered a proper intron as the following example:
180-534 (6091-6445) 99% ==
551-580 (7776-7804) 96%
...
550
533 GA
||
6444 GA
0 . : . : . :
551 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
|||||||||||||||||||||||||||||-
7776 AAAAAAAAAAAAAAAAAAAAAAAAAAAAA
I don't find a simple way to modify current code to handle this
situation.
A way to resolve may to check if the start address match the address
that
was specified in the previous section stating the ranges of the matches.
I'm considering implementing this way.
What do you think?
--
Tomoaki NISHIYAMA
Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan
More information about the BioRuby
mailing list