[BioRuby] EMBL parsing
Anthony Underwood
email2ants at gmail.com
Tue May 8 07:53:37 EDT 2007
Hi Naohisa,
Thanks for the patch. This certainly appears to solve the problem of
slow embl entry reading. However the sequence length is still
reported as 0.
I found this was due to the idline not being interpreted correctly
on line 97
tmp['SEQUENCE_LENGTH'] = idline[3].strip.split(' ').first.to_i
was changed to
tmp['SEQUENCE_LENGTH'] = idline.last.strip.split(' ').first.to_i
This was OK for my purposes, but I think the whole idline
interpretation needs to be looked at see (http://www.ebi.ac.uk/embl/
Documentation/User_manual/usrman.html#3_4_1). I could have a look at
this if appropriate.
Thanks
Anthony
On 5 May 2007, at 07:57, Naohisa GOTO wrote:
> Hi,
>
> On Thu, 3 May 2007 12:48:03 +0100
> Anthony Underwood <email2ants at gmail.com> wrote:
>
>> Hi Mitsiteru,
>>
>> Any of the embl files downloaded from the ebi site have this problem.
>>
>> for example http://www.ebi.ac.uk/cgi-bin/dbfetch?
>> db=embl&style=raw&id=CP000360
>>
>> Ruby takes all of the cpu power :(
>
> It seems it is caused by thousands of iterations of str1 += str2
> because it creates a new string object every time.
> A patch is attached. (Ruby 1.8.0 or newer version required)
>
> --- lib/bio/db.rb 5 Apr 2007 23:35:39 -0000 0.37
> +++ lib/bio/db.rb 5 May 2007 06:08:39 -0000
> @@ -313,12 +313,12 @@
>
> # Returns the contents of the entry as a Hash.
> def entry2hash(entry)
> - hash = Hash.new('')
> + hash = Hash.new { |h, k| h[k] = '' }
> entry.each_line do |line|
> tag = tag_get(line)
> next if tag == 'XX'
> tag = 'R' if tag =~ /^R./ # Reference lines
> - hash[tag] += line
> + hash[tag].concat line
> end
> return hash
> end
>
>
> Naohisa Goto
> ng at bioruby.org
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
More information about the BioRuby
mailing list