[BioRuby] EMBL parsing

Tue May 8 11:53:37 UTC 2007

Hi Naohisa,

Thanks for the patch. This certainly appears to solve the problem of  
slow embl entry reading. However the sequence length is still  
reported as 0.

I found this was due to the idline not being interpreted correctly

on line 97
tmp['SEQUENCE_LENGTH'] = idline[3].strip.split(' ').first.to_i

was changed to

tmp['SEQUENCE_LENGTH'] = idline.last.strip.split(' ').first.to_i

This was OK for my purposes, but I think the whole idline  
interpretation needs to be looked at see (http://www.ebi.ac.uk/embl/ 
Documentation/User_manual/usrman.html#3_4_1). I could have a look at  
this if appropriate.

Thanks

Anthony

On 5 May 2007, at 07:57, Naohisa GOTO wrote:

> Hi,
>
> On Thu, 3 May 2007 12:48:03 +0100
> Anthony Underwood <email2ants at gmail.com> wrote:
>
>> Hi Mitsiteru,
>>
>> Any of the embl files downloaded from the ebi site have this problem.
>>
>> for example http://www.ebi.ac.uk/cgi-bin/dbfetch?
>> db=embl&style=raw&id=CP000360
>>
>> Ruby takes all of the cpu power :(
>
> It seems it is caused by thousands of iterations of str1 += str2
> because it creates a new string object every time.
> A patch is attached. (Ruby 1.8.0 or newer version required)
>
> --- lib/bio/db.rb       5 Apr 2007 23:35:39 -0000       0.37
> +++ lib/bio/db.rb       5 May 2007 06:08:39 -0000
> @@ -313,12 +313,12 @@
>
>    # Returns the contents of the entry as a Hash.
>    def entry2hash(entry)
> -    hash = Hash.new('')
> +    hash = Hash.new { |h, k| h[k] = '' }
>      entry.each_line do |line|
>        tag = tag_get(line)
>        next if tag == 'XX'
>        tag = 'R' if tag =~ /^R./        # Reference lines
> -      hash[tag] += line
> +      hash[tag].concat line
>      end
>      return hash
>    end
>
>
> Naohisa Goto
> ng at bioruby.org
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby