[BioRuby] Is there a limit to string / naseq length?

Lixin Zhou lzhou at illumina.com
Tue Mar 23 14:38:49 EST 2004


Hello,

I've tried the patch for the latest RefSeq 34 version 3 (and v2 as
well).  Perhaps I did it wrong - it's a few times slower than the
previous release, and perhaps use more memory as well.  I've not had a
close look, so that I don't know what caused the slowness. I simply
switched back to the previous release.

Has anyone tried to parse ASN.1 format using Ruby, or common LISP / scheme?

Thanks!

Lixin

Toshiaki Katayama wrote:
> Hi,
> 
> Following change affects all sub-classes of the Bio::NCBIDB and
> I have changed regexp in bio/db.rb to match top level tag from
> /\n(\S)/ to /\n([A-Za-z\])/ for avoiding digits.
> 
> Plus, sequence extraction became faster by replacing gsub with
> tr in genbank.rb.
> 
> Try these changes in CVS and please report if break anything.
> 
> 
> Lixin, thank you for your report.
> 
> Regards,
> Toshiaki Katayama
> 
> On 2004/03/03, at 1:58, Zhou, Lixin wrote:
> 
>> Hi,
>>
>> Thanks for pinpointing the bug.  I was just checking
>> bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS line
>> was tokenized using the GenBank "definition".  Apparently, GenBank will
>> have to break their rules soon or later.  Perhaps we can simply split
>> the line as long as the total number of fields remains the same?
>>
>> Thanks!
>>
>> Lixin Zhou
>>
>>> -----Original Message-----
>>> From: Toshiaki Katayama [mailto:ktym at hgc.jp]
>>> Sent: Tuesday, March 02, 2004 12:55 AM
>>> To: bioruby at open-bio.org
>>> Subject: Re: [BioRuby] Is there a limit to string / naseq length?
>>>
>>>
>>> Hi,
>>>
>>> I have confirmed this also occurs on my OS X and Linux box
>>> with Ruby 1.6.8 and 1.8.1 by parsing the following file.
>>>
>>>
>>> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch
>>> r3.gbk.gz
>>>
>>> My implementation of GenBank parser and Bio::Sequence classes
>>> doesn't limit sequence length.
>>>
>>> ...however...
>>>
>>> The problem was that I couldn't imagine the sequence
>>> coordination number in the NCBI GenBank format can reach at
>>> the line head when I wrote bio/db.rb so that it misses lines
>>> after 100000021.
>>>
>>> --------------------------------------------------------------
>>> ----------
>>> ------
>>> LOCUS       NT_005612          100530261 bp    DNA     linear   CON
>>> 23-JAN-2004
>>> DEFINITION  Homo sapiens chromosome 3 genomic contig.
>>> ACCESSION   NT_005612
>>> (snip)
>>> ORIGIN
>>>          1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt
>>> atgtgaacat
>>>         61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct
>>> cagtcaaaag
>>> (snip)
>>>   99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg
>>> atctccccca
>>> 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg gcaagagata
>>> tccactggtt
>>> (snip)
>>> 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat
>>> actcccatct
>>> 100530241 tgttcatgat tattctgaat t
>>> //
>>> --------------------------------------------------------------
>>> ----------
>>> ------
>>>
>>> I will fix this in the CVS although it may take some time to be done.
>>>
>>> Sorry for the inconvenience,
>>> Toshiaki Katayama
>>>
>>>
>>> On 2004/03/02, at 14:14, Zhou, Lixin wrote:
>>>
>>>> I've just deleted some lines of annotation in the feature table in
>>>> NT_005612 and found that the sequence is still truncated to
>>>> 100,000,020 bp.  Therefore, the bug may have nothing to do
>>>
>>> with the
>>>
>>>> number of lines in the RefSeq record.
>>>>
>>>> Here is to correct the mistakes / typos in the previous message:
>>>>
>>>> 1. The sequence is from ORIGIN not SOURCE.
>>>> 2. The sequence length is greater than 100 M bp.
>>>>
>>>> -----Original Message-----
>>>> From:    Zhou, Lixin
>>>> Sent:    Mon 3/1/2004 6:39 PM
>>>> To:    bioruby at open-bio.org
>>>> Cc:   
>>>> Subject:    [BioRuby] Is there a limit to string / naseq length?
>>>> Hi all,
>>>>
>>>> I was parsing NCBI's human RefSeq 34 version 2 and noticed that the
>>>> DNA sequence from SOURCE is truncated.  This appears to be
>>>
>>> reproducible
>>>
>>>> when
>>>> I "require \"bio/db/genbank/refseq\"".
>>>>
>>>> The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp
>>>> (longest in human RefSeq 34 v2 and the only one whose sequence is
>>>> greater than 1M bp).  I was parsing the entire RefSeq and
>>>
>>> then cutting
>>>
>>>> exon sequence and noticed a few NM / XM entries returned empty
>>>> sequence from NT_005612.  A careful examination indicate that their
>>>> coordinates are greater than 100,000,000.  I tried to print
>>>
>>> out gb.naseq and
>>>
>>>> indeed,
>>>> the sequence is truncated to about 100,000,020.  By the
>>>
>>> way, it appears
>>>
>>>> bioruby takes only the first 2575408 lines of the entire
>>>
>>> RefSeq record
>>>
>>>> -
>>>> because 100,000,021st base starts at the line 2,575,409 of the NT
>>>> record.
>>>>
>>>> I briefly checked bioruby source and have not found a limit to the
>>>> sequence length.  Is this a bug from Ruby 1.8.1, which I use?
>>>>
>>>> Thanks.
>>>>
>>>> Lixin Zhou
>>>> lzhou at illumina.com
>>>>
>>>> _______________________________________________
>>>> BioRuby mailing list
>>>> BioRuby at open-bio.org
>>>> http://portal.open-bio.org/mailman/listinfo/bioruby
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> BioRuby mailing list
>>>> BioRuby at open-bio.org
>>>> http://portal.open-bio.org/mailman/listinfo/bioruby
>>>
>>>
>>> _______________________________________________
>>> BioRuby mailing list
>>> BioRuby at open-bio.org
>>> http://portal.open-> bio.org/mailman/listinfo/bioruby
>>>
>>
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby at open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioruby
> 
> 
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby at open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioruby
> 
> 


More information about the BioRuby mailing list