[BioRuby] Is there a limit to string / naseq length?

Toshiaki Katayama ktym at hgc.jp
Tue Mar 2 03:54:55 EST 2004


Hi,

I have confirmed this also occurs on my OS X and Linux box
with Ruby 1.6.8 and 1.8.1 by parsing the following file.

    
ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_chr3.gbk.gz

My implementation of GenBank parser and Bio::Sequence classes
doesn't limit sequence length.

...however...

The problem was that I couldn't imagine the sequence coordination
number in the NCBI GenBank format can reach at the line head when
I wrote bio/db.rb so that it misses lines after 100000021.

------------------------------------------------------------------------ 
------
LOCUS       NT_005612          100530261 bp    DNA     linear   CON  
23-JAN-2004
DEFINITION  Homo sapiens chromosome 3 genomic contig.
ACCESSION   NT_005612
(snip)
ORIGIN
         1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt  
atgtgaacat
        61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct  
cagtcaaaag
(snip)
  99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg  
atctccccca
100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg gcaagagata  
tccactggtt
(snip)
100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat  
actcccatct
100530241 tgttcatgat tattctgaat t
//
------------------------------------------------------------------------ 
------

I will fix this in the CVS although it may take some time to be done.

Sorry for the inconvenience,
Toshiaki Katayama


On 2004/03/02, at 14:14, Zhou, Lixin wrote:

> I've just deleted some lines of annotation in the feature table in  
> NT_005612 and found that the sequence is still truncated to  
> 100,000,020 bp.  Therefore, the bug may have nothing to do with the  
> number of lines in the RefSeq record.
>
> Here is to correct the mistakes / typos in the previous message:
>
> 1. The sequence is from ORIGIN not SOURCE.
> 2. The sequence length is greater than 100 M bp.
>
> -----Original Message-----
> From:	Zhou, Lixin
> Sent:	Mon 3/1/2004 6:39 PM
> To:	bioruby at open-bio.org
> Cc:	
> Subject:	[BioRuby] Is there a limit to string / naseq length?
> Hi all,
>
> I was parsing NCBI's human RefSeq 34 version 2 and noticed that the DNA
> sequence from SOURCE is truncated.  This appears to be reproducible  
> when
> I "require \"bio/db/genbank/refseq\"".
>
> The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp
> (longest in human RefSeq 34 v2 and the only one whose sequence is
> greater than 1M bp).  I was parsing the entire RefSeq and then cutting
> exon sequence and noticed a few NM / XM entries returned empty sequence
> from NT_005612.  A careful examination indicate that their coordinates
> are greater than 100,000,000.  I tried to print out gb.naseq and  
> indeed,
> the sequence is truncated to about 100,000,020.  By the way, it appears
> bioruby takes only the first 2575408 lines of the entire RefSeq record  
> -
> because 100,000,021st base starts at the line 2,575,409 of the NT
> record.
>
> I briefly checked bioruby source and have not found a limit to the
> sequence length.  Is this a bug from Ruby 1.8.1, which I use?
>
> Thanks.
>
> Lixin Zhou
> lzhou at illumina.com
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioruby
>
>
>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioruby



More information about the BioRuby mailing list