[BioRuby] Is there a limit to string / naseq length?
Lixin Zhou
lzhou at illumina.com
Tue Mar 23 14:38:49 EST 2004
Hello,
I've tried the patch for the latest RefSeq 34 version 3 (and v2 as
well). Perhaps I did it wrong - it's a few times slower than the
previous release, and perhaps use more memory as well. I've not had a
close look, so that I don't know what caused the slowness. I simply
switched back to the previous release.
Has anyone tried to parse ASN.1 format using Ruby, or common LISP / scheme?
Thanks!
Lixin
Toshiaki Katayama wrote:
> Hi,
>
> Following change affects all sub-classes of the Bio::NCBIDB and
> I have changed regexp in bio/db.rb to match top level tag from
> /\n(\S)/ to /\n([A-Za-z\])/ for avoiding digits.
>
> Plus, sequence extraction became faster by replacing gsub with
> tr in genbank.rb.
>
> Try these changes in CVS and please report if break anything.
>
>
> Lixin, thank you for your report.
>
> Regards,
> Toshiaki Katayama
>
> On 2004/03/03, at 1:58, Zhou, Lixin wrote:
>
>> Hi,
>>
>> Thanks for pinpointing the bug. I was just checking
>> bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS line
>> was tokenized using the GenBank "definition". Apparently, GenBank will
>> have to break their rules soon or later. Perhaps we can simply split
>> the line as long as the total number of fields remains the same?
>>
>> Thanks!
>>
>> Lixin Zhou
>>
>>> -----Original Message-----
>>> From: Toshiaki Katayama [mailto:ktym at hgc.jp]
>>> Sent: Tuesday, March 02, 2004 12:55 AM
>>> To: bioruby at open-bio.org
>>> Subject: Re: [BioRuby] Is there a limit to string / naseq length?
>>>
>>>
>>> Hi,
>>>
>>> I have confirmed this also occurs on my OS X and Linux box
>>> with Ruby 1.6.8 and 1.8.1 by parsing the following file.
>>>
>>>
>>> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch
>>> r3.gbk.gz
>>>
>>> My implementation of GenBank parser and Bio::Sequence classes
>>> doesn't limit sequence length.
>>>
>>> ...however...
>>>
>>> The problem was that I couldn't imagine the sequence
>>> coordination number in the NCBI GenBank format can reach at
>>> the line head when I wrote bio/db.rb so that it misses lines
>>> after 100000021.
>>>
>>> --------------------------------------------------------------
>>> ----------
>>> ------
>>> LOCUS NT_005612 100530261 bp DNA linear CON
>>> 23-JAN-2004
>>> DEFINITION Homo sapiens chromosome 3 genomic contig.
>>> ACCESSION NT_005612
>>> (snip)
>>> ORIGIN
>>> 1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt
>>> atgtgaacat
>>> 61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct
>>> cagtcaaaag
>>> (snip)
>>> 99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg
>>> atctccccca
>>> 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg gcaagagata
>>> tccactggtt
>>> (snip)
>>> 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat
>>> actcccatct
>>> 100530241 tgttcatgat tattctgaat t
>>> //
>>> --------------------------------------------------------------
>>> ----------
>>> ------
>>>
>>> I will fix this in the CVS although it may take some time to be done.
>>>
>>> Sorry for the inconvenience,
>>> Toshiaki Katayama
>>>
>>>
>>> On 2004/03/02, at 14:14, Zhou, Lixin wrote:
>>>
>>>> I've just deleted some lines of annotation in the feature table in
>>>> NT_005612 and found that the sequence is still truncated to
>>>> 100,000,020 bp. Therefore, the bug may have nothing to do
>>>
>>> with the
>>>
>>>> number of lines in the RefSeq record.
>>>>
>>>> Here is to correct the mistakes / typos in the previous message:
>>>>
>>>> 1. The sequence is from ORIGIN not SOURCE.
>>>> 2. The sequence length is greater than 100 M bp.
>>>>
>>>> -----Original Message-----
>>>> From: Zhou, Lixin
>>>> Sent: Mon 3/1/2004 6:39 PM
>>>> To: bioruby at open-bio.org
>>>> Cc:
>>>> Subject: [BioRuby] Is there a limit to string / naseq length?
>>>> Hi all,
>>>>
>>>> I was parsing NCBI's human RefSeq 34 version 2 and noticed that the
>>>> DNA sequence from SOURCE is truncated. This appears to be
>>>
>>> reproducible
>>>
>>>> when
>>>> I "require \"bio/db/genbank/refseq\"".
>>>>
>>>> The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp
>>>> (longest in human RefSeq 34 v2 and the only one whose sequence is
>>>> greater than 1M bp). I was parsing the entire RefSeq and
>>>
>>> then cutting
>>>
>>>> exon sequence and noticed a few NM / XM entries returned empty
>>>> sequence from NT_005612. A careful examination indicate that their
>>>> coordinates are greater than 100,000,000. I tried to print
>>>
>>> out gb.naseq and
>>>
>>>> indeed,
>>>> the sequence is truncated to about 100,000,020. By the
>>>
>>> way, it appears
>>>
>>>> bioruby takes only the first 2575408 lines of the entire
>>>
>>> RefSeq record
>>>
>>>> -
>>>> because 100,000,021st base starts at the line 2,575,409 of the NT
>>>> record.
>>>>
>>>> I briefly checked bioruby source and have not found a limit to the
>>>> sequence length. Is this a bug from Ruby 1.8.1, which I use?
>>>>
>>>> Thanks.
>>>>
>>>> Lixin Zhou
>>>> lzhou at illumina.com
>>>>
>>>> _______________________________________________
>>>> BioRuby mailing list
>>>> BioRuby at open-bio.org
>>>> http://portal.open-bio.org/mailman/listinfo/bioruby
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> BioRuby mailing list
>>>> BioRuby at open-bio.org
>>>> http://portal.open-bio.org/mailman/listinfo/bioruby
>>>
>>>
>>> _______________________________________________
>>> BioRuby mailing list
>>> BioRuby at open-bio.org
>>> http://portal.open-> bio.org/mailman/listinfo/bioruby
>>>
>>
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby at open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioruby
>
>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioruby
>
>
More information about the BioRuby
mailing list