From LZhou at illumina.com  Mon Mar  1 21:39:51 2004
From: LZhou at illumina.com (Zhou, Lixin)
Date: Mon Mar  1 21:45:44 2004
Subject: [BioRuby] Is there a limit to string / naseq length?
Message-ID: <B9CE642AED48ED4D97C66A943728F1860F74EC@ilmn-exch.illumina.com>

Hi all,

I was parsing NCBI's human RefSeq 34 version 2 and noticed that the DNA
sequence from SOURCE is truncated.  This appears to be reproducible when
I "require \"bio/db/genbank/refseq\"".

The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp
(longest in human RefSeq 34 v2 and the only one whose sequence is
greater than 1M bp).  I was parsing the entire RefSeq and then cutting
exon sequence and noticed a few NM / XM entries returned empty sequence
from NT_005612.  A careful examination indicate that their coordinates
are greater than 100,000,000.  I tried to print out gb.naseq and indeed,
the sequence is truncated to about 100,000,020.  By the way, it appears
bioruby takes only the first 2575408 lines of the entire RefSeq record -
because 100,000,021st base starts at the line 2,575,409 of the NT
record.

I briefly checked bioruby source and have not found a limit to the
sequence length.  Is this a bug from Ruby 1.8.1, which I use?

Thanks.

Lixin Zhou
lzhou@illumina.com

From LZhou at illumina.com  Tue Mar  2 00:14:05 2004
From: LZhou at illumina.com (Zhou, Lixin)
Date: Tue Mar  2 00:19:59 2004
Subject: [BioRuby] Is there a limit to string / naseq length?
Message-ID: <B9CE642AED48ED4D97C66A943728F1860F74ED@ilmn-exch.illumina.com>

I've just deleted some lines of annotation in the feature table in NT_005612 and found that the sequence is still truncated to 100,000,020 bp.  Therefore, the bug may have nothing to do with the number of lines in the RefSeq record.

Here is to correct the mistakes / typos in the previous message:

1. The sequence is from ORIGIN not SOURCE.
2. The sequence length is greater than 100 M bp.

-----Original Message-----
From:	Zhou, Lixin
Sent:	Mon 3/1/2004 6:39 PM
To:	bioruby@open-bio.org
Cc:	
Subject:	[BioRuby] Is there a limit to string / naseq length?
Hi all,

I was parsing NCBI's human RefSeq 34 version 2 and noticed that the DNA
sequence from SOURCE is truncated.  This appears to be reproducible when
I "require \"bio/db/genbank/refseq\"".

The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp
(longest in human RefSeq 34 v2 and the only one whose sequence is
greater than 1M bp).  I was parsing the entire RefSeq and then cutting
exon sequence and noticed a few NM / XM entries returned empty sequence
from NT_005612.  A careful examination indicate that their coordinates
are greater than 100,000,000.  I tried to print out gb.naseq and indeed,
the sequence is truncated to about 100,000,020.  By the way, it appears
bioruby takes only the first 2575408 lines of the entire RefSeq record -
because 100,000,021st base starts at the line 2,575,409 of the NT
record.

I briefly checked bioruby source and have not found a limit to the
sequence length.  Is this a bug from Ruby 1.8.1, which I use?

Thanks.

Lixin Zhou
lzhou@illumina.com

_______________________________________________
BioRuby mailing list
BioRuby@open-bio.org
http://portal.open-bio.org/mailman/listinfo/bioruby


From LZhou at illumina.com  Tue Mar  2 00:14:05 2004
From: LZhou at illumina.com (Zhou, Lixin)
Date: Tue Mar  2 00:20:00 2004
Subject: [BioRuby] Is there a limit to string / naseq length?
Message-ID: <B9CE642AED48ED4D97C66A943728F1860F74ED@ilmn-exch.illumina.com>

I've just deleted some lines of annotation in the feature table in NT_005612 and found that the sequence is still truncated to 100,000,020 bp.  Therefore, the bug may have nothing to do with the number of lines in the RefSeq record.

Here is to correct the mistakes / typos in the previous message:

1. The sequence is from ORIGIN not SOURCE.
2. The sequence length is greater than 100 M bp.

-----Original Message-----
From:	Zhou, Lixin
Sent:	Mon 3/1/2004 6:39 PM
To:	bioruby@open-bio.org
Cc:	
Subject:	[BioRuby] Is there a limit to string / naseq length?
Hi all,

I was parsing NCBI's human RefSeq 34 version 2 and noticed that the DNA
sequence from SOURCE is truncated.  This appears to be reproducible when
I "require \"bio/db/genbank/refseq\"".

The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp
(longest in human RefSeq 34 v2 and the only one whose sequence is
greater than 1M bp).  I was parsing the entire RefSeq and then cutting
exon sequence and noticed a few NM / XM entries returned empty sequence
from NT_005612.  A careful examination indicate that their coordinates
are greater than 100,000,000.  I tried to print out gb.naseq and indeed,
the sequence is truncated to about 100,000,020.  By the way, it appears
bioruby takes only the first 2575408 lines of the entire RefSeq record -
because 100,000,021st base starts at the line 2,575,409 of the NT
record.

I briefly checked bioruby source and have not found a limit to the
sequence length.  Is this a bug from Ruby 1.8.1, which I use?

Thanks.

Lixin Zhou
lzhou@illumina.com

_______________________________________________
BioRuby mailing list
BioRuby@open-bio.org
http://portal.open-bio.org/mailman/listinfo/bioruby


From ktym at hgc.jp  Tue Mar  2 03:54:55 2004
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Tue Mar  2 04:00:58 2004
Subject: [BioRuby] Is there a limit to string / naseq length?
In-Reply-To: <B9CE642AED48ED4D97C66A943728F1860F74ED@ilmn-exch.illumina.com>
References: <B9CE642AED48ED4D97C66A943728F1860F74ED@ilmn-exch.illumina.com>
Message-ID: <470F467A-6C27-11D8-9053-000A95D919D8@hgc.jp>

Hi,

I have confirmed this also occurs on my OS X and Linux box
with Ruby 1.6.8 and 1.8.1 by parsing the following file.

    
ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_chr3.gbk.gz

My implementation of GenBank parser and Bio::Sequence classes
doesn't limit sequence length.

...however...

The problem was that I couldn't imagine the sequence coordination
number in the NCBI GenBank format can reach at the line head when
I wrote bio/db.rb so that it misses lines after 100000021.

------------------------------------------------------------------------ 
------
LOCUS       NT_005612          100530261 bp    DNA     linear   CON  
23-JAN-2004
DEFINITION  Homo sapiens chromosome 3 genomic contig.
ACCESSION   NT_005612
(snip)
ORIGIN
         1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt  
atgtgaacat
        61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct  
cagtcaaaag
(snip)
  99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg  
atctccccca
100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg gcaagagata  
tccactggtt
(snip)
100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat  
actcccatct
100530241 tgttcatgat tattctgaat t
//
------------------------------------------------------------------------ 
------

I will fix this in the CVS although it may take some time to be done.

Sorry for the inconvenience,
Toshiaki Katayama


On 2004/03/02, at 14:14, Zhou, Lixin wrote:

> I've just deleted some lines of annotation in the feature table in  
> NT_005612 and found that the sequence is still truncated to  
> 100,000,020 bp.  Therefore, the bug may have nothing to do with the  
> number of lines in the RefSeq record.
>
> Here is to correct the mistakes / typos in the previous message:
>
> 1. The sequence is from ORIGIN not SOURCE.
> 2. The sequence length is greater than 100 M bp.
>
> -----Original Message-----
> From:	Zhou, Lixin
> Sent:	Mon 3/1/2004 6:39 PM
> To:	bioruby@open-bio.org
> Cc:	
> Subject:	[BioRuby] Is there a limit to string / naseq length?
> Hi all,
>
> I was parsing NCBI's human RefSeq 34 version 2 and noticed that the DNA
> sequence from SOURCE is truncated.  This appears to be reproducible  
> when
> I "require \"bio/db/genbank/refseq\"".
>
> The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp
> (longest in human RefSeq 34 v2 and the only one whose sequence is
> greater than 1M bp).  I was parsing the entire RefSeq and then cutting
> exon sequence and noticed a few NM / XM entries returned empty sequence
> from NT_005612.  A careful examination indicate that their coordinates
> are greater than 100,000,000.  I tried to print out gb.naseq and  
> indeed,
> the sequence is truncated to about 100,000,020.  By the way, it appears
> bioruby takes only the first 2575408 lines of the entire RefSeq record  
> -
> because 100,000,021st base starts at the line 2,575,409 of the NT
> record.
>
> I briefly checked bioruby source and have not found a limit to the
> sequence length.  Is this a bug from Ruby 1.8.1, which I use?
>
> Thanks.
>
> Lixin Zhou
> lzhou@illumina.com
>
> _______________________________________________
> BioRuby mailing list
> BioRuby@open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioruby
>
>
>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby@open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioruby

From LZhou at illumina.com  Tue Mar  2 11:58:59 2004
From: LZhou at illumina.com (Zhou, Lixin)
Date: Tue Mar  2 12:04:52 2004
Subject: [BioRuby] Is there a limit to string / naseq length?
Message-ID: <B9CE642AED48ED4D97C66A943728F1860F74EE@ilmn-exch.illumina.com>

Hi,

Thanks for pinpointing the bug.  I was just checking
bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS line
was tokenized using the GenBank "definition".  Apparently, GenBank will
have to break their rules soon or later.  Perhaps we can simply split
the line as long as the total number of fields remains the same?

Thanks!

Lixin Zhou

> -----Original Message-----
> From: Toshiaki Katayama [mailto:ktym@hgc.jp] 
> Sent: Tuesday, March 02, 2004 12:55 AM
> To: bioruby@open-bio.org
> Subject: Re: [BioRuby] Is there a limit to string / naseq length?
> 
> 
> Hi,
> 
> I have confirmed this also occurs on my OS X and Linux box
> with Ruby 1.6.8 and 1.8.1 by parsing the following file.
> 
>     
> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch
> r3.gbk.gz
> 
> My implementation of GenBank parser and Bio::Sequence classes 
> doesn't limit sequence length.
> 
> ...however...
> 
> The problem was that I couldn't imagine the sequence 
> coordination number in the NCBI GenBank format can reach at 
> the line head when I wrote bio/db.rb so that it misses lines 
> after 100000021.
> 
> --------------------------------------------------------------
> ---------- 
> ------
> LOCUS       NT_005612          100530261 bp    DNA     linear   CON  
> 23-JAN-2004
> DEFINITION  Homo sapiens chromosome 3 genomic contig.
> ACCESSION   NT_005612
> (snip)
> ORIGIN
>          1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt  
> atgtgaacat
>         61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct  
> cagtcaaaag
> (snip)
>   99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg  
> atctccccca
> 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg gcaagagata  
> tccactggtt
> (snip)
> 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat  
> actcccatct
> 100530241 tgttcatgat tattctgaat t
> //
> --------------------------------------------------------------
> ---------- 
> ------
> 
> I will fix this in the CVS although it may take some time to be done.
> 
> Sorry for the inconvenience,
> Toshiaki Katayama
> 
> 
> On 2004/03/02, at 14:14, Zhou, Lixin wrote:
> 
> > I've just deleted some lines of annotation in the feature table in
> > NT_005612 and found that the sequence is still truncated to  
> > 100,000,020 bp.  Therefore, the bug may have nothing to do 
> with the  
> > number of lines in the RefSeq record.
> >
> > Here is to correct the mistakes / typos in the previous message:
> >
> > 1. The sequence is from ORIGIN not SOURCE.
> > 2. The sequence length is greater than 100 M bp.
> >
> > -----Original Message-----
> > From:	Zhou, Lixin
> > Sent:	Mon 3/1/2004 6:39 PM
> > To:	bioruby@open-bio.org
> > Cc:	
> > Subject:	[BioRuby] Is there a limit to string / naseq length?
> > Hi all,
> >
> > I was parsing NCBI's human RefSeq 34 version 2 and noticed that the 
> > DNA sequence from SOURCE is truncated.  This appears to be 
> reproducible
> > when
> > I "require \"bio/db/genbank/refseq\"".
> >
> > The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp 
> > (longest in human RefSeq 34 v2 and the only one whose sequence is 
> > greater than 1M bp).  I was parsing the entire RefSeq and 
> then cutting 
> > exon sequence and noticed a few NM / XM entries returned empty 
> > sequence from NT_005612.  A careful examination indicate that their 
> > coordinates are greater than 100,000,000.  I tried to print 
> out gb.naseq and
> > indeed,
> > the sequence is truncated to about 100,000,020.  By the 
> way, it appears
> > bioruby takes only the first 2575408 lines of the entire 
> RefSeq record  
> > -
> > because 100,000,021st base starts at the line 2,575,409 of the NT
> > record.
> >
> > I briefly checked bioruby source and have not found a limit to the 
> > sequence length.  Is this a bug from Ruby 1.8.1, which I use?
> >
> > Thanks.
> >
> > Lixin Zhou
> > lzhou@illumina.com
> >
> > _______________________________________________
> > BioRuby mailing list
> > BioRuby@open-bio.org 
> > http://portal.open-bio.org/mailman/listinfo/bioruby
> >
> >
> >
> >
> > _______________________________________________
> > BioRuby mailing list
> > BioRuby@open-bio.org 
> > http://portal.open-bio.org/mailman/listinfo/bioruby
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby@open-bio.org 
> http://portal.open-> bio.org/mailman/listinfo/bioruby
> 

From LZhou at illumina.com  Tue Mar  2 11:58:59 2004
From: LZhou at illumina.com (Zhou, Lixin)
Date: Tue Mar  2 12:04:52 2004
Subject: [BioRuby] Is there a limit to string / naseq length?
Message-ID: <B9CE642AED48ED4D97C66A943728F1860F74EE@ilmn-exch.illumina.com>

Hi,

Thanks for pinpointing the bug.  I was just checking
bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS line
was tokenized using the GenBank "definition".  Apparently, GenBank will
have to break their rules soon or later.  Perhaps we can simply split
the line as long as the total number of fields remains the same?

Thanks!

Lixin Zhou

> -----Original Message-----
> From: Toshiaki Katayama [mailto:ktym@hgc.jp] 
> Sent: Tuesday, March 02, 2004 12:55 AM
> To: bioruby@open-bio.org
> Subject: Re: [BioRuby] Is there a limit to string / naseq length?
> 
> 
> Hi,
> 
> I have confirmed this also occurs on my OS X and Linux box
> with Ruby 1.6.8 and 1.8.1 by parsing the following file.
> 
>     
> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch
> r3.gbk.gz
> 
> My implementation of GenBank parser and Bio::Sequence classes 
> doesn't limit sequence length.
> 
> ...however...
> 
> The problem was that I couldn't imagine the sequence 
> coordination number in the NCBI GenBank format can reach at 
> the line head when I wrote bio/db.rb so that it misses lines 
> after 100000021.
> 
> --------------------------------------------------------------
> ---------- 
> ------
> LOCUS       NT_005612          100530261 bp    DNA     linear   CON  
> 23-JAN-2004
> DEFINITION  Homo sapiens chromosome 3 genomic contig.
> ACCESSION   NT_005612
> (snip)
> ORIGIN
>          1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt  
> atgtgaacat
>         61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct  
> cagtcaaaag
> (snip)
>   99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg  
> atctccccca
> 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg gcaagagata  
> tccactggtt
> (snip)
> 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat  
> actcccatct
> 100530241 tgttcatgat tattctgaat t
> //
> --------------------------------------------------------------
> ---------- 
> ------
> 
> I will fix this in the CVS although it may take some time to be done.
> 
> Sorry for the inconvenience,
> Toshiaki Katayama
> 
> 
> On 2004/03/02, at 14:14, Zhou, Lixin wrote:
> 
> > I've just deleted some lines of annotation in the feature table in
> > NT_005612 and found that the sequence is still truncated to  
> > 100,000,020 bp.  Therefore, the bug may have nothing to do 
> with the  
> > number of lines in the RefSeq record.
> >
> > Here is to correct the mistakes / typos in the previous message:
> >
> > 1. The sequence is from ORIGIN not SOURCE.
> > 2. The sequence length is greater than 100 M bp.
> >
> > -----Original Message-----
> > From:	Zhou, Lixin
> > Sent:	Mon 3/1/2004 6:39 PM
> > To:	bioruby@open-bio.org
> > Cc:	
> > Subject:	[BioRuby] Is there a limit to string / naseq length?
> > Hi all,
> >
> > I was parsing NCBI's human RefSeq 34 version 2 and noticed that the 
> > DNA sequence from SOURCE is truncated.  This appears to be 
> reproducible
> > when
> > I "require \"bio/db/genbank/refseq\"".
> >
> > The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp 
> > (longest in human RefSeq 34 v2 and the only one whose sequence is 
> > greater than 1M bp).  I was parsing the entire RefSeq and 
> then cutting 
> > exon sequence and noticed a few NM / XM entries returned empty 
> > sequence from NT_005612.  A careful examination indicate that their 
> > coordinates are greater than 100,000,000.  I tried to print 
> out gb.naseq and
> > indeed,
> > the sequence is truncated to about 100,000,020.  By the 
> way, it appears
> > bioruby takes only the first 2575408 lines of the entire 
> RefSeq record  
> > -
> > because 100,000,021st base starts at the line 2,575,409 of the NT
> > record.
> >
> > I briefly checked bioruby source and have not found a limit to the 
> > sequence length.  Is this a bug from Ruby 1.8.1, which I use?
> >
> > Thanks.
> >
> > Lixin Zhou
> > lzhou@illumina.com
> >
> > _______________________________________________
> > BioRuby mailing list
> > BioRuby@open-bio.org 
> > http://portal.open-bio.org/mailman/listinfo/bioruby
> >
> >
> >
> >
> > _______________________________________________
> > BioRuby mailing list
> > BioRuby@open-bio.org 
> > http://portal.open-bio.org/mailman/listinfo/bioruby
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby@open-bio.org 
> http://portal.open-> bio.org/mailman/listinfo/bioruby
> 

From ktym at hgc.jp  Tue Mar  2 13:11:07 2004
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Tue Mar  2 13:17:01 2004
Subject: [BioRuby] Is there a limit to string / naseq length?
In-Reply-To: <B9CE642AED48ED4D97C66A943728F1860F74EE@ilmn-exch.illumina.com>
References: <B9CE642AED48ED4D97C66A943728F1860F74EE@ilmn-exch.illumina.com>
Message-ID: <FA0826D2-6C74-11D8-B650-000A95CD9782@hgc.jp>

Hi,

Following change affects all sub-classes of the Bio::NCBIDB and
I have changed regexp in bio/db.rb to match top level tag from
/\n(\S)/ to /\n([A-Za-z\])/ for avoiding digits.

Plus, sequence extraction became faster by replacing gsub with
tr in genbank.rb.

Try these changes in CVS and please report if break anything.


Lixin, thank you for your report.

Regards,
Toshiaki Katayama

On 2004/03/03, at 1:58, Zhou, Lixin wrote:

> Hi,
>
> Thanks for pinpointing the bug.  I was just checking
> bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS line
> was tokenized using the GenBank "definition".  Apparently, GenBank will
> have to break their rules soon or later.  Perhaps we can simply split
> the line as long as the total number of fields remains the same?
>
> Thanks!
>
> Lixin Zhou
>
>> -----Original Message-----
>> From: Toshiaki Katayama [mailto:ktym@hgc.jp]
>> Sent: Tuesday, March 02, 2004 12:55 AM
>> To: bioruby@open-bio.org
>> Subject: Re: [BioRuby] Is there a limit to string / naseq length?
>>
>>
>> Hi,
>>
>> I have confirmed this also occurs on my OS X and Linux box
>> with Ruby 1.6.8 and 1.8.1 by parsing the following file.
>>
>>
>> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch
>> r3.gbk.gz
>>
>> My implementation of GenBank parser and Bio::Sequence classes
>> doesn't limit sequence length.
>>
>> ...however...
>>
>> The problem was that I couldn't imagine the sequence
>> coordination number in the NCBI GenBank format can reach at
>> the line head when I wrote bio/db.rb so that it misses lines
>> after 100000021.
>>
>> --------------------------------------------------------------
>> ----------
>> ------
>> LOCUS       NT_005612          100530261 bp    DNA     linear   CON
>> 23-JAN-2004
>> DEFINITION  Homo sapiens chromosome 3 genomic contig.
>> ACCESSION   NT_005612
>> (snip)
>> ORIGIN
>>          1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt
>> atgtgaacat
>>         61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct
>> cagtcaaaag
>> (snip)
>>   99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg
>> atctccccca
>> 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg gcaagagata
>> tccactggtt
>> (snip)
>> 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat
>> actcccatct
>> 100530241 tgttcatgat tattctgaat t
>> //
>> --------------------------------------------------------------
>> ----------
>> ------
>>
>> I will fix this in the CVS although it may take some time to be done.
>>
>> Sorry for the inconvenience,
>> Toshiaki Katayama
>>
>>
>> On 2004/03/02, at 14:14, Zhou, Lixin wrote:
>>
>>> I've just deleted some lines of annotation in the feature table in
>>> NT_005612 and found that the sequence is still truncated to
>>> 100,000,020 bp.  Therefore, the bug may have nothing to do
>> with the
>>> number of lines in the RefSeq record.
>>>
>>> Here is to correct the mistakes / typos in the previous message:
>>>
>>> 1. The sequence is from ORIGIN not SOURCE.
>>> 2. The sequence length is greater than 100 M bp.
>>>
>>> -----Original Message-----
>>> From:	Zhou, Lixin
>>> Sent:	Mon 3/1/2004 6:39 PM
>>> To:	bioruby@open-bio.org
>>> Cc:	
>>> Subject:	[BioRuby] Is there a limit to string / naseq length?
>>> Hi all,
>>>
>>> I was parsing NCBI's human RefSeq 34 version 2 and noticed that the
>>> DNA sequence from SOURCE is truncated.  This appears to be
>> reproducible
>>> when
>>> I "require ?"bio/db/genbank/refseq?"".
>>>
>>> The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp
>>> (longest in human RefSeq 34 v2 and the only one whose sequence is
>>> greater than 1M bp).  I was parsing the entire RefSeq and
>> then cutting
>>> exon sequence and noticed a few NM / XM entries returned empty
>>> sequence from NT_005612.  A careful examination indicate that their
>>> coordinates are greater than 100,000,000.  I tried to print
>> out gb.naseq and
>>> indeed,
>>> the sequence is truncated to about 100,000,020.  By the
>> way, it appears
>>> bioruby takes only the first 2575408 lines of the entire
>> RefSeq record
>>> -
>>> because 100,000,021st base starts at the line 2,575,409 of the NT
>>> record.
>>>
>>> I briefly checked bioruby source and have not found a limit to the
>>> sequence length.  Is this a bug from Ruby 1.8.1, which I use?
>>>
>>> Thanks.
>>>
>>> Lixin Zhou
>>> lzhou@illumina.com
>>>
>>> _______________________________________________
>>> BioRuby mailing list
>>> BioRuby@open-bio.org
>>> http://portal.open-bio.org/mailman/listinfo/bioruby
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> BioRuby mailing list
>>> BioRuby@open-bio.org
>>> http://portal.open-bio.org/mailman/listinfo/bioruby
>>
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby@open-bio.org
>> http://portal.open-> bio.org/mailman/listinfo/bioruby
>>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby@open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioruby


From LZhou at illumina.com  Tue Mar  2 13:15:19 2004
From: LZhou at illumina.com (Zhou, Lixin)
Date: Tue Mar  2 13:21:10 2004
Subject: [BioRuby] Is there a limit to string / naseq length?
Message-ID: <B9CE642AED48ED4D97C66A943728F1860132EACC@ilmn-exch.illumina.com>

Thank you very much for your quick and great work!  I'll try it out.

> -----Original Message-----
> From: Toshiaki Katayama [mailto:ktym@hgc.jp] 
> Sent: Tuesday, March 02, 2004 10:11 AM
> To: BioRuby Discussion List Project
> Subject: Re: [BioRuby] Is there a limit to string / naseq length?
> 
> 
> Hi,
> 
> Following change affects all sub-classes of the Bio::NCBIDB 
> and I have changed regexp in bio/db.rb to match top level tag 
> from /\n(\S)/ to /\n([A-Za-z\])/ for avoiding digits.
> 
> Plus, sequence extraction became faster by replacing gsub 
> with tr in genbank.rb.
> 
> Try these changes in CVS and please report if break anything.
> 
> 
> Lixin, thank you for your report.
> 
> Regards,
> Toshiaki Katayama
> 
> On 2004/03/03, at 1:58, Zhou, Lixin wrote:
> 
> > Hi,
> >
> > Thanks for pinpointing the bug.  I was just checking 
> > bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS 
> > line was tokenized using the GenBank "definition".  Apparently, 
> > GenBank will have to break their rules soon or later.  
> Perhaps we can 
> > simply split the line as long as the total number of fields remains 
> > the same?
> >
> > Thanks!
> >
> > Lixin Zhou
> >
> >> -----Original Message-----
> >> From: Toshiaki Katayama [mailto:ktym@hgc.jp]
> >> Sent: Tuesday, March 02, 2004 12:55 AM
> >> To: bioruby@open-bio.org
> >> Subject: Re: [BioRuby] Is there a limit to string / naseq length?
> >>
> >>
> >> Hi,
> >>
> >> I have confirmed this also occurs on my OS X and Linux box 
> with Ruby 
> >> 1.6.8 and 1.8.1 by parsing the following file.
> >>
> >>
> >> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch
> >> r3.gbk.gz
> >>
> >> My implementation of GenBank parser and Bio::Sequence 
> classes doesn't 
> >> limit sequence length.
> >>
> >> ...however...
> >>
> >> The problem was that I couldn't imagine the sequence coordination 
> >> number in the NCBI GenBank format can reach at the line 
> head when I 
> >> wrote bio/db.rb so that it misses lines after 100000021.
> >>
> >> --------------------------------------------------------------
> >> ----------
> >> ------
> >> LOCUS       NT_005612          100530261 bp    DNA     linear   CON
> >> 23-JAN-2004
> >> DEFINITION  Homo sapiens chromosome 3 genomic contig.
> >> ACCESSION   NT_005612
> >> (snip)
> >> ORIGIN
> >>          1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt 
> >> atgtgaacat
> >>         61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct 
> >> cagtcaaaag
> >> (snip)
> >>   99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg 
> >> atctccccca 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg 
> >> gcaagagata tccactggtt
> >> (snip)
> >> 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat
> >> actcccatct
> >> 100530241 tgttcatgat tattctgaat t
> >> //
> >> --------------------------------------------------------------
> >> ----------
> >> ------
> >>
> >> I will fix this in the CVS although it may take some time 
> to be done.
> >>
> >> Sorry for the inconvenience,
> >> Toshiaki Katayama
> >>
> >>
> >> On 2004/03/02, at 14:14, Zhou, Lixin wrote:
> >>
> >>> I've just deleted some lines of annotation in the feature 
> table in 
> >>> NT_005612 and found that the sequence is still truncated to 
> >>> 100,000,020 bp.  Therefore, the bug may have nothing to do
> >> with the
> >>> number of lines in the RefSeq record.
> >>>
> >>> Here is to correct the mistakes / typos in the previous message:
> >>>
> >>> 1. The sequence is from ORIGIN not SOURCE.
> >>> 2. The sequence length is greater than 100 M bp.
> >>>
> >>> -----Original Message-----
> >>> From:	Zhou, Lixin
> >>> Sent:	Mon 3/1/2004 6:39 PM
> >>> To:	bioruby@open-bio.org
> >>> Cc:	
> >>> Subject:	[BioRuby] Is there a limit to string / naseq length?
> >>> Hi all,
> >>>
> >>> I was parsing NCBI's human RefSeq 34 version 2 and 
> noticed that the 
> >>> DNA sequence from SOURCE is truncated.  This appears to be
> >> reproducible
> >>> when
> >>> I "require ?"bio/db/genbank/refseq?"".
> >>>
> >>> The length of the NT_005612 sequence from CHR_03 is 
> 100,530,261 bp 
> >>> (longest in human RefSeq 34 v2 and the only one whose sequence is 
> >>> greater than 1M bp).  I was parsing the entire RefSeq and
> >> then cutting
> >>> exon sequence and noticed a few NM / XM entries returned empty 
> >>> sequence from NT_005612.  A careful examination indicate 
> that their 
> >>> coordinates are greater than 100,000,000.  I tried to print
> >> out gb.naseq and
> >>> indeed,
> >>> the sequence is truncated to about 100,000,020.  By the
> >> way, it appears
> >>> bioruby takes only the first 2575408 lines of the entire
> >> RefSeq record
> >>> -
> >>> because 100,000,021st base starts at the line 2,575,409 of the NT 
> >>> record.
> >>>
> >>> I briefly checked bioruby source and have not found a 
> limit to the 
> >>> sequence length.  Is this a bug from Ruby 1.8.1, which I use?
> >>>
> >>> Thanks.
> >>>
> >>> Lixin Zhou
> >>> lzhou@illumina.com
> >>>
> >>> _______________________________________________
> >>> BioRuby mailing list
> >>> BioRuby@open-bio.org 
> >>> http://portal.open-bio.org/mailman/listinfo/bioruby
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> BioRuby mailing list
> >>> BioRuby@open-bio.org 
> >>> http://portal.open-bio.org/mailman/listinfo/bioruby
> >>
> >> _______________________________________________
> >> BioRuby mailing list
> >> BioRuby@open-bio.org
> >> http://portal.open-> bio.org/mailman/listinfo/bioruby
> >>
> >
> > _______________________________________________
> > BioRuby mailing list
> > BioRuby@open-bio.org 
> > http://portal.open-bio.org/mailman/listinfo/bioruby
> 
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby@open-bio.org 
> http://portal.open-bio.org/mailman/listinfo/bioruby
> 
From LZhou at illumina.com  Tue Mar  2 13:15:19 2004
From: LZhou at illumina.com (Zhou, Lixin)
Date: Tue Mar  2 13:21:11 2004
Subject: [BioRuby] Is there a limit to string / naseq length?
Message-ID: <B9CE642AED48ED4D97C66A943728F1860132EACC@ilmn-exch.illumina.com>

Thank you very much for your quick and great work!  I'll try it out.

> -----Original Message-----
> From: Toshiaki Katayama [mailto:ktym@hgc.jp] 
> Sent: Tuesday, March 02, 2004 10:11 AM
> To: BioRuby Discussion List Project
> Subject: Re: [BioRuby] Is there a limit to string / naseq length?
> 
> 
> Hi,
> 
> Following change affects all sub-classes of the Bio::NCBIDB 
> and I have changed regexp in bio/db.rb to match top level tag 
> from /\n(\S)/ to /\n([A-Za-z\])/ for avoiding digits.
> 
> Plus, sequence extraction became faster by replacing gsub 
> with tr in genbank.rb.
> 
> Try these changes in CVS and please report if break anything.
> 
> 
> Lixin, thank you for your report.
> 
> Regards,
> Toshiaki Katayama
> 
> On 2004/03/03, at 1:58, Zhou, Lixin wrote:
> 
> > Hi,
> >
> > Thanks for pinpointing the bug.  I was just checking 
> > bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS 
> > line was tokenized using the GenBank "definition".  Apparently, 
> > GenBank will have to break their rules soon or later.  
> Perhaps we can 
> > simply split the line as long as the total number of fields remains 
> > the same?
> >
> > Thanks!
> >
> > Lixin Zhou
> >
> >> -----Original Message-----
> >> From: Toshiaki Katayama [mailto:ktym@hgc.jp]
> >> Sent: Tuesday, March 02, 2004 12:55 AM
> >> To: bioruby@open-bio.org
> >> Subject: Re: [BioRuby] Is there a limit to string / naseq length?
> >>
> >>
> >> Hi,
> >>
> >> I have confirmed this also occurs on my OS X and Linux box 
> with Ruby 
> >> 1.6.8 and 1.8.1 by parsing the following file.
> >>
> >>
> >> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch
> >> r3.gbk.gz
> >>
> >> My implementation of GenBank parser and Bio::Sequence 
> classes doesn't 
> >> limit sequence length.
> >>
> >> ...however...
> >>
> >> The problem was that I couldn't imagine the sequence coordination 
> >> number in the NCBI GenBank format can reach at the line 
> head when I 
> >> wrote bio/db.rb so that it misses lines after 100000021.
> >>
> >> --------------------------------------------------------------
> >> ----------
> >> ------
> >> LOCUS       NT_005612          100530261 bp    DNA     linear   CON
> >> 23-JAN-2004
> >> DEFINITION  Homo sapiens chromosome 3 genomic contig.
> >> ACCESSION   NT_005612
> >> (snip)
> >> ORIGIN
> >>          1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt 
> >> atgtgaacat
> >>         61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct 
> >> cagtcaaaag
> >> (snip)
> >>   99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg 
> >> atctccccca 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg 
> >> gcaagagata tccactggtt
> >> (snip)
> >> 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat
> >> actcccatct
> >> 100530241 tgttcatgat tattctgaat t
> >> //
> >> --------------------------------------------------------------
> >> ----------
> >> ------
> >>
> >> I will fix this in the CVS although it may take some time 
> to be done.
> >>
> >> Sorry for the inconvenience,
> >> Toshiaki Katayama
> >>
> >>
> >> On 2004/03/02, at 14:14, Zhou, Lixin wrote:
> >>
> >>> I've just deleted some lines of annotation in the feature 
> table in 
> >>> NT_005612 and found that the sequence is still truncated to 
> >>> 100,000,020 bp.  Therefore, the bug may have nothing to do
> >> with the
> >>> number of lines in the RefSeq record.
> >>>
> >>> Here is to correct the mistakes / typos in the previous message:
> >>>
> >>> 1. The sequence is from ORIGIN not SOURCE.
> >>> 2. The sequence length is greater than 100 M bp.
> >>>
> >>> -----Original Message-----
> >>> From:	Zhou, Lixin
> >>> Sent:	Mon 3/1/2004 6:39 PM
> >>> To:	bioruby@open-bio.org
> >>> Cc:	
> >>> Subject:	[BioRuby] Is there a limit to string / naseq length?
> >>> Hi all,
> >>>
> >>> I was parsing NCBI's human RefSeq 34 version 2 and 
> noticed that the 
> >>> DNA sequence from SOURCE is truncated.  This appears to be
> >> reproducible
> >>> when
> >>> I "require ?"bio/db/genbank/refseq?"".
> >>>
> >>> The length of the NT_005612 sequence from CHR_03 is 
> 100,530,261 bp 
> >>> (longest in human RefSeq 34 v2 and the only one whose sequence is 
> >>> greater than 1M bp).  I was parsing the entire RefSeq and
> >> then cutting
> >>> exon sequence and noticed a few NM / XM entries returned empty 
> >>> sequence from NT_005612.  A careful examination indicate 
> that their 
> >>> coordinates are greater than 100,000,000.  I tried to print
> >> out gb.naseq and
> >>> indeed,
> >>> the sequence is truncated to about 100,000,020.  By the
> >> way, it appears
> >>> bioruby takes only the first 2575408 lines of the entire
> >> RefSeq record
> >>> -
> >>> because 100,000,021st base starts at the line 2,575,409 of the NT 
> >>> record.
> >>>
> >>> I briefly checked bioruby source and have not found a 
> limit to the 
> >>> sequence length.  Is this a bug from Ruby 1.8.1, which I use?
> >>>
> >>> Thanks.
> >>>
> >>> Lixin Zhou
> >>> lzhou@illumina.com
> >>>
> >>> _______________________________________________
> >>> BioRuby mailing list
> >>> BioRuby@open-bio.org 
> >>> http://portal.open-bio.org/mailman/listinfo/bioruby
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> BioRuby mailing list
> >>> BioRuby@open-bio.org 
> >>> http://portal.open-bio.org/mailman/listinfo/bioruby
> >>
> >> _______________________________________________
> >> BioRuby mailing list
> >> BioRuby@open-bio.org
> >> http://portal.open-> bio.org/mailman/listinfo/bioruby
> >>
> >
> > _______________________________________________
> > BioRuby mailing list
> > BioRuby@open-bio.org 
> > http://portal.open-bio.org/mailman/listinfo/bioruby
> 
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby@open-bio.org 
> http://portal.open-bio.org/mailman/listinfo/bioruby
> 
From ktym at hgc.jp  Tue Mar  2 14:11:18 2004
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Tue Mar  2 14:17:12 2004
Subject: [BioRuby] Re: BioRuby PDB Classes
In-Reply-To: <371B90E2-5E0A-11D8-B0EA-000A957E44DC@ebi.ac.uk>
References: <371B90E2-5E0A-11D8-B0EA-000A957E44DC@ebi.ac.uk>
Message-ID: <6224FA24-6C7D-11D8-B650-000A95CD9782@hgc.jp>

Hi Alex,

I have received a patch developed by you and N. Goto.

Changes are already commited to the CVS to the CVS repository :
* Added Patches from Alex Gutteridge
   - New classes: PDB::Atom, PDB::Residue, PDB::Chain, PDB::Model
   - New modules: Bio::PDBUtils, Bio::{Atom|Residue|Chain|Model}Finder
   - New methods: iterators, ...
   - Bug fix
* Bio::Coordinate class storing coordinate data (inherits Vector)
* Many imcompatible (but very useful) changes are made, please be 
careful.

Regards,
Toshiaki Katayama

On 2004/02/13, at 18:51, Alex Gutteridge wrote:

>>> My question(s) to the list are:
>>>
>>> 1. Am I treading on other peoples toes here? Is someone else actively
>>> developing the pdb.rb module? Naohisa Goto?
>>
>> I'm Naohisa Goto, but I'm not actively developing the pdb.rb now,
>> and no one (except you) are doing, as far as I know.
>> So, you can freely modify the pdb.rb.
>>
>> If you want to change existing class/method's name, or massively 
>> change
>> existing class/method's specification or definition, please tell us.
>
> I've left the main PDB class alone except for:
>
> - I've changed the seqres method to return Bio::Seq objects rather 
> than just strings
> - I've removed the old model parsing section and replaced it with mine
>
>>> 2. If not, should I post the code to the mailing list, or somewhere
>>> else? I'm sure it needs some tidying up and bioruby-fication. It 
>>> would
>>> be great if someone more experienced than I could give some
>>> comments/criticisms.
>>
>> If the code is short, please post to the mailing list.
>> For long codes, please send to staff@bioruby.org, or you can use
>> the BioRuby Project Wiki page (http://wiki.bioruby.org/English/).
>
> pdb.rb is now ~2000 lines, so I won't post it to the list! I'll post 
> it to staff@bioruby.org (the wiki seems to be broken at the moment). 
> pdb.rb probably needs splitting up into separate files, but like I 
> said, I'm not sure what the BioRuby conventions are for doing this 
> (would it need a new bio/db/pdb directory?). Currently it looks like 
> this:
>
> module bio
> 	#This module provides some generic mixin methods that all classes use
> 	module PDBUtils
> 		[snip!]
> 	end
> 	#There are several *Finder mixins which provide some of the searching 
> methods
> 	module AtomFinder
> 		[snip!]
> 	end
> 	#This is the main PDB class that was here originally - I've only 
> added methods
> 	#so all the old interface is still here (apart from .seqres and 
> .model)
> 	class PDB
> 		#There are a few modules and classes here used for the old style 
> parsing
> 		class FieldDef
> 			[snip!]
> 		end
> 		class Record < Hash
> 			[snip!]
> 		end
> 		[snip!]		
> 		#My new classes for atoms, residues, chains and modules go here
> 		class Atom
> 			[snip!]
> 		end
> 		class Residue
> 			[snip!]
> 		end
> 		class Chain
> 			[snip!]
> 		end
> 		class Model
> 			[snip!]
> 		end
> 	end #class PDB
> end #module Bio
>
> Perhaps the PDBUtils and Finder modules should go inside the PDB 
> class? Or a separate file for each class and the mixins?
>
> Alex Gutteridge
> European Bioinformatics Institute
> Cambridge CB10 1SD
> UK
>
> Tel: 01223 492550
> Email: alexg@ebi.ac.uk
>
> _______________________________________________
> BioRuby mailing list
> BioRuby@open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioruby

From ktym at hgc.jp  Wed Mar  3 19:37:31 2004
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Wed Mar  3 19:43:24 2004
Subject: [BioRuby] Fwd: BOSC 2004 Announcement and Call for Papers
Message-ID: <1F799BB8-6D74-11D8-B4B6-000A95CD9782@hgc.jp>

Hi,

CFP for the BOSC2004 is announced.
Plans and discussions for the BioRuby's presentation are welcome.

Thanks,
Toshiaki Katayama

Begin forwarded message:

> From: Darin London <dlondon@ebi.ac.uk>
> Date: 2004?3?4? 6:06:52:JST
> To: bioperl-announce-l@bioperl.org, <biojava-l@biojava.org>, 
> <biopython-announce@biopython.org>, <bioruby@open-bio.org>, 
> <bakup_moby-announce@biomoby.org>, <das@biodas.org>, 
> <ensembl-dev@ebi.ac.uk>
> Cc: Subject: [Bioperl-announce-l] BOSC 2004 Announcement and Call for 
> Papers
>
>  {Please pass the word!}
>
>  MEETING ANNOUNCEMENT & CALL FOR SPEAKERS
>
>  The 5th annual Bioinformatics Open Source Conference (BOSC'2004) is
>  organized by the not-for-profit Open Bioinformatics Foundation. The
>  meeting will take place July 29-30, 2004 in Glasgow, Scotland, and is
>  one of several Special Interest Group (SIG) meetings occurring in
>  conjunction with the 12th International Conference on Intelligent
>  Systems for Molecular Biology.
>
>  see http://www.iscb.org/ismb2004/ for more information.
>
>  The focus of the meeting will be on current and emerging Open Source**
>  informatics tools and toolkits. BOSC provides a forum for developers,
>  project groups, users and interested parties to meet personally, 
> exchange ideas and
>  collaborate together.
>
>  In addition, keynote speeches from well known Open Source 
> Bioinformatics
>  leaders are being planned.
>
>  BOSC PROGRAM & CONTACT INFO
>
>  * Web: http://www.open-bio.org/bosc2004/
>  * Email: bosc@open-bio.org
>  * Online registration: https://www.cteusa.com/iscb3/
>
>
>  FEES
>
>  * Corporate :GBP ?165.00 british pounds sterling
>  * Academic : GBP ?120.00 british pounds sterling
>  * Student : GBP ?90.00 british pounds sterling
>
>  A 17.5% Valued Added Tax(VAT) will be added to all fees.
>
>  Note: We have tried to set our fees as low as possible without risking
>  the chance that the foundation will lose money on the event. We budget
>  with the goal of breaking even on costs or realizing a small profit.
>
>  REGISTER ONLINE FOR BOSC'2004 & ISMB AT:
>  https://www.cteusa.com/iscb3/
>
>  SPEAKERS & ABSTRACTS WANTED
>
>  The program committee is currently seeking abstracts for talks at BOSC
>  2004. BOSC is a great opportunity for you to tell the community about
>  your use, development, or philosophy of open source software 
> development
>  in bioinformatics. The committee will select several submitted 
> abstracts
>  for 25-minute talks and others for shorter "lightning" talks. Accepted
>  abstracts will be published on the BOSC web site.
>
>  If you are interested in speaking at BOSC 2004,
>  please send us:
>
>  * an abstract (no more than a few paragraphs)
>  * a URL for the project page, if applicable
>  * information about the open source license used for your software or
>    your release plans.
>
>  LIGHTNING-TALK SPEAKERS WANTED!
>
>  The program committee is currently seeking speakers for the lightning
>  talks at BOSC 2004. Lightning talks are quick - only five minutes
>  long - and a great opportunity for you to give people a quick
>  summary of your open source project, code, idea, or vision of the 
> future.
>
>  If you are interested in giving a lightning talk at BOSC 2004,
>  please send us:
>
>  * a brief title and summary (one or two lines)
>  * a URL for the project page, if applicable
>  * information about the open source license used for your software or
>    your release plans.
>
>  We will accept entries on-line until BOSC starts, but
>  space for demos and lightning talks is limited.<br/>
>
>  SOFTWARE DEMONSTRATIONS WANTED!
>
>  If you are involved in the development of Open Source Bioinformatics
>  Software, you are invited to provide a short demonstration to 
> attendees
>  of BOSC 2004.
>
>  If you are interested in giving a software demonstration at BOSC 2004,
>  please send us:
>
>  * a brief title and summary (one or two lines)
>  * a URL for the project page, if applicable
>  * Internet connectivity requirements (e.g. website Application served 
> on
>    the world wide web, or web based client application).
>
>    We will accept entries on-line until the BOSC starts, but
>    space for demos and lightning talks is limited.
>
>  ** Because the mission of the OBF is to promote Open Source software, 
> we
>  will favor submissions for projects that apply a recognized Open 
> Source
>  License, or adhere to the general Open Source Philosophy.
>
>  See the following websites for further details:
>  href="http://www.opensource.org/licenses/
>  href="http://www.opensource.org/docs/definition.php
>
>
>
>
> _______________________________________________
> Bioperl-announce-l mailing list
> Bioperl-announce-l@portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-announce-l

From ngoto at gen-info.osaka-u.ac.jp  Thu Mar  4 11:30:52 2004
From: ngoto at gen-info.osaka-u.ac.jp (GOTO Naohisa)
Date: Thu Mar  4 11:36:42 2004
Subject: [BioRuby] Re: Non-standard FASTA
Message-ID: <E1AyvkK-0002dR-00@lng.gen-info.osaka-u.ac.jp>

Hi,

I've now changed lib/bio/db/fasta.rb and lib/bio/io/flatfile.rb in CVS
to support non-standard FASTA format with comment lines, suggested
by Mr. Pjotr Prins in last year. The support is now integrated into
Bio::FastaFormat class, with very few loss of performance.
Would you please check them working correctly?

Regards,
-- 
Naohisa GOTO
ngoto@gen-info.osaka-u.ac.jp
Genome Information Research Center, Osaka University, Japan
From Jean-Philippe.Vert at mines.org  Tue Mar 16 05:54:52 2004
From: Jean-Philippe.Vert at mines.org (Jean-Philippe Vert)
Date: Tue Mar 16 04:59:46 2004
Subject: [BioRuby] proxy
Message-ID: <4056DCFC.8090902@mines.org>

Dear friends,

I'd like to use the wonderful KEGG API with Ruby, but I can only connect 
to the web through a proxy. Would someone know what I should do to set 
up the proxy connection? For example, let's say I want to run the script:

#!/usr/bin/env ruby
require 'bio'
serv = Bio::KEGG::API.new
puts serv.get_best_neighbors_by_gene('eco:b0002', 500, ['hin', 'bsu'])

It does not work currently because it tries to connect directly. Should 
I change something in this code, or in a different configuration file?

Thanks if you have time to answer this basic question, or give me a link.

yoroshiku onegaishimasu
jp

-- 
Jean-Philippe Vert
Ecole des Mines de Paris
http://www.cg.ensmp.fr/~vert


From Panayiotis.Periorellis at newcastle.ac.uk  Tue Mar 16 05:03:12 2004
From: Panayiotis.Periorellis at newcastle.ac.uk (Panayiotis Periorellis)
Date: Tue Mar 16 05:09:09 2004
Subject: [BioRuby] proxy
Message-ID: <C80380E6149238449E1AE20B15935862010D9CCC@pinewood.ncl.ac.uk>

R you doing your code on java? If yes let me knw and I will 
I will tell you ..

-----Original Message-----
From: Jean-Philippe Vert [mailto:Jean-Philippe.Vert@mines.org] 
Sent: 16 March 2004 10:55
To: bioruby@open-bio.org
Subject: [BioRuby] proxy


Dear friends,

I'd like to use the wonderful KEGG API with Ruby, but I can only connect

to the web through a proxy. Would someone know what I should do to set 
up the proxy connection? For example, let's say I want to run the
script:

#!/usr/bin/env ruby
require 'bio'
serv = Bio::KEGG::API.new
puts serv.get_best_neighbors_by_gene('eco:b0002', 500, ['hin', 'bsu'])

It does not work currently because it tries to connect directly. Should 
I change something in this code, or in a different configuration file?

Thanks if you have time to answer this basic question, or give me a
link.

yoroshiku onegaishimasu
jp

-- 
Jean-Philippe Vert
Ecole des Mines de Paris
http://www.cg.ensmp.fr/~vert


_______________________________________________
BioRuby mailing list
BioRuby@open-bio.org http://portal.open-bio.org/mailman/listinfo/bioruby

From ktym at hgc.jp  Tue Mar 16 05:09:14 2004
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Tue Mar 16 05:14:46 2004
Subject: [BioRuby] proxy
In-Reply-To: <4056DCFC.8090902@mines.org>
References: <4056DCFC.8090902@mines.org>
Message-ID: <FA5FD0A2-7731-11D8-8169-000A95D919D8@hgc.jp>

Hi Vert,

According to the SOAP4R documentation at
   http://rrr.jin.gr.jp/doc/soap4r/RELEASE_en.html
you will need to set two environmental variables or to create
a configuration file (I have no proxy and have never tried).

1. set 'soap_use_proxy' variable with its value 'on' and
    set 'http_proxy' variable with URL of your proxy as its value

or

2. create a $RUBYLIB/soap/property file (where $RUBYLIB is the
    directory of your choice something like $HOME/lib/ruby/ or
    /usr/local/lib/ruby/site_ruby/1.8/ etc.) to specify the proxy like:
      client.protocol.http.proxy = http://myproxy:8080

Hope this helps,
Toshiaki Katayama


On 2004/03/16, at 19:54, Jean-Philippe Vert wrote:

> Dear friends,
>
> I'd like to use the wonderful KEGG API with Ruby, but I can only 
> connect to the web through a proxy. Would someone know what I should 
> do to set up the proxy connection? For example, let's say I want to 
> run the script:
>
> #!/usr/bin/env ruby
> require 'bio'
> serv = Bio::KEGG::API.new
> puts serv.get_best_neighbors_by_gene('eco:b0002', 500, ['hin', 'bsu'])
>
> It does not work currently because it tries to connect directly. 
> Should I change something in this code, or in a different 
> configuration file?
>
> Thanks if you have time to answer this basic question, or give me a 
> link.
>
> yoroshiku onegaishimasu
> jp
>
> -- 
> Jean-Philippe Vert
> Ecole des Mines de Paris
> http://www.cg.ensmp.fr/~vert
>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby@open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioruby

From Jean-Philippe.Vert at mines.org  Tue Mar 16 06:27:20 2004
From: Jean-Philippe.Vert at mines.org (Jean-Philippe Vert)
Date: Tue Mar 16 05:32:38 2004
Subject: [BioRuby] proxy
References: <4056DCFC.8090902@mines.org>
	<FA5FD0A2-7731-11D8-8169-000A95D919D8@hgc.jp>
Message-ID: <4056E498.3010902@mines.org>

wonderful!
thanks a lot, the first solution:

setenv SOAP_USE_PROXY on
setenv HTTP_PROXY myproxy:8080

works very well
ookini!

jp

Toshiaki Katayama wrote:

> Hi Vert,
>
> According to the SOAP4R documentation at
>   http://rrr.jin.gr.jp/doc/soap4r/RELEASE_en.html
> you will need to set two environmental variables or to create
> a configuration file (I have no proxy and have never tried).
>
> 1. set 'soap_use_proxy' variable with its value 'on' and
>    set 'http_proxy' variable with URL of your proxy as its value
>
> or
>
> 2. create a $RUBYLIB/soap/property file (where $RUBYLIB is the
>    directory of your choice something like $HOME/lib/ruby/ or
>    /usr/local/lib/ruby/site_ruby/1.8/ etc.) to specify the proxy like:
>      client.protocol.http.proxy = http://myproxy:8080
>
> Hope this helps,
> Toshiaki Katayama
>
>
> On 2004/03/16, at 19:54, Jean-Philippe Vert wrote:
>
>> Dear friends,
>>
>> I'd like to use the wonderful KEGG API with Ruby, but I can only 
>> connect to the web through a proxy. Would someone know what I should 
>> do to set up the proxy connection? For example, let's say I want to 
>> run the script:
>>
>> #!/usr/bin/env ruby
>> require 'bio'
>> serv = Bio::KEGG::API.new
>> puts serv.get_best_neighbors_by_gene('eco:b0002', 500, ['hin', 'bsu'])
>>
>> It does not work currently because it tries to connect directly. 
>> Should I change something in this code, or in a different 
>> configuration file?
>>
>> Thanks if you have time to answer this basic question, or give me a 
>> link.
>>
>> yoroshiku onegaishimasu
>> jp
>>
>> -- 
>> Jean-Philippe Vert
>> Ecole des Mines de Paris
>> http://www.cg.ensmp.fr/~vert
>>
>>
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby@open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioruby
>
>
> _______________________________________________
> BioRuby mailing list
> BioRuby@open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioruby
>
>

-- 
Jean-Philippe Vert
Ecole des Mines de Paris
http://www.cg.ensmp.fr/~vert


From pjotr at pckassa.com  Sun Mar 21 08:31:37 2004
From: pjotr at pckassa.com (pjotr@pckassa.com)
Date: Sun Mar 21 08:37:04 2004
Subject: [BioRuby] Fwd: BOSC 2004 Announcement and Call for Papers
In-Reply-To: <1F799BB8-6D74-11D8-B4B6-000A95CD9782@hgc.jp>
References: <1F799BB8-6D74-11D8-B4B6-000A95CD9782@hgc.jp>
Message-ID: <20040321133137.GA26257@team-machine.donck.com>

Fair chance I will be going. If so I'll submit a paper.

Pj.

From pjotr at pckassa.com  Sun Mar 21 08:33:00 2004
From: pjotr at pckassa.com (pjotr@pckassa.com)
Date: Sun Mar 21 08:38:25 2004
Subject: [BioRuby] RegEx search example fasta file
Message-ID: <20040321133300.GB26257@team-machine.donck.com>

Can this go in the sample directory of bioruby - I have added it to
the Wiki. Comments welcome.

Pj.


#! /usr/bin/ruby
#
#   $Id: fastasearch,v 1.1 2004/03/21 13:18:41 wrk Exp $
#   $Source: /home/cvs/home/pjotr/lwrk/luw/fasta/fastasearch,v $
#

# require 'profile'

COPYRIGHT = "GPL (c) 2003-2004"

usage = <<USAGE

    Search fasta file(s) tags using a regular expression (regex)

    Usage: fastasearch [-q query] filename(s)

    Example:

      ruby fastasearch -q '/([Hh]uman|[Hh]omo sapiens)/' nr.fa

    For more information see 

        http://thebird.nl/bioinformatics/
	
    Pjotr Prins
    Wageningen University and Research Centre
    http://www.wur.nl/
    http://www.dpw.wageningen-ur.nl/nema/

USAGE

# --------------------------------------------------------------------

srcpath=File.dirname($0)
libpath=File.dirname(srcpath)+'/lib'
$: << srcpath         # ---- Add start path to search libraries
$: << libpath

require 'getoptlong'
require 'bio'

# ---- Parse command line
opts = GetoptLong.new(
 [ "--help", "-h", GetoptLong::NO_ARGUMENT ],
 [ "--query", "-q", GetoptLong::REQUIRED_ARGUMENT ]
)

do_help       = false
query=nil

opts.each do | opt, arg |
   do_help   |= (opt == '--help')
   query = arg if (opt == '--query')
end

# ---- Print usage
if (do_help || ARGV.size==0)
  print usage
  exit 1
end

if !query
  print "Give query: "
  query = $stdin.gets.chomp
end

ARGV.each do | fn |
  $stderr.print "Loading #{fn}..."
  f = Bio::FlatFile.auto(fn)
  $stderr.print " detected: #{f.dbclass}\n"
  f.each_entry do | e |
    if e.definition =~ /#{query}/
      print '>',e.definition,e.data
    end
  end
end

From pjotr at pckassa.com  Mon Mar 22 13:20:51 2004
From: pjotr at pckassa.com (pjotr@pckassa.com)
Date: Mon Mar 22 13:26:17 2004
Subject: [BioRuby] Fwd: BOSC 2004 Announcement and Call for Papers
In-Reply-To: <20040321133137.GA26257@team-machine.donck.com>
References: <1F799BB8-6D74-11D8-B4B6-000A95CD9782@hgc.jp>
	<20040321133137.GA26257@team-machine.donck.com>
Message-ID: <20040322182051.GA10462@team-machine.donck.com>

Yes, I am attending BOSC (permission granted by my Professor). Anyone
else of BioRuby coming?

I can do a talk - what would be the most interesting to discuss? What
came up during the last BOSC? Some feedback would be useful.

Yours,

Pj.
From lzhou at illumina.com  Tue Mar 23 14:38:49 2004
From: lzhou at illumina.com (Lixin Zhou)
Date: Tue Mar 23 14:44:18 2004
Subject: [BioRuby] Is there a limit to string / naseq length?
In-Reply-To: <FA0826D2-6C74-11D8-B650-000A95CD9782@hgc.jp>
References: <B9CE642AED48ED4D97C66A943728F1860F74EE@ilmn-exch.illumina.com>
	<FA0826D2-6C74-11D8-B650-000A95CD9782@hgc.jp>
Message-ID: <40609249.1000902@illumina.com>

Hello,

I've tried the patch for the latest RefSeq 34 version 3 (and v2 as
well).  Perhaps I did it wrong - it's a few times slower than the
previous release, and perhaps use more memory as well.  I've not had a
close look, so that I don't know what caused the slowness. I simply
switched back to the previous release.

Has anyone tried to parse ASN.1 format using Ruby, or common LISP / scheme?

Thanks!

Lixin

Toshiaki Katayama wrote:
> Hi,
> 
> Following change affects all sub-classes of the Bio::NCBIDB and
> I have changed regexp in bio/db.rb to match top level tag from
> /\n(\S)/ to /\n([A-Za-z\])/ for avoiding digits.
> 
> Plus, sequence extraction became faster by replacing gsub with
> tr in genbank.rb.
> 
> Try these changes in CVS and please report if break anything.
> 
> 
> Lixin, thank you for your report.
> 
> Regards,
> Toshiaki Katayama
> 
> On 2004/03/03, at 1:58, Zhou, Lixin wrote:
> 
>> Hi,
>>
>> Thanks for pinpointing the bug.  I was just checking
>> bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS line
>> was tokenized using the GenBank "definition".  Apparently, GenBank will
>> have to break their rules soon or later.  Perhaps we can simply split
>> the line as long as the total number of fields remains the same?
>>
>> Thanks!
>>
>> Lixin Zhou
>>
>>> -----Original Message-----
>>> From: Toshiaki Katayama [mailto:ktym@hgc.jp]
>>> Sent: Tuesday, March 02, 2004 12:55 AM
>>> To: bioruby@open-bio.org
>>> Subject: Re: [BioRuby] Is there a limit to string / naseq length?
>>>
>>>
>>> Hi,
>>>
>>> I have confirmed this also occurs on my OS X and Linux box
>>> with Ruby 1.6.8 and 1.8.1 by parsing the following file.
>>>
>>>
>>> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch
>>> r3.gbk.gz
>>>
>>> My implementation of GenBank parser and Bio::Sequence classes
>>> doesn't limit sequence length.
>>>
>>> ...however...
>>>
>>> The problem was that I couldn't imagine the sequence
>>> coordination number in the NCBI GenBank format can reach at
>>> the line head when I wrote bio/db.rb so that it misses lines
>>> after 100000021.
>>>
>>> --------------------------------------------------------------
>>> ----------
>>> ------
>>> LOCUS       NT_005612          100530261 bp    DNA     linear   CON
>>> 23-JAN-2004
>>> DEFINITION  Homo sapiens chromosome 3 genomic contig.
>>> ACCESSION   NT_005612
>>> (snip)
>>> ORIGIN
>>>          1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt
>>> atgtgaacat
>>>         61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct
>>> cagtcaaaag
>>> (snip)
>>>   99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg
>>> atctccccca
>>> 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg gcaagagata
>>> tccactggtt
>>> (snip)
>>> 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat
>>> actcccatct
>>> 100530241 tgttcatgat tattctgaat t
>>> //
>>> --------------------------------------------------------------
>>> ----------
>>> ------
>>>
>>> I will fix this in the CVS although it may take some time to be done.
>>>
>>> Sorry for the inconvenience,
>>> Toshiaki Katayama
>>>
>>>
>>> On 2004/03/02, at 14:14, Zhou, Lixin wrote:
>>>
>>>> I've just deleted some lines of annotation in the feature table in
>>>> NT_005612 and found that the sequence is still truncated to
>>>> 100,000,020 bp.  Therefore, the bug may have nothing to do
>>>
>>> with the
>>>
>>>> number of lines in the RefSeq record.
>>>>
>>>> Here is to correct the mistakes / typos in the previous message:
>>>>
>>>> 1. The sequence is from ORIGIN not SOURCE.
>>>> 2. The sequence length is greater than 100 M bp.
>>>>
>>>> -----Original Message-----
>>>> From:    Zhou, Lixin
>>>> Sent:    Mon 3/1/2004 6:39 PM
>>>> To:    bioruby@open-bio.org
>>>> Cc:   
>>>> Subject:    [BioRuby] Is there a limit to string / naseq length?
>>>> Hi all,
>>>>
>>>> I was parsing NCBI's human RefSeq 34 version 2 and noticed that the
>>>> DNA sequence from SOURCE is truncated.  This appears to be
>>>
>>> reproducible
>>>
>>>> when
>>>> I "require ?"bio/db/genbank/refseq?"".
>>>>
>>>> The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp
>>>> (longest in human RefSeq 34 v2 and the only one whose sequence is
>>>> greater than 1M bp).  I was parsing the entire RefSeq and
>>>
>>> then cutting
>>>
>>>> exon sequence and noticed a few NM / XM entries returned empty
>>>> sequence from NT_005612.  A careful examination indicate that their
>>>> coordinates are greater than 100,000,000.  I tried to print
>>>
>>> out gb.naseq and
>>>
>>>> indeed,
>>>> the sequence is truncated to about 100,000,020.  By the
>>>
>>> way, it appears
>>>
>>>> bioruby takes only the first 2575408 lines of the entire
>>>
>>> RefSeq record
>>>
>>>> -
>>>> because 100,000,021st base starts at the line 2,575,409 of the NT
>>>> record.
>>>>
>>>> I briefly checked bioruby source and have not found a limit to the
>>>> sequence length.  Is this a bug from Ruby 1.8.1, which I use?
>>>>
>>>> Thanks.
>>>>
>>>> Lixin Zhou
>>>> lzhou@illumina.com
>>>>
>>>> _______________________________________________
>>>> BioRuby mailing list
>>>> BioRuby@open-bio.org
>>>> http://portal.open-bio.org/mailman/listinfo/bioruby
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> BioRuby mailing list
>>>> BioRuby@open-bio.org
>>>> http://portal.open-bio.org/mailman/listinfo/bioruby
>>>
>>>
>>> _______________________________________________
>>> BioRuby mailing list
>>> BioRuby@open-bio.org
>>> http://portal.open-> bio.org/mailman/listinfo/bioruby
>>>
>>
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby@open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioruby
> 
> 
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby@open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioruby
> 
> 
From lzhou at illumina.com  Tue Mar 23 14:38:49 2004
From: lzhou at illumina.com (Lixin Zhou)
Date: Tue Mar 23 14:44:19 2004
Subject: [BioRuby] Is there a limit to string / naseq length?
In-Reply-To: <FA0826D2-6C74-11D8-B650-000A95CD9782@hgc.jp>
References: <B9CE642AED48ED4D97C66A943728F1860F74EE@ilmn-exch.illumina.com>
	<FA0826D2-6C74-11D8-B650-000A95CD9782@hgc.jp>
Message-ID: <40609249.1000902@illumina.com>

Hello,

I've tried the patch for the latest RefSeq 34 version 3 (and v2 as
well).  Perhaps I did it wrong - it's a few times slower than the
previous release, and perhaps use more memory as well.  I've not had a
close look, so that I don't know what caused the slowness. I simply
switched back to the previous release.

Has anyone tried to parse ASN.1 format using Ruby, or common LISP / scheme?

Thanks!

Lixin

Toshiaki Katayama wrote:
> Hi,
> 
> Following change affects all sub-classes of the Bio::NCBIDB and
> I have changed regexp in bio/db.rb to match top level tag from
> /\n(\S)/ to /\n([A-Za-z\])/ for avoiding digits.
> 
> Plus, sequence extraction became faster by replacing gsub with
> tr in genbank.rb.
> 
> Try these changes in CVS and please report if break anything.
> 
> 
> Lixin, thank you for your report.
> 
> Regards,
> Toshiaki Katayama
> 
> On 2004/03/03, at 1:58, Zhou, Lixin wrote:
> 
>> Hi,
>>
>> Thanks for pinpointing the bug.  I was just checking
>> bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS line
>> was tokenized using the GenBank "definition".  Apparently, GenBank will
>> have to break their rules soon or later.  Perhaps we can simply split
>> the line as long as the total number of fields remains the same?
>>
>> Thanks!
>>
>> Lixin Zhou
>>
>>> -----Original Message-----
>>> From: Toshiaki Katayama [mailto:ktym@hgc.jp]
>>> Sent: Tuesday, March 02, 2004 12:55 AM
>>> To: bioruby@open-bio.org
>>> Subject: Re: [BioRuby] Is there a limit to string / naseq length?
>>>
>>>
>>> Hi,
>>>
>>> I have confirmed this also occurs on my OS X and Linux box
>>> with Ruby 1.6.8 and 1.8.1 by parsing the following file.
>>>
>>>
>>> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch
>>> r3.gbk.gz
>>>
>>> My implementation of GenBank parser and Bio::Sequence classes
>>> doesn't limit sequence length.
>>>
>>> ...however...
>>>
>>> The problem was that I couldn't imagine the sequence
>>> coordination number in the NCBI GenBank format can reach at
>>> the line head when I wrote bio/db.rb so that it misses lines
>>> after 100000021.
>>>
>>> --------------------------------------------------------------
>>> ----------
>>> ------
>>> LOCUS       NT_005612          100530261 bp    DNA     linear   CON
>>> 23-JAN-2004
>>> DEFINITION  Homo sapiens chromosome 3 genomic contig.
>>> ACCESSION   NT_005612
>>> (snip)
>>> ORIGIN
>>>          1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt
>>> atgtgaacat
>>>         61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct
>>> cagtcaaaag
>>> (snip)
>>>   99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg
>>> atctccccca
>>> 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg gcaagagata
>>> tccactggtt
>>> (snip)
>>> 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat
>>> actcccatct
>>> 100530241 tgttcatgat tattctgaat t
>>> //
>>> --------------------------------------------------------------
>>> ----------
>>> ------
>>>
>>> I will fix this in the CVS although it may take some time to be done.
>>>
>>> Sorry for the inconvenience,
>>> Toshiaki Katayama
>>>
>>>
>>> On 2004/03/02, at 14:14, Zhou, Lixin wrote:
>>>
>>>> I've just deleted some lines of annotation in the feature table in
>>>> NT_005612 and found that the sequence is still truncated to
>>>> 100,000,020 bp.  Therefore, the bug may have nothing to do
>>>
>>> with the
>>>
>>>> number of lines in the RefSeq record.
>>>>
>>>> Here is to correct the mistakes / typos in the previous message:
>>>>
>>>> 1. The sequence is from ORIGIN not SOURCE.
>>>> 2. The sequence length is greater than 100 M bp.
>>>>
>>>> -----Original Message-----
>>>> From:    Zhou, Lixin
>>>> Sent:    Mon 3/1/2004 6:39 PM
>>>> To:    bioruby@open-bio.org
>>>> Cc:   
>>>> Subject:    [BioRuby] Is there a limit to string / naseq length?
>>>> Hi all,
>>>>
>>>> I was parsing NCBI's human RefSeq 34 version 2 and noticed that the
>>>> DNA sequence from SOURCE is truncated.  This appears to be
>>>
>>> reproducible
>>>
>>>> when
>>>> I "require ?"bio/db/genbank/refseq?"".
>>>>
>>>> The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp
>>>> (longest in human RefSeq 34 v2 and the only one whose sequence is
>>>> greater than 1M bp).  I was parsing the entire RefSeq and
>>>
>>> then cutting
>>>
>>>> exon sequence and noticed a few NM / XM entries returned empty
>>>> sequence from NT_005612.  A careful examination indicate that their
>>>> coordinates are greater than 100,000,000.  I tried to print
>>>
>>> out gb.naseq and
>>>
>>>> indeed,
>>>> the sequence is truncated to about 100,000,020.  By the
>>>
>>> way, it appears
>>>
>>>> bioruby takes only the first 2575408 lines of the entire
>>>
>>> RefSeq record
>>>
>>>> -
>>>> because 100,000,021st base starts at the line 2,575,409 of the NT
>>>> record.
>>>>
>>>> I briefly checked bioruby source and have not found a limit to the
>>>> sequence length.  Is this a bug from Ruby 1.8.1, which I use?
>>>>
>>>> Thanks.
>>>>
>>>> Lixin Zhou
>>>> lzhou@illumina.com
>>>>
>>>> _______________________________________________
>>>> BioRuby mailing list
>>>> BioRuby@open-bio.org
>>>> http://portal.open-bio.org/mailman/listinfo/bioruby
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> BioRuby mailing list
>>>> BioRuby@open-bio.org
>>>> http://portal.open-bio.org/mailman/listinfo/bioruby
>>>
>>>
>>> _______________________________________________
>>> BioRuby mailing list
>>> BioRuby@open-bio.org
>>> http://portal.open-> bio.org/mailman/listinfo/bioruby
>>>
>>
>> _______________________________________________
>> BioRuby mailing list
>> BioRuby@open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioruby
> 
> 
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby@open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioruby
> 
> 
From uehara at cbo.mss.co.jp  Tue Mar 23 21:02:12 2004
From: uehara at cbo.mss.co.jp (UEHARA Keizou)
Date: Tue Mar 23 21:08:30 2004
Subject: [BioRuby] Is there a limit to string / naseq length?
In-Reply-To: <40609249.1000902@illumina.com>
References: <40609249.1000902@illumina.com>
Message-ID: <200403240202.AA00455@C1623.cbo.mss.co.jp>

>Hello,
>
>I've tried the patch for the latest RefSeq 34 version 3 (and v2 as
>well).  Perhaps I did it wrong - it's a few times slower than the
>previous release, and perhaps use more memory as well.  I've not had a
>close look, so that I don't know what caused the slowness. I simply
>switched back to the previous release.
>
>Has anyone tried to parse ASN.1 format using Ruby, or common LISP / scheme?

I convert ASN.1 to XML using asn2gb and parse by xmlparser.
From ktym at hgc.jp  Tue Mar 23 21:28:51 2004
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Tue Mar 23 21:34:19 2004
Subject: [BioRuby] Fwd: BOSC 2004 Announcement and Call for Papers
In-Reply-To: <20040322182051.GA10462@team-machine.donck.com>
References: <1F799BB8-6D74-11D8-B4B6-000A95CD9782@hgc.jp>
	<20040321133137.GA26257@team-machine.donck.com>
	<20040322182051.GA10462@team-machine.donck.com>
Message-ID: <FCD18ED5-7D3A-11D8-8E2D-000A95AE7AB4@hgc.jp>

On 2004/03/23, at 3:20, pjotr@pckassa.com wrote:
> Yes, I am attending BOSC (permission granted by my Professor). Anyone
> else of BioRuby coming?

That's nice.

I will be there (not yet confirmed, although).
Maybe, we can have a BioRuby BOF session.

> I can do a talk - what would be the most interesting to discuss? What
> came up during the last BOSC? Some feedback would be useful.

For example, could you prepare an interesting/typical application of
BioRuby?

Last year, I have represented capability of the BioRuby and the use of
the KEGG API (http://open-bio.org/bosc2003/talks.html#kegg).
Other projects also showed some applications like a users perspective,
pipelines etc. and they were impressive.

In my case, I can prepare to report our project overview and the recent
progress including some topics of my interest.

Currently, I'm interested in KEGG API, DAS, GFF3 etc. and integrate
them with GMOD.  Also interested in to have Bio::Graphics equivalent
in BioRuby (with more generalized way?).

-k

From ktym at hgc.jp  Tue Mar 23 21:58:20 2004
From: ktym at hgc.jp (Toshiaki Katayama)
Date: Tue Mar 23 22:03:45 2004
Subject: [BioRuby] RegEx search example fasta file
In-Reply-To: <20040321133300.GB26257@team-machine.donck.com>
References: <20040321133300.GB26257@team-machine.donck.com>
Message-ID: <1BBD2286-7D3F-11D8-8E2D-000A95AE7AB4@hgc.jp>

On 2004/03/21, at 22:33, pjotr@pckassa.com wrote:
> Can this go in the sample directory of bioruby - I have added it to
> the Wiki. Comments welcome.

As for the wiki page, comparing to the original BJIA,
(http://www.biojava.org/docs/bj_in_anger/FastaParser.htm)
this section is to answer how to parse fasta results.

As the Bio::FlatFile.auto in BioRuby is very powerful and
entry.definition is implemented in various DB classes,
the way of your code that finds entries by regexp
is not limited to the FastaFormat as follows:

% re_grep_def.rb 'serine.* kinase' genbank/gb*.seq
% re_grep_def.rb 'serine.* kinase' kegg/genes/*.ent
% re_grep_def.rb 'serine.* kinase' kegg/sequences/*.pep

----------------------------------------------
#!/usr/bin/env ruby

require 'bio'

re = /#{ARGV.shift}/i

Bio::FlatFile.auto(ARGF) do |ff|
   ff.each do |entry|
     if re.match(entry.definition)
       puts ff.entry_raw
     end
   end
end
----------------------------------------------


-k


>
> Pj.
>
>
> #! /usr/bin/ruby
> #
> #   $Id: fastasearch,v 1.1 2004/03/21 13:18:41 wrk Exp $
> #   $Source: /home/cvs/home/pjotr/lwrk/luw/fasta/fastasearch,v $
> #
>
> # require 'profile'
>
> COPYRIGHT = "GPL (c) 2003-2004"
>
> usage = <<USAGE
>
>     Search fasta file(s) tags using a regular expression (regex)
>
>     Usage: fastasearch [-q query] filename(s)
>
>     Example:
>
>       ruby fastasearch -q '/([Hh]uman|[Hh]omo sapiens)/' nr.fa
>
>     For more information see
>
>         http://thebird.nl/bioinformatics/
> 	
>     Pjotr Prins
>     Wageningen University and Research Centre
>     http://www.wur.nl/
>     http://www.dpw.wageningen-ur.nl/nema/
>
> USAGE
>
> # --------------------------------------------------------------------
>
> srcpath=File.dirname($0)
> libpath=File.dirname(srcpath)+'/lib'
> $: << srcpath         # ---- Add start path to search libraries
> $: << libpath
>
> require 'getoptlong'
> require 'bio'
>
> # ---- Parse command line
> opts = GetoptLong.new(
>  [ "--help", "-h", GetoptLong::NO_ARGUMENT ],
>  [ "--query", "-q", GetoptLong::REQUIRED_ARGUMENT ]
> )
>
> do_help       = false
> query=nil
>
> opts.each do | opt, arg |
>    do_help   |= (opt == '--help')
>    query = arg if (opt == '--query')
> end
>
> # ---- Print usage
> if (do_help || ARGV.size==0)
>   print usage
>   exit 1
> end
>
> if !query
>   print "Give query: "
>   query = $stdin.gets.chomp
> end
>
> ARGV.each do | fn |
>   $stderr.print "Loading #{fn}..."
>   f = Bio::FlatFile.auto(fn)
>   $stderr.print " detected: #{f.dbclass}\n"
>   f.each_entry do | e |
>     if e.definition =~ /#{query}/
>       print '>',e.definition,e.data
>     end
>   end
> end
>
> _______________________________________________
> BioRuby mailing list
> BioRuby@open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioruby

From pjotr at pckassa.com  Wed Mar 24 01:24:16 2004
From: pjotr at pckassa.com (pjotr@pckassa.com)
Date: Wed Mar 24 01:29:38 2004
Subject: [BioRuby] RegEx search example fasta file
In-Reply-To: <1BBD2286-7D3F-11D8-8E2D-000A95AE7AB4@hgc.jp>
References: <20040321133300.GB26257@team-machine.donck.com>
	<1BBD2286-7D3F-11D8-8E2D-000A95AE7AB4@hgc.jp>
Message-ID: <20040324062416.GA31443@team-machine.donck.com>

Thanks! I'll have a look and will improve the Wiki to cover that. Pays
off immediately ;-).

Pj.

On Wed, Mar 24, 2004 at 11:58:20AM +0900, Toshiaki Katayama wrote:
> On 2004/03/21, at 22:33, pjotr@pckassa.com wrote:
> >Can this go in the sample directory of bioruby - I have added it to
> >the Wiki. Comments welcome.
> 
> As for the wiki page, comparing to the original BJIA,
> (http://www.biojava.org/docs/bj_in_anger/FastaParser.htm)
> this section is to answer how to parse fasta results.
> 
> As the Bio::FlatFile.auto in BioRuby is very powerful and
> entry.definition is implemented in various DB classes,
> the way of your code that finds entries by regexp
> is not limited to the FastaFormat as follows:
> 
> % re_grep_def.rb 'serine.* kinase' genbank/gb*.seq
> % re_grep_def.rb 'serine.* kinase' kegg/genes/*.ent
> % re_grep_def.rb 'serine.* kinase' kegg/sequences/*.pep
> 
> ----------------------------------------------
> #!/usr/bin/env ruby
> 
> require 'bio'
> 
> re = /#{ARGV.shift}/i
> 
> Bio::FlatFile.auto(ARGF) do |ff|
>   ff.each do |entry|
>     if re.match(entry.definition)
>       puts ff.entry_raw
>     end
>   end
> end
> ----------------------------------------------
> 
> 
> -k
> 
> 
> 
> >
> >Pj.
> >
> >
> >#! /usr/bin/ruby
> >#
> >#   $Id: fastasearch,v 1.1 2004/03/21 13:18:41 wrk Exp $
> >#   $Source: /home/cvs/home/pjotr/lwrk/luw/fasta/fastasearch,v $
> >#
> >
> ># require 'profile'
> >
> >COPYRIGHT = "GPL (c) 2003-2004"
> >
> >usage = <<USAGE
> >
> >    Search fasta file(s) tags using a regular expression (regex)
> >
> >    Usage: fastasearch [-q query] filename(s)
> >
> >    Example:
> >
> >      ruby fastasearch -q '/([Hh]uman|[Hh]omo sapiens)/' nr.fa
> >
> >    For more information see
> >
> >        http://thebird.nl/bioinformatics/
> >	
> >    Pjotr Prins
> >    Wageningen University and Research Centre
> >    http://www.wur.nl/
> >    http://www.dpw.wageningen-ur.nl/nema/
> >
> >USAGE
> >
> ># --------------------------------------------------------------------
> >
> >srcpath=File.dirname($0)
> >libpath=File.dirname(srcpath)+'/lib'
> >$: << srcpath         # ---- Add start path to search libraries
> >$: << libpath
> >
> >require 'getoptlong'
> >require 'bio'
> >
> ># ---- Parse command line
> >opts = GetoptLong.new(
> > [ "--help", "-h", GetoptLong::NO_ARGUMENT ],
> > [ "--query", "-q", GetoptLong::REQUIRED_ARGUMENT ]
> >)
> >
> >do_help       = false
> >query=nil
> >
> >opts.each do | opt, arg |
> >   do_help   |= (opt == '--help')
> >   query = arg if (opt == '--query')
> >end
> >
> ># ---- Print usage
> >if (do_help || ARGV.size==0)
> >  print usage
> >  exit 1
> >end
> >
> >if !query
> >  print "Give query: "
> >  query = $stdin.gets.chomp
> >end
> >
> >ARGV.each do | fn |
> >  $stderr.print "Loading #{fn}..."
> >  f = Bio::FlatFile.auto(fn)
> >  $stderr.print " detected: #{f.dbclass}\n"
> >  f.each_entry do | e |
> >    if e.definition =~ /#{query}/
> >      print '>',e.definition,e.data
> >    end
> >  end
> >end
> >
> >_______________________________________________
> >BioRuby mailing list
> >BioRuby@open-bio.org
> >http://portal.open-bio.org/mailman/listinfo/bioruby
> 
> _______________________________________________
> BioRuby mailing list
> BioRuby@open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioruby