From LZhou at illumina.com Mon Mar 1 21:39:51 2004 From: LZhou at illumina.com (Zhou, Lixin) Date: Mon Mar 1 21:45:44 2004 Subject: [BioRuby] Is there a limit to string / naseq length? Message-ID: Hi all, I was parsing NCBI's human RefSeq 34 version 2 and noticed that the DNA sequence from SOURCE is truncated. This appears to be reproducible when I "require \"bio/db/genbank/refseq\"". The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp (longest in human RefSeq 34 v2 and the only one whose sequence is greater than 1M bp). I was parsing the entire RefSeq and then cutting exon sequence and noticed a few NM / XM entries returned empty sequence from NT_005612. A careful examination indicate that their coordinates are greater than 100,000,000. I tried to print out gb.naseq and indeed, the sequence is truncated to about 100,000,020. By the way, it appears bioruby takes only the first 2575408 lines of the entire RefSeq record - because 100,000,021st base starts at the line 2,575,409 of the NT record. I briefly checked bioruby source and have not found a limit to the sequence length. Is this a bug from Ruby 1.8.1, which I use? Thanks. Lixin Zhou lzhou@illumina.com From LZhou at illumina.com Tue Mar 2 00:14:05 2004 From: LZhou at illumina.com (Zhou, Lixin) Date: Tue Mar 2 00:19:59 2004 Subject: [BioRuby] Is there a limit to string / naseq length? Message-ID: I've just deleted some lines of annotation in the feature table in NT_005612 and found that the sequence is still truncated to 100,000,020 bp. Therefore, the bug may have nothing to do with the number of lines in the RefSeq record. Here is to correct the mistakes / typos in the previous message: 1. The sequence is from ORIGIN not SOURCE. 2. The sequence length is greater than 100 M bp. -----Original Message----- From: Zhou, Lixin Sent: Mon 3/1/2004 6:39 PM To: bioruby@open-bio.org Cc: Subject: [BioRuby] Is there a limit to string / naseq length? Hi all, I was parsing NCBI's human RefSeq 34 version 2 and noticed that the DNA sequence from SOURCE is truncated. This appears to be reproducible when I "require \"bio/db/genbank/refseq\"". The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp (longest in human RefSeq 34 v2 and the only one whose sequence is greater than 1M bp). I was parsing the entire RefSeq and then cutting exon sequence and noticed a few NM / XM entries returned empty sequence from NT_005612. A careful examination indicate that their coordinates are greater than 100,000,000. I tried to print out gb.naseq and indeed, the sequence is truncated to about 100,000,020. By the way, it appears bioruby takes only the first 2575408 lines of the entire RefSeq record - because 100,000,021st base starts at the line 2,575,409 of the NT record. I briefly checked bioruby source and have not found a limit to the sequence length. Is this a bug from Ruby 1.8.1, which I use? Thanks. Lixin Zhou lzhou@illumina.com _______________________________________________ BioRuby mailing list BioRuby@open-bio.org http://portal.open-bio.org/mailman/listinfo/bioruby From LZhou at illumina.com Tue Mar 2 00:14:05 2004 From: LZhou at illumina.com (Zhou, Lixin) Date: Tue Mar 2 00:20:00 2004 Subject: [BioRuby] Is there a limit to string / naseq length? Message-ID: I've just deleted some lines of annotation in the feature table in NT_005612 and found that the sequence is still truncated to 100,000,020 bp. Therefore, the bug may have nothing to do with the number of lines in the RefSeq record. Here is to correct the mistakes / typos in the previous message: 1. The sequence is from ORIGIN not SOURCE. 2. The sequence length is greater than 100 M bp. -----Original Message----- From: Zhou, Lixin Sent: Mon 3/1/2004 6:39 PM To: bioruby@open-bio.org Cc: Subject: [BioRuby] Is there a limit to string / naseq length? Hi all, I was parsing NCBI's human RefSeq 34 version 2 and noticed that the DNA sequence from SOURCE is truncated. This appears to be reproducible when I "require \"bio/db/genbank/refseq\"". The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp (longest in human RefSeq 34 v2 and the only one whose sequence is greater than 1M bp). I was parsing the entire RefSeq and then cutting exon sequence and noticed a few NM / XM entries returned empty sequence from NT_005612. A careful examination indicate that their coordinates are greater than 100,000,000. I tried to print out gb.naseq and indeed, the sequence is truncated to about 100,000,020. By the way, it appears bioruby takes only the first 2575408 lines of the entire RefSeq record - because 100,000,021st base starts at the line 2,575,409 of the NT record. I briefly checked bioruby source and have not found a limit to the sequence length. Is this a bug from Ruby 1.8.1, which I use? Thanks. Lixin Zhou lzhou@illumina.com _______________________________________________ BioRuby mailing list BioRuby@open-bio.org http://portal.open-bio.org/mailman/listinfo/bioruby From ktym at hgc.jp Tue Mar 2 03:54:55 2004 From: ktym at hgc.jp (Toshiaki Katayama) Date: Tue Mar 2 04:00:58 2004 Subject: [BioRuby] Is there a limit to string / naseq length? In-Reply-To: References: Message-ID: <470F467A-6C27-11D8-9053-000A95D919D8@hgc.jp> Hi, I have confirmed this also occurs on my OS X and Linux box with Ruby 1.6.8 and 1.8.1 by parsing the following file. ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_chr3.gbk.gz My implementation of GenBank parser and Bio::Sequence classes doesn't limit sequence length. ...however... The problem was that I couldn't imagine the sequence coordination number in the NCBI GenBank format can reach at the line head when I wrote bio/db.rb so that it misses lines after 100000021. ------------------------------------------------------------------------ ------ LOCUS NT_005612 100530261 bp DNA linear CON 23-JAN-2004 DEFINITION Homo sapiens chromosome 3 genomic contig. ACCESSION NT_005612 (snip) ORIGIN 1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt atgtgaacat 61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct cagtcaaaag (snip) 99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg atctccccca 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg gcaagagata tccactggtt (snip) 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat actcccatct 100530241 tgttcatgat tattctgaat t // ------------------------------------------------------------------------ ------ I will fix this in the CVS although it may take some time to be done. Sorry for the inconvenience, Toshiaki Katayama On 2004/03/02, at 14:14, Zhou, Lixin wrote: > I've just deleted some lines of annotation in the feature table in > NT_005612 and found that the sequence is still truncated to > 100,000,020 bp. Therefore, the bug may have nothing to do with the > number of lines in the RefSeq record. > > Here is to correct the mistakes / typos in the previous message: > > 1. The sequence is from ORIGIN not SOURCE. > 2. The sequence length is greater than 100 M bp. > > -----Original Message----- > From: Zhou, Lixin > Sent: Mon 3/1/2004 6:39 PM > To: bioruby@open-bio.org > Cc: > Subject: [BioRuby] Is there a limit to string / naseq length? > Hi all, > > I was parsing NCBI's human RefSeq 34 version 2 and noticed that the DNA > sequence from SOURCE is truncated. This appears to be reproducible > when > I "require \"bio/db/genbank/refseq\"". > > The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp > (longest in human RefSeq 34 v2 and the only one whose sequence is > greater than 1M bp). I was parsing the entire RefSeq and then cutting > exon sequence and noticed a few NM / XM entries returned empty sequence > from NT_005612. A careful examination indicate that their coordinates > are greater than 100,000,000. I tried to print out gb.naseq and > indeed, > the sequence is truncated to about 100,000,020. By the way, it appears > bioruby takes only the first 2575408 lines of the entire RefSeq record > - > because 100,000,021st base starts at the line 2,575,409 of the NT > record. > > I briefly checked bioruby source and have not found a limit to the > sequence length. Is this a bug from Ruby 1.8.1, which I use? > > Thanks. > > Lixin Zhou > lzhou@illumina.com > > _______________________________________________ > BioRuby mailing list > BioRuby@open-bio.org > http://portal.open-bio.org/mailman/listinfo/bioruby > > > > > _______________________________________________ > BioRuby mailing list > BioRuby@open-bio.org > http://portal.open-bio.org/mailman/listinfo/bioruby From LZhou at illumina.com Tue Mar 2 11:58:59 2004 From: LZhou at illumina.com (Zhou, Lixin) Date: Tue Mar 2 12:04:52 2004 Subject: [BioRuby] Is there a limit to string / naseq length? Message-ID: Hi, Thanks for pinpointing the bug. I was just checking bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS line was tokenized using the GenBank "definition". Apparently, GenBank will have to break their rules soon or later. Perhaps we can simply split the line as long as the total number of fields remains the same? Thanks! Lixin Zhou > -----Original Message----- > From: Toshiaki Katayama [mailto:ktym@hgc.jp] > Sent: Tuesday, March 02, 2004 12:55 AM > To: bioruby@open-bio.org > Subject: Re: [BioRuby] Is there a limit to string / naseq length? > > > Hi, > > I have confirmed this also occurs on my OS X and Linux box > with Ruby 1.6.8 and 1.8.1 by parsing the following file. > > > ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch > r3.gbk.gz > > My implementation of GenBank parser and Bio::Sequence classes > doesn't limit sequence length. > > ...however... > > The problem was that I couldn't imagine the sequence > coordination number in the NCBI GenBank format can reach at > the line head when I wrote bio/db.rb so that it misses lines > after 100000021. > > -------------------------------------------------------------- > ---------- > ------ > LOCUS NT_005612 100530261 bp DNA linear CON > 23-JAN-2004 > DEFINITION Homo sapiens chromosome 3 genomic contig. > ACCESSION NT_005612 > (snip) > ORIGIN > 1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt > atgtgaacat > 61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct > cagtcaaaag > (snip) > 99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg > atctccccca > 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg gcaagagata > tccactggtt > (snip) > 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat > actcccatct > 100530241 tgttcatgat tattctgaat t > // > -------------------------------------------------------------- > ---------- > ------ > > I will fix this in the CVS although it may take some time to be done. > > Sorry for the inconvenience, > Toshiaki Katayama > > > On 2004/03/02, at 14:14, Zhou, Lixin wrote: > > > I've just deleted some lines of annotation in the feature table in > > NT_005612 and found that the sequence is still truncated to > > 100,000,020 bp. Therefore, the bug may have nothing to do > with the > > number of lines in the RefSeq record. > > > > Here is to correct the mistakes / typos in the previous message: > > > > 1. The sequence is from ORIGIN not SOURCE. > > 2. The sequence length is greater than 100 M bp. > > > > -----Original Message----- > > From: Zhou, Lixin > > Sent: Mon 3/1/2004 6:39 PM > > To: bioruby@open-bio.org > > Cc: > > Subject: [BioRuby] Is there a limit to string / naseq length? > > Hi all, > > > > I was parsing NCBI's human RefSeq 34 version 2 and noticed that the > > DNA sequence from SOURCE is truncated. This appears to be > reproducible > > when > > I "require \"bio/db/genbank/refseq\"". > > > > The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp > > (longest in human RefSeq 34 v2 and the only one whose sequence is > > greater than 1M bp). I was parsing the entire RefSeq and > then cutting > > exon sequence and noticed a few NM / XM entries returned empty > > sequence from NT_005612. A careful examination indicate that their > > coordinates are greater than 100,000,000. I tried to print > out gb.naseq and > > indeed, > > the sequence is truncated to about 100,000,020. By the > way, it appears > > bioruby takes only the first 2575408 lines of the entire > RefSeq record > > - > > because 100,000,021st base starts at the line 2,575,409 of the NT > > record. > > > > I briefly checked bioruby source and have not found a limit to the > > sequence length. Is this a bug from Ruby 1.8.1, which I use? > > > > Thanks. > > > > Lixin Zhou > > lzhou@illumina.com > > > > _______________________________________________ > > BioRuby mailing list > > BioRuby@open-bio.org > > http://portal.open-bio.org/mailman/listinfo/bioruby > > > > > > > > > > _______________________________________________ > > BioRuby mailing list > > BioRuby@open-bio.org > > http://portal.open-bio.org/mailman/listinfo/bioruby > > _______________________________________________ > BioRuby mailing list > BioRuby@open-bio.org > http://portal.open-> bio.org/mailman/listinfo/bioruby > From LZhou at illumina.com Tue Mar 2 11:58:59 2004 From: LZhou at illumina.com (Zhou, Lixin) Date: Tue Mar 2 12:04:52 2004 Subject: [BioRuby] Is there a limit to string / naseq length? Message-ID: Hi, Thanks for pinpointing the bug. I was just checking bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS line was tokenized using the GenBank "definition". Apparently, GenBank will have to break their rules soon or later. Perhaps we can simply split the line as long as the total number of fields remains the same? Thanks! Lixin Zhou > -----Original Message----- > From: Toshiaki Katayama [mailto:ktym@hgc.jp] > Sent: Tuesday, March 02, 2004 12:55 AM > To: bioruby@open-bio.org > Subject: Re: [BioRuby] Is there a limit to string / naseq length? > > > Hi, > > I have confirmed this also occurs on my OS X and Linux box > with Ruby 1.6.8 and 1.8.1 by parsing the following file. > > > ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch > r3.gbk.gz > > My implementation of GenBank parser and Bio::Sequence classes > doesn't limit sequence length. > > ...however... > > The problem was that I couldn't imagine the sequence > coordination number in the NCBI GenBank format can reach at > the line head when I wrote bio/db.rb so that it misses lines > after 100000021. > > -------------------------------------------------------------- > ---------- > ------ > LOCUS NT_005612 100530261 bp DNA linear CON > 23-JAN-2004 > DEFINITION Homo sapiens chromosome 3 genomic contig. > ACCESSION NT_005612 > (snip) > ORIGIN > 1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt > atgtgaacat > 61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct > cagtcaaaag > (snip) > 99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg > atctccccca > 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg gcaagagata > tccactggtt > (snip) > 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat > actcccatct > 100530241 tgttcatgat tattctgaat t > // > -------------------------------------------------------------- > ---------- > ------ > > I will fix this in the CVS although it may take some time to be done. > > Sorry for the inconvenience, > Toshiaki Katayama > > > On 2004/03/02, at 14:14, Zhou, Lixin wrote: > > > I've just deleted some lines of annotation in the feature table in > > NT_005612 and found that the sequence is still truncated to > > 100,000,020 bp. Therefore, the bug may have nothing to do > with the > > number of lines in the RefSeq record. > > > > Here is to correct the mistakes / typos in the previous message: > > > > 1. The sequence is from ORIGIN not SOURCE. > > 2. The sequence length is greater than 100 M bp. > > > > -----Original Message----- > > From: Zhou, Lixin > > Sent: Mon 3/1/2004 6:39 PM > > To: bioruby@open-bio.org > > Cc: > > Subject: [BioRuby] Is there a limit to string / naseq length? > > Hi all, > > > > I was parsing NCBI's human RefSeq 34 version 2 and noticed that the > > DNA sequence from SOURCE is truncated. This appears to be > reproducible > > when > > I "require \"bio/db/genbank/refseq\"". > > > > The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp > > (longest in human RefSeq 34 v2 and the only one whose sequence is > > greater than 1M bp). I was parsing the entire RefSeq and > then cutting > > exon sequence and noticed a few NM / XM entries returned empty > > sequence from NT_005612. A careful examination indicate that their > > coordinates are greater than 100,000,000. I tried to print > out gb.naseq and > > indeed, > > the sequence is truncated to about 100,000,020. By the > way, it appears > > bioruby takes only the first 2575408 lines of the entire > RefSeq record > > - > > because 100,000,021st base starts at the line 2,575,409 of the NT > > record. > > > > I briefly checked bioruby source and have not found a limit to the > > sequence length. Is this a bug from Ruby 1.8.1, which I use? > > > > Thanks. > > > > Lixin Zhou > > lzhou@illumina.com > > > > _______________________________________________ > > BioRuby mailing list > > BioRuby@open-bio.org > > http://portal.open-bio.org/mailman/listinfo/bioruby > > > > > > > > > > _______________________________________________ > > BioRuby mailing list > > BioRuby@open-bio.org > > http://portal.open-bio.org/mailman/listinfo/bioruby > > _______________________________________________ > BioRuby mailing list > BioRuby@open-bio.org > http://portal.open-> bio.org/mailman/listinfo/bioruby > From ktym at hgc.jp Tue Mar 2 13:11:07 2004 From: ktym at hgc.jp (Toshiaki Katayama) Date: Tue Mar 2 13:17:01 2004 Subject: [BioRuby] Is there a limit to string / naseq length? In-Reply-To: References: Message-ID: Hi, Following change affects all sub-classes of the Bio::NCBIDB and I have changed regexp in bio/db.rb to match top level tag from /\n(\S)/ to /\n([A-Za-z\])/ for avoiding digits. Plus, sequence extraction became faster by replacing gsub with tr in genbank.rb. Try these changes in CVS and please report if break anything. Lixin, thank you for your report. Regards, Toshiaki Katayama On 2004/03/03, at 1:58, Zhou, Lixin wrote: > Hi, > > Thanks for pinpointing the bug. I was just checking > bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS line > was tokenized using the GenBank "definition". Apparently, GenBank will > have to break their rules soon or later. Perhaps we can simply split > the line as long as the total number of fields remains the same? > > Thanks! > > Lixin Zhou > >> -----Original Message----- >> From: Toshiaki Katayama [mailto:ktym@hgc.jp] >> Sent: Tuesday, March 02, 2004 12:55 AM >> To: bioruby@open-bio.org >> Subject: Re: [BioRuby] Is there a limit to string / naseq length? >> >> >> Hi, >> >> I have confirmed this also occurs on my OS X and Linux box >> with Ruby 1.6.8 and 1.8.1 by parsing the following file. >> >> >> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch >> r3.gbk.gz >> >> My implementation of GenBank parser and Bio::Sequence classes >> doesn't limit sequence length. >> >> ...however... >> >> The problem was that I couldn't imagine the sequence >> coordination number in the NCBI GenBank format can reach at >> the line head when I wrote bio/db.rb so that it misses lines >> after 100000021. >> >> -------------------------------------------------------------- >> ---------- >> ------ >> LOCUS NT_005612 100530261 bp DNA linear CON >> 23-JAN-2004 >> DEFINITION Homo sapiens chromosome 3 genomic contig. >> ACCESSION NT_005612 >> (snip) >> ORIGIN >> 1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt >> atgtgaacat >> 61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct >> cagtcaaaag >> (snip) >> 99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg >> atctccccca >> 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg gcaagagata >> tccactggtt >> (snip) >> 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat >> actcccatct >> 100530241 tgttcatgat tattctgaat t >> // >> -------------------------------------------------------------- >> ---------- >> ------ >> >> I will fix this in the CVS although it may take some time to be done. >> >> Sorry for the inconvenience, >> Toshiaki Katayama >> >> >> On 2004/03/02, at 14:14, Zhou, Lixin wrote: >> >>> I've just deleted some lines of annotation in the feature table in >>> NT_005612 and found that the sequence is still truncated to >>> 100,000,020 bp. Therefore, the bug may have nothing to do >> with the >>> number of lines in the RefSeq record. >>> >>> Here is to correct the mistakes / typos in the previous message: >>> >>> 1. The sequence is from ORIGIN not SOURCE. >>> 2. The sequence length is greater than 100 M bp. >>> >>> -----Original Message----- >>> From: Zhou, Lixin >>> Sent: Mon 3/1/2004 6:39 PM >>> To: bioruby@open-bio.org >>> Cc: >>> Subject: [BioRuby] Is there a limit to string / naseq length? >>> Hi all, >>> >>> I was parsing NCBI's human RefSeq 34 version 2 and noticed that the >>> DNA sequence from SOURCE is truncated. This appears to be >> reproducible >>> when >>> I "require ?"bio/db/genbank/refseq?"". >>> >>> The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp >>> (longest in human RefSeq 34 v2 and the only one whose sequence is >>> greater than 1M bp). I was parsing the entire RefSeq and >> then cutting >>> exon sequence and noticed a few NM / XM entries returned empty >>> sequence from NT_005612. A careful examination indicate that their >>> coordinates are greater than 100,000,000. I tried to print >> out gb.naseq and >>> indeed, >>> the sequence is truncated to about 100,000,020. By the >> way, it appears >>> bioruby takes only the first 2575408 lines of the entire >> RefSeq record >>> - >>> because 100,000,021st base starts at the line 2,575,409 of the NT >>> record. >>> >>> I briefly checked bioruby source and have not found a limit to the >>> sequence length. Is this a bug from Ruby 1.8.1, which I use? >>> >>> Thanks. >>> >>> Lixin Zhou >>> lzhou@illumina.com >>> >>> _______________________________________________ >>> BioRuby mailing list >>> BioRuby@open-bio.org >>> http://portal.open-bio.org/mailman/listinfo/bioruby >>> >>> >>> >>> >>> _______________________________________________ >>> BioRuby mailing list >>> BioRuby@open-bio.org >>> http://portal.open-bio.org/mailman/listinfo/bioruby >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby@open-bio.org >> http://portal.open-> bio.org/mailman/listinfo/bioruby >> > > _______________________________________________ > BioRuby mailing list > BioRuby@open-bio.org > http://portal.open-bio.org/mailman/listinfo/bioruby From LZhou at illumina.com Tue Mar 2 13:15:19 2004 From: LZhou at illumina.com (Zhou, Lixin) Date: Tue Mar 2 13:21:10 2004 Subject: [BioRuby] Is there a limit to string / naseq length? Message-ID: Thank you very much for your quick and great work! I'll try it out. > -----Original Message----- > From: Toshiaki Katayama [mailto:ktym@hgc.jp] > Sent: Tuesday, March 02, 2004 10:11 AM > To: BioRuby Discussion List Project > Subject: Re: [BioRuby] Is there a limit to string / naseq length? > > > Hi, > > Following change affects all sub-classes of the Bio::NCBIDB > and I have changed regexp in bio/db.rb to match top level tag > from /\n(\S)/ to /\n([A-Za-z\])/ for avoiding digits. > > Plus, sequence extraction became faster by replacing gsub > with tr in genbank.rb. > > Try these changes in CVS and please report if break anything. > > > Lixin, thank you for your report. > > Regards, > Toshiaki Katayama > > On 2004/03/03, at 1:58, Zhou, Lixin wrote: > > > Hi, > > > > Thanks for pinpointing the bug. I was just checking > > bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS > > line was tokenized using the GenBank "definition". Apparently, > > GenBank will have to break their rules soon or later. > Perhaps we can > > simply split the line as long as the total number of fields remains > > the same? > > > > Thanks! > > > > Lixin Zhou > > > >> -----Original Message----- > >> From: Toshiaki Katayama [mailto:ktym@hgc.jp] > >> Sent: Tuesday, March 02, 2004 12:55 AM > >> To: bioruby@open-bio.org > >> Subject: Re: [BioRuby] Is there a limit to string / naseq length? > >> > >> > >> Hi, > >> > >> I have confirmed this also occurs on my OS X and Linux box > with Ruby > >> 1.6.8 and 1.8.1 by parsing the following file. > >> > >> > >> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch > >> r3.gbk.gz > >> > >> My implementation of GenBank parser and Bio::Sequence > classes doesn't > >> limit sequence length. > >> > >> ...however... > >> > >> The problem was that I couldn't imagine the sequence coordination > >> number in the NCBI GenBank format can reach at the line > head when I > >> wrote bio/db.rb so that it misses lines after 100000021. > >> > >> -------------------------------------------------------------- > >> ---------- > >> ------ > >> LOCUS NT_005612 100530261 bp DNA linear CON > >> 23-JAN-2004 > >> DEFINITION Homo sapiens chromosome 3 genomic contig. > >> ACCESSION NT_005612 > >> (snip) > >> ORIGIN > >> 1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt > >> atgtgaacat > >> 61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct > >> cagtcaaaag > >> (snip) > >> 99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg > >> atctccccca 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg > >> gcaagagata tccactggtt > >> (snip) > >> 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat > >> actcccatct > >> 100530241 tgttcatgat tattctgaat t > >> // > >> -------------------------------------------------------------- > >> ---------- > >> ------ > >> > >> I will fix this in the CVS although it may take some time > to be done. > >> > >> Sorry for the inconvenience, > >> Toshiaki Katayama > >> > >> > >> On 2004/03/02, at 14:14, Zhou, Lixin wrote: > >> > >>> I've just deleted some lines of annotation in the feature > table in > >>> NT_005612 and found that the sequence is still truncated to > >>> 100,000,020 bp. Therefore, the bug may have nothing to do > >> with the > >>> number of lines in the RefSeq record. > >>> > >>> Here is to correct the mistakes / typos in the previous message: > >>> > >>> 1. The sequence is from ORIGIN not SOURCE. > >>> 2. The sequence length is greater than 100 M bp. > >>> > >>> -----Original Message----- > >>> From: Zhou, Lixin > >>> Sent: Mon 3/1/2004 6:39 PM > >>> To: bioruby@open-bio.org > >>> Cc: > >>> Subject: [BioRuby] Is there a limit to string / naseq length? > >>> Hi all, > >>> > >>> I was parsing NCBI's human RefSeq 34 version 2 and > noticed that the > >>> DNA sequence from SOURCE is truncated. This appears to be > >> reproducible > >>> when > >>> I "require ?"bio/db/genbank/refseq?"". > >>> > >>> The length of the NT_005612 sequence from CHR_03 is > 100,530,261 bp > >>> (longest in human RefSeq 34 v2 and the only one whose sequence is > >>> greater than 1M bp). I was parsing the entire RefSeq and > >> then cutting > >>> exon sequence and noticed a few NM / XM entries returned empty > >>> sequence from NT_005612. A careful examination indicate > that their > >>> coordinates are greater than 100,000,000. I tried to print > >> out gb.naseq and > >>> indeed, > >>> the sequence is truncated to about 100,000,020. By the > >> way, it appears > >>> bioruby takes only the first 2575408 lines of the entire > >> RefSeq record > >>> - > >>> because 100,000,021st base starts at the line 2,575,409 of the NT > >>> record. > >>> > >>> I briefly checked bioruby source and have not found a > limit to the > >>> sequence length. Is this a bug from Ruby 1.8.1, which I use? > >>> > >>> Thanks. > >>> > >>> Lixin Zhou > >>> lzhou@illumina.com > >>> > >>> _______________________________________________ > >>> BioRuby mailing list > >>> BioRuby@open-bio.org > >>> http://portal.open-bio.org/mailman/listinfo/bioruby > >>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> BioRuby mailing list > >>> BioRuby@open-bio.org > >>> http://portal.open-bio.org/mailman/listinfo/bioruby > >> > >> _______________________________________________ > >> BioRuby mailing list > >> BioRuby@open-bio.org > >> http://portal.open-> bio.org/mailman/listinfo/bioruby > >> > > > > _______________________________________________ > > BioRuby mailing list > > BioRuby@open-bio.org > > http://portal.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby mailing list > BioRuby@open-bio.org > http://portal.open-bio.org/mailman/listinfo/bioruby > From LZhou at illumina.com Tue Mar 2 13:15:19 2004 From: LZhou at illumina.com (Zhou, Lixin) Date: Tue Mar 2 13:21:11 2004 Subject: [BioRuby] Is there a limit to string / naseq length? Message-ID: Thank you very much for your quick and great work! I'll try it out. > -----Original Message----- > From: Toshiaki Katayama [mailto:ktym@hgc.jp] > Sent: Tuesday, March 02, 2004 10:11 AM > To: BioRuby Discussion List Project > Subject: Re: [BioRuby] Is there a limit to string / naseq length? > > > Hi, > > Following change affects all sub-classes of the Bio::NCBIDB > and I have changed regexp in bio/db.rb to match top level tag > from /\n(\S)/ to /\n([A-Za-z\])/ for avoiding digits. > > Plus, sequence extraction became faster by replacing gsub > with tr in genbank.rb. > > Try these changes in CVS and please report if break anything. > > > Lixin, thank you for your report. > > Regards, > Toshiaki Katayama > > On 2004/03/03, at 1:58, Zhou, Lixin wrote: > > > Hi, > > > > Thanks for pinpointing the bug. I was just checking > > bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS > > line was tokenized using the GenBank "definition". Apparently, > > GenBank will have to break their rules soon or later. > Perhaps we can > > simply split the line as long as the total number of fields remains > > the same? > > > > Thanks! > > > > Lixin Zhou > > > >> -----Original Message----- > >> From: Toshiaki Katayama [mailto:ktym@hgc.jp] > >> Sent: Tuesday, March 02, 2004 12:55 AM > >> To: bioruby@open-bio.org > >> Subject: Re: [BioRuby] Is there a limit to string / naseq length? > >> > >> > >> Hi, > >> > >> I have confirmed this also occurs on my OS X and Linux box > with Ruby > >> 1.6.8 and 1.8.1 by parsing the following file. > >> > >> > >> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch > >> r3.gbk.gz > >> > >> My implementation of GenBank parser and Bio::Sequence > classes doesn't > >> limit sequence length. > >> > >> ...however... > >> > >> The problem was that I couldn't imagine the sequence coordination > >> number in the NCBI GenBank format can reach at the line > head when I > >> wrote bio/db.rb so that it misses lines after 100000021. > >> > >> -------------------------------------------------------------- > >> ---------- > >> ------ > >> LOCUS NT_005612 100530261 bp DNA linear CON > >> 23-JAN-2004 > >> DEFINITION Homo sapiens chromosome 3 genomic contig. > >> ACCESSION NT_005612 > >> (snip) > >> ORIGIN > >> 1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt > >> atgtgaacat > >> 61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct > >> cagtcaaaag > >> (snip) > >> 99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg > >> atctccccca 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg > >> gcaagagata tccactggtt > >> (snip) > >> 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat > >> actcccatct > >> 100530241 tgttcatgat tattctgaat t > >> // > >> -------------------------------------------------------------- > >> ---------- > >> ------ > >> > >> I will fix this in the CVS although it may take some time > to be done. > >> > >> Sorry for the inconvenience, > >> Toshiaki Katayama > >> > >> > >> On 2004/03/02, at 14:14, Zhou, Lixin wrote: > >> > >>> I've just deleted some lines of annotation in the feature > table in > >>> NT_005612 and found that the sequence is still truncated to > >>> 100,000,020 bp. Therefore, the bug may have nothing to do > >> with the > >>> number of lines in the RefSeq record. > >>> > >>> Here is to correct the mistakes / typos in the previous message: > >>> > >>> 1. The sequence is from ORIGIN not SOURCE. > >>> 2. The sequence length is greater than 100 M bp. > >>> > >>> -----Original Message----- > >>> From: Zhou, Lixin > >>> Sent: Mon 3/1/2004 6:39 PM > >>> To: bioruby@open-bio.org > >>> Cc: > >>> Subject: [BioRuby] Is there a limit to string / naseq length? > >>> Hi all, > >>> > >>> I was parsing NCBI's human RefSeq 34 version 2 and > noticed that the > >>> DNA sequence from SOURCE is truncated. This appears to be > >> reproducible > >>> when > >>> I "require ?"bio/db/genbank/refseq?"". > >>> > >>> The length of the NT_005612 sequence from CHR_03 is > 100,530,261 bp > >>> (longest in human RefSeq 34 v2 and the only one whose sequence is > >>> greater than 1M bp). I was parsing the entire RefSeq and > >> then cutting > >>> exon sequence and noticed a few NM / XM entries returned empty > >>> sequence from NT_005612. A careful examination indicate > that their > >>> coordinates are greater than 100,000,000. I tried to print > >> out gb.naseq and > >>> indeed, > >>> the sequence is truncated to about 100,000,020. By the > >> way, it appears > >>> bioruby takes only the first 2575408 lines of the entire > >> RefSeq record > >>> - > >>> because 100,000,021st base starts at the line 2,575,409 of the NT > >>> record. > >>> > >>> I briefly checked bioruby source and have not found a > limit to the > >>> sequence length. Is this a bug from Ruby 1.8.1, which I use? > >>> > >>> Thanks. > >>> > >>> Lixin Zhou > >>> lzhou@illumina.com > >>> > >>> _______________________________________________ > >>> BioRuby mailing list > >>> BioRuby@open-bio.org > >>> http://portal.open-bio.org/mailman/listinfo/bioruby > >>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> BioRuby mailing list > >>> BioRuby@open-bio.org > >>> http://portal.open-bio.org/mailman/listinfo/bioruby > >> > >> _______________________________________________ > >> BioRuby mailing list > >> BioRuby@open-bio.org > >> http://portal.open-> bio.org/mailman/listinfo/bioruby > >> > > > > _______________________________________________ > > BioRuby mailing list > > BioRuby@open-bio.org > > http://portal.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby mailing list > BioRuby@open-bio.org > http://portal.open-bio.org/mailman/listinfo/bioruby > From ktym at hgc.jp Tue Mar 2 14:11:18 2004 From: ktym at hgc.jp (Toshiaki Katayama) Date: Tue Mar 2 14:17:12 2004 Subject: [BioRuby] Re: BioRuby PDB Classes In-Reply-To: <371B90E2-5E0A-11D8-B0EA-000A957E44DC@ebi.ac.uk> References: <371B90E2-5E0A-11D8-B0EA-000A957E44DC@ebi.ac.uk> Message-ID: <6224FA24-6C7D-11D8-B650-000A95CD9782@hgc.jp> Hi Alex, I have received a patch developed by you and N. Goto. Changes are already commited to the CVS to the CVS repository : * Added Patches from Alex Gutteridge - New classes: PDB::Atom, PDB::Residue, PDB::Chain, PDB::Model - New modules: Bio::PDBUtils, Bio::{Atom|Residue|Chain|Model}Finder - New methods: iterators, ... - Bug fix * Bio::Coordinate class storing coordinate data (inherits Vector) * Many imcompatible (but very useful) changes are made, please be careful. Regards, Toshiaki Katayama On 2004/02/13, at 18:51, Alex Gutteridge wrote: >>> My question(s) to the list are: >>> >>> 1. Am I treading on other peoples toes here? Is someone else actively >>> developing the pdb.rb module? Naohisa Goto? >> >> I'm Naohisa Goto, but I'm not actively developing the pdb.rb now, >> and no one (except you) are doing, as far as I know. >> So, you can freely modify the pdb.rb. >> >> If you want to change existing class/method's name, or massively >> change >> existing class/method's specification or definition, please tell us. > > I've left the main PDB class alone except for: > > - I've changed the seqres method to return Bio::Seq objects rather > than just strings > - I've removed the old model parsing section and replaced it with mine > >>> 2. If not, should I post the code to the mailing list, or somewhere >>> else? I'm sure it needs some tidying up and bioruby-fication. It >>> would >>> be great if someone more experienced than I could give some >>> comments/criticisms. >> >> If the code is short, please post to the mailing list. >> For long codes, please send to staff@bioruby.org, or you can use >> the BioRuby Project Wiki page (http://wiki.bioruby.org/English/). > > pdb.rb is now ~2000 lines, so I won't post it to the list! I'll post > it to staff@bioruby.org (the wiki seems to be broken at the moment). > pdb.rb probably needs splitting up into separate files, but like I > said, I'm not sure what the BioRuby conventions are for doing this > (would it need a new bio/db/pdb directory?). Currently it looks like > this: > > module bio > #This module provides some generic mixin methods that all classes use > module PDBUtils > [snip!] > end > #There are several *Finder mixins which provide some of the searching > methods > module AtomFinder > [snip!] > end > #This is the main PDB class that was here originally - I've only > added methods > #so all the old interface is still here (apart from .seqres and > .model) > class PDB > #There are a few modules and classes here used for the old style > parsing > class FieldDef > [snip!] > end > class Record < Hash > [snip!] > end > [snip!] > #My new classes for atoms, residues, chains and modules go here > class Atom > [snip!] > end > class Residue > [snip!] > end > class Chain > [snip!] > end > class Model > [snip!] > end > end #class PDB > end #module Bio > > Perhaps the PDBUtils and Finder modules should go inside the PDB > class? Or a separate file for each class and the mixins? > > Alex Gutteridge > European Bioinformatics Institute > Cambridge CB10 1SD > UK > > Tel: 01223 492550 > Email: alexg@ebi.ac.uk > > _______________________________________________ > BioRuby mailing list > BioRuby@open-bio.org > http://portal.open-bio.org/mailman/listinfo/bioruby From ktym at hgc.jp Wed Mar 3 19:37:31 2004 From: ktym at hgc.jp (Toshiaki Katayama) Date: Wed Mar 3 19:43:24 2004 Subject: [BioRuby] Fwd: BOSC 2004 Announcement and Call for Papers Message-ID: <1F799BB8-6D74-11D8-B4B6-000A95CD9782@hgc.jp> Hi, CFP for the BOSC2004 is announced. Plans and discussions for the BioRuby's presentation are welcome. Thanks, Toshiaki Katayama Begin forwarded message: > From: Darin London > Date: 2004?3?4? 6:06:52:JST > To: bioperl-announce-l@bioperl.org, , > , , > , , > > Cc: Subject: [Bioperl-announce-l] BOSC 2004 Announcement and Call for > Papers > > {Please pass the word!} > > MEETING ANNOUNCEMENT & CALL FOR SPEAKERS > > The 5th annual Bioinformatics Open Source Conference (BOSC'2004) is > organized by the not-for-profit Open Bioinformatics Foundation. The > meeting will take place July 29-30, 2004 in Glasgow, Scotland, and is > one of several Special Interest Group (SIG) meetings occurring in > conjunction with the 12th International Conference on Intelligent > Systems for Molecular Biology. > > see http://www.iscb.org/ismb2004/ for more information. > > The focus of the meeting will be on current and emerging Open Source** > informatics tools and toolkits. BOSC provides a forum for developers, > project groups, users and interested parties to meet personally, > exchange ideas and > collaborate together. > > In addition, keynote speeches from well known Open Source > Bioinformatics > leaders are being planned. > > BOSC PROGRAM & CONTACT INFO > > * Web: http://www.open-bio.org/bosc2004/ > * Email: bosc@open-bio.org > * Online registration: https://www.cteusa.com/iscb3/ > > > FEES > > * Corporate :GBP ?165.00 british pounds sterling > * Academic : GBP ?120.00 british pounds sterling > * Student : GBP ?90.00 british pounds sterling > > A 17.5% Valued Added Tax(VAT) will be added to all fees. > > Note: We have tried to set our fees as low as possible without risking > the chance that the foundation will lose money on the event. We budget > with the goal of breaking even on costs or realizing a small profit. > > REGISTER ONLINE FOR BOSC'2004 & ISMB AT: > https://www.cteusa.com/iscb3/ > > SPEAKERS & ABSTRACTS WANTED > > The program committee is currently seeking abstracts for talks at BOSC > 2004. BOSC is a great opportunity for you to tell the community about > your use, development, or philosophy of open source software > development > in bioinformatics. The committee will select several submitted > abstracts > for 25-minute talks and others for shorter "lightning" talks. Accepted > abstracts will be published on the BOSC web site. > > If you are interested in speaking at BOSC 2004, > please send us: > > * an abstract (no more than a few paragraphs) > * a URL for the project page, if applicable > * information about the open source license used for your software or > your release plans. > > LIGHTNING-TALK SPEAKERS WANTED! > > The program committee is currently seeking speakers for the lightning > talks at BOSC 2004. Lightning talks are quick - only five minutes > long - and a great opportunity for you to give people a quick > summary of your open source project, code, idea, or vision of the > future. > > If you are interested in giving a lightning talk at BOSC 2004, > please send us: > > * a brief title and summary (one or two lines) > * a URL for the project page, if applicable > * information about the open source license used for your software or > your release plans. > > We will accept entries on-line until BOSC starts, but > space for demos and lightning talks is limited.
> > SOFTWARE DEMONSTRATIONS WANTED! > > If you are involved in the development of Open Source Bioinformatics > Software, you are invited to provide a short demonstration to > attendees > of BOSC 2004. > > If you are interested in giving a software demonstration at BOSC 2004, > please send us: > > * a brief title and summary (one or two lines) > * a URL for the project page, if applicable > * Internet connectivity requirements (e.g. website Application served > on > the world wide web, or web based client application). > > We will accept entries on-line until the BOSC starts, but > space for demos and lightning talks is limited. > > ** Because the mission of the OBF is to promote Open Source software, > we > will favor submissions for projects that apply a recognized Open > Source > License, or adhere to the general Open Source Philosophy. > > See the following websites for further details: > href="http://www.opensource.org/licenses/ > href="http://www.opensource.org/docs/definition.php > > > > > _______________________________________________ > Bioperl-announce-l mailing list > Bioperl-announce-l@portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/bioperl-announce-l From ngoto at gen-info.osaka-u.ac.jp Thu Mar 4 11:30:52 2004 From: ngoto at gen-info.osaka-u.ac.jp (GOTO Naohisa) Date: Thu Mar 4 11:36:42 2004 Subject: [BioRuby] Re: Non-standard FASTA Message-ID: Hi, I've now changed lib/bio/db/fasta.rb and lib/bio/io/flatfile.rb in CVS to support non-standard FASTA format with comment lines, suggested by Mr. Pjotr Prins in last year. The support is now integrated into Bio::FastaFormat class, with very few loss of performance. Would you please check them working correctly? Regards, -- Naohisa GOTO ngoto@gen-info.osaka-u.ac.jp Genome Information Research Center, Osaka University, Japan From Jean-Philippe.Vert at mines.org Tue Mar 16 05:54:52 2004 From: Jean-Philippe.Vert at mines.org (Jean-Philippe Vert) Date: Tue Mar 16 04:59:46 2004 Subject: [BioRuby] proxy Message-ID: <4056DCFC.8090902@mines.org> Dear friends, I'd like to use the wonderful KEGG API with Ruby, but I can only connect to the web through a proxy. Would someone know what I should do to set up the proxy connection? For example, let's say I want to run the script: #!/usr/bin/env ruby require 'bio' serv = Bio::KEGG::API.new puts serv.get_best_neighbors_by_gene('eco:b0002', 500, ['hin', 'bsu']) It does not work currently because it tries to connect directly. Should I change something in this code, or in a different configuration file? Thanks if you have time to answer this basic question, or give me a link. yoroshiku onegaishimasu jp -- Jean-Philippe Vert Ecole des Mines de Paris http://www.cg.ensmp.fr/~vert From Panayiotis.Periorellis at newcastle.ac.uk Tue Mar 16 05:03:12 2004 From: Panayiotis.Periorellis at newcastle.ac.uk (Panayiotis Periorellis) Date: Tue Mar 16 05:09:09 2004 Subject: [BioRuby] proxy Message-ID: R you doing your code on java? If yes let me knw and I will I will tell you .. -----Original Message----- From: Jean-Philippe Vert [mailto:Jean-Philippe.Vert@mines.org] Sent: 16 March 2004 10:55 To: bioruby@open-bio.org Subject: [BioRuby] proxy Dear friends, I'd like to use the wonderful KEGG API with Ruby, but I can only connect to the web through a proxy. Would someone know what I should do to set up the proxy connection? For example, let's say I want to run the script: #!/usr/bin/env ruby require 'bio' serv = Bio::KEGG::API.new puts serv.get_best_neighbors_by_gene('eco:b0002', 500, ['hin', 'bsu']) It does not work currently because it tries to connect directly. Should I change something in this code, or in a different configuration file? Thanks if you have time to answer this basic question, or give me a link. yoroshiku onegaishimasu jp -- Jean-Philippe Vert Ecole des Mines de Paris http://www.cg.ensmp.fr/~vert _______________________________________________ BioRuby mailing list BioRuby@open-bio.org http://portal.open-bio.org/mailman/listinfo/bioruby From ktym at hgc.jp Tue Mar 16 05:09:14 2004 From: ktym at hgc.jp (Toshiaki Katayama) Date: Tue Mar 16 05:14:46 2004 Subject: [BioRuby] proxy In-Reply-To: <4056DCFC.8090902@mines.org> References: <4056DCFC.8090902@mines.org> Message-ID: Hi Vert, According to the SOAP4R documentation at http://rrr.jin.gr.jp/doc/soap4r/RELEASE_en.html you will need to set two environmental variables or to create a configuration file (I have no proxy and have never tried). 1. set 'soap_use_proxy' variable with its value 'on' and set 'http_proxy' variable with URL of your proxy as its value or 2. create a $RUBYLIB/soap/property file (where $RUBYLIB is the directory of your choice something like $HOME/lib/ruby/ or /usr/local/lib/ruby/site_ruby/1.8/ etc.) to specify the proxy like: client.protocol.http.proxy = http://myproxy:8080 Hope this helps, Toshiaki Katayama On 2004/03/16, at 19:54, Jean-Philippe Vert wrote: > Dear friends, > > I'd like to use the wonderful KEGG API with Ruby, but I can only > connect to the web through a proxy. Would someone know what I should > do to set up the proxy connection? For example, let's say I want to > run the script: > > #!/usr/bin/env ruby > require 'bio' > serv = Bio::KEGG::API.new > puts serv.get_best_neighbors_by_gene('eco:b0002', 500, ['hin', 'bsu']) > > It does not work currently because it tries to connect directly. > Should I change something in this code, or in a different > configuration file? > > Thanks if you have time to answer this basic question, or give me a > link. > > yoroshiku onegaishimasu > jp > > -- > Jean-Philippe Vert > Ecole des Mines de Paris > http://www.cg.ensmp.fr/~vert > > > _______________________________________________ > BioRuby mailing list > BioRuby@open-bio.org > http://portal.open-bio.org/mailman/listinfo/bioruby From Jean-Philippe.Vert at mines.org Tue Mar 16 06:27:20 2004 From: Jean-Philippe.Vert at mines.org (Jean-Philippe Vert) Date: Tue Mar 16 05:32:38 2004 Subject: [BioRuby] proxy References: <4056DCFC.8090902@mines.org> Message-ID: <4056E498.3010902@mines.org> wonderful! thanks a lot, the first solution: setenv SOAP_USE_PROXY on setenv HTTP_PROXY myproxy:8080 works very well ookini! jp Toshiaki Katayama wrote: > Hi Vert, > > According to the SOAP4R documentation at > http://rrr.jin.gr.jp/doc/soap4r/RELEASE_en.html > you will need to set two environmental variables or to create > a configuration file (I have no proxy and have never tried). > > 1. set 'soap_use_proxy' variable with its value 'on' and > set 'http_proxy' variable with URL of your proxy as its value > > or > > 2. create a $RUBYLIB/soap/property file (where $RUBYLIB is the > directory of your choice something like $HOME/lib/ruby/ or > /usr/local/lib/ruby/site_ruby/1.8/ etc.) to specify the proxy like: > client.protocol.http.proxy = http://myproxy:8080 > > Hope this helps, > Toshiaki Katayama > > > On 2004/03/16, at 19:54, Jean-Philippe Vert wrote: > >> Dear friends, >> >> I'd like to use the wonderful KEGG API with Ruby, but I can only >> connect to the web through a proxy. Would someone know what I should >> do to set up the proxy connection? For example, let's say I want to >> run the script: >> >> #!/usr/bin/env ruby >> require 'bio' >> serv = Bio::KEGG::API.new >> puts serv.get_best_neighbors_by_gene('eco:b0002', 500, ['hin', 'bsu']) >> >> It does not work currently because it tries to connect directly. >> Should I change something in this code, or in a different >> configuration file? >> >> Thanks if you have time to answer this basic question, or give me a >> link. >> >> yoroshiku onegaishimasu >> jp >> >> -- >> Jean-Philippe Vert >> Ecole des Mines de Paris >> http://www.cg.ensmp.fr/~vert >> >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby@open-bio.org >> http://portal.open-bio.org/mailman/listinfo/bioruby > > > _______________________________________________ > BioRuby mailing list > BioRuby@open-bio.org > http://portal.open-bio.org/mailman/listinfo/bioruby > > -- Jean-Philippe Vert Ecole des Mines de Paris http://www.cg.ensmp.fr/~vert From pjotr at pckassa.com Sun Mar 21 08:31:37 2004 From: pjotr at pckassa.com (pjotr@pckassa.com) Date: Sun Mar 21 08:37:04 2004 Subject: [BioRuby] Fwd: BOSC 2004 Announcement and Call for Papers In-Reply-To: <1F799BB8-6D74-11D8-B4B6-000A95CD9782@hgc.jp> References: <1F799BB8-6D74-11D8-B4B6-000A95CD9782@hgc.jp> Message-ID: <20040321133137.GA26257@team-machine.donck.com> Fair chance I will be going. If so I'll submit a paper. Pj. From pjotr at pckassa.com Sun Mar 21 08:33:00 2004 From: pjotr at pckassa.com (pjotr@pckassa.com) Date: Sun Mar 21 08:38:25 2004 Subject: [BioRuby] RegEx search example fasta file Message-ID: <20040321133300.GB26257@team-machine.donck.com> Can this go in the sample directory of bioruby - I have added it to the Wiki. Comments welcome. Pj. #! /usr/bin/ruby # # $Id: fastasearch,v 1.1 2004/03/21 13:18:41 wrk Exp $ # $Source: /home/cvs/home/pjotr/lwrk/luw/fasta/fastasearch,v $ # # require 'profile' COPYRIGHT = "GPL (c) 2003-2004" usage = <',e.definition,e.data end end end From pjotr at pckassa.com Mon Mar 22 13:20:51 2004 From: pjotr at pckassa.com (pjotr@pckassa.com) Date: Mon Mar 22 13:26:17 2004 Subject: [BioRuby] Fwd: BOSC 2004 Announcement and Call for Papers In-Reply-To: <20040321133137.GA26257@team-machine.donck.com> References: <1F799BB8-6D74-11D8-B4B6-000A95CD9782@hgc.jp> <20040321133137.GA26257@team-machine.donck.com> Message-ID: <20040322182051.GA10462@team-machine.donck.com> Yes, I am attending BOSC (permission granted by my Professor). Anyone else of BioRuby coming? I can do a talk - what would be the most interesting to discuss? What came up during the last BOSC? Some feedback would be useful. Yours, Pj. From lzhou at illumina.com Tue Mar 23 14:38:49 2004 From: lzhou at illumina.com (Lixin Zhou) Date: Tue Mar 23 14:44:18 2004 Subject: [BioRuby] Is there a limit to string / naseq length? In-Reply-To: References: Message-ID: <40609249.1000902@illumina.com> Hello, I've tried the patch for the latest RefSeq 34 version 3 (and v2 as well). Perhaps I did it wrong - it's a few times slower than the previous release, and perhaps use more memory as well. I've not had a close look, so that I don't know what caused the slowness. I simply switched back to the previous release. Has anyone tried to parse ASN.1 format using Ruby, or common LISP / scheme? Thanks! Lixin Toshiaki Katayama wrote: > Hi, > > Following change affects all sub-classes of the Bio::NCBIDB and > I have changed regexp in bio/db.rb to match top level tag from > /\n(\S)/ to /\n([A-Za-z\])/ for avoiding digits. > > Plus, sequence extraction became faster by replacing gsub with > tr in genbank.rb. > > Try these changes in CVS and please report if break anything. > > > Lixin, thank you for your report. > > Regards, > Toshiaki Katayama > > On 2004/03/03, at 1:58, Zhou, Lixin wrote: > >> Hi, >> >> Thanks for pinpointing the bug. I was just checking >> bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS line >> was tokenized using the GenBank "definition". Apparently, GenBank will >> have to break their rules soon or later. Perhaps we can simply split >> the line as long as the total number of fields remains the same? >> >> Thanks! >> >> Lixin Zhou >> >>> -----Original Message----- >>> From: Toshiaki Katayama [mailto:ktym@hgc.jp] >>> Sent: Tuesday, March 02, 2004 12:55 AM >>> To: bioruby@open-bio.org >>> Subject: Re: [BioRuby] Is there a limit to string / naseq length? >>> >>> >>> Hi, >>> >>> I have confirmed this also occurs on my OS X and Linux box >>> with Ruby 1.6.8 and 1.8.1 by parsing the following file. >>> >>> >>> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch >>> r3.gbk.gz >>> >>> My implementation of GenBank parser and Bio::Sequence classes >>> doesn't limit sequence length. >>> >>> ...however... >>> >>> The problem was that I couldn't imagine the sequence >>> coordination number in the NCBI GenBank format can reach at >>> the line head when I wrote bio/db.rb so that it misses lines >>> after 100000021. >>> >>> -------------------------------------------------------------- >>> ---------- >>> ------ >>> LOCUS NT_005612 100530261 bp DNA linear CON >>> 23-JAN-2004 >>> DEFINITION Homo sapiens chromosome 3 genomic contig. >>> ACCESSION NT_005612 >>> (snip) >>> ORIGIN >>> 1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt >>> atgtgaacat >>> 61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct >>> cagtcaaaag >>> (snip) >>> 99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg >>> atctccccca >>> 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg gcaagagata >>> tccactggtt >>> (snip) >>> 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat >>> actcccatct >>> 100530241 tgttcatgat tattctgaat t >>> // >>> -------------------------------------------------------------- >>> ---------- >>> ------ >>> >>> I will fix this in the CVS although it may take some time to be done. >>> >>> Sorry for the inconvenience, >>> Toshiaki Katayama >>> >>> >>> On 2004/03/02, at 14:14, Zhou, Lixin wrote: >>> >>>> I've just deleted some lines of annotation in the feature table in >>>> NT_005612 and found that the sequence is still truncated to >>>> 100,000,020 bp. Therefore, the bug may have nothing to do >>> >>> with the >>> >>>> number of lines in the RefSeq record. >>>> >>>> Here is to correct the mistakes / typos in the previous message: >>>> >>>> 1. The sequence is from ORIGIN not SOURCE. >>>> 2. The sequence length is greater than 100 M bp. >>>> >>>> -----Original Message----- >>>> From: Zhou, Lixin >>>> Sent: Mon 3/1/2004 6:39 PM >>>> To: bioruby@open-bio.org >>>> Cc: >>>> Subject: [BioRuby] Is there a limit to string / naseq length? >>>> Hi all, >>>> >>>> I was parsing NCBI's human RefSeq 34 version 2 and noticed that the >>>> DNA sequence from SOURCE is truncated. This appears to be >>> >>> reproducible >>> >>>> when >>>> I "require ?"bio/db/genbank/refseq?"". >>>> >>>> The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp >>>> (longest in human RefSeq 34 v2 and the only one whose sequence is >>>> greater than 1M bp). I was parsing the entire RefSeq and >>> >>> then cutting >>> >>>> exon sequence and noticed a few NM / XM entries returned empty >>>> sequence from NT_005612. A careful examination indicate that their >>>> coordinates are greater than 100,000,000. I tried to print >>> >>> out gb.naseq and >>> >>>> indeed, >>>> the sequence is truncated to about 100,000,020. By the >>> >>> way, it appears >>> >>>> bioruby takes only the first 2575408 lines of the entire >>> >>> RefSeq record >>> >>>> - >>>> because 100,000,021st base starts at the line 2,575,409 of the NT >>>> record. >>>> >>>> I briefly checked bioruby source and have not found a limit to the >>>> sequence length. Is this a bug from Ruby 1.8.1, which I use? >>>> >>>> Thanks. >>>> >>>> Lixin Zhou >>>> lzhou@illumina.com >>>> >>>> _______________________________________________ >>>> BioRuby mailing list >>>> BioRuby@open-bio.org >>>> http://portal.open-bio.org/mailman/listinfo/bioruby >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> BioRuby mailing list >>>> BioRuby@open-bio.org >>>> http://portal.open-bio.org/mailman/listinfo/bioruby >>> >>> >>> _______________________________________________ >>> BioRuby mailing list >>> BioRuby@open-bio.org >>> http://portal.open-> bio.org/mailman/listinfo/bioruby >>> >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby@open-bio.org >> http://portal.open-bio.org/mailman/listinfo/bioruby > > > > _______________________________________________ > BioRuby mailing list > BioRuby@open-bio.org > http://portal.open-bio.org/mailman/listinfo/bioruby > > From lzhou at illumina.com Tue Mar 23 14:38:49 2004 From: lzhou at illumina.com (Lixin Zhou) Date: Tue Mar 23 14:44:19 2004 Subject: [BioRuby] Is there a limit to string / naseq length? In-Reply-To: References: Message-ID: <40609249.1000902@illumina.com> Hello, I've tried the patch for the latest RefSeq 34 version 3 (and v2 as well). Perhaps I did it wrong - it's a few times slower than the previous release, and perhaps use more memory as well. I've not had a close look, so that I don't know what caused the slowness. I simply switched back to the previous release. Has anyone tried to parse ASN.1 format using Ruby, or common LISP / scheme? Thanks! Lixin Toshiaki Katayama wrote: > Hi, > > Following change affects all sub-classes of the Bio::NCBIDB and > I have changed regexp in bio/db.rb to match top level tag from > /\n(\S)/ to /\n([A-Za-z\])/ for avoiding digits. > > Plus, sequence extraction became faster by replacing gsub with > tr in genbank.rb. > > Try these changes in CVS and please report if break anything. > > > Lixin, thank you for your report. > > Regards, > Toshiaki Katayama > > On 2004/03/03, at 1:58, Zhou, Lixin wrote: > >> Hi, >> >> Thanks for pinpointing the bug. I was just checking >> bio/db/genbank/genbank.rb and realized that the fields from ^LOCUS line >> was tokenized using the GenBank "definition". Apparently, GenBank will >> have to break their rules soon or later. Perhaps we can simply split >> the line as long as the total number of fields remains the same? >> >> Thanks! >> >> Lixin Zhou >> >>> -----Original Message----- >>> From: Toshiaki Katayama [mailto:ktym@hgc.jp] >>> Sent: Tuesday, March 02, 2004 12:55 AM >>> To: bioruby@open-bio.org >>> Subject: Re: [BioRuby] Is there a limit to string / naseq length? >>> >>> >>> Hi, >>> >>> I have confirmed this also occurs on my OS X and Linux box >>> with Ruby 1.6.8 and 1.8.1 by parsing the following file. >>> >>> >>> ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/H_sapiens/CHR_03/hs_ch >>> r3.gbk.gz >>> >>> My implementation of GenBank parser and Bio::Sequence classes >>> doesn't limit sequence length. >>> >>> ...however... >>> >>> The problem was that I couldn't imagine the sequence >>> coordination number in the NCBI GenBank format can reach at >>> the line head when I wrote bio/db.rb so that it misses lines >>> after 100000021. >>> >>> -------------------------------------------------------------- >>> ---------- >>> ------ >>> LOCUS NT_005612 100530261 bp DNA linear CON >>> 23-JAN-2004 >>> DEFINITION Homo sapiens chromosome 3 genomic contig. >>> ACCESSION NT_005612 >>> (snip) >>> ORIGIN >>> 1 gaattcacac atcacaaaga agtttcacag aatgcttctg tgtggttttt >>> atgtgaacat >>> 61 atttcctttt ccgctggcag attctacaaa aagagtgtat ccaaagtgct >>> cagtcaaaag >>> (snip) >>> 99999961 aagcaatgaa ctgtctgtgg agtgagtgtg tattaaaacg tggaatgagg >>> atctccccca >>> 100000021 cggggcgggg agactaggag aaagctgcca gaggctgctg gcaagagata >>> tccactggtt >>> (snip) >>> 100530181 taggtttgaa agctaggtgt cagccactgg gcctccatgc tgagattcat >>> actcccatct >>> 100530241 tgttcatgat tattctgaat t >>> // >>> -------------------------------------------------------------- >>> ---------- >>> ------ >>> >>> I will fix this in the CVS although it may take some time to be done. >>> >>> Sorry for the inconvenience, >>> Toshiaki Katayama >>> >>> >>> On 2004/03/02, at 14:14, Zhou, Lixin wrote: >>> >>>> I've just deleted some lines of annotation in the feature table in >>>> NT_005612 and found that the sequence is still truncated to >>>> 100,000,020 bp. Therefore, the bug may have nothing to do >>> >>> with the >>> >>>> number of lines in the RefSeq record. >>>> >>>> Here is to correct the mistakes / typos in the previous message: >>>> >>>> 1. The sequence is from ORIGIN not SOURCE. >>>> 2. The sequence length is greater than 100 M bp. >>>> >>>> -----Original Message----- >>>> From: Zhou, Lixin >>>> Sent: Mon 3/1/2004 6:39 PM >>>> To: bioruby@open-bio.org >>>> Cc: >>>> Subject: [BioRuby] Is there a limit to string / naseq length? >>>> Hi all, >>>> >>>> I was parsing NCBI's human RefSeq 34 version 2 and noticed that the >>>> DNA sequence from SOURCE is truncated. This appears to be >>> >>> reproducible >>> >>>> when >>>> I "require ?"bio/db/genbank/refseq?"". >>>> >>>> The length of the NT_005612 sequence from CHR_03 is 100,530,261 bp >>>> (longest in human RefSeq 34 v2 and the only one whose sequence is >>>> greater than 1M bp). I was parsing the entire RefSeq and >>> >>> then cutting >>> >>>> exon sequence and noticed a few NM / XM entries returned empty >>>> sequence from NT_005612. A careful examination indicate that their >>>> coordinates are greater than 100,000,000. I tried to print >>> >>> out gb.naseq and >>> >>>> indeed, >>>> the sequence is truncated to about 100,000,020. By the >>> >>> way, it appears >>> >>>> bioruby takes only the first 2575408 lines of the entire >>> >>> RefSeq record >>> >>>> - >>>> because 100,000,021st base starts at the line 2,575,409 of the NT >>>> record. >>>> >>>> I briefly checked bioruby source and have not found a limit to the >>>> sequence length. Is this a bug from Ruby 1.8.1, which I use? >>>> >>>> Thanks. >>>> >>>> Lixin Zhou >>>> lzhou@illumina.com >>>> >>>> _______________________________________________ >>>> BioRuby mailing list >>>> BioRuby@open-bio.org >>>> http://portal.open-bio.org/mailman/listinfo/bioruby >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> BioRuby mailing list >>>> BioRuby@open-bio.org >>>> http://portal.open-bio.org/mailman/listinfo/bioruby >>> >>> >>> _______________________________________________ >>> BioRuby mailing list >>> BioRuby@open-bio.org >>> http://portal.open-> bio.org/mailman/listinfo/bioruby >>> >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby@open-bio.org >> http://portal.open-bio.org/mailman/listinfo/bioruby > > > > _______________________________________________ > BioRuby mailing list > BioRuby@open-bio.org > http://portal.open-bio.org/mailman/listinfo/bioruby > > From uehara at cbo.mss.co.jp Tue Mar 23 21:02:12 2004 From: uehara at cbo.mss.co.jp (UEHARA Keizou) Date: Tue Mar 23 21:08:30 2004 Subject: [BioRuby] Is there a limit to string / naseq length? In-Reply-To: <40609249.1000902@illumina.com> References: <40609249.1000902@illumina.com> Message-ID: <200403240202.AA00455@C1623.cbo.mss.co.jp> >Hello, > >I've tried the patch for the latest RefSeq 34 version 3 (and v2 as >well). Perhaps I did it wrong - it's a few times slower than the >previous release, and perhaps use more memory as well. I've not had a >close look, so that I don't know what caused the slowness. I simply >switched back to the previous release. > >Has anyone tried to parse ASN.1 format using Ruby, or common LISP / scheme? I convert ASN.1 to XML using asn2gb and parse by xmlparser. From ktym at hgc.jp Tue Mar 23 21:28:51 2004 From: ktym at hgc.jp (Toshiaki Katayama) Date: Tue Mar 23 21:34:19 2004 Subject: [BioRuby] Fwd: BOSC 2004 Announcement and Call for Papers In-Reply-To: <20040322182051.GA10462@team-machine.donck.com> References: <1F799BB8-6D74-11D8-B4B6-000A95CD9782@hgc.jp> <20040321133137.GA26257@team-machine.donck.com> <20040322182051.GA10462@team-machine.donck.com> Message-ID: On 2004/03/23, at 3:20, pjotr@pckassa.com wrote: > Yes, I am attending BOSC (permission granted by my Professor). Anyone > else of BioRuby coming? That's nice. I will be there (not yet confirmed, although). Maybe, we can have a BioRuby BOF session. > I can do a talk - what would be the most interesting to discuss? What > came up during the last BOSC? Some feedback would be useful. For example, could you prepare an interesting/typical application of BioRuby? Last year, I have represented capability of the BioRuby and the use of the KEGG API (http://open-bio.org/bosc2003/talks.html#kegg). Other projects also showed some applications like a users perspective, pipelines etc. and they were impressive. In my case, I can prepare to report our project overview and the recent progress including some topics of my interest. Currently, I'm interested in KEGG API, DAS, GFF3 etc. and integrate them with GMOD. Also interested in to have Bio::Graphics equivalent in BioRuby (with more generalized way?). -k From ktym at hgc.jp Tue Mar 23 21:58:20 2004 From: ktym at hgc.jp (Toshiaki Katayama) Date: Tue Mar 23 22:03:45 2004 Subject: [BioRuby] RegEx search example fasta file In-Reply-To: <20040321133300.GB26257@team-machine.donck.com> References: <20040321133300.GB26257@team-machine.donck.com> Message-ID: <1BBD2286-7D3F-11D8-8E2D-000A95AE7AB4@hgc.jp> On 2004/03/21, at 22:33, pjotr@pckassa.com wrote: > Can this go in the sample directory of bioruby - I have added it to > the Wiki. Comments welcome. As for the wiki page, comparing to the original BJIA, (http://www.biojava.org/docs/bj_in_anger/FastaParser.htm) this section is to answer how to parse fasta results. As the Bio::FlatFile.auto in BioRuby is very powerful and entry.definition is implemented in various DB classes, the way of your code that finds entries by regexp is not limited to the FastaFormat as follows: % re_grep_def.rb 'serine.* kinase' genbank/gb*.seq % re_grep_def.rb 'serine.* kinase' kegg/genes/*.ent % re_grep_def.rb 'serine.* kinase' kegg/sequences/*.pep ---------------------------------------------- #!/usr/bin/env ruby require 'bio' re = /#{ARGV.shift}/i Bio::FlatFile.auto(ARGF) do |ff| ff.each do |entry| if re.match(entry.definition) puts ff.entry_raw end end end ---------------------------------------------- -k > > Pj. > > > #! /usr/bin/ruby > # > # $Id: fastasearch,v 1.1 2004/03/21 13:18:41 wrk Exp $ > # $Source: /home/cvs/home/pjotr/lwrk/luw/fasta/fastasearch,v $ > # > > # require 'profile' > > COPYRIGHT = "GPL (c) 2003-2004" > > usage = < > Search fasta file(s) tags using a regular expression (regex) > > Usage: fastasearch [-q query] filename(s) > > Example: > > ruby fastasearch -q '/([Hh]uman|[Hh]omo sapiens)/' nr.fa > > For more information see > > http://thebird.nl/bioinformatics/ > > Pjotr Prins > Wageningen University and Research Centre > http://www.wur.nl/ > http://www.dpw.wageningen-ur.nl/nema/ > > USAGE > > # -------------------------------------------------------------------- > > srcpath=File.dirname($0) > libpath=File.dirname(srcpath)+'/lib' > $: << srcpath # ---- Add start path to search libraries > $: << libpath > > require 'getoptlong' > require 'bio' > > # ---- Parse command line > opts = GetoptLong.new( > [ "--help", "-h", GetoptLong::NO_ARGUMENT ], > [ "--query", "-q", GetoptLong::REQUIRED_ARGUMENT ] > ) > > do_help = false > query=nil > > opts.each do | opt, arg | > do_help |= (opt == '--help') > query = arg if (opt == '--query') > end > > # ---- Print usage > if (do_help || ARGV.size==0) > print usage > exit 1 > end > > if !query > print "Give query: " > query = $stdin.gets.chomp > end > > ARGV.each do | fn | > $stderr.print "Loading #{fn}..." > f = Bio::FlatFile.auto(fn) > $stderr.print " detected: #{f.dbclass}\n" > f.each_entry do | e | > if e.definition =~ /#{query}/ > print '>',e.definition,e.data > end > end > end > > _______________________________________________ > BioRuby mailing list > BioRuby@open-bio.org > http://portal.open-bio.org/mailman/listinfo/bioruby From pjotr at pckassa.com Wed Mar 24 01:24:16 2004 From: pjotr at pckassa.com (pjotr@pckassa.com) Date: Wed Mar 24 01:29:38 2004 Subject: [BioRuby] RegEx search example fasta file In-Reply-To: <1BBD2286-7D3F-11D8-8E2D-000A95AE7AB4@hgc.jp> References: <20040321133300.GB26257@team-machine.donck.com> <1BBD2286-7D3F-11D8-8E2D-000A95AE7AB4@hgc.jp> Message-ID: <20040324062416.GA31443@team-machine.donck.com> Thanks! I'll have a look and will improve the Wiki to cover that. Pays off immediately ;-). Pj. On Wed, Mar 24, 2004 at 11:58:20AM +0900, Toshiaki Katayama wrote: > On 2004/03/21, at 22:33, pjotr@pckassa.com wrote: > >Can this go in the sample directory of bioruby - I have added it to > >the Wiki. Comments welcome. > > As for the wiki page, comparing to the original BJIA, > (http://www.biojava.org/docs/bj_in_anger/FastaParser.htm) > this section is to answer how to parse fasta results. > > As the Bio::FlatFile.auto in BioRuby is very powerful and > entry.definition is implemented in various DB classes, > the way of your code that finds entries by regexp > is not limited to the FastaFormat as follows: > > % re_grep_def.rb 'serine.* kinase' genbank/gb*.seq > % re_grep_def.rb 'serine.* kinase' kegg/genes/*.ent > % re_grep_def.rb 'serine.* kinase' kegg/sequences/*.pep > > ---------------------------------------------- > #!/usr/bin/env ruby > > require 'bio' > > re = /#{ARGV.shift}/i > > Bio::FlatFile.auto(ARGF) do |ff| > ff.each do |entry| > if re.match(entry.definition) > puts ff.entry_raw > end > end > end > ---------------------------------------------- > > > -k > > > > > > >Pj. > > > > > >#! /usr/bin/ruby > ># > ># $Id: fastasearch,v 1.1 2004/03/21 13:18:41 wrk Exp $ > ># $Source: /home/cvs/home/pjotr/lwrk/luw/fasta/fastasearch,v $ > ># > > > ># require 'profile' > > > >COPYRIGHT = "GPL (c) 2003-2004" > > > >usage = < > > > Search fasta file(s) tags using a regular expression (regex) > > > > Usage: fastasearch [-q query] filename(s) > > > > Example: > > > > ruby fastasearch -q '/([Hh]uman|[Hh]omo sapiens)/' nr.fa > > > > For more information see > > > > http://thebird.nl/bioinformatics/ > > > > Pjotr Prins > > Wageningen University and Research Centre > > http://www.wur.nl/ > > http://www.dpw.wageningen-ur.nl/nema/ > > > >USAGE > > > ># -------------------------------------------------------------------- > > > >srcpath=File.dirname($0) > >libpath=File.dirname(srcpath)+'/lib' > >$: << srcpath # ---- Add start path to search libraries > >$: << libpath > > > >require 'getoptlong' > >require 'bio' > > > ># ---- Parse command line > >opts = GetoptLong.new( > > [ "--help", "-h", GetoptLong::NO_ARGUMENT ], > > [ "--query", "-q", GetoptLong::REQUIRED_ARGUMENT ] > >) > > > >do_help = false > >query=nil > > > >opts.each do | opt, arg | > > do_help |= (opt == '--help') > > query = arg if (opt == '--query') > >end > > > ># ---- Print usage > >if (do_help || ARGV.size==0) > > print usage > > exit 1 > >end > > > >if !query > > print "Give query: " > > query = $stdin.gets.chomp > >end > > > >ARGV.each do | fn | > > $stderr.print "Loading #{fn}..." > > f = Bio::FlatFile.auto(fn) > > $stderr.print " detected: #{f.dbclass}\n" > > f.each_entry do | e | > > if e.definition =~ /#{query}/ > > print '>',e.definition,e.data > > end > > end > >end > > > >_______________________________________________ > >BioRuby mailing list > >BioRuby@open-bio.org > >http://portal.open-bio.org/mailman/listinfo/bioruby > > _______________________________________________ > BioRuby mailing list > BioRuby@open-bio.org > http://portal.open-bio.org/mailman/listinfo/bioruby