From georgkam at gmail.com Wed Dec 16 06:37:45 2009 From: georgkam at gmail.com (George Githinji) Date: Wed, 16 Dec 2009 14:37:45 +0300 Subject: [BioRuby] reading fastq files Message-ID: <55915f820912160337g3aa1121fv55ded38f8802bd7f@mail.gmail.com> I am trying out the current bioruby snapshot from github .... which i have compiled as a local gem version 1.3.1.5000 There seems to be a number of changes and file re-naming from the current stable release 1.3.1. How do i parse a fastq file format? I am getting an error while trying to read a fastq file. #Read a fastq file fastq = "/home/george/Assembly_pipeline/data/Sort.caf.fastq" Bio::Fastq.new(fastq) do |f| f.each do |entry| puts entry.class end end Error: /home/george/NetBeansProjects/contig_assembly/lib/assemble_raw_read.rb:6: no such file to load -- bio/db/fastq (LoadError) Replacing the above with a call to Bio::FlatFile.auto does not seem to help either. I have feeling am making a stupid mistake somewhere or doing it the wrong way.... any ideas? Thank you George -- --------------- Sincerely George Skype: george_g2 Blog: http://biorelated.wordpress.com/ From ngoto at gen-info.osaka-u.ac.jp Wed Dec 16 12:38:42 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Thu, 17 Dec 2009 02:38:42 +0900 Subject: [BioRuby] reading fastq files In-Reply-To: <55915f820912160337g3aa1121fv55ded38f8802bd7f@mail.gmail.com> References: <55915f820912160337g3aa1121fv55ded38f8802bd7f@mail.gmail.com> Message-ID: <20091217023602.420E.EEF6E030@gen-info.osaka-u.ac.jp> Hi, > I am trying out the current bioruby snapshot from github .... > which i have compiled as a local gem version 1.3.1.5000 In that case, updating of the file bioruby.gemspec would be needed, "rake regemspec" in your local git repository. The file bioruby.gemspec contains the list of files to be stored to the gem, but it isn't frequently updated. > There seems to be a number of changes and file re-naming from the current > stable release 1.3.1. > How do i parse a fastq file format? > I am getting an error while trying to read a fastq file. > > #Read a fastq file > fastq = "/home/george/Assembly_pipeline/data/Sort.caf.fastq" > Bio::Fastq.new(fastq) do |f| Bio::FlatFile.open(fastq) do |f| Bio::Fastq.new takes a String of FASTQ entry, not filename. > f.each do |entry| > puts entry.class > end > end > > > Error: /home/george/NetBeansProjects/contig_assembly/lib/assemble_raw_read.rb:6: > no such file to load -- bio/db/fastq (LoadError) > > Replacing the above with a call to Bio::FlatFile.auto does not seem to help > either. I have feeling am making a stupid mistake somewhere > or doing it the wrong way.... > any ideas? Please update to the newest snapshot and try again. > > > Thank you > George > > > > -- > --------------- > Sincerely > George > > Skype: george_g2 > Blog: http://biorelated.wordpress.com/ > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- Naohisa Goto From georgkam at gmail.com Thu Dec 17 00:25:10 2009 From: georgkam at gmail.com (George Githinji) Date: Thu, 17 Dec 2009 08:25:10 +0300 Subject: [BioRuby] reading fastq files Message-ID: <55915f820912162125l5aa6dc7an6e9139191efa56b5@mail.gmail.com> Hi all, It seems that the bioruby.gemspec file did not include the following lines when building the initial gem "lib/bio/db/fasta/fasta_to_biosequence.rb", "lib/bio/db/fastq/fastq_to_biosequence.rb", "lib/bio/db/fastq/format_fastq.rb", "lib/bio/db/fastq.rb", "lib/bio/db/sanger_chromatogram/abif.rb", "lib/bio/db/sanger_chromatogram/chromatogram.rb", "lib/bio/db/sanger_chromatogram/chromatogram_to_biosequence.rb", "lib/bio/db/sanger_chromatogram/scf.rb", "lib/bio/db/phyloxml/phyloxml.xsd", "lib/bio/db/phyloxml/phyloxml_elements.rb", "lib/bio/db/phyloxml/phyloxml_parser.rb", "lib/bio/db/phyloxml/phyloxml_writer.rb", "lib/bio/sequence/quality_score.rb", Upon adding the above lines to the bioruby.gemspec and rebuilding the gem, the functionality is now available. > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 16 Dec 2009 14:37:45 +0300 > From: George Githinji > Subject: [BioRuby] reading fastq files > To: bioruby at lists.open-bio.org > Message-ID: > <55915f820912160337g3aa1121fv55ded38f8802bd7f at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > I am trying out the current bioruby snapshot from github .... > which i have compiled as a local gem version 1.3.1.5000 > > There seems to be a number of changes and file re-naming from the current > stable release 1.3.1. > How do i parse a fastq file format? > I am getting an error while trying to read a fastq file. > > #Read a fastq file > fastq = "/home/george/Assembly_pipeline/data/Sort.caf.fastq" > Bio::Fastq.new(fastq) do |f| > f.each do |entry| > puts entry.class > end > end > > > Error: > /home/george/NetBeansProjects/contig_assembly/lib/assemble_raw_read.rb:6: > no such file to load -- bio/db/fastq (LoadError) > > Replacing the above with a call to Bio::FlatFile.auto does not seem to help > either. I have feeling am making a stupid mistake somewhere > or doing it the wrong way.... > any ideas? > > > Thank you > George > > > > -- > --------------- > Sincerely > George > > Skype: george_g2 > Blog: http://biorelated.wordpress.com/ > > > ------------------------------ > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > > End of BioRuby Digest, Vol 51, Issue 1 > ************************************** > -- --------------- Sincerely George Skype: george_g2 Blog: http://biorelated.wordpress.com/ From georgkam at gmail.com Fri Dec 18 01:37:41 2009 From: georgkam at gmail.com (George Githinji) Date: Fri, 18 Dec 2009 09:37:41 +0300 Subject: [BioRuby] Parsing CAF files(Common Assembly file format) Message-ID: <55915f820912172237o46a7e59ane41546dc7a8f7f78@mail.gmail.com> Hi, Are there ways or a parser for Common Assembly File format in ruby?( http://www.sanger.ac.uk/resources/software/caf.html) Anyone working on it? Thank you. George From pjotr.public14 at thebird.nl Sun Dec 27 11:07:47 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sun, 27 Dec 2009 17:07:47 +0100 Subject: [BioRuby] Parsing ClustalW files Message-ID: <20091227160747.GA7908@thebird.nl> On my ALN branch (http://github.com/pjotrp/bioruby/tree/ALN) I have added a unit test for ClustalW ALN format, as well as an update to the tutorial. I have three comments. First I think the alignment parser belong in ./lib/bio/db/clustalw.rb, rather than in ./lib/app/clustalw/report.rb. I can see how that originated, but it is an independent database format. This should also change the constructor call to, for example, Bio::ClustalWFormat.new, analogues to FastaFormat. Als ClustalW files are ubiquous we may want to rename this to an ALN format. Second, I added an index method [], to Bio::ClustalW::Report, so I can refetch a Bio::Sequence object *with* the ID/definition (see below). However it may be more appropriate to have this shared at the Bio::Alignment level. If you have a better way, I am all ears. bioruby> aln = Bio::ClustalW::Report.new(File.new('../test/data/clustalw/example1.aln').readlines.join) bioruby> aln.header ==> "CLUSTAL 2.0.9 multiple sequence alignment" Fetch a sequence bioruby> seq = aln[1] bioruby> seq.definition ==> "gi|115023|sp|P10425|" Get the partial sequences bioruby> seq.to_s[60..120] ==> "LGYFNG-EAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVTDVIITHAHAD" Show the full alignment residue match information for the sequences in the set bioruby> aln.match_line[60..120] ==> " . **. . .. ::*: . * : : . .: .* * *" Return a Bio::Alignment object bioruby> aln.alignment.consensus[60..120] ==> "???????????SN?????????????D??????????L??????????????????H?H?D" I also kinda disagree with the implementation of the current parser (Report). It has virtually no checking for bad input data, and it should accept an array of lines in addition to a String. Was that three comments already? ;) Happy new year to everyone, and let 2010 be a strong year for BioRuby and friends! Pj. From ngoto at gen-info.osaka-u.ac.jp Mon Dec 28 10:26:52 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 29 Dec 2009 00:26:52 +0900 Subject: [BioRuby] Parsing ClustalW files In-Reply-To: <20091227160747.GA7908@thebird.nl> References: <20091227160747.GA7908@thebird.nl> Message-ID: <20091228152653.1EEE01CBC4A1@idnmail.gen-info.osaka-u.ac.jp> Hi, On Sun, 27 Dec 2009 17:07:47 +0100 Pjotr Prins wrote: > On my ALN branch (http://github.com/pjotrp/bioruby/tree/ALN) I have > added a unit test for ClustalW ALN format, as well as an update to the > tutorial. > > I have three comments. First I think the alignment parser belong in > ./lib/bio/db/clustalw.rb, rather than in ./lib/app/clustalw/report.rb. > I can see how that originated, but it is an independent database > format. This should also change the constructor call to, for example, > Bio::ClustalWFormat.new, analogues to FastaFormat. Als ClustalW files > are ubiquous we may want to rename this to an ALN format. I think it is good to follow EMBOSS's naming rule. In EMBOSS, the format names are "clustal" or "aln". http://emboss.sourceforge.net/docs/themes/SequenceFormats.html By the way, it is interesting that Clustal format isn't described in the EMBOSS alignment formats (http://emboss.sourceforge.net/docs/themes/AlignFormats.html). > Second, I added an index method [], to Bio::ClustalW::Report, so I can > refetch a Bio::Sequence object *with* the ID/definition (see below). > However it may be more appropriate to have this shared at the > Bio::Alignment level. If you have a better way, I am all ears. Why no methods that return a Bio::Sequence object is because the ClustalW parser and Bio::Alginment were first written before Bio::Sequence have been improved. It is good to write methods returning Bio::Sequence object(s) for ClustalW parser. Bio::Alginment is a container class, and I'm still seeking what are better ways to store sequences and other information. Any suggestions are welcomed. > bioruby> aln = Bio::ClustalW::Report.new(File.new('../test/data/clustalw/example1.aln').readlines.join) I think using File.read("...") is better, instead of File.new("...").readlines.join. > bioruby> aln.header > ==> "CLUSTAL 2.0.9 multiple sequence alignment" > > Fetch a sequence > > bioruby> seq = aln[1] > bioruby> seq.definition > ==> "gi|115023|sp|P10425|" > > Get the partial sequences > > bioruby> seq.to_s[60..120] > ==> "LGYFNG-EAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVTDVIITHAHAD" > > Show the full alignment residue match information for the sequences in the set > > bioruby> aln.match_line[60..120] > ==> " . **. . .. ::*: . * : : . .: .* * *" > > Return a Bio::Alignment object > > bioruby> aln.alignment.consensus[60..120] > ==> "???????????SN?????????????D??????????L??????????????????H?H?D" > > I also kinda disagree with the implementation of the current parser > (Report). It has virtually no checking for bad input data, Because no strict format definition and no detailed documents, and it is hard to distinguish what is really "bad". In addition, when I implemented the parser, I thoght it was good to be able to salvage data from broken or incomplete format rather than to report error and to stop parsing. > and it > should accept an array of lines in addition to a String. I don't think so, because to accept two differenct data types would make things complicated, and make harder to implement parsers. > Was that three comments already? ;) > > Happy new year to everyone, and let 2010 be a strong year for BioRuby > and friends! > > Pj. > Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From ngoto at gen-info.osaka-u.ac.jp Tue Dec 29 04:45:20 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 29 Dec 2009 18:45:20 +0900 Subject: [BioRuby] BioRuby 1.4.0 is released Message-ID: <20091229094520.D1C4B1CBC436@idnmail.gen-info.osaka-u.ac.jp> Hi all, We are pleased to announce the release of BioRuby 1.4.0. The archive is available at: http://bioruby.org/archive/bioruby-1.4.0.tar.gz Web page: http://bioruby.org/ http://bioruby.open-bio.org/ API documentation: http://bioruby.org/rdoc/ Bug report: http://rubyforge.org/projects/bioruby/ We also put RubyGems pacakge at RubyForge and Gemcutter. You can easily install by using RubyGems. First, check the version number by using search command: % gem search --remote bio and find "bio (1.4.0)" in the list. Then, % sudo gem install bio You can also obtain bioruby gem file from bioruby.org. http://bioruby.org/archive/gems/bio-1.4.0.gem Here is a brief summary of changes. = PhyloXML support Support for reading and writing PhyloXML file format is added, developed by Diana Jaunzeikare, mentored by Christian M Zmasek and co-mentors, supported by Google Summer of Code 2009 in collaboration with the National Evolutionary Synthesis Center (NESCent). = FASTQ file format support Support for reading and writing FASTQ file format is added. All of the three FASTQ format variants are supported. The code is written by Naohisa Goto, with the help of discussions in the open-bio-l mailing list. The prototype of Bio::Fastq class was first developed during the BioHackathon 2009 held in Okinawa. = DNA chromatogram support Support for reading DNA chromatogram files are added. SCF and ABIF file formats are supported. The code is developed by Anthony Underwood. = MEME (motif-based sequence analysis tools) support Support for running MAST (Motif Aliginment & Search Tool, part of the MEME Suite, motif-based sequence analysis tools) and parsing its results are added. The code is developed by Adam Kraut. = Improvement of KEGG parser classes Some new methods are added to parse new fields added to some KEGG file formats. Unit tests for KEGG parsers are also added and improved. = Many sample scripts are added Many sample scripts showing demonstrations of usages of classes are added. They were originally primitive test codes written in the "if __FILE__ == $0" convention. = Unit tests can test installed BioRuby Mechanism to load library and to find test data in the unit tests are changed, and target library path and test data path can be changed with environment variables. In addition, many changes have been made, including incompatible changes. For more information, see RELEASE_NOTES.rdoc and ChangeLog in the release archive or at: http://github.com/bioruby/bioruby/blob/1.4.0/RELEASE_NOTES.rdoc http://github.com/bioruby/bioruby/blob/1.4.0/ChangeLog Hope you enjoy. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From pjotr.public14 at thebird.nl Tue Dec 29 09:00:49 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 29 Dec 2009 15:00:49 +0100 Subject: [BioRuby] Parsing ClustalW files In-Reply-To: <20091228152653.1EEE01CBC4A1@idnmail.gen-info.osaka-u.ac.jp> References: <20091227160747.GA7908@thebird.nl> <20091228152653.1EEE01CBC4A1@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20091229140049.GA917@thebird.nl> Hi Naohisa, Thanks for the reply. I can improve my proposal if you help decide on below points. On Tue, Dec 29, 2009 at 12:26:52AM +0900, Naohisa GOTO wrote: > I think it is good to follow EMBOSS's naming rule. > In EMBOSS, the format names are "clustal" or "aln". > http://emboss.sourceforge.net/docs/themes/SequenceFormats.html So, should be move the existing app/clustalw/report.rb to db/clustal.rb? Or do we start a new db/clustal.rb - retaining the old one for compatibility? Or do we just merge in my small changes? > By the way, it is interesting that Clustal format isn't > described in the EMBOSS alignment formats > (http://emboss.sourceforge.net/docs/themes/AlignFormats.html). It is a bit of a non-standard. I guess. Clustal and Muscle use it. It is mostly nice for checking alignments in a text file. I need it to send to other people. It is standard enough. > > Second, I added an index method [], to Bio::ClustalW::Report, so I can > > refetch a Bio::Sequence object *with* the ID/definition (see below). > > However it may be more appropriate to have this shared at the > > Bio::Alignment level. If you have a better way, I am all ears. > > Why no methods that return a Bio::Sequence object is because the > ClustalW parser and Bio::Alginment were first written before > Bio::Sequence have been improved. It is good to write methods > returning Bio::Sequence object(s) for ClustalW parser. > > Bio::Alginment is a container class, and I'm still seeking > what are better ways to store sequences and other information. > Any suggestions are welcomed. I think using [] is a good way to fetch Bio::Sequence objects from the alignment. Also we could introduce an 'each_sequence' iterator, returning Bio::Sequence, though 'each' itself would be more consistent. The latter will break things, perhaps, for users. The current alignment class stores keys and sequences separately. I guess a list of Bio::Sequence would be more consistent. Maybe we can discuss deeper design issues soon. I have some opinions which are better vented in a round table. > > bioruby> aln = Bio::ClustalW::Report.new(File.new('../test/data/clustalw/example1.aln').readlines.join) > > I think using File.read("...") is better, instead of > File.new("...").readlines.join. OK > Because no strict format definition and no detailed documents, and > it is hard to distinguish what is really "bad". In addition, when I > implemented the parser, I thoght it was good to be able to salvage > data from broken or incomplete format rather than to report error > and to stop parsing. Hmmm. I think Bioruby should throw an exception when hitting bad data. In this case it is easy - see the example I sent two days ago. It just tests the indentation. Any data failure would be signalled. > > and it > > should accept an array of lines in addition to a String. > > I don't think so, because to accept two differenct data types would > make things complicated, and make harder to implement parsers. Well, we may want to change this. The current edition takes a string, and its first action is to split it into an Array. I have to do an Array.join to pass it in. That is two actions too many. This can get bad with big data. I also think the current parsers should *not* be string based. For this example it is not a problem (ALN files are generally small), but the only way to get rid of memory use issues is splitting the data. I think it is not a good idea to have 1 Gb strings. Anyway, based on above I think my current ALN patch is acceptable - apart from the documentation. Pj. From pjotr.public14 at thebird.nl Thu Dec 31 09:15:46 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Thu, 31 Dec 2009 15:15:46 +0100 Subject: [BioRuby] Codeml parser Message-ID: <20091231141546.GA5770@thebird.nl> Hi Michael, I have a writeup on improving the current PAML functionality. Are you OK with this? http://bioruby.open-bio.org/wiki/BIORUBY_PAML (maybe it does not belong on the bioruby Wiki - but I think of it like a 'design' document). Pj. From georgkam at gmail.com Wed Dec 16 11:37:45 2009 From: georgkam at gmail.com (George Githinji) Date: Wed, 16 Dec 2009 14:37:45 +0300 Subject: [BioRuby] reading fastq files Message-ID: <55915f820912160337g3aa1121fv55ded38f8802bd7f@mail.gmail.com> I am trying out the current bioruby snapshot from github .... which i have compiled as a local gem version 1.3.1.5000 There seems to be a number of changes and file re-naming from the current stable release 1.3.1. How do i parse a fastq file format? I am getting an error while trying to read a fastq file. #Read a fastq file fastq = "/home/george/Assembly_pipeline/data/Sort.caf.fastq" Bio::Fastq.new(fastq) do |f| f.each do |entry| puts entry.class end end Error: /home/george/NetBeansProjects/contig_assembly/lib/assemble_raw_read.rb:6: no such file to load -- bio/db/fastq (LoadError) Replacing the above with a call to Bio::FlatFile.auto does not seem to help either. I have feeling am making a stupid mistake somewhere or doing it the wrong way.... any ideas? Thank you George -- --------------- Sincerely George Skype: george_g2 Blog: http://biorelated.wordpress.com/ From ngoto at gen-info.osaka-u.ac.jp Wed Dec 16 17:38:42 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Thu, 17 Dec 2009 02:38:42 +0900 Subject: [BioRuby] reading fastq files In-Reply-To: <55915f820912160337g3aa1121fv55ded38f8802bd7f@mail.gmail.com> References: <55915f820912160337g3aa1121fv55ded38f8802bd7f@mail.gmail.com> Message-ID: <20091217023602.420E.EEF6E030@gen-info.osaka-u.ac.jp> Hi, > I am trying out the current bioruby snapshot from github .... > which i have compiled as a local gem version 1.3.1.5000 In that case, updating of the file bioruby.gemspec would be needed, "rake regemspec" in your local git repository. The file bioruby.gemspec contains the list of files to be stored to the gem, but it isn't frequently updated. > There seems to be a number of changes and file re-naming from the current > stable release 1.3.1. > How do i parse a fastq file format? > I am getting an error while trying to read a fastq file. > > #Read a fastq file > fastq = "/home/george/Assembly_pipeline/data/Sort.caf.fastq" > Bio::Fastq.new(fastq) do |f| Bio::FlatFile.open(fastq) do |f| Bio::Fastq.new takes a String of FASTQ entry, not filename. > f.each do |entry| > puts entry.class > end > end > > > Error: /home/george/NetBeansProjects/contig_assembly/lib/assemble_raw_read.rb:6: > no such file to load -- bio/db/fastq (LoadError) > > Replacing the above with a call to Bio::FlatFile.auto does not seem to help > either. I have feeling am making a stupid mistake somewhere > or doing it the wrong way.... > any ideas? Please update to the newest snapshot and try again. > > > Thank you > George > > > > -- > --------------- > Sincerely > George > > Skype: george_g2 > Blog: http://biorelated.wordpress.com/ > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- Naohisa Goto From georgkam at gmail.com Thu Dec 17 05:25:10 2009 From: georgkam at gmail.com (George Githinji) Date: Thu, 17 Dec 2009 08:25:10 +0300 Subject: [BioRuby] reading fastq files Message-ID: <55915f820912162125l5aa6dc7an6e9139191efa56b5@mail.gmail.com> Hi all, It seems that the bioruby.gemspec file did not include the following lines when building the initial gem "lib/bio/db/fasta/fasta_to_biosequence.rb", "lib/bio/db/fastq/fastq_to_biosequence.rb", "lib/bio/db/fastq/format_fastq.rb", "lib/bio/db/fastq.rb", "lib/bio/db/sanger_chromatogram/abif.rb", "lib/bio/db/sanger_chromatogram/chromatogram.rb", "lib/bio/db/sanger_chromatogram/chromatogram_to_biosequence.rb", "lib/bio/db/sanger_chromatogram/scf.rb", "lib/bio/db/phyloxml/phyloxml.xsd", "lib/bio/db/phyloxml/phyloxml_elements.rb", "lib/bio/db/phyloxml/phyloxml_parser.rb", "lib/bio/db/phyloxml/phyloxml_writer.rb", "lib/bio/sequence/quality_score.rb", Upon adding the above lines to the bioruby.gemspec and rebuilding the gem, the functionality is now available. > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 16 Dec 2009 14:37:45 +0300 > From: George Githinji > Subject: [BioRuby] reading fastq files > To: bioruby at lists.open-bio.org > Message-ID: > <55915f820912160337g3aa1121fv55ded38f8802bd7f at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > I am trying out the current bioruby snapshot from github .... > which i have compiled as a local gem version 1.3.1.5000 > > There seems to be a number of changes and file re-naming from the current > stable release 1.3.1. > How do i parse a fastq file format? > I am getting an error while trying to read a fastq file. > > #Read a fastq file > fastq = "/home/george/Assembly_pipeline/data/Sort.caf.fastq" > Bio::Fastq.new(fastq) do |f| > f.each do |entry| > puts entry.class > end > end > > > Error: > /home/george/NetBeansProjects/contig_assembly/lib/assemble_raw_read.rb:6: > no such file to load -- bio/db/fastq (LoadError) > > Replacing the above with a call to Bio::FlatFile.auto does not seem to help > either. I have feeling am making a stupid mistake somewhere > or doing it the wrong way.... > any ideas? > > > Thank you > George > > > > -- > --------------- > Sincerely > George > > Skype: george_g2 > Blog: http://biorelated.wordpress.com/ > > > ------------------------------ > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > > End of BioRuby Digest, Vol 51, Issue 1 > ************************************** > -- --------------- Sincerely George Skype: george_g2 Blog: http://biorelated.wordpress.com/ From georgkam at gmail.com Fri Dec 18 06:37:41 2009 From: georgkam at gmail.com (George Githinji) Date: Fri, 18 Dec 2009 09:37:41 +0300 Subject: [BioRuby] Parsing CAF files(Common Assembly file format) Message-ID: <55915f820912172237o46a7e59ane41546dc7a8f7f78@mail.gmail.com> Hi, Are there ways or a parser for Common Assembly File format in ruby?( http://www.sanger.ac.uk/resources/software/caf.html) Anyone working on it? Thank you. George From pjotr.public14 at thebird.nl Sun Dec 27 16:07:47 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sun, 27 Dec 2009 17:07:47 +0100 Subject: [BioRuby] Parsing ClustalW files Message-ID: <20091227160747.GA7908@thebird.nl> On my ALN branch (http://github.com/pjotrp/bioruby/tree/ALN) I have added a unit test for ClustalW ALN format, as well as an update to the tutorial. I have three comments. First I think the alignment parser belong in ./lib/bio/db/clustalw.rb, rather than in ./lib/app/clustalw/report.rb. I can see how that originated, but it is an independent database format. This should also change the constructor call to, for example, Bio::ClustalWFormat.new, analogues to FastaFormat. Als ClustalW files are ubiquous we may want to rename this to an ALN format. Second, I added an index method [], to Bio::ClustalW::Report, so I can refetch a Bio::Sequence object *with* the ID/definition (see below). However it may be more appropriate to have this shared at the Bio::Alignment level. If you have a better way, I am all ears. bioruby> aln = Bio::ClustalW::Report.new(File.new('../test/data/clustalw/example1.aln').readlines.join) bioruby> aln.header ==> "CLUSTAL 2.0.9 multiple sequence alignment" Fetch a sequence bioruby> seq = aln[1] bioruby> seq.definition ==> "gi|115023|sp|P10425|" Get the partial sequences bioruby> seq.to_s[60..120] ==> "LGYFNG-EAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVTDVIITHAHAD" Show the full alignment residue match information for the sequences in the set bioruby> aln.match_line[60..120] ==> " . **. . .. ::*: . * : : . .: .* * *" Return a Bio::Alignment object bioruby> aln.alignment.consensus[60..120] ==> "???????????SN?????????????D??????????L??????????????????H?H?D" I also kinda disagree with the implementation of the current parser (Report). It has virtually no checking for bad input data, and it should accept an array of lines in addition to a String. Was that three comments already? ;) Happy new year to everyone, and let 2010 be a strong year for BioRuby and friends! Pj. From ngoto at gen-info.osaka-u.ac.jp Mon Dec 28 15:26:52 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 29 Dec 2009 00:26:52 +0900 Subject: [BioRuby] Parsing ClustalW files In-Reply-To: <20091227160747.GA7908@thebird.nl> References: <20091227160747.GA7908@thebird.nl> Message-ID: <20091228152653.1EEE01CBC4A1@idnmail.gen-info.osaka-u.ac.jp> Hi, On Sun, 27 Dec 2009 17:07:47 +0100 Pjotr Prins wrote: > On my ALN branch (http://github.com/pjotrp/bioruby/tree/ALN) I have > added a unit test for ClustalW ALN format, as well as an update to the > tutorial. > > I have three comments. First I think the alignment parser belong in > ./lib/bio/db/clustalw.rb, rather than in ./lib/app/clustalw/report.rb. > I can see how that originated, but it is an independent database > format. This should also change the constructor call to, for example, > Bio::ClustalWFormat.new, analogues to FastaFormat. Als ClustalW files > are ubiquous we may want to rename this to an ALN format. I think it is good to follow EMBOSS's naming rule. In EMBOSS, the format names are "clustal" or "aln". http://emboss.sourceforge.net/docs/themes/SequenceFormats.html By the way, it is interesting that Clustal format isn't described in the EMBOSS alignment formats (http://emboss.sourceforge.net/docs/themes/AlignFormats.html). > Second, I added an index method [], to Bio::ClustalW::Report, so I can > refetch a Bio::Sequence object *with* the ID/definition (see below). > However it may be more appropriate to have this shared at the > Bio::Alignment level. If you have a better way, I am all ears. Why no methods that return a Bio::Sequence object is because the ClustalW parser and Bio::Alginment were first written before Bio::Sequence have been improved. It is good to write methods returning Bio::Sequence object(s) for ClustalW parser. Bio::Alginment is a container class, and I'm still seeking what are better ways to store sequences and other information. Any suggestions are welcomed. > bioruby> aln = Bio::ClustalW::Report.new(File.new('../test/data/clustalw/example1.aln').readlines.join) I think using File.read("...") is better, instead of File.new("...").readlines.join. > bioruby> aln.header > ==> "CLUSTAL 2.0.9 multiple sequence alignment" > > Fetch a sequence > > bioruby> seq = aln[1] > bioruby> seq.definition > ==> "gi|115023|sp|P10425|" > > Get the partial sequences > > bioruby> seq.to_s[60..120] > ==> "LGYFNG-EAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVTDVIITHAHAD" > > Show the full alignment residue match information for the sequences in the set > > bioruby> aln.match_line[60..120] > ==> " . **. . .. ::*: . * : : . .: .* * *" > > Return a Bio::Alignment object > > bioruby> aln.alignment.consensus[60..120] > ==> "???????????SN?????????????D??????????L??????????????????H?H?D" > > I also kinda disagree with the implementation of the current parser > (Report). It has virtually no checking for bad input data, Because no strict format definition and no detailed documents, and it is hard to distinguish what is really "bad". In addition, when I implemented the parser, I thoght it was good to be able to salvage data from broken or incomplete format rather than to report error and to stop parsing. > and it > should accept an array of lines in addition to a String. I don't think so, because to accept two differenct data types would make things complicated, and make harder to implement parsers. > Was that three comments already? ;) > > Happy new year to everyone, and let 2010 be a strong year for BioRuby > and friends! > > Pj. > Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From ngoto at gen-info.osaka-u.ac.jp Tue Dec 29 09:45:20 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 29 Dec 2009 18:45:20 +0900 Subject: [BioRuby] BioRuby 1.4.0 is released Message-ID: <20091229094520.D1C4B1CBC436@idnmail.gen-info.osaka-u.ac.jp> Hi all, We are pleased to announce the release of BioRuby 1.4.0. The archive is available at: http://bioruby.org/archive/bioruby-1.4.0.tar.gz Web page: http://bioruby.org/ http://bioruby.open-bio.org/ API documentation: http://bioruby.org/rdoc/ Bug report: http://rubyforge.org/projects/bioruby/ We also put RubyGems pacakge at RubyForge and Gemcutter. You can easily install by using RubyGems. First, check the version number by using search command: % gem search --remote bio and find "bio (1.4.0)" in the list. Then, % sudo gem install bio You can also obtain bioruby gem file from bioruby.org. http://bioruby.org/archive/gems/bio-1.4.0.gem Here is a brief summary of changes. = PhyloXML support Support for reading and writing PhyloXML file format is added, developed by Diana Jaunzeikare, mentored by Christian M Zmasek and co-mentors, supported by Google Summer of Code 2009 in collaboration with the National Evolutionary Synthesis Center (NESCent). = FASTQ file format support Support for reading and writing FASTQ file format is added. All of the three FASTQ format variants are supported. The code is written by Naohisa Goto, with the help of discussions in the open-bio-l mailing list. The prototype of Bio::Fastq class was first developed during the BioHackathon 2009 held in Okinawa. = DNA chromatogram support Support for reading DNA chromatogram files are added. SCF and ABIF file formats are supported. The code is developed by Anthony Underwood. = MEME (motif-based sequence analysis tools) support Support for running MAST (Motif Aliginment & Search Tool, part of the MEME Suite, motif-based sequence analysis tools) and parsing its results are added. The code is developed by Adam Kraut. = Improvement of KEGG parser classes Some new methods are added to parse new fields added to some KEGG file formats. Unit tests for KEGG parsers are also added and improved. = Many sample scripts are added Many sample scripts showing demonstrations of usages of classes are added. They were originally primitive test codes written in the "if __FILE__ == $0" convention. = Unit tests can test installed BioRuby Mechanism to load library and to find test data in the unit tests are changed, and target library path and test data path can be changed with environment variables. In addition, many changes have been made, including incompatible changes. For more information, see RELEASE_NOTES.rdoc and ChangeLog in the release archive or at: http://github.com/bioruby/bioruby/blob/1.4.0/RELEASE_NOTES.rdoc http://github.com/bioruby/bioruby/blob/1.4.0/ChangeLog Hope you enjoy. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From pjotr.public14 at thebird.nl Tue Dec 29 14:00:49 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 29 Dec 2009 15:00:49 +0100 Subject: [BioRuby] Parsing ClustalW files In-Reply-To: <20091228152653.1EEE01CBC4A1@idnmail.gen-info.osaka-u.ac.jp> References: <20091227160747.GA7908@thebird.nl> <20091228152653.1EEE01CBC4A1@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20091229140049.GA917@thebird.nl> Hi Naohisa, Thanks for the reply. I can improve my proposal if you help decide on below points. On Tue, Dec 29, 2009 at 12:26:52AM +0900, Naohisa GOTO wrote: > I think it is good to follow EMBOSS's naming rule. > In EMBOSS, the format names are "clustal" or "aln". > http://emboss.sourceforge.net/docs/themes/SequenceFormats.html So, should be move the existing app/clustalw/report.rb to db/clustal.rb? Or do we start a new db/clustal.rb - retaining the old one for compatibility? Or do we just merge in my small changes? > By the way, it is interesting that Clustal format isn't > described in the EMBOSS alignment formats > (http://emboss.sourceforge.net/docs/themes/AlignFormats.html). It is a bit of a non-standard. I guess. Clustal and Muscle use it. It is mostly nice for checking alignments in a text file. I need it to send to other people. It is standard enough. > > Second, I added an index method [], to Bio::ClustalW::Report, so I can > > refetch a Bio::Sequence object *with* the ID/definition (see below). > > However it may be more appropriate to have this shared at the > > Bio::Alignment level. If you have a better way, I am all ears. > > Why no methods that return a Bio::Sequence object is because the > ClustalW parser and Bio::Alginment were first written before > Bio::Sequence have been improved. It is good to write methods > returning Bio::Sequence object(s) for ClustalW parser. > > Bio::Alginment is a container class, and I'm still seeking > what are better ways to store sequences and other information. > Any suggestions are welcomed. I think using [] is a good way to fetch Bio::Sequence objects from the alignment. Also we could introduce an 'each_sequence' iterator, returning Bio::Sequence, though 'each' itself would be more consistent. The latter will break things, perhaps, for users. The current alignment class stores keys and sequences separately. I guess a list of Bio::Sequence would be more consistent. Maybe we can discuss deeper design issues soon. I have some opinions which are better vented in a round table. > > bioruby> aln = Bio::ClustalW::Report.new(File.new('../test/data/clustalw/example1.aln').readlines.join) > > I think using File.read("...") is better, instead of > File.new("...").readlines.join. OK > Because no strict format definition and no detailed documents, and > it is hard to distinguish what is really "bad". In addition, when I > implemented the parser, I thoght it was good to be able to salvage > data from broken or incomplete format rather than to report error > and to stop parsing. Hmmm. I think Bioruby should throw an exception when hitting bad data. In this case it is easy - see the example I sent two days ago. It just tests the indentation. Any data failure would be signalled. > > and it > > should accept an array of lines in addition to a String. > > I don't think so, because to accept two differenct data types would > make things complicated, and make harder to implement parsers. Well, we may want to change this. The current edition takes a string, and its first action is to split it into an Array. I have to do an Array.join to pass it in. That is two actions too many. This can get bad with big data. I also think the current parsers should *not* be string based. For this example it is not a problem (ALN files are generally small), but the only way to get rid of memory use issues is splitting the data. I think it is not a good idea to have 1 Gb strings. Anyway, based on above I think my current ALN patch is acceptable - apart from the documentation. Pj. From pjotr.public14 at thebird.nl Thu Dec 31 14:15:46 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Thu, 31 Dec 2009 15:15:46 +0100 Subject: [BioRuby] Codeml parser Message-ID: <20091231141546.GA5770@thebird.nl> Hi Michael, I have a writeup on improving the current PAML functionality. Are you OK with this? http://bioruby.open-bio.org/wiki/BIORUBY_PAML (maybe it does not belong on the bioruby Wiki - but I think of it like a 'design' document). Pj.