From ngoto at gen-info.osaka-u.ac.jp Tue Aug 4 08:03:31 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 4 Aug 2009 21:03:31 +0900 Subject: [BioRuby] how to convert sequence to fasta format with header information? In-Reply-To: <4057d3bf0907301504i21d74c8dk6cccf6833476dcd6@mail.gmail.com> References: <4057d3bf0907301504i21d74c8dk6cccf6833476dcd6@mail.gmail.com> Message-ID: <20090804120333.198301CBC53D@idnmail.gen-info.osaka-u.ac.jp> Hi, It seems the document is wrong. Bio::Sequence#to_fasta is deprecated, but Bio::Sequence::NA#to_fasta, Bio::Sequence::AA#to_fasta, and Bio::Sequence::Generic#to_fasta are NOT deprecated. Currently, Bio::Sequence#output is defined, but no Bio::Sequence::NA#output (and AA#output, Generic#output). Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Thu, 30 Jul 2009 18:04:59 -0400 Diana Jaunzeikare wrote: > Hi all, > > I want to retrieve sequence from a pdb file and save it in fasta format > where* header holds the pdb entry id*. This is how I did it: > > file = File.new('1OOP.pdb').gets(nil) > structure = Bio::PDB.new(file) > seq = structure.seqres['A'] > puts seq.to_fasta("1OOP", 70) > > it works and produces result i want: > > #>1OOP > #GPPGEVMGRAIARVADTIGSGPVNSESIPALTAAETGHTSQVVPSDTMQTRHVKNYHSRSESTVENFLCR > #SACVFYTTYENHDSDGDNFAYWVINTRQVAQLRRKLEMFTYARFDLELTFVITSTQEQPTVRGQDAPVLT > #HQIMYVPPGGPVPTKVNSYSWQTSTNPSVFWTEGSAPPRMSVPFIGIGNAYSMFYDGWARFDKQGTYGIS > #TLNNMGTLYMRHVNDGGPGPIVSTVRIYFKPKHVKTWVPRPPRLCQYQKAGNVNFEPTGVTEGRTDITTM > #KTT > > However, according to documation Bio::Sequence::Common#to_fasta is a > deprecated method and it suggests to use Bio::Sequence#output, but when I > modify code to > > puts seq.output(:fasta) > > it gives error that method is not defined. Also I don't see a way how to > define the header. > > What should i use in place of the deprecated to_fasta method? > > Thanks, > > Diana > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From rozziite at gmail.com Fri Aug 7 13:31:45 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Fri, 7 Aug 2009 13:31:45 -0400 Subject: [BioRuby] GSOC PhyloXML profiling, bottleneck is Bio::Tree#children Message-ID: <4057d3bf0908071031g39e64004t50abceae87e12fed@mail.gmail.com> Hi all, Here is update on Google Summer of Code Bioruby PhyloXML project. I was profiling and refactoring Bioruby PhyloXML Parser code and got 67% speed increase. With profiling PhyloXML Writer the story is different. It takes 24minutes to write the 1.5MB mollusca taxonomy tree and forever other larger files. Again the bottleneck is bfs_shortest_path, which is called from Tree#children method. It takes forever to just iterate over all the children nodes. To solve this I propose to save an array of the children of the node within my PhyloXML::Node (which corresponds to a clade) class. This would also ensure that when a phyloxml file is parsed and then written back, clades would be the same order in the input and output files. Have a good weekend, Diana From czmasek at burnham.org Fri Aug 7 16:47:31 2009 From: czmasek at burnham.org (Christian Zmasek) Date: Fri, 7 Aug 2009 13:47:31 -0700 Subject: [BioRuby] GSOC PhyloXML profiling, bottleneck is Bio::Tree#children In-Reply-To: <4057d3bf0908071031g39e64004t50abceae87e12fed@mail.gmail.com> References: <4057d3bf0908071031g39e64004t50abceae87e12fed@mail.gmail.com> Message-ID: Hi, Diana: I think your array based solution is fine. Although, at one point Bioruby's Tree might need to be enhanced, in order to avoid being slowed down by this bfs_shortest_path method. This issue is likely to prohibit (time-wise) many types of algorithms acting on large tree objects. Christian ________________________________________ From: Diana Jaunzeikare [rozziite at gmail.com] Sent: Friday, August 07, 2009 10:31 AM To: Christian Zmasek; Phyloinformatics Group; Pjotr Prins; bioruby at lists.open-bio.org; Naohisa GOTO Subject: GSOC PhyloXML profiling, bottleneck is Bio::Tree#children Hi all, Here is update on Google Summer of Code Bioruby PhyloXML project. I was profiling and refactoring Bioruby PhyloXML Parser code and got 67% speed increase. With profiling PhyloXML Writer the story is different. It takes 24minutes to write the 1.5MB mollusca taxonomy tree and forever other larger files. Again the bottleneck is bfs_shortest_path, which is called from Tree#children method. It takes forever to just iterate over all the children nodes. To solve this I propose to save an array of the children of the node within my PhyloXML::Node (which corresponds to a clade) class. This would also ensure that when a phyloxml file is parsed and then written back, clades would be the same order in the input and output files. Have a good weekend, Diana From ngoto at gen-info.osaka-u.ac.jp Sun Aug 9 22:58:12 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 10 Aug 2009 11:58:12 +0900 Subject: [BioRuby] GSOC PhyloXML profiling, bottleneck is Bio::Tree#children In-Reply-To: <4057d3bf0908071031g39e64004t50abceae87e12fed@mail.gmail.com> References: <4057d3bf0908071031g39e64004t50abceae87e12fed@mail.gmail.com> Message-ID: <20090810025814.9A8371CBC49D@idnmail.gen-info.osaka-u.ac.jp> Hi, On Fri, 7 Aug 2009 13:31:45 -0400 Diana Jaunzeikare wrote: > Hi all, > > Here is update on Google Summer of Code Bioruby PhyloXML project. I was > profiling and refactoring Bioruby PhyloXML Parser code and got 67% speed > increase. > > With profiling PhyloXML Writer the story is different. It takes 24minutes to > write the 1.5MB mollusca taxonomy tree and forever other larger files. > Again the bottleneck is bfs_shortest_path, which is called from > Tree#children method. It takes forever to just iterate over all the children > nodes. Bio::Tree uses graph algorithm implemented in the Bio::Pathway class. As you say, their implementations are naive and should be rewritten in the future. > To solve this I propose to save an array of the children of the node within > my PhyloXML::Node (which corresponds to a clade) class. This would also > ensure that when a phyloxml file is parsed and then written back, clades > would be the same order in the input and output files. How to maintain the array when modifying the tree, e.g. to add, delete, or replace some nodes and edges? Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > Have a good weekend, > > Diana > From rozziite at gmail.com Mon Aug 10 11:00:54 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Mon, 10 Aug 2009 11:00:54 -0400 Subject: [BioRuby] GSOC PhyloXML profiling, bottleneck is Bio::Tree#children In-Reply-To: <20090810025814.9A8371CBC49D@idnmail.gen-info.osaka-u.ac.jp> References: <4057d3bf0908071031g39e64004t50abceae87e12fed@mail.gmail.com> <20090810025814.9A8371CBC49D@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <4057d3bf0908100800j544d0bcaqa063b0b9f5267a53@mail.gmail.com> On Sun, Aug 9, 2009 at 10:58 PM, Naohisa GOTO wrote: > Hi, > > On Fri, 7 Aug 2009 13:31:45 -0400 > Diana Jaunzeikare wrote: > > > Hi all, > > > > Here is update on Google Summer of Code Bioruby PhyloXML project. I was > > profiling and refactoring Bioruby PhyloXML Parser code and got 67% speed > > increase. > > > > With profiling PhyloXML Writer the story is different. It takes 24minutes > to > > write the 1.5MB mollusca taxonomy tree and forever other larger files. > > Again the bottleneck is bfs_shortest_path, which is called from > > Tree#children method. It takes forever to just iterate over all the > children > > nodes. > > Bio::Tree uses graph algorithm implemented in the Bio::Pathway class. > As you say, their implementations are naive and should be rewritten > in the future. > > > To solve this I propose to save an array of the children of the node > within > > my PhyloXML::Node (which corresponds to a clade) class. This would also > > ensure that when a phyloxml file is parsed and then written back, clades > > would be the same order in the input and output files. > > How to maintain the array when modifying the tree, e.g. to add, > delete, or replace some nodes and edges? > I didn't think of this issue. I guess then instead of adding that support in PhyloXML class I could rewrite Bio::Tree class, thus other projects based on Bio::Tree class would benefit in the future. Diana > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > > Have a good weekend, > > > > Diana > > > From rozziite at gmail.com Mon Aug 10 15:54:55 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Mon, 10 Aug 2009 15:54:55 -0400 Subject: [BioRuby] GSOC: Bioruby PhyloXML update 12 Message-ID: <4057d3bf0908101254l6d2b793aibed738981a79680a@mail.gmail.com> Hi all, What was done last week: * Coding. Added changes so that now it is completely compatible with phyloxml schema 1.10 * Testing. added more unit tests (now writer has 9 tests, 26 assertions; parser: 40 tests, 134 assertions) * Profiling. I discovered that writer is really slow. The reason is the implementation of the Tree#children method, which does bfs_shortest_path algorithm. I had idea of tracking node children inside the node class as an array, but Naohisa Goto pointed out that then I would also have to deal with new node, edge addition, removal, etc. So better solution seems to, for now leave it as it is, and first improve Bio::Tree class. I am planning to do that after GSOC, since there is only one week left. * Refactored parser class, got around 3-fold speed increase. Now it can parse Metazoa taxonomy 33MB file in ~14 seconds (Ubuntu 9.04, ruby 1.8.7 [i486-linux], Intel Core 2 Duo P8600 @2.4GHz) Next week: * Create howto wiki page with code examples and usage. * Do more testing (Anybody has some more phyloxml xml files for me to test, other than those on phyloxml.org?) * Any other suggestions from you? Questions/issues: * Where should the HOWTO and code example documentation go? Seems reasonable for it to go here http://bioruby.open-bio.org/wiki/HOWTO:Trees and/or http://bioruby.open-bio.org/wiki/Phyloxml_tree_format (which is linked from previous link). * How does integration to the master branch goes? Is all i have to do is pull_request on github? * I have implemented PhyloXML::Sequence#to_biosequence, however it returns incomplete data, since info for Bio::Sequence#classification, Bio::Sequence#species, Bio::Sequence#division would come from PhyloXML::Taxonomy class, but it is not accessible from Sequence class. Should there be PhyloXML::Node#to_biosequence method which would gather information from both PhyloXML::Sequence and PhyloXML::Taxonomy? or maybe Bio::Sequence should not hold taxonomic information? You are all welcome to test my code. It is available on http://github.com/latvianlinuxgirl/bioruby/tree/dev Thanks, Diana From john.woods at marcottelab.org Fri Aug 7 14:23:22 2009 From: john.woods at marcottelab.org (John O. Woods) Date: Fri, 7 Aug 2009 13:23:22 -0500 Subject: [BioRuby] Feedback requested on OBO module Message-ID: <91656c3f0908071123n3f130f29p2a1a33c4db50214d@mail.gmail.com> Hi all, I'm pretty new to Ruby (and Rails), first of all. I wrote a simple reader for OBO files (attached). It does not write (currently), but it can interface with ActiveRecord and acts-as-dag to create a directed acyclic graph in a database. As a newbie, I suspect I'm doing some things wrong, or at least not as correctly as they could be done. For example, usually rails migrations that fail undo all of their changes; but I can't figure out how I'd write that. I attempted to do so, but it apparently does not work (have to set :save_immediately => true to get it to work). I'd love some feedback, if anyone would be willing to take a quick look at my code. I've also attached a test file from Flybase. An object can be created with: od = Bio::OboDoc.new("filename.obo") Thanks! John From ngoto at gen-info.osaka-u.ac.jp Thu Aug 13 04:21:39 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 13 Aug 2009 17:21:39 +0900 Subject: [BioRuby] Feedback requested on OBO module In-Reply-To: <91656c3f0908071123n3f130f29p2a1a33c4db50214d@mail.gmail.com> References: <91656c3f0908071123n3f130f29p2a1a33c4db50214d@mail.gmail.com> Message-ID: <20090813082140.0536D1CBC413@idnmail.gen-info.osaka-u.ac.jp> Hi, In this mailing list, attachment files are normally dropped. To continue discussions, if the code is short, desribe it in the message body. If it isn't short, put your code on your web site or on a code repository (for example http://gist.github.com/ ) and post only the URL. Thanks. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Fri, 7 Aug 2009 13:23:22 -0500 "John O. Woods" wrote: > Hi all, > I'm pretty new to Ruby (and Rails), first of all. I wrote a simple reader > for OBO files (attached). It does not write (currently), but it can > interface with ActiveRecord and acts-as-dag to create a directed acyclic > graph in a database. > > As a newbie, I suspect I'm doing some things wrong, or at least not as > correctly as they could be done. For example, usually rails migrations that > fail undo all of their changes; but I can't figure out how I'd write that. I > attempted to do so, but it apparently does not work (have to set > :save_immediately => true to get it to work). > > I'd love some feedback, if anyone would be willing to take a quick look at > my code. > > I've also attached a test file from Flybase. An object can be created with: > od = Bio::OboDoc.new("filename.obo") > > Thanks! > John > From rozziite at gmail.com Thu Aug 13 17:50:01 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Thu, 13 Aug 2009 17:50:01 -0400 Subject: [BioRuby] GSOC: Bioruby PhyloXML update 12 In-Reply-To: <4057d3bf0908101254l6d2b793aibed738981a79680a@mail.gmail.com> References: <4057d3bf0908101254l6d2b793aibed738981a79680a@mail.gmail.com> Message-ID: <4057d3bf0908131450v54583bc2ve9bd3f0057d634e0@mail.gmail.com> Hi all, I added here a HOWTO for BioRuby PhyloXML implementation https://www.nescent.org/wg_phyloinformatics/BioRuby_PhyloXML_HowTo_documentation Let me know, what you think Diana On Mon, Aug 10, 2009 at 3:54 PM, Diana Jaunzeikare wrote: > Hi all, > > What was done last week: > > * Coding. Added changes so that now it is completely compatible with > phyloxml schema 1.10 > > * Testing. added more unit tests (now writer has 9 tests, 26 assertions; > parser: 40 tests, 134 assertions) > > * Profiling. I discovered that writer is really slow. The reason is the > implementation of the Tree#children method, which does bfs_shortest_path > algorithm. I had idea of tracking node children inside the node class as an > array, but Naohisa Goto pointed out that then I would also have to deal with > new node, edge addition, removal, etc. So better solution seems to, for now > leave it as it is, and first improve Bio::Tree class. I am planning to do > that after GSOC, since there is only one week left. > > * Refactored parser class, got around 3-fold speed increase. Now it can > parse Metazoa taxonomy 33MB file in ~14 seconds (Ubuntu 9.04, ruby 1.8.7 > [i486-linux], Intel Core 2 Duo P8600 @2.4GHz) > > Next week: > > * Create howto wiki page with code examples and usage. > * Do more testing (Anybody has some more phyloxml xml files for me to test, > other than those on phyloxml.org?) > * Any other suggestions from you? > > Questions/issues: > > * Where should the HOWTO and code example documentation go? Seems > reasonable for it to go here > http://bioruby.open-bio.org/wiki/HOWTO:Trees and/or > http://bioruby.open-bio.org/wiki/Phyloxml_tree_format (which is linked > from previous link). > > * How does integration to the master branch goes? Is all i have to do is > pull_request on github? > > * I have implemented PhyloXML::Sequence#to_biosequence, however it returns > incomplete data, since info for Bio::Sequence#classification, > Bio::Sequence#species, Bio::Sequence#division would come from > PhyloXML::Taxonomy class, but it is not accessible from Sequence class. > Should there be PhyloXML::Node#to_biosequence method which would gather > information from both PhyloXML::Sequence and PhyloXML::Taxonomy? or maybe > Bio::Sequence should not hold taxonomic information? > > You are all welcome to test my code. It is available on > http://github.com/latvianlinuxgirl/bioruby/tree/dev > > Thanks, > > Diana > From czmasek at burnham.org Fri Aug 14 14:41:13 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Fri, 14 Aug 2009 11:41:13 -0700 Subject: [BioRuby] GSOC: Bioruby PhyloXML update 12 In-Reply-To: <4057d3bf0908131450v54583bc2ve9bd3f0057d634e0@mail.gmail.com> References: <4057d3bf0908101254l6d2b793aibed738981a79680a@mail.gmail.com> <4057d3bf0908131450v54583bc2ve9bd3f0057d634e0@mail.gmail.com> Message-ID: <4A85AFC9.4020106@burnham.org> Very nice!! I guess this will be placed/copied to the BioRuby tutorial (at http://bioruby.open-bio.org/wiki/Tutorial) at one point, correct? A very tiny, minuscule even, issue I noticed (and maybe even a problem of my web browser): A the very end, the blue box seems broken at the "wrong place" -- so to speak, i.e. "#Once we know whats there, lets output just sequences phyloxml.other[0].children.each do |node| puts node.value end" and "#=> # #acgtcgcggcccgtggaagtcctctcct #aggtcgcggcctgtggaagtcctctcct #taaatcgc--cccgtgg-agtccc-cct" should be in the same box, but they appear to be in different ones. Christian Diana Jaunzeikare wrote: > Hi all, > > I added here a HOWTO for BioRuby PhyloXML implementation > > https://www.nescent.org/wg_phyloinformatics/BioRuby_PhyloXML_HowTo_documentation > > Let me know, what you think > > Diana > > On Mon, Aug 10, 2009 at 3:54 PM, Diana Jaunzeikare > wrote: > > Hi all, > > What was done last week: > > * Coding. Added changes so that now it is completely compatible with > phyloxml schema 1.10 > > * Testing. added more unit tests (now writer has 9 tests, 26 > assertions; parser: 40 tests, 134 assertions) > > * Profiling. I discovered that writer is really slow. The reason is > the implementation of the Tree#children method, which does > bfs_shortest_path algorithm. I had idea of tracking node children > inside the node class as an array, but Naohisa Goto pointed out that > then I would also have to deal with new node, edge addition, > removal, etc. So better solution seems to, for now leave it as it > is, and first improve Bio::Tree class. I am planning to do that > after GSOC, since there is only one week left. > > * Refactored parser class, got around 3-fold speed increase. Now it > can parse Metazoa taxonomy 33MB file in ~14 seconds (Ubuntu 9.04, > ruby 1.8.7 [i486-linux], Intel Core 2 Duo P8600 @2.4GHz) > > Next week: > > * Create howto wiki page with code examples and usage. > * Do more testing (Anybody has some more phyloxml xml files for me > to test, other than those on phyloxml.org ?) > * Any other suggestions from you? > > Questions/issues: > > * Where should the HOWTO and code example documentation go? Seems > reasonable for it to go here > http://bioruby.open-bio.org/wiki/HOWTO:Trees and/or > http://bioruby.open-bio.org/wiki/Phyloxml_tree_format (which is > linked from previous link). > > * How does integration to the master branch goes? Is all i have to > do is pull_request on github? > > * I have implemented PhyloXML::Sequence#to_biosequence, however it > returns incomplete data, since info for > Bio::Sequence#classification, Bio::Sequence#species, > Bio::Sequence#division would come from PhyloXML::Taxonomy class, but > it is not accessible from Sequence class. Should there be > PhyloXML::Node#to_biosequence method which would gather information > from both PhyloXML::Sequence and PhyloXML::Taxonomy? or maybe > Bio::Sequence should not hold taxonomic information? > > You are all welcome to test my code. It is available on > http://github.com/latvianlinuxgirl/bioruby/tree/dev > > Thanks, > > Diana > > From czmasek at burnham.org Fri Aug 14 14:52:04 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Fri, 14 Aug 2009 11:52:04 -0700 Subject: [BioRuby] GSOC: Bioruby PhyloXML update 12 In-Reply-To: <4057d3bf0908131450v54583bc2ve9bd3f0057d634e0@mail.gmail.com> References: <4057d3bf0908101254l6d2b793aibed738981a79680a@mail.gmail.com> <4057d3bf0908131450v54583bc2ve9bd3f0057d634e0@mail.gmail.com> Message-ID: <4A85B254.8020601@burnham.org> Hi, Diana: > > * How does integration to the master branch goes? Is all i have to > do is pull_request on github? I think so (never done this myself, though). Maybe the git experts can answer this? > * I have implemented PhyloXML::Sequence#to_biosequence, however it > returns incomplete data, since info for > Bio::Sequence#classification, Bio::Sequence#species, > Bio::Sequence#division would come from PhyloXML::Taxonomy class, but > it is not accessible from Sequence class. Should there be > PhyloXML::Node#to_biosequence method which would gather information > from both PhyloXML::Sequence and PhyloXML::Taxonomy? or maybe > Bio::Sequence should not hold taxonomic information? Good point. Personally, I would leave your PhyloXML::Sequence#to_biosequence as it is (and add a warning about this to the documentation) and, in addition, create PhyloXML::Node#to_biosequence -- although, I would not call it to_biosequence but maybe something like extract_biosequence. Needless to say, that an almost infinite number of "solutions" to this exists, without a clear "winner" (in my opinion). Christian From ngoto at gen-info.osaka-u.ac.jp Sat Aug 15 06:38:24 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Sat, 15 Aug 2009 19:38:24 +0900 Subject: [BioRuby] GSOC: Bioruby PhyloXML update 12 In-Reply-To: <4A85B254.8020601@burnham.org> References: <4057d3bf0908101254l6d2b793aibed738981a79680a@mail.gmail.com> <4057d3bf0908131450v54583bc2ve9bd3f0057d634e0@mail.gmail.com> <4A85B254.8020601@burnham.org> Message-ID: <20090815103824.DEC901CBC494@idnmail.gen-info.osaka-u.ac.jp> Hi, On Fri, 14 Aug 2009 11:52:04 -0700 Christian M Zmasek wrote: > Hi, Diana: > > > > > * How does integration to the master branch goes? Is all i have to > > do is pull_request on github? > > I think so (never done this myself, though). Maybe the git experts can > answer this? I'll soon release 1.3.1, bug-fix release of 1.3.0. After that, the PhyloXML support and some other new functions such as Chromatogram class http://github.com/aunderwo/bioruby/tree will be merged, and will be released as 1.4.0. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From andrew.j.grimm at gmail.com Sun Aug 16 06:42:23 2009 From: andrew.j.grimm at gmail.com (Andrew Grimm) Date: Sun, 16 Aug 2009 20:42:23 +1000 Subject: [BioRuby] Code style question Message-ID: The Readme for developers mentions that if __FILE__ == $0 is deprecated for testing. In lib/bio/db/fasta.rb , there's a __FILE__ == $0 that isn't used for testing, but for demonstrating what the module's supposed to do. Is this kind of coding non-deprecated? Andrew From ngoto at gen-info.osaka-u.ac.jp Sun Aug 16 08:57:10 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Sun, 16 Aug 2009 21:57:10 +0900 Subject: [BioRuby] Code style question In-Reply-To: References: Message-ID: <20090816215623.4B48.EEF6E030@gen-info.osaka-u.ac.jp> Hi, > The Readme for developers mentions that > > if __FILE__ == $0 > > is deprecated for testing. > > In lib/bio/db/fasta.rb , there's a __FILE__ == $0 that isn't used for > testing, but for demonstrating what the module's supposed to do. Is > this kind of coding non-deprecated? > > Andrew Yes, deprecated for new codes. Old codes before the style have been deprecated may be existed. We gradually move and rewrite them to unit tests and/or sample codes. -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From rozziite at gmail.com Sun Aug 16 16:09:49 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Sun, 16 Aug 2009 16:09:49 -0400 Subject: [BioRuby] GSOC: Bioruby PhyloXML update 12 In-Reply-To: <4A85B254.8020601@burnham.org> References: <4057d3bf0908101254l6d2b793aibed738981a79680a@mail.gmail.com> <4057d3bf0908131450v54583bc2ve9bd3f0057d634e0@mail.gmail.com> <4A85B254.8020601@burnham.org> Message-ID: <4057d3bf0908161309p634fd95bt24b81debe9b2a669@mail.gmail.com> On Fri, Aug 14, 2009 at 2:52 PM, Christian M Zmasek wrote: > Hi, Diana: > > >> * How does integration to the master branch goes? Is all i have to >> do is pull_request on github? >> > > I think so (never done this myself, though). Maybe the git experts can > answer this? > > > * I have implemented PhyloXML::Sequence#to_biosequence, however it >> returns incomplete data, since info for >> Bio::Sequence#classification, Bio::Sequence#species, >> Bio::Sequence#division would come from PhyloXML::Taxonomy class, but >> it is not accessible from Sequence class. Should there be >> PhyloXML::Node#to_biosequence method which would gather information >> from both PhyloXML::Sequence and PhyloXML::Taxonomy? or maybe >> Bio::Sequence should not hold taxonomic information? >> > > Good point. Personally, I would leave your > PhyloXML::Sequence#to_biosequence as it is (and add a warning about this to > the documentation) and, in addition, create PhyloXML::Node#to_biosequence -- > although, I would not call it to_biosequence but maybe something like > extract_biosequence. > Needless to say, that an almost infinite number of "solutions" to this > exists, without a clear "winner" (in my opinion). > > Christian > > I added PhyloXML::Node#extract_biosequence method. It first calls sequence#to_biosequence and then adds additional information from taxonomy elements. Diana From czmasek at burnham.org Sun Aug 16 21:57:06 2009 From: czmasek at burnham.org (Christian Zmasek) Date: Sun, 16 Aug 2009 18:57:06 -0700 Subject: [BioRuby] GSOC: Bioruby PhyloXML update 12 In-Reply-To: <4057d3bf0908161309p634fd95bt24b81debe9b2a669@mail.gmail.com> References: <4057d3bf0908101254l6d2b793aibed738981a79680a@mail.gmail.com> <4057d3bf0908131450v54583bc2ve9bd3f0057d634e0@mail.gmail.com> <4A85B254.8020601@burnham.org>, <4057d3bf0908161309p634fd95bt24b81debe9b2a669@mail.gmail.com> Message-ID: Sounds very good! ________________________________________ From: Diana Jaunzeikare [rozziite at gmail.com] Sent: Sunday, August 16, 2009 1:09 PM To: Christian Zmasek Cc: bioruby at lists.open-bio.org; Pjotr Prins; Naohisa GOTO; Phyloinformatics Group Subject: Re: GSOC: Bioruby PhyloXML update 12 On Fri, Aug 14, 2009 at 2:52 PM, Christian M Zmasek > wrote: Hi, Diana: * How does integration to the master branch goes? Is all i have to do is pull_request on github? I think so (never done this myself, though). Maybe the git experts can answer this? * I have implemented PhyloXML::Sequence#to_biosequence, however it returns incomplete data, since info for Bio::Sequence#classification, Bio::Sequence#species, Bio::Sequence#division would come from PhyloXML::Taxonomy class, but it is not accessible from Sequence class. Should there be PhyloXML::Node#to_biosequence method which would gather information from both PhyloXML::Sequence and PhyloXML::Taxonomy? or maybe Bio::Sequence should not hold taxonomic information? Good point. Personally, I would leave your PhyloXML::Sequence#to_biosequence as it is (and add a warning about this to the documentation) and, in addition, create PhyloXML::Node#to_biosequence -- although, I would not call it to_biosequence but maybe something like extract_biosequence. Needless to say, that an almost infinite number of "solutions" to this exists, without a clear "winner" (in my opinion). Christian I added PhyloXML::Node#extract_biosequence method. It first calls sequence#to_biosequence and then adds additional information from taxonomy elements. Diana From rozziite at gmail.com Mon Aug 17 11:20:45 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Mon, 17 Aug 2009 11:20:45 -0400 Subject: [BioRuby] Bioruby PhyloXML pencils down Message-ID: <4057d3bf0908170820g552389d7r958a52af3f62b7d5@mail.gmail.com> Hi all, So the Google Summer of Code is over. I had a great time learning and improving my programming skills in real world environment. Especially I am glad that I learned more of "doing things in the ruby way". I think that writing open source will definitely help me to land with a better job after graduation since the result is something i can actually show to the potential employers as example of my work. Here is the summary: * Wrote Bio::PhyloXML::Parser class which is responsible for parsing xml files in phyloxml format. It inherits from Bio::Tree and thus have all the functionality Bio::Tree class have. * Wrote Bio::PhyloXML::Writer class which is responsible for writing (hopefully) valid xml files against phyloxml schema. For most part it produces xml files that validate against phyloxml schema, but probably it is possible to produce artificial example when it is not, since not everywhere i have checks for valid input or data type. This is where i can improve in future. * Since these classes are meant to deal with big data files, I changed code several times to make it faster. I managed to improve the speed of the Parser by a lot, but not for the Writer. In order to improve speed of the Writer, i will first improve the underlying Bio::Tree class (thus others also might benefit from faster Bio::Tree class). * Bio::PhyloXML module holds classes corresponding to complex phyloxml elements. Here http://wiki.github.com/latvianlinuxgirl/bioruby is the design of classes. * The code is here http://github.com/latvianlinuxgirl/bioruby/tree/dev * The project page is here: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby, where probably I will continue to post future plan. * Code examples are https://www.nescent.org/wg_phyloinformatics/BioRuby_PhyloXML_HowTo_documentation. The same information is also incorporated in bioruby tutorial. I will definitely stay involved in BioRuby and open source community. The next step will be to improve Bio::Tree class, especially Bio::Tree#parent and Bio::Tree#children methods. Next semester I will be taking Algorithms class, so maybe I will be able to apply that knowledge to this task. And of course I plan to maintain my code, since when people will start using the code, bugs will start to creep up. I would like to thank my mentors Christian and Pjotr for support and timely answers to my questions. Also thanks to Naohisa and others who answered my questions and helped me. Diana From john.woods at marcottelab.org Fri Aug 21 20:04:03 2009 From: john.woods at marcottelab.org (John O. Woods) Date: Fri, 21 Aug 2009 19:04:03 -0500 Subject: [BioRuby] Batch Entrez search Message-ID: <91656c3f0908211704p5b97f909ve168f4fcb7873fee@mail.gmail.com> Right now I'm doing, one at a time: n.search("gene", "FBgn0041721")I'm trying to get the gene ID corresponding to a Flybase gene ID. This takes forever, since you can only do one every three seconds. Also, sometimes it returns two items, and one might be deprecated--but there's no way to tell. Is there a way to batch search with BioRuby for a whole bunch of Flybase IDs? Many thanks, John Marcotte Lab | The University of Texas at Austin From tomoakin at kenroku.kanazawa-u.ac.jp Sun Aug 23 22:21:18 2009 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Mon, 24 Aug 2009 11:21:18 +0900 Subject: [BioRuby] Batch Entrez search In-Reply-To: <91656c3f0908211704p5b97f909ve168f4fcb7873fee@mail.gmail.com> References: <91656c3f0908211704p5b97f909ve168f4fcb7873fee@mail.gmail.com> Message-ID: Hi, > Is there a way to batch search with BioRuby for a whole bunch of > Flybase > IDs? According to http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpgene&part=genefaq the data for entrez gene are available in ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz If you need really a lot, perhaps its better to download that file (about 93 Mbytes). (It contains data for other organisms, which is not necessary, but does not take forever) The format seems simple enough that you can easily get the gene ID for a flybase ID. #Format: tax_id GeneID Symbol LocusTag Synonyms dbXrefs chromosome map_location description type_of_gene Symbol_from_nomenclature_authority Full_name_from_nomenclatur e_authority Nomenclature_status Other_designations Modification_date (tab is used as a separator, pound sign - start of a comment) -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From marc.hoeppner at molbio.su.se Mon Aug 24 01:36:48 2009 From: marc.hoeppner at molbio.su.se (Marc Hoeppner) Date: Mon, 24 Aug 2009 07:36:48 +0200 Subject: [BioRuby] Batch Entrez search In-Reply-To: References: <91656c3f0908211704p5b97f909ve168f4fcb7873fee@mail.gmail.com> Message-ID: <4A9226F0.4050206@molbio.su.se> Hi, I suppose for FlyBase genes you could also use the ruby Ensembl API. Something like: require 'ensembl' Ensembl::Core::DBConnection('drosophila_melanogaster','55) IO.foreach('my_infile') do |flybase_id| gene = Ensembl::Core::Gene.find_by_stable_id(flybase_id) gene.all_xrefs.each do |xref| puts xref end end Well, you get the idea. The methods are well documented in the corresponding API, but when in doubt I can offer some help, too. P.S.: To make it real easy you could also use the BioMart on www.ensembl.org - unless you need this to be a script. Cheers, Marc > Hi, > >> Is there a way to batch search with BioRuby for a whole bunch of Flybase >> IDs? > > > According to > http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpgene&part=genefaq > the data for entrez gene are available in > ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz > > If you need really a lot, perhaps its better to download that file > (about 93 Mbytes). > (It contains data for other organisms, which is not necessary, but > does not take forever) > The format seems simple enough that you can easily get the gene ID for > a flybase ID. > > #Format: tax_id GeneID Symbol LocusTag Synonyms dbXrefs chromosome > map_location description type_of_gene > Symbol_from_nomenclature_authority Full_name_from_nomenclatur > e_authority Nomenclature_status Other_designations Modification_date > (tab is used as a separator, pound sign - start of a comment) > -- Marc P. Hoeppner PhD student Department of Molecular Biology and Functional Genomics Stockholm University, 10691 Stockholm, Sweden marc.hoeppner at molbio.su.se Tel: +46 (0)8 - 164195 From john.woods at marcottelab.org Mon Aug 24 08:44:31 2009 From: john.woods at marcottelab.org (John O. Woods) Date: Mon, 24 Aug 2009 07:44:31 -0500 Subject: [BioRuby] Batch Entrez search In-Reply-To: <4A9226F0.4050206@molbio.su.se> References: <91656c3f0908211704p5b97f909ve168f4fcb7873fee@mail.gmail.com> <4A9226F0.4050206@molbio.su.se> Message-ID: <91656c3f0908240544h20577825n38817c7667081b90@mail.gmail.com> Unfortunately, many Flybase IDs seem to be missing from BioMart, which leads me to think they'd also be absent from Ensembl. On Mon, Aug 24, 2009 at 12:36 AM, Marc Hoeppner wrote: > Hi, > > I suppose for FlyBase genes you could also use the ruby Ensembl API. > > Something like: > > require 'ensembl' > > > Ensembl::Core::DBConnection('drosophila_melanogaster','55) > > IO.foreach('my_infile') do |flybase_id| > > gene = Ensembl::Core::Gene.find_by_stable_id(flybase_id) > gene.all_xrefs.each do |xref| > puts xref > end > > end > > Well, you get the idea. The methods are well documented in the > corresponding API, but when in doubt I can offer some help, too. > > P.S.: To make it real easy you could also use the BioMart on > www.ensembl.org - unless you need this to be a script. > > Cheers, > > Marc > >> Hi, >> >> Is there a way to batch search with BioRuby for a whole bunch of Flybase >>> IDs? >>> >> >> >> According to >> http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpgene&part=genefaq >> the data for entrez gene are available in >> ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz >> >> If you need really a lot, perhaps its better to download that file (about >> 93 Mbytes). >> (It contains data for other organisms, which is not necessary, but does >> not take forever) >> The format seems simple enough that you can easily get the gene ID for a >> flybase ID. >> >> #Format: tax_id GeneID Symbol LocusTag Synonyms dbXrefs chromosome >> map_location description type_of_gene Symbol_from_nomenclature_authority >> Full_name_from_nomenclatur >> e_authority Nomenclature_status Other_designations Modification_date (tab >> is used as a separator, pound sign - start of a comment) >> >> > > -- > > Marc P. Hoeppner > PhD student > Department of Molecular Biology and Functional Genomics > Stockholm University, 10691 Stockholm, Sweden > > marc.hoeppner at molbio.su.se > Tel: +46 (0)8 - 164195 > > From biopython at maubp.freeserve.co.uk Thu Aug 27 06:46:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 27 Aug 2009 11:46:17 +0100 Subject: [BioRuby] FASTQ in BioRuby? Message-ID: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com> Hello BioRuby team, I am one of the Biopython developers, and together with Peter Rice (EMBOSS) and Chris Fields (BioPerl) we have been coordinating how these Open Bioinformatics Foundation (OBF) projects will interpret the FASTQ file format used in next generation sequencing. This includes standardising our naming conventions for the original Sanger FASTQ variant, and the later Solexa/early Illumina, and recent Illumina 1.3+ variants. We have also put together a set of test files, including reference conversions between the different FASTQ variants. We would be delighted to get BioRuby involved. I tried to contact Naohisa Goto about this directly last month, but perhaps my email did not arrive. If BioRuby is working on (or planning to work on) FASTQ support, please could the developers concerned sign up to the OBF joint mailing list where we have been discussing this: http://lists.open-bio.org/mailman/listinfo/open-bio-l Thank you, Peter From ngoto at gen-info.osaka-u.ac.jp Thu Aug 27 07:20:46 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 27 Aug 2009 20:20:46 +0900 Subject: [BioRuby] [Open-bio-l] FASTQ in BioRuby? In-Reply-To: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com> References: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com> Message-ID: <20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp> Hello Peter, sorry for responding too late. I've subscribed to open-bio-l, but I could not actively join to the discussions, because of lack of my knowledge about FASTQ. There is a small primitive code attempt to support FASTQ format in BioRuby, which is not yet merged in the main repository. http://github.com/ngoto/bioruby/tree/master Recently, Anthony Underwood contributed chromatgram classes to support SCF/ABI formats, which will be merged soon, after bug-fix maintenance release of 1.3.1. http://github.com/aunderwo/bioruby/tree/master I'm now planning to rewrite my FASTQ code to be consistent with the chromatgram classes, and with the open-bio standards. Thank you, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Thu, 27 Aug 2009 11:46:17 +0100 Peter wrote: > Hello BioRuby team, > > I am one of the Biopython developers, and together with Peter Rice > (EMBOSS) and Chris Fields (BioPerl) we have been coordinating > how these Open Bioinformatics Foundation (OBF) projects will > interpret the FASTQ file format used in next generation sequencing. > > This includes standardising our naming conventions for the original > Sanger FASTQ variant, and the later Solexa/early Illumina, and > recent Illumina 1.3+ variants. We have also put together a set of > test files, including reference conversions between the different > FASTQ variants. > > We would be delighted to get BioRuby involved. I tried to contact > Naohisa Goto about this directly last month, but perhaps my email > did not arrive. If BioRuby is working on (or planning to work on) > FASTQ support, please could the developers concerned sign up > to the OBF joint mailing list where we have been discussing this: > http://lists.open-bio.org/mailman/listinfo/open-bio-l > > Thank you, > > Peter > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From biopython at maubp.freeserve.co.uk Thu Aug 27 08:08:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 27 Aug 2009 13:08:28 +0100 Subject: [BioRuby] [Open-bio-l] FASTQ in BioRuby? In-Reply-To: <20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp> References: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com> <20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <320fb6e00908270508o485ba990k96b8bd3b722c09b6@mail.gmail.com> On Thu, Aug 27, 2009 at 12:20 PM, Naohisa GOTO wrote: > > Hello Peter, > > sorry for responding too late. I've subscribed to open-bio-l, > but I could not actively join to the discussions, because of > lack of my knowledge about FASTQ. > > There is a small primitive code attempt to support FASTQ format > in BioRuby, which is not yet merged in the main repository. > http://github.com/ngoto/bioruby/tree/master > > Recently, Anthony Underwood contributed chromatgram classes > to support SCF/ABI formats, which will be merged soon, > after bug-fix maintenance release of 1.3.1. > http://github.com/aunderwo/bioruby/tree/master > > I'm now planning to rewrite my FASTQ code to be consistent > with the chromatgram classes, and with the open-bio standards. > > Thank you, > > Naohisa Goto That is excellent news :) I'm not sure how format names work in BioRuby, but if you do have a set of format names as strings as we do in Biopython, BioPerl and EMBOSS it would be nice to be consistent here: http://biopython.org/wiki/SeqIO http://bioperl.org/wiki/HOWTO:SeqIO http://emboss.sourceforge.net/docs/themes/SequenceFormats.html There is some basic information on wikipedia, but this does not go into detail: http://www.bioperl.org/wiki/FASTQ_sequence_format Please feel free to ask any questions about how we are interpreting things. Thank you, Peter From ngoto at gen-info.osaka-u.ac.jp Mon Aug 31 10:22:11 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 31 Aug 2009 23:22:11 +0900 Subject: [BioRuby] SIM4 parser In-Reply-To: <0C3F8576-899A-426E-869A-C9DCF8F47868@kenroku.kanazawa-u.ac.jp> References: <5510B566-E723-4AEE-8DEC-63BE1ABD9F19@kenroku.kanazawa-u.ac.jp> <0C3F8576-899A-426E-869A-C9DCF8F47868@kenroku.kanazawa-u.ac.jp> Message-ID: <20090831142212.3468E1CBC58D@idnmail.gen-info.osaka-u.ac.jp> Hi, On Sun, 5 Jul 2009 20:28:33 +0900 Tomoaki NISHIYAMA wrote: > Hi, > > > A way to resolve may to check if the start address match the > > address that > > was specified in the previous section stating the ranges of the > > matches. > > I'm considering implementing this way. > > > A working code is obtained and a diff relative to 1.3.0 is attached. > The code was changed to parse alignment only after the SegemntPairs > are prepared The bug is fixed. http://github.com/bioruby/bioruby/commit/02d531e36ecf789f232cf3e05f85391b60279f00 Thank you for sending a patch. I didn't fully use your patch, but it was very helpful. > During this work, I also noticed that the semantics of the structure > might be misunderstood: > 1. The mark after the match, either "->", "<-", "--", or "==" > does not represent the direction of the exon, but indicates > the presumed direction of the intron following the exon. > "--" corresponds in case part of the intervening sequence > and midline is shown and > "==" is for cases without information for intervening sequence. > I do not understand how these patterns are determined by SIM4, > but "->" and "<-" can be estimated based on GU-AG rule. > Since these directions are essentially assigned to the > introns rather than exons, it might be inappropriate to assign > these strings to the exon. There is actually rare cases that > introns in different direction is deduced: in such case > assuming the direction of the exon is same as the 3' intron > rather than 5' intron of the exon is not desired. So, it seems > arguable to make directions for exon deprecated. > > From current state of the parser, I bet there are few people using > bioruby to parse sim4 alignment output, and changing the interface > is acceptable this time. You are right. However, currently, to keep compatibility, the method Bio::Sim4::Report::SegmentPair#direction is still being used. In next major release (1.4.0?), the method will be deprecated, and other method would be added. -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From ngoto at gen-info.osaka-u.ac.jp Tue Aug 4 12:03:31 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 4 Aug 2009 21:03:31 +0900 Subject: [BioRuby] how to convert sequence to fasta format with header information? In-Reply-To: <4057d3bf0907301504i21d74c8dk6cccf6833476dcd6@mail.gmail.com> References: <4057d3bf0907301504i21d74c8dk6cccf6833476dcd6@mail.gmail.com> Message-ID: <20090804120333.198301CBC53D@idnmail.gen-info.osaka-u.ac.jp> Hi, It seems the document is wrong. Bio::Sequence#to_fasta is deprecated, but Bio::Sequence::NA#to_fasta, Bio::Sequence::AA#to_fasta, and Bio::Sequence::Generic#to_fasta are NOT deprecated. Currently, Bio::Sequence#output is defined, but no Bio::Sequence::NA#output (and AA#output, Generic#output). Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Thu, 30 Jul 2009 18:04:59 -0400 Diana Jaunzeikare wrote: > Hi all, > > I want to retrieve sequence from a pdb file and save it in fasta format > where* header holds the pdb entry id*. This is how I did it: > > file = File.new('1OOP.pdb').gets(nil) > structure = Bio::PDB.new(file) > seq = structure.seqres['A'] > puts seq.to_fasta("1OOP", 70) > > it works and produces result i want: > > #>1OOP > #GPPGEVMGRAIARVADTIGSGPVNSESIPALTAAETGHTSQVVPSDTMQTRHVKNYHSRSESTVENFLCR > #SACVFYTTYENHDSDGDNFAYWVINTRQVAQLRRKLEMFTYARFDLELTFVITSTQEQPTVRGQDAPVLT > #HQIMYVPPGGPVPTKVNSYSWQTSTNPSVFWTEGSAPPRMSVPFIGIGNAYSMFYDGWARFDKQGTYGIS > #TLNNMGTLYMRHVNDGGPGPIVSTVRIYFKPKHVKTWVPRPPRLCQYQKAGNVNFEPTGVTEGRTDITTM > #KTT > > However, according to documation Bio::Sequence::Common#to_fasta is a > deprecated method and it suggests to use Bio::Sequence#output, but when I > modify code to > > puts seq.output(:fasta) > > it gives error that method is not defined. Also I don't see a way how to > define the header. > > What should i use in place of the deprecated to_fasta method? > > Thanks, > > Diana > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From rozziite at gmail.com Fri Aug 7 17:31:45 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Fri, 7 Aug 2009 13:31:45 -0400 Subject: [BioRuby] GSOC PhyloXML profiling, bottleneck is Bio::Tree#children Message-ID: <4057d3bf0908071031g39e64004t50abceae87e12fed@mail.gmail.com> Hi all, Here is update on Google Summer of Code Bioruby PhyloXML project. I was profiling and refactoring Bioruby PhyloXML Parser code and got 67% speed increase. With profiling PhyloXML Writer the story is different. It takes 24minutes to write the 1.5MB mollusca taxonomy tree and forever other larger files. Again the bottleneck is bfs_shortest_path, which is called from Tree#children method. It takes forever to just iterate over all the children nodes. To solve this I propose to save an array of the children of the node within my PhyloXML::Node (which corresponds to a clade) class. This would also ensure that when a phyloxml file is parsed and then written back, clades would be the same order in the input and output files. Have a good weekend, Diana From czmasek at burnham.org Fri Aug 7 20:47:31 2009 From: czmasek at burnham.org (Christian Zmasek) Date: Fri, 7 Aug 2009 13:47:31 -0700 Subject: [BioRuby] GSOC PhyloXML profiling, bottleneck is Bio::Tree#children In-Reply-To: <4057d3bf0908071031g39e64004t50abceae87e12fed@mail.gmail.com> References: <4057d3bf0908071031g39e64004t50abceae87e12fed@mail.gmail.com> Message-ID: Hi, Diana: I think your array based solution is fine. Although, at one point Bioruby's Tree might need to be enhanced, in order to avoid being slowed down by this bfs_shortest_path method. This issue is likely to prohibit (time-wise) many types of algorithms acting on large tree objects. Christian ________________________________________ From: Diana Jaunzeikare [rozziite at gmail.com] Sent: Friday, August 07, 2009 10:31 AM To: Christian Zmasek; Phyloinformatics Group; Pjotr Prins; bioruby at lists.open-bio.org; Naohisa GOTO Subject: GSOC PhyloXML profiling, bottleneck is Bio::Tree#children Hi all, Here is update on Google Summer of Code Bioruby PhyloXML project. I was profiling and refactoring Bioruby PhyloXML Parser code and got 67% speed increase. With profiling PhyloXML Writer the story is different. It takes 24minutes to write the 1.5MB mollusca taxonomy tree and forever other larger files. Again the bottleneck is bfs_shortest_path, which is called from Tree#children method. It takes forever to just iterate over all the children nodes. To solve this I propose to save an array of the children of the node within my PhyloXML::Node (which corresponds to a clade) class. This would also ensure that when a phyloxml file is parsed and then written back, clades would be the same order in the input and output files. Have a good weekend, Diana From ngoto at gen-info.osaka-u.ac.jp Mon Aug 10 02:58:12 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 10 Aug 2009 11:58:12 +0900 Subject: [BioRuby] GSOC PhyloXML profiling, bottleneck is Bio::Tree#children In-Reply-To: <4057d3bf0908071031g39e64004t50abceae87e12fed@mail.gmail.com> References: <4057d3bf0908071031g39e64004t50abceae87e12fed@mail.gmail.com> Message-ID: <20090810025814.9A8371CBC49D@idnmail.gen-info.osaka-u.ac.jp> Hi, On Fri, 7 Aug 2009 13:31:45 -0400 Diana Jaunzeikare wrote: > Hi all, > > Here is update on Google Summer of Code Bioruby PhyloXML project. I was > profiling and refactoring Bioruby PhyloXML Parser code and got 67% speed > increase. > > With profiling PhyloXML Writer the story is different. It takes 24minutes to > write the 1.5MB mollusca taxonomy tree and forever other larger files. > Again the bottleneck is bfs_shortest_path, which is called from > Tree#children method. It takes forever to just iterate over all the children > nodes. Bio::Tree uses graph algorithm implemented in the Bio::Pathway class. As you say, their implementations are naive and should be rewritten in the future. > To solve this I propose to save an array of the children of the node within > my PhyloXML::Node (which corresponds to a clade) class. This would also > ensure that when a phyloxml file is parsed and then written back, clades > would be the same order in the input and output files. How to maintain the array when modifying the tree, e.g. to add, delete, or replace some nodes and edges? Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > Have a good weekend, > > Diana > From rozziite at gmail.com Mon Aug 10 15:00:54 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Mon, 10 Aug 2009 11:00:54 -0400 Subject: [BioRuby] GSOC PhyloXML profiling, bottleneck is Bio::Tree#children In-Reply-To: <20090810025814.9A8371CBC49D@idnmail.gen-info.osaka-u.ac.jp> References: <4057d3bf0908071031g39e64004t50abceae87e12fed@mail.gmail.com> <20090810025814.9A8371CBC49D@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <4057d3bf0908100800j544d0bcaqa063b0b9f5267a53@mail.gmail.com> On Sun, Aug 9, 2009 at 10:58 PM, Naohisa GOTO wrote: > Hi, > > On Fri, 7 Aug 2009 13:31:45 -0400 > Diana Jaunzeikare wrote: > > > Hi all, > > > > Here is update on Google Summer of Code Bioruby PhyloXML project. I was > > profiling and refactoring Bioruby PhyloXML Parser code and got 67% speed > > increase. > > > > With profiling PhyloXML Writer the story is different. It takes 24minutes > to > > write the 1.5MB mollusca taxonomy tree and forever other larger files. > > Again the bottleneck is bfs_shortest_path, which is called from > > Tree#children method. It takes forever to just iterate over all the > children > > nodes. > > Bio::Tree uses graph algorithm implemented in the Bio::Pathway class. > As you say, their implementations are naive and should be rewritten > in the future. > > > To solve this I propose to save an array of the children of the node > within > > my PhyloXML::Node (which corresponds to a clade) class. This would also > > ensure that when a phyloxml file is parsed and then written back, clades > > would be the same order in the input and output files. > > How to maintain the array when modifying the tree, e.g. to add, > delete, or replace some nodes and edges? > I didn't think of this issue. I guess then instead of adding that support in PhyloXML class I could rewrite Bio::Tree class, thus other projects based on Bio::Tree class would benefit in the future. Diana > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > > Have a good weekend, > > > > Diana > > > From rozziite at gmail.com Mon Aug 10 19:54:55 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Mon, 10 Aug 2009 15:54:55 -0400 Subject: [BioRuby] GSOC: Bioruby PhyloXML update 12 Message-ID: <4057d3bf0908101254l6d2b793aibed738981a79680a@mail.gmail.com> Hi all, What was done last week: * Coding. Added changes so that now it is completely compatible with phyloxml schema 1.10 * Testing. added more unit tests (now writer has 9 tests, 26 assertions; parser: 40 tests, 134 assertions) * Profiling. I discovered that writer is really slow. The reason is the implementation of the Tree#children method, which does bfs_shortest_path algorithm. I had idea of tracking node children inside the node class as an array, but Naohisa Goto pointed out that then I would also have to deal with new node, edge addition, removal, etc. So better solution seems to, for now leave it as it is, and first improve Bio::Tree class. I am planning to do that after GSOC, since there is only one week left. * Refactored parser class, got around 3-fold speed increase. Now it can parse Metazoa taxonomy 33MB file in ~14 seconds (Ubuntu 9.04, ruby 1.8.7 [i486-linux], Intel Core 2 Duo P8600 @2.4GHz) Next week: * Create howto wiki page with code examples and usage. * Do more testing (Anybody has some more phyloxml xml files for me to test, other than those on phyloxml.org?) * Any other suggestions from you? Questions/issues: * Where should the HOWTO and code example documentation go? Seems reasonable for it to go here http://bioruby.open-bio.org/wiki/HOWTO:Trees and/or http://bioruby.open-bio.org/wiki/Phyloxml_tree_format (which is linked from previous link). * How does integration to the master branch goes? Is all i have to do is pull_request on github? * I have implemented PhyloXML::Sequence#to_biosequence, however it returns incomplete data, since info for Bio::Sequence#classification, Bio::Sequence#species, Bio::Sequence#division would come from PhyloXML::Taxonomy class, but it is not accessible from Sequence class. Should there be PhyloXML::Node#to_biosequence method which would gather information from both PhyloXML::Sequence and PhyloXML::Taxonomy? or maybe Bio::Sequence should not hold taxonomic information? You are all welcome to test my code. It is available on http://github.com/latvianlinuxgirl/bioruby/tree/dev Thanks, Diana From john.woods at marcottelab.org Fri Aug 7 18:23:22 2009 From: john.woods at marcottelab.org (John O. Woods) Date: Fri, 7 Aug 2009 13:23:22 -0500 Subject: [BioRuby] Feedback requested on OBO module Message-ID: <91656c3f0908071123n3f130f29p2a1a33c4db50214d@mail.gmail.com> Hi all, I'm pretty new to Ruby (and Rails), first of all. I wrote a simple reader for OBO files (attached). It does not write (currently), but it can interface with ActiveRecord and acts-as-dag to create a directed acyclic graph in a database. As a newbie, I suspect I'm doing some things wrong, or at least not as correctly as they could be done. For example, usually rails migrations that fail undo all of their changes; but I can't figure out how I'd write that. I attempted to do so, but it apparently does not work (have to set :save_immediately => true to get it to work). I'd love some feedback, if anyone would be willing to take a quick look at my code. I've also attached a test file from Flybase. An object can be created with: od = Bio::OboDoc.new("filename.obo") Thanks! John From ngoto at gen-info.osaka-u.ac.jp Thu Aug 13 08:21:39 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 13 Aug 2009 17:21:39 +0900 Subject: [BioRuby] Feedback requested on OBO module In-Reply-To: <91656c3f0908071123n3f130f29p2a1a33c4db50214d@mail.gmail.com> References: <91656c3f0908071123n3f130f29p2a1a33c4db50214d@mail.gmail.com> Message-ID: <20090813082140.0536D1CBC413@idnmail.gen-info.osaka-u.ac.jp> Hi, In this mailing list, attachment files are normally dropped. To continue discussions, if the code is short, desribe it in the message body. If it isn't short, put your code on your web site or on a code repository (for example http://gist.github.com/ ) and post only the URL. Thanks. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Fri, 7 Aug 2009 13:23:22 -0500 "John O. Woods" wrote: > Hi all, > I'm pretty new to Ruby (and Rails), first of all. I wrote a simple reader > for OBO files (attached). It does not write (currently), but it can > interface with ActiveRecord and acts-as-dag to create a directed acyclic > graph in a database. > > As a newbie, I suspect I'm doing some things wrong, or at least not as > correctly as they could be done. For example, usually rails migrations that > fail undo all of their changes; but I can't figure out how I'd write that. I > attempted to do so, but it apparently does not work (have to set > :save_immediately => true to get it to work). > > I'd love some feedback, if anyone would be willing to take a quick look at > my code. > > I've also attached a test file from Flybase. An object can be created with: > od = Bio::OboDoc.new("filename.obo") > > Thanks! > John > From rozziite at gmail.com Thu Aug 13 21:50:01 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Thu, 13 Aug 2009 17:50:01 -0400 Subject: [BioRuby] GSOC: Bioruby PhyloXML update 12 In-Reply-To: <4057d3bf0908101254l6d2b793aibed738981a79680a@mail.gmail.com> References: <4057d3bf0908101254l6d2b793aibed738981a79680a@mail.gmail.com> Message-ID: <4057d3bf0908131450v54583bc2ve9bd3f0057d634e0@mail.gmail.com> Hi all, I added here a HOWTO for BioRuby PhyloXML implementation https://www.nescent.org/wg_phyloinformatics/BioRuby_PhyloXML_HowTo_documentation Let me know, what you think Diana On Mon, Aug 10, 2009 at 3:54 PM, Diana Jaunzeikare wrote: > Hi all, > > What was done last week: > > * Coding. Added changes so that now it is completely compatible with > phyloxml schema 1.10 > > * Testing. added more unit tests (now writer has 9 tests, 26 assertions; > parser: 40 tests, 134 assertions) > > * Profiling. I discovered that writer is really slow. The reason is the > implementation of the Tree#children method, which does bfs_shortest_path > algorithm. I had idea of tracking node children inside the node class as an > array, but Naohisa Goto pointed out that then I would also have to deal with > new node, edge addition, removal, etc. So better solution seems to, for now > leave it as it is, and first improve Bio::Tree class. I am planning to do > that after GSOC, since there is only one week left. > > * Refactored parser class, got around 3-fold speed increase. Now it can > parse Metazoa taxonomy 33MB file in ~14 seconds (Ubuntu 9.04, ruby 1.8.7 > [i486-linux], Intel Core 2 Duo P8600 @2.4GHz) > > Next week: > > * Create howto wiki page with code examples and usage. > * Do more testing (Anybody has some more phyloxml xml files for me to test, > other than those on phyloxml.org?) > * Any other suggestions from you? > > Questions/issues: > > * Where should the HOWTO and code example documentation go? Seems > reasonable for it to go here > http://bioruby.open-bio.org/wiki/HOWTO:Trees and/or > http://bioruby.open-bio.org/wiki/Phyloxml_tree_format (which is linked > from previous link). > > * How does integration to the master branch goes? Is all i have to do is > pull_request on github? > > * I have implemented PhyloXML::Sequence#to_biosequence, however it returns > incomplete data, since info for Bio::Sequence#classification, > Bio::Sequence#species, Bio::Sequence#division would come from > PhyloXML::Taxonomy class, but it is not accessible from Sequence class. > Should there be PhyloXML::Node#to_biosequence method which would gather > information from both PhyloXML::Sequence and PhyloXML::Taxonomy? or maybe > Bio::Sequence should not hold taxonomic information? > > You are all welcome to test my code. It is available on > http://github.com/latvianlinuxgirl/bioruby/tree/dev > > Thanks, > > Diana > From czmasek at burnham.org Fri Aug 14 18:41:13 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Fri, 14 Aug 2009 11:41:13 -0700 Subject: [BioRuby] GSOC: Bioruby PhyloXML update 12 In-Reply-To: <4057d3bf0908131450v54583bc2ve9bd3f0057d634e0@mail.gmail.com> References: <4057d3bf0908101254l6d2b793aibed738981a79680a@mail.gmail.com> <4057d3bf0908131450v54583bc2ve9bd3f0057d634e0@mail.gmail.com> Message-ID: <4A85AFC9.4020106@burnham.org> Very nice!! I guess this will be placed/copied to the BioRuby tutorial (at http://bioruby.open-bio.org/wiki/Tutorial) at one point, correct? A very tiny, minuscule even, issue I noticed (and maybe even a problem of my web browser): A the very end, the blue box seems broken at the "wrong place" -- so to speak, i.e. "#Once we know whats there, lets output just sequences phyloxml.other[0].children.each do |node| puts node.value end" and "#=> # #acgtcgcggcccgtggaagtcctctcct #aggtcgcggcctgtggaagtcctctcct #taaatcgc--cccgtgg-agtccc-cct" should be in the same box, but they appear to be in different ones. Christian Diana Jaunzeikare wrote: > Hi all, > > I added here a HOWTO for BioRuby PhyloXML implementation > > https://www.nescent.org/wg_phyloinformatics/BioRuby_PhyloXML_HowTo_documentation > > Let me know, what you think > > Diana > > On Mon, Aug 10, 2009 at 3:54 PM, Diana Jaunzeikare > wrote: > > Hi all, > > What was done last week: > > * Coding. Added changes so that now it is completely compatible with > phyloxml schema 1.10 > > * Testing. added more unit tests (now writer has 9 tests, 26 > assertions; parser: 40 tests, 134 assertions) > > * Profiling. I discovered that writer is really slow. The reason is > the implementation of the Tree#children method, which does > bfs_shortest_path algorithm. I had idea of tracking node children > inside the node class as an array, but Naohisa Goto pointed out that > then I would also have to deal with new node, edge addition, > removal, etc. So better solution seems to, for now leave it as it > is, and first improve Bio::Tree class. I am planning to do that > after GSOC, since there is only one week left. > > * Refactored parser class, got around 3-fold speed increase. Now it > can parse Metazoa taxonomy 33MB file in ~14 seconds (Ubuntu 9.04, > ruby 1.8.7 [i486-linux], Intel Core 2 Duo P8600 @2.4GHz) > > Next week: > > * Create howto wiki page with code examples and usage. > * Do more testing (Anybody has some more phyloxml xml files for me > to test, other than those on phyloxml.org ?) > * Any other suggestions from you? > > Questions/issues: > > * Where should the HOWTO and code example documentation go? Seems > reasonable for it to go here > http://bioruby.open-bio.org/wiki/HOWTO:Trees and/or > http://bioruby.open-bio.org/wiki/Phyloxml_tree_format (which is > linked from previous link). > > * How does integration to the master branch goes? Is all i have to > do is pull_request on github? > > * I have implemented PhyloXML::Sequence#to_biosequence, however it > returns incomplete data, since info for > Bio::Sequence#classification, Bio::Sequence#species, > Bio::Sequence#division would come from PhyloXML::Taxonomy class, but > it is not accessible from Sequence class. Should there be > PhyloXML::Node#to_biosequence method which would gather information > from both PhyloXML::Sequence and PhyloXML::Taxonomy? or maybe > Bio::Sequence should not hold taxonomic information? > > You are all welcome to test my code. It is available on > http://github.com/latvianlinuxgirl/bioruby/tree/dev > > Thanks, > > Diana > > From czmasek at burnham.org Fri Aug 14 18:52:04 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Fri, 14 Aug 2009 11:52:04 -0700 Subject: [BioRuby] GSOC: Bioruby PhyloXML update 12 In-Reply-To: <4057d3bf0908131450v54583bc2ve9bd3f0057d634e0@mail.gmail.com> References: <4057d3bf0908101254l6d2b793aibed738981a79680a@mail.gmail.com> <4057d3bf0908131450v54583bc2ve9bd3f0057d634e0@mail.gmail.com> Message-ID: <4A85B254.8020601@burnham.org> Hi, Diana: > > * How does integration to the master branch goes? Is all i have to > do is pull_request on github? I think so (never done this myself, though). Maybe the git experts can answer this? > * I have implemented PhyloXML::Sequence#to_biosequence, however it > returns incomplete data, since info for > Bio::Sequence#classification, Bio::Sequence#species, > Bio::Sequence#division would come from PhyloXML::Taxonomy class, but > it is not accessible from Sequence class. Should there be > PhyloXML::Node#to_biosequence method which would gather information > from both PhyloXML::Sequence and PhyloXML::Taxonomy? or maybe > Bio::Sequence should not hold taxonomic information? Good point. Personally, I would leave your PhyloXML::Sequence#to_biosequence as it is (and add a warning about this to the documentation) and, in addition, create PhyloXML::Node#to_biosequence -- although, I would not call it to_biosequence but maybe something like extract_biosequence. Needless to say, that an almost infinite number of "solutions" to this exists, without a clear "winner" (in my opinion). Christian From ngoto at gen-info.osaka-u.ac.jp Sat Aug 15 10:38:24 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Sat, 15 Aug 2009 19:38:24 +0900 Subject: [BioRuby] GSOC: Bioruby PhyloXML update 12 In-Reply-To: <4A85B254.8020601@burnham.org> References: <4057d3bf0908101254l6d2b793aibed738981a79680a@mail.gmail.com> <4057d3bf0908131450v54583bc2ve9bd3f0057d634e0@mail.gmail.com> <4A85B254.8020601@burnham.org> Message-ID: <20090815103824.DEC901CBC494@idnmail.gen-info.osaka-u.ac.jp> Hi, On Fri, 14 Aug 2009 11:52:04 -0700 Christian M Zmasek wrote: > Hi, Diana: > > > > > * How does integration to the master branch goes? Is all i have to > > do is pull_request on github? > > I think so (never done this myself, though). Maybe the git experts can > answer this? I'll soon release 1.3.1, bug-fix release of 1.3.0. After that, the PhyloXML support and some other new functions such as Chromatogram class http://github.com/aunderwo/bioruby/tree will be merged, and will be released as 1.4.0. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From andrew.j.grimm at gmail.com Sun Aug 16 10:42:23 2009 From: andrew.j.grimm at gmail.com (Andrew Grimm) Date: Sun, 16 Aug 2009 20:42:23 +1000 Subject: [BioRuby] Code style question Message-ID: The Readme for developers mentions that if __FILE__ == $0 is deprecated for testing. In lib/bio/db/fasta.rb , there's a __FILE__ == $0 that isn't used for testing, but for demonstrating what the module's supposed to do. Is this kind of coding non-deprecated? Andrew From ngoto at gen-info.osaka-u.ac.jp Sun Aug 16 12:57:10 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Sun, 16 Aug 2009 21:57:10 +0900 Subject: [BioRuby] Code style question In-Reply-To: References: Message-ID: <20090816215623.4B48.EEF6E030@gen-info.osaka-u.ac.jp> Hi, > The Readme for developers mentions that > > if __FILE__ == $0 > > is deprecated for testing. > > In lib/bio/db/fasta.rb , there's a __FILE__ == $0 that isn't used for > testing, but for demonstrating what the module's supposed to do. Is > this kind of coding non-deprecated? > > Andrew Yes, deprecated for new codes. Old codes before the style have been deprecated may be existed. We gradually move and rewrite them to unit tests and/or sample codes. -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From rozziite at gmail.com Sun Aug 16 20:09:49 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Sun, 16 Aug 2009 16:09:49 -0400 Subject: [BioRuby] GSOC: Bioruby PhyloXML update 12 In-Reply-To: <4A85B254.8020601@burnham.org> References: <4057d3bf0908101254l6d2b793aibed738981a79680a@mail.gmail.com> <4057d3bf0908131450v54583bc2ve9bd3f0057d634e0@mail.gmail.com> <4A85B254.8020601@burnham.org> Message-ID: <4057d3bf0908161309p634fd95bt24b81debe9b2a669@mail.gmail.com> On Fri, Aug 14, 2009 at 2:52 PM, Christian M Zmasek wrote: > Hi, Diana: > > >> * How does integration to the master branch goes? Is all i have to >> do is pull_request on github? >> > > I think so (never done this myself, though). Maybe the git experts can > answer this? > > > * I have implemented PhyloXML::Sequence#to_biosequence, however it >> returns incomplete data, since info for >> Bio::Sequence#classification, Bio::Sequence#species, >> Bio::Sequence#division would come from PhyloXML::Taxonomy class, but >> it is not accessible from Sequence class. Should there be >> PhyloXML::Node#to_biosequence method which would gather information >> from both PhyloXML::Sequence and PhyloXML::Taxonomy? or maybe >> Bio::Sequence should not hold taxonomic information? >> > > Good point. Personally, I would leave your > PhyloXML::Sequence#to_biosequence as it is (and add a warning about this to > the documentation) and, in addition, create PhyloXML::Node#to_biosequence -- > although, I would not call it to_biosequence but maybe something like > extract_biosequence. > Needless to say, that an almost infinite number of "solutions" to this > exists, without a clear "winner" (in my opinion). > > Christian > > I added PhyloXML::Node#extract_biosequence method. It first calls sequence#to_biosequence and then adds additional information from taxonomy elements. Diana From czmasek at burnham.org Mon Aug 17 01:57:06 2009 From: czmasek at burnham.org (Christian Zmasek) Date: Sun, 16 Aug 2009 18:57:06 -0700 Subject: [BioRuby] GSOC: Bioruby PhyloXML update 12 In-Reply-To: <4057d3bf0908161309p634fd95bt24b81debe9b2a669@mail.gmail.com> References: <4057d3bf0908101254l6d2b793aibed738981a79680a@mail.gmail.com> <4057d3bf0908131450v54583bc2ve9bd3f0057d634e0@mail.gmail.com> <4A85B254.8020601@burnham.org>, <4057d3bf0908161309p634fd95bt24b81debe9b2a669@mail.gmail.com> Message-ID: Sounds very good! ________________________________________ From: Diana Jaunzeikare [rozziite at gmail.com] Sent: Sunday, August 16, 2009 1:09 PM To: Christian Zmasek Cc: bioruby at lists.open-bio.org; Pjotr Prins; Naohisa GOTO; Phyloinformatics Group Subject: Re: GSOC: Bioruby PhyloXML update 12 On Fri, Aug 14, 2009 at 2:52 PM, Christian M Zmasek > wrote: Hi, Diana: * How does integration to the master branch goes? Is all i have to do is pull_request on github? I think so (never done this myself, though). Maybe the git experts can answer this? * I have implemented PhyloXML::Sequence#to_biosequence, however it returns incomplete data, since info for Bio::Sequence#classification, Bio::Sequence#species, Bio::Sequence#division would come from PhyloXML::Taxonomy class, but it is not accessible from Sequence class. Should there be PhyloXML::Node#to_biosequence method which would gather information from both PhyloXML::Sequence and PhyloXML::Taxonomy? or maybe Bio::Sequence should not hold taxonomic information? Good point. Personally, I would leave your PhyloXML::Sequence#to_biosequence as it is (and add a warning about this to the documentation) and, in addition, create PhyloXML::Node#to_biosequence -- although, I would not call it to_biosequence but maybe something like extract_biosequence. Needless to say, that an almost infinite number of "solutions" to this exists, without a clear "winner" (in my opinion). Christian I added PhyloXML::Node#extract_biosequence method. It first calls sequence#to_biosequence and then adds additional information from taxonomy elements. Diana From rozziite at gmail.com Mon Aug 17 15:20:45 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Mon, 17 Aug 2009 11:20:45 -0400 Subject: [BioRuby] Bioruby PhyloXML pencils down Message-ID: <4057d3bf0908170820g552389d7r958a52af3f62b7d5@mail.gmail.com> Hi all, So the Google Summer of Code is over. I had a great time learning and improving my programming skills in real world environment. Especially I am glad that I learned more of "doing things in the ruby way". I think that writing open source will definitely help me to land with a better job after graduation since the result is something i can actually show to the potential employers as example of my work. Here is the summary: * Wrote Bio::PhyloXML::Parser class which is responsible for parsing xml files in phyloxml format. It inherits from Bio::Tree and thus have all the functionality Bio::Tree class have. * Wrote Bio::PhyloXML::Writer class which is responsible for writing (hopefully) valid xml files against phyloxml schema. For most part it produces xml files that validate against phyloxml schema, but probably it is possible to produce artificial example when it is not, since not everywhere i have checks for valid input or data type. This is where i can improve in future. * Since these classes are meant to deal with big data files, I changed code several times to make it faster. I managed to improve the speed of the Parser by a lot, but not for the Writer. In order to improve speed of the Writer, i will first improve the underlying Bio::Tree class (thus others also might benefit from faster Bio::Tree class). * Bio::PhyloXML module holds classes corresponding to complex phyloxml elements. Here http://wiki.github.com/latvianlinuxgirl/bioruby is the design of classes. * The code is here http://github.com/latvianlinuxgirl/bioruby/tree/dev * The project page is here: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby, where probably I will continue to post future plan. * Code examples are https://www.nescent.org/wg_phyloinformatics/BioRuby_PhyloXML_HowTo_documentation. The same information is also incorporated in bioruby tutorial. I will definitely stay involved in BioRuby and open source community. The next step will be to improve Bio::Tree class, especially Bio::Tree#parent and Bio::Tree#children methods. Next semester I will be taking Algorithms class, so maybe I will be able to apply that knowledge to this task. And of course I plan to maintain my code, since when people will start using the code, bugs will start to creep up. I would like to thank my mentors Christian and Pjotr for support and timely answers to my questions. Also thanks to Naohisa and others who answered my questions and helped me. Diana From john.woods at marcottelab.org Sat Aug 22 00:04:03 2009 From: john.woods at marcottelab.org (John O. Woods) Date: Fri, 21 Aug 2009 19:04:03 -0500 Subject: [BioRuby] Batch Entrez search Message-ID: <91656c3f0908211704p5b97f909ve168f4fcb7873fee@mail.gmail.com> Right now I'm doing, one at a time: n.search("gene", "FBgn0041721")I'm trying to get the gene ID corresponding to a Flybase gene ID. This takes forever, since you can only do one every three seconds. Also, sometimes it returns two items, and one might be deprecated--but there's no way to tell. Is there a way to batch search with BioRuby for a whole bunch of Flybase IDs? Many thanks, John Marcotte Lab | The University of Texas at Austin From tomoakin at kenroku.kanazawa-u.ac.jp Mon Aug 24 02:21:18 2009 From: tomoakin at kenroku.kanazawa-u.ac.jp (Tomoaki NISHIYAMA) Date: Mon, 24 Aug 2009 11:21:18 +0900 Subject: [BioRuby] Batch Entrez search In-Reply-To: <91656c3f0908211704p5b97f909ve168f4fcb7873fee@mail.gmail.com> References: <91656c3f0908211704p5b97f909ve168f4fcb7873fee@mail.gmail.com> Message-ID: Hi, > Is there a way to batch search with BioRuby for a whole bunch of > Flybase > IDs? According to http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpgene&part=genefaq the data for entrez gene are available in ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz If you need really a lot, perhaps its better to download that file (about 93 Mbytes). (It contains data for other organisms, which is not necessary, but does not take forever) The format seems simple enough that you can easily get the gene ID for a flybase ID. #Format: tax_id GeneID Symbol LocusTag Synonyms dbXrefs chromosome map_location description type_of_gene Symbol_from_nomenclature_authority Full_name_from_nomenclatur e_authority Nomenclature_status Other_designations Modification_date (tab is used as a separator, pound sign - start of a comment) -- Tomoaki NISHIYAMA Advanced Science Research Center, Kanazawa University, 13-1 Takara-machi, Kanazawa, 920-0934, Japan From marc.hoeppner at molbio.su.se Mon Aug 24 05:36:48 2009 From: marc.hoeppner at molbio.su.se (Marc Hoeppner) Date: Mon, 24 Aug 2009 07:36:48 +0200 Subject: [BioRuby] Batch Entrez search In-Reply-To: References: <91656c3f0908211704p5b97f909ve168f4fcb7873fee@mail.gmail.com> Message-ID: <4A9226F0.4050206@molbio.su.se> Hi, I suppose for FlyBase genes you could also use the ruby Ensembl API. Something like: require 'ensembl' Ensembl::Core::DBConnection('drosophila_melanogaster','55) IO.foreach('my_infile') do |flybase_id| gene = Ensembl::Core::Gene.find_by_stable_id(flybase_id) gene.all_xrefs.each do |xref| puts xref end end Well, you get the idea. The methods are well documented in the corresponding API, but when in doubt I can offer some help, too. P.S.: To make it real easy you could also use the BioMart on www.ensembl.org - unless you need this to be a script. Cheers, Marc > Hi, > >> Is there a way to batch search with BioRuby for a whole bunch of Flybase >> IDs? > > > According to > http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpgene&part=genefaq > the data for entrez gene are available in > ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz > > If you need really a lot, perhaps its better to download that file > (about 93 Mbytes). > (It contains data for other organisms, which is not necessary, but > does not take forever) > The format seems simple enough that you can easily get the gene ID for > a flybase ID. > > #Format: tax_id GeneID Symbol LocusTag Synonyms dbXrefs chromosome > map_location description type_of_gene > Symbol_from_nomenclature_authority Full_name_from_nomenclatur > e_authority Nomenclature_status Other_designations Modification_date > (tab is used as a separator, pound sign - start of a comment) > -- Marc P. Hoeppner PhD student Department of Molecular Biology and Functional Genomics Stockholm University, 10691 Stockholm, Sweden marc.hoeppner at molbio.su.se Tel: +46 (0)8 - 164195 From john.woods at marcottelab.org Mon Aug 24 12:44:31 2009 From: john.woods at marcottelab.org (John O. Woods) Date: Mon, 24 Aug 2009 07:44:31 -0500 Subject: [BioRuby] Batch Entrez search In-Reply-To: <4A9226F0.4050206@molbio.su.se> References: <91656c3f0908211704p5b97f909ve168f4fcb7873fee@mail.gmail.com> <4A9226F0.4050206@molbio.su.se> Message-ID: <91656c3f0908240544h20577825n38817c7667081b90@mail.gmail.com> Unfortunately, many Flybase IDs seem to be missing from BioMart, which leads me to think they'd also be absent from Ensembl. On Mon, Aug 24, 2009 at 12:36 AM, Marc Hoeppner wrote: > Hi, > > I suppose for FlyBase genes you could also use the ruby Ensembl API. > > Something like: > > require 'ensembl' > > > Ensembl::Core::DBConnection('drosophila_melanogaster','55) > > IO.foreach('my_infile') do |flybase_id| > > gene = Ensembl::Core::Gene.find_by_stable_id(flybase_id) > gene.all_xrefs.each do |xref| > puts xref > end > > end > > Well, you get the idea. The methods are well documented in the > corresponding API, but when in doubt I can offer some help, too. > > P.S.: To make it real easy you could also use the BioMart on > www.ensembl.org - unless you need this to be a script. > > Cheers, > > Marc > >> Hi, >> >> Is there a way to batch search with BioRuby for a whole bunch of Flybase >>> IDs? >>> >> >> >> According to >> http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpgene&part=genefaq >> the data for entrez gene are available in >> ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz >> >> If you need really a lot, perhaps its better to download that file (about >> 93 Mbytes). >> (It contains data for other organisms, which is not necessary, but does >> not take forever) >> The format seems simple enough that you can easily get the gene ID for a >> flybase ID. >> >> #Format: tax_id GeneID Symbol LocusTag Synonyms dbXrefs chromosome >> map_location description type_of_gene Symbol_from_nomenclature_authority >> Full_name_from_nomenclatur >> e_authority Nomenclature_status Other_designations Modification_date (tab >> is used as a separator, pound sign - start of a comment) >> >> > > -- > > Marc P. Hoeppner > PhD student > Department of Molecular Biology and Functional Genomics > Stockholm University, 10691 Stockholm, Sweden > > marc.hoeppner at molbio.su.se > Tel: +46 (0)8 - 164195 > > From biopython at maubp.freeserve.co.uk Thu Aug 27 10:46:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 27 Aug 2009 11:46:17 +0100 Subject: [BioRuby] FASTQ in BioRuby? Message-ID: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com> Hello BioRuby team, I am one of the Biopython developers, and together with Peter Rice (EMBOSS) and Chris Fields (BioPerl) we have been coordinating how these Open Bioinformatics Foundation (OBF) projects will interpret the FASTQ file format used in next generation sequencing. This includes standardising our naming conventions for the original Sanger FASTQ variant, and the later Solexa/early Illumina, and recent Illumina 1.3+ variants. We have also put together a set of test files, including reference conversions between the different FASTQ variants. We would be delighted to get BioRuby involved. I tried to contact Naohisa Goto about this directly last month, but perhaps my email did not arrive. If BioRuby is working on (or planning to work on) FASTQ support, please could the developers concerned sign up to the OBF joint mailing list where we have been discussing this: http://lists.open-bio.org/mailman/listinfo/open-bio-l Thank you, Peter From ngoto at gen-info.osaka-u.ac.jp Thu Aug 27 11:20:46 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 27 Aug 2009 20:20:46 +0900 Subject: [BioRuby] [Open-bio-l] FASTQ in BioRuby? In-Reply-To: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com> References: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com> Message-ID: <20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp> Hello Peter, sorry for responding too late. I've subscribed to open-bio-l, but I could not actively join to the discussions, because of lack of my knowledge about FASTQ. There is a small primitive code attempt to support FASTQ format in BioRuby, which is not yet merged in the main repository. http://github.com/ngoto/bioruby/tree/master Recently, Anthony Underwood contributed chromatgram classes to support SCF/ABI formats, which will be merged soon, after bug-fix maintenance release of 1.3.1. http://github.com/aunderwo/bioruby/tree/master I'm now planning to rewrite my FASTQ code to be consistent with the chromatgram classes, and with the open-bio standards. Thank you, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Thu, 27 Aug 2009 11:46:17 +0100 Peter wrote: > Hello BioRuby team, > > I am one of the Biopython developers, and together with Peter Rice > (EMBOSS) and Chris Fields (BioPerl) we have been coordinating > how these Open Bioinformatics Foundation (OBF) projects will > interpret the FASTQ file format used in next generation sequencing. > > This includes standardising our naming conventions for the original > Sanger FASTQ variant, and the later Solexa/early Illumina, and > recent Illumina 1.3+ variants. We have also put together a set of > test files, including reference conversions between the different > FASTQ variants. > > We would be delighted to get BioRuby involved. I tried to contact > Naohisa Goto about this directly last month, but perhaps my email > did not arrive. If BioRuby is working on (or planning to work on) > FASTQ support, please could the developers concerned sign up > to the OBF joint mailing list where we have been discussing this: > http://lists.open-bio.org/mailman/listinfo/open-bio-l > > Thank you, > > Peter > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From biopython at maubp.freeserve.co.uk Thu Aug 27 12:08:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 27 Aug 2009 13:08:28 +0100 Subject: [BioRuby] [Open-bio-l] FASTQ in BioRuby? In-Reply-To: <20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp> References: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com> <20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <320fb6e00908270508o485ba990k96b8bd3b722c09b6@mail.gmail.com> On Thu, Aug 27, 2009 at 12:20 PM, Naohisa GOTO wrote: > > Hello Peter, > > sorry for responding too late. I've subscribed to open-bio-l, > but I could not actively join to the discussions, because of > lack of my knowledge about FASTQ. > > There is a small primitive code attempt to support FASTQ format > in BioRuby, which is not yet merged in the main repository. > http://github.com/ngoto/bioruby/tree/master > > Recently, Anthony Underwood contributed chromatgram classes > to support SCF/ABI formats, which will be merged soon, > after bug-fix maintenance release of 1.3.1. > http://github.com/aunderwo/bioruby/tree/master > > I'm now planning to rewrite my FASTQ code to be consistent > with the chromatgram classes, and with the open-bio standards. > > Thank you, > > Naohisa Goto That is excellent news :) I'm not sure how format names work in BioRuby, but if you do have a set of format names as strings as we do in Biopython, BioPerl and EMBOSS it would be nice to be consistent here: http://biopython.org/wiki/SeqIO http://bioperl.org/wiki/HOWTO:SeqIO http://emboss.sourceforge.net/docs/themes/SequenceFormats.html There is some basic information on wikipedia, but this does not go into detail: http://www.bioperl.org/wiki/FASTQ_sequence_format Please feel free to ask any questions about how we are interpreting things. Thank you, Peter From ngoto at gen-info.osaka-u.ac.jp Mon Aug 31 14:22:11 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 31 Aug 2009 23:22:11 +0900 Subject: [BioRuby] SIM4 parser In-Reply-To: <0C3F8576-899A-426E-869A-C9DCF8F47868@kenroku.kanazawa-u.ac.jp> References: <5510B566-E723-4AEE-8DEC-63BE1ABD9F19@kenroku.kanazawa-u.ac.jp> <0C3F8576-899A-426E-869A-C9DCF8F47868@kenroku.kanazawa-u.ac.jp> Message-ID: <20090831142212.3468E1CBC58D@idnmail.gen-info.osaka-u.ac.jp> Hi, On Sun, 5 Jul 2009 20:28:33 +0900 Tomoaki NISHIYAMA wrote: > Hi, > > > A way to resolve may to check if the start address match the > > address that > > was specified in the previous section stating the ranges of the > > matches. > > I'm considering implementing this way. > > > A working code is obtained and a diff relative to 1.3.0 is attached. > The code was changed to parse alignment only after the SegemntPairs > are prepared The bug is fixed. http://github.com/bioruby/bioruby/commit/02d531e36ecf789f232cf3e05f85391b60279f00 Thank you for sending a patch. I didn't fully use your patch, but it was very helpful. > During this work, I also noticed that the semantics of the structure > might be misunderstood: > 1. The mark after the match, either "->", "<-", "--", or "==" > does not represent the direction of the exon, but indicates > the presumed direction of the intron following the exon. > "--" corresponds in case part of the intervening sequence > and midline is shown and > "==" is for cases without information for intervening sequence. > I do not understand how these patterns are determined by SIM4, > but "->" and "<-" can be estimated based on GU-AG rule. > Since these directions are essentially assigned to the > introns rather than exons, it might be inappropriate to assign > these strings to the exon. There is actually rare cases that > introns in different direction is deduced: in such case > assuming the direction of the exon is same as the 3' intron > rather than 5' intron of the exon is not desired. So, it seems > arguable to make directions for exon deprecated. > > From current state of the parser, I bet there are few people using > bioruby to parse sim4 alignment output, and changing the interface > is acceptable this time. You are right. However, currently, to keep compatibility, the method Bio::Sim4::Report::SegmentPair#direction is still being used. In next major release (1.4.0?), the method will be deprecated, and other method would be added. -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org