From anurag08priyam at gmail.com Thu Jun 3 05:00:06 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Thu, 3 Jun 2010 14:30:06 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Update Message-ID: Hello all, I know this update is coming quite late. Sorry for holding this back for so long. From now on I will be updating this list weekly on my progress. Just to keep everyone in the loop, [1] is my project page. What has been done? Till now I have been able to do a significant amount of work on the NeXML parser. The parser recognizes otus, otu and trees. The trees implementation is not complete as per the NeXML schema. Trees with multiple rootings, coalescent trees and networks remain to be done. Problems Faced: Initially it was decided to stream parse any NeXML document as DOM parsing would be slow for larger documents. But with NeXML's non linear design, streaming seems non natural and proves to be a little difficult. Currently, I have written a wrapper over the StAX parsing API of libxml but the entire document is parsed in one go; at the start. Current git head[2] can be built and the code tested out. A tutorial( kind of ) on how to use the NeXML can be found here[3]. [1] https://www.nescent.org/wg_phyloinformatics/Category:NeXML_and_RDF_API_for_BioRuby [2] http://github.com/yeban/bioruby [3] https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From anurag08priyam at gmail.com Fri Jun 4 04:39:34 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Fri, 4 Jun 2010 14:09:34 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Tree with Multiple rootings. Message-ID: Hello all, NeXML allows for trees with multiple rootings. In the NeXML lib trees are represented by Bio::NeXML::Tree which inherits from Bio::Tree. This allows for the usage of the excellent Bio::Tree framework for manipulating NeXML trees. However, Bio::Tree class supports only one root node. There are a couple of functions that require the presence of a root node: parent, children, descendants, ancestors, lowest_common_ancestor. Now, these functions can take a root node as a parameter. So it is possible to extend the current framework to work with trees with multiple root nodes. Though this may not be required, a possibility is to add the multiple root functionality to Bio::Tree class itself. Currently, I am adding multiple root support to Bio::NeXML::Tree class. If need be we can move the functionality to Bio::Tree. Anything? -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From rutgeraldo at gmail.com Fri Jun 4 10:21:27 2010 From: rutgeraldo at gmail.com (Rutger Vos) Date: Fri, 4 Jun 2010 15:21:27 +0100 Subject: [BioRuby] [GSoC][NeXML and RDF API] Tree with Multiple rootings. In-Reply-To: References: Message-ID: Hi Anurag, in practice I haven't actually seen trees with multiple rootings being used much, so it might not be urgent that this moves to the bioruby core. My main worry would be in picking the "right" root node to expose to the core api. I think that it should be the node from which all other nodes can be visited in a recursive traversal (which I expect client code to do), as opposed to a node that has been indicated using an XML attribute to be the root, but isn't in terms of the actual topology that emerges from the node and edge tables. However, I'm curious to hear other people's opinions whether a flag (e.g. "is_root") might be added Bio::Tree::Node, and a "get_roots" method in Bio::Tree that returns a list of roots that typically only holds the value of the "root" attribute, but could potentially have multiple rootings. Rutger On Fri, Jun 4, 2010 at 9:39 AM, Anurag Priyam wrote: > Hello all, > > NeXML allows for trees with multiple rootings. In the NeXML lib trees are > represented by Bio::NeXML::Tree which inherits from Bio::Tree. This allows > for the usage of the excellent Bio::Tree framework for manipulating NeXML > trees. However, Bio::Tree class supports only one root node. > > There are a couple of functions that require the presence of a root node: > parent, children, descendants, ancestors, lowest_common_ancestor. Now, these > functions can take a root node as a parameter. So it is possible to extend > the current framework to work with trees with multiple root nodes. > > Though this may not be required, a possibility is to add the multiple root > functionality to Bio::Tree class itself. Currently, I am adding multiple > root support to Bio::NeXML::Tree class. If need be we can move the > functionality to Bio::Tree. > > Anything? > > -- > Anurag Priyam, > 2nd Year Undergraduate, > Department of Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com From hlapp at drycafe.net Fri Jun 4 15:09:11 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Fri, 4 Jun 2010 15:09:11 -0400 Subject: [BioRuby] [GSoC][NeXML and RDF API] Tree with Multiple rootings. In-Reply-To: References: Message-ID: <01FB94FA-D962-4624-B612-49AA26FF3E2D@drycafe.net> Multiple roots can be the result of a Bayesian analysis. (The PhyloDB module in BioSQL, for example, does support multiple roots.) However, representing multiple roots is useless without also being able to indicate whether a root is an alternate root or the main root node, and what its significance (posterior prob. for a Bayesian analysis) is. For reference, here is the column documentation for these two properties in PhyloDB's tree_root table: COMMENT ON COLUMN tree_root.is_alternate IS 'True if the root node is the preferential (most likely) root node of the tree, and false otherwise.'; COMMENT ON COLUMN tree_root.significance IS 'The significance (such as likelihood, or posterior probability) with which the node is the root node. This only has meaning if the method used for reconstructing the tree calculates this value.'; -hilmar On Jun 4, 2010, at 10:21 AM, Rutger Vos wrote: > Hi Anurag, > > in practice I haven't actually seen trees with multiple rootings being > used much, so it might not be urgent that this moves to the bioruby > core. My main worry would be in picking the "right" root node to > expose to the core api. I think that it should be the node from which > all other nodes can be visited in a recursive traversal (which I > expect client code to do), as opposed to a node that has been > indicated using an XML attribute to be the root, but isn't in terms of > the actual topology that emerges from the node and edge tables. > > However, I'm curious to hear other people's opinions whether a flag > (e.g. "is_root") might be added Bio::Tree::Node, and a "get_roots" > method in Bio::Tree that returns a list of roots that typically only > holds the value of the "root" attribute, but could potentially have > multiple rootings. > > Rutger > > On Fri, Jun 4, 2010 at 9:39 AM, Anurag Priyam > wrote: >> Hello all, >> >> NeXML allows for trees with multiple rootings. In the NeXML lib >> trees are >> represented by Bio::NeXML::Tree which inherits from Bio::Tree. This >> allows >> for the usage of the excellent Bio::Tree framework for manipulating >> NeXML >> trees. However, Bio::Tree class supports only one root node. >> >> There are a couple of functions that require the presence of a root >> node: >> parent, children, descendants, ancestors, lowest_common_ancestor. >> Now, these >> functions can take a root node as a parameter. So it is possible to >> extend >> the current framework to work with trees with multiple root nodes. >> >> Though this may not be required, a possibility is to add the >> multiple root >> functionality to Bio::Tree class itself. Currently, I am adding >> multiple >> root support to Bio::NeXML::Tree class. If need be we can move the >> functionality to Bio::Tree. >> >> Anything? >> >> -- >> Anurag Priyam, >> 2nd Year Undergraduate, >> Department of Mechanical Engineering, >> IIT Kharagpur. >> +91-9775550642 >> > > > > -- > Dr. Rutger A. Vos > School of Biological Sciences > Philip Lyle Building, Level 4 > University of Reading > Reading > RG6 6BX > United Kingdom > Tel: +44 (0) 118 378 7535 > http://www.nexml.org > http://rutgervos.blogspot.com > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From mitlox at op.pl Sun Jun 6 01:30:09 2010 From: mitlox at op.pl (xyz) Date: Sun, 06 Jun 2010 15:30:09 +1000 Subject: [BioRuby] fastq files reading In-Reply-To: <20100530233154.1B8A.EEF6E030@gen-info.osaka-u.ac.jp> References: <20100529221404.0175ee75@wp01> <20100530233154.1B8A.EEF6E030@gen-info.osaka-u.ac.jp> Message-ID: <4C0B3261.3020909@op.pl> Thank you for the solutions it works. From sararayburn at gmail.com Mon Jun 7 14:09:07 2010 From: sararayburn at gmail.com (Sara Rayburn) Date: Mon, 7 Jun 2010 13:09:07 -0500 Subject: [BioRuby] GSoC speciation/duplication inference question Message-ID: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com> Hello, While implementing my gsoc project (the speciation/duplicaiton inference algorithm), I'm not sure where to put the module I'm developing in the bioruby library. I've been developing in the lib/bio/util directory, but I want to make sure that's the place you all would prefer for the module. Any suggestions or thoughts? Thanks, Sara Rayburn sararayburn at gmail.com From anurag08priyam at gmail.com Wed Jun 9 04:17:55 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Wed, 9 Jun 2010 13:47:55 +0530 Subject: [BioRuby] fastq files reading In-Reply-To: <4C0B3261.3020909@op.pl> References: <20100529221404.0175ee75@wp01> <20100530233154.1B8A.EEF6E030@gen-info.osaka-u.ac.jp> <4C0B3261.3020909@op.pl> Message-ID: Maybe we should add this to the wiki [1] [1] http://bioruby.open-bio.org/wiki/SampleCodes On Sun, Jun 6, 2010 at 11:00 AM, xyz wrote: > Thank you for the solutions it works. > -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From mitlox at op.pl Wed Jun 9 08:42:36 2010 From: mitlox at op.pl (xyz) Date: Wed, 09 Jun 2010 22:42:36 +1000 Subject: [BioRuby] fastq files reading In-Reply-To: References: <20100529221404.0175ee75@wp01> <20100530233154.1B8A.EEF6E030@gen-info.osaka-u.ac.jp> <4C0B3261.3020909@op.pl> Message-ID: <4C0F8C3C.1030303@op.pl> Good idea. On 06/09/10 18:17, Anurag Priyam wrote: > Maybe we should add this to the wiki [1] > > [1] http://bioruby.open-bio.org/wiki/SampleCodes > > On Sun, Jun 6, 2010 at 11:00 AM, xyz > wrote: > > Thank you for the solutions it works. > > > > > -- > Anurag Priyam, > 2nd Year Undergraduate, > Department of Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 From anurag08priyam at gmail.com Wed Jun 9 15:49:35 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Thu, 10 Jun 2010 01:19:35 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. Message-ID: Last week, I worked on finishing implementation of Trees: trees, tree, network; and started work on the characters element. This weeks target is to complete the implementation of the characters element. It would be awesome to have some code review including: implementation, API design, coding style and tests. I am planning to give a good amount of time in the fourth week in making the code more robust. It would make perfect sense to have some feedback to serve as guidelines :). The master branch and API discussion page are at: [1] http://github.com/yeban/bioruby [2] https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From czmasek at burnham.org Wed Jun 9 21:50:16 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Wed, 9 Jun 2010 18:50:16 -0700 Subject: [BioRuby] gsoc questions In-Reply-To: References: <2B862774-AD61-4BC0-86D6-69DCD832EB78@gmail.com> Message-ID: <4C1044D8.4010600@burnham.org> Hi Sara: > > > On Mon, Jun 7, 2010 at 11:20 AM, Sara Rayburn > wrote: > > Hi Christian and Diana, > > Two questions: > > 1) On the phylosoft website for forester/sdi > (http://www.phylosoft.org/forester/applications/sdi_r/) I've read > this about the two trees: > "The important point to keep in mind is that there must be at least > one sub-element of the 'taxonomy' element which allows to match the > sequences in the gene tree with a taxonomy in the species tree. In > this example this sub-element of the 'taxonomy' element is 'code'." > > Does this mean that the sub-element for matching will *always* be > 'code'? Or should I just be looking for anything at all that > matches? Also, will all phyloxml trees have the 'code' sub-element? > > > To find out whether some element will always contain some other element > you can look at PhyloXML documentation [0]. For example at the Taxonomy > element documentation [1] you can see that it has a sub-element "code" > which is [0..1], which means that there either is no "code" sub-element > or there is one and no more, whereas there could none or many "synonym" > sub-elements > > [0] http://www.phyloxml.org/documentation/version_1.10/phyloxml.xsd.html > [1] http://www.phyloxml.org/documentation/version_1.10/phyloxml.xsd.html#h888650454 Good point! This matching of taxonomic information is a crucial point. I recommend to implement this in the same manner as it is implemented in the "isEqual" method of the org.forester.phylogeny.data.Taxonomy class of the forester library, see: http://forester-atv.cvs.sourceforge.net/viewvc/forester-atv/forester-atv/java/src/org/forester/phylogeny/data/Taxonomy.java?revision=1.57&view=markup In this (Java) class the matching works like this: 1. If both the two Taxonomies to be compared have identifiers with the same source (e.g. NCBI taxonomy), use these identifiers to match. In Java: if ( ( getIdentifier() != null ) && ( tax.getIdentifier() != null ) ) { return getIdentifier().isEqual( tax.getIdentifier() ); } 2. Otherwise, if both Taxonomies have taxonomy codes, use the taxomoy codes to match. In Java: else if ( !ForesterUtil.isEmpty( getTaxonomyCode() ) && !ForesterUtil.isEmpty( tax.getTaxonomyCode() ) ) { return getTaxonomyCode().equals( tax.getTaxonomyCode() ); } 3. Otherwise, if both Taxonomies have scientific names, use the scientific names to match. 4. Otherwise, if both Taxonomies have common names, use the common names to match. 5. Otherwise, matching is not possible and an error should be thrown. Generally speaking, I recommend to get the source code of forester and look at the classes in the org.forester.sdi directory (especially SDI.java, SDIse.java, and SDIR.java). > > 2) Here's my assumptions about the final output of the algorithm: > Each node in the tree should be updated with speciation OR > duplication, and the tree as a whole has a count of > speciation/duplication events. Am I on the right track here? Yes, the primary goal of the algorithm is to calculate for each node in the gene tree whether it is a duplication or a speciation, and thus each node should be annotated as duplication or speciation. Keeping track of the sum of duplications and speciations is useful too, but cannot, as far as I know, stored in the tree object itself. Maybe the algorithm could return a small "SDI_result" object which is used to store such "summary" information. Christian From ngoto at gen-info.osaka-u.ac.jp Thu Jun 10 09:46:40 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 10 Jun 2010 22:46:40 +0900 Subject: [BioRuby] GSoC speciation/duplication inference question In-Reply-To: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com> References: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com> Message-ID: <20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp> Hi, I think the abbreviation SDI is not common in the field of biology and bioinformatics. In this case, it is generally good not to abbreviate, but the "speciation/duplication inference" is too long. For file/directory names, because the length limit is tight, using abbreviation is good. For the location of files, I suggest lib/bio/util/evolution/SDI/ or lib/bio/util/phylogeny/SDI/ to show the word SDI is in the field of evolution or phylogeny. For the class/module namespace, possible candidates are Bio::SpeciationDuplicationInference, Bio::Evolution::SDI, Bio::Algorithm::SDI, but I couldn't determine which is the best. If you have good idea, please tell us. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Mon, 7 Jun 2010 13:09:07 -0500 Sara Rayburn wrote: > Hello, > > While implementing my gsoc project (the speciation/duplicaiton inference algorithm), I'm not sure where to put the module I'm developing in the bioruby library. I've been developing in the lib/bio/util directory, but I want to make sure that's the place you all would prefer for the module. Any suggestions or thoughts? > > Thanks, > > Sara Rayburn > sararayburn at gmail.com From kpatil at science.uva.nl Thu Jun 17 05:22:12 2010 From: kpatil at science.uva.nl (K. Patil) Date: Thu, 17 Jun 2010 11:22:12 +0200 (CEST) Subject: [BioRuby] newick to phyloxml In-Reply-To: <20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp> References: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com> <20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <1764.139.19.75.1.1276766532.squirrel@webmail.science.uva.nl> Hi, I noticed the inclusion of phyloxml support in bioruby, thanks a lot, its very useful. I was wondering if there is any straightforward way to convert a newick tree to phyloxml? best From czmasek at burnham.org Thu Jun 17 18:49:12 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Thu, 17 Jun 2010 15:49:12 -0700 Subject: [BioRuby] Gene duplications GSoC project: answers to some of your questions Message-ID: <4C1AA668.7040801@burnham.org> Hi, Sara: Regarding some of your questions posted on http://wiki.github.com/srayburn/bioruby/gsoc-2010-implementing-sdi-project-updates Re: "Right now initialization loads from a hard coded file. I need to make this flexible so that trees can come from any file or from a previously loaded tree object": The input of the algoruthm(s) should be tree-objects, reading the trees should not be part of the algorithm implementation. Clearly, for testing you need to read the trees from files, but this should be implemented in your test code, not as part of the algorithm implementation itself. Re: "The names of leaf nodes: how standard are they? Is there a standard format here? I?m going to look at example trees from the forester implementation to get ideas about this. If I?m still stumped I?ll check with my mentors." No there is no standard. The only question for the purpose of this algorithm do they match or not. I.e. they names could just numbers, common names, or scientific names. Hope this helps, Christian From czmasek at burnham.org Thu Jun 17 23:16:20 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Thu, 17 Jun 2010 20:16:20 -0700 Subject: [BioRuby] newick to phyloxml In-Reply-To: <1764.139.19.75.1.1276766532.squirrel@webmail.science.uva.nl> References: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com> <20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp> <1764.139.19.75.1.1276766532.squirrel@webmail.science.uva.nl> Message-ID: <4C1AE504.3000307@burnham.org> Hi, Unfortunately, this is not possible in a straightforward way. The problem is that the tree object (Bio::Tree) returned by: input = Bio::FlatFile.open(Bio::Newick, "tree.nh") tree = input.next_entry.tree is the parent type of the tree object(Bio::PhyloXML::Tree) required by: writer = Bio::PhyloXML::Writer.new("tree.xml") writer.write(phyloxml_tree) Christian K. Patil wrote: > Hi, > > I noticed the inclusion of phyloxml support in bioruby, thanks a lot, its > very useful. I was wondering if there is any straightforward way to > convert a newick tree to phyloxml? > > best > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From anurag08priyam at gmail.com Tue Jun 22 04:46:19 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Tue, 22 Jun 2010 14:16:19 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Update In-Reply-To: References: Message-ID: Hello all, Much of the parser implementation is complete as of now. Last time I had sent an update I had begun implementing characters element. Week 3( June 7-13) was quite low on work due to power shortage where I live. Consequently implementation of characters spanned Week 4( June 14-20 ) too. Work on NeXML serialization has begun. As of now it can serialize taxa blocks. This week( week 5 - June 21-28 ) I will be working on serializing trees and characters element. I would also like to update a little more on future development plans. I am targeting to finish much of the software development by week 9( July 19-25 ), leaving week 10, week 11 and week 12 for feedback and iterations. This is the time where I should make up for any mistakes or lost work. Perhaps in this week we can make the code ready for merging in BioRuby's master branch. Apart from this, I am targeting to finish serializer and start working on the RDF API by week 6. Maybe we could have a round of code review after that too? I am notifying this in advance so that if possible developers can allocate time for this. Sounds good? -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From czmasek at burnham.org Tue Jun 22 14:29:09 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Tue, 22 Jun 2010 11:29:09 -0700 Subject: [BioRuby] gsoc update In-Reply-To: <6B35175E-1DFF-430E-AE22-E724F3D58D9D@gmail.com> References: <4C1AACBA.4030908@burnham.org> <6B35175E-1DFF-430E-AE22-E724F3D58D9D@gmail.com> Message-ID: <4C2100F5.3020306@burnham.org> Hi, Sara: Hopefully you and your son are fully recovered now! To me, Bio::Algorithm::SDI would make the most sense. Re: "It seems that forester has the assumption built in that any node in a tree that has a child must have two children. Is this a property of phylogenetic trees?" Being composed of entirely binary nodes is indeed a property of trees produced by most programs for phylogenetic inference. In contrast, if multiple (binary) trees are used to calculate a consensus tree (e.g. bootstrap resampling), then the resulting consensus tree might contain nodes with more than two children (depending on the method of consensus tree calculation and the degree of divergence among the resampled trees). Furthermore, if (phylogenetic or taxonomic) trees are "manually" created (or by various "supertree" approaches), nodes with more than two children are oftentimes used to express uncertainty. For the purpose of gene duplication inference, it would be particularly useful to allow non-binary species trees (expressing uncertainty about the tree-of-life and preventing the introduction of spurious duplications). Re: "For the non-binary case, should I go forward planning to implement the algorithm from the Vernot et al. paper or should I be planning to extend your algorithm?" You should plan on working on the SDI algorithm and 'modify' it so that it correctly works on non-binary species trees. Now, this is easier said than done. A while ago, I developed such an algorithm and implemented it as org.forester.sdi.GSDI (for Generalized SDI). You can look at it in file org/forester/sdi/GSDI. Yet, the big issue is that while this algorithm seems to work, I don't have a mathematical proof for its correctness. In any case, I recommend to do the following: 1. Thoroughly test (and writes unit tests) your current implementation of binary SDI. For example, does it correctly use the different sub-elements of taxonomy for matching, i.e. does it work if both species and gene use scientific names for taxonomic identification? does it work if both species and gene use NCBI identifier for taxonomic identification? does it work if both species and gene use NCBI identifier for taxonomic identification but also have non-matching common names (in this case it should use the identifiers and ignore common names)? Will it throw an exception if no matching sub-elements of taxonomy are present? 2. Performing timing benchmarks. Does it behave similar (although overall slower) to the Java implementation (see Figure 4 in Zmasek and Eddy, 2001)? Oftentimes, an unexpected timing benchmark results is a indication of an underlying problem? 3. I will look at your implementation as well. 4. Look at org.forester.sdi.GSDI and see if you can understand it and test it on paper. If this makes sense to you then we can go ahead and plan implementing this within BioRuby. Christian Sara Rayburn wrote: > Hi, > > Well, as far as I can tell, things are looking much, much better. I'm sorry I got a bit behind, but my son and I have been sick this past week. > > For the namespace/file locations, the response from the mailing list has been: > Bio::SpeciationDuplicationInference, Bio::Evolution::SDI, > Bio::Algorithm::SD with the files in lib/bio/util/Phylogeny/SDI, or lib/bio/phylo/sdi.rb and Bio::Phylo::SDI > > What do you guys think? > > Also, when I've been in doubt I've looked at the java implementation. It seems that forester has the assumption built in that any node in a tree that has a child must have two children. Is this a property of phylogenetic trees? > > Other than tying up a couple of loose ends, I think the binary case is pretty much wrapped up. Please let me know if there are things I need to modify or rethink. > > For the non-binary case, should I go forward planning to implement the algorithm from the Vernot et al. paper or should I be planning to extend your algorithm? > > Thanks and again, sorry for getting a bit behind. > > Sara From rutgeraldo at gmail.com Wed Jun 23 16:48:02 2010 From: rutgeraldo at gmail.com (Rutger Vos) Date: Wed, 23 Jun 2010 21:48:02 +0100 Subject: [BioRuby] [GSoC][NeXML and RDF API] Update In-Reply-To: References: Message-ID: Hi Anurag, thanks for the update - your time projection and current progress sounds good. Can you forward this update to the phylosoc (nescent) list as well? Thanks, Rutger On Tue, Jun 22, 2010 at 9:46 AM, Anurag Priyam wrote: > Hello all, > > Much of the parser implementation is complete as of now. Last time I had > sent an update I had begun implementing characters element. Week 3( June > 7-13) was quite low on work due to power shortage where I live. Consequently > implementation of characters spanned Week 4( June 14-20 ) too. > > Work on NeXML serialization has begun. As of now it can serialize taxa > blocks. This week( week 5 - June 21-28 ) I will be working on serializing > trees and characters element. > > I would also like to update a little more on future development plans. I am > targeting to finish much of the software development by week 9( July 19-25 > ), leaving week 10, week 11 and week 12 for feedback and iterations. This is > the time where I should make up for any mistakes or lost work. Perhaps in > this week we can make the code ready for merging in BioRuby's master branch. > Apart from this, I am targeting to finish serializer and start working on > the RDF API by week 6. Maybe we could have a round of code review after that > too? I am notifying this in advance so that if possible developers can > allocate time for this. Sounds good? > > > -- > Anurag Priyam, > 2nd Year Undergraduate, > Department of Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com From pjotr.public14 at thebird.nl Thu Jun 24 09:54:11 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Thu, 24 Jun 2010 15:54:11 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: Message-ID: <20100624135411.GA14658@thebird.nl> Hi Everyone, I am going to review Anurag's code. Naohisa, and perhaps others, will join in. A quick recap: Anurag is working on implementing an NeXML parser with RDF support (for the semantic web). NeXML is an XMLized and improved version of Nexus, and is used for interchanging sequences, alignments and trees between programs/services (correct me if I am wrong). A full descripion of NeXML can be found at https://www.nescent.org/wg_evoinfo/Future_Data_Exchange_Standard NeXML is an important standard, and very good to have in BioRuby. Anurag: thanks for the good work, so far. I can see you have put a lot of work in. And, I like your style. I can see you are a competent programmer, so you can expect the worst criticism ;) I am going to start with some high level questions. Can someone who has worked with NeXML (Rutger) have a look at the interface description on: https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby It looks natural to query this way, if you know what the NeXML files contains (e.g. trees, or sequences). What would be the natural approach if you do *not* know the contents? I.e. how does one iterate over the NeXML object? Anurag, your web page states you implemented a LibXML::Parser, and you named it Parser. Meanwhile, it looks like you have implemented libxml2 streaming, using a Reader. This is a bit confusing. I presume you are using the technique used in Diana's PhyloXML parser. You are requiring the 'xml' package. Is that libxml2 these days, or is it actually 'libxml'? Does it work for all Ruby versions? libxml is an external (binary) dependency, so it may not exist and fail. PhyloXML does not handle failure either. The other high-level questions concern testing. For others, the unit tests are here: http://github.com/yeban/bioruby/tree/master/test/unit/bio/db/nexml/ I notice you have limited test input data. How can you be really sure your code works for all cases? How can you be really sure that future changes to the code don't break? And how are you going to measure performance of your code? Finally, getting down to some code. Most of the code is in a single file: http://github.com/yeban/bioruby/blob/master/lib/bio/db/nexml/elements.rb or http://github.com/yeban/bioruby/blob/3abfc592e2f7072a8e2970ee077677a9ab7564ae/lib/bio/db/nexml/elements.rb I think it should be broken up. It would be logical to split by type of elements - at least. I know in BioRuby we are ambiguous about file sizes - I think a single file should describe one concept. That way file names become self describing. Files larger than 300 lines tend to be hard to digest - and probably point out some bigger issue. Also, when I look at DnaSeqRow, RnaSeqRow and others derived from SeqRow (line 2148 and onwards in element.rb), I can see duplicated coding 'patterns'. You are repeating a concept. Would there not be a more elegant way in Ruby to handle this? Hint: Inheritance is just one mechanism, I see no real reason to use an inheritance tree. Why not use one Sequence class for all of these which can contain different formed elements? I bet the code would become a lot shorter and (probably) less error prone. Take Ruby's Array container class as an example - it is just one implementation of a container which allows many types of elements. A final comment for this session: The class/method descriptions are not very informative. It may be early days - especially since we can see some refactoring coming, but it usually helps to write out examples giving the 'nicest' interface for people to use. And stick those in the source code. Personally I favour rubydoctests, see http://github.com/tablatom/rubydoctest I used these in bio/appl/paml/codeml/report.rb - these are examples that double as tests. Kill two birds with one stone! The BioRuby tutorial also uses doctests - i.e. the code in the Tutorial can be validated against the installed bioruby. If you want to use this you need an extra conversion - I have that tool. Another possibility is to start using RSpec. http://rspec.info/ I really like RSpec too - it is more of a replacement for unit tests - and easier to understand, so Specs double as documentation. I am interested to see what you want to do for RDF support. Maybe you can write out the API as an RSpec? That would be a good start. Do not hesitate to stand up to me. You will probably get support from someone on this list ;) Pj. On Thu, Jun 10, 2010 at 01:19:35AM +0530, Anurag Priyam wrote: > Last week, I worked on finishing implementation of Trees: trees, tree, > network; and started work on the characters element. This weeks target is to > complete the implementation of the characters element. > > It would be awesome to have some code review including: implementation, API > design, coding style and tests. I am planning to give a good amount of time > in the fourth week in making the code more robust. It would make perfect > sense to have some feedback to serve as guidelines :). The master branch and > API discussion page are at: > > [1] http://github.com/yeban/bioruby > [2] > https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby > > -- > Anurag Priyam, > 2nd Year Undergraduate, > Department of Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From czmasek at burnham.org Thu Jun 24 22:18:41 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Thu, 24 Jun 2010 19:18:41 -0700 Subject: [BioRuby] gsoc: SDI - unrooted trees In-Reply-To: <6B35175E-1DFF-430E-AE22-E724F3D58D9D@gmail.com> References: <4C1AACBA.4030908@burnham.org> <6B35175E-1DFF-430E-AE22-E724F3D58D9D@gmail.com> Message-ID: <4C241201.9060604@burnham.org> Hi, Sara: Something I forgot the mention. As you know, most phylogeny inference methods produce trees which are unrooted (these trees might look rooted, but for most methods the root is placed randomly, and thus incorrectly). In the the context of duplication inference, a reasonable way to root a tree is by placing the root in such a way that the the sum of inferred duplications is minimized. The brute force approach to accomplish this is by sequentially placing the root on each branch and then running the SDI algorithm on each differently rooted tree and retaining the root position which results in the smallest sum of duplications. A more time efficient approach is possible by realizing that the mapping function only changes for a few nodes if the root is moved from one branch to an neighboring one. This approach is implemented in org.forester.sdi.SDIR. Besides extending the algorithm to work on non-binary trees, this is another useful extension which you might think about tackling. Christian From anurag08priyam at gmail.com Fri Jun 25 02:23:34 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Fri, 25 Jun 2010 11:53:34 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100624135411.GA14658@thebird.nl> References: <20100624135411.GA14658@thebird.nl> Message-ID: > Can someone who has worked with NeXML (Rutger) have a look at the > interface description on: > > https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby > > It looks natural to query this way, if you know what the NeXML files > contains (e.g. trees, or sequences). What would be the natural > approach if you do *not* know the contents? I.e. how does one iterate > over the NeXML object? > > NeXML has three primary elements: otus, trees, characters. All three of them are container for other elements: otu, tree, network, matrix. Currently the each method of an nexml object iterates over each tree object. I did this thinking that tree is the most important part of a phylogenetic analysis( and also because I had not implemented characters then). What were you thinking here? Should each iterate over all otu, tree and matrix or the primary otus, trees and characters elements? I would go for the later. > Anurag, your web page states you implemented a LibXML::Parser, and you > named it Parser. Meanwhile, it looks like you have implemented libxml2 > streaming, using a Reader. This is a bit confusing. I presume you are > using the technique used in Diana's PhyloXML parser. You are requiring > the 'xml' package. Is that libxml2 these days, or is it actually > 'libxml'? Does it work for all Ruby versions? libxml is an external > (binary) dependency, so it may not exist and fail. PhyloXML does not > handle failure either. > > I am glad you asked this. I wanted to discuss it here. I have used libxml2 streaming api, without actually streaming the document to the user. The cursor does not move through the document when you iterate over elements( phyloxml does that ). I am parsing the document at one go; at the start, and storing the objects in memory. Should we want to switch to streaming, using libxml's streaming API from start should make it easier. Yes it is libxml2 these days. The site states that it works with ruby 1.8. I am myself working with 1.8.7. I will have to test the compatibility with ruby 1.9. > The other high-level questions concern testing. For others, the unit > tests are here: > > http://github.com/yeban/bioruby/tree/master/test/unit/bio/db/nexml/ > > I notice you have limited test input data. How can you be really sure > your code works for all cases? How can you be really sure that future > changes to the code don't break? Right. I am working on improving the test suites taking lessons from the other bioruby test suites. > And how are you going to measure > performance of your code? > > Actually I have not done anything here. I will benchmark and profile the code and discuss the results here. Finally, getting down to some code. Most of the code is in a single > file: > > http://github.com/yeban/bioruby/blob/master/lib/bio/db/nexml/elements.rb > > or > > > http://github.com/yeban/bioruby/blob/3abfc592e2f7072a8e2970ee077677a9ab7564ae/lib/bio/db/nexml/elements.rb > > I think it should be broken up. It would be logical to split by type > of elements - at least. I know in BioRuby we are ambiguous about file > sizes - I think a single file should describe one concept. That way > file names become self describing. Files larger than 300 lines tend > to be hard to digest - and probably point out some bigger issue. > > Agreed. > Also, when I look at DnaSeqRow, RnaSeqRow and others derived from > SeqRow (line 2148 and onwards in element.rb), I can see duplicated > coding 'patterns'. You are repeating a concept. Would there not be a > more elegant way in Ruby to handle this? Hint: Inheritance is just one > mechanism, I see no real reason to use an inheritance tree. Why not > use one Sequence class for all of these which can contain different > formed elements? I bet the code would become a lot shorter and > (probably) less error prone. Take Ruby's Array container class as an > example - it is just one implementation of a container which allows > many types of elements. > The idea here was to implement a type system and stick close to the class hierarchy followed in the schema. However, looking back, I myself do not find the code for the Matrix class very elegant. A final comment for this session: The class/method descriptions are > not very informative. It may be early days - especially since we can > see some refactoring coming, but it usually helps to write out > examples giving the 'nicest' interface for people to use. And stick > those in the source code. Personally I favour rubydoctests, see > > http://github.com/tablatom/rubydoctest > > Hey, I did not know that doctests existed for Ruby too. I will have a look into it. > I used these in bio/appl/paml/codeml/report.rb - these are examples > that double as tests. Kill two birds with one stone! The BioRuby > tutorial also uses doctests - i.e. the code in the Tutorial can be > validated against the installed bioruby. If you want to use this you > need an extra conversion - I have that tool. > I will check out the examples. What tool? I would like to know more. > > Another possibility is to start using RSpec. > > http://rspec.info/ > > I really like RSpec too - it is more of a replacement for unit > tests - and easier to understand, so Specs double as documentation. > > I am missing Rspec too from my Rails and Merb days. I picked up unit tests because much of the framework had used the same and also because I wanted to try it out :). > I am interested to see what you want to do for RDF support. Maybe you > can write out the API as an RSpec? That would be a good start. > > That sounds like a nice idea. > Do not hesitate to stand up to me. You will probably get support from > someone on this list ;) > > Pj. > > On Thu, Jun 10, 2010 at 01:19:35AM +0530, Anurag Priyam wrote: > > Last week, I worked on finishing implementation of Trees: trees, tree, > > network; and started work on the characters element. This weeks target is > to > > complete the implementation of the characters element. > > > > It would be awesome to have some code review including: implementation, > API > > design, coding style and tests. I am planning to give a good amount of > time > > in the fourth week in making the code more robust. It would make perfect > > sense to have some feedback to serve as guidelines :). The master branch > and > > API discussion page are at: > > > > [1] http://github.com/yeban/bioruby > > [2] > > > https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby > > > > -- > > Anurag Priyam, > > 2nd Year Undergraduate, > > Department of Mechanical Engineering, > > IIT Kharagpur. > > +91-9775550642 > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From pjotr.public14 at thebird.nl Fri Jun 25 02:46:05 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 25 Jun 2010 08:46:05 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> Message-ID: <20100625064605.GA22887@thebird.nl> (splitting up the discussion) On Fri, Jun 25, 2010 at 11:53:34AM +0530, Anurag Priyam wrote: > Should each iterate over all otu, tree and matrix or the > primary otus, trees and characters elements? I would go for the later. I think Rutger should answer this. Pj. From pjotr.public14 at thebird.nl Fri Jun 25 02:49:11 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 25 Jun 2010 08:49:11 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> Message-ID: <20100625064911.GB22887@thebird.nl> > I have used libxml2 streaming api, without actually streaming the document > to the user. The cursor does not move through the document when you iterate > over elements( phyloxml does that ). I am parsing the document at one go; at > the start, and storing the objects in memory. Should we want to switch to > streaming, using libxml's streaming API from start should make it easier. > > Yes it is libxml2 these days. The site states that it works with ruby 1.8. I > am myself working with 1.8.7. I will have to test the compatibility with > ruby 1.9. OK, glad to see that libxml is a standard package these days - though it has some horrific error handling. At least it is fast. How much time would it cost you to stream the data - and what does it mean with regard to changing the API? I guess, in general, NeXML files won't be that large, so it may not be that important (Rutger)? Pj. From pjotr.public14 at thebird.nl Fri Jun 25 02:51:58 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 25 Jun 2010 08:51:58 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> Message-ID: <20100625065158.GC22887@thebird.nl> > > I notice you have limited test input data. How can you be really sure > > your code works for all cases? How can you be really sure that future > > changes to the code don't break? > > Right. I am working on improving the test suites taking lessons from the > other bioruby test suites. Unit tests are one approach. How about adding some regression tests on larger files? When you have output that should be a good idea. We don't like large datasets in the bioruby tree, but there are two ways around that - create a special branch on github, or pull the data on demand (though Naohisa may frown on that). Ask Diana what she has done. > Actually I have not done anything here. I will benchmark and profile the > code and discuss the results here. Diana created a special profiling branch. It was really helpful to profile. Pj. From pjotr.public14 at thebird.nl Fri Jun 25 02:55:39 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 25 Jun 2010 08:55:39 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> Message-ID: <20100625065539.GD22887@thebird.nl> On Fri, Jun 25, 2010 at 11:53:34AM +0530, Anurag Priyam wrote: > The idea here was to implement a type system and stick close to the class > hierarchy followed in the schema. However, looking back, I myself do not > find the code for the Matrix class very elegant. Over 3000 lines of code for an XML parser sends out alarm bells. If you have the right testing files it should be easy to refactor. Make it simpler. Also, when parsing this type of XML some Ruby reflection may come in handy - I did some of that in my BioRuby GEO parser, which lives in my GEO branch on github. You should look at each class and see if you can refactor it down to a single solution. Just make sure it is not at the expense of readability and understanding. Post us some ideas here, before you start hacking code. Pj. From pjotr.public14 at thebird.nl Fri Jun 25 03:08:04 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 25 Jun 2010 09:08:04 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> Message-ID: <20100625070804.GE22887@thebird.nl> > > http://github.com/tablatom/rubydoctest > > > > > Hey, I did not know that doctests existed for Ruby too. I will have a look > into it. They are good, however finding bugs is a bit problematic as the stack traces are lengthy and often not descriptive. So with troubling code I tend to write extra unit tests. Also, with BioRuby we have not settled on doctests yet, so you need to reach coverage with unit tests and/or Specs. I really think it is good for validating documentation. > > I used these in bio/appl/paml/codeml/report.rb - these are examples > > that double as tests. Kill two birds with one stone! The BioRuby > > tutorial also uses doctests - i.e. the code in the Tutorial can be > > validated against the installed bioruby. If you want to use this you > > need an extra conversion - I have that tool. > > > > I will check out the examples. What tool? I would like to know more. It simply parses out commented code in the source headers, and turns them over to rubydoctest. The tool is in my bioruby-support tree on github - see http://github.com/pjotrp/bioruby-support/blob/master/bin/uncomment_doctest you can see it uses an environment variable. > I am missing Rspec too from my Rails and Merb days. I picked up unit tests > because much of the framework had used the same and also because I wanted to > try it out :). > > > > I am interested to see what you want to do for RDF support. Maybe you > > can write out the API as an RSpec? That would be a good start. > > > > > That sounds like a nice idea. RSpec is new for BioRuby. Since you have experience you are the right one to introduce it to us ;). If it is convincing to the others we may accept it as standard use (personally I think it is a step forward from unit testing - unit tests are not very good as documentation). Pj. From anurag08priyam at gmail.com Fri Jun 25 03:34:21 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Fri, 25 Jun 2010 13:04:21 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100625064911.GB22887@thebird.nl> References: <20100624135411.GA14658@thebird.nl> <20100625064911.GB22887@thebird.nl> Message-ID: On Fri, Jun 25, 2010 at 12:19 PM, Pjotr Prins wrote: > > I have used libxml2 streaming api, without actually streaming the > document > > to the user. The cursor does not move through the document when you > iterate > > over elements( phyloxml does that ). I am parsing the document at one go; > at > > the start, and storing the objects in memory. Should we want to switch to > > streaming, using libxml's streaming API from start should make it easier. > > > > Yes it is libxml2 these days. The site states that it works with ruby > 1.8. I > > am myself working with 1.8.7. I will have to test the compatibility with > > ruby 1.9. > > OK, glad to see that libxml is a standard package these days - > though it has some horrific error handling. At least it is fast. > > Yea it is fast but it has its own share of bugs. Now, I myself have started working on the ruby-libxml code and helping in maintaining it. > How much time would it cost you to stream the data - and what does it > mean with regard to changing the API? I guess, in general, NeXML > files won't be that large, so it may not be that important (Rutger)? > > Pj. > > I mean switching the parsing implementation to streaming from "parsing at the start" and not the API. Just that using Reader API over the DOM API would help in the switch. Even if we do not switch, the Reader API offers a more memory efficient solution than the DOM API. Btw, I am not in a favour of switch. You cannot move backwards in document that way. I can not fetch a tree by id if I the cursor is ahead of that tree. Doing nexml.each_characters and nexml.each_trees is impossible with pure streaming. I will have to stream one while cache the other. Otus and otu provide a one to many relation with trees and characters, and rows. An API call of the type otus.trees or otus.characters or otu.seuences would be impossible( not that I have already added the API call ). Imo, NeXML is non-linear and not meant to be streamed. Besides other NeXML implementations also parse the file at the start. -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From ngoto at gen-info.osaka-u.ac.jp Fri Jun 25 03:15:58 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Fri, 25 Jun 2010 16:15:58 +0900 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100625065158.GC22887@thebird.nl> References: <20100624135411.GA14658@thebird.nl> <20100625065158.GC22887@thebird.nl> Message-ID: <20100625071558.9CAB01CBC5B0@idnmail.gen-info.osaka-u.ac.jp> Most part of the special testing program created by Diana for PhyloXML is now put in sample/test_phyloxml_big.rb, i.e. it is now regarded as a sample script. To run the program, for example, % mkdir /tmp/phyloxml % ruby sample/test_phyloxml_big.rb /tmp/phyloxml -v It executes round-trip tests for large PhyloXML files. Data files are downloaded from the internet and are stored to a directory specified by the user. Naohisa Goto ngoto at ge-info.osaka-u.ac.jp / ng at bioruby.org On Fri, 25 Jun 2010 08:51:58 +0200 Pjotr Prins wrote: > > Actually I have not done anything here. I will benchmark and profile the > > code and discuss the results here. > > Diana created a special profiling branch. It was really helpful to > profile. > > Pj. > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- ?? ?? ngoto at gen-info.osaka-u.ac.jp ??????????? ?????????? ?????????(???) Phone: 06-6879-8365 / FAX: 06-6879-2047 From pjotr.public14 at thebird.nl Fri Jun 25 03:42:13 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 25 Jun 2010 09:42:13 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> <20100625064911.GB22887@thebird.nl> Message-ID: <20100625074213.GA27044@thebird.nl> I think this needs to be answered by Rutger. Are we going to face NeXML files in the future that can easily outrun memory? Pj. On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote: > > How much time would it cost you to stream the data - and what does it > > mean with regard to changing the API? I guess, in general, NeXML > > files won't be that large, so it may not be that important (Rutger)? > > > > Pj. > > > > > I mean switching the parsing implementation to streaming from "parsing at > the start" and not the API. Just that using Reader API over the DOM API > would help in the switch. Even if we do not switch, the Reader API offers a > more memory efficient solution than the DOM API. > > Btw, I am not in a favour of switch. You cannot move backwards in document > that way. I can not fetch a tree by id if I the cursor is ahead of that > tree. Doing nexml.each_characters and nexml.each_trees is impossible with > pure streaming. I will have to stream one while cache the other. Otus and > otu provide a one to many relation with trees and characters, and rows. An > API call of the type otus.trees or otus.characters or otu.seuences would be > impossible( not that I have already added the API call ). Imo, NeXML is > non-linear and not meant to be streamed. Besides other NeXML implementations > also parse the file at the start. > > -- > Anurag Priyam, > 2nd Year Undergraduate, > Department of Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 From rutgeraldo at gmail.com Fri Jun 25 04:14:44 2010 From: rutgeraldo at gmail.com (Rutger Vos) Date: Fri, 25 Jun 2010 09:14:44 +0100 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100625074213.GA27044@thebird.nl> References: <20100624135411.GA14658@thebird.nl> <20100625064911.GB22887@thebird.nl> <20100625074213.GA27044@thebird.nl> Message-ID: This is very possible (and it's why Anurag has been focusing on stream-based parsing) but I am personally of the opinion that worrying too much about that right now would be a premature optimization. It seems to me that we want to get a nice interface that captures what NeXML can express first, and worry about performance and memory footprint later - but that's just my own opinion and certainly open for discussion. On Fri, Jun 25, 2010 at 8:42 AM, Pjotr Prins wrote: > I think this needs to be answered by Rutger. Are we going to face > NeXML files in the future that can easily outrun memory? > > Pj. > > On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote: >> > How much time would it cost you to stream the data - and what does it >> > mean with regard to changing the API? I guess, in general, NeXML >> > files won't be that large, so it may not be that important (Rutger)? >> > >> > Pj. >> > >> > >> I mean switching the parsing implementation to streaming from "parsing at >> the start" and not the API. Just that using Reader API over the DOM API >> would help in the switch. Even if we do not switch, the Reader API offers a >> more memory efficient solution than the DOM API. >> >> Btw, I am not in a favour of switch. You cannot move backwards in document >> that way. I can not fetch a tree by id if I the cursor is ahead of that >> tree. Doing nexml.each_characters and nexml.each_trees is impossible with >> pure streaming. I will have to stream one while cache the other. Otus and >> otu provide a one to many relation with trees and characters, and rows. An >> API call of the type otus.trees or otus.characters or otu.seuences would be >> impossible( not that I have already added the API call ). Imo, NeXML is >> non-linear and not meant to be streamed. Besides other NeXML implementations >> also parse the file at the start. >> >> -- >> Anurag Priyam, >> 2nd Year Undergraduate, >> Department of Mechanical Engineering, >> IIT Kharagpur. >> +91-9775550642 > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com From pjotr.public14 at thebird.nl Fri Jun 25 04:38:36 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 25 Jun 2010 10:38:36 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> <20100625064911.GB22887@thebird.nl> <20100625074213.GA27044@thebird.nl> Message-ID: <20100625083836.GA28214@thebird.nl> On Fri, Jun 25, 2010 at 09:14:44AM +0100, Rutger Vos wrote: > This is very possible (and it's why Anurag has been focusing on > stream-based parsing) but I am personally of the opinion that worrying > too much about that right now would be a premature optimization. It > seems to me that we want to get a nice interface that captures what > NeXML can express first, and worry about performance and memory > footprint later - but that's just my own opinion and certainly open > for discussion. Oh, I agree about implementation. But it does mean Anurag needs to change his preferential solution (like back-tracking in the tree). Pj. From sararayburn at gmail.com Fri Jun 25 14:57:03 2010 From: sararayburn at gmail.com (Sara Rayburn) Date: Fri, 25 Jun 2010 13:57:03 -0500 Subject: [BioRuby] GSoC speciation/duplication inference question In-Reply-To: References: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com> <20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <00A7DE6C-2985-4173-A302-619429F964BB@gmail.com> Hi, I think between the list response and conversations with my mentor, I would probably go with Bio::Algorithm::SDI, with the files in lib/bio/util/phylogeny/SDI/ I can definitely see the others as good possibilities, though. If anyone objects to this naming, please let me know so I can change it. Thanks, Sara Rayburn sararayburn at gmail.com On Jun 15, 2010, at 9:06 PM, Toshiaki Katayama wrote: > Hi, > > Replying personally as I delayed to find this thread. > > I prefer something like lib/bio/phylo/sdi.rb and Bio::Phylo::SDI, how about to gather other phyloinformatics modules under the same directory as well? > > Toshiaki > > > On 2010/06/10, at 22:46, Naohisa GOTO wrote: > >> Hi, >> >> I think the abbreviation SDI is not common in the field of biology >> and bioinformatics. In this case, it is generally good not to >> abbreviate, but the "speciation/duplication inference" is too long. >> For file/directory names, because the length limit is tight, >> using abbreviation is good. >> >> For the location of files, I suggest >> lib/bio/util/evolution/SDI/ or lib/bio/util/phylogeny/SDI/ >> to show the word SDI is in the field of evolution or phylogeny. >> >> For the class/module namespace, possible candidates are >> Bio::SpeciationDuplicationInference, Bio::Evolution::SDI, >> Bio::Algorithm::SDI, but I couldn't determine which is the best. >> If you have good idea, please tell us. >> >> Naohisa Goto >> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org >> >> >> On Mon, 7 Jun 2010 13:09:07 -0500 >> Sara Rayburn wrote: >> >>> Hello, >>> >>> While implementing my gsoc project (the speciation/duplicaiton inference algorithm), I'm not sure where to put the module I'm developing in the bioruby library. I've been developing in the lib/bio/util directory, but I want to make sure that's the place you all would prefer for the module. Any suggestions or thoughts? >>> >>> Thanks, >>> >>> Sara Rayburn >>> sararayburn at gmail.com >> >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > From anurag08priyam at gmail.com Sat Jun 26 07:35:34 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Sat, 26 Jun 2010 17:05:34 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100625070804.GE22887@thebird.nl> References: <20100624135411.GA14658@thebird.nl> <20100625070804.GE22887@thebird.nl> Message-ID: On Fri, Jun 25, 2010 at 12:38 PM, Pjotr Prins wrote: > > > http://github.com/tablatom/rubydoctest > > > > > > > > Hey, I did not know that doctests existed for Ruby too. I will have a > look > > into it. > > They are good, however finding bugs is a bit problematic as the stack > traces are lengthy and often not descriptive. So with troubling code > I tend to write extra unit tests. Also, with BioRuby we have not > settled on doctests yet, so you need to reach coverage with unit > tests and/or Specs. > > I really think it is good for validating documentation. > > > > I used these in bio/appl/paml/codeml/report.rb - these are examples > > > that double as tests. Kill two birds with one stone! The BioRuby > > > tutorial also uses doctests - i.e. the code in the Tutorial can be > > > validated against the installed bioruby. If you want to use this you > > > need an extra conversion - I have that tool. > > > > > > > I will check out the examples. What tool? I would like to know more. > > It simply parses out commented code in the source headers, and turns > them over to rubydoctest. The tool is in my bioruby-support tree on > github - see > > > http://github.com/pjotrp/bioruby-support/blob/master/bin/uncomment_doctest > > you can see it uses an environment variable. > > Perfect. I will use it when expanding the documentation. > > I am missing Rspec too from my Rails and Merb days. I picked up unit > tests > > because much of the framework had used the same and also because I wanted > to > > try it out :). > > > > > > > I am interested to see what you want to do for RDF support. Maybe you > > > can write out the API as an RSpec? That would be a good start. > > > > > > > > That sounds like a nice idea. > > RSpec is new for BioRuby. Since you have experience you are the right > one to introduce it to us ;). If it is convincing to the others we > may accept it as standard use (personally I think it is a step > forward from unit testing - unit tests are not very good as > documentation). > > I am willing to use Rspec for the RDF API part. Converting the already existing unit tests I have written to Rspec does not sound a good idea? -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From pjotr.public14 at thebird.nl Sat Jun 26 08:19:02 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sat, 26 Jun 2010 14:19:02 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> <20100625070804.GE22887@thebird.nl> Message-ID: <20100626121902.GA5700@thebird.nl> On Sat, Jun 26, 2010 at 05:05:34PM +0530, Anurag Priyam wrote: > I am willing to use Rspec for the RDF API part. Converting the already > existing unit tests I have written to Rspec does not sound a good idea? No need. Do the RDF as a proof-of-concept for the rest of BioRuby. Unit tests will (always) remain. Pj. From hlapp at drycafe.net Sat Jun 26 20:30:19 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Sat, 26 Jun 2010 17:30:19 -0700 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100625074213.GA27044@thebird.nl> References: <20100624135411.GA14658@thebird.nl> <20100625064911.GB22887@thebird.nl> <20100625074213.GA27044@thebird.nl> Message-ID: Our ability to reconstruct trees of hundreds, thousands, and even tens of thousands of characters has improved dramatically over the past couple of years, and is increasingly often the goal of an analysis. Genome-scale alignments also aren't so rare anymore. Aside from analysis, NeXML files can be produced by a database, and hence could hold large taxonomies, or the tree of life. NeXML is an emerging standard. If implementations can't cope with the large scale data that are becoming increasingly popular, it'll have a hard time to get uptake. -hilmar On Jun 25, 2010, at 12:42 AM, Pjotr Prins wrote: > I think this needs to be answered by Rutger. Are we going to face > NeXML files in the future that can easily outrun memory? > > Pj. > > On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote: >>> How much time would it cost you to stream the data - and what does >>> it >>> mean with regard to changing the API? I guess, in general, NeXML >>> files won't be that large, so it may not be that important (Rutger)? >>> >>> Pj. >>> >>> >> I mean switching the parsing implementation to streaming from >> "parsing at >> the start" and not the API. Just that using Reader API over the DOM >> API >> would help in the switch. Even if we do not switch, the Reader API >> offers a >> more memory efficient solution than the DOM API. >> >> Btw, I am not in a favour of switch. You cannot move backwards in >> document >> that way. I can not fetch a tree by id if I the cursor is ahead of >> that >> tree. Doing nexml.each_characters and nexml.each_trees is >> impossible with >> pure streaming. I will have to stream one while cache the other. >> Otus and >> otu provide a one to many relation with trees and characters, and >> rows. An >> API call of the type otus.trees or otus.characters or otu.seuences >> would be >> impossible( not that I have already added the API call ). Imo, >> NeXML is >> non-linear and not meant to be streamed. Besides other NeXML >> implementations >> also parse the file at the start. >> >> -- >> Anurag Priyam, >> 2nd Year Undergraduate, >> Department of Mechanical Engineering, >> IIT Kharagpur. >> +91-9775550642 > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From pjotr.public14 at thebird.nl Sun Jun 27 02:47:31 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sun, 27 Jun 2010 08:47:31 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> <20100625064911.GB22887@thebird.nl> <20100625074213.GA27044@thebird.nl> Message-ID: <20100627064731.GA15508@thebird.nl> Thanks Rutger and Hilmar, Anurag, let's not load everything in memory. Pj. On Sat, Jun 26, 2010 at 05:30:19PM -0700, Hilmar Lapp wrote: > Our ability to reconstruct trees of hundreds, thousands, and even tens > of thousands of characters has improved dramatically over the past > couple of years, and is increasingly often the goal of an analysis. > Genome-scale alignments also aren't so rare anymore. > > Aside from analysis, NeXML files can be produced by a database, and > hence could hold large taxonomies, or the tree of life. > > NeXML is an emerging standard. If implementations can't cope with the > large scale data that are becoming increasingly popular, it'll have a > hard time to get uptake. > > -hilmar > > On Jun 25, 2010, at 12:42 AM, Pjotr Prins wrote: > >> I think this needs to be answered by Rutger. Are we going to face >> NeXML files in the future that can easily outrun memory? >> >> Pj. >> >> On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote: >>>> How much time would it cost you to stream the data - and what does >>>> it >>>> mean with regard to changing the API? I guess, in general, NeXML >>>> files won't be that large, so it may not be that important (Rutger)? >>>> >>>> Pj. >>>> >>>> >>> I mean switching the parsing implementation to streaming from >>> "parsing at >>> the start" and not the API. Just that using Reader API over the DOM >>> API >>> would help in the switch. Even if we do not switch, the Reader API >>> offers a >>> more memory efficient solution than the DOM API. >>> >>> Btw, I am not in a favour of switch. You cannot move backwards in >>> document >>> that way. I can not fetch a tree by id if I the cursor is ahead of >>> that >>> tree. Doing nexml.each_characters and nexml.each_trees is impossible >>> with >>> pure streaming. I will have to stream one while cache the other. >>> Otus and >>> otu provide a one to many relation with trees and characters, and >>> rows. An >>> API call of the type otus.trees or otus.characters or otu.seuences >>> would be >>> impossible( not that I have already added the API call ). Imo, NeXML >>> is >>> non-linear and not meant to be streamed. Besides other NeXML >>> implementations >>> also parse the file at the start. >>> >>> -- >>> Anurag Priyam, >>> 2nd Year Undergraduate, >>> Department of Mechanical Engineering, >>> IIT Kharagpur. >>> +91-9775550642 >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > =========================================================== > > > > From ngoto at gen-info.osaka-u.ac.jp Sun Jun 27 03:45:43 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Sun, 27 Jun 2010 16:45:43 +0900 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100627064731.GA15508@thebird.nl> References: <20100627064731.GA15508@thebird.nl> Message-ID: <20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp> Hi, I think the ability to handle large data and the memory usage whether or not to load all data in memory at a time, is essentially independent. Not loading everything in memory does not guarantee the ability to handle large data, due to the disk I/O bottleneck and memory management overhead. I think it is currently OK to depend on memory. The price of memory is gradually going down, and I think buying a machine with huge memory could be a solution to treat large data. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > Thanks Rutger and Hilmar, > > Anurag, let's not load everything in memory. > > Pj. > > On Sat, Jun 26, 2010 at 05:30:19PM -0700, Hilmar Lapp wrote: > > Our ability to reconstruct trees of hundreds, thousands, and even tens > > of thousands of characters has improved dramatically over the past > > couple of years, and is increasingly often the goal of an analysis. > > Genome-scale alignments also aren't so rare anymore. > > > > Aside from analysis, NeXML files can be produced by a database, and > > hence could hold large taxonomies, or the tree of life. > > > > NeXML is an emerging standard. If implementations can't cope with the > > large scale data that are becoming increasingly popular, it'll have a > > hard time to get uptake. > > > > -hilmar > > > > On Jun 25, 2010, at 12:42 AM, Pjotr Prins wrote: > > > >> I think this needs to be answered by Rutger. Are we going to face > >> NeXML files in the future that can easily outrun memory? > >> > >> Pj. > >> > >> On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote: > >>>> How much time would it cost you to stream the data - and what does > >>>> it > >>>> mean with regard to changing the API? I guess, in general, NeXML > >>>> files won't be that large, so it may not be that important (Rutger)? > >>>> > >>>> Pj. > >>>> > >>>> > >>> I mean switching the parsing implementation to streaming from > >>> "parsing at > >>> the start" and not the API. Just that using Reader API over the DOM > >>> API > >>> would help in the switch. Even if we do not switch, the Reader API > >>> offers a > >>> more memory efficient solution than the DOM API. > >>> > >>> Btw, I am not in a favour of switch. You cannot move backwards in > >>> document > >>> that way. I can not fetch a tree by id if I the cursor is ahead of > >>> that > >>> tree. Doing nexml.each_characters and nexml.each_trees is impossible > >>> with > >>> pure streaming. I will have to stream one while cache the other. > >>> Otus and > >>> otu provide a one to many relation with trees and characters, and > >>> rows. An > >>> API call of the type otus.trees or otus.characters or otu.seuences > >>> would be > >>> impossible( not that I have already added the API call ). Imo, NeXML > >>> is > >>> non-linear and not meant to be streamed. Besides other NeXML > >>> implementations > >>> also parse the file at the start. > >>> > >>> -- > >>> Anurag Priyam, > >>> 2nd Year Undergraduate, > >>> Department of Mechanical Engineering, > >>> IIT Kharagpur. > >>> +91-9775550642 > >> _______________________________________________ > >> BioRuby Project - http://www.bioruby.org/ > >> BioRuby mailing list > >> BioRuby at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > > -- > > =========================================================== > > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > > =========================================================== > > > > > > > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From pjotr.public14 at thebird.nl Sun Jun 27 04:43:22 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sun, 27 Jun 2010 10:43:22 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp> References: <20100627064731.GA15508@thebird.nl> <20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp> Message-ID: <20100627084322.GA18815@thebird.nl> On Sun, Jun 27, 2010 at 04:45:43PM +0900, Naohisa Goto wrote: > Hi, > > I think the ability to handle large data and the memory usage > whether or not to load all data in memory at a time, is essentially > independent. Not loading everything in memory does not guarantee > the ability to handle large data, due to the disk I/O bottleneck and > memory management overhead. Well, depends on what you plan to do with that data :). I think you are saying that streaming data may not be efficient, for example for treating alignments. That could be true. However, I think the default strategy should be non-memory bound, if possible. Throughout BioRuby the strategy is the opposite, at the moment. For example, by default FASTA files are loaded in RAM. Same for BLAST XML. I regularly have files that exceed RAM and work around these limitations. I don't think this should be the *default* strategy. I prefer the Unix way of using pipes. Only use memory when it is available. With new code we should design for big data. If it is done from the start, it takes no real effort. > I think it is currently OK to depend on memory. The price of memory > is gradually going down, and I think buying a machine with huge > memory could be a solution to treat large data. We can not all afford big machines. It would hamper many groups/students. RAM is getting cheaper, but data is growing faster. Anurag, what is the size of RAM you have access to? Pj. From anurag08priyam at gmail.com Sun Jun 27 04:49:37 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Sun, 27 Jun 2010 14:19:37 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100627084322.GA18815@thebird.nl> References: <20100627064731.GA15508@thebird.nl> <20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp> <20100627084322.GA18815@thebird.nl> Message-ID: On Sun, Jun 27, 2010 at 2:13 PM, Pjotr Prins wrote: > On Sun, Jun 27, 2010 at 04:45:43PM +0900, Naohisa Goto wrote: > > Hi, > > > > I think the ability to handle large data and the memory usage > > whether or not to load all data in memory at a time, is essentially > > independent. Not loading everything in memory does not guarantee > > the ability to handle large data, due to the disk I/O bottleneck and > > memory management overhead. > > Well, depends on what you plan to do with that data :). I think you > are saying that streaming data may not be efficient, for example for > treating alignments. That could be true. However, I think the default > strategy should be non-memory bound, if possible. Throughout BioRuby > the strategy is the opposite, at the moment. For example, by default > FASTA files are loaded in RAM. Same for BLAST XML. I regularly have > files that exceed RAM and work around these limitations. I don't think > this should be the *default* strategy. > > I prefer the Unix way of using pipes. Only use memory when it is > available. > > With new code we should design for big data. If it is done from the > start, it takes no real effort. > > > I think it is currently OK to depend on memory. The price of memory > > is gradually going down, and I think buying a machine with huge > > memory could be a solution to treat large data. > > We can not all afford big machines. It would hamper many > groups/students. RAM is getting cheaper, but data is growing faster. > > Anurag, what is the size of RAM you have access to? > > 3GB. The biggest sample file I am working with is 500 lines( characters.xml in the examples ); working with it has hardly any effect on my memory. From, where can I get a bigger one? I can test the memory consumption with a large enough file and report. -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From hlapp at drycafe.net Sun Jun 27 19:23:19 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Sun, 27 Jun 2010 16:23:19 -0700 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100627064731.GA15508@thebird.nl> <20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp> <20100627084322.GA18815@thebird.nl> Message-ID: On Jun 27, 2010, at 1:49 AM, Anurag Priyam wrote: > 3GB. The biggest sample file I am working with is 500 > lines( characters.xml > in the examples ); working with it has hardly any effect on my > memory. From, > where can I get a bigger one? Use the NCBI taxonomy :-) Or download the tree from tolweb.org and convert to NeXML. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From anurag08priyam at gmail.com Mon Jun 28 05:31:26 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Mon, 28 Jun 2010 15:01:26 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100624135411.GA14658@thebird.nl> References: <20100624135411.GA14658@thebird.nl> Message-ID: > > > A final comment for this session: The class/method descriptions are > not very informative. It may be early days - especially since we can > see some refactoring coming, but it usually helps to write out > examples giving the 'nicest' interface for people to use. And stick > those in the source code. Personally I favour rubydoctests, see > > http://github.com/tablatom/rubydoctest > > I am loving rubydoctest. Thanks for showing it to me:). As of now I am using it in my nexml serialization implementation. -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From anurag08priyam at gmail.com Mon Jun 28 05:52:32 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Mon, 28 Jun 2010 15:22:32 +0530 Subject: [BioRuby] Testing complex nexml output. Message-ID: I am finding it a little difficult testing the nexml serializer. Any nexml object say otu, is serialized by a function call of the type NeXML::Writer#serialize_otu, which returns a XML::Node object. A raw nexml representation can be obtained by calling to_s on the return value. These nodes are added to the document root and then saved to a file by calling XML::Document#save. Now, when it come to testing comparing nexml string does not make sense because the test is rendered invalid even because of different ordering of the attributes of a node and newline issues. What I am doing is to initialize to XML::Node: one from a test fiile and one that i generate by serialize_otu function and then compare for the equality of these xml nodes attribute by attribute and child by child. An example here: http://github.com/yeban/bioruby/blob/writer/test/unit/bio/db/nexml/tc_writer.rb#L166 However lack of a proper XML::Node#eql? is making things a little difficult for me. See: http://github.com/yeban/bioruby/blob/writer/test/unit/bio/db/nexml/tc_writer.rb#L222 An obvious solution is to myself define an eql? method in Bio::Node. But, am I going in the right direction when it comes to testing xml output. -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From anurag08priyam at gmail.com Mon Jun 28 05:56:52 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Mon, 28 Jun 2010 15:26:52 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100625065539.GD22887@thebird.nl> References: <20100624135411.GA14658@thebird.nl> <20100625065539.GD22887@thebird.nl> Message-ID: > ..... Also, when parsing this type of XML some Ruby reflection > may come in handy - I did some of that in my BioRuby GEO parser, which > lives in my GEO branch on github. I picked up the method_missing trick for the serializer. http://github.com/yeban/bioruby/blob/writer/lib/bio/db/nexml/writer.rb > You should look at each class and > see if you can refactor it down to a single solution. Just make sure > it is not at the expense of readability and understanding. > > Post us some ideas here, before you start hacking code. > > Pj. > > I will. -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From ngoto at gen-info.osaka-u.ac.jp Mon Jun 28 08:00:05 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 28 Jun 2010 21:00:05 +0900 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> <20100625065539.GD22887@thebird.nl> Message-ID: <20100628120005.61D751CBC32B@idnmail.gen-info.osaka-u.ac.jp> Hi, Please never use method_missing. It breaks error reporting and makes very hard to debug and maintain both library codes and user scripts. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Mon, 28 Jun 2010 15:26:52 +0530 Anurag Priyam wrote: > > ..... Also, when parsing this type of XML some Ruby reflection > > may come in handy - I did some of that in my BioRuby GEO parser, which > > lives in my GEO branch on github. > > > I picked up the method_missing trick for the serializer. > > http://github.com/yeban/bioruby/blob/writer/lib/bio/db/nexml/writer.rb > > > > You should look at each class and > > see if you can refactor it down to a single solution. Just make sure > > it is not at the expense of readability and understanding. > > > > Post us some ideas here, before you start hacking code. > > > > Pj. > > > > > I will. > > -- > Anurag Priyam, > 2nd Year Undergraduate, > Department of Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Mon Jun 28 08:54:09 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 28 Jun 2010 21:54:09 +0900 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> <20100625065539.GD22887@thebird.nl> Message-ID: <20100628125409.B23271CBC32B@idnmail.gen-info.osaka-u.ac.jp> Dear Anurag, Do not add methods in other classes and modules outside Bio. Modifying other classes and modules outside Bio namespace is prohibited in BioRuby library because such kind of code could make conflicts with user scrpits or other libraries when each code defines a method with the same name with different behavior or when the original class is refactored by the original authors. It is BioRuby's policy to respect user's freedom. For example, if we defined Array#has?, a user who want to define Array#has? with different meanings could not use BioRuby. So, to keep user's right, it is our policy not to change outside Bio as far as possible. PS. You may find some exceptinal codes in Bio::Shell and in sample scripts, because they are separate applications. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Mon, 28 Jun 2010 15:26:52 +0530 Anurag Priyam wrote: > > ..... Also, when parsing this type of XML some Ruby reflection > > may come in handy - I did some of that in my BioRuby GEO parser, which > > lives in my GEO branch on github. > > > I picked up the method_missing trick for the serializer. > > http://github.com/yeban/bioruby/blob/writer/lib/bio/db/nexml/writer.rb > > > > You should look at each class and > > see if you can refactor it down to a single solution. Just make sure > > it is not at the expense of readability and understanding. > > > > Post us some ideas here, before you start hacking code. > > > > Pj. > > > > > I will. > > -- > Anurag Priyam, > 2nd Year Undergraduate, > Department of Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From anurag08priyam at gmail.com Mon Jun 28 10:13:36 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Mon, 28 Jun 2010 19:43:36 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100628125409.B23271CBC32B@idnmail.gen-info.osaka-u.ac.jp> References: <20100624135411.GA14658@thebird.nl> <20100625065539.GD22887@thebird.nl> <20100628125409.B23271CBC32B@idnmail.gen-info.osaka-u.ac.jp> Message-ID: > It is BioRuby's policy to respect user's freedom. For example, > if we defined Array#has?, a user who want to define Array#has? > with different meanings could not use BioRuby. So, to keep > user's right, it is our policy not to change outside Bio as > far as possible. > > Corrected. Thanks for pointing this out this GOTO san :). -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From anurag08priyam at gmail.com Mon Jun 28 10:22:37 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Mon, 28 Jun 2010 19:52:37 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100628120005.61D751CBC32B@idnmail.gen-info.osaka-u.ac.jp> References: <20100624135411.GA14658@thebird.nl> <20100625065539.GD22887@thebird.nl> <20100628120005.61D751CBC32B@idnmail.gen-info.osaka-u.ac.jp> Message-ID: > Please never use method_missing. It breaks error reporting and > makes very hard to debug and maintain both library codes and > user scripts. > Hmm, I have experienced that. But the way I have used it affects only the Bio::NeXML::Writer class, so is it not safe in this case? Anyways I will change it as it does not offer much improvement to the code readability in my case. I just find it exciting :). -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From yogiprasanna at gmail.com Wed Jun 30 10:11:42 2010 From: yogiprasanna at gmail.com (Prasanna Bala) Date: Wed, 30 Jun 2010 19:41:42 +0530 Subject: [BioRuby] Contribution in Bioruby... Message-ID: Hi, My name is Prasanna. I am working in a software firm in ruby on rails technology. I am new to Bioruby. I am interested in contributing for Bio-ruby project. I would like to know where to start things. To whom to approach for specific tasks. I have extensive experience in Biomedical text mining. Is there is any group specifically working on Biomedical text mining, Ontology Mapping etc.. And I also want to know what are the issues now the community is working on ? I want to know list of current topics that's going on in Bioruby. Regards, Prasanna. From pjotr.public14 at thebird.nl Wed Jun 30 11:31:05 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Wed, 30 Jun 2010 17:31:05 +0200 Subject: [BioRuby] Contribution in Bioruby... In-Reply-To: References: Message-ID: <20100630153105.GB10804@thebird.nl> Hi Prasanna, On Wed, Jun 30, 2010 at 07:41:42PM +0530, Prasanna Bala wrote: > Hi, > My name is Prasanna. I am working in a software firm in ruby on rails > technology. I am new to Bioruby. I am interested in contributing for > Bio-ruby project. I would like to know where to start things. To whom to > approach for specific tasks. I have extensive experience in Biomedical text > mining. Is there is any group specifically working on Biomedical text > mining, Ontology Mapping etc.. And I also want to know what are the issues > now the community is working on ? I want to know list of current topics > that's going on in Bioruby. Thanks for showing your interest. It would be great if you were to look at text mining and ontologies for BioRuby. It is relevant for our work. To start with BioRuby get a github.com account and clone the repository. You can start coding, and post questions on this mailing list. We are having a presentation at BOSC next week, and the slides discuss current work. It will be available for everyone. Where are you located geographically? Pj. From anurag08priyam at gmail.com Wed Jun 30 18:07:09 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Thu, 1 Jul 2010 03:37:09 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Update Message-ID: In the last week and half of this week I have: * been able to work out an NeXML serializer - the code sits in the master branch[1]. In the API page[ 2 ] I have added a discussion on the implementation. * started working on the RDF API - i should be able to come up with RSpecs by the end of this week In the remaining part of the week I will: * come with an RDF API implementation * work on refactoring some of the previous code( matrix and the sequences part ) as Pjotr had pointed out in the last review. Perhaps, we can have another round of code review: for the NeXML serializer? This will help me allocate time in the coming weeks to fix the issues with the code. [1] http://github.com/yeban/bioruby [2] https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From anurag08priyam at gmail.com Wed Jun 30 18:15:08 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Thu, 1 Jul 2010 03:45:08 +0530 Subject: [BioRuby] [GSoC] Message-ID: I hope you guys are tuned to my updates on both the lists and the code and the project plan. Please do keep reminding me if I am missing out on something obvious :). -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From anurag08priyam at gmail.com Thu Jun 3 09:00:06 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Thu, 3 Jun 2010 14:30:06 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Update Message-ID: Hello all, I know this update is coming quite late. Sorry for holding this back for so long. From now on I will be updating this list weekly on my progress. Just to keep everyone in the loop, [1] is my project page. What has been done? Till now I have been able to do a significant amount of work on the NeXML parser. The parser recognizes otus, otu and trees. The trees implementation is not complete as per the NeXML schema. Trees with multiple rootings, coalescent trees and networks remain to be done. Problems Faced: Initially it was decided to stream parse any NeXML document as DOM parsing would be slow for larger documents. But with NeXML's non linear design, streaming seems non natural and proves to be a little difficult. Currently, I have written a wrapper over the StAX parsing API of libxml but the entire document is parsed in one go; at the start. Current git head[2] can be built and the code tested out. A tutorial( kind of ) on how to use the NeXML can be found here[3]. [1] https://www.nescent.org/wg_phyloinformatics/Category:NeXML_and_RDF_API_for_BioRuby [2] http://github.com/yeban/bioruby [3] https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From anurag08priyam at gmail.com Fri Jun 4 08:39:34 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Fri, 4 Jun 2010 14:09:34 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Tree with Multiple rootings. Message-ID: Hello all, NeXML allows for trees with multiple rootings. In the NeXML lib trees are represented by Bio::NeXML::Tree which inherits from Bio::Tree. This allows for the usage of the excellent Bio::Tree framework for manipulating NeXML trees. However, Bio::Tree class supports only one root node. There are a couple of functions that require the presence of a root node: parent, children, descendants, ancestors, lowest_common_ancestor. Now, these functions can take a root node as a parameter. So it is possible to extend the current framework to work with trees with multiple root nodes. Though this may not be required, a possibility is to add the multiple root functionality to Bio::Tree class itself. Currently, I am adding multiple root support to Bio::NeXML::Tree class. If need be we can move the functionality to Bio::Tree. Anything? -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From rutgeraldo at gmail.com Fri Jun 4 14:21:27 2010 From: rutgeraldo at gmail.com (Rutger Vos) Date: Fri, 4 Jun 2010 15:21:27 +0100 Subject: [BioRuby] [GSoC][NeXML and RDF API] Tree with Multiple rootings. In-Reply-To: References: Message-ID: Hi Anurag, in practice I haven't actually seen trees with multiple rootings being used much, so it might not be urgent that this moves to the bioruby core. My main worry would be in picking the "right" root node to expose to the core api. I think that it should be the node from which all other nodes can be visited in a recursive traversal (which I expect client code to do), as opposed to a node that has been indicated using an XML attribute to be the root, but isn't in terms of the actual topology that emerges from the node and edge tables. However, I'm curious to hear other people's opinions whether a flag (e.g. "is_root") might be added Bio::Tree::Node, and a "get_roots" method in Bio::Tree that returns a list of roots that typically only holds the value of the "root" attribute, but could potentially have multiple rootings. Rutger On Fri, Jun 4, 2010 at 9:39 AM, Anurag Priyam wrote: > Hello all, > > NeXML allows for trees with multiple rootings. In the NeXML lib trees are > represented by Bio::NeXML::Tree which inherits from Bio::Tree. This allows > for the usage of the excellent Bio::Tree framework for manipulating NeXML > trees. However, Bio::Tree class supports only one root node. > > There are a couple of functions that require the presence of a root node: > parent, children, descendants, ancestors, lowest_common_ancestor. Now, these > functions can take a root node as a parameter. So it is possible to extend > the current framework to work with trees with multiple root nodes. > > Though this may not be required, a possibility is to add the multiple root > functionality to Bio::Tree class itself. Currently, I am adding multiple > root support to Bio::NeXML::Tree class. If need be we can move the > functionality to Bio::Tree. > > Anything? > > -- > Anurag Priyam, > 2nd Year Undergraduate, > Department of Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com From hlapp at drycafe.net Fri Jun 4 19:09:11 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Fri, 4 Jun 2010 15:09:11 -0400 Subject: [BioRuby] [GSoC][NeXML and RDF API] Tree with Multiple rootings. In-Reply-To: References: Message-ID: <01FB94FA-D962-4624-B612-49AA26FF3E2D@drycafe.net> Multiple roots can be the result of a Bayesian analysis. (The PhyloDB module in BioSQL, for example, does support multiple roots.) However, representing multiple roots is useless without also being able to indicate whether a root is an alternate root or the main root node, and what its significance (posterior prob. for a Bayesian analysis) is. For reference, here is the column documentation for these two properties in PhyloDB's tree_root table: COMMENT ON COLUMN tree_root.is_alternate IS 'True if the root node is the preferential (most likely) root node of the tree, and false otherwise.'; COMMENT ON COLUMN tree_root.significance IS 'The significance (such as likelihood, or posterior probability) with which the node is the root node. This only has meaning if the method used for reconstructing the tree calculates this value.'; -hilmar On Jun 4, 2010, at 10:21 AM, Rutger Vos wrote: > Hi Anurag, > > in practice I haven't actually seen trees with multiple rootings being > used much, so it might not be urgent that this moves to the bioruby > core. My main worry would be in picking the "right" root node to > expose to the core api. I think that it should be the node from which > all other nodes can be visited in a recursive traversal (which I > expect client code to do), as opposed to a node that has been > indicated using an XML attribute to be the root, but isn't in terms of > the actual topology that emerges from the node and edge tables. > > However, I'm curious to hear other people's opinions whether a flag > (e.g. "is_root") might be added Bio::Tree::Node, and a "get_roots" > method in Bio::Tree that returns a list of roots that typically only > holds the value of the "root" attribute, but could potentially have > multiple rootings. > > Rutger > > On Fri, Jun 4, 2010 at 9:39 AM, Anurag Priyam > wrote: >> Hello all, >> >> NeXML allows for trees with multiple rootings. In the NeXML lib >> trees are >> represented by Bio::NeXML::Tree which inherits from Bio::Tree. This >> allows >> for the usage of the excellent Bio::Tree framework for manipulating >> NeXML >> trees. However, Bio::Tree class supports only one root node. >> >> There are a couple of functions that require the presence of a root >> node: >> parent, children, descendants, ancestors, lowest_common_ancestor. >> Now, these >> functions can take a root node as a parameter. So it is possible to >> extend >> the current framework to work with trees with multiple root nodes. >> >> Though this may not be required, a possibility is to add the >> multiple root >> functionality to Bio::Tree class itself. Currently, I am adding >> multiple >> root support to Bio::NeXML::Tree class. If need be we can move the >> functionality to Bio::Tree. >> >> Anything? >> >> -- >> Anurag Priyam, >> 2nd Year Undergraduate, >> Department of Mechanical Engineering, >> IIT Kharagpur. >> +91-9775550642 >> > > > > -- > Dr. Rutger A. Vos > School of Biological Sciences > Philip Lyle Building, Level 4 > University of Reading > Reading > RG6 6BX > United Kingdom > Tel: +44 (0) 118 378 7535 > http://www.nexml.org > http://rutgervos.blogspot.com > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From mitlox at op.pl Sun Jun 6 05:30:09 2010 From: mitlox at op.pl (xyz) Date: Sun, 06 Jun 2010 15:30:09 +1000 Subject: [BioRuby] fastq files reading In-Reply-To: <20100530233154.1B8A.EEF6E030@gen-info.osaka-u.ac.jp> References: <20100529221404.0175ee75@wp01> <20100530233154.1B8A.EEF6E030@gen-info.osaka-u.ac.jp> Message-ID: <4C0B3261.3020909@op.pl> Thank you for the solutions it works. From sararayburn at gmail.com Mon Jun 7 18:09:07 2010 From: sararayburn at gmail.com (Sara Rayburn) Date: Mon, 7 Jun 2010 13:09:07 -0500 Subject: [BioRuby] GSoC speciation/duplication inference question Message-ID: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com> Hello, While implementing my gsoc project (the speciation/duplicaiton inference algorithm), I'm not sure where to put the module I'm developing in the bioruby library. I've been developing in the lib/bio/util directory, but I want to make sure that's the place you all would prefer for the module. Any suggestions or thoughts? Thanks, Sara Rayburn sararayburn at gmail.com From anurag08priyam at gmail.com Wed Jun 9 08:17:55 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Wed, 9 Jun 2010 13:47:55 +0530 Subject: [BioRuby] fastq files reading In-Reply-To: <4C0B3261.3020909@op.pl> References: <20100529221404.0175ee75@wp01> <20100530233154.1B8A.EEF6E030@gen-info.osaka-u.ac.jp> <4C0B3261.3020909@op.pl> Message-ID: Maybe we should add this to the wiki [1] [1] http://bioruby.open-bio.org/wiki/SampleCodes On Sun, Jun 6, 2010 at 11:00 AM, xyz wrote: > Thank you for the solutions it works. > -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From mitlox at op.pl Wed Jun 9 12:42:36 2010 From: mitlox at op.pl (xyz) Date: Wed, 09 Jun 2010 22:42:36 +1000 Subject: [BioRuby] fastq files reading In-Reply-To: References: <20100529221404.0175ee75@wp01> <20100530233154.1B8A.EEF6E030@gen-info.osaka-u.ac.jp> <4C0B3261.3020909@op.pl> Message-ID: <4C0F8C3C.1030303@op.pl> Good idea. On 06/09/10 18:17, Anurag Priyam wrote: > Maybe we should add this to the wiki [1] > > [1] http://bioruby.open-bio.org/wiki/SampleCodes > > On Sun, Jun 6, 2010 at 11:00 AM, xyz > wrote: > > Thank you for the solutions it works. > > > > > -- > Anurag Priyam, > 2nd Year Undergraduate, > Department of Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 From anurag08priyam at gmail.com Wed Jun 9 19:49:35 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Thu, 10 Jun 2010 01:19:35 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. Message-ID: Last week, I worked on finishing implementation of Trees: trees, tree, network; and started work on the characters element. This weeks target is to complete the implementation of the characters element. It would be awesome to have some code review including: implementation, API design, coding style and tests. I am planning to give a good amount of time in the fourth week in making the code more robust. It would make perfect sense to have some feedback to serve as guidelines :). The master branch and API discussion page are at: [1] http://github.com/yeban/bioruby [2] https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From czmasek at burnham.org Thu Jun 10 01:50:16 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Wed, 9 Jun 2010 18:50:16 -0700 Subject: [BioRuby] gsoc questions In-Reply-To: References: <2B862774-AD61-4BC0-86D6-69DCD832EB78@gmail.com> Message-ID: <4C1044D8.4010600@burnham.org> Hi Sara: > > > On Mon, Jun 7, 2010 at 11:20 AM, Sara Rayburn > wrote: > > Hi Christian and Diana, > > Two questions: > > 1) On the phylosoft website for forester/sdi > (http://www.phylosoft.org/forester/applications/sdi_r/) I've read > this about the two trees: > "The important point to keep in mind is that there must be at least > one sub-element of the 'taxonomy' element which allows to match the > sequences in the gene tree with a taxonomy in the species tree. In > this example this sub-element of the 'taxonomy' element is 'code'." > > Does this mean that the sub-element for matching will *always* be > 'code'? Or should I just be looking for anything at all that > matches? Also, will all phyloxml trees have the 'code' sub-element? > > > To find out whether some element will always contain some other element > you can look at PhyloXML documentation [0]. For example at the Taxonomy > element documentation [1] you can see that it has a sub-element "code" > which is [0..1], which means that there either is no "code" sub-element > or there is one and no more, whereas there could none or many "synonym" > sub-elements > > [0] http://www.phyloxml.org/documentation/version_1.10/phyloxml.xsd.html > [1] http://www.phyloxml.org/documentation/version_1.10/phyloxml.xsd.html#h888650454 Good point! This matching of taxonomic information is a crucial point. I recommend to implement this in the same manner as it is implemented in the "isEqual" method of the org.forester.phylogeny.data.Taxonomy class of the forester library, see: http://forester-atv.cvs.sourceforge.net/viewvc/forester-atv/forester-atv/java/src/org/forester/phylogeny/data/Taxonomy.java?revision=1.57&view=markup In this (Java) class the matching works like this: 1. If both the two Taxonomies to be compared have identifiers with the same source (e.g. NCBI taxonomy), use these identifiers to match. In Java: if ( ( getIdentifier() != null ) && ( tax.getIdentifier() != null ) ) { return getIdentifier().isEqual( tax.getIdentifier() ); } 2. Otherwise, if both Taxonomies have taxonomy codes, use the taxomoy codes to match. In Java: else if ( !ForesterUtil.isEmpty( getTaxonomyCode() ) && !ForesterUtil.isEmpty( tax.getTaxonomyCode() ) ) { return getTaxonomyCode().equals( tax.getTaxonomyCode() ); } 3. Otherwise, if both Taxonomies have scientific names, use the scientific names to match. 4. Otherwise, if both Taxonomies have common names, use the common names to match. 5. Otherwise, matching is not possible and an error should be thrown. Generally speaking, I recommend to get the source code of forester and look at the classes in the org.forester.sdi directory (especially SDI.java, SDIse.java, and SDIR.java). > > 2) Here's my assumptions about the final output of the algorithm: > Each node in the tree should be updated with speciation OR > duplication, and the tree as a whole has a count of > speciation/duplication events. Am I on the right track here? Yes, the primary goal of the algorithm is to calculate for each node in the gene tree whether it is a duplication or a speciation, and thus each node should be annotated as duplication or speciation. Keeping track of the sum of duplications and speciations is useful too, but cannot, as far as I know, stored in the tree object itself. Maybe the algorithm could return a small "SDI_result" object which is used to store such "summary" information. Christian From ngoto at gen-info.osaka-u.ac.jp Thu Jun 10 13:46:40 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 10 Jun 2010 22:46:40 +0900 Subject: [BioRuby] GSoC speciation/duplication inference question In-Reply-To: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com> References: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com> Message-ID: <20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp> Hi, I think the abbreviation SDI is not common in the field of biology and bioinformatics. In this case, it is generally good not to abbreviate, but the "speciation/duplication inference" is too long. For file/directory names, because the length limit is tight, using abbreviation is good. For the location of files, I suggest lib/bio/util/evolution/SDI/ or lib/bio/util/phylogeny/SDI/ to show the word SDI is in the field of evolution or phylogeny. For the class/module namespace, possible candidates are Bio::SpeciationDuplicationInference, Bio::Evolution::SDI, Bio::Algorithm::SDI, but I couldn't determine which is the best. If you have good idea, please tell us. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Mon, 7 Jun 2010 13:09:07 -0500 Sara Rayburn wrote: > Hello, > > While implementing my gsoc project (the speciation/duplicaiton inference algorithm), I'm not sure where to put the module I'm developing in the bioruby library. I've been developing in the lib/bio/util directory, but I want to make sure that's the place you all would prefer for the module. Any suggestions or thoughts? > > Thanks, > > Sara Rayburn > sararayburn at gmail.com From kpatil at science.uva.nl Thu Jun 17 09:22:12 2010 From: kpatil at science.uva.nl (K. Patil) Date: Thu, 17 Jun 2010 11:22:12 +0200 (CEST) Subject: [BioRuby] newick to phyloxml In-Reply-To: <20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp> References: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com> <20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <1764.139.19.75.1.1276766532.squirrel@webmail.science.uva.nl> Hi, I noticed the inclusion of phyloxml support in bioruby, thanks a lot, its very useful. I was wondering if there is any straightforward way to convert a newick tree to phyloxml? best From czmasek at burnham.org Thu Jun 17 22:49:12 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Thu, 17 Jun 2010 15:49:12 -0700 Subject: [BioRuby] Gene duplications GSoC project: answers to some of your questions Message-ID: <4C1AA668.7040801@burnham.org> Hi, Sara: Regarding some of your questions posted on http://wiki.github.com/srayburn/bioruby/gsoc-2010-implementing-sdi-project-updates Re: "Right now initialization loads from a hard coded file. I need to make this flexible so that trees can come from any file or from a previously loaded tree object": The input of the algoruthm(s) should be tree-objects, reading the trees should not be part of the algorithm implementation. Clearly, for testing you need to read the trees from files, but this should be implemented in your test code, not as part of the algorithm implementation itself. Re: "The names of leaf nodes: how standard are they? Is there a standard format here? I?m going to look at example trees from the forester implementation to get ideas about this. If I?m still stumped I?ll check with my mentors." No there is no standard. The only question for the purpose of this algorithm do they match or not. I.e. they names could just numbers, common names, or scientific names. Hope this helps, Christian From czmasek at burnham.org Fri Jun 18 03:16:20 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Thu, 17 Jun 2010 20:16:20 -0700 Subject: [BioRuby] newick to phyloxml In-Reply-To: <1764.139.19.75.1.1276766532.squirrel@webmail.science.uva.nl> References: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com> <20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp> <1764.139.19.75.1.1276766532.squirrel@webmail.science.uva.nl> Message-ID: <4C1AE504.3000307@burnham.org> Hi, Unfortunately, this is not possible in a straightforward way. The problem is that the tree object (Bio::Tree) returned by: input = Bio::FlatFile.open(Bio::Newick, "tree.nh") tree = input.next_entry.tree is the parent type of the tree object(Bio::PhyloXML::Tree) required by: writer = Bio::PhyloXML::Writer.new("tree.xml") writer.write(phyloxml_tree) Christian K. Patil wrote: > Hi, > > I noticed the inclusion of phyloxml support in bioruby, thanks a lot, its > very useful. I was wondering if there is any straightforward way to > convert a newick tree to phyloxml? > > best > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From anurag08priyam at gmail.com Tue Jun 22 08:46:19 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Tue, 22 Jun 2010 14:16:19 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Update In-Reply-To: References: Message-ID: Hello all, Much of the parser implementation is complete as of now. Last time I had sent an update I had begun implementing characters element. Week 3( June 7-13) was quite low on work due to power shortage where I live. Consequently implementation of characters spanned Week 4( June 14-20 ) too. Work on NeXML serialization has begun. As of now it can serialize taxa blocks. This week( week 5 - June 21-28 ) I will be working on serializing trees and characters element. I would also like to update a little more on future development plans. I am targeting to finish much of the software development by week 9( July 19-25 ), leaving week 10, week 11 and week 12 for feedback and iterations. This is the time where I should make up for any mistakes or lost work. Perhaps in this week we can make the code ready for merging in BioRuby's master branch. Apart from this, I am targeting to finish serializer and start working on the RDF API by week 6. Maybe we could have a round of code review after that too? I am notifying this in advance so that if possible developers can allocate time for this. Sounds good? -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From czmasek at burnham.org Tue Jun 22 18:29:09 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Tue, 22 Jun 2010 11:29:09 -0700 Subject: [BioRuby] gsoc update In-Reply-To: <6B35175E-1DFF-430E-AE22-E724F3D58D9D@gmail.com> References: <4C1AACBA.4030908@burnham.org> <6B35175E-1DFF-430E-AE22-E724F3D58D9D@gmail.com> Message-ID: <4C2100F5.3020306@burnham.org> Hi, Sara: Hopefully you and your son are fully recovered now! To me, Bio::Algorithm::SDI would make the most sense. Re: "It seems that forester has the assumption built in that any node in a tree that has a child must have two children. Is this a property of phylogenetic trees?" Being composed of entirely binary nodes is indeed a property of trees produced by most programs for phylogenetic inference. In contrast, if multiple (binary) trees are used to calculate a consensus tree (e.g. bootstrap resampling), then the resulting consensus tree might contain nodes with more than two children (depending on the method of consensus tree calculation and the degree of divergence among the resampled trees). Furthermore, if (phylogenetic or taxonomic) trees are "manually" created (or by various "supertree" approaches), nodes with more than two children are oftentimes used to express uncertainty. For the purpose of gene duplication inference, it would be particularly useful to allow non-binary species trees (expressing uncertainty about the tree-of-life and preventing the introduction of spurious duplications). Re: "For the non-binary case, should I go forward planning to implement the algorithm from the Vernot et al. paper or should I be planning to extend your algorithm?" You should plan on working on the SDI algorithm and 'modify' it so that it correctly works on non-binary species trees. Now, this is easier said than done. A while ago, I developed such an algorithm and implemented it as org.forester.sdi.GSDI (for Generalized SDI). You can look at it in file org/forester/sdi/GSDI. Yet, the big issue is that while this algorithm seems to work, I don't have a mathematical proof for its correctness. In any case, I recommend to do the following: 1. Thoroughly test (and writes unit tests) your current implementation of binary SDI. For example, does it correctly use the different sub-elements of taxonomy for matching, i.e. does it work if both species and gene use scientific names for taxonomic identification? does it work if both species and gene use NCBI identifier for taxonomic identification? does it work if both species and gene use NCBI identifier for taxonomic identification but also have non-matching common names (in this case it should use the identifiers and ignore common names)? Will it throw an exception if no matching sub-elements of taxonomy are present? 2. Performing timing benchmarks. Does it behave similar (although overall slower) to the Java implementation (see Figure 4 in Zmasek and Eddy, 2001)? Oftentimes, an unexpected timing benchmark results is a indication of an underlying problem? 3. I will look at your implementation as well. 4. Look at org.forester.sdi.GSDI and see if you can understand it and test it on paper. If this makes sense to you then we can go ahead and plan implementing this within BioRuby. Christian Sara Rayburn wrote: > Hi, > > Well, as far as I can tell, things are looking much, much better. I'm sorry I got a bit behind, but my son and I have been sick this past week. > > For the namespace/file locations, the response from the mailing list has been: > Bio::SpeciationDuplicationInference, Bio::Evolution::SDI, > Bio::Algorithm::SD with the files in lib/bio/util/Phylogeny/SDI, or lib/bio/phylo/sdi.rb and Bio::Phylo::SDI > > What do you guys think? > > Also, when I've been in doubt I've looked at the java implementation. It seems that forester has the assumption built in that any node in a tree that has a child must have two children. Is this a property of phylogenetic trees? > > Other than tying up a couple of loose ends, I think the binary case is pretty much wrapped up. Please let me know if there are things I need to modify or rethink. > > For the non-binary case, should I go forward planning to implement the algorithm from the Vernot et al. paper or should I be planning to extend your algorithm? > > Thanks and again, sorry for getting a bit behind. > > Sara From rutgeraldo at gmail.com Wed Jun 23 20:48:02 2010 From: rutgeraldo at gmail.com (Rutger Vos) Date: Wed, 23 Jun 2010 21:48:02 +0100 Subject: [BioRuby] [GSoC][NeXML and RDF API] Update In-Reply-To: References: Message-ID: Hi Anurag, thanks for the update - your time projection and current progress sounds good. Can you forward this update to the phylosoc (nescent) list as well? Thanks, Rutger On Tue, Jun 22, 2010 at 9:46 AM, Anurag Priyam wrote: > Hello all, > > Much of the parser implementation is complete as of now. Last time I had > sent an update I had begun implementing characters element. Week 3( June > 7-13) was quite low on work due to power shortage where I live. Consequently > implementation of characters spanned Week 4( June 14-20 ) too. > > Work on NeXML serialization has begun. As of now it can serialize taxa > blocks. This week( week 5 - June 21-28 ) I will be working on serializing > trees and characters element. > > I would also like to update a little more on future development plans. I am > targeting to finish much of the software development by week 9( July 19-25 > ), leaving week 10, week 11 and week 12 for feedback and iterations. This is > the time where I should make up for any mistakes or lost work. Perhaps in > this week we can make the code ready for merging in BioRuby's master branch. > Apart from this, I am targeting to finish serializer and start working on > the RDF API by week 6. Maybe we could have a round of code review after that > too? I am notifying this in advance so that if possible developers can > allocate time for this. Sounds good? > > > -- > Anurag Priyam, > 2nd Year Undergraduate, > Department of Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com From pjotr.public14 at thebird.nl Thu Jun 24 13:54:11 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Thu, 24 Jun 2010 15:54:11 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: Message-ID: <20100624135411.GA14658@thebird.nl> Hi Everyone, I am going to review Anurag's code. Naohisa, and perhaps others, will join in. A quick recap: Anurag is working on implementing an NeXML parser with RDF support (for the semantic web). NeXML is an XMLized and improved version of Nexus, and is used for interchanging sequences, alignments and trees between programs/services (correct me if I am wrong). A full descripion of NeXML can be found at https://www.nescent.org/wg_evoinfo/Future_Data_Exchange_Standard NeXML is an important standard, and very good to have in BioRuby. Anurag: thanks for the good work, so far. I can see you have put a lot of work in. And, I like your style. I can see you are a competent programmer, so you can expect the worst criticism ;) I am going to start with some high level questions. Can someone who has worked with NeXML (Rutger) have a look at the interface description on: https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby It looks natural to query this way, if you know what the NeXML files contains (e.g. trees, or sequences). What would be the natural approach if you do *not* know the contents? I.e. how does one iterate over the NeXML object? Anurag, your web page states you implemented a LibXML::Parser, and you named it Parser. Meanwhile, it looks like you have implemented libxml2 streaming, using a Reader. This is a bit confusing. I presume you are using the technique used in Diana's PhyloXML parser. You are requiring the 'xml' package. Is that libxml2 these days, or is it actually 'libxml'? Does it work for all Ruby versions? libxml is an external (binary) dependency, so it may not exist and fail. PhyloXML does not handle failure either. The other high-level questions concern testing. For others, the unit tests are here: http://github.com/yeban/bioruby/tree/master/test/unit/bio/db/nexml/ I notice you have limited test input data. How can you be really sure your code works for all cases? How can you be really sure that future changes to the code don't break? And how are you going to measure performance of your code? Finally, getting down to some code. Most of the code is in a single file: http://github.com/yeban/bioruby/blob/master/lib/bio/db/nexml/elements.rb or http://github.com/yeban/bioruby/blob/3abfc592e2f7072a8e2970ee077677a9ab7564ae/lib/bio/db/nexml/elements.rb I think it should be broken up. It would be logical to split by type of elements - at least. I know in BioRuby we are ambiguous about file sizes - I think a single file should describe one concept. That way file names become self describing. Files larger than 300 lines tend to be hard to digest - and probably point out some bigger issue. Also, when I look at DnaSeqRow, RnaSeqRow and others derived from SeqRow (line 2148 and onwards in element.rb), I can see duplicated coding 'patterns'. You are repeating a concept. Would there not be a more elegant way in Ruby to handle this? Hint: Inheritance is just one mechanism, I see no real reason to use an inheritance tree. Why not use one Sequence class for all of these which can contain different formed elements? I bet the code would become a lot shorter and (probably) less error prone. Take Ruby's Array container class as an example - it is just one implementation of a container which allows many types of elements. A final comment for this session: The class/method descriptions are not very informative. It may be early days - especially since we can see some refactoring coming, but it usually helps to write out examples giving the 'nicest' interface for people to use. And stick those in the source code. Personally I favour rubydoctests, see http://github.com/tablatom/rubydoctest I used these in bio/appl/paml/codeml/report.rb - these are examples that double as tests. Kill two birds with one stone! The BioRuby tutorial also uses doctests - i.e. the code in the Tutorial can be validated against the installed bioruby. If you want to use this you need an extra conversion - I have that tool. Another possibility is to start using RSpec. http://rspec.info/ I really like RSpec too - it is more of a replacement for unit tests - and easier to understand, so Specs double as documentation. I am interested to see what you want to do for RDF support. Maybe you can write out the API as an RSpec? That would be a good start. Do not hesitate to stand up to me. You will probably get support from someone on this list ;) Pj. On Thu, Jun 10, 2010 at 01:19:35AM +0530, Anurag Priyam wrote: > Last week, I worked on finishing implementation of Trees: trees, tree, > network; and started work on the characters element. This weeks target is to > complete the implementation of the characters element. > > It would be awesome to have some code review including: implementation, API > design, coding style and tests. I am planning to give a good amount of time > in the fourth week in making the code more robust. It would make perfect > sense to have some feedback to serve as guidelines :). The master branch and > API discussion page are at: > > [1] http://github.com/yeban/bioruby > [2] > https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby > > -- > Anurag Priyam, > 2nd Year Undergraduate, > Department of Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From czmasek at burnham.org Fri Jun 25 02:18:41 2010 From: czmasek at burnham.org (Christian M Zmasek) Date: Thu, 24 Jun 2010 19:18:41 -0700 Subject: [BioRuby] gsoc: SDI - unrooted trees In-Reply-To: <6B35175E-1DFF-430E-AE22-E724F3D58D9D@gmail.com> References: <4C1AACBA.4030908@burnham.org> <6B35175E-1DFF-430E-AE22-E724F3D58D9D@gmail.com> Message-ID: <4C241201.9060604@burnham.org> Hi, Sara: Something I forgot the mention. As you know, most phylogeny inference methods produce trees which are unrooted (these trees might look rooted, but for most methods the root is placed randomly, and thus incorrectly). In the the context of duplication inference, a reasonable way to root a tree is by placing the root in such a way that the the sum of inferred duplications is minimized. The brute force approach to accomplish this is by sequentially placing the root on each branch and then running the SDI algorithm on each differently rooted tree and retaining the root position which results in the smallest sum of duplications. A more time efficient approach is possible by realizing that the mapping function only changes for a few nodes if the root is moved from one branch to an neighboring one. This approach is implemented in org.forester.sdi.SDIR. Besides extending the algorithm to work on non-binary trees, this is another useful extension which you might think about tackling. Christian From anurag08priyam at gmail.com Fri Jun 25 06:23:34 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Fri, 25 Jun 2010 11:53:34 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100624135411.GA14658@thebird.nl> References: <20100624135411.GA14658@thebird.nl> Message-ID: > Can someone who has worked with NeXML (Rutger) have a look at the > interface description on: > > https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby > > It looks natural to query this way, if you know what the NeXML files > contains (e.g. trees, or sequences). What would be the natural > approach if you do *not* know the contents? I.e. how does one iterate > over the NeXML object? > > NeXML has three primary elements: otus, trees, characters. All three of them are container for other elements: otu, tree, network, matrix. Currently the each method of an nexml object iterates over each tree object. I did this thinking that tree is the most important part of a phylogenetic analysis( and also because I had not implemented characters then). What were you thinking here? Should each iterate over all otu, tree and matrix or the primary otus, trees and characters elements? I would go for the later. > Anurag, your web page states you implemented a LibXML::Parser, and you > named it Parser. Meanwhile, it looks like you have implemented libxml2 > streaming, using a Reader. This is a bit confusing. I presume you are > using the technique used in Diana's PhyloXML parser. You are requiring > the 'xml' package. Is that libxml2 these days, or is it actually > 'libxml'? Does it work for all Ruby versions? libxml is an external > (binary) dependency, so it may not exist and fail. PhyloXML does not > handle failure either. > > I am glad you asked this. I wanted to discuss it here. I have used libxml2 streaming api, without actually streaming the document to the user. The cursor does not move through the document when you iterate over elements( phyloxml does that ). I am parsing the document at one go; at the start, and storing the objects in memory. Should we want to switch to streaming, using libxml's streaming API from start should make it easier. Yes it is libxml2 these days. The site states that it works with ruby 1.8. I am myself working with 1.8.7. I will have to test the compatibility with ruby 1.9. > The other high-level questions concern testing. For others, the unit > tests are here: > > http://github.com/yeban/bioruby/tree/master/test/unit/bio/db/nexml/ > > I notice you have limited test input data. How can you be really sure > your code works for all cases? How can you be really sure that future > changes to the code don't break? Right. I am working on improving the test suites taking lessons from the other bioruby test suites. > And how are you going to measure > performance of your code? > > Actually I have not done anything here. I will benchmark and profile the code and discuss the results here. Finally, getting down to some code. Most of the code is in a single > file: > > http://github.com/yeban/bioruby/blob/master/lib/bio/db/nexml/elements.rb > > or > > > http://github.com/yeban/bioruby/blob/3abfc592e2f7072a8e2970ee077677a9ab7564ae/lib/bio/db/nexml/elements.rb > > I think it should be broken up. It would be logical to split by type > of elements - at least. I know in BioRuby we are ambiguous about file > sizes - I think a single file should describe one concept. That way > file names become self describing. Files larger than 300 lines tend > to be hard to digest - and probably point out some bigger issue. > > Agreed. > Also, when I look at DnaSeqRow, RnaSeqRow and others derived from > SeqRow (line 2148 and onwards in element.rb), I can see duplicated > coding 'patterns'. You are repeating a concept. Would there not be a > more elegant way in Ruby to handle this? Hint: Inheritance is just one > mechanism, I see no real reason to use an inheritance tree. Why not > use one Sequence class for all of these which can contain different > formed elements? I bet the code would become a lot shorter and > (probably) less error prone. Take Ruby's Array container class as an > example - it is just one implementation of a container which allows > many types of elements. > The idea here was to implement a type system and stick close to the class hierarchy followed in the schema. However, looking back, I myself do not find the code for the Matrix class very elegant. A final comment for this session: The class/method descriptions are > not very informative. It may be early days - especially since we can > see some refactoring coming, but it usually helps to write out > examples giving the 'nicest' interface for people to use. And stick > those in the source code. Personally I favour rubydoctests, see > > http://github.com/tablatom/rubydoctest > > Hey, I did not know that doctests existed for Ruby too. I will have a look into it. > I used these in bio/appl/paml/codeml/report.rb - these are examples > that double as tests. Kill two birds with one stone! The BioRuby > tutorial also uses doctests - i.e. the code in the Tutorial can be > validated against the installed bioruby. If you want to use this you > need an extra conversion - I have that tool. > I will check out the examples. What tool? I would like to know more. > > Another possibility is to start using RSpec. > > http://rspec.info/ > > I really like RSpec too - it is more of a replacement for unit > tests - and easier to understand, so Specs double as documentation. > > I am missing Rspec too from my Rails and Merb days. I picked up unit tests because much of the framework had used the same and also because I wanted to try it out :). > I am interested to see what you want to do for RDF support. Maybe you > can write out the API as an RSpec? That would be a good start. > > That sounds like a nice idea. > Do not hesitate to stand up to me. You will probably get support from > someone on this list ;) > > Pj. > > On Thu, Jun 10, 2010 at 01:19:35AM +0530, Anurag Priyam wrote: > > Last week, I worked on finishing implementation of Trees: trees, tree, > > network; and started work on the characters element. This weeks target is > to > > complete the implementation of the characters element. > > > > It would be awesome to have some code review including: implementation, > API > > design, coding style and tests. I am planning to give a good amount of > time > > in the fourth week in making the code more robust. It would make perfect > > sense to have some feedback to serve as guidelines :). The master branch > and > > API discussion page are at: > > > > [1] http://github.com/yeban/bioruby > > [2] > > > https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby > > > > -- > > Anurag Priyam, > > 2nd Year Undergraduate, > > Department of Mechanical Engineering, > > IIT Kharagpur. > > +91-9775550642 > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From pjotr.public14 at thebird.nl Fri Jun 25 06:46:05 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 25 Jun 2010 08:46:05 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> Message-ID: <20100625064605.GA22887@thebird.nl> (splitting up the discussion) On Fri, Jun 25, 2010 at 11:53:34AM +0530, Anurag Priyam wrote: > Should each iterate over all otu, tree and matrix or the > primary otus, trees and characters elements? I would go for the later. I think Rutger should answer this. Pj. From pjotr.public14 at thebird.nl Fri Jun 25 06:49:11 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 25 Jun 2010 08:49:11 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> Message-ID: <20100625064911.GB22887@thebird.nl> > I have used libxml2 streaming api, without actually streaming the document > to the user. The cursor does not move through the document when you iterate > over elements( phyloxml does that ). I am parsing the document at one go; at > the start, and storing the objects in memory. Should we want to switch to > streaming, using libxml's streaming API from start should make it easier. > > Yes it is libxml2 these days. The site states that it works with ruby 1.8. I > am myself working with 1.8.7. I will have to test the compatibility with > ruby 1.9. OK, glad to see that libxml is a standard package these days - though it has some horrific error handling. At least it is fast. How much time would it cost you to stream the data - and what does it mean with regard to changing the API? I guess, in general, NeXML files won't be that large, so it may not be that important (Rutger)? Pj. From pjotr.public14 at thebird.nl Fri Jun 25 06:51:58 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 25 Jun 2010 08:51:58 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> Message-ID: <20100625065158.GC22887@thebird.nl> > > I notice you have limited test input data. How can you be really sure > > your code works for all cases? How can you be really sure that future > > changes to the code don't break? > > Right. I am working on improving the test suites taking lessons from the > other bioruby test suites. Unit tests are one approach. How about adding some regression tests on larger files? When you have output that should be a good idea. We don't like large datasets in the bioruby tree, but there are two ways around that - create a special branch on github, or pull the data on demand (though Naohisa may frown on that). Ask Diana what she has done. > Actually I have not done anything here. I will benchmark and profile the > code and discuss the results here. Diana created a special profiling branch. It was really helpful to profile. Pj. From pjotr.public14 at thebird.nl Fri Jun 25 06:55:39 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 25 Jun 2010 08:55:39 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> Message-ID: <20100625065539.GD22887@thebird.nl> On Fri, Jun 25, 2010 at 11:53:34AM +0530, Anurag Priyam wrote: > The idea here was to implement a type system and stick close to the class > hierarchy followed in the schema. However, looking back, I myself do not > find the code for the Matrix class very elegant. Over 3000 lines of code for an XML parser sends out alarm bells. If you have the right testing files it should be easy to refactor. Make it simpler. Also, when parsing this type of XML some Ruby reflection may come in handy - I did some of that in my BioRuby GEO parser, which lives in my GEO branch on github. You should look at each class and see if you can refactor it down to a single solution. Just make sure it is not at the expense of readability and understanding. Post us some ideas here, before you start hacking code. Pj. From pjotr.public14 at thebird.nl Fri Jun 25 07:08:04 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 25 Jun 2010 09:08:04 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> Message-ID: <20100625070804.GE22887@thebird.nl> > > http://github.com/tablatom/rubydoctest > > > > > Hey, I did not know that doctests existed for Ruby too. I will have a look > into it. They are good, however finding bugs is a bit problematic as the stack traces are lengthy and often not descriptive. So with troubling code I tend to write extra unit tests. Also, with BioRuby we have not settled on doctests yet, so you need to reach coverage with unit tests and/or Specs. I really think it is good for validating documentation. > > I used these in bio/appl/paml/codeml/report.rb - these are examples > > that double as tests. Kill two birds with one stone! The BioRuby > > tutorial also uses doctests - i.e. the code in the Tutorial can be > > validated against the installed bioruby. If you want to use this you > > need an extra conversion - I have that tool. > > > > I will check out the examples. What tool? I would like to know more. It simply parses out commented code in the source headers, and turns them over to rubydoctest. The tool is in my bioruby-support tree on github - see http://github.com/pjotrp/bioruby-support/blob/master/bin/uncomment_doctest you can see it uses an environment variable. > I am missing Rspec too from my Rails and Merb days. I picked up unit tests > because much of the framework had used the same and also because I wanted to > try it out :). > > > > I am interested to see what you want to do for RDF support. Maybe you > > can write out the API as an RSpec? That would be a good start. > > > > > That sounds like a nice idea. RSpec is new for BioRuby. Since you have experience you are the right one to introduce it to us ;). If it is convincing to the others we may accept it as standard use (personally I think it is a step forward from unit testing - unit tests are not very good as documentation). Pj. From anurag08priyam at gmail.com Fri Jun 25 07:34:21 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Fri, 25 Jun 2010 13:04:21 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100625064911.GB22887@thebird.nl> References: <20100624135411.GA14658@thebird.nl> <20100625064911.GB22887@thebird.nl> Message-ID: On Fri, Jun 25, 2010 at 12:19 PM, Pjotr Prins wrote: > > I have used libxml2 streaming api, without actually streaming the > document > > to the user. The cursor does not move through the document when you > iterate > > over elements( phyloxml does that ). I am parsing the document at one go; > at > > the start, and storing the objects in memory. Should we want to switch to > > streaming, using libxml's streaming API from start should make it easier. > > > > Yes it is libxml2 these days. The site states that it works with ruby > 1.8. I > > am myself working with 1.8.7. I will have to test the compatibility with > > ruby 1.9. > > OK, glad to see that libxml is a standard package these days - > though it has some horrific error handling. At least it is fast. > > Yea it is fast but it has its own share of bugs. Now, I myself have started working on the ruby-libxml code and helping in maintaining it. > How much time would it cost you to stream the data - and what does it > mean with regard to changing the API? I guess, in general, NeXML > files won't be that large, so it may not be that important (Rutger)? > > Pj. > > I mean switching the parsing implementation to streaming from "parsing at the start" and not the API. Just that using Reader API over the DOM API would help in the switch. Even if we do not switch, the Reader API offers a more memory efficient solution than the DOM API. Btw, I am not in a favour of switch. You cannot move backwards in document that way. I can not fetch a tree by id if I the cursor is ahead of that tree. Doing nexml.each_characters and nexml.each_trees is impossible with pure streaming. I will have to stream one while cache the other. Otus and otu provide a one to many relation with trees and characters, and rows. An API call of the type otus.trees or otus.characters or otu.seuences would be impossible( not that I have already added the API call ). Imo, NeXML is non-linear and not meant to be streamed. Besides other NeXML implementations also parse the file at the start. -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From ngoto at gen-info.osaka-u.ac.jp Fri Jun 25 07:15:58 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Fri, 25 Jun 2010 16:15:58 +0900 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100625065158.GC22887@thebird.nl> References: <20100624135411.GA14658@thebird.nl> <20100625065158.GC22887@thebird.nl> Message-ID: <20100625071558.9CAB01CBC5B0@idnmail.gen-info.osaka-u.ac.jp> Most part of the special testing program created by Diana for PhyloXML is now put in sample/test_phyloxml_big.rb, i.e. it is now regarded as a sample script. To run the program, for example, % mkdir /tmp/phyloxml % ruby sample/test_phyloxml_big.rb /tmp/phyloxml -v It executes round-trip tests for large PhyloXML files. Data files are downloaded from the internet and are stored to a directory specified by the user. Naohisa Goto ngoto at ge-info.osaka-u.ac.jp / ng at bioruby.org On Fri, 25 Jun 2010 08:51:58 +0200 Pjotr Prins wrote: > > Actually I have not done anything here. I will benchmark and profile the > > code and discuss the results here. > > Diana created a special profiling branch. It was really helpful to > profile. > > Pj. > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- ?? ?? ngoto at gen-info.osaka-u.ac.jp ??????????? ?????????? ?????????(???) Phone: 06-6879-8365 / FAX: 06-6879-2047 From pjotr.public14 at thebird.nl Fri Jun 25 07:42:13 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 25 Jun 2010 09:42:13 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> <20100625064911.GB22887@thebird.nl> Message-ID: <20100625074213.GA27044@thebird.nl> I think this needs to be answered by Rutger. Are we going to face NeXML files in the future that can easily outrun memory? Pj. On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote: > > How much time would it cost you to stream the data - and what does it > > mean with regard to changing the API? I guess, in general, NeXML > > files won't be that large, so it may not be that important (Rutger)? > > > > Pj. > > > > > I mean switching the parsing implementation to streaming from "parsing at > the start" and not the API. Just that using Reader API over the DOM API > would help in the switch. Even if we do not switch, the Reader API offers a > more memory efficient solution than the DOM API. > > Btw, I am not in a favour of switch. You cannot move backwards in document > that way. I can not fetch a tree by id if I the cursor is ahead of that > tree. Doing nexml.each_characters and nexml.each_trees is impossible with > pure streaming. I will have to stream one while cache the other. Otus and > otu provide a one to many relation with trees and characters, and rows. An > API call of the type otus.trees or otus.characters or otu.seuences would be > impossible( not that I have already added the API call ). Imo, NeXML is > non-linear and not meant to be streamed. Besides other NeXML implementations > also parse the file at the start. > > -- > Anurag Priyam, > 2nd Year Undergraduate, > Department of Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 From rutgeraldo at gmail.com Fri Jun 25 08:14:44 2010 From: rutgeraldo at gmail.com (Rutger Vos) Date: Fri, 25 Jun 2010 09:14:44 +0100 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100625074213.GA27044@thebird.nl> References: <20100624135411.GA14658@thebird.nl> <20100625064911.GB22887@thebird.nl> <20100625074213.GA27044@thebird.nl> Message-ID: This is very possible (and it's why Anurag has been focusing on stream-based parsing) but I am personally of the opinion that worrying too much about that right now would be a premature optimization. It seems to me that we want to get a nice interface that captures what NeXML can express first, and worry about performance and memory footprint later - but that's just my own opinion and certainly open for discussion. On Fri, Jun 25, 2010 at 8:42 AM, Pjotr Prins wrote: > I think this needs to be answered by Rutger. Are we going to face > NeXML files in the future that can easily outrun memory? > > Pj. > > On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote: >> > How much time would it cost you to stream the data - and what does it >> > mean with regard to changing the API? I guess, in general, NeXML >> > files won't be that large, so it may not be that important (Rutger)? >> > >> > Pj. >> > >> > >> I mean switching the parsing implementation to streaming from "parsing at >> the start" and not the API. Just that using Reader API over the DOM API >> would help in the switch. Even if we do not switch, the Reader API offers a >> more memory efficient solution than the DOM API. >> >> Btw, I am not in a favour of switch. You cannot move backwards in document >> that way. I can not fetch a tree by id if I the cursor is ahead of that >> tree. Doing nexml.each_characters and nexml.each_trees is impossible with >> pure streaming. I will have to stream one while cache the other. Otus and >> otu provide a one to many relation with trees and characters, and rows. An >> API call of the type otus.trees or otus.characters or otu.seuences would be >> impossible( not that I have already added the API call ). Imo, NeXML is >> non-linear and not meant to be streamed. Besides other NeXML implementations >> also parse the file at the start. >> >> -- >> Anurag Priyam, >> 2nd Year Undergraduate, >> Department of Mechanical Engineering, >> IIT Kharagpur. >> +91-9775550642 > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com From pjotr.public14 at thebird.nl Fri Jun 25 08:38:36 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Fri, 25 Jun 2010 10:38:36 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> <20100625064911.GB22887@thebird.nl> <20100625074213.GA27044@thebird.nl> Message-ID: <20100625083836.GA28214@thebird.nl> On Fri, Jun 25, 2010 at 09:14:44AM +0100, Rutger Vos wrote: > This is very possible (and it's why Anurag has been focusing on > stream-based parsing) but I am personally of the opinion that worrying > too much about that right now would be a premature optimization. It > seems to me that we want to get a nice interface that captures what > NeXML can express first, and worry about performance and memory > footprint later - but that's just my own opinion and certainly open > for discussion. Oh, I agree about implementation. But it does mean Anurag needs to change his preferential solution (like back-tracking in the tree). Pj. From sararayburn at gmail.com Fri Jun 25 18:57:03 2010 From: sararayburn at gmail.com (Sara Rayburn) Date: Fri, 25 Jun 2010 13:57:03 -0500 Subject: [BioRuby] GSoC speciation/duplication inference question In-Reply-To: References: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com> <20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <00A7DE6C-2985-4173-A302-619429F964BB@gmail.com> Hi, I think between the list response and conversations with my mentor, I would probably go with Bio::Algorithm::SDI, with the files in lib/bio/util/phylogeny/SDI/ I can definitely see the others as good possibilities, though. If anyone objects to this naming, please let me know so I can change it. Thanks, Sara Rayburn sararayburn at gmail.com On Jun 15, 2010, at 9:06 PM, Toshiaki Katayama wrote: > Hi, > > Replying personally as I delayed to find this thread. > > I prefer something like lib/bio/phylo/sdi.rb and Bio::Phylo::SDI, how about to gather other phyloinformatics modules under the same directory as well? > > Toshiaki > > > On 2010/06/10, at 22:46, Naohisa GOTO wrote: > >> Hi, >> >> I think the abbreviation SDI is not common in the field of biology >> and bioinformatics. In this case, it is generally good not to >> abbreviate, but the "speciation/duplication inference" is too long. >> For file/directory names, because the length limit is tight, >> using abbreviation is good. >> >> For the location of files, I suggest >> lib/bio/util/evolution/SDI/ or lib/bio/util/phylogeny/SDI/ >> to show the word SDI is in the field of evolution or phylogeny. >> >> For the class/module namespace, possible candidates are >> Bio::SpeciationDuplicationInference, Bio::Evolution::SDI, >> Bio::Algorithm::SDI, but I couldn't determine which is the best. >> If you have good idea, please tell us. >> >> Naohisa Goto >> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org >> >> >> On Mon, 7 Jun 2010 13:09:07 -0500 >> Sara Rayburn wrote: >> >>> Hello, >>> >>> While implementing my gsoc project (the speciation/duplicaiton inference algorithm), I'm not sure where to put the module I'm developing in the bioruby library. I've been developing in the lib/bio/util directory, but I want to make sure that's the place you all would prefer for the module. Any suggestions or thoughts? >>> >>> Thanks, >>> >>> Sara Rayburn >>> sararayburn at gmail.com >> >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > From anurag08priyam at gmail.com Sat Jun 26 11:35:34 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Sat, 26 Jun 2010 17:05:34 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100625070804.GE22887@thebird.nl> References: <20100624135411.GA14658@thebird.nl> <20100625070804.GE22887@thebird.nl> Message-ID: On Fri, Jun 25, 2010 at 12:38 PM, Pjotr Prins wrote: > > > http://github.com/tablatom/rubydoctest > > > > > > > > Hey, I did not know that doctests existed for Ruby too. I will have a > look > > into it. > > They are good, however finding bugs is a bit problematic as the stack > traces are lengthy and often not descriptive. So with troubling code > I tend to write extra unit tests. Also, with BioRuby we have not > settled on doctests yet, so you need to reach coverage with unit > tests and/or Specs. > > I really think it is good for validating documentation. > > > > I used these in bio/appl/paml/codeml/report.rb - these are examples > > > that double as tests. Kill two birds with one stone! The BioRuby > > > tutorial also uses doctests - i.e. the code in the Tutorial can be > > > validated against the installed bioruby. If you want to use this you > > > need an extra conversion - I have that tool. > > > > > > > I will check out the examples. What tool? I would like to know more. > > It simply parses out commented code in the source headers, and turns > them over to rubydoctest. The tool is in my bioruby-support tree on > github - see > > > http://github.com/pjotrp/bioruby-support/blob/master/bin/uncomment_doctest > > you can see it uses an environment variable. > > Perfect. I will use it when expanding the documentation. > > I am missing Rspec too from my Rails and Merb days. I picked up unit > tests > > because much of the framework had used the same and also because I wanted > to > > try it out :). > > > > > > > I am interested to see what you want to do for RDF support. Maybe you > > > can write out the API as an RSpec? That would be a good start. > > > > > > > > That sounds like a nice idea. > > RSpec is new for BioRuby. Since you have experience you are the right > one to introduce it to us ;). If it is convincing to the others we > may accept it as standard use (personally I think it is a step > forward from unit testing - unit tests are not very good as > documentation). > > I am willing to use Rspec for the RDF API part. Converting the already existing unit tests I have written to Rspec does not sound a good idea? -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From pjotr.public14 at thebird.nl Sat Jun 26 12:19:02 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sat, 26 Jun 2010 14:19:02 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> <20100625070804.GE22887@thebird.nl> Message-ID: <20100626121902.GA5700@thebird.nl> On Sat, Jun 26, 2010 at 05:05:34PM +0530, Anurag Priyam wrote: > I am willing to use Rspec for the RDF API part. Converting the already > existing unit tests I have written to Rspec does not sound a good idea? No need. Do the RDF as a proof-of-concept for the rest of BioRuby. Unit tests will (always) remain. Pj. From hlapp at drycafe.net Sun Jun 27 00:30:19 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Sat, 26 Jun 2010 17:30:19 -0700 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100625074213.GA27044@thebird.nl> References: <20100624135411.GA14658@thebird.nl> <20100625064911.GB22887@thebird.nl> <20100625074213.GA27044@thebird.nl> Message-ID: Our ability to reconstruct trees of hundreds, thousands, and even tens of thousands of characters has improved dramatically over the past couple of years, and is increasingly often the goal of an analysis. Genome-scale alignments also aren't so rare anymore. Aside from analysis, NeXML files can be produced by a database, and hence could hold large taxonomies, or the tree of life. NeXML is an emerging standard. If implementations can't cope with the large scale data that are becoming increasingly popular, it'll have a hard time to get uptake. -hilmar On Jun 25, 2010, at 12:42 AM, Pjotr Prins wrote: > I think this needs to be answered by Rutger. Are we going to face > NeXML files in the future that can easily outrun memory? > > Pj. > > On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote: >>> How much time would it cost you to stream the data - and what does >>> it >>> mean with regard to changing the API? I guess, in general, NeXML >>> files won't be that large, so it may not be that important (Rutger)? >>> >>> Pj. >>> >>> >> I mean switching the parsing implementation to streaming from >> "parsing at >> the start" and not the API. Just that using Reader API over the DOM >> API >> would help in the switch. Even if we do not switch, the Reader API >> offers a >> more memory efficient solution than the DOM API. >> >> Btw, I am not in a favour of switch. You cannot move backwards in >> document >> that way. I can not fetch a tree by id if I the cursor is ahead of >> that >> tree. Doing nexml.each_characters and nexml.each_trees is >> impossible with >> pure streaming. I will have to stream one while cache the other. >> Otus and >> otu provide a one to many relation with trees and characters, and >> rows. An >> API call of the type otus.trees or otus.characters or otu.seuences >> would be >> impossible( not that I have already added the API call ). Imo, >> NeXML is >> non-linear and not meant to be streamed. Besides other NeXML >> implementations >> also parse the file at the start. >> >> -- >> Anurag Priyam, >> 2nd Year Undergraduate, >> Department of Mechanical Engineering, >> IIT Kharagpur. >> +91-9775550642 > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From pjotr.public14 at thebird.nl Sun Jun 27 06:47:31 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sun, 27 Jun 2010 08:47:31 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> <20100625064911.GB22887@thebird.nl> <20100625074213.GA27044@thebird.nl> Message-ID: <20100627064731.GA15508@thebird.nl> Thanks Rutger and Hilmar, Anurag, let's not load everything in memory. Pj. On Sat, Jun 26, 2010 at 05:30:19PM -0700, Hilmar Lapp wrote: > Our ability to reconstruct trees of hundreds, thousands, and even tens > of thousands of characters has improved dramatically over the past > couple of years, and is increasingly often the goal of an analysis. > Genome-scale alignments also aren't so rare anymore. > > Aside from analysis, NeXML files can be produced by a database, and > hence could hold large taxonomies, or the tree of life. > > NeXML is an emerging standard. If implementations can't cope with the > large scale data that are becoming increasingly popular, it'll have a > hard time to get uptake. > > -hilmar > > On Jun 25, 2010, at 12:42 AM, Pjotr Prins wrote: > >> I think this needs to be answered by Rutger. Are we going to face >> NeXML files in the future that can easily outrun memory? >> >> Pj. >> >> On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote: >>>> How much time would it cost you to stream the data - and what does >>>> it >>>> mean with regard to changing the API? I guess, in general, NeXML >>>> files won't be that large, so it may not be that important (Rutger)? >>>> >>>> Pj. >>>> >>>> >>> I mean switching the parsing implementation to streaming from >>> "parsing at >>> the start" and not the API. Just that using Reader API over the DOM >>> API >>> would help in the switch. Even if we do not switch, the Reader API >>> offers a >>> more memory efficient solution than the DOM API. >>> >>> Btw, I am not in a favour of switch. You cannot move backwards in >>> document >>> that way. I can not fetch a tree by id if I the cursor is ahead of >>> that >>> tree. Doing nexml.each_characters and nexml.each_trees is impossible >>> with >>> pure streaming. I will have to stream one while cache the other. >>> Otus and >>> otu provide a one to many relation with trees and characters, and >>> rows. An >>> API call of the type otus.trees or otus.characters or otu.seuences >>> would be >>> impossible( not that I have already added the API call ). Imo, NeXML >>> is >>> non-linear and not meant to be streamed. Besides other NeXML >>> implementations >>> also parse the file at the start. >>> >>> -- >>> Anurag Priyam, >>> 2nd Year Undergraduate, >>> Department of Mechanical Engineering, >>> IIT Kharagpur. >>> +91-9775550642 >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > =========================================================== > > > > From ngoto at gen-info.osaka-u.ac.jp Sun Jun 27 07:45:43 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Sun, 27 Jun 2010 16:45:43 +0900 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100627064731.GA15508@thebird.nl> References: <20100627064731.GA15508@thebird.nl> Message-ID: <20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp> Hi, I think the ability to handle large data and the memory usage whether or not to load all data in memory at a time, is essentially independent. Not loading everything in memory does not guarantee the ability to handle large data, due to the disk I/O bottleneck and memory management overhead. I think it is currently OK to depend on memory. The price of memory is gradually going down, and I think buying a machine with huge memory could be a solution to treat large data. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > Thanks Rutger and Hilmar, > > Anurag, let's not load everything in memory. > > Pj. > > On Sat, Jun 26, 2010 at 05:30:19PM -0700, Hilmar Lapp wrote: > > Our ability to reconstruct trees of hundreds, thousands, and even tens > > of thousands of characters has improved dramatically over the past > > couple of years, and is increasingly often the goal of an analysis. > > Genome-scale alignments also aren't so rare anymore. > > > > Aside from analysis, NeXML files can be produced by a database, and > > hence could hold large taxonomies, or the tree of life. > > > > NeXML is an emerging standard. If implementations can't cope with the > > large scale data that are becoming increasingly popular, it'll have a > > hard time to get uptake. > > > > -hilmar > > > > On Jun 25, 2010, at 12:42 AM, Pjotr Prins wrote: > > > >> I think this needs to be answered by Rutger. Are we going to face > >> NeXML files in the future that can easily outrun memory? > >> > >> Pj. > >> > >> On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote: > >>>> How much time would it cost you to stream the data - and what does > >>>> it > >>>> mean with regard to changing the API? I guess, in general, NeXML > >>>> files won't be that large, so it may not be that important (Rutger)? > >>>> > >>>> Pj. > >>>> > >>>> > >>> I mean switching the parsing implementation to streaming from > >>> "parsing at > >>> the start" and not the API. Just that using Reader API over the DOM > >>> API > >>> would help in the switch. Even if we do not switch, the Reader API > >>> offers a > >>> more memory efficient solution than the DOM API. > >>> > >>> Btw, I am not in a favour of switch. You cannot move backwards in > >>> document > >>> that way. I can not fetch a tree by id if I the cursor is ahead of > >>> that > >>> tree. Doing nexml.each_characters and nexml.each_trees is impossible > >>> with > >>> pure streaming. I will have to stream one while cache the other. > >>> Otus and > >>> otu provide a one to many relation with trees and characters, and > >>> rows. An > >>> API call of the type otus.trees or otus.characters or otu.seuences > >>> would be > >>> impossible( not that I have already added the API call ). Imo, NeXML > >>> is > >>> non-linear and not meant to be streamed. Besides other NeXML > >>> implementations > >>> also parse the file at the start. > >>> > >>> -- > >>> Anurag Priyam, > >>> 2nd Year Undergraduate, > >>> Department of Mechanical Engineering, > >>> IIT Kharagpur. > >>> +91-9775550642 > >> _______________________________________________ > >> BioRuby Project - http://www.bioruby.org/ > >> BioRuby mailing list > >> BioRuby at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioruby > > > > -- > > =========================================================== > > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > > =========================================================== > > > > > > > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From pjotr.public14 at thebird.nl Sun Jun 27 08:43:22 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sun, 27 Jun 2010 10:43:22 +0200 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp> References: <20100627064731.GA15508@thebird.nl> <20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp> Message-ID: <20100627084322.GA18815@thebird.nl> On Sun, Jun 27, 2010 at 04:45:43PM +0900, Naohisa Goto wrote: > Hi, > > I think the ability to handle large data and the memory usage > whether or not to load all data in memory at a time, is essentially > independent. Not loading everything in memory does not guarantee > the ability to handle large data, due to the disk I/O bottleneck and > memory management overhead. Well, depends on what you plan to do with that data :). I think you are saying that streaming data may not be efficient, for example for treating alignments. That could be true. However, I think the default strategy should be non-memory bound, if possible. Throughout BioRuby the strategy is the opposite, at the moment. For example, by default FASTA files are loaded in RAM. Same for BLAST XML. I regularly have files that exceed RAM and work around these limitations. I don't think this should be the *default* strategy. I prefer the Unix way of using pipes. Only use memory when it is available. With new code we should design for big data. If it is done from the start, it takes no real effort. > I think it is currently OK to depend on memory. The price of memory > is gradually going down, and I think buying a machine with huge > memory could be a solution to treat large data. We can not all afford big machines. It would hamper many groups/students. RAM is getting cheaper, but data is growing faster. Anurag, what is the size of RAM you have access to? Pj. From anurag08priyam at gmail.com Sun Jun 27 08:49:37 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Sun, 27 Jun 2010 14:19:37 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100627084322.GA18815@thebird.nl> References: <20100627064731.GA15508@thebird.nl> <20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp> <20100627084322.GA18815@thebird.nl> Message-ID: On Sun, Jun 27, 2010 at 2:13 PM, Pjotr Prins wrote: > On Sun, Jun 27, 2010 at 04:45:43PM +0900, Naohisa Goto wrote: > > Hi, > > > > I think the ability to handle large data and the memory usage > > whether or not to load all data in memory at a time, is essentially > > independent. Not loading everything in memory does not guarantee > > the ability to handle large data, due to the disk I/O bottleneck and > > memory management overhead. > > Well, depends on what you plan to do with that data :). I think you > are saying that streaming data may not be efficient, for example for > treating alignments. That could be true. However, I think the default > strategy should be non-memory bound, if possible. Throughout BioRuby > the strategy is the opposite, at the moment. For example, by default > FASTA files are loaded in RAM. Same for BLAST XML. I regularly have > files that exceed RAM and work around these limitations. I don't think > this should be the *default* strategy. > > I prefer the Unix way of using pipes. Only use memory when it is > available. > > With new code we should design for big data. If it is done from the > start, it takes no real effort. > > > I think it is currently OK to depend on memory. The price of memory > > is gradually going down, and I think buying a machine with huge > > memory could be a solution to treat large data. > > We can not all afford big machines. It would hamper many > groups/students. RAM is getting cheaper, but data is growing faster. > > Anurag, what is the size of RAM you have access to? > > 3GB. The biggest sample file I am working with is 500 lines( characters.xml in the examples ); working with it has hardly any effect on my memory. From, where can I get a bigger one? I can test the memory consumption with a large enough file and report. -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From hlapp at drycafe.net Sun Jun 27 23:23:19 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Sun, 27 Jun 2010 16:23:19 -0700 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100627064731.GA15508@thebird.nl> <20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp> <20100627084322.GA18815@thebird.nl> Message-ID: On Jun 27, 2010, at 1:49 AM, Anurag Priyam wrote: > 3GB. The biggest sample file I am working with is 500 > lines( characters.xml > in the examples ); working with it has hardly any effect on my > memory. From, > where can I get a bigger one? Use the NCBI taxonomy :-) Or download the tree from tolweb.org and convert to NeXML. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From anurag08priyam at gmail.com Mon Jun 28 09:31:26 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Mon, 28 Jun 2010 15:01:26 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100624135411.GA14658@thebird.nl> References: <20100624135411.GA14658@thebird.nl> Message-ID: > > > A final comment for this session: The class/method descriptions are > not very informative. It may be early days - especially since we can > see some refactoring coming, but it usually helps to write out > examples giving the 'nicest' interface for people to use. And stick > those in the source code. Personally I favour rubydoctests, see > > http://github.com/tablatom/rubydoctest > > I am loving rubydoctest. Thanks for showing it to me:). As of now I am using it in my nexml serialization implementation. -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From anurag08priyam at gmail.com Mon Jun 28 09:52:32 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Mon, 28 Jun 2010 15:22:32 +0530 Subject: [BioRuby] Testing complex nexml output. Message-ID: I am finding it a little difficult testing the nexml serializer. Any nexml object say otu, is serialized by a function call of the type NeXML::Writer#serialize_otu, which returns a XML::Node object. A raw nexml representation can be obtained by calling to_s on the return value. These nodes are added to the document root and then saved to a file by calling XML::Document#save. Now, when it come to testing comparing nexml string does not make sense because the test is rendered invalid even because of different ordering of the attributes of a node and newline issues. What I am doing is to initialize to XML::Node: one from a test fiile and one that i generate by serialize_otu function and then compare for the equality of these xml nodes attribute by attribute and child by child. An example here: http://github.com/yeban/bioruby/blob/writer/test/unit/bio/db/nexml/tc_writer.rb#L166 However lack of a proper XML::Node#eql? is making things a little difficult for me. See: http://github.com/yeban/bioruby/blob/writer/test/unit/bio/db/nexml/tc_writer.rb#L222 An obvious solution is to myself define an eql? method in Bio::Node. But, am I going in the right direction when it comes to testing xml output. -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From anurag08priyam at gmail.com Mon Jun 28 09:56:52 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Mon, 28 Jun 2010 15:26:52 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100625065539.GD22887@thebird.nl> References: <20100624135411.GA14658@thebird.nl> <20100625065539.GD22887@thebird.nl> Message-ID: > ..... Also, when parsing this type of XML some Ruby reflection > may come in handy - I did some of that in my BioRuby GEO parser, which > lives in my GEO branch on github. I picked up the method_missing trick for the serializer. http://github.com/yeban/bioruby/blob/writer/lib/bio/db/nexml/writer.rb > You should look at each class and > see if you can refactor it down to a single solution. Just make sure > it is not at the expense of readability and understanding. > > Post us some ideas here, before you start hacking code. > > Pj. > > I will. -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From ngoto at gen-info.osaka-u.ac.jp Mon Jun 28 12:00:05 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 28 Jun 2010 21:00:05 +0900 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> <20100625065539.GD22887@thebird.nl> Message-ID: <20100628120005.61D751CBC32B@idnmail.gen-info.osaka-u.ac.jp> Hi, Please never use method_missing. It breaks error reporting and makes very hard to debug and maintain both library codes and user scripts. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Mon, 28 Jun 2010 15:26:52 +0530 Anurag Priyam wrote: > > ..... Also, when parsing this type of XML some Ruby reflection > > may come in handy - I did some of that in my BioRuby GEO parser, which > > lives in my GEO branch on github. > > > I picked up the method_missing trick for the serializer. > > http://github.com/yeban/bioruby/blob/writer/lib/bio/db/nexml/writer.rb > > > > You should look at each class and > > see if you can refactor it down to a single solution. Just make sure > > it is not at the expense of readability and understanding. > > > > Post us some ideas here, before you start hacking code. > > > > Pj. > > > > > I will. > > -- > Anurag Priyam, > 2nd Year Undergraduate, > Department of Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Mon Jun 28 12:54:09 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 28 Jun 2010 21:54:09 +0900 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: References: <20100624135411.GA14658@thebird.nl> <20100625065539.GD22887@thebird.nl> Message-ID: <20100628125409.B23271CBC32B@idnmail.gen-info.osaka-u.ac.jp> Dear Anurag, Do not add methods in other classes and modules outside Bio. Modifying other classes and modules outside Bio namespace is prohibited in BioRuby library because such kind of code could make conflicts with user scrpits or other libraries when each code defines a method with the same name with different behavior or when the original class is refactored by the original authors. It is BioRuby's policy to respect user's freedom. For example, if we defined Array#has?, a user who want to define Array#has? with different meanings could not use BioRuby. So, to keep user's right, it is our policy not to change outside Bio as far as possible. PS. You may find some exceptinal codes in Bio::Shell and in sample scripts, because they are separate applications. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Mon, 28 Jun 2010 15:26:52 +0530 Anurag Priyam wrote: > > ..... Also, when parsing this type of XML some Ruby reflection > > may come in handy - I did some of that in my BioRuby GEO parser, which > > lives in my GEO branch on github. > > > I picked up the method_missing trick for the serializer. > > http://github.com/yeban/bioruby/blob/writer/lib/bio/db/nexml/writer.rb > > > > You should look at each class and > > see if you can refactor it down to a single solution. Just make sure > > it is not at the expense of readability and understanding. > > > > Post us some ideas here, before you start hacking code. > > > > Pj. > > > > > I will. > > -- > Anurag Priyam, > 2nd Year Undergraduate, > Department of Mechanical Engineering, > IIT Kharagpur. > +91-9775550642 > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From anurag08priyam at gmail.com Mon Jun 28 14:13:36 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Mon, 28 Jun 2010 19:43:36 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100628125409.B23271CBC32B@idnmail.gen-info.osaka-u.ac.jp> References: <20100624135411.GA14658@thebird.nl> <20100625065539.GD22887@thebird.nl> <20100628125409.B23271CBC32B@idnmail.gen-info.osaka-u.ac.jp> Message-ID: > It is BioRuby's policy to respect user's freedom. For example, > if we defined Array#has?, a user who want to define Array#has? > with different meanings could not use BioRuby. So, to keep > user's right, it is our policy not to change outside Bio as > far as possible. > > Corrected. Thanks for pointing this out this GOTO san :). -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From anurag08priyam at gmail.com Mon Jun 28 14:22:37 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Mon, 28 Jun 2010 19:52:37 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review. In-Reply-To: <20100628120005.61D751CBC32B@idnmail.gen-info.osaka-u.ac.jp> References: <20100624135411.GA14658@thebird.nl> <20100625065539.GD22887@thebird.nl> <20100628120005.61D751CBC32B@idnmail.gen-info.osaka-u.ac.jp> Message-ID: > Please never use method_missing. It breaks error reporting and > makes very hard to debug and maintain both library codes and > user scripts. > Hmm, I have experienced that. But the way I have used it affects only the Bio::NeXML::Writer class, so is it not safe in this case? Anyways I will change it as it does not offer much improvement to the code readability in my case. I just find it exciting :). -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From yogiprasanna at gmail.com Wed Jun 30 14:11:42 2010 From: yogiprasanna at gmail.com (Prasanna Bala) Date: Wed, 30 Jun 2010 19:41:42 +0530 Subject: [BioRuby] Contribution in Bioruby... Message-ID: Hi, My name is Prasanna. I am working in a software firm in ruby on rails technology. I am new to Bioruby. I am interested in contributing for Bio-ruby project. I would like to know where to start things. To whom to approach for specific tasks. I have extensive experience in Biomedical text mining. Is there is any group specifically working on Biomedical text mining, Ontology Mapping etc.. And I also want to know what are the issues now the community is working on ? I want to know list of current topics that's going on in Bioruby. Regards, Prasanna. From pjotr.public14 at thebird.nl Wed Jun 30 15:31:05 2010 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Wed, 30 Jun 2010 17:31:05 +0200 Subject: [BioRuby] Contribution in Bioruby... In-Reply-To: References: Message-ID: <20100630153105.GB10804@thebird.nl> Hi Prasanna, On Wed, Jun 30, 2010 at 07:41:42PM +0530, Prasanna Bala wrote: > Hi, > My name is Prasanna. I am working in a software firm in ruby on rails > technology. I am new to Bioruby. I am interested in contributing for > Bio-ruby project. I would like to know where to start things. To whom to > approach for specific tasks. I have extensive experience in Biomedical text > mining. Is there is any group specifically working on Biomedical text > mining, Ontology Mapping etc.. And I also want to know what are the issues > now the community is working on ? I want to know list of current topics > that's going on in Bioruby. Thanks for showing your interest. It would be great if you were to look at text mining and ontologies for BioRuby. It is relevant for our work. To start with BioRuby get a github.com account and clone the repository. You can start coding, and post questions on this mailing list. We are having a presentation at BOSC next week, and the slides discuss current work. It will be available for everyone. Where are you located geographically? Pj. From anurag08priyam at gmail.com Wed Jun 30 22:07:09 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Thu, 1 Jul 2010 03:37:09 +0530 Subject: [BioRuby] [GSoC][NeXML and RDF API] Update Message-ID: In the last week and half of this week I have: * been able to work out an NeXML serializer - the code sits in the master branch[1]. In the API page[ 2 ] I have added a discussion on the implementation. * started working on the RDF API - i should be able to come up with RSpecs by the end of this week In the remaining part of the week I will: * come with an RDF API implementation * work on refactoring some of the previous code( matrix and the sequences part ) as Pjotr had pointed out in the last review. Perhaps, we can have another round of code review: for the NeXML serializer? This will help me allocate time in the coming weeks to fix the issues with the code. [1] http://github.com/yeban/bioruby [2] https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642 From anurag08priyam at gmail.com Wed Jun 30 22:15:08 2010 From: anurag08priyam at gmail.com (Anurag Priyam) Date: Thu, 1 Jul 2010 03:45:08 +0530 Subject: [BioRuby] [GSoC] Message-ID: I hope you guys are tuned to my updates on both the lists and the code and the project plan. Please do keep reminding me if I am missing out on something obvious :). -- Anurag Priyam, 2nd Year Undergraduate, Department of Mechanical Engineering, IIT Kharagpur. +91-9775550642