From ngoto at gen-info.osaka-u.ac.jp Wed May 6 03:56:48 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 6 May 2009 16:56:48 +0900 Subject: [BioRuby] Made a change in format10.rb In-Reply-To: <49B4B4EA.6060401@bioreg.kyushu-u.ac.jp> References: <49B4B4EA.6060401@bioreg.kyushu-u.ac.jp> Message-ID: <20090506075650.37E4A1CBC4EB@idnmail.gen-info.osaka-u.ac.jp> Hi, Thank you for reporting a bug. I've changed codes to support results containing two or more query sequences. http://github.com/bioruby/bioruby/commit/e57349594427ad1a51979c9d4e0c3efcffd160c2 http://github.com/bioruby/bioruby/commit/3d3edc44127f4fd97abcc17a859e36623facdc7c Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Mon, 09 Mar 2009 15:19:22 +0900 Fredrik Johansson wrote: > I found that Bioruby can't handle large amounts of output from Fasta. So > I made this change to > /usr/lib/ruby/gems/1.8/gems/bio-1.3.0/lib/bio/appl/fasta/format10.rb : > > 6,7c6,8 > < data.sub!(/(.*)\n\n>>>/m, '') > < @list = "The best scores are" + $1 > --- > > border = data.index("\n\n>>>") > > @list = "The best scores are" + data[0...border] > > data = data[border+5..-1] > > > The old code reported an error when the output was huge: > RegexpError: Stack overflow in regexp matcher: /(.*)\n\n>>>/m > > So I thought that maybe these lines of code should be changed in Bioruby. > > Regards, > Fredrik Johansson > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From andrew.j.grimm at gmail.com Thu May 7 03:31:26 2009 From: andrew.j.grimm at gmail.com (Andrew Grimm) Date: Thu, 7 May 2009 17:31:26 +1000 Subject: [BioRuby] Non deprecated way of converting a naseq to fasta? Message-ID: The documentation for Bio::Sequence::Common talks about #to_fasta being deprecated, in favor of Bio::Sequence #output instead. #output seems to work for Bio::Sequence objects, but not for Bio::Sequence::NA or Bio::Sequence::AA objects. I can happily create a new FastaFormat object instead, but I'm wondering if I'm doing it the right way. Also, the wiki is still suggesting using to_fasta in some of its code samples. Thanks, Andrew Grimm From ngoto at gen-info.osaka-u.ac.jp Wed May 13 10:55:12 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 13 May 2009 23:55:12 +0900 Subject: [BioRuby] Non deprecated way of converting a naseq to fasta? In-Reply-To: References: Message-ID: <20090513145512.D17251CBC3CB@idnmail.gen-info.osaka-u.ac.jp> On Thu, 7 May 2009 17:31:26 +1000 Andrew Grimm wrote: > The documentation for Bio::Sequence::Common talks about #to_fasta being > deprecated, in favor of Bio::Sequence #output instead. #output seems to work > for Bio::Sequence objects, but not for Bio::Sequence::NA or > Bio::Sequence::AA objects. Because the method "to_fasta" is widely and frequently used, and alternative methods are not fully implemented, you can still use "to_fasta". Why "to_fasta" is planned to be deprecated is that the name to_XXX is usually used for data class conversion in Ruby but the current behavior of "to_fasta" is to output formatted string. The method "to_fasta" will be deprecated in the future release, after alternative methods are fully ready. In addition, for smooth migration, "to_fasta" may exist as an alias (or a shortcut) of the alternative methods for a while. > I can happily create a new FastaFormat object instead, but I'm wondering if > I'm doing it the right way. To create a new Bio::Sequence object is the best way. Bio::FastaFormat is a parser class for reading formatted string, and is not be intended to generate formatted string. > Also, the wiki is still suggesting using to_fasta in some of its code > samples. Those will be rewritten in the future. > > Thanks, > > Andrew Grimm > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby Thank you. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From pjotr.public14 at thebird.nl Sat May 16 03:57:34 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sat, 16 May 2009 09:57:34 +0200 Subject: [BioRuby] Google Summer of Code Intro: PhyloXML support in BioRuby In-Reply-To: <4057d3bf0904301837r302bfb2buaa8a644c448267fa@mail.gmail.com> References: <4057d3bf0904301837r302bfb2buaa8a644c448267fa@mail.gmail.com> Message-ID: <20090516075734.GA27669@thebird.nl> Hi Diana, I think your contribution will be very important for BioRuby, and as you are implementing together with others who are adding support for BioPerl and BioPython I have great faith in you getting results. Thank you for taking an interest. We are looking forward to your contribution. If you have any questions on BioRuby ideas and options, please post to this list. Note that responses can be slow, but everyone is tracking. Pj. On Thu, Apr 30, 2009 at 09:37:07PM -0400, Diana Jaunzeikare wrote: > Hi all, > > I would like to introduce myself. My name is Diana and I have been accepted > for Google Summer of Code to implement PhyloXML support for BioRuby. I am a > junior at Smith College double majoring in Computer Science and Math. I am > interested in Bioinformatics, especially protein structure based > phylogenetics. > > Here is the project abstract: > > === > Phylogenetic trees are used in important applications, including > phylogenomics, phylogeography, gene function prediction, cladistics and the > study of molecular evolution. In order to foster successful analysis, > exchange, storage and reuse of phylogenetic trees and associated data, the > phyloXML format was developed. It can store all necessary information about > the phylogenetic tree, like clade, sequence, name and distance. The goal of > this project is to implement support for phyloXML in BioRuby. > === > > Here is wiki: > https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby > > > Any comments are welcome! > > Cheers, > > Diana > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From georgkam at gmail.com Sat May 16 08:20:59 2009 From: georgkam at gmail.com (George Githinji) Date: Sat, 16 May 2009 15:20:59 +0300 Subject: [BioRuby] Fwd: [BioSQL-l] BioSQL at BOSC 2009? In-Reply-To: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com> References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com> Message-ID: <55915f820905160520p1204f35bpbc63b6cabd936c53@mail.gmail.com> ---------- Forwarded message ---------- From: Peter Date: Sat, May 16, 2009 at 3:12 PM Subject: [BioSQL-l] BioSQL at BOSC 2009? To: biosql-l Cc: Jason Stajich Hi, Will any of the key BioSQL people from the Bio* projects be at BOSC (and ISMB) this year? http://open-bio.org/wiki/BOSC_2009 There will be several people from Biopython there this year, including me and Brad Chapman who are both familiar with BioSQL. This would be a nice opportunity for further improving BioSQL compatibility between the Bio* projects - something that has been suggested in the past, e.g. http://lists.open-bio.org/pipermail/biopython/2007-November/003893.html http://lists.open-bio.org/pipermail/biojava-l/2007-November/006037.html I don't follow the BioPerl, BioJava or BioRuby mailing lists - and I doubt many of their developers follow the Biopython mailing lists. So, rather than having any BioSQL compatibility discussions split over individual Bio* project specific mailing lists, it seems using the BioSQL mailing list is most appropriate. I have CC'd a few key people just in case they are not on the BioSQL mailing list, if I have missed anyone please forward this to them and ask them to sign up. Thanks, Peter _______________________________________________ BioSQL-l mailing list BioSQL-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biosql-l -- --------------- Sincerely George Skype: george_g2 Blog: http://biorelated.wordpress.com/ From bonnalraoul at ingm.it Mon May 18 04:13:47 2009 From: bonnalraoul at ingm.it (Raoul JP Bonnal) Date: Mon, 18 May 2009 10:13:47 +0200 Subject: [BioRuby] Meet Italian Developers Message-ID: <4A1118BB.6070205@ingm.it> Hi Guys, I'd like to meet Italians' BioRuby developers. Do you think would be possible to organize an informal meeting in Milan ? Then I'd like to know how many BioRuby devs are following the list. -- Ra From rozziite at gmail.com Tue May 19 17:07:59 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Tue, 19 May 2009 17:07:59 -0400 Subject: [BioRuby] Update on phyloXML support for BioRuby project Message-ID: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> Hi all, I want to update you on my thoughts about this project and I have some questions. So, I think we have reached consensus that the best choice is libxml2-ruby SAX based XML parser. Since BioRuby has Tree class ( http://bioruby.org/rdoc/) it seems logical that the parser should return a Tree class object. By using SAX parser we avoid the problem of having whole XML file in memory, but still the phylogenetic trees can be very large, and it might be too much to store whole thing as a tree object in memory. This could be a little remediated by having a function next_tree (or next_phylogeny) which would read one tree at a time if phyloXML file has several of them (this is similar to BioPerl implementation). I don't think the children nodes can be done in similar fashion. Since SAX parses sequentially, to get next node (child one level down) in the tree, whole subtree has to be parsed (in order to wait while there is event for the end tag of that child), thus loosing on speed. Any thoughts on this? Also the Tree class should be extended and added method output_phyloXML since it has methods output_newick, output_nhx. I think in order to understand what should be returned after parsing it would be useful to know how people use phylogenetic tree data. Here are some I could come up, * visualize / print * calculate total branch length of a tree * query info about specific nodes * create consensus trees Any others? I am a little confused about the require statements in BioRuby classes. It looks like bio/tree.rb should hold a general class, but it requires bio/db/newick.rb, but this file in turn requires bio/tree.rb. Thanks, Diana Project Page: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby From czmasek at burnham.org Tue May 19 17:54:18 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Tue, 19 May 2009 14:54:18 -0700 Subject: [BioRuby] Update on phyloXML support for BioRuby project In-Reply-To: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> Message-ID: <4A132A8A.70102@burnham.org> Hi, Diana: I think it is a good idea to have the parser return one tree at a time, as opposed to returning a list of trees. On the other hand, the same does not apply to nodes. I think it is perfectly acceptable to expect to have enough memory to keep at least one tree in memory (a good target size might be a binary tree with ten-thousand external nodes and 200 bytes of annotation per node, which according to my rough calculations would require less than 5MB). For your tree use cases, important ones to add are: * iteration over all nodes * retrieval/finding of specific nodes according to some criterion (e.g. find all nodes for which the species is "E. coli") * tree reconciliation (e.g. compare a gene tree to a species tree, in order to determine duplications on the gene tree) In any case, all these applications/algorithms will be most time efficient and easiest to implement with trees which are completely in memory. Re. "I am a little confused about the require statements in BioRuby classes. It looks like bio/tree.rb should hold a general class, but it requires bio/db/newick.rb, but this file in turn requires bio/tree.rb." I am not clear about your question about this. ;) Christian Diana Jaunzeikare wrote: > Hi all, > > I want to update you on my thoughts about this project and I have some > questions. > > So, I think we have reached consensus that the best choice is > libxml2-ruby SAX based XML parser. > > Since BioRuby has Tree class ( http://bioruby.org/rdoc/) it seems > logical that the parser should return a Tree class object. By using > SAX parser we avoid the problem of having whole XML file in memory, > but still the phylogenetic trees can be very large, and it might be > too much to store whole thing as a tree object in memory. This could > be a little remediated by having a function next_tree (or > next_phylogeny) which would read one tree at a time if phyloXML file > has several of them (this is similar to BioPerl implementation). I > don't think the children nodes can be done in similar fashion. Since > SAX parses sequentially, to get next node (child one level down) in > the tree, whole subtree has to be parsed (in order to wait while there > is event for the end tag of that child), thus loosing on speed. Any > thoughts on this? > > Also the Tree class should be extended and added method > output_phyloXML since it has methods output_newick, output_nhx. > > I think in order to understand what should be returned after parsing > it would be useful to know how people use phylogenetic tree data. Here > are some I could come up, > * visualize / print > * calculate total branch length of a tree > * query info about specific nodes > * create consensus trees > Any others? > > I am a little confused about the require statements in BioRuby > classes. It looks like bio/tree.rb should hold a general class, but it > requires bio/db/newick.rb, but this file in turn requires bio/tree.rb. > > Thanks, > > Diana > > Project Page: > https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby > From ngoto at gen-info.osaka-u.ac.jp Wed May 20 02:09:17 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 20 May 2009 15:09:17 +0900 Subject: [BioRuby] Update on phyloXML support for BioRuby project In-Reply-To: <4A132A8A.70102@burnham.org> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org> Message-ID: <20090520060918.B07C71CBC3F4@idnmail.gen-info.osaka-u.ac.jp> Hi all, On Tue, 19 May 2009 17:07:59 -0400 Diana Jaunzeikare wrote: > So, I think we have reached consensus that the best choice is libxml2-ruby > SAX based XML parser. In libxml2-ruby, I think LibXML::XML::Reader is the best choice, because it is memory efficient than DOM and its API is simpler than that of SAX. LibXML::XML::SAXParser is not bad, but I wonder if the SAX's callback based API makes our codes too complex and difficult to maintain. > Since BioRuby has Tree class ( http://bioruby.org/rdoc/) it seems > logical that the parser should return a Tree class object. By using > SAX parser we avoid the problem of having whole XML file in memory, I think so. Alternative way is to return an object of wrapper class which mimics Bio::Tree's API. However, it may be too hard to implement such class, and data type conversion from/to Bio::Tree is still needed even in this case. So, I think to return a Bio::Tree object is good. > I am a little confused about the require statements in BioRuby classes. It > looks like bio/tree.rb should hold a general class, but it requires > bio/db/newick.rb, but this file in turn requires bio/tree.rb. The only reason why bio/tree.rb requires bio/db/newick.rb is for the Newick and NHX output of the tree. The codes will be refactored in the future. On Tue, 19 May 2009 14:54:18 -0700 Christian M Zmasek wrote: > Hi, Diana: > > I think it is a good idea to have the parser return one tree at a time, > as opposed to returning a list of trees. I think so. > On the other hand, the same does not apply to nodes. I think it is > perfectly acceptable to expect to have enough memory to keep at least > one tree in memory (a good target size might be a binary tree with > ten-thousand external nodes and 200 bytes of annotation per node, which > according to my rough calculations would require less than 5MB). > > For your tree use cases, important ones to add are: > * iteration over all nodes > * retrieval/finding of specific nodes according to some criterion (e.g. > find all nodes for which the species is "E. coli") > * tree reconciliation (e.g. compare a gene tree to a species tree, in > order to determine duplications on the gene tree) > > In any case, all these applications/algorithms will be most time > efficient and easiest to implement with trees which are completely in > memory. In addition, it is easy to implement manipulation of trees (adding/deleting nodes and edges, etc.). Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From rozziite at gmail.com Wed May 20 10:51:26 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Wed, 20 May 2009 10:51:26 -0400 Subject: [BioRuby] Update on phyloXML support for BioRuby project In-Reply-To: <20090520060918.B07C71CBC3F4@idnmail.gen-info.osaka-u.ac.jp> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org> <20090520060918.B07C71CBC3F4@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <4057d3bf0905200751u739b37cdqe1e7d275fb82d08c@mail.gmail.com> On Wed, May 20, 2009 at 2:09 AM, Naohisa GOTO wrote: > Hi all, > > On Tue, 19 May 2009 17:07:59 -0400 > Diana Jaunzeikare wrote: > > > So, I think we have reached consensus that the best choice is > libxml2-ruby > > SAX based XML parser. > > In libxml2-ruby, I think LibXML::XML::Reader is the best choice, > because it is memory efficient than DOM and its API is simpler > than that of SAX. LibXML::XML::SAXParser is not bad, but I wonder > if the SAX's callback based API makes our codes too complex and > difficult to maintain. > I wrote sample code using both LibXML::XML::Reader and LibXML::XML::SAXParser and I agree that SAX's callback based API might get very complex and hard to maintain. > > > Since BioRuby has Tree class ( http://bioruby.org/rdoc/) it seems > > logical that the parser should return a Tree class object. By using > > SAX parser we avoid the problem of having whole XML file in memory, > > I think so. > Alternative way is to return an object of wrapper class which mimics > Bio::Tree's API. However, it may be too hard to implement such class, > and data type conversion from/to Bio::Tree is still needed even in > this case. So, I think to return a Bio::Tree object is good. > > > I am a little confused about the require statements in BioRuby classes. > It > > looks like bio/tree.rb should hold a general class, but it requires > > bio/db/newick.rb, but this file in turn requires bio/tree.rb. > > The only reason why bio/tree.rb requires bio/db/newick.rb is > for the Newick and NHX output of the tree. The codes will > be refactored in the future. > > On Tue, 19 May 2009 14:54:18 -0700 > Christian M Zmasek wrote: > > > Hi, Diana: > > > > I think it is a good idea to have the parser return one tree at a time, > > as opposed to returning a list of trees. > > I think so. > > > On the other hand, the same does not apply to nodes. I think it is > > perfectly acceptable to expect to have enough memory to keep at least > > one tree in memory (a good target size might be a binary tree with > > ten-thousand external nodes and 200 bytes of annotation per node, which > > according to my rough calculations would require less than 5MB). > > > > For your tree use cases, important ones to add are: > > * iteration over all nodes > > * retrieval/finding of specific nodes according to some criterion (e.g. > > find all nodes for which the species is "E. coli") > > * tree reconciliation (e.g. compare a gene tree to a species tree, in > > order to determine duplications on the gene tree) > > > > In any case, all these applications/algorithms will be most time > > efficient and easiest to implement with trees which are completely in > > memory. > > In addition, it is easy to implement manipulation of trees > (adding/deleting nodes and edges, etc.). > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > From rozziite at gmail.com Wed May 20 22:29:40 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Wed, 20 May 2009 22:29:40 -0400 Subject: [BioRuby] [Wg-phyloinformatics] Update on phyloXML support for BioRubyproject In-Reply-To: <1E5916D0F3C84F978A4A62DD89275053@NewLife> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

<1E5916D0F3C84F978A4A62DD89275053@NewLife> Message-ID: <4057d3bf0905201929y658cb924r36c2d48f62176ed4@mail.gmail.com> Both nexml and phyloxml are xml formats for holding information about phylogenetic trees. Both seem to be fairly new. What's the difference? Isn't there an ultimate goal to have one universal format for phylogenetic data exchnage? If yes, which of these two formats would be better suitable for it, or do they serve different purposes (as nexml is based on NEXUS format)? My other question is about Perl phylogenetics related packages (warning: Iam not familiar at all with BioPerl classes). Bio::TreeIO and Bio::Phylo::IO for me seem to be doing the same task. What are the main differences between them? If there are no fundamental differences, why there are two classes which do the same thing. Diana On Wed, May 20, 2009 at 11:13 AM, Mark A. Jensen wrote: > Should I plug my "web service blog": > > https://www.nescent.org/wg_evoinfo/User_talk:Mjensen#.22Streaming.22_NeXML.3F? > Some possible useful ideas there. I've also written a flexible DOM > implementation for Bio::Phylo, that could be > generalized/cannibalized/stolen and > used for iterating/streaming XML formats. It provides a standard interface > for > writing dom elements, but allows for easily swapping in different XML > handling > packages (in a BioPerly '-format => libxml' way). All in Perl I'm afraid > (or > proud?) to say, but it might provide ideas too. (I sent it to Rutger, but > he > hasn't passed judgment yet, so it's currently available 'by request to the > author'.) > cheers; excellent work Diana- MAJ > > ----- Original Message ----- > From: "Chris Fields" > To: "Hilmar Lapp" > Cc: ; "Phyloinformatics Group" > > Sent: Wednesday, May 20, 2009 10:52 AM > Subject: Re: [Wg-phyloinformatics] Update on phyloXML support for > BioRubyproject > > > > > > On May 20, 2009, at 8:22 AM, Hilmar Lapp wrote: > > > >> > >> On May 19, 2009, at 5:54 PM, Christian M Zmasek wrote: > >> > >>> I think it is perfectly acceptable to expect to have enough memory > >>> to keep at least > >>> one tree in memory > >> > >> Sounds like a good and perfectly reasonable starting point to me too. > >> It's also the way other toolkits (such as BioPerl) work. > >> > >> Having said that, I don't find it inconceivable that we may be working > >> with trees in the near future that don't fit into memory for a 1GB RAM > >> machine if they are richly decorated (which is something that phyloXML > >> wants to enable, isn't it?). Solving that to me though seems to be > >> question of writing an appropriate Tree implementation that happens to > >> store most of the data on disk rather than in memory, and not an issue > >> for how to write a parser. Ideally though, the parser uses a factory > >> for creating the (tree and/or node) objects, so that later it can be > >> made to use an on-disk Tree implementation simply by passing it > >> another factory. I.e., ideally the parser would not assume and hard- > >> code the Tree implementation class. > >> > >> Just my $0.02. > >> > >> -hilmar > > > > This could be implemented in a lazy way or using lightweight objects. > > The Tree object itself contains the XML parser or a reference thereof > > (probably LibXML Reader-based) and creates the relevant nodes as > > needed. The only thing needed would be some light parsing to indicate > > start-end file points. > > > > It's tricky with re: to a number of aspects, but it can be done. For > > instance, if one wanted to modify the created nodes (i.e. if the nodes > > are mutable), or creating a generic Lazy set of classes capable of > > dealing with multiple formats. > > > > Just in case anyone's wondering, I have been thinking along these > > lines for a while re: BioPerl, Bio::Seq, and very large files... ;> > > > > chris > > _______________________________________________ > > Wg-phyloinformatics mailing list > > Wg-phyloinformatics at nescent.org > > https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > > > > > > _______________________________________________ > Wg-phyloinformatics mailing list > Wg-phyloinformatics at nescent.org > https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > From czmasek at burnham.org Thu May 21 00:02:11 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Wed, 20 May 2009 21:02:11 -0700 Subject: [BioRuby] [Wg-phyloinformatics] Update on phyloXML support for BioRuby project In-Reply-To: References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

Message-ID: <4A14D243.2010101@burnham.org> Hi: Thanks for the detailed replies by Hilmar and Chris! I think it is a very good idea to keep such very large trees in mind, and possibly implement a solution which only loads requested nodes into memory (as described by Hilmar and Chris) if there is enough time left at the end of the project. Re "It's tricky with re: to a number of aspects, but it can be done. For instance, if one wanted to modify the created nodes (i.e. if the nodes are mutable), or creating a generic Lazy set of classes capable of dealing with multiple formats." How would you do post-order or pre-order iteration of nodes? Wouldn't you have to back and forth in the file? CZ Chris Fields wrote: > On May 20, 2009, at 8:22 AM, Hilmar Lapp wrote: > > >> On May 19, 2009, at 5:54 PM, Christian M Zmasek wrote: >> >> >>> I think it is perfectly acceptable to expect to have enough memory >>> to keep at least >>> one tree in memory >>> >> Sounds like a good and perfectly reasonable starting point to me too. >> It's also the way other toolkits (such as BioPerl) work. >> >> Having said that, I don't find it inconceivable that we may be working >> with trees in the near future that don't fit into memory for a 1GB RAM >> machine if they are richly decorated (which is something that phyloXML >> wants to enable, isn't it?). Solving that to me though seems to be >> question of writing an appropriate Tree implementation that happens to >> store most of the data on disk rather than in memory, and not an issue >> for how to write a parser. Ideally though, the parser uses a factory >> for creating the (tree and/or node) objects, so that later it can be >> made to use an on-disk Tree implementation simply by passing it >> another factory. I.e., ideally the parser would not assume and hard- >> code the Tree implementation class. >> >> Just my $0.02. >> >> -hilmar >> > > This could be implemented in a lazy way or using lightweight objects. > The Tree object itself contains the XML parser or a reference thereof > (probably LibXML Reader-based) and creates the relevant nodes as > needed. The only thing needed would be some light parsing to indicate > start-end file points. > > It's tricky with re: to a number of aspects, but it can be done. For > instance, if one wanted to modify the created nodes (i.e. if the nodes > are mutable), or creating a generic Lazy set of classes capable of > dealing with multiple formats. > > Just in case anyone's wondering, I have been thinking along these > lines for a while re: BioPerl, Bio::Seq, and very large files... ;> > > chris > From rozziite at gmail.com Sat May 23 16:12:31 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Sat, 23 May 2009 16:12:31 -0400 Subject: [BioRuby] GSOC: Unit testing in BioRuby Message-ID: <4057d3bf0905231312j13d44bf4s76d9f97a534fce79@mail.gmail.com> Hi, How does unit testing in BioRuby works? I created new file /lib/bio/db/phyloxml.rb and test/unit/bio/db/test_phyloxml.rb I also put some sample xml files in /test/unit/data/phyloxml/ directory. How can I access the data set in data directory from the test_phyloxml.rb file? Can I give path relative path to the data file? for example "../../../data/phyloxml/phyloxml_examples.xml" I saw other test files have this code: "require 'pathname' libpath = Pathname.new(File.join(File.dirname(__FILE__), ['..'] * 3, 'lib')).cleanpath.to_s $:.unshift(libpath) unless $:.include?(libpath) " Is this code for that purpose? I am not really sure what this piece of code means. Thanks, Diana From mail at michaelbarton.me.uk Sat May 23 19:00:26 2009 From: mail at michaelbarton.me.uk (Michael Barton) Date: Sun, 24 May 2009 00:00:26 +0100 Subject: [BioRuby] GSOC: Unit testing in BioRuby In-Reply-To: <4057d3bf0905231312j13d44bf4s76d9f97a534fce79@mail.gmail.com> References: <4057d3bf0905231312j13d44bf4s76d9f97a534fce79@mail.gmail.com> Message-ID: Hi Diana. You could try adding a module along these lines to your test file, inside the TestPhyloXML module. ??module TestPyloXMLData ????bioruby_root = Pathname.new(File.join(File.dirname(__FILE__), ['..'] * 5)).cleanpath.to_s ????TEST_DATA = Pathname.new(File.join(bioruby_root, 'test', 'data', 'pyloxml')).cleanpath.to_s ????def self.example_xml ??????File.join TEST_DATA, 'phyloxml_examples.xml' ????end ??end You can then use TestPhyloXMLData.example_xml To return the path to the example test data file. The code you described in your email adds the bioruby root to the library path so you can do require 'bio/db/phyloxml' at the top of your test file. Is this any help? Cheers Mike 2009/5/23 Diana Jaunzeikare > > Hi, > > How does unit testing in BioRuby works? > > I created new file /lib/bio/db/phyloxml.rb and > test/unit/bio/db/test_phyloxml.rb ?I also put some sample xml files in > /test/unit/data/phyloxml/ directory. > > How can I access the data set in data directory from the test_phyloxml.rb > file? > > Can I give path relative path to the data file? for example > "../../../data/phyloxml/phyloxml_examples.xml" > > I saw other test files have this code: > > "require 'pathname' > libpath = Pathname.new(File.join(File.dirname(__FILE__), ['..'] * 3, > 'lib')).cleanpath.to_s > $:.unshift(libpath) unless $:.include?(libpath) > " > Is this code for that purpose? ?I am not really sure what this piece of code > means. > > Thanks, > > Diana > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From rozziite at gmail.com Sun May 24 16:42:18 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Sun, 24 May 2009 16:42:18 -0400 Subject: [BioRuby] GSOC: Unit testing in BioRuby In-Reply-To: References: <4057d3bf0905231312j13d44bf4s76d9f97a534fce79@mail.gmail.com> Message-ID: <4057d3bf0905241342x6cba94f1mf8e236cd2062199f@mail.gmail.com> Thanks a lot Michael! It worked. Diana On Sat, May 23, 2009 at 7:00 PM, Michael Barton wrote: > Hi Diana. > > You could try adding a module along these lines to your test file, > inside the TestPhyloXML module. > > module TestPyloXMLData > > bioruby_root = Pathname.new(File.join(File.dirname(__FILE__), > ['..'] * 5)).cleanpath.to_s > TEST_DATA = Pathname.new(File.join(bioruby_root, 'test', 'data', > 'pyloxml')).cleanpath.to_s > > def self.example_xml > File.join TEST_DATA, 'phyloxml_examples.xml' > end > > end > > You can then use > > TestPhyloXMLData.example_xml > > To return the path to the example test data file. > > > > The code you described in your email adds the bioruby root to the > library path so you can do > > require 'bio/db/phyloxml' > > at the top of your test file. > > Is this any help? > > Cheers > > Mike > > > 2009/5/23 Diana Jaunzeikare > > > > Hi, > > > > How does unit testing in BioRuby works? > > > > I created new file /lib/bio/db/phyloxml.rb and > > test/unit/bio/db/test_phyloxml.rb I also put some sample xml files in > > /test/unit/data/phyloxml/ directory. > > > > How can I access the data set in data directory from the test_phyloxml.rb > > file? > > > > Can I give path relative path to the data file? for example > > "../../../data/phyloxml/phyloxml_examples.xml" > > > > I saw other test files have this code: > > > > "require 'pathname' > > libpath = Pathname.new(File.join(File.dirname(__FILE__), ['..'] * 3, > > 'lib')).cleanpath.to_s > > $:.unshift(libpath) unless $:.include?(libpath) > > " > > Is this code for that purpose? I am not really sure what this piece of > code > > means. > > > > Thanks, > > > > Diana > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From mail at michaelbarton.me.uk Sun May 24 18:28:07 2009 From: mail at michaelbarton.me.uk (Michael Barton) Date: Sun, 24 May 2009 23:28:07 +0100 Subject: [BioRuby] GSOC: Unit testing in BioRuby In-Reply-To: <4057d3bf0905241342x6cba94f1mf8e236cd2062199f@mail.gmail.com> References: <4057d3bf0905231312j13d44bf4s76d9f97a534fce79@mail.gmail.com> <4057d3bf0905241342x6cba94f1mf8e236cd2062199f@mail.gmail.com> Message-ID: Cheers. Good luck with the rest of the project. 2009/5/24 Diana Jaunzeikare : > Thanks? a lot Michael! It worked. > > Diana > > On Sat, May 23, 2009 at 7:00 PM, Michael Barton > wrote: >> >> Hi Diana. >> >> You could try adding a module along these lines to your test file, >> inside the TestPhyloXML module. >> >> ??module TestPyloXMLData >> >> ????bioruby_root ?= Pathname.new(File.join(File.dirname(__FILE__), >> ['..'] * 5)).cleanpath.to_s >> ????TEST_DATA = Pathname.new(File.join(bioruby_root, 'test', 'data', >> 'pyloxml')).cleanpath.to_s >> >> ????def self.example_xml >> ??????File.join TEST_DATA, 'phyloxml_examples.xml' >> ????end >> >> ??end >> >> You can then use >> >> TestPhyloXMLData.example_xml >> >> To return the path to the example test data file. >> >> >> >> The code you described in your email adds the bioruby root to the >> library path so you can do >> >> require 'bio/db/phyloxml' >> >> at the top of your test file. >> >> Is this any help? >> >> Cheers >> >> Mike >> >> >> 2009/5/23 Diana Jaunzeikare >> > >> > Hi, >> > >> > How does unit testing in BioRuby works? >> > >> > I created new file /lib/bio/db/phyloxml.rb and >> > test/unit/bio/db/test_phyloxml.rb ?I also put some sample xml files in >> > /test/unit/data/phyloxml/ directory. >> > >> > How can I access the data set in data directory from the >> > test_phyloxml.rb >> > file? >> > >> > Can I give path relative path to the data file? for example >> > "../../../data/phyloxml/phyloxml_examples.xml" >> > >> > I saw other test files have this code: >> > >> > "require 'pathname' >> > libpath = Pathname.new(File.join(File.dirname(__FILE__), ['..'] * 3, >> > 'lib')).cleanpath.to_s >> > $:.unshift(libpath) unless $:.include?(libpath) >> > " >> > Is this code for that purpose? ?I am not really sure what this piece of >> > code >> > means. >> > >> > Thanks, >> > >> > Diana >> > _______________________________________________ >> > BioRuby mailing list >> > BioRuby at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/bioruby >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > From rozziite at gmail.com Sun May 24 20:29:57 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Sun, 24 May 2009 20:29:57 -0400 Subject: [BioRuby] [Wg-phyloinformatics] Update on phyloXML support for BioRuby project In-Reply-To: <86472996-C665-4AF0-A8A9-9E8D4EEB2A79@illinois.edu> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

<4A14D243.2010101@burnham.org> <86472996-C665-4AF0-A8A9-9E8D4EEB2A79@illinois.edu> Message-ID: <4057d3bf0905241729o69447925hd830404950fd28e1@mail.gmail.com> Hi all, Since there are much more elements in PhyloXML than in Bio::Tree I propose to make a class PhyloXMLNode which inherits from Bio::Tree::Node. PhyloXMLNode: # attributes from Bio::Tree::Node * bootstrap * bootstrap_string * ec_number * name * scientific_name * taxonomy_id #new attributes * id_source * confidence [] ([] means array of elements) * color * node_id * taxonomy [] * sequence [] (Bio::Sequence object) * events * binary_characters * distribution [] * date * reference [] * property [] Also, since element does not only consist of elements, but other elements also, Bio::Tree class should be extended. PhyloXMLTree #inherited from Bio::Tree * options * root # new attributes * rooted (boolean) * rerootable (boolean) * branch_length_unit * type * name * id * description * date * confidence [] * clade_relation [] * sequence_relation [] * property [] I think inheritance is better than creating a separate class, because then users will be able to use Bio::Tree as before, but also being able to read PhyloXML data files. Also then conversion from PhyloXML to other formats will be easy since Bio::Tree class has output_newick, output_nhx, output_phylip_distance_matrix methods. Diana Project Page: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby On Thu, May 21, 2009 at 9:03 AM, Chris Fields wrote: > Actually, as Perl's XML::LibXML::Reader is described it almost sounds > perfect, though I'm unsure of backtracking to a specific node in the > tree (and thus post/pre-order of nodes). Saying that, I would be > surprised if it weren't possible, though. > > chris > > On May 20, 2009, at 11:02 PM, Christian M Zmasek wrote: > > > Hi: > > > > Thanks for the detailed replies by Hilmar and Chris! > > I think it is a very good idea to keep such very large trees in > > mind, and possibly implement a solution which only loads requested > > nodes into memory (as described by Hilmar and Chris) if there is > > enough time left at the end of the project. > > > > Re "It's tricky with re: to a number of aspects, but it can be > > done. For instance, if one wanted to modify the created nodes > > (i.e. if the nodes are mutable), or creating a generic Lazy set of > > classes capable of dealing with multiple formats." > > > > How would you do post-order or pre-order iteration of nodes? > > Wouldn't you have to back and forth in the file? > > > > CZ > > > > Chris Fields wrote: > >> On May 20, 2009, at 8:22 AM, Hilmar Lapp wrote: > >> > >> > >>> On May 19, 2009, at 5:54 PM, Christian M Zmasek wrote: > >>> > >>> > >>>> I think it is perfectly acceptable to expect to have enough memory > >>>> to keep at least > >>>> one tree in memory > >>>> > >>> Sounds like a good and perfectly reasonable starting point to me > >>> too. > >>> It's also the way other toolkits (such as BioPerl) work. > >>> > >>> Having said that, I don't find it inconceivable that we may be > >>> working > >>> with trees in the near future that don't fit into memory for a 1GB > >>> RAM > >>> machine if they are richly decorated (which is something that > >>> phyloXML > >>> wants to enable, isn't it?). Solving that to me though seems to be > >>> question of writing an appropriate Tree implementation that > >>> happens to > >>> store most of the data on disk rather than in memory, and not an > >>> issue > >>> for how to write a parser. Ideally though, the parser uses a factory > >>> for creating the (tree and/or node) objects, so that later it can be > >>> made to use an on-disk Tree implementation simply by passing it > >>> another factory. I.e., ideally the parser would not assume and hard- > >>> code the Tree implementation class. > >>> > >>> Just my $0.02. > >>> > >>> -hilmar > >>> > >> > >> This could be implemented in a lazy way or using lightweight > >> objects. The Tree object itself contains the XML parser or a > >> reference thereof (probably LibXML Reader-based) and creates the > >> relevant nodes as needed. The only thing needed would be some > >> light parsing to indicate start-end file points. > >> > >> It's tricky with re: to a number of aspects, but it can be done. > >> For instance, if one wanted to modify the created nodes (i.e. if > >> the nodes are mutable), or creating a generic Lazy set of classes > >> capable of dealing with multiple formats. > >> > >> Just in case anyone's wondering, I have been thinking along these > >> lines for a while re: BioPerl, Bio::Seq, and very large files... ;> > >> > >> chris > >> > > > > _______________________________________________ > Wg-phyloinformatics mailing list > Wg-phyloinformatics at nescent.org > https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > From ngoto at gen-info.osaka-u.ac.jp Mon May 25 10:04:22 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 25 May 2009 23:04:22 +0900 Subject: [BioRuby] GSOC: Unit testing in BioRuby In-Reply-To: References: <4057d3bf0905231312j13d44bf4s76d9f97a534fce79@mail.gmail.com> Message-ID: <20090525140422.D258B1CBC3F4@idnmail.gen-info.osaka-u.ac.jp> Hi, > libpath = Pathname.new(File.join(File.dirname(__FILE__), ['..'] * 3, > 'lib')).cleanpath.to_s > $:.unshift(libpath) unless $:.include?(libpath) > bioruby_root = Pathname.new(File.join(File.dirname(__FILE__), > ['..'] * 5)).cleanpath.to_s > TEST_DATA = Pathname.new(File.join(bioruby_root, 'test', 'data', > 'pyloxml')).cleanpath.to_s The magic number (in the above cases, 3 or 5) depends on the depth of the directory of the test. For example, File location The number test/unit/bio/test_AAA.rb 3 test/unit/bio/BBB/test_AAA.rb 4 test/unit/bio/CCC/BBB/test_AAA.rb 5 -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From czmasek at burnham.org Mon May 25 18:16:08 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Mon, 25 May 2009 15:16:08 -0700 Subject: [BioRuby] [Wg-phyloinformatics] bioruby classes for phyloxml support In-Reply-To: <4057d3bf0905241729o69447925hd830404950fd28e1@mail.gmail.com> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

<4A14D243.2010101@burnham.org> <86472996-C665-4AF0-A8A9-9E8D4EEB2A79@illinois.edu> <4057d3bf0905241729o69447925hd830404950fd28e1@mail.gmail.com> Message-ID: <4A1B18A8.1070104@burnham.org> Hi, Diana: What you wrote looks more or less OK. I agree it is better to extend existing classes, as opposed to change them drastically. One thing to keep in mind, is that many attributes are composed of multiple fields themselves, i.e. you would need to create a class for them (if such a class not already exists). The most important element besides sequence, is the taxonomy class. Since BioRuby does not contain a general purpose taxonomy class at this point, it might be worth spending some time in designing such a class. I propose a taxonomy class with the following elements: -scientific name (e.g. Nematostella vectensis) -common name (e.g. starlet sea anemone) -code (or mnemonic, as used by swiss-port) (e.g. NEMVE) -rank (e.g. species) phyloxml also has a URI for taxonomies, but I am not sure if this is important for a general taxonomy class. On the other hand, a general taxonomy class might also have - authority (e.g. Stephenson, 1935) - aliases [] (if these elements are considered important, they of course could be added to the next version of phyloxml) What do people think about this? Christian Diana Jaunzeikare wrote: > Hi all, > > Since there are much more elements in PhyloXML than in Bio::Tree I > propose to make a class PhyloXMLNode which inherits from Bio::Tree::Node. > > PhyloXMLNode: > # attributes from Bio::Tree::Node > * bootstrap > * bootstrap_string > * ec_number > * name > * scientific_name > * taxonomy_id > > #new attributes > * id_source > * confidence [] ([] means array of elements) > * color > * node_id > * taxonomy [] > * sequence [] (Bio::Sequence object) > * events > * binary_characters > * distribution [] > * date > * reference [] > * property [] > > Also, since element does not only consist of > elements, but other elements also, Bio::Tree class should be extended. > > PhyloXMLTree > #inherited from Bio::Tree > * options > * root > > # new attributes > * rooted (boolean) > * rerootable (boolean) > * branch_length_unit > * type > * name > * id > * description > * date > * confidence [] > * clade_relation [] > * sequence_relation [] > * property [] > > > I think inheritance is better than creating a separate class, because > then users will be able to use Bio::Tree as before, but also being > able to read PhyloXML data files. Also then conversion from PhyloXML > to other formats will be easy since Bio::Tree class has output_newick, > output_nhx, output_phylip_distance_matrix methods. > > Diana > > Project Page: > https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby > > On Thu, May 21, 2009 at 9:03 AM, Chris Fields > wrote: > > Actually, as Perl's XML::LibXML::Reader is described it almost sounds > perfect, though I'm unsure of backtracking to a specific node in the > tree (and thus post/pre-order of nodes). Saying that, I would be > surprised if it weren't possible, though. > > chris > > On May 20, 2009, at 11:02 PM, Christian M Zmasek wrote: > > > Hi: > > > > Thanks for the detailed replies by Hilmar and Chris! > > I think it is a very good idea to keep such very large trees in > > mind, and possibly implement a solution which only loads requested > > nodes into memory (as described by Hilmar and Chris) if there is > > enough time left at the end of the project. > > > > Re "It's tricky with re: to a number of aspects, but it can be > > done. For instance, if one wanted to modify the created nodes > > (i.e. if the nodes are mutable), or creating a generic Lazy set of > > classes capable of dealing with multiple formats." > > > > How would you do post-order or pre-order iteration of nodes? > > Wouldn't you have to back and forth in the file? > > > > CZ > > > > Chris Fields wrote: > >> On May 20, 2009, at 8:22 AM, Hilmar Lapp wrote: > >> > >> > >>> On May 19, 2009, at 5:54 PM, Christian M Zmasek wrote: > >>> > >>> > >>>> I think it is perfectly acceptable to expect to have enough > memory > >>>> to keep at least > >>>> one tree in memory > >>>> > >>> Sounds like a good and perfectly reasonable starting point to me > >>> too. > >>> It's also the way other toolkits (such as BioPerl) work. > >>> > >>> Having said that, I don't find it inconceivable that we may be > >>> working > >>> with trees in the near future that don't fit into memory for a 1GB > >>> RAM > >>> machine if they are richly decorated (which is something that > >>> phyloXML > >>> wants to enable, isn't it?). Solving that to me though seems to be > >>> question of writing an appropriate Tree implementation that > >>> happens to > >>> store most of the data on disk rather than in memory, and not an > >>> issue > >>> for how to write a parser. Ideally though, the parser uses a > factory > >>> for creating the (tree and/or node) objects, so that later it > can be > >>> made to use an on-disk Tree implementation simply by passing it > >>> another factory. I.e., ideally the parser would not assume and > hard- > >>> code the Tree implementation class. > >>> > >>> Just my $0.02. > >>> > >>> -hilmar > >>> > >> > >> This could be implemented in a lazy way or using lightweight > >> objects. The Tree object itself contains the XML parser or a > >> reference thereof (probably LibXML Reader-based) and creates the > >> relevant nodes as needed. The only thing needed would be some > >> light parsing to indicate start-end file points. > >> > >> It's tricky with re: to a number of aspects, but it can be done. > >> For instance, if one wanted to modify the created nodes (i.e. if > >> the nodes are mutable), or creating a generic Lazy set of classes > >> capable of dealing with multiple formats. > >> > >> Just in case anyone's wondering, I have been thinking along these > >> lines for a while re: BioPerl, Bio::Seq, and very large files... ;> > >> > >> chris > >> > > > > _______________________________________________ > Wg-phyloinformatics mailing list > Wg-phyloinformatics at nescent.org > > https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > > From bonnalraoul at ingm.it Tue May 26 03:25:34 2009 From: bonnalraoul at ingm.it (Raoul JP Bonnal) Date: Tue, 26 May 2009 09:25:34 +0200 Subject: [BioRuby] [Wg-phyloinformatics] bioruby classes for phyloxml support In-Reply-To: <4A1B18A8.1070104@burnham.org> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

<4A14D243.2010101@burnham.org> <86472996-C665-4AF0-A8A9-9E8D4EEB2A79@illinois.edu> <4057d3bf0905241729o69447925hd830404950fd28e1@mail.gmail.com> <4A1B18A8.1070104@burnham.org> Message-ID: <4A1B996E.2010702@ingm.it> Christian M Zmasek ha scritto: > I agree it is better to extend existing classes, as opposed to change > them drastically. > One thing to keep in mind, is that many attributes are composed of > multiple fields themselves, i.e. you would need to create a class for > them (if such a class not already exists). > The most important element besides sequence, is the taxonomy class. > > Since BioRuby does not contain a general purpose taxonomy class at > this point, it might be worth spending some time in designing such a > class. > > I propose a taxonomy class with the following elements: > -scientific name (e.g. Nematostella vectensis) > -common name (e.g. starlet sea anemone) > -code (or mnemonic, as used by swiss-port) (e.g. NEMVE) > -rank (e.g. species) > > phyloxml also has a URI for taxonomies, but I am not sure if this is > important for a general taxonomy class. > > On the other hand, a general taxonomy class might also have > - authority (e.g. Stephenson, 1935) > - aliases [] > (if these elements are considered important, they of course could be > added to the next version of phyloxml) > > What do people think about this? How other langs represent that class ? I think that having the chance to define a new class there is the opportunity to define a similar api among bio-languages. Then, taxonomy class could be used by biosequences objects representing/grabbing data from biosql for example. http://code.open-bio.org/svnweb/index.cgi/biosql/checkout/biosql-schema/trunk/doc/biosql-ERD.pdf -- Ra From francesco.strozzi at gmail.com Tue May 26 03:48:39 2009 From: francesco.strozzi at gmail.com (Francesco Strozzi) Date: Tue, 26 May 2009 09:48:39 +0200 Subject: [BioRuby] Miranda target scan Message-ID: Hi all, I need to parse Miranda output files, after a whole genome scan for microRNA target sites. Is there any package available to do this in BioRuby? Thanks and cheers. -- Francesco skype: francescostrozzi From rozziite at gmail.com Tue May 26 08:17:24 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Tue, 26 May 2009 08:17:24 -0400 Subject: [BioRuby] Bioruby PhyloXML update Message-ID: <4057d3bf0905260517m32167f53j5ce8ca45b43ce655@mail.gmail.com> Hi all! Here is what was done during community bonding period: - subscribed to mailing lists - created a blog - got familiar with Git (this was particularly useful: http://www.gitcasts.com/posts/railsconf-git-talk ) - created GitHub account and forked bioruby project. - made first commit by adding sample phyloxml data files from www.phyloxml.org - reviewed BioPerl phyloXML implementation (also http://www.bioperl.org/wiki/HOWTO:Trees ) - got familiar with libxml-ruby. Wrote simple program using both LibXML::XML::Reader and LibXML::XML::SAXParser to parse a simple xml file. - reviewed Ruby classes - Bio:Tree, Bio::Pathways - After discussions in mailing lists it has been agreed to use LibXML-ruby library, the LibXML::XML::Reader class This weeks plan: - Start writing parser using LibXML::XML::Reader. It should return a Bio::Tree object. - Implement function next_tree to parse and return the next phylogeny. - Design Tree::Node object for containing phyloxml elements. - Start mapping phyloxml elements to Bio::Tree::Node, start with taxonomy, branch_length, scientific_name - Write simple unit tests. Diana Project page: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby From czmasek at burnham.org Tue May 26 22:01:28 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Tue, 26 May 2009 19:01:28 -0700 Subject: [BioRuby] [Wg-phyloinformatics] bioruby classes for phyloxml support In-Reply-To: <75226546-345B-446D-82D4-FBD5016905D8@duke.edu> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

<4A14D243.2010101@burnham.org> <86472996-C665-4AF0-A8A9-9E8D4EEB2A79@illinois.edu> <4057d3bf0905241729o69447925hd830404950fd28e1@mail.gmail.com> <4A1B18A8.1070104@burnham.org> <75226546-345B-446D-82D4-FBD5016905D8@duke.edu> Message-ID: <4A1C9EF8.9010505@burnham.org> Hi: It's a great idea to look at TDWG's standards. But to me, these standards seem designed specifically for collections in museums and, not surprisingly, biodiversity applications. For the purposes of comparative genomics and related fields, these taxonomy concepts seem a little bit overkill. In my experience, taxonomy objects which can contain a scientific name, a common name, a mnemonic, and a (typed) identifier (which could be a Uniform Resource Name (URN) or a NCBI taxonomy id) are sufficient for most applications. This is pretty much what phyloXML's taxon element contains now. Of course, this does not mean that a potential taxonomy class in BioRuby has to follow the concept for phyloXML. What do you think? Christian Hilmar Lapp wrote: > On May 25, 2009, at 6:16 PM, Christian M Zmasek wrote: > > >> I propose a taxonomy class with the following elements: >> -scientific name (e.g. Nematostella vectensis) >> -common name (e.g. starlet sea anemone) >> -code (or mnemonic, as used by swiss-port) (e.g. NEMVE) >> -rank (e.g. species) >> >> phyloxml also has a URI for taxonomies, but I am not sure if this is >> important for a general taxonomy class. >> >> On the other hand, a general taxonomy class might also have >> - authority (e.g. Stephenson, 1935) >> - aliases [] >> >> (if these elements are considered important, they of course could be >> added to the next version of phyloxml) >> > > > Note that there is the Taxonomic Concepts Transfer Schema as a > ratified TDWG standard, so if you really want to have a rich > representation of taxonomic entities or concepts I wouldn't try to > roll my own. > > http://www.tdwg.org/standards/117/ > > For lightweight taxonomic designation, there are taxonomic elements in > Darwin Core: > > http://wiki.tdwg.org/DarwinCore > > -hilmar > From rozziite at gmail.com Tue May 26 22:51:48 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Tue, 26 May 2009 22:51:48 -0400 Subject: [BioRuby] [Wg-phyloinformatics] bioruby classes for phyloxml support In-Reply-To: <4A1C9EF8.9010505@burnham.org> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

<4A14D243.2010101@burnham.org> <86472996-C665-4AF0-A8A9-9E8D4EEB2A79@illinois.edu> <4057d3bf0905241729o69447925hd830404950fd28e1@mail.gmail.com> <4A1B18A8.1070104@burnham.org> <75226546-345B-446D-82D4-FBD5016905D8@duke.edu> <4A1C9EF8.9010505@burnham.org> Message-ID: <4057d3bf0905261951w47d17ef8l70c44d46fbbd933c@mail.gmail.com> On Tue, May 26, 2009 at 10:01 PM, Christian M Zmasek wrote: > Hi: > > It's a great idea to look at TDWG's standards. > But to me, these standards seem designed specifically for collections in > museums and, not surprisingly, biodiversity applications. For the purposes > of comparative genomics and related fields, these taxonomy concepts seem a > little bit overkill. I totally agree that TDWG is overkill. Here is rough overview what are the data fields in TDWG standard. I agree that most of them are irrelevant for applications not specific to taxonomic studies. * MetaData * Specimens [] - id - Institution - Collection - SpecimenItem * Publications [] * TaxonNames [] - Rank - Canonical Name - CanonicalAutorship - PublishedIn - Year - MicroReference - Typification - SpellingCorrectionOf - Basionym - BasedOn - ConservedAgainst - LaterHomonymesOf - Sanctioned - ReplacementNameFor - PublicationStatus - ProviderLink - ProviderSpecificData * TaxonConcepts [] - id - Name - Rank - AccordingTo - TaxonRelationships - SpecimenCircumscription - CharacterCircumscription - ProviderLink * TaxonRelationshipAssertions Diana > > In my experience, taxonomy objects which can contain a scientific name, a > common name, a mnemonic, and a (typed) identifier (which could be a Uniform > Resource Name (URN) or a NCBI taxonomy id) are sufficient for most > applications. This is pretty much what phyloXML's taxon element contains > now. Of course, this does not mean that a potential taxonomy class in > BioRuby has to follow the concept for phyloXML. > > What do you think? > > Christian > > > > > > > Hilmar Lapp wrote: > >> On May 25, 2009, at 6:16 PM, Christian M Zmasek wrote: >> >> >> >>> I propose a taxonomy class with the following elements: >>> -scientific name (e.g. Nematostella vectensis) >>> -common name (e.g. starlet sea anemone) >>> -code (or mnemonic, as used by swiss-port) (e.g. NEMVE) >>> -rank (e.g. species) >>> >>> phyloxml also has a URI for taxonomies, but I am not sure if this is >>> important for a general taxonomy class. >>> >>> On the other hand, a general taxonomy class might also have >>> - authority (e.g. Stephenson, 1935) >>> - aliases [] >>> >>> (if these elements are considered important, they of course could be >>> added to the next version of phyloxml) >>> >>> >> >> >> Note that there is the Taxonomic Concepts Transfer Schema as a ratified >> TDWG standard, so if you really want to have a rich representation of >> taxonomic entities or concepts I wouldn't try to roll my own. >> >> http://www.tdwg.org/standards/117/ >> >> For lightweight taxonomic designation, there are taxonomic elements in >> Darwin Core: >> >> http://wiki.tdwg.org/DarwinCore >> >> -hilmar >> >> > > From rozziite at gmail.com Tue May 26 23:02:20 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Tue, 26 May 2009 23:02:20 -0400 Subject: [BioRuby] [Wg-phyloinformatics] bioruby classes for phyloxml support In-Reply-To: <4A1B996E.2010702@ingm.it> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

<4A14D243.2010101@burnham.org> <86472996-C665-4AF0-A8A9-9E8D4EEB2A79@illinois.edu> <4057d3bf0905241729o69447925hd830404950fd28e1@mail.gmail.com> <4A1B18A8.1070104@burnham.org> <75226546-345B-446D-82D4-FBD5016905D8@duke.edu> <4A1C9EF8.9010505@burnham.org> <4057d3bf0905261951w47d17ef8l70c44d46fbbd933c@mail.gmail.com> <24AC846C-CEAA-4D01-B1CB-A8FC8033172D@berkeleybop.org> <11C49404-46B9-4D47-949B-1B7DE3CBAAF9@duke.edu> Message-ID: <4A1D8684.10701@burnham.org> Hi: > (BTW you will also notice that there is > nothing even close to the Swissprot "mnemonic" - nobody does this > except Swissprot - whose chief business is protein annotation, not > taxonomy - so you may want to consider how much significance you want > to give this in your object model.) > Again, agreed. But in practice this "mnemonic" is very, very useful. For example, it is a very handy way to create short protein names/identifiers which still contain human readable species information but are short enough the be used in a variety of alignment/phylogeny reconstruction programs. > As an aside, the name of a taxon really is a proxy for a taxon > concept, whether that is a species or not, except that typically a > taxon name isn't given in full (i.e., with author, year, and > publication) to allow unambiguous identification. That's one of the > reasons why taxon identifiers are key. Indeed. And the taxon concept is a "science" in itself.... > BTW NCBI taxon IDs are just one > kind of taxon IDs. There are also Catalog of Life, ITIS, IPNI, and > others. > Definitely. That's why the phyloxml taxonomy has a typed id. Like so: 594569. --CZ From czmasek at burnham.org Wed May 27 14:41:41 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Wed, 27 May 2009 11:41:41 -0700 Subject: [BioRuby] [Wg-phyloinformatics] bioruby classes for phyloxml support In-Reply-To: <4057d3bf0905262002w448a7652leb48a40e680b13b8@mail.gmail.com> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

<4A14D243.2010101@burnham.org> <86472996-C665-4AF0-A8A9-9E8D4EEB2A79@illinois.edu> <4057d3bf0905241729o69447925hd830404950fd28e1@mail.gmail.com> <4A1B18A8.1070104@burnham.org> <4A1B996E.2010702@ingm.it> <4057d3bf0905262002w448a7652leb48a40e680b13b8@mail.gmail.com> <4A1D8965.5050404@burnham.org> <4057d3bf0905271949x5b35bc87s17c681fb90d10294@mail.gmail.com> Message-ID: <4A1F15A5.60801@burnham.org> Hi, Diana: Your wiki [at http://wiki.github.com/latvianlinuxgirl/bioruby] looks like a good idea! I made some comments! CZ From rozziite at gmail.com Sat May 30 17:27:52 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Sat, 30 May 2009 17:27:52 -0400 Subject: [BioRuby] GSOC: phyloXML for BioRuby: Mapping sequence Message-ID: <4057d3bf0905301427u2e6cd6c8t759c29566b08f4db@mail.gmail.com> Hi all, So I looked more carefully at the sequence element of phyloXML and it consists of information which cannot be mapped to Bio::Sequence object. I suggest to have a sequence class which closely resembles phyloXML structure and then have a method to extract relevant elements return Bio::Sequence object. What do you think? Here on the left i listed phyloXML sequence tag elements and after the arrow -> the possible corresponding attribute of Bio::Sequence * type ** rna, dna -> Bio::Sequence::NA -> molecule type ** aa -> Bio::Sequence::AA * id_source (string ?) -> id_namespace * id_ref (string ) -> entry_id * symbol (string ?) * accession ** source (example: "UniProtKB") -> ** id (example: "P17304") -> primary_accession * name (string ) * location (string ? ) * mol_seq (string) -> seq / Bio::Sequence::NA/AA * uri ** desc (string) ** type (string ) ** uri * annotation [] ** ref ** source ** evidence ** type ** desc ** confidence ** property [] ** uri * domain_architecture ** length ** domain [] *** from *** to *** confidence *** id Diana From rozziite at gmail.com Sat May 30 21:36:52 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Sat, 30 May 2009 21:36:52 -0400 Subject: [BioRuby] Bioruby PhyloXML update In-Reply-To: <4A1CEC2D.6010202@ingm.it> References: <4057d3bf0905260517m32167f53j5ce8ca45b43ce655@mail.gmail.com> <4A1CEC2D.6010202@ingm.it> Message-ID: <4057d3bf0905301836k4cd0358btf87497b047a5f9d3@mail.gmail.com> I looked at libxml-jruby and tried to install it, but i get "no such file to load -- java" on the line "require 'java'". I have both jruby and java installed. I also have .../jruby/bin directory in my $PATH. Any suggestions? Diana On Wed, May 27, 2009 at 3:30 AM, Raoul JP Bonnal wrote: > Hi Diana, > a portability issue. > Have you guys verified if LibXML::XML is it running on JRuby ? I don't > think so because it's implemented in C. A solution could be > this http://github.com/mguterl/libxml-jruby/tree/master but today I can't > access to it, probably due to my network problems. > > Some test, 1 year ago - last post -: > > http://www.nabble.com/comparing-xml-parsing-in-JRuby-and-MRI-td16268560.html#a16268560 > > this is the test code - good portability example too - : > > https://svn.concord.org/svn/projects/trunk/common/ruby/xml_benchmarks/xml_benchmarks.rb > > Cheers. > > Diana Jaunzeikare ha scritto: > >> Hi all! >> >> Here is what was done during community bonding period: >> >> - subscribed to mailing lists >> - created a blog >> - got familiar with Git (this was particularly useful: >> http://www.gitcasts.com/posts/railsconf-git-talk ) >> - created GitHub account and forked bioruby project. >> - made first commit by adding sample phyloxml data files from >> www.phyloxml.org >> - reviewed BioPerl phyloXML implementation (also >> http://www.bioperl.org/wiki/HOWTO:Trees ) >> - got familiar with libxml-ruby. Wrote simple program using both >> LibXML::XML::Reader and LibXML::XML::SAXParser to parse a simple xml >> file. >> - reviewed Ruby classes - Bio:Tree, Bio::Pathways >> - After discussions in mailing lists it has been agreed to use >> LibXML-ruby library, the LibXML::XML::Reader class >> >> >> This weeks plan: >> >> - Start writing parser using LibXML::XML::Reader. It should return a >> Bio::Tree object. >> - Implement function next_tree to parse and return the next phylogeny. >> - Design Tree::Node object for containing phyloxml elements. >> - Start mapping phyloxml elements to Bio::Tree::Node, start with >> taxonomy, branch_length, scientific_name >> - Write simple unit tests. >> >> > > > From mail at michaelbarton.me.uk Sun May 31 05:45:32 2009 From: mail at michaelbarton.me.uk (Michael Barton) Date: Sun, 31 May 2009 10:45:32 +0100 Subject: [BioRuby] GSOC: phyloXML for BioRuby: Mapping sequence In-Reply-To: <4057d3bf0905301427u2e6cd6c8t759c29566b08f4db@mail.gmail.com> References: <4057d3bf0905301427u2e6cd6c8t759c29566b08f4db@mail.gmail.com> Message-ID: I'm not very familiar with phyloXML, but when you write sequence, do you mean a multiple sequence alignment from which the phylogeny was estimated? If that's the case, there is a MSA class in bioruby which this could be mapped to perhaps? 2009/5/30 Diana Jaunzeikare : > Hi all, > > So I looked more carefully at the sequence element of phyloXML and it > consists of information which cannot be mapped to Bio::Sequence object. I > suggest to have a sequence class which closely resembles phyloXML structure > and then have a method to extract relevant elements return Bio::Sequence > object. ?What do you think? > > Here on the left i listed phyloXML sequence tag elements and after the arrow > -> the possible corresponding attribute of Bio::Sequence > * type > ** rna, dna ?-> Bio::Sequence::NA -> molecule type > ** aa -> Bio::Sequence::AA > * id_source (string ?) -> id_namespace > * id_ref (string ) -> entry_id > * symbol (string ?) > * accession > ** source (example: "UniProtKB") -> > ** id (example: "P17304") -> ?primary_accession > * name (string ) > * location (string ? ) > * mol_seq (string) -> seq / Bio::Sequence::NA/AA > * uri > ** desc (string) > ** type (string ) > ** uri > > * annotation [] > ** ref > ** source > ** evidence > ** type > ** desc > ** confidence > ** property [] > ** uri > > * domain_architecture > ** length > ** domain [] > *** from > *** to > *** confidence > *** id > > Diana > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From czmasek at burnham.org Sun May 31 13:17:49 2009 From: czmasek at burnham.org (Christian Zmasek) Date: Sun, 31 May 2009 10:17:49 -0700 Subject: [BioRuby] GSOC: phyloXML for BioRuby: Mapping sequence In-Reply-To: References: <4057d3bf0905301427u2e6cd6c8t759c29566b08f4db@mail.gmail.com>, Message-ID: Hi, Michael: Good point. Actually, it is not specified. It is just a sequence associated with a node. In my own work, I use it for the original sequence (before the introduction of gaps, and possible trimming of columns, during and after the alignment process). Hence, I do not think reuse of the MSA class is appropriate. Christian ________________________________________ From: bioruby-bounces at lists.open-bio.org [bioruby-bounces at lists.open-bio.org] On Behalf Of Michael Barton [mail at michaelbarton.me.uk] Sent: Sunday, May 31, 2009 2:45 AM To: Phyloinformatics Group; bioruby at lists.open-bio.org Subject: Re: [BioRuby] GSOC: phyloXML for BioRuby: Mapping sequence I'm not very familiar with phyloXML, but when you write sequence, do you mean a multiple sequence alignment from which the phylogeny was estimated? If that's the case, there is a MSA class in bioruby which this could be mapped to perhaps? 2009/5/30 Diana Jaunzeikare : > Hi all, > > So I looked more carefully at the sequence element of phyloXML and it > consists of information which cannot be mapped to Bio::Sequence object. I > suggest to have a sequence class which closely resembles phyloXML structure > and then have a method to extract relevant elements return Bio::Sequence > object. What do you think? > > Here on the left i listed phyloXML sequence tag elements and after the arrow > -> the possible corresponding attribute of Bio::Sequence > * type > ** rna, dna -> Bio::Sequence::NA -> molecule type > ** aa -> Bio::Sequence::AA > * id_source (string ?) -> id_namespace > * id_ref (string ) -> entry_id > * symbol (string ?) > * accession > ** source (example: "UniProtKB") -> > ** id (example: "P17304") -> primary_accession > * name (string ) > * location (string ? ) > * mol_seq (string) -> seq / Bio::Sequence::NA/AA > * uri > ** desc (string) > ** type (string ) > ** uri > > * annotation [] > ** ref > ** source > ** evidence > ** type > ** desc > ** confidence > ** property [] > ** uri > > * domain_architecture > ** length > ** domain [] > *** from > *** to > *** confidence > *** id > > Diana > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > _______________________________________________ BioRuby mailing list BioRuby at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioruby From mail at michaelbarton.me.uk Sun May 31 16:59:09 2009 From: mail at michaelbarton.me.uk (Michael Barton) Date: Sun, 31 May 2009 21:59:09 +0100 Subject: [BioRuby] GSOC: phyloXML for BioRuby: Mapping sequence In-Reply-To: References: <4057d3bf0905301427u2e6cd6c8t759c29566b08f4db@mail.gmail.com> Message-ID: Hi Christian Would this mean that there is a predicted single ancestor sequence object associated at each node in a phylogenetic tree? You could start at a specific node and traverse to ancestor or descendant nodes, and therefore sequences? When you write original sequence, you mean the multiple sequence alignment? Cheers Mike 2009/5/31 Christian Zmasek : > Hi, Michael: > > Good point. Actually, it is not specified. It is just a sequence associated with a node. > In my own work, I use it for the original sequence (before the introduction of gaps, and possible trimming of columns, during and after the alignment process). > Hence, I do not think reuse of the MSA class is appropriate. > > Christian > > > > > ________________________________________ > From: bioruby-bounces at lists.open-bio.org [bioruby-bounces at lists.open-bio.org] On Behalf Of Michael Barton [mail at michaelbarton.me.uk] > Sent: Sunday, May 31, 2009 2:45 AM > To: Phyloinformatics Group; bioruby at lists.open-bio.org > Subject: Re: [BioRuby] GSOC: phyloXML for BioRuby: Mapping sequence > > I'm not very familiar with phyloXML, but when you write sequence, do > you mean a multiple sequence alignment from which the phylogeny was > estimated? If that's the case, there is a MSA class in bioruby which > this could be mapped to perhaps? > > 2009/5/30 Diana Jaunzeikare : >> Hi all, >> >> So I looked more carefully at the sequence element of phyloXML and it >> consists of information which cannot be mapped to Bio::Sequence object. I >> suggest to have a sequence class which closely resembles phyloXML structure >> and then have a method to extract relevant elements return Bio::Sequence >> object. ?What do you think? >> >> Here on the left i listed phyloXML sequence tag elements and after the arrow >> -> the possible corresponding attribute of Bio::Sequence >> * type >> ** rna, dna ?-> Bio::Sequence::NA -> molecule type >> ** aa -> Bio::Sequence::AA >> * id_source (string ?) -> id_namespace >> * id_ref (string ) -> entry_id >> * symbol (string ?) >> * accession >> ** source (example: "UniProtKB") -> >> ** id (example: "P17304") -> ?primary_accession >> * name (string ) >> * location (string ? ) >> * mol_seq (string) -> seq / Bio::Sequence::NA/AA >> * uri >> ** desc (string) >> ** type (string ) >> ** uri >> >> * annotation [] >> ** ref >> ** source >> ** evidence >> ** type >> ** desc >> ** confidence >> ** property [] >> ** uri >> >> * domain_architecture >> ** length >> ** domain [] >> *** from >> *** to >> *** confidence >> *** id >> >> Diana >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From czmasek at burnham.org Sun May 31 23:46:14 2009 From: czmasek at burnham.org (Christian Zmasek) Date: Sun, 31 May 2009 20:46:14 -0700 Subject: [BioRuby] GSOC: phyloXML for BioRuby: Mapping sequence In-Reply-To: References: <4057d3bf0905301427u2e6cd6c8t759c29566b08f4db@mail.gmail.com> , Message-ID: Hi, Mike: In general only external nodes would have a sequence associated with them. In the case of ancestral sequence reconstruction attempts, internal nodes might have sequences, too, though. Please remember to none of the elements in phyloXML are mandatory. With 'original' sequence I meant the sequence prior to alignment. Cheers, Christian ________________________________________ From: Michael Barton [mail at michaelbarton.me.uk] Sent: Sunday, May 31, 2009 1:59 PM To: Christian Zmasek Cc: Phyloinformatics Group; bioruby at lists.open-bio.org; rozziite at gmail.com Subject: Re: [BioRuby] GSOC: phyloXML for BioRuby: Mapping sequence Hi Christian Would this mean that there is a predicted single ancestor sequence object associated at each node in a phylogenetic tree? You could start at a specific node and traverse to ancestor or descendant nodes, and therefore sequences? When you write original sequence, you mean the multiple sequence alignment? Cheers Mike 2009/5/31 Christian Zmasek : > Hi, Michael: > > Good point. Actually, it is not specified. It is just a sequence associated with a node. > In my own work, I use it for the original sequence (before the introduction of gaps, and possible trimming of columns, during and after the alignment process). > Hence, I do not think reuse of the MSA class is appropriate. > > Christian > > > > > ________________________________________ > From: bioruby-bounces at lists.open-bio.org [bioruby-bounces at lists.open-bio.org] On Behalf Of Michael Barton [mail at michaelbarton.me.uk] > Sent: Sunday, May 31, 2009 2:45 AM > To: Phyloinformatics Group; bioruby at lists.open-bio.org > Subject: Re: [BioRuby] GSOC: phyloXML for BioRuby: Mapping sequence > > I'm not very familiar with phyloXML, but when you write sequence, do > you mean a multiple sequence alignment from which the phylogeny was > estimated? If that's the case, there is a MSA class in bioruby which > this could be mapped to perhaps? > > 2009/5/30 Diana Jaunzeikare : >> Hi all, >> >> So I looked more carefully at the sequence element of phyloXML and it >> consists of information which cannot be mapped to Bio::Sequence object. I >> suggest to have a sequence class which closely resembles phyloXML structure >> and then have a method to extract relevant elements return Bio::Sequence >> object. What do you think? >> >> Here on the left i listed phyloXML sequence tag elements and after the arrow >> -> the possible corresponding attribute of Bio::Sequence >> * type >> ** rna, dna -> Bio::Sequence::NA -> molecule type >> ** aa -> Bio::Sequence::AA >> * id_source (string ?) -> id_namespace >> * id_ref (string ) -> entry_id >> * symbol (string ?) >> * accession >> ** source (example: "UniProtKB") -> >> ** id (example: "P17304") -> primary_accession >> * name (string ) >> * location (string ? ) >> * mol_seq (string) -> seq / Bio::Sequence::NA/AA >> * uri >> ** desc (string) >> ** type (string ) >> ** uri >> >> * annotation [] >> ** ref >> ** source >> ** evidence >> ** type >> ** desc >> ** confidence >> ** property [] >> ** uri >> >> * domain_architecture >> ** length >> ** domain [] >> *** from >> *** to >> *** confidence >> *** id >> >> Diana >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby >> > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From rozziite at gmail.com Fri May 1 01:37:07 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Thu, 30 Apr 2009 21:37:07 -0400 Subject: [BioRuby] Google Summer of Code Intro: PhyloXML support in BioRuby Message-ID: <4057d3bf0904301837r302bfb2buaa8a644c448267fa@mail.gmail.com> Hi all, I would like to introduce myself. My name is Diana and I have been accepted for Google Summer of Code to implement PhyloXML support for BioRuby. I am a junior at Smith College double majoring in Computer Science and Math. I am interested in Bioinformatics, especially protein structure based phylogenetics. Here is the project abstract: === Phylogenetic trees are used in important applications, including phylogenomics, phylogeography, gene function prediction, cladistics and the study of molecular evolution. In order to foster successful analysis, exchange, storage and reuse of phylogenetic trees and associated data, the phyloXML format was developed. It can store all necessary information about the phylogenetic tree, like clade, sequence, name and distance. The goal of this project is to implement support for phyloXML in BioRuby. === Here is wiki: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby Any comments are welcome! Cheers, Diana From ngoto at gen-info.osaka-u.ac.jp Wed May 6 07:56:48 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 6 May 2009 16:56:48 +0900 Subject: [BioRuby] Made a change in format10.rb In-Reply-To: <49B4B4EA.6060401@bioreg.kyushu-u.ac.jp> References: <49B4B4EA.6060401@bioreg.kyushu-u.ac.jp> Message-ID: <20090506075650.37E4A1CBC4EB@idnmail.gen-info.osaka-u.ac.jp> Hi, Thank you for reporting a bug. I've changed codes to support results containing two or more query sequences. http://github.com/bioruby/bioruby/commit/e57349594427ad1a51979c9d4e0c3efcffd160c2 http://github.com/bioruby/bioruby/commit/3d3edc44127f4fd97abcc17a859e36623facdc7c Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Mon, 09 Mar 2009 15:19:22 +0900 Fredrik Johansson wrote: > I found that Bioruby can't handle large amounts of output from Fasta. So > I made this change to > /usr/lib/ruby/gems/1.8/gems/bio-1.3.0/lib/bio/appl/fasta/format10.rb : > > 6,7c6,8 > < data.sub!(/(.*)\n\n>>>/m, '') > < @list = "The best scores are" + $1 > --- > > border = data.index("\n\n>>>") > > @list = "The best scores are" + data[0...border] > > data = data[border+5..-1] > > > The old code reported an error when the output was huge: > RegexpError: Stack overflow in regexp matcher: /(.*)\n\n>>>/m > > So I thought that maybe these lines of code should be changed in Bioruby. > > Regards, > Fredrik Johansson > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From andrew.j.grimm at gmail.com Thu May 7 07:31:26 2009 From: andrew.j.grimm at gmail.com (Andrew Grimm) Date: Thu, 7 May 2009 17:31:26 +1000 Subject: [BioRuby] Non deprecated way of converting a naseq to fasta? Message-ID: The documentation for Bio::Sequence::Common talks about #to_fasta being deprecated, in favor of Bio::Sequence #output instead. #output seems to work for Bio::Sequence objects, but not for Bio::Sequence::NA or Bio::Sequence::AA objects. I can happily create a new FastaFormat object instead, but I'm wondering if I'm doing it the right way. Also, the wiki is still suggesting using to_fasta in some of its code samples. Thanks, Andrew Grimm From ngoto at gen-info.osaka-u.ac.jp Wed May 13 14:55:12 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 13 May 2009 23:55:12 +0900 Subject: [BioRuby] Non deprecated way of converting a naseq to fasta? In-Reply-To: References: Message-ID: <20090513145512.D17251CBC3CB@idnmail.gen-info.osaka-u.ac.jp> On Thu, 7 May 2009 17:31:26 +1000 Andrew Grimm wrote: > The documentation for Bio::Sequence::Common talks about #to_fasta being > deprecated, in favor of Bio::Sequence #output instead. #output seems to work > for Bio::Sequence objects, but not for Bio::Sequence::NA or > Bio::Sequence::AA objects. Because the method "to_fasta" is widely and frequently used, and alternative methods are not fully implemented, you can still use "to_fasta". Why "to_fasta" is planned to be deprecated is that the name to_XXX is usually used for data class conversion in Ruby but the current behavior of "to_fasta" is to output formatted string. The method "to_fasta" will be deprecated in the future release, after alternative methods are fully ready. In addition, for smooth migration, "to_fasta" may exist as an alias (or a shortcut) of the alternative methods for a while. > I can happily create a new FastaFormat object instead, but I'm wondering if > I'm doing it the right way. To create a new Bio::Sequence object is the best way. Bio::FastaFormat is a parser class for reading formatted string, and is not be intended to generate formatted string. > Also, the wiki is still suggesting using to_fasta in some of its code > samples. Those will be rewritten in the future. > > Thanks, > > Andrew Grimm > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby Thank you. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From pjotr.public14 at thebird.nl Sat May 16 07:57:34 2009 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sat, 16 May 2009 09:57:34 +0200 Subject: [BioRuby] Google Summer of Code Intro: PhyloXML support in BioRuby In-Reply-To: <4057d3bf0904301837r302bfb2buaa8a644c448267fa@mail.gmail.com> References: <4057d3bf0904301837r302bfb2buaa8a644c448267fa@mail.gmail.com> Message-ID: <20090516075734.GA27669@thebird.nl> Hi Diana, I think your contribution will be very important for BioRuby, and as you are implementing together with others who are adding support for BioPerl and BioPython I have great faith in you getting results. Thank you for taking an interest. We are looking forward to your contribution. If you have any questions on BioRuby ideas and options, please post to this list. Note that responses can be slow, but everyone is tracking. Pj. On Thu, Apr 30, 2009 at 09:37:07PM -0400, Diana Jaunzeikare wrote: > Hi all, > > I would like to introduce myself. My name is Diana and I have been accepted > for Google Summer of Code to implement PhyloXML support for BioRuby. I am a > junior at Smith College double majoring in Computer Science and Math. I am > interested in Bioinformatics, especially protein structure based > phylogenetics. > > Here is the project abstract: > > === > Phylogenetic trees are used in important applications, including > phylogenomics, phylogeography, gene function prediction, cladistics and the > study of molecular evolution. In order to foster successful analysis, > exchange, storage and reuse of phylogenetic trees and associated data, the > phyloXML format was developed. It can store all necessary information about > the phylogenetic tree, like clade, sequence, name and distance. The goal of > this project is to implement support for phyloXML in BioRuby. > === > > Here is wiki: > https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby > > > Any comments are welcome! > > Cheers, > > Diana > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From georgkam at gmail.com Sat May 16 12:20:59 2009 From: georgkam at gmail.com (George Githinji) Date: Sat, 16 May 2009 15:20:59 +0300 Subject: [BioRuby] Fwd: [BioSQL-l] BioSQL at BOSC 2009? In-Reply-To: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com> References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com> Message-ID: <55915f820905160520p1204f35bpbc63b6cabd936c53@mail.gmail.com> ---------- Forwarded message ---------- From: Peter Date: Sat, May 16, 2009 at 3:12 PM Subject: [BioSQL-l] BioSQL at BOSC 2009? To: biosql-l Cc: Jason Stajich Hi, Will any of the key BioSQL people from the Bio* projects be at BOSC (and ISMB) this year? http://open-bio.org/wiki/BOSC_2009 There will be several people from Biopython there this year, including me and Brad Chapman who are both familiar with BioSQL. This would be a nice opportunity for further improving BioSQL compatibility between the Bio* projects - something that has been suggested in the past, e.g. http://lists.open-bio.org/pipermail/biopython/2007-November/003893.html http://lists.open-bio.org/pipermail/biojava-l/2007-November/006037.html I don't follow the BioPerl, BioJava or BioRuby mailing lists - and I doubt many of their developers follow the Biopython mailing lists. So, rather than having any BioSQL compatibility discussions split over individual Bio* project specific mailing lists, it seems using the BioSQL mailing list is most appropriate. I have CC'd a few key people just in case they are not on the BioSQL mailing list, if I have missed anyone please forward this to them and ask them to sign up. Thanks, Peter _______________________________________________ BioSQL-l mailing list BioSQL-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biosql-l -- --------------- Sincerely George Skype: george_g2 Blog: http://biorelated.wordpress.com/ From bonnalraoul at ingm.it Mon May 18 08:13:47 2009 From: bonnalraoul at ingm.it (Raoul JP Bonnal) Date: Mon, 18 May 2009 10:13:47 +0200 Subject: [BioRuby] Meet Italian Developers Message-ID: <4A1118BB.6070205@ingm.it> Hi Guys, I'd like to meet Italians' BioRuby developers. Do you think would be possible to organize an informal meeting in Milan ? Then I'd like to know how many BioRuby devs are following the list. -- Ra From rozziite at gmail.com Tue May 19 21:07:59 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Tue, 19 May 2009 17:07:59 -0400 Subject: [BioRuby] Update on phyloXML support for BioRuby project Message-ID: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> Hi all, I want to update you on my thoughts about this project and I have some questions. So, I think we have reached consensus that the best choice is libxml2-ruby SAX based XML parser. Since BioRuby has Tree class ( http://bioruby.org/rdoc/) it seems logical that the parser should return a Tree class object. By using SAX parser we avoid the problem of having whole XML file in memory, but still the phylogenetic trees can be very large, and it might be too much to store whole thing as a tree object in memory. This could be a little remediated by having a function next_tree (or next_phylogeny) which would read one tree at a time if phyloXML file has several of them (this is similar to BioPerl implementation). I don't think the children nodes can be done in similar fashion. Since SAX parses sequentially, to get next node (child one level down) in the tree, whole subtree has to be parsed (in order to wait while there is event for the end tag of that child), thus loosing on speed. Any thoughts on this? Also the Tree class should be extended and added method output_phyloXML since it has methods output_newick, output_nhx. I think in order to understand what should be returned after parsing it would be useful to know how people use phylogenetic tree data. Here are some I could come up, * visualize / print * calculate total branch length of a tree * query info about specific nodes * create consensus trees Any others? I am a little confused about the require statements in BioRuby classes. It looks like bio/tree.rb should hold a general class, but it requires bio/db/newick.rb, but this file in turn requires bio/tree.rb. Thanks, Diana Project Page: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby From czmasek at burnham.org Tue May 19 21:54:18 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Tue, 19 May 2009 14:54:18 -0700 Subject: [BioRuby] Update on phyloXML support for BioRuby project In-Reply-To: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> Message-ID: <4A132A8A.70102@burnham.org> Hi, Diana: I think it is a good idea to have the parser return one tree at a time, as opposed to returning a list of trees. On the other hand, the same does not apply to nodes. I think it is perfectly acceptable to expect to have enough memory to keep at least one tree in memory (a good target size might be a binary tree with ten-thousand external nodes and 200 bytes of annotation per node, which according to my rough calculations would require less than 5MB). For your tree use cases, important ones to add are: * iteration over all nodes * retrieval/finding of specific nodes according to some criterion (e.g. find all nodes for which the species is "E. coli") * tree reconciliation (e.g. compare a gene tree to a species tree, in order to determine duplications on the gene tree) In any case, all these applications/algorithms will be most time efficient and easiest to implement with trees which are completely in memory. Re. "I am a little confused about the require statements in BioRuby classes. It looks like bio/tree.rb should hold a general class, but it requires bio/db/newick.rb, but this file in turn requires bio/tree.rb." I am not clear about your question about this. ;) Christian Diana Jaunzeikare wrote: > Hi all, > > I want to update you on my thoughts about this project and I have some > questions. > > So, I think we have reached consensus that the best choice is > libxml2-ruby SAX based XML parser. > > Since BioRuby has Tree class ( http://bioruby.org/rdoc/) it seems > logical that the parser should return a Tree class object. By using > SAX parser we avoid the problem of having whole XML file in memory, > but still the phylogenetic trees can be very large, and it might be > too much to store whole thing as a tree object in memory. This could > be a little remediated by having a function next_tree (or > next_phylogeny) which would read one tree at a time if phyloXML file > has several of them (this is similar to BioPerl implementation). I > don't think the children nodes can be done in similar fashion. Since > SAX parses sequentially, to get next node (child one level down) in > the tree, whole subtree has to be parsed (in order to wait while there > is event for the end tag of that child), thus loosing on speed. Any > thoughts on this? > > Also the Tree class should be extended and added method > output_phyloXML since it has methods output_newick, output_nhx. > > I think in order to understand what should be returned after parsing > it would be useful to know how people use phylogenetic tree data. Here > are some I could come up, > * visualize / print > * calculate total branch length of a tree > * query info about specific nodes > * create consensus trees > Any others? > > I am a little confused about the require statements in BioRuby > classes. It looks like bio/tree.rb should hold a general class, but it > requires bio/db/newick.rb, but this file in turn requires bio/tree.rb. > > Thanks, > > Diana > > Project Page: > https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby > From ngoto at gen-info.osaka-u.ac.jp Wed May 20 06:09:17 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 20 May 2009 15:09:17 +0900 Subject: [BioRuby] Update on phyloXML support for BioRuby project In-Reply-To: <4A132A8A.70102@burnham.org> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org> Message-ID: <20090520060918.B07C71CBC3F4@idnmail.gen-info.osaka-u.ac.jp> Hi all, On Tue, 19 May 2009 17:07:59 -0400 Diana Jaunzeikare wrote: > So, I think we have reached consensus that the best choice is libxml2-ruby > SAX based XML parser. In libxml2-ruby, I think LibXML::XML::Reader is the best choice, because it is memory efficient than DOM and its API is simpler than that of SAX. LibXML::XML::SAXParser is not bad, but I wonder if the SAX's callback based API makes our codes too complex and difficult to maintain. > Since BioRuby has Tree class ( http://bioruby.org/rdoc/) it seems > logical that the parser should return a Tree class object. By using > SAX parser we avoid the problem of having whole XML file in memory, I think so. Alternative way is to return an object of wrapper class which mimics Bio::Tree's API. However, it may be too hard to implement such class, and data type conversion from/to Bio::Tree is still needed even in this case. So, I think to return a Bio::Tree object is good. > I am a little confused about the require statements in BioRuby classes. It > looks like bio/tree.rb should hold a general class, but it requires > bio/db/newick.rb, but this file in turn requires bio/tree.rb. The only reason why bio/tree.rb requires bio/db/newick.rb is for the Newick and NHX output of the tree. The codes will be refactored in the future. On Tue, 19 May 2009 14:54:18 -0700 Christian M Zmasek wrote: > Hi, Diana: > > I think it is a good idea to have the parser return one tree at a time, > as opposed to returning a list of trees. I think so. > On the other hand, the same does not apply to nodes. I think it is > perfectly acceptable to expect to have enough memory to keep at least > one tree in memory (a good target size might be a binary tree with > ten-thousand external nodes and 200 bytes of annotation per node, which > according to my rough calculations would require less than 5MB). > > For your tree use cases, important ones to add are: > * iteration over all nodes > * retrieval/finding of specific nodes according to some criterion (e.g. > find all nodes for which the species is "E. coli") > * tree reconciliation (e.g. compare a gene tree to a species tree, in > order to determine duplications on the gene tree) > > In any case, all these applications/algorithms will be most time > efficient and easiest to implement with trees which are completely in > memory. In addition, it is easy to implement manipulation of trees (adding/deleting nodes and edges, etc.). Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From rozziite at gmail.com Wed May 20 14:51:26 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Wed, 20 May 2009 10:51:26 -0400 Subject: [BioRuby] Update on phyloXML support for BioRuby project In-Reply-To: <20090520060918.B07C71CBC3F4@idnmail.gen-info.osaka-u.ac.jp> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org> <20090520060918.B07C71CBC3F4@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <4057d3bf0905200751u739b37cdqe1e7d275fb82d08c@mail.gmail.com> On Wed, May 20, 2009 at 2:09 AM, Naohisa GOTO wrote: > Hi all, > > On Tue, 19 May 2009 17:07:59 -0400 > Diana Jaunzeikare wrote: > > > So, I think we have reached consensus that the best choice is > libxml2-ruby > > SAX based XML parser. > > In libxml2-ruby, I think LibXML::XML::Reader is the best choice, > because it is memory efficient than DOM and its API is simpler > than that of SAX. LibXML::XML::SAXParser is not bad, but I wonder > if the SAX's callback based API makes our codes too complex and > difficult to maintain. > I wrote sample code using both LibXML::XML::Reader and LibXML::XML::SAXParser and I agree that SAX's callback based API might get very complex and hard to maintain. > > > Since BioRuby has Tree class ( http://bioruby.org/rdoc/) it seems > > logical that the parser should return a Tree class object. By using > > SAX parser we avoid the problem of having whole XML file in memory, > > I think so. > Alternative way is to return an object of wrapper class which mimics > Bio::Tree's API. However, it may be too hard to implement such class, > and data type conversion from/to Bio::Tree is still needed even in > this case. So, I think to return a Bio::Tree object is good. > > > I am a little confused about the require statements in BioRuby classes. > It > > looks like bio/tree.rb should hold a general class, but it requires > > bio/db/newick.rb, but this file in turn requires bio/tree.rb. > > The only reason why bio/tree.rb requires bio/db/newick.rb is > for the Newick and NHX output of the tree. The codes will > be refactored in the future. > > On Tue, 19 May 2009 14:54:18 -0700 > Christian M Zmasek wrote: > > > Hi, Diana: > > > > I think it is a good idea to have the parser return one tree at a time, > > as opposed to returning a list of trees. > > I think so. > > > On the other hand, the same does not apply to nodes. I think it is > > perfectly acceptable to expect to have enough memory to keep at least > > one tree in memory (a good target size might be a binary tree with > > ten-thousand external nodes and 200 bytes of annotation per node, which > > according to my rough calculations would require less than 5MB). > > > > For your tree use cases, important ones to add are: > > * iteration over all nodes > > * retrieval/finding of specific nodes according to some criterion (e.g. > > find all nodes for which the species is "E. coli") > > * tree reconciliation (e.g. compare a gene tree to a species tree, in > > order to determine duplications on the gene tree) > > > > In any case, all these applications/algorithms will be most time > > efficient and easiest to implement with trees which are completely in > > memory. > > In addition, it is easy to implement manipulation of trees > (adding/deleting nodes and edges, etc.). > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > From rozziite at gmail.com Thu May 21 02:29:40 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Wed, 20 May 2009 22:29:40 -0400 Subject: [BioRuby] [Wg-phyloinformatics] Update on phyloXML support for BioRubyproject In-Reply-To: <1E5916D0F3C84F978A4A62DD89275053@NewLife> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

<1E5916D0F3C84F978A4A62DD89275053@NewLife> Message-ID: <4057d3bf0905201929y658cb924r36c2d48f62176ed4@mail.gmail.com> Both nexml and phyloxml are xml formats for holding information about phylogenetic trees. Both seem to be fairly new. What's the difference? Isn't there an ultimate goal to have one universal format for phylogenetic data exchnage? If yes, which of these two formats would be better suitable for it, or do they serve different purposes (as nexml is based on NEXUS format)? My other question is about Perl phylogenetics related packages (warning: Iam not familiar at all with BioPerl classes). Bio::TreeIO and Bio::Phylo::IO for me seem to be doing the same task. What are the main differences between them? If there are no fundamental differences, why there are two classes which do the same thing. Diana On Wed, May 20, 2009 at 11:13 AM, Mark A. Jensen wrote: > Should I plug my "web service blog": > > https://www.nescent.org/wg_evoinfo/User_talk:Mjensen#.22Streaming.22_NeXML.3F? > Some possible useful ideas there. I've also written a flexible DOM > implementation for Bio::Phylo, that could be > generalized/cannibalized/stolen and > used for iterating/streaming XML formats. It provides a standard interface > for > writing dom elements, but allows for easily swapping in different XML > handling > packages (in a BioPerly '-format => libxml' way). All in Perl I'm afraid > (or > proud?) to say, but it might provide ideas too. (I sent it to Rutger, but > he > hasn't passed judgment yet, so it's currently available 'by request to the > author'.) > cheers; excellent work Diana- MAJ > > ----- Original Message ----- > From: "Chris Fields" > To: "Hilmar Lapp" > Cc: ; "Phyloinformatics Group" > > Sent: Wednesday, May 20, 2009 10:52 AM > Subject: Re: [Wg-phyloinformatics] Update on phyloXML support for > BioRubyproject > > > > > > On May 20, 2009, at 8:22 AM, Hilmar Lapp wrote: > > > >> > >> On May 19, 2009, at 5:54 PM, Christian M Zmasek wrote: > >> > >>> I think it is perfectly acceptable to expect to have enough memory > >>> to keep at least > >>> one tree in memory > >> > >> Sounds like a good and perfectly reasonable starting point to me too. > >> It's also the way other toolkits (such as BioPerl) work. > >> > >> Having said that, I don't find it inconceivable that we may be working > >> with trees in the near future that don't fit into memory for a 1GB RAM > >> machine if they are richly decorated (which is something that phyloXML > >> wants to enable, isn't it?). Solving that to me though seems to be > >> question of writing an appropriate Tree implementation that happens to > >> store most of the data on disk rather than in memory, and not an issue > >> for how to write a parser. Ideally though, the parser uses a factory > >> for creating the (tree and/or node) objects, so that later it can be > >> made to use an on-disk Tree implementation simply by passing it > >> another factory. I.e., ideally the parser would not assume and hard- > >> code the Tree implementation class. > >> > >> Just my $0.02. > >> > >> -hilmar > > > > This could be implemented in a lazy way or using lightweight objects. > > The Tree object itself contains the XML parser or a reference thereof > > (probably LibXML Reader-based) and creates the relevant nodes as > > needed. The only thing needed would be some light parsing to indicate > > start-end file points. > > > > It's tricky with re: to a number of aspects, but it can be done. For > > instance, if one wanted to modify the created nodes (i.e. if the nodes > > are mutable), or creating a generic Lazy set of classes capable of > > dealing with multiple formats. > > > > Just in case anyone's wondering, I have been thinking along these > > lines for a while re: BioPerl, Bio::Seq, and very large files... ;> > > > > chris > > _______________________________________________ > > Wg-phyloinformatics mailing list > > Wg-phyloinformatics at nescent.org > > https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > > > > > > _______________________________________________ > Wg-phyloinformatics mailing list > Wg-phyloinformatics at nescent.org > https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > From czmasek at burnham.org Thu May 21 04:02:11 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Wed, 20 May 2009 21:02:11 -0700 Subject: [BioRuby] [Wg-phyloinformatics] Update on phyloXML support for BioRuby project In-Reply-To: References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

Message-ID: <4A14D243.2010101@burnham.org> Hi: Thanks for the detailed replies by Hilmar and Chris! I think it is a very good idea to keep such very large trees in mind, and possibly implement a solution which only loads requested nodes into memory (as described by Hilmar and Chris) if there is enough time left at the end of the project. Re "It's tricky with re: to a number of aspects, but it can be done. For instance, if one wanted to modify the created nodes (i.e. if the nodes are mutable), or creating a generic Lazy set of classes capable of dealing with multiple formats." How would you do post-order or pre-order iteration of nodes? Wouldn't you have to back and forth in the file? CZ Chris Fields wrote: > On May 20, 2009, at 8:22 AM, Hilmar Lapp wrote: > > >> On May 19, 2009, at 5:54 PM, Christian M Zmasek wrote: >> >> >>> I think it is perfectly acceptable to expect to have enough memory >>> to keep at least >>> one tree in memory >>> >> Sounds like a good and perfectly reasonable starting point to me too. >> It's also the way other toolkits (such as BioPerl) work. >> >> Having said that, I don't find it inconceivable that we may be working >> with trees in the near future that don't fit into memory for a 1GB RAM >> machine if they are richly decorated (which is something that phyloXML >> wants to enable, isn't it?). Solving that to me though seems to be >> question of writing an appropriate Tree implementation that happens to >> store most of the data on disk rather than in memory, and not an issue >> for how to write a parser. Ideally though, the parser uses a factory >> for creating the (tree and/or node) objects, so that later it can be >> made to use an on-disk Tree implementation simply by passing it >> another factory. I.e., ideally the parser would not assume and hard- >> code the Tree implementation class. >> >> Just my $0.02. >> >> -hilmar >> > > This could be implemented in a lazy way or using lightweight objects. > The Tree object itself contains the XML parser or a reference thereof > (probably LibXML Reader-based) and creates the relevant nodes as > needed. The only thing needed would be some light parsing to indicate > start-end file points. > > It's tricky with re: to a number of aspects, but it can be done. For > instance, if one wanted to modify the created nodes (i.e. if the nodes > are mutable), or creating a generic Lazy set of classes capable of > dealing with multiple formats. > > Just in case anyone's wondering, I have been thinking along these > lines for a while re: BioPerl, Bio::Seq, and very large files... ;> > > chris > From rozziite at gmail.com Sat May 23 20:12:31 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Sat, 23 May 2009 16:12:31 -0400 Subject: [BioRuby] GSOC: Unit testing in BioRuby Message-ID: <4057d3bf0905231312j13d44bf4s76d9f97a534fce79@mail.gmail.com> Hi, How does unit testing in BioRuby works? I created new file /lib/bio/db/phyloxml.rb and test/unit/bio/db/test_phyloxml.rb I also put some sample xml files in /test/unit/data/phyloxml/ directory. How can I access the data set in data directory from the test_phyloxml.rb file? Can I give path relative path to the data file? for example "../../../data/phyloxml/phyloxml_examples.xml" I saw other test files have this code: "require 'pathname' libpath = Pathname.new(File.join(File.dirname(__FILE__), ['..'] * 3, 'lib')).cleanpath.to_s $:.unshift(libpath) unless $:.include?(libpath) " Is this code for that purpose? I am not really sure what this piece of code means. Thanks, Diana From mail at michaelbarton.me.uk Sat May 23 23:00:26 2009 From: mail at michaelbarton.me.uk (Michael Barton) Date: Sun, 24 May 2009 00:00:26 +0100 Subject: [BioRuby] GSOC: Unit testing in BioRuby In-Reply-To: <4057d3bf0905231312j13d44bf4s76d9f97a534fce79@mail.gmail.com> References: <4057d3bf0905231312j13d44bf4s76d9f97a534fce79@mail.gmail.com> Message-ID: Hi Diana. You could try adding a module along these lines to your test file, inside the TestPhyloXML module. ??module TestPyloXMLData ????bioruby_root = Pathname.new(File.join(File.dirname(__FILE__), ['..'] * 5)).cleanpath.to_s ????TEST_DATA = Pathname.new(File.join(bioruby_root, 'test', 'data', 'pyloxml')).cleanpath.to_s ????def self.example_xml ??????File.join TEST_DATA, 'phyloxml_examples.xml' ????end ??end You can then use TestPhyloXMLData.example_xml To return the path to the example test data file. The code you described in your email adds the bioruby root to the library path so you can do require 'bio/db/phyloxml' at the top of your test file. Is this any help? Cheers Mike 2009/5/23 Diana Jaunzeikare > > Hi, > > How does unit testing in BioRuby works? > > I created new file /lib/bio/db/phyloxml.rb and > test/unit/bio/db/test_phyloxml.rb ?I also put some sample xml files in > /test/unit/data/phyloxml/ directory. > > How can I access the data set in data directory from the test_phyloxml.rb > file? > > Can I give path relative path to the data file? for example > "../../../data/phyloxml/phyloxml_examples.xml" > > I saw other test files have this code: > > "require 'pathname' > libpath = Pathname.new(File.join(File.dirname(__FILE__), ['..'] * 3, > 'lib')).cleanpath.to_s > $:.unshift(libpath) unless $:.include?(libpath) > " > Is this code for that purpose? ?I am not really sure what this piece of code > means. > > Thanks, > > Diana > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From rozziite at gmail.com Sun May 24 20:42:18 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Sun, 24 May 2009 16:42:18 -0400 Subject: [BioRuby] GSOC: Unit testing in BioRuby In-Reply-To: References: <4057d3bf0905231312j13d44bf4s76d9f97a534fce79@mail.gmail.com> Message-ID: <4057d3bf0905241342x6cba94f1mf8e236cd2062199f@mail.gmail.com> Thanks a lot Michael! It worked. Diana On Sat, May 23, 2009 at 7:00 PM, Michael Barton wrote: > Hi Diana. > > You could try adding a module along these lines to your test file, > inside the TestPhyloXML module. > > module TestPyloXMLData > > bioruby_root = Pathname.new(File.join(File.dirname(__FILE__), > ['..'] * 5)).cleanpath.to_s > TEST_DATA = Pathname.new(File.join(bioruby_root, 'test', 'data', > 'pyloxml')).cleanpath.to_s > > def self.example_xml > File.join TEST_DATA, 'phyloxml_examples.xml' > end > > end > > You can then use > > TestPhyloXMLData.example_xml > > To return the path to the example test data file. > > > > The code you described in your email adds the bioruby root to the > library path so you can do > > require 'bio/db/phyloxml' > > at the top of your test file. > > Is this any help? > > Cheers > > Mike > > > 2009/5/23 Diana Jaunzeikare > > > > Hi, > > > > How does unit testing in BioRuby works? > > > > I created new file /lib/bio/db/phyloxml.rb and > > test/unit/bio/db/test_phyloxml.rb I also put some sample xml files in > > /test/unit/data/phyloxml/ directory. > > > > How can I access the data set in data directory from the test_phyloxml.rb > > file? > > > > Can I give path relative path to the data file? for example > > "../../../data/phyloxml/phyloxml_examples.xml" > > > > I saw other test files have this code: > > > > "require 'pathname' > > libpath = Pathname.new(File.join(File.dirname(__FILE__), ['..'] * 3, > > 'lib')).cleanpath.to_s > > $:.unshift(libpath) unless $:.include?(libpath) > > " > > Is this code for that purpose? I am not really sure what this piece of > code > > means. > > > > Thanks, > > > > Diana > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From mail at michaelbarton.me.uk Sun May 24 22:28:07 2009 From: mail at michaelbarton.me.uk (Michael Barton) Date: Sun, 24 May 2009 23:28:07 +0100 Subject: [BioRuby] GSOC: Unit testing in BioRuby In-Reply-To: <4057d3bf0905241342x6cba94f1mf8e236cd2062199f@mail.gmail.com> References: <4057d3bf0905231312j13d44bf4s76d9f97a534fce79@mail.gmail.com> <4057d3bf0905241342x6cba94f1mf8e236cd2062199f@mail.gmail.com> Message-ID: Cheers. Good luck with the rest of the project. 2009/5/24 Diana Jaunzeikare : > Thanks? a lot Michael! It worked. > > Diana > > On Sat, May 23, 2009 at 7:00 PM, Michael Barton > wrote: >> >> Hi Diana. >> >> You could try adding a module along these lines to your test file, >> inside the TestPhyloXML module. >> >> ??module TestPyloXMLData >> >> ????bioruby_root ?= Pathname.new(File.join(File.dirname(__FILE__), >> ['..'] * 5)).cleanpath.to_s >> ????TEST_DATA = Pathname.new(File.join(bioruby_root, 'test', 'data', >> 'pyloxml')).cleanpath.to_s >> >> ????def self.example_xml >> ??????File.join TEST_DATA, 'phyloxml_examples.xml' >> ????end >> >> ??end >> >> You can then use >> >> TestPhyloXMLData.example_xml >> >> To return the path to the example test data file. >> >> >> >> The code you described in your email adds the bioruby root to the >> library path so you can do >> >> require 'bio/db/phyloxml' >> >> at the top of your test file. >> >> Is this any help? >> >> Cheers >> >> Mike >> >> >> 2009/5/23 Diana Jaunzeikare >> > >> > Hi, >> > >> > How does unit testing in BioRuby works? >> > >> > I created new file /lib/bio/db/phyloxml.rb and >> > test/unit/bio/db/test_phyloxml.rb ?I also put some sample xml files in >> > /test/unit/data/phyloxml/ directory. >> > >> > How can I access the data set in data directory from the >> > test_phyloxml.rb >> > file? >> > >> > Can I give path relative path to the data file? for example >> > "../../../data/phyloxml/phyloxml_examples.xml" >> > >> > I saw other test files have this code: >> > >> > "require 'pathname' >> > libpath = Pathname.new(File.join(File.dirname(__FILE__), ['..'] * 3, >> > 'lib')).cleanpath.to_s >> > $:.unshift(libpath) unless $:.include?(libpath) >> > " >> > Is this code for that purpose? ?I am not really sure what this piece of >> > code >> > means. >> > >> > Thanks, >> > >> > Diana >> > _______________________________________________ >> > BioRuby mailing list >> > BioRuby at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/bioruby >> >> _______________________________________________ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > From rozziite at gmail.com Mon May 25 00:29:57 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Sun, 24 May 2009 20:29:57 -0400 Subject: [BioRuby] [Wg-phyloinformatics] Update on phyloXML support for BioRuby project In-Reply-To: <86472996-C665-4AF0-A8A9-9E8D4EEB2A79@illinois.edu> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

<4A14D243.2010101@burnham.org> <86472996-C665-4AF0-A8A9-9E8D4EEB2A79@illinois.edu> Message-ID: <4057d3bf0905241729o69447925hd830404950fd28e1@mail.gmail.com> Hi all, Since there are much more elements in PhyloXML than in Bio::Tree I propose to make a class PhyloXMLNode which inherits from Bio::Tree::Node. PhyloXMLNode: # attributes from Bio::Tree::Node * bootstrap * bootstrap_string * ec_number * name * scientific_name * taxonomy_id #new attributes * id_source * confidence [] ([] means array of elements) * color * node_id * taxonomy [] * sequence [] (Bio::Sequence object) * events * binary_characters * distribution [] * date * reference [] * property [] Also, since element does not only consist of elements, but other elements also, Bio::Tree class should be extended. PhyloXMLTree #inherited from Bio::Tree * options * root # new attributes * rooted (boolean) * rerootable (boolean) * branch_length_unit * type * name * id * description * date * confidence [] * clade_relation [] * sequence_relation [] * property [] I think inheritance is better than creating a separate class, because then users will be able to use Bio::Tree as before, but also being able to read PhyloXML data files. Also then conversion from PhyloXML to other formats will be easy since Bio::Tree class has output_newick, output_nhx, output_phylip_distance_matrix methods. Diana Project Page: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby On Thu, May 21, 2009 at 9:03 AM, Chris Fields wrote: > Actually, as Perl's XML::LibXML::Reader is described it almost sounds > perfect, though I'm unsure of backtracking to a specific node in the > tree (and thus post/pre-order of nodes). Saying that, I would be > surprised if it weren't possible, though. > > chris > > On May 20, 2009, at 11:02 PM, Christian M Zmasek wrote: > > > Hi: > > > > Thanks for the detailed replies by Hilmar and Chris! > > I think it is a very good idea to keep such very large trees in > > mind, and possibly implement a solution which only loads requested > > nodes into memory (as described by Hilmar and Chris) if there is > > enough time left at the end of the project. > > > > Re "It's tricky with re: to a number of aspects, but it can be > > done. For instance, if one wanted to modify the created nodes > > (i.e. if the nodes are mutable), or creating a generic Lazy set of > > classes capable of dealing with multiple formats." > > > > How would you do post-order or pre-order iteration of nodes? > > Wouldn't you have to back and forth in the file? > > > > CZ > > > > Chris Fields wrote: > >> On May 20, 2009, at 8:22 AM, Hilmar Lapp wrote: > >> > >> > >>> On May 19, 2009, at 5:54 PM, Christian M Zmasek wrote: > >>> > >>> > >>>> I think it is perfectly acceptable to expect to have enough memory > >>>> to keep at least > >>>> one tree in memory > >>>> > >>> Sounds like a good and perfectly reasonable starting point to me > >>> too. > >>> It's also the way other toolkits (such as BioPerl) work. > >>> > >>> Having said that, I don't find it inconceivable that we may be > >>> working > >>> with trees in the near future that don't fit into memory for a 1GB > >>> RAM > >>> machine if they are richly decorated (which is something that > >>> phyloXML > >>> wants to enable, isn't it?). Solving that to me though seems to be > >>> question of writing an appropriate Tree implementation that > >>> happens to > >>> store most of the data on disk rather than in memory, and not an > >>> issue > >>> for how to write a parser. Ideally though, the parser uses a factory > >>> for creating the (tree and/or node) objects, so that later it can be > >>> made to use an on-disk Tree implementation simply by passing it > >>> another factory. I.e., ideally the parser would not assume and hard- > >>> code the Tree implementation class. > >>> > >>> Just my $0.02. > >>> > >>> -hilmar > >>> > >> > >> This could be implemented in a lazy way or using lightweight > >> objects. The Tree object itself contains the XML parser or a > >> reference thereof (probably LibXML Reader-based) and creates the > >> relevant nodes as needed. The only thing needed would be some > >> light parsing to indicate start-end file points. > >> > >> It's tricky with re: to a number of aspects, but it can be done. > >> For instance, if one wanted to modify the created nodes (i.e. if > >> the nodes are mutable), or creating a generic Lazy set of classes > >> capable of dealing with multiple formats. > >> > >> Just in case anyone's wondering, I have been thinking along these > >> lines for a while re: BioPerl, Bio::Seq, and very large files... ;> > >> > >> chris > >> > > > > _______________________________________________ > Wg-phyloinformatics mailing list > Wg-phyloinformatics at nescent.org > https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > From ngoto at gen-info.osaka-u.ac.jp Mon May 25 14:04:22 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 25 May 2009 23:04:22 +0900 Subject: [BioRuby] GSOC: Unit testing in BioRuby In-Reply-To: References: <4057d3bf0905231312j13d44bf4s76d9f97a534fce79@mail.gmail.com> Message-ID: <20090525140422.D258B1CBC3F4@idnmail.gen-info.osaka-u.ac.jp> Hi, > libpath = Pathname.new(File.join(File.dirname(__FILE__), ['..'] * 3, > 'lib')).cleanpath.to_s > $:.unshift(libpath) unless $:.include?(libpath) > bioruby_root = Pathname.new(File.join(File.dirname(__FILE__), > ['..'] * 5)).cleanpath.to_s > TEST_DATA = Pathname.new(File.join(bioruby_root, 'test', 'data', > 'pyloxml')).cleanpath.to_s The magic number (in the above cases, 3 or 5) depends on the depth of the directory of the test. For example, File location The number test/unit/bio/test_AAA.rb 3 test/unit/bio/BBB/test_AAA.rb 4 test/unit/bio/CCC/BBB/test_AAA.rb 5 -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From czmasek at burnham.org Mon May 25 22:16:08 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Mon, 25 May 2009 15:16:08 -0700 Subject: [BioRuby] [Wg-phyloinformatics] bioruby classes for phyloxml support In-Reply-To: <4057d3bf0905241729o69447925hd830404950fd28e1@mail.gmail.com> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

<4A14D243.2010101@burnham.org> <86472996-C665-4AF0-A8A9-9E8D4EEB2A79@illinois.edu> <4057d3bf0905241729o69447925hd830404950fd28e1@mail.gmail.com> Message-ID: <4A1B18A8.1070104@burnham.org> Hi, Diana: What you wrote looks more or less OK. I agree it is better to extend existing classes, as opposed to change them drastically. One thing to keep in mind, is that many attributes are composed of multiple fields themselves, i.e. you would need to create a class for them (if such a class not already exists). The most important element besides sequence, is the taxonomy class. Since BioRuby does not contain a general purpose taxonomy class at this point, it might be worth spending some time in designing such a class. I propose a taxonomy class with the following elements: -scientific name (e.g. Nematostella vectensis) -common name (e.g. starlet sea anemone) -code (or mnemonic, as used by swiss-port) (e.g. NEMVE) -rank (e.g. species) phyloxml also has a URI for taxonomies, but I am not sure if this is important for a general taxonomy class. On the other hand, a general taxonomy class might also have - authority (e.g. Stephenson, 1935) - aliases [] (if these elements are considered important, they of course could be added to the next version of phyloxml) What do people think about this? Christian Diana Jaunzeikare wrote: > Hi all, > > Since there are much more elements in PhyloXML than in Bio::Tree I > propose to make a class PhyloXMLNode which inherits from Bio::Tree::Node. > > PhyloXMLNode: > # attributes from Bio::Tree::Node > * bootstrap > * bootstrap_string > * ec_number > * name > * scientific_name > * taxonomy_id > > #new attributes > * id_source > * confidence [] ([] means array of elements) > * color > * node_id > * taxonomy [] > * sequence [] (Bio::Sequence object) > * events > * binary_characters > * distribution [] > * date > * reference [] > * property [] > > Also, since element does not only consist of > elements, but other elements also, Bio::Tree class should be extended. > > PhyloXMLTree > #inherited from Bio::Tree > * options > * root > > # new attributes > * rooted (boolean) > * rerootable (boolean) > * branch_length_unit > * type > * name > * id > * description > * date > * confidence [] > * clade_relation [] > * sequence_relation [] > * property [] > > > I think inheritance is better than creating a separate class, because > then users will be able to use Bio::Tree as before, but also being > able to read PhyloXML data files. Also then conversion from PhyloXML > to other formats will be easy since Bio::Tree class has output_newick, > output_nhx, output_phylip_distance_matrix methods. > > Diana > > Project Page: > https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby > > On Thu, May 21, 2009 at 9:03 AM, Chris Fields > wrote: > > Actually, as Perl's XML::LibXML::Reader is described it almost sounds > perfect, though I'm unsure of backtracking to a specific node in the > tree (and thus post/pre-order of nodes). Saying that, I would be > surprised if it weren't possible, though. > > chris > > On May 20, 2009, at 11:02 PM, Christian M Zmasek wrote: > > > Hi: > > > > Thanks for the detailed replies by Hilmar and Chris! > > I think it is a very good idea to keep such very large trees in > > mind, and possibly implement a solution which only loads requested > > nodes into memory (as described by Hilmar and Chris) if there is > > enough time left at the end of the project. > > > > Re "It's tricky with re: to a number of aspects, but it can be > > done. For instance, if one wanted to modify the created nodes > > (i.e. if the nodes are mutable), or creating a generic Lazy set of > > classes capable of dealing with multiple formats." > > > > How would you do post-order or pre-order iteration of nodes? > > Wouldn't you have to back and forth in the file? > > > > CZ > > > > Chris Fields wrote: > >> On May 20, 2009, at 8:22 AM, Hilmar Lapp wrote: > >> > >> > >>> On May 19, 2009, at 5:54 PM, Christian M Zmasek wrote: > >>> > >>> > >>>> I think it is perfectly acceptable to expect to have enough > memory > >>>> to keep at least > >>>> one tree in memory > >>>> > >>> Sounds like a good and perfectly reasonable starting point to me > >>> too. > >>> It's also the way other toolkits (such as BioPerl) work. > >>> > >>> Having said that, I don't find it inconceivable that we may be > >>> working > >>> with trees in the near future that don't fit into memory for a 1GB > >>> RAM > >>> machine if they are richly decorated (which is something that > >>> phyloXML > >>> wants to enable, isn't it?). Solving that to me though seems to be > >>> question of writing an appropriate Tree implementation that > >>> happens to > >>> store most of the data on disk rather than in memory, and not an > >>> issue > >>> for how to write a parser. Ideally though, the parser uses a > factory > >>> for creating the (tree and/or node) objects, so that later it > can be > >>> made to use an on-disk Tree implementation simply by passing it > >>> another factory. I.e., ideally the parser would not assume and > hard- > >>> code the Tree implementation class. > >>> > >>> Just my $0.02. > >>> > >>> -hilmar > >>> > >> > >> This could be implemented in a lazy way or using lightweight > >> objects. The Tree object itself contains the XML parser or a > >> reference thereof (probably LibXML Reader-based) and creates the > >> relevant nodes as needed. The only thing needed would be some > >> light parsing to indicate start-end file points. > >> > >> It's tricky with re: to a number of aspects, but it can be done. > >> For instance, if one wanted to modify the created nodes (i.e. if > >> the nodes are mutable), or creating a generic Lazy set of classes > >> capable of dealing with multiple formats. > >> > >> Just in case anyone's wondering, I have been thinking along these > >> lines for a while re: BioPerl, Bio::Seq, and very large files... ;> > >> > >> chris > >> > > > > _______________________________________________ > Wg-phyloinformatics mailing list > Wg-phyloinformatics at nescent.org > > https://lists.nescent.org/mailman/listinfo/wg-phyloinformatics > > From bonnalraoul at ingm.it Tue May 26 07:25:34 2009 From: bonnalraoul at ingm.it (Raoul JP Bonnal) Date: Tue, 26 May 2009 09:25:34 +0200 Subject: [BioRuby] [Wg-phyloinformatics] bioruby classes for phyloxml support In-Reply-To: <4A1B18A8.1070104@burnham.org> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

<4A14D243.2010101@burnham.org> <86472996-C665-4AF0-A8A9-9E8D4EEB2A79@illinois.edu> <4057d3bf0905241729o69447925hd830404950fd28e1@mail.gmail.com> <4A1B18A8.1070104@burnham.org> Message-ID: <4A1B996E.2010702@ingm.it> Christian M Zmasek ha scritto: > I agree it is better to extend existing classes, as opposed to change > them drastically. > One thing to keep in mind, is that many attributes are composed of > multiple fields themselves, i.e. you would need to create a class for > them (if such a class not already exists). > The most important element besides sequence, is the taxonomy class. > > Since BioRuby does not contain a general purpose taxonomy class at > this point, it might be worth spending some time in designing such a > class. > > I propose a taxonomy class with the following elements: > -scientific name (e.g. Nematostella vectensis) > -common name (e.g. starlet sea anemone) > -code (or mnemonic, as used by swiss-port) (e.g. NEMVE) > -rank (e.g. species) > > phyloxml also has a URI for taxonomies, but I am not sure if this is > important for a general taxonomy class. > > On the other hand, a general taxonomy class might also have > - authority (e.g. Stephenson, 1935) > - aliases [] > (if these elements are considered important, they of course could be > added to the next version of phyloxml) > > What do people think about this? How other langs represent that class ? I think that having the chance to define a new class there is the opportunity to define a similar api among bio-languages. Then, taxonomy class could be used by biosequences objects representing/grabbing data from biosql for example. http://code.open-bio.org/svnweb/index.cgi/biosql/checkout/biosql-schema/trunk/doc/biosql-ERD.pdf -- Ra From francesco.strozzi at gmail.com Tue May 26 07:48:39 2009 From: francesco.strozzi at gmail.com (Francesco Strozzi) Date: Tue, 26 May 2009 09:48:39 +0200 Subject: [BioRuby] Miranda target scan Message-ID: Hi all, I need to parse Miranda output files, after a whole genome scan for microRNA target sites. Is there any package available to do this in BioRuby? Thanks and cheers. -- Francesco skype: francescostrozzi From rozziite at gmail.com Tue May 26 12:17:24 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Tue, 26 May 2009 08:17:24 -0400 Subject: [BioRuby] Bioruby PhyloXML update Message-ID: <4057d3bf0905260517m32167f53j5ce8ca45b43ce655@mail.gmail.com> Hi all! Here is what was done during community bonding period: - subscribed to mailing lists - created a blog - got familiar with Git (this was particularly useful: http://www.gitcasts.com/posts/railsconf-git-talk ) - created GitHub account and forked bioruby project. - made first commit by adding sample phyloxml data files from www.phyloxml.org - reviewed BioPerl phyloXML implementation (also http://www.bioperl.org/wiki/HOWTO:Trees ) - got familiar with libxml-ruby. Wrote simple program using both LibXML::XML::Reader and LibXML::XML::SAXParser to parse a simple xml file. - reviewed Ruby classes - Bio:Tree, Bio::Pathways - After discussions in mailing lists it has been agreed to use LibXML-ruby library, the LibXML::XML::Reader class This weeks plan: - Start writing parser using LibXML::XML::Reader. It should return a Bio::Tree object. - Implement function next_tree to parse and return the next phylogeny. - Design Tree::Node object for containing phyloxml elements. - Start mapping phyloxml elements to Bio::Tree::Node, start with taxonomy, branch_length, scientific_name - Write simple unit tests. Diana Project page: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby From czmasek at burnham.org Wed May 27 02:01:28 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Tue, 26 May 2009 19:01:28 -0700 Subject: [BioRuby] [Wg-phyloinformatics] bioruby classes for phyloxml support In-Reply-To: <75226546-345B-446D-82D4-FBD5016905D8@duke.edu> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

<4A14D243.2010101@burnham.org> <86472996-C665-4AF0-A8A9-9E8D4EEB2A79@illinois.edu> <4057d3bf0905241729o69447925hd830404950fd28e1@mail.gmail.com> <4A1B18A8.1070104@burnham.org> <75226546-345B-446D-82D4-FBD5016905D8@duke.edu> Message-ID: <4A1C9EF8.9010505@burnham.org> Hi: It's a great idea to look at TDWG's standards. But to me, these standards seem designed specifically for collections in museums and, not surprisingly, biodiversity applications. For the purposes of comparative genomics and related fields, these taxonomy concepts seem a little bit overkill. In my experience, taxonomy objects which can contain a scientific name, a common name, a mnemonic, and a (typed) identifier (which could be a Uniform Resource Name (URN) or a NCBI taxonomy id) are sufficient for most applications. This is pretty much what phyloXML's taxon element contains now. Of course, this does not mean that a potential taxonomy class in BioRuby has to follow the concept for phyloXML. What do you think? Christian Hilmar Lapp wrote: > On May 25, 2009, at 6:16 PM, Christian M Zmasek wrote: > > >> I propose a taxonomy class with the following elements: >> -scientific name (e.g. Nematostella vectensis) >> -common name (e.g. starlet sea anemone) >> -code (or mnemonic, as used by swiss-port) (e.g. NEMVE) >> -rank (e.g. species) >> >> phyloxml also has a URI for taxonomies, but I am not sure if this is >> important for a general taxonomy class. >> >> On the other hand, a general taxonomy class might also have >> - authority (e.g. Stephenson, 1935) >> - aliases [] >> >> (if these elements are considered important, they of course could be >> added to the next version of phyloxml) >> > > > Note that there is the Taxonomic Concepts Transfer Schema as a > ratified TDWG standard, so if you really want to have a rich > representation of taxonomic entities or concepts I wouldn't try to > roll my own. > > http://www.tdwg.org/standards/117/ > > For lightweight taxonomic designation, there are taxonomic elements in > Darwin Core: > > http://wiki.tdwg.org/DarwinCore > > -hilmar > From rozziite at gmail.com Wed May 27 02:51:48 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Tue, 26 May 2009 22:51:48 -0400 Subject: [BioRuby] [Wg-phyloinformatics] bioruby classes for phyloxml support In-Reply-To: <4A1C9EF8.9010505@burnham.org> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

<4A14D243.2010101@burnham.org> <86472996-C665-4AF0-A8A9-9E8D4EEB2A79@illinois.edu> <4057d3bf0905241729o69447925hd830404950fd28e1@mail.gmail.com> <4A1B18A8.1070104@burnham.org> <75226546-345B-446D-82D4-FBD5016905D8@duke.edu> <4A1C9EF8.9010505@burnham.org> Message-ID: <4057d3bf0905261951w47d17ef8l70c44d46fbbd933c@mail.gmail.com> On Tue, May 26, 2009 at 10:01 PM, Christian M Zmasek wrote: > Hi: > > It's a great idea to look at TDWG's standards. > But to me, these standards seem designed specifically for collections in > museums and, not surprisingly, biodiversity applications. For the purposes > of comparative genomics and related fields, these taxonomy concepts seem a > little bit overkill. I totally agree that TDWG is overkill. Here is rough overview what are the data fields in TDWG standard. I agree that most of them are irrelevant for applications not specific to taxonomic studies. * MetaData * Specimens [] - id - Institution - Collection - SpecimenItem * Publications [] * TaxonNames [] - Rank - Canonical Name - CanonicalAutorship - PublishedIn - Year - MicroReference - Typification - SpellingCorrectionOf - Basionym - BasedOn - ConservedAgainst - LaterHomonymesOf - Sanctioned - ReplacementNameFor - PublicationStatus - ProviderLink - ProviderSpecificData * TaxonConcepts [] - id - Name - Rank - AccordingTo - TaxonRelationships - SpecimenCircumscription - CharacterCircumscription - ProviderLink * TaxonRelationshipAssertions Diana > > In my experience, taxonomy objects which can contain a scientific name, a > common name, a mnemonic, and a (typed) identifier (which could be a Uniform > Resource Name (URN) or a NCBI taxonomy id) are sufficient for most > applications. This is pretty much what phyloXML's taxon element contains > now. Of course, this does not mean that a potential taxonomy class in > BioRuby has to follow the concept for phyloXML. > > What do you think? > > Christian > > > > > > > Hilmar Lapp wrote: > >> On May 25, 2009, at 6:16 PM, Christian M Zmasek wrote: >> >> >> >>> I propose a taxonomy class with the following elements: >>> -scientific name (e.g. Nematostella vectensis) >>> -common name (e.g. starlet sea anemone) >>> -code (or mnemonic, as used by swiss-port) (e.g. NEMVE) >>> -rank (e.g. species) >>> >>> phyloxml also has a URI for taxonomies, but I am not sure if this is >>> important for a general taxonomy class. >>> >>> On the other hand, a general taxonomy class might also have >>> - authority (e.g. Stephenson, 1935) >>> - aliases [] >>> >>> (if these elements are considered important, they of course could be >>> added to the next version of phyloxml) >>> >>> >> >> >> Note that there is the Taxonomic Concepts Transfer Schema as a ratified >> TDWG standard, so if you really want to have a rich representation of >> taxonomic entities or concepts I wouldn't try to roll my own. >> >> http://www.tdwg.org/standards/117/ >> >> For lightweight taxonomic designation, there are taxonomic elements in >> Darwin Core: >> >> http://wiki.tdwg.org/DarwinCore >> >> -hilmar >> >> > > From rozziite at gmail.com Wed May 27 03:02:20 2009 From: rozziite at gmail.com (Diana Jaunzeikare) Date: Tue, 26 May 2009 23:02:20 -0400 Subject: [BioRuby] [Wg-phyloinformatics] bioruby classes for phyloxml support In-Reply-To: <4A1B996E.2010702@ingm.it> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>

<4A14D243.2010101@burnham.org> <86472996-C665-4AF0-A8A9-9E8D4EEB2A79@illinois.edu> <4057d3bf0905241729o69447925hd830404950fd28e1@mail.gmail.com> <4A1B18A8.1070104@burnham.org> <75226546-345B-446D-82D4-FBD5016905D8@duke.edu> <4A1C9EF8.9010505@burnham.org> <4057d3bf0905261951w47d17ef8l70c44d46fbbd933c@mail.gmail.com> <24AC846C-CEAA-4D01-B1CB-A8FC8033172D@berkeleybop.org> <11C49404-46B9-4D47-949B-1B7DE3CBAAF9@duke.edu> Message-ID: <4A1D8684.10701@burnham.org> Hi: > (BTW you will also notice that there is > nothing even close to the Swissprot "mnemonic" - nobody does this > except Swissprot - whose chief business is protein annotation, not > taxonomy - so you may want to consider how much significance you want > to give this in your object model.) > Again, agreed. But in practice this "mnemonic" is very, very useful. For example, it is a very handy way to create short protein names/identifiers which still contain human readable species information but are short enough the be used in a variety of alignment/phylogeny reconstruction programs. > As an aside, the name of a taxon really is a proxy for a taxon > concept, whether that is a species or not, except that typically a > taxon name isn't given in full (i.e., with author, year, and > publication) to allow unambiguous identification. That's one of the > reasons why taxon identifiers are key. Indeed. And the taxon concept is a "science" in itself.... > BTW NCBI taxon IDs are just one > kind of taxon IDs. There are also Catalog of Life, ITIS, IPNI, and > others. > Definitely. That's why the phyloxml taxonomy has a typed id. Like so: 594569. --CZ From czmasek at burnham.org Wed May 27 18:41:41 2009 From: czmasek at burnham.org (Christian M Zmasek) Date: Wed, 27 May 2009 11:41:41 -0700 Subject: [BioRuby] [Wg-phyloinformatics] bioruby classes for phyloxml support In-Reply-To: <4057d3bf0905262002w448a7652leb48a40e680b13b8@mail.gmail.com> References: <4057d3bf0905191407u44a10b9m656e615b2241bc14@mail.gmail.com> <4A132A8A.70102@burnham.org>