From anurag08priyam at gmail.com  Thu Jun  3 05:00:06 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Thu, 3 Jun 2010 14:30:06 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Update
Message-ID: <AANLkTilpv6Uur-NAhrRBoR_n3fjNY4ZzkU7XG9ZHDGEz@mail.gmail.com>

Hello all,

I know this update is coming quite late. Sorry for holding this back for so
long. From now on I will be updating this list weekly on my progress. Just
to keep everyone in the loop, [1] is my project page.

What has been done?
Till now I have been able to do a significant amount of work on the NeXML
parser. The parser recognizes otus, otu and trees. The trees implementation
is not complete as per the NeXML schema. Trees with multiple rootings,
coalescent trees and networks remain to be done.

Problems Faced:
Initially it was decided to stream parse any NeXML document as DOM parsing
would be slow for larger documents. But with NeXML's non linear design,
streaming seems non natural and proves to be a little difficult. Currently,
I have written a wrapper over the StAX parsing API of libxml but the entire
document is parsed in one go; at the start.

Current git head[2] can be built and the code tested out. A tutorial( kind
of ) on how to use the NeXML can be found here[3].

[1]
https://www.nescent.org/wg_phyloinformatics/Category:NeXML_and_RDF_API_for_BioRuby
[2] http://github.com/yeban/bioruby
[3]
https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From anurag08priyam at gmail.com  Fri Jun  4 04:39:34 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Fri, 4 Jun 2010 14:09:34 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Tree with Multiple rootings.
Message-ID: <AANLkTinGx3qR_5RgDIxT7OuUEp99sHfpS922zoaqiLqG@mail.gmail.com>

Hello all,

NeXML allows for trees with multiple rootings. In the NeXML lib trees are
represented by Bio::NeXML::Tree which inherits from Bio::Tree. This allows
for the usage of the excellent Bio::Tree framework for manipulating NeXML
trees. However, Bio::Tree class supports only one root node.

There are a couple of functions that require the presence of a root node:
parent, children, descendants, ancestors, lowest_common_ancestor. Now, these
functions can take a root node as a parameter. So it is possible to extend
the current framework to work with trees with multiple root nodes.

Though this may not be required, a possibility is to add the multiple root
functionality to Bio::Tree class itself. Currently, I am adding multiple
root support to Bio::NeXML::Tree class. If need be we can move the
functionality to Bio::Tree.

Anything?

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From rutgeraldo at gmail.com  Fri Jun  4 10:21:27 2010
From: rutgeraldo at gmail.com (Rutger Vos)
Date: Fri, 4 Jun 2010 15:21:27 +0100
Subject: [BioRuby] [GSoC][NeXML and RDF API] Tree with Multiple rootings.
In-Reply-To: <AANLkTinGx3qR_5RgDIxT7OuUEp99sHfpS922zoaqiLqG@mail.gmail.com>
References: <AANLkTinGx3qR_5RgDIxT7OuUEp99sHfpS922zoaqiLqG@mail.gmail.com>
Message-ID: <AANLkTinJIQ26bRYKztfD2pHjP3XTsM3AchgNFS6qfuSM@mail.gmail.com>

Hi Anurag,

in practice I haven't actually seen trees with multiple rootings being
used much, so it might not be urgent that this moves to the bioruby
core. My main worry would be in picking the "right" root node to
expose to the core api. I think that it should be the node from which
all other nodes can be visited in a recursive traversal (which I
expect client code to do), as opposed to a node that has been
indicated using an XML attribute to be the root, but isn't in terms of
the actual topology that emerges from the node and edge tables.

However, I'm curious to hear other people's opinions whether a flag
(e.g. "is_root") might be added Bio::Tree::Node, and a "get_roots"
method in Bio::Tree that returns a list of roots that typically only
holds the value of the "root" attribute, but could potentially have
multiple rootings.

Rutger

On Fri, Jun 4, 2010 at 9:39 AM, Anurag Priyam <anurag08priyam at gmail.com> wrote:
> Hello all,
>
> NeXML allows for trees with multiple rootings. In the NeXML lib trees are
> represented by Bio::NeXML::Tree which inherits from Bio::Tree. This allows
> for the usage of the excellent Bio::Tree framework for manipulating NeXML
> trees. However, Bio::Tree class supports only one root node.
>
> There are a couple of functions that require the presence of a root node:
> parent, children, descendants, ancestors, lowest_common_ancestor. Now, these
> functions can take a root node as a parameter. So it is possible to extend
> the current framework to work with trees with multiple root nodes.
>
> Though this may not be required, a possibility is to add the multiple root
> functionality to Bio::Tree class itself. Currently, I am adding multiple
> root support to Bio::NeXML::Tree class. If need be we can move the
> functionality to Bio::Tree.
>
> Anything?
>
> --
> Anurag Priyam,
> 2nd Year Undergraduate,
> Department of Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642
>


-- 
Dr. Rutger A. Vos
School of Biological Sciences
Philip Lyle Building, Level 4
University of Reading
Reading
RG6 6BX
United Kingdom
Tel: +44 (0) 118 378 7535
http://www.nexml.org
http://rutgervos.blogspot.com

From hlapp at drycafe.net  Fri Jun  4 15:09:11 2010
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Fri, 4 Jun 2010 15:09:11 -0400
Subject: [BioRuby] [GSoC][NeXML and RDF API] Tree with Multiple rootings.
In-Reply-To: <AANLkTinJIQ26bRYKztfD2pHjP3XTsM3AchgNFS6qfuSM@mail.gmail.com>
References: <AANLkTinGx3qR_5RgDIxT7OuUEp99sHfpS922zoaqiLqG@mail.gmail.com>
	<AANLkTinJIQ26bRYKztfD2pHjP3XTsM3AchgNFS6qfuSM@mail.gmail.com>
Message-ID: <01FB94FA-D962-4624-B612-49AA26FF3E2D@drycafe.net>

Multiple roots can be the result of a Bayesian analysis. (The PhyloDB  
module in BioSQL, for example, does support multiple roots.)

However, representing multiple roots is useless without also being  
able to indicate whether a root is an alternate root or the main root  
node, and what its significance (posterior prob. for a Bayesian  
analysis) is.

For reference, here is the column documentation for these two  
properties in PhyloDB's tree_root table:

COMMENT ON COLUMN tree_root.is_alternate IS 'True if the root node is  
the preferential (most likely) root node of the tree, and false  
otherwise.';

COMMENT ON COLUMN tree_root.significance IS 'The significance (such as  
likelihood, or posterior probability) with which the node is the root  
node. This only has meaning if the method used for reconstructing the  
tree calculates this value.';

	-hilmar

On Jun 4, 2010, at 10:21 AM, Rutger Vos wrote:

> Hi Anurag,
>
> in practice I haven't actually seen trees with multiple rootings being
> used much, so it might not be urgent that this moves to the bioruby
> core. My main worry would be in picking the "right" root node to
> expose to the core api. I think that it should be the node from which
> all other nodes can be visited in a recursive traversal (which I
> expect client code to do), as opposed to a node that has been
> indicated using an XML attribute to be the root, but isn't in terms of
> the actual topology that emerges from the node and edge tables.
>
> However, I'm curious to hear other people's opinions whether a flag
> (e.g. "is_root") might be added Bio::Tree::Node, and a "get_roots"
> method in Bio::Tree that returns a list of roots that typically only
> holds the value of the "root" attribute, but could potentially have
> multiple rootings.
>
> Rutger
>
> On Fri, Jun 4, 2010 at 9:39 AM, Anurag Priyam <anurag08priyam at gmail.com 
> > wrote:
>> Hello all,
>>
>> NeXML allows for trees with multiple rootings. In the NeXML lib  
>> trees are
>> represented by Bio::NeXML::Tree which inherits from Bio::Tree. This  
>> allows
>> for the usage of the excellent Bio::Tree framework for manipulating  
>> NeXML
>> trees. However, Bio::Tree class supports only one root node.
>>
>> There are a couple of functions that require the presence of a root  
>> node:
>> parent, children, descendants, ancestors, lowest_common_ancestor.  
>> Now, these
>> functions can take a root node as a parameter. So it is possible to  
>> extend
>> the current framework to work with trees with multiple root nodes.
>>
>> Though this may not be required, a possibility is to add the  
>> multiple root
>> functionality to Bio::Tree class itself. Currently, I am adding  
>> multiple
>> root support to Bio::NeXML::Tree class. If need be we can move the
>> functionality to Bio::Tree.
>>
>> Anything?
>>
>> --
>> Anurag Priyam,
>> 2nd Year Undergraduate,
>> Department of Mechanical Engineering,
>> IIT Kharagpur.
>> +91-9775550642
>>
>
>
>
> -- 
> Dr. Rutger A. Vos
> School of Biological Sciences
> Philip Lyle Building, Level 4
> University of Reading
> Reading
> RG6 6BX
> United Kingdom
> Tel: +44 (0) 118 378 7535
> http://www.nexml.org
> http://rutgervos.blogspot.com
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby

-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================


From mitlox at op.pl  Sun Jun  6 01:30:09 2010
From: mitlox at op.pl (xyz)
Date: Sun, 06 Jun 2010 15:30:09 +1000
Subject: [BioRuby] fastq files reading
In-Reply-To: <20100530233154.1B8A.EEF6E030@gen-info.osaka-u.ac.jp>
References: <20100529221404.0175ee75@wp01>
	<AANLkTil1Nrbd5ULu3T-esvbg8VXoCYojW53eCL72oihi@mail.gmail.com>
	<20100530233154.1B8A.EEF6E030@gen-info.osaka-u.ac.jp>
Message-ID: <4C0B3261.3020909@op.pl>

Thank you for the solutions it works.

From sararayburn at gmail.com  Mon Jun  7 14:09:07 2010
From: sararayburn at gmail.com (Sara Rayburn)
Date: Mon, 7 Jun 2010 13:09:07 -0500
Subject: [BioRuby] GSoC speciation/duplication inference question
Message-ID: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com>

Hello,

While implementing my gsoc project (the speciation/duplicaiton inference algorithm), I'm not sure where to put the module I'm developing in the bioruby library. I've been developing in the lib/bio/util directory, but I want to make sure that's the place you all would prefer for the module. Any suggestions or thoughts?

Thanks,

Sara Rayburn
sararayburn at gmail.com

From anurag08priyam at gmail.com  Wed Jun  9 04:17:55 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Wed, 9 Jun 2010 13:47:55 +0530
Subject: [BioRuby] fastq files reading
In-Reply-To: <4C0B3261.3020909@op.pl>
References: <20100529221404.0175ee75@wp01>
	<AANLkTil1Nrbd5ULu3T-esvbg8VXoCYojW53eCL72oihi@mail.gmail.com>
	<20100530233154.1B8A.EEF6E030@gen-info.osaka-u.ac.jp>
	<4C0B3261.3020909@op.pl>
Message-ID: <AANLkTilS8gwRSQ0ZPC1mvRL7RLeGmsKOT1KYGB_J-i00@mail.gmail.com>

Maybe we should add this to the wiki [1]

[1] http://bioruby.open-bio.org/wiki/SampleCodes

On Sun, Jun 6, 2010 at 11:00 AM, xyz <mitlox at op.pl> wrote:

> Thank you for the solutions it works.
>


-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From mitlox at op.pl  Wed Jun  9 08:42:36 2010
From: mitlox at op.pl (xyz)
Date: Wed, 09 Jun 2010 22:42:36 +1000
Subject: [BioRuby] fastq files reading
In-Reply-To: <AANLkTilS8gwRSQ0ZPC1mvRL7RLeGmsKOT1KYGB_J-i00@mail.gmail.com>
References: <20100529221404.0175ee75@wp01>	<AANLkTil1Nrbd5ULu3T-esvbg8VXoCYojW53eCL72oihi@mail.gmail.com>	<20100530233154.1B8A.EEF6E030@gen-info.osaka-u.ac.jp>	<4C0B3261.3020909@op.pl>
	<AANLkTilS8gwRSQ0ZPC1mvRL7RLeGmsKOT1KYGB_J-i00@mail.gmail.com>
Message-ID: <4C0F8C3C.1030303@op.pl>

Good idea.

On 06/09/10 18:17, Anurag Priyam wrote:
> Maybe we should add this to the wiki [1]
>
> [1] http://bioruby.open-bio.org/wiki/SampleCodes
>
> On Sun, Jun 6, 2010 at 11:00 AM, xyz <mitlox at op.pl
> <mailto:mitlox at op.pl>> wrote:
>
>     Thank you for the solutions it works.
>
>
>
>
> -- 
> Anurag Priyam,
> 2nd Year Undergraduate,
> Department of Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642


From anurag08priyam at gmail.com  Wed Jun  9 15:49:35 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Thu, 10 Jun 2010 01:19:35 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
Message-ID: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>

Last week, I worked on finishing implementation of Trees: trees, tree,
network; and started work on the characters element. This weeks target is to
complete the implementation of the characters element.

It would be awesome to have some code review including: implementation, API
design, coding style and tests. I am planning to give a good amount of time
in the fourth week in making the code more robust. It would make perfect
sense to have some feedback to serve as guidelines :). The master branch and
API discussion page are at:

[1] http://github.com/yeban/bioruby
[2]
https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From czmasek at burnham.org  Wed Jun  9 21:50:16 2010
From: czmasek at burnham.org (Christian M Zmasek)
Date: Wed, 9 Jun 2010 18:50:16 -0700
Subject: [BioRuby] gsoc questions
In-Reply-To: <AANLkTin5fGYJKbCWftc8AZYdks11iEiYZgTjebwJ_3f4@mail.gmail.com>
References: <2B862774-AD61-4BC0-86D6-69DCD832EB78@gmail.com>
	<AANLkTin5fGYJKbCWftc8AZYdks11iEiYZgTjebwJ_3f4@mail.gmail.com>
Message-ID: <4C1044D8.4010600@burnham.org>

Hi Sara:
> 
> 
> On Mon, Jun 7, 2010 at 11:20 AM, Sara Rayburn <sararayburn at gmail.com 
> <mailto:sararayburn at gmail.com>> wrote:
> 
>     Hi Christian and Diana,
> 
>     Two questions: 
> 
>     1) On the phylosoft website for forester/sdi
>     (http://www.phylosoft.org/forester/applications/sdi_r/) I've read
>     this about the two trees: 
>     "The important point to keep in mind is that there must be at least
>     one sub-element of the 'taxonomy' element which allows to match the
>     sequences in the gene tree with a taxonomy in the species tree. In
>     this example this sub-element of the 'taxonomy' element is 'code'."
> 
>     Does this mean that the sub-element for matching will *always* be
>     'code'? Or should I just be looking for anything at all that
>     matches? Also, will all phyloxml trees have the 'code' sub-element?
> 
> 
> To find out whether some element will always contain some other element 
> you can look at PhyloXML documentation [0]. For example at the Taxonomy 
> element documentation [1] you can see that it has a sub-element "code" 
> which is [0..1], which means that there either is no "code" sub-element 
> or there is one and no more, whereas there could none or many "synonym" 
> sub-elements
> 
> [0] http://www.phyloxml.org/documentation/version_1.10/phyloxml.xsd.html
> [1] http://www.phyloxml.org/documentation/version_1.10/phyloxml.xsd.html#h888650454


Good point! This matching of taxonomic information is a crucial point.
I recommend to implement this in the same manner as it is implemented in 
the "isEqual" method of the org.forester.phylogeny.data.Taxonomy class 
of the forester library, see:
http://forester-atv.cvs.sourceforge.net/viewvc/forester-atv/forester-atv/java/src/org/forester/phylogeny/data/Taxonomy.java?revision=1.57&view=markup

In this (Java) class the matching works like this:

1. If both the two Taxonomies to be compared have identifiers with the 
same source (e.g. NCBI taxonomy), use these identifiers to match.

  In Java:
   if ( ( getIdentifier() != null ) && ( tax.getIdentifier() != null ) )
   {
     return getIdentifier().isEqual( tax.getIdentifier() );
   }

2. Otherwise, if both Taxonomies have taxonomy codes, use the taxomoy 
codes to match.

  In Java:
   else if ( !ForesterUtil.isEmpty( getTaxonomyCode() ) &&
             !ForesterUtil.isEmpty( tax.getTaxonomyCode() ) )
   {
     return getTaxonomyCode().equals( tax.getTaxonomyCode() );
   }

3. Otherwise, if both Taxonomies have scientific names, use the 
scientific names to match.


4. Otherwise, if both Taxonomies have common names, use the common names 
to match.


5. Otherwise, matching is not possible and an error should be thrown.

Generally speaking, I recommend to get the source code of forester and 
look at the classes in the org.forester.sdi directory (especially 
SDI.java, SDIse.java, and SDIR.java).


> 
>     2) Here's my assumptions about the final output of the algorithm:
>     Each node in the tree should be updated with speciation OR
>     duplication, and the tree as a whole has a count of
>     speciation/duplication events. Am I on the right track here?

Yes, the primary goal of the algorithm is to calculate for each node in 
the gene tree whether it is a duplication or a speciation, and thus each 
node should be annotated as duplication or speciation.
Keeping track of the sum of duplications and speciations is useful too, 
but cannot, as far as I know, stored in the tree object itself.
Maybe the algorithm could return a small "SDI_result" object which is 
used to store such "summary" information.


Christian

From ngoto at gen-info.osaka-u.ac.jp  Thu Jun 10 09:46:40 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Thu, 10 Jun 2010 22:46:40 +0900
Subject: [BioRuby] GSoC speciation/duplication inference question
In-Reply-To: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com>
References: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com>
Message-ID: <20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp>

Hi,

I think the abbreviation SDI is not common in the field of biology
and bioinformatics. In this case, it is generally good not to
abbreviate, but the "speciation/duplication inference" is too long.
For file/directory names, because the length limit is tight,
using abbreviation is good.

For the location of files, I suggest
lib/bio/util/evolution/SDI/ or lib/bio/util/phylogeny/SDI/
to show the word SDI is in the field of evolution or phylogeny.

For the class/module namespace,  possible candidates are
Bio::SpeciationDuplicationInference, Bio::Evolution::SDI,
Bio::Algorithm::SDI, but I couldn't determine which is the best.
If you have good idea, please tell us.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


On Mon, 7 Jun 2010 13:09:07 -0500
Sara Rayburn <sararayburn at gmail.com> wrote:

> Hello,
> 
> While implementing my gsoc project (the speciation/duplicaiton inference algorithm), I'm not sure where to put the module I'm developing in the bioruby library. I've been developing in the lib/bio/util directory, but I want to make sure that's the place you all would prefer for the module. Any suggestions or thoughts?
> 
> Thanks,
> 
> Sara Rayburn
> sararayburn at gmail.com


From kpatil at science.uva.nl  Thu Jun 17 05:22:12 2010
From: kpatil at science.uva.nl (K. Patil)
Date: Thu, 17 Jun 2010 11:22:12 +0200 (CEST)
Subject: [BioRuby] newick to phyloxml
In-Reply-To: <20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp>
References: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com>
	<20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <1764.139.19.75.1.1276766532.squirrel@webmail.science.uva.nl>

Hi,

I noticed the inclusion of phyloxml support in bioruby, thanks a lot, its
very useful. I was wondering if there is any straightforward way to
convert a newick tree to phyloxml?

best


From czmasek at burnham.org  Thu Jun 17 18:49:12 2010
From: czmasek at burnham.org (Christian M Zmasek)
Date: Thu, 17 Jun 2010 15:49:12 -0700
Subject: [BioRuby] Gene duplications GSoC project: answers to some of your
	questions
Message-ID: <4C1AA668.7040801@burnham.org>

Hi, Sara:

Regarding some of your questions posted on 
http://wiki.github.com/srayburn/bioruby/gsoc-2010-implementing-sdi-project-updates

Re: "Right now initialization loads from a hard coded file. I need to 
make this flexible so that trees can come from any file or from a 
previously loaded tree object":

The input of the algoruthm(s) should be tree-objects, reading the trees 
should not be part of the algorithm implementation.

Clearly, for testing you need to read the trees from files, but this 
should be implemented in your test code, not as part of the algorithm 
implementation itself.

Re: "The names of leaf nodes: how standard are they? Is there a standard 
format here? I?m going to look at example trees from the forester 
implementation to get ideas about this. If I?m still stumped I?ll check 
with my mentors."

No there is no standard. The only question for the purpose of this 
algorithm do they match or not. I.e. they names could just numbers, 
common names, or scientific names.

Hope this helps,

Christian


From czmasek at burnham.org  Thu Jun 17 23:16:20 2010
From: czmasek at burnham.org (Christian M Zmasek)
Date: Thu, 17 Jun 2010 20:16:20 -0700
Subject: [BioRuby] newick to phyloxml
In-Reply-To: <1764.139.19.75.1.1276766532.squirrel@webmail.science.uva.nl>
References: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com>	<20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp>
	<1764.139.19.75.1.1276766532.squirrel@webmail.science.uva.nl>
Message-ID: <4C1AE504.3000307@burnham.org>

Hi,

Unfortunately, this is not possible in a straightforward way.

The problem is that the tree object (Bio::Tree) returned by:
  input = Bio::FlatFile.open(Bio::Newick, "tree.nh")
  tree = input.next_entry.tree

is the parent type of the tree object(Bio::PhyloXML::Tree) required by:

  writer = Bio::PhyloXML::Writer.new("tree.xml")
  writer.write(phyloxml_tree)


Christian


K. Patil wrote:
> Hi,
> 
> I noticed the inclusion of phyloxml support in bioruby, thanks a lot, its
> very useful. I was wondering if there is any straightforward way to
> convert a newick tree to phyloxml?
> 
> best
> 
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From anurag08priyam at gmail.com  Tue Jun 22 04:46:19 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Tue, 22 Jun 2010 14:16:19 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Update
In-Reply-To: <AANLkTimsRqsi0-Qj0JgWerTwmSgo90pyx0Xky9rGxYC8@mail.gmail.com>
References: <AANLkTimsRqsi0-Qj0JgWerTwmSgo90pyx0Xky9rGxYC8@mail.gmail.com>
Message-ID: <AANLkTilqCdMCe5QIDk09Iw-Din8URsWNI_EGyQkgAfwY@mail.gmail.com>

Hello all,

Much of the parser implementation is complete as of now. Last time I had
sent an update I had begun implementing characters element. Week 3( June
7-13) was quite low on work due to power shortage where I live. Consequently
implementation of characters spanned Week 4( June 14-20 ) too.

Work on NeXML serialization has begun. As of now it can serialize taxa
blocks. This week( week 5 - June 21-28 ) I will be working on serializing
trees and characters element.

I would also like to update a little more on future development plans. I am
targeting to finish much of the software development by week 9( July 19-25
), leaving week 10, week 11 and week 12 for feedback and iterations. This is
the time where I should make up for any mistakes or lost work. Perhaps in
this week we can make the code ready for merging in BioRuby's master branch.
Apart from this, I am targeting to finish serializer and start working on
the RDF API by week 6. Maybe we could have a round of code review after that
too? I am notifying this in advance so that if possible developers can
allocate time for this. Sounds good?


-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From czmasek at burnham.org  Tue Jun 22 14:29:09 2010
From: czmasek at burnham.org (Christian M Zmasek)
Date: Tue, 22 Jun 2010 11:29:09 -0700
Subject: [BioRuby] gsoc update
In-Reply-To: <6B35175E-1DFF-430E-AE22-E724F3D58D9D@gmail.com>
References: <4C1AACBA.4030908@burnham.org>
	<6B35175E-1DFF-430E-AE22-E724F3D58D9D@gmail.com>
Message-ID: <4C2100F5.3020306@burnham.org>

Hi, Sara:

Hopefully you and your son are fully recovered now!

To me, Bio::Algorithm::SDI would make the most sense.

Re: "It seems that forester has the assumption built in that any node in 
a tree that has a child must have two children. Is this a property of 
phylogenetic trees?"

Being composed of entirely binary nodes is indeed a property of trees 
produced by most programs for phylogenetic inference. In contrast, if 
multiple (binary) trees are used to calculate a consensus tree (e.g. 
bootstrap resampling), then the resulting consensus tree might contain 
nodes with more than two children (depending on the method of consensus 
tree calculation and the degree of divergence among the resampled 
trees). Furthermore, if (phylogenetic or taxonomic) trees are "manually" 
created (or by various "supertree" approaches), nodes with more than two 
children are oftentimes used to express uncertainty.

For the purpose of gene duplication inference, it would be particularly 
useful to allow non-binary species trees (expressing uncertainty about 
the tree-of-life and preventing the introduction of spurious duplications).

Re: "For the non-binary case, should I go forward planning to implement 
the algorithm from the Vernot et al. paper or should I be planning to 
extend your algorithm?"

You should plan on working on the SDI algorithm and 'modify' it so that 
it correctly works on non-binary species trees.
Now, this is easier said than done.
A while ago, I developed such an algorithm and implemented it as 
org.forester.sdi.GSDI (for Generalized SDI). You can look at it in file 
org/forester/sdi/GSDI.
Yet, the big issue is that while this algorithm seems to work, I don't 
have a mathematical proof for its correctness.

In any case, I recommend to do the following:
1. Thoroughly test (and writes unit tests) your current implementation 
of binary SDI. For example, does it correctly use the different 
sub-elements of taxonomy for matching, i.e. does it work if both species 
  and gene use scientific names for taxonomic identification? does it 
work if both species and gene use NCBI identifier for taxonomic 
identification? does it work if both species and gene use NCBI 
identifier for taxonomic identification but also have non-matching 
common names (in this case it should use the identifiers and ignore 
common names)? Will it throw an exception if no matching sub-elements of 
taxonomy are present?
2. Performing timing benchmarks. Does it behave similar (although 
overall slower) to the Java implementation (see Figure 4 in Zmasek and 
Eddy, 2001)? Oftentimes, an unexpected timing benchmark results is a 
indication of an underlying problem?
3. I will look at your implementation as well.
4. Look at org.forester.sdi.GSDI and see if you can understand it and 
test it on paper. If this makes sense to you then we can go ahead and 
plan implementing this within BioRuby.

Christian


Sara Rayburn wrote:
> Hi,
> 
> Well, as far as I can tell, things are looking much, much better.  I'm sorry I got a bit behind, but my son and I have been sick this past week. 
> 
> For the namespace/file locations, the response from the mailing list has been:
> Bio::SpeciationDuplicationInference, Bio::Evolution::SDI,
> Bio::Algorithm::SD with the files in lib/bio/util/Phylogeny/SDI, or lib/bio/phylo/sdi.rb and Bio::Phylo::SDI
> 
> What do you guys think?
> 
> Also, when I've been in doubt I've looked at the java implementation. It seems that forester has the assumption built in that any node in a tree that has a child must have two children. Is this a property of phylogenetic trees?
> 
> Other than tying up a couple of loose ends, I think the binary case is pretty much wrapped up. Please let me know if there are things I need to modify or rethink.
> 
> For the non-binary case, should I go forward planning to implement the algorithm from the Vernot et al. paper or should I be planning to extend your algorithm? 
> 
> Thanks and again, sorry for getting a bit behind.
> 
> Sara


From rutgeraldo at gmail.com  Wed Jun 23 16:48:02 2010
From: rutgeraldo at gmail.com (Rutger Vos)
Date: Wed, 23 Jun 2010 21:48:02 +0100
Subject: [BioRuby] [GSoC][NeXML and RDF API] Update
In-Reply-To: <AANLkTilqCdMCe5QIDk09Iw-Din8URsWNI_EGyQkgAfwY@mail.gmail.com>
References: <AANLkTimsRqsi0-Qj0JgWerTwmSgo90pyx0Xky9rGxYC8@mail.gmail.com>
	<AANLkTilqCdMCe5QIDk09Iw-Din8URsWNI_EGyQkgAfwY@mail.gmail.com>
Message-ID: <AANLkTilcdN3Q8keRE5SQlEEMZJ43wkFe_CKAticag8Lm@mail.gmail.com>

Hi Anurag,

thanks for the update - your time projection and current progress
sounds good. Can you forward this update to the phylosoc (nescent)
list as well?

Thanks,

Rutger

On Tue, Jun 22, 2010 at 9:46 AM, Anurag Priyam <anurag08priyam at gmail.com> wrote:
> Hello all,
>
> Much of the parser implementation is complete as of now. Last time I had
> sent an update I had begun implementing characters element. Week 3( June
> 7-13) was quite low on work due to power shortage where I live. Consequently
> implementation of characters spanned Week 4( June 14-20 ) too.
>
> Work on NeXML serialization has begun. As of now it can serialize taxa
> blocks. This week( week 5 - June 21-28 ) I will be working on serializing
> trees and characters element.
>
> I would also like to update a little more on future development plans. I am
> targeting to finish much of the software development by week 9( July 19-25
> ), leaving week 10, week 11 and week 12 for feedback and iterations. This is
> the time where I should make up for any mistakes or lost work. Perhaps in
> this week we can make the code ready for merging in BioRuby's master branch.
> Apart from this, I am targeting to finish serializer and start working on
> the RDF API by week 6. Maybe we could have a round of code review after that
> too? I am notifying this in advance so that if possible developers can
> allocate time for this. Sounds good?
>
>
> --
> Anurag Priyam,
> 2nd Year Undergraduate,
> Department of Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


-- 
Dr. Rutger A. Vos
School of Biological Sciences
Philip Lyle Building, Level 4
University of Reading
Reading
RG6 6BX
United Kingdom
Tel: +44 (0) 118 378 7535
http://www.nexml.org
http://rutgervos.blogspot.com

From pjotr.public14 at thebird.nl  Thu Jun 24 09:54:11 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Thu, 24 Jun 2010 15:54:11 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
Message-ID: <20100624135411.GA14658@thebird.nl>

Hi Everyone,

I am going to review Anurag's code. Naohisa, and perhaps others, will
join in. 

A quick recap: Anurag is working on implementing an NeXML parser with
RDF support (for the semantic web).

NeXML is an XMLized and improved version of Nexus, and is used for
interchanging sequences, alignments and trees between
programs/services (correct me if I am wrong). A full descripion of
NeXML can be found at

  https://www.nescent.org/wg_evoinfo/Future_Data_Exchange_Standard

NeXML is an important standard, and very good to have in BioRuby.

Anurag: thanks for the good work, so far. I can see you have put a lot
of work in. And, I like your style. I can see you are a competent
programmer, so you can expect the worst criticism ;) I am going to
start with some high level questions.

Can someone who has worked with NeXML (Rutger) have a look at the
interface description on:

  https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby

It looks natural to query this way, if you know what the NeXML files
contains (e.g. trees, or sequences). What would be the natural
approach if you do *not* know the contents? I.e. how does one iterate
over the NeXML object?

Anurag, your web page states you implemented a LibXML::Parser, and you
named it Parser. Meanwhile, it looks like you have implemented libxml2
streaming, using a Reader. This is a bit confusing. I presume you are
using the technique used in Diana's PhyloXML parser. You are requiring
the 'xml' package. Is that libxml2 these days, or is it actually
'libxml'? Does it work for all Ruby versions? libxml is an external
(binary) dependency, so it may not exist and fail.  PhyloXML does not
handle failure either.

The other high-level questions concern testing. For others, the unit
tests are here:

  http://github.com/yeban/bioruby/tree/master/test/unit/bio/db/nexml/

I notice you have limited test input data. How can you be really sure
your code works for all cases? How can you be really sure that future
changes to the code don't break? And how are you going to measure
performance of your code?

Finally, getting down to some code. Most of the code is in a single
file:

  http://github.com/yeban/bioruby/blob/master/lib/bio/db/nexml/elements.rb

or

  http://github.com/yeban/bioruby/blob/3abfc592e2f7072a8e2970ee077677a9ab7564ae/lib/bio/db/nexml/elements.rb

I think it should be broken up. It would be logical to split by type
of elements - at least. I know in BioRuby we are ambiguous about file
sizes - I think a single file should describe one concept. That way
file names become self describing. Files larger than 300 lines tend
to be hard to digest - and probably point out some bigger issue.

Also, when I look at DnaSeqRow, RnaSeqRow and others derived from
SeqRow (line 2148 and onwards in element.rb), I can see duplicated
coding 'patterns'. You are repeating a concept. Would there not be a
more elegant way in Ruby to handle this? Hint: Inheritance is just one
mechanism, I see no real reason to use an inheritance tree. Why not
use one Sequence class for all of these which can contain different
formed elements? I bet the code would become a lot shorter and
(probably) less error prone. Take Ruby's Array container class as an
example - it is just one implementation of a container which allows
many types of elements.

A final comment for this session: The class/method descriptions are
not very informative. It may be early days - especially since we can
see some refactoring coming, but it usually helps to write out
examples giving the 'nicest' interface for people to use. And stick
those in the source code. Personally I favour rubydoctests, see

  http://github.com/tablatom/rubydoctest

I used these in bio/appl/paml/codeml/report.rb - these are examples
that double as tests. Kill two birds with one stone! The BioRuby
tutorial also uses doctests - i.e. the code in the Tutorial can be
validated against the installed bioruby. If you want to use this you
need an extra conversion - I have that tool.

Another possibility is to start using RSpec. 

  http://rspec.info/

I really like RSpec too - it is more of a replacement for unit
tests - and easier to understand, so Specs double as documentation.

I am interested to see what you want to do for RDF support. Maybe you
can write out the API as an RSpec? That would be a good start.

Do not hesitate to stand up to me. You will probably get support from
someone on this list ;)

Pj.

On Thu, Jun 10, 2010 at 01:19:35AM +0530, Anurag Priyam wrote:
> Last week, I worked on finishing implementation of Trees: trees, tree,
> network; and started work on the characters element. This weeks target is to
> complete the implementation of the characters element.
> 
> It would be awesome to have some code review including: implementation, API
> design, coding style and tests. I am planning to give a good amount of time
> in the fourth week in making the code more robust. It would make perfect
> sense to have some feedback to serve as guidelines :). The master branch and
> API discussion page are at:
> 
> [1] http://github.com/yeban/bioruby
> [2]
> https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby
> 
> -- 
> Anurag Priyam,
> 2nd Year Undergraduate,
> Department of Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby

From czmasek at burnham.org  Thu Jun 24 22:18:41 2010
From: czmasek at burnham.org (Christian M Zmasek)
Date: Thu, 24 Jun 2010 19:18:41 -0700
Subject: [BioRuby] gsoc: SDI - unrooted trees
In-Reply-To: <6B35175E-1DFF-430E-AE22-E724F3D58D9D@gmail.com>
References: <4C1AACBA.4030908@burnham.org>
	<6B35175E-1DFF-430E-AE22-E724F3D58D9D@gmail.com>
Message-ID: <4C241201.9060604@burnham.org>

Hi, Sara:

Something I forgot the mention.

As you know, most phylogeny inference methods produce trees which are 
unrooted (these trees might look rooted, but for most methods the root 
is placed randomly, and thus incorrectly).

In the the context of duplication inference, a reasonable way to root a 
tree is by placing the root in such a way that the the sum of inferred 
duplications is minimized.

The brute force approach to accomplish this is by sequentially placing 
the root on each branch and then running the SDI algorithm on each 
differently rooted tree and retaining the root position which results in 
the smallest sum of duplications.

A more time efficient approach is possible by realizing that the mapping 
function only changes for a few nodes if the root is moved from one 
branch to an neighboring one.
  	
This approach is implemented in org.forester.sdi.SDIR.

Besides extending the algorithm to work on non-binary trees, this is 
another useful extension which you might think about tackling.

Christian


From anurag08priyam at gmail.com  Fri Jun 25 02:23:34 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Fri, 25 Jun 2010 11:53:34 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100624135411.GA14658@thebird.nl>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
Message-ID: <AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>

> Can someone who has worked with NeXML (Rutger) have a look at the
> interface description on:
>
>  https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby
>
> It looks natural to query this way, if you know what the NeXML files
> contains (e.g. trees, or sequences). What would be the natural
> approach if you do *not* know the contents? I.e. how does one iterate
> over the NeXML object?
>
>
NeXML has three primary elements: otus, trees, characters. All three of them
are container for other elements: otu, tree, network, matrix. Currently the
each method of an nexml object iterates over each tree object. I did this
thinking that tree is the most important part of a phylogenetic analysis(
and also because I had not implemented characters then). What were you
thinking here? Should each iterate over all otu, tree and matrix or the
primary otus, trees and characters elements? I would go for the later.


> Anurag, your web page states you implemented a LibXML::Parser, and you
> named it Parser. Meanwhile, it looks like you have implemented libxml2
> streaming, using a Reader. This is a bit confusing. I presume you are
> using the technique used in Diana's PhyloXML parser. You are requiring
> the 'xml' package. Is that libxml2 these days, or is it actually
> 'libxml'? Does it work for all Ruby versions? libxml is an external
> (binary) dependency, so it may not exist and fail.  PhyloXML does not
> handle failure either.
>
>
I am glad you asked this. I wanted to discuss it here.

I have used libxml2 streaming api, without actually streaming the document
to the user. The cursor does not move through the document when you iterate
over elements( phyloxml does that ). I am parsing the document at one go; at
the start, and storing the objects in memory. Should we want to switch to
streaming, using libxml's streaming API from start should make it easier.

Yes it is libxml2 these days. The site states that it works with ruby 1.8. I
am myself working with 1.8.7. I will have to test the compatibility with
ruby 1.9.


> The other high-level questions concern testing. For others, the unit
> tests are here:
>
>  http://github.com/yeban/bioruby/tree/master/test/unit/bio/db/nexml/
>
> I notice you have limited test input data. How can you be really sure
> your code works for all cases? How can you be really sure that future
> changes to the code don't break?


Right. I am working on improving the test suites taking lessons from the
other bioruby test suites.


> And how are you going to measure
> performance of your code?
>
>
Actually I have not done anything here. I will benchmark and profile the
code and discuss the results here.

Finally, getting down to some code. Most of the code is in a single
> file:
>
>  http://github.com/yeban/bioruby/blob/master/lib/bio/db/nexml/elements.rb
>
> or
>
>
> http://github.com/yeban/bioruby/blob/3abfc592e2f7072a8e2970ee077677a9ab7564ae/lib/bio/db/nexml/elements.rb
>
> I think it should be broken up. It would be logical to split by type
> of elements - at least. I know in BioRuby we are ambiguous about file
> sizes - I think a single file should describe one concept. That way
> file names become self describing. Files larger than 300 lines tend
> to be hard to digest - and probably point out some bigger issue.
>
>
Agreed.


> Also, when I look at DnaSeqRow, RnaSeqRow and others derived from
> SeqRow (line 2148 and onwards in element.rb), I can see duplicated
> coding 'patterns'. You are repeating a concept. Would there not be a
> more elegant way in Ruby to handle this? Hint: Inheritance is just one
> mechanism, I see no real reason to use an inheritance tree. Why not
> use one Sequence class for all of these which can contain different
> formed elements? I bet the code would become a lot shorter and
> (probably) less error prone. Take Ruby's Array container class as an
> example - it is just one implementation of a container which allows
> many types of elements.
>

The idea here was to implement a type system and stick close to the class
hierarchy followed in the schema. However, looking back, I myself do not
find the code for the Matrix class very elegant.

A final comment for this session: The class/method descriptions are
> not very informative. It may be early days - especially since we can
> see some refactoring coming, but it usually helps to write out
> examples giving the 'nicest' interface for people to use. And stick
> those in the source code. Personally I favour rubydoctests, see
>
>  http://github.com/tablatom/rubydoctest
>
>
Hey, I did not know that doctests existed for Ruby too. I will have a look
into it.


> I used these in bio/appl/paml/codeml/report.rb - these are examples
> that double as tests. Kill two birds with one stone! The BioRuby
> tutorial also uses doctests - i.e. the code in the Tutorial can be
> validated against the installed bioruby. If you want to use this you
> need an extra conversion - I have that tool.
>

I will check out the examples. What tool? I would like to know more.


>
> Another possibility is to start using RSpec.
>
>  http://rspec.info/
>
> I really like RSpec too - it is more of a replacement for unit
> tests - and easier to understand, so Specs double as documentation.
>
>
I am missing Rspec too from my Rails and Merb days. I picked up unit tests
because much of the framework had used the same and also because I wanted to
try it out :).


> I am interested to see what you want to do for RDF support. Maybe you
> can write out the API as an RSpec? That would be a good start.
>
>
That sounds like a nice idea.


> Do not hesitate to stand up to me. You will probably get support from
> someone on this list ;)
>
> Pj.
>
> On Thu, Jun 10, 2010 at 01:19:35AM +0530, Anurag Priyam wrote:
> > Last week, I worked on finishing implementation of Trees: trees, tree,
> > network; and started work on the characters element. This weeks target is
> to
> > complete the implementation of the characters element.
> >
> > It would be awesome to have some code review including: implementation,
> API
> > design, coding style and tests. I am planning to give a good amount of
> time
> > in the fourth week in making the code more robust. It would make perfect
> > sense to have some feedback to serve as guidelines :). The master branch
> and
> > API discussion page are at:
> >
> > [1] http://github.com/yeban/bioruby
> > [2]
> >
> https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby
> >
> > --
> > Anurag Priyam,
> > 2nd Year Undergraduate,
> > Department of Mechanical Engineering,
> > IIT Kharagpur.
> > +91-9775550642
> > _______________________________________________
> > BioRuby Project - http://www.bioruby.org/
> > BioRuby mailing list
> > BioRuby at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioruby
>


-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From pjotr.public14 at thebird.nl  Fri Jun 25 02:46:05 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 25 Jun 2010 08:46:05 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
Message-ID: <20100625064605.GA22887@thebird.nl>

(splitting up the discussion)

On Fri, Jun 25, 2010 at 11:53:34AM +0530, Anurag Priyam wrote:
> Should each iterate over all otu, tree and matrix or the
> primary otus, trees and characters elements? I would go for the later.

I think Rutger should answer this.

Pj.

From pjotr.public14 at thebird.nl  Fri Jun 25 02:49:11 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 25 Jun 2010 08:49:11 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
Message-ID: <20100625064911.GB22887@thebird.nl>

> I have used libxml2 streaming api, without actually streaming the document
> to the user. The cursor does not move through the document when you iterate
> over elements( phyloxml does that ). I am parsing the document at one go; at
> the start, and storing the objects in memory. Should we want to switch to
> streaming, using libxml's streaming API from start should make it easier.
> 
> Yes it is libxml2 these days. The site states that it works with ruby 1.8. I
> am myself working with 1.8.7. I will have to test the compatibility with
> ruby 1.9.

OK, glad to see that libxml is a standard package these days -
though it has some horrific error handling. At least it is fast.

How much time would it cost you to stream the data - and what does it
mean with regard to changing the API? I guess, in general, NeXML
files won't be that large, so it may not be that important (Rutger)?

Pj.


From pjotr.public14 at thebird.nl  Fri Jun 25 02:51:58 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 25 Jun 2010 08:51:58 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
Message-ID: <20100625065158.GC22887@thebird.nl>

> > I notice you have limited test input data. How can you be really sure
> > your code works for all cases? How can you be really sure that future
> > changes to the code don't break?
> 
> Right. I am working on improving the test suites taking lessons from the
> other bioruby test suites.

Unit tests are one approach. How about adding some regression tests
on larger files? When you have output that should be a good idea. We
don't like large datasets in the bioruby tree, but there are two ways
around that - create a special branch on github, or pull the data on
demand (though Naohisa may frown on that). Ask Diana what she has
done.

> Actually I have not done anything here. I will benchmark and profile the
> code and discuss the results here.

Diana created a special profiling branch. It was really helpful to
profile.

Pj.


From pjotr.public14 at thebird.nl  Fri Jun 25 02:55:39 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 25 Jun 2010 08:55:39 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
Message-ID: <20100625065539.GD22887@thebird.nl>

On Fri, Jun 25, 2010 at 11:53:34AM +0530, Anurag Priyam wrote:
> The idea here was to implement a type system and stick close to the class
> hierarchy followed in the schema. However, looking back, I myself do not
> find the code for the Matrix class very elegant.

Over 3000 lines of code for an XML parser sends out alarm bells. If
you have the right testing files it should be easy to refactor. Make
it simpler. Also, when parsing this type of XML some Ruby reflection
may come in handy - I did some of that in my BioRuby GEO parser, which
lives in my GEO branch on github.  You should look at each class and
see if you can refactor it down to a single solution. Just make sure
it is not at the expense of readability and understanding.

Post us some ideas here, before you start hacking code.

Pj.


From pjotr.public14 at thebird.nl  Fri Jun 25 03:08:04 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 25 Jun 2010 09:08:04 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
Message-ID: <20100625070804.GE22887@thebird.nl>

> >  http://github.com/tablatom/rubydoctest
> >
> >
> Hey, I did not know that doctests existed for Ruby too. I will have a look
> into it.

They are good, however finding bugs is a bit problematic as the stack
traces are lengthy and often not descriptive. So with troubling code
I tend to write extra unit tests. Also, with BioRuby we have not
settled on doctests yet, so you need to reach coverage with unit
tests and/or Specs.

I really think it is good for validating documentation.

> > I used these in bio/appl/paml/codeml/report.rb - these are examples
> > that double as tests. Kill two birds with one stone! The BioRuby
> > tutorial also uses doctests - i.e. the code in the Tutorial can be
> > validated against the installed bioruby. If you want to use this you
> > need an extra conversion - I have that tool.
> >
> 
> I will check out the examples. What tool? I would like to know more.

It simply parses out commented code in the source headers, and turns
them over to rubydoctest. The tool is in my bioruby-support tree on
github - see 

  http://github.com/pjotrp/bioruby-support/blob/master/bin/uncomment_doctest

you can see it uses an environment variable.

> I am missing Rspec too from my Rails and Merb days. I picked up unit tests
> because much of the framework had used the same and also because I wanted to
> try it out :).
> 
> 
> > I am interested to see what you want to do for RDF support. Maybe you
> > can write out the API as an RSpec? That would be a good start.
> >
> >
> That sounds like a nice idea.

RSpec is new for BioRuby. Since you have experience you are the right
one to introduce it to us ;). If it is convincing to the others we
may accept it as standard use (personally I think it is a step
forward from unit testing - unit tests are not very good as
documentation).

Pj.

From anurag08priyam at gmail.com  Fri Jun 25 03:34:21 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Fri, 25 Jun 2010 13:04:21 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100625064911.GB22887@thebird.nl>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625064911.GB22887@thebird.nl>
Message-ID: <AANLkTinQ5P28dHE4f3gaRaSMQAbctD0UrtA69j8uwqvF@mail.gmail.com>

On Fri, Jun 25, 2010 at 12:19 PM, Pjotr Prins <pjotr.public14 at thebird.nl>wrote:

> > I have used libxml2 streaming api, without actually streaming the
> document
> > to the user. The cursor does not move through the document when you
> iterate
> > over elements( phyloxml does that ). I am parsing the document at one go;
> at
> > the start, and storing the objects in memory. Should we want to switch to
> > streaming, using libxml's streaming API from start should make it easier.
> >
> > Yes it is libxml2 these days. The site states that it works with ruby
> 1.8. I
> > am myself working with 1.8.7. I will have to test the compatibility with
> > ruby 1.9.
>
> OK, glad to see that libxml is a standard package these days -
> though it has some horrific error handling. At least it is fast.
>
>
Yea it is fast but it has its own share of bugs. Now, I myself have started
working on the ruby-libxml code and helping in maintaining it.


> How much time would it cost you to stream the data - and what does it
> mean with regard to changing the API? I guess, in general, NeXML
> files won't be that large, so it may not be that important (Rutger)?
>
> Pj.
>
>
I mean switching the parsing implementation to streaming from "parsing at
the start" and not the API. Just that using Reader API over the DOM API
would help in the switch. Even if we do not switch, the Reader API offers a
more memory efficient solution than the DOM API.

Btw, I am not in a favour of switch. You cannot move backwards in document
that way. I can not fetch a tree by id if I the cursor is ahead of that
tree. Doing nexml.each_characters and nexml.each_trees is impossible with
pure streaming. I will have to stream one while cache the other. Otus and
otu provide a one to many relation with trees and characters, and rows. An
API call of the type otus.trees or otus.characters or otu.seuences would be
impossible( not that I have already added the API call ). Imo, NeXML is
non-linear and not meant to be streamed. Besides other NeXML implementations
also parse the file at the start.

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From ngoto at gen-info.osaka-u.ac.jp  Fri Jun 25 03:15:58 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Fri, 25 Jun 2010 16:15:58 +0900
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100625065158.GC22887@thebird.nl>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625065158.GC22887@thebird.nl>
Message-ID: <20100625071558.9CAB01CBC5B0@idnmail.gen-info.osaka-u.ac.jp>


Most part of the special testing program created by Diana for
PhyloXML is now put in sample/test_phyloxml_big.rb, i.e. it is
now regarded as a sample script.

To run the program, for example,
 % mkdir /tmp/phyloxml
 % ruby sample/test_phyloxml_big.rb /tmp/phyloxml -v

It executes round-trip tests for large PhyloXML files.
Data files are downloaded from the internet and are stored
to a directory specified by the user.

Naohisa Goto
ngoto at ge-info.osaka-u.ac.jp / ng at bioruby.org

On Fri, 25 Jun 2010 08:51:58 +0200
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> > Actually I have not done anything here. I will benchmark and profile the
> > code and discuss the results here.
> 
> Diana created a special profiling branch. It was really helpful to
> profile.
> 
> Pj.
> 
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


-- 
?? ??  ngoto at gen-info.osaka-u.ac.jp
??????????? ?????????? ?????????(???)
Phone: 06-6879-8365 / FAX: 06-6879-2047

From pjotr.public14 at thebird.nl  Fri Jun 25 03:42:13 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 25 Jun 2010 09:42:13 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTinQ5P28dHE4f3gaRaSMQAbctD0UrtA69j8uwqvF@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625064911.GB22887@thebird.nl>
	<AANLkTinQ5P28dHE4f3gaRaSMQAbctD0UrtA69j8uwqvF@mail.gmail.com>
Message-ID: <20100625074213.GA27044@thebird.nl>

I think this needs to be answered by Rutger. Are we going to face
NeXML files in the future that can easily outrun memory?

Pj.

On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote:
> > How much time would it cost you to stream the data - and what does it
> > mean with regard to changing the API? I guess, in general, NeXML
> > files won't be that large, so it may not be that important (Rutger)?
> >
> > Pj.
> >
> >
> I mean switching the parsing implementation to streaming from "parsing at
> the start" and not the API. Just that using Reader API over the DOM API
> would help in the switch. Even if we do not switch, the Reader API offers a
> more memory efficient solution than the DOM API.
> 
> Btw, I am not in a favour of switch. You cannot move backwards in document
> that way. I can not fetch a tree by id if I the cursor is ahead of that
> tree. Doing nexml.each_characters and nexml.each_trees is impossible with
> pure streaming. I will have to stream one while cache the other. Otus and
> otu provide a one to many relation with trees and characters, and rows. An
> API call of the type otus.trees or otus.characters or otu.seuences would be
> impossible( not that I have already added the API call ). Imo, NeXML is
> non-linear and not meant to be streamed. Besides other NeXML implementations
> also parse the file at the start.
> 
> -- 
> Anurag Priyam,
> 2nd Year Undergraduate,
> Department of Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642

From rutgeraldo at gmail.com  Fri Jun 25 04:14:44 2010
From: rutgeraldo at gmail.com (Rutger Vos)
Date: Fri, 25 Jun 2010 09:14:44 +0100
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100625074213.GA27044@thebird.nl>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625064911.GB22887@thebird.nl>
	<AANLkTinQ5P28dHE4f3gaRaSMQAbctD0UrtA69j8uwqvF@mail.gmail.com>
	<20100625074213.GA27044@thebird.nl>
Message-ID: <AANLkTimW-P4ijZ4TI7IRyg8RFcrvC0dv-2C8Yb8vSbKn@mail.gmail.com>

This is very possible (and it's why Anurag has been focusing on
stream-based parsing) but I am personally of the opinion that worrying
too much about that right now would be a premature optimization. It
seems to me that we want to get a nice interface that captures what
NeXML can express first, and worry about performance and memory
footprint later - but that's just my own opinion and certainly open
for discussion.

On Fri, Jun 25, 2010 at 8:42 AM, Pjotr Prins <pjotr.public14 at thebird.nl> wrote:
> I think this needs to be answered by Rutger. Are we going to face
> NeXML files in the future that can easily outrun memory?
>
> Pj.
>
> On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote:
>> > How much time would it cost you to stream the data - and what does it
>> > mean with regard to changing the API? I guess, in general, NeXML
>> > files won't be that large, so it may not be that important (Rutger)?
>> >
>> > Pj.
>> >
>> >
>> I mean switching the parsing implementation to streaming from "parsing at
>> the start" and not the API. Just that using Reader API over the DOM API
>> would help in the switch. Even if we do not switch, the Reader API offers a
>> more memory efficient solution than the DOM API.
>>
>> Btw, I am not in a favour of switch. You cannot move backwards in document
>> that way. I can not fetch a tree by id if I the cursor is ahead of that
>> tree. Doing nexml.each_characters and nexml.each_trees is impossible with
>> pure streaming. I will have to stream one while cache the other. Otus and
>> otu provide a one to many relation with trees and characters, and rows. An
>> API call of the type otus.trees or otus.characters or otu.seuences would be
>> impossible( not that I have already added the API call ). Imo, NeXML is
>> non-linear and not meant to be streamed. Besides other NeXML implementations
>> also parse the file at the start.
>>
>> --
>> Anurag Priyam,
>> 2nd Year Undergraduate,
>> Department of Mechanical Engineering,
>> IIT Kharagpur.
>> +91-9775550642
>


-- 
Dr. Rutger A. Vos
School of Biological Sciences
Philip Lyle Building, Level 4
University of Reading
Reading
RG6 6BX
United Kingdom
Tel: +44 (0) 118 378 7535
http://www.nexml.org
http://rutgervos.blogspot.com

From pjotr.public14 at thebird.nl  Fri Jun 25 04:38:36 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 25 Jun 2010 10:38:36 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTimW-P4ijZ4TI7IRyg8RFcrvC0dv-2C8Yb8vSbKn@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625064911.GB22887@thebird.nl>
	<AANLkTinQ5P28dHE4f3gaRaSMQAbctD0UrtA69j8uwqvF@mail.gmail.com>
	<20100625074213.GA27044@thebird.nl>
	<AANLkTimW-P4ijZ4TI7IRyg8RFcrvC0dv-2C8Yb8vSbKn@mail.gmail.com>
Message-ID: <20100625083836.GA28214@thebird.nl>

On Fri, Jun 25, 2010 at 09:14:44AM +0100, Rutger Vos wrote:
> This is very possible (and it's why Anurag has been focusing on
> stream-based parsing) but I am personally of the opinion that worrying
> too much about that right now would be a premature optimization. It
> seems to me that we want to get a nice interface that captures what
> NeXML can express first, and worry about performance and memory
> footprint later - but that's just my own opinion and certainly open
> for discussion.

Oh, I agree about implementation. But it does mean Anurag needs to
change his preferential solution (like back-tracking in the tree).

Pj.


From sararayburn at gmail.com  Fri Jun 25 14:57:03 2010
From: sararayburn at gmail.com (Sara Rayburn)
Date: Fri, 25 Jun 2010 13:57:03 -0500
Subject: [BioRuby] GSoC speciation/duplication inference question
In-Reply-To: <D8EBFDB3-5F9B-441D-97C2-6785621B50FE@hgc.jp>
References: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com>
	<20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp>
	<D8EBFDB3-5F9B-441D-97C2-6785621B50FE@hgc.jp>
Message-ID: <00A7DE6C-2985-4173-A302-619429F964BB@gmail.com>

Hi,

I think between the list response and conversations with my mentor, I would probably go with 
Bio::Algorithm::SDI, with the files in lib/bio/util/phylogeny/SDI/

I can definitely see the others as good possibilities, though. If anyone objects to this naming, please let me know so I can change it.

Thanks,

Sara Rayburn
sararayburn at gmail.com 

On Jun 15, 2010, at 9:06 PM, Toshiaki Katayama wrote:

> Hi,
> 
> Replying personally as I delayed to find this thread.
> 
> I prefer something like lib/bio/phylo/sdi.rb and Bio::Phylo::SDI, how about to gather other phyloinformatics modules under the same directory as well?
> 
> Toshiaki
> 
> 
> On 2010/06/10, at 22:46, Naohisa GOTO wrote:
> 
>> Hi,
>> 
>> I think the abbreviation SDI is not common in the field of biology
>> and bioinformatics. In this case, it is generally good not to
>> abbreviate, but the "speciation/duplication inference" is too long.
>> For file/directory names, because the length limit is tight,
>> using abbreviation is good.
>> 
>> For the location of files, I suggest
>> lib/bio/util/evolution/SDI/ or lib/bio/util/phylogeny/SDI/
>> to show the word SDI is in the field of evolution or phylogeny.
>> 
>> For the class/module namespace,  possible candidates are
>> Bio::SpeciationDuplicationInference, Bio::Evolution::SDI,
>> Bio::Algorithm::SDI, but I couldn't determine which is the best.
>> If you have good idea, please tell us.
>> 
>> Naohisa Goto
>> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>> 
>> 
>> On Mon, 7 Jun 2010 13:09:07 -0500
>> Sara Rayburn <sararayburn at gmail.com> wrote:
>> 
>>> Hello,
>>> 
>>> While implementing my gsoc project (the speciation/duplicaiton inference algorithm), I'm not sure where to put the module I'm developing in the bioruby library. I've been developing in the lib/bio/util directory, but I want to make sure that's the place you all would prefer for the module. Any suggestions or thoughts?
>>> 
>>> Thanks,
>>> 
>>> Sara Rayburn
>>> sararayburn at gmail.com
>> 
>> 
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
> 


From anurag08priyam at gmail.com  Sat Jun 26 07:35:34 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Sat, 26 Jun 2010 17:05:34 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100625070804.GE22887@thebird.nl>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625070804.GE22887@thebird.nl>
Message-ID: <AANLkTimB8_u8UsvGha0JF1z0riC96qzkC2H127vIQs7J@mail.gmail.com>

On Fri, Jun 25, 2010 at 12:38 PM, Pjotr Prins <pjotr.public14 at thebird.nl>wrote:

> > >  http://github.com/tablatom/rubydoctest
> > >
> > >
> > Hey, I did not know that doctests existed for Ruby too. I will have a
> look
> > into it.
>
> They are good, however finding bugs is a bit problematic as the stack
> traces are lengthy and often not descriptive. So with troubling code
> I tend to write extra unit tests. Also, with BioRuby we have not
> settled on doctests yet, so you need to reach coverage with unit
> tests and/or Specs.
>
> I really think it is good for validating documentation.
>
> > > I used these in bio/appl/paml/codeml/report.rb - these are examples
> > > that double as tests. Kill two birds with one stone! The BioRuby
> > > tutorial also uses doctests - i.e. the code in the Tutorial can be
> > > validated against the installed bioruby. If you want to use this you
> > > need an extra conversion - I have that tool.
> > >
> >
> > I will check out the examples. What tool? I would like to know more.
>
> It simply parses out commented code in the source headers, and turns
> them over to rubydoctest. The tool is in my bioruby-support tree on
> github - see
>
>
> http://github.com/pjotrp/bioruby-support/blob/master/bin/uncomment_doctest
>
> you can see it uses an environment variable.
>
>
Perfect. I will use it when expanding the documentation.


> > I am missing Rspec too from my Rails and Merb days. I picked up unit
> tests
> > because much of the framework had used the same and also because I wanted
> to
> > try it out :).
> >
> >
> > > I am interested to see what you want to do for RDF support. Maybe you
> > > can write out the API as an RSpec? That would be a good start.
> > >
> > >
> > That sounds like a nice idea.
>
> RSpec is new for BioRuby. Since you have experience you are the right
> one to introduce it to us ;). If it is convincing to the others we
> may accept it as standard use (personally I think it is a step
> forward from unit testing - unit tests are not very good as
> documentation).
>
>
I am willing to use Rspec for the RDF API part. Converting the already
existing unit tests I have written to Rspec does not sound a good idea?

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From pjotr.public14 at thebird.nl  Sat Jun 26 08:19:02 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Sat, 26 Jun 2010 14:19:02 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTimB8_u8UsvGha0JF1z0riC96qzkC2H127vIQs7J@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625070804.GE22887@thebird.nl>
	<AANLkTimB8_u8UsvGha0JF1z0riC96qzkC2H127vIQs7J@mail.gmail.com>
Message-ID: <20100626121902.GA5700@thebird.nl>

On Sat, Jun 26, 2010 at 05:05:34PM +0530, Anurag Priyam wrote:
> I am willing to use Rspec for the RDF API part. Converting the already
> existing unit tests I have written to Rspec does not sound a good idea?

No need. Do the RDF as a proof-of-concept for the rest of BioRuby.
Unit tests will (always) remain.

Pj.

From hlapp at drycafe.net  Sat Jun 26 20:30:19 2010
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Sat, 26 Jun 2010 17:30:19 -0700
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100625074213.GA27044@thebird.nl>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625064911.GB22887@thebird.nl>
	<AANLkTinQ5P28dHE4f3gaRaSMQAbctD0UrtA69j8uwqvF@mail.gmail.com>
	<20100625074213.GA27044@thebird.nl>
Message-ID: <F101AC28-0256-49ED-9D1A-A9477C8A7349@drycafe.net>

Our ability to reconstruct trees of hundreds, thousands, and even tens  
of thousands of characters has improved dramatically over the past  
couple of years, and is increasingly often the goal of an analysis.  
Genome-scale alignments also aren't so rare anymore.

Aside from analysis, NeXML files can be produced by a database, and  
hence could hold large taxonomies, or the tree of life.

NeXML is an emerging standard. If implementations can't cope with the  
large scale data that are becoming increasingly popular, it'll have a  
hard time to get uptake.

	-hilmar

On Jun 25, 2010, at 12:42 AM, Pjotr Prins wrote:

> I think this needs to be answered by Rutger. Are we going to face
> NeXML files in the future that can easily outrun memory?
>
> Pj.
>
> On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote:
>>> How much time would it cost you to stream the data - and what does  
>>> it
>>> mean with regard to changing the API? I guess, in general, NeXML
>>> files won't be that large, so it may not be that important (Rutger)?
>>>
>>> Pj.
>>>
>>>
>> I mean switching the parsing implementation to streaming from  
>> "parsing at
>> the start" and not the API. Just that using Reader API over the DOM  
>> API
>> would help in the switch. Even if we do not switch, the Reader API  
>> offers a
>> more memory efficient solution than the DOM API.
>>
>> Btw, I am not in a favour of switch. You cannot move backwards in  
>> document
>> that way. I can not fetch a tree by id if I the cursor is ahead of  
>> that
>> tree. Doing nexml.each_characters and nexml.each_trees is  
>> impossible with
>> pure streaming. I will have to stream one while cache the other.  
>> Otus and
>> otu provide a one to many relation with trees and characters, and  
>> rows. An
>> API call of the type otus.trees or otus.characters or otu.seuences  
>> would be
>> impossible( not that I have already added the API call ). Imo,  
>> NeXML is
>> non-linear and not meant to be streamed. Besides other NeXML  
>> implementations
>> also parse the file at the start.
>>
>> -- 
>> Anurag Priyam,
>> 2nd Year Undergraduate,
>> Department of Mechanical Engineering,
>> IIT Kharagpur.
>> +91-9775550642
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby

-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================


From pjotr.public14 at thebird.nl  Sun Jun 27 02:47:31 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Sun, 27 Jun 2010 08:47:31 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <F101AC28-0256-49ED-9D1A-A9477C8A7349@drycafe.net>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625064911.GB22887@thebird.nl>
	<AANLkTinQ5P28dHE4f3gaRaSMQAbctD0UrtA69j8uwqvF@mail.gmail.com>
	<20100625074213.GA27044@thebird.nl>
	<F101AC28-0256-49ED-9D1A-A9477C8A7349@drycafe.net>
Message-ID: <20100627064731.GA15508@thebird.nl>

Thanks Rutger and Hilmar,

Anurag, let's not load everything in memory.

Pj.

On Sat, Jun 26, 2010 at 05:30:19PM -0700, Hilmar Lapp wrote:
> Our ability to reconstruct trees of hundreds, thousands, and even tens  
> of thousands of characters has improved dramatically over the past  
> couple of years, and is increasingly often the goal of an analysis.  
> Genome-scale alignments also aren't so rare anymore.
>
> Aside from analysis, NeXML files can be produced by a database, and  
> hence could hold large taxonomies, or the tree of life.
>
> NeXML is an emerging standard. If implementations can't cope with the  
> large scale data that are becoming increasingly popular, it'll have a  
> hard time to get uptake.
>
> 	-hilmar
>
> On Jun 25, 2010, at 12:42 AM, Pjotr Prins wrote:
>
>> I think this needs to be answered by Rutger. Are we going to face
>> NeXML files in the future that can easily outrun memory?
>>
>> Pj.
>>
>> On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote:
>>>> How much time would it cost you to stream the data - and what does  
>>>> it
>>>> mean with regard to changing the API? I guess, in general, NeXML
>>>> files won't be that large, so it may not be that important (Rutger)?
>>>>
>>>> Pj.
>>>>
>>>>
>>> I mean switching the parsing implementation to streaming from  
>>> "parsing at
>>> the start" and not the API. Just that using Reader API over the DOM  
>>> API
>>> would help in the switch. Even if we do not switch, the Reader API  
>>> offers a
>>> more memory efficient solution than the DOM API.
>>>
>>> Btw, I am not in a favour of switch. You cannot move backwards in  
>>> document
>>> that way. I can not fetch a tree by id if I the cursor is ahead of  
>>> that
>>> tree. Doing nexml.each_characters and nexml.each_trees is impossible 
>>> with
>>> pure streaming. I will have to stream one while cache the other.  
>>> Otus and
>>> otu provide a one to many relation with trees and characters, and  
>>> rows. An
>>> API call of the type otus.trees or otus.characters or otu.seuences  
>>> would be
>>> impossible( not that I have already added the API call ). Imo, NeXML 
>>> is
>>> non-linear and not meant to be streamed. Besides other NeXML  
>>> implementations
>>> also parse the file at the start.
>>>
>>> -- 
>>> Anurag Priyam,
>>> 2nd Year Undergraduate,
>>> Department of Mechanical Engineering,
>>> IIT Kharagpur.
>>> +91-9775550642
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>
> -- 
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
> ===========================================================
>
>
>
>

From ngoto at gen-info.osaka-u.ac.jp  Sun Jun 27 03:45:43 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto)
Date: Sun, 27 Jun 2010 16:45:43 +0900
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100627064731.GA15508@thebird.nl>
References: <F101AC28-0256-49ED-9D1A-A9477C8A7349@drycafe.net>
	<20100627064731.GA15508@thebird.nl>
Message-ID: <20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp>

Hi,

I think the ability to handle large data and the memory usage whether or
not to load all data in memory at a time, is essentially independent.
Not loading everything in memory does not guarantee the ability to handle
large data, due to the disk I/O bottleneck and memory management
overhead.

I think it is currently OK to depend on memory. The price of memory is
gradually going down, and I think buying a machine with huge memory
could be a solution to treat large data.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

> Thanks Rutger and Hilmar,
> 
> Anurag, let's not load everything in memory.
> 
> Pj.
> 
> On Sat, Jun 26, 2010 at 05:30:19PM -0700, Hilmar Lapp wrote:
> > Our ability to reconstruct trees of hundreds, thousands, and even tens  
> > of thousands of characters has improved dramatically over the past  
> > couple of years, and is increasingly often the goal of an analysis.  
> > Genome-scale alignments also aren't so rare anymore.
> >
> > Aside from analysis, NeXML files can be produced by a database, and  
> > hence could hold large taxonomies, or the tree of life.
> >
> > NeXML is an emerging standard. If implementations can't cope with the  
> > large scale data that are becoming increasingly popular, it'll have a  
> > hard time to get uptake.
> >
> > 	-hilmar
> >
> > On Jun 25, 2010, at 12:42 AM, Pjotr Prins wrote:
> >
> >> I think this needs to be answered by Rutger. Are we going to face
> >> NeXML files in the future that can easily outrun memory?
> >>
> >> Pj.
> >>
> >> On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote:
> >>>> How much time would it cost you to stream the data - and what does  
> >>>> it
> >>>> mean with regard to changing the API? I guess, in general, NeXML
> >>>> files won't be that large, so it may not be that important (Rutger)?
> >>>>
> >>>> Pj.
> >>>>
> >>>>
> >>> I mean switching the parsing implementation to streaming from  
> >>> "parsing at
> >>> the start" and not the API. Just that using Reader API over the DOM  
> >>> API
> >>> would help in the switch. Even if we do not switch, the Reader API  
> >>> offers a
> >>> more memory efficient solution than the DOM API.
> >>>
> >>> Btw, I am not in a favour of switch. You cannot move backwards in  
> >>> document
> >>> that way. I can not fetch a tree by id if I the cursor is ahead of  
> >>> that
> >>> tree. Doing nexml.each_characters and nexml.each_trees is impossible 
> >>> with
> >>> pure streaming. I will have to stream one while cache the other.  
> >>> Otus and
> >>> otu provide a one to many relation with trees and characters, and  
> >>> rows. An
> >>> API call of the type otus.trees or otus.characters or otu.seuences  
> >>> would be
> >>> impossible( not that I have already added the API call ). Imo, NeXML 
> >>> is
> >>> non-linear and not meant to be streamed. Besides other NeXML  
> >>> implementations
> >>> also parse the file at the start.
> >>>
> >>> -- 
> >>> Anurag Priyam,
> >>> 2nd Year Undergraduate,
> >>> Department of Mechanical Engineering,
> >>> IIT Kharagpur.
> >>> +91-9775550642
> >> _______________________________________________
> >> BioRuby Project - http://www.bioruby.org/
> >> BioRuby mailing list
> >> BioRuby at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioruby
> >
> > -- 
> > ===========================================================
> > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
> > ===========================================================
> >
> >
> >
> >
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From pjotr.public14 at thebird.nl  Sun Jun 27 04:43:22 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Sun, 27 Jun 2010 10:43:22 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp>
References: <F101AC28-0256-49ED-9D1A-A9477C8A7349@drycafe.net>
	<20100627064731.GA15508@thebird.nl>
	<20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp>
Message-ID: <20100627084322.GA18815@thebird.nl>

On Sun, Jun 27, 2010 at 04:45:43PM +0900, Naohisa Goto wrote:
> Hi,
> 
> I think the ability to handle large data and the memory usage
> whether or not to load all data in memory at a time, is essentially
> independent.  Not loading everything in memory does not guarantee
> the ability to handle large data, due to the disk I/O bottleneck and
> memory management overhead.

Well, depends on what you plan to do with that data :). I think you
are saying that streaming data may not be efficient, for example for
treating alignments. That could be true. However, I think the default
strategy should be non-memory bound, if possible. Throughout BioRuby
the strategy is the opposite, at the moment. For example, by default
FASTA files are loaded in RAM. Same for BLAST XML. I regularly have
files that exceed RAM and work around these limitations. I don't think
this should be the *default* strategy.

I prefer the Unix way of using pipes. Only use memory when it is
available.

With new code we should design for big data. If it is done from the
start, it takes no real effort. 

> I think it is currently OK to depend on memory. The price of memory
> is gradually going down, and I think buying a machine with huge
> memory could be a solution to treat large data.

We can not all afford big machines. It would hamper many
groups/students. RAM is getting cheaper, but data is growing faster.

Anurag, what is the size of RAM you have access to?

Pj.

From anurag08priyam at gmail.com  Sun Jun 27 04:49:37 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Sun, 27 Jun 2010 14:19:37 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100627084322.GA18815@thebird.nl>
References: <F101AC28-0256-49ED-9D1A-A9477C8A7349@drycafe.net>
	<20100627064731.GA15508@thebird.nl>
	<20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp>
	<20100627084322.GA18815@thebird.nl>
Message-ID: <AANLkTin8_WJlAnHRFGSdCagRRZpYdIaGMGYNYqGeOuAR@mail.gmail.com>

On Sun, Jun 27, 2010 at 2:13 PM, Pjotr Prins <pjotr.public14 at thebird.nl>wrote:

> On Sun, Jun 27, 2010 at 04:45:43PM +0900, Naohisa Goto wrote:
> > Hi,
> >
> > I think the ability to handle large data and the memory usage
> > whether or not to load all data in memory at a time, is essentially
> > independent.  Not loading everything in memory does not guarantee
> > the ability to handle large data, due to the disk I/O bottleneck and
> > memory management overhead.
>
> Well, depends on what you plan to do with that data :). I think you
> are saying that streaming data may not be efficient, for example for
> treating alignments. That could be true. However, I think the default
> strategy should be non-memory bound, if possible. Throughout BioRuby
> the strategy is the opposite, at the moment. For example, by default
> FASTA files are loaded in RAM. Same for BLAST XML. I regularly have
> files that exceed RAM and work around these limitations. I don't think
> this should be the *default* strategy.
>
> I prefer the Unix way of using pipes. Only use memory when it is
> available.
>
> With new code we should design for big data. If it is done from the
> start, it takes no real effort.
>
> > I think it is currently OK to depend on memory. The price of memory
> > is gradually going down, and I think buying a machine with huge
> > memory could be a solution to treat large data.
>
> We can not all afford big machines. It would hamper many
> groups/students. RAM is getting cheaper, but data is growing faster.
>
> Anurag, what is the size of RAM you have access to?
>
>
3GB. The biggest sample file I am working with is 500 lines( characters.xml
in the examples ); working with it has hardly any effect on my memory. From,
where can I get a bigger one? I can test the memory consumption with a large
enough file and report.

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From hlapp at drycafe.net  Sun Jun 27 19:23:19 2010
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Sun, 27 Jun 2010 16:23:19 -0700
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTin8_WJlAnHRFGSdCagRRZpYdIaGMGYNYqGeOuAR@mail.gmail.com>
References: <F101AC28-0256-49ED-9D1A-A9477C8A7349@drycafe.net>
	<20100627064731.GA15508@thebird.nl>
	<20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp>
	<20100627084322.GA18815@thebird.nl>
	<AANLkTin8_WJlAnHRFGSdCagRRZpYdIaGMGYNYqGeOuAR@mail.gmail.com>
Message-ID: <A29051B5-84D1-4412-80DB-B8442C6BF0FE@drycafe.net>

On Jun 27, 2010, at 1:49 AM, Anurag Priyam wrote:

> 3GB. The biggest sample file I am working with is 500  
> lines( characters.xml
> in the examples ); working with it has hardly any effect on my  
> memory. From,
> where can I get a bigger one?

Use the NCBI taxonomy :-) Or download the tree from tolweb.org and  
convert to NeXML.

	-hilmar
-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================


From anurag08priyam at gmail.com  Mon Jun 28 05:31:26 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Mon, 28 Jun 2010 15:01:26 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100624135411.GA14658@thebird.nl>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
Message-ID: <AANLkTinEJ0Sfmg989yVWxkm3WShAckzjfvNCa-f4E3jn@mail.gmail.com>

>
>
> A final comment for this session: The class/method descriptions are
> not very informative. It may be early days - especially since we can
> see some refactoring coming, but it usually helps to write out
> examples giving the 'nicest' interface for people to use. And stick
> those in the source code. Personally I favour rubydoctests, see
>
>  http://github.com/tablatom/rubydoctest
>
>
I am loving rubydoctest. Thanks for showing it to me:). As of now I am using
it in my nexml serialization implementation.

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From anurag08priyam at gmail.com  Mon Jun 28 05:52:32 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Mon, 28 Jun 2010 15:22:32 +0530
Subject: [BioRuby] Testing complex nexml output.
Message-ID: <AANLkTinUXvhbaiO_lk93fkN8SS5zy9ppe9yNPRNKOXoE@mail.gmail.com>

I am finding it a little difficult testing the nexml serializer.

Any nexml object say otu, is serialized by a function call of the type
NeXML::Writer#serialize_otu, which returns a XML::Node object. A raw nexml
representation can be obtained by calling to_s on the return value. These
nodes are added to the document root and then saved to a file by calling
XML::Document#save.

Now, when it come to testing comparing nexml string does not make sense
because the test is rendered invalid even because of different ordering of
the attributes of a node and newline issues. What I am doing is to
initialize to XML::Node: one from a test fiile and one that i generate by
serialize_otu function and then compare for the equality of these xml nodes
attribute by attribute and child by child. An example here:

http://github.com/yeban/bioruby/blob/writer/test/unit/bio/db/nexml/tc_writer.rb#L166

However lack of a proper XML::Node#eql? is making things a little difficult
for me. See:

http://github.com/yeban/bioruby/blob/writer/test/unit/bio/db/nexml/tc_writer.rb#L222

An obvious solution is to myself define an eql? method in Bio::Node. But, am
I going in the right direction when it comes to testing xml output.

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From anurag08priyam at gmail.com  Mon Jun 28 05:56:52 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Mon, 28 Jun 2010 15:26:52 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100625065539.GD22887@thebird.nl>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625065539.GD22887@thebird.nl>
Message-ID: <AANLkTinGXSNGmuSAdYTzHmPN7x0fMu2OrKGvwdDUlBKj@mail.gmail.com>

>   ..... Also, when parsing this type of XML some Ruby reflection
> may come in handy - I did some of that in my BioRuby GEO parser, which
> lives in my GEO branch on github.


I picked up the method_missing trick for the serializer.

http://github.com/yeban/bioruby/blob/writer/lib/bio/db/nexml/writer.rb


>  You should look at each class and
> see if you can refactor it down to a single solution. Just make sure
> it is not at the expense of readability and understanding.
>
> Post us some ideas here, before you start hacking code.
>
> Pj.
>
>
I will.

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From ngoto at gen-info.osaka-u.ac.jp  Mon Jun 28 08:00:05 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Mon, 28 Jun 2010 21:00:05 +0900
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTinGXSNGmuSAdYTzHmPN7x0fMu2OrKGvwdDUlBKj@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625065539.GD22887@thebird.nl>
	<AANLkTinGXSNGmuSAdYTzHmPN7x0fMu2OrKGvwdDUlBKj@mail.gmail.com>
Message-ID: <20100628120005.61D751CBC32B@idnmail.gen-info.osaka-u.ac.jp>

Hi,

Please never use method_missing. It breaks error reporting and
makes very hard to debug and maintain both library codes and
user scripts.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Mon, 28 Jun 2010 15:26:52 +0530
Anurag Priyam <anurag08priyam at gmail.com> wrote:

> >   ..... Also, when parsing this type of XML some Ruby reflection
> > may come in handy - I did some of that in my BioRuby GEO parser, which
> > lives in my GEO branch on github.
> 
> 
> I picked up the method_missing trick for the serializer.
> 
> http://github.com/yeban/bioruby/blob/writer/lib/bio/db/nexml/writer.rb
> 
> 
> >  You should look at each class and
> > see if you can refactor it down to a single solution. Just make sure
> > it is not at the expense of readability and understanding.
> >
> > Post us some ideas here, before you start hacking code.
> >
> > Pj.
> >
> >
> I will.
> 
> -- 
> Anurag Priyam,
> 2nd Year Undergraduate,
> Department of Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby

From ngoto at gen-info.osaka-u.ac.jp  Mon Jun 28 08:54:09 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Mon, 28 Jun 2010 21:54:09 +0900
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTinGXSNGmuSAdYTzHmPN7x0fMu2OrKGvwdDUlBKj@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625065539.GD22887@thebird.nl>
	<AANLkTinGXSNGmuSAdYTzHmPN7x0fMu2OrKGvwdDUlBKj@mail.gmail.com>
Message-ID: <20100628125409.B23271CBC32B@idnmail.gen-info.osaka-u.ac.jp>

Dear Anurag,

Do not add methods in other classes and modules outside Bio.
Modifying other classes and modules outside Bio namespace is
prohibited in BioRuby library because such kind of code could
make conflicts with user scrpits or other libraries when each
code defines a method with the same name with different behavior
or when the original class is refactored by the original authors.

It is BioRuby's policy to respect user's freedom. For example,
if we defined Array#has?, a user who want to define Array#has?
with different meanings could not use BioRuby. So, to keep
user's right, it is our policy not to change outside Bio as
far as possible.

PS. You may find some exceptinal codes in Bio::Shell and in
sample scripts, because they are separate applications.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Mon, 28 Jun 2010 15:26:52 +0530
Anurag Priyam <anurag08priyam at gmail.com> wrote:

> >   ..... Also, when parsing this type of XML some Ruby reflection
> > may come in handy - I did some of that in my BioRuby GEO parser, which
> > lives in my GEO branch on github.
> 
> 
> I picked up the method_missing trick for the serializer.
> 
> http://github.com/yeban/bioruby/blob/writer/lib/bio/db/nexml/writer.rb
> 
> 
> >  You should look at each class and
> > see if you can refactor it down to a single solution. Just make sure
> > it is not at the expense of readability and understanding.
> >
> > Post us some ideas here, before you start hacking code.
> >
> > Pj.
> >
> >
> I will.
> 
> -- 
> Anurag Priyam,
> 2nd Year Undergraduate,
> Department of Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From anurag08priyam at gmail.com  Mon Jun 28 10:13:36 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Mon, 28 Jun 2010 19:43:36 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100628125409.B23271CBC32B@idnmail.gen-info.osaka-u.ac.jp>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625065539.GD22887@thebird.nl>
	<AANLkTinGXSNGmuSAdYTzHmPN7x0fMu2OrKGvwdDUlBKj@mail.gmail.com>
	<20100628125409.B23271CBC32B@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <AANLkTikr9OltJLKnJHsg6l5PIy1Jt6STBsuEjGMtblW2@mail.gmail.com>

> It is BioRuby's policy to respect user's freedom. For example,
> if we defined Array#has?, a user who want to define Array#has?
> with different meanings could not use BioRuby. So, to keep
> user's right, it is our policy not to change outside Bio as
> far as possible.
>
>
Corrected. Thanks for pointing this out this GOTO san :).

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From anurag08priyam at gmail.com  Mon Jun 28 10:22:37 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Mon, 28 Jun 2010 19:52:37 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100628120005.61D751CBC32B@idnmail.gen-info.osaka-u.ac.jp>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625065539.GD22887@thebird.nl>
	<AANLkTinGXSNGmuSAdYTzHmPN7x0fMu2OrKGvwdDUlBKj@mail.gmail.com>
	<20100628120005.61D751CBC32B@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <AANLkTilD246h5drgc45C-ymvFhWwPqfQKgGkx0ZzxZIR@mail.gmail.com>

> Please never use method_missing. It breaks error reporting and
> makes very hard to debug and maintain both library codes and
> user scripts.
>

Hmm, I have experienced that. But the way I have used it affects only the
Bio::NeXML::Writer class, so is it not safe in this case? Anyways I will
change it as it does not offer much improvement to the code readability in
my case. I just find it exciting :).

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From yogiprasanna at gmail.com  Wed Jun 30 10:11:42 2010
From: yogiprasanna at gmail.com (Prasanna Bala)
Date: Wed, 30 Jun 2010 19:41:42 +0530
Subject: [BioRuby] Contribution in Bioruby...
Message-ID: <AANLkTiloqObN2cMR6834z8k7cusRMA-wQeaYjPAFCNSj@mail.gmail.com>

Hi,
My name is Prasanna. I am working in a software firm in ruby on rails
technology. I am new to Bioruby. I am interested in contributing for
Bio-ruby project. I would like to know where to start things. To whom to
approach for specific tasks. I have extensive experience in Biomedical text
mining. Is there is any group specifically working on Biomedical text
mining, Ontology Mapping etc.. And I also want to know what are the issues
now the community is working on ? I want to know list of current topics
that's going on in Bioruby.

Regards,
Prasanna.

From pjotr.public14 at thebird.nl  Wed Jun 30 11:31:05 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Wed, 30 Jun 2010 17:31:05 +0200
Subject: [BioRuby] Contribution in Bioruby...
In-Reply-To: <AANLkTiloqObN2cMR6834z8k7cusRMA-wQeaYjPAFCNSj@mail.gmail.com>
References: <AANLkTiloqObN2cMR6834z8k7cusRMA-wQeaYjPAFCNSj@mail.gmail.com>
Message-ID: <20100630153105.GB10804@thebird.nl>

Hi Prasanna,

On Wed, Jun 30, 2010 at 07:41:42PM +0530, Prasanna Bala wrote:
> Hi,
> My name is Prasanna. I am working in a software firm in ruby on rails
> technology. I am new to Bioruby. I am interested in contributing for
> Bio-ruby project. I would like to know where to start things. To whom to
> approach for specific tasks. I have extensive experience in Biomedical text
> mining. Is there is any group specifically working on Biomedical text
> mining, Ontology Mapping etc.. And I also want to know what are the issues
> now the community is working on ? I want to know list of current topics
> that's going on in Bioruby.

Thanks for showing your interest. It would be great if you were to
look at text mining and ontologies for BioRuby. It is relevant for
our work. To start with BioRuby get a github.com account and clone
the repository. You can start coding, and post questions on this
mailing list. We are having a presentation at BOSC next week, and the
slides discuss current work. It will be available for everyone.

Where are you located geographically?

Pj.

From anurag08priyam at gmail.com  Wed Jun 30 18:07:09 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Thu, 1 Jul 2010 03:37:09 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Update
Message-ID: <AANLkTinySE3wA5_CVZBobhFRPgf06nuBACWbbQuCTrcb@mail.gmail.com>

In the last week and half of this week I have:
* been able to work out an NeXML serializer - the code sits in the master
branch[1]. In the API page[ 2 ] I have added a discussion on the
implementation.
* started working on the RDF API - i should be able to come up with RSpecs
by the end of this week

In the remaining part of the week I will:
* come with an RDF API implementation
* work on refactoring some of the previous code( matrix and the sequences
part ) as Pjotr had pointed out in the last review.

Perhaps, we can have another round of code review: for the NeXML serializer?
This will help me allocate time in the coming weeks to fix the issues with
the code.

[1] http://github.com/yeban/bioruby
[2]
https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From anurag08priyam at gmail.com  Wed Jun 30 18:15:08 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Thu, 1 Jul 2010 03:45:08 +0530
Subject: [BioRuby] [GSoC]
Message-ID: <AANLkTilACLn73nOBi-nLtgrKL_Ybso5dckH0tgDvQLNa@mail.gmail.com>

I hope you guys are tuned to my updates on both the lists and the code and
the project plan. Please do keep reminding me if I am missing out on
something obvious :).

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642

From anurag08priyam at gmail.com  Thu Jun  3 09:00:06 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Thu, 3 Jun 2010 14:30:06 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Update
Message-ID: <AANLkTilpv6Uur-NAhrRBoR_n3fjNY4ZzkU7XG9ZHDGEz@mail.gmail.com>

Hello all,

I know this update is coming quite late. Sorry for holding this back for so
long. From now on I will be updating this list weekly on my progress. Just
to keep everyone in the loop, [1] is my project page.

What has been done?
Till now I have been able to do a significant amount of work on the NeXML
parser. The parser recognizes otus, otu and trees. The trees implementation
is not complete as per the NeXML schema. Trees with multiple rootings,
coalescent trees and networks remain to be done.

Problems Faced:
Initially it was decided to stream parse any NeXML document as DOM parsing
would be slow for larger documents. But with NeXML's non linear design,
streaming seems non natural and proves to be a little difficult. Currently,
I have written a wrapper over the StAX parsing API of libxml but the entire
document is parsed in one go; at the start.

Current git head[2] can be built and the code tested out. A tutorial( kind
of ) on how to use the NeXML can be found here[3].

[1]
https://www.nescent.org/wg_phyloinformatics/Category:NeXML_and_RDF_API_for_BioRuby
[2] http://github.com/yeban/bioruby
[3]
https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642


From anurag08priyam at gmail.com  Fri Jun  4 08:39:34 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Fri, 4 Jun 2010 14:09:34 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Tree with Multiple rootings.
Message-ID: <AANLkTinGx3qR_5RgDIxT7OuUEp99sHfpS922zoaqiLqG@mail.gmail.com>

Hello all,

NeXML allows for trees with multiple rootings. In the NeXML lib trees are
represented by Bio::NeXML::Tree which inherits from Bio::Tree. This allows
for the usage of the excellent Bio::Tree framework for manipulating NeXML
trees. However, Bio::Tree class supports only one root node.

There are a couple of functions that require the presence of a root node:
parent, children, descendants, ancestors, lowest_common_ancestor. Now, these
functions can take a root node as a parameter. So it is possible to extend
the current framework to work with trees with multiple root nodes.

Though this may not be required, a possibility is to add the multiple root
functionality to Bio::Tree class itself. Currently, I am adding multiple
root support to Bio::NeXML::Tree class. If need be we can move the
functionality to Bio::Tree.

Anything?

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642


From rutgeraldo at gmail.com  Fri Jun  4 14:21:27 2010
From: rutgeraldo at gmail.com (Rutger Vos)
Date: Fri, 4 Jun 2010 15:21:27 +0100
Subject: [BioRuby] [GSoC][NeXML and RDF API] Tree with Multiple rootings.
In-Reply-To: <AANLkTinGx3qR_5RgDIxT7OuUEp99sHfpS922zoaqiLqG@mail.gmail.com>
References: <AANLkTinGx3qR_5RgDIxT7OuUEp99sHfpS922zoaqiLqG@mail.gmail.com>
Message-ID: <AANLkTinJIQ26bRYKztfD2pHjP3XTsM3AchgNFS6qfuSM@mail.gmail.com>

Hi Anurag,

in practice I haven't actually seen trees with multiple rootings being
used much, so it might not be urgent that this moves to the bioruby
core. My main worry would be in picking the "right" root node to
expose to the core api. I think that it should be the node from which
all other nodes can be visited in a recursive traversal (which I
expect client code to do), as opposed to a node that has been
indicated using an XML attribute to be the root, but isn't in terms of
the actual topology that emerges from the node and edge tables.

However, I'm curious to hear other people's opinions whether a flag
(e.g. "is_root") might be added Bio::Tree::Node, and a "get_roots"
method in Bio::Tree that returns a list of roots that typically only
holds the value of the "root" attribute, but could potentially have
multiple rootings.

Rutger

On Fri, Jun 4, 2010 at 9:39 AM, Anurag Priyam <anurag08priyam at gmail.com> wrote:
> Hello all,
>
> NeXML allows for trees with multiple rootings. In the NeXML lib trees are
> represented by Bio::NeXML::Tree which inherits from Bio::Tree. This allows
> for the usage of the excellent Bio::Tree framework for manipulating NeXML
> trees. However, Bio::Tree class supports only one root node.
>
> There are a couple of functions that require the presence of a root node:
> parent, children, descendants, ancestors, lowest_common_ancestor. Now, these
> functions can take a root node as a parameter. So it is possible to extend
> the current framework to work with trees with multiple root nodes.
>
> Though this may not be required, a possibility is to add the multiple root
> functionality to Bio::Tree class itself. Currently, I am adding multiple
> root support to Bio::NeXML::Tree class. If need be we can move the
> functionality to Bio::Tree.
>
> Anything?
>
> --
> Anurag Priyam,
> 2nd Year Undergraduate,
> Department of Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642
>


-- 
Dr. Rutger A. Vos
School of Biological Sciences
Philip Lyle Building, Level 4
University of Reading
Reading
RG6 6BX
United Kingdom
Tel: +44 (0) 118 378 7535
http://www.nexml.org
http://rutgervos.blogspot.com


From hlapp at drycafe.net  Fri Jun  4 19:09:11 2010
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Fri, 4 Jun 2010 15:09:11 -0400
Subject: [BioRuby] [GSoC][NeXML and RDF API] Tree with Multiple rootings.
In-Reply-To: <AANLkTinJIQ26bRYKztfD2pHjP3XTsM3AchgNFS6qfuSM@mail.gmail.com>
References: <AANLkTinGx3qR_5RgDIxT7OuUEp99sHfpS922zoaqiLqG@mail.gmail.com>
	<AANLkTinJIQ26bRYKztfD2pHjP3XTsM3AchgNFS6qfuSM@mail.gmail.com>
Message-ID: <01FB94FA-D962-4624-B612-49AA26FF3E2D@drycafe.net>

Multiple roots can be the result of a Bayesian analysis. (The PhyloDB  
module in BioSQL, for example, does support multiple roots.)

However, representing multiple roots is useless without also being  
able to indicate whether a root is an alternate root or the main root  
node, and what its significance (posterior prob. for a Bayesian  
analysis) is.

For reference, here is the column documentation for these two  
properties in PhyloDB's tree_root table:

COMMENT ON COLUMN tree_root.is_alternate IS 'True if the root node is  
the preferential (most likely) root node of the tree, and false  
otherwise.';

COMMENT ON COLUMN tree_root.significance IS 'The significance (such as  
likelihood, or posterior probability) with which the node is the root  
node. This only has meaning if the method used for reconstructing the  
tree calculates this value.';

	-hilmar

On Jun 4, 2010, at 10:21 AM, Rutger Vos wrote:

> Hi Anurag,
>
> in practice I haven't actually seen trees with multiple rootings being
> used much, so it might not be urgent that this moves to the bioruby
> core. My main worry would be in picking the "right" root node to
> expose to the core api. I think that it should be the node from which
> all other nodes can be visited in a recursive traversal (which I
> expect client code to do), as opposed to a node that has been
> indicated using an XML attribute to be the root, but isn't in terms of
> the actual topology that emerges from the node and edge tables.
>
> However, I'm curious to hear other people's opinions whether a flag
> (e.g. "is_root") might be added Bio::Tree::Node, and a "get_roots"
> method in Bio::Tree that returns a list of roots that typically only
> holds the value of the "root" attribute, but could potentially have
> multiple rootings.
>
> Rutger
>
> On Fri, Jun 4, 2010 at 9:39 AM, Anurag Priyam <anurag08priyam at gmail.com 
> > wrote:
>> Hello all,
>>
>> NeXML allows for trees with multiple rootings. In the NeXML lib  
>> trees are
>> represented by Bio::NeXML::Tree which inherits from Bio::Tree. This  
>> allows
>> for the usage of the excellent Bio::Tree framework for manipulating  
>> NeXML
>> trees. However, Bio::Tree class supports only one root node.
>>
>> There are a couple of functions that require the presence of a root  
>> node:
>> parent, children, descendants, ancestors, lowest_common_ancestor.  
>> Now, these
>> functions can take a root node as a parameter. So it is possible to  
>> extend
>> the current framework to work with trees with multiple root nodes.
>>
>> Though this may not be required, a possibility is to add the  
>> multiple root
>> functionality to Bio::Tree class itself. Currently, I am adding  
>> multiple
>> root support to Bio::NeXML::Tree class. If need be we can move the
>> functionality to Bio::Tree.
>>
>> Anything?
>>
>> --
>> Anurag Priyam,
>> 2nd Year Undergraduate,
>> Department of Mechanical Engineering,
>> IIT Kharagpur.
>> +91-9775550642
>>
>
>
>
> -- 
> Dr. Rutger A. Vos
> School of Biological Sciences
> Philip Lyle Building, Level 4
> University of Reading
> Reading
> RG6 6BX
> United Kingdom
> Tel: +44 (0) 118 378 7535
> http://www.nexml.org
> http://rutgervos.blogspot.com
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby

-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================


From mitlox at op.pl  Sun Jun  6 05:30:09 2010
From: mitlox at op.pl (xyz)
Date: Sun, 06 Jun 2010 15:30:09 +1000
Subject: [BioRuby] fastq files reading
In-Reply-To: <20100530233154.1B8A.EEF6E030@gen-info.osaka-u.ac.jp>
References: <20100529221404.0175ee75@wp01>
	<AANLkTil1Nrbd5ULu3T-esvbg8VXoCYojW53eCL72oihi@mail.gmail.com>
	<20100530233154.1B8A.EEF6E030@gen-info.osaka-u.ac.jp>
Message-ID: <4C0B3261.3020909@op.pl>

Thank you for the solutions it works.


From sararayburn at gmail.com  Mon Jun  7 18:09:07 2010
From: sararayburn at gmail.com (Sara Rayburn)
Date: Mon, 7 Jun 2010 13:09:07 -0500
Subject: [BioRuby] GSoC speciation/duplication inference question
Message-ID: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com>

Hello,

While implementing my gsoc project (the speciation/duplicaiton inference algorithm), I'm not sure where to put the module I'm developing in the bioruby library. I've been developing in the lib/bio/util directory, but I want to make sure that's the place you all would prefer for the module. Any suggestions or thoughts?

Thanks,

Sara Rayburn
sararayburn at gmail.com


From anurag08priyam at gmail.com  Wed Jun  9 08:17:55 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Wed, 9 Jun 2010 13:47:55 +0530
Subject: [BioRuby] fastq files reading
In-Reply-To: <4C0B3261.3020909@op.pl>
References: <20100529221404.0175ee75@wp01>
	<AANLkTil1Nrbd5ULu3T-esvbg8VXoCYojW53eCL72oihi@mail.gmail.com>
	<20100530233154.1B8A.EEF6E030@gen-info.osaka-u.ac.jp>
	<4C0B3261.3020909@op.pl>
Message-ID: <AANLkTilS8gwRSQ0ZPC1mvRL7RLeGmsKOT1KYGB_J-i00@mail.gmail.com>

Maybe we should add this to the wiki [1]

[1] http://bioruby.open-bio.org/wiki/SampleCodes

On Sun, Jun 6, 2010 at 11:00 AM, xyz <mitlox at op.pl> wrote:

> Thank you for the solutions it works.
>


-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642


From mitlox at op.pl  Wed Jun  9 12:42:36 2010
From: mitlox at op.pl (xyz)
Date: Wed, 09 Jun 2010 22:42:36 +1000
Subject: [BioRuby] fastq files reading
In-Reply-To: <AANLkTilS8gwRSQ0ZPC1mvRL7RLeGmsKOT1KYGB_J-i00@mail.gmail.com>
References: <20100529221404.0175ee75@wp01>	<AANLkTil1Nrbd5ULu3T-esvbg8VXoCYojW53eCL72oihi@mail.gmail.com>	<20100530233154.1B8A.EEF6E030@gen-info.osaka-u.ac.jp>	<4C0B3261.3020909@op.pl>
	<AANLkTilS8gwRSQ0ZPC1mvRL7RLeGmsKOT1KYGB_J-i00@mail.gmail.com>
Message-ID: <4C0F8C3C.1030303@op.pl>

Good idea.

On 06/09/10 18:17, Anurag Priyam wrote:
> Maybe we should add this to the wiki [1]
>
> [1] http://bioruby.open-bio.org/wiki/SampleCodes
>
> On Sun, Jun 6, 2010 at 11:00 AM, xyz <mitlox at op.pl
> <mailto:mitlox at op.pl>> wrote:
>
>     Thank you for the solutions it works.
>
>
>
>
> -- 
> Anurag Priyam,
> 2nd Year Undergraduate,
> Department of Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642


From anurag08priyam at gmail.com  Wed Jun  9 19:49:35 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Thu, 10 Jun 2010 01:19:35 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
Message-ID: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>

Last week, I worked on finishing implementation of Trees: trees, tree,
network; and started work on the characters element. This weeks target is to
complete the implementation of the characters element.

It would be awesome to have some code review including: implementation, API
design, coding style and tests. I am planning to give a good amount of time
in the fourth week in making the code more robust. It would make perfect
sense to have some feedback to serve as guidelines :). The master branch and
API discussion page are at:

[1] http://github.com/yeban/bioruby
[2]
https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642


From czmasek at burnham.org  Thu Jun 10 01:50:16 2010
From: czmasek at burnham.org (Christian M Zmasek)
Date: Wed, 9 Jun 2010 18:50:16 -0700
Subject: [BioRuby] gsoc questions
In-Reply-To: <AANLkTin5fGYJKbCWftc8AZYdks11iEiYZgTjebwJ_3f4@mail.gmail.com>
References: <2B862774-AD61-4BC0-86D6-69DCD832EB78@gmail.com>
	<AANLkTin5fGYJKbCWftc8AZYdks11iEiYZgTjebwJ_3f4@mail.gmail.com>
Message-ID: <4C1044D8.4010600@burnham.org>

Hi Sara:
> 
> 
> On Mon, Jun 7, 2010 at 11:20 AM, Sara Rayburn <sararayburn at gmail.com 
> <mailto:sararayburn at gmail.com>> wrote:
> 
>     Hi Christian and Diana,
> 
>     Two questions: 
> 
>     1) On the phylosoft website for forester/sdi
>     (http://www.phylosoft.org/forester/applications/sdi_r/) I've read
>     this about the two trees: 
>     "The important point to keep in mind is that there must be at least
>     one sub-element of the 'taxonomy' element which allows to match the
>     sequences in the gene tree with a taxonomy in the species tree. In
>     this example this sub-element of the 'taxonomy' element is 'code'."
> 
>     Does this mean that the sub-element for matching will *always* be
>     'code'? Or should I just be looking for anything at all that
>     matches? Also, will all phyloxml trees have the 'code' sub-element?
> 
> 
> To find out whether some element will always contain some other element 
> you can look at PhyloXML documentation [0]. For example at the Taxonomy 
> element documentation [1] you can see that it has a sub-element "code" 
> which is [0..1], which means that there either is no "code" sub-element 
> or there is one and no more, whereas there could none or many "synonym" 
> sub-elements
> 
> [0] http://www.phyloxml.org/documentation/version_1.10/phyloxml.xsd.html
> [1] http://www.phyloxml.org/documentation/version_1.10/phyloxml.xsd.html#h888650454


Good point! This matching of taxonomic information is a crucial point.
I recommend to implement this in the same manner as it is implemented in 
the "isEqual" method of the org.forester.phylogeny.data.Taxonomy class 
of the forester library, see:
http://forester-atv.cvs.sourceforge.net/viewvc/forester-atv/forester-atv/java/src/org/forester/phylogeny/data/Taxonomy.java?revision=1.57&view=markup

In this (Java) class the matching works like this:

1. If both the two Taxonomies to be compared have identifiers with the 
same source (e.g. NCBI taxonomy), use these identifiers to match.

  In Java:
   if ( ( getIdentifier() != null ) && ( tax.getIdentifier() != null ) )
   {
     return getIdentifier().isEqual( tax.getIdentifier() );
   }

2. Otherwise, if both Taxonomies have taxonomy codes, use the taxomoy 
codes to match.

  In Java:
   else if ( !ForesterUtil.isEmpty( getTaxonomyCode() ) &&
             !ForesterUtil.isEmpty( tax.getTaxonomyCode() ) )
   {
     return getTaxonomyCode().equals( tax.getTaxonomyCode() );
   }

3. Otherwise, if both Taxonomies have scientific names, use the 
scientific names to match.


4. Otherwise, if both Taxonomies have common names, use the common names 
to match.


5. Otherwise, matching is not possible and an error should be thrown.

Generally speaking, I recommend to get the source code of forester and 
look at the classes in the org.forester.sdi directory (especially 
SDI.java, SDIse.java, and SDIR.java).


> 
>     2) Here's my assumptions about the final output of the algorithm:
>     Each node in the tree should be updated with speciation OR
>     duplication, and the tree as a whole has a count of
>     speciation/duplication events. Am I on the right track here?

Yes, the primary goal of the algorithm is to calculate for each node in 
the gene tree whether it is a duplication or a speciation, and thus each 
node should be annotated as duplication or speciation.
Keeping track of the sum of duplications and speciations is useful too, 
but cannot, as far as I know, stored in the tree object itself.
Maybe the algorithm could return a small "SDI_result" object which is 
used to store such "summary" information.


Christian


From ngoto at gen-info.osaka-u.ac.jp  Thu Jun 10 13:46:40 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Thu, 10 Jun 2010 22:46:40 +0900
Subject: [BioRuby] GSoC speciation/duplication inference question
In-Reply-To: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com>
References: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com>
Message-ID: <20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp>

Hi,

I think the abbreviation SDI is not common in the field of biology
and bioinformatics. In this case, it is generally good not to
abbreviate, but the "speciation/duplication inference" is too long.
For file/directory names, because the length limit is tight,
using abbreviation is good.

For the location of files, I suggest
lib/bio/util/evolution/SDI/ or lib/bio/util/phylogeny/SDI/
to show the word SDI is in the field of evolution or phylogeny.

For the class/module namespace,  possible candidates are
Bio::SpeciationDuplicationInference, Bio::Evolution::SDI,
Bio::Algorithm::SDI, but I couldn't determine which is the best.
If you have good idea, please tell us.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


On Mon, 7 Jun 2010 13:09:07 -0500
Sara Rayburn <sararayburn at gmail.com> wrote:

> Hello,
> 
> While implementing my gsoc project (the speciation/duplicaiton inference algorithm), I'm not sure where to put the module I'm developing in the bioruby library. I've been developing in the lib/bio/util directory, but I want to make sure that's the place you all would prefer for the module. Any suggestions or thoughts?
> 
> Thanks,
> 
> Sara Rayburn
> sararayburn at gmail.com


From kpatil at science.uva.nl  Thu Jun 17 09:22:12 2010
From: kpatil at science.uva.nl (K. Patil)
Date: Thu, 17 Jun 2010 11:22:12 +0200 (CEST)
Subject: [BioRuby] newick to phyloxml
In-Reply-To: <20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp>
References: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com>
	<20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <1764.139.19.75.1.1276766532.squirrel@webmail.science.uva.nl>

Hi,

I noticed the inclusion of phyloxml support in bioruby, thanks a lot, its
very useful. I was wondering if there is any straightforward way to
convert a newick tree to phyloxml?

best


From czmasek at burnham.org  Thu Jun 17 22:49:12 2010
From: czmasek at burnham.org (Christian M Zmasek)
Date: Thu, 17 Jun 2010 15:49:12 -0700
Subject: [BioRuby] Gene duplications GSoC project: answers to some of your
	questions
Message-ID: <4C1AA668.7040801@burnham.org>

Hi, Sara:

Regarding some of your questions posted on 
http://wiki.github.com/srayburn/bioruby/gsoc-2010-implementing-sdi-project-updates

Re: "Right now initialization loads from a hard coded file. I need to 
make this flexible so that trees can come from any file or from a 
previously loaded tree object":

The input of the algoruthm(s) should be tree-objects, reading the trees 
should not be part of the algorithm implementation.

Clearly, for testing you need to read the trees from files, but this 
should be implemented in your test code, not as part of the algorithm 
implementation itself.

Re: "The names of leaf nodes: how standard are they? Is there a standard 
format here? I?m going to look at example trees from the forester 
implementation to get ideas about this. If I?m still stumped I?ll check 
with my mentors."

No there is no standard. The only question for the purpose of this 
algorithm do they match or not. I.e. they names could just numbers, 
common names, or scientific names.

Hope this helps,

Christian


From czmasek at burnham.org  Fri Jun 18 03:16:20 2010
From: czmasek at burnham.org (Christian M Zmasek)
Date: Thu, 17 Jun 2010 20:16:20 -0700
Subject: [BioRuby] newick to phyloxml
In-Reply-To: <1764.139.19.75.1.1276766532.squirrel@webmail.science.uva.nl>
References: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com>	<20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp>
	<1764.139.19.75.1.1276766532.squirrel@webmail.science.uva.nl>
Message-ID: <4C1AE504.3000307@burnham.org>

Hi,

Unfortunately, this is not possible in a straightforward way.

The problem is that the tree object (Bio::Tree) returned by:
  input = Bio::FlatFile.open(Bio::Newick, "tree.nh")
  tree = input.next_entry.tree

is the parent type of the tree object(Bio::PhyloXML::Tree) required by:

  writer = Bio::PhyloXML::Writer.new("tree.xml")
  writer.write(phyloxml_tree)


Christian


K. Patil wrote:
> Hi,
> 
> I noticed the inclusion of phyloxml support in bioruby, thanks a lot, its
> very useful. I was wondering if there is any straightforward way to
> convert a newick tree to phyloxml?
> 
> best
> 
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From anurag08priyam at gmail.com  Tue Jun 22 08:46:19 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Tue, 22 Jun 2010 14:16:19 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Update
In-Reply-To: <AANLkTimsRqsi0-Qj0JgWerTwmSgo90pyx0Xky9rGxYC8@mail.gmail.com>
References: <AANLkTimsRqsi0-Qj0JgWerTwmSgo90pyx0Xky9rGxYC8@mail.gmail.com>
Message-ID: <AANLkTilqCdMCe5QIDk09Iw-Din8URsWNI_EGyQkgAfwY@mail.gmail.com>

Hello all,

Much of the parser implementation is complete as of now. Last time I had
sent an update I had begun implementing characters element. Week 3( June
7-13) was quite low on work due to power shortage where I live. Consequently
implementation of characters spanned Week 4( June 14-20 ) too.

Work on NeXML serialization has begun. As of now it can serialize taxa
blocks. This week( week 5 - June 21-28 ) I will be working on serializing
trees and characters element.

I would also like to update a little more on future development plans. I am
targeting to finish much of the software development by week 9( July 19-25
), leaving week 10, week 11 and week 12 for feedback and iterations. This is
the time where I should make up for any mistakes or lost work. Perhaps in
this week we can make the code ready for merging in BioRuby's master branch.
Apart from this, I am targeting to finish serializer and start working on
the RDF API by week 6. Maybe we could have a round of code review after that
too? I am notifying this in advance so that if possible developers can
allocate time for this. Sounds good?


-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642


From czmasek at burnham.org  Tue Jun 22 18:29:09 2010
From: czmasek at burnham.org (Christian M Zmasek)
Date: Tue, 22 Jun 2010 11:29:09 -0700
Subject: [BioRuby] gsoc update
In-Reply-To: <6B35175E-1DFF-430E-AE22-E724F3D58D9D@gmail.com>
References: <4C1AACBA.4030908@burnham.org>
	<6B35175E-1DFF-430E-AE22-E724F3D58D9D@gmail.com>
Message-ID: <4C2100F5.3020306@burnham.org>

Hi, Sara:

Hopefully you and your son are fully recovered now!

To me, Bio::Algorithm::SDI would make the most sense.

Re: "It seems that forester has the assumption built in that any node in 
a tree that has a child must have two children. Is this a property of 
phylogenetic trees?"

Being composed of entirely binary nodes is indeed a property of trees 
produced by most programs for phylogenetic inference. In contrast, if 
multiple (binary) trees are used to calculate a consensus tree (e.g. 
bootstrap resampling), then the resulting consensus tree might contain 
nodes with more than two children (depending on the method of consensus 
tree calculation and the degree of divergence among the resampled 
trees). Furthermore, if (phylogenetic or taxonomic) trees are "manually" 
created (or by various "supertree" approaches), nodes with more than two 
children are oftentimes used to express uncertainty.

For the purpose of gene duplication inference, it would be particularly 
useful to allow non-binary species trees (expressing uncertainty about 
the tree-of-life and preventing the introduction of spurious duplications).

Re: "For the non-binary case, should I go forward planning to implement 
the algorithm from the Vernot et al. paper or should I be planning to 
extend your algorithm?"

You should plan on working on the SDI algorithm and 'modify' it so that 
it correctly works on non-binary species trees.
Now, this is easier said than done.
A while ago, I developed such an algorithm and implemented it as 
org.forester.sdi.GSDI (for Generalized SDI). You can look at it in file 
org/forester/sdi/GSDI.
Yet, the big issue is that while this algorithm seems to work, I don't 
have a mathematical proof for its correctness.

In any case, I recommend to do the following:
1. Thoroughly test (and writes unit tests) your current implementation 
of binary SDI. For example, does it correctly use the different 
sub-elements of taxonomy for matching, i.e. does it work if both species 
  and gene use scientific names for taxonomic identification? does it 
work if both species and gene use NCBI identifier for taxonomic 
identification? does it work if both species and gene use NCBI 
identifier for taxonomic identification but also have non-matching 
common names (in this case it should use the identifiers and ignore 
common names)? Will it throw an exception if no matching sub-elements of 
taxonomy are present?
2. Performing timing benchmarks. Does it behave similar (although 
overall slower) to the Java implementation (see Figure 4 in Zmasek and 
Eddy, 2001)? Oftentimes, an unexpected timing benchmark results is a 
indication of an underlying problem?
3. I will look at your implementation as well.
4. Look at org.forester.sdi.GSDI and see if you can understand it and 
test it on paper. If this makes sense to you then we can go ahead and 
plan implementing this within BioRuby.

Christian


Sara Rayburn wrote:
> Hi,
> 
> Well, as far as I can tell, things are looking much, much better.  I'm sorry I got a bit behind, but my son and I have been sick this past week. 
> 
> For the namespace/file locations, the response from the mailing list has been:
> Bio::SpeciationDuplicationInference, Bio::Evolution::SDI,
> Bio::Algorithm::SD with the files in lib/bio/util/Phylogeny/SDI, or lib/bio/phylo/sdi.rb and Bio::Phylo::SDI
> 
> What do you guys think?
> 
> Also, when I've been in doubt I've looked at the java implementation. It seems that forester has the assumption built in that any node in a tree that has a child must have two children. Is this a property of phylogenetic trees?
> 
> Other than tying up a couple of loose ends, I think the binary case is pretty much wrapped up. Please let me know if there are things I need to modify or rethink.
> 
> For the non-binary case, should I go forward planning to implement the algorithm from the Vernot et al. paper or should I be planning to extend your algorithm? 
> 
> Thanks and again, sorry for getting a bit behind.
> 
> Sara


From rutgeraldo at gmail.com  Wed Jun 23 20:48:02 2010
From: rutgeraldo at gmail.com (Rutger Vos)
Date: Wed, 23 Jun 2010 21:48:02 +0100
Subject: [BioRuby] [GSoC][NeXML and RDF API] Update
In-Reply-To: <AANLkTilqCdMCe5QIDk09Iw-Din8URsWNI_EGyQkgAfwY@mail.gmail.com>
References: <AANLkTimsRqsi0-Qj0JgWerTwmSgo90pyx0Xky9rGxYC8@mail.gmail.com>
	<AANLkTilqCdMCe5QIDk09Iw-Din8URsWNI_EGyQkgAfwY@mail.gmail.com>
Message-ID: <AANLkTilcdN3Q8keRE5SQlEEMZJ43wkFe_CKAticag8Lm@mail.gmail.com>

Hi Anurag,

thanks for the update - your time projection and current progress
sounds good. Can you forward this update to the phylosoc (nescent)
list as well?

Thanks,

Rutger

On Tue, Jun 22, 2010 at 9:46 AM, Anurag Priyam <anurag08priyam at gmail.com> wrote:
> Hello all,
>
> Much of the parser implementation is complete as of now. Last time I had
> sent an update I had begun implementing characters element. Week 3( June
> 7-13) was quite low on work due to power shortage where I live. Consequently
> implementation of characters spanned Week 4( June 14-20 ) too.
>
> Work on NeXML serialization has begun. As of now it can serialize taxa
> blocks. This week( week 5 - June 21-28 ) I will be working on serializing
> trees and characters element.
>
> I would also like to update a little more on future development plans. I am
> targeting to finish much of the software development by week 9( July 19-25
> ), leaving week 10, week 11 and week 12 for feedback and iterations. This is
> the time where I should make up for any mistakes or lost work. Perhaps in
> this week we can make the code ready for merging in BioRuby's master branch.
> Apart from this, I am targeting to finish serializer and start working on
> the RDF API by week 6. Maybe we could have a round of code review after that
> too? I am notifying this in advance so that if possible developers can
> allocate time for this. Sounds good?
>
>
> --
> Anurag Priyam,
> 2nd Year Undergraduate,
> Department of Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>


-- 
Dr. Rutger A. Vos
School of Biological Sciences
Philip Lyle Building, Level 4
University of Reading
Reading
RG6 6BX
United Kingdom
Tel: +44 (0) 118 378 7535
http://www.nexml.org
http://rutgervos.blogspot.com


From pjotr.public14 at thebird.nl  Thu Jun 24 13:54:11 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Thu, 24 Jun 2010 15:54:11 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
Message-ID: <20100624135411.GA14658@thebird.nl>

Hi Everyone,

I am going to review Anurag's code. Naohisa, and perhaps others, will
join in. 

A quick recap: Anurag is working on implementing an NeXML parser with
RDF support (for the semantic web).

NeXML is an XMLized and improved version of Nexus, and is used for
interchanging sequences, alignments and trees between
programs/services (correct me if I am wrong). A full descripion of
NeXML can be found at

  https://www.nescent.org/wg_evoinfo/Future_Data_Exchange_Standard

NeXML is an important standard, and very good to have in BioRuby.

Anurag: thanks for the good work, so far. I can see you have put a lot
of work in. And, I like your style. I can see you are a competent
programmer, so you can expect the worst criticism ;) I am going to
start with some high level questions.

Can someone who has worked with NeXML (Rutger) have a look at the
interface description on:

  https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby

It looks natural to query this way, if you know what the NeXML files
contains (e.g. trees, or sequences). What would be the natural
approach if you do *not* know the contents? I.e. how does one iterate
over the NeXML object?

Anurag, your web page states you implemented a LibXML::Parser, and you
named it Parser. Meanwhile, it looks like you have implemented libxml2
streaming, using a Reader. This is a bit confusing. I presume you are
using the technique used in Diana's PhyloXML parser. You are requiring
the 'xml' package. Is that libxml2 these days, or is it actually
'libxml'? Does it work for all Ruby versions? libxml is an external
(binary) dependency, so it may not exist and fail.  PhyloXML does not
handle failure either.

The other high-level questions concern testing. For others, the unit
tests are here:

  http://github.com/yeban/bioruby/tree/master/test/unit/bio/db/nexml/

I notice you have limited test input data. How can you be really sure
your code works for all cases? How can you be really sure that future
changes to the code don't break? And how are you going to measure
performance of your code?

Finally, getting down to some code. Most of the code is in a single
file:

  http://github.com/yeban/bioruby/blob/master/lib/bio/db/nexml/elements.rb

or

  http://github.com/yeban/bioruby/blob/3abfc592e2f7072a8e2970ee077677a9ab7564ae/lib/bio/db/nexml/elements.rb

I think it should be broken up. It would be logical to split by type
of elements - at least. I know in BioRuby we are ambiguous about file
sizes - I think a single file should describe one concept. That way
file names become self describing. Files larger than 300 lines tend
to be hard to digest - and probably point out some bigger issue.

Also, when I look at DnaSeqRow, RnaSeqRow and others derived from
SeqRow (line 2148 and onwards in element.rb), I can see duplicated
coding 'patterns'. You are repeating a concept. Would there not be a
more elegant way in Ruby to handle this? Hint: Inheritance is just one
mechanism, I see no real reason to use an inheritance tree. Why not
use one Sequence class for all of these which can contain different
formed elements? I bet the code would become a lot shorter and
(probably) less error prone. Take Ruby's Array container class as an
example - it is just one implementation of a container which allows
many types of elements.

A final comment for this session: The class/method descriptions are
not very informative. It may be early days - especially since we can
see some refactoring coming, but it usually helps to write out
examples giving the 'nicest' interface for people to use. And stick
those in the source code. Personally I favour rubydoctests, see

  http://github.com/tablatom/rubydoctest

I used these in bio/appl/paml/codeml/report.rb - these are examples
that double as tests. Kill two birds with one stone! The BioRuby
tutorial also uses doctests - i.e. the code in the Tutorial can be
validated against the installed bioruby. If you want to use this you
need an extra conversion - I have that tool.

Another possibility is to start using RSpec. 

  http://rspec.info/

I really like RSpec too - it is more of a replacement for unit
tests - and easier to understand, so Specs double as documentation.

I am interested to see what you want to do for RDF support. Maybe you
can write out the API as an RSpec? That would be a good start.

Do not hesitate to stand up to me. You will probably get support from
someone on this list ;)

Pj.

On Thu, Jun 10, 2010 at 01:19:35AM +0530, Anurag Priyam wrote:
> Last week, I worked on finishing implementation of Trees: trees, tree,
> network; and started work on the characters element. This weeks target is to
> complete the implementation of the characters element.
> 
> It would be awesome to have some code review including: implementation, API
> design, coding style and tests. I am planning to give a good amount of time
> in the fourth week in making the code more robust. It would make perfect
> sense to have some feedback to serve as guidelines :). The master branch and
> API discussion page are at:
> 
> [1] http://github.com/yeban/bioruby
> [2]
> https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby
> 
> -- 
> Anurag Priyam,
> 2nd Year Undergraduate,
> Department of Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From czmasek at burnham.org  Fri Jun 25 02:18:41 2010
From: czmasek at burnham.org (Christian M Zmasek)
Date: Thu, 24 Jun 2010 19:18:41 -0700
Subject: [BioRuby] gsoc: SDI - unrooted trees
In-Reply-To: <6B35175E-1DFF-430E-AE22-E724F3D58D9D@gmail.com>
References: <4C1AACBA.4030908@burnham.org>
	<6B35175E-1DFF-430E-AE22-E724F3D58D9D@gmail.com>
Message-ID: <4C241201.9060604@burnham.org>

Hi, Sara:

Something I forgot the mention.

As you know, most phylogeny inference methods produce trees which are 
unrooted (these trees might look rooted, but for most methods the root 
is placed randomly, and thus incorrectly).

In the the context of duplication inference, a reasonable way to root a 
tree is by placing the root in such a way that the the sum of inferred 
duplications is minimized.

The brute force approach to accomplish this is by sequentially placing 
the root on each branch and then running the SDI algorithm on each 
differently rooted tree and retaining the root position which results in 
the smallest sum of duplications.

A more time efficient approach is possible by realizing that the mapping 
function only changes for a few nodes if the root is moved from one 
branch to an neighboring one.
  	
This approach is implemented in org.forester.sdi.SDIR.

Besides extending the algorithm to work on non-binary trees, this is 
another useful extension which you might think about tackling.

Christian


From anurag08priyam at gmail.com  Fri Jun 25 06:23:34 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Fri, 25 Jun 2010 11:53:34 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100624135411.GA14658@thebird.nl>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
Message-ID: <AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>

> Can someone who has worked with NeXML (Rutger) have a look at the
> interface description on:
>
>  https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby
>
> It looks natural to query this way, if you know what the NeXML files
> contains (e.g. trees, or sequences). What would be the natural
> approach if you do *not* know the contents? I.e. how does one iterate
> over the NeXML object?
>
>
NeXML has three primary elements: otus, trees, characters. All three of them
are container for other elements: otu, tree, network, matrix. Currently the
each method of an nexml object iterates over each tree object. I did this
thinking that tree is the most important part of a phylogenetic analysis(
and also because I had not implemented characters then). What were you
thinking here? Should each iterate over all otu, tree and matrix or the
primary otus, trees and characters elements? I would go for the later.


> Anurag, your web page states you implemented a LibXML::Parser, and you
> named it Parser. Meanwhile, it looks like you have implemented libxml2
> streaming, using a Reader. This is a bit confusing. I presume you are
> using the technique used in Diana's PhyloXML parser. You are requiring
> the 'xml' package. Is that libxml2 these days, or is it actually
> 'libxml'? Does it work for all Ruby versions? libxml is an external
> (binary) dependency, so it may not exist and fail.  PhyloXML does not
> handle failure either.
>
>
I am glad you asked this. I wanted to discuss it here.

I have used libxml2 streaming api, without actually streaming the document
to the user. The cursor does not move through the document when you iterate
over elements( phyloxml does that ). I am parsing the document at one go; at
the start, and storing the objects in memory. Should we want to switch to
streaming, using libxml's streaming API from start should make it easier.

Yes it is libxml2 these days. The site states that it works with ruby 1.8. I
am myself working with 1.8.7. I will have to test the compatibility with
ruby 1.9.


> The other high-level questions concern testing. For others, the unit
> tests are here:
>
>  http://github.com/yeban/bioruby/tree/master/test/unit/bio/db/nexml/
>
> I notice you have limited test input data. How can you be really sure
> your code works for all cases? How can you be really sure that future
> changes to the code don't break?


Right. I am working on improving the test suites taking lessons from the
other bioruby test suites.


> And how are you going to measure
> performance of your code?
>
>
Actually I have not done anything here. I will benchmark and profile the
code and discuss the results here.

Finally, getting down to some code. Most of the code is in a single
> file:
>
>  http://github.com/yeban/bioruby/blob/master/lib/bio/db/nexml/elements.rb
>
> or
>
>
> http://github.com/yeban/bioruby/blob/3abfc592e2f7072a8e2970ee077677a9ab7564ae/lib/bio/db/nexml/elements.rb
>
> I think it should be broken up. It would be logical to split by type
> of elements - at least. I know in BioRuby we are ambiguous about file
> sizes - I think a single file should describe one concept. That way
> file names become self describing. Files larger than 300 lines tend
> to be hard to digest - and probably point out some bigger issue.
>
>
Agreed.


> Also, when I look at DnaSeqRow, RnaSeqRow and others derived from
> SeqRow (line 2148 and onwards in element.rb), I can see duplicated
> coding 'patterns'. You are repeating a concept. Would there not be a
> more elegant way in Ruby to handle this? Hint: Inheritance is just one
> mechanism, I see no real reason to use an inheritance tree. Why not
> use one Sequence class for all of these which can contain different
> formed elements? I bet the code would become a lot shorter and
> (probably) less error prone. Take Ruby's Array container class as an
> example - it is just one implementation of a container which allows
> many types of elements.
>

The idea here was to implement a type system and stick close to the class
hierarchy followed in the schema. However, looking back, I myself do not
find the code for the Matrix class very elegant.

A final comment for this session: The class/method descriptions are
> not very informative. It may be early days - especially since we can
> see some refactoring coming, but it usually helps to write out
> examples giving the 'nicest' interface for people to use. And stick
> those in the source code. Personally I favour rubydoctests, see
>
>  http://github.com/tablatom/rubydoctest
>
>
Hey, I did not know that doctests existed for Ruby too. I will have a look
into it.


> I used these in bio/appl/paml/codeml/report.rb - these are examples
> that double as tests. Kill two birds with one stone! The BioRuby
> tutorial also uses doctests - i.e. the code in the Tutorial can be
> validated against the installed bioruby. If you want to use this you
> need an extra conversion - I have that tool.
>

I will check out the examples. What tool? I would like to know more.


>
> Another possibility is to start using RSpec.
>
>  http://rspec.info/
>
> I really like RSpec too - it is more of a replacement for unit
> tests - and easier to understand, so Specs double as documentation.
>
>
I am missing Rspec too from my Rails and Merb days. I picked up unit tests
because much of the framework had used the same and also because I wanted to
try it out :).


> I am interested to see what you want to do for RDF support. Maybe you
> can write out the API as an RSpec? That would be a good start.
>
>
That sounds like a nice idea.


> Do not hesitate to stand up to me. You will probably get support from
> someone on this list ;)
>
> Pj.
>
> On Thu, Jun 10, 2010 at 01:19:35AM +0530, Anurag Priyam wrote:
> > Last week, I worked on finishing implementation of Trees: trees, tree,
> > network; and started work on the characters element. This weeks target is
> to
> > complete the implementation of the characters element.
> >
> > It would be awesome to have some code review including: implementation,
> API
> > design, coding style and tests. I am planning to give a good amount of
> time
> > in the fourth week in making the code more robust. It would make perfect
> > sense to have some feedback to serve as guidelines :). The master branch
> and
> > API discussion page are at:
> >
> > [1] http://github.com/yeban/bioruby
> > [2]
> >
> https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby
> >
> > --
> > Anurag Priyam,
> > 2nd Year Undergraduate,
> > Department of Mechanical Engineering,
> > IIT Kharagpur.
> > +91-9775550642
> > _______________________________________________
> > BioRuby Project - http://www.bioruby.org/
> > BioRuby mailing list
> > BioRuby at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioruby
>


-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642


From pjotr.public14 at thebird.nl  Fri Jun 25 06:46:05 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 25 Jun 2010 08:46:05 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
Message-ID: <20100625064605.GA22887@thebird.nl>

(splitting up the discussion)

On Fri, Jun 25, 2010 at 11:53:34AM +0530, Anurag Priyam wrote:
> Should each iterate over all otu, tree and matrix or the
> primary otus, trees and characters elements? I would go for the later.

I think Rutger should answer this.

Pj.


From pjotr.public14 at thebird.nl  Fri Jun 25 06:49:11 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 25 Jun 2010 08:49:11 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
Message-ID: <20100625064911.GB22887@thebird.nl>

> I have used libxml2 streaming api, without actually streaming the document
> to the user. The cursor does not move through the document when you iterate
> over elements( phyloxml does that ). I am parsing the document at one go; at
> the start, and storing the objects in memory. Should we want to switch to
> streaming, using libxml's streaming API from start should make it easier.
> 
> Yes it is libxml2 these days. The site states that it works with ruby 1.8. I
> am myself working with 1.8.7. I will have to test the compatibility with
> ruby 1.9.

OK, glad to see that libxml is a standard package these days -
though it has some horrific error handling. At least it is fast.

How much time would it cost you to stream the data - and what does it
mean with regard to changing the API? I guess, in general, NeXML
files won't be that large, so it may not be that important (Rutger)?

Pj.


From pjotr.public14 at thebird.nl  Fri Jun 25 06:51:58 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 25 Jun 2010 08:51:58 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
Message-ID: <20100625065158.GC22887@thebird.nl>

> > I notice you have limited test input data. How can you be really sure
> > your code works for all cases? How can you be really sure that future
> > changes to the code don't break?
> 
> Right. I am working on improving the test suites taking lessons from the
> other bioruby test suites.

Unit tests are one approach. How about adding some regression tests
on larger files? When you have output that should be a good idea. We
don't like large datasets in the bioruby tree, but there are two ways
around that - create a special branch on github, or pull the data on
demand (though Naohisa may frown on that). Ask Diana what she has
done.

> Actually I have not done anything here. I will benchmark and profile the
> code and discuss the results here.

Diana created a special profiling branch. It was really helpful to
profile.

Pj.


From pjotr.public14 at thebird.nl  Fri Jun 25 06:55:39 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 25 Jun 2010 08:55:39 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
Message-ID: <20100625065539.GD22887@thebird.nl>

On Fri, Jun 25, 2010 at 11:53:34AM +0530, Anurag Priyam wrote:
> The idea here was to implement a type system and stick close to the class
> hierarchy followed in the schema. However, looking back, I myself do not
> find the code for the Matrix class very elegant.

Over 3000 lines of code for an XML parser sends out alarm bells. If
you have the right testing files it should be easy to refactor. Make
it simpler. Also, when parsing this type of XML some Ruby reflection
may come in handy - I did some of that in my BioRuby GEO parser, which
lives in my GEO branch on github.  You should look at each class and
see if you can refactor it down to a single solution. Just make sure
it is not at the expense of readability and understanding.

Post us some ideas here, before you start hacking code.

Pj.


From pjotr.public14 at thebird.nl  Fri Jun 25 07:08:04 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 25 Jun 2010 09:08:04 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
Message-ID: <20100625070804.GE22887@thebird.nl>

> >  http://github.com/tablatom/rubydoctest
> >
> >
> Hey, I did not know that doctests existed for Ruby too. I will have a look
> into it.

They are good, however finding bugs is a bit problematic as the stack
traces are lengthy and often not descriptive. So with troubling code
I tend to write extra unit tests. Also, with BioRuby we have not
settled on doctests yet, so you need to reach coverage with unit
tests and/or Specs.

I really think it is good for validating documentation.

> > I used these in bio/appl/paml/codeml/report.rb - these are examples
> > that double as tests. Kill two birds with one stone! The BioRuby
> > tutorial also uses doctests - i.e. the code in the Tutorial can be
> > validated against the installed bioruby. If you want to use this you
> > need an extra conversion - I have that tool.
> >
> 
> I will check out the examples. What tool? I would like to know more.

It simply parses out commented code in the source headers, and turns
them over to rubydoctest. The tool is in my bioruby-support tree on
github - see 

  http://github.com/pjotrp/bioruby-support/blob/master/bin/uncomment_doctest

you can see it uses an environment variable.

> I am missing Rspec too from my Rails and Merb days. I picked up unit tests
> because much of the framework had used the same and also because I wanted to
> try it out :).
> 
> 
> > I am interested to see what you want to do for RDF support. Maybe you
> > can write out the API as an RSpec? That would be a good start.
> >
> >
> That sounds like a nice idea.

RSpec is new for BioRuby. Since you have experience you are the right
one to introduce it to us ;). If it is convincing to the others we
may accept it as standard use (personally I think it is a step
forward from unit testing - unit tests are not very good as
documentation).

Pj.


From anurag08priyam at gmail.com  Fri Jun 25 07:34:21 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Fri, 25 Jun 2010 13:04:21 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100625064911.GB22887@thebird.nl>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625064911.GB22887@thebird.nl>
Message-ID: <AANLkTinQ5P28dHE4f3gaRaSMQAbctD0UrtA69j8uwqvF@mail.gmail.com>

On Fri, Jun 25, 2010 at 12:19 PM, Pjotr Prins <pjotr.public14 at thebird.nl>wrote:

> > I have used libxml2 streaming api, without actually streaming the
> document
> > to the user. The cursor does not move through the document when you
> iterate
> > over elements( phyloxml does that ). I am parsing the document at one go;
> at
> > the start, and storing the objects in memory. Should we want to switch to
> > streaming, using libxml's streaming API from start should make it easier.
> >
> > Yes it is libxml2 these days. The site states that it works with ruby
> 1.8. I
> > am myself working with 1.8.7. I will have to test the compatibility with
> > ruby 1.9.
>
> OK, glad to see that libxml is a standard package these days -
> though it has some horrific error handling. At least it is fast.
>
>
Yea it is fast but it has its own share of bugs. Now, I myself have started
working on the ruby-libxml code and helping in maintaining it.


> How much time would it cost you to stream the data - and what does it
> mean with regard to changing the API? I guess, in general, NeXML
> files won't be that large, so it may not be that important (Rutger)?
>
> Pj.
>
>
I mean switching the parsing implementation to streaming from "parsing at
the start" and not the API. Just that using Reader API over the DOM API
would help in the switch. Even if we do not switch, the Reader API offers a
more memory efficient solution than the DOM API.

Btw, I am not in a favour of switch. You cannot move backwards in document
that way. I can not fetch a tree by id if I the cursor is ahead of that
tree. Doing nexml.each_characters and nexml.each_trees is impossible with
pure streaming. I will have to stream one while cache the other. Otus and
otu provide a one to many relation with trees and characters, and rows. An
API call of the type otus.trees or otus.characters or otu.seuences would be
impossible( not that I have already added the API call ). Imo, NeXML is
non-linear and not meant to be streamed. Besides other NeXML implementations
also parse the file at the start.

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642


From ngoto at gen-info.osaka-u.ac.jp  Fri Jun 25 07:15:58 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Fri, 25 Jun 2010 16:15:58 +0900
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100625065158.GC22887@thebird.nl>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625065158.GC22887@thebird.nl>
Message-ID: <20100625071558.9CAB01CBC5B0@idnmail.gen-info.osaka-u.ac.jp>


Most part of the special testing program created by Diana for
PhyloXML is now put in sample/test_phyloxml_big.rb, i.e. it is
now regarded as a sample script.

To run the program, for example,
 % mkdir /tmp/phyloxml
 % ruby sample/test_phyloxml_big.rb /tmp/phyloxml -v

It executes round-trip tests for large PhyloXML files.
Data files are downloaded from the internet and are stored
to a directory specified by the user.

Naohisa Goto
ngoto at ge-info.osaka-u.ac.jp / ng at bioruby.org

On Fri, 25 Jun 2010 08:51:58 +0200
Pjotr Prins <pjotr.public14 at thebird.nl> wrote:

> > Actually I have not done anything here. I will benchmark and profile the
> > code and discuss the results here.
> 
> Diana created a special profiling branch. It was really helpful to
> profile.
> 
> Pj.
> 
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


-- 
?? ??  ngoto at gen-info.osaka-u.ac.jp
??????????? ?????????? ?????????(???)
Phone: 06-6879-8365 / FAX: 06-6879-2047


From pjotr.public14 at thebird.nl  Fri Jun 25 07:42:13 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 25 Jun 2010 09:42:13 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTinQ5P28dHE4f3gaRaSMQAbctD0UrtA69j8uwqvF@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625064911.GB22887@thebird.nl>
	<AANLkTinQ5P28dHE4f3gaRaSMQAbctD0UrtA69j8uwqvF@mail.gmail.com>
Message-ID: <20100625074213.GA27044@thebird.nl>

I think this needs to be answered by Rutger. Are we going to face
NeXML files in the future that can easily outrun memory?

Pj.

On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote:
> > How much time would it cost you to stream the data - and what does it
> > mean with regard to changing the API? I guess, in general, NeXML
> > files won't be that large, so it may not be that important (Rutger)?
> >
> > Pj.
> >
> >
> I mean switching the parsing implementation to streaming from "parsing at
> the start" and not the API. Just that using Reader API over the DOM API
> would help in the switch. Even if we do not switch, the Reader API offers a
> more memory efficient solution than the DOM API.
> 
> Btw, I am not in a favour of switch. You cannot move backwards in document
> that way. I can not fetch a tree by id if I the cursor is ahead of that
> tree. Doing nexml.each_characters and nexml.each_trees is impossible with
> pure streaming. I will have to stream one while cache the other. Otus and
> otu provide a one to many relation with trees and characters, and rows. An
> API call of the type otus.trees or otus.characters or otu.seuences would be
> impossible( not that I have already added the API call ). Imo, NeXML is
> non-linear and not meant to be streamed. Besides other NeXML implementations
> also parse the file at the start.
> 
> -- 
> Anurag Priyam,
> 2nd Year Undergraduate,
> Department of Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642


From rutgeraldo at gmail.com  Fri Jun 25 08:14:44 2010
From: rutgeraldo at gmail.com (Rutger Vos)
Date: Fri, 25 Jun 2010 09:14:44 +0100
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100625074213.GA27044@thebird.nl>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625064911.GB22887@thebird.nl>
	<AANLkTinQ5P28dHE4f3gaRaSMQAbctD0UrtA69j8uwqvF@mail.gmail.com>
	<20100625074213.GA27044@thebird.nl>
Message-ID: <AANLkTimW-P4ijZ4TI7IRyg8RFcrvC0dv-2C8Yb8vSbKn@mail.gmail.com>

This is very possible (and it's why Anurag has been focusing on
stream-based parsing) but I am personally of the opinion that worrying
too much about that right now would be a premature optimization. It
seems to me that we want to get a nice interface that captures what
NeXML can express first, and worry about performance and memory
footprint later - but that's just my own opinion and certainly open
for discussion.

On Fri, Jun 25, 2010 at 8:42 AM, Pjotr Prins <pjotr.public14 at thebird.nl> wrote:
> I think this needs to be answered by Rutger. Are we going to face
> NeXML files in the future that can easily outrun memory?
>
> Pj.
>
> On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote:
>> > How much time would it cost you to stream the data - and what does it
>> > mean with regard to changing the API? I guess, in general, NeXML
>> > files won't be that large, so it may not be that important (Rutger)?
>> >
>> > Pj.
>> >
>> >
>> I mean switching the parsing implementation to streaming from "parsing at
>> the start" and not the API. Just that using Reader API over the DOM API
>> would help in the switch. Even if we do not switch, the Reader API offers a
>> more memory efficient solution than the DOM API.
>>
>> Btw, I am not in a favour of switch. You cannot move backwards in document
>> that way. I can not fetch a tree by id if I the cursor is ahead of that
>> tree. Doing nexml.each_characters and nexml.each_trees is impossible with
>> pure streaming. I will have to stream one while cache the other. Otus and
>> otu provide a one to many relation with trees and characters, and rows. An
>> API call of the type otus.trees or otus.characters or otu.seuences would be
>> impossible( not that I have already added the API call ). Imo, NeXML is
>> non-linear and not meant to be streamed. Besides other NeXML implementations
>> also parse the file at the start.
>>
>> --
>> Anurag Priyam,
>> 2nd Year Undergraduate,
>> Department of Mechanical Engineering,
>> IIT Kharagpur.
>> +91-9775550642
>


-- 
Dr. Rutger A. Vos
School of Biological Sciences
Philip Lyle Building, Level 4
University of Reading
Reading
RG6 6BX
United Kingdom
Tel: +44 (0) 118 378 7535
http://www.nexml.org
http://rutgervos.blogspot.com


From pjotr.public14 at thebird.nl  Fri Jun 25 08:38:36 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Fri, 25 Jun 2010 10:38:36 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTimW-P4ijZ4TI7IRyg8RFcrvC0dv-2C8Yb8vSbKn@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625064911.GB22887@thebird.nl>
	<AANLkTinQ5P28dHE4f3gaRaSMQAbctD0UrtA69j8uwqvF@mail.gmail.com>
	<20100625074213.GA27044@thebird.nl>
	<AANLkTimW-P4ijZ4TI7IRyg8RFcrvC0dv-2C8Yb8vSbKn@mail.gmail.com>
Message-ID: <20100625083836.GA28214@thebird.nl>

On Fri, Jun 25, 2010 at 09:14:44AM +0100, Rutger Vos wrote:
> This is very possible (and it's why Anurag has been focusing on
> stream-based parsing) but I am personally of the opinion that worrying
> too much about that right now would be a premature optimization. It
> seems to me that we want to get a nice interface that captures what
> NeXML can express first, and worry about performance and memory
> footprint later - but that's just my own opinion and certainly open
> for discussion.

Oh, I agree about implementation. But it does mean Anurag needs to
change his preferential solution (like back-tracking in the tree).

Pj.


From sararayburn at gmail.com  Fri Jun 25 18:57:03 2010
From: sararayburn at gmail.com (Sara Rayburn)
Date: Fri, 25 Jun 2010 13:57:03 -0500
Subject: [BioRuby] GSoC speciation/duplication inference question
In-Reply-To: <D8EBFDB3-5F9B-441D-97C2-6785621B50FE@hgc.jp>
References: <07B4F0F7-BF33-428C-9B20-F2F0B5CE5052@gmail.com>
	<20100610134640.F39581CBC650@idnmail.gen-info.osaka-u.ac.jp>
	<D8EBFDB3-5F9B-441D-97C2-6785621B50FE@hgc.jp>
Message-ID: <00A7DE6C-2985-4173-A302-619429F964BB@gmail.com>

Hi,

I think between the list response and conversations with my mentor, I would probably go with 
Bio::Algorithm::SDI, with the files in lib/bio/util/phylogeny/SDI/

I can definitely see the others as good possibilities, though. If anyone objects to this naming, please let me know so I can change it.

Thanks,

Sara Rayburn
sararayburn at gmail.com 

On Jun 15, 2010, at 9:06 PM, Toshiaki Katayama wrote:

> Hi,
> 
> Replying personally as I delayed to find this thread.
> 
> I prefer something like lib/bio/phylo/sdi.rb and Bio::Phylo::SDI, how about to gather other phyloinformatics modules under the same directory as well?
> 
> Toshiaki
> 
> 
> On 2010/06/10, at 22:46, Naohisa GOTO wrote:
> 
>> Hi,
>> 
>> I think the abbreviation SDI is not common in the field of biology
>> and bioinformatics. In this case, it is generally good not to
>> abbreviate, but the "speciation/duplication inference" is too long.
>> For file/directory names, because the length limit is tight,
>> using abbreviation is good.
>> 
>> For the location of files, I suggest
>> lib/bio/util/evolution/SDI/ or lib/bio/util/phylogeny/SDI/
>> to show the word SDI is in the field of evolution or phylogeny.
>> 
>> For the class/module namespace,  possible candidates are
>> Bio::SpeciationDuplicationInference, Bio::Evolution::SDI,
>> Bio::Algorithm::SDI, but I couldn't determine which is the best.
>> If you have good idea, please tell us.
>> 
>> Naohisa Goto
>> ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
>> 
>> 
>> On Mon, 7 Jun 2010 13:09:07 -0500
>> Sara Rayburn <sararayburn at gmail.com> wrote:
>> 
>>> Hello,
>>> 
>>> While implementing my gsoc project (the speciation/duplicaiton inference algorithm), I'm not sure where to put the module I'm developing in the bioruby library. I've been developing in the lib/bio/util directory, but I want to make sure that's the place you all would prefer for the module. Any suggestions or thoughts?
>>> 
>>> Thanks,
>>> 
>>> Sara Rayburn
>>> sararayburn at gmail.com
>> 
>> 
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
> 


From anurag08priyam at gmail.com  Sat Jun 26 11:35:34 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Sat, 26 Jun 2010 17:05:34 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100625070804.GE22887@thebird.nl>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625070804.GE22887@thebird.nl>
Message-ID: <AANLkTimB8_u8UsvGha0JF1z0riC96qzkC2H127vIQs7J@mail.gmail.com>

On Fri, Jun 25, 2010 at 12:38 PM, Pjotr Prins <pjotr.public14 at thebird.nl>wrote:

> > >  http://github.com/tablatom/rubydoctest
> > >
> > >
> > Hey, I did not know that doctests existed for Ruby too. I will have a
> look
> > into it.
>
> They are good, however finding bugs is a bit problematic as the stack
> traces are lengthy and often not descriptive. So with troubling code
> I tend to write extra unit tests. Also, with BioRuby we have not
> settled on doctests yet, so you need to reach coverage with unit
> tests and/or Specs.
>
> I really think it is good for validating documentation.
>
> > > I used these in bio/appl/paml/codeml/report.rb - these are examples
> > > that double as tests. Kill two birds with one stone! The BioRuby
> > > tutorial also uses doctests - i.e. the code in the Tutorial can be
> > > validated against the installed bioruby. If you want to use this you
> > > need an extra conversion - I have that tool.
> > >
> >
> > I will check out the examples. What tool? I would like to know more.
>
> It simply parses out commented code in the source headers, and turns
> them over to rubydoctest. The tool is in my bioruby-support tree on
> github - see
>
>
> http://github.com/pjotrp/bioruby-support/blob/master/bin/uncomment_doctest
>
> you can see it uses an environment variable.
>
>
Perfect. I will use it when expanding the documentation.


> > I am missing Rspec too from my Rails and Merb days. I picked up unit
> tests
> > because much of the framework had used the same and also because I wanted
> to
> > try it out :).
> >
> >
> > > I am interested to see what you want to do for RDF support. Maybe you
> > > can write out the API as an RSpec? That would be a good start.
> > >
> > >
> > That sounds like a nice idea.
>
> RSpec is new for BioRuby. Since you have experience you are the right
> one to introduce it to us ;). If it is convincing to the others we
> may accept it as standard use (personally I think it is a step
> forward from unit testing - unit tests are not very good as
> documentation).
>
>
I am willing to use Rspec for the RDF API part. Converting the already
existing unit tests I have written to Rspec does not sound a good idea?

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642


From pjotr.public14 at thebird.nl  Sat Jun 26 12:19:02 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Sat, 26 Jun 2010 14:19:02 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTimB8_u8UsvGha0JF1z0riC96qzkC2H127vIQs7J@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625070804.GE22887@thebird.nl>
	<AANLkTimB8_u8UsvGha0JF1z0riC96qzkC2H127vIQs7J@mail.gmail.com>
Message-ID: <20100626121902.GA5700@thebird.nl>

On Sat, Jun 26, 2010 at 05:05:34PM +0530, Anurag Priyam wrote:
> I am willing to use Rspec for the RDF API part. Converting the already
> existing unit tests I have written to Rspec does not sound a good idea?

No need. Do the RDF as a proof-of-concept for the rest of BioRuby.
Unit tests will (always) remain.

Pj.


From hlapp at drycafe.net  Sun Jun 27 00:30:19 2010
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Sat, 26 Jun 2010 17:30:19 -0700
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100625074213.GA27044@thebird.nl>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625064911.GB22887@thebird.nl>
	<AANLkTinQ5P28dHE4f3gaRaSMQAbctD0UrtA69j8uwqvF@mail.gmail.com>
	<20100625074213.GA27044@thebird.nl>
Message-ID: <F101AC28-0256-49ED-9D1A-A9477C8A7349@drycafe.net>

Our ability to reconstruct trees of hundreds, thousands, and even tens  
of thousands of characters has improved dramatically over the past  
couple of years, and is increasingly often the goal of an analysis.  
Genome-scale alignments also aren't so rare anymore.

Aside from analysis, NeXML files can be produced by a database, and  
hence could hold large taxonomies, or the tree of life.

NeXML is an emerging standard. If implementations can't cope with the  
large scale data that are becoming increasingly popular, it'll have a  
hard time to get uptake.

	-hilmar

On Jun 25, 2010, at 12:42 AM, Pjotr Prins wrote:

> I think this needs to be answered by Rutger. Are we going to face
> NeXML files in the future that can easily outrun memory?
>
> Pj.
>
> On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote:
>>> How much time would it cost you to stream the data - and what does  
>>> it
>>> mean with regard to changing the API? I guess, in general, NeXML
>>> files won't be that large, so it may not be that important (Rutger)?
>>>
>>> Pj.
>>>
>>>
>> I mean switching the parsing implementation to streaming from  
>> "parsing at
>> the start" and not the API. Just that using Reader API over the DOM  
>> API
>> would help in the switch. Even if we do not switch, the Reader API  
>> offers a
>> more memory efficient solution than the DOM API.
>>
>> Btw, I am not in a favour of switch. You cannot move backwards in  
>> document
>> that way. I can not fetch a tree by id if I the cursor is ahead of  
>> that
>> tree. Doing nexml.each_characters and nexml.each_trees is  
>> impossible with
>> pure streaming. I will have to stream one while cache the other.  
>> Otus and
>> otu provide a one to many relation with trees and characters, and  
>> rows. An
>> API call of the type otus.trees or otus.characters or otu.seuences  
>> would be
>> impossible( not that I have already added the API call ). Imo,  
>> NeXML is
>> non-linear and not meant to be streamed. Besides other NeXML  
>> implementations
>> also parse the file at the start.
>>
>> -- 
>> Anurag Priyam,
>> 2nd Year Undergraduate,
>> Department of Mechanical Engineering,
>> IIT Kharagpur.
>> +91-9775550642
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby

-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================


From pjotr.public14 at thebird.nl  Sun Jun 27 06:47:31 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Sun, 27 Jun 2010 08:47:31 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <F101AC28-0256-49ED-9D1A-A9477C8A7349@drycafe.net>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625064911.GB22887@thebird.nl>
	<AANLkTinQ5P28dHE4f3gaRaSMQAbctD0UrtA69j8uwqvF@mail.gmail.com>
	<20100625074213.GA27044@thebird.nl>
	<F101AC28-0256-49ED-9D1A-A9477C8A7349@drycafe.net>
Message-ID: <20100627064731.GA15508@thebird.nl>

Thanks Rutger and Hilmar,

Anurag, let's not load everything in memory.

Pj.

On Sat, Jun 26, 2010 at 05:30:19PM -0700, Hilmar Lapp wrote:
> Our ability to reconstruct trees of hundreds, thousands, and even tens  
> of thousands of characters has improved dramatically over the past  
> couple of years, and is increasingly often the goal of an analysis.  
> Genome-scale alignments also aren't so rare anymore.
>
> Aside from analysis, NeXML files can be produced by a database, and  
> hence could hold large taxonomies, or the tree of life.
>
> NeXML is an emerging standard. If implementations can't cope with the  
> large scale data that are becoming increasingly popular, it'll have a  
> hard time to get uptake.
>
> 	-hilmar
>
> On Jun 25, 2010, at 12:42 AM, Pjotr Prins wrote:
>
>> I think this needs to be answered by Rutger. Are we going to face
>> NeXML files in the future that can easily outrun memory?
>>
>> Pj.
>>
>> On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote:
>>>> How much time would it cost you to stream the data - and what does  
>>>> it
>>>> mean with regard to changing the API? I guess, in general, NeXML
>>>> files won't be that large, so it may not be that important (Rutger)?
>>>>
>>>> Pj.
>>>>
>>>>
>>> I mean switching the parsing implementation to streaming from  
>>> "parsing at
>>> the start" and not the API. Just that using Reader API over the DOM  
>>> API
>>> would help in the switch. Even if we do not switch, the Reader API  
>>> offers a
>>> more memory efficient solution than the DOM API.
>>>
>>> Btw, I am not in a favour of switch. You cannot move backwards in  
>>> document
>>> that way. I can not fetch a tree by id if I the cursor is ahead of  
>>> that
>>> tree. Doing nexml.each_characters and nexml.each_trees is impossible 
>>> with
>>> pure streaming. I will have to stream one while cache the other.  
>>> Otus and
>>> otu provide a one to many relation with trees and characters, and  
>>> rows. An
>>> API call of the type otus.trees or otus.characters or otu.seuences  
>>> would be
>>> impossible( not that I have already added the API call ). Imo, NeXML 
>>> is
>>> non-linear and not meant to be streamed. Besides other NeXML  
>>> implementations
>>> also parse the file at the start.
>>>
>>> -- 
>>> Anurag Priyam,
>>> 2nd Year Undergraduate,
>>> Department of Mechanical Engineering,
>>> IIT Kharagpur.
>>> +91-9775550642
>> _______________________________________________
>> BioRuby Project - http://www.bioruby.org/
>> BioRuby mailing list
>> BioRuby at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioruby
>
> -- 
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
> ===========================================================
>
>
>
>


From ngoto at gen-info.osaka-u.ac.jp  Sun Jun 27 07:45:43 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto)
Date: Sun, 27 Jun 2010 16:45:43 +0900
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100627064731.GA15508@thebird.nl>
References: <F101AC28-0256-49ED-9D1A-A9477C8A7349@drycafe.net>
	<20100627064731.GA15508@thebird.nl>
Message-ID: <20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp>

Hi,

I think the ability to handle large data and the memory usage whether or
not to load all data in memory at a time, is essentially independent.
Not loading everything in memory does not guarantee the ability to handle
large data, due to the disk I/O bottleneck and memory management
overhead.

I think it is currently OK to depend on memory. The price of memory is
gradually going down, and I think buying a machine with huge memory
could be a solution to treat large data.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

> Thanks Rutger and Hilmar,
> 
> Anurag, let's not load everything in memory.
> 
> Pj.
> 
> On Sat, Jun 26, 2010 at 05:30:19PM -0700, Hilmar Lapp wrote:
> > Our ability to reconstruct trees of hundreds, thousands, and even tens  
> > of thousands of characters has improved dramatically over the past  
> > couple of years, and is increasingly often the goal of an analysis.  
> > Genome-scale alignments also aren't so rare anymore.
> >
> > Aside from analysis, NeXML files can be produced by a database, and  
> > hence could hold large taxonomies, or the tree of life.
> >
> > NeXML is an emerging standard. If implementations can't cope with the  
> > large scale data that are becoming increasingly popular, it'll have a  
> > hard time to get uptake.
> >
> > 	-hilmar
> >
> > On Jun 25, 2010, at 12:42 AM, Pjotr Prins wrote:
> >
> >> I think this needs to be answered by Rutger. Are we going to face
> >> NeXML files in the future that can easily outrun memory?
> >>
> >> Pj.
> >>
> >> On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote:
> >>>> How much time would it cost you to stream the data - and what does  
> >>>> it
> >>>> mean with regard to changing the API? I guess, in general, NeXML
> >>>> files won't be that large, so it may not be that important (Rutger)?
> >>>>
> >>>> Pj.
> >>>>
> >>>>
> >>> I mean switching the parsing implementation to streaming from  
> >>> "parsing at
> >>> the start" and not the API. Just that using Reader API over the DOM  
> >>> API
> >>> would help in the switch. Even if we do not switch, the Reader API  
> >>> offers a
> >>> more memory efficient solution than the DOM API.
> >>>
> >>> Btw, I am not in a favour of switch. You cannot move backwards in  
> >>> document
> >>> that way. I can not fetch a tree by id if I the cursor is ahead of  
> >>> that
> >>> tree. Doing nexml.each_characters and nexml.each_trees is impossible 
> >>> with
> >>> pure streaming. I will have to stream one while cache the other.  
> >>> Otus and
> >>> otu provide a one to many relation with trees and characters, and  
> >>> rows. An
> >>> API call of the type otus.trees or otus.characters or otu.seuences  
> >>> would be
> >>> impossible( not that I have already added the API call ). Imo, NeXML 
> >>> is
> >>> non-linear and not meant to be streamed. Besides other NeXML  
> >>> implementations
> >>> also parse the file at the start.
> >>>
> >>> -- 
> >>> Anurag Priyam,
> >>> 2nd Year Undergraduate,
> >>> Department of Mechanical Engineering,
> >>> IIT Kharagpur.
> >>> +91-9775550642
> >> _______________________________________________
> >> BioRuby Project - http://www.bioruby.org/
> >> BioRuby mailing list
> >> BioRuby at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioruby
> >
> > -- 
> > ===========================================================
> > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
> > ===========================================================
> >
> >
> >
> >
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From pjotr.public14 at thebird.nl  Sun Jun 27 08:43:22 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Sun, 27 Jun 2010 10:43:22 +0200
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp>
References: <F101AC28-0256-49ED-9D1A-A9477C8A7349@drycafe.net>
	<20100627064731.GA15508@thebird.nl>
	<20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp>
Message-ID: <20100627084322.GA18815@thebird.nl>

On Sun, Jun 27, 2010 at 04:45:43PM +0900, Naohisa Goto wrote:
> Hi,
> 
> I think the ability to handle large data and the memory usage
> whether or not to load all data in memory at a time, is essentially
> independent.  Not loading everything in memory does not guarantee
> the ability to handle large data, due to the disk I/O bottleneck and
> memory management overhead.

Well, depends on what you plan to do with that data :). I think you
are saying that streaming data may not be efficient, for example for
treating alignments. That could be true. However, I think the default
strategy should be non-memory bound, if possible. Throughout BioRuby
the strategy is the opposite, at the moment. For example, by default
FASTA files are loaded in RAM. Same for BLAST XML. I regularly have
files that exceed RAM and work around these limitations. I don't think
this should be the *default* strategy.

I prefer the Unix way of using pipes. Only use memory when it is
available.

With new code we should design for big data. If it is done from the
start, it takes no real effort. 

> I think it is currently OK to depend on memory. The price of memory
> is gradually going down, and I think buying a machine with huge
> memory could be a solution to treat large data.

We can not all afford big machines. It would hamper many
groups/students. RAM is getting cheaper, but data is growing faster.

Anurag, what is the size of RAM you have access to?

Pj.


From anurag08priyam at gmail.com  Sun Jun 27 08:49:37 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Sun, 27 Jun 2010 14:19:37 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100627084322.GA18815@thebird.nl>
References: <F101AC28-0256-49ED-9D1A-A9477C8A7349@drycafe.net>
	<20100627064731.GA15508@thebird.nl>
	<20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp>
	<20100627084322.GA18815@thebird.nl>
Message-ID: <AANLkTin8_WJlAnHRFGSdCagRRZpYdIaGMGYNYqGeOuAR@mail.gmail.com>

On Sun, Jun 27, 2010 at 2:13 PM, Pjotr Prins <pjotr.public14 at thebird.nl>wrote:

> On Sun, Jun 27, 2010 at 04:45:43PM +0900, Naohisa Goto wrote:
> > Hi,
> >
> > I think the ability to handle large data and the memory usage
> > whether or not to load all data in memory at a time, is essentially
> > independent.  Not loading everything in memory does not guarantee
> > the ability to handle large data, due to the disk I/O bottleneck and
> > memory management overhead.
>
> Well, depends on what you plan to do with that data :). I think you
> are saying that streaming data may not be efficient, for example for
> treating alignments. That could be true. However, I think the default
> strategy should be non-memory bound, if possible. Throughout BioRuby
> the strategy is the opposite, at the moment. For example, by default
> FASTA files are loaded in RAM. Same for BLAST XML. I regularly have
> files that exceed RAM and work around these limitations. I don't think
> this should be the *default* strategy.
>
> I prefer the Unix way of using pipes. Only use memory when it is
> available.
>
> With new code we should design for big data. If it is done from the
> start, it takes no real effort.
>
> > I think it is currently OK to depend on memory. The price of memory
> > is gradually going down, and I think buying a machine with huge
> > memory could be a solution to treat large data.
>
> We can not all afford big machines. It would hamper many
> groups/students. RAM is getting cheaper, but data is growing faster.
>
> Anurag, what is the size of RAM you have access to?
>
>
3GB. The biggest sample file I am working with is 500 lines( characters.xml
in the examples ); working with it has hardly any effect on my memory. From,
where can I get a bigger one? I can test the memory consumption with a large
enough file and report.

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642


From hlapp at drycafe.net  Sun Jun 27 23:23:19 2010
From: hlapp at drycafe.net (Hilmar Lapp)
Date: Sun, 27 Jun 2010 16:23:19 -0700
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTin8_WJlAnHRFGSdCagRRZpYdIaGMGYNYqGeOuAR@mail.gmail.com>
References: <F101AC28-0256-49ED-9D1A-A9477C8A7349@drycafe.net>
	<20100627064731.GA15508@thebird.nl>
	<20100627164543.2950.EEF6E030@gen-info.osaka-u.ac.jp>
	<20100627084322.GA18815@thebird.nl>
	<AANLkTin8_WJlAnHRFGSdCagRRZpYdIaGMGYNYqGeOuAR@mail.gmail.com>
Message-ID: <A29051B5-84D1-4412-80DB-B8442C6BF0FE@drycafe.net>

On Jun 27, 2010, at 1:49 AM, Anurag Priyam wrote:

> 3GB. The biggest sample file I am working with is 500  
> lines( characters.xml
> in the examples ); working with it has hardly any effect on my  
> memory. From,
> where can I get a bigger one?

Use the NCBI taxonomy :-) Or download the tree from tolweb.org and  
convert to NeXML.

	-hilmar
-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================


From anurag08priyam at gmail.com  Mon Jun 28 09:31:26 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Mon, 28 Jun 2010 15:01:26 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100624135411.GA14658@thebird.nl>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
Message-ID: <AANLkTinEJ0Sfmg989yVWxkm3WShAckzjfvNCa-f4E3jn@mail.gmail.com>

>
>
> A final comment for this session: The class/method descriptions are
> not very informative. It may be early days - especially since we can
> see some refactoring coming, but it usually helps to write out
> examples giving the 'nicest' interface for people to use. And stick
> those in the source code. Personally I favour rubydoctests, see
>
>  http://github.com/tablatom/rubydoctest
>
>
I am loving rubydoctest. Thanks for showing it to me:). As of now I am using
it in my nexml serialization implementation.

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642


From anurag08priyam at gmail.com  Mon Jun 28 09:52:32 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Mon, 28 Jun 2010 15:22:32 +0530
Subject: [BioRuby] Testing complex nexml output.
Message-ID: <AANLkTinUXvhbaiO_lk93fkN8SS5zy9ppe9yNPRNKOXoE@mail.gmail.com>

I am finding it a little difficult testing the nexml serializer.

Any nexml object say otu, is serialized by a function call of the type
NeXML::Writer#serialize_otu, which returns a XML::Node object. A raw nexml
representation can be obtained by calling to_s on the return value. These
nodes are added to the document root and then saved to a file by calling
XML::Document#save.

Now, when it come to testing comparing nexml string does not make sense
because the test is rendered invalid even because of different ordering of
the attributes of a node and newline issues. What I am doing is to
initialize to XML::Node: one from a test fiile and one that i generate by
serialize_otu function and then compare for the equality of these xml nodes
attribute by attribute and child by child. An example here:

http://github.com/yeban/bioruby/blob/writer/test/unit/bio/db/nexml/tc_writer.rb#L166

However lack of a proper XML::Node#eql? is making things a little difficult
for me. See:

http://github.com/yeban/bioruby/blob/writer/test/unit/bio/db/nexml/tc_writer.rb#L222

An obvious solution is to myself define an eql? method in Bio::Node. But, am
I going in the right direction when it comes to testing xml output.

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642


From anurag08priyam at gmail.com  Mon Jun 28 09:56:52 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Mon, 28 Jun 2010 15:26:52 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100625065539.GD22887@thebird.nl>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625065539.GD22887@thebird.nl>
Message-ID: <AANLkTinGXSNGmuSAdYTzHmPN7x0fMu2OrKGvwdDUlBKj@mail.gmail.com>

>   ..... Also, when parsing this type of XML some Ruby reflection
> may come in handy - I did some of that in my BioRuby GEO parser, which
> lives in my GEO branch on github.


I picked up the method_missing trick for the serializer.

http://github.com/yeban/bioruby/blob/writer/lib/bio/db/nexml/writer.rb


>  You should look at each class and
> see if you can refactor it down to a single solution. Just make sure
> it is not at the expense of readability and understanding.
>
> Post us some ideas here, before you start hacking code.
>
> Pj.
>
>
I will.

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642


From ngoto at gen-info.osaka-u.ac.jp  Mon Jun 28 12:00:05 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Mon, 28 Jun 2010 21:00:05 +0900
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTinGXSNGmuSAdYTzHmPN7x0fMu2OrKGvwdDUlBKj@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625065539.GD22887@thebird.nl>
	<AANLkTinGXSNGmuSAdYTzHmPN7x0fMu2OrKGvwdDUlBKj@mail.gmail.com>
Message-ID: <20100628120005.61D751CBC32B@idnmail.gen-info.osaka-u.ac.jp>

Hi,

Please never use method_missing. It breaks error reporting and
makes very hard to debug and maintain both library codes and
user scripts.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Mon, 28 Jun 2010 15:26:52 +0530
Anurag Priyam <anurag08priyam at gmail.com> wrote:

> >   ..... Also, when parsing this type of XML some Ruby reflection
> > may come in handy - I did some of that in my BioRuby GEO parser, which
> > lives in my GEO branch on github.
> 
> 
> I picked up the method_missing trick for the serializer.
> 
> http://github.com/yeban/bioruby/blob/writer/lib/bio/db/nexml/writer.rb
> 
> 
> >  You should look at each class and
> > see if you can refactor it down to a single solution. Just make sure
> > it is not at the expense of readability and understanding.
> >
> > Post us some ideas here, before you start hacking code.
> >
> > Pj.
> >
> >
> I will.
> 
> -- 
> Anurag Priyam,
> 2nd Year Undergraduate,
> Department of Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From ngoto at gen-info.osaka-u.ac.jp  Mon Jun 28 12:54:09 2010
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Mon, 28 Jun 2010 21:54:09 +0900
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <AANLkTinGXSNGmuSAdYTzHmPN7x0fMu2OrKGvwdDUlBKj@mail.gmail.com>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625065539.GD22887@thebird.nl>
	<AANLkTinGXSNGmuSAdYTzHmPN7x0fMu2OrKGvwdDUlBKj@mail.gmail.com>
Message-ID: <20100628125409.B23271CBC32B@idnmail.gen-info.osaka-u.ac.jp>

Dear Anurag,

Do not add methods in other classes and modules outside Bio.
Modifying other classes and modules outside Bio namespace is
prohibited in BioRuby library because such kind of code could
make conflicts with user scrpits or other libraries when each
code defines a method with the same name with different behavior
or when the original class is refactored by the original authors.

It is BioRuby's policy to respect user's freedom. For example,
if we defined Array#has?, a user who want to define Array#has?
with different meanings could not use BioRuby. So, to keep
user's right, it is our policy not to change outside Bio as
far as possible.

PS. You may find some exceptinal codes in Bio::Shell and in
sample scripts, because they are separate applications.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Mon, 28 Jun 2010 15:26:52 +0530
Anurag Priyam <anurag08priyam at gmail.com> wrote:

> >   ..... Also, when parsing this type of XML some Ruby reflection
> > may come in handy - I did some of that in my BioRuby GEO parser, which
> > lives in my GEO branch on github.
> 
> 
> I picked up the method_missing trick for the serializer.
> 
> http://github.com/yeban/bioruby/blob/writer/lib/bio/db/nexml/writer.rb
> 
> 
> >  You should look at each class and
> > see if you can refactor it down to a single solution. Just make sure
> > it is not at the expense of readability and understanding.
> >
> > Post us some ideas here, before you start hacking code.
> >
> > Pj.
> >
> >
> I will.
> 
> -- 
> Anurag Priyam,
> 2nd Year Undergraduate,
> Department of Mechanical Engineering,
> IIT Kharagpur.
> +91-9775550642
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby


From anurag08priyam at gmail.com  Mon Jun 28 14:13:36 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Mon, 28 Jun 2010 19:43:36 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100628125409.B23271CBC32B@idnmail.gen-info.osaka-u.ac.jp>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625065539.GD22887@thebird.nl>
	<AANLkTinGXSNGmuSAdYTzHmPN7x0fMu2OrKGvwdDUlBKj@mail.gmail.com>
	<20100628125409.B23271CBC32B@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <AANLkTikr9OltJLKnJHsg6l5PIy1Jt6STBsuEjGMtblW2@mail.gmail.com>

> It is BioRuby's policy to respect user's freedom. For example,
> if we defined Array#has?, a user who want to define Array#has?
> with different meanings could not use BioRuby. So, to keep
> user's right, it is our policy not to change outside Bio as
> far as possible.
>
>
Corrected. Thanks for pointing this out this GOTO san :).

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642


From anurag08priyam at gmail.com  Mon Jun 28 14:22:37 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Mon, 28 Jun 2010 19:52:37 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Code Review.
In-Reply-To: <20100628120005.61D751CBC32B@idnmail.gen-info.osaka-u.ac.jp>
References: <AANLkTikbVvzD6jbRbOW9WilBxyYGBvxrkaHaoujGx3as@mail.gmail.com>
	<20100624135411.GA14658@thebird.nl>
	<AANLkTinJkEiu2DyAt3Xvv0-kQ7I4LWpAkDgaw2s50ci8@mail.gmail.com>
	<20100625065539.GD22887@thebird.nl>
	<AANLkTinGXSNGmuSAdYTzHmPN7x0fMu2OrKGvwdDUlBKj@mail.gmail.com>
	<20100628120005.61D751CBC32B@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <AANLkTilD246h5drgc45C-ymvFhWwPqfQKgGkx0ZzxZIR@mail.gmail.com>

> Please never use method_missing. It breaks error reporting and
> makes very hard to debug and maintain both library codes and
> user scripts.
>

Hmm, I have experienced that. But the way I have used it affects only the
Bio::NeXML::Writer class, so is it not safe in this case? Anyways I will
change it as it does not offer much improvement to the code readability in
my case. I just find it exciting :).

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642


From yogiprasanna at gmail.com  Wed Jun 30 14:11:42 2010
From: yogiprasanna at gmail.com (Prasanna Bala)
Date: Wed, 30 Jun 2010 19:41:42 +0530
Subject: [BioRuby] Contribution in Bioruby...
Message-ID: <AANLkTiloqObN2cMR6834z8k7cusRMA-wQeaYjPAFCNSj@mail.gmail.com>

Hi,
My name is Prasanna. I am working in a software firm in ruby on rails
technology. I am new to Bioruby. I am interested in contributing for
Bio-ruby project. I would like to know where to start things. To whom to
approach for specific tasks. I have extensive experience in Biomedical text
mining. Is there is any group specifically working on Biomedical text
mining, Ontology Mapping etc.. And I also want to know what are the issues
now the community is working on ? I want to know list of current topics
that's going on in Bioruby.

Regards,
Prasanna.


From pjotr.public14 at thebird.nl  Wed Jun 30 15:31:05 2010
From: pjotr.public14 at thebird.nl (Pjotr Prins)
Date: Wed, 30 Jun 2010 17:31:05 +0200
Subject: [BioRuby] Contribution in Bioruby...
In-Reply-To: <AANLkTiloqObN2cMR6834z8k7cusRMA-wQeaYjPAFCNSj@mail.gmail.com>
References: <AANLkTiloqObN2cMR6834z8k7cusRMA-wQeaYjPAFCNSj@mail.gmail.com>
Message-ID: <20100630153105.GB10804@thebird.nl>

Hi Prasanna,

On Wed, Jun 30, 2010 at 07:41:42PM +0530, Prasanna Bala wrote:
> Hi,
> My name is Prasanna. I am working in a software firm in ruby on rails
> technology. I am new to Bioruby. I am interested in contributing for
> Bio-ruby project. I would like to know where to start things. To whom to
> approach for specific tasks. I have extensive experience in Biomedical text
> mining. Is there is any group specifically working on Biomedical text
> mining, Ontology Mapping etc.. And I also want to know what are the issues
> now the community is working on ? I want to know list of current topics
> that's going on in Bioruby.

Thanks for showing your interest. It would be great if you were to
look at text mining and ontologies for BioRuby. It is relevant for
our work. To start with BioRuby get a github.com account and clone
the repository. You can start coding, and post questions on this
mailing list. We are having a presentation at BOSC next week, and the
slides discuss current work. It will be available for everyone.

Where are you located geographically?

Pj.


From anurag08priyam at gmail.com  Wed Jun 30 22:07:09 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Thu, 1 Jul 2010 03:37:09 +0530
Subject: [BioRuby] [GSoC][NeXML and RDF API] Update
Message-ID: <AANLkTinySE3wA5_CVZBobhFRPgf06nuBACWbbQuCTrcb@mail.gmail.com>

In the last week and half of this week I have:
* been able to work out an NeXML serializer - the code sits in the master
branch[1]. In the API page[ 2 ] I have added a discussion on the
implementation.
* started working on the RDF API - i should be able to come up with RSpecs
by the end of this week

In the remaining part of the week I will:
* come with an RDF API implementation
* work on refactoring some of the previous code( matrix and the sequences
part ) as Pjotr had pointed out in the last review.

Perhaps, we can have another round of code review: for the NeXML serializer?
This will help me allocate time in the coming weeks to fix the issues with
the code.

[1] http://github.com/yeban/bioruby
[2]
https://www.nescent.org/wg_phyloinformatics/NeXML_and_RDF_API_for_BioRuby

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642


From anurag08priyam at gmail.com  Wed Jun 30 22:15:08 2010
From: anurag08priyam at gmail.com (Anurag Priyam)
Date: Thu, 1 Jul 2010 03:45:08 +0530
Subject: [BioRuby] [GSoC]
Message-ID: <AANLkTilACLn73nOBi-nLtgrKL_Ybso5dckH0tgDvQLNa@mail.gmail.com>

I hope you guys are tuned to my updates on both the lists and the code and
the project plan. Please do keep reminding me if I am missing out on
something obvious :).

-- 
Anurag Priyam,
2nd Year Undergraduate,
Department of Mechanical Engineering,
IIT Kharagpur.
+91-9775550642