[BioRuby] [GSoC][NeXML and RDF API] Code Review.

Sun Jun 27 03:45:43 EDT 2010

Hi,

I think the ability to handle large data and the memory usage whether or
not to load all data in memory at a time, is essentially independent.
Not loading everything in memory does not guarantee the ability to handle
large data, due to the disk I/O bottleneck and memory management
overhead.

I think it is currently OK to depend on memory. The price of memory is
gradually going down, and I think buying a machine with huge memory
could be a solution to treat large data.

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

> Thanks Rutger and Hilmar,
> 
> Anurag, let's not load everything in memory.
> 
> Pj.
> 
> On Sat, Jun 26, 2010 at 05:30:19PM -0700, Hilmar Lapp wrote:
> > Our ability to reconstruct trees of hundreds, thousands, and even tens  
> > of thousands of characters has improved dramatically over the past  
> > couple of years, and is increasingly often the goal of an analysis.  
> > Genome-scale alignments also aren't so rare anymore.
> >
> > Aside from analysis, NeXML files can be produced by a database, and  
> > hence could hold large taxonomies, or the tree of life.
> >
> > NeXML is an emerging standard. If implementations can't cope with the  
> > large scale data that are becoming increasingly popular, it'll have a  
> > hard time to get uptake.
> >
> > 	-hilmar
> >
> > On Jun 25, 2010, at 12:42 AM, Pjotr Prins wrote:
> >
> >> I think this needs to be answered by Rutger. Are we going to face
> >> NeXML files in the future that can easily outrun memory?
> >>
> >> Pj.
> >>
> >> On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote:
> >>>> How much time would it cost you to stream the data - and what does  
> >>>> it
> >>>> mean with regard to changing the API? I guess, in general, NeXML
> >>>> files won't be that large, so it may not be that important (Rutger)?
> >>>>
> >>>> Pj.
> >>>>
> >>>>
> >>> I mean switching the parsing implementation to streaming from  
> >>> "parsing at
> >>> the start" and not the API. Just that using Reader API over the DOM  
> >>> API
> >>> would help in the switch. Even if we do not switch, the Reader API  
> >>> offers a
> >>> more memory efficient solution than the DOM API.
> >>>
> >>> Btw, I am not in a favour of switch. You cannot move backwards in  
> >>> document
> >>> that way. I can not fetch a tree by id if I the cursor is ahead of  
> >>> that
> >>> tree. Doing nexml.each_characters and nexml.each_trees is impossible 
> >>> with
> >>> pure streaming. I will have to stream one while cache the other.  
> >>> Otus and
> >>> otu provide a one to many relation with trees and characters, and  
> >>> rows. An
> >>> API call of the type otus.trees or otus.characters or otu.seuences  
> >>> would be
> >>> impossible( not that I have already added the API call ). Imo, NeXML 
> >>> is
> >>> non-linear and not meant to be streamed. Besides other NeXML  
> >>> implementations
> >>> also parse the file at the start.
> >>>
> >>> -- 
> >>> Anurag Priyam,
> >>> 2nd Year Undergraduate,
> >>> Department of Mechanical Engineering,
> >>> IIT Kharagpur.
> >>> +91-9775550642
> >> _______________________________________________
> >> BioRuby Project - http://www.bioruby.org/
> >> BioRuby mailing list
> >> BioRuby at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioruby
> >
> > -- 
> > ===========================================================
> > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
> > ===========================================================
> >
> >
> >
> >
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby