[BioRuby] [GSoC][NeXML and RDF API] Code Review.
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp
Sun Jun 27 03:45:43 EDT 2010
Hi,
I think the ability to handle large data and the memory usage whether or
not to load all data in memory at a time, is essentially independent.
Not loading everything in memory does not guarantee the ability to handle
large data, due to the disk I/O bottleneck and memory management
overhead.
I think it is currently OK to depend on memory. The price of memory is
gradually going down, and I think buying a machine with huge memory
could be a solution to treat large data.
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
> Thanks Rutger and Hilmar,
>
> Anurag, let's not load everything in memory.
>
> Pj.
>
> On Sat, Jun 26, 2010 at 05:30:19PM -0700, Hilmar Lapp wrote:
> > Our ability to reconstruct trees of hundreds, thousands, and even tens
> > of thousands of characters has improved dramatically over the past
> > couple of years, and is increasingly often the goal of an analysis.
> > Genome-scale alignments also aren't so rare anymore.
> >
> > Aside from analysis, NeXML files can be produced by a database, and
> > hence could hold large taxonomies, or the tree of life.
> >
> > NeXML is an emerging standard. If implementations can't cope with the
> > large scale data that are becoming increasingly popular, it'll have a
> > hard time to get uptake.
> >
> > -hilmar
> >
> > On Jun 25, 2010, at 12:42 AM, Pjotr Prins wrote:
> >
> >> I think this needs to be answered by Rutger. Are we going to face
> >> NeXML files in the future that can easily outrun memory?
> >>
> >> Pj.
> >>
> >> On Fri, Jun 25, 2010 at 01:04:21PM +0530, Anurag Priyam wrote:
> >>>> How much time would it cost you to stream the data - and what does
> >>>> it
> >>>> mean with regard to changing the API? I guess, in general, NeXML
> >>>> files won't be that large, so it may not be that important (Rutger)?
> >>>>
> >>>> Pj.
> >>>>
> >>>>
> >>> I mean switching the parsing implementation to streaming from
> >>> "parsing at
> >>> the start" and not the API. Just that using Reader API over the DOM
> >>> API
> >>> would help in the switch. Even if we do not switch, the Reader API
> >>> offers a
> >>> more memory efficient solution than the DOM API.
> >>>
> >>> Btw, I am not in a favour of switch. You cannot move backwards in
> >>> document
> >>> that way. I can not fetch a tree by id if I the cursor is ahead of
> >>> that
> >>> tree. Doing nexml.each_characters and nexml.each_trees is impossible
> >>> with
> >>> pure streaming. I will have to stream one while cache the other.
> >>> Otus and
> >>> otu provide a one to many relation with trees and characters, and
> >>> rows. An
> >>> API call of the type otus.trees or otus.characters or otu.seuences
> >>> would be
> >>> impossible( not that I have already added the API call ). Imo, NeXML
> >>> is
> >>> non-linear and not meant to be streamed. Besides other NeXML
> >>> implementations
> >>> also parse the file at the start.
> >>>
> >>> --
> >>> Anurag Priyam,
> >>> 2nd Year Undergraduate,
> >>> Department of Mechanical Engineering,
> >>> IIT Kharagpur.
> >>> +91-9775550642
> >> _______________________________________________
> >> BioRuby Project - http://www.bioruby.org/
> >> BioRuby mailing list
> >> BioRuby at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioruby
> >
> > --
> > ===========================================================
> > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
> > ===========================================================
> >
> >
> >
> >
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
More information about the BioRuby
mailing list