[Bioperl-l] Bio::SeqIO::game

Bradley Marshall bradmars@yahoo.com
Fri, 1 Dec 2000 11:12:08 -0800 (PST)


> It is the GAME DTD which does not provide a "parse
> unit" tag (I think) and
> I think this is a clear bug in GAME. As people have
> noticed, if we dump
> ensembl in GAME (feasible now) then loading the
> whole thing up in a non
> chunk by chunk way will be murder (this is about
> 10GB of data). 
> 
> I would propose that either <game>  </game> becomes
> an official "parse
> unit" tag or that the game people figure out another
> tag that we can chunk
> on...
> 

Hmm...  This may be more difficult than it sounds. 
The problem is that features may produce their own
sequences.  So if you have genomic sequence, it can
have exon features that point to mRNA sequences and aa
sequences.  So one <seq_chunk/>, which has a top level
genomic sequence can have subsequences.

At BDGP we export the XML into files representing one
genbank accession unit of around 300kb.  I haven't
done performance testing with the sax parser, but it
only takes a few seconds (maybe 3-5) to do one of
these chunks with annotations.  I suppose it would be
pretty inefficient if we exported the whole database
into one xml file, but I don't think it would be
terrible, either.

Still, we could add a <parse_unit/> tag which could
break it down by top level sequences.  Each
<parse_unit/> could be fed into memory as a string and
the sax parser could have at them that way.  Does this
sound like a good solution to everybody?

Thanks for the feed back, guys.

Brad


__________________________________________________
Do You Yahoo!?
Yahoo! Shopping - Thousands of Stores. Millions of Products.
http://shopping.yahoo.com/