[Open-bio-l] Status of OBDA and indexed flatfiles?
Naohisa GOTO
ngoto at gen-info.osaka-u.ac.jp
Mon Aug 31 10:01:46 EDT 2009
Hi Peter,
On Mon, 31 Aug 2009 13:07:45 +0100
Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> I'm looking at indexing next generation sequence files for Biopython
> (e.g. FASTQ short read files with 10s of millions of entries), where
> even just holding the record names and their file offsets in memory
> is beginning to be a bottleneck.
>
> What is the current status of Open Biological Database Access (OBDA),
> and in particular the index files for sequence "flat files" like FASTA or
> GenBank (or FASTQ)?
>
> http://www.bioperl.org/wiki/HOWTO:Flat_databases
> http://www.bioperl.org/wiki/HOWTO:OBDA
> http://obda.open-bio.org/
>
> The spec files are still in CVS (and ViewCVS is still broken since
> the recent server move), rather than having been migrated to SVN
> which may suggest things are obsolete (or on the bright side, stable).
>
> Presumably BioPerl still uses these index files? What about the
> other projects? I know EMBOSS has some indexing system for
> example but I have no idea how it works internally.
BioRuby still uses them. To gain performance, names and offsets are
written to temporary files and using external sort program (default
/usr/bin/sort).
In BioRuby, flatfile-only solution works fine, but BerkeleyDB indexes
would be incompatible with other projects, because of confusion in
the spec, discussed in BioPerl Bugzilla Bug #2337.
http://bugzilla.open-bio.org/show_bug.cgi?id=2337
Thanks,
--
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
More information about the Open-Bio-l
mailing list