[Biojava-dev] Biojava.util package?

Thu Mar 29 18:00:37 UTC 2012

>> so far it still feels like a wrapper for what is already there.
>
> That would still be useful if you wanted to write a format agnostic
> tool wouldn't it?

sure! Since we are talking about a google summer of code project here
I am trying to help David propose something awesome, rather than
something incremental ;-)

A solution which I would find exciting would deal with:

- IDs as input
- proxy-fetching from remote primary databases
- local caching (optional)
- smart detection of data types of random user input (very useful in
web development where users can upload random files)

plus probably a couple of other things which I did not think of
currently. The AtomCache class (which in the future could be named
StructureIO?) is actually already dealing with similar requirements in
the protein structure world.

> I don't think it is possible to reliably distinguish all sequence file
formats
> As a specific example, the different FASTQ formats are tricky.

ok, my main goal is to make this as easy as possible to work with.
There certainly can be limits where the user has to provide more
specific details.

> Also doing format guessing with a stream input (e.g. stdin)
> would be fiddly due to the need to buffer the data while you
> decide how to interpret it.

If easier, format guessing could be on the file level. E.g. we already
have some tools for automatically uncompressing files on the fly.- the
InputStreamProvider class in core utils.

Andreas