[Biojava-dev] Biojava.util package?

Thu Mar 29 16:10:47 UTC 2012

On Thu, Mar 29, 2012 at 3:39 PM, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi David,
>
> so far it still feels like a wrapper for what is already there.

That would still be useful if you wanted to write a format agnostic
tool wouldn't it? How much boiler plate could would it save over
a big if statement selecting the appropriate parser where each
may follow a different style and may even return different classes?

> Try to
> take it to the next level. Why does the user still need to provide the
> type of file, can't this be auto-detected? What is the behaviour for
> non-fasta files, what can be supported and where are the limits, etc.
>
> Andreas

I don't think it is possible to reliably distinguish all sequence file
formats - BioPerl tries, Biopython doesn't, which is partly from
the language style of Perl vs Python.

As a specific example, the different FASTQ formats are tricky.
You can look at the distribution of ASCII quality characters and
in some cases determine it must be Sanger FASTQ since the
scores are invalid in Solexa/Illumina's legacy formats - but in
general this is an educated guess.

Also doing format guessing with a stream input (e.g. stdin)
would be fiddly due to the need to buffer the data while you
decide how to interpret it.

Peter