[Biopython] File format autodetection.

Peter Cock p.j.a.cock at googlemail.com
Tue Jun 24 16:54:12 UTC 2014


Hi Ivan,

Biopython's SeqIO does not (and will not) do automatic file
format detection, it is just too hard to get right so instead
that's the user's task:

Zen of Python: Explicit is better than implicit.
http://legacy.python.org/dev/peps/pep-0020/

(BioPerl's SeqIO can do format guessing)

Your use case is one which highlights a technical reason
why this is hard - you are using stdin, a read-once handle.
You cannot peek at the file, guess the format, seek back to
the beginning, and then give the handle to a specific parser.

You could use Biopython's UndoHandle here, but it will
impose a (modest) performance overhead.

from Bio.File import UndoHandle
help(UndoHandle)

e.g. Use the .peekline() method to spot FASTA vs FASTQ?

Peter

On Tue, Jun 24, 2014 at 5:16 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Hello Biopythoneers,
>
> The question:
>
> What is the strategy currently used for file format autodetection?
>
>
> The context:
>
> I have written a command line program that gets a stream of FASTQ data
> and reports how many records are contained. You can visualise it like
> this
>
> zcat myfile.fq.gz | fxcounttags.py -i /dev/stdin -o /dev/stdout > myfile.counts
>
> That works fine for FASTQ but I need to extend the functionality to
> FASTA streams. How would you write fxcounttags.py to detect
> FASTQ/FASTA?
>
> Thank you,
>
> Ivan
>
>
>
> Ivan Gregoretti, PhD
> Bioinformatics
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython


More information about the Biopython mailing list