[Biopython] newbie question: sequence parsing

Peter Cock p.j.a.cock at googlemail.com
Tue Oct 18 19:31:06 UTC 2011


On Tue, Oct 18, 2011 at 8:11 PM, Fields, Christopher J
<cjfields at illinois.edu> wrote:
> On Oct 18, 2011, at 2:04 PM, Peter Cock wrote:
>
>> On Tue, Oct 18, 2011 at 7:08 PM, Nat Echols <nathaniel.echols at gmail.com> wrote:
>>> ...
>>> 2) Is there a single function that will take a file (and/or string) of
>>> unknown format and try the different parsers until it finds one that works?
>>>  We currently use several different formats (raw string, FASTA, PIR, and
>>> possibly others), and we try not to rely on the file extension alone to
>>> determine the type.  We already have something that does this using our
>>> parsers, which could be refactored to use Bio.SeqIO instead, but if
>>> BioPython has something similar I'd rather use that.
>>
>> No, we don't have such a function. There are many difficulties
>> with format guessing - both from the file contents and even the
>> filename. I usually cite the Zen of Python, Explicit is Better Than
>> Implicit.
>>
>> Peter
>
> Some implicitness is fine, but speaking from experience
> (BioPerl's GuessSeqFormat) trying to guess the format
> from the dozens that litter the bioinformatics landscape
> is a nest of hornets no one wants to maintain.
>
> chris

I think "nest of hornets" is a much more beautiful phrase
than my dead pan "many difficulties".

The practical reality is that while some file formats are
easy (binary files with 4 byte "magic" identifiers), others
are horrible, and the definitions shift over time, as new
formats of variants are added. I really don't want to go
there.

Peter




More information about the Biopython mailing list