[Biopython] newbie question: sequence parsing

Tue Oct 18 19:04:14 UTC 2011

On Tue, Oct 18, 2011 at 7:08 PM, Nat Echols <nathaniel.echols at gmail.com> wrote:
> Greetings--
>
> We have started using BioPython in our (non-bioinformatics) application and
> are investigating the possibility of replacing our existing (custom-made)
> sequence parsers.  Two quick questions:
>
> 1) Is there a sequence parser that works with just a simple string, without
> any header or additional metadata?  If not, how could we write one that
> results in the same basic object as those in Bio.SeqIO?  (The parsing is of
> course easy, I just want to have the API be consistent regardless of
> format.)

Sounds like the "raw" format in EMBOSS, although there are two
interpretations: one sequence per line, or one sequence for the
whole file.

Have a look at the FASTA parser in Bio/SeqIO/FastaIO.py as the
most simple case. Essentially you create a SeqRecord object
(which is covered in the Tutorial).

> 2) Is there a single function that will take a file (and/or string) of
> unknown format and try the different parsers until it finds one that works?
>  We currently use several different formats (raw string, FASTA, PIR, and
> possibly others), and we try not to rely on the file extension alone to
> determine the type.  We already have something that does this using our
> parsers, which could be refactored to use Bio.SeqIO instead, but if
> BioPython has something similar I'd rather use that.

No, we don't have such a function. There are many difficulties
with format guessing - both from the file contents and even the
filename. I usually cite the Zen of Python, Explicit is Better Than
Implicit.

Peter