[Biopython] SeqIO fasta "fakes" recognition

Thu Feb 23 16:21:36 UTC 2012

On Thu, Feb 23, 2012 at 4:05 PM, Marco Galardini
<marco.galardini at unifi.it> wrote:
> Hi all,
>
> i was wondering if you are aware of a method to distinguish between "real"
> fasta files and files that just happen to have a ">" character.
> I would like to scan a directory and return only the "real" fasta files.
> I tried to open a .png file and surprisingly it gave me the following
> results:

The FASTA parser doesn't attempt to restrict the sequence alphabet,
indeed some FASTA like files do use all sorts of weird characters
(e.g. RNA secondary structure). Also it allows for 'free text' before
the first record (useful in several situations including FASTA records
embedded at the end of a GFF file). As a side effect of this need for
tolerance, the code does its best to read any file you give it - but
this is clearly a case of garbage in, garbage out (GIGO).

Guessing bioinformatics file types is non-trivial, and not something
that Bio.SeqIO attempts to do (unlike BioPerl). We take the Python
approach that you the user need to be explicit, and if you say it is
a FASTA file we'll try to treat it as such.

Detecting image files (or indeed most binary file types) on the other
hand is much easier - so do that instead?

Peter