[Biopython] SeqIO fasta "fakes" recognition

Thu Feb 23 16:35:29 UTC 2012

On Thu, Feb 23, 2012 at 11:21 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Thu, Feb 23, 2012 at 4:05 PM, Marco Galardini
> <marco.galardini at unifi.it> wrote:
> > Hi all,
> >
> > i was wondering if you are aware of a method to distinguish between
> "real"
> > fasta files and files that just happen to have a ">" character.
> > I would like to scan a directory and return only the "real" fasta files.
> > I tried to open a .png file and surprisingly it gave me the following
> > results:
>
> The FASTA parser doesn't attempt to restrict the sequence alphabet,
> indeed some FASTA like files do use all sorts of weird characters
> (e.g. RNA secondary structure). Also it allows for 'free text' before
> the first record (useful in several situations including FASTA records
> embedded at the end of a GFF file). As a side effect of this need for
> tolerance, the code does its best to read any file you give it - but
> this is clearly a case of garbage in, garbage out (GIGO).
>
> Guessing bioinformatics file types is non-trivial, and not something
> that Bio.SeqIO attempts to do (unlike BioPerl). We take the Python
> approach that you the user need to be explicit, and if you say it is
> a FASTA file we'll try to treat it as such.
>
>
I suppose there's always:

try:
    record = SeqIO.read("gigo.png", "fasta")
    assert str(record.seq).isalpha()
except:
    # complain...

At some point, didn't we discuss adding optional alphabet validation, e.g.
a validate() method or something more automatic?

-Eric