[Biopython] how to validate fasta format

Peter biopython at maubp.freeserve.co.uk
Tue Oct 27 10:08:41 UTC 2009


On Tue, Oct 27, 2009 at 9:41 AM, Yvan Strahm <yvan.strahm at bccs.uib.no> wrote:
> Hello All,
>
> Is it possible to validate a sequence format, for example while the sequence
> is parsed by SeqIO.parse and using IUPAC.py? Or should I try to search for
> illegal characters in .seq?
>
> Cheers,
> yvan

It depends on what you mean by validate - if you want to check for
specific letters against a whitelist, then currently you would have to
look at the letters in the sequence. I would use sets for this. e.g.

wanted = set("ACGT")
for record in SeqIO.parse(handle, "fasta") :
    if not wanted.isuperset(record.seq) :
         print "Bad: %s" % record.id

Making the Seq object validate against explicit alphabets (where
the allowed letters are given) is something I have wondered about
for the future.

Peter



More information about the Biopython mailing list