[Biopython] how to validate fasta format

Yvan Strahm yvan.strahm at bccs.uib.no
Tue Oct 27 12:03:11 UTC 2009


Peter wrote:
> On Tue, Oct 27, 2009 at 9:41 AM, Yvan Strahm <yvan.strahm at bccs.uib.no> wrote:
>> Hello All,
>>
>> Is it possible to validate a sequence format, for example while the sequence
>> is parsed by SeqIO.parse and using IUPAC.py? Or should I try to search for
>> illegal characters in .seq?
>>
>> Cheers,
>> yvan
> 
> It depends on what you mean by validate - if you want to check for
> specific letters against a whitelist, then currently you would have to
> look at the letters in the sequence. I would use sets for this. e.g.
> 
> wanted = set("ACGT")
> for record in SeqIO.parse(handle, "fasta") :
>     if not wanted.isuperset(record.seq) :
>          print "Bad: %s" % record.id
> 
> Making the Seq object validate against explicit alphabets (where
> the allowed letters are given) is something I have wondered about
> for the future.
> 
> Peter

Thanks for the quick reply.

Yes by validating I mainly meant check for the correct alphabet in the Seq object but also the 
correct header's format. So I guess, I have to trust the user.... ;-)
thanks again
yvan




More information about the Biopython mailing list