[Biopython] how to validate fasta format

Tue Oct 27 13:41:36 UTC 2009

On Tue, Oct 27, 2009 at 1:36 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Tue, Oct 27, 2009 at 12:03 PM, Yvan Strahm <yvan.strahm at bccs.uib.no>
> wrote:
> > Yes by validating I mainly meant check for the correct alphabet in the
> Seq
> > object but also the correct header's format. So I guess, I have to trust
> the
> > user.... ;-)
>
> The FASTA header is basically free format - almost anything is valid,
> although some tools object to things like pipes and underscores.
> You will need to test the data in terms of your own criteria.
>
>

In principle is as you say, but if you want to implement a validator, I
would take into account that:
- many programs fail if the first character after the '>' is a space
- the first word after the '>' is usually considered as being the name of
the sequence; further descriptions must be separed by spaces or '|'
- the sequence is continuous and it should not be interrupted by blank lines

Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it