[Biopython] how to validate fasta format

Peter biopython at maubp.freeserve.co.uk
Tue Oct 27 10:07:05 EDT 2009


On Tue, Oct 27, 2009 at 1:41 PM, Giovanni Marco Dall'Olio
<dalloliogm at gmail.com> wrote:
>
> In principle is as you say, but if you want to implement a validator, I
> would take into account that:
> - many programs fail if the first character after the '>' is a space

Good point. I'd interpret that a  record without a name/identifier,
but with a description. We should double check Biopython does
handle this gracefully.

> - the first word after the '>' is usually considered as being the name of
> the sequence; further descriptions must be separed by spaces or '|'

I'm not sure what you mean about the pipe (|) in descriptions - this
is basically a case of anything is allowed, but some tools are fussy.

> - the sequence is continuous and it should not be interrupted by blank lines

I think according to the original FASTA tools, blank lines are fine.
But again, some tools are fussy. Here Biopython should tolerate
this on input, and not do it on output.

i.e. FASTA "validation" always depends on what you are going it
for. Another example, preparing data for TMHMM it is sensible to
impose a minimum length on the sequence - but a short or
even zero length sequence is valid in FASTA files in general.

Peter


More information about the Biopython mailing list