[Biopython] SeqIO.parse Question

Mon Nov 23 11:19:46 UTC 2009

Thanks for clarifying João :)

On Mon, Nov 23, 2009 at 10:49 AM, João Rodrigues <anaryin at gmail.com> wrote:
> Sorry for the clouded explanation :x I'll try to show you an example:
>
> I have a server that runs BLAST queries from user deposited sequences. Those
> sequences have to in FASTA format.  4 Users deposit their sequences
>
> User 1:
>>SequenceName
> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Valid record, fine.

> User2:
> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Missing ">" header, this contains no FASTA records.

> User3:
>>Sequence1
> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
> Sequence2
> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Assuming you don't mind numbers in your sequence (which
do get used in some situations), this is a valid FASTA file
with a single record, equivalent to identifier "Sequence1"
and sequence:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAASequence2BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

> User4:
>>SequenceOops
> AAAAAAAAAAAAAA1AAAAAAAAAAAAAAAAAAAAAAA

As in example 1, assuming you don't mind numbers in your
sequence, this is a valid FASTA file.

> Now, if I run this through a python script that has simply something like
> this:
>
> user_input = getInput() # Gets input from the user (can be single or
> multiple sequences)
>
> for record in SeqIO.parse(StringIO(user_input),'fasta'): # Parses each
> sequence on at a time
>   print record.id
> print "Parsed"
>
> This will happen for each of the users up there:
>
> User1 will run smoothly and 'SequenceName' will be displayed. 'Parsed' will
> also be displayed.
>
> User2 will be shown 'Parsed'. Despite his sequence is not in FASTA format,
> the parser didn't throw an exception saying so. It just skips the for loop (
> maybe treats the SeqIO.parse as None ).
>
> User3 will be shown 'Sequence1' and 'Parsed', although his second sequence
> is not correctly formatted.
>
> User4 will be shown 'SequenceOops' and 'Parsed', although there is a 1 in
> the sequence ( which is not a valid character for any sequence ).
>
> My question is basically: is there a way to do a sanity check to a file to
> see if it really contains proper FASTA sequences? The way I'm doing it works
> ok but it seems to be a bit too messy to be the best solution.
>
> I'm first checking if the first character of the user input is a '>'. If it
> is, I'm then passing the whole input to the Biopython parser.

I probably do something similar - but I would first strip the white space.
After all, "\n\n\n>ID\nACGT\n\n\n" is a valid FASTA file with one record.

If the sequence lacks the ">", then I would either raise an error, or
add something like ">Default\n" to the start automatically. Do whatever
the BLAST webpage does to make it consistent for your users.

> For each
> record the parser consumes, I get the sequence back, or what the parser
> thinks is a sequence, and then I check to see if there are any numbers,
> blankspaces, etc, in the sequence. If there are, I'll raise an exception.

Again, I might do the same (but see below).

> With those 4 examples:
>
> User 1 passes everything ok
> User 2 fails the first check.
> User 3 and 4 fail the second check because of blank spaces and numbers.
>
> This might sound a bit stupid on my part, and I apologize in advance, but
> this way I don't see much of a use in SeqIO.parse function. I'd do almost
> the same with user_input.split('\n>').
>
> Is this clearer? My code is here: http://pastebin.com/m4d993239

The problem is your definition of "valid FASTA" and Biopython's differ.
This is largely because the FASTA file format has never been strictly
defined. You'll find lots of differences in different tools (e.g. some like
ClustalW can't cope with long description lines; some tools allow
comment lines; in some cases characters like "." and "*" are allowed
but not all).

Also, you appear to want something very narrow - protein FASTA
files with a limited character set (some but not all of the full IUPAC
set) plus the minus sign (as a gap).

Bio.SeqIO is not trying to do file format validation - it is trying to do
file parsing, and for your needs it is being too tolerant. In this situation
then yes, doing your own validation (without using Biopython) might
be simplest.

How I would like to "fix" this is to implement Bug 2597 (strict alphabet
checking in the Seq object). Then, when you call Bio.SeqIO.parse,
include the expected alphabet which should specify the allowed
letters (and exclude numbers etc). See:
http://bugzilla.open-bio.org/show_bug.cgi?id=2597

Peter

P.S. In your code, using a set should be faster for checking membership:

allowed = set('ABCDEFGHIKLMNPQRSTUVWYZX-')

In fact, I would make the allowed list include both cases, then
you don't have to make all those calls to upper.

I would also double check to see if the latest version of BLAST
does in fact accept O (Pyrrolysine) or J (Leucine or Isoleucine),
and if need be contact the NCBI to update this webpage:
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml

Peter