[Biopython] SeqIO.parse Question

Mon Nov 23 11:34:53 UTC 2009

My definition of FASTA is actually what BLASTp requires. It's quite a picky
tool :) I had already understood that FASTA is quite... lax. But I thought I
was missing something, thus asking the list. Is the alphabet patch already
included?

Thanks for the tip on the leading white space, had missed that :)

João [...] Rodrigues
@ http://stanford.edu/~joaor/

On Mon, Nov 23, 2009 at 3:19 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Thanks for clarifying João :)
>
> On Mon, Nov 23, 2009 at 10:49 AM, João Rodrigues <anaryin at gmail.com>
> wrote:
> > Sorry for the clouded explanation :x I'll try to show you an example:
> >
> > I have a server that runs BLAST queries from user deposited sequences.
> Those
> > sequences have to in FASTA format.  4 Users deposit their sequences
> >
> > User 1:
> >>SequenceName
> > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>
> Valid record, fine.
>
> > User2:
> > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>
> Missing ">" header, this contains no FASTA records.
>
> > User3:
> >>Sequence1
> > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
> > Sequence2
> > BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>
> Assuming you don't mind numbers in your sequence (which
> do get used in some situations), this is a valid FASTA file
> with a single record, equivalent to identifier "Sequence1"
> and sequence:
>
>
> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAASequence2BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>
> > User4:
> >>SequenceOops
> > AAAAAAAAAAAAAA1AAAAAAAAAAAAAAAAAAAAAAA
>
> As in example 1, assuming you don't mind numbers in your
> sequence, this is a valid FASTA file.
>
> > Now, if I run this through a python script that has simply something like
> > this:
> >
> > user_input = getInput() # Gets input from the user (can be single or
> > multiple sequences)
> >
> > for record in SeqIO.parse(StringIO(user_input),'fasta'): # Parses each
> > sequence on at a time
> >   print record.id
> > print "Parsed"
> >
> > This will happen for each of the users up there:
> >
> > User1 will run smoothly and 'SequenceName' will be displayed. 'Parsed'
> will
> > also be displayed.
> >
> > User2 will be shown 'Parsed'. Despite his sequence is not in FASTA
> format,
> > the parser didn't throw an exception saying so. It just skips the for
> loop (
> > maybe treats the SeqIO.parse as None ).
> >
> > User3 will be shown 'Sequence1' and 'Parsed', although his second
> sequence
> > is not correctly formatted.
> >
> > User4 will be shown 'SequenceOops' and 'Parsed', although there is a 1 in
> > the sequence ( which is not a valid character for any sequence ).
> >
> > My question is basically: is there a way to do a sanity check to a file
> to
> > see if it really contains proper FASTA sequences? The way I'm doing it
> works
> > ok but it seems to be a bit too messy to be the best solution.
> >
> > I'm first checking if the first character of the user input is a '>'. If
> it
> > is, I'm then passing the whole input to the Biopython parser.
>
> I probably do something similar - but I would first strip the white space.
> After all, "\n\n\n>ID\nACGT\n\n\n" is a valid FASTA file with one record.
>
> If the sequence lacks the ">", then I would either raise an error, or
> add something like ">Default\n" to the start automatically. Do whatever
> the BLAST webpage does to make it consistent for your users.
>
> > For each
> > record the parser consumes, I get the sequence back, or what the parser
> > thinks is a sequence, and then I check to see if there are any numbers,
> > blankspaces, etc, in the sequence. If there are, I'll raise an exception.
>
> Again, I might do the same (but see below).
>
> > With those 4 examples:
> >
> > User 1 passes everything ok
> > User 2 fails the first check.
> > User 3 and 4 fail the second check because of blank spaces and numbers.
> >
> > This might sound a bit stupid on my part, and I apologize in advance, but
> > this way I don't see much of a use in SeqIO.parse function. I'd do almost
> > the same with user_input.split('\n>').
> >
> > Is this clearer? My code is here: http://pastebin.com/m4d993239
>
> The problem is your definition of "valid FASTA" and Biopython's differ.
> This is largely because the FASTA file format has never been strictly
> defined. You'll find lots of differences in different tools (e.g. some like
> ClustalW can't cope with long description lines; some tools allow
> comment lines; in some cases characters like "." and "*" are allowed
> but not all).
>
> Also, you appear to want something very narrow - protein FASTA
> files with a limited character set (some but not all of the full IUPAC
> set) plus the minus sign (as a gap).
>
> Bio.SeqIO is not trying to do file format validation - it is trying to do
> file parsing, and for your needs it is being too tolerant. In this
> situation
> then yes, doing your own validation (without using Biopython) might
> be simplest.
>
> How I would like to "fix" this is to implement Bug 2597 (strict alphabet
> checking in the Seq object). Then, when you call Bio.SeqIO.parse,
> include the expected alphabet which should specify the allowed
> letters (and exclude numbers etc). See:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2597
>
> Peter
>
> P.S. In your code, using a set should be faster for checking membership:
>
> allowed = set('ABCDEFGHIKLMNPQRSTUVWYZX-')
>
> In fact, I would make the allowed list include both cases, then
> you don't have to make all those calls to upper.
>
> I would also double check to see if the latest version of BLAST
> does in fact accept O (Pyrrolysine) or J (Leucine or Isoleucine),
> and if need be contact the NCBI to update this webpage:
> http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml
>
> Peter
>