[Biopython] SeqIO.parse Question

João Rodrigues anaryin at gmail.com
Mon Nov 23 10:49:14 UTC 2009


Sorry for the clouded explanation :x I'll try to show you an example:

I have a server that runs BLAST queries from user deposited sequences. Those
sequences have to in FASTA format. 4 Users deposit their sequences

User 1:
>SequenceName
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

User2:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

User3:
>Sequence1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Sequence2
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

User4:
>SequenceOops
AAAAAAAAAAAAAA1AAAAAAAAAAAAAAAAAAAAAAA

Now, if I run this through a python script that has simply something like
this:

user_input = getInput() # Gets input from the user (can be single or
multiple sequences)

for record in SeqIO.parse(StringIO(user_input),'fasta'): # Parses each
sequence on at a time
  print record.id
print "Parsed"

This will happen for each of the users up there:

User1 will run smoothly and 'SequenceName' will be displayed. 'Parsed' will
also be displayed.

User2 will be shown 'Parsed'. Despite his sequence is not in FASTA format,
the parser didn't throw an exception saying so. It just skips the for loop (
maybe treats the SeqIO.parse as None ).

User3 will be shown 'Sequence1' and 'Parsed', although his second sequence
is not correctly formatted.

User4 will be shown 'SequenceOops' and 'Parsed', although there is a 1 in
the sequence ( which is not a valid character for any sequence ).

My question is basically: is there a way to do a sanity check to a file to
see if it really contains proper FASTA sequences? The way I'm doing it works
ok but it seems to be a bit too messy to be the best solution.

I'm first checking if the first character of the user input is a '>'. If it
is, I'm then passing the whole input to the Biopython parser. For each
record the parser consumes, I get the sequence back, or what the parser
thinks is a sequence, and then I check to see if there are any numbers,
blankspaces, etc, in the sequence. If there are, I'll raise an exception.

With those 4 examples:

User 1 passes everything ok
User 2 fails the first check.
User 3 and 4 fail the second check because of blank spaces and numbers.

This might sound a bit stupid on my part, and I apologize in advance, but
this way I don't see much of a use in SeqIO.parse function. I'd do almost
the same with user_input.split('\n>').

Is this clearer? My code is here: http://pastebin.com/m4d993239



More information about the Biopython mailing list