[Biopython] SeqIO.parse Question
João Rodrigues
anaryin at gmail.com
Mon Nov 23 05:49:14 EST 2009
Sorry for the clouded explanation :x I'll try to show you an example:
I have a server that runs BLAST queries from user deposited sequences. Those
sequences have to in FASTA format. 4 Users deposit their sequences
User 1:
>SequenceName
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
User2:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
User3:
>Sequence1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Sequence2
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
User4:
>SequenceOops
AAAAAAAAAAAAAA1AAAAAAAAAAAAAAAAAAAAAAA
Now, if I run this through a python script that has simply something like
this:
user_input = getInput() # Gets input from the user (can be single or
multiple sequences)
for record in SeqIO.parse(StringIO(user_input),'fasta'): # Parses each
sequence on at a time
print record.id
print "Parsed"
This will happen for each of the users up there:
User1 will run smoothly and 'SequenceName' will be displayed. 'Parsed' will
also be displayed.
User2 will be shown 'Parsed'. Despite his sequence is not in FASTA format,
the parser didn't throw an exception saying so. It just skips the for loop (
maybe treats the SeqIO.parse as None ).
User3 will be shown 'Sequence1' and 'Parsed', although his second sequence
is not correctly formatted.
User4 will be shown 'SequenceOops' and 'Parsed', although there is a 1 in
the sequence ( which is not a valid character for any sequence ).
My question is basically: is there a way to do a sanity check to a file to
see if it really contains proper FASTA sequences? The way I'm doing it works
ok but it seems to be a bit too messy to be the best solution.
I'm first checking if the first character of the user input is a '>'. If it
is, I'm then passing the whole input to the Biopython parser. For each
record the parser consumes, I get the sequence back, or what the parser
thinks is a sequence, and then I check to see if there are any numbers,
blankspaces, etc, in the sequence. If there are, I'll raise an exception.
With those 4 examples:
User 1 passes everything ok
User 2 fails the first check.
User 3 and 4 fail the second check because of blank spaces and numbers.
This might sound a bit stupid on my part, and I apologize in advance, but
this way I don't see much of a use in SeqIO.parse function. I'd do almost
the same with user_input.split('\n>').
Is this clearer? My code is here: http://pastebin.com/m4d993239
More information about the Biopython
mailing list