[Biopython] SeqIO.parse Question

João Rodrigues anaryin at gmail.com
Mon Nov 23 04:02:06 EST 2009


Dear all,

This is merely a suggestion. I've been using SeqIO.parse on some user input
I receive from a server.

I'm using the following code:

for num, record in enumerate(SeqIO.parse(StringIO(FASTA_sequence),
'fasta')):

    req_seq = record.seq.tostring()
    req_name = record.id

Since I have no clue what the user might introduce, regarding the number of
sequences, I have to user parse, instead of read. If I introduce only one
sequence and it is a valid FASTA sequence, it does its work flawlessly. If I
insert several FASTA sequences and one of them is wrongly formatted, it
won't complain at all. If I insert a single wrong sequence, it doesn't
complain either.

Is there a convenient way for me to check FASTA formats? The usual
startswith('>') doesn't work for multiple sequences. And the user might have
spaces in the sequence so a split('\n') is also ruled out to split the
sequences.

At the moment, I'm checking if the first sequence of the input starts with
'>', and if it does, the parser kicks in and for every req_seq object I
check if there is any character that is not valid (a number or an otherwise
weird character). If I get a mis-formatted sequence in there it will
complain because spaces, newlines, and numbers ( often found in sequence
names ) are not in my allowed list.

However, if there's an easier way, it would save me some if checks and for
loops :) Suggestions?

Best regards to all,

João [...] Rodrigues
@ http://stanford.edu/~joaor/



More information about the Biopython mailing list