[Biopython] Parsing FASTA records based on headers

Mon Jul 11 16:07:59 UTC 2011

Hi all,

I tried to parse a FASTA file to select the sequences whose headers satisfy a 
condition. The condition is that the first word of the header belongs to a list 
named SelectedSequencesId.
In the page http://biopython.org/wiki/SeqIO, I found this example, where the 
condition is that sequence length <300:

1 from Bio import SeqIO
2 
3 input_seq_iterator = SeqIO.parse(open("cor6_6.gb", "rU"), "genbank")
4 short_seq_iterator = (record for record in input_seq_iterator \
5                      if len(record.seq) < 300)
6 
7 output_handle = open("short_seqs.fasta", "w")
8 SeqIO.write(short_seq_iterator, output_handle, "fasta")
9 output_handle.close()

so I tried to substitute line 5 with
5 record.id.split()[0] in SelectedSequencesId)

But it did not work.
I was able to get what I wanted generating a list with all the records and 
then parsing it, but I'd like to find a solution that uses a generating 
expression.

Thanks in advance,

Fabio

-- 

F. Gori, PhD student
Intelligent Systems
ICIS (Institute for Computing and Information Sciences)
Radboud University Nijmegen

Home Page: http://www.cs.ru.nl/~gori/