[Biopython] parsing fasta based on header

Tue Nov 1 21:04:39 UTC 2011

On Tue, Nov 1, 2011 at 7:53 PM, Wibowo Arindrarto
<w.arindrarto at gmail.com> wrote:
> Hi Matthew,
>
> You can use Python generators for this. Here's a rough example:
>
> # generators for the two different groups
> seq_1 = (r for r in SeqIO.parse(open('QHM-clean.fasta', 'rU'), 'fasta') if
> r.id.startswith('1'))
> seq_2 = (r for r in SeqIO.parse(open('QHM-clean.fasta', 'rU'), 'fasta') if
> r.id.startswith('2'))
>
> # seqs, filenames pair list
> pairs = [(seq_1, 'file_1'), (seq_2, 'file_2')]
>
> # the actual write
> for seq, filename in pairs:
> SeqIO.write(seq, open(filename, 'w'), 'fasta')
>
> cheers,
> Bowo

Email does tend to mess up the indentation in Python :(

I'm pleased to see that's very similar to my answer earlier,
http://biostar.stackexchange.com/questions/13791/parsing-fasta-based-on-header/13793

By the way Wibiwo, rather than this:

SeqIO.write(seq, open(filename, 'w'), 'fasta')

use this:

SeqIO.write(seq, filename, 'fasta')

It is shorter but also will ensure the handle is closed
promptly on Jython/PyPy where garbage collection
isn't as predictable as on normal C Python.

Peter