[Biopython-dev] Sequential SFF IO

Wed Jan 26 15:45:56 UTC 2011

On Wed, Jan 26, 2011 at 3:14 PM, Kevin Jacobs wrote:
> Any objections/worries about converting the SFF writer to use the
> sequential/incremental writer object interface?  I know it looks
> specialized for text formats, but

It already uses Bio.SeqIO.Interfaces.SequenceWriter

> ... I need to split large SFF files into many smaller ones
> and would rather not materialize the whole thing.  The SFF writer
> code already allows for deferred writing of read counts and index
> creation, so it looks to be only minor surgery.

I don't understand what problem you are having with the SeqIO API.
It should be quite happy to take a generator function, iterator, etc
(as opposed to a list of SeqRecord objects which I assume is what
you mean by "materialize the whole thing").

> There doesn't seem to be an obvious API for obtaining such a writer
> using the SeqIO interface.

You can do that with:

from Bio.SeqIO.SffIO import SffWriter

> Am I missing something obvious?
>

Probably. You can divide a large SFF file into smaller SFF files via the
high level Bio.SeqIO.parse/write interface. Personally I like to use
generator expressions to do a filtering operation.

Note if you want to divide a large SFF file while preserving the
Roche XML manifest things are a little more tricky. You should
use the ReadRocheXmlManifest function in combination with
the SffWriter. You can see an example of this in sff_filter_by_id.py,
a tool I wrote for Galaxy - search for "Filter SFF by ID" here:
http://community.g2.bx.psu.edu/

Peter