[Biopython-dev] Sequential SFF IO

Wed Jan 26 17:19:36 UTC 2011

On Wed, Jan 26, 2011 at 4:44 PM, Kevin Jacobs <jacobs at bioinformed.com>
<bioinformed at gmail.com> wrote:
> On Wed, Jan 26, 2011 at 10:45 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> On Wed, Jan 26, 2011 at 3:14 PM, Kevin Jacobs wrote:
>> > Any objections/worries about converting the SFF writer to use the
>> > sequential/incremental writer object interface?  I know it looks
>> > specialized for text formats, but
>>
>> It already uses Bio.SeqIO.Interfaces.SequenceWriter
>>
>
> Sorry-- was shooting from the hip.  I meant a SequentialSequenceWriter.
>

The file formats which use SequentialSequenceWriter have trivial
(or no) header/footer, which require no additional arguments. The
SFF file format has a non-trivial header which records flow space
settings etc. Any write_header method would have to be SFF specific,
likewise any write_footer method for the index and XML manifest.
I don't see what you have in mind.

In fact, looking at SffIO.py again now, I think the SffWriter's
write_header and write_record method should be private with
just write_file as a public method.

>> > ... I need to split large SFF files into many smaller ones
>> > and would rather not materialize the whole thing.  The SFF writer
>> > code already allows for deferred writing of read counts and index
>> > creation, so it looks to be only minor surgery.
>>
>> I don't understand what problem you are having with the SeqIO API.
>> It should be quite happy to take a generator function, iterator, etc
>> (as opposed to a list of SeqRecord objects which I assume is what
>> you mean by "materialize the whole thing").
>
> The goal is to demultiplex a larger file, so I need a "push" interface.
>  e.g.
> out = dict(...) # of SffWriters
> for rec in SeqIO(filename,'sff-trim'):
>   out[id(read)].write_record(rec)
>
> for writer in out.itervalues():
>   writer.write_footer()

I don't think the above will work without some "magic" to record the
SFF header (which currently would require using private attributes
of the SffWriter objects) as done via its write_file method.

Also you can't read in SFF files with "sff-trim" if you want to output
them, since this discards all the flow space information. You have
to use format "sff" instead.

> I could use a simple generator if I was merely filtering records, but the
> write_file interface would require more co-routine functionality than
> generators provide.

How many output files do you have? Assuming it is small I'd go for
the simple solution of one loop over the input SFF file for each output
file.

A variation on this would be to make a list of read IDs for each
output file, then use the Bio.SeqIO.index for random access to
the records to get the records, e.g.

records = SeqIO.index(original_filename, "sff")
for filename in [...]:
    wanted = [...] # some list or generator
    records = (records[id] for id in wanted)
    SeqIO.write(records, filename, "sff")

Otherwise look at itertools.tee for splitting the iterator if you really
want to make a single pass though the original SFF file.

>> > There doesn't seem to be an obvious API for obtaining such a
>> > writer using the SeqIO interface.
>>
>> You can do that with:
>>
>> from Bio.SeqIO.SffIO import SffWriter
>>
>
> For my immediate need, this is fine.  However, the more general
> API doesn't have a SeqIO.writer to get SequentialSequenceWriter
> objects.

For good reason - not all the writers use SequentialSequenceWriter,
because for many file formats it is too narrow in scope.

Peter