[Biopython-dev] Merging Bio.SeqIO SFF support?

Tue Mar 2 05:08:27 EST 2010

On Mon, Mar 1, 2010 at 11:22 PM, Kevin Jacobs <jacobs at bioinformed.com>
<bioinformed at gmail.com> wrote:
> On Thu, Feb 11, 2010 at 12:29 AM, Peter <biopython at maubp.freeserve.co.uk>
> wrote:
>>
>> On Mon, Jan 11, 2010 at 5:11 PM, Peter <biopython at maubp.freeserve.co.uk>
>> wrote:
>> > I didn't want to rush the SFF support into Biopython 1.53, but its been
>> > waiting "ready" for a while now. Any objections or comments about
>> > me merging this now?
>>
>> There were no objections, and I ran this by Brad and Michiel and
>> have just merged this into the master branch. Time for some more
>> testing!
>>
>
> I've tried out the recently landed SFF SeqIO code and am pleased to
> report that it works very well.

Great :)

If you have suggestions for the documentation please voice them.
Also did the handling of trimmed reads seem sensible? Until we
release this we can tweak the API.

> I am parsing gsMapper 454PairAlign.txt output and
> converting it to SAM/BAM format to view in IGV (among other things) and
> wanted to include per-based quality score information from the SFF files.

Are you reading and writing SAM/BAM format with Python? Looking
into this is on my (long) todo list.

> The only glitch so far is that the indexed access mode yields sequences
> with no alphabet assigned.  The solution is to add the following to the
> beginning of SffDict.__init__:
>         if alphabet is None:
>           alphabet = Alphabet.generic_dna

Thanks - I'll look at that.

> My only other comment is that several file reads and struct.unpacks can be
> merged in _sff_read_seq_record.  Given the number of records in most 454 SFF
> files, I suspect the micro-optimization effort will be worth the slight cost
> in code clarity.

I did try and spend some effort on the run time, but it wouldn't
surprise me that there was still room for improvement. I found
that since most of my SFF files were only up to 2GB with under
a million reads, that this wasn't such an issue (compared to
FASTQ files with Solexa data).

I guess you mean the flowgram values, flowgram index, bases
and qualities might be loaded with a single read? That would
be worth trying.

> Thanks to Peter and Jose for all of their hard work!
> Best regards,
> -Kevin

And thanks for the feedback :)

Peter