[BioPython] Reading Roche 454 binary SFF files in Python

Wed Apr 15 08:42:12 UTC 2009

On Wed, Apr 15, 2009 at 8:07 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
>
>> I was aware that some information was available about the SFF file
>> format, and it should be possible to reverse engineer the format in
>> order to read and write it directly from Biopython.
>
> The sff format is fully documented in the NCBI's SRA web site.
> http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats#sff

Nice link - thanks.  Given the specification is public (and as you say
later, well thought out), we shouldn't have to worry so much about Roche
making changes to it in future releases.

>> Right now with your code under the GPL, we can't incorporate it into
>> Biopython, but if you and Bastien are prepared to offer it to
>> Biopython under our MIT/BSD license that could be very useful.  Even
>> without that, any documentation on the file format or example files
>> you might be able to share could be valuable.
>
> I guess that it wouldn't be a problem to offer you the code under your
> license. But I don't think that's the best approach. The code as it is right
> now is not well suited to be integrated in a library. It would be easier to
> rewrite the sff reading part from scratch. I could do that for you in no
> time.

I was expecting your sff_extract code would serve only as a basis - perhaps
just lifting some core routines.  If you are happy to extract/rewrite the core
bits and give them to Biopython under the Biopython License that would be
great.  See http://biopython.org/DIST/LICENSE (basically MIT/BSD style).

> The main problem would be to have sff files small enough to be used
> for the test.

The Roche command line tools allow you to take a large SFF file and
produce a filtered version (use sfffile with the -i option and a simple
text file of read identifiers).  So making a small SFF file for unit tests
should be simple.

> If you could provide that I could write the code to extract the
> information from the sff file for you. It would be easy to build a
> generator able to deliver the sequences one by one.

That would be very welcome :)

> sff_extract also is able to split the paired-ends reads. That's the part
> that Bastien wrote. Integrating that would be nice, but I think that in
> Biopython that  should be treated as an independent problem.

Quite possibly - I haven't yet had to work with paired end reads, and
at this point I'm not sure how best to represent them with the Biopython
SeqRecord object.  In some senses they are two short sequences (so
using two Biopython SeqRecord objects would work, but with some
kind of cross referencing).  Alternatively you might treat them as a long
sequence with known end regions, but an unknown region of unknown
length in the middle (something we don't currently have a sequence
object to represent).

>> P.S. Have you tested your sff_extract software on SFF files from the
>> new Roche v2 software, released about the same time as the "titanium"
>> 454 upgrade?
>
> Not me, but I think that Bastien has and he has found no problem at all with
> that.

Great.

> The sff format is well thought and consistent, the 454 people did a
> much better job than the ABI people did with the abi format.

That makes a pleasant change - the FASTQ format strikes me as
less than ideal in several ways (and the fact Solexa made their own
incompatible variant just made things worse).

Peter