[Biopython-dev] sff reader

Fri May 22 18:40:45 UTC 2009

On Fri, Apr 17, 2009 at 12:08 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
>> Hi Peter:
>> Here you have some code to read the sff files.
>
> Thanks - I'm not sure when I'll get to look at this, maybe next week.
>
>> For the time being it creates a dict for the sequences. I'm not sure about
>> how to integrate the generated data in BioPython. The sequence and
>> qualities should go to a SeqRecord, but there is also the information
>> about the clipping.
>
> For Bio.SeqIO, we would need to use a SeqRecord.  Ideally we'd want to
> be able to read and write SFF files, and to do that we'll have to record all
> the essential annotation (i.e. clipping) somehow.

I've had a look at your code this evening, and written a rough SeqIO
module using it, available here on enhancement Bug 2837,
http://bugzilla.open-bio.org/show_bug.cgi?id=2837

> Can you write SFF files?
>
>> For my work I use a kind of SeqRecord with a mask property and the
>> mask is a Location that shows which part of the sequence is ok. I don't
>> know if that's a valid model for BioPython.
>
> A mask could be done as a list of booleans, and we can treat it as
> another per-letter-annotation in the SeqRecord.  I'm not sure if this
> is helpful or not.
>
> The Roche tools let you choose to extract trimmed reads as FASTA
> and QUAL, or untrimmed.  Perhaps for reading SFF files with
> Bio.SeqIO we should get the user to choose between these
> options (e.g. format names "roche-sff" and "roche-sff-notrim")?

This would work...

> Roche's FASTA files use upper case for the trimmed region, and
> lower case for the start/end which would get trimmed off. This is
> simple and we could do this for Biopython too - meaning you'd get
> the same data if you read the SFF file directly, or used Roche's
> FASTA+QUAL files with SeqIO.  Note that when reading an SFF
> file directly, we should probably record the real trim data as well.

In my current code, I decided to use the same quality trimming
representation that Roche use if converting the SFF file into FASTA
format (the leading and trailing trim regions are in lower case). We
may want to record the trim positions in the SeqRecord's annotation
as well.

>> There's also a couple of more tricks with the clipping.
>> In theory there's clip_qual and clip_adapter, but in the files
>> we've seen clip_adapter is always zero and clip_quality is used
>> instead for both quality and adapter. I think we could generate
>> one clipping combining both. Let me know what do you think.
>> Also take into account that in some cases the generated clipping
>> from the 454 software are just wrong.
>
> I'll need to learn more about the details before coming to any
> conclusions about how to deal with this information in Biopython.

Right now I have not looked at the left/right adaptor clipping information,
as you found, in the example file I have looked at these fields are zero.

Note I will be away for the next week, so am unlikely to respond to
any emails on this.

Peter