[Biopython-dev] sff reader

Fri Apr 17 07:08:12 EDT 2009

On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> Hi Peter:
> Here you have some code to read the sff files.

Thanks - I'm not sure when I'll get to look at this, maybe next week.

> For the time being it creates a dict for the sequences. I'm not sure about
> how to integrate the generated data in BioPython. The sequence and
> qualities should go to a SeqRecord, but there is also the information
> about the clipping.

For Bio.SeqIO, we would need to use a SeqRecord.  Ideally we'd want to
be able to read and write SFF files, and to do that we'll have to record all
the essential annotation (i.e. clipping) somehow.  Can you write SFF files?

> For my work I use a kind of SeqRecord with a mask property and the
> mask is a Location that shows which part of the sequence is ok. I don't
> know if that's a valid model for BioPython.

A mask could be done as a list of booleans, and we can treat it as
another per-letter-annotation in the SeqRecord.  I'm not sure if this
is helpful or not.

The Roche tools let you choose to extract trimmed reads as FASTA
and QUAL, or untrimmed.  Perhaps for reading SFF files with
Bio.SeqIO we should get the user to choose between these
options (e.g. format names "roche-sff" and "roche-sff-notrim")?

Roche's FASTA files use upper case for the trimmed region, and
lower case for the start/end which would get trimmed off. This is
simple and we could do this for Biopython too - meaning you'd get
the same data if you read the SFF file directly, or used Roche's
FASTA+QUAL files with SeqIO.  Note that when reading an SFF
file directly, we should probably record the real trim data as well.

> In the extract_sff script we generated three files: the fasta sequences,
> the fasta qualities and the xml with the clippings.
> One option could be to clip the sequences, but I don't know if that's the
> desired behaviour in all cases.

Trimming is probably a sensible default.  If we do give the untrimmed
sequences, we'd need a way to easily trim them.

> There's also a couple of more tricks with the clipping.
> In theory there's clip_qual and clip_adapter, but in the files
> we've seen clip_adapter is always zero and clip_quality is used
> instead for both quality and adapter. I think we could generate
> one clipping combining both. Let me know what do you think.
> Also take into account that in some cases the generated clipping
> from the 454 software are just wrong.

I'll need to learn more about the details before coming to any
conclusions about how to deal with this information in Biopython.

> If you want to forward this mail to the list you're more than welcome.
> Best regards,
>
> Jose Blanca

I've CC'd this reply to the list (without the python file attachments).

Regards,

Peter