[Biopython-dev] Quality scores (and per-letter-annotation) in a SeqRecord?

Sat Feb 21 14:03:14 EST 2009

On Sat, Feb 21, 2009 at 12:24 AM, Iddo Friedberg <idoerg at gmail.com> wrote:
>
> Hi all,
>
> I am sort of living in this world right now, doing a lot of
> metagenomics, so here are my $0.02. I agree with Leighton (assuming I
> understand him): We should consider the possible applications people
> will run using the quality data when designing the [parser?]

Sure.  By having the FASTQ and QUAL files integrated into Bio.SeqIO
(using SeqRecord objects) one simple use case is supported -
interconverting these files into other formats (e.g. FASTQ to FASTA,
or with a little more effort FASTA+QUAL to FASTQ).   Your trimming
example is a another good use case - which could be done with the
SeqRecord representation.

For anything more complicated (like assembly or mapping onto a
genome), with massive datasets the modest overhead of the SeqRecord
and Seq objects could be an issue - but isn't this sort of thing is
usually best handled by an external tool (written in C or C++ by a
specialist)?

Anyway - If you have a look at Bug 2767 at the first attachment I did
the core of the FASTQ parser as a generic function returning a tuple
of strings (the record title, sequence and the encoded quality string
- see FastqGeneralIterator).  While this could be just a private
function, I was thinking this could actually be very helpful for
anyone trying to do something where performance speed or memory usage
was important. On top of this core parser, I had a FastqPhredIterator
(and would similarly have a FastqSolexaIterator) function which turns
these into SeqRecord objects for use via the Bio.SeqIO API.  i.e. We
can offer both the standard Bio.SeqIO interface using SeqRecords, and
a simpler string based parser for those that need it.

Peter