[BioPython] Does biopython have a parser for .qual files?

Fri Jan 9 15:01:24 UTC 2009

On Fri, Jan 9, 2009 at 2:24 PM, Martin MOKREJŠ
<mmokrejs at ribosome.natur.cuni.cz> wrote:
> Hi,
>  is there a way in biopython to access the quality values from
> NCBI trace archive? I had a look briefly into
> http://biopython.org/DIST/docs/api/ but cannot find anything
> related. NCBItrace provides some perl script (maybe I could the
> same with Bio.Entrez.esearch (haven't tried yet) ... I will need
> to revert the order of values to get them for minus strand
> orientation. If nobody needed to do this before I will invent
> the wheel. ;)
> Thanks for your comments,
> Martin

In the short term, I'm sure a quick parser shouldn't take you more
than five minutes to implement (based on any of the FASTA parsers),
giving you record names with lists of integer scores.  The trouble for
integrating this into Biopython nicely is how to represent the data.

Have a look at Bug 2382 for some related ideas (including over FASTA
like formats), and this thread just over a year ago:
http://lists.open-bio.org/pipermail/biopython-dev/2007-October/003131.html
http://bugzilla.open-bio.org/show_bug.cgi?id=2382

I can see these qual files (and also fastq files which have both the
sequence and the quality scores) fitting into Bio.SeqIO but this would
require an elegant way to deal with unknown sequences of known length
(see next paragraph), and a good way to handle per-letter-annotation
(which we have touched on on the mailing lists fairly recently).

For this reason, I had wondered about creating an UnknownSeq as
subclass of Seq.  To create an instance you would supply the length
and a character to use (typically N or X for nucleotides and proteins,
perhaps defaulting to ?).  This would then act like a Seq object as
much as possible (for example, translation of an UnknownSeq with a
nucleotide alphabet could give an UnknownSeq with a protein alphabet
with appropriate length).  An UnknownSeq object could be used for
these qual files, or even certain GenBank files (where the sequence is
not always included).  There is a risk of user confusion here though,
as there isn't really a sequence present!

Peter