[Biopython-dev] 454 GSFlex quality score files
Jared Flatow
jflatow at northwestern.edu
Tue Oct 16 18:46:53 UTC 2007
Hi Peter,
>>>> I have also needed to create a modified FASTA parser so that I
>>>> can read things like quality score files.
>>>
>>> Could you be a little more specific - what exactly do you mean by a
>>> quality score files (links and/or examples). It may be that this
>>> warrants setting up a new file format in Bio.SeqIO
>> That is what I did. The quality score files I meant are simply
>> FASTA- like records that indicate the quality of each base pair
>> read from a sequencing machine, on a scale of something like 1 to
>> 64. The values are tab separated and correspond to 'reads' in
>> another FASTA file that contain the actual sequences read. This
>> is the way the 454 GSFlex machines output their sequencing reads,
>> so for every set of reads there will be a pair of 454Reads.fna,
>> 454Reads.qual files. The only difference between a parser that
>> processes these qual files and one that processes the sequence
>> files is that it shouldn't get rid of spaces, and the newlines
>> should not to be stripped but converted into spaces (when 454
>> writes a newline of scores they omit the space). Essentially I
>> have made a duplicate of FastaIOs iterator, named it something
>> else, made these two small changes and put an entry for it in the
>> SeqIO file.
>
> Patches and emails don't do well together. Could you file an
> enhancement bug, and then upload your code as an attachment? If
> you have a few examples of matched pairs of FASTA files and quality
> files which you can contribute that would be very helpful too.
>
Yes I'll get on that.
> It looks like you are trying to construct a "sequence" of numerical
> values (rather than a sequence of letters like nucleotides/amino
> acids). As written I don't think it would work for element access/
> splicing etc. However, with some extra work I suppose we could
> stretch the Seq object in this way - and define a new
> "IntegerAlphabet".
>
> But on balance, I don't think "lists of quality values" should be
> treated in the same way as sequences (and thus it doesn't seem to
> belong in Bio.SeqIO).
>
I agree.
> Alternatively you could regard the quality scores as sequence meta-
> data or annotation. One idea would be to generate SeqRecord
> objects containing dummy sequences of the correct length made up of
> the ambiguous character "N", with the associated quality scores
> held as a list of integers in the SeqRecord's annotation
> dictionary. Then it would fit into the Bio.SeqIO framework [I was
> thinking of something similar for PTT files, NCBI Protein tables,
> where again we have annotation but not the actual sequence].
I agree, and this way is most flexible.
>
> Maybe there should just be a separate parser for GSFlex quality
> records which returns iterator giving each record name with a list
> of integers. A more elegant scheme would read in the pair of files
> together (the FASTA file and the quality file) and generate nicely
> annotated SeqRecords with the sequence and the quality. This isn't
> really possible with the Bio.SeqIO framework.
>
Yes, at first I liked this idea best, but it puts some constraints on
the way these things are read in. Like if it is to be an iterator,
you must have a guarantee that these files contain exactly the same
sequences in exactly the same order. This seems like it could
potentially be fine for the GSFlex files, but I wonder if there might
somewhere down the line be use for quality information about
sequences in other cases. If I am not mistaken, some sources use
upper/lower case letters now to indicate a bistable degree of
confidence in a sequence letter. In any event, this seems like an
unnecessary restriction.
The way I do it now is I load the reads into a database, then update
the database when I read in a quality score file. I think Biopython
should have a simple way of implementing something similar which can
solve both our metadata problems.
In Bio.Fasta there are Parsers which really belong in
Bio.SeqIO.FastaIO, if anywhere. How about Bio.Fasta becomes the more
general Fasta reader, nothing to do with sequences. It can iterate
over a FASTA file using the '>' as the record separator, creating
Record objects, much like it does now, except without processing them
at all or assuming they are sequences.
>Record.header
Record.data
Now Bio.SeqIO.FastaIO can use Bio.Fasta to iterate over the Record
objects in a file and transform them into SeqRecord object. If you
like, you can provide it with a function header_todict, which takes a
string (in this case Record.header) and returns a dictionary, which
gets unpacked and passed to the SeqRecord initializer. Basically the
Bio.SeqIO.FastaIO returns a generator that looks something like this:
(SeqRecord(seq=cleanup(record.data), **header_todict(record.header))
for record in Bio.Fasta.parse(file))
I can also use the Bio.Fasta.parse function now to parse my quality
files and add them as metadata:
# I create an initial SeqRecord dictionary using the
Bio.SeqIO.FastaIO parser
seq_dict = SeqIO.to_dict(SeqIO.FastaIO.parse(seq_file,
my_header_todict))
# Then I iterate over the sequences in the qual file and look them up
in the seq_dict using the same header parsing function
# I passed to create my initial SeqRecords, setting the quality
scores as I find them them
for record in Bio.Fasta.parse(qual_file):
seq_dict[my_header_todict(record.header)['id']].quality =
my_qualitycleanup(record.data)
I hope that makes sense. The advantage to doing it this way is that I
can reuse my header parsing function for both the sequence and the
metadata, and I can do whatever I want with the fasta record data
without writing a whole new parser. The SeqIO fasta parsing functions
just makes some default assumptions (like the data is a sequence).
Let me know what you think.
Jared
More information about the Biopython-dev
mailing list