[Biopython-dev] 454 GSFlex quality score files

Tue Oct 16 18:46:53 UTC 2007

Hi Peter,

>>>> I have also needed to create a modified FASTA parser so that I  
>>>> can  read things like quality score files.
>>>
>>> Could you be a little more specific - what exactly do you mean by a
>>> quality score files (links and/or examples).  It may be that this
>>> warrants setting up a new file format in Bio.SeqIO
>> That is what I did. The quality score files I meant are simply  
>> FASTA- like records that indicate the quality of each base pair  
>> read from a  sequencing machine, on a scale of something like 1 to  
>> 64. The values  are tab separated and correspond to 'reads' in  
>> another FASTA file  that contain the actual sequences read. This  
>> is the way the 454  GSFlex machines output their sequencing reads,  
>> so for every set of  reads there will be a pair of 454Reads.fna,  
>> 454Reads.qual files. The  only difference between a parser that  
>> processes these qual files and  one that processes the sequence  
>> files is that it shouldn't get rid of  spaces, and the newlines  
>> should not to be stripped but converted into  spaces (when 454  
>> writes a newline of scores they omit the space).  Essentially I  
>> have made a duplicate of FastaIOs iterator, named it  something  
>> else, made these two small changes and put an entry for it  in the  
>> SeqIO file.
>
> Patches and emails don't do well together.  Could you file an  
> enhancement bug, and then upload your code as an attachment?  If  
> you have a few examples of matched pairs of FASTA files and quality  
> files which you can contribute that would be very helpful too.
>

Yes I'll get on that.

> It looks like you are trying to construct a "sequence" of numerical  
> values (rather than a sequence of letters like nucleotides/amino  
> acids).  As written I don't think it would work for element access/ 
> splicing etc. However, with some extra work I suppose we could  
> stretch the Seq object in this way - and define a new  
> "IntegerAlphabet".
>
> But on balance, I don't think "lists of quality values" should be  
> treated in the same way as sequences (and thus it doesn't seem to  
> belong in Bio.SeqIO).
>

I agree.

> Alternatively you could regard the quality scores as sequence meta- 
> data or annotation.  One idea would be to generate SeqRecord  
> objects containing dummy sequences of the correct length made up of  
> the ambiguous character "N", with the associated quality scores  
> held as a list of integers in the SeqRecord's annotation  
> dictionary.  Then it would fit into the Bio.SeqIO framework [I was  
> thinking of something similar for PTT files, NCBI Protein tables,  
> where again we have annotation but not the actual sequence].

I agree, and this way is most flexible.

>
> Maybe there should just be a separate parser for GSFlex quality  
> records  which returns iterator giving each record name with a list  
> of integers. A more elegant scheme would read in the pair of files  
> together (the FASTA file and the quality file) and generate nicely  
> annotated SeqRecords with the sequence and the quality.  This isn't  
> really possible with the Bio.SeqIO framework.
>

Yes, at first I liked this idea best, but it puts some constraints on  
the way these things are read in. Like if it is to be an iterator,  
you must have a guarantee that these files contain exactly the same  
sequences in exactly the same order. This seems like it could  
potentially be fine for the GSFlex files, but I wonder if there might  
somewhere down the line be use for quality information about  
sequences in other cases. If I am not mistaken, some sources use  
upper/lower case letters now to indicate a bistable degree of  
confidence in a sequence letter. In any event, this seems like an  
unnecessary restriction.

The way I do it now is I load the reads into a database, then update  
the database when I read in a quality score file. I think Biopython  
should have a simple way of implementing something similar which can  
solve both our metadata problems.

In Bio.Fasta there are Parsers which really belong in  
Bio.SeqIO.FastaIO, if anywhere. How about Bio.Fasta becomes the more  
general Fasta reader, nothing to do with sequences. It can iterate  
over a FASTA file using the '>' as the record separator, creating  
Record objects, much like it does now, except without processing them  
at all or assuming they are sequences.

 >Record.header
Record.data

Now Bio.SeqIO.FastaIO can use Bio.Fasta to iterate over the Record  
objects in a file and transform them into SeqRecord object. If you  
like, you can provide it with a function header_todict, which takes a  
string (in this case Record.header) and returns a dictionary, which  
gets unpacked and passed to the SeqRecord initializer. Basically the  
Bio.SeqIO.FastaIO returns a generator that looks something like this:

(SeqRecord(seq=cleanup(record.data), **header_todict(record.header))  
for record in Bio.Fasta.parse(file))

  I can also use the Bio.Fasta.parse function now to parse my quality  
files and add them as metadata:

# I create an initial SeqRecord dictionary using the  
Bio.SeqIO.FastaIO parser
seq_dict = SeqIO.to_dict(SeqIO.FastaIO.parse(seq_file,  
my_header_todict))

# Then I iterate over the sequences in the qual file and look them up  
in the seq_dict using the same header parsing function
# I passed to create my initial SeqRecords, setting the quality  
scores as I find them them
for record in Bio.Fasta.parse(qual_file):
	seq_dict[my_header_todict(record.header)['id']].quality =  
my_qualitycleanup(record.data)

I hope that makes sense. The advantage to doing it this way is that I  
can reuse my header parsing function for both the sequence and the  
metadata, and I can do whatever I want with the fasta record data  
without writing a whole new parser. The SeqIO fasta parsing functions  
just makes some default assumptions (like the data is a sequence).

Let me know what you think.

Jared