[Biopython] internal function to convert illumina quality scores to phred

Mon Jan 31 20:50:30 UTC 2011

On Mon, Jan 31, 2011 at 7:54 PM, Alan Bergland <bergland at stanford.edu> wrote:
> Hi Peter,
>
>        Thanks for the quick reply.  So, I am trying to iterate through two
> large fastq files (each file is one paired-end read) and split the reads by
> one of 9 barcodes found on both 5' ends of each read.  I would like to use
> the quality information for those barcode reads to assess which
> barcode-group they belong to.

I suspect that it is simpler to just ignore reads which don't match
the barcode within 1 or 2 mismatches, without worrying about their
qualities. It will cost you computational time and effort for a relative
small improvement in the number of filtered reads. YMMV.

If you look at the archives there was a discussion earlier in Jan about
doing this sort of thing with SFF files. I'm currently (between other tasks)
trying to wrap up some sort of PCR-primer/barcode/adaptor filtering
(Bio)python script as a tool for Galaxy, see http://usegalaxy.org

>        I think it would be nice to use FastqGeneralIterator because I don't
> need to translate the quality scores for the full read (100bp) back and
> forth while I iterate through the file.  I gather that when I use
> SeqIO.parse and SeqIO.write, the quality scores are converted back and
> forth.  There is no need to do this for the whole read.
>
>        I've written a little snippet of code that simply prints the quality
> scores from the barcodes:
>
> from Bio import SeqIO
> from Bio.SeqIO.QualityIO import *
> from Bio.SeqIO import *
>
> pe1 = open("head2_pe1.fastq", "r")
> pe2 = open("head2_pe2.fastq", "r")
>
> pe1_record_it = FastqGeneralIterator(pe1)
>
> for pe1_seq_record in pe1_record_it:
>    bc = SeqRecord(Seq(pe1_seq_record[1][:6]), id="a")
>    bc.letter_annotations['fastq-illumina'] = pe1_seq_record[2][:6]
>
>    print bc.letter_annotations["fastq-illumina"]
>
> this just prints out the illumina encoded quality scores.  How would I print
> out the phred scores instead?
>
> Thanks,
> Alan

If you have Illumina 1.3+ or later, then they use PHRED scores
(not Solexa scores which have a different log encoding). If as
it appears above you want to use SeqRecord objects, I think
you might as well use Bio.SeqIO.parse and write.

The point about using FastqGeneralIterator for speed is to
avoid using a SeqRecord due to the overheads. See e.g.:
http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

Peter

P.S. Watch out for the fact that Illumina are planning to switch
to the standard Sanger FASTQ encoding in their next release:
http://seqanswers.com/forums/showthread.php?t=8895