[Bioperl-l] Creating a fastq format file?

Heikki Lehvaslaiho heikki.lehvaslaiho at gmail.com
Mon Apr 27 01:42:03 EDT 2009


> I have tried to summarise this in a central place:
> http://en.wikipedia.org/wiki/FASTQ_format

Torsten,

Thanks for putting this together. Very helpful.

Do you have a plan of action?  Let me propose one for BioPerl. It
based on following assumptions:

1. There is multitude of different ways of coding quality values out there.
2. Bio::Seq::Quality is agnostic of any quality value range rules
3. The emerging open standard is the Sanger fastq specification
4. Open source programs use the Sanger fastq specs


>From these it follows that:


1. BioPerl should support Sanger fastq standard

1.1. it already does and there are other SeqIO modules for dealing
with other non-fastq formats.

2. BioPerl should offer simple ways of converting between quality range rules

2.1. Have a generic method accessible from Bio::Seq::Quality with
preset versions of the method for converting between known variants
(Sanger fastq and the two Illumina versions)

For example:

range_convert ($from_lower, $from_upper, $to_lower, $to_upper, $value)
  throw if $value < $from_lower or $value > $from_upper
  return $newvalue

range_convert_illumina2fastq(), range_convert_fastq2illumina(),
range_convert_fastq2phred(),  range_convert_phred2fastq()....

(assuming that illumina 1.3 eq phred)

2.2. Bio::SeqIO::Fastq::next_seq methods should convert Illumina
qualities into Sanger fastq on the fly

2.2.1 Bio::SeqIO::Fastq::next_seq should detect the incoming stream of
quality value range either automatically or be given a keyword
parameter indicating the range.

2.2.2. Bio::SeqIO::Fastq::next_seq should throw an error if it detects
a quality value out of range.

2.2.3. Bio::SeqIO::Fastq::write_seq should throw an error if it
detects a quality value out of range.

2.2.4. It would be useful but not absolutely necessary for
Bio::SeqIO::Fastq::write_seq to be able to write out in Illumina
ranges


What do you think?

    -Heikki

2009/4/26 Torsten Seemann <torsten.seemann at infotech.monash.edu.au>:
>> > This might be a good place to ask the question: having looked at the
>> > fastq.pm page, is the fastq format defined (only) by a "@'" followed by
>> a
>> > sequence line and a "+" header followed by a quality line and the two
>> > headers have to agree? Now that Illumina is using phred scaling, are
>> > 'Sanger' and 'Illumina' versions the same?
>>
>> No they aren't the same, Illumina still encodes the ascii as value + 64
>> and Sanger as value + 33.
>>
>
> Illumina have now CHANGED how they calculate the quality value however in
> the last month or so... Their Q range used to be -5..40 mapped to ASCII 64+,
> but now they produce Q >= 0 and it is unclear if they start at 69 or 64
> now...
>
> I have tried to summarise this in a central place:
>
> http://en.wikipedia.org/wiki/FASTQ_format
>
> Corrections welcome!
>
>
> --Torsten Seemann
> --Victorian Bioinformatics Consortium, Dept. Microbiology, Monash
> University, AUSTRALIA
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>



-- 
    -Heikki
Heikki Lehvaslaiho - skype:heikki_lehvaslaiho
cell: +27 (0)714328090
Sent from Claremont, WC, South Africa



More information about the Bioperl-l mailing list