[Biojava-l] converting fastq format

Michael Heuer heuermh at gmail.com
Wed Sep 23 00:28:07 UTC 2015


Hello Daniel,

I am sorry, and this is embarrassing, but I thought I remembered the
writers supporting implicit conversion which as you point out is not the
case.

The conversion needs to go to error probabilities and back because the
quality score metrics are different, see the NAR paper linked below for
details.

This pull request adds explicit conversion support and adds round trip
functional tests based on the test data described in the paper

https://github.com/biojava/biojava/pull/334

   michael


On Thu, Sep 17, 2015 at 6:45 PM, Daniel Katzel <dkatzel at gmail.com> wrote:

> Sorry, that was a typo not using the SangerFastqReader in the original
> post I made.  I tried all the different readers just in case...
>
> Using SangerReader still throws an exception if I use a non-sanger writer
>
>        FastqReader fastqReader = new SangerFastqReader();
>         FastqWriter fastqWriter = new IlluminaFastqWriter();
>
>
>
>         PrintStream out = ...
>         InputStream in = ...
>         fastqReader.stream(in,
>                 new StreamListener() {
>
>             @Override
>             public void fastq(Fastq fastq) {
>
>                 if (fastq.getSequence().length() > 20){
>
>                     try {
>                         fastqWriter.append(out, fastq);
>                     } catch (IOException e) {
>                        throw new UncheckedIOException(e);
>                     }
>                 }
>             }
>         });
>
> Still throws an exception when trying to write the first read.
>
> If I change the FastqWriter to a SangerWriter so the reader and writer
> are  the same variant, it works as expected.
>
> Stepping through the code, there is no code that actually performs any
> conversion in the Writer implementations or their parent class
> AbstractFastqWriter
>
> It would be easy to add to the abstract writer, the code would be similar
> to what I posted above to make a new encoded quality string using the
> correct offset.
>
> On Thu, Sep 17, 2015 at 5:50 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>
>> On Thu, Sep 17, 2015 at 3:26 AM, Daniel Katzel <dkatzel at gmail.com> wrote:
>> >
>> > The fastq file I was using is part of the 1000genomes phase 3 dataset
>> > (very large gzipped files) with about 25 million records each.  The
>> reads
>> > are short so it is probably old.
>> >
>> > Here's the file I used
>> >
>> >
>> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/SRR062634_1.filt.fastq.gz
>> >
>> > I made a histogram of the encoded quality values as ascii:
>> >
>> >   33 :          166838
>> >   34 :               0
>> >   35 :       100598505
>> >   36 :           26817
>> >   37 :          156873
>> >   38 :          268700
>> >   39 :          419677
>> >   40 :          807326
>> >   41 :          997720
>> >   42 :          889665
>> >   43 :          946268
>> >   44 :         2372479
>> >   45 :         4147316
>> >   46 :          760108
>> >   47 :          850433
>> >   48 :         1433894
>> >   49 :         1165379
>> >   50 :         1769347
>> >   51 :         2493316
>> >   52 :         2966864
>> >   53 :        12457233
>> >   54 :         3172484
>> >   55 :         3741809
>> >   56 :         3722004
>> >   57 :         4320581
>> >   58 :        23804570
>> >   59 :         6554713
>> >   60 :         7207725
>> >   61 :        33021639
>> >   62 :        13106991
>> >   63 :        60909837
>> >   64 :        36753951
>> >   65 :        70258165
>> >   66 :        91889938
>> >   67 :       102533947
>> >   68 :       129093976
>> >   69 :       368143099
>> >   70 :       231023980
>> >   71 :      1089945133
>> >
>> >
>> > You can see the lowest value is 33 which means SANGER encoding.
>> >
>>
>> Yes, this looks like the Sanger FASTQ encoding :)
>>
>> (Some data archives would convert from the legacy Solexa or Illumina
>> 1.3+ quality encodings into the standard Sanger FASTQ encoding).
>>
>> Because this is the Sanger FASTQ encoding, you should be using the
>> SangerFastqReader. Your original email was using the
>> IlluminaFastqReader which should have complained that there were ASCI
>> characters under 64 present. That is presumably what happened given
>> the message:
>>
>>
>> Caused by: java.io.IOException: sequence SRR062634.1
>> HWI-EAS110_103327062:6:1:1092:8469/1 not fastq-illumina format, was
>> fastq-sanger
>>         at
>> org.biojava.nbio.sequencing.io.fastq.IlluminaFastqWriter.validate(IlluminaFastqWriter.java:43)
>>         at
>> org.biojava.nbio.sequencing.io.fastq.AbstractFastqWriter.append(AbstractFastqWriter.java:62)
>>         at
>> org.biojava.nbio.sequencing.io.fastq.AbstractFastqWriter.append(AbstractFastqWriter.java:46)
>>
>>
>> Do you think this error message can be made clearer?
>>
>> We did come up with a whole set of functional tests including
>> inter-conversion of the FASTQ encodings which are provided with the
>> NAR paper as supplementary materials and used in the Bio* and EMBOSS
>> test suites.
>>
>> http://dx.doi.org/10.1093/nar/gkp1137
>>
>> Peter
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-l/attachments/20150922/691d8cc0/attachment.html>


More information about the Biojava-l mailing list