From biopython at maubp.freeserve.co.uk Mon Jul 27 06:15:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Jul 2009 11:15:12 +0100 Subject: [Open-bio-l] Open-bio cross-project issues In-Reply-To: <4A6D6B8F.9060108@ebi.ac.uk> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <4A6D6B8F.9060108@ebi.ac.uk> Message-ID: <320fb6e00907270315h1d2cbe11p4d003c63468136a5@mail.gmail.com> On Mon, Jul 27, 2009 at 9:55 AM, Peter Rice wrote: > > Peter C. wrote (to bioperl-l, biopython-l, emboss-dev): >> Hi all, >> >> Peter Rice kindly said he will look into an OBF cross project mailing >> list, but in the meantime this has been cross posted to the Biopython, >> BioPerl, and EMBOSS development lists. > > There is a list already for this purpose - open-bio-l Ah. Well spotted. Should we update the mailing list description to which seems a little dated (e.g. BioSQL has its own mailing list now), to make it clearer that things like file format specifications and sample data are also a suitable topic for this list? > I think we will also need a cross-project wiki space on the OBF site. Is > there something already used by other projects or should we set > something up? Good idea. As I have mentioned in the past, I think the BioPerl wiki has a good collection of pages on sequence and alignment file formats which might be relocated there. Very few of these pages are actually BioPerl specific. > I am cross-posting this to other OBF project lists to encourage > developers interested in combining efforts to address common > problems. This started with FASTQ short read formats, and > open-bio-l (a low volume list) has also seen discussion of > common test data sets. For reference, the FASTQ threads include this triple posted one: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000605.html http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006467.html http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030698.html This EMBOSS + Biopython thread: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000576.html http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006416.html And this BioPerl + Biopython thread: http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030404.html http://lists.open-bio.org/pipermail/biopython/2009-July/005335.html Peter C. From pmr at ebi.ac.uk Mon Jul 27 04:55:43 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 27 Jul 2009 09:55:43 +0100 Subject: [Open-bio-l] Open-bio cross-project issues In-Reply-To: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> Message-ID: <4A6D6B8F.9060108@ebi.ac.uk> Peter C. wrote (to bioperl-l, biopython-l, emboss-dev): > Hi all, > > Peter Rice kindly said he will look into an OBF cross project mailing > list, but in the meantime this has been cross posted to the Biopython, > BioPerl, and EMBOSS development lists. There is a list already for this purpose - open-bio-l I think we will also need a cross-project wiki space on the OBF site. Is there something already used by other projects or should we set something up? I am cross-posting this to other OBF project lists to encourage developers interested in combining efforts to address common problems. This started with FASTQ short read formats, and open-bio-l (a low volume list) has also seen discussion of common test data sets. Please sign up to open-bio-l (if you are not there already) and post suggestions for cross-project issues there. The list subscription page is: http://lists.open-bio.org/mailman/listinfo/open-bio-l Please feel free to forward this to any other projects I may have missed (I picked the obvious addresses from the list.open-bio-org server) regards, Peter Rice From biopython at maubp.freeserve.co.uk Mon Jul 27 07:51:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Jul 2009 12:51:13 +0100 Subject: [Open-bio-l] [Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> Message-ID: <320fb6e00907270451i3d40b4ffq607360cfcb6f6282@mail.gmail.com> On Sat, Jul 25, 2009 at 8:50 PM, Chris Fields wrote: > > From this it could be summarized that converting to sanger format is least > problematic, as possible issues may be encountered when converting to the > other variants. ?We'll need to fix the solexa quality calculations in the > BioPerl parser as noted in your previous post; I'll work on that. > BioPerl SVN (revision 15887, just updated on the off chance you have committed any fixes recently) also has a problem going the other way (from FASTQ Sanger to FASTQ Solexa), $ more sanger_faked.fastq @Test PHRED qualities from 40 to 0 inclusive ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN + IHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! $ perl bioperl_sanger2solexa.pl < sanger_faked.fastq @Test PHRED qualities from 40 to 0 inclusive ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN +Test PHRED qualities from 40 to 0 inclusive hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFEDB@>< Depending on your email viewer this may not be obvious, but the sequence line is length 41 but the quality line is only 40 characters. And again, I also suspect a problem in the mapping itself. Peter From biopython at maubp.freeserve.co.uk Mon Jul 27 09:15:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Jul 2009 14:15:39 +0100 Subject: [Open-bio-l] [Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907270451i3d40b4ffq607360cfcb6f6282@mail.gmail.com> Message-ID: <320fb6e00907270615m438b4230wbaed5895d5ed35d1@mail.gmail.com> On Mon, Jul 27, 2009 at 2:06 PM, Chris Fields wrote: > >> >> Depending on your email viewer this may not be obvious, but >> the sequence line is length 41 but the quality line is only 40 >> characters. And again, I also suspect a problem in the mapping >> itself. >> >> Peter > > I added this (and the others) to our ticket tracking this. ?Looks like > solexa conversion either way is borked, which is very likely an issue with > conversion. > > chris I'm afraid so. I'll keep an eye on that then (Bug 2857) http://bugzilla.open-bio.org/show_bug.cgi?id=2857 Peter From hlapp at gmx.net Mon Jul 27 10:01:21 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 27 Jul 2009 10:01:21 -0400 Subject: [Open-bio-l] [Bioperl-l] Open-bio cross-project issues In-Reply-To: <4A6D6B8F.9060108@ebi.ac.uk> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <4A6D6B8F.9060108@ebi.ac.uk> Message-ID: On Jul 27, 2009, at 4:55 AM, Peter Rice wrote: > I think we will also need a cross-project wiki space on the OBF site. What about the open-bio.org wiki? It doesn't say anywhere that it can only be about OBF business (and in that sense, all OBF projects are OBF business by definition). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Mon Jul 27 10:17:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Jul 2009 15:17:55 +0100 Subject: [Open-bio-l] [Bioperl-l] Open-bio cross-project issues In-Reply-To: References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <4A6D6B8F.9060108@ebi.ac.uk> Message-ID: <320fb6e00907270717g4fc782aci4c4a0caf1a107ecb@mail.gmail.com> On Mon, Jul 27, 2009 at 3:01 PM, Hilmar Lapp wrote: > > On Jul 27, 2009, at 4:55 AM, Peter Rice wrote: > >> I think we will also need a cross-project wiki space on the OBF site. >> > > What about the open-bio.org wiki? It doesn't say anywhere that it can only > be about OBF business (and in that sense, all OBF projects are OBF > business by definition). That makes sense to me. I was assuming Peter R just meant something on the existing open-bio.org wiki, although I didn't say so explicitly. Peter C From pmr at ebi.ac.uk Mon Jul 27 10:46:00 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 27 Jul 2009 15:46:00 +0100 Subject: [Open-bio-l] [Bioperl-l] Open-bio cross-project issues In-Reply-To: <320fb6e00907270717g4fc782aci4c4a0caf1a107ecb@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <4A6D6B8F.9060108@ebi.ac.uk> <320fb6e00907270717g4fc782aci4c4a0caf1a107ecb@mail.gmail.com> Message-ID: <4A6DBDA8.8000807@ebi.ac.uk> Peter wrote: > On Mon, Jul 27, 2009 at 3:01 PM, Hilmar Lapp wrote: >> On Jul 27, 2009, at 4:55 AM, Peter Rice wrote: >> >>> I think we will also need a cross-project wiki space on the OBF site. >>> >> What about the open-bio.org wiki? It doesn't say anywhere that it can only >> be about OBF business (and in that sense, all OBF projects are OBF >> business by definition). > > That makes sense to me. I was assuming Peter R just meant something > on the existing open-bio.org wiki, although I didn't say so explicitly. The existing open-bio.org wiki is indeed what I had in mind. Can someone provide guidelines on page naming so we can keep the pages together? regards, Peter From biopython at maubp.freeserve.co.uk Wed Jul 29 06:15:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Jul 2009 11:15:55 +0100 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> Message-ID: <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> Hi all, This is a follow up to the earlier discussion about high quality scores in Solexa or Illumina 1.3+ FASTQ files and the problem of non printable ASCII codes (which can occur if converting from Sanger FASTQ). > On Sat, Jul 25, 2009 at 8:50 PM, Chris Fields wrote: >> >>> Now, here comes the problem. I believe FASTQ files directly >>> from an Illumina 1.3+ pipeline will have PHRED scores in the >>> range 0 to 40 (as in this example). However, much higher >>> PHRED scores are possible during assembly / contig'ing >>> and read mapping. For example, the tool MAQ will output >>> Sanger style FASTQ files with PHRED scores in the range >>> 0 to 93 inclusive. >> >> We can support it as Illumina 1.3, but my point is this may getting into a >> grey area and may be something that Illumina doesn't/wouldn't support. >>?Reminds me a little of the multiple GFF2 variations (one of the main >> reasons for a GFF3). > > I agree this is an grey area (high scores in Solexa/Illumina > FASTQ files). > > ... > > i.e. An Illumina FASTQ format file can hold PHRED scores in the > range 0 to 62 without using problem characters. And likewise > for a Solexa FASTQ file (Solexa scores up to 62). Peter Rice and I have been talking about this off list, and have a proposal for the high score problem. Basically we want to restrict FASTQ quality strings to printable ASCII, which means 126 (0x7e) is a firm upper limit, while otherwise allowing for a high scores as possible. This limit comes from ASCII 127 being "delete", and the even higher characters also being non-printable. i.e. We are suggesting: "fastq-sanger" - Allows PHRED scores 0 to 93 inclusive, mapped with an ASCII offset of 33 to ASCII characters 33 to 126 (or in hex, 0x21 to 0x7e). This is as defined on the MAQ web pages. "fastq-illumina" - Believed to use at least PHRED scores 0 to 40, mapped with an ASCII offset of 64 to ASCII characters 64 to 104 (or in hex, to 0x40 to 0x68). It is a reasonable and well defined extension to permit PHRED scores from 0 to 62 inclusive, which map to ASCII 64 to 126 (or in hex 0x40 to 0x7e). This avoids the non printing characters, and gives some head room for improved sequencing technology from Illumina giving higher raw scores. "fastq-solexa" - Believed to use Solexa scores from -5 to at least 40, again mapped with an ASCII offset of 64 giving ASCII characters 59 to 104 (or in hex, 0x3b to 0x68). Again, a reasonable and well defined extension would permit Solexa scores in the range -5 to 62 inclusive, using ASCII 59 to 126 (or in hex, 0x3b to 0x7e). [Peter R. - please correct me if of the above is not what you had in mind] If in the process of converting between formats, a quality score is too high (it would result in ASCII 127 or higher), then I would argue any of the following would be acceptable: (a) Silently impose the maximum score (ASCII 126, 0x7e) (b) Impose the maximum score with a warning (c) Raise an error I don't think EMBOSS, BioPerl and Biopython have to handle this exactly the same way, but I would favour (b) then (a). Peter From biopython at maubp.freeserve.co.uk Thu Jul 30 06:18:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 30 Jul 2009 11:18:26 +0100 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> Message-ID: <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> On Wed, Jul 29, 2009 at 11:15 AM, Peter wrote: > Hi all, > > This is a follow up to the earlier discussion about high quality scores > in Solexa or Illumina 1.3+ FASTQ files and the problem of non printable > ASCII codes (which can occur if converting from Sanger FASTQ). > > ... > > Peter Rice and I have been talking about this off list, and have > a proposal for the high score problem. Basically we want to > restrict FASTQ quality strings to printable ASCII, which means > 126 (0x7e) is a firm upper limit, while otherwise allowing for a > high scores as possible. This limit comes from ASCII 127 being > "delete", and the even higher characters also being non-printable. > > i.e. We are suggesting: > > "fastq-sanger" - Allows PHRED scores 0 to 93 inclusive, mapped > with an ASCII offset of 33 to ASCII characters 33 to 126 (or in hex, > 0x21 to 0x7e). This is as defined on the MAQ web pages. > > "fastq-illumina" - Believed to use at least PHRED scores 0 to 40, > mapped with an ASCII offset of 64 to ASCII characters 64 to 104 > (or in hex, to 0x40 to 0x68). It is a reasonable and well defined > extension to permit PHRED scores from 0 to 62 inclusive, which > map to ASCII 64 to 126 (or in hex 0x40 to 0x7e). This avoids the > non printing characters, and gives some head room for improved > sequencing technology from Illumina giving higher raw scores. > > "fastq-solexa" - Believed to use Solexa scores from -5 to at least > 40, again mapped with an ASCII offset of 64 giving ASCII characters > 59 to 104 (or in hex, 0x3b to 0x68). Again, a reasonable and well > defined extension would permit Solexa scores in the range -5 to 62 > inclusive, using ASCII 59 to 126 (or in hex, 0x3b to 0x7e). The latest version of Biopython in our repository now follows this, avoiding any non-printing characters (which should trigger an error on parsing). > If in the process of converting between formats, a quality score > is too high (it would result in ASCII 127 or higher), then I would > argue any of the following would be acceptable: > (a) Silently impose the maximum score (ASCII 126, 0x7e) > (b) Impose the maximum score with a warning > (c) Raise an error > > I don't think EMBOSS, BioPerl and Biopython have to handle > this exactly the same way, but I would favour (b) then (a). The EMBOSS patch I was testing from Peter Rice went for a silent truncation, in Biopython have also for the moment gone for silently imposing the maximum scores (ASCII 126, 0x7e) of 93, 62 and 62 for the three formats. Another reason for this is speed. Peter From biopython at maubp.freeserve.co.uk Thu Jul 30 11:35:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 30 Jul 2009 16:35:25 +0100 Subject: [Open-bio-l] FASTQ records with no sequence? In-Reply-To: <320fb6e00907300819x35ae00c3wa20a382376134db7@mail.gmail.com> References: <320fb6e00907300800x5f8e78eci5df8333df713e4c@mail.gmail.com> <4A71B7B5.40502@ebi.ac.uk> <320fb6e00907300819x35ae00c3wa20a382376134db7@mail.gmail.com> Message-ID: <320fb6e00907300835v3a9d46d4w77c344bbf6efa08d@mail.gmail.com> Hi all, On the continuing topic of the nebulous FASTQ format, are there any strong views as to weather a FASTQ files could hold records without a sequence (and therefore no quality scores)? This could make sense as output from an (aggressive) quality filter. This was a discussion I meant to start on the OBF list, not the EMBOSS list - so here is the start of the thread: http://lists.open-bio.org/pipermail/emboss/2009-July/003707.html Basically in some contexts an empty FASTQ record makes sense, so perhaps we should include examples of this for our test suite. However, there is more than one reasonable way to represent such a record (either omitting the sequence and quality lines, or including blank sequence and quality lines). On Thu, Jul 30, 2009 at 4:09 PM, Peter Rice wrote: > > Peter C. wrote: > >> As we are recommending no line wrapping on output this means >> typical FASTQ records would be four lines - so doing the same >> makes sense here too. > > I vote for 4 lines on output. If we want to allow zero length sequences, then yes, I would also vote for the 4 line output (i.e. blank lines for the sequence and the quality string). > It should be possible to allow zero lines on input depending on > where the '+' check is. Yes, I'm pretty sure a parser could cope with any of the zero length sequence FASTQ examples I gave. Peter From biopython at maubp.freeserve.co.uk Thu Jul 30 11:55:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 30 Jul 2009 16:55:56 +0100 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> Message-ID: <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> On Thu, Jul 30, 2009 at 4:46 PM, Chris Fields wrote: >> The EMBOSS patch I was testing from Peter Rice went for a >> silent truncation, in Biopython have also for the moment gone >> for silently imposing the maximum scores (ASCII 126, 0x7e) >> of 93, 62 and 62 for the three formats. Another reason for this >> is speed. >> >> Peter > > Speed is one reason to worry, but we also should think carefully about > silently truncating the data w/o the user's knowledge. ?One thing we > don't want to propagate is loss of data w/o warning. Yes and no. Do you warn about converting from EMBL/GenBank to FASTA? Or from a PFAM alignment to a ClustalW or PHYLIP alignment? In those cases, anyone familiar with the file formats will expect data loss as you are going from a richly annotated file format to something much simpler. Likewise here, anyone familiar with the FASTQ variants (and our documentation should cover this) shouldn't be surprised at this quality truncation. But I must concede, this is a more subtle and less obvious data issue. So maybe you are right. I can take a look at this and see how badly it would impact the speed for Biopython... Peter From cjfields at illinois.edu Thu Jul 30 11:46:51 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 30 Jul 2009 10:46:51 -0500 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> Message-ID: On Jul 30, 2009, at 5:18 AM, Peter wrote: > On Wed, Jul 29, 2009 at 11:15 AM, Peter > wrote: >> Hi all, >> >> This is a follow up to the earlier discussion about high quality >> scores >> in Solexa or Illumina 1.3+ FASTQ files and the problem of non >> printable >> ASCII codes (which can occur if converting from Sanger FASTQ). >> >> ... >> >> Peter Rice and I have been talking about this off list, and have >> a proposal for the high score problem. Basically we want to >> restrict FASTQ quality strings to printable ASCII, which means >> 126 (0x7e) is a firm upper limit, while otherwise allowing for a >> high scores as possible. This limit comes from ASCII 127 being >> "delete", and the even higher characters also being non-printable. >> >> i.e. We are suggesting: >> >> "fastq-sanger" - Allows PHRED scores 0 to 93 inclusive, mapped >> with an ASCII offset of 33 to ASCII characters 33 to 126 (or in hex, >> 0x21 to 0x7e). This is as defined on the MAQ web pages. >> >> "fastq-illumina" - Believed to use at least PHRED scores 0 to 40, >> mapped with an ASCII offset of 64 to ASCII characters 64 to 104 >> (or in hex, to 0x40 to 0x68). It is a reasonable and well defined >> extension to permit PHRED scores from 0 to 62 inclusive, which >> map to ASCII 64 to 126 (or in hex 0x40 to 0x7e). This avoids the >> non printing characters, and gives some head room for improved >> sequencing technology from Illumina giving higher raw scores. >> >> "fastq-solexa" - Believed to use Solexa scores from -5 to at least >> 40, again mapped with an ASCII offset of 64 giving ASCII characters >> 59 to 104 (or in hex, 0x3b to 0x68). Again, a reasonable and well >> defined extension would permit Solexa scores in the range -5 to 62 >> inclusive, using ASCII 59 to 126 (or in hex, 0x3b to 0x7e). > > The latest version of Biopython in our repository now follows this, > avoiding any non-printing characters (which should trigger an error > on parsing). > >> If in the process of converting between formats, a quality score >> is too high (it would result in ASCII 127 or higher), then I would >> argue any of the following would be acceptable: >> (a) Silently impose the maximum score (ASCII 126, 0x7e) >> (b) Impose the maximum score with a warning >> (c) Raise an error >> >> I don't think EMBOSS, BioPerl and Biopython have to handle >> this exactly the same way, but I would favour (b) then (a). > > The EMBOSS patch I was testing from Peter Rice went for a > silent truncation, in Biopython have also for the moment gone > for silently imposing the maximum scores (ASCII 126, 0x7e) > of 93, 62 and 62 for the three formats. Another reason for this > is speed. > > Peter Speed is one reason to worry, but we also should think carefully about silently truncating the data w/o the user's knowledge. One thing we don't want to propagate is loss of data w/o warning. chris From cjfields at illinois.edu Thu Jul 30 16:08:36 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 30 Jul 2009 15:08:36 -0500 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> Message-ID: On Jul 30, 2009, at 10:55 AM, Peter wrote: > On Thu, Jul 30, 2009 at 4:46 PM, Chris Fields > wrote: >>> The EMBOSS patch I was testing from Peter Rice went for a >>> silent truncation, in Biopython have also for the moment gone >>> for silently imposing the maximum scores (ASCII 126, 0x7e) >>> of 93, 62 and 62 for the three formats. Another reason for this >>> is speed. >>> >>> Peter >> >> Speed is one reason to worry, but we also should think carefully >> about >> silently truncating the data w/o the user's knowledge. One thing we >> don't want to propagate is loss of data w/o warning. > > Yes and no. Do you warn about converting from EMBL/GenBank to > FASTA? Or from a PFAM alignment to a ClustalW or PHYLIP > alignment? In those cases, anyone familiar with the file formats will > expect data loss as you are going from a richly annotated file format > to something much simpler. Right, but this doesn't follow along the same lines. Going from a annotation- and feature-rich format to a very lightweight format is one thing. This situation (at least to me) is more analogous to exclusion of a subset of features b/c they don't fit certain parameters. I do think if it affects performance to a significant enough degree we can do this silently, we just need to ensure this is well-documented. My opinions is this use will prove to be a edge case anyway (most will want conversion to Sanger vs. Illumina/Solexa). > Likewise here, anyone familiar with the FASTQ variants (and our > documentation should cover this) shouldn't be surprised at this > quality truncation. But I must concede, this is a more subtle and > less obvious data issue. So maybe you are right. > > I can take a look at this and see how badly it would impact the > speed for Biopython... > > Peter Will try to do the same for bioperl. chris From cjfields at illinois.edu Thu Jul 30 15:59:50 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 30 Jul 2009 14:59:50 -0500 Subject: [Open-bio-l] FASTQ records with no sequence? In-Reply-To: <320fb6e00907300835v3a9d46d4w77c344bbf6efa08d@mail.gmail.com> References: <320fb6e00907300800x5f8e78eci5df8333df713e4c@mail.gmail.com> <4A71B7B5.40502@ebi.ac.uk> <320fb6e00907300819x35ae00c3wa20a382376134db7@mail.gmail.com> <320fb6e00907300835v3a9d46d4w77c344bbf6efa08d@mail.gmail.com> Message-ID: On Jul 30, 2009, at 10:35 AM, Peter wrote: > Hi all, > > On the continuing topic of the nebulous FASTQ format, are there > any strong views as to weather a FASTQ files could hold records > without a sequence (and therefore no quality scores)? This could > make sense as output from an (aggressive) quality filter. > > This was a discussion I meant to start on the OBF list, not the > EMBOSS list - so here is the start of the thread: > http://lists.open-bio.org/pipermail/emboss/2009-July/003707.html > > Basically in some contexts an empty FASTQ record makes sense, > so perhaps we should include examples of this for our test suite. > However, there is more than one reasonable way to represent > such a record (either omitting the sequence and quality lines, or > including blank sequence and quality lines). > > On Thu, Jul 30, 2009 at 4:09 PM, Peter Rice wrote: >> >> Peter C. wrote: >> >>> As we are recommending no line wrapping on output this means >>> typical FASTQ records would be four lines - so doing the same >>> makes sense here too. >> >> I vote for 4 lines on output. > > If we want to allow zero length sequences, then yes, I would also > vote for the 4 line output (i.e. blank lines for the sequence and > the quality string). Same here. >> It should be possible to allow zero lines on input depending on >> where the '+' check is. > > Yes, I'm pretty sure a parser could cope with any of the zero length > sequence FASTQ examples I gave. > > Peter Should be easy to do this with bioperl as well. chris From biopython at maubp.freeserve.co.uk Thu Jul 30 17:50:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 30 Jul 2009 22:50:34 +0100 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> Message-ID: <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> On Thu, Jul 30, 2009 at 9:08 PM, Chris Fields wrote: > > I do think if it affects performance to a significant enough degree we > can do this silently, we just need to ensure this is well-documented. Agreed. > My opinions is this use will prove to be a edge case anyway (most will > want conversion to Sanger vs. Illumina/Solexa). Absolutely. Going from Solexa/Illumina to Sanger FASTQ will be more common (and there are no truncation issues). Going from Sanger FASTQ to Solexa or Illumina FASTQ will be rarer, and while a truncation is possible it requires very high scores (above PHRED 62) which are likely only to be possible from a consensus alignment or such like. i.e. Yes, it should be an edge case. I guess this expected usage supports the argument about issuing a warning on truncation, even with a modest performance overhead (because it only slows down the rarer expected usage). But let's get some benchmarks done to help settle this... Peter From ajmackey at gmail.com Thu Jul 30 19:52:03 2009 From: ajmackey at gmail.com (Aaron Mackey) Date: Thu, 30 Jul 2009 19:52:03 -0400 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> Message-ID: <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> I would strongly warn against truncation, for any reason. Use the formulas you have for quality-encoding conversions, but do not assume that you know more than I do about what my data contains, or that you are in any way helping me by altering my data, silently or otherwise. Said another way, feel free to warn me that my data may contain garbage, and utterly fail to convert it for me, but do not try to fix it for me. -Aaron On Thu, Jul 30, 2009 at 5:50 PM, Peter wrote: > On Thu, Jul 30, 2009 at 9:08 PM, Chris Fields > wrote: > > > > I do think if it affects performance to a significant enough degree we > > can do this silently, we just need to ensure this is well-documented. > > Agreed. > > > My opinions is this use will prove to be a edge case anyway (most will > > want conversion to Sanger vs. Illumina/Solexa). > > Absolutely. > > Going from Solexa/Illumina to Sanger FASTQ will be more common > (and there are no truncation issues). Going from Sanger FASTQ to > Solexa or Illumina FASTQ will be rarer, and while a truncation is > possible it requires very high scores (above PHRED 62) which are > likely only to be possible from a consensus alignment or such like. > i.e. Yes, it should be an edge case. > > I guess this expected usage supports the argument about issuing a > warning on truncation, even with a modest performance overhead > (because it only slows down the rarer expected usage). > > But let's get some benchmarks done to help settle this... > > Peter > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l > From cjfields at illinois.edu Thu Jul 30 21:19:56 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 30 Jul 2009 20:19:56 -0500 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> Message-ID: I do tend to agree, and I don't think any savings from a performance hit will be worth the headache of having to repeatedly explain why it's (silently) doing so, when a simple warning or error message ('value X out of range for fastq format y') would suffice. chris On Jul 30, 2009, at 6:52 PM, Aaron Mackey wrote: > I would strongly warn against truncation, for any reason. Use the > formulas you have for quality-encoding conversions, but do not > assume that you know more than I do about what my data contains, or > that you are in any way helping me by altering my data, silently or > otherwise. Said another way, feel free to warn me that my data may > contain garbage, and utterly fail to convert it for me, but do not > try to fix it for me. > > -Aaron > > On Thu, Jul 30, 2009 at 5:50 PM, Peter > wrote: > On Thu, Jul 30, 2009 at 9:08 PM, Chris Fields > wrote: > > > > I do think if it affects performance to a significant enough > degree we > > can do this silently, we just need to ensure this is well- > documented. > > Agreed. > > > My opinions is this use will prove to be a edge case anyway (most > will > > want conversion to Sanger vs. Illumina/Solexa). > > Absolutely. > > Going from Solexa/Illumina to Sanger FASTQ will be more common > (and there are no truncation issues). Going from Sanger FASTQ to > Solexa or Illumina FASTQ will be rarer, and while a truncation is > possible it requires very high scores (above PHRED 62) which are > likely only to be possible from a consensus alignment or such like. > i.e. Yes, it should be an edge case. > > I guess this expected usage supports the argument about issuing a > warning on truncation, even with a modest performance overhead > (because it only slows down the rarer expected usage). > > But let's get some benchmarks done to help settle this... > > Peter > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l > From pmr at ebi.ac.uk Fri Jul 31 04:16:20 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 31 Jul 2009 09:16:20 +0100 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> Message-ID: <4A72A854.8080302@ebi.ac.uk> Aaron Mackey wrote: > I would strongly warn against truncation, for any reason. Use the formulas > you have for quality-encoding conversions, but do not assume that you know > more than I do about what my data contains, or that you are in any way > helping me by altering my data, silently or otherwise. Said another way, > feel free to warn me that my data may contain garbage, and utterly fail to > convert it for me, but do not try to fix it for me. We should bear in mind what the outer limit quality scores are. A quality score of 60 means a 1 in a million chance of an error. A quality of 90 means a 1 in a billion chance of an error (or 3 in an entire mammalian genome). Quality scores below 1 (phred) or -5 (solexa) mean the base is wrong (worse than random). I do not believe we are losing anything biologically significant by the score limits - but we are using a tighter definition of the FASTQ format to protect other parsers from terrible errors with for example signed characters. On the subject or warnings ... While I am happy to issue warnings, I suggest we take some care over what happens when someone picks the wrong format and a million reads have quality scores out of range. We could, for example, report the first error and then count up so we can later (at the end or when another error occurs) say "and another 987654 up to ..." and give the latest one. regards, Peter Rice From pmr at ebi.ac.uk Fri Jul 31 04:19:05 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 31 Jul 2009 09:19:05 +0100 Subject: [Open-bio-l] FASTQ identifiers In-Reply-To: <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> Message-ID: <4A72A8F9.9020903@ebi.ac.uk> Another FASTQ topic. Should we try to understand FASTQ identifiers. There are some standard identifiers with meaningful elements that could be useful for reporting or subsetting FASTQ data. Can we agree on how to parse those and what they can be used for? What other naming conventions are in common use e.g. for non-SOlexa instruments? regards, Peter From biopython at maubp.freeserve.co.uk Fri Jul 31 05:01:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 31 Jul 2009 10:01:57 +0100 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <4A72A854.8080302@ebi.ac.uk> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> <4A72A854.8080302@ebi.ac.uk> Message-ID: <320fb6e00907310201o7588dbfeg9e0bc5be6138c754@mail.gmail.com> On Fri, Jul 31, 2009 at 9:16 AM, Peter Rice wrote: > > On the subject or warnings ... While I am happy to issue warnings, > I suggest we take some care over what happens when someone > picks the wrong format and a million reads have quality scores out > of range. > > We could, for example, report the first error and then count up so > we can later (at the end or when another error occurs) say "and > another 987654 up to ..." and give the latest one. Yes indeed, a warning for every record which had its quality score truncated would be madness (given the number of reads you might have in a FASTQ file). One warning for the whole file would be enough. Peter From biopython at maubp.freeserve.co.uk Fri Jul 31 05:15:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 31 Jul 2009 10:15:57 +0100 Subject: [Open-bio-l] FASTQ identifiers In-Reply-To: <4A72A8F9.9020903@ebi.ac.uk> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> <4A72A8F9.9020903@ebi.ac.uk> Message-ID: <320fb6e00907310215n6ed33a17id8a5ff5913c9d3b5@mail.gmail.com> On Fri, Jul 31, 2009 at 9:19 AM, Peter Rice wrote: > Another FASTQ topic. > > Should we try to understand FASTQ identifiers. I would say no (see below). Although project interoperability shouldn't stop EMBOSS from doing this if it wants to. Related to this, what about the corner case of reads with NO identifier? The FASTQ (and indeed the FASTA) formats can hold such things - just use a blank title line. In the case of next generation sequencing reads, the names themselves are not actually that important - so you can imagine a pipeline which doesn't actually bother with them at all. > There are some standard identifiers with meaningful elements > that could be useful for reporting or subsetting FASTQ data. True. > Can we agree on how to parse those and what they can be used for? The situation is similar to the FASTA format (and others), in that there are a number of reasonably well documented conventions in use (e.g. the NCBI FASTA identifiers with | characters). However, equally, there are thousands of ad hoc local conventions. In EMBOSS, you cater to a few FASTA variants where you do parse the identifier. This might address the FASTQ situation too. In Biopython we don't do anything clever with the FASTA identifier, nor the FASTQ identifer. Zen of Python "In the face of ambiguity, refuse the temptation to guess." In the case of wanting to parse the identifier and say filter on the lane number, for Biopython the user can do this themselves if they need to. > What other naming conventions are in common use e.g. for non-SOlexa > instruments? Keep in mind that even for a single manufacturers instrument, there are different version of the pipeline, and indeed alternative pipelines. For example, I understand Sanger is using a modified pipeline on their Illumina sequencers, which may introduce their own naming. For Roche 454, their tools don't currently let you produce FASTQ directly, but this is easy to get from the FASTA and QUAL file Roche will output. This indirectly defines a Roche identifier convention. Peter C. From biopython at maubp.freeserve.co.uk Fri Jul 31 10:04:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 31 Jul 2009 15:04:41 +0100 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <4A72A854.8080302@ebi.ac.uk> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> <4A72A854.8080302@ebi.ac.uk> Message-ID: <320fb6e00907310704q4c9f8df7qe688089315304af0@mail.gmail.com> Aaron Mackey wrote: >>> I would strongly warn against truncation, for any reason. ?Use the >>> formulas you have for quality-encoding conversions, but do not >>> assume that you know more than I do about what my data contains, >>> or that you are in any way helping me by altering my data, silently or >>> otherwise. ?Said another way, feel free to warn me that my data may >>> contain garbage, and utterly fail to convert it for me, but do not try >>> to fix it for me. http://lists.open-bio.org/pipermail/open-bio-l/2009-July/000520.html Earlier I wrote: >>>> If in the process of converting between formats, a quality score >>>> is too high (it would result in ASCII 127 or higher), then I would >>>> argue any of the following would be acceptable: >>>> (a) Silently impose the maximum score (ASCII 126, 0x7e) >>>> (b) Impose the maximum score with a warning >>>> (c) Raise an error >>>> >>>> I don't think EMBOSS, BioPerl and Biopython have to handle >>>> this exactly the same way, but I would favour (b) then (a). Aaron, are you saying you support raising an error (option c), or truncation with a warning (option b), but are against a silent score truncation (option a)? The problem with just raising an error (option c) is it prevents a valid operation (conversion with truncation). Peter Rice wrote: >> We should bear in mind what the outer limit quality scores are. A quality >> score of 60 means a 1 in a million chance of an error. A quality of 90 means >> a 1 in a billion chance of an error (or 3 in an entire mammalian genome). >> Quality scores below 1 (phred) or -5 (solexa) mean the base is wrong >> (worse than random). >> >> I do not believe we are losing anything biologically significant by the >> score limits... Good point. On Fri, Jul 31, 2009 at 2:19 AM, Chris Fields wrote: > > I do tend to agree, and I don't think any savings from a performance hit > will be worth the headache of having to repeatedly explain why it's > (silently) doing so, when a simple warning or error message ('value X out of > range for fastq format y') would suffice. That's a shift from your early stance: > I do think if it affects performance to a significant enough degree we > can do this silently, we just need to ensure this is well-documented. Still, I guess it boils down to how big a penalty the warnings would impose on typical conversions. And for Biopython, it looks like the answer is not much. I've updated Biopython to issue warnings on writing FASTQ files if the quality score had to be truncated to fit the given encoding. i.e. If you had a PHRED quality above 93 for "fastq-sanger", or above 62 for "fastq-illumina", or a Solexa quality above 62 for "fastq-solexa". As implemented there is a speed penalty, but *only* for these fringe cases. Peter From biopython at maubp.freeserve.co.uk Mon Jul 27 10:15:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Jul 2009 11:15:12 +0100 Subject: [Open-bio-l] Open-bio cross-project issues In-Reply-To: <4A6D6B8F.9060108@ebi.ac.uk> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <4A6D6B8F.9060108@ebi.ac.uk> Message-ID: <320fb6e00907270315h1d2cbe11p4d003c63468136a5@mail.gmail.com> On Mon, Jul 27, 2009 at 9:55 AM, Peter Rice wrote: > > Peter C. wrote (to bioperl-l, biopython-l, emboss-dev): >> Hi all, >> >> Peter Rice kindly said he will look into an OBF cross project mailing >> list, but in the meantime this has been cross posted to the Biopython, >> BioPerl, and EMBOSS development lists. > > There is a list already for this purpose - open-bio-l Ah. Well spotted. Should we update the mailing list description to which seems a little dated (e.g. BioSQL has its own mailing list now), to make it clearer that things like file format specifications and sample data are also a suitable topic for this list? > I think we will also need a cross-project wiki space on the OBF site. Is > there something already used by other projects or should we set > something up? Good idea. As I have mentioned in the past, I think the BioPerl wiki has a good collection of pages on sequence and alignment file formats which might be relocated there. Very few of these pages are actually BioPerl specific. > I am cross-posting this to other OBF project lists to encourage > developers interested in combining efforts to address common > problems. This started with FASTQ short read formats, and > open-bio-l (a low volume list) has also seen discussion of > common test data sets. For reference, the FASTQ threads include this triple posted one: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000605.html http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006467.html http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030698.html This EMBOSS + Biopython thread: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000576.html http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006416.html And this BioPerl + Biopython thread: http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030404.html http://lists.open-bio.org/pipermail/biopython/2009-July/005335.html Peter C. From pmr at ebi.ac.uk Mon Jul 27 08:55:43 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 27 Jul 2009 09:55:43 +0100 Subject: [Open-bio-l] Open-bio cross-project issues In-Reply-To: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> Message-ID: <4A6D6B8F.9060108@ebi.ac.uk> Peter C. wrote (to bioperl-l, biopython-l, emboss-dev): > Hi all, > > Peter Rice kindly said he will look into an OBF cross project mailing > list, but in the meantime this has been cross posted to the Biopython, > BioPerl, and EMBOSS development lists. There is a list already for this purpose - open-bio-l I think we will also need a cross-project wiki space on the OBF site. Is there something already used by other projects or should we set something up? I am cross-posting this to other OBF project lists to encourage developers interested in combining efforts to address common problems. This started with FASTQ short read formats, and open-bio-l (a low volume list) has also seen discussion of common test data sets. Please sign up to open-bio-l (if you are not there already) and post suggestions for cross-project issues there. The list subscription page is: http://lists.open-bio.org/mailman/listinfo/open-bio-l Please feel free to forward this to any other projects I may have missed (I picked the obvious addresses from the list.open-bio-org server) regards, Peter Rice From biopython at maubp.freeserve.co.uk Mon Jul 27 11:51:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Jul 2009 12:51:13 +0100 Subject: [Open-bio-l] [Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> Message-ID: <320fb6e00907270451i3d40b4ffq607360cfcb6f6282@mail.gmail.com> On Sat, Jul 25, 2009 at 8:50 PM, Chris Fields wrote: > > From this it could be summarized that converting to sanger format is least > problematic, as possible issues may be encountered when converting to the > other variants. ?We'll need to fix the solexa quality calculations in the > BioPerl parser as noted in your previous post; I'll work on that. > BioPerl SVN (revision 15887, just updated on the off chance you have committed any fixes recently) also has a problem going the other way (from FASTQ Sanger to FASTQ Solexa), $ more sanger_faked.fastq @Test PHRED qualities from 40 to 0 inclusive ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN + IHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! $ perl bioperl_sanger2solexa.pl < sanger_faked.fastq @Test PHRED qualities from 40 to 0 inclusive ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN +Test PHRED qualities from 40 to 0 inclusive hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFEDB@>< Depending on your email viewer this may not be obvious, but the sequence line is length 41 but the quality line is only 40 characters. And again, I also suspect a problem in the mapping itself. Peter From biopython at maubp.freeserve.co.uk Mon Jul 27 13:15:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Jul 2009 14:15:39 +0100 Subject: [Open-bio-l] [Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907270451i3d40b4ffq607360cfcb6f6282@mail.gmail.com> Message-ID: <320fb6e00907270615m438b4230wbaed5895d5ed35d1@mail.gmail.com> On Mon, Jul 27, 2009 at 2:06 PM, Chris Fields wrote: > >> >> Depending on your email viewer this may not be obvious, but >> the sequence line is length 41 but the quality line is only 40 >> characters. And again, I also suspect a problem in the mapping >> itself. >> >> Peter > > I added this (and the others) to our ticket tracking this. ?Looks like > solexa conversion either way is borked, which is very likely an issue with > conversion. > > chris I'm afraid so. I'll keep an eye on that then (Bug 2857) http://bugzilla.open-bio.org/show_bug.cgi?id=2857 Peter From hlapp at gmx.net Mon Jul 27 14:01:21 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 27 Jul 2009 10:01:21 -0400 Subject: [Open-bio-l] [Bioperl-l] Open-bio cross-project issues In-Reply-To: <4A6D6B8F.9060108@ebi.ac.uk> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <4A6D6B8F.9060108@ebi.ac.uk> Message-ID: On Jul 27, 2009, at 4:55 AM, Peter Rice wrote: > I think we will also need a cross-project wiki space on the OBF site. What about the open-bio.org wiki? It doesn't say anywhere that it can only be about OBF business (and in that sense, all OBF projects are OBF business by definition). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Mon Jul 27 14:17:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Jul 2009 15:17:55 +0100 Subject: [Open-bio-l] [Bioperl-l] Open-bio cross-project issues In-Reply-To: References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <4A6D6B8F.9060108@ebi.ac.uk> Message-ID: <320fb6e00907270717g4fc782aci4c4a0caf1a107ecb@mail.gmail.com> On Mon, Jul 27, 2009 at 3:01 PM, Hilmar Lapp wrote: > > On Jul 27, 2009, at 4:55 AM, Peter Rice wrote: > >> I think we will also need a cross-project wiki space on the OBF site. >> > > What about the open-bio.org wiki? It doesn't say anywhere that it can only > be about OBF business (and in that sense, all OBF projects are OBF > business by definition). That makes sense to me. I was assuming Peter R just meant something on the existing open-bio.org wiki, although I didn't say so explicitly. Peter C From pmr at ebi.ac.uk Mon Jul 27 14:46:00 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 27 Jul 2009 15:46:00 +0100 Subject: [Open-bio-l] [Bioperl-l] Open-bio cross-project issues In-Reply-To: <320fb6e00907270717g4fc782aci4c4a0caf1a107ecb@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <4A6D6B8F.9060108@ebi.ac.uk> <320fb6e00907270717g4fc782aci4c4a0caf1a107ecb@mail.gmail.com> Message-ID: <4A6DBDA8.8000807@ebi.ac.uk> Peter wrote: > On Mon, Jul 27, 2009 at 3:01 PM, Hilmar Lapp wrote: >> On Jul 27, 2009, at 4:55 AM, Peter Rice wrote: >> >>> I think we will also need a cross-project wiki space on the OBF site. >>> >> What about the open-bio.org wiki? It doesn't say anywhere that it can only >> be about OBF business (and in that sense, all OBF projects are OBF >> business by definition). > > That makes sense to me. I was assuming Peter R just meant something > on the existing open-bio.org wiki, although I didn't say so explicitly. The existing open-bio.org wiki is indeed what I had in mind. Can someone provide guidelines on page naming so we can keep the pages together? regards, Peter From biopython at maubp.freeserve.co.uk Wed Jul 29 10:15:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Jul 2009 11:15:55 +0100 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> Message-ID: <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> Hi all, This is a follow up to the earlier discussion about high quality scores in Solexa or Illumina 1.3+ FASTQ files and the problem of non printable ASCII codes (which can occur if converting from Sanger FASTQ). > On Sat, Jul 25, 2009 at 8:50 PM, Chris Fields wrote: >> >>> Now, here comes the problem. I believe FASTQ files directly >>> from an Illumina 1.3+ pipeline will have PHRED scores in the >>> range 0 to 40 (as in this example). However, much higher >>> PHRED scores are possible during assembly / contig'ing >>> and read mapping. For example, the tool MAQ will output >>> Sanger style FASTQ files with PHRED scores in the range >>> 0 to 93 inclusive. >> >> We can support it as Illumina 1.3, but my point is this may getting into a >> grey area and may be something that Illumina doesn't/wouldn't support. >>?Reminds me a little of the multiple GFF2 variations (one of the main >> reasons for a GFF3). > > I agree this is an grey area (high scores in Solexa/Illumina > FASTQ files). > > ... > > i.e. An Illumina FASTQ format file can hold PHRED scores in the > range 0 to 62 without using problem characters. And likewise > for a Solexa FASTQ file (Solexa scores up to 62). Peter Rice and I have been talking about this off list, and have a proposal for the high score problem. Basically we want to restrict FASTQ quality strings to printable ASCII, which means 126 (0x7e) is a firm upper limit, while otherwise allowing for a high scores as possible. This limit comes from ASCII 127 being "delete", and the even higher characters also being non-printable. i.e. We are suggesting: "fastq-sanger" - Allows PHRED scores 0 to 93 inclusive, mapped with an ASCII offset of 33 to ASCII characters 33 to 126 (or in hex, 0x21 to 0x7e). This is as defined on the MAQ web pages. "fastq-illumina" - Believed to use at least PHRED scores 0 to 40, mapped with an ASCII offset of 64 to ASCII characters 64 to 104 (or in hex, to 0x40 to 0x68). It is a reasonable and well defined extension to permit PHRED scores from 0 to 62 inclusive, which map to ASCII 64 to 126 (or in hex 0x40 to 0x7e). This avoids the non printing characters, and gives some head room for improved sequencing technology from Illumina giving higher raw scores. "fastq-solexa" - Believed to use Solexa scores from -5 to at least 40, again mapped with an ASCII offset of 64 giving ASCII characters 59 to 104 (or in hex, 0x3b to 0x68). Again, a reasonable and well defined extension would permit Solexa scores in the range -5 to 62 inclusive, using ASCII 59 to 126 (or in hex, 0x3b to 0x7e). [Peter R. - please correct me if of the above is not what you had in mind] If in the process of converting between formats, a quality score is too high (it would result in ASCII 127 or higher), then I would argue any of the following would be acceptable: (a) Silently impose the maximum score (ASCII 126, 0x7e) (b) Impose the maximum score with a warning (c) Raise an error I don't think EMBOSS, BioPerl and Biopython have to handle this exactly the same way, but I would favour (b) then (a). Peter From biopython at maubp.freeserve.co.uk Thu Jul 30 10:18:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 30 Jul 2009 11:18:26 +0100 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> Message-ID: <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> On Wed, Jul 29, 2009 at 11:15 AM, Peter wrote: > Hi all, > > This is a follow up to the earlier discussion about high quality scores > in Solexa or Illumina 1.3+ FASTQ files and the problem of non printable > ASCII codes (which can occur if converting from Sanger FASTQ). > > ... > > Peter Rice and I have been talking about this off list, and have > a proposal for the high score problem. Basically we want to > restrict FASTQ quality strings to printable ASCII, which means > 126 (0x7e) is a firm upper limit, while otherwise allowing for a > high scores as possible. This limit comes from ASCII 127 being > "delete", and the even higher characters also being non-printable. > > i.e. We are suggesting: > > "fastq-sanger" - Allows PHRED scores 0 to 93 inclusive, mapped > with an ASCII offset of 33 to ASCII characters 33 to 126 (or in hex, > 0x21 to 0x7e). This is as defined on the MAQ web pages. > > "fastq-illumina" - Believed to use at least PHRED scores 0 to 40, > mapped with an ASCII offset of 64 to ASCII characters 64 to 104 > (or in hex, to 0x40 to 0x68). It is a reasonable and well defined > extension to permit PHRED scores from 0 to 62 inclusive, which > map to ASCII 64 to 126 (or in hex 0x40 to 0x7e). This avoids the > non printing characters, and gives some head room for improved > sequencing technology from Illumina giving higher raw scores. > > "fastq-solexa" - Believed to use Solexa scores from -5 to at least > 40, again mapped with an ASCII offset of 64 giving ASCII characters > 59 to 104 (or in hex, 0x3b to 0x68). Again, a reasonable and well > defined extension would permit Solexa scores in the range -5 to 62 > inclusive, using ASCII 59 to 126 (or in hex, 0x3b to 0x7e). The latest version of Biopython in our repository now follows this, avoiding any non-printing characters (which should trigger an error on parsing). > If in the process of converting between formats, a quality score > is too high (it would result in ASCII 127 or higher), then I would > argue any of the following would be acceptable: > (a) Silently impose the maximum score (ASCII 126, 0x7e) > (b) Impose the maximum score with a warning > (c) Raise an error > > I don't think EMBOSS, BioPerl and Biopython have to handle > this exactly the same way, but I would favour (b) then (a). The EMBOSS patch I was testing from Peter Rice went for a silent truncation, in Biopython have also for the moment gone for silently imposing the maximum scores (ASCII 126, 0x7e) of 93, 62 and 62 for the three formats. Another reason for this is speed. Peter From biopython at maubp.freeserve.co.uk Thu Jul 30 15:35:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 30 Jul 2009 16:35:25 +0100 Subject: [Open-bio-l] FASTQ records with no sequence? In-Reply-To: <320fb6e00907300819x35ae00c3wa20a382376134db7@mail.gmail.com> References: <320fb6e00907300800x5f8e78eci5df8333df713e4c@mail.gmail.com> <4A71B7B5.40502@ebi.ac.uk> <320fb6e00907300819x35ae00c3wa20a382376134db7@mail.gmail.com> Message-ID: <320fb6e00907300835v3a9d46d4w77c344bbf6efa08d@mail.gmail.com> Hi all, On the continuing topic of the nebulous FASTQ format, are there any strong views as to weather a FASTQ files could hold records without a sequence (and therefore no quality scores)? This could make sense as output from an (aggressive) quality filter. This was a discussion I meant to start on the OBF list, not the EMBOSS list - so here is the start of the thread: http://lists.open-bio.org/pipermail/emboss/2009-July/003707.html Basically in some contexts an empty FASTQ record makes sense, so perhaps we should include examples of this for our test suite. However, there is more than one reasonable way to represent such a record (either omitting the sequence and quality lines, or including blank sequence and quality lines). On Thu, Jul 30, 2009 at 4:09 PM, Peter Rice wrote: > > Peter C. wrote: > >> As we are recommending no line wrapping on output this means >> typical FASTQ records would be four lines - so doing the same >> makes sense here too. > > I vote for 4 lines on output. If we want to allow zero length sequences, then yes, I would also vote for the 4 line output (i.e. blank lines for the sequence and the quality string). > It should be possible to allow zero lines on input depending on > where the '+' check is. Yes, I'm pretty sure a parser could cope with any of the zero length sequence FASTQ examples I gave. Peter From biopython at maubp.freeserve.co.uk Thu Jul 30 15:55:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 30 Jul 2009 16:55:56 +0100 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> Message-ID: <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> On Thu, Jul 30, 2009 at 4:46 PM, Chris Fields wrote: >> The EMBOSS patch I was testing from Peter Rice went for a >> silent truncation, in Biopython have also for the moment gone >> for silently imposing the maximum scores (ASCII 126, 0x7e) >> of 93, 62 and 62 for the three formats. Another reason for this >> is speed. >> >> Peter > > Speed is one reason to worry, but we also should think carefully about > silently truncating the data w/o the user's knowledge. ?One thing we > don't want to propagate is loss of data w/o warning. Yes and no. Do you warn about converting from EMBL/GenBank to FASTA? Or from a PFAM alignment to a ClustalW or PHYLIP alignment? In those cases, anyone familiar with the file formats will expect data loss as you are going from a richly annotated file format to something much simpler. Likewise here, anyone familiar with the FASTQ variants (and our documentation should cover this) shouldn't be surprised at this quality truncation. But I must concede, this is a more subtle and less obvious data issue. So maybe you are right. I can take a look at this and see how badly it would impact the speed for Biopython... Peter From cjfields at illinois.edu Thu Jul 30 15:46:51 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 30 Jul 2009 10:46:51 -0500 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> Message-ID: On Jul 30, 2009, at 5:18 AM, Peter wrote: > On Wed, Jul 29, 2009 at 11:15 AM, Peter > wrote: >> Hi all, >> >> This is a follow up to the earlier discussion about high quality >> scores >> in Solexa or Illumina 1.3+ FASTQ files and the problem of non >> printable >> ASCII codes (which can occur if converting from Sanger FASTQ). >> >> ... >> >> Peter Rice and I have been talking about this off list, and have >> a proposal for the high score problem. Basically we want to >> restrict FASTQ quality strings to printable ASCII, which means >> 126 (0x7e) is a firm upper limit, while otherwise allowing for a >> high scores as possible. This limit comes from ASCII 127 being >> "delete", and the even higher characters also being non-printable. >> >> i.e. We are suggesting: >> >> "fastq-sanger" - Allows PHRED scores 0 to 93 inclusive, mapped >> with an ASCII offset of 33 to ASCII characters 33 to 126 (or in hex, >> 0x21 to 0x7e). This is as defined on the MAQ web pages. >> >> "fastq-illumina" - Believed to use at least PHRED scores 0 to 40, >> mapped with an ASCII offset of 64 to ASCII characters 64 to 104 >> (or in hex, to 0x40 to 0x68). It is a reasonable and well defined >> extension to permit PHRED scores from 0 to 62 inclusive, which >> map to ASCII 64 to 126 (or in hex 0x40 to 0x7e). This avoids the >> non printing characters, and gives some head room for improved >> sequencing technology from Illumina giving higher raw scores. >> >> "fastq-solexa" - Believed to use Solexa scores from -5 to at least >> 40, again mapped with an ASCII offset of 64 giving ASCII characters >> 59 to 104 (or in hex, 0x3b to 0x68). Again, a reasonable and well >> defined extension would permit Solexa scores in the range -5 to 62 >> inclusive, using ASCII 59 to 126 (or in hex, 0x3b to 0x7e). > > The latest version of Biopython in our repository now follows this, > avoiding any non-printing characters (which should trigger an error > on parsing). > >> If in the process of converting between formats, a quality score >> is too high (it would result in ASCII 127 or higher), then I would >> argue any of the following would be acceptable: >> (a) Silently impose the maximum score (ASCII 126, 0x7e) >> (b) Impose the maximum score with a warning >> (c) Raise an error >> >> I don't think EMBOSS, BioPerl and Biopython have to handle >> this exactly the same way, but I would favour (b) then (a). > > The EMBOSS patch I was testing from Peter Rice went for a > silent truncation, in Biopython have also for the moment gone > for silently imposing the maximum scores (ASCII 126, 0x7e) > of 93, 62 and 62 for the three formats. Another reason for this > is speed. > > Peter Speed is one reason to worry, but we also should think carefully about silently truncating the data w/o the user's knowledge. One thing we don't want to propagate is loss of data w/o warning. chris From cjfields at illinois.edu Thu Jul 30 20:08:36 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 30 Jul 2009 15:08:36 -0500 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> Message-ID: On Jul 30, 2009, at 10:55 AM, Peter wrote: > On Thu, Jul 30, 2009 at 4:46 PM, Chris Fields > wrote: >>> The EMBOSS patch I was testing from Peter Rice went for a >>> silent truncation, in Biopython have also for the moment gone >>> for silently imposing the maximum scores (ASCII 126, 0x7e) >>> of 93, 62 and 62 for the three formats. Another reason for this >>> is speed. >>> >>> Peter >> >> Speed is one reason to worry, but we also should think carefully >> about >> silently truncating the data w/o the user's knowledge. One thing we >> don't want to propagate is loss of data w/o warning. > > Yes and no. Do you warn about converting from EMBL/GenBank to > FASTA? Or from a PFAM alignment to a ClustalW or PHYLIP > alignment? In those cases, anyone familiar with the file formats will > expect data loss as you are going from a richly annotated file format > to something much simpler. Right, but this doesn't follow along the same lines. Going from a annotation- and feature-rich format to a very lightweight format is one thing. This situation (at least to me) is more analogous to exclusion of a subset of features b/c they don't fit certain parameters. I do think if it affects performance to a significant enough degree we can do this silently, we just need to ensure this is well-documented. My opinions is this use will prove to be a edge case anyway (most will want conversion to Sanger vs. Illumina/Solexa). > Likewise here, anyone familiar with the FASTQ variants (and our > documentation should cover this) shouldn't be surprised at this > quality truncation. But I must concede, this is a more subtle and > less obvious data issue. So maybe you are right. > > I can take a look at this and see how badly it would impact the > speed for Biopython... > > Peter Will try to do the same for bioperl. chris From cjfields at illinois.edu Thu Jul 30 19:59:50 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 30 Jul 2009 14:59:50 -0500 Subject: [Open-bio-l] FASTQ records with no sequence? In-Reply-To: <320fb6e00907300835v3a9d46d4w77c344bbf6efa08d@mail.gmail.com> References: <320fb6e00907300800x5f8e78eci5df8333df713e4c@mail.gmail.com> <4A71B7B5.40502@ebi.ac.uk> <320fb6e00907300819x35ae00c3wa20a382376134db7@mail.gmail.com> <320fb6e00907300835v3a9d46d4w77c344bbf6efa08d@mail.gmail.com> Message-ID: On Jul 30, 2009, at 10:35 AM, Peter wrote: > Hi all, > > On the continuing topic of the nebulous FASTQ format, are there > any strong views as to weather a FASTQ files could hold records > without a sequence (and therefore no quality scores)? This could > make sense as output from an (aggressive) quality filter. > > This was a discussion I meant to start on the OBF list, not the > EMBOSS list - so here is the start of the thread: > http://lists.open-bio.org/pipermail/emboss/2009-July/003707.html > > Basically in some contexts an empty FASTQ record makes sense, > so perhaps we should include examples of this for our test suite. > However, there is more than one reasonable way to represent > such a record (either omitting the sequence and quality lines, or > including blank sequence and quality lines). > > On Thu, Jul 30, 2009 at 4:09 PM, Peter Rice wrote: >> >> Peter C. wrote: >> >>> As we are recommending no line wrapping on output this means >>> typical FASTQ records would be four lines - so doing the same >>> makes sense here too. >> >> I vote for 4 lines on output. > > If we want to allow zero length sequences, then yes, I would also > vote for the 4 line output (i.e. blank lines for the sequence and > the quality string). Same here. >> It should be possible to allow zero lines on input depending on >> where the '+' check is. > > Yes, I'm pretty sure a parser could cope with any of the zero length > sequence FASTQ examples I gave. > > Peter Should be easy to do this with bioperl as well. chris From biopython at maubp.freeserve.co.uk Thu Jul 30 21:50:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 30 Jul 2009 22:50:34 +0100 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> Message-ID: <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> On Thu, Jul 30, 2009 at 9:08 PM, Chris Fields wrote: > > I do think if it affects performance to a significant enough degree we > can do this silently, we just need to ensure this is well-documented. Agreed. > My opinions is this use will prove to be a edge case anyway (most will > want conversion to Sanger vs. Illumina/Solexa). Absolutely. Going from Solexa/Illumina to Sanger FASTQ will be more common (and there are no truncation issues). Going from Sanger FASTQ to Solexa or Illumina FASTQ will be rarer, and while a truncation is possible it requires very high scores (above PHRED 62) which are likely only to be possible from a consensus alignment or such like. i.e. Yes, it should be an edge case. I guess this expected usage supports the argument about issuing a warning on truncation, even with a modest performance overhead (because it only slows down the rarer expected usage). But let's get some benchmarks done to help settle this... Peter From ajmackey at gmail.com Thu Jul 30 23:52:03 2009 From: ajmackey at gmail.com (Aaron Mackey) Date: Thu, 30 Jul 2009 19:52:03 -0400 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> Message-ID: <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> I would strongly warn against truncation, for any reason. Use the formulas you have for quality-encoding conversions, but do not assume that you know more than I do about what my data contains, or that you are in any way helping me by altering my data, silently or otherwise. Said another way, feel free to warn me that my data may contain garbage, and utterly fail to convert it for me, but do not try to fix it for me. -Aaron On Thu, Jul 30, 2009 at 5:50 PM, Peter wrote: > On Thu, Jul 30, 2009 at 9:08 PM, Chris Fields > wrote: > > > > I do think if it affects performance to a significant enough degree we > > can do this silently, we just need to ensure this is well-documented. > > Agreed. > > > My opinions is this use will prove to be a edge case anyway (most will > > want conversion to Sanger vs. Illumina/Solexa). > > Absolutely. > > Going from Solexa/Illumina to Sanger FASTQ will be more common > (and there are no truncation issues). Going from Sanger FASTQ to > Solexa or Illumina FASTQ will be rarer, and while a truncation is > possible it requires very high scores (above PHRED 62) which are > likely only to be possible from a consensus alignment or such like. > i.e. Yes, it should be an edge case. > > I guess this expected usage supports the argument about issuing a > warning on truncation, even with a modest performance overhead > (because it only slows down the rarer expected usage). > > But let's get some benchmarks done to help settle this... > > Peter > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l > From cjfields at illinois.edu Fri Jul 31 01:19:56 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 30 Jul 2009 20:19:56 -0500 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> Message-ID: I do tend to agree, and I don't think any savings from a performance hit will be worth the headache of having to repeatedly explain why it's (silently) doing so, when a simple warning or error message ('value X out of range for fastq format y') would suffice. chris On Jul 30, 2009, at 6:52 PM, Aaron Mackey wrote: > I would strongly warn against truncation, for any reason. Use the > formulas you have for quality-encoding conversions, but do not > assume that you know more than I do about what my data contains, or > that you are in any way helping me by altering my data, silently or > otherwise. Said another way, feel free to warn me that my data may > contain garbage, and utterly fail to convert it for me, but do not > try to fix it for me. > > -Aaron > > On Thu, Jul 30, 2009 at 5:50 PM, Peter > wrote: > On Thu, Jul 30, 2009 at 9:08 PM, Chris Fields > wrote: > > > > I do think if it affects performance to a significant enough > degree we > > can do this silently, we just need to ensure this is well- > documented. > > Agreed. > > > My opinions is this use will prove to be a edge case anyway (most > will > > want conversion to Sanger vs. Illumina/Solexa). > > Absolutely. > > Going from Solexa/Illumina to Sanger FASTQ will be more common > (and there are no truncation issues). Going from Sanger FASTQ to > Solexa or Illumina FASTQ will be rarer, and while a truncation is > possible it requires very high scores (above PHRED 62) which are > likely only to be possible from a consensus alignment or such like. > i.e. Yes, it should be an edge case. > > I guess this expected usage supports the argument about issuing a > warning on truncation, even with a modest performance overhead > (because it only slows down the rarer expected usage). > > But let's get some benchmarks done to help settle this... > > Peter > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l > From pmr at ebi.ac.uk Fri Jul 31 08:16:20 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 31 Jul 2009 09:16:20 +0100 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> Message-ID: <4A72A854.8080302@ebi.ac.uk> Aaron Mackey wrote: > I would strongly warn against truncation, for any reason. Use the formulas > you have for quality-encoding conversions, but do not assume that you know > more than I do about what my data contains, or that you are in any way > helping me by altering my data, silently or otherwise. Said another way, > feel free to warn me that my data may contain garbage, and utterly fail to > convert it for me, but do not try to fix it for me. We should bear in mind what the outer limit quality scores are. A quality score of 60 means a 1 in a million chance of an error. A quality of 90 means a 1 in a billion chance of an error (or 3 in an entire mammalian genome). Quality scores below 1 (phred) or -5 (solexa) mean the base is wrong (worse than random). I do not believe we are losing anything biologically significant by the score limits - but we are using a tighter definition of the FASTQ format to protect other parsers from terrible errors with for example signed characters. On the subject or warnings ... While I am happy to issue warnings, I suggest we take some care over what happens when someone picks the wrong format and a million reads have quality scores out of range. We could, for example, report the first error and then count up so we can later (at the end or when another error occurs) say "and another 987654 up to ..." and give the latest one. regards, Peter Rice From pmr at ebi.ac.uk Fri Jul 31 08:19:05 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 31 Jul 2009 09:19:05 +0100 Subject: [Open-bio-l] FASTQ identifiers In-Reply-To: <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> Message-ID: <4A72A8F9.9020903@ebi.ac.uk> Another FASTQ topic. Should we try to understand FASTQ identifiers. There are some standard identifiers with meaningful elements that could be useful for reporting or subsetting FASTQ data. Can we agree on how to parse those and what they can be used for? What other naming conventions are in common use e.g. for non-SOlexa instruments? regards, Peter From biopython at maubp.freeserve.co.uk Fri Jul 31 09:01:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 31 Jul 2009 10:01:57 +0100 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <4A72A854.8080302@ebi.ac.uk> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> <4A72A854.8080302@ebi.ac.uk> Message-ID: <320fb6e00907310201o7588dbfeg9e0bc5be6138c754@mail.gmail.com> On Fri, Jul 31, 2009 at 9:16 AM, Peter Rice wrote: > > On the subject or warnings ... While I am happy to issue warnings, > I suggest we take some care over what happens when someone > picks the wrong format and a million reads have quality scores out > of range. > > We could, for example, report the first error and then count up so > we can later (at the end or when another error occurs) say "and > another 987654 up to ..." and give the latest one. Yes indeed, a warning for every record which had its quality score truncated would be madness (given the number of reads you might have in a FASTQ file). One warning for the whole file would be enough. Peter From biopython at maubp.freeserve.co.uk Fri Jul 31 09:15:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 31 Jul 2009 10:15:57 +0100 Subject: [Open-bio-l] FASTQ identifiers In-Reply-To: <4A72A8F9.9020903@ebi.ac.uk> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> <4A72A8F9.9020903@ebi.ac.uk> Message-ID: <320fb6e00907310215n6ed33a17id8a5ff5913c9d3b5@mail.gmail.com> On Fri, Jul 31, 2009 at 9:19 AM, Peter Rice wrote: > Another FASTQ topic. > > Should we try to understand FASTQ identifiers. I would say no (see below). Although project interoperability shouldn't stop EMBOSS from doing this if it wants to. Related to this, what about the corner case of reads with NO identifier? The FASTQ (and indeed the FASTA) formats can hold such things - just use a blank title line. In the case of next generation sequencing reads, the names themselves are not actually that important - so you can imagine a pipeline which doesn't actually bother with them at all. > There are some standard identifiers with meaningful elements > that could be useful for reporting or subsetting FASTQ data. True. > Can we agree on how to parse those and what they can be used for? The situation is similar to the FASTA format (and others), in that there are a number of reasonably well documented conventions in use (e.g. the NCBI FASTA identifiers with | characters). However, equally, there are thousands of ad hoc local conventions. In EMBOSS, you cater to a few FASTA variants where you do parse the identifier. This might address the FASTQ situation too. In Biopython we don't do anything clever with the FASTA identifier, nor the FASTQ identifer. Zen of Python "In the face of ambiguity, refuse the temptation to guess." In the case of wanting to parse the identifier and say filter on the lane number, for Biopython the user can do this themselves if they need to. > What other naming conventions are in common use e.g. for non-SOlexa > instruments? Keep in mind that even for a single manufacturers instrument, there are different version of the pipeline, and indeed alternative pipelines. For example, I understand Sanger is using a modified pipeline on their Illumina sequencers, which may introduce their own naming. For Roche 454, their tools don't currently let you produce FASTQ directly, but this is easy to get from the FASTA and QUAL file Roche will output. This indirectly defines a Roche identifier convention. Peter C. From biopython at maubp.freeserve.co.uk Fri Jul 31 14:04:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 31 Jul 2009 15:04:41 +0100 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <4A72A854.8080302@ebi.ac.uk> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> <4A72A854.8080302@ebi.ac.uk> Message-ID: <320fb6e00907310704q4c9f8df7qe688089315304af0@mail.gmail.com> Aaron Mackey wrote: >>> I would strongly warn against truncation, for any reason. ?Use the >>> formulas you have for quality-encoding conversions, but do not >>> assume that you know more than I do about what my data contains, >>> or that you are in any way helping me by altering my data, silently or >>> otherwise. ?Said another way, feel free to warn me that my data may >>> contain garbage, and utterly fail to convert it for me, but do not try >>> to fix it for me. http://lists.open-bio.org/pipermail/open-bio-l/2009-July/000520.html Earlier I wrote: >>>> If in the process of converting between formats, a quality score >>>> is too high (it would result in ASCII 127 or higher), then I would >>>> argue any of the following would be acceptable: >>>> (a) Silently impose the maximum score (ASCII 126, 0x7e) >>>> (b) Impose the maximum score with a warning >>>> (c) Raise an error >>>> >>>> I don't think EMBOSS, BioPerl and Biopython have to handle >>>> this exactly the same way, but I would favour (b) then (a). Aaron, are you saying you support raising an error (option c), or truncation with a warning (option b), but are against a silent score truncation (option a)? The problem with just raising an error (option c) is it prevents a valid operation (conversion with truncation). Peter Rice wrote: >> We should bear in mind what the outer limit quality scores are. A quality >> score of 60 means a 1 in a million chance of an error. A quality of 90 means >> a 1 in a billion chance of an error (or 3 in an entire mammalian genome). >> Quality scores below 1 (phred) or -5 (solexa) mean the base is wrong >> (worse than random). >> >> I do not believe we are losing anything biologically significant by the >> score limits... Good point. On Fri, Jul 31, 2009 at 2:19 AM, Chris Fields wrote: > > I do tend to agree, and I don't think any savings from a performance hit > will be worth the headache of having to repeatedly explain why it's > (silently) doing so, when a simple warning or error message ('value X out of > range for fastq format y') would suffice. That's a shift from your early stance: > I do think if it affects performance to a significant enough degree we > can do this silently, we just need to ensure this is well-documented. Still, I guess it boils down to how big a penalty the warnings would impose on typical conversions. And for Biopython, it looks like the answer is not much. I've updated Biopython to issue warnings on writing FASTQ files if the quality score had to be truncated to fit the given encoding. i.e. If you had a PHRED quality above 93 for "fastq-sanger", or above 62 for "fastq-illumina", or a Solexa quality above 62 for "fastq-solexa". As implemented there is a speed penalty, but *only* for these fringe cases. Peter