From charles-listes+open-bio at plessy.org Sat Aug 1 21:25:37 2009 From: charles-listes+open-bio at plessy.org (Charles Plessy) Date: Sun, 2 Aug 2009 10:25:37 +0900 Subject: [Open-bio-l] FASTQ identifiers In-Reply-To: <320fb6e00907310215n6ed33a17id8a5ff5913c9d3b5@mail.gmail.com> References: <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> <4A72A8F9.9020903@ebi.ac.uk> <320fb6e00907310215n6ed33a17id8a5ff5913c9d3b5@mail.gmail.com> Message-ID: <20090802012537.GD2479@kunpuu.plessy.org> Le Fri, Jul 31, 2009 at 10:15:57AM +0100, Peter a ?crit : > The situation is similar to the FASTA format (and others), in that there > are a number of reasonably well documented conventions in use > (e.g. the NCBI FASTA identifiers with | characters). However, equally, > there are thousands of ad hoc local conventions. Hello, I just would like to mention such an ad-hoc convention in use at workplace: with FASTQ sequences we sometimes replace the original name by the sequence itself. This can be useful for instance to troubleshoot some sequence manipulations. @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 ;;3;;;;;;;;;;;;7;;;;;;;88 becomes: @CCCTTCTTGTCTTCAGCGTTTCTCC CCCTTCTTGTCTTCAGCGTTTCTCC +CCCTTCTTGTCTTCAGCGTTTCTCC ;;3;;;;;;;;;;;;7;;;;;;;88 and after some arbitrary trimming at the ends: @CCCTTCTTGTCTTCAGCGTTTCTCC TTCTTGTCTTCAGCGTTTCT +CCCTTCTTGTCTTCAGCGTTTCTCC ;;;;;;;;;;;;7;;;;;;; With FASTA format, we sometimes eliminate redundant sequences and record how many times they occurred by adding the count to the name. For instance: >seq1 AAATTT >seq2 AAATAT >seq3 AAATTT becomes: >AAATTT_2 AAATTT >AAATAT_1 AAATAT If this is popular elsewhere, it may be useful to implement functions that allow doing this efficiently. Have a nice day, -- Charles Plessy Tsurumi, Kanagawa, Japan From biopython at maubp.freeserve.co.uk Mon Aug 3 05:30:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 3 Aug 2009 10:30:09 +0100 Subject: [Open-bio-l] FASTQ identifiers In-Reply-To: <20090802012537.GD2479@kunpuu.plessy.org> References: <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> <4A72A8F9.9020903@ebi.ac.uk> <320fb6e00907310215n6ed33a17id8a5ff5913c9d3b5@mail.gmail.com> <20090802012537.GD2479@kunpuu.plessy.org> Message-ID: <320fb6e00908030230x52bf32a8o3b640ce8d0a76b8@mail.gmail.com> On Sun, Aug 2, 2009 at 2:25 AM, Charles Plessy wrote: > Le Fri, Jul 31, 2009 at 10:15:57AM +0100, Peter a ?crit : >> The situation is similar to the FASTA format (and others), in that there >> are a number of reasonably well documented conventions in use >> (e.g. the NCBI FASTA identifiers with | characters). However, equally, >> there are thousands of ad hoc local conventions. > > Hello, > > I just would like to mention such an ad-hoc convention in use at > workplace: with FASTQ sequences we sometimes replace the original > name by the sequence itself. This can be useful for instance to > troubleshoot some sequence manipulations. > > @EAS54_6_R1_2_1_413_324 > CCCTTCTTGTCTTCAGCGTTTCTCC > +EAS54_6_R1_2_1_413_324 > ;;3;;;;;;;;;;;;7;;;;;;;88 > > becomes: > > @CCCTTCTTGTCTTCAGCGTTTCTCC > CCCTTCTTGTCTTCAGCGTTTCTCC > +CCCTTCTTGTCTTCAGCGTTTCTCC > ;;3;;;;;;;;;;;;7;;;;;;;88 > That certainly demonstrates we can't make any big assumptions about the title line formatting ;) Your example is interesting - but I don't quite understand why you do this. Surely any debug message or output file for bad reads would (normally) have a unique read ID which (indirectly) tells you the read sequence? If you are writing the code which gives these error messages, can't you explicitly give the read sequence? Is the aim to be able to look at error messages from third party tools (which just give the read name) and see the read sequence directly (without looking up the read name in the original FASTQ file)? This is similar in some ways to my comment that I could see a real use for FASTQ (and FASTA) files with no record identifiers: >> Related to this, what about the corner case of reads with NO >> identifier? The FASTQ (and indeed the FASTA) formats can >> hold such things - just use a blank title line. In the case of >> next generation sequencing reads, the names themselves >> are not actually that important - so you can imagine a pipeline >> which doesn't actually bother with them at all. In your pipeline you clearly don't care about the original FASTQ identifiers, and (if the pipeline would accept it), using blank title lines might also work (and would certainly save disk space). Peter From cjfields at illinois.edu Wed Aug 5 11:12:18 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 5 Aug 2009 10:12:18 -0500 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> Message-ID: <3C298ABC-07CD-4597-BA83-F8F5992BF73A@illinois.edu> On Jul 29, 2009, at 5:15 AM, Peter wrote: > Hi all, > > This is a follow up to the earlier discussion about high quality > scores > in Solexa or Illumina 1.3+ FASTQ files and the problem of non > printable > ASCII codes (which can occur if converting from Sanger FASTQ). > >> On Sat, Jul 25, 2009 at 8:50 PM, Chris >> Fields wrote: >>> >>>> Now, here comes the problem. I believe FASTQ files directly >>>> from an Illumina 1.3+ pipeline will have PHRED scores in the >>>> range 0 to 40 (as in this example). However, much higher >>>> PHRED scores are possible during assembly / contig'ing >>>> and read mapping. For example, the tool MAQ will output >>>> Sanger style FASTQ files with PHRED scores in the range >>>> 0 to 93 inclusive. >>> >>> We can support it as Illumina 1.3, but my point is this may >>> getting into a >>> grey area and may be something that Illumina doesn't/wouldn't >>> support. >>> Reminds me a little of the multiple GFF2 variations (one of the >>> main >>> reasons for a GFF3). >> >> I agree this is an grey area (high scores in Solexa/Illumina >> FASTQ files). >> >> ... >> >> i.e. An Illumina FASTQ format file can hold PHRED scores in the >> range 0 to 62 without using problem characters. And likewise >> for a Solexa FASTQ file (Solexa scores up to 62). > > Peter Rice and I have been talking about this off list, and have > a proposal for the high score problem. Basically we want to > restrict FASTQ quality strings to printable ASCII, which means > 126 (0x7e) is a firm upper limit, while otherwise allowing for a > high scores as possible. This limit comes from ASCII 127 being > "delete", and the even higher characters also being non-printable. > > i.e. We are suggesting: > > "fastq-sanger" - Allows PHRED scores 0 to 93 inclusive, mapped > with an ASCII offset of 33 to ASCII characters 33 to 126 (or in hex, > 0x21 to 0x7e). This is as defined on the MAQ web pages. > > "fastq-illumina" - Believed to use at least PHRED scores 0 to 40, > mapped with an ASCII offset of 64 to ASCII characters 64 to 104 > (or in hex, to 0x40 to 0x68). It is a reasonable and well defined > extension to permit PHRED scores from 0 to 62 inclusive, which > map to ASCII 64 to 126 (or in hex 0x40 to 0x7e). This avoids the > non printing characters, and gives some head room for improved > sequencing technology from Illumina giving higher raw scores. > > "fastq-solexa" - Believed to use Solexa scores from -5 to at least > 40, again mapped with an ASCII offset of 64 giving ASCII characters > 59 to 104 (or in hex, 0x3b to 0x68). Again, a reasonable and well > defined extension would permit Solexa scores in the range -5 to 62 > inclusive, using ASCII 59 to 126 (or in hex, 0x3b to 0x7e). > > [Peter R. - please correct me if of the above is not what you had > in mind] > > If in the process of converting between formats, a quality score > is too high (it would result in ASCII 127 or higher), then I would > argue any of the following would be acceptable: > (a) Silently impose the maximum score (ASCII 126, 0x7e) > (b) Impose the maximum score with a warning > (c) Raise an error > > I don't think EMBOSS, BioPerl and Biopython have to handle > this exactly the same way, but I would favour (b) then (a). > > Peter I think, based on Aaron's comments, with bioperl we'll adopt in (b) to deal with format validation, but try to do it in a way that 'caches' bad data so it doesn't report a warning on every out-of-range value. I am planning on a Moose-based parser at some point that will do the same. chris From biopython at maubp.freeserve.co.uk Wed Aug 5 12:01:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 5 Aug 2009 17:01:32 +0100 Subject: [Open-bio-l] Mixed case sequence strings in FASTQ? Message-ID: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com> Hi all, Another FASTQ issue to debate: Should we care what case the sequence strings are? I've never seen anything written down, but all the examples I recall used upper case. But there is nothing to stop people using mixed case, is there? With FASTA on the other hand, while all uppercase is most common, mixed case has its uses (e.g. representing trimmed regions, or low quality scores). I would suggest that OBF tools all treat the sequence in FASTQ files as is, and preserve the case on output. Any thoughts? Peter From dan.bolser at gmail.com Wed Aug 5 12:50:56 2009 From: dan.bolser at gmail.com (Dan Bolser) Date: Wed, 5 Aug 2009 17:50:56 +0100 Subject: [Open-bio-l] Mixed case sequence strings in FASTQ? In-Reply-To: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com> References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com> Message-ID: <2c8757af0908050950y97863fcj2b4deda1b8bb37c8@mail.gmail.com> 2009/8/5 Peter : > Hi all, > > Another FASTQ issue to debate: Should we care what case the sequence > strings are? I've never seen anything written down, but all the > examples I recall used upper case. But there is nothing to stop people > using mixed case, is there? > > With FASTA on the other hand, while all uppercase is most common, > mixed case has its uses (e.g. representing trimmed regions, or low > quality scores). > > I would suggest that OBF tools all treat the sequence in FASTQ files > as is, and preserve the case on output. > > Any thoughts? Agree. > Peter > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l > From biopython at maubp.freeserve.co.uk Thu Aug 6 04:17:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 6 Aug 2009 09:17:07 +0100 Subject: [Open-bio-l] Naming for FASTQ example files Message-ID: <320fb6e00908060117w703a2b08h6aeb2489530e0b51@mail.gmail.com> Hi all, I am planning on compiling a set of set FASTQ files, for use by Biopython, BioPerl, EMBOSS and anyone else that wants to test a parser. Modest size contributions will be welcome (no big files though). I will have two types of files: valid ones, and invalid ones. The basic idea is any parser should understand what we consider to be valid files (we may need to provide matching FASTA and QUAL files or something like this for verification), but also reject all the files we consider to be invalid. Regarding names, does "error_*.fastq" or "invalid_*.fastq" sound fine? Any preference for meaningful names ("error_qual_short.fastq", "error_qual_bad_char.fastq", ...) versus numbers ("error_001.fastq", "error_002.fastq", ...). Either way I think a README file would need to accompany the dataset stating what we think makes each example invalid (e.g. quality string shorted than sequence, invalid character in quality string, ...). Peter From biopython at maubp.freeserve.co.uk Sat Aug 8 08:53:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 8 Aug 2009 13:53:17 +0100 Subject: [Open-bio-l] Naming for FASTQ example files In-Reply-To: <320fb6e00908060117w703a2b08h6aeb2489530e0b51@mail.gmail.com> References: <320fb6e00908060117w703a2b08h6aeb2489530e0b51@mail.gmail.com> Message-ID: <320fb6e00908080553h115ab748jecc820d14cb5e524@mail.gmail.com> On Thu, Aug 6, 2009 at 9:17 AM, Peter wrote: > Hi all, > > I am planning on compiling a set of set FASTQ files, for use by > Biopython, BioPerl, EMBOSS and anyone else that wants to test a > parser. Modest size contributions will be welcome (no big files > though). > > I will have two types of files: valid ones, and invalid ones. The > basic idea is any parser should understand what we consider to be > valid files (we may need to provide matching FASTA and QUAL files or > something like this for verification), but also reject all the files > we consider to be invalid. > > Regarding names, does "error_*.fastq" or "invalid_*.fastq" sound fine? > > Any preference for meaningful names ("error_qual_short.fastq", > "error_qual_bad_char.fastq", ...) versus numbers ("error_001.fastq", > "error_002.fastq", ...). Either way I think a README file would need > to accompany the dataset stating what we think makes each example > invalid (e.g. quality string shorted than sequence, invalid character > in quality string, ...). I've gone for "error_*.fastq" and have tried to use meaningful names rather than numbers. Currently these files are only in the Biopython repository (under biopython/Tests/Quality), but could be added to the (currently) unused Biodata repository - although that is still on CVS: http://lists.open-bio.org/pipermail/open-bio-l/2009-January/000511.html As these examples are all small and we don't expect to change them, I could also just email them (off the mailing list) to EMBOSS/BioPerl people directly on request. Currently my error examples are as follows, broken down into groups. Quality strings with invalid ASCII characters (not the full set, but we could do that): error_qual_null.fastq error_qual_vtab.fastq error_qual_tab.fastq error_qual_escape.fastq error_qual_unit_sep.fastq error_qual_space.fastq error_qual_del.fastq Misc errors: error_diff_ids.fastq error_spaces.fastq error_tabs.fastq error_short_qual.fastq error_long_qual.fastq error_no_qual.fastq Simulated truncation part way though a file: error_trunc_at_plus.fastq error_trunc_at_qual.fastq error_trunc_at_seq.fastq Note they are all based on the same example file which due to the quality characters can be interpreted as any of the three FASTQ variants we're supporting (Sanger, Solexa, Illumina 1.3+). This was deliberate. Additional examples of files which could be Sanger or Solexa but not Illumina 1.3+ (or valid Sanger but can't be Solexa or Illumina 1.3+) are also a good idea. Note that in many of these examples the error is part way into the file, so there are initially some valid reads and then an error. Peter From biopython at maubp.freeserve.co.uk Sat Aug 8 14:56:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 8 Aug 2009 19:56:32 +0100 Subject: [Open-bio-l] White space in FASTQ files? Message-ID: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com> Hi all, Other than the special case of new lines which we have already covered (allowed but line wrapping is discouraged), should FASTQ sequence lines (and indeed the quality lines) ever be allowed to include white space (e.g. spaces and tabs)? I've never seen this in a real FASTQ file, and would like to suggest this be considered an error. Comments? Counter suggestions? Peter From pmr at ebi.ac.uk Mon Aug 10 09:02:47 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 10 Aug 2009 14:02:47 +0100 Subject: [Open-bio-l] Mixed case sequence strings in FASTQ? In-Reply-To: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com> References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com> Message-ID: <4A801A77.20704@ebi.ac.uk> Peter C. wrote: > I would suggest that OBF tools all treat the sequence in FASTQ files > as is, and preserve the case on output. > > Any thoughts? EMBOSS does that with all sequence formats. The case of the original sequence is preserved and reproduced on output. We have not specified upper or lower case only for any of our current output formats. We provide command line options to force sequences to be converted to upper or lower case if the user want to specify one or the other - usually just to convert sequences for post processing by some other tool. regards, Peter Rice From biopython at maubp.freeserve.co.uk Mon Aug 10 09:06:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 14:06:23 +0100 Subject: [Open-bio-l] Mixed case sequence strings in FASTQ? In-Reply-To: <4A801A77.20704@ebi.ac.uk> References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com> <4A801A77.20704@ebi.ac.uk> Message-ID: <320fb6e00908100606m4cd2a9ceq95c233d657b080ce@mail.gmail.com> On Mon, Aug 10, 2009 at 2:02 PM, Peter Rice wrote: > Peter C. wrote: > >> I would suggest that OBF tools all treat the sequence in FASTQ files >> as is, and preserve the case on output. >> >> Any thoughts? > > EMBOSS does that with all sequence formats. The case of the original > sequence is preserved and reproduced on output. We have not specified > upper or lower case only for any of our current output formats. > > We provide command line options to force sequences to be converted to > upper or lower case if the user want to specify one or the other - > usually just to convert sequences for post processing by some other tool. Cool. It looks like we are on the same wavelength here :) Peter From pmr at ebi.ac.uk Mon Aug 10 09:09:14 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 10 Aug 2009 14:09:14 +0100 Subject: [Open-bio-l] White space in FASTQ files? In-Reply-To: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com> References: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com> Message-ID: <4A801BFA.2040208@ebi.ac.uk> Peter C. wrote: > Other than the special case of new lines which we have already covered > (allowed but line wrapping is discouraged), should FASTQ sequence > lines (and indeed the quality lines) ever be allowed to include white > space (e.g. spaces and tabs)? I've never seen this in a real FASTQ > file, and would like to suggest this be considered an error. > > Comments? Counter suggestions? I am happy adding a warning message in EMBOSS for this. If we add too many warning messages then we could break our plan to issue one message and follow with "and another 999999 up to ..." if we find ourselves issuing more than one warning per sequence. regards, Peter Rice From biopython at maubp.freeserve.co.uk Mon Aug 10 09:36:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 14:36:26 +0100 Subject: [Open-bio-l] White space in FASTQ files? In-Reply-To: <4A801BFA.2040208@ebi.ac.uk> References: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com> <4A801BFA.2040208@ebi.ac.uk> Message-ID: <320fb6e00908100636r3e95b505x1fad838c566c973d@mail.gmail.com> On Mon, Aug 10, 2009 at 2:09 PM, Peter Rice wrote: > Peter C. wrote: >> Other than the special case of new lines which we have already covered >> (allowed but line wrapping is discouraged), should FASTQ sequence >> lines (and indeed the quality lines) ever be allowed to include white >> space (e.g. spaces and tabs)? I've never seen this in a real FASTQ >> file, and would like to suggest this be considered an error. >> >> Comments? Counter suggestions? > > I am happy adding a warning message in EMBOSS for this. > So you are thinking you'll try and cope with white space, and issue a warning? This sounds dangerous to me. One of the properties of a FASTQ file is the sequence string and the quality string should be the same length (after removing the line wrapping). Allowing whitespace in these strings makes that ambiguous. What if the sequence has white space but not the quality? What if they both have white space but in different positions? Just calling any whitespace (other than the new line characters) an error seems much safer. If there are any real files which do this, we can revisit this. Peter From cjfields at illinois.edu Tue Aug 11 19:32:00 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 11 Aug 2009 18:32:00 -0500 Subject: [Open-bio-l] White space in FASTQ files? In-Reply-To: <4A801BFA.2040208@ebi.ac.uk> References: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com> <4A801BFA.2040208@ebi.ac.uk> Message-ID: On Aug 10, 2009, at 8:09 AM, Peter Rice wrote: > Peter C. wrote: >> Other than the special case of new lines which we have already >> covered >> (allowed but line wrapping is discouraged), should FASTQ sequence >> lines (and indeed the quality lines) ever be allowed to include white >> space (e.g. spaces and tabs)? I've never seen this in a real FASTQ >> file, and would like to suggest this be considered an error. >> >> Comments? Counter suggestions? > > I am happy adding a warning message in EMBOSS for this. > > If we add too many warning messages then we could break our plan to > issue one message and follow with "and another 999999 up to ..." if we > find ourselves issuing more than one warning per sequence. > > regards, > > Peter Rice This is quite similar to the 'qual range out-of-bounds for this FASTQ variant' warning we discussed earlier. We could essentially merge these to be one and the same. chris From cjfields at illinois.edu Tue Aug 11 19:32:10 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 11 Aug 2009 18:32:10 -0500 Subject: [Open-bio-l] Mixed case sequence strings in FASTQ? In-Reply-To: <320fb6e00908100606m4cd2a9ceq95c233d657b080ce@mail.gmail.com> References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com> <4A801A77.20704@ebi.ac.uk> <320fb6e00908100606m4cd2a9ceq95c233d657b080ce@mail.gmail.com> Message-ID: <0D818138-133D-4937-9B24-8A07CD62C170@illinois.edu> On Aug 10, 2009, at 8:06 AM, Peter wrote: > On Mon, Aug 10, 2009 at 2:02 PM, Peter Rice wrote: >> Peter C. wrote: >> >>> I would suggest that OBF tools all treat the sequence in FASTQ files >>> as is, and preserve the case on output. >>> >>> Any thoughts? >> >> EMBOSS does that with all sequence formats. The case of the original >> sequence is preserved and reproduced on output. We have not specified >> upper or lower case only for any of our current output formats. >> >> We provide command line options to force sequences to be converted to >> upper or lower case if the user want to specify one or the other - >> usually just to convert sequences for post processing by some other >> tool. > > Cool. It looks like we are on the same wavelength here :) > > Peter I believe so (sorry about lack of responsiveness, just got back in town). chris From biopython at maubp.freeserve.co.uk Wed Aug 12 06:23:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 Aug 2009 11:23:52 +0100 Subject: [Open-bio-l] Mixed case sequence strings in FASTQ? In-Reply-To: <0D818138-133D-4937-9B24-8A07CD62C170@illinois.edu> References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com> <4A801A77.20704@ebi.ac.uk> <320fb6e00908100606m4cd2a9ceq95c233d657b080ce@mail.gmail.com> <0D818138-133D-4937-9B24-8A07CD62C170@illinois.edu> Message-ID: <320fb6e00908120323q39e1b3e9x1ff6b56203149943@mail.gmail.com> On Wed, Aug 12, 2009 at 12:32 AM, Chris Fields wrote: > > On Aug 10, 2009, at 8:06 AM, Peter wrote: > >> On Mon, Aug 10, 2009 at 2:02 PM, Peter Rice wrote: >>> >>> Peter C. wrote: >>> >>>> I would suggest that OBF tools all treat the sequence in FASTQ files >>>> as is, and preserve the case on output. >>>> >>>> Any thoughts? >>> >>> EMBOSS does that with all sequence formats. The case of the original >>> sequence is preserved and reproduced on output. We have not specified >>> upper or lower case only for any of our current output formats. >>> >>> We provide command line options to force sequences to be converted to >>> upper or lower case if the user want to specify one or the other - >>> usually just to convert sequences for post processing by some other tool. >> >> Cool. It looks like we are on the same wavelength here :) >> >> Peter > > I believe so (sorry about lack of responsiveness, just got back in town). > > chris Great - I've added some unit test code in Biopython to confirm we leave the sequence case as-is on a loading and saving FASTQ files. Peter From biopython at maubp.freeserve.co.uk Mon Aug 24 10:18:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 24 Aug 2009 15:18:20 +0100 Subject: [Open-bio-l] [Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907270451i3d40b4ffq607360cfcb6f6282@mail.gmail.com> Message-ID: <320fb6e00908240718q194afe78j4a05b31aeb33e313@mail.gmail.com> On Mon, Jul 27, 2009 at 2:06 PM, Chris Fields wrote: > > I added this (and the others) to our ticket tracking this. ?Looks like > solexa conversion either way is borked, which is very likely an issue > with conversion. Hi Chris, I've been digging into the current SVN code for BioPerl's FASTQ support - I realised you are doing the Solexa to PHRED mapping twice when parsing "fastq-solexa" files. Using "qual" output (which shows the PHRED scores in plain text) makes it very clear something is wrong: $ cat solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; That is Solexa scores from 40 (h) down to -5 (;), which should map onto PHRED scores from 40 down to 1 (according to our prior discussions). $ ./bioperl_solexa2qual.pl < solexa_faked.fastq >slxa_0001_1_0001_01 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 10 9 8 7 6 6 5 5 5 5 4 4 4 4 For reference, $ python biopython_solexa2qual.py < solexa_faked.fastq >slxa_0001_1_0001_01 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 1 1 I can "fix" this in fastq.pm by commenting out one of the log mappings, for example see the patch I've just uploaded to Bug 2857: http://bugzilla.open-bio.org/show_bug.cgi?id=2857 That brings me to another problem, consider the following (with the double conversion fixed): $ ./bioperl_solexa2solexa.pl < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJJHGFEDDBB@@>><< If you compare that to the original, you'll notice a loss of detail in the poor quality reads. e.g. Solexa scores 9 (I) and 10 (J) have both been mapped onto 10 (J). I believe this happens because BioPerl is converting the Solexa scores to PHRED scores on loading (which is fine - EMBOSS does this too), but you are also storing them as integers! In order to preserve these details, I think you'll have to hold the converted PHRED scores as floating point numbers (which I think is what EMBOSS does). This has the downside of taking more memory, and may also complicate file output (you may need to round things). Regards, Peter (@Biopython) From biopython at maubp.freeserve.co.uk Tue Aug 25 07:24:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Aug 2009 12:24:27 +0100 Subject: [Open-bio-l] More FASTQ examples for cross project testing Message-ID: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> Hi all, I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl) off list about this plan. I'm going to co-ordinate putting together a set of valid FASTQ files for shared testing (to supplement the existing set of invalid FASTQ files already done and being used in Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon). What I have in mind is: XXX_original_YYY.fastq - sample input XXX_as_sanger.fastq - reference output XXX_as_solexa.fastq - reference output XXX_as_illumina.fastq - reference output where XXX is some name (e.g. wrapped1, wrapped2, shortreads, longreads, sanger_full_range, solexa_full_range ...) and YYY is the FASTQ variant (sanger, solexa or illumina) for the "input" file. For example, we might have: wrapped1_original_sanger.fastq - A Sanger FASTQ using line wrapping, perhaps repeating the title on the plus lines wrapped1_as_sanger.fastq - The same data but using the consensus of no line wrapping and omitting the repeated title on the plus lines. wrapped1_as_solexa.fastq - As above, but converted in Solexa scores (ASCII offset 64), with capping at Solexa 62 (ASCII 126). wrapped1_as_illumina.fastq - As above, but converted to Illumina ASCII offset 64, with capping at PHRED 62 (ASCII 126). Here "wrapped1" would be a Sanger FASTQ file with some line wrapping (e.g. at 60 characters). I will include "sanger_full_range" which would cover all the valid PHRED scores from 0 to 93, and similarly for Solexa and Illumina files - these are important for testing the score conversions. I have some ideas for deliberately tricky (but valid) files which should properly test any parser. The point is we have "perhaps odd but valid" originals, plus the "cleaned up" versions (using the same FASTQ variant), and "cleaned up" versions in the other two FASTQ variants. Ideally asking Biopython/BioPerl/EMBOSS to convert the XXX_original_YYY.fastq files into any of the three FASTQ variants will give exactly the same as the reference outputs. If anyone has any comments or suggestions please speak up (e.g. my suggested naming conventions). Real life examples of FASTQ files anyone has had trouble parsing (even with 3rd party tools) would be particularly useful - although we'd probably want to cut down big example files in order to keep the dataset to a reasonable size. Thanks, Peter From heuermh at acm.org Tue Aug 25 22:56:20 2009 From: heuermh at acm.org (Michael Heuer) Date: Tue, 25 Aug 2009 22:56:20 -0400 (EDT) Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> Message-ID: Peter wrote: > Hi all, > > I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl) > off list about this plan. I'm going to co-ordinate putting together a > set of valid FASTQ files for shared testing (to supplement the > existing set of invalid FASTQ files already done and being used in > Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon). > > What I have in mind is: > > XXX_original_YYY.fastq - sample input > XXX_as_sanger.fastq - reference output > XXX_as_solexa.fastq - reference output > XXX_as_illumina.fastq - reference output > > where XXX is some name (e.g. wrapped1, wrapped2, shortreads, > longreads, sanger_full_range, solexa_full_range ...) and YYY is the > FASTQ variant (sanger, solexa or illumina) for the "input" file. > > For example, we might have: > > wrapped1_original_sanger.fastq - A Sanger FASTQ using line wrapping, > perhaps repeating the title on the plus lines > wrapped1_as_sanger.fastq - The same data but using the consensus of no > line wrapping and omitting the repeated title on the plus lines. > wrapped1_as_solexa.fastq - As above, but converted in Solexa scores > (ASCII offset 64), with capping at Solexa 62 (ASCII 126). > wrapped1_as_illumina.fastq - As above, but converted to Illumina ASCII > offset 64, with capping at PHRED 62 (ASCII 126). > > Here "wrapped1" would be a Sanger FASTQ file with some line wrapping > (e.g. at 60 characters). I will include "sanger_full_range" which > would cover all the valid PHRED scores from 0 to 93, and similarly for > Solexa and Illumina files - these are important for testing the score > conversions. I have some ideas for deliberately tricky (but valid) > files which should properly test any parser. > > The point is we have "perhaps odd but valid" originals, plus the > "cleaned up" versions (using the same FASTQ variant), and "cleaned up" > versions in the other two FASTQ variants. > > Ideally asking Biopython/BioPerl/EMBOSS to convert the > XXX_original_YYY.fastq files into any of the three FASTQ variants will > give exactly the same as the reference outputs. > > If anyone has any comments or suggestions please speak up (e.g. my > suggested naming conventions). Very cool idea, Peter, and Peter, and Chris. I don't believe anyone from biojava has spoken up on this thread yet, so I thought I should add that we are working towards a compatible implementation as well. michael From biopython at maubp.freeserve.co.uk Wed Aug 26 06:06:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Aug 2009 11:06:39 +0100 Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> Message-ID: <320fb6e00908260306o37eb3d80q53fc50dce708d0b@mail.gmail.com> On Wed, Aug 26, 2009 at 3:56 AM, Michael Heuer wrote: > Peter wrote: > >> Hi all, >> >> I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl) >> off list about this plan. I'm going to co-ordinate putting together a >> set of valid FASTQ files for shared testing (to supplement the >> existing set of invalid FASTQ files already done and being used in >> Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon). >> ... > > Very cool idea, Peter, and Peter, and Chris. ?I don't believe anyone from > biojava has spoken up on this thread yet, so I thought I should add that > we are working towards a compatible implementation as well. > > ? michael Hi Michael - we asked the BioJava guys a while back, and at the time there was interest but no volunteers. Who is working on this now? Peter From biopython at maubp.freeserve.co.uk Wed Aug 26 18:04:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Aug 2009 23:04:18 +0100 Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> Message-ID: <320fb6e00908261504l18c3ee5em4c90b0818fc2c844@mail.gmail.com> On Tue, Aug 25, 2009 at 12:24 PM, Peter wrote: > Hi all, > > I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl) > off list about this plan. I'm going to co-ordinate putting together a > set of valid FASTQ files for shared testing (to supplement the > existing set of invalid FASTQ files already done and being used in > Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon). > > What I have in mind is: > > XXX_original_YYY.fastq - sample input > XXX_as_sanger.fastq - reference output > XXX_as_solexa.fastq - reference output > XXX_as_illumina.fastq - reference output > > where XXX is some name (e.g. wrapped1, wrapped2, shortreads, > longreads, sanger_full_range, solexa_full_range ...) and YYY is the > FASTQ variant (sanger, solexa or illumina) for the "input" file. I didn't want to clog up the mailing list with attachments, but just for the record, I've sent my first attempt at this to Peter (EMBOSS) and Chris (BioPerl) for comment (and checking). My earlier set of error_*.fastq files are in Biopython CVS/github and have since been copied to BioPerl SVN as well. Peter From biopython at maubp.freeserve.co.uk Thu Aug 27 06:46:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 27 Aug 2009 11:46:17 +0100 Subject: [Open-bio-l] FASTQ in BioRuby? Message-ID: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com> Hello BioRuby team, I am one of the Biopython developers, and together with Peter Rice (EMBOSS) and Chris Fields (BioPerl) we have been coordinating how these Open Bioinformatics Foundation (OBF) projects will interpret the FASTQ file format used in next generation sequencing. This includes standardising our naming conventions for the original Sanger FASTQ variant, and the later Solexa/early Illumina, and recent Illumina 1.3+ variants. We have also put together a set of test files, including reference conversions between the different FASTQ variants. We would be delighted to get BioRuby involved. I tried to contact Naohisa Goto about this directly last month, but perhaps my email did not arrive. If BioRuby is working on (or planning to work on) FASTQ support, please could the developers concerned sign up to the OBF joint mailing list where we have been discussing this: http://lists.open-bio.org/mailman/listinfo/open-bio-l Thank you, Peter From ngoto at gen-info.osaka-u.ac.jp Thu Aug 27 07:20:46 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 27 Aug 2009 20:20:46 +0900 Subject: [Open-bio-l] FASTQ in BioRuby? In-Reply-To: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com> References: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com> Message-ID: <20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp> Hello Peter, sorry for responding too late. I've subscribed to open-bio-l, but I could not actively join to the discussions, because of lack of my knowledge about FASTQ. There is a small primitive code attempt to support FASTQ format in BioRuby, which is not yet merged in the main repository. http://github.com/ngoto/bioruby/tree/master Recently, Anthony Underwood contributed chromatgram classes to support SCF/ABI formats, which will be merged soon, after bug-fix maintenance release of 1.3.1. http://github.com/aunderwo/bioruby/tree/master I'm now planning to rewrite my FASTQ code to be consistent with the chromatgram classes, and with the open-bio standards. Thank you, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Thu, 27 Aug 2009 11:46:17 +0100 Peter wrote: > Hello BioRuby team, > > I am one of the Biopython developers, and together with Peter Rice > (EMBOSS) and Chris Fields (BioPerl) we have been coordinating > how these Open Bioinformatics Foundation (OBF) projects will > interpret the FASTQ file format used in next generation sequencing. > > This includes standardising our naming conventions for the original > Sanger FASTQ variant, and the later Solexa/early Illumina, and > recent Illumina 1.3+ variants. We have also put together a set of > test files, including reference conversions between the different > FASTQ variants. > > We would be delighted to get BioRuby involved. I tried to contact > Naohisa Goto about this directly last month, but perhaps my email > did not arrive. If BioRuby is working on (or planning to work on) > FASTQ support, please could the developers concerned sign up > to the OBF joint mailing list where we have been discussing this: > http://lists.open-bio.org/mailman/listinfo/open-bio-l > > Thank you, > > Peter > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From biopython at maubp.freeserve.co.uk Thu Aug 27 08:08:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 27 Aug 2009 13:08:28 +0100 Subject: [Open-bio-l] FASTQ in BioRuby? In-Reply-To: <20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp> References: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com> <20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <320fb6e00908270508o485ba990k96b8bd3b722c09b6@mail.gmail.com> On Thu, Aug 27, 2009 at 12:20 PM, Naohisa GOTO wrote: > > Hello Peter, > > sorry for responding too late. I've subscribed to open-bio-l, > but I could not actively join to the discussions, because of > lack of my knowledge about FASTQ. > > There is a small primitive code attempt to support FASTQ format > in BioRuby, which is not yet merged in the main repository. > http://github.com/ngoto/bioruby/tree/master > > Recently, Anthony Underwood contributed chromatgram classes > to support SCF/ABI formats, which will be merged soon, > after bug-fix maintenance release of 1.3.1. > http://github.com/aunderwo/bioruby/tree/master > > I'm now planning to rewrite my FASTQ code to be consistent > with the chromatgram classes, and with the open-bio standards. > > Thank you, > > Naohisa Goto That is excellent news :) I'm not sure how format names work in BioRuby, but if you do have a set of format names as strings as we do in Biopython, BioPerl and EMBOSS it would be nice to be consistent here: http://biopython.org/wiki/SeqIO http://bioperl.org/wiki/HOWTO:SeqIO http://emboss.sourceforge.net/docs/themes/SequenceFormats.html There is some basic information on wikipedia, but this does not go into detail: http://www.bioperl.org/wiki/FASTQ_sequence_format Please feel free to ask any questions about how we are interpreting things. Thank you, Peter From biopython at maubp.freeserve.co.uk Thu Aug 27 11:26:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 27 Aug 2009 16:26:23 +0100 Subject: [Open-bio-l] Naming for FASTQ example files In-Reply-To: <320fb6e00908080553h115ab748jecc820d14cb5e524@mail.gmail.com> References: <320fb6e00908060117w703a2b08h6aeb2489530e0b51@mail.gmail.com> <320fb6e00908080553h115ab748jecc820d14cb5e524@mail.gmail.com> Message-ID: <320fb6e00908270826i729cfdd6o1fdc56f47e5f3c02@mail.gmail.com> On Sat, Aug 8, 2009 at 1:53 PM, Peter wrote: > On Thu, Aug 6, 2009 at 9:17 AM, Peter wrote: >> Hi all, >> >> I am planning on compiling a set of set FASTQ files, for use by >> Biopython, BioPerl, EMBOSS and anyone else that wants to test a >> parser. Modest size contributions will be welcome (no big files >> though). >> >> I will have two types of files: valid ones, and invalid ones. The >> basic idea is any parser should understand what we consider to be >> valid files (we may need to provide matching FASTA and QUAL files or >> something like this for verification), but also reject all the files >> we consider to be invalid. >> ... > > I've gone for "error_*.fastq" and have tried to use meaningful names > rather than numbers. Currently these files are only in the Biopython > repository (under biopython/Tests/Quality), but could be added to the > (currently) unused Biodata repository - although that is still on CVS: > > http://lists.open-bio.org/pipermail/open-bio-l/2009-January/000511.html > > As these examples are all small and we don't expect to change them, > I could also just email them (off the mailing list) to EMBOSS/BioPerl > people directly on request. Chris Fields has already included the original "error_*.fastq" files in BioPerl SVN as test cases. Peter Rice has pointed out a minor error in "error_short_qual.fastq" which I have now corrected (it had a short sequence, not a short quality line), and after discussion we have come up with a few more truncation examples: error_trunc_in_title.fastq error_trunc_in_seq.fastq error_trunc_in_plus.fastq error_trunc_in_qual.fastq Again, you can grab these five files (four new, one updated) from Biopython CVS/git, and I will also be emailing Chris & Peter R directly. Peter C. From biopython at maubp.freeserve.co.uk Thu Aug 27 12:40:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 27 Aug 2009 17:40:21 +0100 Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: References: <320fb6e00908260306o37eb3d80q53fc50dce708d0b@mail.gmail.com> Message-ID: <320fb6e00908270940r495aecc1n833aa9a28e8f3db3@mail.gmail.com> On Thu, Aug 27, 2009 at 5:31 PM, Michael Heuer wrote: > > Peter wrote: >> Hi Michael - we asked the BioJava guys a while back, and at the time >> there was interest but no volunteers. Who is working on this now? > > Perhaps I should have kept quiet -- I think I just volunteered. ?;) > > ? michael Assuming you're serious, great :) Peter From heuermh at acm.org Thu Aug 27 12:31:43 2009 From: heuermh at acm.org (Michael Heuer) Date: Thu, 27 Aug 2009 12:31:43 -0400 (EDT) Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: <320fb6e00908260306o37eb3d80q53fc50dce708d0b@mail.gmail.com> Message-ID: Peter wrote: > On Wed, Aug 26, 2009 at 3:56 AM, Michael Heuer wrote: > > Peter wrote: > > > >> Hi all, > >> > >> I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl) > >> off list about this plan. I'm going to co-ordinate putting together a > >> set of valid FASTQ files for shared testing (to supplement the > >> existing set of invalid FASTQ files already done and being used in > >> Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon). > >> ... > > > > Very cool idea, Peter, and Peter, and Chris. ?I don't believe anyone from > > biojava has spoken up on this thread yet, so I thought I should add that > > we are working towards a compatible implementation as well. > > > > ? michael > > Hi Michael - we asked the BioJava guys a while back, and at the time > there was interest but no volunteers. Who is working on this now? Perhaps I should have kept quiet -- I think I just volunteered. ;) michael From biopython at maubp.freeserve.co.uk Mon Aug 31 08:07:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Aug 2009 13:07:45 +0100 Subject: [Open-bio-l] Status of OBDA and indexed flatfiles? Message-ID: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> Hi all, I'm looking at indexing next generation sequence files for Biopython (e.g. FASTQ short read files with 10s of millions of entries), where even just holding the record names and their file offsets in memory is beginning to be a bottleneck. What is the current status of Open Biological Database Access (OBDA), and in particular the index files for sequence "flat files" like FASTA or GenBank (or FASTQ)? http://www.bioperl.org/wiki/HOWTO:Flat_databases http://www.bioperl.org/wiki/HOWTO:OBDA http://obda.open-bio.org/ The spec files are still in CVS (and ViewCVS is still broken since the recent server move), rather than having been migrated to SVN which may suggest things are obsolete (or on the bright side, stable). Presumably BioPerl still uses these index files? What about the other projects? I know EMBOSS has some indexing system for example but I have no idea how it works internally. Thanks, Peter From ngoto at gen-info.osaka-u.ac.jp Mon Aug 31 10:01:46 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 31 Aug 2009 23:01:46 +0900 Subject: [Open-bio-l] Status of OBDA and indexed flatfiles? In-Reply-To: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> Message-ID: <20090831140146.EAF851CBC594@idnmail.gen-info.osaka-u.ac.jp> Hi Peter, On Mon, 31 Aug 2009 13:07:45 +0100 Peter wrote: > Hi all, > > I'm looking at indexing next generation sequence files for Biopython > (e.g. FASTQ short read files with 10s of millions of entries), where > even just holding the record names and their file offsets in memory > is beginning to be a bottleneck. > > What is the current status of Open Biological Database Access (OBDA), > and in particular the index files for sequence "flat files" like FASTA or > GenBank (or FASTQ)? > > http://www.bioperl.org/wiki/HOWTO:Flat_databases > http://www.bioperl.org/wiki/HOWTO:OBDA > http://obda.open-bio.org/ > > The spec files are still in CVS (and ViewCVS is still broken since > the recent server move), rather than having been migrated to SVN > which may suggest things are obsolete (or on the bright side, stable). > > Presumably BioPerl still uses these index files? What about the > other projects? I know EMBOSS has some indexing system for > example but I have no idea how it works internally. BioRuby still uses them. To gain performance, names and offsets are written to temporary files and using external sort program (default /usr/bin/sort). In BioRuby, flatfile-only solution works fine, but BerkeleyDB indexes would be incompatible with other projects, because of confusion in the spec, discussed in BioPerl Bugzilla Bug #2337. http://bugzilla.open-bio.org/show_bug.cgi?id=2337 Thanks, -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From biopython at maubp.freeserve.co.uk Mon Aug 31 11:07:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Aug 2009 16:07:28 +0100 Subject: [Open-bio-l] Status of OBDA and indexed flatfiles? In-Reply-To: <20090831140146.EAF851CBC594@idnmail.gen-info.osaka-u.ac.jp> References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> <20090831140146.EAF851CBC594@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <320fb6e00908310807j2fbc9710x3621a16a54ad5e45@mail.gmail.com> On Mon, Aug 31, 2009 at 3:01 PM, Naohisa GOTO wrote: > Hi Peter, > >> Presumably BioPerl still uses these index files? What about the >> other projects? I know EMBOSS has some indexing system for >> example but I have no idea how it works internally. > > BioRuby still uses them. To gain performance, names and offsets are > written to temporary files and using external sort program (default > /usr/bin/sort). That makes sense. Have you tried this on very large files? e.g. FASTA with 10 million short reads? > In BioRuby, flatfile-only solution works fine, but BerkeleyDB indexes > would be incompatible with other projects, because of confusion in > the spec, discussed in BioPerl Bugzilla Bug #2337. > http://bugzilla.open-bio.org/show_bug.cgi?id=2337 Thank you for the link to that bug - I'll need to read that carefully. Peter From biopython at maubp.freeserve.co.uk Mon Aug 31 11:45:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Aug 2009 16:45:51 +0100 Subject: [Open-bio-l] Status of OBDA and indexed flatfiles? In-Reply-To: <8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu> References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> <8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu> Message-ID: <320fb6e00908310845n3322d2c6hc992f78b81c0e226@mail.gmail.com> On Mon, Aug 31, 2009 at 4:33 PM, Chris Fields wrote: > > I don't use OBDA, personally, but I can check on the status with Brian > Osborne (he was heading it up last I checked). ?However, I don't think > BioPerl has an OBDA FASTQ parser. > > You may be thinking about Bio::Index::FASTQ? ?That one is not OBDA, > but just a simple flat file indexer. ?We could probably set an OBDA parser > up fairly easily if needed. I didn't know if Bio::Index was using OBDA "under the hood" or not. Does this mean BioPerl has multiple indexing systems available? As I noted on Bug 2337 earlier today, Biopython used to have some sort of OBDA compliant indexing, but for unrelated reasons we have deprecated and removed that code. We're now revisiting this topic due in part to having to deal with ever larger data files - and I wanted to see if OBDA was still "alive" as a standard, and furthermore how well it had scaled for the other OBF projects. Peter From cjfields at illinois.edu Mon Aug 31 11:33:02 2009 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 31 Aug 2009 10:33:02 -0500 Subject: [Open-bio-l] Status of OBDA and indexed flatfiles? In-Reply-To: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> Message-ID: <8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu> On Aug 31, 2009, at 7:07 AM, Peter wrote: > Hi all, > > I'm looking at indexing next generation sequence files for Biopython > (e.g. FASTQ short read files with 10s of millions of entries), where > even just holding the record names and their file offsets in memory > is beginning to be a bottleneck. > > What is the current status of Open Biological Database Access (OBDA), > and in particular the index files for sequence "flat files" like > FASTA or > GenBank (or FASTQ)? > > http://www.bioperl.org/wiki/HOWTO:Flat_databases > http://www.bioperl.org/wiki/HOWTO:OBDA > http://obda.open-bio.org/ > > The spec files are still in CVS (and ViewCVS is still broken since > the recent server move), rather than having been migrated to SVN > which may suggest things are obsolete (or on the bright side, stable). > > Presumably BioPerl still uses these index files? What about the > other projects? I know EMBOSS has some indexing system for > example but I have no idea how it works internally. > > Thanks, > > Peter I don't use OBDA, personally, but I can check on the status with Brian Osborne (he was heading it up last I checked). However, I don't think BioPerl has an OBDA FASTQ parser. You may be thinking about Bio::Index::FASTQ? That one is not OBDA, but just a simple flat file indexer. We could probably set an OBDA parser up fairly easily if needed. chris From cjfields at illinois.edu Mon Aug 31 14:22:36 2009 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 31 Aug 2009 13:22:36 -0500 Subject: [Open-bio-l] Status of OBDA and indexed flatfiles? In-Reply-To: <320fb6e00908310845n3322d2c6hc992f78b81c0e226@mail.gmail.com> References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> <8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu> <320fb6e00908310845n3322d2c6hc992f78b81c0e226@mail.gmail.com> Message-ID: On Aug 31, 2009, at 10:45 AM, Peter wrote: > On Mon, Aug 31, 2009 at 4:33 PM, Chris Fields > wrote: >> >> I don't use OBDA, personally, but I can check on the status with >> Brian >> Osborne (he was heading it up last I checked). However, I don't >> think >> BioPerl has an OBDA FASTQ parser. >> >> You may be thinking about Bio::Index::FASTQ? That one is not OBDA, >> but just a simple flat file indexer. We could probably set an OBDA >> parser >> up fairly easily if needed. > > I didn't know if Bio::Index was using OBDA "under the hood" or not. > Does this mean BioPerl has multiple indexing systems available? Yes. We have Bio::Index::*, Bio::DB::Flat (which I think is OBDA). There is also the older Bio::DB::Fasta, which is actually still in wide use. Note with Bio::Index::* we allow streaming of any report type (sequence, alignment, analysis like BLAST, etc). We have talked about switching many of the Bio::Index::* sequence- based ones to OBDA but I haven't seen anyone take that up. > As I noted on Bug 2337 earlier today, Biopython used to have some > sort of OBDA compliant indexing, but for unrelated reasons we have > deprecated and removed that code. We're now revisiting this topic > due in part to having to deal with ever larger data files - and I > wanted > to see if OBDA was still "alive" as a standard, and furthermore how > well it had scaled for the other OBF projects. > > Peter I think it's still alive and being used, just not sure what the compliance level is amongst the different Bio* projects. chris From charles-listes+open-bio at plessy.org Sun Aug 2 01:25:37 2009 From: charles-listes+open-bio at plessy.org (Charles Plessy) Date: Sun, 2 Aug 2009 10:25:37 +0900 Subject: [Open-bio-l] FASTQ identifiers In-Reply-To: <320fb6e00907310215n6ed33a17id8a5ff5913c9d3b5@mail.gmail.com> References: <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> <4A72A8F9.9020903@ebi.ac.uk> <320fb6e00907310215n6ed33a17id8a5ff5913c9d3b5@mail.gmail.com> Message-ID: <20090802012537.GD2479@kunpuu.plessy.org> Le Fri, Jul 31, 2009 at 10:15:57AM +0100, Peter a ?crit : > The situation is similar to the FASTA format (and others), in that there > are a number of reasonably well documented conventions in use > (e.g. the NCBI FASTA identifiers with | characters). However, equally, > there are thousands of ad hoc local conventions. Hello, I just would like to mention such an ad-hoc convention in use at workplace: with FASTQ sequences we sometimes replace the original name by the sequence itself. This can be useful for instance to troubleshoot some sequence manipulations. @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 ;;3;;;;;;;;;;;;7;;;;;;;88 becomes: @CCCTTCTTGTCTTCAGCGTTTCTCC CCCTTCTTGTCTTCAGCGTTTCTCC +CCCTTCTTGTCTTCAGCGTTTCTCC ;;3;;;;;;;;;;;;7;;;;;;;88 and after some arbitrary trimming at the ends: @CCCTTCTTGTCTTCAGCGTTTCTCC TTCTTGTCTTCAGCGTTTCT +CCCTTCTTGTCTTCAGCGTTTCTCC ;;;;;;;;;;;;7;;;;;;; With FASTA format, we sometimes eliminate redundant sequences and record how many times they occurred by adding the count to the name. For instance: >seq1 AAATTT >seq2 AAATAT >seq3 AAATTT becomes: >AAATTT_2 AAATTT >AAATAT_1 AAATAT If this is popular elsewhere, it may be useful to implement functions that allow doing this efficiently. Have a nice day, -- Charles Plessy Tsurumi, Kanagawa, Japan From biopython at maubp.freeserve.co.uk Mon Aug 3 09:30:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 3 Aug 2009 10:30:09 +0100 Subject: [Open-bio-l] FASTQ identifiers In-Reply-To: <20090802012537.GD2479@kunpuu.plessy.org> References: <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com> <320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com> <320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com> <24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com> <4A72A8F9.9020903@ebi.ac.uk> <320fb6e00907310215n6ed33a17id8a5ff5913c9d3b5@mail.gmail.com> <20090802012537.GD2479@kunpuu.plessy.org> Message-ID: <320fb6e00908030230x52bf32a8o3b640ce8d0a76b8@mail.gmail.com> On Sun, Aug 2, 2009 at 2:25 AM, Charles Plessy wrote: > Le Fri, Jul 31, 2009 at 10:15:57AM +0100, Peter a ?crit : >> The situation is similar to the FASTA format (and others), in that there >> are a number of reasonably well documented conventions in use >> (e.g. the NCBI FASTA identifiers with | characters). However, equally, >> there are thousands of ad hoc local conventions. > > Hello, > > I just would like to mention such an ad-hoc convention in use at > workplace: with FASTQ sequences we sometimes replace the original > name by the sequence itself. This can be useful for instance to > troubleshoot some sequence manipulations. > > @EAS54_6_R1_2_1_413_324 > CCCTTCTTGTCTTCAGCGTTTCTCC > +EAS54_6_R1_2_1_413_324 > ;;3;;;;;;;;;;;;7;;;;;;;88 > > becomes: > > @CCCTTCTTGTCTTCAGCGTTTCTCC > CCCTTCTTGTCTTCAGCGTTTCTCC > +CCCTTCTTGTCTTCAGCGTTTCTCC > ;;3;;;;;;;;;;;;7;;;;;;;88 > That certainly demonstrates we can't make any big assumptions about the title line formatting ;) Your example is interesting - but I don't quite understand why you do this. Surely any debug message or output file for bad reads would (normally) have a unique read ID which (indirectly) tells you the read sequence? If you are writing the code which gives these error messages, can't you explicitly give the read sequence? Is the aim to be able to look at error messages from third party tools (which just give the read name) and see the read sequence directly (without looking up the read name in the original FASTQ file)? This is similar in some ways to my comment that I could see a real use for FASTQ (and FASTA) files with no record identifiers: >> Related to this, what about the corner case of reads with NO >> identifier? The FASTQ (and indeed the FASTA) formats can >> hold such things - just use a blank title line. In the case of >> next generation sequencing reads, the names themselves >> are not actually that important - so you can imagine a pipeline >> which doesn't actually bother with them at all. In your pipeline you clearly don't care about the original FASTQ identifiers, and (if the pipeline would accept it), using blank title lines might also work (and would certainly save disk space). Peter From cjfields at illinois.edu Wed Aug 5 15:12:18 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 5 Aug 2009 10:12:18 -0500 Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com> <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com> Message-ID: <3C298ABC-07CD-4597-BA83-F8F5992BF73A@illinois.edu> On Jul 29, 2009, at 5:15 AM, Peter wrote: > Hi all, > > This is a follow up to the earlier discussion about high quality > scores > in Solexa or Illumina 1.3+ FASTQ files and the problem of non > printable > ASCII codes (which can occur if converting from Sanger FASTQ). > >> On Sat, Jul 25, 2009 at 8:50 PM, Chris >> Fields wrote: >>> >>>> Now, here comes the problem. I believe FASTQ files directly >>>> from an Illumina 1.3+ pipeline will have PHRED scores in the >>>> range 0 to 40 (as in this example). However, much higher >>>> PHRED scores are possible during assembly / contig'ing >>>> and read mapping. For example, the tool MAQ will output >>>> Sanger style FASTQ files with PHRED scores in the range >>>> 0 to 93 inclusive. >>> >>> We can support it as Illumina 1.3, but my point is this may >>> getting into a >>> grey area and may be something that Illumina doesn't/wouldn't >>> support. >>> Reminds me a little of the multiple GFF2 variations (one of the >>> main >>> reasons for a GFF3). >> >> I agree this is an grey area (high scores in Solexa/Illumina >> FASTQ files). >> >> ... >> >> i.e. An Illumina FASTQ format file can hold PHRED scores in the >> range 0 to 62 without using problem characters. And likewise >> for a Solexa FASTQ file (Solexa scores up to 62). > > Peter Rice and I have been talking about this off list, and have > a proposal for the high score problem. Basically we want to > restrict FASTQ quality strings to printable ASCII, which means > 126 (0x7e) is a firm upper limit, while otherwise allowing for a > high scores as possible. This limit comes from ASCII 127 being > "delete", and the even higher characters also being non-printable. > > i.e. We are suggesting: > > "fastq-sanger" - Allows PHRED scores 0 to 93 inclusive, mapped > with an ASCII offset of 33 to ASCII characters 33 to 126 (or in hex, > 0x21 to 0x7e). This is as defined on the MAQ web pages. > > "fastq-illumina" - Believed to use at least PHRED scores 0 to 40, > mapped with an ASCII offset of 64 to ASCII characters 64 to 104 > (or in hex, to 0x40 to 0x68). It is a reasonable and well defined > extension to permit PHRED scores from 0 to 62 inclusive, which > map to ASCII 64 to 126 (or in hex 0x40 to 0x7e). This avoids the > non printing characters, and gives some head room for improved > sequencing technology from Illumina giving higher raw scores. > > "fastq-solexa" - Believed to use Solexa scores from -5 to at least > 40, again mapped with an ASCII offset of 64 giving ASCII characters > 59 to 104 (or in hex, 0x3b to 0x68). Again, a reasonable and well > defined extension would permit Solexa scores in the range -5 to 62 > inclusive, using ASCII 59 to 126 (or in hex, 0x3b to 0x7e). > > [Peter R. - please correct me if of the above is not what you had > in mind] > > If in the process of converting between formats, a quality score > is too high (it would result in ASCII 127 or higher), then I would > argue any of the following would be acceptable: > (a) Silently impose the maximum score (ASCII 126, 0x7e) > (b) Impose the maximum score with a warning > (c) Raise an error > > I don't think EMBOSS, BioPerl and Biopython have to handle > this exactly the same way, but I would favour (b) then (a). > > Peter I think, based on Aaron's comments, with bioperl we'll adopt in (b) to deal with format validation, but try to do it in a way that 'caches' bad data so it doesn't report a warning on every out-of-range value. I am planning on a Moose-based parser at some point that will do the same. chris From biopython at maubp.freeserve.co.uk Wed Aug 5 16:01:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 5 Aug 2009 17:01:32 +0100 Subject: [Open-bio-l] Mixed case sequence strings in FASTQ? Message-ID: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com> Hi all, Another FASTQ issue to debate: Should we care what case the sequence strings are? I've never seen anything written down, but all the examples I recall used upper case. But there is nothing to stop people using mixed case, is there? With FASTA on the other hand, while all uppercase is most common, mixed case has its uses (e.g. representing trimmed regions, or low quality scores). I would suggest that OBF tools all treat the sequence in FASTQ files as is, and preserve the case on output. Any thoughts? Peter From dan.bolser at gmail.com Wed Aug 5 16:50:56 2009 From: dan.bolser at gmail.com (Dan Bolser) Date: Wed, 5 Aug 2009 17:50:56 +0100 Subject: [Open-bio-l] Mixed case sequence strings in FASTQ? In-Reply-To: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com> References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com> Message-ID: <2c8757af0908050950y97863fcj2b4deda1b8bb37c8@mail.gmail.com> 2009/8/5 Peter : > Hi all, > > Another FASTQ issue to debate: Should we care what case the sequence > strings are? I've never seen anything written down, but all the > examples I recall used upper case. But there is nothing to stop people > using mixed case, is there? > > With FASTA on the other hand, while all uppercase is most common, > mixed case has its uses (e.g. representing trimmed regions, or low > quality scores). > > I would suggest that OBF tools all treat the sequence in FASTQ files > as is, and preserve the case on output. > > Any thoughts? Agree. > Peter > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l > From biopython at maubp.freeserve.co.uk Thu Aug 6 08:17:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 6 Aug 2009 09:17:07 +0100 Subject: [Open-bio-l] Naming for FASTQ example files Message-ID: <320fb6e00908060117w703a2b08h6aeb2489530e0b51@mail.gmail.com> Hi all, I am planning on compiling a set of set FASTQ files, for use by Biopython, BioPerl, EMBOSS and anyone else that wants to test a parser. Modest size contributions will be welcome (no big files though). I will have two types of files: valid ones, and invalid ones. The basic idea is any parser should understand what we consider to be valid files (we may need to provide matching FASTA and QUAL files or something like this for verification), but also reject all the files we consider to be invalid. Regarding names, does "error_*.fastq" or "invalid_*.fastq" sound fine? Any preference for meaningful names ("error_qual_short.fastq", "error_qual_bad_char.fastq", ...) versus numbers ("error_001.fastq", "error_002.fastq", ...). Either way I think a README file would need to accompany the dataset stating what we think makes each example invalid (e.g. quality string shorted than sequence, invalid character in quality string, ...). Peter From biopython at maubp.freeserve.co.uk Sat Aug 8 12:53:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 8 Aug 2009 13:53:17 +0100 Subject: [Open-bio-l] Naming for FASTQ example files In-Reply-To: <320fb6e00908060117w703a2b08h6aeb2489530e0b51@mail.gmail.com> References: <320fb6e00908060117w703a2b08h6aeb2489530e0b51@mail.gmail.com> Message-ID: <320fb6e00908080553h115ab748jecc820d14cb5e524@mail.gmail.com> On Thu, Aug 6, 2009 at 9:17 AM, Peter wrote: > Hi all, > > I am planning on compiling a set of set FASTQ files, for use by > Biopython, BioPerl, EMBOSS and anyone else that wants to test a > parser. Modest size contributions will be welcome (no big files > though). > > I will have two types of files: valid ones, and invalid ones. The > basic idea is any parser should understand what we consider to be > valid files (we may need to provide matching FASTA and QUAL files or > something like this for verification), but also reject all the files > we consider to be invalid. > > Regarding names, does "error_*.fastq" or "invalid_*.fastq" sound fine? > > Any preference for meaningful names ("error_qual_short.fastq", > "error_qual_bad_char.fastq", ...) versus numbers ("error_001.fastq", > "error_002.fastq", ...). Either way I think a README file would need > to accompany the dataset stating what we think makes each example > invalid (e.g. quality string shorted than sequence, invalid character > in quality string, ...). I've gone for "error_*.fastq" and have tried to use meaningful names rather than numbers. Currently these files are only in the Biopython repository (under biopython/Tests/Quality), but could be added to the (currently) unused Biodata repository - although that is still on CVS: http://lists.open-bio.org/pipermail/open-bio-l/2009-January/000511.html As these examples are all small and we don't expect to change them, I could also just email them (off the mailing list) to EMBOSS/BioPerl people directly on request. Currently my error examples are as follows, broken down into groups. Quality strings with invalid ASCII characters (not the full set, but we could do that): error_qual_null.fastq error_qual_vtab.fastq error_qual_tab.fastq error_qual_escape.fastq error_qual_unit_sep.fastq error_qual_space.fastq error_qual_del.fastq Misc errors: error_diff_ids.fastq error_spaces.fastq error_tabs.fastq error_short_qual.fastq error_long_qual.fastq error_no_qual.fastq Simulated truncation part way though a file: error_trunc_at_plus.fastq error_trunc_at_qual.fastq error_trunc_at_seq.fastq Note they are all based on the same example file which due to the quality characters can be interpreted as any of the three FASTQ variants we're supporting (Sanger, Solexa, Illumina 1.3+). This was deliberate. Additional examples of files which could be Sanger or Solexa but not Illumina 1.3+ (or valid Sanger but can't be Solexa or Illumina 1.3+) are also a good idea. Note that in many of these examples the error is part way into the file, so there are initially some valid reads and then an error. Peter From biopython at maubp.freeserve.co.uk Sat Aug 8 18:56:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 8 Aug 2009 19:56:32 +0100 Subject: [Open-bio-l] White space in FASTQ files? Message-ID: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com> Hi all, Other than the special case of new lines which we have already covered (allowed but line wrapping is discouraged), should FASTQ sequence lines (and indeed the quality lines) ever be allowed to include white space (e.g. spaces and tabs)? I've never seen this in a real FASTQ file, and would like to suggest this be considered an error. Comments? Counter suggestions? Peter From pmr at ebi.ac.uk Mon Aug 10 13:02:47 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 10 Aug 2009 14:02:47 +0100 Subject: [Open-bio-l] Mixed case sequence strings in FASTQ? In-Reply-To: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com> References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com> Message-ID: <4A801A77.20704@ebi.ac.uk> Peter C. wrote: > I would suggest that OBF tools all treat the sequence in FASTQ files > as is, and preserve the case on output. > > Any thoughts? EMBOSS does that with all sequence formats. The case of the original sequence is preserved and reproduced on output. We have not specified upper or lower case only for any of our current output formats. We provide command line options to force sequences to be converted to upper or lower case if the user want to specify one or the other - usually just to convert sequences for post processing by some other tool. regards, Peter Rice From biopython at maubp.freeserve.co.uk Mon Aug 10 13:06:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 14:06:23 +0100 Subject: [Open-bio-l] Mixed case sequence strings in FASTQ? In-Reply-To: <4A801A77.20704@ebi.ac.uk> References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com> <4A801A77.20704@ebi.ac.uk> Message-ID: <320fb6e00908100606m4cd2a9ceq95c233d657b080ce@mail.gmail.com> On Mon, Aug 10, 2009 at 2:02 PM, Peter Rice wrote: > Peter C. wrote: > >> I would suggest that OBF tools all treat the sequence in FASTQ files >> as is, and preserve the case on output. >> >> Any thoughts? > > EMBOSS does that with all sequence formats. The case of the original > sequence is preserved and reproduced on output. We have not specified > upper or lower case only for any of our current output formats. > > We provide command line options to force sequences to be converted to > upper or lower case if the user want to specify one or the other - > usually just to convert sequences for post processing by some other tool. Cool. It looks like we are on the same wavelength here :) Peter From pmr at ebi.ac.uk Mon Aug 10 13:09:14 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 10 Aug 2009 14:09:14 +0100 Subject: [Open-bio-l] White space in FASTQ files? In-Reply-To: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com> References: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com> Message-ID: <4A801BFA.2040208@ebi.ac.uk> Peter C. wrote: > Other than the special case of new lines which we have already covered > (allowed but line wrapping is discouraged), should FASTQ sequence > lines (and indeed the quality lines) ever be allowed to include white > space (e.g. spaces and tabs)? I've never seen this in a real FASTQ > file, and would like to suggest this be considered an error. > > Comments? Counter suggestions? I am happy adding a warning message in EMBOSS for this. If we add too many warning messages then we could break our plan to issue one message and follow with "and another 999999 up to ..." if we find ourselves issuing more than one warning per sequence. regards, Peter Rice From biopython at maubp.freeserve.co.uk Mon Aug 10 13:36:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 14:36:26 +0100 Subject: [Open-bio-l] White space in FASTQ files? In-Reply-To: <4A801BFA.2040208@ebi.ac.uk> References: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com> <4A801BFA.2040208@ebi.ac.uk> Message-ID: <320fb6e00908100636r3e95b505x1fad838c566c973d@mail.gmail.com> On Mon, Aug 10, 2009 at 2:09 PM, Peter Rice wrote: > Peter C. wrote: >> Other than the special case of new lines which we have already covered >> (allowed but line wrapping is discouraged), should FASTQ sequence >> lines (and indeed the quality lines) ever be allowed to include white >> space (e.g. spaces and tabs)? I've never seen this in a real FASTQ >> file, and would like to suggest this be considered an error. >> >> Comments? Counter suggestions? > > I am happy adding a warning message in EMBOSS for this. > So you are thinking you'll try and cope with white space, and issue a warning? This sounds dangerous to me. One of the properties of a FASTQ file is the sequence string and the quality string should be the same length (after removing the line wrapping). Allowing whitespace in these strings makes that ambiguous. What if the sequence has white space but not the quality? What if they both have white space but in different positions? Just calling any whitespace (other than the new line characters) an error seems much safer. If there are any real files which do this, we can revisit this. Peter From cjfields at illinois.edu Tue Aug 11 23:32:00 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 11 Aug 2009 18:32:00 -0500 Subject: [Open-bio-l] White space in FASTQ files? In-Reply-To: <4A801BFA.2040208@ebi.ac.uk> References: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com> <4A801BFA.2040208@ebi.ac.uk> Message-ID: On Aug 10, 2009, at 8:09 AM, Peter Rice wrote: > Peter C. wrote: >> Other than the special case of new lines which we have already >> covered >> (allowed but line wrapping is discouraged), should FASTQ sequence >> lines (and indeed the quality lines) ever be allowed to include white >> space (e.g. spaces and tabs)? I've never seen this in a real FASTQ >> file, and would like to suggest this be considered an error. >> >> Comments? Counter suggestions? > > I am happy adding a warning message in EMBOSS for this. > > If we add too many warning messages then we could break our plan to > issue one message and follow with "and another 999999 up to ..." if we > find ourselves issuing more than one warning per sequence. > > regards, > > Peter Rice This is quite similar to the 'qual range out-of-bounds for this FASTQ variant' warning we discussed earlier. We could essentially merge these to be one and the same. chris From cjfields at illinois.edu Tue Aug 11 23:32:10 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 11 Aug 2009 18:32:10 -0500 Subject: [Open-bio-l] Mixed case sequence strings in FASTQ? In-Reply-To: <320fb6e00908100606m4cd2a9ceq95c233d657b080ce@mail.gmail.com> References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com> <4A801A77.20704@ebi.ac.uk> <320fb6e00908100606m4cd2a9ceq95c233d657b080ce@mail.gmail.com> Message-ID: <0D818138-133D-4937-9B24-8A07CD62C170@illinois.edu> On Aug 10, 2009, at 8:06 AM, Peter wrote: > On Mon, Aug 10, 2009 at 2:02 PM, Peter Rice wrote: >> Peter C. wrote: >> >>> I would suggest that OBF tools all treat the sequence in FASTQ files >>> as is, and preserve the case on output. >>> >>> Any thoughts? >> >> EMBOSS does that with all sequence formats. The case of the original >> sequence is preserved and reproduced on output. We have not specified >> upper or lower case only for any of our current output formats. >> >> We provide command line options to force sequences to be converted to >> upper or lower case if the user want to specify one or the other - >> usually just to convert sequences for post processing by some other >> tool. > > Cool. It looks like we are on the same wavelength here :) > > Peter I believe so (sorry about lack of responsiveness, just got back in town). chris From biopython at maubp.freeserve.co.uk Wed Aug 12 10:23:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 Aug 2009 11:23:52 +0100 Subject: [Open-bio-l] Mixed case sequence strings in FASTQ? In-Reply-To: <0D818138-133D-4937-9B24-8A07CD62C170@illinois.edu> References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com> <4A801A77.20704@ebi.ac.uk> <320fb6e00908100606m4cd2a9ceq95c233d657b080ce@mail.gmail.com> <0D818138-133D-4937-9B24-8A07CD62C170@illinois.edu> Message-ID: <320fb6e00908120323q39e1b3e9x1ff6b56203149943@mail.gmail.com> On Wed, Aug 12, 2009 at 12:32 AM, Chris Fields wrote: > > On Aug 10, 2009, at 8:06 AM, Peter wrote: > >> On Mon, Aug 10, 2009 at 2:02 PM, Peter Rice wrote: >>> >>> Peter C. wrote: >>> >>>> I would suggest that OBF tools all treat the sequence in FASTQ files >>>> as is, and preserve the case on output. >>>> >>>> Any thoughts? >>> >>> EMBOSS does that with all sequence formats. The case of the original >>> sequence is preserved and reproduced on output. We have not specified >>> upper or lower case only for any of our current output formats. >>> >>> We provide command line options to force sequences to be converted to >>> upper or lower case if the user want to specify one or the other - >>> usually just to convert sequences for post processing by some other tool. >> >> Cool. It looks like we are on the same wavelength here :) >> >> Peter > > I believe so (sorry about lack of responsiveness, just got back in town). > > chris Great - I've added some unit test code in Biopython to confirm we leave the sequence case as-is on a loading and saving FASTQ files. Peter From biopython at maubp.freeserve.co.uk Mon Aug 24 14:18:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 24 Aug 2009 15:18:20 +0100 Subject: [Open-bio-l] [Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> <320fb6e00907270451i3d40b4ffq607360cfcb6f6282@mail.gmail.com> Message-ID: <320fb6e00908240718q194afe78j4a05b31aeb33e313@mail.gmail.com> On Mon, Jul 27, 2009 at 2:06 PM, Chris Fields wrote: > > I added this (and the others) to our ticket tracking this. ?Looks like > solexa conversion either way is borked, which is very likely an issue > with conversion. Hi Chris, I've been digging into the current SVN code for BioPerl's FASTQ support - I realised you are doing the Solexa to PHRED mapping twice when parsing "fastq-solexa" files. Using "qual" output (which shows the PHRED scores in plain text) makes it very clear something is wrong: $ cat solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; That is Solexa scores from 40 (h) down to -5 (;), which should map onto PHRED scores from 40 down to 1 (according to our prior discussions). $ ./bioperl_solexa2qual.pl < solexa_faked.fastq >slxa_0001_1_0001_01 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 10 9 8 7 6 6 5 5 5 5 4 4 4 4 For reference, $ python biopython_solexa2qual.py < solexa_faked.fastq >slxa_0001_1_0001_01 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 1 1 I can "fix" this in fastq.pm by commenting out one of the log mappings, for example see the patch I've just uploaded to Bug 2857: http://bugzilla.open-bio.org/show_bug.cgi?id=2857 That brings me to another problem, consider the following (with the double conversion fixed): $ ./bioperl_solexa2solexa.pl < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJJHGFEDDBB@@>><< If you compare that to the original, you'll notice a loss of detail in the poor quality reads. e.g. Solexa scores 9 (I) and 10 (J) have both been mapped onto 10 (J). I believe this happens because BioPerl is converting the Solexa scores to PHRED scores on loading (which is fine - EMBOSS does this too), but you are also storing them as integers! In order to preserve these details, I think you'll have to hold the converted PHRED scores as floating point numbers (which I think is what EMBOSS does). This has the downside of taking more memory, and may also complicate file output (you may need to round things). Regards, Peter (@Biopython) From biopython at maubp.freeserve.co.uk Tue Aug 25 11:24:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Aug 2009 12:24:27 +0100 Subject: [Open-bio-l] More FASTQ examples for cross project testing Message-ID: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> Hi all, I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl) off list about this plan. I'm going to co-ordinate putting together a set of valid FASTQ files for shared testing (to supplement the existing set of invalid FASTQ files already done and being used in Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon). What I have in mind is: XXX_original_YYY.fastq - sample input XXX_as_sanger.fastq - reference output XXX_as_solexa.fastq - reference output XXX_as_illumina.fastq - reference output where XXX is some name (e.g. wrapped1, wrapped2, shortreads, longreads, sanger_full_range, solexa_full_range ...) and YYY is the FASTQ variant (sanger, solexa or illumina) for the "input" file. For example, we might have: wrapped1_original_sanger.fastq - A Sanger FASTQ using line wrapping, perhaps repeating the title on the plus lines wrapped1_as_sanger.fastq - The same data but using the consensus of no line wrapping and omitting the repeated title on the plus lines. wrapped1_as_solexa.fastq - As above, but converted in Solexa scores (ASCII offset 64), with capping at Solexa 62 (ASCII 126). wrapped1_as_illumina.fastq - As above, but converted to Illumina ASCII offset 64, with capping at PHRED 62 (ASCII 126). Here "wrapped1" would be a Sanger FASTQ file with some line wrapping (e.g. at 60 characters). I will include "sanger_full_range" which would cover all the valid PHRED scores from 0 to 93, and similarly for Solexa and Illumina files - these are important for testing the score conversions. I have some ideas for deliberately tricky (but valid) files which should properly test any parser. The point is we have "perhaps odd but valid" originals, plus the "cleaned up" versions (using the same FASTQ variant), and "cleaned up" versions in the other two FASTQ variants. Ideally asking Biopython/BioPerl/EMBOSS to convert the XXX_original_YYY.fastq files into any of the three FASTQ variants will give exactly the same as the reference outputs. If anyone has any comments or suggestions please speak up (e.g. my suggested naming conventions). Real life examples of FASTQ files anyone has had trouble parsing (even with 3rd party tools) would be particularly useful - although we'd probably want to cut down big example files in order to keep the dataset to a reasonable size. Thanks, Peter From heuermh at acm.org Wed Aug 26 02:56:20 2009 From: heuermh at acm.org (Michael Heuer) Date: Tue, 25 Aug 2009 22:56:20 -0400 (EDT) Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> Message-ID: Peter wrote: > Hi all, > > I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl) > off list about this plan. I'm going to co-ordinate putting together a > set of valid FASTQ files for shared testing (to supplement the > existing set of invalid FASTQ files already done and being used in > Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon). > > What I have in mind is: > > XXX_original_YYY.fastq - sample input > XXX_as_sanger.fastq - reference output > XXX_as_solexa.fastq - reference output > XXX_as_illumina.fastq - reference output > > where XXX is some name (e.g. wrapped1, wrapped2, shortreads, > longreads, sanger_full_range, solexa_full_range ...) and YYY is the > FASTQ variant (sanger, solexa or illumina) for the "input" file. > > For example, we might have: > > wrapped1_original_sanger.fastq - A Sanger FASTQ using line wrapping, > perhaps repeating the title on the plus lines > wrapped1_as_sanger.fastq - The same data but using the consensus of no > line wrapping and omitting the repeated title on the plus lines. > wrapped1_as_solexa.fastq - As above, but converted in Solexa scores > (ASCII offset 64), with capping at Solexa 62 (ASCII 126). > wrapped1_as_illumina.fastq - As above, but converted to Illumina ASCII > offset 64, with capping at PHRED 62 (ASCII 126). > > Here "wrapped1" would be a Sanger FASTQ file with some line wrapping > (e.g. at 60 characters). I will include "sanger_full_range" which > would cover all the valid PHRED scores from 0 to 93, and similarly for > Solexa and Illumina files - these are important for testing the score > conversions. I have some ideas for deliberately tricky (but valid) > files which should properly test any parser. > > The point is we have "perhaps odd but valid" originals, plus the > "cleaned up" versions (using the same FASTQ variant), and "cleaned up" > versions in the other two FASTQ variants. > > Ideally asking Biopython/BioPerl/EMBOSS to convert the > XXX_original_YYY.fastq files into any of the three FASTQ variants will > give exactly the same as the reference outputs. > > If anyone has any comments or suggestions please speak up (e.g. my > suggested naming conventions). Very cool idea, Peter, and Peter, and Chris. I don't believe anyone from biojava has spoken up on this thread yet, so I thought I should add that we are working towards a compatible implementation as well. michael From biopython at maubp.freeserve.co.uk Wed Aug 26 10:06:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Aug 2009 11:06:39 +0100 Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> Message-ID: <320fb6e00908260306o37eb3d80q53fc50dce708d0b@mail.gmail.com> On Wed, Aug 26, 2009 at 3:56 AM, Michael Heuer wrote: > Peter wrote: > >> Hi all, >> >> I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl) >> off list about this plan. I'm going to co-ordinate putting together a >> set of valid FASTQ files for shared testing (to supplement the >> existing set of invalid FASTQ files already done and being used in >> Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon). >> ... > > Very cool idea, Peter, and Peter, and Chris. ?I don't believe anyone from > biojava has spoken up on this thread yet, so I thought I should add that > we are working towards a compatible implementation as well. > > ? michael Hi Michael - we asked the BioJava guys a while back, and at the time there was interest but no volunteers. Who is working on this now? Peter From biopython at maubp.freeserve.co.uk Wed Aug 26 22:04:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Aug 2009 23:04:18 +0100 Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> Message-ID: <320fb6e00908261504l18c3ee5em4c90b0818fc2c844@mail.gmail.com> On Tue, Aug 25, 2009 at 12:24 PM, Peter wrote: > Hi all, > > I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl) > off list about this plan. I'm going to co-ordinate putting together a > set of valid FASTQ files for shared testing (to supplement the > existing set of invalid FASTQ files already done and being used in > Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon). > > What I have in mind is: > > XXX_original_YYY.fastq - sample input > XXX_as_sanger.fastq - reference output > XXX_as_solexa.fastq - reference output > XXX_as_illumina.fastq - reference output > > where XXX is some name (e.g. wrapped1, wrapped2, shortreads, > longreads, sanger_full_range, solexa_full_range ...) and YYY is the > FASTQ variant (sanger, solexa or illumina) for the "input" file. I didn't want to clog up the mailing list with attachments, but just for the record, I've sent my first attempt at this to Peter (EMBOSS) and Chris (BioPerl) for comment (and checking). My earlier set of error_*.fastq files are in Biopython CVS/github and have since been copied to BioPerl SVN as well. Peter From biopython at maubp.freeserve.co.uk Thu Aug 27 10:46:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 27 Aug 2009 11:46:17 +0100 Subject: [Open-bio-l] FASTQ in BioRuby? Message-ID: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com> Hello BioRuby team, I am one of the Biopython developers, and together with Peter Rice (EMBOSS) and Chris Fields (BioPerl) we have been coordinating how these Open Bioinformatics Foundation (OBF) projects will interpret the FASTQ file format used in next generation sequencing. This includes standardising our naming conventions for the original Sanger FASTQ variant, and the later Solexa/early Illumina, and recent Illumina 1.3+ variants. We have also put together a set of test files, including reference conversions between the different FASTQ variants. We would be delighted to get BioRuby involved. I tried to contact Naohisa Goto about this directly last month, but perhaps my email did not arrive. If BioRuby is working on (or planning to work on) FASTQ support, please could the developers concerned sign up to the OBF joint mailing list where we have been discussing this: http://lists.open-bio.org/mailman/listinfo/open-bio-l Thank you, Peter From ngoto at gen-info.osaka-u.ac.jp Thu Aug 27 11:20:46 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Thu, 27 Aug 2009 20:20:46 +0900 Subject: [Open-bio-l] FASTQ in BioRuby? In-Reply-To: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com> References: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com> Message-ID: <20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp> Hello Peter, sorry for responding too late. I've subscribed to open-bio-l, but I could not actively join to the discussions, because of lack of my knowledge about FASTQ. There is a small primitive code attempt to support FASTQ format in BioRuby, which is not yet merged in the main repository. http://github.com/ngoto/bioruby/tree/master Recently, Anthony Underwood contributed chromatgram classes to support SCF/ABI formats, which will be merged soon, after bug-fix maintenance release of 1.3.1. http://github.com/aunderwo/bioruby/tree/master I'm now planning to rewrite my FASTQ code to be consistent with the chromatgram classes, and with the open-bio standards. Thank you, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Thu, 27 Aug 2009 11:46:17 +0100 Peter wrote: > Hello BioRuby team, > > I am one of the Biopython developers, and together with Peter Rice > (EMBOSS) and Chris Fields (BioPerl) we have been coordinating > how these Open Bioinformatics Foundation (OBF) projects will > interpret the FASTQ file format used in next generation sequencing. > > This includes standardising our naming conventions for the original > Sanger FASTQ variant, and the later Solexa/early Illumina, and > recent Illumina 1.3+ variants. We have also put together a set of > test files, including reference conversions between the different > FASTQ variants. > > We would be delighted to get BioRuby involved. I tried to contact > Naohisa Goto about this directly last month, but perhaps my email > did not arrive. If BioRuby is working on (or planning to work on) > FASTQ support, please could the developers concerned sign up > to the OBF joint mailing list where we have been discussing this: > http://lists.open-bio.org/mailman/listinfo/open-bio-l > > Thank you, > > Peter > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From biopython at maubp.freeserve.co.uk Thu Aug 27 12:08:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 27 Aug 2009 13:08:28 +0100 Subject: [Open-bio-l] FASTQ in BioRuby? In-Reply-To: <20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp> References: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com> <20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <320fb6e00908270508o485ba990k96b8bd3b722c09b6@mail.gmail.com> On Thu, Aug 27, 2009 at 12:20 PM, Naohisa GOTO wrote: > > Hello Peter, > > sorry for responding too late. I've subscribed to open-bio-l, > but I could not actively join to the discussions, because of > lack of my knowledge about FASTQ. > > There is a small primitive code attempt to support FASTQ format > in BioRuby, which is not yet merged in the main repository. > http://github.com/ngoto/bioruby/tree/master > > Recently, Anthony Underwood contributed chromatgram classes > to support SCF/ABI formats, which will be merged soon, > after bug-fix maintenance release of 1.3.1. > http://github.com/aunderwo/bioruby/tree/master > > I'm now planning to rewrite my FASTQ code to be consistent > with the chromatgram classes, and with the open-bio standards. > > Thank you, > > Naohisa Goto That is excellent news :) I'm not sure how format names work in BioRuby, but if you do have a set of format names as strings as we do in Biopython, BioPerl and EMBOSS it would be nice to be consistent here: http://biopython.org/wiki/SeqIO http://bioperl.org/wiki/HOWTO:SeqIO http://emboss.sourceforge.net/docs/themes/SequenceFormats.html There is some basic information on wikipedia, but this does not go into detail: http://www.bioperl.org/wiki/FASTQ_sequence_format Please feel free to ask any questions about how we are interpreting things. Thank you, Peter From biopython at maubp.freeserve.co.uk Thu Aug 27 15:26:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 27 Aug 2009 16:26:23 +0100 Subject: [Open-bio-l] Naming for FASTQ example files In-Reply-To: <320fb6e00908080553h115ab748jecc820d14cb5e524@mail.gmail.com> References: <320fb6e00908060117w703a2b08h6aeb2489530e0b51@mail.gmail.com> <320fb6e00908080553h115ab748jecc820d14cb5e524@mail.gmail.com> Message-ID: <320fb6e00908270826i729cfdd6o1fdc56f47e5f3c02@mail.gmail.com> On Sat, Aug 8, 2009 at 1:53 PM, Peter wrote: > On Thu, Aug 6, 2009 at 9:17 AM, Peter wrote: >> Hi all, >> >> I am planning on compiling a set of set FASTQ files, for use by >> Biopython, BioPerl, EMBOSS and anyone else that wants to test a >> parser. Modest size contributions will be welcome (no big files >> though). >> >> I will have two types of files: valid ones, and invalid ones. The >> basic idea is any parser should understand what we consider to be >> valid files (we may need to provide matching FASTA and QUAL files or >> something like this for verification), but also reject all the files >> we consider to be invalid. >> ... > > I've gone for "error_*.fastq" and have tried to use meaningful names > rather than numbers. Currently these files are only in the Biopython > repository (under biopython/Tests/Quality), but could be added to the > (currently) unused Biodata repository - although that is still on CVS: > > http://lists.open-bio.org/pipermail/open-bio-l/2009-January/000511.html > > As these examples are all small and we don't expect to change them, > I could also just email them (off the mailing list) to EMBOSS/BioPerl > people directly on request. Chris Fields has already included the original "error_*.fastq" files in BioPerl SVN as test cases. Peter Rice has pointed out a minor error in "error_short_qual.fastq" which I have now corrected (it had a short sequence, not a short quality line), and after discussion we have come up with a few more truncation examples: error_trunc_in_title.fastq error_trunc_in_seq.fastq error_trunc_in_plus.fastq error_trunc_in_qual.fastq Again, you can grab these five files (four new, one updated) from Biopython CVS/git, and I will also be emailing Chris & Peter R directly. Peter C. From biopython at maubp.freeserve.co.uk Thu Aug 27 16:40:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 27 Aug 2009 17:40:21 +0100 Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: References: <320fb6e00908260306o37eb3d80q53fc50dce708d0b@mail.gmail.com> Message-ID: <320fb6e00908270940r495aecc1n833aa9a28e8f3db3@mail.gmail.com> On Thu, Aug 27, 2009 at 5:31 PM, Michael Heuer wrote: > > Peter wrote: >> Hi Michael - we asked the BioJava guys a while back, and at the time >> there was interest but no volunteers. Who is working on this now? > > Perhaps I should have kept quiet -- I think I just volunteered. ?;) > > ? michael Assuming you're serious, great :) Peter From heuermh at acm.org Thu Aug 27 16:31:43 2009 From: heuermh at acm.org (Michael Heuer) Date: Thu, 27 Aug 2009 12:31:43 -0400 (EDT) Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: <320fb6e00908260306o37eb3d80q53fc50dce708d0b@mail.gmail.com> Message-ID: Peter wrote: > On Wed, Aug 26, 2009 at 3:56 AM, Michael Heuer wrote: > > Peter wrote: > > > >> Hi all, > >> > >> I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl) > >> off list about this plan. I'm going to co-ordinate putting together a > >> set of valid FASTQ files for shared testing (to supplement the > >> existing set of invalid FASTQ files already done and being used in > >> Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon). > >> ... > > > > Very cool idea, Peter, and Peter, and Chris. ?I don't believe anyone from > > biojava has spoken up on this thread yet, so I thought I should add that > > we are working towards a compatible implementation as well. > > > > ? michael > > Hi Michael - we asked the BioJava guys a while back, and at the time > there was interest but no volunteers. Who is working on this now? Perhaps I should have kept quiet -- I think I just volunteered. ;) michael From biopython at maubp.freeserve.co.uk Mon Aug 31 12:07:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Aug 2009 13:07:45 +0100 Subject: [Open-bio-l] Status of OBDA and indexed flatfiles? Message-ID: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> Hi all, I'm looking at indexing next generation sequence files for Biopython (e.g. FASTQ short read files with 10s of millions of entries), where even just holding the record names and their file offsets in memory is beginning to be a bottleneck. What is the current status of Open Biological Database Access (OBDA), and in particular the index files for sequence "flat files" like FASTA or GenBank (or FASTQ)? http://www.bioperl.org/wiki/HOWTO:Flat_databases http://www.bioperl.org/wiki/HOWTO:OBDA http://obda.open-bio.org/ The spec files are still in CVS (and ViewCVS is still broken since the recent server move), rather than having been migrated to SVN which may suggest things are obsolete (or on the bright side, stable). Presumably BioPerl still uses these index files? What about the other projects? I know EMBOSS has some indexing system for example but I have no idea how it works internally. Thanks, Peter From ngoto at gen-info.osaka-u.ac.jp Mon Aug 31 14:01:46 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 31 Aug 2009 23:01:46 +0900 Subject: [Open-bio-l] Status of OBDA and indexed flatfiles? In-Reply-To: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> Message-ID: <20090831140146.EAF851CBC594@idnmail.gen-info.osaka-u.ac.jp> Hi Peter, On Mon, 31 Aug 2009 13:07:45 +0100 Peter wrote: > Hi all, > > I'm looking at indexing next generation sequence files for Biopython > (e.g. FASTQ short read files with 10s of millions of entries), where > even just holding the record names and their file offsets in memory > is beginning to be a bottleneck. > > What is the current status of Open Biological Database Access (OBDA), > and in particular the index files for sequence "flat files" like FASTA or > GenBank (or FASTQ)? > > http://www.bioperl.org/wiki/HOWTO:Flat_databases > http://www.bioperl.org/wiki/HOWTO:OBDA > http://obda.open-bio.org/ > > The spec files are still in CVS (and ViewCVS is still broken since > the recent server move), rather than having been migrated to SVN > which may suggest things are obsolete (or on the bright side, stable). > > Presumably BioPerl still uses these index files? What about the > other projects? I know EMBOSS has some indexing system for > example but I have no idea how it works internally. BioRuby still uses them. To gain performance, names and offsets are written to temporary files and using external sort program (default /usr/bin/sort). In BioRuby, flatfile-only solution works fine, but BerkeleyDB indexes would be incompatible with other projects, because of confusion in the spec, discussed in BioPerl Bugzilla Bug #2337. http://bugzilla.open-bio.org/show_bug.cgi?id=2337 Thanks, -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From biopython at maubp.freeserve.co.uk Mon Aug 31 15:07:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Aug 2009 16:07:28 +0100 Subject: [Open-bio-l] Status of OBDA and indexed flatfiles? In-Reply-To: <20090831140146.EAF851CBC594@idnmail.gen-info.osaka-u.ac.jp> References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> <20090831140146.EAF851CBC594@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <320fb6e00908310807j2fbc9710x3621a16a54ad5e45@mail.gmail.com> On Mon, Aug 31, 2009 at 3:01 PM, Naohisa GOTO wrote: > Hi Peter, > >> Presumably BioPerl still uses these index files? What about the >> other projects? I know EMBOSS has some indexing system for >> example but I have no idea how it works internally. > > BioRuby still uses them. To gain performance, names and offsets are > written to temporary files and using external sort program (default > /usr/bin/sort). That makes sense. Have you tried this on very large files? e.g. FASTA with 10 million short reads? > In BioRuby, flatfile-only solution works fine, but BerkeleyDB indexes > would be incompatible with other projects, because of confusion in > the spec, discussed in BioPerl Bugzilla Bug #2337. > http://bugzilla.open-bio.org/show_bug.cgi?id=2337 Thank you for the link to that bug - I'll need to read that carefully. Peter From biopython at maubp.freeserve.co.uk Mon Aug 31 15:45:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Aug 2009 16:45:51 +0100 Subject: [Open-bio-l] Status of OBDA and indexed flatfiles? In-Reply-To: <8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu> References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> <8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu> Message-ID: <320fb6e00908310845n3322d2c6hc992f78b81c0e226@mail.gmail.com> On Mon, Aug 31, 2009 at 4:33 PM, Chris Fields wrote: > > I don't use OBDA, personally, but I can check on the status with Brian > Osborne (he was heading it up last I checked). ?However, I don't think > BioPerl has an OBDA FASTQ parser. > > You may be thinking about Bio::Index::FASTQ? ?That one is not OBDA, > but just a simple flat file indexer. ?We could probably set an OBDA parser > up fairly easily if needed. I didn't know if Bio::Index was using OBDA "under the hood" or not. Does this mean BioPerl has multiple indexing systems available? As I noted on Bug 2337 earlier today, Biopython used to have some sort of OBDA compliant indexing, but for unrelated reasons we have deprecated and removed that code. We're now revisiting this topic due in part to having to deal with ever larger data files - and I wanted to see if OBDA was still "alive" as a standard, and furthermore how well it had scaled for the other OBF projects. Peter From cjfields at illinois.edu Mon Aug 31 15:33:02 2009 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 31 Aug 2009 10:33:02 -0500 Subject: [Open-bio-l] Status of OBDA and indexed flatfiles? In-Reply-To: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> Message-ID: <8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu> On Aug 31, 2009, at 7:07 AM, Peter wrote: > Hi all, > > I'm looking at indexing next generation sequence files for Biopython > (e.g. FASTQ short read files with 10s of millions of entries), where > even just holding the record names and their file offsets in memory > is beginning to be a bottleneck. > > What is the current status of Open Biological Database Access (OBDA), > and in particular the index files for sequence "flat files" like > FASTA or > GenBank (or FASTQ)? > > http://www.bioperl.org/wiki/HOWTO:Flat_databases > http://www.bioperl.org/wiki/HOWTO:OBDA > http://obda.open-bio.org/ > > The spec files are still in CVS (and ViewCVS is still broken since > the recent server move), rather than having been migrated to SVN > which may suggest things are obsolete (or on the bright side, stable). > > Presumably BioPerl still uses these index files? What about the > other projects? I know EMBOSS has some indexing system for > example but I have no idea how it works internally. > > Thanks, > > Peter I don't use OBDA, personally, but I can check on the status with Brian Osborne (he was heading it up last I checked). However, I don't think BioPerl has an OBDA FASTQ parser. You may be thinking about Bio::Index::FASTQ? That one is not OBDA, but just a simple flat file indexer. We could probably set an OBDA parser up fairly easily if needed. chris From cjfields at illinois.edu Mon Aug 31 18:22:36 2009 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 31 Aug 2009 13:22:36 -0500 Subject: [Open-bio-l] Status of OBDA and indexed flatfiles? In-Reply-To: <320fb6e00908310845n3322d2c6hc992f78b81c0e226@mail.gmail.com> References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> <8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu> <320fb6e00908310845n3322d2c6hc992f78b81c0e226@mail.gmail.com> Message-ID: On Aug 31, 2009, at 10:45 AM, Peter wrote: > On Mon, Aug 31, 2009 at 4:33 PM, Chris Fields > wrote: >> >> I don't use OBDA, personally, but I can check on the status with >> Brian >> Osborne (he was heading it up last I checked). However, I don't >> think >> BioPerl has an OBDA FASTQ parser. >> >> You may be thinking about Bio::Index::FASTQ? That one is not OBDA, >> but just a simple flat file indexer. We could probably set an OBDA >> parser >> up fairly easily if needed. > > I didn't know if Bio::Index was using OBDA "under the hood" or not. > Does this mean BioPerl has multiple indexing systems available? Yes. We have Bio::Index::*, Bio::DB::Flat (which I think is OBDA). There is also the older Bio::DB::Fasta, which is actually still in wide use. Note with Bio::Index::* we allow streaming of any report type (sequence, alignment, analysis like BLAST, etc). We have talked about switching many of the Bio::Index::* sequence- based ones to OBDA but I haven't seen anyone take that up. > As I noted on Bug 2337 earlier today, Biopython used to have some > sort of OBDA compliant indexing, but for unrelated reasons we have > deprecated and removed that code. We're now revisiting this topic > due in part to having to deal with ever larger data files - and I > wanted > to see if OBDA was still "alive" as a standard, and furthermore how > well it had scaled for the other OBF projects. > > Peter I think it's still alive and being used, just not sure what the compliance level is amongst the different Bio* projects. chris