From charles-listes+open-bio at plessy.org  Sat Aug  1 21:25:37 2009
From: charles-listes+open-bio at plessy.org (Charles Plessy)
Date: Sun, 2 Aug 2009 10:25:37 +0900
Subject: [Open-bio-l] FASTQ identifiers
In-Reply-To: <320fb6e00907310215n6ed33a17id8a5ff5913c9d3b5@mail.gmail.com>
References: <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com>
	<320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com>
	<320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com>
	<E5FA39FD-799D-4FF0-9117-D3186FF95FB2@illinois.edu>
	<320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com>
	<BE6B3A71-3130-4BD9-96C1-FD1A09C6F4CC@illinois.edu>
	<320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com>
	<24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com>
	<4A72A8F9.9020903@ebi.ac.uk>
	<320fb6e00907310215n6ed33a17id8a5ff5913c9d3b5@mail.gmail.com>
Message-ID: <20090802012537.GD2479@kunpuu.plessy.org>

Le Fri, Jul 31, 2009 at 10:15:57AM +0100, Peter a ?crit :
> The situation is similar to the FASTA format (and others), in that there
> are a number of reasonably well documented conventions in use
> (e.g. the NCBI FASTA identifiers with | characters). However, equally,
> there are thousands of ad hoc local conventions.

Hello,

I just would like to mention such an ad-hoc convention in use at workplace:
with FASTQ sequences we sometimes replace the original name by the sequence
itself. This can be useful for instance to troubleshoot some sequence
manipulations.

@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+EAS54_6_R1_2_1_413_324
;;3;;;;;;;;;;;;7;;;;;;;88

becomes:

@CCCTTCTTGTCTTCAGCGTTTCTCC
CCCTTCTTGTCTTCAGCGTTTCTCC
+CCCTTCTTGTCTTCAGCGTTTCTCC
;;3;;;;;;;;;;;;7;;;;;;;88

and after some arbitrary trimming at the ends:

@CCCTTCTTGTCTTCAGCGTTTCTCC
TTCTTGTCTTCAGCGTTTCT
+CCCTTCTTGTCTTCAGCGTTTCTCC
;;;;;;;;;;;;7;;;;;;;


With FASTA format, we sometimes eliminate redundant sequences and record how
many times they occurred by adding the count to the name.

For instance:

>seq1
AAATTT
>seq2
AAATAT
>seq3
AAATTT

becomes:

>AAATTT_2
AAATTT
>AAATAT_1
AAATAT

If this is popular elsewhere, it may be useful to implement functions that
allow doing this efficiently.

Have a nice day,

-- 
Charles Plessy
Tsurumi, Kanagawa, Japan

From biopython at maubp.freeserve.co.uk  Mon Aug  3 05:30:09 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 3 Aug 2009 10:30:09 +0100
Subject: [Open-bio-l] FASTQ identifiers
In-Reply-To: <20090802012537.GD2479@kunpuu.plessy.org>
References: <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com>
	<320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com>
	<E5FA39FD-799D-4FF0-9117-D3186FF95FB2@illinois.edu>
	<320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com>
	<BE6B3A71-3130-4BD9-96C1-FD1A09C6F4CC@illinois.edu>
	<320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com>
	<24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com>
	<4A72A8F9.9020903@ebi.ac.uk>
	<320fb6e00907310215n6ed33a17id8a5ff5913c9d3b5@mail.gmail.com>
	<20090802012537.GD2479@kunpuu.plessy.org>
Message-ID: <320fb6e00908030230x52bf32a8o3b640ce8d0a76b8@mail.gmail.com>

On Sun, Aug 2, 2009 at 2:25 AM, Charles
Plessy<charles-listes+open-bio at plessy.org> wrote:
> Le Fri, Jul 31, 2009 at 10:15:57AM +0100, Peter a ?crit :
>> The situation is similar to the FASTA format (and others), in that there
>> are a number of reasonably well documented conventions in use
>> (e.g. the NCBI FASTA identifiers with | characters). However, equally,
>> there are thousands of ad hoc local conventions.
>
> Hello,
>
> I just would like to mention such an ad-hoc convention in use at
> workplace: with FASTQ sequences we sometimes replace the original
> name by the sequence itself. This can be useful for instance to
> troubleshoot some sequence manipulations.
>
> @EAS54_6_R1_2_1_413_324
> CCCTTCTTGTCTTCAGCGTTTCTCC
> +EAS54_6_R1_2_1_413_324
> ;;3;;;;;;;;;;;;7;;;;;;;88
>
> becomes:
>
> @CCCTTCTTGTCTTCAGCGTTTCTCC
> CCCTTCTTGTCTTCAGCGTTTCTCC
> +CCCTTCTTGTCTTCAGCGTTTCTCC
> ;;3;;;;;;;;;;;;7;;;;;;;88
>

That certainly demonstrates we can't make any big assumptions
about the title line formatting ;)

Your example is interesting - but I don't quite understand why you
do this. Surely any debug message or output file for bad reads
would (normally) have a unique read ID which (indirectly) tells
you the read sequence? If you are writing the code which gives
these error messages, can't you explicitly give the read sequence?
Is the aim to be able to look at error messages from third party
tools (which just give the read name) and see the read sequence
directly (without looking up the read name in the original FASTQ
file)?

This is similar in some ways to my comment that I could see a real
use for FASTQ (and FASTA) files with no record identifiers:

>> Related to this, what about the corner case of reads with NO
>> identifier? The FASTQ (and indeed the FASTA) formats can
>> hold such things - just use a blank title line. In the case of
>> next generation sequencing reads, the names themselves
>> are not actually that important - so you can imagine a pipeline
>> which doesn't actually bother with them at all.

In your pipeline you clearly don't care about the original FASTQ
identifiers, and (if the pipeline would accept it), using blank title
lines might also work (and would certainly save disk space).

Peter


From cjfields at illinois.edu  Wed Aug  5 11:12:18 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Wed, 5 Aug 2009 10:12:18 -0500
Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS
In-Reply-To: <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com>
References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com>
	<320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com>
	<320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com>
	<32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu>
	<320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com>
	<320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com>
	<320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com>
Message-ID: <3C298ABC-07CD-4597-BA83-F8F5992BF73A@illinois.edu>


On Jul 29, 2009, at 5:15 AM, Peter wrote:

> Hi all,
>
> This is a follow up to the earlier discussion about high quality  
> scores
> in Solexa or Illumina 1.3+ FASTQ files and the problem of non  
> printable
> ASCII codes (which can occur if converting from Sanger FASTQ).
>
>> On Sat, Jul 25, 2009 at 8:50 PM, Chris  
>> Fields<cjfields at illinois.edu> wrote:
>>>
>>>> Now, here comes the problem. I believe FASTQ files directly
>>>> from an Illumina 1.3+ pipeline will have PHRED scores in the
>>>> range 0 to 40 (as in this example). However, much higher
>>>> PHRED scores are possible during assembly / contig'ing
>>>> and read mapping. For example, the tool MAQ will output
>>>> Sanger style FASTQ files with PHRED scores in the range
>>>> 0 to 93 inclusive.
>>>
>>> We can support it as Illumina 1.3, but my point is this may  
>>> getting into a
>>> grey area and may be something that Illumina doesn't/wouldn't  
>>> support.
>>>  Reminds me a little of the multiple GFF2 variations (one of the  
>>> main
>>> reasons for a GFF3).
>>
>> I agree this is an grey area (high scores in Solexa/Illumina
>> FASTQ files).
>>
>> ...
>>
>> i.e. An Illumina FASTQ format file can hold PHRED scores in the
>> range 0 to 62 without using problem characters. And likewise
>> for a Solexa FASTQ file (Solexa scores up to 62).
>
> Peter Rice and I have been talking about this off list, and have
> a proposal for the high score problem. Basically we want to
> restrict FASTQ quality strings to printable ASCII, which means
> 126 (0x7e) is a firm upper limit, while otherwise allowing for a
> high scores as possible. This limit comes from ASCII 127 being
> "delete", and the even higher characters also being non-printable.
>
> i.e. We are suggesting:
>
> "fastq-sanger" - Allows PHRED scores 0 to 93 inclusive, mapped
> with an ASCII offset of 33 to ASCII characters 33 to 126 (or in hex,
> 0x21 to 0x7e). This is as defined on the MAQ web pages.
>
> "fastq-illumina" - Believed to use at least PHRED scores 0 to 40,
> mapped with an ASCII offset of 64 to ASCII characters 64 to 104
> (or in hex, to 0x40 to 0x68). It is a reasonable and well defined
> extension to permit PHRED scores from 0 to 62 inclusive, which
> map to ASCII 64 to 126 (or in hex 0x40 to 0x7e). This avoids the
> non printing characters, and gives some head room for improved
> sequencing technology from Illumina giving higher raw scores.
>
> "fastq-solexa" - Believed to use Solexa scores from -5 to at least
> 40, again mapped with an ASCII offset of 64 giving ASCII characters
> 59 to 104 (or in hex, 0x3b to 0x68). Again, a reasonable and well
> defined extension would permit Solexa scores in the range -5 to 62
> inclusive, using ASCII 59 to 126 (or in hex, 0x3b to 0x7e).
>
> [Peter R. - please correct me if of the above is not what you had
> in mind]
>
> If in the process of converting between formats, a quality score
> is too high (it would result in ASCII 127 or higher), then I would
> argue any of the following would be acceptable:
> (a) Silently impose the maximum score (ASCII 126, 0x7e)
> (b) Impose the maximum score with a warning
> (c) Raise an error
>
> I don't think EMBOSS, BioPerl and Biopython have to handle
> this exactly the same way, but I would favour (b) then (a).
>
> Peter

I think, based on Aaron's comments, with bioperl we'll adopt in (b) to  
deal with format validation, but try to do it in a way that 'caches'  
bad data so it doesn't report a warning on every out-of-range value.   
I am planning on a Moose-based parser at some point that will do the  
same.

chris


From biopython at maubp.freeserve.co.uk  Wed Aug  5 12:01:32 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 5 Aug 2009 17:01:32 +0100
Subject: [Open-bio-l] Mixed case sequence strings in FASTQ?
Message-ID: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com>

Hi all,

Another FASTQ issue to debate: Should we care what case the sequence
strings are? I've never seen anything written down, but all the
examples I recall used upper case. But there is nothing to stop people
using mixed case, is there?

With FASTA on the other hand, while all uppercase is most common,
mixed case has its uses (e.g. representing trimmed regions, or low
quality scores).

I would suggest that OBF tools all treat the sequence in FASTQ files
as is, and preserve the case on output.

Any thoughts?

Peter

From dan.bolser at gmail.com  Wed Aug  5 12:50:56 2009
From: dan.bolser at gmail.com (Dan Bolser)
Date: Wed, 5 Aug 2009 17:50:56 +0100
Subject: [Open-bio-l] Mixed case sequence strings in FASTQ?
In-Reply-To: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com>
References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com>
Message-ID: <2c8757af0908050950y97863fcj2b4deda1b8bb37c8@mail.gmail.com>

2009/8/5 Peter <biopython at maubp.freeserve.co.uk>:
> Hi all,
>
> Another FASTQ issue to debate: Should we care what case the sequence
> strings are? I've never seen anything written down, but all the
> examples I recall used upper case. But there is nothing to stop people
> using mixed case, is there?
>
> With FASTA on the other hand, while all uppercase is most common,
> mixed case has its uses (e.g. representing trimmed regions, or low
> quality scores).
>
> I would suggest that OBF tools all treat the sequence in FASTQ files
> as is, and preserve the case on output.
>
> Any thoughts?

Agree.


> Peter
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>

From biopython at maubp.freeserve.co.uk  Thu Aug  6 04:17:07 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 6 Aug 2009 09:17:07 +0100
Subject: [Open-bio-l] Naming for FASTQ example files
Message-ID: <320fb6e00908060117w703a2b08h6aeb2489530e0b51@mail.gmail.com>

Hi all,

I am planning on compiling a set of set FASTQ files, for use by
Biopython, BioPerl, EMBOSS and anyone else that wants to test a
parser. Modest size contributions will be welcome (no big files
though).

I will have two types of files: valid ones, and invalid ones. The
basic idea is any parser should understand what we consider to be
valid files (we may need to provide matching FASTA and QUAL files or
something like this for verification), but also reject all the files
we consider to be invalid.

Regarding names, does "error_*.fastq" or "invalid_*.fastq" sound fine?

Any preference for meaningful names ("error_qual_short.fastq",
"error_qual_bad_char.fastq", ...) versus numbers ("error_001.fastq",
"error_002.fastq", ...). Either way I think a README file would need
to accompany the dataset stating what we think makes each example
invalid (e.g. quality string shorted than sequence, invalid character
in quality string, ...).

Peter

From biopython at maubp.freeserve.co.uk  Sat Aug  8 08:53:17 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 8 Aug 2009 13:53:17 +0100
Subject: [Open-bio-l] Naming for FASTQ example files
In-Reply-To: <320fb6e00908060117w703a2b08h6aeb2489530e0b51@mail.gmail.com>
References: <320fb6e00908060117w703a2b08h6aeb2489530e0b51@mail.gmail.com>
Message-ID: <320fb6e00908080553h115ab748jecc820d14cb5e524@mail.gmail.com>

On Thu, Aug 6, 2009 at 9:17 AM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> I am planning on compiling a set of set FASTQ files, for use by
> Biopython, BioPerl, EMBOSS and anyone else that wants to test a
> parser. Modest size contributions will be welcome (no big files
> though).
>
> I will have two types of files: valid ones, and invalid ones. The
> basic idea is any parser should understand what we consider to be
> valid files (we may need to provide matching FASTA and QUAL files or
> something like this for verification), but also reject all the files
> we consider to be invalid.
>
> Regarding names, does "error_*.fastq" or "invalid_*.fastq" sound fine?
>
> Any preference for meaningful names ("error_qual_short.fastq",
> "error_qual_bad_char.fastq", ...) versus numbers ("error_001.fastq",
> "error_002.fastq", ...). Either way I think a README file would need
> to accompany the dataset stating what we think makes each example
> invalid (e.g. quality string shorted than sequence, invalid character
> in quality string, ...).

I've gone for "error_*.fastq" and have tried to use meaningful names
rather than numbers. Currently these files are only in the Biopython
repository (under biopython/Tests/Quality), but could be added to the
(currently) unused Biodata repository - although that is still on CVS:

http://lists.open-bio.org/pipermail/open-bio-l/2009-January/000511.html

As these examples are all small and we don't expect to change them,
I could also just email them (off the mailing list) to EMBOSS/BioPerl
people directly on request.

Currently my error examples are as follows, broken down into groups.

Quality strings with invalid ASCII characters (not the full set, but
we could do that):

error_qual_null.fastq
error_qual_vtab.fastq
error_qual_tab.fastq
error_qual_escape.fastq
error_qual_unit_sep.fastq
error_qual_space.fastq
error_qual_del.fastq

Misc errors:

error_diff_ids.fastq
error_spaces.fastq
error_tabs.fastq
error_short_qual.fastq
error_long_qual.fastq
error_no_qual.fastq

Simulated truncation part way though a file:

error_trunc_at_plus.fastq
error_trunc_at_qual.fastq
error_trunc_at_seq.fastq

Note they are all based on the same example file which due to the
quality characters can be interpreted as any of the three FASTQ
variants we're supporting (Sanger, Solexa, Illumina 1.3+). This was
deliberate. Additional examples of files which could be Sanger or
Solexa but not Illumina 1.3+ (or valid Sanger but can't be Solexa or
Illumina 1.3+) are also a good idea.

Note that in many of these examples the error is part way into the
file, so there are initially some valid reads and then an error.

Peter

From biopython at maubp.freeserve.co.uk  Sat Aug  8 14:56:32 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 8 Aug 2009 19:56:32 +0100
Subject: [Open-bio-l] White space in FASTQ files?
Message-ID: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com>

Hi all,

Other than the special case of new lines which we have already covered
(allowed but line wrapping is discouraged), should FASTQ sequence
lines (and indeed the quality lines) ever be allowed to include white
space (e.g. spaces and tabs)? I've never seen this in a real FASTQ
file, and would like to suggest this be considered an error.

Comments? Counter suggestions?

Peter

From pmr at ebi.ac.uk  Mon Aug 10 09:02:47 2009
From: pmr at ebi.ac.uk (Peter Rice)
Date: Mon, 10 Aug 2009 14:02:47 +0100
Subject: [Open-bio-l] Mixed case sequence strings in FASTQ?
In-Reply-To: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com>
References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com>
Message-ID: <4A801A77.20704@ebi.ac.uk>

Peter C. wrote:

> I would suggest that OBF tools all treat the sequence in FASTQ files
> as is, and preserve the case on output.
> 
> Any thoughts?

EMBOSS does that with all sequence formats. The case of the original
sequence is preserved and reproduced on output. We have not specified
upper or lower case only for any of our current output formats.

We provide command line options to force sequences to be converted to
upper or lower case if the user want to specify one or the other -
usually just to convert sequences for post processing by some other tool.

regards,

Peter Rice

From biopython at maubp.freeserve.co.uk  Mon Aug 10 09:06:23 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 10 Aug 2009 14:06:23 +0100
Subject: [Open-bio-l] Mixed case sequence strings in FASTQ?
In-Reply-To: <4A801A77.20704@ebi.ac.uk>
References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com>
	<4A801A77.20704@ebi.ac.uk>
Message-ID: <320fb6e00908100606m4cd2a9ceq95c233d657b080ce@mail.gmail.com>

On Mon, Aug 10, 2009 at 2:02 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
> Peter C. wrote:
>
>> I would suggest that OBF tools all treat the sequence in FASTQ files
>> as is, and preserve the case on output.
>>
>> Any thoughts?
>
> EMBOSS does that with all sequence formats. The case of the original
> sequence is preserved and reproduced on output. We have not specified
> upper or lower case only for any of our current output formats.
>
> We provide command line options to force sequences to be converted to
> upper or lower case if the user want to specify one or the other -
> usually just to convert sequences for post processing by some other tool.

Cool. It looks like we are on the same wavelength here :)

Peter

From pmr at ebi.ac.uk  Mon Aug 10 09:09:14 2009
From: pmr at ebi.ac.uk (Peter Rice)
Date: Mon, 10 Aug 2009 14:09:14 +0100
Subject: [Open-bio-l] White space in FASTQ files?
In-Reply-To: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com>
References: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com>
Message-ID: <4A801BFA.2040208@ebi.ac.uk>

Peter C. wrote:
> Other than the special case of new lines which we have already covered
> (allowed but line wrapping is discouraged), should FASTQ sequence
> lines (and indeed the quality lines) ever be allowed to include white
> space (e.g. spaces and tabs)? I've never seen this in a real FASTQ
> file, and would like to suggest this be considered an error.
> 
> Comments? Counter suggestions?

I am happy adding a warning message in EMBOSS for this.

If we add too many warning messages then we could break our plan to
issue one message and follow with "and another 999999 up to ..." if we
find ourselves issuing more than one warning per sequence.

regards,

Peter Rice

From biopython at maubp.freeserve.co.uk  Mon Aug 10 09:36:26 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 10 Aug 2009 14:36:26 +0100
Subject: [Open-bio-l] White space in FASTQ files?
In-Reply-To: <4A801BFA.2040208@ebi.ac.uk>
References: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com>
	<4A801BFA.2040208@ebi.ac.uk>
Message-ID: <320fb6e00908100636r3e95b505x1fad838c566c973d@mail.gmail.com>

On Mon, Aug 10, 2009 at 2:09 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
> Peter C. wrote:
>> Other than the special case of new lines which we have already covered
>> (allowed but line wrapping is discouraged), should FASTQ sequence
>> lines (and indeed the quality lines) ever be allowed to include white
>> space (e.g. spaces and tabs)? I've never seen this in a real FASTQ
>> file, and would like to suggest this be considered an error.
>>
>> Comments? Counter suggestions?
>
> I am happy adding a warning message in EMBOSS for this.
>

So you are thinking you'll try and cope with white space, and issue a
warning? This sounds dangerous to me. One of the properties of a
FASTQ file is the sequence string and the quality string should be the
same length (after removing the line wrapping). Allowing whitespace
in these strings makes that ambiguous. What if the sequence has
white space but not the quality? What if they both have white space
but in different positions?

Just calling any whitespace (other than the new line characters) an
error seems much safer. If there are any real files which do this, we
can revisit this.

Peter

From cjfields at illinois.edu  Tue Aug 11 19:32:00 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Tue, 11 Aug 2009 18:32:00 -0500
Subject: [Open-bio-l] White space in FASTQ files?
In-Reply-To: <4A801BFA.2040208@ebi.ac.uk>
References: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com>
	<4A801BFA.2040208@ebi.ac.uk>
Message-ID: <D83EE4A2-9DDD-4684-AACA-EC50E9070B3E@illinois.edu>

On Aug 10, 2009, at 8:09 AM, Peter Rice wrote:

> Peter C. wrote:
>> Other than the special case of new lines which we have already  
>> covered
>> (allowed but line wrapping is discouraged), should FASTQ sequence
>> lines (and indeed the quality lines) ever be allowed to include white
>> space (e.g. spaces and tabs)? I've never seen this in a real FASTQ
>> file, and would like to suggest this be considered an error.
>>
>> Comments? Counter suggestions?
>
> I am happy adding a warning message in EMBOSS for this.
>
> If we add too many warning messages then we could break our plan to
> issue one message and follow with "and another 999999 up to ..." if we
> find ourselves issuing more than one warning per sequence.
>
> regards,
>
> Peter Rice

This is quite similar to the 'qual range out-of-bounds for this FASTQ  
variant' warning we discussed earlier.  We could essentially merge  
these to be one and the same.

chris

From cjfields at illinois.edu  Tue Aug 11 19:32:10 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Tue, 11 Aug 2009 18:32:10 -0500
Subject: [Open-bio-l] Mixed case sequence strings in FASTQ?
In-Reply-To: <320fb6e00908100606m4cd2a9ceq95c233d657b080ce@mail.gmail.com>
References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com>
	<4A801A77.20704@ebi.ac.uk>
	<320fb6e00908100606m4cd2a9ceq95c233d657b080ce@mail.gmail.com>
Message-ID: <0D818138-133D-4937-9B24-8A07CD62C170@illinois.edu>


On Aug 10, 2009, at 8:06 AM, Peter wrote:

> On Mon, Aug 10, 2009 at 2:02 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
>> Peter C. wrote:
>>
>>> I would suggest that OBF tools all treat the sequence in FASTQ files
>>> as is, and preserve the case on output.
>>>
>>> Any thoughts?
>>
>> EMBOSS does that with all sequence formats. The case of the original
>> sequence is preserved and reproduced on output. We have not specified
>> upper or lower case only for any of our current output formats.
>>
>> We provide command line options to force sequences to be converted to
>> upper or lower case if the user want to specify one or the other -
>> usually just to convert sequences for post processing by some other  
>> tool.
>
> Cool. It looks like we are on the same wavelength here :)
>
> Peter

I believe so (sorry about lack of responsiveness, just got back in  
town).

chris

From biopython at maubp.freeserve.co.uk  Wed Aug 12 06:23:52 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 12 Aug 2009 11:23:52 +0100
Subject: [Open-bio-l] Mixed case sequence strings in FASTQ?
In-Reply-To: <0D818138-133D-4937-9B24-8A07CD62C170@illinois.edu>
References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com>
	<4A801A77.20704@ebi.ac.uk>
	<320fb6e00908100606m4cd2a9ceq95c233d657b080ce@mail.gmail.com>
	<0D818138-133D-4937-9B24-8A07CD62C170@illinois.edu>
Message-ID: <320fb6e00908120323q39e1b3e9x1ff6b56203149943@mail.gmail.com>

On Wed, Aug 12, 2009 at 12:32 AM, Chris Fields<cjfields at illinois.edu> wrote:
>
> On Aug 10, 2009, at 8:06 AM, Peter wrote:
>
>> On Mon, Aug 10, 2009 at 2:02 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
>>>
>>> Peter C. wrote:
>>>
>>>> I would suggest that OBF tools all treat the sequence in FASTQ files
>>>> as is, and preserve the case on output.
>>>>
>>>> Any thoughts?
>>>
>>> EMBOSS does that with all sequence formats. The case of the original
>>> sequence is preserved and reproduced on output. We have not specified
>>> upper or lower case only for any of our current output formats.
>>>
>>> We provide command line options to force sequences to be converted to
>>> upper or lower case if the user want to specify one or the other -
>>> usually just to convert sequences for post processing by some other tool.
>>
>> Cool. It looks like we are on the same wavelength here :)
>>
>> Peter
>
> I believe so (sorry about lack of responsiveness, just got back in town).
>
> chris

Great - I've added some unit test code in Biopython to confirm we
leave the sequence case as-is on a loading and saving FASTQ files.

Peter

From biopython at maubp.freeserve.co.uk  Mon Aug 24 10:18:20 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 24 Aug 2009 15:18:20 +0100
Subject: [Open-bio-l] [Bioperl-l] FASTQ support in Biopython, BioPerl,
	and EMBOSS
In-Reply-To: <F94D84BD-26B1-42E2-955D-11B3308C3AB2@illinois.edu>
References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com>
	<320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com>
	<320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com>
	<32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu>
	<320fb6e00907270451i3d40b4ffq607360cfcb6f6282@mail.gmail.com>
	<F94D84BD-26B1-42E2-955D-11B3308C3AB2@illinois.edu>
Message-ID: <320fb6e00908240718q194afe78j4a05b31aeb33e313@mail.gmail.com>

On Mon, Jul 27, 2009 at 2:06 PM, Chris Fields<cjfields at illinois.edu> wrote:
>
> I added this (and the others) to our ticket tracking this. ?Looks like
> solexa conversion either way is borked, which is very likely an issue
> with conversion.

Hi Chris,

I've been digging into the current SVN code for BioPerl's FASTQ
support - I realised you are doing the Solexa to PHRED mapping
twice when parsing "fastq-solexa" files. Using "qual" output (which
shows the PHRED scores in plain text) makes it very clear
something is wrong:

$ cat solexa_faked.fastq
@slxa_0001_1_0001_01
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
+slxa_0001_1_0001_01
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;

That is Solexa scores from 40 (h) down to -5 (;), which should
map onto PHRED scores from 40 down to 1 (according to our
prior discussions).

$ ./bioperl_solexa2qual.pl < solexa_faked.fastq
>slxa_0001_1_0001_01
40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
15 14 13 12 11 10 10 10 9 8 7 6 6 5 5 5 5 4 4 4 4

For reference,

$ python biopython_solexa2qual.py < solexa_faked.fastq
>slxa_0001_1_0001_01
40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21
20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2
1 1

I can "fix" this in fastq.pm by commenting out one of the log mappings,
for example see the patch I've just uploaded to Bug 2857:
http://bugzilla.open-bio.org/show_bug.cgi?id=2857

That brings me to another problem, consider the following (with the
double conversion fixed):

$ ./bioperl_solexa2solexa.pl < solexa_faked.fastq
@slxa_0001_1_0001_01
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
+slxa_0001_1_0001_01
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJJHGFEDDBB@@>><<

If you compare that to the original, you'll notice a loss of detail in
the poor quality reads. e.g. Solexa scores 9 (I) and 10 (J) have
both been mapped onto 10 (J).

I believe this happens because BioPerl is converting the Solexa
scores to PHRED scores on loading (which is fine - EMBOSS
does this too), but you are also storing them as integers! In order
to preserve these details, I think you'll have to hold the converted
PHRED scores as floating point numbers (which I think is what
EMBOSS does). This has the downside of taking more memory,
and may also complicate file output (you may need to round things).

Regards,

Peter
(@Biopython)


From biopython at maubp.freeserve.co.uk  Tue Aug 25 07:24:27 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 25 Aug 2009 12:24:27 +0100
Subject: [Open-bio-l] More FASTQ examples for cross project testing
Message-ID: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com>

Hi all,

I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl)
off list about this plan. I'm going to co-ordinate putting together a
set of valid FASTQ files for shared testing (to supplement the
existing set of invalid FASTQ files already done and being used in
Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon).

What I have in mind is:

XXX_original_YYY.fastq - sample input
XXX_as_sanger.fastq - reference output
XXX_as_solexa.fastq - reference output
XXX_as_illumina.fastq - reference output

where XXX is some name (e.g. wrapped1, wrapped2, shortreads,
longreads, sanger_full_range, solexa_full_range ...) and YYY is the
FASTQ variant (sanger, solexa or illumina) for the "input" file.

For example, we might have:

wrapped1_original_sanger.fastq - A Sanger FASTQ using line wrapping,
perhaps repeating the title on the plus lines
wrapped1_as_sanger.fastq - The same data but using the consensus of no
line wrapping and omitting the repeated title on the plus lines.
wrapped1_as_solexa.fastq - As above, but converted in Solexa scores
(ASCII offset 64), with capping at Solexa 62 (ASCII 126).
wrapped1_as_illumina.fastq - As above, but converted to Illumina ASCII
offset 64, with capping at PHRED 62 (ASCII 126).

Here "wrapped1" would be a Sanger FASTQ file with some line wrapping
(e.g. at 60 characters). I will include "sanger_full_range" which
would cover all the valid PHRED scores from 0 to 93, and similarly for
Solexa and Illumina files - these are important for testing the score
conversions. I have some ideas for deliberately tricky (but valid)
files which should properly test any parser.

The point is we have "perhaps odd but valid" originals, plus the
"cleaned up" versions (using the same FASTQ variant), and "cleaned up"
versions in the other two FASTQ variants.

Ideally asking Biopython/BioPerl/EMBOSS to convert the
XXX_original_YYY.fastq files into any of the three FASTQ variants will
give exactly the same as the reference outputs.

If anyone has any comments or suggestions please speak up (e.g. my
suggested naming conventions).

Real life examples of FASTQ files anyone has had trouble parsing (even
with 3rd party tools) would be particularly useful - although we'd
probably want to cut down big example files in order to keep the
dataset to a reasonable size.

Thanks,

Peter

From heuermh at acm.org  Tue Aug 25 22:56:20 2009
From: heuermh at acm.org (Michael Heuer)
Date: Tue, 25 Aug 2009 22:56:20 -0400 (EDT)
Subject: [Open-bio-l] More FASTQ examples for cross project testing
In-Reply-To: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com>
Message-ID: <Pine.GSO.4.44.0908252252240.28440-100000@shell3.shore.net>

Peter wrote:

> Hi all,
>
> I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl)
> off list about this plan. I'm going to co-ordinate putting together a
> set of valid FASTQ files for shared testing (to supplement the
> existing set of invalid FASTQ files already done and being used in
> Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon).
>
> What I have in mind is:
>
> XXX_original_YYY.fastq - sample input
> XXX_as_sanger.fastq - reference output
> XXX_as_solexa.fastq - reference output
> XXX_as_illumina.fastq - reference output
>
> where XXX is some name (e.g. wrapped1, wrapped2, shortreads,
> longreads, sanger_full_range, solexa_full_range ...) and YYY is the
> FASTQ variant (sanger, solexa or illumina) for the "input" file.
>
> For example, we might have:
>
> wrapped1_original_sanger.fastq - A Sanger FASTQ using line wrapping,
> perhaps repeating the title on the plus lines
> wrapped1_as_sanger.fastq - The same data but using the consensus of no
> line wrapping and omitting the repeated title on the plus lines.
> wrapped1_as_solexa.fastq - As above, but converted in Solexa scores
> (ASCII offset 64), with capping at Solexa 62 (ASCII 126).
> wrapped1_as_illumina.fastq - As above, but converted to Illumina ASCII
> offset 64, with capping at PHRED 62 (ASCII 126).
>
> Here "wrapped1" would be a Sanger FASTQ file with some line wrapping
> (e.g. at 60 characters). I will include "sanger_full_range" which
> would cover all the valid PHRED scores from 0 to 93, and similarly for
> Solexa and Illumina files - these are important for testing the score
> conversions. I have some ideas for deliberately tricky (but valid)
> files which should properly test any parser.
>
> The point is we have "perhaps odd but valid" originals, plus the
> "cleaned up" versions (using the same FASTQ variant), and "cleaned up"
> versions in the other two FASTQ variants.
>
> Ideally asking Biopython/BioPerl/EMBOSS to convert the
> XXX_original_YYY.fastq files into any of the three FASTQ variants will
> give exactly the same as the reference outputs.
>
> If anyone has any comments or suggestions please speak up (e.g. my
> suggested naming conventions).

Very cool idea, Peter, and Peter, and Chris.  I don't believe anyone from
biojava has spoken up on this thread yet, so I thought I should add that
we are working towards a compatible implementation as well.

   michael


From biopython at maubp.freeserve.co.uk  Wed Aug 26 06:06:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 26 Aug 2009 11:06:39 +0100
Subject: [Open-bio-l] More FASTQ examples for cross project testing
In-Reply-To: <Pine.GSO.4.44.0908252252240.28440-100000@shell3.shore.net>
References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com>
	<Pine.GSO.4.44.0908252252240.28440-100000@shell3.shore.net>
Message-ID: <320fb6e00908260306o37eb3d80q53fc50dce708d0b@mail.gmail.com>

On Wed, Aug 26, 2009 at 3:56 AM, Michael Heuer<heuermh at acm.org> wrote:
> Peter wrote:
>
>> Hi all,
>>
>> I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl)
>> off list about this plan. I'm going to co-ordinate putting together a
>> set of valid FASTQ files for shared testing (to supplement the
>> existing set of invalid FASTQ files already done and being used in
>> Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon).
>> ...
>
> Very cool idea, Peter, and Peter, and Chris. ?I don't believe anyone from
> biojava has spoken up on this thread yet, so I thought I should add that
> we are working towards a compatible implementation as well.
>
> ? michael

Hi Michael - we asked the BioJava guys a while back, and at the time
there was interest but no volunteers. Who is working on this now?

Peter


From biopython at maubp.freeserve.co.uk  Wed Aug 26 18:04:18 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 26 Aug 2009 23:04:18 +0100
Subject: [Open-bio-l] More FASTQ examples for cross project testing
In-Reply-To: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com>
References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com>
Message-ID: <320fb6e00908261504l18c3ee5em4c90b0818fc2c844@mail.gmail.com>

On Tue, Aug 25, 2009 at 12:24 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl)
> off list about this plan. I'm going to co-ordinate putting together a
> set of valid FASTQ files for shared testing (to supplement the
> existing set of invalid FASTQ files already done and being used in
> Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon).
>
> What I have in mind is:
>
> XXX_original_YYY.fastq - sample input
> XXX_as_sanger.fastq - reference output
> XXX_as_solexa.fastq - reference output
> XXX_as_illumina.fastq - reference output
>
> where XXX is some name (e.g. wrapped1, wrapped2, shortreads,
> longreads, sanger_full_range, solexa_full_range ...) and YYY is the
> FASTQ variant (sanger, solexa or illumina) for the "input" file.

I didn't want to clog up the mailing list with attachments, but just
for the record, I've sent my first attempt at this to Peter (EMBOSS)
and Chris (BioPerl) for comment (and checking).

My earlier set of error_*.fastq files are in Biopython CVS/github and
have since been copied to BioPerl SVN as well.

Peter

From biopython at maubp.freeserve.co.uk  Thu Aug 27 06:46:17 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 27 Aug 2009 11:46:17 +0100
Subject: [Open-bio-l] FASTQ in BioRuby?
Message-ID: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com>

Hello BioRuby team,

I am one of the Biopython developers, and together with Peter Rice
(EMBOSS) and Chris Fields (BioPerl) we have been coordinating
how these Open Bioinformatics Foundation (OBF) projects will
interpret the FASTQ file format used in next generation sequencing.

This includes standardising our naming conventions for the original
Sanger FASTQ variant, and the later Solexa/early Illumina, and
recent Illumina 1.3+ variants. We have also put together a set of
test files, including reference conversions between the different
FASTQ variants.

We would be delighted to get BioRuby involved. I tried to contact
Naohisa Goto about this directly last month, but perhaps my email
did not arrive. If BioRuby is working on (or planning to work on)
FASTQ support, please could the developers concerned sign up
to the OBF joint mailing list where we have been discussing this:
http://lists.open-bio.org/mailman/listinfo/open-bio-l

Thank you,

Peter

From ngoto at gen-info.osaka-u.ac.jp  Thu Aug 27 07:20:46 2009
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Thu, 27 Aug 2009 20:20:46 +0900
Subject: [Open-bio-l] FASTQ in BioRuby?
In-Reply-To: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com>
References: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com>
Message-ID: <20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp>

Hello Peter,

sorry for responding too late. I've subscribed to open-bio-l,
but I could not actively join to the discussions, because of
lack of my knowledge about FASTQ.

There is a small primitive code attempt to support FASTQ format
in BioRuby, which is not yet merged in the main repository.
http://github.com/ngoto/bioruby/tree/master

Recently, Anthony Underwood contributed chromatgram classes
to support SCF/ABI formats, which will be merged soon,
after bug-fix maintenance release of 1.3.1.
http://github.com/aunderwo/bioruby/tree/master

I'm now planning to rewrite my FASTQ code to be consistent
with the chromatgram classes, and with the open-bio standards.

Thank you,

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Thu, 27 Aug 2009 11:46:17 +0100
Peter <biopython at maubp.freeserve.co.uk> wrote:

> Hello BioRuby team,
> 
> I am one of the Biopython developers, and together with Peter Rice
> (EMBOSS) and Chris Fields (BioPerl) we have been coordinating
> how these Open Bioinformatics Foundation (OBF) projects will
> interpret the FASTQ file format used in next generation sequencing.
> 
> This includes standardising our naming conventions for the original
> Sanger FASTQ variant, and the later Solexa/early Illumina, and
> recent Illumina 1.3+ variants. We have also put together a set of
> test files, including reference conversions between the different
> FASTQ variants.
> 
> We would be delighted to get BioRuby involved. I tried to contact
> Naohisa Goto about this directly last month, but perhaps my email
> did not arrive. If BioRuby is working on (or planning to work on)
> FASTQ support, please could the developers concerned sign up
> to the OBF joint mailing list where we have been discussing this:
> http://lists.open-bio.org/mailman/listinfo/open-bio-l
> 
> Thank you,
> 
> Peter
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l


From biopython at maubp.freeserve.co.uk  Thu Aug 27 08:08:28 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 27 Aug 2009 13:08:28 +0100
Subject: [Open-bio-l] FASTQ in BioRuby?
In-Reply-To: <20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp>
References: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com>
	<20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <320fb6e00908270508o485ba990k96b8bd3b722c09b6@mail.gmail.com>

On Thu, Aug 27, 2009 at 12:20 PM, Naohisa
GOTO<ngoto at gen-info.osaka-u.ac.jp> wrote:
>
> Hello Peter,
>
> sorry for responding too late. I've subscribed to open-bio-l,
> but I could not actively join to the discussions, because of
> lack of my knowledge about FASTQ.
>
> There is a small primitive code attempt to support FASTQ format
> in BioRuby, which is not yet merged in the main repository.
> http://github.com/ngoto/bioruby/tree/master
>
> Recently, Anthony Underwood contributed chromatgram classes
> to support SCF/ABI formats, which will be merged soon,
> after bug-fix maintenance release of 1.3.1.
> http://github.com/aunderwo/bioruby/tree/master
>
> I'm now planning to rewrite my FASTQ code to be consistent
> with the chromatgram classes, and with the open-bio standards.
>
> Thank you,
>
> Naohisa Goto

That is excellent news :)

I'm not sure how format names work in BioRuby, but if you
do have a set of format names as strings as we do in
Biopython, BioPerl and EMBOSS it would be nice to be
consistent here:

http://biopython.org/wiki/SeqIO
http://bioperl.org/wiki/HOWTO:SeqIO
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

There is some basic information on wikipedia, but this
does not go into detail:
http://www.bioperl.org/wiki/FASTQ_sequence_format

Please feel free to ask any questions about how we are
interpreting things.

Thank you,

Peter

From biopython at maubp.freeserve.co.uk  Thu Aug 27 11:26:23 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 27 Aug 2009 16:26:23 +0100
Subject: [Open-bio-l] Naming for FASTQ example files
In-Reply-To: <320fb6e00908080553h115ab748jecc820d14cb5e524@mail.gmail.com>
References: <320fb6e00908060117w703a2b08h6aeb2489530e0b51@mail.gmail.com>
	<320fb6e00908080553h115ab748jecc820d14cb5e524@mail.gmail.com>
Message-ID: <320fb6e00908270826i729cfdd6o1fdc56f47e5f3c02@mail.gmail.com>

On Sat, Aug 8, 2009 at 1:53 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Aug 6, 2009 at 9:17 AM, Peter<biopython at maubp.freeserve.co.uk> wrote:
>> Hi all,
>>
>> I am planning on compiling a set of set FASTQ files, for use by
>> Biopython, BioPerl, EMBOSS and anyone else that wants to test a
>> parser. Modest size contributions will be welcome (no big files
>> though).
>>
>> I will have two types of files: valid ones, and invalid ones. The
>> basic idea is any parser should understand what we consider to be
>> valid files (we may need to provide matching FASTA and QUAL files or
>> something like this for verification), but also reject all the files
>> we consider to be invalid.
>> ...
>
> I've gone for "error_*.fastq" and have tried to use meaningful names
> rather than numbers. Currently these files are only in the Biopython
> repository (under biopython/Tests/Quality), but could be added to the
> (currently) unused Biodata repository - although that is still on CVS:
>
> http://lists.open-bio.org/pipermail/open-bio-l/2009-January/000511.html
>
> As these examples are all small and we don't expect to change them,
> I could also just email them (off the mailing list) to EMBOSS/BioPerl
> people directly on request.

Chris Fields has already included the original "error_*.fastq" files in
BioPerl SVN as test cases. Peter Rice has pointed out a minor error
in "error_short_qual.fastq" which I have now corrected (it had a
short sequence, not a short quality line), and after discussion we
have come up with a few more truncation examples:

error_trunc_in_title.fastq
error_trunc_in_seq.fastq
error_trunc_in_plus.fastq
error_trunc_in_qual.fastq

Again, you can grab these five files (four new, one updated) from
Biopython CVS/git, and I will also be emailing Chris & Peter R
directly.

Peter C.

From biopython at maubp.freeserve.co.uk  Thu Aug 27 12:40:21 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 27 Aug 2009 17:40:21 +0100
Subject: [Open-bio-l] More FASTQ examples for cross project testing
In-Reply-To: <Pine.GSO.4.44.0908271231100.17078-100000@shell3.shore.net>
References: <320fb6e00908260306o37eb3d80q53fc50dce708d0b@mail.gmail.com>
	<Pine.GSO.4.44.0908271231100.17078-100000@shell3.shore.net>
Message-ID: <320fb6e00908270940r495aecc1n833aa9a28e8f3db3@mail.gmail.com>

On Thu, Aug 27, 2009 at 5:31 PM, Michael Heuer<heuermh at acm.org> wrote:
>
> Peter wrote:
>> Hi Michael - we asked the BioJava guys a while back, and at the time
>> there was interest but no volunteers. Who is working on this now?
>
> Perhaps I should have kept quiet -- I think I just volunteered. ?;)
>
> ? michael

Assuming you're serious, great :)

Peter


From heuermh at acm.org  Thu Aug 27 12:31:43 2009
From: heuermh at acm.org (Michael Heuer)
Date: Thu, 27 Aug 2009 12:31:43 -0400 (EDT)
Subject: [Open-bio-l] More FASTQ examples for cross project testing
In-Reply-To: <320fb6e00908260306o37eb3d80q53fc50dce708d0b@mail.gmail.com>
Message-ID: <Pine.GSO.4.44.0908271231100.17078-100000@shell3.shore.net>

Peter wrote:

> On Wed, Aug 26, 2009 at 3:56 AM, Michael Heuer<heuermh at acm.org> wrote:
> > Peter wrote:
> >
> >> Hi all,
> >>
> >> I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl)
> >> off list about this plan. I'm going to co-ordinate putting together a
> >> set of valid FASTQ files for shared testing (to supplement the
> >> existing set of invalid FASTQ files already done and being used in
> >> Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon).
> >> ...
> >
> > Very cool idea, Peter, and Peter, and Chris. ?I don't believe anyone from
> > biojava has spoken up on this thread yet, so I thought I should add that
> > we are working towards a compatible implementation as well.
> >
> > ? michael
>
> Hi Michael - we asked the BioJava guys a while back, and at the time
> there was interest but no volunteers. Who is working on this now?

Perhaps I should have kept quiet -- I think I just volunteered.  ;)

   michael


From biopython at maubp.freeserve.co.uk  Mon Aug 31 08:07:45 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 31 Aug 2009 13:07:45 +0100
Subject: [Open-bio-l] Status of OBDA and indexed flatfiles?
Message-ID: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com>

Hi all,

I'm looking at indexing next generation sequence files for Biopython
(e.g. FASTQ short read files with 10s of millions of entries), where
even just holding the record names and their file offsets in memory
is beginning to be a bottleneck.

What is the current status of Open Biological Database Access (OBDA),
and in particular the index files for sequence "flat files" like FASTA or
GenBank (or FASTQ)?

http://www.bioperl.org/wiki/HOWTO:Flat_databases
http://www.bioperl.org/wiki/HOWTO:OBDA
http://obda.open-bio.org/

The spec files are still in CVS (and ViewCVS is still broken since
the recent server move), rather than having been migrated to SVN
which may suggest things are obsolete (or on the bright side, stable).

Presumably BioPerl still uses these index files? What about the
other projects? I know EMBOSS has some indexing system for
example but I have no idea how it works internally.

Thanks,

Peter

From ngoto at gen-info.osaka-u.ac.jp  Mon Aug 31 10:01:46 2009
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Mon, 31 Aug 2009 23:01:46 +0900
Subject: [Open-bio-l] Status of OBDA and indexed flatfiles?
In-Reply-To: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com>
References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com>
Message-ID: <20090831140146.EAF851CBC594@idnmail.gen-info.osaka-u.ac.jp>

Hi Peter,

On Mon, 31 Aug 2009 13:07:45 +0100
Peter <biopython at maubp.freeserve.co.uk> wrote:

> Hi all,
> 
> I'm looking at indexing next generation sequence files for Biopython
> (e.g. FASTQ short read files with 10s of millions of entries), where
> even just holding the record names and their file offsets in memory
> is beginning to be a bottleneck.
>
> What is the current status of Open Biological Database Access (OBDA),
> and in particular the index files for sequence "flat files" like FASTA or
> GenBank (or FASTQ)?
> 
> http://www.bioperl.org/wiki/HOWTO:Flat_databases
> http://www.bioperl.org/wiki/HOWTO:OBDA
> http://obda.open-bio.org/
>
> The spec files are still in CVS (and ViewCVS is still broken since
> the recent server move), rather than having been migrated to SVN
> which may suggest things are obsolete (or on the bright side, stable).
> 
> Presumably BioPerl still uses these index files? What about the
> other projects? I know EMBOSS has some indexing system for
> example but I have no idea how it works internally.

BioRuby still uses them. To gain performance, names and offsets are
written to temporary files and using external sort program (default
/usr/bin/sort).

In BioRuby, flatfile-only solution works fine, but BerkeleyDB indexes
would be incompatible with other projects, because of confusion in
the spec, discussed in BioPerl Bugzilla Bug #2337.
http://bugzilla.open-bio.org/show_bug.cgi?id=2337

Thanks,

-- 
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

From biopython at maubp.freeserve.co.uk  Mon Aug 31 11:07:28 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 31 Aug 2009 16:07:28 +0100
Subject: [Open-bio-l] Status of OBDA and indexed flatfiles?
In-Reply-To: <20090831140146.EAF851CBC594@idnmail.gen-info.osaka-u.ac.jp>
References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com>
	<20090831140146.EAF851CBC594@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <320fb6e00908310807j2fbc9710x3621a16a54ad5e45@mail.gmail.com>

On Mon, Aug 31, 2009 at 3:01 PM, Naohisa
GOTO<ngoto at gen-info.osaka-u.ac.jp> wrote:
> Hi Peter,
>
>> Presumably BioPerl still uses these index files? What about the
>> other projects? I know EMBOSS has some indexing system for
>> example but I have no idea how it works internally.
>
> BioRuby still uses them. To gain performance, names and offsets are
> written to temporary files and using external sort program (default
> /usr/bin/sort).

That makes sense. Have you tried this on very large files? e.g.
FASTA with 10 million short reads?

> In BioRuby, flatfile-only solution works fine, but BerkeleyDB indexes
> would be incompatible with other projects, because of confusion in
> the spec, discussed in BioPerl Bugzilla Bug #2337.
> http://bugzilla.open-bio.org/show_bug.cgi?id=2337

Thank you for the link to that bug - I'll need to read that carefully.

Peter

From biopython at maubp.freeserve.co.uk  Mon Aug 31 11:45:51 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 31 Aug 2009 16:45:51 +0100
Subject: [Open-bio-l] Status of OBDA and indexed flatfiles?
In-Reply-To: <8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu>
References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com>
	<8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu>
Message-ID: <320fb6e00908310845n3322d2c6hc992f78b81c0e226@mail.gmail.com>

On Mon, Aug 31, 2009 at 4:33 PM, Chris Fields<cjfields at illinois.edu> wrote:
>
> I don't use OBDA, personally, but I can check on the status with Brian
> Osborne (he was heading it up last I checked). ?However, I don't think
> BioPerl has an OBDA FASTQ parser.
>
> You may be thinking about Bio::Index::FASTQ? ?That one is not OBDA,
> but just a simple flat file indexer. ?We could probably set an OBDA parser
> up fairly easily if needed.

I didn't know if Bio::Index was using OBDA "under the hood" or not.
Does this mean BioPerl has multiple indexing systems available?

As I noted on Bug 2337 earlier today, Biopython used to have some
sort of OBDA compliant indexing, but for unrelated reasons we have
deprecated and removed that code. We're now revisiting this topic
due in part to having to deal with ever larger data files - and I wanted
to see if OBDA was still "alive" as a standard, and furthermore how
well it had scaled for the other OBF projects.

Peter


From cjfields at illinois.edu  Mon Aug 31 11:33:02 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Mon, 31 Aug 2009 10:33:02 -0500
Subject: [Open-bio-l] Status of OBDA and indexed flatfiles?
In-Reply-To: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com>
References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com>
Message-ID: <8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu>

On Aug 31, 2009, at 7:07 AM, Peter wrote:

> Hi all,
>
> I'm looking at indexing next generation sequence files for Biopython
> (e.g. FASTQ short read files with 10s of millions of entries), where
> even just holding the record names and their file offsets in memory
> is beginning to be a bottleneck.
>
> What is the current status of Open Biological Database Access (OBDA),
> and in particular the index files for sequence "flat files" like  
> FASTA or
> GenBank (or FASTQ)?
>
> http://www.bioperl.org/wiki/HOWTO:Flat_databases
> http://www.bioperl.org/wiki/HOWTO:OBDA
> http://obda.open-bio.org/
>
> The spec files are still in CVS (and ViewCVS is still broken since
> the recent server move), rather than having been migrated to SVN
> which may suggest things are obsolete (or on the bright side, stable).
>
> Presumably BioPerl still uses these index files? What about the
> other projects? I know EMBOSS has some indexing system for
> example but I have no idea how it works internally.
>
> Thanks,
>
> Peter

I don't use OBDA, personally, but I can check on the status with Brian  
Osborne (he was heading it up last I checked).  However, I don't think  
BioPerl has an OBDA FASTQ parser.

You may be thinking about Bio::Index::FASTQ?  That one is not OBDA,  
but just a simple flat file indexer.  We could probably set an OBDA  
parser up fairly easily if needed.

chris

From cjfields at illinois.edu  Mon Aug 31 14:22:36 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Mon, 31 Aug 2009 13:22:36 -0500
Subject: [Open-bio-l] Status of OBDA and indexed flatfiles?
In-Reply-To: <320fb6e00908310845n3322d2c6hc992f78b81c0e226@mail.gmail.com>
References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com>
	<8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu>
	<320fb6e00908310845n3322d2c6hc992f78b81c0e226@mail.gmail.com>
Message-ID: <ED58ED84-3B68-4E45-B3DD-BA5F8050551B@illinois.edu>

On Aug 31, 2009, at 10:45 AM, Peter wrote:

> On Mon, Aug 31, 2009 at 4:33 PM, Chris Fields<cjfields at illinois.edu>  
> wrote:
>>
>> I don't use OBDA, personally, but I can check on the status with  
>> Brian
>> Osborne (he was heading it up last I checked).  However, I don't  
>> think
>> BioPerl has an OBDA FASTQ parser.
>>
>> You may be thinking about Bio::Index::FASTQ?  That one is not OBDA,
>> but just a simple flat file indexer.  We could probably set an OBDA  
>> parser
>> up fairly easily if needed.
>
> I didn't know if Bio::Index was using OBDA "under the hood" or not.
> Does this mean BioPerl has multiple indexing systems available?

Yes.  We have Bio::Index::*, Bio::DB::Flat (which I think is OBDA).   
There is also the older Bio::DB::Fasta, which is actually still in  
wide use.  Note with Bio::Index::* we allow streaming of any report  
type (sequence, alignment, analysis like BLAST, etc).

We have talked about switching many of the Bio::Index::* sequence- 
based ones to OBDA but I haven't seen anyone take that up.

> As I noted on Bug 2337 earlier today, Biopython used to have some
> sort of OBDA compliant indexing, but for unrelated reasons we have
> deprecated and removed that code. We're now revisiting this topic
> due in part to having to deal with ever larger data files - and I  
> wanted
> to see if OBDA was still "alive" as a standard, and furthermore how
> well it had scaled for the other OBF projects.
>
> Peter

I think it's still alive and being used, just not sure what the  
compliance level is amongst the different Bio* projects.

chris

From charles-listes+open-bio at plessy.org  Sun Aug  2 01:25:37 2009
From: charles-listes+open-bio at plessy.org (Charles Plessy)
Date: Sun, 2 Aug 2009 10:25:37 +0900
Subject: [Open-bio-l] FASTQ identifiers
In-Reply-To: <320fb6e00907310215n6ed33a17id8a5ff5913c9d3b5@mail.gmail.com>
References: <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com>
	<320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com>
	<320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com>
	<E5FA39FD-799D-4FF0-9117-D3186FF95FB2@illinois.edu>
	<320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com>
	<BE6B3A71-3130-4BD9-96C1-FD1A09C6F4CC@illinois.edu>
	<320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com>
	<24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com>
	<4A72A8F9.9020903@ebi.ac.uk>
	<320fb6e00907310215n6ed33a17id8a5ff5913c9d3b5@mail.gmail.com>
Message-ID: <20090802012537.GD2479@kunpuu.plessy.org>

Le Fri, Jul 31, 2009 at 10:15:57AM +0100, Peter a ?crit :
> The situation is similar to the FASTA format (and others), in that there
> are a number of reasonably well documented conventions in use
> (e.g. the NCBI FASTA identifiers with | characters). However, equally,
> there are thousands of ad hoc local conventions.

Hello,

I just would like to mention such an ad-hoc convention in use at workplace:
with FASTQ sequences we sometimes replace the original name by the sequence
itself. This can be useful for instance to troubleshoot some sequence
manipulations.

@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+EAS54_6_R1_2_1_413_324
;;3;;;;;;;;;;;;7;;;;;;;88

becomes:

@CCCTTCTTGTCTTCAGCGTTTCTCC
CCCTTCTTGTCTTCAGCGTTTCTCC
+CCCTTCTTGTCTTCAGCGTTTCTCC
;;3;;;;;;;;;;;;7;;;;;;;88

and after some arbitrary trimming at the ends:

@CCCTTCTTGTCTTCAGCGTTTCTCC
TTCTTGTCTTCAGCGTTTCT
+CCCTTCTTGTCTTCAGCGTTTCTCC
;;;;;;;;;;;;7;;;;;;;


With FASTA format, we sometimes eliminate redundant sequences and record how
many times they occurred by adding the count to the name.

For instance:

>seq1
AAATTT
>seq2
AAATAT
>seq3
AAATTT

becomes:

>AAATTT_2
AAATTT
>AAATAT_1
AAATAT

If this is popular elsewhere, it may be useful to implement functions that
allow doing this efficiently.

Have a nice day,

-- 
Charles Plessy
Tsurumi, Kanagawa, Japan


From biopython at maubp.freeserve.co.uk  Mon Aug  3 09:30:09 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 3 Aug 2009 10:30:09 +0100
Subject: [Open-bio-l] FASTQ identifiers
In-Reply-To: <20090802012537.GD2479@kunpuu.plessy.org>
References: <320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com>
	<320fb6e00907300318n28260ccm2e675330896af2b1@mail.gmail.com>
	<E5FA39FD-799D-4FF0-9117-D3186FF95FB2@illinois.edu>
	<320fb6e00907300855t3b41a29aye2148e7843dce5fa@mail.gmail.com>
	<BE6B3A71-3130-4BD9-96C1-FD1A09C6F4CC@illinois.edu>
	<320fb6e00907301450h63ee95fehf5cc92f7ca9be1cf@mail.gmail.com>
	<24c96eca0907301652ie2fb130p3195ba1ffb5ede69@mail.gmail.com>
	<4A72A8F9.9020903@ebi.ac.uk>
	<320fb6e00907310215n6ed33a17id8a5ff5913c9d3b5@mail.gmail.com>
	<20090802012537.GD2479@kunpuu.plessy.org>
Message-ID: <320fb6e00908030230x52bf32a8o3b640ce8d0a76b8@mail.gmail.com>

On Sun, Aug 2, 2009 at 2:25 AM, Charles
Plessy<charles-listes+open-bio at plessy.org> wrote:
> Le Fri, Jul 31, 2009 at 10:15:57AM +0100, Peter a ?crit :
>> The situation is similar to the FASTA format (and others), in that there
>> are a number of reasonably well documented conventions in use
>> (e.g. the NCBI FASTA identifiers with | characters). However, equally,
>> there are thousands of ad hoc local conventions.
>
> Hello,
>
> I just would like to mention such an ad-hoc convention in use at
> workplace: with FASTQ sequences we sometimes replace the original
> name by the sequence itself. This can be useful for instance to
> troubleshoot some sequence manipulations.
>
> @EAS54_6_R1_2_1_413_324
> CCCTTCTTGTCTTCAGCGTTTCTCC
> +EAS54_6_R1_2_1_413_324
> ;;3;;;;;;;;;;;;7;;;;;;;88
>
> becomes:
>
> @CCCTTCTTGTCTTCAGCGTTTCTCC
> CCCTTCTTGTCTTCAGCGTTTCTCC
> +CCCTTCTTGTCTTCAGCGTTTCTCC
> ;;3;;;;;;;;;;;;7;;;;;;;88
>

That certainly demonstrates we can't make any big assumptions
about the title line formatting ;)

Your example is interesting - but I don't quite understand why you
do this. Surely any debug message or output file for bad reads
would (normally) have a unique read ID which (indirectly) tells
you the read sequence? If you are writing the code which gives
these error messages, can't you explicitly give the read sequence?
Is the aim to be able to look at error messages from third party
tools (which just give the read name) and see the read sequence
directly (without looking up the read name in the original FASTQ
file)?

This is similar in some ways to my comment that I could see a real
use for FASTQ (and FASTA) files with no record identifiers:

>> Related to this, what about the corner case of reads with NO
>> identifier? The FASTQ (and indeed the FASTA) formats can
>> hold such things - just use a blank title line. In the case of
>> next generation sequencing reads, the names themselves
>> are not actually that important - so you can imagine a pipeline
>> which doesn't actually bother with them at all.

In your pipeline you clearly don't care about the original FASTQ
identifiers, and (if the pipeline would accept it), using blank title
lines might also work (and would certainly save disk space).

Peter


From cjfields at illinois.edu  Wed Aug  5 15:12:18 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Wed, 5 Aug 2009 10:12:18 -0500
Subject: [Open-bio-l] FASTQ support in Biopython, BioPerl, and EMBOSS
In-Reply-To: <320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com>
References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com>
	<320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com>
	<320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com>
	<32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu>
	<320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com>
	<320fb6e00907280342r7daef68ai34925df78465d390@mail.gmail.com>
	<320fb6e00907290315o56ede1a9l3735365a30a371a@mail.gmail.com>
Message-ID: <3C298ABC-07CD-4597-BA83-F8F5992BF73A@illinois.edu>


On Jul 29, 2009, at 5:15 AM, Peter wrote:

> Hi all,
>
> This is a follow up to the earlier discussion about high quality  
> scores
> in Solexa or Illumina 1.3+ FASTQ files and the problem of non  
> printable
> ASCII codes (which can occur if converting from Sanger FASTQ).
>
>> On Sat, Jul 25, 2009 at 8:50 PM, Chris  
>> Fields<cjfields at illinois.edu> wrote:
>>>
>>>> Now, here comes the problem. I believe FASTQ files directly
>>>> from an Illumina 1.3+ pipeline will have PHRED scores in the
>>>> range 0 to 40 (as in this example). However, much higher
>>>> PHRED scores are possible during assembly / contig'ing
>>>> and read mapping. For example, the tool MAQ will output
>>>> Sanger style FASTQ files with PHRED scores in the range
>>>> 0 to 93 inclusive.
>>>
>>> We can support it as Illumina 1.3, but my point is this may  
>>> getting into a
>>> grey area and may be something that Illumina doesn't/wouldn't  
>>> support.
>>>  Reminds me a little of the multiple GFF2 variations (one of the  
>>> main
>>> reasons for a GFF3).
>>
>> I agree this is an grey area (high scores in Solexa/Illumina
>> FASTQ files).
>>
>> ...
>>
>> i.e. An Illumina FASTQ format file can hold PHRED scores in the
>> range 0 to 62 without using problem characters. And likewise
>> for a Solexa FASTQ file (Solexa scores up to 62).
>
> Peter Rice and I have been talking about this off list, and have
> a proposal for the high score problem. Basically we want to
> restrict FASTQ quality strings to printable ASCII, which means
> 126 (0x7e) is a firm upper limit, while otherwise allowing for a
> high scores as possible. This limit comes from ASCII 127 being
> "delete", and the even higher characters also being non-printable.
>
> i.e. We are suggesting:
>
> "fastq-sanger" - Allows PHRED scores 0 to 93 inclusive, mapped
> with an ASCII offset of 33 to ASCII characters 33 to 126 (or in hex,
> 0x21 to 0x7e). This is as defined on the MAQ web pages.
>
> "fastq-illumina" - Believed to use at least PHRED scores 0 to 40,
> mapped with an ASCII offset of 64 to ASCII characters 64 to 104
> (or in hex, to 0x40 to 0x68). It is a reasonable and well defined
> extension to permit PHRED scores from 0 to 62 inclusive, which
> map to ASCII 64 to 126 (or in hex 0x40 to 0x7e). This avoids the
> non printing characters, and gives some head room for improved
> sequencing technology from Illumina giving higher raw scores.
>
> "fastq-solexa" - Believed to use Solexa scores from -5 to at least
> 40, again mapped with an ASCII offset of 64 giving ASCII characters
> 59 to 104 (or in hex, 0x3b to 0x68). Again, a reasonable and well
> defined extension would permit Solexa scores in the range -5 to 62
> inclusive, using ASCII 59 to 126 (or in hex, 0x3b to 0x7e).
>
> [Peter R. - please correct me if of the above is not what you had
> in mind]
>
> If in the process of converting between formats, a quality score
> is too high (it would result in ASCII 127 or higher), then I would
> argue any of the following would be acceptable:
> (a) Silently impose the maximum score (ASCII 126, 0x7e)
> (b) Impose the maximum score with a warning
> (c) Raise an error
>
> I don't think EMBOSS, BioPerl and Biopython have to handle
> this exactly the same way, but I would favour (b) then (a).
>
> Peter

I think, based on Aaron's comments, with bioperl we'll adopt in (b) to  
deal with format validation, but try to do it in a way that 'caches'  
bad data so it doesn't report a warning on every out-of-range value.   
I am planning on a Moose-based parser at some point that will do the  
same.

chris


From biopython at maubp.freeserve.co.uk  Wed Aug  5 16:01:32 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 5 Aug 2009 17:01:32 +0100
Subject: [Open-bio-l] Mixed case sequence strings in FASTQ?
Message-ID: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com>

Hi all,

Another FASTQ issue to debate: Should we care what case the sequence
strings are? I've never seen anything written down, but all the
examples I recall used upper case. But there is nothing to stop people
using mixed case, is there?

With FASTA on the other hand, while all uppercase is most common,
mixed case has its uses (e.g. representing trimmed regions, or low
quality scores).

I would suggest that OBF tools all treat the sequence in FASTQ files
as is, and preserve the case on output.

Any thoughts?

Peter


From dan.bolser at gmail.com  Wed Aug  5 16:50:56 2009
From: dan.bolser at gmail.com (Dan Bolser)
Date: Wed, 5 Aug 2009 17:50:56 +0100
Subject: [Open-bio-l] Mixed case sequence strings in FASTQ?
In-Reply-To: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com>
References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com>
Message-ID: <2c8757af0908050950y97863fcj2b4deda1b8bb37c8@mail.gmail.com>

2009/8/5 Peter <biopython at maubp.freeserve.co.uk>:
> Hi all,
>
> Another FASTQ issue to debate: Should we care what case the sequence
> strings are? I've never seen anything written down, but all the
> examples I recall used upper case. But there is nothing to stop people
> using mixed case, is there?
>
> With FASTA on the other hand, while all uppercase is most common,
> mixed case has its uses (e.g. representing trimmed regions, or low
> quality scores).
>
> I would suggest that OBF tools all treat the sequence in FASTQ files
> as is, and preserve the case on output.
>
> Any thoughts?

Agree.


> Peter
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>


From biopython at maubp.freeserve.co.uk  Thu Aug  6 08:17:07 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 6 Aug 2009 09:17:07 +0100
Subject: [Open-bio-l] Naming for FASTQ example files
Message-ID: <320fb6e00908060117w703a2b08h6aeb2489530e0b51@mail.gmail.com>

Hi all,

I am planning on compiling a set of set FASTQ files, for use by
Biopython, BioPerl, EMBOSS and anyone else that wants to test a
parser. Modest size contributions will be welcome (no big files
though).

I will have two types of files: valid ones, and invalid ones. The
basic idea is any parser should understand what we consider to be
valid files (we may need to provide matching FASTA and QUAL files or
something like this for verification), but also reject all the files
we consider to be invalid.

Regarding names, does "error_*.fastq" or "invalid_*.fastq" sound fine?

Any preference for meaningful names ("error_qual_short.fastq",
"error_qual_bad_char.fastq", ...) versus numbers ("error_001.fastq",
"error_002.fastq", ...). Either way I think a README file would need
to accompany the dataset stating what we think makes each example
invalid (e.g. quality string shorted than sequence, invalid character
in quality string, ...).

Peter


From biopython at maubp.freeserve.co.uk  Sat Aug  8 12:53:17 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 8 Aug 2009 13:53:17 +0100
Subject: [Open-bio-l] Naming for FASTQ example files
In-Reply-To: <320fb6e00908060117w703a2b08h6aeb2489530e0b51@mail.gmail.com>
References: <320fb6e00908060117w703a2b08h6aeb2489530e0b51@mail.gmail.com>
Message-ID: <320fb6e00908080553h115ab748jecc820d14cb5e524@mail.gmail.com>

On Thu, Aug 6, 2009 at 9:17 AM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> I am planning on compiling a set of set FASTQ files, for use by
> Biopython, BioPerl, EMBOSS and anyone else that wants to test a
> parser. Modest size contributions will be welcome (no big files
> though).
>
> I will have two types of files: valid ones, and invalid ones. The
> basic idea is any parser should understand what we consider to be
> valid files (we may need to provide matching FASTA and QUAL files or
> something like this for verification), but also reject all the files
> we consider to be invalid.
>
> Regarding names, does "error_*.fastq" or "invalid_*.fastq" sound fine?
>
> Any preference for meaningful names ("error_qual_short.fastq",
> "error_qual_bad_char.fastq", ...) versus numbers ("error_001.fastq",
> "error_002.fastq", ...). Either way I think a README file would need
> to accompany the dataset stating what we think makes each example
> invalid (e.g. quality string shorted than sequence, invalid character
> in quality string, ...).

I've gone for "error_*.fastq" and have tried to use meaningful names
rather than numbers. Currently these files are only in the Biopython
repository (under biopython/Tests/Quality), but could be added to the
(currently) unused Biodata repository - although that is still on CVS:

http://lists.open-bio.org/pipermail/open-bio-l/2009-January/000511.html

As these examples are all small and we don't expect to change them,
I could also just email them (off the mailing list) to EMBOSS/BioPerl
people directly on request.

Currently my error examples are as follows, broken down into groups.

Quality strings with invalid ASCII characters (not the full set, but
we could do that):

error_qual_null.fastq
error_qual_vtab.fastq
error_qual_tab.fastq
error_qual_escape.fastq
error_qual_unit_sep.fastq
error_qual_space.fastq
error_qual_del.fastq

Misc errors:

error_diff_ids.fastq
error_spaces.fastq
error_tabs.fastq
error_short_qual.fastq
error_long_qual.fastq
error_no_qual.fastq

Simulated truncation part way though a file:

error_trunc_at_plus.fastq
error_trunc_at_qual.fastq
error_trunc_at_seq.fastq

Note they are all based on the same example file which due to the
quality characters can be interpreted as any of the three FASTQ
variants we're supporting (Sanger, Solexa, Illumina 1.3+). This was
deliberate. Additional examples of files which could be Sanger or
Solexa but not Illumina 1.3+ (or valid Sanger but can't be Solexa or
Illumina 1.3+) are also a good idea.

Note that in many of these examples the error is part way into the
file, so there are initially some valid reads and then an error.

Peter


From biopython at maubp.freeserve.co.uk  Sat Aug  8 18:56:32 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 8 Aug 2009 19:56:32 +0100
Subject: [Open-bio-l] White space in FASTQ files?
Message-ID: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com>

Hi all,

Other than the special case of new lines which we have already covered
(allowed but line wrapping is discouraged), should FASTQ sequence
lines (and indeed the quality lines) ever be allowed to include white
space (e.g. spaces and tabs)? I've never seen this in a real FASTQ
file, and would like to suggest this be considered an error.

Comments? Counter suggestions?

Peter


From pmr at ebi.ac.uk  Mon Aug 10 13:02:47 2009
From: pmr at ebi.ac.uk (Peter Rice)
Date: Mon, 10 Aug 2009 14:02:47 +0100
Subject: [Open-bio-l] Mixed case sequence strings in FASTQ?
In-Reply-To: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com>
References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com>
Message-ID: <4A801A77.20704@ebi.ac.uk>

Peter C. wrote:

> I would suggest that OBF tools all treat the sequence in FASTQ files
> as is, and preserve the case on output.
> 
> Any thoughts?

EMBOSS does that with all sequence formats. The case of the original
sequence is preserved and reproduced on output. We have not specified
upper or lower case only for any of our current output formats.

We provide command line options to force sequences to be converted to
upper or lower case if the user want to specify one or the other -
usually just to convert sequences for post processing by some other tool.

regards,

Peter Rice


From biopython at maubp.freeserve.co.uk  Mon Aug 10 13:06:23 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 10 Aug 2009 14:06:23 +0100
Subject: [Open-bio-l] Mixed case sequence strings in FASTQ?
In-Reply-To: <4A801A77.20704@ebi.ac.uk>
References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com>
	<4A801A77.20704@ebi.ac.uk>
Message-ID: <320fb6e00908100606m4cd2a9ceq95c233d657b080ce@mail.gmail.com>

On Mon, Aug 10, 2009 at 2:02 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
> Peter C. wrote:
>
>> I would suggest that OBF tools all treat the sequence in FASTQ files
>> as is, and preserve the case on output.
>>
>> Any thoughts?
>
> EMBOSS does that with all sequence formats. The case of the original
> sequence is preserved and reproduced on output. We have not specified
> upper or lower case only for any of our current output formats.
>
> We provide command line options to force sequences to be converted to
> upper or lower case if the user want to specify one or the other -
> usually just to convert sequences for post processing by some other tool.

Cool. It looks like we are on the same wavelength here :)

Peter


From pmr at ebi.ac.uk  Mon Aug 10 13:09:14 2009
From: pmr at ebi.ac.uk (Peter Rice)
Date: Mon, 10 Aug 2009 14:09:14 +0100
Subject: [Open-bio-l] White space in FASTQ files?
In-Reply-To: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com>
References: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com>
Message-ID: <4A801BFA.2040208@ebi.ac.uk>

Peter C. wrote:
> Other than the special case of new lines which we have already covered
> (allowed but line wrapping is discouraged), should FASTQ sequence
> lines (and indeed the quality lines) ever be allowed to include white
> space (e.g. spaces and tabs)? I've never seen this in a real FASTQ
> file, and would like to suggest this be considered an error.
> 
> Comments? Counter suggestions?

I am happy adding a warning message in EMBOSS for this.

If we add too many warning messages then we could break our plan to
issue one message and follow with "and another 999999 up to ..." if we
find ourselves issuing more than one warning per sequence.

regards,

Peter Rice


From biopython at maubp.freeserve.co.uk  Mon Aug 10 13:36:26 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 10 Aug 2009 14:36:26 +0100
Subject: [Open-bio-l] White space in FASTQ files?
In-Reply-To: <4A801BFA.2040208@ebi.ac.uk>
References: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com>
	<4A801BFA.2040208@ebi.ac.uk>
Message-ID: <320fb6e00908100636r3e95b505x1fad838c566c973d@mail.gmail.com>

On Mon, Aug 10, 2009 at 2:09 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
> Peter C. wrote:
>> Other than the special case of new lines which we have already covered
>> (allowed but line wrapping is discouraged), should FASTQ sequence
>> lines (and indeed the quality lines) ever be allowed to include white
>> space (e.g. spaces and tabs)? I've never seen this in a real FASTQ
>> file, and would like to suggest this be considered an error.
>>
>> Comments? Counter suggestions?
>
> I am happy adding a warning message in EMBOSS for this.
>

So you are thinking you'll try and cope with white space, and issue a
warning? This sounds dangerous to me. One of the properties of a
FASTQ file is the sequence string and the quality string should be the
same length (after removing the line wrapping). Allowing whitespace
in these strings makes that ambiguous. What if the sequence has
white space but not the quality? What if they both have white space
but in different positions?

Just calling any whitespace (other than the new line characters) an
error seems much safer. If there are any real files which do this, we
can revisit this.

Peter


From cjfields at illinois.edu  Tue Aug 11 23:32:00 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Tue, 11 Aug 2009 18:32:00 -0500
Subject: [Open-bio-l] White space in FASTQ files?
In-Reply-To: <4A801BFA.2040208@ebi.ac.uk>
References: <320fb6e00908081156u662c9e4djeb7cc790fe4b7dd1@mail.gmail.com>
	<4A801BFA.2040208@ebi.ac.uk>
Message-ID: <D83EE4A2-9DDD-4684-AACA-EC50E9070B3E@illinois.edu>

On Aug 10, 2009, at 8:09 AM, Peter Rice wrote:

> Peter C. wrote:
>> Other than the special case of new lines which we have already  
>> covered
>> (allowed but line wrapping is discouraged), should FASTQ sequence
>> lines (and indeed the quality lines) ever be allowed to include white
>> space (e.g. spaces and tabs)? I've never seen this in a real FASTQ
>> file, and would like to suggest this be considered an error.
>>
>> Comments? Counter suggestions?
>
> I am happy adding a warning message in EMBOSS for this.
>
> If we add too many warning messages then we could break our plan to
> issue one message and follow with "and another 999999 up to ..." if we
> find ourselves issuing more than one warning per sequence.
>
> regards,
>
> Peter Rice

This is quite similar to the 'qual range out-of-bounds for this FASTQ  
variant' warning we discussed earlier.  We could essentially merge  
these to be one and the same.

chris


From cjfields at illinois.edu  Tue Aug 11 23:32:10 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Tue, 11 Aug 2009 18:32:10 -0500
Subject: [Open-bio-l] Mixed case sequence strings in FASTQ?
In-Reply-To: <320fb6e00908100606m4cd2a9ceq95c233d657b080ce@mail.gmail.com>
References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com>
	<4A801A77.20704@ebi.ac.uk>
	<320fb6e00908100606m4cd2a9ceq95c233d657b080ce@mail.gmail.com>
Message-ID: <0D818138-133D-4937-9B24-8A07CD62C170@illinois.edu>


On Aug 10, 2009, at 8:06 AM, Peter wrote:

> On Mon, Aug 10, 2009 at 2:02 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
>> Peter C. wrote:
>>
>>> I would suggest that OBF tools all treat the sequence in FASTQ files
>>> as is, and preserve the case on output.
>>>
>>> Any thoughts?
>>
>> EMBOSS does that with all sequence formats. The case of the original
>> sequence is preserved and reproduced on output. We have not specified
>> upper or lower case only for any of our current output formats.
>>
>> We provide command line options to force sequences to be converted to
>> upper or lower case if the user want to specify one or the other -
>> usually just to convert sequences for post processing by some other  
>> tool.
>
> Cool. It looks like we are on the same wavelength here :)
>
> Peter

I believe so (sorry about lack of responsiveness, just got back in  
town).

chris


From biopython at maubp.freeserve.co.uk  Wed Aug 12 10:23:52 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 12 Aug 2009 11:23:52 +0100
Subject: [Open-bio-l] Mixed case sequence strings in FASTQ?
In-Reply-To: <0D818138-133D-4937-9B24-8A07CD62C170@illinois.edu>
References: <320fb6e00908050901t735bda93ka804dfae79774fa1@mail.gmail.com>
	<4A801A77.20704@ebi.ac.uk>
	<320fb6e00908100606m4cd2a9ceq95c233d657b080ce@mail.gmail.com>
	<0D818138-133D-4937-9B24-8A07CD62C170@illinois.edu>
Message-ID: <320fb6e00908120323q39e1b3e9x1ff6b56203149943@mail.gmail.com>

On Wed, Aug 12, 2009 at 12:32 AM, Chris Fields<cjfields at illinois.edu> wrote:
>
> On Aug 10, 2009, at 8:06 AM, Peter wrote:
>
>> On Mon, Aug 10, 2009 at 2:02 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
>>>
>>> Peter C. wrote:
>>>
>>>> I would suggest that OBF tools all treat the sequence in FASTQ files
>>>> as is, and preserve the case on output.
>>>>
>>>> Any thoughts?
>>>
>>> EMBOSS does that with all sequence formats. The case of the original
>>> sequence is preserved and reproduced on output. We have not specified
>>> upper or lower case only for any of our current output formats.
>>>
>>> We provide command line options to force sequences to be converted to
>>> upper or lower case if the user want to specify one or the other -
>>> usually just to convert sequences for post processing by some other tool.
>>
>> Cool. It looks like we are on the same wavelength here :)
>>
>> Peter
>
> I believe so (sorry about lack of responsiveness, just got back in town).
>
> chris

Great - I've added some unit test code in Biopython to confirm we
leave the sequence case as-is on a loading and saving FASTQ files.

Peter


From biopython at maubp.freeserve.co.uk  Mon Aug 24 14:18:20 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 24 Aug 2009 15:18:20 +0100
Subject: [Open-bio-l] [Bioperl-l] FASTQ support in Biopython, BioPerl,
	and EMBOSS
In-Reply-To: <F94D84BD-26B1-42E2-955D-11B3308C3AB2@illinois.edu>
References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com>
	<320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com>
	<320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com>
	<32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu>
	<320fb6e00907270451i3d40b4ffq607360cfcb6f6282@mail.gmail.com>
	<F94D84BD-26B1-42E2-955D-11B3308C3AB2@illinois.edu>
Message-ID: <320fb6e00908240718q194afe78j4a05b31aeb33e313@mail.gmail.com>

On Mon, Jul 27, 2009 at 2:06 PM, Chris Fields<cjfields at illinois.edu> wrote:
>
> I added this (and the others) to our ticket tracking this. ?Looks like
> solexa conversion either way is borked, which is very likely an issue
> with conversion.

Hi Chris,

I've been digging into the current SVN code for BioPerl's FASTQ
support - I realised you are doing the Solexa to PHRED mapping
twice when parsing "fastq-solexa" files. Using "qual" output (which
shows the PHRED scores in plain text) makes it very clear
something is wrong:

$ cat solexa_faked.fastq
@slxa_0001_1_0001_01
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
+slxa_0001_1_0001_01
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;

That is Solexa scores from 40 (h) down to -5 (;), which should
map onto PHRED scores from 40 down to 1 (according to our
prior discussions).

$ ./bioperl_solexa2qual.pl < solexa_faked.fastq
>slxa_0001_1_0001_01
40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
15 14 13 12 11 10 10 10 9 8 7 6 6 5 5 5 5 4 4 4 4

For reference,

$ python biopython_solexa2qual.py < solexa_faked.fastq
>slxa_0001_1_0001_01
40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21
20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2
1 1

I can "fix" this in fastq.pm by commenting out one of the log mappings,
for example see the patch I've just uploaded to Bug 2857:
http://bugzilla.open-bio.org/show_bug.cgi?id=2857

That brings me to another problem, consider the following (with the
double conversion fixed):

$ ./bioperl_solexa2solexa.pl < solexa_faked.fastq
@slxa_0001_1_0001_01
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
+slxa_0001_1_0001_01
hgfedcba`_^]\[ZYXWVUTSRQPONMLKJJHGFEDDBB@@>><<

If you compare that to the original, you'll notice a loss of detail in
the poor quality reads. e.g. Solexa scores 9 (I) and 10 (J) have
both been mapped onto 10 (J).

I believe this happens because BioPerl is converting the Solexa
scores to PHRED scores on loading (which is fine - EMBOSS
does this too), but you are also storing them as integers! In order
to preserve these details, I think you'll have to hold the converted
PHRED scores as floating point numbers (which I think is what
EMBOSS does). This has the downside of taking more memory,
and may also complicate file output (you may need to round things).

Regards,

Peter
(@Biopython)


From biopython at maubp.freeserve.co.uk  Tue Aug 25 11:24:27 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 25 Aug 2009 12:24:27 +0100
Subject: [Open-bio-l] More FASTQ examples for cross project testing
Message-ID: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com>

Hi all,

I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl)
off list about this plan. I'm going to co-ordinate putting together a
set of valid FASTQ files for shared testing (to supplement the
existing set of invalid FASTQ files already done and being used in
Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon).

What I have in mind is:

XXX_original_YYY.fastq - sample input
XXX_as_sanger.fastq - reference output
XXX_as_solexa.fastq - reference output
XXX_as_illumina.fastq - reference output

where XXX is some name (e.g. wrapped1, wrapped2, shortreads,
longreads, sanger_full_range, solexa_full_range ...) and YYY is the
FASTQ variant (sanger, solexa or illumina) for the "input" file.

For example, we might have:

wrapped1_original_sanger.fastq - A Sanger FASTQ using line wrapping,
perhaps repeating the title on the plus lines
wrapped1_as_sanger.fastq - The same data but using the consensus of no
line wrapping and omitting the repeated title on the plus lines.
wrapped1_as_solexa.fastq - As above, but converted in Solexa scores
(ASCII offset 64), with capping at Solexa 62 (ASCII 126).
wrapped1_as_illumina.fastq - As above, but converted to Illumina ASCII
offset 64, with capping at PHRED 62 (ASCII 126).

Here "wrapped1" would be a Sanger FASTQ file with some line wrapping
(e.g. at 60 characters). I will include "sanger_full_range" which
would cover all the valid PHRED scores from 0 to 93, and similarly for
Solexa and Illumina files - these are important for testing the score
conversions. I have some ideas for deliberately tricky (but valid)
files which should properly test any parser.

The point is we have "perhaps odd but valid" originals, plus the
"cleaned up" versions (using the same FASTQ variant), and "cleaned up"
versions in the other two FASTQ variants.

Ideally asking Biopython/BioPerl/EMBOSS to convert the
XXX_original_YYY.fastq files into any of the three FASTQ variants will
give exactly the same as the reference outputs.

If anyone has any comments or suggestions please speak up (e.g. my
suggested naming conventions).

Real life examples of FASTQ files anyone has had trouble parsing (even
with 3rd party tools) would be particularly useful - although we'd
probably want to cut down big example files in order to keep the
dataset to a reasonable size.

Thanks,

Peter


From heuermh at acm.org  Wed Aug 26 02:56:20 2009
From: heuermh at acm.org (Michael Heuer)
Date: Tue, 25 Aug 2009 22:56:20 -0400 (EDT)
Subject: [Open-bio-l] More FASTQ examples for cross project testing
In-Reply-To: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com>
Message-ID: <Pine.GSO.4.44.0908252252240.28440-100000@shell3.shore.net>

Peter wrote:

> Hi all,
>
> I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl)
> off list about this plan. I'm going to co-ordinate putting together a
> set of valid FASTQ files for shared testing (to supplement the
> existing set of invalid FASTQ files already done and being used in
> Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon).
>
> What I have in mind is:
>
> XXX_original_YYY.fastq - sample input
> XXX_as_sanger.fastq - reference output
> XXX_as_solexa.fastq - reference output
> XXX_as_illumina.fastq - reference output
>
> where XXX is some name (e.g. wrapped1, wrapped2, shortreads,
> longreads, sanger_full_range, solexa_full_range ...) and YYY is the
> FASTQ variant (sanger, solexa or illumina) for the "input" file.
>
> For example, we might have:
>
> wrapped1_original_sanger.fastq - A Sanger FASTQ using line wrapping,
> perhaps repeating the title on the plus lines
> wrapped1_as_sanger.fastq - The same data but using the consensus of no
> line wrapping and omitting the repeated title on the plus lines.
> wrapped1_as_solexa.fastq - As above, but converted in Solexa scores
> (ASCII offset 64), with capping at Solexa 62 (ASCII 126).
> wrapped1_as_illumina.fastq - As above, but converted to Illumina ASCII
> offset 64, with capping at PHRED 62 (ASCII 126).
>
> Here "wrapped1" would be a Sanger FASTQ file with some line wrapping
> (e.g. at 60 characters). I will include "sanger_full_range" which
> would cover all the valid PHRED scores from 0 to 93, and similarly for
> Solexa and Illumina files - these are important for testing the score
> conversions. I have some ideas for deliberately tricky (but valid)
> files which should properly test any parser.
>
> The point is we have "perhaps odd but valid" originals, plus the
> "cleaned up" versions (using the same FASTQ variant), and "cleaned up"
> versions in the other two FASTQ variants.
>
> Ideally asking Biopython/BioPerl/EMBOSS to convert the
> XXX_original_YYY.fastq files into any of the three FASTQ variants will
> give exactly the same as the reference outputs.
>
> If anyone has any comments or suggestions please speak up (e.g. my
> suggested naming conventions).

Very cool idea, Peter, and Peter, and Chris.  I don't believe anyone from
biojava has spoken up on this thread yet, so I thought I should add that
we are working towards a compatible implementation as well.

   michael


From biopython at maubp.freeserve.co.uk  Wed Aug 26 10:06:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 26 Aug 2009 11:06:39 +0100
Subject: [Open-bio-l] More FASTQ examples for cross project testing
In-Reply-To: <Pine.GSO.4.44.0908252252240.28440-100000@shell3.shore.net>
References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com>
	<Pine.GSO.4.44.0908252252240.28440-100000@shell3.shore.net>
Message-ID: <320fb6e00908260306o37eb3d80q53fc50dce708d0b@mail.gmail.com>

On Wed, Aug 26, 2009 at 3:56 AM, Michael Heuer<heuermh at acm.org> wrote:
> Peter wrote:
>
>> Hi all,
>>
>> I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl)
>> off list about this plan. I'm going to co-ordinate putting together a
>> set of valid FASTQ files for shared testing (to supplement the
>> existing set of invalid FASTQ files already done and being used in
>> Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon).
>> ...
>
> Very cool idea, Peter, and Peter, and Chris. ?I don't believe anyone from
> biojava has spoken up on this thread yet, so I thought I should add that
> we are working towards a compatible implementation as well.
>
> ? michael

Hi Michael - we asked the BioJava guys a while back, and at the time
there was interest but no volunteers. Who is working on this now?

Peter


From biopython at maubp.freeserve.co.uk  Wed Aug 26 22:04:18 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 26 Aug 2009 23:04:18 +0100
Subject: [Open-bio-l] More FASTQ examples for cross project testing
In-Reply-To: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com>
References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com>
Message-ID: <320fb6e00908261504l18c3ee5em4c90b0818fc2c844@mail.gmail.com>

On Tue, Aug 25, 2009 at 12:24 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl)
> off list about this plan. I'm going to co-ordinate putting together a
> set of valid FASTQ files for shared testing (to supplement the
> existing set of invalid FASTQ files already done and being used in
> Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon).
>
> What I have in mind is:
>
> XXX_original_YYY.fastq - sample input
> XXX_as_sanger.fastq - reference output
> XXX_as_solexa.fastq - reference output
> XXX_as_illumina.fastq - reference output
>
> where XXX is some name (e.g. wrapped1, wrapped2, shortreads,
> longreads, sanger_full_range, solexa_full_range ...) and YYY is the
> FASTQ variant (sanger, solexa or illumina) for the "input" file.

I didn't want to clog up the mailing list with attachments, but just
for the record, I've sent my first attempt at this to Peter (EMBOSS)
and Chris (BioPerl) for comment (and checking).

My earlier set of error_*.fastq files are in Biopython CVS/github and
have since been copied to BioPerl SVN as well.

Peter


From biopython at maubp.freeserve.co.uk  Thu Aug 27 10:46:17 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 27 Aug 2009 11:46:17 +0100
Subject: [Open-bio-l] FASTQ in BioRuby?
Message-ID: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com>

Hello BioRuby team,

I am one of the Biopython developers, and together with Peter Rice
(EMBOSS) and Chris Fields (BioPerl) we have been coordinating
how these Open Bioinformatics Foundation (OBF) projects will
interpret the FASTQ file format used in next generation sequencing.

This includes standardising our naming conventions for the original
Sanger FASTQ variant, and the later Solexa/early Illumina, and
recent Illumina 1.3+ variants. We have also put together a set of
test files, including reference conversions between the different
FASTQ variants.

We would be delighted to get BioRuby involved. I tried to contact
Naohisa Goto about this directly last month, but perhaps my email
did not arrive. If BioRuby is working on (or planning to work on)
FASTQ support, please could the developers concerned sign up
to the OBF joint mailing list where we have been discussing this:
http://lists.open-bio.org/mailman/listinfo/open-bio-l

Thank you,

Peter


From ngoto at gen-info.osaka-u.ac.jp  Thu Aug 27 11:20:46 2009
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Thu, 27 Aug 2009 20:20:46 +0900
Subject: [Open-bio-l] FASTQ in BioRuby?
In-Reply-To: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com>
References: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com>
Message-ID: <20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp>

Hello Peter,

sorry for responding too late. I've subscribed to open-bio-l,
but I could not actively join to the discussions, because of
lack of my knowledge about FASTQ.

There is a small primitive code attempt to support FASTQ format
in BioRuby, which is not yet merged in the main repository.
http://github.com/ngoto/bioruby/tree/master

Recently, Anthony Underwood contributed chromatgram classes
to support SCF/ABI formats, which will be merged soon,
after bug-fix maintenance release of 1.3.1.
http://github.com/aunderwo/bioruby/tree/master

I'm now planning to rewrite my FASTQ code to be consistent
with the chromatgram classes, and with the open-bio standards.

Thank you,

Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org

On Thu, 27 Aug 2009 11:46:17 +0100
Peter <biopython at maubp.freeserve.co.uk> wrote:

> Hello BioRuby team,
> 
> I am one of the Biopython developers, and together with Peter Rice
> (EMBOSS) and Chris Fields (BioPerl) we have been coordinating
> how these Open Bioinformatics Foundation (OBF) projects will
> interpret the FASTQ file format used in next generation sequencing.
> 
> This includes standardising our naming conventions for the original
> Sanger FASTQ variant, and the later Solexa/early Illumina, and
> recent Illumina 1.3+ variants. We have also put together a set of
> test files, including reference conversions between the different
> FASTQ variants.
> 
> We would be delighted to get BioRuby involved. I tried to contact
> Naohisa Goto about this directly last month, but perhaps my email
> did not arrive. If BioRuby is working on (or planning to work on)
> FASTQ support, please could the developers concerned sign up
> to the OBF joint mailing list where we have been discussing this:
> http://lists.open-bio.org/mailman/listinfo/open-bio-l
> 
> Thank you,
> 
> Peter
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l


From biopython at maubp.freeserve.co.uk  Thu Aug 27 12:08:28 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 27 Aug 2009 13:08:28 +0100
Subject: [Open-bio-l] FASTQ in BioRuby?
In-Reply-To: <20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp>
References: <320fb6e00908270346y5d653d29mdd2dc7ebc76af3c1@mail.gmail.com>
	<20090827112046.E2C741CBC4BA@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <320fb6e00908270508o485ba990k96b8bd3b722c09b6@mail.gmail.com>

On Thu, Aug 27, 2009 at 12:20 PM, Naohisa
GOTO<ngoto at gen-info.osaka-u.ac.jp> wrote:
>
> Hello Peter,
>
> sorry for responding too late. I've subscribed to open-bio-l,
> but I could not actively join to the discussions, because of
> lack of my knowledge about FASTQ.
>
> There is a small primitive code attempt to support FASTQ format
> in BioRuby, which is not yet merged in the main repository.
> http://github.com/ngoto/bioruby/tree/master
>
> Recently, Anthony Underwood contributed chromatgram classes
> to support SCF/ABI formats, which will be merged soon,
> after bug-fix maintenance release of 1.3.1.
> http://github.com/aunderwo/bioruby/tree/master
>
> I'm now planning to rewrite my FASTQ code to be consistent
> with the chromatgram classes, and with the open-bio standards.
>
> Thank you,
>
> Naohisa Goto

That is excellent news :)

I'm not sure how format names work in BioRuby, but if you
do have a set of format names as strings as we do in
Biopython, BioPerl and EMBOSS it would be nice to be
consistent here:

http://biopython.org/wiki/SeqIO
http://bioperl.org/wiki/HOWTO:SeqIO
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

There is some basic information on wikipedia, but this
does not go into detail:
http://www.bioperl.org/wiki/FASTQ_sequence_format

Please feel free to ask any questions about how we are
interpreting things.

Thank you,

Peter


From biopython at maubp.freeserve.co.uk  Thu Aug 27 15:26:23 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 27 Aug 2009 16:26:23 +0100
Subject: [Open-bio-l] Naming for FASTQ example files
In-Reply-To: <320fb6e00908080553h115ab748jecc820d14cb5e524@mail.gmail.com>
References: <320fb6e00908060117w703a2b08h6aeb2489530e0b51@mail.gmail.com>
	<320fb6e00908080553h115ab748jecc820d14cb5e524@mail.gmail.com>
Message-ID: <320fb6e00908270826i729cfdd6o1fdc56f47e5f3c02@mail.gmail.com>

On Sat, Aug 8, 2009 at 1:53 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Aug 6, 2009 at 9:17 AM, Peter<biopython at maubp.freeserve.co.uk> wrote:
>> Hi all,
>>
>> I am planning on compiling a set of set FASTQ files, for use by
>> Biopython, BioPerl, EMBOSS and anyone else that wants to test a
>> parser. Modest size contributions will be welcome (no big files
>> though).
>>
>> I will have two types of files: valid ones, and invalid ones. The
>> basic idea is any parser should understand what we consider to be
>> valid files (we may need to provide matching FASTA and QUAL files or
>> something like this for verification), but also reject all the files
>> we consider to be invalid.
>> ...
>
> I've gone for "error_*.fastq" and have tried to use meaningful names
> rather than numbers. Currently these files are only in the Biopython
> repository (under biopython/Tests/Quality), but could be added to the
> (currently) unused Biodata repository - although that is still on CVS:
>
> http://lists.open-bio.org/pipermail/open-bio-l/2009-January/000511.html
>
> As these examples are all small and we don't expect to change them,
> I could also just email them (off the mailing list) to EMBOSS/BioPerl
> people directly on request.

Chris Fields has already included the original "error_*.fastq" files in
BioPerl SVN as test cases. Peter Rice has pointed out a minor error
in "error_short_qual.fastq" which I have now corrected (it had a
short sequence, not a short quality line), and after discussion we
have come up with a few more truncation examples:

error_trunc_in_title.fastq
error_trunc_in_seq.fastq
error_trunc_in_plus.fastq
error_trunc_in_qual.fastq

Again, you can grab these five files (four new, one updated) from
Biopython CVS/git, and I will also be emailing Chris & Peter R
directly.

Peter C.


From biopython at maubp.freeserve.co.uk  Thu Aug 27 16:40:21 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 27 Aug 2009 17:40:21 +0100
Subject: [Open-bio-l] More FASTQ examples for cross project testing
In-Reply-To: <Pine.GSO.4.44.0908271231100.17078-100000@shell3.shore.net>
References: <320fb6e00908260306o37eb3d80q53fc50dce708d0b@mail.gmail.com>
	<Pine.GSO.4.44.0908271231100.17078-100000@shell3.shore.net>
Message-ID: <320fb6e00908270940r495aecc1n833aa9a28e8f3db3@mail.gmail.com>

On Thu, Aug 27, 2009 at 5:31 PM, Michael Heuer<heuermh at acm.org> wrote:
>
> Peter wrote:
>> Hi Michael - we asked the BioJava guys a while back, and at the time
>> there was interest but no volunteers. Who is working on this now?
>
> Perhaps I should have kept quiet -- I think I just volunteered. ?;)
>
> ? michael

Assuming you're serious, great :)

Peter


From heuermh at acm.org  Thu Aug 27 16:31:43 2009
From: heuermh at acm.org (Michael Heuer)
Date: Thu, 27 Aug 2009 12:31:43 -0400 (EDT)
Subject: [Open-bio-l] More FASTQ examples for cross project testing
In-Reply-To: <320fb6e00908260306o37eb3d80q53fc50dce708d0b@mail.gmail.com>
Message-ID: <Pine.GSO.4.44.0908271231100.17078-100000@shell3.shore.net>

Peter wrote:

> On Wed, Aug 26, 2009 at 3:56 AM, Michael Heuer<heuermh at acm.org> wrote:
> > Peter wrote:
> >
> >> Hi all,
> >>
> >> I've been chatting with Peter Rice (EMBOSS) and Chris Fields (BioPerl)
> >> off list about this plan. I'm going to co-ordinate putting together a
> >> set of valid FASTQ files for shared testing (to supplement the
> >> existing set of invalid FASTQ files already done and being used in
> >> Biopython and BioPerl's unit tests - and hopefully with EMBOSS soon).
> >> ...
> >
> > Very cool idea, Peter, and Peter, and Chris. ?I don't believe anyone from
> > biojava has spoken up on this thread yet, so I thought I should add that
> > we are working towards a compatible implementation as well.
> >
> > ? michael
>
> Hi Michael - we asked the BioJava guys a while back, and at the time
> there was interest but no volunteers. Who is working on this now?

Perhaps I should have kept quiet -- I think I just volunteered.  ;)

   michael


From biopython at maubp.freeserve.co.uk  Mon Aug 31 12:07:45 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 31 Aug 2009 13:07:45 +0100
Subject: [Open-bio-l] Status of OBDA and indexed flatfiles?
Message-ID: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com>

Hi all,

I'm looking at indexing next generation sequence files for Biopython
(e.g. FASTQ short read files with 10s of millions of entries), where
even just holding the record names and their file offsets in memory
is beginning to be a bottleneck.

What is the current status of Open Biological Database Access (OBDA),
and in particular the index files for sequence "flat files" like FASTA or
GenBank (or FASTQ)?

http://www.bioperl.org/wiki/HOWTO:Flat_databases
http://www.bioperl.org/wiki/HOWTO:OBDA
http://obda.open-bio.org/

The spec files are still in CVS (and ViewCVS is still broken since
the recent server move), rather than having been migrated to SVN
which may suggest things are obsolete (or on the bright side, stable).

Presumably BioPerl still uses these index files? What about the
other projects? I know EMBOSS has some indexing system for
example but I have no idea how it works internally.

Thanks,

Peter


From ngoto at gen-info.osaka-u.ac.jp  Mon Aug 31 14:01:46 2009
From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO)
Date: Mon, 31 Aug 2009 23:01:46 +0900
Subject: [Open-bio-l] Status of OBDA and indexed flatfiles?
In-Reply-To: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com>
References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com>
Message-ID: <20090831140146.EAF851CBC594@idnmail.gen-info.osaka-u.ac.jp>

Hi Peter,

On Mon, 31 Aug 2009 13:07:45 +0100
Peter <biopython at maubp.freeserve.co.uk> wrote:

> Hi all,
> 
> I'm looking at indexing next generation sequence files for Biopython
> (e.g. FASTQ short read files with 10s of millions of entries), where
> even just holding the record names and their file offsets in memory
> is beginning to be a bottleneck.
>
> What is the current status of Open Biological Database Access (OBDA),
> and in particular the index files for sequence "flat files" like FASTA or
> GenBank (or FASTQ)?
> 
> http://www.bioperl.org/wiki/HOWTO:Flat_databases
> http://www.bioperl.org/wiki/HOWTO:OBDA
> http://obda.open-bio.org/
>
> The spec files are still in CVS (and ViewCVS is still broken since
> the recent server move), rather than having been migrated to SVN
> which may suggest things are obsolete (or on the bright side, stable).
> 
> Presumably BioPerl still uses these index files? What about the
> other projects? I know EMBOSS has some indexing system for
> example but I have no idea how it works internally.

BioRuby still uses them. To gain performance, names and offsets are
written to temporary files and using external sort program (default
/usr/bin/sort).

In BioRuby, flatfile-only solution works fine, but BerkeleyDB indexes
would be incompatible with other projects, because of confusion in
the spec, discussed in BioPerl Bugzilla Bug #2337.
http://bugzilla.open-bio.org/show_bug.cgi?id=2337

Thanks,

-- 
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


From biopython at maubp.freeserve.co.uk  Mon Aug 31 15:07:28 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 31 Aug 2009 16:07:28 +0100
Subject: [Open-bio-l] Status of OBDA and indexed flatfiles?
In-Reply-To: <20090831140146.EAF851CBC594@idnmail.gen-info.osaka-u.ac.jp>
References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com>
	<20090831140146.EAF851CBC594@idnmail.gen-info.osaka-u.ac.jp>
Message-ID: <320fb6e00908310807j2fbc9710x3621a16a54ad5e45@mail.gmail.com>

On Mon, Aug 31, 2009 at 3:01 PM, Naohisa
GOTO<ngoto at gen-info.osaka-u.ac.jp> wrote:
> Hi Peter,
>
>> Presumably BioPerl still uses these index files? What about the
>> other projects? I know EMBOSS has some indexing system for
>> example but I have no idea how it works internally.
>
> BioRuby still uses them. To gain performance, names and offsets are
> written to temporary files and using external sort program (default
> /usr/bin/sort).

That makes sense. Have you tried this on very large files? e.g.
FASTA with 10 million short reads?

> In BioRuby, flatfile-only solution works fine, but BerkeleyDB indexes
> would be incompatible with other projects, because of confusion in
> the spec, discussed in BioPerl Bugzilla Bug #2337.
> http://bugzilla.open-bio.org/show_bug.cgi?id=2337

Thank you for the link to that bug - I'll need to read that carefully.

Peter


From biopython at maubp.freeserve.co.uk  Mon Aug 31 15:45:51 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 31 Aug 2009 16:45:51 +0100
Subject: [Open-bio-l] Status of OBDA and indexed flatfiles?
In-Reply-To: <8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu>
References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com>
	<8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu>
Message-ID: <320fb6e00908310845n3322d2c6hc992f78b81c0e226@mail.gmail.com>

On Mon, Aug 31, 2009 at 4:33 PM, Chris Fields<cjfields at illinois.edu> wrote:
>
> I don't use OBDA, personally, but I can check on the status with Brian
> Osborne (he was heading it up last I checked). ?However, I don't think
> BioPerl has an OBDA FASTQ parser.
>
> You may be thinking about Bio::Index::FASTQ? ?That one is not OBDA,
> but just a simple flat file indexer. ?We could probably set an OBDA parser
> up fairly easily if needed.

I didn't know if Bio::Index was using OBDA "under the hood" or not.
Does this mean BioPerl has multiple indexing systems available?

As I noted on Bug 2337 earlier today, Biopython used to have some
sort of OBDA compliant indexing, but for unrelated reasons we have
deprecated and removed that code. We're now revisiting this topic
due in part to having to deal with ever larger data files - and I wanted
to see if OBDA was still "alive" as a standard, and furthermore how
well it had scaled for the other OBF projects.

Peter


From cjfields at illinois.edu  Mon Aug 31 15:33:02 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Mon, 31 Aug 2009 10:33:02 -0500
Subject: [Open-bio-l] Status of OBDA and indexed flatfiles?
In-Reply-To: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com>
References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com>
Message-ID: <8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu>

On Aug 31, 2009, at 7:07 AM, Peter wrote:

> Hi all,
>
> I'm looking at indexing next generation sequence files for Biopython
> (e.g. FASTQ short read files with 10s of millions of entries), where
> even just holding the record names and their file offsets in memory
> is beginning to be a bottleneck.
>
> What is the current status of Open Biological Database Access (OBDA),
> and in particular the index files for sequence "flat files" like  
> FASTA or
> GenBank (or FASTQ)?
>
> http://www.bioperl.org/wiki/HOWTO:Flat_databases
> http://www.bioperl.org/wiki/HOWTO:OBDA
> http://obda.open-bio.org/
>
> The spec files are still in CVS (and ViewCVS is still broken since
> the recent server move), rather than having been migrated to SVN
> which may suggest things are obsolete (or on the bright side, stable).
>
> Presumably BioPerl still uses these index files? What about the
> other projects? I know EMBOSS has some indexing system for
> example but I have no idea how it works internally.
>
> Thanks,
>
> Peter

I don't use OBDA, personally, but I can check on the status with Brian  
Osborne (he was heading it up last I checked).  However, I don't think  
BioPerl has an OBDA FASTQ parser.

You may be thinking about Bio::Index::FASTQ?  That one is not OBDA,  
but just a simple flat file indexer.  We could probably set an OBDA  
parser up fairly easily if needed.

chris


From cjfields at illinois.edu  Mon Aug 31 18:22:36 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Mon, 31 Aug 2009 13:22:36 -0500
Subject: [Open-bio-l] Status of OBDA and indexed flatfiles?
In-Reply-To: <320fb6e00908310845n3322d2c6hc992f78b81c0e226@mail.gmail.com>
References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com>
	<8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu>
	<320fb6e00908310845n3322d2c6hc992f78b81c0e226@mail.gmail.com>
Message-ID: <ED58ED84-3B68-4E45-B3DD-BA5F8050551B@illinois.edu>

On Aug 31, 2009, at 10:45 AM, Peter wrote:

> On Mon, Aug 31, 2009 at 4:33 PM, Chris Fields<cjfields at illinois.edu>  
> wrote:
>>
>> I don't use OBDA, personally, but I can check on the status with  
>> Brian
>> Osborne (he was heading it up last I checked).  However, I don't  
>> think
>> BioPerl has an OBDA FASTQ parser.
>>
>> You may be thinking about Bio::Index::FASTQ?  That one is not OBDA,
>> but just a simple flat file indexer.  We could probably set an OBDA  
>> parser
>> up fairly easily if needed.
>
> I didn't know if Bio::Index was using OBDA "under the hood" or not.
> Does this mean BioPerl has multiple indexing systems available?

Yes.  We have Bio::Index::*, Bio::DB::Flat (which I think is OBDA).   
There is also the older Bio::DB::Fasta, which is actually still in  
wide use.  Note with Bio::Index::* we allow streaming of any report  
type (sequence, alignment, analysis like BLAST, etc).

We have talked about switching many of the Bio::Index::* sequence- 
based ones to OBDA but I haven't seen anyone take that up.

> As I noted on Bug 2337 earlier today, Biopython used to have some
> sort of OBDA compliant indexing, but for unrelated reasons we have
> deprecated and removed that code. We're now revisiting this topic
> due in part to having to deal with ever larger data files - and I  
> wanted
> to see if OBDA was still "alive" as a standard, and furthermore how
> well it had scaled for the other OBF projects.
>
> Peter

I think it's still alive and being used, just not sure what the  
compliance level is amongst the different Bio* projects.

chris