[Open-bio-l] FASTQ identifiers

Peter biopython at maubp.freeserve.co.uk
Fri Jul 31 09:15:57 UTC 2009


On Fri, Jul 31, 2009 at 9:19 AM, Peter Rice<pmr at ebi.ac.uk> wrote:
> Another FASTQ topic.
>
> Should we try to understand FASTQ identifiers.

I would say no (see below). Although project interoperability
shouldn't stop EMBOSS from doing this if it wants to.

Related to this, what about the corner case of reads with NO
identifier? The FASTQ (and indeed the FASTA) formats can
hold such things - just use a blank title line. In the case of
next generation sequencing reads, the names themselves
are not actually that important - so you can imagine a pipeline
which doesn't actually bother with them at all.

> There are some standard identifiers with meaningful elements
> that could be useful for reporting or subsetting FASTQ data.

True.

> Can we agree on how to parse those and what they can be used for?

The situation is similar to the FASTA format (and others), in that there
are a number of reasonably well documented conventions in use
(e.g. the NCBI FASTA identifiers with | characters). However, equally,
there are thousands of ad hoc local conventions.

In EMBOSS, you cater to a few FASTA variants where you do
parse the identifier. This might address the FASTQ situation too.

In Biopython we don't do anything clever with the FASTA identifier,
nor the FASTQ identifer. Zen of Python "In the face of ambiguity,
refuse the temptation to guess."

In the case of wanting to parse the identifier and say filter on the
lane number, for Biopython the user can do this themselves if
they need to.

> What other naming conventions are in common use e.g. for non-SOlexa
> instruments?

Keep in mind that even for a single manufacturers instrument,
there are different version of the pipeline, and indeed alternative
pipelines. For example, I understand Sanger is using a modified
pipeline on their Illumina sequencers, which may introduce their
own naming.

For Roche 454, their tools don't currently let you produce FASTQ
directly, but this is easy to get from the FASTA and QUAL file Roche
will output. This indirectly defines a Roche identifier convention.

Peter C.



More information about the Open-Bio-l mailing list