[EMBOSS] FASTQ records with no sequence?
Peter
biopython at maubp.freeserve.co.uk
Thu Jul 30 15:00:37 UTC 2009
Hi all,
On the continuing topic of the nebulous FASTQ format, are there
any strong views as to weather a FASTQ files could hold records
without a sequence (and therefore no quality scores)? This could
make sense as output from an (agressive) quality filter.
This is corner case, and applies to other file formats too of course
(e.g. FASTA).
I mentioned this to Peter Rice (EMBOSS) off list, and he replied:
On Thu, Jul 30, 2009 at 2:56 PM, Peter Rice<pmr at ebi.ac.uk> wrote:
> EMBOSS rejects zero length sequences - something we put in some years
> ago for misformatted FASTA files that someone ran through a Taverna
> workflow to launch clustalw via EMBOSS's "emma". The user had got his
> carriage control characters mangled so the sequence was appended to the
> FASTA '>' line and appeared as a long description with no sequence.
>
> I can well imagine for filtering paired reads that zero length sequences
> would be useful.
>
> At the point where the test is made we know the sequence format.
> We can therefore define some or all formats as accepting or rejecting
> zero length sequences.
>
> Similarly we can easily extend to define some applications (e.g. emma)
> as requiring a minimum sequence length.
>
> regards,
>
> Peter
Peter Rice is of course correct - in general the meaning and validity
of a zero length sequence is context dependent.
I think Peter Rice makes a good point regarding paired end reads.
What I assume we was getting at is the situation where due to
quality trimming, one of a pair might be trimmed to nothing - leaving
essentially a singleton read. However, paired end reads are normally
stored using a matched pair of FASTQ files, so it could be important
to keep the zero length read present, so that they can be read in
together in sync.
If we do want to allow zero length sequences in FASTQ, would
both of the following be valid? Should there be empty sequence
and quality lines, or no sequence and quality lines?
"@identifier\n+\n" (two lines, just the @ and + lines)
"@identifier\n\n+\n\n" (four lines, including blank seq and qual lines)
or with the repeated identifier on the plus lines:
"@identifier\n+identifier\n" (two lines, just the @ and + lines)
"@identifier\n\n+identifier\n\n" (four lines, including blank lines)
As we are recommending no line wrapping on output this means
typical FASTQ records would be four lines - so doing the same
makes sense here too.
Peter C.
More information about the EMBOSS
mailing list