[emboss-dev] EMBOSS 6.3.0 released - SAM/BAM

Tue Aug 3 03:27:21 EDT 2010

On 08/02/10 18:41, Peter C. wrote:
> On Thu, Jul 15, 2010 at 12:36 PM, Peter<biopython at maubp.freeserve.co.uk>  wrote:
>> On Thu, Jul 15, 2010 at 12:12 PM, Peter Rice<pmr at ebi.ac.uk>  wrote:
>>>
>>>> What do you do about naming for paired reads? I was appending
>>>> /1 or /2 to match the Illumina convention. Doing nothing means
>>>> the paired reads will have the same names.
>>>
>>> Not addressed yet - let's look into a common approach though.
>>> We would also have to lok into what the '/' character does to EMBOSS's
>>> handling of sequence names.
>>
>> My rational for appending the /1 and /2 is that in a typical workflow
>> you might take Illumina paired end data as FASTQ and map it onto
>> a genome with BWA giving SAM/BAM. You might then want to reverse
>> this (e.g. if given a SAM/BAM file by a collaborator, and you want to
>> try an alternative mapping tool or reference genome, first you must
>> recover the raw reads again, e.g. as FASTQ files).
>
> Just for the record, EMBOSS 6.3.1 does not append anything to the
> read names, meaning paired end reads cannot be distinguished if
> output as FASTA or FASTQ.
>
> I'm not sure my idea of appending /1 or /2 for paired reads is the
> best solution (especially since there are other naming schemes
> out there like _f and _r as suffixes). Nevertheless, it seems like a
> practical solution. Would including a slash character within a
> sequence name cause problems in EMBOSS (a potential issue
> you raised earlier)?

The /1 and /2 would cause horrible problems. The sequence names are used 
to generate default output file names so a '/' would have to be removed 
or converted, most likely to _1 and _2

_f or _r as a suffix is much better ... but should we always assume 
these meanings? Should we add a command-line switch for paired read 
data? Should we only do something for fastq, sam and bam (or other NGS 
formats?)

It is a mystery to me how paired reads came to have the same name. When 
we first used them at EMBL for the Human HPRT locus we made sure to add 
an "r" suffix to the reverse reads.... but then, as we used the GCG 
assembly system, we were forced to have a unique name :-)

> Also, and this may be a bug, on output as unaligned SAM (and I
> assume also for unaligned BAM), the fact that a read is paired and
> the information about if is it the first or second read is lost. The
> FLAG is just set to 4, meaning unmapped. e.g.
>
> seqret -sformat bam -osformat sam ex1.bam -filter

Hmmm ... this kind of thing is specific to SAM-BAM conversions, as other 
formats will lose it unless we find some way to preserve the detail.

We will take a look at what we can keep between these formats (we do 
make similar efforts between EMBL and GenBank formats)

>> Given the use case of going from (aligned) SAM/BAM back to the
>> original FASTQ, for a round trip you *must* undo the reverse
>> complementation. This is important even for single reads, as quality
>> scores tend to trail off in the (original) read direction so some algorithms
>> may treat a reverse version of the read differently.

We will look into that one too.

Many thanks for the suggestions

Peter Rice