From biopython at maubp.freeserve.co.uk  Mon Aug  2 07:37:35 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 2 Aug 2010 12:37:35 +0100
Subject: [emboss-dev] Bug report and patch - BAM quality score reading
Message-ID: <AANLkTim_9x1S6Ej8=m2nXi23oTOJKQ8PKJitc1iAhqxR@mail.gmail.com>

Hi all,

Since I had several queries about what EMBOSS would do with SAM/BAM,
http://lists.open-bio.org/pipermail/emboss-dev/2010-July/000656.html
I decided to try it and see. I believe I have found a bug with reading quality
scores from BAM files in EMBOSS 6.3.1

$ seqret -version
EMBOSS:6.3.1

I have been using a small pair of SAM and BAM files, originally downloaded
as a SAM file with reference FASTA sequence from the pysam project, which
I converted to BAM using samtools as they specify in their readme file
http://code.google.com/p/pysam/source/browse/#hg/tests

e.g.

curl -O http://pysam.googlecode.com/hg/tests/ex1.fa
curl -O http://pysam.googlecode.com/hg/tests/ex1.sam.gz
gunzip ex1.sam.gz
samtools faidx ex1.fa
samtools import ex1.fa.fai ex1.sam ex1.bam

If we look at the first two reads in the SAM file, notice their quality strings:

$ head ex1.sam
EAS56_57:6:190:289:82	69	chr1	100	0	*	=	100	0	CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA	<<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;	MF:i:192
EAS56_57:6:190:289:82	137	chr1	100	73	35M	=	100	0	AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC	<<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2;	MF:i:64	Aq:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
...

Now let's ask EMBOSS seqret to convert from SAM to Sanger FASTQ,

$ seqret -sformat sam -osformat fastq-sanger ex1.sam -stdout -auto | head
@EAS56_57:6:190:289:82 chr1
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
+
<<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;
@EAS56_57:6:190:289:82 chr1
AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
+
<<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2;
@EAS51_64:3:190:727:308 chr1
GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG

The quality strings agree with the SAM file, good.

Now let's ask EMBOSS seqret to convert from BAM to Sanger FASTQ,

$ seqret -sformat bam -osformat fastq-sanger ex1.bam -stdout -auto | head
@EAS56_57:6:190:289:82
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
+
]]]X]]]\]]]]]]]]Y\\]X\U]\]\\\\\ZU]\
@EAS56_57:6:190:289:82
AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
+
]]]]]]\]]]]]]]]]]\]]\]]]]\Y]W\Z\\S\
@EAS51_64:3:190:727:308
GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG

The quality strings differ, this is bad.

In the SAM file these two reads have quality strings starting the "<",
ASCII 60 meaning PHRED 60-33 = 27.

In the funny BAM to Sanger FASTQ conversion, EMBOSS has used
"]" which is ASCII 93, giving PHRED 93-33 = 60. i.e. 33 more than it
should be. I suspected that the EMBOSS code for reading BAM files
was wrongly applying a 33 offset to the quality scores. In BAM files
the scores are simply encoded directly as uint8_t without any offset.

Looking at the source code, file ajax/core/ajseqread.c we have:

        for(i=0; i < (ajuint) c->l_qseq; i++)
        {
            ajFmtPrintAppS(&qualstr, " %02x", 33+d[dpos]);
            thys->Accuracy[i] = (float) (33 + d[dpos++]);
        }

The creation of a quality string appears to be for debug only,
and here adding 33 to make it scores printable ASCII using
the Sanger FASTQ encoding makes sense. However, adding
the offset to the accuracy looks like an oversight. How about:

        for(i=0; i < (ajuint) c->l_qseq; i++)
        {
            ajFmtPrintAppS(&qualstr, " %02x", 33+d[dpos]);
            thys->Accuracy[i] = (float) d[dpos++];
        }

With this tiny change, I get the expected Sanger FASTQ output
from a BAM file using seqret.

Regards,

Peter C.

From biopython at maubp.freeserve.co.uk  Mon Aug  2 09:55:10 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 2 Aug 2010 14:55:10 +0100
Subject: [emboss-dev] Bug report and patch - SAM parser and negative ISIZE
Message-ID: <AANLkTikBPEbkRJYLtd6yf0367mXr59jG=UHXsEmhPq-A@mail.gmail.com>

Hi again,

This is another bug report for EMBOSS 6.3.1 (compiled on Mac OS X
10.6.4 Snow Leopard) using the same example files as earlier, see:
http://lists.open-bio.org/pipermail/emboss-dev/2010-August/thread.html

For the purposes of a concise example, I'm using seqret to convert
SAM/BAM to FASTA so as to count the number of reads. See also:
http://lists.open-bio.org/pipermail/emboss/2010-July/003951.html

I believe this SAM and BAM file both contain 3270 reads, but EMBOSS
is having trouble with the SAM file:

$ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | grep -c "^>"
3270
$ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | grep -c "^>"
41

If we look at the output,

$ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto
>EAS56_57:6:190:289:82 chr1
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
>EAS56_57:6:190:289:82 chr1
AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
...
>EAS114_28:6:155:68:326 chr1
CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTAA
>EAS188_7:7:19:886:279 chr1
CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTA

Looking at the SAM file, I guessed EMBOSS doesn't like a negative
ISIZE field in the next record, EAS54_61:4:143:69:578, from the
SAM file we have:

...
EAS114_28:6:155:68:326  99      chr1    182     99      36M     =
 332     186     CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTAA
<<<<<<<<<<<<<<<<<<<<<<<<<<<<:<<<<<<<    MF:i:18 Aq:i:76 NM:i:0  UQ:i:0
 H0:i:1  H1:i:0
EAS188_7:7:19:886:279   99      chr1    182     99      35M     =
 337     190     CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTA
<9<<<<<<<<<<<<6<28:<<85<<<<<2<;<9<<     MF:i:18 Aq:i:67 NM:i:0  UQ:i:0
 H0:i:1  H1:i:0
EAS54_61:4:143:69:578   147     chr1    185     98      35M     =
 36      -184    ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG
222&<21<<<<12<7<01<<<<<0<<<<<<<20<<     MF:i:18 Aq:i:35 NM:i:1  UQ:i:5
 H0:i:1  H1:i:0
EAS54_71:4:13:981:659   181     chr1    187     0       *       =
 188     0       CGGGACAATGGACGAGGTAAACCGCACATTGACAA
+)---3&&3&--+0)&+3:7777).333:<06<<<     MF:i:192
...

Looking at the source code, currently EMBOSS is wrongly assuming
an unsigned integer will be used. This is not true, the spec allows for
a negative ISIZE. I replaced this code in ajax/core/ajseqread.c

    ajStrTokenNextParseNoskip(&handle,&token); /* ISIZE */
    ajDebug("ISIZE '%S'\n", token);
    if(ajStrGetLen(token)){
        if(!ajStrToUint(token, &flags))
            return ajFalse;
    }

with:

    ajStrTokenNextParseNoskip(&handle,&token); /* MPOS */
    ajDebug("MPOS  '%S'\n", token);
    if(ajStrGetLen(token)){
        if(!ajStrToInt(token, &flags))
            return ajFalse;
    }

(i.e. Uint to Int), and now I get the correct read count.

A related question is why did this error condition not give any
error message to stdout or stderr?

Regards,

Peter C.

From pmr at ebi.ac.uk  Mon Aug  2 11:42:00 2010
From: pmr at ebi.ac.uk (Peter Rice)
Date: Mon, 02 Aug 2010 16:42:00 +0100
Subject: [emboss-dev] Bug reports and patches: BAM quality,
 SAM negative ISIZE
In-Reply-To: <AANLkTikBPEbkRJYLtd6yf0367mXr59jG=UHXsEmhPq-A@mail.gmail.com>
References: <AANLkTikBPEbkRJYLtd6yf0367mXr59jG=UHXsEmhPq-A@mail.gmail.com>
Message-ID: <4C56E748.7010803@ebi.ac.uk>

On 02/08/10 14:55, Peter C. wrote:

> In the funny BAM to Sanger FASTQ conversion, EMBOSS has used
> "]" which is ASCII 93, giving PHRED 93-33 = 60. i.e. 33 more than it
> should be. I suspected that the EMBOSS code for reading BAM files
> was wrongly applying a 33 offset to the quality scores. In BAM files
> the scores are simply encoded directly as uint8_t without any offset.

Thanks for spotting that. We will make a patch with that fix in.


> Looking at the SAM file, I guessed EMBOSS doesn't like a negative
> ISIZE field in the next record, EAS54_61:4:143:69:578,  .........
>
> Looking at the source code, currently EMBOSS is wrongly assuming
> an unsigned integer will be used. This is not true, the spec allows for
> a negative ISIZE. I replaced this code in ajax/core/ajseqread.c

Thanks for the fix. We will add that to the patch.

> A related question is why did this error condition not give any
> error message to stdout or stderr?

This appears to be a general issue with reading unknown and known 
formats. We will fix it so that error messages are turned on for this 
failure condition.

Many thanks for the bug reports - and the fixes!!

Peter R.

From biopython at maubp.freeserve.co.uk  Mon Aug  2 11:52:56 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 2 Aug 2010 16:52:56 +0100
Subject: [emboss-dev] Bug reports and patches: BAM quality,
	SAM negative 	ISIZE
In-Reply-To: <4C56E748.7010803@ebi.ac.uk>
References: <AANLkTikBPEbkRJYLtd6yf0367mXr59jG=UHXsEmhPq-A@mail.gmail.com>
	<4C56E748.7010803@ebi.ac.uk>
Message-ID: <AANLkTin7mvdWHE1_DOKtEjxOS7gDL0ZtZyXjROHhZ_is@mail.gmail.com>

On Mon, Aug 2, 2010 at 4:42 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
> On 02/08/10 14:55, Peter C. wrote:
>
>> In the funny BAM to Sanger FASTQ conversion, EMBOSS has used
>> "]" which is ASCII 93, giving PHRED 93-33 = 60. i.e. 33 more than it
>> should be. I suspected that the EMBOSS code for reading BAM files
>> was wrongly applying a 33 offset to the quality scores. In BAM files
>> the scores are simply encoded directly as uint8_t without any offset.
>
> Thanks for spotting that. We will make a patch with that fix in.
>
>> Looking at the SAM file, I guessed EMBOSS doesn't like a negative
>> ISIZE field in the next record, EAS54_61:4:143:69:578, ?.........
>>
>> Looking at the source code, currently EMBOSS is wrongly assuming
>> an unsigned integer will be used. This is not true, the spec allows for
>> a negative ISIZE. I replaced this code in ajax/core/ajseqread.c
>
> Thanks for the fix. We will add that to the patch.
>

Great. Are you still issuing patches which don't affect the version number?
I'd prefer to have an easy way to know if a given install of EMBOSS
has certain fixes, and a point release seems quite straightforward from
an outsider's perspective.

P.S. Expect a couple more reports to follow... so don't rush a patch or
point release out just yet ;)

>> A related question is why did this error condition not give any
>> error message to stdout or stderr?
>
> This appears to be a general issue with reading unknown and known formats.
> We will fix it so that error messages are turned on for this failure
> condition.

Good :)

> Many thanks for the bug reports - and the fixes!!
>

No problem,

Peter


From biopython at maubp.freeserve.co.uk  Mon Aug  2 12:26:07 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 2 Aug 2010 17:26:07 +0100
Subject: [emboss-dev] Inconsistency in SAM vs BAM read description
Message-ID: <AANLkTin1uBSuLn+6cC-_NaS5XzuF5E-WFASbNDvJG-76@mail.gmail.com>

Hi all,

After patching the following two issues,
http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000667.html
http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000668.html
there is a noticeable difference in the output from the SAM and BAM
parsers in the description of the reads:

$ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | head
>EAS56_57:6:190:289:82
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
>EAS56_57:6:190:289:82
AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
>EAS51_64:3:190:727:308
GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG
>EAS112_34:7:141:80:875
AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA
>EAS219_FC30151:3:40:1128:1940
CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC


$ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | head
>EAS56_57:6:190:289:82 chr1
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
>EAS56_57:6:190:289:82 chr1
AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
>EAS51_64:3:190:727:308 chr1
GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG
>EAS112_34:7:141:80:875 chr1
AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA
>EAS219_FC30151:3:40:1128:1940 chr1
CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC

As you can see from the above example (using files described in
the linked threads), when parsing SAM files if the read is mapped
then the reference sequence name is used as the description.
This seems like a sensible and useful thing to do. However, when
parsing BAM files this is not currently being done.

Having the SAM and BAM parser produce identical results is
very useful for testing purposes (e.g. running diff on their output
as FASTQ format), so I would like the BAM parser to do the same.

Looking at the source, function seqReadSam in ajax/core/ajseqread.c
does this with the reference name string:

    ajStrTokenNextParseNoskip(&handle,&token); /* RNAME */
    ajDebug("RNAME '%S'\n", token);
    if(ajStrGetLen(token))
        seqAccSave(thys, token);

Therefore the BAM parser needs to do something similar, first
mapping the integer rID (reference sequence ID) to the array of
reference names from the BAM header.

I got as far as a partial solution but it only worked on the first read.
The problem is that although header variable ntargets is stored as
bamdata->Nref it does not appear that the array of strings
targetname is kept (likewise the array of integers targetlen but we
don't care about that here).

Regards,

Peter C.

From biopython at maubp.freeserve.co.uk  Mon Aug  2 12:42:07 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 2 Aug 2010 17:42:07 +0100
Subject: [emboss-dev] Inconsistency in SAM vs BAM read description
In-Reply-To: <AANLkTin1uBSuLn+6cC-_NaS5XzuF5E-WFASbNDvJG-76@mail.gmail.com>
References: <AANLkTin1uBSuLn+6cC-_NaS5XzuF5E-WFASbNDvJG-76@mail.gmail.com>
Message-ID: <AANLkTi=LtEjtV3azhCT6FY7ekXNYoXDkkjiGasZTgxtB@mail.gmail.com>

On Mon, Aug 2, 2010 at 5:26 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> After patching the following two issues,
> http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000667.html
> http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000668.html
> there is a noticeable difference in the output from the SAM and BAM
> parsers in the description of the reads:
>
> $ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | head
>>EAS56_57:6:190:289:82
> CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
>>EAS56_57:6:190:289:82
> AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
>>EAS51_64:3:190:727:308
> GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG
>>EAS112_34:7:141:80:875
> AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA
>>EAS219_FC30151:3:40:1128:1940
> CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC
>
>
> $ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | head
>>EAS56_57:6:190:289:82 chr1
> CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
>>EAS56_57:6:190:289:82 chr1
> AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
>>EAS51_64:3:190:727:308 chr1
> GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG
>>EAS112_34:7:141:80:875 chr1
> AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA
>>EAS219_FC30151:3:40:1128:1940 chr1
> CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC
>
> As you can see from the above example (using files described in
> the linked threads), when parsing SAM files if the read is mapped
> then the reference sequence name is used as the description.
> This seems like a sensible and useful thing to do. However, when
> parsing BAM files this is not currently being done.
>
> Having the SAM and BAM parser produce identical results is
> very useful for testing purposes (e.g. running diff on their output
> as FASTQ format), so I would like the BAM parser to do the same.
>
> Looking at the source, function seqReadSam in ajax/core/ajseqread.c
> does this with the reference name string:
>
> ? ?ajStrTokenNextParseNoskip(&handle,&token); /* RNAME */
> ? ?ajDebug("RNAME '%S'\n", token);
> ? ?if(ajStrGetLen(token))
> ? ? ? ?seqAccSave(thys, token);
>

Just as a post script,

Having failed to enhance the BAM parser, for short term testing
I'm just commenting out the above two lines of the SAM parser.
With that trivial change, then the FASTA and FASTQ output
from both the SAM and BAM files agrees 100% (as you would
expect).

Peter C.


From biopython at maubp.freeserve.co.uk  Mon Aug  2 13:41:25 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 2 Aug 2010 18:41:25 +0100
Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM
In-Reply-To: <AANLkTimp3WrTbgH1Md-2jOLC6ZN_A8UeijG7FdUpGjhf@mail.gmail.com>
References: <mailman.1170.1279190734.3031.emboss-dev@lists.open-bio.org>
	<AANLkTikdCydb5I08EWmJu7eYjvH3RKkMMMojdfN6tdyk@mail.gmail.com>
	<4C3EED02.7080507@ebi.ac.uk>
	<AANLkTimp3WrTbgH1Md-2jOLC6ZN_A8UeijG7FdUpGjhf@mail.gmail.com>
Message-ID: <AANLkTi=xtj0QKD8T52G0QM72-YpqjJuFmC51efdxEcbD@mail.gmail.com>

On Thu, Jul 15, 2010 at 12:36 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Jul 15, 2010 at 12:12 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
>>
>>> What do you do about naming for paired reads? I was appending
>>> /1 or /2 to match the Illumina convention. Doing nothing means
>>> the paired reads will have the same names.
>>
>> Not addressed yet - let's look into a common approach though.
>> We would also have to lok into what the '/' character does to EMBOSS's
>> handling of sequence names.
>
> My rational for appending the /1 and /2 is that in a typical workflow
> you might take Illumina paired end data as FASTQ and map it onto
> a genome with BWA giving SAM/BAM. You might then want to reverse
> this (e.g. if given a SAM/BAM file by a collaborator, and you want to
> try an alternative mapping tool or reference genome, first you must
> recover the raw reads again, e.g. as FASTQ files).

Just for the record, EMBOSS 6.3.1 does not append anything to the
read names, meaning paired end reads cannot be distinguished if
output as FASTA or FASTQ.

I'm not sure my idea of appending /1 or /2 for paired reads is the
best solution (especially since there are other naming schemes
out there like _f and _r as suffixes). Nevertheless, it seems like a
practical solution. Would including a slash character within a
sequence name cause problems in EMBOSS (a potential issue
you raised earlier)?

Also, and this may be a bug, on output as unaligned SAM (and I
assume also for unaligned BAM), the fact that a read is paired and
the information about if is it the first or second read is lost. The
FLAG is just set to 4, meaning unmapped. e.g.

seqret -sformat bam -osformat sam ex1.bam -filter

or:

seqret -sformat sam -osformat sam ex1.sam -filter

>>> What do you do about the strand issue? SAM/BAM stored reads
>>> which map onto the reverse strand in reverse complement. If
>>> you want to get back to the original orientation for output as
>>> FASTQ you must apply the reverse complement (plus reverse
>>> the quality scores too of course).
>>
>> So far we read as sequences. Reading as mapped reads (very large
>> alignments) is planned for the very near future so it can appear in the
>> next release.
>
> Given the use case of going from (aligned) SAM/BAM back to the
> original FASTQ, for a round trip you *must* undo the reverse
> complementation. This is important even for single reads, as quality
> scores tend to trail off in the (original) read direction so some algorithms
> may treat a reverse version of the read differently.

To clarify, EMBOSS 6.3.1 does not flip reads mapped to the reverse strand:
http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000667.html

Regards,

Peter C.

From pmr at ebi.ac.uk  Tue Aug  3 03:27:21 2010
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 03 Aug 2010 08:27:21 +0100
Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM
In-Reply-To: <AANLkTi=xtj0QKD8T52G0QM72-YpqjJuFmC51efdxEcbD@mail.gmail.com>
References: <mailman.1170.1279190734.3031.emboss-dev@lists.open-bio.org>	<AANLkTikdCydb5I08EWmJu7eYjvH3RKkMMMojdfN6tdyk@mail.gmail.com>	<4C3EED02.7080507@ebi.ac.uk>	<AANLkTimp3WrTbgH1Md-2jOLC6ZN_A8UeijG7FdUpGjhf@mail.gmail.com>
	<AANLkTi=xtj0QKD8T52G0QM72-YpqjJuFmC51efdxEcbD@mail.gmail.com>
Message-ID: <4C57C4D9.1010805@ebi.ac.uk>

On 08/02/10 18:41, Peter C. wrote:
> On Thu, Jul 15, 2010 at 12:36 PM, Peter<biopython at maubp.freeserve.co.uk>  wrote:
>> On Thu, Jul 15, 2010 at 12:12 PM, Peter Rice<pmr at ebi.ac.uk>  wrote:
>>>
>>>> What do you do about naming for paired reads? I was appending
>>>> /1 or /2 to match the Illumina convention. Doing nothing means
>>>> the paired reads will have the same names.
>>>
>>> Not addressed yet - let's look into a common approach though.
>>> We would also have to lok into what the '/' character does to EMBOSS's
>>> handling of sequence names.
>>
>> My rational for appending the /1 and /2 is that in a typical workflow
>> you might take Illumina paired end data as FASTQ and map it onto
>> a genome with BWA giving SAM/BAM. You might then want to reverse
>> this (e.g. if given a SAM/BAM file by a collaborator, and you want to
>> try an alternative mapping tool or reference genome, first you must
>> recover the raw reads again, e.g. as FASTQ files).
>
> Just for the record, EMBOSS 6.3.1 does not append anything to the
> read names, meaning paired end reads cannot be distinguished if
> output as FASTA or FASTQ.
>
> I'm not sure my idea of appending /1 or /2 for paired reads is the
> best solution (especially since there are other naming schemes
> out there like _f and _r as suffixes). Nevertheless, it seems like a
> practical solution. Would including a slash character within a
> sequence name cause problems in EMBOSS (a potential issue
> you raised earlier)?

The /1 and /2 would cause horrible problems. The sequence names are used 
to generate default output file names so a '/' would have to be removed 
or converted, most likely to _1 and _2

_f or _r as a suffix is much better ... but should we always assume 
these meanings? Should we add a command-line switch for paired read 
data? Should we only do something for fastq, sam and bam (or other NGS 
formats?)

It is a mystery to me how paired reads came to have the same name. When 
we first used them at EMBL for the Human HPRT locus we made sure to add 
an "r" suffix to the reverse reads.... but then, as we used the GCG 
assembly system, we were forced to have a unique name :-)


> Also, and this may be a bug, on output as unaligned SAM (and I
> assume also for unaligned BAM), the fact that a read is paired and
> the information about if is it the first or second read is lost. The
> FLAG is just set to 4, meaning unmapped. e.g.
>
> seqret -sformat bam -osformat sam ex1.bam -filter

Hmmm ... this kind of thing is specific to SAM-BAM conversions, as other 
formats will lose it unless we find some way to preserve the detail.

We will take a look at what we can keep between these formats (we do 
make similar efforts between EMBL and GenBank formats)

>> Given the use case of going from (aligned) SAM/BAM back to the
>> original FASTQ, for a round trip you *must* undo the reverse
>> complementation. This is important even for single reads, as quality
>> scores tend to trail off in the (original) read direction so some algorithms
>> may treat a reverse version of the read differently.


We will look into that one too.

Many thanks for the suggestions

Peter Rice

From biopython at maubp.freeserve.co.uk  Tue Aug  3 04:12:27 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 3 Aug 2010 09:12:27 +0100
Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM
In-Reply-To: <4C57C4D9.1010805@ebi.ac.uk>
References: <mailman.1170.1279190734.3031.emboss-dev@lists.open-bio.org>
	<AANLkTikdCydb5I08EWmJu7eYjvH3RKkMMMojdfN6tdyk@mail.gmail.com>
	<4C3EED02.7080507@ebi.ac.uk>
	<AANLkTimp3WrTbgH1Md-2jOLC6ZN_A8UeijG7FdUpGjhf@mail.gmail.com>
	<AANLkTi=xtj0QKD8T52G0QM72-YpqjJuFmC51efdxEcbD@mail.gmail.com>
	<4C57C4D9.1010805@ebi.ac.uk>
Message-ID: <AANLkTimmWW4Ud4+FBd3iUzaedFpRC3XTH5sMORhgLJjv@mail.gmail.com>

On Tue, Aug 3, 2010 at 8:27 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>>
>> Just for the record, EMBOSS 6.3.1 does not append anything to the
>> read names, meaning paired end reads cannot be distinguished if
>> output as FASTA or FASTQ.
>>
>> I'm not sure my idea of appending /1 or /2 for paired reads is the
>> best solution (especially since there are other naming schemes
>> out there like _f and _r as suffixes). Nevertheless, it seems like a
>> practical solution. Would including a slash character within a
>> sequence name cause problems in EMBOSS (a potential issue
>> you raised earlier)?
>
> The /1 and /2 would cause horrible problems. The sequence names are
> used to generate default output file names so a '/' would have to be
> removed or converted, most likely to _1 and _2

Oh :(

I thought they might cause confusion with slashes in filenames, but
yes, they can't be used in filenames can they.

> _f or _r as a suffix is much better ... but should we always assume these
> meanings? Should we add a command-line switch for paired read data?

My understanding is there are multiple different naming conventions,
so whatever we/you do it won't please everyone. What would help here
is if the original read name were to be recorded in the SAM/BAM tags,
as I think was suggested last month or so on the samtools-devel mailing
list. However, that would come with a filesize penalty, and won't help
with old files.

> Should we only do something for fastq, sam and bam (or other NGS
> formats?)

And FASTA too, not all assemblers use quality scores. Also QUAL
files if EMBOSS were to support them.

> It is a mystery to me how paired reads came to have the same name.
> When we first used them at EMBL for the Human HPRT locus we made
> sure to add an "r" suffix to the reverse reads.... but then, as we used
> the GCG assembly system, we were forced to have a unique name :-)

With Solexa/Illumina data, pairs got the same name bar a suffix.
Other sequencing centers also have followed this pattern, for
example Sanger sequencing with suffices of .f and .r for example.
I guess in order to clearly group paired reads, and save a little space,
for SAM/BAM they opted to store a single name and use the FLAG field
to hold if it is the forward or reverse read. Note that with stobed reads
and the like coming "soon", rather than just two reads in a pair, there
could be many child reads for a single fragment. Even with classic
Sanger sequencing of a PCR product you might end up with multiple
reads (e.g. two forward reads, one reverse) and if and how to handle
this via an extension to SAM/BAM was also raised.

Some pipelines may even use the same name for a forward/reverse
pair, or ignore the names. Velvet for example just takes its paired
data as interleaved files (forward then reverse reads one after the
other).

>> Also, and this may be a bug, on output as unaligned SAM (and I
>> assume also for unaligned BAM), the fact that a read is paired and
>> the information about if is it the first or second read is lost. The
>> FLAG is just set to 4, meaning unmapped. e.g.
>>
>> seqret -sformat bam -osformat sam ex1.bam -filter
>
> Hmmm ... this kind of thing is specific to SAM-BAM conversions, as other
> formats will lose it unless we find some way to preserve the detail.
>
> We will take a look at what we can keep between these formats (we do
> make similar efforts between EMBL and GenBank formats)

I think it would be useful to track the three bits for paired, read one, and
read two. From memory, all the other bits of the FLAG are only applicable
to mapped reads. Of course, this overlaps with the naming issue above.

>>> Given the use case of going from (aligned) SAM/BAM back to the
>>> original FASTQ, for a round trip you *must* undo the reverse
>>> complementation. This is important even for single reads, as quality
>>> scores tend to trail off in the (original) read direction so some
>>> algorithms may treat a reverse version of the read differently.
>
> We will look into that one too.
>

Thanks.

> Many thanks for the suggestions
>

No problem.

Peter C.

From ajb at ebi.ac.uk  Fri Aug  6 06:53:18 2010
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Fri, 6 Aug 2010 11:53:18 +0100 (BST)
Subject: [emboss-dev] Configuration in flux
Message-ID: <34847.86.26.12.63.1281091998.squirrel@webmail.ebi.ac.uk>

Dear developers,

The EMBOSS configuration in CVS is in a state of flux at the moment.
The major changes over the last 48 hours have been to make use
of autoheader and also to clear out any system-specific libtool
files.

The upshot is that, from a fresh CVS checkout, the configuration should
just amount to:

     autoreconf -fi
     ./configure [options]

The above should mean that the configuration is relatively independent
of the version of libtool you have installed. Note, however, that
there is now a prerequisite for an autoconf version of at
least 2.59. The use of autoheader means that the compilation
lines are significantly shorter.

There will be further configuration changes over the next few
weeks but nothing quite so fundamental.

Alan


From pjotr.public78 at thebird.nl  Thu Aug 12 06:12:40 2010
From: pjotr.public78 at thebird.nl (Pjotr Prins)
Date: Thu, 12 Aug 2010 12:12:40 +0200
Subject: [emboss-dev] Unreachable code in featReadGff3
In-Reply-To: <AANLkTi=LtEjtV3azhCT6FY7ekXNYoXDkkjiGasZTgxtB@mail.gmail.com>
References: <AANLkTin1uBSuLn+6cC-_NaS5XzuF5E-WFASbNDvJG-76@mail.gmail.com>
	<AANLkTi=LtEjtV3azhCT6FY7ekXNYoXDkkjiGasZTgxtB@mail.gmail.com>
Message-ID: <20100812101240.GA28807@thebird.nl>

Something funny in the function featReadGff3, it looks like the second else
if(ajRegExec(Gff3Regexregion,line)) is unreachable code:

  if(ajRegExec(Gff3Regexblankline, line))
      version = 3.0;
  else if(ajRegExec(Gff3Regexversion,line))
  {
      verstr = ajStrNew();
      ajRegSubI(Gff3Regexversion, 1, &verstr);
      ajStrToFloat(verstr, &version);
      ajStrDel(&verstr);
            if(version < 3.0)
            {
                ajStrDel(&line);
                return ajFalse;
            }
  }
  else if(ajRegExec(Gff3Regexregion,line))
  {
      start = ajStrNew();
      end   = ajStrNew();
  (...)

From pjotr.public78 at thebird.nl  Thu Aug 12 06:33:35 2010
From: pjotr.public78 at thebird.nl (Pjotr Prins)
Date: Thu, 12 Aug 2010 12:33:35 +0200
Subject: [emboss-dev] GFF3 in EMBOSS
Message-ID: <20100812103335.GA28925@thebird.nl>

I am having a look at the GFF3 implementation in EMBOSS - mostly
ajax/core/ajfeat.c.

All features are loaded into RAM, and also the sequence information,
when in the file. Not only for GFF3, but for all feature data types.

On regular desktops this is a problem when loading a larger set,
and/or multiple genomes.

Is it the idea to load big data and store it in a SQL database? I.e.
should I recommend handling it outside EMBOSS?

Pj.

From pmr at ebi.ac.uk  Thu Aug 12 06:52:23 2010
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 12 Aug 2010 11:52:23 +0100
Subject: [emboss-dev] GFF3 in EMBOSS
In-Reply-To: <20100812103335.GA28925@thebird.nl>
References: <20100812103335.GA28925@thebird.nl>
Message-ID: <4C63D267.3070904@ebi.ac.uk>

Hi Pjotr,

On 12/08/10 11:33, Pjotr Prins wrote:
> I am having a look at the GFF3 implementation in EMBOSS - mostly
> ajax/core/ajfeat.c.
>
> All features are loaded into RAM, and also the sequence information,
> when in the file. Not only for GFF3, but for all feature data types.
>
> On regular desktops this is a problem when loading a larger set,
> and/or multiple genomes.
>
> Is it the idea to load big data and store it in a SQL database? I.e.
> should I recommend handling it outside EMBOSS?

We are looking into storing data structures for large datasets on disk - 
not only for features but also for next-generation mapped reads.

Can you give an example of the input you are trying to handle?

I hope to explore these issues at the GMOD meeting in Cambridge (UK) soon.

regards,

Peter Rice

From pjotr.public78 at thebird.nl  Thu Aug 12 07:57:55 2010
From: pjotr.public78 at thebird.nl (Pjotr Prins)
Date: Thu, 12 Aug 2010 13:57:55 +0200
Subject: [emboss-dev] GFF3 in EMBOSS
In-Reply-To: <4C63D267.3070904@ebi.ac.uk>
References: <20100812103335.GA28925@thebird.nl> <4C63D267.3070904@ebi.ac.uk>
Message-ID: <20100812115755.GA30047@thebird.nl>

On Thu, Aug 12, 2010 at 11:52:23AM +0100, Peter Rice wrote:
> We are looking into storing data structures for large datasets on disk -  
> not only for features but also for next-generation mapped reads.

That is a great idea! The first quick-win is not to load sequence
data in memory, but fetch it on demand using a seek index. Something
that BioPerl has.

> Can you give an example of the input you are trying to handle?

I am dealing with Worms - Wormbase uses gff3 for some worms. EMBOSS,
is already memory efficient, compared to BioRuby/Python/Perl - so I
am thinking of a BioLib mapping. A writeup is here:

  http://thebird.nl/biolib/Adding_BioLib_EMBOSS_GFF3_Support.html

> I hope to explore these issues at the GMOD meeting in Cambridge (UK) soon.

It makes sense for (desktop) genome browsers, for one.

Pj.

From pjotr.public78 at thebird.nl  Thu Aug 12 17:24:21 2010
From: pjotr.public78 at thebird.nl (Pjotr Prins)
Date: Thu, 12 Aug 2010 23:24:21 +0200
Subject: [emboss-dev] Embassy in Debian-med
Message-ID: <20100812212421.GA3151@thebird.nl>

Debian-med has problems with the Embassy packages, as they fail to
build against EMBOSS-latest.

Andreas Tille writes:

> To put the emboss and embassy packages in consistency in Squeeze, here are
> possible solutions:
>
> - Remove the embassy-* packages from testing.
> - Upload emboss 6.2 to testing-proposed-updates.
> - Upgrade embassy-* packages with the latest upstream version, that builds
>   against emboss 6.3, and let emboss 6.3 in testing.

what is the priority of supporting the Embassy packages? Are they
lesser citizens in EMBOSS? Or can we expect resolution in the near
future?

Pj.

From biopython at maubp.freeserve.co.uk  Fri Aug 13 05:40:35 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 13 Aug 2010 10:40:35 +0100
Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM
In-Reply-To: <AANLkTimmWW4Ud4+FBd3iUzaedFpRC3XTH5sMORhgLJjv@mail.gmail.com>
References: <mailman.1170.1279190734.3031.emboss-dev@lists.open-bio.org>
	<AANLkTikdCydb5I08EWmJu7eYjvH3RKkMMMojdfN6tdyk@mail.gmail.com>
	<4C3EED02.7080507@ebi.ac.uk>
	<AANLkTimp3WrTbgH1Md-2jOLC6ZN_A8UeijG7FdUpGjhf@mail.gmail.com>
	<AANLkTi=xtj0QKD8T52G0QM72-YpqjJuFmC51efdxEcbD@mail.gmail.com>
	<4C57C4D9.1010805@ebi.ac.uk>
	<AANLkTimmWW4Ud4+FBd3iUzaedFpRC3XTH5sMORhgLJjv@mail.gmail.com>
Message-ID: <AANLkTi=gF9yPgeZQfe2HvdpU2K66BFAaiYt-LM3rQZEU@mail.gmail.com>

On Tue, Aug 3, 2010 at 9:12 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Aug 3, 2010 at 8:27 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>>>
>>> Just for the record, EMBOSS 6.3.1 does not append anything to the
>>> read names, meaning paired end reads cannot be distinguished if
>>> output as FASTA or FASTQ.
>>>
>>> I'm not sure my idea of appending /1 or /2 for paired reads is the
>>> best solution (especially since there are other naming schemes
>>> out there like _f and _r as suffixes). Nevertheless, it seems like a
>>> practical solution. Would including a slash character within a
>>> sequence name cause problems in EMBOSS (a potential issue
>>> you raised earlier)?
>>
>> The /1 and /2 would cause horrible problems. The sequence names are
>> used to generate default output file names so a '/' would have to be
>> removed or converted, most likely to _1 and _2
>
> Oh :(
>
> I thought they might cause confusion with slashes in filenames, but
> yes, they can't be used in filenames can they.

Thinking about this more, I don't think there is a problem. There are
two main reasons. First, with SAM/BAM/FASTQ files there are typically
so many reads that you would never want to create one file per read.

Also, there are plenty of other file formats where the record ID can
or indeed usually does contain a slash - specifically PFAM/Stockholm
format alignments from PFAM where the ID is name/start-stop, e.g.
http://emboss.sourceforge.net/docs/themes/seqformats/pfam
Surely EMBOSS has already got a mechanism for dealing with
slashes in IDs when asked to use the IDs as filenames?

I think I mentioned storing the original read name in the tags had
been suggested on the samtools-devel list. In the latest draft of
the SAM/BAM spec, a new tag FS (fragment name suffix) has been
proposed, so that the original read names could be recovered by
taking the fragment name (the ID in SAM/BAM) and appending
this suffix. See this thread earlier in August 2010,

[Samtools-devel] Recording original read name in tags
http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTimg%2BvNU3CkW-63Mmug-Qt0md183dyJ_nRqva1rv%40mail.gmail.com&forum_name=samtools-devel

Finally, also on the samtools-help list, it was pointed out that the
hydra-sv project has a bamToFastq tool, see thread:

[Samtools-help] BAM to fastq how?
http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTinBnm%2B8V8bXD_ii9jn8-O%2B0_N1MgWBxBFnqm2Mk%40mail.gmail.com&forum_name=samtools-help

and http://code.google.com/p/hydra-sv/

Peter C.

From gbottu at vub.ac.be  Tue Aug 17 14:43:58 2010
From: gbottu at vub.ac.be (Guy Bottu)
Date: Tue, 17 Aug 2010 20:43:58 +0200
Subject: [emboss-dev] computed maximum forbidden in ACD ?
Message-ID: <4C6AD86E.7070909@vub.ac.be>

	Dear Peter and Alan,

I was doing some development on wrappers4EMBOSS when I noted the 
following. The file blast.acd contains :

   integer: listsize [
     information: "Show only the n best scoring sequences that
                   satisfy E() cutoff"
     default: "500"
     minimum: "0"
   ]

   integer: align [
     information: "Show only alignments for the n first
                   sequences"
     default: "@(@($(listsize) < 250 ) ? $(listsize) : 250)"
     expected: "250"
     minimum: "0"
     maximum: "$(listsize)"                          (this is line 100)
     valid: "Integer 0 or more, but not < listsize"
   ]

When I run blast I get :

Error: File /OPT/emboss63/share/EMBOSS/acd/blast.acd line 100: 
(wordsize) Attribute failrange: required with any calculated min/max

I am as good as certain that this behaviour has appeared with EMBOSS 
version 6.3.0. In the past it was allowed to set a "maximum" that 
depended on the choice of another parameter, and we can see that it 
could occasionally make sense, but this seems from now on forbidden. I 
this a bug or a feature ?

	Regards,
	Guy Bottu

From pmr at ebi.ac.uk  Tue Aug 17 16:22:58 2010
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 17 Aug 2010 21:22:58 +0100
Subject: [emboss-dev] computed maximum forbidden in ACD ?
In-Reply-To: <4C6AD86E.7070909@vub.ac.be>
References: <4C6AD86E.7070909@vub.ac.be>
Message-ID: <4C6AEFA2.8000201@ebi.ac.uk>

Dear Guy,

> When I run blast I get :
>
> Error: File /OPT/emboss63/share/EMBOSS/acd/blast.acd line 100:
> (wordsize) Attribute failrange: required with any calculated min/max
>
> I am as good as certain that this behaviour has appeared with EMBOSS
> version 6.3.0. In the past it was allowed to set a "maximum" that
> depended on the choice of another parameter, and we can see that it
> could occasionally make sense, but this seems from now on forbidden. I
> this a bug or a feature ?

It is a fix for a feature. With calculated maximum or minimum values 
(e.g. depending on a window size) it was possible for the maximum to be 
less than the minimum. In such cases we could logically use either the 
maximum or the minimum - and some applications were found to require one 
choice, others needed the other.

After some discussion we decided to add extra attributes to control the 
behaviour. You can add two new attributes:

trueminimum: "N"  (if max/min overlap, use minimum}
failrange:   "Y"  (Fail if (calculated) ranges overlap}
rangemessage: ""   (Failure message if (calculated ranges) overlap}

A common solution (good for your case) is:

failrange: "N"
trueminimum: "Y"

By adding the error messages we made sure that an ACD file with a 
calculated range will give messages to the developer suggesting missing 
attributes to be added.

If you set failrange: "Y" you need to define a message explaining to the 
end user why the range might fail

If you set failrange: "N" the calculated range is accepted, but you also 
need to set trueminimum to say whether you want the minimum value to 
apply (usual to avoid getting negative values) or the maximum to avoid 
values going too large.

So, you get the "failrange is required" message. When you set that you 
get another message (depending whether it is true or false) telling you 
to set one of the other attributes as well.

Hope this makes it clearer!

Peter


From biopython at maubp.freeserve.co.uk  Mon Aug  2 11:37:35 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 2 Aug 2010 12:37:35 +0100
Subject: [emboss-dev] Bug report and patch - BAM quality score reading
Message-ID: <AANLkTim_9x1S6Ej8=m2nXi23oTOJKQ8PKJitc1iAhqxR@mail.gmail.com>

Hi all,

Since I had several queries about what EMBOSS would do with SAM/BAM,
http://lists.open-bio.org/pipermail/emboss-dev/2010-July/000656.html
I decided to try it and see. I believe I have found a bug with reading quality
scores from BAM files in EMBOSS 6.3.1

$ seqret -version
EMBOSS:6.3.1

I have been using a small pair of SAM and BAM files, originally downloaded
as a SAM file with reference FASTA sequence from the pysam project, which
I converted to BAM using samtools as they specify in their readme file
http://code.google.com/p/pysam/source/browse/#hg/tests

e.g.

curl -O http://pysam.googlecode.com/hg/tests/ex1.fa
curl -O http://pysam.googlecode.com/hg/tests/ex1.sam.gz
gunzip ex1.sam.gz
samtools faidx ex1.fa
samtools import ex1.fa.fai ex1.sam ex1.bam

If we look at the first two reads in the SAM file, notice their quality strings:

$ head ex1.sam
EAS56_57:6:190:289:82	69	chr1	100	0	*	=	100	0	CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA	<<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;	MF:i:192
EAS56_57:6:190:289:82	137	chr1	100	73	35M	=	100	0	AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC	<<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2;	MF:i:64	Aq:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
...

Now let's ask EMBOSS seqret to convert from SAM to Sanger FASTQ,

$ seqret -sformat sam -osformat fastq-sanger ex1.sam -stdout -auto | head
@EAS56_57:6:190:289:82 chr1
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
+
<<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;
@EAS56_57:6:190:289:82 chr1
AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
+
<<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2;
@EAS51_64:3:190:727:308 chr1
GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG

The quality strings agree with the SAM file, good.

Now let's ask EMBOSS seqret to convert from BAM to Sanger FASTQ,

$ seqret -sformat bam -osformat fastq-sanger ex1.bam -stdout -auto | head
@EAS56_57:6:190:289:82
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
+
]]]X]]]\]]]]]]]]Y\\]X\U]\]\\\\\ZU]\
@EAS56_57:6:190:289:82
AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
+
]]]]]]\]]]]]]]]]]\]]\]]]]\Y]W\Z\\S\
@EAS51_64:3:190:727:308
GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG

The quality strings differ, this is bad.

In the SAM file these two reads have quality strings starting the "<",
ASCII 60 meaning PHRED 60-33 = 27.

In the funny BAM to Sanger FASTQ conversion, EMBOSS has used
"]" which is ASCII 93, giving PHRED 93-33 = 60. i.e. 33 more than it
should be. I suspected that the EMBOSS code for reading BAM files
was wrongly applying a 33 offset to the quality scores. In BAM files
the scores are simply encoded directly as uint8_t without any offset.

Looking at the source code, file ajax/core/ajseqread.c we have:

        for(i=0; i < (ajuint) c->l_qseq; i++)
        {
            ajFmtPrintAppS(&qualstr, " %02x", 33+d[dpos]);
            thys->Accuracy[i] = (float) (33 + d[dpos++]);
        }

The creation of a quality string appears to be for debug only,
and here adding 33 to make it scores printable ASCII using
the Sanger FASTQ encoding makes sense. However, adding
the offset to the accuracy looks like an oversight. How about:

        for(i=0; i < (ajuint) c->l_qseq; i++)
        {
            ajFmtPrintAppS(&qualstr, " %02x", 33+d[dpos]);
            thys->Accuracy[i] = (float) d[dpos++];
        }

With this tiny change, I get the expected Sanger FASTQ output
from a BAM file using seqret.

Regards,

Peter C.


From biopython at maubp.freeserve.co.uk  Mon Aug  2 13:55:10 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 2 Aug 2010 14:55:10 +0100
Subject: [emboss-dev] Bug report and patch - SAM parser and negative ISIZE
Message-ID: <AANLkTikBPEbkRJYLtd6yf0367mXr59jG=UHXsEmhPq-A@mail.gmail.com>

Hi again,

This is another bug report for EMBOSS 6.3.1 (compiled on Mac OS X
10.6.4 Snow Leopard) using the same example files as earlier, see:
http://lists.open-bio.org/pipermail/emboss-dev/2010-August/thread.html

For the purposes of a concise example, I'm using seqret to convert
SAM/BAM to FASTA so as to count the number of reads. See also:
http://lists.open-bio.org/pipermail/emboss/2010-July/003951.html

I believe this SAM and BAM file both contain 3270 reads, but EMBOSS
is having trouble with the SAM file:

$ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | grep -c "^>"
3270
$ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | grep -c "^>"
41

If we look at the output,

$ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto
>EAS56_57:6:190:289:82 chr1
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
>EAS56_57:6:190:289:82 chr1
AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
...
>EAS114_28:6:155:68:326 chr1
CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTAA
>EAS188_7:7:19:886:279 chr1
CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTA

Looking at the SAM file, I guessed EMBOSS doesn't like a negative
ISIZE field in the next record, EAS54_61:4:143:69:578, from the
SAM file we have:

...
EAS114_28:6:155:68:326  99      chr1    182     99      36M     =
 332     186     CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTAA
<<<<<<<<<<<<<<<<<<<<<<<<<<<<:<<<<<<<    MF:i:18 Aq:i:76 NM:i:0  UQ:i:0
 H0:i:1  H1:i:0
EAS188_7:7:19:886:279   99      chr1    182     99      35M     =
 337     190     CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTA
<9<<<<<<<<<<<<6<28:<<85<<<<<2<;<9<<     MF:i:18 Aq:i:67 NM:i:0  UQ:i:0
 H0:i:1  H1:i:0
EAS54_61:4:143:69:578   147     chr1    185     98      35M     =
 36      -184    ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG
222&<21<<<<12<7<01<<<<<0<<<<<<<20<<     MF:i:18 Aq:i:35 NM:i:1  UQ:i:5
 H0:i:1  H1:i:0
EAS54_71:4:13:981:659   181     chr1    187     0       *       =
 188     0       CGGGACAATGGACGAGGTAAACCGCACATTGACAA
+)---3&&3&--+0)&+3:7777).333:<06<<<     MF:i:192
...

Looking at the source code, currently EMBOSS is wrongly assuming
an unsigned integer will be used. This is not true, the spec allows for
a negative ISIZE. I replaced this code in ajax/core/ajseqread.c

    ajStrTokenNextParseNoskip(&handle,&token); /* ISIZE */
    ajDebug("ISIZE '%S'\n", token);
    if(ajStrGetLen(token)){
        if(!ajStrToUint(token, &flags))
            return ajFalse;
    }

with:

    ajStrTokenNextParseNoskip(&handle,&token); /* MPOS */
    ajDebug("MPOS  '%S'\n", token);
    if(ajStrGetLen(token)){
        if(!ajStrToInt(token, &flags))
            return ajFalse;
    }

(i.e. Uint to Int), and now I get the correct read count.

A related question is why did this error condition not give any
error message to stdout or stderr?

Regards,

Peter C.


From pmr at ebi.ac.uk  Mon Aug  2 15:42:00 2010
From: pmr at ebi.ac.uk (Peter Rice)
Date: Mon, 02 Aug 2010 16:42:00 +0100
Subject: [emboss-dev] Bug reports and patches: BAM quality,
 SAM negative ISIZE
In-Reply-To: <AANLkTikBPEbkRJYLtd6yf0367mXr59jG=UHXsEmhPq-A@mail.gmail.com>
References: <AANLkTikBPEbkRJYLtd6yf0367mXr59jG=UHXsEmhPq-A@mail.gmail.com>
Message-ID: <4C56E748.7010803@ebi.ac.uk>

On 02/08/10 14:55, Peter C. wrote:

> In the funny BAM to Sanger FASTQ conversion, EMBOSS has used
> "]" which is ASCII 93, giving PHRED 93-33 = 60. i.e. 33 more than it
> should be. I suspected that the EMBOSS code for reading BAM files
> was wrongly applying a 33 offset to the quality scores. In BAM files
> the scores are simply encoded directly as uint8_t without any offset.

Thanks for spotting that. We will make a patch with that fix in.


> Looking at the SAM file, I guessed EMBOSS doesn't like a negative
> ISIZE field in the next record, EAS54_61:4:143:69:578,  .........
>
> Looking at the source code, currently EMBOSS is wrongly assuming
> an unsigned integer will be used. This is not true, the spec allows for
> a negative ISIZE. I replaced this code in ajax/core/ajseqread.c

Thanks for the fix. We will add that to the patch.

> A related question is why did this error condition not give any
> error message to stdout or stderr?

This appears to be a general issue with reading unknown and known 
formats. We will fix it so that error messages are turned on for this 
failure condition.

Many thanks for the bug reports - and the fixes!!

Peter R.


From biopython at maubp.freeserve.co.uk  Mon Aug  2 15:52:56 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 2 Aug 2010 16:52:56 +0100
Subject: [emboss-dev] Bug reports and patches: BAM quality,
	SAM negative 	ISIZE
In-Reply-To: <4C56E748.7010803@ebi.ac.uk>
References: <AANLkTikBPEbkRJYLtd6yf0367mXr59jG=UHXsEmhPq-A@mail.gmail.com>
	<4C56E748.7010803@ebi.ac.uk>
Message-ID: <AANLkTin7mvdWHE1_DOKtEjxOS7gDL0ZtZyXjROHhZ_is@mail.gmail.com>

On Mon, Aug 2, 2010 at 4:42 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
> On 02/08/10 14:55, Peter C. wrote:
>
>> In the funny BAM to Sanger FASTQ conversion, EMBOSS has used
>> "]" which is ASCII 93, giving PHRED 93-33 = 60. i.e. 33 more than it
>> should be. I suspected that the EMBOSS code for reading BAM files
>> was wrongly applying a 33 offset to the quality scores. In BAM files
>> the scores are simply encoded directly as uint8_t without any offset.
>
> Thanks for spotting that. We will make a patch with that fix in.
>
>> Looking at the SAM file, I guessed EMBOSS doesn't like a negative
>> ISIZE field in the next record, EAS54_61:4:143:69:578, ?.........
>>
>> Looking at the source code, currently EMBOSS is wrongly assuming
>> an unsigned integer will be used. This is not true, the spec allows for
>> a negative ISIZE. I replaced this code in ajax/core/ajseqread.c
>
> Thanks for the fix. We will add that to the patch.
>

Great. Are you still issuing patches which don't affect the version number?
I'd prefer to have an easy way to know if a given install of EMBOSS
has certain fixes, and a point release seems quite straightforward from
an outsider's perspective.

P.S. Expect a couple more reports to follow... so don't rush a patch or
point release out just yet ;)

>> A related question is why did this error condition not give any
>> error message to stdout or stderr?
>
> This appears to be a general issue with reading unknown and known formats.
> We will fix it so that error messages are turned on for this failure
> condition.

Good :)

> Many thanks for the bug reports - and the fixes!!
>

No problem,

Peter


From biopython at maubp.freeserve.co.uk  Mon Aug  2 16:26:07 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 2 Aug 2010 17:26:07 +0100
Subject: [emboss-dev] Inconsistency in SAM vs BAM read description
Message-ID: <AANLkTin1uBSuLn+6cC-_NaS5XzuF5E-WFASbNDvJG-76@mail.gmail.com>

Hi all,

After patching the following two issues,
http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000667.html
http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000668.html
there is a noticeable difference in the output from the SAM and BAM
parsers in the description of the reads:

$ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | head
>EAS56_57:6:190:289:82
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
>EAS56_57:6:190:289:82
AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
>EAS51_64:3:190:727:308
GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG
>EAS112_34:7:141:80:875
AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA
>EAS219_FC30151:3:40:1128:1940
CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC


$ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | head
>EAS56_57:6:190:289:82 chr1
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
>EAS56_57:6:190:289:82 chr1
AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
>EAS51_64:3:190:727:308 chr1
GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG
>EAS112_34:7:141:80:875 chr1
AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA
>EAS219_FC30151:3:40:1128:1940 chr1
CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC

As you can see from the above example (using files described in
the linked threads), when parsing SAM files if the read is mapped
then the reference sequence name is used as the description.
This seems like a sensible and useful thing to do. However, when
parsing BAM files this is not currently being done.

Having the SAM and BAM parser produce identical results is
very useful for testing purposes (e.g. running diff on their output
as FASTQ format), so I would like the BAM parser to do the same.

Looking at the source, function seqReadSam in ajax/core/ajseqread.c
does this with the reference name string:

    ajStrTokenNextParseNoskip(&handle,&token); /* RNAME */
    ajDebug("RNAME '%S'\n", token);
    if(ajStrGetLen(token))
        seqAccSave(thys, token);

Therefore the BAM parser needs to do something similar, first
mapping the integer rID (reference sequence ID) to the array of
reference names from the BAM header.

I got as far as a partial solution but it only worked on the first read.
The problem is that although header variable ntargets is stored as
bamdata->Nref it does not appear that the array of strings
targetname is kept (likewise the array of integers targetlen but we
don't care about that here).

Regards,

Peter C.


From biopython at maubp.freeserve.co.uk  Mon Aug  2 16:42:07 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 2 Aug 2010 17:42:07 +0100
Subject: [emboss-dev] Inconsistency in SAM vs BAM read description
In-Reply-To: <AANLkTin1uBSuLn+6cC-_NaS5XzuF5E-WFASbNDvJG-76@mail.gmail.com>
References: <AANLkTin1uBSuLn+6cC-_NaS5XzuF5E-WFASbNDvJG-76@mail.gmail.com>
Message-ID: <AANLkTi=LtEjtV3azhCT6FY7ekXNYoXDkkjiGasZTgxtB@mail.gmail.com>

On Mon, Aug 2, 2010 at 5:26 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi all,
>
> After patching the following two issues,
> http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000667.html
> http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000668.html
> there is a noticeable difference in the output from the SAM and BAM
> parsers in the description of the reads:
>
> $ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | head
>>EAS56_57:6:190:289:82
> CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
>>EAS56_57:6:190:289:82
> AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
>>EAS51_64:3:190:727:308
> GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG
>>EAS112_34:7:141:80:875
> AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA
>>EAS219_FC30151:3:40:1128:1940
> CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC
>
>
> $ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | head
>>EAS56_57:6:190:289:82 chr1
> CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
>>EAS56_57:6:190:289:82 chr1
> AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
>>EAS51_64:3:190:727:308 chr1
> GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG
>>EAS112_34:7:141:80:875 chr1
> AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA
>>EAS219_FC30151:3:40:1128:1940 chr1
> CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC
>
> As you can see from the above example (using files described in
> the linked threads), when parsing SAM files if the read is mapped
> then the reference sequence name is used as the description.
> This seems like a sensible and useful thing to do. However, when
> parsing BAM files this is not currently being done.
>
> Having the SAM and BAM parser produce identical results is
> very useful for testing purposes (e.g. running diff on their output
> as FASTQ format), so I would like the BAM parser to do the same.
>
> Looking at the source, function seqReadSam in ajax/core/ajseqread.c
> does this with the reference name string:
>
> ? ?ajStrTokenNextParseNoskip(&handle,&token); /* RNAME */
> ? ?ajDebug("RNAME '%S'\n", token);
> ? ?if(ajStrGetLen(token))
> ? ? ? ?seqAccSave(thys, token);
>

Just as a post script,

Having failed to enhance the BAM parser, for short term testing
I'm just commenting out the above two lines of the SAM parser.
With that trivial change, then the FASTA and FASTQ output
from both the SAM and BAM files agrees 100% (as you would
expect).

Peter C.


From biopython at maubp.freeserve.co.uk  Mon Aug  2 17:41:25 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 2 Aug 2010 18:41:25 +0100
Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM
In-Reply-To: <AANLkTimp3WrTbgH1Md-2jOLC6ZN_A8UeijG7FdUpGjhf@mail.gmail.com>
References: <mailman.1170.1279190734.3031.emboss-dev@lists.open-bio.org>
	<AANLkTikdCydb5I08EWmJu7eYjvH3RKkMMMojdfN6tdyk@mail.gmail.com>
	<4C3EED02.7080507@ebi.ac.uk>
	<AANLkTimp3WrTbgH1Md-2jOLC6ZN_A8UeijG7FdUpGjhf@mail.gmail.com>
Message-ID: <AANLkTi=xtj0QKD8T52G0QM72-YpqjJuFmC51efdxEcbD@mail.gmail.com>

On Thu, Jul 15, 2010 at 12:36 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Jul 15, 2010 at 12:12 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
>>
>>> What do you do about naming for paired reads? I was appending
>>> /1 or /2 to match the Illumina convention. Doing nothing means
>>> the paired reads will have the same names.
>>
>> Not addressed yet - let's look into a common approach though.
>> We would also have to lok into what the '/' character does to EMBOSS's
>> handling of sequence names.
>
> My rational for appending the /1 and /2 is that in a typical workflow
> you might take Illumina paired end data as FASTQ and map it onto
> a genome with BWA giving SAM/BAM. You might then want to reverse
> this (e.g. if given a SAM/BAM file by a collaborator, and you want to
> try an alternative mapping tool or reference genome, first you must
> recover the raw reads again, e.g. as FASTQ files).

Just for the record, EMBOSS 6.3.1 does not append anything to the
read names, meaning paired end reads cannot be distinguished if
output as FASTA or FASTQ.

I'm not sure my idea of appending /1 or /2 for paired reads is the
best solution (especially since there are other naming schemes
out there like _f and _r as suffixes). Nevertheless, it seems like a
practical solution. Would including a slash character within a
sequence name cause problems in EMBOSS (a potential issue
you raised earlier)?

Also, and this may be a bug, on output as unaligned SAM (and I
assume also for unaligned BAM), the fact that a read is paired and
the information about if is it the first or second read is lost. The
FLAG is just set to 4, meaning unmapped. e.g.

seqret -sformat bam -osformat sam ex1.bam -filter

or:

seqret -sformat sam -osformat sam ex1.sam -filter

>>> What do you do about the strand issue? SAM/BAM stored reads
>>> which map onto the reverse strand in reverse complement. If
>>> you want to get back to the original orientation for output as
>>> FASTQ you must apply the reverse complement (plus reverse
>>> the quality scores too of course).
>>
>> So far we read as sequences. Reading as mapped reads (very large
>> alignments) is planned for the very near future so it can appear in the
>> next release.
>
> Given the use case of going from (aligned) SAM/BAM back to the
> original FASTQ, for a round trip you *must* undo the reverse
> complementation. This is important even for single reads, as quality
> scores tend to trail off in the (original) read direction so some algorithms
> may treat a reverse version of the read differently.

To clarify, EMBOSS 6.3.1 does not flip reads mapped to the reverse strand:
http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000667.html

Regards,

Peter C.


From pmr at ebi.ac.uk  Tue Aug  3 07:27:21 2010
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 03 Aug 2010 08:27:21 +0100
Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM
In-Reply-To: <AANLkTi=xtj0QKD8T52G0QM72-YpqjJuFmC51efdxEcbD@mail.gmail.com>
References: <mailman.1170.1279190734.3031.emboss-dev@lists.open-bio.org>	<AANLkTikdCydb5I08EWmJu7eYjvH3RKkMMMojdfN6tdyk@mail.gmail.com>	<4C3EED02.7080507@ebi.ac.uk>	<AANLkTimp3WrTbgH1Md-2jOLC6ZN_A8UeijG7FdUpGjhf@mail.gmail.com>
	<AANLkTi=xtj0QKD8T52G0QM72-YpqjJuFmC51efdxEcbD@mail.gmail.com>
Message-ID: <4C57C4D9.1010805@ebi.ac.uk>

On 08/02/10 18:41, Peter C. wrote:
> On Thu, Jul 15, 2010 at 12:36 PM, Peter<biopython at maubp.freeserve.co.uk>  wrote:
>> On Thu, Jul 15, 2010 at 12:12 PM, Peter Rice<pmr at ebi.ac.uk>  wrote:
>>>
>>>> What do you do about naming for paired reads? I was appending
>>>> /1 or /2 to match the Illumina convention. Doing nothing means
>>>> the paired reads will have the same names.
>>>
>>> Not addressed yet - let's look into a common approach though.
>>> We would also have to lok into what the '/' character does to EMBOSS's
>>> handling of sequence names.
>>
>> My rational for appending the /1 and /2 is that in a typical workflow
>> you might take Illumina paired end data as FASTQ and map it onto
>> a genome with BWA giving SAM/BAM. You might then want to reverse
>> this (e.g. if given a SAM/BAM file by a collaborator, and you want to
>> try an alternative mapping tool or reference genome, first you must
>> recover the raw reads again, e.g. as FASTQ files).
>
> Just for the record, EMBOSS 6.3.1 does not append anything to the
> read names, meaning paired end reads cannot be distinguished if
> output as FASTA or FASTQ.
>
> I'm not sure my idea of appending /1 or /2 for paired reads is the
> best solution (especially since there are other naming schemes
> out there like _f and _r as suffixes). Nevertheless, it seems like a
> practical solution. Would including a slash character within a
> sequence name cause problems in EMBOSS (a potential issue
> you raised earlier)?

The /1 and /2 would cause horrible problems. The sequence names are used 
to generate default output file names so a '/' would have to be removed 
or converted, most likely to _1 and _2

_f or _r as a suffix is much better ... but should we always assume 
these meanings? Should we add a command-line switch for paired read 
data? Should we only do something for fastq, sam and bam (or other NGS 
formats?)

It is a mystery to me how paired reads came to have the same name. When 
we first used them at EMBL for the Human HPRT locus we made sure to add 
an "r" suffix to the reverse reads.... but then, as we used the GCG 
assembly system, we were forced to have a unique name :-)


> Also, and this may be a bug, on output as unaligned SAM (and I
> assume also for unaligned BAM), the fact that a read is paired and
> the information about if is it the first or second read is lost. The
> FLAG is just set to 4, meaning unmapped. e.g.
>
> seqret -sformat bam -osformat sam ex1.bam -filter

Hmmm ... this kind of thing is specific to SAM-BAM conversions, as other 
formats will lose it unless we find some way to preserve the detail.

We will take a look at what we can keep between these formats (we do 
make similar efforts between EMBL and GenBank formats)

>> Given the use case of going from (aligned) SAM/BAM back to the
>> original FASTQ, for a round trip you *must* undo the reverse
>> complementation. This is important even for single reads, as quality
>> scores tend to trail off in the (original) read direction so some algorithms
>> may treat a reverse version of the read differently.


We will look into that one too.

Many thanks for the suggestions

Peter Rice


From biopython at maubp.freeserve.co.uk  Tue Aug  3 08:12:27 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 3 Aug 2010 09:12:27 +0100
Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM
In-Reply-To: <4C57C4D9.1010805@ebi.ac.uk>
References: <mailman.1170.1279190734.3031.emboss-dev@lists.open-bio.org>
	<AANLkTikdCydb5I08EWmJu7eYjvH3RKkMMMojdfN6tdyk@mail.gmail.com>
	<4C3EED02.7080507@ebi.ac.uk>
	<AANLkTimp3WrTbgH1Md-2jOLC6ZN_A8UeijG7FdUpGjhf@mail.gmail.com>
	<AANLkTi=xtj0QKD8T52G0QM72-YpqjJuFmC51efdxEcbD@mail.gmail.com>
	<4C57C4D9.1010805@ebi.ac.uk>
Message-ID: <AANLkTimmWW4Ud4+FBd3iUzaedFpRC3XTH5sMORhgLJjv@mail.gmail.com>

On Tue, Aug 3, 2010 at 8:27 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>>
>> Just for the record, EMBOSS 6.3.1 does not append anything to the
>> read names, meaning paired end reads cannot be distinguished if
>> output as FASTA or FASTQ.
>>
>> I'm not sure my idea of appending /1 or /2 for paired reads is the
>> best solution (especially since there are other naming schemes
>> out there like _f and _r as suffixes). Nevertheless, it seems like a
>> practical solution. Would including a slash character within a
>> sequence name cause problems in EMBOSS (a potential issue
>> you raised earlier)?
>
> The /1 and /2 would cause horrible problems. The sequence names are
> used to generate default output file names so a '/' would have to be
> removed or converted, most likely to _1 and _2

Oh :(

I thought they might cause confusion with slashes in filenames, but
yes, they can't be used in filenames can they.

> _f or _r as a suffix is much better ... but should we always assume these
> meanings? Should we add a command-line switch for paired read data?

My understanding is there are multiple different naming conventions,
so whatever we/you do it won't please everyone. What would help here
is if the original read name were to be recorded in the SAM/BAM tags,
as I think was suggested last month or so on the samtools-devel mailing
list. However, that would come with a filesize penalty, and won't help
with old files.

> Should we only do something for fastq, sam and bam (or other NGS
> formats?)

And FASTA too, not all assemblers use quality scores. Also QUAL
files if EMBOSS were to support them.

> It is a mystery to me how paired reads came to have the same name.
> When we first used them at EMBL for the Human HPRT locus we made
> sure to add an "r" suffix to the reverse reads.... but then, as we used
> the GCG assembly system, we were forced to have a unique name :-)

With Solexa/Illumina data, pairs got the same name bar a suffix.
Other sequencing centers also have followed this pattern, for
example Sanger sequencing with suffices of .f and .r for example.
I guess in order to clearly group paired reads, and save a little space,
for SAM/BAM they opted to store a single name and use the FLAG field
to hold if it is the forward or reverse read. Note that with stobed reads
and the like coming "soon", rather than just two reads in a pair, there
could be many child reads for a single fragment. Even with classic
Sanger sequencing of a PCR product you might end up with multiple
reads (e.g. two forward reads, one reverse) and if and how to handle
this via an extension to SAM/BAM was also raised.

Some pipelines may even use the same name for a forward/reverse
pair, or ignore the names. Velvet for example just takes its paired
data as interleaved files (forward then reverse reads one after the
other).

>> Also, and this may be a bug, on output as unaligned SAM (and I
>> assume also for unaligned BAM), the fact that a read is paired and
>> the information about if is it the first or second read is lost. The
>> FLAG is just set to 4, meaning unmapped. e.g.
>>
>> seqret -sformat bam -osformat sam ex1.bam -filter
>
> Hmmm ... this kind of thing is specific to SAM-BAM conversions, as other
> formats will lose it unless we find some way to preserve the detail.
>
> We will take a look at what we can keep between these formats (we do
> make similar efforts between EMBL and GenBank formats)

I think it would be useful to track the three bits for paired, read one, and
read two. From memory, all the other bits of the FLAG are only applicable
to mapped reads. Of course, this overlaps with the naming issue above.

>>> Given the use case of going from (aligned) SAM/BAM back to the
>>> original FASTQ, for a round trip you *must* undo the reverse
>>> complementation. This is important even for single reads, as quality
>>> scores tend to trail off in the (original) read direction so some
>>> algorithms may treat a reverse version of the read differently.
>
> We will look into that one too.
>

Thanks.

> Many thanks for the suggestions
>

No problem.

Peter C.


From ajb at ebi.ac.uk  Fri Aug  6 10:53:18 2010
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Fri, 6 Aug 2010 11:53:18 +0100 (BST)
Subject: [emboss-dev] Configuration in flux
Message-ID: <34847.86.26.12.63.1281091998.squirrel@webmail.ebi.ac.uk>

Dear developers,

The EMBOSS configuration in CVS is in a state of flux at the moment.
The major changes over the last 48 hours have been to make use
of autoheader and also to clear out any system-specific libtool
files.

The upshot is that, from a fresh CVS checkout, the configuration should
just amount to:

     autoreconf -fi
     ./configure [options]

The above should mean that the configuration is relatively independent
of the version of libtool you have installed. Note, however, that
there is now a prerequisite for an autoconf version of at
least 2.59. The use of autoheader means that the compilation
lines are significantly shorter.

There will be further configuration changes over the next few
weeks but nothing quite so fundamental.

Alan


From pjotr.public78 at thebird.nl  Thu Aug 12 10:12:40 2010
From: pjotr.public78 at thebird.nl (Pjotr Prins)
Date: Thu, 12 Aug 2010 12:12:40 +0200
Subject: [emboss-dev] Unreachable code in featReadGff3
In-Reply-To: <AANLkTi=LtEjtV3azhCT6FY7ekXNYoXDkkjiGasZTgxtB@mail.gmail.com>
References: <AANLkTin1uBSuLn+6cC-_NaS5XzuF5E-WFASbNDvJG-76@mail.gmail.com>
	<AANLkTi=LtEjtV3azhCT6FY7ekXNYoXDkkjiGasZTgxtB@mail.gmail.com>
Message-ID: <20100812101240.GA28807@thebird.nl>

Something funny in the function featReadGff3, it looks like the second else
if(ajRegExec(Gff3Regexregion,line)) is unreachable code:

  if(ajRegExec(Gff3Regexblankline, line))
      version = 3.0;
  else if(ajRegExec(Gff3Regexversion,line))
  {
      verstr = ajStrNew();
      ajRegSubI(Gff3Regexversion, 1, &verstr);
      ajStrToFloat(verstr, &version);
      ajStrDel(&verstr);
            if(version < 3.0)
            {
                ajStrDel(&line);
                return ajFalse;
            }
  }
  else if(ajRegExec(Gff3Regexregion,line))
  {
      start = ajStrNew();
      end   = ajStrNew();
  (...)


From pjotr.public78 at thebird.nl  Thu Aug 12 10:33:35 2010
From: pjotr.public78 at thebird.nl (Pjotr Prins)
Date: Thu, 12 Aug 2010 12:33:35 +0200
Subject: [emboss-dev] GFF3 in EMBOSS
Message-ID: <20100812103335.GA28925@thebird.nl>

I am having a look at the GFF3 implementation in EMBOSS - mostly
ajax/core/ajfeat.c.

All features are loaded into RAM, and also the sequence information,
when in the file. Not only for GFF3, but for all feature data types.

On regular desktops this is a problem when loading a larger set,
and/or multiple genomes.

Is it the idea to load big data and store it in a SQL database? I.e.
should I recommend handling it outside EMBOSS?

Pj.


From pmr at ebi.ac.uk  Thu Aug 12 10:52:23 2010
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 12 Aug 2010 11:52:23 +0100
Subject: [emboss-dev] GFF3 in EMBOSS
In-Reply-To: <20100812103335.GA28925@thebird.nl>
References: <20100812103335.GA28925@thebird.nl>
Message-ID: <4C63D267.3070904@ebi.ac.uk>

Hi Pjotr,

On 12/08/10 11:33, Pjotr Prins wrote:
> I am having a look at the GFF3 implementation in EMBOSS - mostly
> ajax/core/ajfeat.c.
>
> All features are loaded into RAM, and also the sequence information,
> when in the file. Not only for GFF3, but for all feature data types.
>
> On regular desktops this is a problem when loading a larger set,
> and/or multiple genomes.
>
> Is it the idea to load big data and store it in a SQL database? I.e.
> should I recommend handling it outside EMBOSS?

We are looking into storing data structures for large datasets on disk - 
not only for features but also for next-generation mapped reads.

Can you give an example of the input you are trying to handle?

I hope to explore these issues at the GMOD meeting in Cambridge (UK) soon.

regards,

Peter Rice


From pjotr.public78 at thebird.nl  Thu Aug 12 11:57:55 2010
From: pjotr.public78 at thebird.nl (Pjotr Prins)
Date: Thu, 12 Aug 2010 13:57:55 +0200
Subject: [emboss-dev] GFF3 in EMBOSS
In-Reply-To: <4C63D267.3070904@ebi.ac.uk>
References: <20100812103335.GA28925@thebird.nl> <4C63D267.3070904@ebi.ac.uk>
Message-ID: <20100812115755.GA30047@thebird.nl>

On Thu, Aug 12, 2010 at 11:52:23AM +0100, Peter Rice wrote:
> We are looking into storing data structures for large datasets on disk -  
> not only for features but also for next-generation mapped reads.

That is a great idea! The first quick-win is not to load sequence
data in memory, but fetch it on demand using a seek index. Something
that BioPerl has.

> Can you give an example of the input you are trying to handle?

I am dealing with Worms - Wormbase uses gff3 for some worms. EMBOSS,
is already memory efficient, compared to BioRuby/Python/Perl - so I
am thinking of a BioLib mapping. A writeup is here:

  http://thebird.nl/biolib/Adding_BioLib_EMBOSS_GFF3_Support.html

> I hope to explore these issues at the GMOD meeting in Cambridge (UK) soon.

It makes sense for (desktop) genome browsers, for one.

Pj.


From pjotr.public78 at thebird.nl  Thu Aug 12 21:24:21 2010
From: pjotr.public78 at thebird.nl (Pjotr Prins)
Date: Thu, 12 Aug 2010 23:24:21 +0200
Subject: [emboss-dev] Embassy in Debian-med
Message-ID: <20100812212421.GA3151@thebird.nl>

Debian-med has problems with the Embassy packages, as they fail to
build against EMBOSS-latest.

Andreas Tille writes:

> To put the emboss and embassy packages in consistency in Squeeze, here are
> possible solutions:
>
> - Remove the embassy-* packages from testing.
> - Upload emboss 6.2 to testing-proposed-updates.
> - Upgrade embassy-* packages with the latest upstream version, that builds
>   against emboss 6.3, and let emboss 6.3 in testing.

what is the priority of supporting the Embassy packages? Are they
lesser citizens in EMBOSS? Or can we expect resolution in the near
future?

Pj.


From biopython at maubp.freeserve.co.uk  Fri Aug 13 09:40:35 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 13 Aug 2010 10:40:35 +0100
Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM
In-Reply-To: <AANLkTimmWW4Ud4+FBd3iUzaedFpRC3XTH5sMORhgLJjv@mail.gmail.com>
References: <mailman.1170.1279190734.3031.emboss-dev@lists.open-bio.org>
	<AANLkTikdCydb5I08EWmJu7eYjvH3RKkMMMojdfN6tdyk@mail.gmail.com>
	<4C3EED02.7080507@ebi.ac.uk>
	<AANLkTimp3WrTbgH1Md-2jOLC6ZN_A8UeijG7FdUpGjhf@mail.gmail.com>
	<AANLkTi=xtj0QKD8T52G0QM72-YpqjJuFmC51efdxEcbD@mail.gmail.com>
	<4C57C4D9.1010805@ebi.ac.uk>
	<AANLkTimmWW4Ud4+FBd3iUzaedFpRC3XTH5sMORhgLJjv@mail.gmail.com>
Message-ID: <AANLkTi=gF9yPgeZQfe2HvdpU2K66BFAaiYt-LM3rQZEU@mail.gmail.com>

On Tue, Aug 3, 2010 at 9:12 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Aug 3, 2010 at 8:27 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>>>
>>> Just for the record, EMBOSS 6.3.1 does not append anything to the
>>> read names, meaning paired end reads cannot be distinguished if
>>> output as FASTA or FASTQ.
>>>
>>> I'm not sure my idea of appending /1 or /2 for paired reads is the
>>> best solution (especially since there are other naming schemes
>>> out there like _f and _r as suffixes). Nevertheless, it seems like a
>>> practical solution. Would including a slash character within a
>>> sequence name cause problems in EMBOSS (a potential issue
>>> you raised earlier)?
>>
>> The /1 and /2 would cause horrible problems. The sequence names are
>> used to generate default output file names so a '/' would have to be
>> removed or converted, most likely to _1 and _2
>
> Oh :(
>
> I thought they might cause confusion with slashes in filenames, but
> yes, they can't be used in filenames can they.

Thinking about this more, I don't think there is a problem. There are
two main reasons. First, with SAM/BAM/FASTQ files there are typically
so many reads that you would never want to create one file per read.

Also, there are plenty of other file formats where the record ID can
or indeed usually does contain a slash - specifically PFAM/Stockholm
format alignments from PFAM where the ID is name/start-stop, e.g.
http://emboss.sourceforge.net/docs/themes/seqformats/pfam
Surely EMBOSS has already got a mechanism for dealing with
slashes in IDs when asked to use the IDs as filenames?

I think I mentioned storing the original read name in the tags had
been suggested on the samtools-devel list. In the latest draft of
the SAM/BAM spec, a new tag FS (fragment name suffix) has been
proposed, so that the original read names could be recovered by
taking the fragment name (the ID in SAM/BAM) and appending
this suffix. See this thread earlier in August 2010,

[Samtools-devel] Recording original read name in tags
http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTimg%2BvNU3CkW-63Mmug-Qt0md183dyJ_nRqva1rv%40mail.gmail.com&forum_name=samtools-devel

Finally, also on the samtools-help list, it was pointed out that the
hydra-sv project has a bamToFastq tool, see thread:

[Samtools-help] BAM to fastq how?
http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTinBnm%2B8V8bXD_ii9jn8-O%2B0_N1MgWBxBFnqm2Mk%40mail.gmail.com&forum_name=samtools-help

and http://code.google.com/p/hydra-sv/

Peter C.


From gbottu at vub.ac.be  Tue Aug 17 18:43:58 2010
From: gbottu at vub.ac.be (Guy Bottu)
Date: Tue, 17 Aug 2010 20:43:58 +0200
Subject: [emboss-dev] computed maximum forbidden in ACD ?
Message-ID: <4C6AD86E.7070909@vub.ac.be>

	Dear Peter and Alan,

I was doing some development on wrappers4EMBOSS when I noted the 
following. The file blast.acd contains :

   integer: listsize [
     information: "Show only the n best scoring sequences that
                   satisfy E() cutoff"
     default: "500"
     minimum: "0"
   ]

   integer: align [
     information: "Show only alignments for the n first
                   sequences"
     default: "@(@($(listsize) < 250 ) ? $(listsize) : 250)"
     expected: "250"
     minimum: "0"
     maximum: "$(listsize)"                          (this is line 100)
     valid: "Integer 0 or more, but not < listsize"
   ]

When I run blast I get :

Error: File /OPT/emboss63/share/EMBOSS/acd/blast.acd line 100: 
(wordsize) Attribute failrange: required with any calculated min/max

I am as good as certain that this behaviour has appeared with EMBOSS 
version 6.3.0. In the past it was allowed to set a "maximum" that 
depended on the choice of another parameter, and we can see that it 
could occasionally make sense, but this seems from now on forbidden. I 
this a bug or a feature ?

	Regards,
	Guy Bottu


From pmr at ebi.ac.uk  Tue Aug 17 20:22:58 2010
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 17 Aug 2010 21:22:58 +0100
Subject: [emboss-dev] computed maximum forbidden in ACD ?
In-Reply-To: <4C6AD86E.7070909@vub.ac.be>
References: <4C6AD86E.7070909@vub.ac.be>
Message-ID: <4C6AEFA2.8000201@ebi.ac.uk>

Dear Guy,

> When I run blast I get :
>
> Error: File /OPT/emboss63/share/EMBOSS/acd/blast.acd line 100:
> (wordsize) Attribute failrange: required with any calculated min/max
>
> I am as good as certain that this behaviour has appeared with EMBOSS
> version 6.3.0. In the past it was allowed to set a "maximum" that
> depended on the choice of another parameter, and we can see that it
> could occasionally make sense, but this seems from now on forbidden. I
> this a bug or a feature ?

It is a fix for a feature. With calculated maximum or minimum values 
(e.g. depending on a window size) it was possible for the maximum to be 
less than the minimum. In such cases we could logically use either the 
maximum or the minimum - and some applications were found to require one 
choice, others needed the other.

After some discussion we decided to add extra attributes to control the 
behaviour. You can add two new attributes:

trueminimum: "N"  (if max/min overlap, use minimum}
failrange:   "Y"  (Fail if (calculated) ranges overlap}
rangemessage: ""   (Failure message if (calculated ranges) overlap}

A common solution (good for your case) is:

failrange: "N"
trueminimum: "Y"

By adding the error messages we made sure that an ACD file with a 
calculated range will give messages to the developer suggesting missing 
attributes to be added.

If you set failrange: "Y" you need to define a message explaining to the 
end user why the range might fail

If you set failrange: "N" the calculated range is accepted, but you also 
need to set trueminimum to say whether you want the minimum value to 
apply (usual to avoid getting negative values) or the maximum to avoid 
values going too large.

So, you get the "failrange is required" message. When you set that you 
get another message (depending whether it is true or false) telling you 
to set one of the other attributes as well.

Hope this makes it clearer!

Peter