[emboss-dev] Bug report and patch - SAM parser and negative ISIZE

Peter biopython at maubp.freeserve.co.uk
Mon Aug 2 09:55:10 EDT 2010


Hi again,

This is another bug report for EMBOSS 6.3.1 (compiled on Mac OS X
10.6.4 Snow Leopard) using the same example files as earlier, see:
http://lists.open-bio.org/pipermail/emboss-dev/2010-August/thread.html

For the purposes of a concise example, I'm using seqret to convert
SAM/BAM to FASTA so as to count the number of reads. See also:
http://lists.open-bio.org/pipermail/emboss/2010-July/003951.html

I believe this SAM and BAM file both contain 3270 reads, but EMBOSS
is having trouble with the SAM file:

$ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | grep -c "^>"
3270
$ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | grep -c "^>"
41

If we look at the output,

$ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto
>EAS56_57:6:190:289:82 chr1
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
>EAS56_57:6:190:289:82 chr1
AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
...
>EAS114_28:6:155:68:326 chr1
CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTAA
>EAS188_7:7:19:886:279 chr1
CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTA

Looking at the SAM file, I guessed EMBOSS doesn't like a negative
ISIZE field in the next record, EAS54_61:4:143:69:578, from the
SAM file we have:

...
EAS114_28:6:155:68:326  99      chr1    182     99      36M     =
 332     186     CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTAA
<<<<<<<<<<<<<<<<<<<<<<<<<<<<:<<<<<<<    MF:i:18 Aq:i:76 NM:i:0  UQ:i:0
 H0:i:1  H1:i:0
EAS188_7:7:19:886:279   99      chr1    182     99      35M     =
 337     190     CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTA
<9<<<<<<<<<<<<6<28:<<85<<<<<2<;<9<<     MF:i:18 Aq:i:67 NM:i:0  UQ:i:0
 H0:i:1  H1:i:0
EAS54_61:4:143:69:578   147     chr1    185     98      35M     =
 36      -184    ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG
222&<21<<<<12<7<01<<<<<0<<<<<<<20<<     MF:i:18 Aq:i:35 NM:i:1  UQ:i:5
 H0:i:1  H1:i:0
EAS54_71:4:13:981:659   181     chr1    187     0       *       =
 188     0       CGGGACAATGGACGAGGTAAACCGCACATTGACAA
+)---3&&3&--+0)&+3:7777).333:<06<<<     MF:i:192
...

Looking at the source code, currently EMBOSS is wrongly assuming
an unsigned integer will be used. This is not true, the spec allows for
a negative ISIZE. I replaced this code in ajax/core/ajseqread.c

    ajStrTokenNextParseNoskip(&handle,&token); /* ISIZE */
    ajDebug("ISIZE '%S'\n", token);
    if(ajStrGetLen(token)){
        if(!ajStrToUint(token, &flags))
            return ajFalse;
    }

with:

    ajStrTokenNextParseNoskip(&handle,&token); /* MPOS */
    ajDebug("MPOS  '%S'\n", token);
    if(ajStrGetLen(token)){
        if(!ajStrToInt(token, &flags))
            return ajFalse;
    }

(i.e. Uint to Int), and now I get the correct read count.

A related question is why did this error condition not give any
error message to stdout or stderr?

Regards,

Peter C.


More information about the emboss-dev mailing list