[emboss-dev] Bug report and patch - SAM parser and negative ISIZE
Peter
biopython at maubp.freeserve.co.uk
Mon Aug 2 09:55:10 EDT 2010
Hi again,
This is another bug report for EMBOSS 6.3.1 (compiled on Mac OS X
10.6.4 Snow Leopard) using the same example files as earlier, see:
http://lists.open-bio.org/pipermail/emboss-dev/2010-August/thread.html
For the purposes of a concise example, I'm using seqret to convert
SAM/BAM to FASTA so as to count the number of reads. See also:
http://lists.open-bio.org/pipermail/emboss/2010-July/003951.html
I believe this SAM and BAM file both contain 3270 reads, but EMBOSS
is having trouble with the SAM file:
$ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | grep -c "^>"
3270
$ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | grep -c "^>"
41
If we look at the output,
$ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto
>EAS56_57:6:190:289:82 chr1
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
>EAS56_57:6:190:289:82 chr1
AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
...
>EAS114_28:6:155:68:326 chr1
CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTAA
>EAS188_7:7:19:886:279 chr1
CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTA
Looking at the SAM file, I guessed EMBOSS doesn't like a negative
ISIZE field in the next record, EAS54_61:4:143:69:578, from the
SAM file we have:
...
EAS114_28:6:155:68:326 99 chr1 182 99 36M =
332 186 CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTAA
<<<<<<<<<<<<<<<<<<<<<<<<<<<<:<<<<<<< MF:i:18 Aq:i:76 NM:i:0 UQ:i:0
H0:i:1 H1:i:0
EAS188_7:7:19:886:279 99 chr1 182 99 35M =
337 190 CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTA
<9<<<<<<<<<<<<6<28:<<85<<<<<2<;<9<< MF:i:18 Aq:i:67 NM:i:0 UQ:i:0
H0:i:1 H1:i:0
EAS54_61:4:143:69:578 147 chr1 185 98 35M =
36 -184 ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG
222&<21<<<<12<7<01<<<<<0<<<<<<<20<< MF:i:18 Aq:i:35 NM:i:1 UQ:i:5
H0:i:1 H1:i:0
EAS54_71:4:13:981:659 181 chr1 187 0 * =
188 0 CGGGACAATGGACGAGGTAAACCGCACATTGACAA
+)---3&&3&--+0)&+3:7777).333:<06<<< MF:i:192
...
Looking at the source code, currently EMBOSS is wrongly assuming
an unsigned integer will be used. This is not true, the spec allows for
a negative ISIZE. I replaced this code in ajax/core/ajseqread.c
ajStrTokenNextParseNoskip(&handle,&token); /* ISIZE */
ajDebug("ISIZE '%S'\n", token);
if(ajStrGetLen(token)){
if(!ajStrToUint(token, &flags))
return ajFalse;
}
with:
ajStrTokenNextParseNoskip(&handle,&token); /* MPOS */
ajDebug("MPOS '%S'\n", token);
if(ajStrGetLen(token)){
if(!ajStrToInt(token, &flags))
return ajFalse;
}
(i.e. Uint to Int), and now I get the correct read count.
A related question is why did this error condition not give any
error message to stdout or stderr?
Regards,
Peter C.
More information about the emboss-dev
mailing list