[emboss-dev] Inconsistency in SAM vs BAM read description

Peter biopython at maubp.freeserve.co.uk
Mon Aug 2 12:26:07 EDT 2010


Hi all,

After patching the following two issues,
http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000667.html
http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000668.html
there is a noticeable difference in the output from the SAM and BAM
parsers in the description of the reads:

$ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | head
>EAS56_57:6:190:289:82
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
>EAS56_57:6:190:289:82
AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
>EAS51_64:3:190:727:308
GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG
>EAS112_34:7:141:80:875
AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA
>EAS219_FC30151:3:40:1128:1940
CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC


$ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | head
>EAS56_57:6:190:289:82 chr1
CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA
>EAS56_57:6:190:289:82 chr1
AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC
>EAS51_64:3:190:727:308 chr1
GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG
>EAS112_34:7:141:80:875 chr1
AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA
>EAS219_FC30151:3:40:1128:1940 chr1
CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC

As you can see from the above example (using files described in
the linked threads), when parsing SAM files if the read is mapped
then the reference sequence name is used as the description.
This seems like a sensible and useful thing to do. However, when
parsing BAM files this is not currently being done.

Having the SAM and BAM parser produce identical results is
very useful for testing purposes (e.g. running diff on their output
as FASTQ format), so I would like the BAM parser to do the same.

Looking at the source, function seqReadSam in ajax/core/ajseqread.c
does this with the reference name string:

    ajStrTokenNextParseNoskip(&handle,&token); /* RNAME */
    ajDebug("RNAME '%S'\n", token);
    if(ajStrGetLen(token))
        seqAccSave(thys, token);

Therefore the BAM parser needs to do something similar, first
mapping the integer rID (reference sequence ID) to the array of
reference names from the BAM header.

I got as far as a partial solution but it only worked on the first read.
The problem is that although header variable ntargets is stored as
bamdata->Nref it does not appear that the array of strings
targetname is kept (likewise the array of integers targetlen but we
don't care about that here).

Regards,

Peter C.


More information about the emboss-dev mailing list