From biopython at maubp.freeserve.co.uk Mon Aug 2 07:37:35 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Aug 2010 12:37:35 +0100 Subject: [emboss-dev] Bug report and patch - BAM quality score reading Message-ID: Hi all, Since I had several queries about what EMBOSS would do with SAM/BAM, http://lists.open-bio.org/pipermail/emboss-dev/2010-July/000656.html I decided to try it and see. I believe I have found a bug with reading quality scores from BAM files in EMBOSS 6.3.1 $ seqret -version EMBOSS:6.3.1 I have been using a small pair of SAM and BAM files, originally downloaded as a SAM file with reference FASTA sequence from the pysam project, which I converted to BAM using samtools as they specify in their readme file http://code.google.com/p/pysam/source/browse/#hg/tests e.g. curl -O http://pysam.googlecode.com/hg/tests/ex1.fa curl -O http://pysam.googlecode.com/hg/tests/ex1.sam.gz gunzip ex1.sam.gz samtools faidx ex1.fa samtools import ex1.fa.fai ex1.sam ex1.bam If we look at the first two reads in the SAM file, notice their quality strings: $ head ex1.sam EAS56_57:6:190:289:82 69 chr1 100 0 * = 100 0 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; MF:i:192 EAS56_57:6:190:289:82 137 chr1 100 73 35M = 100 0 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2; MF:i:64 Aq:i:0 NM:i:0 UQ:i:0 H0:i:1 H1:i:0 ... Now let's ask EMBOSS seqret to convert from SAM to Sanger FASTQ, $ seqret -sformat sam -osformat fastq-sanger ex1.sam -stdout -auto | head @EAS56_57:6:190:289:82 chr1 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA + <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; @EAS56_57:6:190:289:82 chr1 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC + <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2; @EAS51_64:3:190:727:308 chr1 GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG The quality strings agree with the SAM file, good. Now let's ask EMBOSS seqret to convert from BAM to Sanger FASTQ, $ seqret -sformat bam -osformat fastq-sanger ex1.bam -stdout -auto | head @EAS56_57:6:190:289:82 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA + ]]]X]]]\]]]]]]]]Y\\]X\U]\]\\\\\ZU]\ @EAS56_57:6:190:289:82 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC + ]]]]]]\]]]]]]]]]]\]]\]]]]\Y]W\Z\\S\ @EAS51_64:3:190:727:308 GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG The quality strings differ, this is bad. In the SAM file these two reads have quality strings starting the "<", ASCII 60 meaning PHRED 60-33 = 27. In the funny BAM to Sanger FASTQ conversion, EMBOSS has used "]" which is ASCII 93, giving PHRED 93-33 = 60. i.e. 33 more than it should be. I suspected that the EMBOSS code for reading BAM files was wrongly applying a 33 offset to the quality scores. In BAM files the scores are simply encoded directly as uint8_t without any offset. Looking at the source code, file ajax/core/ajseqread.c we have: for(i=0; i < (ajuint) c->l_qseq; i++) { ajFmtPrintAppS(&qualstr, " %02x", 33+d[dpos]); thys->Accuracy[i] = (float) (33 + d[dpos++]); } The creation of a quality string appears to be for debug only, and here adding 33 to make it scores printable ASCII using the Sanger FASTQ encoding makes sense. However, adding the offset to the accuracy looks like an oversight. How about: for(i=0; i < (ajuint) c->l_qseq; i++) { ajFmtPrintAppS(&qualstr, " %02x", 33+d[dpos]); thys->Accuracy[i] = (float) d[dpos++]; } With this tiny change, I get the expected Sanger FASTQ output from a BAM file using seqret. Regards, Peter C. From biopython at maubp.freeserve.co.uk Mon Aug 2 09:55:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Aug 2010 14:55:10 +0100 Subject: [emboss-dev] Bug report and patch - SAM parser and negative ISIZE Message-ID: Hi again, This is another bug report for EMBOSS 6.3.1 (compiled on Mac OS X 10.6.4 Snow Leopard) using the same example files as earlier, see: http://lists.open-bio.org/pipermail/emboss-dev/2010-August/thread.html For the purposes of a concise example, I'm using seqret to convert SAM/BAM to FASTA so as to count the number of reads. See also: http://lists.open-bio.org/pipermail/emboss/2010-July/003951.html I believe this SAM and BAM file both contain 3270 reads, but EMBOSS is having trouble with the SAM file: $ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | grep -c "^>" 3270 $ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | grep -c "^>" 41 If we look at the output, $ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto >EAS56_57:6:190:289:82 chr1 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA >EAS56_57:6:190:289:82 chr1 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC ... >EAS114_28:6:155:68:326 chr1 CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTAA >EAS188_7:7:19:886:279 chr1 CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTA Looking at the SAM file, I guessed EMBOSS doesn't like a negative ISIZE field in the next record, EAS54_61:4:143:69:578, from the SAM file we have: ... EAS114_28:6:155:68:326 99 chr1 182 99 36M = 332 186 CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTAA <<<<<<<<<<<<<<<<<<<<<<<<<<<<:<<<<<<< MF:i:18 Aq:i:76 NM:i:0 UQ:i:0 H0:i:1 H1:i:0 EAS188_7:7:19:886:279 99 chr1 182 99 35M = 337 190 CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTA <9<<<<<<<<<<<<6<28:<<85<<<<<2<;<9<< MF:i:18 Aq:i:67 NM:i:0 UQ:i:0 H0:i:1 H1:i:0 EAS54_61:4:143:69:578 147 chr1 185 98 35M = 36 -184 ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG 222&<21<<<<12<7<01<<<<<0<<<<<<<20<< MF:i:18 Aq:i:35 NM:i:1 UQ:i:5 H0:i:1 H1:i:0 EAS54_71:4:13:981:659 181 chr1 187 0 * = 188 0 CGGGACAATGGACGAGGTAAACCGCACATTGACAA +)---3&&3&--+0)&+3:7777).333:<06<<< MF:i:192 ... Looking at the source code, currently EMBOSS is wrongly assuming an unsigned integer will be used. This is not true, the spec allows for a negative ISIZE. I replaced this code in ajax/core/ajseqread.c ajStrTokenNextParseNoskip(&handle,&token); /* ISIZE */ ajDebug("ISIZE '%S'\n", token); if(ajStrGetLen(token)){ if(!ajStrToUint(token, &flags)) return ajFalse; } with: ajStrTokenNextParseNoskip(&handle,&token); /* MPOS */ ajDebug("MPOS '%S'\n", token); if(ajStrGetLen(token)){ if(!ajStrToInt(token, &flags)) return ajFalse; } (i.e. Uint to Int), and now I get the correct read count. A related question is why did this error condition not give any error message to stdout or stderr? Regards, Peter C. From pmr at ebi.ac.uk Mon Aug 2 11:42:00 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 02 Aug 2010 16:42:00 +0100 Subject: [emboss-dev] Bug reports and patches: BAM quality, SAM negative ISIZE In-Reply-To: References: Message-ID: <4C56E748.7010803@ebi.ac.uk> On 02/08/10 14:55, Peter C. wrote: > In the funny BAM to Sanger FASTQ conversion, EMBOSS has used > "]" which is ASCII 93, giving PHRED 93-33 = 60. i.e. 33 more than it > should be. I suspected that the EMBOSS code for reading BAM files > was wrongly applying a 33 offset to the quality scores. In BAM files > the scores are simply encoded directly as uint8_t without any offset. Thanks for spotting that. We will make a patch with that fix in. > Looking at the SAM file, I guessed EMBOSS doesn't like a negative > ISIZE field in the next record, EAS54_61:4:143:69:578, ......... > > Looking at the source code, currently EMBOSS is wrongly assuming > an unsigned integer will be used. This is not true, the spec allows for > a negative ISIZE. I replaced this code in ajax/core/ajseqread.c Thanks for the fix. We will add that to the patch. > A related question is why did this error condition not give any > error message to stdout or stderr? This appears to be a general issue with reading unknown and known formats. We will fix it so that error messages are turned on for this failure condition. Many thanks for the bug reports - and the fixes!! Peter R. From biopython at maubp.freeserve.co.uk Mon Aug 2 11:52:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Aug 2010 16:52:56 +0100 Subject: [emboss-dev] Bug reports and patches: BAM quality, SAM negative ISIZE In-Reply-To: <4C56E748.7010803@ebi.ac.uk> References: <4C56E748.7010803@ebi.ac.uk> Message-ID: On Mon, Aug 2, 2010 at 4:42 PM, Peter Rice wrote: > > On 02/08/10 14:55, Peter C. wrote: > >> In the funny BAM to Sanger FASTQ conversion, EMBOSS has used >> "]" which is ASCII 93, giving PHRED 93-33 = 60. i.e. 33 more than it >> should be. I suspected that the EMBOSS code for reading BAM files >> was wrongly applying a 33 offset to the quality scores. In BAM files >> the scores are simply encoded directly as uint8_t without any offset. > > Thanks for spotting that. We will make a patch with that fix in. > >> Looking at the SAM file, I guessed EMBOSS doesn't like a negative >> ISIZE field in the next record, EAS54_61:4:143:69:578, ?......... >> >> Looking at the source code, currently EMBOSS is wrongly assuming >> an unsigned integer will be used. This is not true, the spec allows for >> a negative ISIZE. I replaced this code in ajax/core/ajseqread.c > > Thanks for the fix. We will add that to the patch. > Great. Are you still issuing patches which don't affect the version number? I'd prefer to have an easy way to know if a given install of EMBOSS has certain fixes, and a point release seems quite straightforward from an outsider's perspective. P.S. Expect a couple more reports to follow... so don't rush a patch or point release out just yet ;) >> A related question is why did this error condition not give any >> error message to stdout or stderr? > > This appears to be a general issue with reading unknown and known formats. > We will fix it so that error messages are turned on for this failure > condition. Good :) > Many thanks for the bug reports - and the fixes!! > No problem, Peter From biopython at maubp.freeserve.co.uk Mon Aug 2 12:26:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Aug 2010 17:26:07 +0100 Subject: [emboss-dev] Inconsistency in SAM vs BAM read description Message-ID: Hi all, After patching the following two issues, http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000667.html http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000668.html there is a noticeable difference in the output from the SAM and BAM parsers in the description of the reads: $ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | head >EAS56_57:6:190:289:82 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA >EAS56_57:6:190:289:82 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC >EAS51_64:3:190:727:308 GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG >EAS112_34:7:141:80:875 AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA >EAS219_FC30151:3:40:1128:1940 CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC $ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | head >EAS56_57:6:190:289:82 chr1 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA >EAS56_57:6:190:289:82 chr1 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC >EAS51_64:3:190:727:308 chr1 GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG >EAS112_34:7:141:80:875 chr1 AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA >EAS219_FC30151:3:40:1128:1940 chr1 CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC As you can see from the above example (using files described in the linked threads), when parsing SAM files if the read is mapped then the reference sequence name is used as the description. This seems like a sensible and useful thing to do. However, when parsing BAM files this is not currently being done. Having the SAM and BAM parser produce identical results is very useful for testing purposes (e.g. running diff on their output as FASTQ format), so I would like the BAM parser to do the same. Looking at the source, function seqReadSam in ajax/core/ajseqread.c does this with the reference name string: ajStrTokenNextParseNoskip(&handle,&token); /* RNAME */ ajDebug("RNAME '%S'\n", token); if(ajStrGetLen(token)) seqAccSave(thys, token); Therefore the BAM parser needs to do something similar, first mapping the integer rID (reference sequence ID) to the array of reference names from the BAM header. I got as far as a partial solution but it only worked on the first read. The problem is that although header variable ntargets is stored as bamdata->Nref it does not appear that the array of strings targetname is kept (likewise the array of integers targetlen but we don't care about that here). Regards, Peter C. From biopython at maubp.freeserve.co.uk Mon Aug 2 12:42:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Aug 2010 17:42:07 +0100 Subject: [emboss-dev] Inconsistency in SAM vs BAM read description In-Reply-To: References: Message-ID: On Mon, Aug 2, 2010 at 5:26 PM, Peter wrote: > Hi all, > > After patching the following two issues, > http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000667.html > http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000668.html > there is a noticeable difference in the output from the SAM and BAM > parsers in the description of the reads: > > $ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | head >>EAS56_57:6:190:289:82 > CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA >>EAS56_57:6:190:289:82 > AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC >>EAS51_64:3:190:727:308 > GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG >>EAS112_34:7:141:80:875 > AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA >>EAS219_FC30151:3:40:1128:1940 > CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC > > > $ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | head >>EAS56_57:6:190:289:82 chr1 > CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA >>EAS56_57:6:190:289:82 chr1 > AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC >>EAS51_64:3:190:727:308 chr1 > GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG >>EAS112_34:7:141:80:875 chr1 > AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA >>EAS219_FC30151:3:40:1128:1940 chr1 > CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC > > As you can see from the above example (using files described in > the linked threads), when parsing SAM files if the read is mapped > then the reference sequence name is used as the description. > This seems like a sensible and useful thing to do. However, when > parsing BAM files this is not currently being done. > > Having the SAM and BAM parser produce identical results is > very useful for testing purposes (e.g. running diff on their output > as FASTQ format), so I would like the BAM parser to do the same. > > Looking at the source, function seqReadSam in ajax/core/ajseqread.c > does this with the reference name string: > > ? ?ajStrTokenNextParseNoskip(&handle,&token); /* RNAME */ > ? ?ajDebug("RNAME '%S'\n", token); > ? ?if(ajStrGetLen(token)) > ? ? ? ?seqAccSave(thys, token); > Just as a post script, Having failed to enhance the BAM parser, for short term testing I'm just commenting out the above two lines of the SAM parser. With that trivial change, then the FASTA and FASTQ output from both the SAM and BAM files agrees 100% (as you would expect). Peter C. From biopython at maubp.freeserve.co.uk Mon Aug 2 13:41:25 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Aug 2010 18:41:25 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: <4C3EED02.7080507@ebi.ac.uk> Message-ID: On Thu, Jul 15, 2010 at 12:36 PM, Peter wrote: > On Thu, Jul 15, 2010 at 12:12 PM, Peter Rice wrote: >> >>> What do you do about naming for paired reads? I was appending >>> /1 or /2 to match the Illumina convention. Doing nothing means >>> the paired reads will have the same names. >> >> Not addressed yet - let's look into a common approach though. >> We would also have to lok into what the '/' character does to EMBOSS's >> handling of sequence names. > > My rational for appending the /1 and /2 is that in a typical workflow > you might take Illumina paired end data as FASTQ and map it onto > a genome with BWA giving SAM/BAM. You might then want to reverse > this (e.g. if given a SAM/BAM file by a collaborator, and you want to > try an alternative mapping tool or reference genome, first you must > recover the raw reads again, e.g. as FASTQ files). Just for the record, EMBOSS 6.3.1 does not append anything to the read names, meaning paired end reads cannot be distinguished if output as FASTA or FASTQ. I'm not sure my idea of appending /1 or /2 for paired reads is the best solution (especially since there are other naming schemes out there like _f and _r as suffixes). Nevertheless, it seems like a practical solution. Would including a slash character within a sequence name cause problems in EMBOSS (a potential issue you raised earlier)? Also, and this may be a bug, on output as unaligned SAM (and I assume also for unaligned BAM), the fact that a read is paired and the information about if is it the first or second read is lost. The FLAG is just set to 4, meaning unmapped. e.g. seqret -sformat bam -osformat sam ex1.bam -filter or: seqret -sformat sam -osformat sam ex1.sam -filter >>> What do you do about the strand issue? SAM/BAM stored reads >>> which map onto the reverse strand in reverse complement. If >>> you want to get back to the original orientation for output as >>> FASTQ you must apply the reverse complement (plus reverse >>> the quality scores too of course). >> >> So far we read as sequences. Reading as mapped reads (very large >> alignments) is planned for the very near future so it can appear in the >> next release. > > Given the use case of going from (aligned) SAM/BAM back to the > original FASTQ, for a round trip you *must* undo the reverse > complementation. This is important even for single reads, as quality > scores tend to trail off in the (original) read direction so some algorithms > may treat a reverse version of the read differently. To clarify, EMBOSS 6.3.1 does not flip reads mapped to the reverse strand: http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000667.html Regards, Peter C. From pmr at ebi.ac.uk Tue Aug 3 03:27:21 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 03 Aug 2010 08:27:21 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: <4C3EED02.7080507@ebi.ac.uk> Message-ID: <4C57C4D9.1010805@ebi.ac.uk> On 08/02/10 18:41, Peter C. wrote: > On Thu, Jul 15, 2010 at 12:36 PM, Peter wrote: >> On Thu, Jul 15, 2010 at 12:12 PM, Peter Rice wrote: >>> >>>> What do you do about naming for paired reads? I was appending >>>> /1 or /2 to match the Illumina convention. Doing nothing means >>>> the paired reads will have the same names. >>> >>> Not addressed yet - let's look into a common approach though. >>> We would also have to lok into what the '/' character does to EMBOSS's >>> handling of sequence names. >> >> My rational for appending the /1 and /2 is that in a typical workflow >> you might take Illumina paired end data as FASTQ and map it onto >> a genome with BWA giving SAM/BAM. You might then want to reverse >> this (e.g. if given a SAM/BAM file by a collaborator, and you want to >> try an alternative mapping tool or reference genome, first you must >> recover the raw reads again, e.g. as FASTQ files). > > Just for the record, EMBOSS 6.3.1 does not append anything to the > read names, meaning paired end reads cannot be distinguished if > output as FASTA or FASTQ. > > I'm not sure my idea of appending /1 or /2 for paired reads is the > best solution (especially since there are other naming schemes > out there like _f and _r as suffixes). Nevertheless, it seems like a > practical solution. Would including a slash character within a > sequence name cause problems in EMBOSS (a potential issue > you raised earlier)? The /1 and /2 would cause horrible problems. The sequence names are used to generate default output file names so a '/' would have to be removed or converted, most likely to _1 and _2 _f or _r as a suffix is much better ... but should we always assume these meanings? Should we add a command-line switch for paired read data? Should we only do something for fastq, sam and bam (or other NGS formats?) It is a mystery to me how paired reads came to have the same name. When we first used them at EMBL for the Human HPRT locus we made sure to add an "r" suffix to the reverse reads.... but then, as we used the GCG assembly system, we were forced to have a unique name :-) > Also, and this may be a bug, on output as unaligned SAM (and I > assume also for unaligned BAM), the fact that a read is paired and > the information about if is it the first or second read is lost. The > FLAG is just set to 4, meaning unmapped. e.g. > > seqret -sformat bam -osformat sam ex1.bam -filter Hmmm ... this kind of thing is specific to SAM-BAM conversions, as other formats will lose it unless we find some way to preserve the detail. We will take a look at what we can keep between these formats (we do make similar efforts between EMBL and GenBank formats) >> Given the use case of going from (aligned) SAM/BAM back to the >> original FASTQ, for a round trip you *must* undo the reverse >> complementation. This is important even for single reads, as quality >> scores tend to trail off in the (original) read direction so some algorithms >> may treat a reverse version of the read differently. We will look into that one too. Many thanks for the suggestions Peter Rice From biopython at maubp.freeserve.co.uk Tue Aug 3 04:12:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Aug 2010 09:12:27 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: <4C57C4D9.1010805@ebi.ac.uk> References: <4C3EED02.7080507@ebi.ac.uk> <4C57C4D9.1010805@ebi.ac.uk> Message-ID: On Tue, Aug 3, 2010 at 8:27 AM, Peter Rice wrote: >> >> Just for the record, EMBOSS 6.3.1 does not append anything to the >> read names, meaning paired end reads cannot be distinguished if >> output as FASTA or FASTQ. >> >> I'm not sure my idea of appending /1 or /2 for paired reads is the >> best solution (especially since there are other naming schemes >> out there like _f and _r as suffixes). Nevertheless, it seems like a >> practical solution. Would including a slash character within a >> sequence name cause problems in EMBOSS (a potential issue >> you raised earlier)? > > The /1 and /2 would cause horrible problems. The sequence names are > used to generate default output file names so a '/' would have to be > removed or converted, most likely to _1 and _2 Oh :( I thought they might cause confusion with slashes in filenames, but yes, they can't be used in filenames can they. > _f or _r as a suffix is much better ... but should we always assume these > meanings? Should we add a command-line switch for paired read data? My understanding is there are multiple different naming conventions, so whatever we/you do it won't please everyone. What would help here is if the original read name were to be recorded in the SAM/BAM tags, as I think was suggested last month or so on the samtools-devel mailing list. However, that would come with a filesize penalty, and won't help with old files. > Should we only do something for fastq, sam and bam (or other NGS > formats?) And FASTA too, not all assemblers use quality scores. Also QUAL files if EMBOSS were to support them. > It is a mystery to me how paired reads came to have the same name. > When we first used them at EMBL for the Human HPRT locus we made > sure to add an "r" suffix to the reverse reads.... but then, as we used > the GCG assembly system, we were forced to have a unique name :-) With Solexa/Illumina data, pairs got the same name bar a suffix. Other sequencing centers also have followed this pattern, for example Sanger sequencing with suffices of .f and .r for example. I guess in order to clearly group paired reads, and save a little space, for SAM/BAM they opted to store a single name and use the FLAG field to hold if it is the forward or reverse read. Note that with stobed reads and the like coming "soon", rather than just two reads in a pair, there could be many child reads for a single fragment. Even with classic Sanger sequencing of a PCR product you might end up with multiple reads (e.g. two forward reads, one reverse) and if and how to handle this via an extension to SAM/BAM was also raised. Some pipelines may even use the same name for a forward/reverse pair, or ignore the names. Velvet for example just takes its paired data as interleaved files (forward then reverse reads one after the other). >> Also, and this may be a bug, on output as unaligned SAM (and I >> assume also for unaligned BAM), the fact that a read is paired and >> the information about if is it the first or second read is lost. The >> FLAG is just set to 4, meaning unmapped. e.g. >> >> seqret -sformat bam -osformat sam ex1.bam -filter > > Hmmm ... this kind of thing is specific to SAM-BAM conversions, as other > formats will lose it unless we find some way to preserve the detail. > > We will take a look at what we can keep between these formats (we do > make similar efforts between EMBL and GenBank formats) I think it would be useful to track the three bits for paired, read one, and read two. From memory, all the other bits of the FLAG are only applicable to mapped reads. Of course, this overlaps with the naming issue above. >>> Given the use case of going from (aligned) SAM/BAM back to the >>> original FASTQ, for a round trip you *must* undo the reverse >>> complementation. This is important even for single reads, as quality >>> scores tend to trail off in the (original) read direction so some >>> algorithms may treat a reverse version of the read differently. > > We will look into that one too. > Thanks. > Many thanks for the suggestions > No problem. Peter C. From ajb at ebi.ac.uk Fri Aug 6 06:53:18 2010 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Fri, 6 Aug 2010 11:53:18 +0100 (BST) Subject: [emboss-dev] Configuration in flux Message-ID: <34847.86.26.12.63.1281091998.squirrel@webmail.ebi.ac.uk> Dear developers, The EMBOSS configuration in CVS is in a state of flux at the moment. The major changes over the last 48 hours have been to make use of autoheader and also to clear out any system-specific libtool files. The upshot is that, from a fresh CVS checkout, the configuration should just amount to: autoreconf -fi ./configure [options] The above should mean that the configuration is relatively independent of the version of libtool you have installed. Note, however, that there is now a prerequisite for an autoconf version of at least 2.59. The use of autoheader means that the compilation lines are significantly shorter. There will be further configuration changes over the next few weeks but nothing quite so fundamental. Alan From pjotr.public78 at thebird.nl Thu Aug 12 06:12:40 2010 From: pjotr.public78 at thebird.nl (Pjotr Prins) Date: Thu, 12 Aug 2010 12:12:40 +0200 Subject: [emboss-dev] Unreachable code in featReadGff3 In-Reply-To: References: Message-ID: <20100812101240.GA28807@thebird.nl> Something funny in the function featReadGff3, it looks like the second else if(ajRegExec(Gff3Regexregion,line)) is unreachable code: if(ajRegExec(Gff3Regexblankline, line)) version = 3.0; else if(ajRegExec(Gff3Regexversion,line)) { verstr = ajStrNew(); ajRegSubI(Gff3Regexversion, 1, &verstr); ajStrToFloat(verstr, &version); ajStrDel(&verstr); if(version < 3.0) { ajStrDel(&line); return ajFalse; } } else if(ajRegExec(Gff3Regexregion,line)) { start = ajStrNew(); end = ajStrNew(); (...) From pjotr.public78 at thebird.nl Thu Aug 12 06:33:35 2010 From: pjotr.public78 at thebird.nl (Pjotr Prins) Date: Thu, 12 Aug 2010 12:33:35 +0200 Subject: [emboss-dev] GFF3 in EMBOSS Message-ID: <20100812103335.GA28925@thebird.nl> I am having a look at the GFF3 implementation in EMBOSS - mostly ajax/core/ajfeat.c. All features are loaded into RAM, and also the sequence information, when in the file. Not only for GFF3, but for all feature data types. On regular desktops this is a problem when loading a larger set, and/or multiple genomes. Is it the idea to load big data and store it in a SQL database? I.e. should I recommend handling it outside EMBOSS? Pj. From pmr at ebi.ac.uk Thu Aug 12 06:52:23 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 12 Aug 2010 11:52:23 +0100 Subject: [emboss-dev] GFF3 in EMBOSS In-Reply-To: <20100812103335.GA28925@thebird.nl> References: <20100812103335.GA28925@thebird.nl> Message-ID: <4C63D267.3070904@ebi.ac.uk> Hi Pjotr, On 12/08/10 11:33, Pjotr Prins wrote: > I am having a look at the GFF3 implementation in EMBOSS - mostly > ajax/core/ajfeat.c. > > All features are loaded into RAM, and also the sequence information, > when in the file. Not only for GFF3, but for all feature data types. > > On regular desktops this is a problem when loading a larger set, > and/or multiple genomes. > > Is it the idea to load big data and store it in a SQL database? I.e. > should I recommend handling it outside EMBOSS? We are looking into storing data structures for large datasets on disk - not only for features but also for next-generation mapped reads. Can you give an example of the input you are trying to handle? I hope to explore these issues at the GMOD meeting in Cambridge (UK) soon. regards, Peter Rice From pjotr.public78 at thebird.nl Thu Aug 12 07:57:55 2010 From: pjotr.public78 at thebird.nl (Pjotr Prins) Date: Thu, 12 Aug 2010 13:57:55 +0200 Subject: [emboss-dev] GFF3 in EMBOSS In-Reply-To: <4C63D267.3070904@ebi.ac.uk> References: <20100812103335.GA28925@thebird.nl> <4C63D267.3070904@ebi.ac.uk> Message-ID: <20100812115755.GA30047@thebird.nl> On Thu, Aug 12, 2010 at 11:52:23AM +0100, Peter Rice wrote: > We are looking into storing data structures for large datasets on disk - > not only for features but also for next-generation mapped reads. That is a great idea! The first quick-win is not to load sequence data in memory, but fetch it on demand using a seek index. Something that BioPerl has. > Can you give an example of the input you are trying to handle? I am dealing with Worms - Wormbase uses gff3 for some worms. EMBOSS, is already memory efficient, compared to BioRuby/Python/Perl - so I am thinking of a BioLib mapping. A writeup is here: http://thebird.nl/biolib/Adding_BioLib_EMBOSS_GFF3_Support.html > I hope to explore these issues at the GMOD meeting in Cambridge (UK) soon. It makes sense for (desktop) genome browsers, for one. Pj. From pjotr.public78 at thebird.nl Thu Aug 12 17:24:21 2010 From: pjotr.public78 at thebird.nl (Pjotr Prins) Date: Thu, 12 Aug 2010 23:24:21 +0200 Subject: [emboss-dev] Embassy in Debian-med Message-ID: <20100812212421.GA3151@thebird.nl> Debian-med has problems with the Embassy packages, as they fail to build against EMBOSS-latest. Andreas Tille writes: > To put the emboss and embassy packages in consistency in Squeeze, here are > possible solutions: > > - Remove the embassy-* packages from testing. > - Upload emboss 6.2 to testing-proposed-updates. > - Upgrade embassy-* packages with the latest upstream version, that builds > against emboss 6.3, and let emboss 6.3 in testing. what is the priority of supporting the Embassy packages? Are they lesser citizens in EMBOSS? Or can we expect resolution in the near future? Pj. From biopython at maubp.freeserve.co.uk Fri Aug 13 05:40:35 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 13 Aug 2010 10:40:35 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: <4C3EED02.7080507@ebi.ac.uk> <4C57C4D9.1010805@ebi.ac.uk> Message-ID: On Tue, Aug 3, 2010 at 9:12 AM, Peter wrote: > On Tue, Aug 3, 2010 at 8:27 AM, Peter Rice wrote: >>> >>> Just for the record, EMBOSS 6.3.1 does not append anything to the >>> read names, meaning paired end reads cannot be distinguished if >>> output as FASTA or FASTQ. >>> >>> I'm not sure my idea of appending /1 or /2 for paired reads is the >>> best solution (especially since there are other naming schemes >>> out there like _f and _r as suffixes). Nevertheless, it seems like a >>> practical solution. Would including a slash character within a >>> sequence name cause problems in EMBOSS (a potential issue >>> you raised earlier)? >> >> The /1 and /2 would cause horrible problems. The sequence names are >> used to generate default output file names so a '/' would have to be >> removed or converted, most likely to _1 and _2 > > Oh :( > > I thought they might cause confusion with slashes in filenames, but > yes, they can't be used in filenames can they. Thinking about this more, I don't think there is a problem. There are two main reasons. First, with SAM/BAM/FASTQ files there are typically so many reads that you would never want to create one file per read. Also, there are plenty of other file formats where the record ID can or indeed usually does contain a slash - specifically PFAM/Stockholm format alignments from PFAM where the ID is name/start-stop, e.g. http://emboss.sourceforge.net/docs/themes/seqformats/pfam Surely EMBOSS has already got a mechanism for dealing with slashes in IDs when asked to use the IDs as filenames? I think I mentioned storing the original read name in the tags had been suggested on the samtools-devel list. In the latest draft of the SAM/BAM spec, a new tag FS (fragment name suffix) has been proposed, so that the original read names could be recovered by taking the fragment name (the ID in SAM/BAM) and appending this suffix. See this thread earlier in August 2010, [Samtools-devel] Recording original read name in tags http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTimg%2BvNU3CkW-63Mmug-Qt0md183dyJ_nRqva1rv%40mail.gmail.com&forum_name=samtools-devel Finally, also on the samtools-help list, it was pointed out that the hydra-sv project has a bamToFastq tool, see thread: [Samtools-help] BAM to fastq how? http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTinBnm%2B8V8bXD_ii9jn8-O%2B0_N1MgWBxBFnqm2Mk%40mail.gmail.com&forum_name=samtools-help and http://code.google.com/p/hydra-sv/ Peter C. From gbottu at vub.ac.be Tue Aug 17 14:43:58 2010 From: gbottu at vub.ac.be (Guy Bottu) Date: Tue, 17 Aug 2010 20:43:58 +0200 Subject: [emboss-dev] computed maximum forbidden in ACD ? Message-ID: <4C6AD86E.7070909@vub.ac.be> Dear Peter and Alan, I was doing some development on wrappers4EMBOSS when I noted the following. The file blast.acd contains : integer: listsize [ information: "Show only the n best scoring sequences that satisfy E() cutoff" default: "500" minimum: "0" ] integer: align [ information: "Show only alignments for the n first sequences" default: "@(@($(listsize) < 250 ) ? $(listsize) : 250)" expected: "250" minimum: "0" maximum: "$(listsize)" (this is line 100) valid: "Integer 0 or more, but not < listsize" ] When I run blast I get : Error: File /OPT/emboss63/share/EMBOSS/acd/blast.acd line 100: (wordsize) Attribute failrange: required with any calculated min/max I am as good as certain that this behaviour has appeared with EMBOSS version 6.3.0. In the past it was allowed to set a "maximum" that depended on the choice of another parameter, and we can see that it could occasionally make sense, but this seems from now on forbidden. I this a bug or a feature ? Regards, Guy Bottu From pmr at ebi.ac.uk Tue Aug 17 16:22:58 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 17 Aug 2010 21:22:58 +0100 Subject: [emboss-dev] computed maximum forbidden in ACD ? In-Reply-To: <4C6AD86E.7070909@vub.ac.be> References: <4C6AD86E.7070909@vub.ac.be> Message-ID: <4C6AEFA2.8000201@ebi.ac.uk> Dear Guy, > When I run blast I get : > > Error: File /OPT/emboss63/share/EMBOSS/acd/blast.acd line 100: > (wordsize) Attribute failrange: required with any calculated min/max > > I am as good as certain that this behaviour has appeared with EMBOSS > version 6.3.0. In the past it was allowed to set a "maximum" that > depended on the choice of another parameter, and we can see that it > could occasionally make sense, but this seems from now on forbidden. I > this a bug or a feature ? It is a fix for a feature. With calculated maximum or minimum values (e.g. depending on a window size) it was possible for the maximum to be less than the minimum. In such cases we could logically use either the maximum or the minimum - and some applications were found to require one choice, others needed the other. After some discussion we decided to add extra attributes to control the behaviour. You can add two new attributes: trueminimum: "N" (if max/min overlap, use minimum} failrange: "Y" (Fail if (calculated) ranges overlap} rangemessage: "" (Failure message if (calculated ranges) overlap} A common solution (good for your case) is: failrange: "N" trueminimum: "Y" By adding the error messages we made sure that an ACD file with a calculated range will give messages to the developer suggesting missing attributes to be added. If you set failrange: "Y" you need to define a message explaining to the end user why the range might fail If you set failrange: "N" the calculated range is accepted, but you also need to set trueminimum to say whether you want the minimum value to apply (usual to avoid getting negative values) or the maximum to avoid values going too large. So, you get the "failrange is required" message. When you set that you get another message (depending whether it is true or false) telling you to set one of the other attributes as well. Hope this makes it clearer! Peter From biopython at maubp.freeserve.co.uk Mon Aug 2 11:37:35 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Aug 2010 12:37:35 +0100 Subject: [emboss-dev] Bug report and patch - BAM quality score reading Message-ID: Hi all, Since I had several queries about what EMBOSS would do with SAM/BAM, http://lists.open-bio.org/pipermail/emboss-dev/2010-July/000656.html I decided to try it and see. I believe I have found a bug with reading quality scores from BAM files in EMBOSS 6.3.1 $ seqret -version EMBOSS:6.3.1 I have been using a small pair of SAM and BAM files, originally downloaded as a SAM file with reference FASTA sequence from the pysam project, which I converted to BAM using samtools as they specify in their readme file http://code.google.com/p/pysam/source/browse/#hg/tests e.g. curl -O http://pysam.googlecode.com/hg/tests/ex1.fa curl -O http://pysam.googlecode.com/hg/tests/ex1.sam.gz gunzip ex1.sam.gz samtools faidx ex1.fa samtools import ex1.fa.fai ex1.sam ex1.bam If we look at the first two reads in the SAM file, notice their quality strings: $ head ex1.sam EAS56_57:6:190:289:82 69 chr1 100 0 * = 100 0 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; MF:i:192 EAS56_57:6:190:289:82 137 chr1 100 73 35M = 100 0 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2; MF:i:64 Aq:i:0 NM:i:0 UQ:i:0 H0:i:1 H1:i:0 ... Now let's ask EMBOSS seqret to convert from SAM to Sanger FASTQ, $ seqret -sformat sam -osformat fastq-sanger ex1.sam -stdout -auto | head @EAS56_57:6:190:289:82 chr1 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA + <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<; @EAS56_57:6:190:289:82 chr1 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC + <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2; @EAS51_64:3:190:727:308 chr1 GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG The quality strings agree with the SAM file, good. Now let's ask EMBOSS seqret to convert from BAM to Sanger FASTQ, $ seqret -sformat bam -osformat fastq-sanger ex1.bam -stdout -auto | head @EAS56_57:6:190:289:82 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA + ]]]X]]]\]]]]]]]]Y\\]X\U]\]\\\\\ZU]\ @EAS56_57:6:190:289:82 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC + ]]]]]]\]]]]]]]]]]\]]\]]]]\Y]W\Z\\S\ @EAS51_64:3:190:727:308 GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG The quality strings differ, this is bad. In the SAM file these two reads have quality strings starting the "<", ASCII 60 meaning PHRED 60-33 = 27. In the funny BAM to Sanger FASTQ conversion, EMBOSS has used "]" which is ASCII 93, giving PHRED 93-33 = 60. i.e. 33 more than it should be. I suspected that the EMBOSS code for reading BAM files was wrongly applying a 33 offset to the quality scores. In BAM files the scores are simply encoded directly as uint8_t without any offset. Looking at the source code, file ajax/core/ajseqread.c we have: for(i=0; i < (ajuint) c->l_qseq; i++) { ajFmtPrintAppS(&qualstr, " %02x", 33+d[dpos]); thys->Accuracy[i] = (float) (33 + d[dpos++]); } The creation of a quality string appears to be for debug only, and here adding 33 to make it scores printable ASCII using the Sanger FASTQ encoding makes sense. However, adding the offset to the accuracy looks like an oversight. How about: for(i=0; i < (ajuint) c->l_qseq; i++) { ajFmtPrintAppS(&qualstr, " %02x", 33+d[dpos]); thys->Accuracy[i] = (float) d[dpos++]; } With this tiny change, I get the expected Sanger FASTQ output from a BAM file using seqret. Regards, Peter C. From biopython at maubp.freeserve.co.uk Mon Aug 2 13:55:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Aug 2010 14:55:10 +0100 Subject: [emboss-dev] Bug report and patch - SAM parser and negative ISIZE Message-ID: Hi again, This is another bug report for EMBOSS 6.3.1 (compiled on Mac OS X 10.6.4 Snow Leopard) using the same example files as earlier, see: http://lists.open-bio.org/pipermail/emboss-dev/2010-August/thread.html For the purposes of a concise example, I'm using seqret to convert SAM/BAM to FASTA so as to count the number of reads. See also: http://lists.open-bio.org/pipermail/emboss/2010-July/003951.html I believe this SAM and BAM file both contain 3270 reads, but EMBOSS is having trouble with the SAM file: $ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | grep -c "^>" 3270 $ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | grep -c "^>" 41 If we look at the output, $ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto >EAS56_57:6:190:289:82 chr1 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA >EAS56_57:6:190:289:82 chr1 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC ... >EAS114_28:6:155:68:326 chr1 CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTAA >EAS188_7:7:19:886:279 chr1 CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTA Looking at the SAM file, I guessed EMBOSS doesn't like a negative ISIZE field in the next record, EAS54_61:4:143:69:578, from the SAM file we have: ... EAS114_28:6:155:68:326 99 chr1 182 99 36M = 332 186 CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTAA <<<<<<<<<<<<<<<<<<<<<<<<<<<<:<<<<<<< MF:i:18 Aq:i:76 NM:i:0 UQ:i:0 H0:i:1 H1:i:0 EAS188_7:7:19:886:279 99 chr1 182 99 35M = 337 190 CCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTA <9<<<<<<<<<<<<6<28:<<85<<<<<2<;<9<< MF:i:18 Aq:i:67 NM:i:0 UQ:i:0 H0:i:1 H1:i:0 EAS54_61:4:143:69:578 147 chr1 185 98 35M = 36 -184 ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG 222&<21<<<<12<7<01<<<<<0<<<<<<<20<< MF:i:18 Aq:i:35 NM:i:1 UQ:i:5 H0:i:1 H1:i:0 EAS54_71:4:13:981:659 181 chr1 187 0 * = 188 0 CGGGACAATGGACGAGGTAAACCGCACATTGACAA +)---3&&3&--+0)&+3:7777).333:<06<<< MF:i:192 ... Looking at the source code, currently EMBOSS is wrongly assuming an unsigned integer will be used. This is not true, the spec allows for a negative ISIZE. I replaced this code in ajax/core/ajseqread.c ajStrTokenNextParseNoskip(&handle,&token); /* ISIZE */ ajDebug("ISIZE '%S'\n", token); if(ajStrGetLen(token)){ if(!ajStrToUint(token, &flags)) return ajFalse; } with: ajStrTokenNextParseNoskip(&handle,&token); /* MPOS */ ajDebug("MPOS '%S'\n", token); if(ajStrGetLen(token)){ if(!ajStrToInt(token, &flags)) return ajFalse; } (i.e. Uint to Int), and now I get the correct read count. A related question is why did this error condition not give any error message to stdout or stderr? Regards, Peter C. From pmr at ebi.ac.uk Mon Aug 2 15:42:00 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 02 Aug 2010 16:42:00 +0100 Subject: [emboss-dev] Bug reports and patches: BAM quality, SAM negative ISIZE In-Reply-To: References: Message-ID: <4C56E748.7010803@ebi.ac.uk> On 02/08/10 14:55, Peter C. wrote: > In the funny BAM to Sanger FASTQ conversion, EMBOSS has used > "]" which is ASCII 93, giving PHRED 93-33 = 60. i.e. 33 more than it > should be. I suspected that the EMBOSS code for reading BAM files > was wrongly applying a 33 offset to the quality scores. In BAM files > the scores are simply encoded directly as uint8_t without any offset. Thanks for spotting that. We will make a patch with that fix in. > Looking at the SAM file, I guessed EMBOSS doesn't like a negative > ISIZE field in the next record, EAS54_61:4:143:69:578, ......... > > Looking at the source code, currently EMBOSS is wrongly assuming > an unsigned integer will be used. This is not true, the spec allows for > a negative ISIZE. I replaced this code in ajax/core/ajseqread.c Thanks for the fix. We will add that to the patch. > A related question is why did this error condition not give any > error message to stdout or stderr? This appears to be a general issue with reading unknown and known formats. We will fix it so that error messages are turned on for this failure condition. Many thanks for the bug reports - and the fixes!! Peter R. From biopython at maubp.freeserve.co.uk Mon Aug 2 15:52:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Aug 2010 16:52:56 +0100 Subject: [emboss-dev] Bug reports and patches: BAM quality, SAM negative ISIZE In-Reply-To: <4C56E748.7010803@ebi.ac.uk> References: <4C56E748.7010803@ebi.ac.uk> Message-ID: On Mon, Aug 2, 2010 at 4:42 PM, Peter Rice wrote: > > On 02/08/10 14:55, Peter C. wrote: > >> In the funny BAM to Sanger FASTQ conversion, EMBOSS has used >> "]" which is ASCII 93, giving PHRED 93-33 = 60. i.e. 33 more than it >> should be. I suspected that the EMBOSS code for reading BAM files >> was wrongly applying a 33 offset to the quality scores. In BAM files >> the scores are simply encoded directly as uint8_t without any offset. > > Thanks for spotting that. We will make a patch with that fix in. > >> Looking at the SAM file, I guessed EMBOSS doesn't like a negative >> ISIZE field in the next record, EAS54_61:4:143:69:578, ?......... >> >> Looking at the source code, currently EMBOSS is wrongly assuming >> an unsigned integer will be used. This is not true, the spec allows for >> a negative ISIZE. I replaced this code in ajax/core/ajseqread.c > > Thanks for the fix. We will add that to the patch. > Great. Are you still issuing patches which don't affect the version number? I'd prefer to have an easy way to know if a given install of EMBOSS has certain fixes, and a point release seems quite straightforward from an outsider's perspective. P.S. Expect a couple more reports to follow... so don't rush a patch or point release out just yet ;) >> A related question is why did this error condition not give any >> error message to stdout or stderr? > > This appears to be a general issue with reading unknown and known formats. > We will fix it so that error messages are turned on for this failure > condition. Good :) > Many thanks for the bug reports - and the fixes!! > No problem, Peter From biopython at maubp.freeserve.co.uk Mon Aug 2 16:26:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Aug 2010 17:26:07 +0100 Subject: [emboss-dev] Inconsistency in SAM vs BAM read description Message-ID: Hi all, After patching the following two issues, http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000667.html http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000668.html there is a noticeable difference in the output from the SAM and BAM parsers in the description of the reads: $ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | head >EAS56_57:6:190:289:82 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA >EAS56_57:6:190:289:82 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC >EAS51_64:3:190:727:308 GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG >EAS112_34:7:141:80:875 AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA >EAS219_FC30151:3:40:1128:1940 CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC $ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | head >EAS56_57:6:190:289:82 chr1 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA >EAS56_57:6:190:289:82 chr1 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC >EAS51_64:3:190:727:308 chr1 GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG >EAS112_34:7:141:80:875 chr1 AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA >EAS219_FC30151:3:40:1128:1940 chr1 CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC As you can see from the above example (using files described in the linked threads), when parsing SAM files if the read is mapped then the reference sequence name is used as the description. This seems like a sensible and useful thing to do. However, when parsing BAM files this is not currently being done. Having the SAM and BAM parser produce identical results is very useful for testing purposes (e.g. running diff on their output as FASTQ format), so I would like the BAM parser to do the same. Looking at the source, function seqReadSam in ajax/core/ajseqread.c does this with the reference name string: ajStrTokenNextParseNoskip(&handle,&token); /* RNAME */ ajDebug("RNAME '%S'\n", token); if(ajStrGetLen(token)) seqAccSave(thys, token); Therefore the BAM parser needs to do something similar, first mapping the integer rID (reference sequence ID) to the array of reference names from the BAM header. I got as far as a partial solution but it only worked on the first read. The problem is that although header variable ntargets is stored as bamdata->Nref it does not appear that the array of strings targetname is kept (likewise the array of integers targetlen but we don't care about that here). Regards, Peter C. From biopython at maubp.freeserve.co.uk Mon Aug 2 16:42:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Aug 2010 17:42:07 +0100 Subject: [emboss-dev] Inconsistency in SAM vs BAM read description In-Reply-To: References: Message-ID: On Mon, Aug 2, 2010 at 5:26 PM, Peter wrote: > Hi all, > > After patching the following two issues, > http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000667.html > http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000668.html > there is a noticeable difference in the output from the SAM and BAM > parsers in the description of the reads: > > $ seqret -sformat bam -osformat fasta ex1.bam -stdout -auto | head >>EAS56_57:6:190:289:82 > CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA >>EAS56_57:6:190:289:82 > AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC >>EAS51_64:3:190:727:308 > GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG >>EAS112_34:7:141:80:875 > AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA >>EAS219_FC30151:3:40:1128:1940 > CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC > > > $ seqret -sformat sam -osformat fasta ex1.sam -stdout -auto | head >>EAS56_57:6:190:289:82 chr1 > CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA >>EAS56_57:6:190:289:82 chr1 > AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC >>EAS51_64:3:190:727:308 chr1 > GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG >>EAS112_34:7:141:80:875 chr1 > AGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAA >>EAS219_FC30151:3:40:1128:1940 chr1 > CCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACC > > As you can see from the above example (using files described in > the linked threads), when parsing SAM files if the read is mapped > then the reference sequence name is used as the description. > This seems like a sensible and useful thing to do. However, when > parsing BAM files this is not currently being done. > > Having the SAM and BAM parser produce identical results is > very useful for testing purposes (e.g. running diff on their output > as FASTQ format), so I would like the BAM parser to do the same. > > Looking at the source, function seqReadSam in ajax/core/ajseqread.c > does this with the reference name string: > > ? ?ajStrTokenNextParseNoskip(&handle,&token); /* RNAME */ > ? ?ajDebug("RNAME '%S'\n", token); > ? ?if(ajStrGetLen(token)) > ? ? ? ?seqAccSave(thys, token); > Just as a post script, Having failed to enhance the BAM parser, for short term testing I'm just commenting out the above two lines of the SAM parser. With that trivial change, then the FASTA and FASTQ output from both the SAM and BAM files agrees 100% (as you would expect). Peter C. From biopython at maubp.freeserve.co.uk Mon Aug 2 17:41:25 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Aug 2010 18:41:25 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: <4C3EED02.7080507@ebi.ac.uk> Message-ID: On Thu, Jul 15, 2010 at 12:36 PM, Peter wrote: > On Thu, Jul 15, 2010 at 12:12 PM, Peter Rice wrote: >> >>> What do you do about naming for paired reads? I was appending >>> /1 or /2 to match the Illumina convention. Doing nothing means >>> the paired reads will have the same names. >> >> Not addressed yet - let's look into a common approach though. >> We would also have to lok into what the '/' character does to EMBOSS's >> handling of sequence names. > > My rational for appending the /1 and /2 is that in a typical workflow > you might take Illumina paired end data as FASTQ and map it onto > a genome with BWA giving SAM/BAM. You might then want to reverse > this (e.g. if given a SAM/BAM file by a collaborator, and you want to > try an alternative mapping tool or reference genome, first you must > recover the raw reads again, e.g. as FASTQ files). Just for the record, EMBOSS 6.3.1 does not append anything to the read names, meaning paired end reads cannot be distinguished if output as FASTA or FASTQ. I'm not sure my idea of appending /1 or /2 for paired reads is the best solution (especially since there are other naming schemes out there like _f and _r as suffixes). Nevertheless, it seems like a practical solution. Would including a slash character within a sequence name cause problems in EMBOSS (a potential issue you raised earlier)? Also, and this may be a bug, on output as unaligned SAM (and I assume also for unaligned BAM), the fact that a read is paired and the information about if is it the first or second read is lost. The FLAG is just set to 4, meaning unmapped. e.g. seqret -sformat bam -osformat sam ex1.bam -filter or: seqret -sformat sam -osformat sam ex1.sam -filter >>> What do you do about the strand issue? SAM/BAM stored reads >>> which map onto the reverse strand in reverse complement. If >>> you want to get back to the original orientation for output as >>> FASTQ you must apply the reverse complement (plus reverse >>> the quality scores too of course). >> >> So far we read as sequences. Reading as mapped reads (very large >> alignments) is planned for the very near future so it can appear in the >> next release. > > Given the use case of going from (aligned) SAM/BAM back to the > original FASTQ, for a round trip you *must* undo the reverse > complementation. This is important even for single reads, as quality > scores tend to trail off in the (original) read direction so some algorithms > may treat a reverse version of the read differently. To clarify, EMBOSS 6.3.1 does not flip reads mapped to the reverse strand: http://lists.open-bio.org/pipermail/emboss-dev/2010-August/000667.html Regards, Peter C. From pmr at ebi.ac.uk Tue Aug 3 07:27:21 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 03 Aug 2010 08:27:21 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: <4C3EED02.7080507@ebi.ac.uk> Message-ID: <4C57C4D9.1010805@ebi.ac.uk> On 08/02/10 18:41, Peter C. wrote: > On Thu, Jul 15, 2010 at 12:36 PM, Peter wrote: >> On Thu, Jul 15, 2010 at 12:12 PM, Peter Rice wrote: >>> >>>> What do you do about naming for paired reads? I was appending >>>> /1 or /2 to match the Illumina convention. Doing nothing means >>>> the paired reads will have the same names. >>> >>> Not addressed yet - let's look into a common approach though. >>> We would also have to lok into what the '/' character does to EMBOSS's >>> handling of sequence names. >> >> My rational for appending the /1 and /2 is that in a typical workflow >> you might take Illumina paired end data as FASTQ and map it onto >> a genome with BWA giving SAM/BAM. You might then want to reverse >> this (e.g. if given a SAM/BAM file by a collaborator, and you want to >> try an alternative mapping tool or reference genome, first you must >> recover the raw reads again, e.g. as FASTQ files). > > Just for the record, EMBOSS 6.3.1 does not append anything to the > read names, meaning paired end reads cannot be distinguished if > output as FASTA or FASTQ. > > I'm not sure my idea of appending /1 or /2 for paired reads is the > best solution (especially since there are other naming schemes > out there like _f and _r as suffixes). Nevertheless, it seems like a > practical solution. Would including a slash character within a > sequence name cause problems in EMBOSS (a potential issue > you raised earlier)? The /1 and /2 would cause horrible problems. The sequence names are used to generate default output file names so a '/' would have to be removed or converted, most likely to _1 and _2 _f or _r as a suffix is much better ... but should we always assume these meanings? Should we add a command-line switch for paired read data? Should we only do something for fastq, sam and bam (or other NGS formats?) It is a mystery to me how paired reads came to have the same name. When we first used them at EMBL for the Human HPRT locus we made sure to add an "r" suffix to the reverse reads.... but then, as we used the GCG assembly system, we were forced to have a unique name :-) > Also, and this may be a bug, on output as unaligned SAM (and I > assume also for unaligned BAM), the fact that a read is paired and > the information about if is it the first or second read is lost. The > FLAG is just set to 4, meaning unmapped. e.g. > > seqret -sformat bam -osformat sam ex1.bam -filter Hmmm ... this kind of thing is specific to SAM-BAM conversions, as other formats will lose it unless we find some way to preserve the detail. We will take a look at what we can keep between these formats (we do make similar efforts between EMBL and GenBank formats) >> Given the use case of going from (aligned) SAM/BAM back to the >> original FASTQ, for a round trip you *must* undo the reverse >> complementation. This is important even for single reads, as quality >> scores tend to trail off in the (original) read direction so some algorithms >> may treat a reverse version of the read differently. We will look into that one too. Many thanks for the suggestions Peter Rice From biopython at maubp.freeserve.co.uk Tue Aug 3 08:12:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Aug 2010 09:12:27 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: <4C57C4D9.1010805@ebi.ac.uk> References: <4C3EED02.7080507@ebi.ac.uk> <4C57C4D9.1010805@ebi.ac.uk> Message-ID: On Tue, Aug 3, 2010 at 8:27 AM, Peter Rice wrote: >> >> Just for the record, EMBOSS 6.3.1 does not append anything to the >> read names, meaning paired end reads cannot be distinguished if >> output as FASTA or FASTQ. >> >> I'm not sure my idea of appending /1 or /2 for paired reads is the >> best solution (especially since there are other naming schemes >> out there like _f and _r as suffixes). Nevertheless, it seems like a >> practical solution. Would including a slash character within a >> sequence name cause problems in EMBOSS (a potential issue >> you raised earlier)? > > The /1 and /2 would cause horrible problems. The sequence names are > used to generate default output file names so a '/' would have to be > removed or converted, most likely to _1 and _2 Oh :( I thought they might cause confusion with slashes in filenames, but yes, they can't be used in filenames can they. > _f or _r as a suffix is much better ... but should we always assume these > meanings? Should we add a command-line switch for paired read data? My understanding is there are multiple different naming conventions, so whatever we/you do it won't please everyone. What would help here is if the original read name were to be recorded in the SAM/BAM tags, as I think was suggested last month or so on the samtools-devel mailing list. However, that would come with a filesize penalty, and won't help with old files. > Should we only do something for fastq, sam and bam (or other NGS > formats?) And FASTA too, not all assemblers use quality scores. Also QUAL files if EMBOSS were to support them. > It is a mystery to me how paired reads came to have the same name. > When we first used them at EMBL for the Human HPRT locus we made > sure to add an "r" suffix to the reverse reads.... but then, as we used > the GCG assembly system, we were forced to have a unique name :-) With Solexa/Illumina data, pairs got the same name bar a suffix. Other sequencing centers also have followed this pattern, for example Sanger sequencing with suffices of .f and .r for example. I guess in order to clearly group paired reads, and save a little space, for SAM/BAM they opted to store a single name and use the FLAG field to hold if it is the forward or reverse read. Note that with stobed reads and the like coming "soon", rather than just two reads in a pair, there could be many child reads for a single fragment. Even with classic Sanger sequencing of a PCR product you might end up with multiple reads (e.g. two forward reads, one reverse) and if and how to handle this via an extension to SAM/BAM was also raised. Some pipelines may even use the same name for a forward/reverse pair, or ignore the names. Velvet for example just takes its paired data as interleaved files (forward then reverse reads one after the other). >> Also, and this may be a bug, on output as unaligned SAM (and I >> assume also for unaligned BAM), the fact that a read is paired and >> the information about if is it the first or second read is lost. The >> FLAG is just set to 4, meaning unmapped. e.g. >> >> seqret -sformat bam -osformat sam ex1.bam -filter > > Hmmm ... this kind of thing is specific to SAM-BAM conversions, as other > formats will lose it unless we find some way to preserve the detail. > > We will take a look at what we can keep between these formats (we do > make similar efforts between EMBL and GenBank formats) I think it would be useful to track the three bits for paired, read one, and read two. From memory, all the other bits of the FLAG are only applicable to mapped reads. Of course, this overlaps with the naming issue above. >>> Given the use case of going from (aligned) SAM/BAM back to the >>> original FASTQ, for a round trip you *must* undo the reverse >>> complementation. This is important even for single reads, as quality >>> scores tend to trail off in the (original) read direction so some >>> algorithms may treat a reverse version of the read differently. > > We will look into that one too. > Thanks. > Many thanks for the suggestions > No problem. Peter C. From ajb at ebi.ac.uk Fri Aug 6 10:53:18 2010 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Fri, 6 Aug 2010 11:53:18 +0100 (BST) Subject: [emboss-dev] Configuration in flux Message-ID: <34847.86.26.12.63.1281091998.squirrel@webmail.ebi.ac.uk> Dear developers, The EMBOSS configuration in CVS is in a state of flux at the moment. The major changes over the last 48 hours have been to make use of autoheader and also to clear out any system-specific libtool files. The upshot is that, from a fresh CVS checkout, the configuration should just amount to: autoreconf -fi ./configure [options] The above should mean that the configuration is relatively independent of the version of libtool you have installed. Note, however, that there is now a prerequisite for an autoconf version of at least 2.59. The use of autoheader means that the compilation lines are significantly shorter. There will be further configuration changes over the next few weeks but nothing quite so fundamental. Alan From pjotr.public78 at thebird.nl Thu Aug 12 10:12:40 2010 From: pjotr.public78 at thebird.nl (Pjotr Prins) Date: Thu, 12 Aug 2010 12:12:40 +0200 Subject: [emboss-dev] Unreachable code in featReadGff3 In-Reply-To: References: Message-ID: <20100812101240.GA28807@thebird.nl> Something funny in the function featReadGff3, it looks like the second else if(ajRegExec(Gff3Regexregion,line)) is unreachable code: if(ajRegExec(Gff3Regexblankline, line)) version = 3.0; else if(ajRegExec(Gff3Regexversion,line)) { verstr = ajStrNew(); ajRegSubI(Gff3Regexversion, 1, &verstr); ajStrToFloat(verstr, &version); ajStrDel(&verstr); if(version < 3.0) { ajStrDel(&line); return ajFalse; } } else if(ajRegExec(Gff3Regexregion,line)) { start = ajStrNew(); end = ajStrNew(); (...) From pjotr.public78 at thebird.nl Thu Aug 12 10:33:35 2010 From: pjotr.public78 at thebird.nl (Pjotr Prins) Date: Thu, 12 Aug 2010 12:33:35 +0200 Subject: [emboss-dev] GFF3 in EMBOSS Message-ID: <20100812103335.GA28925@thebird.nl> I am having a look at the GFF3 implementation in EMBOSS - mostly ajax/core/ajfeat.c. All features are loaded into RAM, and also the sequence information, when in the file. Not only for GFF3, but for all feature data types. On regular desktops this is a problem when loading a larger set, and/or multiple genomes. Is it the idea to load big data and store it in a SQL database? I.e. should I recommend handling it outside EMBOSS? Pj. From pmr at ebi.ac.uk Thu Aug 12 10:52:23 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 12 Aug 2010 11:52:23 +0100 Subject: [emboss-dev] GFF3 in EMBOSS In-Reply-To: <20100812103335.GA28925@thebird.nl> References: <20100812103335.GA28925@thebird.nl> Message-ID: <4C63D267.3070904@ebi.ac.uk> Hi Pjotr, On 12/08/10 11:33, Pjotr Prins wrote: > I am having a look at the GFF3 implementation in EMBOSS - mostly > ajax/core/ajfeat.c. > > All features are loaded into RAM, and also the sequence information, > when in the file. Not only for GFF3, but for all feature data types. > > On regular desktops this is a problem when loading a larger set, > and/or multiple genomes. > > Is it the idea to load big data and store it in a SQL database? I.e. > should I recommend handling it outside EMBOSS? We are looking into storing data structures for large datasets on disk - not only for features but also for next-generation mapped reads. Can you give an example of the input you are trying to handle? I hope to explore these issues at the GMOD meeting in Cambridge (UK) soon. regards, Peter Rice From pjotr.public78 at thebird.nl Thu Aug 12 11:57:55 2010 From: pjotr.public78 at thebird.nl (Pjotr Prins) Date: Thu, 12 Aug 2010 13:57:55 +0200 Subject: [emboss-dev] GFF3 in EMBOSS In-Reply-To: <4C63D267.3070904@ebi.ac.uk> References: <20100812103335.GA28925@thebird.nl> <4C63D267.3070904@ebi.ac.uk> Message-ID: <20100812115755.GA30047@thebird.nl> On Thu, Aug 12, 2010 at 11:52:23AM +0100, Peter Rice wrote: > We are looking into storing data structures for large datasets on disk - > not only for features but also for next-generation mapped reads. That is a great idea! The first quick-win is not to load sequence data in memory, but fetch it on demand using a seek index. Something that BioPerl has. > Can you give an example of the input you are trying to handle? I am dealing with Worms - Wormbase uses gff3 for some worms. EMBOSS, is already memory efficient, compared to BioRuby/Python/Perl - so I am thinking of a BioLib mapping. A writeup is here: http://thebird.nl/biolib/Adding_BioLib_EMBOSS_GFF3_Support.html > I hope to explore these issues at the GMOD meeting in Cambridge (UK) soon. It makes sense for (desktop) genome browsers, for one. Pj. From pjotr.public78 at thebird.nl Thu Aug 12 21:24:21 2010 From: pjotr.public78 at thebird.nl (Pjotr Prins) Date: Thu, 12 Aug 2010 23:24:21 +0200 Subject: [emboss-dev] Embassy in Debian-med Message-ID: <20100812212421.GA3151@thebird.nl> Debian-med has problems with the Embassy packages, as they fail to build against EMBOSS-latest. Andreas Tille writes: > To put the emboss and embassy packages in consistency in Squeeze, here are > possible solutions: > > - Remove the embassy-* packages from testing. > - Upload emboss 6.2 to testing-proposed-updates. > - Upgrade embassy-* packages with the latest upstream version, that builds > against emboss 6.3, and let emboss 6.3 in testing. what is the priority of supporting the Embassy packages? Are they lesser citizens in EMBOSS? Or can we expect resolution in the near future? Pj. From biopython at maubp.freeserve.co.uk Fri Aug 13 09:40:35 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 13 Aug 2010 10:40:35 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: <4C3EED02.7080507@ebi.ac.uk> <4C57C4D9.1010805@ebi.ac.uk> Message-ID: On Tue, Aug 3, 2010 at 9:12 AM, Peter wrote: > On Tue, Aug 3, 2010 at 8:27 AM, Peter Rice wrote: >>> >>> Just for the record, EMBOSS 6.3.1 does not append anything to the >>> read names, meaning paired end reads cannot be distinguished if >>> output as FASTA or FASTQ. >>> >>> I'm not sure my idea of appending /1 or /2 for paired reads is the >>> best solution (especially since there are other naming schemes >>> out there like _f and _r as suffixes). Nevertheless, it seems like a >>> practical solution. Would including a slash character within a >>> sequence name cause problems in EMBOSS (a potential issue >>> you raised earlier)? >> >> The /1 and /2 would cause horrible problems. The sequence names are >> used to generate default output file names so a '/' would have to be >> removed or converted, most likely to _1 and _2 > > Oh :( > > I thought they might cause confusion with slashes in filenames, but > yes, they can't be used in filenames can they. Thinking about this more, I don't think there is a problem. There are two main reasons. First, with SAM/BAM/FASTQ files there are typically so many reads that you would never want to create one file per read. Also, there are plenty of other file formats where the record ID can or indeed usually does contain a slash - specifically PFAM/Stockholm format alignments from PFAM where the ID is name/start-stop, e.g. http://emboss.sourceforge.net/docs/themes/seqformats/pfam Surely EMBOSS has already got a mechanism for dealing with slashes in IDs when asked to use the IDs as filenames? I think I mentioned storing the original read name in the tags had been suggested on the samtools-devel list. In the latest draft of the SAM/BAM spec, a new tag FS (fragment name suffix) has been proposed, so that the original read names could be recovered by taking the fragment name (the ID in SAM/BAM) and appending this suffix. See this thread earlier in August 2010, [Samtools-devel] Recording original read name in tags http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTimg%2BvNU3CkW-63Mmug-Qt0md183dyJ_nRqva1rv%40mail.gmail.com&forum_name=samtools-devel Finally, also on the samtools-help list, it was pointed out that the hydra-sv project has a bamToFastq tool, see thread: [Samtools-help] BAM to fastq how? http://sourceforge.net/mailarchive/forum.php?thread_name=AANLkTinBnm%2B8V8bXD_ii9jn8-O%2B0_N1MgWBxBFnqm2Mk%40mail.gmail.com&forum_name=samtools-help and http://code.google.com/p/hydra-sv/ Peter C. From gbottu at vub.ac.be Tue Aug 17 18:43:58 2010 From: gbottu at vub.ac.be (Guy Bottu) Date: Tue, 17 Aug 2010 20:43:58 +0200 Subject: [emboss-dev] computed maximum forbidden in ACD ? Message-ID: <4C6AD86E.7070909@vub.ac.be> Dear Peter and Alan, I was doing some development on wrappers4EMBOSS when I noted the following. The file blast.acd contains : integer: listsize [ information: "Show only the n best scoring sequences that satisfy E() cutoff" default: "500" minimum: "0" ] integer: align [ information: "Show only alignments for the n first sequences" default: "@(@($(listsize) < 250 ) ? $(listsize) : 250)" expected: "250" minimum: "0" maximum: "$(listsize)" (this is line 100) valid: "Integer 0 or more, but not < listsize" ] When I run blast I get : Error: File /OPT/emboss63/share/EMBOSS/acd/blast.acd line 100: (wordsize) Attribute failrange: required with any calculated min/max I am as good as certain that this behaviour has appeared with EMBOSS version 6.3.0. In the past it was allowed to set a "maximum" that depended on the choice of another parameter, and we can see that it could occasionally make sense, but this seems from now on forbidden. I this a bug or a feature ? Regards, Guy Bottu From pmr at ebi.ac.uk Tue Aug 17 20:22:58 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 17 Aug 2010 21:22:58 +0100 Subject: [emboss-dev] computed maximum forbidden in ACD ? In-Reply-To: <4C6AD86E.7070909@vub.ac.be> References: <4C6AD86E.7070909@vub.ac.be> Message-ID: <4C6AEFA2.8000201@ebi.ac.uk> Dear Guy, > When I run blast I get : > > Error: File /OPT/emboss63/share/EMBOSS/acd/blast.acd line 100: > (wordsize) Attribute failrange: required with any calculated min/max > > I am as good as certain that this behaviour has appeared with EMBOSS > version 6.3.0. In the past it was allowed to set a "maximum" that > depended on the choice of another parameter, and we can see that it > could occasionally make sense, but this seems from now on forbidden. I > this a bug or a feature ? It is a fix for a feature. With calculated maximum or minimum values (e.g. depending on a window size) it was possible for the maximum to be less than the minimum. In such cases we could logically use either the maximum or the minimum - and some applications were found to require one choice, others needed the other. After some discussion we decided to add extra attributes to control the behaviour. You can add two new attributes: trueminimum: "N" (if max/min overlap, use minimum} failrange: "Y" (Fail if (calculated) ranges overlap} rangemessage: "" (Failure message if (calculated ranges) overlap} A common solution (good for your case) is: failrange: "N" trueminimum: "Y" By adding the error messages we made sure that an ACD file with a calculated range will give messages to the developer suggesting missing attributes to be added. If you set failrange: "Y" you need to define a message explaining to the end user why the range might fail If you set failrange: "N" the calculated range is accepted, but you also need to set trueminimum to say whether you want the minimum value to apply (usual to avoid getting negative values) or the maximum to avoid values going too large. So, you get the "failrange is required" message. When you set that you get another message (depending whether it is true or false) telling you to set one of the other attributes as well. Hope this makes it clearer! Peter