From biopython at maubp.freeserve.co.uk Thu Jun 17 13:58:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Jun 2010 18:58:03 +0100 Subject: [emboss-dev] Quality scores in union and splitter output (was: partial GenBank) Message-ID: Hi Peter R et al, This email was prompted by a discussion on the main EMBOSS list: On Thu, Jun 17, 2010 at 9:38 AM, Peter Rice wrote: > > On 16/06/2010 13:55, Roi Brodo wrote: >> >> After some more reading I think I can do it using union. The problem is >> that after I create the list (of the two ranges) using yank, union dies on >> "union terminated: Bad value for '-sequence' and no prompt". Why is that? >> Shouldn't i use a yank file? > > Yes, yank and union is the correct approach. > > The output of yank is a list file, so the input to union should be @filename > to read a list of sequence addresses from the file. > > If you just give the filename it assumes it is sequences (perhaps a fasta > file of sequences to be joined). > > We will add this to our feature requests - it should be possible to make > seqret handle ranges from circular sequences. This will be after the next > release as it requires rewriting the way several library functions work to > allow the circular range. This is an interesting example and I couldn't resist working out how I would solve it with Biopython - something like this if anyone cares: from Bio import SeeqIO old = SeqIO.read("input.gbk", "gb") new = old[800000:] + old[:100000] SeqIO.write(new, "output.gbk", "gb") We already have several unit tests for Biopython which check some functionality against an EMBOSS tool. I should probably try using EMBOSS yank and union to verify our record slicing and addition... As part of this it occurred to me to try union on FASTQ files and I found it does not handle the quality scores properly: $ more example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC + ;;3;;;;;;;;;;;;7;;;;;;;88 @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA + ;;;;;;;;;;;7;;;;;-;;;3;83 @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG + ;;;;;;;;;;;9;7;;.7;393333 $ union --version EMBOSS:6.2.0 $ union -sequence example.fastq -sformat fastq-sanger -osformat fastq-sanger -stdout -auto @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCCTTGGCAGGCCAAGGCCGATGGATCAGTTGCTTCTGGCGTGGGTGGGGGGG + ;;3;;;;;;;;;;;;7;;;;;;;88!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! The output is well formed, and has the correct quality scores from the first entry, but has failed to include the quality scores of the subsequence sequences (defaulting to PHRED zero, the exclamation mark). There is a similar issue with splitter which seems to default to PHRED one, the double quote, for all entries: $ splitter -size 5 -sequence example.fastq -sformat fastq-sanger -osformat fastq-sanger -stdout -auto @EAS54_6_R1_2_1_413_324_1-5 CCCTT + """"" @EAS54_6_R1_2_1_413_324_6-10 CTTGT + """"" ... [cut] Now admittedly neither of these operations seem to be very natural for short read data - although it might make sense to take a FASTQ contig and shred it using splitter for feeding into another assembly tool? Anyway, I thought I should report these issues. Regards, Peter C. From pmr at ebi.ac.uk Thu Jun 17 17:01:16 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 17 Jun 2010 22:01:16 +0100 Subject: [emboss-dev] Quality scores in union and splitter output (was: partial GenBank) In-Reply-To: References: Message-ID: <4C1A8D1C.1090901@ebi.ac.uk> On 17/06/2010 18:58, Peter C. wrote: > Hi Peter R et al, > > We already have several unit tests for Biopython which check some > functionality against an EMBOSS tool. I should probably try using > EMBOSS yank and union to verify our record slicing and addition... > As part of this it occurred to me to try union on FASTQ files and I > found it does not handle the quality scores properly: Thanks. Just in time to fix those for the release. regards, Peter R. From biopython at maubp.freeserve.co.uk Thu Jun 17 17:23:04 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Jun 2010 22:23:04 +0100 Subject: [emboss-dev] Quality scores in union and splitter output (was: partial GenBank) In-Reply-To: <4C1A8D1C.1090901@ebi.ac.uk> References: <4C1A8D1C.1090901@ebi.ac.uk> Message-ID: On Thu, Jun 17, 2010 at 10:01 PM, Peter Rice wrote: > > Thanks. Just in time to fix those for the release. > > regards, > > Peter R. Great - that was quick work :) Peter C. From uludag at ebi.ac.uk Fri Jun 25 06:49:34 2010 From: uludag at ebi.ac.uk (Mahmut Uludag) Date: Fri, 25 Jun 2010 11:49:34 +0100 Subject: [emboss-dev] ajSeqNewRange, sequence accuracy values Message-ID: <1277462975.7755.25.camel@emboss1.ebi.ac.uk> Hi, Working on a problem I need to copy sequence accuracy values when a new sequence object is defined using part of a larger original sequence. Since ajSeqNewRange functions in ajseq.c doesn't have sequence object as input arguments (but the sequence string) I either need to define a new ajSeqNewRange function with the source sequence object as one of the inputs or I should define a new function similar to the following that can be called after calling the ajSeqNewRange function. /* @func ajSeqAssignAccuracy ************************************************ ** ** Copies accuracy values from src sequence to dest sequence. ** Assumes dest sequence was a subset of the src sequence. ** ** @param [r] dest [AjPSeq] sequence object to be updated ** @param [r] src [AjPSeq] source sequence objcet for accuracy values ** @param [r] offset [ajint] start point in the src sequence where the ** dest sequence was copied from ** ** @return [void] ******************************************************************************/ void ajSeqAssignAccuracy(AjPSeq dest, AjPSeq src, ajint offset) { dest->Qualsize = ajSeqGetLenUngapped(dest); AJCNEW0(dest->Accuracy,dest->Qualsize); memmove(dest->Accuracy,src->Accuracy+offset, dest->Qualsize*sizeof(float)); } Another alternative would be to make AJCNEW0 and memmove calls in embaln.c where we call the ajSeqNewRange function. I'm finding it difficult to decide, I would appreciate any suggestions. Mahmut From ajb at ebi.ac.uk Fri Jun 25 08:03:45 2010 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Fri, 25 Jun 2010 13:03:45 +0100 (BST) Subject: [emboss-dev] ajSeqNewRange, sequence accuracy values In-Reply-To: <1277462975.7755.25.camel@emboss1.ebi.ac.uk> References: <1277462975.7755.25.camel@emboss1.ebi.ac.uk> Message-ID: <53051.86.26.12.63.1277467425.squirrel@webmail.ebi.ac.uk> Hi Mahmut, As a heuristic: In cases where you find you'd need to access AJAX object internals from either nucleus (or an application), then that is a sign that a new AJAX function of some sort is required. [assuming there isn't a preferred way of doing something which you've missed of course] On that basis alone I'd go for the function. Alan > Hi, > > Working on a problem I need to copy sequence accuracy values when a new > sequence object is defined using part of a larger original sequence. > > Since ajSeqNewRange functions in ajseq.c doesn't have sequence object as > input arguments (but the sequence string) I either need to define a new > ajSeqNewRange function with the source sequence object as one of the > inputs or I should define a new function similar to the following that > can be called after calling the ajSeqNewRange function. > > /* @func ajSeqAssignAccuracy > ************************************************ > ** > ** Copies accuracy values from src sequence to dest sequence. > ** Assumes dest sequence was a subset of the src sequence. > ** > ** @param [r] dest [AjPSeq] sequence object to be updated > ** @param [r] src [AjPSeq] source sequence objcet for accuracy values > ** @param [r] offset [ajint] start point in the src sequence where the > ** dest sequence was copied from > ** > ** @return [void] > ******************************************************************************/ > > void ajSeqAssignAccuracy(AjPSeq dest, AjPSeq src, ajint offset) > { > > dest->Qualsize = ajSeqGetLenUngapped(dest); > AJCNEW0(dest->Accuracy,dest->Qualsize); > memmove(dest->Accuracy,src->Accuracy+offset, > dest->Qualsize*sizeof(float)); > } > > Another alternative would be to make AJCNEW0 and memmove calls in > embaln.c where we call the ajSeqNewRange function. > > I'm finding it difficult to decide, I would appreciate any suggestions. > > Mahmut > > > _______________________________________________ > emboss-dev mailing list > emboss-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss-dev > From uludag at ebi.ac.uk Fri Jun 25 08:26:50 2010 From: uludag at ebi.ac.uk (Mahmut Uludag) Date: Fri, 25 Jun 2010 13:26:50 +0100 Subject: [emboss-dev] ajSeqNewRange, sequence accuracy values In-Reply-To: <53051.86.26.12.63.1277467425.squirrel@webmail.ebi.ac.uk> References: <1277462975.7755.25.camel@emboss1.ebi.ac.uk> <53051.86.26.12.63.1277467425.squirrel@webmail.ebi.ac.uk> Message-ID: <1277468810.7755.38.camel@emboss1.ebi.ac.uk> Hi Alan, > As a heuristic: In cases where you find you'd need to access AJAX > object internals from either nucleus (or an application), then that > is a sign that a new AJAX function of some sort is required. > [assuming there isn't a preferred way of doing something which you've > missed of course] > > On that basis alone I'd go for the function. Thanks. Did you mean the new function I copied or a new ajSeqNewRange function with its first argument being a sequence object? If you meant the new ajSeqAssignAccuracy function then does its name sounds correct? I have noticed that I should have the following conditions in the ajSeqAssignAccuracy function rather than checking them before calling the function. (src->Accuracy!=NULL && src->Qualsize>0) Mahmut > > Working on a problem I need to copy sequence accuracy values when a new > > sequence object is defined using part of a larger original sequence. > > > > Since ajSeqNewRange functions in ajseq.c doesn't have sequence object as > > input arguments (but the sequence string) I either need to define a new > > ajSeqNewRange function with the source sequence object as one of the > > inputs or I should define a new function similar to the following that > > can be called after calling the ajSeqNewRange function. > > > > /* @func ajSeqAssignAccuracy > > ************************************************ > > ** > > ** Copies accuracy values from src sequence to dest sequence. > > ** Assumes dest sequence was a subset of the src sequence. > > ** > > ** @param [r] dest [AjPSeq] sequence object to be updated > > ** @param [r] src [AjPSeq] source sequence objcet for accuracy values > > ** @param [r] offset [ajint] start point in the src sequence where the > > ** dest sequence was copied from > > ** > > ** @return [void] > > ******************************************************************************/ > > > > void ajSeqAssignAccuracy(AjPSeq dest, AjPSeq src, ajint offset) > > { > > > > dest->Qualsize = ajSeqGetLenUngapped(dest); > > AJCNEW0(dest->Accuracy,dest->Qualsize); > > memmove(dest->Accuracy,src->Accuracy+offset, > > dest->Qualsize*sizeof(float)); > > } > > > > Another alternative would be to make AJCNEW0 and memmove calls in > > embaln.c where we call the ajSeqNewRange function. > > > > I'm finding it difficult to decide, I would appreciate any suggestions. > > > > Mahmut From biopython at maubp.freeserve.co.uk Thu Jun 17 17:58:03 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Jun 2010 18:58:03 +0100 Subject: [emboss-dev] Quality scores in union and splitter output (was: partial GenBank) Message-ID: Hi Peter R et al, This email was prompted by a discussion on the main EMBOSS list: On Thu, Jun 17, 2010 at 9:38 AM, Peter Rice wrote: > > On 16/06/2010 13:55, Roi Brodo wrote: >> >> After some more reading I think I can do it using union. The problem is >> that after I create the list (of the two ranges) using yank, union dies on >> "union terminated: Bad value for '-sequence' and no prompt". Why is that? >> Shouldn't i use a yank file? > > Yes, yank and union is the correct approach. > > The output of yank is a list file, so the input to union should be @filename > to read a list of sequence addresses from the file. > > If you just give the filename it assumes it is sequences (perhaps a fasta > file of sequences to be joined). > > We will add this to our feature requests - it should be possible to make > seqret handle ranges from circular sequences. This will be after the next > release as it requires rewriting the way several library functions work to > allow the circular range. This is an interesting example and I couldn't resist working out how I would solve it with Biopython - something like this if anyone cares: from Bio import SeeqIO old = SeqIO.read("input.gbk", "gb") new = old[800000:] + old[:100000] SeqIO.write(new, "output.gbk", "gb") We already have several unit tests for Biopython which check some functionality against an EMBOSS tool. I should probably try using EMBOSS yank and union to verify our record slicing and addition... As part of this it occurred to me to try union on FASTQ files and I found it does not handle the quality scores properly: $ more example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC + ;;3;;;;;;;;;;;;7;;;;;;;88 @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA + ;;;;;;;;;;;7;;;;;-;;;3;83 @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG + ;;;;;;;;;;;9;7;;.7;393333 $ union --version EMBOSS:6.2.0 $ union -sequence example.fastq -sformat fastq-sanger -osformat fastq-sanger -stdout -auto @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCCTTGGCAGGCCAAGGCCGATGGATCAGTTGCTTCTGGCGTGGGTGGGGGGG + ;;3;;;;;;;;;;;;7;;;;;;;88!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! The output is well formed, and has the correct quality scores from the first entry, but has failed to include the quality scores of the subsequence sequences (defaulting to PHRED zero, the exclamation mark). There is a similar issue with splitter which seems to default to PHRED one, the double quote, for all entries: $ splitter -size 5 -sequence example.fastq -sformat fastq-sanger -osformat fastq-sanger -stdout -auto @EAS54_6_R1_2_1_413_324_1-5 CCCTT + """"" @EAS54_6_R1_2_1_413_324_6-10 CTTGT + """"" ... [cut] Now admittedly neither of these operations seem to be very natural for short read data - although it might make sense to take a FASTQ contig and shred it using splitter for feeding into another assembly tool? Anyway, I thought I should report these issues. Regards, Peter C. From pmr at ebi.ac.uk Thu Jun 17 21:01:16 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 17 Jun 2010 22:01:16 +0100 Subject: [emboss-dev] Quality scores in union and splitter output (was: partial GenBank) In-Reply-To: References: Message-ID: <4C1A8D1C.1090901@ebi.ac.uk> On 17/06/2010 18:58, Peter C. wrote: > Hi Peter R et al, > > We already have several unit tests for Biopython which check some > functionality against an EMBOSS tool. I should probably try using > EMBOSS yank and union to verify our record slicing and addition... > As part of this it occurred to me to try union on FASTQ files and I > found it does not handle the quality scores properly: Thanks. Just in time to fix those for the release. regards, Peter R. From biopython at maubp.freeserve.co.uk Thu Jun 17 21:23:04 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Jun 2010 22:23:04 +0100 Subject: [emboss-dev] Quality scores in union and splitter output (was: partial GenBank) In-Reply-To: <4C1A8D1C.1090901@ebi.ac.uk> References: <4C1A8D1C.1090901@ebi.ac.uk> Message-ID: On Thu, Jun 17, 2010 at 10:01 PM, Peter Rice wrote: > > Thanks. Just in time to fix those for the release. > > regards, > > Peter R. Great - that was quick work :) Peter C. From uludag at ebi.ac.uk Fri Jun 25 10:49:34 2010 From: uludag at ebi.ac.uk (Mahmut Uludag) Date: Fri, 25 Jun 2010 11:49:34 +0100 Subject: [emboss-dev] ajSeqNewRange, sequence accuracy values Message-ID: <1277462975.7755.25.camel@emboss1.ebi.ac.uk> Hi, Working on a problem I need to copy sequence accuracy values when a new sequence object is defined using part of a larger original sequence. Since ajSeqNewRange functions in ajseq.c doesn't have sequence object as input arguments (but the sequence string) I either need to define a new ajSeqNewRange function with the source sequence object as one of the inputs or I should define a new function similar to the following that can be called after calling the ajSeqNewRange function. /* @func ajSeqAssignAccuracy ************************************************ ** ** Copies accuracy values from src sequence to dest sequence. ** Assumes dest sequence was a subset of the src sequence. ** ** @param [r] dest [AjPSeq] sequence object to be updated ** @param [r] src [AjPSeq] source sequence objcet for accuracy values ** @param [r] offset [ajint] start point in the src sequence where the ** dest sequence was copied from ** ** @return [void] ******************************************************************************/ void ajSeqAssignAccuracy(AjPSeq dest, AjPSeq src, ajint offset) { dest->Qualsize = ajSeqGetLenUngapped(dest); AJCNEW0(dest->Accuracy,dest->Qualsize); memmove(dest->Accuracy,src->Accuracy+offset, dest->Qualsize*sizeof(float)); } Another alternative would be to make AJCNEW0 and memmove calls in embaln.c where we call the ajSeqNewRange function. I'm finding it difficult to decide, I would appreciate any suggestions. Mahmut From ajb at ebi.ac.uk Fri Jun 25 12:03:45 2010 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Fri, 25 Jun 2010 13:03:45 +0100 (BST) Subject: [emboss-dev] ajSeqNewRange, sequence accuracy values In-Reply-To: <1277462975.7755.25.camel@emboss1.ebi.ac.uk> References: <1277462975.7755.25.camel@emboss1.ebi.ac.uk> Message-ID: <53051.86.26.12.63.1277467425.squirrel@webmail.ebi.ac.uk> Hi Mahmut, As a heuristic: In cases where you find you'd need to access AJAX object internals from either nucleus (or an application), then that is a sign that a new AJAX function of some sort is required. [assuming there isn't a preferred way of doing something which you've missed of course] On that basis alone I'd go for the function. Alan > Hi, > > Working on a problem I need to copy sequence accuracy values when a new > sequence object is defined using part of a larger original sequence. > > Since ajSeqNewRange functions in ajseq.c doesn't have sequence object as > input arguments (but the sequence string) I either need to define a new > ajSeqNewRange function with the source sequence object as one of the > inputs or I should define a new function similar to the following that > can be called after calling the ajSeqNewRange function. > > /* @func ajSeqAssignAccuracy > ************************************************ > ** > ** Copies accuracy values from src sequence to dest sequence. > ** Assumes dest sequence was a subset of the src sequence. > ** > ** @param [r] dest [AjPSeq] sequence object to be updated > ** @param [r] src [AjPSeq] source sequence objcet for accuracy values > ** @param [r] offset [ajint] start point in the src sequence where the > ** dest sequence was copied from > ** > ** @return [void] > ******************************************************************************/ > > void ajSeqAssignAccuracy(AjPSeq dest, AjPSeq src, ajint offset) > { > > dest->Qualsize = ajSeqGetLenUngapped(dest); > AJCNEW0(dest->Accuracy,dest->Qualsize); > memmove(dest->Accuracy,src->Accuracy+offset, > dest->Qualsize*sizeof(float)); > } > > Another alternative would be to make AJCNEW0 and memmove calls in > embaln.c where we call the ajSeqNewRange function. > > I'm finding it difficult to decide, I would appreciate any suggestions. > > Mahmut > > > _______________________________________________ > emboss-dev mailing list > emboss-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss-dev > From uludag at ebi.ac.uk Fri Jun 25 12:26:50 2010 From: uludag at ebi.ac.uk (Mahmut Uludag) Date: Fri, 25 Jun 2010 13:26:50 +0100 Subject: [emboss-dev] ajSeqNewRange, sequence accuracy values In-Reply-To: <53051.86.26.12.63.1277467425.squirrel@webmail.ebi.ac.uk> References: <1277462975.7755.25.camel@emboss1.ebi.ac.uk> <53051.86.26.12.63.1277467425.squirrel@webmail.ebi.ac.uk> Message-ID: <1277468810.7755.38.camel@emboss1.ebi.ac.uk> Hi Alan, > As a heuristic: In cases where you find you'd need to access AJAX > object internals from either nucleus (or an application), then that > is a sign that a new AJAX function of some sort is required. > [assuming there isn't a preferred way of doing something which you've > missed of course] > > On that basis alone I'd go for the function. Thanks. Did you mean the new function I copied or a new ajSeqNewRange function with its first argument being a sequence object? If you meant the new ajSeqAssignAccuracy function then does its name sounds correct? I have noticed that I should have the following conditions in the ajSeqAssignAccuracy function rather than checking them before calling the function. (src->Accuracy!=NULL && src->Qualsize>0) Mahmut > > Working on a problem I need to copy sequence accuracy values when a new > > sequence object is defined using part of a larger original sequence. > > > > Since ajSeqNewRange functions in ajseq.c doesn't have sequence object as > > input arguments (but the sequence string) I either need to define a new > > ajSeqNewRange function with the source sequence object as one of the > > inputs or I should define a new function similar to the following that > > can be called after calling the ajSeqNewRange function. > > > > /* @func ajSeqAssignAccuracy > > ************************************************ > > ** > > ** Copies accuracy values from src sequence to dest sequence. > > ** Assumes dest sequence was a subset of the src sequence. > > ** > > ** @param [r] dest [AjPSeq] sequence object to be updated > > ** @param [r] src [AjPSeq] source sequence objcet for accuracy values > > ** @param [r] offset [ajint] start point in the src sequence where the > > ** dest sequence was copied from > > ** > > ** @return [void] > > ******************************************************************************/ > > > > void ajSeqAssignAccuracy(AjPSeq dest, AjPSeq src, ajint offset) > > { > > > > dest->Qualsize = ajSeqGetLenUngapped(dest); > > AJCNEW0(dest->Accuracy,dest->Qualsize); > > memmove(dest->Accuracy,src->Accuracy+offset, > > dest->Qualsize*sizeof(float)); > > } > > > > Another alternative would be to make AJCNEW0 and memmove calls in > > embaln.c where we call the ajSeqNewRange function. > > > > I'm finding it difficult to decide, I would appreciate any suggestions. > > > > Mahmut