From stephen.taylor at imm.ox.ac.uk Thu Sep 10 09:37:18 2009 From: stephen.taylor at imm.ox.ac.uk (Stephen Taylor) Date: Thu, 10 Sep 2009 14:37:18 +0100 Subject: [EMBOSS] PWMs in EMBOSS Message-ID: <37830416-0044-4FA7-8394-8253BF4C671D@imm.ox.ac.uk> Hi, Is it possible to search PWMs in the following format in EMBOSS? Thanks, Steve Name A: 0.157272762655186 0.101522869320827 0.193269908676739 0.400388333956085 0.0242258313143186 0.00799984439409978 0.985140882956166 0.982735821742488 0.00822519796024231 0.00216973239819103 0.915564050240553 0.620804636015513 0.170770180463701 0.288450091623397 0.357825784599729 0.234712937095288 0.12172997792596 C: 0.585688175787352 0.237479093836628 0.435987353829905 0.147300012153726 0.342087245546403 0.0758172877627498 0.00580600999936562 0.00109267794415073 0.00794630235311888 0.00688337464627776 0.000618817602597609 0.0128822871237646 0.230733571623305 0.276650970023954 0.233013950290859 0.285797488742033 0.426544489294796 G: 0.146549771679813 0.564379933620146 0.162348523327608 0.288303072582316 0.0128822871237646 0.000618817602597609 0.00688337464627776 0.00794630235311888 0.00109267794415073 0.00580600999936562 0.0758172877627498 0.342087245546403 0.181573332311972 0.131046633924716 0.187725054521253 0.247196691859296 0.213778652608617 T: 0.110489289877649 0.0966181032223992 0.208394214165748 0.164008581307874 0.620804636015513 0.915564050240553 0.00216973239819103 0.00822519796024231 0.982735821742488 0.985140882956166 0.00799984439409978 0.0242258313143186 0.416922915601023 0.303852304427932 0.221435210588158 0.232292882303383 0.237946880170627 From pmr at ebi.ac.uk Thu Sep 10 10:54:25 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 10 Sep 2009 15:54:25 +0100 Subject: [EMBOSS] PWMs in EMBOSS In-Reply-To: <37830416-0044-4FA7-8394-8253BF4C671D@imm.ox.ac.uk> References: <37830416-0044-4FA7-8394-8253BF4C671D@imm.ox.ac.uk> Message-ID: <4AA91321.1020002@ebi.ac.uk> Stephen Taylor wrote: > Hi, > > Is it possible to search PWMs in the following format in EMBOSS? Not yet ... but we can extend the formats for PWMs. We only support the formats that we write. Votes please on position weight matrix formats EMBOSS should be able to read ... regards, Peter From charles-listes-emboss at plessy.org Wed Sep 16 02:57:55 2009 From: charles-listes-emboss at plessy.org (Charles Plessy) Date: Wed, 16 Sep 2009 15:57:55 +0900 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. Message-ID: <20090916065755.GA15425@kunpuu.plessy.org> Dear EMBOSS developers, I have multi-sequence file in FASTQ format that contains sequencing reads, and would like to retreive them the with seqret. But as you see in the following example, quality scores are not preserved: $ seqret P13-CA.fq:F1EZY7316JY25B fastq::stdout Reads and writes (returns) sequences @F1EZY7316JY25B rank=0000040 x=3973.0 y=285.0 length=68 AATGATACGGCGACCACCGAACACTGCGTTTGCTGGCTTTGATGCACTTCTCATGGCCAATTTCATTG + """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" The purpose was to use seqret as a workaround for the fact that vectorstrip does not keep the quality either. Do you think that it would be possible to get this functionality as a patch in the future, or is it big work that needs to wait for the next release? Have a nice day, -- Charles Plessy Tsurumi, Kanagawa, Japan From uludag at ebi.ac.uk Wed Sep 16 05:12:16 2009 From: uludag at ebi.ac.uk (Mahmut Uludag) Date: Wed, 16 Sep 2009 10:12:16 +0100 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <20090916065755.GA15425@kunpuu.plessy.org> References: <20090916065755.GA15425@kunpuu.plessy.org> Message-ID: <1253092337.32439.31.camel@emboss2.ebi.ac.uk> Hi Charles, seqret returns quality scores if the input sequence format is explicitly defined on the command line, such as -sformat=fastq-sanger. The following patch looks like fixes the vectorstrip problem. *** ajseq.c.org 2009-09-16 10:08:17.000000000 +0100 --- ajseq.c 2009-09-16 09:52:56.000000000 +0100 *************** *** 781,786 **** --- 781,792 ---- if (seq->Fttable) pthis->Fttable = ajFeattableCopy(seq->Fttable); + + if (seq->Accuracy) + { + AJCNEW0(pthis->Accuracy,seq->Seq->Len); + memmove(pthis->Accuracy,seq->Accuracy,seq->Seq->Len*sizeof(float)); + } return pthis; } Regards, Mahmut > I have multi-sequence file in FASTQ format that contains sequencing reads, and > would like to retreive them the with seqret. But as you see in the following > example, quality scores are not preserved: > > $ seqret P13-CA.fq:F1EZY7316JY25B fastq::stdout > Reads and writes (returns) sequences > @F1EZY7316JY25B rank=0000040 x=3973.0 y=285.0 length=68 > AATGATACGGCGACCACCGAACACTGCGTTTGCTGGCTTTGATGCACTTCTCATGGCCAATTTCATTG > + > """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" > > The purpose was to use seqret as a workaround for the fact that vectorstrip > does not keep the quality either. > > Do you think that it would be possible to get this functionality as a patch in > the future, or is it big work that needs to wait for the next release? From biopython at maubp.freeserve.co.uk Wed Sep 16 05:31:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Sep 2009 10:31:22 +0100 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <20090916065755.GA15425@kunpuu.plessy.org> References: <20090916065755.GA15425@kunpuu.plessy.org> Message-ID: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> On Wed, Sep 16, 2009 at 7:57 AM, Charles Plessy wrote: > > Dear EMBOSS developers, > > I have multi-sequence file in FASTQ format that contains sequencing reads, and > would like to retreive them the with seqret. But as you see in the following > example, quality scores are not preserved: > > $ seqret P13-CA.fq:F1EZY7316JY25B fastq::stdout > Reads and writes (returns) sequences > @F1EZY7316JY25B rank=0000040 x=3973.0 y=285.0 length=68 > AATGATACGGCGACCACCGAACACTGCGTTTGCTGGCTTTGATGCACTTCTCATGGCCAATTTCATTG > + > """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" You need to use "fastq-sanger" (or the other variants), since in EMBOSS, "fastq" currently means FASTQ ignoring the qualities. This is documented: http://emboss.sourceforge.net/docs/themes/SequenceFormats.html As an EMBOSS user, I think the current situation is confusing, and it would make much more sense to have "fastq" just an alias for "fastq-sanger" (which would be consistent with Biopython and BioPerl). http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000576.html And also this email - especially the last example: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000599.html > The purpose was to use seqret as a workaround for the fact that > vectorstrip does not keep the quality either. That's also been suggested, and is likely to be supported in future. http://lists.open-bio.org/pipermail/emboss/2009-August/003722.html Peter From charles-listes-emboss at plessy.org Thu Sep 17 02:36:59 2009 From: charles-listes-emboss at plessy.org (Charles Plessy) Date: Thu, 17 Sep 2009 15:36:59 +0900 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> References: <20090916065755.GA15425@kunpuu.plessy.org> <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <20090916065755.GA15425@kunpuu.plessy.org> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> Message-ID: <20090917063659.GA27021@kunpuu.plessy.org> Le Wed, Sep 16, 2009 at 10:12:16AM +0100, Mahmut Uludag a ?crit : > > seqret returns quality scores if the input sequence format is explicitly > defined on the command line, such as -sformat=fastq-sanger. > > The following patch looks like fixes the vectorstrip problem. Le Wed, Sep 16, 2009 at 10:31:22AM +0100, Peter a ?crit : > > You need to use "fastq-sanger" (or the other variants), since in > EMBOSS, "fastq" currently means FASTQ ignoring the qualities. Hi Mahmut and Peter, and thank you very much for your answers! I would also like if the qualities were kept by default. I actually had tried to force the fastq-sanger format before, but by adding its name to the USAs, like in ?seqret fastq-sanger::stdin fastq-sanger::stdout?. Unfortunately it did not work; I do not know if it is by design or because of the dash in the format name. Nevertheless -sformat=fastq-sanger and -osformat=fastq-sanger worked very well after I applied Mahmut's patch. I am tempted to apply it also to the Debian EMBOSS package, but maybe it is too prematurate. In particular, I have the following warning each time the quality is encoded by an equal sign: Warning: Illegal character '=' Warning: Illegal pattern: = By the way, I think I found a bug in revseq: it seems that it does not reverse the qualities: $ echo -e "@toto\nACTG\n+toto\n12/3" | seqret -filter -sformat=fastq-sanger -osformat=fastq-sanger @toto ACTG + 12/3 $ echo -e "@toto\nACTG\n+toto\n12/3" | revseq -filter -sformat=fastq-sanger -osformat=fastq-sanger @toto Reversed: CAGT + 12/3 Also, in contrary to what the documentation predicts, using the fastq format for the output does not ignore the quality scores. (Not that would be particularly useful, but?) $ echo -e "@toto\nACTG\n+toto\nACTG" | revseq -filter -sformat=fastq-sanger -osformat=fastq @toto Reversed: CAGT + ACTG Have a nice day, -- Charles Plessy http://charles.plessy.org Tsurumi, Kanagawa, Japan From pmr at ebi.ac.uk Thu Sep 17 03:24:07 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 17 Sep 2009 08:24:07 +0100 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <20090917063659.GA27021@kunpuu.plessy.org> References: <20090916065755.GA15425@kunpuu.plessy.org> <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <20090916065755.GA15425@kunpuu.plessy.org> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> <20090917063659.GA27021@kunpuu.plessy.org> Message-ID: <4AB1E417.3010405@ebi.ac.uk> Charles Plessy wrote: > I would also like if the qualities were kept by default. I actually had tried > to force the fastq-sanger format before, but by adding its name to the USAs, > like in ?seqret fastq-sanger::stdin fastq-sanger::stdout?. Unfortunately it did > not work; I do not know if it is by design or because of the dash in the format > name. Nevertheless -sformat=fastq-sanger and -osformat=fastq-sanger worked very > well after I applied Mahmut's patch. Yes, the dash in the format name is causing problems. It should be allowed where there is a '::' in the USA (it is not allowed in database queries because of the dbname-field:value query syntax). I will make a patch for this. > I am tempted to apply it also to the Debian EMBOSS package, but maybe it is too > prematurate. In particular, I have the following warning each time the quality > is encoded by an equal sign: > > Warning: Illegal character '=' > Warning: Illegal pattern: = This is surprising. Is your EMBOSS version the original distribution or have you applied the current patches. If it fails with the patched version, could you send me an input file that causes this error. > By the way, I think I found a bug in revseq: it seems that it does not reverse > the qualities: True ... this I will also patch. We have used quaslities for some years (in Staden experiment format) but it appears nobody has reversed sequences and kept the qualities. Life is changing with FASTQ data! > Also, in contrary to what the documentation predicts, using the fastq format > for the output does not ignore the quality scores. (Not that would be > particularly useful, but?) This is deliberate. We have to write somethign in FASTQ format and we default to the fastq-sanger format. On input, fastq-sanger ignores qualities because there is no safe way to decide which format is correct. regards, Peter From biopython at maubp.freeserve.co.uk Thu Sep 17 05:11:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Sep 2009 10:11:15 +0100 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <4AB1E417.3010405@ebi.ac.uk> References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <20090916065755.GA15425@kunpuu.plessy.org> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> <20090917063659.GA27021@kunpuu.plessy.org> <4AB1E417.3010405@ebi.ac.uk> Message-ID: <320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com> On Thu, Sep 17, 2009 at 8:24 AM, Peter Rice wrote: > >> Also, in contrary to what the documentation predicts, using the fastq >> format for the output does not ignore the quality scores. (Not that >> would be particularly useful, but?) > > This is deliberate. We have to write somethign in FASTQ format and we > default to the fastq-sanger format. On input, fastq-sanger ignores qualities > because there is no safe way to decide which format is correct. So again, could you reconsider making "fastq" act like "fastq-sanger"? The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores, a superset of the Solexa/Illumina FASTQ varaints - so even if you don't know which kind of FASTQ file you have, and you don't care about the qualities, parsing it as a Sanger FASTQ file will work. Peter C. From pmr at ebi.ac.uk Thu Sep 17 05:18:59 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 17 Sep 2009 10:18:59 +0100 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com> References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <20090916065755.GA15425@kunpuu.plessy.org> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> <20090917063659.GA27021@kunpuu.plessy.org> <4AB1E417.3010405@ebi.ac.uk> <320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com> Message-ID: <4AB1FF03.80705@ebi.ac.uk> Peter C. wrote: > On Thu, Sep 17, 2009 at 8:24 AM, Peter Rice wrote: >>> Also, in contrary to what the documentation predicts, using the fastq >>> format for the output does not ignore the quality scores. (Not that >>> would be particularly useful, but?) >> This is deliberate. We have to write somethign in FASTQ format and we >> default to the fastq-sanger format. On input, fastq-sanger ignores qualities >> because there is no safe way to decide which format is correct. > > So again, could you reconsider making "fastq" act like "fastq-sanger"? > The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores, > a superset of the Solexa/Illumina FASTQ varaints - so even if you don't > know which kind of FASTQ file you have, and you don't care about the > qualities, parsing it as a Sanger FASTQ file will work. Yes, but it is dangerous if they could really be Solexa qualities. What we could do is provide a utility that reads in fastq-sanger format and checks whether the quality scores make most sense as Sanger, SOlexa or Ilumina. I consider reading as fastq-sanger by default to be rather dangerous. Peter From biopython at maubp.freeserve.co.uk Thu Sep 17 05:32:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Sep 2009 10:32:21 +0100 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <4AB1FF03.80705@ebi.ac.uk> References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <20090916065755.GA15425@kunpuu.plessy.org> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> <20090917063659.GA27021@kunpuu.plessy.org> <4AB1E417.3010405@ebi.ac.uk> <320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com> <4AB1FF03.80705@ebi.ac.uk> Message-ID: <320fb6e00909170232t2cde71dam3b768ab8cd87bac1@mail.gmail.com> On Thu, Sep 17, 2009 at 10:18 AM, Peter Rice wrote: > >> So again, could you reconsider making "fastq" act like "fastq-sanger"? >> The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores, >> a superset of the Solexa/Illumina FASTQ varaints - so even if you don't >> know which kind of FASTQ file you have, and you don't care about the >> qualities, parsing it as a Sanger FASTQ file will work. > > Yes, but it is dangerous if they could really be Solexa qualities. Indeed, or an Illumina 1.3+ encoded FASTQ file. So if the EMBOSS tools are used to read a FASTQ file without specifying the FASTQ variant, do the currently detect it is FASTQ and default to the "fastq" setting and ignore the quality information? > What we could do is provide a utility that reads in fastq-sanger format and > checks whether the quality scores make most sense as Sanger, SOlexa or > Ilumina. That could be useful - I guess you could scan all the reads building up a histogram of the ASCII characters used. This could immediately rule out some of the options, and then based on the distribution (if you assume they are raw reads) you could make a good guess. > I consider reading as fastq-sanger by default to be rather dangerous. That is understandable. How about removing the current "fastq" output then? That might prevent some of the confusion at the moment. I'm struggling to see any purpose for the current "fastq" output - can you give me any example use case? Right now it has to pick an arbitrary quality symbol, and uses ASCI 34 (double quote) which means PHRED 1 (random) for a Sanger FASTQ file but is invalid as a Solexa or Illumina 1.3+ FASTQ file. Regards, Peter From pmr at ebi.ac.uk Thu Sep 17 05:52:48 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 17 Sep 2009 10:52:48 +0100 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <320fb6e00909170232t2cde71dam3b768ab8cd87bac1@mail.gmail.com> References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <20090916065755.GA15425@kunpuu.plessy.org> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> <20090917063659.GA27021@kunpuu.plessy.org> <4AB1E417.3010405@ebi.ac.uk> <320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com> <4AB1FF03.80705@ebi.ac.uk> <320fb6e00909170232t2cde71dam3b768ab8cd87bac1@mail.gmail.com> Message-ID: <4AB206F0.1040205@ebi.ac.uk> Peter C. wrote: > So if the EMBOSS tools are used to read a FASTQ file without specifying > the FASTQ variant, do the currently detect it is FASTQ and default to the > "fastq" setting and ignore the quality information? Yes, exactly so. Reading the sequence data is safe, and may be all the user wanted to do. >> What we could do is provide a utility that reads in fastq-sanger format and >> checks whether the quality scores make most sense as Sanger, SOlexa or >> Ilumina. > > That could be useful - I guess you could scan all the reads building up > a histogram of the ASCII characters used. This could immediately > rule out some of the options, and then based on the distribution (if > you assume they are raw reads) you could make a good guess. The ACD file would be 'interesting' We could set the default format to be "fastq-sanger" and issue some warning if we find the user had tried to change it. That way the application would run with a filename as the input, though it will appear to interfaces to be able to read any sequence input. Are there rules we can use to decide on improbably qualities? Values below the Illumina and Solexa minima would seem a good guide, and perhaps values above the likely short read maximum score. Maybe some existing pipelines have solme cutoff values we could adopt? >> I consider reading as fastq-sanger by default to be rather dangerous. > > That is understandable. How about removing the current "fastq" output > then? That might prevent some of the confusion at the moment. I'm > struggling to see any purpose for the current "fastq" output - can you > give me any example use case? Right now it has to pick an arbitrary > quality symbol, and uses ASCI 34 (double quote) which means PHRED > 1 (random) for a Sanger FASTQ file but is invalid as a Solexa or > Illumina 1.3+ FASTQ file. It is an alias for fastq-sanger which should be OK. I prefer to have an output format name for each input format name where it looks sensible, so if we read "fastq" as an input format it should do something on output. Unfortunately that means it has to write quality scores somehow. regards, Peter From biopython at maubp.freeserve.co.uk Thu Sep 17 06:06:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Sep 2009 11:06:16 +0100 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <4AB206F0.1040205@ebi.ac.uk> References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <20090916065755.GA15425@kunpuu.plessy.org> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> <20090917063659.GA27021@kunpuu.plessy.org> <4AB1E417.3010405@ebi.ac.uk> <320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com> <4AB1FF03.80705@ebi.ac.uk> <320fb6e00909170232t2cde71dam3b768ab8cd87bac1@mail.gmail.com> <4AB206F0.1040205@ebi.ac.uk> Message-ID: <320fb6e00909170306s328294e9h72717c329baba057@mail.gmail.com> On Thu, Sep 17, 2009 at 10:52 AM, Peter Rice wrote: > >>> What we could do is provide a utility that reads in fastq-sanger format >>> and checks whether the quality scores make most sense as Sanger, >>> SOlexa or Ilumina. >> >> That could be useful - I guess you could scan all the reads building up >> a histogram of the ASCII characters used. This could immediately >> rule out some of the options, and then based on the distribution (if >> you assume they are raw reads) you could make a good guess. > > The ACD file would be 'interesting' We could set the default format to be > "fastq-sanger" and issue some warning if we find the user had tried to > change it. That way the application would run with a filename as the input, > though it will appear to interfaces to be able to read any sequence input. > > Are there rules we can use to decide on improbably qualities? Values below > the Illumina and Solexa minima would seem a good guide, and perhaps > values above the likely short read maximum score. > > Maybe some existing pipelines have solme cutoff values we could adopt? Quite possibly. Telling apart raw Sanger reads and raw Solexa/Illumina reads should be easy. However, unless there are some ASCII characters in the range 59 to 63 (Solexa -5 to -1), there isn't going to be a safe way to tell Solexa and Illumina 1.3+ apart. Of course, if they just have good reads above Solexa/PHRED 10 (which would be ASCII 74), either way it isn't going to make much difference. In any case, it will be heuristic, and sometimes it will get it wrong (e.g. post processed Sanger FASTQ files with high scores might look like raw reads in Solexa/Illumina FASTQ). >>> I consider reading as fastq-sanger by default to be rather dangerous. >> >> That is understandable. How about removing the current "fastq" output >> then? That might prevent some of the confusion at the moment. I'm >> struggling to see any purpose for the current "fastq" output - can you >> give me any example use case? Right now it has to pick an arbitrary >> quality symbol, and uses ASCI 34 (double quote) which means PHRED >> 1 (random) for a Sanger FASTQ file but is invalid as a Solexa or >> Illumina 1.3+ FASTQ file. > > It is an alias for fastq-sanger which should be OK. I prefer to have an > output format name for each input format name where it looks sensible, > so if we read "fastq" as an input format it should do something on > output. Unfortunately that means it has to write quality scores somehow. I'm not convinced that the current "fastq" output (with the double quote quality string) is entirely "sensible". But I'll drop this now - I've argued my case, and will leave it at that. As long as the current behaviour is clear in the documentation, it should be OK. Regards, Peter From charles-listes-emboss at plessy.org Fri Sep 18 09:09:46 2009 From: charles-listes-emboss at plessy.org (Charles Plessy) Date: Fri, 18 Sep 2009 22:09:46 +0900 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <4AB1E417.3010405@ebi.ac.uk> References: <20090916065755.GA15425@kunpuu.plessy.org> <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <20090916065755.GA15425@kunpuu.plessy.org> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> <20090917063659.GA27021@kunpuu.plessy.org> <4AB1E417.3010405@ebi.ac.uk> Message-ID: <20090918130946.GC16344@kunpuu.plessy.org> Le Thu, Sep 17, 2009 at 08:24:07AM +0100, Peter Rice a ?crit : > Charles Plessy wrote: > >> I am tempted to apply it also to the Debian EMBOSS package, but maybe it is too >> prematurate. In particular, I have the following warning each time the quality >> is encoded by an equal sign: >> >> Warning: Illegal character '=' >> Warning: Illegal pattern: = > > This is surprising. Is your EMBOSS version the original distribution or > have you applied the current patches. Actually, I worked on other files today and could not reproduce the result. I will find a proper sourceforge bug if it ever comes back? Have a nice week-end, -- Charles Plessy Tsurumi, Kanagawa, Japon From belegdol at gmail.com Tue Sep 29 15:34:56 2009 From: belegdol at gmail.com (Julian Sikorski) Date: Tue, 29 Sep 2009 21:34:56 +0200 Subject: [EMBOSS] Packaging EMBOSS for Fedora In-Reply-To: References: <1244850556.8999.7.camel@login-svr1.ebi.ac.uk> <49993.78.105.201.225.1248020309.squirrel@webmail.ebi.ac.uk> Message-ID: W dniu 29.07.2009 13:54, Julian Sikorski pisze: > W dniu 19.07.2009 18:18, uludag at ebi.ac.uk pisze: >> >>> there seem to be some problems with make install: >>> >>> /usr/bin/make install-exec-hook >>> make[7]: Entering directory >>> `/builddir/build/BUILD/EMBOSS-6.1.0/jemboss/org/emboss/jemboss/editor' >>> mkdir -p -- >>> /builddir/build/BUILDROOT/EMBOSS-6.1.0-1.fc11.x86_64/usr/share/EMBOSS/jemboss/org/emboss/jemboss/editor >>> /usr/bin/install: cannot stat `*.class': No such file or directory >> >> Looks like we didn't test the --with-java and --with-javaos configure >> options well, before this release. However, most users will not need these >> two options any more as EMBOSS-6.1.0 includes precompiled jemboss class >> files collected in a java archive file. You should hopefully not get the >> above error if you omit these two options when you configure your emboss >> installation. >> >> Regards, >> Mahmut > Thank you, removing these two seems to have done the trick! > > Julian Another problem arose. I was pointed out that another problem pointed out by rpmlint should be fixed: EMBOSS-libs.x86_64: W: shared-lib-calls-exit /usr/lib64/libeplplot.so.3.2.7 exit@ According to the tool, it means the following: This library package calls exit() or _exit(), probably in a non-fork() context. Doing so from a library is strongly discouraged - when a library function calls exit(), it prevents the calling program from handling the error, reporting it to the user, closing files properly, and cleaning up any state that the program has. It is preferred for the library to return an actual error code and let the calling program decide how to handle the situation. Also, the guidelines [1] say that the .jar files should be compiled from source, so fixing the problem with --with-java would definitely help here. Julian [1] https://fedoraproject.org/wiki/Packaging:Java#Pre-built_JAR_files_.2F_Other_bundled_software From stephen.taylor at imm.ox.ac.uk Thu Sep 10 13:37:18 2009 From: stephen.taylor at imm.ox.ac.uk (Stephen Taylor) Date: Thu, 10 Sep 2009 14:37:18 +0100 Subject: [EMBOSS] PWMs in EMBOSS Message-ID: <37830416-0044-4FA7-8394-8253BF4C671D@imm.ox.ac.uk> Hi, Is it possible to search PWMs in the following format in EMBOSS? Thanks, Steve Name A: 0.157272762655186 0.101522869320827 0.193269908676739 0.400388333956085 0.0242258313143186 0.00799984439409978 0.985140882956166 0.982735821742488 0.00822519796024231 0.00216973239819103 0.915564050240553 0.620804636015513 0.170770180463701 0.288450091623397 0.357825784599729 0.234712937095288 0.12172997792596 C: 0.585688175787352 0.237479093836628 0.435987353829905 0.147300012153726 0.342087245546403 0.0758172877627498 0.00580600999936562 0.00109267794415073 0.00794630235311888 0.00688337464627776 0.000618817602597609 0.0128822871237646 0.230733571623305 0.276650970023954 0.233013950290859 0.285797488742033 0.426544489294796 G: 0.146549771679813 0.564379933620146 0.162348523327608 0.288303072582316 0.0128822871237646 0.000618817602597609 0.00688337464627776 0.00794630235311888 0.00109267794415073 0.00580600999936562 0.0758172877627498 0.342087245546403 0.181573332311972 0.131046633924716 0.187725054521253 0.247196691859296 0.213778652608617 T: 0.110489289877649 0.0966181032223992 0.208394214165748 0.164008581307874 0.620804636015513 0.915564050240553 0.00216973239819103 0.00822519796024231 0.982735821742488 0.985140882956166 0.00799984439409978 0.0242258313143186 0.416922915601023 0.303852304427932 0.221435210588158 0.232292882303383 0.237946880170627 From pmr at ebi.ac.uk Thu Sep 10 14:54:25 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 10 Sep 2009 15:54:25 +0100 Subject: [EMBOSS] PWMs in EMBOSS In-Reply-To: <37830416-0044-4FA7-8394-8253BF4C671D@imm.ox.ac.uk> References: <37830416-0044-4FA7-8394-8253BF4C671D@imm.ox.ac.uk> Message-ID: <4AA91321.1020002@ebi.ac.uk> Stephen Taylor wrote: > Hi, > > Is it possible to search PWMs in the following format in EMBOSS? Not yet ... but we can extend the formats for PWMs. We only support the formats that we write. Votes please on position weight matrix formats EMBOSS should be able to read ... regards, Peter From charles-listes-emboss at plessy.org Wed Sep 16 06:57:55 2009 From: charles-listes-emboss at plessy.org (Charles Plessy) Date: Wed, 16 Sep 2009 15:57:55 +0900 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. Message-ID: <20090916065755.GA15425@kunpuu.plessy.org> Dear EMBOSS developers, I have multi-sequence file in FASTQ format that contains sequencing reads, and would like to retreive them the with seqret. But as you see in the following example, quality scores are not preserved: $ seqret P13-CA.fq:F1EZY7316JY25B fastq::stdout Reads and writes (returns) sequences @F1EZY7316JY25B rank=0000040 x=3973.0 y=285.0 length=68 AATGATACGGCGACCACCGAACACTGCGTTTGCTGGCTTTGATGCACTTCTCATGGCCAATTTCATTG + """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" The purpose was to use seqret as a workaround for the fact that vectorstrip does not keep the quality either. Do you think that it would be possible to get this functionality as a patch in the future, or is it big work that needs to wait for the next release? Have a nice day, -- Charles Plessy Tsurumi, Kanagawa, Japan From uludag at ebi.ac.uk Wed Sep 16 09:12:16 2009 From: uludag at ebi.ac.uk (Mahmut Uludag) Date: Wed, 16 Sep 2009 10:12:16 +0100 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <20090916065755.GA15425@kunpuu.plessy.org> References: <20090916065755.GA15425@kunpuu.plessy.org> Message-ID: <1253092337.32439.31.camel@emboss2.ebi.ac.uk> Hi Charles, seqret returns quality scores if the input sequence format is explicitly defined on the command line, such as -sformat=fastq-sanger. The following patch looks like fixes the vectorstrip problem. *** ajseq.c.org 2009-09-16 10:08:17.000000000 +0100 --- ajseq.c 2009-09-16 09:52:56.000000000 +0100 *************** *** 781,786 **** --- 781,792 ---- if (seq->Fttable) pthis->Fttable = ajFeattableCopy(seq->Fttable); + + if (seq->Accuracy) + { + AJCNEW0(pthis->Accuracy,seq->Seq->Len); + memmove(pthis->Accuracy,seq->Accuracy,seq->Seq->Len*sizeof(float)); + } return pthis; } Regards, Mahmut > I have multi-sequence file in FASTQ format that contains sequencing reads, and > would like to retreive them the with seqret. But as you see in the following > example, quality scores are not preserved: > > $ seqret P13-CA.fq:F1EZY7316JY25B fastq::stdout > Reads and writes (returns) sequences > @F1EZY7316JY25B rank=0000040 x=3973.0 y=285.0 length=68 > AATGATACGGCGACCACCGAACACTGCGTTTGCTGGCTTTGATGCACTTCTCATGGCCAATTTCATTG > + > """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" > > The purpose was to use seqret as a workaround for the fact that vectorstrip > does not keep the quality either. > > Do you think that it would be possible to get this functionality as a patch in > the future, or is it big work that needs to wait for the next release? From biopython at maubp.freeserve.co.uk Wed Sep 16 09:31:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Sep 2009 10:31:22 +0100 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <20090916065755.GA15425@kunpuu.plessy.org> References: <20090916065755.GA15425@kunpuu.plessy.org> Message-ID: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> On Wed, Sep 16, 2009 at 7:57 AM, Charles Plessy wrote: > > Dear EMBOSS developers, > > I have multi-sequence file in FASTQ format that contains sequencing reads, and > would like to retreive them the with seqret. But as you see in the following > example, quality scores are not preserved: > > $ seqret P13-CA.fq:F1EZY7316JY25B fastq::stdout > Reads and writes (returns) sequences > @F1EZY7316JY25B rank=0000040 x=3973.0 y=285.0 length=68 > AATGATACGGCGACCACCGAACACTGCGTTTGCTGGCTTTGATGCACTTCTCATGGCCAATTTCATTG > + > """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" You need to use "fastq-sanger" (or the other variants), since in EMBOSS, "fastq" currently means FASTQ ignoring the qualities. This is documented: http://emboss.sourceforge.net/docs/themes/SequenceFormats.html As an EMBOSS user, I think the current situation is confusing, and it would make much more sense to have "fastq" just an alias for "fastq-sanger" (which would be consistent with Biopython and BioPerl). http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000576.html And also this email - especially the last example: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000599.html > The purpose was to use seqret as a workaround for the fact that > vectorstrip does not keep the quality either. That's also been suggested, and is likely to be supported in future. http://lists.open-bio.org/pipermail/emboss/2009-August/003722.html Peter From charles-listes-emboss at plessy.org Thu Sep 17 06:36:59 2009 From: charles-listes-emboss at plessy.org (Charles Plessy) Date: Thu, 17 Sep 2009 15:36:59 +0900 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> References: <20090916065755.GA15425@kunpuu.plessy.org> <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <20090916065755.GA15425@kunpuu.plessy.org> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> Message-ID: <20090917063659.GA27021@kunpuu.plessy.org> Le Wed, Sep 16, 2009 at 10:12:16AM +0100, Mahmut Uludag a ?crit : > > seqret returns quality scores if the input sequence format is explicitly > defined on the command line, such as -sformat=fastq-sanger. > > The following patch looks like fixes the vectorstrip problem. Le Wed, Sep 16, 2009 at 10:31:22AM +0100, Peter a ?crit : > > You need to use "fastq-sanger" (or the other variants), since in > EMBOSS, "fastq" currently means FASTQ ignoring the qualities. Hi Mahmut and Peter, and thank you very much for your answers! I would also like if the qualities were kept by default. I actually had tried to force the fastq-sanger format before, but by adding its name to the USAs, like in ?seqret fastq-sanger::stdin fastq-sanger::stdout?. Unfortunately it did not work; I do not know if it is by design or because of the dash in the format name. Nevertheless -sformat=fastq-sanger and -osformat=fastq-sanger worked very well after I applied Mahmut's patch. I am tempted to apply it also to the Debian EMBOSS package, but maybe it is too prematurate. In particular, I have the following warning each time the quality is encoded by an equal sign: Warning: Illegal character '=' Warning: Illegal pattern: = By the way, I think I found a bug in revseq: it seems that it does not reverse the qualities: $ echo -e "@toto\nACTG\n+toto\n12/3" | seqret -filter -sformat=fastq-sanger -osformat=fastq-sanger @toto ACTG + 12/3 $ echo -e "@toto\nACTG\n+toto\n12/3" | revseq -filter -sformat=fastq-sanger -osformat=fastq-sanger @toto Reversed: CAGT + 12/3 Also, in contrary to what the documentation predicts, using the fastq format for the output does not ignore the quality scores. (Not that would be particularly useful, but?) $ echo -e "@toto\nACTG\n+toto\nACTG" | revseq -filter -sformat=fastq-sanger -osformat=fastq @toto Reversed: CAGT + ACTG Have a nice day, -- Charles Plessy http://charles.plessy.org Tsurumi, Kanagawa, Japan From pmr at ebi.ac.uk Thu Sep 17 07:24:07 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 17 Sep 2009 08:24:07 +0100 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <20090917063659.GA27021@kunpuu.plessy.org> References: <20090916065755.GA15425@kunpuu.plessy.org> <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <20090916065755.GA15425@kunpuu.plessy.org> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> <20090917063659.GA27021@kunpuu.plessy.org> Message-ID: <4AB1E417.3010405@ebi.ac.uk> Charles Plessy wrote: > I would also like if the qualities were kept by default. I actually had tried > to force the fastq-sanger format before, but by adding its name to the USAs, > like in ?seqret fastq-sanger::stdin fastq-sanger::stdout?. Unfortunately it did > not work; I do not know if it is by design or because of the dash in the format > name. Nevertheless -sformat=fastq-sanger and -osformat=fastq-sanger worked very > well after I applied Mahmut's patch. Yes, the dash in the format name is causing problems. It should be allowed where there is a '::' in the USA (it is not allowed in database queries because of the dbname-field:value query syntax). I will make a patch for this. > I am tempted to apply it also to the Debian EMBOSS package, but maybe it is too > prematurate. In particular, I have the following warning each time the quality > is encoded by an equal sign: > > Warning: Illegal character '=' > Warning: Illegal pattern: = This is surprising. Is your EMBOSS version the original distribution or have you applied the current patches. If it fails with the patched version, could you send me an input file that causes this error. > By the way, I think I found a bug in revseq: it seems that it does not reverse > the qualities: True ... this I will also patch. We have used quaslities for some years (in Staden experiment format) but it appears nobody has reversed sequences and kept the qualities. Life is changing with FASTQ data! > Also, in contrary to what the documentation predicts, using the fastq format > for the output does not ignore the quality scores. (Not that would be > particularly useful, but?) This is deliberate. We have to write somethign in FASTQ format and we default to the fastq-sanger format. On input, fastq-sanger ignores qualities because there is no safe way to decide which format is correct. regards, Peter From biopython at maubp.freeserve.co.uk Thu Sep 17 09:11:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Sep 2009 10:11:15 +0100 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <4AB1E417.3010405@ebi.ac.uk> References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <20090916065755.GA15425@kunpuu.plessy.org> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> <20090917063659.GA27021@kunpuu.plessy.org> <4AB1E417.3010405@ebi.ac.uk> Message-ID: <320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com> On Thu, Sep 17, 2009 at 8:24 AM, Peter Rice wrote: > >> Also, in contrary to what the documentation predicts, using the fastq >> format for the output does not ignore the quality scores. (Not that >> would be particularly useful, but?) > > This is deliberate. We have to write somethign in FASTQ format and we > default to the fastq-sanger format. On input, fastq-sanger ignores qualities > because there is no safe way to decide which format is correct. So again, could you reconsider making "fastq" act like "fastq-sanger"? The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores, a superset of the Solexa/Illumina FASTQ varaints - so even if you don't know which kind of FASTQ file you have, and you don't care about the qualities, parsing it as a Sanger FASTQ file will work. Peter C. From pmr at ebi.ac.uk Thu Sep 17 09:18:59 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 17 Sep 2009 10:18:59 +0100 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com> References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <20090916065755.GA15425@kunpuu.plessy.org> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> <20090917063659.GA27021@kunpuu.plessy.org> <4AB1E417.3010405@ebi.ac.uk> <320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com> Message-ID: <4AB1FF03.80705@ebi.ac.uk> Peter C. wrote: > On Thu, Sep 17, 2009 at 8:24 AM, Peter Rice wrote: >>> Also, in contrary to what the documentation predicts, using the fastq >>> format for the output does not ignore the quality scores. (Not that >>> would be particularly useful, but?) >> This is deliberate. We have to write somethign in FASTQ format and we >> default to the fastq-sanger format. On input, fastq-sanger ignores qualities >> because there is no safe way to decide which format is correct. > > So again, could you reconsider making "fastq" act like "fastq-sanger"? > The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores, > a superset of the Solexa/Illumina FASTQ varaints - so even if you don't > know which kind of FASTQ file you have, and you don't care about the > qualities, parsing it as a Sanger FASTQ file will work. Yes, but it is dangerous if they could really be Solexa qualities. What we could do is provide a utility that reads in fastq-sanger format and checks whether the quality scores make most sense as Sanger, SOlexa or Ilumina. I consider reading as fastq-sanger by default to be rather dangerous. Peter From biopython at maubp.freeserve.co.uk Thu Sep 17 09:32:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Sep 2009 10:32:21 +0100 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <4AB1FF03.80705@ebi.ac.uk> References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <20090916065755.GA15425@kunpuu.plessy.org> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> <20090917063659.GA27021@kunpuu.plessy.org> <4AB1E417.3010405@ebi.ac.uk> <320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com> <4AB1FF03.80705@ebi.ac.uk> Message-ID: <320fb6e00909170232t2cde71dam3b768ab8cd87bac1@mail.gmail.com> On Thu, Sep 17, 2009 at 10:18 AM, Peter Rice wrote: > >> So again, could you reconsider making "fastq" act like "fastq-sanger"? >> The Sanger FASTQ format allows ASCII 33 to 126 for the quality scores, >> a superset of the Solexa/Illumina FASTQ varaints - so even if you don't >> know which kind of FASTQ file you have, and you don't care about the >> qualities, parsing it as a Sanger FASTQ file will work. > > Yes, but it is dangerous if they could really be Solexa qualities. Indeed, or an Illumina 1.3+ encoded FASTQ file. So if the EMBOSS tools are used to read a FASTQ file without specifying the FASTQ variant, do the currently detect it is FASTQ and default to the "fastq" setting and ignore the quality information? > What we could do is provide a utility that reads in fastq-sanger format and > checks whether the quality scores make most sense as Sanger, SOlexa or > Ilumina. That could be useful - I guess you could scan all the reads building up a histogram of the ASCII characters used. This could immediately rule out some of the options, and then based on the distribution (if you assume they are raw reads) you could make a good guess. > I consider reading as fastq-sanger by default to be rather dangerous. That is understandable. How about removing the current "fastq" output then? That might prevent some of the confusion at the moment. I'm struggling to see any purpose for the current "fastq" output - can you give me any example use case? Right now it has to pick an arbitrary quality symbol, and uses ASCI 34 (double quote) which means PHRED 1 (random) for a Sanger FASTQ file but is invalid as a Solexa or Illumina 1.3+ FASTQ file. Regards, Peter From pmr at ebi.ac.uk Thu Sep 17 09:52:48 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 17 Sep 2009 10:52:48 +0100 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <320fb6e00909170232t2cde71dam3b768ab8cd87bac1@mail.gmail.com> References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <20090916065755.GA15425@kunpuu.plessy.org> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> <20090917063659.GA27021@kunpuu.plessy.org> <4AB1E417.3010405@ebi.ac.uk> <320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com> <4AB1FF03.80705@ebi.ac.uk> <320fb6e00909170232t2cde71dam3b768ab8cd87bac1@mail.gmail.com> Message-ID: <4AB206F0.1040205@ebi.ac.uk> Peter C. wrote: > So if the EMBOSS tools are used to read a FASTQ file without specifying > the FASTQ variant, do the currently detect it is FASTQ and default to the > "fastq" setting and ignore the quality information? Yes, exactly so. Reading the sequence data is safe, and may be all the user wanted to do. >> What we could do is provide a utility that reads in fastq-sanger format and >> checks whether the quality scores make most sense as Sanger, SOlexa or >> Ilumina. > > That could be useful - I guess you could scan all the reads building up > a histogram of the ASCII characters used. This could immediately > rule out some of the options, and then based on the distribution (if > you assume they are raw reads) you could make a good guess. The ACD file would be 'interesting' We could set the default format to be "fastq-sanger" and issue some warning if we find the user had tried to change it. That way the application would run with a filename as the input, though it will appear to interfaces to be able to read any sequence input. Are there rules we can use to decide on improbably qualities? Values below the Illumina and Solexa minima would seem a good guide, and perhaps values above the likely short read maximum score. Maybe some existing pipelines have solme cutoff values we could adopt? >> I consider reading as fastq-sanger by default to be rather dangerous. > > That is understandable. How about removing the current "fastq" output > then? That might prevent some of the confusion at the moment. I'm > struggling to see any purpose for the current "fastq" output - can you > give me any example use case? Right now it has to pick an arbitrary > quality symbol, and uses ASCI 34 (double quote) which means PHRED > 1 (random) for a Sanger FASTQ file but is invalid as a Solexa or > Illumina 1.3+ FASTQ file. It is an alias for fastq-sanger which should be OK. I prefer to have an output format name for each input format name where it looks sensible, so if we read "fastq" as an input format it should do something on output. Unfortunately that means it has to write quality scores somehow. regards, Peter From biopython at maubp.freeserve.co.uk Thu Sep 17 10:06:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 17 Sep 2009 11:06:16 +0100 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <4AB206F0.1040205@ebi.ac.uk> References: <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <20090916065755.GA15425@kunpuu.plessy.org> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> <20090917063659.GA27021@kunpuu.plessy.org> <4AB1E417.3010405@ebi.ac.uk> <320fb6e00909170211q19737a32mfa7caa8ba8ad3e7d@mail.gmail.com> <4AB1FF03.80705@ebi.ac.uk> <320fb6e00909170232t2cde71dam3b768ab8cd87bac1@mail.gmail.com> <4AB206F0.1040205@ebi.ac.uk> Message-ID: <320fb6e00909170306s328294e9h72717c329baba057@mail.gmail.com> On Thu, Sep 17, 2009 at 10:52 AM, Peter Rice wrote: > >>> What we could do is provide a utility that reads in fastq-sanger format >>> and checks whether the quality scores make most sense as Sanger, >>> SOlexa or Ilumina. >> >> That could be useful - I guess you could scan all the reads building up >> a histogram of the ASCII characters used. This could immediately >> rule out some of the options, and then based on the distribution (if >> you assume they are raw reads) you could make a good guess. > > The ACD file would be 'interesting' We could set the default format to be > "fastq-sanger" and issue some warning if we find the user had tried to > change it. That way the application would run with a filename as the input, > though it will appear to interfaces to be able to read any sequence input. > > Are there rules we can use to decide on improbably qualities? Values below > the Illumina and Solexa minima would seem a good guide, and perhaps > values above the likely short read maximum score. > > Maybe some existing pipelines have solme cutoff values we could adopt? Quite possibly. Telling apart raw Sanger reads and raw Solexa/Illumina reads should be easy. However, unless there are some ASCII characters in the range 59 to 63 (Solexa -5 to -1), there isn't going to be a safe way to tell Solexa and Illumina 1.3+ apart. Of course, if they just have good reads above Solexa/PHRED 10 (which would be ASCII 74), either way it isn't going to make much difference. In any case, it will be heuristic, and sometimes it will get it wrong (e.g. post processed Sanger FASTQ files with high scores might look like raw reads in Solexa/Illumina FASTQ). >>> I consider reading as fastq-sanger by default to be rather dangerous. >> >> That is understandable. How about removing the current "fastq" output >> then? That might prevent some of the confusion at the moment. I'm >> struggling to see any purpose for the current "fastq" output - can you >> give me any example use case? Right now it has to pick an arbitrary >> quality symbol, and uses ASCI 34 (double quote) which means PHRED >> 1 (random) for a Sanger FASTQ file but is invalid as a Solexa or >> Illumina 1.3+ FASTQ file. > > It is an alias for fastq-sanger which should be OK. I prefer to have an > output format name for each input format name where it looks sensible, > so if we read "fastq" as an input format it should do something on > output. Unfortunately that means it has to write quality scores somehow. I'm not convinced that the current "fastq" output (with the double quote quality string) is entirely "sensible". But I'll drop this now - I've argued my case, and will leave it at that. As long as the current behaviour is clear in the documentation, it should be OK. Regards, Peter From charles-listes-emboss at plessy.org Fri Sep 18 13:09:46 2009 From: charles-listes-emboss at plessy.org (Charles Plessy) Date: Fri, 18 Sep 2009 22:09:46 +0900 Subject: [EMBOSS] Conservation of FASTQ scores by the EMBOSS tools. In-Reply-To: <4AB1E417.3010405@ebi.ac.uk> References: <20090916065755.GA15425@kunpuu.plessy.org> <320fb6e00909160231r47555664vc3fdb77346c3a821@mail.gmail.com> <20090916065755.GA15425@kunpuu.plessy.org> <1253092337.32439.31.camel@emboss2.ebi.ac.uk> <20090917063659.GA27021@kunpuu.plessy.org> <4AB1E417.3010405@ebi.ac.uk> Message-ID: <20090918130946.GC16344@kunpuu.plessy.org> Le Thu, Sep 17, 2009 at 08:24:07AM +0100, Peter Rice a ?crit : > Charles Plessy wrote: > >> I am tempted to apply it also to the Debian EMBOSS package, but maybe it is too >> prematurate. In particular, I have the following warning each time the quality >> is encoded by an equal sign: >> >> Warning: Illegal character '=' >> Warning: Illegal pattern: = > > This is surprising. Is your EMBOSS version the original distribution or > have you applied the current patches. Actually, I worked on other files today and could not reproduce the result. I will find a proper sourceforge bug if it ever comes back? Have a nice week-end, -- Charles Plessy Tsurumi, Kanagawa, Japon From belegdol at gmail.com Tue Sep 29 19:34:56 2009 From: belegdol at gmail.com (Julian Sikorski) Date: Tue, 29 Sep 2009 21:34:56 +0200 Subject: [EMBOSS] Packaging EMBOSS for Fedora In-Reply-To: References: <1244850556.8999.7.camel@login-svr1.ebi.ac.uk> <49993.78.105.201.225.1248020309.squirrel@webmail.ebi.ac.uk> Message-ID: W dniu 29.07.2009 13:54, Julian Sikorski pisze: > W dniu 19.07.2009 18:18, uludag at ebi.ac.uk pisze: >> >>> there seem to be some problems with make install: >>> >>> /usr/bin/make install-exec-hook >>> make[7]: Entering directory >>> `/builddir/build/BUILD/EMBOSS-6.1.0/jemboss/org/emboss/jemboss/editor' >>> mkdir -p -- >>> /builddir/build/BUILDROOT/EMBOSS-6.1.0-1.fc11.x86_64/usr/share/EMBOSS/jemboss/org/emboss/jemboss/editor >>> /usr/bin/install: cannot stat `*.class': No such file or directory >> >> Looks like we didn't test the --with-java and --with-javaos configure >> options well, before this release. However, most users will not need these >> two options any more as EMBOSS-6.1.0 includes precompiled jemboss class >> files collected in a java archive file. You should hopefully not get the >> above error if you omit these two options when you configure your emboss >> installation. >> >> Regards, >> Mahmut > Thank you, removing these two seems to have done the trick! > > Julian Another problem arose. I was pointed out that another problem pointed out by rpmlint should be fixed: EMBOSS-libs.x86_64: W: shared-lib-calls-exit /usr/lib64/libeplplot.so.3.2.7 exit@ According to the tool, it means the following: This library package calls exit() or _exit(), probably in a non-fork() context. Doing so from a library is strongly discouraged - when a library function calls exit(), it prevents the calling program from handling the error, reporting it to the user, closing files properly, and cleaning up any state that the program has. It is preferred for the library to return an actual error code and let the calling program decide how to handle the situation. Also, the guidelines [1] say that the .jar files should be compiled from source, so fixing the problem with --with-java would definitely help here. Julian [1] https://fedoraproject.org/wiki/Packaging:Java#Pre-built_JAR_files_.2F_Other_bundled_software