From pmr at ebi.ac.uk Mon Aug 3 11:31:41 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 03 Aug 2009 16:31:41 +0100 Subject: [emboss-dev] EMBOSS and its FASTA like alignment output In-Reply-To: <320fb6e00907210352l76503d38n37e4dc4fc0f4cc33@mail.gmail.com> References: <320fb6e00907210352l76503d38n37e4dc4fc0f4cc33@mail.gmail.com> Message-ID: <4A7702DD.5070507@ebi.ac.uk> Peter wrote: > Hi, > > One of the many things I talked to Peter Rice about in Sweden > was the Pearson FASTA like output from needle and water (e.g. > what EMBOSS calls the markx10 output format), and why it > includes the EMBOSS header and footer lines (which start with > a # character), which are not present in real FASTA output. > > Biopython can parse the pairwise -m 10 output from Bill > Pearson's FASTA tools, so in theory we (Biopython) should > be able to parse the markx10 output from EMBOSS needle > and water. We could probably cope with the extra header > and footer, but I think it would be best if EMBOSS could > produce something more closely matching the real FASTA > output. Unfortunately, it appears to be more than just the > headers which upset our parser - even ignoring them, > EMBOSS markx10 output still looks rather different to > (current) FASTA -m 10 output. Was the markx10 output > mimicking a particular (old) version of the FASTA tools? I have checked the latest FASTA3 and FASTA2 tools from Bill Pearson. What does BioPython expect as "markx10" and the other markx formats? There are extra lines reporting equivalent data to the EMBOSS alignment headers which we could include, but I would like to know there is a parser that can accept them as markx* format in each case. In this case "more closely matching" may not be close enough :-) regards, Peter Rice From biopython at maubp.freeserve.co.uk Mon Aug 3 13:12:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 3 Aug 2009 18:12:09 +0100 Subject: [emboss-dev] EMBOSS and its FASTA like alignment output In-Reply-To: <4A7702DD.5070507@ebi.ac.uk> References: <320fb6e00907210352l76503d38n37e4dc4fc0f4cc33@mail.gmail.com> <4A7702DD.5070507@ebi.ac.uk> Message-ID: <320fb6e00908031012j2a05dcdcr401c80025eaabf4f@mail.gmail.com> On Mon, Aug 3, 2009 at 4:31 PM, Peter Rice wrote: > > Peter wrote: >> Hi, >> >> One of the many things I talked to Peter Rice about in Sweden >> was the Pearson FASTA like output from needle and water (e.g. >> what EMBOSS calls the markx10 output format), and why it >> includes the EMBOSS header and footer lines (which start with >> a # character), which are not present in real FASTA output. >> >> Biopython can parse the pairwise -m 10 output from Bill >> Pearson's FASTA tools, so in theory we (Biopython) should >> be able to parse the markx10 output from EMBOSS needle >> and water. We could probably cope with the extra header >> and footer, but I think it would be best if EMBOSS could >> produce something more closely matching the real FASTA >> output. Unfortunately, it appears to be more than just the >> headers which upset our parser - even ignoring them, >> EMBOSS markx10 output still looks rather different to >> (current) FASTA -m 10 output. Was the markx10 output >> mimicking a particular (old) version of the FASTA tools? > > I have checked the latest FASTA3 and FASTA2 tools from > Bill Pearson. > > What does BioPython expect as "markx10" and the other > markx formats? We only support the "-m 10" output format from the FASTA tools, which is intended to be machine readable. i.e. what EMBOSS tries to mimic with "markx10". So I am not worried about the other markx formats that EMBOSS can produce. > There are extra lines reporting equivalent data to the EMBOSS alignment > headers which we could include, but I would like to know there is a > parser that can accept them as markx* format in each case. > > In this case "more closely matching" may not be close enough :-) Something by eye that looked "wrong" in the EMBOSS markx10 output concerns the ">" lines. In particular, I expect to see lines starting " 1>>>identifier", " 2>>>identifier", ... to indicate the start of each result set for each query. EMBOSS doesn't output these. In the case of needle and water as things stand, you only ever have one query sequence (although we have discussed a "superneedle" and "superwater" as possible enhancements), so there would only be one such line. Beyond that, I'd have to dig a little deeper into our code, feeding it EMBOSS markx10 output with the header/footer removed, and see where it falls over. Things like the histogram are optional and we ignore them anyway. I am happy to test patches (off the list if you prefer). (Although I would prioritise the FASTQ stuff first.) Regards, Peter C. From biopython at maubp.freeserve.co.uk Mon Aug 10 12:13:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 17:13:24 +0100 Subject: [emboss-dev] Repeated "unknown output format" messages from seqret Message-ID: <320fb6e00908100913pddbc5eco15e7c0b8831f7abb@mail.gmail.com> Hi all, Peter has mentioned supressing repeated error messages (e.g. if on converting from Sanger FASTQ to Solexa FASTQ the quality scores must be truncated at a maximum of 62). I've got another example which is probably going to be more common: $ seqret -filter -sformat fastq-sanger -osformat fastq-illuminaa < SRR001666_1.fastq | grep "^@SRR" | wc -l Error: Unknown output format 'fastq-illuminaa' Error: Unknown output format 'fastq-illuminaa' Error: unknown output format 'fastq-illuminaa' Error: unknown output format 'fastq-illuminaa' ... I think getting the "unknown output format" message once is enough, and 7 million times is overkill ;) This was of course a user error - a simple typo - what I meant was: $ seqret -filter -sformat fastq-sanger -osformat fastq-illumina < SRR001666_1.fastq | grep "^@SRR" | wc -l 7047668 Thanks! Peter C. From pmr at ebi.ac.uk Mon Aug 10 12:50:19 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 10 Aug 2009 17:50:19 +0100 Subject: [emboss-dev] Repeated "unknown output format" messages from seqret In-Reply-To: <320fb6e00908100913pddbc5eco15e7c0b8831f7abb@mail.gmail.com> References: <320fb6e00908100913pddbc5eco15e7c0b8831f7abb@mail.gmail.com> Message-ID: <4A804FCB.8040601@ebi.ac.uk> Peter wrote: > Hi all, > > Peter has mentioned supressing repeated error messages (e.g. if > on converting from Sanger FASTQ to Solexa FASTQ the quality > scores must be truncated at a maximum of 62). I've got another > example which is probably going to be more common: > > $ seqret -filter -sformat fastq-sanger -osformat fastq-illuminaa < > SRR001666_1.fastq | grep "^@SRR" | wc -l > Error: Unknown output format 'fastq-illuminaa' > Error: Unknown output format 'fastq-illuminaa' > Error: unknown output format 'fastq-illuminaa' > Error: unknown output format 'fastq-illuminaa' > ... > > I think getting the "unknown output format" message once is enough, > and 7 million times is overkill ;) Odd that nobody has spotted that one before. It will be fixed in a future release. We will aim to report each message once, and then when the program ends we can report how many more times the same message was repeated. We hope to be able to catch several repeated messages because with FASTQ next-generation data there can indeed be millions of sequences each with the same error or warning. regards, Peter Rice From biopython at maubp.freeserve.co.uk Mon Aug 10 16:35:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 21:35:02 +0100 Subject: [emboss-dev] Repeated "unknown output format" messages from seqret In-Reply-To: <4A804FCB.8040601@ebi.ac.uk> References: <320fb6e00908100913pddbc5eco15e7c0b8831f7abb@mail.gmail.com> <4A804FCB.8040601@ebi.ac.uk> Message-ID: <320fb6e00908101335k6711ccf6ibd0a197886c388bd@mail.gmail.com> On Mon, Aug 10, 2009 at 5:50 PM, Peter Rice wrote: > >> I think getting the "unknown output format" message once is enough, >> and 7 million times is overkill ;) > > Odd that nobody has spotted that one before. It will be fixed in a > future release. Maybe most people type more accurately than me? > We will aim to report each message once, and then when the program ends > we can report how many more times the same message was repeated. We hope > to be able to catch several repeated messages because with FASTQ > next-generation data there can indeed be millions of sequences each with > the same error or warning. Good plan :) Thanks Peter From pmr at ebi.ac.uk Tue Aug 11 04:29:09 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 11 Aug 2009 09:29:09 +0100 Subject: [emboss-dev] Repeated "unknown output format" messages from seqret In-Reply-To: <320fb6e00908101335k6711ccf6ibd0a197886c388bd@mail.gmail.com> References: <320fb6e00908100913pddbc5eco15e7c0b8831f7abb@mail.gmail.com> <4A804FCB.8040601@ebi.ac.uk> <320fb6e00908101335k6711ccf6ibd0a197886c388bd@mail.gmail.com> Message-ID: <4A812BD5.2020807@ebi.ac.uk> Peter wrote: > On Mon, Aug 10, 2009 at 5:50 PM, Peter Rice wrote: >>> I think getting the "unknown output format" message once is enough, >>> and 7 million times is overkill ;) >> Odd that nobody has spotted that one before. It will be fixed in a >> future release. > > Maybe most people type more accurately than me? Nobody noticed it appears more than once when you have more than one input sequence. It should of course be checked when the command line is processed, not left until run time when the sequence is written out. regards, Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 06:28:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 11:28:21 +0100 Subject: [emboss-dev] Use more .cvsignore files in repository? Message-ID: <320fb6e00908200328xb9684eep6aaeba90fe4e442f@mail.gmail.com> Hi all, I've been able to checkout and build EMBOSS from CVS following the instructions here: http://emboss.sourceforge.net/developers/cvs.html I've noticed when doing "cvs update" that there are a lot of extra files created by doing an in-place build (as recommended by the above doc) which CVS has not been told to ignore. It would help to ignore the "Makefile" and "Makefile.in" entries in many directories, /emboss/emboss/ajax/*.lo, as well as the expected compiled tools under the "emboss" subdirectory, by listing them in .cvsignore files. Peter P.S. I would be happy to make these minor changes with CVS access and EMBOSS's permission. Depending on how the OBF setup the CVS server, I may in fact already have CVS access (since Biopython CVS is on the same machine). From pmr at ebi.ac.uk Mon Aug 3 15:31:41 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 03 Aug 2009 16:31:41 +0100 Subject: [emboss-dev] EMBOSS and its FASTA like alignment output In-Reply-To: <320fb6e00907210352l76503d38n37e4dc4fc0f4cc33@mail.gmail.com> References: <320fb6e00907210352l76503d38n37e4dc4fc0f4cc33@mail.gmail.com> Message-ID: <4A7702DD.5070507@ebi.ac.uk> Peter wrote: > Hi, > > One of the many things I talked to Peter Rice about in Sweden > was the Pearson FASTA like output from needle and water (e.g. > what EMBOSS calls the markx10 output format), and why it > includes the EMBOSS header and footer lines (which start with > a # character), which are not present in real FASTA output. > > Biopython can parse the pairwise -m 10 output from Bill > Pearson's FASTA tools, so in theory we (Biopython) should > be able to parse the markx10 output from EMBOSS needle > and water. We could probably cope with the extra header > and footer, but I think it would be best if EMBOSS could > produce something more closely matching the real FASTA > output. Unfortunately, it appears to be more than just the > headers which upset our parser - even ignoring them, > EMBOSS markx10 output still looks rather different to > (current) FASTA -m 10 output. Was the markx10 output > mimicking a particular (old) version of the FASTA tools? I have checked the latest FASTA3 and FASTA2 tools from Bill Pearson. What does BioPython expect as "markx10" and the other markx formats? There are extra lines reporting equivalent data to the EMBOSS alignment headers which we could include, but I would like to know there is a parser that can accept them as markx* format in each case. In this case "more closely matching" may not be close enough :-) regards, Peter Rice From biopython at maubp.freeserve.co.uk Mon Aug 3 17:12:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 3 Aug 2009 18:12:09 +0100 Subject: [emboss-dev] EMBOSS and its FASTA like alignment output In-Reply-To: <4A7702DD.5070507@ebi.ac.uk> References: <320fb6e00907210352l76503d38n37e4dc4fc0f4cc33@mail.gmail.com> <4A7702DD.5070507@ebi.ac.uk> Message-ID: <320fb6e00908031012j2a05dcdcr401c80025eaabf4f@mail.gmail.com> On Mon, Aug 3, 2009 at 4:31 PM, Peter Rice wrote: > > Peter wrote: >> Hi, >> >> One of the many things I talked to Peter Rice about in Sweden >> was the Pearson FASTA like output from needle and water (e.g. >> what EMBOSS calls the markx10 output format), and why it >> includes the EMBOSS header and footer lines (which start with >> a # character), which are not present in real FASTA output. >> >> Biopython can parse the pairwise -m 10 output from Bill >> Pearson's FASTA tools, so in theory we (Biopython) should >> be able to parse the markx10 output from EMBOSS needle >> and water. We could probably cope with the extra header >> and footer, but I think it would be best if EMBOSS could >> produce something more closely matching the real FASTA >> output. Unfortunately, it appears to be more than just the >> headers which upset our parser - even ignoring them, >> EMBOSS markx10 output still looks rather different to >> (current) FASTA -m 10 output. Was the markx10 output >> mimicking a particular (old) version of the FASTA tools? > > I have checked the latest FASTA3 and FASTA2 tools from > Bill Pearson. > > What does BioPython expect as "markx10" and the other > markx formats? We only support the "-m 10" output format from the FASTA tools, which is intended to be machine readable. i.e. what EMBOSS tries to mimic with "markx10". So I am not worried about the other markx formats that EMBOSS can produce. > There are extra lines reporting equivalent data to the EMBOSS alignment > headers which we could include, but I would like to know there is a > parser that can accept them as markx* format in each case. > > In this case "more closely matching" may not be close enough :-) Something by eye that looked "wrong" in the EMBOSS markx10 output concerns the ">" lines. In particular, I expect to see lines starting " 1>>>identifier", " 2>>>identifier", ... to indicate the start of each result set for each query. EMBOSS doesn't output these. In the case of needle and water as things stand, you only ever have one query sequence (although we have discussed a "superneedle" and "superwater" as possible enhancements), so there would only be one such line. Beyond that, I'd have to dig a little deeper into our code, feeding it EMBOSS markx10 output with the header/footer removed, and see where it falls over. Things like the histogram are optional and we ignore them anyway. I am happy to test patches (off the list if you prefer). (Although I would prioritise the FASTQ stuff first.) Regards, Peter C. From biopython at maubp.freeserve.co.uk Mon Aug 10 16:13:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 17:13:24 +0100 Subject: [emboss-dev] Repeated "unknown output format" messages from seqret Message-ID: <320fb6e00908100913pddbc5eco15e7c0b8831f7abb@mail.gmail.com> Hi all, Peter has mentioned supressing repeated error messages (e.g. if on converting from Sanger FASTQ to Solexa FASTQ the quality scores must be truncated at a maximum of 62). I've got another example which is probably going to be more common: $ seqret -filter -sformat fastq-sanger -osformat fastq-illuminaa < SRR001666_1.fastq | grep "^@SRR" | wc -l Error: Unknown output format 'fastq-illuminaa' Error: Unknown output format 'fastq-illuminaa' Error: unknown output format 'fastq-illuminaa' Error: unknown output format 'fastq-illuminaa' ... I think getting the "unknown output format" message once is enough, and 7 million times is overkill ;) This was of course a user error - a simple typo - what I meant was: $ seqret -filter -sformat fastq-sanger -osformat fastq-illumina < SRR001666_1.fastq | grep "^@SRR" | wc -l 7047668 Thanks! Peter C. From pmr at ebi.ac.uk Mon Aug 10 16:50:19 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 10 Aug 2009 17:50:19 +0100 Subject: [emboss-dev] Repeated "unknown output format" messages from seqret In-Reply-To: <320fb6e00908100913pddbc5eco15e7c0b8831f7abb@mail.gmail.com> References: <320fb6e00908100913pddbc5eco15e7c0b8831f7abb@mail.gmail.com> Message-ID: <4A804FCB.8040601@ebi.ac.uk> Peter wrote: > Hi all, > > Peter has mentioned supressing repeated error messages (e.g. if > on converting from Sanger FASTQ to Solexa FASTQ the quality > scores must be truncated at a maximum of 62). I've got another > example which is probably going to be more common: > > $ seqret -filter -sformat fastq-sanger -osformat fastq-illuminaa < > SRR001666_1.fastq | grep "^@SRR" | wc -l > Error: Unknown output format 'fastq-illuminaa' > Error: Unknown output format 'fastq-illuminaa' > Error: unknown output format 'fastq-illuminaa' > Error: unknown output format 'fastq-illuminaa' > ... > > I think getting the "unknown output format" message once is enough, > and 7 million times is overkill ;) Odd that nobody has spotted that one before. It will be fixed in a future release. We will aim to report each message once, and then when the program ends we can report how many more times the same message was repeated. We hope to be able to catch several repeated messages because with FASTQ next-generation data there can indeed be millions of sequences each with the same error or warning. regards, Peter Rice From biopython at maubp.freeserve.co.uk Mon Aug 10 20:35:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Aug 2009 21:35:02 +0100 Subject: [emboss-dev] Repeated "unknown output format" messages from seqret In-Reply-To: <4A804FCB.8040601@ebi.ac.uk> References: <320fb6e00908100913pddbc5eco15e7c0b8831f7abb@mail.gmail.com> <4A804FCB.8040601@ebi.ac.uk> Message-ID: <320fb6e00908101335k6711ccf6ibd0a197886c388bd@mail.gmail.com> On Mon, Aug 10, 2009 at 5:50 PM, Peter Rice wrote: > >> I think getting the "unknown output format" message once is enough, >> and 7 million times is overkill ;) > > Odd that nobody has spotted that one before. It will be fixed in a > future release. Maybe most people type more accurately than me? > We will aim to report each message once, and then when the program ends > we can report how many more times the same message was repeated. We hope > to be able to catch several repeated messages because with FASTQ > next-generation data there can indeed be millions of sequences each with > the same error or warning. Good plan :) Thanks Peter From pmr at ebi.ac.uk Tue Aug 11 08:29:09 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 11 Aug 2009 09:29:09 +0100 Subject: [emboss-dev] Repeated "unknown output format" messages from seqret In-Reply-To: <320fb6e00908101335k6711ccf6ibd0a197886c388bd@mail.gmail.com> References: <320fb6e00908100913pddbc5eco15e7c0b8831f7abb@mail.gmail.com> <4A804FCB.8040601@ebi.ac.uk> <320fb6e00908101335k6711ccf6ibd0a197886c388bd@mail.gmail.com> Message-ID: <4A812BD5.2020807@ebi.ac.uk> Peter wrote: > On Mon, Aug 10, 2009 at 5:50 PM, Peter Rice wrote: >>> I think getting the "unknown output format" message once is enough, >>> and 7 million times is overkill ;) >> Odd that nobody has spotted that one before. It will be fixed in a >> future release. > > Maybe most people type more accurately than me? Nobody noticed it appears more than once when you have more than one input sequence. It should of course be checked when the command line is processed, not left until run time when the sequence is written out. regards, Peter From biopython at maubp.freeserve.co.uk Thu Aug 20 10:28:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 Aug 2009 11:28:21 +0100 Subject: [emboss-dev] Use more .cvsignore files in repository? Message-ID: <320fb6e00908200328xb9684eep6aaeba90fe4e442f@mail.gmail.com> Hi all, I've been able to checkout and build EMBOSS from CVS following the instructions here: http://emboss.sourceforge.net/developers/cvs.html I've noticed when doing "cvs update" that there are a lot of extra files created by doing an in-place build (as recommended by the above doc) which CVS has not been told to ignore. It would help to ignore the "Makefile" and "Makefile.in" entries in many directories, /emboss/emboss/ajax/*.lo, as well as the expected compiled tools under the "emboss" subdirectory, by listing them in .cvsignore files. Peter P.S. I would be happy to make these minor changes with CVS access and EMBOSS's permission. Depending on how the OBF setup the CVS server, I may in fact already have CVS access (since Biopython CVS is on the same machine).