From pmr at ebi.ac.uk Thu Feb 20 06:34:13 2003 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 20 Feb 2003 11:34:13 +0000 Subject: ACD file and emboss.default file syntax Message-ID: <3E54BD35.6050509@ebi.ac.uk> I am cleaning up the parsing of both ACD files and the emboss.default files. This includes adding diagnostic messages to say what problems were found and to report the line number (and filename). Showdb will carry out additional checks on the emboss.default and ~/.embossrc files (valid sequence formats, for example). There is no need to run these every time the files are read. At the same time, some of the syntax can be tightened. For example, ACD files allowed some strange characters that were never used (parentheses instead of quotes, "=" instead of ":"). These will be removed. How far should this go? In particular, should white space be required after a ":" or around "[" and "]" characters? There are also differences in the definitions of comments. In ACD files any text after a "#" is ignored. In emboss.default comments must start at the beginning of the line. This seems preferable as occasionally a "#" character could be useful in a definition. For example, both of the following are valid ACD definitions: ################################# # Full definition from acdpretty ################################# integer: minlen [ required: "Y" minimum: "1" maximum: "50" default: "6" information: "Minimum length" ] int:minlen [req:Y min:1 max:50 def:6 info:"Minimum length"] #compact The first is preferred (and generated by -acdpretty). I would like to make it *required* so that other ACD parsers (e.g. for GUI definitions) can cope better. The changes would be: 1. White space is required after "attribute:" 2. White space is required before and after "[" and "]" 3. Any "#" character at the start of a line is a comment and the line will be ignored. Any "#" within a line is part of the definition. Extra questions are: 4. Should the ACD types (integer, string, ...) be specified in full? ACD can cope easily with unambiguous abbreviations, so I prefer to keep the short forms, but perhaps parsers have problem. These files are created by the developers so we can update them. One option is to generate warning messages, and to run acdpretty to fix them before committing the ACD files to CVS. 5. Should the emboss.default types (env, dbname) be specified in full? Parsing can cope easily with unambiguous abbreviations, and "db" in place of "dbname" is common. These files are created by the site administrators and by individual users, so we should avoid breaking their existing definitions. But note we have synonyms (env/set) so we could allow "db" or "dbname" as alternatives. 6. Should the ACD attribute names (required, information, ...) be abbreviated (see question 4)? 7. Should the database (and any other emboss.default) attribute names be abbreviated (see question 5)? Peter Rice From jkb at mrc-lmb.cam.ac.uk Thu Feb 20 07:04:39 2003 From: jkb at mrc-lmb.cam.ac.uk (James Bonfield) Date: Thu, 20 Feb 2003 12:04:39 +0000 Subject: ACD file and emboss.default file syntax In-Reply-To: <3E54BD35.6050509@ebi.ac.uk>; from pmr@ebi.ac.uk on Thu, Feb 20, 2003 at 11:34:13AM +0000 References: <3E54BD35.6050509@ebi.ac.uk> Message-ID: <20030220120439.A13733@arran.mrc-lmb.cam.ac.uk> On Thu, Feb 20, 2003 at 11:34:13AM +0000, Peter Rice wrote: > I am cleaning up the parsing of both ACD files and the emboss.default > files. This includes adding diagnostic messages to say what problems > were found and to report the line number (and filename). Diagnostic messages are a definite help. > At the same time, some of the syntax can be tightened. For example, ACD > files allowed some strange characters that were never used (parentheses > instead of quotes, "=" instead of ":"). These will be removed. For what it's worth the ACD parser in Spin (staden package) copes with these things already, and also the variations in spaces. However I freely admit that it may have been easier to develop with the changes you propose and so it sounds like a sensible way of promoting more interfaces. I'm not 100% convinced on that though; by far the easiest way of helping people is to provide a full and complete BNF grammer. The documentation was not sufficiently clear as I recall. I ended up writing my own version of lex in Tcl and a hand coded parser. Eg I use regexp matching for identifers and it's no harder to match regexp "[^ \t\n:=]+[ \t\n]*[:=]" than "[^:]+:", although the latter is obviously more readable. > There are also differences in the definitions of comments. In ACD files > any text after a "#" is ignored. In emboss.default comments must start > at the beginning of the line. This seems preferable as occasionally a > "#" character could be useful in a definition. This is one change which could cause problems. Existing parsers should, in theory, already be handling the complexities of : vs =, different quoting syntaxes, and varying whitespace. So the changes to these will help new code and not have any effect on existing parsers. Changing comments though will make existing parsers parse incorrectly on files where # is used in a definition. However I guess the change needs to be made due to the points you make (it possibly being a useful character). > 6. Should the ACD attribute names (required, information, ...) be > abbreviated (see question 4)? My approach was to specify the grammer at a higher level of ID, STRING, etc and then use code for matching ID against a known database of words. Literally: foreach word {application information default required \ optional expected documentation outfile \ parameter needed delimiter codedelimiter \ values selection minimum maximum dirlist} { if {[string match -nocase ${id_v}* $word]} { set id_v $word break } } Having full names though makes the 'lex' type part easier as the tokenising can break things down into more specific words: APPLICATION, INTEGER, etc rather than just ID. Although it's possible to do this with regexps right now if you're willing to put up with regexps like "var(i(a(b(l(e)?)?)?)?)?". Oddly I dealt with types and attributes in a slightly different way, so I can only deal with "int" and "integer" and not "integ". I'm not sure why I did it that way though; sloppiness it seems. James -- James Bonfield (jkb at mrc-lmb.cam.ac.uk) Fax: (+44) 01223 213556 Medical Research Council - Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, England. Also see Staden Package WWW site at http://www.mrc-lmb.cam.ac.uk/pubseq/ From gbottu at ben.vub.ac.be Fri Feb 21 11:18:14 2003 From: gbottu at ben.vub.ac.be (Guy Bottu) Date: Fri, 21 Feb 2003 17:18:14 +0100 (CET) Subject: ACD file and emboss.default file syntax Message-ID: <200302211618.h1LGIE3F1069274@black.vub.ac.be> from : BEN > > I am cleaning up the parsing of both ACD files and the emboss.default > files. This includes adding diagnostic messages to say what problems > were found and to report the line number (and filename). > > > > There are also differences in the definitions of comments. In ACD files > any text after a "#" is ignored. In emboss.default comments must start > at the beginning of the line. This seems preferable as occasionally a > "#" character could be useful in a definition. I agree with that. By the way : the file eprimer3.acd contains : help: "The maximum allowed melting temperature of the amplicon. Product Tm i s calculated using the formula from Bolton and McCarthy, PNAS 84:1390 (1962) as presented in Sambrook, Fritsch and Maniatis, Molecular Cloning, p 11.46 (1989, C SHL Press). \ Tm = 81.5 + 16.6(log10[Na+]) + .41*(%GC) - 600/length \ ... The [Na+] turned out to be "toxic", I had to replace it by {Na+}. Maybe make that the parser can distinguish [] signs that are part of the syntax from those that are part of some definition. > > Extra questions are: > > > 7. Should the database (and any other emboss.default) attribute names be > abbreviated (see question 5)? > To allow both "swissprot" and "sw" as alternative names I now duplicate the definition. Allowing for abbreviated database names could be a solution. But perhaps not that good. Might be confusing. And what about "imgtmhc" / "mhc" ? A suggestion is to add an attribute "altname", so that you could have : DB unannotated [ type: N comment: 'EMBL unannotated/unclassified' altname: unc,un,unclassified ..... Other things on "wish list" : - to allow the "nullok" attribute for all objects rather then just some. e.g. it can happen that whether program must input sequence depends on setting of other parameters. seq object however always needs input, so only "hack" now is to write silly default value in ACD file - extend input/output in several format for sequence/feature/alignment/ structure to other types of data : symbol comparison tables (GCG, BLAST, SIM,...) codon_usage_tables (CUTG, GCG, ...) Sincerely, Guy Bottu From ableasby at hgmp.mrc.ac.uk Fri Feb 21 12:47:46 2003 From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk) Date: Fri, 21 Feb 2003 17:47:46 GMT Subject: ACD file and emboss.default file syntax Message-ID: <200302211747.RAA10269@bromine.hgmp.mrc.ac.uk> Commenting only on the nullok for sequence. That feature was put into the CVS version a few weeks ago. Alan From p.ernst at dkfz-heidelberg.de Mon Feb 24 06:27:34 2003 From: p.ernst at dkfz-heidelberg.de (Peter Ernst) Date: Mon, 24 Feb 2003 12:27:34 +0100 (MET) Subject: ACD file and emboss.default file syntax In-Reply-To: <3E54BD35.6050509@ebi.ac.uk> Message-ID: On Thu, 20 Feb 2003, Peter Rice wrote: > How far should this go? In particular, should white space be required > after a ":" or around "[" and "]" characters? How significant are newlines? Usually a newline is just a whitespace in ACD and can be treated like a space character. The end of a block is defined by the ']' character. However there are the qualifiers "variable" and "endsection" where the end of the block is defined by a newline character. A simple approach to solve this problem was to say: "a newline marks the end of a block unless one of the opening characters like '[' appears on the same line". However this doesn't work with all existing ACD files. "good" definition: appl: hmmgen [ documentation: "G.." ... ] "bad" definition: (found in DOMAINATRIX/.../hmmgen.acd) appl: hmmgen [ documentation: "G.." ... ] > There are also differences in the definitions of comments. In ACD files > any text after a "#" is ignored. In emboss.default comments must start > at the beginning of the line. This seems preferable as occasionally a > "#" character could be useful in a definition. Yes it would be better if comments must start from the beginning of a line. However existing parsers contain code to deal with the other comments as well. But anyway, a change in the syntax definition for comments makes sense (for future versions of parsers in GUIs). > 4. Should the ACD types (integer, string, ...) be specified in full? ACD > can cope easily with unambiguous abbreviations, [...] > [...] > > 6. Should the ACD attribute names (required, information, ...) be > abbreviated (see question 4)? The problem is, that abbreviations used in existing ACD files are not *globally* unambiguous but only *locally* unambiguous, i.e. the abbreviation MAX is used for MAXSEQS (in ALIGNMENT context), for MAXIMUM (e.g. INTEGER context) and for MAXLENGTH (e.g. STRING context). Using a LEX/YACC approach to parse ACD files, it was problematic to create a simple lexer, because whenever the lexer found "max", it wasn't clear if the token MAXSEQS, MAXIMUM or MAXLENGTH was meant. (The lexer had to know its context, to be able to throw the right token.) Therefore more unambiguity would be welcome (even if this means: no abbreviations in ACD files). Regards, Peter Ernst From pmr at ebi.ac.uk Thu Feb 20 11:34:13 2003 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 20 Feb 2003 11:34:13 +0000 Subject: ACD file and emboss.default file syntax Message-ID: <3E54BD35.6050509@ebi.ac.uk> I am cleaning up the parsing of both ACD files and the emboss.default files. This includes adding diagnostic messages to say what problems were found and to report the line number (and filename). Showdb will carry out additional checks on the emboss.default and ~/.embossrc files (valid sequence formats, for example). There is no need to run these every time the files are read. At the same time, some of the syntax can be tightened. For example, ACD files allowed some strange characters that were never used (parentheses instead of quotes, "=" instead of ":"). These will be removed. How far should this go? In particular, should white space be required after a ":" or around "[" and "]" characters? There are also differences in the definitions of comments. In ACD files any text after a "#" is ignored. In emboss.default comments must start at the beginning of the line. This seems preferable as occasionally a "#" character could be useful in a definition. For example, both of the following are valid ACD definitions: ################################# # Full definition from acdpretty ################################# integer: minlen [ required: "Y" minimum: "1" maximum: "50" default: "6" information: "Minimum length" ] int:minlen [req:Y min:1 max:50 def:6 info:"Minimum length"] #compact The first is preferred (and generated by -acdpretty). I would like to make it *required* so that other ACD parsers (e.g. for GUI definitions) can cope better. The changes would be: 1. White space is required after "attribute:" 2. White space is required before and after "[" and "]" 3. Any "#" character at the start of a line is a comment and the line will be ignored. Any "#" within a line is part of the definition. Extra questions are: 4. Should the ACD types (integer, string, ...) be specified in full? ACD can cope easily with unambiguous abbreviations, so I prefer to keep the short forms, but perhaps parsers have problem. These files are created by the developers so we can update them. One option is to generate warning messages, and to run acdpretty to fix them before committing the ACD files to CVS. 5. Should the emboss.default types (env, dbname) be specified in full? Parsing can cope easily with unambiguous abbreviations, and "db" in place of "dbname" is common. These files are created by the site administrators and by individual users, so we should avoid breaking their existing definitions. But note we have synonyms (env/set) so we could allow "db" or "dbname" as alternatives. 6. Should the ACD attribute names (required, information, ...) be abbreviated (see question 4)? 7. Should the database (and any other emboss.default) attribute names be abbreviated (see question 5)? Peter Rice From jkb at mrc-lmb.cam.ac.uk Thu Feb 20 12:04:39 2003 From: jkb at mrc-lmb.cam.ac.uk (James Bonfield) Date: Thu, 20 Feb 2003 12:04:39 +0000 Subject: ACD file and emboss.default file syntax In-Reply-To: <3E54BD35.6050509@ebi.ac.uk>; from pmr@ebi.ac.uk on Thu, Feb 20, 2003 at 11:34:13AM +0000 References: <3E54BD35.6050509@ebi.ac.uk> Message-ID: <20030220120439.A13733@arran.mrc-lmb.cam.ac.uk> On Thu, Feb 20, 2003 at 11:34:13AM +0000, Peter Rice wrote: > I am cleaning up the parsing of both ACD files and the emboss.default > files. This includes adding diagnostic messages to say what problems > were found and to report the line number (and filename). Diagnostic messages are a definite help. > At the same time, some of the syntax can be tightened. For example, ACD > files allowed some strange characters that were never used (parentheses > instead of quotes, "=" instead of ":"). These will be removed. For what it's worth the ACD parser in Spin (staden package) copes with these things already, and also the variations in spaces. However I freely admit that it may have been easier to develop with the changes you propose and so it sounds like a sensible way of promoting more interfaces. I'm not 100% convinced on that though; by far the easiest way of helping people is to provide a full and complete BNF grammer. The documentation was not sufficiently clear as I recall. I ended up writing my own version of lex in Tcl and a hand coded parser. Eg I use regexp matching for identifers and it's no harder to match regexp "[^ \t\n:=]+[ \t\n]*[:=]" than "[^:]+:", although the latter is obviously more readable. > There are also differences in the definitions of comments. In ACD files > any text after a "#" is ignored. In emboss.default comments must start > at the beginning of the line. This seems preferable as occasionally a > "#" character could be useful in a definition. This is one change which could cause problems. Existing parsers should, in theory, already be handling the complexities of : vs =, different quoting syntaxes, and varying whitespace. So the changes to these will help new code and not have any effect on existing parsers. Changing comments though will make existing parsers parse incorrectly on files where # is used in a definition. However I guess the change needs to be made due to the points you make (it possibly being a useful character). > 6. Should the ACD attribute names (required, information, ...) be > abbreviated (see question 4)? My approach was to specify the grammer at a higher level of ID, STRING, etc and then use code for matching ID against a known database of words. Literally: foreach word {application information default required \ optional expected documentation outfile \ parameter needed delimiter codedelimiter \ values selection minimum maximum dirlist} { if {[string match -nocase ${id_v}* $word]} { set id_v $word break } } Having full names though makes the 'lex' type part easier as the tokenising can break things down into more specific words: APPLICATION, INTEGER, etc rather than just ID. Although it's possible to do this with regexps right now if you're willing to put up with regexps like "var(i(a(b(l(e)?)?)?)?)?". Oddly I dealt with types and attributes in a slightly different way, so I can only deal with "int" and "integer" and not "integ". I'm not sure why I did it that way though; sloppiness it seems. James -- James Bonfield (jkb at mrc-lmb.cam.ac.uk) Fax: (+44) 01223 213556 Medical Research Council - Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, England. Also see Staden Package WWW site at http://www.mrc-lmb.cam.ac.uk/pubseq/ From gbottu at ben.vub.ac.be Fri Feb 21 16:18:14 2003 From: gbottu at ben.vub.ac.be (Guy Bottu) Date: Fri, 21 Feb 2003 17:18:14 +0100 (CET) Subject: ACD file and emboss.default file syntax Message-ID: <200302211618.h1LGIE3F1069274@black.vub.ac.be> from : BEN > > I am cleaning up the parsing of both ACD files and the emboss.default > files. This includes adding diagnostic messages to say what problems > were found and to report the line number (and filename). > > > > There are also differences in the definitions of comments. In ACD files > any text after a "#" is ignored. In emboss.default comments must start > at the beginning of the line. This seems preferable as occasionally a > "#" character could be useful in a definition. I agree with that. By the way : the file eprimer3.acd contains : help: "The maximum allowed melting temperature of the amplicon. Product Tm i s calculated using the formula from Bolton and McCarthy, PNAS 84:1390 (1962) as presented in Sambrook, Fritsch and Maniatis, Molecular Cloning, p 11.46 (1989, C SHL Press). \ Tm = 81.5 + 16.6(log10[Na+]) + .41*(%GC) - 600/length \ ... The [Na+] turned out to be "toxic", I had to replace it by {Na+}. Maybe make that the parser can distinguish [] signs that are part of the syntax from those that are part of some definition. > > Extra questions are: > > > 7. Should the database (and any other emboss.default) attribute names be > abbreviated (see question 5)? > To allow both "swissprot" and "sw" as alternative names I now duplicate the definition. Allowing for abbreviated database names could be a solution. But perhaps not that good. Might be confusing. And what about "imgtmhc" / "mhc" ? A suggestion is to add an attribute "altname", so that you could have : DB unannotated [ type: N comment: 'EMBL unannotated/unclassified' altname: unc,un,unclassified ..... Other things on "wish list" : - to allow the "nullok" attribute for all objects rather then just some. e.g. it can happen that whether program must input sequence depends on setting of other parameters. seq object however always needs input, so only "hack" now is to write silly default value in ACD file - extend input/output in several format for sequence/feature/alignment/ structure to other types of data : symbol comparison tables (GCG, BLAST, SIM,...) codon_usage_tables (CUTG, GCG, ...) Sincerely, Guy Bottu From ableasby at hgmp.mrc.ac.uk Fri Feb 21 17:47:46 2003 From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk) Date: Fri, 21 Feb 2003 17:47:46 GMT Subject: ACD file and emboss.default file syntax Message-ID: <200302211747.RAA10269@bromine.hgmp.mrc.ac.uk> Commenting only on the nullok for sequence. That feature was put into the CVS version a few weeks ago. Alan From p.ernst at dkfz-heidelberg.de Mon Feb 24 11:27:34 2003 From: p.ernst at dkfz-heidelberg.de (Peter Ernst) Date: Mon, 24 Feb 2003 12:27:34 +0100 (MET) Subject: ACD file and emboss.default file syntax In-Reply-To: <3E54BD35.6050509@ebi.ac.uk> Message-ID: On Thu, 20 Feb 2003, Peter Rice wrote: > How far should this go? In particular, should white space be required > after a ":" or around "[" and "]" characters? How significant are newlines? Usually a newline is just a whitespace in ACD and can be treated like a space character. The end of a block is defined by the ']' character. However there are the qualifiers "variable" and "endsection" where the end of the block is defined by a newline character. A simple approach to solve this problem was to say: "a newline marks the end of a block unless one of the opening characters like '[' appears on the same line". However this doesn't work with all existing ACD files. "good" definition: appl: hmmgen [ documentation: "G.." ... ] "bad" definition: (found in DOMAINATRIX/.../hmmgen.acd) appl: hmmgen [ documentation: "G.." ... ] > There are also differences in the definitions of comments. In ACD files > any text after a "#" is ignored. In emboss.default comments must start > at the beginning of the line. This seems preferable as occasionally a > "#" character could be useful in a definition. Yes it would be better if comments must start from the beginning of a line. However existing parsers contain code to deal with the other comments as well. But anyway, a change in the syntax definition for comments makes sense (for future versions of parsers in GUIs). > 4. Should the ACD types (integer, string, ...) be specified in full? ACD > can cope easily with unambiguous abbreviations, [...] > [...] > > 6. Should the ACD attribute names (required, information, ...) be > abbreviated (see question 4)? The problem is, that abbreviations used in existing ACD files are not *globally* unambiguous but only *locally* unambiguous, i.e. the abbreviation MAX is used for MAXSEQS (in ALIGNMENT context), for MAXIMUM (e.g. INTEGER context) and for MAXLENGTH (e.g. STRING context). Using a LEX/YACC approach to parse ACD files, it was problematic to create a simple lexer, because whenever the lexer found "max", it wasn't clear if the token MAXSEQS, MAXIMUM or MAXLENGTH was meant. (The lexer had to know its context, to be able to throw the right token.) Therefore more unambiguity would be welcome (even if this means: no abbreviations in ACD files). Regards, Peter Ernst