From pmr at ebi.ac.uk  Thu Feb 20 06:34:13 2003
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 20 Feb 2003 11:34:13 +0000
Subject: ACD file and emboss.default file syntax
Message-ID: <3E54BD35.6050509@ebi.ac.uk>

I am cleaning up the parsing of both ACD files and the emboss.default 
files. This includes adding diagnostic messages to say what problems 
were found and to report the line number (and filename).

Showdb will carry out additional checks on the emboss.default and 
~/.embossrc files (valid sequence formats, for example). There is no 
need to run these every time the files are read.

At the same time, some of the syntax can be tightened. For example, ACD 
files allowed some strange characters that were never used (parentheses 
instead of quotes, "=" instead of ":"). These will be removed.

How far should this go? In particular, should white space be required 
after a ":" or around "[" and "]" characters?

There are also differences in the definitions of comments. In ACD files 
any text after a "#" is ignored. In emboss.default comments must start 
at the beginning of the line. This seems preferable as occasionally a 
"#" character could be useful in a definition.

For example, both of the following are valid ACD definitions:

#################################
# Full definition from acdpretty
#################################

integer: minlen  [
   required: "Y"
   minimum: "1"
   maximum: "50"
   default: "6"
   information: "Minimum length"
]

int:minlen [req:Y min:1 max:50 def:6 info:"Minimum length"] #compact

The first is preferred (and generated by -acdpretty). I would like to 
make it *required* so that other ACD parsers (e.g. for GUI definitions) 
can cope better.

The changes would be:

1. White space is required after "attribute:"

2. White space is required before and after "[" and "]"

3. Any "#" character at the start of a line is a comment and the line 
will be ignored. Any "#" within a line is part of the definition.


Extra questions are:

4. Should the ACD types (integer, string, ...) be specified in full? ACD 
can cope easily with unambiguous abbreviations, so I prefer to keep the 
short forms, but perhaps parsers have problem. These files are created 
by the developers so we can update them. One option is to generate 
warning messages, and to run acdpretty to fix them before committing the 
ACD files to CVS.

5. Should the emboss.default types (env, dbname) be specified in full? 
Parsing can cope easily with unambiguous abbreviations, and "db" in 
place of "dbname" is common. These files are created by the site 
administrators and by individual users, so we should avoid breaking 
their existing definitions. But note we have synonyms (env/set) so we 
could allow "db" or "dbname" as alternatives.

6. Should the ACD attribute names (required, information, ...) be 
abbreviated (see question 4)?

7. Should the database (and any other emboss.default) attribute names be 
abbreviated (see question 5)?


Peter Rice


From jkb at mrc-lmb.cam.ac.uk  Thu Feb 20 07:04:39 2003
From: jkb at mrc-lmb.cam.ac.uk (James Bonfield)
Date: Thu, 20 Feb 2003 12:04:39 +0000
Subject: ACD file and emboss.default file syntax
In-Reply-To: <3E54BD35.6050509@ebi.ac.uk>; from pmr@ebi.ac.uk on Thu, Feb 20, 2003 at 11:34:13AM +0000
References: <3E54BD35.6050509@ebi.ac.uk>
Message-ID: <20030220120439.A13733@arran.mrc-lmb.cam.ac.uk>

On Thu, Feb 20, 2003 at 11:34:13AM +0000, Peter Rice wrote:
> I am cleaning up the parsing of both ACD files and the emboss.default 
> files. This includes adding diagnostic messages to say what problems 
> were found and to report the line number (and filename).

Diagnostic messages are a definite help.

> At the same time, some of the syntax can be tightened. For example, ACD 
> files allowed some strange characters that were never used (parentheses 
> instead of quotes, "=" instead of ":"). These will be removed.

For what it's worth the ACD parser in Spin (staden package) copes with these
things already, and also the variations in spaces.

However I freely admit that it may have been easier to develop with the
changes you propose and so it sounds like a sensible way of promoting more
interfaces. I'm not 100% convinced on that though; by far the easiest way of
helping people is to provide a full and complete BNF grammer. The
documentation was not sufficiently clear as I recall. I ended up writing my
own version of lex in Tcl and a hand coded parser. Eg I use regexp matching
for identifers and it's no harder to match regexp "[^ \t\n:=]+[ \t\n]*[:=]"
than "[^:]+:", although the latter is obviously more readable.

> There are also differences in the definitions of comments. In ACD files 
> any text after a "#" is ignored. In emboss.default comments must start 
> at the beginning of the line. This seems preferable as occasionally a 
> "#" character could be useful in a definition.

This is one change which could cause problems. Existing parsers should, in
theory, already be handling the complexities of : vs =, different quoting
syntaxes, and varying whitespace. So the changes to these will help new code
and not have any effect on existing parsers.

Changing comments though will make existing parsers parse incorrectly on files 
where # is used in a definition. However I guess the change needs to be made
due to the points you make (it possibly being a useful character).

> 6. Should the ACD attribute names (required, information, ...) be 
> abbreviated (see question 4)?

My approach was to specify the grammer at a higher level of ID, STRING, etc
and then use code for matching ID against a known database of
words. Literally:

                foreach word {application information default required \
                              optional expected documentation outfile \
                              parameter needed delimiter codedelimiter \
                              values selection minimum maximum dirlist} {
                    if {[string match -nocase ${id_v}* $word]} {
                        set id_v $word
                        break
                    }
                }

Having full names though makes the 'lex' type part easier as the tokenising
can break things down into more specific words: APPLICATION, INTEGER, etc
rather than just ID. Although it's possible to do this with regexps right now
if you're willing to put up with regexps like "var(i(a(b(l(e)?)?)?)?)?".

Oddly I dealt with types and attributes in a slightly different way, so I
can only deal with "int" and "integer" and not "integ". I'm not sure why I did 
it that way though; sloppiness it seems.

James

-- 
James Bonfield (jkb at mrc-lmb.cam.ac.uk)   Fax: (+44) 01223 213556
Medical Research Council - Laboratory of Molecular Biology,
Hills Road, Cambridge, CB2 2QH, England.
Also see Staden Package WWW site at http://www.mrc-lmb.cam.ac.uk/pubseq/


From gbottu at ben.vub.ac.be  Fri Feb 21 11:18:14 2003
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Fri, 21 Feb 2003 17:18:14 +0100 (CET)
Subject: ACD file and emboss.default file syntax
Message-ID: <200302211618.h1LGIE3F1069274@black.vub.ac.be>

from : BEN

> 
> I am cleaning up the parsing of both ACD files and the emboss.default 
> files. This includes adding diagnostic messages to say what problems 
> were found and to report the line number (and filename).
> 
>
> 
> There are also differences in the definitions of comments. In ACD files 
> any text after a "#" is ignored. In emboss.default comments must start 
> at the beginning of the line. This seems preferable as occasionally a 
> "#" character could be useful in a definition.

I agree with that.

By the way : the file eprimer3.acd contains : 
help: "The maximum allowed melting temperature of the amplicon. Product Tm i
s calculated using the formula from Bolton and McCarthy, PNAS 84:1390 (1962) as 
presented in Sambrook, Fritsch and Maniatis, Molecular Cloning, p 11.46 (1989, C
SHL Press). \ Tm = 81.5 + 16.6(log10[Na+]) + .41*(%GC) - 600/length \
...
The [Na+] turned out to be "toxic", I had to replace it by {Na+}. Maybe make 
that the parser can distinguish [] signs that are part of the syntax from those 
that are part of some definition.
> 
> Extra questions are:
> 
>
> 7. Should the database (and any other emboss.default) attribute names be 
> abbreviated (see question 5)?
> 
To allow both "swissprot" and "sw" as alternative names I now duplicate the 
definition. Allowing for abbreviated database names could be a solution. But 
perhaps not that good. Might be confusing. And what about "imgtmhc" / "mhc" ?
A suggestion is to add an attribute "altname", so that you could have :

DB unannotated [ type: N  comment: 'EMBL unannotated/unclassified'
    altname: unc,un,unclassified  .....
    
Other things on "wish list" :

- to allow the "nullok" attribute for all objects rather then just some.
  e.g. it can happen that whether program must input sequence depends on setting
  of other parameters. seq object however always needs input, so only "hack" now 
  is to write silly default value in ACD file
  
- extend input/output in several format for sequence/feature/alignment/
  structure to other types of data : symbol comparison tables (GCG, BLAST,
  SIM,...) codon_usage_tables (CUTG, GCG, ...)
  
	Sincerely,
	Guy Bottu


From ableasby at hgmp.mrc.ac.uk  Fri Feb 21 12:47:46 2003
From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk)
Date: Fri, 21 Feb 2003 17:47:46 GMT
Subject: ACD file and emboss.default file syntax
Message-ID: <200302211747.RAA10269@bromine.hgmp.mrc.ac.uk>

Commenting only on the nullok for sequence. That feature was
put into the CVS version a few weeks ago.

Alan


From p.ernst at dkfz-heidelberg.de  Mon Feb 24 06:27:34 2003
From: p.ernst at dkfz-heidelberg.de (Peter Ernst)
Date: Mon, 24 Feb 2003 12:27:34 +0100 (MET)
Subject: ACD file and emboss.default file syntax
In-Reply-To: <3E54BD35.6050509@ebi.ac.uk>
Message-ID: <Pine.SOL.4.43.0302241210530.4087-100000@husar.inet.dkfz-heidelberg.de>

On Thu, 20 Feb 2003, Peter Rice wrote:

> How far should this go? In particular, should white space be required
> after a ":" or around "[" and "]" characters?

How significant are newlines?

Usually a newline is just a whitespace in ACD and can be treated like
a space character. The end of a block is defined by the ']'
character. However there are the qualifiers "variable" and
"endsection" where the end of the block is defined by a newline
character.

A simple approach to solve this problem was to say: "a newline marks
the end of a block unless one of the opening characters like '['
appears on the same line". However this doesn't work with all existing
ACD files.

  "good" definition:

 appl: hmmgen [
  documentation: "G.."
  ...
 ]

  "bad" definition: (found in DOMAINATRIX/.../hmmgen.acd)

 appl: hmmgen
[
  documentation: "G.."
 ...
]


> There are also differences in the definitions of comments. In ACD files
> any text after a "#" is ignored. In emboss.default comments must start
> at the beginning of the line. This seems preferable as occasionally a
> "#" character could be useful in a definition.

Yes it would be better if comments must start from the beginning of
a line. However existing parsers contain code to deal with the other
comments as well. But anyway, a change in the syntax definition for
comments makes sense (for future versions of parsers in GUIs).


> 4. Should the ACD types (integer, string, ...) be specified in full? ACD
> can cope easily with unambiguous abbreviations, [...]
> [...]
>
> 6. Should the ACD attribute names (required, information, ...) be
> abbreviated (see question 4)?

The problem is, that abbreviations used in existing ACD files are not
*globally* unambiguous but only *locally* unambiguous, i.e.

  the abbreviation MAX
  is used for MAXSEQS (in ALIGNMENT context),
          for MAXIMUM (e.g. INTEGER context) and
          for MAXLENGTH (e.g. STRING context).

Using a LEX/YACC approach to parse ACD files, it was problematic to
create a simple lexer, because whenever the lexer found "max", it
wasn't clear if the token MAXSEQS, MAXIMUM or MAXLENGTH was
meant. (The lexer had to know its context, to be able to throw the
right token.)

Therefore more unambiguity would be welcome (even if this means: no
abbreviations in ACD files).


Regards,
	Peter Ernst


From pmr at ebi.ac.uk  Thu Feb 20 11:34:13 2003
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 20 Feb 2003 11:34:13 +0000
Subject: ACD file and emboss.default file syntax
Message-ID: <3E54BD35.6050509@ebi.ac.uk>

I am cleaning up the parsing of both ACD files and the emboss.default 
files. This includes adding diagnostic messages to say what problems 
were found and to report the line number (and filename).

Showdb will carry out additional checks on the emboss.default and 
~/.embossrc files (valid sequence formats, for example). There is no 
need to run these every time the files are read.

At the same time, some of the syntax can be tightened. For example, ACD 
files allowed some strange characters that were never used (parentheses 
instead of quotes, "=" instead of ":"). These will be removed.

How far should this go? In particular, should white space be required 
after a ":" or around "[" and "]" characters?

There are also differences in the definitions of comments. In ACD files 
any text after a "#" is ignored. In emboss.default comments must start 
at the beginning of the line. This seems preferable as occasionally a 
"#" character could be useful in a definition.

For example, both of the following are valid ACD definitions:

#################################
# Full definition from acdpretty
#################################

integer: minlen  [
   required: "Y"
   minimum: "1"
   maximum: "50"
   default: "6"
   information: "Minimum length"
]

int:minlen [req:Y min:1 max:50 def:6 info:"Minimum length"] #compact

The first is preferred (and generated by -acdpretty). I would like to 
make it *required* so that other ACD parsers (e.g. for GUI definitions) 
can cope better.

The changes would be:

1. White space is required after "attribute:"

2. White space is required before and after "[" and "]"

3. Any "#" character at the start of a line is a comment and the line 
will be ignored. Any "#" within a line is part of the definition.


Extra questions are:

4. Should the ACD types (integer, string, ...) be specified in full? ACD 
can cope easily with unambiguous abbreviations, so I prefer to keep the 
short forms, but perhaps parsers have problem. These files are created 
by the developers so we can update them. One option is to generate 
warning messages, and to run acdpretty to fix them before committing the 
ACD files to CVS.

5. Should the emboss.default types (env, dbname) be specified in full? 
Parsing can cope easily with unambiguous abbreviations, and "db" in 
place of "dbname" is common. These files are created by the site 
administrators and by individual users, so we should avoid breaking 
their existing definitions. But note we have synonyms (env/set) so we 
could allow "db" or "dbname" as alternatives.

6. Should the ACD attribute names (required, information, ...) be 
abbreviated (see question 4)?

7. Should the database (and any other emboss.default) attribute names be 
abbreviated (see question 5)?


Peter Rice


From jkb at mrc-lmb.cam.ac.uk  Thu Feb 20 12:04:39 2003
From: jkb at mrc-lmb.cam.ac.uk (James Bonfield)
Date: Thu, 20 Feb 2003 12:04:39 +0000
Subject: ACD file and emboss.default file syntax
In-Reply-To: <3E54BD35.6050509@ebi.ac.uk>; from pmr@ebi.ac.uk on Thu, Feb 20, 2003 at 11:34:13AM +0000
References: <3E54BD35.6050509@ebi.ac.uk>
Message-ID: <20030220120439.A13733@arran.mrc-lmb.cam.ac.uk>

On Thu, Feb 20, 2003 at 11:34:13AM +0000, Peter Rice wrote:
> I am cleaning up the parsing of both ACD files and the emboss.default 
> files. This includes adding diagnostic messages to say what problems 
> were found and to report the line number (and filename).

Diagnostic messages are a definite help.

> At the same time, some of the syntax can be tightened. For example, ACD 
> files allowed some strange characters that were never used (parentheses 
> instead of quotes, "=" instead of ":"). These will be removed.

For what it's worth the ACD parser in Spin (staden package) copes with these
things already, and also the variations in spaces.

However I freely admit that it may have been easier to develop with the
changes you propose and so it sounds like a sensible way of promoting more
interfaces. I'm not 100% convinced on that though; by far the easiest way of
helping people is to provide a full and complete BNF grammer. The
documentation was not sufficiently clear as I recall. I ended up writing my
own version of lex in Tcl and a hand coded parser. Eg I use regexp matching
for identifers and it's no harder to match regexp "[^ \t\n:=]+[ \t\n]*[:=]"
than "[^:]+:", although the latter is obviously more readable.

> There are also differences in the definitions of comments. In ACD files 
> any text after a "#" is ignored. In emboss.default comments must start 
> at the beginning of the line. This seems preferable as occasionally a 
> "#" character could be useful in a definition.

This is one change which could cause problems. Existing parsers should, in
theory, already be handling the complexities of : vs =, different quoting
syntaxes, and varying whitespace. So the changes to these will help new code
and not have any effect on existing parsers.

Changing comments though will make existing parsers parse incorrectly on files 
where # is used in a definition. However I guess the change needs to be made
due to the points you make (it possibly being a useful character).

> 6. Should the ACD attribute names (required, information, ...) be 
> abbreviated (see question 4)?

My approach was to specify the grammer at a higher level of ID, STRING, etc
and then use code for matching ID against a known database of
words. Literally:

                foreach word {application information default required \
                              optional expected documentation outfile \
                              parameter needed delimiter codedelimiter \
                              values selection minimum maximum dirlist} {
                    if {[string match -nocase ${id_v}* $word]} {
                        set id_v $word
                        break
                    }
                }

Having full names though makes the 'lex' type part easier as the tokenising
can break things down into more specific words: APPLICATION, INTEGER, etc
rather than just ID. Although it's possible to do this with regexps right now
if you're willing to put up with regexps like "var(i(a(b(l(e)?)?)?)?)?".

Oddly I dealt with types and attributes in a slightly different way, so I
can only deal with "int" and "integer" and not "integ". I'm not sure why I did 
it that way though; sloppiness it seems.

James

-- 
James Bonfield (jkb at mrc-lmb.cam.ac.uk)   Fax: (+44) 01223 213556
Medical Research Council - Laboratory of Molecular Biology,
Hills Road, Cambridge, CB2 2QH, England.
Also see Staden Package WWW site at http://www.mrc-lmb.cam.ac.uk/pubseq/


From gbottu at ben.vub.ac.be  Fri Feb 21 16:18:14 2003
From: gbottu at ben.vub.ac.be (Guy Bottu)
Date: Fri, 21 Feb 2003 17:18:14 +0100 (CET)
Subject: ACD file and emboss.default file syntax
Message-ID: <200302211618.h1LGIE3F1069274@black.vub.ac.be>

from : BEN

> 
> I am cleaning up the parsing of both ACD files and the emboss.default 
> files. This includes adding diagnostic messages to say what problems 
> were found and to report the line number (and filename).
> 
>
> 
> There are also differences in the definitions of comments. In ACD files 
> any text after a "#" is ignored. In emboss.default comments must start 
> at the beginning of the line. This seems preferable as occasionally a 
> "#" character could be useful in a definition.

I agree with that.

By the way : the file eprimer3.acd contains : 
help: "The maximum allowed melting temperature of the amplicon. Product Tm i
s calculated using the formula from Bolton and McCarthy, PNAS 84:1390 (1962) as 
presented in Sambrook, Fritsch and Maniatis, Molecular Cloning, p 11.46 (1989, C
SHL Press). \ Tm = 81.5 + 16.6(log10[Na+]) + .41*(%GC) - 600/length \
...
The [Na+] turned out to be "toxic", I had to replace it by {Na+}. Maybe make 
that the parser can distinguish [] signs that are part of the syntax from those 
that are part of some definition.
> 
> Extra questions are:
> 
>
> 7. Should the database (and any other emboss.default) attribute names be 
> abbreviated (see question 5)?
> 
To allow both "swissprot" and "sw" as alternative names I now duplicate the 
definition. Allowing for abbreviated database names could be a solution. But 
perhaps not that good. Might be confusing. And what about "imgtmhc" / "mhc" ?
A suggestion is to add an attribute "altname", so that you could have :

DB unannotated [ type: N  comment: 'EMBL unannotated/unclassified'
    altname: unc,un,unclassified  .....
    
Other things on "wish list" :

- to allow the "nullok" attribute for all objects rather then just some.
  e.g. it can happen that whether program must input sequence depends on setting
  of other parameters. seq object however always needs input, so only "hack" now 
  is to write silly default value in ACD file
  
- extend input/output in several format for sequence/feature/alignment/
  structure to other types of data : symbol comparison tables (GCG, BLAST,
  SIM,...) codon_usage_tables (CUTG, GCG, ...)
  
	Sincerely,
	Guy Bottu


From ableasby at hgmp.mrc.ac.uk  Fri Feb 21 17:47:46 2003
From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk)
Date: Fri, 21 Feb 2003 17:47:46 GMT
Subject: ACD file and emboss.default file syntax
Message-ID: <200302211747.RAA10269@bromine.hgmp.mrc.ac.uk>

Commenting only on the nullok for sequence. That feature was
put into the CVS version a few weeks ago.

Alan


From p.ernst at dkfz-heidelberg.de  Mon Feb 24 11:27:34 2003
From: p.ernst at dkfz-heidelberg.de (Peter Ernst)
Date: Mon, 24 Feb 2003 12:27:34 +0100 (MET)
Subject: ACD file and emboss.default file syntax
In-Reply-To: <3E54BD35.6050509@ebi.ac.uk>
Message-ID: <Pine.SOL.4.43.0302241210530.4087-100000@husar.inet.dkfz-heidelberg.de>

On Thu, 20 Feb 2003, Peter Rice wrote:

> How far should this go? In particular, should white space be required
> after a ":" or around "[" and "]" characters?

How significant are newlines?

Usually a newline is just a whitespace in ACD and can be treated like
a space character. The end of a block is defined by the ']'
character. However there are the qualifiers "variable" and
"endsection" where the end of the block is defined by a newline
character.

A simple approach to solve this problem was to say: "a newline marks
the end of a block unless one of the opening characters like '['
appears on the same line". However this doesn't work with all existing
ACD files.

  "good" definition:

 appl: hmmgen [
  documentation: "G.."
  ...
 ]

  "bad" definition: (found in DOMAINATRIX/.../hmmgen.acd)

 appl: hmmgen
[
  documentation: "G.."
 ...
]


> There are also differences in the definitions of comments. In ACD files
> any text after a "#" is ignored. In emboss.default comments must start
> at the beginning of the line. This seems preferable as occasionally a
> "#" character could be useful in a definition.

Yes it would be better if comments must start from the beginning of
a line. However existing parsers contain code to deal with the other
comments as well. But anyway, a change in the syntax definition for
comments makes sense (for future versions of parsers in GUIs).


> 4. Should the ACD types (integer, string, ...) be specified in full? ACD
> can cope easily with unambiguous abbreviations, [...]
> [...]
>
> 6. Should the ACD attribute names (required, information, ...) be
> abbreviated (see question 4)?

The problem is, that abbreviations used in existing ACD files are not
*globally* unambiguous but only *locally* unambiguous, i.e.

  the abbreviation MAX
  is used for MAXSEQS (in ALIGNMENT context),
          for MAXIMUM (e.g. INTEGER context) and
          for MAXLENGTH (e.g. STRING context).

Using a LEX/YACC approach to parse ACD files, it was problematic to
create a simple lexer, because whenever the lexer found "max", it
wasn't clear if the token MAXSEQS, MAXIMUM or MAXLENGTH was
meant. (The lexer had to know its context, to be able to throw the
right token.)

Therefore more unambiguity would be welcome (even if this means: no
abbreviations in ACD files).


Regards,
	Peter Ernst