Version 4.1.0 A new ACD attribute outputmodifier: "Y" identifies qualifiers that cause the kinds of output changes that can break parsers. An obvious example is the -html qualifier in many of the utility programs. This attribute is a warning to wrapper developers and maintainers that they may want to fix the value of this qualifier and not allow users to change it. In some cases (as with toggle qualifiers) it may be useful to wrap each possible value separately. For example, tfm can run as an HTML version (-html) and a text version (-nohtml -nomore). Backtranseq now keeps stop positions in the sequence and replaces them with the most common stop codon. Previous releases converted stops to 'X' and back translated them as 'NNN'. Reading sequences in NBRF (or PIR) format now only removes one '*' from the end, allowing protein sequences to end with a stop codon. Reading NBRF format sequences in FASTA format was retaining a ';' in front of the sequence ID. This is now fixed. Pattern files and regular expression files now use the -pformat and -pname associated qualifiers which were ignored when they first appeared in 4.0.0. Pattern file formats are "fasta" for the original format in 4.0.0 with FASTA style identifiers, and "simple" for files with a single pattern on each line. The format defaults to testing the first character for a '>'. The pattern name is used to set a name of "name1", "name2" and so on if no name is in the FASTA file. By default patterns are called pattern1, regular expressions are called "regex1". Added a new function to read from a buffered file and trim newlines. It was not needed before because input functions were doing their own trimming. Valgrind memory leak tests now cover all QA tests. The command line is captured and used to generate test cases. Script valgrind.pl knows about the few cases that need input files copied and preprocesses them by name. A few tests can be flagged as ignored. This is intended for tests known to run for a very long time under valgrind. Memory leaks are fixed for all programs in the main EMBOSS package and for the most used ones in the EMBRASSY packages. A new environment variable ACDCOMMANDLINELOG takes a filename as its value. This saves the command line equivalent of a program run, converting user responses to prompts into their command line equivalents. A number of bugs in command line saving for report headers were identifier and fixed. Two string functions had their names reversed. ajStrRemoveWhite is to remove all white space from a string, ajStrRemoveWhiteExcess is to remove white space from the ends and replace internal whitespace with single spaces. When function names were standardized these names were reversed. As function calls were converted automatically EMBOSS code worked as before, but developers will notice the functions to not behave as expected. This is now corrected, and all existing calls in the EMBOSS code have been checked and converted. Showseq with a sequence end position now stops output at the end of the user-specified range, Previous releases printed the whole of the line with the last base/residue. SRS servers use "gid" as the field name for GI numbers. The field name has been changed to allow GI searches with local SRS and remote SRSWWW access to Genbank. A new configure option for developers --enable-devwarnings turns on many more warning messages from the gcc compiler. Not all warnings are useful - the less useful gcc options are documented (and commented out) in the configure.in file devwarnings section. Warnings include missing function prototypes, signed/unsigned comparisons, potential loss of precision in casts, use of global names (index for example) as variables. Function names in ajseqwrite.c have been standardised. Old names are still accepted but are marked as "deprecated" and will generate warnings with the gcc compiler (see ajstr below). Other compilers will see no difference. Edialign is a new application, a port of the DIALIGN2 program by B. Morgenstern, using an ACD file written by Guy Bottu. It takes as input nucleic acid or protein sequences and produces as output a multiple sequence alignment. The sequences need not be similar over their complete length, since the program constructs alignments from gapfree pairs of similar segments of the sequences. Wordfinder is a new application to find word-based matches of limited size. It is based on code from supermatcher. The inputs are reversed so the query sequence set (unaligned) is compared to a streamed database of sequences. (Supermatcher should perhaps have its inputs in this order too). Limits are provided for the length of the word match and the length of the alignment. The default gap penalties are also increased to limit the gaps allowed in alignment. Word-based algorithms found too many matches where both sequences contains runs of X (protein) or N (nucleotide). These are now ignored when building the word table. Word-based algorithms complained if a sequence was shorter than the wordsize. This was a problem for database searches with some short sequences present. They now run silently and simply return no word matches. The EMBL format sequence entry parser was able to read swissprot sequence data, but not the feature table. Efficiency improvements to set the sequence type to nucleotide for EMBL entries showed that swissprot entries were being read by the EMBL parser. A test for swissprot protein information on the ID line should redifrect these entries to the swissprot parser. In previous release the seuqnece type was not set, so there was no problem with the sequence type - although feature lines may not have been readable form swissprot format flat files. Database definitions specify the swiss or embl format so they are not affected. Large sequences were running very slowly. This was traced to the way sequence types are tested using regular expressions processed by calls to the PCRE library. These calls were replaced by simple string functions as they are only testing that a sequence is entirely composed of characteres from an allowed set. An additional speedup was achieved by defining only upper case characters as required (almost halving the number of tests) and testing the upper case version of the sequence characters. Sequence translation in the reverse direction adds extra amino acids for partial codons. In the forward direction the overhang was miscalculated so these codons were missed. No users have complained, probably because in most cases they are translated as 'X' (it needs a 4-base wobble in the code to convert the first 2 bases of a codon into a single amino acid). Sequence translation was relatively slow, at least on very large sequences. Profiling with gprof indicated some changed to reduce the number of string handling calls (each was very fast, but there was a very large number of calls. The internal tables were resized (from 15 elements to 16) for more efficient mapping. Parsing NCBI format ID lines saves the database. This is available for writing NCBI formatted output ID lines, but is not to be used in reporting the USA. Added "refseq" as a sequence and feature format. Initially a simple alias of GenBank but we may let them diverge later. REFSEQ entries have their own idea of what a ProteinID in the feature table looks like, as they use REFSEQP protein IDs. Validation now allows the third character to be an underscore. Large numbers of database files could make the dbi indexing programs (dbiflat, dbifasta, dbigcg, dbiblast) fail at the sort merge stage when the index files are combined. The sort merge is now in 2 steps to limit the number of open files required in the system sort utility. Added a script emblsplit.pl to split EMBL and UniProt database files into 2Gbyte chunks. The -sid qualifier now overwrites the sequence id if used. The -sid value will be used for creating the output filename and for reporting the sequence identifier in output files. For more than one sequence as input currently the same ID is used. We may change this in future to generate new IDs from this base name. New sequence format gifasta is the same as "ncbi" but uses the GI number as the identifier. Because the output is the same for both formats we have to require -sformat gifasta to be on the commandline. The default for such files will remain "ncbi" as the automatically processed format. On output if there is no GI number a dummy value of "000000" is currently used. coderet now writes non-coding sequence to a new output file. New feature function ajFeatLocMark marks selected features as lower case. Used by coderet to report non-coding regions. The help output now correctly reports output sequence default filenames. Phylip input distance matrices now allow integer values to be treated as reals, although there is a possible confusion over integer replicate values so the use of a trailing ".0" is strongly recommended. Sequences with NCBI deflines and no ID after the final "|" were using the version part of the seqversion ("1" from "AB123456.1") instead of the "AB123456" part to set the ID. Graph titles were not standard on the general "graph" type output, but are consistent for xygraph outputs. A new attribute gdesc defines a prefix for graph titles which can be appended to by the calling program, usually with a description of the input (sequence USA, input filename). A new call ajGraphSetTitlePlus defines the text to add to the gdesc as "[gdesc] of [text]". All graphs were standardized except pepinfo which has 10 subplot titles already in the intended format. This will be corrected later to have standard main titles and shorter subplot titles. The version of plplot we use has a bug in calculating character sizes where the origin in user units is not the default of (0,0). This has been fixed in the plgchrW and plstrlW functions in the copy that is included with EMBOSS. Dreg and preg ignored sequence begin and end positions. Both programs now use the embpatlist function calls to process sequence ranges. Fuzznuc, fuzzpro and fuzztran lost the ability to use the sequence begin and end positions when we switched to pattern lists. This has been restored in the pattern list processing code. The logfile caused a file close error if it was read only (because it had not been successfully opened). Opening the logfile now tests the file is writable and ignores logging for a read-only file. More case-sensitive sequence comparison and matching functions added to be consistent about providing both versions. A few sequence databases have no accession number. For these a new database attribute hasaccession: "N" in emboss.default prevents EMBOSS trying to search the ACC field in addition to the ID field. A few databases with duplicate IDs should be treated as case-sensitive. The original example was a pdbprot database, containing FASTA format sequences of individual chains from PDB entries. In PDB, the entry itself is a 4-character string, and the chain is a single character A through Z. When an entry has more than 26 chains, the next 26 are labelled a through z. Pdbprot appends these as _A, _B, etc. PDBPROT is available from some public SRS servers - see the official list at http://downloads.lionbio.co.uk/publicsrs.html. This is resolved by adding a new database attribute caseidmatch in emboss.default. A value of "Y" will force EMBOSS to exactly match the case of the whole ID. This is done by post-processing and rejecting entries with an ID that fails to match. The run date included in report output has changed format to have the day first and to lose the leading zero when the day is 1st to 9th of the month. Program cpgplot can run on more than one input sequence, but the plot failed on the second sequence. Fixing this required adding a new function ajGraphDataReplaceI to replace the 1st, 2nd 3rd, etc. subgraph. Some memory cleanup was also added to remove the replaced graph data objects. Programs pepwindow and pepwindowall can now process any protein sequence. In previous versions pepwindow was restricted to pureprotein (no ambiguity codes) while pepwindowall accepted any protein sequence (it has to handle gaps) but was using a score of zero for unknown amino acid residues. Changed so that missing amino acid values can be filled in using Dayhoff frequency weighted averages for B, J and Z and an overall average for X, J and O. Program octanol can accept any protein sequence. Interpolated values are used for B, Z and J. An average over all values is used for X and also for O and U where there is no data. Interpolations and averages used the Dayhoff amino acid frequencies. Program iep can accept any protein sequence. Ambiguity codes B and Z are resolved by converting to the carboxylic acid (D or E) or amide (N or Q) according to the Dayhoff amino acid frequencies, giving a consistent value for any input protein. Sequence set type testing was checking whether the seqset is defined as protein but ignoring the type of the first sequence. This is now fixed. Program tfm looks in the obsolete install directory with the -html option. Changed to find the embassy package name from the installed ACD file and then to find the installed HTML file. If EMBOSS has not been installed, will also search the original source files. Modified NCBI/FASTA format to preserve the database name from the NCBI style ID. The database name is reported in one of the many and varied NCBI syntax variants, depending on whether there is a version or accession number, and whether there is an EMBOSS database name also involved (for example, an entry in a file indexed with dbxfasta or dbifasta) Modified "pearson" sequence format to keep the FASTA file ID complete. For historical reasons GCG-style dbname:id syntax was still having the db part trimmed. This will still be trimmed from fasta or ncbi format. The report for digest has Cterm and Nterm columns capitalised to match the rest of the report. Sequence ranges now give correct cterm and nterm results. The list file Cut.index for codon usage tables was changed to remove old file names (commented out list at the end) and to remove underscores from the species names. Programs water, needle, merger and prophet calculate an internal path size from the lengths of the input sequences. For sequences that are too long, a fatal error is produced. But if the sequences are extremely long, the test failed and the program gave a segmentation fault. This fix tests in a different way that will catch all cases. (added as a fix to 4.0.0) The new MRS access method used a general search. This gave strange results when the ID or accession appeared in any other entry. It appears that MRS can search for id or accession only. This worked on the main MRS server at least. (added as a fix to 4.0.0) New database access methods MRS and DBFETCH need to be explicitly turned on so that showdb can report them. (added as a fix to 4.0.0) When deleting the last line of buffered input, failed to reset the pointer to the last buffered line. This only affected debug traces. Unfortunately, the ajFileBuffClear function does call the debug trace. In practice we have only seen this bug when processing sequence data in EMBL format from an MRS server. (added as a fix to 4.0.0) Pattern and regular expression searches failed to correctly reverse a nucleotide sequence. The change is to use ajSeqReverseForce (always reverses the sequence provided) instead of ajSeqReverseDo (which only reverses if the reverse flag is set). (added as a fix to 4.0.0) Reports in list format failed to write a usable USA for "asis" sequence input, and incorrectly reported reverse strand nucleotide features. (added as a fix to 4.0.0) The lists files Matrices.nucleotide, Matrices.protein and Matrices.proteinstructure now have comment headers explaining their format. Fixed issues with nucleotide features in the reverse direction in reports. The start/end positions were stored the wrong way around and then reversed again when repiorted in one of the report formats. However, reporting as EMBL features showed the incorrect storage. ajFeatNewII now checks start/end and reverses the feature if start is ggreater than end. ajFeatNewIIRev sets the reverse strand and also checks that the start position is greater than (or equal to) the end position (added as a fix to 4.0.0) To reduce the size of very large reports, for example when fuzznuc or fuzzpro run over very large databases, new qualifiers are added to report output. -rmaxseq gived the maximum hits for any one sequence, -maxall gives the total maximum number of hits. The report tail contains a record of the number of hits reported and found. The qualifiers are intended for web interfaces to control the maximum output they need to report. When the maximum hits figure is reached, ajReportWrite returns false so that programs can terminate at that point. (added as a fix to 4.0.0) Reports now write a header and tail when closed, to make sure that all programs will write something to the report file. The default header contains the command line provenance, the tail contains the number of sequences and hits. (added as a fix to 4.0.0)