From david.martin at biotek.uio.no Tue Aug 1 10:45:20 2000 From: david.martin at biotek.uio.no (David Martin) Date: Tue, 1 Aug 2000 15:45:20 +0100 Subject: File locations Message-ID: When one is asked to specify file locations, should it be possible to specify URIs instead? It would be great to be able to update eg rebase by specifying ftp://ftp.ebi.ac.uk/pub/databases/rebase/withrefm.008 instead of downloading then extracting the file. Just a thought and definitely on the backburner list. ..d --------------------------------------------------------------------- * Dr. David Martin Biotechnology Centre of Oslo * * Node Manager Gaustadalleen 21 * * The Norwegian EMBNet Node P.O. box 1125 Blindern * * tel +47 22 95 87 56 N-0317 Oslo * * fax +47 22 69 41 30 Norway * --------------------------------------------------------------------- From david.martin at biotek.uio.no Wed Aug 2 10:49:18 2000 From: david.martin at biotek.uio.no (David Martin) Date: Wed, 2 Aug 2000 15:49:18 +0100 Subject: Pre first draft of admin guide. Message-ID: OK it is in raw text form. I'll mark it up for LaTeX soon but here it is for your delectation and delight. The major sticking points at the moment are Database Indexing, especially DBIBLAST but there are unresolved issues with DBIFLAT and FASTA files and DBIGCG (because it loops until armageddon in the form of SIGDIEDIEDIE) so I haven't been able to test it properly. Comments are welcome. I'm hoping it can be pretty much a recipe book for EMBOSS setup. With a bit of standardising of macros, it should be possible to dump out the program docs as LaTeX and incorporate those too. I'll look at marking up the quick guide, and then with Val's tutorial and Thon's ACD guide we are approaching a reasonable manual for EMBOSS. Maybe I should create a small EMBOSS logo in LaTeX like EMB that would slot into the text at about the right height. OSS ..d --------------------------------------------------------------------- * Dr. David Martin Biotechnology Centre of Oslo * * Node Manager Gaustadalleen 21 * * The Norwegian EMBNet Node P.O. box 1125 Blindern * * tel +47 22 95 87 56 N-0317 Oslo * * fax +47 22 69 41 30 Norway * --------------------------------------------------------------------- From david.martin at biotek.uio.no Wed Aug 2 10:50:24 2000 From: david.martin at biotek.uio.no (David Martin) Date: Wed, 2 Aug 2000 15:50:24 +0100 Subject: Pre first draft of admin guide. (fwd) Message-ID: And the file is here as an attachment. ..d --------------------------------------------------------------------- * Dr. David Martin Biotechnology Centre of Oslo * * Node Manager Gaustadalleen 21 * * The Norwegian EMBNet Node P.O. box 1125 Blindern * * tel +47 22 95 87 56 N-0317 Oslo * * fax +47 22 69 41 30 Norway * --------------------------------------------------------------------- ---------- Forwarded message ---------- Date: Wed, 2 Aug 2000 15:49:18 +0100 From: David Martin Reply-To: admin at embnet.uio.no To: emboss-dev at embnet.org Subject: Pre first draft of admin guide. OK it is in raw text form. I'll mark it up for LaTeX soon but here it is for your delectation and delight. The major sticking points at the moment are Database Indexing, especially DBIBLAST but there are unresolved issues with DBIFLAT and FASTA files and DBIGCG (because it loops until armageddon in the form of SIGDIEDIEDIE) so I haven't been able to test it properly. Comments are welcome. I'm hoping it can be pretty much a recipe book for EMBOSS setup. With a bit of standardising of macros, it should be possible to dump out the program docs as LaTeX and incorporate those too. I'll look at marking up the quick guide, and then with Val's tutorial and Thon's ACD guide we are approaching a reasonable manual for EMBOSS. Maybe I should create a small EMBOSS logo in LaTeX like EMB that would slot into the text at about the right height. OSS ..d --------------------------------------------------------------------- * Dr. David Martin Biotechnology Centre of Oslo * * Node Manager Gaustadalleen 21 * * The Norwegian EMBNet Node P.O. box 1125 Blindern * * tel +47 22 95 87 56 N-0317 Oslo * * fax +47 22 69 41 30 Norway * --------------------------------------------------------------------- -------------- next part -------------- The EMBOSS Administrators Guide What is EMBOSS? Where do I get it? Installation Configuration Databases Database access Indexing and configuring flatfile databases Indexing and configuring GCG format databases Indexing and configuring BLAST databases Configuring EMBOSS to use SRS for database lookup. Indexing and configuring other databases Other data Logging What is EMBOSS? EMBOSS is a freely available suite of bioinformatics applications and libraries. It can be downloaded via the internet, copied, customised, and passed on under the terms of the various General Public Licenses. EMBOSS has been developed in response to the need for a powerful, adaptable suite of software that can interface readily with many different situations and meet the need of professional bioinformaticists, particularly those needing high throughput and/or scriptable capabilities. EMBOSS has primarily been developed by those responsible for the public extensions to the GCG package. Whilst EMBOSS duplicates much of EGCG it includes far better database interaction and has the benefit of freely accessible source code so novel applications can be developed rapidly and at minimal cost. EMBOSS is currently only available for Unix/Linux systems but it ahs been known to compile and run on Windows NT. This document will only consider the UNIX version and will assume the reader has some familiarity with UNIX system administration. Where do I get it? EMBOSS is available for download from the primary site at the UK EMBnet node via ftp. ftp.uk.embnet.org/pub/EMBOSS/ This directory contains the EMBOSS package and several associated packages (collectively known as EMBASSY) that are distributed with EMBOSS. Download these to a suitable location. Documentation is available at http://www.uk.embnet.org/Software/EMBOSS Installation Unpacking You will have downloaded the EMBOSS and EMBASSY packages to a suitable directory. For this example we will assume you have downloaded them to /packages so you should now have the following files (or similar) and maybe more packages in EMBASSY. EMBOSS-1.0.0.tar.gz PHYLIP-3.573c.tar.gz MSE-0.0.4.tar.gz TOPO-0.1.tar.gz First unpack the EMBOSS distribution gunzip EMBOSS-1.0.0.tar.gz tar xf EMBOSS-1.0.0.tar This will create a new directory, EMBOSS-1.0.0 Enter the EMBOSS directory cd EMBOSS-1.0.0 create a directory for the EMBASSY packages mkdir embassy Now copy the EMBASSY packages to the EMBASSY directory cp ../MSE-0.0.4.tar.gz PHYLIP-3.573c.tar.gz TOPO-0.1.tar.gz embassy Go into the EMBASSY directory and unpack those packages. cd embassy gunzip MSE-0.0.4.tar.gz tar xf MSE-0.0.4.tar and so on for each EMBASSY package. go back up one directory to th emain EMBOSS package directory and prepare to start compilation. Compilation. Building EMBOSS is easy. It follows the usual GNU style of configure, make, make install. We'll take these steps one at a time. Configuration To accept the default configuration, just type ./configure and let EMBOSS get on with it. You may want to make some changes to the configuration parameters according to your local policy. This section will not cover all the possibilities, just some of the more common. The configuration script will attempt to find the neccessary components in your system to determine haow to successfully build EMBOSS. It typically expects the GNU C compiler (gcc) and several standard libraries that should already be part of your Unix/Linux system. Most modern Linux distributions should work straight out of the box. Installation directory. You need to have write permission on the directory in which you eventually wish to install EMBOSS. You may also wish to put it somewhere else other than the standard location of /usr/local/emboss. This is controlled by the --prefix argument. In my case I have all my applications owned by a non-priviledged user and installed under /site/prog ./configure --prefix=/site/prog/emboss will install EMBOSS under /site/prog/emboss. The binaries will be in /site/prog/emboss/bin with shared libraries in /site/prog/emboss/lib. Data will be in /site/prog/emboss/data, and the configuration files (ACD files) for the applications will be under /site/prog/emboss/share in directories corresponding to the package name. The individual directories for installation can be modified with other configuration commands but this is usually not neccessary. Run ./configure --help to get more information on the directories that can be changed and other configuration options. Run ./configure with the options you wish to use. This may take a short while during which various messages will scroll up the screen. Depending on your system you may need to explicitly configure the graphics. Please see the section 'Configuring EMBOSS graphics' below. ./configure --prefix=/site/prog/emboss --with-pngdriver=/site/lib All should be well with this and configure should exit with a message like this: creating ./config.status creating plplot/Makefile creating plplot/lib/Makefile creating nucleus/Makefile creating ajax/Makefile creating emboss/Makefile creating emboss/acd/Makefile creating test/Makefile creating test/data/Makefile creating test/embl/Makefile creating test/pir/Makefile creating test/swiss/Makefile creating test/swnew/Makefile creating test/wormpep/Makefile creating emboss/data/Makefile creating emboss/data/CODONS/Makefile creating emboss/data/REBASE/Makefile creating emboss/data/PRINTS/Makefile creating emboss/data/PROSITE/Makefile creating Makefile Configuration is now complete. Configuring EMBOSS graphics. The PLPLOT library can produce output to many devices but requires certain libraries that are NOT distributed with EMBOSS To get X-windows based output you must have X installed else PLplot will not build the required driver. You may need to specify the location of your X-windows library with the configuration options: --x-includes=DIR (X include files are in DIR) --x-libraries=DIR (X library files are in DIR) To explicitly configure PLPLOT without X-windows, use --without-x. To get PLPLOT to produce PNG images you will need to have the z,png and gd librarys installed. In particular gd version >= 1.6.3 must be used. If for some reason you do not have the required librarys and your system support group will not update these ( In particular gd, as the older versions support GIF which is NOT supported in later versions) then install all three latest versions (z,gd,png) to a new directory and then add this new directory to your configure line for EMBOSS. i.e. ./configure --with-pngdriver=my_dir where the z, gd and png libraries were each installed using ./configure --prefix=my_dir You can explicitly tell EMBOSS to not include PNG support with --without-pngdriver How to tell if ./configure has found PNG. Watch for something like the following when running ./configure: checking if png driver is wanted... yes checking for inflateEnd in -lz... (cached) yes checking for png_destroy_read_struct in -lpng... (cached) yes checking for gdImageCreateFromPng in -lgd... (cached) yes This means that the configuration script has located the PNG libraries on your system. If you see a message indicating that ./configure could not find the libraries or that the version of gd was too old then you should install the latest versions of the libraries yourself and rerun configure with the correct --with-pngdriver value. Building EMBOSS Building EMBOSS is a matter of typing 'make' and going to find something else to do for the next ten minutes to half an hour depending on the speed of your system. EMBOSS will first build the shared libraries (PL_PLOT, AJAX, and NUCLEUS) and then build the applications. You will see plenty of warnings complaining about libraries not being used to resolve any symbols. These can be safely ignored. If all goes according to plan you should have built EMBOSS successfully. If not you will have to try to work out why the build failed. If you can't work it out yourself, send an email describing the problem to emboss-bug at sanger.ac.uk with a copy of the config.status and config.cache files from your EMBOSS directory. (These will tell the developers what state your system was in whaen compilation failed). I am assuming that compilation was successful. You nw have to checkthat you have the correct access permissions for the directory in which you wish to install EMBOSS and type 'make install'. After a few minutes and many pagefuls of messages, EMBOSS should be installed where you specified. Tidying up the environment. You will now need to make a few adjustments to your environment to ensure that EMBOSS runs smoothly. EMBOSS looks for certain environment variables to determine where the libraries and data are found. These instructions assumed you installed EMBOSS in /site/prog/emboss. Adjust these instructions to suit your installaation. Insert the following lines at the end of /etc/cshrc (or ~/.cshrc for a personal installation) setenv EMBOSS_DATA /site/prog/emboss/data setenv PLPLOT_LIB /site/prog/emboss/lib set path=( /site/prog/emboss ${path} ) Or for bash/ksh/sh users, insert the following at the end of /etc/profile or ~/.bashrc EMBOSS_DATA=/site/prog/emboss/data PLPLOT_LIB=/site/prog/emboss/lib PATH=/site/prog/emboss:$PATH export EMBOSS_DATA PLPLOT_LIB PATH EMBOSS should now be ready for use. You can test this by trying the program 'wossname' wossname -auto |more This should give a long list of programs that are available. Press space to page down through the list. This is just the EMBOSS programs and doesn't include any of the EMBASSY programs. Installing EMBASSY As well as the base libraries and standard EMBOSS distribution, various extra packages (EMBASSY) are distributed with EMBOSS. To install an EMBASSY package, go to the relevant directory. For example to install PHYLIP (which was unpacked into /packages/EMBOSS-1.0.0/embassy/PHYLIP-3.573c earlier) go to the relevant directory. cd /packages/EMBOSS-1.0.0/embassy/PHYLIP-3.573c ./configure --prefix=/site/prog/emboss make make install NB. You MUST use the same arguments for configure that you used for the installation of the main EMBOSS package. Repeat as neccessary for the other EMBASSY packages. You should now find that running wossname as before lists the EMBASSY programs. Configuration EMBOSS can be configured to match your requirements. EMBOSS looks for a configuration file in several places. Firstly it looks in /site/prog/emboss/share/EMBOSS for a file 'emboss.default'. It then looks in your home directory for the file '.embossrc' and finally in the current directory for '.embossrc'. In each case definitions will override those previously defined. Several aspects of EMBOSS can be defined. These are: EMBOSS environment variables EMBOSS databases Default behaviour of EMBOSS programs As Databases are by far the most complex of these they will be covered in a seperate section. EMBOSS environment variables These are set with an 'env' or a 'set' declaration. 'env' and 'set' are interchangeable. The most important environment variable is the location of the acd files that describe each program. set emboss_acdroot /site/prog/emboss/share/EMBOSS/acd Environment variables are useful for easing the maintenance of your emboss.default. For example you may want to specify the location of your databases as an environment variable. Then if you move the databases you only have to update one line in the configuration file. set emboss_database_dir /data/databases/flatfiles This would then be referred to as $emboss_database_dir/embl for the directory /data/databases/flatfiles/embl Databases Database access Emboss offers three methods for accessing databases: All: EMBOSS returns all the sequences in the database in no particular order Query: EMBOSS retrieves a set of sequences corresponding to a wildcard query. Single: EMBOSS retrieves a single sequence indexed by ID or accession number. Each database definition can configure one or many of these methods for database access. Typically EMBOSS uses the 'emblcd' system of database indexing. This comes in three variants depending on the original format of your database. The emblcd method assumes that you have both ID and accession number in each record. If you do not have both ID and accession number you will have to use an alternative method. Please see the 'other databases' section below. General Database configuration. Each database is configured using a DB declaration. The generalised form is DB databasename [ Configuration options ] The configuration options are tag/value pairs and must contain at least a description of the access method (using method: or one or more of methodsingle:, methodquery: and methodall:) and a description of the format the sequences will be returned in ( using format:). In addition to these tags there will be other tags that are needed for particular methods and other tags that are optional. method: & scope & Description & DIRECT & a & Returns all the database entries, one after the other. It assumes no indexing. & DB mydb [ #required parameters method: direct format: fasta dir: $emboss_db_dir/mydb file: *.dat #optional parameters type: N release: 63.0 comment: "My own database with no indices" exclude: "est*.dat" ] SRS & a q s & Returns entries from a local installation of SRS using the -e switch to getz to return entries in the original format. DB mydb [ #required parameters method: srs format: embl app: getz #optional parameters dbalias: embl type: N comment: 'My srs indexed database' release: '63.0' ] SRSFASTA & a q s & As SRS but returns the sequences in FASTA format. URL & s & Uses a defined web server to retreive a specific entry. EMBOSS may fail if the HTML causes complications. & DB mydb [ # required parameters method: url format: genbank url: "http://www.infobiogen.fr/srs5bin/cgi-bin/wgetz?-e+[genbank-id:%s]" #optional parameters type: N comment: "Genbank by ID from InfoBiogen" ] The %s in the URL string indicates where EMBOSS will insert the identifier portion of the USA. EMBLCD & a q s & Uses EMBLCD indices created with DBIFLAT to access EMBL format databases in the original format. & directory: files: DB mydb [ method: emblcd format: embl dir: $emboss_db_dir/embl file: *.dat #optional parameters type: N release: 63.0 comment: "my comment" exclude: est*.dat indexdir: $emboss_db_dir/indices ] GCG & a q s & Uses EMBLCD indices created with DBIGCG to access databases in GCG format. & As for EMBLCD but format: gcg and method: gcg BLAST & a q s & Uses EMBLCD indices created with DBIBLAST to access databases in BLAST format. & As for EMBLCD but format: blast and method: blast EXTERNAL & a q s & Uses an external application to retrieve sequences, returning them on STDOUT & The ID is passed as an argument to the application, either replacing %s in the command string (if present) or as an additional arguement (if there is no %s) DB mydb [ #required parameters method: app format: fasta app: "getfromdb thisfastadb" #optional parameters type: P comment: "my own protein database with a custom retrieval program" ] APP & a q s & same as EXTERNAL. NBRF & a q s & for a method: declaration, EMBOSS will use that method for those access methods supported by the method. If you wish to specify which accessmethod should be handled by which method then the methodsingle: methodquery: and methodall: declarations should be used instead of method: DB mydb [ methodsingle: app format: fasta app: "customapp myproteindb" methodall: direct dir: $emboss_db_dir/myproteindb file: myproteindb.dat type: P comment: "single and all access for myproteindb" ] Indexing and configuring flatfile databases Flatfile databases are those released by EMBL, Swissprot and so on. The EMBOSS program DBIFLAT is used to generate emblcd indices that can be used for all types of database access. DBIFLAT can process databases in EMBL, SWISSPROT and GENBANK format. Pseudo EMBL format databases which do not have unique ID and AC entries will cause DBIFLAT to do mysterious things and should be avoided. DBIFLAT requires the databases to be uncompressed. This example will not probe the deeper secrets of DBIFLAT (for which the reader is referred to the documentation, or failing that the source code) but will show a typical installation for a common database. We assume EMBOSS has been installed and works. This can be tested with the command wossname -auto which should list all the programs available. In this example we will index and configure the EMBL database for use with EMBOSS. First download and unpack the EMBL database. This will require a considerable amount of disk space. cd to the directory in which you have unpacked EMBL. This should look something like this when you run ls: est_fun.dat est_hum1.dat est_hum10.dat . . . syn.dat unc.dat vrl.dat vrt.dat Run DBIFLAT to create the emblcd indices. % dbiflat Index a flat file database EMBL : EMBL SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew GB : Genbank, DDBJ FASTA : FASTA format Entry format [SWISS]: EMBL Database name: embl Database directory [.]: Wildcard database filename [*.dat]: Release number [0.0]: 63.0 Index date [00/00/00]: 31/07/00 DBIFLAT should happily chug away for some considerable time (up to a few hours depending on the speed of your machine) and will generate (eventually) the following index files: acnum.hit acnum.trg division.lkp Now we create an entry in the EMBOSS configuration files to acces sthe database. It is probably a good idea to try new database definitions in your local configuration file first. Put the following entry in your .embossrc set emboss_db_dir /path_to_databases DB embl [ type: N method: emblcd format: embl dir: $emboss_db_dir/embl file: "*.dat" release: "63.0" comment: "EMBL release 63.0" ] Save .embossrc and try showdb. You should see a line that looks like: embl N OK OK OK EMBL release 63.0 Fine tuning the installation: It is probably a good idea to set up subsections of the database so that end users can search just the regions they wish to search. Files can be included with the declaration files: or excluded with the declaration exclude: In order to just take the EST files try the following: DB emblest [ type: N method: emblcd format: embl dir: $emboss_db_dir/embl file: "est*.dat" release: "63.0" comment: "EMBL release 63.0" ] Files can also be given as a space seperated list. For example to set up a database of all mamallian sequences (except genomes) try the following: DB emblallmam [ type: N method: emblcd format: embl dir: $emboss_db_dir/embl file: "rod*.dat hum*.dat mam*.dat" release: "63.0" comment: "EMBL release 63.0" ] It can be quite tedious to set up a long list of sequences to search. In many cases you can use the exclude function to make things easier. DB emblnoest [ type: N method: emblcd format: embl dir: $emboss_db_dir/embl file: "*.dat" exclude: "est*.dat" release: "63.0" comment: "EMBL release 63.0" ] This configures the emblnoest database to contain all of EMBL except the EST's. Indexing and configuring GCG format databases EMBOSS can access GCG formatted databases, thus avoiding having multiple copies of the same databases in different formats. EMBOSS creates EMBLCD like indices for the GCG format databases using the program DBIGCG. This runs in much the same way as DBIFLAT. You will need the GCG format .seq and .header files in order to create an indexed database. cd to the GCG database directory containing your data and run DBIGCG Index a GCG formatted database EMBL : EMBL SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew GB : Genbank, DDBJ PIR : NBRF Entry format [EMBL]: Database name: embl Database directory [.]: Wildcard database filename [*.seq]: Release number [0.0]: 63.0 Index date [00/00/00]: 31/07/00 The program will chug along for a while and will then generate the emblcd index files for the GCG format database. The following entry should be put in your .embossrc DB gcgembl [ type: N method: gcg format: embl dir: $emboss_db_dir/embl file: "*.dat" release: "63.0" comment: "EMBL release 63.0" ] SHOWDB should show your newly configured database. You can configure substes of th edatabases in the same way as for the original format databases. Indexing and configuring BLAST databases Here be dragons Configuring EMBOSS to use SRS for database lookup. Here be lions Indexing and configuring other databases Many institutions may have local databases set up in their own Laboratory Information Management System. EMBOSS provides a simple mechanism for interfacing with such systems. As long as a program is available that can be called noninteractively and returns the specified sequence on standard output, EMBOSS can interface with it. Use method: app or external (the two are equivalent) and app: "program command". The ID given in the USA will be appended to the command used to run the program. It is probably best to specify the methods available using the method subsets, methodall:, methodquery: and methodsingle: rather than using the generic method: tag. Other data EMBOSS can be integrated with some common biological databases. These are described in this section. REBASE Rebase is the restriction enzyme database maintained by New England Biolabs. It is needed for programs such as remap and restrict. The latest version of Rebase can be obtained by anonymous FTP from ftp://ftp.ebi.ac.uk/pub/databases/rebase. EMBOSS needs the 'withrefm' file. The data is extracted for EMBOSS with the program rebaseextract. If you installed EMBOSS with the --prefix option you may need to create the REBASE directory under the emboss data directory (/site/prog/emboss/data in this example) This directory only needs creating once. % mkdir /site/prog/emboss/data/REBASE % rebaseextract Extract data from REBASE Full pathname of WITHREFM: /data/rebase/withrefm.008 Rebase is now installed and ready to use. TRANSFAC Transfac is the transcription factor binding site database. It is available by anonymous ftp from ftp://ftp.ebi.ac.uk/pub/databases/transfac/transfac32.tar.Z Unpacking the distribution reveals a file called site.dat. This is the one EMBOSS needs. Run TFEXTRACT to extract the data from TRANSFAC. % tfextract Extract data from TRANSFAC Full pathname of transfac SITE.DAT: /databases/transfac/site.dat tfscan can now access the TF database. PROSITE Prosite is a database of regular expressions that match potentially diagnostic regions for structural/functional classification of proteins. EMBOSS needs this database for the patmatmotifs program. PROSITE can be obtained via anonymous FTP from ftp://ftp.ebi.ac.uk/pub/databases/prosite. You may need to create a PROSITE subdirectory under data in the EMBOSS installation directory. Then run prosextract to build the EMBOSS Prosite database. Builds the PROSITE motif database for patmatmotifs to search Enter name of prosite directory: /data/prosite PROSITE is now integrated into your EMBOSS installation. PRINTS Prints is a database of diagnostic patterns of blocks of sequence homology in protein families. The PRINTS database can be searched using the EMBOSS program pscan. PRINTS can be obtained via anonymous ftp from ftp://ftp.ebi.ac.uk/pub/databases/prints. The database is made available as compressed files which should be uncompressed using gzip before integrating them into EMBOSS PRINTS is integrated with EMBOSS using the command printsextract % printsextract Extract data from PRINTS Input file: /data/prints/prints27_0.dat The PRINTS database is now integrated with EMBOSS. Miscellaneous data files Other data files should be kept in the data directory under the main EMBOSS installation. Individual users personal data files can be kept in the current working directory, a subdirectory .embossdata of the current directory, their home directory or a subdirectory .embossdata of their home directory. EMBOSS will search these locations in this order and will stop as soon as it finds a matching file. If the personal directories do not contain the desired file, EMBOSS will search the system wide data directory, /site/prog/emboss/data in this example. Apparently inexplicable errors when running EMBOSS programs may be caused by the system not using the data files one expects. The search path can be displayed in search order using the command embossdata. Logging Many system administrators may wish to make use of the logging facilities of EMBOSS. Setting the variable emboss_logfile in emboss.default or .embossrc allows the system to keep a log of which programs are used when and by whom. set emboss_logfile /site/log/emboss.log The log file structure is very simple. Three tab seperated fields are stored, program name, user name, and the date and time. prettyplot joeuser Wed Aug 02 14:29:13 2000 The file set in emboss_logfile should be world writable. These settings can be overridden in a users .embossrc files by redefining emboss_logfile. eg. to prevent my system usage being logged I can put the following entry in my .embossrc file. set emboss_logfile /dev/null This behaviour may change in the future to prevent users redefining some system settings. From gwilliam at hgmp.mrc.ac.uk Thu Aug 10 09:19:51 2000 From: gwilliam at hgmp.mrc.ac.uk (gwilliam at hgmp.mrc.ac.uk) Date: Thu, 10 Aug 2000 14:19:51 +0100 (BST) Subject: -osformat strangeness Message-ID: <200008101319.OAA10103@tellurium.hgmp.mrc.ac.uk> Can anyone explain this behaviour? When I use '-osf staden' or '-osf gcg' I get no output but no error messages. % seqret ttt.seq stdout Reads and writes (returns) sequences > AGCTAGGGCTTAAA [OK - this is a short test sequence, but the error occurs with longer sequences, eg embl:hsfau] % seqret ttt.seq stdout -osf gcg Reads and writes (returns) sequences [Where did the output go?] % seqret ttt.seq stdout -osf staden Reads and writes (returns) sequences [Ditto!] % seqret ttt.seq stdout -osf embl Reads and writes (returns) sequences ID standard; DNA; UNC; 14 BP. SQ Sequence 14 BP; 5 A; 2 C; 4 G; 3 T; 0 other; AGCTAGGGCT TAAA 14 // [now it works] The only output formats I think this happens with are gcg and staden. Thanks, Gary From gwilliam at hgmp.mrc.ac.uk Thu Aug 10 10:12:48 2000 From: gwilliam at hgmp.mrc.ac.uk (gwilliam at hgmp.mrc.ac.uk) Date: Thu, 10 Aug 2000 15:12:48 +0100 (BST) Subject: Followup to -osformat strangeness Message-ID: <200008101412.PAA10284@tellurium.hgmp.mrc.ac.uk> Using '-osf gcg' creates a file '.gcg' Similarly with '-osf staden' and '.staden' Gary From pmr at sanger.ac.uk Thu Aug 10 10:41:45 2000 From: pmr at sanger.ac.uk (Peter Rice) Date: Thu, 10 Aug 2000 15:41:45 +0100 (BST) Subject: -osformat strangeness In-Reply-To: <200008101319.OAA10103@tellurium.hgmp.mrc.ac.uk> (gwilliam@hgmp.mrc.ac.uk) References: <200008101319.OAA10103@tellurium.hgmp.mrc.ac.uk> Message-ID: <200008101441.PAA04522@europa.sanger.ac.uk> Gary Williams writes: >Can anyone explain this behaviour? When I use '-osf staden' or '-osf >gcg' I get no output but no error messages. Strictly, this happens when '-ossingle' is on. This forces output to go to a separate file for each sequence, and is turned on for formats that do not support multiple sequences in one file (GCG/GCG8, STADEN/EXPERIMENT, RAW/TEXT/PLAIN) Any other format with -ossingle on the command line will do the same. GCG and STADEN with -noossingle will write to the same file. For GCG format EMBOSS is able to read the resulting file (but GCG cannot :-) This behaviour could be changed, but needs some careful planning. For example, with one output file you may want it to be used. With many output files you may want them all to be automatically named. Perhaps some special processing for stdout and stderr would be useful, and for user-defined output filenames the user's choice could at least be used for the first sequence (maybe with a warning for the second sequence if there is one). So many cases to consider. All packages that write GCG format face these kinds of problems. Peter -- ---------------------------------------------------------------------- Peter Rice | Informatics Division, The Sanger Centre, E-mail: pmr at sanger.ac.uk | Wellcome Trust Genome Campus, Tel: (44) 1223 494967 | Hinxton, Cambridge, CB10 1SA, England Fax: (44) 1223 494919 | URL: http://www.sanger.ac.uk/Users/pmr/ From gwilliam at hgmp.mrc.ac.uk Thu Aug 10 11:22:39 2000 From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522) Date: Thu, 10 Aug 2000 16:22:39 +0100 Subject: -osformat strangeness References: <200008101319.OAA10103@tellurium.hgmp.mrc.ac.uk> <200008101441.PAA04522@europa.sanger.ac.uk> Message-ID: <3992C8BF.51B6@hgmp.mrc.ac.uk> Peter Rice wrote: > > Gary Williams writes: > > >Can anyone explain this behaviour? When I use '-osf staden' or '-osf > >gcg' I get no output but no error messages. > > Strictly, this happens when '-ossingle' is on. This forces output to > go to a separate file for each sequence, and is turned on for formats > that do not support multiple sequences in one file (GCG/GCG8, > STADEN/EXPERIMENT, RAW/TEXT/PLAIN) > > Any other format with -ossingle on the command line will do the same. > > GCG and STADEN with -noossingle will write to the same file. For GCG > format EMBOSS is able to read the resulting file (but GCG cannot :-) > > This behaviour could be changed, but needs some careful planning. For > example, with one output file you may want it to be used. With many > output files you may want them all to be automatically named. > > Perhaps some special processing for stdout and stderr would be useful, > and for user-defined output filenames the user's choice could at least > be used for the first sequence (maybe with a warning for the second > sequence if there is one). This sounds more what I might expect. > So many cases to consider. All packages that write GCG format face > these kinds of problems. Hmm I see what you mean, '-ossingle' toggles this behaviour with any format, eg embl. It doesn't seem the right behaviour though. I guess that whatever behaviour it is set up to have we will meet discrepencies between what was expected and what is possible. I still don't see why there should be a difference between using 'staden::outfile' and '-osf staden' though. Is this the expected behaviour. If so why?: tellurium<116>seqret embl:hsfau jjj.seq Reads and writes (returns) sequences tellurium<117>ls -l jjj.seq -rw-r----- 1 gwilliam cs 560 Aug 10 16:06 jjj.seq [Behaves as I expected] tellurium<118>rm jjj.seq tellurium<119>seqret embl:hsfau staden::jjj.seq Reads and writes (returns) sequences tellurium<120>ls -l jjj.seq -rw-r----- 1 gwilliam cs 539 Aug 10 16:06 jjj.seq [Behaves as I expected] tellurium<122>rm jjj.seq tellurium<123>rm hsfau.staden tellurium<124>seqret embl:hsfau jjj.seq -osf staden Reads and writes (returns) sequences tellurium<125>ty jjj.seq jjj.seq: No such file or directory tellurium<126>ls -l hsfau.staden -rw-r----- 1 gwilliam cs 539 Aug 10 16:09 hsfau.staden [Doesn't make 'jjj.seq', but makes 'hsfau.staden'.] Thanks, Gary -- Gary Williams Tel: +44 1223 494522 Fax: +44 1223 494512 mailto:G.Williams at hgmp.mrc.ac.uk http://www.hgmp.mrc.ac.uk/ Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK From david.martin at biotek.uio.no Tue Aug 15 06:12:58 2000 From: david.martin at biotek.uio.no (David Martin) Date: Tue, 15 Aug 2000 11:12:58 +0100 Subject: Changing default settings in EMBOSS Message-ID: I have been looking through the source to try to track down the possible settings that can be defined beforehand in EMBOSS. So far I have the following list: acdroot logfile DATA all general qualifiers format (default sequence format) Rather than go mad trying to find my way through the code I have tried to find functions that resolve the defined variables. Most of these seem to not be hard coded (with the exception of those above excluding the general qualifiers.) and most of this lookup is in ajacd.c Which qualifiers can be coded specifically and which can't? ajacd.c is about 172 pages of source code and I don't plan to go through all of it. ..d --------------------------------------------------------------------- * Dr. David Martin Biotechnology Centre of Oslo * * Node Manager Gaustadalleen 21 * * The Norwegian EMBNet Node P.O. box 1125 Blindern * * tel +47 22 95 87 56 N-0317 Oslo * * fax +47 22 69 41 30 Norway * --------------------------------------------------------------------- From gwilliam at hgmp.mrc.ac.uk Thu Aug 17 05:16:16 2000 From: gwilliam at hgmp.mrc.ac.uk (gwilliam at hgmp.mrc.ac.uk) Date: Thu, 17 Aug 2000 10:16:16 +0100 (BST) Subject: GFF question Message-ID: <200008170916.KAA02641@tellurium.hgmp.mrc.ac.uk> I have two sequence that are being compared. Is it better to write out a single GFF file with the results for both sequences in it, or should I write out two separate gff files? Thanks, Gary From gwilliam at hgmp.mrc.ac.uk Thu Aug 17 06:42:15 2000 From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522) Date: Thu, 17 Aug 2000 11:42:15 +0100 Subject: GFF question References: Message-ID: <399BC187.7F20@hgmp.mrc.ac.uk> David Martin wrote: > > On Thu, 17 Aug 2000 gwilliam at hgmp.mrc.ac.uk wrote: > > > I have two sequence that are being compared. > > > > Is it better to write out a single GFF file with the results for both > > sequences in it, or should I write out two separate gff files? > > Give the user the option. > > I can think of arguements both ways. What are your arguments? Gary -- Gary Williams Tel: +44 1223 494522 Fax: +44 1223 494512 mailto:G.Williams at hgmp.mrc.ac.uk http://www.hgmp.mrc.ac.uk/ Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK From david.martin at biotek.uio.no Thu Aug 17 07:04:42 2000 From: david.martin at biotek.uio.no (David Martin) Date: Thu, 17 Aug 2000 12:04:42 +0100 Subject: GFF question In-Reply-To: <399BC187.7F20@hgmp.mrc.ac.uk> Message-ID: On Thu, 17 Aug 2000, Gary Williams, Tel 01223 494522 wrote: > David Martin wrote: > > > > On Thu, 17 Aug 2000 gwilliam at hgmp.mrc.ac.uk wrote: > > > > > I have two sequence that are being compared. > > > > > > Is it better to write out a single GFF file with the results for both > > > sequences in it, or should I write out two separate gff files? > > > > Give the user the option. > > > > I can think of arguements both ways. > > What are your arguments? -ofsingle ? but seriously, in some cases one would be wanting a GFF containing features from many subsequences, for example if (as in our case) we are analysing a genome containing many unordered, unconnected contigs bu twant information such as repeat regions, ISs etc. to be kept in one place. In other cases one may want to analyse a set of sequences but produce a seperate file for each one, eg if one is analysin ga set of related genes but do not want the GFF's combined. The overall argument is that EMBOSS should be flexible an dempowering, not conforming to any preconceived mindset (except of course that it should be flexible and empowering, not conforming to any preconceived mindset ( except ...)) ..d > > Gary > > -- > Gary Williams Tel: +44 1223 494522 Fax: +44 1223 494512 > mailto:G.Williams at hgmp.mrc.ac.uk http://www.hgmp.mrc.ac.uk/ > Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK > > > --------------------------------------------------------------------- * Dr. David Martin Biotechnology Centre of Oslo * * Node Manager Gaustadalleen 21 * * The Norwegian EMBNet Node P.O. box 1125 Blindern * * tel +47 22 95 87 56 N-0317 Oslo * * fax +47 22 69 41 30 Norway * --------------------------------------------------------------------- From gwilliam at hgmp.mrc.ac.uk Thu Aug 17 05:16:16 2000 From: gwilliam at hgmp.mrc.ac.uk (gwilliam at hgmp.mrc.ac.uk) Date: Thu, 17 Aug 2000 10:16:16 +0100 (BST) Subject: GFF question Message-ID: <200008170916.KAA02641@tellurium.hgmp.mrc.ac.uk> A non-text attachment was scrubbed... Name: not available Type: text Size: 200 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/emboss-dev/attachments/20000817/9d9c7f70/attachment.ksh From gwilliam at hgmp.mrc.ac.uk Thu Aug 17 05:16:16 2000 From: gwilliam at hgmp.mrc.ac.uk (gwilliam at hgmp.mrc.ac.uk) Date: Thu, 17 Aug 2000 10:16:16 +0100 (BST) Subject: GFF question Message-ID: <200008170916.KAA02641@tellurium.hgmp.mrc.ac.uk> A non-text attachment was scrubbed... Name: not available Type: text Size: 202 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/emboss-dev/attachments/20000817/9d9c7f70/attachment-0001.ksh From david.martin at biotek.uio.no Wed Aug 23 08:23:18 2000 From: david.martin at biotek.uio.no (David Martin) Date: Wed, 23 Aug 2000 13:23:18 +0100 Subject: Some future ideas Message-ID: I had som ethoughts about future expansions of EMBOSS and really want to put down a few 'placemarkers' to see what people think. Eventually people are going to get the idea that EMBOSS should be able to do things with data other than just sequences and want to run e.g. microarray and structure type analyses. With the current database definitions in emboss.default there is a type: clause that is not required. I would propose to make this mandatory and extend the values. N is nucleotide sequence database P is protein sequence database S is a structure database M is a microarray experiment database. The USA can then be extended to cover structures (pdb:1HTF) for an example and microarray experiments. There are probably other entities that could be included. With some careful type management we could even convert types on the fly, so you could put in a pdb reference when asked for a protein sequence and it would be automatically derived (OK, there are a lot of problems with such things but it would be useful). Other possibilities: XML format output in some suitable XML format? This would probably need a lot of work in the libraries to tidy everything up and make it work. Still looking for a student to write an EMBOSS-WAP interface ;-) ..d --------------------------------------------------------------------- * Dr. David Martin Biotechnology Centre of Oslo * * Node Manager Gaustadalleen 21 * * The Norwegian EMBNet Node P.O. box 1125 Blindern * * tel +47 22 95 87 56 N-0317 Oslo * * fax +47 22 69 41 30 Norway * --------------------------------------------------------------------- From bernd at golgi.ski.mskcc.org Wed Aug 23 01:09:49 2000 From: bernd at golgi.ski.mskcc.org (Bernd Jagla) Date: Wed, 23 Aug 2000 11:09:49 +0600 Subject: Some future ideas References: Message-ID: <39A35C9D.9B819964@golgi.ski.mskcc.org> David Martin wrote: > I had som ethoughts about future expansions of EMBOSS and really want to > put down a few 'placemarkers' to see what people think. > > Eventually people are going to get the idea that EMBOSS should be > able to do things with data other than just sequences and want to run > e.g. microarray and structure type analyses. With the current database > definitions in emboss.default there is a type: clause that is not > required. I would propose to make this mandatory and extend the values. > > N is nucleotide sequence database > P is protein sequence database > S is a structure database > M is a microarray experiment database. > > The USA can then be extended to cover structures (pdb:1HTF) for an example > and microarray experiments. There are probably other entities that could > be included. > > With some careful type management we could even convert types on the fly, > so you could put in a pdb reference when asked for a protein sequence and > it would be automatically derived (OK, there are a lot of problems with > such things but it would be useful). > > Other possibilities: > > XML format output in some suitable XML format? This would probably need > a lot of work in the libraries to tidy everything up and make it work. > > Still looking for a student to write an EMBOSS-WAP interface ;-) > > ..d > > --------------------------------------------------------------------- > * Dr. David Martin Biotechnology Centre of Oslo * > * Node Manager Gaustadalleen 21 * > * The Norwegian EMBNet Node P.O. box 1125 Blindern * > * tel +47 22 95 87 56 N-0317 Oslo * > * fax +47 22 69 41 30 Norway * > --------------------------------------------------------------------- Hi David, I still feel quite new in EMBOSS and am not that familiar with the databanks, but it sounds very good to be able to analyze some micro array data. I also believe that there should be some other possibilities for data analysis. I personally like artificial neural network for they are fast, "easy" to use and I have already some programs written using EMBOSS. I am thinking of some other statistical analysis tools to implement (information analysis, some visual output of aa content, distribution and so many other things). For this it would be a good to be able to build groups of sequences and sequence parts, add some numbers to these groups, have probably a new class of functions dealing with these groups. Of course, we should discuss the data model a little more in detail if it is interesting... So, do you thing EMBOSS should be able to deal with these kind of problems as well? Bernd From gbottu at ben.vub.ac.be Fri Aug 25 10:35:57 2000 From: gbottu at ben.vub.ac.be (Guy Bottu) Date: Fri, 25 Aug 2000 16:35:57 +0200 (MET DST) Subject: Some future ideas about formats Message-ID: <200008251435.QAA11192@bigben.vub.ac.be> > Eventually people are going to get the idea that EMBOSS should be >able to do things with data other than just sequences and want to run >e.g. microarray and structure type analyses. With the current database >definitions in emboss.default there is a type: clause that is not >required. I would propose to make this mandatory and extend the values. > >N is nucleotide sequence database >P is protein sequence database >S is a structure database >M is a microarray experiment database. > >The USA can then be extended to cover structures (pdb:1HTF) for an example >and microarray experiments. There are probably other entities that could >be included. > The way EMBOSS handles sequences is certainly a vast improvement over GCG and should be extended to other kinds of data. But why limit ourselves to structures and microarray data ? There are other kinds of data that are handled by many software packages and that hence hange around in several alternative formats : amino acid symbol comparison tables, codon usage tables, sequence motifs defined as patterns/profiles/HMM's, phylogenetic trees, etc. It would be nice if they could all be imported in a transparent way and exported in a user-chosen format directly from the program that creates it. Guy Bottu, Belgian EMBnet Node From ableasby at hgmp.mrc.ac.uk Fri Aug 25 10:41:14 2000 From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk) Date: Fri, 25 Aug 2000 15:41:14 +0100 (BST) Subject: Some future ideas about formats Message-ID: <200008251441.PAA11554@tin.hgmp.mrc.ac.uk> Agreed in principle with Guy and David. Structures were nxt on the list. The interesting with structures is the nightmare of PDB parsing. My colleague, Jon Ison, has a PDB parser which he developed as part of his PhD. He's using it to create a clean PDB which we can hopefully release soon. He intends EMBOSSifying it. Alan From david.martin at biotek.uio.no Tue Aug 1 14:45:20 2000 From: david.martin at biotek.uio.no (David Martin) Date: Tue, 1 Aug 2000 15:45:20 +0100 Subject: File locations Message-ID: When one is asked to specify file locations, should it be possible to specify URIs instead? It would be great to be able to update eg rebase by specifying ftp://ftp.ebi.ac.uk/pub/databases/rebase/withrefm.008 instead of downloading then extracting the file. Just a thought and definitely on the backburner list. ..d --------------------------------------------------------------------- * Dr. David Martin Biotechnology Centre of Oslo * * Node Manager Gaustadalleen 21 * * The Norwegian EMBNet Node P.O. box 1125 Blindern * * tel +47 22 95 87 56 N-0317 Oslo * * fax +47 22 69 41 30 Norway * --------------------------------------------------------------------- From david.martin at biotek.uio.no Wed Aug 2 14:49:18 2000 From: david.martin at biotek.uio.no (David Martin) Date: Wed, 2 Aug 2000 15:49:18 +0100 Subject: Pre first draft of admin guide. Message-ID: OK it is in raw text form. I'll mark it up for LaTeX soon but here it is for your delectation and delight. The major sticking points at the moment are Database Indexing, especially DBIBLAST but there are unresolved issues with DBIFLAT and FASTA files and DBIGCG (because it loops until armageddon in the form of SIGDIEDIEDIE) so I haven't been able to test it properly. Comments are welcome. I'm hoping it can be pretty much a recipe book for EMBOSS setup. With a bit of standardising of macros, it should be possible to dump out the program docs as LaTeX and incorporate those too. I'll look at marking up the quick guide, and then with Val's tutorial and Thon's ACD guide we are approaching a reasonable manual for EMBOSS. Maybe I should create a small EMBOSS logo in LaTeX like EMB that would slot into the text at about the right height. OSS ..d --------------------------------------------------------------------- * Dr. David Martin Biotechnology Centre of Oslo * * Node Manager Gaustadalleen 21 * * The Norwegian EMBNet Node P.O. box 1125 Blindern * * tel +47 22 95 87 56 N-0317 Oslo * * fax +47 22 69 41 30 Norway * --------------------------------------------------------------------- From david.martin at biotek.uio.no Wed Aug 2 14:50:24 2000 From: david.martin at biotek.uio.no (David Martin) Date: Wed, 2 Aug 2000 15:50:24 +0100 Subject: Pre first draft of admin guide. (fwd) Message-ID: And the file is here as an attachment. ..d --------------------------------------------------------------------- * Dr. David Martin Biotechnology Centre of Oslo * * Node Manager Gaustadalleen 21 * * The Norwegian EMBNet Node P.O. box 1125 Blindern * * tel +47 22 95 87 56 N-0317 Oslo * * fax +47 22 69 41 30 Norway * --------------------------------------------------------------------- ---------- Forwarded message ---------- Date: Wed, 2 Aug 2000 15:49:18 +0100 From: David Martin Reply-To: admin at embnet.uio.no To: emboss-dev at embnet.org Subject: Pre first draft of admin guide. OK it is in raw text form. I'll mark it up for LaTeX soon but here it is for your delectation and delight. The major sticking points at the moment are Database Indexing, especially DBIBLAST but there are unresolved issues with DBIFLAT and FASTA files and DBIGCG (because it loops until armageddon in the form of SIGDIEDIEDIE) so I haven't been able to test it properly. Comments are welcome. I'm hoping it can be pretty much a recipe book for EMBOSS setup. With a bit of standardising of macros, it should be possible to dump out the program docs as LaTeX and incorporate those too. I'll look at marking up the quick guide, and then with Val's tutorial and Thon's ACD guide we are approaching a reasonable manual for EMBOSS. Maybe I should create a small EMBOSS logo in LaTeX like EMB that would slot into the text at about the right height. OSS ..d --------------------------------------------------------------------- * Dr. David Martin Biotechnology Centre of Oslo * * Node Manager Gaustadalleen 21 * * The Norwegian EMBNet Node P.O. box 1125 Blindern * * tel +47 22 95 87 56 N-0317 Oslo * * fax +47 22 69 41 30 Norway * --------------------------------------------------------------------- -------------- next part -------------- The EMBOSS Administrators Guide What is EMBOSS? Where do I get it? Installation Configuration Databases Database access Indexing and configuring flatfile databases Indexing and configuring GCG format databases Indexing and configuring BLAST databases Configuring EMBOSS to use SRS for database lookup. Indexing and configuring other databases Other data Logging What is EMBOSS? EMBOSS is a freely available suite of bioinformatics applications and libraries. It can be downloaded via the internet, copied, customised, and passed on under the terms of the various General Public Licenses. EMBOSS has been developed in response to the need for a powerful, adaptable suite of software that can interface readily with many different situations and meet the need of professional bioinformaticists, particularly those needing high throughput and/or scriptable capabilities. EMBOSS has primarily been developed by those responsible for the public extensions to the GCG package. Whilst EMBOSS duplicates much of EGCG it includes far better database interaction and has the benefit of freely accessible source code so novel applications can be developed rapidly and at minimal cost. EMBOSS is currently only available for Unix/Linux systems but it ahs been known to compile and run on Windows NT. This document will only consider the UNIX version and will assume the reader has some familiarity with UNIX system administration. Where do I get it? EMBOSS is available for download from the primary site at the UK EMBnet node via ftp. ftp.uk.embnet.org/pub/EMBOSS/ This directory contains the EMBOSS package and several associated packages (collectively known as EMBASSY) that are distributed with EMBOSS. Download these to a suitable location. Documentation is available at http://www.uk.embnet.org/Software/EMBOSS Installation Unpacking You will have downloaded the EMBOSS and EMBASSY packages to a suitable directory. For this example we will assume you have downloaded them to /packages so you should now have the following files (or similar) and maybe more packages in EMBASSY. EMBOSS-1.0.0.tar.gz PHYLIP-3.573c.tar.gz MSE-0.0.4.tar.gz TOPO-0.1.tar.gz First unpack the EMBOSS distribution gunzip EMBOSS-1.0.0.tar.gz tar xf EMBOSS-1.0.0.tar This will create a new directory, EMBOSS-1.0.0 Enter the EMBOSS directory cd EMBOSS-1.0.0 create a directory for the EMBASSY packages mkdir embassy Now copy the EMBASSY packages to the EMBASSY directory cp ../MSE-0.0.4.tar.gz PHYLIP-3.573c.tar.gz TOPO-0.1.tar.gz embassy Go into the EMBASSY directory and unpack those packages. cd embassy gunzip MSE-0.0.4.tar.gz tar xf MSE-0.0.4.tar and so on for each EMBASSY package. go back up one directory to th emain EMBOSS package directory and prepare to start compilation. Compilation. Building EMBOSS is easy. It follows the usual GNU style of configure, make, make install. We'll take these steps one at a time. Configuration To accept the default configuration, just type ./configure and let EMBOSS get on with it. You may want to make some changes to the configuration parameters according to your local policy. This section will not cover all the possibilities, just some of the more common. The configuration script will attempt to find the neccessary components in your system to determine haow to successfully build EMBOSS. It typically expects the GNU C compiler (gcc) and several standard libraries that should already be part of your Unix/Linux system. Most modern Linux distributions should work straight out of the box. Installation directory. You need to have write permission on the directory in which you eventually wish to install EMBOSS. You may also wish to put it somewhere else other than the standard location of /usr/local/emboss. This is controlled by the --prefix argument. In my case I have all my applications owned by a non-priviledged user and installed under /site/prog ./configure --prefix=/site/prog/emboss will install EMBOSS under /site/prog/emboss. The binaries will be in /site/prog/emboss/bin with shared libraries in /site/prog/emboss/lib. Data will be in /site/prog/emboss/data, and the configuration files (ACD files) for the applications will be under /site/prog/emboss/share in directories corresponding to the package name. The individual directories for installation can be modified with other configuration commands but this is usually not neccessary. Run ./configure --help to get more information on the directories that can be changed and other configuration options. Run ./configure with the options you wish to use. This may take a short while during which various messages will scroll up the screen. Depending on your system you may need to explicitly configure the graphics. Please see the section 'Configuring EMBOSS graphics' below. ./configure --prefix=/site/prog/emboss --with-pngdriver=/site/lib All should be well with this and configure should exit with a message like this: creating ./config.status creating plplot/Makefile creating plplot/lib/Makefile creating nucleus/Makefile creating ajax/Makefile creating emboss/Makefile creating emboss/acd/Makefile creating test/Makefile creating test/data/Makefile creating test/embl/Makefile creating test/pir/Makefile creating test/swiss/Makefile creating test/swnew/Makefile creating test/wormpep/Makefile creating emboss/data/Makefile creating emboss/data/CODONS/Makefile creating emboss/data/REBASE/Makefile creating emboss/data/PRINTS/Makefile creating emboss/data/PROSITE/Makefile creating Makefile Configuration is now complete. Configuring EMBOSS graphics. The PLPLOT library can produce output to many devices but requires certain libraries that are NOT distributed with EMBOSS To get X-windows based output you must have X installed else PLplot will not build the required driver. You may need to specify the location of your X-windows library with the configuration options: --x-includes=DIR (X include files are in DIR) --x-libraries=DIR (X library files are in DIR) To explicitly configure PLPLOT without X-windows, use --without-x. To get PLPLOT to produce PNG images you will need to have the z,png and gd librarys installed. In particular gd version >= 1.6.3 must be used. If for some reason you do not have the required librarys and your system support group will not update these ( In particular gd, as the older versions support GIF which is NOT supported in later versions) then install all three latest versions (z,gd,png) to a new directory and then add this new directory to your configure line for EMBOSS. i.e. ./configure --with-pngdriver=my_dir where the z, gd and png libraries were each installed using ./configure --prefix=my_dir You can explicitly tell EMBOSS to not include PNG support with --without-pngdriver How to tell if ./configure has found PNG. Watch for something like the following when running ./configure: checking if png driver is wanted... yes checking for inflateEnd in -lz... (cached) yes checking for png_destroy_read_struct in -lpng... (cached) yes checking for gdImageCreateFromPng in -lgd... (cached) yes This means that the configuration script has located the PNG libraries on your system. If you see a message indicating that ./configure could not find the libraries or that the version of gd was too old then you should install the latest versions of the libraries yourself and rerun configure with the correct --with-pngdriver value. Building EMBOSS Building EMBOSS is a matter of typing 'make' and going to find something else to do for the next ten minutes to half an hour depending on the speed of your system. EMBOSS will first build the shared libraries (PL_PLOT, AJAX, and NUCLEUS) and then build the applications. You will see plenty of warnings complaining about libraries not being used to resolve any symbols. These can be safely ignored. If all goes according to plan you should have built EMBOSS successfully. If not you will have to try to work out why the build failed. If you can't work it out yourself, send an email describing the problem to emboss-bug at sanger.ac.uk with a copy of the config.status and config.cache files from your EMBOSS directory. (These will tell the developers what state your system was in whaen compilation failed). I am assuming that compilation was successful. You nw have to checkthat you have the correct access permissions for the directory in which you wish to install EMBOSS and type 'make install'. After a few minutes and many pagefuls of messages, EMBOSS should be installed where you specified. Tidying up the environment. You will now need to make a few adjustments to your environment to ensure that EMBOSS runs smoothly. EMBOSS looks for certain environment variables to determine where the libraries and data are found. These instructions assumed you installed EMBOSS in /site/prog/emboss. Adjust these instructions to suit your installaation. Insert the following lines at the end of /etc/cshrc (or ~/.cshrc for a personal installation) setenv EMBOSS_DATA /site/prog/emboss/data setenv PLPLOT_LIB /site/prog/emboss/lib set path=( /site/prog/emboss ${path} ) Or for bash/ksh/sh users, insert the following at the end of /etc/profile or ~/.bashrc EMBOSS_DATA=/site/prog/emboss/data PLPLOT_LIB=/site/prog/emboss/lib PATH=/site/prog/emboss:$PATH export EMBOSS_DATA PLPLOT_LIB PATH EMBOSS should now be ready for use. You can test this by trying the program 'wossname' wossname -auto |more This should give a long list of programs that are available. Press space to page down through the list. This is just the EMBOSS programs and doesn't include any of the EMBASSY programs. Installing EMBASSY As well as the base libraries and standard EMBOSS distribution, various extra packages (EMBASSY) are distributed with EMBOSS. To install an EMBASSY package, go to the relevant directory. For example to install PHYLIP (which was unpacked into /packages/EMBOSS-1.0.0/embassy/PHYLIP-3.573c earlier) go to the relevant directory. cd /packages/EMBOSS-1.0.0/embassy/PHYLIP-3.573c ./configure --prefix=/site/prog/emboss make make install NB. You MUST use the same arguments for configure that you used for the installation of the main EMBOSS package. Repeat as neccessary for the other EMBASSY packages. You should now find that running wossname as before lists the EMBASSY programs. Configuration EMBOSS can be configured to match your requirements. EMBOSS looks for a configuration file in several places. Firstly it looks in /site/prog/emboss/share/EMBOSS for a file 'emboss.default'. It then looks in your home directory for the file '.embossrc' and finally in the current directory for '.embossrc'. In each case definitions will override those previously defined. Several aspects of EMBOSS can be defined. These are: EMBOSS environment variables EMBOSS databases Default behaviour of EMBOSS programs As Databases are by far the most complex of these they will be covered in a seperate section. EMBOSS environment variables These are set with an 'env' or a 'set' declaration. 'env' and 'set' are interchangeable. The most important environment variable is the location of the acd files that describe each program. set emboss_acdroot /site/prog/emboss/share/EMBOSS/acd Environment variables are useful for easing the maintenance of your emboss.default. For example you may want to specify the location of your databases as an environment variable. Then if you move the databases you only have to update one line in the configuration file. set emboss_database_dir /data/databases/flatfiles This would then be referred to as $emboss_database_dir/embl for the directory /data/databases/flatfiles/embl Databases Database access Emboss offers three methods for accessing databases: All: EMBOSS returns all the sequences in the database in no particular order Query: EMBOSS retrieves a set of sequences corresponding to a wildcard query. Single: EMBOSS retrieves a single sequence indexed by ID or accession number. Each database definition can configure one or many of these methods for database access. Typically EMBOSS uses the 'emblcd' system of database indexing. This comes in three variants depending on the original format of your database. The emblcd method assumes that you have both ID and accession number in each record. If you do not have both ID and accession number you will have to use an alternative method. Please see the 'other databases' section below. General Database configuration. Each database is configured using a DB declaration. The generalised form is DB databasename [ Configuration options ] The configuration options are tag/value pairs and must contain at least a description of the access method (using method: or one or more of methodsingle:, methodquery: and methodall:) and a description of the format the sequences will be returned in ( using format:). In addition to these tags there will be other tags that are needed for particular methods and other tags that are optional. method: & scope & Description & DIRECT & a & Returns all the database entries, one after the other. It assumes no indexing. & DB mydb [ #required parameters method: direct format: fasta dir: $emboss_db_dir/mydb file: *.dat #optional parameters type: N release: 63.0 comment: "My own database with no indices" exclude: "est*.dat" ] SRS & a q s & Returns entries from a local installation of SRS using the -e switch to getz to return entries in the original format. DB mydb [ #required parameters method: srs format: embl app: getz #optional parameters dbalias: embl type: N comment: 'My srs indexed database' release: '63.0' ] SRSFASTA & a q s & As SRS but returns the sequences in FASTA format. URL & s & Uses a defined web server to retreive a specific entry. EMBOSS may fail if the HTML causes complications. & DB mydb [ # required parameters method: url format: genbank url: "http://www.infobiogen.fr/srs5bin/cgi-bin/wgetz?-e+[genbank-id:%s]" #optional parameters type: N comment: "Genbank by ID from InfoBiogen" ] The %s in the URL string indicates where EMBOSS will insert the identifier portion of the USA. EMBLCD & a q s & Uses EMBLCD indices created with DBIFLAT to access EMBL format databases in the original format. & directory: files: DB mydb [ method: emblcd format: embl dir: $emboss_db_dir/embl file: *.dat #optional parameters type: N release: 63.0 comment: "my comment" exclude: est*.dat indexdir: $emboss_db_dir/indices ] GCG & a q s & Uses EMBLCD indices created with DBIGCG to access databases in GCG format. & As for EMBLCD but format: gcg and method: gcg BLAST & a q s & Uses EMBLCD indices created with DBIBLAST to access databases in BLAST format. & As for EMBLCD but format: blast and method: blast EXTERNAL & a q s & Uses an external application to retrieve sequences, returning them on STDOUT & The ID is passed as an argument to the application, either replacing %s in the command string (if present) or as an additional arguement (if there is no %s) DB mydb [ #required parameters method: app format: fasta app: "getfromdb thisfastadb" #optional parameters type: P comment: "my own protein database with a custom retrieval program" ] APP & a q s & same as EXTERNAL. NBRF & a q s & for a method: declaration, EMBOSS will use that method for those access methods supported by the method. If you wish to specify which accessmethod should be handled by which method then the methodsingle: methodquery: and methodall: declarations should be used instead of method: DB mydb [ methodsingle: app format: fasta app: "customapp myproteindb" methodall: direct dir: $emboss_db_dir/myproteindb file: myproteindb.dat type: P comment: "single and all access for myproteindb" ] Indexing and configuring flatfile databases Flatfile databases are those released by EMBL, Swissprot and so on. The EMBOSS program DBIFLAT is used to generate emblcd indices that can be used for all types of database access. DBIFLAT can process databases in EMBL, SWISSPROT and GENBANK format. Pseudo EMBL format databases which do not have unique ID and AC entries will cause DBIFLAT to do mysterious things and should be avoided. DBIFLAT requires the databases to be uncompressed. This example will not probe the deeper secrets of DBIFLAT (for which the reader is referred to the documentation, or failing that the source code) but will show a typical installation for a common database. We assume EMBOSS has been installed and works. This can be tested with the command wossname -auto which should list all the programs available. In this example we will index and configure the EMBL database for use with EMBOSS. First download and unpack the EMBL database. This will require a considerable amount of disk space. cd to the directory in which you have unpacked EMBL. This should look something like this when you run ls: est_fun.dat est_hum1.dat est_hum10.dat . . . syn.dat unc.dat vrl.dat vrt.dat Run DBIFLAT to create the emblcd indices. % dbiflat Index a flat file database EMBL : EMBL SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew GB : Genbank, DDBJ FASTA : FASTA format Entry format [SWISS]: EMBL Database name: embl Database directory [.]: Wildcard database filename [*.dat]: Release number [0.0]: 63.0 Index date [00/00/00]: 31/07/00 DBIFLAT should happily chug away for some considerable time (up to a few hours depending on the speed of your machine) and will generate (eventually) the following index files: acnum.hit acnum.trg division.lkp Now we create an entry in the EMBOSS configuration files to acces sthe database. It is probably a good idea to try new database definitions in your local configuration file first. Put the following entry in your .embossrc set emboss_db_dir /path_to_databases DB embl [ type: N method: emblcd format: embl dir: $emboss_db_dir/embl file: "*.dat" release: "63.0" comment: "EMBL release 63.0" ] Save .embossrc and try showdb. You should see a line that looks like: embl N OK OK OK EMBL release 63.0 Fine tuning the installation: It is probably a good idea to set up subsections of the database so that end users can search just the regions they wish to search. Files can be included with the declaration files: or excluded with the declaration exclude: In order to just take the EST files try the following: DB emblest [ type: N method: emblcd format: embl dir: $emboss_db_dir/embl file: "est*.dat" release: "63.0" comment: "EMBL release 63.0" ] Files can also be given as a space seperated list. For example to set up a database of all mamallian sequences (except genomes) try the following: DB emblallmam [ type: N method: emblcd format: embl dir: $emboss_db_dir/embl file: "rod*.dat hum*.dat mam*.dat" release: "63.0" comment: "EMBL release 63.0" ] It can be quite tedious to set up a long list of sequences to search. In many cases you can use the exclude function to make things easier. DB emblnoest [ type: N method: emblcd format: embl dir: $emboss_db_dir/embl file: "*.dat" exclude: "est*.dat" release: "63.0" comment: "EMBL release 63.0" ] This configures the emblnoest database to contain all of EMBL except the EST's. Indexing and configuring GCG format databases EMBOSS can access GCG formatted databases, thus avoiding having multiple copies of the same databases in different formats. EMBOSS creates EMBLCD like indices for the GCG format databases using the program DBIGCG. This runs in much the same way as DBIFLAT. You will need the GCG format .seq and .header files in order to create an indexed database. cd to the GCG database directory containing your data and run DBIGCG Index a GCG formatted database EMBL : EMBL SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew GB : Genbank, DDBJ PIR : NBRF Entry format [EMBL]: Database name: embl Database directory [.]: Wildcard database filename [*.seq]: Release number [0.0]: 63.0 Index date [00/00/00]: 31/07/00 The program will chug along for a while and will then generate the emblcd index files for the GCG format database. The following entry should be put in your .embossrc DB gcgembl [ type: N method: gcg format: embl dir: $emboss_db_dir/embl file: "*.dat" release: "63.0" comment: "EMBL release 63.0" ] SHOWDB should show your newly configured database. You can configure substes of th edatabases in the same way as for the original format databases. Indexing and configuring BLAST databases Here be dragons Configuring EMBOSS to use SRS for database lookup. Here be lions Indexing and configuring other databases Many institutions may have local databases set up in their own Laboratory Information Management System. EMBOSS provides a simple mechanism for interfacing with such systems. As long as a program is available that can be called noninteractively and returns the specified sequence on standard output, EMBOSS can interface with it. Use method: app or external (the two are equivalent) and app: "program command". The ID given in the USA will be appended to the command used to run the program. It is probably best to specify the methods available using the method subsets, methodall:, methodquery: and methodsingle: rather than using the generic method: tag. Other data EMBOSS can be integrated with some common biological databases. These are described in this section. REBASE Rebase is the restriction enzyme database maintained by New England Biolabs. It is needed for programs such as remap and restrict. The latest version of Rebase can be obtained by anonymous FTP from ftp://ftp.ebi.ac.uk/pub/databases/rebase. EMBOSS needs the 'withrefm' file. The data is extracted for EMBOSS with the program rebaseextract. If you installed EMBOSS with the --prefix option you may need to create the REBASE directory under the emboss data directory (/site/prog/emboss/data in this example) This directory only needs creating once. % mkdir /site/prog/emboss/data/REBASE % rebaseextract Extract data from REBASE Full pathname of WITHREFM: /data/rebase/withrefm.008 Rebase is now installed and ready to use. TRANSFAC Transfac is the transcription factor binding site database. It is available by anonymous ftp from ftp://ftp.ebi.ac.uk/pub/databases/transfac/transfac32.tar.Z Unpacking the distribution reveals a file called site.dat. This is the one EMBOSS needs. Run TFEXTRACT to extract the data from TRANSFAC. % tfextract Extract data from TRANSFAC Full pathname of transfac SITE.DAT: /databases/transfac/site.dat tfscan can now access the TF database. PROSITE Prosite is a database of regular expressions that match potentially diagnostic regions for structural/functional classification of proteins. EMBOSS needs this database for the patmatmotifs program. PROSITE can be obtained via anonymous FTP from ftp://ftp.ebi.ac.uk/pub/databases/prosite. You may need to create a PROSITE subdirectory under data in the EMBOSS installation directory. Then run prosextract to build the EMBOSS Prosite database. Builds the PROSITE motif database for patmatmotifs to search Enter name of prosite directory: /data/prosite PROSITE is now integrated into your EMBOSS installation. PRINTS Prints is a database of diagnostic patterns of blocks of sequence homology in protein families. The PRINTS database can be searched using the EMBOSS program pscan. PRINTS can be obtained via anonymous ftp from ftp://ftp.ebi.ac.uk/pub/databases/prints. The database is made available as compressed files which should be uncompressed using gzip before integrating them into EMBOSS PRINTS is integrated with EMBOSS using the command printsextract % printsextract Extract data from PRINTS Input file: /data/prints/prints27_0.dat The PRINTS database is now integrated with EMBOSS. Miscellaneous data files Other data files should be kept in the data directory under the main EMBOSS installation. Individual users personal data files can be kept in the current working directory, a subdirectory .embossdata of the current directory, their home directory or a subdirectory .embossdata of their home directory. EMBOSS will search these locations in this order and will stop as soon as it finds a matching file. If the personal directories do not contain the desired file, EMBOSS will search the system wide data directory, /site/prog/emboss/data in this example. Apparently inexplicable errors when running EMBOSS programs may be caused by the system not using the data files one expects. The search path can be displayed in search order using the command embossdata. Logging Many system administrators may wish to make use of the logging facilities of EMBOSS. Setting the variable emboss_logfile in emboss.default or .embossrc allows the system to keep a log of which programs are used when and by whom. set emboss_logfile /site/log/emboss.log The log file structure is very simple. Three tab seperated fields are stored, program name, user name, and the date and time. prettyplot joeuser Wed Aug 02 14:29:13 2000 The file set in emboss_logfile should be world writable. These settings can be overridden in a users .embossrc files by redefining emboss_logfile. eg. to prevent my system usage being logged I can put the following entry in my .embossrc file. set emboss_logfile /dev/null This behaviour may change in the future to prevent users redefining some system settings. From gwilliam at hgmp.mrc.ac.uk Thu Aug 10 13:19:51 2000 From: gwilliam at hgmp.mrc.ac.uk (gwilliam at hgmp.mrc.ac.uk) Date: Thu, 10 Aug 2000 14:19:51 +0100 (BST) Subject: -osformat strangeness Message-ID: <200008101319.OAA10103@tellurium.hgmp.mrc.ac.uk> Can anyone explain this behaviour? When I use '-osf staden' or '-osf gcg' I get no output but no error messages. % seqret ttt.seq stdout Reads and writes (returns) sequences > AGCTAGGGCTTAAA [OK - this is a short test sequence, but the error occurs with longer sequences, eg embl:hsfau] % seqret ttt.seq stdout -osf gcg Reads and writes (returns) sequences [Where did the output go?] % seqret ttt.seq stdout -osf staden Reads and writes (returns) sequences [Ditto!] % seqret ttt.seq stdout -osf embl Reads and writes (returns) sequences ID standard; DNA; UNC; 14 BP. SQ Sequence 14 BP; 5 A; 2 C; 4 G; 3 T; 0 other; AGCTAGGGCT TAAA 14 // [now it works] The only output formats I think this happens with are gcg and staden. Thanks, Gary From gwilliam at hgmp.mrc.ac.uk Thu Aug 10 14:12:48 2000 From: gwilliam at hgmp.mrc.ac.uk (gwilliam at hgmp.mrc.ac.uk) Date: Thu, 10 Aug 2000 15:12:48 +0100 (BST) Subject: Followup to -osformat strangeness Message-ID: <200008101412.PAA10284@tellurium.hgmp.mrc.ac.uk> Using '-osf gcg' creates a file '.gcg' Similarly with '-osf staden' and '.staden' Gary From pmr at sanger.ac.uk Thu Aug 10 14:41:45 2000 From: pmr at sanger.ac.uk (Peter Rice) Date: Thu, 10 Aug 2000 15:41:45 +0100 (BST) Subject: -osformat strangeness In-Reply-To: <200008101319.OAA10103@tellurium.hgmp.mrc.ac.uk> (gwilliam@hgmp.mrc.ac.uk) References: <200008101319.OAA10103@tellurium.hgmp.mrc.ac.uk> Message-ID: <200008101441.PAA04522@europa.sanger.ac.uk> Gary Williams writes: >Can anyone explain this behaviour? When I use '-osf staden' or '-osf >gcg' I get no output but no error messages. Strictly, this happens when '-ossingle' is on. This forces output to go to a separate file for each sequence, and is turned on for formats that do not support multiple sequences in one file (GCG/GCG8, STADEN/EXPERIMENT, RAW/TEXT/PLAIN) Any other format with -ossingle on the command line will do the same. GCG and STADEN with -noossingle will write to the same file. For GCG format EMBOSS is able to read the resulting file (but GCG cannot :-) This behaviour could be changed, but needs some careful planning. For example, with one output file you may want it to be used. With many output files you may want them all to be automatically named. Perhaps some special processing for stdout and stderr would be useful, and for user-defined output filenames the user's choice could at least be used for the first sequence (maybe with a warning for the second sequence if there is one). So many cases to consider. All packages that write GCG format face these kinds of problems. Peter -- ---------------------------------------------------------------------- Peter Rice | Informatics Division, The Sanger Centre, E-mail: pmr at sanger.ac.uk | Wellcome Trust Genome Campus, Tel: (44) 1223 494967 | Hinxton, Cambridge, CB10 1SA, England Fax: (44) 1223 494919 | URL: http://www.sanger.ac.uk/Users/pmr/ From gwilliam at hgmp.mrc.ac.uk Thu Aug 10 15:22:39 2000 From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522) Date: Thu, 10 Aug 2000 16:22:39 +0100 Subject: -osformat strangeness References: <200008101319.OAA10103@tellurium.hgmp.mrc.ac.uk> <200008101441.PAA04522@europa.sanger.ac.uk> Message-ID: <3992C8BF.51B6@hgmp.mrc.ac.uk> Peter Rice wrote: > > Gary Williams writes: > > >Can anyone explain this behaviour? When I use '-osf staden' or '-osf > >gcg' I get no output but no error messages. > > Strictly, this happens when '-ossingle' is on. This forces output to > go to a separate file for each sequence, and is turned on for formats > that do not support multiple sequences in one file (GCG/GCG8, > STADEN/EXPERIMENT, RAW/TEXT/PLAIN) > > Any other format with -ossingle on the command line will do the same. > > GCG and STADEN with -noossingle will write to the same file. For GCG > format EMBOSS is able to read the resulting file (but GCG cannot :-) > > This behaviour could be changed, but needs some careful planning. For > example, with one output file you may want it to be used. With many > output files you may want them all to be automatically named. > > Perhaps some special processing for stdout and stderr would be useful, > and for user-defined output filenames the user's choice could at least > be used for the first sequence (maybe with a warning for the second > sequence if there is one). This sounds more what I might expect. > So many cases to consider. All packages that write GCG format face > these kinds of problems. Hmm I see what you mean, '-ossingle' toggles this behaviour with any format, eg embl. It doesn't seem the right behaviour though. I guess that whatever behaviour it is set up to have we will meet discrepencies between what was expected and what is possible. I still don't see why there should be a difference between using 'staden::outfile' and '-osf staden' though. Is this the expected behaviour. If so why?: tellurium<116>seqret embl:hsfau jjj.seq Reads and writes (returns) sequences tellurium<117>ls -l jjj.seq -rw-r----- 1 gwilliam cs 560 Aug 10 16:06 jjj.seq [Behaves as I expected] tellurium<118>rm jjj.seq tellurium<119>seqret embl:hsfau staden::jjj.seq Reads and writes (returns) sequences tellurium<120>ls -l jjj.seq -rw-r----- 1 gwilliam cs 539 Aug 10 16:06 jjj.seq [Behaves as I expected] tellurium<122>rm jjj.seq tellurium<123>rm hsfau.staden tellurium<124>seqret embl:hsfau jjj.seq -osf staden Reads and writes (returns) sequences tellurium<125>ty jjj.seq jjj.seq: No such file or directory tellurium<126>ls -l hsfau.staden -rw-r----- 1 gwilliam cs 539 Aug 10 16:09 hsfau.staden [Doesn't make 'jjj.seq', but makes 'hsfau.staden'.] Thanks, Gary -- Gary Williams Tel: +44 1223 494522 Fax: +44 1223 494512 mailto:G.Williams at hgmp.mrc.ac.uk http://www.hgmp.mrc.ac.uk/ Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK From david.martin at biotek.uio.no Tue Aug 15 10:12:58 2000 From: david.martin at biotek.uio.no (David Martin) Date: Tue, 15 Aug 2000 11:12:58 +0100 Subject: Changing default settings in EMBOSS Message-ID: I have been looking through the source to try to track down the possible settings that can be defined beforehand in EMBOSS. So far I have the following list: acdroot logfile DATA all general qualifiers format (default sequence format) Rather than go mad trying to find my way through the code I have tried to find functions that resolve the defined variables. Most of these seem to not be hard coded (with the exception of those above excluding the general qualifiers.) and most of this lookup is in ajacd.c Which qualifiers can be coded specifically and which can't? ajacd.c is about 172 pages of source code and I don't plan to go through all of it. ..d --------------------------------------------------------------------- * Dr. David Martin Biotechnology Centre of Oslo * * Node Manager Gaustadalleen 21 * * The Norwegian EMBNet Node P.O. box 1125 Blindern * * tel +47 22 95 87 56 N-0317 Oslo * * fax +47 22 69 41 30 Norway * --------------------------------------------------------------------- From gwilliam at hgmp.mrc.ac.uk Thu Aug 17 09:16:16 2000 From: gwilliam at hgmp.mrc.ac.uk (gwilliam at hgmp.mrc.ac.uk) Date: Thu, 17 Aug 2000 10:16:16 +0100 (BST) Subject: GFF question Message-ID: <200008170916.KAA02641@tellurium.hgmp.mrc.ac.uk> I have two sequence that are being compared. Is it better to write out a single GFF file with the results for both sequences in it, or should I write out two separate gff files? Thanks, Gary From gwilliam at hgmp.mrc.ac.uk Thu Aug 17 10:42:15 2000 From: gwilliam at hgmp.mrc.ac.uk (Gary Williams, Tel 01223 494522) Date: Thu, 17 Aug 2000 11:42:15 +0100 Subject: GFF question References: Message-ID: <399BC187.7F20@hgmp.mrc.ac.uk> David Martin wrote: > > On Thu, 17 Aug 2000 gwilliam at hgmp.mrc.ac.uk wrote: > > > I have two sequence that are being compared. > > > > Is it better to write out a single GFF file with the results for both > > sequences in it, or should I write out two separate gff files? > > Give the user the option. > > I can think of arguements both ways. What are your arguments? Gary -- Gary Williams Tel: +44 1223 494522 Fax: +44 1223 494512 mailto:G.Williams at hgmp.mrc.ac.uk http://www.hgmp.mrc.ac.uk/ Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK From david.martin at biotek.uio.no Thu Aug 17 11:04:42 2000 From: david.martin at biotek.uio.no (David Martin) Date: Thu, 17 Aug 2000 12:04:42 +0100 Subject: GFF question In-Reply-To: <399BC187.7F20@hgmp.mrc.ac.uk> Message-ID: On Thu, 17 Aug 2000, Gary Williams, Tel 01223 494522 wrote: > David Martin wrote: > > > > On Thu, 17 Aug 2000 gwilliam at hgmp.mrc.ac.uk wrote: > > > > > I have two sequence that are being compared. > > > > > > Is it better to write out a single GFF file with the results for both > > > sequences in it, or should I write out two separate gff files? > > > > Give the user the option. > > > > I can think of arguements both ways. > > What are your arguments? -ofsingle ? but seriously, in some cases one would be wanting a GFF containing features from many subsequences, for example if (as in our case) we are analysing a genome containing many unordered, unconnected contigs bu twant information such as repeat regions, ISs etc. to be kept in one place. In other cases one may want to analyse a set of sequences but produce a seperate file for each one, eg if one is analysin ga set of related genes but do not want the GFF's combined. The overall argument is that EMBOSS should be flexible an dempowering, not conforming to any preconceived mindset (except of course that it should be flexible and empowering, not conforming to any preconceived mindset ( except ...)) ..d > > Gary > > -- > Gary Williams Tel: +44 1223 494522 Fax: +44 1223 494512 > mailto:G.Williams at hgmp.mrc.ac.uk http://www.hgmp.mrc.ac.uk/ > Bioinformatics,MRC HGMP Resource Centre,Hinxton,Cambridge, CB10 1SB,UK > > > --------------------------------------------------------------------- * Dr. David Martin Biotechnology Centre of Oslo * * Node Manager Gaustadalleen 21 * * The Norwegian EMBNet Node P.O. box 1125 Blindern * * tel +47 22 95 87 56 N-0317 Oslo * * fax +47 22 69 41 30 Norway * --------------------------------------------------------------------- From gwilliam at hgmp.mrc.ac.uk Thu Aug 17 09:16:16 2000 From: gwilliam at hgmp.mrc.ac.uk (gwilliam at hgmp.mrc.ac.uk) Date: Thu, 17 Aug 2000 10:16:16 +0100 (BST) Subject: GFF question Message-ID: <200008170916.KAA02641@tellurium.hgmp.mrc.ac.uk> I have two sequence that are being compared. Is it better to write out a single GFF file with the results for both sequences in it, or should I write out two separate gff files? Thanks, Gary From gwilliam at hgmp.mrc.ac.uk Thu Aug 17 09:16:16 2000 From: gwilliam at hgmp.mrc.ac.uk (gwilliam at hgmp.mrc.ac.uk) Date: Thu, 17 Aug 2000 10:16:16 +0100 (BST) Subject: GFF question Message-ID: <200008170916.KAA02641@tellurium.hgmp.mrc.ac.uk> I have two sequence that are being compared. Is it better to write out a single GFF file with the results for both sequences in it, or should I write out two separate gff files? Thanks, Gary From david.martin at biotek.uio.no Wed Aug 23 12:23:18 2000 From: david.martin at biotek.uio.no (David Martin) Date: Wed, 23 Aug 2000 13:23:18 +0100 Subject: Some future ideas Message-ID: I had som ethoughts about future expansions of EMBOSS and really want to put down a few 'placemarkers' to see what people think. Eventually people are going to get the idea that EMBOSS should be able to do things with data other than just sequences and want to run e.g. microarray and structure type analyses. With the current database definitions in emboss.default there is a type: clause that is not required. I would propose to make this mandatory and extend the values. N is nucleotide sequence database P is protein sequence database S is a structure database M is a microarray experiment database. The USA can then be extended to cover structures (pdb:1HTF) for an example and microarray experiments. There are probably other entities that could be included. With some careful type management we could even convert types on the fly, so you could put in a pdb reference when asked for a protein sequence and it would be automatically derived (OK, there are a lot of problems with such things but it would be useful). Other possibilities: XML format output in some suitable XML format? This would probably need a lot of work in the libraries to tidy everything up and make it work. Still looking for a student to write an EMBOSS-WAP interface ;-) ..d --------------------------------------------------------------------- * Dr. David Martin Biotechnology Centre of Oslo * * Node Manager Gaustadalleen 21 * * The Norwegian EMBNet Node P.O. box 1125 Blindern * * tel +47 22 95 87 56 N-0317 Oslo * * fax +47 22 69 41 30 Norway * --------------------------------------------------------------------- From bernd at golgi.ski.mskcc.org Wed Aug 23 05:09:49 2000 From: bernd at golgi.ski.mskcc.org (Bernd Jagla) Date: Wed, 23 Aug 2000 11:09:49 +0600 Subject: Some future ideas References: Message-ID: <39A35C9D.9B819964@golgi.ski.mskcc.org> David Martin wrote: > I had som ethoughts about future expansions of EMBOSS and really want to > put down a few 'placemarkers' to see what people think. > > Eventually people are going to get the idea that EMBOSS should be > able to do things with data other than just sequences and want to run > e.g. microarray and structure type analyses. With the current database > definitions in emboss.default there is a type: clause that is not > required. I would propose to make this mandatory and extend the values. > > N is nucleotide sequence database > P is protein sequence database > S is a structure database > M is a microarray experiment database. > > The USA can then be extended to cover structures (pdb:1HTF) for an example > and microarray experiments. There are probably other entities that could > be included. > > With some careful type management we could even convert types on the fly, > so you could put in a pdb reference when asked for a protein sequence and > it would be automatically derived (OK, there are a lot of problems with > such things but it would be useful). > > Other possibilities: > > XML format output in some suitable XML format? This would probably need > a lot of work in the libraries to tidy everything up and make it work. > > Still looking for a student to write an EMBOSS-WAP interface ;-) > > ..d > > --------------------------------------------------------------------- > * Dr. David Martin Biotechnology Centre of Oslo * > * Node Manager Gaustadalleen 21 * > * The Norwegian EMBNet Node P.O. box 1125 Blindern * > * tel +47 22 95 87 56 N-0317 Oslo * > * fax +47 22 69 41 30 Norway * > --------------------------------------------------------------------- Hi David, I still feel quite new in EMBOSS and am not that familiar with the databanks, but it sounds very good to be able to analyze some micro array data. I also believe that there should be some other possibilities for data analysis. I personally like artificial neural network for they are fast, "easy" to use and I have already some programs written using EMBOSS. I am thinking of some other statistical analysis tools to implement (information analysis, some visual output of aa content, distribution and so many other things). For this it would be a good to be able to build groups of sequences and sequence parts, add some numbers to these groups, have probably a new class of functions dealing with these groups. Of course, we should discuss the data model a little more in detail if it is interesting... So, do you thing EMBOSS should be able to deal with these kind of problems as well? Bernd From gbottu at ben.vub.ac.be Fri Aug 25 14:35:57 2000 From: gbottu at ben.vub.ac.be (Guy Bottu) Date: Fri, 25 Aug 2000 16:35:57 +0200 (MET DST) Subject: Some future ideas about formats Message-ID: <200008251435.QAA11192@bigben.vub.ac.be> > Eventually people are going to get the idea that EMBOSS should be >able to do things with data other than just sequences and want to run >e.g. microarray and structure type analyses. With the current database >definitions in emboss.default there is a type: clause that is not >required. I would propose to make this mandatory and extend the values. > >N is nucleotide sequence database >P is protein sequence database >S is a structure database >M is a microarray experiment database. > >The USA can then be extended to cover structures (pdb:1HTF) for an example >and microarray experiments. There are probably other entities that could >be included. > The way EMBOSS handles sequences is certainly a vast improvement over GCG and should be extended to other kinds of data. But why limit ourselves to structures and microarray data ? There are other kinds of data that are handled by many software packages and that hence hange around in several alternative formats : amino acid symbol comparison tables, codon usage tables, sequence motifs defined as patterns/profiles/HMM's, phylogenetic trees, etc. It would be nice if they could all be imported in a transparent way and exported in a user-chosen format directly from the program that creates it. Guy Bottu, Belgian EMBnet Node From ableasby at hgmp.mrc.ac.uk Fri Aug 25 14:41:14 2000 From: ableasby at hgmp.mrc.ac.uk (ableasby at hgmp.mrc.ac.uk) Date: Fri, 25 Aug 2000 15:41:14 +0100 (BST) Subject: Some future ideas about formats Message-ID: <200008251441.PAA11554@tin.hgmp.mrc.ac.uk> Agreed in principle with Guy and David. Structures were nxt on the list. The interesting with structures is the nightmare of PDB parsing. My colleague, Jon Ison, has a PDB parser which he developed as part of his PhD. He's using it to create a clean PDB which we can hopefully release soon. He intends EMBOSSifying it. Alan