From david.breimann at gmail.com Tue Jul 5 05:33:14 2011 From: david.breimann at gmail.com (David Breimann) Date: Tue, 5 Jul 2011 12:33:14 +0300 Subject: [EMBOSS] Updating EMBOSS Message-ID: Hello, I had an old installation of EMBOSS on my linux server (EMBOSS 6.0.1 according to embossversion). I downloaded EMBOSS 6.3.1, unpacked and compiled (following http://emboss.sourceforge.net/download/#Gettingstarted), hoping this will overwrite the older version. However, embossversion still returns 6.0.1. What should I do? Thanks, Dave From ajb at ebi.ac.uk Tue Jul 5 06:22:53 2011 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Tue, 5 Jul 2011 11:22:53 +0100 (BST) Subject: [EMBOSS] Updating EMBOSS In-Reply-To: References: Message-ID: <55403.82.26.12.214.1309861373.squirrel@imap04.ebi.ac.uk> Hello Dave, It depends on where/how you installed the different versions. If you had configured and installed using a prefix which specified a directory root which was to contain only emboss: e.g. ./configure --prefix=/fu/bar/emboss then you can just delete the /fu/bar/emboss directory and reinstall. If, however, you had installed EMBOSS using no prefix (such that it would be installed under /usr/local) or specified any other shared or system directory then the best means is usually to reinstall the old version (see ftp://emboss.open-bio.org/pub/EMBOSS/old/) on top of itself and then type: make uninstall If it were me I'd then do the same with the new version and have a nose-around to check that all traces of EMBOSS have been deleted, then reinstall the new version. We do recommend, when installing EMBOSS from source, to install it into its own directory (--prefix=/usr/local/emboss is a favourite example in administration documentation). HTH Alan > Hello, > > I had an old installation of EMBOSS on my linux server (EMBOSS 6.0.1 > according to embossversion). > I downloaded EMBOSS 6.3.1, unpacked and compiled (following > http://emboss.sourceforge.net/download/#Gettingstarted), hoping this will > overwrite the older version. > However, embossversion still returns 6.0.1. > What should I do? > > Thanks, > Dave > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From wo.granon at gmail.com Thu Jul 7 06:33:49 2011 From: wo.granon at gmail.com (Wolfgang) Date: Thu, 7 Jul 2011 12:33:49 +0200 Subject: [EMBOSS] Plasmid drawing Message-ID: Hello, are there any news to plasmid drawing (features and restriction sites) and improvement of cirdna, according to this message from 2005? http://www.mail-archive.com/emboss at emboss.open-bio.org/msg00040.html In our labs this is also a big point for users not to switch completely to emboss. Thanks, Wolfgang From pmr at ebi.ac.uk Thu Jul 7 07:33:05 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 07 Jul 2011 12:33:05 +0100 Subject: [EMBOSS] Plasmid drawing In-Reply-To: References: Message-ID: <4E159971.9070509@ebi.ac.uk> Dear Wolfgang, > are there any news to plasmid drawing (features and restriction sites) and > improvement of cirdna, according to this message from 2005? > http://www.mail-archive.com/emboss at emboss.open-bio.org/msg00040.html > > In our labs this is also a big point for users not to switch completely to > emboss. Very close to release date next week, so hard to do anything immediately. However, we did try adding a report format (an output choice for restrict and other applications) to create an input file for cirdna or lindna. Results at the time were poor, but I note we have revised both cirdna and lindna since. I will test whether results have improved. One possibility would be to re-enable this format so you can test and give us feedback on the new release. regards, Peter From hrh at fmi.ch Thu Jul 7 07:47:13 2011 From: hrh at fmi.ch (Hans-Rudolf Hotz) Date: Thu, 07 Jul 2011 13:47:13 +0200 Subject: [EMBOSS] Plasmid drawing In-Reply-To: <4E159971.9070509@ebi.ac.uk> References: <4E159971.9070509@ebi.ac.uk> Message-ID: <4E159CC1.9@fmi.ch> Hi Peter, We will be happy to help you testing and give feedback, since we are in a very similar situation to Wolfgang. Regards, Hans On 07/07/2011 01:33 PM, Peter Rice wrote: > Dear Wolfgang, > >> are there any news to plasmid drawing (features and restriction sites) >> and >> improvement of cirdna, according to this message from 2005? >> http://www.mail-archive.com/emboss at emboss.open-bio.org/msg00040.html >> >> In our labs this is also a big point for users not to switch >> completely to >> emboss. > > Very close to release date next week, so hard to do anything immediately. > > However, we did try adding a report format (an output choice for > restrict and other applications) to create an input file for cirdna or > lindna. > > Results at the time were poor, but I note we have revised both cirdna > and lindna since. > > I will test whether results have improved. One possibility would be to > re-enable this format so you can test and give us feedback on the new > release. > > regards, > > Peter > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From pmr at ebi.ac.uk Thu Jul 7 08:07:19 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 07 Jul 2011 13:07:19 +0100 Subject: [EMBOSS] Plasmid drawing In-Reply-To: <4E159CC1.9@fmi.ch> References: <4E159971.9070509@ebi.ac.uk> <4E159CC1.9@fmi.ch> Message-ID: <4E15A177.6040201@ebi.ac.uk> On 07/07/11 12:47, Hans-Rudolf Hotz wrote: > Hi Peter, > > > We will be happy to help you testing and give feedback, since we are in > a very similar situation to Wolfgang. I'm curious. How many sites on this list (as a rough sample) are still running GCG? And how many are using some other commercial package for functions not in EMBOSS? Could be a very useful guide to the new applications needed. Peter From s.newslists at gmail.com Thu Jul 7 09:54:01 2011 From: s.newslists at gmail.com (Stefan) Date: Thu, 7 Jul 2011 15:54:01 +0200 Subject: [EMBOSS] Plasmid drawing In-Reply-To: References: <4E159971.9070509@ebi.ac.uk> <4E159CC1.9@fmi.ch> <4E15A177.6040201@ebi.ac.uk> Message-ID: Hi Peter, in our labs the people are also sad that they can not use the emboss suite for such daily work. We use two different applications: pDraw32 can draw plasmid cards. Very useful is the feature that it can generate a new plasmid out of two with given restriction enzymes. This can avoid a lot of little mistakes. ApE "A plasmid Editor" is very useful to find features in the plasmid. Often we get sequences where features such as the antibiotic resistance are missing. This tool can quickly find them and make draw a nice plasmid also with its restriction sites. We would be happy to use for all of this work the emboss suite. Also I would be happy to test. Best regards, Stefan 2011/7/7 Peter Rice : > On 07/07/11 12:47, Hans-Rudolf Hotz wrote: >> >> Hi Peter, >> >> >> We will be happy to help you testing and give feedback, since we are in >> a very similar situation to Wolfgang. > > > I'm curious. How many sites on this list (as a rough sample) are still > running GCG? > > And how many are using some other commercial package for functions not in > EMBOSS? > > Could be a very useful guide to the new applications needed. > > Peter > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From david.bauer at bayer.com Thu Jul 7 08:58:38 2011 From: david.bauer at bayer.com (david.bauer at bayer.com) Date: Thu, 7 Jul 2011 14:58:38 +0200 Subject: [EMBOSS] Antwort: Re: Plasmid drawing In-Reply-To: <4E15A177.6040201@ebi.ac.uk> Message-ID: We use VectorNTI for plasmid documentation and in-silico cloning. And as far as I know another widely used software for this purpos is 'Clone Manager' from 'Sci-Ed Software'. David. emboss-bounces at lists.open-bio.org schrieb am 07/07/2011 14:07:19: > On 07/07/11 12:47, Hans-Rudolf Hotz wrote: > > Hi Peter, > > > > > > We will be happy to help you testing and give feedback, since we are in > > a very similar situation to Wolfgang. > > > I'm curious. How many sites on this list (as a rough sample) are still > running GCG? > > And how many are using some other commercial package for functions not > in EMBOSS? > > Could be a very useful guide to the new applications needed. > > Peter > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From cjfields at illinois.edu Thu Jul 7 11:10:35 2011 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 7 Jul 2011 10:10:35 -0500 Subject: [EMBOSS] Antwort: Re: Plasmid drawing In-Reply-To: References: Message-ID: I think Geneious and the CLC tools can also draw plasmid maps. Haven't used them extensively, though. re: GCG, Accelrys stopped GCG development in June 2008;I haven't seen anyone take up the perpetual license (which allows use of GCG, but with outdated databases, etc). Seems like everyone is implicitly being directed to EMBOSS. chris On Jul 7, 2011, at 7:58 AM, david.bauer at bayer.com wrote: > We use VectorNTI for plasmid documentation and in-silico cloning. > And as far as I know another widely used software for this purpos is > 'Clone Manager' from 'Sci-Ed Software'. > > David. > > emboss-bounces at lists.open-bio.org schrieb am 07/07/2011 14:07:19: > >> On 07/07/11 12:47, Hans-Rudolf Hotz wrote: >>> Hi Peter, >>> >>> >>> We will be happy to help you testing and give feedback, since we are > in >>> a very similar situation to Wolfgang. >> >> >> I'm curious. How many sites on this list (as a rough sample) are still >> running GCG? >> >> And how many are using some other commercial package for functions not >> in EMBOSS? >> >> Could be a very useful guide to the new applications needed. >> >> Peter >> _______________________________________________ >> EMBOSS mailing list >> EMBOSS at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/emboss > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From kitagawam at takara-bio.co.jp Fri Jul 8 04:05:48 2011 From: kitagawam at takara-bio.co.jp (kitagawam at takara-bio.co.jp) Date: Fri, 8 Jul 2011 17:05:48 +0900 Subject: [EMBOSS] Antwort: Re: Plasmid drawing In-Reply-To: References: Message-ID: <678B3FABACE9F64B8FAF7A1045C3D67D4D943A52EE@tkrexmb1.central.takara.co.jp> I wish to recommend IMC. http://www.insilicobiology.jp/en/downloads ] -----Original Message----- ] From: emboss-bounces at lists.open-bio.org ] [mailto:emboss-bounces at lists.open-bio.org] On Behalf Of Chris Fields ] Sent: Friday, July 08, 2011 12:11 AM ] To: david.bauer at bayer.com ] Cc: emboss at lists.open-bio.org; emboss-bounces at lists.open-bio.org ] Subject: Re: [EMBOSS] Antwort: Re: Plasmid drawing ] ] I think Geneious and the CLC tools can also draw plasmid maps. Haven't ] used them extensively, though. ] ] re: GCG, Accelrys stopped GCG development in June 2008;I haven't seen anyone ] take up the perpetual license (which allows use of GCG, but with outdated ] databases, etc). Seems like everyone is implicitly being directed to ] EMBOSS. ] ] chris ] ] On Jul 7, 2011, at 7:58 AM, david.bauer at bayer.com wrote: ] ] > We use VectorNTI for plasmid documentation and in-silico cloning. ] > And as far as I know another widely used software for this purpos is ] > 'Clone Manager' from 'Sci-Ed Software'. ] > ] > David. ] > ] > emboss-bounces at lists.open-bio.org schrieb am 07/07/2011 14:07:19: ] > ] >> On 07/07/11 12:47, Hans-Rudolf Hotz wrote: ] >>> Hi Peter, ] >>> ] >>> ] >>> We will be happy to help you testing and give feedback, since we are ] > in ] >>> a very similar situation to Wolfgang. ] >> ] >> ] >> I'm curious. How many sites on this list (as a rough sample) are still ] >> running GCG? ] >> ] >> And how many are using some other commercial package for functions not ] >> in EMBOSS? ] >> ] >> Could be a very useful guide to the new applications needed. ] >> ] >> Peter ] >> _______________________________________________ ] >> EMBOSS mailing list ] >> EMBOSS at lists.open-bio.org ] >> http://lists.open-bio.org/mailman/listinfo/emboss ] > _______________________________________________ ] > EMBOSS mailing list ] > EMBOSS at lists.open-bio.org ] > http://lists.open-bio.org/mailman/listinfo/emboss ] ] ] _______________________________________________ ] EMBOSS mailing list ] EMBOSS at lists.open-bio.org ] http://lists.open-bio.org/mailman/listinfo/emboss From friedman at cancercenter.columbia.edu Wed Jul 13 16:56:29 2011 From: friedman at cancercenter.columbia.edu (Richard Friedman) Date: Wed, 13 Jul 2011 16:56:29 -0400 Subject: [EMBOSS] getting files in GCG format with annotation Message-ID: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu> Dear Emboss list, I am learning to use Emboss after being a long-time GCG user. The fetch command in GCG returns a file with the sequence in GCG format plus annotation. In EMBOSS I know how to get just sequence in GCG format with seqret. In EMBOSS I also know how to get the sequence plus annotation default format. What I would like to know is how using EMBOSS to get sequence plus annotation in GCG format like in GCG. Thanks and best wishes, Rich ------------------------------------------------------------ Richard A. Friedman, PhD Associate Research Scientist, Biomedical Informatics Shared Resource Herbert Irving Comprehensive Cancer Center (HICCC) Lecturer, Department of Biomedical Informatics (DBMI) Educational Coordinator, Center for Computational Biology and Bioinformatics (C2B2)/ National Center for Multiscale Analysis of Genomic Networks (MAGNet) Room 824 Irving Cancer Research Center Columbia University 1130 St. Nicholas Ave New York, NY 10032 (212)851-4765 (voice) friedman at cancercenter.columbia.edu http://cancercenter.columbia.edu/~friedman/ I am a Bayesian. When I see a multiple-choice question on a test and I don't know the answer I say "eeney-meaney-miney-moe". Rose Friedman, Age 14 From pmr at ebi.ac.uk Wed Jul 13 17:37:13 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 13 Jul 2011 22:37:13 +0100 Subject: [EMBOSS] getting files in GCG format with annotation In-Reply-To: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu> References: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu> Message-ID: <4E1E1009.101@ebi.ac.uk> Dear Richard, On 13/07/2011 21:56, Richard Friedman wrote: > I am learning to use Emboss after being a long-time GCG user. > The fetch command in GCG returns a file with the sequence in GCG format > plus annotation. > > In EMBOSS I know how to get just sequence in GCG format with seqret. > In EMBOSS I also know how to get the sequence plus annotation default > format. > What I would like to know is how using EMBOSS to get sequence plus > annotation in GCG format > like in GCG. Hmmm ... by "annotation in GCG format" you mean the EMBL or Uniprot entry with gaps in the ". ." feaure records? The obvious question is why you need GCG format. GCG was not very clever in handling the annotation. You can get the sequence plus annotation in one file with: seqret -feature somedb:someid outfile.seq -osformat embl (or swiss) That gives you one file with "sequence plus annotation"... and you can use the annotation. You can also get the whole entry text with entret somedb:someid Hope that helps - and if not, please do ask again! Peter Rice EMBOSS Team From friedman at cancercenter.columbia.edu Thu Jul 14 12:08:17 2011 From: friedman at cancercenter.columbia.edu (Richard Friedman) Date: Thu, 14 Jul 2011 12:08:17 -0400 Subject: [EMBOSS] getting files in GCG format with annotation In-Reply-To: <4E1E1009.101@ebi.ac.uk> References: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu> <4E1E1009.101@ebi.ac.uk> Message-ID: <134BC19D-E1C0-4788-A700-5B212192AD6B@cancercenter.columbia.edu> Dear Peter and Guy, I guess I just cling to the familiar. The output formats given by emboss are fine, One more obscure question: As far as I can see, the output from "seqret -feature" and "entret" are the same. Are there any differences? Thanks and best wishes, Rich On Jul 13, 2011, at 5:37 PM, Peter Rice wrote: > Dear Richard, > > On 13/07/2011 21:56, Richard Friedman wrote: >> I am learning to use Emboss after being a long-time GCG user. >> The fetch command in GCG returns a file with the sequence in GCG >> format >> plus annotation. >> >> In EMBOSS I know how to get just sequence in GCG format with seqret. >> In EMBOSS I also know how to get the sequence plus annotation default >> format. >> What I would like to know is how using EMBOSS to get sequence plus >> annotation in GCG format >> like in GCG. > > Hmmm ... by "annotation in GCG format" you mean the EMBL or Uniprot > entry with gaps in the ". ." feaure records? > > The obvious question is why you need GCG format. GCG was not very > clever in handling the annotation. > > You can get the sequence plus annotation in one file with: > > seqret -feature somedb:someid outfile.seq -osformat embl (or swiss) > > That gives you one file with "sequence plus annotation"... and you > can use the annotation. > > You can also get the whole entry text with entret somedb:someid > > Hope that helps - and if not, please do ask again! > > Peter Rice > EMBOSS Team > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From pmr at ebi.ac.uk Thu Jul 14 12:14:03 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 14 Jul 2011 17:14:03 +0100 Subject: [EMBOSS] getting files in GCG format with annotation In-Reply-To: <134BC19D-E1C0-4788-A700-5B212192AD6B@cancercenter.columbia.edu> References: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu> <4E1E1009.101@ebi.ac.uk> <134BC19D-E1C0-4788-A700-5B212192AD6B@cancercenter.columbia.edu> Message-ID: <4E1F15CB.5020205@ebi.ac.uk> On 14/07/2011 17:08, Richard Friedman wrote: > Dear Peter and Guy, > > I guess I just cling to the familiar. The output formats given by emboss > are fine, > One more obscure question: > > As far as I can see, the output from "seqret -feature" and "entret" are > the same. > Are there any differences? Not necessarily ... entret reports the exact text of the original input. seqret -feat with the same format as the input will rewrite everything using the output format. If that comes out identical then we are usually very happy (we do try to preserve everything in EMBL/GenBank and Swissprot formats) but there is no absolute guarantee. Also, strictly speaking, the output of entret is defined as "text" while the output of seqret is defined as "sequence" which leads to some distinctions - for example, you cannot choose an alternative output format for entret. Have fun with EMBOSS. Look out for the new release tomorrow! regards, Peter From gbottu at vub.ac.be Thu Jul 14 13:56:14 2011 From: gbottu at vub.ac.be (Guy Bottu) Date: Thu, 14 Jul 2011 19:56:14 +0200 Subject: [EMBOSS] getting files in GCG format with annotation In-Reply-To: <4E1E1009.101@ebi.ac.uk> References: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu> <4E1E1009.101@ebi.ac.uk> Message-ID: <4E1F2DBE.70700@vub.ac.be> Dear Richard, I agree with Peter that it is not obvious what GCG simple sequence format is still useful for, since for giving the sequence as input to whatever software you can use seqret with whatever sequence format and for just reading the annotation you can use entret and for giving the features as input to whatever software you can use seqret with parameter -feature (GCG used for this the GCG RSF format but this did not become popular outside GCG/SeqLab). I can maybe add that a widely used format for features is GFF format and you can do : seqret -feature somedb:someid outfile.seq -osformat gff -oufo somegfffile You will obtain a file somegfffile in GFF format (with just the features, not the sequence). There is a lot of software that can use it. Regards, Guy Bottu, U.L.B. From ajb at ebi.ac.uk Fri Jul 15 04:54:26 2011 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Fri, 15 Jul 2011 09:54:26 +0100 (BST) Subject: [EMBOSS] EMBOSS 6.4.0 released Message-ID: <53026.82.26.12.214.1310720066.squirrel@imap04.ebi.ac.uk> EMBOSS Release 6.4.0 This release is now available on our OBF ftp server. UNIX version: ftp://emboss.open-bio.org/pub/EMBOSS/ mEMBOSS (MS Windows version): ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.0-setup.exe It includes major extensions to the type and number of data resources available to EMBOSS users. In addition, three books are published by Cambridge University Press: EMBOSS User's Guide: Practical Bioinformatics http://www.cambridge.org/gb/knowledge/isbn/item5979294/?site_locale=en_GB EMBOSS Developer's Guide: Bioinformatics Programming http://www.cambridge.org/gb/knowledge/isbn/item5979293/?site_locale=en_GB EMBOSS Administrator's Guide: Bioinformatics Software Management http://www.cambridge.org/gb/knowledge/isbn/item5979238/?site_locale=en_GB They are comprehensive and definitive guides to administering, developing and using EMBOSS. We hope they will prove useful to the EMBOSS community and to anyone providing training courses covering EMBOSS. In addition to these publications we have a new website. http://emboss.open-bio.org Updates for the new features in 6.4.0 will be made available soon on the new EMBOSS website, with tutorials to be developed on the EBI e-Learning Portal. Contents: 1.0 New in 6.4.0 1.1 Server definitions 1.2 Access methods 1.3 emboss.standard file 1.4 new data types 1.5 new query language 1.6 Hash tables and lists 1.7 Cross-references 1.8 URL generation 1.9 Database index compression 1.10 Database indexing applications 1.11 Generating server cache files 1.12 Server and database attributes 1.13 HTTP redirection 1.14 EMBOSS version number 1.15 ACD list 'select all' 2.0 EDAM Ontology 2.1 EDAM in ACD files 2.2 EDAM applications 3.0 DRCAT Data Resource Catalogue 4.0 NCBI Taxonomy 5.0 Maintenance 6.0 Installation Notes 6.1 UNIX 6.1.1 MySQL 6.1.2 PostgreSQL 6.1.3 axis2c 6.1.4 Other optional library software 6.1.5 eprimer3 and eprimer32 6.2 mEMBOSS 7.0 New EMBASSY applications 8.0 Future 1.0 New in 6.4.0 1.1 Server definitions Servers can be defined, in a similar style to a database definition, but covering all databases available from a single server. The server definition names a cache file describing each database, its format and its query fields. Cache files for a core set of public servers are included in the release. 1.2 Access methods New access methods are provided, including Ensembl, BioMart, DAS, SOAP web services (EBI wsdbfetch and ebeye), REST web services (EBI dbfetch), and GMOD/CHADO. Ensembl access uses code contributed by Michael Schuster in the Ensembl team at EBI. This code is updated after each Ensembl API release. Some of these access methods were available but only partly implemented in the previous release. They now support standard server and database definitions and are open for further development. Data access methods have been restructured to use "text" access for any method which seeks a position in a file and then opens it for reading. This includes reading from a URL and returning a pointer to the start of the output. A few datatype-specific access methods remain, for example reading sequence data from a PIR/NBRF/GCG format database, or from the NCBI taxonomy files, or access to database systems via SQL or DAS. 1.3 emboss.standard file Previous releases depended on a user defining databases in their emboss.defaults file. Release 6.4.0 provides a new emboss.standard file defining the core servers and databases, and standard resource settings for database indexing. The local emboss.default file is only needed for local database definitions and settings. The configuration files emboss.standard, emboss.default and ~/.embossrc resolve variable references (e.g. in directory names) during parsing. Extensions to the syntax of these files include ALIAS to give secondary names to a database. IF, IFDEF, ELSE and ENDIF directives allow conditional inclusion of sections of the file dependent on variable settings. Special variables EMBOSS_AXIS2, EMBOSS_MYSQL, EMBOSS_POSTGRESQL and EMBOSS_SQL are automatically created for this purpose. New variable EMBOSS_STANDARD is automatically defined to be the share/EMBOSS install directory (or the emboss source code directory if the package is not installed). This is by default where the emboss.standard files and server cache files are expected to be found. The value is reported by "embossversion -full" 1.4 new data types New data types are available as inputs and outputs or applications. Each has a simple definition including qualifiers -iformat for input format and -oformat for output format. The maxreads attribute defines whether the application expects to read a single entry (maxreads: 1) or loop over multiple entries (the default). This is simpler than the sequence and seqall definitions for sequence which are widely used and will remain unchanged. * text and outtext: the text of an entry for which EMBOSS has (to date) no specialised parser * obo and oboout: terms in an OBO ontology. Six ontologies are included in the release as source and index files (EDAM, GO, SO, RO, PW, ECO). We plan to add more and welcome suggestions for inclusion. * resource and resourceout: entries in the Data Resource Catalogue * taxon and outtaxon: nodes in the NCBI taxonomy which is indexed and included in the release * url and outurl: a database name from the Data Resource Catalogue, and an identifier, converted into a URL which can be pasted into a browser to cover cases where the URL does not return simple text or HTML data. * for future extension, assembly and variation datatypes are defined for development and use in a later release. 1.5 New query language All data types use a common query language. The existing "USA" (uniform sequence address) syntax is still valid for sequence data, but is also now used for features, obo terms, data resources, taxons and plain text data. In response to comments from our Scientific Advisory Board, we have extended the query language to cover multiple identifiers, multiple fields, and operators to combine elements of the query. * id lists: dbname:{ida,idb,idc} searches for 3 identifiers (id, accession, etc.) in a database * or operator: dbname-{id:h* | des:hemoglobin} searches for all entries with identifiers starting with 'h' plus any others that include the word 'hemoglobin' in their descriptions. * not operator: dbname-{id:h* ! des:hemoglobin} searches for all entries with identifiers starting with 'h' that do not include the word 'hemoglobin' in their descriptions. * and operator: dbname-{id:h* & des:hemoglobin} searches for all entries with identifiers starting with 'h' that also include the word 'hemoglobin' in their descriptions. * eor operator: dbname-{id:h* ^ des:hemoglobin} searches for all entries with identifiers starting with 'h' that do not include the word 'hemoglobin' in their descriptions, and all those starting with another character that do include the word 'hemoglobin' in their description. This is the opposite of the and (&) operator. Query operators are not supported by all access methods. Where an operator is invalid an error message gives the list of valid operators. For example, the query syntax for SRS (srs, srswww access) does not include the exclusive-or (^) operator but supports the others as these are standard elements in SRS queries. The query language only allows a single database name in the query. This allows EMBOSS to combine query results for a single query expression. To query multiple databases a list file input with one database query on each line can be used. Indexed strings containing non-alphabetic characters including white space are simplified by converting a run of such characters to a single underscore. The same transformation is applied to a query string for the dbx (emboss) access method. This is especially useful for brackets and other characters in data resource names in DRCAT. We hope that the extended query language and the index file compression will increase the use of locally indexed data in EMBOSS installations, and welcome feedback on further developments of the query language and indexing. 1.6 Hash table and lists The new query language is supported by extensions to tables and lists in the libraries. Tables can now be automatically resized. Merge operations on two tables combine their contents using the same operations (or, and, not, eor) as the query language. By resizing the tables first this operation can be made highly efficient. Destructors can be defined for list data and for table keys and data to automatically clean up after use. Tables with string keys can use C char* or string object queries in all cases. Lists and tables can now be reference counted, avoiding unnecessary copying especially in the Ensembl API code. 1.7 Cross-references Cross-references from UniProt/SwissProt and EMBL/GenBank/DDBJ are collected by extended parsers. New application seqxref reports the cross-references. New application seqxrefget creates a script to retrieve cross-referenced data as the original entries, using entret for sequence data, feattext for feature data, ontotext for ontology terms, textget for text and urlget for data where "HTML" is the only available format. 1.8 URL generation New application urlget returns a query URL from DRCAT with one or mode identifiers. Where data is from a UniProt/SwissProt or EMBL/GenBank/DDBJ entry the DRCAT entry definition of the original cross-reference is used to select from several possible identifier terms in EDAM in order to choose the correct query. 1.9 Database index compression Indexes created by dbxflat or dbxfasta are now, by default, compressed automatically. These files, especially for secondary text indexes such as description, taxonomy or keyword, could be very sparse. Up to 95% space savings were achieved in some cases. The indexes are still updatable by code which uncompresses, updates, and recompresses on-the-fly using a copy of the index. 1.10 Database indexing applications New indexing applications dbxedam (EDAM), dbxresource (DRCAT), dbxtax (NCBI taxonomy) and dbxobo (any OBO ontology) are added for the new data resources provided as standard. users can install new releases of the source data and run these applications to update the index files. Application dbxflat can now index fastq format. This was included in 6.3.1 as a special addition for one user to test and is now fully supported. New applications dbxreport and dbxstat report on the overall and detailed content of dbx database indexes. In database indexing applications, the default "resource" name is one included in the emboss.standard file. Users can continue to define their own resource files. Indexing "resource" definitions can now specify the maximum length of any field, and the page size and cache size for any field, using attributes with the field name as a prefix. 1.11 Generating server cache files New applications for major access methods query a server (for example, the DAS registry or Ensembl) to update the server cache file with a current set of database definitions. When run by the system administrator these can update the site-wide cache file, but they can also be run by an individual user to create a user-specific set of databases. The cache files are time stamped. EMBOSS uses the most recent system or user file. 1.12 Server and database attributes New applications showserver and servertell describe all servers or the attributes of a single named server. We expect to extend these applications once we have feedback on the most useful information they should report. New application dbtell similarly reports on the attributes of a single named database. Database (and server) definitions can use an attribute more than once if it is defined as "multiple". These include a new "field:" attribute which gives the name and description of a query field. A list of "field:" attributes supersedes the old "fields:" attribute which listed all query field names but allowed no further annotation. Database field names are extended from the original fixed set of "SRS sequence" fields to any name. "id" and "acc" are assumed to be the names of identifier and accession fields. The "hasaccession" attribute is set automatically for databases where no "acc" field is found, avoiding some error messages where the attribute has been omitted. 1.13 HTTP redirection Data retrieval using HTTP now checks the returned header for redirects and automatically replaces the results with the output from the redirected URL. Where redirected URLs were found in standard database definitions (e.g. the EBI's dbfetch service) these have been replaced by the current URL. We have also seen redirects from case-sensitive servers which redirect a lower case accession number to one in upper case in the same URL. 1.14 EMBOSS version number The EMBOSS version number now has 4 digits (6.4.0.0). The fourth digit is only there so that the Windows port (mEMBOSS) shows the same version number for QA testing. In mEMBOSS the final digit is the build number. QA tests for mEMBOSS now use the same test definition and qatest script as on Linux. mEMBOSS file handling and reporting has been adapted to support POSIX and Windows style paths. 1.15 ACD list 'select all' In ACD files, a list or selection definition can default to "*" for "select all" if the "minimum" attribute allows all terms to be selected. 2.0 EDAM Ontology EDAM is a new ontology from the EMBRACE project now further developed by Jon Ison in the EMBOSS team. EDAM describes terms for topics (for applications and data), operations (algorithms), formats, identifiers and data (semantic descriptions of data content). EDAM terms are used throughout this release: to annotate all ACD files at the application, input, parameter and output levels; to annotate data resources and their web queries in the Data Resource Catalogue; and to annotate database and server definitions. 2.1 EDAM in ACD files ACD files are annotated extensively with EDAM terms using the term id and the human-readable name. The EMBOSS application groups have been extended to match the EDAM topic annotations, with some applications moving to different or new groups. EDAM has been used to validate these groups by comparing the topics hierarchy with the group designations. 2.2 EDAM applications EDAM can be queried within any specific namespace by new applications edamname and edamdef. EDAM and other ontologies are supported by new applications (ontoget, ontotext, ontodown, ontoup, ontgetsibs, ontogetcommon, ontogetroot, ontogetobsolete, ontoisobsolete, ontocount) New applications search EDAM term names and definitions, retrieve all matching terms and their descendants, and compare to: applications (wosstopic, wossoperation, wossinput, wossoutput, wossdata); data resources (drfindresource, drfindid, drfindformat, drfinddata); and related EDAM terms (edamhasinput, edamhasoutput, edamisid, edamisformat, edamissource). 3.0 DRCAT Data Resource Catalogue DRCAT, the Data Resource Catalogue, is included in this release. DRCAT started as a description of databases found as cross-references in UniProt/SwissProt, extended by adding databases found as cross-references in EMBL/GenBank/DDBJ, plus others from Nucleic Acids Research, ELIXIR, and other sources. Any database in DRCAT can be used by name from an EMBOSS application, returning sequence, feature, or text if a suitable data format is defined for any query, or creating a URL which can be pasted into a browser where the results are, for example, a graphical display using javascript which EMBOSS cannot interpret. We aim to further extend and improve DRCAT in future releases. 4.0 NCBI Taxonomy Taxonomy data from the NCBI taxonomy is included as standard in the release. New applications retrieve single nodes and their ancestors and descendants (taxget, taxgetup, taxgetdown, taxgetspecies, taxgetrank). 5.0 Maintenance Application digest has been renamed pepdigest to avoid a clash with another utility. The name is also in keeping with the EMBOSS naming of other protein analysis applications. Sequence and features formats have been reviewed and updated, especially GFF3, GenPept, SAM, BAM and treecon. GFF3 output now more closely follows the official standard, including the escaping of special characters in the tag/value final column. GFF3 ID and Parent tags are supported. Features with exons are now stored as a list of exon subfeatures. This change allows easier sorting of features by location, keeping groups of features together, and has simplified the generation of several feature output formats. Graphical output for more than one input sequence have been corrected and enhanced. The lindna application has been adjusted to correctly relocate overlapping text and to generate a clean sequence ruler for any range of positions. New report formats allow reported hits (-rformat draw) and restriction sites (-rformat restrict) to be plotted by lindna. We expect to work further on the views that these outputs generate. The einverted application had a bug (also in the original version) when an inverted repeat maximum score was close to the edge of the search window. This was seen only at low threshold scores. Searches with low threshold scores can be expected to yield slightly different choices of hits. In ACD files, the "gui" and "batch" application attributes are assumed to be "true" if missing. Previous releases defined them as "false" internally, but fortunately no parsers seem to have used the internal default value. Database indexes created by the dbx programs now include a count of unique and total keys. The text index files also report the type as "Identifier" or "Secondary" and whether the index is compressed. EMBOSS configuration now uses autoheader and has less dependency on the version of libtool. 6.0 Installation notes 6.1 UNIX The size of the EMBOSS package has shot up by approximately 60MB compared with the last major release. This is largely due to to pre-supplied data and index files for ontology/taxonomy/etc. A typical installation size (shared images) is approximately 360MB. Though not a requirement of EMBOSS there are some associated packages which may be installed prior to configuration that will allow you to use some optional access methods. 6.1.1 MySQL This is used, for example, by the Ensembl access code. It will be automatically configured if the (MySQL-supplied) 'mysql_config' application is found in the PATH and if the associated development files (compiler headers etc) are also installed. As an example, for Linux systems, both things will be done by installing the mysql-devel (RPM distributions) or mysql-dev (Debian-based distributions). If your MySQL installation is in some arbitrary location then you can specify it using the --with-mysql= compilation switch. 6.1.2 PostgreSQL This is used by some servers (e.g. flybase/genedb). Similar considerations apply to those described for MySQL above. Auto-detection is based on the presence in the PATH of 'pg_config', dev[el] files must be installed, the --with-postgresql configuration switch can be used for arbitrary locations. 6.1.3 axis2c EMBOSS optionally uses the 1.6.0 release of Axis2C for retrieval from SOAP servers: http://axis.apache.org/axis2/c/core/ There is a linux binary distribution but, even so, Linux users may find themselves having to install from source (and may need to do an 'autoreconf -fi' prior to configuration to fix a subsequent compilation error on some systems). Auto-detection (by EMBOSS) of this package is based on the presence of a pkgconfig file that axis2c installs. It is advised that you install pkgconfig if not already installed (it usually is pre-installed on Linux systems). EMBOSS has a --with_axis2c= configure switch if you install axis2c into a location other than /usr or /usr/local (typically). 6.1.4 Other optional library software Installation of libraries for PNG (libpng/libgd) and PDF (libhpdf aka libharu) follow considerations given in previous releases and should be familiar to EMBOSS administrators by now. 6.1.5 eprimer3 and eprimer32 The Primer3 authors have released a 2.x.x version which differs significantly from the 1.x.x series. Unfortunately the executable is called the same for both releases (primer3_core). EMBOSS 6.4.0 provides two wrappers for these releases; eprimer3 is for the 1.x.x version and requires the primer3 executable to be called 'primer3_core' (this has always been the case); eprimer32 is for the 2.x.x version and requires the primer3 executable to be called primer32_core. This may involve some minor symlinking and/or directory/PATH reorganisation by administrators. 6.2 mEMBOSS A typical installation executable is approximately 70MB and results in an installation size of approximately 570MB. MySQL, PostgreSQL, Axis2c, libhpdf (etc) come pre-supplied as part of the mEMBOSS installation. The QA test suite has been extended to automatically find and test both developer and end-user installations of mEMBOSS. Note that, with the new server definitions in place (described above), the old SRS database definitions have been removed. You can now access databases using (e.g.) 'dbfetch:uniprotkb:opsd_human' as an ID. Such retrieval is much faster than the previously supplied SRS definitions. 7.0 New EMBASSY applications: We have provided a wrapper package for the recently released clustal omega software which must, of course, also be installed. We have provided a wrapper package for the recently released clustal omega software which must, of course, also be installed. We will add new releases of MIRA and VIENNA at a later date, when the new versions of the original packages are released and integrated. 8.0 Future development EMBOSS is fully funded until the end of December. We have an ambitious schedule of further developments planned for this period. There will be a further release of EMBOSS at the end of the year. We welcome any and all suggestions from our user and developer communities for immediate needs and future directions. At the end of this year the EMBOSS team will be leaving EBI. Peter Rice's maximum 9 year tenure is coming to an end. We do not yet know where we will be from January and are open to suggestions for ways to host and/or to fund further EMBOSS development and for potentially useful partnerships and collaborations to continue the advances we have made. We can most certainly guarantee that we will continue to maintain the existing code base and the latest releases. Alan From rothenbuhler at xoma.com Mon Jul 25 19:42:28 2011 From: rothenbuhler at xoma.com (Jake Rothenbuhler) Date: Mon, 25 Jul 2011 16:42:28 -0700 Subject: [EMBOSS] Algorithms for pI and molecular weight in pepstats Message-ID: <3110E5050A5DE54F8715EB2AC3D2057725FCCC@cypress6.xoma.com> Hello, What are the algorithms used to compute the molecular weight and isoelectric point in pepstats? We are currently using pepstats to measure these properties in our in-house bioinformatics tools and some users are concerned because the results can differ from those returned by ExPASy. Thanks in advance, Jake Rothenbuhler Bioinformatics Programmer/Analyst XOMA (US) LLC (510) 204-7452 -- The information contained in this email message may contain confidential or legally privileged information and is intended solely for the use of the named recipient(s). No confidentiality or privilege is waived or lost by any transmission error. If the reader of this message is not the intended recipient, please immediately delete the e-mail and all copies of it from your system, destroy any hard copies of it and notify the sender either by telephone or return e-mail. Any direct or indirect use, disclosure, distribution, printing, or copying of any part of this message is prohibited. Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorized to state them to be the views of XOMA. From pmr at ebi.ac.uk Tue Jul 26 03:28:14 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 26 Jul 2011 08:28:14 +0100 Subject: [EMBOSS] Algorithms for pI and molecular weight in pepstats In-Reply-To: <3110E5050A5DE54F8715EB2AC3D2057725FCCC@cypress6.xoma.com> References: <3110E5050A5DE54F8715EB2AC3D2057725FCCC@cypress6.xoma.com> Message-ID: <4E2E6C8E.5040502@ebi.ac.uk> On 26/07/2011 00:42, Jake Rothenbuhler wrote: > What are the algorithms used to compute the molecular weight and > isoelectric point in pepstats? We are currently using pepstats to > measure these properties in our in-house bioinformatics tools and some > users are concerned because the results can differ from those returned > by ExPASy. There was discussion on this last year on this list too. There is no single correct answer. Molecular weights can use the average value for each amino acid to calculate the molecular weight of a protein, or monoisotopic values top calculate peptide masses for mass-spec data. Pepstats has a command line option -mono to use the monoisotopic weights. We use amino acid molecular weights from ExPASy findmod in the calculations. The isoelectric point can be calculated for various conditions. When I checked last, ExPASy's protparam was set up the isoelectric focus phase of 2D gels under high urea conditions. It was unclear at the time where to find all the values needed to reproduce their calculation. We would like to update EMBOSS's protein property calculations, possibly with additional options or alternative parameter sets. Any suggestions from anyone on the list? regards, Peter Rice EMBOSS Team From ajb at ebi.ac.uk Tue Jul 26 11:24:35 2011 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Tue, 26 Jul 2011 16:24:35 +0100 (BST) Subject: [EMBOSS] mEMBOSS 6.4.0.1 available Message-ID: <53274.82.26.12.214.1311693875.squirrel@imap04.ebi.ac.uk> This is a bugfix release for the MS Windows version of EMBOSS, primarily to fix a problem printing very long ('long long') integers. Though most users would be unlikely to hit this problem an uninstall/reinstall is nevertheless recommended. The release also contains a few minor bugfixes, notably making visible some potentially hidden SOAP server definitions. It is available from the usual place: ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.1-setup.exe Alan From Narayana.Upadhyaya at csiro.au Wed Jul 27 05:15:09 2011 From: Narayana.Upadhyaya at csiro.au (Narayana.Upadhyaya at csiro.au) Date: Wed, 27 Jul 2011 19:15:09 +1000 Subject: [EMBOSS] getorf output discrepancy Message-ID: Hi, I am trying to get both NT and AA sequence out puts from a file with ~20,000 transcripts models using the getorf with following command:- getorf -minsize 200 -reverse Y myfile.fa -find 3 getorf -minsize 200 -reverse Y myfile.fa -find 1 I get the outputs all right. But I was expecting same number of sequences in both (with identical names in the header). But looks like at 60 odd sequences(which are there in the AA output) are missing in the NT out put. Can anyone explain this discrepancy? I tried putting the minsize option as "201" for both but the problem persists. Regards, Narayana From Narayana.Upadhyaya at csiro.au Wed Jul 27 05:30:09 2011 From: Narayana.Upadhyaya at csiro.au (Narayana.Upadhyaya at csiro.au) Date: Wed, 27 Jul 2011 19:30:09 +1000 Subject: [EMBOSS] getorf output discrepancy Message-ID: Hi I figured out the problem. Missing ORFs in NT output are the ones which are just 198 NT length. When I put minsize 198 for NT output I don't miss anything. Sorry for bothering. Narayana Hi, I am trying to get both NT and AA sequence out puts from a file with ~20,000 transcripts models using the getorf with following command:- getorf -minsize 200 -reverse Y myfile.fa -find 3 getorf -minsize 200 -reverse Y myfile.fa -find 1 I get the outputs all right. But I was expecting same number of sequences in both (with identical names in the header). But looks like at 60 odd sequences(which are there in the AA output) are missing in the NT out put. Can anyone explain this discrepancy? I tried putting the minsize option as "201" for both but the problem persists. Regards, Narayana From friedman at cancercenter.columbia.edu Wed Jul 27 12:31:01 2011 From: friedman at cancercenter.columbia.edu (Richard Friedman) Date: Wed, 27 Jul 2011 12:31:01 -0400 Subject: [EMBOSS] dotplots taking similarity into account Message-ID: Dear Emboss list, Is there a way to get dotplots that take similarity according to a similarity matrix, rather than strict identity into account? As far as I can see, dottup is based on identity. Is there a way that we can dotplots based on a similarity matrix similar to dotplot in GCG? I know that it may be tiresome that I use GCG as a standard, but it is what I know and it is serving as a point of departure while I am learning Emboss and redoing the GCG portion of my course in Emboss. I am enjoying learning about the ways in which Emboss offers improved functionality in the process as well. Thanks and best wishes, Rich ------------------------------------------------------------ Richard A. Friedman, PhD Associate Research Scientist, Biomedical Informatics Shared Resource Herbert Irving Comprehensive Cancer Center (HICCC) Lecturer, Department of Biomedical Informatics (DBMI) Educational Coordinator, Center for Computational Biology and Bioinformatics (C2B2)/ National Center for Multiscale Analysis of Genomic Networks (MAGNet) Room 824 Irving Cancer Research Center Columbia University 1130 St. Nicholas Ave New York, NY 10032 (212)851-4765 (voice) friedman at cancercenter.columbia.edu http://cancercenter.columbia.edu/~friedman/ I am a Bayesian. When I see a multiple-choice question on a test and I don't know the answer I say "eeney-meaney-miney-moe". Rose Friedman, Age 14 From s.newslists at gmail.com Wed Jul 27 14:14:03 2011 From: s.newslists at gmail.com (Stefan) Date: Wed, 27 Jul 2011 20:14:03 +0200 Subject: [EMBOSS] dotplots taking similarity into account In-Reply-To: References: Message-ID: Dear Richard, Dotmatcher uses a specified substitution matrix: http://emboss.open-bio.org/wiki/Appdoc:Dotmatcher Best regards, Stefan 2011/7/27 Richard Friedman : > Dear Emboss list, > > ? ? ? ?Is there a way to get dotplots that take similarity according to a > similarity matrix, > rather than strict ?identity into account? As far as I can see, dottup is > based on identity. > Is there a way that we can dotplots based on a similarity matrix similar to > dotplot in GCG? > I know that it may be tiresome that I use GCG as a standard, but it is what > I know and > it is serving as a point of departure while I am learning Emboss and redoing > the GCG > portion of my course in Emboss. I am enjoying learning about the ways in > which Emboss > offers improved functionality in the process as well. > > Thanks and best wishes, > Rich > ------------------------------------------------------------ > Richard A. Friedman, PhD > Associate Research Scientist, > Biomedical Informatics Shared Resource > Herbert Irving Comprehensive Cancer Center (HICCC) > Lecturer, > Department of Biomedical Informatics (DBMI) > Educational Coordinator, > Center for Computational Biology and Bioinformatics (C2B2)/ > National Center for Multiscale Analysis of Genomic Networks (MAGNet) > Room 824 > Irving Cancer Research Center > Columbia University > 1130 St. Nicholas Ave > New York, NY 10032 > (212)851-4765 (voice) > friedman at cancercenter.columbia.edu > http://cancercenter.columbia.edu/~friedman/ > > I am a Bayesian. When I see a multiple-choice question on a test and I don't > know the answer I say "eeney-meaney-miney-moe". > > Rose Friedman, Age 14 > > > > > > > > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From charles-listes-emboss at plessy.org Thu Jul 28 10:38:37 2011 From: charles-listes-emboss at plessy.org (Charles Plessy) Date: Thu, 28 Jul 2011 23:38:37 +0900 Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative Commons Attribution-NoDerivs Message-ID: <20110728143837.GC30927@merveille.plessy.net> Dear EMBOSS developers, (CC Debian Med mailing list) while working on upgrading Debian's emboss package to version 6.4.0 (congratulations, by the way), I found some files in EMBOSS that are not considered ?Free software? by Debian. They were actually present in past releases as well. Here is their list: test/data/amir.swiss test/data/uniprotft.sw test/swiss/seq.dat test/swnew/trembl.dat and emboss/data/dbxref.txt Their license is Creative Commons Attribution-NoDerivs 3.0 Unported (CC BY-ND 3.0), and it disallows modification of the files. The presence of these files in EMBOSS makes it impossible for Debian to redistribute it in our operating system. I have confirmed with the UniProt consortium's helpdesk that, even in isolation, these files are covered by the CC BY-ND license. I see three possible solutions. a) Remove the files in Debian's EMBOSS package. b) Distribute EMBOSS with the files, but in the non-free section of the Debian archive. c) Replace the files by Free equivalents, for instance by re-creating records from scratch. I am not very comfortable with any of the solutions, and was wondering if you would have suggestions ? Have a nice day, -- Charles Plessy Debian Med packaging team, http://www.debian.org/devel/debian-med Tsurumi, Kanagawa, Japan From mathog at caltech.edu Thu Jul 28 11:06:50 2011 From: mathog at caltech.edu (David Mathog) Date: Thu, 28 Jul 2011 08:06:50 -0700 Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative Commons Attribution-NoDerivs Message-ID: Charles Plessy wrote: >a) Remove the files in Debian's EMBOSS package. >b) Distribute EMBOSS with the files, but in the non-free section of the >Debian archive. >c) Replace the files by Free equivalents, for instance by re-creating >records from scratch. d) Add a small script that wget's each file from its original distribution site and installs it in the right place. Have the package install script either ask if it should run this script, or have it issue a message which describes the issue and leaves it up to the user to run the script. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From wolfgang.rumpf at gmail.com Thu Jul 28 13:14:23 2011 From: wolfgang.rumpf at gmail.com (Wolfgang Rumpf) Date: Thu, 28 Jul 2011 13:14:23 -0400 Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative Commons Attribution-NoDerivs In-Reply-To: References: Message-ID: <1DAFB6D6-77CF-424C-A2BC-CD1B57FE6671@gmail.com> I would prefer (c) or the newly-added (d) myself.... Cheers, Wolfgang -------------------------------------------------------------------------------------------------------------- Dr. Wolfgang Rumpf Senior Product Specialist & Director of Support, Rescentris Inc. Adjunct Faculty, Dept. of Biotechnology, UMUC -------------------------------------------------------------------------------------------------------------- wolfgang.rumpf at rescentris.com wolfgang.rumpf at gmail.com Mobile - (614) 638-6797 Skype - wolfgang.rumpf -------------------------------------------------------------------------------------------------------------- Read my Blog - "QuantumThoughts" - at http://culture.no-ip.org/quantumthoughts -------------------------------------------------------------------------------------------------------------- On Jul 28, 2011, at 11:06 AM, David Mathog wrote: > Charles Plessy wrote: > >> a) Remove the files in Debian's EMBOSS package. >> b) Distribute EMBOSS with the files, but in the non-free section of the >> Debian archive. >> c) Replace the files by Free equivalents, for instance by re-creating >> records from scratch. > > d) Add a small script that wget's each file from its original > distribution site and installs it in the right place. Have the package > install script either ask if it should run this script, or have it issue > a message which describes the issue and leaves it up to the user to run > the script. > > Regards, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From s.newslists at gmail.com Thu Jul 28 13:24:53 2011 From: s.newslists at gmail.com (Stefan) Date: Thu, 28 Jul 2011 19:24:53 +0200 Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative Commons Attribution-NoDerivs In-Reply-To: <1DAFB6D6-77CF-424C-A2BC-CD1B57FE6671@gmail.com> References: <1DAFB6D6-77CF-424C-A2BC-CD1B57FE6671@gmail.com> Message-ID: I would prefer (d) and I know packages where this is realized like this. For example in SuSE the msttf fonts. Regards, Stefan 2011/7/28 Wolfgang Rumpf : > I would prefer (c) or the newly-added (d) myself.... > > > Cheers, > > > Wolfgang > > -------------------------------------------------------------------------------------------------------------- > Dr. Wolfgang Rumpf > Senior Product Specialist & Director of Support, Rescentris Inc. > Adjunct Faculty, Dept. of Biotechnology, UMUC > -------------------------------------------------------------------------------------------------------------- > wolfgang.rumpf at rescentris.com ? ? ? ? ? wolfgang.rumpf at gmail.com > Mobile - (614) 638-6797 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Skype - wolfgang.rumpf > -------------------------------------------------------------------------------------------------------------- > Read my Blog - "QuantumThoughts" - at http://culture.no-ip.org/quantumthoughts > -------------------------------------------------------------------------------------------------------------- > > On Jul 28, 2011, at 11:06 AM, David Mathog wrote: > >> Charles Plessy wrote: >> >>> a) Remove the files in Debian's EMBOSS package. >>> b) Distribute EMBOSS with the files, but in the non-free section of the >>> Debian archive. >>> c) Replace the files by Free equivalents, for instance by re-creating >>> records from scratch. >> >> d) ?Add a small script that wget's each file from its original >> distribution site and installs it in the right place. ?Have the package >> install script either ask if it should run this script, or have it issue >> a message which describes the issue and leaves it up to the user to run >> the script. >> >> Regards, >> >> David Mathog >> mathog at caltech.edu >> Manager, Sequence Analysis Facility, Biology Division, Caltech >> _______________________________________________ >> EMBOSS mailing list >> EMBOSS at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/emboss > > > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From rothenbuhler at xoma.com Thu Jul 28 18:44:47 2011 From: rothenbuhler at xoma.com (Jake Rothenbuhler) Date: Thu, 28 Jul 2011 15:44:47 -0700 Subject: [EMBOSS] Algorithms for pI and molecular weight in pepstats In-Reply-To: <4E2E6C8E.5040502@ebi.ac.uk> References: <3110E5050A5DE54F8715EB2AC3D2057725FCCC@cypress6.xoma.com> <4E2E6C8E.5040502@ebi.ac.uk> Message-ID: <3110E5050A5DE54F8715EB2AC3D2057725FCED@cypress6.xoma.com> Thanks to Ingo and Peter for the quick and helpful replies. I've read through the discussion you had a year ago on this topic and it seems like it is still unresolved. > The isoelectric point can be calculated for various conditions. When I > checked last, ExPASy's protparam was set up the isoelectric focus phase > of 2D gels under high urea conditions. It was unclear at the time where > to find all the values needed to reproduce their calculation. I have been reading through the literature referenced by ExPASy's documentation. The article does not give pK values for all N-terminal residues. I've asked ExPASy support about the pK values used for residues not listed in the paper. If you're interested, I can keep you updated regarding their response. > We would like to update EMBOSS's protein property calculations, possibly > with additional options or alternative parameter sets. If it's something you'd like to include in EMBOSS, I'd be willing to contribute to an additional option for pI calculation that uses ExPASy's pK values. Jake Rothenbuhler Bioinformatics Programmer/Analyst XOMA (US) LLC (510) 204-7452 -- The information contained in this email message may contain confidential or legally privileged information and is intended solely for the use of the named recipient(s). No confidentiality or privilege is waived or lost by any transmission error. If the reader of this message is not the intended recipient, please immediately delete the e-mail and all copies of it from your system, destroy any hard copies of it and notify the sender either by telephone or return e-mail. Any direct or indirect use, disclosure, distribution, printing, or copying of any part of this message is prohibited. Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorized to state them to be the views of XOMA. From pmr at ebi.ac.uk Fri Jul 29 03:28:48 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 29 Jul 2011 08:28:48 +0100 Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative Commons Attribution-NoDerivs In-Reply-To: <20110728143837.GC30927@merveille.plessy.net> References: <20110728143837.GC30927@merveille.plessy.net> Message-ID: <4E326130.7030507@ebi.ac.uk> On 28/07/2011 15:38, Charles Plessy wrote: > Dear EMBOSS developers, > (CC Debian Med mailing list) > > while working on upgrading Debian's emboss package to version 6.4.0 > (congratulations, by the way), I found some files in EMBOSS that are > not considered ?Free software? by Debian. They were actually present > in past releases as well. Here is their list: > > test/data/amir.swiss > test/data/uniprotft.sw > test/swiss/seq.dat > test/swnew/trembl.dat Huh? Example entries from UniProt? We can of course remove them from the distribution but then the QA tests will not work if anyone tries them. I suspect amir.swiss predates this UniProt licensing, but the others are more recently updated. Anyway, EMBOSS will work perfectly well without them. You can just delete them. > and emboss/data/dbxref.txt That one can go. It was a source for the DRCAT.dat data resource catalogue and yes we do have permission from UniProt to use it. > Their license is Creative Commons Attribution-NoDerivs 3.0 Unported (CC BY-ND > 3.0), and it disallows modification of the files. The presence of these files > in EMBOSS makes it impossible for Debian to redistribute it in our operating > system. I have confirmed with the UniProt consortium's helpdesk that, even in > isolation, these files are covered by the CC BY-ND license. I see three > possible solutions. > > a) Remove the files in Debian's EMBOSS package. > b) Distribute EMBOSS with the files, but in the non-free section of the Debian archive. > c) Replace the files by Free equivalents, for instance by re-creating records from scratch. > > I am not very comfortable with any of the solutions, and was wondering if you > would have suggestions ? I will also have words with the UniProt folk at EBI and if it really is not possible to include a few example entries with EMBOSS then I'll check with the other Open Bio projects. This is really silly. regards, Peter Rice EMBOSS Team From pmr at ebi.ac.uk Fri Jul 29 03:46:42 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 29 Jul 2011 08:46:42 +0100 Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative Commons Attribution-NoDerivs In-Reply-To: <20110728143837.GC30927@merveille.plessy.net> References: <20110728143837.GC30927@merveille.plessy.net> Message-ID: <4E326562.1020001@ebi.ac.uk> On 28/07/2011 15:38, Charles Plessy wrote: > Dear EMBOSS developers, > (CC Debian Med mailing list) > > while working on upgrading Debian's emboss package to version 6.4.0 > (congratulations, by the way), I found some files in EMBOSS that are > not considered ?Free software? by Debian. They were actually present > in past releases as well. > > Their license is Creative Commons Attribution-NoDerivs 3.0 Unported (CC BY-ND > 3.0), and it disallows modification of the files. The presence of these files > in EMBOSS makes it impossible for Debian to redistribute it in our operating > system. I have confirmed with the UniProt consortium's helpdesk that, even in > isolation, these files are covered by the CC BY-ND license. I see three > possible solutions. Ummm .... in what sense would *you* be modifying the files? UniProt's license http://www.uniprot.org/help/license says > License & disclaimer > > License > > We have chosen to apply the Creative Commons Attribution-NoDerivs License to all copyrightable parts of our databases. This means that you are free to copy, distribute, display and make commercial use of these databases in all legislations, provided you give us credit. However, if you intend to distribute a modified version of one of our databases, you must ask us for permission first. So I see no problem for EMBOSS in including the files. The only problem is for someone "modifying the files and redistributing them" without permission ... but strictly that would not apply to most uses of a UniProt entry (otherwise you could not use one entry as input and distribute the results). The licensing is there to prevent redistribution of UniProt without permission. Anyway, you can just delete them from the Debian duistribution of EMBOSS - and find your own way to run the QA tests. I don't think we have a problem. regards, Peter Rice EMBOSS Team regards, Peter Rice From pmr at ebi.ac.uk Fri Jul 29 04:39:46 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 29 Jul 2011 09:39:46 +0100 Subject: [EMBOSS] Files included in EMBOSS but licensed ... In-Reply-To: <4E326562.1020001@ebi.ac.uk> References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> Message-ID: <4E3271D2.2070906@ebi.ac.uk> On 07/29/2011 08:46 AM, Peter Rice wrote: > On 28/07/2011 15:38, Charles Plessy wrote: >> Dear EMBOSS developers, >> (CC Debian Med mailing list) >> >> while working on upgrading Debian's emboss package to version 6.4.0 >> (congratulations, by the way), I found some files in EMBOSS that are >> not considered ?Free software? by Debian. While we're on the topic of licensing, some other data files in EMBOSS 6.4.0 have licences. emboss/data/OBO contains copies of several Open Bio-Ontologies for which EMBOSS includes index files - so you need the data file version that matches the index files. For example, the Gene Ontology terms http://www.geneontology.org/GO.cite.shtml are: GO Usage Policy The GO Consortium gives permission for any of its products to be used without license for any purpose under three conditions: That the Gene Ontology Consortium is clearly acknowledged as the source of the product; That any GO Consortium file(s) displayed publicly include the date(s) and/or version number(s) of the relevant GO file(s) (the GO is evolving and changes will occur with time); That neither the content of the GO file(s) nor the logical relationships embedded within the GO file(s) be altered in any way. which looks rather like the problem you had with Creative Commons. Licenses that protect the official database release from derives versions are entirely reasonable and standard in bioinformatics. Basically, making sure that when you refer to a UniProt entry, or a, OBO ontology term, everyone agrees you are referring to one agreed entry or term. EMBOSS does depend on these files. The database names are hard-coded into some of the new (and more to come) applications. You could download the databases and indexes from our rsync copies we use to keep developers in sync. These are at rsync://emboss.open-bio.org/EMBOSS/ It might make things clearer if someone from Debian could explain: (a) why a Creative Commons licence is an issue for you (b) why you appear to consider a copy of a whole or part of a public biological database as part of an "operating system" regards, Peter Rice EMBOSS Team From cjfields at illinois.edu Fri Jul 29 09:51:53 2011 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 29 Jul 2011 08:51:53 -0500 Subject: [EMBOSS] Files included in EMBOSS but licensed ... In-Reply-To: <4E3271D2.2070906@ebi.ac.uk> References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk> Message-ID: On Jul 29, 2011, at 3:39 AM, Peter Rice wrote: > On 07/29/2011 08:46 AM, Peter Rice wrote: >> On 28/07/2011 15:38, Charles Plessy wrote: >>> Dear EMBOSS developers, >>> (CC Debian Med mailing list) >>> >>> while working on upgrading Debian's emboss package to version 6.4.0 >>> (congratulations, by the way), I found some files in EMBOSS that are >>> not considered ?Free software? by Debian. > > While we're on the topic of licensing, some other data files in EMBOSS > 6.4.0 have licences. > > emboss/data/OBO contains copies of several Open Bio-Ontologies for which > EMBOSS includes index files - so you need the data file version that > matches the index files. > > For example, the Gene Ontology terms > http://www.geneontology.org/GO.cite.shtml are: > > GO Usage Policy > > The GO Consortium gives permission for any of its products to be used > without license for any purpose under three conditions: > > That the Gene Ontology Consortium is clearly acknowledged as the > source of the product; > That any GO Consortium file(s) displayed publicly include the > date(s) and/or version number(s) of the relevant GO file(s) (the GO is > evolving and changes will occur with time); > That neither the content of the GO file(s) nor the logical > relationships embedded within the GO file(s) be altered in any way. > > which looks rather like the problem you had with Creative Commons. > > Licenses that protect the official database release from derives > versions are entirely reasonable and standard in bioinformatics. > Basically, making sure that when you refer to a UniProt entry, or a, OBO > ontology term, everyone agrees you are referring to one agreed entry or > term. > > EMBOSS does depend on these files. The database names are hard-coded > into some of the new (and more to come) applications. > > You could download the databases and indexes from our rsync copies we > use to keep developers in sync. These are at > rsync://emboss.open-bio.org/EMBOSS/ > > It might make things clearer if someone from Debian could explain: > > (a) why a Creative Commons licence is an issue for you > > (b) why you appear to consider a copy of a whole or part of a public > biological database as part of an "operating system" > > regards, > > Peter Rice > EMBOSS Team Charles, >From the BioPerl perspective, this will very likely be a problem for us as well as all other Bio* language (Biopython, BioJava, BioRuby); we typically include data derived from these sources. We may have a bit more flexibility in that the vast majority are mainly only for tests, but I believe some data is hard-coded in. Fallback data like REBase for restriction analysis and GO (as Peter mentioned above) come to mind. chris Christopher Fields Senior Research Scientist National Center for Supercomputing Applications Institute for Genomic Biology University of Illinois Urbana-Champaign 1206 W. Gregory Dr. , MC-195 Urbana, IL 61801 From asjo at koldfront.dk Fri Jul 29 16:35:13 2011 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Fri, 29 Jul 2011 22:35:13 +0200 Subject: [EMBOSS] Files included in EMBOSS but licensed ... References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk> Message-ID: <87sjpoq0zi.fsf@topper.koldfront.dk> On Fri, 29 Jul 2011 09:39:46 +0100, Peter wrote: > It might make things clearer if someone from Debian could explain: (I am not from Debian, but here is my take on it anyway:) > (a) why a Creative Commons licence is an issue for you One of the fundamental software freedoms is the freedom to change the software?. The Debian Free Software Guidelines' definition of free software includes this freedom?. So the "No Derivatives" variants of the Creative Commons licenses aren't free by the DFSG definition. (The GNU Free Documentation License on documents with invariant sections is considered non-free by DFSG-standards as well, even if the invariant sections are things that nobody would want to change.) When a project of volunteers packages 29000+ thousand packages, I think making a judgement call on whether it is okay that the license of a couple of files does not live up to the guidelines is neigh impossible. The answer to "Why would you want to?" is, because you might need to. It is more obvious with programs and code than it is with database entries, granted - but I guess the equivalent problem would be that the licensor didn't want to fix a problem in such a database, and that problem made the programs using it malfunction. It would be a pain if you weren't allowed to fix the problem and distribute the fixed data yourself, say, if "upstream" didn't want to include the fix for some reason or another; maybe they happened to turn sour on the world/you - stranger things have happened. I don't think that will happen in this specific case, but making judgement calls on what organisations/people will do in the future isn't quite firm ground. So, nobody is probably ever going to exercise that freedom in this specific case, I think, but ignoring some of the freedoms in special cases is infeasible for a project such as Debian. This is just me trying to explain how I understand it, so take it with a grain of salt, and swing by debian-legal? for the experts. > (b) why you appear to consider a copy of a whole or part of a public > biological database as part of an "operating system" They are part of a package which is included in the Debian GNU/Linux free operating system. (I personally think it would make sense to change to a Creative Commons license that allows derivative works - Uniprot and others are going to be the canonical source for the data anyway, so nothing will be lost by them by doing that, as far as I can see.) Best regards, Adam ? http://en.wikipedia.org/wiki/Free_software#Definition ? http://en.wikipedia.org/wiki/Debian_Free_Software_Guidelines ? http://lists.debian.org/debian-legal/ -- "Good car to drive after a war" Adam Sj?gren asjo at koldfront.dk From pmr at ebi.ac.uk Sat Jul 30 04:58:07 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Sat, 30 Jul 2011 09:58:07 +0100 Subject: [EMBOSS] Files included in EMBOSS but licensed ... In-Reply-To: <87sjpoq0zi.fsf@topper.koldfront.dk> References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk> <87sjpoq0zi.fsf@topper.koldfront.dk> Message-ID: <4E33C79F.8080402@ebi.ac.uk> Quoted in full for the benefit of the debian-med list who missed the original posting On 29/07/2011 21:35, Adam Sj?gren wrote: > On Fri, 29 Jul 2011 09:39:46 +0100, Peter wrote: > >> It might make things clearer if someone from Debian could explain: > > (I am not from Debian, but here is my take on it anyway:) > >> (a) why a Creative Commons licence is an issue for you > > One of the fundamental software freedoms is the freedom to change the > software?. > > The Debian Free Software Guidelines' definition of free software > includes this freedom?. > > So the "No Derivatives" variants of the Creative Commons licenses aren't > free by the DFSG definition. > > (The GNU Free Documentation License on documents with invariant sections > is considered non-free by DFSG-standards as well, even if the invariant > sections are things that nobody would want to change.) > > When a project of volunteers packages 29000+ thousand packages, I think > making a judgement call on whether it is okay that the license of a > couple of files does not live up to the guidelines is neigh impossible. > The answer to "Why would you want to?" is, because you might need to. > > It is more obvious with programs and code than it is with database > entries, granted - but I guess the equivalent problem would be that the > licensor didn't want to fix a problem in such a database, and that > problem made the programs using it malfunction. It would be a pain if > you weren't allowed to fix the problem and distribute the fixed data > yourself, say, if "upstream" didn't want to include the fix for some > reason or another; maybe they happened to turn sour on the world/you - > stranger things have happened. > > So, nobody is probably ever going to exercise that freedom in this > specific case, I think, but ignoring some of the freedoms in special > cases is infeasible for a project such as Debian. > > This is just me trying to explain how I understand it, so take it with a > grain of salt, and swing by debian-legal? for the experts. A specific example might help. About 5 years ago a release of the UniProt database (as plain text files) broke the Wisconsin (GCG) sequence analysis package. They introduced extremely long lines in a data file that everyone assumed was only maximum 80 characters. As GCG was closed source, the fix required a change to the UniProt files to either wrap or truncate the 'offending' records. The fix was not to distribute a change to the data of course, but to write and distribute a simple perl script that wrapped the long records. That was not a licensing issue - the content stays the same, the format is changed, no changed data is distributed. But it does illustrate that the database licensing does not prevent 'fixing' a database. >> (b) why you appear to consider a copy of a whole or part of a public >> biological database as part of an "operating system" > > They are part of a package which is included in the Debian GNU/Linux > free operating system. I expect there are many problems that arise if data ... and documentation ... are considered to be software. For EMBOSS we didn't officially specify a license for the documentation but other packages probably do. It still worries me that some of our documentation files officially include GPL licensed (EMBOSS) source code but I did not like any of the alternative documentation licenses. > (I personally think it would make sense to change to a Creative Commons > license that allows derivative works - Uniprot and others are going to > be the canonical source for the data anyway, so nothing will be lost by > them by doing that, as far as I can see.) Unlikely. The no-derivatives version is specifically there to prevent derivatives - for example Debian distributing a modified UniProt without permission. The ontologies are similar, but do allow for the use case of importing terms from one ontology into another if the ontology name is changed (and preferably if cross-references to the original are provided). Again, the need is to protect the integrity of the original ontology content so references to a GO term or a UniProt entry are clearly defined. This is essential for many of the public bioinformatics databases. Data and software are not the same in this context. I am curious whether documentation licensing raises any issues. Just my 2c worth Peter Rice EMBOSS Team From asjo at koldfront.dk Sat Jul 30 07:36:54 2011 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Sat, 30 Jul 2011 13:36:54 +0200 Subject: [EMBOSS] Files included in EMBOSS but licensed ... References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk> <87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk> Message-ID: <87ipqkgfu1.fsf@topper.koldfront.dk> On Sat, 30 Jul 2011 09:58:07 +0100, Peter wrote: > A specific example might help. About 5 years ago a release of the > UniProt database (as plain text files) broke the Wisconsin (GCG) > sequence analysis package. [...] This is the opposite problem of what I tried to sketch. Your example has closed source software that can't be fixed, leading to either preprocessing or changing the database rather than fixing the real problem. If the software had been free, you could just have fixed the software. Switch around "software" and "database", and you have the example I was trying to paint. > I expect there are many problems that arise if data ... and > documentation ... are considered to be software. Sure. The whole GFDL debate took quite a while, I think. But that doesn't change that one of the solutions outlined by Charles Plessy is necessary for Debian to distribute EMBOSS (and any other piece of free/redistributable software). >> (I personally think it would make sense to change to a Creative Commons >> license that allows derivative works - Uniprot and others are going to >> be the canonical source for the data anyway, so nothing will be lost by >> them by doing that, as far as I can see.) > Unlikely. The no-derivatives version is specifically there to prevent > derivatives - for example Debian distributing a modified UniProt > without permission. What I was trying to say is that I don't think that that clause gives any value to the owners of Uniprot and other databases. Why would Uniprot want to prevent derivative works? They'll always be the canonical source for the correct information. You are free to distribute a modified version of the man-page for ls(1) - but if you introduce errors in it or make it worse, nobody will choose your derived version. > The ontologies are similar, but do allow for the use case of importing > terms from one ontology into another if the ontology name is changed > (and preferably if cross-references to the original are provided). > Again, the need is to protect the integrity of the original ontology > content so references to a GO term or a UniProt entry are clearly > defined. I think the problem that is being protected against is non-existing. People don't want to break stuff that works, they want to be able to fix stuff that doesn't. > This is essential for many of the public bioinformatics databases. Why? Only a hypothetical derivative would be changed, not the original. If someome distributed a derivative that was broken, I think people would quickly abandon it. Again, just my point of view - not representing or speaking for anyone :-) Best regards, Adam -- "Good car to drive after a war" Adam Sj?gren asjo at koldfront.dk From cjfields at illinois.edu Sat Jul 30 15:01:58 2011 From: cjfields at illinois.edu (Chris Fields) Date: Sat, 30 Jul 2011 14:01:58 -0500 Subject: [EMBOSS] Files included in EMBOSS but licensed ... In-Reply-To: <4E33C79F.8080402@ebi.ac.uk> References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk> <87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk> Message-ID: <5EF06959-FAA0-4DBB-96EA-013CFFDE2960@illinois.edu> On Jul 30, 2011, at 3:58 AM, Peter Rice wrote: > Quoted in full for the benefit of the debian-med list who missed the original posting > > On 29/07/2011 21:35, Adam Sj?gren wrote: >> On Fri, 29 Jul 2011 09:39:46 +0100, Peter wrote: >> >>> It might make things clearer if someone from Debian could explain: >> >> (I am not from Debian, but here is my take on it anyway:) >> >>> (a) why a Creative Commons licence is an issue for you >> >> One of the fundamental software freedoms is the freedom to change the >> software?. >> >> The Debian Free Software Guidelines' definition of free software >> includes this freedom?. >> >> So the "No Derivatives" variants of the Creative Commons licenses aren't >> free by the DFSG definition. >> >> (The GNU Free Documentation License on documents with invariant sections >> is considered non-free by DFSG-standards as well, even if the invariant >> sections are things that nobody would want to change.) >> >> When a project of volunteers packages 29000+ thousand packages, I think >> making a judgement call on whether it is okay that the license of a >> couple of files does not live up to the guidelines is neigh impossible. > >> The answer to "Why would you want to?" is, because you might need to. >> >> It is more obvious with programs and code than it is with database >> entries, granted - but I guess the equivalent problem would be that the >> licensor didn't want to fix a problem in such a database, and that >> problem made the programs using it malfunction. It would be a pain if >> you weren't allowed to fix the problem and distribute the fixed data >> yourself, say, if "upstream" didn't want to include the fix for some >> reason or another; maybe they happened to turn sour on the world/you - >> stranger things have happened. >> >> So, nobody is probably ever going to exercise that freedom in this >> specific case, I think, but ignoring some of the freedoms in special >> cases is infeasible for a project such as Debian. >> >> This is just me trying to explain how I understand it, so take it with a >> grain of salt, and swing by debian-legal? for the experts. > > A specific example might help. About 5 years ago a release of the UniProt database (as plain text files) broke the Wisconsin (GCG) sequence analysis package. They introduced extremely long lines in a data file that everyone assumed was only maximum 80 characters. > > As GCG was closed source, the fix required a change to the UniProt files to either wrap or truncate the 'offending' records. > > The fix was not to distribute a change to the data of course, but to write and distribute a simple perl script that wrapped the long records. > > That was not a licensing issue - the content stays the same, the format is changed, no changed data is distributed. But it does illustrate that the database licensing does not prevent 'fixing' a database. > >>> (b) why you appear to consider a copy of a whole or part of a public >>> biological database as part of an "operating system" >> >> They are part of a package which is included in the Debian GNU/Linux >> free operating system. > > I expect there are many problems that arise if data ... and documentation ... are considered to be software. For EMBOSS we didn't officially specify a license for the documentation but other packages probably do. It still worries me that some of our documentation files officially include GPL licensed (EMBOSS) source code but I did not like any of the alternative documentation licenses. I don't understand the logic behind why data would be considered software, unless one is using a very fuzzy definition of 'software'. Is this strictly a packaging issue, e.g. any data packaged with source makes it 'software'? Or just the fact that such data is licensed? Would a package of just data/docs (no code) be allowed? >> (I personally think it would make sense to change to a Creative Commons >> license that allows derivative works - Uniprot and others are going to >> be the canonical source for the data anyway, so nothing will be lost by >> them by doing that, as far as I can see.) > > Unlikely. The no-derivatives version is specifically there to prevent derivatives - for example Debian distributing a modified UniProt without permission. > > The ontologies are similar, but do allow for the use case of importing terms from one ontology into another if the ontology name is changed (and preferably if cross-references to the original are provided). Again, the need is to protect the integrity of the original ontology content so references to a GO term or a UniProt entry are clearly defined. > > This is essential for many of the public bioinformatics databases. Data and software are not the same in this context. I am curious whether documentation licensing raises any issues. > > Just my 2c worth > > Peter Rice > EMBOSS Team Maybe the best solution is to just package any data separately? We have talked about setting up a 'biodata' repository for common datasets from all the Bio* projects. Feel free to skip the rest of this, but: I agree with Peter's point, Uniprot and other databases license data this way for very good (and well-intentioned) reasons. For the Bio* languages there are instances where we use such data as a fallback in case a newer version isn't immediately available (REBase and SO come to mind, and I think we have others), so we are likely in the same boat as EMBOSS. I had a long screed here, but I found some original sources for the discussion re: Uniprot and use of Creative Commons licensing that states the reasoning for why this is in place: http://wiki.creativecommons.org/Case_Studies/Uniprot http://eric.jain.name/2006/02/07/uniprot-creative-commons/ http://sciencecommons.org/resources/faq/databases/ http://sciencecommons.org/resources/faq/database-protocol/ Note there is now a 'Database Protocol' (last link) that recommends a different license; that page nicely summarizes the history the whole Creative Commons licensing affair and the issues of using a Creative Commons license re: databases, mainly due to the issue Peter mentioned above, that databases != software. Uniprot doesn't use this as of yet (so it doesn't solve the problem at hand), but it's possible this may change. chris Christopher Fields Senior Research Scientist National Center for Supercomputing Applications Institute for Genomic Biology University of Illinois Urbana-Champaign 1206 W. Gregory Dr. , MC-195 Urbana, IL 61801 From asjo at koldfront.dk Sat Jul 30 15:34:30 2011 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Sat, 30 Jul 2011 21:34:30 +0200 Subject: [EMBOSS] Files included in EMBOSS but licensed ... References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk> <87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk> <5EF06959-FAA0-4DBB-96EA-013CFFDE2960@illinois.edu> Message-ID: <87d3grwojd.fsf@topper.koldfront.dk> On Sat, 30 Jul 2011 14:01:58 -0500, Chris wrote: > I don't understand the logic behind why data would be considered > software, unless one is using a very fuzzy definition of 'software'. > Is this strictly a packaging issue, e.g. any data packaged with source > makes it 'software'? Or just the fact that such data is licensed? > Would a package of just data/docs (no code) be allowed? "The DFSG is focused on software, but the word itself is unclear - some apply it to everything that can be expressed as a stream of bits, while a minority considers it to refer to just computer programs. Also, the existence of PostScript, executable scripts, sourced documents, etc, greatly muddies the second definition. Thus, to break the confusion, in June 2004 the Debian project decided to explicitly apply the same principles to software documentation, multimedia data and other content. The non-program content of Debian began to comply with the DFSG more strictly in Debian 4.0 (released in April 2007) and subsequent releases." - http://en.wikipedia.org/wiki/DFSG#Non-.22software.22_content So no. > I agree with Peter's point, Uniprot and other databases license data > this way for very good (and well-intentioned) reasons. Several people have mentioned the existence of these good reasons for not allowing derived works when it comes to science/databases/biology; I wonder what those reasons are? Just curious. [...] > http://sciencecommons.org/resources/faq/database-protocol/ > Note there is now a 'Database Protocol' (last link) that recommends a > different license; that page nicely summarizes the history the whole > Creative Commons licensing affair and the issues of using a Creative > Commons license re: databases, mainly due to the issue Peter mentioned > above, that databases != software. Uniprot doesn't use this as of yet > (so it doesn't solve the problem at hand), but it's possible this may > change. It sounds like Science Commons' Open Access Data Protocol means putting the data in the public domain, which would mean that derived works would very much be allowed? This link explains the protocol: * http://sciencecommons.org/projects/publishing/open-access-data-protocol/ Best regards, Adam -- "Good car to drive after a war" Adam Sj?gren asjo at koldfront.dk From cjfields at illinois.edu Sat Jul 30 15:42:19 2011 From: cjfields at illinois.edu (Chris Fields) Date: Sat, 30 Jul 2011 14:42:19 -0500 Subject: [EMBOSS] Files included in EMBOSS but licensed ... In-Reply-To: <87ipqkgfu1.fsf@topper.koldfront.dk> References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk> <87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk> <87ipqkgfu1.fsf@topper.koldfront.dk> Message-ID: On Jul 30, 2011, at 6:36 AM, Adam Sj?gren wrote: > On Sat, 30 Jul 2011 09:58:07 +0100, Peter wrote: > >> A specific example might help. About 5 years ago a release of the >> UniProt database (as plain text files) broke the Wisconsin (GCG) >> sequence analysis package. > > [...] > > This is the opposite problem of what I tried to sketch. > > Your example has closed source software that can't be fixed, leading to > either preprocessing or changing the database rather than fixing the > real problem. > > If the software had been free, you could just have fixed the software. > > Switch around "software" and "database", and you have the example I was > trying to paint. Yes, if the source were available fixing the parser would have been the best option. But I think you are missing the fundamental point that Peter made (that you left out): the wording of the license allowed them to reformat the file w/o changing the actual content. I'm not sure but I believe many GenPept documents are Uniprot-derived and follow the same concept. Data records and databases are not software, unless you are using some very fuzzy definition of such. >> I expect there are many problems that arise if data ... and >> documentation ... are considered to be software. > > Sure. The whole GFDL debate took quite a while, I think. > > But that doesn't change that one of the solutions outlined by Charles > Plessy is necessary for Debian to distribute EMBOSS (and any other piece > of free/redistributable software). You'll also note Charles's distaste for the options mentioned. He was also searching for alternatives. >>> (I personally think it would make sense to change to a Creative Commons >>> license that allows derivative works - Uniprot and others are going to >>> be the canonical source for the data anyway, so nothing will be lost by >>> them by doing that, as far as I can see.) > >> Unlikely. The no-derivatives version is specifically there to prevent >> derivatives - for example Debian distributing a modified UniProt >> without permission. > > What I was trying to say is that I don't think that that clause gives > any value to the owners of Uniprot and other databases. > > Why would Uniprot want to prevent derivative works? They'll always be > the canonical source for the correct information. The links provided in my other responce indicate some of the mindset behind this. I think the main point is that the work has to be attributed, and that any changes to such data need permission of Uniprot, likely so any content changes can be curated and (possibly) propogated to future releases. This also ensures that a set of files from a third-party containing the Uniprot name will not be modified (e.g. all content can be trusted as coming from Uniprot w/o modification). I have seen instances where loose data control (such as annotation from a newly sequenced genome) become balkanized to the point that no one can clearly state who is the trusted source (even when the list of sources includes large databases such as NCBI/EBI). So I understand the reasoning for the license, but I also see Science Commons is recommending something less strict. > You are free to distribute a modified version of the man-page for ls(1) > - but if you introduce errors in it or make it worse, nobody will choose > your derived version. That's a straw man argument; man page documentation for an app is not the same as a database record based on scientific data. Woud you make the same argument (allow free content modification) for a scientific publication? I would, but only for corrections or for new data that support/contradict the original data, and even then it must go through some sort of mediation (an editor for instance), not unlike what a database curator does. >> The ontologies are similar, but do allow for the use case of importing >> terms from one ontology into another if the ontology name is changed >> (and preferably if cross-references to the original are provided). > >> Again, the need is to protect the integrity of the original ontology >> content so references to a GO term or a UniProt entry are clearly >> defined. > > I think the problem that is being protected against is non-existing. > > People don't want to break stuff that works, they want to be able to fix > stuff that doesn't. Simply opening the licensing up for any content modification doesn't solve the problem in the case of scientific databases, it potentially exacerbates it. Hence the variations in the licensing in the previous links I sent. By the way, if you think the classic 'vi vs emacs' arguments can get out of control, see what happens when you have competing groups trying to make changes to a sequence record w/o curation. I do agree that it would be nice for the barrier to database modification to be lowered. Many previous attempts have been made at doing this, such as including third-party annotation, but with the major databases they all seem to fall by the wayside and they seem to fall back to simple curation. Maybe it's time to come up with a git/hg for biological data, where one could fork records and make changes for submission; at least there one could have a trusted source and easier paths to data modification. Just a thought. >> This is essential for many of the public bioinformatics databases. > > Why? Only a hypothetical derivative would be changed, not the original. > > If someome distributed a derivative that was broken, I think people > would quickly abandon it. How could one tell the difference if both versions are implied to come from Uniprot (even if one comes from a third/fourth/fifth party)? There is no guarantee beyond going back and comparing the records to the original Uniprot data. > Again, just my point of view - not representing or speaking for anyone :-) > > > Best regards, > > Adam chris Christopher Fields Senior Research Scientist National Center for Supercomputing Applications Institute for Genomic Biology University of Illinois Urbana-Champaign 1206 W. Gregory Dr. , MC-195 Urbana, IL 61801 From cjfields at illinois.edu Sat Jul 30 16:14:39 2011 From: cjfields at illinois.edu (Chris Fields) Date: Sat, 30 Jul 2011 15:14:39 -0500 Subject: [EMBOSS] Files included in EMBOSS but licensed ... In-Reply-To: <87d3grwojd.fsf@topper.koldfront.dk> References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk> <87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk> <5EF06959-FAA0-4DBB-96EA-013CFFDE2960@illinois.edu> <87d3grwojd.fsf@topper.koldfront.dk> Message-ID: (Charles, not sure you have been following, but any idea on the next steps and whether other package like bioperl are affected?) On Jul 30, 2011, at 2:34 PM, Adam Sj?gren wrote: > On Sat, 30 Jul 2011 14:01:58 -0500, Chris wrote: > >> I don't understand the logic behind why data would be considered >> software, unless one is using a very fuzzy definition of 'software'. >> Is this strictly a packaging issue, e.g. any data packaged with source >> makes it 'software'? Or just the fact that such data is licensed? >> Would a package of just data/docs (no code) be allowed? > > "The DFSG is focused on software, but the word itself is unclear - > some apply it to everything that can be expressed as a stream of > bits, while a minority considers it to refer to just computer > programs. Also, the existence of PostScript, executable scripts, > sourced documents, etc, greatly muddies the second definition. Thus, > to break the confusion, in June 2004 the Debian project decided to > explicitly apply the same principles to software documentation, > multimedia data and other content. The non-program content of Debian > began to comply with the DFSG more strictly in Debian 4.0 (released > in April 2007) and subsequent releases." > - http://en.wikipedia.org/wiki/DFSG#Non-.22software.22_content > > So no. Oh well; we'll leave that up to debian then. I think Peter and I stated our concerns, and possible options were stated by Charles and myself, no need to protract this out. I would rather find a solution. >> I agree with Peter's point, Uniprot and other databases license data >> this way for very good (and well-intentioned) reasons. > > Several people have mentioned the existence of these good reasons for > not allowing derived works when it comes to science/databases/biology; I > wonder what those reasons are? > > Just curious. Those links I passed on mention some of the primary concerns from both the Science Commons and Uniprot side. I believe it comes down to an issue of trusting the source of the data and the level of control the database wants (the latter was implied in Eric's blog post). > [...] >> http://sciencecommons.org/resources/faq/database-protocol/ > >> Note there is now a 'Database Protocol' (last link) that recommends a >> different license; that page nicely summarizes the history the whole >> Creative Commons licensing affair and the issues of using a Creative >> Commons license re: databases, mainly due to the issue Peter mentioned >> above, that databases != software. Uniprot doesn't use this as of yet >> (so it doesn't solve the problem at hand), but it's possible this may >> change. > > It sounds like Science Commons' Open Access Data Protocol means putting > the data in the public domain, which would mean that derived works would > very much be allowed? Yes, if one adopts that protocol (Uniprot hasn't). Eric's blog post indicates the CC-nonderivative was chose for a level of control both Uniprot users and curators felt comfortable with but wasn't overly restrictive. That's also from 2006, so a lot has likely changed since then. > This link explains the protocol: > > * http://sciencecommons.org/projects/publishing/open-access-data-protocol/ > > > Best regards, > > Adam There is no mention of derived or modified works there, but the brief mention of derived works from the Database Protocol page indicates that it is possibly allowed, yes. That may be an impediment to adoption by a database depending on what level of control they would like. I'm curious to see who has adopted it. chris From david.breimann at gmail.com Tue Jul 5 09:33:14 2011 From: david.breimann at gmail.com (David Breimann) Date: Tue, 5 Jul 2011 12:33:14 +0300 Subject: [EMBOSS] Updating EMBOSS Message-ID: Hello, I had an old installation of EMBOSS on my linux server (EMBOSS 6.0.1 according to embossversion). I downloaded EMBOSS 6.3.1, unpacked and compiled (following http://emboss.sourceforge.net/download/#Gettingstarted), hoping this will overwrite the older version. However, embossversion still returns 6.0.1. What should I do? Thanks, Dave From ajb at ebi.ac.uk Tue Jul 5 10:22:53 2011 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Tue, 5 Jul 2011 11:22:53 +0100 (BST) Subject: [EMBOSS] Updating EMBOSS In-Reply-To: References: Message-ID: <55403.82.26.12.214.1309861373.squirrel@imap04.ebi.ac.uk> Hello Dave, It depends on where/how you installed the different versions. If you had configured and installed using a prefix which specified a directory root which was to contain only emboss: e.g. ./configure --prefix=/fu/bar/emboss then you can just delete the /fu/bar/emboss directory and reinstall. If, however, you had installed EMBOSS using no prefix (such that it would be installed under /usr/local) or specified any other shared or system directory then the best means is usually to reinstall the old version (see ftp://emboss.open-bio.org/pub/EMBOSS/old/) on top of itself and then type: make uninstall If it were me I'd then do the same with the new version and have a nose-around to check that all traces of EMBOSS have been deleted, then reinstall the new version. We do recommend, when installing EMBOSS from source, to install it into its own directory (--prefix=/usr/local/emboss is a favourite example in administration documentation). HTH Alan > Hello, > > I had an old installation of EMBOSS on my linux server (EMBOSS 6.0.1 > according to embossversion). > I downloaded EMBOSS 6.3.1, unpacked and compiled (following > http://emboss.sourceforge.net/download/#Gettingstarted), hoping this will > overwrite the older version. > However, embossversion still returns 6.0.1. > What should I do? > > Thanks, > Dave > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From wo.granon at gmail.com Thu Jul 7 10:33:49 2011 From: wo.granon at gmail.com (Wolfgang) Date: Thu, 7 Jul 2011 12:33:49 +0200 Subject: [EMBOSS] Plasmid drawing Message-ID: Hello, are there any news to plasmid drawing (features and restriction sites) and improvement of cirdna, according to this message from 2005? http://www.mail-archive.com/emboss at emboss.open-bio.org/msg00040.html In our labs this is also a big point for users not to switch completely to emboss. Thanks, Wolfgang From pmr at ebi.ac.uk Thu Jul 7 11:33:05 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 07 Jul 2011 12:33:05 +0100 Subject: [EMBOSS] Plasmid drawing In-Reply-To: References: Message-ID: <4E159971.9070509@ebi.ac.uk> Dear Wolfgang, > are there any news to plasmid drawing (features and restriction sites) and > improvement of cirdna, according to this message from 2005? > http://www.mail-archive.com/emboss at emboss.open-bio.org/msg00040.html > > In our labs this is also a big point for users not to switch completely to > emboss. Very close to release date next week, so hard to do anything immediately. However, we did try adding a report format (an output choice for restrict and other applications) to create an input file for cirdna or lindna. Results at the time were poor, but I note we have revised both cirdna and lindna since. I will test whether results have improved. One possibility would be to re-enable this format so you can test and give us feedback on the new release. regards, Peter From hrh at fmi.ch Thu Jul 7 11:47:13 2011 From: hrh at fmi.ch (Hans-Rudolf Hotz) Date: Thu, 07 Jul 2011 13:47:13 +0200 Subject: [EMBOSS] Plasmid drawing In-Reply-To: <4E159971.9070509@ebi.ac.uk> References: <4E159971.9070509@ebi.ac.uk> Message-ID: <4E159CC1.9@fmi.ch> Hi Peter, We will be happy to help you testing and give feedback, since we are in a very similar situation to Wolfgang. Regards, Hans On 07/07/2011 01:33 PM, Peter Rice wrote: > Dear Wolfgang, > >> are there any news to plasmid drawing (features and restriction sites) >> and >> improvement of cirdna, according to this message from 2005? >> http://www.mail-archive.com/emboss at emboss.open-bio.org/msg00040.html >> >> In our labs this is also a big point for users not to switch >> completely to >> emboss. > > Very close to release date next week, so hard to do anything immediately. > > However, we did try adding a report format (an output choice for > restrict and other applications) to create an input file for cirdna or > lindna. > > Results at the time were poor, but I note we have revised both cirdna > and lindna since. > > I will test whether results have improved. One possibility would be to > re-enable this format so you can test and give us feedback on the new > release. > > regards, > > Peter > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From pmr at ebi.ac.uk Thu Jul 7 12:07:19 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 07 Jul 2011 13:07:19 +0100 Subject: [EMBOSS] Plasmid drawing In-Reply-To: <4E159CC1.9@fmi.ch> References: <4E159971.9070509@ebi.ac.uk> <4E159CC1.9@fmi.ch> Message-ID: <4E15A177.6040201@ebi.ac.uk> On 07/07/11 12:47, Hans-Rudolf Hotz wrote: > Hi Peter, > > > We will be happy to help you testing and give feedback, since we are in > a very similar situation to Wolfgang. I'm curious. How many sites on this list (as a rough sample) are still running GCG? And how many are using some other commercial package for functions not in EMBOSS? Could be a very useful guide to the new applications needed. Peter From s.newslists at gmail.com Thu Jul 7 13:54:01 2011 From: s.newslists at gmail.com (Stefan) Date: Thu, 7 Jul 2011 15:54:01 +0200 Subject: [EMBOSS] Plasmid drawing In-Reply-To: References: <4E159971.9070509@ebi.ac.uk> <4E159CC1.9@fmi.ch> <4E15A177.6040201@ebi.ac.uk> Message-ID: Hi Peter, in our labs the people are also sad that they can not use the emboss suite for such daily work. We use two different applications: pDraw32 can draw plasmid cards. Very useful is the feature that it can generate a new plasmid out of two with given restriction enzymes. This can avoid a lot of little mistakes. ApE "A plasmid Editor" is very useful to find features in the plasmid. Often we get sequences where features such as the antibiotic resistance are missing. This tool can quickly find them and make draw a nice plasmid also with its restriction sites. We would be happy to use for all of this work the emboss suite. Also I would be happy to test. Best regards, Stefan 2011/7/7 Peter Rice : > On 07/07/11 12:47, Hans-Rudolf Hotz wrote: >> >> Hi Peter, >> >> >> We will be happy to help you testing and give feedback, since we are in >> a very similar situation to Wolfgang. > > > I'm curious. How many sites on this list (as a rough sample) are still > running GCG? > > And how many are using some other commercial package for functions not in > EMBOSS? > > Could be a very useful guide to the new applications needed. > > Peter > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From david.bauer at bayer.com Thu Jul 7 12:58:38 2011 From: david.bauer at bayer.com (david.bauer at bayer.com) Date: Thu, 7 Jul 2011 14:58:38 +0200 Subject: [EMBOSS] Antwort: Re: Plasmid drawing In-Reply-To: <4E15A177.6040201@ebi.ac.uk> Message-ID: We use VectorNTI for plasmid documentation and in-silico cloning. And as far as I know another widely used software for this purpos is 'Clone Manager' from 'Sci-Ed Software'. David. emboss-bounces at lists.open-bio.org schrieb am 07/07/2011 14:07:19: > On 07/07/11 12:47, Hans-Rudolf Hotz wrote: > > Hi Peter, > > > > > > We will be happy to help you testing and give feedback, since we are in > > a very similar situation to Wolfgang. > > > I'm curious. How many sites on this list (as a rough sample) are still > running GCG? > > And how many are using some other commercial package for functions not > in EMBOSS? > > Could be a very useful guide to the new applications needed. > > Peter > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From cjfields at illinois.edu Thu Jul 7 15:10:35 2011 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 7 Jul 2011 10:10:35 -0500 Subject: [EMBOSS] Antwort: Re: Plasmid drawing In-Reply-To: References: Message-ID: I think Geneious and the CLC tools can also draw plasmid maps. Haven't used them extensively, though. re: GCG, Accelrys stopped GCG development in June 2008;I haven't seen anyone take up the perpetual license (which allows use of GCG, but with outdated databases, etc). Seems like everyone is implicitly being directed to EMBOSS. chris On Jul 7, 2011, at 7:58 AM, david.bauer at bayer.com wrote: > We use VectorNTI for plasmid documentation and in-silico cloning. > And as far as I know another widely used software for this purpos is > 'Clone Manager' from 'Sci-Ed Software'. > > David. > > emboss-bounces at lists.open-bio.org schrieb am 07/07/2011 14:07:19: > >> On 07/07/11 12:47, Hans-Rudolf Hotz wrote: >>> Hi Peter, >>> >>> >>> We will be happy to help you testing and give feedback, since we are > in >>> a very similar situation to Wolfgang. >> >> >> I'm curious. How many sites on this list (as a rough sample) are still >> running GCG? >> >> And how many are using some other commercial package for functions not >> in EMBOSS? >> >> Could be a very useful guide to the new applications needed. >> >> Peter >> _______________________________________________ >> EMBOSS mailing list >> EMBOSS at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/emboss > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From kitagawam at takara-bio.co.jp Fri Jul 8 08:05:48 2011 From: kitagawam at takara-bio.co.jp (kitagawam at takara-bio.co.jp) Date: Fri, 8 Jul 2011 17:05:48 +0900 Subject: [EMBOSS] Antwort: Re: Plasmid drawing In-Reply-To: References: Message-ID: <678B3FABACE9F64B8FAF7A1045C3D67D4D943A52EE@tkrexmb1.central.takara.co.jp> I wish to recommend IMC. http://www.insilicobiology.jp/en/downloads ] -----Original Message----- ] From: emboss-bounces at lists.open-bio.org ] [mailto:emboss-bounces at lists.open-bio.org] On Behalf Of Chris Fields ] Sent: Friday, July 08, 2011 12:11 AM ] To: david.bauer at bayer.com ] Cc: emboss at lists.open-bio.org; emboss-bounces at lists.open-bio.org ] Subject: Re: [EMBOSS] Antwort: Re: Plasmid drawing ] ] I think Geneious and the CLC tools can also draw plasmid maps. Haven't ] used them extensively, though. ] ] re: GCG, Accelrys stopped GCG development in June 2008;I haven't seen anyone ] take up the perpetual license (which allows use of GCG, but with outdated ] databases, etc). Seems like everyone is implicitly being directed to ] EMBOSS. ] ] chris ] ] On Jul 7, 2011, at 7:58 AM, david.bauer at bayer.com wrote: ] ] > We use VectorNTI for plasmid documentation and in-silico cloning. ] > And as far as I know another widely used software for this purpos is ] > 'Clone Manager' from 'Sci-Ed Software'. ] > ] > David. ] > ] > emboss-bounces at lists.open-bio.org schrieb am 07/07/2011 14:07:19: ] > ] >> On 07/07/11 12:47, Hans-Rudolf Hotz wrote: ] >>> Hi Peter, ] >>> ] >>> ] >>> We will be happy to help you testing and give feedback, since we are ] > in ] >>> a very similar situation to Wolfgang. ] >> ] >> ] >> I'm curious. How many sites on this list (as a rough sample) are still ] >> running GCG? ] >> ] >> And how many are using some other commercial package for functions not ] >> in EMBOSS? ] >> ] >> Could be a very useful guide to the new applications needed. ] >> ] >> Peter ] >> _______________________________________________ ] >> EMBOSS mailing list ] >> EMBOSS at lists.open-bio.org ] >> http://lists.open-bio.org/mailman/listinfo/emboss ] > _______________________________________________ ] > EMBOSS mailing list ] > EMBOSS at lists.open-bio.org ] > http://lists.open-bio.org/mailman/listinfo/emboss ] ] ] _______________________________________________ ] EMBOSS mailing list ] EMBOSS at lists.open-bio.org ] http://lists.open-bio.org/mailman/listinfo/emboss From friedman at cancercenter.columbia.edu Wed Jul 13 20:56:29 2011 From: friedman at cancercenter.columbia.edu (Richard Friedman) Date: Wed, 13 Jul 2011 16:56:29 -0400 Subject: [EMBOSS] getting files in GCG format with annotation Message-ID: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu> Dear Emboss list, I am learning to use Emboss after being a long-time GCG user. The fetch command in GCG returns a file with the sequence in GCG format plus annotation. In EMBOSS I know how to get just sequence in GCG format with seqret. In EMBOSS I also know how to get the sequence plus annotation default format. What I would like to know is how using EMBOSS to get sequence plus annotation in GCG format like in GCG. Thanks and best wishes, Rich ------------------------------------------------------------ Richard A. Friedman, PhD Associate Research Scientist, Biomedical Informatics Shared Resource Herbert Irving Comprehensive Cancer Center (HICCC) Lecturer, Department of Biomedical Informatics (DBMI) Educational Coordinator, Center for Computational Biology and Bioinformatics (C2B2)/ National Center for Multiscale Analysis of Genomic Networks (MAGNet) Room 824 Irving Cancer Research Center Columbia University 1130 St. Nicholas Ave New York, NY 10032 (212)851-4765 (voice) friedman at cancercenter.columbia.edu http://cancercenter.columbia.edu/~friedman/ I am a Bayesian. When I see a multiple-choice question on a test and I don't know the answer I say "eeney-meaney-miney-moe". Rose Friedman, Age 14 From pmr at ebi.ac.uk Wed Jul 13 21:37:13 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 13 Jul 2011 22:37:13 +0100 Subject: [EMBOSS] getting files in GCG format with annotation In-Reply-To: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu> References: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu> Message-ID: <4E1E1009.101@ebi.ac.uk> Dear Richard, On 13/07/2011 21:56, Richard Friedman wrote: > I am learning to use Emboss after being a long-time GCG user. > The fetch command in GCG returns a file with the sequence in GCG format > plus annotation. > > In EMBOSS I know how to get just sequence in GCG format with seqret. > In EMBOSS I also know how to get the sequence plus annotation default > format. > What I would like to know is how using EMBOSS to get sequence plus > annotation in GCG format > like in GCG. Hmmm ... by "annotation in GCG format" you mean the EMBL or Uniprot entry with gaps in the ". ." feaure records? The obvious question is why you need GCG format. GCG was not very clever in handling the annotation. You can get the sequence plus annotation in one file with: seqret -feature somedb:someid outfile.seq -osformat embl (or swiss) That gives you one file with "sequence plus annotation"... and you can use the annotation. You can also get the whole entry text with entret somedb:someid Hope that helps - and if not, please do ask again! Peter Rice EMBOSS Team From friedman at cancercenter.columbia.edu Thu Jul 14 16:08:17 2011 From: friedman at cancercenter.columbia.edu (Richard Friedman) Date: Thu, 14 Jul 2011 12:08:17 -0400 Subject: [EMBOSS] getting files in GCG format with annotation In-Reply-To: <4E1E1009.101@ebi.ac.uk> References: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu> <4E1E1009.101@ebi.ac.uk> Message-ID: <134BC19D-E1C0-4788-A700-5B212192AD6B@cancercenter.columbia.edu> Dear Peter and Guy, I guess I just cling to the familiar. The output formats given by emboss are fine, One more obscure question: As far as I can see, the output from "seqret -feature" and "entret" are the same. Are there any differences? Thanks and best wishes, Rich On Jul 13, 2011, at 5:37 PM, Peter Rice wrote: > Dear Richard, > > On 13/07/2011 21:56, Richard Friedman wrote: >> I am learning to use Emboss after being a long-time GCG user. >> The fetch command in GCG returns a file with the sequence in GCG >> format >> plus annotation. >> >> In EMBOSS I know how to get just sequence in GCG format with seqret. >> In EMBOSS I also know how to get the sequence plus annotation default >> format. >> What I would like to know is how using EMBOSS to get sequence plus >> annotation in GCG format >> like in GCG. > > Hmmm ... by "annotation in GCG format" you mean the EMBL or Uniprot > entry with gaps in the ". ." feaure records? > > The obvious question is why you need GCG format. GCG was not very > clever in handling the annotation. > > You can get the sequence plus annotation in one file with: > > seqret -feature somedb:someid outfile.seq -osformat embl (or swiss) > > That gives you one file with "sequence plus annotation"... and you > can use the annotation. > > You can also get the whole entry text with entret somedb:someid > > Hope that helps - and if not, please do ask again! > > Peter Rice > EMBOSS Team > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From pmr at ebi.ac.uk Thu Jul 14 16:14:03 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 14 Jul 2011 17:14:03 +0100 Subject: [EMBOSS] getting files in GCG format with annotation In-Reply-To: <134BC19D-E1C0-4788-A700-5B212192AD6B@cancercenter.columbia.edu> References: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu> <4E1E1009.101@ebi.ac.uk> <134BC19D-E1C0-4788-A700-5B212192AD6B@cancercenter.columbia.edu> Message-ID: <4E1F15CB.5020205@ebi.ac.uk> On 14/07/2011 17:08, Richard Friedman wrote: > Dear Peter and Guy, > > I guess I just cling to the familiar. The output formats given by emboss > are fine, > One more obscure question: > > As far as I can see, the output from "seqret -feature" and "entret" are > the same. > Are there any differences? Not necessarily ... entret reports the exact text of the original input. seqret -feat with the same format as the input will rewrite everything using the output format. If that comes out identical then we are usually very happy (we do try to preserve everything in EMBL/GenBank and Swissprot formats) but there is no absolute guarantee. Also, strictly speaking, the output of entret is defined as "text" while the output of seqret is defined as "sequence" which leads to some distinctions - for example, you cannot choose an alternative output format for entret. Have fun with EMBOSS. Look out for the new release tomorrow! regards, Peter From gbottu at vub.ac.be Thu Jul 14 17:56:14 2011 From: gbottu at vub.ac.be (Guy Bottu) Date: Thu, 14 Jul 2011 19:56:14 +0200 Subject: [EMBOSS] getting files in GCG format with annotation In-Reply-To: <4E1E1009.101@ebi.ac.uk> References: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu> <4E1E1009.101@ebi.ac.uk> Message-ID: <4E1F2DBE.70700@vub.ac.be> Dear Richard, I agree with Peter that it is not obvious what GCG simple sequence format is still useful for, since for giving the sequence as input to whatever software you can use seqret with whatever sequence format and for just reading the annotation you can use entret and for giving the features as input to whatever software you can use seqret with parameter -feature (GCG used for this the GCG RSF format but this did not become popular outside GCG/SeqLab). I can maybe add that a widely used format for features is GFF format and you can do : seqret -feature somedb:someid outfile.seq -osformat gff -oufo somegfffile You will obtain a file somegfffile in GFF format (with just the features, not the sequence). There is a lot of software that can use it. Regards, Guy Bottu, U.L.B. From ajb at ebi.ac.uk Fri Jul 15 08:54:26 2011 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Fri, 15 Jul 2011 09:54:26 +0100 (BST) Subject: [EMBOSS] EMBOSS 6.4.0 released Message-ID: <53026.82.26.12.214.1310720066.squirrel@imap04.ebi.ac.uk> EMBOSS Release 6.4.0 This release is now available on our OBF ftp server. UNIX version: ftp://emboss.open-bio.org/pub/EMBOSS/ mEMBOSS (MS Windows version): ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.0-setup.exe It includes major extensions to the type and number of data resources available to EMBOSS users. In addition, three books are published by Cambridge University Press: EMBOSS User's Guide: Practical Bioinformatics http://www.cambridge.org/gb/knowledge/isbn/item5979294/?site_locale=en_GB EMBOSS Developer's Guide: Bioinformatics Programming http://www.cambridge.org/gb/knowledge/isbn/item5979293/?site_locale=en_GB EMBOSS Administrator's Guide: Bioinformatics Software Management http://www.cambridge.org/gb/knowledge/isbn/item5979238/?site_locale=en_GB They are comprehensive and definitive guides to administering, developing and using EMBOSS. We hope they will prove useful to the EMBOSS community and to anyone providing training courses covering EMBOSS. In addition to these publications we have a new website. http://emboss.open-bio.org Updates for the new features in 6.4.0 will be made available soon on the new EMBOSS website, with tutorials to be developed on the EBI e-Learning Portal. Contents: 1.0 New in 6.4.0 1.1 Server definitions 1.2 Access methods 1.3 emboss.standard file 1.4 new data types 1.5 new query language 1.6 Hash tables and lists 1.7 Cross-references 1.8 URL generation 1.9 Database index compression 1.10 Database indexing applications 1.11 Generating server cache files 1.12 Server and database attributes 1.13 HTTP redirection 1.14 EMBOSS version number 1.15 ACD list 'select all' 2.0 EDAM Ontology 2.1 EDAM in ACD files 2.2 EDAM applications 3.0 DRCAT Data Resource Catalogue 4.0 NCBI Taxonomy 5.0 Maintenance 6.0 Installation Notes 6.1 UNIX 6.1.1 MySQL 6.1.2 PostgreSQL 6.1.3 axis2c 6.1.4 Other optional library software 6.1.5 eprimer3 and eprimer32 6.2 mEMBOSS 7.0 New EMBASSY applications 8.0 Future 1.0 New in 6.4.0 1.1 Server definitions Servers can be defined, in a similar style to a database definition, but covering all databases available from a single server. The server definition names a cache file describing each database, its format and its query fields. Cache files for a core set of public servers are included in the release. 1.2 Access methods New access methods are provided, including Ensembl, BioMart, DAS, SOAP web services (EBI wsdbfetch and ebeye), REST web services (EBI dbfetch), and GMOD/CHADO. Ensembl access uses code contributed by Michael Schuster in the Ensembl team at EBI. This code is updated after each Ensembl API release. Some of these access methods were available but only partly implemented in the previous release. They now support standard server and database definitions and are open for further development. Data access methods have been restructured to use "text" access for any method which seeks a position in a file and then opens it for reading. This includes reading from a URL and returning a pointer to the start of the output. A few datatype-specific access methods remain, for example reading sequence data from a PIR/NBRF/GCG format database, or from the NCBI taxonomy files, or access to database systems via SQL or DAS. 1.3 emboss.standard file Previous releases depended on a user defining databases in their emboss.defaults file. Release 6.4.0 provides a new emboss.standard file defining the core servers and databases, and standard resource settings for database indexing. The local emboss.default file is only needed for local database definitions and settings. The configuration files emboss.standard, emboss.default and ~/.embossrc resolve variable references (e.g. in directory names) during parsing. Extensions to the syntax of these files include ALIAS to give secondary names to a database. IF, IFDEF, ELSE and ENDIF directives allow conditional inclusion of sections of the file dependent on variable settings. Special variables EMBOSS_AXIS2, EMBOSS_MYSQL, EMBOSS_POSTGRESQL and EMBOSS_SQL are automatically created for this purpose. New variable EMBOSS_STANDARD is automatically defined to be the share/EMBOSS install directory (or the emboss source code directory if the package is not installed). This is by default where the emboss.standard files and server cache files are expected to be found. The value is reported by "embossversion -full" 1.4 new data types New data types are available as inputs and outputs or applications. Each has a simple definition including qualifiers -iformat for input format and -oformat for output format. The maxreads attribute defines whether the application expects to read a single entry (maxreads: 1) or loop over multiple entries (the default). This is simpler than the sequence and seqall definitions for sequence which are widely used and will remain unchanged. * text and outtext: the text of an entry for which EMBOSS has (to date) no specialised parser * obo and oboout: terms in an OBO ontology. Six ontologies are included in the release as source and index files (EDAM, GO, SO, RO, PW, ECO). We plan to add more and welcome suggestions for inclusion. * resource and resourceout: entries in the Data Resource Catalogue * taxon and outtaxon: nodes in the NCBI taxonomy which is indexed and included in the release * url and outurl: a database name from the Data Resource Catalogue, and an identifier, converted into a URL which can be pasted into a browser to cover cases where the URL does not return simple text or HTML data. * for future extension, assembly and variation datatypes are defined for development and use in a later release. 1.5 New query language All data types use a common query language. The existing "USA" (uniform sequence address) syntax is still valid for sequence data, but is also now used for features, obo terms, data resources, taxons and plain text data. In response to comments from our Scientific Advisory Board, we have extended the query language to cover multiple identifiers, multiple fields, and operators to combine elements of the query. * id lists: dbname:{ida,idb,idc} searches for 3 identifiers (id, accession, etc.) in a database * or operator: dbname-{id:h* | des:hemoglobin} searches for all entries with identifiers starting with 'h' plus any others that include the word 'hemoglobin' in their descriptions. * not operator: dbname-{id:h* ! des:hemoglobin} searches for all entries with identifiers starting with 'h' that do not include the word 'hemoglobin' in their descriptions. * and operator: dbname-{id:h* & des:hemoglobin} searches for all entries with identifiers starting with 'h' that also include the word 'hemoglobin' in their descriptions. * eor operator: dbname-{id:h* ^ des:hemoglobin} searches for all entries with identifiers starting with 'h' that do not include the word 'hemoglobin' in their descriptions, and all those starting with another character that do include the word 'hemoglobin' in their description. This is the opposite of the and (&) operator. Query operators are not supported by all access methods. Where an operator is invalid an error message gives the list of valid operators. For example, the query syntax for SRS (srs, srswww access) does not include the exclusive-or (^) operator but supports the others as these are standard elements in SRS queries. The query language only allows a single database name in the query. This allows EMBOSS to combine query results for a single query expression. To query multiple databases a list file input with one database query on each line can be used. Indexed strings containing non-alphabetic characters including white space are simplified by converting a run of such characters to a single underscore. The same transformation is applied to a query string for the dbx (emboss) access method. This is especially useful for brackets and other characters in data resource names in DRCAT. We hope that the extended query language and the index file compression will increase the use of locally indexed data in EMBOSS installations, and welcome feedback on further developments of the query language and indexing. 1.6 Hash table and lists The new query language is supported by extensions to tables and lists in the libraries. Tables can now be automatically resized. Merge operations on two tables combine their contents using the same operations (or, and, not, eor) as the query language. By resizing the tables first this operation can be made highly efficient. Destructors can be defined for list data and for table keys and data to automatically clean up after use. Tables with string keys can use C char* or string object queries in all cases. Lists and tables can now be reference counted, avoiding unnecessary copying especially in the Ensembl API code. 1.7 Cross-references Cross-references from UniProt/SwissProt and EMBL/GenBank/DDBJ are collected by extended parsers. New application seqxref reports the cross-references. New application seqxrefget creates a script to retrieve cross-referenced data as the original entries, using entret for sequence data, feattext for feature data, ontotext for ontology terms, textget for text and urlget for data where "HTML" is the only available format. 1.8 URL generation New application urlget returns a query URL from DRCAT with one or mode identifiers. Where data is from a UniProt/SwissProt or EMBL/GenBank/DDBJ entry the DRCAT entry definition of the original cross-reference is used to select from several possible identifier terms in EDAM in order to choose the correct query. 1.9 Database index compression Indexes created by dbxflat or dbxfasta are now, by default, compressed automatically. These files, especially for secondary text indexes such as description, taxonomy or keyword, could be very sparse. Up to 95% space savings were achieved in some cases. The indexes are still updatable by code which uncompresses, updates, and recompresses on-the-fly using a copy of the index. 1.10 Database indexing applications New indexing applications dbxedam (EDAM), dbxresource (DRCAT), dbxtax (NCBI taxonomy) and dbxobo (any OBO ontology) are added for the new data resources provided as standard. users can install new releases of the source data and run these applications to update the index files. Application dbxflat can now index fastq format. This was included in 6.3.1 as a special addition for one user to test and is now fully supported. New applications dbxreport and dbxstat report on the overall and detailed content of dbx database indexes. In database indexing applications, the default "resource" name is one included in the emboss.standard file. Users can continue to define their own resource files. Indexing "resource" definitions can now specify the maximum length of any field, and the page size and cache size for any field, using attributes with the field name as a prefix. 1.11 Generating server cache files New applications for major access methods query a server (for example, the DAS registry or Ensembl) to update the server cache file with a current set of database definitions. When run by the system administrator these can update the site-wide cache file, but they can also be run by an individual user to create a user-specific set of databases. The cache files are time stamped. EMBOSS uses the most recent system or user file. 1.12 Server and database attributes New applications showserver and servertell describe all servers or the attributes of a single named server. We expect to extend these applications once we have feedback on the most useful information they should report. New application dbtell similarly reports on the attributes of a single named database. Database (and server) definitions can use an attribute more than once if it is defined as "multiple". These include a new "field:" attribute which gives the name and description of a query field. A list of "field:" attributes supersedes the old "fields:" attribute which listed all query field names but allowed no further annotation. Database field names are extended from the original fixed set of "SRS sequence" fields to any name. "id" and "acc" are assumed to be the names of identifier and accession fields. The "hasaccession" attribute is set automatically for databases where no "acc" field is found, avoiding some error messages where the attribute has been omitted. 1.13 HTTP redirection Data retrieval using HTTP now checks the returned header for redirects and automatically replaces the results with the output from the redirected URL. Where redirected URLs were found in standard database definitions (e.g. the EBI's dbfetch service) these have been replaced by the current URL. We have also seen redirects from case-sensitive servers which redirect a lower case accession number to one in upper case in the same URL. 1.14 EMBOSS version number The EMBOSS version number now has 4 digits (6.4.0.0). The fourth digit is only there so that the Windows port (mEMBOSS) shows the same version number for QA testing. In mEMBOSS the final digit is the build number. QA tests for mEMBOSS now use the same test definition and qatest script as on Linux. mEMBOSS file handling and reporting has been adapted to support POSIX and Windows style paths. 1.15 ACD list 'select all' In ACD files, a list or selection definition can default to "*" for "select all" if the "minimum" attribute allows all terms to be selected. 2.0 EDAM Ontology EDAM is a new ontology from the EMBRACE project now further developed by Jon Ison in the EMBOSS team. EDAM describes terms for topics (for applications and data), operations (algorithms), formats, identifiers and data (semantic descriptions of data content). EDAM terms are used throughout this release: to annotate all ACD files at the application, input, parameter and output levels; to annotate data resources and their web queries in the Data Resource Catalogue; and to annotate database and server definitions. 2.1 EDAM in ACD files ACD files are annotated extensively with EDAM terms using the term id and the human-readable name. The EMBOSS application groups have been extended to match the EDAM topic annotations, with some applications moving to different or new groups. EDAM has been used to validate these groups by comparing the topics hierarchy with the group designations. 2.2 EDAM applications EDAM can be queried within any specific namespace by new applications edamname and edamdef. EDAM and other ontologies are supported by new applications (ontoget, ontotext, ontodown, ontoup, ontgetsibs, ontogetcommon, ontogetroot, ontogetobsolete, ontoisobsolete, ontocount) New applications search EDAM term names and definitions, retrieve all matching terms and their descendants, and compare to: applications (wosstopic, wossoperation, wossinput, wossoutput, wossdata); data resources (drfindresource, drfindid, drfindformat, drfinddata); and related EDAM terms (edamhasinput, edamhasoutput, edamisid, edamisformat, edamissource). 3.0 DRCAT Data Resource Catalogue DRCAT, the Data Resource Catalogue, is included in this release. DRCAT started as a description of databases found as cross-references in UniProt/SwissProt, extended by adding databases found as cross-references in EMBL/GenBank/DDBJ, plus others from Nucleic Acids Research, ELIXIR, and other sources. Any database in DRCAT can be used by name from an EMBOSS application, returning sequence, feature, or text if a suitable data format is defined for any query, or creating a URL which can be pasted into a browser where the results are, for example, a graphical display using javascript which EMBOSS cannot interpret. We aim to further extend and improve DRCAT in future releases. 4.0 NCBI Taxonomy Taxonomy data from the NCBI taxonomy is included as standard in the release. New applications retrieve single nodes and their ancestors and descendants (taxget, taxgetup, taxgetdown, taxgetspecies, taxgetrank). 5.0 Maintenance Application digest has been renamed pepdigest to avoid a clash with another utility. The name is also in keeping with the EMBOSS naming of other protein analysis applications. Sequence and features formats have been reviewed and updated, especially GFF3, GenPept, SAM, BAM and treecon. GFF3 output now more closely follows the official standard, including the escaping of special characters in the tag/value final column. GFF3 ID and Parent tags are supported. Features with exons are now stored as a list of exon subfeatures. This change allows easier sorting of features by location, keeping groups of features together, and has simplified the generation of several feature output formats. Graphical output for more than one input sequence have been corrected and enhanced. The lindna application has been adjusted to correctly relocate overlapping text and to generate a clean sequence ruler for any range of positions. New report formats allow reported hits (-rformat draw) and restriction sites (-rformat restrict) to be plotted by lindna. We expect to work further on the views that these outputs generate. The einverted application had a bug (also in the original version) when an inverted repeat maximum score was close to the edge of the search window. This was seen only at low threshold scores. Searches with low threshold scores can be expected to yield slightly different choices of hits. In ACD files, the "gui" and "batch" application attributes are assumed to be "true" if missing. Previous releases defined them as "false" internally, but fortunately no parsers seem to have used the internal default value. Database indexes created by the dbx programs now include a count of unique and total keys. The text index files also report the type as "Identifier" or "Secondary" and whether the index is compressed. EMBOSS configuration now uses autoheader and has less dependency on the version of libtool. 6.0 Installation notes 6.1 UNIX The size of the EMBOSS package has shot up by approximately 60MB compared with the last major release. This is largely due to to pre-supplied data and index files for ontology/taxonomy/etc. A typical installation size (shared images) is approximately 360MB. Though not a requirement of EMBOSS there are some associated packages which may be installed prior to configuration that will allow you to use some optional access methods. 6.1.1 MySQL This is used, for example, by the Ensembl access code. It will be automatically configured if the (MySQL-supplied) 'mysql_config' application is found in the PATH and if the associated development files (compiler headers etc) are also installed. As an example, for Linux systems, both things will be done by installing the mysql-devel (RPM distributions) or mysql-dev (Debian-based distributions). If your MySQL installation is in some arbitrary location then you can specify it using the --with-mysql= compilation switch. 6.1.2 PostgreSQL This is used by some servers (e.g. flybase/genedb). Similar considerations apply to those described for MySQL above. Auto-detection is based on the presence in the PATH of 'pg_config', dev[el] files must be installed, the --with-postgresql configuration switch can be used for arbitrary locations. 6.1.3 axis2c EMBOSS optionally uses the 1.6.0 release of Axis2C for retrieval from SOAP servers: http://axis.apache.org/axis2/c/core/ There is a linux binary distribution but, even so, Linux users may find themselves having to install from source (and may need to do an 'autoreconf -fi' prior to configuration to fix a subsequent compilation error on some systems). Auto-detection (by EMBOSS) of this package is based on the presence of a pkgconfig file that axis2c installs. It is advised that you install pkgconfig if not already installed (it usually is pre-installed on Linux systems). EMBOSS has a --with_axis2c= configure switch if you install axis2c into a location other than /usr or /usr/local (typically). 6.1.4 Other optional library software Installation of libraries for PNG (libpng/libgd) and PDF (libhpdf aka libharu) follow considerations given in previous releases and should be familiar to EMBOSS administrators by now. 6.1.5 eprimer3 and eprimer32 The Primer3 authors have released a 2.x.x version which differs significantly from the 1.x.x series. Unfortunately the executable is called the same for both releases (primer3_core). EMBOSS 6.4.0 provides two wrappers for these releases; eprimer3 is for the 1.x.x version and requires the primer3 executable to be called 'primer3_core' (this has always been the case); eprimer32 is for the 2.x.x version and requires the primer3 executable to be called primer32_core. This may involve some minor symlinking and/or directory/PATH reorganisation by administrators. 6.2 mEMBOSS A typical installation executable is approximately 70MB and results in an installation size of approximately 570MB. MySQL, PostgreSQL, Axis2c, libhpdf (etc) come pre-supplied as part of the mEMBOSS installation. The QA test suite has been extended to automatically find and test both developer and end-user installations of mEMBOSS. Note that, with the new server definitions in place (described above), the old SRS database definitions have been removed. You can now access databases using (e.g.) 'dbfetch:uniprotkb:opsd_human' as an ID. Such retrieval is much faster than the previously supplied SRS definitions. 7.0 New EMBASSY applications: We have provided a wrapper package for the recently released clustal omega software which must, of course, also be installed. We have provided a wrapper package for the recently released clustal omega software which must, of course, also be installed. We will add new releases of MIRA and VIENNA at a later date, when the new versions of the original packages are released and integrated. 8.0 Future development EMBOSS is fully funded until the end of December. We have an ambitious schedule of further developments planned for this period. There will be a further release of EMBOSS at the end of the year. We welcome any and all suggestions from our user and developer communities for immediate needs and future directions. At the end of this year the EMBOSS team will be leaving EBI. Peter Rice's maximum 9 year tenure is coming to an end. We do not yet know where we will be from January and are open to suggestions for ways to host and/or to fund further EMBOSS development and for potentially useful partnerships and collaborations to continue the advances we have made. We can most certainly guarantee that we will continue to maintain the existing code base and the latest releases. Alan From rothenbuhler at xoma.com Mon Jul 25 23:42:28 2011 From: rothenbuhler at xoma.com (Jake Rothenbuhler) Date: Mon, 25 Jul 2011 16:42:28 -0700 Subject: [EMBOSS] Algorithms for pI and molecular weight in pepstats Message-ID: <3110E5050A5DE54F8715EB2AC3D2057725FCCC@cypress6.xoma.com> Hello, What are the algorithms used to compute the molecular weight and isoelectric point in pepstats? We are currently using pepstats to measure these properties in our in-house bioinformatics tools and some users are concerned because the results can differ from those returned by ExPASy. Thanks in advance, Jake Rothenbuhler Bioinformatics Programmer/Analyst XOMA (US) LLC (510) 204-7452 -- The information contained in this email message may contain confidential or legally privileged information and is intended solely for the use of the named recipient(s). No confidentiality or privilege is waived or lost by any transmission error. If the reader of this message is not the intended recipient, please immediately delete the e-mail and all copies of it from your system, destroy any hard copies of it and notify the sender either by telephone or return e-mail. Any direct or indirect use, disclosure, distribution, printing, or copying of any part of this message is prohibited. Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorized to state them to be the views of XOMA. From pmr at ebi.ac.uk Tue Jul 26 07:28:14 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 26 Jul 2011 08:28:14 +0100 Subject: [EMBOSS] Algorithms for pI and molecular weight in pepstats In-Reply-To: <3110E5050A5DE54F8715EB2AC3D2057725FCCC@cypress6.xoma.com> References: <3110E5050A5DE54F8715EB2AC3D2057725FCCC@cypress6.xoma.com> Message-ID: <4E2E6C8E.5040502@ebi.ac.uk> On 26/07/2011 00:42, Jake Rothenbuhler wrote: > What are the algorithms used to compute the molecular weight and > isoelectric point in pepstats? We are currently using pepstats to > measure these properties in our in-house bioinformatics tools and some > users are concerned because the results can differ from those returned > by ExPASy. There was discussion on this last year on this list too. There is no single correct answer. Molecular weights can use the average value for each amino acid to calculate the molecular weight of a protein, or monoisotopic values top calculate peptide masses for mass-spec data. Pepstats has a command line option -mono to use the monoisotopic weights. We use amino acid molecular weights from ExPASy findmod in the calculations. The isoelectric point can be calculated for various conditions. When I checked last, ExPASy's protparam was set up the isoelectric focus phase of 2D gels under high urea conditions. It was unclear at the time where to find all the values needed to reproduce their calculation. We would like to update EMBOSS's protein property calculations, possibly with additional options or alternative parameter sets. Any suggestions from anyone on the list? regards, Peter Rice EMBOSS Team From ajb at ebi.ac.uk Tue Jul 26 15:24:35 2011 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Tue, 26 Jul 2011 16:24:35 +0100 (BST) Subject: [EMBOSS] mEMBOSS 6.4.0.1 available Message-ID: <53274.82.26.12.214.1311693875.squirrel@imap04.ebi.ac.uk> This is a bugfix release for the MS Windows version of EMBOSS, primarily to fix a problem printing very long ('long long') integers. Though most users would be unlikely to hit this problem an uninstall/reinstall is nevertheless recommended. The release also contains a few minor bugfixes, notably making visible some potentially hidden SOAP server definitions. It is available from the usual place: ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.1-setup.exe Alan From Narayana.Upadhyaya at csiro.au Wed Jul 27 09:15:09 2011 From: Narayana.Upadhyaya at csiro.au (Narayana.Upadhyaya at csiro.au) Date: Wed, 27 Jul 2011 19:15:09 +1000 Subject: [EMBOSS] getorf output discrepancy Message-ID: Hi, I am trying to get both NT and AA sequence out puts from a file with ~20,000 transcripts models using the getorf with following command:- getorf -minsize 200 -reverse Y myfile.fa -find 3 getorf -minsize 200 -reverse Y myfile.fa -find 1 I get the outputs all right. But I was expecting same number of sequences in both (with identical names in the header). But looks like at 60 odd sequences(which are there in the AA output) are missing in the NT out put. Can anyone explain this discrepancy? I tried putting the minsize option as "201" for both but the problem persists. Regards, Narayana From Narayana.Upadhyaya at csiro.au Wed Jul 27 09:30:09 2011 From: Narayana.Upadhyaya at csiro.au (Narayana.Upadhyaya at csiro.au) Date: Wed, 27 Jul 2011 19:30:09 +1000 Subject: [EMBOSS] getorf output discrepancy Message-ID: Hi I figured out the problem. Missing ORFs in NT output are the ones which are just 198 NT length. When I put minsize 198 for NT output I don't miss anything. Sorry for bothering. Narayana Hi, I am trying to get both NT and AA sequence out puts from a file with ~20,000 transcripts models using the getorf with following command:- getorf -minsize 200 -reverse Y myfile.fa -find 3 getorf -minsize 200 -reverse Y myfile.fa -find 1 I get the outputs all right. But I was expecting same number of sequences in both (with identical names in the header). But looks like at 60 odd sequences(which are there in the AA output) are missing in the NT out put. Can anyone explain this discrepancy? I tried putting the minsize option as "201" for both but the problem persists. Regards, Narayana From friedman at cancercenter.columbia.edu Wed Jul 27 16:31:01 2011 From: friedman at cancercenter.columbia.edu (Richard Friedman) Date: Wed, 27 Jul 2011 12:31:01 -0400 Subject: [EMBOSS] dotplots taking similarity into account Message-ID: Dear Emboss list, Is there a way to get dotplots that take similarity according to a similarity matrix, rather than strict identity into account? As far as I can see, dottup is based on identity. Is there a way that we can dotplots based on a similarity matrix similar to dotplot in GCG? I know that it may be tiresome that I use GCG as a standard, but it is what I know and it is serving as a point of departure while I am learning Emboss and redoing the GCG portion of my course in Emboss. I am enjoying learning about the ways in which Emboss offers improved functionality in the process as well. Thanks and best wishes, Rich ------------------------------------------------------------ Richard A. Friedman, PhD Associate Research Scientist, Biomedical Informatics Shared Resource Herbert Irving Comprehensive Cancer Center (HICCC) Lecturer, Department of Biomedical Informatics (DBMI) Educational Coordinator, Center for Computational Biology and Bioinformatics (C2B2)/ National Center for Multiscale Analysis of Genomic Networks (MAGNet) Room 824 Irving Cancer Research Center Columbia University 1130 St. Nicholas Ave New York, NY 10032 (212)851-4765 (voice) friedman at cancercenter.columbia.edu http://cancercenter.columbia.edu/~friedman/ I am a Bayesian. When I see a multiple-choice question on a test and I don't know the answer I say "eeney-meaney-miney-moe". Rose Friedman, Age 14 From s.newslists at gmail.com Wed Jul 27 18:14:03 2011 From: s.newslists at gmail.com (Stefan) Date: Wed, 27 Jul 2011 20:14:03 +0200 Subject: [EMBOSS] dotplots taking similarity into account In-Reply-To: References: Message-ID: Dear Richard, Dotmatcher uses a specified substitution matrix: http://emboss.open-bio.org/wiki/Appdoc:Dotmatcher Best regards, Stefan 2011/7/27 Richard Friedman : > Dear Emboss list, > > ? ? ? ?Is there a way to get dotplots that take similarity according to a > similarity matrix, > rather than strict ?identity into account? As far as I can see, dottup is > based on identity. > Is there a way that we can dotplots based on a similarity matrix similar to > dotplot in GCG? > I know that it may be tiresome that I use GCG as a standard, but it is what > I know and > it is serving as a point of departure while I am learning Emboss and redoing > the GCG > portion of my course in Emboss. I am enjoying learning about the ways in > which Emboss > offers improved functionality in the process as well. > > Thanks and best wishes, > Rich > ------------------------------------------------------------ > Richard A. Friedman, PhD > Associate Research Scientist, > Biomedical Informatics Shared Resource > Herbert Irving Comprehensive Cancer Center (HICCC) > Lecturer, > Department of Biomedical Informatics (DBMI) > Educational Coordinator, > Center for Computational Biology and Bioinformatics (C2B2)/ > National Center for Multiscale Analysis of Genomic Networks (MAGNet) > Room 824 > Irving Cancer Research Center > Columbia University > 1130 St. Nicholas Ave > New York, NY 10032 > (212)851-4765 (voice) > friedman at cancercenter.columbia.edu > http://cancercenter.columbia.edu/~friedman/ > > I am a Bayesian. When I see a multiple-choice question on a test and I don't > know the answer I say "eeney-meaney-miney-moe". > > Rose Friedman, Age 14 > > > > > > > > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From charles-listes-emboss at plessy.org Thu Jul 28 14:38:37 2011 From: charles-listes-emboss at plessy.org (Charles Plessy) Date: Thu, 28 Jul 2011 23:38:37 +0900 Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative Commons Attribution-NoDerivs Message-ID: <20110728143837.GC30927@merveille.plessy.net> Dear EMBOSS developers, (CC Debian Med mailing list) while working on upgrading Debian's emboss package to version 6.4.0 (congratulations, by the way), I found some files in EMBOSS that are not considered ?Free software? by Debian. They were actually present in past releases as well. Here is their list: test/data/amir.swiss test/data/uniprotft.sw test/swiss/seq.dat test/swnew/trembl.dat and emboss/data/dbxref.txt Their license is Creative Commons Attribution-NoDerivs 3.0 Unported (CC BY-ND 3.0), and it disallows modification of the files. The presence of these files in EMBOSS makes it impossible for Debian to redistribute it in our operating system. I have confirmed with the UniProt consortium's helpdesk that, even in isolation, these files are covered by the CC BY-ND license. I see three possible solutions. a) Remove the files in Debian's EMBOSS package. b) Distribute EMBOSS with the files, but in the non-free section of the Debian archive. c) Replace the files by Free equivalents, for instance by re-creating records from scratch. I am not very comfortable with any of the solutions, and was wondering if you would have suggestions ? Have a nice day, -- Charles Plessy Debian Med packaging team, http://www.debian.org/devel/debian-med Tsurumi, Kanagawa, Japan From mathog at caltech.edu Thu Jul 28 15:06:50 2011 From: mathog at caltech.edu (David Mathog) Date: Thu, 28 Jul 2011 08:06:50 -0700 Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative Commons Attribution-NoDerivs Message-ID: Charles Plessy wrote: >a) Remove the files in Debian's EMBOSS package. >b) Distribute EMBOSS with the files, but in the non-free section of the >Debian archive. >c) Replace the files by Free equivalents, for instance by re-creating >records from scratch. d) Add a small script that wget's each file from its original distribution site and installs it in the right place. Have the package install script either ask if it should run this script, or have it issue a message which describes the issue and leaves it up to the user to run the script. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From wolfgang.rumpf at gmail.com Thu Jul 28 17:14:23 2011 From: wolfgang.rumpf at gmail.com (Wolfgang Rumpf) Date: Thu, 28 Jul 2011 13:14:23 -0400 Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative Commons Attribution-NoDerivs In-Reply-To: References: Message-ID: <1DAFB6D6-77CF-424C-A2BC-CD1B57FE6671@gmail.com> I would prefer (c) or the newly-added (d) myself.... Cheers, Wolfgang -------------------------------------------------------------------------------------------------------------- Dr. Wolfgang Rumpf Senior Product Specialist & Director of Support, Rescentris Inc. Adjunct Faculty, Dept. of Biotechnology, UMUC -------------------------------------------------------------------------------------------------------------- wolfgang.rumpf at rescentris.com wolfgang.rumpf at gmail.com Mobile - (614) 638-6797 Skype - wolfgang.rumpf -------------------------------------------------------------------------------------------------------------- Read my Blog - "QuantumThoughts" - at http://culture.no-ip.org/quantumthoughts -------------------------------------------------------------------------------------------------------------- On Jul 28, 2011, at 11:06 AM, David Mathog wrote: > Charles Plessy wrote: > >> a) Remove the files in Debian's EMBOSS package. >> b) Distribute EMBOSS with the files, but in the non-free section of the >> Debian archive. >> c) Replace the files by Free equivalents, for instance by re-creating >> records from scratch. > > d) Add a small script that wget's each file from its original > distribution site and installs it in the right place. Have the package > install script either ask if it should run this script, or have it issue > a message which describes the issue and leaves it up to the user to run > the script. > > Regards, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From s.newslists at gmail.com Thu Jul 28 17:24:53 2011 From: s.newslists at gmail.com (Stefan) Date: Thu, 28 Jul 2011 19:24:53 +0200 Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative Commons Attribution-NoDerivs In-Reply-To: <1DAFB6D6-77CF-424C-A2BC-CD1B57FE6671@gmail.com> References: <1DAFB6D6-77CF-424C-A2BC-CD1B57FE6671@gmail.com> Message-ID: I would prefer (d) and I know packages where this is realized like this. For example in SuSE the msttf fonts. Regards, Stefan 2011/7/28 Wolfgang Rumpf : > I would prefer (c) or the newly-added (d) myself.... > > > Cheers, > > > Wolfgang > > -------------------------------------------------------------------------------------------------------------- > Dr. Wolfgang Rumpf > Senior Product Specialist & Director of Support, Rescentris Inc. > Adjunct Faculty, Dept. of Biotechnology, UMUC > -------------------------------------------------------------------------------------------------------------- > wolfgang.rumpf at rescentris.com ? ? ? ? ? wolfgang.rumpf at gmail.com > Mobile - (614) 638-6797 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Skype - wolfgang.rumpf > -------------------------------------------------------------------------------------------------------------- > Read my Blog - "QuantumThoughts" - at http://culture.no-ip.org/quantumthoughts > -------------------------------------------------------------------------------------------------------------- > > On Jul 28, 2011, at 11:06 AM, David Mathog wrote: > >> Charles Plessy wrote: >> >>> a) Remove the files in Debian's EMBOSS package. >>> b) Distribute EMBOSS with the files, but in the non-free section of the >>> Debian archive. >>> c) Replace the files by Free equivalents, for instance by re-creating >>> records from scratch. >> >> d) ?Add a small script that wget's each file from its original >> distribution site and installs it in the right place. ?Have the package >> install script either ask if it should run this script, or have it issue >> a message which describes the issue and leaves it up to the user to run >> the script. >> >> Regards, >> >> David Mathog >> mathog at caltech.edu >> Manager, Sequence Analysis Facility, Biology Division, Caltech >> _______________________________________________ >> EMBOSS mailing list >> EMBOSS at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/emboss > > > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss > From rothenbuhler at xoma.com Thu Jul 28 22:44:47 2011 From: rothenbuhler at xoma.com (Jake Rothenbuhler) Date: Thu, 28 Jul 2011 15:44:47 -0700 Subject: [EMBOSS] Algorithms for pI and molecular weight in pepstats In-Reply-To: <4E2E6C8E.5040502@ebi.ac.uk> References: <3110E5050A5DE54F8715EB2AC3D2057725FCCC@cypress6.xoma.com> <4E2E6C8E.5040502@ebi.ac.uk> Message-ID: <3110E5050A5DE54F8715EB2AC3D2057725FCED@cypress6.xoma.com> Thanks to Ingo and Peter for the quick and helpful replies. I've read through the discussion you had a year ago on this topic and it seems like it is still unresolved. > The isoelectric point can be calculated for various conditions. When I > checked last, ExPASy's protparam was set up the isoelectric focus phase > of 2D gels under high urea conditions. It was unclear at the time where > to find all the values needed to reproduce their calculation. I have been reading through the literature referenced by ExPASy's documentation. The article does not give pK values for all N-terminal residues. I've asked ExPASy support about the pK values used for residues not listed in the paper. If you're interested, I can keep you updated regarding their response. > We would like to update EMBOSS's protein property calculations, possibly > with additional options or alternative parameter sets. If it's something you'd like to include in EMBOSS, I'd be willing to contribute to an additional option for pI calculation that uses ExPASy's pK values. Jake Rothenbuhler Bioinformatics Programmer/Analyst XOMA (US) LLC (510) 204-7452 -- The information contained in this email message may contain confidential or legally privileged information and is intended solely for the use of the named recipient(s). No confidentiality or privilege is waived or lost by any transmission error. If the reader of this message is not the intended recipient, please immediately delete the e-mail and all copies of it from your system, destroy any hard copies of it and notify the sender either by telephone or return e-mail. Any direct or indirect use, disclosure, distribution, printing, or copying of any part of this message is prohibited. Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorized to state them to be the views of XOMA. From pmr at ebi.ac.uk Fri Jul 29 07:28:48 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 29 Jul 2011 08:28:48 +0100 Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative Commons Attribution-NoDerivs In-Reply-To: <20110728143837.GC30927@merveille.plessy.net> References: <20110728143837.GC30927@merveille.plessy.net> Message-ID: <4E326130.7030507@ebi.ac.uk> On 28/07/2011 15:38, Charles Plessy wrote: > Dear EMBOSS developers, > (CC Debian Med mailing list) > > while working on upgrading Debian's emboss package to version 6.4.0 > (congratulations, by the way), I found some files in EMBOSS that are > not considered ?Free software? by Debian. They were actually present > in past releases as well. Here is their list: > > test/data/amir.swiss > test/data/uniprotft.sw > test/swiss/seq.dat > test/swnew/trembl.dat Huh? Example entries from UniProt? We can of course remove them from the distribution but then the QA tests will not work if anyone tries them. I suspect amir.swiss predates this UniProt licensing, but the others are more recently updated. Anyway, EMBOSS will work perfectly well without them. You can just delete them. > and emboss/data/dbxref.txt That one can go. It was a source for the DRCAT.dat data resource catalogue and yes we do have permission from UniProt to use it. > Their license is Creative Commons Attribution-NoDerivs 3.0 Unported (CC BY-ND > 3.0), and it disallows modification of the files. The presence of these files > in EMBOSS makes it impossible for Debian to redistribute it in our operating > system. I have confirmed with the UniProt consortium's helpdesk that, even in > isolation, these files are covered by the CC BY-ND license. I see three > possible solutions. > > a) Remove the files in Debian's EMBOSS package. > b) Distribute EMBOSS with the files, but in the non-free section of the Debian archive. > c) Replace the files by Free equivalents, for instance by re-creating records from scratch. > > I am not very comfortable with any of the solutions, and was wondering if you > would have suggestions ? I will also have words with the UniProt folk at EBI and if it really is not possible to include a few example entries with EMBOSS then I'll check with the other Open Bio projects. This is really silly. regards, Peter Rice EMBOSS Team From pmr at ebi.ac.uk Fri Jul 29 07:46:42 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 29 Jul 2011 08:46:42 +0100 Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative Commons Attribution-NoDerivs In-Reply-To: <20110728143837.GC30927@merveille.plessy.net> References: <20110728143837.GC30927@merveille.plessy.net> Message-ID: <4E326562.1020001@ebi.ac.uk> On 28/07/2011 15:38, Charles Plessy wrote: > Dear EMBOSS developers, > (CC Debian Med mailing list) > > while working on upgrading Debian's emboss package to version 6.4.0 > (congratulations, by the way), I found some files in EMBOSS that are > not considered ?Free software? by Debian. They were actually present > in past releases as well. > > Their license is Creative Commons Attribution-NoDerivs 3.0 Unported (CC BY-ND > 3.0), and it disallows modification of the files. The presence of these files > in EMBOSS makes it impossible for Debian to redistribute it in our operating > system. I have confirmed with the UniProt consortium's helpdesk that, even in > isolation, these files are covered by the CC BY-ND license. I see three > possible solutions. Ummm .... in what sense would *you* be modifying the files? UniProt's license http://www.uniprot.org/help/license says > License & disclaimer > > License > > We have chosen to apply the Creative Commons Attribution-NoDerivs License to all copyrightable parts of our databases. This means that you are free to copy, distribute, display and make commercial use of these databases in all legislations, provided you give us credit. However, if you intend to distribute a modified version of one of our databases, you must ask us for permission first. So I see no problem for EMBOSS in including the files. The only problem is for someone "modifying the files and redistributing them" without permission ... but strictly that would not apply to most uses of a UniProt entry (otherwise you could not use one entry as input and distribute the results). The licensing is there to prevent redistribution of UniProt without permission. Anyway, you can just delete them from the Debian duistribution of EMBOSS - and find your own way to run the QA tests. I don't think we have a problem. regards, Peter Rice EMBOSS Team regards, Peter Rice From pmr at ebi.ac.uk Fri Jul 29 08:39:46 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 29 Jul 2011 09:39:46 +0100 Subject: [EMBOSS] Files included in EMBOSS but licensed ... In-Reply-To: <4E326562.1020001@ebi.ac.uk> References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> Message-ID: <4E3271D2.2070906@ebi.ac.uk> On 07/29/2011 08:46 AM, Peter Rice wrote: > On 28/07/2011 15:38, Charles Plessy wrote: >> Dear EMBOSS developers, >> (CC Debian Med mailing list) >> >> while working on upgrading Debian's emboss package to version 6.4.0 >> (congratulations, by the way), I found some files in EMBOSS that are >> not considered ?Free software? by Debian. While we're on the topic of licensing, some other data files in EMBOSS 6.4.0 have licences. emboss/data/OBO contains copies of several Open Bio-Ontologies for which EMBOSS includes index files - so you need the data file version that matches the index files. For example, the Gene Ontology terms http://www.geneontology.org/GO.cite.shtml are: GO Usage Policy The GO Consortium gives permission for any of its products to be used without license for any purpose under three conditions: That the Gene Ontology Consortium is clearly acknowledged as the source of the product; That any GO Consortium file(s) displayed publicly include the date(s) and/or version number(s) of the relevant GO file(s) (the GO is evolving and changes will occur with time); That neither the content of the GO file(s) nor the logical relationships embedded within the GO file(s) be altered in any way. which looks rather like the problem you had with Creative Commons. Licenses that protect the official database release from derives versions are entirely reasonable and standard in bioinformatics. Basically, making sure that when you refer to a UniProt entry, or a, OBO ontology term, everyone agrees you are referring to one agreed entry or term. EMBOSS does depend on these files. The database names are hard-coded into some of the new (and more to come) applications. You could download the databases and indexes from our rsync copies we use to keep developers in sync. These are at rsync://emboss.open-bio.org/EMBOSS/ It might make things clearer if someone from Debian could explain: (a) why a Creative Commons licence is an issue for you (b) why you appear to consider a copy of a whole or part of a public biological database as part of an "operating system" regards, Peter Rice EMBOSS Team From cjfields at illinois.edu Fri Jul 29 13:51:53 2011 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 29 Jul 2011 08:51:53 -0500 Subject: [EMBOSS] Files included in EMBOSS but licensed ... In-Reply-To: <4E3271D2.2070906@ebi.ac.uk> References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk> Message-ID: On Jul 29, 2011, at 3:39 AM, Peter Rice wrote: > On 07/29/2011 08:46 AM, Peter Rice wrote: >> On 28/07/2011 15:38, Charles Plessy wrote: >>> Dear EMBOSS developers, >>> (CC Debian Med mailing list) >>> >>> while working on upgrading Debian's emboss package to version 6.4.0 >>> (congratulations, by the way), I found some files in EMBOSS that are >>> not considered ?Free software? by Debian. > > While we're on the topic of licensing, some other data files in EMBOSS > 6.4.0 have licences. > > emboss/data/OBO contains copies of several Open Bio-Ontologies for which > EMBOSS includes index files - so you need the data file version that > matches the index files. > > For example, the Gene Ontology terms > http://www.geneontology.org/GO.cite.shtml are: > > GO Usage Policy > > The GO Consortium gives permission for any of its products to be used > without license for any purpose under three conditions: > > That the Gene Ontology Consortium is clearly acknowledged as the > source of the product; > That any GO Consortium file(s) displayed publicly include the > date(s) and/or version number(s) of the relevant GO file(s) (the GO is > evolving and changes will occur with time); > That neither the content of the GO file(s) nor the logical > relationships embedded within the GO file(s) be altered in any way. > > which looks rather like the problem you had with Creative Commons. > > Licenses that protect the official database release from derives > versions are entirely reasonable and standard in bioinformatics. > Basically, making sure that when you refer to a UniProt entry, or a, OBO > ontology term, everyone agrees you are referring to one agreed entry or > term. > > EMBOSS does depend on these files. The database names are hard-coded > into some of the new (and more to come) applications. > > You could download the databases and indexes from our rsync copies we > use to keep developers in sync. These are at > rsync://emboss.open-bio.org/EMBOSS/ > > It might make things clearer if someone from Debian could explain: > > (a) why a Creative Commons licence is an issue for you > > (b) why you appear to consider a copy of a whole or part of a public > biological database as part of an "operating system" > > regards, > > Peter Rice > EMBOSS Team Charles, >From the BioPerl perspective, this will very likely be a problem for us as well as all other Bio* language (Biopython, BioJava, BioRuby); we typically include data derived from these sources. We may have a bit more flexibility in that the vast majority are mainly only for tests, but I believe some data is hard-coded in. Fallback data like REBase for restriction analysis and GO (as Peter mentioned above) come to mind. chris Christopher Fields Senior Research Scientist National Center for Supercomputing Applications Institute for Genomic Biology University of Illinois Urbana-Champaign 1206 W. Gregory Dr. , MC-195 Urbana, IL 61801 From asjo at koldfront.dk Fri Jul 29 20:35:13 2011 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Fri, 29 Jul 2011 22:35:13 +0200 Subject: [EMBOSS] Files included in EMBOSS but licensed ... References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk> Message-ID: <87sjpoq0zi.fsf@topper.koldfront.dk> On Fri, 29 Jul 2011 09:39:46 +0100, Peter wrote: > It might make things clearer if someone from Debian could explain: (I am not from Debian, but here is my take on it anyway:) > (a) why a Creative Commons licence is an issue for you One of the fundamental software freedoms is the freedom to change the software?. The Debian Free Software Guidelines' definition of free software includes this freedom?. So the "No Derivatives" variants of the Creative Commons licenses aren't free by the DFSG definition. (The GNU Free Documentation License on documents with invariant sections is considered non-free by DFSG-standards as well, even if the invariant sections are things that nobody would want to change.) When a project of volunteers packages 29000+ thousand packages, I think making a judgement call on whether it is okay that the license of a couple of files does not live up to the guidelines is neigh impossible. The answer to "Why would you want to?" is, because you might need to. It is more obvious with programs and code than it is with database entries, granted - but I guess the equivalent problem would be that the licensor didn't want to fix a problem in such a database, and that problem made the programs using it malfunction. It would be a pain if you weren't allowed to fix the problem and distribute the fixed data yourself, say, if "upstream" didn't want to include the fix for some reason or another; maybe they happened to turn sour on the world/you - stranger things have happened. I don't think that will happen in this specific case, but making judgement calls on what organisations/people will do in the future isn't quite firm ground. So, nobody is probably ever going to exercise that freedom in this specific case, I think, but ignoring some of the freedoms in special cases is infeasible for a project such as Debian. This is just me trying to explain how I understand it, so take it with a grain of salt, and swing by debian-legal? for the experts. > (b) why you appear to consider a copy of a whole or part of a public > biological database as part of an "operating system" They are part of a package which is included in the Debian GNU/Linux free operating system. (I personally think it would make sense to change to a Creative Commons license that allows derivative works - Uniprot and others are going to be the canonical source for the data anyway, so nothing will be lost by them by doing that, as far as I can see.) Best regards, Adam ? http://en.wikipedia.org/wiki/Free_software#Definition ? http://en.wikipedia.org/wiki/Debian_Free_Software_Guidelines ? http://lists.debian.org/debian-legal/ -- "Good car to drive after a war" Adam Sj?gren asjo at koldfront.dk From pmr at ebi.ac.uk Sat Jul 30 08:58:07 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Sat, 30 Jul 2011 09:58:07 +0100 Subject: [EMBOSS] Files included in EMBOSS but licensed ... In-Reply-To: <87sjpoq0zi.fsf@topper.koldfront.dk> References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk> <87sjpoq0zi.fsf@topper.koldfront.dk> Message-ID: <4E33C79F.8080402@ebi.ac.uk> Quoted in full for the benefit of the debian-med list who missed the original posting On 29/07/2011 21:35, Adam Sj?gren wrote: > On Fri, 29 Jul 2011 09:39:46 +0100, Peter wrote: > >> It might make things clearer if someone from Debian could explain: > > (I am not from Debian, but here is my take on it anyway:) > >> (a) why a Creative Commons licence is an issue for you > > One of the fundamental software freedoms is the freedom to change the > software?. > > The Debian Free Software Guidelines' definition of free software > includes this freedom?. > > So the "No Derivatives" variants of the Creative Commons licenses aren't > free by the DFSG definition. > > (The GNU Free Documentation License on documents with invariant sections > is considered non-free by DFSG-standards as well, even if the invariant > sections are things that nobody would want to change.) > > When a project of volunteers packages 29000+ thousand packages, I think > making a judgement call on whether it is okay that the license of a > couple of files does not live up to the guidelines is neigh impossible. > The answer to "Why would you want to?" is, because you might need to. > > It is more obvious with programs and code than it is with database > entries, granted - but I guess the equivalent problem would be that the > licensor didn't want to fix a problem in such a database, and that > problem made the programs using it malfunction. It would be a pain if > you weren't allowed to fix the problem and distribute the fixed data > yourself, say, if "upstream" didn't want to include the fix for some > reason or another; maybe they happened to turn sour on the world/you - > stranger things have happened. > > So, nobody is probably ever going to exercise that freedom in this > specific case, I think, but ignoring some of the freedoms in special > cases is infeasible for a project such as Debian. > > This is just me trying to explain how I understand it, so take it with a > grain of salt, and swing by debian-legal? for the experts. A specific example might help. About 5 years ago a release of the UniProt database (as plain text files) broke the Wisconsin (GCG) sequence analysis package. They introduced extremely long lines in a data file that everyone assumed was only maximum 80 characters. As GCG was closed source, the fix required a change to the UniProt files to either wrap or truncate the 'offending' records. The fix was not to distribute a change to the data of course, but to write and distribute a simple perl script that wrapped the long records. That was not a licensing issue - the content stays the same, the format is changed, no changed data is distributed. But it does illustrate that the database licensing does not prevent 'fixing' a database. >> (b) why you appear to consider a copy of a whole or part of a public >> biological database as part of an "operating system" > > They are part of a package which is included in the Debian GNU/Linux > free operating system. I expect there are many problems that arise if data ... and documentation ... are considered to be software. For EMBOSS we didn't officially specify a license for the documentation but other packages probably do. It still worries me that some of our documentation files officially include GPL licensed (EMBOSS) source code but I did not like any of the alternative documentation licenses. > (I personally think it would make sense to change to a Creative Commons > license that allows derivative works - Uniprot and others are going to > be the canonical source for the data anyway, so nothing will be lost by > them by doing that, as far as I can see.) Unlikely. The no-derivatives version is specifically there to prevent derivatives - for example Debian distributing a modified UniProt without permission. The ontologies are similar, but do allow for the use case of importing terms from one ontology into another if the ontology name is changed (and preferably if cross-references to the original are provided). Again, the need is to protect the integrity of the original ontology content so references to a GO term or a UniProt entry are clearly defined. This is essential for many of the public bioinformatics databases. Data and software are not the same in this context. I am curious whether documentation licensing raises any issues. Just my 2c worth Peter Rice EMBOSS Team From asjo at koldfront.dk Sat Jul 30 11:36:54 2011 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Sat, 30 Jul 2011 13:36:54 +0200 Subject: [EMBOSS] Files included in EMBOSS but licensed ... References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk> <87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk> Message-ID: <87ipqkgfu1.fsf@topper.koldfront.dk> On Sat, 30 Jul 2011 09:58:07 +0100, Peter wrote: > A specific example might help. About 5 years ago a release of the > UniProt database (as plain text files) broke the Wisconsin (GCG) > sequence analysis package. [...] This is the opposite problem of what I tried to sketch. Your example has closed source software that can't be fixed, leading to either preprocessing or changing the database rather than fixing the real problem. If the software had been free, you could just have fixed the software. Switch around "software" and "database", and you have the example I was trying to paint. > I expect there are many problems that arise if data ... and > documentation ... are considered to be software. Sure. The whole GFDL debate took quite a while, I think. But that doesn't change that one of the solutions outlined by Charles Plessy is necessary for Debian to distribute EMBOSS (and any other piece of free/redistributable software). >> (I personally think it would make sense to change to a Creative Commons >> license that allows derivative works - Uniprot and others are going to >> be the canonical source for the data anyway, so nothing will be lost by >> them by doing that, as far as I can see.) > Unlikely. The no-derivatives version is specifically there to prevent > derivatives - for example Debian distributing a modified UniProt > without permission. What I was trying to say is that I don't think that that clause gives any value to the owners of Uniprot and other databases. Why would Uniprot want to prevent derivative works? They'll always be the canonical source for the correct information. You are free to distribute a modified version of the man-page for ls(1) - but if you introduce errors in it or make it worse, nobody will choose your derived version. > The ontologies are similar, but do allow for the use case of importing > terms from one ontology into another if the ontology name is changed > (and preferably if cross-references to the original are provided). > Again, the need is to protect the integrity of the original ontology > content so references to a GO term or a UniProt entry are clearly > defined. I think the problem that is being protected against is non-existing. People don't want to break stuff that works, they want to be able to fix stuff that doesn't. > This is essential for many of the public bioinformatics databases. Why? Only a hypothetical derivative would be changed, not the original. If someome distributed a derivative that was broken, I think people would quickly abandon it. Again, just my point of view - not representing or speaking for anyone :-) Best regards, Adam -- "Good car to drive after a war" Adam Sj?gren asjo at koldfront.dk From cjfields at illinois.edu Sat Jul 30 19:01:58 2011 From: cjfields at illinois.edu (Chris Fields) Date: Sat, 30 Jul 2011 14:01:58 -0500 Subject: [EMBOSS] Files included in EMBOSS but licensed ... In-Reply-To: <4E33C79F.8080402@ebi.ac.uk> References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk> <87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk> Message-ID: <5EF06959-FAA0-4DBB-96EA-013CFFDE2960@illinois.edu> On Jul 30, 2011, at 3:58 AM, Peter Rice wrote: > Quoted in full for the benefit of the debian-med list who missed the original posting > > On 29/07/2011 21:35, Adam Sj?gren wrote: >> On Fri, 29 Jul 2011 09:39:46 +0100, Peter wrote: >> >>> It might make things clearer if someone from Debian could explain: >> >> (I am not from Debian, but here is my take on it anyway:) >> >>> (a) why a Creative Commons licence is an issue for you >> >> One of the fundamental software freedoms is the freedom to change the >> software?. >> >> The Debian Free Software Guidelines' definition of free software >> includes this freedom?. >> >> So the "No Derivatives" variants of the Creative Commons licenses aren't >> free by the DFSG definition. >> >> (The GNU Free Documentation License on documents with invariant sections >> is considered non-free by DFSG-standards as well, even if the invariant >> sections are things that nobody would want to change.) >> >> When a project of volunteers packages 29000+ thousand packages, I think >> making a judgement call on whether it is okay that the license of a >> couple of files does not live up to the guidelines is neigh impossible. > >> The answer to "Why would you want to?" is, because you might need to. >> >> It is more obvious with programs and code than it is with database >> entries, granted - but I guess the equivalent problem would be that the >> licensor didn't want to fix a problem in such a database, and that >> problem made the programs using it malfunction. It would be a pain if >> you weren't allowed to fix the problem and distribute the fixed data >> yourself, say, if "upstream" didn't want to include the fix for some >> reason or another; maybe they happened to turn sour on the world/you - >> stranger things have happened. >> >> So, nobody is probably ever going to exercise that freedom in this >> specific case, I think, but ignoring some of the freedoms in special >> cases is infeasible for a project such as Debian. >> >> This is just me trying to explain how I understand it, so take it with a >> grain of salt, and swing by debian-legal? for the experts. > > A specific example might help. About 5 years ago a release of the UniProt database (as plain text files) broke the Wisconsin (GCG) sequence analysis package. They introduced extremely long lines in a data file that everyone assumed was only maximum 80 characters. > > As GCG was closed source, the fix required a change to the UniProt files to either wrap or truncate the 'offending' records. > > The fix was not to distribute a change to the data of course, but to write and distribute a simple perl script that wrapped the long records. > > That was not a licensing issue - the content stays the same, the format is changed, no changed data is distributed. But it does illustrate that the database licensing does not prevent 'fixing' a database. > >>> (b) why you appear to consider a copy of a whole or part of a public >>> biological database as part of an "operating system" >> >> They are part of a package which is included in the Debian GNU/Linux >> free operating system. > > I expect there are many problems that arise if data ... and documentation ... are considered to be software. For EMBOSS we didn't officially specify a license for the documentation but other packages probably do. It still worries me that some of our documentation files officially include GPL licensed (EMBOSS) source code but I did not like any of the alternative documentation licenses. I don't understand the logic behind why data would be considered software, unless one is using a very fuzzy definition of 'software'. Is this strictly a packaging issue, e.g. any data packaged with source makes it 'software'? Or just the fact that such data is licensed? Would a package of just data/docs (no code) be allowed? >> (I personally think it would make sense to change to a Creative Commons >> license that allows derivative works - Uniprot and others are going to >> be the canonical source for the data anyway, so nothing will be lost by >> them by doing that, as far as I can see.) > > Unlikely. The no-derivatives version is specifically there to prevent derivatives - for example Debian distributing a modified UniProt without permission. > > The ontologies are similar, but do allow for the use case of importing terms from one ontology into another if the ontology name is changed (and preferably if cross-references to the original are provided). Again, the need is to protect the integrity of the original ontology content so references to a GO term or a UniProt entry are clearly defined. > > This is essential for many of the public bioinformatics databases. Data and software are not the same in this context. I am curious whether documentation licensing raises any issues. > > Just my 2c worth > > Peter Rice > EMBOSS Team Maybe the best solution is to just package any data separately? We have talked about setting up a 'biodata' repository for common datasets from all the Bio* projects. Feel free to skip the rest of this, but: I agree with Peter's point, Uniprot and other databases license data this way for very good (and well-intentioned) reasons. For the Bio* languages there are instances where we use such data as a fallback in case a newer version isn't immediately available (REBase and SO come to mind, and I think we have others), so we are likely in the same boat as EMBOSS. I had a long screed here, but I found some original sources for the discussion re: Uniprot and use of Creative Commons licensing that states the reasoning for why this is in place: http://wiki.creativecommons.org/Case_Studies/Uniprot http://eric.jain.name/2006/02/07/uniprot-creative-commons/ http://sciencecommons.org/resources/faq/databases/ http://sciencecommons.org/resources/faq/database-protocol/ Note there is now a 'Database Protocol' (last link) that recommends a different license; that page nicely summarizes the history the whole Creative Commons licensing affair and the issues of using a Creative Commons license re: databases, mainly due to the issue Peter mentioned above, that databases != software. Uniprot doesn't use this as of yet (so it doesn't solve the problem at hand), but it's possible this may change. chris Christopher Fields Senior Research Scientist National Center for Supercomputing Applications Institute for Genomic Biology University of Illinois Urbana-Champaign 1206 W. Gregory Dr. , MC-195 Urbana, IL 61801 From asjo at koldfront.dk Sat Jul 30 19:34:30 2011 From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=) Date: Sat, 30 Jul 2011 21:34:30 +0200 Subject: [EMBOSS] Files included in EMBOSS but licensed ... References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk> <87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk> <5EF06959-FAA0-4DBB-96EA-013CFFDE2960@illinois.edu> Message-ID: <87d3grwojd.fsf@topper.koldfront.dk> On Sat, 30 Jul 2011 14:01:58 -0500, Chris wrote: > I don't understand the logic behind why data would be considered > software, unless one is using a very fuzzy definition of 'software'. > Is this strictly a packaging issue, e.g. any data packaged with source > makes it 'software'? Or just the fact that such data is licensed? > Would a package of just data/docs (no code) be allowed? "The DFSG is focused on software, but the word itself is unclear - some apply it to everything that can be expressed as a stream of bits, while a minority considers it to refer to just computer programs. Also, the existence of PostScript, executable scripts, sourced documents, etc, greatly muddies the second definition. Thus, to break the confusion, in June 2004 the Debian project decided to explicitly apply the same principles to software documentation, multimedia data and other content. The non-program content of Debian began to comply with the DFSG more strictly in Debian 4.0 (released in April 2007) and subsequent releases." - http://en.wikipedia.org/wiki/DFSG#Non-.22software.22_content So no. > I agree with Peter's point, Uniprot and other databases license data > this way for very good (and well-intentioned) reasons. Several people have mentioned the existence of these good reasons for not allowing derived works when it comes to science/databases/biology; I wonder what those reasons are? Just curious. [...] > http://sciencecommons.org/resources/faq/database-protocol/ > Note there is now a 'Database Protocol' (last link) that recommends a > different license; that page nicely summarizes the history the whole > Creative Commons licensing affair and the issues of using a Creative > Commons license re: databases, mainly due to the issue Peter mentioned > above, that databases != software. Uniprot doesn't use this as of yet > (so it doesn't solve the problem at hand), but it's possible this may > change. It sounds like Science Commons' Open Access Data Protocol means putting the data in the public domain, which would mean that derived works would very much be allowed? This link explains the protocol: * http://sciencecommons.org/projects/publishing/open-access-data-protocol/ Best regards, Adam -- "Good car to drive after a war" Adam Sj?gren asjo at koldfront.dk From cjfields at illinois.edu Sat Jul 30 19:42:19 2011 From: cjfields at illinois.edu (Chris Fields) Date: Sat, 30 Jul 2011 14:42:19 -0500 Subject: [EMBOSS] Files included in EMBOSS but licensed ... In-Reply-To: <87ipqkgfu1.fsf@topper.koldfront.dk> References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk> <87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk> <87ipqkgfu1.fsf@topper.koldfront.dk> Message-ID: On Jul 30, 2011, at 6:36 AM, Adam Sj?gren wrote: > On Sat, 30 Jul 2011 09:58:07 +0100, Peter wrote: > >> A specific example might help. About 5 years ago a release of the >> UniProt database (as plain text files) broke the Wisconsin (GCG) >> sequence analysis package. > > [...] > > This is the opposite problem of what I tried to sketch. > > Your example has closed source software that can't be fixed, leading to > either preprocessing or changing the database rather than fixing the > real problem. > > If the software had been free, you could just have fixed the software. > > Switch around "software" and "database", and you have the example I was > trying to paint. Yes, if the source were available fixing the parser would have been the best option. But I think you are missing the fundamental point that Peter made (that you left out): the wording of the license allowed them to reformat the file w/o changing the actual content. I'm not sure but I believe many GenPept documents are Uniprot-derived and follow the same concept. Data records and databases are not software, unless you are using some very fuzzy definition of such. >> I expect there are many problems that arise if data ... and >> documentation ... are considered to be software. > > Sure. The whole GFDL debate took quite a while, I think. > > But that doesn't change that one of the solutions outlined by Charles > Plessy is necessary for Debian to distribute EMBOSS (and any other piece > of free/redistributable software). You'll also note Charles's distaste for the options mentioned. He was also searching for alternatives. >>> (I personally think it would make sense to change to a Creative Commons >>> license that allows derivative works - Uniprot and others are going to >>> be the canonical source for the data anyway, so nothing will be lost by >>> them by doing that, as far as I can see.) > >> Unlikely. The no-derivatives version is specifically there to prevent >> derivatives - for example Debian distributing a modified UniProt >> without permission. > > What I was trying to say is that I don't think that that clause gives > any value to the owners of Uniprot and other databases. > > Why would Uniprot want to prevent derivative works? They'll always be > the canonical source for the correct information. The links provided in my other responce indicate some of the mindset behind this. I think the main point is that the work has to be attributed, and that any changes to such data need permission of Uniprot, likely so any content changes can be curated and (possibly) propogated to future releases. This also ensures that a set of files from a third-party containing the Uniprot name will not be modified (e.g. all content can be trusted as coming from Uniprot w/o modification). I have seen instances where loose data control (such as annotation from a newly sequenced genome) become balkanized to the point that no one can clearly state who is the trusted source (even when the list of sources includes large databases such as NCBI/EBI). So I understand the reasoning for the license, but I also see Science Commons is recommending something less strict. > You are free to distribute a modified version of the man-page for ls(1) > - but if you introduce errors in it or make it worse, nobody will choose > your derived version. That's a straw man argument; man page documentation for an app is not the same as a database record based on scientific data. Woud you make the same argument (allow free content modification) for a scientific publication? I would, but only for corrections or for new data that support/contradict the original data, and even then it must go through some sort of mediation (an editor for instance), not unlike what a database curator does. >> The ontologies are similar, but do allow for the use case of importing >> terms from one ontology into another if the ontology name is changed >> (and preferably if cross-references to the original are provided). > >> Again, the need is to protect the integrity of the original ontology >> content so references to a GO term or a UniProt entry are clearly >> defined. > > I think the problem that is being protected against is non-existing. > > People don't want to break stuff that works, they want to be able to fix > stuff that doesn't. Simply opening the licensing up for any content modification doesn't solve the problem in the case of scientific databases, it potentially exacerbates it. Hence the variations in the licensing in the previous links I sent. By the way, if you think the classic 'vi vs emacs' arguments can get out of control, see what happens when you have competing groups trying to make changes to a sequence record w/o curation. I do agree that it would be nice for the barrier to database modification to be lowered. Many previous attempts have been made at doing this, such as including third-party annotation, but with the major databases they all seem to fall by the wayside and they seem to fall back to simple curation. Maybe it's time to come up with a git/hg for biological data, where one could fork records and make changes for submission; at least there one could have a trusted source and easier paths to data modification. Just a thought. >> This is essential for many of the public bioinformatics databases. > > Why? Only a hypothetical derivative would be changed, not the original. > > If someome distributed a derivative that was broken, I think people > would quickly abandon it. How could one tell the difference if both versions are implied to come from Uniprot (even if one comes from a third/fourth/fifth party)? There is no guarantee beyond going back and comparing the records to the original Uniprot data. > Again, just my point of view - not representing or speaking for anyone :-) > > > Best regards, > > Adam chris Christopher Fields Senior Research Scientist National Center for Supercomputing Applications Institute for Genomic Biology University of Illinois Urbana-Champaign 1206 W. Gregory Dr. , MC-195 Urbana, IL 61801 From cjfields at illinois.edu Sat Jul 30 20:14:39 2011 From: cjfields at illinois.edu (Chris Fields) Date: Sat, 30 Jul 2011 15:14:39 -0500 Subject: [EMBOSS] Files included in EMBOSS but licensed ... In-Reply-To: <87d3grwojd.fsf@topper.koldfront.dk> References: <20110728143837.GC30927@merveille.plessy.net> <4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk> <87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk> <5EF06959-FAA0-4DBB-96EA-013CFFDE2960@illinois.edu> <87d3grwojd.fsf@topper.koldfront.dk> Message-ID: (Charles, not sure you have been following, but any idea on the next steps and whether other package like bioperl are affected?) On Jul 30, 2011, at 2:34 PM, Adam Sj?gren wrote: > On Sat, 30 Jul 2011 14:01:58 -0500, Chris wrote: > >> I don't understand the logic behind why data would be considered >> software, unless one is using a very fuzzy definition of 'software'. >> Is this strictly a packaging issue, e.g. any data packaged with source >> makes it 'software'? Or just the fact that such data is licensed? >> Would a package of just data/docs (no code) be allowed? > > "The DFSG is focused on software, but the word itself is unclear - > some apply it to everything that can be expressed as a stream of > bits, while a minority considers it to refer to just computer > programs. Also, the existence of PostScript, executable scripts, > sourced documents, etc, greatly muddies the second definition. Thus, > to break the confusion, in June 2004 the Debian project decided to > explicitly apply the same principles to software documentation, > multimedia data and other content. The non-program content of Debian > began to comply with the DFSG more strictly in Debian 4.0 (released > in April 2007) and subsequent releases." > - http://en.wikipedia.org/wiki/DFSG#Non-.22software.22_content > > So no. Oh well; we'll leave that up to debian then. I think Peter and I stated our concerns, and possible options were stated by Charles and myself, no need to protract this out. I would rather find a solution. >> I agree with Peter's point, Uniprot and other databases license data >> this way for very good (and well-intentioned) reasons. > > Several people have mentioned the existence of these good reasons for > not allowing derived works when it comes to science/databases/biology; I > wonder what those reasons are? > > Just curious. Those links I passed on mention some of the primary concerns from both the Science Commons and Uniprot side. I believe it comes down to an issue of trusting the source of the data and the level of control the database wants (the latter was implied in Eric's blog post). > [...] >> http://sciencecommons.org/resources/faq/database-protocol/ > >> Note there is now a 'Database Protocol' (last link) that recommends a >> different license; that page nicely summarizes the history the whole >> Creative Commons licensing affair and the issues of using a Creative >> Commons license re: databases, mainly due to the issue Peter mentioned >> above, that databases != software. Uniprot doesn't use this as of yet >> (so it doesn't solve the problem at hand), but it's possible this may >> change. > > It sounds like Science Commons' Open Access Data Protocol means putting > the data in the public domain, which would mean that derived works would > very much be allowed? Yes, if one adopts that protocol (Uniprot hasn't). Eric's blog post indicates the CC-nonderivative was chose for a level of control both Uniprot users and curators felt comfortable with but wasn't overly restrictive. That's also from 2006, so a lot has likely changed since then. > This link explains the protocol: > > * http://sciencecommons.org/projects/publishing/open-access-data-protocol/ > > > Best regards, > > Adam There is no mention of derived or modified works there, but the brief mention of derived works from the Database Protocol page indicates that it is possibly allowed, yes. That may be an impediment to adoption by a database depending on what level of control they would like. I'm curious to see who has adopted it. chris