From david.breimann at gmail.com  Tue Jul  5 05:33:14 2011
From: david.breimann at gmail.com (David Breimann)
Date: Tue, 5 Jul 2011 12:33:14 +0300
Subject: [EMBOSS] Updating EMBOSS
Message-ID: <CAL64h6467Cp+EzT61R6ucbwaoFtaQRAgRHQs3L-8DET1RQCarw@mail.gmail.com>

Hello,

I had an old installation of EMBOSS on my linux server (EMBOSS 6.0.1
according to embossversion).
I downloaded EMBOSS 6.3.1, unpacked and compiled (following
http://emboss.sourceforge.net/download/#Gettingstarted), hoping this will
overwrite the older version.
However, embossversion still returns 6.0.1.
What should I do?

Thanks,
Dave

From ajb at ebi.ac.uk  Tue Jul  5 06:22:53 2011
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Tue, 5 Jul 2011 11:22:53 +0100 (BST)
Subject: [EMBOSS] Updating EMBOSS
In-Reply-To: <CAL64h6467Cp+EzT61R6ucbwaoFtaQRAgRHQs3L-8DET1RQCarw@mail.gmail.com>
References: <CAL64h6467Cp+EzT61R6ucbwaoFtaQRAgRHQs3L-8DET1RQCarw@mail.gmail.com>
Message-ID: <55403.82.26.12.214.1309861373.squirrel@imap04.ebi.ac.uk>

Hello Dave,

It depends on where/how you installed the different versions.
If you had configured and installed using a prefix which specified
a directory root which was to contain only emboss:

   e.g.  ./configure --prefix=/fu/bar/emboss

then you can just delete the /fu/bar/emboss directory and reinstall.

If, however, you had installed EMBOSS using no prefix (such that it
would be installed under /usr/local) or specified any other shared
or system directory then the best means is usually to reinstall
the old version (see ftp://emboss.open-bio.org/pub/EMBOSS/old/)
on top of itself and then type:

  make uninstall

If it were me I'd then do the same with the new version and have a
nose-around to check that all traces of EMBOSS have been deleted,
then reinstall the new version.

We do recommend, when installing EMBOSS from source, to install it
into its own directory (--prefix=/usr/local/emboss  is a favourite
example in administration documentation).

HTH

Alan


> Hello,
>
> I had an old installation of EMBOSS on my linux server (EMBOSS 6.0.1
> according to embossversion).
> I downloaded EMBOSS 6.3.1, unpacked and compiled (following
> http://emboss.sourceforge.net/download/#Gettingstarted), hoping this will
> overwrite the older version.
> However, embossversion still returns 6.0.1.
> What should I do?
>
> Thanks,
> Dave
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From wo.granon at gmail.com  Thu Jul  7 06:33:49 2011
From: wo.granon at gmail.com (Wolfgang)
Date: Thu, 7 Jul 2011 12:33:49 +0200
Subject: [EMBOSS] Plasmid drawing
Message-ID: <CAOhDcVx+yZPGE4-zEHb8GeMd5eyLg1GX+V2CrHgWVjrtshSHeA@mail.gmail.com>

Hello,

are there any news to plasmid drawing (features and restriction sites) and
improvement of cirdna, according to this message from 2005?
http://www.mail-archive.com/emboss at emboss.open-bio.org/msg00040.html

In our labs this is also a big point for users not to switch completely to
emboss.

Thanks,
Wolfgang

From pmr at ebi.ac.uk  Thu Jul  7 07:33:05 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 07 Jul 2011 12:33:05 +0100
Subject: [EMBOSS] Plasmid drawing
In-Reply-To: <CAOhDcVx+yZPGE4-zEHb8GeMd5eyLg1GX+V2CrHgWVjrtshSHeA@mail.gmail.com>
References: <CAOhDcVx+yZPGE4-zEHb8GeMd5eyLg1GX+V2CrHgWVjrtshSHeA@mail.gmail.com>
Message-ID: <4E159971.9070509@ebi.ac.uk>

Dear Wolfgang,

> are there any news to plasmid drawing (features and restriction sites) and
> improvement of cirdna, according to this message from 2005?
> http://www.mail-archive.com/emboss at emboss.open-bio.org/msg00040.html
>
> In our labs this is also a big point for users not to switch completely to
> emboss.

Very close to release date next week, so hard to do anything immediately.

However, we did try adding a report format (an output choice for 
restrict and other applications) to create an input file for cirdna or 
lindna.

Results at the time were poor, but I note we have revised both cirdna 
and lindna since.

I will test whether results have improved. One possibility would be to 
re-enable this format so you can test and give us feedback on the new 
release.

regards,

Peter

From hrh at fmi.ch  Thu Jul  7 07:47:13 2011
From: hrh at fmi.ch (Hans-Rudolf Hotz)
Date: Thu, 07 Jul 2011 13:47:13 +0200
Subject: [EMBOSS] Plasmid drawing
In-Reply-To: <4E159971.9070509@ebi.ac.uk>
References: <CAOhDcVx+yZPGE4-zEHb8GeMd5eyLg1GX+V2CrHgWVjrtshSHeA@mail.gmail.com>
	<4E159971.9070509@ebi.ac.uk>
Message-ID: <4E159CC1.9@fmi.ch>

Hi Peter,


We will be happy to help you testing and give feedback, since we are in 
a very similar situation to Wolfgang.


Regards, Hans


On 07/07/2011 01:33 PM, Peter Rice wrote:
> Dear Wolfgang,
>
>> are there any news to plasmid drawing (features and restriction sites)
>> and
>> improvement of cirdna, according to this message from 2005?
>> http://www.mail-archive.com/emboss at emboss.open-bio.org/msg00040.html
>>
>> In our labs this is also a big point for users not to switch
>> completely to
>> emboss.
>
> Very close to release date next week, so hard to do anything immediately.
>
> However, we did try adding a report format (an output choice for
> restrict and other applications) to create an input file for cirdna or
> lindna.
>
> Results at the time were poor, but I note we have revised both cirdna
> and lindna since.
>
> I will test whether results have improved. One possibility would be to
> re-enable this format so you can test and give us feedback on the new
> release.
>
> regards,
>
> Peter
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss

From pmr at ebi.ac.uk  Thu Jul  7 08:07:19 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 07 Jul 2011 13:07:19 +0100
Subject: [EMBOSS] Plasmid drawing
In-Reply-To: <4E159CC1.9@fmi.ch>
References: <CAOhDcVx+yZPGE4-zEHb8GeMd5eyLg1GX+V2CrHgWVjrtshSHeA@mail.gmail.com>
	<4E159971.9070509@ebi.ac.uk> <4E159CC1.9@fmi.ch>
Message-ID: <4E15A177.6040201@ebi.ac.uk>

On 07/07/11 12:47, Hans-Rudolf Hotz wrote:
> Hi Peter,
>
>
> We will be happy to help you testing and give feedback, since we are in
> a very similar situation to Wolfgang.


I'm curious. How many sites on this list (as a rough sample) are still 
running GCG?

And how many are using some other commercial package for functions not 
in EMBOSS?

Could be a very useful guide to the new applications needed.

Peter

From s.newslists at gmail.com  Thu Jul  7 09:54:01 2011
From: s.newslists at gmail.com (Stefan)
Date: Thu, 7 Jul 2011 15:54:01 +0200
Subject: [EMBOSS]  Plasmid drawing
In-Reply-To: <CAECtV7PBqXv07m+VsN3E9yGDEzA6xoWvb9WWsw6PO99oSnROCQ@mail.gmail.com>
References: <CAOhDcVx+yZPGE4-zEHb8GeMd5eyLg1GX+V2CrHgWVjrtshSHeA@mail.gmail.com>
	<4E159971.9070509@ebi.ac.uk> <4E159CC1.9@fmi.ch>
	<4E15A177.6040201@ebi.ac.uk>
	<CAECtV7PBqXv07m+VsN3E9yGDEzA6xoWvb9WWsw6PO99oSnROCQ@mail.gmail.com>
Message-ID: <CAECtV7McouYpKX7tjnoM0+oP5r9_FxKM5s6efcJrKtdEm_qM2g@mail.gmail.com>

Hi Peter,

in our labs the people are also sad that they can not use the emboss
suite for such daily work. We use two different applications:

pDraw32 can draw plasmid cards. Very useful is the feature that it can
generate a new plasmid out of two with given restriction enzymes. This
can avoid a lot of little mistakes.

ApE "A plasmid Editor" is very useful to find features in the plasmid.
Often we get sequences where features such as the antibiotic
resistance are missing. This tool can quickly find them and make draw
a nice plasmid also with its restriction sites.

We would be happy to use for all of this work the emboss suite.

Also I would be happy to test.

Best regards,
Stefan

2011/7/7 Peter Rice <pmr at ebi.ac.uk>:
> On 07/07/11 12:47, Hans-Rudolf Hotz wrote:
>>
>> Hi Peter,
>>
>>
>> We will be happy to help you testing and give feedback, since we are in
>> a very similar situation to Wolfgang.
>
>
> I'm curious. How many sites on this list (as a rough sample) are still
> running GCG?
>
> And how many are using some other commercial package for functions not in
> EMBOSS?
>
> Could be a very useful guide to the new applications needed.
>
> Peter
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>

From david.bauer at bayer.com  Thu Jul  7 08:58:38 2011
From: david.bauer at bayer.com (david.bauer at bayer.com)
Date: Thu, 7 Jul 2011 14:58:38 +0200
Subject: [EMBOSS] Antwort: Re:  Plasmid drawing
In-Reply-To: <4E15A177.6040201@ebi.ac.uk>
Message-ID: <OFACDB8C5F.07B28638-ONC12578C6.00468B55-C12578C6.00474971@bayer.de>

We use VectorNTI for plasmid documentation and in-silico cloning.
And as far as I know another widely used software for this purpos is 
'Clone Manager' from 'Sci-Ed Software'. 

David.

emboss-bounces at lists.open-bio.org schrieb am 07/07/2011 14:07:19:

> On 07/07/11 12:47, Hans-Rudolf Hotz wrote:
> > Hi Peter,
> >
> >
> > We will be happy to help you testing and give feedback, since we are 
in
> > a very similar situation to Wolfgang.
> 
> 
> I'm curious. How many sites on this list (as a rough sample) are still 
> running GCG?
> 
> And how many are using some other commercial package for functions not 
> in EMBOSS?
> 
> Could be a very useful guide to the new applications needed.
> 
> Peter
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss

From cjfields at illinois.edu  Thu Jul  7 11:10:35 2011
From: cjfields at illinois.edu (Chris Fields)
Date: Thu, 7 Jul 2011 10:10:35 -0500
Subject: [EMBOSS] Antwort: Re:  Plasmid drawing
In-Reply-To: <OFACDB8C5F.07B28638-ONC12578C6.00468B55-C12578C6.00474971@bayer.de>
References: <OFACDB8C5F.07B28638-ONC12578C6.00468B55-C12578C6.00474971@bayer.de>
Message-ID: <B7FACE16-3F45-4658-A476-842BAD410898@illinois.edu>

I think Geneious and the CLC tools can also draw plasmid maps.  Haven't used them extensively, though.

re: GCG, Accelrys stopped GCG development in June 2008;I haven't seen anyone take up the perpetual license (which allows use of GCG, but with outdated databases, etc).  Seems like everyone is implicitly being directed to EMBOSS.

chris

On Jul 7, 2011, at 7:58 AM, david.bauer at bayer.com wrote:

> We use VectorNTI for plasmid documentation and in-silico cloning.
> And as far as I know another widely used software for this purpos is 
> 'Clone Manager' from 'Sci-Ed Software'. 
> 
> David.
> 
> emboss-bounces at lists.open-bio.org schrieb am 07/07/2011 14:07:19:
> 
>> On 07/07/11 12:47, Hans-Rudolf Hotz wrote:
>>> Hi Peter,
>>> 
>>> 
>>> We will be happy to help you testing and give feedback, since we are 
> in
>>> a very similar situation to Wolfgang.
>> 
>> 
>> I'm curious. How many sites on this list (as a rough sample) are still 
>> running GCG?
>> 
>> And how many are using some other commercial package for functions not 
>> in EMBOSS?
>> 
>> Could be a very useful guide to the new applications needed.
>> 
>> Peter
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


From kitagawam at takara-bio.co.jp  Fri Jul  8 04:05:48 2011
From: kitagawam at takara-bio.co.jp (kitagawam at takara-bio.co.jp)
Date: Fri, 8 Jul 2011 17:05:48 +0900
Subject: [EMBOSS] Antwort: Re:  Plasmid drawing
In-Reply-To: <B7FACE16-3F45-4658-A476-842BAD410898@illinois.edu>
References: <OFACDB8C5F.07B28638-ONC12578C6.00468B55-C12578C6.00474971@bayer.de>
	<B7FACE16-3F45-4658-A476-842BAD410898@illinois.edu>
Message-ID: <678B3FABACE9F64B8FAF7A1045C3D67D4D943A52EE@tkrexmb1.central.takara.co.jp>

I wish to recommend IMC.
http://www.insilicobiology.jp/en/downloads

] -----Original Message-----
] From: emboss-bounces at lists.open-bio.org
] [mailto:emboss-bounces at lists.open-bio.org] On Behalf Of Chris Fields
] Sent: Friday, July 08, 2011 12:11 AM
] To: david.bauer at bayer.com
] Cc: emboss at lists.open-bio.org; emboss-bounces at lists.open-bio.org
] Subject: Re: [EMBOSS] Antwort: Re: Plasmid drawing
] 
] I think Geneious and the CLC tools can also draw plasmid maps.  Haven't
] used them extensively, though.
] 
] re: GCG, Accelrys stopped GCG development in June 2008;I haven't seen anyone
] take up the perpetual license (which allows use of GCG, but with outdated
] databases, etc).  Seems like everyone is implicitly being directed to
] EMBOSS.
] 
] chris
] 
] On Jul 7, 2011, at 7:58 AM, david.bauer at bayer.com wrote:
] 
] > We use VectorNTI for plasmid documentation and in-silico cloning.
] > And as far as I know another widely used software for this purpos is
] > 'Clone Manager' from 'Sci-Ed Software'.
] >
] > David.
] >
] > emboss-bounces at lists.open-bio.org schrieb am 07/07/2011 14:07:19:
] >
] >> On 07/07/11 12:47, Hans-Rudolf Hotz wrote:
] >>> Hi Peter,
] >>>
] >>>
] >>> We will be happy to help you testing and give feedback, since we are
] > in
] >>> a very similar situation to Wolfgang.
] >>
] >>
] >> I'm curious. How many sites on this list (as a rough sample) are still
] >> running GCG?
] >>
] >> And how many are using some other commercial package for functions not
] >> in EMBOSS?
] >>
] >> Could be a very useful guide to the new applications needed.
] >>
] >> Peter
] >> _______________________________________________
] >> EMBOSS mailing list
] >> EMBOSS at lists.open-bio.org
] >> http://lists.open-bio.org/mailman/listinfo/emboss
] > _______________________________________________
] > EMBOSS mailing list
] > EMBOSS at lists.open-bio.org
] > http://lists.open-bio.org/mailman/listinfo/emboss
] 
] 
] _______________________________________________
] EMBOSS mailing list
] EMBOSS at lists.open-bio.org
] http://lists.open-bio.org/mailman/listinfo/emboss


From friedman at cancercenter.columbia.edu  Wed Jul 13 16:56:29 2011
From: friedman at cancercenter.columbia.edu (Richard Friedman)
Date: Wed, 13 Jul 2011 16:56:29 -0400
Subject: [EMBOSS] getting files in GCG format with annotation
Message-ID: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu>

Dear Emboss list,

	I am learning to use Emboss after being a long-time GCG user.
The fetch command in GCG returns a file with the sequence in GCG format
plus annotation.

In EMBOSS I know how to get just sequence in GCG format with seqret.
In EMBOSS I also know how to get the sequence plus annotation default  
format.
What I would like to know is how using EMBOSS to get sequence plus  
annotation in GCG format
like in GCG.

Thanks and best wishes,
Rich
------------------------------------------------------------
Richard A. Friedman, PhD
Associate Research Scientist,
Biomedical Informatics Shared Resource
Herbert Irving Comprehensive Cancer Center (HICCC)
Lecturer,
Department of Biomedical Informatics (DBMI)
Educational Coordinator,
Center for Computational Biology and Bioinformatics (C2B2)/
National Center for Multiscale Analysis of Genomic Networks (MAGNet)
Room 824
Irving Cancer Research Center
Columbia University
1130 St. Nicholas Ave
New York, NY 10032
(212)851-4765 (voice)
friedman at cancercenter.columbia.edu
http://cancercenter.columbia.edu/~friedman/

I am a Bayesian. When I see a multiple-choice question on a test and I  
don't
know the answer I say "eeney-meaney-miney-moe".

Rose Friedman, Age 14

From pmr at ebi.ac.uk  Wed Jul 13 17:37:13 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 13 Jul 2011 22:37:13 +0100
Subject: [EMBOSS] getting files in GCG format with annotation
In-Reply-To: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu>
References: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu>
Message-ID: <4E1E1009.101@ebi.ac.uk>

Dear Richard,

On 13/07/2011 21:56, Richard Friedman wrote:
> I am learning to use Emboss after being a long-time GCG user.
> The fetch command in GCG returns a file with the sequence in GCG format
> plus annotation.
>
> In EMBOSS I know how to get just sequence in GCG format with seqret.
> In EMBOSS I also know how to get the sequence plus annotation default
> format.
> What I would like to know is how using EMBOSS to get sequence plus
> annotation in GCG format
> like in GCG.

Hmmm ... by "annotation in GCG format" you mean the EMBL or Uniprot 
entry with gaps in the ". ." feaure records?

The obvious question is why you need GCG format. GCG was not very clever 
in handling the annotation.

You can get the sequence plus annotation in one file with:

seqret -feature somedb:someid outfile.seq -osformat embl (or swiss)

That gives you one file with "sequence plus annotation"... and you can 
use the annotation.

You can also get the whole entry text with entret somedb:someid

Hope that helps - and if not, please do ask again!

Peter Rice
EMBOSS Team

From friedman at cancercenter.columbia.edu  Thu Jul 14 12:08:17 2011
From: friedman at cancercenter.columbia.edu (Richard Friedman)
Date: Thu, 14 Jul 2011 12:08:17 -0400
Subject: [EMBOSS] getting files in GCG format with annotation
In-Reply-To: <4E1E1009.101@ebi.ac.uk>
References: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu>
	<4E1E1009.101@ebi.ac.uk>
Message-ID: <134BC19D-E1C0-4788-A700-5B212192AD6B@cancercenter.columbia.edu>

Dear Peter and Guy,

	I guess I just cling to the familiar. The output formats given by  
emboss are fine,
One more obscure question:

As far as I can see, the output from "seqret -feature" and "entret"  
are the same.
Are there any differences?

Thanks and best wishes,
Rich


On Jul 13, 2011, at 5:37 PM, Peter Rice wrote:

> Dear Richard,
>
> On 13/07/2011 21:56, Richard Friedman wrote:
>> I am learning to use Emboss after being a long-time GCG user.
>> The fetch command in GCG returns a file with the sequence in GCG  
>> format
>> plus annotation.
>>
>> In EMBOSS I know how to get just sequence in GCG format with seqret.
>> In EMBOSS I also know how to get the sequence plus annotation default
>> format.
>> What I would like to know is how using EMBOSS to get sequence plus
>> annotation in GCG format
>> like in GCG.
>
> Hmmm ... by "annotation in GCG format" you mean the EMBL or Uniprot  
> entry with gaps in the ". ." feaure records?
>
> The obvious question is why you need GCG format. GCG was not very  
> clever in handling the annotation.
>
> You can get the sequence plus annotation in one file with:
>
> seqret -feature somedb:someid outfile.seq -osformat embl (or swiss)
>
> That gives you one file with "sequence plus annotation"... and you  
> can use the annotation.
>
> You can also get the whole entry text with entret somedb:someid
>
> Hope that helps - and if not, please do ask again!
>
> Peter Rice
> EMBOSS Team
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


From pmr at ebi.ac.uk  Thu Jul 14 12:14:03 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 14 Jul 2011 17:14:03 +0100
Subject: [EMBOSS] getting files in GCG format with annotation
In-Reply-To: <134BC19D-E1C0-4788-A700-5B212192AD6B@cancercenter.columbia.edu>
References: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu>
	<4E1E1009.101@ebi.ac.uk>
	<134BC19D-E1C0-4788-A700-5B212192AD6B@cancercenter.columbia.edu>
Message-ID: <4E1F15CB.5020205@ebi.ac.uk>

On 14/07/2011 17:08, Richard Friedman wrote:
> Dear Peter and Guy,
>
> I guess I just cling to the familiar. The output formats given by emboss
> are fine,
> One more obscure question:
>
> As far as I can see, the output from "seqret -feature" and "entret" are
> the same.
> Are there any differences?

Not necessarily ...

entret reports the exact text of the original input.

seqret -feat with the same format as the input will rewrite everything 
using the output format. If that comes out identical then we are usually 
very happy (we do try to preserve everything in EMBL/GenBank and 
Swissprot formats) but there is no absolute guarantee.

Also, strictly speaking, the output of entret is defined as "text" while 
the output of seqret is defined as "sequence" which leads to some 
distinctions - for example, you cannot choose an alternative output 
format for entret.

Have fun with EMBOSS. Look out for the new release tomorrow!

regards,

Peter


From gbottu at vub.ac.be  Thu Jul 14 13:56:14 2011
From: gbottu at vub.ac.be (Guy Bottu)
Date: Thu, 14 Jul 2011 19:56:14 +0200
Subject: [EMBOSS] getting files in GCG format with annotation
In-Reply-To: <4E1E1009.101@ebi.ac.uk>
References: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu>
	<4E1E1009.101@ebi.ac.uk>
Message-ID: <4E1F2DBE.70700@vub.ac.be>

	Dear Richard,

I agree with Peter that it is not obvious what GCG simple sequence 
format is still useful for, since for giving the sequence as input to 
whatever software you can use seqret with whatever sequence format and 
for just reading the annotation you can use entret and for giving the 
features as input to whatever software you can use seqret with parameter 
-feature (GCG used for this the GCG RSF format but this did not become 
popular outside GCG/SeqLab). I can maybe add that a widely used format 
for features is GFF format and you can do :

seqret -feature somedb:someid outfile.seq -osformat gff -oufo somegfffile

You will obtain a file somegfffile in GFF format (with just the 
features, not the sequence). There is a lot of software that can use it.

	Regards,
	Guy Bottu,
	U.L.B.

From ajb at ebi.ac.uk  Fri Jul 15 04:54:26 2011
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Fri, 15 Jul 2011 09:54:26 +0100 (BST)
Subject: [EMBOSS] EMBOSS 6.4.0 released
Message-ID: <53026.82.26.12.214.1310720066.squirrel@imap04.ebi.ac.uk>

EMBOSS Release 6.4.0

This release is now available on our OBF ftp server.

UNIX version:
   ftp://emboss.open-bio.org/pub/EMBOSS/

mEMBOSS (MS Windows version):
   ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.0-setup.exe

It includes major extensions to the type and number of data resources
available to EMBOSS users.

In addition, three books are published by Cambridge University Press:

EMBOSS User's Guide: Practical Bioinformatics
http://www.cambridge.org/gb/knowledge/isbn/item5979294/?site_locale=en_GB

EMBOSS Developer's Guide: Bioinformatics Programming
http://www.cambridge.org/gb/knowledge/isbn/item5979293/?site_locale=en_GB

EMBOSS Administrator's Guide: Bioinformatics Software Management
http://www.cambridge.org/gb/knowledge/isbn/item5979238/?site_locale=en_GB

They are comprehensive and definitive guides to administering,
developing and using EMBOSS. We hope they will prove useful to the
EMBOSS community and to anyone providing training courses covering
EMBOSS.

In addition to these publications we have a new website.

http://emboss.open-bio.org

Updates for the new features in 6.4.0 will be made available soon on
the new EMBOSS website, with tutorials to be developed on the EBI
e-Learning Portal.

Contents:

1.0 New in 6.4.0
1.1 Server definitions
1.2 Access methods
1.3 emboss.standard file
1.4 new data types
1.5 new query language
1.6 Hash tables and lists
1.7 Cross-references
1.8 URL generation
1.9 Database index compression
1.10 Database indexing applications
1.11 Generating server cache files
1.12 Server and database attributes
1.13 HTTP redirection
1.14 EMBOSS version number
1.15 ACD list 'select all'
2.0 EDAM Ontology
2.1 EDAM in ACD files
2.2 EDAM applications
3.0 DRCAT Data Resource Catalogue
4.0 NCBI Taxonomy
5.0 Maintenance
6.0 Installation Notes
6.1 UNIX
6.1.1 MySQL
6.1.2 PostgreSQL
6.1.3 axis2c
6.1.4 Other optional library software
6.1.5 eprimer3 and eprimer32
6.2 mEMBOSS
7.0 New EMBASSY applications
8.0 Future

1.0 New in 6.4.0

1.1 Server definitions

Servers can be defined, in a similar style to a database definition,
but covering all databases available from a single server. The server
definition names a cache file describing each database, its format
and its query fields. Cache files for a core set of public servers are
included in the release.

1.2 Access methods

New access methods are provided, including Ensembl, BioMart, DAS, SOAP
web services (EBI wsdbfetch and ebeye), REST web services (EBI
dbfetch), and GMOD/CHADO. Ensembl access uses code contributed by
Michael Schuster in the Ensembl team at EBI. This code is updated
after each Ensembl API release. Some of these access methods were
available but only partly implemented in the previous release. They
now support standard server and database definitions and are open for
further development.

Data access methods have been restructured to use "text" access for
any method which seeks a position in a file and then opens it for
reading. This includes reading from a URL and returning a pointer to
the start of the output. A few datatype-specific access methods remain,
for example reading sequence data from a PIR/NBRF/GCG format database,
or from the NCBI taxonomy files, or access to database systems via SQL
or DAS.

1.3 emboss.standard file

Previous releases depended on a user defining databases in their
emboss.defaults file. Release 6.4.0 provides a new emboss.standard
file defining the core servers and databases, and standard resource
settings for database indexing. The local emboss.default file is only
needed for local database definitions and settings.

The configuration files emboss.standard, emboss.default and
~/.embossrc resolve variable references (e.g. in directory names)
during parsing. Extensions to the syntax of these files include ALIAS
to give secondary names to a database. IF, IFDEF, ELSE and ENDIF
directives allow conditional inclusion of sections of the file
dependent on variable settings. Special variables EMBOSS_AXIS2,
EMBOSS_MYSQL, EMBOSS_POSTGRESQL and EMBOSS_SQL are automatically
created for this purpose.

New variable EMBOSS_STANDARD is automatically defined to be the
share/EMBOSS install directory (or the emboss source code directory if
the package is not installed). This is by default where the
emboss.standard files and server cache files are expected to be
found. The value is reported by "embossversion -full"

1.4 new data types

New data types are available as inputs and outputs or
applications. Each has a simple definition including qualifiers
-iformat for input format and -oformat for output format. The maxreads
attribute defines whether the application expects to read a single
entry (maxreads: 1) or loop over multiple entries (the default). This
is simpler than the sequence and seqall definitions for sequence which
are widely used and will remain unchanged.

* text and outtext: the text of an entry for which EMBOSS has (to
   date) no specialised parser

* obo and oboout: terms in an OBO ontology. Six ontologies are
   included in the release as source and index files (EDAM, GO, SO, RO,
   PW, ECO). We plan to add more and welcome suggestions for inclusion.

* resource and resourceout: entries in the Data Resource Catalogue

* taxon and outtaxon: nodes in the NCBI taxonomy which is indexed and
   included in the release

* url and outurl: a database name from the Data Resource Catalogue, and
   an identifier, converted into a URL which can be pasted into a browser
   to cover cases where the URL does not return simple text or HTML data.

* for future extension, assembly and variation datatypes are defined
   for development and use in a later release.

1.5 New query language

All data types use a common query language. The existing "USA"
(uniform sequence address) syntax is still valid for sequence data,
but is also now used for features, obo terms, data resources, taxons
and plain text data.

In response to comments from our Scientific Advisory Board, we have
extended the query language to cover multiple identifiers, multiple
fields, and operators to combine elements of the query.

* id lists: dbname:{ida,idb,idc} searches for 3 identifiers (id,
   accession, etc.)  in a database

* or operator: dbname-{id:h* | des:hemoglobin} searches for all
   entries with identifiers starting with 'h' plus any others that
   include the word 'hemoglobin' in their descriptions.

* not operator: dbname-{id:h* ! des:hemoglobin} searches for all
   entries with identifiers starting with 'h' that do not include the
   word 'hemoglobin' in their descriptions.

* and operator: dbname-{id:h* & des:hemoglobin} searches for all
   entries with identifiers starting with 'h' that also include the
   word 'hemoglobin' in their descriptions.

* eor operator: dbname-{id:h* ^ des:hemoglobin} searches for all
   entries with identifiers starting with 'h' that do not include the
   word 'hemoglobin' in their descriptions, and all those starting with
   another character that do include the word 'hemoglobin' in their
   description. This is the opposite of the and (&) operator.

Query operators are not supported by all access methods. Where an
operator is invalid an error message gives the list of valid
operators. For example, the query syntax for SRS (srs, srswww access)
does not include the exclusive-or (^) operator but supports the
others as these are standard elements in SRS queries.

The query language only allows a single database name in the
query. This allows EMBOSS to combine query results for a single query
expression. To query multiple databases a list file input with one
database query on each line can be used.

Indexed strings containing non-alphabetic characters including white
space are simplified by converting a run of such characters to a
single underscore. The same transformation is applied to a query
string for the dbx (emboss) access method. This is especially useful
for brackets and other characters in data resource names in DRCAT.

We hope that the extended query language and the index file
compression will increase the use of locally indexed data in EMBOSS
installations, and welcome feedback on further developments of the
query language and indexing.

1.6 Hash table and lists

The new query language is supported by extensions to tables and lists
in the libraries. Tables can now be automatically resized. Merge
operations on two tables combine their contents using the same
operations (or, and, not, eor) as the query language. By resizing the
tables first this operation can be made highly efficient. Destructors
can be defined for list data and for table keys and data to
automatically clean up after use. Tables with string keys can use C
char* or string object queries in all cases.

Lists and tables can now be reference counted, avoiding unnecessary
copying especially in the Ensembl API code.

1.7 Cross-references

Cross-references from UniProt/SwissProt and EMBL/GenBank/DDBJ are
collected by extended parsers. New application seqxref reports the
cross-references. New application seqxrefget creates a script to
retrieve cross-referenced data as the original entries, using entret
for sequence data, feattext for feature data, ontotext for ontology
terms, textget for text and urlget for data where "HTML" is the only
available format.

1.8 URL generation

New application urlget returns a query URL from DRCAT with one or mode
identifiers. Where data is from a UniProt/SwissProt or
EMBL/GenBank/DDBJ entry the DRCAT entry definition of the original
cross-reference is used to select from several possible identifier
terms in EDAM in order to choose the correct query.

1.9 Database index compression

Indexes created by dbxflat or dbxfasta are now, by default, compressed
automatically. These files, especially for secondary text indexes such
as description, taxonomy or keyword, could be very sparse. Up to 95%
space savings were achieved in some cases. The indexes are still
updatable by code which uncompresses, updates, and recompresses
on-the-fly using a copy of the index.

1.10 Database indexing applications

New indexing applications dbxedam (EDAM), dbxresource (DRCAT), dbxtax
(NCBI taxonomy) and dbxobo (any OBO ontology) are added for the new
data resources provided as standard. users can install new releases of
the source data and run these applications to update the index files.

Application dbxflat can now index fastq format. This was included in
6.3.1 as a special addition for one user to test and is now fully
supported.

New applications dbxreport and dbxstat report on the overall and
detailed content of dbx database indexes.

In database indexing applications, the default "resource" name is one
included in the emboss.standard file. Users can continue to define
their own resource files. Indexing "resource" definitions can now
specify the maximum length of any field, and the page size and cache
size for any field, using attributes with the field name as a prefix.

1.11 Generating server cache files

New applications for major access methods query a server (for example,
the DAS registry or Ensembl) to update the server cache file with a
current set of database definitions. When run by the system
administrator these can update the site-wide cache file, but they can
also be run by an individual user to create a user-specific set of
databases. The cache files are time stamped. EMBOSS uses the most
recent system or user file.

1.12 Server and database attributes

New applications showserver and servertell describe all servers or the
attributes of a single named server. We expect to extend these
applications once we have feedback on the most useful information they
should report. New application dbtell similarly reports on the
attributes of a single named database.

Database (and server) definitions can use an attribute more than once
if it is defined as "multiple". These include a new "field:" attribute
which gives the name and description of a query field. A list of
"field:" attributes supersedes the old "fields:" attribute which listed
all query field names but allowed no further annotation.

Database field names are extended from the original fixed set of "SRS
sequence" fields to any name. "id" and "acc" are assumed to be the
names of identifier and accession fields. The "hasaccession" attribute
is set automatically for databases where no "acc" field is found,
avoiding some error messages where the attribute has been omitted.

1.13 HTTP redirection

Data retrieval using HTTP now checks the returned header for redirects
and automatically replaces the results with the output from the
redirected URL. Where redirected URLs were found in standard database
definitions (e.g. the EBI's dbfetch service) these have been replaced
by the current URL. We have also seen redirects from case-sensitive
servers which redirect a lower case accession number to one in upper
case in the same URL.

1.14 EMBOSS version number

The EMBOSS version number now has 4 digits (6.4.0.0). The fourth digit
is only there so that the Windows port (mEMBOSS) shows the same
version number for QA testing. In mEMBOSS the final digit is the build
number. QA tests for mEMBOSS now use the same test definition and
qatest script as on Linux. mEMBOSS file handling and reporting has
been adapted to support POSIX and Windows style paths.

1.15 ACD list 'select all'

In ACD files, a list or selection definition can default to "*" for
"select all" if the "minimum" attribute allows all terms to be
selected.

2.0 EDAM Ontology

EDAM is a new ontology from the EMBRACE project now further developed
by Jon Ison in the EMBOSS team. EDAM describes terms for topics (for
applications and data), operations (algorithms), formats, identifiers
and data (semantic descriptions of data content). EDAM terms are used
throughout this release: to annotate all ACD files at the application,
input, parameter and output levels; to annotate data resources and
their web queries in the Data Resource Catalogue; and to annotate
database and server definitions.

2.1 EDAM in ACD files

ACD files are annotated extensively with EDAM terms using the term id
and the human-readable name. The EMBOSS application groups have been
extended to match the EDAM topic annotations, with some applications
moving to different or new groups. EDAM has been used to validate
these groups by comparing the topics hierarchy with the group
designations.

2.2 EDAM applications

EDAM can be queried within any specific namespace by new applications
edamname and edamdef.

EDAM and other ontologies are supported by new applications (ontoget,
ontotext, ontodown, ontoup, ontgetsibs, ontogetcommon, ontogetroot,
ontogetobsolete, ontoisobsolete, ontocount)

New applications search EDAM term names and definitions, retrieve all
matching terms and their descendants, and compare to: applications
(wosstopic, wossoperation, wossinput, wossoutput, wossdata); data
resources (drfindresource, drfindid, drfindformat, drfinddata); and
related EDAM terms (edamhasinput, edamhasoutput, edamisid,
edamisformat, edamissource).

3.0 DRCAT Data Resource Catalogue

DRCAT, the Data Resource Catalogue, is included in this release. DRCAT
started as a description of databases found as cross-references in
UniProt/SwissProt, extended by adding databases found as
cross-references in EMBL/GenBank/DDBJ, plus others from Nucleic Acids
Research, ELIXIR, and other sources. Any database in DRCAT can be used
by name from an EMBOSS application, returning sequence, feature, or
text if a suitable data format is defined for any query, or creating a
URL which can be pasted into a browser where the results are, for
example, a graphical display using javascript which EMBOSS cannot
interpret. We aim to further extend and improve DRCAT in future
releases.

4.0 NCBI Taxonomy

Taxonomy data from the NCBI taxonomy is included as standard in the
release. New applications retrieve single nodes and their ancestors
and descendants (taxget, taxgetup, taxgetdown, taxgetspecies,
taxgetrank).


5.0 Maintenance

Application digest has been renamed pepdigest to avoid a clash with
another utility. The name is also in keeping with the EMBOSS naming of
other protein analysis applications.

Sequence and features formats have been reviewed and updated,
especially GFF3, GenPept, SAM, BAM and treecon. GFF3 output now more closely
follows the official standard, including the escaping of special
characters in the tag/value final column. GFF3 ID and Parent tags are
supported.

Features with exons are now stored as a list of exon subfeatures.
This change allows easier sorting of features by location, keeping
groups of features together, and has simplified the generation of
several feature output formats.

Graphical output for more than one input sequence have been corrected
and enhanced.

The lindna application has been adjusted to correctly relocate
overlapping text and to generate a clean sequence ruler for any range
of positions. New report formats allow reported hits (-rformat draw)
and restriction sites (-rformat restrict) to be plotted by lindna. We
expect to work further on the views that these outputs generate.

The einverted application had a bug (also in the original version)
when an inverted repeat maximum score was close to the edge of the
search window. This was seen only at low threshold scores. Searches
with low threshold scores can be expected to yield slightly different
choices of hits.

In ACD files, the "gui" and "batch" application attributes are assumed
to be "true" if missing. Previous releases defined them as "false"
internally, but fortunately no parsers seem to have used the internal
default value.

Database indexes created by the dbx programs now include a count of
unique and total keys. The text index files also report the type as
"Identifier" or "Secondary" and whether the index is compressed.

EMBOSS configuration now uses autoheader and has less dependency on
the version of libtool.

6.0 Installation notes

6.1 UNIX

The size of the EMBOSS package has shot up by approximately 60MB
compared with the last major release. This is largely due to to
pre-supplied data and index files for ontology/taxonomy/etc.  A
typical installation size (shared images) is approximately 360MB.

Though not a requirement of EMBOSS there are some associated
packages which may be installed prior to configuration that
will allow you to use some optional access methods.

6.1.1 MySQL

This is used, for example, by the Ensembl access code. It will be
automatically configured if the (MySQL-supplied) 'mysql_config'
application is found in the PATH and if the associated development
files (compiler headers etc) are also installed. As an example, for
Linux systems, both things will be done by installing the mysql-devel
(RPM distributions) or mysql-dev (Debian-based distributions). If your
MySQL installation is in some arbitrary location then you can specify
it using the --with-mysql= compilation switch.

6.1.2 PostgreSQL

This is used by some servers (e.g. flybase/genedb). Similar
considerations apply to those described for MySQL above.
Auto-detection is based on the presence in the PATH of 'pg_config',
dev[el] files must be installed, the --with-postgresql configuration
switch can be used for arbitrary locations.

6.1.3 axis2c

EMBOSS optionally uses the 1.6.0 release of Axis2C for
retrieval from SOAP servers:

  http://axis.apache.org/axis2/c/core/

There is a linux binary distribution but, even so, Linux
users may find themselves having to install from
source (and may need to do an 'autoreconf -fi' prior to
configuration to fix a subsequent compilation error on some
systems).

Auto-detection (by EMBOSS) of this package is based on the
presence of a pkgconfig file that axis2c installs. It is
advised that you install pkgconfig if not already installed
(it usually is pre-installed on Linux systems). EMBOSS has a
--with_axis2c= configure switch if you install axis2c into
a location other than /usr or /usr/local (typically).

6.1.4 Other optional library software

Installation of libraries for PNG (libpng/libgd) and PDF (libhpdf
aka libharu) follow considerations given in previous releases and
should be familiar to EMBOSS administrators by now.

6.1.5 eprimer3 and eprimer32

The Primer3 authors have released a 2.x.x version which differs
significantly from the 1.x.x series. Unfortunately the executable is
called the same for both releases (primer3_core).  EMBOSS 6.4.0
provides two wrappers for these releases; eprimer3 is for the 1.x.x
version and requires the primer3 executable to be called
'primer3_core' (this has always been the case); eprimer32 is for the
2.x.x version and requires the primer3 executable to be called
primer32_core.

This may involve some minor symlinking and/or directory/PATH
reorganisation by administrators.


6.2 mEMBOSS

A typical installation executable is approximately 70MB and results
in an installation size of approximately 570MB.

MySQL, PostgreSQL, Axis2c, libhpdf (etc) come pre-supplied as part of
the mEMBOSS installation.

The QA test suite has been extended to automatically find and test
both developer and end-user installations of mEMBOSS.

Note that, with the new server definitions in place (described above),
the old SRS database definitions have been removed. You can now access
databases using (e.g.) 'dbfetch:uniprotkb:opsd_human' as an ID. Such
retrieval is much faster than the previously supplied SRS definitions.

7.0 New EMBASSY applications:

We have provided a wrapper package for the recently released
clustal omega software which must, of course, also be installed.

We have provided a wrapper package for the recently released clustal
omega software which must, of course, also be installed.  We will add
new releases of MIRA and VIENNA at a later date, when the new versions
of the original packages are released and integrated.

8.0 Future development

EMBOSS is fully funded until the end of December. We have an ambitious
schedule of further developments planned for this period. There will
be a further release of EMBOSS at the end of the year.

We welcome any and all suggestions from our user and developer
communities for immediate needs and future directions.

At the end of this year the EMBOSS team will be leaving EBI. Peter
Rice's maximum 9 year tenure is coming to an end. We do not yet know
where we will be from January and are open to suggestions for ways to
host and/or to fund further EMBOSS development and for potentially
useful partnerships and collaborations to continue the advances we
have made.

We can most certainly guarantee that we will continue to maintain the
existing code base and the latest releases.


Alan


From rothenbuhler at xoma.com  Mon Jul 25 19:42:28 2011
From: rothenbuhler at xoma.com (Jake Rothenbuhler)
Date: Mon, 25 Jul 2011 16:42:28 -0700
Subject: [EMBOSS] Algorithms for pI and molecular weight in pepstats
Message-ID: <3110E5050A5DE54F8715EB2AC3D2057725FCCC@cypress6.xoma.com>

Hello,

 
What are the algorithms used to compute the molecular weight and
isoelectric point in pepstats? We are currently using pepstats to
measure these properties in our in-house bioinformatics tools and some
users are concerned because the results can differ from those returned
by ExPASy.

 
Thanks in advance,

 
Jake Rothenbuhler

Bioinformatics Programmer/Analyst

XOMA (US) LLC

(510) 204-7452

 
-- 
The information contained in this email message may 
contain confidential or legally privileged information and is intended solely 
for the use of the named recipient(s).  No confidentiality or privilege is 
waived or lost by any transmission error. If the reader of this message is 
not the intended recipient, please immediately delete the e-mail and all 
copies of it from your system, destroy any hard copies of it and notify the 
sender either by telephone or return e-mail.  Any direct or indirect use, 
disclosure, distribution, printing, or copying of any part of this message is 
prohibited.  Any views expressed in this message are those of the individual 
sender, except where the message states otherwise and the sender is 
authorized to state them to be the views of XOMA.


From pmr at ebi.ac.uk  Tue Jul 26 03:28:14 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 26 Jul 2011 08:28:14 +0100
Subject: [EMBOSS] Algorithms for pI and molecular weight in pepstats
In-Reply-To: <3110E5050A5DE54F8715EB2AC3D2057725FCCC@cypress6.xoma.com>
References: <3110E5050A5DE54F8715EB2AC3D2057725FCCC@cypress6.xoma.com>
Message-ID: <4E2E6C8E.5040502@ebi.ac.uk>

On 26/07/2011 00:42, Jake Rothenbuhler wrote:

> What are the algorithms used to compute the molecular weight and
> isoelectric point in pepstats? We are currently using pepstats to
> measure these properties in our in-house bioinformatics tools and some
> users are concerned because the results can differ from those returned
> by ExPASy.

There was discussion on this last year on this list too.

There is no single correct answer. Molecular weights can use the average 
value for each amino acid to calculate the molecular weight of a 
protein, or monoisotopic values top calculate peptide masses for 
mass-spec data. Pepstats has a command line option -mono to use the 
monoisotopic weights. We use amino acid molecular weights from ExPASy 
findmod in the calculations.

The isoelectric point can be calculated for various conditions. When I 
checked last, ExPASy's protparam was set up the isoelectric focus phase 
of 2D gels under high urea conditions. It was unclear at the time where 
to find all the values needed to reproduce their calculation.

We would like to update EMBOSS's protein property calculations, possibly 
with additional options or alternative parameter sets.

Any suggestions from anyone on the list?

regards,

Peter Rice
EMBOSS Team


From ajb at ebi.ac.uk  Tue Jul 26 11:24:35 2011
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Tue, 26 Jul 2011 16:24:35 +0100 (BST)
Subject: [EMBOSS] mEMBOSS 6.4.0.1 available
Message-ID: <53274.82.26.12.214.1311693875.squirrel@imap04.ebi.ac.uk>

This is a bugfix release for the MS Windows version of EMBOSS,
primarily to fix a problem printing very long ('long long') integers.
Though most users would be unlikely to hit this problem an
uninstall/reinstall is nevertheless recommended.

The release also contains a few minor bugfixes, notably making visible
some potentially hidden SOAP server definitions.

It is available from the usual place:

 ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.1-setup.exe

Alan


From Narayana.Upadhyaya at csiro.au  Wed Jul 27 05:15:09 2011
From: Narayana.Upadhyaya at csiro.au (Narayana.Upadhyaya at csiro.au)
Date: Wed, 27 Jul 2011 19:15:09 +1000
Subject: [EMBOSS] getorf output discrepancy
Message-ID: <F9512FDAD950114680F34532768636025CEDFA2AEB@exvic-mbx04.nexus.csiro.au>

Hi,

I am trying to get both NT and AA sequence out puts from a file with ~20,000 transcripts  models using the getorf with following command:-

getorf -minsize 200 -reverse Y  myfile.fa -find 3
getorf -minsize 200 -reverse Y myfile.fa -find 1

I get the outputs all right. But I was expecting same number of sequences in both (with identical names in the header). But looks like at 60 odd sequences(which are there in the AA output) are missing in the NT out put.

Can anyone explain this discrepancy?  I tried putting the minsize option as "201" for both but the problem persists.

Regards,

Narayana


From Narayana.Upadhyaya at csiro.au  Wed Jul 27 05:30:09 2011
From: Narayana.Upadhyaya at csiro.au (Narayana.Upadhyaya at csiro.au)
Date: Wed, 27 Jul 2011 19:30:09 +1000
Subject: [EMBOSS] getorf output discrepancy
Message-ID: <F9512FDAD950114680F34532768636025CEDFA2AEC@exvic-mbx04.nexus.csiro.au>

Hi
I figured out the problem. Missing ORFs in NT output are the ones which are just 198 NT length. When I put minsize 198 for NT output I don't miss anything.

Sorry for bothering.

Narayana


Hi,

I am trying to get both NT and AA sequence out puts from a file with ~20,000 transcripts  models using the getorf with following command:-

getorf -minsize 200 -reverse Y  myfile.fa -find 3
getorf -minsize 200 -reverse Y myfile.fa -find 1

I get the outputs all right. But I was expecting same number of sequences in both (with identical names in the header). But looks like at 60 odd sequences(which are there in the AA output) are missing in the NT out put.

Can anyone explain this discrepancy?  I tried putting the minsize option as "201" for both but the problem persists.

Regards,

Narayana


From friedman at cancercenter.columbia.edu  Wed Jul 27 12:31:01 2011
From: friedman at cancercenter.columbia.edu (Richard Friedman)
Date: Wed, 27 Jul 2011 12:31:01 -0400
Subject: [EMBOSS] dotplots taking similarity into account
Message-ID: <B446D8AB-20BE-4E8B-B488-47EDB8CC5BE0@cancercenter.columbia.edu>

Dear Emboss list,

	Is there a way to get dotplots that take similarity according to a  
similarity matrix,
rather than strict  identity into account? As far as I can see, dottup  
is based on identity.
Is there a way that we can dotplots based on a similarity matrix  
similar to dotplot in GCG?
I know that it may be tiresome that I use GCG as a standard, but it is  
what I know and
it is serving as a point of departure while I am learning Emboss and  
redoing the GCG
portion of my course in Emboss. I am enjoying learning about the ways  
in which Emboss
offers improved functionality in the process as well.

Thanks and best wishes,
Rich
------------------------------------------------------------
Richard A. Friedman, PhD
Associate Research Scientist,
Biomedical Informatics Shared Resource
Herbert Irving Comprehensive Cancer Center (HICCC)
Lecturer,
Department of Biomedical Informatics (DBMI)
Educational Coordinator,
Center for Computational Biology and Bioinformatics (C2B2)/
National Center for Multiscale Analysis of Genomic Networks (MAGNet)
Room 824
Irving Cancer Research Center
Columbia University
1130 St. Nicholas Ave
New York, NY 10032
(212)851-4765 (voice)
friedman at cancercenter.columbia.edu
http://cancercenter.columbia.edu/~friedman/

I am a Bayesian. When I see a multiple-choice question on a test and I  
don't
know the answer I say "eeney-meaney-miney-moe".

Rose Friedman, Age 14


From s.newslists at gmail.com  Wed Jul 27 14:14:03 2011
From: s.newslists at gmail.com (Stefan)
Date: Wed, 27 Jul 2011 20:14:03 +0200
Subject: [EMBOSS] dotplots taking similarity into account
In-Reply-To: <B446D8AB-20BE-4E8B-B488-47EDB8CC5BE0@cancercenter.columbia.edu>
References: <B446D8AB-20BE-4E8B-B488-47EDB8CC5BE0@cancercenter.columbia.edu>
Message-ID: <CAECtV7Om73f5jxL1U0b8WsZLcBeUaCRTDc6YgqWdELzfr+dVrw@mail.gmail.com>

Dear Richard,

Dotmatcher uses a specified substitution matrix:
http://emboss.open-bio.org/wiki/Appdoc:Dotmatcher

Best regards,
Stefan

2011/7/27 Richard Friedman <friedman at cancercenter.columbia.edu>:
> Dear Emboss list,
>
> ? ? ? ?Is there a way to get dotplots that take similarity according to a
> similarity matrix,
> rather than strict ?identity into account? As far as I can see, dottup is
> based on identity.
> Is there a way that we can dotplots based on a similarity matrix similar to
> dotplot in GCG?
> I know that it may be tiresome that I use GCG as a standard, but it is what
> I know and
> it is serving as a point of departure while I am learning Emboss and redoing
> the GCG
> portion of my course in Emboss. I am enjoying learning about the ways in
> which Emboss
> offers improved functionality in the process as well.
>
> Thanks and best wishes,
> Rich
> ------------------------------------------------------------
> Richard A. Friedman, PhD
> Associate Research Scientist,
> Biomedical Informatics Shared Resource
> Herbert Irving Comprehensive Cancer Center (HICCC)
> Lecturer,
> Department of Biomedical Informatics (DBMI)
> Educational Coordinator,
> Center for Computational Biology and Bioinformatics (C2B2)/
> National Center for Multiscale Analysis of Genomic Networks (MAGNet)
> Room 824
> Irving Cancer Research Center
> Columbia University
> 1130 St. Nicholas Ave
> New York, NY 10032
> (212)851-4765 (voice)
> friedman at cancercenter.columbia.edu
> http://cancercenter.columbia.edu/~friedman/
>
> I am a Bayesian. When I see a multiple-choice question on a test and I don't
> know the answer I say "eeney-meaney-miney-moe".
>
> Rose Friedman, Age 14
>
>
>
>
>
>
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From charles-listes-emboss at plessy.org  Thu Jul 28 10:38:37 2011
From: charles-listes-emboss at plessy.org (Charles Plessy)
Date: Thu, 28 Jul 2011 23:38:37 +0900
Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative
 Commons Attribution-NoDerivs
Message-ID: <20110728143837.GC30927@merveille.plessy.net>

Dear EMBOSS developers,
(CC Debian Med mailing list)

while working on upgrading Debian's emboss package to version 6.4.0
(congratulations, by the way), I found some files in EMBOSS that are
not considered ?Free software? by Debian.  They were actually present
in past releases as well. Here is their list:

test/data/amir.swiss
test/data/uniprotft.sw
test/swiss/seq.dat
test/swnew/trembl.dat

and emboss/data/dbxref.txt

Their license is Creative Commons Attribution-NoDerivs 3.0 Unported (CC BY-ND
3.0), and it disallows modification of the files.  The presence of these files
in EMBOSS makes it impossible for Debian to redistribute it in our operating
system.  I have confirmed with the UniProt consortium's helpdesk that, even in
isolation, these files are covered by the CC BY-ND license.  I see three
possible solutions. 

 a) Remove the files in Debian's EMBOSS package.
 b) Distribute EMBOSS with the files, but in the non-free section of the Debian archive.
 c) Replace the files by Free equivalents, for instance by re-creating records from scratch.

I am not very comfortable with any of the solutions, and was wondering if you
would have suggestions ?

Have a nice day,

-- 
Charles Plessy
Debian Med packaging team,
http://www.debian.org/devel/debian-med
Tsurumi, Kanagawa, Japan

From mathog at caltech.edu  Thu Jul 28 11:06:50 2011
From: mathog at caltech.edu (David Mathog)
Date: Thu, 28 Jul 2011 08:06:50 -0700
Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative
	Commons Attribution-NoDerivs
Message-ID: <E1QmSAk-0000CQ-0M@mendel.bio.caltech.edu>

Charles Plessy wrote:

>a) Remove the files in Debian's EMBOSS package.
>b) Distribute EMBOSS with the files, but in the non-free section of the
>Debian archive.
>c) Replace the files by Free equivalents, for instance by re-creating
>records from scratch.

d)  Add a small script that wget's each file from its original
distribution site and installs it in the right place.  Have the package
install script either ask if it should run this script, or have it issue
a message which describes the issue and leaves it up to the user to run
the script.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

From wolfgang.rumpf at gmail.com  Thu Jul 28 13:14:23 2011
From: wolfgang.rumpf at gmail.com (Wolfgang Rumpf)
Date: Thu, 28 Jul 2011 13:14:23 -0400
Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative
	Commons Attribution-NoDerivs
In-Reply-To: <E1QmSAk-0000CQ-0M@mendel.bio.caltech.edu>
References: <E1QmSAk-0000CQ-0M@mendel.bio.caltech.edu>
Message-ID: <1DAFB6D6-77CF-424C-A2BC-CD1B57FE6671@gmail.com>

I would prefer (c) or the newly-added (d) myself....


Cheers,


Wolfgang

--------------------------------------------------------------------------------------------------------------
Dr. Wolfgang Rumpf
Senior Product Specialist & Director of Support, Rescentris Inc.
Adjunct Faculty, Dept. of Biotechnology, UMUC
--------------------------------------------------------------------------------------------------------------
wolfgang.rumpf at rescentris.com 	 	wolfgang.rumpf at gmail.com
Mobile - (614) 638-6797 				Skype - wolfgang.rumpf
--------------------------------------------------------------------------------------------------------------
Read my Blog - "QuantumThoughts" - at http://culture.no-ip.org/quantumthoughts
--------------------------------------------------------------------------------------------------------------

On Jul 28, 2011, at 11:06 AM, David Mathog wrote:

> Charles Plessy wrote:
> 
>> a) Remove the files in Debian's EMBOSS package.
>> b) Distribute EMBOSS with the files, but in the non-free section of the
>> Debian archive.
>> c) Replace the files by Free equivalents, for instance by re-creating
>> records from scratch.
> 
> d)  Add a small script that wget's each file from its original
> distribution site and installs it in the right place.  Have the package
> install script either ask if it should run this script, or have it issue
> a message which describes the issue and leaves it up to the user to run
> the script.
> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


From s.newslists at gmail.com  Thu Jul 28 13:24:53 2011
From: s.newslists at gmail.com (Stefan)
Date: Thu, 28 Jul 2011 19:24:53 +0200
Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative
 Commons Attribution-NoDerivs
In-Reply-To: <1DAFB6D6-77CF-424C-A2BC-CD1B57FE6671@gmail.com>
References: <E1QmSAk-0000CQ-0M@mendel.bio.caltech.edu>
	<1DAFB6D6-77CF-424C-A2BC-CD1B57FE6671@gmail.com>
Message-ID: <CAECtV7OOtkFeXid4R5=rKr_puULbZibh-S16QJpzXzppmvWfOw@mail.gmail.com>

I would prefer (d) and I know packages where this is realized like
this. For example in SuSE the msttf fonts.

Regards,
Stefan

2011/7/28 Wolfgang Rumpf <wolfgang.rumpf at gmail.com>:
> I would prefer (c) or the newly-added (d) myself....
>
>
> Cheers,
>
>
> Wolfgang
>
> --------------------------------------------------------------------------------------------------------------
> Dr. Wolfgang Rumpf
> Senior Product Specialist & Director of Support, Rescentris Inc.
> Adjunct Faculty, Dept. of Biotechnology, UMUC
> --------------------------------------------------------------------------------------------------------------
> wolfgang.rumpf at rescentris.com ? ? ? ? ? wolfgang.rumpf at gmail.com
> Mobile - (614) 638-6797 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Skype - wolfgang.rumpf
> --------------------------------------------------------------------------------------------------------------
> Read my Blog - "QuantumThoughts" - at http://culture.no-ip.org/quantumthoughts
> --------------------------------------------------------------------------------------------------------------
>
> On Jul 28, 2011, at 11:06 AM, David Mathog wrote:
>
>> Charles Plessy wrote:
>>
>>> a) Remove the files in Debian's EMBOSS package.
>>> b) Distribute EMBOSS with the files, but in the non-free section of the
>>> Debian archive.
>>> c) Replace the files by Free equivalents, for instance by re-creating
>>> records from scratch.
>>
>> d) ?Add a small script that wget's each file from its original
>> distribution site and installs it in the right place. ?Have the package
>> install script either ask if it should run this script, or have it issue
>> a message which describes the issue and leaves it up to the user to run
>> the script.
>>
>> Regards,
>>
>> David Mathog
>> mathog at caltech.edu
>> Manager, Sequence Analysis Facility, Biology Division, Caltech
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
>
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From rothenbuhler at xoma.com  Thu Jul 28 18:44:47 2011
From: rothenbuhler at xoma.com (Jake Rothenbuhler)
Date: Thu, 28 Jul 2011 15:44:47 -0700
Subject: [EMBOSS] Algorithms for pI and molecular weight in pepstats
In-Reply-To: <4E2E6C8E.5040502@ebi.ac.uk>
References: <3110E5050A5DE54F8715EB2AC3D2057725FCCC@cypress6.xoma.com>
	<4E2E6C8E.5040502@ebi.ac.uk>
Message-ID: <3110E5050A5DE54F8715EB2AC3D2057725FCED@cypress6.xoma.com>

Thanks to Ingo and Peter for the quick and helpful replies. I've read
through the discussion you had a year ago on this topic and it seems
like it is still unresolved.

> The isoelectric point can be calculated for various conditions. When I

> checked last, ExPASy's protparam was set up the isoelectric focus
phase 
> of 2D gels under high urea conditions. It was unclear at the time
where 
> to find all the values needed to reproduce their calculation.

I have been reading through the literature referenced by ExPASy's
documentation. The article does not give pK values for all N-terminal
residues. I've asked ExPASy support about the pK values used for
residues not listed in the paper. If you're interested, I can keep you
updated regarding their response.
 
> We would like to update EMBOSS's protein property calculations,
possibly 
> with additional options or alternative parameter sets.

If it's something you'd like to include in EMBOSS, I'd be willing to
contribute to an additional option for pI calculation that uses ExPASy's
pK values.

Jake Rothenbuhler
Bioinformatics Programmer/Analyst
XOMA (US) LLC
(510) 204-7452

-- 
The information contained in this email message may 
contain confidential or legally privileged information and is intended solely 
for the use of the named recipient(s).  No confidentiality or privilege is 
waived or lost by any transmission error. If the reader of this message is 
not the intended recipient, please immediately delete the e-mail and all 
copies of it from your system, destroy any hard copies of it and notify the 
sender either by telephone or return e-mail.  Any direct or indirect use, 
disclosure, distribution, printing, or copying of any part of this message is 
prohibited.  Any views expressed in this message are those of the individual 
sender, except where the message states otherwise and the sender is 
authorized to state them to be the views of XOMA.


From pmr at ebi.ac.uk  Fri Jul 29 03:28:48 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Fri, 29 Jul 2011 08:28:48 +0100
Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative
 Commons Attribution-NoDerivs
In-Reply-To: <20110728143837.GC30927@merveille.plessy.net>
References: <20110728143837.GC30927@merveille.plessy.net>
Message-ID: <4E326130.7030507@ebi.ac.uk>

On 28/07/2011 15:38, Charles Plessy wrote:
> Dear EMBOSS developers,
> (CC Debian Med mailing list)
>
> while working on upgrading Debian's emboss package to version 6.4.0
> (congratulations, by the way), I found some files in EMBOSS that are
> not considered ?Free software? by Debian.  They were actually present
> in past releases as well. Here is their list:
>
> test/data/amir.swiss
> test/data/uniprotft.sw
> test/swiss/seq.dat
> test/swnew/trembl.dat

Huh? Example entries from UniProt? We can of course remove them from the 
distribution but then the QA tests will not work if anyone tries them.

I suspect amir.swiss predates this UniProt licensing, but the others are 
more recently updated.

Anyway, EMBOSS will work perfectly well without them. You can just 
delete them.

> and emboss/data/dbxref.txt

That one can go. It was a source for the DRCAT.dat data resource 
catalogue and yes we do have permission from UniProt to use it.

> Their license is Creative Commons Attribution-NoDerivs 3.0 Unported (CC BY-ND
> 3.0), and it disallows modification of the files.  The presence of these files
> in EMBOSS makes it impossible for Debian to redistribute it in our operating
> system.  I have confirmed with the UniProt consortium's helpdesk that, even in
> isolation, these files are covered by the CC BY-ND license.  I see three
> possible solutions.
>
>   a) Remove the files in Debian's EMBOSS package.
>   b) Distribute EMBOSS with the files, but in the non-free section of the Debian archive.
>   c) Replace the files by Free equivalents, for instance by re-creating records from scratch.
>
> I am not very comfortable with any of the solutions, and was wondering if you
> would have suggestions ?

I will also have words with the UniProt folk at EBI and if it really is 
not possible to include a few example entries with EMBOSS then I'll 
check with the other Open Bio projects. This is really silly.

regards,

Peter Rice
EMBOSS Team


From pmr at ebi.ac.uk  Fri Jul 29 03:46:42 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Fri, 29 Jul 2011 08:46:42 +0100
Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative
 Commons Attribution-NoDerivs
In-Reply-To: <20110728143837.GC30927@merveille.plessy.net>
References: <20110728143837.GC30927@merveille.plessy.net>
Message-ID: <4E326562.1020001@ebi.ac.uk>

On 28/07/2011 15:38, Charles Plessy wrote:
> Dear EMBOSS developers,
> (CC Debian Med mailing list)
>
> while working on upgrading Debian's emboss package to version 6.4.0
> (congratulations, by the way), I found some files in EMBOSS that are
> not considered ?Free software? by Debian.  They were actually present
> in past releases as well.
>
> Their license is Creative Commons Attribution-NoDerivs 3.0 Unported (CC BY-ND
> 3.0), and it disallows modification of the files.  The presence of these files
> in EMBOSS makes it impossible for Debian to redistribute it in our operating
> system.  I have confirmed with the UniProt consortium's helpdesk that, even in
> isolation, these files are covered by the CC BY-ND license.  I see three
> possible solutions.

Ummm .... in what sense would *you* be modifying the files?

UniProt's license http://www.uniprot.org/help/license says

> License & disclaimer
>
> License
>
> We have chosen to apply the Creative Commons Attribution-NoDerivs License to all copyrightable parts of our databases. This means that you are free to copy, distribute, display and make commercial use of these databases in all legislations, provided you give us credit. However, if you intend to distribute a modified version of one of our databases, you must ask us for permission first.

So I see no problem for EMBOSS in including the files.

The only problem is for someone "modifying the files and redistributing 
them" without permission ... but strictly that would not apply to most 
uses of a UniProt entry (otherwise you could not use one entry as input 
and distribute the results).

The licensing is there to prevent redistribution of UniProt without 
permission.

Anyway, you can just delete them from the Debian duistribution of EMBOSS 
- and find your own way to run the QA tests. I don't think we have a 
problem.

regards,

Peter Rice
EMBOSS Team

regards,

Peter Rice

From pmr at ebi.ac.uk  Fri Jul 29 04:39:46 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Fri, 29 Jul 2011 09:39:46 +0100
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
In-Reply-To: <4E326562.1020001@ebi.ac.uk>
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk>
Message-ID: <4E3271D2.2070906@ebi.ac.uk>

On 07/29/2011 08:46 AM, Peter Rice wrote:
> On 28/07/2011 15:38, Charles Plessy wrote:
>> Dear EMBOSS developers,
>> (CC Debian Med mailing list)
>>
>> while working on upgrading Debian's emboss package to version 6.4.0
>> (congratulations, by the way), I found some files in EMBOSS that are
>> not considered ?Free software? by Debian. 

While we're on the topic of licensing, some other data files in EMBOSS
6.4.0 have licences.

emboss/data/OBO contains copies of several Open Bio-Ontologies for which
EMBOSS includes index files - so you need the data file version that
matches the index files.

For example, the Gene Ontology terms
http://www.geneontology.org/GO.cite.shtml are:

GO Usage Policy

The GO Consortium gives permission for any of its products to be used
without license for any purpose under three conditions:

    That the Gene Ontology Consortium is clearly acknowledged as the
source of the product;
    That any GO Consortium file(s) displayed publicly include the
date(s) and/or version number(s) of the relevant GO file(s) (the GO is
evolving and changes will occur with time);
    That neither the content of the GO file(s) nor the logical
relationships embedded within the GO file(s) be altered in any way.

which looks rather like the problem you had with Creative Commons.

Licenses that protect the official database release from derives
versions are entirely reasonable and standard in bioinformatics.
Basically, making sure that when you refer to a UniProt entry, or a, OBO
ontology term, everyone agrees you are referring to one agreed entry or
term.

EMBOSS does depend on these files. The database names are hard-coded
into some of the new (and more to come) applications.

You could download the databases and indexes from our rsync copies we
use to keep developers in sync. These are at
rsync://emboss.open-bio.org/EMBOSS/

It might make things clearer if someone from Debian could explain:

(a) why a Creative Commons licence is an issue for you

(b) why you appear to consider a copy of a whole or part of a public
biological database as part of an "operating system"

regards,

Peter Rice
EMBOSS Team


From cjfields at illinois.edu  Fri Jul 29 09:51:53 2011
From: cjfields at illinois.edu (Chris Fields)
Date: Fri, 29 Jul 2011 08:51:53 -0500
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
In-Reply-To: <4E3271D2.2070906@ebi.ac.uk>
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk>
Message-ID: <B0C0539E-6E9D-4049-BD7E-77D3958B224C@illinois.edu>

On Jul 29, 2011, at 3:39 AM, Peter Rice wrote:

> On 07/29/2011 08:46 AM, Peter Rice wrote:
>> On 28/07/2011 15:38, Charles Plessy wrote:
>>> Dear EMBOSS developers,
>>> (CC Debian Med mailing list)
>>> 
>>> while working on upgrading Debian's emboss package to version 6.4.0
>>> (congratulations, by the way), I found some files in EMBOSS that are
>>> not considered ?Free software? by Debian. 
> 
> While we're on the topic of licensing, some other data files in EMBOSS
> 6.4.0 have licences.
> 
> emboss/data/OBO contains copies of several Open Bio-Ontologies for which
> EMBOSS includes index files - so you need the data file version that
> matches the index files.
> 
> For example, the Gene Ontology terms
> http://www.geneontology.org/GO.cite.shtml are:
> 
> GO Usage Policy
> 
> The GO Consortium gives permission for any of its products to be used
> without license for any purpose under three conditions:
> 
>    That the Gene Ontology Consortium is clearly acknowledged as the
> source of the product;
>    That any GO Consortium file(s) displayed publicly include the
> date(s) and/or version number(s) of the relevant GO file(s) (the GO is
> evolving and changes will occur with time);
>    That neither the content of the GO file(s) nor the logical
> relationships embedded within the GO file(s) be altered in any way.
> 
> which looks rather like the problem you had with Creative Commons.
> 
> Licenses that protect the official database release from derives
> versions are entirely reasonable and standard in bioinformatics.
> Basically, making sure that when you refer to a UniProt entry, or a, OBO
> ontology term, everyone agrees you are referring to one agreed entry or
> term.
> 
> EMBOSS does depend on these files. The database names are hard-coded
> into some of the new (and more to come) applications.
> 
> You could download the databases and indexes from our rsync copies we
> use to keep developers in sync. These are at
> rsync://emboss.open-bio.org/EMBOSS/
> 
> It might make things clearer if someone from Debian could explain:
> 
> (a) why a Creative Commons licence is an issue for you
> 
> (b) why you appear to consider a copy of a whole or part of a public
> biological database as part of an "operating system"
> 
> regards,
> 
> Peter Rice
> EMBOSS Team


Charles,

>From the BioPerl perspective, this will very likely be a problem for us as well as all other Bio* language (Biopython, BioJava, BioRuby); we typically include data derived from these sources.  We may have a bit more flexibility in that the vast majority are mainly only for tests, but I believe some data is hard-coded in.  Fallback data like REBase for restriction analysis and GO (as Peter mentioned above) come to mind.

chris

Christopher Fields
Senior Research Scientist
National Center for Supercomputing Applications
Institute for Genomic Biology
University of Illinois Urbana-Champaign
1206 W. Gregory Dr. , MC-195
Urbana, IL 61801


From asjo at koldfront.dk  Fri Jul 29 16:35:13 2011
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Fri, 29 Jul 2011 22:35:13 +0200
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk>
Message-ID: <87sjpoq0zi.fsf@topper.koldfront.dk>

On Fri, 29 Jul 2011 09:39:46 +0100, Peter wrote:

> It might make things clearer if someone from Debian could explain:

(I am not from Debian, but here is my take on it anyway:)

> (a) why a Creative Commons licence is an issue for you

One of the fundamental software freedoms is the freedom to change the
software?.

The Debian Free Software Guidelines' definition of free software
includes this freedom?.

So the "No Derivatives" variants of the Creative Commons licenses aren't
free by the DFSG definition.

(The GNU Free Documentation License on documents with invariant sections
is considered non-free by DFSG-standards as well, even if the invariant
sections are things that nobody would want to change.)

When a project of volunteers packages 29000+ thousand packages, I think
making a judgement call on whether it is okay that the license of a
couple of files does not live up to the guidelines is neigh impossible.

The answer to "Why would you want to?" is, because you might need to.

It is more obvious with programs and code than it is with database
entries, granted - but I guess the equivalent problem would be that the
licensor didn't want to fix a problem in such a database, and that
problem made the programs using it malfunction. It would be a pain if
you weren't allowed to fix the problem and distribute the fixed data
yourself, say, if "upstream" didn't want to include the fix for some
reason or another; maybe they happened to turn sour on the world/you -
stranger things have happened.

I don't think that will happen in this specific case, but making
judgement calls on what organisations/people will do in the future isn't
quite firm ground.

So, nobody is probably ever going to exercise that freedom in this
specific case, I think, but ignoring some of the freedoms in special
cases is infeasible for a project such as Debian.

This is just me trying to explain how I understand it, so take it with a
grain of salt, and swing by debian-legal? for the experts.

> (b) why you appear to consider a copy of a whole or part of a public
> biological database as part of an "operating system"

They are part of a package which is included in the Debian GNU/Linux
free operating system.


(I personally think it would make sense to change to a Creative Commons
license that allows derivative works - Uniprot and others are going to
be the canonical source for the data anyway, so nothing will be lost by
them by doing that, as far as I can see.)


  Best regards,

    Adam


? http://en.wikipedia.org/wiki/Free_software#Definition
? http://en.wikipedia.org/wiki/Debian_Free_Software_Guidelines
? http://lists.debian.org/debian-legal/

-- 
 "Good car to drive after a war"                              Adam Sj?gren
                                                         asjo at koldfront.dk


From pmr at ebi.ac.uk  Sat Jul 30 04:58:07 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Sat, 30 Jul 2011 09:58:07 +0100
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
In-Reply-To: <87sjpoq0zi.fsf@topper.koldfront.dk>
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk>
	<87sjpoq0zi.fsf@topper.koldfront.dk>
Message-ID: <4E33C79F.8080402@ebi.ac.uk>

Quoted in full for the benefit of the debian-med list who missed the 
original posting

On 29/07/2011 21:35, Adam Sj?gren wrote:
> On Fri, 29 Jul 2011 09:39:46 +0100, Peter wrote:
>
>> It might make things clearer if someone from Debian could explain:
>
> (I am not from Debian, but here is my take on it anyway:)
>
>> (a) why a Creative Commons licence is an issue for you
>
> One of the fundamental software freedoms is the freedom to change the
> software?.
>
> The Debian Free Software Guidelines' definition of free software
> includes this freedom?.
>
> So the "No Derivatives" variants of the Creative Commons licenses aren't
> free by the DFSG definition.
>
> (The GNU Free Documentation License on documents with invariant sections
> is considered non-free by DFSG-standards as well, even if the invariant
> sections are things that nobody would want to change.)
>
> When a project of volunteers packages 29000+ thousand packages, I think
> making a judgement call on whether it is okay that the license of a
> couple of files does not live up to the guidelines is neigh impossible.

> The answer to "Why would you want to?" is, because you might need to.
>
> It is more obvious with programs and code than it is with database
> entries, granted - but I guess the equivalent problem would be that the
> licensor didn't want to fix a problem in such a database, and that
> problem made the programs using it malfunction. It would be a pain if
> you weren't allowed to fix the problem and distribute the fixed data
> yourself, say, if "upstream" didn't want to include the fix for some
> reason or another; maybe they happened to turn sour on the world/you -
> stranger things have happened.
>
> So, nobody is probably ever going to exercise that freedom in this
> specific case, I think, but ignoring some of the freedoms in special
> cases is infeasible for a project such as Debian.
>
> This is just me trying to explain how I understand it, so take it with a
> grain of salt, and swing by debian-legal? for the experts.

A specific example might help. About 5 years ago a release of the 
UniProt database (as plain text files) broke the Wisconsin (GCG) 
sequence analysis package. They introduced extremely long lines in a 
data file that everyone assumed was only maximum 80 characters.

As GCG was closed source, the fix required a change to the UniProt files 
to either wrap or truncate the 'offending' records.

The fix was not to distribute a change to the data of course, but to 
write and distribute a simple perl script that wrapped the long records.

That was not a licensing issue - the content stays the same, the format 
is changed, no changed data is distributed. But it does illustrate that 
the database licensing does not prevent 'fixing' a database.

>> (b) why you appear to consider a copy of a whole or part of a public
>> biological database as part of an "operating system"
>
> They are part of a package which is included in the Debian GNU/Linux
> free operating system.

I expect there are many problems that arise if data ... and 
documentation ... are considered to be software. For EMBOSS we didn't 
officially specify a license for the documentation but other packages 
probably do. It still worries me that some of our documentation files 
officially include GPL licensed (EMBOSS) source code but I did not like 
any of the alternative documentation licenses.

> (I personally think it would make sense to change to a Creative Commons
> license that allows derivative works - Uniprot and others are going to
> be the canonical source for the data anyway, so nothing will be lost by
> them by doing that, as far as I can see.)

Unlikely. The no-derivatives version is specifically there to prevent 
derivatives - for example Debian distributing a modified UniProt without 
permission.

The ontologies are similar, but do allow for the use case of importing 
terms from one ontology into another if the ontology name is changed 
(and preferably if cross-references to the original are provided). 
Again, the need is to protect the integrity of the original ontology 
content so references to a GO term or a UniProt entry are clearly defined.

This is essential for many of the public bioinformatics databases. Data 
and software are not the same in this context. I am curious whether 
documentation licensing raises any issues.

Just my 2c worth

Peter Rice
EMBOSS Team


From asjo at koldfront.dk  Sat Jul 30 07:36:54 2011
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Sat, 30 Jul 2011 13:36:54 +0200
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk>
	<87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk>
Message-ID: <87ipqkgfu1.fsf@topper.koldfront.dk>

On Sat, 30 Jul 2011 09:58:07 +0100, Peter wrote:

> A specific example might help. About 5 years ago a release of the
> UniProt database (as plain text files) broke the Wisconsin (GCG)
> sequence analysis package.

[...]

This is the opposite problem of what I tried to sketch.

Your example has closed source software that can't be fixed, leading to
either preprocessing or changing the database rather than fixing the
real problem.

If the software had been free, you could just have fixed the software.

Switch around "software" and "database", and you have the example I was
trying to paint.

> I expect there are many problems that arise if data ... and
> documentation ... are considered to be software.

Sure. The whole GFDL debate took quite a while, I think.

But that doesn't change that one of the solutions outlined by Charles
Plessy is necessary for Debian to distribute EMBOSS (and any other piece
of free/redistributable software).

>> (I personally think it would make sense to change to a Creative Commons
>> license that allows derivative works - Uniprot and others are going to
>> be the canonical source for the data anyway, so nothing will be lost by
>> them by doing that, as far as I can see.)

> Unlikely. The no-derivatives version is specifically there to prevent
> derivatives - for example Debian distributing a modified UniProt
> without permission.

What I was trying to say is that I don't think that that clause gives
any value to the owners of Uniprot and other databases.

Why would Uniprot want to prevent derivative works? They'll always be
the canonical source for the correct information.

You are free to distribute a modified version of the man-page for ls(1)
- but if you introduce errors in it or make it worse, nobody will choose
your derived version.

> The ontologies are similar, but do allow for the use case of importing
> terms from one ontology into another if the ontology name is changed
> (and preferably if cross-references to the original are provided).

> Again, the need is to protect the integrity of the original ontology
> content so references to a GO term or a UniProt entry are clearly
> defined.

I think the problem that is being protected against is non-existing.

People don't want to break stuff that works, they want to be able to fix
stuff that doesn't.

> This is essential for many of the public bioinformatics databases.

Why? Only a hypothetical derivative would be changed, not the original.

If someome distributed a derivative that was broken, I think people
would quickly abandon it.


Again, just my point of view - not representing or speaking for anyone :-)


  Best regards,

    Adam

-- 
 "Good car to drive after a war"                              Adam Sj?gren
                                                         asjo at koldfront.dk


From cjfields at illinois.edu  Sat Jul 30 15:01:58 2011
From: cjfields at illinois.edu (Chris Fields)
Date: Sat, 30 Jul 2011 14:01:58 -0500
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
In-Reply-To: <4E33C79F.8080402@ebi.ac.uk>
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk>
	<87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk>
Message-ID: <5EF06959-FAA0-4DBB-96EA-013CFFDE2960@illinois.edu>

On Jul 30, 2011, at 3:58 AM, Peter Rice wrote:

> Quoted in full for the benefit of the debian-med list who missed the original posting
> 
> On 29/07/2011 21:35, Adam Sj?gren wrote:
>> On Fri, 29 Jul 2011 09:39:46 +0100, Peter wrote:
>> 
>>> It might make things clearer if someone from Debian could explain:
>> 
>> (I am not from Debian, but here is my take on it anyway:)
>> 
>>> (a) why a Creative Commons licence is an issue for you
>> 
>> One of the fundamental software freedoms is the freedom to change the
>> software?.
>> 
>> The Debian Free Software Guidelines' definition of free software
>> includes this freedom?.
>> 
>> So the "No Derivatives" variants of the Creative Commons licenses aren't
>> free by the DFSG definition.
>> 
>> (The GNU Free Documentation License on documents with invariant sections
>> is considered non-free by DFSG-standards as well, even if the invariant
>> sections are things that nobody would want to change.)
>> 
>> When a project of volunteers packages 29000+ thousand packages, I think
>> making a judgement call on whether it is okay that the license of a
>> couple of files does not live up to the guidelines is neigh impossible.
> 
>> The answer to "Why would you want to?" is, because you might need to.
>> 
>> It is more obvious with programs and code than it is with database
>> entries, granted - but I guess the equivalent problem would be that the
>> licensor didn't want to fix a problem in such a database, and that
>> problem made the programs using it malfunction. It would be a pain if
>> you weren't allowed to fix the problem and distribute the fixed data
>> yourself, say, if "upstream" didn't want to include the fix for some
>> reason or another; maybe they happened to turn sour on the world/you -
>> stranger things have happened.
>> 
>> So, nobody is probably ever going to exercise that freedom in this
>> specific case, I think, but ignoring some of the freedoms in special
>> cases is infeasible for a project such as Debian.
>> 
>> This is just me trying to explain how I understand it, so take it with a
>> grain of salt, and swing by debian-legal? for the experts.
> 
> A specific example might help. About 5 years ago a release of the UniProt database (as plain text files) broke the Wisconsin (GCG) sequence analysis package. They introduced extremely long lines in a data file that everyone assumed was only maximum 80 characters.
> 
> As GCG was closed source, the fix required a change to the UniProt files to either wrap or truncate the 'offending' records.
> 
> The fix was not to distribute a change to the data of course, but to write and distribute a simple perl script that wrapped the long records.
> 
> That was not a licensing issue - the content stays the same, the format is changed, no changed data is distributed. But it does illustrate that the database licensing does not prevent 'fixing' a database.
> 
>>> (b) why you appear to consider a copy of a whole or part of a public
>>> biological database as part of an "operating system"
>> 
>> They are part of a package which is included in the Debian GNU/Linux
>> free operating system.
> 
> I expect there are many problems that arise if data ... and documentation ... are considered to be software. For EMBOSS we didn't officially specify a license for the documentation but other packages probably do. It still worries me that some of our documentation files officially include GPL licensed (EMBOSS) source code but I did not like any of the alternative documentation licenses.

I don't understand the logic behind why data would be considered software, unless one is using a very fuzzy definition of 'software'.  Is this strictly a packaging issue, e.g. any data packaged with source makes it 'software'?  Or just the fact that such data is licensed?  Would a package of just data/docs (no code) be allowed?

>> (I personally think it would make sense to change to a Creative Commons
>> license that allows derivative works - Uniprot and others are going to
>> be the canonical source for the data anyway, so nothing will be lost by
>> them by doing that, as far as I can see.)
> 
> Unlikely. The no-derivatives version is specifically there to prevent derivatives - for example Debian distributing a modified UniProt without permission.
> 
> The ontologies are similar, but do allow for the use case of importing terms from one ontology into another if the ontology name is changed (and preferably if cross-references to the original are provided). Again, the need is to protect the integrity of the original ontology content so references to a GO term or a UniProt entry are clearly defined.
> 
> This is essential for many of the public bioinformatics databases. Data and software are not the same in this context. I am curious whether documentation licensing raises any issues.
> 
> Just my 2c worth
> 
> Peter Rice
> EMBOSS Team


Maybe the best solution is to just package any data separately?  We have talked about setting up a 'biodata' repository for common datasets from all the Bio* projects.

Feel free to skip the rest of this, but:

<my_2c>

I agree with Peter's point, Uniprot and other databases license data this way for very good (and well-intentioned) reasons. For the Bio* languages there are instances where we use such data as a fallback in case a newer version isn't immediately available (REBase and SO come to mind, and I think we have others), so we are likely in the same boat as EMBOSS.  

I had a long screed here, but I found some original sources for the discussion re: Uniprot and use of Creative Commons licensing that states the reasoning for why this is in place:

http://wiki.creativecommons.org/Case_Studies/Uniprot
http://eric.jain.name/2006/02/07/uniprot-creative-commons/
http://sciencecommons.org/resources/faq/databases/
http://sciencecommons.org/resources/faq/database-protocol/

Note there is now a 'Database Protocol' (last link) that recommends a different license; that page nicely summarizes the history the whole Creative Commons licensing affair and the issues of using a Creative Commons license re: databases, mainly due to the issue Peter mentioned above, that databases != software.  Uniprot doesn't use this as of yet (so it doesn't solve the problem at hand), but it's possible this may change.

</my_2c>

chris

Christopher Fields
Senior Research Scientist
National Center for Supercomputing Applications
Institute for Genomic Biology
University of Illinois Urbana-Champaign
1206 W. Gregory Dr. , MC-195
Urbana, IL 61801


From asjo at koldfront.dk  Sat Jul 30 15:34:30 2011
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Sat, 30 Jul 2011 21:34:30 +0200
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk>
	<87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk>
	<5EF06959-FAA0-4DBB-96EA-013CFFDE2960@illinois.edu>
Message-ID: <87d3grwojd.fsf@topper.koldfront.dk>

On Sat, 30 Jul 2011 14:01:58 -0500, Chris wrote:

> I don't understand the logic behind why data would be considered
> software, unless one is using a very fuzzy definition of 'software'.
> Is this strictly a packaging issue, e.g. any data packaged with source
> makes it 'software'? Or just the fact that such data is licensed?
> Would a package of just data/docs (no code) be allowed?

  "The DFSG is focused on software, but the word itself is unclear -
   some apply it to everything that can be expressed as a stream of
   bits, while a minority considers it to refer to just computer
   programs. Also, the existence of PostScript, executable scripts,
   sourced documents, etc, greatly muddies the second definition. Thus,
   to break the confusion, in June 2004 the Debian project decided to
   explicitly apply the same principles to software documentation,
   multimedia data and other content. The non-program content of Debian
   began to comply with the DFSG more strictly in Debian 4.0 (released
   in April 2007) and subsequent releases."
    - http://en.wikipedia.org/wiki/DFSG#Non-.22software.22_content

So no.

> I agree with Peter's point, Uniprot and other databases license data
> this way for very good (and well-intentioned) reasons.

Several people have mentioned the existence of these good reasons for
not allowing derived works when it comes to science/databases/biology; I
wonder what those reasons are?

Just curious.

[...]
> http://sciencecommons.org/resources/faq/database-protocol/

> Note there is now a 'Database Protocol' (last link) that recommends a
> different license; that page nicely summarizes the history the whole
> Creative Commons licensing affair and the issues of using a Creative
> Commons license re: databases, mainly due to the issue Peter mentioned
> above, that databases != software. Uniprot doesn't use this as of yet
> (so it doesn't solve the problem at hand), but it's possible this may
> change.

It sounds like Science Commons' Open Access Data Protocol means putting
the data in the public domain, which would mean that derived works would
very much be allowed?

This link explains the protocol:

 * http://sciencecommons.org/projects/publishing/open-access-data-protocol/


  Best regards,

    Adam

-- 
 "Good car to drive after a war"                              Adam Sj?gren
                                                         asjo at koldfront.dk


From cjfields at illinois.edu  Sat Jul 30 15:42:19 2011
From: cjfields at illinois.edu (Chris Fields)
Date: Sat, 30 Jul 2011 14:42:19 -0500
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
In-Reply-To: <87ipqkgfu1.fsf@topper.koldfront.dk>
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk>
	<87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk>
	<87ipqkgfu1.fsf@topper.koldfront.dk>
Message-ID: <C368A0FF-27D8-463E-BAB4-5FBB6A02D1C0@illinois.edu>

On Jul 30, 2011, at 6:36 AM, Adam Sj?gren wrote:

> On Sat, 30 Jul 2011 09:58:07 +0100, Peter wrote:
> 
>> A specific example might help. About 5 years ago a release of the
>> UniProt database (as plain text files) broke the Wisconsin (GCG)
>> sequence analysis package.
> 
> [...]
> 
> This is the opposite problem of what I tried to sketch.
> 
> Your example has closed source software that can't be fixed, leading to
> either preprocessing or changing the database rather than fixing the
> real problem.
> 
> If the software had been free, you could just have fixed the software.
> 
> Switch around "software" and "database", and you have the example I was
> trying to paint.

Yes, if the source were available fixing the parser would have been the best option.  But I think you are missing the fundamental point that Peter made (that you left out): the wording of the license allowed them to reformat the file w/o changing the actual content.  I'm not sure but I believe many GenPept documents are Uniprot-derived and follow the same concept. 

Data records and databases are not software, unless you are using some very fuzzy definition of such.

>> I expect there are many problems that arise if data ... and
>> documentation ... are considered to be software.
> 
> Sure. The whole GFDL debate took quite a while, I think.
> 
> But that doesn't change that one of the solutions outlined by Charles
> Plessy is necessary for Debian to distribute EMBOSS (and any other piece
> of free/redistributable software).

You'll also note Charles's distaste for the options mentioned.  He was also searching for alternatives.

>>> (I personally think it would make sense to change to a Creative Commons
>>> license that allows derivative works - Uniprot and others are going to
>>> be the canonical source for the data anyway, so nothing will be lost by
>>> them by doing that, as far as I can see.)
> 
>> Unlikely. The no-derivatives version is specifically there to prevent
>> derivatives - for example Debian distributing a modified UniProt
>> without permission.
> 
> What I was trying to say is that I don't think that that clause gives
> any value to the owners of Uniprot and other databases.
> 
> Why would Uniprot want to prevent derivative works? They'll always be
> the canonical source for the correct information.

The links provided in my other responce indicate some of the mindset behind this. I think the main point is that the work has to be attributed, and that any changes to such data need permission of Uniprot, likely so any content changes can be curated and (possibly) propogated to future releases. This also ensures that a set of files from a third-party containing the Uniprot name will not be modified (e.g. all content can be trusted as coming from Uniprot w/o modification).  

I have seen instances where loose data control (such as annotation from a newly sequenced genome) become balkanized to the point that no one can clearly state who is the trusted source (even when the list of sources includes large databases such as NCBI/EBI).  So I understand the reasoning for the license, but I also see Science Commons is recommending something less strict.

> You are free to distribute a modified version of the man-page for ls(1)
> - but if you introduce errors in it or make it worse, nobody will choose
> your derived version.

That's a straw man argument; man page documentation for an app is not the same as a database record based on scientific data.  Woud you make the same argument (allow free content modification) for a scientific publication?  I would, but only for corrections or for new data that support/contradict the original data, and even then it must go through some sort of mediation (an editor for instance), not unlike what a database curator does.

>> The ontologies are similar, but do allow for the use case of importing
>> terms from one ontology into another if the ontology name is changed
>> (and preferably if cross-references to the original are provided).
> 
>> Again, the need is to protect the integrity of the original ontology
>> content so references to a GO term or a UniProt entry are clearly
>> defined.
> 
> I think the problem that is being protected against is non-existing.
> 
> People don't want to break stuff that works, they want to be able to fix
> stuff that doesn't.

Simply opening the licensing up for any content modification doesn't solve the problem in the case of scientific databases, it potentially exacerbates it.  Hence the variations in the licensing in the previous links I sent.  By the way, if you think the classic 'vi vs emacs' arguments can get out of control, see what happens when you have competing groups trying to make changes to a sequence record w/o curation.

I do agree that it would be nice for the barrier to database modification to be lowered. Many previous attempts have been made at doing this, such as including third-party annotation, but with the major databases they all seem to fall by the wayside and they seem to fall back to simple curation. 

Maybe it's time to come up with a git/hg for biological data, where one could fork records and make changes for submission; at least there one could have a trusted source and easier paths to data modification.  Just a thought.

>> This is essential for many of the public bioinformatics databases.
> 
> Why? Only a hypothetical derivative would be changed, not the original.
> 
> If someome distributed a derivative that was broken, I think people
> would quickly abandon it.

How could one tell the difference if both versions are implied to come from Uniprot (even if one comes from a third/fourth/fifth party)?  There is no guarantee beyond going back and comparing the records to the original Uniprot data.  

> Again, just my point of view - not representing or speaking for anyone :-)
> 
> 
>  Best regards,
> 
>    Adam


chris

Christopher Fields
Senior Research Scientist
National Center for Supercomputing Applications
Institute for Genomic Biology
University of Illinois Urbana-Champaign
1206 W. Gregory Dr. , MC-195
Urbana, IL 61801


From cjfields at illinois.edu  Sat Jul 30 16:14:39 2011
From: cjfields at illinois.edu (Chris Fields)
Date: Sat, 30 Jul 2011 15:14:39 -0500
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
In-Reply-To: <87d3grwojd.fsf@topper.koldfront.dk>
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk>
	<87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk>
	<5EF06959-FAA0-4DBB-96EA-013CFFDE2960@illinois.edu>
	<87d3grwojd.fsf@topper.koldfront.dk>
Message-ID: <F62874C2-2AE6-49E7-AD76-46BA8DC874A6@illinois.edu>

(Charles, not sure you have been following, but any idea on the next steps and whether other package like bioperl are affected?)

On Jul 30, 2011, at 2:34 PM, Adam Sj?gren wrote:

> On Sat, 30 Jul 2011 14:01:58 -0500, Chris wrote:
> 
>> I don't understand the logic behind why data would be considered
>> software, unless one is using a very fuzzy definition of 'software'.
>> Is this strictly a packaging issue, e.g. any data packaged with source
>> makes it 'software'? Or just the fact that such data is licensed?
>> Would a package of just data/docs (no code) be allowed?
> 
>  "The DFSG is focused on software, but the word itself is unclear -
>   some apply it to everything that can be expressed as a stream of
>   bits, while a minority considers it to refer to just computer
>   programs. Also, the existence of PostScript, executable scripts,
>   sourced documents, etc, greatly muddies the second definition. Thus,
>   to break the confusion, in June 2004 the Debian project decided to
>   explicitly apply the same principles to software documentation,
>   multimedia data and other content. The non-program content of Debian
>   began to comply with the DFSG more strictly in Debian 4.0 (released
>   in April 2007) and subsequent releases."
>    - http://en.wikipedia.org/wiki/DFSG#Non-.22software.22_content
> 
> So no.

Oh well; we'll leave that up to debian then.  I think Peter and I stated our concerns, and possible options were stated by Charles and myself, no need to protract this out.  I would rather find a solution.

>> I agree with Peter's point, Uniprot and other databases license data
>> this way for very good (and well-intentioned) reasons.
> 
> Several people have mentioned the existence of these good reasons for
> not allowing derived works when it comes to science/databases/biology; I
> wonder what those reasons are?
> 
> Just curious.

Those links I passed on mention some of the primary concerns from both the Science Commons and Uniprot side.  I believe it comes down to an issue of trusting the source of the data and the level of control the database wants (the latter was implied in Eric's blog post).  

> [...]
>> http://sciencecommons.org/resources/faq/database-protocol/
> 
>> Note there is now a 'Database Protocol' (last link) that recommends a
>> different license; that page nicely summarizes the history the whole
>> Creative Commons licensing affair and the issues of using a Creative
>> Commons license re: databases, mainly due to the issue Peter mentioned
>> above, that databases != software. Uniprot doesn't use this as of yet
>> (so it doesn't solve the problem at hand), but it's possible this may
>> change.
> 
> It sounds like Science Commons' Open Access Data Protocol means putting
> the data in the public domain, which would mean that derived works would
> very much be allowed?

Yes, if one adopts that protocol (Uniprot hasn't).  Eric's blog post indicates the CC-nonderivative was chose for a level of control both Uniprot users and curators felt comfortable with but wasn't overly restrictive.  That's also from 2006, so a lot has likely changed since then.

> This link explains the protocol:
> 
> * http://sciencecommons.org/projects/publishing/open-access-data-protocol/
> 
> 
>  Best regards,
> 
>    Adam

There is no mention of derived or modified works there, but the brief mention of derived works from the Database Protocol page indicates that it is possibly allowed, yes.  That may be an impediment to adoption by a database depending on what level of control they would like.  I'm curious to see who has adopted it.

chris

From david.breimann at gmail.com  Tue Jul  5 09:33:14 2011
From: david.breimann at gmail.com (David Breimann)
Date: Tue, 5 Jul 2011 12:33:14 +0300
Subject: [EMBOSS] Updating EMBOSS
Message-ID: <CAL64h6467Cp+EzT61R6ucbwaoFtaQRAgRHQs3L-8DET1RQCarw@mail.gmail.com>

Hello,

I had an old installation of EMBOSS on my linux server (EMBOSS 6.0.1
according to embossversion).
I downloaded EMBOSS 6.3.1, unpacked and compiled (following
http://emboss.sourceforge.net/download/#Gettingstarted), hoping this will
overwrite the older version.
However, embossversion still returns 6.0.1.
What should I do?

Thanks,
Dave


From ajb at ebi.ac.uk  Tue Jul  5 10:22:53 2011
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Tue, 5 Jul 2011 11:22:53 +0100 (BST)
Subject: [EMBOSS] Updating EMBOSS
In-Reply-To: <CAL64h6467Cp+EzT61R6ucbwaoFtaQRAgRHQs3L-8DET1RQCarw@mail.gmail.com>
References: <CAL64h6467Cp+EzT61R6ucbwaoFtaQRAgRHQs3L-8DET1RQCarw@mail.gmail.com>
Message-ID: <55403.82.26.12.214.1309861373.squirrel@imap04.ebi.ac.uk>

Hello Dave,

It depends on where/how you installed the different versions.
If you had configured and installed using a prefix which specified
a directory root which was to contain only emboss:

   e.g.  ./configure --prefix=/fu/bar/emboss

then you can just delete the /fu/bar/emboss directory and reinstall.

If, however, you had installed EMBOSS using no prefix (such that it
would be installed under /usr/local) or specified any other shared
or system directory then the best means is usually to reinstall
the old version (see ftp://emboss.open-bio.org/pub/EMBOSS/old/)
on top of itself and then type:

  make uninstall

If it were me I'd then do the same with the new version and have a
nose-around to check that all traces of EMBOSS have been deleted,
then reinstall the new version.

We do recommend, when installing EMBOSS from source, to install it
into its own directory (--prefix=/usr/local/emboss  is a favourite
example in administration documentation).

HTH

Alan


> Hello,
>
> I had an old installation of EMBOSS on my linux server (EMBOSS 6.0.1
> according to embossversion).
> I downloaded EMBOSS 6.3.1, unpacked and compiled (following
> http://emboss.sourceforge.net/download/#Gettingstarted), hoping this will
> overwrite the older version.
> However, embossversion still returns 6.0.1.
> What should I do?
>
> Thanks,
> Dave
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From wo.granon at gmail.com  Thu Jul  7 10:33:49 2011
From: wo.granon at gmail.com (Wolfgang)
Date: Thu, 7 Jul 2011 12:33:49 +0200
Subject: [EMBOSS] Plasmid drawing
Message-ID: <CAOhDcVx+yZPGE4-zEHb8GeMd5eyLg1GX+V2CrHgWVjrtshSHeA@mail.gmail.com>

Hello,

are there any news to plasmid drawing (features and restriction sites) and
improvement of cirdna, according to this message from 2005?
http://www.mail-archive.com/emboss at emboss.open-bio.org/msg00040.html

In our labs this is also a big point for users not to switch completely to
emboss.

Thanks,
Wolfgang


From pmr at ebi.ac.uk  Thu Jul  7 11:33:05 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 07 Jul 2011 12:33:05 +0100
Subject: [EMBOSS] Plasmid drawing
In-Reply-To: <CAOhDcVx+yZPGE4-zEHb8GeMd5eyLg1GX+V2CrHgWVjrtshSHeA@mail.gmail.com>
References: <CAOhDcVx+yZPGE4-zEHb8GeMd5eyLg1GX+V2CrHgWVjrtshSHeA@mail.gmail.com>
Message-ID: <4E159971.9070509@ebi.ac.uk>

Dear Wolfgang,

> are there any news to plasmid drawing (features and restriction sites) and
> improvement of cirdna, according to this message from 2005?
> http://www.mail-archive.com/emboss at emboss.open-bio.org/msg00040.html
>
> In our labs this is also a big point for users not to switch completely to
> emboss.

Very close to release date next week, so hard to do anything immediately.

However, we did try adding a report format (an output choice for 
restrict and other applications) to create an input file for cirdna or 
lindna.

Results at the time were poor, but I note we have revised both cirdna 
and lindna since.

I will test whether results have improved. One possibility would be to 
re-enable this format so you can test and give us feedback on the new 
release.

regards,

Peter


From hrh at fmi.ch  Thu Jul  7 11:47:13 2011
From: hrh at fmi.ch (Hans-Rudolf Hotz)
Date: Thu, 07 Jul 2011 13:47:13 +0200
Subject: [EMBOSS] Plasmid drawing
In-Reply-To: <4E159971.9070509@ebi.ac.uk>
References: <CAOhDcVx+yZPGE4-zEHb8GeMd5eyLg1GX+V2CrHgWVjrtshSHeA@mail.gmail.com>
	<4E159971.9070509@ebi.ac.uk>
Message-ID: <4E159CC1.9@fmi.ch>

Hi Peter,


We will be happy to help you testing and give feedback, since we are in 
a very similar situation to Wolfgang.


Regards, Hans


On 07/07/2011 01:33 PM, Peter Rice wrote:
> Dear Wolfgang,
>
>> are there any news to plasmid drawing (features and restriction sites)
>> and
>> improvement of cirdna, according to this message from 2005?
>> http://www.mail-archive.com/emboss at emboss.open-bio.org/msg00040.html
>>
>> In our labs this is also a big point for users not to switch
>> completely to
>> emboss.
>
> Very close to release date next week, so hard to do anything immediately.
>
> However, we did try adding a report format (an output choice for
> restrict and other applications) to create an input file for cirdna or
> lindna.
>
> Results at the time were poor, but I note we have revised both cirdna
> and lindna since.
>
> I will test whether results have improved. One possibility would be to
> re-enable this format so you can test and give us feedback on the new
> release.
>
> regards,
>
> Peter
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


From pmr at ebi.ac.uk  Thu Jul  7 12:07:19 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 07 Jul 2011 13:07:19 +0100
Subject: [EMBOSS] Plasmid drawing
In-Reply-To: <4E159CC1.9@fmi.ch>
References: <CAOhDcVx+yZPGE4-zEHb8GeMd5eyLg1GX+V2CrHgWVjrtshSHeA@mail.gmail.com>
	<4E159971.9070509@ebi.ac.uk> <4E159CC1.9@fmi.ch>
Message-ID: <4E15A177.6040201@ebi.ac.uk>

On 07/07/11 12:47, Hans-Rudolf Hotz wrote:
> Hi Peter,
>
>
> We will be happy to help you testing and give feedback, since we are in
> a very similar situation to Wolfgang.


I'm curious. How many sites on this list (as a rough sample) are still 
running GCG?

And how many are using some other commercial package for functions not 
in EMBOSS?

Could be a very useful guide to the new applications needed.

Peter


From s.newslists at gmail.com  Thu Jul  7 13:54:01 2011
From: s.newslists at gmail.com (Stefan)
Date: Thu, 7 Jul 2011 15:54:01 +0200
Subject: [EMBOSS]  Plasmid drawing
In-Reply-To: <CAECtV7PBqXv07m+VsN3E9yGDEzA6xoWvb9WWsw6PO99oSnROCQ@mail.gmail.com>
References: <CAOhDcVx+yZPGE4-zEHb8GeMd5eyLg1GX+V2CrHgWVjrtshSHeA@mail.gmail.com>
	<4E159971.9070509@ebi.ac.uk> <4E159CC1.9@fmi.ch>
	<4E15A177.6040201@ebi.ac.uk>
	<CAECtV7PBqXv07m+VsN3E9yGDEzA6xoWvb9WWsw6PO99oSnROCQ@mail.gmail.com>
Message-ID: <CAECtV7McouYpKX7tjnoM0+oP5r9_FxKM5s6efcJrKtdEm_qM2g@mail.gmail.com>

Hi Peter,

in our labs the people are also sad that they can not use the emboss
suite for such daily work. We use two different applications:

pDraw32 can draw plasmid cards. Very useful is the feature that it can
generate a new plasmid out of two with given restriction enzymes. This
can avoid a lot of little mistakes.

ApE "A plasmid Editor" is very useful to find features in the plasmid.
Often we get sequences where features such as the antibiotic
resistance are missing. This tool can quickly find them and make draw
a nice plasmid also with its restriction sites.

We would be happy to use for all of this work the emboss suite.

Also I would be happy to test.

Best regards,
Stefan

2011/7/7 Peter Rice <pmr at ebi.ac.uk>:
> On 07/07/11 12:47, Hans-Rudolf Hotz wrote:
>>
>> Hi Peter,
>>
>>
>> We will be happy to help you testing and give feedback, since we are in
>> a very similar situation to Wolfgang.
>
>
> I'm curious. How many sites on this list (as a rough sample) are still
> running GCG?
>
> And how many are using some other commercial package for functions not in
> EMBOSS?
>
> Could be a very useful guide to the new applications needed.
>
> Peter
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From david.bauer at bayer.com  Thu Jul  7 12:58:38 2011
From: david.bauer at bayer.com (david.bauer at bayer.com)
Date: Thu, 7 Jul 2011 14:58:38 +0200
Subject: [EMBOSS] Antwort: Re:  Plasmid drawing
In-Reply-To: <4E15A177.6040201@ebi.ac.uk>
Message-ID: <OFACDB8C5F.07B28638-ONC12578C6.00468B55-C12578C6.00474971@bayer.de>

We use VectorNTI for plasmid documentation and in-silico cloning.
And as far as I know another widely used software for this purpos is 
'Clone Manager' from 'Sci-Ed Software'. 

David.

emboss-bounces at lists.open-bio.org schrieb am 07/07/2011 14:07:19:

> On 07/07/11 12:47, Hans-Rudolf Hotz wrote:
> > Hi Peter,
> >
> >
> > We will be happy to help you testing and give feedback, since we are 
in
> > a very similar situation to Wolfgang.
> 
> 
> I'm curious. How many sites on this list (as a rough sample) are still 
> running GCG?
> 
> And how many are using some other commercial package for functions not 
> in EMBOSS?
> 
> Could be a very useful guide to the new applications needed.
> 
> Peter
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


From cjfields at illinois.edu  Thu Jul  7 15:10:35 2011
From: cjfields at illinois.edu (Chris Fields)
Date: Thu, 7 Jul 2011 10:10:35 -0500
Subject: [EMBOSS] Antwort: Re:  Plasmid drawing
In-Reply-To: <OFACDB8C5F.07B28638-ONC12578C6.00468B55-C12578C6.00474971@bayer.de>
References: <OFACDB8C5F.07B28638-ONC12578C6.00468B55-C12578C6.00474971@bayer.de>
Message-ID: <B7FACE16-3F45-4658-A476-842BAD410898@illinois.edu>

I think Geneious and the CLC tools can also draw plasmid maps.  Haven't used them extensively, though.

re: GCG, Accelrys stopped GCG development in June 2008;I haven't seen anyone take up the perpetual license (which allows use of GCG, but with outdated databases, etc).  Seems like everyone is implicitly being directed to EMBOSS.

chris

On Jul 7, 2011, at 7:58 AM, david.bauer at bayer.com wrote:

> We use VectorNTI for plasmid documentation and in-silico cloning.
> And as far as I know another widely used software for this purpos is 
> 'Clone Manager' from 'Sci-Ed Software'. 
> 
> David.
> 
> emboss-bounces at lists.open-bio.org schrieb am 07/07/2011 14:07:19:
> 
>> On 07/07/11 12:47, Hans-Rudolf Hotz wrote:
>>> Hi Peter,
>>> 
>>> 
>>> We will be happy to help you testing and give feedback, since we are 
> in
>>> a very similar situation to Wolfgang.
>> 
>> 
>> I'm curious. How many sites on this list (as a rough sample) are still 
>> running GCG?
>> 
>> And how many are using some other commercial package for functions not 
>> in EMBOSS?
>> 
>> Could be a very useful guide to the new applications needed.
>> 
>> Peter
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


From kitagawam at takara-bio.co.jp  Fri Jul  8 08:05:48 2011
From: kitagawam at takara-bio.co.jp (kitagawam at takara-bio.co.jp)
Date: Fri, 8 Jul 2011 17:05:48 +0900
Subject: [EMBOSS] Antwort: Re:  Plasmid drawing
In-Reply-To: <B7FACE16-3F45-4658-A476-842BAD410898@illinois.edu>
References: <OFACDB8C5F.07B28638-ONC12578C6.00468B55-C12578C6.00474971@bayer.de>
	<B7FACE16-3F45-4658-A476-842BAD410898@illinois.edu>
Message-ID: <678B3FABACE9F64B8FAF7A1045C3D67D4D943A52EE@tkrexmb1.central.takara.co.jp>

I wish to recommend IMC.
http://www.insilicobiology.jp/en/downloads

] -----Original Message-----
] From: emboss-bounces at lists.open-bio.org
] [mailto:emboss-bounces at lists.open-bio.org] On Behalf Of Chris Fields
] Sent: Friday, July 08, 2011 12:11 AM
] To: david.bauer at bayer.com
] Cc: emboss at lists.open-bio.org; emboss-bounces at lists.open-bio.org
] Subject: Re: [EMBOSS] Antwort: Re: Plasmid drawing
] 
] I think Geneious and the CLC tools can also draw plasmid maps.  Haven't
] used them extensively, though.
] 
] re: GCG, Accelrys stopped GCG development in June 2008;I haven't seen anyone
] take up the perpetual license (which allows use of GCG, but with outdated
] databases, etc).  Seems like everyone is implicitly being directed to
] EMBOSS.
] 
] chris
] 
] On Jul 7, 2011, at 7:58 AM, david.bauer at bayer.com wrote:
] 
] > We use VectorNTI for plasmid documentation and in-silico cloning.
] > And as far as I know another widely used software for this purpos is
] > 'Clone Manager' from 'Sci-Ed Software'.
] >
] > David.
] >
] > emboss-bounces at lists.open-bio.org schrieb am 07/07/2011 14:07:19:
] >
] >> On 07/07/11 12:47, Hans-Rudolf Hotz wrote:
] >>> Hi Peter,
] >>>
] >>>
] >>> We will be happy to help you testing and give feedback, since we are
] > in
] >>> a very similar situation to Wolfgang.
] >>
] >>
] >> I'm curious. How many sites on this list (as a rough sample) are still
] >> running GCG?
] >>
] >> And how many are using some other commercial package for functions not
] >> in EMBOSS?
] >>
] >> Could be a very useful guide to the new applications needed.
] >>
] >> Peter
] >> _______________________________________________
] >> EMBOSS mailing list
] >> EMBOSS at lists.open-bio.org
] >> http://lists.open-bio.org/mailman/listinfo/emboss
] > _______________________________________________
] > EMBOSS mailing list
] > EMBOSS at lists.open-bio.org
] > http://lists.open-bio.org/mailman/listinfo/emboss
] 
] 
] _______________________________________________
] EMBOSS mailing list
] EMBOSS at lists.open-bio.org
] http://lists.open-bio.org/mailman/listinfo/emboss


From friedman at cancercenter.columbia.edu  Wed Jul 13 20:56:29 2011
From: friedman at cancercenter.columbia.edu (Richard Friedman)
Date: Wed, 13 Jul 2011 16:56:29 -0400
Subject: [EMBOSS] getting files in GCG format with annotation
Message-ID: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu>

Dear Emboss list,

	I am learning to use Emboss after being a long-time GCG user.
The fetch command in GCG returns a file with the sequence in GCG format
plus annotation.

In EMBOSS I know how to get just sequence in GCG format with seqret.
In EMBOSS I also know how to get the sequence plus annotation default  
format.
What I would like to know is how using EMBOSS to get sequence plus  
annotation in GCG format
like in GCG.

Thanks and best wishes,
Rich
------------------------------------------------------------
Richard A. Friedman, PhD
Associate Research Scientist,
Biomedical Informatics Shared Resource
Herbert Irving Comprehensive Cancer Center (HICCC)
Lecturer,
Department of Biomedical Informatics (DBMI)
Educational Coordinator,
Center for Computational Biology and Bioinformatics (C2B2)/
National Center for Multiscale Analysis of Genomic Networks (MAGNet)
Room 824
Irving Cancer Research Center
Columbia University
1130 St. Nicholas Ave
New York, NY 10032
(212)851-4765 (voice)
friedman at cancercenter.columbia.edu
http://cancercenter.columbia.edu/~friedman/

I am a Bayesian. When I see a multiple-choice question on a test and I  
don't
know the answer I say "eeney-meaney-miney-moe".

Rose Friedman, Age 14


From pmr at ebi.ac.uk  Wed Jul 13 21:37:13 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 13 Jul 2011 22:37:13 +0100
Subject: [EMBOSS] getting files in GCG format with annotation
In-Reply-To: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu>
References: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu>
Message-ID: <4E1E1009.101@ebi.ac.uk>

Dear Richard,

On 13/07/2011 21:56, Richard Friedman wrote:
> I am learning to use Emboss after being a long-time GCG user.
> The fetch command in GCG returns a file with the sequence in GCG format
> plus annotation.
>
> In EMBOSS I know how to get just sequence in GCG format with seqret.
> In EMBOSS I also know how to get the sequence plus annotation default
> format.
> What I would like to know is how using EMBOSS to get sequence plus
> annotation in GCG format
> like in GCG.

Hmmm ... by "annotation in GCG format" you mean the EMBL or Uniprot 
entry with gaps in the ". ." feaure records?

The obvious question is why you need GCG format. GCG was not very clever 
in handling the annotation.

You can get the sequence plus annotation in one file with:

seqret -feature somedb:someid outfile.seq -osformat embl (or swiss)

That gives you one file with "sequence plus annotation"... and you can 
use the annotation.

You can also get the whole entry text with entret somedb:someid

Hope that helps - and if not, please do ask again!

Peter Rice
EMBOSS Team


From friedman at cancercenter.columbia.edu  Thu Jul 14 16:08:17 2011
From: friedman at cancercenter.columbia.edu (Richard Friedman)
Date: Thu, 14 Jul 2011 12:08:17 -0400
Subject: [EMBOSS] getting files in GCG format with annotation
In-Reply-To: <4E1E1009.101@ebi.ac.uk>
References: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu>
	<4E1E1009.101@ebi.ac.uk>
Message-ID: <134BC19D-E1C0-4788-A700-5B212192AD6B@cancercenter.columbia.edu>

Dear Peter and Guy,

	I guess I just cling to the familiar. The output formats given by  
emboss are fine,
One more obscure question:

As far as I can see, the output from "seqret -feature" and "entret"  
are the same.
Are there any differences?

Thanks and best wishes,
Rich


On Jul 13, 2011, at 5:37 PM, Peter Rice wrote:

> Dear Richard,
>
> On 13/07/2011 21:56, Richard Friedman wrote:
>> I am learning to use Emboss after being a long-time GCG user.
>> The fetch command in GCG returns a file with the sequence in GCG  
>> format
>> plus annotation.
>>
>> In EMBOSS I know how to get just sequence in GCG format with seqret.
>> In EMBOSS I also know how to get the sequence plus annotation default
>> format.
>> What I would like to know is how using EMBOSS to get sequence plus
>> annotation in GCG format
>> like in GCG.
>
> Hmmm ... by "annotation in GCG format" you mean the EMBL or Uniprot  
> entry with gaps in the ". ." feaure records?
>
> The obvious question is why you need GCG format. GCG was not very  
> clever in handling the annotation.
>
> You can get the sequence plus annotation in one file with:
>
> seqret -feature somedb:someid outfile.seq -osformat embl (or swiss)
>
> That gives you one file with "sequence plus annotation"... and you  
> can use the annotation.
>
> You can also get the whole entry text with entret somedb:someid
>
> Hope that helps - and if not, please do ask again!
>
> Peter Rice
> EMBOSS Team
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


From pmr at ebi.ac.uk  Thu Jul 14 16:14:03 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 14 Jul 2011 17:14:03 +0100
Subject: [EMBOSS] getting files in GCG format with annotation
In-Reply-To: <134BC19D-E1C0-4788-A700-5B212192AD6B@cancercenter.columbia.edu>
References: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu>
	<4E1E1009.101@ebi.ac.uk>
	<134BC19D-E1C0-4788-A700-5B212192AD6B@cancercenter.columbia.edu>
Message-ID: <4E1F15CB.5020205@ebi.ac.uk>

On 14/07/2011 17:08, Richard Friedman wrote:
> Dear Peter and Guy,
>
> I guess I just cling to the familiar. The output formats given by emboss
> are fine,
> One more obscure question:
>
> As far as I can see, the output from "seqret -feature" and "entret" are
> the same.
> Are there any differences?

Not necessarily ...

entret reports the exact text of the original input.

seqret -feat with the same format as the input will rewrite everything 
using the output format. If that comes out identical then we are usually 
very happy (we do try to preserve everything in EMBL/GenBank and 
Swissprot formats) but there is no absolute guarantee.

Also, strictly speaking, the output of entret is defined as "text" while 
the output of seqret is defined as "sequence" which leads to some 
distinctions - for example, you cannot choose an alternative output 
format for entret.

Have fun with EMBOSS. Look out for the new release tomorrow!

regards,

Peter


From gbottu at vub.ac.be  Thu Jul 14 17:56:14 2011
From: gbottu at vub.ac.be (Guy Bottu)
Date: Thu, 14 Jul 2011 19:56:14 +0200
Subject: [EMBOSS] getting files in GCG format with annotation
In-Reply-To: <4E1E1009.101@ebi.ac.uk>
References: <642B5FF8-AA56-4C8E-B88B-1A74C22676C0@cancercenter.columbia.edu>
	<4E1E1009.101@ebi.ac.uk>
Message-ID: <4E1F2DBE.70700@vub.ac.be>

	Dear Richard,

I agree with Peter that it is not obvious what GCG simple sequence 
format is still useful for, since for giving the sequence as input to 
whatever software you can use seqret with whatever sequence format and 
for just reading the annotation you can use entret and for giving the 
features as input to whatever software you can use seqret with parameter 
-feature (GCG used for this the GCG RSF format but this did not become 
popular outside GCG/SeqLab). I can maybe add that a widely used format 
for features is GFF format and you can do :

seqret -feature somedb:someid outfile.seq -osformat gff -oufo somegfffile

You will obtain a file somegfffile in GFF format (with just the 
features, not the sequence). There is a lot of software that can use it.

	Regards,
	Guy Bottu,
	U.L.B.


From ajb at ebi.ac.uk  Fri Jul 15 08:54:26 2011
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Fri, 15 Jul 2011 09:54:26 +0100 (BST)
Subject: [EMBOSS] EMBOSS 6.4.0 released
Message-ID: <53026.82.26.12.214.1310720066.squirrel@imap04.ebi.ac.uk>

EMBOSS Release 6.4.0

This release is now available on our OBF ftp server.

UNIX version:
   ftp://emboss.open-bio.org/pub/EMBOSS/

mEMBOSS (MS Windows version):
   ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.0-setup.exe

It includes major extensions to the type and number of data resources
available to EMBOSS users.

In addition, three books are published by Cambridge University Press:

EMBOSS User's Guide: Practical Bioinformatics
http://www.cambridge.org/gb/knowledge/isbn/item5979294/?site_locale=en_GB

EMBOSS Developer's Guide: Bioinformatics Programming
http://www.cambridge.org/gb/knowledge/isbn/item5979293/?site_locale=en_GB

EMBOSS Administrator's Guide: Bioinformatics Software Management
http://www.cambridge.org/gb/knowledge/isbn/item5979238/?site_locale=en_GB

They are comprehensive and definitive guides to administering,
developing and using EMBOSS. We hope they will prove useful to the
EMBOSS community and to anyone providing training courses covering
EMBOSS.

In addition to these publications we have a new website.

http://emboss.open-bio.org

Updates for the new features in 6.4.0 will be made available soon on
the new EMBOSS website, with tutorials to be developed on the EBI
e-Learning Portal.

Contents:

1.0 New in 6.4.0
1.1 Server definitions
1.2 Access methods
1.3 emboss.standard file
1.4 new data types
1.5 new query language
1.6 Hash tables and lists
1.7 Cross-references
1.8 URL generation
1.9 Database index compression
1.10 Database indexing applications
1.11 Generating server cache files
1.12 Server and database attributes
1.13 HTTP redirection
1.14 EMBOSS version number
1.15 ACD list 'select all'
2.0 EDAM Ontology
2.1 EDAM in ACD files
2.2 EDAM applications
3.0 DRCAT Data Resource Catalogue
4.0 NCBI Taxonomy
5.0 Maintenance
6.0 Installation Notes
6.1 UNIX
6.1.1 MySQL
6.1.2 PostgreSQL
6.1.3 axis2c
6.1.4 Other optional library software
6.1.5 eprimer3 and eprimer32
6.2 mEMBOSS
7.0 New EMBASSY applications
8.0 Future

1.0 New in 6.4.0

1.1 Server definitions

Servers can be defined, in a similar style to a database definition,
but covering all databases available from a single server. The server
definition names a cache file describing each database, its format
and its query fields. Cache files for a core set of public servers are
included in the release.

1.2 Access methods

New access methods are provided, including Ensembl, BioMart, DAS, SOAP
web services (EBI wsdbfetch and ebeye), REST web services (EBI
dbfetch), and GMOD/CHADO. Ensembl access uses code contributed by
Michael Schuster in the Ensembl team at EBI. This code is updated
after each Ensembl API release. Some of these access methods were
available but only partly implemented in the previous release. They
now support standard server and database definitions and are open for
further development.

Data access methods have been restructured to use "text" access for
any method which seeks a position in a file and then opens it for
reading. This includes reading from a URL and returning a pointer to
the start of the output. A few datatype-specific access methods remain,
for example reading sequence data from a PIR/NBRF/GCG format database,
or from the NCBI taxonomy files, or access to database systems via SQL
or DAS.

1.3 emboss.standard file

Previous releases depended on a user defining databases in their
emboss.defaults file. Release 6.4.0 provides a new emboss.standard
file defining the core servers and databases, and standard resource
settings for database indexing. The local emboss.default file is only
needed for local database definitions and settings.

The configuration files emboss.standard, emboss.default and
~/.embossrc resolve variable references (e.g. in directory names)
during parsing. Extensions to the syntax of these files include ALIAS
to give secondary names to a database. IF, IFDEF, ELSE and ENDIF
directives allow conditional inclusion of sections of the file
dependent on variable settings. Special variables EMBOSS_AXIS2,
EMBOSS_MYSQL, EMBOSS_POSTGRESQL and EMBOSS_SQL are automatically
created for this purpose.

New variable EMBOSS_STANDARD is automatically defined to be the
share/EMBOSS install directory (or the emboss source code directory if
the package is not installed). This is by default where the
emboss.standard files and server cache files are expected to be
found. The value is reported by "embossversion -full"

1.4 new data types

New data types are available as inputs and outputs or
applications. Each has a simple definition including qualifiers
-iformat for input format and -oformat for output format. The maxreads
attribute defines whether the application expects to read a single
entry (maxreads: 1) or loop over multiple entries (the default). This
is simpler than the sequence and seqall definitions for sequence which
are widely used and will remain unchanged.

* text and outtext: the text of an entry for which EMBOSS has (to
   date) no specialised parser

* obo and oboout: terms in an OBO ontology. Six ontologies are
   included in the release as source and index files (EDAM, GO, SO, RO,
   PW, ECO). We plan to add more and welcome suggestions for inclusion.

* resource and resourceout: entries in the Data Resource Catalogue

* taxon and outtaxon: nodes in the NCBI taxonomy which is indexed and
   included in the release

* url and outurl: a database name from the Data Resource Catalogue, and
   an identifier, converted into a URL which can be pasted into a browser
   to cover cases where the URL does not return simple text or HTML data.

* for future extension, assembly and variation datatypes are defined
   for development and use in a later release.

1.5 New query language

All data types use a common query language. The existing "USA"
(uniform sequence address) syntax is still valid for sequence data,
but is also now used for features, obo terms, data resources, taxons
and plain text data.

In response to comments from our Scientific Advisory Board, we have
extended the query language to cover multiple identifiers, multiple
fields, and operators to combine elements of the query.

* id lists: dbname:{ida,idb,idc} searches for 3 identifiers (id,
   accession, etc.)  in a database

* or operator: dbname-{id:h* | des:hemoglobin} searches for all
   entries with identifiers starting with 'h' plus any others that
   include the word 'hemoglobin' in their descriptions.

* not operator: dbname-{id:h* ! des:hemoglobin} searches for all
   entries with identifiers starting with 'h' that do not include the
   word 'hemoglobin' in their descriptions.

* and operator: dbname-{id:h* & des:hemoglobin} searches for all
   entries with identifiers starting with 'h' that also include the
   word 'hemoglobin' in their descriptions.

* eor operator: dbname-{id:h* ^ des:hemoglobin} searches for all
   entries with identifiers starting with 'h' that do not include the
   word 'hemoglobin' in their descriptions, and all those starting with
   another character that do include the word 'hemoglobin' in their
   description. This is the opposite of the and (&) operator.

Query operators are not supported by all access methods. Where an
operator is invalid an error message gives the list of valid
operators. For example, the query syntax for SRS (srs, srswww access)
does not include the exclusive-or (^) operator but supports the
others as these are standard elements in SRS queries.

The query language only allows a single database name in the
query. This allows EMBOSS to combine query results for a single query
expression. To query multiple databases a list file input with one
database query on each line can be used.

Indexed strings containing non-alphabetic characters including white
space are simplified by converting a run of such characters to a
single underscore. The same transformation is applied to a query
string for the dbx (emboss) access method. This is especially useful
for brackets and other characters in data resource names in DRCAT.

We hope that the extended query language and the index file
compression will increase the use of locally indexed data in EMBOSS
installations, and welcome feedback on further developments of the
query language and indexing.

1.6 Hash table and lists

The new query language is supported by extensions to tables and lists
in the libraries. Tables can now be automatically resized. Merge
operations on two tables combine their contents using the same
operations (or, and, not, eor) as the query language. By resizing the
tables first this operation can be made highly efficient. Destructors
can be defined for list data and for table keys and data to
automatically clean up after use. Tables with string keys can use C
char* or string object queries in all cases.

Lists and tables can now be reference counted, avoiding unnecessary
copying especially in the Ensembl API code.

1.7 Cross-references

Cross-references from UniProt/SwissProt and EMBL/GenBank/DDBJ are
collected by extended parsers. New application seqxref reports the
cross-references. New application seqxrefget creates a script to
retrieve cross-referenced data as the original entries, using entret
for sequence data, feattext for feature data, ontotext for ontology
terms, textget for text and urlget for data where "HTML" is the only
available format.

1.8 URL generation

New application urlget returns a query URL from DRCAT with one or mode
identifiers. Where data is from a UniProt/SwissProt or
EMBL/GenBank/DDBJ entry the DRCAT entry definition of the original
cross-reference is used to select from several possible identifier
terms in EDAM in order to choose the correct query.

1.9 Database index compression

Indexes created by dbxflat or dbxfasta are now, by default, compressed
automatically. These files, especially for secondary text indexes such
as description, taxonomy or keyword, could be very sparse. Up to 95%
space savings were achieved in some cases. The indexes are still
updatable by code which uncompresses, updates, and recompresses
on-the-fly using a copy of the index.

1.10 Database indexing applications

New indexing applications dbxedam (EDAM), dbxresource (DRCAT), dbxtax
(NCBI taxonomy) and dbxobo (any OBO ontology) are added for the new
data resources provided as standard. users can install new releases of
the source data and run these applications to update the index files.

Application dbxflat can now index fastq format. This was included in
6.3.1 as a special addition for one user to test and is now fully
supported.

New applications dbxreport and dbxstat report on the overall and
detailed content of dbx database indexes.

In database indexing applications, the default "resource" name is one
included in the emboss.standard file. Users can continue to define
their own resource files. Indexing "resource" definitions can now
specify the maximum length of any field, and the page size and cache
size for any field, using attributes with the field name as a prefix.

1.11 Generating server cache files

New applications for major access methods query a server (for example,
the DAS registry or Ensembl) to update the server cache file with a
current set of database definitions. When run by the system
administrator these can update the site-wide cache file, but they can
also be run by an individual user to create a user-specific set of
databases. The cache files are time stamped. EMBOSS uses the most
recent system or user file.

1.12 Server and database attributes

New applications showserver and servertell describe all servers or the
attributes of a single named server. We expect to extend these
applications once we have feedback on the most useful information they
should report. New application dbtell similarly reports on the
attributes of a single named database.

Database (and server) definitions can use an attribute more than once
if it is defined as "multiple". These include a new "field:" attribute
which gives the name and description of a query field. A list of
"field:" attributes supersedes the old "fields:" attribute which listed
all query field names but allowed no further annotation.

Database field names are extended from the original fixed set of "SRS
sequence" fields to any name. "id" and "acc" are assumed to be the
names of identifier and accession fields. The "hasaccession" attribute
is set automatically for databases where no "acc" field is found,
avoiding some error messages where the attribute has been omitted.

1.13 HTTP redirection

Data retrieval using HTTP now checks the returned header for redirects
and automatically replaces the results with the output from the
redirected URL. Where redirected URLs were found in standard database
definitions (e.g. the EBI's dbfetch service) these have been replaced
by the current URL. We have also seen redirects from case-sensitive
servers which redirect a lower case accession number to one in upper
case in the same URL.

1.14 EMBOSS version number

The EMBOSS version number now has 4 digits (6.4.0.0). The fourth digit
is only there so that the Windows port (mEMBOSS) shows the same
version number for QA testing. In mEMBOSS the final digit is the build
number. QA tests for mEMBOSS now use the same test definition and
qatest script as on Linux. mEMBOSS file handling and reporting has
been adapted to support POSIX and Windows style paths.

1.15 ACD list 'select all'

In ACD files, a list or selection definition can default to "*" for
"select all" if the "minimum" attribute allows all terms to be
selected.

2.0 EDAM Ontology

EDAM is a new ontology from the EMBRACE project now further developed
by Jon Ison in the EMBOSS team. EDAM describes terms for topics (for
applications and data), operations (algorithms), formats, identifiers
and data (semantic descriptions of data content). EDAM terms are used
throughout this release: to annotate all ACD files at the application,
input, parameter and output levels; to annotate data resources and
their web queries in the Data Resource Catalogue; and to annotate
database and server definitions.

2.1 EDAM in ACD files

ACD files are annotated extensively with EDAM terms using the term id
and the human-readable name. The EMBOSS application groups have been
extended to match the EDAM topic annotations, with some applications
moving to different or new groups. EDAM has been used to validate
these groups by comparing the topics hierarchy with the group
designations.

2.2 EDAM applications

EDAM can be queried within any specific namespace by new applications
edamname and edamdef.

EDAM and other ontologies are supported by new applications (ontoget,
ontotext, ontodown, ontoup, ontgetsibs, ontogetcommon, ontogetroot,
ontogetobsolete, ontoisobsolete, ontocount)

New applications search EDAM term names and definitions, retrieve all
matching terms and their descendants, and compare to: applications
(wosstopic, wossoperation, wossinput, wossoutput, wossdata); data
resources (drfindresource, drfindid, drfindformat, drfinddata); and
related EDAM terms (edamhasinput, edamhasoutput, edamisid,
edamisformat, edamissource).

3.0 DRCAT Data Resource Catalogue

DRCAT, the Data Resource Catalogue, is included in this release. DRCAT
started as a description of databases found as cross-references in
UniProt/SwissProt, extended by adding databases found as
cross-references in EMBL/GenBank/DDBJ, plus others from Nucleic Acids
Research, ELIXIR, and other sources. Any database in DRCAT can be used
by name from an EMBOSS application, returning sequence, feature, or
text if a suitable data format is defined for any query, or creating a
URL which can be pasted into a browser where the results are, for
example, a graphical display using javascript which EMBOSS cannot
interpret. We aim to further extend and improve DRCAT in future
releases.

4.0 NCBI Taxonomy

Taxonomy data from the NCBI taxonomy is included as standard in the
release. New applications retrieve single nodes and their ancestors
and descendants (taxget, taxgetup, taxgetdown, taxgetspecies,
taxgetrank).


5.0 Maintenance

Application digest has been renamed pepdigest to avoid a clash with
another utility. The name is also in keeping with the EMBOSS naming of
other protein analysis applications.

Sequence and features formats have been reviewed and updated,
especially GFF3, GenPept, SAM, BAM and treecon. GFF3 output now more closely
follows the official standard, including the escaping of special
characters in the tag/value final column. GFF3 ID and Parent tags are
supported.

Features with exons are now stored as a list of exon subfeatures.
This change allows easier sorting of features by location, keeping
groups of features together, and has simplified the generation of
several feature output formats.

Graphical output for more than one input sequence have been corrected
and enhanced.

The lindna application has been adjusted to correctly relocate
overlapping text and to generate a clean sequence ruler for any range
of positions. New report formats allow reported hits (-rformat draw)
and restriction sites (-rformat restrict) to be plotted by lindna. We
expect to work further on the views that these outputs generate.

The einverted application had a bug (also in the original version)
when an inverted repeat maximum score was close to the edge of the
search window. This was seen only at low threshold scores. Searches
with low threshold scores can be expected to yield slightly different
choices of hits.

In ACD files, the "gui" and "batch" application attributes are assumed
to be "true" if missing. Previous releases defined them as "false"
internally, but fortunately no parsers seem to have used the internal
default value.

Database indexes created by the dbx programs now include a count of
unique and total keys. The text index files also report the type as
"Identifier" or "Secondary" and whether the index is compressed.

EMBOSS configuration now uses autoheader and has less dependency on
the version of libtool.

6.0 Installation notes

6.1 UNIX

The size of the EMBOSS package has shot up by approximately 60MB
compared with the last major release. This is largely due to to
pre-supplied data and index files for ontology/taxonomy/etc.  A
typical installation size (shared images) is approximately 360MB.

Though not a requirement of EMBOSS there are some associated
packages which may be installed prior to configuration that
will allow you to use some optional access methods.

6.1.1 MySQL

This is used, for example, by the Ensembl access code. It will be
automatically configured if the (MySQL-supplied) 'mysql_config'
application is found in the PATH and if the associated development
files (compiler headers etc) are also installed. As an example, for
Linux systems, both things will be done by installing the mysql-devel
(RPM distributions) or mysql-dev (Debian-based distributions). If your
MySQL installation is in some arbitrary location then you can specify
it using the --with-mysql= compilation switch.

6.1.2 PostgreSQL

This is used by some servers (e.g. flybase/genedb). Similar
considerations apply to those described for MySQL above.
Auto-detection is based on the presence in the PATH of 'pg_config',
dev[el] files must be installed, the --with-postgresql configuration
switch can be used for arbitrary locations.

6.1.3 axis2c

EMBOSS optionally uses the 1.6.0 release of Axis2C for
retrieval from SOAP servers:

  http://axis.apache.org/axis2/c/core/

There is a linux binary distribution but, even so, Linux
users may find themselves having to install from
source (and may need to do an 'autoreconf -fi' prior to
configuration to fix a subsequent compilation error on some
systems).

Auto-detection (by EMBOSS) of this package is based on the
presence of a pkgconfig file that axis2c installs. It is
advised that you install pkgconfig if not already installed
(it usually is pre-installed on Linux systems). EMBOSS has a
--with_axis2c= configure switch if you install axis2c into
a location other than /usr or /usr/local (typically).

6.1.4 Other optional library software

Installation of libraries for PNG (libpng/libgd) and PDF (libhpdf
aka libharu) follow considerations given in previous releases and
should be familiar to EMBOSS administrators by now.

6.1.5 eprimer3 and eprimer32

The Primer3 authors have released a 2.x.x version which differs
significantly from the 1.x.x series. Unfortunately the executable is
called the same for both releases (primer3_core).  EMBOSS 6.4.0
provides two wrappers for these releases; eprimer3 is for the 1.x.x
version and requires the primer3 executable to be called
'primer3_core' (this has always been the case); eprimer32 is for the
2.x.x version and requires the primer3 executable to be called
primer32_core.

This may involve some minor symlinking and/or directory/PATH
reorganisation by administrators.


6.2 mEMBOSS

A typical installation executable is approximately 70MB and results
in an installation size of approximately 570MB.

MySQL, PostgreSQL, Axis2c, libhpdf (etc) come pre-supplied as part of
the mEMBOSS installation.

The QA test suite has been extended to automatically find and test
both developer and end-user installations of mEMBOSS.

Note that, with the new server definitions in place (described above),
the old SRS database definitions have been removed. You can now access
databases using (e.g.) 'dbfetch:uniprotkb:opsd_human' as an ID. Such
retrieval is much faster than the previously supplied SRS definitions.

7.0 New EMBASSY applications:

We have provided a wrapper package for the recently released
clustal omega software which must, of course, also be installed.

We have provided a wrapper package for the recently released clustal
omega software which must, of course, also be installed.  We will add
new releases of MIRA and VIENNA at a later date, when the new versions
of the original packages are released and integrated.

8.0 Future development

EMBOSS is fully funded until the end of December. We have an ambitious
schedule of further developments planned for this period. There will
be a further release of EMBOSS at the end of the year.

We welcome any and all suggestions from our user and developer
communities for immediate needs and future directions.

At the end of this year the EMBOSS team will be leaving EBI. Peter
Rice's maximum 9 year tenure is coming to an end. We do not yet know
where we will be from January and are open to suggestions for ways to
host and/or to fund further EMBOSS development and for potentially
useful partnerships and collaborations to continue the advances we
have made.

We can most certainly guarantee that we will continue to maintain the
existing code base and the latest releases.


Alan


From rothenbuhler at xoma.com  Mon Jul 25 23:42:28 2011
From: rothenbuhler at xoma.com (Jake Rothenbuhler)
Date: Mon, 25 Jul 2011 16:42:28 -0700
Subject: [EMBOSS] Algorithms for pI and molecular weight in pepstats
Message-ID: <3110E5050A5DE54F8715EB2AC3D2057725FCCC@cypress6.xoma.com>

Hello,

 
What are the algorithms used to compute the molecular weight and
isoelectric point in pepstats? We are currently using pepstats to
measure these properties in our in-house bioinformatics tools and some
users are concerned because the results can differ from those returned
by ExPASy.

 
Thanks in advance,

 
Jake Rothenbuhler

Bioinformatics Programmer/Analyst

XOMA (US) LLC

(510) 204-7452

 
-- 
The information contained in this email message may 
contain confidential or legally privileged information and is intended solely 
for the use of the named recipient(s).  No confidentiality or privilege is 
waived or lost by any transmission error. If the reader of this message is 
not the intended recipient, please immediately delete the e-mail and all 
copies of it from your system, destroy any hard copies of it and notify the 
sender either by telephone or return e-mail.  Any direct or indirect use, 
disclosure, distribution, printing, or copying of any part of this message is 
prohibited.  Any views expressed in this message are those of the individual 
sender, except where the message states otherwise and the sender is 
authorized to state them to be the views of XOMA.


From pmr at ebi.ac.uk  Tue Jul 26 07:28:14 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 26 Jul 2011 08:28:14 +0100
Subject: [EMBOSS] Algorithms for pI and molecular weight in pepstats
In-Reply-To: <3110E5050A5DE54F8715EB2AC3D2057725FCCC@cypress6.xoma.com>
References: <3110E5050A5DE54F8715EB2AC3D2057725FCCC@cypress6.xoma.com>
Message-ID: <4E2E6C8E.5040502@ebi.ac.uk>

On 26/07/2011 00:42, Jake Rothenbuhler wrote:

> What are the algorithms used to compute the molecular weight and
> isoelectric point in pepstats? We are currently using pepstats to
> measure these properties in our in-house bioinformatics tools and some
> users are concerned because the results can differ from those returned
> by ExPASy.

There was discussion on this last year on this list too.

There is no single correct answer. Molecular weights can use the average 
value for each amino acid to calculate the molecular weight of a 
protein, or monoisotopic values top calculate peptide masses for 
mass-spec data. Pepstats has a command line option -mono to use the 
monoisotopic weights. We use amino acid molecular weights from ExPASy 
findmod in the calculations.

The isoelectric point can be calculated for various conditions. When I 
checked last, ExPASy's protparam was set up the isoelectric focus phase 
of 2D gels under high urea conditions. It was unclear at the time where 
to find all the values needed to reproduce their calculation.

We would like to update EMBOSS's protein property calculations, possibly 
with additional options or alternative parameter sets.

Any suggestions from anyone on the list?

regards,

Peter Rice
EMBOSS Team


From ajb at ebi.ac.uk  Tue Jul 26 15:24:35 2011
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Tue, 26 Jul 2011 16:24:35 +0100 (BST)
Subject: [EMBOSS] mEMBOSS 6.4.0.1 available
Message-ID: <53274.82.26.12.214.1311693875.squirrel@imap04.ebi.ac.uk>

This is a bugfix release for the MS Windows version of EMBOSS,
primarily to fix a problem printing very long ('long long') integers.
Though most users would be unlikely to hit this problem an
uninstall/reinstall is nevertheless recommended.

The release also contains a few minor bugfixes, notably making visible
some potentially hidden SOAP server definitions.

It is available from the usual place:

 ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.1-setup.exe

Alan


From Narayana.Upadhyaya at csiro.au  Wed Jul 27 09:15:09 2011
From: Narayana.Upadhyaya at csiro.au (Narayana.Upadhyaya at csiro.au)
Date: Wed, 27 Jul 2011 19:15:09 +1000
Subject: [EMBOSS] getorf output discrepancy
Message-ID: <F9512FDAD950114680F34532768636025CEDFA2AEB@exvic-mbx04.nexus.csiro.au>

Hi,

I am trying to get both NT and AA sequence out puts from a file with ~20,000 transcripts  models using the getorf with following command:-

getorf -minsize 200 -reverse Y  myfile.fa -find 3
getorf -minsize 200 -reverse Y myfile.fa -find 1

I get the outputs all right. But I was expecting same number of sequences in both (with identical names in the header). But looks like at 60 odd sequences(which are there in the AA output) are missing in the NT out put.

Can anyone explain this discrepancy?  I tried putting the minsize option as "201" for both but the problem persists.

Regards,

Narayana


From Narayana.Upadhyaya at csiro.au  Wed Jul 27 09:30:09 2011
From: Narayana.Upadhyaya at csiro.au (Narayana.Upadhyaya at csiro.au)
Date: Wed, 27 Jul 2011 19:30:09 +1000
Subject: [EMBOSS] getorf output discrepancy
Message-ID: <F9512FDAD950114680F34532768636025CEDFA2AEC@exvic-mbx04.nexus.csiro.au>

Hi
I figured out the problem. Missing ORFs in NT output are the ones which are just 198 NT length. When I put minsize 198 for NT output I don't miss anything.

Sorry for bothering.

Narayana


Hi,

I am trying to get both NT and AA sequence out puts from a file with ~20,000 transcripts  models using the getorf with following command:-

getorf -minsize 200 -reverse Y  myfile.fa -find 3
getorf -minsize 200 -reverse Y myfile.fa -find 1

I get the outputs all right. But I was expecting same number of sequences in both (with identical names in the header). But looks like at 60 odd sequences(which are there in the AA output) are missing in the NT out put.

Can anyone explain this discrepancy?  I tried putting the minsize option as "201" for both but the problem persists.

Regards,

Narayana


From friedman at cancercenter.columbia.edu  Wed Jul 27 16:31:01 2011
From: friedman at cancercenter.columbia.edu (Richard Friedman)
Date: Wed, 27 Jul 2011 12:31:01 -0400
Subject: [EMBOSS] dotplots taking similarity into account
Message-ID: <B446D8AB-20BE-4E8B-B488-47EDB8CC5BE0@cancercenter.columbia.edu>

Dear Emboss list,

	Is there a way to get dotplots that take similarity according to a  
similarity matrix,
rather than strict  identity into account? As far as I can see, dottup  
is based on identity.
Is there a way that we can dotplots based on a similarity matrix  
similar to dotplot in GCG?
I know that it may be tiresome that I use GCG as a standard, but it is  
what I know and
it is serving as a point of departure while I am learning Emboss and  
redoing the GCG
portion of my course in Emboss. I am enjoying learning about the ways  
in which Emboss
offers improved functionality in the process as well.

Thanks and best wishes,
Rich
------------------------------------------------------------
Richard A. Friedman, PhD
Associate Research Scientist,
Biomedical Informatics Shared Resource
Herbert Irving Comprehensive Cancer Center (HICCC)
Lecturer,
Department of Biomedical Informatics (DBMI)
Educational Coordinator,
Center for Computational Biology and Bioinformatics (C2B2)/
National Center for Multiscale Analysis of Genomic Networks (MAGNet)
Room 824
Irving Cancer Research Center
Columbia University
1130 St. Nicholas Ave
New York, NY 10032
(212)851-4765 (voice)
friedman at cancercenter.columbia.edu
http://cancercenter.columbia.edu/~friedman/

I am a Bayesian. When I see a multiple-choice question on a test and I  
don't
know the answer I say "eeney-meaney-miney-moe".

Rose Friedman, Age 14


From s.newslists at gmail.com  Wed Jul 27 18:14:03 2011
From: s.newslists at gmail.com (Stefan)
Date: Wed, 27 Jul 2011 20:14:03 +0200
Subject: [EMBOSS] dotplots taking similarity into account
In-Reply-To: <B446D8AB-20BE-4E8B-B488-47EDB8CC5BE0@cancercenter.columbia.edu>
References: <B446D8AB-20BE-4E8B-B488-47EDB8CC5BE0@cancercenter.columbia.edu>
Message-ID: <CAECtV7Om73f5jxL1U0b8WsZLcBeUaCRTDc6YgqWdELzfr+dVrw@mail.gmail.com>

Dear Richard,

Dotmatcher uses a specified substitution matrix:
http://emboss.open-bio.org/wiki/Appdoc:Dotmatcher

Best regards,
Stefan

2011/7/27 Richard Friedman <friedman at cancercenter.columbia.edu>:
> Dear Emboss list,
>
> ? ? ? ?Is there a way to get dotplots that take similarity according to a
> similarity matrix,
> rather than strict ?identity into account? As far as I can see, dottup is
> based on identity.
> Is there a way that we can dotplots based on a similarity matrix similar to
> dotplot in GCG?
> I know that it may be tiresome that I use GCG as a standard, but it is what
> I know and
> it is serving as a point of departure while I am learning Emboss and redoing
> the GCG
> portion of my course in Emboss. I am enjoying learning about the ways in
> which Emboss
> offers improved functionality in the process as well.
>
> Thanks and best wishes,
> Rich
> ------------------------------------------------------------
> Richard A. Friedman, PhD
> Associate Research Scientist,
> Biomedical Informatics Shared Resource
> Herbert Irving Comprehensive Cancer Center (HICCC)
> Lecturer,
> Department of Biomedical Informatics (DBMI)
> Educational Coordinator,
> Center for Computational Biology and Bioinformatics (C2B2)/
> National Center for Multiscale Analysis of Genomic Networks (MAGNet)
> Room 824
> Irving Cancer Research Center
> Columbia University
> 1130 St. Nicholas Ave
> New York, NY 10032
> (212)851-4765 (voice)
> friedman at cancercenter.columbia.edu
> http://cancercenter.columbia.edu/~friedman/
>
> I am a Bayesian. When I see a multiple-choice question on a test and I don't
> know the answer I say "eeney-meaney-miney-moe".
>
> Rose Friedman, Age 14
>
>
>
>
>
>
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From charles-listes-emboss at plessy.org  Thu Jul 28 14:38:37 2011
From: charles-listes-emboss at plessy.org (Charles Plessy)
Date: Thu, 28 Jul 2011 23:38:37 +0900
Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative
 Commons Attribution-NoDerivs
Message-ID: <20110728143837.GC30927@merveille.plessy.net>

Dear EMBOSS developers,
(CC Debian Med mailing list)

while working on upgrading Debian's emboss package to version 6.4.0
(congratulations, by the way), I found some files in EMBOSS that are
not considered ?Free software? by Debian.  They were actually present
in past releases as well. Here is their list:

test/data/amir.swiss
test/data/uniprotft.sw
test/swiss/seq.dat
test/swnew/trembl.dat

and emboss/data/dbxref.txt

Their license is Creative Commons Attribution-NoDerivs 3.0 Unported (CC BY-ND
3.0), and it disallows modification of the files.  The presence of these files
in EMBOSS makes it impossible for Debian to redistribute it in our operating
system.  I have confirmed with the UniProt consortium's helpdesk that, even in
isolation, these files are covered by the CC BY-ND license.  I see three
possible solutions. 

 a) Remove the files in Debian's EMBOSS package.
 b) Distribute EMBOSS with the files, but in the non-free section of the Debian archive.
 c) Replace the files by Free equivalents, for instance by re-creating records from scratch.

I am not very comfortable with any of the solutions, and was wondering if you
would have suggestions ?

Have a nice day,

-- 
Charles Plessy
Debian Med packaging team,
http://www.debian.org/devel/debian-med
Tsurumi, Kanagawa, Japan


From mathog at caltech.edu  Thu Jul 28 15:06:50 2011
From: mathog at caltech.edu (David Mathog)
Date: Thu, 28 Jul 2011 08:06:50 -0700
Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative
	Commons Attribution-NoDerivs
Message-ID: <E1QmSAk-0000CQ-0M@mendel.bio.caltech.edu>

Charles Plessy wrote:

>a) Remove the files in Debian's EMBOSS package.
>b) Distribute EMBOSS with the files, but in the non-free section of the
>Debian archive.
>c) Replace the files by Free equivalents, for instance by re-creating
>records from scratch.

d)  Add a small script that wget's each file from its original
distribution site and installs it in the right place.  Have the package
install script either ask if it should run this script, or have it issue
a message which describes the issue and leaves it up to the user to run
the script.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


From wolfgang.rumpf at gmail.com  Thu Jul 28 17:14:23 2011
From: wolfgang.rumpf at gmail.com (Wolfgang Rumpf)
Date: Thu, 28 Jul 2011 13:14:23 -0400
Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative
	Commons Attribution-NoDerivs
In-Reply-To: <E1QmSAk-0000CQ-0M@mendel.bio.caltech.edu>
References: <E1QmSAk-0000CQ-0M@mendel.bio.caltech.edu>
Message-ID: <1DAFB6D6-77CF-424C-A2BC-CD1B57FE6671@gmail.com>

I would prefer (c) or the newly-added (d) myself....


Cheers,


Wolfgang

--------------------------------------------------------------------------------------------------------------
Dr. Wolfgang Rumpf
Senior Product Specialist & Director of Support, Rescentris Inc.
Adjunct Faculty, Dept. of Biotechnology, UMUC
--------------------------------------------------------------------------------------------------------------
wolfgang.rumpf at rescentris.com 	 	wolfgang.rumpf at gmail.com
Mobile - (614) 638-6797 				Skype - wolfgang.rumpf
--------------------------------------------------------------------------------------------------------------
Read my Blog - "QuantumThoughts" - at http://culture.no-ip.org/quantumthoughts
--------------------------------------------------------------------------------------------------------------

On Jul 28, 2011, at 11:06 AM, David Mathog wrote:

> Charles Plessy wrote:
> 
>> a) Remove the files in Debian's EMBOSS package.
>> b) Distribute EMBOSS with the files, but in the non-free section of the
>> Debian archive.
>> c) Replace the files by Free equivalents, for instance by re-creating
>> records from scratch.
> 
> d)  Add a small script that wget's each file from its original
> distribution site and installs it in the right place.  Have the package
> install script either ask if it should run this script, or have it issue
> a message which describes the issue and leaves it up to the user to run
> the script.
> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss


From s.newslists at gmail.com  Thu Jul 28 17:24:53 2011
From: s.newslists at gmail.com (Stefan)
Date: Thu, 28 Jul 2011 19:24:53 +0200
Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative
 Commons Attribution-NoDerivs
In-Reply-To: <1DAFB6D6-77CF-424C-A2BC-CD1B57FE6671@gmail.com>
References: <E1QmSAk-0000CQ-0M@mendel.bio.caltech.edu>
	<1DAFB6D6-77CF-424C-A2BC-CD1B57FE6671@gmail.com>
Message-ID: <CAECtV7OOtkFeXid4R5=rKr_puULbZibh-S16QJpzXzppmvWfOw@mail.gmail.com>

I would prefer (d) and I know packages where this is realized like
this. For example in SuSE the msttf fonts.

Regards,
Stefan

2011/7/28 Wolfgang Rumpf <wolfgang.rumpf at gmail.com>:
> I would prefer (c) or the newly-added (d) myself....
>
>
> Cheers,
>
>
> Wolfgang
>
> --------------------------------------------------------------------------------------------------------------
> Dr. Wolfgang Rumpf
> Senior Product Specialist & Director of Support, Rescentris Inc.
> Adjunct Faculty, Dept. of Biotechnology, UMUC
> --------------------------------------------------------------------------------------------------------------
> wolfgang.rumpf at rescentris.com ? ? ? ? ? wolfgang.rumpf at gmail.com
> Mobile - (614) 638-6797 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Skype - wolfgang.rumpf
> --------------------------------------------------------------------------------------------------------------
> Read my Blog - "QuantumThoughts" - at http://culture.no-ip.org/quantumthoughts
> --------------------------------------------------------------------------------------------------------------
>
> On Jul 28, 2011, at 11:06 AM, David Mathog wrote:
>
>> Charles Plessy wrote:
>>
>>> a) Remove the files in Debian's EMBOSS package.
>>> b) Distribute EMBOSS with the files, but in the non-free section of the
>>> Debian archive.
>>> c) Replace the files by Free equivalents, for instance by re-creating
>>> records from scratch.
>>
>> d) ?Add a small script that wget's each file from its original
>> distribution site and installs it in the right place. ?Have the package
>> install script either ask if it should run this script, or have it issue
>> a message which describes the issue and leaves it up to the user to run
>> the script.
>>
>> Regards,
>>
>> David Mathog
>> mathog at caltech.edu
>> Manager, Sequence Analysis Facility, Biology Division, Caltech
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
>
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>


From rothenbuhler at xoma.com  Thu Jul 28 22:44:47 2011
From: rothenbuhler at xoma.com (Jake Rothenbuhler)
Date: Thu, 28 Jul 2011 15:44:47 -0700
Subject: [EMBOSS] Algorithms for pI and molecular weight in pepstats
In-Reply-To: <4E2E6C8E.5040502@ebi.ac.uk>
References: <3110E5050A5DE54F8715EB2AC3D2057725FCCC@cypress6.xoma.com>
	<4E2E6C8E.5040502@ebi.ac.uk>
Message-ID: <3110E5050A5DE54F8715EB2AC3D2057725FCED@cypress6.xoma.com>

Thanks to Ingo and Peter for the quick and helpful replies. I've read
through the discussion you had a year ago on this topic and it seems
like it is still unresolved.

> The isoelectric point can be calculated for various conditions. When I

> checked last, ExPASy's protparam was set up the isoelectric focus
phase 
> of 2D gels under high urea conditions. It was unclear at the time
where 
> to find all the values needed to reproduce their calculation.

I have been reading through the literature referenced by ExPASy's
documentation. The article does not give pK values for all N-terminal
residues. I've asked ExPASy support about the pK values used for
residues not listed in the paper. If you're interested, I can keep you
updated regarding their response.
 
> We would like to update EMBOSS's protein property calculations,
possibly 
> with additional options or alternative parameter sets.

If it's something you'd like to include in EMBOSS, I'd be willing to
contribute to an additional option for pI calculation that uses ExPASy's
pK values.

Jake Rothenbuhler
Bioinformatics Programmer/Analyst
XOMA (US) LLC
(510) 204-7452

-- 
The information contained in this email message may 
contain confidential or legally privileged information and is intended solely 
for the use of the named recipient(s).  No confidentiality or privilege is 
waived or lost by any transmission error. If the reader of this message is 
not the intended recipient, please immediately delete the e-mail and all 
copies of it from your system, destroy any hard copies of it and notify the 
sender either by telephone or return e-mail.  Any direct or indirect use, 
disclosure, distribution, printing, or copying of any part of this message is 
prohibited.  Any views expressed in this message are those of the individual 
sender, except where the message states otherwise and the sender is 
authorized to state them to be the views of XOMA.


From pmr at ebi.ac.uk  Fri Jul 29 07:28:48 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Fri, 29 Jul 2011 08:28:48 +0100
Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative
 Commons Attribution-NoDerivs
In-Reply-To: <20110728143837.GC30927@merveille.plessy.net>
References: <20110728143837.GC30927@merveille.plessy.net>
Message-ID: <4E326130.7030507@ebi.ac.uk>

On 28/07/2011 15:38, Charles Plessy wrote:
> Dear EMBOSS developers,
> (CC Debian Med mailing list)
>
> while working on upgrading Debian's emboss package to version 6.4.0
> (congratulations, by the way), I found some files in EMBOSS that are
> not considered ?Free software? by Debian.  They were actually present
> in past releases as well. Here is their list:
>
> test/data/amir.swiss
> test/data/uniprotft.sw
> test/swiss/seq.dat
> test/swnew/trembl.dat

Huh? Example entries from UniProt? We can of course remove them from the 
distribution but then the QA tests will not work if anyone tries them.

I suspect amir.swiss predates this UniProt licensing, but the others are 
more recently updated.

Anyway, EMBOSS will work perfectly well without them. You can just 
delete them.

> and emboss/data/dbxref.txt

That one can go. It was a source for the DRCAT.dat data resource 
catalogue and yes we do have permission from UniProt to use it.

> Their license is Creative Commons Attribution-NoDerivs 3.0 Unported (CC BY-ND
> 3.0), and it disallows modification of the files.  The presence of these files
> in EMBOSS makes it impossible for Debian to redistribute it in our operating
> system.  I have confirmed with the UniProt consortium's helpdesk that, even in
> isolation, these files are covered by the CC BY-ND license.  I see three
> possible solutions.
>
>   a) Remove the files in Debian's EMBOSS package.
>   b) Distribute EMBOSS with the files, but in the non-free section of the Debian archive.
>   c) Replace the files by Free equivalents, for instance by re-creating records from scratch.
>
> I am not very comfortable with any of the solutions, and was wondering if you
> would have suggestions ?

I will also have words with the UniProt folk at EBI and if it really is 
not possible to include a few example entries with EMBOSS then I'll 
check with the other Open Bio projects. This is really silly.

regards,

Peter Rice
EMBOSS Team


From pmr at ebi.ac.uk  Fri Jul 29 07:46:42 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Fri, 29 Jul 2011 08:46:42 +0100
Subject: [EMBOSS] Files included in EMBOSS but licensed under Creative
 Commons Attribution-NoDerivs
In-Reply-To: <20110728143837.GC30927@merveille.plessy.net>
References: <20110728143837.GC30927@merveille.plessy.net>
Message-ID: <4E326562.1020001@ebi.ac.uk>

On 28/07/2011 15:38, Charles Plessy wrote:
> Dear EMBOSS developers,
> (CC Debian Med mailing list)
>
> while working on upgrading Debian's emboss package to version 6.4.0
> (congratulations, by the way), I found some files in EMBOSS that are
> not considered ?Free software? by Debian.  They were actually present
> in past releases as well.
>
> Their license is Creative Commons Attribution-NoDerivs 3.0 Unported (CC BY-ND
> 3.0), and it disallows modification of the files.  The presence of these files
> in EMBOSS makes it impossible for Debian to redistribute it in our operating
> system.  I have confirmed with the UniProt consortium's helpdesk that, even in
> isolation, these files are covered by the CC BY-ND license.  I see three
> possible solutions.

Ummm .... in what sense would *you* be modifying the files?

UniProt's license http://www.uniprot.org/help/license says

> License & disclaimer
>
> License
>
> We have chosen to apply the Creative Commons Attribution-NoDerivs License to all copyrightable parts of our databases. This means that you are free to copy, distribute, display and make commercial use of these databases in all legislations, provided you give us credit. However, if you intend to distribute a modified version of one of our databases, you must ask us for permission first.

So I see no problem for EMBOSS in including the files.

The only problem is for someone "modifying the files and redistributing 
them" without permission ... but strictly that would not apply to most 
uses of a UniProt entry (otherwise you could not use one entry as input 
and distribute the results).

The licensing is there to prevent redistribution of UniProt without 
permission.

Anyway, you can just delete them from the Debian duistribution of EMBOSS 
- and find your own way to run the QA tests. I don't think we have a 
problem.

regards,

Peter Rice
EMBOSS Team

regards,

Peter Rice


From pmr at ebi.ac.uk  Fri Jul 29 08:39:46 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Fri, 29 Jul 2011 09:39:46 +0100
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
In-Reply-To: <4E326562.1020001@ebi.ac.uk>
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk>
Message-ID: <4E3271D2.2070906@ebi.ac.uk>

On 07/29/2011 08:46 AM, Peter Rice wrote:
> On 28/07/2011 15:38, Charles Plessy wrote:
>> Dear EMBOSS developers,
>> (CC Debian Med mailing list)
>>
>> while working on upgrading Debian's emboss package to version 6.4.0
>> (congratulations, by the way), I found some files in EMBOSS that are
>> not considered ?Free software? by Debian. 

While we're on the topic of licensing, some other data files in EMBOSS
6.4.0 have licences.

emboss/data/OBO contains copies of several Open Bio-Ontologies for which
EMBOSS includes index files - so you need the data file version that
matches the index files.

For example, the Gene Ontology terms
http://www.geneontology.org/GO.cite.shtml are:

GO Usage Policy

The GO Consortium gives permission for any of its products to be used
without license for any purpose under three conditions:

    That the Gene Ontology Consortium is clearly acknowledged as the
source of the product;
    That any GO Consortium file(s) displayed publicly include the
date(s) and/or version number(s) of the relevant GO file(s) (the GO is
evolving and changes will occur with time);
    That neither the content of the GO file(s) nor the logical
relationships embedded within the GO file(s) be altered in any way.

which looks rather like the problem you had with Creative Commons.

Licenses that protect the official database release from derives
versions are entirely reasonable and standard in bioinformatics.
Basically, making sure that when you refer to a UniProt entry, or a, OBO
ontology term, everyone agrees you are referring to one agreed entry or
term.

EMBOSS does depend on these files. The database names are hard-coded
into some of the new (and more to come) applications.

You could download the databases and indexes from our rsync copies we
use to keep developers in sync. These are at
rsync://emboss.open-bio.org/EMBOSS/

It might make things clearer if someone from Debian could explain:

(a) why a Creative Commons licence is an issue for you

(b) why you appear to consider a copy of a whole or part of a public
biological database as part of an "operating system"

regards,

Peter Rice
EMBOSS Team


From cjfields at illinois.edu  Fri Jul 29 13:51:53 2011
From: cjfields at illinois.edu (Chris Fields)
Date: Fri, 29 Jul 2011 08:51:53 -0500
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
In-Reply-To: <4E3271D2.2070906@ebi.ac.uk>
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk>
Message-ID: <B0C0539E-6E9D-4049-BD7E-77D3958B224C@illinois.edu>

On Jul 29, 2011, at 3:39 AM, Peter Rice wrote:

> On 07/29/2011 08:46 AM, Peter Rice wrote:
>> On 28/07/2011 15:38, Charles Plessy wrote:
>>> Dear EMBOSS developers,
>>> (CC Debian Med mailing list)
>>> 
>>> while working on upgrading Debian's emboss package to version 6.4.0
>>> (congratulations, by the way), I found some files in EMBOSS that are
>>> not considered ?Free software? by Debian. 
> 
> While we're on the topic of licensing, some other data files in EMBOSS
> 6.4.0 have licences.
> 
> emboss/data/OBO contains copies of several Open Bio-Ontologies for which
> EMBOSS includes index files - so you need the data file version that
> matches the index files.
> 
> For example, the Gene Ontology terms
> http://www.geneontology.org/GO.cite.shtml are:
> 
> GO Usage Policy
> 
> The GO Consortium gives permission for any of its products to be used
> without license for any purpose under three conditions:
> 
>    That the Gene Ontology Consortium is clearly acknowledged as the
> source of the product;
>    That any GO Consortium file(s) displayed publicly include the
> date(s) and/or version number(s) of the relevant GO file(s) (the GO is
> evolving and changes will occur with time);
>    That neither the content of the GO file(s) nor the logical
> relationships embedded within the GO file(s) be altered in any way.
> 
> which looks rather like the problem you had with Creative Commons.
> 
> Licenses that protect the official database release from derives
> versions are entirely reasonable and standard in bioinformatics.
> Basically, making sure that when you refer to a UniProt entry, or a, OBO
> ontology term, everyone agrees you are referring to one agreed entry or
> term.
> 
> EMBOSS does depend on these files. The database names are hard-coded
> into some of the new (and more to come) applications.
> 
> You could download the databases and indexes from our rsync copies we
> use to keep developers in sync. These are at
> rsync://emboss.open-bio.org/EMBOSS/
> 
> It might make things clearer if someone from Debian could explain:
> 
> (a) why a Creative Commons licence is an issue for you
> 
> (b) why you appear to consider a copy of a whole or part of a public
> biological database as part of an "operating system"
> 
> regards,
> 
> Peter Rice
> EMBOSS Team


Charles,

>From the BioPerl perspective, this will very likely be a problem for us as well as all other Bio* language (Biopython, BioJava, BioRuby); we typically include data derived from these sources.  We may have a bit more flexibility in that the vast majority are mainly only for tests, but I believe some data is hard-coded in.  Fallback data like REBase for restriction analysis and GO (as Peter mentioned above) come to mind.

chris

Christopher Fields
Senior Research Scientist
National Center for Supercomputing Applications
Institute for Genomic Biology
University of Illinois Urbana-Champaign
1206 W. Gregory Dr. , MC-195
Urbana, IL 61801


From asjo at koldfront.dk  Fri Jul 29 20:35:13 2011
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Fri, 29 Jul 2011 22:35:13 +0200
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk>
Message-ID: <87sjpoq0zi.fsf@topper.koldfront.dk>

On Fri, 29 Jul 2011 09:39:46 +0100, Peter wrote:

> It might make things clearer if someone from Debian could explain:

(I am not from Debian, but here is my take on it anyway:)

> (a) why a Creative Commons licence is an issue for you

One of the fundamental software freedoms is the freedom to change the
software?.

The Debian Free Software Guidelines' definition of free software
includes this freedom?.

So the "No Derivatives" variants of the Creative Commons licenses aren't
free by the DFSG definition.

(The GNU Free Documentation License on documents with invariant sections
is considered non-free by DFSG-standards as well, even if the invariant
sections are things that nobody would want to change.)

When a project of volunteers packages 29000+ thousand packages, I think
making a judgement call on whether it is okay that the license of a
couple of files does not live up to the guidelines is neigh impossible.

The answer to "Why would you want to?" is, because you might need to.

It is more obvious with programs and code than it is with database
entries, granted - but I guess the equivalent problem would be that the
licensor didn't want to fix a problem in such a database, and that
problem made the programs using it malfunction. It would be a pain if
you weren't allowed to fix the problem and distribute the fixed data
yourself, say, if "upstream" didn't want to include the fix for some
reason or another; maybe they happened to turn sour on the world/you -
stranger things have happened.

I don't think that will happen in this specific case, but making
judgement calls on what organisations/people will do in the future isn't
quite firm ground.

So, nobody is probably ever going to exercise that freedom in this
specific case, I think, but ignoring some of the freedoms in special
cases is infeasible for a project such as Debian.

This is just me trying to explain how I understand it, so take it with a
grain of salt, and swing by debian-legal? for the experts.

> (b) why you appear to consider a copy of a whole or part of a public
> biological database as part of an "operating system"

They are part of a package which is included in the Debian GNU/Linux
free operating system.


(I personally think it would make sense to change to a Creative Commons
license that allows derivative works - Uniprot and others are going to
be the canonical source for the data anyway, so nothing will be lost by
them by doing that, as far as I can see.)


  Best regards,

    Adam


? http://en.wikipedia.org/wiki/Free_software#Definition
? http://en.wikipedia.org/wiki/Debian_Free_Software_Guidelines
? http://lists.debian.org/debian-legal/

-- 
 "Good car to drive after a war"                              Adam Sj?gren
                                                         asjo at koldfront.dk


From pmr at ebi.ac.uk  Sat Jul 30 08:58:07 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Sat, 30 Jul 2011 09:58:07 +0100
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
In-Reply-To: <87sjpoq0zi.fsf@topper.koldfront.dk>
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk>
	<87sjpoq0zi.fsf@topper.koldfront.dk>
Message-ID: <4E33C79F.8080402@ebi.ac.uk>

Quoted in full for the benefit of the debian-med list who missed the 
original posting

On 29/07/2011 21:35, Adam Sj?gren wrote:
> On Fri, 29 Jul 2011 09:39:46 +0100, Peter wrote:
>
>> It might make things clearer if someone from Debian could explain:
>
> (I am not from Debian, but here is my take on it anyway:)
>
>> (a) why a Creative Commons licence is an issue for you
>
> One of the fundamental software freedoms is the freedom to change the
> software?.
>
> The Debian Free Software Guidelines' definition of free software
> includes this freedom?.
>
> So the "No Derivatives" variants of the Creative Commons licenses aren't
> free by the DFSG definition.
>
> (The GNU Free Documentation License on documents with invariant sections
> is considered non-free by DFSG-standards as well, even if the invariant
> sections are things that nobody would want to change.)
>
> When a project of volunteers packages 29000+ thousand packages, I think
> making a judgement call on whether it is okay that the license of a
> couple of files does not live up to the guidelines is neigh impossible.

> The answer to "Why would you want to?" is, because you might need to.
>
> It is more obvious with programs and code than it is with database
> entries, granted - but I guess the equivalent problem would be that the
> licensor didn't want to fix a problem in such a database, and that
> problem made the programs using it malfunction. It would be a pain if
> you weren't allowed to fix the problem and distribute the fixed data
> yourself, say, if "upstream" didn't want to include the fix for some
> reason or another; maybe they happened to turn sour on the world/you -
> stranger things have happened.
>
> So, nobody is probably ever going to exercise that freedom in this
> specific case, I think, but ignoring some of the freedoms in special
> cases is infeasible for a project such as Debian.
>
> This is just me trying to explain how I understand it, so take it with a
> grain of salt, and swing by debian-legal? for the experts.

A specific example might help. About 5 years ago a release of the 
UniProt database (as plain text files) broke the Wisconsin (GCG) 
sequence analysis package. They introduced extremely long lines in a 
data file that everyone assumed was only maximum 80 characters.

As GCG was closed source, the fix required a change to the UniProt files 
to either wrap or truncate the 'offending' records.

The fix was not to distribute a change to the data of course, but to 
write and distribute a simple perl script that wrapped the long records.

That was not a licensing issue - the content stays the same, the format 
is changed, no changed data is distributed. But it does illustrate that 
the database licensing does not prevent 'fixing' a database.

>> (b) why you appear to consider a copy of a whole or part of a public
>> biological database as part of an "operating system"
>
> They are part of a package which is included in the Debian GNU/Linux
> free operating system.

I expect there are many problems that arise if data ... and 
documentation ... are considered to be software. For EMBOSS we didn't 
officially specify a license for the documentation but other packages 
probably do. It still worries me that some of our documentation files 
officially include GPL licensed (EMBOSS) source code but I did not like 
any of the alternative documentation licenses.

> (I personally think it would make sense to change to a Creative Commons
> license that allows derivative works - Uniprot and others are going to
> be the canonical source for the data anyway, so nothing will be lost by
> them by doing that, as far as I can see.)

Unlikely. The no-derivatives version is specifically there to prevent 
derivatives - for example Debian distributing a modified UniProt without 
permission.

The ontologies are similar, but do allow for the use case of importing 
terms from one ontology into another if the ontology name is changed 
(and preferably if cross-references to the original are provided). 
Again, the need is to protect the integrity of the original ontology 
content so references to a GO term or a UniProt entry are clearly defined.

This is essential for many of the public bioinformatics databases. Data 
and software are not the same in this context. I am curious whether 
documentation licensing raises any issues.

Just my 2c worth

Peter Rice
EMBOSS Team


From asjo at koldfront.dk  Sat Jul 30 11:36:54 2011
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Sat, 30 Jul 2011 13:36:54 +0200
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk>
	<87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk>
Message-ID: <87ipqkgfu1.fsf@topper.koldfront.dk>

On Sat, 30 Jul 2011 09:58:07 +0100, Peter wrote:

> A specific example might help. About 5 years ago a release of the
> UniProt database (as plain text files) broke the Wisconsin (GCG)
> sequence analysis package.

[...]

This is the opposite problem of what I tried to sketch.

Your example has closed source software that can't be fixed, leading to
either preprocessing or changing the database rather than fixing the
real problem.

If the software had been free, you could just have fixed the software.

Switch around "software" and "database", and you have the example I was
trying to paint.

> I expect there are many problems that arise if data ... and
> documentation ... are considered to be software.

Sure. The whole GFDL debate took quite a while, I think.

But that doesn't change that one of the solutions outlined by Charles
Plessy is necessary for Debian to distribute EMBOSS (and any other piece
of free/redistributable software).

>> (I personally think it would make sense to change to a Creative Commons
>> license that allows derivative works - Uniprot and others are going to
>> be the canonical source for the data anyway, so nothing will be lost by
>> them by doing that, as far as I can see.)

> Unlikely. The no-derivatives version is specifically there to prevent
> derivatives - for example Debian distributing a modified UniProt
> without permission.

What I was trying to say is that I don't think that that clause gives
any value to the owners of Uniprot and other databases.

Why would Uniprot want to prevent derivative works? They'll always be
the canonical source for the correct information.

You are free to distribute a modified version of the man-page for ls(1)
- but if you introduce errors in it or make it worse, nobody will choose
your derived version.

> The ontologies are similar, but do allow for the use case of importing
> terms from one ontology into another if the ontology name is changed
> (and preferably if cross-references to the original are provided).

> Again, the need is to protect the integrity of the original ontology
> content so references to a GO term or a UniProt entry are clearly
> defined.

I think the problem that is being protected against is non-existing.

People don't want to break stuff that works, they want to be able to fix
stuff that doesn't.

> This is essential for many of the public bioinformatics databases.

Why? Only a hypothetical derivative would be changed, not the original.

If someome distributed a derivative that was broken, I think people
would quickly abandon it.


Again, just my point of view - not representing or speaking for anyone :-)


  Best regards,

    Adam

-- 
 "Good car to drive after a war"                              Adam Sj?gren
                                                         asjo at koldfront.dk


From cjfields at illinois.edu  Sat Jul 30 19:01:58 2011
From: cjfields at illinois.edu (Chris Fields)
Date: Sat, 30 Jul 2011 14:01:58 -0500
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
In-Reply-To: <4E33C79F.8080402@ebi.ac.uk>
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk>
	<87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk>
Message-ID: <5EF06959-FAA0-4DBB-96EA-013CFFDE2960@illinois.edu>

On Jul 30, 2011, at 3:58 AM, Peter Rice wrote:

> Quoted in full for the benefit of the debian-med list who missed the original posting
> 
> On 29/07/2011 21:35, Adam Sj?gren wrote:
>> On Fri, 29 Jul 2011 09:39:46 +0100, Peter wrote:
>> 
>>> It might make things clearer if someone from Debian could explain:
>> 
>> (I am not from Debian, but here is my take on it anyway:)
>> 
>>> (a) why a Creative Commons licence is an issue for you
>> 
>> One of the fundamental software freedoms is the freedom to change the
>> software?.
>> 
>> The Debian Free Software Guidelines' definition of free software
>> includes this freedom?.
>> 
>> So the "No Derivatives" variants of the Creative Commons licenses aren't
>> free by the DFSG definition.
>> 
>> (The GNU Free Documentation License on documents with invariant sections
>> is considered non-free by DFSG-standards as well, even if the invariant
>> sections are things that nobody would want to change.)
>> 
>> When a project of volunteers packages 29000+ thousand packages, I think
>> making a judgement call on whether it is okay that the license of a
>> couple of files does not live up to the guidelines is neigh impossible.
> 
>> The answer to "Why would you want to?" is, because you might need to.
>> 
>> It is more obvious with programs and code than it is with database
>> entries, granted - but I guess the equivalent problem would be that the
>> licensor didn't want to fix a problem in such a database, and that
>> problem made the programs using it malfunction. It would be a pain if
>> you weren't allowed to fix the problem and distribute the fixed data
>> yourself, say, if "upstream" didn't want to include the fix for some
>> reason or another; maybe they happened to turn sour on the world/you -
>> stranger things have happened.
>> 
>> So, nobody is probably ever going to exercise that freedom in this
>> specific case, I think, but ignoring some of the freedoms in special
>> cases is infeasible for a project such as Debian.
>> 
>> This is just me trying to explain how I understand it, so take it with a
>> grain of salt, and swing by debian-legal? for the experts.
> 
> A specific example might help. About 5 years ago a release of the UniProt database (as plain text files) broke the Wisconsin (GCG) sequence analysis package. They introduced extremely long lines in a data file that everyone assumed was only maximum 80 characters.
> 
> As GCG was closed source, the fix required a change to the UniProt files to either wrap or truncate the 'offending' records.
> 
> The fix was not to distribute a change to the data of course, but to write and distribute a simple perl script that wrapped the long records.
> 
> That was not a licensing issue - the content stays the same, the format is changed, no changed data is distributed. But it does illustrate that the database licensing does not prevent 'fixing' a database.
> 
>>> (b) why you appear to consider a copy of a whole or part of a public
>>> biological database as part of an "operating system"
>> 
>> They are part of a package which is included in the Debian GNU/Linux
>> free operating system.
> 
> I expect there are many problems that arise if data ... and documentation ... are considered to be software. For EMBOSS we didn't officially specify a license for the documentation but other packages probably do. It still worries me that some of our documentation files officially include GPL licensed (EMBOSS) source code but I did not like any of the alternative documentation licenses.

I don't understand the logic behind why data would be considered software, unless one is using a very fuzzy definition of 'software'.  Is this strictly a packaging issue, e.g. any data packaged with source makes it 'software'?  Or just the fact that such data is licensed?  Would a package of just data/docs (no code) be allowed?

>> (I personally think it would make sense to change to a Creative Commons
>> license that allows derivative works - Uniprot and others are going to
>> be the canonical source for the data anyway, so nothing will be lost by
>> them by doing that, as far as I can see.)
> 
> Unlikely. The no-derivatives version is specifically there to prevent derivatives - for example Debian distributing a modified UniProt without permission.
> 
> The ontologies are similar, but do allow for the use case of importing terms from one ontology into another if the ontology name is changed (and preferably if cross-references to the original are provided). Again, the need is to protect the integrity of the original ontology content so references to a GO term or a UniProt entry are clearly defined.
> 
> This is essential for many of the public bioinformatics databases. Data and software are not the same in this context. I am curious whether documentation licensing raises any issues.
> 
> Just my 2c worth
> 
> Peter Rice
> EMBOSS Team


Maybe the best solution is to just package any data separately?  We have talked about setting up a 'biodata' repository for common datasets from all the Bio* projects.

Feel free to skip the rest of this, but:

<my_2c>

I agree with Peter's point, Uniprot and other databases license data this way for very good (and well-intentioned) reasons. For the Bio* languages there are instances where we use such data as a fallback in case a newer version isn't immediately available (REBase and SO come to mind, and I think we have others), so we are likely in the same boat as EMBOSS.  

I had a long screed here, but I found some original sources for the discussion re: Uniprot and use of Creative Commons licensing that states the reasoning for why this is in place:

http://wiki.creativecommons.org/Case_Studies/Uniprot
http://eric.jain.name/2006/02/07/uniprot-creative-commons/
http://sciencecommons.org/resources/faq/databases/
http://sciencecommons.org/resources/faq/database-protocol/

Note there is now a 'Database Protocol' (last link) that recommends a different license; that page nicely summarizes the history the whole Creative Commons licensing affair and the issues of using a Creative Commons license re: databases, mainly due to the issue Peter mentioned above, that databases != software.  Uniprot doesn't use this as of yet (so it doesn't solve the problem at hand), but it's possible this may change.

</my_2c>

chris

Christopher Fields
Senior Research Scientist
National Center for Supercomputing Applications
Institute for Genomic Biology
University of Illinois Urbana-Champaign
1206 W. Gregory Dr. , MC-195
Urbana, IL 61801


From asjo at koldfront.dk  Sat Jul 30 19:34:30 2011
From: asjo at koldfront.dk (Adam =?iso-8859-1?Q?Sj=F8gren?=)
Date: Sat, 30 Jul 2011 21:34:30 +0200
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk>
	<87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk>
	<5EF06959-FAA0-4DBB-96EA-013CFFDE2960@illinois.edu>
Message-ID: <87d3grwojd.fsf@topper.koldfront.dk>

On Sat, 30 Jul 2011 14:01:58 -0500, Chris wrote:

> I don't understand the logic behind why data would be considered
> software, unless one is using a very fuzzy definition of 'software'.
> Is this strictly a packaging issue, e.g. any data packaged with source
> makes it 'software'? Or just the fact that such data is licensed?
> Would a package of just data/docs (no code) be allowed?

  "The DFSG is focused on software, but the word itself is unclear -
   some apply it to everything that can be expressed as a stream of
   bits, while a minority considers it to refer to just computer
   programs. Also, the existence of PostScript, executable scripts,
   sourced documents, etc, greatly muddies the second definition. Thus,
   to break the confusion, in June 2004 the Debian project decided to
   explicitly apply the same principles to software documentation,
   multimedia data and other content. The non-program content of Debian
   began to comply with the DFSG more strictly in Debian 4.0 (released
   in April 2007) and subsequent releases."
    - http://en.wikipedia.org/wiki/DFSG#Non-.22software.22_content

So no.

> I agree with Peter's point, Uniprot and other databases license data
> this way for very good (and well-intentioned) reasons.

Several people have mentioned the existence of these good reasons for
not allowing derived works when it comes to science/databases/biology; I
wonder what those reasons are?

Just curious.

[...]
> http://sciencecommons.org/resources/faq/database-protocol/

> Note there is now a 'Database Protocol' (last link) that recommends a
> different license; that page nicely summarizes the history the whole
> Creative Commons licensing affair and the issues of using a Creative
> Commons license re: databases, mainly due to the issue Peter mentioned
> above, that databases != software. Uniprot doesn't use this as of yet
> (so it doesn't solve the problem at hand), but it's possible this may
> change.

It sounds like Science Commons' Open Access Data Protocol means putting
the data in the public domain, which would mean that derived works would
very much be allowed?

This link explains the protocol:

 * http://sciencecommons.org/projects/publishing/open-access-data-protocol/


  Best regards,

    Adam

-- 
 "Good car to drive after a war"                              Adam Sj?gren
                                                         asjo at koldfront.dk


From cjfields at illinois.edu  Sat Jul 30 19:42:19 2011
From: cjfields at illinois.edu (Chris Fields)
Date: Sat, 30 Jul 2011 14:42:19 -0500
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
In-Reply-To: <87ipqkgfu1.fsf@topper.koldfront.dk>
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk>
	<87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk>
	<87ipqkgfu1.fsf@topper.koldfront.dk>
Message-ID: <C368A0FF-27D8-463E-BAB4-5FBB6A02D1C0@illinois.edu>

On Jul 30, 2011, at 6:36 AM, Adam Sj?gren wrote:

> On Sat, 30 Jul 2011 09:58:07 +0100, Peter wrote:
> 
>> A specific example might help. About 5 years ago a release of the
>> UniProt database (as plain text files) broke the Wisconsin (GCG)
>> sequence analysis package.
> 
> [...]
> 
> This is the opposite problem of what I tried to sketch.
> 
> Your example has closed source software that can't be fixed, leading to
> either preprocessing or changing the database rather than fixing the
> real problem.
> 
> If the software had been free, you could just have fixed the software.
> 
> Switch around "software" and "database", and you have the example I was
> trying to paint.

Yes, if the source were available fixing the parser would have been the best option.  But I think you are missing the fundamental point that Peter made (that you left out): the wording of the license allowed them to reformat the file w/o changing the actual content.  I'm not sure but I believe many GenPept documents are Uniprot-derived and follow the same concept. 

Data records and databases are not software, unless you are using some very fuzzy definition of such.

>> I expect there are many problems that arise if data ... and
>> documentation ... are considered to be software.
> 
> Sure. The whole GFDL debate took quite a while, I think.
> 
> But that doesn't change that one of the solutions outlined by Charles
> Plessy is necessary for Debian to distribute EMBOSS (and any other piece
> of free/redistributable software).

You'll also note Charles's distaste for the options mentioned.  He was also searching for alternatives.

>>> (I personally think it would make sense to change to a Creative Commons
>>> license that allows derivative works - Uniprot and others are going to
>>> be the canonical source for the data anyway, so nothing will be lost by
>>> them by doing that, as far as I can see.)
> 
>> Unlikely. The no-derivatives version is specifically there to prevent
>> derivatives - for example Debian distributing a modified UniProt
>> without permission.
> 
> What I was trying to say is that I don't think that that clause gives
> any value to the owners of Uniprot and other databases.
> 
> Why would Uniprot want to prevent derivative works? They'll always be
> the canonical source for the correct information.

The links provided in my other responce indicate some of the mindset behind this. I think the main point is that the work has to be attributed, and that any changes to such data need permission of Uniprot, likely so any content changes can be curated and (possibly) propogated to future releases. This also ensures that a set of files from a third-party containing the Uniprot name will not be modified (e.g. all content can be trusted as coming from Uniprot w/o modification).  

I have seen instances where loose data control (such as annotation from a newly sequenced genome) become balkanized to the point that no one can clearly state who is the trusted source (even when the list of sources includes large databases such as NCBI/EBI).  So I understand the reasoning for the license, but I also see Science Commons is recommending something less strict.

> You are free to distribute a modified version of the man-page for ls(1)
> - but if you introduce errors in it or make it worse, nobody will choose
> your derived version.

That's a straw man argument; man page documentation for an app is not the same as a database record based on scientific data.  Woud you make the same argument (allow free content modification) for a scientific publication?  I would, but only for corrections or for new data that support/contradict the original data, and even then it must go through some sort of mediation (an editor for instance), not unlike what a database curator does.

>> The ontologies are similar, but do allow for the use case of importing
>> terms from one ontology into another if the ontology name is changed
>> (and preferably if cross-references to the original are provided).
> 
>> Again, the need is to protect the integrity of the original ontology
>> content so references to a GO term or a UniProt entry are clearly
>> defined.
> 
> I think the problem that is being protected against is non-existing.
> 
> People don't want to break stuff that works, they want to be able to fix
> stuff that doesn't.

Simply opening the licensing up for any content modification doesn't solve the problem in the case of scientific databases, it potentially exacerbates it.  Hence the variations in the licensing in the previous links I sent.  By the way, if you think the classic 'vi vs emacs' arguments can get out of control, see what happens when you have competing groups trying to make changes to a sequence record w/o curation.

I do agree that it would be nice for the barrier to database modification to be lowered. Many previous attempts have been made at doing this, such as including third-party annotation, but with the major databases they all seem to fall by the wayside and they seem to fall back to simple curation. 

Maybe it's time to come up with a git/hg for biological data, where one could fork records and make changes for submission; at least there one could have a trusted source and easier paths to data modification.  Just a thought.

>> This is essential for many of the public bioinformatics databases.
> 
> Why? Only a hypothetical derivative would be changed, not the original.
> 
> If someome distributed a derivative that was broken, I think people
> would quickly abandon it.

How could one tell the difference if both versions are implied to come from Uniprot (even if one comes from a third/fourth/fifth party)?  There is no guarantee beyond going back and comparing the records to the original Uniprot data.  

> Again, just my point of view - not representing or speaking for anyone :-)
> 
> 
>  Best regards,
> 
>    Adam


chris

Christopher Fields
Senior Research Scientist
National Center for Supercomputing Applications
Institute for Genomic Biology
University of Illinois Urbana-Champaign
1206 W. Gregory Dr. , MC-195
Urbana, IL 61801


From cjfields at illinois.edu  Sat Jul 30 20:14:39 2011
From: cjfields at illinois.edu (Chris Fields)
Date: Sat, 30 Jul 2011 15:14:39 -0500
Subject: [EMBOSS] Files included in EMBOSS but licensed ...
In-Reply-To: <87d3grwojd.fsf@topper.koldfront.dk>
References: <20110728143837.GC30927@merveille.plessy.net>
	<4E326562.1020001@ebi.ac.uk> <4E3271D2.2070906@ebi.ac.uk>
	<87sjpoq0zi.fsf@topper.koldfront.dk> <4E33C79F.8080402@ebi.ac.uk>
	<5EF06959-FAA0-4DBB-96EA-013CFFDE2960@illinois.edu>
	<87d3grwojd.fsf@topper.koldfront.dk>
Message-ID: <F62874C2-2AE6-49E7-AD76-46BA8DC874A6@illinois.edu>

(Charles, not sure you have been following, but any idea on the next steps and whether other package like bioperl are affected?)

On Jul 30, 2011, at 2:34 PM, Adam Sj?gren wrote:

> On Sat, 30 Jul 2011 14:01:58 -0500, Chris wrote:
> 
>> I don't understand the logic behind why data would be considered
>> software, unless one is using a very fuzzy definition of 'software'.
>> Is this strictly a packaging issue, e.g. any data packaged with source
>> makes it 'software'? Or just the fact that such data is licensed?
>> Would a package of just data/docs (no code) be allowed?
> 
>  "The DFSG is focused on software, but the word itself is unclear -
>   some apply it to everything that can be expressed as a stream of
>   bits, while a minority considers it to refer to just computer
>   programs. Also, the existence of PostScript, executable scripts,
>   sourced documents, etc, greatly muddies the second definition. Thus,
>   to break the confusion, in June 2004 the Debian project decided to
>   explicitly apply the same principles to software documentation,
>   multimedia data and other content. The non-program content of Debian
>   began to comply with the DFSG more strictly in Debian 4.0 (released
>   in April 2007) and subsequent releases."
>    - http://en.wikipedia.org/wiki/DFSG#Non-.22software.22_content
> 
> So no.

Oh well; we'll leave that up to debian then.  I think Peter and I stated our concerns, and possible options were stated by Charles and myself, no need to protract this out.  I would rather find a solution.

>> I agree with Peter's point, Uniprot and other databases license data
>> this way for very good (and well-intentioned) reasons.
> 
> Several people have mentioned the existence of these good reasons for
> not allowing derived works when it comes to science/databases/biology; I
> wonder what those reasons are?
> 
> Just curious.

Those links I passed on mention some of the primary concerns from both the Science Commons and Uniprot side.  I believe it comes down to an issue of trusting the source of the data and the level of control the database wants (the latter was implied in Eric's blog post).  

> [...]
>> http://sciencecommons.org/resources/faq/database-protocol/
> 
>> Note there is now a 'Database Protocol' (last link) that recommends a
>> different license; that page nicely summarizes the history the whole
>> Creative Commons licensing affair and the issues of using a Creative
>> Commons license re: databases, mainly due to the issue Peter mentioned
>> above, that databases != software. Uniprot doesn't use this as of yet
>> (so it doesn't solve the problem at hand), but it's possible this may
>> change.
> 
> It sounds like Science Commons' Open Access Data Protocol means putting
> the data in the public domain, which would mean that derived works would
> very much be allowed?

Yes, if one adopts that protocol (Uniprot hasn't).  Eric's blog post indicates the CC-nonderivative was chose for a level of control both Uniprot users and curators felt comfortable with but wasn't overly restrictive.  That's also from 2006, so a lot has likely changed since then.

> This link explains the protocol:
> 
> * http://sciencecommons.org/projects/publishing/open-access-data-protocol/
> 
> 
>  Best regards,
> 
>    Adam

There is no mention of derived or modified works there, but the brief mention of derived works from the Database Protocol page indicates that it is possibly allowed, yes.  That may be an impediment to adoption by a database depending on what level of control they would like.  I'm curious to see who has adopted it.

chris