[BioPython] An open letter to bioinformatcis researchers

Ewan Birney birney@ebi.ac.uk
Sat, 9 Dec 2000 19:03:10 +0000 (GMT)


Dear fellow bioinformatics developers:

By now you have probably heard that Celera Genomics has submitted
their human genome paper to the journal Science. Science and Celera
have agreed to special terms for the release of the human genome
sequence data.  It will be made available through the Celera website,
and will not be submitted to the international DNA database consortium
(GenBank, EMBL and DDBJ). Science's statement regarding the agreement
is at:
http://www.sciencemag.org/feature/data/announcement/genomesequenceplan.shl

All major journals, including Science, have a policy of deposition of
sequence data with the "appropriate data bank". The accepted community
standard is submission to GenBank/EMBL/DDBJ. The reason for this
deposition is to make the results of the work openly available for
future research. This principle was specifically mentioned in the
Clinton/Blair statement on human genome sequencing -
 http://www.usinfo.state.gov/topical/global/biotech/00031401.htm
- who strongly upheld the view that "unencumbered access" to genome
data was critical.

The terms of the Celera/Science agreement will give us access to the
genome sequence, but not unencumbered access.  Celera is suggesting
publishing their data under a MTA (Material Transfer Agreement) which
would prevent large scale downloads and incorporation of this data
into GenBank/EMBL/DDBJ. In order to download the data, you and your
institution will have to sign a contract guaranteeing that you will
not "redistribute" the Celera data.

Science believes that the deal is an adequate compromise because it
provides us the right to download the data and publish our results.
We believe Science is thinking in terms of single gene biology, not
large scale bioinformatics. It is probably not hard for you to imagine
scenarios in bioinformatics in which "publication" and
"redistribution" are virtually the same thing; we cannot imagine
Celera allowing us to incorporate data into Pfam, for example,
nor into Ensembl.

We are asking for your support in writing to Science to politely
insist that genome sequence papers should be accompanied by
unencumbered deposition to GenBank/EMBL/DDBJ. Please note that we have
no issue with Celera either keeping this data unpublished for
commercial reasons, nor with them combining their data with freely
available data from the public genome projects. We would defend their
right to do either. Our view is simply that the genome community has
established a clear principle that published genome data must be
deposited in the international databases, that bioinformatics is
fueled by this principle, and that Science therefore threatens to set
a precedent that undermines our research.

We encourage you to express your views on this matter to Donald
Kennedy (kennedyd@kennedyd.pobox.stanford.edu), the Editor-in-Chief of
Science, and/or to Barbara Jasny (bjasny@aaas.org), the managing
editor in charge of genomics papers at Science.


Here is a Q/A about some points.   

* Why does this matter?

A classic example of how our field began to have an impact on
molecular biology was Russ Doolittle's discovery of a significant
sequence similarity between a viral oncogene and a cellular growth
factor receptor. Russ could not have found that result if he did not
have an aggregate database of previously published sequences. We have
come a long way from Russ and his son typing data into the NEWAT
protein sequence database by hand.

Throughout the 80's the international database community fought hard
to insist that DNA sequence data be deposited into the public domain
databases. Journals now generally require deposition as a condition of
accepting a paper. The forming of these databases and the
international agreements on data sharing between the European,
American and Japanase databases fostered the rapid development of
bioinformatics research. We now all take for granted the fact that
large DNA databases are accessible from a single point of contact, and
the identifiers are coordinated worldwide.

Bioinformatics research relies on open data with minimal legal
encumberances submitted to public databases. Without these databases
there is no real substrate for bioinformatics research.


* What would happen if this precedent was set?

There are a number of consequences if Science set a precedent that
allowed people to publish DNA data under a variety of MTAs.

- One would not be able to form a single DNA database on which to
  do bioinformatics research, and the derivative databases (Swissprot,
  PIR, Pfam, PROSITE, etc.) would not be legal.

- Bench biologists would have to visit a number of websites and
  possibly enter into a number of different contracts for access to DNA
  data. Unexpected informative homologies could become prohibitively
  difficult to find.

- You may need to get a legal review before you can publish
  the results of an analysis, if your analysis is large-scale and
  detailed enough that it could be reasonably interpreted as a 
  "redistribution" of the primary sequence data. You could
  be sued for breach of contract for a Web Supplement page
  that discloses extensive sequence data supporting your results.

- Scientific openness will be undermined. Efforts to engage the
  community in cooperative annotation of large genomes, for instance,
  would be blocked -- we can't usefully annotate a genome we can't freely
  redistribute.


* Celera paid for it. Can't they set their own access terms?

Absolutely. We have no issue with Celera's commercial data gathering,
and their right to set their own access terms to their data.  We do
feel, though, that scientific publications carry a certain ethical
responsibility. The purpose of a paper is to enable the community to
efficiently build on your work. There is always a tension between
disclosing your work to your competitors (this is not unique to
private companies!) and receiving scientific credit for your work via
publication.  This tension is natural, and maintaining a consistent
and acceptable balance is the reason that scientist and journals
establish community standards that dictate how data are required to be
disclosed. In this case, the clearly accepted community standard is
that DNA sequence data are deposited in Genbank/EMBL/DDBJ upon
publication.

We certainly do not blame Celera (much) for seeking a special deal
that lets them have their cake and eat it too -- they would
understandably like scientific credit for their terrific and important
work in human sequencing, and they would also like a profitable
business model.

We do blame Science for failing to take a strong stand in upholding
accepted scientific publication practices. We cannot accept that it is
necessary to sacrifice ethics for expediency.

* Science claims they are honouring their own policy. What gives?
  
Science now claims that all their policy really requires is that
archival data be available via a publicly accessible database.  We
think this is a conveniently revisionist view of their own policy,
which states (in Instructions to Authors):
  
"archival data sets (such as sequence and structural data) must be 
deposited with the appropriate data bank and the identifier code should be 
sent to Science for inclusion in the published manuscript (coordinates
must be released at the time of publication)"

Notice the use of the definitive article "THE appropiate data bank",
the notion of "deposition", and the additional rider that the
identifier code should be sent.

The spirit of this statement seems clear to us. Science's statement
anticipates that there is an appropriate, single, aggregrate community
database for each sort of archival data, whether DNA sequence, protein
structure coordinates, or something else. Sensibly, they don't name
every possible database for every possible archival data set.  They
expect that recognized community standards exist. In no way does
Science's statement seem consistent with the view that an individual
lab could start its own "public" DNA sequence database and send a
meaningless internal database identifier; to try to read it that way
is a post hoc rationalisation.


*  What can Science do? This is a done deal.

It's true that this is a done deal. Science and Celera have mutually
agreed to the general terms of data release. But there are two ways
that we can minimize the damage.

First, the details of the agreement are not set. In particular, there
is no definition of allowed "publication" versus prohibited
"redistribution". Science could specify definitions that did not
interfere with noncommercial uses of the data in bioinformatics,
allowing us redistribution rights if it made sense in the context of
our project (for example, a genome annotation project like Ensembl).
 
Second, and preferably, Science -- or even the peer reviewers -- can
uphold Science's own data access policy, and reject the paper.

Incidentally, they might also choose to enforce Science's policy on
prior publication, which states "...the main findings of a paper
should not have been reported in the mass media. Authors are, however,
permitted to present their data at open meetings but should not
overtly seek media attention." If I issued a press release upon
submission of a manuscript to Science, like Celera did, Science would
rightly fire it back to me without review.

* What can I do?

Agitate. Let Science know that you care. They consider this deal to be
a trial balloon for future genome papers. Even if we can't change the
deal with Celera, we can try to make sure it's a one-time-only deal
that's viewed as a Big Mistake. Write a letter to Science and tell
them how their actions would impact your research, both in the long
term and in the short term. Also, you can pass on this open letter to
other bioinformatics researchers you know.


Dr Sean Eddy, 
Alvin Goldfarb Professor of Computational Biology,
Howard Hughes Medical Institute, Washington University in St. Louis, USA

Dr Ewan Birney
Team Leader, Genomic Annotation
European Bioinformatics Institute, UK