[BioPython] Human genome sequence

Charles Auffray auffray@infobiogen.fr
Mon, 18 Dec 2000 13:19:48 +0100


Ewan,

Let me express my reaction to your mail.

The Science-Celera deal is not, on many accounts, a precedent. In 1995,
Nature published in its Genome Directory a paper by Adams et al. under an
agreement which already was breaching many of the commonly accepted rules
for publication and sequence data deposition. The TIGR group, led by Craig
Venter, had included data released in public databases by other groups
without releasing much of their own data, which at that time was only
accessible through an MTA (based on their relationship with Human Genome
Science). My Genexpress team was declined the possibility of publishing our
interpretation of our own data in the same issue of Nature. The silence of
the scientific community at that time was astounding (notwithstanding the
fact that by some irony, Genome Research published our paper the very same
day as the Genome Directory).

What is happening now with publication of the human genome sequence papers
seems to indicate that the lessons have not been taken from such past
events, and that people have short memories. The sort of work that will
lead to full description of genomes, transcriptomes and proteomes is the
result of the contributions of large number of individuals over several
decades. In an attempt to evaluate how many people should be cited as
co-authors of an overview paper describing the state of knowledge on the
human transcriptome, I ended up with a figure of 44,444 (including Venter
and his co-workers), that is in the same order of magnitude as the
estimated number of human genes. I believe it would be appropriate for
those seeking to publish milestones papers on the current knowledge of the
human genome, whether from the public or the private sector, to aknowledge
all those who led the ground for this work by citing them as co-authors. As
a first indication, there are 7846 papers registered in PubMed containing
"human genome" in their tittle or abstract. As many scientists know, and
despite all media coverage, the work is not yet completely finished, and
even it it was, it would only be the end of the beginning.

Such an action would have several advantages. First it would convey to the
public the idea that science is a collective as well as an individual
endeavour. Second, it would make clear that the sequence of the human
genome is common knowledge which can be shared by all to advance human
health, in line with the United Nations Declaration on the Human Genome and
Human Rights which was adopted unanimously by the 186 nations represented
in 1998 (http://www1.umn.edu/humanrts/instree/Udhrhg.htm). Part of the
process is, as you rightly point out, the development of large-scale
analyses using informatics which require enencumbered access of the primary
data in the established international electronic data repositories
(EMBL/NCBI/DDBJ). We also need to ensure that useful applications can be
developed with appropriate financial investment and reach the end user
through the healthcare system. In this respect, some level of balanced and
fair competition, which occurs both within the academic or industrial
sectors as well as between them, is desirable. The balance and fairness can
only be achieved if we all recognize the contributions of all and provide
the incentive for the required public and private investments needed. The
fuzziness of the intellectual property status of inventions based on
knowledge of the human genome sequence vs the human genome sequence itself
does not help.

In eight years, since I wrote a letter to Nature on this subject (DNA
sequences. Nature. 1992 355:292), it seems to me that we have witnessed
some progress in this regard, but a lot more effort is needed by the law
and policy makers to clarify the situation. It is their responsability to
enforce regulations disallowing attempts for monopolization, were it hrough
media coverage, as seems currently fashionable.  The sooner the better.
There is so much to do ahead of us.

Charles Auffray

>>Date: Mon, 11 Dec 2000 17:55:25 +0100
>>To: crbm@crbm.cnrs-mop.fr
>>From: Vincent Coulon <coulon@jones.igm.cnrs-mop.fr>
>>Subject: Celera/Science agreement: les scelerats!
>>X-MIME-Autoconverted: from quoted-printable to 8bit by
>xerxes.crbm.cnrs-mop.fr id RAA30399
>>
>>------- Forwarded Message
>>
>>From: Ewan Birney <birney@ebi.ac.uk>
>>To: bioperl-l@bioperl.org, biojava-l@biojava.org, biopython@biopython.org,
>>       bioxml-dev@bioxml.org, ensembl-dev@ebi.ac.uk, apollo@ebi.ac.uk
>>Subject: [Bioperl-l] An open letter to bioinformatcis researchers
>>Date: Sat, 9 Dec 2000 19:03:10 +0000 (GMT)
>>
>>
>>
>>Dear fellow bioinformatics developers:
>>
>>By now you have probably heard that Celera Genomics has submitted
>>their human genome paper to the journal Science. Science and Celera
>>have agreed to special terms for the release of the human genome
>>sequence data.  It will be made available through the Celera website,
>>and will not be submitted to the international DNA database consortium
>>(GenBank, EMBL and DDBJ). Science's statement regarding the agreement
>>is at:
>>http://www.sciencemag.org/feature/data/announcement/genomesequenceplan.shl
>>
>>All major journals, including Science, have a policy of deposition of
>>sequence data with the "appropriate data bank". The accepted community
>>standard is submission to GenBank/EMBL/DDBJ. The reason for this
>>deposition is to make the results of the work openly available for
>>future research. This principle was specifically mentioned in the
>>Clinton/Blair statement on human genome sequencing -
>> http://www.usinfo.state.gov/topical/global/biotech/00031401.htm
>>- - who strongly upheld the view that "unencumbered access" to genome
>>data was critical.
>>
>>The terms of the Celera/Science agreement will give us access to the
>>genome sequence, but not unencumbered access.  Celera is suggesting
>>publishing their data under a MTA (Material Transfer Agreement) which
>>would prevent large scale downloads and incorporation of this data
>>into GenBank/EMBL/DDBJ. In order to download the data, you and your
>>institution will have to sign a contract guaranteeing that you will
>>not "redistribute" the Celera data.
>>
>>Science believes that the deal is an adequate compromise because it
>>provides us the right to download the data and publish our results.
>>We believe Science is thinking in terms of single gene biology, not
>>large scale bioinformatics. It is probably not hard for you to imagine
>>scenarios in bioinformatics in which "publication" and
>>"redistribution" are virtually the same thing; we cannot imagine
>>Celera allowing us to incorporate data into Pfam, for example,
>>nor into Ensembl.
>>
>>We are asking for your support in writing to Science to politely
>>insist that genome sequence papers should be accompanied by
>>unencumbered deposition to GenBank/EMBL/DDBJ. Please note that we have
>>no issue with Celera either keeping this data unpublished for
>>commercial reasons, nor with them combining their data with freely
>>available data from the public genome projects. We would defend their
>>right to do either. Our view is simply that the genome community has
>>established a clear principle that published genome data must be
>>deposited in the international databases, that bioinformatics is
>>fueled by this principle, and that Science therefore threatens to set
>>a precedent that undermines our research.
>>
>>We encourage you to express your views on this matter to Donald
>>Kennedy (kennedyd@kennedyd.pobox.stanford.edu), the Editor-in-Chief of
>>Science, and/or to Barbara Jasny (bjasny@aaas.org), the managing
>>editor in charge of genomics papers at Science.
>>
>>
>>Here is a Q/A about some points.
>>
>>* Why does this matter?
>>
>>A classic example of how our field began to have an impact on
>>molecular biology was Russ Doolittle's discovery of a significant
>>sequence similarity between a viral oncogene and a cellular growth
>>factor receptor. Russ could not have found that result if he did not
>>have an aggregate database of previously published sequences. We have
>>come a long way from Russ and his son typing data into the NEWAT
>>protein sequence database by hand.
>>
>>Throughout the 80's the international database community fought hard
>>to insist that DNA sequence data be deposited into the public domain
>>databases. Journals now generally require deposition as a condition of
>>accepting a paper. The forming of these databases and the
>>international agreements on data sharing between the European,
>>American and Japanase databases fostered the rapid development of
>>bioinformatics research. We now all take for granted the fact that
>>large DNA databases are accessible from a single point of contact, and
>>the identifiers are coordinated worldwide.
>>
>>Bioinformatics research relies on open data with minimal legal
>>encumberances submitted to public databases. Without these databases
>>there is no real substrate for bioinformatics research.
>>
>>
>>* What would happen if this precedent was set?
>>
>>There are a number of consequences if Science set a precedent that
>>allowed people to publish DNA data under a variety of MTAs.
>>
>>- - One would not be able to form a single DNA database on which to
>>  do bioinformatics research, and the derivative databases (Swissprot,
>>  PIR, Pfam, PROSITE, etc.) would not be legal.
>>
>>- - Bench biologists would have to visit a number of websites and
>>  possibly enter into a number of different contracts for access to DNA
>>  data. Unexpected informative homologies could become prohibitively
>>  difficult to find.
>>
>>- - You may need to get a legal review before you can publish
>>  the results of an analysis, if your analysis is large-scale and
>>  detailed enough that it could be reasonably interpreted as a
>>  "redistribution" of the primary sequence data. You could
>>  be sued for breach of contract for a Web Supplement page
>>  that discloses extensive sequence data supporting your results.
>>
>>- - Scientific openness will be undermined. Efforts to engage the
>>  community in cooperative annotation of large genomes, for instance,
>>  would be blocked -- we can't usefully annotate a genome we can't freely
>>  redistribute.
>>
>>
>>* Celera paid for it. Can't they set their own access terms?
>>
>>Absolutely. We have no issue with Celera's commercial data gathering,
>>and their right to set their own access terms to their data.  We do
>>feel, though, that scientific publications carry a certain ethical
>>responsibility. The purpose of a paper is to enable the community to
>>efficiently build on your work. There is always a tension between
>>disclosing your work to your competitors (this is not unique to
>>private companies!) and receiving scientific credit for your work via
>>publication.  This tension is natural, and maintaining a consistent
>>and acceptable balance is the reason that scientist and journals
>>establish community standards that dictate how data are required to be
>>disclosed. In this case, the clearly accepted community standard is
>>that DNA sequence data are deposited in Genbank/EMBL/DDBJ upon
>>publication.
>>
>>We certainly do not blame Celera (much) for seeking a special deal
>>that lets them have their cake and eat it too -- they would
>>understandably like scientific credit for their terrific and important
>>work in human sequencing, and they would also like a profitable
>>business model.
>>
>>We do blame Science for failing to take a strong stand in upholding
>>accepted scientific publication practices. We cannot accept that it is
>>necessary to sacrifice ethics for expediency.
>>
>>* Science claims they are honouring their own policy. What gives?
>>
>>Science now claims that all their policy really requires is that
>>archival data be available via a publicly accessible database.  We
>>think this is a conveniently revisionist view of their own policy,
>>which states (in Instructions to Authors):
>>
>>"archival data sets (such as sequence and structural data) must be
>>deposited with the appropriate data bank and the identifier code should be
>>sent to Science for inclusion in the published manuscript (coordinates
>>must be released at the time of publication)"
>>
>>Notice the use of the definitive article "THE appropiate data bank",
>>the notion of "deposition", and the additional rider that the
>>identifier code should be sent.
>>
>>The spirit of this statement seems clear to us. Science's statement
>>anticipates that there is an appropriate, single, aggregrate community
>>database for each sort of archival data, whether DNA sequence, protein
>>structure coordinates, or something else. Sensibly, they don't name
>>every possible database for every possible archival data set.  They
>>expect that recognized community standards exist. In no way does
>>Science's statement seem consistent with the view that an individual
>>lab could start its own "public" DNA sequence database and send a
>>meaningless internal database identifier; to try to read it that way
>>is a post hoc rationalisation.
>>
>>
>>*  What can Science do? This is a done deal.
>>
>>It's true that this is a done deal. Science and Celera have mutually
>>agreed to the general terms of data release. But there are two ways
>>that we can minimize the damage.
>>
>>First, the details of the agreement are not set. In particular, there
>>is no definition of allowed "publication" versus prohibited
>>"redistribution". Science could specify definitions that did not
>>interfere with noncommercial uses of the data in bioinformatics,
>>allowing us redistribution rights if it made sense in the context of
>>our project (for example, a genome annotation project like Ensembl).
>>
>>Second, and preferably, Science -- or even the peer reviewers -- can
>>uphold Science's own data access policy, and reject the paper.
>>
>>Incidentally, they might also choose to enforce Science's policy on
>>prior publication, which states "...the main findings of a paper
>>should not have been reported in the mass media. Authors are, however,
>>permitted to present their data at open meetings but should not
>>overtly seek media attention." If I issued a press release upon
>>submission of a manuscript to Science, like Celera did, Science would
>>rightly fire it back to me without review.
>>
>>* What can I do?
>>
>>Agitate. Let Science know that you care. They consider this deal to be
>>a trial balloon for future genome papers. Even if we can't change the
>>deal with Celera, we can try to make sure it's a one-time-only deal
>>that's viewed as a Big Mistake. Write a letter to Science and tell
>>them how their actions would impact your research, both in the long
>>term and in the short term. Also, you can pass on this open letter to
>>other bioinformatics researchers you know.
>>
>>
>>Dr Sean Eddy,
>>Alvin Goldfarb Professor of Computational Biology,
>>Howard Hughes Medical Institute, Washington University in St. Louis, USA
>>
>>Dr Ewan Birney
>>Team Leader, Genomic Annotation
>>European Bioinformatics Institute, UK
>>
>>
>>_______________________________________________
>>Bioperl-l mailing list
>>Bioperl-l@bioperl.org
>>http://bioperl.org/mailman/listinfo/bioperl-l


Unite de Genetique Moleculaire
et Biologie du Developpement
CNRS ERS 1984  - 7-19 rue Guy Moquet
BP 8 - 94801 VILLEJUIF CEDEX - FRANCE
Tel : 33 (0)1 49 58 34 98 - Fax : 33 (0)1 49 58 35 09
E-mail : auffray@infobiogen.fr