Bioperl: Feature Table

Ian Korf ikorf@sapiens.wustl.edu
Wed, 8 Jul 1998 11:19:08 -0500


Clearly, the BioPerl Feature Table is an important issue. I have looked
into the DDJB/EMBL/GenBank feature table as well as the ACEDB Sequence class
in use at the Sanger/GSC. For various reasons, I've developed my own
sequence object in the Bio::GSC modules. Since I've been volunteered for
overseeing the feature table/"shopping bag", I thought I would state some
of my thoughts on the problem and open up some discussion. First off, I must
admit my bias towards the representation of sequence data. I am much more
interested in developing a framework for describing and operating on sequences
than describing 3D structures or multiple alignments. It seems to me that 3D
structure is a completely different problem, and multiple alignments are
aggregates of sequence objects. Also, there are fundamental differences
between protein and nucleic acid sequences. Here, I'm considering nucleotide
only. It is my opinion that a DNA sequence feature, like an exon, is quite
different from a protein feature, like a beta-sheet. So although both types
of features could inherit methods from some base class, I'm not in favor of
such a relationship.

Some Ideas and Questions
------------------------

1) We should be able to convert the BioPerl sequence object to other formats
easily. For starters, BioPerl must be able to speak DDJB/EMBL/GenBank and
ACEDB. That should be relatively straightforward, but there are important
differences in those two implementations. But once there, there may be tools
to get to other formats in the future (eg. BSML).

2) Is it better to develop our own model or try to adopt the same model used
by the International Nucleotide Sequence Database Collaboration (aka DDJB/
EMBL/GenBank)? Note that the INSDC is a NUCLEOTIDE database description so
if proteins are to be included, we wouldn't want to follow it too closely.

3) How much structure/hierarchy should there be in the feature table?
In the Bio::GSC::Sequence class, features are derived from tools. This may
not be an optimal representation, but most features are created by some
program/tool (eg. BLAST). In the INSDC feature table, all features are in the
same pot. I think it is important to be able to compare features to one
another, and this often takes place with respect to the algorithm. For
example, you might want to compare two gene prediction algorithms that both
make exon features. I see two approaches here. We can either have all the
features swimming in the same pool, and pull them out selectively by some
kind of tag, or we can impose a hierarchy as I've done with Bio::GSC::Tool. The
INSDC solution is to use a "group" to link similar things. I imagine either
approach would have the same interface, but the implementation would be quite
different.

One aspect of the Bio::GSC::Feature is that it is recursive; features can
contain features (genes contain exons). So if you want the beginning of an
exon, you have to first get the gene feature and disassemble it to get the
exon. BLAST hits work the same way, Sbjcts contain HSPs, but both Sbjcts and
HSPs are features. There are advantages to this approach, but it flies in the
face of putting everything in one pot. I can imagine a feature table containing
huge numbers of features with tags indicating the realtionships to each other.
I'm not sure if I'm in favor of this.

4) Some kind of BioPerl viewer. I've made a very simple GIF dumper, but I
don't have plans to extend it much. This project would probably require a lot
of work, but I think it would be worth it. The viewer need not be interactive,
but it ought to produce quality diagrams, and therefore should use Postscript.
If you've ever looked at the NCBI graphical display of a sequence entry, you
know how badly a better representation is needed.

2) Is it better to develop our own model or try to adopt the same model used
by the International Nucleotide Sequence Database Collaboration (aka DDJB/
EMBL/GenBank)? Note that the INSDC is a NUCLEOTIDE database description so
if proteins are to be included, we wouldn't want to follow it too closely.

3) How much structure/hierarchy should there be in the feature table?
In the Bio::GSC::Sequence class, features are derived from tools. This may
not be an optimal representation, but most features are created by some
program/tool (eg. BLAST). In the INSDC feature table, all features are in the
same pot. I think it is important to be able to compare features to one
another, and this often takes place with respect to the algorithm. For
example, you might want to compare two gene prediction algorithms that both
make exon features. I see two approaches here. We can either have all the
features swimming in the same pool, and pull them out selectively by some
kind of tag, or we can impose a hierarchy as I've done with Bio::GSC::Tool. The
INSDC solution is to use a "group" to link similar things. I imagine either
approach would have the same interface, but the implementation would be quite
different.

One aspect of the Bio::GSC::Feature is that it is recursive; features can
contain features (genes contain exons). So if you want the beginning of an
exon, you have to first get the gene feature and disassemble it to get the
exon. BLAST hits work the same way, Sbjcts contain HSPs, but both Sbjcts and
HSPs are features. There are advantages to this approach, but it flies in the
face of putting everything in one pot. I can imagine a feature table containing
huge numbers of features with tags indicating the realtionships to each other.
I'm not sure if I'm in favor of this.

4) Some kind of BioPerl viewer. I've made a very simple GIF dumper, but I
don't have plans to extend it much. This project would probably require a lot
of work, but I think it would be worth it. The viewer need not be interactive,
but it ought to produce quality diagrams, and therefore should use Postscript.
If you've ever looked at the NCBI graphical display of a sequence entry, you
know how badly a better representation is needed.

face of putting everything in one pot. I can imagine a feature table containing
huge numbers of features with tags indicating the realtionships to each other.
I'm not sure if I'm in favor of this.

4) Some kind of BioPerl viewer. I've made a very simple GIF dumper, but I
don't have plans to extend it much. This project would probably require a lot
of work, but I think it would be worth it. The viewer need not be interactive,
but it ought to produce quality diagrams, and therefore should use Postscript.
If you've ever looked at the NCBI graphical display of a sequence entry, you
know how badly a better representation is needed.



-Ian
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================