Bioperl: Feature Table

Ewan Birney birney@sanger.ac.uk
Sat, 11 Jul 1998 11:13:41 +0100 (BST)


On Wed, 8 Jul 1998, Ian Korf wrote:

> 
> Some Ideas and Questions
> ------------------------
> 
> 1) We should be able to convert the BioPerl sequence object to other formats
> easily. For starters, BioPerl must be able to speak DDJB/EMBL/GenBank and
> ACEDB. That should be relatively straightforward, but there are important
> differences in those two implementations. But once there, there may be tools
> to get to other formats in the future (eg. BSML).

Absolutely. Matt Pocock wrote a pretty nifty parser for EMBL format
(you can find off http://www.sanger.ac.uk/Software/PerlModule/) which
included FeatureTable parsing of EMBL format.

I don't think it is an appropiate format for the Bio::Entry type
object as we delibrately decided not to abstract the Feature Table
which is what you want to do (quite rightly!).


> 
> 2) Is it better to develop our own model or try to adopt the same model used
> by the International Nucleotide Sequence Database Collaboration (aka DDJB/
> EMBL/GenBank)? Note that the INSDC is a NUCLEOTIDE database description so
> if proteins are to be included, we wouldn't want to follow it too closely.

I doubt we can stick to the INSDC, and also I don't like their set up
at all (witness the dreadful need to write mRNA and CDS to specify
a gene with UTRs... makes me shudder).

The parsers to and from GenBank/EMBL however should try to be as kosher
as possible wrt to the feature table specification.


> 
> 3) How much structure/hierarchy should there be in the feature table?
> In the Bio::GSC::Sequence class, features are derived from tools. This may
> not be an optimal representation, but most features are created by some
> program/tool (eg. BLAST). In the INSDC feature table, all features are in the
> same pot. I think it is important to be able to compare features to one
> another, and this often takes place with respect to the algorithm. For
> example, you might want to compare two gene prediction algorithms that both
> make exon features. I see two approaches here. We can either have all the
> features swimming in the same pool, and pull them out selectively by some
> kind of tag, or we can impose a hierarchy as I've done with Bio::GSC::Tool. The
> INSDC solution is to use a "group" to link similar things. I imagine either
> approach would have the same interface, but the implementation would be quite
> different.
> 
> One aspect of the Bio::GSC::Feature is that it is recursive; features can
> contain features (genes contain exons). So if you want the beginning of an
> exon, you have to first get the gene feature and disassemble it to get the
> exon. BLAST hits work the same way, Sbjcts contain HSPs, but both Sbjcts and
> HSPs are features. There are advantages to this approach, but it flies in the
> face of putting everything in one pot. I can imagine a feature table containing
> huge numbers of features with tags indicating the realtionships to each other.
> I'm not sure if I'm in favor of this.


I am very much in favour of heirarchies. One issue then is how the
coordinate system works:

	In Bio::Feature::Gene start xxx end yyy
                  Bio::Feature::Exon start www end vvv

Now is the www and the vvv with respect to the the Gene's coordinates
or to the DNA's coordinates?

I prefer (and in Wise2) I have the coordinates heirarchical - so if you
want to move a gene but keep its structure, you just change the gene 
coordinates...

However - most operations (eg overlap) have to be the reference set of
coordinates, so you need some methods to do the mapping. (and the age
old question of whether to use 'bio coordinates' - ie , first residue
is 1, Range 1-2 is two residues - inclusive counting - or better C/Perl 
like ranges - first residue is 0, Range 1-2 is one residue).

Re: to group by Tools or not. I think someone is always going to want
to cut it in the two orthogonal ways:

	Give me everything that program XXX did

	Give me all Gene predictions.

Notice that some applications (eg, to name one close to my heart,
GeneWise) could produce both gene predictions and 'homology' Features.

So - I think both groupings have to be easily fetched. Whether internally
the objects are stored under one particular grouping, (or you could keep
hashes for both) is I think your decision Ian.

Perhaps more importantly do we have a heirarchy that looks like

	Bio::Feature::Gene etc

Or a field in Feature which is 'type'

Former - it is much cleaner

Latter - It is likely that it is more extendable, especially by people who
don't know how to make Perl Objects.

Sadly I think I'd vote for a 'type' field. (ugh!)


> 
> 4) Some kind of BioPerl viewer. I've made a very simple GIF dumper, but I
> don't have plans to extend it much. This project would probably require a lot
> of work, but I think it would be worth it. The viewer need not be interactive,
> but it ought to produce quality diagrams, and therefore should use Postscript.
> If you've ever looked at the NCBI graphical display of a sequence entry, you
> know how badly a better representation is needed.
> 

I strongly believe that the Data representation is kept at arm's length
from the viewing - so if there is a GIF dumper it should be a Factory
type module called Bio::FeatureGIF or something.

I also think that we should link up with people who are doing this 
stuff as the main project (not a sideline) - An example would be
Lincoln Stein's and Jean Theirry-meig's Jade viewer (? how suitable
I don't know - depends on how much of their fmap is in Jade), or Niclas
Jareborg's alfresco 

see

http://stein.cshl.org/jade/

http://www.sanger.ac.uk/Users/nic/comp.html

I think, Ian, we should see if we can't pass the viewing on to
someone else (does anyone else have any ideas - acedb and webace I
ruled out because they are heavyweight - very heavyweight - solutions)...


But great to see this discussion starting - we need a good Entry object,
as it will be the core for alot of work.


> 
> -Ian
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
> For info about how to (un)subscribe, where messages are archived, etc:
> http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
> ====================================================================
> 

Ewan Birney
<birney@sanger.ac.uk>
http://www.sanger.ac.uk/Users/birney/

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================