[Bioperl-l] Sequence Features... (hello from Singapore...)

Ewan Birney birney at ebi.ac.uk
Thu Feb 20 09:05:20 EST 2003



Bioperl hackers who have been at Singapore have been discussing the next
generation of sequence feature handling. As any developer - and indeed
user - who has used bioperl might have noticed, our sequence feature model
is quite complex - this is because we have a number of drivers, in
particular:

   - Full representation of EMBL/GenBank locations, including disjoints
and fuzzies

   - Handling multi-coordinate systems in genome->cDNA (mutations) and
contig->genome (DAS)

   - Handling nested/composed features nicely (Genes, SimilarityPairs).


The result is a number of different possibilities for storing things.


We'd like to change this to make life more consistent, and also to adhere
more to emerging standards, like SO (sequence ontology). Here is our
(Lincoln, Hilmar, Heikki, Jason and ChrisM's) view of life:


  - We separate the idea of "feature on a particular sequence" from the
"information stored inside a feature". For the information part we'll
reuse again the Bio::AnnotationCollection as that is what we want really
want for full blown annotation like cases (eg, genes on genomes). The
former case, which is just a light weight join object of
<location,annotation> we still call "SeqFeature"

  - A single Bio::AnnotationCollection can be pointed to by multiple
SeqFeatures; the AnnotationCollection is still a separate feature - in
but the SeqFeatures are different coordinate systems with the same
sequence feature;

     Concrete Example:

     Imagine a genome contig part of a chromosome. The contig has two tRNA
sequences and 12 Alu Repeats.

    - There are 2 AnnotationCollection objects representing the two tRNA
sequences and 12 AnnotationCollection objects representing each Repeat.

    - However there are 14 SeqFeature objects binding the
AnnotationCollections to the Contig and *another* 14 SeqFeature objects
binding the AnnotationCollections to the chromsomes


  - We can't directly link AnnotationCollections to all their SeqFeatures
because we'll have a circular reference and even though there are
techniques to break these in Perl, it is hard to get this to work nicely
in many cases.


  - Instead we have a Bio::CoordinateSystem::Resolver object associated
with each *coordinate system* which can give out either

     - All SequenceFeatures for the coordinate system (that is knows
about)

     - A SeqFeature for a given AnnotationCollection

  - For this to work, we have to make AnnotationCollections "Identifiable"
or perhaps "UniquelyIdentifiable" (Lincoln's proposed a distinction
between externally identifiable - and system level identifiable
objects...).

  - AnnotationCollection Annotations should have "type" (an Ontology term)
and "type_string" (string of Ontology) which gives links into SO. We
should slurp up the standard part of SO into Bioperl and always allow
people to make "standard" SO objects

  - AnnotationCollection also will have a contains "contains" method,
which gives sub Annotations. Therefore a Gene will have Transcripts etc
via this method. By placing this method here, and *not* at the feature
level (sub_SeqFeatures) we get rid of the ambiguity of "should a
sub_SeqFeature be strictly covered by its parent". We officially allow the
containment heirarchy to have *nothing* to do with the location system.


  - As this is done at the interface level, implementations which need to
conserve space can use loop back methods. eg,

     SeqFeature->annotation() will give back the annotation collection,
but an implementation can easily be both SeqFeatureI compliant and
AnnotationCollectionI compliant and return self on this method. this is
not how we will build the generic objects, but it gives space limited
implementation a route out to build lightweight systems.



   Vague plan for implementing this:


     - More discussion  <<grin>>

     - chrisM and SO people get SO ontologies into bioperl so we can ask
for SO terms easily

     - extend AnnotationCollection to work with ontologies

     - extend AnnotationCollection with containment

     - Deprecate SeqFeature::Generic. We should probably warn on
SeqFeature::Generic->new and instead have a new object representing the
new SeqFeature bindings (proposed name SeqFeature::Standard ??)

     - switch over SeqIO <<grin>> modules

     - Reroute all current SeqFeatureI methods to correctly chain to
SeqFeature::Standard->annotation->blah() methods with deprecation

     -






More information about the Bioperl-l mailing list