[Bioperl-l] Proposal: SemanticMapping and call for info on Gene Objects

Chris Mungall cjm@fruitfly.org
Mon, 13 May 2002 08:20:18 -0700 (PDT)


Sounds sensible; you know my opinions on biospecific classes but if folks
want them this seems a good way to do it.

I would venture that map_to_generic_features() isn't really necessary, as
I strongly feel that the Gene/Transcript/Exon/etc classes should be
lightweight wrappers on top of the generic seqfeatures, with class
specific attribute accessors mapped onto the seqfeature tag/value system.

eg in gadfly, calling $gene->transcript_list([@trs]) actually maps to

$seqfeature->set_subfeatures_by_type("transcript", [@trs])

this keeps everything working for applications that just want to use the
objects at the generic seqfeature level.

Not sure about recording translation starts to the exons - what about
doubly encoded genes in retroviral genomes? also, my faves - dicistronic
genes.

I'm happy to provide a tricky test-set of genbank files to test this.

This part is a bit less fleshed out... but it would be really nice if the
biology encoded in the object model is both as flexible as possible, and
open to introspection.

E.g. let's take a small part of SO and turn it into a lispy perl
datastructure:

[schema=>
  [gene=>[[isa=>"seqfeature"],
          [coding=>1],
          [class=>"Bio::GeneI"]]],
  [noncoding-gene=>[[isa=>"gene"],
          [coding=>0],
          [class=>"Bio::NcGeneI"]]],
  [transcript=>[[isa=>"seqfeature"],
                [partof=>"gene"],
                [class=>"Bio::TranscriptI"]]],
]

Ewan will hate this... but it would be nice to have as much of the
implementation specified dynamically by a "language" such as the above. Or
at least have it as an implementation option. If not, at least let's try
and keep the OM to SO mapping clean.

Here's SO:
ftp://ftp.geneontology.org/pub/go/gobo/sequence.ontology/

Flexibility is important; on the one hand there are some who want to write
robust software that deals with pc genes with the minimum of fuss, writing
to a simple (and possibly biologically restrictive) GeneI, TranscriptI etc
interface. On the other hand some of us want more plasticky objects,
possibly conforming to the pc-gene interfaces.

Regarding the logic for the actual mapping; this seems kind of tricky. Is
it robust to use the /gene field in genbank records to collect alternately
spliced mRNAs into a gene object?

Would the semantic mapper do stuff like create intron objects from exons,
etc?

It seems the mapping must be in 2 parts; the first will manipulate the
seqfteaure / subseqfeature hierarchy, eg to fix genbank split location
mRNA features into 3 level gene/transcript/exon/translation/cds objects.
The second part would go through and "bless" the objects appropriately. It
would be nice to seperate those.

On Sat, 11 May 2002, Jason Stajich wrote:

> I'm starting to try and build the semantic mapper for building
> Bio::SeqFeature::Gene objects from a list of Bio::SeqFeatureI objects.
> Dave/Hilmar any chance you guys can walk us through the ideas behind the
> Gene objects and the assumptions that have been made.
> I am wondering if we have a rich enough set of objects for truly
> representing all the information one might have for a gene.
>
> I think we probably need a CDS object or a little richer exon object to
> note where translation starts.  I'm not sure what is appropriate - to
> build objects towards the way data is organized in a genbank/embl file, or
> build them a little more generically and have to do some acrobatics to go
> in seqfile -> GeneStructure -> out seqfile format.
>
> Anyone who has opinions or ideas here, I would encourage you to look over
> the existing objects and help propose some directions.  I'd perhaps like
> to adopt what we can from the Gadfly & Ensembl models as well - any
> guidance and lessons learned would be great Ewan/Michele/Chris M.
>
> As for the actually semantic mapping part - here is a simple interface
> I've started.
>
> Bio::SeqFeature::SemanticMapperI
> (or should it be a Bio::Factory::SemanticMapperI ???)
> (happy to hear better suggestions for names)
>
> =head2 map_from_generic_features
>
>  Title   : map_from_generic_features
>  Usage   : my @features = $mapper->map_from_geneic_features(-features => \@generic);
>  Function: Will build new Bio::SeqFeatureI object(s) from set of
>            Bio::SeqFeatureI objects on implemented logic.
>  Returns : List of Bio::SeqFeatureI objects
>  Args    : -features => \@generic  # Feature list
>
> =head2 map_to_generic_features
>
>  Title   : map_to_generic_features
>  Usage   : my @features = $mapper->map_to_generic_features(-features => \@specialized);
>  Function: Will build generic Bio::SeqFeature::Generic objects from
>            specialized Bio::SeqFeature:: objects useful for outputting
>            GenBank/EMBL Feature Tables.
>  Returns : List of Bio::SeqFeatureI
>  Args    : -features => \@specialized # array ref of features to map to
> generic objects
>
> =cut
>
> The first implementing class would be Bio::SeqFeature::GeneSemanticMapper,
> which would work to build Bio::SeqFeature::Gene::GeneStructure objects or
> at least Exon/Intron objects depending on the depth of the annotated
> data.
>
> A second implementing class would be
> Bio::SeqFeature::AnalysisSemanticMapper. (name up for debate!) This would
> allow us to expand/collapse SeqFeature::Computational/FeaturePair/HSP etc
> objects to/from a set of SeqFeatureI(s).
>
> This class would also provide a means for simplifying object from high
> level bioperl SeqFeature classes down to the Generic object level suitable
> for outputting.
>
> I would then propose adding methods to Bio::SeqIO - add_SemanticMapper(),
> each_SemanticMapper, remove_SemanticMappers() to deal with having a set of
> semantic mappers to process sequence features once they have been created.
> Perhaps add a boolean state to the SeqIO class as to whether or not to use
> SemanticMapping as there is going to be a serious performance cost.  One
> can always process features after the sequence is read in so we gain
> flexibility without always paying the performance cost.  By delegating
> this to a separate factory we can still reimplement the sequence parsing
> later on without affecting this behavior.
>
>
> Comments, ideas, & volunteers welcomed.
>
> -jason
>