[Bioperl-l] RE: Proposal: SemanticMapping and call for info on Gene Objects

Hilmar Lapp hlapp@gnf.org
Mon, 13 May 2002 13:20:15 -0700


Jason great you started this. Everyone wants it, no-one has been bold enough to code it. Hopefully this gets the ball rolling. See below for my comments.

> -----Original Message-----
> From: Jason Stajich [mailto:jason@cgt.mc.duke.edu]
> Sent: Saturday, May 11, 2002 3:10 PM
> To: dblock@gnf.org; Hilmar Lapp
> Cc: Bioperl
> Subject: Proposal: SemanticMapping and call for info on Gene Objects
> 
> 
> I'm starting to try and build the semantic mapper for building
> Bio::SeqFeature::Gene objects from a list of Bio::SeqFeatureI objects.
> Dave/Hilmar any chance you guys can walk us through the ideas 
> behind the
> Gene objects and the assumptions that have been made.
> I am wondering if we have a rich enough set of objects for truly
> representing all the information one might have for a gene.
> 
> I think we probably need a CDS object or a little richer exon 
> object to
> note where translation starts.  I'm not sure what is appropriate - to
> build objects towards the way data is organized in a 
> genbank/embl file, or
> build them a little more generically and have to do some 
> acrobatics to go
> in seqfile -> GeneStructure -> out seqfile format.

I wouldn't strictly tie an object tree resulting from a semantic interpretation to the syntax limitations of a particular sequence databank format, and in particular not for round-tripping. What would the use case be for round-tripping in that way? The question as I see it is rather whether the file format is rich enough to provide and receive the information; I suppose Genbank and EMBL are (not sure though).

I'd look at GeneStructure really as a virtual representation, getting to which from a GenBank entry may be a one-way street.

> 
> Anyone who has opinions or ideas here, I would encourage you 
> to look over
> the existing objects and help propose some directions.  I'd 
> perhaps like
> to adopt what we can from the Gadfly & Ensembl models as well - any
> guidance and lessons learned would be great Ewan/Michele/Chris M.
> 
> As for the actually semantic mapping part - here is a simple interface
> I've started.
> 
> Bio::SeqFeature::SemanticMapperI
> (or should it be a Bio::Factory::SemanticMapperI ???)
> (happy to hear better suggestions for names)
> 

This is very generic; I'm wondering whether there is a tangible benefit going that generic. I.e., the methods defined here can't return something that's strongly typed (i.e., a SeqFeatureI as the least common denominator), but for 99.5% of use cases you'd have to cast it anyway to a specific type (e.g. GeneStructure). In other words, what would be the benefit of sharing the interface between, say, a GeneStructureSemanticMapper, and a PolymorphismSemanticMapper?


> =head2 map_from_generic_features
> 
>  Title   : map_from_generic_features
>  Usage   : my @features = 
> $mapper->map_from_geneic_features(-features => \@generic);
>  Function: Will build new Bio::SeqFeatureI object(s) from set of
>            Bio::SeqFeatureI objects on implemented logic.
>  Returns : List of Bio::SeqFeatureI objects
>  Args    : -features => \@generic  # Feature list

Related to before, IMHO that doesn't add as much sugar as I would like to see from a user's (programmer's) perspective: I've had SeqFeatureIs before, and I'll have them afterwards. Of course, possibly re-arranged etc. You can have fancy implementations, but you don't have to, i.e., what can you rely on in a possibly generic application. I'd vote for a stricter contract, e.g. as_GeneFeature(-features=>\@generic) returning a GeneI implementing object.

> 
> =head2 map_to_generic_features
> 
>  Title   : map_to_generic_features
>  Usage   : my @features = 
> $mapper->map_to_generic_features(-features => \@specialized);
>  Function: Will build generic Bio::SeqFeature::Generic objects from
>            specialized Bio::SeqFeature:: objects useful for outputting
>            GenBank/EMBL Feature Tables.
>  Returns : List of Bio::SeqFeatureI
>  Args    : -features => \@specialized # array ref of features 

I would think that this should be rather as_flat_SeqFeatures(GeneI), as the 'specialized' objects should be SeqFeatureIs by their very nature, but the tree may be nested and such, so that they don't come out natively in an EMBL feature table compliant way.


> to map to
> generic objects
> 
> =cut
> 
> The first implementing class would be 
> Bio::SeqFeature::GeneSemanticMapper,
> which would work to build 
> Bio::SeqFeature::Gene::GeneStructure objects or
> at least Exon/Intron objects depending on the depth of the annotated
> data.
> 
> A second implementing class would be
> Bio::SeqFeature::AnalysisSemanticMapper. (name up for 
> debate!) This would
> allow us to expand/collapse 
> SeqFeature::Computational/FeaturePair/HSP etc
> objects to/from a set of SeqFeatureI(s).
> 
> This class would also provide a means for simplifying object from high
> level bioperl SeqFeature classes down to the Generic object 
> level suitable
> for outputting.

I don't think that's the best way to go. If there are certain output formats to be supported, then I'd vote for actually having them in the base interface or class, or possibly in an 'extension' of that (similar in idea to to SeqI and RichSeqI). GFF is already there, and I don't think we should loosen the contract that SeqFeatureI implementing objects must override those methods in order to properly support GFF. The same would go for another format. Since probably EMBL or so is too specific, you could require an XML format, and e.g. XSLT-map from that to specific feature tables (not sure the formats aren't too wicked for this to actually work).

> 
> I would then propose adding methods to Bio::SeqIO - 
> add_SemanticMapper(),
> each_SemanticMapper, remove_SemanticMappers() to deal with 
> having a set of
> semantic mappers to process sequence features once they have 
> been created.
> Perhaps add a boolean state to the SeqIO class as to whether 
> or not to use
> SemanticMapping as there is going to be a serious performance 
> cost. 

I wouldn't do that, unless there is really demand for it. First, if you have more than one mapper, in which order should they be applied, and what should be the input for the next, and what would you do with the original feature objects, and what with those that serve as input for the next mapper? I don't think it's asking for too much to ask users to get the feature tree and 'manually' hand it over to the mapper they're interested in.

My 2 cents.

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------