[Bioperl-l] EnsEMBL-Bioperl converters proposal

Juguang Xiao juguang at fugu-sg.org
Thu Mar 6 17:20:28 EST 2003


Hi Ewan and Michele,

Nice to talk with you in Bio Hackathon Singapore. And I am writing the 
proposal for the converter for the objects between bioperl and EnsEMBL.

TERMS

In this document, 'users' mean the programmers that use this converter 
system, while "developer' refers to the programmer who develops this 
converter system, currently just me. :)

PACKAGE

With your agreement, I will use Bio::EnsEMBL::Utils::Converter module 
name as converter factory, and Bio::EnsEMBL::Utils::Converter package 
will be dedicated for converter instances.

FRAMEWORK

All module that will be used by users,  is 
Bio::EnsEMBL::Utils::Converter, the converter factory. In the POD of 
this module, the developer is responsible to announce which converter 
instances he has implemented. For users, there are two steps that they 
need to perform. (1) constructing an converter instance and (2) 
converting. See the below example code.

my $ens_analysis ; # a Bio::EnsEMBL::Analysis object
my $ens_contig; # a Bio::EnsEMBL:RawContig object


   my $converter = new Bio::EnsEMBL::Utils::Converter(
         -in => 'Bio::Search::Hit::GenericHit',
         -out => 'Bio::EnsEMBL::DnaPepAlignFeature',
         -analysis => $ens_analysis,
          -contig => $ens_contig
     }

# NOTE: Convensions, that convert method accepts an array ref
   # and returns an array ref.
   my @objs; # an array of original objects.
   my @converted_obj = @{$conveter->convert(\@objs)};


NOTES

1. In Converter::new, a user needs to, at least, specify '-in' and 
'-out' module name of conversion. Say, -in => Bio::SeqFeature::Generic, 
-out => Bio::EnsEMBL::SimpleFeature. If converting features form 
bioperl to ensembl, as you know about ensembl, you need to offer the 
analysis and rawcontig information.

2. This is a conventions that Converter::convert accepts an array ref 
of objects and will return an array ref objects too. To be friendly, my 
implementation also accept an object and return an object, but give 
user a warning.

INSIDE

The hierarchy of converter module is like this

Converter
	bio_ens
		bio_ens_seq
		bio_ens_seqFeature
		bio_ens_featurePair (converting to Bio::EnsEMBL::FeaturePair / 
RepeatFeature, for repeatmasker result.)
		bio_ens_hit	(Bio::Search::Hit::GenericHit / HSP::GenericHit, 
generated by Blast)	
	ens_bio
		ens_bio_seq (EnsEMBL feature object actually attaches bioperl seq 
object)
		ens_bio_seqFeature
		ens_bio_featurePair

You can see the design is copied from bioperl SeqIO and a sort., but 
with some variance of multiple layers. Hopefully no copyright legal 
issue involved. :)

The first two top level mainly marshall to find the right converter 
instance based on the -in and -out. Generally, 
Bio::EnsEMBL::Utils::Converter will judge whether the conversion is 
from bioperl to  ensembl or opposite direction, and call the 
constructor of one of (bio_ens and ens_bio). Consequently 
Bio::EnsEMBL::Utils::Converter::bio_ens, for example, try to find the 
more detailed implementor, also based on the -in and -out. The method 
to do that is called _guess_module.

Each the third level module, such as 
Bio::EnsEMBL::Utils::Converter::bio_ens_seqFeature, a hidden hero, 
implements 2 *internal* methods, _initialize, and _convert_single.

Converter::convert dereference the original objects, calls the 
_convert_single of converter instance module, and reference the 
converted objects to return.


DEVELOPMENT TEST

There will be a converter.t file in module/t directory, ensembl cvs 
repository. It is in charge to test all implemented converter instances.

Question: I did not find the Makefile.PL in ensembl cvs, like in 
bioperl, so I do not know how to batch testing all test files, like 
'make test' in bioperl. However, I do not think my converter breaks 
other's code.

the converter.t test pass, with currently other codes live cvs, I think 
that is EnsEMBL 11.

THE END

Did I miss something?

  I have commit the code and test file to ensembl CVS. Now what I have 
done is the framework, and the instance to convert between

1. Bio::SeqFeature::Generic <-> Bio::EnsEMBL::SeqFeature, SimpleFeature
2. Bio::SeqFeature::FeaturePair -> Bio::EnsEMBL::RepeatFeature and 
RepeatConsensus, (for repeatmasker result)

Later soon, there will be

1. Bio::Search::Hit::GenericHit, Bio::Search::HSP::GenericHSP -> 
Bio::EnsEMBL::BaseAlignFeature, and sub-categorize into 
DnaDnaAlignFeature, DnaPepAlignFeature, and PepDanAlignFeature, based 
on the program of blastall is used.

2. Bio::Tools::Prediction::Gene -> Bio::EnsEMBL::PredictionTranscript, 
(for genscan, etc)

3.
Bio::SeqFeature::Gene::GeneStructure -> Bio::EnsEMBL::Gene
Bio;;SeqFeature::Gene::Transcript -> Bio::EnsEMBL::Transcript
Bio::SeqFeature::Gene::Exon -> Bio::EnsEMBL::Exon
(for result of genewise)


Did I match the object types correctly?? And your suggestion for more 
conversion? Thanks

For the  special case on converting RawContig <-> Seq in bioperl. I am 
thinking whether it is a necessary work, because the RawContig's lazy 
loading Seq and auto-saving Seq. See Bio::EnsEMBL::RawContig::subseq, 
or Bio::EnsEMBL::DBSQL::RawContigAdaptor::fetch_by_name, for getting 
the Seq, and Bio:;EnsEMBL::RawContig::seq, for setting the Seq.

Any comments are most welcomed! Thanks

Juguang



More information about the Bioperl-l mailing list