[Bioperl-l] Re: [Bioperl-announce-l] an extension to Bio::SeqIO

Tue Jun 17 13:33:18 EDT 2003

There is a bit of chicken-egg problem in that most of the data sets
Bioperl has tried to interface with are not as rich as chado, the
genbank->gene->chado way is not going to work for all genbank records
(which I personally can live with).  I would like to see if we can define
the least-common denominator for people to understand what needs to get a
chado db populated.

As we've been discussing in different venues I think we'd like to see a
general purpose system which can take a collection of sequence features,
relate them in a graph based on an identifiable grouping (the /gene field
or perhaps mapped into a general slot like 'group' ala Lincoln's
Bio::DB::GFF system), and then using SO map these into objects.  For genes
I'd like to see these be Bio::SeqFeature::Gene::GeneStructure objects
(the object model of which might need some work) because there are
additional methods already built in like intron inferences and ability to
loop through the transcripts, etc.

So my request is that we make the chado writer dumb, it should not try to
group anything, but should just obey however the objects are built.  An
intermediete set of factories can take lists of features and assign
'group' fields to them, a second factory could relate them into a graph
based on SO and the group fields.  This graph can now be written out to
chadoxml.  Another factory (I was calling Bio::SeqFeature::Transmogrifier
for the calvin and hobbes fans) could build the appropriate composite
objects from the graph (Genes, HSPs, where appropriate) and deal with
multiple coordinate systems (in the case of features attached to the
annotated protein product).   The 'Transmogrifier' could also turn these
composite objects back into simple feature graphs so that they can be
written to chado simply and (finally) fully written out to a genbank
record with a controlled vocab of /tag=value fields.

These are my ideas anyways, perhaps too much?  I know other people (Shawn
Hoon, Chris Mungall) have volunteered ideas and coding to this as well so
we'd like to see if we can perhaps work together on it.

For examples of some minimal gene objects, the easiest way to get them
right now is from any of the gene prediction parser (
Bio::Tools::Genewise, Bio::Tools::Genomewise, Bio::Tools::Genscan,
Bio::Tools::Glimmer).

-jason
On Tue, 17 Jun 2003, Peili Zhang wrote:

> Hi,
>
> here at FlyBase, we implement chado database schema to store sequence,
> annotation, genetic, controlled vocabulary, publication and other types
> of data (for detailed information about chado schema, please visit
> http://www.gmod.org and read the schema documentations and scripts in
> its CVS).  we have developed tools to dump FlyBase data into chadoxml
> and load data in chadoxml format into FlyBase (for chadoxml dtd, please
> see
> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/gmod/schema/chado/dat/chado.dtd),
> to facilitate data communication among the different sites of FlyBase
> and between FlyBase and the rest of the world. need arises for a tool to
> convert external data in other formats into chadoxml. I'm coding a perl
> module chadoxml.pm to write out a Bio::Seq object into chadoxml. we'd
> like to get your feedback on whether it's useful to add this module into
> bioperl as an extension to the Bio::SeqIO package. if you already have
> working code for the same purpose, maybe we can discuss how to merge our
> code to produce a better version.
>
> thanks for your input.
>
> regards,
> Peili Zhang
> FlyBase-Harvard
>
> _______________________________________________
> Bioperl-announce-l mailing list
> Bioperl-announce-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-announce-l
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu