[Bioperl-l] Re: Changes to GFF 2.5 "unflattening" code

Mon Dec 15 11:50:28 EST 2003

On Fri, 2003-12-12 at 18:58, Chris Mungall wrote:
> Nice one Scott!
> 
> I imagine this script would be v useful to plenty of non GMOD/chado folks.
> Is there anything chado or GMOD specific about this? can we add it to
> bioperl instead of GMOD? (IMHO there are far too few scripts in bioperl,
> which is fine for the hardcore object-heads who'll roll up their own in a
> few minutes, but not so great for new users)

I agree, and I intend on moving my script from GMOD to bioperl when it
is ready for prime time--it's not there yet.
> 
> What do you think of rolling some of the logic up from the script into
> bioperl modules? For example, the typemapping stuff could go into
> Bio::SeqFeature::Tools::TypeMapper, which already has a method for mapping
> to the Sequence Ontology

I think TypeMapper is fine (that is, now that I know about it), but it
doesn't solve the fundamental problem of letting users know how things
should be mapped so as to be consistent with both what was intended by
the original authors of the db entry and with how other people will
interpret it when converting formats.  I am thinking there may need to
be an online resource much like SO that gives "standard" mappings,
allowing individual users to override them.
> 
> Mapping of the SeqFeature nesting hierarchy to GFF ID/Parent tags could
> also take place in FeatureHolderI, as discussed on this list the other
> week.

Yep.
> 
> By the way, what are you doing for parent features that don't have a
> natural ID? Are you creating artificial surrogate IDs?

Arificial IDs where necesssary, though this is an evolving part of the
script (I think it will always use artificial IDs--I don't see a way
around that--but the way I am creating them is changing.
> 
> That way we could easily roll out genbank2chadoxml, genbank2ensembl,
> genbank2game, genbank2das, genbank2biosql and fastafile generators like
> genbank2intron_fasta, genbank2spliced_utr_fasta, genbank2exon_fasta,
> genbank2intergenic_fasta, genbank2my_favourite_SO_type_fasta and so on - I
> think this is the sort of thing people are really often after when they
> start downloading and wrestling with the bioperl object model.
> 
> By the way, we often use genbank, when what we really mean is
> genbank/eml(/ddbj?). is there a handy short catchy name for this
> collective, or shall we carry on just using the term genbank to denote the
> collection of genbank-like formats?

I'm fine with continuing to refer to Genbank/EMBL/DDBJ as Genbank.  It's
just shorter than 'Genbank/EMBL/DDBJ'.
> 
> This is all incredibly useful stuff in my opinion - for ages we've been
> able to say "we have a parser for format X" in bioperl, but really it's
> still been a  semantic quagmire, the parsing is just the first step.
> 
> Cheers
> Chris
> 
> 
> On Fri, 12 Dec 2003, Scott Cain wrote:
> 
> > Lincoln and Sheldon,
> >
> > For your information, I wrote a new genbank2gff3.pl script for use with
> > the pending GMOD release.  I anticipate that it will form the foundation
> > for rewriting the biofetch adaptor.  It uses Unflattener.pm and seems to
> > work for the organisms I tested (human, worm, fly, mosquito, and
> > Ecoli).  It is in the GMOD cvs in the schema repository at
> > schema/chado/load/bin/genbank2gff.PLS.
> >
> > Scott
> >
> > On Fri, 2003-12-12 at 10:56, bioperl-l-request at portal.open-bio.org
> > wrote:
> > > Hi Mark, Sheldon,
> > >
> > > I saw your change to the _parse_gff2_group code in Bio::DB::GFF, which
> > > prioritizes "gene", "locus_tag" and "transcript" as group fields in
> > > the column 9 attributes.  I like it, but unfortunately it breaks some
> > > other code that I have, including the GMOD tutorial.
> > >
> > > I think you'll like what I've done instead.  I've added a
> > > preferred_groups() method to which you pass a list of group names.
> > > Then, this list will be used as the priority list to pluck out groups
> > > from the GFF2 attribute list.  To get your previous behavior, you need
> > > to do this:
> > >
> > >  $db = Bio::DB::GFF->new(-preferred_groups=>['gene','locus_tag','transcript'],
> > > 	                 @other_args);
> > >  $db->load_gff(...);
> > >
> > > or this
> > >
> > >  $db = Bio::DB::GFF->new(@other_args);
> > >  $db->preferred_groups('gene','locus_tag','transcript');
> > >  $db->load_gff(...);
> > >
> > > You'll have to change your existing scripts accordingly.  Sure, this
> > > should be merged with Chris's unflattener, but then again let's just
> > > get to GFF3 as quickly as we possibly can and leave this nightmare
> > > behind us!
> > >
> > > Lincoln
> >
> 
-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         cain at cshl.org
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory