[Bioperl-l] Re: *major* error in genbank parser or am i just insane?

Chris Mungall cjm@fruitfly.org
Fri, 9 Aug 2002 14:04:16 -0700 (PDT)


What about refseqs and cDNA submissions? these are all cDNA bioentries so
there are no introns and hence no exons.

both the bioperl model and biosql have to be able to cope with these

I think the fancy interpretation layer should be built on top of the
bioperl object model, and on top of biosql. this could take the form of:
subclasses, ontologies/cvs, an xml layer with biofeature-specific
elements, sql views, decorator tables, entirely new tables/schema.... take
your pick, so long as it's decoupled. this way you don't force everyone to
use one single interpretation layer. for instance, if i'm working on a
quick project where I know everything I need is in the genbank NT files
(eg human assembled sequence), I can zip through the files making
seqfeature objects with the assumption that the only feature types are
gene, mRNA, CDS, variation, and I don't have to do kooky conversions
between exon features and mRNA sublocations.

of course we need the bottom layer correct before we have any hope of
making the interpretation layer work, ie a consistent model for
sublocation strands. sounds like we've agreed that -1 strand sublocs are
fine?

the interpretation layer is a different issue. i'd like to see this done
declaratively rather than imperatively (good for x-project compatibility,
easier to understand code) but i have to admit i'm not 100% sure how this
would be done. i think xslt would get ugly rather quickly.

here's a made up example:

mRNA <=> mRNA
sublocationof(mRNA) <=> exon
misc_feature.type=snRNA <=> snRNA
CDS  <=> CDS
sublocationof(CDS)  <=> CDS-exon
sublocationof(5'UTR) <=> 5'UTR-exon
5'UTR + CDS + 3'UTR <=> mRNA
Seq(type=mRNA) <=> feature(type=mRNA)
Seq(type=protein) <=> feature(type=CDS)

forall(mRNA), mRNA.property.gene=GeneStructure.name
=> partof(mRNA, GeneStructure),

there's lots of weird dependencies here; eg the UTR is occasionally
explicitly specified in the genbank file, most of the time it's implicit
from the CDS. sometimes exons are explicit, mostly implicit. there's also
the coordinate transformations between dna, transcript and protein space.
a declarative specification can help with avoiding dependency cycles.

combined with a SO type ontology eg for inheritance of feature types.
ideally the declarative representation of the interpretation layer should
be part of the ontology.

this is a hard problem though; most of the time it's expedient for the
programmer to make certain assumptions about how the data they are
interested in is represented in genbank/embl, and just use bioperl/biosql
as-is, performing their own hacky transformations.


On Fri, 9 Aug 2002, Lincoln Stein wrote:

> Here's my 2c:
>
> If the genbank entry has CDS features but no exons, or an mRNA join operator
> which is out of sync with the CDS join, then in my opinion the quality of the
> annotation is so questionable that BioSQL should throw up its hands and seek
> human assistance in interpretation.  Asking the import software to read the
> minds of the submitters is beyond what can be reasonably expected, and only
> ends up propagating errors.
>
> Lincoln
>
> On Friday 09 August 2002 04:49 am, Brian King wrote:
> > > This is very hard to do because you have to handle:
> > >
> > >
> > >    (a) CDS with no Exons
> > >
> > > and, my particular favourite
> > >
> > >    (b) a mRNA join operator which is out of sync
> > > with the CDS join
> > > operator (!)
> >
> > For (a) I'd put generic sub-features in the CDS to
> > hold the places of the presumed exons, and for (b) use
> > generic sub-features for the CDS and the mRNA joins
> > and just let them be out of sync.  I surrender on
> > remote joins!  I'd keep the location string in
> > documentation in the data, but not try to interpret
> > it.  Ideally the parser would download the remote
> > record, but...
> >
> > Regards,
> > Brian
> >
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > HotJobs - Search Thousands of New Jobs
> > http://www.hotjobs.com
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
>
>