[Bioperl-l] Re: *major* error in genbank parser or am i just insane?

Francis Ouellette francis@cmmt.ubc.ca
Fri, 09 Aug 2002 14:20:59 -0700


Lincoln Stein wrote:
> 
> I think the issue was what to do about those cases where the exon and the CDS
> features are inconsistent.  We all love and respect the NCBI data model...
> really!

I never wanted to say that you didn't respect the NCBI model, I know 
you do. I just wanted to explain that the model means that the 
CDS join() is validated, and that the individual exons may or may 
not be. So if you have an inconsistancy, just use the CDS join(), 
or to code it ecven more simply, just use the CDS join(), 
never look at the exon features.

cheers,

f.

-- 
| B.F. Francis Ouellette                       francis@cmmt.ubc.ca | 
| Director, Bioinformatics Centre              Tel: (604) 875-3815 | 
| University of British Columbia               Fax: (604) 608-4795 | 
| Vancouver, BC Canada            http://www.cmmt.ubc.ca/ouellette |


> 
> Lincoln
> 
> On Friday 09 August 2002 03:40 pm, Francis Ouellette wrote:
> > { apologies: long reply]
> >
> > "Lin, Xiaoying J." wrote:
> > > but for CDS features but no exon features, I am not sure I understand
> > > you correctly. there are lots submissions in Genbank, which only comes
> > > with CDS (join) features, but no separate exon features. If that is a
> > > mistake, it is a systematic mistake then. How does the current parser
> > > handle a record like
> > > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=nucleotide
> > > &list_uids=1458097&dopt=GenBank
> >
> > Having a CDS (with join) and no exon feature is how most (the great
> > majority) of CDS's are built that where submitted to the NCBI to be
> > included into GenBank.
> >
> > The rationale for this is that there where tooooo many where the exon
> > feature where not valid/validated and it was a bad feature, and that the
> > very best place (within NCBI's data model) to check and validate these
> > was to make sure the join that make up the CDS are valid, and make the
> > right protein, with valid exons. All of the information you need/want
> > is in the join statement.
> >
> > But "Ha" you say ... what about UTR's? Well, if you have non-coding
> > exons,
> > and you have their coordinates, you should put that information in a
> > join
> > statement in an mRNA feature.
> >
> > With those two features (CDS and mRNA) the exon feature becomes
> > superfluous
> > (in the NCBI data model, I know and understand this is not the case in
> > bioperl
> > world.
> >
> > Another thing, which as far as I know is *not* validated in the current
> > NCBI
> > model (well, it wasn't a few years back when I was a humble civil
> > servant)
> > was that the join statement from the mRNA and the one from the
> > corresponding
> > CDS where not matched to make sure they where in accordance, and
> > obviously
> > you don't have  a translation to validate that join.
> >
> > Before people get bent out of shape against NCBI for not encouraging the
> > exon feature, let me state the philosophy and reasoning behind that
> > (very good, imho) decision: mRNA and proteins are real biological
> > entities
> > within the cell and with the NCBI data model, exon are not -- they don't
> > exit on their own. The NCBI data model (of which the GenBank flatfile
> > is a *poor* text/report representation) tries to represent (read:
> > validate, promote, allow computation on) biological "stuff". It doesn't
> > care much for things which are not really "validatable" (an exon on
> > it's own is next to impossible to validate, and CDS is much easier
> > to validate).
> >
> > Anyway, I hope this long discourse explains a little where things
> > are coming from ...
> >
> > cheers,
> >
> > f.
> 
> --
> ========================================================================
> Lincoln D. Stein                           Cold Spring Harbor Laboratory
> lstein@cshl.org                                   Cold Spring Harbor, NY
> ========================================================================