[Bioperl-l] Re: *major* error in genbank parser or am i just insane?

Francis Ouellette francis@cmmt.ubc.ca
Fri, 09 Aug 2002 12:40:19 -0700


{ apologies: long reply]

"Lin, Xiaoying J." wrote:

> but for CDS features but no exon features, I am not sure I understand
> you correctly. there are lots submissions in Genbank, which only comes
> with CDS (join) features, but no separate exon features. If that is a
> mistake, it is a systematic mistake then. How does the current parser
> handle a record like
> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=nucleotide
> &list_uids=1458097&dopt=GenBank


Having a CDS (with join) and no exon feature is how most (the great 
majority) of CDS's are built that where submitted to the NCBI to be 
included into GenBank.

The rationale for this is that there where tooooo many where the exon 
feature where not valid/validated and it was a bad feature, and that the 
very best place (within NCBI's data model) to check and validate these 
was to make sure the join that make up the CDS are valid, and make the 
right protein, with valid exons. All of the information you need/want 
is in the join statement.

But "Ha" you say ... what about UTR's? Well, if you have non-coding
exons, 
and you have their coordinates, you should put that information in a
join
statement in an mRNA feature. 

With those two features (CDS and mRNA) the exon feature becomes
superfluous 
(in the NCBI data model, I know and understand this is not the case in
bioperl 
world.

Another thing, which as far as I know is *not* validated in the current
NCBI 
model (well, it wasn't a few years back when I was a humble civil
servant) 
was that the join statement from the mRNA and the one from the
corresponding 
CDS where not matched to make sure they where in accordance, and
obviously 
you don't have  a translation to validate that join.

Before people get bent out of shape against NCBI for not encouraging the 
exon feature, let me state the philosophy and reasoning behind that 
(very good, imho) decision: mRNA and proteins are real biological
entities 
within the cell and with the NCBI data model, exon are not -- they don't 
exit on their own. The NCBI data model (of which the GenBank flatfile 
is a *poor* text/report representation) tries to represent (read: 
validate, promote, allow computation on) biological "stuff". It doesn't 
care much for things which are not really "validatable" (an exon on 
it's own is next to impossible to validate, and CDS is much easier 
to validate).

Anyway, I hope this long discourse explains a little where things 
are coming from ...

cheers,

f.


-- 
| B.F. Francis Ouellette                       francis@cmmt.ubc.ca | 
| Director, Bioinformatics Centre              Tel: (604) 875-3815 | 
| University of British Columbia               Fax: (604) 608-4795 | 
| Vancouver, BC Canada            http://www.cmmt.ubc.ca/ouellette |