[Bioperl-l] [Gmod-schema] bp_genbank2gff3.pl in bioperl-live: why map CDS to gene_component_region?

Leighton Pritchard lpritc at scri.ac.uk
Wed Mar 24 12:05:08 UTC 2010


Hi,

I'm surprised that this issue hasn't come up already, as the change to the
gene model is quite significant.  For comparison, this is what the old
bp_genbank2gff3.pl script would produce with --CDS:

NC_000913    GenBank    gene    190    255    .    +    .
ID=thrL;Dbxref=EcoGene:EG11277,ECOCYC:EG11277,GeneID:944742;Note=synonyms:
ECK0001%2C JW4367;gene=thrL;locus_tag=b0001
NC_000913    GenBank    mRNA    190    255    .    +    .
ID=thrL.t01;Parent=thrL
NC_000913    GenBank    CDS    190    255    .    +    .
ID=thrL.p01;Parent=thrL.t01;Dbxref=ASAP:ABE-0000006,UniProtKB/Swiss-Prot:P0A
D86,GI:16127995,EcoGene:EG11277,ECOCYC:EG11277,GeneID:944742;Note=GO_process
: threonine biosynthetic process [goid
0009088];Ontology_term=GO:0009088;codon_start=1;function=leader%3B Amino
acid biosynthesis: Threonine,1.5.1.8 metabolism%3B building block
biosynthesis%3B amino acids%3B
threonine;gene=thrL;locus_tag=b0001;product=thr operon leader
peptide;protein_id=NP_414542.1;transl_table=11;translation=length.21
NC_000913    GenBank    exon    190    255    .    +    .    Parent=thrL.t01

and with --noCDS:

NC_000913    GenBank    gene    190    255    .    +    .
ID=thrL;Dbxref=EcoGene:EG11277,ECOCYC:EG11277,GeneID:944742;Note=synonyms:
ECK0001%2C JW4367;gene=thrL;locus_tag=b0001
NC_000913    GenBank    mRNA    190    255    .    +    .
ID=thrL.t01;Parent=thrL
NC_000913    GenBank    polypeptide    190    255    .    +    .
ID=thrL.p01;Dbxref=ASAP:ABE-0000006,UniProtKB/Swiss-Prot:P0AD86,GI:16127995,
EcoGene:EG11277,ECOCYC:EG11277,GeneID:944742;Derives_from=thrL.t01;Note=GO_p
rocess: threonine biosynthetic process [goid
0009088];Ontology_term=GO:0009088;codon_start=1;function=leader%3B Amino
acid biosynthesis: Threonine,1.5.1.8 metabolism%3B building block
biosynthesis%3B amino acids%3B
threonine;gene=thrL;locus_tag=b0001;product=thr operon leader
peptide;protein_id=NP_414542.1;transl_table=11;translation=length.21
NC_000913    GenBank    exon    190    255    .    +    .    Parent=thrL.t01


The new script produces this identical output with both --CDS and --noCDS:

NC_000913       GenBank region  190     255     .       +       .
ID=GenBank:region:NC_000913:190:255
NC_000913       GenBank exon    190     255     .       +       .
ID=GenBank:exon:NC_000913:190:255
NC_000913       GenBank gene    190     255     .       +       .
ID=b0001;Dbxref=EcoGene:EG11277,ECOCYC:EG11277,GeneID:944742;Note=synonyms:
ECK0001%2C JW4367;gene=thrL;locus_tag=b0001
NC_000913       GenBank gene_component_region   190     255     .       +
.       
Parent=b0001;Dbxref=ASAP:ABE-0000006,UniProtKB/Swiss-Prot:P0AD86,GI:16127995
,EcoGene:EG11277,ECOCYC:EG11277,GeneID:944742;Note=GO_process: threonine
biosynthetic process [goid
0009088];Ontology_term=GO:0009088;codon_start=1;function=leader%3B Amino
acid biosynthesis: Threonine,1.5.1.8 metabolism%3B building block
biosynthesis%3B amino acids%3B
threonine;gene=thrL;locus_tag=b0001;product=thr operon leader
peptide;protein_id=NP_414542.1;transl_table=11;translation=MKRISTTITTTITITTG
NGAG


So, although the new script improves the parent-child relationships by
identifying parents on the locus_tag field (guaranteed to be unique), rather
than gene name (not guaranteed to be unique), the GFF3 gene model has
apparently changed from canonical:

gene <- mRNA <- {polypeptide/CDS, exon}

to this:

region ; exon ; gene <- gene_component_region

So I guess I don't understand the region-exon-gene part of the new model,
after all.  This new model doesn't appear to be Sequence Ontology-compatible
any more (e.g. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1175956/) as exon
is no longer considered part_of the transcript.  In fact, there's not a
transcript.  Given that the SO cite bp_genbank2gff3.pl as a way to get
SO-compliant GFF3 
(http://www.sequenceontology.org/resources/faq.html#convert), this might be
an issue requiring a prompt fix or reversion.


For now, due to the downstream problems this model causes with GBROWSE and
ARTEMIS, I'm going to go back to BioPerl 1.6.1, with a modification to the
script to use the locus_tag field rather than the gene field for the feature
ID.


Cheers,

L.

On 23/03/2010 Tuesday, March 23, 18:18, "Scott Cain" <scott at scottcain.net>
wrote:

> Hi Leighton,
> 
> I wonder if this is a change stemming from Nathan's work on this
> script.  Nathan?
> 
> Scott


-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________



More information about the Bioperl-l mailing list