[Bioperl-l] Loading NCBI/GenBank bacteria into CHADO: Chromosome/Plasmid gene name conflicts

Leighton Pritchard lpritc at scri.ac.uk
Mon Mar 1 11:32:10 UTC 2010


Hi,

I've tried going back through the mailing list, Googling the answer, and
reading the documentation and wiki to find a solution for this.  I've either
missed it, or it's not there yet.  Hopefully there's a simple solution, or
an option that I'm just not seeing.  I'm sure other people must be using
CHADO for bacterial genomes, and I would be interested in hearing about best
practice for using CHADO/GBROWSE with these sequences (I've seen
http://gmod.org/wiki/Chado_for_prokaryotes - but there's not much in
there...).

I have a working CHADO(GMOD-1.0)/GBROWSE2/BioPerl 1.6.1 setup on CentOS 5.4,
and I'm trying to load some bacterial data.  Specifically for this example,
I'm trying to get the GenBank sequences for E.coli S88: NC_011742 and
NC_011747 into CHADO.  I've been following instructions from a number of
locations, including http://gmod.org/wiki/Artemis-Chado_Integration_Tutorial
and http://gmod.org/wiki/Chado_Tutorial, but there's an issue with these two
files, in that the NC_011742 (chromosome) and NC_011747 (plasmid) sequences
contain genes that have the same names (and several genes with the same name
in the same sequence!), and this appears to be a problem.  Here's what's
going wrong:

I start off with the two GenBank files:

"""
[lpritc at localhost ~]$ ls -1 *.gbk
NC_011742.gbk
NC_011747.gbk
"""

And convert these to .gff3 using the BioPerl script (it doesn't seem to
matter whether I pass them with the wildcard, or convert separately, though
passing multiple sequences for conversion might be a good place to check for
unique IDs):

"""
[lpritc at localhost ~]$ bp_genbank2gff3.pl -s *.gbk
# Input: NC_011742.gbk
# working on region:NC_011742, Escherichia coli S88, 19-DEC-2008,
Escherichia coli S88, complete genome.
# GFF3 saved to ./NC_011742.gbk.gff
# Summary:
# Feature    Count
# -------    -----
# mRNA  4696
# gene  4898
# region  1
# pseudogene  151
# CDS  4696
# RESIDUES(tr)  1442813
# RESIDUES  5032268
# processed_transcript  89
# rRNA  22
# pseudogenic_region  151
# exon  4899
# tRNA  91
# 
# Input: NC_011747.gbk
# working on region:NC_011747, Escherichia coli S88, 18-AUG-2009,
Escherichia coli S88 plasmid pECOS88, complete sequence.
# GFF3 saved to ./NC_011747.gbk.gff
# Summary:
# Feature    Count
# -------    -----
# mRNA  4832
# gene  5037
# region  2
# pseudogene  159
# CDS  4832
# RESIDUES(tr)  1477756
# RESIDUES  5166121
# processed_transcript  92
# rRNA  22
# pseudogenic_region  159
# exon  5038
# tRNA  91
# 
"""

I can then use the gmod_bulk_load_gff3.pl script to load either file, but
only singly.  This appears to work, and the result is visible and seemingly
correctly navigable in GBROWSE (using NC_011747 as the first sequence here,
but the order is unimportant):

"""
[lpritc at localhost ~]$ gmod_bulk_load_gff3.pl --organism E.coli --dbxref
GeneID --noexon --recreate_cache --gfffile NC_011747.gbk.gff
(Re)creating the uniquename cache in the database...
Creating table...
Populating table...
Creating indexes...Done.
Preparing data for inserting into the chado database
(This may take a while ...)
Dropping cds temp tables...
Creating cds temp tables...
NOTICE:  CREATE TABLE will create implicit sequence
"tmp_cds_handler_cds_row_id_seq" for serial column
"tmp_cds_handler.cds_row_id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"tmp_cds_handler_pkey" for table "tmp_cds_handler"
NOTICE:  CREATE TABLE will create implicit sequence
"tmp_cds_handler_relationship_rel_row_id_seq" for serial column
"tmp_cds_handler_relationship.rel_row_id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"tmp_cds_handler_relationship_pkey" for table "tmp_cds_handler_relationship"
Loading data into feature table ...
Loading data into featureloc table ...
Loading data into feature_relationship table ...
Loading data into featureprop table ...
Skipping feature_cvterm table since the load file is empty...
Skipping synonym table since the load file is empty...
Skipping feature_synonym table since the load file is empty...
Skipping dbxref table since the load file is empty...
Loading data into feature_dbxref table ...
Skipping analysisfeature table since the load file is empty...
Skipping cvterm table since the load file is empty...
Skipping db table since the load file is empty...
Skipping cv table since the load file is empty...
Skipping analysis table since the load file is empty...
Skipping organism table since the load file is empty...
Adding cvtermprop=MapReferenceType for 'region' ...
Loading sequences (if any) ...
Optimizing database (this may take a while) ...
  (feature featureloc feature_relationship featureprop feature_cvterm
synonym feature_synonym dbxref feature_dbxref analysisfeature cvterm db cv
analysis organism ) Done.

While this script has made an effort to optimize the database, you
should probably also run VACUUM FULL ANALYZE on the database as well
"""

"""
chado=> SELECT feature_id, organism_id, name, uniquename FROM feature WHERE
name='NC_011747';
 feature_id | organism_id |   name    | uniquename
------------+-------------+-----------+------------
     146917 |          99 | NC_011747 | NC_011747
"""

However, attempting to load in the second sequence throws an error (though
this might also be a good point to check for ID uniqueness with a database
check, and appropriate modification to the ID, if necessary - problems could
arise if we were trying to add genuine duplicates, though...):

"""
[lpritc at localhost ~]$ gmod_bulk_load_gff3.pl --organism E.coli --dbxref
GeneID --noexon --recreate_cache --gfffile NC_011742.gbk.gff
(Re)creating the uniquename cache in the database...
Creating table...
Populating table...
Creating indexes...Done.
Preparing data for inserting into the chado database
(This may take a while ...)
Dropping cds temp tables...
Creating cds temp tables...
NOTICE:  CREATE TABLE will create implicit sequence
"tmp_cds_handler_cds_row_id_seq" for serial column
"tmp_cds_handler.cds_row_id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"tmp_cds_handler_pkey" for table "tmp_cds_handler"
NOTICE:  CREATE TABLE will create implicit sequence
"tmp_cds_handler_relationship_rel_row_id_seq" for serial column
"tmp_cds_handler_relationship.rel_row_id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"tmp_cds_handler_relationship_pkey" for table "tmp_cds_handler_relationship"

no parent yacC;
you probably need to rerun the loader with the --recreate_cache option

Issuing rollback() due to DESTROY without explicit disconnect() of
DBD::Pg::db handle dbname=chado;port=5432;host=localhost.
"""

This, of course, prevents the upload of the sequence and its annotations, as
a whole.

The script recommends that the --recreate_cache option should be used, but I
am already using it.  If the same process is run, reversing the order of the
input files, the same error is reported, but for the gene with name 'int'.
Both sequences contain genes with the names 'int' and 'yacC' (NC_011742
appears to contain four genes with the name 'int'):

"""
[lpritc at localhost ~]$ grep 'ID=yacC;' *.gbk.gff
NC_011742.gbk.gff:NC_011742    GenBank    gene    142755    143273    .    -
.    ID=yacC;Dbxref=GeneID:7130628;gene=yacC;locus_tag=ECS88_0131
NC_011747.gbk.gff:NC_011747    GenBank    gene    85083    85931    .    +
.    ID=yacC;Dbxref=GeneID:7119486;gene=yacC;locus_tag=pECS88_0103

[lpritc at localhost ~]$ grep 'ID=int;' *.gbk.gff
NC_011742.gbk.gff:NC_011742    GenBank    gene    1182443    1183585    .
-    .    ID=int;Dbxref=GeneID:7131611;gene=int;locus_tag=ECS88_1152
NC_011742.gbk.gff:NC_011742    GenBank    pseudogene    1998684    1999646
.    +    .    
ID=int;Dbxref=GeneID:7128964;gene=int;locus_tag=ECS88_2031;pseudo=_no_value
NC_011742.gbk.gff:NC_011742    GenBank    gene    2829972    2830991    .
+    .    ID=int;Dbxref=GeneID:7131911;gene=int;locus_tag=ECS88_2851
NC_011742.gbk.gff:NC_011742    GenBank    gene    3220074    3221336    .
+    .    ID=int;Dbxref=GeneID:7129893;gene=int;locus_tag=ECS88_3250
NC_011747.gbk.gff:NC_011747    GenBank    gene    132    872    .    +    .
ID=int;Dbxref=GeneID:7119360;gene=int;locus_tag=pECS88_0001
"""

Commenting out either of these genes, and their child features, defers the
error to another gene that has the same name in both sequences in each case.
It seems that the problem might derive from attempting to uniquely associate
each gene uniquely with its 'gene' tag in the GenBank file and, as there are
several points in the process where it would be sensible to check for name
collisions, so that the feature:uniquename column can be modified to reflect
this, I looked for command-line options to each script, but didn't see one
that could help.  Examining the manual for gmod_bulk_load_gff3.pl suggests
that this might be the problem (though I might be misunderstanding it):

"""
       Column 9 (group)
           Here is where the magic happens.

           Assigning feature.name, feature.uniquename
               The values of feature.name and feature.uniquename are
assigned according to these simple rules:

               If there is an ID tag, that is used as feature.uniquename
                   otherwise, it is assigned a uniquename that is equal to
¹auto¹ concatenated with the feature_id.

                   (Note that this is a potential problem as there is no
check to make sure that it is appropriately unique.)

               If there is a Name tag, it¹s value is set to feature.name;
                   otherwise it is null.

                   Note that these rules are much more simple than that
those that Bio::DB::GFF uses, and may need to be revisited.
"""

I suspect that, as the bp_genbank2gff3.pl script converts gene names (which
are not guaranteed to be unique) to ID tags, the problem recognised in the
manual is cropping up at this point.  Luckily, the GenBank files come with
locus_tag tags, which should be unique for each gene (see
http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html#locus_tag).  For
bacteria, at least, using the locus_tag values might be a more robust option
for the bp_genbank2gff3.pl; this already appears to have been recognised in
the script comments:

"""
            #?? should gene_name from
/locus_tag,/gene,/product,/transposon=xxx
            # be converted to or added as  Name=xxx (if not ID= or as well)
            ## problematic: convert_to_name ($feature); # drops
/locus_tag,/gene, tags
"""

I can get round the upload problem somewhat suckily by changing the priority
given to 'locus_tag' and 'gene' tags for generating the .gff ID tag in the
bp_genbank2gff3.pl script:

"""
[lpritc at localhost ~]$ diff bp_genbank2gff3.pl /usr/bin/bp_genbank2gff3.pl
976,977c976,977
<     if ($g->has_tag('locus_tag')) {
<         ($gene_id) = $g->get_tag_values('locus_tag');
---
>     if ($g->has_tag('gene')) {
>         ($gene_id) = $g->get_tag_values('gene');
979,980c979,980
<     elsif ($g->has_tag('gene')) {
<         ($gene_id) = $g->get_tag_values('gene');
---
>     elsif ($g->has_tag('locus_tag')) {
>         ($gene_id) = $g->get_tag_values('locus_tag');
"""

But this isn't a complete solution, as GBROWSE searches by gene name don't
work after making this change, and presumably some further configuration or
hacking about is required to sort that out (advice welcome).

So, what are other people doing to overcome this issue (if you've seen it),
and would a change to the bp_genbank2gff.pl script along the lines I mention
be useful to others?

Cheers,

L.


-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________




More information about the Bioperl-l mailing list