[Bioperl-l] bp_genbank2gff3.pl - circular genomes, origin-spanning features, and GFF3

Leighton Pritchard lpritc at scri.ac.uk
Fri Apr 9 09:34:57 EDT 2010


Hi,

(cc'd to Lincoln due to GFF3 relevance - apologies to all for the earlier
partial post, I accidentally clicked on 'send' when moving windows about)

I've recently been trying to use BioPerl, CHADO and GBROWSE to represent
bacterial genome sequences.  In doing this, I've been testing with GenBank
genome/feature files, converting these to GFF3 with bp_genbank2gff3.pl to
get a CHADO-friendly gene model.  There appears to be an issue when
converting GenBank files that contain features which span the genomic
origin.

For example, the GenBank file NC_002127.gbk describes a plasmid from E.coli
O157H7.  This contains the following feature which spans the reference
sequence origin:

     gene            join(92527..92721,1..2502)
                     /gene="tagA"
                     /locus_tag="pO157p01"
                     /db_xref="GeneID:1789672"
     CDS             join(92527..92721,1..2502)
                     /gene="tagA"
                     /locus_tag="pO157p01"
                     /codon_start=1
                     /transl_table=11
                     /product="ToxR-regulated lipoprotein"
                     /protein_id="NP_052607.1"
                     /db_xref="GI:10955349"
                     /db_xref="GeneID:1789672"

When using the bp_genbank2gff3.pl script (either from bioperl-live or
release 1.6.1) to convert NC_002128.gbk to GFF3 with the command-line

$ bp_genbank2gff3.pl ./Escherichia_coli_O157H7/NC_002128.gbk -out stdout >
test.gff3

This produces non-sequence ontology-compatible GFF, where CDS are not
explicitly related to their parent gene features, and exons/mRNA are not
inferred:
"""
[...]
NC_002128       GenBank CDS     2589    3464    .       +       .
ID=pO157p02;Dbxref=GI:10955267,GeneID:1789731;Note=type II secretion pathway
related 
protein;codon_start=1;gene=etpC;locus_tag=pO157p02;product=EtpC;protein_id=N
P_052608.1;transl_table=11;translation=length.291
NC_002128       GenBank gene    2589    3464    .       +       .
ID=pO157p02;Dbxref=GeneID:1789731;gene=etpC;locus_tag=pO157p02
NC_002128       GenBank CDS     3675    5432    .       +       .
ID=pO157p03;Dbxref=GI:10955268,GeneID:1789733;Note=type II secretion pathway
related 
protein;codon_start=1;gene=etpD;locus_tag=pO157p03;product=EtpD;protein_id=N
P_052609.1;transl_table=11;translation=length.585
NC_002128       GenBank gene    3675    5432    .       +       .
ID=pO157p03;Dbxref=GeneID:1789733;gene=etpD;locus_tag=pO157p03
NC_002128       GenBank CDS     5432    6937    .       +       .
ID=pO157p04;Dbxref=GI:10955269,GeneID:1789725;Note=type II secretion pathway
related 
protein;codon_start=1;gene=etpE;locus_tag=pO157p04;product=EtpE;protein_id=N
P_052610.1;transl_table=11;translation=length.501
NC_002128       GenBank gene    5432    6937    .       +       .
ID=pO157p04;Dbxref=GeneID:1789725;gene=etpE;locus_tag=pO157p04
[...]
"""

Removing the wrapped feature from the .gbk file and running the script again
(same command-line) produces 'correct' output, with an SO-compatible model:

"""
[...]
NC_002128       GenBank gene    2589    3464    .       +       .
ID=pO157p02;Dbxref=GeneID:1789731;gene=etpC;locus_tag=pO157p02
NC_002128       GenBank mRNA    2589    3464    .       +       .
ID=pO157p02.t01;Parent=pO157p02
NC_002128       GenBank polypeptide     2589    3464    .       +       .
ID=pO157p02.p01;Dbxref=GI:10955267,GeneID:1789731;Derives_from=pO157p02.t01;
Note=type I
I secretion pathway related
protein;codon_start=1;gene=etpC;locus_tag=pO157p02;product=EtpC;protein_id=N
P_052608.1;transl_table=11;translation=length.291
NC_002128       GenBank exon    2589    3464    .       +       .
Parent=pO157p02.t01
NC_002128       GenBank gene    3675    5432    .       +       .
ID=pO157p03;Dbxref=GeneID:1789733;gene=etpD;locus_tag=pO157p03
NC_002128       GenBank mRNA    3675    5432    .       +       .
ID=pO157p03.t01;Parent=pO157p03
NC_002128       GenBank polypeptide     3675    5432    .       +       .
ID=pO157p03.p01;Dbxref=GI:10955268,GeneID:1789733;Derives_from=pO157p03.t01;
Note=type I
I secretion pathway related
protein;codon_start=1;gene=etpD;locus_tag=pO157p03;product=EtpD;protein_id=N
P_052609.1;transl_table=11;translation=length.585
NC_002128       GenBank exon    3675    5432    .       +       .
Parent=pO157p03.t01
NC_002128       GenBank gene    5432    6937    .       +       .
ID=pO157p04;Dbxref=GeneID:1789725;gene=etpE;locus_tag=pO157p04
NC_002128       GenBank mRNA    5432    6937    .       +       .
ID=pO157p04.t01;Parent=pO157p04
[...]
"""

Somewhere, the bp_genbank2gff3.pl script is baulking at the wrapped feature,
and this appears to be disrupting its ability to construct gene models.

Now, since circular sequences are declared in the GenBank file (and there
are over 1000 bacterial, mostly circular, genomes), and wrapped features
must be expected, I had a look to see how these are handled in
BioPerl/GFF3/etc. And came across the following:

A proposal from 2008 for modifying CHADO/GBROWSE/GFF3 to handle
origin-crossing features:
http://old.nabble.com/Circular-genomes-in-Chado-BioPerl-td19378544.html

The current(?) GFF3 spec (1.15), which doesn't describe how to handle this:
http://www.sequenceontology.org/gff3.shtml

A proposed extension to GFF3 (1.01), which adds a topology attribute to the
reference sequence to indicate circularity:
http://www.pathogenportal.org/gff3-usage-conventions.html

Which is reminiscent of Nathan's proposal, here:
http://gmod.org/wiki/GBrowse_FAQ#How_do_I_show_circular_genomes.3F

But it doesn't look like anything made it into the GFF3 spec, or the BioPerl
script as a result.  So I was wondering what, if anything, I'm still missing
here, if anything has in fact changed since the 2008 proposal, and what
direction people thought this might take in BioPerl/GFF3 (links to a thread
are fine if it's under discussion already).  FWIW I like the
topology=circular|linear attribute from IOWG and the modular arithmetic
approach in the linked thread.

Cheers,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________


More information about the Bioperl-l mailing list