[Bioperl-l] bp_genbank2gff3.pl

David Breimann david.breimann at gmail.com
Sat Sep 18 13:57:30 UTC 2010


So let's do an intermediate summary of my situation:
I'm using Ubuntu 10.04 and Perl 5.10.1.
I get unexpected results when using bp_genbank2gff3.pl ("Name=" instead of
"locus_tag=" in the last GFF3 column), while Scott gets the expected results
while using the latest version of bioperl.
I cloned a fresh version of bioperl live into my ~/src:
$ cd ~/src
$ git clone http://github.com/bioperl/bioperl-live.git

I then added the following line to the end of ~/.profile:
export PERL5LIB="$HOME/src/bioperl-live:$PERL5LIB"
and ran
$ source ~/.profile

I then downloaded a small genome from NCBI
$ wget
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_E24377A/NC_009789.gbk
and tested the script:
$ ~/src/bioperl-live/scripts/Bio-DB-GFF/genbank2gff3.PLS NC_009789.gbk

Following are the top 10 lines of the resulting GFF3:

##gff-version 3
# sequence-region NC_009789 1 6199
# conversion-by bp_genbank2gff3.pl
# organism Escherichia coli E24377A
# date 06-JAN-2010
# Note Escherichia coli E24377A plasmid pETEC_6, complete sequence.
NC_009789    GenBank    region    1    6199    .    +    1
ID=NC_009789;Dbxref=Project:13960,taxon:331111;Name=NC_009789;Note=Escherichia
coli E24377A plasmid pETEC_6%2C complete sequence.,PROVISIONAL REFSEQ: This
record has not yet been subject to final NCBI review. The reference sequence
was derived from CP000798. Source DNA and bacteria available from Jacques
Ravel (jravel at tigr.org). COMPLETENESS: full length. ;comment1=PROVISIONAL
REFSEQ: This record has not yet been subject to final NCBI review. The
reference sequence was derived from CP000798. Source DNA and bacteria
available from Jacques Ravel (jravel at tigr.org). COMPLETENESS: full length.
;date=06-JAN-2010;mol_type=genomic DNA;organism=Escherichia coli
E24377A;plasmid=pETEC_6;strain=E24377A
NC_009789    GenBank    gene    665    781    .    -    1
ID=EcE24377A_B0001;Dbxref=GeneID:5585816;Name=EcE24377A_B0001
NC_009789    GenBank    mRNA    665    781    .    -    1
ID=EcE24377A_B0001.t01;Parent=EcE24377A_B0001
NC_009789    GenBank    CDS    665    781    .    -    1
ID=EcE24377A_B0001.p01;Parent=EcE24377A_B0001.t01;Dbxref=GI:157149501,GeneID:5585816;Name=EcE24377A_B0001;Note=identified
by glimmer%3B putative;codon_start=1;product=hypothetical
protein;protein_id=YP_001451539.1;transl_table=11;translation=length.38

while these are from Scotts' file:
##gff-version 3
# sequence-region NC_009789 1 6199
# conversion-by bp_genbank2gff3.pl
# organism Escherichia coli E24377A
# date 06-JAN-2010
# Note Escherichia coli E24377A plasmid pETEC_6, complete sequence.
NC_009789    GenBank    region    1    6199    .    +    1
ID=NC_009789;Dbxref=Project:13960,taxon:331111;Note=Escherichia coli E24377A
plasmid pETEC_6%2C complete sequence.,PROVISIONAL REFSEQ: This record has
not yet been subject to final NCBI review. The reference sequence was
derived from CP000798. Source DNA and bacteria available from Jacques Ravel
(jravel at tigr.org). COMPLETENESS: full length. ;comment1=PROVISIONAL REFSEQ:
This record has not yet been subject to final NCBI review. The reference
sequence was derived from CP000798. Source DNA and bacteria available from
Jacques Ravel (jravel at tigr.org). COMPLETENESS: full length.
;date=06-JAN-2010;mol_type=genomic DNA;organism=Escherichia coli
E24377A;plasmid=pETEC_6;strain=E24377A
NC_009789    GenBank    gene    665    781    .    -    1
ID=EcE24377A_B0001;Dbxref=GeneID:5585816;locus_tag=EcE24377A_B0001
NC_009789    GenBank    mRNA    665    781    .    -    1
ID=EcE24377A_B0001.t01;Parent=EcE24377A_B0001
NC_009789    GenBank    CDS    665    781    .    -    1
ID=EcE24377A_B0001.p01;Parent=EcE24377A_B0001.t01;Dbxref=GI:157149501,GeneID:5585816;Note=identified
by glimmer%3B
putative;codon_start=1;locus_tag=EcE24377A_B0001;product=hypothetical
protein;protein_id=YP_001451539.1;transl_table=11;translation=length.38


Note the "Name=" tags in my version are replaced by "locus_tag=" in Scott's,
as desired.
I have no idea what is going on here...

Best,
Dave

On Sat, Sep 18, 2010 at 3:40 PM, Scott Cain <scott at scottcain.net> wrote:

> Hi Dave,
>
> Let's keep the discussion on the mailing list so we can make sure that
> when this problem is solved, its resolution will be archived.
>
> I don't really understand what is going on either, though it would
> probably be a good idea to set your PERL5LIB env variable so that when
> you execute this script from the git repository that it will also uses
> BioPerl modules in the git repository instead of the ones that are
> installed in your "normal" path.
>
> Also, are you using any command line flags when executing it?  I didn't.
>
> Scott
>
>
> On Sat, Sep 18, 2010 at 2:14 PM, David Breimann
> <david.breimann at gmail.com> wrote:
> > Yes, I'm using Ubuntu 10.04.
> >
> > That is really weired. I tried running the script from the perl-live dir
> > (which I just pulled using git), and I get the same results as before
> > (`Name` instead of `locus_tag`):
> >
> >  $ wget
> >
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_E24377A/NC_009789.gbk
> >  $ /home/dave/src/bioperl-live/blib/script/bp_genbank2gff3.pl -y
> > NC_009789.genbank
> >
> > Attached is the resulting GFF3.
> > I also attach a copy of bp_genbank2gff3.pl as found under
> > /home/dave/src/bioperl-live/blib/script.
> >
> > This is a real mystery for me!
> >
> > On Sat, Sep 18, 2010 at 2:54 PM, Scott Cain <scott at scottcain.net> wrote:
> >>
> >> Typically I do build and install, but you can run it directly from the
> >> git checkout directory.
> >>
> >> For locating other versions of the script, are you running linux?  If
> >> so, are you familiar with the "locate" command:
> >>
> >>  locate bp_genbank2gff3.pl
> >>
> >> If you've never used it before, you may need to update the database
> >> the locate command uses as root:
> >>
> >>  sudo updatedb
> >>
> >> Scott
> >>
> >>
> >> On Sat, Sep 18, 2010 at 1:46 PM, David Breimann
> >> <david.breimann at gmail.com> wrote:
> >> > Your gff seems fine. I get a vey similiar one, but with `Name=`
> instaed
> >> > of
> >> > `locus_tag=`.
> >> >
> >> > I don't really know how to check for multiple bioperl installations.
> >> > I'm using my personal server, so I don't mind removing and installing
> >> > everything from scratch -- but I do'nt know ho to do that.
> >> >
> >> > Also, what I don't get with the git is how the scripts are supposed to
> >> > be
> >> > updated (unless you build and install).
> >> >
> >> > Thanks you!
> >> >
> >> > On Sat, Sep 18, 2010 at 2:38 PM, Scott Cain <scott at scottcain.net>
> wrote:
> >> >>
> >> >> Well, if you aren't getting the same results as me then I'd say you
> >> >> aren't using the same version of the script :-)
> >> >>
> >> >> Unfortunately, the scripts are no longer automatically marked with
> the
> >> >> "internal" version information when committed, so there really isn't
> >> >> anything in the script I can tell you to look for.  Check for more
> >> >> than one bioperl instance on your  computer.
> >> >>
> >> >> I've attached the GFF3 file I got so you can look at it and tell me
> if
> >> >> it is what you expect.
> >> >>
> >> >> Scott
> >> >>
> >> >>
> >> >>
> >> >> On Sat, Sep 18, 2010 at 12:26 PM, David Breimann
> >> >> <david.breimann at gmail.com> wrote:
> >> >> > Hi Scott,
> >> >> >
> >> >> > I just pulled the lated bioperl-live using git.
> >> >> > I'm not sure how the scripts are updated, so I Build and installed
> >> >> > anyway
> >> >> > (perhaps exporting the path is supposed to be enough?)
> >> >> > Anyway, I still get the same results. No locus_tag.
> >> >> > How can I tell if I'm using the latest version of the script?
> >> >> >
> >> >> > Thanks again.
> >> >> >
> >> >> > On Sat, Sep 18, 2010 at 1:07 PM, Scott Cain <scott at scottcain.net>
> >> >> > wrote:
> >> >> >>
> >> >> >> Hi Dave,
> >> >> >>
> >> >> >> A fresh "pull" of the bioperl git repository shows that
> >> >> >> bp_genbank2gff3.pl already does this.  It creates a locus_tag for
> >> >> >> all
> >> >> >> features that have a locus_tag, and uses the locus_tag for the ID
> >> >> >> when
> >> >> >> it can (it can't blindly use the locus tag for the ID since both
> the
> >> >> >> gene and the CDS have the same tag).
> >> >> >>
> >> >> >> Scott
> >> >> >>
> >> >> >>
> >> >> >> On Sat, Sep 18, 2010 at 11:20 AM, David Breimann
> >> >> >> <david.breimann at gmail.com> wrote:
> >> >> >> > Hi Scott,
> >> >> >> >
> >> >> >> > Here is a very short genbank:
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_E24377A/NC_009789.gbk
> >> >> >> >
> >> >> >> > Note all genes in the genbank have locus tags. In the resulting
> >> >> >> > GFF3,
> >> >> >> > however, only the last gene (EcE24377A_B0005) gets a locus_tag.
> I
> >> >> >> > have
> >> >> >> > no
> >> >> >> > idea why it deserves a special treatment... :)
> >> >> >> >
> >> >> >> > p.s. making this change (i.e., copying locus_tag to the GFF3
> last
> >> >> >> > column
> >> >> >> > whenever available) will really make my life easier.
> >> >> >> >
> >> >> >> > Thank you,
> >> >> >> > Dave
> >> >> >> >
> >> >> >> > On Sat, Sep 18, 2010 at 12:08 PM, Scott Cain <
> scott at scottcain.net>
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> Hi Dave,
> >> >> >> >>
> >> >> >> >> That seems perfectly reasonable.  If you could point out a
> >> >> >> >> GenBank
> >> >> >> >> entry for which that does not happen, I could try to figure out
> >> >> >> >> why
> >> >> >> >> not.
> >> >> >> >>
> >> >> >> >> Scott
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Sat, Sep 18, 2010 at 10:20 AM, David Breimann
> >> >> >> >> <david.breimann at gmail.com> wrote:
> >> >> >> >> > Since locus_tag is an essential tag in genbank, I suggest
> >> >> >> >> > locus_tag
> >> >> >> >> > will
> >> >> >> >> > be
> >> >> >> >> > always added to the GFF last column if it exists in the
> >> >> >> >> > genbank,
> >> >> >> >> > whether
> >> >> >> >> > it
> >> >> >> >> > is used as ID in the GFF or not.
> >> >> >> >> >
> >> >> >> >> > On Sat, Sep 18, 2010 at 11:17 AM, Scott Cain
> >> >> >> >> > <scott at scottcain.net>
> >> >> >> >> > wrote:
> >> >> >> >> >>
> >> >> >> >> >> Hi Dave,
> >> >> >> >> >>
> >> >> >> >> >> bp_genbank2gff3.pl suffers from the fact that it has to
> deal
> >> >> >> >> >> with
> >> >> >> >> >> GenBank files :-)  It was designed initially to work on
> whole
> >> >> >> >> >> genome
> >> >> >> >> >> refseqs, and contains several ad hoc rules for trying to
> make
> >> >> >> >> >> it
> >> >> >> >> >> "do
> >> >> >> >> >> the right thing."  In practice, it is not unusual for a post
> >> >> >> >> >> processing step (either by hand or a quicky perl script) to
> be
> >> >> >> >> >> required to really get it right.  I don't recall the
> specifics
> >> >> >> >> >> (if I
> >> >> >> >> >> ever knew :-) for when and how the locus tag is used, but I
> do
> >> >> >> >> >> know
> >> >> >> >> >> that there is a list of things that it will try to use for
> the
> >> >> >> >> >> ID,
> >> >> >> >> >> and
> >> >> >> >> >> while the locus is on the list, I don't know where it comes
> in
> >> >> >> >> >> the
> >> >> >> >> >> list, so it's possible that other items might supersede it.
> >> >> >> >> >>
> >> >> >> >> >> Scott
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> On Sat, Sep 18, 2010 at 10:05 AM, David Breimann
> >> >> >> >> >> <david.breimann at gmail.com> wrote:
> >> >> >> >> >> > Hello,
> >> >> >> >> >> >
> >> >> >> >> >> > I'm not sure how bp_genbank2gff3.pl works. Sometimes it
> adds
> >> >> >> >> >> > a
> >> >> >> >> >> > `locus_tag`
> >> >> >> >> >> > in the fields and sometime it doesn't, even though the
> >> >> >> >> >> > genabank
> >> >> >> >> >> > has a
> >> >> >> >> >> > locus
> >> >> >> >> >> > tag.
> >> >> >> >> >> > Also, is the ID always equivalent to the locus tag?
> >> >> >> >> >> >
> >> >> >> >> >> > Thanks,
> >> >> >> >> >> > Dave
> >> >> >> >> >> > _______________________________________________
> >> >> >> >> >> > Bioperl-l mailing list
> >> >> >> >> >> > Bioperl-l at lists.open-bio.org
> >> >> >> >> >> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> --
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> ------------------------------------------------------------------------
> >> >> >> >> >> Scott Cain, Ph. D.                                   scott
> at
> >> >> >> >> >> scottcain
> >> >> >> >> >> dot net
> >> >> >> >> >> GMOD Coordinator (http://gmod.org/)
> >> >> >> >> >> 216-392-3087
> >> >> >> >> >> Ontario Institute for Cancer Research
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> ------------------------------------------------------------------------
> >> >> >> >> Scott Cain, Ph. D.                                   scott at
> >> >> >> >> scottcain
> >> >> >> >> dot net
> >> >> >> >> GMOD Coordinator (http://gmod.org/)
> >> >> >> >> 216-392-3087
> >> >> >> >> Ontario Institute for Cancer Research
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >>
> >> >> >>
> >> >> >>
> ------------------------------------------------------------------------
> >> >> >> Scott Cain, Ph. D.                                   scott at
> >> >> >> scottcain
> >> >> >> dot net
> >> >> >> GMOD Coordinator (http://gmod.org/)
> 216-392-3087
> >> >> >> Ontario Institute for Cancer Research
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >>
> >> >>
> ------------------------------------------------------------------------
> >> >> Scott Cain, Ph. D.                                   scott at
> scottcain
> >> >> dot net
> >> >> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> >> >> Ontario Institute for Cancer Research
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> ------------------------------------------------------------------------
> >> Scott Cain, Ph. D.                                   scott at scottcain
> >> dot net
> >> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> >> Ontario Institute for Cancer Research
> >
> >
>
>
>
> --
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                   scott at scottcain dot
> net
> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> Ontario Institute for Cancer Research
>



More information about the Bioperl-l mailing list