[Bioperl-l] Re: Problem with Unflattener

Scott Cain cain at cshl.org
Tue Dec 9 22:12:02 EST 2003


Well, really the goal is to write a converter that can be used by a MOD
to seed an instance of GMOD, and since Genbank is a defacto standard,
there really needs to be a converter for it.  At the moment, my approach
is to warn liberally when the script can't figure something out.

Thanks,
Scott

On Tue, 2003-12-09 at 21:52, Chris Mungall wrote:
> Oh, not quite. Sorry, not an expert on the various pre-gff3 flavours
> 
> But I imagine you could take the ensembl GFF (which is of 'GTF' flavour?)
> OR an actual ensembl mysql database, both of which sensibly preserve
> gene/transcript/protein/exon nesting structure, and turn that into GFF3
> without too much difficulty. Ok, you start losing yourself in a maze of
> converters, but the advantage is you never have to go through that lossy
> step to genbank format where you lose the nesting structure. There are
> some situations where it could be simply impossible for the Unflattener to
> recover some of this lost structure.
> 
> On Tue, 9 Dec 2003, Scott Cain wrote:
> 
> > Thanks.
> >
> > Re: GFF from ensembl: You can get it as GFF3?  Could you send me a link
> > if so.  (You can tell I'm a little incredulous.)
> >
> > Scott
> >
> > On Tue, 2003-12-09 at 21:19, Chris Mungall wrote:
> > > Hi Scott
> > >
> > > Bug squashed, do a cvs update and it should work
> > >
> > > The problem was that this record uses /locus_tag instead of /gene - the
> > > unflattener should be able to detect this in magic mode, but there was one
> > > place where "/gene" was hardcoded.
> > >
> > > By the way, for this particular record you can get the exact same data
> > > from ensembl, already unflattened (or rather, never flattened into genbank
> > > format in the first place). Nevertheless, this sort of thing is extremely
> > > useful for testing Unflattener.pm, so carry on testing! Really I should do
> > > a full QC by comparing ensembl sourced GFF and the results of
> > > ensembl->genbank->unflattener->gff, but I haven't got round to this yet.
> > >
> > > Cheers
> > > Chris
> > >
> > > On Tue, 9 Dec 2003, Scott Cain wrote:
> > >
> > > > Hello Chris,
> > > >
> > > > I am using Unflattener to create a genbank2gff script that is more
> > > > robust than what we have now.  As one of my example Genbank files, I am
> > > > using an A. gambiae chromosome:
> > > >
> > > > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=nucleotide&list_uids=31249389&dopt=GenBank&term=NW_045730&qty=1
> > > >
> > > > When I try to run the simplified script below, I get the following
> > > > error:
> > > >
> > > > ------------- EXCEPTION  -------------
> > > > MSG: structure_type 2 is currently unknown
> > > > STACK Bio::SeqFeature::Tools::Unflattener::unflatten_seq /usr/local/lib/perl5/site_perl/5.8.1/Bio/SeqFeature/Tools/Unflattener.pm:1345
> > > > STACK toplevel ./simple.pl:19
> > > >
> > > > --------------------------------------
> > > >
> > > > As I read Unflattener, structure_type should only be set if I set it
> > > > explicitly, right?  So how is it getting set here, and how do I make it
> > > > stop?
> > > >
> > > > Here's the script:
> > > > #!/usr/bin/perl -w
> > > > use strict;
> > > > use Bio::SeqIO;
> > > > use Bio::SeqFeature::Tools::Unflattener;
> > > >
> > > > my $unflattener = Bio::SeqFeature::Tools::Unflattener->new;
> > > >
> > > > my $seqio = Bio::SeqIO->new(
> > > >     -file   => 'NW_045730.1.gbk',
> > > >     -format => 'GenBank'
> > > > );
> > > >
> > > > open OUT, '>out.gff';
> > > >
> > > > while ( my $seq = $seqio->next_seq() ) {
> > > >     my $acc = $seq->accession;
> > > >
> > > >     # get top level unflattended SeqFeatureI objects
> > > >     my @sfs = $unflattener->unflatten_seq(
> > > >         -seq       => $seq,
> > > >         -use_magic => 1
> > > >     );
> > > >
> > > >     foreach my $sf (@sfs) {
> > > >         my $gffio =
> > > >           $sf->gff_format( Bio::Tools::GFF->new( -gff_version => 3 ) );
> > > >
> > > >         $sf->seq_id($acc);
> > > >
> > > >         if ( $sf->primary_tag() eq 'source' ) {
> > > >             $sf->add_tag_value( 'ID', $acc );
> > > >             $sf->primary_tag('region');
> > > >         }
> > > >         print OUT $sf->gff_string . "\n";
> > > >     }
> > > > }
> > > > close OUT;
> > > > ---------------------------
> > > >
> > > > Thanks,
> > > > Scott
> > > >
> > > >
> > >
> >
-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         cain at cshl.org
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory



More information about the Bioperl-l mailing list