[Bioperl-l] Bio::DB::SeqFeature treamtent of tags and annotations

Lincoln Stein lstein at cshl.edu
Wed Jan 31 17:02:08 UTC 2007


The problem is when you try to merge files from different sources. The IDs
frequently collide. This is why we allow two GFF files to contain the same
IDs, and assume that the database will create its own internal (unique) IDs
when it loads them. The "Name" field is more or less the global identifier
that you are looking for.

Lincoln

On 1/31/07, Cook, Malcolm <MEC at stowers-institute.org> wrote:
>
>  Hi Lincoln,
>
> Thanks for the resolution of tag value methods.  Your fixes work in my hands...
>
> I never knew that "ID column is that it is supposed to be LOCAL to the GFF3 file and is not intended to be stored in the database".
>
> I read in http://www.sequenceontology.org/gff3.shtml that
>
>     ID	   Indicates the name of the feature.  IDs must be unique
> 	   within the scope of the GFF file.
>
> But that ID is solely for the purpose of linearizing the relationships among features within the scope of the GFF text file.
>
> Further, in that document, I see lots of examples where the ID is what is holding together the fabric of the feature relationships; IDs appear as the value of: Target and Parent attributes.
>
>
>
> I'd appreciate it if you can provide a motivating example where the ID= in a (real life) GFF3 file is in fact ONLY used for the purpose of linearizing the feature relationships.
>
>
>
> ANyway,
>
> If it matters, the GFF formated genome I'm wallowing in these days from
> Flybase (dmel r5.1) presents their FlyBase IDs in the ID attribute, like
> this:
>
> 4 FlyBase gene 24068 25621 . + . ID=FBgn0040037;Name=CG17923;Dbxref=FlyBase:FBan0017923,FlyBase_Annotation_IDs:CG17923...
> 4 FlyBase mRNA 24068 25621 . + . ID=FBtr0089155;Name=CG17923-RA;Parent=FBgn0040037;Dbxref=FlyBase_Annotation_IDs:CG17923-RA;
> 4 FlyBase exon 24068 24477 . + . ID=CG17923:1;Name=CG17923:1;Parent=FBtr0089155
>
> Switching to using 'Name' might work OK for my application.  I'll look into it.  It winds up being the same as the ID in some cases anyway....
>
> Malcolm Cook
>
>
>
>  ------------------------------
> *From:* lincoln.stein at gmail.com [mailto:lincoln.stein at gmail.com] *On
> Behalf Of *Lincoln Stein
> *Sent:* Tuesday, January 30, 2007 4:46 PM
> *To:* Cook, Malcolm
> *Cc:* bioperl list; lstein at cshl.org
> *Subject:* Re: Bio::DB::SeqFeature treamtent of tags and annotations
>
> I've fixed the first issue in CVS. Sorry for the inconsistency.
> add_tag_value(), delete_tag_value() and get_Annotations() now all work as
> expected.
>
> The problem with the ID column is that it is supposed to be LOCAL to the
> GFF3 file and is not intended to be stored in the database. In contrast,
> Name can survive roundtripping. Perhaps the thing to do is to add a flag to
> the GFF3 file that turns on ID round-tripping, e.g.
>
> ##round-trip-ids: 1
>
> If you like this idea, I can implement it.
>
> Lincoln
>
> On 1/29/07, Cook, Malcolm < MEC at stowers-institute.org> wrote:
> >
> >  Lincoln,
> >
> > Thanks for your suggestions on approach to my problems augmenting
> > Flybase annotation.  I am trying to follow them and finding the following
> > oddities
> >
> > The first issue relates to the intermix of 'annotations' and 'tag
> > values'.  I find that Bio::DB::SeqFeature implements some of the 'tag'
> > methods and some of the 'Annotation' methods.  Here is a perl one-liner that
> > shows values stored using add_tag_value are not retreived with
> > get_tag_values, but rather with get_Annotations.
> >
> > > perl -MBio::DB::SeqFeature -e 'my $f=Bio::DB::SeqFeature->new;
> > $f->add_tag_value("x",666); print "get_tag_values:\t" .
> > $f->get_tag_values("x") . "\nget_Annotations:\t" .
> > $f->get_Annotations("x");'
> >
> > whose output is:
> > get_tag_values:
> > get_Annotations:    666
> >
> > Tracing this shows me that this results from the fact that:
> >
> >  Bio::DB::SeqFeature uses of Bio::Graphics::FeatureBase (via
> > Bio::DB::SeqFeature::NormalizedFeature) which does not support -tags in
> > ->new but rather -attributes, viz:
> >
> >    -attributes   a hashref of tag value attributes, in which the key is
> > the tag
> >                   and the value is an array reference of values
> >
> >  And though Bio::Graphics::FeatureBase purports to implement
> > Bio::SeqFeatureI, it only partially implements the  'tag' methods (now
> > deprecated and relegated to Bio::AnnotatableI).  In particular, the '*'
> > methods Bio::SeqFeatureI are not implemented in Bio::Graphics::FeatureBase
> >
> >
> >   has_tag
> > *  add_tag_value
> >   get_tag_values
> >   get_all_tags
> > *  remove_tag
> >   get_tagset_values
> >   get_Annotations
> >
> > As a result, add_tag_value and remove_tag are inherited from different
> > modules whose understanding of tags is not the same!
> >
> > This one-liner :
> >
> > >perl -MClass::ISA -MClass::Inspector -MBio::DB::SeqFeature -e 'my @c =
> > Class::ISA::self_and_super_path("Bio::DB::SeqFeature"); foreach my $fn
> > qw(add_tag_value get_tag_values) {print "\n$fn:\t", join "\t", (grep
> > {Class::Inspector->function_exists($_, $fn)} @c)}'
> >
> > confirms that they are defined in different packages, namely:
> >
> > add_tag_value: Bio::AnnotatableI
> > get_tag_values: Bio::Graphics::FeatureBase Bio::AnnotatableI
> >  Proposed solution...  hmmmm ..... I dunno.... maybe the following patch
> > to Bio::Graphics::FeatureBase->add_tag_value :
> >
> > sub add_tag_value {
> >   my ($self,$tag, at vals) = @_;
> >   push @{$self->{attributes}{$tag}}, @vals;
> > }
> >
> >  It fixes my use case for now but I'm still concerned and confused about
> > this variety of methods.
> >
> > Suggestions?
> >
> >
> >
> > -------------------------------------------------------------------------
> >
> > Also, I think that any "ID" in column 9 of GFF3 float file should be
> > preserved through a round-trip through a Bio::DB::SeqFeature store, but this
> > is not yet possible since any ID attribute in GFF3 column 9 is being
> > lost by GFF3Loader, causing me to locally patch GFF3Loader::handle_feature
> > method to add the following:
> >
> >   # mec at stowers-institute.org , wondering why not all attributes are
> >   # carried forward, adds ID tag in particular service of
> >   # round-tripping ID, which, though present in database as load_id
> >   # attribute, was getting lost as itself
> >   $unreserved->{ID}= $reserved->{ID}     if exists $reserved->{ID};
> >
> > Poised to patch.... what d'you think?
> >
> > Malcolm Cook
> > Stowers Institute for Medical Research - Kansas City, Missouri
> >
> >
> >
> >  ------------------------------
> > From: lincoln.stein at gmail.com [mailto: lincoln.stein at gmail.com] On
> > Behalf Of Lincoln Stein
> > Sent: Tuesday, December 19, 2006 3:58 PM
> > To: Cook, Malcolm
> > Cc: bioperl list; lstein at cshl.org
> > Subject: Re: bp_seqfeature_load / Bio::DB::SeqFeature::Store::GFF3Loader
> > problems augmenting Flybase annotation
> >
> > Hi Malcom,
> >
> > Your second guess was right. The use case of augmenting an existing gene
> > with additional splice forms isn't provided for. You can get the
> > functionality by making direct calls to Bio::DB::SeqFeature::Store methods:
> >
> > my @genes = $db->get_features_by_name('FBgn0017545');
> > @genes == 1 or die "Didn't get exactly one gene";
> > my $parent = $genes[0];
> >
> > my $parent = $genes[0];
> > my $chr    = $parent->seq_id;
> > my $start  = $parent->start;
> > my $end    = $parent->end;
> > my $strand = $parent->strand;
> >
> > my $new_splice_form = $db->new_feature(-primary_tag => 'mRNA',
> >                        -source      => 'added',
> >                        -seq_id   => '4r',
> >                        -strand   => $strand,
> >                        -start    => $start+10,
> >                        -end      => $end,
> >                        );
> > $parent->add_SeqFeature($new_splice_form);
> >
> > for my $pos ([$start+10,$start+100],[$start+200,$end]) {
> >   my ($e_start,$e_end) = @$pos;
> >   my $exon = Bio::DB::SeqFeature->new(-primary_tag => 'exon',
> >                                       -store       => $db,
> >                       -seq_id      => '4r',
> >                       -strand     => $strand,
> >                       -start       => $e_start,
> >                       -end         => $e_end);
> >   $new_splice_form->add_SeqFeature($exon);
> > }
> >
> > I found a bug in updating the seqfeature database when I wrote this
> > script, so you'll have to get the latest biperl live. I think you can use
> > this to write a splice form updating script.
> >
> > In order to support the idea of adding new splice forms to an existing
> > gene using the GFF3 format, I will have to either modify the loader, or
> > write a separate script (probably better to do the latter). It shouldn't be
> > hard if you'd like to give it a try.
> >
> > Lincoln
> >
> > On 12/19/06, Cook, Malcolm <MEC at stowers-institute.org > wrote:
> > >
> > > Lincoln and fellow Bio::DB::SeqFeature travelers,
> > >
> > > I find that using bp_seqfeature_load.PLS to load subfeatures of genes
> > > already loaded using bp_seqfeature_load.PLS fails with
> > >
> > > ------------- EXCEPTION  -------------
> > > MSG: FBgn0017545 doesn't have a primary id
> > > STACK
> > > Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree_in_tables
> > > /home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:682
> > > STACK Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree
> > > /home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:663
> > > STACK Bio::DB::SeqFeature::Store::GFF3Loader::finish_load
> > > /home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:372
> > > STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh
> > > /home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:345
> > > STACK Bio::DB::SeqFeature::Store::GFF3Loader::load
> > > /home/mec/cvs/bioperl-live/Bio/DB/SeqFeature/Store/GFF3Loader.pm:242
> > > STACK toplevel
> > > /home/mec/cvs/bioperl-live/scripts/Bio-SeqFeature-Store/bp_seqfeature_lo
> > >
> > > ad.PLS:76
> > >
> > > Where FBgn0017545 is the ID of a gene previously loaded.
> > >
> > > I am unsure how to remedy my situation and welcome any advise on
> > > correct
> > > or improved approach to my problem.
> > >
> > > Here's some detail if it helps.  I am developing a pipeline to design
> > > a
> > > microarray probes capable of distinguishing among splice variants in
> > > drosophila (using latest Flybase dmel_r5.1 annotation).  So I
> > >
> > > 1) load a filtered selection of Flybase annotation using
> > > bp_seqfeature_load.  (for testing purposes, I am using a single gene's
> > >
> > > worth of annotation, FBgn0017545.gff, attached).  This is done as
> > > follows:
> > >
> > >         > bp_seqfeature_load.PLS  --create FBgn0017545.gff
> > >
> > > 2) analyze all the genes in the database, and create GFF3 output each
> > > feature of which has a 'Parent' that is a previously loaded gene (i.e.
> > > FBgn0017545).  (These features represent the unique introns, splice
> > > sites, and exonic design targets. Output of this analysis,
> > > FBgn0017545_matd.gff, is also attached)
> > >
> > > 3) load these analysis results into the same database, as follows:
> > >
> > >         > bp_seqfeature_load.PLS          FBgn0017545_matd.gff
> > >
> > > It is at this point that I get the above error.
> > >
> > > However, I don't get any error and the data loads fine if I load the
> > > two
> > > files together, as follows:
> > >
> > >         > bp_seqfeature_load.PLS --create <(cat FBgn0017545.gff
> > > FBgn0017545_matd.gff)
> > >
> > > So, I suspect that either I am misunderstanding when/how to use
> > > bp_seqfeature_load.PLS or else this use case has not yet arisen and
> > > must
> > > be provided for somehow.
> > >
> > > I am running against bioperl-live
> > >
> > > Thanks for your thoughts and assistance,
> > >
> > > Malcolm Cook
> > > Database Applications Manager - Bioinformatics
> > > Stowers Institute for Medical Research - Kansas City, Missouri
> > >
> > >
> >
> >
> > --
> > Lincoln D. Stein
> > Cold Spring Harbor Laboratory
> > 1 Bungtown Road
> > Cold Spring Harbor, NY 11724
> > (516) 367-8380 (voice)
> > (516) 367-8389 (fax)
> > FOR URGENT MESSAGES & SCHEDULING,
> > PLEASE CONTACT MY ASSISTANT,
> > SANDRA MICHELSEN, AT michelse at cshl.edu
> >
> >
>
>
> --
> Lincoln D. Stein
> Cold Spring Harbor Laboratory
> 1 Bungtown Road
> Cold Spring Harbor, NY 11724
> (516) 367-8380 (voice)
> (516) 367-8389 (fax)
> FOR URGENT MESSAGES & SCHEDULING,
> PLEASE CONTACT MY ASSISTANT,
> SANDRA MICHELSEN, AT michelse at cshl.edu
>
>


-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu



More information about the Bioperl-l mailing list