[Bioperl-l] Fwd: Extracting gene seq from Bio::DB::GFF

Lincoln Stein lincoln.stein at gmail.com
Thu Aug 17 22:32:35 UTC 2006


Let me know how it works.

I also get a few of the warnings about the ortho:* features. They don't seem
to hurt anything so you can go ahead and use fast loading if you want. The
long-term fix is to sort the GFF3 files so that all features that share the
same ID occur next to each other.

Lincoln

On 8/17/06, Marco Blanchette <mblanche at berkeley.edu> wrote:
>
> I will answer my own question...
>
> Yes, one can load the fasta file after having loaded the gff file by
> doing:
>
> bp_seqfeature_load.pl -d dmel_43_SF_slow dmel-all-chromosome-r4.3.fasta
>
> Marco
>
>
> On 8/17/06 11:20, "Marco Blanchette" <mblanche at berkeley.edu> wrote:
>
> > Lincoln, thanks for the precision. I just could not find any references
> to
> > how to load the DNA (no where in bp_seqfeature_load.pl or in the
> > Bio::DB::SeqFeature::Store it says how load the DNA sequences).
> >
> > So right now the gff files were loaded in mysql using:
> > /usr/bin/bp_seqfeature_load.pl -d dmel_43_SF_slow *.gff
> >
> > I tried the --fast options but got a bunch of warning (see below).
> >
> > The DNA file (a single fasta database file containing all chromosome
> > sequences) was in a different location from the gff files and was not
> loaded
> > together with the gff files (the sequence table is empty in the
> database).
> >
> > Can I load the DNA sequence after the gff files were loaded?
> >
> > Many thanks
> >
> > Marco
> >
> >
> > -------------------- WARNING ---------------------
> > MSG: ID=ortho:2825 has been used more than once, but it cannot be found
> in
> > the database.
> > This can happen if you have specified fast loading, but features sharing
> the
> > same ID
> > are not contiguous in the GFF file. This will be loaded as a separate
> > feature.
> > Line 483681: "X .       orthologous_region      19477824        19478027
> > .       +       .
> > ID=ortho:2825;to_name=FBpp0074514,CG14214-PA;to_species=dpse"
> >
> > STACK Bio::DB::SeqFeature::Store::GFF3Loader::handle_feature
> > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:537
> > STACK Bio::DB::SeqFeature::Store::GFF3Loader::do_load
> > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:424
> > STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh
> > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:342
> > STACK Bio::DB::SeqFeature::Store::GFF3Loader::load
> > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:240
> > STACK toplevel /usr/bin/bp_seqfeature_load.pl:81
> >
> >
> > On 8/17/06 10:27, "Lincoln Stein" <lstein at cshl.edu> wrote:
> >
> >> Hi,
> >>
> >> This message bounced because I tried to send it from my gmail account
> and so
> >> I'm sending it again. Bio::DB::SeqFeature::Store *does* load DNA. If it
> >> finds a file that contains DNA data, it simply loads it. There is no
> special
> >> command line switch. Also you can include the DNA in the GFF3 file.
> >>
> >> Lincoln
> >>
> >> ---------- Forwarded message ----------
> >> From: Lincoln Stein <lincoln.stein at gmail.com>
> >> Date: Aug 17, 2006 12:26 PM
> >> Subject: Re: [Bioperl-l] Extracting gene seq from Bio::DB::GFF
> >> To: Chris Fields <cjfields at uiuc.edu>
> >> Cc: Marco Blanchette <mblanche at berkeley.edu>, "
> bioperl-l at lists.open-bio.org"
> >> <Bioperl-l at lists.open-bio.org>, cain.cshl at gmail.com
> >>
> >> I'm curious. Could you try using the Bio::DB::SeqFeature::Store class
> to
> >> load the GFF3-format Fly data? I think you're probably getting confused
> by
> >> overlapping mRNA splice forms, an issue that won't occur with the full
> >> GFF3-formatted data.
> >>
> >>
> >> On 8/13/06, Chris Fields <cjfields at uiuc.edu> wrote:
> >>>
> >>> Marco,
> >>>
> >>> Did you figure out what the problem was?  I was curious; the issue
> >>> you were having was rather odd.  I wanted to see if it was an issue
> >>> with the GFF data or with the database itself.
> >>>
> >>> Chris
> >>>
> >>> On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote:
> >>>
> >>>> Chris,
> >>>>
> >>>>> Do you mean you get duplicates of sequences back, or that you get
> >>>>> more than
> >>>>> one chunk of the same sequence back?
> >>>>
> >>>> I sometimes get duplicated sequences and sometimes overlapping
> >>>> regions (see
> >>>> bellow)
> >>>>
> >>>>>
> >>>>> Is it possible that each query using an ID could contain more than
> >>>>> one
> >>>>> feature?  That might explain it (you could check by testing the
> >>>>> size of the
> >>>>> array @feats).
> >>>> Most id return more than one features from various type
> >>>> ( point_mutation,
> >>>> insertion_site, processed_transcript, etc...). That's why I
> >>>> restirct the
> >>>> output to type "gene" using regexp /gene/ on $f->type.
> >>>>
> >>>>>
> >>>>> I'm not sure how split locations are handled within Bio:DB::GFF,
> >>>>> but do the
> >>>>> specific features have split locations?
> >>>>>
> >>>>> Chris
> >>>>>
> >>>> Not sure what you mean exactly but have a look at the following
> >>>> script, it
> >>>> gives the location and the group id of the feature being reported:
> >>>>
> >>>> use Bio::DB::GFF;
> >>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
> >>>>                               -dsn =>
> >>>> 'dbi:mysql:database=dmel_43_new');
> >>>> my %dups;
> >>>> while (<>){
> >>>>     chomp;
> >>>>     my $id = $_;
> >>>>     my @feat = $db->get_feature_by_name($id);
> >>>>
> >>>>      for my $f (@feat){
> >>>>         if (exists $dups{$f->group} && $f->type =~/gene/){
> >>>>             print     "Calling >>>", $f->group, "\n";
> >>>>             print     "Chr: ", $f->refseq,
> >>>>                     " Strand: ", $f->strand,
> >>>>                     " Start: ", $f->start,
> >>>>                     " End: ", $f->end,
> >>>>                     "\n";
> >>>>             print "Offending >>>", $dups{$f->group}->group, "\n";
> >>>>             print     "Chr: ", $dups{$f->group}->refseq,
> >>>>                     " Strand: ", $dups{$f->group}->strand,
> >>>>                     " Start: ", $dups{$f->group}->start,
> >>>>                       " End: ", $dups{$f->group}->end;
> >>>>             print "\n\n";
> >>>>          } else {
> >>>>             $dups{$f->group} = $f;
> >>>>          }
> >>>>      }
> >>>> }
> >>>>
> >>>> Here is the output:
> >>>> Calling >>>FBgn0004179
> >>>> Chr: 3L Strand: 1 Start: 22201102 End: 22207587
> >>>> Offending >>>FBgn0004179
> >>>> Chr: 3L Strand: 1 Start: 22200575 End: 22200575
> >>>>
> >>>> Calling >>>FBgn0025681
> >>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
> >>>> Offending >>>FBgn0025681
> >>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
> >>>>
> >>>> Calling >>>FBgn0025803
> >>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
> >>>> Offending >>>FBgn0025803
> >>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
> >>>>
> >>>> Calling >>>FBgn0000117
> >>>> Chr: X Strand: -1 Start: 1756796 End: 1747557
> >>>> Offending >>>FBgn0000117
> >>>> Chr: X Strand: -1 Start: 1757776 End: 1747182
> >>>>
> >>>> Calling >>>FBgn0005427
> >>>> Chr: X Strand: -1 Start: 136456 End: 125343
> >>>> Offending >>>FBgn0005427
> >>>> Chr: X Strand: -1 Start: 133199 End: 124949
> >>>>
> >>>> Calling >>>FBgn0000042
> >>>> Chr: X Strand: 1 Start: 5746100 End: 5750026
> >>>> Offending >>>FBgn0000042
> >>>> Chr: X Strand: 1 Start: 5746096 End: 5746106
> >>>>
> >>>> Calling >>>FBgn0004551
> >>>> Chr: 2R Strand: -1 Start: 19443485 End: 19434556
> >>>> Offending >>>FBgn0004551
> >>>> Chr: 2R Strand: -1 Start: 19445155 End: 19429977
> >>>>
> >>>> Do you have any suggestions?? Is the procedure I am using to
> >>>> retrieve  the
> >>>> genes right?
> >>>>
> >>>> Many thanks
> >>>>
> >>>> Marco
> >>>>
> >>>>
> >>>>
> >>>>>> Many thanks Scott,
> >>>>>>
> >>>>>> At the same time I got your email I was coming to the same
> >>>>>> conclusion as
> >>>>>> you.
> >>>>>>
> >>>>>> Now I have a stranger problem in my hands... My goal is quite
> >>>>>> simple, I
> >>>>>> try
> >>>>>> to get the sequence of the genes back from the Bio::DB::GFF
> database
> >>>>>> loaded
> >>>>>> on MySQL. The gene list is from a file with one gene id per per
> >>>>>> line. When
> >>>>>> I
> >>>>>> run the following script:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> use Bio::DB::GFF;
> >>>>>> use Bio::SeqIO;
> >>>>>> my $out = Bio::SeqIO->new(    -fh => \*STDOUT,
> >>>>>>                             -format => 'fasta');
> >>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
> >>>>>>                               -dsn =>
> >>>>>> 'dbi:mysql:database=dmel_43_new');
> >>>>>>
> >>>>>> while (<>){
> >>>>>>     chomp;
> >>>>>>     my $id = $_;
> >>>>>>     my @feats = $db->get_feature_by_name($id);
> >>>>>>     for my $f (@feats){
> >>>>>>         $out->write_seq( $f->seq ) if $f->type =~/gene/;
> >>>>>>     }
> >>>>>> }
> >>>>>>
> >>>>>>
> >>>>>> I get more sequence back than the number of gene in my input file.
> I
> >>>>>> double
> >>>>>> check there. Some of the duplicated entries are the same, some
> >>>>>> are not!
> >>>>>
> >>>>>
> >>>>> ...
> >>>>>
> >>>>> _______________________________________________
> >>>>> Bioperl-l mailing list
> >>>>> Bioperl-l at lists.open-bio.org
> >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>>
> >>>> ______________________________
> >>>> Marco Blanchette, Ph.D.
> >>>>
> >>>> mblanche at uclink.berkeley.edu
> >>>>
> >>>> Donald C. Rio's lab
> >>>> Department of Molecular and Cell Biology
> >>>> 16 Barker Hall
> >>>> University of California
> >>>> Berkeley, CA 94720-3204
> >>>>
> >>>> Tel: (510) 642-1084
> >>>> Cell: (510) 847-0996
> >>>> Fax: (510) 642-6062
> >>>> --
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Bioperl-l mailing list
> >>>> Bioperl-l at lists.open-bio.org
> >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>
> >>> Christopher Fields
> >>> Postdoctoral Researcher
> >>> Lab of Dr. Robert Switzer
> >>> Dept of Biochemistry
> >>> University of Illinois Urbana-Champaign
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>
> >>
> >>
> >
> > ______________________________
> > Marco Blanchette, Ph.D.
> >
> > mblanche at uclink.berkeley.edu
> >
> > Donald C. Rio's lab
> > Department of Molecular and Cell Biology
> > 16 Barker Hall
> > University of California
> > Berkeley, CA 94720-3204
> >
> > Tel: (510) 642-1084
> > Cell: (510) 847-0996
> > Fax: (510) 642-6062
>
> ______________________________
> Marco Blanchette, Ph.D.
>
> mblanche at uclink.berkeley.edu
>
> Donald C. Rio's lab
> Department of Molecular and Cell Biology
> 16 Barker Hall
> University of California
> Berkeley, CA 94720-3204
>
> Tel: (510) 642-1084
> Cell: (510) 847-0996
> Fax: (510) 642-6062
> --
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>



-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu



More information about the Bioperl-l mailing list