[Bioperl-l] Fwd: Extracting gene seq from Bio::DB::GFF

Thu Aug 17 18:40:50 UTC 2006

I will answer my own question...

Yes, one can load the fasta file after having loaded the gff file by doing:

bp_seqfeature_load.pl -d dmel_43_SF_slow dmel-all-chromosome-r4.3.fasta

Marco

On 8/17/06 11:20, "Marco Blanchette" <mblanche at berkeley.edu> wrote:

> Lincoln, thanks for the precision. I just could not find any references to
> how to load the DNA (no where in bp_seqfeature_load.pl or in the
> Bio::DB::SeqFeature::Store it says how load the DNA sequences).
> 
> So right now the gff files were loaded in mysql using:
> /usr/bin/bp_seqfeature_load.pl -d dmel_43_SF_slow *.gff
> 
> I tried the --fast options but got a bunch of warning (see below).
> 
> The DNA file (a single fasta database file containing all chromosome
> sequences) was in a different location from the gff files and was not loaded
> together with the gff files (the sequence table is empty in the database).
> 
> Can I load the DNA sequence after the gff files were loaded?
> 
> Many thanks
> 
> Marco
> 
> 
> -------------------- WARNING ---------------------
> MSG: ID=ortho:2825 has been used more than once, but it cannot be found in
> the database.
> This can happen if you have specified fast loading, but features sharing the
> same ID
> are not contiguous in the GFF file. This will be loaded as a separate
> feature.
> Line 483681: "X .       orthologous_region      19477824        19478027
> .       +       .
> ID=ortho:2825;to_name=FBpp0074514,CG14214-PA;to_species=dpse"
> 
> STACK Bio::DB::SeqFeature::Store::GFF3Loader::handle_feature
> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:537
> STACK Bio::DB::SeqFeature::Store::GFF3Loader::do_load
> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:424
> STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh
> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:342
> STACK Bio::DB::SeqFeature::Store::GFF3Loader::load
> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:240
> STACK toplevel /usr/bin/bp_seqfeature_load.pl:81
> 
> 
> On 8/17/06 10:27, "Lincoln Stein" <lstein at cshl.edu> wrote:
> 
>> Hi,
>> 
>> This message bounced because I tried to send it from my gmail account and so
>> I'm sending it again. Bio::DB::SeqFeature::Store *does* load DNA. If it
>> finds a file that contains DNA data, it simply loads it. There is no special
>> command line switch. Also you can include the DNA in the GFF3 file.
>> 
>> Lincoln
>> 
>> ---------- Forwarded message ----------
>> From: Lincoln Stein <lincoln.stein at gmail.com>
>> Date: Aug 17, 2006 12:26 PM
>> Subject: Re: [Bioperl-l] Extracting gene seq from Bio::DB::GFF
>> To: Chris Fields <cjfields at uiuc.edu>
>> Cc: Marco Blanchette <mblanche at berkeley.edu>, "bioperl-l at lists.open-bio.org"
>> <Bioperl-l at lists.open-bio.org>, cain.cshl at gmail.com
>> 
>> I'm curious. Could you try using the Bio::DB::SeqFeature::Store class to
>> load the GFF3-format Fly data? I think you're probably getting confused by
>> overlapping mRNA splice forms, an issue that won't occur with the full
>> GFF3-formatted data.
>> 
>> 
>> On 8/13/06, Chris Fields <cjfields at uiuc.edu> wrote:
>>> 
>>> Marco,
>>> 
>>> Did you figure out what the problem was?  I was curious; the issue
>>> you were having was rather odd.  I wanted to see if it was an issue
>>> with the GFF data or with the database itself.
>>> 
>>> Chris
>>> 
>>> On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote:
>>> 
>>>> Chris,
>>>> 
>>>>> Do you mean you get duplicates of sequences back, or that you get
>>>>> more than
>>>>> one chunk of the same sequence back?
>>>> 
>>>> I sometimes get duplicated sequences and sometimes overlapping
>>>> regions (see
>>>> bellow)
>>>> 
>>>>> 
>>>>> Is it possible that each query using an ID could contain more than
>>>>> one
>>>>> feature?  That might explain it (you could check by testing the
>>>>> size of the
>>>>> array @feats).
>>>> Most id return more than one features from various type
>>>> ( point_mutation,
>>>> insertion_site, processed_transcript, etc...). That's why I
>>>> restirct the
>>>> output to type "gene" using regexp /gene/ on $f->type.
>>>> 
>>>>> 
>>>>> I'm not sure how split locations are handled within Bio:DB::GFF,
>>>>> but do the
>>>>> specific features have split locations?
>>>>> 
>>>>> Chris
>>>>> 
>>>> Not sure what you mean exactly but have a look at the following
>>>> script, it
>>>> gives the location and the group id of the feature being reported:
>>>> 
>>>> use Bio::DB::GFF;
>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>>>                               -dsn =>
>>>> 'dbi:mysql:database=dmel_43_new');
>>>> my %dups;
>>>> while (<>){
>>>>     chomp;
>>>>     my $id = $_;
>>>>     my @feat = $db->get_feature_by_name($id);
>>>> 
>>>>      for my $f (@feat){
>>>>         if (exists $dups{$f->group} && $f->type =~/gene/){
>>>>             print     "Calling >>>", $f->group, "\n";
>>>>             print     "Chr: ", $f->refseq,
>>>>                     " Strand: ", $f->strand,
>>>>                     " Start: ", $f->start,
>>>>                     " End: ", $f->end,
>>>>                     "\n";
>>>>             print "Offending >>>", $dups{$f->group}->group, "\n";
>>>>             print     "Chr: ", $dups{$f->group}->refseq,
>>>>                     " Strand: ", $dups{$f->group}->strand,
>>>>                     " Start: ", $dups{$f->group}->start,
>>>>                       " End: ", $dups{$f->group}->end;
>>>>             print "\n\n";
>>>>          } else {
>>>>             $dups{$f->group} = $f;
>>>>          }
>>>>      }
>>>> }
>>>> 
>>>> Here is the output:
>>>> Calling >>>FBgn0004179
>>>> Chr: 3L Strand: 1 Start: 22201102 End: 22207587
>>>> Offending >>>FBgn0004179
>>>> Chr: 3L Strand: 1 Start: 22200575 End: 22200575
>>>> 
>>>> Calling >>>FBgn0025681
>>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
>>>> Offending >>>FBgn0025681
>>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
>>>> 
>>>> Calling >>>FBgn0025803
>>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
>>>> Offending >>>FBgn0025803
>>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
>>>> 
>>>> Calling >>>FBgn0000117
>>>> Chr: X Strand: -1 Start: 1756796 End: 1747557
>>>> Offending >>>FBgn0000117
>>>> Chr: X Strand: -1 Start: 1757776 End: 1747182
>>>> 
>>>> Calling >>>FBgn0005427
>>>> Chr: X Strand: -1 Start: 136456 End: 125343
>>>> Offending >>>FBgn0005427
>>>> Chr: X Strand: -1 Start: 133199 End: 124949
>>>> 
>>>> Calling >>>FBgn0000042
>>>> Chr: X Strand: 1 Start: 5746100 End: 5750026
>>>> Offending >>>FBgn0000042
>>>> Chr: X Strand: 1 Start: 5746096 End: 5746106
>>>> 
>>>> Calling >>>FBgn0004551
>>>> Chr: 2R Strand: -1 Start: 19443485 End: 19434556
>>>> Offending >>>FBgn0004551
>>>> Chr: 2R Strand: -1 Start: 19445155 End: 19429977
>>>> 
>>>> Do you have any suggestions?? Is the procedure I am using to
>>>> retrieve  the
>>>> genes right?
>>>> 
>>>> Many thanks
>>>> 
>>>> Marco
>>>> 
>>>> 
>>>> 
>>>>>> Many thanks Scott,
>>>>>> 
>>>>>> At the same time I got your email I was coming to the same
>>>>>> conclusion as
>>>>>> you.
>>>>>> 
>>>>>> Now I have a stranger problem in my hands... My goal is quite
>>>>>> simple, I
>>>>>> try
>>>>>> to get the sequence of the genes back from the Bio::DB::GFF database
>>>>>> loaded
>>>>>> on MySQL. The gene list is from a file with one gene id per per
>>>>>> line. When
>>>>>> I
>>>>>> run the following script:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> use Bio::DB::GFF;
>>>>>> use Bio::SeqIO;
>>>>>> my $out = Bio::SeqIO->new(    -fh => \*STDOUT,
>>>>>>                             -format => 'fasta');
>>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>>>>>                               -dsn =>
>>>>>> 'dbi:mysql:database=dmel_43_new');
>>>>>> 
>>>>>> while (<>){
>>>>>>     chomp;
>>>>>>     my $id = $_;
>>>>>>     my @feats = $db->get_feature_by_name($id);
>>>>>>     for my $f (@feats){
>>>>>>         $out->write_seq( $f->seq ) if $f->type =~/gene/;
>>>>>>     }
>>>>>> }
>>>>>> 
>>>>>> 
>>>>>> I get more sequence back than the number of gene in my input file. I
>>>>>> double
>>>>>> check there. Some of the duplicated entries are the same, some
>>>>>> are not!
>>>>> 
>>>>> 
>>>>> ...
>>>>> 
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>> 
>>>> ______________________________
>>>> Marco Blanchette, Ph.D.
>>>> 
>>>> mblanche at uclink.berkeley.edu
>>>> 
>>>> Donald C. Rio's lab
>>>> Department of Molecular and Cell Biology
>>>> 16 Barker Hall
>>>> University of California
>>>> Berkeley, CA 94720-3204
>>>> 
>>>> Tel: (510) 642-1084
>>>> Cell: (510) 847-0996
>>>> Fax: (510) 642-6062
>>>> --
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> 
>>> Christopher Fields
>>> Postdoctoral Researcher
>>> Lab of Dr. Robert Switzer
>>> Dept of Biochemistry
>>> University of Illinois Urbana-Champaign
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> 
>> 
>> 
> 
> ______________________________
> Marco Blanchette, Ph.D.
> 
> mblanche at uclink.berkeley.edu
> 
> Donald C. Rio's lab
> Department of Molecular and Cell Biology
> 16 Barker Hall
> University of California
> Berkeley, CA 94720-3204
> 
> Tel: (510) 642-1084
> Cell: (510) 847-0996
> Fax: (510) 642-6062

______________________________
Marco Blanchette, Ph.D.

mblanche at uclink.berkeley.edu

Donald C. Rio's lab
Department of Molecular and Cell Biology
16 Barker Hall
University of California
Berkeley, CA 94720-3204

Tel: (510) 642-1084
Cell: (510) 847-0996
Fax: (510) 642-6062
--