[Bioperl-l] GFF file and load_gff.pl

Scott Cain scott at scottcain.net
Wed Jan 28 22:30:25 UTC 2009


Hi Richard,

It's not clear to me why the SeqFeature::Store didn't give a result
with the types call, but the Bio::DB::GFF call is giving you the
expected result.  The types method gives a list of
Bio::DB::GFF::Typename objects (you can see the explanation in the
perldoc for Bio::DB::GFF).  The Bio::DB::GFF::Typename class has a
asString method (perldoc Bio::DB::GFF::Typename) that gets called when
the object is being used in a string context, which it is when used in
a print statement.

The asString method tells the object to return "type:source", which is
what you're seeing. The documentation for Bio::DB::GFF::Typename uses
the word "method" for type, which is what the type (the thing in the
third column) used to be called.  The source is the thing in the
second column.

On reading that last paragraph, it feels very unclear to me, though it
says exactly what I want it too.  Does it make sense to you?

Scott


On Wed, Jan 28, 2009 at 3:59 PM, Richard Harrison
<richard.harrison at edinburgh.ac.uk> wrote:
> Thanks Scott,
> You're being a great help. Unfortunately, I am still struggling. There was
> no line at the top of the gff file. I added one, but it makes little
> difference. I had a look at Bio::DB::SeqFeature::Store, but as far as I can
> make out it handles the gff file worse than the Bio:DB:GFF file. I tried
> another gff3 file from a different source and it made no difference at all.
>
>
> These are the commands that I'm using to populate two different databases,
> so I can work out which method is best:
>
> ./bp_seqfeature_load.pl -d cere_seqfeat -s Bio::DB::SeqFeature -a DBI::mysql
> -user root -pass pwd -v --verbose -c genome.gff
>
> ./bp_load_gff.pl -d cere_gffdb -user root -pass pwd --adaptor=dbi::mysql
> --create --gff3_munge genome.gff
>
> Both databases seem to load the data ok and don't give error messages..
>
>
> Then in bioperl:
>
> #use Bio::DB::SeqFeature;
>
>  # Open the feature database
>  my $db     = Bio::DB::SeqFeature::Store->new( -adaptor => 'DBI::mysql',
>                                                                 -dsn     =>
> 'dbi:mysql:cere_seqfeat',
>
>                     -user => 'root',
>
>                     -pass => 'pwd',
>
>                     -create => 1
>
>                    );
> my @types = $db->types;
>        foreach (@types){
>        print "$_\n";
>        }
>
>
> I GET NO OUTPUT
>
> Alternatively:
> use Bio::DB::GFF;
>
>  # Open the feature database
>        my $db      = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>                                   -dsn     => 'dbi:mysql:cere_gffdb',
>                                   -user => 'root',
>                                   -pass => 'pwd'
>                                 );
>
> my @types = $db->types;
>        foreach (@types){
>        print "$_\n";
>        }
>
> I GET:
>
> telomere:SGD
> intron:SGD
> insertion:SGD
> chromosome:SGD
> region:landmark
> ncRNA:SGD
> transposable_element_gene:SGD
> region:SGD
> ARS:SGD
> snRNA:SGD
> snoRNA:SGD
> nc_primary_transcript:SGD
> rRNA:SGD
> transposable_element:SGD
> gene:SGD
> CDS:SGD
> repeat_family:SGD
> transcript_region:SGD
> pseudogene:SGD
> nucleotide_match:SGD
> tRNA:SGD
> binding_site:SGD
> repeat_region:SGD
> centromere:SGD
>
>
> Any ideas what is going on here? I'm struggling to comprehend where I'm
> going wrong.
>
> Best wishes,
> Richard
>
>
>
> On 28 Jan 2009, at 18:51, Scott Cain wrote:
>
>> Hi Richard,
>>
>> A few items:
>>
>> * It looks as though the loader didn't know that it was loading GFF3
>> (you can tell it's GFF3 by the = between the tags and values in the
>> ninth column; in GFF2, there would be a space).  As a result, the
>> classes weren't created properly.  Check that there is a line at the
>> top of your GFF file that looks like "##gff-version 3"
>>
>> * You may not want to use a Bio::DB::GFF database anyway.  Since you
>> are just getting started and have GFF3, you might be better off using
>> a Bio::DB::SeqFeature::Store database, which was designed to work with
>> GFF3 data (Bio::DB::GFF works better with GFF2).  The loader for a
>> SeqFeature::Store database is called bp_seqfeature_load.pl.
>>
>> Scott
>>
>>
>> On Wed, Jan 28, 2009 at 12:36 PM, Richard Harrison
>> <richard.harrison at ed.ac.uk> wrote:
>>>
>>> Thank you Chris, Scott and Adam,
>>> You are right, I was confused. I have now managed to create a
>>> Bio::DB::GFF
>>> database with my genome annotation loaded into it. One further question.
>>> I am having trouble retrieving the desired info from the database.  Shown
>>> below is a typical entry into the GFF file for a gene
>>>
>>>
>>> #chr01  SGD     gene    33449   34702   .       +       .
>>>
>>> ID=YAL061W;Name=YAL061W;gene=BDH2;Alias=BDH2;Ontology_term=GO:0008150,GO:0005634,GO:0005737,GO:0016616,GO:0008270,GO:0016491,GO:0046872;Note=Putative%20medium-chain%20alcohol%20dehydrogenase%20with%20similarity%20to%20BDH1%3B%20transcription%20induced%20by%20constitutively%20active%20PDR1%20and%20PDR3%3B%20BDH2%20is%20an%20essential%20gene;dbxref=SGD:S000000057;orf_classification=Uncharacterized
>>>
>>> #chr01  SGD     CDS     33449   34702   .       +       0
>>>
>>> Parent=YAL061W;Name=YAL061W;gene=BDH2;Alias=BDH2;Ontology_term=GO:0008150,GO:0005634,GO:0005737,GO:0016616,GO:0008270,GO:0016491,GO:0046872;Note=Putative%20medium-chain%20alcohol%20dehydrogenase%20with%20similarity%20to%20BDH1%3B%20transcription%20induced%20by%20constitutively%20active%20PDR1%20and%20PDR3%3B%20BDH2%20is%20an%20essential%20gene;dbxref=SGD:S000000057;orf_classification=Uncharacterized
>>>
>>>
>>> I would like to search the database for YAL061W and retrieve the CDS
>>> coordinates, details about introns etc. I don't need the sequence, as I
>>> have
>>> separate multiple genome-alignments..
>>>
>>>
>>> At present all I can work out how to do is  get all feature types and
>>> classes  in the database.. (see code below)
>>>
>>>
>>> my $db      = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>>                                 -dsn     => 'dbi:mysql:biosql',
>>>                                 user => 'root',
>>>                                 pass => '*******'
>>>                               );
>>>      #get types
>>>      my @types = $db->types;
>>>
>>> EG:
>>>
>>> #telomere:SGDintron:SGDinsertion:SGDchromosome:SGDregion:landmarkncRNA:SGDtransposable_element_gene:SGDregion:SGDARS:SGDsnRNA:SGDsnoRNA:SGDnc_primary_transcript:SGDrRNA etc...
>>>
>>>
>>>
>>>      #get classes
>>>      my @classes = $db->classes;
>>>
>>> ID=YKR067W
>>> ID=YKR068C
>>> ID=YKR069W
>>> ID=YKR070W
>>> ID=YKR071C
>>> ID=YKR072C
>>> ID=YKR073C
>>> ID=YKR074W
>>>
>>> etc...
>>>
>>> Could someone point me towards a useful set of pointers for this. I've
>>> tried
>>> reading the documentation but it doesn't seem to illustrate what I want
>>> to
>>> do.
>>>
>>> Best wishes and thanks for the help so far,
>>>
>>> Richard
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 28 Jan 2009, at 16:15, Scott Cain wrote:
>>>
>>>> Hi Richard,
>>>>
>>>> Your mixing up two database schemas.  Do you want to use a BioSQL
>>>> database (bioperl-db) or a Bio::DB::GFF database?  I'm guessing that
>>>> you want the latter, so I'll answer that question (as it's the easier
>>>> one anyway).  You need to add the "-c" flag (for --create) to the
>>>> load_gff.pl command to create the Bio::DB::GFF schema.
>>>>
>>>> If you really wanted a BioSQL database, you'll have to wait for help
>>>> from someone else more knowledgeable about it.
>>>>
>>>> Scott
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jan 28, 2009 at 10:22 AM, Richard Harrison
>>>> <richard.harrison at ed.ac.uk> wrote:
>>>>>
>>>>> Dear all,
>>>>>
>>>>> I am running Bioperl 1.6 on osx- leopard on a macbook pro.
>>>>>
>>>>> I have installed mysql-5.1.30-osx10.5-x86, DBD-mysql-4.010, the
>>>>> biosql-schema for mysql and bioperl-db.  As per the instructions I have
>>>>> a
>>>>> database called biosql which I associated the SQL dialect
>>>>> biosqldb-mysql.sql
>>>>>
>>>>> After much fannying, the install seems fine....although i can't be sure
>>>>> (never used mysql before)
>>>>>
>>>>> I am having problems with the script load_gff.pl
>>>>>
>>>>> I want to load  a database with the data from a genome.gff file (for
>>>>> saccharomyces cerevisiae). I don't want to add sequence to it, as all i
>>>>> need
>>>>> is the annotation.
>>>>>
>>>>> I have tried the following command(s):
>>>>>
>>>>> ./bp_load_gff.pl -d biosql -user root -pass mypassword genome.gff
>>>>> ./bp_load_gff.pl -d biosql -user root -pass mypassword
>>>>> --adaptor=dbi::mysql
>>>>> genome.gff
>>>>>
>>>>> With both I get the following error:
>>>>>
>>>>> No ftype id for CDS:SGD Table 'biosql.ftype' doesn't exist Record
>>>>> skipped.
>>>>> (then another few '000 of these)
>>>>> then..
>>>>>
>>>>> genome.gff: 16379 records loaded
>>>>>
>>>>>
>>>>> Any ideas where I'm going wrong?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Richard
>>>>>
>>>>> ____________________________
>>>>> Dr Richard Harrison
>>>>> 127 Ashworth Labs
>>>>> Institutes of Evolutionary Biology
>>>>> King's Buildings
>>>>> West Mains Road
>>>>> Edinburgh EH9 3JT
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> The University of Edinburgh is a charitable body, registered in
>>>>> Scotland, with registration number SC005336.
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ------------------------------------------------------------------------
>>>> Scott Cain, Ph. D.                                   scott at scottcain
>>>> dot net
>>>> GMOD Coordinator (http://gmod.org/)                     216-392-3087
>>>> Ontario Institute for Cancer Research
>>>>
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>>
>>
>>
>>
>> --
>> ------------------------------------------------------------------------
>> Scott Cain, Ph. D.                                   scott at scottcain
>> dot net
>> GMOD Coordinator (http://gmod.org/)                     216-392-3087
>> Ontario Institute for Cancer Research
>>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>



-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research




More information about the Bioperl-l mailing list