[Bioperl-l] GFF file and load_gff.pl
Richard Harrison
richard.harrison at edinburgh.ac.uk
Wed Jan 28 20:59:02 UTC 2009
Thanks Scott,
You're being a great help. Unfortunately, I am still struggling. There
was no line at the top of the gff file. I added one, but it makes
little difference. I had a look at Bio::DB::SeqFeature::Store, but as
far as I can make out it handles the gff file worse than the
Bio:DB:GFF file. I tried another gff3 file from a different source and
it made no difference at all.
These are the commands that I'm using to populate two different
databases, so I can work out which method is best:
./bp_seqfeature_load.pl -d cere_seqfeat -s Bio::DB::SeqFeature -a
DBI::mysql -user root -pass pwd -v --verbose -c genome.gff
./bp_load_gff.pl -d cere_gffdb -user root -pass pwd --
adaptor=dbi::mysql --create --gff3_munge genome.gff
Both databases seem to load the data ok and don't give error messages..
Then in bioperl:
#use Bio::DB::SeqFeature;
# Open the feature database
my $db = Bio::DB::SeqFeature::Store->new( -adaptor =>
'DBI::mysql',
-dsn =>
'dbi:mysql:cere_seqfeat',
-user => 'root',
-pass => 'pwd',
-create => 1
);
my @types = $db->types;
foreach (@types){
print "$_\n";
}
I GET NO OUTPUT
Alternatively:
use Bio::DB::GFF;
# Open the feature database
my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
-dsn => 'dbi:mysql:cere_gffdb',
-user => 'root',
-pass => 'pwd'
);
my @types = $db->types;
foreach (@types){
print "$_\n";
}
I GET:
telomere:SGD
intron:SGD
insertion:SGD
chromosome:SGD
region:landmark
ncRNA:SGD
transposable_element_gene:SGD
region:SGD
ARS:SGD
snRNA:SGD
snoRNA:SGD
nc_primary_transcript:SGD
rRNA:SGD
transposable_element:SGD
gene:SGD
CDS:SGD
repeat_family:SGD
transcript_region:SGD
pseudogene:SGD
nucleotide_match:SGD
tRNA:SGD
binding_site:SGD
repeat_region:SGD
centromere:SGD
Any ideas what is going on here? I'm struggling to comprehend where
I'm going wrong.
Best wishes,
Richard
On 28 Jan 2009, at 18:51, Scott Cain wrote:
> Hi Richard,
>
> A few items:
>
> * It looks as though the loader didn't know that it was loading GFF3
> (you can tell it's GFF3 by the = between the tags and values in the
> ninth column; in GFF2, there would be a space). As a result, the
> classes weren't created properly. Check that there is a line at the
> top of your GFF file that looks like "##gff-version 3"
>
> * You may not want to use a Bio::DB::GFF database anyway. Since you
> are just getting started and have GFF3, you might be better off using
> a Bio::DB::SeqFeature::Store database, which was designed to work with
> GFF3 data (Bio::DB::GFF works better with GFF2). The loader for a
> SeqFeature::Store database is called bp_seqfeature_load.pl.
>
> Scott
>
>
> On Wed, Jan 28, 2009 at 12:36 PM, Richard Harrison
> <richard.harrison at ed.ac.uk> wrote:
>> Thank you Chris, Scott and Adam,
>> You are right, I was confused. I have now managed to create a
>> Bio::DB::GFF
>> database with my genome annotation loaded into it. One further
>> question.
>> I am having trouble retrieving the desired info from the database.
>> Shown
>> below is a typical entry into the GFF file for a gene
>>
>>
>> #chr01 SGD gene 33449 34702 . + .
>> ID=YAL061W;Name=YAL061W;gene=BDH2;Alias=BDH2;Ontology_term=GO:
>> 0008150,GO:0005634,GO:0005737,GO:0016616,GO:0008270,GO:0016491,GO:
>> 0046872;Note=Putative%20medium-chain%20alcohol%20dehydrogenase
>> %20with%20similarity%20to%20BDH1%3B%20transcription%20induced%20by
>> %20constitutively%20active%20PDR1%20and%20PDR3%3B%20BDH2%20is%20an
>> %20essential
>> %20gene;dbxref=SGD:S000000057;orf_classification=Uncharacterized
>>
>> #chr01 SGD CDS 33449 34702 . + 0
>> Parent=YAL061W;Name=YAL061W;gene=BDH2;Alias=BDH2;Ontology_term=GO:
>> 0008150,GO:0005634,GO:0005737,GO:0016616,GO:0008270,GO:0016491,GO:
>> 0046872;Note=Putative%20medium-chain%20alcohol%20dehydrogenase
>> %20with%20similarity%20to%20BDH1%3B%20transcription%20induced%20by
>> %20constitutively%20active%20PDR1%20and%20PDR3%3B%20BDH2%20is%20an
>> %20essential
>> %20gene;dbxref=SGD:S000000057;orf_classification=Uncharacterized
>>
>>
>> I would like to search the database for YAL061W and retrieve the CDS
>> coordinates, details about introns etc. I don't need the sequence,
>> as I have
>> separate multiple genome-alignments..
>>
>>
>> At present all I can work out how to do is get all feature types and
>> classes in the database.. (see code below)
>>
>>
>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>> -dsn => 'dbi:mysql:biosql',
>> user => 'root',
>> pass => '*******'
>> );
>> #get types
>> my @types = $db->types;
>>
>> EG:
>> #telomere:SGDintron:SGDinsertion:SGDchromosome:SGDregion:landmarkncRNA:SGDtransposable_element_gene:SGDregion:SGDARS:SGDsnRNA:SGDsnoRNA:SGDnc_primary_transcript:SGDrRNA
>> etc...
>>
>>
>>
>> #get classes
>> my @classes = $db->classes;
>>
>> ID=YKR067W
>> ID=YKR068C
>> ID=YKR069W
>> ID=YKR070W
>> ID=YKR071C
>> ID=YKR072C
>> ID=YKR073C
>> ID=YKR074W
>>
>> etc...
>>
>> Could someone point me towards a useful set of pointers for this.
>> I've tried
>> reading the documentation but it doesn't seem to illustrate what I
>> want to
>> do.
>>
>> Best wishes and thanks for the help so far,
>>
>> Richard
>>
>>
>>
>>
>>
>>
>>
>> On 28 Jan 2009, at 16:15, Scott Cain wrote:
>>
>>> Hi Richard,
>>>
>>> Your mixing up two database schemas. Do you want to use a BioSQL
>>> database (bioperl-db) or a Bio::DB::GFF database? I'm guessing that
>>> you want the latter, so I'll answer that question (as it's the
>>> easier
>>> one anyway). You need to add the "-c" flag (for --create) to the
>>> load_gff.pl command to create the Bio::DB::GFF schema.
>>>
>>> If you really wanted a BioSQL database, you'll have to wait for help
>>> from someone else more knowledgeable about it.
>>>
>>> Scott
>>>
>>>
>>>
>>>
>>> On Wed, Jan 28, 2009 at 10:22 AM, Richard Harrison
>>> <richard.harrison at ed.ac.uk> wrote:
>>>>
>>>> Dear all,
>>>>
>>>> I am running Bioperl 1.6 on osx- leopard on a macbook pro.
>>>>
>>>> I have installed mysql-5.1.30-osx10.5-x86, DBD-mysql-4.010, the
>>>> biosql-schema for mysql and bioperl-db. As per the instructions
>>>> I have a
>>>> database called biosql which I associated the SQL dialect
>>>> biosqldb-mysql.sql
>>>>
>>>> After much fannying, the install seems fine....although i can't
>>>> be sure
>>>> (never used mysql before)
>>>>
>>>> I am having problems with the script load_gff.pl
>>>>
>>>> I want to load a database with the data from a genome.gff file
>>>> (for
>>>> saccharomyces cerevisiae). I don't want to add sequence to it, as
>>>> all i
>>>> need
>>>> is the annotation.
>>>>
>>>> I have tried the following command(s):
>>>>
>>>> ./bp_load_gff.pl -d biosql -user root -pass mypassword genome.gff
>>>> ./bp_load_gff.pl -d biosql -user root -pass mypassword
>>>> --adaptor=dbi::mysql
>>>> genome.gff
>>>>
>>>> With both I get the following error:
>>>>
>>>> No ftype id for CDS:SGD Table 'biosql.ftype' doesn't exist Record
>>>> skipped.
>>>> (then another few '000 of these)
>>>> then..
>>>>
>>>> genome.gff: 16379 records loaded
>>>>
>>>>
>>>> Any ideas where I'm going wrong?
>>>>
>>>> Thanks,
>>>>
>>>> Richard
>>>>
>>>> ____________________________
>>>> Dr Richard Harrison
>>>> 127 Ashworth Labs
>>>> Institutes of Evolutionary Biology
>>>> King's Buildings
>>>> West Mains Road
>>>> Edinburgh EH9 3JT
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>
>>>
>>>
>>> --
>>> ------------------------------------------------------------------------
>>> Scott Cain, Ph. D. scott at
>>> scottcain
>>> dot net
>>> GMOD Coordinator (http://gmod.org/) 216-392-3087
>>> Ontario Institute for Cancer Research
>>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>
>
>
> --
> ------------------------------------------------------------------------
> Scott Cain, Ph. D. scott at
> scottcain dot net
> GMOD Coordinator (http://gmod.org/) 216-392-3087
> Ontario Institute for Cancer Research
>
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the Bioperl-l
mailing list