[BioSQL-l] gff in biosql; biosql, gbrowse observations

Thu Mar 3 18:00:45 EST 2005

On Mar 3, 2005, at 1:25 PM, Genevieve DeClerck wrote:

> Hi,
>
> I've been playing around with a test biosql database I have running 
> (with gbrowse frontend and mysql rdbms). Using the 
> 'load_seqdatabase.pl' script I can successfully load a genbank file 
> (NC_004578.gbk, the TIGR-annotated P. syringae chromosome) into a 
> mysql biosql database. However, when I try to load a gff file with 
> additional syringae data I run in to problems - it appears that GFF is 
> not supported by biosql?? (I don't see it as a valid format in SeqIO/ 
> or any mention in the docs).

Biosql itself doesn't support or not support specific formats - it's 
whatever bioperl supports.

There is two caveats to this statement:

	1) load_seqdatabase.pl instantiates a SeqIO-compliant (or in fact a 
Bio::ClusterIO-compliant) parser and thus expects Bio::SeqI (or 
Bio::ClusterI) objects coming from the parser. In order to use a parser 
that returns different objects (like Bio::SeqFeatureI from the 
Bio::FeatureIO system if I understand things correctly) you at this 
point would have to write a variant of load_seqdatabase.pl that deals 
with this. Although it shouldn't be difficult (or so I suppose) it 
doesn't come off the shelf right now.

	2) The biosql schema does distinguish between 'bioentries' and 
seqfeatures and has a not-null constraint on the foreign key from 
seqfeature to bioentry. That is, puristically speaking biosql does not 
support features without a sequence object that holds them. So, if you 
had a Bio::FeatureIO::gff parser that returns SeqFeatureI objects 
without attached sequence, you either need to artificially make up a 
sequence or you cannot load those features into biosql.

>
> Also, when I take a look at the data in the biosql database, I see a 
> lot of empty tables and empty fields. The bulk of the data loads into 
> these tables: location, seqfeature, seqfeature_qualifier_value. One 
> table I expected to see data in is dbxref - I was surprized to see 
> this completely empty because there are db_xrefs defined throughout 
> the gbk file I loaded.

GenBank has dbxrefs in obscure places in the feature table. The 
Bio::SeqIO::genbank parser will not make an attempt to parse dbxrefs 
out of feature annotation - you have to write this yourself, e.g., as a 
SeqProcessor (this is what I do for e.g. RefSeq).

Alternatively, use EMBL format as input which has designated tags for 
dbxrefs that are parsed accordingly by the SeqIO::embl parser.

>  Other tables that are empty: bioentry_dbxref, bioentry_path, 
> bioentry_relationship, dbxref_qualifier_value, 
> location_qualifier_value, seqfeature_dbxref, seqfeature_path  
> seqfeature_relationship, term_* (all term tables except 'term').

The bioperl parser or the loader will not do black magic by themselves 
- you only get the information that is in there and that is properly 
tagged. A genbank record does not contain relationships to other 
records other than (obscured) dbxrefs, so bioentry_relationship will 
not be populated by it. A ClusterIO stream will populate that table, 
e.g. when parsing UniGene. bioentry_path is meant for the transitive 
closure over bioentry_relationship and you have to compute it yourself 
at this point. seqfeature_relationship will not get populated from a 
genbank file as the feature table in genbank (and embl) is flat. In 
order to load relationships between terms you will need to load the 
ontology that defines them. term_path is again the transitive closure 
table (the load_ontology.pl script _will_ optionally compute the 
transitive closure). If dbxref is empty then association tables with it 
will of course also be empty.

>
> As for empty fields, I would not have expected the 'display_name' in 
> table 'seqfeature' to be empty/NULL.

It is empty because none of the bioperl SeqIO parsers populates it.

>  This appears to be an important field for feature display in gbrowse.

Agreed, but it's not supported (yet) by the bioperl SeqIO system. In a 
sequence-object centric world naming features is not important, as 
opposed to a feature-centric world.

>  When I view this biosql database in gbrowse I can see CDS and genes 
> glyphs, but the identifier labels are missing (even thought they're 
> acitivate in the .conf file). I had a suspicion that the NULL 
> display_name field might have something to do with the missing glyph 
> label so I edited a record in 'seqfeature' in mysql, replacing NULL 
> with "Foo" -- I refreshed the gbrowse page and, sure enough, a "Foo" 
> label appeared on the glyph whose info I altered in the db. This 
> proved to me that there definately seems to be some disconnect between 
> how the gbk file was parsed by load_seqdatabase.pl and/or how these 
> data were inserted into the database.

There is no disconnect, but there is no magic either.

Remember, load_seqdatabase.pl and the rest of bioperl-db in essence is 
nothing but the object-relational bridge that maps bioperl objects into 
the biosql schema. load_seqdatabase.pl won't make up any content that 
you didn't supply it with to begin with through the bioperl SeqIO 
parser or chained SeqProcessor.

If you look at a genbank feature table you will realize that features 
do not have names - they only have type and location (and an implicit 
source), and some tag/value annotation. This is what a bioperl SeqIO 
parser will generally give you.

The disconnect is between the feature-centric world as e.g. represented 
in GFF(3) and the sequence record-centric world represented by 
datasources like Genbank, Embl, UniProt, etc. and therefore also the 
Bioperl core object model. There are modules and scripts in bioperl 
that try to bridge this disconnect by converting seq objects with 
features to features with names, and by 'unflattening' Genbank feature 
tables. Check out stuff under scripts/Bio-DB-GFF and 
Bio::SeqFeature::Tools::Unflattener.

	-hilmar

-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------