[BioSQL-l] gff in biosql; biosql, gbrowse observations
hlapp at gnf.org
Thu Mar 3 18:00:45 EST 2005
On Mar 3, 2005, at 1:25 PM, Genevieve DeClerck wrote:
> I've been playing around with a test biosql database I have running
> (with gbrowse frontend and mysql rdbms). Using the
> 'load_seqdatabase.pl' script I can successfully load a genbank file
> (NC_004578.gbk, the TIGR-annotated P. syringae chromosome) into a
> mysql biosql database. However, when I try to load a gff file with
> additional syringae data I run in to problems - it appears that GFF is
> not supported by biosql?? (I don't see it as a valid format in SeqIO/
> or any mention in the docs).
Biosql itself doesn't support or not support specific formats - it's
whatever bioperl supports.
There is two caveats to this statement:
1) load_seqdatabase.pl instantiates a SeqIO-compliant (or in fact a
Bio::ClusterIO-compliant) parser and thus expects Bio::SeqI (or
Bio::ClusterI) objects coming from the parser. In order to use a parser
that returns different objects (like Bio::SeqFeatureI from the
Bio::FeatureIO system if I understand things correctly) you at this
point would have to write a variant of load_seqdatabase.pl that deals
with this. Although it shouldn't be difficult (or so I suppose) it
doesn't come off the shelf right now.
2) The biosql schema does distinguish between 'bioentries' and
seqfeatures and has a not-null constraint on the foreign key from
seqfeature to bioentry. That is, puristically speaking biosql does not
support features without a sequence object that holds them. So, if you
had a Bio::FeatureIO::gff parser that returns SeqFeatureI objects
without attached sequence, you either need to artificially make up a
sequence or you cannot load those features into biosql.
> Also, when I take a look at the data in the biosql database, I see a
> lot of empty tables and empty fields. The bulk of the data loads into
> these tables: location, seqfeature, seqfeature_qualifier_value. One
> table I expected to see data in is dbxref - I was surprized to see
> this completely empty because there are db_xrefs defined throughout
> the gbk file I loaded.
GenBank has dbxrefs in obscure places in the feature table. The
Bio::SeqIO::genbank parser will not make an attempt to parse dbxrefs
out of feature annotation - you have to write this yourself, e.g., as a
SeqProcessor (this is what I do for e.g. RefSeq).
Alternatively, use EMBL format as input which has designated tags for
dbxrefs that are parsed accordingly by the SeqIO::embl parser.
> Other tables that are empty: bioentry_dbxref, bioentry_path,
> bioentry_relationship, dbxref_qualifier_value,
> location_qualifier_value, seqfeature_dbxref, seqfeature_path
> seqfeature_relationship, term_* (all term tables except 'term').
The bioperl parser or the loader will not do black magic by themselves
- you only get the information that is in there and that is properly
tagged. A genbank record does not contain relationships to other
records other than (obscured) dbxrefs, so bioentry_relationship will
not be populated by it. A ClusterIO stream will populate that table,
e.g. when parsing UniGene. bioentry_path is meant for the transitive
closure over bioentry_relationship and you have to compute it yourself
at this point. seqfeature_relationship will not get populated from a
genbank file as the feature table in genbank (and embl) is flat. In
order to load relationships between terms you will need to load the
ontology that defines them. term_path is again the transitive closure
table (the load_ontology.pl script _will_ optionally compute the
transitive closure). If dbxref is empty then association tables with it
will of course also be empty.
> As for empty fields, I would not have expected the 'display_name' in
> table 'seqfeature' to be empty/NULL.
It is empty because none of the bioperl SeqIO parsers populates it.
> This appears to be an important field for feature display in gbrowse.
Agreed, but it's not supported (yet) by the bioperl SeqIO system. In a
sequence-object centric world naming features is not important, as
opposed to a feature-centric world.
> When I view this biosql database in gbrowse I can see CDS and genes
> glyphs, but the identifier labels are missing (even thought they're
> acitivate in the .conf file). I had a suspicion that the NULL
> display_name field might have something to do with the missing glyph
> label so I edited a record in 'seqfeature' in mysql, replacing NULL
> with "Foo" -- I refreshed the gbrowse page and, sure enough, a "Foo"
> label appeared on the glyph whose info I altered in the db. This
> proved to me that there definately seems to be some disconnect between
> how the gbk file was parsed by load_seqdatabase.pl and/or how these
> data were inserted into the database.
There is no disconnect, but there is no magic either.
Remember, load_seqdatabase.pl and the rest of bioperl-db in essence is
nothing but the object-relational bridge that maps bioperl objects into
the biosql schema. load_seqdatabase.pl won't make up any content that
you didn't supply it with to begin with through the bioperl SeqIO
parser or chained SeqProcessor.
If you look at a genbank feature table you will realize that features
do not have names - they only have type and location (and an implicit
source), and some tag/value annotation. This is what a bioperl SeqIO
parser will generally give you.
The disconnect is between the feature-centric world as e.g. represented
in GFF(3) and the sequence record-centric world represented by
datasources like Genbank, Embl, UniProt, etc. and therefore also the
Bioperl core object model. There are modules and scripts in bioperl
that try to bridge this disconnect by converting seq objects with
features to features with names, and by 'unflattening' Genbank feature
tables. Check out stuff under scripts/Bio-DB-GFF and
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
More information about the BioSQL-l