[BioSQL-l] Quick question about tools using BioSQL

Tue Mar 2 06:06:08 EST 2010

On Mon, Mar 1, 2010 at 7:59 PM, Amedeo, Paolo <pamedeo at jcvi.org> wrote:
>
> I'm evaluating the possibility of using BioSQL for a new genome
> annotation project that also involves loading annotated genomes from
> GenBank.

We looked at this too, but one major drawback of the BioSQL schema
is it does not attempt to handle revisions (e.g. biologists manually
curating the annotation). I think my colleague is planning on trying
Artemis with a Chado database for this.

> So far, I have successfully deployed a working copy of the database
> under MySQL on my machine and loaded a couple of genomes from
> GenBank files, using the script load_seqdatabase.pl found in the scripts
> directory of BioPerl-db-1.6.0.

You can also load GenBank files into BioSQL using BioJava, Biopython
etc. They should give almost the same result in the database (since
the BioPerl implementation is the de facto reference implementation).

> Browsing the database, however, I have noticed a few things that
> concern me a little bit.
>
> First, the seqfeature_relationship table is completely empty and I
> couldn't identify any obvious way to investigate parent/child
> relationships between entities stored in seqfeature (e.g. in the case of
> overlapping genes, or genes embedded in introns of other genes,
> how one could determine to which gene a given CDS belongs?).

GenBank files do not hold that information, so why do you expect
the parser to store it in the database?

GFF files do hold this kind of information, and I am aware of a BioPerl
script to turn GenBank files into GFF (GFF3?) but to do this requires
a lot of inference about the gene model etc to generate the parent/child
information.

You could load the basic GenBank file into the database and then
run your own analysis to populate the relationship tables with this
kind of information.

> Second, I was unable to find a dedicated script to populate the ontology
> table and I was somehow surprised that this table got somehow populated
> with the keywords present in the GenBank files.

The BioPerl implementation (and others since) use an "ad hoc"
ontology, basically a nasty hack. I don't think any of the Bio* binding
for BioSQL implement a strict ontology check (but it would be nice).

> Third, once I have loaded a genome without first populating the taxon
> table and, as a result I have noticed that the values assigned to
> taxon.left_value and taxon.right_value described a narrow interval that
> didn't include at all the taxon_id of the genome loaded in the database.

There are two IDs for each entry in the taxon table, the database key
and the NCBI taxon id, which can be different. Biopython ignores the
left/right values, so I haven't looked at this recently. Could you clarify
with an example?

> I then tried to use the script bioentry2flat.pl to try to write back to
> a gbf file the genome that I had loaded in the database. Unfortunately I
> couldn't find any documentation for this script and I've tried to use as
> values of the various parameters the same strings that I used with the
> other script.  I had to edit the code to get rid of hard-coded values,
> but still I couldn't get the script to run successfully. I suspect that
> there is some problem with matching correctly the accession.

I've not used that script (I would use Biopython, just a few lines to
retrieve the entry from the BioSQL database, then give it to our
SeqIO module for output as a GenBank file).

> Obviously I'm doing one or more things wrong and/or I'm not using the
> proper set of tools for doing what I need to do.

I'm not sure if BioSQL does exactly what you expect.

> I would really appreciate if somebody could point me to a set of tools
> that would allow me to load gbf files into the database and extract the
> individual accessions in both gbf and asn.1 (sqn) format, or teaches me
> how should I correctly use these two scripts, so that the
> bioentry_relationship table is populated correctly.

When you say gbf do you mean GenBank? Most people use the
shorthand gb or gbk (based on the typical file extensions).

Could you give an example of the sort information you are hoping
to find in the bioentry_relationship table?

Peter