[BioSQL-l] Treating GenBank source features as top level annotation

Peter biopython at maubp.freeserve.co.uk
Wed Nov 18 06:06:51 EST 2009


Hello all,

Something we've just been discussing on the Biopython mailing list
is a possible change to how we parse the source features in GenBank
(or EMBL) files. This could have knock on implications for how we use
BioSQL. For anyone interested, the thread is here:
http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html

The basic observation is that GenBank files do not have any extensible
annotation block for the whole sequence. There are a few fields like
the comment, organism and taxonomy - but nothing general and
structured. Instead, it seems the NCBI etc decided to use the feature
table for this task by inventing the "source" feature. In every single
GenBank file I have ever seen with a source feature, there is only
one feature of this type and it spans the full sequence.

For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
plasmid pPCP1, complete sequence:

 source      1..9609
             /organism="Yersinia pestis biovar Microtus str. 91001"
             /mol_type="genomic DNA"
             /strain="91001"
             /db_xref="taxon:229193"
             /plasmid="pPCP1"
             /biovar="Microtus"

(I reduced the white space for emailing). All of that information
makes sense as annotation for the whole sequence. In fact, the
"organism" entry is duplicated on the ORGANISM line in the
GenBank header (and the SOURCE line too).

Currently we (Biopython, BioPerl etc) store this annotation in BioSQL
using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
associated with a "source" feature in the seqfeature table.

I am suggesting it could make more sense to store the "source"
feature annotation at the sequence level, using instead the
bioentry_qualifier_value and bioentry_dbxref tables.

This is a slight shift from the origins of BioSQL as a schema to
hold GenBank files - but to me at least it is more logical.

What does everyone else think? Things work as they are...
and "if it ain't broken don't fix it"?

Peter

[Even if Biopython changes its internal object structure to treat
the "source" feature annotation as sequence level annotation,
we *could* continue to use a "source" feature when loading
GenBank files to/from BioSQL if required for compatibility with
the other Bio* projects. It would be more work though. In any
case, we'd also need to recreate a "source" feature when
writing GenBank output files.]


More information about the BioSQL-l mailing list