[Biopython] Additions to the SeqRecord

Wed Nov 18 06:31:40 EST 2009

Peter wrote:
>>> Regarding the special case of the source feature in GenBank files, for
>>> tasks like removing part of the record, or doing an origin shift, you may
>>> want to recreate a new source feature reusing the old source feature
>>> annotation (e.g. NCBI taxon ID). However, the location would have to
>>> reflect the new modified sequence length.
>>>
>>> I have another idea to "solve" this problem:
>>>
>>> I am actually be tempted to remove the source SeqFeature, and instead
>>> handle it via the annotations dict. To me this seems more natural than
>>> having it as an entry in the feature table - a GenBank file format choice I
>>> never really understood. My guess is they didn't want to introduce a record
>>> level extensible annotation header block, which is what the source feature
>>> could be regarded as handling.
>>>
>>> i.e. When parsing a GenBank (or EMBL) file, the source feature information
>>> could get stored in the SeqRecord annotations dictionary. When writing to
>>> GenBank (or in future EMBL) format, if the annotations dictionary contained
>>> relevant fields, we would generate a source feature for the full sequence.
>>>
>>> Does that make sense? It requires looking at the source feature not as
>>> a feature which happens to span the whole sequence, but as annotation
>>> for the whole sequence (which happens to be in the GenBank features
>>> table due to a historical choice or accident).

Let's call that idea Plan(B). I've started a thread on the BioSQL mailing
list, as this possible change would have implications for Biopython's use
of BioSQL for storing this information. Unless we put some special case
handling code in our BioSQL wrapper, it would mean Biopython would
treat the "source" features differently to all the other Bio* interfaces for
BioSQL. That would be bad.

http://lists.open-bio.org/pipermail/biosql-l/2009-November/001642.html

In thinking about this, perhaps there is another less invasive change,
which I'm going to call Plan(C):

We expect (and could even enforce this assumption) there to be at
most one "source" feature in a GenBank/EMBL file, and that it should
span the full length of the sequence. Taking this a special case, when
slicing a SeqRecord, we could also slice the "source" SeqFeature to
match the new reduced sequence. Furthermore, when adding two
SeqRecord objects, we would try to combine the two "source"
SeqFeatures - taking only common annotation information.

And I'll use Plan(A) for leaving things as they stand, pros and cons:
* pro - no code changes at all
* con - "source" annotation remains a bit hidden
* con - still lose "source" features on slicing

Plan(B) pros and cons ("source" as top level annotation):
* pro - elegant handling of "source" annotation
* pro - no changes in SeqRecord
* con - special case code in GenBank/EMBL input/output
* con - may need special case code in BioSQL wrapper
* con - fairly big break to backwards compatibility (affecting
any scripts accessing or creating "source" features),
depending on how such a transition was made.

Place(C) pros and cons (special "source" slicing/adding):
* con - "source" annotation remains a bit hidden
* con - special case code in SeqRecord
* pro - no changes in GenBank/EMBL input/output
* pro - no changes in BioSQL wrapper
* pro - minor break to backwards compatibility (affecting
slicing of "source" features only - remember SeqRecord
addition hasn't been released yet).

Any thoughts? I've probably missed some advantages and
disadvantages, and alternative ideas are welcome.

This new idea to just special case slicing/adding of the "source"
feature (Plan C) lacks the elegance of moving the "source"
annotation to the top level (Plan B). However, it is much less
invasive and looks quite practical and intuitive.

Peter