[Biopython] Additions to the SeqRecord

Peter biopython at maubp.freeserve.co.uk
Fri Nov 13 08:51:48 EST 2009


On Fri, Nov 13, 2009 at 1:23 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hi Peter;
>
> [...Discussion on what to do with full length features and annotations
>  when slicing SeqRecords...]
>
> Good discussion. Agreed that copying may be confusing. One hybrid
> approach is to provide a function make makes copying them easy if
> someone does want to save the annotations, dbxrefs and full length
> feature sources:
>
> sliced = rec[:100]
> sliced.set_full_length_features(rec)
>
> where set_full_length_features copied over the annotations and
> dbxrefs, ala your code example:
>
> deletion_mutant.dbxrefs = record.dbxrefs[:]
> deletion_mutant.annotations = record.annotations.copy()
>
> and perhaps also added any whole sequence sequence features from the
> original SeqRecord. This would help with discoverability for people
> who do want to retain all of the source and other high level information
> when they slice.
>
> Brad

Hi Brad.

Interesting idea - but I'm not sure about that name (maybe something like
copy_annotation would be better?) and personally don't think it is actually
any clearer than the two lines:

deletion_mutant.dbxrefs = record.dbxrefs[:]
deletion_mutant.annotations = record.annotations.copy()

[We should in the meantime add those line to the relevant examples in
the docstring and Tutorial in the repository.]

Regarding the special case of the source feature in GenBank files, for
tasks like removing part of the record, or doing an origin shift, you may
want to recreate a new source feature reusing the old source feature
annotation (e.g. NCBI taxon ID). However, the location would have to
reflect the new modified sequence length.

I have another idea to "solve" this problem:

I am actually be tempted to remove the source SeqFeature, and instead
handle it via the annotations dict. To me this seems more natural than
having it as an entry in the feature table - a GenBank file format choice I
never really understood. My guess is they didn't want to introduce a record
level extensible annotation header block, which is what the source feature
could be regarded as handling.

i.e. When parsing a GenBank (or EMBL) file, the source feature information
could get stored in the SeqRecord annotations dictionary. When writing to
GenBank (or in future EMBL) format, if the annotations dictionary contained
relevant fields, we would generate a source feature for the full sequence.

Does that make sense? It requires looking at the source feature not as
a feature which happens to span the whole sequence, but as annotation
for the whole sequence (which happens to be in the GenBank features
table due to a historical choice or accident).

Peter



More information about the Biopython mailing list