[Biopython] Additions to the SeqRecord

Tue Nov 17 08:24:17 EST 2009

Hi Peter;

> > [...Discussion on what to do with full length features and annotations
> >  when slicing SeqRecords...]
>
> [...Proposal to have a function that does the copying...]
>
> Interesting idea - but I'm not sure about that name (maybe something like
> copy_annotation would be better?) and personally don't think it is actually
> any clearer than the two lines:
> 
> deletion_mutant.dbxrefs = record.dbxrefs[:]
> deletion_mutant.annotations = record.annotations.copy()

Yes, I am terrible at thinking up function names -- copy_annotation
is great. Here I'm not as worried about clarity as I am about
discoverability. It's another way for people to realize that the
annotations were not copied.

> Regarding the special case of the source feature in GenBank files, for
> tasks like removing part of the record, or doing an origin shift, you may
> want to recreate a new source feature reusing the old source feature
> annotation (e.g. NCBI taxon ID). However, the location would have to
> reflect the new modified sequence length.
> 
> I have another idea to "solve" this problem:
> 
> I am actually be tempted to remove the source SeqFeature, and instead
> handle it via the annotations dict. To me this seems more natural than
> having it as an entry in the feature table - a GenBank file format choice I
> never really understood. My guess is they didn't want to introduce a record
> level extensible annotation header block, which is what the source feature
> could be regarded as handling.
> 
> i.e. When parsing a GenBank (or EMBL) file, the source feature information
> could get stored in the SeqRecord annotations dictionary. When writing to
> GenBank (or in future EMBL) format, if the annotations dictionary contained
> relevant fields, we would generate a source feature for the full sequence.
> 
> Does that make sense? It requires looking at the source feature not as
> a feature which happens to span the whole sequence, but as annotation
> for the whole sequence (which happens to be in the GenBank features
> table due to a historical choice or accident).

I like that. You're right that those full length features are really 
annotations in disguise. Instead of removing the source SeqFeature,
I would suggest making it available in both places. This way you
mimic what GenBank is doing, but also make it available in a more
accessible and natural place. So for something like:

     source          1..4411532
                     /organism="Mycobacterium tuberculosis H37Rv"
                     /mol_type="genomic DNA"
                     /strain="H37Rv"
                     /db_xref="taxon:83332"

you would have the source SeqFeature, but also the organism,
mol_type and strain in the annotations dictionary, and the cross
reference in dbxrefs. Nice idea.

Brad