[BioPython] I don't understand why SeqRecord.feature is a list

Peter biopython at maubp.freeserve.co.uk
Tue Jun 12 10:32:45 EDT 2007


Marc Colosimo wrote:
> Additionally, for many formats you can have multiple features with 
> the same name; e.g., CDS, gene, etc... in GenBank Records.

Indeed - and as the SeqRecord/SeqFeature is most heavily used by the
GenBank parser, that does explain things well.

The problem with using a dictionary is what to index on - you can't
simply use the location string for example, as there usually entries for
genes and CDS features with the same location.

You can't depend on any other information like an identifier or name to 
be present in a GenBank file for all feature types.

In general, the choice of index will depend on what you want to use it 
for - so the flippant answer is just index it yourself, for example like 
this:

http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features

> The same  rational doesn't fully apply to why the feature qualifiers
> are dictionaries of lists.

No it doesn't. The rational seems to have been that feature qualifiers 
in GenBank files can occur with no values (e.g. /pseudo and others), a 
single value (e.g. translation) or multiple values (by repeated keys, 
e.g. database cross references).  So using a list is a simple solution 
to cover all these cases - even if most entries only have a single 
entry.  (There are some old posts on the mailing list archive discussing 
this.)

Peter



More information about the BioPython mailing list