[BioPython] I don't understand why SeqRecord.feature is a list
Peter
biopython at maubp.freeserve.co.uk
Tue Jun 12 14:32:45 UTC 2007
Marc Colosimo wrote:
> Additionally, for many formats you can have multiple features with
> the same name; e.g., CDS, gene, etc... in GenBank Records.
Indeed - and as the SeqRecord/SeqFeature is most heavily used by the
GenBank parser, that does explain things well.
The problem with using a dictionary is what to index on - you can't
simply use the location string for example, as there usually entries for
genes and CDS features with the same location.
You can't depend on any other information like an identifier or name to
be present in a GenBank file for all feature types.
In general, the choice of index will depend on what you want to use it
for - so the flippant answer is just index it yourself, for example like
this:
http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features
> The same rational doesn't fully apply to why the feature qualifiers
> are dictionaries of lists.
No it doesn't. The rational seems to have been that feature qualifiers
in GenBank files can occur with no values (e.g. /pseudo and others), a
single value (e.g. translation) or multiple values (by repeated keys,
e.g. database cross references). So using a list is a simple solution
to cover all these cases - even if most entries only have a single
entry. (There are some old posts on the mailing list archive discussing
this.)
Peter
More information about the Biopython
mailing list