[BioPython] I don't understand why SeqRecord.feature is a list

Giovanni Marco Dall'Olio dalloliogm at gmail.com
Tue Jul 3 15:24:05 UTC 2007


2007/6/28, Peter <biopython at maubp.freeserve.co.uk>:
> Giovanni Marco Dall'Olio wrote:
> > Hi!
> > In principle, when I can't decide which keys to use for a dictionary,
> > I just take simple numerical integers as keys, and it works quite
> > well.
> > It simplifies testing/debugging/organization a lot and I can decide
> > the meaning of every key later (so it's better for dictionaries which
> > have to contain very heterogeneous data).
>
> It sounds like you don't need/want a dictionary at all.  If you are
> assigning increasing numerical integers "keys", then why not just use
> the list of features directly?

That is not true: with a list is more complicated to add/remove
elements if not in the last position. For instance, if I remove the
first element in a list, all the other elements shift a position and I
risk losing all the references I've made to them.
Also, I don't need the sort/reverse and other operators. Moreover the
cost of the insert operation into a dictionary in python is of the
order of O(1) for every position, while for lists is not constant if
not in the last position (sorry, I can't find a reference for this).
In other languages it could seem strange to use a list to store data,
because traditionally the cost of retrieving an element from a list is
on the order of O(n) (this is not the case of python).


Let's have a look at your example:
- we have a list of features like this:
list_features = ['GTAAGT', 'TACTAAC', 'TGT']

- then we specify the meaning of these features in another dictionary:
splicesignal5 = list_features[0]
polypirimidinetract = list_features[1]
splicesignal3 = list_features[2]

python passes the variables by value: this means that if you change
one of the values in the list_features list, then you have to update
all the variables which refer to it manually.

>>> list_features = ['GTAAGT', 'TACTAAC', 'TGT']
>>> splicesignal5 = list_features[0]
>>> print splicesignal5
'GTAAGT'
>>> list_features[0] = 'TTTTTTT'
>>> print splicesignal5
'GTAAGT'     # wrong!
>>> splicesignal5 = list_features[0]    # have to update all the
variables which refer to list_features manually
>>> print splicesignal5'
'TTTTTTT'

This is why I prefer to save the positions of the features instead of
their values:
>>> list_features = ['GTAAGT', 'TACTAAC', 'TGT']
>>> dict_aliases = {'splicesignal5': [0], 'polypirimidinetract' : [1],
'splicesignal3': [2]}
>>> def get_feature(feature_name): return
list_features[dict_aliases[feature_name]] # (this code doesn't work)


However, I think it's better to save the features in a dictionary
instead of a list, for the reasons I was explaining before, in this
way:

>>> dict_features = {0: 'GTAAGT', 1: 'TACTAAC', 2: 'TGT'}     #
features are in a dictionary instead of a list
>>> dict_aliases = {'splicesignal5': [0], 'polypirimidinetract' : [1],
'splicesignal3': [2]}
>>> def get_feature(feature_name):
             return(map (dict_features.get, x for x in
dict_aliases[feature_name]))

Another option could be to use references to memory positions instead
of dictionary keys, but I don't know how to implement this in python,
and I'm not sure it would be computationally convenient.



> e.g. assuming record is a SeqRecord object:
>
> first_feature = record.features[0]
> second_feature = record.features[1]
> third_feature = record.features[2]
> etc
>
> > I'm not sure I have understood the example you gave me on
> > http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features
> > , but it seems to work in a way similar to what I was saying before:
> > it saves all the features in a list (or is it a dictionary?) and
> > access them later by their positions.
>
> That example stored integers (indices in the features list) in a
> dictionary using either the Locus tag, GI numbers or GeneID (e.g. keys
> like "NEQ010", "GI:41614806" or "GeneID:2654552").
>
> The point being if you know in advance you want to find individual
> feature on the basis of their locus tag (for example), rather than the
> order in the file, then I would map the locus tag strings to positions
> in the list.
>
> e.g.
>
> locus_tag_cds_index = \ index_genbank_features(gb_record,"CDS","locus_tag")
> my_feature = gb_record.features[locus_tag_index["NEQ010"]]


uh ok.. but how is the gb_record.features dictionary structured? Which
keys does it have? And what happens to these dictionaries (let's say,
locus_tag_cds_index), when a feature  from gb_record.features is
deleted or modified?



> You could also build a dictionary which maps from the locus tag directly
> to the associated SeqFeature objects themselves.
>
> > Not to be silly but... how do you represent a gene with its
> > transcripts/exons/introns structure with biopython? With SeqRecord and
> > SeqFeature objects?
>
> If you loaded a GenBank or EMBL file using SeqIO you get one SeqRecord
> object (assuming there is only one LOCUS line in the file) which
> contains a list of SeqFeature objects which in turn may contain
> sub-features.
>
> I work with bacteria so I don't have much experience with dealing with
> sub-features in a SeqFeature object.


I've never worked with SeqFeature and GenBank files (I have to work
with GFF/GTF for annotations), but I will try to see how does it
works.
Thank you very much for these replies! I was really hoping to have
this kind of feedback.
Cheers! :)


>
> Peter
>
>


-- 
-----------------------------------------------------------

My Blog on Bioinformatics (italian): http://dalloliogm.wordpress.com



More information about the Biopython mailing list