[Biopython-dev] SeqFeature's FeatureLocation for GenBank
Marc Colosimo
mcolosimo at mitre.org
Thu Nov 3 08:45:35 EST 2005
Peter wrote:
> Marc Colosimo wrote:
>
>> I want to point out the very bizarre behavior of FeatureLocations
>> when using GenBank.FeatureParser (well to me anyways).
>
>
> Its by design...
>
>> When I was testing out some code, I noticed that the start positions
>> were 1 less that in the GenBank Record, but the end positions were
>> correct. My first thought was that this must be a bug and such went
>> looking for it. I soon gave up because I just don't have the time to
>> understand all the code that is involved (I was going to file a bug
>> report). So, I just added 1 to the start positions and went on to get
>> the features from the DNA. Suddenly I now understand why the
>> positions were like that: slicing!
>
>
> Exactly, e.g. something like:
>
> seq[feature.location.start.position:feature.location.end.position]
>
>> Unless I missed something, I didn't see anything talking about this
>> behavior.
>
>
> Python (like C) starts counting at zero, and this behaviour is
> deliberate to make handling of the BioPython sequence objects as easy
> as possible. Why - because the biopython DNA/RNA/Proteins sequences
> are as much like Python strings as possible.
>
> For example, to extract letters the 5 to 7 from "abcdefghijk" (using
> one based counting, i.e. "efg") in Python you say "abcdefghijk"[4:7]
>
> Suppose your gene is bases 150..300 (using one based counting as in a
> GenBank file).
>
> To extract this from the full DNA sequence, you would use something
> like: fullsequence[149:300]
>
> I suppose the CookBook may have assumed people were familiar with
> Python strings already...
>
> > Is this consistent with other parsers? If so, I would suggest
>
>> that this is included in the Cookbook ...
>
>
Thank you for the response. However, I know how lists work in Python
(and C, and Java, etc...). That was not question. Here is some code to
show you what I mean about the inconsistent behavior of Locations.
from Bio import GenBank
gi_list = GenBank.search_for("AB077698")
ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser =
GenBank.FeatureParser() )
seq_rec = ncbi_dict[gi_list[0]]
print len(seq_rec.seq) # returns 2701, which is correct
# now lets look at a feature location
source_feature = seq_rec.features[0]
print source_feature.type # should be 'source'
print source_feature.location # (0..2701), in the gb record it was
(1..2701). The start is correct, the end is NOT
# get a slice
seq_rec.seq[source_feature.location.start.position :
source_feature.location.end.position]
# returns the correct thing
# now lets see what the first nt looks like
seq_rec.seq[source_feature.location.start.position] # works fine
# now lets see what the last nt looks like
seq_rec.seq[source_feature.location.end.position]
IndexError: string index out of range
# The correct answer for is...
seq_rec.seq[source_feature.location.end.position - 1] # now, this is
different from how start position works!
# but wait there is more...
# What if I didn't know about the funny end position business and wrote
this,
seq_rec.seq[source_feature.location.start.position:
source_feature.location.end.position + 1]
# This works, but it is not correct because it has added a nt from the
beginning to the end (slices are nice about that)
# If I were to use this on the other internal features I would get the
wrong thing (by one nt)
So, either location End should be 2700, Start should be 1, or state
'explicitly' what Locations positions represent. But not 0..2701.
Changing the end position probably would mess up lots of code. So that
leaves documentation. You can add my code above to the cookbook
<http://biopython.org/docs/tutorial/Tutorial004.html#toc16>.
Marc
More information about the Biopython-dev
mailing list