[Biopython-dev] SeqFeature's FeatureLocation for GenBank

Thu Nov 3 09:17:47 EST 2005

Marc Colosimo wrote:
> Thank you for the response. However,  I know how lists work in Python 
> (and C, and Java, etc...). That was not question. Here is some code to 
> show you what I mean about the inconsistent behavior of Locations.
> 
> from Bio import GenBank
> gi_list = GenBank.search_for("AB077698")
> ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser = 
> GenBank.FeatureParser() )
> seq_rec = ncbi_dict[gi_list[0]]
> print len(seq_rec.seq)               # returns 2701, which is correct
> # now lets look at a feature location
> source_feature = seq_rec.features[0]   print 
> source_feature.type           # should be 'source'
> print source_feature.location      # (0..2701), in the gb record it was 
> (1..2701). The start is correct, the end is NOT

The start and end ARE correct in that seq_rec.seq[0:2701] will return 
all of the sequence.

The first nucleotide is seq_rec.seq[0]
The last nucleotide is seq_rec.seq[2700]
The length is 2701

It makes more sense in the case of (sub)features, rather than the source 
'feature' which is everything.

In the same way, a string of length 5, e.g. "abcde"
"abcde"[0] == "a"
"abcde"[4] == "e"
"abcde"[0:5] == "abcde"
"abcde"[5] is out of range.

 From memory, the location object is actually rather more complicated 
because it copes with nasty locations like 123..<150 plus joins etc, as 
well as the simple cases like 123..150

It took me a while to get my head round the location object too.

> # get a slice
> seq_rec.seq[source_feature.location.start.position : 
> source_feature.location.end.position]
> # returns the correct thing

> # now lets see what the first nt looks like
> seq_rec.seq[source_feature.location.start.position]   # works fine

> # now lets see what the last nt looks like
> seq_rec.seq[source_feature.location.end.position]
> IndexError: string index out of range

This is correct.  See my example with a string "abcde"[5]

> # The correct answer for is...
> seq_rec.seq[source_feature.location.end.position - 1]   # now, this is 
> different from how start position works!

It has to be different for the splicing trick.  Again, its the "fault" 
of trying to be the same as python strings.

> # but wait there is more...
> # What if I didn't know about the funny end position business and wrote 
> this,
> seq_rec.seq[source_feature.location.start.position: 
> source_feature.location.end.position + 1]
> # This works, but it is not correct because it has added a nt from the 
> beginning to the end (slices are nice about that)
> # If I were to use this on the other internal features I would get the 
> wrong thing (by one nt)
> 
> So, either location End should be 2700,  Start should be 1, or state 
> 'explicitly' what Locations positions represent. But not 0..2701. 

I personally am happy with having start 0, end 2701 for a genbank 
location of 1..2701 and this it is logical.

However, the documentation could be improved.

> Changing the end position probably would mess up lots of code. So that 
> leaves documentation. You can add my code above to the cookbook 
> <http://biopython.org/docs/tutorial/Tutorial004.html#toc16>.

Maybe an extra sub section, before the current "3.7.2.2  Locations" for 
the location simple case?  i.e. No joins, no fuzzy locations.  Just some 
very simple examples...

Peter