[Biopython-dev] SeqFeature's FeatureLocation for GenBank

Peter biopython-dev at maubp.freeserve.co.uk
Thu Nov 3 09:17:47 EST 2005


Marc Colosimo wrote:
> Thank you for the response. However,  I know how lists work in Python 
> (and C, and Java, etc...). That was not question. Here is some code to 
> show you what I mean about the inconsistent behavior of Locations.
> 
> from Bio import GenBank
> gi_list = GenBank.search_for("AB077698")
> ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser = 
> GenBank.FeatureParser() )
> seq_rec = ncbi_dict[gi_list[0]]
> print len(seq_rec.seq)               # returns 2701, which is correct
> # now lets look at a feature location
> source_feature = seq_rec.features[0]   print 
> source_feature.type           # should be 'source'
> print source_feature.location      # (0..2701), in the gb record it was 
> (1..2701). The start is correct, the end is NOT

The start and end ARE correct in that seq_rec.seq[0:2701] will return 
all of the sequence.

The first nucleotide is seq_rec.seq[0]
The last nucleotide is seq_rec.seq[2700]
The length is 2701

It makes more sense in the case of (sub)features, rather than the source 
'feature' which is everything.

In the same way, a string of length 5, e.g. "abcde"
"abcde"[0] == "a"
"abcde"[4] == "e"
"abcde"[0:5] == "abcde"
"abcde"[5] is out of range.

 From memory, the location object is actually rather more complicated 
because it copes with nasty locations like 123..<150 plus joins etc, as 
well as the simple cases like 123..150

It took me a while to get my head round the location object too.

> # get a slice
> seq_rec.seq[source_feature.location.start.position : 
> source_feature.location.end.position]
> # returns the correct thing

> # now lets see what the first nt looks like
> seq_rec.seq[source_feature.location.start.position]   # works fine

> # now lets see what the last nt looks like
> seq_rec.seq[source_feature.location.end.position]
> IndexError: string index out of range

This is correct.  See my example with a string "abcde"[5]

> # The correct answer for is...
> seq_rec.seq[source_feature.location.end.position - 1]   # now, this is 
> different from how start position works!

It has to be different for the splicing trick.  Again, its the "fault" 
of trying to be the same as python strings.

> # but wait there is more...
> # What if I didn't know about the funny end position business and wrote 
> this,
> seq_rec.seq[source_feature.location.start.position: 
> source_feature.location.end.position + 1]
> # This works, but it is not correct because it has added a nt from the 
> beginning to the end (slices are nice about that)
> # If I were to use this on the other internal features I would get the 
> wrong thing (by one nt)
> 
> So, either location End should be 2700,  Start should be 1, or state 
> 'explicitly' what Locations positions represent. But not 0..2701. 

I personally am happy with having start 0, end 2701 for a genbank 
location of 1..2701 and this it is logical.

However, the documentation could be improved.

> Changing the end position probably would mess up lots of code. So that 
> leaves documentation. You can add my code above to the cookbook 
> <http://biopython.org/docs/tutorial/Tutorial004.html#toc16>.

Maybe an extra sub section, before the current "3.7.2.2  Locations" for 
the location simple case?  i.e. No joins, no fuzzy locations.  Just some 
very simple examples...

Peter


More information about the Biopython-dev mailing list