[Biopython-dev] SeqFeature's FeatureLocation for GenBank

Marc Colosimo mcolosimo at mitre.org
Thu Nov 3 13:35:34 EST 2005


Okay, I think we are arguing in a circle here and this could go on for 
a long time.

Since we both have in the past had to get our heads around this 
behavior (which to me indicates that the behavior is not intuitive), I 
suggest that we up date the Cookbook. I think my code is a good example 
(I can even change it to include CDS instead of the source feature). 
Also, something like the following,

Under 3.7.2.1:

location
	- The location of the SeqFeature on the sequence that you are dealing 
with. The locations are designed to return the sequence when slicing. 
Thus, the end position is ONE more than the actual end position in the 
sequence.

Under  3.7.2.2:

<insert code snippet> for ExactPositions.

Marc

On Nov 3, 2005, at 9:17 AM, Peter wrote:

> Marc Colosimo wrote:
>> Thank you for the response. However,  I know how lists work in Python 
>> (and C, and Java, etc...). That was not question. Here is some code 
>> to show you what I mean about the inconsistent behavior of Locations.
>> from Bio import GenBank
>> gi_list = GenBank.search_for("AB077698")
>> ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser = 
>> GenBank.FeatureParser() )
>> seq_rec = ncbi_dict[gi_list[0]]
>> print len(seq_rec.seq)               # returns 2701, which is correct
>> # now lets look at a feature location
>> source_feature = seq_rec.features[0]   print source_feature.type      
>>      # should be 'source'
>> print source_feature.location      # (0..2701), in the gb record it 
>> was (1..2701). The start is correct, the end is NOT
>
> The start and end ARE correct in that seq_rec.seq[0:2701] will return 
> all of the sequence.
>
> The first nucleotide is seq_rec.seq[0]
> The last nucleotide is seq_rec.seq[2700]
> The length is 2701
>
> It makes more sense in the case of (sub)features, rather than the 
> source 'feature' which is everything.
>
> In the same way, a string of length 5, e.g. "abcde"
> "abcde"[0] == "a"
> "abcde"[4] == "e"
> "abcde"[0:5] == "abcde"
> "abcde"[5] is out of range.
>
> From memory, the location object is actually rather more complicated 
> because it copes with nasty locations like 123..<150 plus joins etc, 
> as well as the simple cases like 123..150
>
> It took me a while to get my head round the location object too.
>
>> # get a slice
>> seq_rec.seq[source_feature.location.start.position : 
>> source_feature.location.end.position]
>> # returns the correct thing
>
>> # now lets see what the first nt looks like
>> seq_rec.seq[source_feature.location.start.position]   # works fine
>
>> # now lets see what the last nt looks like
>> seq_rec.seq[source_feature.location.end.position]
>> IndexError: string index out of range
>
> This is correct.  See my example with a string "abcde"[5]
>
>> # The correct answer for is...
>> seq_rec.seq[source_feature.location.end.position - 1]   # now, this 
>> is different from how start position works!
>
> It has to be different for the splicing trick.  Again, its the "fault" 
> of trying to be the same as python strings.

  I don't see the connection between Location and Sequences behaving 
like python strings. The first part is the key to my confusion 
("splicing trick")

>
>> # but wait there is more...
>> # What if I didn't know about the funny end position business and 
>> wrote this,
>> seq_rec.seq[source_feature.location.start.position: 
>> source_feature.location.end.position + 1]
>> # This works, but it is not correct because it has added a nt from 
>> the beginning to the end (slices are nice about that)
>> # If I were to use this on the other internal features I would get 
>> the wrong thing (by one nt)
>> So, either location End should be 2700,  Start should be 1, or state 
>> 'explicitly' what Locations positions represent. But not 0..2701.
>
> I personally am happy with having start 0, end 2701 for a genbank 
> location of 1..2701 and this it is logical.
>
> However, the documentation could be improved.
>
>> Changing the end position probably would mess up lots of code. So 
>> that leaves documentation. You can add my code above to the cookbook 
>> <http://biopython.org/docs/tutorial/Tutorial004.html#toc16>.
>
> Maybe an extra sub section, before the current "3.7.2.2  Locations" 
> for the location simple case?  i.e. No joins, no fuzzy locations.  
> Just some very simple examples...
>
> Peter



More information about the Biopython-dev mailing list