[Biopython-dev] SeqFeature's FeatureLocation for GenBank

Marc Colosimo mcolosimo at mitre.org
Fri Nov 4 10:20:15 EST 2005


Michiel,

That would probably help. Along with some additions to the cookbook.

Marc

Michiel Jan Laurens de Hoon wrote:

> I think the confusion is coming from the way FeatureLocation prints 
> itself. (0..2701) looks too much like a GenBank-style location. If 
> FeatureLocation were to print (0:2701) instead, it's pretty clear that 
> this is a Python-style slice. One solution might be to let 
> FeatureLocation inherit from list, and override as needed for abstract 
> positions.
>
> --Michiel.
>
> Peter wrote:
>
>> Marc Colosimo wrote:
>>
>>> Thank you for the response. However,  I know how lists work in 
>>> Python (and C, and Java, etc...). That was not question. Here is 
>>> some code to show you what I mean about the inconsistent behavior of 
>>> Locations.
>>>
>>> from Bio import GenBank
>>> gi_list = GenBank.search_for("AB077698")
>>> ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser 
>>> = GenBank.FeatureParser() )
>>> seq_rec = ncbi_dict[gi_list[0]]
>>> print len(seq_rec.seq)               # returns 2701, which is correct
>>> # now lets look at a feature location
>>> source_feature = seq_rec.features[0]   print 
>>> source_feature.type           # should be 'source'
>>> print source_feature.location      # (0..2701), in the gb record it 
>>> was (1..2701). The start is correct, the end is NOT
>>
>>
>>
>> The start and end ARE correct in that seq_rec.seq[0:2701] will return 
>> all of the sequence.
>>
>> The first nucleotide is seq_rec.seq[0]
>> The last nucleotide is seq_rec.seq[2700]
>> The length is 2701
>>
>> It makes more sense in the case of (sub)features, rather than the 
>> source 'feature' which is everything.
>>
>> In the same way, a string of length 5, e.g. "abcde"
>> "abcde"[0] == "a"
>> "abcde"[4] == "e"
>> "abcde"[0:5] == "abcde"
>> "abcde"[5] is out of range.
>>
>> From memory, the location object is actually rather more complicated 
>> because it copes with nasty locations like 123..<150 plus joins etc, 
>> as well as the simple cases like 123..150
>>
>> It took me a while to get my head round the location object too.
>>
>>> # get a slice
>>> seq_rec.seq[source_feature.location.start.position : 
>>> source_feature.location.end.position]
>>> # returns the correct thing
>>
>>
>>
>>> # now lets see what the first nt looks like
>>> seq_rec.seq[source_feature.location.start.position]   # works fine
>>
>>
>>
>>> # now lets see what the last nt looks like
>>> seq_rec.seq[source_feature.location.end.position]
>>> IndexError: string index out of range
>>
>>
>>
>> This is correct.  See my example with a string "abcde"[5]
>>
>>> # The correct answer for is...
>>> seq_rec.seq[source_feature.location.end.position - 1]   # now, this 
>>> is different from how start position works!
>>
>>
>>
>> It has to be different for the splicing trick.  Again, its the 
>> "fault" of trying to be the same as python strings.
>>
>>> # but wait there is more...
>>> # What if I didn't know about the funny end position business and 
>>> wrote this,
>>> seq_rec.seq[source_feature.location.start.position: 
>>> source_feature.location.end.position + 1]
>>> # This works, but it is not correct because it has added a nt from 
>>> the beginning to the end (slices are nice about that)
>>> # If I were to use this on the other internal features I would get 
>>> the wrong thing (by one nt)
>>>
>>> So, either location End should be 2700,  Start should be 1, or state 
>>> 'explicitly' what Locations positions represent. But not 0..2701. 
>>
>>
>>
>> I personally am happy with having start 0, end 2701 for a genbank 
>> location of 1..2701 and this it is logical.
>>
>> However, the documentation could be improved.
>>
>>> Changing the end position probably would mess up lots of code. So 
>>> that leaves documentation. You can add my code above to the cookbook 
>>> <http://biopython.org/docs/tutorial/Tutorial004.html#toc16>.
>>
>>
>>
>> Maybe an extra sub section, before the current "3.7.2.2  Locations" 
>> for the location simple case?  i.e. No joins, no fuzzy locations.  
>> Just some very simple examples...
>>
>> Peter
>



More information about the Biopython-dev mailing list