[Biopython-dev] SeqFeature's FeatureLocation for GenBank

Michiel Jan Laurens de Hoon mdehoon at c2b2.columbia.edu
Thu Nov 3 15:05:20 EST 2005


I think the confusion is coming from the way FeatureLocation prints 
itself. (0..2701) looks too much like a GenBank-style location. If 
FeatureLocation were to print (0:2701) instead, it's pretty clear that 
this is a Python-style slice. One solution might be to let 
FeatureLocation inherit from list, and override as needed for abstract 
positions.

--Michiel.

Peter wrote:

> Marc Colosimo wrote:
>
>> Thank you for the response. However,  I know how lists work in Python 
>> (and C, and Java, etc...). That was not question. Here is some code 
>> to show you what I mean about the inconsistent behavior of Locations.
>>
>> from Bio import GenBank
>> gi_list = GenBank.search_for("AB077698")
>> ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser = 
>> GenBank.FeatureParser() )
>> seq_rec = ncbi_dict[gi_list[0]]
>> print len(seq_rec.seq)               # returns 2701, which is correct
>> # now lets look at a feature location
>> source_feature = seq_rec.features[0]   print 
>> source_feature.type           # should be 'source'
>> print source_feature.location      # (0..2701), in the gb record it 
>> was (1..2701). The start is correct, the end is NOT
>
>
> The start and end ARE correct in that seq_rec.seq[0:2701] will return 
> all of the sequence.
>
> The first nucleotide is seq_rec.seq[0]
> The last nucleotide is seq_rec.seq[2700]
> The length is 2701
>
> It makes more sense in the case of (sub)features, rather than the 
> source 'feature' which is everything.
>
> In the same way, a string of length 5, e.g. "abcde"
> "abcde"[0] == "a"
> "abcde"[4] == "e"
> "abcde"[0:5] == "abcde"
> "abcde"[5] is out of range.
>
> From memory, the location object is actually rather more complicated 
> because it copes with nasty locations like 123..<150 plus joins etc, 
> as well as the simple cases like 123..150
>
> It took me a while to get my head round the location object too.
>
>> # get a slice
>> seq_rec.seq[source_feature.location.start.position : 
>> source_feature.location.end.position]
>> # returns the correct thing
>
>
>> # now lets see what the first nt looks like
>> seq_rec.seq[source_feature.location.start.position]   # works fine
>
>
>> # now lets see what the last nt looks like
>> seq_rec.seq[source_feature.location.end.position]
>> IndexError: string index out of range
>
>
> This is correct.  See my example with a string "abcde"[5]
>
>> # The correct answer for is...
>> seq_rec.seq[source_feature.location.end.position - 1]   # now, this 
>> is different from how start position works!
>
>
> It has to be different for the splicing trick.  Again, its the "fault" 
> of trying to be the same as python strings.
>
>> # but wait there is more...
>> # What if I didn't know about the funny end position business and 
>> wrote this,
>> seq_rec.seq[source_feature.location.start.position: 
>> source_feature.location.end.position + 1]
>> # This works, but it is not correct because it has added a nt from 
>> the beginning to the end (slices are nice about that)
>> # If I were to use this on the other internal features I would get 
>> the wrong thing (by one nt)
>>
>> So, either location End should be 2700,  Start should be 1, or state 
>> 'explicitly' what Locations positions represent. But not 0..2701. 
>
>
> I personally am happy with having start 0, end 2701 for a genbank 
> location of 1..2701 and this it is logical.
>
> However, the documentation could be improved.
>
>> Changing the end position probably would mess up lots of code. So 
>> that leaves documentation. You can add my code above to the cookbook 
>> <http://biopython.org/docs/tutorial/Tutorial004.html#toc16>.
>
>
> Maybe an extra sub section, before the current "3.7.2.2  Locations" 
> for the location simple case?  i.e. No joins, no fuzzy locations.  
> Just some very simple examples...
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev



-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032




More information about the Biopython-dev mailing list