[Biopython-dev] SeqFeature's FeatureLocation for GenBank
Michiel Jan Laurens de Hoon
mdehoon at c2b2.columbia.edu
Thu Nov 3 15:05:20 EST 2005
I think the confusion is coming from the way FeatureLocation prints
itself. (0..2701) looks too much like a GenBank-style location. If
FeatureLocation were to print (0:2701) instead, it's pretty clear that
this is a Python-style slice. One solution might be to let
FeatureLocation inherit from list, and override as needed for abstract
positions.
--Michiel.
Peter wrote:
> Marc Colosimo wrote:
>
>> Thank you for the response. However, I know how lists work in Python
>> (and C, and Java, etc...). That was not question. Here is some code
>> to show you what I mean about the inconsistent behavior of Locations.
>>
>> from Bio import GenBank
>> gi_list = GenBank.search_for("AB077698")
>> ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser =
>> GenBank.FeatureParser() )
>> seq_rec = ncbi_dict[gi_list[0]]
>> print len(seq_rec.seq) # returns 2701, which is correct
>> # now lets look at a feature location
>> source_feature = seq_rec.features[0] print
>> source_feature.type # should be 'source'
>> print source_feature.location # (0..2701), in the gb record it
>> was (1..2701). The start is correct, the end is NOT
>
>
> The start and end ARE correct in that seq_rec.seq[0:2701] will return
> all of the sequence.
>
> The first nucleotide is seq_rec.seq[0]
> The last nucleotide is seq_rec.seq[2700]
> The length is 2701
>
> It makes more sense in the case of (sub)features, rather than the
> source 'feature' which is everything.
>
> In the same way, a string of length 5, e.g. "abcde"
> "abcde"[0] == "a"
> "abcde"[4] == "e"
> "abcde"[0:5] == "abcde"
> "abcde"[5] is out of range.
>
> From memory, the location object is actually rather more complicated
> because it copes with nasty locations like 123..<150 plus joins etc,
> as well as the simple cases like 123..150
>
> It took me a while to get my head round the location object too.
>
>> # get a slice
>> seq_rec.seq[source_feature.location.start.position :
>> source_feature.location.end.position]
>> # returns the correct thing
>
>
>> # now lets see what the first nt looks like
>> seq_rec.seq[source_feature.location.start.position] # works fine
>
>
>> # now lets see what the last nt looks like
>> seq_rec.seq[source_feature.location.end.position]
>> IndexError: string index out of range
>
>
> This is correct. See my example with a string "abcde"[5]
>
>> # The correct answer for is...
>> seq_rec.seq[source_feature.location.end.position - 1] # now, this
>> is different from how start position works!
>
>
> It has to be different for the splicing trick. Again, its the "fault"
> of trying to be the same as python strings.
>
>> # but wait there is more...
>> # What if I didn't know about the funny end position business and
>> wrote this,
>> seq_rec.seq[source_feature.location.start.position:
>> source_feature.location.end.position + 1]
>> # This works, but it is not correct because it has added a nt from
>> the beginning to the end (slices are nice about that)
>> # If I were to use this on the other internal features I would get
>> the wrong thing (by one nt)
>>
>> So, either location End should be 2700, Start should be 1, or state
>> 'explicitly' what Locations positions represent. But not 0..2701.
>
>
> I personally am happy with having start 0, end 2701 for a genbank
> location of 1..2701 and this it is logical.
>
> However, the documentation could be improved.
>
>> Changing the end position probably would mess up lots of code. So
>> that leaves documentation. You can add my code above to the cookbook
>> <http://biopython.org/docs/tutorial/Tutorial004.html#toc16>.
>
>
> Maybe an extra sub section, before the current "3.7.2.2 Locations"
> for the location simple case? i.e. No joins, no fuzzy locations.
> Just some very simple examples...
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
--
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032
More information about the Biopython-dev
mailing list