[Biopython-dev] SeqFeature's FeatureLocation for GenBank
Peter
biopython-dev at maubp.freeserve.co.uk
Thu Nov 3 09:17:47 EST 2005
Marc Colosimo wrote:
> Thank you for the response. However, I know how lists work in Python
> (and C, and Java, etc...). That was not question. Here is some code to
> show you what I mean about the inconsistent behavior of Locations.
>
> from Bio import GenBank
> gi_list = GenBank.search_for("AB077698")
> ncbi_dict = GenBank.NCBIDictionary( 'nucleotide', 'genbank', parser =
> GenBank.FeatureParser() )
> seq_rec = ncbi_dict[gi_list[0]]
> print len(seq_rec.seq) # returns 2701, which is correct
> # now lets look at a feature location
> source_feature = seq_rec.features[0] print
> source_feature.type # should be 'source'
> print source_feature.location # (0..2701), in the gb record it was
> (1..2701). The start is correct, the end is NOT
The start and end ARE correct in that seq_rec.seq[0:2701] will return
all of the sequence.
The first nucleotide is seq_rec.seq[0]
The last nucleotide is seq_rec.seq[2700]
The length is 2701
It makes more sense in the case of (sub)features, rather than the source
'feature' which is everything.
In the same way, a string of length 5, e.g. "abcde"
"abcde"[0] == "a"
"abcde"[4] == "e"
"abcde"[0:5] == "abcde"
"abcde"[5] is out of range.
From memory, the location object is actually rather more complicated
because it copes with nasty locations like 123..<150 plus joins etc, as
well as the simple cases like 123..150
It took me a while to get my head round the location object too.
> # get a slice
> seq_rec.seq[source_feature.location.start.position :
> source_feature.location.end.position]
> # returns the correct thing
> # now lets see what the first nt looks like
> seq_rec.seq[source_feature.location.start.position] # works fine
> # now lets see what the last nt looks like
> seq_rec.seq[source_feature.location.end.position]
> IndexError: string index out of range
This is correct. See my example with a string "abcde"[5]
> # The correct answer for is...
> seq_rec.seq[source_feature.location.end.position - 1] # now, this is
> different from how start position works!
It has to be different for the splicing trick. Again, its the "fault"
of trying to be the same as python strings.
> # but wait there is more...
> # What if I didn't know about the funny end position business and wrote
> this,
> seq_rec.seq[source_feature.location.start.position:
> source_feature.location.end.position + 1]
> # This works, but it is not correct because it has added a nt from the
> beginning to the end (slices are nice about that)
> # If I were to use this on the other internal features I would get the
> wrong thing (by one nt)
>
> So, either location End should be 2700, Start should be 1, or state
> 'explicitly' what Locations positions represent. But not 0..2701.
I personally am happy with having start 0, end 2701 for a genbank
location of 1..2701 and this it is logical.
However, the documentation could be improved.
> Changing the end position probably would mess up lots of code. So that
> leaves documentation. You can add my code above to the cookbook
> <http://biopython.org/docs/tutorial/Tutorial004.html#toc16>.
Maybe an extra sub section, before the current "3.7.2.2 Locations" for
the location simple case? i.e. No joins, no fuzzy locations. Just some
very simple examples...
Peter
More information about the Biopython-dev
mailing list