[Biopython] Start and end of feature locations in circular sequences
Jan T. Kim
jttkim at googlemail.com
Thu Aug 31 01:42:00 EDT 2023
Hi All,
I've recently encountered features in circular sequences that start near
the end of the (probably arbitrarily) linearised sequence and end near
its start. For an example see the first CDS feature in [1] (locus tag
"X600_gp001"):
join(139629..139738,1..196)
To my surprise, the start attribute of this feature's location is 0,
and its end attribute is the end of the sequence:
>>> f1.location.start
ExactPosition(0)
>>> f1.location.end
ExactPosition(139738)
So by using the start and end positions of the feature, without checking
whether its location is compound and going through the parts in this
case, it appears that the feature is comprised of the entire sequence (!!).
Technically, the findings above are consistent with the documentation which
states that start and end give the minimal and maximal positions occurring in
a feature, respectively.
This behaviour is not quite consistent with my expectations in this case,
however. Is there any way (attribute, method or whatever) to detect whether
a feature straddles the cut point of a circular sequence? I realise that
when taking non-exact positions into account and when making no assumptions
about the ordering of parts, such a check can be difficult and may not
have a well defined result in all cases, but on the other hand I don't
think it's likely that I'm the first person requiring such a check...?
My main objective with this post is to find out whether there's anyting
in Biopython that does this type of job already. If there isn't I'll
code up some heuristic.
Best regards, Jan
[1] https://www.ncbi.nlm.nih.gov/nuccore/NC_022920.1/
More information about the Biopython
mailing list