[Biopython] Start and end of feature locations in circular sequences

Jan T. Kim jttkim at googlemail.com
Thu Aug 31 01:42:00 EDT 2023


Hi All,

I've recently encountered features in circular sequences that start near
the end of the (probably arbitrarily) linearised sequence and end near
its start. For an example see the first CDS feature in [1] (locus tag
"X600_gp001"):

    join(139629..139738,1..196)

To my surprise, the start attribute of this feature's location is 0,
and its end attribute is the end of the sequence:

    >>> f1.location.start
    ExactPosition(0)
    >>> f1.location.end
    ExactPosition(139738)

So by using the start and end positions of the feature, without checking
whether its location is compound and going through the parts in this
case, it appears that the feature is comprised of the entire sequence (!!).

Technically, the findings above are consistent with the documentation which
states that start and end give the minimal and maximal positions occurring in
a feature, respectively.

This behaviour is not quite consistent with my expectations in this case,
however. Is there any way (attribute, method or whatever) to detect whether
a feature straddles the cut point of a circular sequence? I realise that
when taking non-exact positions into account and when making no assumptions
about the ordering of parts, such a check can be difficult and may not
have a well defined result in all cases, but on the other hand I don't
think it's likely that I'm the first person requiring such a check...?

My main objective with this post is to find out whether there's anyting
in Biopython that does this type of job already. If there isn't I'll
code up some heuristic.

Best regards, Jan


[1] https://www.ncbi.nlm.nih.gov/nuccore/NC_022920.1/



More information about the Biopython mailing list