[Biopython] zero-length feature

Mon Mar 22 12:07:06 UTC 2010

On Mon, Mar 22, 2010 at 11:44 AM, Anne Pajon <ap12 at sanger.ac.uk> wrote:
> My genome has a single N character at this point.
>

OK - then the feature should be length one, describing this single
base region. i.e. Using python counting, start+1 == end

>
> Here is the code I use to insert these gaps:
>
>    # Add FT gap
>    seq = record.seq
>    in_N = False
>    gap_features = []
>    for i in range(len(seq)):
>        if seq[i] == 'N' and not in_N:
>            start_N = i
>            in_N = True
>        if in_N and not seq[i+1] == 'N':
>            end_N = i
>            if start_N == end_N:
>                log.warning("gap of size 1 %s..%s" % (start_N, end_N))
>            length = (end_N - start_N) + 1
>            gap_feature = SeqFeature(FeatureLocation(start_N,end_N+1),
> strand=1, type="gap")
>            gap_feature.qualifiers['estimated_length'] = [length]
>            gap_features.append(gap_feature)
>            in_N = False
>
> What should I do to make it works with (unmodified) Biopython EMBL output?
> Thanks in advance for your help.
>
> Regards,
> Anne.

I think you have some out by one counting there (resulting in features
of length one shorted than they should have been). How does this self
contained example look?

from Bio.Alphabet import generic_dna
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
seq = Seq("ANANNANNNANNNNNA", generic_dna)
record = SeqRecord(seq, id="Test")
print "Finding stretches of N in this:"
print seq
# TODO - Cope with a sequence which ends with N
assert seq[-1] != "N", "FIXME - seq ends with N"
in_N = False
for i in range(len(seq)):
    if seq[i] == 'N' and not in_N:
        start_N = i
        in_N = True
    if in_N and not seq[i+1] == 'N':
        end_N = i+1
        length = end_N - start_N
        assert length > 0
        assert str(seq[start_N:end_N]) == "N"*length
        print "."*start_N + seq[start_N:end_N] + "."*(len(seq)-end_N),
        print "gap of size %i, Python slicing  %s:%s" % (length, start_N, end_N)
        gap_feature = SeqFeature(FeatureLocation(start_N,end_N),
strand=1, type="gap")
        gap_feature.qualifiers['estimated_length'] = [length]
        record.features.append(gap_feature)
        in_N = False
print
print record.format("embl")

And the output, which looks fine to me (this is more readable if your
email client uses a fixed width font):

Finding stretches of N in this:
ANANNANNNANNNNNA
.N.............. gap of size 1, Python slicing  1:2
...NN........... gap of size 2, Python slicing  3:5
......NNN....... gap of size 3, Python slicing  6:9
..........NNNNN. gap of size 5, Python slicing  10:15

ID   Test; ; ; DNA; ; UNC; 16 BP.
XX
AC   Test;
XX
DE   .
XX
OS   .
OC   .
XX
FH   Key             Location/Qualifiers
FT   gap             2..2
FT                   /estimated_length=1
FT   gap             4..5
FT                   /estimated_length=2
FT   gap             7..9
FT                   /estimated_length=3
FT   gap             11..15
FT                   /estimated_length=5
SQ   Sequence 16 BP; 5 A; 0 C; 0 G; 0 T; 11 other;
     ANANNANNNA NNNNNA                                                        16
//

Regards,

Peter