[Biopython] zero-length feature
Peter
biopython at maubp.freeserve.co.uk
Mon Mar 22 12:07:06 UTC 2010
On Mon, Mar 22, 2010 at 11:44 AM, Anne Pajon <ap12 at sanger.ac.uk> wrote:
> My genome has a single N character at this point.
>
OK - then the feature should be length one, describing this single
base region. i.e. Using python counting, start+1 == end
>
> Here is the code I use to insert these gaps:
>
> # Add FT gap
> seq = record.seq
> in_N = False
> gap_features = []
> for i in range(len(seq)):
> if seq[i] == 'N' and not in_N:
> start_N = i
> in_N = True
> if in_N and not seq[i+1] == 'N':
> end_N = i
> if start_N == end_N:
> log.warning("gap of size 1 %s..%s" % (start_N, end_N))
> length = (end_N - start_N) + 1
> gap_feature = SeqFeature(FeatureLocation(start_N,end_N+1),
> strand=1, type="gap")
> gap_feature.qualifiers['estimated_length'] = [length]
> gap_features.append(gap_feature)
> in_N = False
>
> What should I do to make it works with (unmodified) Biopython EMBL output?
> Thanks in advance for your help.
>
> Regards,
> Anne.
I think you have some out by one counting there (resulting in features
of length one shorted than they should have been). How does this self
contained example look?
from Bio.Alphabet import generic_dna
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
seq = Seq("ANANNANNNANNNNNA", generic_dna)
record = SeqRecord(seq, id="Test")
print "Finding stretches of N in this:"
print seq
# TODO - Cope with a sequence which ends with N
assert seq[-1] != "N", "FIXME - seq ends with N"
in_N = False
for i in range(len(seq)):
if seq[i] == 'N' and not in_N:
start_N = i
in_N = True
if in_N and not seq[i+1] == 'N':
end_N = i+1
length = end_N - start_N
assert length > 0
assert str(seq[start_N:end_N]) == "N"*length
print "."*start_N + seq[start_N:end_N] + "."*(len(seq)-end_N),
print "gap of size %i, Python slicing %s:%s" % (length, start_N, end_N)
gap_feature = SeqFeature(FeatureLocation(start_N,end_N),
strand=1, type="gap")
gap_feature.qualifiers['estimated_length'] = [length]
record.features.append(gap_feature)
in_N = False
print
print record.format("embl")
And the output, which looks fine to me (this is more readable if your
email client uses a fixed width font):
Finding stretches of N in this:
ANANNANNNANNNNNA
.N.............. gap of size 1, Python slicing 1:2
...NN........... gap of size 2, Python slicing 3:5
......NNN....... gap of size 3, Python slicing 6:9
..........NNNNN. gap of size 5, Python slicing 10:15
ID Test; ; ; DNA; ; UNC; 16 BP.
XX
AC Test;
XX
DE .
XX
OS .
OC .
XX
FH Key Location/Qualifiers
FT gap 2..2
FT /estimated_length=1
FT gap 4..5
FT /estimated_length=2
FT gap 7..9
FT /estimated_length=3
FT gap 11..15
FT /estimated_length=5
SQ Sequence 16 BP; 5 A; 0 C; 0 G; 0 T; 11 other;
ANANNANNNA NNNNNA 16
//
Regards,
Peter
More information about the Biopython
mailing list