[Biopython] zero-length feature

Mon Mar 22 15:01:48 UTC 2010

All,

Just to make sure I'm sanity-checked here, the GenBank/EMBL/DDBJ location specs indicate a location of one nucleotide (inclusive) in length is to be characterized as one number, not a range at all:

http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#3.5.3

2..2

should just be:

2

Or, did I miss something in the discussion?

chris

On Mar 22, 2010, at 9:52 AM, Anne Pajon wrote:

> Brilliant! Thanks.
> 
> Regards,
> Anne.
> 
> On 22 Mar 2010, at 12:07, Peter wrote:
> 
>> On Mon, Mar 22, 2010 at 11:44 AM, Anne Pajon <ap12 at sanger.ac.uk> wrote:
>>> My genome has a single N character at this point.
>>> 
>> 
>> OK - then the feature should be length one, describing this single
>> base region. i.e. Using python counting, start+1 == end
>> 
>>> 
>>> Here is the code I use to insert these gaps:
>>> 
>>>   # Add FT gap
>>>   seq = record.seq
>>>   in_N = False
>>>   gap_features = []
>>>   for i in range(len(seq)):
>>>       if seq[i] == 'N' and not in_N:
>>>           start_N = i
>>>           in_N = True
>>>       if in_N and not seq[i+1] == 'N':
>>>           end_N = i
>>>           if start_N == end_N:
>>>               log.warning("gap of size 1 %s..%s" % (start_N, end_N))
>>>           length = (end_N - start_N) + 1
>>>           gap_feature = SeqFeature(FeatureLocation(start_N,end_N+1),
>>> strand=1, type="gap")
>>>           gap_feature.qualifiers['estimated_length'] = [length]
>>>           gap_features.append(gap_feature)
>>>           in_N = False
>>> 
>>> What should I do to make it works with (unmodified) Biopython EMBL output?
>>> Thanks in advance for your help.
>>> 
>>> Regards,
>>> Anne.
>> 
>> I think you have some out by one counting there (resulting in features
>> of length one shorted than they should have been). How does this self
>> contained example look?
>> 
>> from Bio.Alphabet import generic_dna
>> from Bio.Seq import Seq
>> from Bio.SeqRecord import SeqRecord
>> from Bio.SeqFeature import SeqFeature, FeatureLocation
>> seq = Seq("ANANNANNNANNNNNA", generic_dna)
>> record = SeqRecord(seq, id="Test")
>> print "Finding stretches of N in this:"
>> print seq
>> # TODO - Cope with a sequence which ends with N
>> assert seq[-1] != "N", "FIXME - seq ends with N"
>> in_N = False
>> for i in range(len(seq)):
>>   if seq[i] == 'N' and not in_N:
>>       start_N = i
>>       in_N = True
>>   if in_N and not seq[i+1] == 'N':
>>       end_N = i+1
>>       length = end_N - start_N
>>       assert length > 0
>>       assert str(seq[start_N:end_N]) == "N"*length
>>       print "."*start_N + seq[start_N:end_N] + "."*(len(seq)-end_N),
>>       print "gap of size %i, Python slicing  %s:%s" % (length, start_N, end_N)
>>       gap_feature = SeqFeature(FeatureLocation(start_N,end_N),
>> strand=1, type="gap")
>>       gap_feature.qualifiers['estimated_length'] = [length]
>>       record.features.append(gap_feature)
>>       in_N = False
>> print
>> print record.format("embl")
>> 
>> 
>> And the output, which looks fine to me (this is more readable if your
>> email client uses a fixed width font):
>> 
>> 
>> Finding stretches of N in this:
>> ANANNANNNANNNNNA
>> .N.............. gap of size 1, Python slicing  1:2
>> ...NN........... gap of size 2, Python slicing  3:5
>> ......NNN....... gap of size 3, Python slicing  6:9
>> ..........NNNNN. gap of size 5, Python slicing  10:15
>> 
>> ID   Test; ; ; DNA; ; UNC; 16 BP.
>> XX
>> AC   Test;
>> XX
>> DE   .
>> XX
>> OS   .
>> OC   .
>> XX
>> FH   Key             Location/Qualifiers
>> FT   gap             2..2
>> FT                   /estimated_length=1
>> FT   gap             4..5
>> FT                   /estimated_length=2
>> FT   gap             7..9
>> FT                   /estimated_length=3
>> FT   gap             11..15
>> FT                   /estimated_length=5
>> SQ   Sequence 16 BP; 5 A; 0 C; 0 G; 0 T; 11 other;
>>    ANANNANNNA NNNNNA                                                        16
>> //
>> 
>> Regards,
>> 
>> Peter
> 
> --
> Dr Anne Pajon - Pathogen Genomics, Team 81
> Sanger Institute, Wellcome Trust Genome Campus, Hinxton
> Cambridge CB10 1SA, United Kingdom
> +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile)
> 
> 
> 
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, a charity registered in England with number 1021457 and acompany registered in England with number 2742969, whose registeredoffice is 215 Euston Road, London, NW1 2BE._______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython