[Biopython-dev] More 'fun' with GenBank

Peter Cock p.j.a.cock at googlemail.com
Tue Jan 15 16:41:32 UTC 2013


On Tue, Jan 15, 2013 at 3:54 PM, Kai Blin
<kai.blin at biotech.uni-tuebingen.de> wrote:
> Hi folks,
>
> as people are hitting my web service with all sorts of wonky GenBank
> files, I've stumbled over another one that throws the GenBank parser off
> track.
>
> The culprit is a SeqFeature with a location line like:
>
>      CDS             join(complement(4093..4338),complement(3876..4011),
>                      complement(3655..3809),complement(3284..3585),
>                      complement(2421..2813),complement(2057..2303))
>
> Now, the way I read the GenBank spec, this is not a valid location line,
> but should instead be a complement() of joins(). Unfortunately, the NCBI
> seems to disagree with its own specs, and put the record into their
> Nucleotide database as CABT02000004, which means that by all practical
> purposes, it _is_ a valid GenBank file and the parser should cope.

That should work - for a while GenBank and EMBL didn't agree about
joins on the complement strand, one did complement(join(a..b,c..d))
and the other join(complement(c..d),complement(a..b)), notice the
order of the sub-regions flips.

> The parser looks at this location and creates a feature on the -1
> strand, from 4092:2303. This is caused by by the feature location
> calculation on
> https://github.com/biopython/biopython/blob/master/Bio/GenBank/__init__.py#L1049
> and the lines after.
>
> In short, we do
>             s = cur_feature.sub_features[0].location.start
>             e = cur_feature.sub_features[-1].location.end
>             cur_feature.location = SeqFeature.FeatureLocation(s, e, strand)

For join feature locations, the sub-feature locations should be fine
but the overall feature location is a bit weird/broken for negative
and mixed strands.

This was one of the things the re-factoring on this branch aimed to
fix, https://github.com/peterjc/biopython/tree/f_loc4/
http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html

I was intending to bring this up again after the next release (which
could be later this month or February 2012), but perhaps it would
be worth doing now?

Peter



More information about the Biopython-dev mailing list