[Biopython-dev] More 'fun' with GenBank

Tue Jan 15 10:54:45 EST 2013

Hi folks,

as people are hitting my web service with all sorts of wonky GenBank
files, I've stumbled over another one that throws the GenBank parser off
track.

The culprit is a SeqFeature with a location line like:

     CDS             join(complement(4093..4338),complement(3876..4011),
                     complement(3655..3809),complement(3284..3585),
                     complement(2421..2813),complement(2057..2303))

Now, the way I read the GenBank spec, this is not a valid location line,
but should instead be a complement() of joins(). Unfortunately, the NCBI
seems to disagree with its own specs, and put the record into their
Nucleotide database as CABT02000004, which means that by all practical
purposes, it _is_ a valid GenBank file and the parser should cope.

The parser looks at this location and creates a feature on the -1
strand, from 4092:2303. This is caused by by the feature location
calculation on
https://github.com/biopython/biopython/blob/master/Bio/GenBank/__init__.py#L1049
and the lines after.

In short, we do
            s = cur_feature.sub_features[0].location.start
            e = cur_feature.sub_features[-1].location.end
            cur_feature.location = SeqFeature.FeatureLocation(s, e, strand)

And when the join() looks like the record I'm dealing with, this is
clearly the wrong way around.

I decided to fix this by sorting the subfeatures by start,end
coordinates, and that fixes this issue for me.

Unfortunately, this also breaks an existing test, the extra_keywords.gb
test.
https://github.com/biopython/biopython/blob/master/Tests/GenBank/extra_keywords.gb#L647
has a feature that has a location of

     CDS             join(153490..154269,AL121804.2:41..610,
                     AL121804.2:672..1487)

Here, we probably do want the feature from 153489:1487, even though I'm
not sure how useful such a location really is.

So I decided to fix this by sorting the subfeatures first on their ref,
and then on start, end.

This again breaks a test, this time in one_of.gb
https://github.com/biopython/biopython/blob/master/Tests/GenBank/one_of.gb#L39
where the location line is

     CDS join(2201..2479,U18267.1:120..246,U18268.1:130..288,
                     U18270.1:4691..4788,U18269.1:82..>128)

Here, the U18270.1 record seems to come befire the U18269.1 record.

Now, we're again spanning a feature into multiple contigs, none of which
are accessible to the extract() function as far as I'm aware.
Sorting the locations by start, end (and maybe ref first) at least fixes
the case CABT02000004 is broken on where we have the chance of getting
extract() to work.

The attached patch is my proposed change, but I wanted to get some
feedback first before opening a bug and/or submitting a pull request.

Cheers,
Kai

-- 
Dipl.-Inform. Kai Blin         kai.blin at biotech.uni-tuebingen.de
Institute for Microbiology and Infection Medicine
Division of Microbiology/Biotechnology
Eberhard-Karls-Universität Tübingen
Auf der Morgenstelle 28                 Phone : ++49 7071 29-78841
D-72076 Tübingen                        Fax :   ++49 7071 29-5979
Germany
Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-GenBank-Sort-subfeatures-by-ref-and-start-end-positi.patch
Type: text/x-patch
Size: 9059 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20130115/f7c0bb7d/attachment.bin>