[Biopython] [BioPython] Genbank parser

Wed Mar 16 11:43:28 UTC 2011

On Wed, Mar 16, 2011 at 8:26 AM, Timothy Wu <2huggie at gmail.com> wrote:
> Hi,
>
> I'm using Biopython to parse human genome files with code like this:
>
>        for seq_record in SeqIO.parse(fd, "genbank"):
>            * do something with seq_record*
>
> However something tripped on me:
>
> Traceback (most recent call last):
> ...
>    raise LocationParserError(location_line)
> Bio.GenBank.LocationParserError: 958574^958575..958886
>
> The Genbank file involved has the following structure:
>
>    CDS             958574^958575..958772
>                     /gene="CSH2"
> ...
>
> This isn't the first occurrence in this file, however I manually deleted
> what's equivalent of "^958575" in the location and it works out OK.
>
> Is there something I can do? Right now I edit the genbank file instead
> (since I won't be needing the location information)
> And I'm not sure what the caret is suppose to represent.

Hi Timothy,

I believe this to be an invalid GenBank file, and I would like you
to contact the NCBI to check this. The caret is used for 'between'.
Here it seems to be saying meaning this feature starts between
958574 and 958575, and runs to 958772. That would normally
be represented just as 958575..958772

See also:
http://bugzilla.open-bio.org/show_bug.cgi?id=3175
http://redmine.open-bio.org/issues/3175
(we're migrating the bug database, official announcement
due soon)

How many of this kind of 'broken' GenBank records have you
found? I would hope it is just one or two that can be fixed by
hand. If on the other hand the NCBI say this is valid, we need
to handle this in the Biopython feature model...

Peter