[BioPython] Genbank LOCUS line slightly misaligned

Thu Dec 18 15:25:54 UTC 2008

Thanks for your prompt reply.

Peter wrote:
> I suspect that they (Genomatrix) are inserting a large locus
> identifier into the beginning of the LOCUS line which is sometimes
> bigger than the allocated slot, pushing the rest of the fields out of
> position in some of the files.  I'd need to see several examples to be
> confident about this guess.
>

That sounds about right. Here's a sample:

$ grep LOCUS skurukutipromo.gb | head
LOCUS       GXP_4216    601 bp    DNA
LOCUS       GXP_4217    601 bp    DNA
LOCUS       GXP_4220    601 bp    DNA
LOCUS       GXP_4226    603 bp    DNA
LOCUS       GXP_1485624    601 bp    DNA
LOCUS       GXP_1485625    601 bp    DNA
LOCUS       GXP_4230    601 bp    DNA
LOCUS       GXP_4253    640 bp    DNA
LOCUS       GXP_648168    662 bp    DNA
LOCUS       GXP_4281    601 bp    DNA

It's a bit careless on their part, but who listens to standards anyway? ;)

> If you don't actually need much information from the LOCUS line, you
> might find it easier to hack our parser to be a little more tolerant -
> I would suggest simply pulling out the locus ID, ignoring the rest of
> the LOCUS line, and printing a warning.
> 

I already did a regex on the file itself to excise everything after the 
locus id, which put an end to the complaints.

I'm also finding I have to manually parse the description entry, which 
comes out in one big lump like this:

'loc=GXL_3369|sym=AK092090|acc=GXP_4216|taxid=9606| spec=Homo 
sapiens|chr=19|ctg=NC_000019|str=(-)| 
start=9533632|end=9534232|len=601|tss=501| descr=Homo sapiens cDNA 
FLJ34771 fis, clone NT2NE2003150.| comm=GXT_2771591/AK096729/501/gold'

Has some other formatting error prevented biopython from breaking this 
up for me, or is this the expected behaviour? I'm using biopython1.49. 
It's not a big deal, I was just wondering.

Cheers,

Peter