[BioPython] Genbank LOCUS line slightly misaligned

Thu Dec 18 16:01:10 UTC 2008

On Thu, Dec 18, 2008 at 3:25 PM, Peter Saffrey <pzs at dcs.gla.ac.uk> wrote:
> Thanks for your prompt reply.
>
> Peter wrote:
>>
>> I suspect that they (Genomatrix) are inserting a large locus
>> identifier into the beginning of the LOCUS line which is sometimes
>> bigger than the allocated slot, pushing the rest of the fields out of
>> position in some of the files.  I'd need to see several examples to be
>> confident about this guess.
>>
>
> That sounds about right. Here's a sample:
>
> $ grep LOCUS skurukutipromo.gb | head
> LOCUS       GXP_4216    601 bp    DNA
> LOCUS       GXP_4217    601 bp    DNA
> LOCUS       GXP_4220    601 bp    DNA
> LOCUS       GXP_4226    603 bp    DNA
> LOCUS       GXP_1485624    601 bp    DNA
> LOCUS       GXP_1485625    601 bp    DNA
> LOCUS       GXP_4230    601 bp    DNA
> LOCUS       GXP_4253    640 bp    DNA
> LOCUS       GXP_648168    662 bp    DNA
> LOCUS       GXP_4281    601 bp    DNA
>
> It's a bit careless on their part, but who listens to standards anyway? ;)

Writing general output to GenBank format is tricky if you have long
record identifiers.

>> If you don't actually need much information from the LOCUS line, you
>> might find it easier to hack our parser to be a little more tolerant -
>> I would suggest simply pulling out the locus ID, ignoring the rest of
>> the LOCUS line, and printing a warning.
>
> I already did a regex on the file itself to excise everything after the
> locus id, which put an end to the complaints.

If you're happy, that's fine.

> I'm also finding I have to manually parse the description entry, which comes
> out in one big lump like this:
>
> 'loc=GXL_3369|sym=AK092090|acc=GXP_4216|taxid=9606| spec=Homo
> sapiens|chr=19|ctg=NC_000019|str=(-)|
> start=9533632|end=9534232|len=601|tss=501| descr=Homo sapiens cDNA FLJ34771
> fis, clone NT2NE2003150.| comm=GXT_2771591/AK096729/501/gold'

What did the DEFINITION lines look like?  Its usually just a long
string like "species name, complete genome" spanning one or more
lines.  Here I'm guessing Genomatrix are sticking a whole load of meta
data into this field using their own convention.  This is a bit odd,
but I think I've also seem similar extra data dumped into the COMMENT
lines by other programs.

> Has some other formatting error prevented biopython from breaking this up
> for me, or is this the expected behaviour? I'm using biopython1.49. It's not
> a big deal, I was just wondering.

I think that's the expected behaviour, the DEFINITION lines becomes
the record's description property (a simple string).

Peter