[BioPython] Genbank LOCUS line slightly misaligned
Peter Saffrey
pzs at dcs.gla.ac.uk
Thu Dec 18 15:25:54 UTC 2008
Thanks for your prompt reply.
Peter wrote:
> I suspect that they (Genomatrix) are inserting a large locus
> identifier into the beginning of the LOCUS line which is sometimes
> bigger than the allocated slot, pushing the rest of the fields out of
> position in some of the files. I'd need to see several examples to be
> confident about this guess.
>
That sounds about right. Here's a sample:
$ grep LOCUS skurukutipromo.gb | head
LOCUS GXP_4216 601 bp DNA
LOCUS GXP_4217 601 bp DNA
LOCUS GXP_4220 601 bp DNA
LOCUS GXP_4226 603 bp DNA
LOCUS GXP_1485624 601 bp DNA
LOCUS GXP_1485625 601 bp DNA
LOCUS GXP_4230 601 bp DNA
LOCUS GXP_4253 640 bp DNA
LOCUS GXP_648168 662 bp DNA
LOCUS GXP_4281 601 bp DNA
It's a bit careless on their part, but who listens to standards anyway? ;)
> If you don't actually need much information from the LOCUS line, you
> might find it easier to hack our parser to be a little more tolerant -
> I would suggest simply pulling out the locus ID, ignoring the rest of
> the LOCUS line, and printing a warning.
>
I already did a regex on the file itself to excise everything after the
locus id, which put an end to the complaints.
I'm also finding I have to manually parse the description entry, which
comes out in one big lump like this:
'loc=GXL_3369|sym=AK092090|acc=GXP_4216|taxid=9606| spec=Homo
sapiens|chr=19|ctg=NC_000019|str=(-)|
start=9533632|end=9534232|len=601|tss=501| descr=Homo sapiens cDNA
FLJ34771 fis, clone NT2NE2003150.| comm=GXT_2771591/AK096729/501/gold'
Has some other formatting error prevented biopython from breaking this
up for me, or is this the expected behaviour? I'm using biopython1.49.
It's not a big deal, I was just wondering.
Cheers,
Peter
More information about the Biopython
mailing list