[BioPython] Genbank LOCUS line slightly misaligned
Peter
biopython at maubp.freeserve.co.uk
Thu Dec 18 16:01:10 UTC 2008
On Thu, Dec 18, 2008 at 3:25 PM, Peter Saffrey <pzs at dcs.gla.ac.uk> wrote:
> Thanks for your prompt reply.
>
> Peter wrote:
>>
>> I suspect that they (Genomatrix) are inserting a large locus
>> identifier into the beginning of the LOCUS line which is sometimes
>> bigger than the allocated slot, pushing the rest of the fields out of
>> position in some of the files. I'd need to see several examples to be
>> confident about this guess.
>>
>
> That sounds about right. Here's a sample:
>
> $ grep LOCUS skurukutipromo.gb | head
> LOCUS GXP_4216 601 bp DNA
> LOCUS GXP_4217 601 bp DNA
> LOCUS GXP_4220 601 bp DNA
> LOCUS GXP_4226 603 bp DNA
> LOCUS GXP_1485624 601 bp DNA
> LOCUS GXP_1485625 601 bp DNA
> LOCUS GXP_4230 601 bp DNA
> LOCUS GXP_4253 640 bp DNA
> LOCUS GXP_648168 662 bp DNA
> LOCUS GXP_4281 601 bp DNA
>
> It's a bit careless on their part, but who listens to standards anyway? ;)
Writing general output to GenBank format is tricky if you have long
record identifiers.
>> If you don't actually need much information from the LOCUS line, you
>> might find it easier to hack our parser to be a little more tolerant -
>> I would suggest simply pulling out the locus ID, ignoring the rest of
>> the LOCUS line, and printing a warning.
>
> I already did a regex on the file itself to excise everything after the
> locus id, which put an end to the complaints.
If you're happy, that's fine.
> I'm also finding I have to manually parse the description entry, which comes
> out in one big lump like this:
>
> 'loc=GXL_3369|sym=AK092090|acc=GXP_4216|taxid=9606| spec=Homo
> sapiens|chr=19|ctg=NC_000019|str=(-)|
> start=9533632|end=9534232|len=601|tss=501| descr=Homo sapiens cDNA FLJ34771
> fis, clone NT2NE2003150.| comm=GXT_2771591/AK096729/501/gold'
What did the DEFINITION lines look like? Its usually just a long
string like "species name, complete genome" spanning one or more
lines. Here I'm guessing Genomatrix are sticking a whole load of meta
data into this field using their own convention. This is a bit odd,
but I think I've also seem similar extra data dumped into the COMMENT
lines by other programs.
> Has some other formatting error prevented biopython from breaking this up
> for me, or is this the expected behaviour? I'm using biopython1.49. It's not
> a big deal, I was just wondering.
I think that's the expected behaviour, the DEFINITION lines becomes
the record's description property (a simple string).
Peter
More information about the Biopython
mailing list