[Biopython-dev] [Bug 2745] Bio.GenBank.LocationParserError with a GenBank CON file
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Fri Jan 30 15:11:56 UTC 2009
http://bugzilla.open-bio.org/show_bug.cgi?id=2745
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-01-30 10:11 EST -------
It's the "gap(unk100)" entries which are breaking the location parser in
Bruce's examples. Similarly even "gap()" entries of unknown length like this
will fail:
LOCUS AH007743 7832 bp DNA CON 26-MAY-1999
DEFINITION Gallus gallus ornithine transcarbamylase (OTC) gene, complete cds.
ACCESSION AH007743
VERSION AH007743.1 GI:4927367
KEYWORDS .
SOURCE chicken.
ORGANISM Gallus gallus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Archosauria;
Aves; Neognathae; Galliformes; Phasianidae; Phasianinae; Gallus.
[....]
FEATURES Location/Qualifiers
source 1..7832
/organism="Gallus gallus"
/db_xref="taxon:9031"
/chromosome="1"
CONTIG join(AF065630.1:1..1903,gap(),AF065631.1:1..435,gap(),
AF065632.1:1..509,gap(),AF065633.1:1..722,gap(),AF065634.1:1..707,
gap(),AF065635.1:1..836,gap(),AF065636.1:1..1614,gap(),
AF065637.1:1..605,gap(),AF065638.1:1..501)
//
Example based on ftp://ftp.ncbi.nih.gov/genbank/README.genbank although this
does not describe the new terms. Older versions of the release notes do, e.g.
ftp://ftp.ncbi.nih.gov/genbank/release.notes/gb168.release.notes
========================= [start quote] =========================
3.4.15 CONTIG Format
As an alternative to SEQUENCE, a CONTIG record can be present
following the ORIGIN record. A join() statement utilizing a syntax
similar to that of feature locations (see the Feature Table specification
mentioned in Section 3.4.12) provides the accession numbers and basepair
ranges of other GenBank sequences which contribute to a large-scale
biological object, such as a chromosome or complete genome. Here is
an example of the use of CONTIG :
CONTIG join(AE003590.3:1..305900,AE003589.4:61..306076,
AE003588.3:61..308447,AE003587.4:61..314549,AE003586.3:61..306696,
AE003585.5:61..343161,AE003584.5:61..346734,AE003583.3:101..303641,
[ lines removed for brevity ]
AE003782.4:61..298116,AE003783.3:16..111706,AE002603.3:61..143856)
However, the CONTIG join() statement can also utilize a special operator
which is *not* part of the syntax for feature locations:
gap() : Gap of unknown length.
gap(X) : Gap with an estimated integer length of X bases.
To be represented as a run of n's of length X
in the sequence that can be constructed from
the CONTIG line join() statement .
gap(unkX) : Gap of unknown length, which is to be represented
as an integer number (X) of n's in the sequence that
can be constructed from the CONTIG line join()
statement.
The value of this gap operator consists of the
literal characters 'unk', followed by an integer.
Here is an example of a CONTIG line join() that utilizes the gap() operator:
CONTIG join(complement(AADE01002756.1:1..10234),gap(1206),
AADE01006160.1:1..1963,gap(323),AADE01002525.1:1..11915,gap(1633),
AADE01005641.1:1..2377)
The first and last elements of the join() statement may be a gap() operator.
But if so, then those gaps should represent telomeres, centromeres, etc.
Consecutive gap() operators are illegal.
========================= [end quote] =========================
Evidently Biopython doesn't cope with these CONTIG lines - but then they do
have a different syntax to the feature locations. I never understood why the
current code tries to parse the CONTIG string into a SeqFeature object in the
first place.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the Biopython-dev
mailing list