[Biopython-dev] LocationParserError
Peter Cock
p.j.a.cock at googlemail.com
Fri Mar 9 06:23:58 EST 2012
On Fri, Mar 9, 2012 at 10:06 AM, Matthias Bernt <MatatTHC at gmx.de> wrote:
> Just in case you need more test cases. I send all cases I found (all
> in mitochondria).
Trying with the current release (Biopython 1.59) I didn't get an exception
with NC_016406 but something wasn't quite right - I was missing the
external exon... which appears to be a bug in Entrez.
Here is NC_016406 from Entrez using GenBank (with parts),
http://www.ncbi.nlm.nih.gov/nuccore/NC_016406.1?report=gbwithparts&log$=seqview
gene join(complement(149815..150200),
complement(293787..295573),181647..181905)
/gene="nad1"
/trans_splicing
/note="exons 1, 2, 3, and 5 on chromosome 1 are
trans-spliced with exon 4 on chromosome 3 to form the
complete coding region"
/db_xref="GeneID:11447159"
CDS join(complement(149815..150200),
complement(295492..295573),complement(293787..293978),
181647..181905)
/gene="nad1"
/trans_splicing
/note="exons 1, 2, 3, and 5 on chromosome 1 are
trans-spliced with exon 4 on chromosome 3 to form the
complete coding region"
/codon_start=1
/transl_except=(pos:complement(150198..150200),aa:Met)
/product="NADH dehydrogenase subunit 1"
/protein_id="YP_004935334.1"
/db_xref="GI:357967323"
/db_xref="GeneID:11447159"
/translation="MYIAVPAEILGIILPLLLGVAFLVLAERKVMAFVQRRKGPDVVG
SFGLLQPLADGSKLILKEPISPSSANFSLFRMAPVTTFMLSLVARAVVPFDYGMVLSD
PNIGLLYLFAISSLGVYGIIIAGWSSNSKYAFLGALRSAAQMVPYEVSIGLILITVLI
CVGPRNSSEIVMAQKQIWSGIPLFPVLVMFFISCLAETNRAPFDLPEAEAELVAGYNV
EYSSMGSALFFLGEYANMILMSGLCTLLSPGGWPPILDLPISKKIPGSIWFSIKVILF
LFLYIWVRAAFPRYRYDQLMGLGRKVFLPLSLARVVAVSGVLVTFQWLP"
Here is NC_016406 from Entrez using GenBank (default, not with parts):
http://www.ncbi.nlm.nih.gov/nuccore/NC_016406.1?report=genbank&log$=seqview
gene join(complement(149815..150200),
complement(293787..295573),NC_016402.1:6618..6676,
181647..181905)
/gene="nad1"
/trans_splicing
/note="exons 1, 2, 3, and 5 on chromosome 1 are
trans-spliced with exon 4 on chromosome 3 to form the
complete coding region"
/db_xref="GeneID:11447159"
CDS join(complement(149815..150200),
complement(295492..295573),complement(293787..293978),
NC_016402.1:6618..6676,181647..181905)
/gene="nad1"
/trans_splicing
/note="exons 1, 2, 3, and 5 on chromosome 1 are
trans-spliced with exon 4 on chromosome 3 to form the
complete coding region"
/codon_start=1
/transl_except=(pos:complement(150198..150200),aa:Met)
/product="NADH dehydrogenase subunit 1"
/protein_id="YP_004935334.1"
/db_xref="GI:357967323"
/db_xref="GeneID:11447159"
/translation="MYIAVPAEILGIILPLLLGVAFLVLAERKVMAFVQRRKGPDVVG
SFGLLQPLADGSKLILKEPISPSSANFSLFRMAPVTTFMLSLVARAVVPFDYGMVLSD
PNIGLLYLFAISSLGVYGIIIAGWSSNSKYAFLGALRSAAQMVPYEVSIGLILITVLI
CVGPRNSSEIVMAQKQIWSGIPLFPVLVMFFISCLAETNRAPFDLPEAEAELVAGYNV
EYSSMGSALFFLGEYANMILMSGLCTLLSPGGWPPILDLPISKKIPGSIWFSIKVILF
LFLYIWVRAAFPRYRYDQLMGLGRKVFLPLSLARVVAVSGVLVTFQWLP"
Can you see the difference? Using Genbank "with parts" the external
location in this gene and CDS feature has been lost! I will report this
bug to the NCBI.
However, with that hurdle out of the way I found the problem in
Biopython - the regular expression for an external sequence
reference wasn't allowing for an underscore. The fix itself is
very trivial, in Bio/GenBank/__init__.py we replace this line:
_complex_location =
r"([a-zA-z][a-zA-Z0-9]*(\.[a-zA-Z0-9]+)?\:)?(%s|%s|%s|%s|%s)" \
with:
_complex_location =
r"([a-zA-z][a-zA-Z0-9_]*(\.[a-zA-Z0-9]+)?\:)?(%s|%s|%s|%s|%s)" \
The commit does this and adds a few tests (and fixes a typo):
https://github.com/biopython/biopython/commit/16efc7bc51b5ccef7f81f443d4b52f490f6fc354
If you are happy installing from source, you can download the
latest code from GitHub, or via git at the command line.
Peter
More information about the Biopython-dev
mailing list