[BioPython] GenBank parser

Leighton Pritchard lpritc at scri.sari.ac.uk
Thu Apr 29 10:03:14 EDT 2004


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I've noticed an oddity in the GenBank FeatureParser (CVS installation
19/4).  While parsing the Salmonella typhi file NC_003198.gbk, my way of
dealing with 'gene' tags fell over.  This turned out to be because the
GenBank file contains entries with valueless tags such as /partial and
/pseudo.  The current parser concatenates these tags with the following
tag, e.g for:

~     CDS             1449249..1450391
~                     /partial
~                     /gene="fdnG"
~                     /note="Similar to part of Escherichia coli formate
~                     dehydrogenase, nitrate-inducible, major subunit fdnG
~                     SW:FDNG_ECOLI (P24183; P78261) (1015 aa) fasta scores:
~                     E(): 0, 94.4% id in 376 aa"
~                     /pseudo
~                     /codon_start=1
~                     /transl_table=11

it returns a set of qualifiers which include the tags "partial gene" and
"pseudo codon_start".  This probably isn't what was intended by the
authors ;)

I haven't got a fix for the parser, but my workaround in the code was:

##################

qualifiers = cds.qualifiers             # Shorthand for qualifiers
# We need to account for use of qualifiers, e.g. in
# NC_003198.gbk, the /partial and /pseudo tags often have no
# associated value - the BioPython GenBank feature parser lumps the
# two together into a single tag, e.g. 'partial gene' and
# 'pseudo codon_start'.  This buggers up our processing below,
# so the solution is to split tags by the ' ' space character,
# and add a qualifier comprising only the last item in the
# resulting list
for key in qualifiers.keys():
~    if key.count(' '):
~        qualifiers[key.split(' ')[-1]] = qualifiers[key]

###################

...I wasn't bothered about the partial or pseudo tags for my script

- --
Dr Leighton Pritchard AMRSC
D104, PPI, Scottish Crop Research Institute
Invergowrie, Dundee, DD2 5DA, Scotland, UK
E: lpritc at scri.sari.ac.uk	W: http://bioinf.scri.sari.ac.uk/index.shtml
T: +44 (0)1382 568579		F: +44 (0)1382 568578
PGP key FEFC205C: GPG key E58BA41B: http://www.keyserver.net
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFAkQsiL1gZ+OWLpBsRAg2mAJkBe3EvfNiygGEwsJ4i5wwA85t5DwCfVfPp
nFoRXTGoAdrq8shnfhSPjuA=
=P60G
-----END PGP SIGNATURE-----



More information about the BioPython mailing list