[Bioperl-l] parsing DEFINITION field in GenBank entries?

Dave Lewis ddlewis3@worldnet.att.net
Sun, 24 Jun 2001 07:37:08 -0500


Hi - I'm wondering if anyone has attempted to parse (using perl or
otherwise) the DEFINITION field of GenBank entries.   Here's some examples:

DEFINITION  Papio hamadryas cynocephalus MHC class II antigen DQ-alpha
MHC-DQA
            gene (MHC-DQA*AMB-2 allele), exon 2 and partial cds.

DEFINITION  Homo sapiens inducible
            6-phosphofructo-2-kinase/fructose-2,6-bisphosphatase (IPFK2)
gene,
            partial cds.

DEFINITION  Homo sapiens SBBI12 mRNA, complete cds.

DEFINITION  Sequence 22 from Patent WO0100669.

The field is only semi-formatted, so this would in general be a heuristic
pattern-matching / natural language processing problem.  It wouldn't be
possible to do perfectly, but one might be able to do a reasonable job of
pulling out gene names and gene symbols when they are there, or at least
eliminating parts of the entry that aren't those things.

Regards, Dave

David D. Lewis, Ph.D.
858 W. Armitage Ave., #296
Chicago, IL 60614   USA
ph. 773-975-0304; fax 773 442-0262
http://www.DavidDLewis.com