[Biopython-dev] GenBank parser -- first go

Wed Dec 6 03:10:36 EST 2000

Brad:
>I spent this weekend getting together a GenBank parser,
>which I hope is something that we could include in Biopython in the
>future.

Wow!  I'm glad that people other than me can use it.  I've been
working on it for so long now that I don't have a good idea of what it
means to come into it from scratch.

>we all definately have to give
>Andrew another big pat on the back for his awesome tool :-).

Thank you.

>5. Can the code be speeded up/improved in any ways? Suggestions to
>help me code better are always very welcome!

I look over it and it seems quite good.  I do have some comments,
which I've included here.  (Most of the points are relevant to
Python and Martel programming, so I've sent it to the list instead
of just you directly.)

You use

cur_record = iterator.next()
while cur_record:
  ...
  cur_record = iterator.next()

The standard Python idiom is

while 1:
  cur_record = iterator.next()
  if not cur_record:
    break
  ...

Instead of
  indent_space = Martel.RepN(Martel.Str(" "), 2)

it is better to do (note: two spaces)
  indent_space = Martel.Str("  ")

They give the same result, but "  " gets checked with a single
test while two " "s is done with two tests.  This is even more
appropriate with
  qualifier_space = Martel.RepN(Martel.Str(" "), 21)

(Actually, there is a Martel.optimize module which contains the
function 'merge_strings'.  It should merge " "," " into " " but it
isn't automatically used.)

As an aside, I can see I've been focused on regexps compared to
you.  You have
  blank_space = Martel.Rep1(Martel.Str(" "))
where I would use
  blank_space = Martel.Re(" *")
There's no difference in implementation - the decision on which one to
use is based on readability/usability.  I (sadly?) know a lot about
regexps so I probably find them more usable :)

Case in point, I probably would have written the LOCUS definition with
more regexps.

def choice(tag, words):
  exp_list = map(Martel.Str, words)
  return Martel.Group(tag, exp_list)

residue_types = choice("residue_type",
                       ["DNA", "RNA", "mRNA", "PROTEIN"])

data_file_division = choice("data_file_division",
              ["PRI", "ROD", "MAM", "VRT", "INV", "PLN", "BCT", "RNA",
               "VRL", "PHG", "SYN", "UNA", "EST", "PAT", "STS", "GSS",
               "HTG"])

date = Martel.Group("date",
     Martel.Re("(?P<day>\d+)-(?P<month>[A-Z]+)-(?P<year>\d+)))

locus_line = Martel.Re("LOCUS +(?P<locus>\w+) +(?P<size>\d+) bp +") + \
             residue_types + blank_space + data_file_division + \
             blank_space + date + Martel.AnyEol()

Interestingly, I didn't use a regexp for residue_types like you do.
That's because I worry about people looking at a list of strings and
thinking they can add arbitrary text - forgetting about escapeing
regex characters.

Skipping ahead, the 'valid_f_keys' might someday include characters
like '+' and '.', so you really should use a function like the one I
gave above.

Hmmm.  I've been pickier than you about ignoring the leading
whitespace.  For example, you have

# definition line
# DEFINITION  Genomic sequence for Arabidopsis thaliana BAC T25K16 from
#             chromosome I, complete sequence.

definition = Martel.Group("definition",
                          Martel.Rep1(blank_space +
                                      Martel.ToEol()))

definition_line = Martel.Group("definition_line",
                               Martel.Str("DEFINITION") +
                               definition)

By comparison, I would enforced that the text be folded to start on
column 13 using

definition_line = Martel.Group("definition_line",
             Martel.Str("DEFINITION ") + Martel.ToEol("definition") + \
  Martel.Rep(Martel.Str("           ") + Martel.ToEol("definition")))

Here's a justification for this.  It's already common practice with
GenBank files to have subitems indented under the major item.  For
example,

SOURCE      thale cress.
  ORGANISM  Arabidopsis thaliana

Suppose some day the powers that be add a subitem to the DEFINITION
DEFINITION  Genomic sequence for Arabidopsis thaliana BAC T25K16 from
            chromosome I, complete sequence.
  BLAH      Abcd Ef Ghijkl.

I consider it a good thing for the parser to break at this point,
rather than include the "BLAH      Abcd Ef Ghijkl." text as part of
the definition.

It seems the form of
  "LABEL"    text starts on column 13
             and may fold over multiple
             lines

is pretty common.  If that's the case, you can make things simpler by
using a functions to make the definition.

INDENT = 12
def make_line(label, line_name, data_name):
    assert len(label) < INDENT, "label text too long"
    first_line = Martel.Str(label + " " * (INDENT - len(label))) + \
                 Martel.ToEol(data_name)
    other_lines = Martel.Str(" " * INDENT) + \
                  Martel.ToEol(data_name)
    return Martel.Group(line_name, first_line + Rep(other_lines))

definition_line = make_line("DEFINITION", "definition_line",
                            "definition")

accession_line = make_line("ACCESSION", "accession_line", "accession")

(If you were really trusting you could use just the line label and
string.lower it to get the data name, and add "_line" to that to get
the line name.  I'm usually more explicit than that.)

\w includes \d so you don't need to do [\w\d] for the nid

For that matter, [\d]+ is the same as \d+ (as in the gi)

I'm still undecided about the AtBeginning and AtEnd commands.  (You
use the former in the version_line definition.)  I don't know if they
should test for beginning/end of line or beginning/end of input text.
In fact, I would rather not have them at all, since they don't work
well with the RecordReader idea, where the text is broken up into
parts.  There's no real need for AtBeginning here so you can remove
it.

With the function definition above, you can also replace

  keywords_line = make_line("KEYWORDS", "keywords_line", "keywords")
  source_line = make_line("SOURCE", "source_line", "source")

Out of curiosity, why do arab1.gb and cor6_6.gb have different
ORGANISM lines?

SOURCE      thale cress.
  ORGANISM  Arabidopsis thaliana
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta;
Tracheophyta;
            euphyllophytes; Spermatophyta; Magnoliophyta; eudicotyledons;
            Rosidae; Capparales; Brassicaceae; Arabidopsis.

SOURCE      thale cress.
  ORGANISM  Arabidopsis thaliana
            Eukaryota; Viridiplantae; Embryophyta; Tracheophyta;
Spermatophyta;
            Magnoliophyta; eudicotyledons; core eudicots; Rosidae; eurosids
II;
            Brassicales; Brassicaceae; Arabidopsis.

The first has "Streptophyta", "euphyllophytes" and "Capparales".

The second has "core eudicots", "eurosids II" and "Brassicales".

Martel includes a helper function called 'Integer' which can simplify
some of your definitions, as with

reference_num = Martel.Integer("reference_num")
pubmed_id = Martel.Integer("pubmed_id")

Here's another preference of mine.  You have

  sequence = Martel.Group("sequence",
                          Martel.Re("[\w ]+"))

where I would do
  sequence = Martel.ToEol("sequence")

The difference is that you only accept a-zA-Z0-9_ and the space
character.  It doesn't accept "-" or "*", which I can see possibly
getting into the data.  Since there's no need to validate the
characters, might as well just consume anything you find.

(Note to self - need to use \R for the ToEol definition.)

Looking at GenBank.py, you use the record definition thusly:

        parser = genbank_format.record.make_parser(debug_level = debug)
        parser.setContentHandler(_EventGenerator(consumer))
        parser.setErrorHandler(handler.ErrorHandler())

        parser.parseFile(handle)

You really should cache the created parser.  It can take quite some
time to generate.  During my PIR tests, about 98% of time of some of
the tests were spent doing generation rather than parsing.

I think you do it because you want to allow different debug levels.
You can still support that in a couple of ways.  Here's one:

class _Scanner:
    def __init__(self):
        self._cached_parsers = {}
    def feed(self, handler, consumer, debug = 0)
        parser = self._cached_parsers.get(debug)
        if parser is None:
            parser = self._cached_parsers[debug] = \
                genbank_format.record.make_parser(debug_level = debug)

        parser.setContentHandler(_EventGenerator(consumer))
        parser.setErrorHandler(handler.ErrorHandler())
        parser.parseFile(handle)

Here's another which is tuned for smaller memory use and the
assumption that you almost never change debug levels.

class _Scanner:
    def __init__(self):
        self._cached_parser = None
        self._cached_debug = None
    def feed(self, handler, consumer, debug = 0)
        if self._cached_debug == debug:
            parser = self._cached_parser
        else:
            parser = self._cached_parser = \
                genbank_format.record.make_parser(debug_level = debug)
            self._cached_debug = debug

        parser.setContentHandler(_EventGenerator(consumer))
        parser.setErrorHandler(handler.ErrorHandler())
        parser.parseFile(handle)

You have a list of tags you are interested in receiving.  Martel has a
function to create a new expression tree based but only sending back
the events you are interested in receiving.

   expression = genbank_format.record
   expression = Martel.select_names(expression, interest_tags)
   parser = expression.make_parser()

BTW, this was implemented because Python's function call overhead is
pretty large so I wanted a way to reduce the number of calls if I knew
an event wasn't needed.

Replace
         fun_to_call = eval("self._consumer" + "." + name)
with
         fun_to_call = getattr("self._consumer",  name)

You should also do
         fun_to_call(info_to_pass)
instead of
         apply(fun_to_call, (info_to_pass,))

Here's a cute implementation for _PrintConsumer (untested)

# Let's you define new labels if you want them
event_name_converter = {
  "start_feature_table": "Starting feature table",
  "record_end": "End of Record!",
}

class _PrintWrapper:
  def __init__(self, name):
    self.name = name
  def __call__(self, content):
    print "%s: %s" % (self.name, content)

class _PrintConsumer:
  def __init__(self):
    self.data = 'blah'
  def __getattr__(self, name):
    if name[:2] == "__":
      raise AttributeError, name
    return _PrintWrapper(event_name_converter.get(name, name))

                    Andrew
                    dalke at acm.org