[Biopython-dev] neat Martel trick

Wed Jan 17 03:51:44 EST 2001

Forgot to point out the prodoc parser in the 0.5 release
has a neat trick.

> This parser does extra work to identify [footnote,records], <PRODOC>
> and (EC) links, which has about an extra 50% impact on performance (80
> seconds instead of 50).

reference_note = Martel.Group("reference_note",
                 Martel.Re("\[(?P<note>E?\d+)(,(?P<note>E?\d+))*\]"))

prodoc_link = Martel.Group("prodoc_link",
             Martel.Str("<") + prodoc_num + Martel.Str(">"))

# This is incomplete and doesn't allow things like 1.1.-.-
ec_link = Martel.Group("ec_link",
    Martel.Re("\(EC *(?P<ec_number>[1-9][0-9]*\.[1-9][0-9]*" \
              "\.[1-9][0-9]*\.([1-9][0-9]*|-))\)"))

generic_text = reference_note | \
               prodoc_link | \
               ec_link | \
               Martel.Re("[^\R]")

What this all says is, if a given character position starts
a substring looking like:
 -- "[1]" or "[2,3]" then it's a footnote reference
 -- "<PDOC00001>" then it's a PRODOC reference
 -- "(EC 1.2.3.4)" then it's an E.C. number
otherwise that character is just a regular character.

That this *does* is provide a way to mark up semi-free form
text by providing detection of certain items.  For example,
I could also have had a pattern for http links, or email
addresses, or ...

In other words, the line

This is similar to <PDOC12345> which talks about (EC 1.2.3.-).

gets parsed as

This is similar to <prodoc_link>&lt;<prodoc_num>PDOC12345
</prodoc_num>&gt;</prodoc_link> which talks about <ec_link>(
<ec_number>EC 1.2.3.-</ec_number>)</ec_link>.

which can easily be turned into links to other databases
or for [] footnotes, into #relative references to the current
page.

I've done this by hand and it's hard because you don't want
to replace text which was already replaced once before.
Guess I should have included it in the Python paper, but it
was getting long.  Perhaps at the conference.

                    Andrew
                    dalke at acm.org