[Biopython-dev] neat Martel trick
Andrew Dalke
dalke at acm.org
Wed Jan 17 03:51:44 EST 2001
Forgot to point out the prodoc parser in the 0.5 release
has a neat trick.
> This parser does extra work to identify [footnote,records], <PRODOC>
> and (EC) links, which has about an extra 50% impact on performance (80
> seconds instead of 50).
reference_note = Martel.Group("reference_note",
Martel.Re("\[(?P<note>E?\d+)(,(?P<note>E?\d+))*\]"))
prodoc_link = Martel.Group("prodoc_link",
Martel.Str("<") + prodoc_num + Martel.Str(">"))
# This is incomplete and doesn't allow things like 1.1.-.-
ec_link = Martel.Group("ec_link",
Martel.Re("\(EC *(?P<ec_number>[1-9][0-9]*\.[1-9][0-9]*" \
"\.[1-9][0-9]*\.([1-9][0-9]*|-))\)"))
generic_text = reference_note | \
prodoc_link | \
ec_link | \
Martel.Re("[^\R]")
What this all says is, if a given character position starts
a substring looking like:
-- "[1]" or "[2,3]" then it's a footnote reference
-- "<PDOC00001>" then it's a PRODOC reference
-- "(EC 1.2.3.4)" then it's an E.C. number
otherwise that character is just a regular character.
That this *does* is provide a way to mark up semi-free form
text by providing detection of certain items. For example,
I could also have had a pattern for http links, or email
addresses, or ...
In other words, the line
This is similar to <PDOC12345> which talks about (EC 1.2.3.-).
gets parsed as
This is similar to <prodoc_link><<prodoc_num>PDOC12345
</prodoc_num>></prodoc_link> which talks about <ec_link>(
<ec_number>EC 1.2.3.-</ec_number>)</ec_link>.
which can easily be turned into links to other databases
or for [] footnotes, into #relative references to the current
page.
I've done this by hand and it's hard because you don't want
to replace text which was already replaced once before.
Guess I should have included it in the Python paper, but it
was getting long. Perhaps at the conference.
Andrew
dalke at acm.org
More information about the Biopython-dev
mailing list