[BioPython] Python HTMLParser
Brad Chapman
chapmanb@arches.uga.edu
Mon, 7 May 2001 17:17:04 -0400
Hi Scott;
> Does anyone out there have some Python code using the HTMLparser from the
> htmllib? I've tried the examples in the library reference but can't get
> them to work. I know I'm missing something, but I just can't find a good
> example out there of someone using Python parsers.
I have a script which parses the Arabidopsis clone tables, like:
http://arabidopsis.org/cgi-bin/maps/Seqtable.pl?chr=1
The code is at:
http://bioinformatics.org/cgi-bin/cvsweb.cgi/biopy-pgml/Bio/PGML/Arabidopsis/AtDB/ChrTable.py?rev=1.1.1.1&content-type=text/x-cvsweb-markup
(sorry for the long URL on that, it's a library module I've got in
CVS).
ChrTableParser does the actual parsing, and the parse_table() function
shows how to set it up. This is a semi-complicated example, but if you
wanted a real life example, this is it :-).
The UniGene parser code in Bio/UniGene/__init__.py uses sgmllib, which
works very similar to htmllib, so this is another example you could
look at.
> Anyway, if you have a little script I could peek at or know where I can
> find one I would be very appreciative.
Basically, what you need to do is inherit from htmllib.HTMLParser and
implement functions for the tags you want to get. For instance, if you
wanted to parse a web page and just print out everything inside anchor
(<a>) tags, you would do something like:
import htmllib
class AnchorParser(htmllib.HTMLParser):
def __init__(self):
# flag to determine if we are in an anchor tag
self.in_anchor = 0
def start_a(self, attrs):
"""Signal when we get to an <a> tag.
"""
self.in_anchor = 1
def end_a(self, attrs):
"""Signal when we are out of the anchor -- a </a> tag"""
self.in_anchor = 0
def handle_data(self, text):
"""This is called everytime we get to text data (ie. not tags) """
if self.in_anchor:
print "Got anchor text: %s" % text
Normally, you just use flags (like self.in_anchor in the example) to
keep track of where you are.
Hope this helps. Let us know if you are still stuck.
Brad