[BioPython] Python HTMLParser

Brad Chapman chapmanb@arches.uga.edu
Mon, 7 May 2001 17:17:04 -0400


Hi Scott;
 
> Does anyone out there have some Python code using the HTMLparser from the
> htmllib? I've tried the examples in the library reference but can't get
> them to work. I know I'm missing something, but I just can't find a good
> example out there of someone using Python parsers.

I have a script which parses the Arabidopsis clone tables, like:

http://arabidopsis.org/cgi-bin/maps/Seqtable.pl?chr=1

The code is at:

http://bioinformatics.org/cgi-bin/cvsweb.cgi/biopy-pgml/Bio/PGML/Arabidopsis/AtDB/ChrTable.py?rev=1.1.1.1&content-type=text/x-cvsweb-markup

(sorry for the long URL on that, it's a library module I've got in
CVS). 
ChrTableParser does the actual parsing, and the parse_table() function 
shows how to set it up. This is a semi-complicated example, but if you
wanted a real life example, this is it :-).

The UniGene parser code in Bio/UniGene/__init__.py uses sgmllib, which 
works very similar to htmllib, so this is another example you could
look at.
 
> Anyway, if you have a little script I could peek at or know where I can
> find one I would be very appreciative.

Basically, what you need to do is inherit from htmllib.HTMLParser and
implement functions for the tags you want to get. For instance, if you 
wanted to parse a web page and just print out everything inside anchor 
(<a>) tags, you would do something like:

import htmllib

class AnchorParser(htmllib.HTMLParser):
    def __init__(self):
       # flag to determine if we are in an anchor tag
       self.in_anchor = 0

    def start_a(self, attrs):
       """Signal when we get to an <a> tag.
       """
       self.in_anchor = 1

    def end_a(self, attrs):
       """Signal when we are out of the anchor -- a </a> tag"""
       self.in_anchor = 0

    def handle_data(self, text):
       """This is called everytime we get to text data (ie. not tags) """
       if self.in_anchor:
          print "Got anchor text: %s" % text

Normally, you just use flags (like self.in_anchor in the example) to
keep track of where you are.

Hope this helps. Let us know if you are still stuck.
Brad