[Biopython-dev] rebase

Mon Jul 24 03:28:01 EDT 2000

There's already a class that strips HTML tags in:
Bio.File.SGMLHandle

It decorates a file handle to HTML data (e.g. a socket to a web page) and
returns only the non-tag data.  It uses Python's built-in sgmllib library,
since stripping tags is non-trivial.

There's also a consumer decorator so that you can build consumers that
don't have to deal with tags:
Bio.ParserSupport.SGMLStrippingConsumer

Jeff

On Mon, 24 Jul 2000, Cayte wrote:

>   Looking at the text rebase files, I noticed a difference between the
> Internet Explorer conversion to text and the Netscape Navigator
> version.  The Netscape version tries to preserve more of the look and
> feel of the html file, but both try to preserve indention.  It ocurred
> to me that it might be useful to have our own converter to prevent
> bugs caused by variations in browsers.  It would also eliminate the
> need for stripping whitespace.  The utility would simply remove the
> angle bracketted stuff and forget about how it looks on a page.  But
> the converter could be written most efficiently in perl. Are we having
> mixed language applications?  The advantage is that you can use each
> language for what its best at.  The disadvantage is that users have to
> install lots of compilers.
> 
>  The utility could be useful in a lot of places, since many databases
> use HTML.  This is nice for human viewers but its a hassle if its
> being used as input to other software.
> 
> 
>                                                            Cayte
>