[Biopython-dev] rebase
Jeffrey Chang
jchang at SMI.Stanford.EDU
Mon Jul 24 03:28:01 EDT 2000
There's already a class that strips HTML tags in:
Bio.File.SGMLHandle
It decorates a file handle to HTML data (e.g. a socket to a web page) and
returns only the non-tag data. It uses Python's built-in sgmllib library,
since stripping tags is non-trivial.
There's also a consumer decorator so that you can build consumers that
don't have to deal with tags:
Bio.ParserSupport.SGMLStrippingConsumer
Jeff
On Mon, 24 Jul 2000, Cayte wrote:
> Looking at the text rebase files, I noticed a difference between the
> Internet Explorer conversion to text and the Netscape Navigator
> version. The Netscape version tries to preserve more of the look and
> feel of the html file, but both try to preserve indention. It ocurred
> to me that it might be useful to have our own converter to prevent
> bugs caused by variations in browsers. It would also eliminate the
> need for stripping whitespace. The utility would simply remove the
> angle bracketted stuff and forget about how it looks on a page. But
> the converter could be written most efficiently in perl. Are we having
> mixed language applications? The advantage is that you can use each
> language for what its best at. The disadvantage is that users have to
> install lots of compilers.
>
> The utility could be useful in a lot of places, since many databases
> use HTML. This is nice for human viewers but its a hassle if its
> being used as input to other software.
>
>
> Cayte
>
More information about the Biopython-dev
mailing list