[Biopython-dev] ToEol in Martel

Andrew Dalke dalke at acm.org
Fri Mar 30 01:36:27 EST 2001


Brad:
>> This is just a random thought, but if Cayte is using
>> the EventGenerator class which I recently moved to
>> Bio.ParserSupport, this *does* strip whitespace before
>> sending an event:

Cayte:
>  Yes, I use EventGenerator.

I'm looking at the EventGenerator.  I get the idea
of what it does.

However, I have some problems with how it does it.
Specifically, how it handles whitespace.  By default
it strips whitespace from the ends of the element's
contents and it merges multiple elements with the same
tag name into a single string seperated by whitespace.

Consider the case where someone stores alignment data
in a sequence format.  In that case, whitespace is
very significant, yet the default mode of EventGenerator
is to throw that information away and there is no
way for the client code to specify otherwise.

As an existence proof, Cayte ran into a case where that
significant information was tossed.

There are a couple problems with the automatic joining
of successive tag code.  It is used for cases like:

<description>This is a description field</description>
<description>using two lines.</description>

which automatically gets turned into

<description>This is a description field using two lines.</description>

For that case it is useful, but consider

<sequence>EKLAD</sequence>
<sequence>WERNDA</sequence>

Do people expect
<sequence>EKLAD WERNDA</sequence>

over
<sequence>EKLAD WERNDA</sequence>

Even worse, I tried something like this with the XSLT
code to turn the description lines into a single field
for SWISS-PROT.  I ran into problems with records like:

DE  This record matches the Prosite pattern A-B-C-D-E-
DE  F-G-H-I but is a false positive.

When merged together, the pattern string becomes
A-B-C-D-E F-G-H-I

On the other hand, with the original callbacks, the
handler could be made smart enough to know that if
a line ends in a hyphen that it could be a split
pattern, or even a hypenated word and attempt to
reconstruct the "real" form of the data.

Not trivial work, but doable with some heuristics.

But if the handler only gets the merged data fields
then it doesn't have the clue that the "-" is at the
end of a line, which makes this sort of cleanup
much harder or at least more error prone.

Brad:
>> I guess whether I should do this or not is up for debate.
>> I know Jeff has some differing opinions (and a good example
>> of why this can be bad),

I'm going to have to agree with Jeff on this.  Whitespace
must be treated as important, unless told otherwise.

One way to address it is with composition, or whatever
the appropriate GoF name is.  Have handlers take handlers,
where each handler modifies the data stream as needed.

  consumer = YourConsumer()
  handler = Sax2Consumer(consumer)
  cleanup = MergeFieldsAndStripWhitespace(handler)
  filter = FilterEvents(("some", "list", "of", "tags"), cleanup)
  parser.setContentHandler(filter)

In this case, filter only passes along the named tags,
cleanup does the merging of fields and whitespace stripping,
and Sax2Consumer translates the start/endElement calls to
the form expected by the consumer.

This may be too fine-grained, but the point is that there
are ways to make it easy to meet the different needs without
much extra work, while also being explicit on what's
actually going on with the data.  No surprises is a good
thing.  (Except for the surprise that there are no surprises :)

>> If this is really a problem here, I can look at fixing it.

Please consider that.

Cayte:
>  One option is for me to subclass EventHandler and rewrite
> _make_callback. But this is inelegant because of the
> convention that a leading underscore means a private function.

Agreed.  This is because it's doing two things, normalizing
the callback text and passing the result to the consumer.
That normalization needs to be accessible to change, while the
rest of it does not.

> An alternative solution is a flag, with a default
> of false, that bypasses the code to strip whitespace.

I do not like flags.  In almost all cases there is a
better way to do it.  For example, another way is to
call "self._cleanup" where "_cleanup" could be a method
which may be overridden in subclasses but defaults to
string.strip .

The same would need to be done with the join function.
although this is harder to do because it isn't a simple
choice of which character to use - some people will
not want that joining to occur.

Brad:
>> # strip off whitespace and call the consumer
>> callback_function = eval('self._consumer.' + name)

Please change this to use
   callback_function = getattr(self._consumer, name)
This is faster and safer.  Evals can do nasty things,
like if name is
   abc + __import__("shutil").rmtree("/")

Also, you might want to change
   if name in self.flags.keys():
to
   if self.flags.has_key(name):

Python 2.1 will allow (as I recall)
   if name in self.flags:

                    Andrew
                    dalke at acm.org





More information about the Biopython-dev mailing list