[Bioperl-l] Naming consistency and Bioperl future searchresultparsing

Wed, 2 Jan 2002 14:39:23 -0700

>Methods don't work so well with '-' in perl so I had to change them.
>Tried to Underscorealize them.

Nor in Python.  I was just pointing out that the names aren't
exact copies of the NCBI XML -- nor do I want them to be.
BTW, in my code I'm mapping namespaced events, like ns:tag,
into ns__tag.  Doubleunderscorealization.  :)

>At least in my tiny understanding of these things, the accession is the
>last term.  Of course we can get into a philosophical discussion of what
>the accession is - a unique transportable identifier for the sequence.

For this I'm not interested in the philosophy, I just want the
pragmatics -- what's the standard algorithm we all should use to
get that information?

>ick.

Indeed.

>>  I would rather be dumb and get the name and the description,
>> and not try for parsing the name at this point.  The encoding
>> for the name is often done by whomever makes the input database,
>> and that's the only person who knows how to decode it properly.
>>
>yes I agree!  I frequently would not like to make my software try and
>guess things it just can't.

Advantages of being dumb are; it's easy to explain and easy for
the different projects to agree on the bare minimum.

>Ahh - but we are using methods not the underlying hash keys
>to get/set the data so the module looks like this:
  ...
>sub e { (shift)->expect }

Sure.  And if was used method accessors in Python then it's

  def getExpect(self):
    return self.expect
  def setExpect(self, expect):
    self.expect = expect
  getE = getExpect
  setE = getExpect
   ..
(I don't like the style where obj.expect(value) sets the
property and obj.expect() retrieves it.  What happens if
value is undefined?  Say, because of some problem elsewhere.
This causes an accidental and silent change in semantics.
Consider

sub expect {
  my($value) = @_;
  if (defined $value) {
    print "Set\n";
  }
  print "return value\n";
}

$s = "Expectation = 1";
$s =~ m/Expectation = (0\..*)/;
&expect($1);

where I expected the expectation value to always start with
a "0."  (I know, silly example, and I should have checked
that there was at least one match from the =~).  Since that
doesn't match, $1 is undefined.  I passed that to expect,
which sees the undefined value and simply returns the value,
rather than setting it as one might expect from simply looking
at the code.

I prefer this style more in C++ where the function resolution
is done by the compiler based on the call arguments at compile
time, and not their contents at runtime.

A concern about using direct attribute lookups is they prevent
encapsulation when used in language like C++.  But in Python
it's possible to override attribute look at runtime, so doing
something like
  obj.expect = 0.5
doesn't necessarily mean obj.__dict__["expect"] = 0.5; it can
do anything it wants to do.
)

>However, we do expect people to look at the API and the
>documentation to figure out how to use the object.

The more work needed to understand the API, the harder it
is to use.  I just happen to feel that aliases just aren't
worth it, except for backwards compatibility, and then the
solution is to document the aliases as deprecated, and
after a year or two have those aliases generate warnings,
and after another year or two, remove them.

>This is moot. With the above example - we use methods to set data no raw
>data access is allowed (well it's allowed by perl but we frown upon that
>sort of thing).

Frown away.  You also put the onus on whoever else duplicates the
API to remember all of the methods -- and yes, people will implement
the methods they remember/need immediately but will not always
check the code nor the API documentation to figure out all of the
methods and aliases they need to do.

>I agree - would like to move towards this, I guess I am having trouble
>deciding at what granularity we put the interpretation - do I write a
>parser for blastxml and just throw the events as the XML tagnames - then
>do I have to use the same tagnames when writing the FASTA parser?  Should
>I be converting things to completely different set of XML tags with XSLT
>and if one is richer than the other...? I admit I just did what seemed to
>be the simplest road - put most of the onus on the parser to generate
>events for the main data types in a DB or pairwise alignment Search
>report - Results, Hits, HSPs.  Need to handle Psi-blast iterations
>which is done in Steve's code - figuring out how to reconcile these
>right now.

I think you did the right thing.  Using the same tag names makes
things a lot easier.  That's what I've been doing with the Martel
work for the last couple of days.  The nice things is I can toss
in a few mixin classes, which know how to parse different parts
of the code.  If I don't need that data parsed, I don't include
the mixin.  I also made a DispatchHandler class which enables this.
It looks like this:

  def startElement(self, tag, attrs):
    methodname = "start_" + escape(tag)
    method = getattr(self, methodname, None)
    if method is not None:
      method(tag, attrs)

where the escape function turns ':' into '__'s.
What this does it turn the central functions (startElement,
endElement) into dispatchers, so different handlers need only
listen to their specific methods.  That makes the mixin
idea work.

(Also, I needed a characters dispatcher, which looks like this

  def __init__(self):
    self.__text = None
    self.__text_positions = []
  def start_saving(self):
    if self.__text_positions:
      self.__text_positions.append(len(self.__text))
    else:
      self.__text = ""
      self.__text_positions.append(0)
  def stop_saving(self):
    pos = self.__text_positions.pop()
    return self.__text[pos:]
  def characters(self, s):
    if self.__text_positions:
      self.__text += s

Because of the balanced tree nature of XML, my mixins can get
the text they want with

  def start_bioformat__sequence(self, tag, attrs):
    self.remove_spaces = attrs.get("remove_spaces", "1")
    self.sequences = []
  def start_bioformat__sequence(self, tag, attrs):
    self.start_saving()
  def end_bioformat__sequence(self, tag):
    self.sequencs.append(self.stop_saving())
  def end_bioformat__sequence(self, tag, attrs):
    s = string.join(self.sequences, "")
    if self.remove_spaces == "1":
      s = string.replace(s, " ", "")
    self.add_sequence(s)

and the start_saving/stop_saving methods don't interfere
with each other.

BTW, I strongly advise against XSLT.  While it can do some
sorts of transformations, it really is entirely too verbose
for its slight advantages.

>
>> Besides, do you also assume the, say, 14K in fasta.pm is accesible
>> by new developers?
>>
>
>My Bio::SearchIO::fasta is 600 lines with comments and really just one
>method - not sure which fasta.pm you are referring to.

Same thing.  I was talking about characters, not lines.

>> One of the things about file typing is, what are the names of
>> the formats?  That's part of a project I'm doing now -- Bioformats.
>>
>What are the names of what formats?  Sorry - lost me.

Whoops!  Forgot that the rest of the world doesn't really know what's
going on inside my head.

There are a lot of file formats.  Some are SWISS-PROT, BLAST, PDB,
and Prodoc.  Each format is actually part of a family or classification
of formats, so you might say something is in SWISS-PROT format when
it really is in the format used for release 38 of the SWISS-PROT
database, and not in the format used for release 40.  (There are
some very minor differences between the two -- eg, somewhere the
checksum changed from CRC32 to CRC64.)

Or there's the difference between different BLASTs - 2.0.5 vs. 2.0.11,
or BLASTP vs. BLASTX, or NCBI BLAST vs. WU-BLAST.  In the larger
sense, there's still some commonalities between formats; there are
fields in FASTA which are identical to fields in BLAST.  (Hence
the basis for your project.)

What we all have are various builders, which know how to parse one
or more formats and produce a data structure.  These might be able
to parse only one specific format, like BLASTP 2.0.8, or a class
of formats, like SWISS-PROT records which have ID, AC, and sequence
data.  (There may also be builders for the same format which do
different things, like build a sequence record, or convert the
text to HTML.)

The problem then is figuring out which builders can be used to
parse the file you have.  Right now that calls for someone to
look at the file and use the right parser.  It also calls for
the parser to handle all the variations in a given format.

However, suppose you have some way to identify the file format
automatically and unambiguously.  Then each of the builders can
have associated with it (somehow) a list of supported formats.
Reading from an arbitrary file is then a matter of:
  - start with the data type to build
  - find the list of associated builders
  - find the one which handles this format
  - use it to parse the file

For example, I have a "sequence" format definition, which
includes "pir", "swissprot", etc.  I then have a "swissprot"
definition which includes the different swissprot variations.
So when I want to build a SeqRecord object (similar to Bioperl's
Seq object), I say:
  for record in SeqRecord.io.readFile(infile, "sequence"):
    print record

The SeqRecord.io traverses the tree of registered formats,
starting with "sequence", finds that the file is in
"swissprot/38" format, gets the grammar definition for that,
pulls up the 'sprot38 to SeqRecord' content handler, and
starts parsing the file.

This is all based around the idea of a canonical format
name.  These aren't well defined in bioinformatics.  There
isn't someplace a list of all the file formats -- along
with their variations!  This is probably because there
are no other solutions available for automatical file
typing and parsing.  Hence, I need to come up with my own
names, and hence my statement:

>> One of the things about file typing is, what are the names of
>> the formats?

Make better sense now?

                    Andrew
                    dalke@dalkescientific.com