[Bioperl-l] Naming consistency and Bioperl future search resultparsing

Andrew Dalke Andrew Dalke" <dalke@dalkescientific.com
Tue, 1 Jan 2002 14:18:23 -0700


Jason:
>Not yet - this is still very much a work in progress.

Sure.  I'm just pointing out that this is useful work for
others besides Bioperl.  I hope for there to be some
documentation outside of the code, and showed an example of
what I'm looking for.

>These names were influenced greatly by the NCBI xml tags

I saw that, and I know it's been out for over a year.  But
I also saw you changed some of the names.  (Eg, adding an
"_" or replacing a "-" with an "_".)

> Names here are in flux here!  I don't want to get stuck to this
> API as this is not part of a stable release so I plan to rip
> things up whereever it seems appropriate.

Quite understood, and I made no plans dependent on any stability,
real or otherwise.  (BTW, something about a flux capacitor?  :)

> In NCBI formatted dbs the following is what is assigned in the
> NCBI blast xml output:
>>gi|1234|ref|NP_1234.1|NP_1234 some desc here
>name is 'gi|1234|ref|NP_1234.1|NP_1234'
>accession should be 'NP_1234'
>desc is 'some desc here' (everything else)

How would I wrote a program do to this?  Is the accession the
last |'ed term?  The second one after a 'ref'?  The first one
after a 'ref' but up to the '.'?  The 4th after the 'gi'?

>You could be smart and pick up multiple accessions or multiple
>references but the DTD did not account for this as far as I
>can tell.

 I would rather be dumb and get the name and the description,
and not try for parsing the name at this point.  The encoding
for the name is often done by whomever makes the input database,
and that's the only person who knows how to decode it properly.

 For NCBI, they defined their own encoding.

 An even dumber solution is to simply store the whole line,
and not even try to figure out the name.  But I think that's
too dumb.


>Again, standardizing my names with what Steve has put together many of
>these will change.  I just reused the names from the BPlite::HSP object.

Again, understood.  :)   I'm pointing out the Biopython names in
case that helps.

>Some alignments don't have gaps.

In that case, will the property exist, will it be undef, or
will it be 0?

>We will likely support e, expect, and evalue as aliases
>for the same thing.

May I suggest you not use aliases?  I don't think they
help as much as they hinder.  For examples (and assuming I
can speak any Perl these days):

  @x = ();
  foreach $hsp ($hit->{"hsps"}) {
    push(@x, $hsp) if ($hsp->expect > $threshold);
  }
  @x = reverse(@x);
  foreach $hsp (@x) {
    print $hsp->name, " = ", $hsp->e, "\n"
  }

Someone looking at the code can't tell that "e" and "expect"
are really the same thing -- unless they know the API or
the biology.  That makes the code harder to understand.

Also, suppose you want to edit the infomation.  Then you
need to tie the properties together so changing one changes
the others.  That's tedious.  Eg, in Python if I allow

  hsp.evalue = 0.1
  assert hsp.e == hsp.expect == hsp.evalue == 0.1
  hsp.expect = 0.4
  assert hsp.e == 0.4

then I have to play attribute tricks, like (pre-Python 2.2)

  def __getattr__(self, name):
    if name in ("e", "evalue"):
      return self.expect
    raise AttributeError(name)
  def __setattr__(self, name, value):
    if name in ("e", "evalue"):
        name = "expect"
    self.__dict__[name] = value

(Python 2.2, which just came out, has ways to define
per-attribute accessor functions for this.)

You can also see the tedium when you write a local object
which allows changes in a remote database.

>>     begin  -- same as for query
>      start() not begin

Ahh, okay, didn't read the code correctly.

>We also have frame and strand as part of Bio::SeqFeature::Similarity

Ditto.

>> Parameters:
>>   depends on the specific alignment search used.
>(hence my desire to leave it as a tag/value hash rather than its own
> object)

The difference is an anonymous vs. non-anonymous object.
In this case, I agree, having a class doesn't help.  That is,
I can't think of any realistic client code which would be
improved by using a class/type mechanism.

>t/data/cysprot1.FASTA, t/data/HUMBETGLOA.FASTA

Thanks for the pointer.

>It's there - It's been called the homology sequence. stored in
>HSP::homology_seq.

Again, thanks.

>> >The Bio::Search and Bio::SearchIO classes and directories will be
>> >reorganized to only contain Query, Hit, HSP, & Result in the API.
>>
>> What about Statistics and Parameters?
>>
>Don't think we need objects for them at this point.  Tag-value should be
>enough.

They are still important objects, even if they are implemented
with a built-in data type instead of a new class.  The should be
documented as "can be used as a hash and contain ...".  This
lets people know what's there, but gives the flexibility to
replace them with hash-like objects in some other implementation,
as one which talks to a database to get/set the values.

>Not been following [Biopython/Martel] as I am not really python
>literate - not really clear how it would be ported.

Nor I.  The only ways I could think of were:
  - call Python from Perl
  - rewrite the engine as a standalone C library usable from the
      various langauges
  - rewrite the whole thing in Perl

>Open to suggestions, improvements on
>building the event-based parser - I like your conversion to SAX events,
>but I'm not sure I see a big win for us with what I percieve as added
>confusion and complexity to new developers.

The distinction is that most new developers don't need to worry
about SAX events.  There should be a set of existing SAX -> object
builders and converter which handle their needs.  And since it's
a standard API, the overhead for learning it is low.  It also
helps that it's possible to dump everything to XML, to see
what's going on at the lowest level.

Besides, do you also assume the, say, 14K in fasta.pm is accesible
by new developers?

>  Happy to be dissuaded.

Working furiously to be able to show things off at the O'Reilly
conference.  :)

>Would love to take advantages of your work - this is one of the
>frustrating things about doing all this work in different languages.
>Perhaps we can map some strategies about how to best share code between
>languages at the hackathon?

One of the things about file typing is, what are the names of
the formats?  That's part of a project I'm doing now -- Bioformats.

BTW, Damian Conway will be at the conference.  He's giving a
full-day tutorial on parsing.  He's also one of the Perl people
involved with regexps -- his list of projects at
   http://www.yetanother.org/damian/projects.html
includes
  Regexp::Configurable Roll-your-own regex syntax

I'm trying to convince him that some of the things in Martel
should be put into Perl.  I'll be talking with him more at
the conference.  (I'm also trying to get him to help out bioperl
with advice on context-free parsing, since they're needed to
handle Genbank locations and I don't think anyone here has
yet been able to deal with them.  Even with the Biopython
grammar available.)

                    Andrew
                    dalke@dalkescientific.com