[Bioperl-l] Naming consistency and Bioperl future search resultparsing

Jason Stajich jason@cgt.mc.duke.edu
Tue, 1 Jan 2002 17:39:06 -0500 (EST)


On Tue, 1 Jan 2002, Andrew Dalke wrote:


> I saw that, and I know it's been out for over a year.  But
> I also saw you changed some of the names.  (Eg, adding an
> "_" or replacing a "-" with an "_".)
>
Methods don't work so well with '-' in perl so I had to change them.
Tried to Underscorealize them.


> How would I wrote a program do to this?  Is the accession the
> last |'ed term?  The second one after a 'ref'?  The first one
> after a 'ref' but up to the '.'?  The 4th after the 'gi'?
>
At least in my tiny understanding of these things, the accession is the
last term.  Of course we can get into a philosophical discussion of what
the accession is - a unique transportable identifier for the sequence.

This of course one of the minor headaches of bioinformatics:
The following are all real entries from NR as provided by NCBI

gi|89471|pir||S15901
- I interpret this to be the seq comes from pir but does not have a pir
  accession?
gi|231734|sp|P30274|CGA2_BOVIN

CGA2_BOVIN is the swissprot ID but the accession could also be P30274 as
some dbs don't use the swissprot ID (will be assigned later on after the
sequence is submited and recorded in a database).

If you download the human refseqs you get entries like
gi|1234|ref|NP_1234.1|

so there is no accession listed, but really it is
NP_1234

ick.

> >You could be smart and pick up multiple accessions or multiple
> >references but the DTD did not account for this as far as I
> >can tell.
>
>  I would rather be dumb and get the name and the description,
> and not try for parsing the name at this point.  The encoding
> for the name is often done by whomever makes the input database,
> and that's the only person who knows how to decode it properly.
>
yes I agree!  I frequently would not like to make my software try and
guess things it just can't.


> >Some alignments don't have gaps.
>
> In that case, will the property exist, will it be undef, or
> will it be 0?
>
Should be 0 I guess.

> >We will likely support e, expect, and evalue as aliases
> >for the same thing.
>
> May I suggest you not use aliases?  I don't think they
> help as much as they hinder.  For examples (and assuming I
> can speak any Perl these days):
>

Ahh - but we are using methods not the underlying hash keys to get/set the
data so the module looks like this:

sub expect {
 my ($self,$value) = @_;
 if( defined $value ) {
	$self->{'_expect'} = $value;
  }
  return $self->{'_expect'};
}
sub e { (shift)->expect }

> Someone looking at the code can't tell that "e" and "expect"
> are really the same thing -- unless they know the API or
> the biology.  That makes the code harder to understand.
>
Noted - the only reason to keep both is for backwards compatibility.  But
we are not really trying to support the previous mish-mashes of APIs
anwyays so silly to add this ambiguity.  However, we do expect people to
look at the API and the documentation to figure out how to use the object.

> Also, suppose you want to edit the infomation.  Then you
> need to tie the properties together so changing one changes
> the others.  That's tedious.  Eg, in Python if I allow
>
This is moot. With the above example - we use methods to set data no raw
data access is allowed (well it's allowed by perl but we frown upon that
sort of thing).


> Nor I.  The only ways I could think of were:
>   - call Python from Perl
>   - rewrite the engine as a standalone C library usable from the
>       various langauges

        This can be problems in that we aim to have 90% (I just made
        that number up - the word "most" would suffice ) of our code
        runnable on Win and Mac perl ports.

>   - rewrite the whole thing in Perl

      Code duplication, but could be useful in its own right - would be
interested to see where this fits in with Damian's other parser
and grammar modules.

> The distinction is that most new developers don't need to worry
> about SAX events.  There should be a set of existing SAX -> object
> builders and converter which handle their needs.  And since it's
> a standard API, the overhead for learning it is low.  It also
> helps that it's possible to dump everything to XML, to see
> what's going on at the lowest level.
>

I agree - would like to move towards this, I guess I am having trouble
deciding at what granularity we put the interpretation - do I write a
parser for blastxml and just throw the events as the XML tagnames - then
do I have to use the same tagnames when writing the FASTA parser?  Should
I be converting things to completely different set of XML tags with XSLT
and if one is richer than the other...? I admit I just did what seemed to
be the simplest road - put most of the onus on the parser to generate
events for the main data types in a DB or pairwise alignment Search report
- Results, Hits, HSPs.  Need to handle Psi-blast iterations which is done
in Steve's code - figuring out how to reconcile these right now.

> Besides, do you also assume the, say, 14K in fasta.pm is accesible
> by new developers?
>

My Bio::SearchIO::fasta is 600 lines with comments and really just one
method - not sure which fasta.pm you are referring to.

> One of the things about file typing is, what are the names of
> the formats?  That's part of a project I'm doing now -- Bioformats.
>
What are the names of what formats?  Sorry - lost me.

Anyways - all this little stuff aside - I am glad we are trying to work
towards a common nomeclature here - I am less concerned with the
individual data access methods being the same as the basic data
partitioning of the objects.  My goal, at a minimum, is that we (Bio{*})
should be able to match up to the same CORBA IDL spec for Analysis results
at some point when we generate one.

-jason
-- 
Jason Stajich
Duke University
jason@cgt.mc.duke.edu