[Dynamite] Is this working now then?
Ian Holmes
ihh@fruitfly.org
Sun, 5 Mar 2000 09:45:46 -0800 (PST)
On Sun, 5 Mar 2000, Ewan Birney wrote:
> > > At the risk of treading old ground here..
> > >
> > > I vote for a lightweight sequence data structure containing two strings:
> > > name and sequence data. This accession number stuff has nothing to do with
> > > dynamic programming really. Besides -- having three different kinds of ID
> > > with apparently nothing to distinguish them is somewhat idiosyncratic.
>
>
> No no... last point *bad*.
>
> Don't forget we are doing DB searching in this package. The DB searching
> package absolutely has to be able to handle the nice-ities of looping
> through a set of sequences and providing a data structure from which
> was can make decent output. One does need three names:
>
> a) human-readable name to be shown, does not have to be
> unique (display_id)
>
> b) supposed unique 'biological' name, (without the database
> part, as we should know what database we searched) which needs to be
> reported in the output for the output to be useful to a computer
> (accession_number)
>
> c) a unique id for this implementation, which the implementation
> has complete control of - for example, byte position in a file munged
> into a string.
>
>
> All three names are **absolutely required**. I have walked this road many
> times and I now know you need three "names" for different functions.
> It may look mad, but it certainly is not mad.
>
>
> This offends people again and again - and so they insist on one name - and
> then they find they need three parts to that name - so they munge a string
> together into that one string and they the have the affront to claim that
> we need structured munging rules of "display|accession|implementation" to
> allow people to know how to decode this munged name when they should have
> **let the datastructure have three names** in the first place.
You are conflating two issues. The database searching definitely needs
names, accession numbers, tags, peripheral information and whatever else
(it's still not clear that this object model is 100% resolved -- at the
very least, the "display_id" vs "primary_id" nomenclature could use some
clarification -- but the principle that sequences in databases have more
than just two fields is perfectly sound).
However, this information all belongs in the database layer, NOT the DP
layer. DP does not need to know about accession numbers, nor does it need
to make pretty output.
In terms of the sequence memo pattern you outlined in your previous post
(if I have indeed remembered the right name for it), I see this moving
toward:
* a Sequence interface with virtual 'name' & 'data' attributes
* a Sequence_memo datastructure mirroring the Sequence interface
* a Database_sequence interface that inherits from Sequence and also has
accession_number & implementation_id attributes (your Foreign_seq)
-- possibly a get_subseq method as well (I like this.)
* a Database_sequence_factory
This keeps the object model clean, decouples the DP from the database
layer. There is no uniqueness guarantee provided on the Sequence::name
method; uniqueness is a quality that is only relevant where there are
multiple sequences, i.e. the database layer.
Also, since Database_sequence is an extension of Sequence, we can put the
code to take a snapshot of a Sequence in the Sequence_memo constructor
(where it belongs) instead of creating a new object.
Ian