[Bioperl-l] Sequence IDs and Comment()s

Jason Eric Stajich jason@cgt.mc.duke.edu
Tue, 23 Oct 2001 11:50:06 -0400 (EDT)


On Tue, 23 Oct 2001, Charles Tilford wrote:

> Jason,
>
> This has been a bit of confusion to me - I had assumed that
> display_id() was "human readable" text like "Beta Hemoglobin", while
> primary_id() should be used for unique database keys. I got this
> impression from the documentation for Bio::Seq. I'm particularly
> confused about the implementation of primary_id, since "For sequences
> with no natural id, this method should return a stringified memory
> location" (pointing to what, and to what end?).
>

This is a bugabo we have just made arbitrary decisions on.  If you look at
the write_seq code in the fasta.pm we use the display_id when writing out
the sequence because someone can instantiate a sequence w/o specifying a
primary id.  We initialize the primary_id AND display_id both to the same
UNIQUE id when reading a fasta db (first block of text after the >).
Should we be more flexible here and try and write out the primary id if it
has been set by the user and is not a memory location?  What about derived
sequences from trunc(), what primary id should they get?

If you continue on looking at our seqio implementation of
genbank/embl/swissprot writing we assume that if the sequence is a RichSeq
then primary_id points to something meaningful (gi number) and we use
this.

We plan to overhaul the system to go to event based parsers instead, at
some level these types of handling can be tweaked much more fine grained
by the user, but I'm not really ready to start that project until after we
get 1.0 released.

> I've been working in the context of migrating Seq objects to (and
> from) BSML for display in the LabBook Viewer. The problem I've faced
> is storing a short name for a sequence, for use as a title. The
> contents of desc are often too long for a simple on-screen label
> (sometimes a full sentence or more), and the sort of database primary
> keys I've been putting in primary_id are typically things like
> "245331", which are not of great utility to the viewer. Accession
> number is also not immediately informative, and is not always
> available.
>
> So where should I put "title" strings? In a Comment? s/ /_/g and put
> them as display_id?
>
I feel that "Title" strings - short representations of a sequence should
go in display_id.  So the s/\s/_/g is probably for the best.  Anything
more elaborate than 2 or 3 words should really be in the description
because they won't fit in the user's short reading view in a LabBook view
anyways. (IMHO)

Should we consider a more flexible strategy when dumping out fasta files
to handle both cases: where no primary_id is
known so use the internal 'fake' primary id - memory location AND where
primary_id is valid gi-number or accession?


> General observation: When performing Bio::Seq <-> BSML, I end up with
> a fistful of objects from one implementation that have no clear
> corollary in the other. In BSML, there are two generic name/value type
> containers (e.g. <Attribute name="user" content="Bob"> or <Qualifier
> value-type="gene" value="actin">) - I've used these liberally to store
> information that otherwise has no clear home. In the reverse
> direction, I've been mapping orphaned data into Comment()s of the form
> "name: Bob". However, I fret about the lack of predictability in
> delimiter choice (": " vs ":" vs "\t" vs "," etc.).
>
> So... What would people think of adding a type() (or class(),
> category(), meta(), etc.) method to Comment to optionally qualify the
> contents?
>

I'm wary of this because it implies we are starting to interpret the data
rather than just provide a mechanism for storing it and manipulating it,
how would this work for a GenBank -> Seq -> BSML trip and back?



> -Charles
>
> Jason Eric Stajich wrote:
>
> ...snip...
>
> > If you wanted these names printed out with the seqio system you should
> > make sure and set the display id:
> > $seq->display_id("myaccessionnumber");
> >
> > display id should really not have spaces in it since it is the
> > intended unique id for the sequence in a db.
>
> ...snip...
>
>

-- 
Jason Stajich
Duke University
jason@cgt.mc.duke.edu