[DAS2] Notes from DAS/2 code sprint #2, day three, 15 Mar 2006

Steve Chervitz Steve_Chervitz at affymetrix.com
Thu Mar 16 15:37:16 EST 2006


Notes from DAS/2 code sprint #2, day three, 15 Mar 2006

$Id: das2-teleconf-2006-03-15.txt,v 1.1 2006/03/16 20:45:35 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Ed E., Gregg Helt
  Sanger: Thomas Down, Andreas Prlic
  CSHL: Lincoln Stein
  Dalke Scientific: Andrew Dalke (at Affy)
  UCLA: Allen Day, Brian O'Connor (at Affy)
        
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 


[Notetaker: joining 10 min into the discussion]

ls: how does synonym business work?
ad: if server has access to data...
ls: we ask server for the global id, uses same global id for segments,
and uses same global id for the sequence.
gh: to do this in the capabilities for annot server, the global id for
segments query points to reference server.
ls: if the local machine current server, has sequence capabilities,
then it passes global id for segments to current server and it gets
the sequence. if it doesn't have that capability, then we need to
figure out a way for it to get the sequence. the easiest way to do
that would be to resolve that url and fetch it. I'm open to any
suggestion. I don't see how this uri/synonym is getting us any closer
to being able to find the server where sequence can be fetched. The
synonym isn't always a fetchable thing.
ad: syn is a global id
ad: look at the uri for the segment and fetch it from there
ls: could be a remote url.
gh: segments query is only thing that gives segment url
segments capabilities for the annot server should point

ls: break apart segments into: id=a string, then have an attribute
seq_url, when fetched returns the seq. returns the bases.
ad: is that's what's there already?
ls: no, uri is an id
ad: every url is an id, but it's up to whim of the server
ls: i don't want people to think its for an id.
want an agreed upon uri identifier, then optionally have a url.
turn synonym into uri, turn uri into resolver
make uri required, bases not required.
ad: additional constraint is 'agreed upon'. what about a group
starts a new sequencing project. There is no globally known uri for it
yet.
ls: they just create their own ids
td: the natural authority is the creator of the assembly.
gh: ncbi won't do it. they don't have a das server, unlikely to.
ls: can point to genome assembly. can create a url that will return
bases from ncbi in a supported format.
this approach will disentangle issue of resolvable vs non-resolvable,
local vs non-local segment ids and how to get segment dna.
gh: I think this will work.

ad: 'this' changing key names?
ls: key semantics
uri is required, global identifier sequence is an optional pointer
gh: you say that for feat xml, the id for seq will be the globally
agreed on id.
ls: yes
ad: if you don't have a local copy, if you have ability to map global
identifiers, then you know what it is from the coordinates.
there are two ways to specificy coordinates: coordinates and segments

ad: if you just need the segments and some identifier.
only when you need to do an overlay with someone else that you
need the coords.
gh: no, coords don't say anything about ids of coord (?)

gh: if we do it the way lincoln proposed, then the logical way to
relate those is that the segments capapbilities points to ref server.
ad: when feat returns a location is it in global or local space?
gh: lincoln - global space

ls: every annot server will know length of its landmarks (chrms).
some people will not want to be served dna, they will point somewhere
else where to get the dna. There will be many places to get dna for a
given global id, they chose one they like.
ls: feature locations are given in global id
ad: this changes the way it's been working. xml:base issues
ls: I know.
gh: if base of sequence and base of features are different, the xml
will get bigger.

ls: so an argument for having local ids is so you can make location
string shorter.
gh: yes.
ls: probably not worth it
ad: also makes it easier to set up a basic server. if you want to
overlay them, yes you do.
ls: you can always set up a local server if you

gh: segments response local and global id as we talked about yesterday
(which one feature locatn is relative to)
gh: if the only way to overlay for a client to know things are in the
same coord system is segid=xxxx and globalid=yyyy, how much harder is
it for server to use global ids.
ls: server can have configuration file to know where its global ids
are coming from

aday: would need to think about it more.
ad: who will set up these identifiers (yeast, human)
ls: I'll do it for model org databases, I will specify segments, and
their dna fetchers and will look up their lengths.
gh: versions?
ls: most recent. community can then keep it up to date.
I bet ensembl will be happy to generate this file automatically with
every build (for vertebrates)

ad: local id uri, and a bunch of synonyms. People will set up own
server not referencing a global system.
ls: then client would do a closure over all systems.
imagine three servers:
server-a says here is my segment
server-b says it can be  b or c
server-c says it can be c or a
so you have to do a join over all servers

gh: not encourage people to do that with local seq ids, encourage
people to use.
need a global referencing system to say this uri is same as that uri.
ad: bad logic for the web. If one is wrong, could be a problem
td: (proposal - based on genomic coord alignments)
ad: that says only alignable things are the same.

ad: don't think it will work, they will already have local servers

gh: what about 'the stick': people who want to register their server
with central registry can only do so if they use global ids for their
segments. 
ls, td: fine
ad: if they've been working for a while in house, they would have a
big effort to retrofit their system to comply. just won't do.

ls: in draft 3, where's assembly info?
ad: same as before. ask segments for agp format. draft not complete.
gh: the thing that ids which assembly you're on is the coordinates
element (authority, taxonomy, ...)
ls: authority is a recognized, globally unique organization. Should it
be a uri?
ad: authority and version is human visible so people can search by
it.
ls: fine.

gh: can invoke the 'stick' idea here: if you 're trying to register
something on same genomome assembly, then registry can check your
segments to verify they are agreed up.
ls: taxon, source, authority, version all must match
ad: also an id
ap: we discussed in email
ad: the only stuff that is complete is in the ucla subdir.
ls: the examples are definitive
ad: yes, unless we change things today.

ls: what if taxon, source, version match but uri doesn't?
registry gets submission. makes a segments request on submitter, if it
gets a list of same segment identifiers, it accepts it. what if it
gets a subset?
gh: ok
ls: superset is not ok.
aday: why?
gh: if you allow subset and superset, you can have everything.
aday: use case: bacteria with extra plasmid identifier.

nh: signing off. will be at affy tomorrow.

ls: you would have to create your own coord system.
gh: could argue with maintainer to added it.
ls: can you have multiple coordinates in a given assembly?
aday: proposal: make coords an attribute of the segment.
could keep your segment references local.

ls: we shouldn't give people ways to create new names. human chr1 ncbi
build 35 should be something that everybody can agree on.
gh: then we wouldn't allow allen's use case where someone wants a
superset of what's in reference?
ls: add new coord tag to source version entry, says I'm creating a
superset consisting of coords from ref 1, 2, 3, any of these can be a
new namespace that I set up.
gh: how do you know which ones come from where?
right now there's now way to get coord for a segment.
ad: can as of yesterday afternoon.

ls: to indicate which segments come from which auth. put coord id into
segments tag. 
aday: thank you!
ad: alternative proposal - multiple segments
use case: when you have scaffolds or chromosomes, or mouse and yeast
ls: say you want human mouse scaffolds + chrms, and human chrms
three diff coords tags in the sources document
each one gives auth, taxon, etc.
when client goes to get segments, it will get human chromosomes, mouse
chrms, and mouse scaffolds, in one big list, each will point back to
coord it got in features requets.

gh: knowing what coordinates doesn't tell you global id for segment
aday: ok.
gh: multiple segments elements vs mult coords in a segment work for
me.
ad: what does a client do
gh: ...
ls: three types of entry points, hu chrms, mo chrms, mo scaffolds, now
tell me what you want to start browsing. human readable.
scaffold on mouse with name xxx from two

ad: displaying all together vs one or the other or the other.

ee: affymetrix use case in igb. [probe

gh: doesn't seem to matter
aday: the tag values are easier to implement
td: not a big difference to me
gh: drawing on whiteboard...

ls: let's rename das to distributed annotation research network. then
we can say "darn1, darn2"!

ad: gregg's request for search to find everything identical (start and
end are same)
td: if you have contained and inside, you can do identical with an and
operation.
ls: doesn't make server any more complicated, for completeness you may
want to do that.
ad: how about includes 1-5000 and excludes ... some of this is asethetic.
ls: overlaps, contains, contained-in have good use cases for.
exact match - maybe searching for curated exons that exactly match
predicted. 

[Lincoln has to leave.]

gh: drawing options for segments and coordinate systems.
[whether you  put a coords tag per segment, or ome capabilities one
for each coord system]
allen's approach - one query with filter or multiple fetches

aday: uniprot example
gh: separate segments query.
ap: can we leave it out and add later if necessary?
ad: these are things that haven't been discussed in last two years
aday: uri

ad: xml namespace issue - what do we call it (see email)
gh: you pick it

ad: required syntax for entry points /das/source
gh: recommended, but not required
ad: lincoln was only one who felt strongly about it being required,
and he's not here.


gh: feature xml, every feature can have multiple locations
feaures can represent alignments (collapsed alignment tag into feature
tag)
td: like it
gh: naive user- given a feat with multip location on genome, represent as
multip locations, or parent child relations
td: don't see as a problem. using parent-child you have things to say
about child features specific to them
gh: genscan prediction,
a problem: one server can serve them up as parent child or as multiple
locations on parent

four child exons in one case
four diff locations in other case

problem is with feat filters. if yo do an overlaps query and any
children meet the condition, you have to return the parent as well and
it's parent on up. agreed?
ad: yes
gh: works fine for parent child, but for multip location situation, if
inside query fully contains only two eons, do you return parent?

td: I'd assume inside query would return both. as long as one exon is
inside the region, the parent is return. define inside as applying to
any level.
gh: so even though the transcript is not inside, you still return it?
td: using the get parent-if-get-children rule
gh: rule must apply to all of them, so you don't get transcript since
it doesn't meet the inside condition.

aday: multiple locations makes sense - just aligned mult times.
human alu feature 100,000s, do you want to create a single feature, or
just a single identifier and put it in many different locations.
ee: that is for alignments not parent-child relationship
aday: you consider location as a attribute of the object..
ee: I agree. alu is only one object, but the exon-transcript are
different
ad: would someone want to annotate the separate exons differently?
aday: you would split it off
ad: eg blast alignment, hsp is part of the conceptual alignment.
gh: in bioperl, some people will go one path, some go the other path,
so we need to figure out how to deal with it.

feat filters is clear for parent child relationship.
aday: inside and overlaps
gh: if your overlap query only grazes one child, you return the
parent. this is the only one I'm certain about.
gh: we haven't specified that the child is within bounds of parent.
with insides, we have a difference of opinion.

one exon is within, do you return it?
ad: most clients  will be doing overlaps, you are the only one doing
insides what do you want?
gh: the multiple locations muddies the issue.
if parent child rule is you only return it if parent is inside (and
recursive parent), I've already optimized for that.
For multiple locations, I can catch that and handle it.
the way I want, the behaviour of mult location will be diff than
parent child.
td: for me, the overlaps is the most important thing. Andreas just get
everything.
ad: can we delegate to gregg here for what to do in case of inside.

[A] gregg will write up description for inside query and multiple locations

Status reports
-----------------

gh: updating server. overlaps, insides, types, and each
good news: latest genome assembly on human on affy server overlayed
with allen's server. using hardcoded knowledge in igb for assembly id,
not coordinates yet.
with andrew: making sure clients can understand any variants of
namespace usage in the xml.
get client to use more capabilities like links

ad: example data set together, updated schema to latest spec, but
forgot cigar thing. update validator to use most recent version or rnc
schemas.
gh: even if your server isn't public you can cut and paste into you
validator at http://cgi.biodas.org:8080

aday: biopackages up to date with version 200 of spec file. issues for
nomi, and gregg. off by one error.

bo: small code refactor in the das server. testing that today.

ee: nothing das related yet, but will. implementing style sheets to get
colors for features.

ap: registry ui for upload of a das/2 source. coding for that

gh: what about registry rejecting segment ids if they don't match
standard ids for that coord system. sound good to you?
ap: basically yes. 
td: not done a great deal

gh: Nomi has been here working on apollo client. we'll hear from her
tomorrow. 

-----------------------
post teleconf discussion re: using global identifiers for uri

[Notetaker: just a few morsels were captured here.]

ad: most folks i work with get something going locally, then after
it's going, hook it up with the rest of the world, integrate with
other people. they don't want to revamp their work in order to do
that. 

gh: slightly in favor with andrew

ad: get what we have now. they are still uri's so it's just an
interpretation. will change attributes to be 'uri and 'reference_uri'

gh: how does it get length of segments?


ad: good idea to have coordinates and segments in the document.
add your own track to ensembl, you don't need to give it a segments,
just specify coordinates.
gh: seems like it will encourage servers that can only work with
particular clients.

ad: what about getting rid of coordinates, just needed by Andreas for
registry. 




More information about the DAS2 mailing list